2019-04-17 17:54:42

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 0/7] Extend write-hint/stream infrastructure

V4 series, towards extending write-hint/streams infrastucture so that file-systems
and other kernel-mode components can use write-hints which are separate from
user-space ones. Also this introduces support for sending write-hint with
Ext4/JBD2 journal.

Here is the changelog/history -

Changes since v3:
- Correction in grouping related changes into patches
- Rectification in commit text at places

Changes since v2:
- Introduce API in block layer so that drivers can register stream info. Added
new limit in request queue for this purpose.
- Block layer does the conversion from write-hint to stream-id.
- Any write-hint beyond registered limit turn to 0.
- New macro "WRITE_LIFE_KERN_MIN" can be used as base by kernel mode components.

Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
& Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)

V3 patch:
https://marc.info/?l=linux-block&m=155384631909082&w=2

V2 patch:
https://patchwork.kernel.org/cover/10754405/

V1 patch:
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2

Kanchan Joshi (7):
fs: introduce write-hint start point for in-kernel hints
block: increase stream count for in-kernel use
block: introduce API to register stream information with block-layer
block: introduce write-hint to stream-id conversion
nvme: register stream info with block layer
fs: introduce APIs to enable passing write-hint with buffer-head
fs/ext4,jbd2: add support for sending write-hint with journal

block/blk-core.c | 20 ++++++++++++++++++++
block/blk-settings.c | 12 ++++++++++++
drivers/nvme/host/core.c | 23 ++++++-----------------
fs/buffer.c | 18 ++++++++++++++++--
fs/ext4/ext4_jbd2.h | 1 +
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/blkdev.h | 7 ++++++-
include/linux/buffer_head.h | 3 +++
include/linux/fs.h | 2 ++
include/linux/jbd2.h | 8 ++++++++
13 files changed, 87 insertions(+), 26 deletions(-)

--
2.7.4



2019-04-17 17:54:15

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 7/7] fs/ext4,jbd2: add support for sending write-hint with journal

For NAND based SSDs, mixing of data with different life-time reduces
efficiency of internal garbage-collection. During FS operations,
series of journal updates will follow/precede series of data/meta
updates, causing intermixing inside SSD. By passing a write-hint with
journal, its write can be isolated from other data/meta writes, leading
to endurance/performance benefit on SSD.
This patch introduces "j_writehint" member in JBD2 journal, using which
Ext4 specifies write-hint for journal.

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/ext4/ext4_jbd2.h | 1 +
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/jbd2.h | 8 ++++++++
6 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 75a5309..ade47b2 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -16,6 +16,7 @@
#include <linux/jbd2.h>
#include "ext4.h"

+#define EXT4_JOURNAL_WRITE_HINT (WRITE_LIFE_KERN_MIN)
#define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)

/* Define the number of blocks we need to account to a transaction to
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6ed4eb8..238c0b5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4298,6 +4298,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

+ sbi->s_journal->j_writehint = EXT4_JOURNAL_WRITE_HINT;
+
sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;

no_journal:
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index efd0ce9..be3a0b9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,

if (journal->j_flags & JBD2_BARRIER &&
!jbd2_has_feature_async_commit(journal))
- ret = submit_bh(REQ_OP_WRITE,
- REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE,
+ REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
+ journal->j_writehint);
else
- ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
+ journal->j_writehint);

*cbh = bh;
return ret;
@@ -713,7 +715,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;
- submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC,
+ bh, journal->j_writehint);
}
cond_resched();
stats.run.rs_blocks_logged += bufs;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 382c030..6dc7c9a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1372,7 +1372,8 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
sb->s_checksum = jbd2_superblock_csum(journal, sb);
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
- ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh,
+ journal->j_writehint);
wait_on_buffer(bh);
if (buffer_write_io_error(bh)) {
clear_buffer_write_io_error(bh);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index a1143e5..376b1d8 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -642,7 +642,8 @@ static void flush_descriptor(journal_t *journal,
set_buffer_jwrite(descriptor);
BUFFER_TRACE(descriptor, "write");
set_buffer_dirty(descriptor);
- write_dirty_buffer(descriptor, REQ_SYNC);
+ write_dirty_buffer_with_hint(descriptor, REQ_SYNC,
+ journal->j_writehint);
}
#endif

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0f919d5..918f21e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1139,6 +1139,14 @@ struct journal_s
*/
__u32 j_csum_seed;

+ /**
+ * @j_writehint:
+ *
+ * write-hint for journal (set by FS).
+ */
+ enum rw_hint j_writehint;
+
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/**
* @j_trans_commit_map:
--
2.7.4


2019-04-17 17:54:17

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 6/7] fs: introduce APIs to enable passing write-hint with buffer-head

submit_bh and write_dirty_buffer do not take write-hint as parameter.
This introduces variants which do, and pass write-hint to lower layer.

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/buffer.c | 18 ++++++++++++++++--
include/linux/buffer_head.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ce35760..dc1ea59 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3105,6 +3105,13 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
}
EXPORT_SYMBOL(submit_bh);

+int submit_bh_write_hint(int op, int op_flags, struct buffer_head *bh,
+ enum rw_hint hint)
+{
+ return submit_bh_wbc(op, op_flags, bh, hint, NULL);
+}
+EXPORT_SYMBOL(submit_bh_write_hint);
+
/**
* ll_rw_block: low-level access to block devices (DEPRECATED)
* @op: whether to %READ or %WRITE
@@ -3162,6 +3169,13 @@ EXPORT_SYMBOL(ll_rw_block);

void write_dirty_buffer(struct buffer_head *bh, int op_flags)
{
+ write_dirty_buffer_with_hint(bh, op_flags, 0);
+}
+EXPORT_SYMBOL(write_dirty_buffer);
+
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint)
+{
lock_buffer(bh);
if (!test_clear_buffer_dirty(bh)) {
unlock_buffer(bh);
@@ -3169,9 +3183,9 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
}
bh->b_end_io = end_buffer_write_sync;
get_bh(bh);
- submit_bh(REQ_OP_WRITE, op_flags, bh);
+ submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
}
-EXPORT_SYMBOL(write_dirty_buffer);
+EXPORT_SYMBOL(write_dirty_buffer_with_hint);

/*
* For a data-integrity writeout, we need to wait upon any in-progress I/O
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7..3d682ac 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -200,7 +200,10 @@ void ll_rw_block(int, int, int, struct buffer_head * bh[]);
int sync_dirty_buffer(struct buffer_head *bh);
int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
void write_dirty_buffer(struct buffer_head *bh, int op_flags);
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint);
int submit_bh(int, int, struct buffer_head *);
+int submit_bh_write_hint(int, int, struct buffer_head *, enum rw_hint hint);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
int bh_uptodate_or_lock(struct buffer_head *bh);
--
2.7.4


2019-04-17 17:54:20

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

This patch moves write-hint-to-stream-id conversion in block-layer.
Earlier this was done by driver (nvme). Current conversion is of the
form "streamid = write-hint - 1", for both user and kernel streams.
Conversion takes stream limit (maintained in request queue) into
account. Write-hints beyond the exposed limit turn to 0.
A new field 'streamid' has been added in request. While 'write-hint'
field continues to exist. It keeps original value passed from upper
layer, and used during merging checks.

Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-core.c | 20 ++++++++++++++++++++
include/linux/blkdev.h | 1 +
2 files changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389b..712e6b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -730,6 +730,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
return false;
}

+enum rw_hint blk_write_hint_to_streamid(struct request *req,
+ struct bio *bio)
+{
+ enum rw_hint streamid, nr_streams;
+ struct request_queue *q = req->q;
+ nr_streams = q->limits.nr_streams;
+
+ streamid = bio->bi_write_hint;
+ if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
+ streamid == WRITE_LIFE_NONE)
+ streamid = 0;
+ else {
+ --streamid;
+ if(streamid > nr_streams)
+ streamid = 0;
+ }
+ return streamid;
+}
+
void blk_init_request_from_bio(struct request *req, struct bio *bio)
{
if (bio->bi_opf & REQ_RAHEAD)
@@ -738,6 +757,7 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
req->__sector = bio->bi_iter.bi_sector;
req->ioprio = bio_prio(bio);
req->write_hint = bio->bi_write_hint;
+ req->streamid = blk_write_hint_to_streamid(req, bio);
blk_rq_bio_prep(req->q, req, bio);
}
EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eb6eb60..7cd1c2d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -217,6 +217,7 @@ struct request {
#endif

unsigned short write_hint;
+ unsigned short streamid;
unsigned short ioprio;

unsigned int extra_len; /* length of alignment and padding */
--
2.7.4


2019-04-17 17:54:28

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 5/7] nvme: register stream info with block layer

Make nvme driver register number of streams with block layer. Block
layer will use that for write-hint to stream-id conversion. Registration
is done for each namespace. Since NVMe spec allow all available
streams (within subsystem) to be used by all namespaces, no attempt
has been made to add reservation at namespace level.

Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/core.c | 23 ++++++-----------------
1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2c43e12..81b86fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -464,10 +464,6 @@ static int nvme_toggle_streams(struct nvme_ctrl *ctrl, bool enable)
return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
}

-static int nvme_disable_streams(struct nvme_ctrl *ctrl)
-{
- return nvme_toggle_streams(ctrl, false);
-}

static int nvme_enable_streams(struct nvme_ctrl *ctrl)
{
@@ -510,14 +506,7 @@ static int nvme_configure_directives(struct nvme_ctrl *ctrl)
return ret;

ctrl->nssa = le16_to_cpu(s.nssa);
- if (ctrl->nssa < BLK_MAX_WRITE_HINTS - 1) {
- dev_info(ctrl->device, "too few streams (%u) available\n",
- ctrl->nssa);
- nvme_disable_streams(ctrl);
- return 0;
- }
-
- ctrl->nr_streams = min_t(unsigned, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1);
+ ctrl->nr_streams = ctrl->nssa;
dev_info(ctrl->device, "Using %u streams\n", ctrl->nr_streams);
return 0;
}
@@ -530,12 +519,9 @@ static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
struct request *req, u16 *control,
u32 *dsmgmt)
{
- enum rw_hint streamid = req->write_hint;
+ enum rw_hint streamid = req->streamid;

- if (streamid == WRITE_LIFE_NOT_SET || streamid == WRITE_LIFE_NONE)
- streamid = 0;
- else {
- streamid--;
+ if (streamid != 0) {
if (WARN_ON_ONCE(streamid > ctrl->nr_streams))
return;

@@ -3189,6 +3175,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
{
struct streams_directive_params s;
int ret;
+ u16 nr_streams;

if (!ctrl->nr_streams)
return 0;
@@ -3200,6 +3187,8 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
ns->sws = le32_to_cpu(s.sws);
ns->sgs = le16_to_cpu(s.sgs);

+ nr_streams = min_t(unsigned, ctrl->nr_streams, BLK_MAX_WRITE_HINTS - 1);
+ blk_queue_stream_limits(ns->queue, nr_streams);
if (ns->sws) {
unsigned int bs = 1 << ns->lba_shift;

--
2.7.4


2019-04-17 17:54:29

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 3/7] block: introduce API to register stream information with block-layer

This introduces stream limit (count of streams supported by underlying
driver) in request-queue.

Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-settings.c | 12 ++++++++++++
include/linux/blkdev.h | 2 ++
2 files changed, 14 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6375afa..6023229 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -231,6 +231,18 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
EXPORT_SYMBOL(blk_queue_max_discard_sectors);

/**
+ * blk_queue_stream_limits - set stream parameters
+ * @q: the request queue for the device
+ * @nr_streams: number of streams supported by device
+ **/
+void blk_queue_stream_limits(struct request_queue *q,
+ unsigned short nr_streams)
+{
+ q->limits.nr_streams = nr_streams;
+}
+EXPORT_SYMBOL(blk_queue_stream_limits);
+
+/**
* blk_queue_max_write_same_sectors - set max sectors for a single write same
* @q: the request queue for the device
* @max_write_same_sectors: maximum number of sectors to write per command
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f200aed..eb6eb60 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -340,6 +340,7 @@ struct queue_limits {
unsigned char discard_misaligned;
unsigned char raid_partial_stripes_expensive;
enum blk_zoned_model zoned;
+ unsigned short nr_streams;
};

#ifdef CONFIG_BLK_DEV_ZONED
@@ -1065,6 +1066,7 @@ extern void blk_queue_dma_alignment(struct request_queue *, int);
extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
+extern void blk_queue_stream_limits(struct request_queue *q, unsigned short);

/*
* Number of physical segments as sent to the device.
--
2.7.4


2019-04-17 17:54:35

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 2/7] block: increase stream count for in-kernel use

This bumps up stream count to support in-kernel hints.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/blkdev.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c58a3b..f200aed 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -566,7 +566,9 @@ struct request_queue {

struct work_struct release_work;

-#define BLK_MAX_WRITE_HINTS 5
+#define BLK_MAX_USER_HINTS (WRITE_LIFE_KERN_MIN - 2)
+#define BLK_MAX_KERN_HINTS 4
+#define BLK_MAX_WRITE_HINTS (1 + BLK_MAX_USER_HINTS + BLK_MAX_KERN_HINTS)
u64 write_hints[BLK_MAX_WRITE_HINTS];
};

--
2.7.4


2019-04-17 17:54:39

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v4 1/7] fs: introduce write-hint start point for in-kernel hints

Introduce "WRITE_LIFE_KERN_MIN". Kernel mode components can define own
write-hints using this as base.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e76..ee27eb4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -299,6 +299,8 @@ enum rw_hint {
WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
+/* Kernel should use write-hint starting from this */
+ WRITE_LIFE_KERN_MIN,
};

#define IOCB_EVENTFD (1 << 0)
--
2.7.4


2019-04-17 17:57:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

On 4/17/19 11:50 AM, Kanchan Joshi wrote:
> This patch moves write-hint-to-stream-id conversion in block-layer.
> Earlier this was done by driver (nvme). Current conversion is of the
> form "streamid = write-hint - 1", for both user and kernel streams.
> Conversion takes stream limit (maintained in request queue) into
> account. Write-hints beyond the exposed limit turn to 0.
> A new field 'streamid' has been added in request. While 'write-hint'
> field continues to exist. It keeps original value passed from upper
> layer, and used during merging checks.

Why not just use the bio write hint? We already disallow merging of
dissimilar write hints, so req->bio->bi_write_hint is known to be
identical with the rest of the bio's in that chain.

--
Jens Axboe


2019-04-18 13:52:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] nvme: register stream info with block layer

On Wed 17-04-19 23:20:04, Kanchan Joshi wrote:
> Make nvme driver register number of streams with block layer. Block
> layer will use that for write-hint to stream-id conversion. Registration
> is done for each namespace. Since NVMe spec allow all available
> streams (within subsystem) to be used by all namespaces, no attempt
> has been made to add reservation at namespace level.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> drivers/nvme/host/core.c | 23 ++++++-----------------
> 1 file changed, 6 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2c43e12..81b86fa 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -464,10 +464,6 @@ static int nvme_toggle_streams(struct nvme_ctrl *ctrl, bool enable)
> return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
> }
>
> -static int nvme_disable_streams(struct nvme_ctrl *ctrl)
> -{
> - return nvme_toggle_streams(ctrl, false);
> -}
>
> static int nvme_enable_streams(struct nvme_ctrl *ctrl)
> {
> @@ -510,14 +506,7 @@ static int nvme_configure_directives(struct nvme_ctrl *ctrl)
> return ret;
>
> ctrl->nssa = le16_to_cpu(s.nssa);
> - if (ctrl->nssa < BLK_MAX_WRITE_HINTS - 1) {
> - dev_info(ctrl->device, "too few streams (%u) available\n",
> - ctrl->nssa);
> - nvme_disable_streams(ctrl);
> - return 0;
> - }
> -
> - ctrl->nr_streams = min_t(unsigned, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1);
> + ctrl->nr_streams = ctrl->nssa;
> dev_info(ctrl->device, "Using %u streams\n", ctrl->nr_streams);
> return 0;
> }

This changes the current behavior, doesn't it? Previously devices with less
than 5 hints got streams completely disabled, now they will have streams
enabled but ids beyond supported max will be mapped to 0. I'm not against
this but I guess it should be spelled out explicitely in the changelog.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-18 14:06:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

On Wed 17-04-19 23:20:03, Kanchan Joshi wrote:
> This patch moves write-hint-to-stream-id conversion in block-layer.
> Earlier this was done by driver (nvme). Current conversion is of the
> form "streamid = write-hint - 1", for both user and kernel streams.
> Conversion takes stream limit (maintained in request queue) into
> account. Write-hints beyond the exposed limit turn to 0.
> A new field 'streamid' has been added in request. While 'write-hint'
> field continues to exist. It keeps original value passed from upper
> layer, and used during merging checks.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> block/blk-core.c | 20 ++++++++++++++++++++
> include/linux/blkdev.h | 1 +
> 2 files changed, 21 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a55389b..712e6b7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -730,6 +730,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> return false;
> }
>
> +enum rw_hint blk_write_hint_to_streamid(struct request *req,
> + struct bio *bio)
> +{
> + enum rw_hint streamid, nr_streams;
> + struct request_queue *q = req->q;
> + nr_streams = q->limits.nr_streams;
> +
> + streamid = bio->bi_write_hint;
> + if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
> + streamid == WRITE_LIFE_NONE)
> + streamid = 0;
> + else {
> + --streamid;
> + if(streamid > nr_streams)
> + streamid = 0;
> + }
> + return streamid;
> +}
> +

Someone told me that stream ids are potentially persistent on the storage
so it isn't great to change the id for the same thing e.g. if we add
another user hint. So maybe we should allocate kernel hints from top as
Dave Chinner suggested? I.e., something like the following mapping function:

if (streamid <= BLK_MAX_USER_HINTS) {
streamid--;
if (streamid > nr_streams)
streamid = 0;
} else {
/* Kernel hints get allocated from top */
streamid -= WRITE_LIFE_KERN_MIN;
if (streamid > nr_streams - BLK_MAX_USER_HINTS)
streamid = 0;
else
streamid = nr_streams - streamid - 1;
}

what do you think?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-18 18:58:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

On Apr 18, 2019, at 8:06 AM, Jan Kara <[email protected]> wrote:
>
> On Wed 17-04-19 23:20:03, Kanchan Joshi wrote:
>> This patch moves write-hint-to-stream-id conversion in block-layer.
>> Earlier this was done by driver (nvme). Current conversion is of the
>> form "streamid = write-hint - 1", for both user and kernel streams.
>> Conversion takes stream limit (maintained in request queue) into
>> account. Write-hints beyond the exposed limit turn to 0.
>> A new field 'streamid' has been added in request. While 'write-hint'
>> field continues to exist. It keeps original value passed from upper
>> layer, and used during merging checks.
>>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>> block/blk-core.c | 20 ++++++++++++++++++++
>> include/linux/blkdev.h | 1 +
>> 2 files changed, 21 insertions(+)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index a55389b..712e6b7 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -730,6 +730,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>> return false;
>> }
>>
>> +enum rw_hint blk_write_hint_to_streamid(struct request *req,
>> + struct bio *bio)
>> +{
>> + enum rw_hint streamid, nr_streams;
>> + struct request_queue *q = req->q;
>> + nr_streams = q->limits.nr_streams;
>> +
>> + streamid = bio->bi_write_hint;
>> + if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
>> + streamid == WRITE_LIFE_NONE)
>> + streamid = 0;
>> + else {
>> + --streamid;
>> + if(streamid > nr_streams)
>> + streamid = 0;
>> + }
>> + return streamid;
>> +}
>> +
>
> Someone told me that stream ids are potentially persistent on the storage
> so it isn't great to change the id for the same thing e.g. if we add
> another user hint. So maybe we should allocate kernel hints from top as
> Dave Chinner suggested? I.e., something like the following mapping function:
>
> if (streamid <= BLK_MAX_USER_HINTS) {
> streamid--;
> if (streamid > nr_streams)
> streamid = 0;
> } else {
> /* Kernel hints get allocated from top */
> streamid -= WRITE_LIFE_KERN_MIN;
> if (streamid > nr_streams - BLK_MAX_USER_HINTS)
> streamid = 0;
> else
> streamid = nr_streams - streamid - 1;
> }
>
> what do you think?

Dave has expressed this sentiment several times, and I agree. We don't
want the filesystem hint values/mapping to change over time, or it will
conflict with data that was written with the previous hints (e.g. if
data was written with a "short lifetime" hint suddenly changes to be
grouped with a "long lifetime" hint). This is easily avoided with some
simple changes now, but harder to fix after this patch has landed.

Cheers, Andreas






Attachments:
signature.asc (873.00 B)
Message signed with OpenPGP

2019-04-22 13:33:53

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

> Someone told me that stream ids are potentially persistent on the
> storage so it isn't great to change the id for the same thing e.g. if
> we add another user hint. So maybe we should allocate kernel hints
> from top as Dave Chinner suggested? I.e., something like the following
mapping function:

This function is good. Thank you for sharing.

-----Original Message-----
From: Andreas Dilger [mailto:[email protected]]
Sent: Friday, April 19, 2019 12:28 AM
To: Jan Kara <[email protected]>
Cc: Kanchan Joshi <[email protected]>; open list
<[email protected]>; linux-block <[email protected]>;
[email protected]; linux-fsdevel
<[email protected]>; [email protected];
[email protected]
Subject: Re: [PATCH v4 4/7] block: introduce write-hint to stream-id
conversion

On Apr 18, 2019, at 8:06 AM, Jan Kara <[email protected]> wrote:
>
> On Wed 17-04-19 23:20:03, Kanchan Joshi wrote:
>> This patch moves write-hint-to-stream-id conversion in block-layer.
>> Earlier this was done by driver (nvme). Current conversion is of the
>> form "streamid = write-hint - 1", for both user and kernel streams.
>> Conversion takes stream limit (maintained in request queue) into
>> account. Write-hints beyond the exposed limit turn to 0.
>> A new field 'streamid' has been added in request. While 'write-hint'
>> field continues to exist. It keeps original value passed from upper
>> layer, and used during merging checks.
>>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>> block/blk-core.c | 20 ++++++++++++++++++++
>> include/linux/blkdev.h | 1 +
>> 2 files changed, 21 insertions(+)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c index
>> a55389b..712e6b7 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -730,6 +730,25 @@ bool blk_attempt_plug_merge(struct request_queue *q,
struct bio *bio,
>> return false;
>> }
>>
>> +enum rw_hint blk_write_hint_to_streamid(struct request *req,
>> + struct bio *bio)
>> +{
>> + enum rw_hint streamid, nr_streams;
>> + struct request_queue *q = req->q;
>> + nr_streams = q->limits.nr_streams;
>> +
>> + streamid = bio->bi_write_hint;
>> + if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
>> + streamid == WRITE_LIFE_NONE)
>> + streamid = 0;
>> + else {
>> + --streamid;
>> + if(streamid > nr_streams)
>> + streamid = 0;
>> + }
>> + return streamid;
>> +}
>> +
>
> Someone told me that stream ids are potentially persistent on the
> storage so it isn't great to change the id for the same thing e.g. if
> we add another user hint. So maybe we should allocate kernel hints
> from top as Dave Chinner suggested? I.e., something like the following
mapping function:
>
> if (streamid <= BLK_MAX_USER_HINTS) {
> streamid--;
> if (streamid > nr_streams)
> streamid = 0;
> } else {
> /* Kernel hints get allocated from top */
> streamid -= WRITE_LIFE_KERN_MIN;
> if (streamid > nr_streams - BLK_MAX_USER_HINTS)
> streamid = 0;
> else
> streamid = nr_streams - streamid - 1;
> }
>
> what do you think?

Dave has expressed this sentiment several times, and I agree. We don't want
the filesystem hint values/mapping to change over time, or it will conflict
with data that was written with the previous hints (e.g. if data was written
with a "short lifetime" hint suddenly changes to be grouped with a "long
lifetime" hint). This is easily avoided with some simple changes now, but
harder to fix after this patch has landed.

Cheers, Andreas







2019-04-22 13:37:23

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

> Why not just use the bio write hint? We already disallow merging of dissimilar write hints, so req->bio->bi_write_hint is known to be identical with the rest of the bio's in that chain.

Yes, that is better. Thanks for suggesting it.

-----Original Message-----
From: Jens Axboe [mailto:[email protected]]
Sent: Wednesday, April 17, 2019 11:28 PM
To: Kanchan Joshi <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]
Subject: Re: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

On 4/17/19 11:50 AM, Kanchan Joshi wrote:
> This patch moves write-hint-to-stream-id conversion in block-layer.
> Earlier this was done by driver (nvme). Current conversion is of the
> form "streamid = write-hint - 1", for both user and kernel streams.
> Conversion takes stream limit (maintained in request queue) into
> account. Write-hints beyond the exposed limit turn to 0.
> A new field 'streamid' has been added in request. While 'write-hint'
> field continues to exist. It keeps original value passed from upper
> layer, and used during merging checks.

Why not just use the bio write hint? We already disallow merging of dissimilar write hints, so req->bio->bi_write_hint is known to be identical with the rest of the bio's in that chain.

--
Jens Axboe



2019-04-22 13:43:37

by Kanchan Joshi

[permalink] [raw]
Subject: RE: [PATCH v4 5/7] nvme: register stream info with block layer

> This changes the current behavior, doesn't it? Previously devices with
less than 5 hints got streams completely disabled, now they will have
streams enabled but ids beyond supported max will be mapped to 0. I'm not
against this but I guess it should be spelled out explicitely in the
changelog.

Yes, that is a change in current behavior. Will add that in next version,
thanks.

-----Original Message-----
From: Jan Kara [mailto:[email protected]]
Sent: Thursday, April 18, 2019 7:22 PM
To: Kanchan Joshi <[email protected]>
Cc: [email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: [PATCH v4 5/7] nvme: register stream info with block layer

On Wed 17-04-19 23:20:04, Kanchan Joshi wrote:
> Make nvme driver register number of streams with block layer. Block
> layer will use that for write-hint to stream-id conversion.
> Registration is done for each namespace. Since NVMe spec allow all
> available streams (within subsystem) to be used by all namespaces, no
> attempt has been made to add reservation at namespace level.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> drivers/nvme/host/core.c | 23 ++++++-----------------
> 1 file changed, 6 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
> 2c43e12..81b86fa 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -464,10 +464,6 @@ static int nvme_toggle_streams(struct nvme_ctrl
*ctrl, bool enable)
> return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0); }
>
> -static int nvme_disable_streams(struct nvme_ctrl *ctrl) -{
> - return nvme_toggle_streams(ctrl, false);
> -}
>
> static int nvme_enable_streams(struct nvme_ctrl *ctrl) { @@ -510,14
> +506,7 @@ static int nvme_configure_directives(struct nvme_ctrl *ctrl)
> return ret;
>
> ctrl->nssa = le16_to_cpu(s.nssa);
> - if (ctrl->nssa < BLK_MAX_WRITE_HINTS - 1) {
> - dev_info(ctrl->device, "too few streams (%u) available\n",
> - ctrl->nssa);
> - nvme_disable_streams(ctrl);
> - return 0;
> - }
> -
> - ctrl->nr_streams = min_t(unsigned, ctrl->nssa, BLK_MAX_WRITE_HINTS -
1);
> + ctrl->nr_streams = ctrl->nssa;
> dev_info(ctrl->device, "Using %u streams\n", ctrl->nr_streams);
> return 0;
> }

This changes the current behavior, doesn't it? Previously devices with less
than 5 hints got streams completely disabled, now they will have streams
enabled but ids beyond supported max will be mapped to 0. I'm not against
this but I guess it should be spelled out explicitely in the changelog.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR