2019-01-09 15:33:32

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v2 0/4] Write-hint for FS journal

Towards supporing write-hints/streams for filesystem journal.

Here is the v1 patch for background -
https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2

Changes since v1:
- introduce four more hints for in-kernel use, as recommended by Dave chinner
& Jens axboe. This isolates kernel-mode hints from user-mode ones.
- remove mount-option to specify write-hint, as recommended by Jan kara &
Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
if device does not support stream.
- Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)


Kanchan Joshi (4):
block: Increase count of supported write-hints
fs: introduce four macros for in-kernel hints
fs: introduce APIs to enable sending write-hint with buffer-head
fs/ext4,jbd2: add support for passing write-hint with journal.

fs/buffer.c | 18 ++++++++++++++++--
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/blkdev.h | 5 ++++-
include/linux/buffer_head.h | 3 +++
include/linux/fs.h | 5 +++++
include/linux/jbd2.h | 8 ++++++++
9 files changed, 49 insertions(+), 9 deletions(-)

--
2.7.4


2019-01-09 15:33:45

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 3/4] fs: introduce APIs to enable sending write-hint with buffer-head

submit_bh and write_dirty_buffer do not take write-hint as parameter.
This patch introduces variants which do.

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/buffer.c | 18 ++++++++++++++++--
include/linux/buffer_head.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d60d61e..2a94676 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3094,6 +3094,13 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
}
EXPORT_SYMBOL(submit_bh);

+int submit_bh_write_hint(int op, int op_flags, struct buffer_head *bh,
+ enum rw_hint hint)
+{
+ return submit_bh_wbc(op, op_flags, bh, hint, NULL);
+}
+EXPORT_SYMBOL(submit_bh_write_hint);
+
/**
* ll_rw_block: low-level access to block devices (DEPRECATED)
* @op: whether to %READ or %WRITE
@@ -3151,6 +3158,13 @@ EXPORT_SYMBOL(ll_rw_block);

void write_dirty_buffer(struct buffer_head *bh, int op_flags)
{
+ write_dirty_buffer_with_hint(bh, op_flags, 0);
+}
+EXPORT_SYMBOL(write_dirty_buffer);
+
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint)
+{
lock_buffer(bh);
if (!test_clear_buffer_dirty(bh)) {
unlock_buffer(bh);
@@ -3158,9 +3172,9 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
}
bh->b_end_io = end_buffer_write_sync;
get_bh(bh);
- submit_bh(REQ_OP_WRITE, op_flags, bh);
+ submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
}
-EXPORT_SYMBOL(write_dirty_buffer);
+EXPORT_SYMBOL(write_dirty_buffer_with_hint);

/*
* For a data-integrity writeout, we need to wait upon any in-progress I/O
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7..3d682ac 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -200,7 +200,10 @@ void ll_rw_block(int, int, int, struct buffer_head * bh[]);
int sync_dirty_buffer(struct buffer_head *bh);
int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
void write_dirty_buffer(struct buffer_head *bh, int op_flags);
+void write_dirty_buffer_with_hint(struct buffer_head *bh, int op_flags,
+ enum rw_hint hint);
int submit_bh(int, int, struct buffer_head *);
+int submit_bh_write_hint(int, int, struct buffer_head *, enum rw_hint hint);
void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
int bh_uptodate_or_lock(struct buffer_head *bh);
--
2.7.4

2019-01-09 15:33:41

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 2/4] fs: introduce four macros for in-kernel hints

Exiting write-hints are exposed to user-mode. There is a possiblity
of conflict if kernel happens to use those. This patch introduces four
write-hints for exclusive kernel-mode use.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/fs.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 811c777..e8548eb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -291,6 +291,11 @@ enum rw_hint {
WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
+/* below ones are meant for in-kernel use */
+ KERN_WRITE_LIFE_SHORT,
+ KERN_WRITE_LIFE_MEDIUM,
+ KERN_WRITE_LIFE_LONG,
+ KERN_WRITE_LIFE_EXTREME
};

#define IOCB_EVENTFD (1 << 0)
--
2.7.4

2019-01-09 15:33:36

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 1/4] block: Increase count of supported write-hints

This patch bumps up write-hint count to support four new, in-kernel
hints.

Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/blkdev.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604d..df07759 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -568,7 +568,10 @@ struct request_queue {

struct work_struct release_work;

-#define BLK_MAX_WRITE_HINTS 5
+#define BLK_MAX_USER_WRITE_HINTS 5
+#define BLK_MAX_KERNEL_WRITE_HINTS 4
+#define BLK_MAX_WRITE_HINTS (BLK_MAX_USER_WRITE_HINTS + \
+ BLK_MAX_KERNEL_WRITE_HINTS)
u64 write_hints[BLK_MAX_WRITE_HINTS];
};

--
2.7.4

2019-01-09 15:33:46

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 4/4] fs/ext4,jbd2: add support for passing write-hint with journal.

For NAND based SSDs, mixing of data with different life-time reduces
efficiency of internal garbage-collection. During FS operations, series
of journal updates will follow/precede series of data/meta updates, causing
intermixing inside SSD. By passing a write-hint with journal, its write
can be isolated from other data/meta writes, leading to endurance/performance
benefit on SSD.

This patch introduces "j_writehint" member in JBD2 journal, using which
Ext4 specifies write-hint (as SHORT) for journal.

Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/ext4/super.c | 2 ++
fs/jbd2/commit.c | 11 +++++++----
fs/jbd2/journal.c | 3 ++-
fs/jbd2/revoke.c | 3 ++-
include/linux/jbd2.h | 8 ++++++++
5 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d6c142d..3af4049 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4289,6 +4289,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

+ sbi->s_journal->j_writehint = KERN_WRITE_LIFE_SHORT;
+
sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;

no_journal:
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 2eb55c3..6da4c28 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,

if (journal->j_flags & JBD2_BARRIER &&
!jbd2_has_feature_async_commit(journal))
- ret = submit_bh(REQ_OP_WRITE,
- REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE,
+ REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
+ journal->j_writehint);
else
- ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
+ journal->j_writehint);

*cbh = bh;
return ret;
@@ -711,7 +713,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;
- submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+ submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC,
+ bh, journal->j_writehint);
}
cond_resched();
stats.run.rs_blocks_logged += bufs;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ef6b6d..804dc2c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1384,7 +1384,8 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
jbd2_superblock_csum_set(journal, sb);
get_bh(bh);
bh->b_end_io = end_buffer_write_sync;
- ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
+ ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh,
+ journal->j_writehint);
wait_on_buffer(bh);
if (buffer_write_io_error(bh)) {
clear_buffer_write_io_error(bh);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index a1143e5..376b1d8 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -642,7 +642,8 @@ static void flush_descriptor(journal_t *journal,
set_buffer_jwrite(descriptor);
BUFFER_TRACE(descriptor, "write");
set_buffer_dirty(descriptor);
- write_dirty_buffer(descriptor, REQ_SYNC);
+ write_dirty_buffer_with_hint(descriptor, REQ_SYNC,
+ journal->j_writehint);
}
#endif

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0f919d5..918f21e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1139,6 +1139,14 @@ struct journal_s
*/
__u32 j_csum_seed;

+ /**
+ * @j_writehint:
+ *
+ * write-hint for journal (set by FS).
+ */
+ enum rw_hint j_writehint;
+
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/**
* @j_trans_commit_map:
--
2.7.4

2019-01-23 18:27:20

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/4] introduce four macros for in-kernel hints


> On 9 Jan 2019, at 16.30, Kanchan Joshi <[email protected]> wrote:
>
> Exiting write-hints are exposed to user-mode. There is a possiblity
> of conflict if kernel happens to use those. This patch introduces four
> write-hints for exclusive kernel-mode use.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> include/linux/fs.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 811c777..e8548eb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -291,6 +291,11 @@ enum rw_hint {
> WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
> WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
> WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
> +/* below ones are meant for in-kernel use */
> + KERN_WRITE_LIFE_SHORT,
> + KERN_WRITE_LIFE_MEDIUM,
> + KERN_WRITE_LIFE_LONG,
> + KERN_WRITE_LIFE_EXTREME
> };
>

I think Jens and Dave meant kernel hints to go top down. This would also
give space for supporting more hints / streams from both ends for user
and kernel.

Javier


Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP

2019-01-23 18:35:51

by Javier González

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal


> On 9 Jan 2019, at 16.30, Kanchan Joshi <[email protected]> wrote:
>
> Towards supporing write-hints/streams for filesystem journal.
>
> Here is the v1 patch for background -
> https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
>
> Changes since v1:
> - introduce four more hints for in-kernel use, as recommended by Dave chinner
> & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> - remove mount-option to specify write-hint, as recommended by Jan kara &
> Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
> if device does not support stream.
> - Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)
>
>
> Kanchan Joshi (4):
> block: Increase count of supported write-hints
> fs: introduce four macros for in-kernel hints
> fs: introduce APIs to enable sending write-hint with buffer-head
> fs/ext4,jbd2: add support for passing write-hint with journal.
>
> fs/buffer.c | 18 ++++++++++++++++--
> fs/ext4/super.c | 2 ++
> fs/jbd2/commit.c | 11 +++++++----
> fs/jbd2/journal.c | 3 ++-
> fs/jbd2/revoke.c | 3 ++-
> include/linux/blkdev.h | 5 ++++-
> include/linux/buffer_head.h | 3 +++
> include/linux/fs.h | 5 +++++
> include/linux/jbd2.h | 8 ++++++++
> 9 files changed, 49 insertions(+), 9 deletions(-)
>
> --
> 2.7.4

Worth sharing the paper where you describe the design and the numbers
you collected [1]. Also, addressing Dave's comment on stream support, it
points to a Samsung drive supporting streams (PM963). In this context,
you should verify in V3 that we are at least passing xfstests and
blktests with these changes.

[1] https://www.usenix.org/system/files/conference/fast18/fast18-rho.pdf

Javier


Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP

2019-01-24 08:29:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

Hello,

On Wed 09-01-19 21:00:57, Kanchan Joshi wrote:
> Towards supporing write-hints/streams for filesystem journal.
>
> Here is the v1 patch for background -
> https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
>
> Changes since v1:
> - introduce four more hints for in-kernel use, as recommended by Dave chinner
> & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> - remove mount-option to specify write-hint, as recommended by Jan kara &
> Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
> if device does not support stream.
> - Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)

I guess the series should also go to Jens since he was the original author
of write-hint support so he has the best idea about the architecture.

Honza

> Kanchan Joshi (4):
> block: Increase count of supported write-hints
> fs: introduce four macros for in-kernel hints
> fs: introduce APIs to enable sending write-hint with buffer-head
> fs/ext4,jbd2: add support for passing write-hint with journal.
>
> fs/buffer.c | 18 ++++++++++++++++--
> fs/ext4/super.c | 2 ++
> fs/jbd2/commit.c | 11 +++++++----
> fs/jbd2/journal.c | 3 ++-
> fs/jbd2/revoke.c | 3 ++-
> include/linux/blkdev.h | 5 ++++-
> include/linux/buffer_head.h | 3 +++
> include/linux/fs.h | 5 +++++
> include/linux/jbd2.h | 8 ++++++++
> 9 files changed, 49 insertions(+), 9 deletions(-)
>
> --
> 2.7.4
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-01-24 08:35:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] introduce four macros for in-kernel hints

On Wed 23-01-19 19:27:12, Javier Gonz?lez wrote:
>
> > On 9 Jan 2019, at 16.30, Kanchan Joshi <[email protected]> wrote:
> >
> > Exiting write-hints are exposed to user-mode. There is a possiblity
> > of conflict if kernel happens to use those. This patch introduces four
> > write-hints for exclusive kernel-mode use.
> >
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > ---
> > include/linux/fs.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 811c777..e8548eb 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -291,6 +291,11 @@ enum rw_hint {
> > WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
> > WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
> > WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
> > +/* below ones are meant for in-kernel use */
> > + KERN_WRITE_LIFE_SHORT,
> > + KERN_WRITE_LIFE_MEDIUM,
> > + KERN_WRITE_LIFE_LONG,
> > + KERN_WRITE_LIFE_EXTREME
> > };
> >
>
> I think Jens and Dave meant kernel hints to go top down. This would also
> give space for supporting more hints / streams from both ends for user
> and kernel.

Yes, that was the idea however if I understand it right, the write hints do
not really have to be consistent boot-to-boot since they aren't stored
persistently by the disk, are they? If that's the case, it doesn't really
matter which numbers we pick.

One thing I don't quite like is the naming of KERN_WRITE_LIFE_SHORT etc.. It
is upto filesystem to assign meanings to the write hints. So I think it is
enough to provide something like KERN_WRITE_HINT_MIN which is the first
hint available to the kernel and then the number of hints available to the
kernel.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-01-24 08:50:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] fs/ext4,jbd2: add support for passing write-hint with journal.

On Wed 09-01-19 21:01:01, Kanchan Joshi wrote:
> For NAND based SSDs, mixing of data with different life-time reduces
> efficiency of internal garbage-collection. During FS operations, series
> of journal updates will follow/precede series of data/meta updates, causing
> intermixing inside SSD. By passing a write-hint with journal, its write
> can be isolated from other data/meta writes, leading to endurance/performance
> benefit on SSD.
>
> This patch introduces "j_writehint" member in JBD2 journal, using which
> Ext4 specifies write-hint (as SHORT) for journal.
>
> Signed-off-by: Kanchan Joshi <[email protected]>

Thanks for the patch. It looks mostly good, just one nit below.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index d6c142d..3af4049 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4289,6 +4289,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
>
> + sbi->s_journal->j_writehint = KERN_WRITE_LIFE_SHORT;
> +
> sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;

So I'd rather have defines like:

#define EXT4_WRITE_HINT_JOURNAL KERN_WRITE_HINT_MIN

like I suggested in another email and then later we could add stuff like

#define EXT4_WRITE_HINT_BITMAP (KERN_WRITE_HINT_MIN + 1)

etc.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-01-24 09:23:54

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/4] introduce four macros for in-kernel hints


> On 24 Jan 2019, at 09.35, Jan Kara <[email protected]> wrote:
>
> On Wed 23-01-19 19:27:12, Javier González wrote:
>>> On 9 Jan 2019, at 16.30, Kanchan Joshi <[email protected]> wrote:
>>>
>>> Exiting write-hints are exposed to user-mode. There is a possiblity
>>> of conflict if kernel happens to use those. This patch introduces four
>>> write-hints for exclusive kernel-mode use.
>>>
>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>> ---
>>> include/linux/fs.h | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 811c777..e8548eb 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -291,6 +291,11 @@ enum rw_hint {
>>> WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
>>> WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
>>> WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
>>> +/* below ones are meant for in-kernel use */
>>> + KERN_WRITE_LIFE_SHORT,
>>> + KERN_WRITE_LIFE_MEDIUM,
>>> + KERN_WRITE_LIFE_LONG,
>>> + KERN_WRITE_LIFE_EXTREME
>>> };
>>
>> I think Jens and Dave meant kernel hints to go top down. This would also
>> give space for supporting more hints / streams from both ends for user
>> and kernel.
>
> Yes, that was the idea however if I understand it right, the write hints do
> not really have to be consistent boot-to-boot since they aren't stored
> persistently by the disk, are they? If that's the case, it doesn't really
> matter which numbers we pick.
>

I guess this is implementation specific. Some times the drive will want
to store this to improve GC. For the current "coldness" hint I does not
matter much, but if the hint were to express other metric it can become
relevant.

Anyway, the comment was more to separate user / kernel hints and allow
them to grow from the ends.


> One thing I don't quite like is the naming of KERN_WRITE_LIFE_SHORT etc.. It
> is upto filesystem to assign meanings to the write hints. So I think it is
> enough to provide something like KERN_WRITE_HINT_MIN which is the first
> hint available to the kernel and then the number of hints available to the
> kernel.
>

Makes sense to me. Then we can rename the hint for each FS to give it
proper mening.

> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

Javier


Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP

2019-01-25 14:23:40

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

Hi Jens,
Can you please have a glance on this patch series, given than attempt is
to extend the original architecture of write-hints.

Thanks,

On Thursday 24 January 2019 01:59 PM, Jan Kara wrote:
> Hello,
>
> On Wed 09-01-19 21:00:57, Kanchan Joshi wrote:
>> Towards supporing write-hints/streams for filesystem journal.
>>
>> Here is the v1 patch for background -
>> https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
>>
>> Changes since v1:
>> - introduce four more hints for in-kernel use, as recommended by Dave chinner
>> & Jens axboe. This isolates kernel-mode hints from user-mode ones.
>> - remove mount-option to specify write-hint, as recommended by Jan kara &
>> Dave chinner. Rather, FS always sets write-hint for journal. This gets ignored
>> if device does not support stream.
>> - Removed code-redundancy for write_dirty_buffer (Jan kara's review comment)
>
> I guess the series should also go to Jens since he was the original author
> of write-hint support so he has the best idea about the architecture.
>
> Honza
>
>> Kanchan Joshi (4):
>> block: Increase count of supported write-hints
>> fs: introduce four macros for in-kernel hints
>> fs: introduce APIs to enable sending write-hint with buffer-head
>> fs/ext4,jbd2: add support for passing write-hint with journal.
>>
>> fs/buffer.c | 18 ++++++++++++++++--
>> fs/ext4/super.c | 2 ++
>> fs/jbd2/commit.c | 11 +++++++----
>> fs/jbd2/journal.c | 3 ++-
>> fs/jbd2/revoke.c | 3 ++-
>> include/linux/blkdev.h | 5 ++++-
>> include/linux/buffer_head.h | 3 +++
>> include/linux/fs.h | 5 +++++
>> include/linux/jbd2.h | 8 ++++++++
>> 9 files changed, 49 insertions(+), 9 deletions(-)
>>
>> --
>> 2.7.4
>>
>>

2019-01-25 16:24:52

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> Towards supporing write-hints/streams for filesystem journal.
>
> Here is the v1 patch for background -
> https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
>
> Changes since v1:
> - introduce four more hints for in-kernel use, as recommended by Dave chinner
> & Jens axboe. This isolates kernel-mode hints from user-mode ones.

The nvme driver disables streams if the controller doesn't support
BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
for controllers that only support up to 4.

2019-01-28 12:47:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Fri 25-01-19 09:23:53, Keith Busch wrote:
> On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > Towards supporing write-hints/streams for filesystem journal.
> >
> > Here is the v1 patch for background -
> > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> >
> > Changes since v1:
> > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
>
> The nvme driver disables streams if the controller doesn't support
> BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> for controllers that only support up to 4.

Right. Do you know if there are such controllers? Or are you just afraid
that there could be?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-01-28 23:25:15

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
> On Fri 25-01-19 09:23:53, Keith Busch wrote:
> > On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > > Towards supporing write-hints/streams for filesystem journal.
> > >
> > > Here is the v1 patch for background -
> > > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> > >
> > > Changes since v1:
> > > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> >
> > The nvme driver disables streams if the controller doesn't support
> > BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> > for controllers that only support up to 4.
>
> Right. Do you know if there are such controllers? Or are you just afraid
> that there could be?

I've asked around, and the concensus I received is all currently support
at least 8, but they couldn't say if that would be true for potential
lower budget products. Can we implement a reasonable fallback to use
what's available?

2019-01-29 10:07:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Mon 28-01-19 16:24:24, Keith Busch wrote:
> On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
> > On Fri 25-01-19 09:23:53, Keith Busch wrote:
> > > On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > > > Towards supporing write-hints/streams for filesystem journal.
> > > >
> > > > Here is the v1 patch for background -
> > > > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> > > >
> > > > Changes since v1:
> > > > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > > > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> > >
> > > The nvme driver disables streams if the controller doesn't support
> > > BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> > > for controllers that only support up to 4.
> >
> > Right. Do you know if there are such controllers? Or are you just afraid
> > that there could be?
>
> I've asked around, and the concensus I received is all currently support
> at least 8, but they couldn't say if that would be true for potential
> lower budget products. Can we implement a reasonable fallback to use
> what's available?

OK, thanks for input. So probably we should just map kernel stream IDs to 0
if the device doesn't support them. But that probably means we need to
propagate number of available streams up from NVME into the block layer so
that this can be handled reasonably seamlessly. Jens, Kanchan?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-01-30 00:13:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Tue, Jan 29, 2019 at 11:07:02AM +0100, Jan Kara wrote:
> On Mon 28-01-19 16:24:24, Keith Busch wrote:
> > On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
> > > On Fri 25-01-19 09:23:53, Keith Busch wrote:
> > > > On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > > > > Towards supporing write-hints/streams for filesystem journal.
> > > > >
> > > > > Here is the v1 patch for background -
> > > > > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> > > > >
> > > > > Changes since v1:
> > > > > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > > > > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> > > >
> > > > The nvme driver disables streams if the controller doesn't support
> > > > BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> > > > for controllers that only support up to 4.
> > >
> > > Right. Do you know if there are such controllers? Or are you just afraid
> > > that there could be?
> >
> > I've asked around, and the concensus I received is all currently support
> > at least 8, but they couldn't say if that would be true for potential
> > lower budget products. Can we implement a reasonable fallback to use
> > what's available?
>
> OK, thanks for input. So probably we should just map kernel stream IDs to 0
> if the device doesn't support them. But that probably means we need to
> propagate number of available streams up from NVME into the block layer so
> that this can be handled reasonably seamlessly. Jens, Kanchan?

Yeah, that's basically what I said we needed to do when this was
last discussed. i.e. that the block layer needed to know how many
streams the hardware had and map the 4 "kernel internal" hints
appropriately to what he device supports.

e.g. if the device only supports 4 hints, then it needs to map the
kernel hints either to zero. If it supports less than 8 streams,
then they need otbe mapped into the hints above index 5. If there
are N streams, then they need to be mapped to the hints {N-3,N}

And, to top it all off, there needs to be guards so that if we want
to grow the userspace hints to more than 4 hints, they don't crash
into ranges the kernel is already reserving because of limited
device range support.

Nothing is ever simple....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-30 13:57:19

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal


On Wednesday 30 January 2019 05:43 AM, Dave Chinner wrote:
> On Tue, Jan 29, 2019 at 11:07:02AM +0100, Jan Kara wrote:
>> On Mon 28-01-19 16:24:24, Keith Busch wrote:
>>> On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
>>>> On Fri 25-01-19 09:23:53, Keith Busch wrote:
>>>>> On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
>>>>>> Towards supporing write-hints/streams for filesystem journal.
>>>>>>
>>>>>> Here is the v1 patch for background -
>>>>>> https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
>>>>>>
>>>>>> Changes since v1:
>>>>>> - introduce four more hints for in-kernel use, as recommended by Dave chinner
>>>>>> & Jens axboe. This isolates kernel-mode hints from user-mode ones.
>>>>>
>>>>> The nvme driver disables streams if the controller doesn't support
>>>>> BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
>>>>> for controllers that only support up to 4.
>>>>
>>>> Right. Do you know if there are such controllers? Or are you just afraid
>>>> that there could be?
>>>
>>> I've asked around, and the concensus I received is all currently support
>>> at least 8, but they couldn't say if that would be true for potential
>>> lower budget products. Can we implement a reasonable fallback to use
>>> what's available?
>>
>> OK, thanks for input. So probably we should just map kernel stream IDs to 0
>> if the device doesn't support them. But that probably means we need to
>> propagate number of available streams up from NVME into the block layer so
>> that this can be handled reasonably seamlessly. Jens, Kanchan?
>
> Yeah, that's basically what I said we needed to do when this was
> last discussed. i.e. that the block layer needed to know how many
> streams the hardware had and map the 4 "kernel internal" hints
> appropriately to what he device supports.
>
> e.g. if the device only supports 4 hints, then it needs to map the
> kernel hints either to zero. If it supports less than 8 streams,
> then they need otbe mapped into the hints above index 5. If there
> are N streams, then they need to be mapped to the hints {N-3,N}
>
> And, to top it all off, there needs to be guards so that if we want
> to grow the userspace hints to more than 4 hints, they don't crash
> into ranges the kernel is already reserving because of limited
> device range support.
>
> Nothing is ever simple....
>
Thanks all for feedback.
user-hints, when they reach to kernel via fcntl path, are sanity-checked
(rw_hint_valid function).
Currently streams are enabled when nvme driver is made to run with
"streams =1" option, while stream users always pass some write-hint,
without bothering whether streams (and how many of those) are
operational or not. This keeps configuration simple for stream users.
Second, block layer does not translate write-hint to stream-number,
rather it is done inside nvme driver. I suppose I should keep both these
properties intact.
And considering all the suggestions, this is the plan for V3 -

[In block layer]
1. Introduce one macro "KERN_WRITE_HINT_MIN" which will take the value
"user_hint_cnt + 1".
FS code will use this value (onwards) to define their own streams.

2. Introduce another macro "BLK_MAX_KERNEL_WRITE_HINTS" which will be
set to 4 for now.

[In nvme driver]
1. Continue working as before if device supports just 4 streams. All
these streams are used by user-hints, and kernel-hints are translated to 0.

2. If device supports any more than 4 streams, those will be mapped to
serve kernel-hints, starting from KERN_WRITE_HINT_MIN onwards.
For example, if device has 6 streams, four streams (numbers = 1,2,3,4)
will be used to serve user-hints and two streams ( numbers = 65535,
65534) will be used to serve first two kernel hints. Other kernel-hints
get mapped to 0. OTOH, if device has 10 streams, first four kernel-hints
will be mapped to non-zero values (65535 to 65532) and anything else
would get turned to 0.


Let me know if this sounds fine?


Thanks,
Kanchan







2019-02-05 11:50:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Wed 30-01-19 19:24:39, Kanchan Joshi wrote:
>
> On Wednesday 30 January 2019 05:43 AM, Dave Chinner wrote:
> > On Tue, Jan 29, 2019 at 11:07:02AM +0100, Jan Kara wrote:
> > > On Mon 28-01-19 16:24:24, Keith Busch wrote:
> > > > On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
> > > > > On Fri 25-01-19 09:23:53, Keith Busch wrote:
> > > > > > On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > > > > > > Towards supporing write-hints/streams for filesystem journal.
> > > > > > > Here is the v1 patch for background -
> > > > > > > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> > > > > > > Changes since v1:
> > > > > > > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > > > > > > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> > > > > >
> > > > > > The nvme driver disables streams if the controller doesn't support
> > > > > > BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> > > > > > for controllers that only support up to 4.
> > > > >
> > > > > Right. Do you know if there are such controllers? Or are you just afraid
> > > > > that there could be?
> > > >
> > > > I've asked around, and the concensus I received is all currently support
> > > > at least 8, but they couldn't say if that would be true for potential
> > > > lower budget products. Can we implement a reasonable fallback to use
> > > > what's available?
> > >
> > > OK, thanks for input. So probably we should just map kernel stream IDs to 0
> > > if the device doesn't support them. But that probably means we need to
> > > propagate number of available streams up from NVME into the block layer so
> > > that this can be handled reasonably seamlessly. Jens, Kanchan?
> >
> > Yeah, that's basically what I said we needed to do when this was
> > last discussed. i.e. that the block layer needed to know how many
> > streams the hardware had and map the 4 "kernel internal" hints
> > appropriately to what he device supports.
> >
> > e.g. if the device only supports 4 hints, then it needs to map the
> > kernel hints either to zero. If it supports less than 8 streams,
> > then they need otbe mapped into the hints above index 5. If there
> > are N streams, then they need to be mapped to the hints {N-3,N}
> >
> > And, to top it all off, there needs to be guards so that if we want
> > to grow the userspace hints to more than 4 hints, they don't crash
> > into ranges the kernel is already reserving because of limited
> > device range support.
> >
> > Nothing is ever simple....
> >
> Thanks all for feedback.
> user-hints, when they reach to kernel via fcntl path, are sanity-checked
> (rw_hint_valid function).
> Currently streams are enabled when nvme driver is made to run with "streams
> =1" option, while stream users always pass some write-hint, without
> bothering whether streams (and how many of those) are operational or not.
> This keeps configuration simple for stream users. Second, block layer does
> not translate write-hint to stream-number, rather it is done inside nvme
> driver. I suppose I should keep both these properties intact.
> And considering all the suggestions, this is the plan for V3 -
>
> [In block layer]
> 1. Introduce one macro "KERN_WRITE_HINT_MIN" which will take the value
> "user_hint_cnt + 1".
> FS code will use this value (onwards) to define their own streams.
>
> 2. Introduce another macro "BLK_MAX_KERNEL_WRITE_HINTS" which will be set to
> 4 for now.
>
> [In nvme driver]
> 1. Continue working as before if device supports just 4 streams. All these
> streams are used by user-hints, and kernel-hints are translated to 0.
>
> 2. If device supports any more than 4 streams, those will be mapped to serve
> kernel-hints, starting from KERN_WRITE_HINT_MIN onwards.
> For example, if device has 6 streams, four streams (numbers = 1,2,3,4) will
> be used to serve user-hints and two streams ( numbers = 65535, 65534) will
> be used to serve first two kernel hints. Other kernel-hints get mapped to 0.
> OTOH, if device has 10 streams, first four kernel-hints will be mapped to
> non-zero values (65535 to 65532) and anything else would get turned to 0.

Well, I'm not sure if the mapping should happen in the NVME driver. In
future, there will be potentially more drivers supporting write hints and
we probably don't want each of them to replicate the mapping behavior. So
IMO the mapping should rather belong to the block layer...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-02-05 22:53:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Write-hint for FS journal

On Tue, Feb 05, 2019 at 12:50:48PM +0100, Jan Kara wrote:
> On Wed 30-01-19 19:24:39, Kanchan Joshi wrote:
> >
> > On Wednesday 30 January 2019 05:43 AM, Dave Chinner wrote:
> > > On Tue, Jan 29, 2019 at 11:07:02AM +0100, Jan Kara wrote:
> > > > On Mon 28-01-19 16:24:24, Keith Busch wrote:
> > > > > On Mon, Jan 28, 2019 at 04:47:09AM -0800, Jan Kara wrote:
> > > > > > On Fri 25-01-19 09:23:53, Keith Busch wrote:
> > > > > > > On Wed, Jan 09, 2019 at 09:00:57PM +0530, Kanchan Joshi wrote:
> > > > > > > > Towards supporing write-hints/streams for filesystem journal.
> > > > > > > > Here is the v1 patch for background -
> > > > > > > > https://marc.info/?l=linux-fsdevel&m=154444637519020&w=2
> > > > > > > > Changes since v1:
> > > > > > > > - introduce four more hints for in-kernel use, as recommended by Dave chinner
> > > > > > > > & Jens axboe. This isolates kernel-mode hints from user-mode ones.
> > > > > > >
> > > > > > > The nvme driver disables streams if the controller doesn't support
> > > > > > > BLK_MAX_WRITE_HINT number of streams, so this series breaks the feature
> > > > > > > for controllers that only support up to 4.
> > > > > >
> > > > > > Right. Do you know if there are such controllers? Or are you just afraid
> > > > > > that there could be?
> > > > >
> > > > > I've asked around, and the concensus I received is all currently support
> > > > > at least 8, but they couldn't say if that would be true for potential
> > > > > lower budget products. Can we implement a reasonable fallback to use
> > > > > what's available?
> > > >
> > > > OK, thanks for input. So probably we should just map kernel stream IDs to 0
> > > > if the device doesn't support them. But that probably means we need to
> > > > propagate number of available streams up from NVME into the block layer so
> > > > that this can be handled reasonably seamlessly. Jens, Kanchan?
> > >
> > > Yeah, that's basically what I said we needed to do when this was
> > > last discussed. i.e. that the block layer needed to know how many
> > > streams the hardware had and map the 4 "kernel internal" hints
> > > appropriately to what he device supports.
> > >
> > > e.g. if the device only supports 4 hints, then it needs to map the
> > > kernel hints either to zero. If it supports less than 8 streams,
> > > then they need otbe mapped into the hints above index 5. If there
> > > are N streams, then they need to be mapped to the hints {N-3,N}
> > >
> > > And, to top it all off, there needs to be guards so that if we want
> > > to grow the userspace hints to more than 4 hints, they don't crash
> > > into ranges the kernel is already reserving because of limited
> > > device range support.
> > >
> > > Nothing is ever simple....
> > >
> > Thanks all for feedback.
> > user-hints, when they reach to kernel via fcntl path, are sanity-checked
> > (rw_hint_valid function).
> > Currently streams are enabled when nvme driver is made to run with "streams
> > =1" option, while stream users always pass some write-hint, without
> > bothering whether streams (and how many of those) are operational or not.
> > This keeps configuration simple for stream users. Second, block layer does
> > not translate write-hint to stream-number, rather it is done inside nvme
> > driver. I suppose I should keep both these properties intact.
> > And considering all the suggestions, this is the plan for V3 -
> >
> > [In block layer]
> > 1. Introduce one macro "KERN_WRITE_HINT_MIN" which will take the value
> > "user_hint_cnt + 1".
> > FS code will use this value (onwards) to define their own streams.
> >
> > 2. Introduce another macro "BLK_MAX_KERNEL_WRITE_HINTS" which will be set to
> > 4 for now.
> >
> > [In nvme driver]
> > 1. Continue working as before if device supports just 4 streams. All these
> > streams are used by user-hints, and kernel-hints are translated to 0.
> >
> > 2. If device supports any more than 4 streams, those will be mapped to serve
> > kernel-hints, starting from KERN_WRITE_HINT_MIN onwards.
> > For example, if device has 6 streams, four streams (numbers = 1,2,3,4) will
> > be used to serve user-hints and two streams ( numbers = 65535, 65534) will
> > be used to serve first two kernel hints. Other kernel-hints get mapped to 0.
> > OTOH, if device has 10 streams, first four kernel-hints will be mapped to
> > non-zero values (65535 to 65532) and anything else would get turned to 0.
>
> Well, I'm not sure if the mapping should happen in the NVME driver. In
> future, there will be potentially more drivers supporting write hints and
> we probably don't want each of them to replicate the mapping behavior. So
> IMO the mapping should rather belong to the block layer...

*nod*

That's what I was suggesting. All the driver does is supply the
block layer with the number of hints it supports, and the block
layer does the rest. After all, this has to work with DM, MD, etc
so it really does need to bubble up from the driver to the block
layer so it can be handled appropriately by multi-device block
drivers. e.g. md raid might want to reserve a kernel channel for
itself (e.g. internal metadata) and so only present 7 channels to
the next layer up (4 user and 3 kernel)....

Cheers,

Dave.
--
Dave Chinner
[email protected]