2019-06-22 23:23:33

by Eric Wheeler

[permalink] [raw]
Subject: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

From: Eric Wheeler <[email protected]>

While some drivers set queue_limits.io_opt (e.g., md raid5), there are
currently no SCSI/RAID controller drivers that do. Previously stripe_size
and partial_stripes_expensive were read-only values and could not be
tuned by users (eg, for hardware RAID5/6).

This patch enables users to save the optimal IO size via sysfs through
the backing device attributes stripe_size and partial_stripes_expensive
into the bcache superblock.

Superblock changes are backwards-compatable:

* partial_stripes_expensive: One bit was used in the superblock flags field

* stripe_size: There are eight 64-bit "pad" fields for future use in
the superblock which default to 0; from those, 32-bits are now used
to save the stripe_size and load at device registration time.

Signed-off-by: Eric Wheeler <[email protected]>
---
Documentation/admin-guide/bcache.rst | 21 +++++++++++++++++++++
drivers/md/bcache/super.c | 15 ++++++++++++++-
drivers/md/bcache/sysfs.c | 33 +++++++++++++++++++++++++++++++--
include/uapi/linux/bcache.h | 6 ++++--
4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
index c0ce64d..ef82022 100644
--- a/Documentation/admin-guide/bcache.rst
+++ b/Documentation/admin-guide/bcache.rst
@@ -420,6 +420,12 @@ dirty_data
label
Name of underlying device.

+partial_stripes_expensive
+ Flag to bcache that partial or unaligned stripe_size'd
+ writes to the backing device are expensive (e.g., RAID5/6 incur
+ read-copy-write). Writing this sysfs attribute updates the superblock
+ and also takes effect immediately. See also stripe_size, below.
+
readahead
Size of readahead that should be performed. Defaults to 0. If set to e.g.
1M, it will round cache miss reads up to that size, but without overlapping
@@ -458,6 +464,21 @@ stop
Write to this file to shut down the bcache device and close the backing
device.

+stripe_size
+ The stripe size in bytes of the backing device for optimial
+ write performance (also known as the "stride width"). This is set
+ automatically when using a device driver sets blk_limits_io_opt
+ (e.g., md, rbd, skd, zram, virtio_blk). No hardware RAID controller
+ sets blk_limits_io_opt as of 2019-06-15, so configure this to suit
+ your needs. Note that you must unregister and re-register the backing
+ device after making a change to stripe_size.
+
+ Where N is the number of data disks,
+ RAID5: stripe_size = (N-1)*RAID_CHUNK_SIZE.
+ RAID6: stripe_size = (N-2)*RAID_CHUNK_SIZE.
+
+ See also partial_stripes_expensive, above.
+
writeback_delay
When dirty data is written to the cache and it previously did not contain
any, waits some number of seconds before initiating writeback. Defaults to
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1b63ac8..d0b9501 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -80,6 +80,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,

sb->flags = le64_to_cpu(s->flags);
sb->seq = le64_to_cpu(s->seq);
+ sb->stripe_size = le32_to_cpu(s->stripe_size);
sb->last_mount = le32_to_cpu(s->last_mount);
sb->first_bucket = le16_to_cpu(s->first_bucket);
sb->keys = le16_to_cpu(s->keys);
@@ -221,6 +222,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio)

out->flags = cpu_to_le64(sb->flags);
out->seq = cpu_to_le64(sb->seq);
+ out->stripe_size = cpu_to_le32(sb->stripe_size);

out->last_mount = cpu_to_le32(sb->last_mount);
out->first_bucket = cpu_to_le16(sb->first_bucket);
@@ -1258,7 +1260,18 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)

dc->disk.stripe_size = q->limits.io_opt >> 9;

- if (dc->disk.stripe_size)
+ if (dc->sb.stripe_size) {
+ if (dc->disk.stripe_size &&
+ dc->disk.stripe_size != dc->sb.stripe_size) {
+ pr_warn("superblock stripe_size (%d) overrides bdev stripe_size (%d)\n",
+ (int)dc->sb.stripe_size,
+ (int)dc->disk.stripe_size);
+ }
+
+ dc->disk.stripe_size = dc->sb.stripe_size;
+ dc->partial_stripes_expensive =
+ (unsigned int)BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb);
+ } else if (dc->disk.stripe_size)
dc->partial_stripes_expensive =
q->limits.raid_partial_stripes_expensive;

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index bfb437f..4ebca52 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -111,8 +111,8 @@
rw_attribute(writeback_rate_minimum);
read_attribute(writeback_rate_debug);

-read_attribute(stripe_size);
-read_attribute(partial_stripes_expensive);
+rw_attribute(stripe_size);
+rw_attribute(partial_stripes_expensive);

rw_attribute(synchronous);
rw_attribute(journal_delay_ms);
@@ -343,6 +343,35 @@ static ssize_t bch_snprint_string_list(char *buf,
}
}

+ if (attr == &sysfs_stripe_size) {
+ int v = strtoul_or_return(buf);
+
+ if (v & 0x1FF) {
+ pr_err("stripe_size must be a muliple of 512-byte sectors");
+ return -EINVAL;
+ }
+
+ v >>= 9;
+
+ if (v != dc->sb.stripe_size) {
+ dc->sb.stripe_size = v;
+ pr_info("stripe_size=%d, re-register to take effect.",
+ v<<9);
+ bch_write_bdev_super(dc, NULL);
+ } else
+ pr_info("stripe_size is already set to %d.", v<<9);
+ }
+
+ if (attr == &sysfs_partial_stripes_expensive) {
+ int v = strtoul_or_return(buf);
+
+ if (v != BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb)) {
+ SET_BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb, v);
+ dc->partial_stripes_expensive = v;
+ bch_write_bdev_super(dc, NULL);
+ }
+ }
+
if (attr == &sysfs_stop_when_cache_set_failed) {
v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
if (v < 0)
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index 5d4f58e..ee60914 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -172,7 +172,9 @@ struct cache_sb {

__u64 flags;
__u64 seq;
- __u64 pad[8];
+ __u32 stripe_size;
+ __u32 pad_u32;
+ __u64 pad_u64[7];

union {
struct {
@@ -230,7 +232,7 @@ static inline _Bool SB_IS_BDEV(const struct cache_sb *sb)
#define BDEV_STATE_CLEAN 1U
#define BDEV_STATE_DIRTY 2U
#define BDEV_STATE_STALE 3U
-
+BITMASK(BDEV_PARTIAL_STRIPES_EXPENSIVE, struct cache_sb, flags, 60, 1);
/*
* Magic numbers
*
--
1.8.3.1


2019-06-23 01:20:32

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> currently no SCSI/RAID controller drivers that do.

That's not true. Lots of SCSI RAID devices report a stripe width.

--
Martin K. Petersen Oracle Linux Engineering

2019-06-24 07:17:19

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On 2019/6/23 7:16 上午, Eric Wheeler wrote:
> From: Eric Wheeler <[email protected]>
>
> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> currently no SCSI/RAID controller drivers that do. Previously stripe_size
> and partial_stripes_expensive were read-only values and could not be
> tuned by users (eg, for hardware RAID5/6).
>
> This patch enables users to save the optimal IO size via sysfs through
> the backing device attributes stripe_size and partial_stripes_expensive
> into the bcache superblock.
>
> Superblock changes are backwards-compatable:
>
> * partial_stripes_expensive: One bit was used in the superblock flags field
>
> * stripe_size: There are eight 64-bit "pad" fields for future use in
> the superblock which default to 0; from those, 32-bits are now used
> to save the stripe_size and load at device registration time.
>
> Signed-off-by: Eric Wheeler <[email protected]>

Hi Eric,

In general I am OK with this patch. Since Peter comments lots of SCSI
RAID devices reports a stripe width, could you please list the hardware
raid devices which don't list stripe size ? Then we can make decision
whether it is necessary to have such option enabled.

Another point is, this patch changes struct cache_sb, it is no problem
to change on-disk format. I plan to update the super block version soon,
to store more configuration persistently into super block. stripe_size
can be added to cache_sb with other on-disk changes.

Thanks.

Coly Li


> ---
> Documentation/admin-guide/bcache.rst | 21 +++++++++++++++++++++
> drivers/md/bcache/super.c | 15 ++++++++++++++-
> drivers/md/bcache/sysfs.c | 33 +++++++++++++++++++++++++++++++--
> include/uapi/linux/bcache.h | 6 ++++--
> 4 files changed, 70 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
> index c0ce64d..ef82022 100644
> --- a/Documentation/admin-guide/bcache.rst
> +++ b/Documentation/admin-guide/bcache.rst
> @@ -420,6 +420,12 @@ dirty_data
> label
> Name of underlying device.
>
> +partial_stripes_expensive
> + Flag to bcache that partial or unaligned stripe_size'd
> + writes to the backing device are expensive (e.g., RAID5/6 incur
> + read-copy-write). Writing this sysfs attribute updates the superblock
> + and also takes effect immediately. See also stripe_size, below.
> +
> readahead
> Size of readahead that should be performed. Defaults to 0. If set to e.g.
> 1M, it will round cache miss reads up to that size, but without overlapping
> @@ -458,6 +464,21 @@ stop
> Write to this file to shut down the bcache device and close the backing
> device.
>
> +stripe_size
> + The stripe size in bytes of the backing device for optimial
> + write performance (also known as the "stride width"). This is set
> + automatically when using a device driver sets blk_limits_io_opt
> + (e.g., md, rbd, skd, zram, virtio_blk). No hardware RAID controller
> + sets blk_limits_io_opt as of 2019-06-15, so configure this to suit
> + your needs. Note that you must unregister and re-register the backing
> + device after making a change to stripe_size.
> +
> + Where N is the number of data disks,
> + RAID5: stripe_size = (N-1)*RAID_CHUNK_SIZE.
> + RAID6: stripe_size = (N-2)*RAID_CHUNK_SIZE.
> +
> + See also partial_stripes_expensive, above.
> +
> writeback_delay
> When dirty data is written to the cache and it previously did not contain
> any, waits some number of seconds before initiating writeback. Defaults to
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1b63ac8..d0b9501 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -80,6 +80,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
>
> sb->flags = le64_to_cpu(s->flags);
> sb->seq = le64_to_cpu(s->seq);
> + sb->stripe_size = le32_to_cpu(s->stripe_size);
> sb->last_mount = le32_to_cpu(s->last_mount);
> sb->first_bucket = le16_to_cpu(s->first_bucket);
> sb->keys = le16_to_cpu(s->keys);
> @@ -221,6 +222,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio)
>
> out->flags = cpu_to_le64(sb->flags);
> out->seq = cpu_to_le64(sb->seq);
> + out->stripe_size = cpu_to_le32(sb->stripe_size);
>
> out->last_mount = cpu_to_le32(sb->last_mount);
> out->first_bucket = cpu_to_le16(sb->first_bucket);
> @@ -1258,7 +1260,18 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>
> dc->disk.stripe_size = q->limits.io_opt >> 9;
>
> - if (dc->disk.stripe_size)
> + if (dc->sb.stripe_size) {
> + if (dc->disk.stripe_size &&
> + dc->disk.stripe_size != dc->sb.stripe_size) {
> + pr_warn("superblock stripe_size (%d) overrides bdev stripe_size (%d)\n",
> + (int)dc->sb.stripe_size,
> + (int)dc->disk.stripe_size);
> + }
> +
> + dc->disk.stripe_size = dc->sb.stripe_size;
> + dc->partial_stripes_expensive =
> + (unsigned int)BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb);
> + } else if (dc->disk.stripe_size)
> dc->partial_stripes_expensive =
> q->limits.raid_partial_stripes_expensive;
>
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index bfb437f..4ebca52 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -111,8 +111,8 @@
> rw_attribute(writeback_rate_minimum);
> read_attribute(writeback_rate_debug);
>
> -read_attribute(stripe_size);
> -read_attribute(partial_stripes_expensive);
> +rw_attribute(stripe_size);
> +rw_attribute(partial_stripes_expensive);
>
> rw_attribute(synchronous);
> rw_attribute(journal_delay_ms);
> @@ -343,6 +343,35 @@ static ssize_t bch_snprint_string_list(char *buf,
> }
> }
>
> + if (attr == &sysfs_stripe_size) {
> + int v = strtoul_or_return(buf);
> +
> + if (v & 0x1FF) {
> + pr_err("stripe_size must be a muliple of 512-byte sectors");
> + return -EINVAL;
> + }
> +
> + v >>= 9;
> +
> + if (v != dc->sb.stripe_size) {
> + dc->sb.stripe_size = v;
> + pr_info("stripe_size=%d, re-register to take effect.",
> + v<<9);
> + bch_write_bdev_super(dc, NULL);
> + } else
> + pr_info("stripe_size is already set to %d.", v<<9);
> + }
> +
> + if (attr == &sysfs_partial_stripes_expensive) {
> + int v = strtoul_or_return(buf);
> +
> + if (v != BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb)) {
> + SET_BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb, v);
> + dc->partial_stripes_expensive = v;
> + bch_write_bdev_super(dc, NULL);
> + }
> + }
> +
> if (attr == &sysfs_stop_when_cache_set_failed) {
> v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
> if (v < 0)
> diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
> index 5d4f58e..ee60914 100644
> --- a/include/uapi/linux/bcache.h
> +++ b/include/uapi/linux/bcache.h
> @@ -172,7 +172,9 @@ struct cache_sb {
>
> __u64 flags;
> __u64 seq;
> - __u64 pad[8];
> + __u32 stripe_size;
> + __u32 pad_u32;
> + __u64 pad_u64[7];
>
> union {
> struct {
> @@ -230,7 +232,7 @@ static inline _Bool SB_IS_BDEV(const struct cache_sb *sb)
> #define BDEV_STATE_CLEAN 1U
> #define BDEV_STATE_DIRTY 2U
> #define BDEV_STATE_STALE 3U
> -
> +BITMASK(BDEV_PARTIAL_STRIPES_EXPENSIVE, struct cache_sb, flags, 60, 1);
> /*
> * Magic numbers
> *
>


--

Coly Li

2019-06-24 07:21:07

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On 2019/6/23 7:16 上午, Eric Wheeler wrote:
> From: Eric Wheeler <[email protected]>
>
> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> currently no SCSI/RAID controller drivers that do. Previously stripe_size
> and partial_stripes_expensive were read-only values and could not be
> tuned by users (eg, for hardware RAID5/6).
>
> This patch enables users to save the optimal IO size via sysfs through
> the backing device attributes stripe_size and partial_stripes_expensive
> into the bcache superblock.
>
> Superblock changes are backwards-compatable:
>
> * partial_stripes_expensive: One bit was used in the superblock flags field
>
> * stripe_size: There are eight 64-bit "pad" fields for future use in
> the superblock which default to 0; from those, 32-bits are now used
> to save the stripe_size and load at device registration time.
>
> Signed-off-by: Eric Wheeler <[email protected]>

Hi Eric,

In general I am OK with this patch. Since Peter comments lots of SCSI
RAID devices reports a stripe width, could you please list the hardware
raid devices which don't list stripe size ? Then we can make decision
whether it is necessary to have such option enabled.

Another point is, this patch changes struct cache_sb, it is no problem
to change on-disk format. I plan to update the super block version soon,
to store more configuration persistently into super block. stripe_size
can be added to cache_sb with other on-disk changes.

Thanks.

Coly Li


> ---
> Documentation/admin-guide/bcache.rst | 21 +++++++++++++++++++++
> drivers/md/bcache/super.c | 15 ++++++++++++++-
> drivers/md/bcache/sysfs.c | 33 +++++++++++++++++++++++++++++++--
> include/uapi/linux/bcache.h | 6 ++++--
> 4 files changed, 70 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
> index c0ce64d..ef82022 100644
> --- a/Documentation/admin-guide/bcache.rst
> +++ b/Documentation/admin-guide/bcache.rst
> @@ -420,6 +420,12 @@ dirty_data
> label
> Name of underlying device.
>
> +partial_stripes_expensive
> + Flag to bcache that partial or unaligned stripe_size'd
> + writes to the backing device are expensive (e.g., RAID5/6 incur
> + read-copy-write). Writing this sysfs attribute updates the superblock
> + and also takes effect immediately. See also stripe_size, below.
> +
> readahead
> Size of readahead that should be performed. Defaults to 0. If set to e.g.
> 1M, it will round cache miss reads up to that size, but without overlapping
> @@ -458,6 +464,21 @@ stop
> Write to this file to shut down the bcache device and close the backing
> device.
>
> +stripe_size
> + The stripe size in bytes of the backing device for optimial
> + write performance (also known as the "stride width"). This is set
> + automatically when using a device driver sets blk_limits_io_opt
> + (e.g., md, rbd, skd, zram, virtio_blk). No hardware RAID controller
> + sets blk_limits_io_opt as of 2019-06-15, so configure this to suit
> + your needs. Note that you must unregister and re-register the backing
> + device after making a change to stripe_size.
> +
> + Where N is the number of data disks,
> + RAID5: stripe_size = (N-1)*RAID_CHUNK_SIZE.
> + RAID6: stripe_size = (N-2)*RAID_CHUNK_SIZE.
> +
> + See also partial_stripes_expensive, above.
> +
> writeback_delay
> When dirty data is written to the cache and it previously did not contain
> any, waits some number of seconds before initiating writeback. Defaults to
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 1b63ac8..d0b9501 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -80,6 +80,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
>
> sb->flags = le64_to_cpu(s->flags);
> sb->seq = le64_to_cpu(s->seq);
> + sb->stripe_size = le32_to_cpu(s->stripe_size);
> sb->last_mount = le32_to_cpu(s->last_mount);
> sb->first_bucket = le16_to_cpu(s->first_bucket);
> sb->keys = le16_to_cpu(s->keys);
> @@ -221,6 +222,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio)
>
> out->flags = cpu_to_le64(sb->flags);
> out->seq = cpu_to_le64(sb->seq);
> + out->stripe_size = cpu_to_le32(sb->stripe_size);
>
> out->last_mount = cpu_to_le32(sb->last_mount);
> out->first_bucket = cpu_to_le16(sb->first_bucket);
> @@ -1258,7 +1260,18 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>
> dc->disk.stripe_size = q->limits.io_opt >> 9;
>
> - if (dc->disk.stripe_size)
> + if (dc->sb.stripe_size) {
> + if (dc->disk.stripe_size &&
> + dc->disk.stripe_size != dc->sb.stripe_size) {
> + pr_warn("superblock stripe_size (%d) overrides bdev stripe_size (%d)\n",
> + (int)dc->sb.stripe_size,
> + (int)dc->disk.stripe_size);
> + }
> +
> + dc->disk.stripe_size = dc->sb.stripe_size;
> + dc->partial_stripes_expensive =
> + (unsigned int)BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb);
> + } else if (dc->disk.stripe_size)
> dc->partial_stripes_expensive =
> q->limits.raid_partial_stripes_expensive;
>
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index bfb437f..4ebca52 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -111,8 +111,8 @@
> rw_attribute(writeback_rate_minimum);
> read_attribute(writeback_rate_debug);
>
> -read_attribute(stripe_size);
> -read_attribute(partial_stripes_expensive);
> +rw_attribute(stripe_size);
> +rw_attribute(partial_stripes_expensive);
>
> rw_attribute(synchronous);
> rw_attribute(journal_delay_ms);
> @@ -343,6 +343,35 @@ static ssize_t bch_snprint_string_list(char *buf,
> }
> }
>
> + if (attr == &sysfs_stripe_size) {
> + int v = strtoul_or_return(buf);
> +
> + if (v & 0x1FF) {
> + pr_err("stripe_size must be a muliple of 512-byte sectors");
> + return -EINVAL;
> + }
> +
> + v >>= 9;
> +
> + if (v != dc->sb.stripe_size) {
> + dc->sb.stripe_size = v;
> + pr_info("stripe_size=%d, re-register to take effect.",
> + v<<9);
> + bch_write_bdev_super(dc, NULL);
> + } else
> + pr_info("stripe_size is already set to %d.", v<<9);
> + }
> +
> + if (attr == &sysfs_partial_stripes_expensive) {
> + int v = strtoul_or_return(buf);
> +
> + if (v != BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb)) {
> + SET_BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb, v);
> + dc->partial_stripes_expensive = v;
> + bch_write_bdev_super(dc, NULL);
> + }
> + }
> +
> if (attr == &sysfs_stop_when_cache_set_failed) {
> v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
> if (v < 0)
> diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
> index 5d4f58e..ee60914 100644
> --- a/include/uapi/linux/bcache.h
> +++ b/include/uapi/linux/bcache.h
> @@ -172,7 +172,9 @@ struct cache_sb {
>
> __u64 flags;
> __u64 seq;
> - __u64 pad[8];
> + __u32 stripe_size;
> + __u32 pad_u32;
> + __u64 pad_u64[7];
>
> union {
> struct {
> @@ -230,7 +232,7 @@ static inline _Bool SB_IS_BDEV(const struct cache_sb *sb)
> #define BDEV_STATE_CLEAN 1U
> #define BDEV_STATE_DIRTY 2U
> #define BDEV_STATE_STALE 3U
> -
> +BITMASK(BDEV_PARTIAL_STRIPES_EXPENSIVE, struct cache_sb, flags, 60, 1);
> /*
> * Magic numbers
> *
>


--

Coly Li

2019-06-24 21:20:15

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Mon, 24 Jun 2019, Coly Li wrote:

> On 2019/6/23 7:16 $B>e8a(J, Eric Wheeler wrote:
> > From: Eric Wheeler <[email protected]>
> >
> > While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> > currently no SCSI/RAID controller drivers that do. Previously stripe_size
> > and partial_stripes_expensive were read-only values and could not be
> > tuned by users (eg, for hardware RAID5/6).
> >
> > This patch enables users to save the optimal IO size via sysfs through
> > the backing device attributes stripe_size and partial_stripes_expensive
> > into the bcache superblock.
> >
> > Superblock changes are backwards-compatable:
> >
> > * partial_stripes_expensive: One bit was used in the superblock flags field
> >
> > * stripe_size: There are eight 64-bit "pad" fields for future use in
> > the superblock which default to 0; from those, 32-bits are now used
> > to save the stripe_size and load at device registration time.
> >
> > Signed-off-by: Eric Wheeler <[email protected]>
>
> Hi Eric,
>
> In general I am OK with this patch. Since Peter comments lots of SCSI
> RAID devices reports a stripe width, could you please list the hardware
> raid devices which don't list stripe size ? Then we can make decision
> whether it is necessary to have such option enabled.

Perhaps they do not set stripe_width using io_opt? I did a grep to see if
any of them did, but I didn't see them. How is stripe_width indicated by
RAID controllers?

If they do set io_opt, then at least my Areca 1883 does not set io_opt as
of 4.19.x. I also have a LSI MegaRAID 3108 which does not report io_opt as
of 4.1.x, but that is an older kernel so maybe support has been added
since then.

Martin,

Where would stripe_width be configured in the SCSI drivers? Is it visible
through sysfs or debugfs so I can check my hardware support without
hacking debugging the kernel?

>
> Another point is, this patch changes struct cache_sb, it is no problem
> to change on-disk format. I plan to update the super block version soon,
> to store more configuration persistently into super block. stripe_size
> can be added to cache_sb with other on-disk changes.

Maybe bumping version makes sense, but even if you do not, this is safe to
use on systems without bumping the version because the values are unused
and default to 0.

--
Eric Wheeler

>
> Thanks.
>
> Coly Li
>
>
> > ---
> > Documentation/admin-guide/bcache.rst | 21 +++++++++++++++++++++
> > drivers/md/bcache/super.c | 15 ++++++++++++++-
> > drivers/md/bcache/sysfs.c | 33 +++++++++++++++++++++++++++++++--
> > include/uapi/linux/bcache.h | 6 ++++--
> > 4 files changed, 70 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst
> > index c0ce64d..ef82022 100644
> > --- a/Documentation/admin-guide/bcache.rst
> > +++ b/Documentation/admin-guide/bcache.rst
> > @@ -420,6 +420,12 @@ dirty_data
> > label
> > Name of underlying device.
> >
> > +partial_stripes_expensive
> > + Flag to bcache that partial or unaligned stripe_size'd
> > + writes to the backing device are expensive (e.g., RAID5/6 incur
> > + read-copy-write). Writing this sysfs attribute updates the superblock
> > + and also takes effect immediately. See also stripe_size, below.
> > +
> > readahead
> > Size of readahead that should be performed. Defaults to 0. If set to e.g.
> > 1M, it will round cache miss reads up to that size, but without overlapping
> > @@ -458,6 +464,21 @@ stop
> > Write to this file to shut down the bcache device and close the backing
> > device.
> >
> > +stripe_size
> > + The stripe size in bytes of the backing device for optimial
> > + write performance (also known as the "stride width"). This is set
> > + automatically when using a device driver sets blk_limits_io_opt
> > + (e.g., md, rbd, skd, zram, virtio_blk). No hardware RAID controller
> > + sets blk_limits_io_opt as of 2019-06-15, so configure this to suit
> > + your needs. Note that you must unregister and re-register the backing
> > + device after making a change to stripe_size.
> > +
> > + Where N is the number of data disks,
> > + RAID5: stripe_size = (N-1)*RAID_CHUNK_SIZE.
> > + RAID6: stripe_size = (N-2)*RAID_CHUNK_SIZE.
> > +
> > + See also partial_stripes_expensive, above.
> > +
> > writeback_delay
> > When dirty data is written to the cache and it previously did not contain
> > any, waits some number of seconds before initiating writeback. Defaults to
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 1b63ac8..d0b9501 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -80,6 +80,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> >
> > sb->flags = le64_to_cpu(s->flags);
> > sb->seq = le64_to_cpu(s->seq);
> > + sb->stripe_size = le32_to_cpu(s->stripe_size);
> > sb->last_mount = le32_to_cpu(s->last_mount);
> > sb->first_bucket = le16_to_cpu(s->first_bucket);
> > sb->keys = le16_to_cpu(s->keys);
> > @@ -221,6 +222,7 @@ static void __write_super(struct cache_sb *sb, struct bio *bio)
> >
> > out->flags = cpu_to_le64(sb->flags);
> > out->seq = cpu_to_le64(sb->seq);
> > + out->stripe_size = cpu_to_le32(sb->stripe_size);
> >
> > out->last_mount = cpu_to_le32(sb->last_mount);
> > out->first_bucket = cpu_to_le16(sb->first_bucket);
> > @@ -1258,7 +1260,18 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
> >
> > dc->disk.stripe_size = q->limits.io_opt >> 9;
> >
> > - if (dc->disk.stripe_size)
> > + if (dc->sb.stripe_size) {
> > + if (dc->disk.stripe_size &&
> > + dc->disk.stripe_size != dc->sb.stripe_size) {
> > + pr_warn("superblock stripe_size (%d) overrides bdev stripe_size (%d)\n",
> > + (int)dc->sb.stripe_size,
> > + (int)dc->disk.stripe_size);
> > + }
> > +
> > + dc->disk.stripe_size = dc->sb.stripe_size;
> > + dc->partial_stripes_expensive =
> > + (unsigned int)BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb);
> > + } else if (dc->disk.stripe_size)
> > dc->partial_stripes_expensive =
> > q->limits.raid_partial_stripes_expensive;
> >
> > diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> > index bfb437f..4ebca52 100644
> > --- a/drivers/md/bcache/sysfs.c
> > +++ b/drivers/md/bcache/sysfs.c
> > @@ -111,8 +111,8 @@
> > rw_attribute(writeback_rate_minimum);
> > read_attribute(writeback_rate_debug);
> >
> > -read_attribute(stripe_size);
> > -read_attribute(partial_stripes_expensive);
> > +rw_attribute(stripe_size);
> > +rw_attribute(partial_stripes_expensive);
> >
> > rw_attribute(synchronous);
> > rw_attribute(journal_delay_ms);
> > @@ -343,6 +343,35 @@ static ssize_t bch_snprint_string_list(char *buf,
> > }
> > }
> >
> > + if (attr == &sysfs_stripe_size) {
> > + int v = strtoul_or_return(buf);
> > +
> > + if (v & 0x1FF) {
> > + pr_err("stripe_size must be a muliple of 512-byte sectors");
> > + return -EINVAL;
> > + }
> > +
> > + v >>= 9;
> > +
> > + if (v != dc->sb.stripe_size) {
> > + dc->sb.stripe_size = v;
> > + pr_info("stripe_size=%d, re-register to take effect.",
> > + v<<9);
> > + bch_write_bdev_super(dc, NULL);
> > + } else
> > + pr_info("stripe_size is already set to %d.", v<<9);
> > + }
> > +
> > + if (attr == &sysfs_partial_stripes_expensive) {
> > + int v = strtoul_or_return(buf);
> > +
> > + if (v != BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb)) {
> > + SET_BDEV_PARTIAL_STRIPES_EXPENSIVE(&dc->sb, v);
> > + dc->partial_stripes_expensive = v;
> > + bch_write_bdev_super(dc, NULL);
> > + }
> > + }
> > +
> > if (attr == &sysfs_stop_when_cache_set_failed) {
> > v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
> > if (v < 0)
> > diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
> > index 5d4f58e..ee60914 100644
> > --- a/include/uapi/linux/bcache.h
> > +++ b/include/uapi/linux/bcache.h
> > @@ -172,7 +172,9 @@ struct cache_sb {
> >
> > __u64 flags;
> > __u64 seq;
> > - __u64 pad[8];
> > + __u32 stripe_size;
> > + __u32 pad_u32;
> > + __u64 pad_u64[7];
> >
> > union {
> > struct {
> > @@ -230,7 +232,7 @@ static inline _Bool SB_IS_BDEV(const struct cache_sb *sb)
> > #define BDEV_STATE_CLEAN 1U
> > #define BDEV_STATE_DIRTY 2U
> > #define BDEV_STATE_STALE 3U
> > -
> > +BITMASK(BDEV_PARTIAL_STRIPES_EXPENSIVE, struct cache_sb, flags, 60, 1);
> > /*
> > * Magic numbers
> > *
> >
>
>
> --
>
> Coly Li
>

2019-06-25 03:55:45

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> Perhaps they do not set stripe_width using io_opt? I did a grep to see
> if any of them did, but I didn't see them. How is stripe_width
> indicated by RAID controllers?

The values are reported in the Block Limits VPD page for each SCSI block
device and are thus set by the SCSI disk driver. IOW, the RAID
controller device drivers have nothing to do with this.

For RAID controllers specifically, the controller firmware will fill out
the VPD fields for each virtual SCSI disk when you configure a RAID
set. For pretty much everything else, the Block Limits come straight
from the device itself.

Also note that these values aren't specific to RAID controllers at
all. Most new SCSI devices, including disk drives and SSDs, will fill
out the Block Limits VPD page one way or the other. Even some USB
storage devices are providing this page.

> If they do set io_opt, then at least my Areca 1883 does not set io_opt
> as of 4.19.x. I also have a LSI MegaRAID 3108 which does not report
> io_opt as of 4.1.x, but that is an older kernel so maybe support has
> been added since then.

I have several MegaRAIDs that all report it. But it depends on the
controller firmware.

> Is it visible through sysfs or debugfs so I can check my hardware
> support without hacking debugging the kernel?

To print the block device topology:

# lsblk -t

or look up io_opt in sysfs:

# grep . /sys/block/sdX/queue/optimal_io_size

You can also query a SCSI device's Block Limits directly:

# sg_vpd -p bl /dev/sdX

If you want to tinker, you can simulate a SCSI disk with your choice of
io_opt:

# modprobe scsi_debug opt_blks=N

where N is the number of logical blocks to report as being the optimal
I/O size.

--
Martin K. Petersen Oracle Linux Engineering

2019-06-25 04:10:16

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On 2019/6/25 2:14 $B>e8a(B, Eric Wheeler wrote:
> On Mon, 24 Jun 2019, Coly Li wrote:
>
>> On 2019/6/23 7:16 $B>e8a(B, Eric Wheeler wrote:
>>> From: Eric Wheeler <[email protected]>
>>>
>>> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
>>> currently no SCSI/RAID controller drivers that do. Previously stripe_size
>>> and partial_stripes_expensive were read-only values and could not be
>>> tuned by users (eg, for hardware RAID5/6).
>>>
>>> This patch enables users to save the optimal IO size via sysfs through
>>> the backing device attributes stripe_size and partial_stripes_expensive
>>> into the bcache superblock.
>>>
>>> Superblock changes are backwards-compatable:
>>>
>>> * partial_stripes_expensive: One bit was used in the superblock flags field
>>>
>>> * stripe_size: There are eight 64-bit "pad" fields for future use in
>>> the superblock which default to 0; from those, 32-bits are now used
>>> to save the stripe_size and load at device registration time.
>>>
>>> Signed-off-by: Eric Wheeler <[email protected]>
>>
>> Hi Eric,
>>
>> In general I am OK with this patch. Since Peter comments lots of SCSI
>> RAID devices reports a stripe width, could you please list the hardware
>> raid devices which don't list stripe size ? Then we can make decision
>> whether it is necessary to have such option enabled.
>
> Perhaps they do not set stripe_width using io_opt? I did a grep to see if
> any of them did, but I didn't see them. How is stripe_width indicated by
> RAID controllers?
>
> If they do set io_opt, then at least my Areca 1883 does not set io_opt as
> of 4.19.x. I also have a LSI MegaRAID 3108 which does not report io_opt as
> of 4.1.x, but that is an older kernel so maybe support has been added
> since then.
>
> Martin,
>
> Where would stripe_width be configured in the SCSI drivers? Is it visible
> through sysfs or debugfs so I can check my hardware support without
> hacking debugging the kernel?
>
>>
>> Another point is, this patch changes struct cache_sb, it is no problem
>> to change on-disk format. I plan to update the super block version soon,
>> to store more configuration persistently into super block. stripe_size
>> can be added to cache_sb with other on-disk changes.
>

Hi Eric,

> Maybe bumping version makes sense, but even if you do not, this is safe to
> use on systems without bumping the version because the values are unused
> and default to 0.

Yes, I understand you, it works as you suggested. I need to think how to
organize all options in struct cache_sb, stripe_size will be arranged
then. And I will ask help to you for reviewing the changes of on-disk
format.

Thanks.

[snipped]

--

Coly Li

2019-06-26 00:23:32

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Mon, 24 Jun 2019, Martin K. Petersen wrote:
> > Perhaps they do not set stripe_width using io_opt? I did a grep to see
> > if any of them did, but I didn't see them. How is stripe_width
> > indicated by RAID controllers?
>
> The values are reported in the Block Limits VPD page for each SCSI block
> device and are thus set by the SCSI disk driver. IOW, the RAID
> controller device drivers have nothing to do with this.
>
> For RAID controllers specifically, the controller firmware will fill out
> the VPD fields for each virtual SCSI disk when you configure a RAID
> set. For pretty much everything else, the Block Limits come straight
> from the device itself.
>
> Also note that these values aren't specific to RAID controllers at
> all. Most new SCSI devices, including disk drives and SSDs, will fill
> out the Block Limits VPD page one way or the other. Even some USB
> storage devices are providing this page.

Thanks, that makes sense. Interesting about USB.

> > If they do set io_opt, then at least my Areca 1883 does not set io_opt
> > as of 4.19.x. I also have a LSI MegaRAID 3108 which does not report
> > io_opt as of 4.1.x, but that is an older kernel so maybe support has
> > been added since then.
>
> I have several MegaRAIDs that all report it. But it depends on the
> controller firmware.
>
> > Is it visible through sysfs or debugfs so I can check my hardware
> > support without hacking debugging the kernel?
>
> To print the block device topology:
>
> # lsblk -t
>
> or look up io_opt in sysfs:
>
> # grep . /sys/block/sdX/queue/optimal_io_size
>
> You can also query a SCSI device's Block Limits directly:
>
> # sg_vpd -p bl /dev/sdX

Perfect, thank you for that. I've tried the following controllers that I
have access to. One worked (hspa/HP Gen8 Smart Array Controller), but the
others I tried are not providing VPDs:

* LSI 2108 (Supermicro)
* LSI 3108 (Dell)
* Areca 1882
* Areca 1883
* Fibrechannel 8gbe connected to a Storwize 3700

~]# sg_vpd -p bl /dev/sdb
VPD page=0xb0
fetching VPD page failed

> If you want to tinker, you can simulate a SCSI disk with your choice of
> io_opt:
>
> # modprobe scsi_debug opt_blks=N
>
> where N is the number of logical blocks to report as being the optimal
> I/O size.

Neat, thanks for the hint!

-Eric

>
> --
> Martin K. Petersen Oracle Linux Engineering
>

2019-06-26 02:51:59

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> * LSI 2108 (Supermicro)
> * LSI 3108 (Dell)
> * Areca 1882
> * Areca 1883
> * Fibrechannel 8gbe connected to a Storwize 3700

I have a 3108 that provides the BL VPD. Surprised the 1883 doesn't.

As a rule of thumb, you need 12 Gbps SAS or 16 Gbps FC devices for the
VPD page to be present. The protocol feature is not tied to the
transport signaling speed in any way. But general support for the BL VPD
page roughly coincided with vendors introducing 12 Gbps SAS and 16 Gbps
FC products to the market.

--
Martin K. Petersen Oracle Linux Engineering

2022-01-06 03:34:40

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Tue, 25 Jun 2019, Coly Li wrote:
> On 2019/6/25 2:14 $B>e8a(J, Eric Wheeler wrote:
> > On Mon, 24 Jun 2019, Coly Li wrote:
> >
> >> On 2019/6/23 7:16 $B>e8a(J, Eric Wheeler wrote:
> >>> From: Eric Wheeler <[email protected]>
> >>>
> >>> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> >>> currently no SCSI/RAID controller drivers that do. Previously stripe_size
> >>> and partial_stripes_expensive were read-only values and could not be
> >>> tuned by users (eg, for hardware RAID5/6).
> >>>
> >>> This patch enables users to save the optimal IO size via sysfs through
> >>> the backing device attributes stripe_size and partial_stripes_expensive
> >>> into the bcache superblock.
> >>>
> >>> Superblock changes are backwards-compatable:
> >>>
> >>> * partial_stripes_expensive: One bit was used in the superblock flags field
> >>>
> >>> * stripe_size: There are eight 64-bit "pad" fields for future use in
> >>> the superblock which default to 0; from those, 32-bits are now used
> >>> to save the stripe_size and load at device registration time.
> >>>
> >>> Signed-off-by: Eric Wheeler <[email protected]>
> >>
> >> Hi Eric,
> >>
> >> In general I am OK with this patch. Since Peter comments lots of SCSI
> >> RAID devices reports a stripe width, could you please list the hardware
> >> raid devices which don't list stripe size ? Then we can make decision
> >> whether it is necessary to have such option enabled.
> >
> > Perhaps they do not set stripe_width using io_opt? I did a grep to see if
> > any of them did, but I didn't see them. How is stripe_width indicated by
> > RAID controllers?
> >
> > If they do set io_opt, then at least my Areca 1883 does not set io_opt as
> > of 4.19.x. I also have a LSI MegaRAID 3108 which does not report io_opt as
> > of 4.1.x, but that is an older kernel so maybe support has been added
> > since then.
> >
> > Martin,
> >
> > Where would stripe_width be configured in the SCSI drivers? Is it visible
> > through sysfs or debugfs so I can check my hardware support without
> > hacking debugging the kernel?
> >
> >>
> >> Another point is, this patch changes struct cache_sb, it is no problem
> >> to change on-disk format. I plan to update the super block version soon,
> >> to store more configuration persistently into super block. stripe_size
> >> can be added to cache_sb with other on-disk changes.
> >
>
> Hi Eric,
>
> > Maybe bumping version makes sense, but even if you do not, this is safe to
> > use on systems without bumping the version because the values are unused
> > and default to 0.
>
> Yes, I understand you, it works as you suggested. I need to think how to
> organize all options in struct cache_sb, stripe_size will be arranged
> then. And I will ask help to you for reviewing the changes of on-disk
> format.

Hi Coli,

Just checking in, its been a while and I didn't see any more discussion on
the topic:

This would benefit users with older RAID controllers using RAID-5/6 that
don't set io_opt.

Even new new RAID controlers that _do_ provide `io_opt` still do _not_
indicate partial_stripes_expensive (which is an mdraid feature, but Martin
please correct me if I'm wrong here). Thus, all hardware RAID-5/6 users
could benefit by manually flagging partial_stripes_expensive to get burst
writes out of bcache that fit their stride width.

This patch probably needs rebased and documentation updated about io_opt,
but here is the original patch with documentation for your reference:
https://lkml.org/lkml/2019/6/22/298

What do you think?

-Eric

>
> Thanks.
>
> [snipped]
>
> --
>
> Coly Li
>

2022-01-06 16:17:44

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On 1/6/22 11:29 AM, Eric Wheeler wrote:
> On Tue, 25 Jun 2019, Coly Li wrote:
>> On 2019/6/25 2:14 上午, Eric Wheeler wrote:
>>> On Mon, 24 Jun 2019, Coly Li wrote:
>>>
>>>> On 2019/6/23 7:16 上午, Eric Wheeler wrote:
>>>>> From: Eric Wheeler <[email protected]>
>>>>>
>>>>> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
>>>>> currently no SCSI/RAID controller drivers that do. Previously stripe_size
>>>>> and partial_stripes_expensive were read-only values and could not be
>>>>> tuned by users (eg, for hardware RAID5/6).
>>>>>
>>>>> This patch enables users to save the optimal IO size via sysfs through
>>>>> the backing device attributes stripe_size and partial_stripes_expensive
>>>>> into the bcache superblock.
>>>>>
>>>>> Superblock changes are backwards-compatable:
>>>>>
>>>>> * partial_stripes_expensive: One bit was used in the superblock flags field
>>>>>
>>>>> * stripe_size: There are eight 64-bit "pad" fields for future use in
>>>>> the superblock which default to 0; from those, 32-bits are now used
>>>>> to save the stripe_size and load at device registration time.
>>>>>
>>>>> Signed-off-by: Eric Wheeler <[email protected]>
>>>> Hi Eric,
>>>>
>>>> In general I am OK with this patch. Since Peter comments lots of SCSI
>>>> RAID devices reports a stripe width, could you please list the hardware
>>>> raid devices which don't list stripe size ? Then we can make decision
>>>> whether it is necessary to have such option enabled.
>>> Perhaps they do not set stripe_width using io_opt? I did a grep to see if
>>> any of them did, but I didn't see them. How is stripe_width indicated by
>>> RAID controllers?
>>>
>>> If they do set io_opt, then at least my Areca 1883 does not set io_opt as
>>> of 4.19.x. I also have a LSI MegaRAID 3108 which does not report io_opt as
>>> of 4.1.x, but that is an older kernel so maybe support has been added
>>> since then.
>>>
>>> Martin,
>>>
>>> Where would stripe_width be configured in the SCSI drivers? Is it visible
>>> through sysfs or debugfs so I can check my hardware support without
>>> hacking debugging the kernel?
>>>
>>>> Another point is, this patch changes struct cache_sb, it is no problem
>>>> to change on-disk format. I plan to update the super block version soon,
>>>> to store more configuration persistently into super block. stripe_size
>>>> can be added to cache_sb with other on-disk changes.
>> Hi Eric,
>>
>>> Maybe bumping version makes sense, but even if you do not, this is safe to
>>> use on systems without bumping the version because the values are unused
>>> and default to 0.
>> Yes, I understand you, it works as you suggested. I need to think how to
>> organize all options in struct cache_sb, stripe_size will be arranged
>> then. And I will ask help to you for reviewing the changes of on-disk
>> format.
> Hi Coli,
>
> Just checking in, its been a while and I didn't see any more discussion on
> the topic:

Hi Eric,

Thank you for reminding me. The persistent on-disk options were that
much as I thought, so using a reserved space from the on-disk super
block is fine.

> This would benefit users with older RAID controllers using RAID-5/6 that
> don't set io_opt.
>
> Even new new RAID controlers that _do_ provide `io_opt` still do _not_
> indicate partial_stripes_expensive (which is an mdraid feature, but Martin
> please correct me if I'm wrong here). Thus, all hardware RAID-5/6 users
> could benefit by manually flagging partial_stripes_expensive to get burst
> writes out of bcache that fit their stride width.

Yeah, I agree with you.

> This patch probably needs rebased and documentation updated about io_opt,
> but here is the original patch with documentation for your reference:
> https://lkml.org/lkml/2019/6/22/298
>
> What do you think?

Yes please rebase the patch with latest mainline kernel and let's start
the review.

Thank you.

Coly Li

2022-01-08 00:21:45

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> Even new new RAID controlers that _do_ provide `io_opt` still do _not_
> indicate partial_stripes_expensive (which is an mdraid feature, but Martin
> please correct me if I'm wrong here).

partial_stripes_expensive is a bcache thing, I am not sure why it needs
a separate flag. It is implied, although I guess one could argue that
RAID0 is a special case since partial writes are not as painful as with
parity RAID.

The SCSI spec states that submitting an I/O that is smaller than io_min
"may incur delays in processing the command". And similarly, submitting
a command larger than io_opt "may incur delays in processing the
command".

IOW, the spec says "don't write less than an aligned multiple of the
stripe chunk size" and "don't write more than an aligned full
stripe". That leaves "aligned multiples of the stripe chunk size but
less than the full stripe width" unaccounted for. And I guess that's
what the bcache flag is trying to capture.

SCSI doesn't go into details about RAID levels and other implementation
details which is why the wording is deliberately vague. But obviously
the expectation is that partial stripe writes are slower than full.

In my book any component in the stack that sees either io_min or io_opt
should try very hard to send I/Os that are aligned multiples of those
values. I am not opposed to letting users manually twiddle the
settings. But I do think that we should aim for the stack doing the
right thing when it sees io_opt reported on a device.

--
Martin K. Petersen Oracle Linux Engineering

2022-01-08 04:54:28

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Fri, 7 Jan 2022, Martin K. Petersen wrote:
> Eric,
>
> > Even new new RAID controlers that _do_ provide `io_opt` still do _not_
> > indicate partial_stripes_expensive (which is an mdraid feature, but Martin
> > please correct me if I'm wrong here).
>
> partial_stripes_expensive is a bcache thing, I am not sure why it needs
> a separate flag. It is implied, although I guess one could argue that
> RAID0 is a special case since partial writes are not as painful as with
> parity RAID.

I'm guessing bcache used did some optimization for
queue->limits.raid_partial_stripes_expensive because md raid5 code sets
this flag. At least when using Linux md as the RAID5 implementation it
gets configured automatically:
raid5.c: mddev->queue->limits.raid_partial_stripes_expensive = 1;

https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L7729

Interestingly only bcache uses it, but md does set it.

> The SCSI spec states that submitting an I/O that is smaller than io_min
> "may incur delays in processing the command". And similarly, submitting
> a command larger than io_opt "may incur delays in processing the
> command".
>
> IOW, the spec says "don't write less than an aligned multiple of the
> stripe chunk size" and "don't write more than an aligned full
> stripe". That leaves "aligned multiples of the stripe chunk size but
> less than the full stripe width" unaccounted for. And I guess that's
> what the bcache flag is trying to capture.

Maybe any time io_opt is provided then partial_stripes_expensive should be
flagged too and any code to the contrary should be removed?

Question: Does anyone have a reason to keep partial_stripes_expensive in
the kernel at all?

> SCSI doesn't go into details about RAID levels and other implementation
> details which is why the wording is deliberately vague. But obviously
> the expectation is that partial stripe writes are slower than full.
>
> In my book any component in the stack that sees either io_min or io_opt
> should try very hard to send I/Os that are aligned multiples of those
> values. I am not opposed to letting users manually twiddle the
> settings. But I do think that we should aim for the stack doing the
> right thing when it sees io_opt reported on a device.

Agreed, thanks for the feedback!

-Eric


>
> --
> Martin K. Petersen Oracle Linux Engineering
>

2022-01-08 21:51:32

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Fri, 7 Jan 2022, Eric Wheeler wrote:
> On Fri, 7 Jan 2022, Martin K. Petersen wrote:
> > Eric,
> >
> > > Even new new RAID controlers that _do_ provide `io_opt` still do _not_
> > > indicate partial_stripes_expensive (which is an mdraid feature, but Martin
> > > please correct me if I'm wrong here).
> >
> > partial_stripes_expensive is a bcache thing, I am not sure why it needs
> > a separate flag. It is implied, although I guess one could argue that
> > RAID0 is a special case since partial writes are not as painful as with
> > parity RAID.
>
> I'm guessing bcache used did some optimization for
> queue->limits.raid_partial_stripes_expensive because md raid5 code sets
> this flag. At least when using Linux md as the RAID5 implementation it
> gets configured automatically:
> raid5.c: mddev->queue->limits.raid_partial_stripes_expensive = 1;
>
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L7729
>
> Interestingly only bcache uses it, but md does set it.

Ok so `git blame` shows that Kent added this to md/raid5.c in
c78afc6261b (Kent Overstreet 2013-07-11 22:39:53 -0700 7526)
mddev->queue->limits.raid_partial_stripes_expensive = 1;

bcache/md: Use raid stripe size

Now that we've got code for raid5/6 stripe awareness, bcache just needs
to know about the stripes and when writing partial stripes is expensive
- we probably don't want to enable this optimization for raid1 or 10,
even though they have stripes. So add a flag to queue_limits.

Kent, Martin:

Do you think we should leave the md-specific
raid_partial_stripes_expensive setting and require users of RAID
controllers to set the bit themselves in bcache---or---remove all
raid_partial_stripes_expensive code and always treat writes as "expensive"
when `opt_io` is defined?

--
Eric Wheeler


>
> > The SCSI spec states that submitting an I/O that is smaller than io_min
> > "may incur delays in processing the command". And similarly, submitting
> > a command larger than io_opt "may incur delays in processing the
> > command".
> >
> > IOW, the spec says "don't write less than an aligned multiple of the
> > stripe chunk size" and "don't write more than an aligned full
> > stripe". That leaves "aligned multiples of the stripe chunk size but
> > less than the full stripe width" unaccounted for. And I guess that's
> > what the bcache flag is trying to capture.
>
> Maybe any time io_opt is provided then partial_stripes_expensive should be
> flagged too and any code to the contrary should be removed?
>
> Question: Does anyone have a reason to keep partial_stripes_expensive in
> the kernel at all?
>
> > SCSI doesn't go into details about RAID levels and other implementation
> > details which is why the wording is deliberately vague. But obviously
> > the expectation is that partial stripe writes are slower than full.
> >
> > In my book any component in the stack that sees either io_min or io_opt
> > should try very hard to send I/Os that are aligned multiples of those
> > values. I am not opposed to letting users manually twiddle the
> > settings. But I do think that we should aim for the stack doing the
> > right thing when it sees io_opt reported on a device.
>
> Agreed, thanks for the feedback!
>
> -Eric
>
>
> >
> > --
> > Martin K. Petersen Oracle Linux Engineering
> >
>

2022-01-10 16:14:49

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> Ok so `git blame` shows that Kent added this to md/raid5.c in
> c78afc6261b (Kent Overstreet 2013-07-11 22:39:53 -0700 7526)

Yep.

> Do you think we should leave the md-specific
> raid_partial_stripes_expensive setting and require users of RAID
> controllers to set the bit themselves in bcache---or---remove all
> raid_partial_stripes_expensive code and always treat writes as
> "expensive" when `opt_io` is defined?

I'd prefer the latter since that was the very intent of exporting the
device topology in an abstract and protocol-independent fashion.
However, I don't know enough about bcache internals to know whether it
is always the right choice, what the trade-offs are, etc.

--
Martin K. Petersen Oracle Linux Engineering

2022-01-10 23:30:31

by Eric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6

On Mon, 10 Jan 2022, Martin K. Petersen wrote:
> > Ok so `git blame` shows that Kent added this to md/raid5.c in
> > c78afc6261b (Kent Overstreet 2013-07-11 22:39:53 -0700 7526)
>
> Yep.
>
> > Do you think we should leave the md-specific
> > raid_partial_stripes_expensive setting and require users of RAID
> > controllers to set the bit themselves in bcache---or---remove all
> > raid_partial_stripes_expensive code and always treat writes as
> > "expensive" when `opt_io` is defined?
>
> I'd prefer the latter since that was the very intent of exporting the
> device topology in an abstract and protocol-independent fashion.
> However, I don't know enough about bcache internals to know whether it
> is always the right choice, what the trade-offs are, etc.

I've not looked in detail, but I don't think its too difficult.

Do you know if io_opt can be configured on scsi devices that do not have
io_opt provided by the controller? If so, or if it would be easy to add,
then configuring io_opt at the scsi layer is probably a better option so
subsystems besides bcache would benefit.

What do you think?

--
Eric Wheeler


>
> --
> Martin K. Petersen Oracle Linux Engineering
>

2022-01-11 02:55:37

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] bcache: make stripe_size configurable and persistent for hardware raid5/6


Eric,

> Do you know if io_opt can be configured on scsi devices that do not
> have io_opt provided by the controller? If so, or if it would be easy
> to add, then configuring io_opt at the scsi layer is probably a better
> option so subsystems besides bcache would benefit.

There is currently no way to do this in the SCSI sysfs interface.
However, io_opt is a queue_limits parameter. The appropriate location
for modification is blk-sysfs.c. We already allow overriding other block
device properties such as rotational, stable_writes, discards, max I/O
size, etc. I don't see any particular reason why we couldn't also permit
overriding the optimal I/O size.

--
Martin K. Petersen Oracle Linux Engineering