2020-08-19 15:26:41

by John Garry

[permalink] [raw]
Subject: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

Hi all,

Here is v8 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
and introduce changes to fix up the tag budgeting across request queues.
We also have a change to count requests per-hctx for when an elevator is
enabled, as an optimisation. I also dropped the debugfs changes - more on
that below.

Some performance figures:

Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
but it is not always an appropriate scheduler to use.

Tag depth 4000 (default) 260**

Baseline (v5.9-rc1):
none sched: 2094K IOPS 513K
mq-deadline sched: 2145K IOPS 1336K

Final, host_tagset=0 in LLDD *, ***:
none sched: 2120K IOPS 550K
mq-deadline sched: 2121K IOPS 1309K

Final ***:
none sched: 2132K IOPS 1185
mq-deadline sched: 2145K IOPS 2097

* this is relevant as this is the performance in supporting but not
enabling the feature
** depth=260 is relevant as some point where we are regularly waiting for
tags to be available. Figures were are a bit unstable here.
*** Included "[PATCH V4] scsi: core: only re-run queue in
scsi_end_request() if device queue is busy"

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8

The hpsa patch depends on:
https://lore.kernel.org/linux-scsi/[email protected]/

And the smartpqi patch is not to be accepted.

Comments (and testing) welcome, thanks!

Differences to v7:
- Add null_blk and scsi_debug support
- Drop debugfs tags patch - it's too difficult to be the same between
hostwide and non-hostwide, as discussed:
https://lore.kernel.org/linux-scsi/[email protected]/T/#mb3eb462d8be40273718505989abd12f8228c15fd
And from commit 6bf0eb550452 ("sbitmap: Consider cleared bits in
sbitmap_bitmap_show()"), I guess not many used this anyway...
- Add elevator per-hctx request count for optimisation
- Break up "blk-mq: rename blk_mq_update_tag_set_depth()" into 2x patches
- Pass flags for avoid per-hq queue tags init/free for hostwide tags
- Add Don's reviewed-tag and tested-by tags to appropiate patches
- (@Don, please let me know if issue with how I did this)
- Add "scsi: core: Show nr_hw_queues in sysfs"
- Rework megaraid SAS patch to have module param (Kashyap)
- rebase

V7 is here for more info:
https://lore.kernel.org/linux-scsi/[email protected]/T/#t

Hannes Reinecke (5):
blk-mq: Rename blk_mq_update_tag_set_depth()
blk-mq: Free tags in blk_mq_init_tags() upon error
scsi: Add host and host template flag 'host_tagset'
hpsa: enable host_tagset and switch to MQ
smartpqi: enable host tagset

John Garry (10):
blk-mq: Pass flags for tag init/free
blk-mq: Use pointers for blk_mq_tags bitmap tags
blk-mq: Facilitate a shared sbitmap per tagset
blk-mq: Relocate hctx_may_queue()
blk-mq: Record nr_active_requests per queue for when using shared
sbitmap
blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
shared sbitmap
null_blk: Support shared tag bitmap
scsi: core: Show nr_hw_queues in sysfs
scsi: hisi_sas: Switch v3 hw to MQ
scsi: scsi_debug: Support host tagset

Kashyap Desai (2):
blk-mq, elevator: Count requests per hctx to improve performance
scsi: megaraid_sas: Added support for shared host tagset for
cpuhotplug

Ming Lei (1):
blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

block/bfq-iosched.c | 9 +-
block/blk-core.c | 2 +
block/blk-mq-debugfs.c | 10 +-
block/blk-mq-sched.c | 13 +-
block/blk-mq-tag.c | 149 ++++++++++++++------
block/blk-mq-tag.h | 56 +++-----
block/blk-mq.c | 81 +++++++----
block/blk-mq.h | 76 +++++++++-
block/kyber-iosched.c | 4 +-
block/mq-deadline.c | 6 +
drivers/block/null_blk_main.c | 6 +
drivers/block/rnbd/rnbd-clt.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas.h | 3 +-
drivers/scsi/hisi_sas/hisi_sas_main.c | 36 ++---
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++-------
drivers/scsi/hosts.c | 1 +
drivers/scsi/hpsa.c | 44 +-----
drivers/scsi/hpsa.h | 1 -
drivers/scsi/megaraid/megaraid_sas_base.c | 39 +++++
drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++--
drivers/scsi/scsi_debug.c | 28 ++--
drivers/scsi/scsi_lib.c | 2 +
drivers/scsi/scsi_sysfs.c | 11 ++
drivers/scsi/smartpqi/smartpqi_init.c | 45 ++++--
include/linux/blk-mq.h | 13 +-
include/linux/blkdev.h | 3 +
include/scsi/scsi_host.h | 9 +-
27 files changed, 484 insertions(+), 281 deletions(-)

--
2.26.2


2020-08-19 15:26:52

by John Garry

[permalink] [raw]
Subject: [PATCH v8 12/18] scsi: Add host and host template flag 'host_tagset'

From: Hannes Reinecke <[email protected]>

Add Host and host template flag 'host_tagset' so hostwide tagset can be
shared on multiple reply queues after the SCSI device's reply queue is
converted to blk-mq hw queue.

Tested-by: Don Brace<[email protected]> #SCSI resv cmds patches used
Signed-off-by: Hannes Reinecke <[email protected]>
[jpg: Update comment on .can_queue and add Scsi_Host.host_tagset]
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/hosts.c | 1 +
drivers/scsi/scsi_lib.c | 2 ++
include/scsi/scsi_host.h | 9 ++++++++-
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 37d1c5565d90..2f162603876f 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -421,6 +421,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
shost->cmd_per_lun = sht->cmd_per_lun;
shost->unchecked_isa_dma = sht->unchecked_isa_dma;
shost->no_write_same = sht->no_write_same;
+ shost->host_tagset = sht->host_tagset;

if (shost_eh_deadline == -1 || !sht->eh_host_reset_handler)
shost->eh_deadline = -1;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7c6dd6f75190..cdef5d1777cb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1891,6 +1891,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
tag_set->flags |=
BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
tag_set->driver_data = shost;
+ if (shost->host_tagset)
+ tag_set->flags |= BLK_MQ_F_TAG_HCTX_SHARED;

return blk_mq_alloc_tag_set(tag_set);
}
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 46ef8cccc982..701f178b20ae 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -436,6 +436,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;

+ /* True if the host uses host-wide tagspace */
+ unsigned host_tagset:1;
+
/*
* Countdown for host blocking with no commands outstanding.
*/
@@ -603,7 +606,8 @@ struct Scsi_Host {
*
* Note: it is assumed that each hardware queue has a queue depth of
* can_queue. In other words, the total queue depth per host
- * is nr_hw_queues * can_queue.
+ * is nr_hw_queues * can_queue. However, for when host_tagset is set,
+ * the total queue depth is can_queue.
*/
unsigned nr_hw_queues;
unsigned active_mode:2;
@@ -634,6 +638,9 @@ struct Scsi_Host {
/* The controller does not support WRITE SAME */
unsigned no_write_same:1;

+ /* True if the host uses host-wide tagspace */
+ unsigned host_tagset:1;
+
/* Host responded with short (<36 bytes) INQUIRY result */
unsigned short_inquiry:1;

--
2.26.2

2020-08-19 15:26:55

by John Garry

[permalink] [raw]
Subject: [PATCH v8 05/18] blk-mq: Use pointers for blk_mq_tags bitmap tags

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Tested-by: Don Brace<[email protected]> #SCSI resv cmds patches used
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
block/bfq-iosched.c | 4 ++--
block/blk-mq-debugfs.c | 8 ++++----
block/blk-mq-tag.c | 41 ++++++++++++++++++++++-------------------
block/blk-mq-tag.h | 7 +++++--
block/blk-mq.c | 8 ++++----
block/kyber-iosched.c | 4 ++--
6 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a4c0bec920cb..88f0dfa545d7 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6372,8 +6372,8 @@ static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
struct blk_mq_tags *tags = hctx->sched_tags;
unsigned int min_shallow;

- min_shallow = bfq_update_depths(bfqd, &tags->bitmap_tags);
- sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, min_shallow);
+ min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
+ sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
}

static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a4597febff69..645b7f800cb8 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -452,11 +452,11 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
atomic_read(&tags->active_queues));

seq_puts(m, "\nbitmap_tags:\n");
- sbitmap_queue_show(&tags->bitmap_tags, m);
+ sbitmap_queue_show(tags->bitmap_tags, m);

if (tags->nr_reserved_tags) {
seq_puts(m, "\nbreserved_tags:\n");
- sbitmap_queue_show(&tags->breserved_tags, m);
+ sbitmap_queue_show(tags->breserved_tags, m);
}
}

@@ -487,7 +487,7 @@ static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
if (res)
goto out;
if (hctx->tags)
- sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+ sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
mutex_unlock(&q->sysfs_lock);

out:
@@ -521,7 +521,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
if (res)
goto out;
if (hctx->sched_tags)
- sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+ sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
mutex_unlock(&q->sysfs_lock);

out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 0426f01e06a6..77fbeae05745 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -35,9 +35,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
*/
void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
{
- sbitmap_queue_wake_all(&tags->bitmap_tags);
+ sbitmap_queue_wake_all(tags->bitmap_tags);
if (include_reserve)
- sbitmap_queue_wake_all(&tags->breserved_tags);
+ sbitmap_queue_wake_all(tags->breserved_tags);
}

/*
@@ -82,10 +82,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
WARN_ON_ONCE(1);
return BLK_MQ_NO_TAG;
}
- bt = &tags->breserved_tags;
+ bt = tags->breserved_tags;
tag_offset = 0;
} else {
- bt = &tags->bitmap_tags;
+ bt = tags->bitmap_tags;
tag_offset = tags->nr_reserved_tags;
}

@@ -131,9 +131,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
data->ctx);
tags = blk_mq_tags_from_data(data);
if (data->flags & BLK_MQ_REQ_RESERVED)
- bt = &tags->breserved_tags;
+ bt = tags->breserved_tags;
else
- bt = &tags->bitmap_tags;
+ bt = tags->bitmap_tags;

/*
* If destination hw queue is changed, fake wake up on
@@ -167,10 +167,10 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
const int real_tag = tag - tags->nr_reserved_tags;

BUG_ON(real_tag >= tags->nr_tags);
- sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
+ sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
} else {
BUG_ON(tag >= tags->nr_reserved_tags);
- sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
+ sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
}
}

@@ -298,9 +298,9 @@ static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED);

if (tags->nr_reserved_tags)
- bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
+ bt_tags_for_each(tags, tags->breserved_tags, fn, priv,
flags | BT_TAG_ITER_RESERVED);
- bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
+ bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, flags);
}

/**
@@ -416,8 +416,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
continue;

if (tags->nr_reserved_tags)
- bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
- bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+ bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+ bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
}
blk_queue_exit(q);
}
@@ -435,15 +435,18 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;

- if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
+ if (bt_alloc(&tags->__bitmap_tags, depth, round_robin, node))
return -ENOMEM;
- if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
- node))
+ if (bt_alloc(&tags->__breserved_tags, tags->nr_reserved_tags,
+ round_robin, node))
goto free_bitmap_tags;

+ tags->bitmap_tags = &tags->__bitmap_tags;
+ tags->breserved_tags = &tags->__breserved_tags;
+
return 0;
free_bitmap_tags:
- sbitmap_queue_free(&tags->bitmap_tags);
+ sbitmap_queue_free(&tags->__bitmap_tags);
return -ENOMEM;
}

@@ -475,8 +478,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,

void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
{
- sbitmap_queue_free(&tags->bitmap_tags);
- sbitmap_queue_free(&tags->breserved_tags);
+ sbitmap_queue_free(&tags->__bitmap_tags);
+ sbitmap_queue_free(&tags->__breserved_tags);
kfree(tags);
}

@@ -527,7 +530,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
* Don't need (or can't) update reserved tags here, they
* remain static and should never need resizing.
*/
- sbitmap_queue_resize(&tags->bitmap_tags,
+ sbitmap_queue_resize(tags->bitmap_tags,
tdepth - tags->nr_reserved_tags);
}

diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 391f68d32fa1..5b6f25bd1966 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -13,8 +13,11 @@ struct blk_mq_tags {

atomic_t active_queues;

- struct sbitmap_queue bitmap_tags;
- struct sbitmap_queue breserved_tags;
+ struct sbitmap_queue *bitmap_tags;
+ struct sbitmap_queue *breserved_tags;
+
+ struct sbitmap_queue __bitmap_tags;
+ struct sbitmap_queue __breserved_tags;

struct request **rqs;
struct request **static_rqs;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 376930061452..ce712ec818e4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1096,14 +1096,14 @@ static inline unsigned int queued_to_index(unsigned int queued)

static bool __blk_mq_get_driver_tag(struct request *rq)
{
- struct sbitmap_queue *bt = &rq->mq_hctx->tags->bitmap_tags;
+ struct sbitmap_queue *bt = rq->mq_hctx->tags->bitmap_tags;
unsigned int tag_offset = rq->mq_hctx->tags->nr_reserved_tags;
int tag;

blk_mq_tag_busy(rq->mq_hctx);

if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
- bt = &rq->mq_hctx->tags->breserved_tags;
+ bt = rq->mq_hctx->tags->breserved_tags;
tag_offset = 0;
}

@@ -1145,7 +1145,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
struct sbitmap_queue *sbq;

list_del_init(&wait->entry);
- sbq = &hctx->tags->bitmap_tags;
+ sbq = hctx->tags->bitmap_tags;
atomic_dec(&sbq->ws_active);
}
spin_unlock(&hctx->dispatch_wait_lock);
@@ -1163,7 +1163,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
- struct sbitmap_queue *sbq = &hctx->tags->bitmap_tags;
+ struct sbitmap_queue *sbq = hctx->tags->bitmap_tags;
struct wait_queue_head *wq;
wait_queue_entry_t *wait;
bool ret;
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a38c5ab103d1..075e99c207ef 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -359,7 +359,7 @@ static unsigned int kyber_sched_tags_shift(struct request_queue *q)
* All of the hardware queues have the same depth, so we can just grab
* the shift of the first one.
*/
- return q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+ return q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift;
}

static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
@@ -502,7 +502,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
khd->batching = 0;

hctx->sched_data = khd;
- sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+ sbitmap_queue_min_shallow_depth(hctx->sched_tags->bitmap_tags,
kqd->async_depth);

return 0;
--
2.26.2

2020-08-19 15:27:13

by John Garry

[permalink] [raw]
Subject: [PATCH v8 01/18] blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

From: Ming Lei <[email protected]>

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

Suggested-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
[jpg: rebase a few times, add rnbd-clt.c change]
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-debugfs.c | 2 +-
block/blk-mq-tag.h | 6 +++---
block/blk-mq.c | 20 ++++++++++----------
drivers/block/rnbd/rnbd-clt.c | 2 +-
include/linux/blk-mq.h | 2 +-
5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3f09bcb8a6fd..a4597febff69 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,7 +240,7 @@ static const char *const alloc_policy_name[] = {
#define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name
static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
- HCTX_FLAG_NAME(TAG_SHARED),
+ HCTX_FLAG_NAME(TAG_QUEUE_SHARED),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
HCTX_FLAG_NAME(STACKING),
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index b1acac518c4e..918e2cee4f43 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -56,7 +56,7 @@ extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);

static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
{
- if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+ if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
return false;

return __blk_mq_tag_busy(hctx);
@@ -64,7 +64,7 @@ static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)

static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
{
- if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+ if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
return;

__blk_mq_tag_idle(hctx);
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
{
unsigned int depth, users;

- if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+ if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
return true;
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
return true;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0015a1892153..8f95cc443527 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1124,7 +1124,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_get_driver_tag(rq))
return false;

- if ((hctx->flags & BLK_MQ_F_TAG_SHARED) &&
+ if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
!(rq->rq_flags & RQF_MQ_INFLIGHT)) {
rq->rq_flags |= RQF_MQ_INFLIGHT;
atomic_inc(&hctx->nr_active);
@@ -1168,7 +1168,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
wait_queue_entry_t *wait;
bool ret;

- if (!(hctx->flags & BLK_MQ_F_TAG_SHARED)) {
+ if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
blk_mq_sched_mark_restart_hctx(hctx);

/*
@@ -1420,7 +1420,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
bool needs_restart;
/* For non-shared tags, the RESTART check will suffice */
bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
- (hctx->flags & BLK_MQ_F_TAG_SHARED);
+ (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED);
bool no_budget_avail = prep == PREP_DISPATCH_NO_BUDGET;

blk_mq_release_budgets(q, nr_budgets);
@@ -2657,7 +2657,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
spin_lock_init(&hctx->lock);
INIT_LIST_HEAD(&hctx->dispatch);
hctx->queue = q;
- hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+ hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;

INIT_LIST_HEAD(&hctx->hctx_list);

@@ -2874,9 +2874,9 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)

queue_for_each_hw_ctx(q, hctx, i) {
if (shared)
- hctx->flags |= BLK_MQ_F_TAG_SHARED;
+ hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
else
- hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+ hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
}
}

@@ -2902,7 +2902,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
list_del(&q->tag_set_list);
if (list_is_singular(&set->tag_list)) {
/* just transitioned to unshared */
- set->flags &= ~BLK_MQ_F_TAG_SHARED;
+ set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
/* update existing queue */
blk_mq_update_tag_set_depth(set, false);
}
@@ -2919,12 +2919,12 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
* Check to see if we're transitioning to shared (from 1 to 2 queues).
*/
if (!list_empty(&set->tag_list) &&
- !(set->flags & BLK_MQ_F_TAG_SHARED)) {
- set->flags |= BLK_MQ_F_TAG_SHARED;
+ !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+ set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
/* update existing queue */
blk_mq_update_tag_set_depth(set, true);
}
- if (set->flags & BLK_MQ_F_TAG_SHARED)
+ if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
queue_set_hctx_shared(q, true);
list_add_tail(&q->tag_set_list, &set->tag_list);

diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
index cc6a4e2587ae..77b0e0c62c36 100644
--- a/drivers/block/rnbd/rnbd-clt.c
+++ b/drivers/block/rnbd/rnbd-clt.c
@@ -1180,7 +1180,7 @@ static int setup_mq_tags(struct rnbd_clt_session *sess)
tag_set->queue_depth = sess->queue_depth;
tag_set->numa_node = NUMA_NO_NODE;
tag_set->flags = BLK_MQ_F_SHOULD_MERGE |
- BLK_MQ_F_TAG_SHARED;
+ BLK_MQ_F_TAG_QUEUE_SHARED;
tag_set->cmd_size = sizeof(struct rnbd_iu);
tag_set->nr_hw_queues = num_online_cpus();

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9d2d5ad367a4..95aca7aa8957 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -378,7 +378,7 @@ struct blk_mq_ops {

enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
- BLK_MQ_F_TAG_SHARED = 1 << 1,
+ BLK_MQ_F_TAG_QUEUE_SHARED = 1 << 1,
/*
* Set when this device requires underlying blk-mq device for
* completing IO:
--
2.26.2

2020-08-19 15:27:28

by John Garry

[permalink] [raw]
Subject: [PATCH v8 11/18] null_blk: Support shared tag bitmap

Support a shared tag bitmap, whereby request tags are unique over all
submission queues, and not just per submission queue.

As such, per device total queue depth is normally hw_queue_depth *
submit_queues, but hw_queue_depth when set. And a similar story for when
shared_tags is set, where that is the queue depth over all null blk
devices.

Signed-off-by: John Garry <[email protected]>
---
drivers/block/null_blk_main.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 47a9dad880af..19c04e074880 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -164,6 +164,10 @@ static bool shared_tags;
module_param(shared_tags, bool, 0444);
MODULE_PARM_DESC(shared_tags, "Share tag set between devices for blk-mq");

+static bool g_shared_tag_bitmap;
+module_param_named(shared_tag_bitmap, g_shared_tag_bitmap, bool, 0444);
+MODULE_PARM_DESC(shared_tag_bitmap, "Use shared tag bitmap for all submission queues for blk-mq");
+
static int g_irqmode = NULL_IRQ_SOFTIRQ;

static int null_set_irqmode(const char *str, const struct kernel_param *kp)
@@ -1692,6 +1696,8 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
+ if (g_shared_tag_bitmap)
+ set->flags |= BLK_MQ_F_TAG_HCTX_SHARED;
set->driver_data = NULL;

if ((nullb && nullb->dev->blocking) || g_blocking)
--
2.26.2

2020-08-19 15:27:39

by John Garry

[permalink] [raw]
Subject: [PATCH v8 07/18] blk-mq: Relocate hctx_may_queue()

blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

In this way, we can drop the blk-mq-tag.h include of blk-mq.h

Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-tag.h | 33 ---------------------------------
block/blk-mq.h | 32 ++++++++++++++++++++++++++++++++
2 files changed, 32 insertions(+), 33 deletions(-)

diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 8f4cce52ba2d..7d3e6b333a4a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -2,8 +2,6 @@
#ifndef INT_BLK_MQ_TAG_H
#define INT_BLK_MQ_TAG_H

-#include "blk-mq.h"
-
/*
* Tag address space map.
*/
@@ -81,37 +79,6 @@ static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
__blk_mq_tag_idle(hctx);
}

-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
- struct sbitmap_queue *bt)
-{
- unsigned int depth, users;
-
- if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
- return true;
- if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
- return true;
-
- /*
- * Don't try dividing an ant
- */
- if (bt->sb.depth == 1)
- return true;
-
- users = atomic_read(&hctx->tags->active_queues);
- if (!users)
- return true;
-
- /*
- * Allow at least some tags
- */
- depth = max((bt->sb.depth + users - 1) / users, 4U);
- return atomic_read(&hctx->nr_active) < depth;
-}
-
static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
unsigned int tag)
{
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3acdb56065e1..56dc37c21908 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -259,4 +259,36 @@ static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
return NULL;
}

+/*
+ * For shared tag users, we track the number of currently active users
+ * and attempt to provide a fair share of the tag depth for each of them.
+ */
+static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+ struct sbitmap_queue *bt)
+{
+ unsigned int depth, users;
+
+ if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
+ return true;
+ if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+ return true;
+
+ /*
+ * Don't try dividing an ant
+ */
+ if (bt->sb.depth == 1)
+ return true;
+
+ users = atomic_read(&hctx->tags->active_queues);
+ if (!users)
+ return true;
+
+ /*
+ * Allow at least some tags
+ */
+ depth = max((bt->sb.depth + users - 1) / users, 4U);
+ return atomic_read(&hctx->nr_active) < depth;
+}
+
+
#endif
--
2.26.2

2020-08-19 15:27:48

by John Garry

[permalink] [raw]
Subject: [PATCH v8 09/18] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Tested-by: Don Brace<[email protected]> #SCSI resv cmds patches used
Originally-from: Kashyap Desai <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-tag.c | 33 +++++++++++++++++++++++++--------
block/blk-mq.c | 2 ++
block/blk-mq.h | 16 +++++++++++++---
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 1 +
5 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c6d7ebc62bdb..c31c4a0478a5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,9 +23,18 @@
*/
bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
{
- if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
- !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
- atomic_inc(&hctx->tags->active_queues);
+ if (blk_mq_is_sbitmap_shared(hctx->flags)) {
+ struct request_queue *q = hctx->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
+
+ if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
+ !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
+ atomic_inc(&set->active_queues_shared_sbitmap);
+ } else {
+ if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+ !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+ atomic_inc(&hctx->tags->active_queues);
+ }

return true;
}
@@ -47,11 +56,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
{
struct blk_mq_tags *tags = hctx->tags;
-
- if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
- return;
-
- atomic_dec(&tags->active_queues);
+ struct request_queue *q = hctx->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
+
+ if (blk_mq_is_sbitmap_shared(hctx->flags)) {
+ if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
+ &q->queue_flags))
+ return;
+ atomic_dec(&set->active_queues_shared_sbitmap);
+ } else {
+ if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+ return;
+ atomic_dec(&tags->active_queues);
+ }

blk_mq_tag_wakeup_all(tags, false);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ebb72a59b433..457b43829a4f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3431,6 +3431,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
goto out_free_mq_map;

if (blk_mq_is_sbitmap_shared(set->flags)) {
+ atomic_set(&set->active_queues_shared_sbitmap, 0);
+
if (blk_mq_init_shared_sbitmap(set, set->flags)) {
ret = -ENOMEM;
goto out_free_mq_rq_maps;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 25ec73078e95..a52703c98b77 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -292,8 +292,6 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,

if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
return true;
- if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
- return true;

/*
* Don't try dividing an ant
@@ -301,7 +299,19 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
if (bt->sb.depth == 1)
return true;

- users = atomic_read(&hctx->tags->active_queues);
+ if (blk_mq_is_sbitmap_shared(hctx->flags)) {
+ struct request_queue *q = hctx->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
+
+ if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
+ return true;
+ users = atomic_read(&set->active_queues_shared_sbitmap);
+ } else {
+ if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+ return true;
+ users = atomic_read(&hctx->tags->active_queues);
+ }
+
if (!users)
return true;

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 03b1f39d5a72..a4b35ec60faf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -252,6 +252,7 @@ struct blk_mq_tag_set {
unsigned int timeout;
unsigned int flags;
void *driver_data;
+ atomic_t active_queues_shared_sbitmap;

struct sbitmap_queue __bitmap_tags;
struct sbitmap_queue __breserved_tags;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1d85235611e1..67b39596436c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -617,6 +617,7 @@ struct request_queue {
#define QUEUE_FLAG_PCI_P2PDMA 25 /* device supports PCI p2p requests */
#define QUEUE_FLAG_ZONE_RESETALL 26 /* supports Zone Reset All */
#define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
+#define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */

#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP))
--
2.26.2

2020-08-19 15:27:55

by John Garry

[permalink] [raw]
Subject: [PATCH v8 04/18] blk-mq: Pass flags for tag init/free

Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).

Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-sched.c | 11 ++++++++---
block/blk-mq-tag.c | 12 +++++++-----
block/blk-mq-tag.h | 7 ++++---
block/blk-mq.c | 23 +++++++++++++----------
block/blk-mq.h | 5 +++--
5 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a19cdf159b75..ee970d848e62 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -607,9 +607,11 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx,
unsigned int hctx_idx)
{
+ unsigned int flags = set->flags;
+
if (hctx->sched_tags) {
blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx);
- blk_mq_free_rq_map(hctx->sched_tags);
+ blk_mq_free_rq_map(hctx->sched_tags, flags);
hctx->sched_tags = NULL;
}
}
@@ -619,10 +621,11 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
unsigned int hctx_idx)
{
struct blk_mq_tag_set *set = q->tag_set;
+ unsigned int flags = set->flags;
int ret;

hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
- set->reserved_tags);
+ set->reserved_tags, flags);
if (!hctx->sched_tags)
return -ENOMEM;

@@ -640,8 +643,10 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q)
int i;

queue_for_each_hw_ctx(q, hctx, i) {
+ unsigned int flags = hctx->flags;
+
if (hctx->sched_tags) {
- blk_mq_free_rq_map(hctx->sched_tags);
+ blk_mq_free_rq_map(hctx->sched_tags, flags);
hctx->sched_tags = NULL;
}
}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c2c9f369d777..0426f01e06a6 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -449,8 +449,9 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,

struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
unsigned int reserved_tags,
- int node, int alloc_policy)
+ int node, unsigned int flags)
{
+ int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(flags);
struct blk_mq_tags *tags;

if (total_tags > BLK_MQ_TAG_MAX) {
@@ -472,7 +473,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
return tags;
}

-void blk_mq_free_tags(struct blk_mq_tags *tags)
+void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
{
sbitmap_queue_free(&tags->bitmap_tags);
sbitmap_queue_free(&tags->breserved_tags);
@@ -494,6 +495,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
*/
if (tdepth > tags->nr_tags) {
struct blk_mq_tag_set *set = hctx->queue->tag_set;
+ unsigned int flags = set->flags;
struct blk_mq_tags *new;
bool ret;

@@ -508,17 +510,17 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
return -EINVAL;

new = blk_mq_alloc_rq_map(set, hctx->queue_num, tdepth,
- tags->nr_reserved_tags);
+ tags->nr_reserved_tags, flags);
if (!new)
return -ENOMEM;
ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
if (ret) {
- blk_mq_free_rq_map(new);
+ blk_mq_free_rq_map(new, flags);
return -ENOMEM;
}

blk_mq_free_rqs(set, *tagsptr, hctx->queue_num);
- blk_mq_free_rq_map(*tagsptr);
+ blk_mq_free_rq_map(*tagsptr, flags);
*tagsptr = new;
} else {
/*
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 918e2cee4f43..391f68d32fa1 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -21,9 +21,10 @@ struct blk_mq_tags {
struct list_head page_list;
};

-
-extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
-extern void blk_mq_free_tags(struct blk_mq_tags *tags);
+extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
+ unsigned int reserved_tags,
+ int node, unsigned int flags);
+extern void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags);

extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 50de9eb60466..376930061452 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2285,20 +2285,21 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
}
}

-void blk_mq_free_rq_map(struct blk_mq_tags *tags)
+void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags)
{
kfree(tags->rqs);
tags->rqs = NULL;
kfree(tags->static_rqs);
tags->static_rqs = NULL;

- blk_mq_free_tags(tags);
+ blk_mq_free_tags(tags, flags);
}

struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
unsigned int hctx_idx,
unsigned int nr_tags,
- unsigned int reserved_tags)
+ unsigned int reserved_tags,
+ unsigned int flags)
{
struct blk_mq_tags *tags;
int node;
@@ -2307,8 +2308,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
if (node == NUMA_NO_NODE)
node = set->numa_node;

- tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
- BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
+ tags = blk_mq_init_tags(nr_tags, reserved_tags, node, flags);
if (!tags)
return NULL;

@@ -2316,7 +2316,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
node);
if (!tags->rqs) {
- blk_mq_free_tags(tags);
+ blk_mq_free_tags(tags, flags);
return NULL;
}

@@ -2325,7 +2325,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
node);
if (!tags->static_rqs) {
kfree(tags->rqs);
- blk_mq_free_tags(tags);
+ blk_mq_free_tags(tags, flags);
return NULL;
}

@@ -2734,10 +2734,11 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
int hctx_idx)
{
+ unsigned int flags = set->flags;
int ret = 0;

set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
- set->queue_depth, set->reserved_tags);
+ set->queue_depth, set->reserved_tags, flags);
if (!set->tags[hctx_idx])
return false;

@@ -2746,7 +2747,7 @@ static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
if (!ret)
return true;

- blk_mq_free_rq_map(set->tags[hctx_idx]);
+ blk_mq_free_rq_map(set->tags[hctx_idx], flags);
set->tags[hctx_idx] = NULL;
return false;
}
@@ -2754,9 +2755,11 @@ static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
unsigned int hctx_idx)
{
+ unsigned int flags = set->flags;
+
if (set->tags && set->tags[hctx_idx]) {
blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
- blk_mq_free_rq_map(set->tags[hctx_idx]);
+ blk_mq_free_rq_map(set->tags[hctx_idx], flags);
set->tags[hctx_idx] = NULL;
}
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 863a2f3346d4..b86e7fe47789 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -53,11 +53,12 @@ struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
*/
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx);
-void blk_mq_free_rq_map(struct blk_mq_tags *tags);
+void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags);
struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
unsigned int hctx_idx,
unsigned int nr_tags,
- unsigned int reserved_tags);
+ unsigned int reserved_tags,
+ unsigned int flags);
int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx, unsigned int depth);

--
2.26.2

2020-08-19 15:28:20

by John Garry

[permalink] [raw]
Subject: [PATCH v8 13/18] scsi: core: Show nr_hw_queues in sysfs

So that we don't use a value of 0 for when Scsi_Host.nr_hw_queues is unset,
use the tag_set->nr_hw_queues (which holds 1 for this case).

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/scsi_sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 163dbcb741c1..d6e344fa33ad 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -393,6 +393,16 @@ show_use_blk_mq(struct device *dev, struct device_attribute *attr, char *buf)
}
static DEVICE_ATTR(use_blk_mq, S_IRUGO, show_use_blk_mq, NULL);

+static ssize_t
+show_nr_hw_queues(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ struct Scsi_Host *shost = class_to_shost(dev);
+ struct blk_mq_tag_set *tag_set = &shost->tag_set;
+
+ return snprintf(buf, 20, "%d\n", tag_set->nr_hw_queues);
+}
+static DEVICE_ATTR(nr_hw_queues, S_IRUGO, show_nr_hw_queues, NULL);
+
static struct attribute *scsi_sysfs_shost_attrs[] = {
&dev_attr_use_blk_mq.attr,
&dev_attr_unique_id.attr,
@@ -411,6 +421,7 @@ static struct attribute *scsi_sysfs_shost_attrs[] = {
&dev_attr_prot_guard_type.attr,
&dev_attr_host_reset.attr,
&dev_attr_eh_deadline.attr,
+ &dev_attr_nr_hw_queues.attr,
NULL
};

--
2.26.2

2020-08-19 15:28:28

by John Garry

[permalink] [raw]
Subject: [PATCH v8 02/18] blk-mq: Rename blk_mq_update_tag_set_depth()

From: Hannes Reinecke <[email protected]>

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.

So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

Signed-off-by: Hannes Reinecke <[email protected]>
[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8f95cc443527..50de9eb60466 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2880,8 +2880,8 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
}
}

-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
- bool shared)
+static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
+ bool shared)
{
struct request_queue *q;

@@ -2904,7 +2904,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
/* just transitioned to unshared */
set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
/* update existing queue */
- blk_mq_update_tag_set_depth(set, false);
+ blk_mq_update_tag_set_shared(set, false);
}
mutex_unlock(&set->tag_list_lock);
INIT_LIST_HEAD(&q->tag_set_list);
@@ -2922,7 +2922,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
!(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
/* update existing queue */
- blk_mq_update_tag_set_depth(set, true);
+ blk_mq_update_tag_set_shared(set, true);
}
if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
queue_set_hctx_shared(q, true);
--
2.26.2

2020-08-19 15:28:41

by John Garry

[permalink] [raw]
Subject: [PATCH v8 03/18] blk-mq: Free tags in blk_mq_init_tags() upon error

From: Hannes Reinecke <[email protected]>

Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).

Signed-off-by: Hannes Reinecke <[email protected]>
[jpg: Split from an earlier patch with a new commit message]
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-tag.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 32d82e23b095..c2c9f369d777 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -429,24 +429,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
node);
}

-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
- int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+ int node, int alloc_policy)
{
unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;

if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
- goto free_tags;
+ return -ENOMEM;
if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
node))
goto free_bitmap_tags;

- return tags;
+ return 0;
free_bitmap_tags:
sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
- kfree(tags);
- return NULL;
+ return -ENOMEM;
}

struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -467,7 +465,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;

- return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+ if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+ kfree(tags);
+ return NULL;
+ }
+ return tags;
}

void blk_mq_free_tags(struct blk_mq_tags *tags)
--
2.26.2

2020-08-19 15:28:48

by John Garry

[permalink] [raw]
Subject: [PATCH v8 08/18] blk-mq: Record nr_active_requests per queue for when using shared sbitmap

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Tested-by: Don Brace<[email protected]> #SCSI resv cmds patches used
Co-developed-with: Kashyap Desai <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
block/blk-core.c | 2 ++
block/blk-mq.c | 4 ++--
block/blk-mq.h | 26 ++++++++++++++++++++++++--
include/linux/blkdev.h | 2 ++
4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d9d632639bd1..360975255a2a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -542,6 +542,8 @@ struct request_queue *blk_alloc_queue(int node_id)
q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->node = node_id;

+ atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
+
timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
laptop_mode_timer_fn, 0);
timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a500d1dfa1bd..ebb72a59b433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -519,7 +519,7 @@ void blk_mq_free_request(struct request *rq)

ctx->rq_completed[rq_is_sync(rq)]++;
if (rq->rq_flags & RQF_MQ_INFLIGHT)
- atomic_dec(&hctx->nr_active);
+ __blk_mq_dec_active_requests(hctx);

if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
laptop_io_completion(q->backing_dev_info);
@@ -1127,7 +1127,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) &&
!(rq->rq_flags & RQF_MQ_INFLIGHT)) {
rq->rq_flags |= RQF_MQ_INFLIGHT;
- atomic_inc(&hctx->nr_active);
+ __blk_mq_inc_active_requests(hctx);
}
hctx->tags->rqs[rq->tag] = rq;
return true;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 56dc37c21908..25ec73078e95 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -199,6 +199,28 @@ static inline bool blk_mq_get_dispatch_budget(struct request_queue *q)
return true;
}

+static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
+{
+ if (blk_mq_is_sbitmap_shared(hctx->flags))
+ atomic_inc(&hctx->queue->nr_active_requests_shared_sbitmap);
+ else
+ atomic_inc(&hctx->nr_active);
+}
+
+static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
+{
+ if (blk_mq_is_sbitmap_shared(hctx->flags))
+ atomic_dec(&hctx->queue->nr_active_requests_shared_sbitmap);
+ else
+ atomic_dec(&hctx->nr_active);
+}
+
+static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx)
+{
+ if (blk_mq_is_sbitmap_shared(hctx->flags))
+ return atomic_read(&hctx->queue->nr_active_requests_shared_sbitmap);
+ return atomic_read(&hctx->nr_active);
+}
static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
@@ -207,7 +229,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,

if (rq->rq_flags & RQF_MQ_INFLIGHT) {
rq->rq_flags &= ~RQF_MQ_INFLIGHT;
- atomic_dec(&hctx->nr_active);
+ __blk_mq_dec_active_requests(hctx);
}
}

@@ -287,7 +309,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
* Allow at least some tags
*/
depth = max((bt->sb.depth + users - 1) / users, 4U);
- return atomic_read(&hctx->nr_active) < depth;
+ return __blk_mq_active_requests(hctx) < depth;
}


diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bb5636cc17b9..1d85235611e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -484,6 +484,8 @@ struct request_queue {
struct timer_list timeout;
struct work_struct timeout_work;

+ atomic_t nr_active_requests_shared_sbitmap;
+
struct list_head icq_list;
#ifdef CONFIG_BLK_CGROUP
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
--
2.26.2

2020-08-19 15:28:59

by John Garry

[permalink] [raw]
Subject: [PATCH v8 18/18] smartpqi: enable host tagset

From: Hannes Reinecke <[email protected]>

Enable host tagset for smartpqi; with this we can use the request
tag to look command from the pool avoiding the list iteration in
the hot path.

Signed-off-by: Hannes Reinecke <[email protected]>
[jpg: Mod ctrl_info->next_io_request_slot calc]
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/smartpqi/smartpqi_init.c | 45 ++++++++++++++++++---------
1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index bd38c8cea56e..870ed1400a9e 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -575,22 +575,33 @@ static inline void pqi_reinit_io_request(struct pqi_io_request *io_request)
}

static struct pqi_io_request *pqi_alloc_io_request(
- struct pqi_ctrl_info *ctrl_info)
+ struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
{
struct pqi_io_request *io_request;
- u16 i = ctrl_info->next_io_request_slot; /* benignly racy */
+ unsigned int limit = PQI_RESERVED_IO_SLOTS;
+ u16 i;

- while (1) {
+ if (scmd) {
+ u32 blk_tag = blk_mq_unique_tag(scmd->request);
+
+ i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
io_request = &ctrl_info->io_request_pool[i];
- if (atomic_inc_return(&io_request->refcount) == 1)
- break;
- atomic_dec(&io_request->refcount);
- i = (i + 1) % ctrl_info->max_io_slots;
+ if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
+ atomic_dec(&io_request->refcount);
+ return NULL;
+ }
+ } else {
+ i = ctrl_info->next_io_request_slot; /* benignly racy */
+ while (1) {
+ io_request = &ctrl_info->io_request_pool[i];
+ if (atomic_inc_return(&io_request->refcount) == 1)
+ break;
+ atomic_dec(&io_request->refcount);
+ i = (i + 1) % limit;
+ }
+ ctrl_info->next_io_request_slot = (i + 1) % limit;
}

- /* benignly racy */
- ctrl_info->next_io_request_slot = (i + 1) % ctrl_info->max_io_slots;
-
pqi_reinit_io_request(io_request);

return io_request;
@@ -4075,7 +4086,7 @@ static int pqi_submit_raid_request_synchronous(struct pqi_ctrl_info *ctrl_info,

atomic_inc(&ctrl_info->sync_cmds_outstanding);

- io_request = pqi_alloc_io_request(ctrl_info);
+ io_request = pqi_alloc_io_request(ctrl_info, NULL);

put_unaligned_le16(io_request->index,
&(((struct pqi_raid_path_request *)request)->request_id));
@@ -5032,7 +5043,9 @@ static inline int pqi_raid_submit_scsi_cmd(struct pqi_ctrl_info *ctrl_info,
{
struct pqi_io_request *io_request;

- io_request = pqi_alloc_io_request(ctrl_info);
+ io_request = pqi_alloc_io_request(ctrl_info, scmd);
+ if (!io_request)
+ return SCSI_MLQUEUE_HOST_BUSY;

return pqi_raid_submit_scsi_cmd_with_io_request(ctrl_info, io_request,
device, scmd, queue_group);
@@ -5230,7 +5243,10 @@ static int pqi_aio_submit_io(struct pqi_ctrl_info *ctrl_info,
struct pqi_io_request *io_request;
struct pqi_aio_path_request *request;

- io_request = pqi_alloc_io_request(ctrl_info);
+ io_request = pqi_alloc_io_request(ctrl_info, scmd);
+ if (!io_request)
+ return SCSI_MLQUEUE_HOST_BUSY;
+
io_request->io_complete_callback = pqi_aio_io_complete;
io_request->scmd = scmd;
io_request->raid_bypass = raid_bypass;
@@ -5657,7 +5673,7 @@ static int pqi_lun_reset(struct pqi_ctrl_info *ctrl_info,
DECLARE_COMPLETION_ONSTACK(wait);
struct pqi_task_management_request *request;

- io_request = pqi_alloc_io_request(ctrl_info);
+ io_request = pqi_alloc_io_request(ctrl_info, NULL);
io_request->io_complete_callback = pqi_lun_reset_complete;
io_request->context = &wait;

@@ -6504,6 +6520,7 @@ static struct scsi_host_template pqi_driver_template = {
.map_queues = pqi_map_queues,
.sdev_attrs = pqi_sdev_attrs,
.shost_attrs = pqi_shost_attrs,
+ .host_tagset = 1,
};

static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
--
2.26.2

2020-08-19 15:29:16

by John Garry

[permalink] [raw]
Subject: [PATCH v8 10/18] blk-mq, elevator: Count requests per hctx to improve performance

From: Kashyap Desai <[email protected]>

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

Signed-off-by: Kashyap Desai <[email protected]>
Signed-off-by: Hannes Reinecke <[email protected]>
[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]
Signed-off-by: John Garry <[email protected]>
---
block/bfq-iosched.c | 5 +++++
block/blk-mq.c | 1 +
block/mq-deadline.c | 6 ++++++
include/linux/blk-mq.h | 4 ++++
4 files changed, 16 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 88f0dfa545d7..4650012f1e55 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,9 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
{
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;

+ if (!atomic_read(&hctx->elevator_queued))
+ return false;
+
/*
* Avoiding lock: a race on bfqd->busy_queues should cause at
* most a call to dispatch for nothing
@@ -5554,6 +5557,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
rq = list_first_entry(list, struct request, queuelist);
list_del_init(&rq->queuelist);
bfq_insert_request(hctx, rq, at_head);
+ atomic_inc(&hctx->elevator_queued);
}
}

@@ -5933,6 +5937,7 @@ static void bfq_finish_requeue_request(struct request *rq)

bfq_completed_request(bfqq, bfqd);
bfq_finish_requeue_request_body(bfqq);
+ atomic_dec(&rq->mq_hctx->elevator_queued);

spin_unlock_irqrestore(&bfqd->lock, flags);
} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 457b43829a4f..361fb9fe1dc5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2649,6 +2649,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
goto free_hctx;

atomic_set(&hctx->nr_active, 0);
+ atomic_set(&hctx->elevator_queued, 0);
if (node == NUMA_NO_NODE)
node = set->numa_node;
hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e154c8..800ac902809b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -386,6 +386,8 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
spin_lock(&dd->lock);
rq = __dd_dispatch_request(dd);
spin_unlock(&dd->lock);
+ if (rq)
+ atomic_dec(&rq->mq_hctx->elevator_queued);

return rq;
}
@@ -533,6 +535,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
rq = list_first_entry(list, struct request, queuelist);
list_del_init(&rq->queuelist);
dd_insert_request(hctx, rq, at_head);
+ atomic_inc(&hctx->elevator_queued);
}
spin_unlock(&dd->lock);
}
@@ -579,6 +582,9 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
{
struct deadline_data *dd = hctx->queue->elevator->elevator_data;

+ if (!atomic_read(&hctx->elevator_queued))
+ return false;
+
return !list_empty_careful(&dd->dispatch) ||
!list_empty_careful(&dd->fifo_list[0]) ||
!list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a4b35ec60faf..2f3ba31a1658 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
* shared across request queues.
*/
atomic_t nr_active;
+ /**
+ * @elevator_queued: Number of queued requests on hctx.
+ */
+ atomic_t elevator_queued;

/** @cpuhp_online: List to store request if CPU is going to die */
struct hlist_node cpuhp_online;
--
2.26.2

2020-08-19 15:30:21

by John Garry

[permalink] [raw]
Subject: [PATCH v8 15/18] scsi: scsi_debug: Support host tagset

When host_max_queue is set (> 0), set the Scsi_Host.host_tagset such that
blk-mq will use a hostwide tagset over all SCSI host submission queues.

This means that we may expose all submission queues and always use the hwq
chosen by blk-mq.

And since if sdebug_host_max_queue is set, sdebug_max_queue is fixed to the
same value, we can simplify how sdebug_driver_template.can_queue is set.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/scsi_debug.c | 28 ++++++++++------------------
1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 064ed680c053..4238baa2b34b 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -4698,19 +4698,14 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
static struct sdebug_queue *get_queue(struct scsi_cmnd *cmnd)
{
u16 hwq;
+ u32 tag = blk_mq_unique_tag(cmnd->request);

- if (sdebug_host_max_queue) {
- /* Provide a simple method to choose the hwq */
- hwq = smp_processor_id() % submit_queues;
- } else {
- u32 tag = blk_mq_unique_tag(cmnd->request);
+ hwq = blk_mq_unique_tag_to_hwq(tag);

- hwq = blk_mq_unique_tag_to_hwq(tag);
+ pr_debug("tag=%#x, hwq=%d\n", tag, hwq);
+ if (WARN_ON_ONCE(hwq >= submit_queues))
+ hwq = 0;

- pr_debug("tag=%#x, hwq=%d\n", tag, hwq);
- if (WARN_ON_ONCE(hwq >= submit_queues))
- hwq = 0;
- }
return sdebug_q_arr + hwq;
}

@@ -7347,10 +7342,7 @@ static int sdebug_driver_probe(struct device *dev)

sdbg_host = to_sdebug_host(dev);

- if (sdebug_host_max_queue)
- sdebug_driver_template.can_queue = sdebug_host_max_queue;
- else
- sdebug_driver_template.can_queue = sdebug_max_queue;
+ sdebug_driver_template.can_queue = sdebug_max_queue;
if (!sdebug_clustering)
sdebug_driver_template.dma_boundary = PAGE_SIZE - 1;

@@ -7367,11 +7359,11 @@ static int sdebug_driver_probe(struct device *dev)
}
/*
* Decide whether to tell scsi subsystem that we want mq. The
- * following should give the same answer for each host. If the host
- * has a limit of hostwide max commands, then do not set.
+ * following should give the same answer for each host.
*/
- if (!sdebug_host_max_queue)
- hpnt->nr_hw_queues = submit_queues;
+ hpnt->nr_hw_queues = submit_queues;
+ if (sdebug_host_max_queue)
+ hpnt->host_tagset = 1;

sdbg_host->shost = hpnt;
*((struct sdebug_host_info **)hpnt->hostdata) = sdbg_host;
--
2.26.2

2020-08-19 15:30:20

by John Garry

[permalink] [raw]
Subject: [PATCH v8 06/18] blk-mq: Facilitate a shared sbitmap per tagset

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/[email protected]/
[1] https://lore.kernel.org/linux-block/[email protected]/
[2] https://lore.kernel.org/linux-block/[email protected]/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Tested-by: Don Brace<[email protected]> #SCSI resv cmds patches used
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-sched.c | 8 ++++---
block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++----
block/blk-mq-tag.h | 7 ++++++
block/blk-mq.c | 15 +++++++++++++
block/blk-mq.h | 5 +++++
include/linux/blk-mq.h | 6 +++++
6 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ee970d848e62..777be967c4f2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -607,7 +607,7 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx,
unsigned int hctx_idx)
{
- unsigned int flags = set->flags;
+ unsigned int flags = set->flags & ~BLK_MQ_F_TAG_HCTX_SHARED;

if (hctx->sched_tags) {
blk_mq_free_rqs(set, hctx->sched_tags, hctx_idx);
@@ -621,7 +621,8 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
unsigned int hctx_idx)
{
struct blk_mq_tag_set *set = q->tag_set;
- unsigned int flags = set->flags;
+ /* Clear HCTX_SHARED so tags are init'ed */
+ unsigned int flags = set->flags & ~BLK_MQ_F_TAG_HCTX_SHARED;
int ret;

hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
@@ -643,7 +644,8 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q)
int i;

queue_for_each_hw_ctx(q, hctx, i) {
- unsigned int flags = hctx->flags;
+ /* Clear HCTX_SHARED so tags are freed */
+ unsigned int flags = hctx->flags & ~BLK_MQ_F_TAG_HCTX_SHARED;

if (hctx->sched_tags) {
blk_mq_free_rq_map(hctx->sched_tags, flags);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 77fbeae05745..c6d7ebc62bdb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -197,7 +197,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assigning ->rqs[].
*/
- if (rq && rq->q == hctx->queue)
+ if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
return iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
}
@@ -450,6 +450,38 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
return -ENOMEM;
}

+int blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int flags)
+{
+ unsigned int depth = set->queue_depth - set->reserved_tags;
+ int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags);
+ bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
+ int i, node = set->numa_node;
+
+ if (bt_alloc(&set->__bitmap_tags, depth, round_robin, node))
+ return -ENOMEM;
+ if (bt_alloc(&set->__breserved_tags, set->reserved_tags,
+ round_robin, node))
+ goto free_bitmap_tags;
+
+ for (i = 0; i < set->nr_hw_queues; i++) {
+ struct blk_mq_tags *tags = set->tags[i];
+
+ tags->bitmap_tags = &set->__bitmap_tags;
+ tags->breserved_tags = &set->__breserved_tags;
+ }
+
+ return 0;
+free_bitmap_tags:
+ sbitmap_queue_free(&set->__bitmap_tags);
+ return -ENOMEM;
+}
+
+void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *set)
+{
+ sbitmap_queue_free(&set->__bitmap_tags);
+ sbitmap_queue_free(&set->__breserved_tags);
+}
+
struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
unsigned int reserved_tags,
int node, unsigned int flags)
@@ -469,6 +501,9 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;

+ if (flags & BLK_MQ_F_TAG_HCTX_SHARED)
+ return tags;
+
if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
kfree(tags);
return NULL;
@@ -478,8 +513,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,

void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags)
{
- sbitmap_queue_free(&tags->__bitmap_tags);
- sbitmap_queue_free(&tags->__breserved_tags);
+ if (!(flags & BLK_MQ_F_TAG_HCTX_SHARED)) {
+ sbitmap_queue_free(tags->bitmap_tags);
+ sbitmap_queue_free(tags->breserved_tags);
+ }
kfree(tags);
}

@@ -498,7 +535,8 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
*/
if (tdepth > tags->nr_tags) {
struct blk_mq_tag_set *set = hctx->queue->tag_set;
- unsigned int flags = set->flags;
+ /* Only sched tags can grow, so clear HCTX_SHARED flag */
+ unsigned int flags = set->flags & ~BLK_MQ_F_TAG_HCTX_SHARED;
struct blk_mq_tags *new;
bool ret;

@@ -537,6 +575,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
return 0;
}

+void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
+{
+ sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
+}
+
/**
* blk_mq_unique_tag() - return a tag that is unique queue-wide
* @rq: request for which to compute a unique tag
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 5b6f25bd1966..8f4cce52ba2d 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -29,12 +29,19 @@ extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
int node, unsigned int flags);
extern void blk_mq_free_tags(struct blk_mq_tags *tags, unsigned int flags);

+extern int blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *set,
+ unsigned int flags);
+extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *set);
+
extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
unsigned int tag);
extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
struct blk_mq_tags **tags,
unsigned int depth, bool can_grow);
+extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
+ unsigned int size);
+
extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce712ec818e4..a500d1dfa1bd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3430,11 +3430,21 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
if (ret)
goto out_free_mq_map;

+ if (blk_mq_is_sbitmap_shared(set->flags)) {
+ if (blk_mq_init_shared_sbitmap(set, set->flags)) {
+ ret = -ENOMEM;
+ goto out_free_mq_rq_maps;
+ }
+ }
+
mutex_init(&set->tag_list_lock);
INIT_LIST_HEAD(&set->tag_list);

return 0;

+out_free_mq_rq_maps:
+ for (i = 0; i < set->nr_hw_queues; i++)
+ blk_mq_free_map_and_requests(set, i);
out_free_mq_map:
for (i = 0; i < set->nr_maps; i++) {
kfree(set->map[i].mq_map);
@@ -3453,6 +3463,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
for (i = 0; i < set->nr_hw_queues; i++)
blk_mq_free_map_and_requests(set, i);

+ if (blk_mq_is_sbitmap_shared(set->flags))
+ blk_mq_exit_shared_sbitmap(set);
+
for (j = 0; j < set->nr_maps; j++) {
kfree(set->map[j].mq_map);
set->map[j].mq_map = NULL;
@@ -3489,6 +3502,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
if (!hctx->sched_tags) {
ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
false);
+ if (!ret && blk_mq_is_sbitmap_shared(set->flags))
+ blk_mq_tag_resize_shared_sbitmap(set, nr);
} else {
ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
nr, true);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b86e7fe47789..3acdb56065e1 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -159,6 +159,11 @@ struct blk_mq_alloc_data {
struct blk_mq_hw_ctx *hctx;
};

+static inline bool blk_mq_is_sbitmap_shared(unsigned int flags)
+{
+ return flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
{
if (data->q->elevator)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95aca7aa8957..03b1f39d5a72 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -231,6 +231,9 @@ enum hctx_type {
* @flags: Zero or more BLK_MQ_F_* flags.
* @driver_data: Pointer to data owned by the block driver that created this
* tag set.
+ * @__bitmap_tags: A shared tags sbitmap, used over all hctx's
+ * @__breserved_tags:
+ * A shared reserved tags sbitmap, used over all hctx's
* @tags: Tag sets. One tag set per hardware queue. Has @nr_hw_queues
* elements.
* @tag_list_lock: Serializes tag_list accesses.
@@ -250,6 +253,8 @@ struct blk_mq_tag_set {
unsigned int flags;
void *driver_data;

+ struct sbitmap_queue __bitmap_tags;
+ struct sbitmap_queue __breserved_tags;
struct blk_mq_tags **tags;

struct mutex tag_list_lock;
@@ -384,6 +389,7 @@ enum {
* completing IO:
*/
BLK_MQ_F_STACKING = 1 << 2,
+ BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
BLK_MQ_F_BLOCKING = 1 << 5,
BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
--
2.26.2

2020-08-19 15:31:32

by John Garry

[permalink] [raw]
Subject: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

From: Kashyap Desai <[email protected]>

Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters.

Once driver enable shared host-wide tags, cpu hotplug feature is also
supported as it was enabled using below patchsets -
commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
offline")

Currently driver has provision to disable host-wide tags using
"host_tagset_enable" module parameter.

Once we do not have any major performance regression using host-wide
tags, we will drop the hand-crafted interrupt affinity settings.

Performance is also meeting the expecatation - (used both none and
mq-deadline scheduler)
24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
3 VDs consist of 8 SAS SSD on Aero with/without this patch can get 3.1M
IOPs.

Signed-off-by: Kashyap Desai <[email protected]>
Signed-off-by: Hannes Reinecke <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/megaraid/megaraid_sas_base.c | 39 +++++++++++++++++++++
drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++++++++-------
2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 861f7140f52e..6960922d0d7f 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -37,6 +37,7 @@
#include <linux/poll.h>
#include <linux/vmalloc.h>
#include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>

#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
@@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
module_param(enable_sdev_max_qd, int, 0444);
MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue. Default: 0");

+int host_tagset_enable = 1;
+module_param(host_tagset_enable, int, 0444);
+MODULE_PARM_DESC(host_tagset_enable, "Shared host tagset enable/disable Default: enable(1)");
+
MODULE_LICENSE("GPL");
MODULE_VERSION(MEGASAS_VERSION);
MODULE_AUTHOR("[email protected]");
@@ -3119,6 +3124,19 @@ megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
return 0;
}

+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+ struct megasas_instance *instance;
+
+ instance = (struct megasas_instance *)shost->hostdata;
+
+ if (shost->nr_hw_queues == 1)
+ return 0;
+
+ return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+ instance->pdev, instance->low_latency_index_start);
+}
+
static void megasas_aen_polling(struct work_struct *work);

/**
@@ -3427,6 +3445,7 @@ static struct scsi_host_template megasas_template = {
.eh_timed_out = megasas_reset_timer,
.shost_attrs = megaraid_host_attrs,
.bios_param = megasas_bios_param,
+ .map_queues = megasas_map_queues,
.change_queue_depth = scsi_change_queue_depth,
.max_segment_size = 0xffffffff,
};
@@ -6808,6 +6827,26 @@ static int megasas_io_attach(struct megasas_instance *instance)
host->max_lun = MEGASAS_MAX_LUN;
host->max_cmd_len = 16;

+ /* Use shared host tagset only for fusion adaptors
+ * if there are managed interrupts (smp affinity enabled case).
+ * Single msix_vectors in kdump, so shared host tag is also disabled.
+ */
+
+ host->host_tagset = 0;
+ host->nr_hw_queues = 1;
+
+ if ((instance->adapter_type != MFI_SERIES) &&
+ (instance->msix_vectors > instance->low_latency_index_start) &&
+ host_tagset_enable &&
+ instance->smp_affinity_enable) {
+ host->host_tagset = 1;
+ host->nr_hw_queues = instance->msix_vectors -
+ instance->low_latency_index_start;
+ }
+
+ dev_info(&instance->pdev->dev,
+ "Max firmware commands: %d shared with nr_hw_queues = %d\n",
+ instance->max_fw_cmds, host->nr_hw_queues);
/*
* Notify the mid-layer about the new controller
*/
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 0824410f78f8..a4251121f173 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -359,24 +359,29 @@ megasas_get_msix_index(struct megasas_instance *instance,
{
int sdev_busy;

- /* nr_hw_queue = 1 for MegaRAID */
- struct blk_mq_hw_ctx *hctx =
- scmd->device->request_queue->queue_hw_ctx[0];
-
- sdev_busy = atomic_read(&hctx->nr_active);
+ /* TBD - if sml remove device_busy in future, driver
+ * should track counter in internal structure.
+ */
+ sdev_busy = atomic_read(&scmd->device->device_busy);

if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
- sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+ sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
cmd->request_desc->SCSIIO.MSIxIndex =
mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
- else if (instance->msix_load_balance)
+ } else if (instance->msix_load_balance) {
cmd->request_desc->SCSIIO.MSIxIndex =
(mega_mod64(atomic64_add_return(1, &instance->total_io_count),
instance->msix_vectors));
- else
+ } else if (instance->host->nr_hw_queues > 1) {
+ u32 tag = blk_mq_unique_tag(scmd->request);
+
+ cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(tag) +
+ instance->low_latency_index_start;
+ } else {
cmd->request_desc->SCSIIO.MSIxIndex =
instance->reply_map[raw_smp_processor_id()];
+ }
}

/**
@@ -956,9 +961,6 @@ megasas_alloc_cmds_fusion(struct megasas_instance *instance)
if (megasas_alloc_cmdlist_fusion(instance))
goto fail_exit;

- dev_info(&instance->pdev->dev, "Configured max firmware commands: %d\n",
- instance->max_fw_cmds);
-
/* The first 256 bytes (SMID 0) is not used. Don't add to the cmd list */
io_req_base = fusion->io_request_frames + MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
io_req_base_phys = fusion->io_request_frames_phys + MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
@@ -1102,8 +1104,9 @@ megasas_ioc_init_fusion(struct megasas_instance *instance)
MR_HIGH_IOPS_QUEUE_COUNT) && cur_intr_coalescing)
instance->perf_mode = MR_BALANCED_PERF_MODE;

- dev_info(&instance->pdev->dev, "Performance mode :%s\n",
- MEGASAS_PERF_MODE_2STR(instance->perf_mode));
+ dev_info(&instance->pdev->dev, "Performance mode :%s (latency index = %d)\n",
+ MEGASAS_PERF_MODE_2STR(instance->perf_mode),
+ instance->low_latency_index_start);

instance->fw_sync_cache_support = (scratch_pad_1 &
MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0;
--
2.26.2

2020-08-27 08:59:51

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

Hi Jens,

I was wondering if you could kindly consider the block changes in this
series, since I have now dropped the RFC flag?

I guess patch 5/18 (using pointers to bitmaps) would be of first
concern. We did discuss this previously, and I think what we're doing
now could be considered satisfactory.

Thanks,
john

>
> Here is v8 of the patchset.
>
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues.
> We also have a change to count requests per-hctx for when an elevator is
> enabled, as an optimisation. I also dropped the debugfs changes - more on
> that below.
>
> Some performance figures:
>
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
>
> Tag depth 4000 (default) 260**
>
> Baseline (v5.9-rc1):
> none sched: 2094K IOPS 513K
> mq-deadline sched: 2145K IOPS 1336K
>
> Final, host_tagset=0 in LLDD *, ***:
> none sched: 2120K IOPS 550K
> mq-deadline sched: 2121K IOPS 1309K
>
> Final ***:
> none sched: 2132K IOPS 1185
> mq-deadline sched: 2145K IOPS 2097
>
> * this is relevant as this is the performance in supporting but not
> enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
> tags to be available. Figures were are a bit unstable here.
> *** Included "[PATCH V4] scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy"
>
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8
>
> The hpsa patch depends on:
> https://lore.kernel.org/linux-scsi/[email protected]/
>
> And the smartpqi patch is not to be accepted.
>
> Comments (and testing) welcome, thanks!
>
> Differences to v7:
> - Add null_blk and scsi_debug support
> - Drop debugfs tags patch - it's too difficult to be the same between
> hostwide and non-hostwide, as discussed:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#mb3eb462d8be40273718505989abd12f8228c15fd
> And from commit 6bf0eb550452 ("sbitmap: Consider cleared bits in
> sbitmap_bitmap_show()"), I guess not many used this anyway...
> - Add elevator per-hctx request count for optimisation
> - Break up "blk-mq: rename blk_mq_update_tag_set_depth()" into 2x patches
> - Pass flags for avoid per-hq queue tags init/free for hostwide tags
> - Add Don's reviewed-tag and tested-by tags to appropiate patches
> - (@Don, please let me know if issue with how I did this)
> - Add "scsi: core: Show nr_hw_queues in sysfs"
> - Rework megaraid SAS patch to have module param (Kashyap)
> - rebase
>
> V7 is here for more info:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#t
>
> Hannes Reinecke (5):
> blk-mq: Rename blk_mq_update_tag_set_depth()
> blk-mq: Free tags in blk_mq_init_tags() upon error
> scsi: Add host and host template flag 'host_tagset'
> hpsa: enable host_tagset and switch to MQ
> smartpqi: enable host tagset
>
> John Garry (10):
> blk-mq: Pass flags for tag init/free
> blk-mq: Use pointers for blk_mq_tags bitmap tags
> blk-mq: Facilitate a shared sbitmap per tagset
> blk-mq: Relocate hctx_may_queue()
> blk-mq: Record nr_active_requests per queue for when using shared
> sbitmap
> blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
> shared sbitmap
> null_blk: Support shared tag bitmap
> scsi: core: Show nr_hw_queues in sysfs
> scsi: hisi_sas: Switch v3 hw to MQ
> scsi: scsi_debug: Support host tagset
>
> Kashyap Desai (2):
> blk-mq, elevator: Count requests per hctx to improve performance
> scsi: megaraid_sas: Added support for shared host tagset for
> cpuhotplug
>
> Ming Lei (1):
> blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
>
> block/bfq-iosched.c | 9 +-
> block/blk-core.c | 2 +
> block/blk-mq-debugfs.c | 10 +-
> block/blk-mq-sched.c | 13 +-
> block/blk-mq-tag.c | 149 ++++++++++++++------
> block/blk-mq-tag.h | 56 +++-----
> block/blk-mq.c | 81 +++++++----
> block/blk-mq.h | 76 +++++++++-
> block/kyber-iosched.c | 4 +-
> block/mq-deadline.c | 6 +
> drivers/block/null_blk_main.c | 6 +
> drivers/block/rnbd/rnbd-clt.c | 2 +-
> drivers/scsi/hisi_sas/hisi_sas.h | 3 +-
> drivers/scsi/hisi_sas/hisi_sas_main.c | 36 ++---
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++-------
> drivers/scsi/hosts.c | 1 +
> drivers/scsi/hpsa.c | 44 +-----
> drivers/scsi/hpsa.h | 1 -
> drivers/scsi/megaraid/megaraid_sas_base.c | 39 +++++
> drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++--
> drivers/scsi/scsi_debug.c | 28 ++--
> drivers/scsi/scsi_lib.c | 2 +
> drivers/scsi/scsi_sysfs.c | 11 ++
> drivers/scsi/smartpqi/smartpqi_init.c | 45 ++++--
> include/linux/blk-mq.h | 13 +-
> include/linux/blkdev.h | 3 +
> include/scsi/scsi_host.h | 9 +-
> 27 files changed, 484 insertions(+), 281 deletions(-)
>

2020-09-03 19:32:06

by Douglas Gilbert

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 2020-08-19 11:20 a.m., John Garry wrote:
> Hi all,
>
> Here is v8 of the patchset.
>
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues.
> We also have a change to count requests per-hctx for when an elevator is
> enabled, as an optimisation. I also dropped the debugfs changes - more on
> that below.
>
> Some performance figures:
>
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
>
> Tag depth 4000 (default) 260**
>
> Baseline (v5.9-rc1):
> none sched: 2094K IOPS 513K
> mq-deadline sched: 2145K IOPS 1336K
>
> Final, host_tagset=0 in LLDD *, ***:
> none sched: 2120K IOPS 550K
> mq-deadline sched: 2121K IOPS 1309K
>
> Final ***:
> none sched: 2132K IOPS 1185
> mq-deadline sched: 2145K IOPS 2097
>
> * this is relevant as this is the performance in supporting but not
> enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
> tags to be available. Figures were are a bit unstable here.
> *** Included "[PATCH V4] scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy"
>
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8
>
> The hpsa patch depends on:
> https://lore.kernel.org/linux-scsi/[email protected]/
>
> And the smartpqi patch is not to be accepted.
>
> Comments (and testing) welcome, thanks!

I tested this v8 patchset on MKP's 5.10/scsi-queue branch together with
my rewritten sg driver on my laptop and a Ryzen 5 3600 machine. Since I
don't have same hardware, I use the scsi_debug driver as the target:

modprobe scsi_debug dev_size_mb=1024 sector_size=512 add_host=7
per_host_store=1 ndelay=1000 random=1 submit_queues=12

My test is a script which runs these three commands many times with
differing parameters:

sg_mrq_dd iflag=random bs=512 of=/dev/sg8 thr=64 time=2
time to transfer data was 0.312705 secs, 3433.72 MB/sec
2097152+0 records in
2097152+0 records out

sg_mrq_dd bpt=256 thr=64 mrq=36 time=2 if=/dev/sg8 bs=512 of=/dev/sg9
time to transfer data was 0.212090 secs, 5062.67 MB/sec
2097152+0 records in
2097152+0 records out

sg_mrq_dd --verify if=/dev/sg8 of=/dev/sg9 bs=512 bpt=256 thr=64 mrq=36 time=2
Doing verify/cmp rather than copy
time to transfer data was 0.184563 secs, 5817.75 MB/sec
2097152+0 records in
2097152+0 records verified

The above is the output from last section of the my script run on the Ryzen 5.

So the three steps are:
1) produce random data on /dev/sg8
2) copy /dev/sg8 to /dev/sg9
3) verify /dev/sg8 and /dev/sg9 are the same.

The latter step is done with a sequence of READ(/dev/sg8) and
VERIFY(BYTCHK=1 on /dev/sg9). The "mrq" stands for multiple requests (in
one invocation; the bsg driver did that before its write(2) command was
removed.
The SCSI devices on the Ryzen 5 machine are:

# lsscsi -gs
[2:0:0:0] disk IBM-207x HUSMM8020ASS20 J4B6 /dev/sda /dev/sg0 200GB
[2:0:1:0] disk SEAGATE ST200FM0073 0007 /dev/sdb /dev/sg1 200GB
[2:0:2:0] enclosu Areca Te ARC-802801.37.69 0137 - /dev/sg2 -
[3:0:0:0] disk Linux scsi_debug 0190 /dev/sdc /dev/sg3 1.07GB
[4:0:0:0] disk Linux scsi_debug 0190 /dev/sdd /dev/sg4 1.07GB
[5:0:0:0] disk Linux scsi_debug 0190 /dev/sde /dev/sg5 1.07GB
[6:0:0:0] disk Linux scsi_debug 0190 /dev/sdf /dev/sg6 1.07GB
[7:0:0:0] disk Linux scsi_debug 0190 /dev/sdg /dev/sg7 1.07GB
[8:0:0:0] disk Linux scsi_debug 0190 /dev/sdh /dev/sg8 1.07GB
[9:0:0:0] disk Linux scsi_debug 0190 /dev/sdi /dev/sg9 1.07GB
[N:0:1:1] disk WDC WDS250G2B0C-00PXH0__1 /dev/nvme0n1 - 250GB

My script took 17m12 and the highest throughput (on a copy) was 7.5 GB/sec.
Then I reloaded the scsi_debug module, this time with an additional
'host_max_queue=128' parameter. The script run time was 5 seconds shorter
and the maximum throughput was around 7.6 GB/sec. [Average throughput is
around 4 GB/sec.]

For comparison:

# time liburing/examples/io_uring-cp /dev/sdh /dev/sdi
real 0m1.542s
user 0m0.004s
sys 0m1.027s

Umm, that's less then 1 GB/sec. In its defence io_uring-cp is an
extremely simple, single threaded, proof-of-concept copy program,
at least compared to sg_mrq_dd . As used by the sg_mrq_dd the
rewritten sg driver bypasses moving 1 GB to and from the _user_
space while doing the above copy and verify steps.

So:

Tested-by: Douglas Gilbert <[email protected]>

> Differences to v7:
> - Add null_blk and scsi_debug support
> - Drop debugfs tags patch - it's too difficult to be the same between
> hostwide and non-hostwide, as discussed:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#mb3eb462d8be40273718505989abd12f8228c15fd
> And from commit 6bf0eb550452 ("sbitmap: Consider cleared bits in
> sbitmap_bitmap_show()"), I guess not many used this anyway...
> - Add elevator per-hctx request count for optimisation
> - Break up "blk-mq: rename blk_mq_update_tag_set_depth()" into 2x patches
> - Pass flags for avoid per-hq queue tags init/free for hostwide tags
> - Add Don's reviewed-tag and tested-by tags to appropiate patches
> - (@Don, please let me know if issue with how I did this)
> - Add "scsi: core: Show nr_hw_queues in sysfs"
> - Rework megaraid SAS patch to have module param (Kashyap)
> - rebase
>
> V7 is here for more info:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#t
>
> Hannes Reinecke (5):
> blk-mq: Rename blk_mq_update_tag_set_depth()
> blk-mq: Free tags in blk_mq_init_tags() upon error
> scsi: Add host and host template flag 'host_tagset'
> hpsa: enable host_tagset and switch to MQ
> smartpqi: enable host tagset
>
> John Garry (10):
> blk-mq: Pass flags for tag init/free
> blk-mq: Use pointers for blk_mq_tags bitmap tags
> blk-mq: Facilitate a shared sbitmap per tagset
> blk-mq: Relocate hctx_may_queue()
> blk-mq: Record nr_active_requests per queue for when using shared
> sbitmap
> blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
> shared sbitmap
> null_blk: Support shared tag bitmap
> scsi: core: Show nr_hw_queues in sysfs
> scsi: hisi_sas: Switch v3 hw to MQ
> scsi: scsi_debug: Support host tagset
>
> Kashyap Desai (2):
> blk-mq, elevator: Count requests per hctx to improve performance
> scsi: megaraid_sas: Added support for shared host tagset for
> cpuhotplug
>
> Ming Lei (1):
> blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
>
> block/bfq-iosched.c | 9 +-
> block/blk-core.c | 2 +
> block/blk-mq-debugfs.c | 10 +-
> block/blk-mq-sched.c | 13 +-
> block/blk-mq-tag.c | 149 ++++++++++++++------
> block/blk-mq-tag.h | 56 +++-----
> block/blk-mq.c | 81 +++++++----
> block/blk-mq.h | 76 +++++++++-
> block/kyber-iosched.c | 4 +-
> block/mq-deadline.c | 6 +
> drivers/block/null_blk_main.c | 6 +
> drivers/block/rnbd/rnbd-clt.c | 2 +-
> drivers/scsi/hisi_sas/hisi_sas.h | 3 +-
> drivers/scsi/hisi_sas/hisi_sas_main.c | 36 ++---
> drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++-------
> drivers/scsi/hosts.c | 1 +
> drivers/scsi/hpsa.c | 44 +-----
> drivers/scsi/hpsa.h | 1 -
> drivers/scsi/megaraid/megaraid_sas_base.c | 39 +++++
> drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++--
> drivers/scsi/scsi_debug.c | 28 ++--
> drivers/scsi/scsi_lib.c | 2 +
> drivers/scsi/scsi_sysfs.c | 11 ++
> drivers/scsi/smartpqi/smartpqi_init.c | 45 ++++--
> include/linux/blk-mq.h | 13 +-
> include/linux/blkdev.h | 3 +
> include/scsi/scsi_host.h | 9 +-
> 27 files changed, 484 insertions(+), 281 deletions(-)
>

2020-09-03 21:25:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 8/19/20 9:20 AM, John Garry wrote:
> Hi all,
>
> Here is v8 of the patchset.
>
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues.
> We also have a change to count requests per-hctx for when an elevator is
> enabled, as an optimisation. I also dropped the debugfs changes - more on
> that below.
>
> Some performance figures:
>
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
>
> Tag depth 4000 (default) 260**
>
> Baseline (v5.9-rc1):
> none sched: 2094K IOPS 513K
> mq-deadline sched: 2145K IOPS 1336K
>
> Final, host_tagset=0 in LLDD *, ***:
> none sched: 2120K IOPS 550K
> mq-deadline sched: 2121K IOPS 1309K
>
> Final ***:
> none sched: 2132K IOPS 1185
> mq-deadline sched: 2145K IOPS 2097
>
> * this is relevant as this is the performance in supporting but not
> enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
> tags to be available. Figures were are a bit unstable here.
> *** Included "[PATCH V4] scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy"
>
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8
>
> The hpsa patch depends on:
> https://lore.kernel.org/linux-scsi/[email protected]/
>
> And the smartpqi patch is not to be accepted.
>
> Comments (and testing) welcome, thanks!

I applied 1-11, leaving the SCSI core bits and drivers to Martin. I can
also carry them, just let me know.

--
Jens Axboe

2020-09-04 09:13:52

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 03/09/2020 22:23, Jens Axboe wrote:
> On 8/19/20 9:20 AM, John Garry wrote:
>> Hi all,
>>
>> Here is v8 of the patchset.
>>
>> In this version of the series, we keep the shared sbitmap for driver tags,
>> and introduce changes to fix up the tag budgeting across request queues.
>> We also have a change to count requests per-hctx for when an elevator is
>> enabled, as an optimisation. I also dropped the debugfs changes - more on
>> that below.
>>
>> Some performance figures:
>>
>> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
>> but it is not always an appropriate scheduler to use.
>>
>> Tag depth 4000 (default) 260**
>>
>> Baseline (v5.9-rc1):
>> none sched: 2094K IOPS 513K
>> mq-deadline sched: 2145K IOPS 1336K
>>
>> Final, host_tagset=0 in LLDD *, ***:
>> none sched: 2120K IOPS 550K
>> mq-deadline sched: 2121K IOPS 1309K
>>
>> Final ***:
>> none sched: 2132K IOPS 1185
>> mq-deadline sched: 2145K IOPS 2097
>>
>> * this is relevant as this is the performance in supporting but not
>> enabling the feature
>> ** depth=260 is relevant as some point where we are regularly waiting for
>> tags to be available. Figures were are a bit unstable here.
>> *** Included "[PATCH V4] scsi: core: only re-run queue in
>> scsi_end_request() if device queue is busy"
>>
>> A copy of the patches can be found here:
>> https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8
>>
>> The hpsa patch depends on:
>> https://lore.kernel.org/linux-scsi/[email protected]/
>>
>> And the smartpqi patch is not to be accepted.
>>
>> Comments (and testing) welcome, thanks!
>
> I applied 1-11, leaving the SCSI core bits and drivers to Martin. I can
> also carry them, just let me know.
>

Great, thanks!

So the SCSI parts depend on the block parts for building, so I guess it
makes sense if you could carry them also.

hpsa and smartpqi patches are pending for now, but the rest could be
picked up. Martin/James may want more review of the SCSI core bits, though.

Thanks again,
John


2020-09-04 12:49:05

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs


John,

> Martin/James may want more review of the SCSI core bits, though.

I'll take a look later today.

--
Martin K. Petersen Oracle Linux Engineering

2020-09-08 17:51:04

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 08/09/2020 13:46, Hannes Reinecke wrote:
> Now that Jens merged the block bits in his tree, wouldn't it be better
> to re-send the SCSI bits only, thereby avoiding a potential merge error
> later on?
>

Anything which I resend would need to be against Jens' tree (and not
Martin's), assuming Jens will carry them also. So I am not sure how that
will help.

JFYI, I just tested against today's linux-next, and the SCSI parts (hpsa
and smartpqi omitted) still apply there without conflict.

Thanks,
John

2020-09-08 19:43:45

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 8/19/20 5:20 PM, John Garry wrote:
> Hi all,
>
> Here is v8 of the patchset.
>
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues.
> We also have a change to count requests per-hctx for when an elevator is
> enabled, as an optimisation. I also dropped the debugfs changes - more on
> that below.
>
> Some performance figures:
>
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
>
> Tag depth 4000 (default) 260**
>
> Baseline (v5.9-rc1):
> none sched: 2094K IOPS 513K
> mq-deadline sched: 2145K IOPS 1336K
>
> Final, host_tagset=0 in LLDD *, ***:
> none sched: 2120K IOPS 550K
> mq-deadline sched: 2121K IOPS 1309K
>
> Final ***:
> none sched: 2132K IOPS 1185
> mq-deadline sched: 2145K IOPS 2097
>
> * this is relevant as this is the performance in supporting but not
> enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
> tags to be available. Figures were are a bit unstable here.
> *** Included "[PATCH V4] scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy"
>
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8
>
> The hpsa patch depends on:
> https://lore.kernel.org/linux-scsi/[email protected]/
>
> And the smartpqi patch is not to be accepted.
>
> Comments (and testing) welcome, thanks!
>
> Differences to v7:
> - Add null_blk and scsi_debug support
> - Drop debugfs tags patch - it's too difficult to be the same between
> hostwide and non-hostwide, as discussed:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#mb3eb462d8be40273718505989abd12f8228c15fd
> And from commit 6bf0eb550452 ("sbitmap: Consider cleared bits in
> sbitmap_bitmap_show()"), I guess not many used this anyway...
> - Add elevator per-hctx request count for optimisation
> - Break up "blk-mq: rename blk_mq_update_tag_set_depth()" into 2x patches
> - Pass flags for avoid per-hq queue tags init/free for hostwide tags
> - Add Don's reviewed-tag and tested-by tags to appropiate patches
> - (@Don, please let me know if issue with how I did this)
> - Add "scsi: core: Show nr_hw_queues in sysfs"
> - Rework megaraid SAS patch to have module param (Kashyap)
> - rebase
>
> V7 is here for more info:
> https://lore.kernel.org/linux-scsi/[email protected]/T/#t
>
> Hannes Reinecke (5):
> blk-mq: Rename blk_mq_update_tag_set_depth()
> blk-mq: Free tags in blk_mq_init_tags() upon error
> scsi: Add host and host template flag 'host_tagset'
> hpsa: enable host_tagset and switch to MQ
> smartpqi: enable host tagset
>
> John Garry (10):
> blk-mq: Pass flags for tag init/free
> blk-mq: Use pointers for blk_mq_tags bitmap tags
> blk-mq: Facilitate a shared sbitmap per tagset
> blk-mq: Relocate hctx_may_queue()
> blk-mq: Record nr_active_requests per queue for when using shared
> sbitmap
> blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
> shared sbitmap
> null_blk: Support shared tag bitmap
> scsi: core: Show nr_hw_queues in sysfs
> scsi: hisi_sas: Switch v3 hw to MQ
> scsi: scsi_debug: Support host tagset
>
> Kashyap Desai (2):
> blk-mq, elevator: Count requests per hctx to improve performance
> scsi: megaraid_sas: Added support for shared host tagset for
> cpuhotplug
>
> Ming Lei (1):
> blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
>
Now that Jens merged the block bits in his tree, wouldn't it be better
to re-send the SCSI bits only, thereby avoiding a potential merge error
later on?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2020-09-10 08:38:19

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 13/18] scsi: core: Show nr_hw_queues in sysfs

On 19/08/2020 16:20, John Garry wrote:
> So that we don't use a value of 0 for when Scsi_Host.nr_hw_queues is unset,
> use the tag_set->nr_hw_queues (which holds 1 for this case).
>
> Signed-off-by: John Garry <[email protected]>

Note that there has been no review on this patch yet.

It's not strictly necessary, but I see it as useful. The same info can
be derived indirectly from block debugfs, but that's not always
available (debugfs, that is).

Thanks,
john

> ---
> drivers/scsi/scsi_sysfs.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 163dbcb741c1..d6e344fa33ad 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -393,6 +393,16 @@ show_use_blk_mq(struct device *dev, struct device_attribute *attr, char *buf)
> }
> static DEVICE_ATTR(use_blk_mq, S_IRUGO, show_use_blk_mq, NULL);
>
> +static ssize_t
> +show_nr_hw_queues(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> + struct Scsi_Host *shost = class_to_shost(dev);
> + struct blk_mq_tag_set *tag_set = &shost->tag_set;
> +
> + return snprintf(buf, 20, "%d\n", tag_set->nr_hw_queues);
> +}
> +static DEVICE_ATTR(nr_hw_queues, S_IRUGO, show_nr_hw_queues, NULL);
> +
> static struct attribute *scsi_sysfs_shost_attrs[] = {
> &dev_attr_use_blk_mq.attr,
> &dev_attr_unique_id.attr,
> @@ -411,6 +421,7 @@ static struct attribute *scsi_sysfs_shost_attrs[] = {
> &dev_attr_prot_guard_type.attr,
> &dev_attr_host_reset.attr,
> &dev_attr_eh_deadline.attr,
> + &dev_attr_nr_hw_queues.attr,
> NULL
> };
>
>

2020-09-16 07:27:41

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 04/09/2020 13:44, Martin K. Petersen wrote:
>> Martin/James may want more review of the SCSI core bits, though.
> I'll take a look later today.


Hi Martin,

Have you had a chance to check these outstanding SCSI patches?

scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug
scsi: scsi_debug: Support host tagset
scsi: hisi_sas: Switch v3 hw to MQ
scsi: core: Show nr_hw_queues in sysfs
scsi: Add host and host template flag 'host_tagset'

Cheers,
John

2020-09-17 04:43:06

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs


John,

> Have you had a chance to check these outstanding SCSI patches?
>
> scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug
> scsi: scsi_debug: Support host tagset
> scsi: hisi_sas: Switch v3 hw to MQ
> scsi: core: Show nr_hw_queues in sysfs
> scsi: Add host and host template flag 'host_tagset'

These look good to me.

Jens, feel free to merge.

Acked-by: Martin K. Petersen <[email protected]>

--
Martin K. Petersen Oracle Linux Engineering

2020-09-17 07:08:24

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

>
>> Have you had a chance to check these outstanding SCSI patches?
>>
>> scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug
>> scsi: scsi_debug: Support host tagset
>> scsi: hisi_sas: Switch v3 hw to MQ
>> scsi: core: Show nr_hw_queues in sysfs
>> scsi: Add host and host template flag 'host_tagset'
>
> These look good to me.
>
> Jens, feel free to merge.

Hi Jens,

I'm waiting on the hpsa and smartpqi patches update, so please kindly
merge only those patches, above.

Thanks!

>
> Acked-by: Martin K. Petersen <[email protected]>
>

Cheers Martin

2020-09-21 21:44:14

by Don Brace

[permalink] [raw]
Subject: RE: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

>>Hi Jens,

>>I'm waiting on the hpsa and smartpqi patches >>update, so please kindly merge only those >>patches, above.

>>Thanks!

John, the hpsa driver crashes, the or more patches to allow internal commands from Hannas seem to be missing.

I'll let you know exactly which ones soon.

Thanks,
Don

2020-09-21 23:49:16

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 21/09/2020 22:35, [email protected] wrote:
>>> I'm waiting on the hpsa and smartpqi patches >>update, so please kindly merge only those >>patches, above.
>>> Thanks!
> John, the hpsa driver crashes, the or more patches to allow internal commands from Hannas seem to be missing.
>
> I'll let you know exactly which ones soon.
>

Hi Don,

Right, that branch did not include Hannes patches as they did not apply
cleanly, but I think that you need the same patches as before. I can
create a branch for you to test which does include those tomorrow - let
me know.

Alternatively I think that we could create a hpsa patch which does not
rely on that series, like I mentioned here [0], but it would not be as
clean.

Cheers,
John

https://lore.kernel.org/linux-scsi/[email protected]/

2020-09-22 09:10:15

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 21/09/2020 23:15, John Garry wrote:
> On 21/09/2020 22:35, [email protected] wrote:
>>>> I'm waiting on the hpsa and smartpqi patches >>update, so please
>>>> kindly merge only those >>patches, above.
>>>> Thanks!
>> John, the hpsa driver crashes, the  or more patches to allow internal
>> commands from Hannas seem to be missing.
>>
>> I'll let you know exactly which ones soon.
>>
>
> Hi Don,
>
> Right, that branch did not include Hannes patches as they did not apply
> cleanly, but I think that you need the same patches as before. I can
> create a branch for you to test which does include those tomorrow - let
> me know.
>
> Alternatively I think that we could create a hpsa patch which does not
> rely on that series, like I mentioned here [0], but it would not be as
> clean.
>
> Cheers,
> John
>
> https://lore.kernel.org/linux-scsi/[email protected]/
>
> .

JFYI, I put the reserved command v6 patchset and $subject patchset on
this branch:

https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v8-resv-v6-baseline

I don't have HW to test hpsa or smartpqi.

Thanks,
John

2020-09-28 16:15:31

by Kashyap Desai

[permalink] [raw]
Subject: RE: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

>
> John,
>
> > Have you had a chance to check these outstanding SCSI patches?
> >
> > scsi: megaraid_sas: Added support for shared host tagset for
cpuhotplug
> > scsi: scsi_debug: Support host tagset
> > scsi: hisi_sas: Switch v3 hw to MQ
> > scsi: core: Show nr_hw_queues in sysfs
> > scsi: Add host and host template flag 'host_tagset'
>
> These look good to me.
>
> Jens, feel free to merge.

Hi Jens, Gentle ping. I am not able to find commits for above listed scsi
patches. I want to use your repo which has above mentioned patch for
<scsi:io_uring iopoll> patch submission. Martin has Acked the scsi
patches.

>
> Acked-by: Martin K. Petersen <[email protected]>
>
> --
> Martin K. Petersen Oracle Linux Engineering


Attachments:
smime.p7s (4.07 kB)
S/MIME Cryptographic Signature

2020-10-06 14:30:40

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 28/09/2020 17:11, Kashyap Desai wrote:
>>
>> John,
>>
>>> Have you had a chance to check these outstanding SCSI patches?
>>>
>>> scsi: megaraid_sas: Added support for shared host tagset for
> cpuhotplug
>>> scsi: scsi_debug: Support host tagset
>>> scsi: hisi_sas: Switch v3 hw to MQ
>>> scsi: core: Show nr_hw_queues in sysfs
>>> scsi: Add host and host template flag 'host_tagset'
>>
>> These look good to me.
>>
>> Jens, feel free to merge.
>
> Hi Jens, Gentle ping. I am not able to find commits for above listed scsi
> patches. I want to use your repo which has above mentioned patch for
> <scsi:io_uring iopoll> patch submission. Martin has Acked the scsi
> patches.
>
>>
>> Acked-by: Martin K. Petersen <[email protected]>
>>
>> --
>> Martin K. Petersen Oracle Linux Engineering


Hi Jens,

Could you kindly pick up the following patches, to go along with the
blk-mq changes:

scsi: megaraid_sas: Added support for shared host tagset for
cpuhotplug
scsi: scsi_debug: Support host tagset
scsi: hisi_sas: Switch v3 hw to MQ
scsi: core: Show nr_hw_queues in sysfs
scsi: Add host and host template flag 'host_tagset'

Thanks,
John

2020-10-06 14:43:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

On 10/6/20 8:24 AM, John Garry wrote:
> On 28/09/2020 17:11, Kashyap Desai wrote:
>>>
>>> John,
>>>
>>>> Have you had a chance to check these outstanding SCSI patches?
>>>>
>>>> scsi: megaraid_sas: Added support for shared host tagset for
>> cpuhotplug
>>>> scsi: scsi_debug: Support host tagset
>>>> scsi: hisi_sas: Switch v3 hw to MQ
>>>> scsi: core: Show nr_hw_queues in sysfs
>>>> scsi: Add host and host template flag 'host_tagset'
>>>
>>> These look good to me.
>>>
>>> Jens, feel free to merge.
>>
>> Hi Jens, Gentle ping. I am not able to find commits for above listed scsi
>> patches. I want to use your repo which has above mentioned patch for
>> <scsi:io_uring iopoll> patch submission. Martin has Acked the scsi
>> patches.
>>
>>>
>>> Acked-by: Martin K. Petersen <[email protected]>
>>>
>>> --
>>> Martin K. Petersen Oracle Linux Engineering
>
>
> Hi Jens,
>
> Could you kindly pick up the following patches, to go along with the
> blk-mq changes:
>
> scsi: megaraid_sas: Added support for shared host tagset for
> cpuhotplug
> scsi: scsi_debug: Support host tagset
> scsi: hisi_sas: Switch v3 hw to MQ
> scsi: core: Show nr_hw_queues in sysfs
> scsi: Add host and host template flag 'host_tagset'

Sorry about the delay, picked up these 5 patches.

--
Jens Axboe

2020-11-02 14:20:04

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> From: Kashyap Desai <[email protected]>
>
> Fusion adapters can steer completions to individual queues, and
> we now have support for shared host-wide tags.
> So we can enable multiqueue support for fusion adapters.
>
> Once driver enable shared host-wide tags, cpu hotplug feature is also
> supported as it was enabled using below patchsets -
> commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> offline")
>
> Currently driver has provision to disable host-wide tags using
> "host_tagset_enable" module parameter.
>
> Once we do not have any major performance regression using host-wide
> tags, we will drop the hand-crafted interrupt affinity settings.
>
> Performance is also meeting the expecatation - (used both none and
> mq-deadline scheduler)
> 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get 3.1M
> IOPs.
>
> Signed-off-by: Kashyap Desai <[email protected]>
> Signed-off-by: Hannes Reinecke <[email protected]>
> Signed-off-by: John Garry <[email protected]>

Reverting this commit fixed an issue that Dell Power Edge R6415 server with
megaraid_sas is unable to boot.

c1:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
DeviceName: Integrated RAID
Subsystem: Dell PERC H730P Mini
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 48
NUMA node: 3
Region 0: I/O ports at c000 [size=256]
Region 1: Memory at a5500000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at a5400000 (64-bit, non-prefetchable) [size=1M]
Expansion ROM at <ignored> [disabled]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 04000001 c000000f c1080000 4ba9007a
Capabilities: [1e0 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [1c0 v1] Power Budgeting <?>
Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Kernel driver in use: megaraid_sas
Kernel modules: megaraid_sas

[ 26.330282][ T567] megasas: 07.714.04.00-rc1
[ 26.355663][ T611] ahci 0000:87:00.2: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[ 26.364585][ T611] ahci 0000:87:00.2: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part
[ 26.376125][ T289] megaraid_sas 0000:c1:00.0: FW now in Ready state
[ 26.382534][ T289] megaraid_sas 0000:c1:00.0: 63 bit DMA mask and 32 bit consistent mask
[ 26.391537][ T289] megaraid_sas 0000:c1:00.0: firmware supports msix : (96)
[ 26.431767][ T611] scsi host1: ahci
[ 26.492580][ T611] ata1: SATA max UDMA/133 abar m4096@0xc0a02000 port 0xc0a02100 irq 60
[ 26.701197][ T283] bnxt_en 0000:84:00.0 eth0: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem ad210000, node addr 4c:d9:8f:4a:20:e6
[ 26.714352][ T283] bnxt_en 0000:84:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 26.743738][ T24] tg3 0000:81:00.0 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 4c:d9:8f:65:3f:32
[ 26.754974][ T24] tg3 0000:81:00.0 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 26.765523][ T24] tg3 0000:81:00.0 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 26.774074][ T24] tg3 0000:81:00.0 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[ 26.842518][ T620] ata1: SATA link down (SStatus 0 SControl 300)
[ 26.945741][ T289] megaraid_sas 0000:c1:00.0: requested/available msix 49/49
[ 26.952912][ T289] megaraid_sas 0000:c1:00.0: current msix/online cpus : (49/48)
[ 26.960401][ T289] megaraid_sas 0000:c1:00.0: RDPQ mode : (disabled)
[ 26.966876][ T289] megaraid_sas 0000:c1:00.0: Current firmware supports maximum commands: 928 LDIO threshold: 0
[ 27.079361][ T289] megaraid_sas 0000:c1:00.0: Performance mode :Latency (latency index = 1)
[ 27.085381][ T283] bnxt_en 0000:84:00.1 eth2: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem ad200000, node addr 4c:d9:8f:4a:20:e7
[ 27.087824][ T289] megaraid_sas 0000:c1:00.0: FW supports sync cache : No
[ 27.100959][ T283] bnxt_en 0000:84:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[ 27.107835][ T289] megaraid_sas 0000:c1:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 27.130978][ T24] tg3 0000:81:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 4c:d9:8f:65:3f:33
[ 27.142919][ T24] tg3 0000:81:00.1 eth3: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 27.146042][ T571] bnxt_en 0000:84:00.1 enp132s0f1np1: renamed from eth2
[ 27.153456][ T24] tg3 0000:81:00.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 27.153467][ T24] tg3 0000:81:00.1 eth3: dma_rwctrl[00000001] dma_mask[64-bit]
[ 27.200900][ T289] megaraid_sas 0000:c1:00.0: FW provided supportMaxExtLDs: 1 max_lds: 64
[ 27.209174][ T289] megaraid_sas 0000:c1:00.0: controller type : MR(2048MB)
[ 27.216260][ T289] megaraid_sas 0000:c1:00.0: Online Controller Reset(OCR) : Enabled
[ 27.224105][ T289] megaraid_sas 0000:c1:00.0: Secure JBOD support : No
[ 27.230720][ T289] megaraid_sas 0000:c1:00.0: NVMe passthru support : No
[ 27.237527][ T289] megaraid_sas 0000:c1:00.0: FW provided TM TaskAbort/Reset timeout : 0 secs/0 secs
[ 27.246754][ T289] megaraid_sas 0000:c1:00.0: JBOD sequence map support : No
[ 27.253906][ T289] megaraid_sas 0000:c1:00.0: PCI Lane Margining support : No
[ 27.341447][ T289] megaraid_sas 0000:c1:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[ 27.351729][ T289] megaraid_sas 0000:c1:00.0: INIT adapter done
[ 27.357742][ T289] megaraid_sas 0000:c1:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5709
[ 27.367832][ T289] megaraid_sas 0000:c1:00.0: pci id : (0x1000)/(0x005d)/(0x1028)/(0x1f47)
[ 27.376287][ T289] megaraid_sas 0000:c1:00.0: unevenspan support : yes
[ 27.382925][ T289] megaraid_sas 0000:c1:00.0: firmware crash dump : no
[ 27.389547][ T289] megaraid_sas 0000:c1:00.0: JBOD sequence map : disabled
[ 27.397816][ T289] megaraid_sas 0000:c1:00.0: Max firmware commands: 927 shared with nr_hw_queues = 48
[ 27.407232][ T289] scsi host0: Avago SAS based MegaRAID driver
[ 27.430212][ T586] bnxt_en 0000:84:00.0 enp132s0f0np0: renamed from eth0
[ 27.781038][ T603] tg3 0000:81:00.0 eno1: renamed from eth1
[ 28.194046][ T552] tg3 0000:81:00.1 eno2: renamed from eth3

[ 251.961152][ T330] INFO: task systemd-udevd:567 blocked for more than 122 seconds.
[ 251.968876][ T330] Not tainted 5.10.0-rc1-next-20201102 #1
[ 251.975003][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 251.983546][ T330] task:systemd-udevd state:D stack:27224 pid: 567 ppid: 506 flags:0x00004324
[ 251.992620][ T330] Call Trace:
[ 251.995784][ T330] __schedule+0x71d/0x1b60
[ 252.000067][ T330] ? __sched_text_start+0x8/0x8
[ 252.004798][ T330] schedule+0xbf/0x270
[ 252.008735][ T330] schedule_timeout+0x3fc/0x590
[ 252.013464][ T330] ? usleep_range+0x120/0x120
[ 252.018008][ T330] ? wait_for_completion+0x156/0x250
[ 252.023176][ T330] ? lock_downgrade+0x700/0x700
[ 252.027886][ T330] ? rcu_read_unlock+0x40/0x40
[ 252.032530][ T330] ? do_raw_spin_lock+0x121/0x290
[ 252.037412][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[ 252.043268][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
[ 252.048331][ T330] wait_for_completion+0x15e/0x250
[ 252.053323][ T330] ? wait_for_completion_interruptible+0x320/0x320
[ 252.059687][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[ 252.065543][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
[ 252.070606][ T330] __flush_work+0x42a/0x900
[ 252.074989][ T330] ? queue_delayed_work_on+0x90/0x90
[ 252.080139][ T330] ? __queue_work+0x463/0xf40
[ 252.084700][ T330] ? init_pwq+0x320/0x320
[ 252.088891][ T330] ? queue_work_on+0x5e/0x80
[ 252.093364][ T330] ? trace_hardirqs_on+0x1c/0x150
[ 252.098255][ T330] work_on_cpu+0xe7/0x130
[ 252.102461][ T330] ? flush_delayed_work+0xc0/0xc0
[ 252.107342][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
[ 252.112764][ T330] ? work_debug_hint+0x30/0x30
[ 252.117391][ T330] ? pci_device_shutdown+0x80/0x80
[ 252.122378][ T330] ? cpumask_next_and+0x57/0x80
[ 252.127094][ T330] pci_device_probe+0x500/0x5c0
[ 252.131824][ T330] ? pci_device_remove+0x1f0/0x1f0
[ 252.136805][ T330] really_probe+0x207/0xad0
[ 252.141191][ T330] ? device_driver_attach+0x120/0x120
[ 252.146428][ T330] driver_probe_device+0x1f1/0x370
[ 252.151424][ T330] device_driver_attach+0xe5/0x120
[ 252.156399][ T330] __driver_attach+0xf0/0x260
[ 252.160953][ T330] bus_for_each_dev+0x117/0x1a0
[ 252.165669][ T330] ? subsys_dev_iter_exit+0x10/0x10
[ 252.170731][ T330] bus_add_driver+0x399/0x560
[ 252.175289][ T330] driver_register+0x189/0x310
[ 252.179919][ T330] ? 0xffffffffc05c1000
[ 252.183960][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
[ 252.189713][ T330] do_one_initcall+0xf6/0x510
[ 252.194267][ T330] ? perf_trace_initcall_level+0x490/0x490
[ 252.199940][ T330] ? kasan_unpoison_shadow+0x30/0x40
[ 252.205104][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[ 252.210859][ T330] ? do_init_module+0x49/0x6c0
[ 252.215500][ T330] ? kmem_cache_alloc_trace+0x11f/0x1e0
[ 252.220925][ T330] ? kasan_unpoison_shadow+0x30/0x40
[ 252.226068][ T330] do_init_module+0x1ed/0x6c0
[ 252.230608][ T330] load_module+0x4a59/0x5d20
[ 252.235081][ T330] ? layout_and_allocate+0x2770/0x2770
[ 252.240404][ T330] ? __vmalloc_node+0x8d/0x100
[ 252.245046][ T330] ? kernel_read_file+0x485/0x5a0
[ 252.249934][ T330] ? kernel_read_file+0x305/0x5a0
[ 252.254839][ T330] ? __x64_sys_fsconfig+0x970/0x970
[ 252.259903][ T330] ? __do_sys_finit_module+0xff/0x180
[ 252.265153][ T330] __do_sys_finit_module+0xff/0x180
[ 252.270216][ T330] ? __do_sys_init_module+0x1d0/0x1d0
[ 252.275465][ T330] ? __fget_files+0x1c3/0x2e0
[ 252.280010][ T330] do_syscall_64+0x33/0x40
[ 252.284304][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 252.290054][ T330] RIP: 0033:0x7fbb3e2fa78d
[ 252.294348][ T330] Code: Unable to access opcode bytes at RIP 0x7fbb3e2fa763.
[ 252.301584][ T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 252.309855][ T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX: 00007fbb3e2fa78d
[ 252.317703][ T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI: 0000000000000006
[ 252.325553][ T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09: 00007ffe572e8e40
[ 252.333402][ T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[ 252.341257][ T330] R13: 000055c7795930e0 R14: 0000000000020000 R15: 0000000000000000
[ 252.349117][ T330]
[ 252.349117][ T330] Showing all locks held in the system:
[ 252.356770][ T330] 3 locks held by kworker/3:1/289:
[ 252.361759][ T330] #0: ffff8881001eb938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ec/0x1610
[ 252.371976][ T330] #1: ffffc90004ee7e00 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
[ 252.382803][ T330] #2: ffff8881430380e0 (&shost->scan_mutex){+.+.}-{3:3}, at: scsi_scan_host_selected+0xde/0x260
[ 252.393199][ T330] 1 lock held by khungtaskd/330:
[ 252.397993][ T330] #0: ffffffff9d4d3760 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.52+0x0/0x30
[ 252.408296][ T330] 1 lock held by systemd-journal/420:
[ 252.413562][ T330] 1 lock held by systemd-udevd/567:
[ 252.418619][ T330] #0: ffff8881207ac218 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x37/0x120
[ 252.428159][ T330]
[ 252.430355][ T330] =============================================
[ 252.430355][ T330]

> ---
> drivers/scsi/megaraid/megaraid_sas_base.c | 39 +++++++++++++++++++++
> drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++++++++-------
> 2 files changed, 55 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 861f7140f52e..6960922d0d7f 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -37,6 +37,7 @@
> #include <linux/poll.h>
> #include <linux/vmalloc.h>
> #include <linux/irq_poll.h>
> +#include <linux/blk-mq-pci.h>
>
> #include <scsi/scsi.h>
> #include <scsi/scsi_cmnd.h>
> @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
> module_param(enable_sdev_max_qd, int, 0444);
> MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue.
> Default: 0");
>
> +int host_tagset_enable = 1;
> +module_param(host_tagset_enable, int, 0444);
> +MODULE_PARM_DESC(host_tagset_enable, "Shared host tagset enable/disable
> Default: enable(1)");
> +
> MODULE_LICENSE("GPL");
> MODULE_VERSION(MEGASAS_VERSION);
> MODULE_AUTHOR("[email protected]");
> @@ -3119,6 +3124,19 @@ megasas_bios_param(struct scsi_device *sdev, struct
> block_device *bdev,
> return 0;
> }
>
> +static int megasas_map_queues(struct Scsi_Host *shost)
> +{
> + struct megasas_instance *instance;
> +
> + instance = (struct megasas_instance *)shost->hostdata;
> +
> + if (shost->nr_hw_queues == 1)
> + return 0;
> +
> + return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
> + instance->pdev, instance->low_latency_index_start);
> +}
> +
> static void megasas_aen_polling(struct work_struct *work);
>
> /**
> @@ -3427,6 +3445,7 @@ static struct scsi_host_template megasas_template = {
> .eh_timed_out = megasas_reset_timer,
> .shost_attrs = megaraid_host_attrs,
> .bios_param = megasas_bios_param,
> + .map_queues = megasas_map_queues,
> .change_queue_depth = scsi_change_queue_depth,
> .max_segment_size = 0xffffffff,
> };
> @@ -6808,6 +6827,26 @@ static int megasas_io_attach(struct megasas_instance
> *instance)
> host->max_lun = MEGASAS_MAX_LUN;
> host->max_cmd_len = 16;
>
> + /* Use shared host tagset only for fusion adaptors
> + * if there are managed interrupts (smp affinity enabled case).
> + * Single msix_vectors in kdump, so shared host tag is also disabled.
> + */
> +
> + host->host_tagset = 0;
> + host->nr_hw_queues = 1;
> +
> + if ((instance->adapter_type != MFI_SERIES) &&
> + (instance->msix_vectors > instance->low_latency_index_start) &&
> + host_tagset_enable &&
> + instance->smp_affinity_enable) {
> + host->host_tagset = 1;
> + host->nr_hw_queues = instance->msix_vectors -
> + instance->low_latency_index_start;
> + }
> +
> + dev_info(&instance->pdev->dev,
> + "Max firmware commands: %d shared with nr_hw_queues = %d\n",
> + instance->max_fw_cmds, host->nr_hw_queues);
> /*
> * Notify the mid-layer about the new controller
> */
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 0824410f78f8..a4251121f173 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -359,24 +359,29 @@ megasas_get_msix_index(struct megasas_instance
> *instance,
> {
> int sdev_busy;
>
> - /* nr_hw_queue = 1 for MegaRAID */
> - struct blk_mq_hw_ctx *hctx =
> - scmd->device->request_queue->queue_hw_ctx[0];
> -
> - sdev_busy = atomic_read(&hctx->nr_active);
> + /* TBD - if sml remove device_busy in future, driver
> + * should track counter in internal structure.
> + */
> + sdev_busy = atomic_read(&scmd->device->device_busy);
>
> if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
> - sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
> + sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
> cmd->request_desc->SCSIIO.MSIxIndex =
> mega_mod64((atomic64_add_return(1, &instance-
> >high_iops_outstanding) /
> MR_HIGH_IOPS_BATCH_COUNT), instance-
> >low_latency_index_start);
> - else if (instance->msix_load_balance)
> + } else if (instance->msix_load_balance) {
> cmd->request_desc->SCSIIO.MSIxIndex =
> (mega_mod64(atomic64_add_return(1, &instance-
> >total_io_count),
> instance->msix_vectors));
> - else
> + } else if (instance->host->nr_hw_queues > 1) {
> + u32 tag = blk_mq_unique_tag(scmd->request);
> +
> + cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(tag) +
> + instance->low_latency_index_start;
> + } else {
> cmd->request_desc->SCSIIO.MSIxIndex =
> instance->reply_map[raw_smp_processor_id()];
> + }
> }
>
> /**
> @@ -956,9 +961,6 @@ megasas_alloc_cmds_fusion(struct megasas_instance
> *instance)
> if (megasas_alloc_cmdlist_fusion(instance))
> goto fail_exit;
>
> - dev_info(&instance->pdev->dev, "Configured max firmware commands: %d\n",
> - instance->max_fw_cmds);
> -
> /* The first 256 bytes (SMID 0) is not used. Don't add to the cmd list
> */
> io_req_base = fusion->io_request_frames +
> MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
> io_req_base_phys = fusion->io_request_frames_phys +
> MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
> @@ -1102,8 +1104,9 @@ megasas_ioc_init_fusion(struct megasas_instance
> *instance)
> MR_HIGH_IOPS_QUEUE_COUNT) && cur_intr_coalescing)
> instance->perf_mode = MR_BALANCED_PERF_MODE;
>
> - dev_info(&instance->pdev->dev, "Performance mode :%s\n",
> - MEGASAS_PERF_MODE_2STR(instance->perf_mode));
> + dev_info(&instance->pdev->dev, "Performance mode :%s (latency index =
> %d)\n",
> + MEGASAS_PERF_MODE_2STR(instance->perf_mode),
> + instance->low_latency_index_start);
>
> instance->fw_sync_cache_support = (scratch_pad_1 &
> MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0;

2020-11-02 14:33:58

by Kashyap Desai

[permalink] [raw]
Subject: RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

> On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> > From: Kashyap Desai <[email protected]>
> >
> > Fusion adapters can steer completions to individual queues, and we now
> > have support for shared host-wide tags.
> > So we can enable multiqueue support for fusion adapters.
> >
> > Once driver enable shared host-wide tags, cpu hotplug feature is also
> > supported as it was enabled using below patchsets - commit
> > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> > offline")
> >
> > Currently driver has provision to disable host-wide tags using
> > "host_tagset_enable" module parameter.
> >
> > Once we do not have any major performance regression using host-wide
> > tags, we will drop the hand-crafted interrupt affinity settings.
> >
> > Performance is also meeting the expecatation - (used both none and
> > mq-deadline scheduler)
> > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get
> > 3.1M IOPs.
> >
> > Signed-off-by: Kashyap Desai <[email protected]>
> > Signed-off-by: Hannes Reinecke <[email protected]>
> > Signed-off-by: John Garry <[email protected]>
>
> Reverting this commit fixed an issue that Dell Power Edge R6415 server
> with
> megaraid_sas is unable to boot.

I will take a look at this. BTW, can you try keeping same PATCH but use
module parameter "host_tagset_enable =0"

Kashyap


Attachments:
smime.p7s (4.07 kB)
S/MIME Cryptographic Signature

2020-11-02 14:59:32

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 02/11/2020 14:17, Qian Cai wrote:
> [ 251.961152][ T330] INFO: task systemd-udevd:567 blocked for more than 122 seconds.
> [ 251.968876][ T330] Not tainted 5.10.0-rc1-next-20201102 #1
> [ 251.975003][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 251.983546][ T330] task:systemd-udevd state:D stack:27224 pid: 567 ppid: 506 flags:0x00004324
> [ 251.992620][ T330] Call Trace:
> [ 251.995784][ T330] __schedule+0x71d/0x1b60
> [ 252.000067][ T330] ? __sched_text_start+0x8/0x8
> [ 252.004798][ T330] schedule+0xbf/0x270
> [ 252.008735][ T330] schedule_timeout+0x3fc/0x590
> [ 252.013464][ T330] ? usleep_range+0x120/0x120
> [ 252.018008][ T330] ? wait_for_completion+0x156/0x250
> [ 252.023176][ T330] ? lock_downgrade+0x700/0x700
> [ 252.027886][ T330] ? rcu_read_unlock+0x40/0x40
> [ 252.032530][ T330] ? do_raw_spin_lock+0x121/0x290
> [ 252.037412][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [ 252.043268][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> [ 252.048331][ T330] wait_for_completion+0x15e/0x250
> [ 252.053323][ T330] ? wait_for_completion_interruptible+0x320/0x320
> [ 252.059687][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [ 252.065543][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> [ 252.070606][ T330] __flush_work+0x42a/0x900
> [ 252.074989][ T330] ? queue_delayed_work_on+0x90/0x90
> [ 252.080139][ T330] ? __queue_work+0x463/0xf40
> [ 252.084700][ T330] ? init_pwq+0x320/0x320
> [ 252.088891][ T330] ? queue_work_on+0x5e/0x80
> [ 252.093364][ T330] ? trace_hardirqs_on+0x1c/0x150
> [ 252.098255][ T330] work_on_cpu+0xe7/0x130
> [ 252.102461][ T330] ? flush_delayed_work+0xc0/0xc0
> [ 252.107342][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
> [ 252.112764][ T330] ? work_debug_hint+0x30/0x30
> [ 252.117391][ T330] ? pci_device_shutdown+0x80/0x80
> [ 252.122378][ T330] ? cpumask_next_and+0x57/0x80
> [ 252.127094][ T330] pci_device_probe+0x500/0x5c0
> [ 252.131824][ T330] ? pci_device_remove+0x1f0/0x1f0

Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.

Or please share the .config

Cheers,
John

> [ 252.136805][ T330] really_probe+0x207/0xad0
> [ 252.141191][ T330] ? device_driver_attach+0x120/0x120
> [ 252.146428][ T330] driver_probe_device+0x1f1/0x370
> [ 252.151424][ T330] device_driver_attach+0xe5/0x120
> [ 252.156399][ T330] __driver_attach+0xf0/0x260
> [ 252.160953][ T330] bus_for_each_dev+0x117/0x1a0
> [ 252.165669][ T330] ? subsys_dev_iter_exit+0x10/0x10
> [ 252.170731][ T330] bus_add_driver+0x399/0x560
> [ 252.175289][ T330] driver_register+0x189/0x310
> [ 252.179919][ T330] ? 0xffffffffc05c1000
> [ 252.183960][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
> [ 252.189713][ T330] do_one_initcall+0xf6/0x510
> [ 252.194267][ T330] ? perf_trace_initcall_level+0x490/0x490
> [ 252.199940][ T330] ? kasan_unpoison_shadow+0x30/0x40
> [ 252.205104][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> [ 252.210859][ T330] ? do_init_module+0x49/0x6c0
> [ 252.215500][ T330] ? kmem_cache_alloc_trace+0x11f/0x1e0
> [ 252.220925][ T330] ? kasan_unpoison_shadow+0x30/0x40
> [ 252.226068][ T330] do_init_module+0x1ed/0x6c0
> [ 252.230608][ T330] load_module+0x4a59/0x5d20
> [ 252.235081][ T330] ? layout_and_allocate+0x2770/0x2770
> [ 252.240404][ T330] ? __vmalloc_node+0x8d/0x100
> [ 252.245046][ T330] ? kernel_read_file+0x485/0x5a0
> [ 252.249934][ T330] ? kernel_read_file+0x305/0x5a0
> [ 252.254839][ T330] ? __x64_sys_fsconfig+0x970/0x970
> [ 252.259903][ T330] ? __do_sys_finit_module+0xff/0x180
> [ 252.265153][ T330] __do_sys_finit_module+0xff/0x180
> [ 252.270216][ T330] ? __do_sys_init_module+0x1d0/0x1d0
> [ 252.275465][ T330] ? __fget_files+0x1c3/0x2e0
> [ 252.280010][ T330] do_syscall_64+0x33/0x40
> [ 252.284304][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 252.290054][ T330] RIP: 0033:0x7fbb3e2fa78d
> [ 252.294348][ T330] Code: Unable to access opcode bytes at RIP 0x7fbb3e2fa763.
> [ 252.301584][ T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [ 252.309855][ T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX: 00007fbb3e2fa78d
> [ 252.317703][ T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI: 0000000000000006
> [ 252.325553][ T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09: 00007ffe572e8e40
> [ 252.333402][ T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
> [ 252.341257][ T330] R13: 000055c7795930e0 R14: 0000000000020000 R15: 0000000000000000
> [ 252.349117][ T330]

2020-11-02 15:21:22

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Mon, 2020-11-02 at 14:51 +0000, John Garry wrote:
> On 02/11/2020 14:17, Qian Cai wrote:
> > [ 251.961152][ T330] INFO: task systemd-udevd:567 blocked for more than
> > 122 seconds.
> > [ 251.968876][ T330] Not tainted 5.10.0-rc1-next-20201102 #1
> > [ 251.975003][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 251.983546][ T330] task:systemd-udevd state:D stack:27224 pid: 567
> > ppid: 506 flags:0x00004324
> > [ 251.992620][ T330] Call Trace:
> > [ 251.995784][ T330] __schedule+0x71d/0x1b60
> > [ 252.000067][ T330] ? __sched_text_start+0x8/0x8
> > [ 252.004798][ T330] schedule+0xbf/0x270
> > [ 252.008735][ T330] schedule_timeout+0x3fc/0x590
> > [ 252.013464][ T330] ? usleep_range+0x120/0x120
> > [ 252.018008][ T330] ? wait_for_completion+0x156/0x250
> > [ 252.023176][ T330] ? lock_downgrade+0x700/0x700
> > [ 252.027886][ T330] ? rcu_read_unlock+0x40/0x40
> > [ 252.032530][ T330] ? do_raw_spin_lock+0x121/0x290
> > [ 252.037412][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> > [ 252.043268][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> > [ 252.048331][ T330] wait_for_completion+0x15e/0x250
> > [ 252.053323][ T330] ? wait_for_completion_interruptible+0x320/0x320
> > [ 252.059687][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> > [ 252.065543][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> > [ 252.070606][ T330] __flush_work+0x42a/0x900
> > [ 252.074989][ T330] ? queue_delayed_work_on+0x90/0x90
> > [ 252.080139][ T330] ? __queue_work+0x463/0xf40
> > [ 252.084700][ T330] ? init_pwq+0x320/0x320
> > [ 252.088891][ T330] ? queue_work_on+0x5e/0x80
> > [ 252.093364][ T330] ? trace_hardirqs_on+0x1c/0x150
> > [ 252.098255][ T330] work_on_cpu+0xe7/0x130
> > [ 252.102461][ T330] ? flush_delayed_work+0xc0/0xc0
> > [ 252.107342][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
> > [ 252.112764][ T330] ? work_debug_hint+0x30/0x30
> > [ 252.117391][ T330] ? pci_device_shutdown+0x80/0x80
> > [ 252.122378][ T330] ? cpumask_next_and+0x57/0x80
> > [ 252.127094][ T330] pci_device_probe+0x500/0x5c0
> > [ 252.131824][ T330] ? pci_device_remove+0x1f0/0x1f0
>
> Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.
>
> Or please share the .config

No. https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

>
> Cheers,
> John
>
> > [ 252.136805][ T330] really_probe+0x207/0xad0
> > [ 252.141191][ T330] ? device_driver_attach+0x120/0x120
> > [ 252.146428][ T330] driver_probe_device+0x1f1/0x370
> > [ 252.151424][ T330] device_driver_attach+0xe5/0x120
> > [ 252.156399][ T330] __driver_attach+0xf0/0x260
> > [ 252.160953][ T330] bus_for_each_dev+0x117/0x1a0
> > [ 252.165669][ T330] ? subsys_dev_iter_exit+0x10/0x10
> > [ 252.170731][ T330] bus_add_driver+0x399/0x560
> > [ 252.175289][ T330] driver_register+0x189/0x310
> > [ 252.179919][ T330] ? 0xffffffffc05c1000
> > [ 252.183960][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
> > [ 252.189713][ T330] do_one_initcall+0xf6/0x510
> > [ 252.194267][ T330] ? perf_trace_initcall_level+0x490/0x490
> > [ 252.199940][ T330] ? kasan_unpoison_shadow+0x30/0x40
> > [ 252.205104][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> > [ 252.210859][ T330] ? do_init_module+0x49/0x6c0
> > [ 252.215500][ T330] ? kmem_cache_alloc_trace+0x11f/0x1e0
> > [ 252.220925][ T330] ? kasan_unpoison_shadow+0x30/0x40
> > [ 252.226068][ T330] do_init_module+0x1ed/0x6c0
> > [ 252.230608][ T330] load_module+0x4a59/0x5d20
> > [ 252.235081][ T330] ? layout_and_allocate+0x2770/0x2770
> > [ 252.240404][ T330] ? __vmalloc_node+0x8d/0x100
> > [ 252.245046][ T330] ? kernel_read_file+0x485/0x5a0
> > [ 252.249934][ T330] ? kernel_read_file+0x305/0x5a0
> > [ 252.254839][ T330] ? __x64_sys_fsconfig+0x970/0x970
> > [ 252.259903][ T330] ? __do_sys_finit_module+0xff/0x180
> > [ 252.265153][ T330] __do_sys_finit_module+0xff/0x180
> > [ 252.270216][ T330] ? __do_sys_init_module+0x1d0/0x1d0
> > [ 252.275465][ T330] ? __fget_files+0x1c3/0x2e0
> > [ 252.280010][ T330] do_syscall_64+0x33/0x40
> > [ 252.284304][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 252.290054][ T330] RIP: 0033:0x7fbb3e2fa78d
> > [ 252.294348][ T330] Code: Unable to access opcode bytes at RIP
> > 0x7fbb3e2fa763.
> > [ 252.301584][ T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000139
> > [ 252.309855][ T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX:
> > 00007fbb3e2fa78d
> > [ 252.317703][ T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI:
> > 0000000000000006
> > [ 252.325553][ T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09:
> > 00007ffe572e8e40
> > [ 252.333402][ T330] R10: 0000000000000006 R11: 0000000000000246 R12:
> > 0000000000000000
> > [ 252.341257][ T330] R13: 000055c7795930e0 R14: 0000000000020000 R15:
> > 0000000000000000
> > [ 252.349117][ T330]

2020-11-02 15:27:17

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Mon, 2020-11-02 at 20:01 +0530, Kashyap Desai wrote:
> > On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> > > From: Kashyap Desai <[email protected]>
> > >
> > > Fusion adapters can steer completions to individual queues, and we now
> > > have support for shared host-wide tags.
> > > So we can enable multiqueue support for fusion adapters.
> > >
> > > Once driver enable shared host-wide tags, cpu hotplug feature is also
> > > supported as it was enabled using below patchsets - commit
> > > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> > > offline")
> > >
> > > Currently driver has provision to disable host-wide tags using
> > > "host_tagset_enable" module parameter.
> > >
> > > Once we do not have any major performance regression using host-wide
> > > tags, we will drop the hand-crafted interrupt affinity settings.
> > >
> > > Performance is also meeting the expecatation - (used both none and
> > > mq-deadline scheduler)
> > > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> > > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get
> > > 3.1M IOPs.
> > >
> > > Signed-off-by: Kashyap Desai <[email protected]>
> > > Signed-off-by: Hannes Reinecke <[email protected]>
> > > Signed-off-by: John Garry <[email protected]>
> >
> > Reverting this commit fixed an issue that Dell Power Edge R6415 server
> > with
> > megaraid_sas is unable to boot.
>
> I will take a look at this. BTW, can you try keeping same PATCH but use
> module parameter "host_tagset_enable =0"

Yes, that also works.

2020-11-03 10:56:44

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 02/11/2020 15:18, Qian Cai wrote:
> On Mon, 2020-11-02 at 14:51 +0000, John Garry wrote:
>> On 02/11/2020 14:17, Qian Cai wrote:
>>> [ 251.961152][ T330] INFO: task systemd-udevd:567 blocked for more than
>>> 122 seconds.
>>> [ 251.968876][ T330] Not tainted 5.10.0-rc1-next-20201102 #1
>>> [ 251.975003][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [ 251.983546][ T330] task:systemd-udevd state:D stack:27224 pid: 567
>>> ppid: 506 flags:0x00004324
>>> [ 251.992620][ T330] Call Trace:
>>> [ 251.995784][ T330] __schedule+0x71d/0x1b60
>>> [ 252.000067][ T330] ? __sched_text_start+0x8/0x8
>>> [ 252.004798][ T330] schedule+0xbf/0x270
>>> [ 252.008735][ T330] schedule_timeout+0x3fc/0x590
>>> [ 252.013464][ T330] ? usleep_range+0x120/0x120
>>> [ 252.018008][ T330] ? wait_for_completion+0x156/0x250
>>> [ 252.023176][ T330] ? lock_downgrade+0x700/0x700
>>> [ 252.027886][ T330] ? rcu_read_unlock+0x40/0x40
>>> [ 252.032530][ T330] ? do_raw_spin_lock+0x121/0x290
>>> [ 252.037412][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
>>> [ 252.043268][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
>>> [ 252.048331][ T330] wait_for_completion+0x15e/0x250
>>> [ 252.053323][ T330] ? wait_for_completion_interruptible+0x320/0x320
>>> [ 252.059687][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
>>> [ 252.065543][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
>>> [ 252.070606][ T330] __flush_work+0x42a/0x900
>>> [ 252.074989][ T330] ? queue_delayed_work_on+0x90/0x90
>>> [ 252.080139][ T330] ? __queue_work+0x463/0xf40
>>> [ 252.084700][ T330] ? init_pwq+0x320/0x320
>>> [ 252.088891][ T330] ? queue_work_on+0x5e/0x80
>>> [ 252.093364][ T330] ? trace_hardirqs_on+0x1c/0x150
>>> [ 252.098255][ T330] work_on_cpu+0xe7/0x130
>>> [ 252.102461][ T330] ? flush_delayed_work+0xc0/0xc0
>>> [ 252.107342][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
>>> [ 252.112764][ T330] ? work_debug_hint+0x30/0x30
>>> [ 252.117391][ T330] ? pci_device_shutdown+0x80/0x80
>>> [ 252.122378][ T330] ? cpumask_next_and+0x57/0x80
>>> [ 252.127094][ T330] pci_device_probe+0x500/0x5c0
>>> [ 252.131824][ T330] ? pci_device_remove+0x1f0/0x1f0
>>
>> Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.
>>
>> Or please share the .config
>
> No. https://cailca.coding.net/public/linux/mm/git/files/master/x86.config
>

thanks, FWIW, I just tested another megaraid sas card on linux-next 02
Nov with vanilla arm64 defconfig and no special commandline param, and
found no issue:

dmesg | grep mega
[30.031739] megasas: 07.714.04.00-rc1
[30.039749] megaraid_sas 0000:08:00.0: Adding to iommu group 0
[30.053247] megaraid_sas 0000:08:00.0: BAR:0x0 BAR's
base_addr(phys):0x0000080010000000 mapped virt_addr:0x(____ptrval____)
[30.053251] megaraid_sas 0000:08:00.0: FW now in Ready state
[30.065162] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63 bit
consistent mask
[30.081197] megaraid_sas 0000:08:00.0: firmware supports msix : (128)
[30.096349] megaraid_sas 0000:08:00.0: requested/available msix 128/128
[30.110277] megaraid_sas 0000:08:00.0: current msix/online cpus: (128/128)
[30.124917] megaraid_sas 0000:08:00.0: RDPQ mode : (enabled)
[30.136821] megaraid_sas 0000:08:00.0: Current firmware supports maximum
commands: 4077 LDIO threshold: 0
[30.208538] megaraid_sas 0000:08:00.0: Performance mode :Latency
(latency index = 1)
[30.224838] megaraid_sas 0000:08:00.0: FW supports sync cache : Yes
[30.238021] megaraid_sas 0000:08:00.0: megasas_disable_intr_fusion is
called outbound_intr_mask:0x40000009
[30.311960] megaraid_sas 0000:08:00.0: FW provided supportMaxExtLDs: 1
max_lds: 64
[30.327885] megaraid_sas 0000:08:00.0: controller type : MR(2048MB)
[30.341066] megaraid_sas 0000:08:00.0: Online Controller Reset(OCR) :
Enabled
[30.356076] megaraid_sas 0000:08:00.0: Secure JBOD support: Yes
[30.368710] megaraid_sas 0000:08:00.0: NVMe passthru support : Yes
[30.381708] megaraid_sas 0000:08:00.0: FW provided TM TaskAbort/Reset
timeout : 6 secs/60 secs
[30.399825] megaraid_sas 0000:08:00.0: JBOD sequence map support : Yes
[30.413552] megaraid_sas 0000:08:00.0: PCI Lane Margining support : No
[30.452059] megaraid_sas 0000:08:00.0: NVME page size : (4096)
[30.465079] megaraid_sas 0000:08:00.0: megasas_enable_intr_fusion is
called outbound_intr_mask:0x40000000
[30.485208] megaraid_sas 0000:08:00.0: INIT adapter done
[30.496609] megaraid_sas 0000:08:00.0: pci id :
(0x1000)/(0x0016)/(0x19e5)/(0xd215)
[30.512931] megaraid_sas 0000:08:00.0: unevenspan support : no
[30.525199] megaraid_sas 0000:08:00.0: firmware crash dump: no
[30.537649] megaraid_sas 0000:08:00.0: JBOD sequence map : enabled
[30.550743] megaraid_sas 0000:08:00.0: Max firmware commands: 4076
shared with nr_hw_queues = 127

john@ubuntu:~$ lspci -s 08:00.0 -v
08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode
SAS3508 (rev 01)
Subsystem: Huawei Technologies Co., Ltd. MegaRAID Tri-Mode SAS3508
Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 0
Memory at 80010000000 (64-bit, prefetchable) [size=1M]
Memory at 80010100000 (64-bit, prefetchable) [size=1M]
Memory at e9400000 (32-bit, non-prefetchable) [size=1M]
I/O ports at 1000 [size=256]
Expansion ROM at e9500000 [disabled] [size=1M]
Capabilities: <access denied>
Kernel driver in use: megaraid_sas

I have no x86 system to test that x86 config, though. How about
v5.10-rc2 for this issue?

Thanks,
John


>>
>> Cheers,
>> John
>>
>>> [ 252.136805][ T330] really_probe+0x207/0xad0
>>> [ 252.141191][ T330] ? device_driver_attach+0x120/0x120
>>> [ 252.146428][ T330] driver_probe_device+0x1f1/0x370
>>> [ 252.151424][ T330] device_driver_attach+0xe5/0x120
>>> [ 252.156399][ T330] __driver_attach+0xf0/0x260
>>> [ 252.160953][ T330] bus_for_each_dev+0x117/0x1a0
>>> [ 252.165669][ T330] ? subsys_dev_iter_exit+0x10/0x10
>>> [ 252.170731][ T330] bus_add_driver+0x399/0x560
>>> [ 252.175289][ T330] driver_register+0x189/0x310
>>> [ 252.179919][ T330] ? 0xffffffffc05c1000
>>> [ 252.183960][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
>>> [ 252.189713][ T330] do_one_initcall+0xf6/0x510
>>> [ 252.194267][ T330] ? perf_trace_initcall_level+0x490/0x490
>>> [ 252.199940][ T330] ? kasan_unpoison_shadow+0x30/0x40
>>> [ 252.205104][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
>>> [ 252.210859][ T330] ? do_init_module+0x49/0x6c0
>>> [ 252.215500][ T330] ? kmem_cache_alloc_trace+0x11f/0x1e0
>>> [ 252.220925][ T330] ? kasan_unpoison_shadow+0x30/0x40
>>> [ 252.226068][ T330] do_init_module+0x1ed/0x6c0
>>> [ 252.230608][ T330] load_module+0x4a59/0x5d20
>>> [ 252.235081][ T330] ? layout_and_allocate+0x2770/0x2770
>>> [ 252.240404][ T330] ? __vmalloc_node+0x8d/0x100
>>> [ 252.245046][ T330] ? kernel_read_file+0x485/0x5a0
>>> [ 252.249934][ T330] ? kernel_read_file+0x305/0x5a0
>>> [ 252.254839][ T330] ? __x64_sys_fsconfig+0x970/0x970
>>> [ 252.259903][ T330] ? __do_sys_finit_module+0xff/0x180
>>> [ 252.265153][ T330] __do_sys_finit_module+0xff/0x180
>>> [ 252.270216][ T330] ? __do_sys_init_module+0x1d0/0x1d0
>>> [ 252.275465][ T330] ? __fget_files+0x1c3/0x2e0
>>> [ 252.280010][ T330] do_syscall_64+0x33/0x40
>>> [ 252.284304][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 252.290054][ T330] RIP: 0033:0x7fbb3e2fa78d
>>> [ 252.294348][ T330] Code: Unable to access opcode bytes at RIP
>>> 0x7fbb3e2fa763.
>>> [ 252.301584][ T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000139
>>> [ 252.309855][ T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX:
>>> 00007fbb3e2fa78d
>>> [ 252.317703][ T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI:
>>> 0000000000000006
>>> [ 252.325553][ T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09:
>>> 00007ffe572e8e40
>>> [ 252.333402][ T330] R10: 0000000000000006 R11: 0000000000000246 R12:
>>> 0000000000000000
>>> [ 252.341257][ T330] R13: 000055c7795930e0 R14: 0000000000020000 R15:
>>> 0000000000000000
>>> [ 252.349117][ T330]
>
> .
>

2020-11-03 13:07:54

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Tue, 2020-11-03 at 10:54 +0000, John Garry wrote:
> I have no x86 system to test that x86 config, though. How about
> v5.10-rc2 for this issue?

v5.10-rc2 is also broken here.

[ 251.941451][ T330] INFO: task systemd-udevd:551 blocked for more than 122 seconds.
[ 251.949176][ T330] Not tainted 5.10.0-rc2 #3
[ 251.954094][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 251.962633][ T330] task:systemd-udevd state:D stack:27160 pid: 551 ppid: 506 flags:0x00000324
[ 251.971707][ T330] Call Trace:
[ 251.974871][ T330] __schedule+0x71d/0x1b50
[ 251.979155][ T330] ? kcore_callback+0x1d/0x1d
[ 251.983709][ T330] schedule+0xbf/0x270
[ 251.987640][ T330] schedule_timeout+0x3fc/0x590
[ 251.992370][ T330] ? usleep_range+0x120/0x120
[ 251.996910][ T330] ? wait_for_completion+0x156/0x250
[ 252.002080][ T330] ? lock_downgrade+0x700/0x700
[ 252.006792][ T330] ? rcu_read_unlock+0x40/0x40
[ 252.011435][ T330] ? do_raw_spin_lock+0x121/0x290
[ 252.016324][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[ 252.022178][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
[ 252.027235][ T330] wait_for_completion+0x15e/0x250
[ 252.032226][ T330] ? wait_for_completion_interruptible+0x2f0/0x2f0
[ 252.038590][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[ 252.044443][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
[ 252.049502][ T330] __flush_work+0x42a/0x900
[ 252.053882][ T330] ? queue_delayed_work_on+0x90/0x90
[ 252.059025][ T330] ? __queue_work+0x463/0xf40
[ 252.063583][ T330] ? init_pwq+0x320/0x320
[ 252.067777][ T330] ? queue_work_on+0x5e/0x80
[ 252.072249][ T330] ? trace_hardirqs_on+0x1c/0x150
[ 252.077138][ T330] work_on_cpu+0xe7/0x130
[ 252.081347][ T330] ? flush_delayed_work+0xc0/0xc0
[ 252.086231][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
[ 252.091655][ T330] ? work_debug_hint+0x30/0x30
[ 252.096284][ T330] ? pci_device_shutdown+0x80/0x80
[ 252.101274][ T330] ? cpumask_next_and+0x57/0x80
[ 252.105990][ T330] pci_device_probe+0x500/0x5c0
[ 252.110703][ T330] ? pci_device_remove+0x1f0/0x1f0
[ 252.115697][ T330] really_probe+0x207/0xad0
[ 252.120065][ T330] ? device_driver_attach+0x120/0x120
[ 252.125317][ T330] driver_probe_device+0x1f1/0x370
[ 252.130291][ T330] device_driver_attach+0xe5/0x120
[ 252.135281][ T330] __driver_attach+0xf0/0x260
[ 252.139827][ T330] bus_for_each_dev+0x117/0x1a0
[ 252.144552][ T330] ? subsys_dev_iter_exit+0x10/0x10
[ 252.149609][ T330] bus_add_driver+0x399/0x560
[ 252.154166][ T330] driver_register+0x189/0x310
[ 252.158795][ T330] ? 0xffffffffc05c5000
[ 252.162838][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
[ 252.168593][ T330] do_one_initcall+0xf6/0x510
[ 252.173143][ T330] ? perf_trace_initcall_level+0x490/0x490
[ 252.178809][ T330] ? kasan_unpoison_shadow+0x30/0x40
[ 252.183973][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[ 252.189728][ T330] ? do_init_module+0x49/0x6c0
[ 252.194370][ T330] ? kmem_cache_alloc_trace+0x12e/0x2a0
[ 252.199780][ T330] ? kasan_unpoison_shadow+0x30/0x40
[ 252.204942][ T330] do_init_module+0x1ed/0x6c0
[ 252.209479][ T330] load_module+0x4a25/0x5cf0
[ 252.213950][ T330] ? layout_and_allocate+0x2770/0x2770
[ 252.219271][ T330] ? __vmalloc_node+0x8d/0x100
[ 252.223913][ T330] ? kernel_read_file+0x485/0x5a0
[ 252.228796][ T330] ? kernel_read_file+0x305/0x5a0
[ 252.233696][ T330] ? __ia32_sys_fsconfig+0x6a0/0x6a0
[ 252.238841][ T330] ? __do_sys_finit_module+0xff/0x180
[ 252.244093][ T330] __do_sys_finit_module+0xff/0x180
[ 252.249155][ T330] ? __do_sys_init_module+0x1d0/0x1d0
[ 252.254403][ T330] ? __fget_files+0x1c3/0x2e0
[ 252.258940][ T330] do_syscall_64+0x33/0x40
[ 252.263234][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 252.268984][ T330] RIP: 0033:0x7f7cf6a4878d
[ 252.273276][ T330] Code: Unable to access opcode bytes at RIP 0x7f7cf6a48763.
[ 252.280499][ T330] RSP: 002b:00007ffcfa94b978 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 252.288781][ T330] RAX: ffffffffffffffda RBX: 000055e01f48b730 RCX: 00007f7cf6a4878d
[ 252.296628][ T330] RDX: 0000000000000000 RSI: 00007f7cf75ba82d RDI: 0000000000000006
[ 252.304482][ T330] RBP: 00007f7cf75ba82d R08: 0000000000000000 R09: 00007ffcfa94baa0
[ 252.312331][ T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[ 252.320167][ T330] R13: 000055e01f433530 R14: 0000000000020000 R15: 0000000000000000
[ 252.328052][ T330]
[ 252.328052][ T330] Showing all locks held in the system:
[ 252.335722][ T330] 3 locks held by kworker/3:1/289:
[ 252.340697][ T330] #0: ffff8881001eb338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ec/0x1610
[ 252.350906][ T330] #1: ffffc90004ef7e00 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
[ 252.361725][ T330] #2: ffff88810dc600e0 (&shost->scan_mutex){+.+.}-{3:3}, at: scsi_scan_host_selected+0xde/0x260
[ 252.372132][ T330] 1 lock held by khungtaskd/330:
[ 252.376933][ T330] #0: ffffffffb42d2de0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.52+0x0/0x30
[ 252.387234][ T330] 1 lock held by systemd-journal/398:
[ 252.392489][ T330] 1 lock held by systemd-udevd/551:
[ 252.397550][ T330] #0: ffff888109a49218 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x37/0x120
[ 252.407085][ T330]
[ 252.409285][ T330] =============================================
[ 252.409285][ T330]


2020-11-04 15:25:11

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Tue, 2020-11-03 at 08:04 -0500, Qian Cai wrote:
> On Tue, 2020-11-03 at 10:54 +0000, John Garry wrote:
> > I have no x86 system to test that x86 config, though. How about
> > v5.10-rc2 for this issue?
>
> v5.10-rc2 is also broken here.

John, Kashyap, any update on this? If this is going to take a while to fix it
proper, should I send a patch to revert this or at least disable the feature by
default for megaraid_sas in the meantime, so it no longer breaks the existing
systems out there?

>
> [ 251.941451][ T330] INFO: task systemd-udevd:551 blocked for more than 122
> seconds.
> [ 251.949176][ T330] Not tainted 5.10.0-rc2 #3
> [ 251.954094][ T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 251.962633][ T330] task:systemd-udevd state:D stack:27160 pid: 551
> ppid: 506 flags:0x00000324
> [ 251.971707][ T330] Call Trace:
> [ 251.974871][ T330] __schedule+0x71d/0x1b50
> [ 251.979155][ T330] ? kcore_callback+0x1d/0x1d
> [ 251.983709][ T330] schedule+0xbf/0x270
> [ 251.987640][ T330] schedule_timeout+0x3fc/0x590
> [ 251.992370][ T330] ? usleep_range+0x120/0x120
> [ 251.996910][ T330] ? wait_for_completion+0x156/0x250
> [ 252.002080][ T330] ? lock_downgrade+0x700/0x700
> [ 252.006792][ T330] ? rcu_read_unlock+0x40/0x40
> [ 252.011435][ T330] ? do_raw_spin_lock+0x121/0x290
> [ 252.016324][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [ 252.022178][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> [ 252.027235][ T330] wait_for_completion+0x15e/0x250
> [ 252.032226][ T330] ? wait_for_completion_interruptible+0x2f0/0x2f0
> [ 252.038590][ T330] ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [ 252.044443][ T330] ? _raw_spin_unlock_irq+0x1f/0x30
> [ 252.049502][ T330] __flush_work+0x42a/0x900
> [ 252.053882][ T330] ? queue_delayed_work_on+0x90/0x90
> [ 252.059025][ T330] ? __queue_work+0x463/0xf40
> [ 252.063583][ T330] ? init_pwq+0x320/0x320
> [ 252.067777][ T330] ? queue_work_on+0x5e/0x80
> [ 252.072249][ T330] ? trace_hardirqs_on+0x1c/0x150
> [ 252.077138][ T330] work_on_cpu+0xe7/0x130
> [ 252.081347][ T330] ? flush_delayed_work+0xc0/0xc0
> [ 252.086231][ T330] ? __mutex_unlock_slowpath+0xd4/0x670
> [ 252.091655][ T330] ? work_debug_hint+0x30/0x30
> [ 252.096284][ T330] ? pci_device_shutdown+0x80/0x80
> [ 252.101274][ T330] ? cpumask_next_and+0x57/0x80
> [ 252.105990][ T330] pci_device_probe+0x500/0x5c0
> [ 252.110703][ T330] ? pci_device_remove+0x1f0/0x1f0
> [ 252.115697][ T330] really_probe+0x207/0xad0
> [ 252.120065][ T330] ? device_driver_attach+0x120/0x120
> [ 252.125317][ T330] driver_probe_device+0x1f1/0x370
> [ 252.130291][ T330] device_driver_attach+0xe5/0x120
> [ 252.135281][ T330] __driver_attach+0xf0/0x260
> [ 252.139827][ T330] bus_for_each_dev+0x117/0x1a0
> [ 252.144552][ T330] ? subsys_dev_iter_exit+0x10/0x10
> [ 252.149609][ T330] bus_add_driver+0x399/0x560
> [ 252.154166][ T330] driver_register+0x189/0x310
> [ 252.158795][ T330] ? 0xffffffffc05c5000
> [ 252.162838][ T330] megasas_init+0x117/0x1000 [megaraid_sas]
> [ 252.168593][ T330] do_one_initcall+0xf6/0x510
> [ 252.173143][ T330] ? perf_trace_initcall_level+0x490/0x490
> [ 252.178809][ T330] ? kasan_unpoison_shadow+0x30/0x40
> [ 252.183973][ T330] ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> [ 252.189728][ T330] ? do_init_module+0x49/0x6c0
> [ 252.194370][ T330] ? kmem_cache_alloc_trace+0x12e/0x2a0
> [ 252.199780][ T330] ? kasan_unpoison_shadow+0x30/0x40
> [ 252.204942][ T330] do_init_module+0x1ed/0x6c0
> [ 252.209479][ T330] load_module+0x4a25/0x5cf0
> [ 252.213950][ T330] ? layout_and_allocate+0x2770/0x2770
> [ 252.219271][ T330] ? __vmalloc_node+0x8d/0x100
> [ 252.223913][ T330] ? kernel_read_file+0x485/0x5a0
> [ 252.228796][ T330] ? kernel_read_file+0x305/0x5a0
> [ 252.233696][ T330] ? __ia32_sys_fsconfig+0x6a0/0x6a0
> [ 252.238841][ T330] ? __do_sys_finit_module+0xff/0x180
> [ 252.244093][ T330] __do_sys_finit_module+0xff/0x180
> [ 252.249155][ T330] ? __do_sys_init_module+0x1d0/0x1d0
> [ 252.254403][ T330] ? __fget_files+0x1c3/0x2e0
> [ 252.258940][ T330] do_syscall_64+0x33/0x40
> [ 252.263234][ T330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 252.268984][ T330] RIP: 0033:0x7f7cf6a4878d
> [ 252.273276][ T330] Code: Unable to access opcode bytes at RIP
> 0x7f7cf6a48763.
> [ 252.280499][ T330] RSP: 002b:00007ffcfa94b978 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [ 252.288781][ T330] RAX: ffffffffffffffda RBX: 000055e01f48b730 RCX:
> 00007f7cf6a4878d
> [ 252.296628][ T330] RDX: 0000000000000000 RSI: 00007f7cf75ba82d RDI:
> 0000000000000006
> [ 252.304482][ T330] RBP: 00007f7cf75ba82d R08: 0000000000000000 R09:
> 00007ffcfa94baa0
> [ 252.312331][ T330] R10: 0000000000000006 R11: 0000000000000246 R12:
> 0000000000000000
> [ 252.320167][ T330] R13: 000055e01f433530 R14: 0000000000020000 R15:
> 0000000000000000
> [ 252.328052][ T330]
> [ 252.328052][ T330] Showing all locks held in the system:
> [ 252.335722][ T330] 3 locks held by kworker/3:1/289:
> [ 252.340697][ T330] #0: ffff8881001eb338 ((wq_completion)events){+.+.}-
> {0:0}, at: process_one_work+0x7ec/0x1610
> [ 252.350906][ T330] #1: ffffc90004ef7e00
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
> [ 252.361725][ T330] #2: ffff88810dc600e0 (&shost->scan_mutex){+.+.}-{3:3},
> at: scsi_scan_host_selected+0xde/0x260
> [ 252.372132][ T330] 1 lock held by khungtaskd/330:
> [ 252.376933][ T330] #0: ffffffffb42d2de0 (rcu_read_lock){....}-{1:2}, at:
> rcu_lock_acquire.constprop.52+0x0/0x30
> [ 252.387234][ T330] 1 lock held by systemd-journal/398:
> [ 252.392489][ T330] 1 lock held by systemd-udevd/551:
> [ 252.397550][ T330] #0: ffff888109a49218 (&dev->mutex){....}-{3:3}, at:
> device_driver_attach+0x37/0x120
> [ 252.407085][ T330]
> [ 252.409285][ T330] =============================================
> [ 252.409285][ T330]
>

2020-11-04 16:11:29

by Kashyap Desai

[permalink] [raw]
Subject: RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

> >
> > v5.10-rc2 is also broken here.
>
> John, Kashyap, any update on this? If this is going to take a while to fix
> it
> proper, should I send a patch to revert this or at least disable the
> feature by
> default for megaraid_sas in the meantime, so it no longer breaks the
> existing
> systems out there?

I am trying to get similar h/w to try out. All my current h/w works fine.
Give me couple of days' time.
If this is not obviously common issue and need time, we will go with module
parameter disable method.
I will let you know.

Kashyap


Attachments:
smime.p7s (4.07 kB)
S/MIME Cryptographic Signature

2020-11-05 00:30:59

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 04/11/2020 16:07, Kashyap Desai wrote:
>>>
>>> v5.10-rc2 is also broken here.
>>
>> John, Kashyap, any update on this? If this is going to take a while to fix
>> it
>> proper, should I send a patch to revert this or at least disable the
>> feature by
>> default for megaraid_sas in the meantime, so it no longer breaks the
>> existing
>> systems out there?
>
> I am trying to get similar h/w to try out. All my current h/w works fine.
> Give me couple of days' time.
> If this is not obviously common issue and need time, we will go with module
> parameter disable method.
> I will let you know.

Hi Kashyap,

Please also consider just disabling for this card, so any other possible
issues are unearthed on other cards. I don't have this card or any x86
machine to test it unfortunately to assist.

BTW, just to be clear, did you try the same .config as Qian Cai?

Thanks,
John

2020-11-06 19:30:27

by Sumit Saxena

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Wed, Nov 4, 2020 at 11:38 PM John Garry <[email protected]> wrote:
>
> On 04/11/2020 16:07, Kashyap Desai wrote:
> >>>
> >>> v5.10-rc2 is also broken here.
> >>
> >> John, Kashyap, any update on this? If this is going to take a while to fix
> >> it
> >> proper, should I send a patch to revert this or at least disable the
> >> feature by
> >> default for megaraid_sas in the meantime, so it no longer breaks the
> >> existing
> >> systems out there?
> >
> > I am trying to get similar h/w to try out. All my current h/w works fine.
> > Give me couple of days' time.
> > If this is not obviously common issue and need time, we will go with module
> > parameter disable method.
> > I will let you know.
>
> Hi Kashyap,
>
> Please also consider just disabling for this card, so any other possible
> issues are unearthed on other cards. I don't have this card or any x86
> machine to test it unfortunately to assist.
>
> BTW, just to be clear, did you try the same .config as Qian Cai?
>
> Thanks,
> John
I am able to hit the boot hang and similar kind of stack traces as
reported by Qian with shared .config on x86 machine.
In my case the system boots after a hang of 40-45 mins. Qian, is it
true for you as well ?
With module parameter -"host_tagset_enable=0", the issue is not seen.
Below is snippet of the dmesg logs/traces which are observed during
system bootup and after wait of 40-45 mins
drives attached to megaraid_sas adapter are discovered:

========================================
[ 1969.502913] INFO: task systemd-udevd:906 can't die for more than
1720 seconds.
[ 1969.597725] task:systemd-udevd state:D stack:13456 pid: 906
ppid: 858 flags:0x00000324
[ 1969.597730] Call Trace:
[ 1969.597734] __schedule+0x263/0x7f0
[ 1969.597737] ? __lock_acquire+0x576/0xaf0
[ 1969.597739] ? wait_for_completion+0x7b/0x110
[ 1969.597741] schedule+0x4c/0xc0
[ 1969.597743] schedule_timeout+0x244/0x2e0
[ 1969.597745] ? find_held_lock+0x2d/0x90
[ 1969.597748] ? wait_for_completion+0xa6/0x110
[ 1969.597750] ? wait_for_completion+0x7b/0x110
[ 1969.597752] ? lockdep_hardirqs_on_prepare+0xd4/0x170
[ 1969.597753] ? wait_for_completion+0x7b/0x110
[ 1969.597755] wait_for_completion+0xae/0x110
[ 1969.597757] __flush_work+0x269/0x4b0
[ 1969.597760] ? init_pwq+0xf0/0xf0
[ 1969.597763] work_on_cpu+0x9c/0xd0
[ 1969.597765] ? work_is_static_object+0x10/0x10
[ 1969.597768] ? pci_device_shutdown+0x30/0x30
[ 1969.597770] pci_device_probe+0x197/0x1b0
[ 1969.597773] really_probe+0xda/0x410
[ 1969.597776] driver_probe_device+0xd9/0x140
[ 1969.597778] device_driver_attach+0x4a/0x50
[ 1969.597780] __driver_attach+0x83/0x140
[ 1969.597782] ? device_driver_attach+0x50/0x50
[ 1969.597784] ? device_driver_attach+0x50/0x50
[ 1969.597787] bus_for_each_dev+0x74/0xc0
[ 1969.597789] bus_add_driver+0x14b/0x1f0
[ 1969.597791] ? 0xffffffffc04fb000
[ 1969.597793] driver_register+0x66/0xb0
[ 1969.597795] ? 0xffffffffc04fb000
[ 1969.597801] megasas_init+0xe7/0x1000 [megaraid_sas]
[ 1969.597803] do_one_initcall+0x62/0x300
[ 1969.597806] ? do_init_module+0x1d/0x200
[ 1969.597808] ? kmem_cache_alloc_trace+0x296/0x2d0
[ 1969.597811] do_init_module+0x55/0x200
[ 1969.597813] load_module+0x15f2/0x17b0
[ 1969.597816] ? __do_sys_finit_module+0xad/0x110
[ 1969.597818] __do_sys_finit_module+0xad/0x110
[ 1969.597820] do_syscall_64+0x33/0x40
[ 1969.597823] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1969.597825] RIP: 0033:0x7f66340262bd
[ 1969.597827] Code: Unable to access opcode bytes at RIP 0x7f6634026293.
[ 1969.597828] RSP: 002b:00007ffca1011f48 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[ 1969.597831] RAX: ffffffffffffffda RBX: 000055f6720cf370 RCX: 00007f66340262bd
[ 1969.597833] RDX: 0000000000000000 RSI: 00007f6634b9880d RDI: 0000000000000006
[ 1969.597835] RBP: 00007f6634b9880d R08: 0000000000000000 R09: 00007ffca1012070
[ 1969.597836] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[ 1969.597838] R13: 000055f6720cce70 R14: 0000000000020000 R15: 0000000000000000
[ 1969.597859]
Showing all locks held in the system:
[ 1969.597862] 2 locks held by kworker/0:0/5:
[ 1969.597863] #0: ffff9af800194b38
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1e6/0x5e0
[ 1969.597872] #1: ffffbf3bc01f3e70
((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x1e6/0x5e0
[ 1969.597890] 3 locks held by kworker/0:1/7:
[ 1969.597960] 1 lock held by khungtaskd/643:
[ 1969.597962] #0: ffffffffa624cb60 (rcu_read_lock){....}-{1:2}, at:
rcu_lock_acquire.constprop.54+0x0/0x30
[ 1969.597982] 1 lock held by systemd-udevd/906:
[ 1969.597983] #0: ffff9af984a1c218 (&dev->mutex){....}-{3:3}, at:
device_driver_attach+0x18/0x50

[ 1969.598010] =============================================

[ 1983.242512] random: fast init done
[ 2071.928411] sd 0:2:0:0: [sda] 1951399936 512-byte logical blocks:
(999 GB/931 GiB)
[ 2071.928480] sd 0:2:2:0: [sdc] 1756889088 512-byte logical blocks:
(900 GB/838 GiB)
[ 2071.928537] sd 0:2:1:0: [sdb] 285474816 512-byte logical blocks:
(146 GB/136 GiB)
[ 2071.928580] sd 0:2:0:0: [sda] Write Protect is off
[ 2071.928625] sd 0:2:0:0: [sda] Mode Sense: 1f 00 00 08
[ 2071.928629] sd 0:2:2:0: [sdc] Write Protect is off
[ 2071.928669] sd 0:2:1:0: [sdb] Write Protect is off
[ 2071.928706] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
[ 2071.928844] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 00 08
[ 2071.928848] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, doesn't support DPO or FUA


================================

I am working on it and need some time for debugging. BTW did anyone
try "shared host tagset" patchset on some other adapter/s which are
not really multiqueue at HW level
but driver exposes multiple hardware queues(similar to megaraid_sas)
with the .config shared by Qian ?

Thanks,
Sumit


Attachments:
smime.p7s (4.07 kB)
S/MIME Cryptographic Signature

2020-11-07 00:21:19

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
> I am able to hit the boot hang and similar kind of stack traces as
> reported by Qian with shared .config on x86 machine.
> In my case the system boots after a hang of 40-45 mins. Qian, is it
> true for you as well ?
I don't know. I had never waited for that long.

2020-11-09 08:53:57

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 07/11/2020 00:17, Qian Cai wrote:
> On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
>> I am able to hit the boot hang and similar kind of stack traces as
>> reported by Qian with shared .config on x86 machine.
>> In my case the system boots after a hang of 40-45 mins. Qian, is it
>> true for you as well ?
> I don't know. I had never waited for that long.
>
> .
>

Hi Qian,

By chance do have an equivalent arm64 .config, enabling the same RH
config options?

I suppose I could try do this myself also, but an authentic version
would be nicer.

Thanks,
John

2020-11-09 13:43:59

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Mon, 2020-11-09 at 08:49 +0000, John Garry wrote:
> On 07/11/2020 00:17, Qian Cai wrote:
> > On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
> > > I am able to hit the boot hang and similar kind of stack traces as
> > > reported by Qian with shared .config on x86 machine.
> > > In my case the system boots after a hang of 40-45 mins. Qian, is it
> > > true for you as well ?
> > I don't know. I had never waited for that long.
> >
> > .
> >
>
> Hi Qian,
>
> By chance do have an equivalent arm64 .config, enabling the same RH
> config options?
>
> I suppose I could try do this myself also, but an authentic version
> would be nicer.
The closest one I have here is:
https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

but it only selects the Thunder X2 platform and needs to manually select
CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
megaraid_sas.

2020-11-09 14:08:26

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 09/11/2020 13:39, Qian Cai wrote:
>> I suppose I could try do this myself also, but an authentic version
>> would be nicer.
> The closest one I have here is:
> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
>
> but it only selects the Thunder X2 platform and needs to manually select
> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
> megaraid_sas.

Thanks, I'm confident I can fix it up to get it going on my Huawei arm64
D06CS.

So that board has a megaraid sas card. In addition, it also has hisi_sas
HW, which is another storage controller which we enabled this same
feature which is causing the problem.

I'll report back when I can.

Thanks,
john

2020-11-10 17:47:14

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On 09/11/2020 14:05, John Garry wrote:
> On 09/11/2020 13:39, Qian Cai wrote:
>>> I suppose I could try do this myself also, but an authentic version
>>> would be nicer.
>> The closest one I have here is:
>> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
>>
>> but it only selects the Thunder X2 platform and needs to manually select
>> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
>> megaraid_sas.
>
> Thanks, I'm confident I can fix it up to get it going on my Huawei arm64
> D06CS.
>
> So that board has a megaraid sas card. In addition, it also has hisi_sas
> HW, which is another storage controller which we enabled this same
> feature which is causing the problem.
>
> I'll report back when I can.

So I had to hack that arm64 config a bit to get it booting:
https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang

Boot is ok on my board without the megaraid sas card, but includes
hisi_sas HW (which enables the equivalent option which is exposing the
problem).

But the board with the megaraid sas boots very slowly, specifically
around the megaraid sas probe:

: ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A
[ 50.023726][ T1] printk: console [ttyS0] enabled
[ 50.412597][ T1] megasas: 07.714.04.00-rc1
[ 50.436614][ T5] megaraid_sas 0000:08:00.0: FW now in Ready state
[ 50.450079][ T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63
bit consistent mask
[ 50.467811][ T5] megaraid_sas 0000:08:00.0: firmware supports msix
: (128)
[ 50.845995][ T5] megaraid_sas 0000:08:00.0: requested/available
msix 128/128
[ 50.861476][ T5] megaraid_sas 0000:08:00.0: current msix/online
cpus : (128/128)
[ 50.877616][ T5] megaraid_sas 0000:08:00.0: RDPQ mode : (enabled)
[ 50.891018][ T5] megaraid_sas 0000:08:00.0: Current firmware
supports maximum commands: 4077 LDIO threshold: 0
[ 51.262942][ T5] megaraid_sas 0000:08:00.0: Performance mode
:Latency (latency index = 1)
[ 51.280749][ T5] megaraid_sas 0000:08:00.0: FW supports sync cache
: Yes
[ 51.295451][ T5] megaraid_sas 0000:08:00.0:
megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 51.387474][ T5] megaraid_sas 0000:08:00.0: FW provided
supportMaxExtLDs: 1 max_lds: 64
[ 51.404931][ T5] megaraid_sas 0000:08:00.0: controller type
: MR(2048MB)
[ 51.419616][ T5] megaraid_sas 0000:08:00.0: Online Controller
Reset(OCR) : Enabled
[ 51.436132][ T5] megaraid_sas 0000:08:00.0: Secure JBOD support
: Yes
[ 51.450265][ T5] megaraid_sas 0000:08:00.0: NVMe passthru support
: Yes
[ 51.464757][ T5] megaraid_sas 0000:08:00.0: FW provided TM
TaskAbort/Reset timeout : 6 secs/60 secs
[ 51.484379][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
support : Yes
[ 51.499607][ T5] megaraid_sas 0000:08:00.0: PCI Lane Margining
support : No
[ 51.547610][ T5] megaraid_sas 0000:08:00.0: NVME page size
: (4096)
[ 51.608635][ T5] megaraid_sas 0000:08:00.0:
megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[ 51.630285][ T5] megaraid_sas 0000:08:00.0: INIT adapter done
[ 51.649854][ T5] megaraid_sas 0000:08:00.0: pci id
: (0x1000)/(0x0016)/(0x19e5)/(0xd215)
[ 51.667873][ T5] megaraid_sas 0000:08:00.0: unevenspan support : no
[ 51.681646][ T5] megaraid_sas 0000:08:00.0: firmware crash dump : no
[ 51.695596][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
: enabled
[ 51.711521][ T5] megaraid_sas 0000:08:00.0: Max firmware commands:
4076 shared with nr_hw_queues = 127
[ 51.733056][ T5] scsi host0: Avago SAS based MegaRAID driver
[ 65.304363][ T5] scsi 0:0:0:0: Direct-Access ATA SAMSUNG
MZ7KH1T9 404Q PQ: 0 ANSI: 6
[ 65.392401][ T5] scsi 0:0:1:0: Direct-Access ATA SAMSUNG
MZ7KH1T9 404Q PQ: 0 ANSI: 6
[ 79.508307][ T5] scsi 0:0:65:0: Enclosure HUAWEI
Expander 12Gx16 131 PQ: 0 ANSI: 6
[ 183.965109][ C14] random: fast init done

Notice the 14 and 104 second delays.

But does boot fully to get to the console. I'll wait for further issues,
which you guys seem to experience after a while.

Thanks,
John

2020-11-11 07:32:22

by Sumit Saxena

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Tue, Nov 10, 2020 at 11:12 PM John Garry <[email protected]> wrote:
>
> On 09/11/2020 14:05, John Garry wrote:
> > On 09/11/2020 13:39, Qian Cai wrote:
> >>> I suppose I could try do this myself also, but an authentic version
> >>> would be nicer.
> >> The closest one I have here is:
> >> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
> >>
> >> but it only selects the Thunder X2 platform and needs to manually select
> >> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
> >> megaraid_sas.
> >
> > Thanks, I'm confident I can fix it up to get it going on my Huawei arm64
> > D06CS.
> >
> > So that board has a megaraid sas card. In addition, it also has hisi_sas
> > HW, which is another storage controller which we enabled this same
> > feature which is causing the problem.
> >
> > I'll report back when I can.
>
> So I had to hack that arm64 config a bit to get it booting:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang
>
> Boot is ok on my board without the megaraid sas card, but includes
> hisi_sas HW (which enables the equivalent option which is exposing the
> problem).
>
> But the board with the megaraid sas boots very slowly, specifically
> around the megaraid sas probe:
>
> : ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A
> [ 50.023726][ T1] printk: console [ttyS0] enabled
> [ 50.412597][ T1] megasas: 07.714.04.00-rc1
> [ 50.436614][ T5] megaraid_sas 0000:08:00.0: FW now in Ready state
> [ 50.450079][ T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63
> bit consistent mask
> [ 50.467811][ T5] megaraid_sas 0000:08:00.0: firmware supports msix
> : (128)
> [ 50.845995][ T5] megaraid_sas 0000:08:00.0: requested/available
> msix 128/128
> [ 50.861476][ T5] megaraid_sas 0000:08:00.0: current msix/online
> cpus : (128/128)
> [ 50.877616][ T5] megaraid_sas 0000:08:00.0: RDPQ mode : (enabled)
> [ 50.891018][ T5] megaraid_sas 0000:08:00.0: Current firmware
> supports maximum commands: 4077 LDIO threshold: 0
> [ 51.262942][ T5] megaraid_sas 0000:08:00.0: Performance mode
> :Latency (latency index = 1)
> [ 51.280749][ T5] megaraid_sas 0000:08:00.0: FW supports sync cache
> : Yes
> [ 51.295451][ T5] megaraid_sas 0000:08:00.0:
> megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
> [ 51.387474][ T5] megaraid_sas 0000:08:00.0: FW provided
> supportMaxExtLDs: 1 max_lds: 64
> [ 51.404931][ T5] megaraid_sas 0000:08:00.0: controller type
> : MR(2048MB)
> [ 51.419616][ T5] megaraid_sas 0000:08:00.0: Online Controller
> Reset(OCR) : Enabled
> [ 51.436132][ T5] megaraid_sas 0000:08:00.0: Secure JBOD support
> : Yes
> [ 51.450265][ T5] megaraid_sas 0000:08:00.0: NVMe passthru support
> : Yes
> [ 51.464757][ T5] megaraid_sas 0000:08:00.0: FW provided TM
> TaskAbort/Reset timeout : 6 secs/60 secs
> [ 51.484379][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
> support : Yes
> [ 51.499607][ T5] megaraid_sas 0000:08:00.0: PCI Lane Margining
> support : No
> [ 51.547610][ T5] megaraid_sas 0000:08:00.0: NVME page size
> : (4096)
> [ 51.608635][ T5] megaraid_sas 0000:08:00.0:
> megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
> [ 51.630285][ T5] megaraid_sas 0000:08:00.0: INIT adapter done
> [ 51.649854][ T5] megaraid_sas 0000:08:00.0: pci id
> : (0x1000)/(0x0016)/(0x19e5)/(0xd215)
> [ 51.667873][ T5] megaraid_sas 0000:08:00.0: unevenspan support : no
> [ 51.681646][ T5] megaraid_sas 0000:08:00.0: firmware crash dump : no
> [ 51.695596][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
> : enabled
> [ 51.711521][ T5] megaraid_sas 0000:08:00.0: Max firmware commands:
> 4076 shared with nr_hw_queues = 127
> [ 51.733056][ T5] scsi host0: Avago SAS based MegaRAID driver
> [ 65.304363][ T5] scsi 0:0:0:0: Direct-Access ATA SAMSUNG
> MZ7KH1T9 404Q PQ: 0 ANSI: 6
> [ 65.392401][ T5] scsi 0:0:1:0: Direct-Access ATA SAMSUNG
> MZ7KH1T9 404Q PQ: 0 ANSI: 6
> [ 79.508307][ T5] scsi 0:0:65:0: Enclosure HUAWEI
> Expander 12Gx16 131 PQ: 0 ANSI: 6
> [ 183.965109][ C14] random: fast init done
>
> Notice the 14 and 104 second delays.
>
> But does boot fully to get to the console. I'll wait for further issues,
> which you guys seem to experience after a while.
>
> Thanks,
> John
"megaraid_sas" driver calls “scsi_scan_host()” to discover SCSI
devices. In this failure case, scsi_scan_host() is taking a long time
to complete, hence causing delay in system boot.
With "host_tagset" enabled, scsi_scan_host() takes around 20 mins.
With "host_tagset" disabled, scsi_scan_host() takes upto 5-8 mins.

The scan time depends upon the number of scsi channels and devices per
scsi channel is exposed by LLD.
megaraid_sas driver exposes 4 channels and 128 drives per channel.

Each target scan takes 2 seconds (in case of failure with host_tagset
enabled). That's why driver load completes after ~20 minutes. See
below:

[ 299.725271] kobject: 'target18:0:96': free name
[ 301.681267] kobject: 'target18:0:97' (00000000987c7f11):
kobject_cleanup, parent 0000000000000000
[ 301.681269] kobject: 'target18:0:97' (00000000987c7f11): calling
ktype release
[ 301.681273] kobject: 'target18:0:97': free name
[ 303.575268] kobject: 'target18:0:98' (00000000a8c34149):
kobject_cleanup, parent 0000000000000000

In Qian's kernel .config, async scsi scan is disabled so in failure
case SCSI scan type is synchronous.
Below is the stack trace when scsi_scan_host() hangs:

[<0>] __wait_rcu_gp+0x134/0x170
[<0>] synchronize_rcu.part.80+0x53/0x60
[<0>] blk_free_flush_queue+0x12/0x30
[<0>] blk_mq_hw_sysfs_release+0x21/0x70
[<0>] kobject_release+0x46/0x150
[<0>] blk_mq_release+0xb4/0xf0
[<0>] blk_release_queue+0xc4/0x130
[<0>] kobject_release+0x46/0x150
[<0>] scsi_device_dev_release_usercontext+0x194/0x3f0
[<0>] execute_in_process_context+0x22/0xa0
[<0>] device_release+0x2e/0x80
[<0>] kobject_release+0x46/0x150
[<0>] scsi_alloc_sdev+0x2e7/0x310
[<0>] scsi_probe_and_add_lun+0x410/0xbd0
[<0>] __scsi_scan_target+0xf2/0x530
[<0>] scsi_scan_channel.part.7+0x51/0x70
[<0>] scsi_scan_host_selected+0xd4/0x140
[<0>] scsi_scan_host+0x198/0x1c0

This issue hits when lock related debugging is enabled in kernel config.
kernel .config parameters(may be subset of this list) are required to
hit the issue:

CONFIG_PREEMPT_COUNT=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_TRACE_IRQFLAGS_NMI=y
CONFIG_DEBUG_KOBJECT=y
CONFIG_PROVE_RCU=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y

When scsi_scan_host() hangs, there are no outstanding IOs with
megaraid_sas driver-firmware stack as SCSI "host_busy" counter and
megaraid_sas driver's internal counter are "0".
Key takeaways:
1. Issue is observed when lock related debugging is enabled so issue
is seen in debug environment.
2. Issue seems to be related to generic shared "host_tagset" code
whenever some kind of kernel debugging is enabled. We do not see an
immediate reason to hide this issue through disabling the
"host_tagset" feature.

John,
Issue may hit on ARM platform too using Qian's .config file with other
adapters (e.g. hisi_sas) as well. So I feel disabling “host_tagset” in
megaraid_sas driver will not help. It requires debugging from the
“Entire Shared host tag feature” perspective as scsi_scan_host()
waittime aggravates when "host_tagset" is enabled. Also, I am doing
parallel debugging and if I find anything useful, I will share.

Qian,
I need full dmesg logs from your setup with
megaraid_sas.host_tagset_enable=1 and
megaraid_sas.host_tagset_enable=0. Please wait for a long time. I just
want to make sure that whatever you observe is the same as mine.

Thanks,
Sumit

2020-11-11 09:31:59

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Wed, Nov 11, 2020 at 12:57:59PM +0530, Sumit Saxena wrote:
> On Tue, Nov 10, 2020 at 11:12 PM John Garry <[email protected]> wrote:
> >
> > On 09/11/2020 14:05, John Garry wrote:
> > > On 09/11/2020 13:39, Qian Cai wrote:
> > >>> I suppose I could try do this myself also, but an authentic version
> > >>> would be nicer.
> > >> The closest one I have here is:
> > >> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config
> > >>
> > >> but it only selects the Thunder X2 platform and needs to manually select
> > >> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
> > >> megaraid_sas.
> > >
> > > Thanks, I'm confident I can fix it up to get it going on my Huawei arm64
> > > D06CS.
> > >
> > > So that board has a megaraid sas card. In addition, it also has hisi_sas
> > > HW, which is another storage controller which we enabled this same
> > > feature which is causing the problem.
> > >
> > > I'll report back when I can.
> >
> > So I had to hack that arm64 config a bit to get it booting:
> > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang
> >
> > Boot is ok on my board without the megaraid sas card, but includes
> > hisi_sas HW (which enables the equivalent option which is exposing the
> > problem).
> >
> > But the board with the megaraid sas boots very slowly, specifically
> > around the megaraid sas probe:
> >
> > : ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A
> > [ 50.023726][ T1] printk: console [ttyS0] enabled
> > [ 50.412597][ T1] megasas: 07.714.04.00-rc1
> > [ 50.436614][ T5] megaraid_sas 0000:08:00.0: FW now in Ready state
> > [ 50.450079][ T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63
> > bit consistent mask
> > [ 50.467811][ T5] megaraid_sas 0000:08:00.0: firmware supports msix
> > : (128)
> > [ 50.845995][ T5] megaraid_sas 0000:08:00.0: requested/available
> > msix 128/128
> > [ 50.861476][ T5] megaraid_sas 0000:08:00.0: current msix/online
> > cpus : (128/128)
> > [ 50.877616][ T5] megaraid_sas 0000:08:00.0: RDPQ mode : (enabled)
> > [ 50.891018][ T5] megaraid_sas 0000:08:00.0: Current firmware
> > supports maximum commands: 4077 LDIO threshold: 0
> > [ 51.262942][ T5] megaraid_sas 0000:08:00.0: Performance mode
> > :Latency (latency index = 1)
> > [ 51.280749][ T5] megaraid_sas 0000:08:00.0: FW supports sync cache
> > : Yes
> > [ 51.295451][ T5] megaraid_sas 0000:08:00.0:
> > megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
> > [ 51.387474][ T5] megaraid_sas 0000:08:00.0: FW provided
> > supportMaxExtLDs: 1 max_lds: 64
> > [ 51.404931][ T5] megaraid_sas 0000:08:00.0: controller type
> > : MR(2048MB)
> > [ 51.419616][ T5] megaraid_sas 0000:08:00.0: Online Controller
> > Reset(OCR) : Enabled
> > [ 51.436132][ T5] megaraid_sas 0000:08:00.0: Secure JBOD support
> > : Yes
> > [ 51.450265][ T5] megaraid_sas 0000:08:00.0: NVMe passthru support
> > : Yes
> > [ 51.464757][ T5] megaraid_sas 0000:08:00.0: FW provided TM
> > TaskAbort/Reset timeout : 6 secs/60 secs
> > [ 51.484379][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
> > support : Yes
> > [ 51.499607][ T5] megaraid_sas 0000:08:00.0: PCI Lane Margining
> > support : No
> > [ 51.547610][ T5] megaraid_sas 0000:08:00.0: NVME page size
> > : (4096)
> > [ 51.608635][ T5] megaraid_sas 0000:08:00.0:
> > megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
> > [ 51.630285][ T5] megaraid_sas 0000:08:00.0: INIT adapter done
> > [ 51.649854][ T5] megaraid_sas 0000:08:00.0: pci id
> > : (0x1000)/(0x0016)/(0x19e5)/(0xd215)
> > [ 51.667873][ T5] megaraid_sas 0000:08:00.0: unevenspan support : no
> > [ 51.681646][ T5] megaraid_sas 0000:08:00.0: firmware crash dump : no
> > [ 51.695596][ T5] megaraid_sas 0000:08:00.0: JBOD sequence map
> > : enabled
> > [ 51.711521][ T5] megaraid_sas 0000:08:00.0: Max firmware commands:
> > 4076 shared with nr_hw_queues = 127
> > [ 51.733056][ T5] scsi host0: Avago SAS based MegaRAID driver
> > [ 65.304363][ T5] scsi 0:0:0:0: Direct-Access ATA SAMSUNG
> > MZ7KH1T9 404Q PQ: 0 ANSI: 6
> > [ 65.392401][ T5] scsi 0:0:1:0: Direct-Access ATA SAMSUNG
> > MZ7KH1T9 404Q PQ: 0 ANSI: 6
> > [ 79.508307][ T5] scsi 0:0:65:0: Enclosure HUAWEI
> > Expander 12Gx16 131 PQ: 0 ANSI: 6
> > [ 183.965109][ C14] random: fast init done
> >
> > Notice the 14 and 104 second delays.
> >
> > But does boot fully to get to the console. I'll wait for further issues,
> > which you guys seem to experience after a while.
> >
> > Thanks,
> > John
> "megaraid_sas" driver calls “scsi_scan_host()” to discover SCSI
> devices. In this failure case, scsi_scan_host() is taking a long time
> to complete, hence causing delay in system boot.
> With "host_tagset" enabled, scsi_scan_host() takes around 20 mins.
> With "host_tagset" disabled, scsi_scan_host() takes upto 5-8 mins.
>
> The scan time depends upon the number of scsi channels and devices per
> scsi channel is exposed by LLD.
> megaraid_sas driver exposes 4 channels and 128 drives per channel.
>
> Each target scan takes 2 seconds (in case of failure with host_tagset
> enabled). That's why driver load completes after ~20 minutes. See
> below:
>
> [ 299.725271] kobject: 'target18:0:96': free name
> [ 301.681267] kobject: 'target18:0:97' (00000000987c7f11):
> kobject_cleanup, parent 0000000000000000
> [ 301.681269] kobject: 'target18:0:97' (00000000987c7f11): calling
> ktype release
> [ 301.681273] kobject: 'target18:0:97': free name
> [ 303.575268] kobject: 'target18:0:98' (00000000a8c34149):
> kobject_cleanup, parent 0000000000000000
>
> In Qian's kernel .config, async scsi scan is disabled so in failure
> case SCSI scan type is synchronous.
> Below is the stack trace when scsi_scan_host() hangs:
>
> [<0>] __wait_rcu_gp+0x134/0x170
> [<0>] synchronize_rcu.part.80+0x53/0x60
> [<0>] blk_free_flush_queue+0x12/0x30

Can this issue disappear by applying the following change?

diff --git a/block/blk-flush.c b/block/blk-flush.c
index e32958f0b687..b1fe6176d77f 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
INIT_LIST_HEAD(&fq->flush_queue[1]);
INIT_LIST_HEAD(&fq->flush_data_in_flight);

- lockdep_register_key(&fq->key);
- lockdep_set_class(&fq->mq_flush_lock, &fq->key);
-
return fq;

fail_rq:
@@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
if (!fq)
return;

- lockdep_unregister_key(&fq->key);
kfree(fq->flush_rq);
kfree(fq);
}


Thanks,
Ming

2020-11-11 11:40:46

by Sumit Saxena

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

>
> Can this issue disappear by applying the following change?
This change fixes the issue for me.

Qian,
Please try after applying changes suggested by Ming.

Thanks,
Sumit
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e32958f0b687..b1fe6176d77f 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
> INIT_LIST_HEAD(&fq->flush_queue[1]);
> INIT_LIST_HEAD(&fq->flush_data_in_flight);
>
> - lockdep_register_key(&fq->key);
> - lockdep_set_class(&fq->mq_flush_lock, &fq->key);
> -
> return fq;
>
> fail_rq:
> @@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
> if (!fq)
> return;
>
> - lockdep_unregister_key(&fq->key);
> kfree(fq->flush_rq);
> kfree(fq);
> }
>
>
> Thanks,
> Ming
>

2020-11-11 11:53:50

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

>
> In Qian's kernel .config, async scsi scan is disabled so in failure
> case SCSI scan type is synchronous.
> Below is the stack trace when scsi_scan_host() hangs:
>
> [<0>] __wait_rcu_gp+0x134/0x170
> [<0>] synchronize_rcu.part.80+0x53/0x60
> [<0>] blk_free_flush_queue+0x12/0x30
> [<0>] blk_mq_hw_sysfs_release+0x21/0x70

this is per blk_mq_hw_ctx

> [<0>] kobject_release+0x46/0x150
> [<0>] blk_mq_release+0xb4/0xf0
> [<0>] blk_release_queue+0xc4/0x130
> [<0>] kobject_release+0x46/0x150
> [<0>] scsi_device_dev_release_usercontext+0x194/0x3f0
> [<0>] execute_in_process_context+0x22/0xa0
> [<0>] device_release+0x2e/0x80
> [<0>] kobject_release+0x46/0x150
> [<0>] scsi_alloc_sdev+0x2e7/0x310
> [<0>] scsi_probe_and_add_lun+0x410/0xbd0
> [<0>] __scsi_scan_target+0xf2/0x530
> [<0>] scsi_scan_channel.part.7+0x51/0x70
> [<0>] scsi_scan_host_selected+0xd4/0x140
> [<0>] scsi_scan_host+0x198/0x1c0
>
> This issue hits when lock related debugging is enabled in kernel config.
> kernel .config parameters(may be subset of this list) are required to
> hit the issue:
>
> CONFIG_PREEMPT_COUNT=y *
> CONFIG_UNINLINE_SPIN_UNLOCK=y *
> CONFIG_LOCK_STAT=y
> CONFIG_DEBUG_RT_MUTEXES=y *
> CONFIG_DEBUG_SPINLOCK=y *
> CONFIG_DEBUG_MUTEXES=y *
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y *
> CONFIG_DEBUG_RWSEMS=y *
> CONFIG_DEBUG_LOCK_ALLOC=y *
> CONFIG_LOCKDEP=y *
> CONFIG_DEBUG_LOCKDEP=y
> CONFIG_TRACE_IRQFLAGS=y *
> CONFIG_TRACE_IRQFLAGS_NMI=y
> CONFIG_DEBUG_KOBJECT=y
> CONFIG_PROVE_RCU=y *
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y *

(* means that I enabled)

>
> When scsi_scan_host() hangs, there are no outstanding IOs with
> megaraid_sas driver-firmware stack as SCSI "host_busy" counter and
> megaraid_sas driver's internal counter are "0".
> Key takeaways:
> 1. Issue is observed when lock related debugging is enabled so issue
> is seen in debug environment.
> 2. Issue seems to be related to generic shared "host_tagset" code
> whenever some kind of kernel debugging is enabled. We do not see an
> immediate reason to hide this issue through disabling the
> "host_tagset" feature.
>
> John,
> Issue may hit on ARM platform too using Qian's .config file with other
> adapters (e.g. hisi_sas) as well. So I feel disabling “host_tagset” in
> megaraid_sas driver will not help. It requires debugging from the
> “Entire Shared host tag feature” perspective as scsi_scan_host()
> waittime aggravates when "host_tagset" is enabled. Also, I am doing
> parallel debugging and if I find anything useful, I will share.

So isn't this then really related to how many HW queues we expose there
is just scaling up the time? For megaraid sas, it's 1->128 for my arm64
platform when host_tagset_enable=1.

As a hack, I tried this (while keeping host_tagset_enable=1):

@@ -6162,11 +6168,15 @@ static int megasas_init_fw(struct
megasas_instance *instance)
else
instance->low_latency_index_start = 1;

- num_msix_req = num_online_cpus() +
instance->low_latency_index_start;
+ num_msix_req = 6 + instance->low_latency_index_start;

(6 is an arbitrary small number)

And boot time is nearly same as with host_tagset_enable=0.

For hisi_sas, max HW queue number ever is 16. In addition, we don't scan
each channel/id/lun for hisi_sas, as it has a scan handler.

>
> Qian,
> I need full dmesg logs from your setup with
> megaraid_sas.host_tagset_enable=1 and
> megaraid_sas.host_tagset_enable=0. Please wait for a long time. I just
> want to make sure that whatever you observe is the same as mine.
>

Thanks,
John

2020-11-11 14:44:26

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Wed, 2020-11-11 at 17:27 +0800, Ming Lei wrote:
> Can this issue disappear by applying the following change?

This makes the system boot again as well.

>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e32958f0b687..b1fe6176d77f 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node,
> int cmd_size,
> INIT_LIST_HEAD(&fq->flush_queue[1]);
> INIT_LIST_HEAD(&fq->flush_data_in_flight);
>
> - lockdep_register_key(&fq->key);
> - lockdep_set_class(&fq->mq_flush_lock, &fq->key);
> -
> return fq;
>
> fail_rq:
> @@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
> if (!fq)
> return;
>
> - lockdep_unregister_key(&fq->key);
> kfree(fq->flush_rq);
> kfree(fq);
> }
>
>
> Thanks,
> Ming

2020-11-11 15:09:23

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

On Wed, Nov 11, 2020 at 09:42:17AM -0500, Qian Cai wrote:
> On Wed, 2020-11-11 at 17:27 +0800, Ming Lei wrote:
> > Can this issue disappear by applying the following change?
>
> This makes the system boot again as well.

OK, actually it isn't necessary to register one new lock key for each
hctx(blk_flush_queue) instance, and the current way is really over-kill
because there can be lots of hw queues in one system.

The original lockdep warning can be avoided by setting one nvme_loop
specific lock class simply. If nvme_loop is backed against another nvme_loop,
we still can avoid the warning by killing the direct end io chain, or
assign another lock class.

Will prepare one formal patch tomorrow.

Thanks,
Ming