2014-06-25 16:49:52

by Christoph Hellwig

[permalink] [raw]
Subject: scsi-mq V2

This is the second post of the scsi-mq series.

At this point the code is ready for merging and use by developers and early
adopters. The core blk-mq code isn't that suitable for slow devices
yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
Similarly there is no dm-multipath support for drivers using blk-mq yet,
but I'm working on it. It should also be noted that the code doesn't
actually support multiple hardware queues or fine grained tuning of the
blk-mq parameters yet. All these could be added fairly easily as soon
as low-level drivers want to make use of them.

The amount of chances to the existing code are fairly small, and mostly
speedups or cleanups that also apply to the old path as well. Because
of this I also haven't bothered to put it under a config option, just
like the blk-mq core.

The usage of blk-mq dramatically decreases CPU usage under all workloads going
down from 100% CPU usage that the old setup can hit easily to usually less
than 20% for maxing out storage subsystems with 512byte reads and writes,
and it allows to easily archive millions of IOPS. Bart and Robert have
helped with some very detailed measurements that they might be able to send
in reply to this, although these usually involve significantly reworked low
level drivers to avoid other bottle necks.

One major objection to previous iterations of this code was the simple
replacement of the host_lock with atomic counters for the host and busy
counters. The host_lock avoidance on it's own already improves performance,
and with the patch to avoid maintaining the per-target busy counter unless
needed we now replace a lock round trip on the host_lock with just a single
atomic increment in the submission path, and a single atomic decrement in
completion path, which should provide benefits even for the oddest RISC
architecture. Longer term I'd still love to get rid of these entirely
and use the counters in blk-mq, but due to the difference in how they
are maintained this doesn't seem feasible as long as we still need to
support the legacy request code path.

Changes from V1:
- rebased on top of the core-for-3.17 branch, most notable the
scsi logging changes
- fixed handling of cmd_list to prevent crashes for some heavy
workloads
- fixed incorrect handling of !target->can_queue
- avoid scheduling a workqueue on I/O completions when no queues
are congested

In addition to the patches in this thread there also is a git available at:

git://git.infradead.org/users/hch/scsi.git scsi-mq.2

This work was sponsored by the ION division of Fusion IO.


2014-06-25 16:50:08

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

This patch adds support for an alternate I/O path in the scsi midlayer
which uses the blk-mq infrastructure instead of the legacy request code.

Use of blk-mq is fully transparent to drivers, although for now a host
template field is provided to opt out of blk-mq usage in case any unforseen
incompatibilities arise.

In general replacing the legacy request code with blk-mq is a simple and
mostly mechanical transformation. The biggest exception is the new code
that deals with the fact the I/O submissions in blk-mq must happen from
process context, which slightly complicates the I/O completion handler.
The second biggest differences is that blk-mq is build around the concept
of preallocated requests that also include driver specific data, which
in SCSI context means the scsi_cmnd structure. This completely avoids
dynamic memory allocations for the fast path through I/O submission.

Due the preallocated requests the MQ code path exclusively uses the
host-wide shared tag allocator instead of a per-LUN one. This only
affects drivers actually using the block layer provided tag allocator
instead of their own. Unlike the old path blk-mq always provides a tag,
although drivers don't have to use it.

For now the blk-mq path is disable by defauly and must be enabled using
the "use_blk_mq" module parameter. Once the remaining work in the block
layer to make blk-mq more suitable for slow devices is complete I hope
to make it the default and eventually even remove the old code path.

Based on the earlier scsi-mq prototype by Nicholas Bellinger.

Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
various sugestions and code contributions.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/hosts.c | 30 ++-
drivers/scsi/scsi.c | 5 +-
drivers/scsi/scsi_lib.c | 475 +++++++++++++++++++++++++++++++++++++++------
drivers/scsi/scsi_priv.h | 3 +
drivers/scsi/scsi_scan.c | 5 +-
drivers/scsi/scsi_sysfs.c | 2 +
include/scsi/scsi_host.h | 18 +-
include/scsi/scsi_tcq.h | 28 ++-
8 files changed, 494 insertions(+), 72 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 0632eee..6322e6c 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
goto fail;
}

+ if (shost_use_blk_mq(shost)) {
+ error = scsi_mq_setup_tags(shost);
+ if (error)
+ goto fail;
+ }
+
+ /*
+ * Note that we allocate the freelist even for the MQ case for now,
+ * as we need a command set aside for scsi_reset_provider. Having
+ * the full host freelist and one command available for that is a
+ * little heavy-handed, but avoids introducing a special allocator
+ * just for this. Eventually the structure of scsi_reset_provider
+ * will need a major overhaul.
+ */
error = scsi_setup_command_freelist(shost);
if (error)
- goto fail;
+ goto out_destroy_tags;
+

if (!shost->shost_gendev.parent)
shost->shost_gendev.parent = dev ? dev : &platform_bus;
@@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,

error = device_add(&shost->shost_gendev);
if (error)
- goto out;
+ goto out_destroy_freelist;

pm_runtime_set_active(&shost->shost_gendev);
pm_runtime_enable(&shost->shost_gendev);
@@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
device_del(&shost->shost_dev);
out_del_gendev:
device_del(&shost->shost_gendev);
- out:
+ out_destroy_freelist:
scsi_destroy_command_freelist(shost);
+ out_destroy_tags:
+ if (shost_use_blk_mq(shost))
+ scsi_mq_destroy_tags(shost);
fail:
return error;
}
@@ -309,7 +327,9 @@ static void scsi_host_dev_release(struct device *dev)
}

scsi_destroy_command_freelist(shost);
- if (shost->bqt)
+ if (shost_use_blk_mq(shost) && shost->tag_set.tags)
+ scsi_mq_destroy_tags(shost);
+ else if (shost->bqt)
blk_free_tags(shost->bqt);

kfree(shost->shost_data);
@@ -436,6 +456,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
else
shost->dma_boundary = 0xffffffff;

+ shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
+
device_initialize(&shost->shost_gendev);
dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
shost->shost_gendev.bus = &scsi_bus_type;
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index b362058..c089812 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -809,7 +809,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
* is more IO than the LLD's can_queue (so there are not enuogh
* tags) request_fn's host queue ready check will handle it.
*/
- if (!sdev->host->bqt) {
+ if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
if (blk_queue_tagged(sdev->request_queue) &&
blk_queue_resize_tags(sdev->request_queue, tags) != 0)
goto out;
@@ -1363,6 +1363,9 @@ MODULE_LICENSE("GPL");
module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");

+bool scsi_use_blk_mq = false;
+module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
static int __init init_scsi(void)
{
int error;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 900b1c0..5d39cfc 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1,5 +1,6 @@
/*
- * scsi_lib.c Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 2014 Christoph Hellwig
*
* SCSI queueing library.
* Initial versions: Eric Youngdale ([email protected]).
@@ -20,6 +21,7 @@
#include <linux/delay.h>
#include <linux/hardirq.h>
#include <linux/scatterlist.h>
+#include <linux/blk-mq.h>

#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
@@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
}
}

+static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
+{
+ struct scsi_device *sdev = cmd->device;
+ struct request_queue *q = cmd->request->q;
+
+ blk_mq_requeue_request(cmd->request);
+ blk_mq_kick_requeue_list(q);
+ put_device(&sdev->sdev_gendev);
+}
+
/**
* __scsi_queue_insert - private queue insertion
* @cmd: The SCSI command being requeued
@@ -150,6 +162,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
* before blk_cleanup_queue() finishes.
*/
cmd->result = 0;
+ if (q->mq_ops) {
+ scsi_mq_requeue_cmd(cmd);
+ return;
+ }
spin_lock_irqsave(q->queue_lock, flags);
blk_requeue_request(q, cmd->request);
kblockd_schedule_work(&device->requeue_work);
@@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
atomic_dec(&sdev->device_busy);
}

+static void scsi_kick_queue(struct request_queue *q)
+{
+ if (q->mq_ops)
+ blk_mq_start_hw_queues(q);
+ else
+ blk_run_queue(q);
+}
+
/*
* Called for single_lun devices on IO completion. Clear starget_sdev_user,
* and call blk_run_queue for all the scsi_devices on the target -
@@ -332,7 +356,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
* but in most cases, we will be first. Ideally, each LU on the
* target would get some limited time or requests on the target.
*/
- blk_run_queue(current_sdev->request_queue);
+ scsi_kick_queue(current_sdev->request_queue);

spin_lock_irqsave(shost->host_lock, flags);
if (starget->starget_sdev_user)
@@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
continue;

spin_unlock_irqrestore(shost->host_lock, flags);
- blk_run_queue(sdev->request_queue);
+ scsi_kick_queue(sdev->request_queue);
spin_lock_irqsave(shost->host_lock, flags);

scsi_device_put(sdev);
@@ -438,7 +462,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
continue;
spin_unlock_irqrestore(shost->host_lock, flags);

- blk_run_queue(slq);
+ scsi_kick_queue(slq);
blk_put_queue(slq);

spin_lock_irqsave(shost->host_lock, flags);
@@ -469,7 +493,10 @@ static void scsi_run_queue(struct request_queue *q)
if (!list_empty(&sdev->host->starved_list))
scsi_starved_list_run(sdev->host);

- blk_run_queue(q);
+ if (q->mq_ops)
+ blk_mq_start_stopped_hw_queues(q, false);
+ else
+ blk_run_queue(q);
}

void scsi_requeue_run_queue(struct work_struct *work)
@@ -567,25 +594,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
return mempool_alloc(sgp->pool, gfp_mask);
}

-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
{
- __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+ if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
+ return;
+ __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
}

static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, bool mq)
{
+ struct scatterlist *first_chunk = NULL;
int ret;

BUG_ON(!nents);

+ if (mq) {
+ if (nents <= SCSI_MAX_SG_SEGMENTS) {
+ sdb->table.nents = nents;
+ sg_init_table(sdb->table.sgl, sdb->table.nents);
+ return 0;
+ }
+ first_chunk = sdb->table.sgl;
+ }
+
ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
- NULL, gfp_mask, scsi_sg_alloc);
+ first_chunk, gfp_mask, scsi_sg_alloc);
if (unlikely(ret))
- scsi_free_sgtable(sdb);
+ scsi_free_sgtable(sdb, mq);
return ret;
}

+static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
+{
+ if (cmd->request->cmd_type == REQ_TYPE_FS) {
+ struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
+
+ if (drv->uninit_command)
+ drv->uninit_command(cmd);
+ }
+}
+
+static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
+{
+ if (cmd->sdb.table.nents)
+ scsi_free_sgtable(&cmd->sdb, true);
+ if (cmd->request->next_rq && cmd->request->next_rq->special)
+ scsi_free_sgtable(cmd->request->next_rq->special, true);
+ if (scsi_prot_sg_count(cmd))
+ scsi_free_sgtable(cmd->prot_sdb, true);
+}
+
+static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
+{
+ struct scsi_device *sdev = cmd->device;
+ unsigned long flags;
+
+ BUG_ON(list_empty(&cmd->list));
+
+ scsi_mq_free_sgtables(cmd);
+ scsi_uninit_cmd(cmd);
+
+ spin_lock_irqsave(&sdev->list_lock, flags);
+ list_del_init(&cmd->list);
+ spin_unlock_irqrestore(&sdev->list_lock, flags);
+}
+
/*
* Function: scsi_release_buffers()
*
@@ -605,12 +679,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
void scsi_release_buffers(struct scsi_cmnd *cmd)
{
if (cmd->sdb.table.nents)
- scsi_free_sgtable(&cmd->sdb);
+ scsi_free_sgtable(&cmd->sdb, false);

memset(&cmd->sdb, 0, sizeof(cmd->sdb));

if (scsi_prot_sg_count(cmd))
- scsi_free_sgtable(cmd->prot_sdb);
+ scsi_free_sgtable(cmd->prot_sdb, false);
}
EXPORT_SYMBOL(scsi_release_buffers);

@@ -618,7 +692,7 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
{
struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;

- scsi_free_sgtable(bidi_sdb);
+ scsi_free_sgtable(bidi_sdb, false);
kmem_cache_free(scsi_sdb_cache, bidi_sdb);
cmd->request->next_rq->special = NULL;
}
@@ -629,8 +703,6 @@ static bool scsi_end_request(struct request *req, int error,
struct scsi_cmnd *cmd = req->special;
struct scsi_device *sdev = cmd->device;
struct request_queue *q = sdev->request_queue;
- unsigned long flags;
-

if (blk_update_request(req, error, bytes))
return true;
@@ -643,14 +715,38 @@ static bool scsi_end_request(struct request *req, int error,
if (blk_queue_add_random(q))
add_disk_randomness(req->rq_disk);

- spin_lock_irqsave(q->queue_lock, flags);
- blk_finish_request(req, error);
- spin_unlock_irqrestore(q->queue_lock, flags);
+ if (req->mq_ctx) {
+ /*
+ * In the MQ case the command gets freed by __blk_mq_end_io,
+ * so we have to do all cleanup that depends on it earlier.
+ *
+ * We also can't kick the queues from irq context, so we
+ * will have to defer it to a workqueue.
+ */
+ scsi_mq_uninit_cmd(cmd);
+
+ __blk_mq_end_io(req, error);
+
+ if (scsi_target(sdev)->single_lun ||
+ !list_empty(&sdev->host->starved_list))
+ kblockd_schedule_work(&sdev->requeue_work);
+ else
+ blk_mq_start_stopped_hw_queues(q, true);
+
+ put_device(&sdev->sdev_gendev);
+ } else {
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_finish_request(req, error);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ if (bidi_bytes)
+ scsi_release_bidi_buffers(cmd);
+ scsi_release_buffers(cmd);
+ scsi_next_command(cmd);
+ }

- if (bidi_bytes)
- scsi_release_bidi_buffers(cmd);
- scsi_release_buffers(cmd);
- scsi_next_command(cmd);
return false;
}

@@ -981,8 +1077,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
/* Unprep the request and put it back at the head of the queue.
* A new command will be prepared and issued.
*/
- scsi_release_buffers(cmd);
- scsi_requeue_command(q, cmd);
+ if (q->mq_ops) {
+ cmd->request->cmd_flags &= ~REQ_DONTPREP;
+ scsi_mq_uninit_cmd(cmd);
+ scsi_mq_requeue_cmd(cmd);
+ } else {
+ scsi_release_buffers(cmd);
+ scsi_requeue_command(q, cmd);
+ }
break;
case ACTION_RETRY:
/* Retry the same command immediately */
@@ -1004,9 +1106,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
* If sg table allocation fails, requeue request later.
*/
if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
- gfp_mask))) {
+ gfp_mask, req->mq_ctx != NULL)))
return BLKPREP_DEFER;
- }

/*
* Next, walk the list, and fill in the addresses and sizes of
@@ -1034,21 +1135,27 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
{
struct scsi_device *sdev = cmd->device;
struct request *rq = cmd->request;
+ bool is_mq = (rq->mq_ctx != NULL);
+ int error;

- int error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
+ error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
if (error)
goto err_exit;

if (blk_bidi_rq(rq)) {
- struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
- scsi_sdb_cache, GFP_ATOMIC);
- if (!bidi_sdb) {
- error = BLKPREP_DEFER;
- goto err_exit;
+ if (!rq->q->mq_ops) {
+ struct scsi_data_buffer *bidi_sdb =
+ kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
+ if (!bidi_sdb) {
+ error = BLKPREP_DEFER;
+ goto err_exit;
+ }
+
+ rq->next_rq->special = bidi_sdb;
}

- rq->next_rq->special = bidi_sdb;
- error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
+ error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
+ GFP_ATOMIC);
if (error)
goto err_exit;
}
@@ -1060,7 +1167,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
BUG_ON(prot_sdb == NULL);
ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);

- if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
+ if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
error = BLKPREP_DEFER;
goto err_exit;
}
@@ -1074,13 +1181,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
cmd->prot_sdb->table.nents = count;
}

- return BLKPREP_OK ;
-
+ return BLKPREP_OK;
err_exit:
- scsi_release_buffers(cmd);
- cmd->request->special = NULL;
- scsi_put_command(cmd);
- put_device(&sdev->sdev_gendev);
+ if (is_mq) {
+ scsi_mq_free_sgtables(cmd);
+ } else {
+ scsi_release_buffers(cmd);
+ cmd->request->special = NULL;
+ scsi_put_command(cmd);
+ put_device(&sdev->sdev_gendev);
+ }
return error;
}
EXPORT_SYMBOL(scsi_init_io);
@@ -1295,13 +1405,7 @@ out:

static void scsi_unprep_fn(struct request_queue *q, struct request *req)
{
- if (req->cmd_type == REQ_TYPE_FS) {
- struct scsi_cmnd *cmd = req->special;
- struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
-
- if (drv->uninit_command)
- drv->uninit_command(cmd);
- }
+ scsi_uninit_cmd(req->special);
}

/*
@@ -1318,7 +1422,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
busy = atomic_inc_return(&sdev->device_busy) - 1;
if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
if (atomic_dec_return(&sdev->device_blocked) > 0) {
- blk_delay_queue(q, SCSI_QUEUE_DELAY);
+ /*
+ * For the MQ case we take care of this in the caller.
+ */
+ if (!q->mq_ops)
+ blk_delay_queue(q, SCSI_QUEUE_DELAY);
goto out_dec;
}
SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
@@ -1688,6 +1796,188 @@ out_delay:
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

+static inline int prep_to_mq(int ret)
+{
+ switch (ret) {
+ case BLKPREP_OK:
+ return 0;
+ case BLKPREP_DEFER:
+ return BLK_MQ_RQ_QUEUE_BUSY;
+ default:
+ return BLK_MQ_RQ_QUEUE_ERROR;
+ }
+}
+
+static int scsi_mq_prep_fn(struct request *req)
+{
+ struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+ struct scsi_device *sdev = req->q->queuedata;
+ struct Scsi_Host *shost = sdev->host;
+ unsigned char *sense_buf = cmd->sense_buffer;
+ struct scatterlist *sg;
+
+ memset(cmd, 0, sizeof(struct scsi_cmnd));
+
+ req->special = cmd;
+
+ cmd->request = req;
+ cmd->device = sdev;
+ cmd->sense_buffer = sense_buf;
+
+ cmd->tag = req->tag;
+
+ req->cmd = req->__cmd;
+ cmd->cmnd = req->cmd;
+ cmd->prot_op = SCSI_PROT_NORMAL;
+
+ INIT_LIST_HEAD(&cmd->list);
+ INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
+ cmd->jiffies_at_alloc = jiffies;
+
+ /*
+ * XXX: cmd_list lookups are only used by two drivers, try to get
+ * rid of this list in common code.
+ */
+ spin_lock_irq(&sdev->list_lock);
+ list_add_tail(&cmd->list, &sdev->cmd_list);
+ spin_unlock_irq(&sdev->list_lock);
+
+ sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
+ cmd->sdb.table.sgl = sg;
+
+ if (scsi_host_get_prot(shost)) {
+ cmd->prot_sdb = (void *)sg +
+ shost->sg_tablesize * sizeof(struct scatterlist);
+ memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
+
+ cmd->prot_sdb->table.sgl =
+ (struct scatterlist *)(cmd->prot_sdb + 1);
+ }
+
+ if (blk_bidi_rq(req)) {
+ struct request *next_rq = req->next_rq;
+ struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
+
+ memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
+ bidi_sdb->table.sgl =
+ (struct scatterlist *)(bidi_sdb + 1);
+
+ next_rq->special = bidi_sdb;
+ }
+
+ switch (req->cmd_type) {
+ case REQ_TYPE_FS:
+ return scsi_cmd_to_driver(cmd)->init_command(cmd);
+ case REQ_TYPE_BLOCK_PC:
+ return scsi_setup_blk_pc_cmnd(cmd->device, req);
+ default:
+ return BLKPREP_KILL;
+ }
+}
+
+static void scsi_mq_done(struct scsi_cmnd *cmd)
+{
+ trace_scsi_dispatch_cmd_done(cmd);
+ blk_mq_complete_request(cmd->request);
+}
+
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+{
+ struct request_queue *q = req->q;
+ struct scsi_device *sdev = q->queuedata;
+ struct Scsi_Host *shost = sdev->host;
+ struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+ int ret;
+ int reason;
+
+ ret = prep_to_mq(scsi_prep_state_check(sdev, req));
+ if (ret)
+ goto out;
+
+ ret = BLK_MQ_RQ_QUEUE_BUSY;
+ if (!get_device(&sdev->sdev_gendev))
+ goto out;
+
+ if (!scsi_dev_queue_ready(q, sdev))
+ goto out_put_device;
+ if (!scsi_target_queue_ready(shost, sdev))
+ goto out_dec_device_busy;
+ if (!scsi_host_queue_ready(q, shost, sdev))
+ goto out_dec_target_busy;
+
+ if (!(req->cmd_flags & REQ_DONTPREP)) {
+ ret = prep_to_mq(scsi_mq_prep_fn(req));
+ if (ret)
+ goto out_dec_host_busy;
+ req->cmd_flags |= REQ_DONTPREP;
+ }
+
+ scsi_init_cmd_errh(cmd);
+ cmd->scsi_done = scsi_mq_done;
+
+ reason = scsi_dispatch_cmd(cmd);
+ if (reason) {
+ scsi_set_blocked(cmd, reason);
+ ret = BLK_MQ_RQ_QUEUE_BUSY;
+ goto out_dec_host_busy;
+ }
+
+ return BLK_MQ_RQ_QUEUE_OK;
+
+out_dec_host_busy:
+ cancel_delayed_work(&cmd->abort_work);
+ atomic_dec(&shost->host_busy);
+out_dec_target_busy:
+ if (scsi_target(sdev)->can_queue > 0)
+ atomic_dec(&scsi_target(sdev)->target_busy);
+out_dec_device_busy:
+ atomic_dec(&sdev->device_busy);
+out_put_device:
+ put_device(&sdev->sdev_gendev);
+out:
+ switch (ret) {
+ case BLK_MQ_RQ_QUEUE_BUSY:
+ blk_mq_stop_hw_queue(hctx);
+ if (atomic_read(&sdev->device_busy) == 0 &&
+ !scsi_device_blocked(sdev))
+ blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+ break;
+ case BLK_MQ_RQ_QUEUE_ERROR:
+ /*
+ * Make sure to release all allocated ressources when
+ * we hit an error, as we will never see this command
+ * again.
+ */
+ if (req->cmd_flags & REQ_DONTPREP)
+ scsi_mq_uninit_cmd(cmd);
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
+
+static int scsi_init_request(void *data, struct request *rq,
+ unsigned int hctx_idx, unsigned int request_idx,
+ unsigned int numa_node)
+{
+ struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+ cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
+ numa_node);
+ if (!cmd->sense_buffer)
+ return -ENOMEM;
+ return 0;
+}
+
+static void scsi_exit_request(void *data, struct request *rq,
+ unsigned int hctx_idx, unsigned int request_idx)
+{
+ struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+ kfree(cmd->sense_buffer);
+}
+
u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
{
struct device *host_dev;
@@ -1710,16 +2000,10 @@ u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
}
EXPORT_SYMBOL(scsi_calculate_bounce_limit);

-struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
- request_fn_proc *request_fn)
+static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
{
- struct request_queue *q;
struct device *dev = shost->dma_dev;

- q = blk_init_queue(request_fn, NULL);
- if (!q)
- return NULL;
-
/*
* this limit is imposed by hardware restrictions
*/
@@ -1750,7 +2034,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
* blk_queue_update_dma_alignment() later.
*/
blk_queue_dma_alignment(q, 0x03);
+}

+struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
+ request_fn_proc *request_fn)
+{
+ struct request_queue *q;
+
+ q = blk_init_queue(request_fn, NULL);
+ if (!q)
+ return NULL;
+ __scsi_init_queue(shost, q);
return q;
}
EXPORT_SYMBOL(__scsi_alloc_queue);
@@ -1771,6 +2065,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
return q;
}

+static struct blk_mq_ops scsi_mq_ops = {
+ .map_queue = blk_mq_map_queue,
+ .queue_rq = scsi_queue_rq,
+ .complete = scsi_softirq_done,
+ .timeout = scsi_times_out,
+ .init_request = scsi_init_request,
+ .exit_request = scsi_exit_request,
+};
+
+struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
+{
+ sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
+ if (IS_ERR(sdev->request_queue))
+ return NULL;
+
+ sdev->request_queue->queuedata = sdev;
+ __scsi_init_queue(sdev->host, sdev->request_queue);
+ return sdev->request_queue;
+}
+
+int scsi_mq_setup_tags(struct Scsi_Host *shost)
+{
+ unsigned int cmd_size, sgl_size, tbl_size;
+
+ tbl_size = shost->sg_tablesize;
+ if (tbl_size > SCSI_MAX_SG_SEGMENTS)
+ tbl_size = SCSI_MAX_SG_SEGMENTS;
+ sgl_size = tbl_size * sizeof(struct scatterlist);
+ cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
+ if (scsi_host_get_prot(shost))
+ cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+
+ memset(&shost->tag_set, 0, sizeof(shost->tag_set));
+ shost->tag_set.ops = &scsi_mq_ops;
+ shost->tag_set.nr_hw_queues = 1;
+ shost->tag_set.queue_depth = shost->can_queue;
+ shost->tag_set.cmd_size = cmd_size;
+ shost->tag_set.numa_node = NUMA_NO_NODE;
+ shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ shost->tag_set.driver_data = shost;
+
+ return blk_mq_alloc_tag_set(&shost->tag_set);
+}
+
+void scsi_mq_destroy_tags(struct Scsi_Host *shost)
+{
+ blk_mq_free_tag_set(&shost->tag_set);
+}
+
/*
* Function: scsi_block_requests()
*
@@ -2516,9 +2859,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
* block layer from calling the midlayer with this device's
* request queue.
*/
- spin_lock_irqsave(q->queue_lock, flags);
- blk_stop_queue(q);
- spin_unlock_irqrestore(q->queue_lock, flags);
+ if (q->mq_ops) {
+ blk_mq_stop_hw_queues(q);
+ } else {
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_stop_queue(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ }

return 0;
}
@@ -2564,9 +2911,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
sdev->sdev_state != SDEV_OFFLINE)
return -EINVAL;

- spin_lock_irqsave(q->queue_lock, flags);
- blk_start_queue(q);
- spin_unlock_irqrestore(q->queue_lock, flags);
+ if (q->mq_ops) {
+ blk_mq_start_stopped_hw_queues(q, false);
+ } else {
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_start_queue(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ }

return 0;
}
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index a45d1c2..12b8e1b 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
extern void scsi_run_host_queues(struct Scsi_Host *shost);
extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
+extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
+extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
extern int scsi_init_queue(void);
extern void scsi_exit_queue(void);
struct request_queue;
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 4a6e4ba..b91cfaf 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
*/
sdev->borken = 1;

- sdev->request_queue = scsi_alloc_queue(sdev);
+ if (shost_use_blk_mq(shost))
+ sdev->request_queue = scsi_mq_alloc_queue(sdev);
+ else
+ sdev->request_queue = scsi_alloc_queue(sdev);
if (!sdev->request_queue) {
/* release fn is set up in scsi_sysfs_device_initialise, so
* have to free and put manually here */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index deef063..6c9227f 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,

static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);

+shost_rd_attr(use_blk_mq, "%d\n");
shost_rd_attr(unique_id, "%u\n");
shost_rd_attr(cmd_per_lun, "%hd\n");
shost_rd_attr(can_queue, "%hd\n");
@@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);

static struct attribute *scsi_sysfs_shost_attrs[] = {
+ &dev_attr_use_blk_mq.attr,
&dev_attr_unique_id.attr,
&dev_attr_host_busy.attr,
&dev_attr_cmd_per_lun.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7f9bbda..b54511e 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -7,6 +7,7 @@
#include <linux/workqueue.h>
#include <linux/mutex.h>
#include <linux/seq_file.h>
+#include <linux/blk-mq.h>
#include <scsi/scsi.h>

struct request_queue;
@@ -531,6 +532,9 @@ struct scsi_host_template {
*/
unsigned int cmd_size;
struct scsi_host_cmd_pool *cmd_pool;
+
+ /* temporary flag to disable blk-mq I/O path */
+ bool disable_blk_mq;
};

/*
@@ -601,7 +605,10 @@ struct Scsi_Host {
* Area to keep a shared tag map (if needed, will be
* NULL if not).
*/
- struct blk_queue_tag *bqt;
+ union {
+ struct blk_queue_tag *bqt;
+ struct blk_mq_tag_set tag_set;
+ };

atomic_t host_busy; /* commands actually active on low-level */
atomic_t host_blocked;
@@ -693,6 +700,8 @@ struct Scsi_Host {
/* The controller does not support WRITE SAME */
unsigned no_write_same:1;

+ unsigned use_blk_mq:1;
+
/*
* Optional work queue to be utilized by the transport
*/
@@ -793,6 +802,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
shost->tmf_in_progress;
}

+extern bool scsi_use_blk_mq;
+
+static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
+{
+ return shost->use_blk_mq;
+}
+
extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
extern void scsi_flush_work(struct Scsi_Host *);

diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 81dd12e..cdcc90b 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
if (!sdev->tagged_supported)
return;

- if (!blk_queue_tagged(sdev->request_queue))
+ if (!shost_use_blk_mq(sdev->host) &&
+ blk_queue_tagged(sdev->request_queue))
blk_queue_init_tags(sdev->request_queue, depth,
sdev->host->bqt);

@@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
**/
static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
{
- if (blk_queue_tagged(sdev->request_queue))
+ if (!shost_use_blk_mq(sdev->host) &&
+ blk_queue_tagged(sdev->request_queue))
blk_queue_free_tags(sdev->request_queue);
scsi_adjust_queue_depth(sdev, 0, depth);
}
@@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
return 0;
}

+static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
+ unsigned int hw_ctx, int tag)
+{
+ struct request *req;
+
+ req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
+ return req ? (struct scsi_cmnd *)req->special : NULL;
+}
+
/**
* scsi_find_tag - find a tagged command by device
* @SDpnt: pointer to the ScSI device
@@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
**/
static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
{
-
struct request *req;

if (tag != SCSI_NO_TAG) {
+ if (shost_use_blk_mq(sdev->host))
+ return scsi_mq_find_tag(sdev->host, 0, tag);
+
req = blk_queue_find_tag(sdev->request_queue, tag);
return req ? (struct scsi_cmnd *)req->special : NULL;
}
@@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
return sdev->current_cmnd;
}

+
/**
* scsi_init_shared_tag_map - create a shared tag map
* @shost: the host to share the tag map among all devices
@@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
{
/*
+ * We always have a shared tag map around when using blk-mq.
+ */
+ if (shost_use_blk_mq(shost))
+ return 0;
+
+ /*
* If the shared tag map isn't already initialized, do it now.
* This saves callers from having to check ->bqt when setting up
* devices on the shared host (for libata)
@@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
struct request *req;

if (tag != SCSI_NO_TAG) {
+ if (shost_use_blk_mq(shost))
+ return scsi_mq_find_tag(shost, 0, tag);
req = blk_map_queue_find_tag(shost->bqt, tag);
return req ? (struct scsi_cmnd *)req->special : NULL;
}
--
1.7.10.4

2014-06-25 16:50:04

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit

This saves us an atomic operation for each I/O submission and completion
for the usual case where the driver doesn't set a per-target can_queue
value. Only a few iscsi hardware offload drivers set the per-target
can_queue value at the moment.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a39d5ba..a64b9d3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -295,7 +295,8 @@ void scsi_device_unbusy(struct scsi_device *sdev)
unsigned long flags;

atomic_dec(&shost->host_busy);
- atomic_dec(&starget->target_busy);
+ if (starget->can_queue > 0)
+ atomic_dec(&starget->target_busy);

if (unlikely(scsi_host_in_recovery(shost) &&
(shost->host_failed || shost->host_eh_scheduled))) {
@@ -1335,6 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
spin_unlock_irq(shost->host_lock);
}

+ if (starget->can_queue <= 0)
+ return 1;
+
busy = atomic_inc_return(&starget->target_busy) - 1;
if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
if (atomic_dec_return(&starget->target_blocked) > 0)
@@ -1344,7 +1348,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
"unblocking target at zero depth\n"));
}

- if (starget->can_queue > 0 && busy >= starget->can_queue)
+ if (busy >= starget->can_queue)
goto starved;
if (atomic_read(&starget->target_blocked) > 0)
goto starved;
@@ -1356,7 +1360,8 @@ starved:
list_move_tail(&sdev->starved_entry, &shost->starved_list);
spin_unlock_irq(shost->host_lock);
out_dec:
- atomic_dec(&starget->target_busy);
+ if (starget->can_queue > 0)
+ atomic_dec(&starget->target_busy);
return 0;
}

@@ -1473,7 +1478,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
*/
atomic_inc(&sdev->device_busy);
atomic_inc(&shost->host_busy);
- atomic_inc(&starget->target_busy);
+ if (starget->can_queue > 0)
+ atomic_inc(&starget->target_busy);

blk_complete_request(req);
}
@@ -1642,7 +1648,8 @@ static void scsi_request_fn(struct request_queue *q)
return;

host_not_ready:
- atomic_dec(&scsi_target(sdev)->target_busy);
+ if (scsi_target(sdev)->can_queue > 0)
+ atomic_dec(&scsi_target(sdev)->target_busy);
not_ready:
/*
* lock q, handle tag, requeue req, and decrement device_busy. We
--
1.7.10.4

2014-06-25 16:50:35

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case

Current the midlayer fakes up a struct request for the explicit reset
ioctls, and those don't have a tag allocated to them. The fnic driver pokes
into midlayer structures to paper over this design issue, but that won't
work for the blk-mq case.

Either someone who can actually test the hardware will have to come up with
a similar hack for the blk-mq case, or we'll have to bite the bullet and fix
the way the EH ioctls work for real, but until that happens we fail these
explicit requests here.

Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Hiral Patel <[email protected]>
Cc: Suma Ramars <[email protected]>
Cc: Brian Uchino <[email protected]>
---
drivers/scsi/fnic/fnic_scsi.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
index 3f88f56..961bdf5 100644
--- a/drivers/scsi/fnic/fnic_scsi.c
+++ b/drivers/scsi/fnic/fnic_scsi.c
@@ -2224,6 +2224,22 @@ int fnic_device_reset(struct scsi_cmnd *sc)

tag = sc->request->tag;
if (unlikely(tag < 0)) {
+ /*
+ * XXX(hch): current the midlayer fakes up a struct
+ * request for the explicit reset ioctls, and those
+ * don't have a tag allocated to them. The below
+ * code pokes into midlayer structures to paper over
+ * this design issue, but that won't work for blk-mq.
+ *
+ * Either someone who can actually test the hardware
+ * will have to come up with a similar hack for the
+ * blk-mq case, or we'll have to bite the bullet and
+ * fix the way the EH ioctls work for real, but until
+ * that happens we fail these explicit requests here.
+ */
+ if (shost_use_blk_mq(sc->device->host))
+ goto fnic_device_reset_end;
+
tag = fnic_scsi_host_start_tag(fnic, sc);
if (unlikely(tag == SCSI_NO_TAG))
goto fnic_device_reset_end;
--
1.7.10.4

2014-06-25 16:50:58

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 12/14] scatterlist: allow chaining to preallocated chunks

Blk-mq drivers usually preallocate their S/G list as part of the request,
but if we want to support the very large S/G lists currently supported by
the SCSI code that would tie up a lot of memory in the preallocated request
pool. Add support to the scatterlist code so that it can initialize a
S/G list that uses a preallocated first chunks and dynamically allocated
additional chunks. That way the scsi-mq code can preallocate a first
page worth of S/G entries as part of the request, and dynamically extent
the S/G list when needed.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 16 +++++++---------
include/linux/scatterlist.h | 6 +++---
lib/scatterlist.c | 24 ++++++++++++++++--------
3 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 58534fd..900b1c0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -567,6 +567,11 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
return mempool_alloc(sgp->pool, gfp_mask);
}

+static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+{
+ __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+}
+
static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
gfp_t gfp_mask)
{
@@ -575,19 +580,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
BUG_ON(!nents);

ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
- gfp_mask, scsi_sg_alloc);
+ NULL, gfp_mask, scsi_sg_alloc);
if (unlikely(ret))
- __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS,
- scsi_sg_free);
-
+ scsi_free_sgtable(sdb);
return ret;
}

-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
-{
- __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, scsi_sg_free);
-}
-
/*
* Function: scsi_release_buffers()
*
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index a964f72..f4ec8bb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -229,10 +229,10 @@ void sg_init_one(struct scatterlist *, const void *, unsigned int);
typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t);
typedef void (sg_free_fn)(struct scatterlist *, unsigned int);

-void __sg_free_table(struct sg_table *, unsigned int, sg_free_fn *);
+void __sg_free_table(struct sg_table *, unsigned int, bool, sg_free_fn *);
void sg_free_table(struct sg_table *);
-int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int, gfp_t,
- sg_alloc_fn *);
+int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
+ struct scatterlist *, gfp_t, sg_alloc_fn *);
int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
int sg_alloc_table_from_pages(struct sg_table *sgt,
struct page **pages, unsigned int n_pages,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 3a8e8e8..48c15d2 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -165,6 +165,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
* __sg_free_table - Free a previously mapped sg table
* @table: The sg table header to use
* @max_ents: The maximum number of entries per single scatterlist
+ * @skip_first_chunk: don't free the (preallocated) first scatterlist chunk
* @free_fn: Free function
*
* Description:
@@ -174,7 +175,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
*
**/
void __sg_free_table(struct sg_table *table, unsigned int max_ents,
- sg_free_fn *free_fn)
+ bool skip_first_chunk, sg_free_fn *free_fn)
{
struct scatterlist *sgl, *next;

@@ -202,7 +203,9 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
}

table->orig_nents -= sg_size;
- free_fn(sgl, alloc_size);
+ if (!skip_first_chunk)
+ free_fn(sgl, alloc_size);
+ skip_first_chunk = false;
sgl = next;
}

@@ -217,7 +220,7 @@ EXPORT_SYMBOL(__sg_free_table);
**/
void sg_free_table(struct sg_table *table)
{
- __sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
+ __sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
}
EXPORT_SYMBOL(sg_free_table);

@@ -241,8 +244,8 @@ EXPORT_SYMBOL(sg_free_table);
*
**/
int __sg_alloc_table(struct sg_table *table, unsigned int nents,
- unsigned int max_ents, gfp_t gfp_mask,
- sg_alloc_fn *alloc_fn)
+ unsigned int max_ents, struct scatterlist *first_chunk,
+ gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
{
struct scatterlist *sg, *prv;
unsigned int left;
@@ -269,7 +272,12 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,

left -= sg_size;

- sg = alloc_fn(alloc_size, gfp_mask);
+ if (first_chunk) {
+ sg = first_chunk;
+ first_chunk = NULL;
+ } else {
+ sg = alloc_fn(alloc_size, gfp_mask);
+ }
if (unlikely(!sg)) {
/*
* Adjust entry count to reflect that the last
@@ -324,9 +332,9 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
int ret;

ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC,
- gfp_mask, sg_kmalloc);
+ NULL, gfp_mask, sg_kmalloc);
if (unlikely(ret))
- __sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
+ __sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);

return ret;
}
--
1.7.10.4

2014-06-25 16:51:41

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls

Replace the calls to the various blk_end_request variants with opencode
equivalents. Blk-mq is using a model that gives the driver control
between the bio updates and the actual completion, and making the old
code follow that same model allows us to keep the code more similar for
both pathes.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 61 ++++++++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a64b9d3..58534fd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -625,6 +625,37 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
cmd->request->next_rq->special = NULL;
}

+static bool scsi_end_request(struct request *req, int error,
+ unsigned int bytes, unsigned int bidi_bytes)
+{
+ struct scsi_cmnd *cmd = req->special;
+ struct scsi_device *sdev = cmd->device;
+ struct request_queue *q = sdev->request_queue;
+ unsigned long flags;
+
+
+ if (blk_update_request(req, error, bytes))
+ return true;
+
+ /* Bidi request must be completed as a whole */
+ if (unlikely(bidi_bytes) &&
+ blk_update_request(req->next_rq, error, bidi_bytes))
+ return true;
+
+ if (blk_queue_add_random(q))
+ add_disk_randomness(req->rq_disk);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ blk_finish_request(req, error);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ if (bidi_bytes)
+ scsi_release_bidi_buffers(cmd);
+ scsi_release_buffers(cmd);
+ scsi_next_command(cmd);
+ return false;
+}
+
/**
* __scsi_error_from_host_byte - translate SCSI error code into errno
* @cmd: SCSI command (unused)
@@ -697,7 +728,7 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd *cmd, int result)
* be put back on the queue and retried using the same
* command as before, possibly after a delay.
*
- * c) We can call blk_end_request() with -EIO to fail
+ * c) We can call scsi_end_request() with -EIO to fail
* the remainder of the request.
*/
void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
@@ -749,13 +780,9 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
* both sides at once.
*/
req->next_rq->resid_len = scsi_in(cmd)->resid;
-
- scsi_release_buffers(cmd);
- scsi_release_bidi_buffers(cmd);
-
- blk_end_request_all(req, 0);
-
- scsi_next_command(cmd);
+ if (scsi_end_request(req, 0, blk_rq_bytes(req),
+ blk_rq_bytes(req->next_rq)))
+ BUG();
return;
}
}
@@ -794,15 +821,16 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
/*
* If we finished all bytes in the request we are done now.
*/
- if (!blk_end_request(req, error, good_bytes))
- goto next_command;
+ if (!scsi_end_request(req, error, good_bytes, 0))
+ return;

/*
* Kill remainder if no retrys.
*/
if (error && scsi_noretry_cmd(cmd)) {
- blk_end_request_all(req, error);
- goto next_command;
+ if (scsi_end_request(req, error, blk_rq_bytes(req), 0))
+ BUG();
+ return;
}

/*
@@ -947,8 +975,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
scsi_print_sense("", cmd);
scsi_print_command(cmd);
}
- if (!blk_end_request_err(req, error))
- goto next_command;
+ if (!scsi_end_request(req, error, blk_rq_err_bytes(req), 0))
+ return;
/*FALLTHRU*/
case ACTION_REPREP:
requeue:
@@ -967,11 +995,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
__scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
break;
}
- return;
-
-next_command:
- scsi_release_buffers(cmd);
- scsi_next_command(cmd);
}

static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
--
1.7.10.4

2014-06-25 16:50:00

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 07/14] scsi: convert host_busy to atomic_t

Avoid taking the host-wide host_lock to check the per-host queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/advansys.c | 4 +-
drivers/scsi/libiscsi.c | 4 +-
drivers/scsi/libsas/sas_scsi_host.c | 5 ++-
drivers/scsi/qlogicpti.c | 2 +-
drivers/scsi/scsi.c | 2 +-
drivers/scsi/scsi_error.c | 7 ++--
drivers/scsi/scsi_lib.c | 71 +++++++++++++++++++++--------------
drivers/scsi/scsi_sysfs.c | 9 ++++-
include/scsi/scsi_host.h | 10 ++---
9 files changed, 66 insertions(+), 48 deletions(-)

diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index e716d0a..43761c1 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -2512,7 +2512,7 @@ static void asc_prt_scsi_host(struct Scsi_Host *s)

printk("Scsi_Host at addr 0x%p, device %s\n", s, dev_name(boardp->dev));
printk(" host_busy %u, host_no %d,\n",
- s->host_busy, s->host_no);
+ atomic_read(&s->host_busy), s->host_no);

printk(" base 0x%lx, io_port 0x%lx, irq %d,\n",
(ulong)s->base, (ulong)s->io_port, boardp->irq);
@@ -3346,7 +3346,7 @@ static void asc_prt_driver_conf(struct seq_file *m, struct Scsi_Host *shost)

seq_printf(m,
" host_busy %u, max_id %u, max_lun %llu, max_channel %u\n",
- shost->host_busy, shost->max_id,
+ atomic_read(&shost->host_busy), shost->max_id,
shost->max_lun, shost->max_channel);

seq_printf(m,
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index f2db82b..f9f3a12 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2971,7 +2971,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
*/
for (;;) {
spin_lock_irqsave(session->host->host_lock, flags);
- if (!session->host->host_busy) { /* OK for ERL == 0 */
+ if (!atomic_read(&session->host->host_busy)) { /* OK for ERL == 0 */
spin_unlock_irqrestore(session->host->host_lock, flags);
break;
}
@@ -2979,7 +2979,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
msleep_interruptible(500);
iscsi_conn_printk(KERN_INFO, conn, "iscsi conn_destroy(): "
"host_busy %d host_failed %d\n",
- session->host->host_busy,
+ atomic_read(&session->host->host_busy),
session->host->host_failed);
/*
* force eh_abort() to unblock
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index 7d02a19..24e477d 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -813,7 +813,7 @@ retry:
spin_unlock_irq(shost->host_lock);

SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
- __func__, shost->host_busy, shost->host_failed);
+ __func__, atomic_read(&shost->host_busy), shost->host_failed);
/*
* Deal with commands that still have SAS tasks (i.e. they didn't
* complete via the normal sas_task completion mechanism),
@@ -858,7 +858,8 @@ out:
goto retry;

SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
- __func__, shost->host_busy, shost->host_failed, tries);
+ __func__, atomic_read(&shost->host_busy),
+ shost->host_failed, tries);
}

enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd)
diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
index 6d48d30..740ae49 100644
--- a/drivers/scsi/qlogicpti.c
+++ b/drivers/scsi/qlogicpti.c
@@ -959,7 +959,7 @@ static inline void update_can_queue(struct Scsi_Host *host, u_int in_ptr, u_int
/* Temporary workaround until bug is found and fixed (one bug has been found
already, but fixing it makes things even worse) -jj */
int num_free = QLOGICPTI_REQ_QUEUE_LEN - REQ_QUEUE_DEPTH(in_ptr, out_ptr) - 64;
- host->can_queue = host->host_busy + num_free;
+ host->can_queue = atomic_read(&host->host_busy) + num_free;
host->sg_tablesize = QLOGICPTI_MAX_SG(num_free);
}

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d3bd6cf..35a23e2 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -603,7 +603,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int disposition)
if (level > 3)
scmd_printk(KERN_INFO, cmd,
"scsi host busy %d failed %d\n",
- cmd->device->host->host_busy,
+ atomic_read(&cmd->device->host->host_busy),
cmd->device->host->host_failed);
}
}
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index e4a5324..5db8454 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -59,7 +59,7 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
/* called with shost->host_lock held */
void scsi_eh_wakeup(struct Scsi_Host *shost)
{
- if (shost->host_busy == shost->host_failed) {
+ if (atomic_read(&shost->host_busy) == shost->host_failed) {
trace_scsi_eh_wakeup(shost);
wake_up_process(shost->ehandler);
SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
@@ -2164,7 +2164,7 @@ int scsi_error_handler(void *data)
while (!kthread_should_stop()) {
set_current_state(TASK_INTERRUPTIBLE);
if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
- shost->host_failed != shost->host_busy) {
+ shost->host_failed != atomic_read(&shost->host_busy)) {
SCSI_LOG_ERROR_RECOVERY(1,
shost_printk(KERN_INFO, shost,
"scsi_eh_%d: sleeping\n",
@@ -2178,7 +2178,8 @@ int scsi_error_handler(void *data)
shost_printk(KERN_INFO, shost,
"scsi_eh_%d: waking up %d/%d/%d\n",
shost->host_no, shost->host_eh_scheduled,
- shost->host_failed, shost->host_busy));
+ shost->host_failed,
+ atomic_read(&shost->host_busy)));

/*
* We have a host that is failing for some reason. Figure out
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5e269d6..5d37d79 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -292,14 +292,17 @@ void scsi_device_unbusy(struct scsi_device *sdev)
struct scsi_target *starget = scsi_target(sdev);
unsigned long flags;

- spin_lock_irqsave(shost->host_lock, flags);
- shost->host_busy--;
+ atomic_dec(&shost->host_busy);
atomic_dec(&starget->target_busy);
+
if (unlikely(scsi_host_in_recovery(shost) &&
- (shost->host_failed || shost->host_eh_scheduled)))
+ (shost->host_failed || shost->host_eh_scheduled))) {
+ spin_lock_irqsave(shost->host_lock, flags);
scsi_eh_wakeup(shost);
- spin_unlock(shost->host_lock);
- spin_lock(sdev->request_queue->queue_lock);
+ spin_unlock_irqrestore(shost->host_lock, flags);
+ }
+
+ spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
sdev->device_busy--;
spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
}
@@ -367,7 +370,8 @@ static inline int scsi_target_is_busy(struct scsi_target *starget)

static inline int scsi_host_is_busy(struct Scsi_Host *shost)
{
- if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
+ if ((shost->can_queue > 0 &&
+ atomic_read(&shost->host_busy) >= shost->can_queue) ||
shost->host_blocked || shost->host_self_blocked)
return 1;

@@ -1359,38 +1363,51 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
struct Scsi_Host *shost,
struct scsi_device *sdev)
{
- int ret = 0;
-
- spin_lock_irq(shost->host_lock);
+ unsigned int busy;

if (scsi_host_in_recovery(shost))
- goto out;
- if (shost->host_busy == 0 && shost->host_blocked) {
+ return 0;
+
+ busy = atomic_inc_return(&shost->host_busy) - 1;
+ if (busy == 0 && shost->host_blocked) {
/*
* unblock after host_blocked iterates to zero
*/
- if (--shost->host_blocked != 0)
- goto out;
+ spin_lock_irq(shost->host_lock);
+ if (--shost->host_blocked != 0) {
+ spin_unlock_irq(shost->host_lock);
+ goto out_dec;
+ }
+ spin_unlock_irq(shost->host_lock);

SCSI_LOG_MLQUEUE(3,
shost_printk(KERN_INFO, shost,
"unblocking host at zero depth\n"));
}
- if (scsi_host_is_busy(shost)) {
- if (list_empty(&sdev->starved_entry))
- list_add_tail(&sdev->starved_entry, &shost->starved_list);
- goto out;
- }
+
+ if (shost->can_queue > 0 && busy >= shost->can_queue)
+ goto starved;
+ if (shost->host_blocked || shost->host_self_blocked)
+ goto starved;

/* We're OK to process the command, so we can't be starved */
- if (!list_empty(&sdev->starved_entry))
- list_del_init(&sdev->starved_entry);
+ if (!list_empty(&sdev->starved_entry)) {
+ spin_lock_irq(shost->host_lock);
+ if (!list_empty(&sdev->starved_entry))
+ list_del_init(&sdev->starved_entry);
+ spin_unlock_irq(shost->host_lock);
+ }

- shost->host_busy++;
- ret = 1;
-out:
+ return 1;
+
+starved:
+ spin_lock_irq(shost->host_lock);
+ if (list_empty(&sdev->starved_entry))
+ list_add_tail(&sdev->starved_entry, &shost->starved_list);
spin_unlock_irq(shost->host_lock);
- return ret;
+out_dec:
+ atomic_dec(&shost->host_busy);
+ return 0;
}

/*
@@ -1454,12 +1471,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
* with the locks as normal issue path does.
*/
sdev->device_busy++;
- spin_unlock(sdev->request_queue->queue_lock);
- spin_lock(shost->host_lock);
- shost->host_busy++;
+ atomic_inc(&shost->host_busy);
atomic_inc(&starget->target_busy);
- spin_unlock(shost->host_lock);
- spin_lock(sdev->request_queue->queue_lock);

blk_complete_request(req);
}
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 5f36788..7ec5e06 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -334,7 +334,6 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);

shost_rd_attr(unique_id, "%u\n");
-shost_rd_attr(host_busy, "%hu\n");
shost_rd_attr(cmd_per_lun, "%hd\n");
shost_rd_attr(can_queue, "%hd\n");
shost_rd_attr(sg_tablesize, "%hu\n");
@@ -344,6 +343,14 @@ shost_rd_attr(prot_capabilities, "%u\n");
shost_rd_attr(prot_guard_type, "%hd\n");
shost_rd_attr2(proc_name, hostt->proc_name, "%s\n");

+static ssize_t
+show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ struct Scsi_Host *shost = class_to_shost(dev);
+ return snprintf(buf, 20, "%hu\n", atomic_read(&shost->host_busy));
+}
+static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
+
static struct attribute *scsi_sysfs_shost_attrs[] = {
&dev_attr_unique_id.attr,
&dev_attr_host_busy.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index abb6958..3d124f7 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -603,13 +603,9 @@ struct Scsi_Host {
*/
struct blk_queue_tag *bqt;

- /*
- * The following two fields are protected with host_lock;
- * however, eh routines can safely access during eh processing
- * without acquiring the lock.
- */
- unsigned int host_busy; /* commands actually active on low-level */
- unsigned int host_failed; /* commands that failed. */
+ atomic_t host_busy; /* commands actually active on low-level */
+ unsigned int host_failed; /* commands that failed.
+ protected by host_lock */
unsigned int host_eh_scheduled; /* EH scheduled without command */

unsigned int host_no; /* Used for IOCTL_GET_IDLUN, /proc/scsi et al. */
--
1.7.10.4

2014-06-25 16:52:01

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess

Seems like these counters are missing any sort of synchronization for
updates, as a over 10 year old comment from me noted. Fix this by
using atomic counters, and while we're at it also make sure they are
in the same cacheline as the _busy counters and not needlessly stored
to in every I/O completion.

With the new model the _busy counters can temporarily go negative,
so all the readers are updated to check for > 0 values. Longer
term every successful I/O completion will reset the counters to zero,
so the temporarily negative values will not cause any harm.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi.c | 21 ++++++------
drivers/scsi/scsi_lib.c | 82 +++++++++++++++++++++-----------------------
drivers/scsi/scsi_sysfs.c | 10 +++++-
include/scsi/scsi_device.h | 7 ++--
include/scsi/scsi_host.h | 7 ++--
5 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 35a23e2..b362058 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -729,17 +729,16 @@ void scsi_finish_command(struct scsi_cmnd *cmd)

scsi_device_unbusy(sdev);

- /*
- * Clear the flags which say that the device/host is no longer
- * capable of accepting new commands. These are set in scsi_queue.c
- * for both the queue full condition on a device, and for a
- * host full condition on the host.
- *
- * XXX(hch): What about locking?
- */
- shost->host_blocked = 0;
- starget->target_blocked = 0;
- sdev->device_blocked = 0;
+ /*
+ * Clear the flags which say that the device/target/host is no longer
+ * capable of accepting new commands.
+ */
+ if (atomic_read(&shost->host_blocked))
+ atomic_set(&shost->host_blocked, 0);
+ if (atomic_read(&starget->target_blocked))
+ atomic_set(&starget->target_blocked, 0);
+ if (atomic_read(&sdev->device_blocked))
+ atomic_set(&sdev->device_blocked, 0);

/*
* If we have valid sense information, then some kind of recovery
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e23fef5..a39d5ba 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -99,14 +99,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
*/
switch (reason) {
case SCSI_MLQUEUE_HOST_BUSY:
- host->host_blocked = host->max_host_blocked;
+ atomic_set(&host->host_blocked, host->max_host_blocked);
break;
case SCSI_MLQUEUE_DEVICE_BUSY:
case SCSI_MLQUEUE_EH_RETRY:
- device->device_blocked = device->max_device_blocked;
+ atomic_set(&device->device_blocked,
+ device->max_device_blocked);
break;
case SCSI_MLQUEUE_TARGET_BUSY:
- starget->target_blocked = starget->max_target_blocked;
+ atomic_set(&starget->target_blocked,
+ starget->max_target_blocked);
break;
}
}
@@ -351,30 +353,39 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
spin_unlock_irqrestore(shost->host_lock, flags);
}

-static inline int scsi_device_is_busy(struct scsi_device *sdev)
+static inline bool scsi_device_is_busy(struct scsi_device *sdev)
{
if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
- return 1;
- if (sdev->device_blocked)
- return 1;
+ return true;
+ if (atomic_read(&sdev->device_blocked) > 0)
+ return true;
return 0;
}

-static inline int scsi_target_is_busy(struct scsi_target *starget)
+static inline bool scsi_target_is_busy(struct scsi_target *starget)
{
- return ((starget->can_queue > 0 &&
- atomic_read(&starget->target_busy) >= starget->can_queue) ||
- starget->target_blocked);
+ if (starget->can_queue > 0) {
+ if (atomic_read(&starget->target_busy) >= starget->can_queue)
+ return true;
+ if (atomic_read(&starget->target_blocked) > 0)
+ return true;
+ }
+
+ return false;
}

-static inline int scsi_host_is_busy(struct Scsi_Host *shost)
+static inline bool scsi_host_is_busy(struct Scsi_Host *shost)
{
- if ((shost->can_queue > 0 &&
- atomic_read(&shost->host_busy) >= shost->can_queue) ||
- shost->host_blocked || shost->host_self_blocked)
- return 1;
+ if (shost->can_queue > 0) {
+ if (atomic_read(&shost->host_busy) >= shost->can_queue)
+ return true;
+ if (atomic_read(&shost->host_blocked) > 0)
+ return true;
+ if (shost->host_self_blocked)
+ return true;
+ }

- return 0;
+ return false;
}

static void scsi_starved_list_run(struct Scsi_Host *shost)
@@ -1283,11 +1294,8 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
unsigned int busy;

busy = atomic_inc_return(&sdev->device_busy) - 1;
- if (busy == 0 && sdev->device_blocked) {
- /*
- * unblock after device_blocked iterates to zero
- */
- if (--sdev->device_blocked != 0) {
+ if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
+ if (atomic_dec_return(&sdev->device_blocked) > 0) {
blk_delay_queue(q, SCSI_QUEUE_DELAY);
goto out_dec;
}
@@ -1297,7 +1305,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,

if (busy >= sdev->queue_depth)
goto out_dec;
- if (sdev->device_blocked)
+ if (atomic_read(&sdev->device_blocked) > 0)
goto out_dec;

return 1;
@@ -1328,16 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
}

busy = atomic_inc_return(&starget->target_busy) - 1;
- if (busy == 0 && starget->target_blocked) {
- /*
- * unblock after target_blocked iterates to zero
- */
- spin_lock_irq(shost->host_lock);
- if (--starget->target_blocked != 0) {
- spin_unlock_irq(shost->host_lock);
+ if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
+ if (atomic_dec_return(&starget->target_blocked) > 0)
goto out_dec;
- }
- spin_unlock_irq(shost->host_lock);

SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
"unblocking target at zero depth\n"));
@@ -1345,7 +1346,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,

if (starget->can_queue > 0 && busy >= starget->can_queue)
goto starved;
- if (starget->target_blocked)
+ if (atomic_read(&starget->target_blocked) > 0)
goto starved;

return 1;
@@ -1374,16 +1375,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
return 0;

busy = atomic_inc_return(&shost->host_busy) - 1;
- if (busy == 0 && shost->host_blocked) {
- /*
- * unblock after host_blocked iterates to zero
- */
- spin_lock_irq(shost->host_lock);
- if (--shost->host_blocked != 0) {
- spin_unlock_irq(shost->host_lock);
+ if (busy == 0 && atomic_read(&shost->host_blocked) > 0) {
+ if (atomic_dec_return(&shost->host_blocked) > 0)
goto out_dec;
- }
- spin_unlock_irq(shost->host_lock);

SCSI_LOG_MLQUEUE(3,
shost_printk(KERN_INFO, shost,
@@ -1392,7 +1386,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,

if (shost->can_queue > 0 && busy >= shost->can_queue)
goto starved;
- if (shost->host_blocked || shost->host_self_blocked)
+ if (atomic_read(&shost->host_blocked) > 0)
+ goto starved;
+ if (shost->host_self_blocked)
goto starved;

/* We're OK to process the command, so we can't be starved */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 54e3dac..deef063 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -584,7 +584,6 @@ static int scsi_sdev_check_buf_bit(const char *buf)
/*
* Create the actual show/store functions and data structures.
*/
-sdev_rd_attr (device_blocked, "%d\n");
sdev_rd_attr (type, "%d\n");
sdev_rd_attr (scsi_level, "%d\n");
sdev_rd_attr (vendor, "%.8s\n");
@@ -600,6 +599,15 @@ sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);

+static ssize_t
+sdev_show_device_blocked(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct scsi_device *sdev = to_scsi_device(dev);
+ return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_blocked));
+}
+static DEVICE_ATTR(device_blocked, S_IRUGO, sdev_show_device_blocked, NULL);
+
/*
* TODO: can we make these symlinks to the block layer ones?
*/
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 5ff3d24..a8a8981 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -82,6 +82,8 @@ struct scsi_device {
struct list_head same_target_siblings; /* just the devices sharing same target id */

atomic_t device_busy; /* commands actually active on LLDD */
+ atomic_t device_blocked; /* Device returned QUEUE_FULL. */
+
spinlock_t list_lock;
struct list_head cmd_list; /* queue of in use SCSI Command structures */
struct list_head starved_entry;
@@ -179,8 +181,6 @@ struct scsi_device {
struct list_head event_list; /* asserted events */
struct work_struct event_work;

- unsigned int device_blocked; /* Device returned QUEUE_FULL. */
-
unsigned int max_device_blocked; /* what device_blocked counts down from */
#define SCSI_DEFAULT_DEVICE_BLOCKED 3

@@ -290,12 +290,13 @@ struct scsi_target {
* the same target will also. */
/* commands actually active on LLD. */
atomic_t target_busy;
+ atomic_t target_blocked;
+
/*
* LLDs should set this in the slave_alloc host template callout.
* If set to zero then there is not limit.
*/
unsigned int can_queue;
- unsigned int target_blocked;
unsigned int max_target_blocked;
#define SCSI_DEFAULT_TARGET_BLOCKED 3

diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 3d124f7..7f9bbda 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -604,6 +604,8 @@ struct Scsi_Host {
struct blk_queue_tag *bqt;

atomic_t host_busy; /* commands actually active on low-level */
+ atomic_t host_blocked;
+
unsigned int host_failed; /* commands that failed.
protected by host_lock */
unsigned int host_eh_scheduled; /* EH scheduled without command */
@@ -703,11 +705,6 @@ struct Scsi_Host {
struct workqueue_struct *tmf_work_q;

/*
- * Host has rejected a command because it was busy.
- */
- unsigned int host_blocked;
-
- /*
* Value host_blocked counts down from
*/
unsigned int max_host_blocked;
--
1.7.10.4

2014-06-25 16:49:58

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready

Prepare for not taking a host-wide lock in the dispatch path by pushing
the lock down into the places that actually need it. Note that this
patch is just a preparation step, as it will actually increase lock
roundtrips and thus decrease performance on its own.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 75 ++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6989b6f..18e6449 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1300,18 +1300,18 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
/*
* scsi_target_queue_ready: checks if there we can send commands to target
* @sdev: scsi device on starget to check.
- *
- * Called with the host lock held.
*/
static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
struct scsi_device *sdev)
{
struct scsi_target *starget = scsi_target(sdev);
+ int ret = 0;

+ spin_lock_irq(shost->host_lock);
if (starget->single_lun) {
if (starget->starget_sdev_user &&
starget->starget_sdev_user != sdev)
- return 0;
+ goto out;
starget->starget_sdev_user = sdev;
}

@@ -1319,57 +1319,66 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
/*
* unblock after target_blocked iterates to zero
*/
- if (--starget->target_blocked == 0) {
- SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
- "unblocking target at zero depth\n"));
- } else
- return 0;
+ if (--starget->target_blocked != 0)
+ goto out;
+
+ SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
+ "unblocking target at zero depth\n"));
}

if (scsi_target_is_busy(starget)) {
list_move_tail(&sdev->starved_entry, &shost->starved_list);
- return 0;
+ goto out;
}

- return 1;
+ scsi_target(sdev)->target_busy++;
+ ret = 1;
+out:
+ spin_unlock_irq(shost->host_lock);
+ return ret;
}

/*
* scsi_host_queue_ready: if we can send requests to shost, return 1 else
* return 0. We must end up running the queue again whenever 0 is
* returned, else IO can hang.
- *
- * Called with host_lock held.
*/
static inline int scsi_host_queue_ready(struct request_queue *q,
struct Scsi_Host *shost,
struct scsi_device *sdev)
{
+ int ret = 0;
+
+ spin_lock_irq(shost->host_lock);
+
if (scsi_host_in_recovery(shost))
- return 0;
+ goto out;
if (shost->host_busy == 0 && shost->host_blocked) {
/*
* unblock after host_blocked iterates to zero
*/
- if (--shost->host_blocked == 0) {
- SCSI_LOG_MLQUEUE(3,
- shost_printk(KERN_INFO, shost,
- "unblocking host at zero depth\n"));
- } else {
- return 0;
- }
+ if (--shost->host_blocked != 0)
+ goto out;
+
+ SCSI_LOG_MLQUEUE(3,
+ shost_printk(KERN_INFO, shost,
+ "unblocking host at zero depth\n"));
}
if (scsi_host_is_busy(shost)) {
if (list_empty(&sdev->starved_entry))
list_add_tail(&sdev->starved_entry, &shost->starved_list);
- return 0;
+ goto out;
}

/* We're OK to process the command, so we can't be starved */
if (!list_empty(&sdev->starved_entry))
list_del_init(&sdev->starved_entry);

- return 1;
+ shost->host_busy++;
+ ret = 1;
+out:
+ spin_unlock_irq(shost->host_lock);
+ return ret;
}

/*
@@ -1550,7 +1559,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_start_request(req);
sdev->device_busy++;

- spin_unlock(q->queue_lock);
+ spin_unlock_irq(q->queue_lock);
cmd = req->special;
if (unlikely(cmd == NULL)) {
printk(KERN_CRIT "impossible request in %s.\n"
@@ -1560,7 +1569,6 @@ static void scsi_request_fn(struct request_queue *q)
blk_dump_rq_flags(req, "foo");
BUG();
}
- spin_lock(shost->host_lock);

/*
* We hit this when the driver is using a host wide
@@ -1571,9 +1579,11 @@ static void scsi_request_fn(struct request_queue *q)
* a run when a tag is freed.
*/
if (blk_queue_tagged(q) && !blk_rq_tagged(req)) {
+ spin_lock_irq(shost->host_lock);
if (list_empty(&sdev->starved_entry))
list_add_tail(&sdev->starved_entry,
&shost->starved_list);
+ spin_unlock_irq(shost->host_lock);
goto not_ready;
}

@@ -1581,16 +1591,7 @@ static void scsi_request_fn(struct request_queue *q)
goto not_ready;

if (!scsi_host_queue_ready(q, shost, sdev))
- goto not_ready;
-
- scsi_target(sdev)->target_busy++;
- shost->host_busy++;
-
- /*
- * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
- * take the lock again.
- */
- spin_unlock_irq(shost->host_lock);
+ goto host_not_ready;

/*
* Finally, initialize any error handling parameters, and set up
@@ -1613,9 +1614,11 @@ static void scsi_request_fn(struct request_queue *q)

return;

- not_ready:
+ host_not_ready:
+ spin_lock_irq(shost->host_lock);
+ scsi_target(sdev)->target_busy--;
spin_unlock_irq(shost->host_lock);
-
+ not_ready:
/*
* lock q, handle tag, requeue req, and decrement device_busy. We
* must return with queue_lock held.
--
1.7.10.4

2014-06-25 16:52:46

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 08/14] scsi: convert device_busy to atomic_t

Avoid taking the queue_lock to check the per-device queue limit. Instead
we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Unlike the host and target busy counters this doesn't allow us to avoid the
queue_lock in the request_fn due to the way the interface works, but it'll
allow us to prepare for using the blk-mq code, which doesn't use the
queue_lock at all, and it at least avoids a queue_lock rountrip in
scsi_device_unbusy, which is still important given how busy the queue_lock
is.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/message/fusion/mptsas.c | 2 +-
drivers/scsi/scsi_lib.c | 50 ++++++++++++++++++++++-----------------
drivers/scsi/scsi_sysfs.c | 10 +++++++-
drivers/scsi/sg.c | 2 +-
include/scsi/scsi_device.h | 4 +---
5 files changed, 40 insertions(+), 28 deletions(-)

diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 711fcb5..d636dbe 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -3763,7 +3763,7 @@ mptsas_send_link_status_event(struct fw_event_work *fw_event)
printk(MYIOC_s_DEBUG_FMT
"SDEV OUTSTANDING CMDS"
"%d\n", ioc->name,
- sdev->device_busy));
+ atomic_read(&sdev->device_busy)));
}

}
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5d37d79..e23fef5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -302,9 +302,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
spin_unlock_irqrestore(shost->host_lock, flags);
}

- spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
- sdev->device_busy--;
- spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
+ atomic_dec(&sdev->device_busy);
}

/*
@@ -355,9 +353,10 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)

static inline int scsi_device_is_busy(struct scsi_device *sdev)
{
- if (sdev->device_busy >= sdev->queue_depth || sdev->device_blocked)
+ if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
+ return 1;
+ if (sdev->device_blocked)
return 1;
-
return 0;
}

@@ -1224,7 +1223,7 @@ scsi_prep_return(struct request_queue *q, struct request *req, int ret)
* queue must be restarted, so we schedule a callback to happen
* shortly.
*/
- if (sdev->device_busy == 0)
+ if (atomic_read(&sdev->device_busy) == 0)
blk_delay_queue(q, SCSI_QUEUE_DELAY);
break;
default:
@@ -1281,26 +1280,32 @@ static void scsi_unprep_fn(struct request_queue *q, struct request *req)
static inline int scsi_dev_queue_ready(struct request_queue *q,
struct scsi_device *sdev)
{
- if (sdev->device_busy == 0 && sdev->device_blocked) {
+ unsigned int busy;
+
+ busy = atomic_inc_return(&sdev->device_busy) - 1;
+ if (busy == 0 && sdev->device_blocked) {
/*
* unblock after device_blocked iterates to zero
*/
- if (--sdev->device_blocked == 0) {
- SCSI_LOG_MLQUEUE(3,
- sdev_printk(KERN_INFO, sdev,
- "unblocking device at zero depth\n"));
- } else {
+ if (--sdev->device_blocked != 0) {
blk_delay_queue(q, SCSI_QUEUE_DELAY);
- return 0;
+ goto out_dec;
}
+ SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
+ "unblocking device at zero depth\n"));
}
- if (scsi_device_is_busy(sdev))
- return 0;
+
+ if (busy >= sdev->queue_depth)
+ goto out_dec;
+ if (sdev->device_blocked)
+ goto out_dec;

return 1;
+out_dec:
+ atomic_dec(&sdev->device_busy);
+ return 0;
}

-
/*
* scsi_target_queue_ready: checks if there we can send commands to target
* @sdev: scsi device on starget to check.
@@ -1470,7 +1475,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
* bump busy counts. To bump the counters, we need to dance
* with the locks as normal issue path does.
*/
- sdev->device_busy++;
+ atomic_inc(&sdev->device_busy);
atomic_inc(&shost->host_busy);
atomic_inc(&starget->target_busy);

@@ -1566,7 +1571,7 @@ static void scsi_request_fn(struct request_queue *q)
* accept it.
*/
req = blk_peek_request(q);
- if (!req || !scsi_dev_queue_ready(q, sdev))
+ if (!req)
break;

if (unlikely(!scsi_device_online(sdev))) {
@@ -1576,13 +1581,14 @@ static void scsi_request_fn(struct request_queue *q)
continue;
}

+ if (!scsi_dev_queue_ready(q, sdev))
+ break;

/*
* Remove the request from the request list.
*/
if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
blk_start_request(req);
- sdev->device_busy++;

spin_unlock_irq(q->queue_lock);
cmd = req->special;
@@ -1652,9 +1658,9 @@ static void scsi_request_fn(struct request_queue *q)
*/
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
- sdev->device_busy--;
+ atomic_dec(&sdev->device_busy);
out_delay:
- if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
+ if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

@@ -2394,7 +2400,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
return err;

scsi_run_queue(sdev->request_queue);
- while (sdev->device_busy) {
+ while (atomic_read(&sdev->device_busy)) {
msleep_interruptible(200);
scsi_run_queue(sdev->request_queue);
}
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 7ec5e06..54e3dac 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -585,13 +585,21 @@ static int scsi_sdev_check_buf_bit(const char *buf)
* Create the actual show/store functions and data structures.
*/
sdev_rd_attr (device_blocked, "%d\n");
-sdev_rd_attr (device_busy, "%d\n");
sdev_rd_attr (type, "%d\n");
sdev_rd_attr (scsi_level, "%d\n");
sdev_rd_attr (vendor, "%.8s\n");
sdev_rd_attr (model, "%.16s\n");
sdev_rd_attr (rev, "%.4s\n");

+static ssize_t
+sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct scsi_device *sdev = to_scsi_device(dev);
+ return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_busy));
+}
+static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);
+
/*
* TODO: can we make these symlinks to the block layer ones?
*/
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index cb2a18e..3db4fc9 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -2573,7 +2573,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v)
scsidp->id, scsidp->lun, (int) scsidp->type,
1,
(int) scsidp->queue_depth,
- (int) scsidp->device_busy,
+ (int) atomic_read(&scsidp->device_busy),
(int) scsi_device_online(scsidp));
}
read_unlock_irqrestore(&sg_index_lock, iflags);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 446f741..5ff3d24 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -81,9 +81,7 @@ struct scsi_device {
struct list_head siblings; /* list of all devices on this host */
struct list_head same_target_siblings; /* just the devices sharing same target id */

- /* this is now protected by the request_queue->queue_lock */
- unsigned int device_busy; /* commands actually active on
- * low-level. protected by queue_lock. */
+ atomic_t device_busy; /* commands actually active on LLDD */
spinlock_t list_lock;
struct list_head cmd_list; /* queue of in use SCSI Command structures */
struct list_head starved_entry;
--
1.7.10.4

2014-06-25 16:53:19

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 06/14] scsi: convert target_busy to an atomic_t

Avoid taking the host-wide host_lock to check the per-target queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 52 ++++++++++++++++++++++++++------------------
include/scsi/scsi_device.h | 4 ++--
2 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 18e6449..5e269d6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -294,7 +294,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)

spin_lock_irqsave(shost->host_lock, flags);
shost->host_busy--;
- starget->target_busy--;
+ atomic_dec(&starget->target_busy);
if (unlikely(scsi_host_in_recovery(shost) &&
(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
@@ -361,7 +361,7 @@ static inline int scsi_device_is_busy(struct scsi_device *sdev)
static inline int scsi_target_is_busy(struct scsi_target *starget)
{
return ((starget->can_queue > 0 &&
- starget->target_busy >= starget->can_queue) ||
+ atomic_read(&starget->target_busy) >= starget->can_queue) ||
starget->target_blocked);
}

@@ -1305,37 +1305,49 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
struct scsi_device *sdev)
{
struct scsi_target *starget = scsi_target(sdev);
- int ret = 0;
+ unsigned int busy;

- spin_lock_irq(shost->host_lock);
if (starget->single_lun) {
+ spin_lock_irq(shost->host_lock);
if (starget->starget_sdev_user &&
- starget->starget_sdev_user != sdev)
- goto out;
+ starget->starget_sdev_user != sdev) {
+ spin_unlock_irq(shost->host_lock);
+ return 0;
+ }
starget->starget_sdev_user = sdev;
+ spin_unlock_irq(shost->host_lock);
}

- if (starget->target_busy == 0 && starget->target_blocked) {
+ busy = atomic_inc_return(&starget->target_busy) - 1;
+ if (busy == 0 && starget->target_blocked) {
/*
* unblock after target_blocked iterates to zero
*/
- if (--starget->target_blocked != 0)
- goto out;
+ spin_lock_irq(shost->host_lock);
+ if (--starget->target_blocked != 0) {
+ spin_unlock_irq(shost->host_lock);
+ goto out_dec;
+ }
+ spin_unlock_irq(shost->host_lock);

SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
"unblocking target at zero depth\n"));
}

- if (scsi_target_is_busy(starget)) {
- list_move_tail(&sdev->starved_entry, &shost->starved_list);
- goto out;
- }
+ if (starget->can_queue > 0 && busy >= starget->can_queue)
+ goto starved;
+ if (starget->target_blocked)
+ goto starved;

- scsi_target(sdev)->target_busy++;
- ret = 1;
-out:
+ return 1;
+
+starved:
+ spin_lock_irq(shost->host_lock);
+ list_move_tail(&sdev->starved_entry, &shost->starved_list);
spin_unlock_irq(shost->host_lock);
- return ret;
+out_dec:
+ atomic_dec(&starget->target_busy);
+ return 0;
}

/*
@@ -1445,7 +1457,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
spin_unlock(sdev->request_queue->queue_lock);
spin_lock(shost->host_lock);
shost->host_busy++;
- starget->target_busy++;
+ atomic_inc(&starget->target_busy);
spin_unlock(shost->host_lock);
spin_lock(sdev->request_queue->queue_lock);

@@ -1615,9 +1627,7 @@ static void scsi_request_fn(struct request_queue *q)
return;

host_not_ready:
- spin_lock_irq(shost->host_lock);
- scsi_target(sdev)->target_busy--;
- spin_unlock_irq(shost->host_lock);
+ atomic_dec(&scsi_target(sdev)->target_busy);
not_ready:
/*
* lock q, handle tag, requeue req, and decrement device_busy. We
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 816e8a2..446f741 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -290,8 +290,8 @@ struct scsi_target {
unsigned int expecting_lun_change:1; /* A device has reported
* a 3F/0E UA, other devices on
* the same target will also. */
- /* commands actually active on LLD. protected by host lock. */
- unsigned int target_busy;
+ /* commands actually active on LLD. */
+ atomic_t target_busy;
/*
* LLDs should set this in the slave_alloc host template callout.
* If set to zero then there is not limit.
--
1.7.10.4

2014-06-25 16:49:55

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 02/14] scsi: split __scsi_queue_insert

Factor out a helper to set the _blocked values, which we'll reuse for the
blk-mq code path.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d5d22e4..2667c75 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -75,28 +75,12 @@ struct kmem_cache *scsi_sdb_cache;
*/
#define SCSI_QUEUE_DELAY 3

-/**
- * __scsi_queue_insert - private queue insertion
- * @cmd: The SCSI command being requeued
- * @reason: The reason for the requeue
- * @unbusy: Whether the queue should be unbusied
- *
- * This is a private queue insertion. The public interface
- * scsi_queue_insert() always assumes the queue should be unbusied
- * because it's always called before the completion. This function is
- * for a requeue after completion, which should only occur in this
- * file.
- */
-static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
+static void
+scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
{
struct Scsi_Host *host = cmd->device->host;
struct scsi_device *device = cmd->device;
struct scsi_target *starget = scsi_target(device);
- struct request_queue *q = device->request_queue;
- unsigned long flags;
-
- SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
- "Inserting command %p into mlqueue\n", cmd));

/*
* Set the appropriate busy bit for the device/host.
@@ -123,6 +107,30 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
starget->target_blocked = starget->max_target_blocked;
break;
}
+}
+
+/**
+ * __scsi_queue_insert - private queue insertion
+ * @cmd: The SCSI command being requeued
+ * @reason: The reason for the requeue
+ * @unbusy: Whether the queue should be unbusied
+ *
+ * This is a private queue insertion. The public interface
+ * scsi_queue_insert() always assumes the queue should be unbusied
+ * because it's always called before the completion. This function is
+ * for a requeue after completion, which should only occur in this
+ * file.
+ */
+static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
+{
+ struct scsi_device *device = cmd->device;
+ struct request_queue *q = device->request_queue;
+ unsigned long flags;
+
+ SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
+ "Inserting command %p into mlqueue\n", cmd));
+
+ scsi_set_blocked(cmd, reason);

/*
* Decrement the counters, since these commands are no longer
--
1.7.10.4

2014-06-25 16:53:44

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd

The blk-mq code path will set this to a different function, so make the
code simpler by setting it up in a legacy-request specific place.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi.c | 23 +----------------------
drivers/scsi/scsi_lib.c | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index dcc43fd..d3bd6cf 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -72,8 +72,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/scsi.h>

-static void scsi_done(struct scsi_cmnd *cmd);
-
/*
* Definitions and constants.
*/
@@ -696,8 +694,6 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
}

trace_scsi_dispatch_cmd_start(cmd);
-
- cmd->scsi_done = scsi_done;
rtn = host->hostt->queuecommand(host, cmd);
if (rtn) {
trace_scsi_dispatch_cmd_error(cmd, rtn);
@@ -711,28 +707,11 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)

return rtn;
done:
- scsi_done(cmd);
+ cmd->scsi_done(cmd);
return 0;
}

/**
- * scsi_done - Invoke completion on finished SCSI command.
- * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
- * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
- *
- * Description: This function is the mid-level's (SCSI Core) interrupt routine,
- * which regains ownership of the SCSI command (de facto) from a LLDD, and
- * calls blk_complete_request() for further processing.
- *
- * This function is interrupt context safe.
- */
-static void scsi_done(struct scsi_cmnd *cmd)
-{
- trace_scsi_dispatch_cmd_done(cmd);
- blk_complete_request(cmd->request);
-}
-
-/**
* scsi_finish_command - cleanup and pass command back to upper layer
* @cmd: the command
*
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 63bf844..6989b6f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -29,6 +29,8 @@
#include <scsi/scsi_eh.h>
#include <scsi/scsi_host.h>

+#include <trace/events/scsi.h>
+
#include "scsi_priv.h"
#include "scsi_logging.h"

@@ -1480,6 +1482,23 @@ static void scsi_softirq_done(struct request *rq)
}
}

+/**
+ * scsi_done - Invoke completion on finished SCSI command.
+ * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
+ * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
+ *
+ * Description: This function is the mid-level's (SCSI Core) interrupt routine,
+ * which regains ownership of the SCSI command (de facto) from a LLDD, and
+ * calls blk_complete_request() for further processing.
+ *
+ * This function is interrupt context safe.
+ */
+static void scsi_done(struct scsi_cmnd *cmd)
+{
+ trace_scsi_dispatch_cmd_done(cmd);
+ blk_complete_request(cmd->request);
+}
+
/*
* Function: scsi_request_fn()
*
@@ -1582,6 +1601,7 @@ static void scsi_request_fn(struct request_queue *q)
/*
* Dispatch the command to the low-level driver.
*/
+ cmd->scsi_done = scsi_done;
rtn = scsi_dispatch_cmd(cmd);
if (rtn) {
scsi_queue_insert(cmd, rtn);
--
1.7.10.4

2014-06-25 16:54:42

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn

Make sure we only have the logic for requeing commands in one place.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi.c | 35 ++++++++++++-----------------------
drivers/scsi/scsi_lib.c | 9 ++++++---
2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index ce5b4e5..dcc43fd 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
* returns an immediate error upwards, and signals
* that the device is no longer present */
cmd->result = DID_NO_CONNECT << 16;
- scsi_done(cmd);
- /* return 0 (because the command has been processed) */
- goto out;
+ goto done;
}

/* Check to see if the scsi lld made this device blocked. */
@@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
* occur until the device transitions out of the
* suspend state.
*/
-
- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
-
SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
"queuecommand : device blocked\n"));
-
- /*
- * NOTE: rtn is still zero here because we don't need the
- * queue to be plugged on return (it's already stopped)
- */
- goto out;
+ return SCSI_MLQUEUE_DEVICE_BUSY;
}

/*
@@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
"cdb_size=%d host->max_cmd_len=%d\n",
cmd->cmd_len, cmd->device->host->max_cmd_len));
cmd->result = (DID_ABORT << 16);
-
- scsi_done(cmd);
- goto out;
+ goto done;
}

if (unlikely(host->shost_state == SHOST_DEL)) {
cmd->result = (DID_NO_CONNECT << 16);
- scsi_done(cmd);
- } else {
- trace_scsi_dispatch_cmd_start(cmd);
- cmd->scsi_done = scsi_done;
- rtn = host->hostt->queuecommand(host, cmd);
+ goto done;
+
}

+ trace_scsi_dispatch_cmd_start(cmd);
+
+ cmd->scsi_done = scsi_done;
+ rtn = host->hostt->queuecommand(host, cmd);
if (rtn) {
trace_scsi_dispatch_cmd_error(cmd, rtn);
if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
@@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)

SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
"queuecommand : request rejected\n"));
-
- scsi_queue_insert(cmd, rtn);
}

- out:
return rtn;
+ done:
+ scsi_done(cmd);
+ return 0;
}

/**
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2667c75..63bf844 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1583,9 +1583,12 @@ static void scsi_request_fn(struct request_queue *q)
* Dispatch the command to the low-level driver.
*/
rtn = scsi_dispatch_cmd(cmd);
- spin_lock_irq(q->queue_lock);
- if (rtn)
+ if (rtn) {
+ scsi_queue_insert(cmd, rtn);
+ spin_lock_irq(q->queue_lock);
goto out_delay;
+ }
+ spin_lock_irq(q->queue_lock);
}

return;
@@ -1605,7 +1608,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_requeue_request(q, req);
sdev->device_busy--;
out_delay:
- if (sdev->device_busy == 0)
+ if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

--
1.7.10.4

2014-06-25 16:55:31

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up

Unlike the old request code blk-mq doesn't initialize cmd_len with a
default value, so don't rely on it being set in sd_setup_write_same_cmnd.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/sd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9c86e3d..6ec4ffe 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -824,15 +824,16 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)

rq->__data_len = sdp->sector_size;
rq->timeout = SD_WRITE_SAME_TIMEOUT;
- memset(rq->cmd, 0, rq->cmd_len);

if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff) {
rq->cmd_len = 16;
+ memset(rq->cmd, 0, rq->cmd_len);
rq->cmd[0] = WRITE_SAME_16;
put_unaligned_be64(sector, &rq->cmd[2]);
put_unaligned_be32(nr_sectors, &rq->cmd[10]);
} else {
rq->cmd_len = 10;
+ memset(rq->cmd, 0, rq->cmd_len);
rq->cmd[0] = WRITE_SAME;
put_unaligned_be32(sector, &rq->cmd[2]);
put_unaligned_be16(nr_sectors, &rq->cmd[7]);
--
1.7.10.4

2014-06-26 04:50:50

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-06-25 10:51, Christoph Hellwig wrote:
> This is the second post of the scsi-mq series.
>
> At this point the code is ready for merging and use by developers and early
> adopters. The core blk-mq code isn't that suitable for slow devices
> yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
> Similarly there is no dm-multipath support for drivers using blk-mq yet,
> but I'm working on it. It should also be noted that the code doesn't
> actually support multiple hardware queues or fine grained tuning of the
> blk-mq parameters yet. All these could be added fairly easily as soon
> as low-level drivers want to make use of them.
>
> The amount of chances to the existing code are fairly small, and mostly
> speedups or cleanups that also apply to the old path as well. Because
> of this I also haven't bothered to put it under a config option, just
> like the blk-mq core.
>
> The usage of blk-mq dramatically decreases CPU usage under all workloads going
> down from 100% CPU usage that the old setup can hit easily to usually less
> than 20% for maxing out storage subsystems with 512byte reads and writes,
> and it allows to easily archive millions of IOPS. Bart and Robert have
> helped with some very detailed measurements that they might be able to send
> in reply to this, although these usually involve significantly reworked low
> level drivers to avoid other bottle necks.
>
> One major objection to previous iterations of this code was the simple
> replacement of the host_lock with atomic counters for the host and busy
> counters. The host_lock avoidance on it's own already improves performance,
> and with the patch to avoid maintaining the per-target busy counter unless
> needed we now replace a lock round trip on the host_lock with just a single
> atomic increment in the submission path, and a single atomic decrement in
> completion path, which should provide benefits even for the oddest RISC
> architecture. Longer term I'd still love to get rid of these entirely
> and use the counters in blk-mq, but due to the difference in how they
> are maintained this doesn't seem feasible as long as we still need to
> support the legacy request code path.
>
> Changes from V1:
> - rebased on top of the core-for-3.17 branch, most notable the
> scsi logging changes
> - fixed handling of cmd_list to prevent crashes for some heavy
> workloads
> - fixed incorrect handling of !target->can_queue
> - avoid scheduling a workqueue on I/O completions when no queues
> are congested
>
> In addition to the patches in this thread there also is a git available at:
>
> git://git.infradead.org/users/hch/scsi.git scsi-mq.2

You can add my acked/reviewed-by to the series.

--
Jens Axboe

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Wednesday, 25 June, 2014 11:51 PM
> To: Christoph Hellwig; James Bottomley
> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
> [email protected]; [email protected]
> Subject: Re: scsi-mq V2
>
> On 2014-06-25 10:51, Christoph Hellwig wrote:
> > This is the second post of the scsi-mq series.
> >
...
> >
> > Changes from V1:
> > - rebased on top of the core-for-3.17 branch, most notable the
> > scsi logging changes
> > - fixed handling of cmd_list to prevent crashes for some heavy
> > workloads
> > - fixed incorrect handling of !target->can_queue
> > - avoid scheduling a workqueue on I/O completions when no queues
> > are congested
> >
> > In addition to the patches in this thread there also is a git available at:
> >
> > git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>
> You can add my acked/reviewed-by to the series.
>
> --
> Jens Axboe

Since March 20th (circa LSF-MM 2014) we've run many hours of tests
with hpsa and the scsi-mq tree. We've also done a little bit of
testing with mpt3sas and, in the last few days, scsi_debug.

Although there are certainly more problems to find and improvements
to be made, it's become quite stable. It's even been used on the
boot drives of our test servers.

For the patches in scsi-mq.2 you may add:
Tested-by: Robert Elliott <[email protected]>


---
Rob Elliott HP Server Storage


2014-06-27 14:42:26

by Bart Van Assche

[permalink] [raw]
Subject: Re: scsi-mq V2

On 06/27/14 00:07, Elliott, Robert (Server Storage) wrote:
>> -----Original Message-----
>> From: Jens Axboe [mailto:[email protected]]
>> Sent: Wednesday, 25 June, 2014 11:51 PM
>> To: Christoph Hellwig; James Bottomley
>> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
>> [email protected]; [email protected]
>> Subject: Re: scsi-mq V2
>>
>> On 2014-06-25 10:51, Christoph Hellwig wrote:
>>> This is the second post of the scsi-mq series.
>>>
> ...
>>>
>>> Changes from V1:
>>> - rebased on top of the core-for-3.17 branch, most notable the
>>> scsi logging changes
>>> - fixed handling of cmd_list to prevent crashes for some heavy
>>> workloads
>>> - fixed incorrect handling of !target->can_queue
>>> - avoid scheduling a workqueue on I/O completions when no queues
>>> are congested
>>>
>>> In addition to the patches in this thread there also is a git available at:
>>>
>>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>>
>> You can add my acked/reviewed-by to the series.
>
> Since March 20th (circa LSF-MM 2014) we've run many hours of tests
> with hpsa and the scsi-mq tree. We've also done a little bit of
> testing with mpt3sas and, in the last few days, scsi_debug.
>
> Although there are certainly more problems to find and improvements
> to be made, it's become quite stable. It's even been used on the
> boot drives of our test servers.
>
> For the patches in scsi-mq.2 you may add:
> Tested-by: Robert Elliott <[email protected]>

Performance of scsi-mq-v2 looks even better than that of scsi-mq-v1. The
slight single-LUN regression is gone, peak IOPS with use_blk_mq=Y on my
test setup is now 3x the performance of use_blk_mq=N and latency has
been reduced further. I think this means reducing the number of context
switches did really help :-) Detailed measurement results can be found
on https://drive.google.com/file/d/0B1YQOreL3_FxWmZfbl8xSzRfdGM/.

If you want you may add to the scsi-mq-v2 patch series:

Tested-by: Bart Van Assche <[email protected]>

Bart.

2014-06-30 15:20:41

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 06/25/2014 10:50 PM, Jens Axboe wrote:
> On 2014-06-25 10:51, Christoph Hellwig wrote:
>> This is the second post of the scsi-mq series.
>>
>> At this point the code is ready for merging and use by developers and
>> early
>> adopters. The core blk-mq code isn't that suitable for slow devices
>> yet, mostly due to the lack of an I/O scheduler, but Jens is working
>> on it.
>> Similarly there is no dm-multipath support for drivers using blk-mq yet,
>> but I'm working on it. It should also be noted that the code doesn't
>> actually support multiple hardware queues or fine grained tuning of the
>> blk-mq parameters yet. All these could be added fairly easily as soon
>> as low-level drivers want to make use of them.
>>
>> The amount of chances to the existing code are fairly small, and mostly
>> speedups or cleanups that also apply to the old path as well. Because
>> of this I also haven't bothered to put it under a config option, just
>> like the blk-mq core.
>>
>> The usage of blk-mq dramatically decreases CPU usage under all
>> workloads going
>> down from 100% CPU usage that the old setup can hit easily to usually
>> less
>> than 20% for maxing out storage subsystems with 512byte reads and writes,
>> and it allows to easily archive millions of IOPS. Bart and Robert have
>> helped with some very detailed measurements that they might be able to
>> send
>> in reply to this, although these usually involve significantly
>> reworked low
>> level drivers to avoid other bottle necks.
>>
>> One major objection to previous iterations of this code was the simple
>> replacement of the host_lock with atomic counters for the host and busy
>> counters. The host_lock avoidance on it's own already improves
>> performance,
>> and with the patch to avoid maintaining the per-target busy counter
>> unless
>> needed we now replace a lock round trip on the host_lock with just a
>> single
>> atomic increment in the submission path, and a single atomic decrement in
>> completion path, which should provide benefits even for the oddest RISC
>> architecture. Longer term I'd still love to get rid of these entirely
>> and use the counters in blk-mq, but due to the difference in how they
>> are maintained this doesn't seem feasible as long as we still need to
>> support the legacy request code path.
>>
>> Changes from V1:
>> - rebased on top of the core-for-3.17 branch, most notable the
>> scsi logging changes
>> - fixed handling of cmd_list to prevent crashes for some heavy
>> workloads
>> - fixed incorrect handling of !target->can_queue
>> - avoid scheduling a workqueue on I/O completions when no queues
>> are congested
>>
>> In addition to the patches in this thread there also is a git
>> available at:
>>
>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>
> You can add my acked/reviewed-by to the series.

Ran stress testing from Friday to now, 65h of beating up on it and no
problems observed. 47TB read and 20TB written for a total of 17.7
billion of IOs issued and completed. Latencies look good. I officially
declare this code for bug free.

Bug-free-by: Jens Axboe <[email protected]>

Now lets get this queued up for inclusion, pretty please.

--
Jens Axboe

2014-06-30 15:25:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Mon, Jun 30, 2014 at 09:20:51AM -0600, Jens Axboe wrote:
> Ran stress testing from Friday to now, 65h of beating up on it and no
> problems observed. 47TB read and 20TB written for a total of 17.7
> billion of IOs issued and completed. Latencies look good. I officially
> declare this code for bug free.
>
> Bug-free-by: Jens Axboe <[email protected]>
>
> Now lets get this queued up for inclusion, pretty please.

I'm still looking for one (or better two) persons familar with the
SCSI and/or block code to go over it and do a real detailed review.

2014-06-30 15:55:41

by Martin K. Petersen

[permalink] [raw]
Subject: Re: scsi-mq V2

>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:

Christoph> I'm still looking for one (or better two) persons familar
Christoph> with the SCSI and/or block code to go over it and do a real
Christoph> detailed review.

I'm on vacation for a couple of days. Will review Wednesday.

--
Martin K. Petersen Oracle Linux Engineering

2014-07-08 14:48:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> Changes from V1:
> - rebased on top of the core-for-3.17 branch, most notable the
> scsi logging changes
> - fixed handling of cmd_list to prevent crashes for some heavy
> workloads
> - fixed incorrect handling of !target->can_queue
> - avoid scheduling a workqueue on I/O completions when no queues
> are congested
>
> In addition to the patches in this thread there also is a git available at:
>
> git://git.infradead.org/users/hch/scsi.git scsi-mq.2


I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the "RFC: clean up command setup" series
from June 29th. Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below. Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.

---
>From f925c317c74849666d599926d8ad8f34ef99d5cf Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Tue, 8 Jul 2014 13:16:17 +0200
Subject: scsi: add scsi_setup_cmnd helper

Factor out command setup code that will be shared with the blk-mq code path.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/scsi_lib.c | 40 ++++++++++++++++++++++------------------
1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 116f541..61afae8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1116,6 +1116,27 @@ static int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
return scsi_cmd_to_driver(cmd)->init_command(cmd);
}

+static int scsi_setup_cmnd(struct scsi_device *sdev, struct request *req)
+{
+ struct scsi_cmnd *cmd = req->special;
+
+ if (!blk_rq_bytes(req))
+ cmd->sc_data_direction = DMA_NONE;
+ else if (rq_data_dir(req) == WRITE)
+ cmd->sc_data_direction = DMA_TO_DEVICE;
+ else
+ cmd->sc_data_direction = DMA_FROM_DEVICE;
+
+ switch (req->cmd_type) {
+ case REQ_TYPE_FS:
+ return scsi_setup_fs_cmnd(sdev, req);
+ case REQ_TYPE_BLOCK_PC:
+ return scsi_setup_blk_pc_cmnd(sdev, req);
+ default:
+ return BLKPREP_KILL;
+ }
+}
+
static int
scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
{
@@ -1219,24 +1240,7 @@ static int scsi_prep_fn(struct request_queue *q, struct request *req)
goto out;
}

- if (!blk_rq_bytes(req))
- cmd->sc_data_direction = DMA_NONE;
- else if (rq_data_dir(req) == WRITE)
- cmd->sc_data_direction = DMA_TO_DEVICE;
- else
- cmd->sc_data_direction = DMA_FROM_DEVICE;
-
- switch (req->cmd_type) {
- case REQ_TYPE_FS:
- ret = scsi_setup_fs_cmnd(sdev, req);
- break;
- case REQ_TYPE_BLOCK_PC:
- ret = scsi_setup_blk_pc_cmnd(sdev, req);
- break;
- default:
- ret = BLKPREP_KILL;
- }
-
+ ret = scsi_setup_cmnd(sdev, req);
out:
return scsi_prep_return(q, req, ret);
}
--
1.7.10.4

Subject: RE: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Wednesday, 25 June, 2014 11:52 AM
> To: James Bottomley
> Cc: Jens Axboe; Bart Van Assche; Elliott, Robert (Server Storage); linux-
> [email protected]; [email protected]
> Subject: [PATCH 03/14] scsi: centralize command re-queueing in
> scsi_dispatch_fn
>
> Make sure we only have the logic for requeing commands in one place.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi.c | 35 ++++++++++++-----------------------
> drivers/scsi/scsi_lib.c | 9 ++++++---
> 2 files changed, 18 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index ce5b4e5..dcc43fd 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> * returns an immediate error upwards, and signals
> * that the device is no longer present */
> cmd->result = DID_NO_CONNECT << 16;
> - scsi_done(cmd);
> - /* return 0 (because the command has been processed) */
> - goto out;
> + goto done;
> }
>
> /* Check to see if the scsi lld made this device blocked. */
> @@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> * occur until the device transitions out of the
> * suspend state.
> */
> -
> - scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> -
> SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
> "queuecommand : device blocked\n"));
> -
> - /*
> - * NOTE: rtn is still zero here because we don't need the
> - * queue to be plugged on return (it's already stopped)
> - */
> - goto out;
> + return SCSI_MLQUEUE_DEVICE_BUSY;
> }
>
> /*
> @@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> "cdb_size=%d host->max_cmd_len=%d\n",
> cmd->cmd_len, cmd->device->host->max_cmd_len));
> cmd->result = (DID_ABORT << 16);
> -
> - scsi_done(cmd);
> - goto out;
> + goto done;
> }
>
> if (unlikely(host->shost_state == SHOST_DEL)) {
> cmd->result = (DID_NO_CONNECT << 16);
> - scsi_done(cmd);
> - } else {
> - trace_scsi_dispatch_cmd_start(cmd);
> - cmd->scsi_done = scsi_done;
> - rtn = host->hostt->queuecommand(host, cmd);
> + goto done;
> +
> }
>
> + trace_scsi_dispatch_cmd_start(cmd);
> +
> + cmd->scsi_done = scsi_done;
> + rtn = host->hostt->queuecommand(host, cmd);
> if (rtn) {
> trace_scsi_dispatch_cmd_error(cmd, rtn);
> if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
> @@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>
> SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
> "queuecommand : request rejected\n"));
> -
> - scsi_queue_insert(cmd, rtn);
> }
>
> - out:
> return rtn;
> + done:
> + scsi_done(cmd);
> + return 0;
> }
>

Related to the position of the trace_scsi_dispatch_cmd_start()
call... this function does:

1. check sdev_state - goto done
2. check scsi_device_blocked() - return
3. put LUN into CDB for ancient SCSI-1 devices
4. scsi_log_send()
5. check cmd_len - goto done
6. check shost_state - goto done
7. trace_scsi_dispatch_cmd_start()
8. queuecommand()
9. return
10. done:
cmd->scsi_done(cmd) [PATCH 04/14 upgrades it to this]
return 0;

It's inconsistent for logging and tracing to occur after
different number of checks.

In scsi_lib.c, both scsi_done() and scsi_mq_done() always call
trace_scsi_dispatch_cmd_done(), so trace_scsi_dispatch_cmd_start()
should be called before scsi_done() is called. That way the
trace will always have a submission to match each completion.

That means trace should be called before the sdev_state check
(which calls scsi_done()).

I don't know about the scsi_device_blocked check (which just
returns). Should the trace record multiple submissions with
one completion? Maybe both trace_scsi_dispatch_cmd_start()
and trace_scsi_dispatch_cmd_done() should both be called?

scsi_log_completion() is called by scsi_softirq_done() and
scsi_times_out() but not by scsi_done() and scsi_mq_done(), so
scsi_log_send() should not be called unless all the checks
pass and an IO is really queued.

That would lead to something like:
1. check sdev_state - goto done
2. check scsi_device_blocked() - return
3. put LUN into CDB for ancient SCSI-1 devices
5. check cmd_len - goto done
6. check shost_state - goto done
7a. scsi_log_send()
7b. trace_scsi_dispatch_cmd_start()
8. queuecommand()
9. return
10. done:
trace_scsi_dispatch_cmd_start()
cmd->scsi_done(cmd);
return 0;

---
Rob Elliott HP Server Storage


2014-07-09 06:40:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn

On Tue, Jul 08, 2014 at 08:51:30PM +0000, Elliott, Robert (Server Storage) wrote:
> In scsi_lib.c, both scsi_done() and scsi_mq_done() always call
> trace_scsi_dispatch_cmd_done(), so trace_scsi_dispatch_cmd_start()
> should be called before scsi_done() is called. That way the
> trace will always have a submission to match each completion.
>
> That means trace should be called before the sdev_state check
> (which calls scsi_done()).
>
> I don't know about the scsi_device_blocked check (which just
> returns). Should the trace record multiple submissions with
> one completion? Maybe both trace_scsi_dispatch_cmd_start()
> and trace_scsi_dispatch_cmd_done() should both be called?

trace_scsi_dispatch_cmd_start is maybe a little misnamed as it traces
the command submission to the driver. So getting a done trace without
this one sounds perfectly fine. Adding another trace for an error
before submission could be done if you care about pairing. The *_BUSY
returns don't fit this scheme at all.

But none of this really is in this patch. Hannes has some plans to clean
up the logging and tracing mess in scsi, and it might be a good idea
to incorporate it there.

2014-07-09 11:12:23

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Seems like these counters are missing any sort of synchronization for
> updates, as a over 10 year old comment from me noted. Fix this by
> using atomic counters, and while we're at it also make sure they are
> in the same cacheline as the _busy counters and not needlessly stored
> to in every I/O completion.
>
> With the new model the _busy counters can temporarily go negative,
> so all the readers are updated to check for > 0 values. Longer
> term every successful I/O completion will reset the counters to zero,
> so the temporarily negative values will not cause any harm.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi.c | 21 ++++++------
> drivers/scsi/scsi_lib.c | 82 +++++++++++++++++++++-----------------------
> drivers/scsi/scsi_sysfs.c | 10 +++++-
> include/scsi/scsi_device.h | 7 ++--
> include/scsi/scsi_host.h | 7 ++--
> 5 files changed, 64 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 35a23e2..b362058 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -729,17 +729,16 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
>
> scsi_device_unbusy(sdev);
>
> - /*
> - * Clear the flags which say that the device/host is no longer
> - * capable of accepting new commands. These are set in scsi_queue.c
> - * for both the queue full condition on a device, and for a
> - * host full condition on the host.
> - *
> - * XXX(hch): What about locking?
> - */
> - shost->host_blocked = 0;
> - starget->target_blocked = 0;
> - sdev->device_blocked = 0;
> + /*
> + * Clear the flags which say that the device/target/host is no longer
> + * capable of accepting new commands.
> + */
> + if (atomic_read(&shost->host_blocked))
> + atomic_set(&shost->host_blocked, 0);
> + if (atomic_read(&starget->target_blocked))
> + atomic_set(&starget->target_blocked, 0);
> + if (atomic_read(&sdev->device_blocked))
> + atomic_set(&sdev->device_blocked, 0);
>
> /*
> * If we have valid sense information, then some kind of recovery
Hmm. I guess there is a race window between
atomic_read() and atomic_set().
Doesn't this cause issues when someone calls atomic_set() just
before the call to atomic_read?

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e23fef5..a39d5ba 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -99,14 +99,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
> */
> switch (reason) {
> case SCSI_MLQUEUE_HOST_BUSY:
> - host->host_blocked = host->max_host_blocked;
> + atomic_set(&host->host_blocked, host->max_host_blocked);
> break;
> case SCSI_MLQUEUE_DEVICE_BUSY:
> case SCSI_MLQUEUE_EH_RETRY:
> - device->device_blocked = device->max_device_blocked;
> + atomic_set(&device->device_blocked,
> + device->max_device_blocked);
> break;
> case SCSI_MLQUEUE_TARGET_BUSY:
> - starget->target_blocked = starget->max_target_blocked;
> + atomic_set(&starget->target_blocked,
> + starget->max_target_blocked);
> break;
> }
> }
> @@ -351,30 +353,39 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> spin_unlock_irqrestore(shost->host_lock, flags);
> }
>
> -static inline int scsi_device_is_busy(struct scsi_device *sdev)
> +static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> {
> if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> - return 1;
> - if (sdev->device_blocked)
> - return 1;
> + return true;
> + if (atomic_read(&sdev->device_blocked) > 0)
> + return true;
> return 0;
> }
>
> -static inline int scsi_target_is_busy(struct scsi_target *starget)
> +static inline bool scsi_target_is_busy(struct scsi_target *starget)
> {
> - return ((starget->can_queue > 0 &&
> - atomic_read(&starget->target_busy) >= starget->can_queue) ||
> - starget->target_blocked);
> + if (starget->can_queue > 0) {
> + if (atomic_read(&starget->target_busy) >= starget->can_queue)
> + return true;
> + if (atomic_read(&starget->target_blocked) > 0)
> + return true;
> + }
> +
> + return false;
> }
>
> -static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> +static inline bool scsi_host_is_busy(struct Scsi_Host *shost)
> {
> - if ((shost->can_queue > 0 &&
> - atomic_read(&shost->host_busy) >= shost->can_queue) ||
> - shost->host_blocked || shost->host_self_blocked)
> - return 1;
> + if (shost->can_queue > 0) {
> + if (atomic_read(&shost->host_busy) >= shost->can_queue)
> + return true;
> + if (atomic_read(&shost->host_blocked) > 0)
> + return true;
> + if (shost->host_self_blocked)
> + return true;
> + }
>
> - return 0;
> + return false;
> }
>
> static void scsi_starved_list_run(struct Scsi_Host *shost)
> @@ -1283,11 +1294,8 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> unsigned int busy;
>
> busy = atomic_inc_return(&sdev->device_busy) - 1;
> - if (busy == 0 && sdev->device_blocked) {
> - /*
> - * unblock after device_blocked iterates to zero
> - */
> - if (--sdev->device_blocked != 0) {
> + if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
> + if (atomic_dec_return(&sdev->device_blocked) > 0) {
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> goto out_dec;
> }
> @@ -1297,7 +1305,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>
> if (busy >= sdev->queue_depth)
> goto out_dec;
> - if (sdev->device_blocked)
> + if (atomic_read(&sdev->device_blocked) > 0)
> goto out_dec;
>
> return 1;
> @@ -1328,16 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> }
>
> busy = atomic_inc_return(&starget->target_busy) - 1;
> - if (busy == 0 && starget->target_blocked) {
> - /*
> - * unblock after target_blocked iterates to zero
> - */
> - spin_lock_irq(shost->host_lock);
> - if (--starget->target_blocked != 0) {
> - spin_unlock_irq(shost->host_lock);
> + if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
> + if (atomic_dec_return(&starget->target_blocked) > 0)
> goto out_dec;
> - }
> - spin_unlock_irq(shost->host_lock);
>
> SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> "unblocking target at zero depth\n"));
> @@ -1345,7 +1346,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>
> if (starget->can_queue > 0 && busy >= starget->can_queue)
> goto starved;
> - if (starget->target_blocked)
> + if (atomic_read(&starget->target_blocked) > 0)
> goto starved;
>
> return 1;
> @@ -1374,16 +1375,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
> return 0;
>
> busy = atomic_inc_return(&shost->host_busy) - 1;
> - if (busy == 0 && shost->host_blocked) {
> - /*
> - * unblock after host_blocked iterates to zero
> - */
> - spin_lock_irq(shost->host_lock);
> - if (--shost->host_blocked != 0) {
> - spin_unlock_irq(shost->host_lock);
> + if (busy == 0 && atomic_read(&shost->host_blocked) > 0) {
> + if (atomic_dec_return(&shost->host_blocked) > 0)
> goto out_dec;
> - }
> - spin_unlock_irq(shost->host_lock);
>
> SCSI_LOG_MLQUEUE(3,
> shost_printk(KERN_INFO, shost,
Same with this one.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:12:38

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Unlike the old request code blk-mq doesn't initialize cmd_len with a
> default value, so don't rely on it being set in sd_setup_write_same_cmnd.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/sd.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 9c86e3d..6ec4ffe 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -824,15 +824,16 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)
>
> rq->__data_len = sdp->sector_size;
> rq->timeout = SD_WRITE_SAME_TIMEOUT;
> - memset(rq->cmd, 0, rq->cmd_len);
>
> if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff) {
> rq->cmd_len = 16;
> + memset(rq->cmd, 0, rq->cmd_len);
> rq->cmd[0] = WRITE_SAME_16;
> put_unaligned_be64(sector, &rq->cmd[2]);
> put_unaligned_be32(nr_sectors, &rq->cmd[10]);
> } else {
> rq->cmd_len = 10;
> + memset(rq->cmd, 0, rq->cmd_len);
> rq->cmd[0] = WRITE_SAME;
> put_unaligned_be32(sector, &rq->cmd[2]);
> put_unaligned_be16(nr_sectors, &rq->cmd[7]);
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:12:57

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 02/14] scsi: split __scsi_queue_insert

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Factor out a helper to set the _blocked values, which we'll reuse for the
> blk-mq code path.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 44 ++++++++++++++++++++++++++------------------
> 1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index d5d22e4..2667c75 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -75,28 +75,12 @@ struct kmem_cache *scsi_sdb_cache;
> */
> #define SCSI_QUEUE_DELAY 3
>
> -/**
> - * __scsi_queue_insert - private queue insertion
> - * @cmd: The SCSI command being requeued
> - * @reason: The reason for the requeue
> - * @unbusy: Whether the queue should be unbusied
> - *
> - * This is a private queue insertion. The public interface
> - * scsi_queue_insert() always assumes the queue should be unbusied
> - * because it's always called before the completion. This function is
> - * for a requeue after completion, which should only occur in this
> - * file.
> - */
> -static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> +static void
> +scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
> {
> struct Scsi_Host *host = cmd->device->host;
> struct scsi_device *device = cmd->device;
> struct scsi_target *starget = scsi_target(device);
> - struct request_queue *q = device->request_queue;
> - unsigned long flags;
> -
> - SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
> - "Inserting command %p into mlqueue\n", cmd));
>
> /*
> * Set the appropriate busy bit for the device/host.
> @@ -123,6 +107,30 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> starget->target_blocked = starget->max_target_blocked;
> break;
> }
> +}
> +
> +/**
> + * __scsi_queue_insert - private queue insertion
> + * @cmd: The SCSI command being requeued
> + * @reason: The reason for the requeue
> + * @unbusy: Whether the queue should be unbusied
> + *
> + * This is a private queue insertion. The public interface
> + * scsi_queue_insert() always assumes the queue should be unbusied
> + * because it's always called before the completion. This function is
> + * for a requeue after completion, which should only occur in this
> + * file.
> + */
> +static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> +{
> + struct scsi_device *device = cmd->device;
> + struct request_queue *q = device->request_queue;
> + unsigned long flags;
> +
> + SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
> + "Inserting command %p into mlqueue\n", cmd));
> +
> + scsi_set_blocked(cmd, reason);
>
> /*
> * Decrement the counters, since these commands are no longer
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:13:37

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Make sure we only have the logic for requeing commands in one place.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi.c | 35 ++++++++++++-----------------------
> drivers/scsi/scsi_lib.c | 9 ++++++---
> 2 files changed, 18 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index ce5b4e5..dcc43fd 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> * returns an immediate error upwards, and signals
> * that the device is no longer present */
> cmd->result = DID_NO_CONNECT << 16;
> - scsi_done(cmd);
> - /* return 0 (because the command has been processed) */
> - goto out;
> + goto done;
> }
>
> /* Check to see if the scsi lld made this device blocked. */
> @@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> * occur until the device transitions out of the
> * suspend state.
> */
> -
> - scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> -
> SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
> "queuecommand : device blocked\n"));
> -
> - /*
> - * NOTE: rtn is still zero here because we don't need the
> - * queue to be plugged on return (it's already stopped)
> - */
> - goto out;
> + return SCSI_MLQUEUE_DEVICE_BUSY;
> }
>
> /*
> @@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> "cdb_size=%d host->max_cmd_len=%d\n",
> cmd->cmd_len, cmd->device->host->max_cmd_len));
> cmd->result = (DID_ABORT << 16);
> -
> - scsi_done(cmd);
> - goto out;
> + goto done;
> }
>
> if (unlikely(host->shost_state == SHOST_DEL)) {
> cmd->result = (DID_NO_CONNECT << 16);
> - scsi_done(cmd);
> - } else {
> - trace_scsi_dispatch_cmd_start(cmd);
> - cmd->scsi_done = scsi_done;
> - rtn = host->hostt->queuecommand(host, cmd);
> + goto done;
> +
> }
>
> + trace_scsi_dispatch_cmd_start(cmd);
> +
> + cmd->scsi_done = scsi_done;
> + rtn = host->hostt->queuecommand(host, cmd);
> if (rtn) {
> trace_scsi_dispatch_cmd_error(cmd, rtn);
> if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
> @@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>
> SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
> "queuecommand : request rejected\n"));
> -
> - scsi_queue_insert(cmd, rtn);
> }
>
> - out:
> return rtn;
> + done:
> + scsi_done(cmd);
> + return 0;
> }
>
> /**
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 2667c75..63bf844 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1583,9 +1583,12 @@ static void scsi_request_fn(struct request_queue *q)
> * Dispatch the command to the low-level driver.
> */
> rtn = scsi_dispatch_cmd(cmd);
> - spin_lock_irq(q->queue_lock);
> - if (rtn)
> + if (rtn) {
> + scsi_queue_insert(cmd, rtn);
> + spin_lock_irq(q->queue_lock);
> goto out_delay;
> + }
> + spin_lock_irq(q->queue_lock);
> }
>
> return;
> @@ -1605,7 +1608,7 @@ static void scsi_request_fn(struct request_queue *q)
> blk_requeue_request(q, req);
> sdev->device_busy--;
> out_delay:
> - if (sdev->device_busy == 0)
> + if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
>
>

Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:14:07

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> The blk-mq code path will set this to a different function, so make the
> code simpler by setting it up in a legacy-request specific place.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi.c | 23 +----------------------
> drivers/scsi/scsi_lib.c | 20 ++++++++++++++++++++
> 2 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index dcc43fd..d3bd6cf 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -72,8 +72,6 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/scsi.h>
>
> -static void scsi_done(struct scsi_cmnd *cmd);
> -
> /*
> * Definitions and constants.
> */
> @@ -696,8 +694,6 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> }
>
> trace_scsi_dispatch_cmd_start(cmd);
> -
> - cmd->scsi_done = scsi_done;
> rtn = host->hostt->queuecommand(host, cmd);
> if (rtn) {
> trace_scsi_dispatch_cmd_error(cmd, rtn);
> @@ -711,28 +707,11 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>
> return rtn;
> done:
> - scsi_done(cmd);
> + cmd->scsi_done(cmd);
> return 0;
> }
>
> /**
> - * scsi_done - Invoke completion on finished SCSI command.
> - * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
> - * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
> - *
> - * Description: This function is the mid-level's (SCSI Core) interrupt routine,
> - * which regains ownership of the SCSI command (de facto) from a LLDD, and
> - * calls blk_complete_request() for further processing.
> - *
> - * This function is interrupt context safe.
> - */
> -static void scsi_done(struct scsi_cmnd *cmd)
> -{
> - trace_scsi_dispatch_cmd_done(cmd);
> - blk_complete_request(cmd->request);
> -}
> -
> -/**
> * scsi_finish_command - cleanup and pass command back to upper layer
> * @cmd: the command
> *
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 63bf844..6989b6f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -29,6 +29,8 @@
> #include <scsi/scsi_eh.h>
> #include <scsi/scsi_host.h>
>
> +#include <trace/events/scsi.h>
> +
> #include "scsi_priv.h"
> #include "scsi_logging.h"
>
> @@ -1480,6 +1482,23 @@ static void scsi_softirq_done(struct request *rq)
> }
> }
>
> +/**
> + * scsi_done - Invoke completion on finished SCSI command.
> + * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
> + * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
> + *
> + * Description: This function is the mid-level's (SCSI Core) interrupt routine,
> + * which regains ownership of the SCSI command (de facto) from a LLDD, and
> + * calls blk_complete_request() for further processing.
> + *
> + * This function is interrupt context safe.
> + */
> +static void scsi_done(struct scsi_cmnd *cmd)
> +{
> + trace_scsi_dispatch_cmd_done(cmd);
> + blk_complete_request(cmd->request);
> +}
> +
> /*
> * Function: scsi_request_fn()
> *
> @@ -1582,6 +1601,7 @@ static void scsi_request_fn(struct request_queue *q)
> /*
> * Dispatch the command to the low-level driver.
> */
> + cmd->scsi_done = scsi_done;
> rtn = scsi_dispatch_cmd(cmd);
> if (rtn) {
> scsi_queue_insert(cmd, rtn);
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:14:41

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Prepare for not taking a host-wide lock in the dispatch path by pushing
> the lock down into the places that actually need it. Note that this
> patch is just a preparation step, as it will actually increase lock
> roundtrips and thus decrease performance on its own.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 75 ++++++++++++++++++++++++-----------------------
> 1 file changed, 39 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 6989b6f..18e6449 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1300,18 +1300,18 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> /*
> * scsi_target_queue_ready: checks if there we can send commands to target
> * @sdev: scsi device on starget to check.
> - *
> - * Called with the host lock held.
> */
> static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> struct scsi_device *sdev)
> {
> struct scsi_target *starget = scsi_target(sdev);
> + int ret = 0;
>
> + spin_lock_irq(shost->host_lock);
> if (starget->single_lun) {
> if (starget->starget_sdev_user &&
> starget->starget_sdev_user != sdev)
> - return 0;
> + goto out;
> starget->starget_sdev_user = sdev;
> }
>
> @@ -1319,57 +1319,66 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> /*
> * unblock after target_blocked iterates to zero
> */
> - if (--starget->target_blocked == 0) {
> - SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> - "unblocking target at zero depth\n"));
> - } else
> - return 0;
> + if (--starget->target_blocked != 0)
> + goto out;
> +
> + SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> + "unblocking target at zero depth\n"));
> }
>
> if (scsi_target_is_busy(starget)) {
> list_move_tail(&sdev->starved_entry, &shost->starved_list);
> - return 0;
> + goto out;
> }
>
> - return 1;
> + scsi_target(sdev)->target_busy++;
> + ret = 1;
> +out:
> + spin_unlock_irq(shost->host_lock);
> + return ret;
> }
>
> /*
> * scsi_host_queue_ready: if we can send requests to shost, return 1 else
> * return 0. We must end up running the queue again whenever 0 is
> * returned, else IO can hang.
> - *
> - * Called with host_lock held.
> */
> static inline int scsi_host_queue_ready(struct request_queue *q,
> struct Scsi_Host *shost,
> struct scsi_device *sdev)
> {
> + int ret = 0;
> +
> + spin_lock_irq(shost->host_lock);
> +
> if (scsi_host_in_recovery(shost))
> - return 0;
> + goto out;
> if (shost->host_busy == 0 && shost->host_blocked) {
> /*
> * unblock after host_blocked iterates to zero
> */
> - if (--shost->host_blocked == 0) {
> - SCSI_LOG_MLQUEUE(3,
> - shost_printk(KERN_INFO, shost,
> - "unblocking host at zero depth\n"));
> - } else {
> - return 0;
> - }
> + if (--shost->host_blocked != 0)
> + goto out;
> +
> + SCSI_LOG_MLQUEUE(3,
> + shost_printk(KERN_INFO, shost,
> + "unblocking host at zero depth\n"));
> }
> if (scsi_host_is_busy(shost)) {
> if (list_empty(&sdev->starved_entry))
> list_add_tail(&sdev->starved_entry, &shost->starved_list);
> - return 0;
> + goto out;
> }
>
> /* We're OK to process the command, so we can't be starved */
> if (!list_empty(&sdev->starved_entry))
> list_del_init(&sdev->starved_entry);
>
> - return 1;
> + shost->host_busy++;
> + ret = 1;
> +out:
> + spin_unlock_irq(shost->host_lock);
> + return ret;
> }
>
> /*
> @@ -1550,7 +1559,7 @@ static void scsi_request_fn(struct request_queue *q)
> blk_start_request(req);
> sdev->device_busy++;
>
> - spin_unlock(q->queue_lock);
> + spin_unlock_irq(q->queue_lock);
> cmd = req->special;
> if (unlikely(cmd == NULL)) {
> printk(KERN_CRIT "impossible request in %s.\n"
> @@ -1560,7 +1569,6 @@ static void scsi_request_fn(struct request_queue *q)
> blk_dump_rq_flags(req, "foo");
> BUG();
> }
> - spin_lock(shost->host_lock);
>
> /*
> * We hit this when the driver is using a host wide
> @@ -1571,9 +1579,11 @@ static void scsi_request_fn(struct request_queue *q)
> * a run when a tag is freed.
> */
> if (blk_queue_tagged(q) && !blk_rq_tagged(req)) {
> + spin_lock_irq(shost->host_lock);
> if (list_empty(&sdev->starved_entry))
> list_add_tail(&sdev->starved_entry,
> &shost->starved_list);
> + spin_unlock_irq(shost->host_lock);
> goto not_ready;
> }
>
> @@ -1581,16 +1591,7 @@ static void scsi_request_fn(struct request_queue *q)
> goto not_ready;
>
> if (!scsi_host_queue_ready(q, shost, sdev))
> - goto not_ready;
> -
> - scsi_target(sdev)->target_busy++;
> - shost->host_busy++;
> -
> - /*
> - * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
> - * take the lock again.
> - */
> - spin_unlock_irq(shost->host_lock);
> + goto host_not_ready;
>
> /*
> * Finally, initialize any error handling parameters, and set up
> @@ -1613,9 +1614,11 @@ static void scsi_request_fn(struct request_queue *q)
>
> return;
>
> - not_ready:
> + host_not_ready:
> + spin_lock_irq(shost->host_lock);
> + scsi_target(sdev)->target_busy--;
> spin_unlock_irq(shost->host_lock);
> -
> + not_ready:
> /*
> * lock q, handle tag, requeue req, and decrement device_busy. We
> * must return with queue_lock held.
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:15:46

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 06/14] scsi: convert target_busy to an atomic_t

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the host-wide host_lock to check the per-target queue limit.
> Instead we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 52 ++++++++++++++++++++++++++------------------
> include/scsi/scsi_device.h | 4 ++--
> 2 files changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 18e6449..5e269d6 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -294,7 +294,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>
> spin_lock_irqsave(shost->host_lock, flags);
> shost->host_busy--;
> - starget->target_busy--;
> + atomic_dec(&starget->target_busy);
> if (unlikely(scsi_host_in_recovery(shost) &&
> (shost->host_failed || shost->host_eh_scheduled)))
> scsi_eh_wakeup(shost);
> @@ -361,7 +361,7 @@ static inline int scsi_device_is_busy(struct scsi_device *sdev)
> static inline int scsi_target_is_busy(struct scsi_target *starget)
> {
> return ((starget->can_queue > 0 &&
> - starget->target_busy >= starget->can_queue) ||
> + atomic_read(&starget->target_busy) >= starget->can_queue) ||
> starget->target_blocked);
> }
>
> @@ -1305,37 +1305,49 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> struct scsi_device *sdev)
> {
> struct scsi_target *starget = scsi_target(sdev);
> - int ret = 0;
> + unsigned int busy;
>
> - spin_lock_irq(shost->host_lock);
> if (starget->single_lun) {
> + spin_lock_irq(shost->host_lock);
> if (starget->starget_sdev_user &&
> - starget->starget_sdev_user != sdev)
> - goto out;
> + starget->starget_sdev_user != sdev) {
> + spin_unlock_irq(shost->host_lock);
> + return 0;
> + }
> starget->starget_sdev_user = sdev;
> + spin_unlock_irq(shost->host_lock);
> }
>
> - if (starget->target_busy == 0 && starget->target_blocked) {
> + busy = atomic_inc_return(&starget->target_busy) - 1;
> + if (busy == 0 && starget->target_blocked) {
> /*
> * unblock after target_blocked iterates to zero
> */
> - if (--starget->target_blocked != 0)
> - goto out;
> + spin_lock_irq(shost->host_lock);
> + if (--starget->target_blocked != 0) {
> + spin_unlock_irq(shost->host_lock);
> + goto out_dec;
> + }
> + spin_unlock_irq(shost->host_lock);
>
> SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> "unblocking target at zero depth\n"));
> }
>
> - if (scsi_target_is_busy(starget)) {
> - list_move_tail(&sdev->starved_entry, &shost->starved_list);
> - goto out;
> - }
> + if (starget->can_queue > 0 && busy >= starget->can_queue)
> + goto starved;
> + if (starget->target_blocked)
> + goto starved;
>
> - scsi_target(sdev)->target_busy++;
> - ret = 1;
> -out:
> + return 1;
> +
> +starved:
> + spin_lock_irq(shost->host_lock);
> + list_move_tail(&sdev->starved_entry, &shost->starved_list);
> spin_unlock_irq(shost->host_lock);
> - return ret;
> +out_dec:
> + atomic_dec(&starget->target_busy);
> + return 0;
> }
>
> /*
> @@ -1445,7 +1457,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
> spin_unlock(sdev->request_queue->queue_lock);
> spin_lock(shost->host_lock);
> shost->host_busy++;
> - starget->target_busy++;
> + atomic_inc(&starget->target_busy);
> spin_unlock(shost->host_lock);
> spin_lock(sdev->request_queue->queue_lock);
>
> @@ -1615,9 +1627,7 @@ static void scsi_request_fn(struct request_queue *q)
> return;
>
> host_not_ready:
> - spin_lock_irq(shost->host_lock);
> - scsi_target(sdev)->target_busy--;
> - spin_unlock_irq(shost->host_lock);
> + atomic_dec(&scsi_target(sdev)->target_busy);
> not_ready:
> /*
> * lock q, handle tag, requeue req, and decrement device_busy. We
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 816e8a2..446f741 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -290,8 +290,8 @@ struct scsi_target {
> unsigned int expecting_lun_change:1; /* A device has reported
> * a 3F/0E UA, other devices on
> * the same target will also. */
> - /* commands actually active on LLD. protected by host lock. */
> - unsigned int target_busy;
> + /* commands actually active on LLD. */
> + atomic_t target_busy;
> /*
> * LLDs should set this in the slave_alloc host template callout.
> * If set to zero then there is not limit.
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:16:04

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 07/14] scsi: convert host_busy to atomic_t

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the host-wide host_lock to check the per-host queue limit.
> Instead we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/advansys.c | 4 +-
> drivers/scsi/libiscsi.c | 4 +-
> drivers/scsi/libsas/sas_scsi_host.c | 5 ++-
> drivers/scsi/qlogicpti.c | 2 +-
> drivers/scsi/scsi.c | 2 +-
> drivers/scsi/scsi_error.c | 7 ++--
> drivers/scsi/scsi_lib.c | 71 +++++++++++++++++++++--------------
> drivers/scsi/scsi_sysfs.c | 9 ++++-
> include/scsi/scsi_host.h | 10 ++---
> 9 files changed, 66 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
> index e716d0a..43761c1 100644
> --- a/drivers/scsi/advansys.c
> +++ b/drivers/scsi/advansys.c
> @@ -2512,7 +2512,7 @@ static void asc_prt_scsi_host(struct Scsi_Host *s)
>
> printk("Scsi_Host at addr 0x%p, device %s\n", s, dev_name(boardp->dev));
> printk(" host_busy %u, host_no %d,\n",
> - s->host_busy, s->host_no);
> + atomic_read(&s->host_busy), s->host_no);
>
> printk(" base 0x%lx, io_port 0x%lx, irq %d,\n",
> (ulong)s->base, (ulong)s->io_port, boardp->irq);
> @@ -3346,7 +3346,7 @@ static void asc_prt_driver_conf(struct seq_file *m, struct Scsi_Host *shost)
>
> seq_printf(m,
> " host_busy %u, max_id %u, max_lun %llu, max_channel %u\n",
> - shost->host_busy, shost->max_id,
> + atomic_read(&shost->host_busy), shost->max_id,
> shost->max_lun, shost->max_channel);
>
> seq_printf(m,
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index f2db82b..f9f3a12 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -2971,7 +2971,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
> */
> for (;;) {
> spin_lock_irqsave(session->host->host_lock, flags);
> - if (!session->host->host_busy) { /* OK for ERL == 0 */
> + if (!atomic_read(&session->host->host_busy)) { /* OK for ERL == 0 */
> spin_unlock_irqrestore(session->host->host_lock, flags);
> break;
> }
> @@ -2979,7 +2979,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
> msleep_interruptible(500);
> iscsi_conn_printk(KERN_INFO, conn, "iscsi conn_destroy(): "
> "host_busy %d host_failed %d\n",
> - session->host->host_busy,
> + atomic_read(&session->host->host_busy),
> session->host->host_failed);
> /*
> * force eh_abort() to unblock
> diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
> index 7d02a19..24e477d 100644
> --- a/drivers/scsi/libsas/sas_scsi_host.c
> +++ b/drivers/scsi/libsas/sas_scsi_host.c
> @@ -813,7 +813,7 @@ retry:
> spin_unlock_irq(shost->host_lock);
>
> SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
> - __func__, shost->host_busy, shost->host_failed);
> + __func__, atomic_read(&shost->host_busy), shost->host_failed);
> /*
> * Deal with commands that still have SAS tasks (i.e. they didn't
> * complete via the normal sas_task completion mechanism),
> @@ -858,7 +858,8 @@ out:
> goto retry;
>
> SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
> - __func__, shost->host_busy, shost->host_failed, tries);
> + __func__, atomic_read(&shost->host_busy),
> + shost->host_failed, tries);
> }
>
> enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd)
> diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
> index 6d48d30..740ae49 100644
> --- a/drivers/scsi/qlogicpti.c
> +++ b/drivers/scsi/qlogicpti.c
> @@ -959,7 +959,7 @@ static inline void update_can_queue(struct Scsi_Host *host, u_int in_ptr, u_int
> /* Temporary workaround until bug is found and fixed (one bug has been found
> already, but fixing it makes things even worse) -jj */
> int num_free = QLOGICPTI_REQ_QUEUE_LEN - REQ_QUEUE_DEPTH(in_ptr, out_ptr) - 64;
> - host->can_queue = host->host_busy + num_free;
> + host->can_queue = atomic_read(&host->host_busy) + num_free;
> host->sg_tablesize = QLOGICPTI_MAX_SG(num_free);
> }
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index d3bd6cf..35a23e2 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -603,7 +603,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int disposition)
> if (level > 3)
> scmd_printk(KERN_INFO, cmd,
> "scsi host busy %d failed %d\n",
> - cmd->device->host->host_busy,
> + atomic_read(&cmd->device->host->host_busy),
> cmd->device->host->host_failed);
> }
> }
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index e4a5324..5db8454 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -59,7 +59,7 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
> /* called with shost->host_lock held */
> void scsi_eh_wakeup(struct Scsi_Host *shost)
> {
> - if (shost->host_busy == shost->host_failed) {
> + if (atomic_read(&shost->host_busy) == shost->host_failed) {
> trace_scsi_eh_wakeup(shost);
> wake_up_process(shost->ehandler);
> SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
> @@ -2164,7 +2164,7 @@ int scsi_error_handler(void *data)
> while (!kthread_should_stop()) {
> set_current_state(TASK_INTERRUPTIBLE);
> if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
> - shost->host_failed != shost->host_busy) {
> + shost->host_failed != atomic_read(&shost->host_busy)) {
> SCSI_LOG_ERROR_RECOVERY(1,
> shost_printk(KERN_INFO, shost,
> "scsi_eh_%d: sleeping\n",
> @@ -2178,7 +2178,8 @@ int scsi_error_handler(void *data)
> shost_printk(KERN_INFO, shost,
> "scsi_eh_%d: waking up %d/%d/%d\n",
> shost->host_no, shost->host_eh_scheduled,
> - shost->host_failed, shost->host_busy));
> + shost->host_failed,
> + atomic_read(&shost->host_busy)));
>
> /*
> * We have a host that is failing for some reason. Figure out
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5e269d6..5d37d79 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -292,14 +292,17 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> struct scsi_target *starget = scsi_target(sdev);
> unsigned long flags;
>
> - spin_lock_irqsave(shost->host_lock, flags);
> - shost->host_busy--;
> + atomic_dec(&shost->host_busy);
> atomic_dec(&starget->target_busy);
> +
> if (unlikely(scsi_host_in_recovery(shost) &&
> - (shost->host_failed || shost->host_eh_scheduled)))
> + (shost->host_failed || shost->host_eh_scheduled))) {
> + spin_lock_irqsave(shost->host_lock, flags);
> scsi_eh_wakeup(shost);
> - spin_unlock(shost->host_lock);
> - spin_lock(sdev->request_queue->queue_lock);
> + spin_unlock_irqrestore(shost->host_lock, flags);
> + }
> +
> + spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
> sdev->device_busy--;
> spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
> }
> @@ -367,7 +370,8 @@ static inline int scsi_target_is_busy(struct scsi_target *starget)
>
> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> {
> - if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
> + if ((shost->can_queue > 0 &&
> + atomic_read(&shost->host_busy) >= shost->can_queue) ||
> shost->host_blocked || shost->host_self_blocked)
> return 1;
>
> @@ -1359,38 +1363,51 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
> struct Scsi_Host *shost,
> struct scsi_device *sdev)
> {
> - int ret = 0;
> -
> - spin_lock_irq(shost->host_lock);
> + unsigned int busy;
>
> if (scsi_host_in_recovery(shost))
> - goto out;
> - if (shost->host_busy == 0 && shost->host_blocked) {
> + return 0;
> +
> + busy = atomic_inc_return(&shost->host_busy) - 1;
> + if (busy == 0 && shost->host_blocked) {
> /*
> * unblock after host_blocked iterates to zero
> */
> - if (--shost->host_blocked != 0)
> - goto out;
> + spin_lock_irq(shost->host_lock);
> + if (--shost->host_blocked != 0) {
> + spin_unlock_irq(shost->host_lock);
> + goto out_dec;
> + }
> + spin_unlock_irq(shost->host_lock);
>
> SCSI_LOG_MLQUEUE(3,
> shost_printk(KERN_INFO, shost,
> "unblocking host at zero depth\n"));
> }
> - if (scsi_host_is_busy(shost)) {
> - if (list_empty(&sdev->starved_entry))
> - list_add_tail(&sdev->starved_entry, &shost->starved_list);
> - goto out;
> - }
> +
> + if (shost->can_queue > 0 && busy >= shost->can_queue)
> + goto starved;
> + if (shost->host_blocked || shost->host_self_blocked)
> + goto starved;
>
> /* We're OK to process the command, so we can't be starved */
> - if (!list_empty(&sdev->starved_entry))
> - list_del_init(&sdev->starved_entry);
> + if (!list_empty(&sdev->starved_entry)) {
> + spin_lock_irq(shost->host_lock);
> + if (!list_empty(&sdev->starved_entry))
> + list_del_init(&sdev->starved_entry);
> + spin_unlock_irq(shost->host_lock);
> + }
>
> - shost->host_busy++;
> - ret = 1;
> -out:
> + return 1;
> +
> +starved:
> + spin_lock_irq(shost->host_lock);
> + if (list_empty(&sdev->starved_entry))
> + list_add_tail(&sdev->starved_entry, &shost->starved_list);
> spin_unlock_irq(shost->host_lock);
> - return ret;
> +out_dec:
> + atomic_dec(&shost->host_busy);
> + return 0;
> }
>
> /*
> @@ -1454,12 +1471,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
> * with the locks as normal issue path does.
> */
> sdev->device_busy++;
> - spin_unlock(sdev->request_queue->queue_lock);
> - spin_lock(shost->host_lock);
> - shost->host_busy++;
> + atomic_inc(&shost->host_busy);
> atomic_inc(&starget->target_busy);
> - spin_unlock(shost->host_lock);
> - spin_lock(sdev->request_queue->queue_lock);
>
> blk_complete_request(req);
> }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 5f36788..7ec5e06 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -334,7 +334,6 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
> static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
>
> shost_rd_attr(unique_id, "%u\n");
> -shost_rd_attr(host_busy, "%hu\n");
> shost_rd_attr(cmd_per_lun, "%hd\n");
> shost_rd_attr(can_queue, "%hd\n");
> shost_rd_attr(sg_tablesize, "%hu\n");
> @@ -344,6 +343,14 @@ shost_rd_attr(prot_capabilities, "%u\n");
> shost_rd_attr(prot_guard_type, "%hd\n");
> shost_rd_attr2(proc_name, hostt->proc_name, "%s\n");
>
> +static ssize_t
> +show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> + struct Scsi_Host *shost = class_to_shost(dev);
> + return snprintf(buf, 20, "%hu\n", atomic_read(&shost->host_busy));
> +}
> +static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
> +
> static struct attribute *scsi_sysfs_shost_attrs[] = {
> &dev_attr_unique_id.attr,
> &dev_attr_host_busy.attr,
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index abb6958..3d124f7 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -603,13 +603,9 @@ struct Scsi_Host {
> */
> struct blk_queue_tag *bqt;
>
> - /*
> - * The following two fields are protected with host_lock;
> - * however, eh routines can safely access during eh processing
> - * without acquiring the lock.
> - */
> - unsigned int host_busy; /* commands actually active on low-level */
> - unsigned int host_failed; /* commands that failed. */
> + atomic_t host_busy; /* commands actually active on low-level */
> + unsigned int host_failed; /* commands that failed.
> + protected by host_lock */
> unsigned int host_eh_scheduled; /* EH scheduled without command */
>
> unsigned int host_no; /* Used for IOCTL_GET_IDLUN, /proc/scsi et al. */
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:16:21

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 08/14] scsi: convert device_busy to atomic_t

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the queue_lock to check the per-device queue limit. Instead
> we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Unlike the host and target busy counters this doesn't allow us to avoid the
> queue_lock in the request_fn due to the way the interface works, but it'll
> allow us to prepare for using the blk-mq code, which doesn't use the
> queue_lock at all, and it at least avoids a queue_lock rountrip in
> scsi_device_unbusy, which is still important given how busy the queue_lock
> is.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/message/fusion/mptsas.c | 2 +-
> drivers/scsi/scsi_lib.c | 50 ++++++++++++++++++++++-----------------
> drivers/scsi/scsi_sysfs.c | 10 +++++++-
> drivers/scsi/sg.c | 2 +-
> include/scsi/scsi_device.h | 4 +---
> 5 files changed, 40 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
> index 711fcb5..d636dbe 100644
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -3763,7 +3763,7 @@ mptsas_send_link_status_event(struct fw_event_work *fw_event)
> printk(MYIOC_s_DEBUG_FMT
> "SDEV OUTSTANDING CMDS"
> "%d\n", ioc->name,
> - sdev->device_busy));
> + atomic_read(&sdev->device_busy)));
> }
>
> }
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5d37d79..e23fef5 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -302,9 +302,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> spin_unlock_irqrestore(shost->host_lock, flags);
> }
>
> - spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
> - sdev->device_busy--;
> - spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
> + atomic_dec(&sdev->device_busy);
> }
>
> /*
> @@ -355,9 +353,10 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>
> static inline int scsi_device_is_busy(struct scsi_device *sdev)
> {
> - if (sdev->device_busy >= sdev->queue_depth || sdev->device_blocked)
> + if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> + return 1;
> + if (sdev->device_blocked)
> return 1;
> -
> return 0;
> }
>
> @@ -1224,7 +1223,7 @@ scsi_prep_return(struct request_queue *q, struct request *req, int ret)
> * queue must be restarted, so we schedule a callback to happen
> * shortly.
> */
> - if (sdev->device_busy == 0)
> + if (atomic_read(&sdev->device_busy) == 0)
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> break;
> default:
> @@ -1281,26 +1280,32 @@ static void scsi_unprep_fn(struct request_queue *q, struct request *req)
> static inline int scsi_dev_queue_ready(struct request_queue *q,
> struct scsi_device *sdev)
> {
> - if (sdev->device_busy == 0 && sdev->device_blocked) {
> + unsigned int busy;
> +
> + busy = atomic_inc_return(&sdev->device_busy) - 1;
> + if (busy == 0 && sdev->device_blocked) {
> /*
> * unblock after device_blocked iterates to zero
> */
> - if (--sdev->device_blocked == 0) {
> - SCSI_LOG_MLQUEUE(3,
> - sdev_printk(KERN_INFO, sdev,
> - "unblocking device at zero depth\n"));
> - } else {
> + if (--sdev->device_blocked != 0) {
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> - return 0;
> + goto out_dec;
> }
> + SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
> + "unblocking device at zero depth\n"));
> }
> - if (scsi_device_is_busy(sdev))
> - return 0;
> +
> + if (busy >= sdev->queue_depth)
> + goto out_dec;
> + if (sdev->device_blocked)
> + goto out_dec;
>
> return 1;
> +out_dec:
> + atomic_dec(&sdev->device_busy);
> + return 0;
> }
>
> -
> /*
> * scsi_target_queue_ready: checks if there we can send commands to target
> * @sdev: scsi device on starget to check.
> @@ -1470,7 +1475,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
> * bump busy counts. To bump the counters, we need to dance
> * with the locks as normal issue path does.
> */
> - sdev->device_busy++;
> + atomic_inc(&sdev->device_busy);
> atomic_inc(&shost->host_busy);
> atomic_inc(&starget->target_busy);
>
> @@ -1566,7 +1571,7 @@ static void scsi_request_fn(struct request_queue *q)
> * accept it.
> */
> req = blk_peek_request(q);
> - if (!req || !scsi_dev_queue_ready(q, sdev))
> + if (!req)
> break;
>
> if (unlikely(!scsi_device_online(sdev))) {
> @@ -1576,13 +1581,14 @@ static void scsi_request_fn(struct request_queue *q)
> continue;
> }
>
> + if (!scsi_dev_queue_ready(q, sdev))
> + break;
>
> /*
> * Remove the request from the request list.
> */
> if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
> blk_start_request(req);
> - sdev->device_busy++;
>
> spin_unlock_irq(q->queue_lock);
> cmd = req->special;
> @@ -1652,9 +1658,9 @@ static void scsi_request_fn(struct request_queue *q)
> */
> spin_lock_irq(q->queue_lock);
> blk_requeue_request(q, req);
> - sdev->device_busy--;
> + atomic_dec(&sdev->device_busy);
> out_delay:
> - if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
> + if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
>
> @@ -2394,7 +2400,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
> return err;
>
> scsi_run_queue(sdev->request_queue);
> - while (sdev->device_busy) {
> + while (atomic_read(&sdev->device_busy)) {
> msleep_interruptible(200);
> scsi_run_queue(sdev->request_queue);
> }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 7ec5e06..54e3dac 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -585,13 +585,21 @@ static int scsi_sdev_check_buf_bit(const char *buf)
> * Create the actual show/store functions and data structures.
> */
> sdev_rd_attr (device_blocked, "%d\n");
> -sdev_rd_attr (device_busy, "%d\n");
> sdev_rd_attr (type, "%d\n");
> sdev_rd_attr (scsi_level, "%d\n");
> sdev_rd_attr (vendor, "%.8s\n");
> sdev_rd_attr (model, "%.16s\n");
> sdev_rd_attr (rev, "%.4s\n");
>
> +static ssize_t
> +sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct scsi_device *sdev = to_scsi_device(dev);
> + return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_busy));
> +}
> +static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);
> +
> /*
> * TODO: can we make these symlinks to the block layer ones?
> */
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index cb2a18e..3db4fc9 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -2573,7 +2573,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v)
> scsidp->id, scsidp->lun, (int) scsidp->type,
> 1,
> (int) scsidp->queue_depth,
> - (int) scsidp->device_busy,
> + (int) atomic_read(&scsidp->device_busy),
> (int) scsi_device_online(scsidp));
> }
> read_unlock_irqrestore(&sg_index_lock, iflags);
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 446f741..5ff3d24 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -81,9 +81,7 @@ struct scsi_device {
> struct list_head siblings; /* list of all devices on this host */
> struct list_head same_target_siblings; /* just the devices sharing same target id */
>
> - /* this is now protected by the request_queue->queue_lock */
> - unsigned int device_busy; /* commands actually active on
> - * low-level. protected by queue_lock. */
> + atomic_t device_busy; /* commands actually active on LLDD */
> spinlock_t list_lock;
> struct list_head cmd_list; /* queue of in use SCSI Command structures */
> struct list_head starved_entry;
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:19:46

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> This saves us an atomic operation for each I/O submission and completion
> for the usual case where the driver doesn't set a per-target can_queue
> value. Only a few iscsi hardware offload drivers set the per-target
> can_queue value at the moment.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index a39d5ba..a64b9d3 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -295,7 +295,8 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> unsigned long flags;
>
> atomic_dec(&shost->host_busy);
> - atomic_dec(&starget->target_busy);
> + if (starget->can_queue > 0)
> + atomic_dec(&starget->target_busy);
>
> if (unlikely(scsi_host_in_recovery(shost) &&
> (shost->host_failed || shost->host_eh_scheduled))) {
> @@ -1335,6 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> spin_unlock_irq(shost->host_lock);
> }
>
> + if (starget->can_queue <= 0)
> + return 1;
> +
> busy = atomic_inc_return(&starget->target_busy) - 1;
> if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
> if (atomic_dec_return(&starget->target_blocked) > 0)
> @@ -1344,7 +1348,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
> "unblocking target at zero depth\n"));
> }
>
> - if (starget->can_queue > 0 && busy >= starget->can_queue)
> + if (busy >= starget->can_queue)
> goto starved;
> if (atomic_read(&starget->target_blocked) > 0)
> goto starved;
> @@ -1356,7 +1360,8 @@ starved:
> list_move_tail(&sdev->starved_entry, &shost->starved_list);
> spin_unlock_irq(shost->host_lock);
> out_dec:
> - atomic_dec(&starget->target_busy);
> + if (starget->can_queue > 0)
> + atomic_dec(&starget->target_busy);
> return 0;
> }
>
> @@ -1473,7 +1478,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
> */
> atomic_inc(&sdev->device_busy);
> atomic_inc(&shost->host_busy);
> - atomic_inc(&starget->target_busy);
> + if (starget->can_queue > 0)
> + atomic_inc(&starget->target_busy);
>
> blk_complete_request(req);
> }
> @@ -1642,7 +1648,8 @@ static void scsi_request_fn(struct request_queue *q)
> return;
>
> host_not_ready:
> - atomic_dec(&scsi_target(sdev)->target_busy);
> + if (scsi_target(sdev)->can_queue > 0)
> + atomic_dec(&scsi_target(sdev)->target_busy);
> not_ready:
> /*
> * lock q, handle tag, requeue req, and decrement device_busy. We
>
Hmm. 'can_queue' can be changed by the LLDD. Don't we need some sort
of synchronization here?
(Or move that to atomic_t, too?)

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:20:41

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Replace the calls to the various blk_end_request variants with opencode
> equivalents. Blk-mq is using a model that gives the driver control
> between the bio updates and the actual completion, and making the old
> code follow that same model allows us to keep the code more similar for
> both pathes.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 61 ++++++++++++++++++++++++++++++++---------------
> 1 file changed, 42 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index a64b9d3..58534fd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -625,6 +625,37 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
> cmd->request->next_rq->special = NULL;
> }
>
> +static bool scsi_end_request(struct request *req, int error,
> + unsigned int bytes, unsigned int bidi_bytes)
> +{
> + struct scsi_cmnd *cmd = req->special;
> + struct scsi_device *sdev = cmd->device;
> + struct request_queue *q = sdev->request_queue;
> + unsigned long flags;
> +
> +
> + if (blk_update_request(req, error, bytes))
> + return true;
> +
> + /* Bidi request must be completed as a whole */
> + if (unlikely(bidi_bytes) &&
> + blk_update_request(req->next_rq, error, bidi_bytes))
> + return true;
> +
> + if (blk_queue_add_random(q))
> + add_disk_randomness(req->rq_disk);
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + blk_finish_request(req, error);
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> + if (bidi_bytes)
> + scsi_release_bidi_buffers(cmd);
> + scsi_release_buffers(cmd);
> + scsi_next_command(cmd);
> + return false;
> +}
> +
> /**
> * __scsi_error_from_host_byte - translate SCSI error code into errno
> * @cmd: SCSI command (unused)
> @@ -697,7 +728,7 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd *cmd, int result)
> * be put back on the queue and retried using the same
> * command as before, possibly after a delay.
> *
> - * c) We can call blk_end_request() with -EIO to fail
> + * c) We can call scsi_end_request() with -EIO to fail
> * the remainder of the request.
> */
> void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> @@ -749,13 +780,9 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> * both sides at once.
> */
> req->next_rq->resid_len = scsi_in(cmd)->resid;
> -
> - scsi_release_buffers(cmd);
> - scsi_release_bidi_buffers(cmd);
> -
> - blk_end_request_all(req, 0);
> -
> - scsi_next_command(cmd);
> + if (scsi_end_request(req, 0, blk_rq_bytes(req),
> + blk_rq_bytes(req->next_rq)))
> + BUG();
> return;
> }
> }
> @@ -794,15 +821,16 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> /*
> * If we finished all bytes in the request we are done now.
> */
> - if (!blk_end_request(req, error, good_bytes))
> - goto next_command;
> + if (!scsi_end_request(req, error, good_bytes, 0))
> + return;
>
> /*
> * Kill remainder if no retrys.
> */
> if (error && scsi_noretry_cmd(cmd)) {
> - blk_end_request_all(req, error);
> - goto next_command;
> + if (scsi_end_request(req, error, blk_rq_bytes(req), 0))
> + BUG();
> + return;
> }
>
> /*
> @@ -947,8 +975,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> scsi_print_sense("", cmd);
> scsi_print_command(cmd);
> }
> - if (!blk_end_request_err(req, error))
> - goto next_command;
> + if (!scsi_end_request(req, error, blk_rq_err_bytes(req), 0))
> + return;
> /*FALLTHRU*/
> case ACTION_REPREP:
> requeue:
> @@ -967,11 +995,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> __scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
> break;
> }
> - return;
> -
> -next_command:
> - scsi_release_buffers(cmd);
> - scsi_next_command(cmd);
> }
>
> static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
>
YES.

That code really was a mess.

Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:21:32

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 12/14] scatterlist: allow chaining to preallocated chunks

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Blk-mq drivers usually preallocate their S/G list as part of the request,
> but if we want to support the very large S/G lists currently supported by
> the SCSI code that would tie up a lot of memory in the preallocated request
> pool. Add support to the scatterlist code so that it can initialize a
> S/G list that uses a preallocated first chunks and dynamically allocated
> additional chunks. That way the scsi-mq code can preallocate a first
> page worth of S/G entries as part of the request, and dynamically extent
> the S/G list when needed.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/scsi_lib.c | 16 +++++++---------
> include/linux/scatterlist.h | 6 +++---
> lib/scatterlist.c | 24 ++++++++++++++++--------
> 3 files changed, 26 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 58534fd..900b1c0 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -567,6 +567,11 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
> return mempool_alloc(sgp->pool, gfp_mask);
> }
>
> +static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> +{
> + __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
> +}
> +
> static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> gfp_t gfp_mask)
> {
> @@ -575,19 +580,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> BUG_ON(!nents);
>
> ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
> - gfp_mask, scsi_sg_alloc);
> + NULL, gfp_mask, scsi_sg_alloc);
> if (unlikely(ret))
> - __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS,
> - scsi_sg_free);
> -
> + scsi_free_sgtable(sdb);
> return ret;
> }
>
> -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> -{
> - __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, scsi_sg_free);
> -}
> -
> /*
> * Function: scsi_release_buffers()
> *
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index a964f72..f4ec8bb 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -229,10 +229,10 @@ void sg_init_one(struct scatterlist *, const void *, unsigned int);
> typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t);
> typedef void (sg_free_fn)(struct scatterlist *, unsigned int);
>
> -void __sg_free_table(struct sg_table *, unsigned int, sg_free_fn *);
> +void __sg_free_table(struct sg_table *, unsigned int, bool, sg_free_fn *);
> void sg_free_table(struct sg_table *);
> -int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int, gfp_t,
> - sg_alloc_fn *);
> +int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
> + struct scatterlist *, gfp_t, sg_alloc_fn *);
> int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
> int sg_alloc_table_from_pages(struct sg_table *sgt,
> struct page **pages, unsigned int n_pages,
> diff --git a/lib/scatterlist.c b/lib/scatterlist.c
> index 3a8e8e8..48c15d2 100644
> --- a/lib/scatterlist.c
> +++ b/lib/scatterlist.c
> @@ -165,6 +165,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
> * __sg_free_table - Free a previously mapped sg table
> * @table: The sg table header to use
> * @max_ents: The maximum number of entries per single scatterlist
> + * @skip_first_chunk: don't free the (preallocated) first scatterlist chunk
> * @free_fn: Free function
> *
> * Description:
> @@ -174,7 +175,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
> *
> **/
> void __sg_free_table(struct sg_table *table, unsigned int max_ents,
> - sg_free_fn *free_fn)
> + bool skip_first_chunk, sg_free_fn *free_fn)
> {
> struct scatterlist *sgl, *next;
>
> @@ -202,7 +203,9 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
> }
>
> table->orig_nents -= sg_size;
> - free_fn(sgl, alloc_size);
> + if (!skip_first_chunk)
> + free_fn(sgl, alloc_size);
> + skip_first_chunk = false;
> sgl = next;
> }
>
> @@ -217,7 +220,7 @@ EXPORT_SYMBOL(__sg_free_table);
> **/
> void sg_free_table(struct sg_table *table)
> {
> - __sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
> + __sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
> }
> EXPORT_SYMBOL(sg_free_table);
>
> @@ -241,8 +244,8 @@ EXPORT_SYMBOL(sg_free_table);
> *
> **/
> int __sg_alloc_table(struct sg_table *table, unsigned int nents,
> - unsigned int max_ents, gfp_t gfp_mask,
> - sg_alloc_fn *alloc_fn)
> + unsigned int max_ents, struct scatterlist *first_chunk,
> + gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
> {
> struct scatterlist *sg, *prv;
> unsigned int left;
> @@ -269,7 +272,12 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
>
> left -= sg_size;
>
> - sg = alloc_fn(alloc_size, gfp_mask);
> + if (first_chunk) {
> + sg = first_chunk;
> + first_chunk = NULL;
> + } else {
> + sg = alloc_fn(alloc_size, gfp_mask);
> + }
> if (unlikely(!sg)) {
> /*
> * Adjust entry count to reflect that the last
> @@ -324,9 +332,9 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
> int ret;
>
> ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC,
> - gfp_mask, sg_kmalloc);
> + NULL, gfp_mask, sg_kmalloc);
> if (unlikely(ret))
> - __sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
> + __sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
>
> return ret;
> }
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:25:17

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

On 06/25/2014 06:52 PM, Christoph Hellwig wrote:
> This patch adds support for an alternate I/O path in the scsi midlayer
> which uses the blk-mq infrastructure instead of the legacy request code.
>
> Use of blk-mq is fully transparent to drivers, although for now a host
> template field is provided to opt out of blk-mq usage in case any unforseen
> incompatibilities arise.
>
> In general replacing the legacy request code with blk-mq is a simple and
> mostly mechanical transformation. The biggest exception is the new code
> that deals with the fact the I/O submissions in blk-mq must happen from
> process context, which slightly complicates the I/O completion handler.
> The second biggest differences is that blk-mq is build around the concept
> of preallocated requests that also include driver specific data, which
> in SCSI context means the scsi_cmnd structure. This completely avoids
> dynamic memory allocations for the fast path through I/O submission.
>
> Due the preallocated requests the MQ code path exclusively uses the
> host-wide shared tag allocator instead of a per-LUN one. This only
> affects drivers actually using the block layer provided tag allocator
> instead of their own. Unlike the old path blk-mq always provides a tag,
> although drivers don't have to use it.
>
> For now the blk-mq path is disable by defauly and must be enabled using
> the "use_blk_mq" module parameter. Once the remaining work in the block
> layer to make blk-mq more suitable for slow devices is complete I hope
> to make it the default and eventually even remove the old code path.
>
> Based on the earlier scsi-mq prototype by Nicholas Bellinger.
>
> Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
> various sugestions and code contributions.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/scsi/hosts.c | 30 ++-
> drivers/scsi/scsi.c | 5 +-
> drivers/scsi/scsi_lib.c | 475 +++++++++++++++++++++++++++++++++++++++------
> drivers/scsi/scsi_priv.h | 3 +
> drivers/scsi/scsi_scan.c | 5 +-
> drivers/scsi/scsi_sysfs.c | 2 +
> include/scsi/scsi_host.h | 18 +-
> include/scsi/scsi_tcq.h | 28 ++-
> 8 files changed, 494 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index 0632eee..6322e6c 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
> goto fail;
> }
>
> + if (shost_use_blk_mq(shost)) {
> + error = scsi_mq_setup_tags(shost);
> + if (error)
> + goto fail;
> + }
> +
> + /*
> + * Note that we allocate the freelist even for the MQ case for now,
> + * as we need a command set aside for scsi_reset_provider. Having
> + * the full host freelist and one command available for that is a
> + * little heavy-handed, but avoids introducing a special allocator
> + * just for this. Eventually the structure of scsi_reset_provider
> + * will need a major overhaul.
> + */
> error = scsi_setup_command_freelist(shost);
> if (error)
> - goto fail;
> + goto out_destroy_tags;
> +
>
> if (!shost->shost_gendev.parent)
> shost->shost_gendev.parent = dev ? dev : &platform_bus;
> @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
>
> error = device_add(&shost->shost_gendev);
> if (error)
> - goto out;
> + goto out_destroy_freelist;
>
> pm_runtime_set_active(&shost->shost_gendev);
> pm_runtime_enable(&shost->shost_gendev);
> @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
> device_del(&shost->shost_dev);
> out_del_gendev:
> device_del(&shost->shost_gendev);
> - out:
> + out_destroy_freelist:
> scsi_destroy_command_freelist(shost);
> + out_destroy_tags:
> + if (shost_use_blk_mq(shost))
> + scsi_mq_destroy_tags(shost);
> fail:
> return error;
> }
> @@ -309,7 +327,9 @@ static void scsi_host_dev_release(struct device *dev)
> }
>
> scsi_destroy_command_freelist(shost);
> - if (shost->bqt)
> + if (shost_use_blk_mq(shost) && shost->tag_set.tags)
> + scsi_mq_destroy_tags(shost);
> + else if (shost->bqt)
> blk_free_tags(shost->bqt);
>
> kfree(shost->shost_data);
> @@ -436,6 +456,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
> else
> shost->dma_boundary = 0xffffffff;
>
> + shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
> +
> device_initialize(&shost->shost_gendev);
> dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
> shost->shost_gendev.bus = &scsi_bus_type;
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index b362058..c089812 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -809,7 +809,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
> * is more IO than the LLD's can_queue (so there are not enuogh
> * tags) request_fn's host queue ready check will handle it.
> */
> - if (!sdev->host->bqt) {
> + if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
> if (blk_queue_tagged(sdev->request_queue) &&
> blk_queue_resize_tags(sdev->request_queue, tags) != 0)
> goto out;
> @@ -1363,6 +1363,9 @@ MODULE_LICENSE("GPL");
> module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
> MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
>
> +bool scsi_use_blk_mq = false;
> +module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
> +
> static int __init init_scsi(void)
> {
> int error;
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 900b1c0..5d39cfc 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1,5 +1,6 @@
> /*
> - * scsi_lib.c Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 2014 Christoph Hellwig
> *
> * SCSI queueing library.
> * Initial versions: Eric Youngdale ([email protected]).
> @@ -20,6 +21,7 @@
> #include <linux/delay.h>
> #include <linux/hardirq.h>
> #include <linux/scatterlist.h>
> +#include <linux/blk-mq.h>
>
> #include <scsi/scsi.h>
> #include <scsi/scsi_cmnd.h>
> @@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
> }
> }
>
> +static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
> +{
> + struct scsi_device *sdev = cmd->device;
> + struct request_queue *q = cmd->request->q;
> +
> + blk_mq_requeue_request(cmd->request);
> + blk_mq_kick_requeue_list(q);
> + put_device(&sdev->sdev_gendev);
> +}
> +
> /**
> * __scsi_queue_insert - private queue insertion
> * @cmd: The SCSI command being requeued
> @@ -150,6 +162,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> * before blk_cleanup_queue() finishes.
> */
> cmd->result = 0;
> + if (q->mq_ops) {
> + scsi_mq_requeue_cmd(cmd);
> + return;
> + }
> spin_lock_irqsave(q->queue_lock, flags);
> blk_requeue_request(q, cmd->request);
> kblockd_schedule_work(&device->requeue_work);
> @@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> atomic_dec(&sdev->device_busy);
> }
>
> +static void scsi_kick_queue(struct request_queue *q)
> +{
> + if (q->mq_ops)
> + blk_mq_start_hw_queues(q);
> + else
> + blk_run_queue(q);
> +}
> +
> /*
> * Called for single_lun devices on IO completion. Clear starget_sdev_user,
> * and call blk_run_queue for all the scsi_devices on the target -
> @@ -332,7 +356,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> * but in most cases, we will be first. Ideally, each LU on the
> * target would get some limited time or requests on the target.
> */
> - blk_run_queue(current_sdev->request_queue);
> + scsi_kick_queue(current_sdev->request_queue);
>
> spin_lock_irqsave(shost->host_lock, flags);
> if (starget->starget_sdev_user)
> @@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> continue;
>
> spin_unlock_irqrestore(shost->host_lock, flags);
> - blk_run_queue(sdev->request_queue);
> + scsi_kick_queue(sdev->request_queue);
> spin_lock_irqsave(shost->host_lock, flags);
>
> scsi_device_put(sdev);
> @@ -438,7 +462,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
> continue;
> spin_unlock_irqrestore(shost->host_lock, flags);
>
> - blk_run_queue(slq);
> + scsi_kick_queue(slq);
> blk_put_queue(slq);
>
> spin_lock_irqsave(shost->host_lock, flags);
> @@ -469,7 +493,10 @@ static void scsi_run_queue(struct request_queue *q)
> if (!list_empty(&sdev->host->starved_list))
> scsi_starved_list_run(sdev->host);
>
> - blk_run_queue(q);
> + if (q->mq_ops)
> + blk_mq_start_stopped_hw_queues(q, false);
> + else
> + blk_run_queue(q);
> }
>
> void scsi_requeue_run_queue(struct work_struct *work)
> @@ -567,25 +594,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
> return mempool_alloc(sgp->pool, gfp_mask);
> }
>
> -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> +static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
> {
> - __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
> + if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
> + return;
> + __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
> }
>
> static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, bool mq)
> {
> + struct scatterlist *first_chunk = NULL;
> int ret;
>
> BUG_ON(!nents);
>
> + if (mq) {
> + if (nents <= SCSI_MAX_SG_SEGMENTS) {
> + sdb->table.nents = nents;
> + sg_init_table(sdb->table.sgl, sdb->table.nents);
> + return 0;
> + }
> + first_chunk = sdb->table.sgl;
> + }
> +
> ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
> - NULL, gfp_mask, scsi_sg_alloc);
> + first_chunk, gfp_mask, scsi_sg_alloc);
> if (unlikely(ret))
> - scsi_free_sgtable(sdb);
> + scsi_free_sgtable(sdb, mq);
> return ret;
> }
>
> +static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
> +{
> + if (cmd->request->cmd_type == REQ_TYPE_FS) {
> + struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> +
> + if (drv->uninit_command)
> + drv->uninit_command(cmd);
> + }
> +}
> +
> +static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
> +{
> + if (cmd->sdb.table.nents)
> + scsi_free_sgtable(&cmd->sdb, true);
> + if (cmd->request->next_rq && cmd->request->next_rq->special)
> + scsi_free_sgtable(cmd->request->next_rq->special, true);
> + if (scsi_prot_sg_count(cmd))
> + scsi_free_sgtable(cmd->prot_sdb, true);
> +}
> +
> +static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
> +{
> + struct scsi_device *sdev = cmd->device;
> + unsigned long flags;
> +
> + BUG_ON(list_empty(&cmd->list));
> +
> + scsi_mq_free_sgtables(cmd);
> + scsi_uninit_cmd(cmd);
> +
> + spin_lock_irqsave(&sdev->list_lock, flags);
> + list_del_init(&cmd->list);
> + spin_unlock_irqrestore(&sdev->list_lock, flags);
> +}
> +
> /*
> * Function: scsi_release_buffers()
> *
> @@ -605,12 +679,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> void scsi_release_buffers(struct scsi_cmnd *cmd)
> {
> if (cmd->sdb.table.nents)
> - scsi_free_sgtable(&cmd->sdb);
> + scsi_free_sgtable(&cmd->sdb, false);
>
> memset(&cmd->sdb, 0, sizeof(cmd->sdb));
>
> if (scsi_prot_sg_count(cmd))
> - scsi_free_sgtable(cmd->prot_sdb);
> + scsi_free_sgtable(cmd->prot_sdb, false);
> }
> EXPORT_SYMBOL(scsi_release_buffers);
>
> @@ -618,7 +692,7 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
> {
> struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;
>
> - scsi_free_sgtable(bidi_sdb);
> + scsi_free_sgtable(bidi_sdb, false);
> kmem_cache_free(scsi_sdb_cache, bidi_sdb);
> cmd->request->next_rq->special = NULL;
> }
> @@ -629,8 +703,6 @@ static bool scsi_end_request(struct request *req, int error,
> struct scsi_cmnd *cmd = req->special;
> struct scsi_device *sdev = cmd->device;
> struct request_queue *q = sdev->request_queue;
> - unsigned long flags;
> -
>
> if (blk_update_request(req, error, bytes))
> return true;
> @@ -643,14 +715,38 @@ static bool scsi_end_request(struct request *req, int error,
> if (blk_queue_add_random(q))
> add_disk_randomness(req->rq_disk);
>
> - spin_lock_irqsave(q->queue_lock, flags);
> - blk_finish_request(req, error);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> + if (req->mq_ctx) {
> + /*
> + * In the MQ case the command gets freed by __blk_mq_end_io,
> + * so we have to do all cleanup that depends on it earlier.
> + *
> + * We also can't kick the queues from irq context, so we
> + * will have to defer it to a workqueue.
> + */
> + scsi_mq_uninit_cmd(cmd);
> +
> + __blk_mq_end_io(req, error);
> +
> + if (scsi_target(sdev)->single_lun ||
> + !list_empty(&sdev->host->starved_list))
> + kblockd_schedule_work(&sdev->requeue_work);
> + else
> + blk_mq_start_stopped_hw_queues(q, true);
> +
> + put_device(&sdev->sdev_gendev);
> + } else {
> + unsigned long flags;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + blk_finish_request(req, error);
> + spin_unlock_irqrestore(q->queue_lock, flags);
> +
> + if (bidi_bytes)
> + scsi_release_bidi_buffers(cmd);
> + scsi_release_buffers(cmd);
> + scsi_next_command(cmd);
> + }
>
> - if (bidi_bytes)
> - scsi_release_bidi_buffers(cmd);
> - scsi_release_buffers(cmd);
> - scsi_next_command(cmd);
> return false;
> }
>
> @@ -981,8 +1077,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> /* Unprep the request and put it back at the head of the queue.
> * A new command will be prepared and issued.
> */
> - scsi_release_buffers(cmd);
> - scsi_requeue_command(q, cmd);
> + if (q->mq_ops) {
> + cmd->request->cmd_flags &= ~REQ_DONTPREP;
> + scsi_mq_uninit_cmd(cmd);
> + scsi_mq_requeue_cmd(cmd);
> + } else {
> + scsi_release_buffers(cmd);
> + scsi_requeue_command(q, cmd);
> + }
> break;
> case ACTION_RETRY:
> /* Retry the same command immediately */
> @@ -1004,9 +1106,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
> * If sg table allocation fails, requeue request later.
> */
> if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
> - gfp_mask))) {
> + gfp_mask, req->mq_ctx != NULL)))
> return BLKPREP_DEFER;
> - }
>
> /*
> * Next, walk the list, and fill in the addresses and sizes of
> @@ -1034,21 +1135,27 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
> {
> struct scsi_device *sdev = cmd->device;
> struct request *rq = cmd->request;
> + bool is_mq = (rq->mq_ctx != NULL);
> + int error;
>
> - int error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
> + error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
> if (error)
> goto err_exit;
>
> if (blk_bidi_rq(rq)) {
> - struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
> - scsi_sdb_cache, GFP_ATOMIC);
> - if (!bidi_sdb) {
> - error = BLKPREP_DEFER;
> - goto err_exit;
> + if (!rq->q->mq_ops) {
> + struct scsi_data_buffer *bidi_sdb =
> + kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
> + if (!bidi_sdb) {
> + error = BLKPREP_DEFER;
> + goto err_exit;
> + }
> +
> + rq->next_rq->special = bidi_sdb;
> }
>
> - rq->next_rq->special = bidi_sdb;
> - error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
> + error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
> + GFP_ATOMIC);
> if (error)
> goto err_exit;
> }
> @@ -1060,7 +1167,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
> BUG_ON(prot_sdb == NULL);
> ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
>
> - if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
> + if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
> error = BLKPREP_DEFER;
> goto err_exit;
> }
> @@ -1074,13 +1181,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
> cmd->prot_sdb->table.nents = count;
> }
>
> - return BLKPREP_OK ;
> -
> + return BLKPREP_OK;
> err_exit:
> - scsi_release_buffers(cmd);
> - cmd->request->special = NULL;
> - scsi_put_command(cmd);
> - put_device(&sdev->sdev_gendev);
> + if (is_mq) {
> + scsi_mq_free_sgtables(cmd);
> + } else {
> + scsi_release_buffers(cmd);
> + cmd->request->special = NULL;
> + scsi_put_command(cmd);
> + put_device(&sdev->sdev_gendev);
> + }
> return error;
> }
> EXPORT_SYMBOL(scsi_init_io);
> @@ -1295,13 +1405,7 @@ out:
>
> static void scsi_unprep_fn(struct request_queue *q, struct request *req)
> {
> - if (req->cmd_type == REQ_TYPE_FS) {
> - struct scsi_cmnd *cmd = req->special;
> - struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> -
> - if (drv->uninit_command)
> - drv->uninit_command(cmd);
> - }
> + scsi_uninit_cmd(req->special);
> }
>
> /*
> @@ -1318,7 +1422,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> busy = atomic_inc_return(&sdev->device_busy) - 1;
> if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
> if (atomic_dec_return(&sdev->device_blocked) > 0) {
> - blk_delay_queue(q, SCSI_QUEUE_DELAY);
> + /*
> + * For the MQ case we take care of this in the caller.
> + */
> + if (!q->mq_ops)
> + blk_delay_queue(q, SCSI_QUEUE_DELAY);
> goto out_dec;
> }
> SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
> @@ -1688,6 +1796,188 @@ out_delay:
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
>
> +static inline int prep_to_mq(int ret)
> +{
> + switch (ret) {
> + case BLKPREP_OK:
> + return 0;
> + case BLKPREP_DEFER:
> + return BLK_MQ_RQ_QUEUE_BUSY;
> + default:
> + return BLK_MQ_RQ_QUEUE_ERROR;
> + }
> +}
> +
> +static int scsi_mq_prep_fn(struct request *req)
> +{
> + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> + struct scsi_device *sdev = req->q->queuedata;
> + struct Scsi_Host *shost = sdev->host;
> + unsigned char *sense_buf = cmd->sense_buffer;
> + struct scatterlist *sg;
> +
> + memset(cmd, 0, sizeof(struct scsi_cmnd));
> +
> + req->special = cmd;
> +
> + cmd->request = req;
> + cmd->device = sdev;
> + cmd->sense_buffer = sense_buf;
> +
> + cmd->tag = req->tag;
> +
> + req->cmd = req->__cmd;
> + cmd->cmnd = req->cmd;
> + cmd->prot_op = SCSI_PROT_NORMAL;
> +
> + INIT_LIST_HEAD(&cmd->list);
> + INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
> + cmd->jiffies_at_alloc = jiffies;
> +
> + /*
> + * XXX: cmd_list lookups are only used by two drivers, try to get
> + * rid of this list in common code.
> + */
> + spin_lock_irq(&sdev->list_lock);
> + list_add_tail(&cmd->list, &sdev->cmd_list);
> + spin_unlock_irq(&sdev->list_lock);
> +
> + sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
> + cmd->sdb.table.sgl = sg;
> +
> + if (scsi_host_get_prot(shost)) {
> + cmd->prot_sdb = (void *)sg +
> + shost->sg_tablesize * sizeof(struct scatterlist);
> + memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
> +
> + cmd->prot_sdb->table.sgl =
> + (struct scatterlist *)(cmd->prot_sdb + 1);
> + }
> +
> + if (blk_bidi_rq(req)) {
> + struct request *next_rq = req->next_rq;
> + struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
> +
> + memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
> + bidi_sdb->table.sgl =
> + (struct scatterlist *)(bidi_sdb + 1);
> +
> + next_rq->special = bidi_sdb;
> + }
> +
> + switch (req->cmd_type) {
> + case REQ_TYPE_FS:
> + return scsi_cmd_to_driver(cmd)->init_command(cmd);
> + case REQ_TYPE_BLOCK_PC:
> + return scsi_setup_blk_pc_cmnd(cmd->device, req);
> + default:
> + return BLKPREP_KILL;
> + }
> +}
> +
> +static void scsi_mq_done(struct scsi_cmnd *cmd)
> +{
> + trace_scsi_dispatch_cmd_done(cmd);
> + blk_mq_complete_request(cmd->request);
> +}
> +
> +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> +{
> + struct request_queue *q = req->q;
> + struct scsi_device *sdev = q->queuedata;
> + struct Scsi_Host *shost = sdev->host;
> + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> + int ret;
> + int reason;
> +
> + ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> + if (ret)
> + goto out;
> +
> + ret = BLK_MQ_RQ_QUEUE_BUSY;
> + if (!get_device(&sdev->sdev_gendev))
> + goto out;
> +
> + if (!scsi_dev_queue_ready(q, sdev))
> + goto out_put_device;
> + if (!scsi_target_queue_ready(shost, sdev))
> + goto out_dec_device_busy;
> + if (!scsi_host_queue_ready(q, shost, sdev))
> + goto out_dec_target_busy;
> +
> + if (!(req->cmd_flags & REQ_DONTPREP)) {
> + ret = prep_to_mq(scsi_mq_prep_fn(req));
> + if (ret)
> + goto out_dec_host_busy;
> + req->cmd_flags |= REQ_DONTPREP;
> + }
> +
> + scsi_init_cmd_errh(cmd);
> + cmd->scsi_done = scsi_mq_done;
> +
> + reason = scsi_dispatch_cmd(cmd);
> + if (reason) {
> + scsi_set_blocked(cmd, reason);
> + ret = BLK_MQ_RQ_QUEUE_BUSY;
> + goto out_dec_host_busy;
> + }
> +
> + return BLK_MQ_RQ_QUEUE_OK;
> +
> +out_dec_host_busy:
> + cancel_delayed_work(&cmd->abort_work);
> + atomic_dec(&shost->host_busy);
> +out_dec_target_busy:
> + if (scsi_target(sdev)->can_queue > 0)
> + atomic_dec(&scsi_target(sdev)->target_busy);
> +out_dec_device_busy:
> + atomic_dec(&sdev->device_busy);
> +out_put_device:
> + put_device(&sdev->sdev_gendev);
> +out:
> + switch (ret) {
> + case BLK_MQ_RQ_QUEUE_BUSY:
> + blk_mq_stop_hw_queue(hctx);
> + if (atomic_read(&sdev->device_busy) == 0 &&
> + !scsi_device_blocked(sdev))
> + blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
> + break;
> + case BLK_MQ_RQ_QUEUE_ERROR:
> + /*
> + * Make sure to release all allocated ressources when
> + * we hit an error, as we will never see this command
> + * again.
> + */
> + if (req->cmd_flags & REQ_DONTPREP)
> + scsi_mq_uninit_cmd(cmd);
> + break;
> + default:
> + break;
> + }
> + return ret;
> +}
> +
> +static int scsi_init_request(void *data, struct request *rq,
> + unsigned int hctx_idx, unsigned int request_idx,
> + unsigned int numa_node)
> +{
> + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> + cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
> + numa_node);
> + if (!cmd->sense_buffer)
> + return -ENOMEM;
> + return 0;
> +}
> +
> +static void scsi_exit_request(void *data, struct request *rq,
> + unsigned int hctx_idx, unsigned int request_idx)
> +{
> + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> + kfree(cmd->sense_buffer);
> +}
> +
> u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
> {
> struct device *host_dev;
> @@ -1710,16 +2000,10 @@ u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
> }
> EXPORT_SYMBOL(scsi_calculate_bounce_limit);
>
> -struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> - request_fn_proc *request_fn)
> +static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
> {
> - struct request_queue *q;
> struct device *dev = shost->dma_dev;
>
> - q = blk_init_queue(request_fn, NULL);
> - if (!q)
> - return NULL;
> -
> /*
> * this limit is imposed by hardware restrictions
> */
> @@ -1750,7 +2034,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> * blk_queue_update_dma_alignment() later.
> */
> blk_queue_dma_alignment(q, 0x03);
> +}
>
> +struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> + request_fn_proc *request_fn)
> +{
> + struct request_queue *q;
> +
> + q = blk_init_queue(request_fn, NULL);
> + if (!q)
> + return NULL;
> + __scsi_init_queue(shost, q);
> return q;
> }
> EXPORT_SYMBOL(__scsi_alloc_queue);
> @@ -1771,6 +2065,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
> return q;
> }
>
> +static struct blk_mq_ops scsi_mq_ops = {
> + .map_queue = blk_mq_map_queue,
> + .queue_rq = scsi_queue_rq,
> + .complete = scsi_softirq_done,
> + .timeout = scsi_times_out,
> + .init_request = scsi_init_request,
> + .exit_request = scsi_exit_request,
> +};
> +
> +struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
> +{
> + sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
> + if (IS_ERR(sdev->request_queue))
> + return NULL;
> +
> + sdev->request_queue->queuedata = sdev;
> + __scsi_init_queue(sdev->host, sdev->request_queue);
> + return sdev->request_queue;
> +}
> +
> +int scsi_mq_setup_tags(struct Scsi_Host *shost)
> +{
> + unsigned int cmd_size, sgl_size, tbl_size;
> +
> + tbl_size = shost->sg_tablesize;
> + if (tbl_size > SCSI_MAX_SG_SEGMENTS)
> + tbl_size = SCSI_MAX_SG_SEGMENTS;
> + sgl_size = tbl_size * sizeof(struct scatterlist);
> + cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
> + if (scsi_host_get_prot(shost))
> + cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
> +
> + memset(&shost->tag_set, 0, sizeof(shost->tag_set));
> + shost->tag_set.ops = &scsi_mq_ops;
> + shost->tag_set.nr_hw_queues = 1;
> + shost->tag_set.queue_depth = shost->can_queue;
> + shost->tag_set.cmd_size = cmd_size;
> + shost->tag_set.numa_node = NUMA_NO_NODE;
> + shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> + shost->tag_set.driver_data = shost;
> +
> + return blk_mq_alloc_tag_set(&shost->tag_set);
> +}
> +
> +void scsi_mq_destroy_tags(struct Scsi_Host *shost)
> +{
> + blk_mq_free_tag_set(&shost->tag_set);
> +}
> +
> /*
> * Function: scsi_block_requests()
> *
> @@ -2516,9 +2859,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
> * block layer from calling the midlayer with this device's
> * request queue.
> */
> - spin_lock_irqsave(q->queue_lock, flags);
> - blk_stop_queue(q);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> + if (q->mq_ops) {
> + blk_mq_stop_hw_queues(q);
> + } else {
> + spin_lock_irqsave(q->queue_lock, flags);
> + blk_stop_queue(q);
> + spin_unlock_irqrestore(q->queue_lock, flags);
> + }
>
> return 0;
> }
> @@ -2564,9 +2911,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
> sdev->sdev_state != SDEV_OFFLINE)
> return -EINVAL;
>
> - spin_lock_irqsave(q->queue_lock, flags);
> - blk_start_queue(q);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> + if (q->mq_ops) {
> + blk_mq_start_stopped_hw_queues(q, false);
> + } else {
> + spin_lock_irqsave(q->queue_lock, flags);
> + blk_start_queue(q);
> + spin_unlock_irqrestore(q->queue_lock, flags);
> + }
>
> return 0;
> }
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index a45d1c2..12b8e1b 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
> extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
> extern void scsi_run_host_queues(struct Scsi_Host *shost);
> extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
> +extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
> +extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
> +extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
> extern int scsi_init_queue(void);
> extern void scsi_exit_queue(void);
> struct request_queue;
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 4a6e4ba..b91cfaf 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
> */
> sdev->borken = 1;
>
> - sdev->request_queue = scsi_alloc_queue(sdev);
> + if (shost_use_blk_mq(shost))
> + sdev->request_queue = scsi_mq_alloc_queue(sdev);
> + else
> + sdev->request_queue = scsi_alloc_queue(sdev);
> if (!sdev->request_queue) {
> /* release fn is set up in scsi_sysfs_device_initialise, so
> * have to free and put manually here */
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index deef063..6c9227f 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
>
> static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
>
> +shost_rd_attr(use_blk_mq, "%d\n");
> shost_rd_attr(unique_id, "%u\n");
> shost_rd_attr(cmd_per_lun, "%hd\n");
> shost_rd_attr(can_queue, "%hd\n");
> @@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
> static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
>
> static struct attribute *scsi_sysfs_shost_attrs[] = {
> + &dev_attr_use_blk_mq.attr,
> &dev_attr_unique_id.attr,
> &dev_attr_host_busy.attr,
> &dev_attr_cmd_per_lun.attr,
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index 7f9bbda..b54511e 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -7,6 +7,7 @@
> #include <linux/workqueue.h>
> #include <linux/mutex.h>
> #include <linux/seq_file.h>
> +#include <linux/blk-mq.h>
> #include <scsi/scsi.h>
>
> struct request_queue;
> @@ -531,6 +532,9 @@ struct scsi_host_template {
> */
> unsigned int cmd_size;
> struct scsi_host_cmd_pool *cmd_pool;
> +
> + /* temporary flag to disable blk-mq I/O path */
> + bool disable_blk_mq;
> };
>
> /*
> @@ -601,7 +605,10 @@ struct Scsi_Host {
> * Area to keep a shared tag map (if needed, will be
> * NULL if not).
> */
> - struct blk_queue_tag *bqt;
> + union {
> + struct blk_queue_tag *bqt;
> + struct blk_mq_tag_set tag_set;
> + };
>
> atomic_t host_busy; /* commands actually active on low-level */
> atomic_t host_blocked;
> @@ -693,6 +700,8 @@ struct Scsi_Host {
> /* The controller does not support WRITE SAME */
> unsigned no_write_same:1;
>
> + unsigned use_blk_mq:1;
> +
> /*
> * Optional work queue to be utilized by the transport
> */
> @@ -793,6 +802,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
> shost->tmf_in_progress;
> }
>
> +extern bool scsi_use_blk_mq;
> +
> +static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
> +{
> + return shost->use_blk_mq;
> +}
> +
> extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
> extern void scsi_flush_work(struct Scsi_Host *);
>
> diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
> index 81dd12e..cdcc90b 100644
> --- a/include/scsi/scsi_tcq.h
> +++ b/include/scsi/scsi_tcq.h
> @@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
> if (!sdev->tagged_supported)
> return;
>
> - if (!blk_queue_tagged(sdev->request_queue))
> + if (!shost_use_blk_mq(sdev->host) &&
> + blk_queue_tagged(sdev->request_queue))
> blk_queue_init_tags(sdev->request_queue, depth,
> sdev->host->bqt);
>
> @@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
> **/
> static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
> {
> - if (blk_queue_tagged(sdev->request_queue))
> + if (!shost_use_blk_mq(sdev->host) &&
> + blk_queue_tagged(sdev->request_queue))
> blk_queue_free_tags(sdev->request_queue);
> scsi_adjust_queue_depth(sdev, 0, depth);
> }
> @@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
> return 0;
> }
>
> +static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
> + unsigned int hw_ctx, int tag)
> +{
> + struct request *req;
> +
> + req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
> + return req ? (struct scsi_cmnd *)req->special : NULL;
> +}
> +
> /**
> * scsi_find_tag - find a tagged command by device
> * @SDpnt: pointer to the ScSI device
> @@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
> **/
> static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
> {
> -
> struct request *req;
>
> if (tag != SCSI_NO_TAG) {
> + if (shost_use_blk_mq(sdev->host))
> + return scsi_mq_find_tag(sdev->host, 0, tag);
> +
> req = blk_queue_find_tag(sdev->request_queue, tag);
> return req ? (struct scsi_cmnd *)req->special : NULL;
> }
> @@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
> return sdev->current_cmnd;
> }
>
> +
> /**
> * scsi_init_shared_tag_map - create a shared tag map
> * @shost: the host to share the tag map among all devices
> @@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
> static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
> {
> /*
> + * We always have a shared tag map around when using blk-mq.
> + */
> + if (shost_use_blk_mq(shost))
> + return 0;
> +
> + /*
> * If the shared tag map isn't already initialized, do it now.
> * This saves callers from having to check ->bqt when setting up
> * devices on the shared host (for libata)
> @@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
> struct request *req;
>
> if (tag != SCSI_NO_TAG) {
> + if (shost_use_blk_mq(shost))
> + return scsi_mq_find_tag(shost, 0, tag);
> req = blk_map_queue_find_tag(shost->bqt, tag);
> return req ? (struct scsi_cmnd *)req->special : NULL;
> }
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 11:27:36

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case

On 06/25/2014 06:52 PM, Christoph Hellwig wrote:
> Current the midlayer fakes up a struct request for the explicit reset
> ioctls, and those don't have a tag allocated to them. The fnic driver pokes
> into midlayer structures to paper over this design issue, but that won't
> work for the blk-mq case.
>
> Either someone who can actually test the hardware will have to come up with
> a similar hack for the blk-mq case, or we'll have to bite the bullet and fix
> the way the EH ioctls work for real, but until that happens we fail these
> explicit requests here.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Cc: Hiral Patel <[email protected]>
> Cc: Suma Ramars <[email protected]>
> Cc: Brian Uchino <[email protected]>
> ---
> drivers/scsi/fnic/fnic_scsi.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
> index 3f88f56..961bdf5 100644
> --- a/drivers/scsi/fnic/fnic_scsi.c
> +++ b/drivers/scsi/fnic/fnic_scsi.c
> @@ -2224,6 +2224,22 @@ int fnic_device_reset(struct scsi_cmnd *sc)
>
> tag = sc->request->tag;
> if (unlikely(tag < 0)) {
> + /*
> + * XXX(hch): current the midlayer fakes up a struct
> + * request for the explicit reset ioctls, and those
> + * don't have a tag allocated to them. The below
> + * code pokes into midlayer structures to paper over
> + * this design issue, but that won't work for blk-mq.
> + *
> + * Either someone who can actually test the hardware
> + * will have to come up with a similar hack for the
> + * blk-mq case, or we'll have to bite the bullet and
> + * fix the way the EH ioctls work for real, but until
> + * that happens we fail these explicit requests here.
> + */
> + if (shost_use_blk_mq(sc->device->host))
> + goto fnic_device_reset_end;
> +
> tag = fnic_scsi_host_start_tag(fnic, sc);
> if (unlikely(tag == SCSI_NO_TAG))
> goto fnic_device_reset_end;
>
The correct fix will be part of my EH redesign.
Plan is to allocate a real command/request for EH, which then can be
used to send down EH TMFs and related commands.

Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-09 15:03:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up

FYI, this has been dropped from the series in favour of always memsetting
the cdb in common code. Take a look at the "RFC: clean up command setup"
series, on top of which I have rebased the scsi-mq changes.

2014-07-09 15:05:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit

On Wed, Jul 09, 2014 at 01:19:41PM +0200, Hannes Reinecke wrote:
>> host_not_ready:
>> - atomic_dec(&scsi_target(sdev)->target_busy);
>> + if (scsi_target(sdev)->can_queue > 0)
>> + atomic_dec(&scsi_target(sdev)->target_busy);
>> not_ready:
>> /*
>> * lock q, handle tag, requeue req, and decrement device_busy. We
>>
> Hmm. 'can_queue' can be changed by the LLDD. Don't we need some sort of
> synchronization here?

While a few drivers change the host can_queue value at runtime none
do for the target. While I don't think driver should even change the
host one even modification to the target one is perfectly fine as long
as no driver drops it to zero.

2014-07-09 16:39:48

by Douglas Gilbert

[permalink] [raw]
Subject: Re: scsi-mq V2

On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
>> Changes from V1:
>> - rebased on top of the core-for-3.17 branch, most notable the
>> scsi logging changes
>> - fixed handling of cmd_list to prevent crashes for some heavy
>> workloads
>> - fixed incorrect handling of !target->can_queue
>> - avoid scheduling a workqueue on I/O completions when no queues
>> are congested
>>
>> In addition to the patches in this thread there also is a git available at:
>>
>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>
>
> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> from June 29th. Robert Elliot found a problem with not fully zeroed
> out UNMAP CDBs, which is fixed by the saner discard handling in that
> series.
>
> There is a new patch to factor the code from the above series for
> blk-mq use, which I've attached below. Besides that the only changes
> are minor merge fixups in the main blk-mq usage patch.

Be warned: both Rob Elliott and I can easily break
the scsi-mq.3 branch. It seems as though a regression
has slipped in. I notice that Christoph has added a
new branch called "scsi-mq.3-no-rebase".

For those interested, watch this space.

Doug Gilbert

2014-07-09 16:50:02

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 08/14] scsi: convert device_busy to atomic_t

On Wed, 2014-06-25 at 18:51 +0200, Christoph Hellwig wrote:
> Avoid taking the queue_lock to check the per-device queue limit. Instead
> we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Unlike the host and target busy counters this doesn't allow us to avoid the
> queue_lock in the request_fn due to the way the interface works, but it'll
> allow us to prepare for using the blk-mq code, which doesn't use the
> queue_lock at all, and it at least avoids a queue_lock rountrip in
> scsi_device_unbusy, which is still important given how busy the queue_lock
> is.

Most of these patches look fine to me, but this one worries me largely
because of the expense of atomics.

As far as I can tell from the block MQ, we get one CPU thread per LUN.
Doesn't this mean that we only need true atomics for variables that
cross threads? That does mean target and host, but shouldn't mean
device, since device == LUN. As long as we protect from local
interrupts, we should be able to exclusively update all LUN local
variables without having to change them to being atomic.

This view depends on correct CPU steering of returning interrupts, since
the LUN thread model only works if the same CPU handles issue and
completion, but it looks like that works in MQ, even if it doesn't work
in vanilla.

James

2014-07-09 19:38:18

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-09 18:39, Douglas Gilbert wrote:
> On 14-07-08 10:48 AM, Christoph Hellwig wrote:
>> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
>>> Changes from V1:
>>> - rebased on top of the core-for-3.17 branch, most notable the
>>> scsi logging changes
>>> - fixed handling of cmd_list to prevent crashes for some heavy
>>> workloads
>>> - fixed incorrect handling of !target->can_queue
>>> - avoid scheduling a workqueue on I/O completions when no queues
>>> are congested
>>>
>>> In addition to the patches in this thread there also is a git
>>> available at:
>>>
>>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>>
>>
>> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
>> latest core-for-3.17 tree + the "RFC: clean up command setup" series
>> from June 29th. Robert Elliot found a problem with not fully zeroed
>> out UNMAP CDBs, which is fixed by the saner discard handling in that
>> series.
>>
>> There is a new patch to factor the code from the above series for
>> blk-mq use, which I've attached below. Besides that the only changes
>> are minor merge fixups in the main blk-mq usage patch.
>
> Be warned: both Rob Elliott and I can easily break
> the scsi-mq.3 branch. It seems as though a regression
> has slipped in. I notice that Christoph has added a
> new branch called "scsi-mq.3-no-rebase".

Rob/Doug, those issues look very much like problems in the aio code. Can
either/both of you try with:

f8567a3845ac05bb28f3c1b478ef752762bd39ef
edfbbf388f293d70bf4b7c0bc38774d05e6f711a

reverted (in that order) and see if that changes anything.


--
Jens Axboe

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Wednesday, 09 July, 2014 2:38 PM
> To: [email protected]; Christoph Hellwig; James Bottomley; Bart Van
> Assche; Elliott, Robert (Server Storage); [email protected]; linux-
> [email protected]
> Subject: Re: scsi-mq V2
>
> On 2014-07-09 18:39, Douglas Gilbert wrote:
> > On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> >> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> >>> Changes from V1:
> >>> - rebased on top of the core-for-3.17 branch, most notable the
> >>> scsi logging changes
> >>> - fixed handling of cmd_list to prevent crashes for some heavy
> >>> workloads
> >>> - fixed incorrect handling of !target->can_queue
> >>> - avoid scheduling a workqueue on I/O completions when no queues
> >>> are congested
> >>>
> >>> In addition to the patches in this thread there also is a git
> >>> available at:
> >>>
> >>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> >>
> >>
> >> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> >> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> >> from June 29th. Robert Elliot found a problem with not fully zeroed
> >> out UNMAP CDBs, which is fixed by the saner discard handling in that
> >> series.
> >>
> >> There is a new patch to factor the code from the above series for
> >> blk-mq use, which I've attached below. Besides that the only changes
> >> are minor merge fixups in the main blk-mq usage patch.
> >
> > Be warned: both Rob Elliott and I can easily break
> > the scsi-mq.3 branch. It seems as though a regression
> > has slipped in. I notice that Christoph has added a
> > new branch called "scsi-mq.3-no-rebase".
>
> Rob/Doug, those issues look very much like problems in the aio code. Can
> either/both of you try with:
>
> f8567a3845ac05bb28f3c1b478ef752762bd39ef
> edfbbf388f293d70bf4b7c0bc38774d05e6f711a
>
> reverted (in that order) and see if that changes anything.
>
>
> --
> Jens Axboe

scsi-mq.3-no-rebase, which has all the scsi updates from scsi-mq.3
but is based on 3.16.0-rc2 rather than 3.16.0-rc4, works fine:
* ^C exits fio cleanly with scsi_debug devices
* ^C exits fio cleanly with mpt3sas devices
* fio hits 1M IOPS with 16 hpsa devices
* fio hits 700K IOPS with 6 mpt3sas devices
* 38 device test to mpt3sas, hpsa, and scsi_debug devices runs OK


With:
* scsi-mq-3, which is based on 3.16.0-rc4
* [PATCH] x86-64: fix vDSO build from https://lkml.org/lkml/2014/7/3/738
* those two aio patches reverted

the problem still occurs - fio results in low or 0 IOPS, with perf top
reporting unusual amounts of time spent in do_io_submit and io_submit.

perf top:
14.38% [kernel] [k] do_io_submit
13.71% libaio.so.1.0.1 [.] io_submit
13.32% [kernel] [k] system_call
11.60% [kernel] [k] system_call_after_swapgs
8.88% [kernel] [k] lookup_ioctx
8.78% [kernel] [k] copy_user_generic_string
7.78% [kernel] [k] io_submit_one
5.97% [kernel] [k] blk_flush_plug_list
2.73% fio [.] fio_libaio_commit
2.70% [kernel] [k] sysret_check
2.68% [kernel] [k] blk_finish_plug
1.98% [kernel] [k] blk_start_plug
1.17% [kernel] [k] SyS_io_submit
1.17% [kernel] [k] __get_user_4
0.99% fio [.] io_submit@plt
0.85% [kernel] [k] _copy_from_user
0.79% [kernel] [k] system_call_fastpath

Repeating some of last night's investigation details for the lists:

ftrace of one of the CPUs for all functions shows these
are repeatedly being called:

<...>-34508 [004] .... 6360.790714: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790714: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790714: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790714: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790715: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790715: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790715: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790715: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790715: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790715: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790716: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790716: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790716: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790716: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790717: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790717: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790717: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790717: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790717: do_io_submit <-SyS_io_submit

fs/aio.c do_io_submit is apparently completing (many times) - it's not
stuck in the for loop:
blk_start_plug(&plug);

/*
* AKPM: should this return a partial result if some of the IOs were
* successfully submitted?
*/
for (i=0; i<nr; i++) {
struct iocb __user *user_iocb;
struct iocb tmp;

if (unlikely(__get_user(user_iocb, iocbpp + i))) {
ret = -EFAULT;
break;
}

if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
ret = -EFAULT;
break;
}

ret = io_submit_one(ctx, user_iocb, &tmp, compat);
if (ret)
break;
}
blk_finish_plug(&plug);


fs/aio.c io_submit_one is not getting to fget, which is traceable:
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}

/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
(iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
((ssize_t)iocb->aio_nbytes < 0)
)) {
pr_debug("EINVAL: io_submit: overflow check\n");
return -EINVAL;
}

req = aio_get_req(ctx);
if (unlikely(!req))
return -EAGAIN;

req->ki_filp = fget(iocb->aio_fildes);

I don't have that file compiled with -DDEBUG so the pr_debug
prints are unavailable. The -EAGAIN seems most likely to lead
to a hang like this.

aio_get_req is not getting to kmem_cache_alloc, which is
traceable:
if (!get_reqs_available(ctx))
return NULL;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);

get_reqs_available is probably returning false because not
enough reqs are available compared to req_batch:

struct kioctx_cpu *kcpu;
bool ret = false;

preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);

if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);

do {
if (avail < ctx->req_batch)
goto out;

old = avail;
avail = atomic_cmpxchg(&ctx->reqs_available,
avail, avail - ctx->req_batch);
} while (avail != old);

kcpu->reqs_available += ctx->req_batch;
}

ret = true;
kcpu->reqs_available--;
out:
preempt_enable();
return ret;


---
Rob Elliott HP Server Storage


2014-07-10 06:02:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/14] scsi: convert device_busy to atomic_t

On Wed, Jul 09, 2014 at 09:49:56AM -0700, James Bottomley wrote:
> As far as I can tell from the block MQ, we get one CPU thread per LUN.

No, that's entirely incorrect. IFF a device supports multiple hardware
queues we only submit I/O from CPUs (there might be more than one) this
queue is bound to. With the single hardware queue supported by most
hardware submissions can and will happen from any CPU. Note that
this patchset doesn't even support multiple hardware queues yet, although
it should be fairly simple to add once the low level driver support is
ready.

2014-07-10 06:06:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess

On Wed, Jul 09, 2014 at 01:12:17PM +0200, Hannes Reinecke wrote:
> Hmm. I guess there is a race window between
> atomic_read() and atomic_set().
> Doesn't this cause issues when someone calls atomic_set() just before the
> call to atomic_read?

There is a race window just _after_ the atomic_read, but it's harmless.
The whole _blocked scheme is a backoff to avoid resubmitting I/O all
the time when the HBA or target returned a busy status. If we race
an incorrectly reset it we will submit I/O and just get a busy indicator
again.

On the other hand doing the atomic_set all the time introduces three atomic
in the I/O completion part that are entirely pointless most of the time.

I guess I should add something like this as a comment to the code..

Note that the old code didn't use any sort of synchronization either.

2014-07-10 06:20:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
> the problem still occurs - fio results in low or 0 IOPS, with perf top
> reporting unusual amounts of time spent in do_io_submit and io_submit.

The diff between the two version doesn't show too much other possible
interesting commits, the most interesting being some minor block
updates.

I guess we'll have to a manual bisect, I've pushed out a
scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
the block tree, and a scsi-mq.3-bisect-2 branch that is just after
the merge of the block tree to get started.

2014-07-10 13:36:25

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:
> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
> > the problem still occurs - fio results in low or 0 IOPS, with perf top
> > reporting unusual amounts of time spent in do_io_submit and io_submit.
>
> The diff between the two version doesn't show too much other possible
> interesting commits, the most interesting being some minor block
> updates.
>
> I guess we'll have to a manual bisect, I've pushed out a
> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> the block tree, and a scsi-mq.3-bisect-2 branch that is just after
> the merge of the block tree to get started.

There is one possible concern that could be exacerbated by other changes in
the system: if the application is running close to the bare minimum number
of requests allocated in io_setup(), the per cpu reference counters will
have a hard time batching things. It might be worth testing with an
increased number of requests being allocated if this is the case.

-ben
--
"Thought is the essence of where you are now."

2014-07-10 13:40:05

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 15:36, Benjamin LaHaise wrote:
> On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:
>> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
>>> the problem still occurs - fio results in low or 0 IOPS, with perf top
>>> reporting unusual amounts of time spent in do_io_submit and io_submit.
>>
>> The diff between the two version doesn't show too much other possible
>> interesting commits, the most interesting being some minor block
>> updates.
>>
>> I guess we'll have to a manual bisect, I've pushed out a
>> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
>> the block tree, and a scsi-mq.3-bisect-2 branch that is just after
>> the merge of the block tree to get started.
>
> There is one possible concern that could be exacerbated by other changes in
> the system: if the application is running close to the bare minimum number
> of requests allocated in io_setup(), the per cpu reference counters will
> have a hard time batching things. It might be worth testing with an
> increased number of requests being allocated if this is the case.

That's how fio always runs, it sets up the context with the exact queue
depth that it needs. Do we have a good enough understanding of other aio
use cases to say that this isn't the norm? I would expect it to be, it's
the way that the API would most obviously be used.

--
Jens Axboe

2014-07-10 13:44:41

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
> That's how fio always runs, it sets up the context with the exact queue
> depth that it needs. Do we have a good enough understanding of other aio
> use cases to say that this isn't the norm? I would expect it to be, it's
> the way that the API would most obviously be used.

The problem with this approach is that it works very poorly with per cpu
reference counting's batching of references, which is pretty much a
requirement now that many core systems are the norm. Allocating the bare
minimum is not the right thing to do today. That said, the default limits
on the number of requests probably needs to be raised.

-ben
--
"Thought is the essence of where you are now."

2014-07-10 13:48:16

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 15:44, Benjamin LaHaise wrote:
> On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
>> That's how fio always runs, it sets up the context with the exact queue
>> depth that it needs. Do we have a good enough understanding of other aio
>> use cases to say that this isn't the norm? I would expect it to be, it's
>> the way that the API would most obviously be used.
>
> The problem with this approach is that it works very poorly with per cpu
> reference counting's batching of references, which is pretty much a
> requirement now that many core systems are the norm. Allocating the bare
> minimum is not the right thing to do today. That said, the default limits
> on the number of requests probably needs to be raised.

Sorry, that's a complete cop-out. Then you handle this internally,
allocate a bigger pool and cap the limit if you need to. Look at the
API. You pass in the number of requests you will use. Do you expect
anyone to double up, just in case? Will never happen.

But all of this is side stepping the point that there's a real bug
reported here. The above could potentially explain the "it's using X
more CPU, or it's Y slower". The above is a softlock, it never completes.

--
Jens Axboe

2014-07-10 13:50:37

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2014-07-10 15:44, Benjamin LaHaise wrote:
> >On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
> >>That's how fio always runs, it sets up the context with the exact queue
> >>depth that it needs. Do we have a good enough understanding of other aio
> >>use cases to say that this isn't the norm? I would expect it to be, it's
> >>the way that the API would most obviously be used.
> >
> >The problem with this approach is that it works very poorly with per cpu
> >reference counting's batching of references, which is pretty much a
> >requirement now that many core systems are the norm. Allocating the bare
> >minimum is not the right thing to do today. That said, the default limits
> >on the number of requests probably needs to be raised.
>
> Sorry, that's a complete cop-out. Then you handle this internally,
> allocate a bigger pool and cap the limit if you need to. Look at the
> API. You pass in the number of requests you will use. Do you expect
> anyone to double up, just in case? Will never happen.
>
> But all of this is side stepping the point that there's a real bug
> reported here. The above could potentially explain the "it's using X
> more CPU, or it's Y slower". The above is a softlock, it never completes.

I'm not trying to cop out on this -- I'm asking for a data point to see
if changing the request limits has any effect.

-ben

> --
> Jens Axboe

--
"Thought is the essence of where you are now."

2014-07-10 13:50:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> There is one possible concern that could be exacerbated by other changes in
> the system: if the application is running close to the bare minimum number
> of requests allocated in io_setup(), the per cpu reference counters will
> have a hard time batching things. It might be worth testing with an
> increased number of requests being allocated if this is the case.

Well, Robert said reverting the two aio commits didn't help. Either he
didn't manage to boot into the right kernel, or we need to look
elsewhere for the culprit.

2014-07-10 13:52:23

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 15:50, Benjamin LaHaise wrote:
> On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
>> On 2014-07-10 15:44, Benjamin LaHaise wrote:
>>> On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
>>>> That's how fio always runs, it sets up the context with the exact queue
>>>> depth that it needs. Do we have a good enough understanding of other aio
>>>> use cases to say that this isn't the norm? I would expect it to be, it's
>>>> the way that the API would most obviously be used.
>>>
>>> The problem with this approach is that it works very poorly with per cpu
>>> reference counting's batching of references, which is pretty much a
>>> requirement now that many core systems are the norm. Allocating the bare
>>> minimum is not the right thing to do today. That said, the default limits
>>> on the number of requests probably needs to be raised.
>>
>> Sorry, that's a complete cop-out. Then you handle this internally,
>> allocate a bigger pool and cap the limit if you need to. Look at the
>> API. You pass in the number of requests you will use. Do you expect
>> anyone to double up, just in case? Will never happen.
>>
>> But all of this is side stepping the point that there's a real bug
>> reported here. The above could potentially explain the "it's using X
>> more CPU, or it's Y slower". The above is a softlock, it never completes.
>
> I'm not trying to cop out on this -- I'm asking for a data point to see
> if changing the request limits has any effect.

Fair enough, if the question is "does it solve the regression", then
it's a valid data point. Rob/Doug, for fio, you can just double the
iodepth passed in in engines/libaio:fio_libaio_init() and test with that
and see if it makes a difference.

--
Jens Axboe

2014-07-10 13:52:54

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 15:50, Christoph Hellwig wrote:
> On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
>> There is one possible concern that could be exacerbated by other changes in
>> the system: if the application is running close to the bare minimum number
>> of requests allocated in io_setup(), the per cpu reference counters will
>> have a hard time batching things. It might be worth testing with an
>> increased number of requests being allocated if this is the case.
>
> Well, Robert said reverting the two aio commits didn't help. Either he
> didn't manage to boot into the right kernel, or we need to look
> elsewhere for the culprit.

Rob, let me know what scsi_debug setup you use, and I can try and
reproduce it here as well.

--
Jens Axboe

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Thursday, 10 July, 2014 8:53 AM
> To: Christoph Hellwig; Benjamin LaHaise
> Cc: Elliott, Robert (Server Storage); [email protected]; James Bottomley;
> Bart Van Assche; [email protected]; [email protected]
> Subject: Re: scsi-mq V2
>
> On 2014-07-10 15:50, Christoph Hellwig wrote:
> > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> >> There is one possible concern that could be exacerbated by other changes
> in
> >> the system: if the application is running close to the bare minimum number
> >> of requests allocated in io_setup(), the per cpu reference counters will
> >> have a hard time batching things. It might be worth testing with an
> >> increased number of requests being allocated if this is the case.
> >
> > Well, Robert said reverting the two aio commits didn't help. Either he
> > didn't manage to boot into the right kernel, or we need to look
> > elsewhere for the culprit.
>
> Rob, let me know what scsi_debug setup you use, and I can try and
> reproduce it here as well.
>
> --
> Jens Axboe

This system has 6 online CPUs and 64 possible CPUs.

Printing avail and req_batch in that loop results in many of these:
** 3813 printk messages dropped ** [10643.503772] ctx ffff88042d8d4cc0 avail=0 req_batch=2

Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
those pr_debug prints results in nothing extra printing,
so it's not hitting an error.

Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
these as fio starts:

[ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.435370] hrtimer: interrupt took 6868 ns
[ 186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.449933] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.461147] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.461176] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.461177] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.461181] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.461181] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.461184] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.461185] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.461185] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.461191] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.461192] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.474426] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.481353] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.483706] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.483707] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.483709] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.483710] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.483712] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.483717] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.495441] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495444] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495445] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.490] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495451] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.495457] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495457] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.495460] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.495463] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.495464] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.499429] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.499437] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.619785] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.627371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.627374] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.627383] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.627384] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.627385] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.628371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.628372] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.630361] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.665329] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.666360] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.666361] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.666366] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.666367] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.666367] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.666369] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 186.666370] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 186.670369] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.670372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.670373] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 186.767323] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 187.211053] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 187.213053] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]

Subsequent added prints showed nr_events coming in with
an initial value of 0x7FFFFFFF in those cases where it showed
as -2 above. Since the last call had a reasonable value
of 512, it doesn't seem like a problem.


script to create the scsi_debug devices:
#!/bin/bash
devices=6
delay=0 # -2 = jiffy, -1 = hi jiffy, 0 = none, 1..n = longer
ndelay=20000 # 20 us
opts=0x4801 #
every_nth=3
capacity_mib=0
capacity_gib=$((2 * 1024))
lbpu=1
lbpws=1

modprobe -r scsi_debug
modprobe -v scsi_debug fake_rw=1 delay=$delay ndelay=$ndelay num_tgts=$devices opts=$opts every_nth=$every_nth physblk_exp=3 lbpu=$lbpu lbpws=$lbpws dev_size_mb=$capacity_mib virtual_gb=$capacity_gib
lsscsi -s
lsblk
# the assigned /dev names will vary...
for device in /sys/block/sda[h-m]
do
echo none > $device/device/queue_type
done

fio script:
[global]
direct=1
invalidate=1
ioengine=libaio
norandommap
randrepeat=0
bs=4096
iodepth=96
numjobs=6
runtime=216000
time_based=1
group_reporting
thread
gtod_reduce=1
iodepth_batch=16
iodepth_batch_complete=16
cpus_allowed=0-5
cpus_allowed_policy=split
rw=randread

[4_KiB_RR_drive_ah]
filename=/dev/sdah

[4_KiB_RR_drive_ai]
filename=/dev/sdai

[4_KiB_RR_drive_aj]
filename=/dev/sdaj

[4_KiB_RR_drive_ak]
filename=/dev/sdak

[4_KiB_RR_drive_al]
filename=/dev/sdal

[4_KiB_RR_drive_am]
filename=/dev/sdam

kernel log with some prints in ioctx_alloc:
(2147483647 is 0x7FFFFFFF)

[ 94.050877] ioctx_alloc: initial nr_events=2147483647
[ 94.053610] ioctx_alloc: num_possible_cpus=64
[ 94.055235] ioctx_alloc: after max nr_events=2147483647
[ 94.057110] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.059189] ioctx_alloc: initial nr_events=96
[ 94.059294] ioctx_alloc: initial nr_events=2147483647
[ 94.059295] ioctx_alloc: num_possible_cpus=64
[ 94.059295] ioctx_alloc: after max nr_events=2147483647
[ 94.059296] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.059296] ioctx_alloc: initial nr_events=96
[ 94.059297] ioctx_alloc: num_possible_cpus=64
[ 94.059297] ioctx_alloc: after max nr_events=256
[ 94.059298] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.075891] ioctx_alloc: num_possible_cpus=64
[ 94.077529] ioctx_alloc: after max nr_events=256
[ 94.079064] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.087777] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.087810] ioctx_alloc: initial nr_events=2147483647
[ 94.087810] ioctx_alloc: num_possible_cpus=64
[ 94.087811] ioctx_alloc: after max nr_events=2147483647
[ 94.087811] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.087812] ioctx_alloc: initial nr_events=96
[ 94.087812] ioctx_alloc: num_possible_cpus=64
[ 94.087813] ioctx_alloc: after max nr_events=256
[ 94.087813] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.087815] ioctx_alloc: initial nr_events=2147483647
[ 94.087816] ioctx_alloc: initial nr_events=2147483647
[ 94.087816] ioctx_alloc: num_possible_cpus=64
[ 94.087817] ioctx_alloc: initial nr_events=2147483647
[ 94.087818] ioctx_alloc: num_possible_cpus=64
[ 94.087819] ioctx_alloc: after max nr_events=2147483647
[ 94.087819] ioctx_alloc: num_possible_cpus=64
[ 94.087820] ioctx_alloc: after max nr_events=2147483647
[ 94.087820] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.087821] ioctx_alloc: after max nr_events=2147483647
[ 94.087822] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.087822] ioctx_alloc: initial nr_events=96
[ 94.087823] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.087824] ioctx_alloc: initial nr_events=96
[ 94.087825] ioctx_alloc: initial nr_events=2147483647
[ 94.087825] ioctx_alloc: num_possible_cpus=64
[ 94.087826] ioctx_alloc: initial nr_events=96
[ 94.087826] ioctx_alloc: num_possible_cpus=64
[ 94.087827] ioctx_alloc: num_possible_cpus=64
[ 94.087827] ioctx_alloc: after max nr_events=256
[ 94.087828] ioctx_alloc: num_possible_cpus=64
[ 94.087828] ioctx_alloc: after max nr_events=256
[ 94.087829] ioctx_alloc: after max nr_events=2147483647
[ 94.087829] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.087830] ioctx_alloc: after max nr_events=256
[ 94.087831] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.087831] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.087832] ioctx_alloc: nr_events=512 aio_max_nr=65
[ 94.087833] ioctx_alloc: initial nr_events=96
[ 94.087833] ioctx_alloc: num_possible_cpus=64
[ 94.087833] ioctx_alloc: after max nr_events=256
[ 94.087834] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.090668] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.259433] ioctx_alloc: initial nr_events=2147483647
[ 94.259435] ioctx_alloc: initial nr_events=2147483647
[ 94.259436] ioctx_alloc: num_possible_cpus=64
[ 94.259437] ioctx_alloc: after max nr_events=2147483647
[ 94.259437] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.259438] ioctx_alloc: initial nr_events=96
[ 94.259438] ioctx_alloc: num_possible_cpus=64
[ 94.259438] ioctx_alloc: after max nr_events=256
[ 94.259439] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.259446] ioctx_alloc: initial nr_events=2147483647
[ 94.259448] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.259449] ioctx_alloc: initial nr_events=2147483647
[ 94.259450] ioctx_alloc: initial nr_events=2147483647
[ 94.259450] ioctx_alloc: num_possible_cpus=64
[ 94.259451] ioctx_alloc: num_possible_cpus=64
[ 94.259452] ioctx_alloc: num_possible_cpus=64
[ 94.259452] ioctx_alloc: after max nr_events=2147483647
[ 94.259453] ioctx_alloc: after max nr_events=2147483647
[ 94.259453] ioctx_alloc: after max nr_events=2147483647
[ 94.259454] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.259456] ioctx_alloc: initial nr_events=96
[ 94.259456] ioctx_alloc: initial nr_events=96
[ 94.259457] ioctx_alloc: initial nr_events=96
[ 94.259457] ioctx_alloc: num_possible_cpus=64
[ 94.259458] ioctx_alloc: num_possible_cpus=64
[ 94.259458] ioctx_alloc: num_possible_cpus=64
[ 94.259459] ioctx_alloc: after max nr_events=256
[ 94.259459] ioctx_alloc: after max nr_events=256
[ 94.259460] ioctx_alloc: after max nr_events=256
[ 94.259460] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 259461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.259462] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.260539] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.260544] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.262535] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.262550] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.423889] ioctx_alloc: num_possible_cpus=64
[ 94.425386] ioctx_alloc: after max nr_events=2147483647
[ 94.427327] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.429359] ioctx_alloc: initial nr_events=96
[ 94.429448] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.429451] ioctx_alloc: initial nr_events=2147483647
[ 94.429452] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.429453] ioctx_alloc: num_possible_cpus=64
[ 94.429454] ioctx_alloc: initial nr_events=2147483647
[ 94.429454] ioctx_alloc: after max nr_events=2147483647
[ 94.429455] ioctx_alloc: num_possible_cpus=64
[ 94.429456] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.429456] ioctx_alloc: after max nr_events=2147483647
[ 94.429457] ioctx_alloc: initial nr_events=96
[ 94.429458] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.429458] ioctx_alloc: num_possible_cpus=64
[ 94.429459] ioctx_alloc: initial nr_events=96
[ 94.429459] ioctx_alloc: after max nr_events=256
[ 94.429460] ioctx_alloc: num_possible_cpus=64
[ 94.429461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.429461] ioctx_alloc: after max nr_events=256
[ 94.429462] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.429463] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.430422] hrtimer: interrupt took 6115 ns
[ 94.431463] ioctx_alloc: initial nr_events=2147483647
[ 94.431464] ioctx_alloc: num_possible_cpus=64
[ 94.431464] ioctx_alloc: after max nr_events=2147483647
[ 94.431465] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.431465] ioctx_alloc: initial nr_events=96
[ 94.431466] ioctx_alloc: num_possible_cpus=64
[ 931466] ioctx_alloc: after max nr_events=256
[ 94.431466] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.432641] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.580307] ioctx_alloc: num_possible_cpus=64
[ 94.581844] ioctx_alloc: after max nr_events=256
[ 94.583405] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.585313] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.585319] ioctx_alloc: initial nr_events=2147483647
[ 94.585320] ioctx_alloc: num_possible_cpus=64
[ 94.585320] ioctx_alloc: after max nr_events=2147483647
[ 94.585321] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.585322] ioctx_alloc: initial nr_events=2147483647
[ 94.585322] ioctx_alloc: initial nr_events=96
[ 94.585323] ioctx_alloc: num_possible_cpus=64
[ 94.585324] ioctx_alloc: num_possible_cpus=64
[ 94.585324] ioctx_alloc: after max nr_events=2147483647
[ 94.585325] ioctx_alloc: after max nr_events=256
[ 94.585325] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.585326] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.585327] ioctx_alloc: initial nr_events=2147483647
[ 94.585328] ioctx_alloc: initial nr_events=96
[ 94.585328] ioctx_alloc: num_possible_cpus=64
[ 94.585329] ioctx_alloc: num_possible_cpus=64
[ 94.585329] ioctx_alloc: after max nr_events=2147483647
[ 94.585330] ioctx_alloc: after max nr_events=256
[ 94.585331] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.585331] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.585332] ioctx_alloc: initial nr_events=96
[ 94.585332] ioctx_alloc: num_possible_cpus=64
[ 94.585333] ioctx_alloc: after max nr_events=256
[ 94.585333] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.585372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.585402] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.588377] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.632221] ioctx_alloc: initial nr_events=2147483647
[ 94.632228] ioctx_alloc: initial nr_events=2147483647
[ 94.632229] iocalloc: num_possible_cpus=64
[ 94.632229] ioctx_alloc: after max nr_events=2147483647
[ 94.632230] ioctx_alloc: initial nr_events=2147483647
[ 94.632231] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.632232] ioctx_alloc: num_possible_cpus=64
[ 94.632232] ioctx_alloc: initial nr_events=96
[ 94.632233] ioctx_alloc: after max nr_events=2147483647
[ 94.632233] ioctx_alloc: num_possible_cpus=64
[ 94.632234] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.632234] ioctx_alloc: after max nr_events=256
[ 94.632235] ioctx_alloc: initial nr_events=96
[ 94.632236] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.632236] ioctx_alloc: num_possible_cpus=64
[ 94.632237] ioctx_alloc: after max nr_events=256
[ 94.632237] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.632241] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.633350] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.764384] ioctx_alloc: num_possible_cpus=64
[ 94.766038] ioctx_alloc: after max nr_events=2147483647
[ 94.767807] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.769568] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.770328] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.773546] ioctx_alloc: initial nr_events=2147483647
[ 94.773550] ioctx_alloc: initial nr_events=2147483647
[ 94.773551] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.773552] ioctx_alloc: num_possible_cpus=64
[ 94.773552] ioctx_alloc: after max nr_events=2147483647
[ 94.773553] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.773553] ioctx_alloc: initial nr_events=96
[ 94.773554] ioctx_alloc: num_possible_cpus=64
[ 94.773554] ioctx_alloc: after max nr_events=256
[ 94.773555] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.773569] ioctx_alloc: initial nr_events=2147483647
[ 94.773569] ioctx_alloc: num_possible_cpus=64
[ 94.773570] ioctx_alloc: after max nr_events=2147483647
[ 94.773570] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.773571] ioctx_alloc:itial nr_events=96
[ 94.773571] ioctx_alloc: num_possible_cpus=64
[ 94.773572] ioctx_alloc: after max nr_events=256
[ 94.773572] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.903978] ioctx_alloc: num_possible_cpus=64
[ 94.905427] ioctx_alloc: after max nr_events=2147483647
[ 94.907320] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.909300] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.909305] ioctx_alloc: initial nr_events=2147483647
[ 94.909306] ioctx_alloc: num_possible_cpus=64
[ 94.909306] ioctx_alloc: after max nr_events=2147483647
[ 94.909307] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.909307] ioctx_alloc: initial nr_events=96
[ 94.909308] ioctx_alloc: num_possible_cpus=64
[ 94.909308] ioctx_alloc: after max nr_events=256
[ 94.909309] ioctx_alloc: initial nr_events=2147483647
[ 94.909310] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.909310] ioctx_alloc: num_possible_cpus=64
[ 94.909311] ioctx_alloc: after max nr_events=2147483647
[ 94.909311] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.909312] ioctx_alloc: initial nr_events=96
[ 94.909312] ioctx_alloc: num_possible_cpus=64
[ 94.909313] ioctx_alloc: after max nr_events=256
[ 94.909313] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.912223] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.940281] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 94.940283] ioctx_alloc: initial nr_events=2147483647
[ 94.940284] ioctx_alloc: num_possible_cpus=64
[ 94.940285] ioctx_alloc: after max nr_events=2147483647
[ 94.940286] ioctx_alloc: initial nr_events=2147483647
[ 94.940286] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.940287] ioctx_alloc: num_possible_cpus=64
[ 94.940288] ioctx_alloc: initial nr_events=96
[ 94.940288] ioctx_alloc: after max nr_events=2147483647
[ 94.940289] ioctx_alloc: num_possible_cpus=64
[ 94.940290] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 94.940290] ioctx_alloc: after max nr_events=256
[ 94.940291] ioctx_alloc: initial nr_events=96
[ 94.940291] ioctx_alloc: nr_events=512 amax_nr=65536
[ 94.940292] ioctx_alloc: num_possible_cpus=64
[ 94.940292] ioctx_alloc: after max nr_events=256
[ 94.940293] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 94.942198] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.069096] ioctx_alloc: initial nr_events=96
[ 95.069097] ioctx_alloc: num_possible_cpus=64
[ 95.069097] ioctx_alloc: after max nr_events=256
[ 95.069098] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.087101] ioctx_alloc: initial nr_events=2147483647
[ 95.087108] ioctx_alloc: initial nr_events=2147483647
[ 95.087108] ioctx_alloc: num_possible_cpus=64
[ 95.087109] ioctx_alloc: after max nr_events=2147483647
[ 95.087109] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.087110] ioctx_alloc: initial nr_events=96
[ 95.087110] ioctx_alloc: num_possible_cpus=64
[ 95.087111] ioctx_alloc: after max nr_events=256
[ 95.087112] ioctx_alloc: initial nr_events=2147483647
[ 95.087113] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.087113] ioctx_alloc: num_possible_cpus=64
[ 95.087114] ioctx_alloc: after max nr_events=2147483647
[ 95.087114] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.087115] ioctx_alloc: initial nr_events=96
[ 95.087115] ioctx_alloc: num_possible_cpus=64
[ 95.087116] ioctx_alloc: after max nr_events=256
[ 95.087117] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.087117] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.087120] ioctx_alloc: initial nr_events=2147483647
[ 95.087120] ioctx_alloc: num_possible_cpus=64
[ 95.087121] ioctx_alloc: after max nr_events=2147483647
[ 95.087121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.087122] ioctx_alloc: initial nr_events=96
[ 95.087122] ioctx_alloc: num_possible_cpus=64
[ 95.087123] ioctx_alloc: after max nr_events=256
[ 95.087123] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.087126] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.091100] ioctx_alloc: initial nr_events=2147483647
[ 95.091100] ioctx_alloc: num_possible_cpus=64
[ 95.091100] ioctx_alloc: after max nr_events=2147483647
[ 95.091101] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.091101] ioctx_alloc: initial nr_events=96
[ 95.091102] ioctx_alloc: num_possible_cpus=64
[ 95.091102] ioctx_alloc: after max nr_events=256
[ 95.091103] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.145236] ioctx_alloc: num_possible_cpus=64
[ 95.146754] ioctx_alloc: after max nr_events=2483647
[ 95.248567] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.250432] ioctx_alloc: initial nr_events=2147483647
[ 95.250438] ioctx_alloc: initial nr_events=2147483647
[ 95.250439] ioctx_alloc: num_possible_cpus=64
[ 95.250439] ioctx_alloc: after max nr_events=2147483647
[ 95.250440] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.250440] ioctx_alloc: initial nr_events=96
[ 95.250441] ioctx_alloc: num_possible_cpus=64
[ 95.250441] ioctx_alloc: after max nr_events=256
[ 95.250442] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.250450] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.250457] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.251027] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.251038] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.252029] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.275430] ioctx_alloc: num_possible_cpus=64
[ 95.277000] ioctx_alloc: after max nr_events=2147483647
[ 95.278747] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.280540] ioctx_alloc: initial nr_events=2147483647
[ 95.280554] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.284457] ioctx_alloc: num_possible_cpus=64
[ 95.285998] ioctx_alloc: after max nr_events=2147483647
[ 95.287764] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[ 95.289455] ioctx_alloc: initial nr_events=96
[ 95.290901] ioctx_alloc: num_possible_cpus=64
[ 95.292450] ioctx_alloc: after max nr_events=256
[ 95.294013] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.295873] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.381941] ioctx_alloc: initial nr_events=96
[ 95.383764] ioctx_alloc: num_possible_cpus=64
[ 95.385303] ioctx_alloc: after max nr_events=256
[ 95.386959] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.391935] ioctx_alloc: initial nr_events=96
[ 95.393493] ioctx_alloc: num_possible_cpus=64
[ 95.394994] ioctx_alloc: after max nr_events=256
[ 95.396751] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.421964] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.425953] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[ 95.611825] ioctx_alloc: initial nr_events=96
[ 95.613398] ioctx_alloc: num_possible_cpus=64
[ 95.614893] ioctx_alloc: after max nr_events=256
[ 95.616615] ioctx_alloc: nr_events=512 aio_max_nr=65536
[ 95.645844] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]

2014-07-10 14:45:43

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 02:36:40PM +0000, Elliott, Robert (Server Storage) wrote:
>
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:[email protected]]
> > Sent: Thursday, 10 July, 2014 8:53 AM
> > To: Christoph Hellwig; Benjamin LaHaise
> > Cc: Elliott, Robert (Server Storage); [email protected]; James Bottomley;
> > Bart Van Assche; [email protected]; [email protected]
> > Subject: Re: scsi-mq V2
> >
> > On 2014-07-10 15:50, Christoph Hellwig wrote:
> > > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> > >> There is one possible concern that could be exacerbated by other changes
> > in
> > >> the system: if the application is running close to the bare minimum number
> > >> of requests allocated in io_setup(), the per cpu reference counters will
> > >> have a hard time batching things. It might be worth testing with an
> > >> increased number of requests being allocated if this is the case.
> > >
> > > Well, Robert said reverting the two aio commits didn't help. Either he
> > > didn't manage to boot into the right kernel, or we need to look
> > > elsewhere for the culprit.
> >
> > Rob, let me know what scsi_debug setup you use, and I can try and
> > reproduce it here as well.
> >
> > --
> > Jens Axboe
>
> This system has 6 online CPUs and 64 possible CPUs.
>
> Printing avail and req_batch in that loop results in many of these:
> ** 3813 printk messages dropped ** [10643.503772] ctx ffff88042d8d4cc0 avail=0 req_batch=2
>
> Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
> those pr_debug prints results in nothing extra printing,
> so it's not hitting an error.
>
> Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
> these as fio starts:
>
> [ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

Something is horribly wrong here. There is no way that value for nr_events
should be passed in to ioctx_alloc(). This implies that userland is calling
io_setup() with an impossibly large value for nr_events. Can you post the
actual diff for your fs/aio.c relative to linus' tree?

-ben


> [ 186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.435370] hrtimer: interrupt took 6868 ns
> [ 186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.449933] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.461147] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.461176] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.461177] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.461181] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.461181] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.461184] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.461185] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.461185] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.461191] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.461192] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.474426] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.481353] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.483706] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.483707] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.483709] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.483710] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.483712] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.483717] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.495441] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495444] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495445] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.490] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495451] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.495457] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495457] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.495460] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.495463] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.495464] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.499429] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.499437] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.619785] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.627371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.627374] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.627383] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.627384] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.627385] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.628371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.628372] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.630361] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.665329] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.666360] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.666361] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.666366] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.666367] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.666367] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.666369] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 186.666370] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 186.670369] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.670372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.670373] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 186.767323] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 187.211053] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 187.213053] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
>
> Subsequent added prints showed nr_events coming in with
> an initial value of 0x7FFFFFFF in those cases where it showed
> as -2 above. Since the last call had a reasonable value
> of 512, it doesn't seem like a problem.
>
>
> script to create the scsi_debug devices:
> #!/bin/bash
> devices=6
> delay=0 # -2 = jiffy, -1 = hi jiffy, 0 = none, 1..n = longer
> ndelay=20000 # 20 us
> opts=0x4801 #
> every_nth=3
> capacity_mib=0
> capacity_gib=$((2 * 1024))
> lbpu=1
> lbpws=1
>
> modprobe -r scsi_debug
> modprobe -v scsi_debug fake_rw=1 delay=$delay ndelay=$ndelay num_tgts=$devices opts=$opts every_nth=$every_nth physblk_exp=3 lbpu=$lbpu lbpws=$lbpws dev_size_mb=$capacity_mib virtual_gb=$capacity_gib
> lsscsi -s
> lsblk
> # the assigned /dev names will vary...
> for device in /sys/block/sda[h-m]
> do
> echo none > $device/device/queue_type
> done
>
> fio script:
> [global]
> direct=1
> invalidate=1
> ioengine=libaio
> norandommap
> randrepeat=0
> bs=4096
> iodepth=96
> numjobs=6
> runtime=216000
> time_based=1
> group_reporting
> thread
> gtod_reduce=1
> iodepth_batch=16
> iodepth_batch_complete=16
> cpus_allowed=0-5
> cpus_allowed_policy=split
> rw=randread
>
> [4_KiB_RR_drive_ah]
> filename=/dev/sdah
>
> [4_KiB_RR_drive_ai]
> filename=/dev/sdai
>
> [4_KiB_RR_drive_aj]
> filename=/dev/sdaj
>
> [4_KiB_RR_drive_ak]
> filename=/dev/sdak
>
> [4_KiB_RR_drive_al]
> filename=/dev/sdal
>
> [4_KiB_RR_drive_am]
> filename=/dev/sdam
>
> kernel log with some prints in ioctx_alloc:
> (2147483647 is 0x7FFFFFFF)
>
> [ 94.050877] ioctx_alloc: initial nr_events=2147483647
> [ 94.053610] ioctx_alloc: num_possible_cpus=64
> [ 94.055235] ioctx_alloc: after max nr_events=2147483647
> [ 94.057110] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.059189] ioctx_alloc: initial nr_events=96
> [ 94.059294] ioctx_alloc: initial nr_events=2147483647
> [ 94.059295] ioctx_alloc: num_possible_cpus=64
> [ 94.059295] ioctx_alloc: after max nr_events=2147483647
> [ 94.059296] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.059296] ioctx_alloc: initial nr_events=96
> [ 94.059297] ioctx_alloc: num_possible_cpus=64
> [ 94.059297] ioctx_alloc: after max nr_events=256
> [ 94.059298] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.075891] ioctx_alloc: num_possible_cpus=64
> [ 94.077529] ioctx_alloc: after max nr_events=256
> [ 94.079064] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.087777] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.087810] ioctx_alloc: initial nr_events=2147483647
> [ 94.087810] ioctx_alloc: num_possible_cpus=64
> [ 94.087811] ioctx_alloc: after max nr_events=2147483647
> [ 94.087811] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.087812] ioctx_alloc: initial nr_events=96
> [ 94.087812] ioctx_alloc: num_possible_cpus=64
> [ 94.087813] ioctx_alloc: after max nr_events=256
> [ 94.087813] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.087815] ioctx_alloc: initial nr_events=2147483647
> [ 94.087816] ioctx_alloc: initial nr_events=2147483647
> [ 94.087816] ioctx_alloc: num_possible_cpus=64
> [ 94.087817] ioctx_alloc: initial nr_events=2147483647
> [ 94.087818] ioctx_alloc: num_possible_cpus=64
> [ 94.087819] ioctx_alloc: after max nr_events=2147483647
> [ 94.087819] ioctx_alloc: num_possible_cpus=64
> [ 94.087820] ioctx_alloc: after max nr_events=2147483647
> [ 94.087820] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.087821] ioctx_alloc: after max nr_events=2147483647
> [ 94.087822] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.087822] ioctx_alloc: initial nr_events=96
> [ 94.087823] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.087824] ioctx_alloc: initial nr_events=96
> [ 94.087825] ioctx_alloc: initial nr_events=2147483647
> [ 94.087825] ioctx_alloc: num_possible_cpus=64
> [ 94.087826] ioctx_alloc: initial nr_events=96
> [ 94.087826] ioctx_alloc: num_possible_cpus=64
> [ 94.087827] ioctx_alloc: num_possible_cpus=64
> [ 94.087827] ioctx_alloc: after max nr_events=256
> [ 94.087828] ioctx_alloc: num_possible_cpus=64
> [ 94.087828] ioctx_alloc: after max nr_events=256
> [ 94.087829] ioctx_alloc: after max nr_events=2147483647
> [ 94.087829] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.087830] ioctx_alloc: after max nr_events=256
> [ 94.087831] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.087831] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.087832] ioctx_alloc: nr_events=512 aio_max_nr=65
> [ 94.087833] ioctx_alloc: initial nr_events=96
> [ 94.087833] ioctx_alloc: num_possible_cpus=64
> [ 94.087833] ioctx_alloc: after max nr_events=256
> [ 94.087834] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.090668] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.259433] ioctx_alloc: initial nr_events=2147483647
> [ 94.259435] ioctx_alloc: initial nr_events=2147483647
> [ 94.259436] ioctx_alloc: num_possible_cpus=64
> [ 94.259437] ioctx_alloc: after max nr_events=2147483647
> [ 94.259437] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.259438] ioctx_alloc: initial nr_events=96
> [ 94.259438] ioctx_alloc: num_possible_cpus=64
> [ 94.259438] ioctx_alloc: after max nr_events=256
> [ 94.259439] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.259446] ioctx_alloc: initial nr_events=2147483647
> [ 94.259448] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.259449] ioctx_alloc: initial nr_events=2147483647
> [ 94.259450] ioctx_alloc: initial nr_events=2147483647
> [ 94.259450] ioctx_alloc: num_possible_cpus=64
> [ 94.259451] ioctx_alloc: num_possible_cpus=64
> [ 94.259452] ioctx_alloc: num_possible_cpus=64
> [ 94.259452] ioctx_alloc: after max nr_events=2147483647
> [ 94.259453] ioctx_alloc: after max nr_events=2147483647
> [ 94.259453] ioctx_alloc: after max nr_events=2147483647
> [ 94.259454] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.259456] ioctx_alloc: initial nr_events=96
> [ 94.259456] ioctx_alloc: initial nr_events=96
> [ 94.259457] ioctx_alloc: initial nr_events=96
> [ 94.259457] ioctx_alloc: num_possible_cpus=64
> [ 94.259458] ioctx_alloc: num_possible_cpus=64
> [ 94.259458] ioctx_alloc: num_possible_cpus=64
> [ 94.259459] ioctx_alloc: after max nr_events=256
> [ 94.259459] ioctx_alloc: after max nr_events=256
> [ 94.259460] ioctx_alloc: after max nr_events=256
> [ 94.259460] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 259461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.259462] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.260539] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.260544] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.262535] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.262550] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.423889] ioctx_alloc: num_possible_cpus=64
> [ 94.425386] ioctx_alloc: after max nr_events=2147483647
> [ 94.427327] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.429359] ioctx_alloc: initial nr_events=96
> [ 94.429448] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.429451] ioctx_alloc: initial nr_events=2147483647
> [ 94.429452] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.429453] ioctx_alloc: num_possible_cpus=64
> [ 94.429454] ioctx_alloc: initial nr_events=2147483647
> [ 94.429454] ioctx_alloc: after max nr_events=2147483647
> [ 94.429455] ioctx_alloc: num_possible_cpus=64
> [ 94.429456] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.429456] ioctx_alloc: after max nr_events=2147483647
> [ 94.429457] ioctx_alloc: initial nr_events=96
> [ 94.429458] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.429458] ioctx_alloc: num_possible_cpus=64
> [ 94.429459] ioctx_alloc: initial nr_events=96
> [ 94.429459] ioctx_alloc: after max nr_events=256
> [ 94.429460] ioctx_alloc: num_possible_cpus=64
> [ 94.429461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.429461] ioctx_alloc: after max nr_events=256
> [ 94.429462] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.429463] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.430422] hrtimer: interrupt took 6115 ns
> [ 94.431463] ioctx_alloc: initial nr_events=2147483647
> [ 94.431464] ioctx_alloc: num_possible_cpus=64
> [ 94.431464] ioctx_alloc: after max nr_events=2147483647
> [ 94.431465] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.431465] ioctx_alloc: initial nr_events=96
> [ 94.431466] ioctx_alloc: num_possible_cpus=64
> [ 931466] ioctx_alloc: after max nr_events=256
> [ 94.431466] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.432641] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.580307] ioctx_alloc: num_possible_cpus=64
> [ 94.581844] ioctx_alloc: after max nr_events=256
> [ 94.583405] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.585313] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.585319] ioctx_alloc: initial nr_events=2147483647
> [ 94.585320] ioctx_alloc: num_possible_cpus=64
> [ 94.585320] ioctx_alloc: after max nr_events=2147483647
> [ 94.585321] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.585322] ioctx_alloc: initial nr_events=2147483647
> [ 94.585322] ioctx_alloc: initial nr_events=96
> [ 94.585323] ioctx_alloc: num_possible_cpus=64
> [ 94.585324] ioctx_alloc: num_possible_cpus=64
> [ 94.585324] ioctx_alloc: after max nr_events=2147483647
> [ 94.585325] ioctx_alloc: after max nr_events=256
> [ 94.585325] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.585326] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.585327] ioctx_alloc: initial nr_events=2147483647
> [ 94.585328] ioctx_alloc: initial nr_events=96
> [ 94.585328] ioctx_alloc: num_possible_cpus=64
> [ 94.585329] ioctx_alloc: num_possible_cpus=64
> [ 94.585329] ioctx_alloc: after max nr_events=2147483647
> [ 94.585330] ioctx_alloc: after max nr_events=256
> [ 94.585331] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.585331] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.585332] ioctx_alloc: initial nr_events=96
> [ 94.585332] ioctx_alloc: num_possible_cpus=64
> [ 94.585333] ioctx_alloc: after max nr_events=256
> [ 94.585333] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.585372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.585402] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.588377] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.632221] ioctx_alloc: initial nr_events=2147483647
> [ 94.632228] ioctx_alloc: initial nr_events=2147483647
> [ 94.632229] iocalloc: num_possible_cpus=64
> [ 94.632229] ioctx_alloc: after max nr_events=2147483647
> [ 94.632230] ioctx_alloc: initial nr_events=2147483647
> [ 94.632231] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.632232] ioctx_alloc: num_possible_cpus=64
> [ 94.632232] ioctx_alloc: initial nr_events=96
> [ 94.632233] ioctx_alloc: after max nr_events=2147483647
> [ 94.632233] ioctx_alloc: num_possible_cpus=64
> [ 94.632234] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.632234] ioctx_alloc: after max nr_events=256
> [ 94.632235] ioctx_alloc: initial nr_events=96
> [ 94.632236] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.632236] ioctx_alloc: num_possible_cpus=64
> [ 94.632237] ioctx_alloc: after max nr_events=256
> [ 94.632237] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.632241] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.633350] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.764384] ioctx_alloc: num_possible_cpus=64
> [ 94.766038] ioctx_alloc: after max nr_events=2147483647
> [ 94.767807] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.769568] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.770328] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.773546] ioctx_alloc: initial nr_events=2147483647
> [ 94.773550] ioctx_alloc: initial nr_events=2147483647
> [ 94.773551] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.773552] ioctx_alloc: num_possible_cpus=64
> [ 94.773552] ioctx_alloc: after max nr_events=2147483647
> [ 94.773553] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.773553] ioctx_alloc: initial nr_events=96
> [ 94.773554] ioctx_alloc: num_possible_cpus=64
> [ 94.773554] ioctx_alloc: after max nr_events=256
> [ 94.773555] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.773569] ioctx_alloc: initial nr_events=2147483647
> [ 94.773569] ioctx_alloc: num_possible_cpus=64
> [ 94.773570] ioctx_alloc: after max nr_events=2147483647
> [ 94.773570] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.773571] ioctx_alloc:itial nr_events=96
> [ 94.773571] ioctx_alloc: num_possible_cpus=64
> [ 94.773572] ioctx_alloc: after max nr_events=256
> [ 94.773572] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.903978] ioctx_alloc: num_possible_cpus=64
> [ 94.905427] ioctx_alloc: after max nr_events=2147483647
> [ 94.907320] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.909300] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.909305] ioctx_alloc: initial nr_events=2147483647
> [ 94.909306] ioctx_alloc: num_possible_cpus=64
> [ 94.909306] ioctx_alloc: after max nr_events=2147483647
> [ 94.909307] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.909307] ioctx_alloc: initial nr_events=96
> [ 94.909308] ioctx_alloc: num_possible_cpus=64
> [ 94.909308] ioctx_alloc: after max nr_events=256
> [ 94.909309] ioctx_alloc: initial nr_events=2147483647
> [ 94.909310] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.909310] ioctx_alloc: num_possible_cpus=64
> [ 94.909311] ioctx_alloc: after max nr_events=2147483647
> [ 94.909311] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.909312] ioctx_alloc: initial nr_events=96
> [ 94.909312] ioctx_alloc: num_possible_cpus=64
> [ 94.909313] ioctx_alloc: after max nr_events=256
> [ 94.909313] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.912223] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.940281] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 94.940283] ioctx_alloc: initial nr_events=2147483647
> [ 94.940284] ioctx_alloc: num_possible_cpus=64
> [ 94.940285] ioctx_alloc: after max nr_events=2147483647
> [ 94.940286] ioctx_alloc: initial nr_events=2147483647
> [ 94.940286] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.940287] ioctx_alloc: num_possible_cpus=64
> [ 94.940288] ioctx_alloc: initial nr_events=96
> [ 94.940288] ioctx_alloc: after max nr_events=2147483647
> [ 94.940289] ioctx_alloc: num_possible_cpus=64
> [ 94.940290] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 94.940290] ioctx_alloc: after max nr_events=256
> [ 94.940291] ioctx_alloc: initial nr_events=96
> [ 94.940291] ioctx_alloc: nr_events=512 amax_nr=65536
> [ 94.940292] ioctx_alloc: num_possible_cpus=64
> [ 94.940292] ioctx_alloc: after max nr_events=256
> [ 94.940293] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 94.942198] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.069096] ioctx_alloc: initial nr_events=96
> [ 95.069097] ioctx_alloc: num_possible_cpus=64
> [ 95.069097] ioctx_alloc: after max nr_events=256
> [ 95.069098] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.087101] ioctx_alloc: initial nr_events=2147483647
> [ 95.087108] ioctx_alloc: initial nr_events=2147483647
> [ 95.087108] ioctx_alloc: num_possible_cpus=64
> [ 95.087109] ioctx_alloc: after max nr_events=2147483647
> [ 95.087109] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.087110] ioctx_alloc: initial nr_events=96
> [ 95.087110] ioctx_alloc: num_possible_cpus=64
> [ 95.087111] ioctx_alloc: after max nr_events=256
> [ 95.087112] ioctx_alloc: initial nr_events=2147483647
> [ 95.087113] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.087113] ioctx_alloc: num_possible_cpus=64
> [ 95.087114] ioctx_alloc: after max nr_events=2147483647
> [ 95.087114] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.087115] ioctx_alloc: initial nr_events=96
> [ 95.087115] ioctx_alloc: num_possible_cpus=64
> [ 95.087116] ioctx_alloc: after max nr_events=256
> [ 95.087117] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.087117] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.087120] ioctx_alloc: initial nr_events=2147483647
> [ 95.087120] ioctx_alloc: num_possible_cpus=64
> [ 95.087121] ioctx_alloc: after max nr_events=2147483647
> [ 95.087121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.087122] ioctx_alloc: initial nr_events=96
> [ 95.087122] ioctx_alloc: num_possible_cpus=64
> [ 95.087123] ioctx_alloc: after max nr_events=256
> [ 95.087123] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.087126] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.091100] ioctx_alloc: initial nr_events=2147483647
> [ 95.091100] ioctx_alloc: num_possible_cpus=64
> [ 95.091100] ioctx_alloc: after max nr_events=2147483647
> [ 95.091101] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.091101] ioctx_alloc: initial nr_events=96
> [ 95.091102] ioctx_alloc: num_possible_cpus=64
> [ 95.091102] ioctx_alloc: after max nr_events=256
> [ 95.091103] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.145236] ioctx_alloc: num_possible_cpus=64
> [ 95.146754] ioctx_alloc: after max nr_events=2483647
> [ 95.248567] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.250432] ioctx_alloc: initial nr_events=2147483647
> [ 95.250438] ioctx_alloc: initial nr_events=2147483647
> [ 95.250439] ioctx_alloc: num_possible_cpus=64
> [ 95.250439] ioctx_alloc: after max nr_events=2147483647
> [ 95.250440] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.250440] ioctx_alloc: initial nr_events=96
> [ 95.250441] ioctx_alloc: num_possible_cpus=64
> [ 95.250441] ioctx_alloc: after max nr_events=256
> [ 95.250442] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.250450] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.250457] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.251027] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.251038] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.252029] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.275430] ioctx_alloc: num_possible_cpus=64
> [ 95.277000] ioctx_alloc: after max nr_events=2147483647
> [ 95.278747] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.280540] ioctx_alloc: initial nr_events=2147483647
> [ 95.280554] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.284457] ioctx_alloc: num_possible_cpus=64
> [ 95.285998] ioctx_alloc: after max nr_events=2147483647
> [ 95.287764] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [ 95.289455] ioctx_alloc: initial nr_events=96
> [ 95.290901] ioctx_alloc: num_possible_cpus=64
> [ 95.292450] ioctx_alloc: after max nr_events=256
> [ 95.294013] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.295873] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.381941] ioctx_alloc: initial nr_events=96
> [ 95.383764] ioctx_alloc: num_possible_cpus=64
> [ 95.385303] ioctx_alloc: after max nr_events=256
> [ 95.386959] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.391935] ioctx_alloc: initial nr_events=96
> [ 95.393493] ioctx_alloc: num_possible_cpus=64
> [ 95.394994] ioctx_alloc: after max nr_events=256
> [ 95.396751] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.421964] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.425953] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [ 95.611825] ioctx_alloc: initial nr_events=96
> [ 95.613398] ioctx_alloc: num_possible_cpus=64
> [ 95.614893] ioctx_alloc: after max nr_events=256
> [ 95.616615] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [ 95.645844] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
>

--
"Thought is the essence of where you are now."

2014-07-10 15:11:59

by Jeff Moyer

[permalink] [raw]
Subject: Re: scsi-mq V2

Benjamin LaHaise <[email protected]> writes:

>>
>> [ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>
> Something is horribly wrong here. There is no way that value for nr_events
> should be passed in to ioctx_alloc(). This implies that userland is calling
> io_setup() with an impossibly large value for nr_events. Can you post the
> actual diff for your fs/aio.c relative to linus' tree?
>

fio does exactly this! it passes INT_MAX.

Cheers,
Jeff

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Thursday, 10 July, 2014 1:21 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; [email protected]; Christoph Hellwig; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; [email protected]; linux-
> [email protected]
> Subject: Re: scsi-mq V2
>
> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage)
> wrote:
> > the problem still occurs - fio results in low or 0 IOPS, with perf top
> > reporting unusual amounts of time spent in do_io_submit and io_submit.
>
> The diff between the two version doesn't show too much other possible
> interesting commits, the most interesting being some minor block
> updates.
>
> I guess we'll have to a manual bisect, I've pushed out a

> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> the block tree

good.

> and a scsi-mq.3-bisect-2 branch that is just after the merge of the
> block tree to get started.

good.

2014-07-10 16:04:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 03:51:44PM +0000, Elliott, Robert (Server Storage) wrote:
> > scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> > the block tree
>
> good.
>
> > and a scsi-mq.3-bisect-2 branch that is just after the merge of the
> > block tree to get started.
>
> good.

It's starting to look weird. I'll prepare another two bisect branches
around some MM changes, which seems the only other possible candidate.

2014-07-10 16:14:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> It's starting to look weird. I'll prepare another two bisect branches
> around some MM changes, which seems the only other possible candidate.

I've pushed out scsi-mq.3-bisect-3 and scsi-mq.3-bisect-4 for you.

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Thursday, 10 July, 2014 11:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; [email protected]; James Bottomley; Bart Van Assche;
> Benjamin LaHaise; [email protected]; [email protected]
> Subject: Re: scsi-mq V2
>
> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> > It's starting to look weird. I'll prepare another two bisect branches
> > around some MM changes, which seems the only other possible candidate.
>
> I've pushed out scsi-mq.3-bisect-3

Good.

> and scsi-mq.3-bisect-4 for you.

Bad.

Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index df95a2f..11b65d4 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct BITSFUNC(fake_sections) *out,
uint64_t flags = GET_LE(&in->sh_flags);

bool copy = flags & SHF_ALLOC &&
+ (GET_LE(&in->sh_size) ||
+ (GET_LE(&in->sh_type) != SHT_RELA &&
+ GET_LE(&in->sh_type) != SHT_REL)) &&
strcmp(name, ".altinstructions") &&
strcmp(name, ".altinstr_replacement");

Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

perf top shows lookup_ioctx function alongside io_submit and
do_io_submit this time:
14.96% [kernel] [k] lookup_ioctx
14.71% libaio.so.1.0.1 [.] io_submit
13.78% [kernel] [k] system_call
10.79% [kernel] [k] system_call_after_swapgs
10.17% [kernel] [k] do_io_submit
8.91% [kernel] [k] copy_user_generic_string
4.24% [kernel] [k] io_submit_one
3.93% [kernel] [k] blk_flush_plug_list
3.32% fio [.] fio_libaio_commit
2.84% [kernel] [k] sysret_check
2.06% [kernel] [k] blk_finish_plug
1.89% [kernel] [k] SyS_io_submit
1.48% [kernel] [k] blk_start_plug
1.04% fio [.] io_submit@plt
0.84% [kernel] [k] __get_user_4
0.74% [kernel] [k] system_call_fastpath
0.60% [kernel] [k] _copy_from_user
0.51% diff [.] 0x0000000000007abb

ftrace on CPU 0 shows similar repetition to before:
fio-4107 [000] .... 389.992300: lookup_ioctx <-do_io_submit
fio-4107 [000] .... 389.992300: blk_start_plug <-do_io_submit
fio-4107 [000] .... 389.992300: io_submit_one <-do_io_submit
fio-4107 [000] .... 389.992300: blk_finish_plug <-do_io_submit
fio-4107 [000] .... 389.992300: blk_flush_plug_list <-blk_finish_plug
fio-4107 [000] .... 389.992301: SyS_io_submit <-system_call_fastpath
fio-4107 [000] .... 389.992301: do_io_submit <-SyS_io_submit
fio-4107 [000] .... 389.992301: lookup_ioctx <-do_io_submit
fio-4107 [000] .... 389.992301: blk_start_plug <-do_io_submit
fio-4107 [000] .... 389.992301: io_submit_one <-do_io_submit
fio-4107 [000] .... 389.992301: blk_finish_plug <-do_io_submit
fio-4107 [000] .... 389.992301: blk_flush_plug_list <-blk_finish_plug
fio-4107 [000] .... 389.992301: SyS_io_submit <-system_call_fastpath
fio-4107 [000] .... 389.992302: do_io_submit <-SyS_io_submit
fio-4107 [000] .... 389.992302: lookup_ioctx <-do_io_submit
fio-4107 [000] .... 389.992302: blk_start_plug <-do_io_submit
fio-4107 [000] .... 389.992302: io_submit_one <-do_io_submit
fio-4107 [000] .... 389.992302: blk_finish_plug <-do_io_submit
fio-4107 [000] .... 389.992302: blk_flush_plug_list <-blk_finish_plug


2014-07-10 19:15:46

by Jeff Moyer

[permalink] [raw]
Subject: Re: scsi-mq V2

"Elliott, Robert (Server Storage)" <[email protected]> writes:

>> -----Original Message-----
>> From: Christoph Hellwig [mailto:[email protected]]
>> Sent: Thursday, 10 July, 2014 11:15 AM
>> To: Elliott, Robert (Server Storage)
>> Cc: Jens Axboe; [email protected]; James Bottomley; Bart Van Assche;
>> Benjamin LaHaise; [email protected]; [email protected]
>> Subject: Re: scsi-mq V2
>>
>> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
>> > It's starting to look weird. I'll prepare another two bisect branches
>> > around some MM changes, which seems the only other possible candidate.
>>
>> I've pushed out scsi-mq.3-bisect-3
>
> Good.
>
>> and scsi-mq.3-bisect-4 for you.
>
> Bad.
>
> Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> index df95a2f..11b65d4 100644
> --- a/arch/x86/vdso/vdso2c.h
> +++ b/arch/x86/vdso/vdso2c.h
> @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct BITSFUNC(fake_sections) *out,
> uint64_t flags = GET_LE(&in->sh_flags);
>
> bool copy = flags & SHF_ALLOC &&
> + (GET_LE(&in->sh_size) ||
> + (GET_LE(&in->sh_type) != SHT_RELA &&
> + GET_LE(&in->sh_type) != SHT_REL)) &&
> strcmp(name, ".altinstructions") &&
> strcmp(name, ".altinstr_replacement");
>
> Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

Hi, Rob,

Can you get sysrq-t output for me? I don't know how/why we'd continue
to get io_submits for an exiting process.

Thanks,
Jeff

2014-07-10 19:37:37

by Jeff Moyer

[permalink] [raw]
Subject: Re: scsi-mq V2

Jeff Moyer <[email protected]> writes:

> Hi, Rob,
>
> Can you get sysrq-t output for me? I don't know how/why we'd continue
> to get io_submits for an exiting process.

Also, do you know what sys_io_submit is returning?

2014-07-10 19:59:44

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 17:11, Jeff Moyer wrote:
> Benjamin LaHaise <[email protected]> writes:
>
>>>
>>> [ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>
>> Something is horribly wrong here. There is no way that value for nr_events
>> should be passed in to ioctx_alloc(). This implies that userland is calling
>> io_setup() with an impossibly large value for nr_events. Can you post the
>> actual diff for your fs/aio.c relative to linus' tree?
>>
>
> fio does exactly this! it passes INT_MAX.

That's correct, I had actually forgotten about this. It was a change
made a few years back, in correlation with the aio optimizations posted
then, basically telling aio to ignore that silly (and broken) user ring.

--
Jens Axboe

2014-07-10 20:05:48

by Jeff Moyer

[permalink] [raw]
Subject: Re: scsi-mq V2

Jens Axboe <[email protected]> writes:

> On 2014-07-10 17:11, Jeff Moyer wrote:
>> Benjamin LaHaise <[email protected]> writes:
>>
>>>>
>>>> [ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>
>>> Something is horribly wrong here. There is no way that value for nr_events
>>> should be passed in to ioctx_alloc(). This implies that userland is calling
>>> io_setup() with an impossibly large value for nr_events. Can you post the
>>> actual diff for your fs/aio.c relative to linus' tree?
>>>
>>
>> fio does exactly this! it passes INT_MAX.
>
> That's correct, I had actually forgotten about this. It was a change
> made a few years back, in correlation with the aio optimizations
> posted then, basically telling aio to ignore that silly (and broken)
> user ring.

I still don't see how you accomplish that. Making it bigger doesn't get
rid of it. ;-)

Cheers,
Jeff

2014-07-10 20:06:59

by Jens Axboe

[permalink] [raw]
Subject: Re: scsi-mq V2

On 2014-07-10 22:05, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
>> On 2014-07-10 17:11, Jeff Moyer wrote:
>>> Benjamin LaHaise <[email protected]> writes:
>>>
>>>>>
>>>>> [ 186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [ 186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [ 186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [ 186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [ 186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>
>>>> Something is horribly wrong here. There is no way that value for nr_events
>>>> should be passed in to ioctx_alloc(). This implies that userland is calling
>>>> io_setup() with an impossibly large value for nr_events. Can you post the
>>>> actual diff for your fs/aio.c relative to linus' tree?
>>>>
>>>
>>> fio does exactly this! it passes INT_MAX.
>>
>> That's correct, I had actually forgotten about this. It was a change
>> made a few years back, in correlation with the aio optimizations
>> posted then, basically telling aio to ignore that silly (and broken)
>> user ring.
>
> I still don't see how you accomplish that. Making it bigger doesn't get
> rid of it. ;-)

See the patches from back then - INT_MAX basically just meant the same
as 0, but 0 could not be used because of the (silly) setup with the
wrappers around the syscalls. So INT_MAX was overloaded to mean "no ring
events, I don't care".

--
Jens Axboe

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Jeff Moyer [mailto:[email protected]]
> Sent: Thursday, 10 July, 2014 2:14 PM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jens Axboe; [email protected]; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; [email protected]; linux-
> [email protected]
> Subject: Re: scsi-mq V2
>
> "Elliott, Robert (Server Storage)" <[email protected]> writes:
>
> >> -----Original Message-----
> >> From: Christoph Hellwig [mailto:[email protected]]
> >> Sent: Thursday, 10 July, 2014 11:15 AM
> >> To: Elliott, Robert (Server Storage)
> >> Cc: Jens Axboe; [email protected]; James Bottomley; Bart Van Assche;
> >> Benjamin LaHaise; [email protected]; [email protected]
> >> Subject: Re: scsi-mq V2
> >>
> >> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> >> > It's starting to look weird. I'll prepare another two bisect branches
> >> > around some MM changes, which seems the only other possible candidate.
> >>
> >> I've pushed out scsi-mq.3-bisect-3
> >
> > Good.
> >
> >> and scsi-mq.3-bisect-4 for you.
> >
> > Bad.
> >
> > Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> > diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> > index df95a2f..11b65d4 100644
> > --- a/arch/x86/vdso/vdso2c.h
> > +++ b/arch/x86/vdso/vdso2c.h
> > @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct
> BITSFUNC(fake_sections) *out,
> > uint64_t flags = GET_LE(&in->sh_flags);
> >
> > bool copy = flags & SHF_ALLOC &&
> > + (GET_LE(&in->sh_size) ||
> > + (GET_LE(&in->sh_type) != SHT_RELA &&
> > + GET_LE(&in->sh_type) != SHT_REL)) &&
> > strcmp(name, ".altinstructions") &&
> > strcmp(name, ".altinstr_replacement");
> >
> > Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> > an fio hang, with one CPU (CPU 0) stuck in io_submit loops.
>

I added some prints in aio_setup_ring and ioctx_alloc and
rebooted. This time it took much longer to hit the problem. It
survived dozens of ^Cs. Running a few minutes, though, IOPS
eventually dropped. So, sometimes it happens immediately,
sometimes it takes time to develop.

I will rerun bisect-1 -2 and -3 for longer times to increase
confidence that they didn't just appear good.

On this bisect-4 run, as IOPS started to drop from 900K to 40K,
I ran perf top when it was at 700K. You can see io_submit times
creeping up.

4.30% [kernel] [k] do_io_submit
4.29% [kernel] [k] _raw_spin_lock_irqsave
3.88% libaio.so.1.0.1 [.] io_submit
3.55% [kernel] [k] system_call
3.34% [kernel] [k] put_compound_page
3.11% [kernel] [k] io_submit_one
3.06% [kernel] [k] system_call_after_swapgs
2.89% [kernel] [k] copy_user_generic_string
2.45% [kernel] [k] lookup_ioctx
2.16% [kernel] [k] apic_timer_interrupt
2.00% [kernel] [k] _raw_spin_lock
1.97% [scsi_debug] [k] sdebug_q_cmd_hrt_complete
1.84% [kernel] [k] __get_page_tail
1.74% [kernel] [k] do_blockdev_direct_IO
1.68% [kernel] [k] blk_flush_plug_list
1.41% [kernel] [k] _raw_spin_unlock_irqrestore
1.24% [scsi_debug] [k] schedule_resp

finally settling like before:
14.15% [kernel] [k] do_io_submit
13.61% libaio.so.1.0.1 [.] io_submit
11.81% [kernel] [k] system_call
10.11% [kernel] [k] system_call_after_swapgs
8.59% [kernel] [k] io_submit_one
8.56% [kernel] [k] copy_user_generic_string
7.96% [kernel] [k] lookup_ioctx
5.33% [kernel] [k] blk_flush_plug_list
3.11% [kernel] [k] blk_finish_plug
2.84% [kernel] [k] sysret_check
2.63% fio [.] fio_libaio_commit
2.27% [kernel] [k] blk_start_plug
1.17% [kernel] [k] SyS_io_submit

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: [email protected] [mailto:linux-scsi-
> [email protected]] On Behalf Of Elliott, Robert (Server Storage)
>
> I added some prints in aio_setup_ring and ioctx_alloc and
> rebooted. This time it took much longer to hit the problem. It
> survived dozens of ^Cs. Running a few minutes, though, IOPS
> eventually dropped. So, sometimes it happens immediately,
> sometimes it takes time to develop.
>
> I will rerun bisect-1 -2 and -3 for longer times to increase
> confidence that they didn't just appear good.

Allowing longer run times before declaring success, the problem
does appear in all of the bisect trees. I just let fio
continue to run for many minutes - no ^Cs necessary.

no-rebase: good for > 45 minutes (I will leave that running for
8 more hours)
bisect-1: bad
bisect-2: bad
bisect-3: bad
bisect-4: bad

2014-07-11 06:14:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: scsi-mq V2

On Fri, Jul 11, 2014 at 06:02:03AM +0000, Elliott, Robert (Server Storage) wrote:
> Allowing longer run times before declaring success, the problem
> does appear in all of the bisect trees. I just let fio
> continue to run for many minutes - no ^Cs necessary.
>
> no-rebase: good for > 45 minutes (I will leave that running for
> 8 more hours)

Ok, thanks. If it's still running tomorrow morning let's look into the
aio reverts again.

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Friday, 11 July, 2014 1:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jeff Moyer; Christoph Hellwig; Jens Axboe; [email protected]; James
> Bottomley; Bart Van Assche; Benjamin LaHaise; [email protected];
> [email protected]
> Subject: Re: scsi-mq V2
>
> On Fri, Jul 11, 2014 at 06:02:03AM +0000, Elliott, Robert (Server Storage)
> wrote:
> > Allowing longer run times before declaring success, the problem
> > does appear in all of the bisect trees. I just let fio
> > continue to run for many minutes - no ^Cs necessary.
> >
> > no-rebase: good for > 45 minutes (I will leave that running for
> > 8 more hours)
>
> Ok, thanks. If it's still running tomorrow morning let's look into the
> aio reverts again.

That ran 9 total hours with no problem.

Rather than revert in the bisect trees, I added just this single additional
patch to the no-rebase tree, and the problem appeared:


48a2e94154177286b3bcbed25ea802232527fa7c
aio: fix aio request leak when events are reaped by userspace

diff --git a/fs/aio.c b/fs/aio.c
index 4f078c0..e59bba8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1021,6 +1021,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

/* everything turned out well, dispose of the aiocb. */
kiocb_free(iocb);
+ put_reqs_available(ctx, 1); /* added by patch f8567 */

/*
* We have to order our ring_info tail store above and test
@@ -1101,7 +1102,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

pr_debug("%li h%u t%u\n", ret, head, tail);

- put_reqs_available(ctx, ret);
+ /* put_reqs_available(ctx, ret); removed by patch f8567 */
out:
mutex_unlock(&ctx->ring_lock);


---
Rob Elliott HP Server Storage


2014-07-11 14:55:26

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

On Fri, Jul 11, 2014 at 02:33:12PM +0000, Elliott, Robert (Server Storage) wrote:
> That ran 9 total hours with no problem.
>
> Rather than revert in the bisect trees, I added just this single additional
> patch to the no-rebase tree, and the problem appeared:

Can you try the below totally untested patch instead? It looks like
put_reqs_available() is not irq-safe.

-ben
--
"Thought is the essence of where you are now."


diff --git a/fs/aio.c b/fs/aio.c
index 955947e..4b97180 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
static void put_reqs_available(struct kioctx *ctx, unsigned nr)
{
struct kioctx_cpu *kcpu;
+ unsigned long flags;

preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);

+ local_irq_save(flags);
kcpu->reqs_available += nr;
+
while (kcpu->reqs_available >= ctx->req_batch * 2) {
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}

+ local_irq_restore(flags);
preempt_enable();
}

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: Benjamin LaHaise [mailto:[email protected]]
> Sent: Friday, 11 July, 2014 9:55 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; [email protected]; James
> Bottomley; Bart Van Assche; [email protected]; linux-
> [email protected]
> Subject: Re: scsi-mq V2
...
> Can you try the below totally untested patch instead? It looks like
> put_reqs_available() is not irq-safe.
>

With that addition alone, fio still runs into the same problem.

I added the same fix to get_reqs_available, which also accesses
kcpu->reqs_available, and the test has run for 35 minutes with
no problem.

Patch applied:

diff --git a/fs/aio.c b/fs/aio.c
index e59bba8..8e85e26 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
static void put_reqs_available(struct kioctx *ctx, unsigned nr)
{
struct kioctx_cpu *kcpu;
+ unsigned long flags;

preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);

+ local_irq_save(flags);
kcpu->reqs_available += nr;
+
while (kcpu->reqs_available >= ctx->req_batch * 2) {
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}

+ local_irq_restore(flags);
preempt_enable();
}

@@ -847,10 +851,12 @@ static bool get_reqs_available(struct kioctx *ctx)
{
struct kioctx_cpu *kcpu;
bool ret = false;
+ unsigned long flags;

preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);

+ local_irq_save(flags);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);

@@ -869,6 +875,7 @@ static bool get_reqs_available(struct kioctx *ctx)
ret = true;
kcpu->reqs_available--;
out:
+ local_irq_restore(flags);
preempt_enable();
return ret;
}

--
I will see if that solves the problem with the scsi-mq-3 tree, or
at least some of the bisect trees leading up to it.

A few other comments:

1. Those changes boost _raw_spin_lock_irqsave into first place
in perf top:

6.59% [kernel] [k] _raw_spin_lock_irqsave
4.37% [kernel] [k] put_compound_page
2.87% [scsi_debug] [k] sdebug_q_cmd_hrt_complete
2.74% [kernel] [k] _raw_spin_lock
2.73% [kernel] [k] apic_timer_interrupt
2.41% [kernel] [k] do_blockdev_direct_IO
2.24% [kernel] [k] __get_page_tail
1.97% [kernel] [k] _raw_spin_unlock_irqrestore
1.87% [kernel] [k] scsi_queue_rq
1.76% [scsi_debug] [k] schedule_resp

Maybe (later) kcpu->reqs_available should converted to an atomic,
like ctx->reqs_available, to reduce that overhead?

2. After the f8567a3 patch, aio_complete has one early return that
bypasses the call to put_reqs_available. Is that OK, or does
that mean that sync iocbs will now eat up reqs_available?

/*
* Special case handling for sync iocbs:
* - events go directly into the iocb for fast handling
* - the sync task with the iocb in its stack holds the single iocb
* ref, no other paths have a way to get another ref
* - the sync task helpfully left a reference to itself in the iocb
*/
if (is_sync_kiocb(iocb)) {
iocb->ki_user_data = res;
smp_wmb();
iocb->ki_ctx = ERR_PTR(-EXDEV);
wake_up_process(iocb->ki_obj.tsk);
return;
}


3. The f8567a3 patch renders this comment in aio.c out of date - it's
no longer incremented when pulled off the ringbuffer, but is now
incremented when aio_complete is called.

struct {
/*
* This counts the number of available slots in the ringbuffer,
* so we avoid overflowing it: it's decremented (if positive)
* when allocating a kiocb and incremented when the resulting
* io_event is pulled off the ringbuffer.
*
* We batch accesses to it with a percpu version.
*/
atomic_t reqs_available;
} ____cacheline_aligned_in_smp;


---
Rob Elliott HP Server Storage


Subject: RE: scsi-mq V2

> I will see if that solves the problem with the scsi-mq-3 tree, or
> at least some of the bisect trees leading up to it.

scsi-mq-3 is still going after 45 minutes. I'll leave it running
overnight.

Subject: RE: scsi-mq V2



> -----Original Message-----
> From: [email protected] [mailto:linux-scsi-
> [email protected]] On Behalf Of Elliott, Robert (Server Storage)
> Sent: Saturday, July 12, 2014 6:20 PM
> To: Benjamin LaHaise
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; [email protected];
> James Bottomley; Bart Van Assche; [email protected]; linux-
> [email protected]
> Subject: RE: scsi-mq V2
>
> > I will see if that solves the problem with the scsi-mq-3 tree, or
> > at least some of the bisect trees leading up to it.
>
> scsi-mq-3 is still going after 45 minutes. I'll leave it running
> overnight.
>

That has been going strong for 18 hours, so I think that's the patch
we need.


2014-07-14 09:13:42

by Sagi Grimberg

[permalink] [raw]
Subject: Re: scsi-mq V2

On 7/8/2014 5:48 PM, Christoph Hellwig wrote:
<SNIP>
> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> from June 29th. Robert Elliot found a problem with not fully zeroed
> out UNMAP CDBs, which is fixed by the saner discard handling in that
> series.
>
> There is a new patch to factor the code from the above series for
> blk-mq use, which I've attached below. Besides that the only changes
> are minor merge fixups in the main blk-mq usage patch.

Hey Christoph & Co,

I'd like to share some benchmarks I took on this patch set using iSER
initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
I ran workloads I think are interesting use-cases (single LUN with 1,2,4
IO threads up to a fully occupied system doing IO to multiple LUNs).
Overall (except 2 strange anomalies) seems that scsi-mq patches
(use_blk_mq=N) roughly sustains traditional scsi performance.
On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly
shows better performance (tables below).

At first I too hit the aio issues discussed in this thread and converted
to scsi-mq.3-no-rebase for testing (thanks Doug & Rob for raising it).
I must say that for some reason I get very low numbers for writes vs.
reads (writes perf stuck at ~20K IOPs per thread), this happens
on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well
or is it just a weird problem I'm having in my setup?
Anyway this is why my benchmarks shows only randread IO pattern (getting
familiar numbers). I need to figure out whats wrong
with IO writes - I'll start bisecting on this.

I also reviewed the patch set and at this point, I don't have any
comments. So you can add to the series:
Reviewed-by: Sagi Grimberg '<[email protected]>' (or Tested-by -
whatever you choose).

I want to state that I tested a traditional iSER initiator - no scsi-mq
adoption at all.
I started looking into adopting scsi-mq to iSCSI/iSER recently and I
must that say the scsi-mq adoption is not so
trivial due to iSCSI session-wide CmdSN/StatSN ordering constraints
(can't just use more RDMA channels per connection...)
I'll be on vacation for the next couple of weeks, so I'll start a
separate thread to get the community input on this matter.


Results: table entries are KIOPS(CPU%)
3.16-rc2 (scsi-mq patches reverted)
Threads/LUN 1 2 4
#LUNs
1 231(6.5%) 355(18.5%) 337(31.1%)
2 446(13.6%) 673(37.2%) 654(49.8%)
4 594(25%) 960(49.41%) 1165(99.3%)
8 1018(50.3%) 1563(99.6%) 1696(99.9%)
16 1660(86.5%) 1731(99.6%) 1710(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=N)
Threads/LUN 1 2 4
#LUNs
1 231(6.5%) 351(18.5%) 337(31.4%)
2 446(13.6%) 660(37.3%) 647(50%)
4 591(25%) 967(49.7%) 1136(98.1%)
8 1014(52.1%) 1296(100%) 1470(100%)
16 1741(100%) 1761(100%) 1853(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=Y)
Threads/LUN 1 2 4
#LUNs
1 265(6.4%) 465(13.4%) 572(27.9%)
2 507(13.4%) 902(27.8%) 1034(45.9%)
4 697(25%) 1197(49.5%) 1477(98.6%)
8 1257(53.6%) 1856(98.7%) 1906(100%)
16 1991(100%) 2021(100%) 2020(100%)

Notes:
- IOPs measurements are the average of a 60 seconds runs.
- The CPU measurement is the total usage across all CPUs, In order
to understand per-CPU utilization value should be normalized to 16
cores.
- scsi_mq (use_blk_mq=N) has roughly the same performance as
traditional scsi IO path but I see an anomaly in test cases
{8 LUNs, 2/4 threads per LUN}. This may result in NUMA
misalignment for threads/interrupts ? requires further
investigation.
- iSER initiator has no Multi-Queue awareness.

Testing environment:
- Initiator and target systems of 16 (8x2) cores (Hyperthreading
disabled).
- CPU model: Intel(R) Xeon(R) @ 2.60GHz
- Block Layer settings:
- scheduler=noop
- rq_affinity=1
- add_random=0
- nomerges=1
- Single FDR link between the target and initiator.
- Device model: Mellanox ConnectIB (the numbers are also familiar
with Mellanox ConnectX-3).
- MSIX interrupt vectors were spread across system cores.
- irqbalancer was disabled.
- scsi_host settings:
- cmd_per_lun=32 (default)
- can_queue=113 (default)
- In the multi-LUN test cases, each LUN exposed via different
scsi_host (iSCSI session).

Software:
- fio version: 2.0.13
- LIO iSER target (target-pending for-next)
- Null backing devices (NULLIO)
- Upstream based iSER initiator + internal pre-submitted
performance enhancements.

fio configuration:
rw=randread
bs=1k
iodepth=128
loops=1
ioengine=libaio
direct=1
invalidate=1
fsync_on_close=1
randrepeat=1
norandommap

Cheers,
Sagi.

2014-07-14 17:15:24

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: scsi-mq V2

Hi Robert,

On Sun, Jul 13, 2014 at 05:15:15PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I will see if that solves the problem with the scsi-mq-3 tree, or
> > > at least some of the bisect trees leading up to it.
> >
> > scsi-mq-3 is still going after 45 minutes. I'll leave it running
> > overnight.
> >
>
> That has been going strong for 18 hours, so I think that's the patch
> we need.

Thanks for taking the time to narrow this down. I've applied the fix to
my aio-fixes tree at git://git.kvack.org/~bcrl/aio-fixes.git and fowarded
it on to Linus as well.

-ben
--
"Thought is the essence of where you are now."

2014-07-16 11:13:45

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

On 06/25/2014 11:52 AM, Christoph Hellwig wrote:
> +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> +{
> + struct request_queue *q = req->q;
> + struct scsi_device *sdev = q->queuedata;
> + struct Scsi_Host *shost = sdev->host;
> + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> + int ret;
> + int reason;
> +
> + ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> + if (ret)
> + goto out;
> +
> + ret = BLK_MQ_RQ_QUEUE_BUSY;
> + if (!get_device(&sdev->sdev_gendev))
> + goto out;
> +
> + if (!scsi_dev_queue_ready(q, sdev))
> + goto out_put_device;
> + if (!scsi_target_queue_ready(shost, sdev))
> + goto out_dec_device_busy;
> + if (!scsi_host_queue_ready(q, shost, sdev))
> + goto out_dec_target_busy;
> +
> + if (!(req->cmd_flags & REQ_DONTPREP)) {
> + ret = prep_to_mq(scsi_mq_prep_fn(req));
> + if (ret)
> + goto out_dec_host_busy;
> + req->cmd_flags |= REQ_DONTPREP;
> + }
> +
> + scsi_init_cmd_errh(cmd);
> + cmd->scsi_done = scsi_mq_done;
> +
> + reason = scsi_dispatch_cmd(cmd);
> + if (reason) {
> + scsi_set_blocked(cmd, reason);
> + ret = BLK_MQ_RQ_QUEUE_BUSY;
> + goto out_dec_host_busy;
> + }
> +
> + return BLK_MQ_RQ_QUEUE_OK;
> +
> +out_dec_host_busy:
> + cancel_delayed_work(&cmd->abort_work);

Hey Christoph,

I see the request timer is started before calling queue_rq, but I could
not figure out what the cancel_delayed_work here is for exactly. It
seems if the request were to time out and the eh started while queue_rq
was running we could end up some nasty bugs like the requested requeued
twice.

Is the cancel_delayed_work call just to be safe or is supposed to be
handling a case where the abort_work could be queued at this time up due
to a request timing out while queue_rq is running? Is this case mq specific?

2014-07-16 11:16:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

On Wed, Jul 16, 2014 at 06:13:21AM -0500, Mike Christie wrote:
> I see the request timer is started before calling queue_rq, but I could
> not figure out what the cancel_delayed_work here is for exactly. It
> seems if the request were to time out and the eh started while queue_rq
> was running we could end up some nasty bugs like the requested requeued
> twice.
>
> Is the cancel_delayed_work call just to be safe or is supposed to be
> handling a case where the abort_work could be queued at this time up due
> to a request timing out while queue_rq is running? Is this case mq specific?

It was cargo cult copy & paste from the old path. I've merged a patch
from Bart to remove it from the old code, so it should go away here as well,
thanks for the reminder.