As mentioned in the v1 submission, these patches were delayed a bit
by the discussion and testing of patch 5. Debugging the problem also
led to the discovery of a midlayer bug fixed by patch 4.
Paolo
v1->v2: fix all occurrences of "scmd->result |= DID_TIME_OUT << 16"
in patch 4
Ming Lei (2):
virtio_scsi: remove ACCESS_ONCE() and smp_read_barrier_depends()
virtio-scsi: replace target spinlock with seqcount
Paolo Bonzini (2):
virtio-scsi: avoid cancelling uninitialized work items
virtio-scsi: fix various bad behavior on aborted requests
Ulrich Obergfell (1):
scsi_error: fix invalid setting of host byte
Venkatesh Srinivas (1):
virtio-scsi: Implement change_queue_depth for virtscsi targets
drivers/scsi/scsi_error.c | 6 +-
drivers/scsi/virtio_scsi.c | 148 +++++++++++++++++++++++++++------------------
2 files changed, 93 insertions(+), 61 deletions(-)
--
1.8.3.1
From: Ming Lei <[email protected]>
Access to tgt->req_vq is strictly serialized by spin_lock
of tgt->tgt_lock, so the ACCESS_ONCE() isn't necessary.
smp_read_barrier_depends() in virtscsi_req_done was introduced
to order reading req_vq and decreasing tgt->reqs, but it isn't
needed now because req_vq is read from
scsi->req_vqs[vq->index - VIRTIO_SCSI_VQ_BASE] instead of
tgt->req_vq, so remove the unnecessary barrier.
Also remove related comment about the barrier.
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
drivers/scsi/virtio_scsi.c | 53 ++++++----------------------------------------
1 file changed, 6 insertions(+), 47 deletions(-)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index db3b494e5926..fc054935eb1f 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -73,17 +73,12 @@ struct virtio_scsi_vq {
* queue, and also lets the driver optimize the IRQ affinity for the virtqueues
* (each virtqueue's affinity is set to the CPU that "owns" the queue).
*
- * An interesting effect of this policy is that only writes to req_vq need to
- * take the tgt_lock. Read can be done outside the lock because:
+ * tgt_lock is held to serialize reading and writing req_vq. Reading req_vq
+ * could be done locklessly, but we do not do it yet.
*
- * - writes of req_vq only occur when atomic_inc_return(&tgt->reqs) returns 1.
- * In that case, no other CPU is reading req_vq: even if they were in
- * virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
- *
- * - reads of req_vq only occur when the target is not idle (reqs != 0).
- * A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
- *
- * Similarly, decrements of reqs are never concurrent with writes of req_vq.
+ * Decrements of reqs are never concurrent with writes of req_vq: before the
+ * decrement reqs will be != 0; after the decrement the virtqueue completion
+ * routine will not use the req_vq so it can be changed by a new request.
* Thus they can happen outside the tgt_lock, provided of course we make reqs
* an atomic_t.
*/
@@ -238,38 +233,6 @@ static void virtscsi_req_done(struct virtqueue *vq)
int index = vq->index - VIRTIO_SCSI_VQ_BASE;
struct virtio_scsi_vq *req_vq = &vscsi->req_vqs[index];
- /*
- * Read req_vq before decrementing the reqs field in
- * virtscsi_complete_cmd.
- *
- * With barriers:
- *
- * CPU #0 virtscsi_queuecommand_multi (CPU #1)
- * ------------------------------------------------------------
- * lock vq_lock
- * read req_vq
- * read reqs (reqs = 1)
- * write reqs (reqs = 0)
- * increment reqs (reqs = 1)
- * write req_vq
- *
- * Possible reordering without barriers:
- *
- * CPU #0 virtscsi_queuecommand_multi (CPU #1)
- * ------------------------------------------------------------
- * lock vq_lock
- * read reqs (reqs = 1)
- * write reqs (reqs = 0)
- * increment reqs (reqs = 1)
- * write req_vq
- * read (wrong) req_vq
- *
- * We do not need a full smp_rmb, because req_vq is required to get
- * to tgt->reqs: tgt is &vscsi->tgt[sc->device->id], where sc is stored
- * in the virtqueue as the user token.
- */
- smp_read_barrier_depends();
-
virtscsi_vq_done(vscsi, req_vq, virtscsi_complete_cmd);
};
@@ -560,12 +523,8 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
spin_lock_irqsave(&tgt->tgt_lock, flags);
- /*
- * The memory barrier after atomic_inc_return matches
- * the smp_read_barrier_depends() in virtscsi_req_done.
- */
if (atomic_inc_return(&tgt->reqs) > 1)
- vq = ACCESS_ONCE(tgt->req_vq);
+ vq = tgt->req_vq;
else {
queue_num = smp_processor_id();
while (unlikely(queue_num >= vscsi->num_queues))
--
1.8.3.1
Calling the workqueue interface on uninitialized work items isn't a
good idea even if they're zeroed. It's not failing catastrophically only
through happy accidents.
Signed-off-by: Paolo Bonzini <[email protected]>
---
drivers/scsi/virtio_scsi.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index f0b4cdbfceb0..d66c4ee2c774 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -253,6 +253,8 @@ static void virtscsi_ctrl_done(struct virtqueue *vq)
virtscsi_vq_done(vscsi, &vscsi->ctrl_vq, virtscsi_complete_free);
};
+static void virtscsi_handle_event(struct work_struct *work);
+
static int virtscsi_kick_event(struct virtio_scsi *vscsi,
struct virtio_scsi_event_node *event_node)
{
@@ -260,6 +262,7 @@ static int virtscsi_kick_event(struct virtio_scsi *vscsi,
struct scatterlist sg;
unsigned long flags;
+ INIT_WORK(&event_node->work, virtscsi_handle_event);
sg_init_one(&sg, &event_node->event, sizeof(struct virtio_scsi_event));
spin_lock_irqsave(&vscsi->event_vq.vq_lock, flags);
@@ -377,7 +380,6 @@ static void virtscsi_complete_event(struct virtio_scsi *vscsi, void *buf)
{
struct virtio_scsi_event_node *event_node = buf;
- INIT_WORK(&event_node->work, virtscsi_handle_event);
schedule_work(&event_node->work);
}
--
1.8.3.1
From: Ulrich Obergfell <[email protected]>
After scsi_try_to_abort_cmd returns, the eh_abort_handler may have
already found that the command has completed in the device, causing
the host_byte to be nonzero (e.g. it could be DID_ABORT). When
this happens, ORing DID_TIME_OUT into the host byte will corrupt
the result field and initiate an unwanted command retry.
Fix this by using set_host_byte instead, following the model of
commit 2082ebc45af9c9c648383b8cde0dc1948eadbf31.
Cc: [email protected]
Signed-off-by: Ulrich Obergfell <[email protected]>
[Fix all instances according to review comments. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
---
v1->v2: fix all occurrences [Bart]
drivers/scsi/scsi_error.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f17aa7aa7879..8e7a0f8d4f52 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -131,7 +131,7 @@ scmd_eh_abort_handler(struct work_struct *work)
"aborting command %p\n", scmd));
rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
if (rtn == SUCCESS) {
- scmd->result |= DID_TIME_OUT << 16;
+ set_host_byte(scmd, DID_TIME_OUT);
if (scsi_host_eh_past_deadline(sdev->host)) {
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd,
@@ -167,7 +167,7 @@ scmd_eh_abort_handler(struct work_struct *work)
scmd_printk(KERN_WARNING, scmd,
"scmd %p terminate "
"aborted command\n", scmd));
- scmd->result |= DID_TIME_OUT << 16;
+ set_host_byte(scmd, DID_TIME_OUT);
scsi_finish_command(scmd);
}
}
@@ -291,7 +291,7 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
if (scsi_abort_command(scmd) == SUCCESS)
return BLK_EH_NOT_HANDLED;
- scmd->result |= DID_TIME_OUT << 16;
+ set_host_byte(scmd, DID_TIME_OUT);
if (unlikely(rtn == BLK_EH_NOT_HANDLED &&
!scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD)))
--
1.8.3.1
From: Venkatesh Srinivas <[email protected]>
change_queue_depth allows changing per-target queue depth via sysfs.
It also allows the SCSI midlayer to ramp down the number of concurrent
inflight requests in response to a SCSI BUSY status response and allows
the midlayer to ramp the count back up to the device maximum when the
BUSY condition has resolved.
Signed-off-by: Venkatesh Srinivas <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
drivers/scsi/virtio_scsi.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index fda9fb358888..de0f8d9b3682 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
#include <scsi/scsi_host.h>
#include <scsi/scsi_device.h>
#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_tcq.h>
#include <linux/seqlock.h>
#define VIRTIO_SCSI_MEMPOOL_SZ 64
@@ -629,6 +630,36 @@ static int virtscsi_device_reset(struct scsi_cmnd *sc)
return virtscsi_tmf(vscsi, cmd);
}
+/**
+ * virtscsi_change_queue_depth() - Change a virtscsi target's queue depth
+ * @sdev: Virtscsi target whose queue depth to change
+ * @qdepth: New queue depth
+ * @reason: Reason for the queue depth change.
+ */
+static int virtscsi_change_queue_depth(struct scsi_device *sdev,
+ int qdepth,
+ int reason)
+{
+ struct Scsi_Host *shost = sdev->host;
+ int max_depth = shost->cmd_per_lun;
+
+ switch (reason) {
+ case SCSI_QDEPTH_QFULL: /* Drop qdepth in response to BUSY state */
+ scsi_track_queue_full(sdev, qdepth);
+ break;
+ case SCSI_QDEPTH_RAMP_UP: /* Raise qdepth after BUSY state resolved */
+ case SCSI_QDEPTH_DEFAULT: /* Manual change via sysfs */
+ scsi_adjust_queue_depth(sdev,
+ scsi_get_tag_type(sdev),
+ min(max_depth, qdepth));
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ return sdev->queue_depth;
+}
+
static int virtscsi_abort(struct scsi_cmnd *sc)
{
struct virtio_scsi *vscsi = shost_priv(sc->device->host);
@@ -683,6 +714,7 @@ static struct scsi_host_template virtscsi_host_template_single = {
.proc_name = "virtio_scsi",
.this_id = -1,
.queuecommand = virtscsi_queuecommand_single,
+ .change_queue_depth = virtscsi_change_queue_depth,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
@@ -699,6 +731,7 @@ static struct scsi_host_template virtscsi_host_template_multi = {
.proc_name = "virtio_scsi",
.this_id = -1,
.queuecommand = virtscsi_queuecommand_multi,
+ .change_queue_depth = virtscsi_change_queue_depth,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
--
1.8.3.1
Even though the virtio-scsi spec guarantees that all requests related
to the TMF will have been completed by the time the TMF itself completes,
the request queue's callback might not have run yet. This causes requests
to be completed more than once, and as a result triggers a variety of
BUGs or oopses.
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
---
drivers/scsi/virtio_scsi.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index d66c4ee2c774..fda9fb358888 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -235,6 +235,16 @@ static void virtscsi_req_done(struct virtqueue *vq)
virtscsi_vq_done(vscsi, req_vq, virtscsi_complete_cmd);
};
+static void virtscsi_poll_requests(struct virtio_scsi *vscsi)
+{
+ int i, num_vqs;
+
+ num_vqs = vscsi->num_queues;
+ for (i = 0; i < num_vqs; i++)
+ virtscsi_vq_done(vscsi, &vscsi->req_vqs[i],
+ virtscsi_complete_cmd);
+}
+
static void virtscsi_complete_free(struct virtio_scsi *vscsi, void *buf)
{
struct virtio_scsi_cmd *cmd = buf;
@@ -579,6 +589,18 @@ static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
cmd->resp.tmf.response == VIRTIO_SCSI_S_FUNCTION_SUCCEEDED)
ret = SUCCESS;
+ /*
+ * The spec guarantees that all requests related to the TMF have
+ * been completed, but the callback might not have run yet if
+ * we're using independent interrupts (e.g. MSI). Poll the
+ * virtqueues once.
+ *
+ * In the abort case, sc->scsi_done will do nothing, because
+ * the block layer must have detected a timeout and as a result
+ * REQ_ATOM_COMPLETE has been set.
+ */
+ virtscsi_poll_requests(vscsi);
+
out:
mempool_free(cmd, virtscsi_cmd_pool);
return ret;
--
1.8.3.1
From: Ming Lei <[email protected]>
The spinlock of tgt_lock is only for serializing read and write
req_vq, one lockless seqcount is enough for the purpose.
On one 16core VM with vhost-scsi backend, the patch can improve
IOPS with 3% on random read test.
Signed-off-by: Ming Lei <[email protected]>
[Add initialization in virtscsi_target_alloc. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
---
drivers/scsi/virtio_scsi.c | 42 +++++++++++++++++++++++++++++-------------
1 file changed, 29 insertions(+), 13 deletions(-)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index fc054935eb1f..f0b4cdbfceb0 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -26,6 +26,7 @@
#include <scsi/scsi_host.h>
#include <scsi/scsi_device.h>
#include <scsi/scsi_cmnd.h>
+#include <linux/seqlock.h>
#define VIRTIO_SCSI_MEMPOOL_SZ 64
#define VIRTIO_SCSI_EVENT_LEN 8
@@ -73,18 +74,16 @@ struct virtio_scsi_vq {
* queue, and also lets the driver optimize the IRQ affinity for the virtqueues
* (each virtqueue's affinity is set to the CPU that "owns" the queue).
*
- * tgt_lock is held to serialize reading and writing req_vq. Reading req_vq
- * could be done locklessly, but we do not do it yet.
+ * tgt_seq is held to serialize reading and writing req_vq.
*
* Decrements of reqs are never concurrent with writes of req_vq: before the
* decrement reqs will be != 0; after the decrement the virtqueue completion
* routine will not use the req_vq so it can be changed by a new request.
- * Thus they can happen outside the tgt_lock, provided of course we make reqs
+ * Thus they can happen outside the tgt_seq, provided of course we make reqs
* an atomic_t.
*/
struct virtio_scsi_target_state {
- /* This spinlock never held at the same time as vq_lock. */
- spinlock_t tgt_lock;
+ seqcount_t tgt_seq;
/* Count of outstanding requests. */
atomic_t reqs;
@@ -521,19 +520,33 @@ static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
unsigned long flags;
u32 queue_num;
- spin_lock_irqsave(&tgt->tgt_lock, flags);
+ local_irq_save(flags);
+ if (atomic_inc_return(&tgt->reqs) > 1) {
+ unsigned long seq;
+
+ do {
+ seq = read_seqcount_begin(&tgt->tgt_seq);
+ vq = tgt->req_vq;
+ } while (read_seqcount_retry(&tgt->tgt_seq, seq));
+ } else {
+ /* no writes can be concurrent because of atomic_t */
+ write_seqcount_begin(&tgt->tgt_seq);
+
+ /* keep previous req_vq if there is reader found */
+ if (unlikely(atomic_read(&tgt->reqs) > 1)) {
+ vq = tgt->req_vq;
+ goto unlock;
+ }
- if (atomic_inc_return(&tgt->reqs) > 1)
- vq = tgt->req_vq;
- else {
queue_num = smp_processor_id();
while (unlikely(queue_num >= vscsi->num_queues))
queue_num -= vscsi->num_queues;
-
tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+ unlock:
+ write_seqcount_end(&tgt->tgt_seq);
}
+ local_irq_restore(flags);
- spin_unlock_irqrestore(&tgt->tgt_lock, flags);
return vq;
}
@@ -618,14 +631,17 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
static int virtscsi_target_alloc(struct scsi_target *starget)
{
+ struct Scsi_Host *sh = dev_to_shost(starget->dev.parent);
+ struct virtio_scsi *vscsi = shost_priv(sh);
+
struct virtio_scsi_target_state *tgt =
kmalloc(sizeof(*tgt), GFP_KERNEL);
if (!tgt)
return -ENOMEM;
- spin_lock_init(&tgt->tgt_lock);
+ seqcount_init(&tgt->tgt_seq);
atomic_set(&tgt->reqs, 0);
- tgt->req_vq = NULL;
+ tgt->req_vq = &vscsi->req_vqs[0];
starget->hostdata = tgt;
return 0;
--
1.8.3.1
On 06/04/14 12:11, Paolo Bonzini wrote:
> v1->v2: fix all occurrences of "scmd->result |= DID_TIME_OUT << 16"
> in patch 4
Hello Paolo,
Thanks for the quick respin. However, since you are mentioning that in
v2 all occurrences of "scmd->result |= DID_TIME_OUT << 16" have been
addressed, this made me wonder whether you had noticed the following
code in scsi_decide_disposition() ?
switch (host_byte(scmd->result)) {
[ ... ]
case DID_ABORT:
if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) {
scmd->result |= DID_TIME_OUT << 16;
return SUCCESS;
}
Thanks,
Bart.
Il 04/06/2014 13:21, Bart Van Assche ha scritto:
>
> Thanks for the quick respin. However, since you are mentioning that in
> v2 all occurrences of "scmd->result |= DID_TIME_OUT << 16" have been
> addressed, this made me wonder whether you had noticed the following
> code in scsi_decide_disposition() ?
>
> switch (host_byte(scmd->result)) {
> [ ... ]
> case DID_ABORT:
> if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) {
> scmd->result |= DID_TIME_OUT << 16;
> return SUCCESS;
> }
Nope, I miscounted how many occurrences were left. v3 on the way.
Paolo
On 6/4/14, Paolo Bonzini <[email protected]> wrote:
> From: Ming Lei <[email protected]>
>
> The spinlock of tgt_lock is only for serializing read and write
> req_vq, one lockless seqcount is enough for the purpose.
>
> On one 16core VM with vhost-scsi backend, the patch can improve
> IOPS with 3% on random read test.
>
> Signed-off-by: Ming Lei <[email protected]>
> [Add initialization in virtscsi_target_alloc. - Paolo]
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Venkatesh Srinivas <[email protected]>
-- vs;