2015-05-22 08:00:07

by Nicholas A. Bellinger

[permalink] [raw]
Subject: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

From: Nicholas Bellinger <[email protected]>

Hi HCH & Co,

Here is RFC code for adding target_core_fabric_ops->complete_irq
bypass that allows fabrics to invoke response callbacks directly
from target_complete_cmd() IRQ context.

It breaks up existing target_complete_ok_work() code into three
pieces:

- transport_complete_task_attr()
- target_complete_ok_pre()
- target_complete_irq()

and allows target_complete_irq() be called directly from IRQ context
if no special case se_cmd handling requirements exist.

The three cases that trigger queue_work() process context are:

- non GOOD status
- ORDERED task sync
- non NULL se_cmd->transport_complete_callback()

It also includes converting target_restart_delayed_cmds() to use
llist_head, and checking within transport_complete_task_attr() to
determine if dev_ordered_sync is non zero, ahead of doing the
llist_del_all() -> cmpxchg of outstanding ordered tags.

This allows loopback LLD code to bypass the extra queue_work, and
invoke it's ->scsi_done() callback directly from IRQ context.

WDYT..?

--nab

Nicholas Bellinger (2):
target: Add support for fabric IRQ completion
loopback: Enable TFO->complete_irq for fast-path ->scsi_done

drivers/target/loopback/tcm_loop.c | 1 +
drivers/target/target_core_device.c | 2 +-
drivers/target/target_core_tpg.c | 1 +
drivers/target/target_core_transport.c | 165 ++++++++++++++++++++++++---------
include/target/target_core_base.h | 5 +-
include/target/target_core_fabric.h | 1 +
6 files changed, 127 insertions(+), 48 deletions(-)

--
1.9.1


2015-05-22 08:01:06

by Nicholas A. Bellinger

[permalink] [raw]
Subject: [RFC 1/2] target: Add support for fabric IRQ completion

From: Nicholas Bellinger <[email protected]>

This patch adds support for invoking TFO completion callbacks directly
from IRQ context in target_complete_cmd().

Some fabric drivers like loopback and vhost can invoke their response
callbacks directly from IRQ context, and this patch allows the extra
queue_work() dispatch to a seperate work-queue process context to be
avoided for fast-path operation.

This includes the refactoring of target_complete_ok_work(), so that
transport_complete_task_attr() and target_complete_irq() can be
invoked directly from target_complete_cmd() context.

Also, target_restart_delayed_cmds() has been converted to use llist,
and will only be invoked from !in_interrupt() with a llist_del_all()
compare and exchange when draining the list of ordered tags.

Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
---
drivers/target/target_core_device.c | 2 +-
drivers/target/target_core_tpg.c | 1 +
drivers/target/target_core_transport.c | 165 ++++++++++++++++++++++++---------
include/target/target_core_base.h | 5 +-
include/target/target_core_fabric.h | 1 +
5 files changed, 126 insertions(+), 48 deletions(-)

diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c
index 1d98033..70213fa 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -739,7 +739,7 @@ struct se_device *target_alloc_device(struct se_hba *hba, const char *name)
INIT_LIST_HEAD(&dev->dev_list);
INIT_LIST_HEAD(&dev->dev_sep_list);
INIT_LIST_HEAD(&dev->dev_tmr_list);
- INIT_LIST_HEAD(&dev->delayed_cmd_list);
+ init_llist_head(&dev->delayed_cmd_llist);
INIT_LIST_HEAD(&dev->state_list);
INIT_LIST_HEAD(&dev->qf_cmd_list);
INIT_LIST_HEAD(&dev->g_dev_node);
diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c
index 3fbb0d4..aa08d6b 100644
--- a/drivers/target/target_core_tpg.c
+++ b/drivers/target/target_core_tpg.c
@@ -37,6 +37,7 @@

#include <target/target_core_base.h>
#include <target/target_core_backend.h>
+#include <target/target_core_configfs.h>
#include <target/target_core_fabric.h>

#include "target_core_internal.h"
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 2ccaeff..a98a23c 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -63,10 +63,11 @@ struct kmem_cache *t10_alua_tg_pt_gp_cache;
struct kmem_cache *t10_alua_lba_map_cache;
struct kmem_cache *t10_alua_lba_map_mem_cache;

-static void transport_complete_task_attr(struct se_cmd *cmd);
+static bool transport_complete_task_attr(struct se_cmd *cmd);
static void transport_handle_queue_full(struct se_cmd *cmd,
struct se_device *dev);
static int transport_put_cmd(struct se_cmd *cmd);
+static void target_complete_irq(struct se_cmd *cmd, bool);
static void target_complete_ok_work(struct work_struct *work);

int init_se_kmem_caches(void)
@@ -711,16 +712,37 @@ void target_complete_cmd(struct se_cmd *cmd, u8 scsi_status)
spin_unlock_irqrestore(&cmd->t_state_lock, flags);
complete_all(&cmd->t_transport_stop_comp);
return;
- } else if (!success) {
- INIT_WORK(&cmd->work, target_complete_failure_work);
- } else {
+ }
+ if (success) {
+ /*
+ * Invoke TFO completion callback now if fabric driver can
+ * queue response in IRQ context, and special case descriptor
+ * handling requirements in process context for ORDERED task
+ * and friends do not exist.
+ */
+ if (cmd->se_cmd_flags & SCF_COMPLETE_IRQ &&
+ !(cmd->se_cmd_flags & SCF_TRANSPORT_TASK_SENSE) &&
+ !cmd->transport_complete_callback) {
+
+ cmd->t_state = TRANSPORT_COMPLETE;
+ cmd->transport_state |= (CMD_T_COMPLETE | CMD_T_ACTIVE);
+ spin_unlock_irqrestore(&cmd->t_state_lock, flags);
+
+ if (!transport_complete_task_attr(cmd))
+ goto do_work;
+
+ target_complete_irq(cmd, true);
+ return;
+ }
INIT_WORK(&cmd->work, target_complete_ok_work);
+ } else {
+ INIT_WORK(&cmd->work, target_complete_failure_work);
}

cmd->t_state = TRANSPORT_COMPLETE;
cmd->transport_state |= (CMD_T_COMPLETE | CMD_T_ACTIVE);
spin_unlock_irqrestore(&cmd->t_state_lock, flags);
-
+do_work:
queue_work(target_completion_wq, &cmd->work);
}
EXPORT_SYMBOL(target_complete_cmd);
@@ -1145,7 +1167,6 @@ void transport_init_se_cmd(
int task_attr,
unsigned char *sense_buffer)
{
- INIT_LIST_HEAD(&cmd->se_delayed_node);
INIT_LIST_HEAD(&cmd->se_qf_node);
INIT_LIST_HEAD(&cmd->se_cmd_list);
INIT_LIST_HEAD(&cmd->state_list);
@@ -1155,6 +1176,8 @@ void transport_init_se_cmd(
spin_lock_init(&cmd->t_state_lock);
kref_init(&cmd->cmd_kref);
cmd->transport_state = CMD_T_DEV_ACTIVE;
+ if (tfo->complete_irq)
+ cmd->se_cmd_flags |= SCF_COMPLETE_IRQ;

cmd->se_tfo = tfo;
cmd->se_sess = se_sess;
@@ -1803,9 +1826,7 @@ static bool target_handle_task_attr(struct se_cmd *cmd)
if (atomic_read(&dev->dev_ordered_sync) == 0)
return false;

- spin_lock(&dev->delayed_cmd_lock);
- list_add_tail(&cmd->se_delayed_node, &dev->delayed_cmd_list);
- spin_unlock(&dev->delayed_cmd_lock);
+ llist_add(&cmd->se_delayed_node, &dev->delayed_cmd_llist);

pr_debug("Added CDB: 0x%02x Task Attr: 0x%02x to"
" delayed CMD list, se_ordered_id: %u\n",
@@ -1856,28 +1877,32 @@ EXPORT_SYMBOL(target_execute_cmd);

/*
* Process all commands up to the last received ORDERED task attribute which
- * requires another blocking boundary
+ * requires another blocking boundary.
*/
static void target_restart_delayed_cmds(struct se_device *dev)
{
- for (;;) {
- struct se_cmd *cmd;
+ bool ordered = false;
+ struct se_cmd *cmd;
+ struct llist_node *llnode;

- spin_lock(&dev->delayed_cmd_lock);
- if (list_empty(&dev->delayed_cmd_list)) {
- spin_unlock(&dev->delayed_cmd_lock);
- break;
+ llnode = llist_del_all(&dev->delayed_cmd_llist);
+ while (llnode) {
+ cmd = llist_entry(llnode, struct se_cmd, se_delayed_node);
+ llnode = llist_next(llnode);
+ /*
+ * Re-add outstanding command to se_device delayed llist to
+ * satisfy ordered tag execution requirements.
+ */
+ if (ordered) {
+ llist_add(&cmd->se_delayed_node, &dev->delayed_cmd_llist);
+ continue;
}
-
- cmd = list_entry(dev->delayed_cmd_list.next,
- struct se_cmd, se_delayed_node);
- list_del(&cmd->se_delayed_node);
- spin_unlock(&dev->delayed_cmd_lock);
-
__target_execute_cmd(cmd);

- if (cmd->sam_task_attr == TCM_ORDERED_TAG)
- break;
+ if (cmd->sam_task_attr == TCM_ORDERED_TAG) {
+ ordered = true;
+ continue;
+ }
}
}

@@ -1885,12 +1910,12 @@ static void target_restart_delayed_cmds(struct se_device *dev)
* Called from I/O completion to determine which dormant/delayed
* and ordered cmds need to have their tasks added to the execution queue.
*/
-static void transport_complete_task_attr(struct se_cmd *cmd)
+static bool transport_complete_task_attr(struct se_cmd *cmd)
{
struct se_device *dev = cmd->se_dev;

if (dev->transport->transport_type == TRANSPORT_PLUGIN_PHBA_PDEV)
- return;
+ return true;

if (cmd->sam_task_attr == TCM_SIMPLE_TAG) {
atomic_dec_mb(&dev->simple_cmds);
@@ -1910,8 +1935,23 @@ static void transport_complete_task_attr(struct se_cmd *cmd)
pr_debug("Incremented dev_cur_ordered_id: %u for ORDERED:"
" %u\n", dev->dev_cur_ordered_id, cmd->se_ordered_id);
}
+ /*
+ * Check for special in_interrupt() case where completion can happen
+ * for certain fabrics from IRQ context, as long as no outstanding
+ * ordered tags exist.
+ */
+ if (in_interrupt()) {
+ if (atomic_read(&dev->dev_ordered_sync))
+ return false;

+ return true;
+ }
+ /*
+ * If called from process context, go ahead and drain the current
+ * se_device->delayed_cmd_llist of ordered tags if any exist.
+ */
target_restart_delayed_cmds(dev);
+ return true;
}

static void transport_complete_qf(struct se_cmd *cmd)
@@ -1995,18 +2035,18 @@ static bool target_read_prot_action(struct se_cmd *cmd)
return false;
}

-static void target_complete_ok_work(struct work_struct *work)
-{
- struct se_cmd *cmd = container_of(work, struct se_cmd, work);
- int ret;

- /*
- * Check if we need to move delayed/dormant tasks from cmds on the
- * delayed execution list after a HEAD_OF_QUEUE or ORDERED Task
- * Attribute.
- */
- transport_complete_task_attr(cmd);
+static void target_complete_queue_full(struct se_cmd *cmd)
+{
+ pr_debug("Handling complete_ok QUEUE_FULL: se_cmd: %p,"
+ " data_direction: %d\n", cmd, cmd->data_direction);
+ cmd->t_state = TRANSPORT_COMPLETE_QF_OK;
+ transport_handle_queue_full(cmd, cmd->se_dev);
+}

+static bool target_complete_ok_pre(struct se_cmd *cmd)
+{
+ int ret;
/*
* Check to schedule QUEUE_FULL work, or execute an existing
* cmd->transport_qf_callback()
@@ -2027,7 +2067,7 @@ static void target_complete_ok_work(struct work_struct *work)

transport_lun_remove_cmd(cmd);
transport_cmd_check_stop_to_fabric(cmd);
- return;
+ return true;
}
/*
* Check for a callback, used by amongst other things
@@ -2040,9 +2080,9 @@ static void target_complete_ok_work(struct work_struct *work)
if (!rc && !(cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE_POST)) {
if ((cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE) &&
!cmd->data_length)
- goto queue_rsp;
+ return false;

- return;
+ return true;
} else if (rc) {
ret = transport_send_check_condition_and_sense(cmd,
rc, 0);
@@ -2051,11 +2091,27 @@ static void target_complete_ok_work(struct work_struct *work)

transport_lun_remove_cmd(cmd);
transport_cmd_check_stop_to_fabric(cmd);
- return;
+ return true;
}
}

-queue_rsp:
+ return false;
+
+queue_full:
+ target_complete_queue_full(cmd);
+ return true;
+}
+
+static void target_complete_irq(struct se_cmd *cmd, bool check_qf)
+{
+ int ret;
+ /*
+ * Check to schedule QUEUE_FULL work, or execute an existing
+ * cmd->transport_qf_callback()
+ */
+ if (check_qf && atomic_read(&cmd->se_dev->dev_qf_count) != 0)
+ schedule_work(&cmd->se_dev->qf_work_queue);
+
switch (cmd->data_direction) {
case DMA_FROM_DEVICE:
atomic_long_add(cmd->data_length,
@@ -2111,10 +2167,29 @@ queue_rsp:
return;

queue_full:
- pr_debug("Handling complete_ok QUEUE_FULL: se_cmd: %p,"
- " data_direction: %d\n", cmd, cmd->data_direction);
- cmd->t_state = TRANSPORT_COMPLETE_QF_OK;
- transport_handle_queue_full(cmd, cmd->se_dev);
+ target_complete_queue_full(cmd);
+}
+
+static void target_complete_ok_work(struct work_struct *work)
+{
+ struct se_cmd *cmd = container_of(work, struct se_cmd, work);
+ /*
+ * Check if we need to move delayed/dormant tasks from cmds on the
+ * delayed execution list after a HEAD_OF_QUEUE or ORDERED Task
+ * Attribute.
+ */
+ transport_complete_task_attr(cmd);
+ /*
+ * Invoke special case handling for QUEUE_FULL, backend TASK_SENSE or
+ * transport_complete_callback() cases.
+ */
+ if (target_complete_ok_pre(cmd))
+ return;
+ /*
+ * Perform the se_tfo->queue_data_in() and/or >se_tfo->queue_status()
+ * callbacks into fabric code.
+ */
+ target_complete_irq(cmd, false);
}

static inline void transport_free_sgl(struct scatterlist *sgl, int nents)
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index 9f3878f..bc07c8b 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -156,6 +156,7 @@ enum se_cmd_flags_table {
SCF_COMPARE_AND_WRITE = 0x00080000,
SCF_COMPARE_AND_WRITE_POST = 0x00100000,
SCF_PASSTHROUGH_PROT_SG_TO_MEM_NOALLOC = 0x00200000,
+ SCF_COMPLETE_IRQ = 0x00400000,
};

/* struct se_dev_entry->lun_flags and struct se_lun->lun_access */
@@ -487,7 +488,7 @@ struct se_cmd {
u64 pr_res_key;
/* Used for sense data */
void *sense_buffer;
- struct list_head se_delayed_node;
+ struct llist_node se_delayed_node;
struct list_head se_qf_node;
struct se_device *se_dev;
struct se_lun *se_lun;
@@ -795,7 +796,7 @@ struct se_device {
struct list_head dev_tmr_list;
struct workqueue_struct *tmr_wq;
struct work_struct qf_work_queue;
- struct list_head delayed_cmd_list;
+ struct llist_head delayed_cmd_llist;
struct list_head state_list;
struct list_head qf_cmd_list;
struct list_head g_dev_node;
diff --git a/include/target/target_core_fabric.h b/include/target/target_core_fabric.h
index d6216b7..72c4a12 100644
--- a/include/target/target_core_fabric.h
+++ b/include/target/target_core_fabric.h
@@ -4,6 +4,7 @@
struct target_core_fabric_ops {
struct module *module;
const char *name;
+ bool complete_irq;
size_t node_acl_size;
char *(*get_fabric_name)(void);
char *(*tpg_get_wwn)(struct se_portal_group *);
--
1.9.1

2015-05-22 08:00:24

by Nicholas A. Bellinger

[permalink] [raw]
Subject: [RFC 2/2] loopback: Enable TFO->complete_irq for fast-path ->scsi_done

From: Nicholas Bellinger <[email protected]>

Go ahead and enable TFO->complete_irq for testing with loopback
LLD code, and avoid the extra fast-path queue_work() context
switch if no se_cmd special case handling requirements exist.

Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
CC: Sagi Grimberg <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
---
drivers/target/loopback/tcm_loop.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/target/loopback/tcm_loop.c b/drivers/target/loopback/tcm_loop.c
index bd9d11a..562906a 100644
--- a/drivers/target/loopback/tcm_loop.c
+++ b/drivers/target/loopback/tcm_loop.c
@@ -1229,6 +1229,7 @@ static struct configfs_attribute *tcm_loop_wwn_attrs[] = {
static const struct target_core_fabric_ops loop_ops = {
.module = THIS_MODULE,
.name = "loopback",
+ .complete_irq = true,
.get_fabric_name = tcm_loop_get_fabric_name,
.tpg_get_wwn = tcm_loop_get_endpoint_wwn,
.tpg_get_tag = tcm_loop_get_tag,
--
1.9.1

2015-06-03 12:58:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

This makes lockdep very unhappy, rightly so. If you execute
one end_io function inside another you basіcally nest every possible
lock taken in the I/O completion path. Also adding more work
to the hardirq path generally isn't a smart idea. Can you explain
what issues you were seeing and how much this helps? Note that
the workqueue usage in the target core so far is fairly basic, so
there should some low hanging fruit.

[ 21.119148]
[ 21.119382] =============================================
[ 21.120012] [ INFO: possible recursive locking detected ]
[ 21.120639] 4.1.0-rc1+ #489 Not tainted
[ 21.121131] ---------------------------------------------
[ 21.121754] swapper/0/0 is trying to acquire lock:
[ 21.122324] (&(&fq->mq_flush_lock)->rlock){-.....}, at: [<ffffffff817b0256>] flush_end_io+0x66/0x220
[ 21.122451]
[ 21.122451] but task is already holding lock:
[ 21.122451] (&(&fq->mq_flush_lock)->rlock){-.....}, at: [<ffffffff817b0256>] flush_end_io+0x66/0x220
[ 21.122451]
[ 21.122451] other info that might help us debug this:
[ 21.122451] Possible unsafe locking scenario:
[ 21.122451]
[ 21.122451] CPU0
[ 21.122451] ----
[ 21.122451] lock(&(&fq->mq_flush_lock)->rlock);
[ 21.122451] lock(&(&fq->mq_flush_lock)->rlock);
[ 21.122451]
[ 21.122451] *** DEADLOCK ***
[ 21.122451]
[ 21.122451] May be due to missing lock nesting notation
[ 21.122451]
[ 21.122451] 3 locks held by swapper/0/0:
[ 21.122451] #0: (&(&vp_dev->lock)->rlock){-.-...}, at: [<ffffffff8187c87c>] vp_vring_interrupt+0x2c/0x90
[ 21.122451] #1: (&(&virtscsi_vq->vq_lock)->rlock){-.-...}, at: [<ffffffff81ad2796>] virtscsi_vq_done+0x26/0x90
[ 21.122451] #2: (&(&fq->mq_flush_lock)->rlock){-.....}, at: [<ffffffff817b0256>] flush_end_io+0x66/0x220
[ 21.122451]
[ 21.122451] stack backtrace:
[ 21.122451] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-rc1+ #489
[ 21.122451] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 21.122451] ffffffff82c34820 ffff88007fc03618 ffffffff81e3396f 0000000000000000
[ 21.122451] ffffffff82445500 ffff88007fc036c8 ffffffff811209a9 ffff88007fc17d18
[ 21.122451] 0000000000000000 0000000000000000 ffffffff82c34820 ffff88007fc17d18
[ 21.122451] Call Trace:
[ 21.122451] <IRQ> [<ffffffff81e3396f>] dump_stack+0x45/0x57
[ 21.122451] [<ffffffff811209a9>] validate_chain.isra.37+0xd39/0x1160
[ 21.122451] [<ffffffff811219b8>] __lock_acquire+0x488/0xd40
[ 21.122451] [<ffffffff810f6b28>] ? __kernel_text_address+0x58/0x80
[ 21.122451] [<ffffffff8112285f>] lock_acquire+0xaf/0x130
[ 21.122451] [<ffffffff817b0256>] ? flush_end_io+0x66/0x220
[ 21.122451] [<ffffffff81e3d859>] _raw_spin_lock_irqsave+0x49/0x60
[ 21.122451] [<ffffffff817b0256>] ? flush_end_io+0x66/0x220
[ 21.122451] [<ffffffff817b0256>] flush_end_io+0x66/0x220
[ 21.122451] [<ffffffff817b5d0f>] __blk_mq_end_request+0x2f/0x70
[ 21.122451] [<ffffffff81a088ad>] scsi_end_request+0x7d/0x1e0
[ 21.122451] [<ffffffff81a0a3e0>] scsi_io_completion+0x110/0x610
[ 21.122451] [<ffffffff811219b8>] ? __lock_acquire+0x488/0xd40
[ 21.122451] [<ffffffff81a00e5b>] scsi_finish_command+0xdb/0x140
[ 21.122451] [<ffffffff81a09b66>] scsi_softirq_done+0x136/0x150
[ 21.122451] [<ffffffff817b7f6e>] __blk_mq_complete_request+0x8e/0x130
[ 21.122451] [<ffffffff817b8039>] blk_mq_complete_request+0x29/0x30
[ 21.122451] [<ffffffff81a078d8>] scsi_mq_done+0x28/0x60
[ 21.122451] [<ffffffff81b43fbb>] tcm_loop_queue_status+0x3b/0xb0
[ 21.122451] [<ffffffff81b318da>] target_complete_irq+0x8a/0x250
[ 21.122451] [<ffffffff81b335ea>] target_complete_cmd+0x1fa/0x2a0
[ 21.122451] [<ffffffff81b3f6da>] iblock_end_io_flush+0x2a/0x60
[ 21.122451] [<ffffffff817a3a93>] bio_endio+0x53/0x90
[ 21.122451] [<ffffffff817acd58>] blk_update_request+0x98/0x360
[ 21.122451] [<ffffffff817b752e>] blk_mq_end_request+0x1e/0x80
[ 21.122451] [<ffffffff817affb4>] blk_flush_complete_seq+0xe4/0x320
[ 21.122451] [<ffffffff817b0328>] flush_end_io+0x138/0x220
[ 21.122451] [<ffffffff817b5d0f>] __blk_mq_end_request+0x2f/0x70
[ 21.122451] [<ffffffff81a088ad>] scsi_end_request+0x7d/0x1e0
[ 21.122451] [<ffffffff81a0a3e0>] scsi_io_completion+0x110/0x610
[ 21.122451] [<ffffffff8111dcdd>] ? trace_hardirqs_off+0xd/0x10
[ 21.122451] [<ffffffff81a00e5b>] scsi_finish_command+0xdb/0x140
[ 21.122451] [<ffffffff81a09b66>] scsi_softirq_done+0x136/0x150
[ 21.122451] [<ffffffff817b7f6e>] __blk_mq_complete_request+0x8e/0x130
[ 21.122451] [<ffffffff817b8039>] blk_mq_complete_request+0x29/0x30
[ 21.122451] [<ffffffff81a078d8>] scsi_mq_done+0x28/0x60
[ 21.122451] [<ffffffff81ad29af>] virtscsi_complete_cmd+0x10f/0x1e0
[ 21.122451] [<ffffffff81ad28a0>] ? virtscsi_ctrl_done+0x30/0x30
[ 21.122451] [<ffffffff81ad27b9>] virtscsi_vq_done+0x49/0x90
[ 21.122451] [<ffffffff81ad2839>] virtscsi_req_done+0x39/0x40
[ 21.122451] [<ffffffff8187a470>] vring_interrupt+0x30/0x60
[ 21.122451] [<ffffffff8187c8ab>] vp_vring_interrupt+0x5b/0x90
[ 21.122451] [<ffffffff811348a0>] handle_irq_event_percpu+0x60/0x1d0
[ 21.122451] [<ffffffff81134a53>] handle_irq_event+0x43/0x70
[ 21.122451] [<ffffffff811376d6>] handle_edge_irq+0x96/0x110
[ 21.122451] [<ffffffff81063fe8>] handle_irq+0x58/0x130
[ 21.122451] [<ffffffff810f96a1>] ? atomic_notifier_call_chain+0x11/0x20
[ 21.122451] [<ffffffff810638d7>] do_IRQ+0x57/0x100
[ 21.122451] [<ffffffff81e3eeb3>] common_interrupt+0x73/0x73
[ 21.122451] <EOI> [<ffffffff810a3726>] ? native_safe_halt+0x6/0x10
[ 21.122451] [<ffffffff811233fd>] ? trace_hardirqs_on+0xd/0x10
[ 21.122451] [<ffffffff8106c8ae>] default_idle+0x1e/0xc0
[ 21.122451] [<ffffffff8106d26a>] arch_cpu_idle+0xa/0x10
[ 21.122451] [<ffffffff81118073>] cpu_startup_entry+0x383/0x430
[ 21.122451] [<ffffffff81e29788>] rest_init+0x128/0x130
[ 21.122451] [<ffffffff81e29660>] ? csum_partial_copy_generic+0x170/0x170
[ 21.122451] [<ffffffff825d21a6>] start_kernel+0x54a/0x557
[ 21.122451] [<ffffffff825d1a49>] ? set_init_arg+0x58/0x58
[ 21.122451] [<ffffffff825d1117>] ? early_idt_handlers+0x117/0x120
[ 21.122451] [<ffffffff825d15f0>] x86_64_start_reservations+0x2a/0x2c
[ 21.122451] [<ffffffff825d1730>] x86_64_start_kernel+0x13e/0x14d

2015-06-04 07:06:14

by Nicholas A. Bellinger

[permalink] [raw]
Subject: Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

On Wed, 2015-06-03 at 14:57 +0200, Christoph Hellwig wrote:
> This makes lockdep very unhappy, rightly so. If you execute
> one end_io function inside another you basіcally nest every possible
> lock taken in the I/O completion path. Also adding more work
> to the hardirq path generally isn't a smart idea. Can you explain
> what issues you were seeing and how much this helps? Note that
> the workqueue usage in the target core so far is fairly basic, so
> there should some low hanging fruit.

So I've been using tcm_loop + RAMDISK backends for prototyping, but this
patch is intended for vhost-scsi so it can avoid the unnecessary
queue_work() context switch within target_complete_cmd() for all backend
driver types.

This is because vhost_work_queue() is just updating vhost_dev->work_list
and immediately wake_up_process() into a different vhost_worker()
process context. For heavy small block workloads into fast IBLOCK
backends, avoiding this extra context switch should be a nice efficiency
win.

Also, AFAIK RDMA fabrics are allowed to do ib_post_send() response
callbacks directly from IRQ context as well.

Perhaps tcm_loop LLD code should just be limited to RAMDISK here..?

2015-06-04 17:01:11

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

On 6/4/2015 10:06 AM, Nicholas A. Bellinger wrote:
> On Wed, 2015-06-03 at 14:57 +0200, Christoph Hellwig wrote:
>> This makes lockdep very unhappy, rightly so. If you execute
>> one end_io function inside another you basіcally nest every possible
>> lock taken in the I/O completion path. Also adding more work
>> to the hardirq path generally isn't a smart idea. Can you explain
>> what issues you were seeing and how much this helps? Note that
>> the workqueue usage in the target core so far is fairly basic, so
>> there should some low hanging fruit.
>
> So I've been using tcm_loop + RAMDISK backends for prototyping, but this
> patch is intended for vhost-scsi so it can avoid the unnecessary
> queue_work() context switch within target_complete_cmd() for all backend
> driver types.
>
> This is because vhost_work_queue() is just updating vhost_dev->work_list
> and immediately wake_up_process() into a different vhost_worker()
> process context. For heavy small block workloads into fast IBLOCK
> backends, avoiding this extra context switch should be a nice efficiency
> win.

I can see that, did you get a chance to measure the expected latency
improvement?

>
> Also, AFAIK RDMA fabrics are allowed to do ib_post_send() response
> callbacks directly from IRQ context as well.

This is correct in general, ib_post_send is not allowed to schedule.
isert/srpt might benefit here in latency, but it would require the
the drivers to pre-allocate the sgls (ib_sge's) and use a worst-case
approach (or use GFP_ATOMIC allocations - I'm not sure which is
better...)

2015-06-09 07:19:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

On Thu, Jun 04, 2015 at 12:06:09AM -0700, Nicholas A. Bellinger wrote:
> So I've been using tcm_loop + RAMDISK backends for prototyping, but this
> patch is intended for vhost-scsi so it can avoid the unnecessary
> queue_work() context switch within target_complete_cmd() for all backend
> driver types.
>
> This is because vhost_work_queue() is just updating vhost_dev->work_list
> and immediately wake_up_process() into a different vhost_worker()
> process context. For heavy small block workloads into fast IBLOCK
> backends, avoiding this extra context switch should be a nice efficiency
> win.

How about trying to merge the two workers instead?

> Perhaps tcm_loop LLD code should just be limited to RAMDISK here..?

I'd prefer to not do it especially for the loopback code, as that
should serve as a simple example. But before making further judgement
I'd really like to see the numbers.

Note that something that might help much more is getting rid of
the remaining irq or bh disabling spinlocks in the target core,
as that tends to introduce a lot of additional latency. Moving
additional code to hardirq context is fairly diametrical to that
design.

2015-06-09 07:27:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] target: Add support for fabric IRQ completion

> diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c
> index 3fbb0d4..aa08d6b 100644
> --- a/drivers/target/target_core_tpg.c
> +++ b/drivers/target/target_core_tpg.c
> @@ -37,6 +37,7 @@
>
> #include <target/target_core_base.h>
> #include <target/target_core_backend.h>
> +#include <target/target_core_configfs.h>
> #include <target/target_core_fabric.h>
>
> #include "target_core_internal.h"

This hunk look unrelated.

On the implementation: the new target_core_fabric_ops flag and
command flag aren't really needed, just add a version of
target_complete_cmd that takes the flag.

target_complete_irq is fairly misnamed as it doesn't get called
just from irq context, please chose a more sensible name for it.

2015-06-10 07:12:12

by Nicholas A. Bellinger

[permalink] [raw]
Subject: Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

On Tue, 2015-06-09 at 09:19 +0200, Christoph Hellwig wrote:
> On Thu, Jun 04, 2015 at 12:06:09AM -0700, Nicholas A. Bellinger wrote:
> > So I've been using tcm_loop + RAMDISK backends for prototyping, but this
> > patch is intended for vhost-scsi so it can avoid the unnecessary
> > queue_work() context switch within target_complete_cmd() for all backend
> > driver types.
> >
> > This is because vhost_work_queue() is just updating vhost_dev->work_list
> > and immediately wake_up_process() into a different vhost_worker()
> > process context. For heavy small block workloads into fast IBLOCK
> > backends, avoiding this extra context switch should be a nice efficiency
> > win.
>
> How about trying to merge the two workers instead?
>

IIRC, vhost.c has a existing requirement for running completions within
a single kernel thread context for each vhost_dev context -> vhost-scsi
WWPN.

> > Perhaps tcm_loop LLD code should just be limited to RAMDISK here..?
>
> I'd prefer to not do it especially for the loopback code, as that
> should serve as a simple example.

Fair enough.

> But before making further judgement I'd really like to see the numbers.
>

Sure, will include some performance + context switch results for -v2.

> Note that something that might help much more is getting rid of
> the remaining irq or bh disabling spinlocks in the target core,
> as that tends to introduce a lot of additional latency. Moving
> additional code to hardirq context is fairly diametrical to that
> design.

Within for-next RCU enabled target code, the three spinlocks who irq
disable from fast-path submit + completion path are:

* se_cmd->t_state_lock:

Used to update se_cmd->transport_state within target_complete_cmd() from
backend driver irq context, and when passing se_cmd ownership back to
fabric driver code via fast-path transport_cmd_check_stop() response
completion.

Still required while iblock backends are calling target_complete_cmd()
from irq context.

* se_device->execute_task_lock

Used for tracking device TMR tasks. Completion path called from irq
context in transport_cmd_check_stop() -> target_remove_from_state_list()
when passing se_cmd ownership back to fabric driver.

transport_generic_free_cmd() needs to obtain this lock during se_cmd
exception status too, if the failure occurs before se_cmd->execute_cmd()
submission happens.

* se_session->sess_cmd_lock

Originally required for tcm_qla2xxx, where qla_hw_data->hardware_lock
must be held while performing the initial per se_session shutdown of
outstanding + active se_cmd_list entries. Other HW LLD fabric drivers
also need this when target-core is responsible for active I/O
shutdown.

However, not all fabric drivers need to disable irq while acquiring this
specific lock.

2015-06-29 09:51:44

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [RFC 1/2] target: Add support for fabric IRQ completion

On 6/9/2015 10:27 AM, Christoph Hellwig wrote:
>> diff --git a/drivers/target/target_core_tpg.c b/drivers/target/target_core_tpg.c
>> index 3fbb0d4..aa08d6b 100644
>> --- a/drivers/target/target_core_tpg.c
>> +++ b/drivers/target/target_core_tpg.c
>> @@ -37,6 +37,7 @@
>>
>> #include <target/target_core_base.h>
>> #include <target/target_core_backend.h>
>> +#include <target/target_core_configfs.h>
>> #include <target/target_core_fabric.h>
>>
>> #include "target_core_internal.h"
>
> This hunk look unrelated.

Nic, can you remove that - it breaks compilation...

Thanks,
Sagi.