2019-11-11 07:36:13

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v6 0/4] Add MMC software queue support

Hi All,

Now the MMC read/write stack will always wait for previous request is
completed by mmc_blk_rw_wait(), before sending a new request to hardware,
or queue a work to complete request, that will bring context switching
overhead, especially for high I/O per second rates, to affect the IO
performance.

Thus this patch set will introduce the MMC software command queue support
based on command queue engine's interfaces, and set the queue depth as 32
to allow more requests can be be prepared, merged and inserted into IO
scheduler, but we only allow 2 requests in flight, that is enough to let
the irq handler always trigger the next request without a context switch,
as well as avoiding a long latency.

Moreover we can expand the MMC software queue interface to support
MMC packed request or packed command instead of adding new interfaces,
according to previosus discussion.

Below are some comparison data with fio tool. The fio command I used
is like below with changing the '--rw' parameter and enabling the direct
IO flag to measure the actual hardware transfer speed in 4K block size.

./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read

My eMMC card working at HS400 Enhanced strobe mode:
[ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
[ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
[ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
[ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
[ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)

1. Without MMC software queue
I tested 5 times for each case and output a average speed.

1) Sequential read:
Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
Average speed: 59.66MiB/s

2) Random read:
Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
Average speed: 27.04MiB/s

3) Sequential write:
Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
Average speed: 69.68MiB/s

4) Random write:
Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
Average speed: 35.96MiB/s

2. With MMC software queue
I tested 5 times for each case and output a average speed.

1) Sequential read:
Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
Average speed: 60.68MiB/s

2) Random read:
Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
Average speed: 31.36MiB/s

3) Sequential write:
Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
Average speed: 71.66MiB/s

4) Random write:
Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
Average speed: 68.76MiB/s

Form above data, we can see the MMC software queue can help to improve some
performance obviously for random read and write, though no obvious improvement
for sequential read and write.

Any comments are welcome. Thanks a lot.

Hi Ulf,

This patch set was pending for a while, and I've tested it several times and
have not found any recessions. Hope this patch set can be merged into v5.5
if no objection from you, since I still have some patches introducing the
packed request depend on the mmc software queue as we talked before.
Thanks a lot.

Changes from v5:
- Modify the condition of defering to complete request suggested by Adrian.

Changes from v4:
- Add a seperate patch to introduce a variable to defer to complete
data requests for some host drivers, when using host software queue.

Changes from v3:
- Use host software queue instead of sqhci.
- Fix random config building issue.
- Change queue depth to 32, but still only allow 2 requests in flight.
- Update the testing data.

Changes from v2:
- Remove reference to 'struct cqhci_host' and 'struct cqhci_slot',
instead adding 'struct sqhci_host', which is only used by software queue.

Changes from v1:
- Add request_done ops for sdhci_ops.
- Replace virtual command queue with software queue for functions and
variables.
- Rename the software queue file and add sqhci.h header file.

Baolin Wang (4):
mmc: Add MMC host software queue support
mmc: host: sdhci: Add request_done ops for struct sdhci_ops
mmc: host: sdhci-sprd: Add software queue support
mmc: host: sdhci: Add a variable to defer to complete requests if
needed

drivers/mmc/core/block.c | 61 ++++++++
drivers/mmc/core/mmc.c | 13 +-
drivers/mmc/core/queue.c | 33 +++-
drivers/mmc/host/Kconfig | 8 +
drivers/mmc/host/Makefile | 1 +
drivers/mmc/host/mmc_hsq.c | 344 +++++++++++++++++++++++++++++++++++++++++
drivers/mmc/host/mmc_hsq.h | 30 ++++
drivers/mmc/host/sdhci-sprd.c | 26 ++++
drivers/mmc/host/sdhci.c | 14 +-
drivers/mmc/host/sdhci.h | 3 +
include/linux/mmc/host.h | 3 +
11 files changed, 523 insertions(+), 13 deletions(-)
create mode 100644 drivers/mmc/host/mmc_hsq.c
create mode 100644 drivers/mmc/host/mmc_hsq.h

--
1.7.9.5


2019-11-11 07:36:23

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v6 1/4] mmc: Add MMC host software queue support

Now the MMC read/write stack will always wait for previous request is
completed by mmc_blk_rw_wait(), before sending a new request to hardware,
or queue a work to complete request, that will bring context switching
overhead, especially for high I/O per second rates, to affect the IO
performance.

Thus this patch introduces MMC software queue interface based on the
hardware command queue engine's interfaces, which is similar with the
hardware command queue engine's idea, that can remove the context
switching. Moreover we set the default queue depth as 32 for software
queue, which allows more requests to be prepared, merged and inserted
into IO scheduler to improve performance, but we only allow 2 requests
in flight, that is enough to let the irq handler always trigger the
next request without a context switch, as well as avoiding a long latency.

From the fio testing data in cover letter, we can see the software
queue can improve some performance with 4K block size, increasing
about 16% for random read, increasing about 90% for random write,
though no obvious improvement for sequential read and write.

Moreover we can expand the software queue interface to support MMC
packed request or packed command in future.

Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/core/block.c | 61 ++++++++
drivers/mmc/core/mmc.c | 13 +-
drivers/mmc/core/queue.c | 33 ++++-
drivers/mmc/host/Kconfig | 7 +
drivers/mmc/host/Makefile | 1 +
drivers/mmc/host/mmc_hsq.c | 344 ++++++++++++++++++++++++++++++++++++++++++++
drivers/mmc/host/mmc_hsq.h | 30 ++++
include/linux/mmc/host.h | 3 +
8 files changed, 482 insertions(+), 10 deletions(-)
create mode 100644 drivers/mmc/host/mmc_hsq.c
create mode 100644 drivers/mmc/host/mmc_hsq.h

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 2c71a43..870462c 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -168,6 +168,11 @@ struct mmc_rpmb_data {

static inline int mmc_blk_part_switch(struct mmc_card *card,
unsigned int part_type);
+static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
+ struct mmc_card *card,
+ int disable_multi,
+ struct mmc_queue *mq);
+static void mmc_blk_swq_req_done(struct mmc_request *mrq);

static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
{
@@ -1569,9 +1574,30 @@ static int mmc_blk_cqe_issue_flush(struct mmc_queue *mq, struct request *req)
return mmc_blk_cqe_start_req(mq->card->host, mrq);
}

+static int mmc_blk_swq_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+{
+ struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+ struct mmc_host *host = mq->card->host;
+ int err;
+
+ mmc_blk_rw_rq_prep(mqrq, mq->card, 0, mq);
+ mqrq->brq.mrq.done = mmc_blk_swq_req_done;
+ mmc_pre_req(host, &mqrq->brq.mrq);
+
+ err = mmc_cqe_start_req(host, &mqrq->brq.mrq);
+ if (err)
+ mmc_post_req(host, &mqrq->brq.mrq, err);
+
+ return err;
+}
+
static int mmc_blk_cqe_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+ struct mmc_host *host = mq->card->host;
+
+ if (host->swq_enabled)
+ return mmc_blk_swq_issue_rw_rq(mq, req);

mmc_blk_data_prep(mq, mqrq, 0, NULL, NULL);

@@ -1957,6 +1983,41 @@ static void mmc_blk_urgent_bkops(struct mmc_queue *mq,
mmc_run_bkops(mq->card);
}

+static void mmc_blk_swq_req_done(struct mmc_request *mrq)
+{
+ struct mmc_queue_req *mqrq =
+ container_of(mrq, struct mmc_queue_req, brq.mrq);
+ struct request *req = mmc_queue_req_to_req(mqrq);
+ struct request_queue *q = req->q;
+ struct mmc_queue *mq = q->queuedata;
+ struct mmc_host *host = mq->card->host;
+ unsigned long flags;
+
+ if (mmc_blk_rq_error(&mqrq->brq) ||
+ mmc_blk_urgent_bkops_needed(mq, mqrq)) {
+ spin_lock_irqsave(&mq->lock, flags);
+ mq->recovery_needed = true;
+ mq->recovery_req = req;
+ spin_unlock_irqrestore(&mq->lock, flags);
+
+ host->cqe_ops->cqe_recovery_start(host);
+
+ schedule_work(&mq->recovery_work);
+ return;
+ }
+
+ mmc_blk_rw_reset_success(mq, req);
+
+ /*
+ * Block layer timeouts race with completions which means the normal
+ * completion path cannot be used during recovery.
+ */
+ if (mq->in_recovery)
+ mmc_blk_cqe_complete_rq(mq, req);
+ else
+ blk_mq_complete_request(req);
+}
+
void mmc_blk_mq_complete(struct request *req)
{
struct mmc_queue *mq = req->q->queuedata;
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index c880489..8eac1a2 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1852,15 +1852,22 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
*/
card->reenable_cmdq = card->ext_csd.cmdq_en;

- if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
+ if (host->cqe_ops && !host->cqe_enabled) {
err = host->cqe_ops->cqe_enable(host, card);
if (err) {
pr_err("%s: Failed to enable CQE, error %d\n",
mmc_hostname(host), err);
} else {
host->cqe_enabled = true;
- pr_info("%s: Command Queue Engine enabled\n",
- mmc_hostname(host));
+
+ if (card->ext_csd.cmdq_en) {
+ pr_info("%s: Command Queue Engine enabled\n",
+ mmc_hostname(host));
+ } else {
+ host->swq_enabled = true;
+ pr_info("%s: Software Queue enabled\n",
+ mmc_hostname(host));
+ }
}
}

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 9edc086..d9086c1 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -62,7 +62,7 @@ enum mmc_issue_type mmc_issue_type(struct mmc_queue *mq, struct request *req)
{
struct mmc_host *host = mq->card->host;

- if (mq->use_cqe)
+ if (mq->use_cqe && !host->swq_enabled)
return mmc_cqe_issue_type(host, req);

if (req_op(req) == REQ_OP_READ || req_op(req) == REQ_OP_WRITE)
@@ -124,12 +124,14 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
{
struct request_queue *q = req->q;
struct mmc_queue *mq = q->queuedata;
+ struct mmc_card *card = mq->card;
+ struct mmc_host *host = card->host;
unsigned long flags;
int ret;

spin_lock_irqsave(&mq->lock, flags);

- if (mq->recovery_needed || !mq->use_cqe)
+ if (mq->recovery_needed || !mq->use_cqe || host->swq_enabled)
ret = BLK_EH_RESET_TIMER;
else
ret = mmc_cqe_timed_out(req);
@@ -144,12 +146,13 @@ static void mmc_mq_recovery_handler(struct work_struct *work)
struct mmc_queue *mq = container_of(work, struct mmc_queue,
recovery_work);
struct request_queue *q = mq->queue;
+ struct mmc_host *host = mq->card->host;

mmc_get_card(mq->card, &mq->ctx);

mq->in_recovery = true;

- if (mq->use_cqe)
+ if (mq->use_cqe && !host->swq_enabled)
mmc_blk_cqe_recovery(mq);
else
mmc_blk_mq_recovery(mq);
@@ -160,6 +163,9 @@ static void mmc_mq_recovery_handler(struct work_struct *work)
mq->recovery_needed = false;
spin_unlock_irq(&mq->lock);

+ if (host->swq_enabled)
+ host->cqe_ops->cqe_recovery_finish(host);
+
mmc_put_card(mq->card, &mq->ctx);

blk_mq_run_hw_queues(q, true);
@@ -279,6 +285,14 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
}
break;
case MMC_ISSUE_ASYNC:
+ /*
+ * For MMC host software queue, we only allow 2 requests in
+ * flight to avoid a long latency.
+ */
+ if (host->swq_enabled && mq->in_flight[issue_type] > 2) {
+ spin_unlock_irq(&mq->lock);
+ return BLK_STS_RESOURCE;
+ }
break;
default:
/*
@@ -430,11 +444,16 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
* The queue depth for CQE must match the hardware because the request
* tag is used to index the hardware queue.
*/
- if (mq->use_cqe)
- mq->tag_set.queue_depth =
- min_t(int, card->ext_csd.cmdq_depth, host->cqe_qdepth);
- else
+ if (mq->use_cqe) {
+ if (host->swq_enabled)
+ mq->tag_set.queue_depth = host->cqe_qdepth;
+ else
+ mq->tag_set.queue_depth =
+ min_t(int, card->ext_csd.cmdq_depth, host->cqe_qdepth);
+ } else {
mq->tag_set.queue_depth = MMC_QUEUE_DEPTH;
+ }
+
mq->tag_set.numa_node = NUMA_NO_NODE;
mq->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
mq->tag_set.nr_hw_queues = 1;
diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 49ea02c..efa4019 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -936,6 +936,13 @@ config MMC_CQHCI

If unsure, say N.

+config MMC_HSQ
+ tristate "MMC Host Software Queue support"
+ help
+ This selects the Software Queue support.
+
+ If unsure, say N.
+
config MMC_TOSHIBA_PCI
tristate "Toshiba Type A SD/MMC Card Interface Driver"
depends on PCI
diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
index 11c4598..c14b439 100644
--- a/drivers/mmc/host/Makefile
+++ b/drivers/mmc/host/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_MMC_SDHCI_BRCMSTB) += sdhci-brcmstb.o
obj-$(CONFIG_MMC_SDHCI_OMAP) += sdhci-omap.o
obj-$(CONFIG_MMC_SDHCI_SPRD) += sdhci-sprd.o
obj-$(CONFIG_MMC_CQHCI) += cqhci.o
+obj-$(CONFIG_MMC_HSQ) += mmc_hsq.o

ifeq ($(CONFIG_CB710_DEBUG),y)
CFLAGS-cb710-mmc += -DDEBUG
diff --git a/drivers/mmc/host/mmc_hsq.c b/drivers/mmc/host/mmc_hsq.c
new file mode 100644
index 0000000..f5a4f93
--- /dev/null
+++ b/drivers/mmc/host/mmc_hsq.c
@@ -0,0 +1,344 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * MMC software queue support based on command queue interfaces
+ *
+ * Copyright (C) 2019 Linaro, Inc.
+ * Author: Baolin Wang <[email protected]>
+ */
+
+#include <linux/mmc/card.h>
+#include <linux/mmc/host.h>
+
+#include "mmc_hsq.h"
+
+#define HSQ_NUM_SLOTS 32
+#define HSQ_INVALID_TAG HSQ_NUM_SLOTS
+
+static void mmc_hsq_pump_requests(struct mmc_hsq *hsq)
+{
+ struct mmc_host *mmc = hsq->mmc;
+ struct hsq_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ /* Make sure we are not already running a request now */
+ if (hsq->mrq) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ /* Make sure there are remain requests need to pump */
+ if (!hsq->qcnt || !hsq->enabled) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ slot = &hsq->slot[hsq->next_tag];
+ hsq->mrq = slot->mrq;
+ hsq->qcnt--;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ mmc->ops->request(mmc, hsq->mrq);
+}
+
+static void mmc_hsq_update_next_tag(struct mmc_hsq *hsq, int remains)
+{
+ struct hsq_slot *slot;
+ int tag;
+
+ /*
+ * If there are no remain requests in software queue, then set a invalid
+ * tag.
+ */
+ if (!remains) {
+ hsq->next_tag = HSQ_INVALID_TAG;
+ return;
+ }
+
+ /*
+ * Increasing the next tag and check if the corresponding request is
+ * available, if yes, then we found a candidate request.
+ */
+ if (++hsq->next_tag != HSQ_INVALID_TAG) {
+ slot = &hsq->slot[hsq->next_tag];
+ if (slot->mrq)
+ return;
+ }
+
+ /* Othersie we should iterate all slots to find a available tag. */
+ for (tag = 0; tag < HSQ_NUM_SLOTS; tag++) {
+ slot = &hsq->slot[tag];
+ if (slot->mrq)
+ break;
+ }
+
+ if (tag == HSQ_NUM_SLOTS)
+ tag = HSQ_INVALID_TAG;
+
+ hsq->next_tag = tag;
+}
+
+static void mmc_hsq_post_request(struct mmc_hsq *hsq)
+{
+ unsigned long flags;
+ int remains;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ remains = hsq->qcnt;
+ hsq->mrq = NULL;
+
+ /* Update the next available tag to be queued. */
+ mmc_hsq_update_next_tag(hsq, remains);
+
+ if (hsq->waiting_for_idle && !remains) {
+ hsq->waiting_for_idle = false;
+ wake_up(&hsq->wait_queue);
+ }
+
+ /* Do not pump new request in recovery mode. */
+ if (hsq->recovery_halt) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ /*
+ * Try to pump new request to host controller as fast as possible,
+ * after completing previous request.
+ */
+ if (remains > 0)
+ mmc_hsq_pump_requests(hsq);
+}
+
+/**
+ * mmc_hsq_finalize_request - finalize one request if the request is done
+ * @mmc: the host controller
+ * @mrq: the request need to be finalized
+ *
+ * Return true if we finalized the corresponding request in software queue,
+ * otherwise return false.
+ */
+bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ if (!hsq->enabled || !hsq->mrq || hsq->mrq != mrq) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return false;
+ }
+
+ /*
+ * Clear current completed slot request to make a room for new request.
+ */
+ hsq->slot[hsq->next_tag].mrq = NULL;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ mmc_cqe_request_done(mmc, hsq->mrq);
+
+ mmc_hsq_post_request(hsq);
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_finalize_request);
+
+static void mmc_hsq_recovery_start(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ hsq->recovery_halt = true;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+}
+
+static void mmc_hsq_recovery_finish(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int remains;
+
+ spin_lock_irq(&hsq->lock);
+
+ hsq->recovery_halt = false;
+ remains = hsq->qcnt;
+
+ spin_unlock_irq(&hsq->lock);
+
+ /*
+ * Try to pump new request if there are request pending in software
+ * queue after finishing recovery.
+ */
+ if (remains > 0)
+ mmc_hsq_pump_requests(hsq);
+}
+
+static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int tag = mrq->tag;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (!hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return -ESHUTDOWN;
+ }
+
+ /* Do not queue any new requests in recovery mode. */
+ if (hsq->recovery_halt) {
+ spin_unlock_irq(&hsq->lock);
+ return -EBUSY;
+ }
+
+ hsq->slot[tag].mrq = mrq;
+
+ /*
+ * Set the next tag as current request tag if no available
+ * next tag.
+ */
+ if (hsq->next_tag == HSQ_INVALID_TAG)
+ hsq->next_tag = tag;
+
+ hsq->qcnt++;
+
+ spin_unlock_irq(&hsq->lock);
+
+ mmc_hsq_pump_requests(hsq);
+
+ return 0;
+}
+
+static void mmc_hsq_post_req(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ if (mmc->ops->post_req)
+ mmc->ops->post_req(mmc, mrq, 0);
+}
+
+static bool mmc_hsq_queue_is_idle(struct mmc_hsq *hsq, int *ret)
+{
+ bool is_idle;
+
+ spin_lock_irq(&hsq->lock);
+
+ is_idle = (!hsq->mrq && !hsq->qcnt) ||
+ hsq->recovery_halt;
+
+ *ret = hsq->recovery_halt ? -EBUSY : 0;
+ hsq->waiting_for_idle = !is_idle;
+
+ spin_unlock_irq(&hsq->lock);
+
+ return is_idle;
+}
+
+static int mmc_hsq_wait_for_idle(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int ret;
+
+ wait_event(hsq->wait_queue,
+ mmc_hsq_queue_is_idle(hsq, &ret));
+
+ return ret;
+}
+
+static void mmc_hsq_disable(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ u32 timeout = 500;
+ int ret;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (!hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return;
+ }
+
+ spin_unlock_irq(&hsq->lock);
+
+ ret = wait_event_timeout(hsq->wait_queue,
+ mmc_hsq_queue_is_idle(hsq, &ret),
+ msecs_to_jiffies(timeout));
+ if (ret == 0) {
+ pr_warn("could not stop mmc software queue\n");
+ return;
+ }
+
+ spin_lock_irq(&hsq->lock);
+
+ hsq->enabled = false;
+
+ spin_unlock_irq(&hsq->lock);
+}
+
+static int mmc_hsq_enable(struct mmc_host *mmc, struct mmc_card *card)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return -EBUSY;
+ }
+
+ hsq->enabled = true;
+
+ spin_unlock_irq(&hsq->lock);
+
+ return 0;
+}
+
+static const struct mmc_cqe_ops mmc_hsq_ops = {
+ .cqe_enable = mmc_hsq_enable,
+ .cqe_disable = mmc_hsq_disable,
+ .cqe_request = mmc_hsq_request,
+ .cqe_post_req = mmc_hsq_post_req,
+ .cqe_wait_for_idle = mmc_hsq_wait_for_idle,
+ .cqe_recovery_start = mmc_hsq_recovery_start,
+ .cqe_recovery_finish = mmc_hsq_recovery_finish,
+};
+
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc)
+{
+ hsq->num_slots = HSQ_NUM_SLOTS;
+ hsq->next_tag = HSQ_INVALID_TAG;
+ mmc->cqe_qdepth = HSQ_NUM_SLOTS;
+
+ hsq->slot = devm_kcalloc(mmc_dev(mmc), hsq->num_slots,
+ sizeof(struct hsq_slot), GFP_KERNEL);
+ if (!hsq->slot)
+ return -ENOMEM;
+
+ hsq->mmc = mmc;
+ hsq->mmc->cqe_private = hsq;
+ mmc->cqe_ops = &mmc_hsq_ops;
+
+ spin_lock_init(&hsq->lock);
+ init_waitqueue_head(&hsq->wait_queue);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_init);
+
+void mmc_hsq_suspend(struct mmc_host *mmc)
+{
+ mmc_hsq_disable(mmc);
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_suspend);
+
+int mmc_hsq_resume(struct mmc_host *mmc)
+{
+ return mmc_hsq_enable(mmc, NULL);
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_resume);
diff --git a/drivers/mmc/host/mmc_hsq.h b/drivers/mmc/host/mmc_hsq.h
new file mode 100644
index 0000000..d51beb7
--- /dev/null
+++ b/drivers/mmc/host/mmc_hsq.h
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef LINUX_MMC_HSQ_H
+#define LINUX_MMC_HSQ_H
+
+struct hsq_slot {
+ struct mmc_request *mrq;
+};
+
+struct mmc_hsq {
+ struct mmc_host *mmc;
+ struct mmc_request *mrq;
+ wait_queue_head_t wait_queue;
+ struct hsq_slot *slot;
+ spinlock_t lock;
+
+ int next_tag;
+ int num_slots;
+ int qcnt;
+
+ bool enabled;
+ bool waiting_for_idle;
+ bool recovery_halt;
+};
+
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc);
+void mmc_hsq_suspend(struct mmc_host *mmc);
+int mmc_hsq_resume(struct mmc_host *mmc);
+bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq);
+
+#endif
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index ba70338..3931aa3 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -462,6 +462,9 @@ struct mmc_host {
bool cqe_enabled;
bool cqe_on;

+ /* Software Queue support */
+ bool swq_enabled;
+
unsigned long private[0] ____cacheline_aligned;
};

--
1.7.9.5

2019-11-11 07:36:43

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v6 2/4] mmc: host: sdhci: Add request_done ops for struct sdhci_ops

Add request_done ops for struct sdhci_ops as a preparation in case some
host controllers have different method to complete one request, such as
supporting request completion of MMC software queue.

Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/sdhci.c | 12 ++++++++++--
drivers/mmc/host/sdhci.h | 2 ++
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index b056400..850241f 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -2729,7 +2729,10 @@ static bool sdhci_request_done(struct sdhci_host *host)

spin_unlock_irqrestore(&host->lock, flags);

- mmc_request_done(host->mmc, mrq);
+ if (host->ops->request_done)
+ host->ops->request_done(host, mrq);
+ else
+ mmc_request_done(host->mmc, mrq);

return false;
}
@@ -3157,7 +3160,12 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)

/* Process mrqs ready for immediate completion */
for (i = 0; i < SDHCI_MAX_MRQS; i++) {
- if (mrqs_done[i])
+ if (!mrqs_done[i])
+ continue;
+
+ if (host->ops->request_done)
+ host->ops->request_done(host, mrqs_done[i]);
+ else
mmc_request_done(host->mmc, mrqs_done[i]);
}

diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 0ed3e0e..d89cdb9 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -644,6 +644,8 @@ struct sdhci_ops {
void (*voltage_switch)(struct sdhci_host *host);
void (*adma_write_desc)(struct sdhci_host *host, void **desc,
dma_addr_t addr, int len, unsigned int cmd);
+ void (*request_done)(struct sdhci_host *host,
+ struct mmc_request *mrq);
};

#ifdef CONFIG_MMC_SDHCI_IO_ACCESSORS
--
1.7.9.5

2019-11-11 07:38:06

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v6 3/4] mmc: host: sdhci-sprd: Add software queue support

Add software queue support to improve the performance.

Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/Kconfig | 1 +
drivers/mmc/host/sdhci-sprd.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index efa4019..54b86f6 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -632,6 +632,7 @@ config MMC_SDHCI_SPRD
depends on ARCH_SPRD
depends on MMC_SDHCI_PLTFM
select MMC_SDHCI_IO_ACCESSORS
+ select MMC_HSQ
help
This selects the SDIO Host Controller in Spreadtrum
SoCs, this driver supports R11(IP version: R11P0).
diff --git a/drivers/mmc/host/sdhci-sprd.c b/drivers/mmc/host/sdhci-sprd.c
index d07b979..3cc1277 100644
--- a/drivers/mmc/host/sdhci-sprd.c
+++ b/drivers/mmc/host/sdhci-sprd.c
@@ -19,6 +19,7 @@
#include <linux/slab.h>

#include "sdhci-pltfm.h"
+#include "mmc_hsq.h"

/* SDHCI_ARGUMENT2 register high 16bit */
#define SDHCI_SPRD_ARG2_STUFF GENMASK(31, 16)
@@ -379,6 +380,16 @@ static unsigned int sdhci_sprd_get_ro(struct sdhci_host *host)
return 0;
}

+static void sdhci_sprd_request_done(struct sdhci_host *host,
+ struct mmc_request *mrq)
+{
+ /* Validate if the request was from software queue firstly. */
+ if (mmc_hsq_finalize_request(host->mmc, mrq))
+ return;
+
+ mmc_request_done(host->mmc, mrq);
+}
+
static struct sdhci_ops sdhci_sprd_ops = {
.read_l = sdhci_sprd_readl,
.write_l = sdhci_sprd_writel,
@@ -392,6 +403,7 @@ static unsigned int sdhci_sprd_get_ro(struct sdhci_host *host)
.hw_reset = sdhci_sprd_hw_reset,
.get_max_timeout_count = sdhci_sprd_get_max_timeout_count,
.get_ro = sdhci_sprd_get_ro,
+ .request_done = sdhci_sprd_request_done,
};

static void sdhci_sprd_request(struct mmc_host *mmc, struct mmc_request *mrq)
@@ -521,6 +533,7 @@ static int sdhci_sprd_probe(struct platform_device *pdev)
{
struct sdhci_host *host;
struct sdhci_sprd_host *sprd_host;
+ struct mmc_hsq *hsq;
struct clk *clk;
int ret = 0;

@@ -631,6 +644,16 @@ static int sdhci_sprd_probe(struct platform_device *pdev)

sprd_host->flags = host->flags;

+ hsq = devm_kzalloc(&pdev->dev, sizeof(*hsq), GFP_KERNEL);
+ if (!hsq) {
+ ret = -ENOMEM;
+ goto err_cleanup_host;
+ }
+
+ ret = mmc_hsq_init(hsq, host->mmc);
+ if (ret)
+ goto err_cleanup_host;
+
ret = __sdhci_add_host(host);
if (ret)
goto err_cleanup_host;
@@ -689,6 +712,7 @@ static int sdhci_sprd_runtime_suspend(struct device *dev)
struct sdhci_host *host = dev_get_drvdata(dev);
struct sdhci_sprd_host *sprd_host = TO_SPRD_HOST(host);

+ mmc_hsq_suspend(host->mmc);
sdhci_runtime_suspend_host(host);

clk_disable_unprepare(sprd_host->clk_sdio);
@@ -717,6 +741,8 @@ static int sdhci_sprd_runtime_resume(struct device *dev)
goto clk_disable;

sdhci_runtime_resume_host(host, 1);
+ mmc_hsq_resume(host->mmc);
+
return 0;

clk_disable:
--
1.7.9.5

2019-11-11 07:39:15

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v6 4/4] mmc: host: sdhci: Add a variable to defer to complete requests if needed

When using the host software queue, it will trigger the next request in
irq handler without a context switch. But the sdhci_request() can not be
called in interrupt context when using host software queue for some host
drivers, due to the get_cd() ops can be sleepable.

But for some host drivers, such as Spreadtrum host driver, the card is
nonremovable, so the get_cd() ops is not sleepable, which means we can
complete the data request and trigger the next request in irq handler
to remove the context switch for the Spreadtrum host driver.

Thus we still need introduce a variable in struct sdhci_host to indicate
that we will always to defer to complete requests if the sdhci_request()
can not be called in interrupt context for some host drivers, when using
the host software queue.

Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/sdhci.c | 2 +-
drivers/mmc/host/sdhci.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 850241f..4bef066 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -3035,7 +3035,7 @@ static inline bool sdhci_defer_done(struct sdhci_host *host,
{
struct mmc_data *data = mrq->data;

- return host->pending_reset ||
+ return host->pending_reset || host->always_defer_done ||
((host->flags & SDHCI_REQ_USE_DMA) && data &&
data->host_cookie == COOKIE_MAPPED);
}
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index d89cdb9..a73ce89 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -533,6 +533,7 @@ struct sdhci_host {
bool pending_reset; /* Cmd/data reset is pending */
bool irq_wake_enabled; /* IRQ wakeup is enabled */
bool v4_mode; /* Host Version 4 Enable */
+ bool always_defer_done; /* Always defer to complete requests */

struct mmc_request *mrqs_done[SDHCI_MAX_MRQS]; /* Requests done */
struct mmc_command *cmd; /* Current command */
--
1.7.9.5

2019-11-11 07:47:12

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mmc: host: sdhci: Add a variable to defer to complete requests if needed

On 11/11/19 9:34 AM, Baolin Wang wrote:
> When using the host software queue, it will trigger the next request in
> irq handler without a context switch. But the sdhci_request() can not be
> called in interrupt context when using host software queue for some host
> drivers, due to the get_cd() ops can be sleepable.
>
> But for some host drivers, such as Spreadtrum host driver, the card is
> nonremovable, so the get_cd() ops is not sleepable, which means we can
> complete the data request and trigger the next request in irq handler
> to remove the context switch for the Spreadtrum host driver.
>
> Thus we still need introduce a variable in struct sdhci_host to indicate
> that we will always to defer to complete requests if the sdhci_request()
> can not be called in interrupt context for some host drivers, when using
> the host software queue.

Sorry, I assumed you would set host->always_defer_done in = true for the
Spreadtrum host driver in patch "mmc: host: sdhci-sprd: Add software queue
support" and put this patch before it.

>
> Suggested-by: Adrian Hunter <[email protected]>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> drivers/mmc/host/sdhci.c | 2 +-
> drivers/mmc/host/sdhci.h | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index 850241f..4bef066 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -3035,7 +3035,7 @@ static inline bool sdhci_defer_done(struct sdhci_host *host,
> {
> struct mmc_data *data = mrq->data;
>
> - return host->pending_reset ||
> + return host->pending_reset || host->always_defer_done ||
> ((host->flags & SDHCI_REQ_USE_DMA) && data &&
> data->host_cookie == COOKIE_MAPPED);
> }
> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> index d89cdb9..a73ce89 100644
> --- a/drivers/mmc/host/sdhci.h
> +++ b/drivers/mmc/host/sdhci.h
> @@ -533,6 +533,7 @@ struct sdhci_host {
> bool pending_reset; /* Cmd/data reset is pending */
> bool irq_wake_enabled; /* IRQ wakeup is enabled */
> bool v4_mode; /* Host Version 4 Enable */
> + bool always_defer_done; /* Always defer to complete requests */
>
> struct mmc_request *mrqs_done[SDHCI_MAX_MRQS]; /* Requests done */
> struct mmc_command *cmd; /* Current command */
>

2019-11-11 08:00:28

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] mmc: host: sdhci: Add a variable to defer to complete requests if needed

On Mon, 11 Nov 2019 at 15:45, Adrian Hunter <[email protected]> wrote:
>
> On 11/11/19 9:34 AM, Baolin Wang wrote:
> > When using the host software queue, it will trigger the next request in
> > irq handler without a context switch. But the sdhci_request() can not be
> > called in interrupt context when using host software queue for some host
> > drivers, due to the get_cd() ops can be sleepable.
> >
> > But for some host drivers, such as Spreadtrum host driver, the card is
> > nonremovable, so the get_cd() ops is not sleepable, which means we can
> > complete the data request and trigger the next request in irq handler
> > to remove the context switch for the Spreadtrum host driver.
> >
> > Thus we still need introduce a variable in struct sdhci_host to indicate
> > that we will always to defer to complete requests if the sdhci_request()
> > can not be called in interrupt context for some host drivers, when using
> > the host software queue.
>
> Sorry, I assumed you would set host->always_defer_done in = true for the
> Spreadtrum host driver in patch "mmc: host: sdhci-sprd: Add software queue
> support" and put this patch before it.

Ah, sorry, I misunderstood your point.
So you still expect the Spreadtrum host driver should defer to
complete requests firstly, then introducing a request_atomic API in
next patch set to let our Spreadtrum host driver can call
request_atomic() in the interrupt context. OK, will do in next
version. Thanks.

> > Suggested-by: Adrian Hunter <[email protected]>
> > Signed-off-by: Baolin Wang <[email protected]>
> > ---
> > drivers/mmc/host/sdhci.c | 2 +-
> > drivers/mmc/host/sdhci.h | 1 +
> > 2 files changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > index 850241f..4bef066 100644
> > --- a/drivers/mmc/host/sdhci.c
> > +++ b/drivers/mmc/host/sdhci.c
> > @@ -3035,7 +3035,7 @@ static inline bool sdhci_defer_done(struct sdhci_host *host,
> > {
> > struct mmc_data *data = mrq->data;
> >
> > - return host->pending_reset ||
> > + return host->pending_reset || host->always_defer_done ||
> > ((host->flags & SDHCI_REQ_USE_DMA) && data &&
> > data->host_cookie == COOKIE_MAPPED);
> > }
> > diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> > index d89cdb9..a73ce89 100644
> > --- a/drivers/mmc/host/sdhci.h
> > +++ b/drivers/mmc/host/sdhci.h
> > @@ -533,6 +533,7 @@ struct sdhci_host {
> > bool pending_reset; /* Cmd/data reset is pending */
> > bool irq_wake_enabled; /* IRQ wakeup is enabled */
> > bool v4_mode; /* Host Version 4 Enable */
> > + bool always_defer_done; /* Always defer to complete requests */
> >
> > struct mmc_request *mrqs_done[SDHCI_MAX_MRQS]; /* Requests done */
> > struct mmc_command *cmd; /* Current command */
> >
>


--
Baolin Wang
Best Regards

2019-11-11 09:29:52

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Mon, Nov 11, 2019 at 8:35 AM Baolin Wang <[email protected]> wrote:
>
> Hi All,
>
> Now the MMC read/write stack will always wait for previous request is
> completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> or queue a work to complete request, that will bring context switching
> overhead, especially for high I/O per second rates, to affect the IO
> performance.

Hi Baolin,

I had a chance to discuss your changes and what other improvements
can be done to the way mmc-blk works with Hannes Reinecke during the ELC
conference. He had some good suggestions. Adding him and the linux-block
mailing list to Cc to make sure I'm correctly representing this.

- For the queue_depth of a non-queuing block device, you indeed need to
leave it at e.g. 32 or 64 rather than 1 or 2, as you do now (I was wrong
here originally, but just confirmed that). The queue depth is just used to
ensure there is room for reordering and merging, as you also noticed.

- Removing all the context switches and workqueues from the data submission
path is also the right idea. As you found, there is still a workqueue inside
of blk_mq that is used because it may get called from atomic context but
the submission may get blocked in __mmc_claim_host(). This really
needs to be changed as well, but not in the way I originally suggested:
As Hannes suggested, the host interrrupt handler should always use
request_threaded_irq() to have its own process context, and then pass a
flag to blk_mq to say that we never need another workqueue there.

- With that change in place calling a blocking __mmc_claim_host() is
still a problem, so there should still be a nonblocking mmc_try_claim_host()
for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
should always return right away, either after having queued the next I/O
or with an error, but not waiting for the device in any way.

- For the packed requests, there is apparently a very simple way to implement
that without a software queue: mmc_mq_queue_rq() is allowed to look at
and dequeue all requests that are currently part of the request_queue,
so it should take out as many as it wants to submit at once and send
them all down to the driver together, avoiding the need for any further
round-trips to blk_mq or maintaining a queue in mmc.

- The DMA management (bounce buffer, map, unmap) that is currently
done in mmc_blk_mq_issue_rq() should ideally be done in the
init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
can be done asynchronously, out of the critical timing path for the
submission. With this, there won't be any need for a software queue.

Hannes,

Let me know if I misunderstood any of the above, or if I missed any
additional points.

Arnd

> Thus this patch set will introduce the MMC software command queue support
> based on command queue engine's interfaces, and set the queue depth as 32
> to allow more requests can be be prepared, merged and inserted into IO
> scheduler, but we only allow 2 requests in flight, that is enough to let
> the irq handler always trigger the next request without a context switch,
> as well as avoiding a long latency.
>
> Moreover we can expand the MMC software queue interface to support
> MMC packed request or packed command instead of adding new interfaces,
> according to previosus discussion.
>
> Below are some comparison data with fio tool. The fio command I used
> is like below with changing the '--rw' parameter and enabling the direct
> IO flag to measure the actual hardware transfer speed in 4K block size.
>
> ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
>
> My eMMC card working at HS400 Enhanced strobe mode:
> [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
>
> 1. Without MMC software queue
> I tested 5 times for each case and output a average speed.
>
> 1) Sequential read:
> Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> Average speed: 59.66MiB/s
>
> 2) Random read:
> Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> Average speed: 27.04MiB/s
>
> 3) Sequential write:
> Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> Average speed: 69.68MiB/s
>
> 4) Random write:
> Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> Average speed: 35.96MiB/s
>
> 2. With MMC software queue
> I tested 5 times for each case and output a average speed.
>
> 1) Sequential read:
> Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> Average speed: 60.68MiB/s
>
> 2) Random read:
> Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> Average speed: 31.36MiB/s
>
> 3) Sequential write:
> Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> Average speed: 71.66MiB/s
>
> 4) Random write:
> Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> Average speed: 68.76MiB/s
>
> Form above data, we can see the MMC software queue can help to improve some
> performance obviously for random read and write, though no obvious improvement
> for sequential read and write.
>
> Any comments are welcome. Thanks a lot.
>
> Hi Ulf,
>
> This patch set was pending for a while, and I've tested it several times and
> have not found any recessions. Hope this patch set can be merged into v5.5
> if no objection from you, since I still have some patches introducing the
> packed request depend on the mmc software queue as we talked before.
> Thanks a lot.
>
> Changes from v5:
> - Modify the condition of defering to complete request suggested by Adrian.
>
> Changes from v4:
> - Add a seperate patch to introduce a variable to defer to complete
> data requests for some host drivers, when using host software queue.
>
> Changes from v3:
> - Use host software queue instead of sqhci.
> - Fix random config building issue.
> - Change queue depth to 32, but still only allow 2 requests in flight.
> - Update the testing data.
>
> Changes from v2:
> - Remove reference to 'struct cqhci_host' and 'struct cqhci_slot',
> instead adding 'struct sqhci_host', which is only used by software queue.
>
> Changes from v1:
> - Add request_done ops for sdhci_ops.
> - Replace virtual command queue with software queue for functions and
> variables.
> - Rename the software queue file and add sqhci.h header file.
>
> Baolin Wang (4):
> mmc: Add MMC host software queue support
> mmc: host: sdhci: Add request_done ops for struct sdhci_ops
> mmc: host: sdhci-sprd: Add software queue support
> mmc: host: sdhci: Add a variable to defer to complete requests if
> needed
>
> drivers/mmc/core/block.c | 61 ++++++++
> drivers/mmc/core/mmc.c | 13 +-
> drivers/mmc/core/queue.c | 33 +++-
> drivers/mmc/host/Kconfig | 8 +
> drivers/mmc/host/Makefile | 1 +
> drivers/mmc/host/mmc_hsq.c | 344 +++++++++++++++++++++++++++++++++++++++++
> drivers/mmc/host/mmc_hsq.h | 30 ++++
> drivers/mmc/host/sdhci-sprd.c | 26 ++++
> drivers/mmc/host/sdhci.c | 14 +-
> drivers/mmc/host/sdhci.h | 3 +
> include/linux/mmc/host.h | 3 +
> 11 files changed, 523 insertions(+), 13 deletions(-)
> create mode 100644 drivers/mmc/host/mmc_hsq.c
> create mode 100644 drivers/mmc/host/mmc_hsq.h
>
> --
> 1.7.9.5
>

2019-11-11 13:00:43

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

Hi Arnd,

On Mon, 11 Nov 2019 at 17:28, Arnd Bergmann <[email protected]> wrote:
>
> On Mon, Nov 11, 2019 at 8:35 AM Baolin Wang <[email protected]> wrote:
> >
> > Hi All,
> >
> > Now the MMC read/write stack will always wait for previous request is
> > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > or queue a work to complete request, that will bring context switching
> > overhead, especially for high I/O per second rates, to affect the IO
> > performance.
>
> Hi Baolin,
>
> I had a chance to discuss your changes and what other improvements
> can be done to the way mmc-blk works with Hannes Reinecke during the ELC
> conference. He had some good suggestions. Adding him and the linux-block
> mailing list to Cc to make sure I'm correctly representing this.

Great, thanks for your input.

>
> - For the queue_depth of a non-queuing block device, you indeed need to
> leave it at e.g. 32 or 64 rather than 1 or 2, as you do now (I was wrong
> here originally, but just confirmed that). The queue depth is just used to
> ensure there is room for reordering and merging, as you also noticed.

Right.

>
> - Removing all the context switches and workqueues from the data submission
> path is also the right idea. As you found, there is still a workqueue inside
> of blk_mq that is used because it may get called from atomic context but
> the submission may get blocked in __mmc_claim_host(). This really
> needs to be changed as well, but not in the way I originally suggested:
> As Hannes suggested, the host interrrupt handler should always use
> request_threaded_irq() to have its own process context, and then pass a
> flag to blk_mq to say that we never need another workqueue there.

So you mean we should complete the request in the host driver irq
thread context, then issue another request in this context by calling
blk_mq_run_hw_queues()?

>
> - With that change in place calling a blocking __mmc_claim_host() is
> still a problem, so there should still be a nonblocking mmc_try_claim_host()
> for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> should always return right away, either after having queued the next I/O
> or with an error, but not waiting for the device in any way.

Actually not only the mmc_claim_host() will block the MMC request
processing, in this routine, the mmc_blk_part_switch() and
mmc_retune() can also block the request processing. Moreover the part
switching and tuning should be sync operations, and we can not move
them to a work or a thread.

>
> - For the packed requests, there is apparently a very simple way to implement
> that without a software queue: mmc_mq_queue_rq() is allowed to look at
> and dequeue all requests that are currently part of the request_queue,
> so it should take out as many as it wants to submit at once and send
> them all down to the driver together, avoiding the need for any further
> round-trips to blk_mq or maintaining a queue in mmc.

You mean we can dispatch a request directly from
elevator->type->ops.dispatch_request()? but we still need some helper
functions to check if these requests can be packed (the package
condition), and need to invent new APIs to start a packed request (or
using cqe interfaces, which means we still need to implement some cqe
callbacks).

>
> - The DMA management (bounce buffer, map, unmap) that is currently
> done in mmc_blk_mq_issue_rq() should ideally be done in the
> init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> can be done asynchronously, out of the critical timing path for the
> submission. With this, there won't be any need for a software queue.

This is not true, now the blk-mq will allocate some static request
objects (usually the static requests number should be the same with
the hardware queue depth) saved in struct blk_mq_tags. So the
init_request() is used to initialize the static requests when
allocating them, and call exit_request to free the static requests
when freeing the 'struct blk_mq_tags', such as the queue is dead. So
we can not move the DMA management into the init_request/exit_request.

>
> Hannes,
>
> Let me know if I misunderstood any of the above, or if I missed any
> additional points.
>
> Arnd
>
> > Thus this patch set will introduce the MMC software command queue support
> > based on command queue engine's interfaces, and set the queue depth as 32
> > to allow more requests can be be prepared, merged and inserted into IO
> > scheduler, but we only allow 2 requests in flight, that is enough to let
> > the irq handler always trigger the next request without a context switch,
> > as well as avoiding a long latency.
> >
> > Moreover we can expand the MMC software queue interface to support
> > MMC packed request or packed command instead of adding new interfaces,
> > according to previosus discussion.
> >
> > Below are some comparison data with fio tool. The fio command I used
> > is like below with changing the '--rw' parameter and enabling the direct
> > IO flag to measure the actual hardware transfer speed in 4K block size.
> >
> > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> >
> > My eMMC card working at HS400 Enhanced strobe mode:
> > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> >
> > 1. Without MMC software queue
> > I tested 5 times for each case and output a average speed.
> >
> > 1) Sequential read:
> > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > Average speed: 59.66MiB/s
> >
> > 2) Random read:
> > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > Average speed: 27.04MiB/s
> >
> > 3) Sequential write:
> > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > Average speed: 69.68MiB/s
> >
> > 4) Random write:
> > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > Average speed: 35.96MiB/s
> >
> > 2. With MMC software queue
> > I tested 5 times for each case and output a average speed.
> >
> > 1) Sequential read:
> > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > Average speed: 60.68MiB/s
> >
> > 2) Random read:
> > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > Average speed: 31.36MiB/s
> >
> > 3) Sequential write:
> > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > Average speed: 71.66MiB/s
> >
> > 4) Random write:
> > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > Average speed: 68.76MiB/s
> >
> > Form above data, we can see the MMC software queue can help to improve some
> > performance obviously for random read and write, though no obvious improvement
> > for sequential read and write.
> >
> > Any comments are welcome. Thanks a lot.
> >
> > Hi Ulf,
> >
> > This patch set was pending for a while, and I've tested it several times and
> > have not found any recessions. Hope this patch set can be merged into v5.5
> > if no objection from you, since I still have some patches introducing the
> > packed request depend on the mmc software queue as we talked before.
> > Thanks a lot.
> >
> > Changes from v5:
> > - Modify the condition of defering to complete request suggested by Adrian.
> >
> > Changes from v4:
> > - Add a seperate patch to introduce a variable to defer to complete
> > data requests for some host drivers, when using host software queue.
> >
> > Changes from v3:
> > - Use host software queue instead of sqhci.
> > - Fix random config building issue.
> > - Change queue depth to 32, but still only allow 2 requests in flight.
> > - Update the testing data.
> >
> > Changes from v2:
> > - Remove reference to 'struct cqhci_host' and 'struct cqhci_slot',
> > instead adding 'struct sqhci_host', which is only used by software queue.
> >
> > Changes from v1:
> > - Add request_done ops for sdhci_ops.
> > - Replace virtual command queue with software queue for functions and
> > variables.
> > - Rename the software queue file and add sqhci.h header file.
> >
> > Baolin Wang (4):
> > mmc: Add MMC host software queue support
> > mmc: host: sdhci: Add request_done ops for struct sdhci_ops
> > mmc: host: sdhci-sprd: Add software queue support
> > mmc: host: sdhci: Add a variable to defer to complete requests if
> > needed
> >
> > drivers/mmc/core/block.c | 61 ++++++++
> > drivers/mmc/core/mmc.c | 13 +-
> > drivers/mmc/core/queue.c | 33 +++-
> > drivers/mmc/host/Kconfig | 8 +
> > drivers/mmc/host/Makefile | 1 +
> > drivers/mmc/host/mmc_hsq.c | 344 +++++++++++++++++++++++++++++++++++++++++
> > drivers/mmc/host/mmc_hsq.h | 30 ++++
> > drivers/mmc/host/sdhci-sprd.c | 26 ++++
> > drivers/mmc/host/sdhci.c | 14 +-
> > drivers/mmc/host/sdhci.h | 3 +
> > include/linux/mmc/host.h | 3 +
> > 11 files changed, 523 insertions(+), 13 deletions(-)
> > create mode 100644 drivers/mmc/host/mmc_hsq.c
> > create mode 100644 drivers/mmc/host/mmc_hsq.h
> >
> > --
> > 1.7.9.5
> >



--
Baolin Wang
Best Regards

2019-11-11 17:02:44

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
> On Mon, 11 Nov 2019 at 17:28, Arnd Bergmann <[email protected]> wrote:
> > On Mon, Nov 11, 2019 at 8:35 AM Baolin Wang <[email protected]> wrote:
> > - Removing all the context switches and workqueues from the data submission
> > path is also the right idea. As you found, there is still a workqueue inside
> > of blk_mq that is used because it may get called from atomic context but
> > the submission may get blocked in __mmc_claim_host(). This really
> > needs to be changed as well, but not in the way I originally suggested:
> > As Hannes suggested, the host interrrupt handler should always use
> > request_threaded_irq() to have its own process context, and then pass a
> > flag to blk_mq to say that we never need another workqueue there.
>
> So you mean we should complete the request in the host driver irq
> thread context, then issue another request in this context by calling
> blk_mq_run_hw_queues()?

Yes. I assumed there was already code that would always run
blk_mq_run_hw_queue() at I/O completion, but I can't find where
that happens today.

As I understand, the main difference to today is that
__blk_mq_delay_run_hw_queue() can call into __blk_mq_run_hw_queue
directly rather than using the delayed work queue once we
can skip the BLK_MQ_F_BLOCKING check.

> > - With that change in place calling a blocking __mmc_claim_host() is
> > still a problem, so there should still be a nonblocking mmc_try_claim_host()
> > for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> > return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> > should always return right away, either after having queued the next I/O
> > or with an error, but not waiting for the device in any way.
>
> Actually not only the mmc_claim_host() will block the MMC request
> processing, in this routine, the mmc_blk_part_switch() and
> mmc_retune() can also block the request processing. Moreover the part
> switching and tuning should be sync operations, and we can not move
> them to a work or a thread.

Ok, I see.

Those would also cause requests to be sent to the device or the host
controller, right? Maybe we can treat them as "a non-IO request
has successfully been queued to the device" events, returning
busy from the mmc_mq_queue_rq() function and then running
the queue again when they complete?

> > - For the packed requests, there is apparently a very simple way to implement
> > that without a software queue: mmc_mq_queue_rq() is allowed to look at
> > and dequeue all requests that are currently part of the request_queue,
> > so it should take out as many as it wants to submit at once and send
> > them all down to the driver together, avoiding the need for any further
> > round-trips to blk_mq or maintaining a queue in mmc.
>
> You mean we can dispatch a request directly from
> elevator->type->ops.dispatch_request()? but we still need some helper
> functions to check if these requests can be packed (the package
> condition), and need to invent new APIs to start a packed request (or
> using cqe interfaces, which means we still need to implement some cqe
> callbacks).

I don't know how the dispatch_request() function fits in there,
what Hannes told me is that in ->queue_rq() you can always
look at the following requests that are already queued up
and take the next ones off the list. Looking at bd->last
tells you if there are additional requests. If there are, you can
look at the next one from blk_mq_hw_ctx (not sure how, but
should not be hard to find)

I also see that there is a commit_rqs() callback that may
go along with queue_rq(), implementing that one could make
this easier as well.

> > - The DMA management (bounce buffer, map, unmap) that is currently
> > done in mmc_blk_mq_issue_rq() should ideally be done in the
> > init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> > can be done asynchronously, out of the critical timing path for the
> > submission. With this, there won't be any need for a software queue.
>
> This is not true, now the blk-mq will allocate some static request
> objects (usually the static requests number should be the same with
> the hardware queue depth) saved in struct blk_mq_tags. So the
> init_request() is used to initialize the static requests when
> allocating them, and call exit_request to free the static requests
> when freeing the 'struct blk_mq_tags', such as the queue is dead. So
> we can not move the DMA management into the init_request/exit_request.

Ok, I must have misremembered which callback that is then, but I guess
there is some other place to do it.

Arnd

2019-11-12 08:51:01

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
>
> On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
> > On Mon, 11 Nov 2019 at 17:28, Arnd Bergmann <[email protected]> wrote:
> > > On Mon, Nov 11, 2019 at 8:35 AM Baolin Wang <[email protected]> wrote:
> > > - Removing all the context switches and workqueues from the data submission
> > > path is also the right idea. As you found, there is still a workqueue inside
> > > of blk_mq that is used because it may get called from atomic context but
> > > the submission may get blocked in __mmc_claim_host(). This really
> > > needs to be changed as well, but not in the way I originally suggested:
> > > As Hannes suggested, the host interrrupt handler should always use
> > > request_threaded_irq() to have its own process context, and then pass a
> > > flag to blk_mq to say that we never need another workqueue there.
> >
> > So you mean we should complete the request in the host driver irq
> > thread context, then issue another request in this context by calling
> > blk_mq_run_hw_queues()?
>
> Yes. I assumed there was already code that would always run
> blk_mq_run_hw_queue() at I/O completion, but I can't find where
> that happens today.

OK. Now we will complete a request in block softirq, which means the
irq thread of host driver should call blk_mq_complete_request() to
complete this request (triggering the block softirq) and call
blk_mq_run_hw_queues() to dispatch another request in this context.

>
> As I understand, the main difference to today is that
> __blk_mq_delay_run_hw_queue() can call into __blk_mq_run_hw_queue
> directly rather than using the delayed work queue once we
> can skip the BLK_MQ_F_BLOCKING check.

Right. Need to improve this as you suggested.

>
> > > - With that change in place calling a blocking __mmc_claim_host() is
> > > still a problem, so there should still be a nonblocking mmc_try_claim_host()
> > > for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> > > return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> > > should always return right away, either after having queued the next I/O
> > > or with an error, but not waiting for the device in any way.
> >
> > Actually not only the mmc_claim_host() will block the MMC request
> > processing, in this routine, the mmc_blk_part_switch() and
> > mmc_retune() can also block the request processing. Moreover the part
> > switching and tuning should be sync operations, and we can not move
> > them to a work or a thread.
>
> Ok, I see.
>
> Those would also cause requests to be sent to the device or the host
> controller, right? Maybe we can treat them as "a non-IO request

Right.

> has successfully been queued to the device" events, returning
> busy from the mmc_mq_queue_rq() function and then running
> the queue again when they complete?

Yes, seems reasonable to me.

>
> > > - For the packed requests, there is apparently a very simple way to implement
> > > that without a software queue: mmc_mq_queue_rq() is allowed to look at
> > > and dequeue all requests that are currently part of the request_queue,
> > > so it should take out as many as it wants to submit at once and send
> > > them all down to the driver together, avoiding the need for any further
> > > round-trips to blk_mq or maintaining a queue in mmc.
> >
> > You mean we can dispatch a request directly from
> > elevator->type->ops.dispatch_request()? but we still need some helper
> > functions to check if these requests can be packed (the package
> > condition), and need to invent new APIs to start a packed request (or
> > using cqe interfaces, which means we still need to implement some cqe
> > callbacks).
>
> I don't know how the dispatch_request() function fits in there,
> what Hannes told me is that in ->queue_rq() you can always
> look at the following requests that are already queued up
> and take the next ones off the list. Looking at bd->last
> tells you if there are additional requests. If there are, you can
> look at the next one from blk_mq_hw_ctx (not sure how, but
> should not be hard to find)
>
> I also see that there is a commit_rqs() callback that may
> go along with queue_rq(), implementing that one could make
> this easier as well.

Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
can not work well, see [1]), but like we talked before, for packed
request, we still need some new interfaces (for example, a interface
used to start a packed request, and a interface used to complete a
packed request), but at last we got a consensus that we should re-use
the CQE interfaces instead of new invention.

[1] https://lore.kernel.org/patchwork/patch/1102897/

>
> > > - The DMA management (bounce buffer, map, unmap) that is currently
> > > done in mmc_blk_mq_issue_rq() should ideally be done in the
> > > init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> > > can be done asynchronously, out of the critical timing path for the
> > > submission. With this, there won't be any need for a software queue.
> >
> > This is not true, now the blk-mq will allocate some static request
> > objects (usually the static requests number should be the same with
> > the hardware queue depth) saved in struct blk_mq_tags. So the
> > init_request() is used to initialize the static requests when
> > allocating them, and call exit_request to free the static requests
> > when freeing the 'struct blk_mq_tags', such as the queue is dead. So
> > we can not move the DMA management into the init_request/exit_request.
>
> Ok, I must have misremembered which callback that is then, but I guess
> there is some other place to do it.

I checked the 'struct blk_mq_ops', and I did not find a ops can be
used to do DMA management. And I also checked UFS driver, it also did
the DMA mapping in the queue_rq() (scsi_queue_rq() --->
ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?

Moreover like I said above, for the packed request, we still need
implement something (like the software queue) based on the CQE
interfaces to help to handle packed requests.

2019-11-18 10:09:26

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

Hi Arnd,

On Tue, 12 Nov 2019 at 16:48, Baolin Wang <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
> >
> > On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
> > > On Mon, 11 Nov 2019 at 17:28, Arnd Bergmann <[email protected]> wrote:
> > > > On Mon, Nov 11, 2019 at 8:35 AM Baolin Wang <[email protected]> wrote:
> > > > - Removing all the context switches and workqueues from the data submission
> > > > path is also the right idea. As you found, there is still a workqueue inside
> > > > of blk_mq that is used because it may get called from atomic context but
> > > > the submission may get blocked in __mmc_claim_host(). This really
> > > > needs to be changed as well, but not in the way I originally suggested:
> > > > As Hannes suggested, the host interrrupt handler should always use
> > > > request_threaded_irq() to have its own process context, and then pass a
> > > > flag to blk_mq to say that we never need another workqueue there.
> > >
> > > So you mean we should complete the request in the host driver irq
> > > thread context, then issue another request in this context by calling
> > > blk_mq_run_hw_queues()?
> >
> > Yes. I assumed there was already code that would always run
> > blk_mq_run_hw_queue() at I/O completion, but I can't find where
> > that happens today.
>
> OK. Now we will complete a request in block softirq, which means the
> irq thread of host driver should call blk_mq_complete_request() to
> complete this request (triggering the block softirq) and call
> blk_mq_run_hw_queues() to dispatch another request in this context.
>
> >
> > As I understand, the main difference to today is that
> > __blk_mq_delay_run_hw_queue() can call into __blk_mq_run_hw_queue
> > directly rather than using the delayed work queue once we
> > can skip the BLK_MQ_F_BLOCKING check.
>
> Right. Need to improve this as you suggested.
>
> >
> > > > - With that change in place calling a blocking __mmc_claim_host() is
> > > > still a problem, so there should still be a nonblocking mmc_try_claim_host()
> > > > for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> > > > return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> > > > should always return right away, either after having queued the next I/O
> > > > or with an error, but not waiting for the device in any way.
> > >
> > > Actually not only the mmc_claim_host() will block the MMC request
> > > processing, in this routine, the mmc_blk_part_switch() and
> > > mmc_retune() can also block the request processing. Moreover the part
> > > switching and tuning should be sync operations, and we can not move
> > > them to a work or a thread.
> >
> > Ok, I see.
> >
> > Those would also cause requests to be sent to the device or the host
> > controller, right? Maybe we can treat them as "a non-IO request
>
> Right.
>
> > has successfully been queued to the device" events, returning
> > busy from the mmc_mq_queue_rq() function and then running
> > the queue again when they complete?
>
> Yes, seems reasonable to me.
>
> >
> > > > - For the packed requests, there is apparently a very simple way to implement
> > > > that without a software queue: mmc_mq_queue_rq() is allowed to look at
> > > > and dequeue all requests that are currently part of the request_queue,
> > > > so it should take out as many as it wants to submit at once and send
> > > > them all down to the driver together, avoiding the need for any further
> > > > round-trips to blk_mq or maintaining a queue in mmc.
> > >
> > > You mean we can dispatch a request directly from
> > > elevator->type->ops.dispatch_request()? but we still need some helper
> > > functions to check if these requests can be packed (the package
> > > condition), and need to invent new APIs to start a packed request (or
> > > using cqe interfaces, which means we still need to implement some cqe
> > > callbacks).
> >
> > I don't know how the dispatch_request() function fits in there,
> > what Hannes told me is that in ->queue_rq() you can always
> > look at the following requests that are already queued up
> > and take the next ones off the list. Looking at bd->last
> > tells you if there are additional requests. If there are, you can
> > look at the next one from blk_mq_hw_ctx (not sure how, but
> > should not be hard to find)
> >
> > I also see that there is a commit_rqs() callback that may
> > go along with queue_rq(), implementing that one could make
> > this easier as well.
>
> Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
> can not work well, see [1]), but like we talked before, for packed
> request, we still need some new interfaces (for example, a interface
> used to start a packed request, and a interface used to complete a
> packed request), but at last we got a consensus that we should re-use
> the CQE interfaces instead of new invention.
>
> [1] https://lore.kernel.org/patchwork/patch/1102897/
>
> >
> > > > - The DMA management (bounce buffer, map, unmap) that is currently
> > > > done in mmc_blk_mq_issue_rq() should ideally be done in the
> > > > init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> > > > can be done asynchronously, out of the critical timing path for the
> > > > submission. With this, there won't be any need for a software queue.
> > >
> > > This is not true, now the blk-mq will allocate some static request
> > > objects (usually the static requests number should be the same with
> > > the hardware queue depth) saved in struct blk_mq_tags. So the
> > > init_request() is used to initialize the static requests when
> > > allocating them, and call exit_request to free the static requests
> > > when freeing the 'struct blk_mq_tags', such as the queue is dead. So
> > > we can not move the DMA management into the init_request/exit_request.
> >
> > Ok, I must have misremembered which callback that is then, but I guess
> > there is some other place to do it.
>
> I checked the 'struct blk_mq_ops', and I did not find a ops can be
> used to do DMA management. And I also checked UFS driver, it also did
> the DMA mapping in the queue_rq() (scsi_queue_rq() --->
> ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?
>
> Moreover like I said above, for the packed request, we still need
> implement something (like the software queue) based on the CQE
> interfaces to help to handle packed requests.

After some investigation and offline discussion with you, I still have
some concerns about your suggestion.

1) Now blk-mq have not supplied some ops to prepare a request, which is
used to do some DMA management asynchronously. But yes, we can
introduce new ops for blk-mq. But there are still some remaining
preparation in mmc_mq_queue_rq(), like mmc part switch. For software
queue, we can prepare a request totally after issuing one.

2) I wonder if it is appropriate that using the irq threaded context
to dispatch next request, actually we will still introduce a context
switch here. Now we will complete a request in the hard irq handler
and kick the softirq to do time-consuming operations, like DMA
unmapping , and will start next request in the hard irq handler
without context switch. Moreover if we remove the BLK_MQ_F_BLOCKING in
future like you suggested, then we can remove all context switch. And
I think we can dispatch next request in the softirq context (actually
the CQE already did).

3) For packed request support, I did not see an example that block
driver can dispatch a request from the IO scheduler in queue_rq() and
no APIs supported from blk-mq. And we do not know where can dispatch a
request in queue_rq(), from IO scheduler? from ctx? or from
hctx->dispatch list? and if this request can not be passed to host
now, how to do it? Seems lots of complicated things.

Moreover, we still need some interfaces for the packed request
handling, from previous discussion, we still need something like MMC
software queue based on the CQE to help to handle the packed request.

So I think I still need to introduce the MMC software queue, on the one
hand is that it can really improve the performance from fio data and
avoid a long latency, on the other hand we can expand it to support
packed request easily in future. Thanks.

(Anyway I will still post the V7 to address Adrian's comments and to
see if we can get a consensus there).

--
Baolin Wang
Best Regards

2019-11-22 09:52:04

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

(adding Paolo as well, maybe he has some more insights)

On Mon, Nov 18, 2019 at 11:04 AM (Exiting) Baolin Wang
<[email protected]> wrote:
> On Tue, 12 Nov 2019 at 16:48, Baolin Wang <[email protected]> wrote:
> > On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
> > > On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
> > > > > - With that change in place calling a blocking __mmc_claim_host() is
> > > > > still a problem, so there should still be a nonblocking mmc_try_claim_host()
> > > > > for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> > > > > return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> > > > > should always return right away, either after having queued the next I/O
> > > > > or with an error, but not waiting for the device in any way.
> > > >
> > > > Actually not only the mmc_claim_host() will block the MMC request
> > > > processing, in this routine, the mmc_blk_part_switch() and
> > > > mmc_retune() can also block the request processing. Moreover the part
> > > > switching and tuning should be sync operations, and we can not move
> > > > them to a work or a thread.
> > >
> > > Ok, I see.
> > >
> > > Those would also cause requests to be sent to the device or the host
> > > controller, right? Maybe we can treat them as "a non-IO request
> >
> > Right.
> >
> > > has successfully been queued to the device" events, returning
> > > busy from the mmc_mq_queue_rq() function and then running
> > > the queue again when they complete?
> >
> > Yes, seems reasonable to me.
> >
> > >
> > > > > - For the packed requests, there is apparently a very simple way to implement
> > > > > that without a software queue: mmc_mq_queue_rq() is allowed to look at
> > > > > and dequeue all requests that are currently part of the request_queue,
> > > > > so it should take out as many as it wants to submit at once and send
> > > > > them all down to the driver together, avoiding the need for any further
> > > > > round-trips to blk_mq or maintaining a queue in mmc.
> > > >
> > > > You mean we can dispatch a request directly from
> > > > elevator->type->ops.dispatch_request()? but we still need some helper
> > > > functions to check if these requests can be packed (the package
> > > > condition), and need to invent new APIs to start a packed request (or
> > > > using cqe interfaces, which means we still need to implement some cqe
> > > > callbacks).
> > >
> > > I don't know how the dispatch_request() function fits in there,
> > > what Hannes told me is that in ->queue_rq() you can always
> > > look at the following requests that are already queued up
> > > and take the next ones off the list. Looking at bd->last
> > > tells you if there are additional requests. If there are, you can
> > > look at the next one from blk_mq_hw_ctx (not sure how, but
> > > should not be hard to find)
> > >
> > > I also see that there is a commit_rqs() callback that may
> > > go along with queue_rq(), implementing that one could make
> > > this easier as well.
> >
> > Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
> > can not work well, see [1]), but like we talked before, for packed
> > request, we still need some new interfaces (for example, a interface
> > used to start a packed request, and a interface used to complete a
> > packed request), but at last we got a consensus that we should re-use
> > the CQE interfaces instead of new invention.
> >
> > [1] https://lore.kernel.org/patchwork/patch/1102897/
> >
> > >
> > > > > - The DMA management (bounce buffer, map, unmap) that is currently
> > > > > done in mmc_blk_mq_issue_rq() should ideally be done in the
> > > > > init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> > > > > can be done asynchronously, out of the critical timing path for the
> > > > > submission. With this, there won't be any need for a software queue.
> > > >
> > > > This is not true, now the blk-mq will allocate some static request
> > > > objects (usually the static requests number should be the same with
> > > > the hardware queue depth) saved in struct blk_mq_tags. So the
> > > > init_request() is used to initialize the static requests when
> > > > allocating them, and call exit_request to free the static requests
> > > > when freeing the 'struct blk_mq_tags', such as the queue is dead. So
> > > > we can not move the DMA management into the init_request/exit_request.
> > >
> > > Ok, I must have misremembered which callback that is then, but I guess
> > > there is some other place to do it.
> >
> > I checked the 'struct blk_mq_ops', and I did not find a ops can be
> > used to do DMA management. And I also checked UFS driver, it also did
> > the DMA mapping in the queue_rq() (scsi_queue_rq() --->
> > ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?
> >
> > Moreover like I said above, for the packed request, we still need
> > implement something (like the software queue) based on the CQE
> > interfaces to help to handle packed requests.
>
> After some investigation and offline discussion with you, I still have
> some concerns about your suggestion.
>
> 1) Now blk-mq have not supplied some ops to prepare a request, which is
> used to do some DMA management asynchronously. But yes, we can
> introduce new ops for blk-mq. But there are still some remaining
> preparation in mmc_mq_queue_rq(), like mmc part switch. For software
> queue, we can prepare a request totally after issuing one.

I suppose to make the submission non-blocking, all operations that
currently block in the submission path may have to be changed first.

For the case of a partition switch (same for retune), I suppose
something like this can be done:

- in queue_rq() check whether a partition switch is needed. If not,
submit the current rq
- if a partition switch is needed, submit the partition switch cmd
instead, and return busy status
- when the completion arrives for the partition switch, call back into
blk_mq to have it call queue_rq again.

Or possibly even (this might not be possible without signifcant
restructuring):

- when preparing a request that would require a partition switch,
insert another meta-request to switch the partition ahead of it.

I do realize that this is a significant departure from how it was done
in the past, but it seems cleaner that way to me.

> 2) I wonder if it is appropriate that using the irq threaded context
> to dispatch next request, actually we will still introduce a context
> switch here. Now we will complete a request in the hard irq handler
> and kick the softirq to do time-consuming operations, like DMA
> unmapping , and will start next request in the hard irq handler
> without context switch. Moreover if we remove the BLK_MQ_F_BLOCKING in
> future like you suggested, then we can remove all context switch. And
> I think we can dispatch next request in the softirq context (actually
> the CQE already did).

I hope Hannes (or someone else) can comment here, as I don't
know exactly what his objection to kicking off the next cmd in the
hardirq was.

I think generally, deferring all slow operations to an irqthread
rather than a softirq is a good idea, but I share your concern that
this can introduce an unnecessary latency between the the
the IRQ is signaled and the time the following cmd is sent to the
hardware.

Doing everything in a single (irqthread) context is clearly simpler,
so this would need to be measured carefully to avoid unnecessary
complexity, but I think don't see anything stopping us from having
the fast-path where the low-level driver first checks for any possible
error conditions in hardirq context and the fires off a prepared cmd
right away whenever it can before triggering the irqthread that does
everything else. I think this has to be a per-driver optimization, so
the common case would just have an irqthread.

> 3) For packed request support, I did not see an example that block
> driver can dispatch a request from the IO scheduler in queue_rq() and
> no APIs supported from blk-mq. And we do not know where can dispatch a
> request in queue_rq(), from IO scheduler? from ctx? or from
> hctx->dispatch list? and if this request can not be passed to host
> now, how to do it? Seems lots of complicated things.

The only way I can see is the ->last flag, so if blk_mq submits multiple
requests in a row to queue_rq() with this flag cleared and calls
->commit_rqs() after the last one. This seems to be what the scsi
disk driver and the nvme driver rely on, and we should be able to use
it the same way for packed cmds, by checking each time in queue_rq()
whether requests can/should be combined and reporting busy otherwise
(after preparing a combined mmc cmd).
blk_mq will then call commit_rqs, which should do the actual submission
to the hardware driver.

Now as you point out, the *big* problem with this is that we never
get multiple requests together in practice, i.e. the last flag is almost
always set, and any optimization around it has no effect.

This is where I'm a bit lost in the code as well, but it seems that
this is part of the current bfq design that only sends one request down
the driver stack at a time, and this would have to change first before
we can rely on this for packing requests.

Paolo, can you comment on why this is currently done, or if it can
be changed? It seems to me that sending multiple requests at
once would also have a significant benefit on the per-request overhead
on NVMe devices with with bfq.

> Moreover, we still need some interfaces for the packed request
> handling, from previous discussion, we still need something like MMC
> software queue based on the CQE to help to handle the packed request.
>
> So I think I still need to introduce the MMC software queue, on the one
> hand is that it can really improve the performance from fio data and
> avoid a long latency, on the other hand we can expand it to support
> packed request easily in future. Thanks.
>
> (Anyway I will still post the V7 to address Adrian's comments and to
> see if we can get a consensus there).


Arnd

2019-11-22 13:23:32

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Fri, Nov 22, 2019 at 10:50 AM Arnd Bergmann <[email protected]> wrote:

> I suppose to make the submission non-blocking, all operations that
> currently block in the submission path may have to be changed first.
>
> For the case of a partition switch (same for retune), I suppose
> something like this can be done:
>
> - in queue_rq() check whether a partition switch is needed. If not,
> submit the current rq
> - if a partition switch is needed, submit the partition switch cmd
> instead, and return busy status
> - when the completion arrives for the partition switch, call back into
> blk_mq to have it call queue_rq again.
>
> Or possibly even (this might not be possible without signifcant
> restructuring):
>
> - when preparing a request that would require a partition switch,
> insert another meta-request to switch the partition ahead of it.
>
> I do realize that this is a significant departure from how it was done
> in the past, but it seems cleaner that way to me.

This partition business really need a proper overhaul.

I outlined the work elsewhere but the problem is that the
eMMC "partitions" such as boot partitions and the usecase-defined
"general" partition (notice SD cards do not have this problem)
are badly integrated with the Linux partition manager.

Instead of mapping these partitions 1:1 to the Linux
partitions they are separate block devices with their own
block queue while still having a name that suggest they
are just a partition of the device. Which they are. The
only thing peculiar with them is that the firmware in the
card are aware of them, I think the partitions that are
not primary may trade update correctness for speed,
such that e.g. boot partitions may have extra redundant
pages in the device so that they never become corrupted.
But card vendors would have to comment.

This has peculiar side effects yielding weird user experiences
such that
dd if=/dev/mmcblk0 of=my-mmc-backup.img
will actually NOT make a backup of the whole device,
only the primary partition.

This should be fixed. My preferred solution would be to just
catenate the logical blocks for these partitions beyond those
of the primary partition, stash these offsets away somewhere
and when they are accessed, insert special partition switch
commands into the block scheduler just like you said.

Right now the MMC core is trying to coordinate the uses of
different partitions by arbitrating different requests from
typically 4 different block devices instead which isn't very
good to say the least.

Also each block device eats memory
and it should really just be one block device.

Yours,
Linus Walleij

2019-11-22 13:51:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Fri, Nov 22, 2019 at 2:20 PM Linus Walleij <[email protected]> wrote:
>
> On Fri, Nov 22, 2019 at 10:50 AM Arnd Bergmann <[email protected]> wrote:
>
> > I suppose to make the submission non-blocking, all operations that
> > currently block in the submission path may have to be changed first.
> >
> > For the case of a partition switch (same for retune), I suppose
> > something like this can be done:
> >
> > - in queue_rq() check whether a partition switch is needed. If not,
> > submit the current rq
> > - if a partition switch is needed, submit the partition switch cmd
> > instead, and return busy status
> > - when the completion arrives for the partition switch, call back into
> > blk_mq to have it call queue_rq again.
> >
> > Or possibly even (this might not be possible without signifcant
> > restructuring):
> >
> > - when preparing a request that would require a partition switch,
> > insert another meta-request to switch the partition ahead of it.
> >
> > I do realize that this is a significant departure from how it was done
> > in the past, but it seems cleaner that way to me.
>
> This partition business really need a proper overhaul.
>
> I outlined the work elsewhere but the problem is that the
> eMMC "partitions" such as boot partitions and the usecase-defined
> "general" partition (notice SD cards do not have this problem)
> are badly integrated with the Linux partition manager.

I think that's a totally orthogonal problem though: we may
well be able to come up with a different way of representing
the extra partitions to user space or as separate block devices,
but in the end, this looks exactly the same to mm
->queue_rq() callback. If we have send a cmd to one partition
and want to send the next cmd to another partition, we first
have to send the partition switch cmd.

Arnd

2019-11-26 08:13:58

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support



> Il giorno 22 nov 2019, alle ore 10:50, Arnd Bergmann <[email protected]> ha scritto:
>
> (adding Paolo as well, maybe he has some more insights)
>
> On Mon, Nov 18, 2019 at 11:04 AM (Exiting) Baolin Wang
> <[email protected]> wrote:
>> On Tue, 12 Nov 2019 at 16:48, Baolin Wang <[email protected]> wrote:
>>> On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
>>>> On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
>>>>>> - With that change in place calling a blocking __mmc_claim_host() is
>>>>>> still a problem, so there should still be a nonblocking mmc_try_claim_host()
>>>>>> for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
>>>>>> return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
>>>>>> should always return right away, either after having queued the next I/O
>>>>>> or with an error, but not waiting for the device in any way.
>>>>>
>>>>> Actually not only the mmc_claim_host() will block the MMC request
>>>>> processing, in this routine, the mmc_blk_part_switch() and
>>>>> mmc_retune() can also block the request processing. Moreover the part
>>>>> switching and tuning should be sync operations, and we can not move
>>>>> them to a work or a thread.
>>>>
>>>> Ok, I see.
>>>>
>>>> Those would also cause requests to be sent to the device or the host
>>>> controller, right? Maybe we can treat them as "a non-IO request
>>>
>>> Right.
>>>
>>>> has successfully been queued to the device" events, returning
>>>> busy from the mmc_mq_queue_rq() function and then running
>>>> the queue again when they complete?
>>>
>>> Yes, seems reasonable to me.
>>>
>>>>
>>>>>> - For the packed requests, there is apparently a very simple way to implement
>>>>>> that without a software queue: mmc_mq_queue_rq() is allowed to look at
>>>>>> and dequeue all requests that are currently part of the request_queue,
>>>>>> so it should take out as many as it wants to submit at once and send
>>>>>> them all down to the driver together, avoiding the need for any further
>>>>>> round-trips to blk_mq or maintaining a queue in mmc.
>>>>>
>>>>> You mean we can dispatch a request directly from
>>>>> elevator->type->ops.dispatch_request()? but we still need some helper
>>>>> functions to check if these requests can be packed (the package
>>>>> condition), and need to invent new APIs to start a packed request (or
>>>>> using cqe interfaces, which means we still need to implement some cqe
>>>>> callbacks).
>>>>
>>>> I don't know how the dispatch_request() function fits in there,
>>>> what Hannes told me is that in ->queue_rq() you can always
>>>> look at the following requests that are already queued up
>>>> and take the next ones off the list. Looking at bd->last
>>>> tells you if there are additional requests. If there are, you can
>>>> look at the next one from blk_mq_hw_ctx (not sure how, but
>>>> should not be hard to find)
>>>>
>>>> I also see that there is a commit_rqs() callback that may
>>>> go along with queue_rq(), implementing that one could make
>>>> this easier as well.
>>>
>>> Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
>>> can not work well, see [1]), but like we talked before, for packed
>>> request, we still need some new interfaces (for example, a interface
>>> used to start a packed request, and a interface used to complete a
>>> packed request), but at last we got a consensus that we should re-use
>>> the CQE interfaces instead of new invention.
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/1102897/
>>>
>>>>
>>>>>> - The DMA management (bounce buffer, map, unmap) that is currently
>>>>>> done in mmc_blk_mq_issue_rq() should ideally be done in the
>>>>>> init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
>>>>>> can be done asynchronously, out of the critical timing path for the
>>>>>> submission. With this, there won't be any need for a software queue.
>>>>>
>>>>> This is not true, now the blk-mq will allocate some static request
>>>>> objects (usually the static requests number should be the same with
>>>>> the hardware queue depth) saved in struct blk_mq_tags. So the
>>>>> init_request() is used to initialize the static requests when
>>>>> allocating them, and call exit_request to free the static requests
>>>>> when freeing the 'struct blk_mq_tags', such as the queue is dead. So
>>>>> we can not move the DMA management into the init_request/exit_request.
>>>>
>>>> Ok, I must have misremembered which callback that is then, but I guess
>>>> there is some other place to do it.
>>>
>>> I checked the 'struct blk_mq_ops', and I did not find a ops can be
>>> used to do DMA management. And I also checked UFS driver, it also did
>>> the DMA mapping in the queue_rq() (scsi_queue_rq() --->
>>> ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?
>>>
>>> Moreover like I said above, for the packed request, we still need
>>> implement something (like the software queue) based on the CQE
>>> interfaces to help to handle packed requests.
>>
>> After some investigation and offline discussion with you, I still have
>> some concerns about your suggestion.
>>
>> 1) Now blk-mq have not supplied some ops to prepare a request, which is
>> used to do some DMA management asynchronously. But yes, we can
>> introduce new ops for blk-mq. But there are still some remaining
>> preparation in mmc_mq_queue_rq(), like mmc part switch. For software
>> queue, we can prepare a request totally after issuing one.
>
> I suppose to make the submission non-blocking, all operations that
> currently block in the submission path may have to be changed first.
>
> For the case of a partition switch (same for retune), I suppose
> something like this can be done:
>
> - in queue_rq() check whether a partition switch is needed. If not,
> submit the current rq
> - if a partition switch is needed, submit the partition switch cmd
> instead, and return busy status
> - when the completion arrives for the partition switch, call back into
> blk_mq to have it call queue_rq again.
>
> Or possibly even (this might not be possible without signifcant
> restructuring):
>
> - when preparing a request that would require a partition switch,
> insert another meta-request to switch the partition ahead of it.
>
> I do realize that this is a significant departure from how it was done
> in the past, but it seems cleaner that way to me.
>
>> 2) I wonder if it is appropriate that using the irq threaded context
>> to dispatch next request, actually we will still introduce a context
>> switch here. Now we will complete a request in the hard irq handler
>> and kick the softirq to do time-consuming operations, like DMA
>> unmapping , and will start next request in the hard irq handler
>> without context switch. Moreover if we remove the BLK_MQ_F_BLOCKING in
>> future like you suggested, then we can remove all context switch. And
>> I think we can dispatch next request in the softirq context (actually
>> the CQE already did).
>
> I hope Hannes (or someone else) can comment here, as I don't
> know exactly what his objection to kicking off the next cmd in the
> hardirq was.
>
> I think generally, deferring all slow operations to an irqthread
> rather than a softirq is a good idea, but I share your concern that
> this can introduce an unnecessary latency between the the
> the IRQ is signaled and the time the following cmd is sent to the
> hardware.
>
> Doing everything in a single (irqthread) context is clearly simpler,
> so this would need to be measured carefully to avoid unnecessary
> complexity, but I think don't see anything stopping us from having
> the fast-path where the low-level driver first checks for any possible
> error conditions in hardirq context and the fires off a prepared cmd
> right away whenever it can before triggering the irqthread that does
> everything else. I think this has to be a per-driver optimization, so
> the common case would just have an irqthread.
>
>> 3) For packed request support, I did not see an example that block
>> driver can dispatch a request from the IO scheduler in queue_rq() and
>> no APIs supported from blk-mq. And we do not know where can dispatch a
>> request in queue_rq(), from IO scheduler? from ctx? or from
>> hctx->dispatch list? and if this request can not be passed to host
>> now, how to do it? Seems lots of complicated things.
>
> The only way I can see is the ->last flag, so if blk_mq submits multiple
> requests in a row to queue_rq() with this flag cleared and calls
> ->commit_rqs() after the last one. This seems to be what the scsi
> disk driver and the nvme driver rely on, and we should be able to use
> it the same way for packed cmds, by checking each time in queue_rq()
> whether requests can/should be combined and reporting busy otherwise
> (after preparing a combined mmc cmd).
> blk_mq will then call commit_rqs, which should do the actual submission
> to the hardware driver.
>
> Now as you point out, the *big* problem with this is that we never
> get multiple requests together in practice, i.e. the last flag is almost
> always set, and any optimization around it has no effect.
>
> This is where I'm a bit lost in the code as well, but it seems that
> this is part of the current bfq design that only sends one request down
> the driver stack at a time, and this would have to change first before
> we can rely on this for packing requests.
>
> Paolo, can you comment on why this is currently done, or if it can
> be changed? It seems to me that sending multiple requests at
> once would also have a significant benefit on the per-request overhead
> on NVMe devices with with bfq.
>

Hi,
actually, "one request dispatched at a time" is not a peculiarity of
bfq. Any scheduler can provide only one request at a time, with the
current blk-mq API for I/O schedulers.

Yet, when it is time to refill an hardware queue, blk-mq pulls as many
requests as it deems appropriate from the scheduler, by invoking the
latter multiple times. See blk_mq_do_dispatch_sched() in
block/blk-mq-sched.c.

I don't know where the glitch for MMC is with respect to this scheme.

Thanks,
Paolo


>> Moreover, we still need some interfaces for the packed request
>> handling, from previous discussion, we still need something like MMC
>> software queue based on the CQE to help to handle the packed request.
>>
>> So I think I still need to introduce the MMC software queue, on the one
>> hand is that it can really improve the performance from fio data and
>> avoid a long latency, on the other hand we can expand it to support
>> packed request easily in future. Thanks.
>>
>> (Anyway I will still post the V7 to address Adrian's comments and to
>> see if we can get a consensus there).
>
>
> Arnd

2019-11-26 12:11:27

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Tue, Nov 26, 2019 at 8:41 AM Paolo Valente <[email protected]> wrote:
> > Il giorno 22 nov 2019, alle ore 10:50, Arnd Bergmann <[email protected]> ha scritto:
> > On Mon, Nov 18, 2019 at 11:04 AM (Exiting) Baolin Wang <[email protected]> wrote:
> > Paolo, can you comment on why this is currently done, or if it can
> > be changed? It seems to me that sending multiple requests at
> > once would also have a significant benefit on the per-request overhead
> > on NVMe devices with with bfq.
> >
>
> Hi,
> actually, "one request dispatched at a time" is not a peculiarity of
> bfq. Any scheduler can provide only one request at a time, with the
> current blk-mq API for I/O schedulers.
>
> Yet, when it is time to refill an hardware queue, blk-mq pulls as many
> requests as it deems appropriate from the scheduler, by invoking the
> latter multiple times. See blk_mq_do_dispatch_sched() in
> block/blk-mq-sched.c.
>
> I don't know where the glitch for MMC is with respect to this scheme.

Right, this is what is puzzling me as well: in both blk_mq_do_dispatch_sched()
and blk_mq_do_dispatch_ctx(), we seem to always take one request from
the scheduler and dispatch it to the device, regardless of the driver or
the scheduler, so there should only ever be one request in the local list.

Yet, both the blk_mq_dispatch_rq_list() function and the NVMe driver
appear to be written based on the idea that there are multiple entries
in this list. The one place that I see putting multiple requests on the
local list before dispatching them is the end of
blk_mq_sched_dispatch_requests():

if (!list_empty(&rq_list)) {
...
}
} else if (has_sched_dispatch) {
blk_mq_do_dispatch_sched(hctx);
} else if (hctx->dispatch_busy) {
/* dequeue request one by one from sw queue if queue is busy */
blk_mq_do_dispatch_ctx(hctx);
} else {
-> blk_mq_flush_busy_ctxs(hctx, &rq_list); <----
blk_mq_dispatch_rq_list(q, &rq_list, false);
}

So as you said, if we use an elevator (has_sched_dispatch == true),
we only get one request, but without an elevator, we get into this
optimized path.

Could we perhaps change the ops.dispatch_request() function to pass
down the list as in https://paste.ubuntu.com/p/MfSRwKqFCs/ ?

Arnd

2019-11-26 12:42:16

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On 11/22/19 10:50 AM, Arnd Bergmann wrote:
> (adding Paolo as well, maybe he has some more insights)
>
> On Mon, Nov 18, 2019 at 11:04 AM (Exiting) Baolin Wang
> <[email protected]> wrote:
>> On Tue, 12 Nov 2019 at 16:48, Baolin Wang <[email protected]> wrote:
>>> On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
>>>> On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
>>>>>> - With that change in place calling a blocking __mmc_claim_host() is
>>>>>> still a problem, so there should still be a nonblocking mmc_try_claim_host()
>>>>>> for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
>>>>>> return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
>>>>>> should always return right away, either after having queued the next I/O
>>>>>> or with an error, but not waiting for the device in any way.
>>>>>
>>>>> Actually not only the mmc_claim_host() will block the MMC request
>>>>> processing, in this routine, the mmc_blk_part_switch() and
>>>>> mmc_retune() can also block the request processing. Moreover the part
>>>>> switching and tuning should be sync operations, and we can not move
>>>>> them to a work or a thread.
>>>>
>>>> Ok, I see.
>>>>
>>>> Those would also cause requests to be sent to the device or the host
>>>> controller, right? Maybe we can treat them as "a non-IO request
>>>
>>> Right.
>>>
>>>> has successfully been queued to the device" events, returning
>>>> busy from the mmc_mq_queue_rq() function and then running
>>>> the queue again when they complete?
>>>
>>> Yes, seems reasonable to me.
>>>
>>>>
>>>>>> - For the packed requests, there is apparently a very simple way to implement
>>>>>> that without a software queue: mmc_mq_queue_rq() is allowed to look at
>>>>>> and dequeue all requests that are currently part of the request_queue,
>>>>>> so it should take out as many as it wants to submit at once and send
>>>>>> them all down to the driver together, avoiding the need for any further
>>>>>> round-trips to blk_mq or maintaining a queue in mmc.
>>>>>
>>>>> You mean we can dispatch a request directly from
>>>>> elevator->type->ops.dispatch_request()? but we still need some helper
>>>>> functions to check if these requests can be packed (the package
>>>>> condition), and need to invent new APIs to start a packed request (or
>>>>> using cqe interfaces, which means we still need to implement some cqe
>>>>> callbacks).
>>>>
>>>> I don't know how the dispatch_request() function fits in there,
>>>> what Hannes told me is that in ->queue_rq() you can always
>>>> look at the following requests that are already queued up
>>>> and take the next ones off the list. Looking at bd->last
>>>> tells you if there are additional requests. If there are, you can
>>>> look at the next one from blk_mq_hw_ctx (not sure how, but
>>>> should not be hard to find)
>>>>
>>>> I also see that there is a commit_rqs() callback that may
>>>> go along with queue_rq(), implementing that one could make
>>>> this easier as well.
>>>
>>> Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
>>> can not work well, see [1]), but like we talked before, for packed
>>> request, we still need some new interfaces (for example, a interface
>>> used to start a packed request, and a interface used to complete a
>>> packed request), but at last we got a consensus that we should re-use
>>> the CQE interfaces instead of new invention.
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/1102897/
>>>
>>>>
>>>>>> - The DMA management (bounce buffer, map, unmap) that is currently
>>>>>> done in mmc_blk_mq_issue_rq() should ideally be done in the
>>>>>> init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
>>>>>> can be done asynchronously, out of the critical timing path for the
>>>>>> submission. With this, there won't be any need for a software queue.
>>>>>
>>>>> This is not true, now the blk-mq will allocate some static request
>>>>> objects (usually the static requests number should be the same with
>>>>> the hardware queue depth) saved in struct blk_mq_tags. So the
>>>>> init_request() is used to initialize the static requests when
>>>>> allocating them, and call exit_request to free the static requests
>>>>> when freeing the 'struct blk_mq_tags', such as the queue is dead. So
>>>>> we can not move the DMA management into the init_request/exit_request.
>>>>
>>>> Ok, I must have misremembered which callback that is then, but I guess
>>>> there is some other place to do it.
>>>
>>> I checked the 'struct blk_mq_ops', and I did not find a ops can be
>>> used to do DMA management. And I also checked UFS driver, it also did
>>> the DMA mapping in the queue_rq() (scsi_queue_rq() --->
>>> ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?
>>>
>>> Moreover like I said above, for the packed request, we still need
>>> implement something (like the software queue) based on the CQE
>>> interfaces to help to handle packed requests.
>>
>> After some investigation and offline discussion with you, I still have
>> some concerns about your suggestion.
>>
>> 1) Now blk-mq have not supplied some ops to prepare a request, which is
>> used to do some DMA management asynchronously. But yes, we can
>> introduce new ops for blk-mq. But there are still some remaining
>> preparation in mmc_mq_queue_rq(), like mmc part switch. For software
>> queue, we can prepare a request totally after issuing one.
>
> I suppose to make the submission non-blocking, all operations that
> currently block in the submission path may have to be changed first.
>
> For the case of a partition switch (same for retune), I suppose
> something like this can be done:
>
> - in queue_rq() check whether a partition switch is needed. If not,
> submit the current rq
> - if a partition switch is needed, submit the partition switch cmd
> instead, and return busy status
> - when the completion arrives for the partition switch, call back into
> blk_mq to have it call queue_rq again.
>
> Or possibly even (this might not be possible without signifcant
> restructuring):
>
> - when preparing a request that would require a partition switch,
> insert another meta-request to switch the partition ahead of it.
>
> I do realize that this is a significant departure from how it was done
> in the past, but it seems cleaner that way to me.
>
I would be treating the partition issue separate from the queued/batched
submission.

Aligning with the 'traditional' linux way for partition handling is
definitely the way to go IMO; otherwise you'll end up having to worry
about resource allocation between distinct queues (like you have to do
now), and will be having a hard time trying to map it properly to the
underlying hardware abstraction in blk-mq.

For starters I would keep a partition marker in the driver instance, and
calculate the parition for each incoming request. If the partition is
different you'll have to insert a partition switch request before
submitting the actual one.

To do this efficiently it would be good to know if:
a) How invasive is the partition switch? Would it be feasible to eg add
a partition switch for every command? This might sound daft now, but if
we get request batching going it might not the _that_ expensive after all...
b) Can the request switch command batched together with the normal
command? IE is is possible to have them both send in one go?
If so it would make life _so_ much easier; we could submit both command
at the same time, and won't have to worry about handling internal
completions ...

>> 2) I wonder if it is appropriate that using the irq threaded context
>> to dispatch next request, actually we will still introduce a context
>> switch here. Now we will complete a request in the hard irq handler
>> and kick the softirq to do time-consuming operations, like DMA
>> unmapping , and will start next request in the hard irq handler
>> without context switch. Moreover if we remove the BLK_MQ_F_BLOCKING in
>> future like you suggested, then we can remove all context switch. And
>> I think we can dispatch next request in the softirq context (actually
>> the CQE already did).
>
> I hope Hannes (or someone else) can comment here, as I don't
> know exactly what his objection to kicking off the next cmd in the
> hardirq was.
>
The point being that you'll have to have a context switch anyway
(between hardirq and softirq), and you'll be able to better handle
recovery as the hardirq handler is pretty generic and the won't be any
chance of that one becoming stuck.
And, of course, modern software design :-)

> I think generally, deferring all slow operations to an irqthread
> rather than a softirq is a good idea, but I share your concern that
> this can introduce an unnecessary latency between the the
> the IRQ is signaled and the time the following cmd is sent to the
> hardware.
> > Doing everything in a single (irqthread) context is clearly simpler,
> so this would need to be measured carefully to avoid unnecessary
> complexity, but I think don't see anything stopping us from having
> the fast-path where the low-level driver first checks for any possible
> error conditions in hardirq context and the fires off a prepared cmd
> right away whenever it can before triggering the irqthread that does
> everything else. I think this has to be a per-driver optimization, so
> the common case would just have an irqthread.
>
>> 3) For packed request support, I did not see an example that block
>> driver can dispatch a request from the IO scheduler in queue_rq() and
>> no APIs supported from blk-mq. And we do not know where can dispatch a
>> request in queue_rq(), from IO scheduler? from ctx? or from
>> hctx->dispatch list? and if this request can not be passed to host
>> now, how to do it? Seems lots of complicated things.
>
> The only way I can see is the ->last flag, so if blk_mq submits multiple
> requests in a row to queue_rq() with this flag cleared and calls
> ->commit_rqs() after the last one. This seems to be what the scsi
> disk driver and the nvme driver rely on, and we should be able to use
> it the same way for packed cmds, by checking each time in queue_rq()
> whether requests can/should be combined and reporting busy otherwise
> (after preparing a combined mmc cmd).
> blk_mq will then call commit_rqs, which should do the actual submission
> to the hardware driver.
>
The ->last flag really depends on the submission thread, eg if something
in the upper layers is using on-stack plugging.
In my experience this is done only in some specific use-cases resp.
filesystems, so this is not something you can rely on to make any decisions.

> Now as you point out, the *big* problem with this is that we never
> get multiple requests together in practice, i.e. the last flag is almost
> always set, and any optimization around it has no effect.
>
See above. I don't think the using the ->last flag is the way to go here.
What you really need to do here is to inject some 'pushback' into the
block layer so that is has a _chance_ of assembling more requests.

But the actual design really needs to take hardware features into account.

As mentioned above, initially I would concentrate on getting the
partitioning working with a single request queue; once that is done we
can look at request batching proper.

And for that you could have a look at the S/390 DASD driver (right,
Arndt?), which has a very similar concept.

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2019-11-27 07:51:07

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Tue, Nov 26, 2019 at 7:17 PM Hannes Reinecke <[email protected]> wrote:
>
> On 11/22/19 10:50 AM, Arnd Bergmann wrote:
> > (adding Paolo as well, maybe he has some more insights)
> >
> > On Mon, Nov 18, 2019 at 11:04 AM (Exiting) Baolin Wang
> > <[email protected]> wrote:
> >> On Tue, 12 Nov 2019 at 16:48, Baolin Wang <[email protected]> wrote:
> >>> On Tue, Nov 12, 2019 at 12:59 AM Arnd Bergmann <[email protected]> wrote:
> >>>> On Mon, Nov 11, 2019 at 1:58 PM Baolin Wang <[email protected]> wrote:
> >>>>>> - With that change in place calling a blocking __mmc_claim_host() is
> >>>>>> still a problem, so there should still be a nonblocking mmc_try_claim_host()
> >>>>>> for the submission path, leading to a BLK_STS_DEV_RESOURCE (?)
> >>>>>> return code from mmc_mq_queue_rq(). Basically mmc_mq_queue_rq()
> >>>>>> should always return right away, either after having queued the next I/O
> >>>>>> or with an error, but not waiting for the device in any way.
> >>>>>
> >>>>> Actually not only the mmc_claim_host() will block the MMC request
> >>>>> processing, in this routine, the mmc_blk_part_switch() and
> >>>>> mmc_retune() can also block the request processing. Moreover the part
> >>>>> switching and tuning should be sync operations, and we can not move
> >>>>> them to a work or a thread.
> >>>>
> >>>> Ok, I see.
> >>>>
> >>>> Those would also cause requests to be sent to the device or the host
> >>>> controller, right? Maybe we can treat them as "a non-IO request
> >>>
> >>> Right.
> >>>
> >>>> has successfully been queued to the device" events, returning
> >>>> busy from the mmc_mq_queue_rq() function and then running
> >>>> the queue again when they complete?
> >>>
> >>> Yes, seems reasonable to me.
> >>>
> >>>>
> >>>>>> - For the packed requests, there is apparently a very simple way to implement
> >>>>>> that without a software queue: mmc_mq_queue_rq() is allowed to look at
> >>>>>> and dequeue all requests that are currently part of the request_queue,
> >>>>>> so it should take out as many as it wants to submit at once and send
> >>>>>> them all down to the driver together, avoiding the need for any further
> >>>>>> round-trips to blk_mq or maintaining a queue in mmc.
> >>>>>
> >>>>> You mean we can dispatch a request directly from
> >>>>> elevator->type->ops.dispatch_request()? but we still need some helper
> >>>>> functions to check if these requests can be packed (the package
> >>>>> condition), and need to invent new APIs to start a packed request (or
> >>>>> using cqe interfaces, which means we still need to implement some cqe
> >>>>> callbacks).
> >>>>
> >>>> I don't know how the dispatch_request() function fits in there,
> >>>> what Hannes told me is that in ->queue_rq() you can always
> >>>> look at the following requests that are already queued up
> >>>> and take the next ones off the list. Looking at bd->last
> >>>> tells you if there are additional requests. If there are, you can
> >>>> look at the next one from blk_mq_hw_ctx (not sure how, but
> >>>> should not be hard to find)
> >>>>
> >>>> I also see that there is a commit_rqs() callback that may
> >>>> go along with queue_rq(), implementing that one could make
> >>>> this easier as well.
> >>>
> >>> Yes, we can use queue_rq()/commit_rqs() and bd->last (now bd->last may
> >>> can not work well, see [1]), but like we talked before, for packed
> >>> request, we still need some new interfaces (for example, a interface
> >>> used to start a packed request, and a interface used to complete a
> >>> packed request), but at last we got a consensus that we should re-use
> >>> the CQE interfaces instead of new invention.
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/1102897/
> >>>
> >>>>
> >>>>>> - The DMA management (bounce buffer, map, unmap) that is currently
> >>>>>> done in mmc_blk_mq_issue_rq() should ideally be done in the
> >>>>>> init_request()/exit_request() (?) callbacks from mmc_mq_ops so this
> >>>>>> can be done asynchronously, out of the critical timing path for the
> >>>>>> submission. With this, there won't be any need for a software queue.
> >>>>>
> >>>>> This is not true, now the blk-mq will allocate some static request
> >>>>> objects (usually the static requests number should be the same with
> >>>>> the hardware queue depth) saved in struct blk_mq_tags. So the
> >>>>> init_request() is used to initialize the static requests when
> >>>>> allocating them, and call exit_request to free the static requests
> >>>>> when freeing the 'struct blk_mq_tags', such as the queue is dead. So
> >>>>> we can not move the DMA management into the init_request/exit_request.
> >>>>
> >>>> Ok, I must have misremembered which callback that is then, but I guess
> >>>> there is some other place to do it.
> >>>
> >>> I checked the 'struct blk_mq_ops', and I did not find a ops can be
> >>> used to do DMA management. And I also checked UFS driver, it also did
> >>> the DMA mapping in the queue_rq() (scsi_queue_rq() --->
> >>> ufshcd_queuecommand() ---> ufshcd_map_sg()). Maybe I missed something?
> >>>
> >>> Moreover like I said above, for the packed request, we still need
> >>> implement something (like the software queue) based on the CQE
> >>> interfaces to help to handle packed requests.
> >>
> >> After some investigation and offline discussion with you, I still have
> >> some concerns about your suggestion.
> >>
> >> 1) Now blk-mq have not supplied some ops to prepare a request, which is
> >> used to do some DMA management asynchronously. But yes, we can
> >> introduce new ops for blk-mq. But there are still some remaining
> >> preparation in mmc_mq_queue_rq(), like mmc part switch. For software
> >> queue, we can prepare a request totally after issuing one.
> >
> > I suppose to make the submission non-blocking, all operations that
> > currently block in the submission path may have to be changed first.
> >
> > For the case of a partition switch (same for retune), I suppose
> > something like this can be done:
> >
> > - in queue_rq() check whether a partition switch is needed. If not,
> > submit the current rq
> > - if a partition switch is needed, submit the partition switch cmd
> > instead, and return busy status
> > - when the completion arrives for the partition switch, call back into
> > blk_mq to have it call queue_rq again.
> >
> > Or possibly even (this might not be possible without signifcant
> > restructuring):
> >
> > - when preparing a request that would require a partition switch,
> > insert another meta-request to switch the partition ahead of it.
> >
> > I do realize that this is a significant departure from how it was done
> > in the past, but it seems cleaner that way to me.
> >
> I would be treating the partition issue separate from the queued/batched
> submission.
>
> Aligning with the 'traditional' linux way for partition handling is
> definitely the way to go IMO; otherwise you'll end up having to worry
> about resource allocation between distinct queues (like you have to do
> now), and will be having a hard time trying to map it properly to the
> underlying hardware abstraction in blk-mq.
>
> For starters I would keep a partition marker in the driver instance, and
> calculate the parition for each incoming request. If the partition is
> different you'll have to insert a partition switch request before
> submitting the actual one.
>
> To do this efficiently it would be good to know if:
> a) How invasive is the partition switch? Would it be feasible to eg add
> a partition switch for every command? This might sound daft now, but if
> we get request batching going it might not the _that_ expensive after all...

This is expensive I think, now not all SD host controllers or cards
can handle batching request, only for those which can support ADMA3 or
packed command.

> b) Can the request switch command batched together with the normal
> command? IE is is possible to have them both send in one go?

I do not think that the switch command can be batched together with
the normal command. We must wait for the completion of the switch
command before sending the normal command. It should be the SYNC
command in MMC stack.

I think the first method suggested by Arnd can work.

> If so it would make life _so_ much easier; we could submit both command
> at the same time, and won't have to worry about handling internal
> completions ...
>
> >> 2) I wonder if it is appropriate that using the irq threaded context
> >> to dispatch next request, actually we will still introduce a context
> >> switch here. Now we will complete a request in the hard irq handler
> >> and kick the softirq to do time-consuming operations, like DMA
> >> unmapping , and will start next request in the hard irq handler
> >> without context switch. Moreover if we remove the BLK_MQ_F_BLOCKING in
> >> future like you suggested, then we can remove all context switch. And
> >> I think we can dispatch next request in the softirq context (actually
> >> the CQE already did).
> >
> > I hope Hannes (or someone else) can comment here, as I don't
> > know exactly what his objection to kicking off the next cmd in the
> > hardirq was.
> >
> The point being that you'll have to have a context switch anyway
> (between hardirq and softirq), and you'll be able to better handle
> recovery as the hardirq handler is pretty generic and the won't be any
> chance of that one becoming stuck.
> And, of course, modern software design :-)
>
> > I think generally, deferring all slow operations to an irqthread
> > rather than a softirq is a good idea, but I share your concern that
> > this can introduce an unnecessary latency between the the
> > the IRQ is signaled and the time the following cmd is sent to the
> > hardware.
> > > Doing everything in a single (irqthread) context is clearly simpler,
> > so this would need to be measured carefully to avoid unnecessary
> > complexity, but I think don't see anything stopping us from having
> > the fast-path where the low-level driver first checks for any possible
> > error conditions in hardirq context and the fires off a prepared cmd
> > right away whenever it can before triggering the irqthread that does
> > everything else. I think this has to be a per-driver optimization, so
> > the common case would just have an irqthread.
> >
> >> 3) For packed request support, I did not see an example that block
> >> driver can dispatch a request from the IO scheduler in queue_rq() and
> >> no APIs supported from blk-mq. And we do not know where can dispatch a
> >> request in queue_rq(), from IO scheduler? from ctx? or from
> >> hctx->dispatch list? and if this request can not be passed to host
> >> now, how to do it? Seems lots of complicated things.
> >
> > The only way I can see is the ->last flag, so if blk_mq submits multiple
> > requests in a row to queue_rq() with this flag cleared and calls
> > ->commit_rqs() after the last one. This seems to be what the scsi
> > disk driver and the nvme driver rely on, and we should be able to use
> > it the same way for packed cmds, by checking each time in queue_rq()
> > whether requests can/should be combined and reporting busy otherwise
> > (after preparing a combined mmc cmd).
> > blk_mq will then call commit_rqs, which should do the actual submission
> > to the hardware driver.
> >
> The ->last flag really depends on the submission thread, eg if something
> in the upper layers is using on-stack plugging.
> In my experience this is done only in some specific use-cases resp.
> filesystems, so this is not something you can rely on to make any decisions.

The on-stack plugging is enabled in my case, and from the commit
message of introducing this structure, the ->last flag is used to
indicate the last request in the chain, but seems work abnormally.

"Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent."

>
> > Now as you point out, the *big* problem with this is that we never
> > get multiple requests together in practice, i.e. the last flag is almost
> > always set, and any optimization around it has no effect.
> >
> See above. I don't think the using the ->last flag is the way to go here.
> What you really need to do here is to inject some 'pushback' into the
> block layer so that is has a _chance_ of assembling more requests.

How about Arnd's suggestion? looks reasonable to me that we can have a
new API to handle batching request.
https://paste.ubuntu.com/p/MfSRwKqFCs/

>
> But the actual design really needs to take hardware features into account.
>
> As mentioned above, initially I would concentrate on getting the
> partitioning working with a single request queue; once that is done we
> can look at request batching proper.

OK.

>
> And for that you could have a look at the S/390 DASD driver (right,
> Arndt?), which has a very similar concept.

Yes, I've looked at the dasd.c driver, and it used a list to link
requests from blk-mq and handled them with a tasklet, but if this will
cause a long latency if we linked more requests into the list and
dispatched them to the controller slowly?

Thanks for your input.

2019-11-27 09:02:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Tue, Nov 26, 2019 at 12:17:15PM +0100, Hannes Reinecke wrote:
> Aligning with the 'traditional' linux way for partition handling is
> definitely the way to go IMO; otherwise you'll end up having to worry
> about resource allocation between distinct queues (like you have to do
> now), and will be having a hard time trying to map it properly to the
> underlying hardware abstraction in blk-mq.

Sorry, but this is complete bullshit. Except for the very unfortunate
name MMC partitions have nothing to do with partitions. They are a
concept roughly equivalent to SCSI logical units and nvme namespace,
just with a pretty idiotic design decision that only allows I/O to one
of them at a time. The block layer way to deal with them is to use
a shared tagset for multiple request queues, which doesn't use up a
whole lot of resources. The only hard part is the draining when
switching between partitions, and there is no really nice way to
deal with that. If requests are batched enough we could just drain
and switch every time an other partition access comes in. Especially
so if people only use partitions for boot partitions and other rarely
used areas. If that doesn't work out we'll just have to reject other
partition access and then use a timer and/or counter to eventually
switch and provide basic fairness.

2019-11-27 12:03:11

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Wed, Nov 27, 2019 at 10:00 AM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Nov 26, 2019 at 12:17:15PM +0100, Hannes Reinecke wrote:
> If requests are batched enough we could just drain
> and switch every time an other partition access comes in. Especially
> so if people only use partitions for boot partitions and other rarely
> used areas.

We only support a single user partition plus up to two boot partitions that
are accessed rarely, I don't think there is any reason to optimize switching
between them.

The only change that I think we need here is to change the partition switch
from something that is done synchronously during ->queue_rq() to
something that fits better into normal scheme of sending a cmd to
the device, returning BLK_STS_RESOURCE from ->queue_rq.
Possibly this could even be turned into a standard struct request that is
added between two normal requests for different partitions at some
point, if this simplifies the logic (I suspect it won't, but it may be worth
a try).

Arnd

2019-11-28 12:19:19

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support


Christoph,

> equivalent to SCSI logical units and nvme namespace, just with a
> pretty idiotic design decision that only allows I/O to one of them at
> a time. The block layer way to deal with them is to use a shared
> tagset for multiple request queues, which doesn't use up a whole lot
> of resources. The only hard part is the draining when switching
> between partitions, and there is no really nice way to deal with that.
> If requests are batched enough we could just drain and switch every
> time an other partition access comes in.

This mirrors single_lun in SCSI closely. I was hoping we could
eventually get rid of that travesty but if MMC needs something similar,
maybe it would be good to move that plumbing to block?

--
Martin K. Petersen Oracle Linux Engineering

2019-11-28 15:56:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Thu, Nov 28, 2019 at 07:15:09AM -0500, Martin K. Petersen wrote:
> This mirrors single_lun in SCSI closely. I was hoping we could
> eventually get rid of that travesty but if MMC needs something similar,
> maybe it would be good to move that plumbing to block?

Oh, I totally forgot about single_lun. Given that it is only set
for ancient CD changers I'm not even sure that code works properly
anymore. Having common code that is exercised regularly by mmc would
certainly be better than the current SCSI code, but I'm not sure how
well this is going to work out.

2019-12-10 15:18:54

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Add MMC software queue support

On Wed, 27 Nov 2019 at 13:01, Arnd Bergmann <[email protected]> wrote:
>
> On Wed, Nov 27, 2019 at 10:00 AM Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Nov 26, 2019 at 12:17:15PM +0100, Hannes Reinecke wrote:
> > If requests are batched enough we could just drain
> > and switch every time an other partition access comes in. Especially
> > so if people only use partitions for boot partitions and other rarely
> > used areas.
>
> We only support a single user partition plus up to two boot partitions that
> are accessed rarely, I don't think there is any reason to optimize switching
> between them.

I agree. However, let me just add some more information to this.

There are more partitions, like the RPMB for example. In regards to
partition switching, after serving a request to the RPMB partition, we
always switch back to the main user area. I think that is sufficient.

Also note that requests for the RPMB partitions are managed via
REQ_OP_DRV_IN|OUT.

>
> The only change that I think we need here is to change the partition switch
> from something that is done synchronously during ->queue_rq() to
> something that fits better into normal scheme of sending a cmd to
> the device, returning BLK_STS_RESOURCE from ->queue_rq.

You want to translate them to be managed similar to REQ_OP_DRV_IN|OUT, no?

I am just trying to understand what this would help us with, but I
don't get it, sorry.

I realize that I am joining the show a bit late, apologize for that.
But it seems like you are forgetting about re-tuning, urgent bkops,
card detect, SDIO combo cards, etc.

For example, re-tuning may be required because of a CRC error on the
previously sent transfer. Thus re-tuning must be done before serving
the next request.

Likewise, when the device signals urgent bkops status, we must not
serve any new request until the card has notified us that it is ready
with it's internal housekeeping operations.

> Possibly this could even be turned into a standard struct request that is
> added between two normal requests for different partitions at some
> point, if this simplifies the logic (I suspect it won't, but it may be worth
> a try).

Doing so, means re-tuning, bkops, etc, also needs to be managed in the
same way. Is this really the way to go?

Kind regards
Uffe