2020-02-05 12:52:31

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 0/5] Add MMC software queue support

Hi All,

Now the MMC read/write stack will always wait for previous request is
completed by mmc_blk_rw_wait(), before sending a new request to hardware,
or queue a work to complete request, that will bring context switching
overhead, especially for high I/O per second rates, to affect the IO
performance.

Thus this patch set will introduce the MMC software command queue support
based on command queue engine's interfaces, and set the queue depth as 64
to allow more requests can be be prepared, merged and inserted into IO
scheduler, but we only allow 2 requests in flight, that is enough to let
the irq handler always trigger the next request without a context switch,
as well as avoiding a long latency.

Moreover we can expand the MMC software queue interface to support
MMC packed request or packed command instead of adding new interfaces,
according to previosus discussion.

Below are some comparison data with fio tool. The fio command I used
is like below with changing the '--rw' parameter and enabling the direct
IO flag to measure the actual hardware transfer speed in 4K block size.

./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read

My eMMC card working at HS400 Enhanced strobe mode:
[ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
[ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
[ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
[ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
[ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)

1. Without MMC software queue
I tested 5 times for each case and output a average speed.

1) Sequential read:
Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
Average speed: 59.66MiB/s

2) Random read:
Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
Average speed: 27.04MiB/s

3) Sequential write:
Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
Average speed: 69.68MiB/s

4) Random write:
Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
Average speed: 35.96MiB/s

2. With MMC software queue
I tested 5 times for each case and output a average speed.

1) Sequential read:
Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
Average speed: 60.68MiB/s

2) Random read:
Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
Average speed: 31.36MiB/s

3) Sequential write:
Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
Average speed: 71.66MiB/s

4) Random write:
Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
Average speed: 68.76MiB/s

Form above data, we can see the MMC software queue can help to improve some
performance obviously for random read and write, though no obvious improvement
for sequential read and write.

Any comments are welcome. Thanks a lot.

Changes from v7:
- Add reviewed tag from Arnd.
- Use the 'hsq' acronym for varibles and functions in the core layer.
- Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
can work normally.
- Add a new patch to enable the host software queue for the SD card.
- Use the default MMC queue depth for host software queue.

Changes from v6:
- Change the patch order and set host->always_defer_done = true for the
Spreadtrum host driver.

Changes from v5:
- Modify the condition of defering to complete request suggested by Adrian.

Changes from v4:
- Add a seperate patch to introduce a variable to defer to complete
data requests for some host drivers, when using host software queue.

Changes from v3:
- Use host software queue instead of sqhci.
- Fix random config building issue.
- Change queue depth to 32, but still only allow 2 requests in flight.
- Update the testing data.

Changes from v2:
- Remove reference to 'struct cqhci_host' and 'struct cqhci_slot',
instead adding 'struct sqhci_host', which is only used by software queue.

Changes from v1:
- Add request_done ops for sdhci_ops.
- Replace virtual command queue with software queue for functions and
variables.
- Rename the software queue file and add sqhci.h header file.

Baolin Wang (5):
mmc: Add MMC host software queue support
mmc: core: Enable the MMC host software queue for the SD card
mmc: host: sdhci: Add request_done ops for struct sdhci_ops
mmc: host: sdhci: Add a variable to defer to complete requests if
needed
mmc: host: sdhci-sprd: Add software queue support

drivers/mmc/core/block.c | 61 ++++++++
drivers/mmc/core/mmc.c | 13 +-
drivers/mmc/core/queue.c | 22 ++-
drivers/mmc/core/sd.c | 10 ++
drivers/mmc/host/Kconfig | 8 +
drivers/mmc/host/Makefile | 1 +
drivers/mmc/host/cqhci.c | 3 +
drivers/mmc/host/mmc_hsq.c | 343 +++++++++++++++++++++++++++++++++++++++++
drivers/mmc/host/mmc_hsq.h | 30 ++++
drivers/mmc/host/sdhci-sprd.c | 28 ++++
drivers/mmc/host/sdhci.c | 14 +-
drivers/mmc/host/sdhci.h | 3 +
include/linux/mmc/host.h | 3 +
13 files changed, 529 insertions(+), 10 deletions(-)
create mode 100644 drivers/mmc/host/mmc_hsq.c
create mode 100644 drivers/mmc/host/mmc_hsq.h

--
1.7.9.5


2020-02-05 12:52:39

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 1/5] mmc: Add MMC host software queue support

From: Baolin Wang <[email protected]>

Now the MMC read/write stack will always wait for previous request is
completed by mmc_blk_rw_wait(), before sending a new request to hardware,
or queue a work to complete request, that will bring context switching
overhead, especially for high I/O per second rates, to affect the IO
performance.

Thus this patch introduces MMC software queue interface based on the
hardware command queue engine's interfaces, which is similar with the
hardware command queue engine's idea, that can remove the context
switching. Moreover we set the default queue depth as 64 for software
queue, which allows more requests to be prepared, merged and inserted
into IO scheduler to improve performance, but we only allow 2 requests
in flight, that is enough to let the irq handler always trigger the
next request without a context switch, as well as avoiding a long latency.

From the fio testing data in cover letter, we can see the software
queue can improve some performance with 4K block size, increasing
about 16% for random read, increasing about 90% for random write,
though no obvious improvement for sequential read and write.

Moreover we can expand the software queue interface to support MMC
packed request or packed command in future.

Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/core/block.c | 61 ++++++++
drivers/mmc/core/mmc.c | 13 +-
drivers/mmc/core/queue.c | 22 ++-
drivers/mmc/host/Kconfig | 7 +
drivers/mmc/host/Makefile | 1 +
drivers/mmc/host/cqhci.c | 3 +
drivers/mmc/host/mmc_hsq.c | 343 ++++++++++++++++++++++++++++++++++++++++++++
drivers/mmc/host/mmc_hsq.h | 30 ++++
include/linux/mmc/host.h | 3 +
9 files changed, 476 insertions(+), 7 deletions(-)
create mode 100644 drivers/mmc/host/mmc_hsq.c
create mode 100644 drivers/mmc/host/mmc_hsq.h

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 663d879..55d52fc 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -168,6 +168,11 @@ struct mmc_rpmb_data {

static inline int mmc_blk_part_switch(struct mmc_card *card,
unsigned int part_type);
+static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
+ struct mmc_card *card,
+ int disable_multi,
+ struct mmc_queue *mq);
+static void mmc_blk_hsq_req_done(struct mmc_request *mrq);

static struct mmc_blk_data *mmc_blk_get(struct gendisk *disk)
{
@@ -1532,9 +1537,30 @@ static int mmc_blk_cqe_issue_flush(struct mmc_queue *mq, struct request *req)
return mmc_blk_cqe_start_req(mq->card->host, mrq);
}

+static int mmc_blk_hsq_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+{
+ struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+ struct mmc_host *host = mq->card->host;
+ int err;
+
+ mmc_blk_rw_rq_prep(mqrq, mq->card, 0, mq);
+ mqrq->brq.mrq.done = mmc_blk_hsq_req_done;
+ mmc_pre_req(host, &mqrq->brq.mrq);
+
+ err = mmc_cqe_start_req(host, &mqrq->brq.mrq);
+ if (err)
+ mmc_post_req(host, &mqrq->brq.mrq, err);
+
+ return err;
+}
+
static int mmc_blk_cqe_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+ struct mmc_host *host = mq->card->host;
+
+ if (host->hsq_enabled)
+ return mmc_blk_hsq_issue_rw_rq(mq, req);

mmc_blk_data_prep(mq, mqrq, 0, NULL, NULL);

@@ -1920,6 +1946,41 @@ static void mmc_blk_urgent_bkops(struct mmc_queue *mq,
mmc_run_bkops(mq->card);
}

+static void mmc_blk_hsq_req_done(struct mmc_request *mrq)
+{
+ struct mmc_queue_req *mqrq =
+ container_of(mrq, struct mmc_queue_req, brq.mrq);
+ struct request *req = mmc_queue_req_to_req(mqrq);
+ struct request_queue *q = req->q;
+ struct mmc_queue *mq = q->queuedata;
+ struct mmc_host *host = mq->card->host;
+ unsigned long flags;
+
+ if (mmc_blk_rq_error(&mqrq->brq) ||
+ mmc_blk_urgent_bkops_needed(mq, mqrq)) {
+ spin_lock_irqsave(&mq->lock, flags);
+ mq->recovery_needed = true;
+ mq->recovery_req = req;
+ spin_unlock_irqrestore(&mq->lock, flags);
+
+ host->cqe_ops->cqe_recovery_start(host);
+
+ schedule_work(&mq->recovery_work);
+ return;
+ }
+
+ mmc_blk_rw_reset_success(mq, req);
+
+ /*
+ * Block layer timeouts race with completions which means the normal
+ * completion path cannot be used during recovery.
+ */
+ if (mq->in_recovery)
+ mmc_blk_cqe_complete_rq(mq, req);
+ else
+ blk_mq_complete_request(req);
+}
+
void mmc_blk_mq_complete(struct request *req)
{
struct mmc_queue *mq = req->q->queuedata;
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index f6912de..7a9976f 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1851,15 +1851,22 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
*/
card->reenable_cmdq = card->ext_csd.cmdq_en;

- if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
+ if (host->cqe_ops && !host->cqe_enabled) {
err = host->cqe_ops->cqe_enable(host, card);
if (err) {
pr_err("%s: Failed to enable CQE, error %d\n",
mmc_hostname(host), err);
} else {
host->cqe_enabled = true;
- pr_info("%s: Command Queue Engine enabled\n",
- mmc_hostname(host));
+
+ if (card->ext_csd.cmdq_en) {
+ pr_info("%s: Command Queue Engine enabled\n",
+ mmc_hostname(host));
+ } else {
+ host->hsq_enabled = true;
+ pr_info("%s: Host Software Queue enabled\n",
+ mmc_hostname(host));
+ }
}
}

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 9edc086..25bee3d 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -62,7 +62,7 @@ enum mmc_issue_type mmc_issue_type(struct mmc_queue *mq, struct request *req)
{
struct mmc_host *host = mq->card->host;

- if (mq->use_cqe)
+ if (mq->use_cqe && !host->hsq_enabled)
return mmc_cqe_issue_type(host, req);

if (req_op(req) == REQ_OP_READ || req_op(req) == REQ_OP_WRITE)
@@ -124,12 +124,14 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
{
struct request_queue *q = req->q;
struct mmc_queue *mq = q->queuedata;
+ struct mmc_card *card = mq->card;
+ struct mmc_host *host = card->host;
unsigned long flags;
int ret;

spin_lock_irqsave(&mq->lock, flags);

- if (mq->recovery_needed || !mq->use_cqe)
+ if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
ret = BLK_EH_RESET_TIMER;
else
ret = mmc_cqe_timed_out(req);
@@ -144,12 +146,13 @@ static void mmc_mq_recovery_handler(struct work_struct *work)
struct mmc_queue *mq = container_of(work, struct mmc_queue,
recovery_work);
struct request_queue *q = mq->queue;
+ struct mmc_host *host = mq->card->host;

mmc_get_card(mq->card, &mq->ctx);

mq->in_recovery = true;

- if (mq->use_cqe)
+ if (mq->use_cqe && !host->hsq_enabled)
mmc_blk_cqe_recovery(mq);
else
mmc_blk_mq_recovery(mq);
@@ -160,6 +163,9 @@ static void mmc_mq_recovery_handler(struct work_struct *work)
mq->recovery_needed = false;
spin_unlock_irq(&mq->lock);

+ if (host->hsq_enabled)
+ host->cqe_ops->cqe_recovery_finish(host);
+
mmc_put_card(mq->card, &mq->ctx);

blk_mq_run_hw_queues(q, true);
@@ -279,6 +285,14 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
}
break;
case MMC_ISSUE_ASYNC:
+ /*
+ * For MMC host software queue, we only allow 2 requests in
+ * flight to avoid a long latency.
+ */
+ if (host->hsq_enabled && mq->in_flight[issue_type] > 2) {
+ spin_unlock_irq(&mq->lock);
+ return BLK_STS_RESOURCE;
+ }
break;
default:
/*
@@ -430,7 +444,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card)
* The queue depth for CQE must match the hardware because the request
* tag is used to index the hardware queue.
*/
- if (mq->use_cqe)
+ if (mq->use_cqe && !host->hsq_enabled)
mq->tag_set.queue_depth =
min_t(int, card->ext_csd.cmdq_depth, host->cqe_qdepth);
else
diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 3a5089f..65d3966 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -949,6 +949,13 @@ config MMC_CQHCI

If unsure, say N.

+config MMC_HSQ
+ tristate "MMC Host Software Queue support"
+ help
+ This selects the Software Queue support.
+
+ If unsure, say N.
+
config MMC_TOSHIBA_PCI
tristate "Toshiba Type A SD/MMC Card Interface Driver"
depends on PCI
diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
index 21d9089..b929ef9 100644
--- a/drivers/mmc/host/Makefile
+++ b/drivers/mmc/host/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_MMC_SDHCI_BRCMSTB) += sdhci-brcmstb.o
obj-$(CONFIG_MMC_SDHCI_OMAP) += sdhci-omap.o
obj-$(CONFIG_MMC_SDHCI_SPRD) += sdhci-sprd.o
obj-$(CONFIG_MMC_CQHCI) += cqhci.o
+obj-$(CONFIG_MMC_HSQ) += mmc_hsq.o

ifeq ($(CONFIG_CB710_DEBUG),y)
CFLAGS-cb710-mmc += -DDEBUG
diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
index 5047f73..4a335f7 100644
--- a/drivers/mmc/host/cqhci.c
+++ b/drivers/mmc/host/cqhci.c
@@ -321,6 +321,9 @@ static int cqhci_enable(struct mmc_host *mmc, struct mmc_card *card)
struct cqhci_host *cq_host = mmc->cqe_private;
int err;

+ if (!card->ext_csd.cmdq_en)
+ return -EINVAL;
+
if (cq_host->enabled)
return 0;

diff --git a/drivers/mmc/host/mmc_hsq.c b/drivers/mmc/host/mmc_hsq.c
new file mode 100644
index 0000000..2011988
--- /dev/null
+++ b/drivers/mmc/host/mmc_hsq.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * MMC software queue support based on command queue interfaces
+ *
+ * Copyright (C) 2019 Linaro, Inc.
+ * Author: Baolin Wang <[email protected]>
+ */
+
+#include <linux/mmc/card.h>
+#include <linux/mmc/host.h>
+
+#include "mmc_hsq.h"
+
+#define HSQ_NUM_SLOTS 64
+#define HSQ_INVALID_TAG HSQ_NUM_SLOTS
+
+static void mmc_hsq_pump_requests(struct mmc_hsq *hsq)
+{
+ struct mmc_host *mmc = hsq->mmc;
+ struct hsq_slot *slot;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ /* Make sure we are not already running a request now */
+ if (hsq->mrq) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ /* Make sure there are remain requests need to pump */
+ if (!hsq->qcnt || !hsq->enabled) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ slot = &hsq->slot[hsq->next_tag];
+ hsq->mrq = slot->mrq;
+ hsq->qcnt--;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ mmc->ops->request(mmc, hsq->mrq);
+}
+
+static void mmc_hsq_update_next_tag(struct mmc_hsq *hsq, int remains)
+{
+ struct hsq_slot *slot;
+ int tag;
+
+ /*
+ * If there are no remain requests in software queue, then set a invalid
+ * tag.
+ */
+ if (!remains) {
+ hsq->next_tag = HSQ_INVALID_TAG;
+ return;
+ }
+
+ /*
+ * Increasing the next tag and check if the corresponding request is
+ * available, if yes, then we found a candidate request.
+ */
+ if (++hsq->next_tag != HSQ_INVALID_TAG) {
+ slot = &hsq->slot[hsq->next_tag];
+ if (slot->mrq)
+ return;
+ }
+
+ /* Othersie we should iterate all slots to find a available tag. */
+ for (tag = 0; tag < HSQ_NUM_SLOTS; tag++) {
+ slot = &hsq->slot[tag];
+ if (slot->mrq)
+ break;
+ }
+
+ if (tag == HSQ_NUM_SLOTS)
+ tag = HSQ_INVALID_TAG;
+
+ hsq->next_tag = tag;
+}
+
+static void mmc_hsq_post_request(struct mmc_hsq *hsq)
+{
+ unsigned long flags;
+ int remains;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ remains = hsq->qcnt;
+ hsq->mrq = NULL;
+
+ /* Update the next available tag to be queued. */
+ mmc_hsq_update_next_tag(hsq, remains);
+
+ if (hsq->waiting_for_idle && !remains) {
+ hsq->waiting_for_idle = false;
+ wake_up(&hsq->wait_queue);
+ }
+
+ /* Do not pump new request in recovery mode. */
+ if (hsq->recovery_halt) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return;
+ }
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ /*
+ * Try to pump new request to host controller as fast as possible,
+ * after completing previous request.
+ */
+ if (remains > 0)
+ mmc_hsq_pump_requests(hsq);
+}
+
+/**
+ * mmc_hsq_finalize_request - finalize one request if the request is done
+ * @mmc: the host controller
+ * @mrq: the request need to be finalized
+ *
+ * Return true if we finalized the corresponding request in software queue,
+ * otherwise return false.
+ */
+bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ if (!hsq->enabled || !hsq->mrq || hsq->mrq != mrq) {
+ spin_unlock_irqrestore(&hsq->lock, flags);
+ return false;
+ }
+
+ /*
+ * Clear current completed slot request to make a room for new request.
+ */
+ hsq->slot[hsq->next_tag].mrq = NULL;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+
+ mmc_cqe_request_done(mmc, hsq->mrq);
+
+ mmc_hsq_post_request(hsq);
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_finalize_request);
+
+static void mmc_hsq_recovery_start(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&hsq->lock, flags);
+
+ hsq->recovery_halt = true;
+
+ spin_unlock_irqrestore(&hsq->lock, flags);
+}
+
+static void mmc_hsq_recovery_finish(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int remains;
+
+ spin_lock_irq(&hsq->lock);
+
+ hsq->recovery_halt = false;
+ remains = hsq->qcnt;
+
+ spin_unlock_irq(&hsq->lock);
+
+ /*
+ * Try to pump new request if there are request pending in software
+ * queue after finishing recovery.
+ */
+ if (remains > 0)
+ mmc_hsq_pump_requests(hsq);
+}
+
+static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int tag = mrq->tag;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (!hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return -ESHUTDOWN;
+ }
+
+ /* Do not queue any new requests in recovery mode. */
+ if (hsq->recovery_halt) {
+ spin_unlock_irq(&hsq->lock);
+ return -EBUSY;
+ }
+
+ hsq->slot[tag].mrq = mrq;
+
+ /*
+ * Set the next tag as current request tag if no available
+ * next tag.
+ */
+ if (hsq->next_tag == HSQ_INVALID_TAG)
+ hsq->next_tag = tag;
+
+ hsq->qcnt++;
+
+ spin_unlock_irq(&hsq->lock);
+
+ mmc_hsq_pump_requests(hsq);
+
+ return 0;
+}
+
+static void mmc_hsq_post_req(struct mmc_host *mmc, struct mmc_request *mrq)
+{
+ if (mmc->ops->post_req)
+ mmc->ops->post_req(mmc, mrq, 0);
+}
+
+static bool mmc_hsq_queue_is_idle(struct mmc_hsq *hsq, int *ret)
+{
+ bool is_idle;
+
+ spin_lock_irq(&hsq->lock);
+
+ is_idle = (!hsq->mrq && !hsq->qcnt) ||
+ hsq->recovery_halt;
+
+ *ret = hsq->recovery_halt ? -EBUSY : 0;
+ hsq->waiting_for_idle = !is_idle;
+
+ spin_unlock_irq(&hsq->lock);
+
+ return is_idle;
+}
+
+static int mmc_hsq_wait_for_idle(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ int ret;
+
+ wait_event(hsq->wait_queue,
+ mmc_hsq_queue_is_idle(hsq, &ret));
+
+ return ret;
+}
+
+static void mmc_hsq_disable(struct mmc_host *mmc)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+ u32 timeout = 500;
+ int ret;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (!hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return;
+ }
+
+ spin_unlock_irq(&hsq->lock);
+
+ ret = wait_event_timeout(hsq->wait_queue,
+ mmc_hsq_queue_is_idle(hsq, &ret),
+ msecs_to_jiffies(timeout));
+ if (ret == 0) {
+ pr_warn("could not stop mmc software queue\n");
+ return;
+ }
+
+ spin_lock_irq(&hsq->lock);
+
+ hsq->enabled = false;
+
+ spin_unlock_irq(&hsq->lock);
+}
+
+static int mmc_hsq_enable(struct mmc_host *mmc, struct mmc_card *card)
+{
+ struct mmc_hsq *hsq = mmc->cqe_private;
+
+ spin_lock_irq(&hsq->lock);
+
+ if (hsq->enabled) {
+ spin_unlock_irq(&hsq->lock);
+ return -EBUSY;
+ }
+
+ hsq->enabled = true;
+
+ spin_unlock_irq(&hsq->lock);
+
+ return 0;
+}
+
+static const struct mmc_cqe_ops mmc_hsq_ops = {
+ .cqe_enable = mmc_hsq_enable,
+ .cqe_disable = mmc_hsq_disable,
+ .cqe_request = mmc_hsq_request,
+ .cqe_post_req = mmc_hsq_post_req,
+ .cqe_wait_for_idle = mmc_hsq_wait_for_idle,
+ .cqe_recovery_start = mmc_hsq_recovery_start,
+ .cqe_recovery_finish = mmc_hsq_recovery_finish,
+};
+
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc)
+{
+ hsq->num_slots = HSQ_NUM_SLOTS;
+ hsq->next_tag = HSQ_INVALID_TAG;
+
+ hsq->slot = devm_kcalloc(mmc_dev(mmc), hsq->num_slots,
+ sizeof(struct hsq_slot), GFP_KERNEL);
+ if (!hsq->slot)
+ return -ENOMEM;
+
+ hsq->mmc = mmc;
+ hsq->mmc->cqe_private = hsq;
+ mmc->cqe_ops = &mmc_hsq_ops;
+
+ spin_lock_init(&hsq->lock);
+ init_waitqueue_head(&hsq->wait_queue);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_init);
+
+void mmc_hsq_suspend(struct mmc_host *mmc)
+{
+ mmc_hsq_disable(mmc);
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_suspend);
+
+int mmc_hsq_resume(struct mmc_host *mmc)
+{
+ return mmc_hsq_enable(mmc, NULL);
+}
+EXPORT_SYMBOL_GPL(mmc_hsq_resume);
diff --git a/drivers/mmc/host/mmc_hsq.h b/drivers/mmc/host/mmc_hsq.h
new file mode 100644
index 0000000..d51beb7
--- /dev/null
+++ b/drivers/mmc/host/mmc_hsq.h
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef LINUX_MMC_HSQ_H
+#define LINUX_MMC_HSQ_H
+
+struct hsq_slot {
+ struct mmc_request *mrq;
+};
+
+struct mmc_hsq {
+ struct mmc_host *mmc;
+ struct mmc_request *mrq;
+ wait_queue_head_t wait_queue;
+ struct hsq_slot *slot;
+ spinlock_t lock;
+
+ int next_tag;
+ int num_slots;
+ int qcnt;
+
+ bool enabled;
+ bool waiting_for_idle;
+ bool recovery_halt;
+};
+
+int mmc_hsq_init(struct mmc_hsq *hsq, struct mmc_host *mmc);
+void mmc_hsq_suspend(struct mmc_host *mmc);
+int mmc_hsq_resume(struct mmc_host *mmc);
+bool mmc_hsq_finalize_request(struct mmc_host *mmc, struct mmc_request *mrq);
+
+#endif
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index ba70338..562ed06 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -462,6 +462,9 @@ struct mmc_host {
bool cqe_enabled;
bool cqe_on;

+ /* Host Software Queue support */
+ bool hsq_enabled;
+
unsigned long private[0] ____cacheline_aligned;
};

--
1.7.9.5

2020-02-05 12:52:45

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 2/5] mmc: core: Enable the MMC host software queue for the SD card

From: Baolin Wang <[email protected]>

Enable the MMC host software queue for the SD card if the host controller
supports the MMC host software queue.

Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/core/sd.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index fe914ff..76c7add 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -1082,6 +1082,16 @@ static int mmc_sd_init_card(struct mmc_host *host, u32 ocr,
}
}

+ if (host->cqe_ops && !host->cqe_enabled) {
+ err = host->cqe_ops->cqe_enable(host, card);
+ if (!err) {
+ host->cqe_enabled = true;
+ host->hsq_enabled = true;
+ pr_info("%s: Host Software Queue enabled\n",
+ mmc_hostname(host));
+ }
+ }
+
if (host->caps2 & MMC_CAP2_AVOID_3_3V &&
host->ios.signal_voltage == MMC_SIGNAL_VOLTAGE_330) {
pr_err("%s: Host failed to negotiate down from 3.3V\n",
--
1.7.9.5

2020-02-05 12:52:57

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 3/5] mmc: host: sdhci: Add request_done ops for struct sdhci_ops

From: Baolin Wang <[email protected]>

Add request_done ops for struct sdhci_ops as a preparation in case some
host controllers have different method to complete one request, such as
supporting request completion of MMC software queue.

Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/sdhci.c | 12 ++++++++++--
drivers/mmc/host/sdhci.h | 2 ++
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 63db844..9761981 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -2944,7 +2944,10 @@ static bool sdhci_request_done(struct sdhci_host *host)

spin_unlock_irqrestore(&host->lock, flags);

- mmc_request_done(host->mmc, mrq);
+ if (host->ops->request_done)
+ host->ops->request_done(host, mrq);
+ else
+ mmc_request_done(host->mmc, mrq);

return false;
}
@@ -3372,7 +3375,12 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)

/* Process mrqs ready for immediate completion */
for (i = 0; i < SDHCI_MAX_MRQS; i++) {
- if (mrqs_done[i])
+ if (!mrqs_done[i])
+ continue;
+
+ if (host->ops->request_done)
+ host->ops->request_done(host, mrqs_done[i]);
+ else
mmc_request_done(host->mmc, mrqs_done[i]);
}

diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index a6a3ddc..3e95f74 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -654,6 +654,8 @@ struct sdhci_ops {
void (*voltage_switch)(struct sdhci_host *host);
void (*adma_write_desc)(struct sdhci_host *host, void **desc,
dma_addr_t addr, int len, unsigned int cmd);
+ void (*request_done)(struct sdhci_host *host,
+ struct mmc_request *mrq);
};

#ifdef CONFIG_MMC_SDHCI_IO_ACCESSORS
--
1.7.9.5

2020-02-05 12:53:07

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 4/5] mmc: host: sdhci: Add a variable to defer to complete requests if needed

From: Baolin Wang <[email protected]>

When using the host software queue, it will trigger the next request in
irq handler without a context switch. But the sdhci_request() can not be
called in interrupt context when using host software queue for some host
drivers, due to the get_cd() ops can be sleepable.

But for some host drivers, such as Spreadtrum host driver, the card is
nonremovable, so the get_cd() ops is not sleepable, which means we can
complete the data request and trigger the next request in irq handler
to remove the context switch for the Spreadtrum host driver.

As suggested by Adrian, we should introduce a request_atomic() API to
indicate that a request can be called in interrupt context to remove
the context switch when using mmc host software queue. But this should
be done in another thread to convert the users of mmc host software queue.
Thus we can introduce a variable in struct sdhci_host to indicate that
we will always to defer to complete requests when using the host software
queue.

Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/sdhci.c | 2 +-
drivers/mmc/host/sdhci.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 9761981..9c37451 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -3250,7 +3250,7 @@ static inline bool sdhci_defer_done(struct sdhci_host *host,
{
struct mmc_data *data = mrq->data;

- return host->pending_reset ||
+ return host->pending_reset || host->always_defer_done ||
((host->flags & SDHCI_REQ_USE_DMA) && data &&
data->host_cookie == COOKIE_MAPPED);
}
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index 3e95f74..cac2d97 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -537,6 +537,7 @@ struct sdhci_host {
bool irq_wake_enabled; /* IRQ wakeup is enabled */
bool v4_mode; /* Host Version 4 Enable */
bool use_external_dma; /* Host selects to use external DMA */
+ bool always_defer_done; /* Always defer to complete requests */

struct mmc_request *mrqs_done[SDHCI_MAX_MRQS]; /* Requests done */
struct mmc_command *cmd; /* Current command */
--
1.7.9.5

2020-02-05 12:54:15

by Baolin Wang

[permalink] [raw]
Subject: [PATCH v8 5/5] mmc: host: sdhci-sprd: Add software queue support

From: Baolin Wang <[email protected]>

Add software queue support to improve the performance.

Signed-off-by: Baolin Wang <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
---
drivers/mmc/host/Kconfig | 1 +
drivers/mmc/host/sdhci-sprd.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 65d3966..3ea45be 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -645,6 +645,7 @@ config MMC_SDHCI_SPRD
depends on ARCH_SPRD
depends on MMC_SDHCI_PLTFM
select MMC_SDHCI_IO_ACCESSORS
+ select MMC_HSQ
help
This selects the SDIO Host Controller in Spreadtrum
SoCs, this driver supports R11(IP version: R11P0).
diff --git a/drivers/mmc/host/sdhci-sprd.c b/drivers/mmc/host/sdhci-sprd.c
index d07b979..d346223 100644
--- a/drivers/mmc/host/sdhci-sprd.c
+++ b/drivers/mmc/host/sdhci-sprd.c
@@ -19,6 +19,7 @@
#include <linux/slab.h>

#include "sdhci-pltfm.h"
+#include "mmc_hsq.h"

/* SDHCI_ARGUMENT2 register high 16bit */
#define SDHCI_SPRD_ARG2_STUFF GENMASK(31, 16)
@@ -379,6 +380,16 @@ static unsigned int sdhci_sprd_get_ro(struct sdhci_host *host)
return 0;
}

+static void sdhci_sprd_request_done(struct sdhci_host *host,
+ struct mmc_request *mrq)
+{
+ /* Validate if the request was from software queue firstly. */
+ if (mmc_hsq_finalize_request(host->mmc, mrq))
+ return;
+
+ mmc_request_done(host->mmc, mrq);
+}
+
static struct sdhci_ops sdhci_sprd_ops = {
.read_l = sdhci_sprd_readl,
.write_l = sdhci_sprd_writel,
@@ -392,6 +403,7 @@ static unsigned int sdhci_sprd_get_ro(struct sdhci_host *host)
.hw_reset = sdhci_sprd_hw_reset,
.get_max_timeout_count = sdhci_sprd_get_max_timeout_count,
.get_ro = sdhci_sprd_get_ro,
+ .request_done = sdhci_sprd_request_done,
};

static void sdhci_sprd_request(struct mmc_host *mmc, struct mmc_request *mrq)
@@ -521,6 +533,7 @@ static int sdhci_sprd_probe(struct platform_device *pdev)
{
struct sdhci_host *host;
struct sdhci_sprd_host *sprd_host;
+ struct mmc_hsq *hsq;
struct clk *clk;
int ret = 0;

@@ -631,6 +644,18 @@ static int sdhci_sprd_probe(struct platform_device *pdev)

sprd_host->flags = host->flags;

+ hsq = devm_kzalloc(&pdev->dev, sizeof(*hsq), GFP_KERNEL);
+ if (!hsq) {
+ ret = -ENOMEM;
+ goto err_cleanup_host;
+ }
+
+ ret = mmc_hsq_init(hsq, host->mmc);
+ if (ret)
+ goto err_cleanup_host;
+
+ host->always_defer_done = true;
+
ret = __sdhci_add_host(host);
if (ret)
goto err_cleanup_host;
@@ -689,6 +714,7 @@ static int sdhci_sprd_runtime_suspend(struct device *dev)
struct sdhci_host *host = dev_get_drvdata(dev);
struct sdhci_sprd_host *sprd_host = TO_SPRD_HOST(host);

+ mmc_hsq_suspend(host->mmc);
sdhci_runtime_suspend_host(host);

clk_disable_unprepare(sprd_host->clk_sdio);
@@ -717,6 +743,8 @@ static int sdhci_sprd_runtime_resume(struct device *dev)
goto clk_disable;

sdhci_runtime_resume_host(host, 1);
+ mmc_hsq_resume(host->mmc);
+
return 0;

clk_disable:
--
1.7.9.5

2020-02-06 15:02:40

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
>
> Hi All,
>
> Now the MMC read/write stack will always wait for previous request is
> completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> or queue a work to complete request, that will bring context switching
> overhead, especially for high I/O per second rates, to affect the IO
> performance.

In the regular request path (non CQE), we call mmc_blk_card_busy() to
complete a request. For write I/O, this leads to calling
card_busy_detect(), which starts to poll the card by sending a CMD13.

At least one CMD13 will be sent to the card, before we exit the
polling loop and a new I/O request can get submitted. However, in many
cases, depending on the controller/host/card/request-size, my best
guess is that *one* CMD13 might not be sufficient. At least, that is
what I have observed on those platforms I recently have been working
on.

That said, I am wondering if you have done some measurement/profiling
on this particular behaviour for your controller/driver? For example,
how many CMD13 gets sent for random small writes during polling?

Why am I asking this? Because, unless I am mistaken, when using the
new hsq path that you introduce in $subject series, based on the cqe
ops, then mmc_blk_card_busy() is not being called at all. In other
words, you rely on HW busy detection from the controller/driver,
rather than polling with CMD13. Is that correct?

This seems like an additional reason to why you achieve significant
improvements for the random write case. Don't you think?

>
> Thus this patch set will introduce the MMC software command queue support
> based on command queue engine's interfaces, and set the queue depth as 64
> to allow more requests can be be prepared, merged and inserted into IO
> scheduler, but we only allow 2 requests in flight, that is enough to let
> the irq handler always trigger the next request without a context switch,
> as well as avoiding a long latency.
>
> Moreover we can expand the MMC software queue interface to support
> MMC packed request or packed command instead of adding new interfaces,
> according to previosus discussion.
>
> Below are some comparison data with fio tool. The fio command I used
> is like below with changing the '--rw' parameter and enabling the direct
> IO flag to measure the actual hardware transfer speed in 4K block size.
>
> ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
>
> My eMMC card working at HS400 Enhanced strobe mode:
> [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
>
> 1. Without MMC software queue
> I tested 5 times for each case and output a average speed.
>
> 1) Sequential read:
> Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> Average speed: 59.66MiB/s
>
> 2) Random read:
> Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> Average speed: 27.04MiB/s
>
> 3) Sequential write:
> Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> Average speed: 69.68MiB/s
>
> 4) Random write:
> Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> Average speed: 35.96MiB/s
>
> 2. With MMC software queue
> I tested 5 times for each case and output a average speed.
>
> 1) Sequential read:
> Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> Average speed: 60.68MiB/s
>
> 2) Random read:
> Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> Average speed: 31.36MiB/s
>
> 3) Sequential write:
> Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> Average speed: 71.66MiB/s
>
> 4) Random write:
> Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> Average speed: 68.76MiB/s
>
> Form above data, we can see the MMC software queue can help to improve some
> performance obviously for random read and write, though no obvious improvement
> for sequential read and write.
>
> Any comments are welcome. Thanks a lot.
>
> Changes from v7:
> - Add reviewed tag from Arnd.
> - Use the 'hsq' acronym for varibles and functions in the core layer.
> - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> can work normally.
> - Add a new patch to enable the host software queue for the SD card.
> - Use the default MMC queue depth for host software queue.

It would be nice to also have some measurements for an SD card, now
that the series supports this. Is that possible for you test as well?

[...]

Kind regards
Uffe

2020-02-10 08:41:56

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

Hi Ulf,

On Thu, Feb 6, 2020 at 11:00 PM Ulf Hansson <[email protected]> wrote:
>
> On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> >
> > Hi All,
> >
> > Now the MMC read/write stack will always wait for previous request is
> > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > or queue a work to complete request, that will bring context switching
> > overhead, especially for high I/O per second rates, to affect the IO
> > performance.
>
> In the regular request path (non CQE), we call mmc_blk_card_busy() to
> complete a request. For write I/O, this leads to calling
> card_busy_detect(), which starts to poll the card by sending a CMD13.
>
> At least one CMD13 will be sent to the card, before we exit the
> polling loop and a new I/O request can get submitted. However, in many
> cases, depending on the controller/host/card/request-size, my best
> guess is that *one* CMD13 might not be sufficient. At least, that is
> what I have observed on those platforms I recently have been working
> on.
>
> That said, I am wondering if you have done some measurement/profiling
> on this particular behaviour for your controller/driver? For example,
> how many CMD13 gets sent for random small writes during polling?

Ah, I had not checked how many CMD13 for random small writes before.
And I did a quick testing today, I found only 1 CMD13 gets sent for
random writes on my platform.


> Why am I asking this? Because, unless I am mistaken, when using the
> new hsq path that you introduce in $subject series, based on the cqe
> ops, then mmc_blk_card_busy() is not being called at all. In other
> words, you rely on HW busy detection from the controller/driver,
> rather than polling with CMD13. Is that correct?

Right. I think so.

> This seems like an additional reason to why you achieve significant
> improvements for the random write case. Don't you think?

Yes, agree wtih you.

> >
> > Thus this patch set will introduce the MMC software command queue support
> > based on command queue engine's interfaces, and set the queue depth as 64
> > to allow more requests can be be prepared, merged and inserted into IO
> > scheduler, but we only allow 2 requests in flight, that is enough to let
> > the irq handler always trigger the next request without a context switch,
> > as well as avoiding a long latency.
> >
> > Moreover we can expand the MMC software queue interface to support
> > MMC packed request or packed command instead of adding new interfaces,
> > according to previosus discussion.
> >
> > Below are some comparison data with fio tool. The fio command I used
> > is like below with changing the '--rw' parameter and enabling the direct
> > IO flag to measure the actual hardware transfer speed in 4K block size.
> >
> > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> >
> > My eMMC card working at HS400 Enhanced strobe mode:
> > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> >
> > 1. Without MMC software queue
> > I tested 5 times for each case and output a average speed.
> >
> > 1) Sequential read:
> > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > Average speed: 59.66MiB/s
> >
> > 2) Random read:
> > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > Average speed: 27.04MiB/s
> >
> > 3) Sequential write:
> > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > Average speed: 69.68MiB/s
> >
> > 4) Random write:
> > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > Average speed: 35.96MiB/s
> >
> > 2. With MMC software queue
> > I tested 5 times for each case and output a average speed.
> >
> > 1) Sequential read:
> > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > Average speed: 60.68MiB/s
> >
> > 2) Random read:
> > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > Average speed: 31.36MiB/s
> >
> > 3) Sequential write:
> > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > Average speed: 71.66MiB/s
> >
> > 4) Random write:
> > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > Average speed: 68.76MiB/s
> >
> > Form above data, we can see the MMC software queue can help to improve some
> > performance obviously for random read and write, though no obvious improvement
> > for sequential read and write.
> >
> > Any comments are welcome. Thanks a lot.
> >
> > Changes from v7:
> > - Add reviewed tag from Arnd.
> > - Use the 'hsq' acronym for varibles and functions in the core layer.
> > - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> > can work normally.
> > - Add a new patch to enable the host software queue for the SD card.
> > - Use the default MMC queue depth for host software queue.
>
> It would be nice to also have some measurements for an SD card, now
> that the series supports this. Is that possible for you test as well?

Yes, but my SD card works at high speed mode, and shows a low speed in
4k block size.
[ 2.941965] mmc0: new high speed SDHC card at address b368
[ 2.948325] mmcblk0: mmc0:b368 SD08G 7.42 GiB
[ 2.956554] mmcblk0: p1

And I did not see any obvious improvement or recession for my SD card
in 4k block size from below data, I think the most of the time is
spent in hardware. (But when I enabled the packed request based on
hsq, I can see some obvious improvement.)
Without hsq:
read: bw=4347KiB/s
randread: bw=3040KiB/s
write: bw=1361KiB/s
randwrite: bw=692KiB/s

With hsq:
read: bw=4246KiB/s
randread: bw=29950KiB/s
write: bw=1417KiB/s
randwrite: bw=697KiB/s

2020-02-10 13:26:57

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

On Mon, 10 Feb 2020 at 09:41, Baolin Wang <[email protected]> wrote:
>
> Hi Ulf,
>
> On Thu, Feb 6, 2020 at 11:00 PM Ulf Hansson <[email protected]> wrote:
> >
> > On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> > >
> > > Hi All,
> > >
> > > Now the MMC read/write stack will always wait for previous request is
> > > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > > or queue a work to complete request, that will bring context switching
> > > overhead, especially for high I/O per second rates, to affect the IO
> > > performance.
> >
> > In the regular request path (non CQE), we call mmc_blk_card_busy() to
> > complete a request. For write I/O, this leads to calling
> > card_busy_detect(), which starts to poll the card by sending a CMD13.
> >
> > At least one CMD13 will be sent to the card, before we exit the
> > polling loop and a new I/O request can get submitted. However, in many
> > cases, depending on the controller/host/card/request-size, my best
> > guess is that *one* CMD13 might not be sufficient. At least, that is
> > what I have observed on those platforms I recently have been working
> > on.
> >
> > That said, I am wondering if you have done some measurement/profiling
> > on this particular behaviour for your controller/driver? For example,
> > how many CMD13 gets sent for random small writes during polling?
>
> Ah, I had not checked how many CMD13 for random small writes before.
> And I did a quick testing today, I found only 1 CMD13 gets sent for
> random writes on my platform.

Thanks for sharing the result, very interesting!

Would you mind running a "dd write operation", to test large
consecutive writes as those should cause longer busy times. Just to
make sure the HW busy detection really works as expected.

For example:
dd of=/dev/mmcblk[n] if=/dev/zero bs=1M count=512 conv=fsync

>
>
> > Why am I asking this? Because, unless I am mistaken, when using the
> > new hsq path that you introduce in $subject series, based on the cqe
> > ops, then mmc_blk_card_busy() is not being called at all. In other
> > words, you rely on HW busy detection from the controller/driver,
> > rather than polling with CMD13. Is that correct?
>
> Right. I think so.

A couple of follow up questions then.

Normally, the mmc core adds the MMC_RSP_BUSY (part of MMC_RSP_R1B)
response flag, for those commands having busy signaling on DAT0, like
CMD6 for example. After the command has been sent, the core checks
whether the host supports HW busy signaling, via the
MMC_CAP_WAIT_WHILE_BUSY flag. If so the polling loop to detect when
the card stops signaling busy, is skipped by the core. See
__mmc_switch() and mmc_poll_for_busy(), for example.

This makes me wonder, why doesn't your driver set the
MMC_CAP_WAIT_WHILE_BUSY, as it seems to support HW busy signaling?

Moreover, it also seems like your driver can support
MMC_CAP_DONE_COMPLETE. Or at least the part that requires HW busy
detection for I/O write operations. I guess we also need your series,
"[PATCH 0/3] Introduce the request_atomic() for the host" as to
support it. What do you think, would it be possible to test this at
your side?

Note that, I haven't played with MMC_CAP_DONE_COMPLETE so far, but it
was invented to allow optimization for these kind of situations.

Now, don't get me wrong, I still think we should move forward with
@subject series. I just want to make sure we don't have several
methods to implement the same thing. So perhaps, MMC_CAP_DONE_COMPLETE
and the corresponding code should be removed, in favor of the more
generic hsq interface?

>
> > This seems like an additional reason to why you achieve significant
> > improvements for the random write case. Don't you think?
>
> Yes, agree wtih you.
>
> > >
> > > Thus this patch set will introduce the MMC software command queue support
> > > based on command queue engine's interfaces, and set the queue depth as 64
> > > to allow more requests can be be prepared, merged and inserted into IO
> > > scheduler, but we only allow 2 requests in flight, that is enough to let
> > > the irq handler always trigger the next request without a context switch,
> > > as well as avoiding a long latency.
> > >
> > > Moreover we can expand the MMC software queue interface to support
> > > MMC packed request or packed command instead of adding new interfaces,
> > > according to previosus discussion.
> > >
> > > Below are some comparison data with fio tool. The fio command I used
> > > is like below with changing the '--rw' parameter and enabling the direct
> > > IO flag to measure the actual hardware transfer speed in 4K block size.
> > >
> > > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> > >
> > > My eMMC card working at HS400 Enhanced strobe mode:
> > > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> > >
> > > 1. Without MMC software queue
> > > I tested 5 times for each case and output a average speed.
> > >
> > > 1) Sequential read:
> > > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > > Average speed: 59.66MiB/s
> > >
> > > 2) Random read:
> > > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > > Average speed: 27.04MiB/s
> > >
> > > 3) Sequential write:
> > > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > > Average speed: 69.68MiB/s
> > >
> > > 4) Random write:
> > > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > > Average speed: 35.96MiB/s
> > >
> > > 2. With MMC software queue
> > > I tested 5 times for each case and output a average speed.
> > >
> > > 1) Sequential read:
> > > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > > Average speed: 60.68MiB/s
> > >
> > > 2) Random read:
> > > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > > Average speed: 31.36MiB/s
> > >
> > > 3) Sequential write:
> > > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > > Average speed: 71.66MiB/s
> > >
> > > 4) Random write:
> > > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > > Average speed: 68.76MiB/s
> > >
> > > Form above data, we can see the MMC software queue can help to improve some
> > > performance obviously for random read and write, though no obvious improvement
> > > for sequential read and write.
> > >
> > > Any comments are welcome. Thanks a lot.
> > >
> > > Changes from v7:
> > > - Add reviewed tag from Arnd.
> > > - Use the 'hsq' acronym for varibles and functions in the core layer.
> > > - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> > > can work normally.
> > > - Add a new patch to enable the host software queue for the SD card.
> > > - Use the default MMC queue depth for host software queue.
> >
> > It would be nice to also have some measurements for an SD card, now
> > that the series supports this. Is that possible for you test as well?
>
> Yes, but my SD card works at high speed mode, and shows a low speed in
> 4k block size.
> [ 2.941965] mmc0: new high speed SDHC card at address b368
> [ 2.948325] mmcblk0: mmc0:b368 SD08G 7.42 GiB
> [ 2.956554] mmcblk0: p1
>
> And I did not see any obvious improvement or recession for my SD card
> in 4k block size from below data, I think the most of the time is
> spent in hardware. (But when I enabled the packed request based on
> hsq, I can see some obvious improvement.)
> Without hsq:
> read: bw=4347KiB/s
> randread: bw=3040KiB/s
> write: bw=1361KiB/s
> randwrite: bw=692KiB/s
>
> With hsq:
> read: bw=4246KiB/s
> randread: bw=29950KiB/s
> write: bw=1417KiB/s
> randwrite: bw=697KiB/s

Thanks for testing and sharing!

Did you use "[PATCH 0/3] Introduce the request_atomic() for the host"
as well? In there, it seems like you are disabling the hsq option for
removable cards, or did I get that wrong? Does it matter?

Kind regards
Uffe

2020-02-11 05:50:53

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

On Mon, Feb 10, 2020 at 9:26 PM Ulf Hansson <[email protected]> wrote:
>
> On Mon, 10 Feb 2020 at 09:41, Baolin Wang <[email protected]> wrote:
> >
> > Hi Ulf,
> >
> > On Thu, Feb 6, 2020 at 11:00 PM Ulf Hansson <[email protected]> wrote:
> > >
> > > On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > Now the MMC read/write stack will always wait for previous request is
> > > > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > > > or queue a work to complete request, that will bring context switching
> > > > overhead, especially for high I/O per second rates, to affect the IO
> > > > performance.
> > >
> > > In the regular request path (non CQE), we call mmc_blk_card_busy() to
> > > complete a request. For write I/O, this leads to calling
> > > card_busy_detect(), which starts to poll the card by sending a CMD13.
> > >
> > > At least one CMD13 will be sent to the card, before we exit the
> > > polling loop and a new I/O request can get submitted. However, in many
> > > cases, depending on the controller/host/card/request-size, my best
> > > guess is that *one* CMD13 might not be sufficient. At least, that is
> > > what I have observed on those platforms I recently have been working
> > > on.
> > >
> > > That said, I am wondering if you have done some measurement/profiling
> > > on this particular behaviour for your controller/driver? For example,
> > > how many CMD13 gets sent for random small writes during polling?
> >
> > Ah, I had not checked how many CMD13 for random small writes before.
> > And I did a quick testing today, I found only 1 CMD13 gets sent for
> > random writes on my platform.
>
> Thanks for sharing the result, very interesting!
>
> Would you mind running a "dd write operation", to test large
> consecutive writes as those should cause longer busy times. Just to
> make sure the HW busy detection really works as expected.
>
> For example:
> dd of=/dev/mmcblk[n] if=/dev/zero bs=1M count=512 conv=fsync

Sure. I've run the dd command and still got the same result. Only 1
CMD13 for each write operation.

> > > Why am I asking this? Because, unless I am mistaken, when using the
> > > new hsq path that you introduce in $subject series, based on the cqe
> > > ops, then mmc_blk_card_busy() is not being called at all. In other
> > > words, you rely on HW busy detection from the controller/driver,
> > > rather than polling with CMD13. Is that correct?
> >
> > Right. I think so.
>
> A couple of follow up questions then.
>
> Normally, the mmc core adds the MMC_RSP_BUSY (part of MMC_RSP_R1B)
> response flag, for those commands having busy signaling on DAT0, like
> CMD6 for example. After the command has been sent, the core checks
> whether the host supports HW busy signaling, via the
> MMC_CAP_WAIT_WHILE_BUSY flag. If so the polling loop to detect when
> the card stops signaling busy, is skipped by the core. See
> __mmc_switch() and mmc_poll_for_busy(), for example.

Make sense.

> This makes me wonder, why doesn't your driver set the
> MMC_CAP_WAIT_WHILE_BUSY, as it seems to support HW busy signaling?

I think we should set this flag, but missed it before. And I did a
quick testing with setting this flag, I did not find any problem.
So I will post one patch to enable this flag with more stable testing.

> Moreover, it also seems like your driver can support
> MMC_CAP_DONE_COMPLETE. Or at least the part that requires HW busy

No. Cause we will complete the request in the irq context, if we set
this MMC_CAP_DONE_COMPLETE, we will call mmc_blk_mq_post_req()--->
mmc_post_req() in the irq context, which is a time-consuming operation
and not be allowed.

> detection for I/O write operations. I guess we also need your series,
> "[PATCH 0/3] Introduce the request_atomic() for the host" as to
> support it. What do you think, would it be possible to test this at
> your side?

Yes, we need this series ("[PATCH 0/3] Introduce the request_atomic()
for the host"), which is used to dispatch next request to the
controller in the irq context directly, to remove context switching.

> Note that, I haven't played with MMC_CAP_DONE_COMPLETE so far, but it
> was invented to allow optimization for these kind of situations.

I think the MMC_CAP_DONE_COMPLETE flag is used for this case: the host
controller completes requests in the irq thread or a workqueue
context, then we do not need queue the 'mq->complete_work' to complete
requests, instead we can compelete requests in the current context.

But now we will complete the requests in the irq context, so seems
MMC_CAP_DONE_COMPLETE is not useful here.

> Now, don't get me wrong, I still think we should move forward with
> @subject series. I just want to make sure we don't have several
> methods to implement the same thing. So perhaps, MMC_CAP_DONE_COMPLETE
> and the corresponding code should be removed, in favor of the more
> generic hsq interface?

Yes, now no host controllers set the MMC_CAP_DONE_COMPLETE flag, I
think we should remove this flag.

> > > This seems like an additional reason to why you achieve significant
> > > improvements for the random write case. Don't you think?
> >
> > Yes, agree wtih you.
> >
> > > >
> > > > Thus this patch set will introduce the MMC software command queue support
> > > > based on command queue engine's interfaces, and set the queue depth as 64
> > > > to allow more requests can be be prepared, merged and inserted into IO
> > > > scheduler, but we only allow 2 requests in flight, that is enough to let
> > > > the irq handler always trigger the next request without a context switch,
> > > > as well as avoiding a long latency.
> > > >
> > > > Moreover we can expand the MMC software queue interface to support
> > > > MMC packed request or packed command instead of adding new interfaces,
> > > > according to previosus discussion.
> > > >
> > > > Below are some comparison data with fio tool. The fio command I used
> > > > is like below with changing the '--rw' parameter and enabling the direct
> > > > IO flag to measure the actual hardware transfer speed in 4K block size.
> > > >
> > > > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> > > >
> > > > My eMMC card working at HS400 Enhanced strobe mode:
> > > > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > > > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > > > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > > > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > > > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> > > >
> > > > 1. Without MMC software queue
> > > > I tested 5 times for each case and output a average speed.
> > > >
> > > > 1) Sequential read:
> > > > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > > > Average speed: 59.66MiB/s
> > > >
> > > > 2) Random read:
> > > > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > > > Average speed: 27.04MiB/s
> > > >
> > > > 3) Sequential write:
> > > > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > > > Average speed: 69.68MiB/s
> > > >
> > > > 4) Random write:
> > > > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > > > Average speed: 35.96MiB/s
> > > >
> > > > 2. With MMC software queue
> > > > I tested 5 times for each case and output a average speed.
> > > >
> > > > 1) Sequential read:
> > > > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > > > Average speed: 60.68MiB/s
> > > >
> > > > 2) Random read:
> > > > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > > > Average speed: 31.36MiB/s
> > > >
> > > > 3) Sequential write:
> > > > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > > > Average speed: 71.66MiB/s
> > > >
> > > > 4) Random write:
> > > > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > > > Average speed: 68.76MiB/s
> > > >
> > > > Form above data, we can see the MMC software queue can help to improve some
> > > > performance obviously for random read and write, though no obvious improvement
> > > > for sequential read and write.
> > > >
> > > > Any comments are welcome. Thanks a lot.
> > > >
> > > > Changes from v7:
> > > > - Add reviewed tag from Arnd.
> > > > - Use the 'hsq' acronym for varibles and functions in the core layer.
> > > > - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> > > > can work normally.
> > > > - Add a new patch to enable the host software queue for the SD card.
> > > > - Use the default MMC queue depth for host software queue.
> > >
> > > It would be nice to also have some measurements for an SD card, now
> > > that the series supports this. Is that possible for you test as well?
> >
> > Yes, but my SD card works at high speed mode, and shows a low speed in
> > 4k block size.
> > [ 2.941965] mmc0: new high speed SDHC card at address b368
> > [ 2.948325] mmcblk0: mmc0:b368 SD08G 7.42 GiB
> > [ 2.956554] mmcblk0: p1
> >
> > And I did not see any obvious improvement or recession for my SD card
> > in 4k block size from below data, I think the most of the time is
> > spent in hardware. (But when I enabled the packed request based on
> > hsq, I can see some obvious improvement.)
> > Without hsq:
> > read: bw=4347KiB/s
> > randread: bw=3040KiB/s
> > write: bw=1361KiB/s
> > randwrite: bw=692KiB/s
> >
> > With hsq:
> > read: bw=4246KiB/s
> > randread: bw=29950KiB/s
> > write: bw=1417KiB/s
> > randwrite: bw=697KiB/s
>
> Thanks for testing and sharing!
>
> Did you use "[PATCH 0/3] Introduce the request_atomic() for the host"
> as well? In there, it seems like you are disabling the hsq option for

No, I did not use this series when testing, but I think the result
will be same. Since we will set host->always_defer_done as true for
removable SD cards.

> removable cards, or did I get that wrong? Does it matter?

No, I did not disable the hsq. In this series, we will not implement
the request_atomic() API for these removable cards, since we need
check the card status when handling a request, which maybe a sleepable
operation when detecting the card status (such as from GPIO), so we
can not disaptch next request in the irq context, instead we should
still set the host->always_defer_done as true for the removable cards.

2020-02-11 09:47:05

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] mmc: Add MMC host software queue support

On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
>
> From: Baolin Wang <[email protected]>
>
> Now the MMC read/write stack will always wait for previous request is
> completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> or queue a work to complete request, that will bring context switching
> overhead, especially for high I/O per second rates, to affect the IO
> performance.

Would you mind adding some more context about the mmc_blk_rw_wait()?
Especially I want to make it clear that mmc_blk_rw_wait() is also used
to poll the card for busy completion for I/O writes, via sending
CMD13.

>
> Thus this patch introduces MMC software queue interface based on the
> hardware command queue engine's interfaces, which is similar with the
> hardware command queue engine's idea, that can remove the context
> switching. Moreover we set the default queue depth as 64 for software
> queue, which allows more requests to be prepared, merged and inserted
> into IO scheduler to improve performance, but we only allow 2 requests
> in flight, that is enough to let the irq handler always trigger the
> next request without a context switch, as well as avoiding a long latency.

I think it's important to clarify that to use this new interface, hsq,
the host controller/driver needs to support HW busy detection for I/O
operations.

In other words, the host driver must not complete a data transfer
request, until after the card stops signals busy. This behaviour is
also required for "closed-ended-transmissions" with CMD23, as in this
path there is no CMD12 sent to complete the transfer, thus no R1B
response flag to trigger the HW busy detection behaviour in the
driver.

>
> From the fio testing data in cover letter, we can see the software
> queue can improve some performance with 4K block size, increasing
> about 16% for random read, increasing about 90% for random write,
> though no obvious improvement for sequential read and write.
>
> Moreover we can expand the software queue interface to support MMC
> packed request or packed command in future.
>
> Reviewed-by: Arnd Bergmann <[email protected]>
> Signed-off-by: Baolin Wang <[email protected]>
> Signed-off-by: Baolin Wang <[email protected]>
> ---

[...]

> diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
> index f6912de..7a9976f 100644
> --- a/drivers/mmc/core/mmc.c
> +++ b/drivers/mmc/core/mmc.c
> @@ -1851,15 +1851,22 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
> */
> card->reenable_cmdq = card->ext_csd.cmdq_en;
>
> - if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
> + if (host->cqe_ops && !host->cqe_enabled) {
> err = host->cqe_ops->cqe_enable(host, card);
> if (err) {
> pr_err("%s: Failed to enable CQE, error %d\n",
> mmc_hostname(host), err);

This means we are going to start printing an error message for those
eMMCs that doesn't support command queuing, but the host supports
MMC_CAP2_CQE.

Not sure how big of a problem this is, but another option is simply to
leave the logging of the *failures* to the host driver, rather than
doing it here.

Oh well, feel free to change or leave this as is for now. We can
always change it on top, if needed.

> } else {
> host->cqe_enabled = true;
> - pr_info("%s: Command Queue Engine enabled\n",
> - mmc_hostname(host));
> +
> + if (card->ext_csd.cmdq_en) {
> + pr_info("%s: Command Queue Engine enabled\n",
> + mmc_hostname(host));
> + } else {
> + host->hsq_enabled = true;
> + pr_info("%s: Host Software Queue enabled\n",
> + mmc_hostname(host));
> + }
> }
> }

[...]

Kind regards
Uffe

2020-02-11 09:48:42

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

On Tue, 11 Feb 2020 at 05:47, Baolin Wang <[email protected]> wrote:
>
> On Mon, Feb 10, 2020 at 9:26 PM Ulf Hansson <[email protected]> wrote:
> >
> > On Mon, 10 Feb 2020 at 09:41, Baolin Wang <[email protected]> wrote:
> > >
> > > Hi Ulf,
> > >
> > > On Thu, Feb 6, 2020 at 11:00 PM Ulf Hansson <[email protected]> wrote:
> > > >
> > > > On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > Now the MMC read/write stack will always wait for previous request is
> > > > > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > > > > or queue a work to complete request, that will bring context switching
> > > > > overhead, especially for high I/O per second rates, to affect the IO
> > > > > performance.
> > > >
> > > > In the regular request path (non CQE), we call mmc_blk_card_busy() to
> > > > complete a request. For write I/O, this leads to calling
> > > > card_busy_detect(), which starts to poll the card by sending a CMD13.
> > > >
> > > > At least one CMD13 will be sent to the card, before we exit the
> > > > polling loop and a new I/O request can get submitted. However, in many
> > > > cases, depending on the controller/host/card/request-size, my best
> > > > guess is that *one* CMD13 might not be sufficient. At least, that is
> > > > what I have observed on those platforms I recently have been working
> > > > on.
> > > >
> > > > That said, I am wondering if you have done some measurement/profiling
> > > > on this particular behaviour for your controller/driver? For example,
> > > > how many CMD13 gets sent for random small writes during polling?
> > >
> > > Ah, I had not checked how many CMD13 for random small writes before.
> > > And I did a quick testing today, I found only 1 CMD13 gets sent for
> > > random writes on my platform.
> >
> > Thanks for sharing the result, very interesting!
> >
> > Would you mind running a "dd write operation", to test large
> > consecutive writes as those should cause longer busy times. Just to
> > make sure the HW busy detection really works as expected.
> >
> > For example:
> > dd of=/dev/mmcblk[n] if=/dev/zero bs=1M count=512 conv=fsync
>
> Sure. I've run the dd command and still got the same result. Only 1
> CMD13 for each write operation.

Great, thanks for confirming the behaviour.

>
> > > > Why am I asking this? Because, unless I am mistaken, when using the
> > > > new hsq path that you introduce in $subject series, based on the cqe
> > > > ops, then mmc_blk_card_busy() is not being called at all. In other
> > > > words, you rely on HW busy detection from the controller/driver,
> > > > rather than polling with CMD13. Is that correct?
> > >
> > > Right. I think so.
> >
> > A couple of follow up questions then.
> >
> > Normally, the mmc core adds the MMC_RSP_BUSY (part of MMC_RSP_R1B)
> > response flag, for those commands having busy signaling on DAT0, like
> > CMD6 for example. After the command has been sent, the core checks
> > whether the host supports HW busy signaling, via the
> > MMC_CAP_WAIT_WHILE_BUSY flag. If so the polling loop to detect when
> > the card stops signaling busy, is skipped by the core. See
> > __mmc_switch() and mmc_poll_for_busy(), for example.
>
> Make sense.
>
> > This makes me wonder, why doesn't your driver set the
> > MMC_CAP_WAIT_WHILE_BUSY, as it seems to support HW busy signaling?
>
> I think we should set this flag, but missed it before. And I did a
> quick testing with setting this flag, I did not find any problem.
> So I will post one patch to enable this flag with more stable testing.

Don't forget to also set .max_busy_timeout for the host, if you need
to set an upper limit of the busy timeout. Zero indicates, no limit.

>
> > Moreover, it also seems like your driver can support
> > MMC_CAP_DONE_COMPLETE. Or at least the part that requires HW busy
>
> No. Cause we will complete the request in the irq context, if we set
> this MMC_CAP_DONE_COMPLETE, we will call mmc_blk_mq_post_req()--->
> mmc_post_req() in the irq context, which is a time-consuming operation
> and not be allowed.

Ahh, I see. Thanks for clarifying this.

>
> > detection for I/O write operations. I guess we also need your series,
> > "[PATCH 0/3] Introduce the request_atomic() for the host" as to
> > support it. What do you think, would it be possible to test this at
> > your side?
>
> Yes, we need this series ("[PATCH 0/3] Introduce the request_atomic()
> for the host"), which is used to dispatch next request to the
> controller in the irq context directly, to remove context switching.
>
> > Note that, I haven't played with MMC_CAP_DONE_COMPLETE so far, but it
> > was invented to allow optimization for these kind of situations.
>
> I think the MMC_CAP_DONE_COMPLETE flag is used for this case: the host
> controller completes requests in the irq thread or a workqueue
> context, then we do not need queue the 'mq->complete_work' to complete
> requests, instead we can compelete requests in the current context.
>
> But now we will complete the requests in the irq context, so seems
> MMC_CAP_DONE_COMPLETE is not useful here.

Yes, I fully agree with you, now. Thanks again for clarifying.

>
> > Now, don't get me wrong, I still think we should move forward with
> > @subject series. I just want to make sure we don't have several
> > methods to implement the same thing. So perhaps, MMC_CAP_DONE_COMPLETE
> > and the corresponding code should be removed, in favor of the more
> > generic hsq interface?
>
> Yes, now no host controllers set the MMC_CAP_DONE_COMPLETE flag, I
> think we should remove this flag.

Yeah, let's consider that for later then.

If we should keep it, at least we should clarify with some
comments/documentation about when it makes sense to use it.

>
> > > > This seems like an additional reason to why you achieve significant
> > > > improvements for the random write case. Don't you think?
> > >
> > > Yes, agree wtih you.
> > >
> > > > >
> > > > > Thus this patch set will introduce the MMC software command queue support
> > > > > based on command queue engine's interfaces, and set the queue depth as 64
> > > > > to allow more requests can be be prepared, merged and inserted into IO
> > > > > scheduler, but we only allow 2 requests in flight, that is enough to let
> > > > > the irq handler always trigger the next request without a context switch,
> > > > > as well as avoiding a long latency.
> > > > >
> > > > > Moreover we can expand the MMC software queue interface to support
> > > > > MMC packed request or packed command instead of adding new interfaces,
> > > > > according to previosus discussion.
> > > > >
> > > > > Below are some comparison data with fio tool. The fio command I used
> > > > > is like below with changing the '--rw' parameter and enabling the direct
> > > > > IO flag to measure the actual hardware transfer speed in 4K block size.
> > > > >
> > > > > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> > > > >
> > > > > My eMMC card working at HS400 Enhanced strobe mode:
> > > > > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > > > > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > > > > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > > > > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > > > > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> > > > >
> > > > > 1. Without MMC software queue
> > > > > I tested 5 times for each case and output a average speed.
> > > > >
> > > > > 1) Sequential read:
> > > > > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > > > > Average speed: 59.66MiB/s
> > > > >
> > > > > 2) Random read:
> > > > > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > > > > Average speed: 27.04MiB/s
> > > > >
> > > > > 3) Sequential write:
> > > > > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > > > > Average speed: 69.68MiB/s
> > > > >
> > > > > 4) Random write:
> > > > > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > > > > Average speed: 35.96MiB/s
> > > > >
> > > > > 2. With MMC software queue
> > > > > I tested 5 times for each case and output a average speed.
> > > > >
> > > > > 1) Sequential read:
> > > > > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > > > > Average speed: 60.68MiB/s
> > > > >
> > > > > 2) Random read:
> > > > > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > > > > Average speed: 31.36MiB/s
> > > > >
> > > > > 3) Sequential write:
> > > > > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > > > > Average speed: 71.66MiB/s
> > > > >
> > > > > 4) Random write:
> > > > > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > > > > Average speed: 68.76MiB/s
> > > > >
> > > > > Form above data, we can see the MMC software queue can help to improve some
> > > > > performance obviously for random read and write, though no obvious improvement
> > > > > for sequential read and write.
> > > > >
> > > > > Any comments are welcome. Thanks a lot.
> > > > >
> > > > > Changes from v7:
> > > > > - Add reviewed tag from Arnd.
> > > > > - Use the 'hsq' acronym for varibles and functions in the core layer.
> > > > > - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> > > > > can work normally.
> > > > > - Add a new patch to enable the host software queue for the SD card.
> > > > > - Use the default MMC queue depth for host software queue.
> > > >
> > > > It would be nice to also have some measurements for an SD card, now
> > > > that the series supports this. Is that possible for you test as well?
> > >
> > > Yes, but my SD card works at high speed mode, and shows a low speed in
> > > 4k block size.
> > > [ 2.941965] mmc0: new high speed SDHC card at address b368
> > > [ 2.948325] mmcblk0: mmc0:b368 SD08G 7.42 GiB
> > > [ 2.956554] mmcblk0: p1
> > >
> > > And I did not see any obvious improvement or recession for my SD card
> > > in 4k block size from below data, I think the most of the time is
> > > spent in hardware. (But when I enabled the packed request based on
> > > hsq, I can see some obvious improvement.)
> > > Without hsq:
> > > read: bw=4347KiB/s
> > > randread: bw=3040KiB/s
> > > write: bw=1361KiB/s
> > > randwrite: bw=692KiB/s
> > >
> > > With hsq:
> > > read: bw=4246KiB/s
> > > randread: bw=29950KiB/s
> > > write: bw=1417KiB/s
> > > randwrite: bw=697KiB/s
> >
> > Thanks for testing and sharing!
> >
> > Did you use "[PATCH 0/3] Introduce the request_atomic() for the host"
> > as well? In there, it seems like you are disabling the hsq option for
>
> No, I did not use this series when testing, but I think the result
> will be same. Since we will set host->always_defer_done as true for
> removable SD cards.
>
> > removable cards, or did I get that wrong? Does it matter?
>
> No, I did not disable the hsq. In this series, we will not implement
> the request_atomic() API for these removable cards, since we need
> check the card status when handling a request, which maybe a sleepable
> operation when detecting the card status (such as from GPIO), so we
> can not disaptch next request in the irq context, instead we should
> still set the host->always_defer_done as true for the removable cards.

Got it.

So, a temporary/not-to-be-merged hack, to make the SD slot
non-removable, would allow you to use the optimized path with hsq,
right? That could give us some performance numbers also for SD cards.

Kind regards
Uffe

2020-02-11 12:48:30

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add MMC software queue support

On Tue, Feb 11, 2020 at 5:47 PM Ulf Hansson <[email protected]> wrote:
>
> On Tue, 11 Feb 2020 at 05:47, Baolin Wang <[email protected]> wrote:
> >
> > On Mon, Feb 10, 2020 at 9:26 PM Ulf Hansson <[email protected]> wrote:
> > >
> > > On Mon, 10 Feb 2020 at 09:41, Baolin Wang <[email protected]> wrote:
> > > >
> > > > Hi Ulf,
> > > >
> > > > On Thu, Feb 6, 2020 at 11:00 PM Ulf Hansson <[email protected]> wrote:
> > > > >
> > > > > On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Now the MMC read/write stack will always wait for previous request is
> > > > > > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > > > > > or queue a work to complete request, that will bring context switching
> > > > > > overhead, especially for high I/O per second rates, to affect the IO
> > > > > > performance.
> > > > >
> > > > > In the regular request path (non CQE), we call mmc_blk_card_busy() to
> > > > > complete a request. For write I/O, this leads to calling
> > > > > card_busy_detect(), which starts to poll the card by sending a CMD13.
> > > > >
> > > > > At least one CMD13 will be sent to the card, before we exit the
> > > > > polling loop and a new I/O request can get submitted. However, in many
> > > > > cases, depending on the controller/host/card/request-size, my best
> > > > > guess is that *one* CMD13 might not be sufficient. At least, that is
> > > > > what I have observed on those platforms I recently have been working
> > > > > on.
> > > > >
> > > > > That said, I am wondering if you have done some measurement/profiling
> > > > > on this particular behaviour for your controller/driver? For example,
> > > > > how many CMD13 gets sent for random small writes during polling?
> > > >
> > > > Ah, I had not checked how many CMD13 for random small writes before.
> > > > And I did a quick testing today, I found only 1 CMD13 gets sent for
> > > > random writes on my platform.
> > >
> > > Thanks for sharing the result, very interesting!
> > >
> > > Would you mind running a "dd write operation", to test large
> > > consecutive writes as those should cause longer busy times. Just to
> > > make sure the HW busy detection really works as expected.
> > >
> > > For example:
> > > dd of=/dev/mmcblk[n] if=/dev/zero bs=1M count=512 conv=fsync
> >
> > Sure. I've run the dd command and still got the same result. Only 1
> > CMD13 for each write operation.
>
> Great, thanks for confirming the behaviour.
>
> >
> > > > > Why am I asking this? Because, unless I am mistaken, when using the
> > > > > new hsq path that you introduce in $subject series, based on the cqe
> > > > > ops, then mmc_blk_card_busy() is not being called at all. In other
> > > > > words, you rely on HW busy detection from the controller/driver,
> > > > > rather than polling with CMD13. Is that correct?
> > > >
> > > > Right. I think so.
> > >
> > > A couple of follow up questions then.
> > >
> > > Normally, the mmc core adds the MMC_RSP_BUSY (part of MMC_RSP_R1B)
> > > response flag, for those commands having busy signaling on DAT0, like
> > > CMD6 for example. After the command has been sent, the core checks
> > > whether the host supports HW busy signaling, via the
> > > MMC_CAP_WAIT_WHILE_BUSY flag. If so the polling loop to detect when
> > > the card stops signaling busy, is skipped by the core. See
> > > __mmc_switch() and mmc_poll_for_busy(), for example.
> >
> > Make sense.
> >
> > > This makes me wonder, why doesn't your driver set the
> > > MMC_CAP_WAIT_WHILE_BUSY, as it seems to support HW busy signaling?
> >
> > I think we should set this flag, but missed it before. And I did a
> > quick testing with setting this flag, I did not find any problem.
> > So I will post one patch to enable this flag with more stable testing.
>
> Don't forget to also set .max_busy_timeout for the host, if you need
> to set an upper limit of the busy timeout. Zero indicates, no limit.

Sure. We've already set the correct max_busy_timeout correctly.

> >
> > > Moreover, it also seems like your driver can support
> > > MMC_CAP_DONE_COMPLETE. Or at least the part that requires HW busy
> >
> > No. Cause we will complete the request in the irq context, if we set
> > this MMC_CAP_DONE_COMPLETE, we will call mmc_blk_mq_post_req()--->
> > mmc_post_req() in the irq context, which is a time-consuming operation
> > and not be allowed.
>
> Ahh, I see. Thanks for clarifying this.
>
> >
> > > detection for I/O write operations. I guess we also need your series,
> > > "[PATCH 0/3] Introduce the request_atomic() for the host" as to
> > > support it. What do you think, would it be possible to test this at
> > > your side?
> >
> > Yes, we need this series ("[PATCH 0/3] Introduce the request_atomic()
> > for the host"), which is used to dispatch next request to the
> > controller in the irq context directly, to remove context switching.
> >
> > > Note that, I haven't played with MMC_CAP_DONE_COMPLETE so far, but it
> > > was invented to allow optimization for these kind of situations.
> >
> > I think the MMC_CAP_DONE_COMPLETE flag is used for this case: the host
> > controller completes requests in the irq thread or a workqueue
> > context, then we do not need queue the 'mq->complete_work' to complete
> > requests, instead we can compelete requests in the current context.
> >
> > But now we will complete the requests in the irq context, so seems
> > MMC_CAP_DONE_COMPLETE is not useful here.
>
> Yes, I fully agree with you, now. Thanks again for clarifying.
>
> >
> > > Now, don't get me wrong, I still think we should move forward with
> > > @subject series. I just want to make sure we don't have several
> > > methods to implement the same thing. So perhaps, MMC_CAP_DONE_COMPLETE
> > > and the corresponding code should be removed, in favor of the more
> > > generic hsq interface?
> >
> > Yes, now no host controllers set the MMC_CAP_DONE_COMPLETE flag, I
> > think we should remove this flag.
>
> Yeah, let's consider that for later then.
>
> If we should keep it, at least we should clarify with some
> comments/documentation about when it makes sense to use it.

OK.

> > > > > This seems like an additional reason to why you achieve significant
> > > > > improvements for the random write case. Don't you think?
> > > >
> > > > Yes, agree wtih you.
> > > >
> > > > > >
> > > > > > Thus this patch set will introduce the MMC software command queue support
> > > > > > based on command queue engine's interfaces, and set the queue depth as 64
> > > > > > to allow more requests can be be prepared, merged and inserted into IO
> > > > > > scheduler, but we only allow 2 requests in flight, that is enough to let
> > > > > > the irq handler always trigger the next request without a context switch,
> > > > > > as well as avoiding a long latency.
> > > > > >
> > > > > > Moreover we can expand the MMC software queue interface to support
> > > > > > MMC packed request or packed command instead of adding new interfaces,
> > > > > > according to previosus discussion.
> > > > > >
> > > > > > Below are some comparison data with fio tool. The fio command I used
> > > > > > is like below with changing the '--rw' parameter and enabling the direct
> > > > > > IO flag to measure the actual hardware transfer speed in 4K block size.
> > > > > >
> > > > > > ./fio --filename=/dev/mmcblk0p30 --direct=1 --iodepth=20 --rw=read --bs=4K --size=1G --group_reporting --numjobs=20 --name=test_read
> > > > > >
> > > > > > My eMMC card working at HS400 Enhanced strobe mode:
> > > > > > [ 2.229856] mmc0: new HS400 Enhanced strobe MMC card at address 0001
> > > > > > [ 2.237566] mmcblk0: mmc0:0001 HBG4a2 29.1 GiB
> > > > > > [ 2.242621] mmcblk0boot0: mmc0:0001 HBG4a2 partition 1 4.00 MiB
> > > > > > [ 2.249110] mmcblk0boot1: mmc0:0001 HBG4a2 partition 2 4.00 MiB
> > > > > > [ 2.255307] mmcblk0rpmb: mmc0:0001 HBG4a2 partition 3 4.00 MiB, chardev (248:0)
> > > > > >
> > > > > > 1. Without MMC software queue
> > > > > > I tested 5 times for each case and output a average speed.
> > > > > >
> > > > > > 1) Sequential read:
> > > > > > Speed: 59.4MiB/s, 63.4MiB/s, 57.5MiB/s, 57.2MiB/s, 60.8MiB/s
> > > > > > Average speed: 59.66MiB/s
> > > > > >
> > > > > > 2) Random read:
> > > > > > Speed: 26.9MiB/s, 26.9MiB/s, 27.1MiB/s, 27.1MiB/s, 27.2MiB/s
> > > > > > Average speed: 27.04MiB/s
> > > > > >
> > > > > > 3) Sequential write:
> > > > > > Speed: 71.6MiB/s, 72.5MiB/s, 72.2MiB/s, 64.6MiB/s, 67.5MiB/s
> > > > > > Average speed: 69.68MiB/s
> > > > > >
> > > > > > 4) Random write:
> > > > > > Speed: 36.3MiB/s, 35.4MiB/s, 38.6MiB/s, 34MiB/s, 35.5MiB/s
> > > > > > Average speed: 35.96MiB/s
> > > > > >
> > > > > > 2. With MMC software queue
> > > > > > I tested 5 times for each case and output a average speed.
> > > > > >
> > > > > > 1) Sequential read:
> > > > > > Speed: 59.2MiB/s, 60.4MiB/s, 63.6MiB/s, 60.3MiB/s, 59.9MiB/s
> > > > > > Average speed: 60.68MiB/s
> > > > > >
> > > > > > 2) Random read:
> > > > > > Speed: 31.3MiB/s, 31.4MiB/s, 31.5MiB/s, 31.3MiB/s, 31.3MiB/s
> > > > > > Average speed: 31.36MiB/s
> > > > > >
> > > > > > 3) Sequential write:
> > > > > > Speed: 71MiB/s, 71.8MiB/s, 72.3MiB/s, 72.2MiB/s, 71MiB/s
> > > > > > Average speed: 71.66MiB/s
> > > > > >
> > > > > > 4) Random write:
> > > > > > Speed: 68.9MiB/s, 68.7MiB/s, 68.8MiB/s, 68.6MiB/s, 68.8MiB/s
> > > > > > Average speed: 68.76MiB/s
> > > > > >
> > > > > > Form above data, we can see the MMC software queue can help to improve some
> > > > > > performance obviously for random read and write, though no obvious improvement
> > > > > > for sequential read and write.
> > > > > >
> > > > > > Any comments are welcome. Thanks a lot.
> > > > > >
> > > > > > Changes from v7:
> > > > > > - Add reviewed tag from Arnd.
> > > > > > - Use the 'hsq' acronym for varibles and functions in the core layer.
> > > > > > - Check the 'card->ext_csd.cmdq_en' in cqhci.c to make sure the CQE
> > > > > > can work normally.
> > > > > > - Add a new patch to enable the host software queue for the SD card.
> > > > > > - Use the default MMC queue depth for host software queue.
> > > > >
> > > > > It would be nice to also have some measurements for an SD card, now
> > > > > that the series supports this. Is that possible for you test as well?
> > > >
> > > > Yes, but my SD card works at high speed mode, and shows a low speed in
> > > > 4k block size.
> > > > [ 2.941965] mmc0: new high speed SDHC card at address b368
> > > > [ 2.948325] mmcblk0: mmc0:b368 SD08G 7.42 GiB
> > > > [ 2.956554] mmcblk0: p1
> > > >
> > > > And I did not see any obvious improvement or recession for my SD card
> > > > in 4k block size from below data, I think the most of the time is
> > > > spent in hardware. (But when I enabled the packed request based on
> > > > hsq, I can see some obvious improvement.)
> > > > Without hsq:
> > > > read: bw=4347KiB/s
> > > > randread: bw=3040KiB/s
> > > > write: bw=1361KiB/s
> > > > randwrite: bw=692KiB/s
> > > >
> > > > With hsq:
> > > > read: bw=4246KiB/s
> > > > randread: bw=29950KiB/s
> > > > write: bw=1417KiB/s
> > > > randwrite: bw=697KiB/s
> > >
> > > Thanks for testing and sharing!
> > >
> > > Did you use "[PATCH 0/3] Introduce the request_atomic() for the host"
> > > as well? In there, it seems like you are disabling the hsq option for
> >
> > No, I did not use this series when testing, but I think the result
> > will be same. Since we will set host->always_defer_done as true for
> > removable SD cards.
> >
> > > removable cards, or did I get that wrong? Does it matter?
> >
> > No, I did not disable the hsq. In this series, we will not implement
> > the request_atomic() API for these removable cards, since we need
> > check the card status when handling a request, which maybe a sleepable
> > operation when detecting the card status (such as from GPIO), so we
> > can not disaptch next request in the irq context, instead we should
> > still set the host->always_defer_done as true for the removable cards.
>
> Got it.
>
> So, a temporary/not-to-be-merged hack, to make the SD slot
> non-removable, would allow you to use the optimized path with hsq,
> right? That could give us some performance numbers also for SD cards.

I tried the request_atomic() patchset and set non-removable for the SD
card host, but I got the same result. So I still think the speed
(clock) is so low that we can not find any obvious improvements in
this case.

2020-02-11 12:54:35

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] mmc: Add MMC host software queue support

On Tue, Feb 11, 2020 at 5:45 PM Ulf Hansson <[email protected]> wrote:
>
> On Wed, 5 Feb 2020 at 13:51, Baolin Wang <[email protected]> wrote:
> >
> > From: Baolin Wang <[email protected]>
> >
> > Now the MMC read/write stack will always wait for previous request is
> > completed by mmc_blk_rw_wait(), before sending a new request to hardware,
> > or queue a work to complete request, that will bring context switching
> > overhead, especially for high I/O per second rates, to affect the IO
> > performance.
>
> Would you mind adding some more context about the mmc_blk_rw_wait()?
> Especially I want to make it clear that mmc_blk_rw_wait() is also used
> to poll the card for busy completion for I/O writes, via sending
> CMD13.

Sure.

>
> >
> > Thus this patch introduces MMC software queue interface based on the
> > hardware command queue engine's interfaces, which is similar with the
> > hardware command queue engine's idea, that can remove the context
> > switching. Moreover we set the default queue depth as 64 for software
> > queue, which allows more requests to be prepared, merged and inserted
> > into IO scheduler to improve performance, but we only allow 2 requests
> > in flight, that is enough to let the irq handler always trigger the
> > next request without a context switch, as well as avoiding a long latency.
>
> I think it's important to clarify that to use this new interface, hsq,
> the host controller/driver needs to support HW busy detection for I/O
> operations.
>
> In other words, the host driver must not complete a data transfer
> request, until after the card stops signals busy. This behaviour is
> also required for "closed-ended-transmissions" with CMD23, as in this
> path there is no CMD12 sent to complete the transfer, thus no R1B
> response flag to trigger the HW busy detection behaviour in the
> driver.

Sure.

>
> >
> > From the fio testing data in cover letter, we can see the software
> > queue can improve some performance with 4K block size, increasing
> > about 16% for random read, increasing about 90% for random write,
> > though no obvious improvement for sequential read and write.
> >
> > Moreover we can expand the software queue interface to support MMC
> > packed request or packed command in future.
> >
> > Reviewed-by: Arnd Bergmann <[email protected]>
> > Signed-off-by: Baolin Wang <[email protected]>
> > Signed-off-by: Baolin Wang <[email protected]>
> > ---
>
> [...]
>
> > diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
> > index f6912de..7a9976f 100644
> > --- a/drivers/mmc/core/mmc.c
> > +++ b/drivers/mmc/core/mmc.c
> > @@ -1851,15 +1851,22 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
> > */
> > card->reenable_cmdq = card->ext_csd.cmdq_en;
> >
> > - if (card->ext_csd.cmdq_en && !host->cqe_enabled) {
> > + if (host->cqe_ops && !host->cqe_enabled) {
> > err = host->cqe_ops->cqe_enable(host, card);
> > if (err) {
> > pr_err("%s: Failed to enable CQE, error %d\n",
> > mmc_hostname(host), err);
>
> This means we are going to start printing an error message for those
> eMMCs that doesn't support command queuing, but the host supports
> MMC_CAP2_CQE.
>
> Not sure how big of a problem this is, but another option is simply to
> leave the logging of the *failures* to the host driver, rather than
> doing it here.
>
> Oh well, feel free to change or leave this as is for now. We can
> always change it on top, if needed.

OK. I will move the failure log to cqe_enable() callback to keep the
same logs' logic. Thanks.

> > } else {
> > host->cqe_enabled = true;
> > - pr_info("%s: Command Queue Engine enabled\n",
> > - mmc_hostname(host));
> > +
> > + if (card->ext_csd.cmdq_en) {
> > + pr_info("%s: Command Queue Engine enabled\n",
> > + mmc_hostname(host));
> > + } else {
> > + host->hsq_enabled = true;
> > + pr_info("%s: Host Software Queue enabled\n",
> > + mmc_hostname(host));
> > + }
> > }
> > }
>
> [...]
>
> Kind regards
> Uffe