2024-03-19 11:59:50

by Sergey Khimich

[permalink] [raw]
Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

Hello!

This is implementation of SDHCI CQE support for sdhci-of-dwcmshc driver.
For enabling CQE support just set 'supports-cqe' in your DevTree file
for appropriate mmc node.

Also, while implementing CQE support for the driver, I faced with a problem
which I will describe below.
According to the IP block documentation CQE works only with "AMDA-2 only"
mode which is activated only with v4 mode enabled. I see in dwcmshc_probe()
function that v4 mode gets enabled only for 'sdhci_dwcmshc_bf3_pdata'
platform data.

So my question is: is it correct to enable v4 mode for all platform data
if 'SDHCI_CAN_64BIT_V4' bit is set in hw?

Because I`m afraid that enabling v4 mode for some platforms could break
them down. On the other hand, if host controller says that it can do v4
(caps & SDHCI_CAN_64BIT_V4), lets do v4 or disable it manualy by some
quirk. Anyway - RFC.


v2:
- Added dwcmshc specific cqe_disable hook to prevent losing
in-flight cmd when an ioctl is issued and cqe_disable is called;

- Added processing 128Mb boundary for the host memory data buffer size
and the data buffer. For implementing this processing an extra
callback is added to the struct 'sdhci_ops'.

- Fixed typo.

v3:
- Fix warning reported by kernel test robot:
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

v4:
- Data reset moved to custom driver tuning hook.
- Removed unnecessary dwcmshc_sdhci_cqe_disable() func
- Removed unnecessary dwcmshc_cqhci_set_tran_desc. Export and use
cqhci_set_tran_desc() instead.
- Provide a hook for cqhci_set_tran_desc() instead of cqhci_prep_tran_desc().
- Fix typo: int_clok_disable --> int_clock_disable

v5:
- Fix warning reported by kernel test robot:
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

v6:
- Rebase to master branch
- Fix typo;
- Fix double blank line;
- Add cqhci_suspend() and cqhci_resume() functions
to support mmc suspend-to-ram (s2r);
- Move reading DWCMSHC_P_VENDOR_AREA2 register under "supports-cqe"
condition as not all IPs have that register;
- Remove sdhci V4 mode from the list of prerequisites to init cqhci.

v7:
- Add disabling MMC_CAP2_CQE and MMC_CAP2_CQE_DCMD caps
in case of CQE init fails to prevent problems in suspend/resume
functions.

Sergey Khimich (2):
mmc: cqhci: Add cqhci set_tran_desc() callback
mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support

drivers/mmc/host/Kconfig | 1 +
drivers/mmc/host/cqhci-core.c | 11 +-
drivers/mmc/host/cqhci.h | 4 +
drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
4 files changed, 202 insertions(+), 5 deletions(-)

--
2.30.2



2024-03-19 11:59:54

by Sergey Khimich

[permalink] [raw]
Subject: [PATCH v7 1/2] mmc: cqhci: Add cqhci set_tran_desc() callback

From: Sergey Khimich <[email protected]>

There are could be specific limitations for some mmc
controllers for setting cqhci transfer descriptors.
So add callback to allow implement driver specific function.

Signed-off-by: Sergey Khimich <[email protected]>
---
drivers/mmc/host/cqhci-core.c | 11 ++++++++---
drivers/mmc/host/cqhci.h | 4 ++++
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
index 41e94cd14109..c14d7251d0bb 100644
--- a/drivers/mmc/host/cqhci-core.c
+++ b/drivers/mmc/host/cqhci-core.c
@@ -474,8 +474,8 @@ static int cqhci_dma_map(struct mmc_host *host, struct mmc_request *mrq)
return sg_count;
}

-static void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
- bool dma64)
+void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
+ bool dma64)
{
__le32 *attr = (__le32 __force *)desc;

@@ -495,6 +495,7 @@ static void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
dataddr[0] = cpu_to_le32(addr);
}
}
+EXPORT_SYMBOL(cqhci_set_tran_desc);

static int cqhci_prep_tran_desc(struct mmc_request *mrq,
struct cqhci_host *cq_host, int tag)
@@ -522,7 +523,11 @@ static int cqhci_prep_tran_desc(struct mmc_request *mrq,

if ((i+1) == sg_count)
end = true;
- cqhci_set_tran_desc(desc, addr, len, end, dma64);
+ if (cq_host->ops->set_tran_desc)
+ cq_host->ops->set_tran_desc(cq_host, &desc, addr, len, end, dma64);
+ else
+ cqhci_set_tran_desc(desc, addr, len, end, dma64);
+
desc += cq_host->trans_desc_len;
}

diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 1a12e40a02e6..fab9d74445ba 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -293,6 +293,9 @@ struct cqhci_host_ops {
int (*program_key)(struct cqhci_host *cq_host,
const union cqhci_crypto_cfg_entry *cfg, int slot);
#endif
+ void (*set_tran_desc)(struct cqhci_host *cq_host, u8 **desc,
+ dma_addr_t addr, int len, bool end, bool dma64);
+
};

static inline void cqhci_writel(struct cqhci_host *host, u32 val, int reg)
@@ -318,6 +321,7 @@ irqreturn_t cqhci_irq(struct mmc_host *mmc, u32 intmask, int cmd_error,
int cqhci_init(struct cqhci_host *cq_host, struct mmc_host *mmc, bool dma64);
struct cqhci_host *cqhci_pltfm_init(struct platform_device *pdev);
int cqhci_deactivate(struct mmc_host *mmc);
+void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end, bool dma64);
static inline int cqhci_suspend(struct mmc_host *mmc)
{
return cqhci_deactivate(mmc);
--
2.30.2


2024-03-19 12:00:09

by Sergey Khimich

[permalink] [raw]
Subject: [PATCH v7 2/2] mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support

From: Sergey Khimich <[email protected]>

For enabling CQE support just set 'supports-cqe' in your DevTree file
for appropriate mmc node.

Signed-off-by: Sergey Khimich <[email protected]>
---
drivers/mmc/host/Kconfig | 1 +
drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
2 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 81f2c4e05287..554dbf7f2fa4 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -233,6 +233,7 @@ config MMC_SDHCI_OF_DWCMSHC
depends on MMC_SDHCI_PLTFM
depends on OF
depends on COMMON_CLK
+ select MMC_CQHCI
help
This selects Synopsys DesignWare Cores Mobile Storage Controller
support.
diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c
index a1f57af6acfb..8d6cfb648096 100644
--- a/drivers/mmc/host/sdhci-of-dwcmshc.c
+++ b/drivers/mmc/host/sdhci-of-dwcmshc.c
@@ -21,6 +21,7 @@
#include <linux/sizes.h>

#include "sdhci-pltfm.h"
+#include "cqhci.h"

#define SDHCI_DWCMSHC_ARG2_STUFF GENMASK(31, 16)

@@ -52,6 +53,9 @@
#define AT_CTRL_SWIN_TH_VAL_MASK GENMASK(31, 24) /* bits [31:24] */
#define AT_CTRL_SWIN_TH_VAL 0x9 /* sampling window threshold */

+/* DWC IP vendor area 2 pointer */
+#define DWCMSHC_P_VENDOR_AREA2 0xea
+
/* Rockchip specific Registers */
#define DWCMSHC_EMMC_DLL_CTRL 0x800
#define DWCMSHC_EMMC_DLL_RXCLK 0x804
@@ -167,6 +171,10 @@
#define BOUNDARY_OK(addr, len) \
((addr | (SZ_128M - 1)) == ((addr + len - 1) | (SZ_128M - 1)))

+#define DWCMSHC_SDHCI_CQE_TRNS_MODE (SDHCI_TRNS_MULTI | \
+ SDHCI_TRNS_BLK_CNT_EN | \
+ SDHCI_TRNS_DMA)
+
enum dwcmshc_rk_type {
DWCMSHC_RK3568,
DWCMSHC_RK3588,
@@ -182,7 +190,9 @@ struct rk35xx_priv {

struct dwcmshc_priv {
struct clk *bus_clk;
- int vendor_specific_area1; /* P_VENDOR_SPECIFIC_AREA reg */
+ int vendor_specific_area1; /* P_VENDOR_SPECIFIC_AREA1 reg */
+ int vendor_specific_area2; /* P_VENDOR_SPECIFIC_AREA2 reg */
+
void *priv; /* pointer to SoC private stuff */
u16 delay_line;
u16 flags;
@@ -441,6 +451,90 @@ static void dwcmshc_hs400_enhanced_strobe(struct mmc_host *mmc,
sdhci_writel(host, vendor, reg);
}

+static int dwcmshc_execute_tuning(struct mmc_host *mmc, u32 opcode)
+{
+ int err = sdhci_execute_tuning(mmc, opcode);
+ struct sdhci_host *host = mmc_priv(mmc);
+
+ if (err)
+ return err;
+
+ /*
+ * Tuning can leave the IP in an active state (Buffer Read Enable bit
+ * set) which prevents the entry to low power states (i.e. S0i3). Data
+ * reset will clear it.
+ */
+ sdhci_reset(host, SDHCI_RESET_DATA);
+
+ return 0;
+}
+
+static u32 dwcmshc_cqe_irq_handler(struct sdhci_host *host, u32 intmask)
+{
+ int cmd_error = 0;
+ int data_error = 0;
+
+ if (!sdhci_cqe_irq(host, intmask, &cmd_error, &data_error))
+ return intmask;
+
+ cqhci_irq(host->mmc, intmask, cmd_error, data_error);
+
+ return 0;
+}
+
+static void dwcmshc_sdhci_cqe_enable(struct mmc_host *mmc)
+{
+ struct sdhci_host *host = mmc_priv(mmc);
+ u8 ctrl;
+
+ sdhci_writew(host, DWCMSHC_SDHCI_CQE_TRNS_MODE, SDHCI_TRANSFER_MODE);
+
+ sdhci_cqe_enable(mmc);
+
+ /*
+ * The "DesignWare Cores Mobile Storage Host Controller
+ * DWC_mshc / DWC_mshc_lite Databook" says:
+ * when Host Version 4 Enable" is 1 in Host Control 2 register,
+ * SDHCI_CTRL_ADMA32 bit means ADMA2 is selected.
+ * Selection of 32-bit/64-bit System Addressing:
+ * either 32-bit or 64-bit system addressing is selected by
+ * 64-bit Addressing bit in Host Control 2 register.
+ *
+ * On the other hand the "DesignWare Cores Mobile Storage Host
+ * Controller DWC_mshc / DWC_mshc_lite User Guide" says, that we have to
+ * set DMA_SEL to ADMA2 _only_ mode in the Host Control 2 register.
+ */
+ ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
+ ctrl &= ~SDHCI_CTRL_DMA_MASK;
+ ctrl |= SDHCI_CTRL_ADMA32;
+ sdhci_writeb(host, ctrl, SDHCI_HOST_CONTROL);
+}
+
+static void dwcmshc_set_tran_desc(struct cqhci_host *cq_host, u8 **desc,
+ dma_addr_t addr, int len, bool end, bool dma64)
+{
+ int tmplen, offset;
+
+ if (likely(!len || BOUNDARY_OK(addr, len))) {
+ cqhci_set_tran_desc(*desc, addr, len, end, dma64);
+ return;
+ }
+
+ offset = addr & (SZ_128M - 1);
+ tmplen = SZ_128M - offset;
+ cqhci_set_tran_desc(*desc, addr, tmplen, false, dma64);
+
+ addr += tmplen;
+ len -= tmplen;
+ *desc += cq_host->trans_desc_len;
+ cqhci_set_tran_desc(*desc, addr, len, end, dma64);
+}
+
+static void dwcmshc_cqhci_dumpregs(struct mmc_host *mmc)
+{
+ sdhci_dumpregs(mmc_priv(mmc));
+}
+
static void dwcmshc_rk3568_set_clock(struct sdhci_host *host, unsigned int clock)
{
struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
@@ -649,6 +743,7 @@ static const struct sdhci_ops sdhci_dwcmshc_ops = {
.get_max_clock = dwcmshc_get_max_clock,
.reset = sdhci_reset,
.adma_write_desc = dwcmshc_adma_write_desc,
+ .irq = dwcmshc_cqe_irq_handler,
};

static const struct sdhci_ops sdhci_dwcmshc_rk35xx_ops = {
@@ -700,6 +795,73 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_th1520_pdata = {
.quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN,
};

+static const struct cqhci_host_ops dwcmshc_cqhci_ops = {
+ .enable = dwcmshc_sdhci_cqe_enable,
+ .disable = sdhci_cqe_disable,
+ .dumpregs = dwcmshc_cqhci_dumpregs,
+ .set_tran_desc = dwcmshc_set_tran_desc,
+};
+
+static void dwcmshc_cqhci_init(struct sdhci_host *host, struct platform_device *pdev)
+{
+ struct cqhci_host *cq_host;
+ struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
+ struct dwcmshc_priv *priv = sdhci_pltfm_priv(pltfm_host);
+ bool dma64 = false;
+ u16 clk;
+ int err;
+
+ host->mmc->caps2 |= MMC_CAP2_CQE | MMC_CAP2_CQE_DCMD;
+ cq_host = devm_kzalloc(&pdev->dev, sizeof(*cq_host), GFP_KERNEL);
+ if (!cq_host) {
+ dev_err(mmc_dev(host->mmc), "Unable to setup CQE: not enough memory\n");
+ goto dsbl_cqe_caps;
+ }
+
+ /*
+ * For dwcmshc host controller we have to enable internal clock
+ * before access to some registers from Vendor Specific Area 2.
+ */
+ clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
+ clk |= SDHCI_CLOCK_INT_EN;
+ sdhci_writew(host, clk, SDHCI_CLOCK_CONTROL);
+ clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
+ if (!(clk & SDHCI_CLOCK_INT_EN)) {
+ dev_err(mmc_dev(host->mmc), "Unable to setup CQE: internal clock enable error\n");
+ goto free_cq_host;
+ }
+
+ cq_host->mmio = host->ioaddr + priv->vendor_specific_area2;
+ cq_host->ops = &dwcmshc_cqhci_ops;
+
+ /* Enable using of 128-bit task descriptors */
+ dma64 = host->flags & SDHCI_USE_64_BIT_DMA;
+ if (dma64) {
+ dev_dbg(mmc_dev(host->mmc), "128-bit task descriptors\n");
+ cq_host->caps |= CQHCI_TASK_DESC_SZ_128;
+ }
+ err = cqhci_init(cq_host, host->mmc, dma64);
+ if (err) {
+ dev_err(mmc_dev(host->mmc), "Unable to setup CQE: error %d\n", err);
+ goto int_clock_disable;
+ }
+
+ dev_dbg(mmc_dev(host->mmc), "CQE init done\n");
+
+ return;
+
+int_clock_disable:
+ clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
+ clk &= ~SDHCI_CLOCK_INT_EN;
+ sdhci_writew(host, clk, SDHCI_CLOCK_CONTROL);
+
+free_cq_host:
+ devm_kfree(&pdev->dev, cq_host);
+
+dsbl_cqe_caps:
+ host->mmc->caps2 &= ~(MMC_CAP2_CQE | MMC_CAP2_CQE_DCMD);
+}
+
static int dwcmshc_rk35xx_init(struct sdhci_host *host, struct dwcmshc_priv *dwc_priv)
{
int err;
@@ -796,7 +958,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
struct rk35xx_priv *rk_priv = NULL;
const struct sdhci_pltfm_data *pltfm_data;
int err;
- u32 extra;
+ u32 extra, caps;

pltfm_data = device_get_match_data(&pdev->dev);
if (!pltfm_data) {
@@ -847,6 +1009,7 @@ static int dwcmshc_probe(struct platform_device *pdev)

host->mmc_host_ops.request = dwcmshc_request;
host->mmc_host_ops.hs400_enhanced_strobe = dwcmshc_hs400_enhanced_strobe;
+ host->mmc_host_ops.execute_tuning = dwcmshc_execute_tuning;

if (pltfm_data == &sdhci_dwcmshc_rk35xx_pdata) {
rk_priv = devm_kzalloc(&pdev->dev, sizeof(struct rk35xx_priv), GFP_KERNEL);
@@ -896,6 +1059,10 @@ static int dwcmshc_probe(struct platform_device *pdev)
sdhci_enable_v4_mode(host);
#endif

+ caps = sdhci_readl(host, SDHCI_CAPABILITIES);
+ if (caps & SDHCI_CAN_64BIT_V4)
+ sdhci_enable_v4_mode(host);
+
host->mmc->caps |= MMC_CAP_WAIT_WHILE_BUSY;

pm_runtime_get_noresume(dev);
@@ -906,6 +1073,14 @@ static int dwcmshc_probe(struct platform_device *pdev)
if (err)
goto err_rpm;

+ /* Setup Command Queue Engine if enabled */
+ if (device_property_read_bool(&pdev->dev, "supports-cqe")) {
+ priv->vendor_specific_area2 =
+ sdhci_readw(host, DWCMSHC_P_VENDOR_AREA2);
+
+ dwcmshc_cqhci_init(host, pdev);
+ }
+
if (rk_priv)
dwcmshc_rk35xx_postinit(host, priv);

@@ -961,6 +1136,12 @@ static int dwcmshc_suspend(struct device *dev)

pm_runtime_resume(dev);

+ if (host->mmc->caps2 & MMC_CAP2_CQE) {
+ ret = cqhci_suspend(host->mmc);
+ if (ret)
+ return ret;
+ }
+
ret = sdhci_suspend_host(host);
if (ret)
return ret;
@@ -1005,6 +1186,12 @@ static int dwcmshc_resume(struct device *dev)
if (ret)
goto disable_rockchip_clks;

+ if (host->mmc->caps2 & MMC_CAP2_CQE) {
+ ret = cqhci_resume(host->mmc);
+ if (ret)
+ goto disable_rockchip_clks;
+ }
+
return 0;

disable_rockchip_clks:
--
2.30.2


2024-03-19 12:48:43

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v7 1/2] mmc: cqhci: Add cqhci set_tran_desc() callback

On 19/03/24 13:59, Sergey Khimich wrote:
> From: Sergey Khimich <[email protected]>
>
> There are could be specific limitations for some mmc
> controllers for setting cqhci transfer descriptors.
> So add callback to allow implement driver specific function.
>
> Signed-off-by: Sergey Khimich <[email protected]>

Acked-by: Adrian Hunter <[email protected]>

> ---
> drivers/mmc/host/cqhci-core.c | 11 ++++++++---
> drivers/mmc/host/cqhci.h | 4 ++++
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index 41e94cd14109..c14d7251d0bb 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -474,8 +474,8 @@ static int cqhci_dma_map(struct mmc_host *host, struct mmc_request *mrq)
> return sg_count;
> }
>
> -static void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
> - bool dma64)
> +void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
> + bool dma64)
> {
> __le32 *attr = (__le32 __force *)desc;
>
> @@ -495,6 +495,7 @@ static void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end,
> dataddr[0] = cpu_to_le32(addr);
> }
> }
> +EXPORT_SYMBOL(cqhci_set_tran_desc);
>
> static int cqhci_prep_tran_desc(struct mmc_request *mrq,
> struct cqhci_host *cq_host, int tag)
> @@ -522,7 +523,11 @@ static int cqhci_prep_tran_desc(struct mmc_request *mrq,
>
> if ((i+1) == sg_count)
> end = true;
> - cqhci_set_tran_desc(desc, addr, len, end, dma64);
> + if (cq_host->ops->set_tran_desc)
> + cq_host->ops->set_tran_desc(cq_host, &desc, addr, len, end, dma64);
> + else
> + cqhci_set_tran_desc(desc, addr, len, end, dma64);
> +
> desc += cq_host->trans_desc_len;
> }
>
> diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
> index 1a12e40a02e6..fab9d74445ba 100644
> --- a/drivers/mmc/host/cqhci.h
> +++ b/drivers/mmc/host/cqhci.h
> @@ -293,6 +293,9 @@ struct cqhci_host_ops {
> int (*program_key)(struct cqhci_host *cq_host,
> const union cqhci_crypto_cfg_entry *cfg, int slot);
> #endif
> + void (*set_tran_desc)(struct cqhci_host *cq_host, u8 **desc,
> + dma_addr_t addr, int len, bool end, bool dma64);
> +
> };
>
> static inline void cqhci_writel(struct cqhci_host *host, u32 val, int reg)
> @@ -318,6 +321,7 @@ irqreturn_t cqhci_irq(struct mmc_host *mmc, u32 intmask, int cmd_error,
> int cqhci_init(struct cqhci_host *cq_host, struct mmc_host *mmc, bool dma64);
> struct cqhci_host *cqhci_pltfm_init(struct platform_device *pdev);
> int cqhci_deactivate(struct mmc_host *mmc);
> +void cqhci_set_tran_desc(u8 *desc, dma_addr_t addr, int len, bool end, bool dma64);
> static inline int cqhci_suspend(struct mmc_host *mmc)
> {
> return cqhci_deactivate(mmc);


2024-03-19 12:54:01

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support

On 19/03/24 13:59, Sergey Khimich wrote:
> From: Sergey Khimich <[email protected]>
>
> For enabling CQE support just set 'supports-cqe' in your DevTree file
> for appropriate mmc node.
>
> Signed-off-by: Sergey Khimich <[email protected]>

Seems like it needs a re-base on latest mmc next, but nevertheless:

Acked-by: Adrian Hunter <[email protected]>

> ---
> drivers/mmc/host/Kconfig | 1 +
> drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
> 2 files changed, 190 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
> index 81f2c4e05287..554dbf7f2fa4 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -233,6 +233,7 @@ config MMC_SDHCI_OF_DWCMSHC
> depends on MMC_SDHCI_PLTFM
> depends on OF
> depends on COMMON_CLK
> + select MMC_CQHCI
> help
> This selects Synopsys DesignWare Cores Mobile Storage Controller
> support.
> diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c
> index a1f57af6acfb..8d6cfb648096 100644
> --- a/drivers/mmc/host/sdhci-of-dwcmshc.c
> +++ b/drivers/mmc/host/sdhci-of-dwcmshc.c
> @@ -21,6 +21,7 @@
> #include <linux/sizes.h>
>
> #include "sdhci-pltfm.h"
> +#include "cqhci.h"
>
> #define SDHCI_DWCMSHC_ARG2_STUFF GENMASK(31, 16)
>
> @@ -52,6 +53,9 @@
> #define AT_CTRL_SWIN_TH_VAL_MASK GENMASK(31, 24) /* bits [31:24] */
> #define AT_CTRL_SWIN_TH_VAL 0x9 /* sampling window threshold */
>
> +/* DWC IP vendor area 2 pointer */
> +#define DWCMSHC_P_VENDOR_AREA2 0xea
> +
> /* Rockchip specific Registers */
> #define DWCMSHC_EMMC_DLL_CTRL 0x800
> #define DWCMSHC_EMMC_DLL_RXCLK 0x804
> @@ -167,6 +171,10 @@
> #define BOUNDARY_OK(addr, len) \
> ((addr | (SZ_128M - 1)) == ((addr + len - 1) | (SZ_128M - 1)))
>
> +#define DWCMSHC_SDHCI_CQE_TRNS_MODE (SDHCI_TRNS_MULTI | \
> + SDHCI_TRNS_BLK_CNT_EN | \
> + SDHCI_TRNS_DMA)
> +
> enum dwcmshc_rk_type {
> DWCMSHC_RK3568,
> DWCMSHC_RK3588,
> @@ -182,7 +190,9 @@ struct rk35xx_priv {
>
> struct dwcmshc_priv {
> struct clk *bus_clk;
> - int vendor_specific_area1; /* P_VENDOR_SPECIFIC_AREA reg */
> + int vendor_specific_area1; /* P_VENDOR_SPECIFIC_AREA1 reg */
> + int vendor_specific_area2; /* P_VENDOR_SPECIFIC_AREA2 reg */
> +
> void *priv; /* pointer to SoC private stuff */
> u16 delay_line;
> u16 flags;
> @@ -441,6 +451,90 @@ static void dwcmshc_hs400_enhanced_strobe(struct mmc_host *mmc,
> sdhci_writel(host, vendor, reg);
> }
>
> +static int dwcmshc_execute_tuning(struct mmc_host *mmc, u32 opcode)
> +{
> + int err = sdhci_execute_tuning(mmc, opcode);
> + struct sdhci_host *host = mmc_priv(mmc);
> +
> + if (err)
> + return err;
> +
> + /*
> + * Tuning can leave the IP in an active state (Buffer Read Enable bit
> + * set) which prevents the entry to low power states (i.e. S0i3). Data
> + * reset will clear it.
> + */
> + sdhci_reset(host, SDHCI_RESET_DATA);
> +
> + return 0;
> +}
> +
> +static u32 dwcmshc_cqe_irq_handler(struct sdhci_host *host, u32 intmask)
> +{
> + int cmd_error = 0;
> + int data_error = 0;
> +
> + if (!sdhci_cqe_irq(host, intmask, &cmd_error, &data_error))
> + return intmask;
> +
> + cqhci_irq(host->mmc, intmask, cmd_error, data_error);
> +
> + return 0;
> +}
> +
> +static void dwcmshc_sdhci_cqe_enable(struct mmc_host *mmc)
> +{
> + struct sdhci_host *host = mmc_priv(mmc);
> + u8 ctrl;
> +
> + sdhci_writew(host, DWCMSHC_SDHCI_CQE_TRNS_MODE, SDHCI_TRANSFER_MODE);
> +
> + sdhci_cqe_enable(mmc);
> +
> + /*
> + * The "DesignWare Cores Mobile Storage Host Controller
> + * DWC_mshc / DWC_mshc_lite Databook" says:
> + * when Host Version 4 Enable" is 1 in Host Control 2 register,
> + * SDHCI_CTRL_ADMA32 bit means ADMA2 is selected.
> + * Selection of 32-bit/64-bit System Addressing:
> + * either 32-bit or 64-bit system addressing is selected by
> + * 64-bit Addressing bit in Host Control 2 register.
> + *
> + * On the other hand the "DesignWare Cores Mobile Storage Host
> + * Controller DWC_mshc / DWC_mshc_lite User Guide" says, that we have to
> + * set DMA_SEL to ADMA2 _only_ mode in the Host Control 2 register.
> + */
> + ctrl = sdhci_readb(host, SDHCI_HOST_CONTROL);
> + ctrl &= ~SDHCI_CTRL_DMA_MASK;
> + ctrl |= SDHCI_CTRL_ADMA32;
> + sdhci_writeb(host, ctrl, SDHCI_HOST_CONTROL);
> +}
> +
> +static void dwcmshc_set_tran_desc(struct cqhci_host *cq_host, u8 **desc,
> + dma_addr_t addr, int len, bool end, bool dma64)
> +{
> + int tmplen, offset;
> +
> + if (likely(!len || BOUNDARY_OK(addr, len))) {
> + cqhci_set_tran_desc(*desc, addr, len, end, dma64);
> + return;
> + }
> +
> + offset = addr & (SZ_128M - 1);
> + tmplen = SZ_128M - offset;
> + cqhci_set_tran_desc(*desc, addr, tmplen, false, dma64);
> +
> + addr += tmplen;
> + len -= tmplen;
> + *desc += cq_host->trans_desc_len;
> + cqhci_set_tran_desc(*desc, addr, len, end, dma64);
> +}
> +
> +static void dwcmshc_cqhci_dumpregs(struct mmc_host *mmc)
> +{
> + sdhci_dumpregs(mmc_priv(mmc));
> +}
> +
> static void dwcmshc_rk3568_set_clock(struct sdhci_host *host, unsigned int clock)
> {
> struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
> @@ -649,6 +743,7 @@ static const struct sdhci_ops sdhci_dwcmshc_ops = {
> .get_max_clock = dwcmshc_get_max_clock,
> .reset = sdhci_reset,
> .adma_write_desc = dwcmshc_adma_write_desc,
> + .irq = dwcmshc_cqe_irq_handler,
> };
>
> static const struct sdhci_ops sdhci_dwcmshc_rk35xx_ops = {
> @@ -700,6 +795,73 @@ static const struct sdhci_pltfm_data sdhci_dwcmshc_th1520_pdata = {
> .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN,
> };
>
> +static const struct cqhci_host_ops dwcmshc_cqhci_ops = {
> + .enable = dwcmshc_sdhci_cqe_enable,
> + .disable = sdhci_cqe_disable,
> + .dumpregs = dwcmshc_cqhci_dumpregs,
> + .set_tran_desc = dwcmshc_set_tran_desc,
> +};
> +
> +static void dwcmshc_cqhci_init(struct sdhci_host *host, struct platform_device *pdev)
> +{
> + struct cqhci_host *cq_host;
> + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
> + struct dwcmshc_priv *priv = sdhci_pltfm_priv(pltfm_host);
> + bool dma64 = false;
> + u16 clk;
> + int err;
> +
> + host->mmc->caps2 |= MMC_CAP2_CQE | MMC_CAP2_CQE_DCMD;
> + cq_host = devm_kzalloc(&pdev->dev, sizeof(*cq_host), GFP_KERNEL);
> + if (!cq_host) {
> + dev_err(mmc_dev(host->mmc), "Unable to setup CQE: not enough memory\n");
> + goto dsbl_cqe_caps;
> + }
> +
> + /*
> + * For dwcmshc host controller we have to enable internal clock
> + * before access to some registers from Vendor Specific Area 2.
> + */
> + clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
> + clk |= SDHCI_CLOCK_INT_EN;
> + sdhci_writew(host, clk, SDHCI_CLOCK_CONTROL);
> + clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
> + if (!(clk & SDHCI_CLOCK_INT_EN)) {
> + dev_err(mmc_dev(host->mmc), "Unable to setup CQE: internal clock enable error\n");
> + goto free_cq_host;
> + }
> +
> + cq_host->mmio = host->ioaddr + priv->vendor_specific_area2;
> + cq_host->ops = &dwcmshc_cqhci_ops;
> +
> + /* Enable using of 128-bit task descriptors */
> + dma64 = host->flags & SDHCI_USE_64_BIT_DMA;
> + if (dma64) {
> + dev_dbg(mmc_dev(host->mmc), "128-bit task descriptors\n");
> + cq_host->caps |= CQHCI_TASK_DESC_SZ_128;
> + }
> + err = cqhci_init(cq_host, host->mmc, dma64);
> + if (err) {
> + dev_err(mmc_dev(host->mmc), "Unable to setup CQE: error %d\n", err);
> + goto int_clock_disable;
> + }
> +
> + dev_dbg(mmc_dev(host->mmc), "CQE init done\n");
> +
> + return;
> +
> +int_clock_disable:
> + clk = sdhci_readw(host, SDHCI_CLOCK_CONTROL);
> + clk &= ~SDHCI_CLOCK_INT_EN;
> + sdhci_writew(host, clk, SDHCI_CLOCK_CONTROL);
> +
> +free_cq_host:
> + devm_kfree(&pdev->dev, cq_host);
> +
> +dsbl_cqe_caps:
> + host->mmc->caps2 &= ~(MMC_CAP2_CQE | MMC_CAP2_CQE_DCMD);
> +}
> +
> static int dwcmshc_rk35xx_init(struct sdhci_host *host, struct dwcmshc_priv *dwc_priv)
> {
> int err;
> @@ -796,7 +958,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
> struct rk35xx_priv *rk_priv = NULL;
> const struct sdhci_pltfm_data *pltfm_data;
> int err;
> - u32 extra;
> + u32 extra, caps;
>
> pltfm_data = device_get_match_data(&pdev->dev);
> if (!pltfm_data) {
> @@ -847,6 +1009,7 @@ static int dwcmshc_probe(struct platform_device *pdev)
>
> host->mmc_host_ops.request = dwcmshc_request;
> host->mmc_host_ops.hs400_enhanced_strobe = dwcmshc_hs400_enhanced_strobe;
> + host->mmc_host_ops.execute_tuning = dwcmshc_execute_tuning;
>
> if (pltfm_data == &sdhci_dwcmshc_rk35xx_pdata) {
> rk_priv = devm_kzalloc(&pdev->dev, sizeof(struct rk35xx_priv), GFP_KERNEL);
> @@ -896,6 +1059,10 @@ static int dwcmshc_probe(struct platform_device *pdev)
> sdhci_enable_v4_mode(host);
> #endif
>
> + caps = sdhci_readl(host, SDHCI_CAPABILITIES);
> + if (caps & SDHCI_CAN_64BIT_V4)
> + sdhci_enable_v4_mode(host);
> +
> host->mmc->caps |= MMC_CAP_WAIT_WHILE_BUSY;
>
> pm_runtime_get_noresume(dev);
> @@ -906,6 +1073,14 @@ static int dwcmshc_probe(struct platform_device *pdev)
> if (err)
> goto err_rpm;
>
> + /* Setup Command Queue Engine if enabled */
> + if (device_property_read_bool(&pdev->dev, "supports-cqe")) {
> + priv->vendor_specific_area2 =
> + sdhci_readw(host, DWCMSHC_P_VENDOR_AREA2);
> +
> + dwcmshc_cqhci_init(host, pdev);
> + }
> +
> if (rk_priv)
> dwcmshc_rk35xx_postinit(host, priv);
>
> @@ -961,6 +1136,12 @@ static int dwcmshc_suspend(struct device *dev)
>
> pm_runtime_resume(dev);
>
> + if (host->mmc->caps2 & MMC_CAP2_CQE) {
> + ret = cqhci_suspend(host->mmc);
> + if (ret)
> + return ret;
> + }
> +
> ret = sdhci_suspend_host(host);
> if (ret)
> return ret;
> @@ -1005,6 +1186,12 @@ static int dwcmshc_resume(struct device *dev)
> if (ret)
> goto disable_rockchip_clks;
>
> + if (host->mmc->caps2 & MMC_CAP2_CQE) {
> + ret = cqhci_resume(host->mmc);
> + if (ret)
> + goto disable_rockchip_clks;
> + }
> +
> return 0;
>
> disable_rockchip_clks:


2024-03-20 10:36:59

by Maksim Kiselev

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

Hi Sergey, Adrian!

First of all I want to thank Sergey for supporting the CQE feature
on the DWC MSHC controller.

I tested this series on the LicheePi 4A board (TH1520 SoC).
It has the DWC MSHC IP too and according to the T-Head datasheet
it also supports the CQE feature.

> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.

So, to enable CQE on LicheePi 4A need to set a prop in DT
and add a IRQ handler to th1520_ops:
> .irq = dwcmshc_cqe_irq_handler,

And the CQE will work for th1520 SoC too.

But, when I enabled the CQE, I was faced with a strange effect.

The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.

I would be very appreciative if you could point me where to look for
the bottleneck.

Without CQE:

# cat /sys/kernel/debug/mmc0/ios
clock: 198000000 Hz
actual clock: 198000000 Hz
vdd: 21 (3.3 ~ 3.4 V)
bus mode: 2 (push-pull)
chip select: 0 (don't care)
power mode: 2 (on)
bus width: 3 (8 bits)
timing spec: 10 (mmc HS400 enhanced strobe)
signal voltage: 1 (1.80 V)
driver type: 0 (driver type B)

# fio --filename=/dev/mmcblk0 --direct=1 --rw=randread --bs=1M
--ioengine=sync --iodepth=256 --size=4G --numjobs=1 --group_reporting
--name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=256
fio-3.34
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue
depth will be capped at 1
Jobs: 1 (f=1): [r(1)][15.0%][r=209MiB/s][r=209 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [r(1)][25.0%][r=208MiB/s][r=208 IOPS][eta 00m:15s]
Jobs: 1 (f=1): [r(1)][35.0%][r=207MiB/s][r=207 IOPS][eta 00m:13s]
Jobs: 1 (f=1): [r(1)][47.4%][r=208MiB/s][r=208 IOPS][eta 00m:10s]
Jobs: 1 (f=1): [r(1)][52.6%][r=209MiB/s][r=208 IOPS][eta 00m:09s]
Jobs: 1 (f=1): [r(1)][63.2%][r=208MiB/s][r=208 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [r(1)][68.4%][r=208MiB/s][r=207 IOPS][eta 00m:06s]
Jobs: 1 (f=1): [r(1)][78.9%][r=207MiB/s][r=207 IOPS][eta 00m:04s]
Jobs: 1 (f=1): [r(1)][89.5%][r=209MiB/s][r=209 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [r(1)][100.0%][r=209MiB/s][r=209 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=1): err= 0: pid=132: Thu Jan 1 00:03:44 1970
read: IOPS=208, BW=208MiB/s (219MB/s)(4096MiB/19652msec)
clat (usec): min=3882, max=11557, avg=4778.37, stdev=238.26
lat (usec): min=3883, max=11559, avg=4779.93, stdev=238.26
clat percentiles (usec):
| 1.00th=[ 4359], 5.00th=[ 4555], 10.00th=[ 4555], 20.00th=[ 4621],
| 30.00th=[ 4621], 40.00th=[ 4686], 50.00th=[ 4752], 60.00th=[ 4817],
| 70.00th=[ 4883], 80.00th=[ 4948], 90.00th=[ 5014], 95.00th=[ 5145],
| 99.00th=[ 5473], 99.50th=[ 5538], 99.90th=[ 5932], 99.95th=[ 6915],
| 99.99th=[11600]
bw ( KiB/s): min=208896, max=219136, per=100.00%, avg=213630.77,
stdev=1577.33, samples=39
iops : min= 204, max= 214, avg=208.56, stdev= 1.55, samples=39
lat (msec) : 4=0.39%, 10=99.58%, 20=0.02%
cpu : usr=0.38%, sys=13.04%, ctx=4132, majf=0, minf=275
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=208MiB/s (219MB/s), 208MiB/s-208MiB/s (219MB/s-219MB/s),
io=4096MiB (4295MB), run=19652-19652msec

Disk stats (read/write):
mmcblk0: ios=8181/0, merge=0/0, ticks=25682/0, in_queue=25682, util=99.66%


With CQE:

fio --filename=/dev/mmcblk1 --direct=1 --rw=randread --bs=1M --ioengine=sync -
-iodepth=256 --size=4G --numjobs=1 --group_reporting --name=iops-test-job --eta-
newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioeng
ine=sync, iodepth=256
fio-3.34
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue
depth will be capped at 1
Jobs: 1 (f=1): [r(1)][5.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:49s]
Jobs: 1 (f=1): [r(1)][10.0%][r=84.0MiB/s][r=84 IOPS][eta 00m:45s]
Jobs: 1 (f=1): [r(1)][14.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:43s]
Jobs: 1 (f=1): [r(1)][18.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [r(1)][22.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [r(1)][26.5%][r=83.1MiB/s][r=83 IOPS][eta 00m:36s]
Jobs: 1 (f=1): [r(1)][30.6%][r=83.1MiB/s][r=83 IOPS][eta 00m:34s]
Jobs: 1 (f=1): [r(1)][34.7%][r=84.1MiB/s][r=84 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [r(1)][38.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:30s]
Jobs: 1 (f=1): [r(1)][42.9%][r=83.1MiB/s][r=83 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [r(1)][46.9%][r=84.1MiB/s][r=84 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [r(1)][51.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:24s]
Jobs: 1 (f=1): [r(1)][55.1%][r=83.0MiB/s][r=83 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [r(1)][59.2%][r=84.1MiB/s][r=84 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [r(1)][63.3%][r=83.0MiB/s][r=83 IOPS][eta 00m:18s]
Jobs: 1 (f=1): [r(1)][67.3%][r=83.1MiB/s][r=83 IOPS][eta 00m:16s]
Jobs: 1 (f=1): [r(1)][71.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:14s]
Jobs: 1 (f=1): [r(1)][75.5%][r=83.0MiB/s][r=83 IOPS][eta 00m:12s]
Jobs: 1 (f=1): [r(1)][79.6%][r=83.0MiB/s][r=83 IOPS][eta 00m:10s]
Jobs: 1 (f=1): [r(1)][83.7%][r=84.0MiB/s][r=84 IOPS][eta 00m:08s]
Jobs: 1 (f=1): [r(1)][87.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:06s]
Jobs: 1 (f=1): [r(1)][91.8%][r=83.0MiB/s][r=83 IOPS][eta 00m:04s]
Jobs: 1 (f=1): [r(1)][95.9%][r=84.0MiB/s][r=84 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [r(1)][100.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=1): err= 0: pid=134: Thu Jan 1 00:02:19 1970
read: IOPS=83, BW=83.3MiB/s (87.4MB/s)(4096MiB/49154msec)
clat (usec): min=11885, max=14840, avg=11981.37, stdev=61.89
lat (usec): min=11887, max=14843, avg=11983.00, stdev=61.92
clat percentiles (usec):
| 1.00th=[11863], 5.00th=[11994], 10.00th=[11994], 20.00th=[11994],
| 30.00th=[11994], 40.00th=[11994], 50.00th=[11994], 60.00th=[11994],
| 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
| 99.00th=[12125], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
| 99.99th=[14877]
bw ( KiB/s): min=83800, max=86016, per=100.00%, avg=85430.61,
stdev=894.16, samples=98
iops : min= 81, max= 84, avg=83.22, stdev= 0.89, samples=98
lat (msec) : 20=100.00%
cpu : usr=0.00%, sys=5.44%, ctx=4097, majf=0, minf=274
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=83.3MiB/s (87.4MB/s), 83.3MiB/s-83.3MiB/s
(87.4MB/s-87.4MB/s), io=4096MiB (4295MB), run=49154-
49154msec

Disk stats (read/write):
mmcblk1: ios=8181/0, merge=0/0, ticks=69682/0, in_queue=69682, util=99.96%


Best regards,
Maksim

2024-03-21 06:40:35

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

On 20/03/24 12:36, Maxim Kiselev wrote:
> Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
>
> Hi Sergey, Adrian!
>
> First of all I want to thank Sergey for supporting the CQE feature
> on the DWC MSHC controller.
>
> I tested this series on the LicheePi 4A board (TH1520 SoC).
> It has the DWC MSHC IP too and according to the T-Head datasheet
> it also supports the CQE feature.
>
>> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.
>
> So, to enable CQE on LicheePi 4A need to set a prop in DT
> and add a IRQ handler to th1520_ops:
>> .irq = dwcmshc_cqe_irq_handler,
>
> And the CQE will work for th1520 SoC too.
>
> But, when I enabled the CQE, I was faced with a strange effect.
>
> The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
> 219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.
>
> I would be very appreciative if you could point me where to look for
> the bottleneck.

Some things you could try:

Check for any related kernel messages.

Have a look at /sys/kernel/debug/mmc*/err_stats

See if disabling runtime PM for the host controller has any effect.

Enable mmc dynamic debug messages and see if anything looks different.

>
> Without CQE:
>
> # cat /sys/kernel/debug/mmc0/ios
> clock: 198000000 Hz
> actual clock: 198000000 Hz
> vdd: 21 (3.3 ~ 3.4 V)
> bus mode: 2 (push-pull)
> chip select: 0 (don't care)
> power mode: 2 (on)
> bus width: 3 (8 bits)
> timing spec: 10 (mmc HS400 enhanced strobe)
> signal voltage: 1 (1.80 V)
> driver type: 0 (driver type B)
>
> # fio --filename=/dev/mmcblk0 --direct=1 --rw=randread --bs=1M
> --ioengine=sync --iodepth=256 --size=4G --numjobs=1 --group_reporting
> --name=iops-test-job --eta-newline=1 --readonly
> iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
> 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=256
> fio-3.34
> Starting 1 process
> note: both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1
> Jobs: 1 (f=1): [r(1)][15.0%][r=209MiB/s][r=209 IOPS][eta 00m:17s]
> Jobs: 1 (f=1): [r(1)][25.0%][r=208MiB/s][r=208 IOPS][eta 00m:15s]
> Jobs: 1 (f=1): [r(1)][35.0%][r=207MiB/s][r=207 IOPS][eta 00m:13s]
> Jobs: 1 (f=1): [r(1)][47.4%][r=208MiB/s][r=208 IOPS][eta 00m:10s]
> Jobs: 1 (f=1): [r(1)][52.6%][r=209MiB/s][r=208 IOPS][eta 00m:09s]
> Jobs: 1 (f=1): [r(1)][63.2%][r=208MiB/s][r=208 IOPS][eta 00m:07s]
> Jobs: 1 (f=1): [r(1)][68.4%][r=208MiB/s][r=207 IOPS][eta 00m:06s]
> Jobs: 1 (f=1): [r(1)][78.9%][r=207MiB/s][r=207 IOPS][eta 00m:04s]
> Jobs: 1 (f=1): [r(1)][89.5%][r=209MiB/s][r=209 IOPS][eta 00m:02s]
> Jobs: 1 (f=1): [r(1)][100.0%][r=209MiB/s][r=209 IOPS][eta 00m:00s]
> iops-test-job: (groupid=0, jobs=1): err= 0: pid=132: Thu Jan 1 00:03:44 1970
> read: IOPS=208, BW=208MiB/s (219MB/s)(4096MiB/19652msec)
> clat (usec): min=3882, max=11557, avg=4778.37, stdev=238.26
> lat (usec): min=3883, max=11559, avg=4779.93, stdev=238.26
> clat percentiles (usec):
> | 1.00th=[ 4359], 5.00th=[ 4555], 10.00th=[ 4555], 20.00th=[ 4621],
> | 30.00th=[ 4621], 40.00th=[ 4686], 50.00th=[ 4752], 60.00th=[ 4817],
> | 70.00th=[ 4883], 80.00th=[ 4948], 90.00th=[ 5014], 95.00th=[ 5145],
> | 99.00th=[ 5473], 99.50th=[ 5538], 99.90th=[ 5932], 99.95th=[ 6915],
> | 99.99th=[11600]
> bw ( KiB/s): min=208896, max=219136, per=100.00%, avg=213630.77,
> stdev=1577.33, samples=39
> iops : min= 204, max= 214, avg=208.56, stdev= 1.55, samples=39
> lat (msec) : 4=0.39%, 10=99.58%, 20=0.02%
> cpu : usr=0.38%, sys=13.04%, ctx=4132, majf=0, minf=275
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=256
>
> Run status group 0 (all jobs):
> READ: bw=208MiB/s (219MB/s), 208MiB/s-208MiB/s (219MB/s-219MB/s),
> io=4096MiB (4295MB), run=19652-19652msec
>
> Disk stats (read/write):
> mmcblk0: ios=8181/0, merge=0/0, ticks=25682/0, in_queue=25682, util=99.66%
>
>
> With CQE:

Was output from "cat /sys/kernel/debug/mmc0/ios" the same?

>
> fio --filename=/dev/mmcblk1 --direct=1 --rw=randread --bs=1M --ioengine=sync -
> -iodepth=256 --size=4G --numjobs=1 --group_reporting --name=iops-test-job --eta-
> newline=1 --readonly
> iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
> 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioeng
> ine=sync, iodepth=256
> fio-3.34
> Starting 1 process
> note: both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1
> Jobs: 1 (f=1): [r(1)][5.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:49s]
> Jobs: 1 (f=1): [r(1)][10.0%][r=84.0MiB/s][r=84 IOPS][eta 00m:45s]
> Jobs: 1 (f=1): [r(1)][14.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:43s]
> Jobs: 1 (f=1): [r(1)][18.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:41s]
> Jobs: 1 (f=1): [r(1)][22.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:38s]
> Jobs: 1 (f=1): [r(1)][26.5%][r=83.1MiB/s][r=83 IOPS][eta 00m:36s]
> Jobs: 1 (f=1): [r(1)][30.6%][r=83.1MiB/s][r=83 IOPS][eta 00m:34s]
> Jobs: 1 (f=1): [r(1)][34.7%][r=84.1MiB/s][r=84 IOPS][eta 00m:32s]
> Jobs: 1 (f=1): [r(1)][38.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:30s]
> Jobs: 1 (f=1): [r(1)][42.9%][r=83.1MiB/s][r=83 IOPS][eta 00m:28s]
> Jobs: 1 (f=1): [r(1)][46.9%][r=84.1MiB/s][r=84 IOPS][eta 00m:26s]
> Jobs: 1 (f=1): [r(1)][51.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:24s]
> Jobs: 1 (f=1): [r(1)][55.1%][r=83.0MiB/s][r=83 IOPS][eta 00m:22s]
> Jobs: 1 (f=1): [r(1)][59.2%][r=84.1MiB/s][r=84 IOPS][eta 00m:20s]
> Jobs: 1 (f=1): [r(1)][63.3%][r=83.0MiB/s][r=83 IOPS][eta 00m:18s]
> Jobs: 1 (f=1): [r(1)][67.3%][r=83.1MiB/s][r=83 IOPS][eta 00m:16s]
> Jobs: 1 (f=1): [r(1)][71.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:14s]
> Jobs: 1 (f=1): [r(1)][75.5%][r=83.0MiB/s][r=83 IOPS][eta 00m:12s]
> Jobs: 1 (f=1): [r(1)][79.6%][r=83.0MiB/s][r=83 IOPS][eta 00m:10s]
> Jobs: 1 (f=1): [r(1)][83.7%][r=84.0MiB/s][r=84 IOPS][eta 00m:08s]
> Jobs: 1 (f=1): [r(1)][87.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:06s]
> Jobs: 1 (f=1): [r(1)][91.8%][r=83.0MiB/s][r=83 IOPS][eta 00m:04s]
> Jobs: 1 (f=1): [r(1)][95.9%][r=84.0MiB/s][r=84 IOPS][eta 00m:02s]
> Jobs: 1 (f=1): [r(1)][100.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:00s]
> iops-test-job: (groupid=0, jobs=1): err= 0: pid=134: Thu Jan 1 00:02:19 1970
> read: IOPS=83, BW=83.3MiB/s (87.4MB/s)(4096MiB/49154msec)
> clat (usec): min=11885, max=14840, avg=11981.37, stdev=61.89
> lat (usec): min=11887, max=14843, avg=11983.00, stdev=61.92
> clat percentiles (usec):
> | 1.00th=[11863], 5.00th=[11994], 10.00th=[11994], 20.00th=[11994],
> | 30.00th=[11994], 40.00th=[11994], 50.00th=[11994], 60.00th=[11994],
> | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
> | 99.00th=[12125], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
> | 99.99th=[14877]
> bw ( KiB/s): min=83800, max=86016, per=100.00%, avg=85430.61,
> stdev=894.16, samples=98
> iops : min= 81, max= 84, avg=83.22, stdev= 0.89, samples=98
> lat (msec) : 20=100.00%
> cpu : usr=0.00%, sys=5.44%, ctx=4097, majf=0, minf=274
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=256
>
> Run status group 0 (all jobs):
> READ: bw=83.3MiB/s (87.4MB/s), 83.3MiB/s-83.3MiB/s
> (87.4MB/s-87.4MB/s), io=4096MiB (4295MB), run=49154-
> 49154msec
>
> Disk stats (read/write):
> mmcblk1: ios=8181/0, merge=0/0, ticks=69682/0, in_queue=69682, util=99.96%
>
>
> Best regards,
> Maksim


2024-03-22 14:07:57

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

On 20/03/2024 10:36, Maxim Kiselev wrote:
> Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
>
> Hi Sergey, Adrian!
>
> First of all I want to thank Sergey for supporting the CQE feature
> on the DWC MSHC controller.
>
> I tested this series on the LicheePi 4A board (TH1520 SoC).
> It has the DWC MSHC IP too and according to the T-Head datasheet
> it also supports the CQE feature.
>
>> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.
>
> So, to enable CQE on LicheePi 4A need to set a prop in DT
> and add a IRQ handler to th1520_ops:
>> .irq = dwcmshc_cqe_irq_handler,
>
> And the CQE will work for th1520 SoC too.
>
> But, when I enabled the CQE, I was faced with a strange effect.
>
> The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
> 219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.
>
> I would be very appreciative if you could point me where to look for
> the bottleneck.
>
> Without CQE:

I would also suspect some bus issues here, either read out ios or ext_csd
after enabling CQE, it could be helpful.
OTOH the CQE could just be limiting the frequency, which you wouldn't be
able to see without a scope. Does the TRM say anything about that?

Are you limited to <100MB/s with CQE for HS400(non-ES) and HS200, too?

What about sequential reads but smaller bs? like 256K sequential?

FWIW your fio call should be on par with non-CQE performance-wise at best,
as you just have one IO in-flight, i.e. no CQE performance improvement
possible, see your warning:

> both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1

Kind Regards,
Christian

2024-03-25 16:07:17

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support

On Tue, 19 Mar 2024 at 12:59, Sergey Khimich <[email protected]> wrote:
>
> Hello!
>
> This is implementation of SDHCI CQE support for sdhci-of-dwcmshc driver.
> For enabling CQE support just set 'supports-cqe' in your DevTree file
> for appropriate mmc node.
>
> Also, while implementing CQE support for the driver, I faced with a problem
> which I will describe below.
> According to the IP block documentation CQE works only with "AMDA-2 only"
> mode which is activated only with v4 mode enabled. I see in dwcmshc_probe()
> function that v4 mode gets enabled only for 'sdhci_dwcmshc_bf3_pdata'
> platform data.
>
> So my question is: is it correct to enable v4 mode for all platform data
> if 'SDHCI_CAN_64BIT_V4' bit is set in hw?
>
> Because I`m afraid that enabling v4 mode for some platforms could break
> them down. On the other hand, if host controller says that it can do v4
> (caps & SDHCI_CAN_64BIT_V4), lets do v4 or disable it manualy by some
> quirk. Anyway - RFC.
>
>
> v2:
> - Added dwcmshc specific cqe_disable hook to prevent losing
> in-flight cmd when an ioctl is issued and cqe_disable is called;
>
> - Added processing 128Mb boundary for the host memory data buffer size
> and the data buffer. For implementing this processing an extra
> callback is added to the struct 'sdhci_ops'.
>
> - Fixed typo.
>
> v3:
> - Fix warning reported by kernel test robot:
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> v4:
> - Data reset moved to custom driver tuning hook.
> - Removed unnecessary dwcmshc_sdhci_cqe_disable() func
> - Removed unnecessary dwcmshc_cqhci_set_tran_desc. Export and use
> cqhci_set_tran_desc() instead.
> - Provide a hook for cqhci_set_tran_desc() instead of cqhci_prep_tran_desc().
> - Fix typo: int_clok_disable --> int_clock_disable
>
> v5:
> - Fix warning reported by kernel test robot:
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> v6:
> - Rebase to master branch
> - Fix typo;
> - Fix double blank line;
> - Add cqhci_suspend() and cqhci_resume() functions
> to support mmc suspend-to-ram (s2r);
> - Move reading DWCMSHC_P_VENDOR_AREA2 register under "supports-cqe"
> condition as not all IPs have that register;
> - Remove sdhci V4 mode from the list of prerequisites to init cqhci.
>
> v7:
> - Add disabling MMC_CAP2_CQE and MMC_CAP2_CQE_DCMD caps
> in case of CQE init fails to prevent problems in suspend/resume
> functions.
>
> Sergey Khimich (2):
> mmc: cqhci: Add cqhci set_tran_desc() callback
> mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support
>
> drivers/mmc/host/Kconfig | 1 +
> drivers/mmc/host/cqhci-core.c | 11 +-
> drivers/mmc/host/cqhci.h | 4 +
> drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
> 4 files changed, 202 insertions(+), 5 deletions(-)
>

Applied for next and by fixing a minor conflict when applying, thanks!

Kind regards
Uffe