2019-03-20 17:50:17

by Alexander Kochetkov

[permalink] [raw]
Subject: [PATCH 0/2] Fix eMMC hang on rk3188 and earlier

Hello!

I found, that sometimes dw_mmc driver stop transfer data to
eMMC card on my rk3188 based board. One of tranfers hangs then
doing EDMA transfer and controller gives HTO. And here is a fix.

Alexander Kochetkov (2):
mmc: dw_mmc: add init_slot() hook to platform function table
mmc: dw_mmc-rockchip: fix transfer hangs on rk3188

drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++
drivers/mmc/host/dw_mmc.c | 4 ++++
drivers/mmc/host/dw_mmc.h | 2 ++
3 files changed, 25 insertions(+)

--
1.7.9.5



2019-03-20 17:49:21

by Alexander Kochetkov

[permalink] [raw]
Subject: [PATCH 1/2] mmc: dw_mmc: add init_slot() hook to platform function table

The init_slot() hook allow platform driver override slot defaults
provided by generic dw_mmc driver. It's required to fix EDMA based
transfer hangs observed on rockchip rk3188.

Signed-off-by: Alexander Kochetkov <[email protected]>
---
drivers/mmc/host/dw_mmc.c | 4 ++++
drivers/mmc/host/dw_mmc.h | 2 ++
2 files changed, 6 insertions(+)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 80dc2fd..d3ecee9 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2819,6 +2819,7 @@ static int dw_mci_init_slot_caps(struct dw_mci_slot *slot)

static int dw_mci_init_slot(struct dw_mci *host)
{
+ const struct dw_mci_drv_data *drv_data = host->drv_data;
struct mmc_host *mmc;
struct dw_mci_slot *slot;
int ret;
@@ -2876,6 +2877,9 @@ static int dw_mci_init_slot(struct dw_mci *host)
mmc->max_seg_size = mmc->max_req_size;
}

+ if (drv_data && drv_data->init_slot)
+ drv_data->init_slot(host);
+
dw_mci_get_cd(mmc);

ret = mmc_add_host(mmc);
diff --git a/drivers/mmc/host/dw_mmc.h b/drivers/mmc/host/dw_mmc.h
index 46e9f8e..de51c59 100644
--- a/drivers/mmc/host/dw_mmc.h
+++ b/drivers/mmc/host/dw_mmc.h
@@ -548,6 +548,7 @@ struct dw_mci_slot {
* @caps: mmc subsystem specified capabilities of the controller(s).
* @num_caps: number of capabilities specified by @caps.
* @init: early implementation specific initialization.
+ * @init_slot: platform specific slot initialization.
* @set_ios: handle bus specific extensions.
* @parse_dt: parse implementation specific device tree properties.
* @execute_tuning: implementation specific tuning procedure.
@@ -560,6 +561,7 @@ struct dw_mci_drv_data {
unsigned long *caps;
u32 num_caps;
int (*init)(struct dw_mci *host);
+ void (*init_slot)(struct dw_mci *host);
void (*set_ios)(struct dw_mci *host, struct mmc_ios *ios);
int (*parse_dt)(struct dw_mci *host);
int (*execute_tuning)(struct dw_mci_slot *slot, u32 opcode);
--
1.7.9.5


2019-03-20 17:49:41

by Alexander Kochetkov

[permalink] [raw]
Subject: [PATCH 2/2] mmc: dw_mmc-rockchip: fix transfer hangs on rk3188

I've found that sometimes dw_mmc in my rk3188 based board stop transfer
any data with error:

kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3

Further digging into problem showed that sometimes one of EDMA-based
transfers hangs and abort with HTO error. I've made test, that 100%
reproduce the error. I found, that setting max_segs parameter to 1 fix
the problem.

I guess the problem is hardware related and relates to DMA controller
implementation for rk3188. Probably it can relates to missed FLUSHP,
see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no
flushp"). It is possible that pl330 and dw_mmc become out of sync then
pl330 driver switch from one scatterlist to another. If we limit
scatterlist size to 1, we can avoid switching scatterlists and avoid
hardware problem. Setting max_segs to 1 tells mmc core to use maximum
one scatterlist for one transfer.

I guess that all other rk3xxx chips that lacks FLUSHP also affected by
the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188.

Signed-off-by: Alexander Kochetkov <[email protected]>
---
drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c
index 8c86a80..2eed922 100644
--- a/drivers/mmc/host/dw_mmc-rockchip.c
+++ b/drivers/mmc/host/dw_mmc-rockchip.c
@@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host)
return 0;
}

+static void dw_mci_rk2928_init_slot(struct dw_mci *host)
+{
+ struct mmc_host *mmc = host->slot->mmc;
+
+ if (host->use_dma == TRANS_MODE_EDMAC) {
+ /*
+ * Using max_segs > 1 leads to rare EDMA transfer hangs
+ * resulting in HTO errors.
+ */
+ mmc->max_segs = 1;
+ mmc->max_blk_size = 65535;
+ mmc->max_blk_count = 64 * 512;
+ mmc->max_req_size =
+ mmc->max_blk_size * mmc->max_blk_count;
+ mmc->max_seg_size = mmc->max_req_size;
+ }
+}
+
static int dw_mci_rockchip_init(struct dw_mci *host)
{
/* It is slot 8 on Rockchip SoCs */
@@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host)

static const struct dw_mci_drv_data rk2928_drv_data = {
.init = dw_mci_rockchip_init,
+ .init_slot = dw_mci_rk2928_init_slot,
};

static const struct dw_mci_drv_data rk3288_drv_data = {
--
1.7.9.5


2019-03-21 02:32:28

by Shawn Lin

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmc: dw_mmc-rockchip: fix tra nsfer hangs on rk3188【请注意,邮件由[email protected]代发】

+ Caesar Wang

On 2019/3/21 1:48, Alexander Kochetkov wrote:
> I've found that sometimes dw_mmc in my rk3188 based board stop transfer
> any data with error:
>
> kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3
>
> Further digging into problem showed that sometimes one of EDMA-based
> transfers hangs and abort with HTO error. I've made test, that 100%

I'm not sure what 100% means, but Caesar fired QA test for RK3036 with
EDMA-based dwmmc in vendor 4.4 kernel, and seems not big deal. The
vendor 4.4 kernel didn't patch anything else wrt EDMA code, but we did
enhance PL330 code and fix some bug there, so you may have a try.

> reproduce the error. I found, that setting max_segs parameter to 1 fix
> the problem.
>
> I guess the problem is hardware related and relates to DMA controller
> implementation for rk3188. Probably it can relates to missed FLUSHP,
> see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no
> flushp"). It is possible that pl330 and dw_mmc become out of sync then
> pl330 driver switch from one scatterlist to another. If we limit
> scatterlist size to 1, we can avoid switching scatterlists and avoid
> hardware problem. Setting max_segs to 1 tells mmc core to use maximum
> one scatterlist for one transfer.
>
> I guess that all other rk3xxx chips that lacks FLUSHP also affected by
> the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188.

Hard to find these acient platforms to test, expecially some was EOL....

>
> Signed-off-by: Alexander Kochetkov <[email protected]>
> ---
> drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c
> index 8c86a80..2eed922 100644
> --- a/drivers/mmc/host/dw_mmc-rockchip.c
> +++ b/drivers/mmc/host/dw_mmc-rockchip.c
> @@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host)
> return 0;
> }
>
> +static void dw_mci_rk2928_init_slot(struct dw_mci *host)
> +{
> + struct mmc_host *mmc = host->slot->mmc;
> +
> + if (host->use_dma == TRANS_MODE_EDMAC) {
> + /*
> + * Using max_segs > 1 leads to rare EDMA transfer hangs
> + * resulting in HTO errors.
> + */
> + mmc->max_segs = 1;
> + mmc->max_blk_size = 65535;
> + mmc->max_blk_count = 64 * 512;
> + mmc->max_req_size =
> + mmc->max_blk_size * mmc->max_blk_count;
> + mmc->max_seg_size = mmc->max_req_size;
> + }
> +}
> +
> static int dw_mci_rockchip_init(struct dw_mci *host)
> {
> /* It is slot 8 on Rockchip SoCs */
> @@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host)
>
> static const struct dw_mci_drv_data rk2928_drv_data = {
> .init = dw_mci_rockchip_init,
> + .init_slot = dw_mci_rk2928_init_slot,
> };
>
> static const struct dw_mci_drv_data rk3288_drv_data = {
>



2019-03-21 10:33:45

by Alexander Kochetkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmc: dw_mmc-rockchip: f ix transfer hangs on rk3188【请注意,邮件由[email protected]代发】

Hello!

Forgot to mention transfer hags happen only on mem to dev transfers (dma writes to
device) and never on dev to mem.

Yea, I know, rk3188 and earlier are quite ancient, but we made custom hardware
based on rk3188 and some of our customers report problems.

For testing I use rk3188 based custom board with eMMC (probably rk3188-radxa rock
with SD can also be used for testing) with cpufreq enabled.

For testing I made simple script, that do in loop following:
1. Creates 6 new empty partitions using mkfs.ext3 about 1Gb total
2. extract 100MB archive of linux image to 512Mb partition (about 400MB extracted size).
3. sleep random time from 60 to 120 sec

CPU load looks like that:
cpufreq stats: 312 MHz:32.63%, 504 MHz:0.00%, 600 MHz:0.00%, 816 MHz:0.38%, 1.01 GHz:29.83%, 1.20 GHz:0.38%, 1.42 GHz:0.00%, 1.61 GHz:36.79% (494481)

This test can run for 6 hours and than transfer can hang. I used 5 devices to test. Some
devices may run test for long time, but some may fail within an hour.

I played with CPU clock settings in u-boot and mmc bus clock settings dts file. I tried to lower eMMC bus
clock frequency to exclude PCB errors. Found that some combinations of settings
make my test run longer, but test fail anyway.

Also I found, that making following change to dw_mmc, result in high error count:

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 9c54d60..dcf7d36e 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2905,10 +2905,9 @@ static int dw_mci_init_slot(struct dw_mci *host)
} else if (host->use_dma == TRANS_MODE_EDMAC) {
mmc->max_segs = 64;
mmc->max_blk_size = 65535;
- mmc->max_blk_count = 65535;
- mmc->max_req_size =
- mmc->max_blk_size * mmc->max_blk_count;
- mmc->max_seg_size = mmc->max_req_size;
+ mmc->max_seg_size = 0x1000;
+ mmc->max_req_size = mmc->max_seg_size * mmc->max_segs;
+ mmc->max_blk_count = mmc->max_req_size / 512;
} else {
/* TRANS_MODE_PIO */
mmc->max_segs = 64;

With this settings mmc core split large transfer to multiply item scatterlists and
increase scatterlists switching rate inside pl330. So I assumed that the root of problem
is dma goes out of sync with device.

For, example, there is a patch in mainline linux:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/dma/pl330.c?h=v5.0.3&id=1d48745b192a7a45bbdd3557b4c039609569ca41
It fix the problem EDMA can get out of sync with device. But the patch don’t work for rk3188, because
rk3188 has PL330_QUIRK_BROKEN_NO_FLUSHP quirk.

I’ll try to backport EDMA driver from vendor 4.4 kernel and report test result.

Problem safer to fix patching dw_mmc code, than pl330 code. Because
patch change transfer parameters from known to work values:

mmc->max_segs = 64;
mmc->max_blk_size = 65535;
mmc->max_blk_count = 65535;
mmc->max_req_size =
mmc->max_blk_size * mmc->max_blk_count;
mmc->max_seg_size = mmc->max_req_size;

to

mmc->max_segs = 1;
mmc->max_blk_size = 65535;
mmc->max_blk_count = 64 * 512;
mmc->max_req_size =
mmc->max_blk_size * mmc->max_blk_count;
mmc->max_seg_size = mmc->max_req_size;


> 21 марта 2019 г., в 5:31, Shawn Lin <[email protected]> написал(а):
>
> + Caesar Wang
>
> On 2019/3/21 1:48, Alexander Kochetkov wrote:
>> I've found that sometimes dw_mmc in my rk3188 based board stop transfer
>> any data with error:
>> kernel: dwmmc_rockchip 1021c000.dwmmc: Unexpected command timeout, state 3
>> Further digging into problem showed that sometimes one of EDMA-based
>> transfers hangs and abort with HTO error. I've made test, that 100%
>
> I'm not sure what 100% means, but Caesar fired QA test for RK3036 with
> EDMA-based dwmmc in vendor 4.4 kernel, and seems not big deal. The
> vendor 4.4 kernel didn't patch anything else wrt EDMA code, but we did
> enhance PL330 code and fix some bug there, so you may have a try.
>
>> reproduce the error. I found, that setting max_segs parameter to 1 fix
>> the problem.
>> I guess the problem is hardware related and relates to DMA controller
>> implementation for rk3188. Probably it can relates to missed FLUSHP,
>> see commit 271e1b86e691 ("dmaengine: pl330: add quirk for broken no
>> flushp"). It is possible that pl330 and dw_mmc become out of sync then
>> pl330 driver switch from one scatterlist to another. If we limit
>> scatterlist size to 1, we can avoid switching scatterlists and avoid
>> hardware problem. Setting max_segs to 1 tells mmc core to use maximum
>> one scatterlist for one transfer.
>> I guess that all other rk3xxx chips that lacks FLUSHP also affected by
>> the problem. So I made fix for all rk3xxx chips from rk2928 to rk3188.
>
> Hard to find these acient platforms to test, expecially some was EOL....
>
>> Signed-off-by: Alexander Kochetkov <[email protected]>
>> ---
>> drivers/mmc/host/dw_mmc-rockchip.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>> diff --git a/drivers/mmc/host/dw_mmc-rockchip.c b/drivers/mmc/host/dw_mmc-rockchip.c
>> index 8c86a80..2eed922 100644
>> --- a/drivers/mmc/host/dw_mmc-rockchip.c
>> +++ b/drivers/mmc/host/dw_mmc-rockchip.c
>> @@ -292,6 +292,24 @@ static int dw_mci_rk3288_parse_dt(struct dw_mci *host)
>> return 0;
>> }
>> +static void dw_mci_rk2928_init_slot(struct dw_mci *host)
>> +{
>> + struct mmc_host *mmc = host->slot->mmc;
>> +
>> + if (host->use_dma == TRANS_MODE_EDMAC) {
>> + /*
>> + * Using max_segs > 1 leads to rare EDMA transfer hangs
>> + * resulting in HTO errors.
>> + */
>> + mmc->max_segs = 1;
>> + mmc->max_blk_size = 65535;
>> + mmc->max_blk_count = 64 * 512;
>> + mmc->max_req_size =
>> + mmc->max_blk_size * mmc->max_blk_count;
>> + mmc->max_seg_size = mmc->max_req_size;
>> + }
>> +}
>> +
>> static int dw_mci_rockchip_init(struct dw_mci *host)
>> {
>> /* It is slot 8 on Rockchip SoCs */
>> @@ -314,6 +332,7 @@ static int dw_mci_rockchip_init(struct dw_mci *host)
>> static const struct dw_mci_drv_data rk2928_drv_data = {
>> .init = dw_mci_rockchip_init,
>> + .init_slot = dw_mci_rk2928_init_slot,
>> };
>> static const struct dw_mci_drv_data rk3288_drv_data = {
>
>