2016-03-17 12:12:30

by Enric Balletbo Serra

[permalink] [raw]
Subject: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

Dear all,

Seems the following thread[1] didn't go anywhere. I'd like to continue
the discussion and share some tests that I did regarding the issue
that the patch is trying to fix.

First I reproduced the issue on my rockchip board and I tested the
patch intensively, I can confirm that the patch made by Doug fixes the
issue.But, as reported by Alim, seems that this patch has the side
effect that breaks mmc on peach-pi board [2], specially on
suspend/resume, I ran lots of tests on peach-pi and, although is a bit
random, I can also confirm the breakage.

Looks like that on peach-pi, when the patch is applied the controller
moves into a data transfer and the interrupt does not come, then the
task blocks. The reason why I think the dw_mmc-rockchip driver works
is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].

So I did lots of tests on peach-pi with dto quirk, suspend/resume
started to work again. But I guess this is not the proper solution or
it is? Thoughts?

[1] https://lkml.org/lkml/2015/5/18/495
[2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
[3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b

Cheers,
Enric

>
> Alim,
>
> On Tue, May 26, 2015 at 11:02 AM, Alim Akhtar <[email protected]> wrote:
>> Hi Doug,
>> On peach-pi, I got a hung task once in 4 cold boot as [1].
>
> OK, I'll have to get my peach-pi or peach-pit up and running again. I
> ran out of desk space and I haven't gotten it set back up. :(
>
> I've been testing on an rk3288-based device. Past experience has
> taught me that the rk3288 dw_mmc works differently than the exynos
> one, so perhaps this is a difference.
>
> Could you possibly patch in something like
> <https://chromium-review.googlesource.com/#/c/244347/> and provide the
> console for the failure? I'll put it on my list to try this myself,
> too
>
>
>> I was checking on v4.1-rc5, git hash as below:
>>
>> 862e58a mmc: dw_mmc: Wait for data transfer after response errors
>> ba155e2 Linux 4.1-rc5
>> 5b13966
>>
>> Not sure if I missed any dependent patch??
>
> I'm currently testing out of tree, but my dw_mmc is very close to mainline.
>
>
>> Have not checked the dw TRM for this change, will do that as soon as I
>> get access to it.
>
> OK, sounds good. I have some old version of the DesignWare TRM, so
> possibly something is different in the newer one...
>
>
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2016-03-21 22:38:06

by Douglas Anderson

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

Enric,

On Thu, Mar 17, 2016 at 5:12 AM, Enric Balletbo Serra
<[email protected]> wrote:
> Dear all,
>
> Seems the following thread[1] didn't go anywhere. I'd like to continue
> the discussion and share some tests that I did regarding the issue
> that the patch is trying to fix.
>
> First I reproduced the issue on my rockchip board and I tested the
> patch intensively, I can confirm that the patch made by Doug fixes the
> issue.But, as reported by Alim, seems that this patch has the side
> effect that breaks mmc on peach-pi board [2], specially on
> suspend/resume, I ran lots of tests on peach-pi and, although is a bit
> random, I can also confirm the breakage.
>
> Looks like that on peach-pi, when the patch is applied the controller
> moves into a data transfer and the interrupt does not come, then the
> task blocks. The reason why I think the dw_mmc-rockchip driver works
> is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].
>
> So I did lots of tests on peach-pi with dto quirk, suspend/resume
> started to work again. But I guess this is not the proper solution or
> it is? Thoughts?
>
> [1] https://lkml.org/lkml/2015/5/18/495
> [2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
> [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b

Ah, that would make some sense why things work OK on Rockchip. Adding
DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
Hrm...

Since my original debugging of the issue was over a year ago, I think
I've almost totally lost context of any debugging I did on the issue,
so I'm not sure I'm going to be too much help in giving any details
other than what I put in the original commit message. From the
original message it appears that I thought we could solve this other
ways but just that my patch was easier than the alternative of
handling every error case. Maybe we just need to go back to the
drawing board and handle the error directly?

Also: my original commit message says "response error or response CRC
error". Do you happen to know which of these two we're hitting on
rk3288? If we limit the workaround to just one of these two cases
does peach pi still break?

Also: I'd be curious, with the same SD card can you reproduce any
failures on peach pi? ...or does peach-pi work fine in this case?

Hmm, also I think my last suggestion was to see how things looked with
<https://chromium-review.googlesource.com/#/c/244347/> picked to get
extra debug info...


-Doug

2016-03-24 11:26:52

by Enric Balletbo Serra

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

I fixed Javier Martinez email and removed [email protected] (delivery fail)
Also cc'ing Russell King as I think might help (see my comment below)


2016-03-21 23:38 GMT+01:00 Doug Anderson <[email protected]>:
> Enric,
>
> On Thu, Mar 17, 2016 at 5:12 AM, Enric Balletbo Serra
> <[email protected]> wrote:
>> Dear all,
>>
>> Seems the following thread[1] didn't go anywhere. I'd like to continue
>> the discussion and share some tests that I did regarding the issue
>> that the patch is trying to fix.
>>
>> First I reproduced the issue on my rockchip board and I tested the
>> patch intensively, I can confirm that the patch made by Doug fixes the
>> issue.But, as reported by Alim, seems that this patch has the side
>> effect that breaks mmc on peach-pi board [2], specially on
>> suspend/resume, I ran lots of tests on peach-pi and, although is a bit
>> random, I can also confirm the breakage.
>>
>> Looks like that on peach-pi, when the patch is applied the controller
>> moves into a data transfer and the interrupt does not come, then the
>> task blocks. The reason why I think the dw_mmc-rockchip driver works
>> is because it has the DW_MCI_QUIRK_BROKEN_DTO quirk [3].
>>
>> So I did lots of tests on peach-pi with dto quirk, suspend/resume
>> started to work again. But I guess this is not the proper solution or
>> it is? Thoughts?
>>
>> [1] https://lkml.org/lkml/2015/5/18/495
>> [2] https://lava.collabora.co.uk/scheduler/job/169384/log_file#L_195_5
>> [3] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/mmc/host/dw_mmc-rockchip.c?id=57e104864bc4874a36796fd222d8d084dbf90b9b
>
> Ah, that would make some sense why things work OK on Rockchip. Adding
> DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
> Hrm...
>
> Since my original debugging of the issue was over a year ago, I think
> I've almost totally lost context of any debugging I did on the issue,
> so I'm not sure I'm going to be too much help in giving any details
> other than what I put in the original commit message. From the
> original message it appears that I thought we could solve this other
> ways but just that my patch was easier than the alternative of
> handling every error case. Maybe we just need to go back to the
> drawing board and handle the error directly?
>

I just saw that Russell introduced a patch [1] that will land on 4.6.
I think that patch solves the same issue that we're trying to fix, but
for sdhci controller.

The problem that we have on peach-pi, with our patch applied, is that
when we get a response CRC error on a command and we move to start
sending data, the transfer doesn't receives a timeout interrupt (I
don't know why). As I told, on rockchip works due the DTO quirk.
exynos is not using this quirk. Also, please correct me if I'm wrong,
looks like the sdhci controller has a timer to signal the command
timed out.

ooi, anyone knows what was the test case that caused the necessity of
the DTO quirk?

> Also: my original commit message says "response error or response CRC
> error". Do you happen to know which of these two we're hitting on
> rk3288? If we limit the workaround to just one of these two cases
> does peach pi still break?
>

Yes, the peach pi still break. The one that is hitting is the response
CRC error, so limit the workaround doesn't help.


> Also: I'd be curious, with the same SD card can you reproduce any
> failures on peach pi? ...or does peach-pi work fine in this case?
>

I can't test this now because I don't have physical access to the
peach-pi. But yeah, this is something to test.

> Hmm, also I think my last suggestion was to see how things looked with
> <https://chromium-review.googlesource.com/#/c/244347/> picked to get
> extra debug info...
>
>
> -Doug

[1] https://git.linaro.org/people/ulf.hansson/mmc.git/commit/71fcbda0fcddd0896c4982a484f6c8aa802d28b1

Enric

2016-03-24 15:16:58

by Douglas Anderson

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

Hi,

On Thu, Mar 24, 2016 at 4:26 AM, Enric Balletbo Serra
<[email protected]> wrote:
>> Ah, that would make some sense why things work OK on Rockchip. Adding
>> DW_MCI_QUIRK_BROKEN_DTO to peach probably doesn't make sense, then.
>> Hrm...
>>
>> Since my original debugging of the issue was over a year ago, I think
>> I've almost totally lost context of any debugging I did on the issue,
>> so I'm not sure I'm going to be too much help in giving any details
>> other than what I put in the original commit message. From the
>> original message it appears that I thought we could solve this other
>> ways but just that my patch was easier than the alternative of
>> handling every error case. Maybe we just need to go back to the
>> drawing board and handle the error directly?
>>
>
> I just saw that Russell introduced a patch [1] that will land on 4.6.
> I think that patch solves the same issue that we're trying to fix, but
> for sdhci controller.

Yes, the description sounds very similar for sure.


> The problem that we have on peach-pi, with our patch applied, is that
> when we get a response CRC error on a command and we move to start
> sending data, the transfer doesn't receives a timeout interrupt (I
> don't know why). As I told, on rockchip works due the DTO quirk.
> exynos is not using this quirk. Also, please correct me if I'm wrong,
> looks like the sdhci controller has a timer to signal the command
> timed out.

I haven't spent any amount of time looking at SDHCI driver.

If we have no other better solution, enabling the DTO timer for this
specific case on all dw_mmc might be a good idea?


> ooi, anyone knows what was the test case that caused the necessity of
> the DTO quirk?

If I remember correctly, the DTO quirk is necessary to get basic
tuning working on almost every UHS card out there on rk3288. Take it
out and you'll see lots of problems. If you have it in there and add
a printout when the DTO quirk fires, you'll see your printout a decent
amount during tuning. Should be easy to test that...

Those same cards work fine on exynos devices without the DTO quirk.


-Doug

2016-03-24 15:31:24

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

On Thu, Mar 24, 2016 at 12:26:43PM +0100, Enric Balletbo Serra wrote:
> I just saw that Russell introduced a patch [1] that will land on 4.6.
> I think that patch solves the same issue that we're trying to fix, but
> for sdhci controller.

It doesn't sound like the same issue to me, though it was a long while
back when I was looking at sdhci, so I may be misremembering.

> The problem that we have on peach-pi, with our patch applied, is that
> when we get a response CRC error on a command and we move to start
> sending data, the transfer doesn't receives a timeout interrupt (I
> don't know why). As I told, on rockchip works due the DTO quirk.
> exynos is not using this quirk. Also, please correct me if I'm wrong,
> looks like the sdhci controller has a timer to signal the command
> timed out.

>From what I remember, the problem I was seeing is that SDHCI sends a
command (iirc, a tuning command), and receives a response CRC error.
The card, however, knows nothing about the CRC error, so it moves into
the transfer state.

Meanwhile, SDHCI stopped processing the command, resetting the SDHCI
controller and reporting the error to the upper layers.

Then, a new command gets queued, issued to the card, and this fails
because the card is still in transfer state. This totally screws up
the SDHCI UHS tuning.

This is not the only SDHCI UHS tuning bug: others exist which do not
yet have patches, where we can get spurious false positives/false
negatives for various tuning steps which totally confuse the code.

>From what you say above, your issue is that you get a response CRC
error, but the dw MMC masks the data side, which sounds like a
different solution is needed. The MMC block driver error handling
is fairly robust, but there is no core error handling (because the
error handling is not obvious.) So any command not eminating from
the MMC block driver that invokes a transfer from the card which
fails won't have a stop command sent for it.

Maybe that's a weakness of the core MMC code: when I originally
designed that part of the MMC code, my thoughts were to leave error
handling to the higher levels (such as MMC block) because its
dependent on those higher levels. Eg, the various status bits
which report errors, whether a stop command needs to be issued,
etc.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2016-03-24 16:06:56

by Douglas Anderson

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

Russell,

On Thu, Mar 24, 2016 at 8:30 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Thu, Mar 24, 2016 at 12:26:43PM +0100, Enric Balletbo Serra wrote:
>> I just saw that Russell introduced a patch [1] that will land on 4.6.
>> I think that patch solves the same issue that we're trying to fix, but
>> for sdhci controller.
>
> It doesn't sound like the same issue to me, though it was a long while
> back when I was looking at sdhci, so I may be misremembering.
>
>> The problem that we have on peach-pi, with our patch applied, is that
>> when we get a response CRC error on a command and we move to start
>> sending data, the transfer doesn't receives a timeout interrupt (I
>> don't know why). As I told, on rockchip works due the DTO quirk.
>> exynos is not using this quirk. Also, please correct me if I'm wrong,
>> looks like the sdhci controller has a timer to signal the command
>> timed out.
>
> From what I remember, the problem I was seeing is that SDHCI sends a
> command (iirc, a tuning command), and receives a response CRC error.
> The card, however, knows nothing about the CRC error, so it moves into
> the transfer state.
>
> Meanwhile, SDHCI stopped processing the command, resetting the SDHCI
> controller and reporting the error to the upper layers.
>
> Then, a new command gets queued, issued to the card, and this fails
> because the card is still in transfer state. This totally screws up
> the SDHCI UHS tuning.
>
> This is not the only SDHCI UHS tuning bug: others exist which do not
> yet have patches, where we can get spurious false positives/false
> negatives for various tuning steps which totally confuse the code.
>
> From what you say above, your issue is that you get a response CRC
> error, but the dw MMC masks the data side, which sounds like a
> different solution is needed.

What I was seeing that when the controller saw the CRC error it tried
to abort with a "stop" command. You can see the
"send_stop_abort(host, data)" in dw_mmc.c. Then I saw:

> Sending the stop command after the "response CRC error" would
> then throw the system into a confused state causing all future
> tuning phases to report failure.

Presumably this is similar to what you saw: the host saw the CRC error
but the card knew nothing about it. Sending the stop command during
this time confused the card. Presumably the card was in transfer
state during this time?


-Doug

2016-03-24 16:22:42

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
> Russell,
...
> Presumably this is similar to what you saw: the host saw the CRC error
> but the card knew nothing about it. Sending the stop command during
> this time confused the card. Presumably the card was in transfer
> state during this time?

If the card was in transfer state for a command which expects a stop
command, and that stop command was issued after the card entered
the transfer state, then I'd expect the card to handle it... though
there's always the firmware bug issue.

If the card hadn't entered transfer state at the time the stop command
was issued.. I think that's more likely to hit card firmware issues.

With the tuning commands, there's another case you can hit though:
the data transfer may have completed before you get around to sending
the stop command.

That's why, for sdhci, I came to the conclusion that waiting for the
data transfer to complete or timeout was the best solution for SDHCI.

Maybe, if sending a STOP command does cause card firmware issues, then:

1) it provides evidence that trying to send a stop command on response
CRC error is the wrong thing to do (it was talked about making SDHCI
do this.)

2) it suggests that the solution I came up with for SDHCI is the better
solution, rather than trying to immediately recover the situation by
sending a STOP command.

Maybe dw-mmc can do something similar, but with the lack of data transfer
timeout, maybe it's possible to do something with a kernel timer instead,
and check what the hardware is doing after a response CRC error?

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2016-03-30 17:16:22

by Enric Balletbo Serra

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <[email protected]>:
> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>> Russell,
> ...
>> Presumably this is similar to what you saw: the host saw the CRC error
>> but the card knew nothing about it. Sending the stop command during
>> this time confused the card. Presumably the card was in transfer
>> state during this time?
>
> If the card was in transfer state for a command which expects a stop
> command, and that stop command was issued after the card entered
> the transfer state, then I'd expect the card to handle it... though
> there's always the firmware bug issue.
>
> If the card hadn't entered transfer state at the time the stop command
> was issued.. I think that's more likely to hit card firmware issues.
>
> With the tuning commands, there's another case you can hit though:
> the data transfer may have completed before you get around to sending
> the stop command.
>
> That's why, for sdhci, I came to the conclusion that waiting for the
> data transfer to complete or timeout was the best solution for SDHCI.
>

In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
behaviour. What does this is use a kernel timer to signal when DTO
interrupt does NOT come. Note that if I disable this quirk I can also
saw the problem on rockchip.

> Maybe, if sending a STOP command does cause card firmware issues, then:
>
> 1) it provides evidence that trying to send a stop command on response
> CRC error is the wrong thing to do (it was talked about making SDHCI
> do this.)
>

Seems the same here, so guess is the wrong thing to do.

> 2) it suggests that the solution I came up with for SDHCI is the better
> solution, rather than trying to immediately recover the situation by
> sending a STOP command.
>

I'm wondering if just enable this quirk on exynos too is the proper
solution. Unfortunately I don't have enough documentation to check
differences between those controllers.
Also will really help have access to some hardware that uses
dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
Anyone with the hardware who can do some tests?


> Maybe dw-mmc can do something similar, but with the lack of data transfer
> timeout, maybe it's possible to do something with a kernel timer instead,
> and check what the hardware is doing after a response CRC error?
>
> --
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

2016-03-30 17:26:27

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

On Wed, Mar 30, 2016 at 07:16:18PM +0200, Enric Balletbo Serra wrote:
> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <[email protected]>:
> > On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
> >> Russell,
> > ...
> >> Presumably this is similar to what you saw: the host saw the CRC error
> >> but the card knew nothing about it. Sending the stop command during
> >> this time confused the card. Presumably the card was in transfer
> >> state during this time?
> >
> > If the card was in transfer state for a command which expects a stop
> > command, and that stop command was issued after the card entered
> > the transfer state, then I'd expect the card to handle it... though
> > there's always the firmware bug issue.
> >
> > If the card hadn't entered transfer state at the time the stop command
> > was issued.. I think that's more likely to hit card firmware issues.
> >
> > With the tuning commands, there's another case you can hit though:
> > the data transfer may have completed before you get around to sending
> > the stop command.
> >
> > That's why, for sdhci, I came to the conclusion that waiting for the
> > data transfer to complete or timeout was the best solution for SDHCI.
> >
>
> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
> behaviour. What does this is use a kernel timer to signal when DTO
> interrupt does NOT come. Note that if I disable this quirk I can also
> saw the problem on rockchip.
>
> > Maybe, if sending a STOP command does cause card firmware issues, then:
> >
> > 1) it provides evidence that trying to send a stop command on response
> > CRC error is the wrong thing to do (it was talked about making SDHCI
> > do this.)
> >
>
> Seems the same here, so guess is the wrong thing to do.
>
> > 2) it suggests that the solution I came up with for SDHCI is the better
> > solution, rather than trying to immediately recover the situation by
> > sending a STOP command.
> >
>
> I'm wondering if just enable this quirk on exynos too is the proper
> solution. Unfortunately I don't have enough documentation to check
> differences between those controllers.
> Also will really help have access to some hardware that uses
> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
> Anyone with the hardware who can do some tests?

I'd really suggest that the dw-mmc folk place a moritorium on quirk
flags, and instead deal with situations like this without resorting
to this kind of thing.

sdhci is a good example why the quirk flag approach is totally wrong,
and shows that it leads to an unmaintainable mess. If dw-mmc people
don't want the driver to decend into the same state that sdhci is,
then things like this should not be quirks. sdhci already has a
long-term moritorium on quirk flags until the resulting mess has been
cleaned up.

The danger that quirk flags cause is also highlighted in your mail:
it's very likely that this _isn't_ a host controller issue at all,
but a MMC protocol issue or a card issue - and the behaviour required
here is not specific to any particular host controller. The problem
with having a quirk flag for it is that you end up with some hosts
enabling it, and other hosts having it disabled only because they
haven't yet tripped over the issue.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2016-03-30 20:39:39

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

On Wed, Mar 30, 2016 at 10:06:36PM +0200, Enric Balletbo Serra wrote:
> El dia 30/03/2016 19:26, "Russell King - ARM Linux" <[email protected]>
> va escriure:
> > I'd really suggest that the dw-mmc folk place a moritorium on quirk
> > flags, and instead deal with situations like this without resorting
> > to this kind of thing.
> >
> > sdhci is a good example why the quirk flag approach is totally wrong,
> > and shows that it leads to an unmaintainable mess. If dw-mmc people
> > don't want the driver to decend into the same state that sdhci is,
> > then things like this should not be quirks. sdhci already has a
> > long-term moritorium on quirk flags until the resulting mess has been
> > cleaned up.
> >
>
> You mean that was a mess in the past and now is cleaned up?

SDHCI is far from being "cleaned up" - the conclusion that I've put
forward, and Ulf appears to agree with is that the core SDHCI driver
needs to become a library.

We then need individual drivers, which make selective use of the
library functions, and the "quirks" implemented as either fixes to
the core code or implemented by replacement functions.

The problem with SDHCI is that it's not far off having every other
line of code being conditional on a quirk flag - I've submitted to
very large series of patches so far doing a series of code transforms
to reduce this, but the job is nowhere near complete. Given the
number of drivers, it's something that needs to be done with care
and over a period of time.

This is why I'm warning about this as soon as I've heard another
driver going down the quirks route - hopefully you can avoid this
mistake early enough that it doesn't become a several year project
to sort out.

--
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

2016-03-31 01:56:33

by Shawn Lin

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

在 2016/3/31 1:26, Russell King - ARM Linux 写道:
> On Wed, Mar 30, 2016 at 07:16:18PM +0200, Enric Balletbo Serra wrote:
>> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <[email protected]>:
>>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>>> Russell,
>>> ...
>>>> Presumably this is similar to what you saw: the host saw the CRC error
>>>> but the card knew nothing about it. Sending the stop command during
>>>> this time confused the card. Presumably the card was in transfer
>>>> state during this time?
>>>
>>> If the card was in transfer state for a command which expects a stop
>>> command, and that stop command was issued after the card entered
>>> the transfer state, then I'd expect the card to handle it... though
>>> there's always the firmware bug issue.
>>>
>>> If the card hadn't entered transfer state at the time the stop command
>>> was issued.. I think that's more likely to hit card firmware issues.
>>>
>>> With the tuning commands, there's another case you can hit though:
>>> the data transfer may have completed before you get around to sending
>>> the stop command.
>>>
>>> That's why, for sdhci, I came to the conclusion that waiting for the
>>> data transfer to complete or timeout was the best solution for SDHCI.
>>>
>>
>> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
>> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
>> behaviour. What does this is use a kernel timer to signal when DTO
>> interrupt does NOT come. Note that if I disable this quirk I can also
>> saw the problem on rockchip.
>>
>>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>>
>>> 1) it provides evidence that trying to send a stop command on response
>>> CRC error is the wrong thing to do (it was talked about making SDHCI
>>> do this.)
>>>
>>
>> Seems the same here, so guess is the wrong thing to do.
>>
>>> 2) it suggests that the solution I came up with for SDHCI is the better
>>> solution, rather than trying to immediately recover the situation by
>>> sending a STOP command.
>>>
>>
>> I'm wondering if just enable this quirk on exynos too is the proper
>> solution. Unfortunately I don't have enough documentation to check
>> differences between those controllers.
>> Also will really help have access to some hardware that uses
>> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
>> Anyone with the hardware who can do some tests?
>
> I'd really suggest that the dw-mmc folk place a moritorium on quirk
> flags, and instead deal with situations like this without resorting
> to this kind of thing.
>

Some quirks and some callbacks have been cleaned in Jaehoon's repo,and
still some are going to removed. Finally we do plan to turn dw_mmc core
into a pure library..

> sdhci is a good example why the quirk flag approach is totally wrong,
> and shows that it leads to an unmaintainable mess. If dw-mmc people
> don't want the driver to decend into the same state that sdhci is,
> then things like this should not be quirks. sdhci already has a
> long-term moritorium on quirk flags until the resulting mess has been
> cleaned up.
>
> The danger that quirk flags cause is also highlighted in your mail:
> it's very likely that this _isn't_ a host controller issue at all,

Two issues found by dw_mmc-rockchip part,
(1) need reset idma when switching between fifo-transfer and
idma-transfer. When biu:ciu > 1:6, idma internal fsm take a risk of
a race condition to maintain its fifo lookup pointer. It can be very
easy reproduce by seting biu:ciu > 1:6.. Common bug for dw_mmc! But
unfortunately these details was missing in the commit msg.

(2) Missing DTO/DRTO; I missed the thread for this topic, so I need to
reproduce it by setting a simple C model code. I can't say more
currently until we can find a way to easily reproduce it. But I guess
it's NOT a host issue....since I slightly glance at the TMOUT reg at
dw_mmc databook and find a software timer requirement:

31:8 data_timeout 0xffffff
Value for card Data Read Timeout; same value also used for Data
Starvation by Host timeout. The timeout counter is started only after
thecard clock is stopped. Value is in number of card output clocks –
cclk_out of selected card.

Note: The software timer should be used if the timeout value is in the
order of 100 ms. In this case, read data timeout interrupt needs to be
disabled.

> but a MMC protocol issue or a card issue - and the behaviour required
> here is not specific to any particular host controller. The problem
> with having a quirk flag for it is that you end up with some hosts
> enabling it, and other hosts having it disabled only because they
> haven't yet tripped over the issue.
>


--
Best Regards
Shawn Lin

2016-03-31 02:03:32

by Jaehoon Chung

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

On 03/31/2016 02:16 AM, Enric Balletbo Serra wrote:
> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <[email protected]>:
>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>> Russell,
>> ...
>>> Presumably this is similar to what you saw: the host saw the CRC error
>>> but the card knew nothing about it. Sending the stop command during
>>> this time confused the card. Presumably the card was in transfer
>>> state during this time?
>>
>> If the card was in transfer state for a command which expects a stop
>> command, and that stop command was issued after the card entered
>> the transfer state, then I'd expect the card to handle it... though
>> there's always the firmware bug issue.
>>
>> If the card hadn't entered transfer state at the time the stop command
>> was issued.. I think that's more likely to hit card firmware issues.
>>
>> With the tuning commands, there's another case you can hit though:
>> the data transfer may have completed before you get around to sending
>> the stop command.
>>
>> That's why, for sdhci, I came to the conclusion that waiting for the
>> data transfer to complete or timeout was the best solution for SDHCI.
>>
>
> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
> behaviour. What does this is use a kernel timer to signal when DTO
> interrupt does NOT come. Note that if I disable this quirk I can also
> saw the problem on rockchip.

Did you see the problem with exynos? Could you share which exynos chip you use?
Then i can check this with all exynos.

>
>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>
>> 1) it provides evidence that trying to send a stop command on response
>> CRC error is the wrong thing to do (it was talked about making SDHCI
>> do this.)
>>
>
> Seems the same here, so guess is the wrong thing to do.
>
>> 2) it suggests that the solution I came up with for SDHCI is the better
>> solution, rather than trying to immediately recover the situation by
>> sending a STOP command.
>>
>
> I'm wondering if just enable this quirk on exynos too is the proper
> solution. Unfortunately I don't have enough documentation to check
> differences between those controllers.
> Also will really help have access to some hardware that uses
> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
> Anyone with the hardware who can do some tests?

I want to remove all quirks for dwmmc controller. (in progressing with Shawn.)

>
>
>> Maybe dw-mmc can do something similar, but with the lack of data transfer
>> timeout, maybe it's possible to do something with a kernel timer instead,
>> and check what the hardware is doing after a response CRC error?
>>
>> --
>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>> according to speedtest.net.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2016-03-31 06:39:35

by Enric Balletbo Serra

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

2016-03-31 4:03 GMT+02:00 Jaehoon Chung <[email protected]>:
> On 03/31/2016 02:16 AM, Enric Balletbo Serra wrote:
>> 2016-03-24 17:22 GMT+01:00 Russell King - ARM Linux <[email protected]>:
>>> On Thu, Mar 24, 2016 at 09:06:45AM -0700, Doug Anderson wrote:
>>>> Russell,
>>> ...
>>>> Presumably this is similar to what you saw: the host saw the CRC error
>>>> but the card knew nothing about it. Sending the stop command during
>>>> this time confused the card. Presumably the card was in transfer
>>>> state during this time?
>>>
>>> If the card was in transfer state for a command which expects a stop
>>> command, and that stop command was issued after the card entered
>>> the transfer state, then I'd expect the card to handle it... though
>>> there's always the firmware bug issue.
>>>
>>> If the card hadn't entered transfer state at the time the stop command
>>> was issued.. I think that's more likely to hit card firmware issues.
>>>
>>> With the tuning commands, there's another case you can hit though:
>>> the data transfer may have completed before you get around to sending
>>> the stop command.
>>>
>>> That's why, for sdhci, I came to the conclusion that waiting for the
>>> data transfer to complete or timeout was the best solution for SDHCI.
>>>
>>
>> In fact I only saw the problem with dw_mmc-exynos, on dw_mmc-rockchip
>> it doesn't happen because it enables the DW_MCI_QUIRK_BROKEN_DTO
>> behaviour. What does this is use a kernel timer to signal when DTO
>> interrupt does NOT come. Note that if I disable this quirk I can also
>> saw the problem on rockchip.
>
> Did you see the problem with exynos? Could you share which exynos chip you use?
> Then i can check this with all exynos.
>

Applying this patch I saw the problem with peach-pi, exynos 58000.

>>
>>> Maybe, if sending a STOP command does cause card firmware issues, then:
>>>
>>> 1) it provides evidence that trying to send a stop command on response
>>> CRC error is the wrong thing to do (it was talked about making SDHCI
>>> do this.)
>>>
>>
>> Seems the same here, so guess is the wrong thing to do.
>>
>>> 2) it suggests that the solution I came up with for SDHCI is the better
>>> solution, rather than trying to immediately recover the situation by
>>> sending a STOP command.
>>>
>>
>> I'm wondering if just enable this quirk on exynos too is the proper
>> solution. Unfortunately I don't have enough documentation to check
>> differences between those controllers.
>> Also will really help have access to some hardware that uses
>> dw_mmc-pltfm to check if, like on exynos, same issue is triggered.
>> Anyone with the hardware who can do some tests?
>
> I want to remove all quirks for dwmmc controller. (in progressing with Shawn.)
>

Nice, do you have a work-in-progress repository? Maybe I can help
testing the patches.

>>
>>
>>> Maybe dw-mmc can do something similar, but with the lack of data transfer
>>> timeout, maybe it's possible to do something with a kernel timer instead,
>>> and check what the hardware is doing after a response CRC error?
>>>
>>> --
>>> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
>>> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
>>> according to speedtest.net.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>

2016-03-31 18:12:55

by Douglas Anderson

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors

Hi,

On Thu, Mar 24, 2016 at 9:22 AM, Russell King - ARM Linux
<[email protected]> wrote:
> That's why, for sdhci, I came to the conclusion that waiting for the
> data transfer to complete or timeout was the best solution for SDHCI.
>
> Maybe, if sending a STOP command does cause card firmware issues, then:
>
> 1) it provides evidence that trying to send a stop command on response
> CRC error is the wrong thing to do (it was talked about making SDHCI
> do this.)
>
> 2) it suggests that the solution I came up with for SDHCI is the better
> solution, rather than trying to immediately recover the situation by
> sending a STOP command.
>
> Maybe dw-mmc can do something similar, but with the lack of data transfer
> timeout, maybe it's possible to do something with a kernel timer instead,
> and check what the hardware is doing after a response CRC error?

Since the problem only reproduced with a single model of a single
brand of card, it is probably a card firmware issue as you say.
Presumably if we put this card in an exynos that had tuning enabled
we'd seem similar issues. I haven't had a chance to test this.

dw_mmc does have a data transfer timeout, but for some reason it
doesn't seem to fire reliably on Rockchip. Sounds like Shawn may be
digging here? In the past I've found that the same code running on
rk3288 and exynos would fire the timeout on exynos but not on rk3288.
That's the reason why rk3288 has a special quirk to enable a software
timer.

In the case of CRC error perhaps the controller sin't sending a data
timeout because it already told us about the CRC error (so need to
report further?), but I guess the quirk on the Rockhip platform makes
it work.


Sounds like the right fix is to enable a timer after the CRC error
(similar to the DTO quirk) and not send a STOP. Even if it's not a
firmware problem, keeping dw_mmc behaving more like SDHCI is a good
idea because presumably SD cards out there will not test against all
controllers, so we'll have the most compatibility if all controllers
behave the same (even if the spec technically allows them to behave
otherwise).


-Doug