2020-02-13 15:43:29

by Naresh Kamboju

[permalink] [raw]
Subject: LKFT: arm x15: mmc1: cache flush error -110

arm beagleboard x15 device failed to boot Linux mainline and
linux-next kernel due
to below error.
This error occurred across all x15 device for these kernel version.

This regression started happening on x15 from this commit onwards (27th Jan)
git branch: master
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git commit: aae1464f46a2403565f75717438118691d31ccf1
git describe: v5.5-489-gaae1464f46a2


Test output log,
[ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
[ 37.611850] mmc1: cache flush error -110
[ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector
4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 37.627387] Aborting journal on device mmcblk1p9-8.
[ 37.635448] systemd[1]: Installed transient /etc/machine-id file.
[ 37.659283] systemd[1]: Couldn't move remaining userspace
processes, ignoring: Input/output error
[ 37.744027] EXT4-fs error (device mmcblk1p9):
ext4_journal_check_start:61: Detected aborted journal
[ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
[ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
Input/output error
[ 37.927825] systemd[104]:
/lib/systemd/system-generators/systemd-gpt-auto-generator failed with
exit status 1.
<>
[ 68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy
[ 68.861838] mmc1: cache flush error -110
[ 68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
<>
[ 98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy
[ 98.911774] mmc1: cache flush error -110
[ 98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
<>
Dependency failed for Serial Getty on ttyS2.
[ 128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy
[ 128.951786] mmc1: cache flush error -110
[ 128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op
0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[FAILED] Failed to start File System Check on Root Device.
See 'systemctl status systemd-fsck-root.service' for details.
[ OK ] Started Apply Kernel Variables.
[ OK ] Reached target Login Prompts.
Starting Remount Root and Kernel File Systems...
[ OK ] Reached target Timers.
[ OK ] Closed Syslog Socket.
[ OK ] Started Emergency Shell.
[ 129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354:
Abort forced by user
[ OK ] Reached target Emergency Mode.
[ OK ] Reached target Sockets.
[FAILED] Failed to start Remount Root and Kernel File Systems.
<>
You are in emergency mode. After logging in, type \"journalctl -xb\" to view
system logs, \"systemctl reboot\" to reboot, \"systemctl default\" or \"exit\"
to boot into default mode.
Press Enter for maintenance
auto-login-action timed out after 874 seconds

ref:
https://lkft.validation.linaro.org/scheduler/job/1137693#L4034
https://lkft.validation.linaro.org/scheduler/job/1158106#L4048
https://lkft.validation.linaro.org/scheduler/job/1137690#L3985
https://lkft.validation.linaro.org/scheduler/job/1137691#L4012
https://lkft.validation.linaro.org/scheduler/job/1137696#L4043
https://lkft.validation.linaro.org/scheduler/job/1137699#L4153

--
Linaro LKFT
https://lkft.linaro.org


2020-02-14 09:10:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju
<[email protected]> wrote:
>
> arm beagleboard x15 device failed to boot Linux mainline and
> linux-next kernel due
> to below error.
> This error occurred across all x15 device for these kernel version.
>
> This regression started happening on x15 from this commit onwards (27th Jan)
> git branch: master
> git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git commit: aae1464f46a2403565f75717438118691d31ccf1
> git describe: v5.5-489-gaae1464f46a2

Is it only the merge that introduced the issue, or is the branch that got
merged already broken?

If it's easy for you to reproduce, please run the same test on commit
e4e4c2ff78ed from Mark's regulator tree to narrow it down further.

Added Mark to Cc as well, in case it is indeed one of those.

Arnd

8<---
> Test output log,
> [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 37.611850] mmc1: cache flush error -110
> [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [ 37.627387] Aborting journal on device mmcblk1p9-8.
> [ 37.635448] systemd[1]: Installed transient /etc/machine-id file.
> [ 37.659283] systemd[1]: Couldn't move remaining userspace
> processes, ignoring: Input/output error
> [ 37.744027] EXT4-fs error (device mmcblk1p9):
> ext4_journal_check_start:61: Detected aborted journal
> [ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
> [ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
> Input/output error
> [ 37.927825] systemd[104]:
> /lib/systemd/system-generators/systemd-gpt-auto-generator failed with
> exit status 1.
> <>
> [ 68.856307] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 68.861838] mmc1: cache flush error -110
> [ 68.865812] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> <>
> [ 98.906243] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 98.911774] mmc1: cache flush error -110
> [ 98.915747] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> <>
> Dependency failed for Serial Getty on ttyS2.
> [ 128.946258] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 128.951786] mmc1: cache flush error -110
> [ 128.955756] blk_update_request: I/O error, dev mmcblk1, sector 0 op
> 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> [FAILED] Failed to start File System Check on Root Device.
> See 'systemctl status systemd-fsck-root.service' for details.
> [ OK ] Started Apply Kernel Variables.
> [ OK ] Reached target Login Prompts.
> Starting Remount Root and Kernel File Systems...
> [ OK ] Reached target Timers.
> [ OK ] Closed Syslog Socket.
> [ OK ] Started Emergency Shell.
> [ 129.227328] EXT4-fs error (device mmcblk1p9): ext4_remount:5354:
> Abort forced by user
> [ OK ] Reached target Emergency Mode.
> [ OK ] Reached target Sockets.
> [FAILED] Failed to start Remount Root and Kernel File Systems.
> <>
> You are in emergency mode. After logging in, type \"journalctl -xb\" to view
> system logs, \"systemctl reboot\" to reboot, \"systemctl default\" or \"exit\"
> to boot into default mode.
> Press Enter for maintenance
> auto-login-action timed out after 874 seconds
>
> ref:
> https://lkft.validation.linaro.org/scheduler/job/1137693#L4034
> https://lkft.validation.linaro.org/scheduler/job/1158106#L4048
> https://lkft.validation.linaro.org/scheduler/job/1137690#L3985
> https://lkft.validation.linaro.org/scheduler/job/1137691#L4012
> https://lkft.validation.linaro.org/scheduler/job/1137696#L4043
> https://lkft.validation.linaro.org/scheduler/job/1137699#L4153

2020-02-14 12:09:34

by Mark Brown

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Fri, Feb 14, 2020 at 10:09:34AM +0100, Arnd Bergmann wrote:
> On Thu, Feb 13, 2020 at 4:43 PM Naresh Kamboju

> > arm beagleboard x15 device failed to boot Linux mainline and
> > linux-next kernel due
> > to below error.
> > This error occurred across all x15 device for these kernel version.

> > This regression started happening on x15 from this commit onwards (27th Jan)
> > git branch: master
> > git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > git commit: aae1464f46a2403565f75717438118691d31ccf1
> > git describe: v5.5-489-gaae1464f46a2

That's Linus' commit:

Merge tag 'regulator-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Please include human readable descriptions of things like commits and
issues being discussed in e-mail in your mails, this makes them much
easier for humans to read especially when they have no internet access.
I do frequently catch up on my mail on flights or while otherwise
travelling so this is even more pressing for me than just being about
making things a bit easier to read.

> Is it only the merge that introduced the issue, or is the branch that got
> merged already broken?

Or is this just the first commit where the problem happened to be
noticed? Commits done just before that on the same day include the
dmaengine and MMC merges which seem potentially relevant:

> > [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> > [ 37.611850] mmc1: cache flush error -110
> > [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> > 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0


Attachments:
(No filename) (1.63 kB)
signature.asc (499.00 B)
Download all attachments

2020-02-19 16:25:57

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
>
> arm beagleboard x15 device failed to boot Linux mainline and
> linux-next kernel due
> to below error.
> This error occurred across all x15 device for these kernel version.
>
> This regression started happening on x15 from this commit onwards (27th Jan)
> git branch: master
> git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git commit: aae1464f46a2403565f75717438118691d31ccf1
> git describe: v5.5-489-gaae1464f46a2
>
>
> Test output log,
> [ 37.606241] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 37.611850] mmc1: cache flush error -110
> [ 37.615883] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [ 37.627387] Aborting journal on device mmcblk1p9-8.
> [ 37.635448] systemd[1]: Installed transient /etc/machine-id file.
> [ 37.659283] systemd[1]: Couldn't move remaining userspace
> processes, ignoring: Input/output error
> [ 37.744027] EXT4-fs error (device mmcblk1p9):
> ext4_journal_check_start:61: Detected aborted journal
> [ 37.753322] EXT4-fs (mmcblk1p9): Remounting filesystem read-only
> [ 37.917486] systemd-gpt-auto-generator[108]: Failed to dissect:
> Input/output error
> [ 37.927825] systemd[104]:
> /lib/systemd/system-generators/systemd-gpt-auto-generator failed with
> exit status 1.
> <>

Try to restore the value for the cache flush timeout, by updating the
define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".

The offending commit could perhaps be this one.

commit 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9
Author: Ulf Hansson <[email protected]>
Date: Wed Jan 22 15:27:45 2020 +0100
mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC

[...]

Kind regards
Uffe

2020-02-20 17:55:03

by Naresh Kamboju

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
>
> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
> >
>
> Try to restore the value for the cache flush timeout, by updating the
> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".

I have increased the timeout to 10 minutes but it did not help.
Same error found.
[ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
[ 608.684964] mmc1: cache flush error -110
[ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0

OTOH, What best i could do for my own experiment to revert all three patches and
now the reported error gone and device mount successfully [1].

List of patches reverted,
mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
mmc: block: Use generic_cmd6_time when modifying
INAND_CMD38_ARG_EXT_CSD
mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()

[1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346

- Naresh

2020-02-21 09:49:21

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
>
> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
> >
> > On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
> > >
> >
> > Try to restore the value for the cache flush timeout, by updating the
> > define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>
> I have increased the timeout to 10 minutes but it did not help.
> Same error found.
> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> [ 608.684964] mmc1: cache flush error -110
> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>
> OTOH, What best i could do for my own experiment to revert all three patches and
> now the reported error gone and device mount successfully [1].
>
> List of patches reverted,
> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
> mmc: block: Use generic_cmd6_time when modifying
> INAND_CMD38_ARG_EXT_CSD
> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>
> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>
> - Naresh

Thanks for testing!

This sounds a bit weird, I must say. Also, while looking into the
logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
maybe I didn't read the logs carefully enough.

In any case, I am looking into creating a debug patch so we can
narrow down the problem a bit further.

Kind regards
Uffe

2020-02-21 19:46:27

by Bitan Biswas

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On 2/21/20 1:48 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
>>
>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
>>>
>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
>>>>
>>>
>>> Try to restore the value for the cache flush timeout, by updating the
>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>
>> I have increased the timeout to 10 minutes but it did not help.
>> Same error found.
>> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>> [ 608.684964] mmc1: cache flush error -110
>> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>
>> OTOH, What best i could do for my own experiment to revert all three patches and
>> now the reported error gone and device mount successfully [1].
>>
>> List of patches reverted,
>> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>> mmc: block: Use generic_cmd6_time when modifying
>> INAND_CMD38_ARG_EXT_CSD
>> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>>
>> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>>
>> - Naresh
>
> Thanks for testing!
>
> This sounds a bit weird, I must say. Also, while looking into the
> logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
> maybe I didn't read the logs carefully enough.
>
> In any case, I am looking into creating a debug patch so we can
> narrow down the problem a bit further.
>
> Kind regards
> Uffe
>

Hi Ulf,

I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and
below error is seen due to the commit
24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts
for BKOPS and CACHE_FLUSH for eMMC"):

##
[ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds)
done.
[ 268.984414] OOM killer disabled.
[ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000
seconds) done.
[ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
[ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110
[ 269.230080] PM: Some devices failed to suspend, or early wake event
detected
##

I find that from the commit the changes in mmc_flush_cache below is
the cause.

##
@@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
(card->ext_csd.cache_size > 0) &&
(card->ext_csd.cache_ctrl & 1)) {
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
- EXT_CSD_FLUSH_CACHE, 1, 0);
+ EXT_CSD_FLUSH_CACHE, 1,
+ MMC_CACHE_FLUSH_TIMEOUT_MS);

##

Do you have suggestion to try for the suspend errors ?

-regards,
Bitan

2020-02-24 11:17:53

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

+ Adrian

On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <[email protected]> wrote:
>
> On 2/21/20 1:48 AM, Ulf Hansson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
> >>
> >> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
> >>>
> >>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
> >>>>
> >>>
> >>> Try to restore the value for the cache flush timeout, by updating the
> >>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
> >>
> >> I have increased the timeout to 10 minutes but it did not help.
> >> Same error found.
> >> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> >> [ 608.684964] mmc1: cache flush error -110
> >> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> >> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> >>
> >> OTOH, What best i could do for my own experiment to revert all three patches and
> >> now the reported error gone and device mount successfully [1].
> >>
> >> List of patches reverted,
> >> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
> >> mmc: block: Use generic_cmd6_time when modifying
> >> INAND_CMD38_ARG_EXT_CSD
> >> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
> >>
> >> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
> >>
> >> - Naresh
> >
> > Thanks for testing!
> >
> > This sounds a bit weird, I must say. Also, while looking into the
> > logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
> > maybe I didn't read the logs carefully enough.
> >
> > In any case, I am looking into creating a debug patch so we can
> > narrow down the problem a bit further.
> >
> > Kind regards
> > Uffe
> >
>
> Hi Ulf,
>
> I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and
> below error is seen due to the commit
> 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts
> for BKOPS and CACHE_FLUSH for eMMC"):
>
> ##
> [ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds)
> done.
> [ 268.984414] OOM killer disabled.
> [ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000
> seconds) done.
> [ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
> [ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110
> [ 269.230080] PM: Some devices failed to suspend, or early wake event
> detected

Is there also a print in the log about "Card stuck being busy!"?

In any case, it seems like the timeout error (-110) is happening way
too soon. The cache flush timeout is now 30s, but the timeout seems to
fire only a few hundred ms (at most) after the cache flush command has
been sent.

> ##
>
> I find that from the commit the changes in mmc_flush_cache below is
> the cause.
>
> ##
> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> (card->ext_csd.cache_size > 0) &&
> (card->ext_csd.cache_ctrl & 1)) {
> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> - EXT_CSD_FLUSH_CACHE, 1, 0);
> + EXT_CSD_FLUSH_CACHE, 1,
> + MMC_CACHE_FLUSH_TIMEOUT_MS);
>
> ##
>
> Do you have suggestion to try for the suspend errors ?

Just as a quick sanity test, please try the below patch, which
restores the old cache flush timeout to 10min.

However, as I indicated above, this seems to be a problem that needs
to be fixed at in the host driver side. For the sdhci driver, there is
a bit of a tricky logic around how to deal with timeouts in
sdhci_send_command(). My best guess is that's where we should look
more closely (and I am doing that).

From: Ulf Hansson <[email protected]>
Date: Mon, 24 Feb 2020 11:43:33 +0100
Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing

Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/mmc/core/mmc_ops.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index da425ee2d9bf..713e7dd6d028 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -21,7 +21,7 @@

#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/
#define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */
-#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */
+#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */

static const u8 tuning_blk_pattern_4bit[] = {
0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
--

Kind regards
Uffe

2020-02-24 13:01:43

by Adrian Hunter

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On 24/02/20 1:16 pm, Ulf Hansson wrote:
> + Adrian
>
> On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <[email protected]> wrote:
>>
>> On 2/21/20 1:48 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
>>>>
>>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
>>>>>
>>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
>>>>>>
>>>>>
>>>>> Try to restore the value for the cache flush timeout, by updating the
>>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>>>
>>>> I have increased the timeout to 10 minutes but it did not help.
>>>> Same error found.
>>>> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>>>> [ 608.684964] mmc1: cache flush error -110
>>>> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>>>
>>>> OTOH, What best i could do for my own experiment to revert all three patches and
>>>> now the reported error gone and device mount successfully [1].
>>>>
>>>> List of patches reverted,
>>>> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>>>> mmc: block: Use generic_cmd6_time when modifying
>>>> INAND_CMD38_ARG_EXT_CSD
>>>> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>>>>
>>>> [1] https://lkft.validation.linaro.org/scheduler/job/1238275#L4346
>>>>
>>>> - Naresh
>>>
>>> Thanks for testing!
>>>
>>> This sounds a bit weird, I must say. Also, while looking into the
>>> logs, it seems like you are comparing a v5.5 kernel with v5.6-rc2, but
>>> maybe I didn't read the logs carefully enough.
>>>
>>> In any case, I am looking into creating a debug patch so we can
>>> narrow down the problem a bit further.
>>>
>>> Kind regards
>>> Uffe
>>>
>>
>> Hi Ulf,
>>
>> I see that Jetson-TX2 / Jetson-Xavier suspend test is aborted and
>> below error is seen due to the commit
>> 24ed3bd01d6a844fd5e8a75f48d0a3d10ed71bf9 ("mmc: core: Specify timeouts
>> for BKOPS and CACHE_FLUSH for eMMC"):
>>
>> ##
>> [ 268.976197] Freezing user space processes ... (elapsed 0.001 seconds)
>> done.
>> [ 268.984414] OOM killer disabled.
>> [ 268.987635] Freezing remaining freezable tasks ... (elapsed 0.000
>> seconds) done.
>> [ 269.217471] PM: dpm_run_callback(): mmc_bus_suspend+0x0/0x58 returns -110
>> [ 269.224255] PM: Device mmc1:0001 failed to suspend: error -110
>> [ 269.230080] PM: Some devices failed to suspend, or early wake event
>> detected
>
> Is there also a print in the log about "Card stuck being busy!"?
>
> In any case, it seems like the timeout error (-110) is happening way
> too soon. The cache flush timeout is now 30s, but the timeout seems to
> fire only a few hundred ms (at most) after the cache flush command has
> been sent.
>
>> ##
>>
>> I find that from the commit the changes in mmc_flush_cache below is
>> the cause.
>>
>> ##
>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>> (card->ext_csd.cache_size > 0) &&
>> (card->ext_csd.cache_ctrl & 1)) {
>> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>> - EXT_CSD_FLUSH_CACHE, 1, 0);
>> + EXT_CSD_FLUSH_CACHE, 1,
>> + MMC_CACHE_FLUSH_TIMEOUT_MS);
>>
>> ##
>>
>> Do you have suggestion to try for the suspend errors ?
>
> Just as a quick sanity test, please try the below patch, which
> restores the old cache flush timeout to 10min.
>
> However, as I indicated above, this seems to be a problem that needs
> to be fixed at in the host driver side. For the sdhci driver, there is
> a bit of a tricky logic around how to deal with timeouts in
> sdhci_send_command(). My best guess is that's where we should look
> more closely (and I am doing that).

Could be, but hard to narrow down without more debug. I would enable
dynamic debug and mmc messages e.g.

kernel config

CONFIG_DYNAMIC_DEBUG=y

kernel command line option

dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"

But that might lead to a very large number of messages.


>
> From: Ulf Hansson <[email protected]>
> Date: Mon, 24 Feb 2020 11:43:33 +0100
> Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/mmc/core/mmc_ops.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index da425ee2d9bf..713e7dd6d028 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -21,7 +21,7 @@
>
> #define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/
> #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */
> -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */
> +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
>
> static const u8 tuning_blk_pattern_4bit[] = {
> 0xff, 0x0f, 0xff, 0x00, 0xff, 0xcc, 0xc3, 0xcc,
>

2020-02-25 10:05:25

by Jon Hunter

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 24/02/2020 11:16, Ulf Hansson wrote:
> + Adrian
>
> On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <[email protected]> wrote:
>>
>> On 2/21/20 1:48 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
>>>>
>>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
>>>>>
>>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
>>>>>>
>>>>>
>>>>> Try to restore the value for the cache flush timeout, by updating the
>>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
>>>>
>>>> I have increased the timeout to 10 minutes but it did not help.
>>>> Same error found.
>>>> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
>>>> [ 608.684964] mmc1: cache flush error -110
>>>> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
>>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
>>>>
>>>> OTOH, What best i could do for my own experiment to revert all three patches and
>>>> now the reported error gone and device mount successfully [1].
>>>>
>>>> List of patches reverted,
>>>> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
>>>> mmc: block: Use generic_cmd6_time when modifying
>>>> INAND_CMD38_ARG_EXT_CSD
>>>> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()


Reverting all the above also fixes the problem for me.

>> I find that from the commit the changes in mmc_flush_cache below is
>> the cause.
>>
>> ##
>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>> (card->ext_csd.cache_size > 0) &&
>> (card->ext_csd.cache_ctrl & 1)) {
>> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>> - EXT_CSD_FLUSH_CACHE, 1, 0);
>> + EXT_CSD_FLUSH_CACHE, 1,
>> + MMC_CACHE_FLUSH_TIMEOUT_MS);


I no longer see the issue on reverting the above hunk as Bitan suggested
but now I see the following (which is expected) ...

WARNING KERN mmc1: unspecified timeout for CMD6 - use generic

> Just as a quick sanity test, please try the below patch, which
> restores the old cache flush timeout to 10min.
>
> However, as I indicated above, this seems to be a problem that needs
> to be fixed at in the host driver side. For the sdhci driver, there is
> a bit of a tricky logic around how to deal with timeouts in
> sdhci_send_command(). My best guess is that's where we should look
> more closely (and I am doing that).
>
> From: Ulf Hansson <[email protected]>
> Date: Mon, 24 Feb 2020 11:43:33 +0100
> Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
>
> Signed-off-by: Ulf Hansson <[email protected]>
> ---
> drivers/mmc/core/mmc_ops.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index da425ee2d9bf..713e7dd6d028 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -21,7 +21,7 @@
>
> #define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/
> #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */
> -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */
> +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */

This does not fix the problem for me.

Jon

--
nvpublic

2020-02-25 12:11:16

by Jon Hunter

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 25/02/2020 10:04, Jon Hunter wrote:

...

>>> I find that from the commit the changes in mmc_flush_cache below is
>>> the cause.
>>>
>>> ##
>>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
>>> (card->ext_csd.cache_size > 0) &&
>>> (card->ext_csd.cache_ctrl & 1)) {
>>> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
>>> - EXT_CSD_FLUSH_CACHE, 1, 0);
>>> + EXT_CSD_FLUSH_CACHE, 1,
>>> + MMC_CACHE_FLUSH_TIMEOUT_MS);
>
>
> I no longer see the issue on reverting the above hunk as Bitan suggested
> but now I see the following (which is expected) ...
>
> WARNING KERN mmc1: unspecified timeout for CMD6 - use generic

For Tegra, the default timeout used when no timeout is specified for CMD6
is 100mS. So hard-coding the following also appears to workaround the
problem on Tegra ...

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 868653bc1555..5155e0240fca 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card)
(card->ext_csd.cache_size > 0) &&
(card->ext_csd.cache_ctrl & 1)) {
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
- EXT_CSD_FLUSH_CACHE, 1, 0);
+ EXT_CSD_FLUSH_CACHE, 1, 100);
if (err)
pr_err("%s: cache flush error %d\n",
mmc_hostname(card->host), err);

So the problem appears to be causing by the timeout being too long rather
than not long enough.

Looking more at the code, I think now that we are hitting the condition
...

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 868653bc1555..feae82b1ff35 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
* the host to avoid HW busy detection, by converting to a R1 response
* instead of a R1B.
*/
- if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
+ if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
+ pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout);
use_r1b_resp = false;
+ }


With the above I see ...

WARNING KERN mmc1: timeout (1600) > max busy timeout (672)

So with the longer timeout we are not using/requesting the response.

Cheers
Jon

--
nvpublic

2020-02-25 12:12:09

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Tue, 25 Feb 2020 at 11:04, Jon Hunter <[email protected]> wrote:
>
>
> On 24/02/2020 11:16, Ulf Hansson wrote:
> > + Adrian
> >
> > On Fri, 21 Feb 2020 at 20:44, Bitan Biswas <[email protected]> wrote:
> >>
> >> On 2/21/20 1:48 AM, Ulf Hansson wrote:
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Thu, 20 Feb 2020 at 18:54, Naresh Kamboju <[email protected]> wrote:
> >>>>
> >>>> On Wed, 19 Feb 2020 at 21:54, Ulf Hansson <[email protected]> wrote:
> >>>>>
> >>>>> On Thu, 13 Feb 2020 at 16:43, Naresh Kamboju <[email protected]> wrote:
> >>>>>>
> >>>>>
> >>>>> Try to restore the value for the cache flush timeout, by updating the
> >>>>> define MMC_CACHE_FLUSH_TIMEOUT_MS to 10 * 60 * 1000".
> >>>>
> >>>> I have increased the timeout to 10 minutes but it did not help.
> >>>> Same error found.
> >>>> [ 608.679353] mmc1: Card stuck being busy! mmc_poll_for_busy
> >>>> [ 608.684964] mmc1: cache flush error -110
> >>>> [ 608.689005] blk_update_request: I/O error, dev mmcblk1, sector
> >>>> 4302400 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> >>>>
> >>>> OTOH, What best i could do for my own experiment to revert all three patches and
> >>>> now the reported error gone and device mount successfully [1].
> >>>>
> >>>> List of patches reverted,
> >>>> mmc: core: Specify timeouts for BKOPS and CACHE_FLUSH for eMMC
> >>>> mmc: block: Use generic_cmd6_time when modifying
> >>>> INAND_CMD38_ARG_EXT_CSD
> >>>> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>
>
> Reverting all the above also fixes the problem for me.
>
> >> I find that from the commit the changes in mmc_flush_cache below is
> >> the cause.
> >>
> >> ##
> >> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> >> (card->ext_csd.cache_size > 0) &&
> >> (card->ext_csd.cache_ctrl & 1)) {
> >> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> >> - EXT_CSD_FLUSH_CACHE, 1, 0);
> >> + EXT_CSD_FLUSH_CACHE, 1,
> >> + MMC_CACHE_FLUSH_TIMEOUT_MS);
>
>
> I no longer see the issue on reverting the above hunk as Bitan suggested
> but now I see the following (which is expected) ...
>
> WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
>
> > Just as a quick sanity test, please try the below patch, which
> > restores the old cache flush timeout to 10min.
> >
> > However, as I indicated above, this seems to be a problem that needs
> > to be fixed at in the host driver side. For the sdhci driver, there is
> > a bit of a tricky logic around how to deal with timeouts in
> > sdhci_send_command(). My best guess is that's where we should look
> > more closely (and I am doing that).
> >
> > From: Ulf Hansson <[email protected]>
> > Date: Mon, 24 Feb 2020 11:43:33 +0100
> > Subject: [PATCH] mmc: core: Restore busy timeout for eMMC cache flushing
> >
> > Signed-off-by: Ulf Hansson <[email protected]>
> > ---
> > drivers/mmc/core/mmc_ops.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> > index da425ee2d9bf..713e7dd6d028 100644
> > --- a/drivers/mmc/core/mmc_ops.c
> > +++ b/drivers/mmc/core/mmc_ops.c
> > @@ -21,7 +21,7 @@
> >
> > #define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10min*/
> > #define MMC_BKOPS_TIMEOUT_MS (120 * 1000) /* 120s */
> > -#define MMC_CACHE_FLUSH_TIMEOUT_MS (30 * 1000) /* 30s */
> > +#define MMC_CACHE_FLUSH_TIMEOUT_MS (10 * 60 * 1000) /* 10min */
>
> This does not fix the problem for me.

Thanks for testing and the report - the results are not surprising.

I am getting back with an update asap, just want to complete my tests
over here before I say more.

Kind regards
Uffe

2020-02-25 14:28:02

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

+ Faiz Abbas

On Tue, 25 Feb 2020 at 12:41, Jon Hunter <[email protected]> wrote:
>
>
> On 25/02/2020 10:04, Jon Hunter wrote:
>
> ...
>
> >>> I find that from the commit the changes in mmc_flush_cache below is
> >>> the cause.
> >>>
> >>> ##
> >>> @@ -961,7 +963,8 @@ int mmc_flush_cache(struct mmc_card *card)
> >>> (card->ext_csd.cache_size > 0) &&
> >>> (card->ext_csd.cache_ctrl & 1)) {
> >>> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> >>> - EXT_CSD_FLUSH_CACHE, 1, 0);
> >>> + EXT_CSD_FLUSH_CACHE, 1,
> >>> + MMC_CACHE_FLUSH_TIMEOUT_MS);
> >
> >
> > I no longer see the issue on reverting the above hunk as Bitan suggested
> > but now I see the following (which is expected) ...
> >
> > WARNING KERN mmc1: unspecified timeout for CMD6 - use generic
>
> For Tegra, the default timeout used when no timeout is specified for CMD6
> is 100mS. So hard-coding the following also appears to workaround the
> problem on Tegra ...

Interesting.

>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..5155e0240fca 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -992,7 +992,7 @@ int mmc_flush_cache(struct mmc_card *card)
> (card->ext_csd.cache_size > 0) &&
> (card->ext_csd.cache_ctrl & 1)) {
> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> - EXT_CSD_FLUSH_CACHE, 1, 0);
> + EXT_CSD_FLUSH_CACHE, 1, 100);
> if (err)
> pr_err("%s: cache flush error %d\n",
> mmc_hostname(card->host), err);
>
> So the problem appears to be causing by the timeout being too long rather
> than not long enough.
>
> Looking more at the code, I think now that we are hitting the condition
> ...
>
> diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
> index 868653bc1555..feae82b1ff35 100644
> --- a/drivers/mmc/core/mmc_ops.c
> +++ b/drivers/mmc/core/mmc_ops.c
> @@ -579,8 +579,10 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
> * the host to avoid HW busy detection, by converting to a R1 response
> * instead of a R1B.
> */
> - if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
> + if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
> + pr_warn("%s: timeout (%d) > max busy timeout (%d)", mmc_hostname(host), timeout_ms, host->max_busy_timeout);
> use_r1b_resp = false;
> + }
>
>
> With the above I see ...
>
> WARNING KERN mmc1: timeout (1600) > max busy timeout (672)
>
> So with the longer timeout we are not using/requesting the response.

You are most likely correct.

However, from the core point of view, the response is still requested,
only that we don't want the driver to wait for the card to stop
signaling busy. Instead we want to deal with that via "polling" from
the core.

This is a rather worrying behaviour, as it seems like the host driver
doesn't really follow this expectations from the core point of view.
And mmc_flush_cache() is not the only case, as we have erase, bkops,
sanitize, etc. Are all these working or not really well tested?

Earlier, before my three patches, if the provided timeout_ms parameter
to __mmc_switch() was zero, which was the case for
mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
ignored validating host->max_busy_timeout, which was wrong. In any
case, this also meant that an R1B response was always used for
mmc_flush_cache(), as you also indicated above. Perhaps this is the
critical part where things can go wrong.

BTW, have you tried erase commands for sdhci tegra driver? If those
are working fine, do you have any special treatments for these?

I have looped in Faiz, as sdhci-omap seems to suffer from very similar
problems. One thing I noted for sdhci-omap, is that MMC_ERASE commands
is treated in a special manner in sdhci_omap_set_timeout(). This
indicates that there is something fishy going on.

Faiz, can you please comment on this?

Kind regards
Uffe

2020-02-25 16:25:16

by Jon Hunter

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 25/02/2020 14:26, Ulf Hansson wrote:

...

> However, from the core point of view, the response is still requested,
> only that we don't want the driver to wait for the card to stop
> signaling busy. Instead we want to deal with that via "polling" from
> the core.
>
> This is a rather worrying behaviour, as it seems like the host driver
> doesn't really follow this expectations from the core point of view.
> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> sanitize, etc. Are all these working or not really well tested?

I don't believe that they are well tested. We have a simple test to
mount an eMMC partition, create a file, check the contents, remove the
file and unmount. The timeouts always occur during unmounting.

> Earlier, before my three patches, if the provided timeout_ms parameter
> to __mmc_switch() was zero, which was the case for
> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> ignored validating host->max_busy_timeout, which was wrong. In any
> case, this also meant that an R1B response was always used for
> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> critical part where things can go wrong.
>
> BTW, have you tried erase commands for sdhci tegra driver? If those
> are working fine, do you have any special treatments for these?

That I am not sure, but I will check.

Cheers
Jon

--
nvpublic

2020-02-26 15:23:47

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

+ Anders, Kishon

On Tue, 25 Feb 2020 at 17:24, Jon Hunter <[email protected]> wrote:
>
>
> On 25/02/2020 14:26, Ulf Hansson wrote:
>
> ...
>
> > However, from the core point of view, the response is still requested,
> > only that we don't want the driver to wait for the card to stop
> > signaling busy. Instead we want to deal with that via "polling" from
> > the core.
> >
> > This is a rather worrying behaviour, as it seems like the host driver
> > doesn't really follow this expectations from the core point of view.
> > And mmc_flush_cache() is not the only case, as we have erase, bkops,
> > sanitize, etc. Are all these working or not really well tested?
>
> I don't believe that they are well tested. We have a simple test to
> mount an eMMC partition, create a file, check the contents, remove the
> file and unmount. The timeouts always occur during unmounting.
>
> > Earlier, before my three patches, if the provided timeout_ms parameter
> > to __mmc_switch() was zero, which was the case for
> > mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> > ignored validating host->max_busy_timeout, which was wrong. In any
> > case, this also meant that an R1B response was always used for
> > mmc_flush_cache(), as you also indicated above. Perhaps this is the
> > critical part where things can go wrong.
> >
> > BTW, have you tried erase commands for sdhci tegra driver? If those
> > are working fine, do you have any special treatments for these?
>
> That I am not sure, but I will check.

Great, thanks. Looking forward to your report.

So, from my side, me and Anders Roxell, have been collaborating on
testing the behaviour on a TI Beagleboard x15 (remotely with limited
debug options), which is using the sdhci-omap variant. I am trying to
get hold of an Nvidia jetson-TX2, but not found one yet. These are the
conclusions from the observed behaviour on the Beagleboard for the
CMD6 cache flush command.

First, the reported host->max_busy_timeout is 2581 (ms) for the
sdhci-omap driver in this configuration.

1. As we all know by now, the cache flush command (CMD6) fails with
-110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
from the command.

2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
Then everything works fine.

3. Updating the code to again use 30s as the
MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
set, even when the timeout_ms becomes greater than max_busy_timeout.
This also works fine.

Clearly this indicates a problem that I think needs to be addressed in
the sdhci driver. However, of course I can revert the three discussed
patches to fix the problem, but that would only hide the issues and I
am sure we would then get back to this issue, sooner or later.

To fix the problem in the sdhci driver, I would appreciate if someone
from TI and Nvidia can step in to help, as I don't have the HW on my
desk.

Comments or other ideas of how to move forward?

Kind regards
Uffe

2020-02-26 17:06:43

by Jon Hunter

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 26/02/2020 15:21, Ulf Hansson wrote:
> + Anders, Kishon
>
> On Tue, 25 Feb 2020 at 17:24, Jon Hunter <[email protected]> wrote:
>>
>>
>> On 25/02/2020 14:26, Ulf Hansson wrote:
>>
>> ...
>>
>>> However, from the core point of view, the response is still requested,
>>> only that we don't want the driver to wait for the card to stop
>>> signaling busy. Instead we want to deal with that via "polling" from
>>> the core.
>>>
>>> This is a rather worrying behaviour, as it seems like the host driver
>>> doesn't really follow this expectations from the core point of view.
>>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
>>> sanitize, etc. Are all these working or not really well tested?
>>
>> I don't believe that they are well tested. We have a simple test to
>> mount an eMMC partition, create a file, check the contents, remove the
>> file and unmount. The timeouts always occur during unmounting.
>>
>>> Earlier, before my three patches, if the provided timeout_ms parameter
>>> to __mmc_switch() was zero, which was the case for
>>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
>>> ignored validating host->max_busy_timeout, which was wrong. In any
>>> case, this also meant that an R1B response was always used for
>>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
>>> critical part where things can go wrong.
>>>
>>> BTW, have you tried erase commands for sdhci tegra driver? If those
>>> are working fine, do you have any special treatments for these?
>>
>> That I am not sure, but I will check.
>
> Great, thanks. Looking forward to your report.

So no I don't believe that we have explicitly tried erase commands and
no there is nothing special that we are doing for erase.

> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
>
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
>
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
>
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> Then everything works fine.
>
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
>
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
>
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.

OK, let me check to see who can help from our side.

Cheers
Jon

--
nvpublic

2020-03-02 13:11:32

by Faiz Abbas

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

Uffe,

On 26/02/20 8:51 pm, Ulf Hansson wrote:
> + Anders, Kishon
>
> On Tue, 25 Feb 2020 at 17:24, Jon Hunter <[email protected]> wrote:
>>
>>
>> On 25/02/2020 14:26, Ulf Hansson wrote:
>>
>> ...
>>
>>> However, from the core point of view, the response is still requested,
>>> only that we don't want the driver to wait for the card to stop
>>> signaling busy. Instead we want to deal with that via "polling" from
>>> the core.
>>>
>>> This is a rather worrying behaviour, as it seems like the host driver
>>> doesn't really follow this expectations from the core point of view.
>>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
>>> sanitize, etc. Are all these working or not really well tested?
>>
>> I don't believe that they are well tested. We have a simple test to
>> mount an eMMC partition, create a file, check the contents, remove the
>> file and unmount. The timeouts always occur during unmounting.
>>
>>> Earlier, before my three patches, if the provided timeout_ms parameter
>>> to __mmc_switch() was zero, which was the case for
>>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
>>> ignored validating host->max_busy_timeout, which was wrong. In any
>>> case, this also meant that an R1B response was always used for
>>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
>>> critical part where things can go wrong.
>>>
>>> BTW, have you tried erase commands for sdhci tegra driver? If those
>>> are working fine, do you have any special treatments for these?
>>
>> That I am not sure, but I will check.
>
> Great, thanks. Looking forward to your report.
>
> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
>
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
>
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
>
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> Then everything works fine.
>
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
>
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
>
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.
>
> Comments or other ideas of how to move forward?
>

Sorry I missed this earlier.

I don't have an X15 with me here but I'm trying to set one up in our
remote farm. In the meantime, I tried to reproduce this issue on two
platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
because those eMMC's don't even have a cache. I will keep you updated
when I do get a board with a eMMC that has a cache.

Is there a way to reproduce this CMD6 issue with another operation?

Thanks,
Faiz

2020-03-02 16:51:47

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Mon, 2 Mar 2020 at 14:11, Faiz Abbas <[email protected]> wrote:
>
> Uffe,
>
> On 26/02/20 8:51 pm, Ulf Hansson wrote:
> > + Anders, Kishon
> >
> > On Tue, 25 Feb 2020 at 17:24, Jon Hunter <[email protected]> wrote:
> >>
> >>
> >> On 25/02/2020 14:26, Ulf Hansson wrote:
> >>
> >> ...
> >>
> >>> However, from the core point of view, the response is still requested,
> >>> only that we don't want the driver to wait for the card to stop
> >>> signaling busy. Instead we want to deal with that via "polling" from
> >>> the core.
> >>>
> >>> This is a rather worrying behaviour, as it seems like the host driver
> >>> doesn't really follow this expectations from the core point of view.
> >>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> >>> sanitize, etc. Are all these working or not really well tested?
> >>
> >> I don't believe that they are well tested. We have a simple test to
> >> mount an eMMC partition, create a file, check the contents, remove the
> >> file and unmount. The timeouts always occur during unmounting.
> >>
> >>> Earlier, before my three patches, if the provided timeout_ms parameter
> >>> to __mmc_switch() was zero, which was the case for
> >>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> >>> ignored validating host->max_busy_timeout, which was wrong. In any
> >>> case, this also meant that an R1B response was always used for
> >>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> >>> critical part where things can go wrong.
> >>>
> >>> BTW, have you tried erase commands for sdhci tegra driver? If those
> >>> are working fine, do you have any special treatments for these?
> >>
> >> That I am not sure, but I will check.
> >
> > Great, thanks. Looking forward to your report.
> >
> > So, from my side, me and Anders Roxell, have been collaborating on
> > testing the behaviour on a TI Beagleboard x15 (remotely with limited
> > debug options), which is using the sdhci-omap variant. I am trying to
> > get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> > conclusions from the observed behaviour on the Beagleboard for the
> > CMD6 cache flush command.
> >
> > First, the reported host->max_busy_timeout is 2581 (ms) for the
> > sdhci-omap driver in this configuration.
> >
> > 1. As we all know by now, the cache flush command (CMD6) fails with
> > -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> > 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> > from the command.
> >
> > 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> > the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> > timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> > Then everything works fine.
> >
> > 3. Updating the code to again use 30s as the
> > MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> > set, even when the timeout_ms becomes greater than max_busy_timeout.
> > This also works fine.
> >
> > Clearly this indicates a problem that I think needs to be addressed in
> > the sdhci driver. However, of course I can revert the three discussed
> > patches to fix the problem, but that would only hide the issues and I
> > am sure we would then get back to this issue, sooner or later.
> >
> > To fix the problem in the sdhci driver, I would appreciate if someone
> > from TI and Nvidia can step in to help, as I don't have the HW on my
> > desk.
> >
> > Comments or other ideas of how to move forward?
> >
>
> Sorry I missed this earlier.
>
> I don't have an X15 with me here but I'm trying to set one up in our
> remote farm. In the meantime, I tried to reproduce this issue on two
> platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
> because those eMMC's don't even have a cache. I will keep you updated
> when I do get a board with a eMMC that has a cache.
>
> Is there a way to reproduce this CMD6 issue with another operation?

Yes, most definitely.

Let me cook a debug patch for you that should trigger the problem for
another CMD6 operation. I will post something later this evening or in
the mornings (Swedish timezone).

Kind regards
Uffe

2020-03-03 21:52:50

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Mon, 2 Mar 2020 at 17:50, Ulf Hansson <[email protected]> wrote:
>
> On Mon, 2 Mar 2020 at 14:11, Faiz Abbas <[email protected]> wrote:
> >
> > Uffe,
> >
> > On 26/02/20 8:51 pm, Ulf Hansson wrote:
> > > + Anders, Kishon
> > >
> > > On Tue, 25 Feb 2020 at 17:24, Jon Hunter <[email protected]> wrote:
> > >>
> > >>
> > >> On 25/02/2020 14:26, Ulf Hansson wrote:
> > >>
> > >> ...
> > >>
> > >>> However, from the core point of view, the response is still requested,
> > >>> only that we don't want the driver to wait for the card to stop
> > >>> signaling busy. Instead we want to deal with that via "polling" from
> > >>> the core.
> > >>>
> > >>> This is a rather worrying behaviour, as it seems like the host driver
> > >>> doesn't really follow this expectations from the core point of view.
> > >>> And mmc_flush_cache() is not the only case, as we have erase, bkops,
> > >>> sanitize, etc. Are all these working or not really well tested?
> > >>
> > >> I don't believe that they are well tested. We have a simple test to
> > >> mount an eMMC partition, create a file, check the contents, remove the
> > >> file and unmount. The timeouts always occur during unmounting.
> > >>
> > >>> Earlier, before my three patches, if the provided timeout_ms parameter
> > >>> to __mmc_switch() was zero, which was the case for
> > >>> mmc_mmc_flush_cache() - this lead to that __mmc_switch() simply
> > >>> ignored validating host->max_busy_timeout, which was wrong. In any
> > >>> case, this also meant that an R1B response was always used for
> > >>> mmc_flush_cache(), as you also indicated above. Perhaps this is the
> > >>> critical part where things can go wrong.
> > >>>
> > >>> BTW, have you tried erase commands for sdhci tegra driver? If those
> > >>> are working fine, do you have any special treatments for these?
> > >>
> > >> That I am not sure, but I will check.
> > >
> > > Great, thanks. Looking forward to your report.
> > >
> > > So, from my side, me and Anders Roxell, have been collaborating on
> > > testing the behaviour on a TI Beagleboard x15 (remotely with limited
> > > debug options), which is using the sdhci-omap variant. I am trying to
> > > get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> > > conclusions from the observed behaviour on the Beagleboard for the
> > > CMD6 cache flush command.
> > >
> > > First, the reported host->max_busy_timeout is 2581 (ms) for the
> > > sdhci-omap driver in this configuration.
> > >
> > > 1. As we all know by now, the cache flush command (CMD6) fails with
> > > -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> > > 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> > > from the command.
> > >
> > > 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> > > the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> > > timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> > > Then everything works fine.
> > >
> > > 3. Updating the code to again use 30s as the
> > > MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> > > set, even when the timeout_ms becomes greater than max_busy_timeout.
> > > This also works fine.
> > >
> > > Clearly this indicates a problem that I think needs to be addressed in
> > > the sdhci driver. However, of course I can revert the three discussed
> > > patches to fix the problem, but that would only hide the issues and I
> > > am sure we would then get back to this issue, sooner or later.
> > >
> > > To fix the problem in the sdhci driver, I would appreciate if someone
> > > from TI and Nvidia can step in to help, as I don't have the HW on my
> > > desk.
> > >
> > > Comments or other ideas of how to move forward?
> > >
> >
> > Sorry I missed this earlier.
> >
> > I don't have an X15 with me here but I'm trying to set one up in our
> > remote farm. In the meantime, I tried to reproduce this issue on two
> > platforms (dra72-evm and am57xx-evm) and wasn't able to see the issue
> > because those eMMC's don't even have a cache. I will keep you updated
> > when I do get a board with a eMMC that has a cache.
> >
> > Is there a way to reproduce this CMD6 issue with another operation?
>
> Yes, most definitely.
>
> Let me cook a debug patch for you that should trigger the problem for
> another CMD6 operation. I will post something later this evening or in
> the mornings (Swedish timezone).

A bit later than promised, I am clearly an optimist. In any case
here's the patch I had in mind to trigger the problem for other CMD6
operations. Please give at shot and see what happens.

-------

From: Ulf Hansson <[email protected]>
Date: Tue, 3 Mar 2020 22:11:05 +0100
Subject: [PATCH] mmc: core: DEBUG: Force a long timeout for all CMD6

This is to test sdhci-omap, for example, to see what happens when using a
longer timeout. My guess is that it triggers __mmc_switch() to disable the
MMC_RSP_BUSY flag for the command. If so, it likely to make the host driver
to fail, in some way or the other.

Signed-off-by: Ulf Hansson <[email protected]>
---
drivers/mmc/core/mmc_ops.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index da425ee2d9bf..f0d2563961f6 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -532,6 +532,9 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8
index, u8 value,

mmc_retune_hold(host);

+ /* Force a long timeout to likely make use_r1b_resp to become false. */
+ timeout_ms = MMC_CACHE_FLUSH_TIMEOUT_MS;
+
if (!timeout_ms) {
pr_warn("%s: unspecified timeout for CMD6 - use generic\n",
mmc_hostname(host));
@@ -544,8 +547,11 @@ int __mmc_switch(struct mmc_card *card, u8 set,
u8 index, u8 value,
* the host to avoid HW busy detection, by converting to a R1 response
* instead of a R1B.
*/
- if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout))
+ if (host->max_busy_timeout && (timeout_ms > host->max_busy_timeout)) {
+ pr_warn("%s:Disable MMC_RSP_BUSY. timeout_ms(%u) >
max_busy_timeout(%u)\n",
+ mmc_hostname(host), timeout_ms, host->max_busy_timeout);
use_r1b_resp = false;
+ }

cmd.opcode = MMC_SWITCH;
cmd.arg = (MMC_SWITCH_MODE_WRITE_BYTE << 24) |
--

Kind regards
Uffe

2020-03-04 10:19:05

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

[...]

>
> So, from my side, me and Anders Roxell, have been collaborating on
> testing the behaviour on a TI Beagleboard x15 (remotely with limited
> debug options), which is using the sdhci-omap variant. I am trying to
> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
> conclusions from the observed behaviour on the Beagleboard for the
> CMD6 cache flush command.
>
> First, the reported host->max_busy_timeout is 2581 (ms) for the
> sdhci-omap driver in this configuration.
>
> 1. As we all know by now, the cache flush command (CMD6) fails with
> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> from the command.
>
> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> Then everything works fine.
>
> 3. Updating the code to again use 30s as the
> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
> set, even when the timeout_ms becomes greater than max_busy_timeout.
> This also works fine.
>
> Clearly this indicates a problem that I think needs to be addressed in
> the sdhci driver. However, of course I can revert the three discussed
> patches to fix the problem, but that would only hide the issues and I
> am sure we would then get back to this issue, sooner or later.
>
> To fix the problem in the sdhci driver, I would appreciate if someone
> from TI and Nvidia can step in to help, as I don't have the HW on my
> desk.
>
> Comments or other ideas of how to move forward?

[...]

> Hi Ulf,
>
> I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.

Okay, great.

>
>
> Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
>
>
>
> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
>
>
>
> Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.

If I understand correctly, because of the timeout now set to 30s,
MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
sdhci-tegra as well?

In other words, mmc_poll_for_busy() is being called, which in your
case means the ->card_busy() host ops (set to sdhci_card_busy() in
your case) will be invoked to wait for the card to stop signal busy on
DAT0.

This indicates to me, that the ->card_busy() ops returns zero to
inform that the card is *not* busy, even if the card actually signals
busy? Is that correct?

>
>
>
> With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
>
>
>
> With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.

So, yes, using R1 instead of R1B triggers a different behaviour, but
according to the eMMC spec it's perfectly allowed to issue a CMD13
even if the card signals busy on DAT0. The CMD13 is not using the DATA
lines, so this should work.

If I understand correctly, your driver (and controller?) has issues
with coping with this scenario. Is it something that can be fixed?

>
>
> 30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.

Right, as stated above, isn't sdhci_card_busy() working for your case?
Can we fix it?

>
>
>
>
> Actually we always use R1B with CMD6 as per spec.

I fully agree that R1B is preferable, but it's not against the spec to
send CMD13 to poll for busy.

Moreover, we need to cope with the scenario when the host has
specified a maximum timeout that isn't sufficiently long enough for
the requested operation. Do you have another proposal for how to
manage this, but disabling MMC_RSP_BUSY?

Let's assume you driver would get a R1B for the CMD6 (we force it),
then what timeout would the driver be using if we would set
cmd.busy_timeout to 30ms?

Kind regards
Uffe

2020-03-04 10:33:44

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

[...]

>
> > Actually we always use R1B with CMD6 as per spec.
>
> I fully agree that R1B is preferable, but it's not against the spec to
> send CMD13 to poll for busy.
>
> Moreover, we need to cope with the scenario when the host has
> specified a maximum timeout that isn't sufficiently long enough for
> the requested operation. Do you have another proposal for how to
> manage this, but disabling MMC_RSP_BUSY?
>
> Let's assume you driver would get a R1B for the CMD6 (we force it),
> then what timeout would the driver be using if we would set
> cmd.busy_timeout to 30ms?

/s/30ms/30s

Kind regards
Uffe

2020-03-04 16:56:34

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 2:18 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>> So, from my side, me and Anders Roxell, have been collaborating on
>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>> debug options), which is using the sdhci-omap variant. I am trying to
>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>> conclusions from the observed behaviour on the Beagleboard for the
>> CMD6 cache flush command.
>>
>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>> sdhci-omap driver in this configuration.
>>
>> 1. As we all know by now, the cache flush command (CMD6) fails with
>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>> from the command.
>>
>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>> Then everything works fine.
>>
>> 3. Updating the code to again use 30s as the
>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>> This also works fine.
>>
>> Clearly this indicates a problem that I think needs to be addressed in
>> the sdhci driver. However, of course I can revert the three discussed
>> patches to fix the problem, but that would only hide the issues and I
>> am sure we would then get back to this issue, sooner or later.
>>
>> To fix the problem in the sdhci driver, I would appreciate if someone
>> from TI and Nvidia can step in to help, as I don't have the HW on my
>> desk.
>>
>> Comments or other ideas of how to move forward?
> [...]
>
>> Hi Ulf,
>>
>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc flush cache.
> Okay, great.
>
>>
>> Timeout I see is for switch status CMD13 after sending CMD6 as device side CMD6 is still inflight while host sends CMD13 as we are using R1 response type with timeout_ms changes to 30s.
>>
>>
>>
>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it uses R1B response type and host will wait for busy state followed by response from device for CMD6 and then data lines go High.
>>
>>
>>
>> Now with timeout_ms changed to 30s, we use R1 response and SW waits for busy by checking for DAT0 line to go High.
> If I understand correctly, because of the timeout now set to 30s,
> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
> sdhci-tegra as well?
Yes
>
> In other words, mmc_poll_for_busy() is being called, which in your
> case means the ->card_busy() host ops (set to sdhci_card_busy() in
> your case) will be invoked to wait for the card to stop signal busy on
> DAT0.
>
> This indicates to me, that the ->card_busy() ops returns zero to
> inform that the card is *not* busy, even if the card actually signals
> busy? Is that correct?
Yes
>
>>
>>
>> With R1B type, host design after sending command at end of completion after end bit waits for 2 cycles for data line to go low (busy state from device) and waits for response cycles after which data lines will go back high and then we issue switch status CMD13.
>>
>>
>>
>> With R1 type, host after sending command and at end of completion after end bit, DATA lines will go high immediately as its R1 type and switch status CMD13 gets issued but by this time it looks like CMD6 on device side is still in flight for sending status and data.
> So, yes, using R1 instead of R1B triggers a different behaviour, but
> according to the eMMC spec it's perfectly allowed to issue a CMD13
> even if the card signals busy on DAT0. The CMD13 is not using the DATA
> lines, so this should work.
>
> If I understand correctly, your driver (and controller?) has issues
> with coping with this scenario. Is it something that can be fixed?
>
>>
>> 30s timeout is the wait time for data0 line to go high and mmc_busy_status will return success right away with R1 response type and SW sends switch status CMD13 but during that time on device side looks like still processing CMD6 as we are not waiting for enough time when we use R1 response type.
> Right, as stated above, isn't sdhci_card_busy() working for your case?
> Can we fix it?

sdhci_card_busy() returned 0 indicating its not busy.

Based on our host design, When CMD6 is issued with R1 type, we program
it as NO_RESPONSE and with this command complete interrupt happens right
at end bit of command and there will be no transfer complete interrupt.

When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and
with this command complete is end bit of device resp and transfer
complete interrupt will be when DAT0 LOW -> HIGH.

Regardless of R1/R1B, device side CMD6 will always have busy state on D0
and response on CMD lines.

There will be 2 clock cycles period after sending CMD6 for device to
send busy state on data0.

In case of R1 type, after sending command DAT will stay high and looks
like we are polling for busy early before busy state has started and
sending CMD13 while device is busy and sending response on CMD line is
causing timeout.

Probably with this specific case of CMD6 with R1 type, to wait for card
busy we should poll for DAT0 to go Low first and then to go High??

>
>>
>>
>>
>> Actually we always use R1B with CMD6 as per spec.
> I fully agree that R1B is preferable, but it's not against the spec to
> send CMD13 to poll for busy.
>
> Moreover, we need to cope with the scenario when the host has
> specified a maximum timeout that isn't sufficiently long enough for
> the requested operation. Do you have another proposal for how to
> manage this, but disabling MMC_RSP_BUSY?
>
> Let's assume you driver would get a R1B for the CMD6 (we force it),
> then what timeout would the driver be using if we would set
> cmd.busy_timeout to 30ms?
>
> Kind regards
> Uffe

2020-03-04 17:22:56

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> [...]
>>
>>> So, from my side, me and Anders Roxell, have been collaborating on
>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>> debug options), which is using the sdhci-omap variant. I am trying to
>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>>> conclusions from the observed behaviour on the Beagleboard for the
>>> CMD6 cache flush command.
>>>
>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>> sdhci-omap driver in this configuration.
>>>
>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>> from the command.
>>>
>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>> Then everything works fine.
>>>
>>> 3. Updating the code to again use 30s as the
>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>> This also works fine.
>>>
>>> Clearly this indicates a problem that I think needs to be addressed in
>>> the sdhci driver. However, of course I can revert the three discussed
>>> patches to fix the problem, but that would only hide the issues and I
>>> am sure we would then get back to this issue, sooner or later.
>>>
>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>> desk.
>>>
>>> Comments or other ideas of how to move forward?
>> [...]
>>
>>> Hi Ulf,
>>>
>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc
>>> flush cache.
>> Okay, great.
>>
>>>
>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>> device side CMD6 is still inflight while host sends CMD13 as we are
>>> using R1 response type with timeout_ms changes to 30s.
>>>
>>>
>>>
>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it
>>> uses R1B response type and host will wait for busy state followed by
>>> response from device for CMD6 and then data lines go High.
>>>
>>>
>>>
>>> Now with timeout_ms changed to 30s, we use R1 response and SW waits
>>> for busy by checking for DAT0 line to go High.
>> If I understand correctly, because of the timeout now set to 30s,
>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>> sdhci-tegra as well?
> Yes
>>
>> In other words, mmc_poll_for_busy() is being called, which in your
>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>> your case) will be invoked to wait for the card to stop signal busy on
>> DAT0.
>>
>> This indicates to me, that the ->card_busy() ops returns zero to
>> inform that the card is *not* busy, even if the card actually signals
>> busy? Is that correct?
> Yes
>>
>>>
>>>
>>> With R1B type, host design after sending command at end of
>>> completion after end bit waits for 2 cycles for data line to go low
>>> (busy state from device) and waits for response cycles after which
>>> data lines will go back high and then we issue switch status CMD13.
>>>
>>>
>>>
>>> With R1 type, host after sending command and at end of completion
>>> after end bit, DATA lines will go high immediately as its R1 type
>>> and switch status CMD13 gets issued but by this time it looks like
>>> CMD6 on device side is still in flight for sending status and data.
>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>> lines, so this should work.
>>
>> If I understand correctly, your driver (and controller?) has issues
>> with coping with this scenario. Is it something that can be fixed?
>>
>>>
>>> 30s timeout is the wait time for data0 line to go high and
>>> mmc_busy_status will return success right away with R1 response type
>>> and SW sends switch status CMD13 but during that time on device side
>>> looks like still processing CMD6 as we are not waiting for enough
>>> time when we use R1 response type.
>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>> Can we fix it?
>
> sdhci_card_busy() returned 0 indicating its not busy.
>
> Based on our host design, When CMD6 is issued with R1 type, we program
> it as NO_RESPONSE and with this command complete interrupt happens
> right at end bit of command and there will be no transfer complete
> interrupt.
*[Correction] Based on our host design, When CMD6 is issued with R1 type
as we program it as NO_RESPONSE and with this command complete interrupt
happens right at end bit of command and there will be no transfer
complete interrupt.
>
> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT and
> with this command complete is end bit of device resp and transfer
> complete interrupt will be when DAT0 LOW -> HIGH.
>
> Regardless of R1/R1B, device side CMD6 will always have busy state on
> D0 and response on CMD lines.
>
> There will be 2 clock cycles period after sending CMD6 for device to
> send busy state on data0.
>
> In case of R1 type, after sending command DAT will stay high and looks
> like we are polling for busy early before busy state has started and
> sending CMD13 while device is busy and sending response on CMD line is
> causing timeout.
>
> Probably with this specific case of CMD6 with R1 type, to wait for
> card busy we should poll for DAT0 to go Low first and then to go High??
>
>>
>>>
>>>
>>>
>>> Actually we always use R1B with CMD6 as per spec.
>> I fully agree that R1B is preferable, but it's not against the spec to
>> send CMD13 to poll for busy.
>>
>> Moreover, we need to cope with the scenario when the host has
>> specified a maximum timeout that isn't sufficiently long enough for
>> the requested operation. Do you have another proposal for how to
>> manage this, but disabling MMC_RSP_BUSY?
>>
>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>> then what timeout would the driver be using if we would set
>> cmd.busy_timeout to 30ms?
>>
>> Kind regards
>> Uffe

2020-03-04 17:27:39

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>> debug options), which is using the sdhci-omap variant. I am trying to
>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are the
>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>> CMD6 cache flush command.
>>>>
>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>> sdhci-omap driver in this configuration.
>>>>
>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to 30 *
>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>> from the command.
>>>>
>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>> Then everything works fine.
>>>>
>>>> 3. Updating the code to again use 30s as the
>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY to be
>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>> This also works fine.
>>>>
>>>> Clearly this indicates a problem that I think needs to be addressed in
>>>> the sdhci driver. However, of course I can revert the three discussed
>>>> patches to fix the problem, but that would only hide the issues and I
>>>> am sure we would then get back to this issue, sooner or later.
>>>>
>>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>> desk.
>>>>
>>>> Comments or other ideas of how to move forward?
>>> [...]
>>>
>>>> Hi Ulf,
>>>>
>>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc
>>>> flush cache.
>>> Okay, great.
>>>
>>>>
>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>> device side CMD6 is still inflight while host sends CMD13 as we are
>>>> using R1 response type with timeout_ms changes to 30s.
>>>>
>>>>
>>>>
>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it
>>>> uses R1B response type and host will wait for busy state followed
>>>> by response from device for CMD6 and then data lines go High.
>>>>
>>>>
>>>>
>>>> Now with timeout_ms changed to 30s, we use R1 response and SW waits
>>>> for busy by checking for DAT0 line to go High.
>>> If I understand correctly, because of the timeout now set to 30s,
>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>> sdhci-tegra as well?
>> Yes
>>>
>>> In other words, mmc_poll_for_busy() is being called, which in your
>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>> your case) will be invoked to wait for the card to stop signal busy on
>>> DAT0.
>>>
>>> This indicates to me, that the ->card_busy() ops returns zero to
>>> inform that the card is *not* busy, even if the card actually signals
>>> busy? Is that correct?
>> Yes
>>>
>>>>
>>>>
>>>> With R1B type, host design after sending command at end of
>>>> completion after end bit waits for 2 cycles for data line to go low
>>>> (busy state from device) and waits for response cycles after which
>>>> data lines will go back high and then we issue switch status CMD13.
>>>>
>>>>
>>>>
>>>> With R1 type, host after sending command and at end of completion
>>>> after end bit, DATA lines will go high immediately as its R1 type
>>>> and switch status CMD13 gets issued but by this time it looks like
>>>> CMD6 on device side is still in flight for sending status and data.
>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>>> lines, so this should work.
>>>
>>> If I understand correctly, your driver (and controller?) has issues
>>> with coping with this scenario. Is it something that can be fixed?
>>>
>>>>
>>>> 30s timeout is the wait time for data0 line to go high and
>>>> mmc_busy_status will return success right away with R1 response
>>>> type and SW sends switch status CMD13 but during that time on
>>>> device side looks like still processing CMD6 as we are not waiting
>>>> for enough time when we use R1 response type.
>>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>>> Can we fix it?
>>
>> sdhci_card_busy() returned 0 indicating its not busy.
>>
>> Based on our host design, When CMD6 is issued with R1 type, we
>> program it as NO_RESPONSE and with this command complete interrupt
>> happens right at end bit of command and there will be no transfer
>> complete interrupt.
> *[Correction] Based on our host design, When CMD6 is issued with R1
> type as we program it as NO_RESPONSE and with this command complete
> interrupt happens right at end bit of command and there will be no
> transfer complete interrupt.

Sorry to correct wordings, I meant sdhci driver programs response type
as NO_RESPONSE for CMD6.

When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host
design  command complete interrupt happens right at end bit of command
and there will be no transfer complete interrupt.


>>
>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>> and with this command complete is end bit of device resp and transfer
>> complete interrupt will be when DAT0 LOW -> HIGH.
>>
>> Regardless of R1/R1B, device side CMD6 will always have busy state on
>> D0 and response on CMD lines.
>>
>> There will be 2 clock cycles period after sending CMD6 for device to
>> send busy state on data0.
>>
>> In case of R1 type, after sending command DAT will stay high and
>> looks like we are polling for busy early before busy state has
>> started and sending CMD13 while device is busy and sending response
>> on CMD line is causing timeout.
>>
>> Probably with this specific case of CMD6 with R1 type, to wait for
>> card busy we should poll for DAT0 to go Low first and then to go High??
>>
>>>
>>>>
>>>>
>>>>
>>>> Actually we always use R1B with CMD6 as per spec.
>>> I fully agree that R1B is preferable, but it's not against the spec to
>>> send CMD13 to poll for busy.
>>>
>>> Moreover, we need to cope with the scenario when the host has
>>> specified a maximum timeout that isn't sufficiently long enough for
>>> the requested operation. Do you have another proposal for how to
>>> manage this, but disabling MMC_RSP_BUSY?
>>>
>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>> then what timeout would the driver be using if we would set
>>> cmd.busy_timeout to 30ms?
>>>
>>> Kind regards
>>> Uffe

2020-03-04 17:52:30

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> [...]
>>>>
>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>>> debug options), which is using the sdhci-omap variant. I am trying to
>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These are
>>>>> the
>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>> CMD6 cache flush command.
>>>>>
>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>> sdhci-omap driver in this configuration.
>>>>>
>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to
>>>>> 30 *
>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>> from the command.
>>>>>
>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>> Then everything works fine.
>>>>>
>>>>> 3. Updating the code to again use 30s as the
>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
>>>>> to be
>>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>>> This also works fine.
>>>>>
>>>>> Clearly this indicates a problem that I think needs to be
>>>>> addressed in
>>>>> the sdhci driver. However, of course I can revert the three discussed
>>>>> patches to fix the problem, but that would only hide the issues and I
>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>
>>>>> To fix the problem in the sdhci driver, I would appreciate if someone
>>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>>> desk.
>>>>>
>>>>> Comments or other ideas of how to move forward?
>>>> [...]
>>>>
>>>>> Hi Ulf,
>>>>>
>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does mmc
>>>>> flush cache.
>>>> Okay, great.
>>>>
>>>>>
>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>
>>>>>
>>>>>
>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it
>>>>> uses R1B response type and host will wait for busy state followed
>>>>> by response from device for CMD6 and then data lines go High.
>>>>>
>>>>>
>>>>>
>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>> waits for busy by checking for DAT0 line to go High.
>>>> If I understand correctly, because of the timeout now set to 30s,
>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>> sdhci-tegra as well?
>>> Yes
>>>>
>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>> your case) will be invoked to wait for the card to stop signal busy on
>>>> DAT0.
>>>>
>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>> inform that the card is *not* busy, even if the card actually signals
>>>> busy? Is that correct?
>>> Yes
>>>>
>>>>>
>>>>>
>>>>> With R1B type, host design after sending command at end of
>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>> low (busy state from device) and waits for response cycles after
>>>>> which data lines will go back high and then we issue switch status
>>>>> CMD13.
>>>>>
>>>>>
>>>>>
>>>>> With R1 type, host after sending command and at end of completion
>>>>> after end bit, DATA lines will go high immediately as its R1 type
>>>>> and switch status CMD13 gets issued but by this time it looks like
>>>>> CMD6 on device side is still in flight for sending status and data.
>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>> even if the card signals busy on DAT0. The CMD13 is not using the DATA
>>>> lines, so this should work.
>>>>
>>>> If I understand correctly, your driver (and controller?) has issues
>>>> with coping with this scenario. Is it something that can be fixed?
>>>>
>>>>>
>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>> mmc_busy_status will return success right away with R1 response
>>>>> type and SW sends switch status CMD13 but during that time on
>>>>> device side looks like still processing CMD6 as we are not waiting
>>>>> for enough time when we use R1 response type.
>>>> Right, as stated above, isn't sdhci_card_busy() working for your case?
>>>> Can we fix it?
>>>
>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>
>>> Based on our host design, When CMD6 is issued with R1 type, we
>>> program it as NO_RESPONSE and with this command complete interrupt
>>> happens right at end bit of command and there will be no transfer
>>> complete interrupt.
>> *[Correction] Based on our host design, When CMD6 is issued with R1
>> type as we program it as NO_RESPONSE and with this command complete
>> interrupt happens right at end bit of command and there will be no
>> transfer complete interrupt.
>
> Sorry to correct wordings, I meant sdhci driver programs response type
> as NO_RESPONSE for CMD6.
>
> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our host
> design  command complete interrupt happens right at end bit of command
> and there will be no transfer complete interrupt.
>
>
Sorry for confusion. Please ignore above on response. it is using SHORT
response for R1. So SW poll should be working.

Will get back on checking on host design side internally.

>>>
>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>>> and with this command complete is end bit of device resp and
>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>
>>> Regardless of R1/R1B, device side CMD6 will always have busy state
>>> on D0 and response on CMD lines.
>>>
>>> There will be 2 clock cycles period after sending CMD6 for device to
>>> send busy state on data0.
>>>
>>> In case of R1 type, after sending command DAT will stay high and
>>> looks like we are polling for busy early before busy state has
>>> started and sending CMD13 while device is busy and sending response
>>> on CMD line is causing timeout.
>>>
>>> Probably with this specific case of CMD6 with R1 type, to wait for
>>> card busy we should poll for DAT0 to go Low first and then to go High??
>>>
>>>>

>>>>>
>>>>>
>>>>>
>>>>> Actually we always use R1B with CMD6 as per spec.
>>>> I fully agree that R1B is preferable, but it's not against the spec to
>>>> send CMD13 to poll for busy.
>>>>
>>>> Moreover, we need to cope with the scenario when the host has
>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>> the requested operation. Do you have another proposal for how to
>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>
>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>> then what timeout would the driver be using if we would set
>>>> cmd.busy_timeout to 30ms?
>>>>
>>>> Kind regards
>>>> Uffe

2020-03-04 22:35:59

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with limited
>>>>>> debug options), which is using the sdhci-omap variant. I am
>>>>>> trying to
>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
>>>>>> are the
>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>> CMD6 cache flush command.
>>>>>>
>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>> sdhci-omap driver in this configuration.
>>>>>>
>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set to
>>>>>> 30 *
>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>> from the command.
>>>>>>
>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>> Then everything works fine.
>>>>>>
>>>>>> 3. Updating the code to again use 30s as the
>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
>>>>>> to be
>>>>>> set, even when the timeout_ms becomes greater than max_busy_timeout.
>>>>>> This also works fine.
>>>>>>
>>>>>> Clearly this indicates a problem that I think needs to be
>>>>>> addressed in
>>>>>> the sdhci driver. However, of course I can revert the three
>>>>>> discussed
>>>>>> patches to fix the problem, but that would only hide the issues
>>>>>> and I
>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>
>>>>>> To fix the problem in the sdhci driver, I would appreciate if
>>>>>> someone
>>>>>> from TI and Nvidia can step in to help, as I don't have the HW on my
>>>>>> desk.
>>>>>>
>>>>>> Comments or other ideas of how to move forward?
>>>>> [...]
>>>>>
>>>>>> Hi Ulf,
>>>>>>
>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
>>>>>> mmc flush cache.
>>>>> Okay, great.
>>>>>
>>>>>>
>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with it
>>>>>> uses R1B response type and host will wait for busy state followed
>>>>>> by response from device for CMD6 and then data lines go High.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>> sdhci-tegra as well?
>>>> Yes
>>>>>
>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>> your case) will be invoked to wait for the card to stop signal
>>>>> busy on
>>>>> DAT0.
>>>>>
>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>> inform that the card is *not* busy, even if the card actually signals
>>>>> busy? Is that correct?
>>>> Yes
>>>>>
>>>>>>
>>>>>>
>>>>>> With R1B type, host design after sending command at end of
>>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>>> low (busy state from device) and waits for response cycles after
>>>>>> which data lines will go back high and then we issue switch
>>>>>> status CMD13.
>>>>>>
>>>>>>
>>>>>>
>>>>>> With R1 type, host after sending command and at end of completion
>>>>>> after end bit, DATA lines will go high immediately as its R1 type
>>>>>> and switch status CMD13 gets issued but by this time it looks
>>>>>> like CMD6 on device side is still in flight for sending status
>>>>>> and data.
>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
>>>>> DATA
>>>>> lines, so this should work.
>>>>>
>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>
>>>>>>
>>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>>> mmc_busy_status will return success right away with R1 response
>>>>>> type and SW sends switch status CMD13 but during that time on
>>>>>> device side looks like still processing CMD6 as we are not
>>>>>> waiting for enough time when we use R1 response type.
>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
>>>>> case?
>>>>> Can we fix it?
>>>>
>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>
>>>> Based on our host design, When CMD6 is issued with R1 type, we
>>>> program it as NO_RESPONSE and with this command complete interrupt
>>>> happens right at end bit of command and there will be no transfer
>>>> complete interrupt.
>>> *[Correction] Based on our host design, When CMD6 is issued with R1
>>> type as we program it as NO_RESPONSE and with this command complete
>>> interrupt happens right at end bit of command and there will be no
>>> transfer complete interrupt.
>>
>> Sorry to correct wordings, I meant sdhci driver programs response
>> type as NO_RESPONSE for CMD6.
>>
>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
>> host design  command complete interrupt happens right at end bit of
>> command and there will be no transfer complete interrupt.
>>
>>
> Sorry for confusion. Please ignore above on response. it is using
> SHORT response for R1. So SW poll should be working.
>
> Will get back on checking on host design side internally.
>
Hi Ulf,

Verified internally regarding the busy state over DATA0 Our host design
has known minor bug where command complete interrupt is asserted after
waiting for busy cycles from device.So because of this polling for
card_busy() returns 0 (DAT0 line High) immediately as waiting for busy
is taken care during command complete interrupt in host design. This
behavior is same for R1 and R1B.


>>>>
>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>>>> and with this command complete is end bit of device resp and
>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>
>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
>>>> on D0 and response on CMD lines.
>>>>
>>>> There will be 2 clock cycles period after sending CMD6 for device
>>>> to send busy state on data0.
>>>>
>>>> In case of R1 type, after sending command DAT will stay high and
>>>> looks like we are polling for busy early before busy state has
>>>> started and sending CMD13 while device is busy and sending response
>>>> on CMD line is causing timeout.
>>>>
>>>> Probably with this specific case of CMD6 with R1 type, to wait for
>>>> card busy we should poll for DAT0 to go Low first and then to go
>>>> High??
>>>>
>>>>>
>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>> spec to
>>>>> send CMD13 to poll for busy.
>>>>>
>>>>> Moreover, we need to cope with the scenario when the host has
>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>> the requested operation. Do you have another proposal for how to
>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>
>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>> then what timeout would the driver be using if we would set
>>>>> cmd.busy_timeout to 30ms?
>>>>>

>>>>> Kind regards
>>>>> Uffe

2020-03-05 00:20:54

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>
> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>
>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
>>>>>>> limited
>>>>>>> debug options), which is using the sdhci-omap variant. I am
>>>>>>> trying to
>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
>>>>>>> are the
>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>> CMD6 cache flush command.
>>>>>>>
>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>
>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
>>>>>>> to 30 *
>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>> from the command.
>>>>>>>
>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>> Then everything works fine.
>>>>>>>
>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
>>>>>>> to be
>>>>>>> set, even when the timeout_ms becomes greater than
>>>>>>> max_busy_timeout.
>>>>>>> This also works fine.
>>>>>>>
>>>>>>> Clearly this indicates a problem that I think needs to be
>>>>>>> addressed in
>>>>>>> the sdhci driver. However, of course I can revert the three
>>>>>>> discussed
>>>>>>> patches to fix the problem, but that would only hide the issues
>>>>>>> and I
>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>
>>>>>>> To fix the problem in the sdhci driver, I would appreciate if
>>>>>>> someone
>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
>>>>>>> on my
>>>>>>> desk.
>>>>>>>
>>>>>>> Comments or other ideas of how to move forward?
>>>>>> [...]
>>>>>>
>>>>>>> Hi Ulf,
>>>>>>>
>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
>>>>>>> mmc flush cache.
>>>>>> Okay, great.
>>>>>>
>>>>>>>
>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
>>>>>>> it uses R1B response type and host will wait for busy state
>>>>>>> followed by response from device for CMD6 and then data lines go
>>>>>>> High.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>> sdhci-tegra as well?
>>>>> Yes
>>>>>>
>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>> your case) will be invoked to wait for the card to stop signal
>>>>>> busy on
>>>>>> DAT0.
>>>>>>
>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>> inform that the card is *not* busy, even if the card actually
>>>>>> signals
>>>>>> busy? Is that correct?
>>>>> Yes
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With R1B type, host design after sending command at end of
>>>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>>>> low (busy state from device) and waits for response cycles after
>>>>>>> which data lines will go back high and then we issue switch
>>>>>>> status CMD13.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With R1 type, host after sending command and at end of
>>>>>>> completion after end bit, DATA lines will go high immediately as
>>>>>>> its R1 type and switch status CMD13 gets issued but by this time
>>>>>>> it looks like CMD6 on device side is still in flight for sending
>>>>>>> status and data.
>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
>>>>>> DATA
>>>>>> lines, so this should work.
>>>>>>
>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>
>>>>>>>
>>>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>>>> mmc_busy_status will return success right away with R1 response
>>>>>>> type and SW sends switch status CMD13 but during that time on
>>>>>>> device side looks like still processing CMD6 as we are not
>>>>>>> waiting for enough time when we use R1 response type.
>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
>>>>>> case?
>>>>>> Can we fix it?
>>>>>
>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>
>>>>> Based on our host design, When CMD6 is issued with R1 type, we
>>>>> program it as NO_RESPONSE and with this command complete interrupt
>>>>> happens right at end bit of command and there will be no transfer
>>>>> complete interrupt.
>>>> *[Correction] Based on our host design, When CMD6 is issued with R1
>>>> type as we program it as NO_RESPONSE and with this command complete
>>>> interrupt happens right at end bit of command and there will be no
>>>> transfer complete interrupt.
>>>
>>> Sorry to correct wordings, I meant sdhci driver programs response
>>> type as NO_RESPONSE for CMD6.
>>>
>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
>>> host design  command complete interrupt happens right at end bit of
>>> command and there will be no transfer complete interrupt.
>>>
>>>
>> Sorry for confusion. Please ignore above on response. it is using
>> SHORT response for R1. So SW poll should be working.
>>
>> Will get back on checking on host design side internally.
>>
> Hi Ulf,
>
> Verified internally regarding the busy state over DATA0 Our host
> design has known minor bug where command complete interrupt is
> asserted after waiting for busy cycles from device.So because of this
> polling for card_busy() returns 0 (DAT0 line High) immediately as
> waiting for busy is taken care during command complete interrupt in
> host design. This behavior is same for R1 and R1B.
>
>
>>>>>
>>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>>>>> and with this command complete is end bit of device resp and
>>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>
>>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
>>>>> on D0 and response on CMD lines.
>>>>>
>>>>> There will be 2 clock cycles period after sending CMD6 for device
>>>>> to send busy state on data0.
>>>>>
>>>>> In case of R1 type, after sending command DAT will stay high and
>>>>> looks like we are polling for busy early before busy state has
>>>>> started and sending CMD13 while device is busy and sending
>>>>> response on CMD line is causing timeout.
>>>>>
>>>>> Probably with this specific case of CMD6 with R1 type, to wait for
>>>>> card busy we should poll for DAT0 to go Low first and then to go
>>>>> High??
>>>>>
>>>>>>
>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>> spec to
>>>>>> send CMD13 to poll for busy.
>>>>>>
>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>> the requested operation. Do you have another proposal for how to
>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>
>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>> then what timeout would the driver be using if we would set
>>>>>> cmd.busy_timeout to 30ms?
>>>>>>
>
Sorry didn't understood clearly. Are you asking with 30s timeout, whats
the data timeout counter used?

Because of above mentioned issue on our host where CMD interrupt happens
after busy state, poll for busy returns right away as not busy.

So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
But weird that with small delay of 1ms or debug print before CMD13 it
doesn't timeout and works all the time.


>>>>>> Kind regards
>>>>>> Uffe

2020-03-05 03:06:33

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/4/20 4:20 PM, Sowjanya Komatineni wrote:
>
> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>>
>> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>>
>>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>>
>>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>>
>>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
>>>>>>>> limited
>>>>>>>> debug options), which is using the sdhci-omap variant. I am
>>>>>>>> trying to
>>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
>>>>>>>> are the
>>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>>> CMD6 cache flush command.
>>>>>>>>
>>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>>
>>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails
>>>>>>>> with
>>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
>>>>>>>> to 30 *
>>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>>> from the command.
>>>>>>>>
>>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means
>>>>>>>> that
>>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>>> Then everything works fine.
>>>>>>>>
>>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the
>>>>>>>> MMC_RSP_BUSY to be
>>>>>>>> set, even when the timeout_ms becomes greater than
>>>>>>>> max_busy_timeout.
>>>>>>>> This also works fine.
>>>>>>>>
>>>>>>>> Clearly this indicates a problem that I think needs to be
>>>>>>>> addressed in
>>>>>>>> the sdhci driver. However, of course I can revert the three
>>>>>>>> discussed
>>>>>>>> patches to fix the problem, but that would only hide the issues
>>>>>>>> and I
>>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>>
>>>>>>>> To fix the problem in the sdhci driver, I would appreciate if
>>>>>>>> someone
>>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
>>>>>>>> on my
>>>>>>>> desk.
>>>>>>>>
>>>>>>>> Comments or other ideas of how to move forward?
>>>>>>> [...]
>>>>>>>
>>>>>>>> Hi Ulf,
>>>>>>>>
>>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
>>>>>>>> mmc flush cache.
>>>>>>> Okay, great.
>>>>>>>
>>>>>>>>
>>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
>>>>>>>> it uses R1B response type and host will wait for busy state
>>>>>>>> followed by response from device for CMD6 and then data lines
>>>>>>>> go High.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>>> sdhci-tegra as well?
>>>>>> Yes
>>>>>>>
>>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>>> your case) will be invoked to wait for the card to stop signal
>>>>>>> busy on
>>>>>>> DAT0.
>>>>>>>
>>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>>> inform that the card is *not* busy, even if the card actually
>>>>>>> signals
>>>>>>> busy? Is that correct?
>>>>>> Yes
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With R1B type, host design after sending command at end of
>>>>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>>>>> low (busy state from device) and waits for response cycles
>>>>>>>> after which data lines will go back high and then we issue
>>>>>>>> switch status CMD13.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> With R1 type, host after sending command and at end of
>>>>>>>> completion after end bit, DATA lines will go high immediately
>>>>>>>> as its R1 type and switch status CMD13 gets issued but by this
>>>>>>>> time it looks like CMD6 on device side is still in flight for
>>>>>>>> sending status and data.
>>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour,
>>>>>>> but
>>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>>> even if the card signals busy on DAT0. The CMD13 is not using
>>>>>>> the DATA
>>>>>>> lines, so this should work.
>>>>>>>
>>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>>
>>>>>>>>
>>>>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>>>>> mmc_busy_status will return success right away with R1 response
>>>>>>>> type and SW sends switch status CMD13 but during that time on
>>>>>>>> device side looks like still processing CMD6 as we are not
>>>>>>>> waiting for enough time when we use R1 response type.
>>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
>>>>>>> case?
>>>>>>> Can we fix it?
>>>>>>
>>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>>
>>>>>> Based on our host design, When CMD6 is issued with R1 type, we
>>>>>> program it as NO_RESPONSE and with this command complete
>>>>>> interrupt happens right at end bit of command and there will be
>>>>>> no transfer complete interrupt.
>>>>> *[Correction] Based on our host design, When CMD6 is issued with
>>>>> R1 type as we program it as NO_RESPONSE and with this command
>>>>> complete interrupt happens right at end bit of command and there
>>>>> will be no transfer complete interrupt.
>>>>
>>>> Sorry to correct wordings, I meant sdhci driver programs response
>>>> type as NO_RESPONSE for CMD6.
>>>>
>>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
>>>> host design  command complete interrupt happens right at end bit of
>>>> command and there will be no transfer complete interrupt.
>>>>
>>>>
>>> Sorry for confusion. Please ignore above on response. it is using
>>> SHORT response for R1. So SW poll should be working.
>>>
>>> Will get back on checking on host design side internally.
>>>
>> Hi Ulf,
>>
>> Verified internally regarding the busy state over DATA0 Our host
>> design has known minor bug where command complete interrupt is
>> asserted after waiting for busy cycles from device.So because of this
>> polling for card_busy() returns 0 (DAT0 line High) immediately as
>> waiting for busy is taken care during command complete interrupt in
>> host design. This behavior is same for R1 and R1B.
>>
>>
>>>>>>
>>>>>> When CMD6 is issued with R1B type, we program is as R1B
>>>>>> RESP_SHORT and with this command complete is end bit of device
>>>>>> resp and transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>>
>>>>>> Regardless of R1/R1B, device side CMD6 will always have busy
>>>>>> state on D0 and response on CMD lines.
>>>>>>
>>>>>> There will be 2 clock cycles period after sending CMD6 for device
>>>>>> to send busy state on data0.
>>>>>>
>>>>>> In case of R1 type, after sending command DAT will stay high and
>>>>>> looks like we are polling for busy early before busy state has
>>>>>> started and sending CMD13 while device is busy and sending
>>>>>> response on CMD line is causing timeout.
>>>>>>
>>>>>> Probably with this specific case of CMD6 with R1 type, to wait
>>>>>> for card busy we should poll for DAT0 to go Low first and then to
>>>>>> go High??
>>>>>>
>>>>>>>
>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>> spec to
>>>>>>> send CMD13 to poll for busy.
>>>>>>>
>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>
>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>> then what timeout would the driver be using if we would set
>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>
>>
> Sorry didn't understood clearly. Are you asking with 30s timeout,
> whats the data timeout counter used?
>
> Because of above mentioned issue on our host where CMD interrupt
> happens after busy state, poll for busy returns right away as not busy.
>
> So issuing CMD13 after CMD6-R1 followed by busy poll should be
> working. But weird that with small delay of 1ms or debug print before
> CMD13 it doesn't timeout and works all the time.
>
>
With R1B for CMD6, busy detection timeout on our host is 11s (data
timeout count = 0xE)
>>>>>>> Kind regards
>>>>>>> Uffe

2020-03-05 13:07:02

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni <[email protected]> wrote:
>
>
> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
> >
> > On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
> >>
> >> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
> >>>
> >>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
> >>>>
> >>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
> >>>>>
> >>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
> >>>>>> External email: Use caution opening links or attachments
> >>>>>>
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
> >>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
> >>>>>>> limited
> >>>>>>> debug options), which is using the sdhci-omap variant. I am
> >>>>>>> trying to
> >>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
> >>>>>>> are the
> >>>>>>> conclusions from the observed behaviour on the Beagleboard for the
> >>>>>>> CMD6 cache flush command.
> >>>>>>>
> >>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
> >>>>>>> sdhci-omap driver in this configuration.
> >>>>>>>
> >>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
> >>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
> >>>>>>> to 30 *
> >>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
> >>>>>>> from the command.
> >>>>>>>
> >>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
> >>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
> >>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
> >>>>>>> Then everything works fine.
> >>>>>>>
> >>>>>>> 3. Updating the code to again use 30s as the
> >>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
> >>>>>>> to be
> >>>>>>> set, even when the timeout_ms becomes greater than
> >>>>>>> max_busy_timeout.
> >>>>>>> This also works fine.
> >>>>>>>
> >>>>>>> Clearly this indicates a problem that I think needs to be
> >>>>>>> addressed in
> >>>>>>> the sdhci driver. However, of course I can revert the three
> >>>>>>> discussed
> >>>>>>> patches to fix the problem, but that would only hide the issues
> >>>>>>> and I
> >>>>>>> am sure we would then get back to this issue, sooner or later.
> >>>>>>>
> >>>>>>> To fix the problem in the sdhci driver, I would appreciate if
> >>>>>>> someone
> >>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
> >>>>>>> on my
> >>>>>>> desk.
> >>>>>>>
> >>>>>>> Comments or other ideas of how to move forward?
> >>>>>> [...]
> >>>>>>
> >>>>>>> Hi Ulf,
> >>>>>>>
> >>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
> >>>>>>> mmc flush cache.
> >>>>>> Okay, great.
> >>>>>>
> >>>>>>>
> >>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
> >>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
> >>>>>>> are using R1 response type with timeout_ms changes to 30s.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
> >>>>>>> it uses R1B response type and host will wait for busy state
> >>>>>>> followed by response from device for CMD6 and then data lines go
> >>>>>>> High.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
> >>>>>>> waits for busy by checking for DAT0 line to go High.
> >>>>>> If I understand correctly, because of the timeout now set to 30s,
> >>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
> >>>>>> sdhci-tegra as well?
> >>>>> Yes
> >>>>>>
> >>>>>> In other words, mmc_poll_for_busy() is being called, which in your
> >>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
> >>>>>> your case) will be invoked to wait for the card to stop signal
> >>>>>> busy on
> >>>>>> DAT0.
> >>>>>>
> >>>>>> This indicates to me, that the ->card_busy() ops returns zero to
> >>>>>> inform that the card is *not* busy, even if the card actually
> >>>>>> signals
> >>>>>> busy? Is that correct?
> >>>>> Yes
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> With R1B type, host design after sending command at end of
> >>>>>>> completion after end bit waits for 2 cycles for data line to go
> >>>>>>> low (busy state from device) and waits for response cycles after
> >>>>>>> which data lines will go back high and then we issue switch
> >>>>>>> status CMD13.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> With R1 type, host after sending command and at end of
> >>>>>>> completion after end bit, DATA lines will go high immediately as
> >>>>>>> its R1 type and switch status CMD13 gets issued but by this time
> >>>>>>> it looks like CMD6 on device side is still in flight for sending
> >>>>>>> status and data.
> >>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
> >>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
> >>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
> >>>>>> DATA
> >>>>>> lines, so this should work.
> >>>>>>
> >>>>>> If I understand correctly, your driver (and controller?) has issues
> >>>>>> with coping with this scenario. Is it something that can be fixed?
> >>>>>>
> >>>>>>>
> >>>>>>> 30s timeout is the wait time for data0 line to go high and
> >>>>>>> mmc_busy_status will return success right away with R1 response
> >>>>>>> type and SW sends switch status CMD13 but during that time on
> >>>>>>> device side looks like still processing CMD6 as we are not
> >>>>>>> waiting for enough time when we use R1 response type.
> >>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
> >>>>>> case?
> >>>>>> Can we fix it?
> >>>>>
> >>>>> sdhci_card_busy() returned 0 indicating its not busy.
> >>>>>
> >>>>> Based on our host design, When CMD6 is issued with R1 type, we
> >>>>> program it as NO_RESPONSE and with this command complete interrupt
> >>>>> happens right at end bit of command and there will be no transfer
> >>>>> complete interrupt.
> >>>> *[Correction] Based on our host design, When CMD6 is issued with R1
> >>>> type as we program it as NO_RESPONSE and with this command complete
> >>>> interrupt happens right at end bit of command and there will be no
> >>>> transfer complete interrupt.
> >>>
> >>> Sorry to correct wordings, I meant sdhci driver programs response
> >>> type as NO_RESPONSE for CMD6.
> >>>
> >>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
> >>> host design command complete interrupt happens right at end bit of
> >>> command and there will be no transfer complete interrupt.
> >>>
> >>>
> >> Sorry for confusion. Please ignore above on response. it is using
> >> SHORT response for R1. So SW poll should be working.
> >>
> >> Will get back on checking on host design side internally.
> >>
> > Hi Ulf,
> >
> > Verified internally regarding the busy state over DATA0 Our host
> > design has known minor bug where command complete interrupt is
> > asserted after waiting for busy cycles from device.So because of this
> > polling for card_busy() returns 0 (DAT0 line High) immediately as
> > waiting for busy is taken care during command complete interrupt in
> > host design. This behavior is same for R1 and R1B.
> >
> >
> >>>>>
> >>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
> >>>>> and with this command complete is end bit of device resp and
> >>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
> >>>>>
> >>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
> >>>>> on D0 and response on CMD lines.
> >>>>>
> >>>>> There will be 2 clock cycles period after sending CMD6 for device
> >>>>> to send busy state on data0.
> >>>>>
> >>>>> In case of R1 type, after sending command DAT will stay high and
> >>>>> looks like we are polling for busy early before busy state has
> >>>>> started and sending CMD13 while device is busy and sending
> >>>>> response on CMD line is causing timeout.
> >>>>>
> >>>>> Probably with this specific case of CMD6 with R1 type, to wait for
> >>>>> card busy we should poll for DAT0 to go Low first and then to go
> >>>>> High??
> >>>>>
> >>>>>>
> >>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>> spec to
> >>>>>> send CMD13 to poll for busy.
> >>>>>>
> >>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>> the requested operation. Do you have another proposal for how to
> >>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>
> >>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>> then what timeout would the driver be using if we would set
> >>>>>> cmd.busy_timeout to 30ms?
> >>>>>>
> >
> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> the data timeout counter used?

Yes. It seems like it will pick the maximum, which is 11s?

>
> Because of above mentioned issue on our host where CMD interrupt happens
> after busy state, poll for busy returns right away as not busy.

I see.

>
> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> But weird that with small delay of 1ms or debug print before CMD13 it
> doesn't timeout and works all the time.

I have digested the information you provided in these emails. Let me
summarize it, to see if I have understood correctly.

1.
Your controller can't distinguish between R1 and R1B because of a
limitation in the HW. So, in both cases you need to wait for the card
to stop signal busy, before the controller can give an IRQ to notify
that the R1 response has been received. Correct?

In this context, I am wondering if sdhci_send_command(), really
conforms to these requirements. For example, depending on if the CMD6
has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
SDHCI_CMD_RESP_SHORT_BUSY.

Does this work as expected for your case?

2.
Assuming my interpretation of the above is somewhat correct. Then you
always need to set a busy timeout for R1/R1B responses in the
controller. The maximum timeout seems to be 11s long. Obviously, this
isn't enough for all cases, such as cache flushing and erase, for
example. So, what can we do to support a longer timeouts than 11s?
Would it be possible to disable the HW timeout, if the requested
timeout is longer than 11s and use a SW timeout instead?

Kind regards
Uffe

2020-03-06 02:45:11

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/5/20 5:05 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 5 Mar 2020 at 01:20, Sowjanya Komatineni <[email protected]> wrote:
>>
>> On 3/4/20 2:35 PM, Sowjanya Komatineni wrote:
>>> On 3/4/20 9:51 AM, Sowjanya Komatineni wrote:
>>>> On 3/4/20 9:26 AM, Sowjanya Komatineni wrote:
>>>>> On 3/4/20 9:21 AM, Sowjanya Komatineni wrote:
>>>>>> On 3/4/20 8:56 AM, Sowjanya Komatineni wrote:
>>>>>>> On 3/4/20 2:18 AM, Ulf Hansson wrote:
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>>
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> So, from my side, me and Anders Roxell, have been collaborating on
>>>>>>>>> testing the behaviour on a TI Beagleboard x15 (remotely with
>>>>>>>>> limited
>>>>>>>>> debug options), which is using the sdhci-omap variant. I am
>>>>>>>>> trying to
>>>>>>>>> get hold of an Nvidia jetson-TX2, but not found one yet. These
>>>>>>>>> are the
>>>>>>>>> conclusions from the observed behaviour on the Beagleboard for the
>>>>>>>>> CMD6 cache flush command.
>>>>>>>>>
>>>>>>>>> First, the reported host->max_busy_timeout is 2581 (ms) for the
>>>>>>>>> sdhci-omap driver in this configuration.
>>>>>>>>>
>>>>>>>>> 1. As we all know by now, the cache flush command (CMD6) fails with
>>>>>>>>> -110 currently. This is when MMC_CACHE_FLUSH_TIMEOUT_MS is set
>>>>>>>>> to 30 *
>>>>>>>>> 1000 (30s), which means __mmc_switch() drops the MMC_RSP_BUSY flag
>>>>>>>>> from the command.
>>>>>>>>>
>>>>>>>>> 2. Changing the MMC_CACHE_FLUSH_TIMEOUT_MS to 2000 (2s), means that
>>>>>>>>> the MMC_RSP_BUSY flag becomes set by __mmc_switch, because of the
>>>>>>>>> timeout_ms parameter is less than max_busy_timeout (2000 < 2581).
>>>>>>>>> Then everything works fine.
>>>>>>>>>
>>>>>>>>> 3. Updating the code to again use 30s as the
>>>>>>>>> MMC_CACHE_FLUSH_TIMEOUT_MS, but instead forcing the MMC_RSP_BUSY
>>>>>>>>> to be
>>>>>>>>> set, even when the timeout_ms becomes greater than
>>>>>>>>> max_busy_timeout.
>>>>>>>>> This also works fine.
>>>>>>>>>
>>>>>>>>> Clearly this indicates a problem that I think needs to be
>>>>>>>>> addressed in
>>>>>>>>> the sdhci driver. However, of course I can revert the three
>>>>>>>>> discussed
>>>>>>>>> patches to fix the problem, but that would only hide the issues
>>>>>>>>> and I
>>>>>>>>> am sure we would then get back to this issue, sooner or later.
>>>>>>>>>
>>>>>>>>> To fix the problem in the sdhci driver, I would appreciate if
>>>>>>>>> someone
>>>>>>>>> from TI and Nvidia can step in to help, as I don't have the HW
>>>>>>>>> on my
>>>>>>>>> desk.
>>>>>>>>>
>>>>>>>>> Comments or other ideas of how to move forward?
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> Hi Ulf,
>>>>>>>>>
>>>>>>>>> I could repro during suspend on Jetson TX1/TX2 as when it does
>>>>>>>>> mmc flush cache.
>>>>>>>> Okay, great.
>>>>>>>>
>>>>>>>>> Timeout I see is for switch status CMD13 after sending CMD6 as
>>>>>>>>> device side CMD6 is still inflight while host sends CMD13 as we
>>>>>>>>> are using R1 response type with timeout_ms changes to 30s.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Earlier we used timeout_ms of 0 for CMD6 flush cache, and with
>>>>>>>>> it uses R1B response type and host will wait for busy state
>>>>>>>>> followed by response from device for CMD6 and then data lines go
>>>>>>>>> High.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now with timeout_ms changed to 30s, we use R1 response and SW
>>>>>>>>> waits for busy by checking for DAT0 line to go High.
>>>>>>>> If I understand correctly, because of the timeout now set to 30s,
>>>>>>>> MMC_RSP_BUSY becomes disabled in __mmc_switch() for your case in
>>>>>>>> sdhci-tegra as well?
>>>>>>> Yes
>>>>>>>> In other words, mmc_poll_for_busy() is being called, which in your
>>>>>>>> case means the ->card_busy() host ops (set to sdhci_card_busy() in
>>>>>>>> your case) will be invoked to wait for the card to stop signal
>>>>>>>> busy on
>>>>>>>> DAT0.
>>>>>>>>
>>>>>>>> This indicates to me, that the ->card_busy() ops returns zero to
>>>>>>>> inform that the card is *not* busy, even if the card actually
>>>>>>>> signals
>>>>>>>> busy? Is that correct?
>>>>>>> Yes
>>>>>>>>>
>>>>>>>>> With R1B type, host design after sending command at end of
>>>>>>>>> completion after end bit waits for 2 cycles for data line to go
>>>>>>>>> low (busy state from device) and waits for response cycles after
>>>>>>>>> which data lines will go back high and then we issue switch
>>>>>>>>> status CMD13.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With R1 type, host after sending command and at end of
>>>>>>>>> completion after end bit, DATA lines will go high immediately as
>>>>>>>>> its R1 type and switch status CMD13 gets issued but by this time
>>>>>>>>> it looks like CMD6 on device side is still in flight for sending
>>>>>>>>> status and data.
>>>>>>>> So, yes, using R1 instead of R1B triggers a different behaviour, but
>>>>>>>> according to the eMMC spec it's perfectly allowed to issue a CMD13
>>>>>>>> even if the card signals busy on DAT0. The CMD13 is not using the
>>>>>>>> DATA
>>>>>>>> lines, so this should work.
>>>>>>>>
>>>>>>>> If I understand correctly, your driver (and controller?) has issues
>>>>>>>> with coping with this scenario. Is it something that can be fixed?
>>>>>>>>
>>>>>>>>> 30s timeout is the wait time for data0 line to go high and
>>>>>>>>> mmc_busy_status will return success right away with R1 response
>>>>>>>>> type and SW sends switch status CMD13 but during that time on
>>>>>>>>> device side looks like still processing CMD6 as we are not
>>>>>>>>> waiting for enough time when we use R1 response type.
>>>>>>>> Right, as stated above, isn't sdhci_card_busy() working for your
>>>>>>>> case?
>>>>>>>> Can we fix it?
>>>>>>> sdhci_card_busy() returned 0 indicating its not busy.
>>>>>>>
>>>>>>> Based on our host design, When CMD6 is issued with R1 type, we
>>>>>>> program it as NO_RESPONSE and with this command complete interrupt
>>>>>>> happens right at end bit of command and there will be no transfer
>>>>>>> complete interrupt.
>>>>>> *[Correction] Based on our host design, When CMD6 is issued with R1
>>>>>> type as we program it as NO_RESPONSE and with this command complete
>>>>>> interrupt happens right at end bit of command and there will be no
>>>>>> transfer complete interrupt.
>>>>> Sorry to correct wordings, I meant sdhci driver programs response
>>>>> type as NO_RESPONSE for CMD6.
>>>>>
>>>>> When CMD6 is issued with R1 type and as NO_RESPONSE, Based on our
>>>>> host design command complete interrupt happens right at end bit of
>>>>> command and there will be no transfer complete interrupt.
>>>>>
>>>>>
>>>> Sorry for confusion. Please ignore above on response. it is using
>>>> SHORT response for R1. So SW poll should be working.
>>>>
>>>> Will get back on checking on host design side internally.
>>>>
>>> Hi Ulf,
>>>
>>> Verified internally regarding the busy state over DATA0 Our host
>>> design has known minor bug where command complete interrupt is
>>> asserted after waiting for busy cycles from device.So because of this
>>> polling for card_busy() returns 0 (DAT0 line High) immediately as
>>> waiting for busy is taken care during command complete interrupt in
>>> host design. This behavior is same for R1 and R1B.
>>>
>>>
>>>>>>> When CMD6 is issued with R1B type, we program is as R1B RESP_SHORT
>>>>>>> and with this command complete is end bit of device resp and
>>>>>>> transfer complete interrupt will be when DAT0 LOW -> HIGH.
>>>>>>>
>>>>>>> Regardless of R1/R1B, device side CMD6 will always have busy state
>>>>>>> on D0 and response on CMD lines.
>>>>>>>
>>>>>>> There will be 2 clock cycles period after sending CMD6 for device
>>>>>>> to send busy state on data0.
>>>>>>>
>>>>>>> In case of R1 type, after sending command DAT will stay high and
>>>>>>> looks like we are polling for busy early before busy state has
>>>>>>> started and sending CMD13 while device is busy and sending
>>>>>>> response on CMD line is causing timeout.
>>>>>>>
>>>>>>> Probably with this specific case of CMD6 with R1 type, to wait for
>>>>>>> card busy we should poll for DAT0 to go Low first and then to go
>>>>>>> High??
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>> spec to
>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>
>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>
>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>
>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>> the data timeout counter used?
> Yes. It seems like it will pick the maximum, which is 11s?
yes
>
>> Because of above mentioned issue on our host where CMD interrupt happens
>> after busy state, poll for busy returns right away as not busy.
> I see.
>
>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>> But weird that with small delay of 1ms or debug print before CMD13 it
>> doesn't timeout and works all the time.
> I have digested the information you provided in these emails. Let me
> summarize it, to see if I have understood correctly.
>
> 1.
> Your controller can't distinguish between R1 and R1B because of a
> limitation in the HW. So, in both cases you need to wait for the card
> to stop signal busy, before the controller can give an IRQ to notify
> that the R1 response has been received. Correct?
>
> In this context, I am wondering if sdhci_send_command(), really
> conforms to these requirements. For example, depending on if the CMD6
> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> SDHCI_CMD_RESP_SHORT_BUSY.
>
> Does this work as expected for your case?
Design team re-verified internally and bug where HW waits for busy state
before IRQ is only for R1B and R1 is spec compliant.

So, with R1, CMD complete is generated after response received.

With R1B, CMD complete and xfer complete both are generated after
response received + device busy (max timeout of 11s)
DATA timeout interrupt will be asserted incase if HW busy detection fails.

With R1B we may see DATA Timeout if operation takes more than max busy
timeout of 11s.

> 2.
> Assuming my interpretation of the above is somewhat correct. Then you
> always need to set a busy timeout for R1/R1B responses in the
> controller. The maximum timeout seems to be 11s long. Obviously, this
> isn't enough for all cases, such as cache flushing and erase, for
> example. So, what can we do to support a longer timeouts than 11s?
> Would it be possible to disable the HW timeout, if the requested
> timeout is longer than 11s and use a SW timeout instead?
>
> Kind regards
> Uffe

For erase long operations we have register bit to enable for infinite
busy wait mode where host controller would be monitoring until card is busy.

But so far for emmc devices we used on our platforms, we haven't seen
cache flush taking more than 11s.

Will get back on possibility of disabling HW timeout and using SW timeout..

Thanks

Sowjanya

2020-03-06 11:15:57

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

[...]

> >>>>>>>>>
> >>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>> spec to
> >>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>
> >>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>
> >>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>
> >> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >> the data timeout counter used?
> > Yes. It seems like it will pick the maximum, which is 11s?
> yes

Okay, thanks!

> >
> >> Because of above mentioned issue on our host where CMD interrupt happens
> >> after busy state, poll for busy returns right away as not busy.
> > I see.
> >
> >> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >> But weird that with small delay of 1ms or debug print before CMD13 it
> >> doesn't timeout and works all the time.
> > I have digested the information you provided in these emails. Let me
> > summarize it, to see if I have understood correctly.
> >
> > 1.
> > Your controller can't distinguish between R1 and R1B because of a
> > limitation in the HW. So, in both cases you need to wait for the card
> > to stop signal busy, before the controller can give an IRQ to notify
> > that the R1 response has been received. Correct?
> >
> > In this context, I am wondering if sdhci_send_command(), really
> > conforms to these requirements. For example, depending on if the CMD6
> > has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> > SDHCI_CMD_RESP_SHORT_BUSY.
> >
> > Does this work as expected for your case?
> Design team re-verified internally and bug where HW waits for busy state
> before IRQ is only for R1B and R1 is spec compliant.
>
> So, with R1, CMD complete is generated after response received.

Okay.

So, the issue we see for CMD6 with R1, is a software problem that we
should be able to fix.

>
> With R1B, CMD complete and xfer complete both are generated after
> response received + device busy (max timeout of 11s)
> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>
> With R1B we may see DATA Timeout if operation takes more than max busy
> timeout of 11s.

Okay, I see.

>
> > 2.
> > Assuming my interpretation of the above is somewhat correct. Then you
> > always need to set a busy timeout for R1/R1B responses in the
> > controller. The maximum timeout seems to be 11s long. Obviously, this
> > isn't enough for all cases, such as cache flushing and erase, for
> > example. So, what can we do to support a longer timeouts than 11s?
> > Would it be possible to disable the HW timeout, if the requested
> > timeout is longer than 11s and use a SW timeout instead?
> >
> > Kind regards
> > Uffe
>
> For erase long operations we have register bit to enable for infinite
> busy wait mode where host controller would be monitoring until card is busy.

Alright, that sounds great!

>
> But so far for emmc devices we used on our platforms, we haven't seen
> cache flush taking more than 11s.

I understand that 11s is probably fine to use, for most cases.

However, it's not spec compliant, as for some operations there are
simply no timeout specified. BKOPS, cache flush, sanitize are cases
like this - and then 11s is definitely not sufficient.

>
> Will get back on possibility of disabling HW timeout and using SW timeout..

Thanks!

I would like to get the regression fixed asap, but I also would like
to avoid reverting patches, unless really necessary. May I propose the
following two options.

1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
an R1 response doesn't work - and then fix that behaviour.

2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
the core to always use R1B for CMD6 (and erase). This also means that
when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
disable the HW busy timeout and just wait "forever".

If you decide for 2, you can add the software timeout support on top,
but make that can be considered as a next step of an improvement,
rather than needed as fix. Note that, I believe there are some support
for software timeout already in the sdhci core, maybe you need to
tweak it a bit for your case, I don't know.

Kind regards
Uffe

2020-03-09 14:06:59

by Faiz Abbas

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

Uffe,

On 06/03/20 4:44 pm, Ulf Hansson wrote:
> [...]
>
>>>>>>>>>>>
>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>> spec to
>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>
>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>
>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>
>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>> the data timeout counter used?
>>> Yes. It seems like it will pick the maximum, which is 11s?
>> yes
>
> Okay, thanks!
>
>>>
>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>> after busy state, poll for busy returns right away as not busy.
>>> I see.
>>>
>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>> doesn't timeout and works all the time.
>>> I have digested the information you provided in these emails. Let me
>>> summarize it, to see if I have understood correctly.
>>>
>>> 1.
>>> Your controller can't distinguish between R1 and R1B because of a
>>> limitation in the HW. So, in both cases you need to wait for the card
>>> to stop signal busy, before the controller can give an IRQ to notify
>>> that the R1 response has been received. Correct?
>>>
>>> In this context, I am wondering if sdhci_send_command(), really
>>> conforms to these requirements. For example, depending on if the CMD6
>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>
>>> Does this work as expected for your case?
>> Design team re-verified internally and bug where HW waits for busy state
>> before IRQ is only for R1B and R1 is spec compliant.
>>
>> So, with R1, CMD complete is generated after response received.
>
> Okay.
>
> So, the issue we see for CMD6 with R1, is a software problem that we
> should be able to fix.
>
>>
>> With R1B, CMD complete and xfer complete both are generated after
>> response received + device busy (max timeout of 11s)
>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>
>> With R1B we may see DATA Timeout if operation takes more than max busy
>> timeout of 11s.
>
> Okay, I see.
>
>>
>>> 2.
>>> Assuming my interpretation of the above is somewhat correct. Then you
>>> always need to set a busy timeout for R1/R1B responses in the
>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>> isn't enough for all cases, such as cache flushing and erase, for
>>> example. So, what can we do to support a longer timeouts than 11s?
>>> Would it be possible to disable the HW timeout, if the requested
>>> timeout is longer than 11s and use a SW timeout instead?
>>>
>>> Kind regards
>>> Uffe
>>
>> For erase long operations we have register bit to enable for infinite
>> busy wait mode where host controller would be monitoring until card is busy.
>
> Alright, that sounds great!
>
>>
>> But so far for emmc devices we used on our platforms, we haven't seen
>> cache flush taking more than 11s.
>
> I understand that 11s is probably fine to use, for most cases.
>
> However, it's not spec compliant, as for some operations there are
> simply no timeout specified. BKOPS, cache flush, sanitize are cases
> like this - and then 11s is definitely not sufficient.
>
>>
>> Will get back on possibility of disabling HW timeout and using SW timeout..
>
> Thanks!
>
> I would like to get the regression fixed asap, but I also would like
> to avoid reverting patches, unless really necessary. May I propose the
> following two options.
>
> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> an R1 response doesn't work - and then fix that behaviour.
>
> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> the core to always use R1B for CMD6 (and erase). This also means that
> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> disable the HW busy timeout and just wait "forever".
>
> If you decide for 2, you can add the software timeout support on top,
> but make that can be considered as a next step of an improvement,
> rather than needed as fix. Note that, I believe there are some support
> for software timeout already in the sdhci core, maybe you need to
> tweak it a bit for your case, I don't know.
>

I was able to reproduce the issue on sdhci-omap with your patch applied.

In our case, the dat0 line never comes back up when trying to send a
CMD6 with R1. It needs an R1b to work properly. I also set
mmc->max_busy_timeout to 0 and that made it work as well.

I wanna try it out with other R1b commands like erase to see if it works
properly if the delay is actually very long.

Thanks,
Faiz

2020-03-09 15:58:57

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Mon, 9 Mar 2020 at 15:06, Faiz Abbas <[email protected]> wrote:
>
> Uffe,
>
> On 06/03/20 4:44 pm, Ulf Hansson wrote:
> > [...]
> >
> >>>>>>>>>>>
> >>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>>>> spec to
> >>>>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>>>
> >>>>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>>>
> >>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>>>
> >>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >>>> the data timeout counter used?
> >>> Yes. It seems like it will pick the maximum, which is 11s?
> >> yes
> >
> > Okay, thanks!
> >
> >>>
> >>>> Because of above mentioned issue on our host where CMD interrupt happens
> >>>> after busy state, poll for busy returns right away as not busy.
> >>> I see.
> >>>
> >>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >>>> But weird that with small delay of 1ms or debug print before CMD13 it
> >>>> doesn't timeout and works all the time.
> >>> I have digested the information you provided in these emails. Let me
> >>> summarize it, to see if I have understood correctly.
> >>>
> >>> 1.
> >>> Your controller can't distinguish between R1 and R1B because of a
> >>> limitation in the HW. So, in both cases you need to wait for the card
> >>> to stop signal busy, before the controller can give an IRQ to notify
> >>> that the R1 response has been received. Correct?
> >>>
> >>> In this context, I am wondering if sdhci_send_command(), really
> >>> conforms to these requirements. For example, depending on if the CMD6
> >>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> >>> SDHCI_CMD_RESP_SHORT_BUSY.
> >>>
> >>> Does this work as expected for your case?
> >> Design team re-verified internally and bug where HW waits for busy state
> >> before IRQ is only for R1B and R1 is spec compliant.
> >>
> >> So, with R1, CMD complete is generated after response received.
> >
> > Okay.
> >
> > So, the issue we see for CMD6 with R1, is a software problem that we
> > should be able to fix.
> >
> >>
> >> With R1B, CMD complete and xfer complete both are generated after
> >> response received + device busy (max timeout of 11s)
> >> DATA timeout interrupt will be asserted incase if HW busy detection fails.
> >>
> >> With R1B we may see DATA Timeout if operation takes more than max busy
> >> timeout of 11s.
> >
> > Okay, I see.
> >
> >>
> >>> 2.
> >>> Assuming my interpretation of the above is somewhat correct. Then you
> >>> always need to set a busy timeout for R1/R1B responses in the
> >>> controller. The maximum timeout seems to be 11s long. Obviously, this
> >>> isn't enough for all cases, such as cache flushing and erase, for
> >>> example. So, what can we do to support a longer timeouts than 11s?
> >>> Would it be possible to disable the HW timeout, if the requested
> >>> timeout is longer than 11s and use a SW timeout instead?
> >>>
> >>> Kind regards
> >>> Uffe
> >>
> >> For erase long operations we have register bit to enable for infinite
> >> busy wait mode where host controller would be monitoring until card is busy.
> >
> > Alright, that sounds great!
> >
> >>
> >> But so far for emmc devices we used on our platforms, we haven't seen
> >> cache flush taking more than 11s.
> >
> > I understand that 11s is probably fine to use, for most cases.
> >
> > However, it's not spec compliant, as for some operations there are
> > simply no timeout specified. BKOPS, cache flush, sanitize are cases
> > like this - and then 11s is definitely not sufficient.
> >
> >>
> >> Will get back on possibility of disabling HW timeout and using SW timeout..
> >
> > Thanks!
> >
> > I would like to get the regression fixed asap, but I also would like
> > to avoid reverting patches, unless really necessary. May I propose the
> > following two options.
> >
> > 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> > an R1 response doesn't work - and then fix that behaviour.
> >
> > 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> > the core to always use R1B for CMD6 (and erase). This also means that
> > when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> > disable the HW busy timeout and just wait "forever".
> >
> > If you decide for 2, you can add the software timeout support on top,
> > but make that can be considered as a next step of an improvement,
> > rather than needed as fix. Note that, I believe there are some support
> > for software timeout already in the sdhci core, maybe you need to
> > tweak it a bit for your case, I don't know.
> >
>
> I was able to reproduce the issue on sdhci-omap with your patch applied.

Great, thanks for confirming!

>
> In our case, the dat0 line never comes back up when trying to send a
> CMD6 with R1. It needs an R1b to work properly. I also set

Do you know if that is because of how the driver manages R1B/R1 or
because of a limitation in the HW?

> mmc->max_busy_timeout to 0 and that made it work as well.

Okay, good.

>
> I wanna try it out with other R1b commands like erase to see if it works
> properly if the delay is actually very long.

Right, it probably works for most cases. In any case, this was broken
even before my three patches that screwed things up.

BTW, what HW busy time out will you be using for sdhci omap if
cmd->busy_timeout exceeds it? Is it 2581ms for all sdhci-omap
instances or does it differ depending on platform?

Anyway, in this phase, it sounds like we should move forward with a
patch setting max_busy_timeout to zero for sdhci omap, as that
restores things to the earlier behavior. Would you mind posting a
patch, or do you want me to manage it?

Kind regards
Uffe

2020-03-09 17:36:17

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/6/20 3:14 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>> spec to
>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>
>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>
>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>
>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>> the data timeout counter used?
>>> Yes. It seems like it will pick the maximum, which is 11s?
>> yes
> Okay, thanks!
>
>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>> after busy state, poll for busy returns right away as not busy.
>>> I see.
>>>
>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>> doesn't timeout and works all the time.
>>> I have digested the information you provided in these emails. Let me
>>> summarize it, to see if I have understood correctly.
>>>
>>> 1.
>>> Your controller can't distinguish between R1 and R1B because of a
>>> limitation in the HW. So, in both cases you need to wait for the card
>>> to stop signal busy, before the controller can give an IRQ to notify
>>> that the R1 response has been received. Correct?
>>>
>>> In this context, I am wondering if sdhci_send_command(), really
>>> conforms to these requirements. For example, depending on if the CMD6
>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>
>>> Does this work as expected for your case?
>> Design team re-verified internally and bug where HW waits for busy state
>> before IRQ is only for R1B and R1 is spec compliant.
>>
>> So, with R1, CMD complete is generated after response received.
> Okay.
>
> So, the issue we see for CMD6 with R1, is a software problem that we
> should be able to fix.
>
>> With R1B, CMD complete and xfer complete both are generated after
>> response received + device busy (max timeout of 11s)
>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>
>> With R1B we may see DATA Timeout if operation takes more than max busy
>> timeout of 11s.
> Okay, I see.
>
>>> 2.
>>> Assuming my interpretation of the above is somewhat correct. Then you
>>> always need to set a busy timeout for R1/R1B responses in the
>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>> isn't enough for all cases, such as cache flushing and erase, for
>>> example. So, what can we do to support a longer timeouts than 11s?
>>> Would it be possible to disable the HW timeout, if the requested
>>> timeout is longer than 11s and use a SW timeout instead?
>>>
>>> Kind regards
>>> Uffe
>> For erase long operations we have register bit to enable for infinite
>> busy wait mode where host controller would be monitoring until card is busy.
> Alright, that sounds great!
>
>> But so far for emmc devices we used on our platforms, we haven't seen
>> cache flush taking more than 11s.
> I understand that 11s is probably fine to use, for most cases.
>
> However, it's not spec compliant, as for some operations there are
> simply no timeout specified. BKOPS, cache flush, sanitize are cases
> like this - and then 11s is definitely not sufficient.
>
>> Will get back on possibility of disabling HW timeout and using SW timeout..
> Thanks!
>
> I would like to get the regression fixed asap, but I also would like
> to avoid reverting patches, unless really necessary. May I propose the
> following two options.
>
> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> an R1 response doesn't work - and then fix that behaviour.
>
> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> the core to always use R1B for CMD6 (and erase). This also means that
> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> disable the HW busy timeout and just wait "forever".
>
> If you decide for 2, you can add the software timeout support on top,
> but make that can be considered as a next step of an improvement,
> rather than needed as fix. Note that, I believe there are some support
> for software timeout already in the sdhci core, maybe you need to
> tweak it a bit for your case, I don't know.
>
> Kind regards
> Uffe

Hi Uffe

Will go with 2nd option and will send patches out when ready.

BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
data timeout based on host clock when using finite mode (HW busy
detection based on DATA TIMEOUT count value when cmd operation timeout
is < 11s for tegra host).

So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
force R1B during SWITCH and SLEEP_AWAKE.

So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
which can be used for hosts supporting long or infinite HW busy wait
detection and will update mmc and mmc_ops drivers to not allow convert
R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.

Thanks

Sowjanya

2020-03-10 09:49:22

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni <[email protected]> wrote:
>
>
> On 3/6/20 3:14 AM, Ulf Hansson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > [...]
> >
> >>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
> >>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
> >>>>>>>>>> spec to
> >>>>>>>>>> send CMD13 to poll for busy.
> >>>>>>>>>>
> >>>>>>>>>> Moreover, we need to cope with the scenario when the host has
> >>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
> >>>>>>>>>> the requested operation. Do you have another proposal for how to
> >>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
> >>>>>>>>>>
> >>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
> >>>>>>>>>> then what timeout would the driver be using if we would set
> >>>>>>>>>> cmd.busy_timeout to 30ms?
> >>>>>>>>>>
> >>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
> >>>> the data timeout counter used?
> >>> Yes. It seems like it will pick the maximum, which is 11s?
> >> yes
> > Okay, thanks!
> >
> >>>> Because of above mentioned issue on our host where CMD interrupt happens
> >>>> after busy state, poll for busy returns right away as not busy.
> >>> I see.
> >>>
> >>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
> >>>> But weird that with small delay of 1ms or debug print before CMD13 it
> >>>> doesn't timeout and works all the time.
> >>> I have digested the information you provided in these emails. Let me
> >>> summarize it, to see if I have understood correctly.
> >>>
> >>> 1.
> >>> Your controller can't distinguish between R1 and R1B because of a
> >>> limitation in the HW. So, in both cases you need to wait for the card
> >>> to stop signal busy, before the controller can give an IRQ to notify
> >>> that the R1 response has been received. Correct?
> >>>
> >>> In this context, I am wondering if sdhci_send_command(), really
> >>> conforms to these requirements. For example, depending on if the CMD6
> >>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
> >>> SDHCI_CMD_RESP_SHORT_BUSY.
> >>>
> >>> Does this work as expected for your case?
> >> Design team re-verified internally and bug where HW waits for busy state
> >> before IRQ is only for R1B and R1 is spec compliant.
> >>
> >> So, with R1, CMD complete is generated after response received.
> > Okay.
> >
> > So, the issue we see for CMD6 with R1, is a software problem that we
> > should be able to fix.
> >
> >> With R1B, CMD complete and xfer complete both are generated after
> >> response received + device busy (max timeout of 11s)
> >> DATA timeout interrupt will be asserted incase if HW busy detection fails.
> >>
> >> With R1B we may see DATA Timeout if operation takes more than max busy
> >> timeout of 11s.
> > Okay, I see.
> >
> >>> 2.
> >>> Assuming my interpretation of the above is somewhat correct. Then you
> >>> always need to set a busy timeout for R1/R1B responses in the
> >>> controller. The maximum timeout seems to be 11s long. Obviously, this
> >>> isn't enough for all cases, such as cache flushing and erase, for
> >>> example. So, what can we do to support a longer timeouts than 11s?
> >>> Would it be possible to disable the HW timeout, if the requested
> >>> timeout is longer than 11s and use a SW timeout instead?
> >>>
> >>> Kind regards
> >>> Uffe
> >> For erase long operations we have register bit to enable for infinite
> >> busy wait mode where host controller would be monitoring until card is busy.
> > Alright, that sounds great!
> >
> >> But so far for emmc devices we used on our platforms, we haven't seen
> >> cache flush taking more than 11s.
> > I understand that 11s is probably fine to use, for most cases.
> >
> > However, it's not spec compliant, as for some operations there are
> > simply no timeout specified. BKOPS, cache flush, sanitize are cases
> > like this - and then 11s is definitely not sufficient.
> >
> >> Will get back on possibility of disabling HW timeout and using SW timeout..
> > Thanks!
> >
> > I would like to get the regression fixed asap, but I also would like
> > to avoid reverting patches, unless really necessary. May I propose the
> > following two options.
> >
> > 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> > an R1 response doesn't work - and then fix that behaviour.
> >
> > 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> > the core to always use R1B for CMD6 (and erase). This also means that
> > when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> > disable the HW busy timeout and just wait "forever".
> >
> > If you decide for 2, you can add the software timeout support on top,
> > but make that can be considered as a next step of an improvement,
> > rather than needed as fix. Note that, I believe there are some support
> > for software timeout already in the sdhci core, maybe you need to
> > tweak it a bit for your case, I don't know.
> >
> > Kind regards
> > Uffe
>
> Hi Uffe
>
> Will go with 2nd option and will send patches out when ready.

Okay, good.

>
> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> data timeout based on host clock when using finite mode (HW busy
> detection based on DATA TIMEOUT count value when cmd operation timeout
> is < 11s for tegra host).
>
> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
> force R1B during SWITCH and SLEEP_AWAKE.
>
> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
> which can be used for hosts supporting long or infinite HW busy wait
> detection and will update mmc and mmc_ops drivers to not allow convert
> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.

That seems reasonable, it becomes probably both easier and clearer by
adding a new host cap.

In any case, let me help out and cook a patch for this for the core
part (I leave the sdhci change to you). It may be a bit tricky,
especially since I have currently queued a bunch of new changes for
v5.7, that enables more users of mmc_poll_for_busy() in the core.
Maybe I need to temporarily drop them, so we can fix these problems
first. I will check.

Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
seems to be describing the common problem we have for sdhci
omap/tegra.

Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
sdhci- tegra, so while at it, perhaps you can cook a patch for that as
well.

Kind regards
Uffe

2020-03-10 16:57:52

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/10/20 2:46 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 9 Mar 2020 at 18:33, Sowjanya Komatineni <[email protected]> wrote:
>>
>> On 3/6/20 3:14 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>>>>>>>>>>> Actually we always use R1B with CMD6 as per spec.
>>>>>>>>>>>> I fully agree that R1B is preferable, but it's not against the
>>>>>>>>>>>> spec to
>>>>>>>>>>>> send CMD13 to poll for busy.
>>>>>>>>>>>>
>>>>>>>>>>>> Moreover, we need to cope with the scenario when the host has
>>>>>>>>>>>> specified a maximum timeout that isn't sufficiently long enough for
>>>>>>>>>>>> the requested operation. Do you have another proposal for how to
>>>>>>>>>>>> manage this, but disabling MMC_RSP_BUSY?
>>>>>>>>>>>>
>>>>>>>>>>>> Let's assume you driver would get a R1B for the CMD6 (we force it),
>>>>>>>>>>>> then what timeout would the driver be using if we would set
>>>>>>>>>>>> cmd.busy_timeout to 30ms?
>>>>>>>>>>>>
>>>>>> Sorry didn't understood clearly. Are you asking with 30s timeout, whats
>>>>>> the data timeout counter used?
>>>>> Yes. It seems like it will pick the maximum, which is 11s?
>>>> yes
>>> Okay, thanks!
>>>
>>>>>> Because of above mentioned issue on our host where CMD interrupt happens
>>>>>> after busy state, poll for busy returns right away as not busy.
>>>>> I see.
>>>>>
>>>>>> So issuing CMD13 after CMD6-R1 followed by busy poll should be working.
>>>>>> But weird that with small delay of 1ms or debug print before CMD13 it
>>>>>> doesn't timeout and works all the time.
>>>>> I have digested the information you provided in these emails. Let me
>>>>> summarize it, to see if I have understood correctly.
>>>>>
>>>>> 1.
>>>>> Your controller can't distinguish between R1 and R1B because of a
>>>>> limitation in the HW. So, in both cases you need to wait for the card
>>>>> to stop signal busy, before the controller can give an IRQ to notify
>>>>> that the R1 response has been received. Correct?
>>>>>
>>>>> In this context, I am wondering if sdhci_send_command(), really
>>>>> conforms to these requirements. For example, depending on if the CMD6
>>>>> has MMC_RSP_BUSY or not, it may pick either SDHCI_CMD_RESP_SHORT or
>>>>> SDHCI_CMD_RESP_SHORT_BUSY.
>>>>>
>>>>> Does this work as expected for your case?
>>>> Design team re-verified internally and bug where HW waits for busy state
>>>> before IRQ is only for R1B and R1 is spec compliant.
>>>>
>>>> So, with R1, CMD complete is generated after response received.
>>> Okay.
>>>
>>> So, the issue we see for CMD6 with R1, is a software problem that we
>>> should be able to fix.
>>>
>>>> With R1B, CMD complete and xfer complete both are generated after
>>>> response received + device busy (max timeout of 11s)
>>>> DATA timeout interrupt will be asserted incase if HW busy detection fails.
>>>>
>>>> With R1B we may see DATA Timeout if operation takes more than max busy
>>>> timeout of 11s.
>>> Okay, I see.
>>>
>>>>> 2.
>>>>> Assuming my interpretation of the above is somewhat correct. Then you
>>>>> always need to set a busy timeout for R1/R1B responses in the
>>>>> controller. The maximum timeout seems to be 11s long. Obviously, this
>>>>> isn't enough for all cases, such as cache flushing and erase, for
>>>>> example. So, what can we do to support a longer timeouts than 11s?
>>>>> Would it be possible to disable the HW timeout, if the requested
>>>>> timeout is longer than 11s and use a SW timeout instead?
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> For erase long operations we have register bit to enable for infinite
>>>> busy wait mode where host controller would be monitoring until card is busy.
>>> Alright, that sounds great!
>>>
>>>> But so far for emmc devices we used on our platforms, we haven't seen
>>>> cache flush taking more than 11s.
>>> I understand that 11s is probably fine to use, for most cases.
>>>
>>> However, it's not spec compliant, as for some operations there are
>>> simply no timeout specified. BKOPS, cache flush, sanitize are cases
>>> like this - and then 11s is definitely not sufficient.
>>>
>>>> Will get back on possibility of disabling HW timeout and using SW timeout..
>>> Thanks!
>>>
>>> I would like to get the regression fixed asap, but I also would like
>>> to avoid reverting patches, unless really necessary. May I propose the
>>> following two options.
>>>
>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>> an R1 response doesn't work - and then fix that behaviour.
>>>
>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
>>> the core to always use R1B for CMD6 (and erase). This also means that
>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>> disable the HW busy timeout and just wait "forever".
>>>
>>> If you decide for 2, you can add the software timeout support on top,
>>> but make that can be considered as a next step of an improvement,
>>> rather than needed as fix. Note that, I believe there are some support
>>> for software timeout already in the sdhci core, maybe you need to
>>> tweak it a bit for your case, I don't know.
>>>
>>> Kind regards
>>> Uffe
>> Hi Uffe
>>
>> Will go with 2nd option and will send patches out when ready.
> Okay, good.
>
>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>> data timeout based on host clock when using finite mode (HW busy
>> detection based on DATA TIMEOUT count value when cmd operation timeout
>> is < 11s for tegra host).
>>
>> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
>> force R1B during SWITCH and SLEEP_AWAKE.
>>
>> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
>> which can be used for hosts supporting long or infinite HW busy wait
>> detection and will update mmc and mmc_ops drivers to not allow convert
>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
> That seems reasonable, it becomes probably both easier and clearer by
> adding a new host cap.
>
> In any case, let me help out and cook a patch for this for the core
> part (I leave the sdhci change to you). It may be a bit tricky,
> especially since I have currently queued a bunch of new changes for
> v5.7, that enables more users of mmc_poll_for_busy() in the core.
> Maybe I need to temporarily drop them, so we can fix these problems
> first. I will check.
>
> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
> seems to be describing the common problem we have for sdhci
> omap/tegra.
>
> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
> well.
>
> Kind regards
> Uffe

OK, I sent v1 yesterday. Please ignore them then.

Will send out patches only for HW busy wait modes program based on cmd
timeout and WAIT_WHILE_BUSY enabled.

Thanks

Sowjanya

2020-03-10 17:10:50

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

[...]

> >>>
> >>> I would like to get the regression fixed asap, but I also would like
> >>> to avoid reverting patches, unless really necessary. May I propose the
> >>> following two options.
> >>>
> >>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
> >>> an R1 response doesn't work - and then fix that behaviour.
> >>>
> >>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
> >>> the core to always use R1B for CMD6 (and erase). This also means that
> >>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
> >>> disable the HW busy timeout and just wait "forever".
> >>>
> >>> If you decide for 2, you can add the software timeout support on top,
> >>> but make that can be considered as a next step of an improvement,
> >>> rather than needed as fix. Note that, I believe there are some support
> >>> for software timeout already in the sdhci core, maybe you need to
> >>> tweak it a bit for your case, I don't know.
> >>>
> >>> Kind regards
> >>> Uffe
> >> Hi Uffe
> >>
> >> Will go with 2nd option and will send patches out when ready.
> > Okay, good.
> >
> >> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> >> data timeout based on host clock when using finite mode (HW busy
> >> detection based on DATA TIMEOUT count value when cmd operation timeout
> >> is < 11s for tegra host).
> >>
> >> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
> >> force R1B during SWITCH and SLEEP_AWAKE.
> >>
> >> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
> >> which can be used for hosts supporting long or infinite HW busy wait
> >> detection and will update mmc and mmc_ops drivers to not allow convert
> >> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
> > That seems reasonable, it becomes probably both easier and clearer by
> > adding a new host cap.
> >
> > In any case, let me help out and cook a patch for this for the core
> > part (I leave the sdhci change to you). It may be a bit tricky,
> > especially since I have currently queued a bunch of new changes for
> > v5.7, that enables more users of mmc_poll_for_busy() in the core.
> > Maybe I need to temporarily drop them, so we can fix these problems
> > first. I will check.
> >
> > Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
> > seems to be describing the common problem we have for sdhci
> > omap/tegra.
> >
> > Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> > sdhci- tegra, so while at it, perhaps you can cook a patch for that as
> > well.
> >
> > Kind regards
> > Uffe
>
> OK, I sent v1 yesterday. Please ignore them then.

Oh, I haven't seen them. In any case, I am ignoring them.

>
> Will send out patches only for HW busy wait modes program based on cmd
> timeout and WAIT_WHILE_BUSY enabled.

Great, thanks!

Please help test the series I just posted as well, if you have the
time ofcourse.

Kind regards
Uffe

2020-03-10 17:26:48

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/10/20 10:09 AM, Ulf Hansson wrote:
> External email: Use caution opening links or attachments
>
>
> [...]
>
>>>>> I would like to get the regression fixed asap, but I also would like
>>>>> to avoid reverting patches, unless really necessary. May I propose the
>>>>> following two options.
>>>>>
>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>
>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which makes
>>>>> the core to always use R1B for CMD6 (and erase). This also means that
>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>>>> disable the HW busy timeout and just wait "forever".
>>>>>
>>>>> If you decide for 2, you can add the software timeout support on top,
>>>>> but make that can be considered as a next step of an improvement,
>>>>> rather than needed as fix. Note that, I believe there are some support
>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>> tweak it a bit for your case, I don't know.
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> Hi Uffe
>>>>
>>>> Will go with 2nd option and will send patches out when ready.
>>> Okay, good.
>>>
>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>> data timeout based on host clock when using finite mode (HW busy
>>>> detection based on DATA TIMEOUT count value when cmd operation timeout
>>>> is < 11s for tegra host).
>>>>
>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra host to
>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>
>>>> So, was thinking to introduce host capability MMC_CAP2_LONG_WAIT_HW_BUSY
>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>> detection and will update mmc and mmc_ops drivers to not allow convert
>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and SWITCH.
>>> That seems reasonable, it becomes probably both easier and clearer by
>>> adding a new host cap.
>>>
>>> In any case, let me help out and cook a patch for this for the core
>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>> especially since I have currently queued a bunch of new changes for
>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>> Maybe I need to temporarily drop them, so we can fix these problems
>>> first. I will check.
>>>
>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>> seems to be describing the common problem we have for sdhci
>>> omap/tegra.
>>>
>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
>>> well.
>>>
>>> Kind regards
>>> Uffe
>> OK, I sent v1 yesterday. Please ignore them then.
> Oh, I haven't seen them. In any case, I am ignoring them.
>
>> Will send out patches only for HW busy wait modes program based on cmd
>> timeout and WAIT_WHILE_BUSY enabled.
> Great, thanks!
>
> Please help test the series I just posted as well, if you have the
> time ofcourse.
>
> Kind regards
> Uffe

Sure,

Thanks

Sowjanya

2020-03-10 21:57:19

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>
> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> [...]
>>
>>>>>> I would like to get the regression fixed asap, but I also would like
>>>>>> to avoid reverting patches, unless really necessary. May I
>>>>>> propose the
>>>>>> following two options.
>>>>>>
>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6 with
>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>
>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
>>>>>> makes
>>>>>> the core to always use R1B for CMD6 (and erase). This also means
>>>>>> that
>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra must
>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>
>>>>>> If you decide for 2, you can add the software timeout support on
>>>>>> top,
>>>>>> but make that can be considered as a next step of an improvement,
>>>>>> rather than needed as fix. Note that, I believe there are some
>>>>>> support
>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>> tweak it a bit for your case, I don't know.
>>>>>>
>>>>>> Kind regards
>>>>>> Uffe
>>>>> Hi Uffe
>>>>>
>>>>> Will go with 2nd option and will send patches out when ready.
>>>> Okay, good.
>>>>
>>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>> detection based on DATA TIMEOUT count value when cmd operation
>>>>> timeout
>>>>> is < 11s for tegra host).
>>>>>
>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra
>>>>> host to
>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>
>>>>> So, was thinking to introduce host capability
>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>>> detection and will update mmc and mmc_ops drivers to not allow
>>>>> convert
>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and
>>>>> SWITCH.
>>>> That seems reasonable, it becomes probably both easier and clearer by
>>>> adding a new host cap.
>>>>
>>>> In any case, let me help out and cook a patch for this for the core
>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>> especially since I have currently queued a bunch of new changes for
>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>> first. I will check.
>>>>
>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>>> seems to be describing the common problem we have for sdhci
>>>> omap/tegra.
>>>>
>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for that as
>>>> well.
>>>>
>>>> Kind regards
>>>> Uffe
>>> OK, I sent v1 yesterday. Please ignore them then.
>> Oh, I haven't seen them. In any case, I am ignoring them.
>>
>>> Will send out patches only for HW busy wait modes program based on cmd
>>> timeout and WAIT_WHILE_BUSY enabled.
>> Great, thanks!
>>
>> Please help test the series I just posted as well, if you have the
>> time ofcourse.
>>
>> Kind regards
>> Uffe
>
> Sure,
>
> Thanks
>
> Sowjanya


mmc_sleep() also needs update to force R1B when host sets capability
MMC_CAP_NEED_RSP_BUSY


2020-03-10 23:09:36

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
>
> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>>
>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> [...]
>>>
>>>>>>> I would like to get the regression fixed asap, but I also would
>>>>>>> like
>>>>>>> to avoid reverting patches, unless really necessary. May I
>>>>>>> propose the
>>>>>>> following two options.
>>>>>>>
>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6
>>>>>>> with
>>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>>
>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
>>>>>>> makes
>>>>>>> the core to always use R1B for CMD6 (and erase). This also means
>>>>>>> that
>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra
>>>>>>> must
>>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>>
>>>>>>> If you decide for 2, you can add the software timeout support on
>>>>>>> top,
>>>>>>> but make that can be considered as a next step of an improvement,
>>>>>>> rather than needed as fix. Note that, I believe there are some
>>>>>>> support
>>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>>> tweak it a bit for your case, I don't know.
>>>>>>>
>>>>>>> Kind regards
>>>>>>> Uffe
>>>>>> Hi Uffe
>>>>>>
>>>>>> Will go with 2nd option and will send patches out when ready.
>>>>> Okay, good.
>>>>>
>>>>>> BTW, Tegra host also supports SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK
>>>>>> for
>>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>>> detection based on DATA TIMEOUT count value when cmd operation
>>>>>> timeout
>>>>>> is < 11s for tegra host).
>>>>>>
>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra
>>>>>> host to
>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>>
>>>>>> So, was thinking to introduce host capability
>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>>> which can be used for hosts supporting long or infinite HW busy wait
>>>>>> detection and will update mmc and mmc_ops drivers to not allow
>>>>>> convert
>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and
>>>>>> SWITCH.
>>>>> That seems reasonable, it becomes probably both easier and clearer by
>>>>> adding a new host cap.
>>>>>
>>>>> In any case, let me help out and cook a patch for this for the core
>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>>> especially since I have currently queued a bunch of new changes for
>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>>> first. I will check.
>>>>>
>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as that
>>>>> seems to be describing the common problem we have for sdhci
>>>>> omap/tegra.
>>>>>
>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for
>>>>> that as
>>>>> well.
>>>>>
>>>>> Kind regards
>>>>> Uffe
>>>> OK, I sent v1 yesterday. Please ignore them then.
>>> Oh, I haven't seen them. In any case, I am ignoring them.
>>>
>>>> Will send out patches only for HW busy wait modes program based on cmd
>>>> timeout and WAIT_WHILE_BUSY enabled.
>>> Great, thanks!
>>>
>>> Please help test the series I just posted as well, if you have the
>>> time ofcourse.
>>>
>>> Kind regards
>>> Uffe
>>
>> Sure,
>>
>> Thanks
>>
>> Sowjanya
>
>
> mmc_sleep() also needs update to force R1B when host sets capability
> MMC_CAP_NEED_RSP_BUSY
>
>
Tested patches and they work good.

2020-03-11 00:20:31

by Sowjanya Komatineni

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110


On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
>
> On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
>>
>> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
>>>
>>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> [...]
>>>>
>>>>>>>> I would like to get the regression fixed asap, but I also would
>>>>>>>> like
>>>>>>>> to avoid reverting patches, unless really necessary. May I
>>>>>>>> propose the
>>>>>>>> following two options.
>>>>>>>>
>>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6
>>>>>>>> with
>>>>>>>> an R1 response doesn't work - and then fix that behaviour.
>>>>>>>>
>>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
>>>>>>>> makes
>>>>>>>> the core to always use R1B for CMD6 (and erase). This also
>>>>>>>> means that
>>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra
>>>>>>>> must
>>>>>>>> disable the HW busy timeout and just wait "forever".
>>>>>>>>
>>>>>>>> If you decide for 2, you can add the software timeout support
>>>>>>>> on top,
>>>>>>>> but make that can be considered as a next step of an improvement,
>>>>>>>> rather than needed as fix. Note that, I believe there are some
>>>>>>>> support
>>>>>>>> for software timeout already in the sdhci core, maybe you need to
>>>>>>>> tweak it a bit for your case, I don't know.
>>>>>>>>
>>>>>>>> Kind regards
>>>>>>>> Uffe
>>>>>>> Hi Uffe
>>>>>>>
>>>>>>> Will go with 2nd option and will send patches out when ready.
>>>>>> Okay, good.
>>>>>>
>>>>>>> BTW, Tegra host also supports
>>>>>>> SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
>>>>>>> data timeout based on host clock when using finite mode (HW busy
>>>>>>> detection based on DATA TIMEOUT count value when cmd operation
>>>>>>> timeout
>>>>>>> is < 11s for tegra host).
>>>>>>>
>>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra
>>>>>>> host to
>>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
>>>>>>>
>>>>>>> So, was thinking to introduce host capability
>>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
>>>>>>> which can be used for hosts supporting long or infinite HW busy
>>>>>>> wait
>>>>>>> detection and will update mmc and mmc_ops drivers to not allow
>>>>>>> convert
>>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and
>>>>>>> SWITCH.
>>>>>> That seems reasonable, it becomes probably both easier and
>>>>>> clearer by
>>>>>> adding a new host cap.
>>>>>>
>>>>>> In any case, let me help out and cook a patch for this for the core
>>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
>>>>>> especially since I have currently queued a bunch of new changes for
>>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
>>>>>> Maybe I need to temporarily drop them, so we can fix these problems
>>>>>> first. I will check.
>>>>>>
>>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as
>>>>>> that
>>>>>> seems to be describing the common problem we have for sdhci
>>>>>> omap/tegra.
>>>>>>
>>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
>>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for
>>>>>> that as
>>>>>> well.
>>>>>>
>>>>>> Kind regards
>>>>>> Uffe
>>>>> OK, I sent v1 yesterday. Please ignore them then.
>>>> Oh, I haven't seen them. In any case, I am ignoring them.
>>>>
>>>>> Will send out patches only for HW busy wait modes program based on
>>>>> cmd
>>>>> timeout and WAIT_WHILE_BUSY enabled.
>>>> Great, thanks!
>>>>
>>>> Please help test the series I just posted as well, if you have the
>>>> time ofcourse.
>>>>
>>>> Kind regards
>>>> Uffe
>>>
>>> Sure,
>>>
>>> Thanks
>>>
>>> Sowjanya
>>
>>
>> mmc_sleep() also needs update to force R1B when host sets capability
>> MMC_CAP_NEED_RSP_BUSY
>>
>>
> Tested patches and they work good.
>
Sent sdhci-tegra v2 patches

- includes busy wait mode programming based on cmd busy_timeout

- enables MMC_CAP_WAIT_WHILE_BUSY

Thanks

Sowjanya

2020-03-11 08:35:52

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Wed, 11 Mar 2020 at 01:19, Sowjanya Komatineni
<[email protected]> wrote:
>
>
> On 3/10/20 4:10 PM, Sowjanya Komatineni wrote:
> >
> > On 3/10/20 2:59 PM, Sowjanya Komatineni wrote:
> >>
> >> On 3/10/20 10:27 AM, Sowjanya Komatineni wrote:
> >>>
> >>> On 3/10/20 10:09 AM, Ulf Hansson wrote:
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>>>> I would like to get the regression fixed asap, but I also would
> >>>>>>>> like
> >>>>>>>> to avoid reverting patches, unless really necessary. May I
> >>>>>>>> propose the
> >>>>>>>> following two options.
> >>>>>>>>
> >>>>>>>> 1. Find out why polling with ->card_busy() or CMD13, for a CMD6
> >>>>>>>> with
> >>>>>>>> an R1 response doesn't work - and then fix that behaviour.
> >>>>>>>>
> >>>>>>>> 2. Set the mmc->max_busy_timeout to zero for sdhci-tegra, which
> >>>>>>>> makes
> >>>>>>>> the core to always use R1B for CMD6 (and erase). This also
> >>>>>>>> means that
> >>>>>>>> when the cmd->busy_timeout becomes longer than 11s, sdhci-tegra
> >>>>>>>> must
> >>>>>>>> disable the HW busy timeout and just wait "forever".
> >>>>>>>>
> >>>>>>>> If you decide for 2, you can add the software timeout support
> >>>>>>>> on top,
> >>>>>>>> but make that can be considered as a next step of an improvement,
> >>>>>>>> rather than needed as fix. Note that, I believe there are some
> >>>>>>>> support
> >>>>>>>> for software timeout already in the sdhci core, maybe you need to
> >>>>>>>> tweak it a bit for your case, I don't know.
> >>>>>>>>
> >>>>>>>> Kind regards
> >>>>>>>> Uffe
> >>>>>>> Hi Uffe
> >>>>>>>
> >>>>>>> Will go with 2nd option and will send patches out when ready.
> >>>>>> Okay, good.
> >>>>>>
> >>>>>>> BTW, Tegra host also supports
> >>>>>>> SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK for
> >>>>>>> data timeout based on host clock when using finite mode (HW busy
> >>>>>>> detection based on DATA TIMEOUT count value when cmd operation
> >>>>>>> timeout
> >>>>>>> is < 11s for tegra host).
> >>>>>>>
> >>>>>>> So, looks like we cant set host max_busy_timeout to 0 for Tegra
> >>>>>>> host to
> >>>>>>> force R1B during SWITCH and SLEEP_AWAKE.
> >>>>>>>
> >>>>>>> So, was thinking to introduce host capability
> >>>>>>> MMC_CAP2_LONG_WAIT_HW_BUSY
> >>>>>>> which can be used for hosts supporting long or infinite HW busy
> >>>>>>> wait
> >>>>>>> detection and will update mmc and mmc_ops drivers to not allow
> >>>>>>> convert
> >>>>>>> R1B to R1B for hosts with this capability during SLEEP_AWAKE and
> >>>>>>> SWITCH.
> >>>>>> That seems reasonable, it becomes probably both easier and
> >>>>>> clearer by
> >>>>>> adding a new host cap.
> >>>>>>
> >>>>>> In any case, let me help out and cook a patch for this for the core
> >>>>>> part (I leave the sdhci change to you). It may be a bit tricky,
> >>>>>> especially since I have currently queued a bunch of new changes for
> >>>>>> v5.7, that enables more users of mmc_poll_for_busy() in the core.
> >>>>>> Maybe I need to temporarily drop them, so we can fix these problems
> >>>>>> first. I will check.
> >>>>>>
> >>>>>> Probably, I would also name the cap MMC_CAP_HW_NEED_RSP_BUSY, as
> >>>>>> that
> >>>>>> seems to be describing the common problem we have for sdhci
> >>>>>> omap/tegra.
> >>>>>>
> >>>>>> Finally, it seems like MMC_CAP_WAIT_WHILE_BUSY should be set for
> >>>>>> sdhci- tegra, so while at it, perhaps you can cook a patch for
> >>>>>> that as
> >>>>>> well.
> >>>>>>
> >>>>>> Kind regards
> >>>>>> Uffe
> >>>>> OK, I sent v1 yesterday. Please ignore them then.
> >>>> Oh, I haven't seen them. In any case, I am ignoring them.
> >>>>
> >>>>> Will send out patches only for HW busy wait modes program based on
> >>>>> cmd
> >>>>> timeout and WAIT_WHILE_BUSY enabled.
> >>>> Great, thanks!
> >>>>
> >>>> Please help test the series I just posted as well, if you have the
> >>>> time ofcourse.
> >>>>
> >>>> Kind regards
> >>>> Uffe
> >>>
> >>> Sure,
> >>>
> >>> Thanks
> >>>
> >>> Sowjanya
> >>
> >>
> >> mmc_sleep() also needs update to force R1B when host sets capability
> >> MMC_CAP_NEED_RSP_BUSY

Yes, I am on it! Thanks!

> >>
> >>
> > Tested patches and they work good.

Great, I am adding your tested-by tag then.

> >
> Sent sdhci-tegra v2 patches
>
> - includes busy wait mode programming based on cmd busy_timeout
>
> - enables MMC_CAP_WAIT_WHILE_BUSY

Sounds great, but I don't see the patches on the mailing list, nor did
they reach the mmc patchtracker.

Seems like you probably need to check your email settings when sending patches.

Kind regards
Uffe

2020-03-19 19:14:22

by Naresh Kamboju

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

FYI,

The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
kernel pops up the following messages on console log,
Is this a problem ?

[ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020

[ 15.737765] mmc1: unspecified timeout for CMD6 - use generic
[ 16.754248] mmc1: unspecified timeout for CMD6 - use generic
[ 16.842071] mmc1: unspecified timeout for CMD6 - use generic
...
[ 20.580541] mmc1: unspecified timeout for CMD6 - use generic
[ 20.588216] mmc1: unspecified timeout for CMD6 - use generic
[ 20.604011] mmc1: unspecified timeout for CMD6 - use generic

ref:
https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
https://lkft.validation.linaro.org/scheduler/job/1299973#L4232

- Naresh

2020-03-20 09:21:37

by Ulf Hansson

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju <[email protected]> wrote:
>
> FYI,
>
> The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
> kernel pops up the following messages on console log,
> Is this a problem ?
>
> [ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
> 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
>
> [ 15.737765] mmc1: unspecified timeout for CMD6 - use generic
> [ 16.754248] mmc1: unspecified timeout for CMD6 - use generic
> [ 16.842071] mmc1: unspecified timeout for CMD6 - use generic
> ...
> [ 20.580541] mmc1: unspecified timeout for CMD6 - use generic
> [ 20.588216] mmc1: unspecified timeout for CMD6 - use generic
> [ 20.604011] mmc1: unspecified timeout for CMD6 - use generic
>
> ref:
> https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
> https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
> https://lkft.validation.linaro.org/scheduler/job/1299973#L4232

The commit below is the problem, it shouldn't be applied for stable.

commit 533a6cfe08f96a7b5c65e06d20916d552c11b256
Author: Ulf Hansson <[email protected]>
Date: Wed Jan 22 15:27:47 2020 +0100
mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()

Let me sort it out with Greg/Sasha. I will keep you in the loop,
thanks for reporting!

Kind regards
Uffe

2020-03-20 09:51:01

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: LKFT: arm x15: mmc1: cache flush error -110

On Fri, Mar 20, 2020 at 10:20:04AM +0100, Ulf Hansson wrote:
> On Thu, 19 Mar 2020 at 20:12, Naresh Kamboju <[email protected]> wrote:
> >
> > FYI,
> >
> > The arm device x15 running stable rc 4.19.112-rc1, 5.4.27-rc1 and 5.5.11-rc2
> > kernel pops up the following messages on console log,
> > Is this a problem ?
> >
> > [ 0.000000] Linux version 4.19.112-rc1 (oe-user@oe-host) (gcc version
> > 7.3.0 (GCC)) #1 SMP Thu Mar 19 12:55:45 UTC 2020
> >
> > [ 15.737765] mmc1: unspecified timeout for CMD6 - use generic
> > [ 16.754248] mmc1: unspecified timeout for CMD6 - use generic
> > [ 16.842071] mmc1: unspecified timeout for CMD6 - use generic
> > ...
> > [ 20.580541] mmc1: unspecified timeout for CMD6 - use generic
> > [ 20.588216] mmc1: unspecified timeout for CMD6 - use generic
> > [ 20.604011] mmc1: unspecified timeout for CMD6 - use generic
> >
> > ref:
> > https://lkft.validation.linaro.org/scheduler/job/1298207#L4037
> > https://lkft.validation.linaro.org/scheduler/job/1298945#L4132
> > https://lkft.validation.linaro.org/scheduler/job/1299973#L4232
>
> The commit below is the problem, it shouldn't be applied for stable.
>
> commit 533a6cfe08f96a7b5c65e06d20916d552c11b256
> Author: Ulf Hansson <[email protected]>
> Date: Wed Jan 22 15:27:47 2020 +0100
> mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch()
>
> Let me sort it out with Greg/Sasha. I will keep you in the loop,
> thanks for reporting!

Now dropped from the 4.19, 5.4, and 5.5 stable queues, sorry for the
noise.

greg k-h