2018-06-19 06:29:24

by Kurt Kanzenbach

[permalink] [raw]
Subject: [PATCH 0/1] eMMC controller issue on Intel Baytrail SoC

Hi,

I've encountered a problem on an Intel Atom E3825. When performing lots of
reboots (10, 50, 100, ...) the eMMC controller stops working. The reset commands
won't work anymore and you get error messages such as:

|mmc1: Reset 0x1 never completed.
|sdhci: =========== REGISTER DUMP (mmc1)===========
|sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
|sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
|sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
|sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
|sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
|sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
|sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
|sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
|sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
|sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
|sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
|sdhci: Host ctl2: 0x0000ffff
|sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff

After using ftrace, I've discovered that this issue happens when runtime power
management is utilized. So after searching a bit, I've found the errata list for
the E3825:

https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf

Erratum VLI10 basically states, that suspend/resume shouldn't be used. Otherwise
wrong data between memory the device may be transferred. Therefore, I've
disabled runtime power management and the issue disappeared. That's what the
following patch does.

This patch is tested against v4.17 and v4.9.

Any suggestions?

Kurt Kanzenbach (1):
mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

--
2.11.0


2018-06-19 06:29:13

by Kurt Kanzenbach

[permalink] [raw]
Subject: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
SoCs. The resulting error looks like:

|mmc1: Reset 0x1 never completed.
|sdhci: =========== REGISTER DUMP (mmc1)===========
|sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
|sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
|sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
|sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
|sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
|sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
|sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
|sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
|sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
|sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
|sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
|sdhci: Host ctl2: 0x0000ffff
|sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff

The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
issue seems to occur if runtime power management is used. Found by utilizing
ftrace.

The erratum VLI10 for the Intel E3825 states, that the eMMC controller
incorrectly announces that it supports suspend/resume. However, that shouldn't
be used, as the controller may incorrectly transfer data between memory and the
SD device.

Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.

Signed-off-by: Kurt Kanzenbach <[email protected]>
---
drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
index 77dd3521daae..df89381944cd 100644
--- a/drivers/mmc/host/sdhci-pci-core.c
+++ b/drivers/mmc/host/sdhci-pci-core.c
@@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
.priv_size = sizeof(struct intel_host),
};

+/*
+ * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
+ * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
+ */
+static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
+ .allow_runtime_pm = false,
+ .probe_slot = byt_emmc_probe_slot,
+ .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
+ .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
+ SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
+ SDHCI_QUIRK2_STOP_WITH_TC,
+ .ops = &sdhci_intel_byt_ops,
+ .priv_size = sizeof(struct intel_host),
+};
+
static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
.allow_runtime_pm = true,
.probe_slot = glk_emmc_probe_slot,
@@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
- SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
+ SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
--
2.11.0


2018-06-19 07:05:21

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

On 19/06/18 09:31, Kurt Kanzenbach wrote:
> Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
> SoCs. The resulting error looks like:
>
> |mmc1: Reset 0x1 never completed.
> |sdhci: =========== REGISTER DUMP (mmc1)===========
> |sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
> |sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
> |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
> |sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
> |sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
> |sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
> |sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
> |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
> |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
> |sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
> |sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
> |sdhci: Host ctl2: 0x0000ffff
> |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
>
> The behavior was observed on an Intel Atom E3825 performing lots of reboots. The

So you are saying this only happens at boot time? And only when re-booting?
Can you send all the kernel messages? Can you send an acpidump?

> issue seems to occur if runtime power management is used. Found by utilizing
> ftrace.
>
> The erratum VLI10 for the Intel E3825 states, that the eMMC controller
> incorrectly announces that it supports suspend/resume. However, that shouldn't
> be used, as the controller may incorrectly transfer data between memory and the
> SD device.

That erratum is not related to this problem. The suspend/resume that is
documented is an internal SDHCI feature, not the kernel's suspend/resume.
The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
not being used anyway.

>
> Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
>
> Signed-off-by: Kurt Kanzenbach <[email protected]>
> ---
> drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
> index 77dd3521daae..df89381944cd 100644
> --- a/drivers/mmc/host/sdhci-pci-core.c
> +++ b/drivers/mmc/host/sdhci-pci-core.c
> @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
> .priv_size = sizeof(struct intel_host),
> };
>
> +/*
> + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
> + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
> + */
> +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
> + .allow_runtime_pm = false,
> + .probe_slot = byt_emmc_probe_slot,
> + .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
> + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
> + SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
> + SDHCI_QUIRK2_STOP_WITH_TC,
> + .ops = &sdhci_intel_byt_ops,
> + .priv_size = sizeof(struct intel_host),
> +};
> +
> static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
> .allow_runtime_pm = true,
> .probe_slot = glk_emmc_probe_slot,
> @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
> SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
> SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
> SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
> - SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
> + SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
> SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
> SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
> SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
>


2018-06-20 13:17:20

by Kurt Kanzenbach

[permalink] [raw]
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

Hi,

thanks for your response.

On Tue, Jun 19, 2018 at 10:03:01AM +0300, Adrian Hunter wrote:
> On 19/06/18 09:31, Kurt Kanzenbach wrote:
> > Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
> > SoCs. The resulting error looks like:
> >
> > |mmc1: Reset 0x1 never completed.
> > |sdhci: =========== REGISTER DUMP (mmc1)===========
> > |sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
> > |sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
> > |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
> > |sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
> > |sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
> > |sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
> > |sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
> > |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
> > |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
> > |sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
> > |sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
> > |sdhci: Host ctl2: 0x0000ffff
> > |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
> >
> > The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
>
> So you are saying this only happens at boot time? And only when
> re-booting?

well, exactly. This issue was only observed when rebooting, not on cold
boots.

> Can you send all the kernel messages? Can you send an acpidump?

The kernel log is straightforward. The system is booting and starting a
few applications. Afterwards the issue happens. The rootfilesystem is
located on the eMMC.

The error message above is from the Linux v4.9 boot log.

On v4.17 the same issue happens, but the error messages are different:

|mmc1: Timeout waiting for hardware interrupt.
|mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
|mmc1: sdhci: Sys addr: 0x00000002 | Version: 0x00001002
|mmc1: sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000000
|mmc1: sdhci: Argument: 0x00040fd4 | Trn mode: 0x0000003b
|mmc1: sdhci: Present: 0x1fff0000 | Host ctl: 0x00000035
|mmc1: sdhci: Power: 0x0000000b | Blk gap: 0x00000080
|mmc1: sdhci: Wake-up: 0x00000000 | Clock: 0x00000207
|mmc1: sdhci: Timeout: 0x00000000 | Int stat: 0x00000003
|mmc1: sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
|mmc1: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000001
|mmc1: sdhci: Caps: 0x446cc801 | Caps_1: 0x00000005
|mmc1: sdhci: Cmd: 0x0000123a | Max curr: 0x00000000
|mmc1: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0xffffffff
|mmc1: sdhci: Resp[2]: 0x320f5913 | Resp[3]: 0x00000900
|mmc1: sdhci: Host ctl2: 0x0000000c
|mmc1: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x34ee5208
|mmc1: sdhci: ============================================
|[...]

Both issues disappear when disabling runtime pm.

Anyway I'll prepare an acpidump for you.

>
> > issue seems to occur if runtime power management is used. Found by utilizing
> > ftrace.
> >
> > The erratum VLI10 for the Intel E3825 states, that the eMMC controller
> > incorrectly announces that it supports suspend/resume. However, that shouldn't
> > be used, as the controller may incorrectly transfer data between memory and the
> > SD device.
>
> That erratum is not related to this problem. The suspend/resume that is
> documented is an internal SDHCI feature, not the kernel's suspend/resume.
> The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
> not being used anyway.

Thanks for the clarification.

Do you have any idea why this issue might happen?

Thanks, Kurt

>
> >
> > Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
> >
> > Signed-off-by: Kurt Kanzenbach <[email protected]>
> > ---
> > drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
> > index 77dd3521daae..df89381944cd 100644
> > --- a/drivers/mmc/host/sdhci-pci-core.c
> > +++ b/drivers/mmc/host/sdhci-pci-core.c
> > @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
> > .priv_size = sizeof(struct intel_host),
> > };
> >
> > +/*
> > + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
> > + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
> > + */
> > +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
> > + .allow_runtime_pm = false,
> > + .probe_slot = byt_emmc_probe_slot,
> > + .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
> > + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
> > + SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
> > + SDHCI_QUIRK2_STOP_WITH_TC,
> > + .ops = &sdhci_intel_byt_ops,
> > + .priv_size = sizeof(struct intel_host),
> > +};
> > +
> > static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
> > .allow_runtime_pm = true,
> > .probe_slot = glk_emmc_probe_slot,
> > @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
> > SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
> > SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
> > SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
> > - SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
> > + SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
> > SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
> > SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
> > SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
> >
>

2018-06-20 15:54:33

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

On 06/20/2018 04:15 PM, Kurt Kanzenbach wrote:
> Hi,
>
> thanks for your response.
>
> On Tue, Jun 19, 2018 at 10:03:01AM +0300, Adrian Hunter wrote:
>> On 19/06/18 09:31, Kurt Kanzenbach wrote:
>>> Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
>>> SoCs. The resulting error looks like:
>>>
>>> |mmc1: Reset 0x1 never completed.
>>> |sdhci: =========== REGISTER DUMP (mmc1)===========
>>> |sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
>>> |sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
>>> |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
>>> |sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
>>> |sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
>>> |sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
>>> |sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
>>> |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
>>> |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
>>> |sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
>>> |sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
>>> |sdhci: Host ctl2: 0x0000ffff
>>> |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
>>>
>>> The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
>>
>> So you are saying this only happens at boot time? And only when
>> re-booting?
>
> well, exactly. This issue was only observed when rebooting, not on cold
> boots.
>
>> Can you send all the kernel messages? Can you send an acpidump?
>
> The kernel log is straightforward. The system is booting and starting a
> few applications. Afterwards the issue happens. The rootfilesystem is
> located on the eMMC.

The full messages can be more revealing such as showing what else was
happening and the order of events, so I would still like to see them.

>
> The error message above is from the Linux v4.9 boot log.
>
> On v4.17 the same issue happens, but the error messages are different:
>
> |mmc1: Timeout waiting for hardware interrupt.
> |mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
> |mmc1: sdhci: Sys addr: 0x00000002 | Version: 0x00001002
> |mmc1: sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000000
> |mmc1: sdhci: Argument: 0x00040fd4 | Trn mode: 0x0000003b
> |mmc1: sdhci: Present: 0x1fff0000 | Host ctl: 0x00000035
> |mmc1: sdhci: Power: 0x0000000b | Blk gap: 0x00000080
> |mmc1: sdhci: Wake-up: 0x00000000 | Clock: 0x00000207
> |mmc1: sdhci: Timeout: 0x00000000 | Int stat: 0x00000003
> |mmc1: sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
> |mmc1: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000001
> |mmc1: sdhci: Caps: 0x446cc801 | Caps_1: 0x00000005
> |mmc1: sdhci: Cmd: 0x0000123a | Max curr: 0x00000000
> |mmc1: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0xffffffff
> |mmc1: sdhci: Resp[2]: 0x320f5913 | Resp[3]: 0x00000900
> |mmc1: sdhci: Host ctl2: 0x0000000c
> |mmc1: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x34ee5208
> |mmc1: sdhci: ============================================
> |[...]

Those messages show that the interrupt did happen but the driver did not see
it. Are you doing anything unusual like using threadirqs?

>
> Both issues disappear when disabling runtime pm.
>
> Anyway I'll prepare an acpidump for you.
>
>>
>>> issue seems to occur if runtime power management is used. Found by utilizing
>>> ftrace.
>>>
>>> The erratum VLI10 for the Intel E3825 states, that the eMMC controller
>>> incorrectly announces that it supports suspend/resume. However, that shouldn't
>>> be used, as the controller may incorrectly transfer data between memory and the
>>> SD device.
>>
>> That erratum is not related to this problem. The suspend/resume that is
>> documented is an internal SDHCI feature, not the kernel's suspend/resume.
>> The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
>> not being used anyway.
>
> Thanks for the clarification.
>
> Do you have any idea why this issue might happen?

No, but it seems like the runtime pm callbacks aren't happening when they
are supposed to.

>
> Thanks, Kurt
>
>>
>>>
>>> Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
>>>
>>> Signed-off-by: Kurt Kanzenbach <[email protected]>
>>> ---
>>> drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
>>> index 77dd3521daae..df89381944cd 100644
>>> --- a/drivers/mmc/host/sdhci-pci-core.c
>>> +++ b/drivers/mmc/host/sdhci-pci-core.c
>>> @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
>>> .priv_size = sizeof(struct intel_host),
>>> };
>>>
>>> +/*
>>> + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
>>> + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
>>> + */
>>> +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
>>> + .allow_runtime_pm = false,
>>> + .probe_slot = byt_emmc_probe_slot,
>>> + .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
>>> + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
>>> + SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
>>> + SDHCI_QUIRK2_STOP_WITH_TC,
>>> + .ops = &sdhci_intel_byt_ops,
>>> + .priv_size = sizeof(struct intel_host),
>>> +};
>>> +
>>> static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
>>> .allow_runtime_pm = true,
>>> .probe_slot = glk_emmc_probe_slot,
>>> @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
>>> SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
>>> SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
>>> SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
>>> - SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
>>> + SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
>>> SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
>>> SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
>>> SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
>>>
>>
>


2018-06-25 14:34:44

by Kurt Kanzenbach

[permalink] [raw]
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

> On 06/20/2018 04:15 PM, Kurt Kanzenbach wrote:
> > Hi,
> >
> > thanks for your response.
> >
> > On Tue, Jun 19, 2018 at 10:03:01AM +0300, Adrian Hunter wrote:
> >> On 19/06/18 09:31, Kurt Kanzenbach wrote:
> >>> Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
> >>> SoCs. The resulting error looks like:
> >>>
> >>> |mmc1: Reset 0x1 never completed.
> >>> |sdhci: =========== REGISTER DUMP (mmc1)===========
> >>> |sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
> >>> |sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
> >>> |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
> >>> |sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
> >>> |sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
> >>> |sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
> >>> |sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
> >>> |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
> >>> |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
> >>> |sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
> >>> |sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
> >>> |sdhci: Host ctl2: 0x0000ffff
> >>> |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
> >>>
> >>> The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
> >>
> >> So you are saying this only happens at boot time? And only when
> >> re-booting?
> >
> > well, exactly. This issue was only observed when rebooting, not on cold
> > boots.
> >
> >> Can you send all the kernel messages? Can you send an acpidump?
> >
> > The kernel log is straightforward. The system is booting and starting a
> > few applications. Afterwards the issue happens. The rootfilesystem is
> > located on the eMMC.
>
> The full messages can be more revealing such as showing what else was
> happening and the order of events, so I would still like to see them.
>
> >
> > The error message above is from the Linux v4.9 boot log.
> >
> > On v4.17 the same issue happens, but the error messages are different:
> >
> > |mmc1: Timeout waiting for hardware interrupt.
> > |mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
> > |mmc1: sdhci: Sys addr: 0x00000002 | Version: 0x00001002
> > |mmc1: sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000000
> > |mmc1: sdhci: Argument: 0x00040fd4 | Trn mode: 0x0000003b
> > |mmc1: sdhci: Present: 0x1fff0000 | Host ctl: 0x00000035
> > |mmc1: sdhci: Power: 0x0000000b | Blk gap: 0x00000080
> > |mmc1: sdhci: Wake-up: 0x00000000 | Clock: 0x00000207
> > |mmc1: sdhci: Timeout: 0x00000000 | Int stat: 0x00000003
> > |mmc1: sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
> > |mmc1: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000001
> > |mmc1: sdhci: Caps: 0x446cc801 | Caps_1: 0x00000005
> > |mmc1: sdhci: Cmd: 0x0000123a | Max curr: 0x00000000
> > |mmc1: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0xffffffff
> > |mmc1: sdhci: Resp[2]: 0x320f5913 | Resp[3]: 0x00000900
> > |mmc1: sdhci: Host ctl2: 0x0000000c
> > |mmc1: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x34ee5208
> > |mmc1: sdhci: ============================================
> > |[...]
>
> Those messages show that the interrupt did happen but the driver did not see
> it. Are you doing anything unusual like using threadirqs?

No, I'm not doing anything unusual. The mmc core uses threaded irqs by
default. But, most of the work is performed in the primary handler. So,
that shouldn't be a problem.

But in the v4.9 case, we use preempt rt. I took a few scheduler traces
in order to see if there might be any task blocking or preempting the
mmc irqs. However, that's not the case.

The common pattern is: mmc1 is suspended, afterwards some applications
use mmc0 and finally a different application accesses mmc1. The suspend
function is called and during initialization the reset doesn't work
anymore.

Anyway, I'll perform more tests.

Thanks, Kurt

>
> >
> > Both issues disappear when disabling runtime pm.
> >
> > Anyway I'll prepare an acpidump for you.
> >
> >>
> >>> issue seems to occur if runtime power management is used. Found by utilizing
> >>> ftrace.
> >>>
> >>> The erratum VLI10 for the Intel E3825 states, that the eMMC controller
> >>> incorrectly announces that it supports suspend/resume. However, that shouldn't
> >>> be used, as the controller may incorrectly transfer data between memory and the
> >>> SD device.
> >>
> >> That erratum is not related to this problem. The suspend/resume that is
> >> documented is an internal SDHCI feature, not the kernel's suspend/resume.
> >> The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
> >> not being used anyway.
> >
> > Thanks for the clarification.
> >
> > Do you have any idea why this issue might happen?
>
> No, but it seems like the runtime pm callbacks aren't happening when they
> are supposed to.
>
> >
> > Thanks, Kurt
> >
> >>
> >>>
> >>> Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
> >>>
> >>> Signed-off-by: Kurt Kanzenbach <[email protected]>
> >>> ---
> >>> drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
> >>> 1 file changed, 16 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
> >>> index 77dd3521daae..df89381944cd 100644
> >>> --- a/drivers/mmc/host/sdhci-pci-core.c
> >>> +++ b/drivers/mmc/host/sdhci-pci-core.c
> >>> @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
> >>> .priv_size = sizeof(struct intel_host),
> >>> };
> >>>
> >>> +/*
> >>> + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
> >>> + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
> >>> + */
> >>> +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
> >>> + .allow_runtime_pm = false,
> >>> + .probe_slot = byt_emmc_probe_slot,
> >>> + .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
> >>> + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
> >>> + SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
> >>> + SDHCI_QUIRK2_STOP_WITH_TC,
> >>> + .ops = &sdhci_intel_byt_ops,
> >>> + .priv_size = sizeof(struct intel_host),
> >>> +};
> >>> +
> >>> static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
> >>> .allow_runtime_pm = true,
> >>> .probe_slot = glk_emmc_probe_slot,
> >>> @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
> >>> SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
> >>> SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
> >>> SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
> >>> - SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
> >>> + SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
> >>> SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
> >>> SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
> >>> SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
> >>>
> >>
> >
>

2018-07-10 12:54:21

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

On 25/06/18 17:36, Kurt Kanzenbach wrote:
>> On 06/20/2018 04:15 PM, Kurt Kanzenbach wrote:
>>> Hi,
>>>
>>> thanks for your response.
>>>
>>> On Tue, Jun 19, 2018 at 10:03:01AM +0300, Adrian Hunter wrote:
>>>> On 19/06/18 09:31, Kurt Kanzenbach wrote:
>>>>> Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
>>>>> SoCs. The resulting error looks like:
>>>>>
>>>>> |mmc1: Reset 0x1 never completed.
>>>>> |sdhci: =========== REGISTER DUMP (mmc1)===========
>>>>> |sdhci: Sys addr: 0xffffffff | Version: 0x0000ffff
>>>>> |sdhci: Blk size: 0x0000ffff | Blk cnt: 0x0000ffff
>>>>> |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
>>>>> |sdhci: Present: 0xffffffff | Host ctl: 0x000000ff
>>>>> |sdhci: Power: 0x000000ff | Blk gap: 0x000000ff
>>>>> |sdhci: Wake-up: 0x000000ff | Clock: 0x0000ffff
>>>>> |sdhci: Timeout: 0x000000ff | Int stat: 0xffffffff
>>>>> |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
>>>>> |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
>>>>> |sdhci: Caps: 0xffffffff | Caps_1: 0xffffffff
>>>>> |sdhci: Cmd: 0x0000ffff | Max curr: 0xffffffff
>>>>> |sdhci: Host ctl2: 0x0000ffff
>>>>> |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
>>>>>
>>>>> The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
>>>>
>>>> So you are saying this only happens at boot time? And only when
>>>> re-booting?
>>>
>>> well, exactly. This issue was only observed when rebooting, not on cold
>>> boots.
>>>
>>>> Can you send all the kernel messages? Can you send an acpidump?
>>>
>>> The kernel log is straightforward. The system is booting and starting a
>>> few applications. Afterwards the issue happens. The rootfilesystem is
>>> located on the eMMC.
>>
>> The full messages can be more revealing such as showing what else was
>> happening and the order of events, so I would still like to see them.
>>
>>>
>>> The error message above is from the Linux v4.9 boot log.
>>>
>>> On v4.17 the same issue happens, but the error messages are different:
>>>
>>> |mmc1: Timeout waiting for hardware interrupt.
>>> |mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> |mmc1: sdhci: Sys addr: 0x00000002 | Version: 0x00001002
>>> |mmc1: sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000000
>>> |mmc1: sdhci: Argument: 0x00040fd4 | Trn mode: 0x0000003b
>>> |mmc1: sdhci: Present: 0x1fff0000 | Host ctl: 0x00000035
>>> |mmc1: sdhci: Power: 0x0000000b | Blk gap: 0x00000080
>>> |mmc1: sdhci: Wake-up: 0x00000000 | Clock: 0x00000207
>>> |mmc1: sdhci: Timeout: 0x00000000 | Int stat: 0x00000003
>>> |mmc1: sdhci: Int enab: 0x02ff000b | Sig enab: 0x02ff000b
>>> |mmc1: sdhci: AC12 err: 0x00000000 | Slot int: 0x00000001
>>> |mmc1: sdhci: Caps: 0x446cc801 | Caps_1: 0x00000005
>>> |mmc1: sdhci: Cmd: 0x0000123a | Max curr: 0x00000000
>>> |mmc1: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0xffffffff
>>> |mmc1: sdhci: Resp[2]: 0x320f5913 | Resp[3]: 0x00000900
>>> |mmc1: sdhci: Host ctl2: 0x0000000c
>>> |mmc1: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x34ee5208
>>> |mmc1: sdhci: ============================================
>>> |[...]
>>
>> Those messages show that the interrupt did happen but the driver did not see
>> it. Are you doing anything unusual like using threadirqs?
>
> No, I'm not doing anything unusual. The mmc core uses threaded irqs by
> default. But, most of the work is performed in the primary handler. So,
> that shouldn't be a problem.
>
> But in the v4.9 case, we use preempt rt. I took a few scheduler traces

preempt rt is unusual. SDHCI uses synchronize_hardirq() and that might
explain the difference between the 4.9 case with preempt rt and the 4.17
without.

> in order to see if there might be any task blocking or preempting the
> mmc irqs. However, that's not the case.
>
> The common pattern is: mmc1 is suspended, afterwards some applications
> use mmc0 and finally a different application accesses mmc1. The suspend
> function is called and during initialization the reset doesn't work
> anymore.
>
> Anyway, I'll perform more tests.
>
> Thanks, Kurt
>
>>
>>>
>>> Both issues disappear when disabling runtime pm.
>>>
>>> Anyway I'll prepare an acpidump for you.
>>>
>>>>
>>>>> issue seems to occur if runtime power management is used. Found by utilizing
>>>>> ftrace.
>>>>>
>>>>> The erratum VLI10 for the Intel E3825 states, that the eMMC controller
>>>>> incorrectly announces that it supports suspend/resume. However, that shouldn't
>>>>> be used, as the controller may incorrectly transfer data between memory and the
>>>>> SD device.
>>>>
>>>> That erratum is not related to this problem. The suspend/resume that is
>>>> documented is an internal SDHCI feature, not the kernel's suspend/resume.
>>>> The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
>>>> not being used anyway.
>>>
>>> Thanks for the clarification.
>>>
>>> Do you have any idea why this issue might happen?
>>
>> No, but it seems like the runtime pm callbacks aren't happening when they
>> are supposed to.
>>
>>>
>>> Thanks, Kurt
>>>
>>>>
>>>>>
>>>>> Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
>>>>>
>>>>> Signed-off-by: Kurt Kanzenbach <[email protected]>
>>>>> ---
>>>>> drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
>>>>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
>>>>> index 77dd3521daae..df89381944cd 100644
>>>>> --- a/drivers/mmc/host/sdhci-pci-core.c
>>>>> +++ b/drivers/mmc/host/sdhci-pci-core.c
>>>>> @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
>>>>> .priv_size = sizeof(struct intel_host),
>>>>> };
>>>>>
>>>>> +/*
>>>>> + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
>>>>> + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
>>>>> + */
>>>>> +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
>>>>> + .allow_runtime_pm = false,
>>>>> + .probe_slot = byt_emmc_probe_slot,
>>>>> + .quirks = SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
>>>>> + .quirks2 = SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
>>>>> + SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
>>>>> + SDHCI_QUIRK2_STOP_WITH_TC,
>>>>> + .ops = &sdhci_intel_byt_ops,
>>>>> + .priv_size = sizeof(struct intel_host),
>>>>> +};
>>>>> +
>>>>> static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
>>>>> .allow_runtime_pm = true,
>>>>> .probe_slot = glk_emmc_probe_slot,
>>>>> @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
>>>>> SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
>>>>> SDHCI_PCI_DEVICE(INTEL, BYT_SDIO, intel_byt_sdio),
>>>>> SDHCI_PCI_DEVICE(INTEL, BYT_SD, intel_byt_sd),
>>>>> - SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
>>>>> + SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
>>>>> SDHCI_PCI_DEVICE(INTEL, BSW_EMMC, intel_byt_emmc),
>>>>> SDHCI_PCI_DEVICE(INTEL, BSW_SDIO, intel_byt_sdio),
>>>>> SDHCI_PCI_DEVICE(INTEL, BSW_SD, intel_byt_sd),
>>>>>
>>>>
>>>
>>
>