2015-11-23 14:10:09

by Jorge Ramirez-Ortiz

[permalink] [raw]
Subject: dw_mmc: HLE errors

Doug/Jaehoon,

Were there any follow ups to this thread [1] from March 30, 2015?
We are seeing HLE errors on 3.18 and we are trying to determine if a solution
was ever delivered.
On inspection, I can't find anything specific in recent kernels that address
this particular issue (was the actual root cause identified?)

I put together a possible work-around that avoids the HLE storm from occurring
for this specific SoC [2].
However we'd rather not merge this -or any other similar fix- if there is a
generic solution already that we can pick up from mainline.

thanks

Jorge




[1] https://lkml.org/lkml/headers/2015/3/30/423
[2]
https://github.com/96boards/linux/commit/fe8d7f714d420121cec460e69f6529044a2cb6d8


2015-11-23 16:57:36

by Doug Anderson

[permalink] [raw]
Subject: Re: dw_mmc: HLE errors

Jorge,

On Mon, Nov 23, 2015 at 6:10 AM, Jorge Ramirez-Ortiz
<[email protected]> wrote:
> Doug/Jaehoon,
>
> Were there any follow ups to this thread [1] from March 30, 2015?
> We are seeing HLE errors on 3.18 and we are trying to determine if a solution
> was ever delivered.
> On inspection, I can't find anything specific in recent kernels that address
> this particular issue (was the actual root cause identified?)
>
> I put together a possible work-around that avoids the HLE storm from occurring
> for this specific SoC [2].
> However we'd rather not merge this -or any other similar fix- if there is a
> generic solution already that we can pick up from mainline.

Nothing landed that I'm aware of. Are you on SDIO, SD or eMMC?
Trying to do UHS?

I know that this patch mattered for me for UHS:

7c5209c315ea mmc: core: Increase delay for voltage to stabilize from
3.3V to 1.8V


Also important for UHS (for at least some folks) were patches like:

9c85f37a2984 mmc: core: Add mmc_regulator_set_vqmmc()

...that attempted to get voltages more proper...


In the ChromeOS tree we did just land treating HLE errors as data and
cmd errors <https://patchwork.kernel.org/patch/5978711/>. It's not
wonderful but it's better than letting an interrupt go off forever...


-Doug

2015-11-23 17:29:38

by Jorge Ramirez-Ortiz

[permalink] [raw]
Subject: Re: dw_mmc: HLE errors

On 11/23/2015 11:57 AM, Doug Anderson wrote:
> Jorge,
>
> On Mon, Nov 23, 2015 at 6:10 AM, Jorge Ramirez-Ortiz
> <[email protected]> wrote:
>> Doug/Jaehoon,
>>
>> Were there any follow ups to this thread [1] from March 30, 2015?
>> We are seeing HLE errors on 3.18 and we are trying to determine if a solution
>> was ever delivered.
>> On inspection, I can't find anything specific in recent kernels that address
>> this particular issue (was the actual root cause identified?)
>>
>> I put together a possible work-around that avoids the HLE storm from occurring
>> for this specific SoC [2].
>> However we'd rather not merge this -or any other similar fix- if there is a
>> generic solution already that we can pick up from mainline.
> Nothing landed that I'm aware of. Are you on SDIO, SD or eMMC?
> Trying to do UHS?

SD even without UHS (yet, that is coming now)

>
> I know that this patch mattered for me for UHS:
>
> 7c5209c315ea mmc: core: Increase delay for voltage to stabilize from
> 3.3V to 1.8V
>
>
> Also important for UHS (for at least some folks) were patches like:
>
> 9c85f37a2984 mmc: core: Add mmc_regulator_set_vqmmc()
>
> ...that attempted to get voltages more proper...

ack

>
>
> In the ChromeOS tree we did just land treating HLE errors as data and
> cmd errors <https://patchwork.kernel.org/patch/5978711/>. It's not
> wonderful but it's better than letting an interrupt go off forever...

Yes I did try this patch on 3.18 but it didn't seem to be enough for us.
Even though it would prevent the interrupt storm from flooding the kernel, once
the event triggered and the interrupt was handled no more card
insertions/ejections would be detected.

ok, thanks for the info!


2015-11-24 00:12:12

by Jaehoon Chung

[permalink] [raw]
Subject: Re: dw_mmc: HLE errors

Dear, Jorge.

On 11/24/2015 02:29 AM, Jorge Ramirez-Ortiz wrote:
> On 11/23/2015 11:57 AM, Doug Anderson wrote:
>> Jorge,
>>
>> On Mon, Nov 23, 2015 at 6:10 AM, Jorge Ramirez-Ortiz
>> <[email protected]> wrote:
>>> Doug/Jaehoon,
>>>
>>> Were there any follow ups to this thread [1] from March 30, 2015?
>>> We are seeing HLE errors on 3.18 and we are trying to determine if a solution
>>> was ever delivered.
>>> On inspection, I can't find anything specific in recent kernels that address
>>> this particular issue (was the actual root cause identified?)
>>>
>>> I put together a possible work-around that avoids the HLE storm from occurring
>>> for this specific SoC [2].
>>> However we'd rather not merge this -or any other similar fix- if there is a
>>> generic solution already that we can pick up from mainline.
>> Nothing landed that I'm aware of. Are you on SDIO, SD or eMMC?
>> Trying to do UHS?
>
> SD even without UHS (yet, that is coming now)

If you want to use the upper mode than UHS-DDR50 for SD-card, you need to apply the below patch.

https://patchwork.kernel.org/patch/7456121/

Actually, this is not relevant to HLE error.

When sd-card is inserted/removed quickly, then sometime dwmmc controller is occurred the HLE error.
(Now, i can't see HLE error.)
So i had applied the some reset processing at my official repository.(It's not generic solution.)

>
>>
>> I know that this patch mattered for me for UHS:
>>
>> 7c5209c315ea mmc: core: Increase delay for voltage to stabilize from
>> 3.3V to 1.8V
>>
>>
>> Also important for UHS (for at least some folks) were patches like:
>>
>> 9c85f37a2984 mmc: core: Add mmc_regulator_set_vqmmc()
>>
>> ...that attempted to get voltages more proper...
>
> ack
>
>>
>>
>> In the ChromeOS tree we did just land treating HLE errors as data and
>> cmd errors <https://patchwork.kernel.org/patch/5978711/>. It's not
>> wonderful but it's better than letting an interrupt go off forever...
>
> Yes I did try this patch on 3.18 but it didn't seem to be enough for us.
> Even though it would prevent the interrupt storm from flooding the kernel, once
> the event triggered and the interrupt was handled no more card
> insertions/ejections would be detected.

If HLE error will be reproduce with the generic sequence, I think we can find the generic solution.
So could you explain to me in more detail? If i can reproduce with v3.18, i will try to test it.
Your case will be helpful to me for solving the HLE error.

Best Regards,
Jaehoon Chung

>
> ok, thanks for the info!
>
>
>
>
>

2015-11-24 01:55:20

by Jorge Ramirez-Ortiz

[permalink] [raw]
Subject: Re: dw_mmc: HLE errors

On 11/23/2015 07:11 PM, Jaehoon Chung wrote:
> Dear, Jorge.
>
> On 11/24/2015 02:29 AM, Jorge Ramirez-Ortiz wrote:
>> On 11/23/2015 11:57 AM, Doug Anderson wrote:
>>> Jorge,
>>>
>>> On Mon, Nov 23, 2015 at 6:10 AM, Jorge Ramirez-Ortiz
>>> <[email protected]> wrote:
>>>> Doug/Jaehoon,
>>>>
>>>> Were there any follow ups to this thread [1] from March 30, 2015?
>>>> We are seeing HLE errors on 3.18 and we are trying to determine if a solution
>>>> was ever delivered.
>>>> On inspection, I can't find anything specific in recent kernels that address
>>>> this particular issue (was the actual root cause identified?)
>>>>
>>>> I put together a possible work-around that avoids the HLE storm from occurring
>>>> for this specific SoC [2].
>>>> However we'd rather not merge this -or any other similar fix- if there is a
>>>> generic solution already that we can pick up from mainline.
>>> Nothing landed that I'm aware of. Are you on SDIO, SD or eMMC?
>>> Trying to do UHS?
>> SD even without UHS (yet, that is coming now)
> If you want to use the upper mode than UHS-DDR50 for SD-card, you need to apply the below patch.

ACK

>
> https://patchwork.kernel.org/patch/7456121/
>
> Actually, this is not relevant to HLE error.
>
> When sd-card is inserted/removed quickly, then sometime dwmmc controller is occurred the HLE error.
> (Now, i can't see HLE error.)
> So i had applied the some reset processing at my official repository.(It's not generic solution.)

Thanks, I'll have a look now.

I believe this to be your official repo:
https://github.com/jh80chung/dw-mmc

Please let me know if it is not.


>
>>> I know that this patch mattered for me for UHS:
>>>
>>> 7c5209c315ea mmc: core: Increase delay for voltage to stabilize from
>>> 3.3V to 1.8V
>>>
>>>
>>> Also important for UHS (for at least some folks) were patches like:
>>>
>>> 9c85f37a2984 mmc: core: Add mmc_regulator_set_vqmmc()
>>>
>>> ...that attempted to get voltages more proper...
>> ack
>>
>>>
>>> In the ChromeOS tree we did just land treating HLE errors as data and
>>> cmd errors <https://patchwork.kernel.org/patch/5978711/>. It's not
>>> wonderful but it's better than letting an interrupt go off forever...
>> Yes I did try this patch on 3.18 but it didn't seem to be enough for us.
>> Even though it would prevent the interrupt storm from flooding the kernel, once
>> the event triggered and the interrupt was handled no more card
>> insertions/ejections would be detected.
> If HLE error will be reproduce with the generic sequence, I think we can find the generic solution.
> So could you explain to me in more detail? If i can reproduce with v3.18, i will try to test it.
> Your case will be helpful to me for solving the HLE error.


Yes, the issue is relatively easy to reproduce.

On this platform:
https://www.96boards.org/products/ce/hikey/

Using either debian [1] or android [2] releases and the latest UEFI [3]
[1] https://builds.96boards.org/snapshots/hikey/linaro/debian/379/
[2] https://builds.96boards.org/snapshots/hikey/linaro/aosp/197/
[3] https://builds.96boards.org/snapshots/hikey/linaro/uefi/89/

The kernel tree between android and debian is shared [4].
We are using the "hikey" branch (v3.18)
[4] https://github.com/96boards/linux

For my tests and to be able to handed the interrupt storm and monitor the
registers while it happens, I patched the kernel with a Xenomai [5] co-kernel.
This is my kernel tree [6]
[5] http://xenomai.org/
[6] http://git.xenomai.org/ipipe-jro.git/log/?h=hikey

To reproduce the problem all it was required was to insert/remove the SD card
rapidly until it triggers this condition:
[ 229.974525] dwmmc_k3 f723e000.dwmmc1: Busy; trying anyway

When it triggered, and after patching the interrupt handler with some debug info
to show the distance between interrupts and the content of the MINTSTS register,
I could see the following:
mci_isr: 0x1000, 3333 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 3333 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 3333 ns
mci_isr: 0x1000, 2500 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 2500 ns
mci_isr: 0x1000, 3333 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 3333 ns
mci_isr: 0x1000, 3334 ns
mci_isr: 0x1000, 2500 ns
mci_isr: 0x1000, 3334 ns
[...]

Notice that since the Xenomai co-kernel runs with a higher priority than the
Linux kernel, I was able to output this information to the console.

I put together a fix based on this commit from Doug;
mmc: dw_mmc: Don't start commands while busy
https://lkml.org/lkml/2015/2/20/508

In Doug's commit, we would delay sending a command until the SDMCC_STATUS_BUSY
cleared.
However if it never cleared, we'd go ahead and submit the command anyway.

I believe this is what was causing the HLE to be raised.
In order to prevent that from happening, I think we should abort the operation
completely.
My "extension" for the Hikey platform looks like this:
https://github.com/96boards/linux/commit/fe8d7f714d420121cec460e69f6529044a2cb6d

It could be made generic or the fix could have some other form of course.
I was only targeting the Hikey platform when I wrote this hoping that it would
have been fixed upstream.

Having said all of this, I am not sure what would cause the host status to
remain busy for so long (which is Ulf's biggest concern)
I also tried increasing some of the timers that wait for the voltages to ramp up
after power on but it didnt make any difference.

I captured most of the information above under this bug for reference.
https://bugs.96boards.org/show_bug.cgi?id=175


2015-11-24 01:59:31

by Jaehoon Chung

[permalink] [raw]
Subject: Re: dw_mmc: HLE errors

On 11/24/2015 10:55 AM, Jorge Ramirez-Ortiz wrote:
> On 11/23/2015 07:11 PM, Jaehoon Chung wrote:
>> Dear, Jorge.
>>
>> On 11/24/2015 02:29 AM, Jorge Ramirez-Ortiz wrote:
>>> On 11/23/2015 11:57 AM, Doug Anderson wrote:
>>>> Jorge,
>>>>
>>>> On Mon, Nov 23, 2015 at 6:10 AM, Jorge Ramirez-Ortiz
>>>> <[email protected]> wrote:
>>>>> Doug/Jaehoon,
>>>>>
>>>>> Were there any follow ups to this thread [1] from March 30, 2015?
>>>>> We are seeing HLE errors on 3.18 and we are trying to determine if a solution
>>>>> was ever delivered.
>>>>> On inspection, I can't find anything specific in recent kernels that address
>>>>> this particular issue (was the actual root cause identified?)
>>>>>
>>>>> I put together a possible work-around that avoids the HLE storm from occurring
>>>>> for this specific SoC [2].
>>>>> However we'd rather not merge this -or any other similar fix- if there is a
>>>>> generic solution already that we can pick up from mainline.
>>>> Nothing landed that I'm aware of. Are you on SDIO, SD or eMMC?
>>>> Trying to do UHS?
>>> SD even without UHS (yet, that is coming now)
>> If you want to use the upper mode than UHS-DDR50 for SD-card, you need to apply the below patch.
>
> ACK
>
>>
>> https://patchwork.kernel.org/patch/7456121/
>>
>> Actually, this is not relevant to HLE error.
>>
>> When sd-card is inserted/removed quickly, then sometime dwmmc controller is occurred the HLE error.
>> (Now, i can't see HLE error.)
>> So i had applied the some reset processing at my official repository.(It's not generic solution.)
>
> Thanks, I'll have a look now.
>
> I believe this to be your official repo:
> https://github.com/jh80chung/dw-mmc
>
> Please let me know if it is not.

Sorry. it's not official repo (Samsung). So i can't share URL. :(
It's just my personal git repository. I will work on that repository.. :)

Best Regards,
Jaehoon Chung

>
>
>>
>>>> I know that this patch mattered for me for UHS:
>>>>
>>>> 7c5209c315ea mmc: core: Increase delay for voltage to stabilize from
>>>> 3.3V to 1.8V
>>>>
>>>>
>>>> Also important for UHS (for at least some folks) were patches like:
>>>>
>>>> 9c85f37a2984 mmc: core: Add mmc_regulator_set_vqmmc()
>>>>
>>>> ...that attempted to get voltages more proper...
>>> ack
>>>
>>>>
>>>> In the ChromeOS tree we did just land treating HLE errors as data and
>>>> cmd errors <https://patchwork.kernel.org/patch/5978711/>. It's not
>>>> wonderful but it's better than letting an interrupt go off forever...
>>> Yes I did try this patch on 3.18 but it didn't seem to be enough for us.
>>> Even though it would prevent the interrupt storm from flooding the kernel, once
>>> the event triggered and the interrupt was handled no more card
>>> insertions/ejections would be detected.
>> If HLE error will be reproduce with the generic sequence, I think we can find the generic solution.
>> So could you explain to me in more detail? If i can reproduce with v3.18, i will try to test it.
>> Your case will be helpful to me for solving the HLE error.
>
>
> Yes, the issue is relatively easy to reproduce.
>
> On this platform:
> https://www.96boards.org/products/ce/hikey/
>
> Using either debian [1] or android [2] releases and the latest UEFI [3]
> [1] https://builds.96boards.org/snapshots/hikey/linaro/debian/379/
> [2] https://builds.96boards.org/snapshots/hikey/linaro/aosp/197/
> [3] https://builds.96boards.org/snapshots/hikey/linaro/uefi/89/
>
> The kernel tree between android and debian is shared [4].
> We are using the "hikey" branch (v3.18)
> [4] https://github.com/96boards/linux
>
> For my tests and to be able to handed the interrupt storm and monitor the
> registers while it happens, I patched the kernel with a Xenomai [5] co-kernel.
> This is my kernel tree [6]
> [5] http://xenomai.org/
> [6] http://git.xenomai.org/ipipe-jro.git/log/?h=hikey
>
> To reproduce the problem all it was required was to insert/remove the SD card
> rapidly until it triggers this condition:
> [ 229.974525] dwmmc_k3 f723e000.dwmmc1: Busy; trying anyway
>
> When it triggered, and after patching the interrupt handler with some debug info
> to show the distance between interrupts and the content of the MINTSTS register,
> I could see the following:
> mci_isr: 0x1000, 3333 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 3333 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 3333 ns
> mci_isr: 0x1000, 2500 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 2500 ns
> mci_isr: 0x1000, 3333 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 3333 ns
> mci_isr: 0x1000, 3334 ns
> mci_isr: 0x1000, 2500 ns
> mci_isr: 0x1000, 3334 ns
> [...]
>
> Notice that since the Xenomai co-kernel runs with a higher priority than the
> Linux kernel, I was able to output this information to the console.
>
> I put together a fix based on this commit from Doug;
> mmc: dw_mmc: Don't start commands while busy
> https://lkml.org/lkml/2015/2/20/508
>
> In Doug's commit, we would delay sending a command until the SDMCC_STATUS_BUSY
> cleared.
> However if it never cleared, we'd go ahead and submit the command anyway.
>
> I believe this is what was causing the HLE to be raised.
> In order to prevent that from happening, I think we should abort the operation
> completely.
> My "extension" for the Hikey platform looks like this:
> https://github.com/96boards/linux/commit/fe8d7f714d420121cec460e69f6529044a2cb6d
>
> It could be made generic or the fix could have some other form of course.
> I was only targeting the Hikey platform when I wrote this hoping that it would
> have been fixed upstream.
>
> Having said all of this, I am not sure what would cause the host status to
> remain busy for so long (which is Ulf's biggest concern)
> I also tried increasing some of the timers that wait for the voltages to ramp up
> after power on but it didnt make any difference.
>
> I captured most of the information above under this bug for reference.
> https://bugs.96boards.org/show_bug.cgi?id=175
>
>
>
>
>