2019-08-14 12:55:39

by Vicente Bergas

[permalink] [raw]
Subject: Re: kexec on rk3399

On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote:
> Hi, i have been running linux on rk3399 booted with kexec fine until 5.2
> From 5.2 onwards, there are memory corruption issues as reported here:
> http://lkml.iu.edu/hypermail/linux/kernel/1906.2/07211.html
> kexec has been identified as the principal reason for the issues.
>
> It turns out that kexec has never worked reliably on this platform,
> i was just lucky until recently.
>
> Please, can you provide some directions on how to debug the issue?

Thank you all for your suggestions on where the issue could be.

It seems that it was the USB driver.
Now using v5.2.8 booted with kexec from v5.2.8 with a workaround and
so far so good. It is being tested on the Sapphire board.

The workaround is:
--- a/drivers/usb/dwc3/dwc3-of-simple.c
+++ b/drivers/usb/dwc3/dwc3-of-simple.c
@@ -133,6 +133,13 @@
return 0;
}

+static void dwc3_of_simple_shutdown(struct platform_device *pdev)
+{
+ struct dwc3_of_simple *simple = platform_get_drvdata(pdev);
+
+ reset_control_assert(simple->resets);
+}
+
static int __maybe_unused dwc3_of_simple_runtime_suspend(struct device
*dev)
{
struct dwc3_of_simple *simple = dev_get_drvdata(dev);
@@ -190,6 +197,7 @@
static struct platform_driver dwc3_of_simple_driver = {
.probe = dwc3_of_simple_probe,
.remove = dwc3_of_simple_remove,
+ .shutdown = dwc3_of_simple_shutdown,
.driver = {
.name = "dwc3-of-simple",
.of_match_table = of_dwc3_simple_match,

If this patch is OK after review i can resubmit it as a pull request.
Should a similar change be applied to drivers/usb/dwc3/core.c ?

Regards,
Vicenç.


2019-08-14 13:09:04

by Felipe Balbi

[permalink] [raw]
Subject: Re: kexec on rk3399


Hi,

Vicente Bergas <[email protected]> writes:
> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote:
>> Hi, i have been running linux on rk3399 booted with kexec fine until 5.2
>> From 5.2 onwards, there are memory corruption issues as reported here:
>> http://lkml.iu.edu/hypermail/linux/kernel/1906.2/07211.html
>> kexec has been identified as the principal reason for the issues.
>>
>> It turns out that kexec has never worked reliably on this platform,
>> i was just lucky until recently.
>>
>> Please, can you provide some directions on how to debug the issue?
>
> Thank you all for your suggestions on where the issue could be.
>
> It seems that it was the USB driver.
> Now using v5.2.8 booted with kexec from v5.2.8 with a workaround and
> so far so good. It is being tested on the Sapphire board.
>
> The workaround is:
> --- a/drivers/usb/dwc3/dwc3-of-simple.c
> +++ b/drivers/usb/dwc3/dwc3-of-simple.c
> @@ -133,6 +133,13 @@
> return 0;
> }
>
> +static void dwc3_of_simple_shutdown(struct platform_device *pdev)
> +{
> + struct dwc3_of_simple *simple = platform_get_drvdata(pdev);
> +
> + reset_control_assert(simple->resets);
> +}
> +
> static int __maybe_unused dwc3_of_simple_runtime_suspend(struct device
> *dev)
> {
> struct dwc3_of_simple *simple = dev_get_drvdata(dev);
> @@ -190,6 +197,7 @@
> static struct platform_driver dwc3_of_simple_driver = {
> .probe = dwc3_of_simple_probe,
> .remove = dwc3_of_simple_remove,
> + .shutdown = dwc3_of_simple_shutdown,
> .driver = {
> .name = "dwc3-of-simple",
> .of_match_table = of_dwc3_simple_match,
>
> If this patch is OK after review i can resubmit it as a pull request.

not a pull request, just send a patch using git send-email

> Should a similar change be applied to drivers/usb/dwc3/core.c ?

Is it necessary? We haven't had any bug reports regarding that. Also, if
we have reset control support in the core driver, why do we need it in
of_simple? Seems like of_simple could just rely on what core does.

--
balbi


Attachments:
signature.asc (847.00 B)

2019-08-14 13:14:53

by Robin Murphy

[permalink] [raw]
Subject: Re: kexec on rk3399

On 14/08/2019 13:53, Vicente Bergas wrote:
> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote:
>> Hi, i have been running linux on rk3399 booted with kexec fine until 5.2
>> From 5.2 onwards, there are memory corruption issues as reported here:
>> http://lkml.iu.edu/hypermail/linux/kernel/1906.2/07211.html
>> kexec has been identified as the principal reason for the issues.
>>
>> It turns out that kexec has never worked reliably on this platform,
>> i was just lucky until recently.
>>
>> Please, can you provide some directions on how to debug the issue?
>
> Thank you all for your suggestions on where the issue could be.
>
> It seems that it was the USB driver.
> Now using v5.2.8 booted with kexec from v5.2.8 with a workaround and
> so far so good. It is being tested on the Sapphire board.
>
> The workaround is:
> --- a/drivers/usb/dwc3/dwc3-of-simple.c
> +++ b/drivers/usb/dwc3/dwc3-of-simple.c
> @@ -133,6 +133,13 @@
>     return 0;
> }
>
> +static void dwc3_of_simple_shutdown(struct platform_device *pdev)
> +{
> +    struct dwc3_of_simple *simple = platform_get_drvdata(pdev);
> +
> +    reset_control_assert(simple->resets);
> +}
> +
> static int __maybe_unused dwc3_of_simple_runtime_suspend(struct device
> *dev)
> {
>     struct dwc3_of_simple    *simple = dev_get_drvdata(dev);
> @@ -190,6 +197,7 @@
> static struct platform_driver dwc3_of_simple_driver = {
>     .probe        = dwc3_of_simple_probe,
>     .remove        = dwc3_of_simple_remove,
> +    .shutdown    = dwc3_of_simple_shutdown,
>     .driver        = {
>         .name    = "dwc3-of-simple",
>         .of_match_table = of_dwc3_simple_match,
>
> If this patch is OK after review i can resubmit it as a pull request.
> Should a similar change be applied to drivers/usb/dwc3/core.c ?

This particular change looks like it's implicitly specific to RK3399,
which wouldn't be ideal. Presumably if the core dwc3 driver implemented
shutdown correctly (echoing parts of dwc3_remove(), I guess) then the
glue layers shouldn't need anything special anyway.

Robin.

2019-08-14 13:16:02

by Vicente Bergas

[permalink] [raw]
Subject: Re: kexec on rk3399

On Wednesday, August 14, 2019 3:06:04 PM CEST, Felipe Balbi wrote:
> Hi,
>
> Vicente Bergas <[email protected]> writes:
>> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote:
>>> Hi, i have been running linux on rk3399 booted with kexec fine until 5.2
>>> From 5.2 onwards, there are memory corruption issues as reported here:
>>> http://lkml.iu.edu/hypermail/linux/kernel/1906.2/07211.html
>>> kexec has been identified as the principal reason for the issues.
>>>
>>> It turns out that kexec has never worked reliably on this platform, ...
>>
>> Thank you all for your suggestions on where the issue could be.
>>
>> It seems that it was the USB driver.
>> Now using v5.2.8 booted with kexec from v5.2.8 with a workaround and
>> so far so good. It is being tested on the Sapphire board.
>>
>> The workaround is:
>> --- a/drivers/usb/dwc3/dwc3-of-simple.c
>> +++ b/drivers/usb/dwc3/dwc3-of-simple.c
>> @@ -133,6 +133,13 @@
>> return 0;
>> }
>>
>> +static void dwc3_of_simple_shutdown(struct platform_device *pdev)
>> +{
>> + struct dwc3_of_simple *simple = platform_get_drvdata(pdev);
>> +
>> + reset_control_assert(simple->resets);
>> +}
>> +
>> static int __maybe_unused dwc3_of_simple_runtime_suspend(struct device
>> *dev)
>> {
>> struct dwc3_of_simple *simple = dev_get_drvdata(dev);
>> @@ -190,6 +197,7 @@
>> static struct platform_driver dwc3_of_simple_driver = {
>> .probe = dwc3_of_simple_probe,
>> .remove = dwc3_of_simple_remove,
>> + .shutdown = dwc3_of_simple_shutdown,
>> .driver = {
>> .name = "dwc3-of-simple",
>> .of_match_table = of_dwc3_simple_match,
>>
>> If this patch is OK after review i can resubmit it as a pull request.
>
> not a pull request, just send a patch using git send-email
>
>> Should a similar change be applied to drivers/usb/dwc3/core.c ?
>
> Is it necessary? We haven't had any bug reports regarding that. Also, if
> we have reset control support in the core driver, why do we need it in
> of_simple? Seems like of_simple could just rely on what core does.

the workaround has been tested patching only core.c with
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1561,6 +1561,13 @@
return 0;
}

+static void dwc3_shutdown(struct platform_device *pdev)
+{
+ struct dwc3 *dwc = platform_get_drvdata(pdev);
+
+ reset_control_assert(dwc->reset);
+}
+
#ifdef CONFIG_PM
static int dwc3_core_init_for_resume(struct dwc3 *dwc)
{
@@ -1866,6 +1873,7 @@
static struct platform_driver dwc3_driver = {
.probe = dwc3_probe,
.remove = dwc3_remove,
+ .shutdown = dwc3_shutdown,
.driver = {
.name = "dwc3",
.of_match_table = of_match_ptr(of_dwc3_match),

and leaving dwc3-of-simple.c as is, the issue persisted.

Regards,
Vicenç.

2019-08-15 01:30:20

by Vicente Bergas

[permalink] [raw]
Subject: Re: kexec on rk3399

On Wednesday, August 14, 2019 3:12:26 PM CEST, Robin Murphy wrote:
> On 14/08/2019 13:53, Vicente Bergas wrote:
>> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote: ...
>
> This particular change looks like it's implicitly specific to
> RK3399, which wouldn't be ideal. Presumably if the core dwc3
> driver implemented shutdown correctly (echoing parts of
> dwc3_remove(), I guess) then the glue layers shouldn't need
> anything special anyway.
>
> Robin.

I just checked simple->resets from dwc3-of-simple.c and it is an array
with multiple resets whereas dwc->reset from core.c is NULL.
So the reset seems specific to the glue layers.
Is there another way than resetting the thing that is
generic enough to go to core.c and allows kexec?

2019-08-15 06:10:37

by Felipe Balbi

[permalink] [raw]
Subject: Re: kexec on rk3399


Hi,

Vicente Bergas <[email protected]> writes:
>> Vicente Bergas <[email protected]> writes:
>>> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote:
>>>> Hi, i have been running linux on rk3399 booted with kexec fine until 5.2
>>>> From 5.2 onwards, there are memory corruption issues as reported here:
>>>> http://lkml.iu.edu/hypermail/linux/kernel/1906.2/07211.html
>>>> kexec has been identified as the principal reason for the issues.
>>>>
>>>> It turns out that kexec has never worked reliably on this platform, ...
>>>
>>> Thank you all for your suggestions on where the issue could be.
>>>
>>> It seems that it was the USB driver.
>>> Now using v5.2.8 booted with kexec from v5.2.8 with a workaround and
>>> so far so good. It is being tested on the Sapphire board.
>>>
>>> The workaround is:
>>> --- a/drivers/usb/dwc3/dwc3-of-simple.c
>>> +++ b/drivers/usb/dwc3/dwc3-of-simple.c
>>> @@ -133,6 +133,13 @@
>>> return 0;
>>> }
>>>
>>> +static void dwc3_of_simple_shutdown(struct platform_device *pdev)
>>> +{
>>> + struct dwc3_of_simple *simple = platform_get_drvdata(pdev);
>>> +
>>> + reset_control_assert(simple->resets);
>>> +}
>>> +
>>> static int __maybe_unused dwc3_of_simple_runtime_suspend(struct device
>>> *dev)
>>> {
>>> struct dwc3_of_simple *simple = dev_get_drvdata(dev);
>>> @@ -190,6 +197,7 @@
>>> static struct platform_driver dwc3_of_simple_driver = {
>>> .probe = dwc3_of_simple_probe,
>>> .remove = dwc3_of_simple_remove,
>>> + .shutdown = dwc3_of_simple_shutdown,
>>> .driver = {
>>> .name = "dwc3-of-simple",
>>> .of_match_table = of_dwc3_simple_match,
>>>
>>> If this patch is OK after review i can resubmit it as a pull request.
>>
>> not a pull request, just send a patch using git send-email
>>
>>> Should a similar change be applied to drivers/usb/dwc3/core.c ?
>>
>> Is it necessary? We haven't had any bug reports regarding that. Also, if
>> we have reset control support in the core driver, why do we need it in
>> of_simple? Seems like of_simple could just rely on what core does.
>
> the workaround has been tested patching only core.c with
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -1561,6 +1561,13 @@
> return 0;
> }
>
> +static void dwc3_shutdown(struct platform_device *pdev)
> +{
> + struct dwc3 *dwc = platform_get_drvdata(pdev);
> +
> + reset_control_assert(dwc->reset);
> +}
> +
> #ifdef CONFIG_PM
> static int dwc3_core_init_for_resume(struct dwc3 *dwc)
> {
> @@ -1866,6 +1873,7 @@
> static struct platform_driver dwc3_driver = {
> .probe = dwc3_probe,
> .remove = dwc3_remove,
> + .shutdown = dwc3_shutdown,
> .driver = {
> .name = "dwc3",
> .of_match_table = of_match_ptr(of_dwc3_match),
>
> and leaving dwc3-of-simple.c as is, the issue persisted.

That's because your reset controller is not passed to dwc3 core, only to
your glue layer.

--
balbi

2019-08-15 06:58:03

by Felipe Balbi

[permalink] [raw]
Subject: Re: kexec on rk3399


Hi,

Vicente Bergas <[email protected]> writes:

> On Wednesday, August 14, 2019 3:12:26 PM CEST, Robin Murphy wrote:
>> On 14/08/2019 13:53, Vicente Bergas wrote:
>>> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote: ...
>>
>> This particular change looks like it's implicitly specific to
>> RK3399, which wouldn't be ideal. Presumably if the core dwc3
>> driver implemented shutdown correctly (echoing parts of
>> dwc3_remove(), I guess) then the glue layers shouldn't need
>> anything special anyway.
>>
>> Robin.
>
> I just checked simple->resets from dwc3-of-simple.c and it is an array
> with multiple resets whereas dwc->reset from core.c is NULL.
> So the reset seems specific to the glue layers.
> Is there another way than resetting the thing that is
> generic enough to go to core.c and allows kexec?

This is a really odd 'failure'. We do full soft reset during driver
initialization on dwc3. We shouldn't need to assert reset on shutdown,
really.

I think the problem is here:

if (simple->pulse_resets) {
ret = reset_control_reset(simple->resets);
if (ret)
goto err_resetc_put;
} else {
ret = reset_control_deassert(simple->resets);
if (ret)
goto err_resetc_put;
}

Note that if pulse_resets is set, we will run a reset. But if
pulse_resets is false and need_reset is true, we deassert the reset.

I think below patch is enough:

diff --git a/drivers/usb/dwc3/dwc3-of-simple.c b/drivers/usb/dwc3/dwc3-of-simple.c
index bdac3e7d7b18..9a2f3e09aa2e 100644
--- a/drivers/usb/dwc3/dwc3-of-simple.c
+++ b/drivers/usb/dwc3/dwc3-of-simple.c
@@ -72,7 +72,15 @@ static int dwc3_of_simple_probe(struct platform_device *pdev)
ret = reset_control_reset(simple->resets);
if (ret)
goto err_resetc_put;
- } else {
+ }
+
+ if (simple->need_reset) {
+ ret = reset_control_assert(simple->resets);
+ if (ret)
+ goto err_resetc_put;
+
+ usleep_range(1000, 2000);
+
ret = reset_control_deassert(simple->resets);
if (ret)
goto err_resetc_put;
@@ -121,9 +129,6 @@ static int dwc3_of_simple_remove(struct platform_device *pdev)
clk_bulk_put_all(simple->num_clocks, simple->clks);
simple->num_clocks = 0;

- if (!simple->pulse_resets)
- reset_control_assert(simple->resets);
-
reset_control_put(simple->resets);

pm_runtime_disable(dev);

Can you test?

--
balbi

2019-08-15 11:35:40

by Robin Murphy

[permalink] [raw]
Subject: Re: kexec on rk3399

On 15/08/2019 07:06, Felipe Balbi wrote:
>
> Hi,
>
> Vicente Bergas <[email protected]> writes:
>
>> On Wednesday, August 14, 2019 3:12:26 PM CEST, Robin Murphy wrote:
>>> On 14/08/2019 13:53, Vicente Bergas wrote:
>>>> On Monday, July 22, 2019 4:31:27 PM CEST, Vicente Bergas wrote: ...
>>>
>>> This particular change looks like it's implicitly specific to
>>> RK3399, which wouldn't be ideal. Presumably if the core dwc3
>>> driver implemented shutdown correctly (echoing parts of
>>> dwc3_remove(), I guess) then the glue layers shouldn't need
>>> anything special anyway.
>>>
>>> Robin.
>>
>> I just checked simple->resets from dwc3-of-simple.c and it is an array
>> with multiple resets whereas dwc->reset from core.c is NULL.
>> So the reset seems specific to the glue layers.
>> Is there another way than resetting the thing that is
>> generic enough to go to core.c and allows kexec?
>
> This is a really odd 'failure'. We do full soft reset during driver
> initialization on dwc3. We shouldn't need to assert reset on shutdown,
> really.

Probing/initialisation has never been the problem. The issue for the
kexec case is that when the first kernel shuts down, there is currently
nothing to quiesce the controller (since only driver->shutdown gets
called, not driver->remove), and thus (presumably) external USB activity
causes it to keep writing back over the memory where the
descriptors/command ring used to be while the second kernel boots. The
second kernel will eventually probe and reset it appropriately, but by
that time the damage is already done.

Yanking on a hardware reset line when the first kernel shuts down is
certainly one way to stop any memory accesses if such a control is
available, but presumably there's a general software way to gracefully
disable the controller's DMA functions until a subsequent probe can
fully reset it again - I think that would be the preferable solution.

Robin.

> I think the problem is here:
>
> if (simple->pulse_resets) {
> ret = reset_control_reset(simple->resets);
> if (ret)
> goto err_resetc_put;
> } else {
> ret = reset_control_deassert(simple->resets);
> if (ret)
> goto err_resetc_put;
> }
>
> Note that if pulse_resets is set, we will run a reset. But if
> pulse_resets is false and need_reset is true, we deassert the reset.
>
> I think below patch is enough:
>
> diff --git a/drivers/usb/dwc3/dwc3-of-simple.c b/drivers/usb/dwc3/dwc3-of-simple.c
> index bdac3e7d7b18..9a2f3e09aa2e 100644
> --- a/drivers/usb/dwc3/dwc3-of-simple.c
> +++ b/drivers/usb/dwc3/dwc3-of-simple.c
> @@ -72,7 +72,15 @@ static int dwc3_of_simple_probe(struct platform_device *pdev)
> ret = reset_control_reset(simple->resets);
> if (ret)
> goto err_resetc_put;
> - } else {
> + }
> +
> + if (simple->need_reset) {
> + ret = reset_control_assert(simple->resets);
> + if (ret)
> + goto err_resetc_put;
> +
> + usleep_range(1000, 2000);
> +
> ret = reset_control_deassert(simple->resets);
> if (ret)
> goto err_resetc_put;
> @@ -121,9 +129,6 @@ static int dwc3_of_simple_remove(struct platform_device *pdev)
> clk_bulk_put_all(simple->num_clocks, simple->clks);
> simple->num_clocks = 0;
>
> - if (!simple->pulse_resets)
> - reset_control_assert(simple->resets);
> -
> reset_control_put(simple->resets);
>
> pm_runtime_disable(dev);
>
> Can you test?
>