2019-06-03 21:44:01

by Thor Thayer

[permalink] [raw]
Subject: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

From: Thor Thayer <[email protected]>

The Stratix10 peripheral FIFO memories can recover from double
bit errors with a warm reset instead of a cold reset.
Add the option of a warm reset for peripheral (USB, Ethernet)
memories.

CPU memories such as SDRAM and OCRAM require a cold reset for
DBEs.
Filter on whether the error is a SDRAM/OCRAM or a peripheral
FIFO memory to determine which reset to use when the warm
reset option is configured.

Signed-off-by: Thor Thayer <[email protected]>
---
drivers/edac/Kconfig | 9 +++++++++
drivers/edac/altera_edac.c | 31 +++++++++++++++++++++++++++++--
drivers/edac/altera_edac.h | 4 ++++
3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 47eb4d13ed5f..e47c428d485d 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -394,6 +394,15 @@ config EDAC_ALTERA
Altera SOCs. This is the global enable for the
various Altera peripherals.

+config EDAC_ALTERA_ARM64_WARM_RESET
+ bool "Altera ARM64 Peripheral Warm Reset"
+ depends on EDAC_ALTERA=y && ARM64
+ help
+ Support for Warm Reset on peripheral FIFO double bit errors
+ on SoCFPGA ARM64 platforms. Otherwise a peripheral FIFO DBE
+ will cause a cold reset. SDRAM and OCRAM DBEs always cause
+ a cold reset.
+
config EDAC_ALTERA_SDRAM
bool "Altera SDRAM ECC"
depends on EDAC_ALTERA=y
diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
index 8816f74a22b4..179601f14b48 100644
--- a/drivers/edac/altera_edac.c
+++ b/drivers/edac/altera_edac.c
@@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
/* panic routine issues reboot on non-zero panic_timeout */
extern int panic_timeout;

+#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
+/* EL3 SMC call to setup CPUs for warm reset */
+void panic_smp_self_stop(void)
+{
+ struct arm_smccc_res result;
+
+ __cpu_disable();
+ cpu_relax();
+ arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
+ S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, &result);
+}
+#endif
+
/*
* The double bit error is handled through SError which is fatal. This is
* called as a panic notifier to printout ECC error info as part of the panic.
@@ -2067,14 +2080,28 @@ static int s10_edac_dberr_handler(struct notifier_block *this,
regmap_write(edac->ecc_mgr_map,
S10_SYSMGR_UE_ADDR_OFST, err_addr);
edac_printk(KERN_ERR, EDAC_DEVICE,
- "EDAC: [Fatal DBE on %s @ 0x%08X]\n",
- ed->edac_dev_name, err_addr);
+ "EDAC: [Fatal DBE on %s [CPU=%d] @ 0x%08X]\n",
+ ed->edac_dev_name, raw_smp_processor_id(),
+ err_addr);
break;
}
/* Notify the System through SMC. Reboot delay = 1 second */
+#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
+ /* Handle peripheral FIFO DBE as Warm Resets */
+ if (dberror & S10_COLD_RESET_MASK) {
+ panic_timeout = 1;
+ arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0,
+ 0, 0, 0, &result);
+ } else {
+ arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE,
+ S10_WARM_RESET_WFI_FLAG | dberror, 0, 0,
+ 0, 0, 0, 0, &result);
+ }
+#else
panic_timeout = 1;
arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0, 0,
0, 0, &result);
+#endif
}

return NOTIFY_DONE;
diff --git a/drivers/edac/altera_edac.h b/drivers/edac/altera_edac.h
index 55654cc4bcdf..e5936fbe3964 100644
--- a/drivers/edac/altera_edac.h
+++ b/drivers/edac/altera_edac.h
@@ -327,6 +327,10 @@ struct altr_sdram_mc_data {
#define ECC_READ_EOVR 0x2
#define ECC_READ_EDOVR 0x3

+/* DRAM and OCRAM require cold reset */
+#define S10_COLD_RESET_MASK 0x30002
+#define S10_WARM_RESET_WFI_FLAG BIT(31)
+
struct altr_edac_device_dev;

struct edac_device_prv_data {
--
2.7.4


2019-06-04 17:24:51

by James Morse

[permalink] [raw]
Subject: Re: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

Hi Thor,

(CC: +Mark, Lorenzo and Sudeep for PSCI.
How should SYSTEM_RESET2 be used for a vendor-specific reset?

The original patch is:
lore.kernel.org/r/[email protected]
)

On 03/06/2019 21:37, [email protected] wrote:
> From: Thor Thayer <[email protected]>
>
> The Stratix10 peripheral FIFO memories can recover from double
> bit errors with a warm reset instead of a cold reset.
> Add the option of a warm reset for peripheral (USB, Ethernet)
> memories.
>
> CPU memories such as SDRAM and OCRAM require a cold reset for
> DBEs.
> Filter on whether the error is a SDRAM/OCRAM or a peripheral
> FIFO memory to determine which reset to use when the warm
> reset option is configured.

... so you want to make different SMC calls on each CPU after panic()?


> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> index 8816f74a22b4..179601f14b48 100644
> --- a/drivers/edac/altera_edac.c
> +++ b/drivers/edac/altera_edac.c
> @@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
> /* panic routine issues reboot on non-zero panic_timeout */
> extern int panic_timeout;
>
> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
> +/* EL3 SMC call to setup CPUs for warm reset */
> +void panic_smp_self_stop(void)
> +{
> + struct arm_smccc_res result;
> +
> + __cpu_disable();
> + cpu_relax();
> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
> + S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, &result);
> +}
> +#endif

Oooer!

panic_smp_self_stop() isn't for drivers to override: only the arch code.
__cpu_disable() is only for the cpu-hotplug machinery. Nothing else should touch it.

Isn't this thing only called if another CPU out there is panic()ing too?


I think one of the problems here is arm64 leaves secondary CPUs running after panic().
This would be better fixed by using the appropriate cpu_ops[]->cpu_die() call in arm64's
ipi_cpu_stop().


As for passing platform-specific options, PSCI[0] has a 'reset_type' for SYSTEM_RESET2,
which looks suspiciously like what you want here. I'm not sure how its expected to be
used... hopefully the PSCI maintainers can give us some pointers.

(The existing support is commit 4302e381a870 ("firmware/psci: add support for SYSTEM_RESET2"))


Is it possible for firmware to do both the cold/warm reset work when SYSTEM_RESET is
called? This would mean you don't have to care here and there are fewer choices to be made
overall.
If not, is there anything left behind that can give it the hint? Like non-zero error
counters for the USB/Ethernet devices?


> @@ -2067,14 +2080,28 @@ static int s10_edac_dberr_handler(struct notifier_block *this,
> regmap_write(edac->ecc_mgr_map,
> S10_SYSMGR_UE_ADDR_OFST, err_addr);
> edac_printk(KERN_ERR, EDAC_DEVICE,
> - "EDAC: [Fatal DBE on %s @ 0x%08X]\n",
> - ed->edac_dev_name, err_addr);
> + "EDAC: [Fatal DBE on %s [CPU=%d] @ 0x%08X]\n",
> + ed->edac_dev_name, raw_smp_processor_id(),
> + err_addr);
> break;
> }
> /* Notify the System through SMC. Reboot delay = 1 second */
> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
> + /* Handle peripheral FIFO DBE as Warm Resets */
> + if (dberror & S10_COLD_RESET_MASK) {


> + panic_timeout = 1;

Isn't this value supposed to be provided on the kernel commandline? Surely this prevents
debug using the commandline option to increase the delay?

(I see you already change it)


> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0,
> + 0, 0, 0, &result);
> + } else {
> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE,
> + S10_WARM_RESET_WFI_FLAG | dberror, 0, 0,
> + 0, 0, 0, 0, &result);
> + }
> +#else
> panic_timeout = 1;
> arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0, 0,
> 0, 0, &result);
> +#endif
> }
>
> return NOTIFY_DONE;

What do these SMC do? Are they equivalent to the PSCI CPU online/offline calls?

panic() notifiers aren't robust as they can be skipped if kdump is loaded.


Thanks,

James


[0]
https://static.docs.arm.com/den0022/d/Power_State_Coordination_Interface_PDD_v1_1_DEN0022D.pdf

2019-06-04 17:41:36

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

On Tue, Jun 04, 2019 at 06:23:15PM +0100, James Morse wrote:
> Hi Thor,
>
> (CC: +Mark, Lorenzo and Sudeep for PSCI.
> How should SYSTEM_RESET2 be used for a vendor-specific reset?
>

Initially it was indented to be used by passing command line argument
"reboot=w" or "reboot=warm" as specified in kernel document[1]

However it was enhanced and enabled specifically for panic by
Commit b287a25a7148 ("panic/reboot: allow specifying reboot_mode for panic only")

IIUC you can now pass "reboot=panic_warm" to just set reboot_mode to
WARM when there's a panic. SYSTEM_RESET2 gets called whenever reboot_mode
is set to WARM/SOFT

> The original patch is:
> lore.kernel.org/r/[email protected]
> )
>
> On 03/06/2019 21:37, [email protected] wrote:
> > From: Thor Thayer <[email protected]>
> >
> > The Stratix10 peripheral FIFO memories can recover from double
> > bit errors with a warm reset instead of a cold reset.
> > Add the option of a warm reset for peripheral (USB, Ethernet)
> > memories.
> >
> > CPU memories such as SDRAM and OCRAM require a cold reset for
> > DBEs.
> > Filter on whether the error is a SDRAM/OCRAM or a peripheral
> > FIFO memory to determine which reset to use when the warm
> > reset option is configured.
>
> ... so you want to make different SMC calls on each CPU after panic()?
>
>
> > diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> > index 8816f74a22b4..179601f14b48 100644
> > --- a/drivers/edac/altera_edac.c
> > +++ b/drivers/edac/altera_edac.c
> > @@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
> > /* panic routine issues reboot on non-zero panic_timeout */
> > extern int panic_timeout;
> >
> > +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
> > +/* EL3 SMC call to setup CPUs for warm reset */
> > +void panic_smp_self_stop(void)
> > +{
> > + struct arm_smccc_res result;
> > +
> > + __cpu_disable();
> > + cpu_relax();
> > + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
> > + S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, &result);

Please use SYSTEM_RESET2 or let us know why it can't be used to understand
the requirement better. There are options to use vendor extentions with
the SYSTEM_RESET2 PSCI command if you really have to. However the mainline
supports only architectural warm reset.

--
Regards,
Sudeep

[1] Documentation/admin-guide/kernel-parameters.txt

2019-06-04 21:41:25

by Thor Thayer

[permalink] [raw]
Subject: Re: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

Hi James,

On 6/4/19 12:23 PM, James Morse wrote:
> Hi Thor,
>
> (CC: +Mark, Lorenzo and Sudeep for PSCI.
> How should SYSTEM_RESET2 be used for a vendor-specific reset?
>
> The original patch is:
> lore.kernel.org/r/[email protected]
> )
>

Thank you for reviewing my patch and bringing in additional expertise.

> On 03/06/2019 21:37, [email protected] wrote:
>> From: Thor Thayer <[email protected]>
>>
>> The Stratix10 peripheral FIFO memories can recover from double
>> bit errors with a warm reset instead of a cold reset.
>> Add the option of a warm reset for peripheral (USB, Ethernet)
>> memories.
>>
>> CPU memories such as SDRAM and OCRAM require a cold reset for
>> DBEs.
>> Filter on whether the error is a SDRAM/OCRAM or a peripheral
>> FIFO memory to determine which reset to use when the warm
>> reset option is configured.
>
> ... so you want to make different SMC calls on each CPU after panic()?
>
Yes. The registers needed for a warm reset need to be setup in EL3
(U-Boot). As part of this, I needed the other CPUs (3 of the 4) to idle
in WFI.

>
>> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
>> index 8816f74a22b4..179601f14b48 100644
>> --- a/drivers/edac/altera_edac.c
>> +++ b/drivers/edac/altera_edac.c
>> @@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
>> /* panic routine issues reboot on non-zero panic_timeout */
>> extern int panic_timeout;
>>
>> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
>> +/* EL3 SMC call to setup CPUs for warm reset */
>> +void panic_smp_self_stop(void)
>> +{
>> + struct arm_smccc_res result;
>> +
>> + __cpu_disable();
>> + cpu_relax();
>> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
>> + S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, &result);
>> +}
>> +#endif
>
> Oooer!
>
> panic_smp_self_stop() isn't for drivers to override: only the arch code.
> __cpu_disable() is only for the cpu-hotplug machinery. Nothing else should touch it.
>
> Isn't this thing only called if another CPU out there is panic()ing too?
>
OK. I wasn't aware of that intention. Thanks.

On an SError, all the CPUs except the first one end up calling
panic_smp_self_stop().

All the CPUs are calling the panic routine which follows this path:
arm64_serror_panic() -> nmi_panic()

In nmi_panic()
If this is the first CPU to reach nmi_panic(), call panic()
otherwise, call nmi_panic_self_stop() -> panic_smp_self_stop()

panic_smp_self_stop() is a while(1) cpu_relax() which is a yield but I
needed a WFI.

>
> I think one of the problems here is arm64 leaves secondary CPUs running after panic().
> This would be better fixed by using the appropriate cpu_ops[]->cpu_die() call in arm64's
> ipi_cpu_stop().
>
Yes. I will look into this.

>
> As for passing platform-specific options, PSCI[0] has a 'reset_type' for SYSTEM_RESET2,
> which looks suspiciously like what you want here. I'm not sure how its expected to be
> used... hopefully the PSCI maintainers can give us some pointers.
>
> (The existing support is commit 4302e381a870 ("firmware/psci: add support for SYSTEM_RESET2"))
>
>
> Is it possible for firmware to do both the cold/warm reset work when SYSTEM_RESET is
> called? This would mean you don't have to care here and there are fewer choices to be made
> overall.
> If not, is there anything left behind that can give it the hint? Like non-zero error
> counters for the USB/Ethernet devices?
>

Hmm. I'll look into this. I was trying to handle more in Linux so that I
could do a best effort of printout notification of the error, etc for
the logs (even though I know getting this far or printing out isn't
guaranteed)

>
>> @@ -2067,14 +2080,28 @@ static int s10_edac_dberr_handler(struct notifier_block *this,
>> regmap_write(edac->ecc_mgr_map,
>> S10_SYSMGR_UE_ADDR_OFST, err_addr);
>> edac_printk(KERN_ERR, EDAC_DEVICE,
>> - "EDAC: [Fatal DBE on %s @ 0x%08X]\n",
>> - ed->edac_dev_name, err_addr);
>> + "EDAC: [Fatal DBE on %s [CPU=%d] @ 0x%08X]\n",
>> + ed->edac_dev_name, raw_smp_processor_id(),
>> + err_addr);
>> break;
>> }
>> /* Notify the System through SMC. Reboot delay = 1 second */
>> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
>> + /* Handle peripheral FIFO DBE as Warm Resets */
>> + if (dberror & S10_COLD_RESET_MASK) {
>
>
>> + panic_timeout = 1;
>
> Isn't this value supposed to be provided on the kernel commandline? Surely this prevents
> debug using the commandline option to increase the delay?
>
> (I see you already change it)

I see your point but I only override the value when I have an ECC DBE panic.
>
>
>> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0,
>> + 0, 0, 0, &result);
>> + } else {
>> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE,
>> + S10_WARM_RESET_WFI_FLAG | dberror, 0, 0,
>> + 0, 0, 0, 0, &result);
>> + }
>> +#else
>> panic_timeout = 1;
>> arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0, 0,
>> 0, 0, &result);
>> +#endif
>> }
>>
>> return NOTIFY_DONE;
>
> What do these SMC do? Are they equivalent to the PSCI CPU online/offline calls?
>
> panic() notifiers aren't robust as they can be skipped if kdump is loaded.
>
These are calling into the firmware to 1) notify the firmware of an
upcoming cold reset in the case of cold reset or 2) call the warm reset
handler in the firmware.

I missed that limitation of the panic() notifiers.

Thanks for the insightful comments!

Thor
>
> Thanks,
>
> James
>
>
> [0]
> https://static.docs.arm.com/den0022/d/Power_State_Coordination_Interface_PDD_v1_1_DEN0022D.pdf
>

2019-06-04 21:52:56

by Thor Thayer

[permalink] [raw]
Subject: Re: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

Hi Sudeep,

On 6/4/19 12:38 PM, Sudeep Holla wrote:
> On Tue, Jun 04, 2019 at 06:23:15PM +0100, James Morse wrote:
>> Hi Thor,
>>
>> (CC: +Mark, Lorenzo and Sudeep for PSCI.
>> How should SYSTEM_RESET2 be used for a vendor-specific reset?
>>
>
> Initially it was indented to be used by passing command line argument
> "reboot=w" or "reboot=warm" as specified in kernel document[1]
>
> However it was enhanced and enabled specifically for panic by
> Commit b287a25a7148 ("panic/reboot: allow specifying reboot_mode for panic only")
>
> IIUC you can now pass "reboot=panic_warm" to just set reboot_mode to
> WARM when there's a panic. SYSTEM_RESET2 gets called whenever reboot_mode
> is set to WARM/SOFT
>
Thanks. I missed that SYSTEM_RESET2 had been implemented.

>> The original patch is:
>> lore.kernel.org/r/[email protected]
>> )
>>
>> On 03/06/2019 21:37, [email protected] wrote:
>>> From: Thor Thayer <[email protected]>
>>>
>>> The Stratix10 peripheral FIFO memories can recover from double
>>> bit errors with a warm reset instead of a cold reset.
>>> Add the option of a warm reset for peripheral (USB, Ethernet)
>>> memories.
>>>
>>> CPU memories such as SDRAM and OCRAM require a cold reset for
>>> DBEs.
>>> Filter on whether the error is a SDRAM/OCRAM or a peripheral
>>> FIFO memory to determine which reset to use when the warm
>>> reset option is configured.
>>
>> ... so you want to make different SMC calls on each CPU after panic()?
>>
>>
>>> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
>>> index 8816f74a22b4..179601f14b48 100644
>>> --- a/drivers/edac/altera_edac.c
>>> +++ b/drivers/edac/altera_edac.c
>>> @@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = {
>>> /* panic routine issues reboot on non-zero panic_timeout */
>>> extern int panic_timeout;
>>>
>>> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
>>> +/* EL3 SMC call to setup CPUs for warm reset */
>>> +void panic_smp_self_stop(void)
>>> +{
>>> + struct arm_smccc_res result;
>>> +
>>> + __cpu_disable();
>>> + cpu_relax();
>>> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
>>> + S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, &result);
>
> Please use SYSTEM_RESET2 or let us know why it can't be used to understand
> the requirement better. There are options to use vendor extentions with
> the SYSTEM_RESET2 PSCI command if you really have to. However the mainline
> supports only architectural warm reset.
>
I need to decide between warm reset and cold reset based on the
peripheral type but maybe that decision can be done by firmware as James
pointed out.

Thanks for the links and the comments!

Thor
> --
> Regards,
> Sudeep
>
> [1] Documentation/admin-guide/kernel-parameters.txt
>