2024-01-09 19:53:34

by Andrea Fois

[permalink] [raw]
Subject: [PATCH] tg3: add new module param to force device power down on reboot

The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
device on system reboot to avoid triggering AER") but was reintroduced
by commit 9fc3bc764334 ("tg3: power down device only on
SYSTEM_POWER_OFF").

The problem described in #1917471 is still consistently replicable on
reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
(i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
was committed.

The problem is detected also by the Lifecycle controller and logged as
a PCI Bus Error for the device.

As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
opposite strategies, a new module param "force_pwr_down_on_reboot"
<bool> is introduced to fix both scenarios:

force_pwr_down_on_reboot = 0/N/n = disable, keep the current
behavior, don't force dev
power down on reboot

force_pwr_down_on_reboot = 1/Y/y = enable, revert to the
behavior of 2ca1c94ce0b6,
force dev power down on reboot

Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
Signed-off-by: Andrea Fois <[email protected]>
---
drivers/net/ethernet/broadcom/tg3.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index f52830dfb26a..287786357c9b 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -233,6 +233,12 @@ static int tg3_debug = -1; /* -1 == use TG3_DEF_MSG_ENABLE as value */
module_param(tg3_debug, int, 0);
MODULE_PARM_DESC(tg3_debug, "Tigon3 bitmapped debugging message enable value");

+static bool force_pwr_down_on_reboot; /* false == Don't force the power down of
+ * the device during reboot, only on SYSTEM_POWER_OFF
+ */
+module_param(force_pwr_down_on_reboot, bool, 0x644);
+MODULE_PARM_DESC(force_pwr_down_on_reboot, "Tigon3 force power down of the device on reboot enable value");
+
#define TG3_DRV_DATA_FLAG_10_100_ONLY 0x0001
#define TG3_DRV_DATA_FLAG_5705_10_100 0x0002

@@ -18197,7 +18203,7 @@ static void tg3_shutdown(struct pci_dev *pdev)
if (netif_running(dev))
dev_close(dev);

- if (system_state == SYSTEM_POWER_OFF)
+ if (system_state == SYSTEM_POWER_OFF || force_pwr_down_on_reboot)
tg3_power_down(tp);

rtnl_unlock();
--
2.40.1



2024-01-09 20:31:40

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On 09.01.2024 20:45, Andrea Fois wrote:
> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
> device on system reboot to avoid triggering AER") but was reintroduced
> by commit 9fc3bc764334 ("tg3: power down device only on
> SYSTEM_POWER_OFF").
>
> The problem described in #1917471 is still consistently replicable on
> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
> was committed.
>
> The problem is detected also by the Lifecycle controller and logged as
> a PCI Bus Error for the device.
>
> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
> opposite strategies, a new module param "force_pwr_down_on_reboot"
> <bool> is introduced to fix both scenarios:
>
Adding module parameters is discouraged. What I see could try:

- limit 9fc3bc764334 to the specific machine type mentioned in the
commit message (based DMI info)
- 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
Based on the commit description disabling the device might be sufficient.

> force_pwr_down_on_reboot = 0/N/n = disable, keep the current
> behavior, don't force dev
> power down on reboot
>
> force_pwr_down_on_reboot = 1/Y/y = enable, revert to the
> behavior of 2ca1c94ce0b6,
> force dev power down on reboot
>
> Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF")
> Signed-off-by: Andrea Fois <[email protected]>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index f52830dfb26a..287786357c9b 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -233,6 +233,12 @@ static int tg3_debug = -1; /* -1 == use TG3_DEF_MSG_ENABLE as value */
> module_param(tg3_debug, int, 0);
> MODULE_PARM_DESC(tg3_debug, "Tigon3 bitmapped debugging message enable value");
>
> +static bool force_pwr_down_on_reboot; /* false == Don't force the power down of
> + * the device during reboot, only on SYSTEM_POWER_OFF
> + */
> +module_param(force_pwr_down_on_reboot, bool, 0x644);
> +MODULE_PARM_DESC(force_pwr_down_on_reboot, "Tigon3 force power down of the device on reboot enable value");
> +
> #define TG3_DRV_DATA_FLAG_10_100_ONLY 0x0001
> #define TG3_DRV_DATA_FLAG_5705_10_100 0x0002
>
> @@ -18197,7 +18203,7 @@ static void tg3_shutdown(struct pci_dev *pdev)
> if (netif_running(dev))
> dev_close(dev);
>
> - if (system_state == SYSTEM_POWER_OFF)
> + if (system_state == SYSTEM_POWER_OFF || force_pwr_down_on_reboot)
> tg3_power_down(tp);
>
> rtnl_unlock();


2024-01-10 04:13:21

by Pavan Chebbi

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <[email protected]> wrote:
>
> On 09.01.2024 20:45, Andrea Fois wrote:
> > The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
> > device on system reboot to avoid triggering AER") but was reintroduced
> > by commit 9fc3bc764334 ("tg3: power down device only on
> > SYSTEM_POWER_OFF").
> >
> > The problem described in #1917471 is still consistently replicable on
> > reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
> > (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
> > was committed.
> >
> > The problem is detected also by the Lifecycle controller and logged as
> > a PCI Bus Error for the device.
> >
> > As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
> > opposite strategies, a new module param "force_pwr_down_on_reboot"
> > <bool> is introduced to fix both scenarios:
> >
> Adding module parameters is discouraged. What I see could try:
Ack.
>
> - limit 9fc3bc764334 to the specific machine type mentioned in the
> commit message (based DMI info)
> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
> Based on the commit description disabling the device might be sufficient.

I think the second suggestion could be a better solution. Helps to
solve the issue 9fc3bc764334 is trying to fix.
But I am not sure how easy it is to test. As I recall, Goerge was
unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
test his patch for regression.
We did discuss the risk of this regression.
https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
Unfortunately, looks like it has come true :(


Attachments:
smime.p7s (4.11 kB)
S/MIME Cryptographic Signature

2024-01-10 07:09:31

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On 10.01.2024 05:12, Pavan Chebbi wrote:
> On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <[email protected]> wrote:
>>
>> On 09.01.2024 20:45, Andrea Fois wrote:
>>> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
>>> device on system reboot to avoid triggering AER") but was reintroduced
>>> by commit 9fc3bc764334 ("tg3: power down device only on
>>> SYSTEM_POWER_OFF").
>>>
>>> The problem described in #1917471 is still consistently replicable on
>>> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
>>> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
>>> was committed.
>>>
>>> The problem is detected also by the Lifecycle controller and logged as
>>> a PCI Bus Error for the device.
>>>
>>> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
>>> opposite strategies, a new module param "force_pwr_down_on_reboot"
>>> <bool> is introduced to fix both scenarios:
>>>
>> Adding module parameters is discouraged. What I see could try:
> Ack.
>>
>> - limit 9fc3bc764334 to the specific machine type mentioned in the
>> commit message (based DMI info)
>> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
>> Based on the commit description disabling the device might be sufficient.
>
> I think the second suggestion could be a better solution. Helps to
> solve the issue 9fc3bc764334 is trying to fix.
> But I am not sure how easy it is to test. As I recall, Goerge was
> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> test his patch for regression.
> We did discuss the risk of this regression.
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> Unfortunately, looks like it has come true :(

If the culprit is an unexpected MSI interrupt, then you may also address
this directly by disabling interrupts in the device. tg3_stop() may be
a candidate here.


2024-01-10 07:18:03

by Michael Chan

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On Tue, Jan 9, 2024 at 11:09 PM Heiner Kallweit <[email protected]> wrote:
>
> On 10.01.2024 05:12, Pavan Chebbi wrote:
> > I think the second suggestion could be a better solution. Helps to
> > solve the issue 9fc3bc764334 is trying to fix.
> > But I am not sure how easy it is to test. As I recall, Goerge was
> > unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> > test his patch for regression.
> > We did discuss the risk of this regression.
> > https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> > Unfortunately, looks like it has come true :(
>
> If the culprit is an unexpected MSI interrupt, then you may also address
> this directly by disabling interrupts in the device. tg3_stop() may be
> a candidate here.
>

We already call dev_close() which will call tg3_close() -> tg3_stop()
a few lines above.


Attachments:
smime.p7s (4.11 kB)
S/MIME Cryptographic Signature

2024-01-10 07:34:53

by Heiner Kallweit

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On 10.01.2024 08:17, Michael Chan wrote:
> On Tue, Jan 9, 2024 at 11:09 PM Heiner Kallweit <[email protected]> wrote:
>>
>> On 10.01.2024 05:12, Pavan Chebbi wrote:
>>> I think the second suggestion could be a better solution. Helps to
>>> solve the issue 9fc3bc764334 is trying to fix.
>>> But I am not sure how easy it is to test. As I recall, Goerge was
>>> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
>>> test his patch for regression.
>>> We did discuss the risk of this regression.
>>> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>>> Unfortunately, looks like it has come true :(
>>
>> If the culprit is an unexpected MSI interrupt, then you may also address
>> this directly by disabling interrupts in the device. tg3_stop() may be
>> a candidate here.
>>
>
> We already call dev_close() which will call tg3_close() -> tg3_stop()
> a few lines above.

tg3_stop() calls tg3_disable_ints(), so I wonder how a MSI interrupt can
occur after that. Does tg3_disable_ints() disable interrupts synchronously?
Or maybe some kind of commit is needed?


2024-01-10 17:02:03

by Michael Chan

[permalink] [raw]
Subject: Re: [PATCH] tg3: add new module param to force device power down on reboot

On Tue, Jan 9, 2024 at 11:34 PM Heiner Kallweit <[email protected]> wrote:
>
> On 10.01.2024 08:17, Michael Chan wrote:
> > We already call dev_close() which will call tg3_close() -> tg3_stop()
> > a few lines above.
>
> tg3_stop() calls tg3_disable_ints(), so I wonder how a MSI interrupt can
> occur after that. Does tg3_disable_ints() disable interrupts synchronously?
> Or maybe some kind of commit is needed?
>

Yes, it is synchronous. The tg3_full_lock() call before
tg3_disable_ints() makes it synchronous.


Attachments:
smime.p7s (4.11 kB)
S/MIME Cryptographic Signature