2023-07-11 15:34:57

by Mike Looijmans

[permalink] [raw]
Subject: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue

When there's an error attempting to store the BER counter, don't abort
but continue shutting down the chip as required.

After disabling communications, also stop the TX queue with a call to
netif_stop_queue. When the interface restarts, mcp251xfd_set_mode will
call netif_wake_queue and resume.

This fixes a hangup in either send() or poll() from userspace. To
reproduce: I ran "cansequence can0 -p" to flood the system with packets.
While running that, I shorted the CAN signals, causing a bus error.
Usually communications would resume after the CAN bus restarted, but
sometimes the system got stuck and refused to send any more packets.
The sending process would be in either poll() or send(), waiting for
the queue to resume. To "unstuck" the process from send() it was
sufficient to send any packet on the can interface from another
process. To get it out of the poll() hang, only bringing the can
interface down (and up) would work.

After adding the netif_stop_queue call, I was unable to reproduce the
problem.

Signed-off-by: Mike Looijmans <[email protected]>

---

drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
index 68df6d4641b5..854e0644764c 100644
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
@@ -1084,10 +1084,11 @@ static int mcp251xfd_handle_cerrif(struct mcp251xfd_priv *priv)
*/
err = __mcp251xfd_get_berr_counter(priv->ndev, &priv->bec);
if (err)
- return err;
+ netdev_err(priv->ndev, "Failed to store BER counter\n");

mcp251xfd_chip_stop(priv, CAN_STATE_BUS_OFF);
can_bus_off(priv->ndev);
+ netif_stop_queue(priv->ndev);
}

if (!skb)
--
2.17.1


Met vriendelijke groet / kind regards,

Mike Looijmans
System Expert


TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands

T: +31 (0) 499 33 69 69
E: [email protected]
W: http://www.topic.nl

Please consider the environment before printing this e-mail


2023-07-11 17:34:17

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue

On 11.07.2023 17:26:47, Mike Looijmans wrote:
> When there's an error attempting to store the BER counter, don't abort
> but continue shutting down the chip as required.

If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
store error, but a failure of regmap_read(priv->map_reg,
MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
and no/wrong data from the chip will be detected and return an error
here (after 3 retires). In out of memory conditions or other kernel
errors might be possible here, too.

Have you seen a problem here? But as we shut down the chip here anyways,
we can ignore the error here.

> After disabling communications, also stop the TX queue with a call to
> netif_stop_queue.

can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
you enabled automatic restart in case of bus off? I think the netdev
watchdog will kick you, if the interface has a stooped queue for too
long (IIRC 5 seconds).

> When the interface restarts, mcp251xfd_set_mode will
> call netif_wake_queue and resume.
>
> This fixes a hangup in either send() or poll() from userspace. To
> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
> While running that, I shorted the CAN signals, causing a bus error.
> Usually communications would resume after the CAN bus restarted, but
> sometimes the system got stuck and refused to send any more packets.
> The sending process would be in either poll() or send(), waiting for
> the queue to resume. To "unstuck" the process from send() it was
> sufficient to send any packet on the can interface from another
> process. To get it out of the poll() hang, only bringing the can
> interface down (and up) would work.
>
> After adding the netif_stop_queue call, I was unable to reproduce the
> problem.

The newly added netif_stop_queue() will cause the netif_wake_queue() in
the mcp251xfd_set_mode() to actually wake the queue. If you observe a
problem, I think it's a general problem, so all drivers would be
effected.

> Signed-off-by: Mike Looijmans <[email protected]>

Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung Nürnberg | Phone: +49-5121-206917-129 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |


Attachments:
(No filename) (2.36 kB)
signature.asc (499.00 B)
Download all attachments

2023-07-12 06:05:54

by Mike Looijmans

[permalink] [raw]
Subject: Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue


Met vriendelijke groet / kind regards,

Mike Looijmans
System Expert


TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands

T: +31 (0) 499 33 69 69
E: [email protected]
W: http://www.topic.nl

Please consider the environment before printing this e-mail
On 11-07-2023 19:05, Marc Kleine-Budde wrote:
> On 11.07.2023 17:26:47, Mike Looijmans wrote:
>> When there's an error attempting to store the BER counter, don't abort
>> but continue shutting down the chip as required.
>
> If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
> store error, but a failure of regmap_read(priv->map_reg,
> MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
> and no/wrong data from the chip will be detected and return an error
> here (after 3 retires). In out of memory conditions or other kernel
> errors might be possible here, too.
>
> Have you seen a problem here? But as we shut down the chip here anyways,
> we can ignore the error here.

Excactly my point, the important task here is to report the bus off state.

I haven't seen the error trigger, so it wasn't the cause of our problems.

In retrospect, maybe I should have split this in two patches?


>> After disabling communications, also stop the TX queue with a call to
>> netif_stop_queue.
>
> can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
> you enabled automatic restart in case of bus off? I think the netdev
> watchdog will kick you, if the interface has a stooped queue for too
> long (IIRC 5 seconds).


Apparently not. The problem also occurred in the field, where the CAN
communication got stuck for something like a full day.

Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout
like 50 ms seemed to make the problem more likely to occur

What I also noticed, but didn't accurately measure, is that the time it took
to recover used to be somewhat random, it could take up to about a second for
packets to start going out again, as observed by watching LEDs. After adding
netif_stop_queue the recovery was almost instantaneous.


>
>> When the interface restarts, mcp251xfd_set_mode will
>> call netif_wake_queue and resume.
>>
>> This fixes a hangup in either send() or poll() from userspace. To
>> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
>> While running that, I shorted the CAN signals, causing a bus error.
>> Usually communications would resume after the CAN bus restarted, but
>> sometimes the system got stuck and refused to send any more packets.
>> The sending process would be in either poll() or send(), waiting for
>> the queue to resume. To "unstuck" the process from send() it was
>> sufficient to send any packet on the can interface from another
>> process. To get it out of the poll() hang, only bringing the can
>> interface down (and up) would work.
>>
>> After adding the netif_stop_queue call, I was unable to reproduce the
>> problem.
>
> The newly added netif_stop_queue() will cause the netif_wake_queue() in
> the mcp251xfd_set_mode() to actually wake the queue. If you observe a
> problem, I think it's a general problem, so all drivers would be
> effected.

After two days digging, my conclusion was that "something" wasn't getting a
proper "kick" when the CAN interface came back online.


One could argue the netif_stop_queue should be put into can_bus_off. But the
netif_wake_queue call is in the driver, so that would break symmetry.


>
>> Signed-off-by: Mike Looijmans <[email protected]>
>
> Marc
>