In bi-directional CAN transfer using M_CAN IP, with
the frame gap being set to '0', it leads to Protocol
error in Arbitration phase resulting in communication
stall.
Discussed with Bosch M_CAN IP team and the stall issue
can only be overcome by controlling the tx and rx
packets flow as done by the patch.
Rx packets would also be serviced when there is a tx
interrupt. The solution has been tested extensively for
more than 10 days, and no issues has been observed.
Setup that is used to reproduce the issue:
+---------------------+ +----------------------+
|Intel ElkhartLake | |Intel ElkhartLake |
| +--------+ | | +--------+ |
| |m_can 0 | |<=======>| |m_can 0 | |
| +--------+ | | +--------+ |
+---------------------+ +----------------------+
Steps to be run on the two Elkhartlake HW:
1. ip link set can0 type can bitrate 1000000
2. ip link set can0 txqueuelen 2048
3. ip link set can0 up
4. cangen -g 0 can0
5. candump can0
cangen -g 0 can0 & candump can0 commands are used for transmit and
receive on both the m_can HW simultaneously where -g is the frame gap
between two frames.
Signed-off-by: Kumari Pallavi <[email protected]>
Signed-off-by: Srikanth Thokala <[email protected]>
---
drivers/net/can/m_can/m_can.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
index a5003435802b..94aa0ba89202 100644
--- a/drivers/net/can/m_can/m_can.c
+++ b/drivers/net/can/m_can/m_can.c
@@ -1118,7 +1118,7 @@ static irqreturn_t m_can_isr(int irq, void *dev_id)
/* New TX FIFO Element arrived */
if (m_can_echo_tx_event(dev) != 0)
goto out_fail;
-
+ m_can_write(cdev, M_CAN_IE, IR_ALL_INT & ~(IR_TEFN));
if (netif_queue_stopped(dev) &&
!m_can_tx_fifo_full(cdev))
netif_wake_queue(dev);
@@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff *skb,
}
} else {
cdev->tx_skb = skb;
+ m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
return m_can_tx_handler(cdev);
}
--
2.17.1
Hi,
> -----Original Message-----
> From: Kumari Pallavi <[email protected]>
> Sent: Friday, June 23, 2023 2:29 PM
> To: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: Sangannavar, Mallikarjunappa <[email protected]>;
> Nikula, Jarkko <[email protected]>; [email protected];
> [email protected]; [email protected]; Kumari Pallavi
> <[email protected]>; Thokala, Srikanth <[email protected]>
> Subject: [RESEND] [PATCH 1/1] can: m_can: Control tx and rx flow to avoid
> communication stall
>
> In bi-directional CAN transfer using M_CAN IP, with the frame gap being set to
> '0', it leads to Protocol error in Arbitration phase resulting in communication
> stall.
> Discussed with Bosch M_CAN IP team and the stall issue can only be overcome
> by controlling the tx and rx packets flow as done by the patch.
>
> Rx packets would also be serviced when there is a tx interrupt. The solution has
> been tested extensively for more than 10 days, and no issues has been observed.
>
> Setup that is used to reproduce the issue:
>
> +---------------------+ +----------------------+
> |Intel ElkhartLake | |Intel ElkhartLake |
> | +--------+ | | +--------+ |
> | |m_can 0 | |<=======>| |m_can 0 | |
> | +--------+ | | +--------+ |
> +---------------------+ +----------------------+
>
> Steps to be run on the two Elkhartlake HW:
>
> 1. ip link set can0 type can bitrate 1000000 2. ip link set can0 txqueuelen 2048 3.
> ip link set can0 up 4. cangen -g 0 can0 5. candump can0
>
> cangen -g 0 can0 & candump can0 commands are used for transmit and receive
> on both the m_can HW simultaneously where -g is the frame gap between two
> frames.
>
> Signed-off-by: Kumari Pallavi <[email protected]>
> Signed-off-by: Srikanth Thokala <[email protected]>
> ---
> drivers/net/can/m_can/m_can.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> index a5003435802b..94aa0ba89202 100644
> --- a/drivers/net/can/m_can/m_can.c
> +++ b/drivers/net/can/m_can/m_can.c
> @@ -1118,7 +1118,7 @@ static irqreturn_t m_can_isr(int irq, void *dev_id)
> /* New TX FIFO Element arrived */
> if (m_can_echo_tx_event(dev) != 0)
> goto out_fail;
> -
> + m_can_write(cdev, M_CAN_IE, IR_ALL_INT &
> ~(IR_TEFN));
> if (netif_queue_stopped(dev) &&
> !m_can_tx_fifo_full(cdev))
> netif_wake_queue(dev);
> @@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff
> *skb,
> }
> } else {
> cdev->tx_skb = skb;
> + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
> return m_can_tx_handler(cdev);
> }
>
Kindly review the patch and please let us know if any comments!
Thanks,
Pallavi
> --
> 2.17.1
On 23.06.2023 14:29:20, Kumari Pallavi wrote:
> In bi-directional CAN transfer using M_CAN IP, with
> the frame gap being set to '0', it leads to Protocol
> error in Arbitration phase resulting in communication
> stall.
Is there a (public) erratum describing the problem?
> Discussed with Bosch M_CAN IP team and the stall issue
> can only be overcome by controlling the tx and rx
> packets flow as done by the patch.
Please elaborate the suggested workaround.
> Rx packets would also be serviced when there is a tx
> interrupt. The solution has been tested extensively for
> more than 10 days, and no issues has been observed.
Can you describe how your patch implements the workaround?
| Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
| instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
| to do frotz", as if you are giving orders to the codebase to change
| its behaviour.
See: https://github.com/torvalds/linux/blob/master/Documentation/process/submitting-patches.rst#describe-your-changes
> Setup that is used to reproduce the issue:
>
> +---------------------+ +----------------------+
> |Intel ElkhartLake | |Intel ElkhartLake |
> | +--------+ | | +--------+ |
> | |m_can 0 | |<=======>| |m_can 0 | |
> | +--------+ | | +--------+ |
> +---------------------+ +----------------------+
>
> Steps to be run on the two Elkhartlake HW:
>
> 1. ip link set can0 type can bitrate 1000000
> 2. ip link set can0 txqueuelen 2048
> 3. ip link set can0 up
> 4. cangen -g 0 can0
> 5. candump can0
>
> cangen -g 0 can0 & candump can0 commands are used for transmit and
> receive on both the m_can HW simultaneously where -g is the frame gap
> between two frames.
>
> Signed-off-by: Kumari Pallavi <[email protected]>
> Signed-off-by: Srikanth Thokala <[email protected]>
> ---
> drivers/net/can/m_can/m_can.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> index a5003435802b..94aa0ba89202 100644
> --- a/drivers/net/can/m_can/m_can.c
> +++ b/drivers/net/can/m_can/m_can.c
> @@ -1118,7 +1118,7 @@ static irqreturn_t m_can_isr(int irq, void *dev_id)
> /* New TX FIFO Element arrived */
> if (m_can_echo_tx_event(dev) != 0)
> goto out_fail;
> -
nitpick: please keep that empty line.
> + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & ~(IR_TEFN));
- What's the purpose of "()" around IR_TEFN?
- You enable a lot of interrupts that have not been enabled before. Have
a look at m_can_chip_config() how the original register value for
M_CAN_IE is calculated.
> if (netif_queue_stopped(dev) &&
> !m_can_tx_fifo_full(cdev))
> netif_wake_queue(dev);
> @@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff *skb,
> }
> } else {
> cdev->tx_skb = skb;
> + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
- What's the purpose of "()" around IR_TEFN?
- "IR_ALL_INT & (IR_TEFN)" is equal to IR_TEFN, isn't it?
- This basically disables all other interrupts, is this what you want to
do?
- What happens if the bus is busy with high prio CAN frames and you want
to send low prio ones? You will not get any RX-IRQ, this doesn't look
correct to me.
> return m_can_tx_handler(cdev);
> }
>
> --
> 2.17.1
>
>
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung Nürnberg | Phone: +49-5121-206917-129 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |
Hi Marc,
Thank you for the review.
> -----Original Message-----
> From: Marc Kleine-Budde <[email protected]>
> Sent: Wednesday, July 5, 2023 5:47 PM
> To: Kumari Pallavi <[email protected]>
> Cc: [email protected]; Sangannavar, Mallikarjunappa
> <[email protected]>; Nikula, Jarkko
> <[email protected]>; [email protected];
> [email protected]; [email protected]; Thokala, Srikanth
> <[email protected]>
> Subject: Re: [RESEND] [PATCH 1/1] can: m_can: Control tx and rx flow to avoid
> communication stall
>
> On 23.06.2023 14:29:20, Kumari Pallavi wrote:
> > In bi-directional CAN transfer using M_CAN IP, with the frame gap
> > being set to '0', it leads to Protocol error in Arbitration phase
> > resulting in communication stall.
>
> Is there a (public) erratum describing the problem?
No, there is no any public erratum available. I found this issue during stress test.
>
> > Discussed with Bosch M_CAN IP team and the stall issue can only be
> > overcome by controlling the tx and rx packets flow as done by the
> > patch.
>
> Please elaborate the suggested workaround.
Sure.
>
> > Rx packets would also be serviced when there is a tx interrupt. The
> > solution has been tested extensively for more than 10 days, and no
> > issues has been observed.
>
> Can you describe how your patch implements the workaround?
>
> | Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
> | instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
> | to do frotz", as if you are giving orders to the codebase to change
> | its behaviour.
>
> See:
> https://github.com/torvalds/linux/blob/master/Documentation/process/submit
> ting-patches.rst#describe-your-changes
>
Sure, I will reformat in next version.
> > Setup that is used to reproduce the issue:
> >
> > +---------------------+ +----------------------+
> > |Intel ElkhartLake | |Intel ElkhartLake |
> > | +--------+ | | +--------+ |
> > | |m_can 0 | |<=======>| |m_can 0 | |
> > | +--------+ | | +--------+ |
> > +---------------------+ +----------------------+
> >
> > Steps to be run on the two Elkhartlake HW:
> >
> > 1. ip link set can0 type can bitrate 1000000 2. ip link set can0
> > txqueuelen 2048 3. ip link set can0 up 4. cangen -g 0 can0 5. candump
> > can0
> >
> > cangen -g 0 can0 & candump can0 commands are used for transmit and
> > receive on both the m_can HW simultaneously where -g is the frame gap
> > between two frames.
> >
> > Signed-off-by: Kumari Pallavi <[email protected]>
> > Signed-off-by: Srikanth Thokala <[email protected]>
> > ---
> > drivers/net/can/m_can/m_can.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/can/m_can/m_can.c
> > b/drivers/net/can/m_can/m_can.c index a5003435802b..94aa0ba89202
> > 100644
> > --- a/drivers/net/can/m_can/m_can.c
> > +++ b/drivers/net/can/m_can/m_can.c
> > @@ -1118,7 +1118,7 @@ static irqreturn_t m_can_isr(int irq, void *dev_id)
> > /* New TX FIFO Element arrived */
> > if (m_can_echo_tx_event(dev) != 0)
> > goto out_fail;
> > -
>
> nitpick: please keep that empty line.
Sure.
>
> > + m_can_write(cdev, M_CAN_IE, IR_ALL_INT &
> ~(IR_TEFN));
>
> - What's the purpose of "()" around IR_TEFN?
> - You enable a lot of interrupts that have not been enabled before. Have
> a look at m_can_chip_config() how the original register value for
> M_CAN_IE is calculated.
>
Sure, I will recheck and fix as needed.
> > if (netif_queue_stopped(dev) &&
> > !m_can_tx_fifo_full(cdev))
> > netif_wake_queue(dev);
> > @@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff
> *skb,
> > }
> > } else {
> > cdev->tx_skb = skb;
> > + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
>
> - What's the purpose of "()" around IR_TEFN?
> - "IR_ALL_INT & (IR_TEFN)" is equal to IR_TEFN, isn't it?
> - This basically disables all other interrupts, is this what you want to
> do?
> - What happens if the bus is busy with high prio CAN frames and you want
> to send low prio ones? You will not get any RX-IRQ, this doesn't look
> correct to me.
>
Even though the RX interrupt is disabled (in IE), if there is an TX interrupt and the RF0N bit is set (in IR), the RX packet will still be serviced because the TX and RX share the same IRQ handler.
> > return m_can_tx_handler(cdev);
> > }
> >
> > --
> > 2.17.1
> >
> >
>
> Marc
>
> --
> Pengutronix e.K. | Marc Kleine-Budde |
> Embedded Linux | https://www.pengutronix.de |
> Vertretung Nürnberg | Phone: +49-5121-206917-129 |
> Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |
Thanks,
Pallavi
On 07.07.2023 05:38:09, Kumari Pallavi wrote:
> > > if (netif_queue_stopped(dev) &&
> > > !m_can_tx_fifo_full(cdev))
> > > netif_wake_queue(dev);
> > > @@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct sk_buff
> > *skb,
> > > }
> > > } else {
> > > cdev->tx_skb = skb;
> > > + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
> >
> > - What's the purpose of "()" around IR_TEFN?
> > - "IR_ALL_INT & (IR_TEFN)" is equal to IR_TEFN, isn't it?
> > - This basically disables all other interrupts, is this what you want to
> > do?
> > - What happens if the bus is busy with high prio CAN frames and you want
> > to send low prio ones? You will not get any RX-IRQ, this doesn't look
> > correct to me.
> >
>
> Even though the RX interrupt is disabled (in IE), if there is an TX
> interrupt and the RF0N bit is set (in IR), the RX packet will still be
> serviced because the TX and RX share the same IRQ handler.
If the bus is busy with high prio CAN frames and the m_can wants to send
a low prio frame, the m_can will not be able to send it's CAN frame,
there will be not TX interrupt. If there are enough high prio CAN frames
the RX buffer will overflow.
regards,
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung Nürnberg | Phone: +49-5121-206917-129 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |
Hi Marc,
> -----Original Message-----
> From: Marc Kleine-Budde <[email protected]>
> Sent: Friday, July 7, 2023 1:04 PM
> To: Kumari Pallavi <[email protected]>
> Cc: [email protected]; Sangannavar, Mallikarjunappa
> <[email protected]>; Nikula, Jarkko
> <[email protected]>; [email protected];
> [email protected]; [email protected]; Thokala, Srikanth
> <[email protected]>
> Subject: Re: RE: [RESEND] [PATCH 1/1] can: m_can: Control tx and rx flow to
> avoid communication stall
>
> On 07.07.2023 05:38:09, Kumari Pallavi wrote:
> > > > if (netif_queue_stopped(dev) &&
> > > > !m_can_tx_fifo_full(cdev))
> > > > netif_wake_queue(dev);
> > > > @@ -1787,6 +1787,7 @@ static netdev_tx_t m_can_start_xmit(struct
> > > > sk_buff
> > > *skb,
> > > > }
> > > > } else {
> > > > cdev->tx_skb = skb;
> > > > + m_can_write(cdev, M_CAN_IE, IR_ALL_INT & (IR_TEFN));
> > >
> > > - What's the purpose of "()" around IR_TEFN?
> > > - "IR_ALL_INT & (IR_TEFN)" is equal to IR_TEFN, isn't it?
> > > - This basically disables all other interrupts, is this what you want to
> > > do?
> > > - What happens if the bus is busy with high prio CAN frames and you want
> > > to send low prio ones? You will not get any RX-IRQ, this doesn't look
> > > correct to me.
> > >
> >
> > Even though the RX interrupt is disabled (in IE), if there is an TX
> > interrupt and the RF0N bit is set (in IR), the RX packet will still be
> > serviced because the TX and RX share the same IRQ handler.
>
> If the bus is busy with high prio CAN frames and the m_can wants to send a low
> prio frame, the m_can will not be able to send it's CAN frame, there will be not
> TX interrupt. If there are enough high prio CAN frames the RX buffer will
> overflow.
>
Sorry for late reply, I agree let me see if I can try to simulate this scenario using CAN
analyzer. I already stressed the current solution for more than 10 days and didn't
observe any issue. However, I will try to incorporate this scenario for stress as well and
come back.
Thanks,
Pallavi
> regards,
> Marc
>
> --
> Pengutronix e.K. | Marc Kleine-Budde |
> Embedded Linux | https://www.pengutronix.de |
> Vertretung Nürnberg | Phone: +49-5121-206917-129 |
> Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-9 |