2022-09-22 05:22:15

by Shivnandan Kumar

[permalink] [raw]
Subject: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

Hi Christian,


Do you have any update or suggestion regarding thread
https://lore.kernel.org/lkml/20211105094310.GI6526@e120937-lin/T/#m07993053f6f238864acad4e9bad9f08d85aeb019.

We are still getting this issue and wanted to check if  there is any fix
that I can try.


Thanks,

Shivnandan


2022-09-22 15:14:18

by Cristian Marussi

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

On Thu, Sep 22, 2022 at 10:31:47AM +0530, Shivnandan Kumar wrote:
> Hi Christian,
>

Hi Shivnandan,

>
> Do you have any update or suggestion regarding thread https://lore.kernel.org/lkml/20211105094310.GI6526@e120937-lin/T/#m07993053f6f238864acad4e9bad9f08d85aeb019.
>
> We are still getting this issue and wanted to check if? there is any fix
> that I can try.
>

Sorry this issue fell to the bottom of my list in these past months...
... but it stil on TODO :D

So today I tried to get my head around this issue again (i.e. mainly
re-reading the above thread to remind me what was the status and wth I
had written... :P)

In summary the racy thing seemed to be caused by the a delayed late
SCMI Base reply happily served on one core by scmi_rx_callback operating
on some well-defined SCMI channel, while on another core we are effectively
shutting down the system and destroying such channels: now this should
be clearly NOT be possible and it is what we have to synchronize.

Looking at the transport layer that you use, mailbox, I see that while
setup/free helpers are synchronized on an internal chan->lock, the RX
path inside the mailbox core is not, so I tried this:


diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 4229b9b5da98..bb6173c0ad54 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -157,9 +157,13 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
*/
void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
{
+ unsigned long flags;
+
+ spin_lock_irqsave(&chan->lock, flags);
/* No buffering the received data */
if (chan->cl->rx_callback)
chan->cl->rx_callback(chan->cl, mssg);
+ spin_unlock_irqrestore(&chan->lock, flags);
}
EXPORT_SYMBOL_GPL(mbox_chan_received_data);


... can you try on your setup ? I dont have a way to easily reproduce
your race as of now...

NOTE THAT, I am not still convinced that the above fix, if it works, will
constitute the final solution to this issue, I could maybe move this same
kind of sync up into the SCMI transport layer to avoid to impact all other
users of the above mailbox interface (since, as of today, nobody has
reported any issue like ours due to the missing spinlock..)..but it
could be helpful to test the above to verify that this is really where
the root issue is.

Thanks,
Cristian

2022-09-22 15:39:08

by Cristian Marussi

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

On Thu, Sep 22, 2022 at 03:54:31PM +0100, Cristian Marussi wrote:
> On Thu, Sep 22, 2022 at 10:31:47AM +0530, Shivnandan Kumar wrote:
> > Hi Christian,
> >
>

Hi Shivnandan,

[snip]

> Looking at the transport layer that you use, mailbox, I see that while
> setup/free helpers are synchronized on an internal chan->lock, the RX
> path inside the mailbox core is not, so I tried this:
>
>
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 4229b9b5da98..bb6173c0ad54 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -157,9 +157,13 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
> */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
> {
> + unsigned long flags;
> +
> + spin_lock_irqsave(&chan->lock, flags);
> /* No buffering the received data */
> if (chan->cl->rx_callback)
> chan->cl->rx_callback(chan->cl, mssg);
> + spin_unlock_irqrestore(&chan->lock, flags);
> }
> EXPORT_SYMBOL_GPL(mbox_chan_received_data);
>

...sorry... a small change on the tentative above fix...

----8<------

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 4229b9b5da98..6fbe183acdae 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -157,9 +157,13 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
*/
void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
{
+ unsigned long flags;
+
+ spin_lock_irqsave(&chan->lock, flags);
/* No buffering the received data */
- if (chan->cl->rx_callback)
+ if (chan->cl && chan->cl->rx_callback)
chan->cl->rx_callback(chan->cl, mssg);
+ spin_unlock_irqrestore(&chan->lock, flags);
}
EXPORT_SYMBOL_GPL(mbox_chan_received_data);

------8<-----

Thanks,
Cristian

2022-09-30 13:42:51

by Shivnandan Kumar

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

hi Cristian,

Thanks for your support in providing the patch to try.

I found one race condition in our downstream mbox controller driver
while accessing con_priv, when I serialized access to this, issue is not
seen on 3 days of testing.

As you rightly mentioned that your provided patch will impact all the
other users.

Also if  we take your provided patch, same race still exists while
accessing con_priv in our downstream mbox controller so this issue will
still be there.

So, we are planning to merge the patch( serialized access to con_priv)
in our downstream mbox controller now.


Thanks,

Shivnandan


On 9/22/2022 8:58 PM, Cristian Marussi wrote:
> On Thu, Sep 22, 2022 at 03:54:31PM +0100, Cristian Marussi wrote:
>> On Thu, Sep 22, 2022 at 10:31:47AM +0530, Shivnandan Kumar wrote:
>>> Hi Cristian,
>>>
> Hi Shivnandan,
>
> [snip]
>
>> Looking at the transport layer that you use, mailbox, I see that while
>> setup/free helpers are synchronized on an internal chan->lock, the RX
>> path inside the mailbox core is not, so I tried this:
>>
>>
>> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
>> index 4229b9b5da98..bb6173c0ad54 100644
>> --- a/drivers/mailbox/mailbox.c
>> +++ b/drivers/mailbox/mailbox.c
>> @@ -157,9 +157,13 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
>> */
>> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
>> {
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&chan->lock, flags);
>> /* No buffering the received data */
>> if (chan->cl->rx_callback)
>> chan->cl->rx_callback(chan->cl, mssg);
>> + spin_unlock_irqrestore(&chan->lock, flags);
>> }
>> EXPORT_SYMBOL_GPL(mbox_chan_received_data);
>>
> ...sorry... a small change on the tentative above fix...
>
> ----8<------
>
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 4229b9b5da98..6fbe183acdae 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -157,9 +157,13 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
> */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
> {
> + unsigned long flags;
> +
> + spin_lock_irqsave(&chan->lock, flags);
> /* No buffering the received data */
> - if (chan->cl->rx_callback)
> + if (chan->cl && chan->cl->rx_callback)
> chan->cl->rx_callback(chan->cl, mssg);
> + spin_unlock_irqrestore(&chan->lock, flags);
> }
> EXPORT_SYMBOL_GPL(mbox_chan_received_data);
>
> ------8<-----
>
> Thanks,
> Cristian
>

2022-10-03 13:48:06

by Cristian Marussi

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

On Fri, Sep 30, 2022 at 06:29:02PM +0530, Shivnandan Kumar wrote:
> hi Cristian,

Hi Shivnandan,
>
> Thanks for your support in providing the patch to try.
>
> I found one race condition in our downstream mbox controller driver while
> accessing con_priv, when I serialized access to this, issue is not seen on 3
> days of testing.

Good to hear that you find the issue.

>
> As you rightly mentioned that your provided patch will impact all the other
> users.
>
> Also if? we take your provided patch, same race still exists while accessing
> con_priv in our downstream mbox controller so this issue will still be
> there.
>

Yes indeed, even though I think that race in the mailbox core between RX path
and chan_free could still be theoretically possible it does not seem to me
appropriate to try to fix it now that you cannot reproduce it anymore and
no other mailbox user has ever raised this concern (even though, as said, the
proper solution to that race wont probably be directly in the mailbox-core as
in my experimental two liners..)

> So, we are planning to merge the patch( serialized access to con_priv) in
> our downstream mbox controller now.
>

Ok, just out of curiosity, once done, can you point me at your downstream public
sources so I can see the issue and the fix that you are applying to your trees ?

Thanks,
Cristian

2022-10-11 10:06:26

by Shivnandan Kumar

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"


Hi Cristian,

>>Ok, just out of curiosity, once done, can you point me at your
downstream public sources so I can see the issue and the fix that you
are applying to your trees ?

https://source.codeaurora.org/quic/la/kernel/msm-5.10/tree/drivers/soc/qcom/qcom_rimps.c?h=KERNEL.PLATFORM.1.0.r1-07800-kernel.0

I have added lock while accessing con_priv inside irq handler and
shutdown function.


I have one input regarding timeout from firmware, can we enable BUG on
response  time out in function do_xfer based on some debug config
flag,this will help to debug firmware timeout issue faster.

We will only enable that config flag during internal testing.


Thanks,

Shivnandan

On 10/3/2022 6:52 PM, Cristian Marussi wrote:
> On Fri, Sep 30, 2022 at 06:29:02PM +0530, Shivnandan Kumar wrote:
>> hi Cristian,
> Hi Shivnandan,
>> Thanks for your support in providing the patch to try.
>>
>> I found one race condition in our downstream mbox controller driver while
>> accessing con_priv, when I serialized access to this, issue is not seen on 3
>> days of testing.
> Good to hear that you find the issue.
>
>> As you rightly mentioned that your provided patch will impact all the other
>> users.
>>
>> Also if  we take your provided patch, same race still exists while accessing
>> con_priv in our downstream mbox controller so this issue will still be
>> there.
>>
> Yes indeed, even though I think that race in the mailbox core between RX path
> and chan_free could still be theoretically possible it does not seem to me
> appropriate to try to fix it now that you cannot reproduce it anymore and
> no other mailbox user has ever raised this concern (even though, as said, the
> proper solution to that race wont probably be directly in the mailbox-core as
> in my experimental two liners..)
>
>> So, we are planning to merge the patch( serialized access to con_priv) in
>> our downstream mbox controller now.
>>
> Ok, just out of curiosity, once done, can you point me at your downstream public
> sources so I can see the issue and the fix that you are applying to your trees ?
>
> Thanks,
> Cristian

2022-10-11 10:48:43

by Cristian Marussi

[permalink] [raw]
Subject: Re: Query regarding "firmware: arm_scmi: Free mailbox channels if probe fails"

On Tue, Oct 11, 2022 at 03:34:45PM +0530, Shivnandan Kumar wrote:
>
> Hi Cristian,
>

Hi Shivnandan,

> >>Ok, just out of curiosity, once done, can you point me at your downstream
> public sources so I can see the issue and the fix that you are applying to
> your trees ?
>
> https://source.codeaurora.org/quic/la/kernel/msm-5.10/tree/drivers/soc/qcom/qcom_rimps.c?h=KERNEL.PLATFORM.1.0.r1-07800-kernel.0
>
> I have added lock while accessing con_priv inside irq handler and shutdown
> function.
>

Thanks !

>
> I have one input regarding timeout from firmware, can we enable BUG on
> response? time out in function do_xfer based on some debug config flag,this
> will help to debug firmware timeout issue faster.
>
> We will only enable that config flag during internal testing.
>

I understand a sort of 'Panic-on-timeout' would be useful to just freeze
the system as it is and debug, but it seems to me pretty much invasive
(and generally frowned upon) to BUG_ON timeouts, given on some SCMI
platforms/transports a few timeouts can happen really not so infrequently
due to transient conditions during moments of peak SCMI traffic.

Even though you mention to make it conditional to Kconfig, I'm not sure
this could fly, especially if you want to enable only for internal
testing...I'll ping Sudeep about this to see what he thinks.

As an alternative, what if I try to improve SCMI tracing/debug, let's say
dumping more info in dmesg about the offending (timed-out) message instead
of hanging the system as a whole ?

I'd have also some still-brewing-and-not-published patches to add some
SCMI stats somewhere in sysfs to be able to read current SCMI errors/timeouts
and transport anomalies, would that be of interest ?

...maybe, we could combine some of these stats and some sort of
BUG_ON/WARN_ON (if it will fly eventually..) into some kind SCMI_DEBUG mode
...any input on your needs about which kind of SCMI info you'll like to see
exposed by the stack would be welcome.

Last but not least, since we are talking about SCMI Server/FW testing,
have you (or your team) seen this work-in-progress of mine:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

about a new unified userspace interface to inject/snoop SCMI messages to
test/fuzz/stress the SCMI server wherever it is placed ?

Any feedback on the API proprosed in the cover-letter would be highly welcome;
I'll post a new V4 next week possibly, and the changes to the existing ARM SCMI
Compliance suite (mentioned in the cover) to support this new SCMI Raw
mode are in their final stage too.

Thanks,
Cristian