LinuxLists.cc - [PATCH 0/2] mailbox: Fix error handling in mbox_send

2022-09-15 17:22:59

Subject: [PATCH 0/2] mailbox: Fix error handling in mbox_send_message()

This series fixes a couple issues I found while experimenting with mailbox device drivers:

1. If one tries to send a message via the mailbox using mbox_send_message(), and the controller driver for the mailbox device returns any error, mbox_send_message() returns -ETIME when in blocking mode. This is confusing at least.

2. If the mbox controller driver fails to send a message, mbox_send_message() needlessly waits for the message to be sent. That ends when the waiting timeout has been expired. However, the pointer to the message remains in the queue of the mbox framework (chan->msg_data) and the framework will try to process it over and over again, when one tries to send other messages. If the driver rejects the message each time because it is invalid (some drivers do that), this will prevent anyone from using that mailbox channel.

Evgenii Shatokhin (2):
mailbox: Propagate errors from .send_data() callback to
mbox_send_message()
mailbox: Error out early if the mbox driver has failed to submit the
message

drivers/mailbox/mailbox.c | 30 ++++++++++++++++++++----------
1 file changed, 20 insertions(+), 10 deletions(-)

--
2.34.1

2022-09-15 17:40:58

by Evgenii Shatokhin

[permalink] [raw]

Subject: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message

mbox_send_message() places the pointer to the message to the queue
(add_to_rbuf) then calls msg_submit(chan) to submit the first of the
queued messaged to the mailbox. Some of mailbox drivers can return
errors from their .send_data() callbacks, e.g., if the message is
invalid or there is something wrong with the mailbox device.

In this case:
* hrtimer is not started by msg_submit();
* the pointer to that message remains in the queue;
* if mbox_send_message() is called in blocking mode, it will needlessly
wait for tx_tout ms (or for an hour if tx_out is not set), then it will
call tx_tick(chan, -ETIME).

tx_tick() will then try to submit the first message in the queue - the
same message as before. The underlying driver might reject it again.

The problematic message could then remain in the queue forever, which would
prevent the system from sending other, maybe unrelated messages via the
same mailbox channel. Moreover, the caller would be unable to free or reuse
the message structure.

Let us remove the message from the queue and error out from
mbox_send_message() early if sending of the message fails.

As for tx_tick() - not sure, if chan->cl->tx_done() should still be
called if msg_submit(chan) reports an error. For now, tx_tick() will
exit early too.

Signed-off-by: Evgenii Shatokhin <[email protected]>
---
drivers/mailbox/mailbox.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 04db5ef58f93..32d9ba05427e 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -82,10 +82,9 @@ static int msg_submit(struct mbox_chan *chan)
chan->cl->tx_prepare(chan->cl, data);
/* Try to submit a message to the MBOX controller */
err = chan->mbox->ops->send_data(chan, data);
- if (!err) {
+ chan->msg_count--;
+ if (!err)
chan->active_req = data;
- chan->msg_count--;
- }
exit:
spin_unlock_irqrestore(&chan->lock, flags);

@@ -102,6 +101,7 @@ static void tx_tick(struct mbox_chan *chan, int r)
{
unsigned long flags;
void *mssg;
+ int err;

spin_lock_irqsave(&chan->lock, flags);
mssg = chan->active_req;
@@ -109,9 +109,8 @@ static void tx_tick(struct mbox_chan *chan, int r)
spin_unlock_irqrestore(&chan->lock, flags);

/* Submit next message */
- msg_submit(chan);
-
- if (!mssg)
+ err = msg_submit(chan);
+ if (!mssg || err)
return;

/* Notify the client */
@@ -276,6 +275,8 @@ int mbox_send_message(struct mbox_chan *chan, void *mssg)
}

err = msg_submit(chan);
+ if (err)
+ return err;

if (chan->cl->tx_block) {
unsigned long wait;
@@ -293,7 +294,7 @@ int mbox_send_message(struct mbox_chan *chan, void *mssg)
}
}

- return (err < 0) ? err : t;
+ return t;
}
EXPORT_SYMBOL_GPL(mbox_send_message);

--
2.34.1

2022-09-16 17:20:46

by Jassi Brar

[permalink] [raw]

Subject: Re: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message

On Thu, Sep 15, 2022 at 11:50 AM Evgenii Shatokhin
<[email protected]> wrote:
>
> mbox_send_message() places the pointer to the message to the queue
> (add_to_rbuf) then calls msg_submit(chan) to submit the first of the
> queued messaged to the mailbox. Some of mailbox drivers can return
> errors from their .send_data() callbacks, e.g., if the message is
> invalid or there is something wrong with the mailbox device.
>
The message can't be invalid because the client code is written for a
particular provider.

Though it is possible for the mailbox controller to break down for
some reason. In that case, the blocking api will keep retrying until
successful. But ideally the client, upon getting -ETIME, should free()
and request() the channel reset it (because controller drivers usually
don't contain the logic to automatically reset upon some error).

thanks.

2022-09-18 16:40:40

by Evgenii Shatokhin

[permalink] [raw]

Subject: Re: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message

Thank you for a quick reply!

On 16.09.2022 20:04, Jassi Brar wrote:
> On Thu, Sep 15, 2022 at 11:50 AM Evgenii Shatokhin
> <[email protected]> wrote:
>>
>> mbox_send_message() places the pointer to the message to the queue
>> (add_to_rbuf) then calls msg_submit(chan) to submit the first of the
>> queued messaged to the mailbox. Some of mailbox drivers can return
>> errors from their .send_data() callbacks, e.g., if the message is
>> invalid or there is something wrong with the mailbox device.
>>
> The message can't be invalid because the client code is written for a
> particular provider.

As of mainline kernel v6.0-rc5, there are mailbox controller drivers
which check if the messages are valid in their .send_data() callbacks.
Example:

drivers/mailbox/rockchip-mailbox.c, rockchip_mbox_send_data():
if (msg->rx_size > mb->buf_size) {
dev_err(mb->mbox.dev, "Transmit size over buf size(%d)\n",
mb->buf_size);
return -EINVAL;
}

Other examples are zynqmp_ipi_send_data() from
drivers/mailbox/zynqmp-ipi-mailbox.c, ti_msgmgr_send_data() from
drivers/mailbox/ti-msgmgr.c, etc.

If this is incorrect and the controller drivers should not do such
things, I'd suggest to clearly state it in the docs, because it is far
from obvious from Documentation/driver-api/mailbox.rst at the moment.

That is, one could state that checking if the messages to be transmitted
are valid is a responsibility of the callers of mailbox API rather than
of the controller driver.

I could prepare such patch for the docs. Objections?

>
> Though it is possible for the mailbox controller to break down for
> some reason. In that case, the blocking api will keep retrying until
> successful.

As far as I can see from the code, the behaviour seems to be different.

mbox_send_message() calls msg_submit() to send the message the first
time. If that fails, hrtimer is not armed, so there will be no attempts
to send the message again till tx_out ms pass:

err = chan->mbox->ops->send_data(chan, data);
if (!err) {
chan->active_req = data;
chan->msg_count--;
}
exit:
spin_unlock_irqrestore(&chan->lock, flags);

if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
/* kick start the timer immediately to avoid delays */
spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
}

This is from msg_submit(). Thus, the hrtimer will not fire, tx_tick()
will not be called until tx_out ms have passed, and no attempts to send
the message again will be made here.

In addition, complete(&chan->tx_complete) will not be called, so,
mbox_send_message() will have to needlessly wait whole tx_out ms.
Only after that, it will call tx_tick(chan, -ETIME), which will, in
turn, call msg_submit() to try to send the message again. If the mbox
has not recovered, sending will fail again.

Then, mbox_send_message() will exit with -ETIME. The pointer to the
message will remain in chan->msg_data[], but the framework will not
attempt to send it again until the client calls mbox_send_message() for
another, possibly unrelated message.

In this case, to sum up, mbox_send_message():
* needlessly waits for tx_out ms;
* only tries to send the message twice rather than makes retries until
successful;
* does not inform the client about the actual error happened, just
returns -ETIME;
* keeps the pointer to the message in chan->msg_data[], which is too
easy to overlook on the client side. Too easy for the client to, say,
reuse the structure and cause trouble.

What I suggest is to leave it to the client (or some other
provider-specific code using the client) what to do with the failures.

If the error is reported by the controller driver, don't wait in
mbox_send_message(), just pass the error to the client and exit. If the
client decides to ignore the error - OK, its problem. Or - it may kick
the mbox device somehow in a provider-specific way to make it work, or -
reset the channel, or - do anything else to make things work again.

The behaviour of mbox_send_message() would then become more consistent:
either it has sent the message successfully or it failed and returned an
error, without side-effects (like the pointer to that message kept in
the internal buffer).

I do not think this change would break the existing controller drivers
and client drivers.

What do you think?

>But ideally the client, upon getting -ETIME, should free()
> and request() the channel reset it (because controller drivers usually
> don't contain the logic to automatically reset upon some error).
>
> thanks.

Regards,
Evgenii

2022-10-05 12:27:44

by Evgenii Shatokhin

[permalink] [raw]

Subject: Re: [PATCH 2/2] mailbox: Error out early if the mbox driver has failed to submit the message

Hi,

On 18.09.2022 19:32, Evgeny Shatokhin wrote:
> Thank you for a quick reply!
>
> On 16.09.2022 20:04, Jassi Brar wrote:
>> On Thu, Sep 15, 2022 at 11:50 AM Evgenii Shatokhin
>> <[email protected]> wrote:
>>>
>>> mbox_send_message() places the pointer to the message to the queue
>>> (add_to_rbuf) then calls msg_submit(chan) to submit the first of the
>>> queued messaged to the mailbox. Some of mailbox drivers can return
>>> errors from their .send_data() callbacks, e.g., if the message is
>>> invalid or there is something wrong with the mailbox device.
>>>
>> The message can't be invalid because the client code is written for a
>> particular provider.
>
> As of mainline kernel v6.0-rc5, there are mailbox controller drivers
> which check if the messages are valid in their .send_data() callbacks.
> Example:
>
> drivers/mailbox/rockchip-mailbox.c, rockchip_mbox_send_data():
>     if (msg->rx_size > mb->buf_size) {
>         dev_err(mb->mbox.dev, "Transmit size over buf size(%d)\n",
>             mb->buf_size);
>         return -EINVAL;
>     }
>
> Other examples are zynqmp_ipi_send_data() from
> drivers/mailbox/zynqmp-ipi-mailbox.c, ti_msgmgr_send_data() from
> drivers/mailbox/ti-msgmgr.c, etc.
>
> If this is incorrect and the controller drivers should not do such
> things, I'd suggest to clearly state it in the docs, because it is far
> from obvious from Documentation/driver-api/mailbox.rst at the moment.
>
> That is, one could state that checking if the messages to be transmitted
> are valid is a responsibility of the callers of mailbox API rather than
> of the controller driver.
>
> I could prepare such patch for the docs. Objections?
>
>>
>> Though it is possible for the mailbox controller to break down for
>> some reason. In that case, the blocking api will keep retrying until
>> successful.
>
> As far as I can see from the code, the behaviour seems to be different.
>
> mbox_send_message() calls msg_submit() to send the message the first
> time. If that fails, hrtimer is not armed, so there will be no attempts
> to send the message again till tx_out ms pass:
>
>     err = chan->mbox->ops->send_data(chan, data);
>     if (!err) {
>         chan->active_req = data;
>         chan->msg_count--;
>     }
> exit:
>     spin_unlock_irqrestore(&chan->lock, flags);
>
>     if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
>         /* kick start the timer immediately to avoid delays */
>         spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
>         hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
>         spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
>     }
>
> This is from msg_submit(). Thus, the hrtimer will not fire, tx_tick()
> will not be called until tx_out ms have passed, and no attempts to send
> the message again will be made here.
>
> In addition, complete(&chan->tx_complete) will not be called, so,
> mbox_send_message() will have to needlessly wait whole tx_out ms.
> Only after that, it will call tx_tick(chan, -ETIME), which will, in
> turn, call msg_submit() to try to send the message again. If the mbox
> has not recovered, sending will fail again.
>
> Then, mbox_send_message() will exit with -ETIME. The pointer to the
> message will remain in chan->msg_data[], but the framework will not
> attempt to send it again until the client calls mbox_send_message() for
> another, possibly unrelated message.
>
> In this case, to sum up, mbox_send_message():
> * needlessly waits for tx_out ms;
> * only tries to send the message twice rather than makes retries until
> successful;
> * does not inform the client about the actual error happened, just
> returns -ETIME;
> * keeps the pointer to the message in chan->msg_data[], which is too
> easy to overlook on the client side. Too easy for the client to, say,
> reuse the structure and cause trouble.
>
> What I suggest is to leave it to the client (or some other
> provider-specific code using the client) what to do with the failures.
>
> If the error is reported by the controller driver, don't wait in
> mbox_send_message(), just pass the error to the client and exit. If the
> client decides to ignore the error - OK, its problem. Or - it may kick
> the mbox device somehow in a provider-specific way to make it work, or -
> reset the channel, or - do anything else to make things work again.
>
> The behaviour of mbox_send_message() would then become more consistent:
> either it has sent the message successfully or it failed and returned an
> error, without side-effects (like the pointer to that message kept in
> the internal buffer).
>
> I do not think this change would break the existing controller drivers
> and client drivers.
>
> What do you think?
>
>> But ideally the client, upon getting -ETIME, should free()
>> and request() the channel reset it (because controller drivers usually
>> don't contain the logic to automatically reset upon some error).
>>
>> thanks.

Any updates on this?
Looking forward to your comments.

Regards,
Evgenii