LinuxLists.cc - [PATCH] mailbox: forward the hrtimer if not queued and under a lock

2022-03-31 09:21:31

Subject: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
"mailbox: avoid timer start from callback"

The previous commit was reverted since it lead to a race that
caused the hrtimer to not be started at all. The check for
hrtimer_active() in msg_submit() will return true if the
callback function txdone_hrtimer() is currently running. This
function could return HRTIMER_NORESTART and then the timer
will not be restarted, and also msg_submit() will not start
the timer. This will lead to a message actually being submitted
but no timer will start to check for its compleation.

The original fix that added checking hrtimer_active() was added to
avoid a warning with hrtimer_forward. Looking in the kernel
another solution to avoid this warning is to check hrtimer_is_queued()
before calling hrtimer_forward_now() instead. This however requires a
lock so the timer is not started by msg_submit() inbetween this check
and the hrtimer_forward() call.

Signed-off-by: Björn Ardö <[email protected]>
---
drivers/mailbox/mailbox.c | 19 +++++++++++++------
include/linux/mailbox_controller.h | 1 +
2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..4229b9b5da98 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -82,11 +82,11 @@ static void msg_submit(struct mbox_chan *chan)
exit:
spin_unlock_irqrestore(&chan->lock, flags);

- /* kick start the timer immediately to avoid delays */
if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
- /* but only if not already active */
- if (!hrtimer_active(&chan->mbox->poll_hrt))
- hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+ /* kick start the timer immediately to avoid delays */
+ spin_lock_irqsave(&chan->mbox->poll_hrt_lock, flags);
+ hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+ spin_unlock_irqrestore(&chan->mbox->poll_hrt_lock, flags);
}
}

@@ -120,20 +120,26 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
container_of(hrtimer, struct mbox_controller, poll_hrt);
bool txdone, resched = false;
int i;
+ unsigned long flags;

for (i = 0; i < mbox->num_chans; i++) {
struct mbox_chan *chan = &mbox->chans[i];

if (chan->active_req && chan->cl) {
- resched = true;
txdone = chan->mbox->ops->last_tx_done(chan);
if (txdone)
tx_tick(chan, 0);
+ else
+ resched = true;
}
}

if (resched) {
- hrtimer_forward_now(hrtimer, ms_to_ktime(mbox->txpoll_period));
+ spin_lock_irqsave(&mbox->poll_hrt_lock, flags);
+ if (!hrtimer_is_queued(hrtimer))
+ hrtimer_forward_now(hrtimer, ms_to_ktime(mbox->txpoll_period));
+ spin_unlock_irqrestore(&mbox->poll_hrt_lock, flags);
+
return HRTIMER_RESTART;
}
return HRTIMER_NORESTART;
@@ -500,6 +506,7 @@ int mbox_controller_register(struct mbox_controller *mbox)
hrtimer_init(&mbox->poll_hrt, CLOCK_MONOTONIC,
HRTIMER_MODE_REL);
mbox->poll_hrt.function = txdone_hrtimer;
+ spin_lock_init(&mbox->poll_hrt_lock);
}

for (i = 0; i < mbox->num_chans; i++) {
diff --git a/include/linux/mailbox_controller.h b/include/linux/mailbox_controller.h
index 36d6ce673503..6fee33cb52f5 100644
--- a/include/linux/mailbox_controller.h
+++ b/include/linux/mailbox_controller.h
@@ -83,6 +83,7 @@ struct mbox_controller {
const struct of_phandle_args *sp);
/* Internal to API */
struct hrtimer poll_hrt;
+ spinlock_t poll_hrt_lock;
struct list_head node;
};

--
2.20.1

2022-04-16 02:30:40

by Jassi Brar

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <[email protected]> wrote:
>
> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
> "mailbox: avoid timer start from callback"
>
> The previous commit was reverted since it lead to a race that
> caused the hrtimer to not be started at all. The check for
> hrtimer_active() in msg_submit() will return true if the
> callback function txdone_hrtimer() is currently running. This
> function could return HRTIMER_NORESTART and then the timer
> will not be restarted, and also msg_submit() will not start
> the timer. This will lead to a message actually being submitted
> but no timer will start to check for its compleation.
>
> The original fix that added checking hrtimer_active() was added to
> avoid a warning with hrtimer_forward. Looking in the kernel
> another solution to avoid this warning is to check hrtimer_is_queued()
> before calling hrtimer_forward_now() instead. This however requires a
> lock so the timer is not started by msg_submit() inbetween this check
> and the hrtimer_forward() call.
>
This is a very dense api used by many use-cases, I am not confident
making any changes without confirming its a real issue with the common
code. Please share your client code and traces, that will help me get
a clearer picture.

Thanks.

2022-04-22 18:17:57

by Jassi Brar

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

On Tue, Apr 19, 2022 at 7:15 AM Bjorn Ardo <[email protected]> wrote:
>
> Hi,
>
>
> I can confirm that this is an actual issue found on our system, not just
> a theoretical case.
>
>
> If I add the following trace-code to the original code:
>
>
> diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
> index 3e7d4b20ab34..8e9e82e5f4b1 100644
> --- a/drivers/mailbox/mailbox.c
> +++ b/drivers/mailbox/mailbox.c
> @@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
> void *data;
> int err = -EBUSY;
>
> + trace_printk("Entering msg_submit\n");
> spin_lock_irqsave(&chan->lock, flags);
>
> if (!chan->msg_count || chan->active_req)
> @@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
> /* kick start the timer immediately to avoid delays */
> if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
> /* but only if not already active */
> - if (!hrtimer_active(&chan->mbox->poll_hrt))
> + if (!hrtimer_active(&chan->mbox->poll_hrt)) {
> + trace_printk("Starting the hr timer from
> submit\n");
> hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
> + } else {
> + trace_printk("Not starting the hr timer from
> submit since it is active\n");
> + }
> }
> + trace_printk("Leaving msg_submit\n");
> }
>
> static void tx_tick(struct mbox_chan *chan, int r)
> @@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct
> hrtimer *hrtimer)
> bool txdone, resched = false;
> int i;
>
> + trace_printk("Entering txdone_hrtimer\n");
> for (i = 0; i < mbox->num_chans; i++) {
> struct mbox_chan *chan = &mbox->chans[i];
>
> @@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct
> hrtimer *hrtimer)
>
> if (resched) {
> hrtimer_forward_now(hrtimer,
> ms_to_ktime(mbox->txpoll_period));
> + trace_printk("Leaving txdone_hrtimer with restart\n");
> return HRTIMER_RESTART;
> }
> + trace_printk("Leaving txdone_hrtimer without restart\n");
> return HRTIMER_NORESTART;
> }
>
> Then I get the following trace output (I have cropped a small portion
> around where the error appears):
>
>
> vhost-475-480 [000] d..1. 217.440325: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d..1. 217.440332: msg_submit: Starting
> the hr timer from submit
> vhost-475-480 [000] d..1. 217.440336: msg_submit: Leaving
> msg_submit
> vhost-475-480 [000] d.h1. 217.440342: txdone_hrtimer:
> Entering txdone_hrtimer
> vhost-475-480 [000] d.h1. 217.440349: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
> vhost-475-480 [000] d..1. 217.440597: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d..1. 217.440599: msg_submit: Starting
> the hr timer from submit
> vhost-475-480 [000] d..1. 217.440602: msg_submit: Leaving
> msg_submit
> <idle>-0 [001] ..s1. 217.440604: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d.h1. 217.440606: txdone_hrtimer:
> Entering txdone_hrtimer
> vhost-475-480 [000] d.h1. 217.440608: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
> <idle>-0 [001] ..s1. 217.440609: msg_submit: Not
> starting the hr timer from submit since it is active
> <idle>-0 [001] ..s1. 217.440610: msg_submit: Leaving
> msg_submit
>
>
> If I break down the log above we first have one case that works as
> intended. That is a message being written and a timer started and the
> message have been read when the timer hits:
>
> vhost-475-480 [000] d..1. 217.440325: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d..1. 217.440332: msg_submit: Starting
> the hr timer from submit
> vhost-475-480 [000] d..1. 217.440336: msg_submit: Leaving
> msg_submit
> vhost-475-480 [000] d.h1. 217.440342: txdone_hrtimer:
> Entering txdone_hrtimer
> vhost-475-480 [000] d.h1. 217.440349: txdone_hrtimer:
> Leaving txdone_hrtimer without restart
>
>
> After this we write a new message and a new timer is started:
>
> vhost-475-480 [000] d..1. 217.440597: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d..1. 217.440599: msg_submit: Starting
> the hr timer from submit
> vhost-475-480 [000] d..1. 217.440602: msg_submit: Leaving
> msg_submit
>
>
> However here comes the race. Now a new message is being written at the
> same time as the hr-timer is handling the first reply (on different CPU's):
>
> <idle>-0 [001] ..s1. 217.440604: msg_submit: Entering
> msg_submit
> vhost-475-480 [000] d.h1. 217.440606: txdone_hrtimer:
> Entering txdone_hrtimer
>
I don't have access to your client driver, but if it submits another
message from rx_callback() that is the problem.

Please have a look at how other platforms do, for example
imx_dsp_rproc_rx_tx_callback()

/**
* mbox_chan_received_data - A way for controller driver to push data
* received from remote to the upper layer.
* @chan: Pointer to the mailbox channel on which RX happened.
* @mssg: Client specific message typecasted as void *
*
* After startup and before shutdown any data received on the chan
* is passed on to the API via atomic mbox_chan_received_data().
* The controller should ACK the RX only after this call returns.
*/
void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)

If not this, please share your client code as well.

thanks.

2022-04-22 18:19:40

by Bjorn Ardo

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

Hi,

I can confirm that this is an actual issue found on our system, not just
a theoretical case.

If I add the following trace-code to the original code:

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 3e7d4b20ab34..8e9e82e5f4b1 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -57,6 +57,7 @@ static void msg_submit(struct mbox_chan *chan)
        void *data;
        int err = -EBUSY;

+       trace_printk("Entering msg_submit\n");
        spin_lock_irqsave(&chan->lock, flags);

        if (!chan->msg_count || chan->active_req)
@@ -85,9 +86,14 @@ static void msg_submit(struct mbox_chan *chan)
        /* kick start the timer immediately to avoid delays */
        if (!err && (chan->txdone_method & TXDONE_BY_POLL)) {
                /* but only if not already active */
-               if (!hrtimer_active(&chan->mbox->poll_hrt))
+               if (!hrtimer_active(&chan->mbox->poll_hrt)) {
+                       trace_printk("Starting the hr timer from
submit\n");
hrtimer_start(&chan->mbox->poll_hrt, 0, HRTIMER_MODE_REL);
+               } else {
+                       trace_printk("Not starting the hr timer from
submit since it is active\n");
+               }
        }
+       trace_printk("Leaving msg_submit\n");
}

static void tx_tick(struct mbox_chan *chan, int r)
@@ -121,6 +127,7 @@ static enum hrtimer_restart txdone_hrtimer(struct
hrtimer *hrtimer)
        bool txdone, resched = false;
        int i;

+       trace_printk("Entering txdone_hrtimer\n");
        for (i = 0; i < mbox->num_chans; i++) {
                struct mbox_chan *chan = &mbox->chans[i];

@@ -134,8 +141,10 @@ static enum hrtimer_restart txdone_hrtimer(struct
hrtimer *hrtimer)

        if (resched) {
                hrtimer_forward_now(hrtimer,
ms_to_ktime(mbox->txpoll_period));
+               trace_printk("Leaving txdone_hrtimer with restart\n");
                return HRTIMER_RESTART;
        }
+       trace_printk("Leaving txdone_hrtimer without restart\n");
        return HRTIMER_NORESTART;
}

Then I get the following trace output (I have cropped a small portion
around where the error appears):

       vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting
the hr timer from submit
       vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving
msg_submit
       vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer:
Entering txdone_hrtimer
       vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer:
Leaving txdone_hrtimer without restart
       vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting
the hr timer from submit
       vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving
msg_submit
          <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer:
Entering txdone_hrtimer
       vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer:
Leaving txdone_hrtimer without restart
          <idle>-0       [001] ..s1.   217.440609: msg_submit: Not
starting the hr timer from submit since it is active
          <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving
msg_submit

If I break down the log above we first have one case that works as
intended. That is a message being written and a timer started and the
message have been read when the timer hits:

       vhost-475-480     [000] d..1.   217.440325: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d..1.   217.440332: msg_submit: Starting
the hr timer from submit
       vhost-475-480     [000] d..1.   217.440336: msg_submit: Leaving
msg_submit
       vhost-475-480     [000] d.h1.   217.440342: txdone_hrtimer:
Entering txdone_hrtimer
       vhost-475-480     [000] d.h1.   217.440349: txdone_hrtimer:
Leaving txdone_hrtimer without restart

After this we write a new message and a new timer is started:

       vhost-475-480     [000] d..1.   217.440597: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d..1.   217.440599: msg_submit: Starting
the hr timer from submit
       vhost-475-480     [000] d..1.   217.440602: msg_submit: Leaving
msg_submit

However here comes the race. Now a new message is being written at the
same time as the hr-timer is handling the first reply (on different CPU's):

          <idle>-0       [001] ..s1.   217.440604: msg_submit: Entering
msg_submit
       vhost-475-480     [000] d.h1.   217.440606: txdone_hrtimer:
Entering txdone_hrtimer

And the hr-timer decides not to restart itself:

       vhost-475-480     [000] d.h1.   217.440608: txdone_hrtimer:
Leaving txdone_hrtimer without restart

And also the new message written decides to not start the timer (even
thou it requires one), since it has the status of being active during
the time that the hr-timer callback is running:

          <idle>-0       [001] ..s1.   217.440609: msg_submit: Not
starting the hr timer from submit since it is active
          <idle>-0       [001] ..s1.   217.440610: msg_submit: Leaving
msg_submit

So now we end up in a state where we have a message waiting for a reply
but no timer started that can provide this reply.

Our own mailbox code is attached (it is not upstreamed yet, but it is on
its way).

Best Regards

Björn

On 4/14/22 16:30, Jassi Brar wrote:
> On Thu, Mar 31, 2022 at 2:01 AM Björn Ardö <[email protected]> wrote:
>> This reverts commit c7dacf5b0f32957b24ef29df1207dc2cd8307743,
>> "mailbox: avoid timer start from callback"
>>
>> The previous commit was reverted since it lead to a race that
>> caused the hrtimer to not be started at all. The check for
>> hrtimer_active() in msg_submit() will return true if the
>> callback function txdone_hrtimer() is currently running. This
>> function could return HRTIMER_NORESTART and then the timer
>> will not be restarted, and also msg_submit() will not start
>> the timer. This will lead to a message actually being submitted
>> but no timer will start to check for its compleation.
>>
>> The original fix that added checking hrtimer_active() was added to
>> avoid a warning with hrtimer_forward. Looking in the kernel
>> another solution to avoid this warning is to check hrtimer_is_queued()
>> before calling hrtimer_forward_now() instead. This however requires a
>> lock so the timer is not started by msg_submit() inbetween this check
>> and the hrtimer_forward() call.
>>
> This is a very dense api used by many use-cases, I am not confident
> making any changes without confirming its a real issue with the common
> code. Please share your client code and traces, that will help me get
> a clearer picture.
>
> Thanks.

Attachments:

mailbox-artpec8.c (7.33 kB)

2022-04-22 22:32:07

by Bjorn Ardo

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

Hi,

On 4/19/22 16:10, Jassi Brar wrote:
> I don't have access to your client driver, but if it submits another
> message from rx_callback() that is the problem.
>
> Please have a look at how other platforms do, for example
> imx_dsp_rproc_rx_tx_callback()
>
> /**
> * mbox_chan_received_data - A way for controller driver to push data
> * received from remote to the upper layer.
> * @chan: Pointer to the mailbox channel on which RX happened.
> * @mssg: Client specific message typecasted as void *
> *
> * After startup and before shutdown any data received on the chan
> * is passed on to the API via atomic mbox_chan_received_data().
> * The controller should ACK the RX only after this call returns.
> */
> void mbox_chan_received_data(struct mbox_chan *chan, void *mssg)
>
> If not this, please share your client code as well.
>
> thanks.

In rx_callback() we call tasklet_hi_schedule() to schedule a tasklet and
this tasklet calls mbox_send_message(). The mailbox is setup with
.tx_block = false.

I do not quite understand how calling mbox_send_message() from another
contex (like a work thread as in imx_dsp_rproc_rx_tx_callback()) will
resolve the race, could you explain this? Or do you mean that it should
not be called from any interrupt context at all? Looking at the
documentation of mbox_send_message() it states:

* This function could be called from atomic context as it simply
* queues the data and returns a token against the request.

If I look at the code in msg_submit() the part that calls
hrtimer_active() and hrtimer_start() is completely without a lock. Also
the code in txdone_hrtimer() that calls hrtimer_forward_now() is without
a lock. So I cannot see anything preventing these two functions to be
called concurrently on two different processors (as they do in my
trace). And looking at the hr_timer code then hrtimer_active() will
return true if the hrtimer is currently executing txdone_hrtimer().

The way I see the race is if the timer has called txdone_hrtimer() and
it has passed the part with the for-loop looking at the individual
channels when the msg_submit() runs. Then txdone_hrtimer() has decided
to not restart the timer, but hrtimer_active() will still return true
for a while (until txdone_hrtimer() return completely and the hrtimer
code can change its status). In this time there is nothing that prevents
msg_submit() from running, adding a new message that the timer should
monitor. But if it manages to call hrtimer_active() in the time period
before the hrtimer code updates it then the current code will never
start the timer.

Also the poll_hrt is shared between all channels in the mailbox, so it
does not have to be the same mailbox channel that hrtimer is currently
waiting for that is calling msg_submit(). So not calling
mbox_send_message() from rx_callback() will only alter timing for that
channel. There could still be a race between two different mailbox channels.

This is my understanding of the current code, please tell me if I have
missed or misunderstood any part of it?

Our current solution are using 4 different mailbox channels
asynchronously. The code is part of a larger system, but I can put down
some time and try and extract the relevant parts if you still think this
is a client issue? But with my current understanding of the code, the
race between msg_submit() and txdone_hrtimer() is quite clear, and with
my proposed patch that removes this race we have be able to run for very
long time without any problems (that is several days). Without the fix
we get the race after 5-10 min.

Best Regards,

Björn

2022-05-23 11:56:52

by Bjorn Ardo

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

Hi again,

On 4/20/22 10:28, Bjorn Ardo wrote:
>
>
> Our current solution are using 4 different mailbox channels
> asynchronously. The code is part of a larger system, but I can put
> down some time and try and extract the relevant parts if you still
> think this is a client issue? But with my current understanding of the
> code, the race between msg_submit() and txdone_hrtimer() is quite
> clear, and with my proposed patch that removes this race we have be
> able to run for very long time without any problems (that is several
> days). Without the fix we get the race after 5-10 min.
>
>
>

I do not know if you have had any time to review my comments yet, but we
have created some examples to trigger the error.

With the attached testmodule mailbox-loadtest.c I can trigger the error
by attaching it to the two sides of an mailbox with the following
devicetree code:

        mboxtest1 {
                compatible = "mailbox-loadtest";
                mbox-names = "ping", "pong";
                mboxes = <&mbox_loop_pri 0 &mbox_loop_pri 1>;
        };

        mboxtest2 {
                compatible = "mailbox-loadtest";
                mbox-names = "pong", "ping";
                mboxes = <&mbox_loop_scd 0 &mbox_loop_scd 1>;
        };

After that I create load on the mailbox by running (or respectively
system) the following:

while echo 1 > /sys/kernel/debug/mboxtest1/ping ; do
usleep 1
done

while echo 1 > /sys/kernel/debug/mboxtest2/ping ; do
usleep 50000
done

After a few minutes (normally 2-5) I get errors.

Using the patch I sent earlier the errors goes away.

We also have created a mailbox-loopback.c that is a loopback mailbox
that can be used on the same system (to make testing easier on systems
that does not have a hardware mailbox), it is also attached. This can be
probed by the following devicetree code:

        mbox_loop_pri: mailbox_loop_pri {
                compatible = "mailbox-loopback";
                #mbox-cells = <1>;
                side = <0>;
        };
        mbox_loop_scd: mailbox_loop_scd {
                compatible = "mailbox-loopback";
                #mbox-cells = <1>;
                side = <1>;
        };

And with this loopback mailbox we have also been able to reproduce the
errors without the patch applied.

Best Regards,

Björn

Attachments:

mailbox-loadtest.c (5.10 kB)
mailbox-loopback.c (4.17 kB)
Download all attachments

2022-05-23 20:44:21

by Jassi Brar

[permalink] [raw]

Subject: Re: [PATCH] mailbox: forward the hrtimer if not queued and under a lock

On Mon, May 23, 2022 at 6:56 AM Bjorn Ardo <[email protected]> wrote:
>
> Hi again,
>
>
> On 4/20/22 10:28, Bjorn Ardo wrote:
> >
> >
> > Our current solution are using 4 different mailbox channels
> > asynchronously. The code is part of a larger system, but I can put
> > down some time and try and extract the relevant parts if you still
> > think this is a client issue? But with my current understanding of the
> > code, the race between msg_submit() and txdone_hrtimer() is quite
> > clear, and with my proposed patch that removes this race we have be
> > able to run for very long time without any problems (that is several
> > days). Without the fix we get the race after 5-10 min.
> >
> >
> >
>
> I do not know if you have had any time to review my comments yet, but we
> have created some examples to trigger the error.
>
Thanks, but your last explanation was enough. The logic seems fine.
I was hoping someone impacted by the commit may chime in with a tested/acked by.
I think I'll pick it up now.

Thanks.

>
> With the attached testmodule mailbox-loadtest.c I can trigger the error
> by attaching it to the two sides of an mailbox with the following
> devicetree code:
>
> mboxtest1 {
> compatible = "mailbox-loadtest";
> mbox-names = "ping", "pong";
> mboxes = <&mbox_loop_pri 0 &mbox_loop_pri 1>;
> };
>
> mboxtest2 {
> compatible = "mailbox-loadtest";
> mbox-names = "pong", "ping";
> mboxes = <&mbox_loop_scd 0 &mbox_loop_scd 1>;
> };
>
>
> After that I create load on the mailbox by running (or respectively
> system) the following:
>
> while echo 1 > /sys/kernel/debug/mboxtest1/ping ; do
> usleep 1
> done
>
> while echo 1 > /sys/kernel/debug/mboxtest2/ping ; do
> usleep 50000
> done
>
> After a few minutes (normally 2-5) I get errors.
>
>
> Using the patch I sent earlier the errors goes away.
>
>
> We also have created a mailbox-loopback.c that is a loopback mailbox
> that can be used on the same system (to make testing easier on systems
> that does not have a hardware mailbox), it is also attached. This can be
> probed by the following devicetree code:
>
> mbox_loop_pri: mailbox_loop_pri {
> compatible = "mailbox-loopback";
> #mbox-cells = <1>;
> side = <0>;
> };
> mbox_loop_scd: mailbox_loop_scd {
> compatible = "mailbox-loopback";
> #mbox-cells = <1>;
> side = <1>;
> };
>
> And with this loopback mailbox we have also been able to reproduce the
> errors without the patch applied.
>
>
> Best Regards,
>
> Björn
>