I added a cover page for this single patch to provide a little more
explanation about testing.
I have verified that, when this patch is applied, the IPA driver
receives the desired crash notification after the modem has crashed.
It recovers properly, and functions correctly following the crash.
However I have not tested the specific scenario it is intended to
fix; that is, I have not verified that the crash notification is
delayed if the crash occurs while the IPA driver is suspended. If I
had a reliable way to do this I would have done so, but I do not.
I can only argue that it *should* work, based on the way the
freezable system workqueue is designed.
My biggest concern about this change is that I don't understand
other remoteproc users well enough to know what impact this change
would have on them. So I am relying on review (and testing if
possible!) to evaluate that.
And finally, while this is a very simple (one line) change, if there
are other suggestions I'd like to hear them.
-Alex
Alex Elder (1):
remoteproc: use freezable workqueue for crash notifications
drivers/remoteproc/remoteproc_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
2.27.0
When a remoteproc has crashed, rproc_report_crash() is called to
handle whatever recovery is desired. This can happen at almost any
time, often triggered by an interrupt, though it can also be
initiated by a write to debugfs file remoteproc/remoteproc*/crash.
When a crash is reported, the crash handler worker is scheduled to
run (rproc_crash_handler_work()). One thing that worker does is
call rproc_trigger_recovery(), which calls rproc_stop(). That calls
the ->stop method for any remoteproc subdevices before making the
remote processor go offline.
The Q6V5 modem remoteproc driver implements an SSR subdevice that
notifies registered drivers when the modem changes operational state
(prepare, started, stop/crash, unprepared). The IPA driver
registers to receive these notifications.
With that as context, I'll now describe the problem.
There was a situation in which buggy modem firmware led to a modem
crash very soon after system (AP) resume had begun. The crash caused
a remoteproc SSR crash notification to be sent to the IPA driver.
The problem was that, although system resume had begun, it had not
yet completed, and the IPA driver was still in a suspended state.
This scenario could happen to any driver that registers for these
SSR notifications, because they are delivered without knowledge of
the (suspend) state of registered recipient drivers.
This patch offers a simple fix for this, by having the crash
handling worker function run on the system freezable workqueue.
This workqueue does not operate if user space is frozen (for
suspend). As a result, the SSR subdevice only delivers its
crash notification when the system is fully operational (i.e.,
neither suspended nor in suspend/resume transition).
Signed-off-by: Alex Elder <[email protected]>
---
drivers/remoteproc/remoteproc_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index 39cf44cb08035..6bedf2d2af239 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
dev_err(&rproc->dev, "crash detected in %s: type %s\n",
rproc->name, rproc_crash_to_string(type));
- /* create a new task to handle the error */
- schedule_work(&rproc->crash_handler);
+ /* Have a worker handle the error; ensure system is not suspended */
+ queue_work(system_freezable_wq, &rproc->crash_handler);
}
EXPORT_SYMBOL(rproc_report_crash);
--
2.27.0
On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
> When a remoteproc has crashed, rproc_report_crash() is called to
> handle whatever recovery is desired. This can happen at almost any
> time, often triggered by an interrupt, though it can also be
> initiated by a write to debugfs file remoteproc/remoteproc*/crash.
>
> When a crash is reported, the crash handler worker is scheduled to
> run (rproc_crash_handler_work()). One thing that worker does is
> call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> the ->stop method for any remoteproc subdevices before making the
> remote processor go offline.
>
> The Q6V5 modem remoteproc driver implements an SSR subdevice that
> notifies registered drivers when the modem changes operational state
> (prepare, started, stop/crash, unprepared). The IPA driver
> registers to receive these notifications.
>
> With that as context, I'll now describe the problem.
>
> There was a situation in which buggy modem firmware led to a modem
> crash very soon after system (AP) resume had begun. The crash caused
> a remoteproc SSR crash notification to be sent to the IPA driver.
> The problem was that, although system resume had begun, it had not
> yet completed, and the IPA driver was still in a suspended state.
>
> This scenario could happen to any driver that registers for these
> SSR notifications, because they are delivered without knowledge of
> the (suspend) state of registered recipient drivers.
>
> This patch offers a simple fix for this, by having the crash
> handling worker function run on the system freezable workqueue.
> This workqueue does not operate if user space is frozen (for
> suspend). As a result, the SSR subdevice only delivers its
> crash notification when the system is fully operational (i.e.,
> neither suspended nor in suspend/resume transition).
>
This makes sense to me; both that it ensures that we spend our resources
on the actual system resume and that it avoids surprises from this
happening while the system still is in a funky state...
Reviewed-by: Bjorn Andersson <[email protected]>
But it would be nice to get some input from other users of the
framework.
Regards,
Bjorn
> Signed-off-by: Alex Elder <[email protected]>
> ---
> drivers/remoteproc/remoteproc_core.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index 39cf44cb08035..6bedf2d2af239 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
> dev_err(&rproc->dev, "crash detected in %s: type %s\n",
> rproc->name, rproc_crash_to_string(type));
>
> - /* create a new task to handle the error */
> - schedule_work(&rproc->crash_handler);
> + /* Have a worker handle the error; ensure system is not suspended */
> + queue_work(system_freezable_wq, &rproc->crash_handler);
> }
> EXPORT_SYMBOL(rproc_report_crash);
>
> --
> 2.27.0
>
On Thu, May 27, 2021 at 10:55:05PM -0500, Bjorn Andersson wrote:
> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
>
> > When a remoteproc has crashed, rproc_report_crash() is called to
> > handle whatever recovery is desired. This can happen at almost any
> > time, often triggered by an interrupt, though it can also be
> > initiated by a write to debugfs file remoteproc/remoteproc*/crash.
> >
> > When a crash is reported, the crash handler worker is scheduled to
> > run (rproc_crash_handler_work()). One thing that worker does is
> > call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> > the ->stop method for any remoteproc subdevices before making the
> > remote processor go offline.
> >
> > The Q6V5 modem remoteproc driver implements an SSR subdevice that
> > notifies registered drivers when the modem changes operational state
> > (prepare, started, stop/crash, unprepared). The IPA driver
> > registers to receive these notifications.
> >
> > With that as context, I'll now describe the problem.
> >
> > There was a situation in which buggy modem firmware led to a modem
> > crash very soon after system (AP) resume had begun. The crash caused
> > a remoteproc SSR crash notification to be sent to the IPA driver.
> > The problem was that, although system resume had begun, it had not
> > yet completed, and the IPA driver was still in a suspended state.
> >
> > This scenario could happen to any driver that registers for these
> > SSR notifications, because they are delivered without knowledge of
> > the (suspend) state of registered recipient drivers.
> >
> > This patch offers a simple fix for this, by having the crash
> > handling worker function run on the system freezable workqueue.
> > This workqueue does not operate if user space is frozen (for
> > suspend). As a result, the SSR subdevice only delivers its
> > crash notification when the system is fully operational (i.e.,
> > neither suspended nor in suspend/resume transition).
> >
>
> This makes sense to me; both that it ensures that we spend our resources
> on the actual system resume and that it avoids surprises from this
> happening while the system still is in a funky state...
>
> Reviewed-by: Bjorn Andersson <[email protected]>
>
> But it would be nice to get some input from other users of the
> framework.
>
This patch is in my review queue - I should be able to get to it by the end of
next week.
> Regards,
> Bjorn
>
> > Signed-off-by: Alex Elder <[email protected]>
> > ---
> > drivers/remoteproc/remoteproc_core.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index 39cf44cb08035..6bedf2d2af239 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
> > dev_err(&rproc->dev, "crash detected in %s: type %s\n",
> > rproc->name, rproc_crash_to_string(type));
> >
> > - /* create a new task to handle the error */
> > - schedule_work(&rproc->crash_handler);
> > + /* Have a worker handle the error; ensure system is not suspended */
> > + queue_work(system_freezable_wq, &rproc->crash_handler);
> > }
> > EXPORT_SYMBOL(rproc_report_crash);
> >
> > --
> > 2.27.0
> >
On 5/27/2021 8:55 PM, Bjorn Andersson wrote:
> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
>
>> When a remoteproc has crashed, rproc_report_crash() is called to
>> handle whatever recovery is desired. This can happen at almost any
>> time, often triggered by an interrupt, though it can also be
>> initiated by a write to debugfs file remoteproc/remoteproc*/crash.
>>
>> When a crash is reported, the crash handler worker is scheduled to
>> run (rproc_crash_handler_work()). One thing that worker does is
>> call rproc_trigger_recovery(), which calls rproc_stop(). That calls
>> the ->stop method for any remoteproc subdevices before making the
>> remote processor go offline.
>>
>> The Q6V5 modem remoteproc driver implements an SSR subdevice that
>> notifies registered drivers when the modem changes operational state
>> (prepare, started, stop/crash, unprepared). The IPA driver
>> registers to receive these notifications.
>>
>> With that as context, I'll now describe the problem.
>>
>> There was a situation in which buggy modem firmware led to a modem
>> crash very soon after system (AP) resume had begun. The crash caused
>> a remoteproc SSR crash notification to be sent to the IPA driver.
>> The problem was that, although system resume had begun, it had not
>> yet completed, and the IPA driver was still in a suspended state.
>>
>> This scenario could happen to any driver that registers for these
>> SSR notifications, because they are delivered without knowledge of
>> the (suspend) state of registered recipient drivers.
>>
>> This patch offers a simple fix for this, by having the crash
>> handling worker function run on the system freezable workqueue.
>> This workqueue does not operate if user space is frozen (for
>> suspend). As a result, the SSR subdevice only delivers its
>> crash notification when the system is fully operational (i.e.,
>> neither suspended nor in suspend/resume transition).
>>
> This makes sense to me; both that it ensures that we spend our resources
> on the actual system resume and that it avoids surprises from this
> happening while the system still is in a funky state...
>
> Reviewed-by: Bjorn Andersson <[email protected]>
>
> But it would be nice to get some input from other users of the
> framework.
This patch sounds like a good idea for cases where the
request_firmware() APIs fallback to userspace firmware loading.
Will test out this patch and report back.
Thanks,
Sid
>
> Regards,
> Bjorn
>
>> Signed-off-by: Alex Elder <[email protected]>
>> ---
>> drivers/remoteproc/remoteproc_core.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index 39cf44cb08035..6bedf2d2af239 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
>> dev_err(&rproc->dev, "crash detected in %s: type %s\n",
>> rproc->name, rproc_crash_to_string(type));
>>
>> - /* create a new task to handle the error */
>> - schedule_work(&rproc->crash_handler);
>> + /* Have a worker handle the error; ensure system is not suspended */
>> + queue_work(system_freezable_wq, &rproc->crash_handler);
>> }
>> EXPORT_SYMBOL(rproc_report_crash);
>>
>> --
>> 2.27.0
>>
On Fri 28 May 21:48 CDT 2021, Hillf Danton wrote:
> On Fri, 28 May 2021 09:09:12 -0600 Mathieu Poirier wrote:
> >On Thu, May 27, 2021 at 10:55:05PM -0500, Bjorn Andersson wrote:
> >> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
> >>
> >> > When a remoteproc has crashed, rproc_report_crash() is called to
> >> > handle whatever recovery is desired. This can happen at almost any
> >> > time, often triggered by an interrupt, though it can also be
> >> > initiated by a write to debugfs file remoteproc/remoteproc*/crash.
> >> >
> >> > When a crash is reported, the crash handler worker is scheduled to
> >> > run (rproc_crash_handler_work()). One thing that worker does is
> >> > call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> >> > the ->stop method for any remoteproc subdevices before making the
> >> > remote processor go offline.
> >> >
> >> > The Q6V5 modem remoteproc driver implements an SSR subdevice that
> >> > notifies registered drivers when the modem changes operational state
> >> > (prepare, started, stop/crash, unprepared). The IPA driver
> >> > registers to receive these notifications.
> >> >
> >> > With that as context, I'll now describe the problem.
> >> >
> >> > There was a situation in which buggy modem firmware led to a modem
> >> > crash very soon after system (AP) resume had begun. The crash caused
> >> > a remoteproc SSR crash notification to be sent to the IPA driver.
> >> > The problem was that, although system resume had begun, it had not
> >> > yet completed, and the IPA driver was still in a suspended state.
> >> >
> >> > This scenario could happen to any driver that registers for these
> >> > SSR notifications, because they are delivered without knowledge of
> >> > the (suspend) state of registered recipient drivers.
> >> >
> >> > This patch offers a simple fix for this, by having the crash
> >> > handling worker function run on the system freezable workqueue.
> >> > This workqueue does not operate if user space is frozen (for
> >> > suspend). As a result, the SSR subdevice only delivers its
> >> > crash notification when the system is fully operational (i.e.,
> >> > neither suspended nor in suspend/resume transition).
> >> >
> >>
> >> This makes sense to me; both that it ensures that we spend our resources
> >> on the actual system resume and that it avoids surprises from this
> >> happening while the system still is in a funky state...
> >>
> >> Reviewed-by: Bjorn Andersson <[email protected]>
> >>
> >> But it would be nice to get some input from other users of the
> >> framework.
> >>
> >
> >This patch is in my review queue - I should be able to get to it by the end of
> >next week.
>
> A minute...
>
> >> Regards,
> >> Bjorn
> >>
> >> > Signed-off-by: Alex Elder <[email protected]>
> >> > ---
> >> > drivers/remoteproc/remoteproc_core.c | 4 ++--
> >> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> >> > index 39cf44cb08035..6bedf2d2af239 100644
> >> > --- a/drivers/remoteproc/remoteproc_core.c
> >> > +++ b/drivers/remoteproc/remoteproc_core.c
> >> > @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
> >> > dev_err(&rproc->dev, "crash detected in %s: type %s\n",
> >> > rproc->name, rproc_crash_to_string(type));
> >> >
> >> > - /* create a new task to handle the error */
> >> > - schedule_work(&rproc->crash_handler);
> >> > + /* Have a worker handle the error; ensure system is not suspended */
> >> > + queue_work(system_freezable_wq, &rproc->crash_handler);
> >> > }
> >> > EXPORT_SYMBOL(rproc_report_crash);
>
> Given 1) system_freezable_wq is bound 2) the mutex_lock(&rproc->lock) in
> rproc_crash_handler_work() requires an unbound wq, the system_unbound_wq is
> the right candidate if no more wq is allocated under the rproc directory.
>
Can you please explain why the mutex_lock() "requires" the context
executing it to be "unbound"? The lock is there to protect against
concurrent modifications of the state coming from e.g. sysfs.
Also, we're currently scheduling the work on the system_wq, which afaict
isn't UNBOUND either. I do suspect that your suggestion of moving the
work to an UNBOUND queue would improve concurrency in the event of
multiple remote processors triggering rproc_report_crash() from the same
CPU.
If this is the case I'd be happy to take the proposed patch and then
separately create a new unbound, freezable, queue in remoteproc (to
ensure we can bisect any potential issues).
Regards,
Bjorn
On Thu, May 27, 2021 at 10:55:05PM -0500, Bjorn Andersson wrote:
> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
>
> > When a remoteproc has crashed, rproc_report_crash() is called to
> > handle whatever recovery is desired. This can happen at almost any
> > time, often triggered by an interrupt, though it can also be
> > initiated by a write to debugfs file remoteproc/remoteproc*/crash.
> >
> > When a crash is reported, the crash handler worker is scheduled to
> > run (rproc_crash_handler_work()). One thing that worker does is
> > call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> > the ->stop method for any remoteproc subdevices before making the
> > remote processor go offline.
> >
> > The Q6V5 modem remoteproc driver implements an SSR subdevice that
> > notifies registered drivers when the modem changes operational state
> > (prepare, started, stop/crash, unprepared). The IPA driver
> > registers to receive these notifications.
> >
> > With that as context, I'll now describe the problem.
> >
> > There was a situation in which buggy modem firmware led to a modem
> > crash very soon after system (AP) resume had begun. The crash caused
> > a remoteproc SSR crash notification to be sent to the IPA driver.
> > The problem was that, although system resume had begun, it had not
> > yet completed, and the IPA driver was still in a suspended state.
This is a very tight race condition - I agree with you that it is next to
impossible to test.
> >
> > This scenario could happen to any driver that registers for these
> > SSR notifications, because they are delivered without knowledge of
> > the (suspend) state of registered recipient drivers.
> >
> > This patch offers a simple fix for this, by having the crash
> > handling worker function run on the system freezable workqueue.
> > This workqueue does not operate if user space is frozen (for
> > suspend). As a result, the SSR subdevice only delivers its
> > crash notification when the system is fully operational (i.e.,
> > neither suspended nor in suspend/resume transition).
> >
I think the real fix for this problem should be in the platform driver where
the remoteproc interrupt would be masked while suspending and re-enabled again
when resuming. The runtime PM API would work just fine for that... But doing
so wouldn't guarantee that other drivers, i.e IPA, would be operational. Unless
of one is a child of the other or using a bus like mechanic, and getting
to that point will introduce a lot more churn than what this patch does.
The proposed solution will work since user space is expected to resume when
the kernel and drivers are fully operational. As you pointed out the only
concern is with other users, which may not be noticeable given the very small
delay that is introduced. As such:
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro>
But be mindful the patch will have to be reverted if someone complains, whether
that happens in two weeks or 10 months.
Thanks,
Mathieu
>
> This makes sense to me; both that it ensures that we spend our resources
> on the actual system resume and that it avoids surprises from this
> happening while the system still is in a funky state...
>
> Reviewed-by: Bjorn Andersson <[email protected]>
>
> But it would be nice to get some input from other users of the
> framework.
>
> Regards,
> Bjorn
>
> > Signed-off-by: Alex Elder <[email protected]>
> > ---
> > drivers/remoteproc/remoteproc_core.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index 39cf44cb08035..6bedf2d2af239 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
> > dev_err(&rproc->dev, "crash detected in %s: type %s\n",
> > rproc->name, rproc_crash_to_string(type));
> >
> > - /* create a new task to handle the error */
> > - schedule_work(&rproc->crash_handler);
> > + /* Have a worker handle the error; ensure system is not suspended */
> > + queue_work(system_freezable_wq, &rproc->crash_handler);
> > }
> > EXPORT_SYMBOL(rproc_report_crash);
> >
> > --
> > 2.27.0
> >
On Mon 31 May 12:21 CDT 2021, Mathieu Poirier wrote:
> On Thu, May 27, 2021 at 10:55:05PM -0500, Bjorn Andersson wrote:
> > On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
> >
> > > When a remoteproc has crashed, rproc_report_crash() is called to
> > > handle whatever recovery is desired. This can happen at almost any
> > > time, often triggered by an interrupt, though it can also be
> > > initiated by a write to debugfs file remoteproc/remoteproc*/crash.
> > >
> > > When a crash is reported, the crash handler worker is scheduled to
> > > run (rproc_crash_handler_work()). One thing that worker does is
> > > call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> > > the ->stop method for any remoteproc subdevices before making the
> > > remote processor go offline.
> > >
> > > The Q6V5 modem remoteproc driver implements an SSR subdevice that
> > > notifies registered drivers when the modem changes operational state
> > > (prepare, started, stop/crash, unprepared). The IPA driver
> > > registers to receive these notifications.
> > >
> > > With that as context, I'll now describe the problem.
> > >
> > > There was a situation in which buggy modem firmware led to a modem
> > > crash very soon after system (AP) resume had begun. The crash caused
> > > a remoteproc SSR crash notification to be sent to the IPA driver.
> > > The problem was that, although system resume had begun, it had not
> > > yet completed, and the IPA driver was still in a suspended state.
>
> This is a very tight race condition - I agree with you that it is next to
> impossible to test.
>
I certainly appreciate to see the upstream kernel be put through the
level of product testing necessary to find issues like this.
> > >
> > > This scenario could happen to any driver that registers for these
> > > SSR notifications, because they are delivered without knowledge of
> > > the (suspend) state of registered recipient drivers.
> > >
> > > This patch offers a simple fix for this, by having the crash
> > > handling worker function run on the system freezable workqueue.
> > > This workqueue does not operate if user space is frozen (for
> > > suspend). As a result, the SSR subdevice only delivers its
> > > crash notification when the system is fully operational (i.e.,
> > > neither suspended nor in suspend/resume transition).
> > >
>
> I think the real fix for this problem should be in the platform driver where
> the remoteproc interrupt would be masked while suspending and re-enabled again
> when resuming. The runtime PM API would work just fine for that... But doing
> so wouldn't guarantee that other drivers, i.e IPA, would be operational. Unless
> of one is a child of the other or using a bus like mechanic, and getting
> to that point will introduce a lot more churn than what this patch does.
>
Disabling the related interrupt(s) would mean that if the modem
remoteproc firmware crashes while Linux is suspended we would not know
about this until the next time Linux resumes. The expected outcome of
this would be that until something else happens to wake up Linux you
won't get any notifications from the network (i.e. no phone calls, text
messages or incoming notifications)
Regards,
Bjorn
On Sat 29 May 22:07 CDT 2021, Hillf Danton wrote:
> On Sat, 29 May 2021 12:28:36 -0500 Bjorn Andersson wrote:
> >
> >Can you please explain why the mutex_lock() "requires" the context
> >executing it to be "unbound"? The lock is there to protect against
> >concurrent modifications of the state coming from e.g. sysfs.
>
> There are simple and light events pending on the bound workqueue,
>
> static void foo_event_fn(struct work_struct *w)
> {
> struct bar_struct *bar = container_of(w, struct bar_struct, work);
>
> spin_lock_irq(&foo_lock);
> list_del(&bar->list);
> spin_unlock_irq(&foo_lock);
>
> kfree(bar);
> return;
> or
> if (bar has waiter)
> wake_up();
> }
>
> and they are not tough enough to tolerate a schedule() for which the unbound
> wq is allocated.
If you have work that is so latency sensitive that it can't handle other
work items sleeping momentarily, is it really a good idea to schedule
them on the system wide queues - or even schedule them at all?
That said, the proposed patch does not move the work from an unbound to
a bound queue, it simply moves it from one bound system queue to another
and further changes to this should be done in a separate patch - backed
by some measurements/data.
Thanks,
Bjorn
On 5/31/21 12:21 PM, Mathieu Poirier wrote:
> On Thu, May 27, 2021 at 10:55:05PM -0500, Bjorn Andersson wrote:
>> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
>>
>>> When a remoteproc has crashed, rproc_report_crash() is called to
>>> handle whatever recovery is desired. This can happen at almost any
>>> time, often triggered by an interrupt, though it can also be
>>> initiated by a write to debugfs file remoteproc/remoteproc*/crash.
>>>
>>> When a crash is reported, the crash handler worker is scheduled to
>>> run (rproc_crash_handler_work()). One thing that worker does is
>>> call rproc_trigger_recovery(), which calls rproc_stop(). That calls
>>> the ->stop method for any remoteproc subdevices before making the
>>> remote processor go offline.
>>>
>>> The Q6V5 modem remoteproc driver implements an SSR subdevice that
>>> notifies registered drivers when the modem changes operational state
>>> (prepare, started, stop/crash, unprepared). The IPA driver
>>> registers to receive these notifications.
>>>
>>> With that as context, I'll now describe the problem.
>>>
>>> There was a situation in which buggy modem firmware led to a modem
>>> crash very soon after system (AP) resume had begun. The crash caused
>>> a remoteproc SSR crash notification to be sent to the IPA driver.
>>> The problem was that, although system resume had begun, it had not
>>> yet completed, and the IPA driver was still in a suspended state.
>
> This is a very tight race condition - I agree with you that it is next to
> impossible to test.
>
>>>
>>> This scenario could happen to any driver that registers for these
>>> SSR notifications, because they are delivered without knowledge of
>>> the (suspend) state of registered recipient drivers.
>>>
>>> This patch offers a simple fix for this, by having the crash
>>> handling worker function run on the system freezable workqueue.
>>> This workqueue does not operate if user space is frozen (for
>>> suspend). As a result, the SSR subdevice only delivers its
>>> crash notification when the system is fully operational (i.e.,
>>> neither suspended nor in suspend/resume transition).
>>>
>
> I think the real fix for this problem should be in the platform driver where
> the remoteproc interrupt would be masked while suspending and re-enabled again
> when resuming. The runtime PM API would work just fine for that... But doing
I considered that (and sent a private e-mail to Bjorn with that
as a suggestion), but as he pointed out, there's a chance this
would include disabling an interrupt that is needed to trigger a
system resume.
> so wouldn't guarantee that other drivers, i.e IPA, would be operational. Unless
> of one is a child of the other or using a bus like mechanic, and getting
> to that point will introduce a lot more churn than what this patch does.
Agreed. I think either the remoteproc driver should handle it
centrally, or all drivers that could be affected by this scenario
should be updated to handle it. To my knowledge only IPA is
affected, but if it's possible for remoteproc to handle it, it's
simpler overall.
Thanks for the review.
-Alex
> The proposed solution will work since user space is expected to resume when
> the kernel and drivers are fully operational. As you pointed out the only
> concern is with other users, which may not be noticeable given the very small
> delay that is introduced. As such:
>
> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro>
>
> But be mindful the patch will have to be reverted if someone complains, whether
> that happens in two weeks or 10 months.
>
> Thanks,
> Mathieu
>
>>
>> This makes sense to me; both that it ensures that we spend our resources
>> on the actual system resume and that it avoids surprises from this
>> happening while the system still is in a funky state...
>>
>> Reviewed-by: Bjorn Andersson <[email protected]>
>>
>> But it would be nice to get some input from other users of the
>> framework.
>>
>> Regards,
>> Bjorn
>>
>>> Signed-off-by: Alex Elder <[email protected]>
>>> ---
>>> drivers/remoteproc/remoteproc_core.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>>> index 39cf44cb08035..6bedf2d2af239 100644
>>> --- a/drivers/remoteproc/remoteproc_core.c
>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>> @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
>>> dev_err(&rproc->dev, "crash detected in %s: type %s\n",
>>> rproc->name, rproc_crash_to_string(type));
>>>
>>> - /* create a new task to handle the error */
>>> - schedule_work(&rproc->crash_handler);
>>> + /* Have a worker handle the error; ensure system is not suspended */
>>> + queue_work(system_freezable_wq, &rproc->crash_handler);
>>> }
>>> EXPORT_SYMBOL(rproc_report_crash);
>>>
>>> --
>>> 2.27.0
>>>
On 5/28/2021 5:12 PM, Siddharth Gupta wrote:
>
> On 5/27/2021 8:55 PM, Bjorn Andersson wrote:
>> On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:
>>
>>> When a remoteproc has crashed, rproc_report_crash() is called to
>>> handle whatever recovery is desired. This can happen at almost any
>>> time, often triggered by an interrupt, though it can also be
>>> initiated by a write to debugfs file remoteproc/remoteproc*/crash.
>>>
>>> When a crash is reported, the crash handler worker is scheduled to
>>> run (rproc_crash_handler_work()). One thing that worker does is
>>> call rproc_trigger_recovery(), which calls rproc_stop(). That calls
>>> the ->stop method for any remoteproc subdevices before making the
>>> remote processor go offline.
>>>
>>> The Q6V5 modem remoteproc driver implements an SSR subdevice that
>>> notifies registered drivers when the modem changes operational state
>>> (prepare, started, stop/crash, unprepared). The IPA driver
>>> registers to receive these notifications.
>>>
>>> With that as context, I'll now describe the problem.
>>>
>>> There was a situation in which buggy modem firmware led to a modem
>>> crash very soon after system (AP) resume had begun. The crash caused
>>> a remoteproc SSR crash notification to be sent to the IPA driver.
>>> The problem was that, although system resume had begun, it had not
>>> yet completed, and the IPA driver was still in a suspended state.
>>>
>>> This scenario could happen to any driver that registers for these
>>> SSR notifications, because they are delivered without knowledge of
>>> the (suspend) state of registered recipient drivers.
>>>
>>> This patch offers a simple fix for this, by having the crash
>>> handling worker function run on the system freezable workqueue.
>>> This workqueue does not operate if user space is frozen (for
>>> suspend). As a result, the SSR subdevice only delivers its
>>> crash notification when the system is fully operational (i.e.,
>>> neither suspended nor in suspend/resume transition).
>>>
>> This makes sense to me; both that it ensures that we spend our resources
>> on the actual system resume and that it avoids surprises from this
>> happening while the system still is in a funky state...
>>
>> Reviewed-by: Bjorn Andersson <[email protected]>
>>
>> But it would be nice to get some input from other users of the
>> framework.
> This patch sounds like a good idea for cases where the
> request_firmware() APIs fallback to userspace firmware loading.
>
> Will test out this patch and report back.
>
> Thanks,
> Sid
I was able to test out this change with one of our usecases, no
issues to report.
Could you please CC stable as well?
Thanks,
Sid
Tested-by: Siddharth Gupta <[email protected]>
>>
>> Regards,
>> Bjorn
>>
>>> Signed-off-by: Alex Elder <[email protected]>
>>> ---
>>> drivers/remoteproc/remoteproc_core.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>> b/drivers/remoteproc/remoteproc_core.c
>>> index 39cf44cb08035..6bedf2d2af239 100644
>>> --- a/drivers/remoteproc/remoteproc_core.c
>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>> @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc,
>>> enum rproc_crash_type type)
>>> dev_err(&rproc->dev, "crash detected in %s: type %s\n",
>>> rproc->name, rproc_crash_to_string(type));
>>> - /* create a new task to handle the error */
>>> - schedule_work(&rproc->crash_handler);
>>> + /* Have a worker handle the error; ensure system is not
>>> suspended */
>>> + queue_work(system_freezable_wq, &rproc->crash_handler);
>>> }
>>> EXPORT_SYMBOL(rproc_report_crash);
>>> --
>>> 2.27.0
>>>