Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Subject: Re: [PATCH] remoteproc: Create a separate workqueue for recovery
 tasks
To:     Bjorn Andersson <bjorn.andersson@linaro.org>,
        Rishabh Bhatnagar <rishabhb@codeaurora.org>
Cc:     linux-remoteproc@vger.kernel.org, linux-kernel@vger.kernel.org,
        tsoni@codeaurora.org, psodagud@codeaurora.org,
        sidgup@codeaurora.org
References: <1607806087-27244-1-git-send-email-rishabhb@codeaurora.org>
 <X9k+xmg9SULEbJXe@builder.lan>
From:   Alex Elder <elder@linaro.org>
Message-ID: <dc9940f0-7fe3-d1da-acb5-580ae7366c9b@linaro.org>
Date:   Thu, 17 Dec 2020 10:12:56 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.5.0
MIME-Version: 1.0
In-Reply-To: <X9k+xmg9SULEbJXe@builder.lan>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 12/15/20 4:55 PM, Bjorn Andersson wrote:
> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
> 
>> Create an unbound high priority workqueue for recovery tasks.

I have been looking at a different issue that is caused by
crash notification.

What happened was that the modem crashed while the AP was
in system suspend (or possibly even resuming) state.  And
there is no guarantee that the system will have called a
driver's ->resume callback when the crash notification is
delivered.

In my case (in the IPA driver), handling a modem crash
cannot be done while the driver is suspended; i.e. the
activities in its ->resume callback must be completed
before we can recover from the crash.

For this reason I might like to change the way the
crash notification is handled, but what I'd rather see
is to have the work queue not run until user space
is unfrozen, which would guarantee that all drivers
that have registered for a crash notification will
be resumed when the notification arrives.

I'm not sure how that interacts with what you are
looking for here.  I think the workqueue could still
be unbound, but its work would be delayed longer before
any notification (and recovery) started.

					-Alex


> This simply repeats $subject
> 
>> Recovery time is an important parameter for a subsystem and there
>> might be situations where multiple subsystems crash around the same
>> time.  Scheduling into an unbound workqueue increases parallelization
>> and avoids time impact.
> 
> You should be able to write this more succinctly. The important part is
> that you want an unbound work queue to allow recovery to happen in
> parallel - which naturally implies that you care about recovery latency.
> 
>> Also creating a high priority workqueue
>> will utilize separate worker threads with higher nice values than
>> normal ones.
>>
> 
> This doesn't describe why you need the higher priority.
> 
> 
> I believe, and certainly with the in-line coredump, that we're running
> our recovery work for way too long to be queued on the system_wq. As
> such the content of the patch looks good!
> 
> Regards,
> Bjorn
> 
>> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index 46c2937..8fd8166 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex);
>>   static LIST_HEAD(rproc_list);
>>   static struct notifier_block rproc_panic_nb;
>>   
>> +static struct workqueue_struct *rproc_wq;
>> +
>>   typedef int (*rproc_handle_resource_t)(struct rproc *rproc,
>>   				 void *, int offset, int avail);
>>   
>> @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
>>   		rproc->name, rproc_crash_to_string(type));
>>   
>>   	/* create a new task to handle the error */
>> -	schedule_work(&rproc->crash_handler);
>> +	queue_work(rproc_wq, &rproc->crash_handler);
>>   }
>>   EXPORT_SYMBOL(rproc_report_crash);
>>   
>> @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void)
>>   
>>   static int __init remoteproc_init(void)
>>   {
>> +	rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0);
>> +	if (!rproc_wq)
>> +		return -ENOMEM;
>> +
>>   	rproc_init_sysfs();
>>   	rproc_init_debugfs();
>>   	rproc_init_cdev();
>> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void)
>>   	rproc_exit_panic();
>>   	rproc_exit_debugfs();
>>   	rproc_exit_sysfs();
>> +	destroy_workqueue(rproc_wq);
>>   }
>>   module_exit(remoteproc_exit);
>>   
>> -- 
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
>> a Linux Foundation Collaborative Project
>>