Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp1251640rwb; Thu, 10 Nov 2022 13:26:14 -0800 (PST) X-Google-Smtp-Source: AMsMyM42J3CjmOvFxwmNHtFPcGMZFrdF2jZF87qijlgood9elY9ZohVaxq7jioh0jsRO4+l6rM1n X-Received: by 2002:a17:906:c448:b0:78d:cd84:9a40 with SMTP id ck8-20020a170906c44800b0078dcd849a40mr3985316ejb.167.1668115573973; Thu, 10 Nov 2022 13:26:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668115573; cv=none; d=google.com; s=arc-20160816; b=m9dGWnz9EMzWGQSkFMz+DJBwW+4iiQBsW8+snFZqHZSPbHIrVyjf1XK/6T6x3V/kH3 Z4aOujcAeXw9ZKJqOtoUBJFPdIa767amrvZz7SkcHnvimNEpMXn/Eznoq4l0Wjr1yR4d h6gBO8V68DnWRDgmGXih1vaqpqoRQJWqfKt9JLa0Rwqwa2jNF843HSqV/gEkU9n52xev s6cwiY4vBr2ArpcMUWDIBJrq3xgtUQ1vufV6h0KpsW16FgDAKK3kSo9z1gtMJ9DILCQa B0ZTq8Stqg1db15HC5FMHHHMORcavuM66Q0+IjaUq8pJRmi0McgzgJURjUF9q6bf7owI 6mpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=dUXti7KTkJdRMWD08GZ3ouWT9555xymOsXODMqYQkhY=; b=krDVP7tKX5weUvTugexeilsMc5iQOtoEYhsIz1nawL3Nq5d+mT25VxLK38KSYH8Tt9 zpFtdRle5oWc5D9qO56oFDk32rUyKxoLdagLXnR5Qsg6ODishmsQH7nHux0E/ajLH9cX a/K4jA2Z8CNxyrpJd3duccfal+JGrrrufh2U1NH79KZ035TMPgCF/Cw4uyolPsZaN1aL bl4ALcCQol3f+o/pEobQrULS8pAy/FKZMHrXsSXDx7zgf1FK4Tn/K6HcVXrlTkKkK6Yr gyYDgMMiFhrLmdKM7SjL6TPLx7zjv26rJzGSaabSodbHxJh07JwL48d6dZbJIqK9YmhC 5UDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=PCIEDVxX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w1-20020a056402268100b0045c9f2adb6bsi724345edd.606.2022.11.10.13.25.50; Thu, 10 Nov 2022 13:26:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=PCIEDVxX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231890AbiKJUu0 (ORCPT + 92 others); Thu, 10 Nov 2022 15:50:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231721AbiKJUuY (ORCPT ); Thu, 10 Nov 2022 15:50:24 -0500 Received: from mail-il1-x133.google.com (mail-il1-x133.google.com [IPv6:2607:f8b0:4864:20::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35A8845A3E for ; Thu, 10 Nov 2022 12:50:22 -0800 (PST) Received: by mail-il1-x133.google.com with SMTP id q5so1610442ilt.13 for ; Thu, 10 Nov 2022 12:50:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dUXti7KTkJdRMWD08GZ3ouWT9555xymOsXODMqYQkhY=; b=PCIEDVxXXgCN/5FmUeF2UQzXwwNT2KEdKzB8FQ4oFrFDgcAvQdr3N0Wk8DE39bS9PX vktIem+aTy7KjxJNJ9q9dPMzwSQPViW8AfgM2T6TlQHOBpUSMCXlQZXPEKBt6B003mWt VQhFqZRac54zLqSivXn6vr/LNoKiDQF1DAbRyGUoOf+701bOeijzwx2r8cm6nKLo4cwx CLtn1mH0KpfFp01zOganiOMxToPrzEmFBjvzMfFMEyaqARXTXp6rySqmGncv0nxDqJJG bTHhwZFCl1VpYYpJM4S/iZTTMB27fxpUjgvgcSwUl2VxxLzd1l/G0h/0LBFEwnLBPeYL FBeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dUXti7KTkJdRMWD08GZ3ouWT9555xymOsXODMqYQkhY=; b=zPLKjIua+XDrBWWBw4OuLsxyhysdBAnK9i5Fr5sQPooZawg+H/3oU6unjkKmNw6Uh2 uogZPV8YErF4W2iARwnB11mjQ+NRdbsh0cBWzWA7b/yumIyqgFA2KSqfzpBJnpF/X/Mg pTkj3QBA0rL0D+6c9K9cDRwz+lAMp6YcqCf9MfFZ0tqy/ij4Mm+aEfMpkyjwjbINrIEd to771+VcSIk0Zj1gopYkXvYGOxL2WgeZRAWzaXD2LDFMeemX89zWRNrMCWg2OZWwU683 8p5xhwdaXduSDLlifjaEjRm8EjzkpixKp9Ry3cWfW3EW4Q2RI8IwzTZaCR90o9Vx+1er q9NQ== X-Gm-Message-State: ACrzQf236wb1kCpS/UYmL09kRKTOiinWkOjULHFeBEawYnmMVkr0jjze m548jm38pic2YmDbP0dRqzgrN+tPnk4JdDADiByfvQ== X-Received: by 2002:a92:d681:0:b0:300:f2cc:c6e3 with SMTP id p1-20020a92d681000000b00300f2ccc6e3mr3311667iln.126.1668113421427; Thu, 10 Nov 2022 12:50:21 -0800 (PST) MIME-Version: 1.0 References: <20221013173442.GA1279972@p14s> <20221013180334.GB1279972@p14s> <8807a9a6-d93d-aef5-15f4-88648a6ecbe2@quicinc.com> <70828854-8427-8ce1-1535-e14261fd122d@quicinc.com> <420faf00-d59e-57c6-55a5-fae08a411517@foss.st.com> <414aacb1-e68b-a9a7-3b99-12bc56494f6f@quicinc.com> <20221102180350.GA1733006@p14s> <4c3d38c9-a43e-97bd-c7f9-3d21240e9d0e@quicinc.com> <20221104155915.GC1873068@p14s> <0f2b805c-5b01-138d-3e76-f6ed866be7ef@quicinc.com> In-Reply-To: <0f2b805c-5b01-138d-3e76-f6ed866be7ef@quicinc.com> From: Mathieu Poirier Date: Thu, 10 Nov 2022 13:50:10 -0700 Message-ID: Subject: Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE To: "Aiqun(Maria) Yu" Cc: Arnaud POULIQUEN , linux-remoteproc@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, quic_clew@quicinc.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I had a couple of good discussions with our power management expert and even then, the way forward isn't as clear as I would have liked. I am currently travelling and as such don't have the required time to go into greater details, something I will be doing next week. On Sun, 6 Nov 2022 at 18:14, Aiqun(Maria) Yu wrote: > > Hi, > On 11/4/2022 11:59 PM, Mathieu Poirier wrote: > > On Thu, Nov 03, 2022 at 10:03:49AM +0800, Aiqun(Maria) Yu wrote: > >> On 11/3/2022 2:03 AM, Mathieu Poirier wrote: > >>> On Wed, Nov 02, 2022 at 06:53:49PM +0800, Aiqun(Maria) Yu wrote: > >>>> Hi, > >>>> > >>>> Let me think about this carefully. > >>>> > >>>> When in RPROC_RECOVERY_FAIL case we want to re-do the recovery process again > >>>> or just leave the pm_relax? > >>> > >>> Neither. > >>> > >>> When a recovery fail we don't want to call pm_relax(). The code in > >>> rproc_crash_handler_work() becomes: > >>> > >>> if (rproc->state == RPROC_OFFLINE) { > >>> /* We have raced with rproc_shutdown() */ > >>> pm_relax() > >>> mutex_unlock(&rproc->lock); > >>> return; > >>> } > >>> > >>> if (rproc->state == RPROC_CRASHED || > >>> rproc->state == RPROC_RECOVERY_FAILED) { > >>> /* handle only the first crash detected */ > >>> mutex_unlock(&rproc->lock); > >>> return; > >>> } > >>> > >>> > >>> RPROC_RECOVERY_FAILED gets set in rproc_boot_recovery() if request_firmware() or > >>> rproc_start() fail. Function rproc_trigger_recovery() needs to allow for the > >>> recovery the the remote processor is in RPROC_RECOVERY_FAILED state. As such > >>> the condition becomes: > >>> > >>> /* State could have changed before we got the mutex */ > >>> if (rproc->state != RPROC_CRASHED && > >>> rproc->state != RPROC_RECOVERY_FAILED) > >>> goto unlock_mutex; > >>> > >>> Start with that and we can look at corner cases (if some exists) with a fresh > >>> patchset. Note that I have not addressed the attach/detach() scenario in the > >>> above. > >> > >> If we didn't deal with the recovery failed case with correct pm_relax call, > >> it may left the device in a state that cannot enter to suspend state. > > > > That is what I am looking for. We don't want to give the impression that > > everything is fine by allowing the device to suspend. If the remote processor > > can't be recovered than it needs to be dealth with. > For the normal recovery failed case, it still need to do pm_relax to not > prevent the device goes to suspend. It is what in normal recovery failed > case we do in rproc_crash_handler_work as well. > rproc_crash_handler_work will not check the result of the > rproc_trigger_recovery return value, and will always do pm_relax. > > For current conconrency cornor case as well, it is better to consistant > with the current design of recovery fail senarios in normal cases. > > I personally agree that we shouldn't do nothing when it is a > RPROC_RECOVERY_FAILED senario when it is in rproc_crash_handler_work > check, because it maybe crash happened when it is trying to do the recovery. > So I suggested to do a continue try of trigger recovery again instead of > doing nothing and bail out if it is a RPROC_RECOVERY_FAILED state. > > > > >> Because first PROC_RECOVERY_FAIL case cannot ensure it have pm_relax called > >> before the second crash handler call pm_stay_awake or not. > >> > > > > I've been thinking about that part. I don't think adding a wake_count to > > control calls to pm_stay_awake()/pm_relax() is the best way to go. There is a > > similar count happening in the PM runtime subsystem and that is what we should > > be using. I have asked a power management expert at Linaro for guidance with > > this matter. I should be able to get back to you with a way forward by the end > > of next week. > > > Thx for the specific date provided as well. I will wait until your reply > for next patchset then. > > >> So, What about the atomic count along with pm_relax and pm_stay_awake ? > >> > >> struct rproc{ > >> ... > >> atomic_t wake_count; > >> ... > >> } > >> > >> rproc_pm_stay_awake() > >> { > >> atomic_inc(&wake_count); > >> pm_stay_awake(); > >> } > >> > >> rproc_pm_relax() > >> { > >> if (atomic_dec_return(&wake_count) == 0) > >> pm_stay_awake(); > >> } > >> > >> can refer code like: > >> > >> rproc_report_crash() > >> { > >> ... > >> rproc_pm_stay_awake(); > >> queue_work(); > >> ... > >> } > >> > >> rproc_crash_handler_work() > >> { > >> ... > >> if (rproc->state == RPROC_OFFLINE || rproc->state == RPROC_CRASHED) { > >> /* We have raced with rproc_shutdown() */ > >> rproc_pm_relax(); > >> mutex_unlock(&rproc->lock); > >> return; > >> } > >> ... > >> } > >> > >>> > >>> Thanks, > >>> Mathieu > >>> > >>>> > >>>> recovery fail case 1: > >>>> | |firstcrash interrupt issued > >>>> | second crashed interrupt issued | rproc_report_crash() > >>>> | rproc_report_crash() | pm_stay_awake() > >>>> | pm_stay_awake() | queue_work() > >>>> | queue_work() |rproc_crash_handler_work() > >>>> | |mutex_lock(&rproc->lock); > >>>> | |rproc_stop() > >>>> |rproc_crash_handler_work() |rproc->state = RPROC_OFFLINE; > >>>> | |RPROC_RECOVERY_FAIL //new > >>>> | |mutex_unlock(&rproc->lock); > >>>> |mutex_lock(&rproc->lock); |pm_relax() > >>>> |if (rproc->state == RPROC_OFFLINE) | > >>>> |return // shouldn't do pm_relax if RPROC_RECOVERY_FAIL? | > >>>> |mutex_unlock(&rproc->lock); | > >>>> | | > >>>> | | > >>>> | | > >>>> > >>>> recovery fail case 2: > >>>> | |firstcrash interrupt issued > >>>> | | rproc_report_crash() > >>>> | | pm_stay_awake() > >>>> | | queue_work() > >>>> | |rproc_crash_handler_work() > >>>> | |mutex_lock(&rproc->lock); > >>>> | |rproc_stop() > >>>> | |rproc->state = RPROC_OFFLINE; > >>>> | |RPROC_RECOVERY_FAIL //new > >>>> | |mutex_unlock(&rproc->lock); > >>>> | |pm_relax() > >>>> | > >>>> | second crashed interrupt issued | > >>>> | rproc_report_crash() | > >>>> | pm_stay_awake() | > >>>> | queue_work() | > >>>> |pm_stay_awake() > >>>> |mutex_lock(&rproc->lock); > >>>> |if (rproc->state == RPROC_OFFLINE) | > >>>> |return // still need do pm_relax if RPROC_RECOVERY_FAIL? | > >>>> |mutex_unlock(&rproc->lock); | > >>>> | | > >>>> | | > >>>> | | > >>>> > >>>> Maybe I can have: > >>>> 1. the pm_stay_awake and pm_relax with count based and call with paired for > >>>> fix current concurency issue. > >>>> 2. RPROC_RECOVERY_FAIL can be another patch for continue try to do recovery > >>>> work. > >>>> 3. handle RPROC_DETACHED case. > >>>> > >>>> On 11/2/2022 4:11 AM, Mathieu Poirier wrote: > >>>>> On Fri, 28 Oct 2022 at 09:31, Arnaud POULIQUEN > >>>>> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> On 10/24/22 05:17, Aiqun(Maria) Yu wrote: > >>>>>>> On 10/22/2022 3:34 AM, Mathieu Poirier wrote: > >>>>>>>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu wrote: > >>>>>>>>> > >>>>>>>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote: > >>>>>>>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote: > >>>>>>>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote: > >>>>>>>>>>>> Hi Mathieu, > >>>>>>>>>>>> > >>>>>>>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote: > >>>>>>>>>>>>> Please add what has changed from one version to another, either in a cover > >>>>>>>>>>>>> letter or after the "Signed-off-by". There are many examples on how to > >>>>>>>>>>>>> do that > >>>>>>>>>>>>> on the mailing list. > >>>>>>>>>>>>> > >>>>>>>>>>>> Thx for the information, will take a note and benefit for next time. > >>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote: > >>>>>>>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process > >>>>>>>>>>>>>> is in progress and no chance to do the pm_relax. > >>>>>>>>>>>>>> Because when recovering from crash, rproc->lock is held and > >>>>>>>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING, > >>>>>>>>>>>>>> and then unlock rproc->lock. > >>>>>>>>>>>>> > >>>>>>>>>>>>> You are correct - because the lock is held rproc->state should be set to > >>>>>>>>>>>>> RPROC_RUNNING > >>>>>>>>>>>>> when rproc_trigger_recovery() returns. If that is not the case then > >>>>>>>>>>>>> something > >>>>>>>>>>>>> went wrong. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before > >>>>>>>>>>>>> returning, > >>>>>>>>>>>>> so we know the remote processor was stopped. Therefore if rproc->state > >>>>>>>>>>>>> is set > >>>>>>>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or > >>>>>>>>>>>>> rproc_start(). Either way the remote processor is offline and the system > >>>>>>>>>>>>> probably > >>>>>>>>>>>>> in an unknown/unstable. As such I don't see how calling pm_relax() can help > >>>>>>>>>>>>> things along. > >>>>>>>>>>>>> > >>>>>>>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully > >>>>>>>>>>>> finished. > >>>>>>>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and > >>>>>>>>>>>> last rproc_trigger_recovery bailed out with only > >>>>>>>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair. > >>>>>>>>>>>> Since the subsystem may still can be recovered with customer's next trigger > >>>>>>>>>>>> of rproc_start, and we can make each error out path clean with pm resources. > >>>>>>>>>>>> > >>>>>>>>>>>>> I suggest spending time understanding what leads to the failure when > >>>>>>>>>>>>> recovering > >>>>>>>>>>>>> from a crash and address that problem(s). > >>>>>>>>>>>>> > >>>>>>>>>>>> In current case, the customer's information is that the issue happened when > >>>>>>>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out > >>>>>>>>>>>> of rproc_trigger_recovery. > >>>>>>>>>>> > >>>>>>>>>>> That is a very important element to consider and should have been mentioned > >>>>>>>>>>> from > >>>>>>>>>>> the beginning. What I see happening is the following: > >>>>>>>>>>> > >>>>>>>>>>> rproc_report_crash() > >>>>>>>>>>> pm_stay_awake() > >>>>>>>>>>> queue_work() // current thread is suspended > >>>>>>>>>>> > >>>>>>>>>>> rproc_shutdown() > >>>>>>>>>>> rproc_stop() > >>>>>>>>>>> rproc->state = RPROC_OFFLINE; > >>>>>>>>>>> > >>>>>>>>>>> rproc_crash_handler_work() > >>>>>>>>>>> if (rproc->state == RPROC_OFFLINE) > >>>>>>>>>>> return // pm_relax() is not called > >>>>>>>>>>> > >>>>>>>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and > >>>>>>>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed. > >>>>>>>>>> > >>>>>>>>>> Thinking about this further there are more ramifications to consider. Please > >>>>>>>>>> confirm the above scenario is what you are facing. I will advise on how to > >>>>>>>>>> move > >>>>>>>>>> forward if that is the case. > >>>>>>>>>> > >>>>>>>>> Not sure if the situation is clear or not. So resend the email again. > >>>>>>>>> > >>>>>>>>> The above senario is what customer is facing. crash hanppened while at > >>>>>>>>> the same time shutdown is triggered. > >>>>>>>> > >>>>>>>> Unfortunately this is not enough details to address a problem as > >>>>>>>> complex as this one. > >>>>>>>> > >>>>>>>>> And the device cannto goes to suspend state after that. > >>>>>>>>> the subsystem can still be start normally after this. > >>>>>>>> > >>>>>>>> If the code flow I pasted above reflects the problem at hand, the > >>>>>>>> current patch will not be sufficient to address the issue. If Arnaud > >>>>>>>> confirms my suspicions we will have to think about a better solution. > >>>>>>>> > >>>>>>> > >>>>>>> Hi Mathiew, > >>>>>>> > >>>>>>> Could you pls have more details of any side effects other then power issue of > >>>>>>> the current senario? > >>>>>>> Why the current patch is not sufficient pls? > >>>>>>> > >>>>>>> > >>>>>>> Have the current senario in details with rproc->lock information in details: > >>>>>>> > >>>>>>> | subsystem crashed interrupt issued | user trigger shutdown > >>>>>>> | rproc_report_crash() | > >>>>>>> | pm_stay_awake() | > >>>>>>> | queue_work() | > >>>>>>> | |rproc_shutdown > >>>>>>> | |mutex_lock(&rproc->lock); > >>>>>>> | |rproc_stop() > >>>>>>> |rproc_crash_handler_work() |rproc->state = RPROC_OFFLINE; > >>>>>>> | |mutex_unlock(&rproc->lock); > >>>>>>> |mutex_lock(&rproc->lock); | > >>>>>>> |if (rproc->state == RPROC_OFFLINE) | > >>>>>>> |return // pm_relax() is not called |rproc_boot > >>>>>>> |mutex_unlock(&rproc->lock); | > >>>>>>> | |mutex_lock(&rproc->lock); > >>>>>>> | |rproc_start() > >>>>>>> | |mutex_unlock(&rproc->lock); > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> Agree with Mathieu, this is not so simple. > >>>>>> > >>>>> > >>>>> Thanks for looking into this. > >>>>> > >>>>>> Here is my view hoping I haven't missed a point in your discussion or > >>>>>> an other corner cases. > >>>>>> > >>>>>> I tried to analyze the issues (in what follows, the term "condition" means > >>>>>> the "if" condition in which Aiqun proposes to add the fix) : > >>>>>> > >>>>>> I can see 4 use cases with race condition > >>>>>> > >>>>>> 1) crash report while already one is treated (rproc_boot_recovery called) > >>>>>> => not a real use case as if the remote processor is crashed we > >>>>>> should not have a second crash report > >>>>>> > >>>>> > >>>>> That part is of great concern to me. *Theoretically* we should not > >>>>> get a new crash report while one has already been dispatched but the > >>>>> current code accounts for this scenario and as such the possibility > >>>>> can't be dismissed. Therefore we need to expect rproc_report_crash() > >>>>> to be called multiple times before a single instance of > >>>>> rproc_boot_recovery() is scheduled. > >>>> > >>>> > >>>>> > >>>>>> 2) rproc_stop executed between the queuing of the crash work and the call of > >>>>>> rproc_crash_handler_work > >>>>>> => rproc->state = RPROC_OFFLINE > >>>>>> => we enter in the "condition" and the pm_relax has to be called > >>>>>> => This commit fix should solve this use case > >>>>>> > >>>>>> 3) rproc_detach executed between the queue of the crash work and the call of > >>>>>> rproc_crash_handler_work > >>>>>> => rproc->state = RPROC_DETACHED; > >>>>>> => we don't go in "the condition" and issue because the recovery reattach > >>>>>> to the remote processor > >>>>>> => but pm_relax is called > >>>>>> => probably need an extra fix to avoid to re-attach > >>>>>> > >>>>>> 4) crash report while already one is treated (rproc_attach_recovery called) > >>>>>> this one corresponds to an auto reboot of the remote processor, with a > >>>>>> new crash > >>>>>> => rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED; > >>>>>> 4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true > >>>>>> => should call pm_relax if rproc->recovery_disabled = true > >>>>>> => commit does not work for this use case > >>>>>> > >>>>>> 4)b) rproc->state = RPROC_DETACHED if recovery fails > >>>>>> => error case with an unstable state > >>>>>> => how to differentiate it from the use case 3) ? > >>>>>> => introduce a RPROC_RECOVERY_FAIL state? > >>>>>> > >>>>> > >>>>> The case where a recovery fails needs to be considered and is the > >>>>> reason the original patch doesn't work. Right now in > >>>>> rproc_crash_handler_work(), it is not possible to differentiate > >>>>> between a legitimate shutdown request (scenario #2 above) and a > >>>>> recovery that went wrong. I think introducing RPROC_RECOVERY_FAIL > >>>>> would greatly simplify things. > >>>>> > >>>>> My initial evaluation had not considered the attach/detach scenarios - > >>>>> thanks for adding that in the mix. > >>>>> > >>>>> Aiqun, please send a new patchset that adds a new remote processor > >>>>> state, i.e RPROC_RECOVERY_FAIL. There should also be another patch in > >>>>> that set that takes attach/detach scenarios into account. The code > >>>>> between the v6.0 and v6.1 cycle has changed a lot in that area so make > >>>>> sure to properly rebase. > >>>>> > >>>> I will try. > >>>> > >>>>>> > >>>>>> Then pm_stay_awake is called when the crash work is queued. > >>>>>> It seems to me coherent to call the pm_relax in the work handler. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Here is a quick and dirty patch (not tested) that should take into account the > >>>>>> main use cases ( except 1) and 4)b) ) > >>>>>> > >>>>>> @@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work) > >>>>>> > >>>>>> mutex_lock(&rproc->lock); > >>>>>> > >>>>>> - if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) { > >>>>>> + if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE || > >>>>>> + rproc->state == RPROC_DETACHED) { > >>>>>> /* handle only the first crash detected */ > >>>>>> + > >>>>>> + /* > >>>>>> + * call pm-relax in following use cases: > >>>>>> + * - the remote processor has been stopped by the user > >>>>>> + * - the remote processor is detached > >>>>>> + + - the remote proc has an autonomous reset but recovery_disabled is true. > >>>>>> + */ > >>>>>> + if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled) > >>>>>> + pm_relax(rproc->dev.parent); > >>>>>> mutex_unlock(&rproc->lock); > >>>>>> return; > >>>>>> } > >>>>>> > >>>>>> Regards, > >>>>>> Arnaud > >>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Mathieu > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> When the state is in RPROC_OFFLINE it means separate request > >>>>>>>>>>>>>> of rproc_stop was done and no need to hold the wakeup source > >>>>>>>>>>>>>> in crash handler to recover any more. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Signed-off-by: Maria Yu > >>>>>>>>>>>>>> --- > >>>>>>>>>>>>>> drivers/remoteproc/remoteproc_core.c | 11 +++++++++++ > >>>>>>>>>>>>>> 1 file changed, 11 insertions(+) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c > >>>>>>>>>>>>>> b/drivers/remoteproc/remoteproc_core.c > >>>>>>>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644 > >>>>>>>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c > >>>>>>>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c > >>>>>>>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct > >>>>>>>>>>>>>> work_struct *work) > >>>>>>>>>>>>>> if (rproc->state == RPROC_CRASHED || rproc->state == > >>>>>>>>>>>>>> RPROC_OFFLINE) { > >>>>>>>>>>>>>> /* handle only the first crash detected */ > >>>>>>>>>>>>>> mutex_unlock(&rproc->lock); > >>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>> + * RPROC_OFFLINE state indicate there is no recovery process > >>>>>>>>>>>>>> + * is in progress and no chance to have pm_relax in place. > >>>>>>>>>>>>>> + * Because when recovering from crash, rproc->lock is held and > >>>>>>>>>>>>>> + * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING, > >>>>>>>>>>>>>> + * and then unlock rproc->lock. > >>>>>>>>>>>>>> + * RPROC_OFFLINE is only an intermediate state in recovery > >>>>>>>>>>>>>> + * process. > >>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>> + if (rproc->state == RPROC_OFFLINE) > >>>>>>>>>>>>>> + pm_relax(rproc->dev.parent); > >>>>>>>>>>>>>> return; > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> 2.7.4 > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Thx and BRs, > >>>>>>>>>>>> Aiqun(Maria) Yu > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Thx and BRs, > >>>>>>>>> Aiqun(Maria) Yu > >>>>>>> > >>>>>>> > >>>> > >>>> > >>>> -- > >>>> Thx and BRs, > >>>> Aiqun(Maria) Yu > >> > >> > >> -- > >> Thx and BRs, > >> Aiqun(Maria) Yu > > > -- > Thx and BRs, > Aiqun(Maria) Yu