2021-02-25 08:11:21

by Bhaumik Bhatt

[permalink] [raw]
Subject: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

In current design, whenever the BHI interrupt is fired, the
execution environment is updated. This can cause race conditions
and impede ongoing power up/down processing. For example, if a
power down is in progress, MHI host updates to a local "disabled"
execution environment. If a BHI interrupt fires later, that value
gets replaced with one from the BHI EE register. This impacts the
controller as it does not expect multiple RDDM execution
environment change status callbacks as an example. Another issue
would be that the device can enter mission mode and the execution
environment is updated, while device creation for SBL channels is
still going on due to slower PM state worker thread run, leading
to multiple attempts at opening the same channel.

Ensure that EE changes are handled only from appropriate places
and occur one after another and handle only PBL modes or RDDM EE
changes as critical events directly from the interrupt handler.
Simplify handling by waiting for SYS ERROR before handling RDDM.
This also makes sure that we use the correct execution environment
to notify the controller driver when the device resets to one of
the PBL execution environments.

Signed-off-by: Bhaumik Bhatt <[email protected]>
---
drivers/bus/mhi/core/main.c | 40 +++++++++++++++++++++-------------------
drivers/bus/mhi/core/pm.c | 7 ++++---
2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 7a2e98c..9715f51 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -430,7 +430,7 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
struct device *dev = &mhi_cntrl->mhi_dev->dev;
enum mhi_state state = MHI_STATE_MAX;
enum mhi_pm_state pm_state = 0;
- enum mhi_ee_type ee = 0;
+ enum mhi_ee_type ee = MHI_EE_MAX;

write_lock_irq(&mhi_cntrl->pm_lock);
if (!MHI_REG_ACCESS_VALID(mhi_cntrl->pm_state)) {
@@ -439,8 +439,7 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
}

state = mhi_get_mhi_state(mhi_cntrl);
- ee = mhi_cntrl->ee;
- mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
+ ee = mhi_get_exec_env(mhi_cntrl);
dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
TO_MHI_STATE_STR(state));
@@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
}
write_unlock_irq(&mhi_cntrl->pm_lock);

- /* If device supports RDDM don't bother processing SYS error */
- if (mhi_cntrl->rddm_image) {
- /* host may be performing a device power down already */
- if (!mhi_is_active(mhi_cntrl))
- goto exit_intvec;
+ if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
+ goto exit_intvec;

- if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee) {
+ switch (ee) {
+ case MHI_EE_RDDM:
+ /* proceed if power down is not already in progress */
+ if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
+ mhi_cntrl->ee = ee;
wake_up_all(&mhi_cntrl->state_event);
}
- goto exit_intvec;
- }
-
- if (pm_state == MHI_PM_SYS_ERR_DETECT) {
+ break;
+ case MHI_EE_PBL:
+ case MHI_EE_EDL:
+ case MHI_EE_PTHRU:
+ mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
+ mhi_cntrl->ee = ee;
wake_up_all(&mhi_cntrl->state_event);
-
- /* For fatal errors, we let controller decide next step */
- if (MHI_IN_PBL(ee))
- mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
- else
- mhi_pm_sys_err_handler(mhi_cntrl);
+ mhi_pm_sys_err_handler(mhi_cntrl);
+ break;
+ default:
+ wake_up_all(&mhi_cntrl->state_event);
+ mhi_pm_sys_err_handler(mhi_cntrl);
+ break;
}

exit_intvec:
diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
index c09ec13..c870fa8 100644
--- a/drivers/bus/mhi/core/pm.c
+++ b/drivers/bus/mhi/core/pm.c
@@ -377,21 +377,22 @@ static int mhi_pm_mission_mode_transition(struct mhi_controller *mhi_cntrl)
{
struct mhi_event *mhi_event;
struct device *dev = &mhi_cntrl->mhi_dev->dev;
- enum mhi_ee_type current_ee = mhi_cntrl->ee;
+ enum mhi_ee_type ee = MHI_EE_MAX, current_ee = mhi_cntrl->ee;
int i, ret;

dev_dbg(dev, "Processing Mission Mode transition\n");

write_lock_irq(&mhi_cntrl->pm_lock);
if (MHI_REG_ACCESS_VALID(mhi_cntrl->pm_state))
- mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
+ ee = mhi_get_exec_env(mhi_cntrl);

- if (!MHI_IN_MISSION_MODE(mhi_cntrl->ee)) {
+ if (!MHI_IN_MISSION_MODE(ee)) {
mhi_cntrl->pm_state = MHI_PM_LD_ERR_FATAL_DETECT;
write_unlock_irq(&mhi_cntrl->pm_lock);
wake_up_all(&mhi_cntrl->state_event);
return -EIO;
}
+ mhi_cntrl->ee = ee;
write_unlock_irq(&mhi_cntrl->pm_lock);

wake_up_all(&mhi_cntrl->state_event);
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


2021-03-04 15:39:17

by Loic Poulain

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On Thu, 25 Feb 2021 at 00:23, Bhaumik Bhatt <[email protected]> wrote:
>
> In current design, whenever the BHI interrupt is fired, the
> execution environment is updated. This can cause race conditions
> and impede ongoing power up/down processing. For example, if a
> power down is in progress, MHI host updates to a local "disabled"
> execution environment. If a BHI interrupt fires later, that value
> gets replaced with one from the BHI EE register. This impacts the
> controller as it does not expect multiple RDDM execution
> environment change status callbacks as an example. Another issue
> would be that the device can enter mission mode and the execution
> environment is updated, while device creation for SBL channels is
> still going on due to slower PM state worker thread run, leading
> to multiple attempts at opening the same channel.
>
> Ensure that EE changes are handled only from appropriate places
> and occur one after another and handle only PBL modes or RDDM EE
> changes as critical events directly from the interrupt handler.
> Simplify handling by waiting for SYS ERROR before handling RDDM.
> This also makes sure that we use the correct execution environment
> to notify the controller driver when the device resets to one of
> the PBL execution environments.
>
> Signed-off-by: Bhaumik Bhatt <[email protected]>

Looking good now

Reviewed-by: Loic Poulain <[email protected]>

2021-03-10 14:06:50

by Manivannan Sadhasivam

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On Wed, Feb 24, 2021 at 03:23:04PM -0800, Bhaumik Bhatt wrote:
> In current design, whenever the BHI interrupt is fired, the
> execution environment is updated. This can cause race conditions
> and impede ongoing power up/down processing. For example, if a
> power down is in progress, MHI host updates to a local "disabled"
> execution environment. If a BHI interrupt fires later, that value
> gets replaced with one from the BHI EE register. This impacts the
> controller as it does not expect multiple RDDM execution
> environment change status callbacks as an example. Another issue
> would be that the device can enter mission mode and the execution
> environment is updated, while device creation for SBL channels is
> still going on due to slower PM state worker thread run, leading
> to multiple attempts at opening the same channel.
>
> Ensure that EE changes are handled only from appropriate places
> and occur one after another and handle only PBL modes or RDDM EE
> changes as critical events directly from the interrupt handler.
> Simplify handling by waiting for SYS ERROR before handling RDDM.
> This also makes sure that we use the correct execution environment
> to notify the controller driver when the device resets to one of
> the PBL execution environments.
>
> Signed-off-by: Bhaumik Bhatt <[email protected]>

Reviewed-by: Manivannan Sadhasivam <[email protected]>

Thanks,
Mani

> ---
> drivers/bus/mhi/core/main.c | 40 +++++++++++++++++++++-------------------
> drivers/bus/mhi/core/pm.c | 7 ++++---
> 2 files changed, 25 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index 7a2e98c..9715f51 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -430,7 +430,7 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
> struct device *dev = &mhi_cntrl->mhi_dev->dev;
> enum mhi_state state = MHI_STATE_MAX;
> enum mhi_pm_state pm_state = 0;
> - enum mhi_ee_type ee = 0;
> + enum mhi_ee_type ee = MHI_EE_MAX;
>
> write_lock_irq(&mhi_cntrl->pm_lock);
> if (!MHI_REG_ACCESS_VALID(mhi_cntrl->pm_state)) {
> @@ -439,8 +439,7 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
> }
>
> state = mhi_get_mhi_state(mhi_cntrl);
> - ee = mhi_cntrl->ee;
> - mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
> + ee = mhi_get_exec_env(mhi_cntrl);
> dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
> TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
> TO_MHI_STATE_STR(state));
> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
> }
> write_unlock_irq(&mhi_cntrl->pm_lock);
>
> - /* If device supports RDDM don't bother processing SYS error */
> - if (mhi_cntrl->rddm_image) {
> - /* host may be performing a device power down already */
> - if (!mhi_is_active(mhi_cntrl))
> - goto exit_intvec;
> + if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
> + goto exit_intvec;
>
> - if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee) {
> + switch (ee) {
> + case MHI_EE_RDDM:
> + /* proceed if power down is not already in progress */
> + if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
> mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
> + mhi_cntrl->ee = ee;
> wake_up_all(&mhi_cntrl->state_event);
> }
> - goto exit_intvec;
> - }
> -
> - if (pm_state == MHI_PM_SYS_ERR_DETECT) {
> + break;
> + case MHI_EE_PBL:
> + case MHI_EE_EDL:
> + case MHI_EE_PTHRU:
> + mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
> + mhi_cntrl->ee = ee;
> wake_up_all(&mhi_cntrl->state_event);
> -
> - /* For fatal errors, we let controller decide next step */
> - if (MHI_IN_PBL(ee))
> - mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
> - else
> - mhi_pm_sys_err_handler(mhi_cntrl);
> + mhi_pm_sys_err_handler(mhi_cntrl);
> + break;
> + default:
> + wake_up_all(&mhi_cntrl->state_event);
> + mhi_pm_sys_err_handler(mhi_cntrl);
> + break;
> }
>
> exit_intvec:
> diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> index c09ec13..c870fa8 100644
> --- a/drivers/bus/mhi/core/pm.c
> +++ b/drivers/bus/mhi/core/pm.c
> @@ -377,21 +377,22 @@ static int mhi_pm_mission_mode_transition(struct mhi_controller *mhi_cntrl)
> {
> struct mhi_event *mhi_event;
> struct device *dev = &mhi_cntrl->mhi_dev->dev;
> - enum mhi_ee_type current_ee = mhi_cntrl->ee;
> + enum mhi_ee_type ee = MHI_EE_MAX, current_ee = mhi_cntrl->ee;
> int i, ret;
>
> dev_dbg(dev, "Processing Mission Mode transition\n");
>
> write_lock_irq(&mhi_cntrl->pm_lock);
> if (MHI_REG_ACCESS_VALID(mhi_cntrl->pm_state))
> - mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
> + ee = mhi_get_exec_env(mhi_cntrl);
>
> - if (!MHI_IN_MISSION_MODE(mhi_cntrl->ee)) {
> + if (!MHI_IN_MISSION_MODE(ee)) {
> mhi_cntrl->pm_state = MHI_PM_LD_ERR_FATAL_DETECT;
> write_unlock_irq(&mhi_cntrl->pm_lock);
> wake_up_all(&mhi_cntrl->state_event);
> return -EIO;
> }
> + mhi_cntrl->ee = ee;
> write_unlock_irq(&mhi_cntrl->pm_lock);
>
> wake_up_all(&mhi_cntrl->state_event);
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>

2021-08-23 18:45:33

by Jeffrey Hugo

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On 2/24/2021 4:23 PM, Bhaumik Bhatt wrote:
> In current design, whenever the BHI interrupt is fired, the
> execution environment is updated. This can cause race conditions
> and impede ongoing power up/down processing. For example, if a
> power down is in progress, MHI host updates to a local "disabled"
> execution environment. If a BHI interrupt fires later, that value
> gets replaced with one from the BHI EE register. This impacts the
> controller as it does not expect multiple RDDM execution
> environment change status callbacks as an example. Another issue
> would be that the device can enter mission mode and the execution
> environment is updated, while device creation for SBL channels is
> still going on due to slower PM state worker thread run, leading
> to multiple attempts at opening the same channel.
>
> Ensure that EE changes are handled only from appropriate places
> and occur one after another and handle only PBL modes or RDDM EE
> changes as critical events directly from the interrupt handler.
> Simplify handling by waiting for SYS ERROR before handling RDDM.
> This also makes sure that we use the correct execution environment
> to notify the controller driver when the device resets to one of
> the PBL execution environments.
>
> Signed-off-by: Bhaumik Bhatt <[email protected]>

<snip>

> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
> }
> write_unlock_irq(&mhi_cntrl->pm_lock);
>
> - /* If device supports RDDM don't bother processing SYS error */
> - if (mhi_cntrl->rddm_image) {
> - /* host may be performing a device power down already */
> - if (!mhi_is_active(mhi_cntrl))
> - goto exit_intvec;
> + if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
> + goto exit_intvec;
>
> - if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee) {
> + switch (ee) {
> + case MHI_EE_RDDM:
> + /* proceed if power down is not already in progress */
> + if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
> mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
> + mhi_cntrl->ee = ee;
> wake_up_all(&mhi_cntrl->state_event);
> }
> - goto exit_intvec;
> - }
> -
> - if (pm_state == MHI_PM_SYS_ERR_DETECT) {
> + break;
> + case MHI_EE_PBL:
> + case MHI_EE_EDL:
> + case MHI_EE_PTHRU:
> + mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
> + mhi_cntrl->ee = ee;
> wake_up_all(&mhi_cntrl->state_event);
> -
> - /* For fatal errors, we let controller decide next step */
> - if (MHI_IN_PBL(ee))
> - mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
> - else
> - mhi_pm_sys_err_handler(mhi_cntrl);
> + mhi_pm_sys_err_handler(mhi_cntrl);
> + break;
> + default:
> + wake_up_all(&mhi_cntrl->state_event);
> + mhi_pm_sys_err_handler(mhi_cntrl);
> + break;
> }

Bhaumik, can you explain the above change? Before this patch (which is
now committed), if there was a fatal error, the controller was notified
(MHI_CB_FATAL_ERROR) and it decided all action. After this patch, the
controller is notified, but also the core attempts to handle the syserr.

This is a change in behavior, and seems to make a mess of the
controller, and possibly the core fighting each other.

Specifically, I'm rebasing the AIC100 driver onto 5.13, which has this
change, and I'm seeing a serious regression. I'm thinking that for the
PBL/EDL/PTHRU case, mhi_pm_sys_err_handler() should not be called.

Thoughts?

2021-08-23 19:20:58

by Bhaumik Bhatt

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On 2021-08-23 11:43 AM, Jeffrey Hugo wrote:
> On 2/24/2021 4:23 PM, Bhaumik Bhatt wrote:
>> In current design, whenever the BHI interrupt is fired, the
>> execution environment is updated. This can cause race conditions
>> and impede ongoing power up/down processing. For example, if a
>> power down is in progress, MHI host updates to a local "disabled"
>> execution environment. If a BHI interrupt fires later, that value
>> gets replaced with one from the BHI EE register. This impacts the
>> controller as it does not expect multiple RDDM execution
>> environment change status callbacks as an example. Another issue
>> would be that the device can enter mission mode and the execution
>> environment is updated, while device creation for SBL channels is
>> still going on due to slower PM state worker thread run, leading
>> to multiple attempts at opening the same channel.
>>
>> Ensure that EE changes are handled only from appropriate places
>> and occur one after another and handle only PBL modes or RDDM EE
>> changes as critical events directly from the interrupt handler.
>> Simplify handling by waiting for SYS ERROR before handling RDDM.
>> This also makes sure that we use the correct execution environment
>> to notify the controller driver when the device resets to one of
>> the PBL execution environments.
>>
>> Signed-off-by: Bhaumik Bhatt <[email protected]>
>
> <snip>
>
>> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int
>> irq_number, void *priv)
>> }
>> write_unlock_irq(&mhi_cntrl->pm_lock);
>> - /* If device supports RDDM don't bother processing SYS error */
>> - if (mhi_cntrl->rddm_image) {
>> - /* host may be performing a device power down already */
>> - if (!mhi_is_active(mhi_cntrl))
>> - goto exit_intvec;
>> + if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
>> + goto exit_intvec;
>> - if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee) {
>> + switch (ee) {
>> + case MHI_EE_RDDM:
>> + /* proceed if power down is not already in progress */
>> + if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
>> mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
>> + mhi_cntrl->ee = ee;
>> wake_up_all(&mhi_cntrl->state_event);
>> }
>> - goto exit_intvec;
>> - }
>> -
>> - if (pm_state == MHI_PM_SYS_ERR_DETECT) {
>> + break;
>> + case MHI_EE_PBL:
>> + case MHI_EE_EDL:
>> + case MHI_EE_PTHRU:
>> + mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>> + mhi_cntrl->ee = ee;
>> wake_up_all(&mhi_cntrl->state_event);
>> -
>> - /* For fatal errors, we let controller decide next step */
>> - if (MHI_IN_PBL(ee))
>> - mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>> - else
>> - mhi_pm_sys_err_handler(mhi_cntrl);
>> + mhi_pm_sys_err_handler(mhi_cntrl);
>> + break;
>> + default:
>> + wake_up_all(&mhi_cntrl->state_event);
>> + mhi_pm_sys_err_handler(mhi_cntrl);
>> + break;
>> }
>
> Bhaumik, can you explain the above change? Before this patch (which
> is now committed), if there was a fatal error, the controller was
> notified (MHI_CB_FATAL_ERROR) and it decided all action. After this
> patch, the controller is notified, but also the core attempts to
> handle the syserr.
>
> This is a change in behavior, and seems to make a mess of the
> controller, and possibly the core fighting each other.
>
> Specifically, I'm rebasing the AIC100 driver onto 5.13, which has this
> change, and I'm seeing a serious regression. I'm thinking that for
> the PBL/EDL/PTHRU case, mhi_pm_sys_err_handler() should not be called.
>
> Thoughts?

I see. We use this heavily for entry to EDL use-cases.

We do require the mhi_pm_sys_err_handler() to be called in those cases
as any entry to PBL/EDL means there is need to clean-up MHI host and
notify all
its clients.

We include PTHRU in here because its a SYS ERROR in PBL modes.

Premise being the controller should not be in that business of doing any
of
the clean-up that is responsibility of the core driver. We're using this
feature
set to ensure controller is only notified.

What does AIC100 do on fatal error that you run in to issues? I don't
think
any of the other controllers do anything other than disabling runtime PM
since
device is down. Maybe there's some room for improvement.

Thanks,
Bhaumik
---
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project

2021-08-23 19:41:02

by Jeffrey Hugo

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On 8/23/2021 1:19 PM, Bhaumik Bhatt wrote:
> On 2021-08-23 11:43 AM, Jeffrey Hugo wrote:
>> On 2/24/2021 4:23 PM, Bhaumik Bhatt wrote:
>>> In current design, whenever the BHI interrupt is fired, the
>>> execution environment is updated. This can cause race conditions
>>> and impede ongoing power up/down processing. For example, if a
>>> power down is in progress, MHI host updates to a local "disabled"
>>> execution environment. If a BHI interrupt fires later, that value
>>> gets replaced with one from the BHI EE register. This impacts the
>>> controller as it does not expect multiple RDDM execution
>>> environment change status callbacks as an example. Another issue
>>> would be that the device can enter mission mode and the execution
>>> environment is updated, while device creation for SBL channels is
>>> still going on due to slower PM state worker thread run, leading
>>> to multiple attempts at opening the same channel.
>>>
>>> Ensure that EE changes are handled only from appropriate places
>>> and occur one after another and handle only PBL modes or RDDM EE
>>> changes as critical events directly from the interrupt handler.
>>> Simplify handling by waiting for SYS ERROR before handling RDDM.
>>> This also makes sure that we use the correct execution environment
>>> to notify the controller driver when the device resets to one of
>>> the PBL execution environments.
>>>
>>> Signed-off-by: Bhaumik Bhatt <[email protected]>
>>
>> <snip>
>>
>>> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int
>>> irq_number, void *priv)
>>>       }
>>>       write_unlock_irq(&mhi_cntrl->pm_lock);
>>>   -     /* If device supports RDDM don't bother processing SYS error */
>>> -    if (mhi_cntrl->rddm_image) {
>>> -        /* host may be performing a device power down already */
>>> -        if (!mhi_is_active(mhi_cntrl))
>>> -            goto exit_intvec;
>>> +    if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
>>> +        goto exit_intvec;
>>>   -        if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee) {
>>> +    switch (ee) {
>>> +    case MHI_EE_RDDM:
>>> +        /* proceed if power down is not already in progress */
>>> +        if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
>>>               mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
>>> +            mhi_cntrl->ee = ee;
>>>               wake_up_all(&mhi_cntrl->state_event);
>>>           }
>>> -        goto exit_intvec;
>>> -    }
>>> -
>>> -    if (pm_state == MHI_PM_SYS_ERR_DETECT) {
>>> +        break;
>>> +    case MHI_EE_PBL:
>>> +    case MHI_EE_EDL:
>>> +    case MHI_EE_PTHRU:
>>> +        mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>> +        mhi_cntrl->ee = ee;
>>>           wake_up_all(&mhi_cntrl->state_event);
>>> -
>>> -        /* For fatal errors, we let controller decide next step */
>>> -        if (MHI_IN_PBL(ee))
>>> -            mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>> -        else
>>> -            mhi_pm_sys_err_handler(mhi_cntrl);
>>> +        mhi_pm_sys_err_handler(mhi_cntrl);
>>> +        break;
>>> +    default:
>>> +        wake_up_all(&mhi_cntrl->state_event);
>>> +        mhi_pm_sys_err_handler(mhi_cntrl);
>>> +        break;
>>>       }
>>
>> Bhaumik, can you explain the above change?  Before this patch (which
>> is now committed), if there was a fatal error, the controller was
>> notified (MHI_CB_FATAL_ERROR) and it decided all action.  After this
>> patch, the controller is notified, but also the core attempts to
>> handle the syserr.
>>
>> This is a change in behavior, and seems to make a mess of the
>> controller, and possibly the core fighting each other.
>>
>> Specifically, I'm rebasing the AIC100 driver onto 5.13, which has this
>> change, and I'm seeing a serious regression.  I'm thinking that for
>> the PBL/EDL/PTHRU case, mhi_pm_sys_err_handler() should not be called.
>>
>> Thoughts?
>
> I see. We use this heavily for entry to EDL use-cases.
>
> We do require the mhi_pm_sys_err_handler() to be called in those cases
> as any entry to PBL/EDL means there is need to clean-up MHI host and
> notify all
> its clients.
>
> We include PTHRU in here because its a SYS ERROR in PBL modes.
>
> Premise being the controller should not be in that business of doing any of
> the clean-up that is responsibility of the core driver. We're using this
> feature
> set to ensure controller is only notified.
>
> What does AIC100 do on fatal error that you run in to issues? I don't think
> any of the other controllers do anything other than disabling runtime PM
> since
> device is down. Maybe there's some room for improvement.

Our usecase is PBL as a result of a full device crash. AIC100 doesn't
exercise the EDL/PTHRU usecases. (Just giving you some context, not
trying to imply EDL is not valuable to others for example).

In that case (FATAL_ERROR), our controller schedules a work item, and
then returnsas we assume FATAL_ERROR is notified in atomic context. We
need to do non-mhi cleanup first. Then we powerdown the MHI to cleanup
the MHI core, and kick off all the clients (its been a long while, but
initially, we were seeing the syserr handling in the core not
sufficiently kicking off the clients). Then we power up MHI. MHI will
init in PBL, still in syserr, and handle it.

I haven't fully traced everything, but we were getting into some really
bad states with the core triggering mhi_pm_sys_err_handler() per this patch.

Its important that we have control over the ordering of our cleanup, vs
the MHI cleanup. Sadly, non-atomic context (sleeping) is a requirement
of our cleanup.

Seems like we have differing requirements here. Hmm. What about an API
the controller can call, that does mhi_pm_sys_err_handler() (or
equivalent) so that the controller can control when the MHI core cleanup
is done, but doesn't need to be concerned with the details? Or, do you
have a suggestion?

2021-08-26 18:07:34

by Bhaumik Bhatt

[permalink] [raw]
Subject: Re: [PATCH v6 3/4] bus: mhi: core: Process execution environment changes serially

On 2021-08-23 12:38 PM, Jeffrey Hugo wrote:
> On 8/23/2021 1:19 PM, Bhaumik Bhatt wrote:
>> On 2021-08-23 11:43 AM, Jeffrey Hugo wrote:
>>> On 2/24/2021 4:23 PM, Bhaumik Bhatt wrote:
>>>> In current design, whenever the BHI interrupt is fired, the
>>>> execution environment is updated. This can cause race conditions
>>>> and impede ongoing power up/down processing. For example, if a
>>>> power down is in progress, MHI host updates to a local "disabled"
>>>> execution environment. If a BHI interrupt fires later, that value
>>>> gets replaced with one from the BHI EE register. This impacts the
>>>> controller as it does not expect multiple RDDM execution
>>>> environment change status callbacks as an example. Another issue
>>>> would be that the device can enter mission mode and the execution
>>>> environment is updated, while device creation for SBL channels is
>>>> still going on due to slower PM state worker thread run, leading
>>>> to multiple attempts at opening the same channel.
>>>>
>>>> Ensure that EE changes are handled only from appropriate places
>>>> and occur one after another and handle only PBL modes or RDDM EE
>>>> changes as critical events directly from the interrupt handler.
>>>> Simplify handling by waiting for SYS ERROR before handling RDDM.
>>>> This also makes sure that we use the correct execution environment
>>>> to notify the controller driver when the device resets to one of
>>>> the PBL execution environments.
>>>>
>>>> Signed-off-by: Bhaumik Bhatt <[email protected]>
>>>
>>> <snip>
>>>
>>>> @@ -452,27 +451,30 @@ irqreturn_t mhi_intvec_threaded_handler(int
>>>> irq_number, void *priv)
>>>>       }
>>>>       write_unlock_irq(&mhi_cntrl->pm_lock);
>>>>   -     /* If device supports RDDM don't bother processing SYS error
>>>> */
>>>> -    if (mhi_cntrl->rddm_image) {
>>>> -        /* host may be performing a device power down already */
>>>> -        if (!mhi_is_active(mhi_cntrl))
>>>> -            goto exit_intvec;
>>>> +    if (pm_state != MHI_PM_SYS_ERR_DETECT || ee == mhi_cntrl->ee)
>>>> +        goto exit_intvec;
>>>>   -        if (mhi_cntrl->ee == MHI_EE_RDDM && mhi_cntrl->ee != ee)
>>>> {
>>>> +    switch (ee) {
>>>> +    case MHI_EE_RDDM:
>>>> +        /* proceed if power down is not already in progress */
>>>> +        if (mhi_cntrl->rddm_image && mhi_is_active(mhi_cntrl)) {
>>>>               mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_EE_RDDM);
>>>> +            mhi_cntrl->ee = ee;
>>>>               wake_up_all(&mhi_cntrl->state_event);
>>>>           }
>>>> -        goto exit_intvec;
>>>> -    }
>>>> -
>>>> -    if (pm_state == MHI_PM_SYS_ERR_DETECT) {
>>>> +        break;
>>>> +    case MHI_EE_PBL:
>>>> +    case MHI_EE_EDL:
>>>> +    case MHI_EE_PTHRU:
>>>> +        mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>>> +        mhi_cntrl->ee = ee;
>>>>           wake_up_all(&mhi_cntrl->state_event);
>>>> -
>>>> -        /* For fatal errors, we let controller decide next step */
>>>> -        if (MHI_IN_PBL(ee))
>>>> -            mhi_cntrl->status_cb(mhi_cntrl, MHI_CB_FATAL_ERROR);
>>>> -        else
>>>> -            mhi_pm_sys_err_handler(mhi_cntrl);
>>>> +        mhi_pm_sys_err_handler(mhi_cntrl);
>>>> +        break;
>>>> +    default:
>>>> +        wake_up_all(&mhi_cntrl->state_event);
>>>> +        mhi_pm_sys_err_handler(mhi_cntrl);
>>>> +        break;
>>>>       }
>>>
>>> Bhaumik, can you explain the above change?  Before this patch (which
>>> is now committed), if there was a fatal error, the controller was
>>> notified (MHI_CB_FATAL_ERROR) and it decided all action.  After this
>>> patch, the controller is notified, but also the core attempts to
>>> handle the syserr.
>>>
>>> This is a change in behavior, and seems to make a mess of the
>>> controller, and possibly the core fighting each other.
>>>
>>> Specifically, I'm rebasing the AIC100 driver onto 5.13, which has
>>> this
>>> change, and I'm seeing a serious regression.  I'm thinking that for
>>> the PBL/EDL/PTHRU case, mhi_pm_sys_err_handler() should not be
>>> called.
>>>
>>> Thoughts?
>>
>> I see. We use this heavily for entry to EDL use-cases.
>>
>> We do require the mhi_pm_sys_err_handler() to be called in those cases
>> as any entry to PBL/EDL means there is need to clean-up MHI host and
>> notify all
>> its clients.
>>
>> We include PTHRU in here because its a SYS ERROR in PBL modes.
>>
>> Premise being the controller should not be in that business of doing
>> any of
>> the clean-up that is responsibility of the core driver. We're using
>> this feature
>> set to ensure controller is only notified.
>>
>> What does AIC100 do on fatal error that you run in to issues? I don't
>> think
>> any of the other controllers do anything other than disabling runtime
>> PM since
>> device is down. Maybe there's some room for improvement.
>
> Our usecase is PBL as a result of a full device crash. AIC100 doesn't
> exercise the EDL/PTHRU usecases. (Just giving you some context, not
> trying to imply EDL is not valuable to others for example).
>
> In that case (FATAL_ERROR), our controller schedules a work item, and
> then returnsas we assume FATAL_ERROR is notified in atomic context.
> We need to do non-mhi cleanup first. Then we powerdown the MHI to
> cleanup the MHI core, and kick off all the clients (its been a long
> while, but initially, we were seeing the syserr handling in the core
> not sufficiently kicking off the clients). Then we power up MHI. MHI
> will init in PBL, still in syserr, and handle it.
>
> I haven't fully traced everything, but we were getting into some
> really bad states with the core triggering mhi_pm_sys_err_handler()
> per this patch.
>
> Its important that we have control over the ordering of our cleanup,
> vs the MHI cleanup. Sadly, non-atomic context (sleeping) is a
> requirement of our cleanup.
>
> Seems like we have differing requirements here. Hmm. What about an
> API the controller can call, that does mhi_pm_sys_err_handler() (or
> equivalent) so that the controller can control when the MHI core
> cleanup is done, but doesn't need to be concerned with the details?
> Or, do you have a suggestion?

I see why you're seeing the issue. We had a design shift when we
introduced
EDL handling and updated how the execution environments and state
changes get
processed.

Previously, you would only receive the FATAL ERROR callback if a
SYS_ERROR
entry in PBL execution environment occurred. Now, you're getting both
the
callbacks as MHI core driver takes the responsibility of SYS_ERROR
handling
and related clean-up.

The way we can take it forward is - FATAL ERROR callback should just be
an
indication to the controller that device saw a SYS_ERROR and has entered
PBL.

The SYS_ERROR callback is sleep-capable and will block for the
controller to
perform any heavy-lifting clean-up with the exception of an
mhi_power_down()
call from within that callback. If that provision is desired, it can be
taken
care of in a future patch as an async power down request or something
equivalent.

I will push a patch to improve the documentation for this.

Thanks,
Bhaumik
---
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project