2023-12-13 21:09:17

by Rob Clark

[permalink] [raw]
Subject: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

From: Rob Clark <[email protected]>

We need to bail out before adding/removing devices, if we are going
to -EPROBE_DEFER. Otherwise boot will get stuck forever at
deferred_probe_initcall().

Suggested-by: Dmitry Baryshkov <[email protected]>
Signed-off-by: Rob Clark <[email protected]>
---
drivers/soc/qcom/pmic_glink.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index 914057331afd..2899746af6a6 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -268,10 +268,16 @@ static int pmic_glink_probe(struct platform_device *pdev)
else
pg->client_mask = PMIC_GLINK_CLIENT_DEFAULT;

+ pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
+ if (IS_ERR(pg->pdr)) {
+ ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n");
+ return ret;
+ }
+
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) {
ret = pmic_glink_add_aux_device(pg, &pg->ucsi_aux, "ucsi");
if (ret)
- return ret;
+ goto out_release_pdr_handle;
}
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_ALTMODE)) {
ret = pmic_glink_add_aux_device(pg, &pg->altmode_aux, "altmode");
@@ -284,17 +290,11 @@ static int pmic_glink_probe(struct platform_device *pdev)
goto out_release_altmode_aux;
}

- pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
- if (IS_ERR(pg->pdr)) {
- ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n");
- goto out_release_aux_devices;
- }
-
service = pdr_add_lookup(pg->pdr, "tms/servreg", "msm/adsp/charger_pd");
if (IS_ERR(service)) {
ret = dev_err_probe(&pdev->dev, PTR_ERR(service),
"failed adding pdr lookup for charger_pd\n");
- goto out_release_pdr_handle;
+ goto out_release_aux_devices;
}

mutex_lock(&__pmic_glink_lock);
@@ -303,8 +303,6 @@ static int pmic_glink_probe(struct platform_device *pdev)

return 0;

-out_release_pdr_handle:
- pdr_handle_release(pg->pdr);
out_release_aux_devices:
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_BATT))
pmic_glink_del_aux_device(pg, &pg->ps_aux);
@@ -314,6 +312,8 @@ static int pmic_glink_probe(struct platform_device *pdev)
out_release_ucsi_aux:
if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI))
pmic_glink_del_aux_device(pg, &pg->ucsi_aux);
+out_release_pdr_handle:
+ pdr_handle_release(pg->pdr);

return ret;
}
--
2.43.0


2023-12-14 07:16:53

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> From: Rob Clark <[email protected]>
>
> We need to bail out before adding/removing devices, if we are going
> to -EPROBE_DEFER. Otherwise boot will get stuck forever at
> deferred_probe_initcall().

Can please you expand on why this is a problem here in the commit
message?

The aux devices appear to be tore down correctly in the probe error
paths so how exactly does that lead to deferred_probe_initcall() being
stuck? This sounds like we may have a problem elsewhere which this patch
is papering over.

Also where does the probe deferral come from in your case?
pdr_handle_alloc()?

If this is a correct fix, I'd also expect there to be a Fixes and
CC-stable tag.

Johan

2023-12-14 11:06:07

by Dmitry Baryshkov

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, 14 Dec 2023 at 09:16, Johan Hovold <[email protected]> wrote:
>
> On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > From: Rob Clark <[email protected]>
> >
> > We need to bail out before adding/removing devices, if we are going
> > to -EPROBE_DEFER. Otherwise boot will get stuck forever at
> > deferred_probe_initcall().
>
> Can please you expand on why this is a problem here in the commit
> message?
>
> The aux devices appear to be tore down correctly in the probe error
> paths so how exactly does that lead to deferred_probe_initcall() being
> stuck? This sounds like we may have a problem elsewhere which this patch
> is papering over.

This is a known problem. Successful probes during the probe deferral
loop causes the whole loop to be reiterated. Creating child devices
usually results in a successful probe. Aso I thought that just
creating new device also causes a reprobe, but I can not find any
evidence now.

>
> Also where does the probe deferral come from in your case?
> pdr_handle_alloc()?
>
> If this is a correct fix, I'd also expect there to be a Fixes and
> CC-stable tag.
>
> Johan
>


--
With best wishes
Dmitry

2023-12-14 14:02:13

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:
> On Thu, 14 Dec 2023 at 09:16, Johan Hovold <[email protected]> wrote:
> >
> > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > > From: Rob Clark <[email protected]>
> > >
> > > We need to bail out before adding/removing devices, if we are going
> > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at
> > > deferred_probe_initcall().
> >
> > Can please you expand on why this is a problem here in the commit
> > message?
> >
> > The aux devices appear to be tore down correctly in the probe error
> > paths so how exactly does that lead to deferred_probe_initcall() being
> > stuck? This sounds like we may have a problem elsewhere which this patch
> > is papering over.
>
> This is a known problem. Successful probes during the probe deferral
> loop causes the whole loop to be reiterated. Creating child devices
> usually results in a successful probe. Aso I thought that just
> creating new device also causes a reprobe, but I can not find any
> evidence now.

This still needs to be described in the commit message.

Only a successful probe should trigger a reprobe, and when the child
devices are registered the parent is not yet on the deferred probe list.
So something is not right or missing here.

Johan

2023-12-14 14:05:31

by Dmitry Baryshkov

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, 14 Dec 2023 at 16:01, Johan Hovold <[email protected]> wrote:
>
> On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:
> > On Thu, 14 Dec 2023 at 09:16, Johan Hovold <[email protected]> wrote:
> > >
> > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > > > From: Rob Clark <[email protected]>
> > > >
> > > > We need to bail out before adding/removing devices, if we are going
> > > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at
> > > > deferred_probe_initcall().
> > >
> > > Can please you expand on why this is a problem here in the commit
> > > message?
> > >
> > > The aux devices appear to be tore down correctly in the probe error
> > > paths so how exactly does that lead to deferred_probe_initcall() being
> > > stuck? This sounds like we may have a problem elsewhere which this patch
> > > is papering over.
> >
> > This is a known problem. Successful probes during the probe deferral
> > loop causes the whole loop to be reiterated. Creating child devices
> > usually results in a successful probe. Aso I thought that just
> > creating new device also causes a reprobe, but I can not find any
> > evidence now.
>
> This still needs to be described in the commit message.
>
> Only a successful probe should trigger a reprobe, and when the child
> devices are registered the parent is not yet on the deferred probe list.
> So something is not right or missing here.

Child devices can be successfully probed, then the parent gets
-EPROBE_DEFER, removes children and then it goes on and on.

--
With best wishes
Dmitry

2023-12-14 14:10:01

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote:
> On Thu, 14 Dec 2023 at 16:01, Johan Hovold <[email protected]> wrote:
> > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:
> > > On Thu, 14 Dec 2023 at 09:16, Johan Hovold <[email protected]> wrote:
> > > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > > > > From: Rob Clark <[email protected]>
> > > > >
> > > > > We need to bail out before adding/removing devices, if we are going
> > > > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at
> > > > > deferred_probe_initcall().
> > > >
> > > > Can please you expand on why this is a problem here in the commit
> > > > message?
> > > >
> > > > The aux devices appear to be tore down correctly in the probe error
> > > > paths so how exactly does that lead to deferred_probe_initcall() being
> > > > stuck? This sounds like we may have a problem elsewhere which this patch
> > > > is papering over.
> > >
> > > This is a known problem. Successful probes during the probe deferral
> > > loop causes the whole loop to be reiterated. Creating child devices
> > > usually results in a successful probe. Aso I thought that just
> > > creating new device also causes a reprobe, but I can not find any
> > > evidence now.
> >
> > This still needs to be described in the commit message.
> >
> > Only a successful probe should trigger a reprobe, and when the child
> > devices are registered the parent is not yet on the deferred probe list.
> > So something is not right or missing here.
>
> Child devices can be successfully probed, then the parent gets
> -EPROBE_DEFER, removes children and then it goes on and on.

So what? As I described above, the successful probe of the children
should have nothing to do with whether the parent is reprobed.

If that isn't the case, then explain how.

Johan

2023-12-14 15:38:50

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 03:09:36PM +0100, Johan Hovold wrote:
> On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote:
> > On Thu, 14 Dec 2023 at 16:01, Johan Hovold <[email protected]> wrote:
> > > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:

> > > > This is a known problem. Successful probes during the probe deferral
> > > > loop causes the whole loop to be reiterated. Creating child devices
> > > > usually results in a successful probe. Aso I thought that just
> > > > creating new device also causes a reprobe, but I can not find any
> > > > evidence now.
> > >
> > > This still needs to be described in the commit message.
> > >
> > > Only a successful probe should trigger a reprobe, and when the child
> > > devices are registered the parent is not yet on the deferred probe list.
> > > So something is not right or missing here.
> >
> > Child devices can be successfully probed, then the parent gets
> > -EPROBE_DEFER, removes children and then it goes on and on.
>
> So what? As I described above, the successful probe of the children
> should have nothing to do with whether the parent is reprobed.
>
> If that isn't the case, then explain how.

I took a closer look at this and indeed we do have code that triggers a
reprobe of a device in case there was a successful probe while the
device was probing.

This was introduced by commit 58b116bce136 ("drivercore: deferral race
condition fix") and the workaround for the reprobe-loop bug that hack
led to is to not return -EPROBE_DEFER after registering child devices as
no one managed to come up with a proper fix. This was documented here:

fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")

But please spell this out in some more detail in the commit message, and
add a Fixes and CC stable tag.

Johan

2023-12-14 16:08:55

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote:

> I took a closer look at this and indeed we do have code that triggers a
> reprobe of a device in case there was a successful probe while the
> device was probing.
>
> This was introduced by commit 58b116bce136 ("drivercore: deferral race
> condition fix") and the workaround for the reprobe-loop bug that hack
> led to is to not return -EPROBE_DEFER after registering child devices as
> no one managed to come up with a proper fix. This was documented here:
>
> fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")
>
> But please spell this out in some more detail in the commit message, and
> add a Fixes and CC stable tag.

And please update the commit summary as I've been booting with QRTR=m
all along just fine. I guess the issue is if you have pmic_glink
built-in or in the initramfs but forgot to include qrtr or similar?

Johan

2023-12-14 20:45:45

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <[email protected]> wrote:
>
> On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote:
>
> > I took a closer look at this and indeed we do have code that triggers a
> > reprobe of a device in case there was a successful probe while the
> > device was probing.
> >
> > This was introduced by commit 58b116bce136 ("drivercore: deferral race
> > condition fix") and the workaround for the reprobe-loop bug that hack
> > led to is to not return -EPROBE_DEFER after registering child devices as
> > no one managed to come up with a proper fix. This was documented here:
> >
> > fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")
> >
> > But please spell this out in some more detail in the commit message, and
> > add a Fixes and CC stable tag.
>
> And please update the commit summary as I've been booting with QRTR=m
> all along just fine. I guess the issue is if you have pmic_glink
> built-in or in the initramfs but forgot to include qrtr or similar?

I do have both QRTR=m and QCOM_GLINK=m. I'm honestly not sure what
started triggering this issue for me.. it seemed to have started after
merging msm-next + drm-misc-next on top of your
jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3
so this shouldn't have really brought in random non-drm things).
Maybe there is a timing element to it? I felt like the problem was
obvious enough, and the exact details of why I started hitting this
were not important enough to spend time tracking down.

BR,
-R

> Johan

2023-12-15 07:50:46

by Johan Hovold

[permalink] [raw]
Subject: Re: [PATCH] soc: qcom: pmic_glink: Fix boot when QRTR=m

On Thu, Dec 14, 2023 at 12:44:10PM -0800, Rob Clark wrote:
> On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <[email protected]> wrote:
> >
> > On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote:
> >
> > > I took a closer look at this and indeed we do have code that triggers a
> > > reprobe of a device in case there was a successful probe while the
> > > device was probing.
> > >
> > > This was introduced by commit 58b116bce136 ("drivercore: deferral race
> > > condition fix") and the workaround for the reprobe-loop bug that hack
> > > led to is to not return -EPROBE_DEFER after registering child devices as
> > > no one managed to come up with a proper fix. This was documented here:
> > >
> > > fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")
> > >
> > > But please spell this out in some more detail in the commit message, and
> > > add a Fixes and CC stable tag.
> >
> > And please update the commit summary as I've been booting with QRTR=m
> > all along just fine. I guess the issue is if you have pmic_glink
> > built-in or in the initramfs but forgot to include qrtr or similar?
>
> I do have both QRTR=m and QCOM_GLINK=m. I'm honestly not sure what
> started triggering this issue for me.. it seemed to have started after
> merging msm-next + drm-misc-next on top of your
> jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3
> so this shouldn't have really brought in random non-drm things).
> Maybe there is a timing element to it?

Yeah, possibly, and device links may also come into play here.

> I felt like the problem was obvious enough, and the exact details of
> why I started hitting this were not important enough to spend time
> tracking down.

The patch itself is of course fine itself as a clean up (or
microoptimisation) but the claim that it solves, and is the correct fix
for, a probe loop issue is not obvious at all and requires some further
justification.

Since he last time someone suggested reverting the commit that
introduced the probe-loop issue, the result was to leave things as they
were and just document the workaround, it should be fine to just refer
to that commit:

fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")

Perhaps you can keep the Subject if you make it clear in the commit
message that this bug isn't always hit with QRTR=m.

Johan