In the following sequence:
1) of_platform_depopulate()
2) of_overlay_remove()
During the step 1, devices are destroyed and devlinks are removed.
During the step 2, OF nodes are destroyed but
__of_changeset_entry_destroy() can raise warnings related to missing
of_node_put():
ERROR: memory leak, expected refcount 1 instead of 2 ...
Indeed, during the devlink removals performed at step 1, the removal
itself releasing the device (and the attached of_node) is done by a job
queued in a workqueue and so, it is done asynchronously with respect to
function calls.
When the warning is present, of_node_put() will be called but wrongly
too late from the workqueue job.
In order to be sure that any ongoing devlink removals are done before
the of_node destruction, synchronize the of_overlay_remove() with the
devlink removals.
Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
Cc: [email protected]
Signed-off-by: Herve Codina <[email protected]>
---
drivers/of/overlay.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 2ae7e9d24a64..99659ae9fb28 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -853,6 +853,14 @@ static void free_overlay_changeset(struct overlay_changeset *ovcs)
{
int i;
+ /*
+ * Wait for any ongoing device link removals before removing some of
+ * nodes. Drop the global lock while waiting
+ */
+ mutex_unlock(&of_mutex);
+ device_link_wait_removal();
+ mutex_lock(&of_mutex);
+
if (ovcs->cset.entries.next)
of_changeset_destroy(&ovcs->cset);
@@ -862,7 +870,6 @@ static void free_overlay_changeset(struct overlay_changeset *ovcs)
ovcs->id = 0;
}
-
for (i = 0; i < ovcs->count; i++) {
of_node_put(ovcs->fragments[i].target);
of_node_put(ovcs->fragments[i].overlay);
--
2.43.0
On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> In the following sequence:
> 1) of_platform_depopulate()
> 2) of_overlay_remove()
>
> During the step 1, devices are destroyed and devlinks are removed.
> During the step 2, OF nodes are destroyed but
> __of_changeset_entry_destroy() can raise warnings related to missing
> of_node_put():
> ERROR: memory leak, expected refcount 1 instead of 2 ...
>
> Indeed, during the devlink removals performed at step 1, the removal
> itself releasing the device (and the attached of_node) is done by a job
> queued in a workqueue and so, it is done asynchronously with respect to
> function calls.
> When the warning is present, of_node_put() will be called but wrongly
> too late from the workqueue job.
>
> In order to be sure that any ongoing devlink removals are done before
> the of_node destruction, synchronize the of_overlay_remove() with the
> devlink removals.
>
> Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> Cc: [email protected]
> Signed-off-by: Herve Codina <[email protected]>
> ---
> drivers/of/overlay.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 2ae7e9d24a64..99659ae9fb28 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -853,6 +853,14 @@ static void free_overlay_changeset(struct
> overlay_changeset *ovcs)
> {
> int i;
>
> + /*
> + * Wait for any ongoing device link removals before removing some of
> + * nodes. Drop the global lock while waiting
> + */
> + mutex_unlock(&of_mutex);
> + device_link_wait_removal();
> + mutex_lock(&of_mutex);
I'm still not convinced we need to drop the lock. What happens if someone else
grabs the lock while we are in device_link_wait_removal()? Can we guarantee that
we can't screw things badly?
The question is, do you have a system/use case where you can really see the
deadlock happening? Until I see one, I'm very skeptical about this. And if we
have one, I'm not really sure this is also the right solution for it.
- Nuno Sá
On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> In the following sequence:
> 1) of_platform_depopulate()
> 2) of_overlay_remove()
>
> During the step 1, devices are destroyed and devlinks are removed.
> During the step 2, OF nodes are destroyed but
> __of_changeset_entry_destroy() can raise warnings related to missing
> of_node_put():
> ERROR: memory leak, expected refcount 1 instead of 2 ...
>
> Indeed, during the devlink removals performed at step 1, the removal
> itself releasing the device (and the attached of_node) is done by a job
> queued in a workqueue and so, it is done asynchronously with respect to
> function calls.
> When the warning is present, of_node_put() will be called but wrongly
> too late from the workqueue job.
>
> In order to be sure that any ongoing devlink removals are done before
> the of_node destruction, synchronize the of_overlay_remove() with the
> devlink removals.
>
> Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> Cc: [email protected]
> Signed-off-by: Herve Codina <[email protected]>
> ---
> drivers/of/overlay.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 2ae7e9d24a64..99659ae9fb28 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
In the cover, you mention device.h inclusion but I'm not seeing it? This is
clearly up to the DT maintainers to decide but, IMHO, I would very much prefer
to see fwnode.h included in here rather than directly device.h (so yeah,
renaming the function to fwnode_*). But yeah, I might be biased by own series :)
- Nuno Sá
On Thu, 29 Feb 2024 10:50:21 +0100
Nuno Sá <[email protected]> wrote:
> On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> > In the following sequence:
> > 1) of_platform_depopulate()
> > 2) of_overlay_remove()
> >
> > During the step 1, devices are destroyed and devlinks are removed.
> > During the step 2, OF nodes are destroyed but
> > __of_changeset_entry_destroy() can raise warnings related to missing
> > of_node_put():
> > ERROR: memory leak, expected refcount 1 instead of 2 ...
> >
> > Indeed, during the devlink removals performed at step 1, the removal
> > itself releasing the device (and the attached of_node) is done by a job
> > queued in a workqueue and so, it is done asynchronously with respect to
> > function calls.
> > When the warning is present, of_node_put() will be called but wrongly
> > too late from the workqueue job.
> >
> > In order to be sure that any ongoing devlink removals are done before
> > the of_node destruction, synchronize the of_overlay_remove() with the
> > devlink removals.
> >
> > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > Cc: [email protected]
> > Signed-off-by: Herve Codina <[email protected]>
> > ---
> > drivers/of/overlay.c | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> > index 2ae7e9d24a64..99659ae9fb28 100644
> > --- a/drivers/of/overlay.c
> > +++ b/drivers/of/overlay.c
>
> In the cover, you mention device.h inclusion but I'm not seeing it? This is
> clearly up to the DT maintainers to decide but, IMHO, I would very much prefer
> to see fwnode.h included in here rather than directly device.h (so yeah,
> renaming the function to fwnode_*). But yeah, I might be biased by own series :)
>
Damned. I missed device.h in this patch.
Without this one, the patch do not compile :(
A fixup commit I missed to squash before sending.
A v3 is planned to add this device.h.
Nuno, do you prefer I wait few days before sending this v3 waiting for more replies
or I send it right now and you re-do your comment on the v3 ?
I would really prefer to send it now :)
Sorry about my mistake.
Best regards,
Hervé
--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
On Thu, 2024-02-29 at 11:14 +0100, Herve Codina wrote:
> On Thu, 29 Feb 2024 10:50:21 +0100
> Nuno Sá <[email protected]> wrote:
>
> > On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> > > In the following sequence:
> > > 1) of_platform_depopulate()
> > > 2) of_overlay_remove()
> > >
> > > During the step 1, devices are destroyed and devlinks are removed.
> > > During the step 2, OF nodes are destroyed but
> > > __of_changeset_entry_destroy() can raise warnings related to missing
> > > of_node_put():
> > > ERROR: memory leak, expected refcount 1 instead of 2 ...
> > >
> > > Indeed, during the devlink removals performed at step 1, the removal
> > > itself releasing the device (and the attached of_node) is done by a job
> > > queued in a workqueue and so, it is done asynchronously with respect to
> > > function calls.
> > > When the warning is present, of_node_put() will be called but wrongly
> > > too late from the workqueue job.
> > >
> > > In order to be sure that any ongoing devlink removals are done before
> > > the of_node destruction, synchronize the of_overlay_remove() with the
> > > devlink removals.
> > >
> > > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > > Cc: [email protected]
> > > Signed-off-by: Herve Codina <[email protected]>
> > > ---
> > > drivers/of/overlay.c | 9 ++++++++-
> > > 1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> > > index 2ae7e9d24a64..99659ae9fb28 100644
> > > --- a/drivers/of/overlay.c
> > > +++ b/drivers/of/overlay.c
> >
> > In the cover, you mention device.h inclusion but I'm not seeing it? This is
> > clearly up to the DT maintainers to decide but, IMHO, I would very much
> > prefer
> > to see fwnode.h included in here rather than directly device.h (so yeah,
> > renaming the function to fwnode_*). But yeah, I might be biased by own
> > series :)
> >
>
> Damned. I missed device.h in this patch.
> Without this one, the patch do not compile :(
>
> A fixup commit I missed to squash before sending.
>
> A v3 is planned to add this device.h.
>
> Nuno, do you prefer I wait few days before sending this v3 waiting for more
> replies
> or I send it right now and you re-do your comment on the v3 ?
>
> I would really prefer to send it now :)
>
Typically maintainers don't like much of re-spinning versions too fast. That
said, up to you :). I can copy paste my comments in v3.
- Nuno Sá