2020-05-15 05:39:23

by Saravana Kannan

[permalink] [raw]
Subject: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
parsing of the device tree nodes when a lot of devices are added. This
will significantly cut down parsing time (as much a 1 second on some
systems). So, use them when adding devices for all the top level device
tree nodes in a system.

Signed-off-by: Saravana Kannan <[email protected]>
---
drivers/of/platform.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 3371e4a06248..55d719347810 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -538,7 +538,9 @@ static int __init of_platform_default_populate_init(void)
}

/* Populate everything else. */
+ fw_devlink_pause();
of_platform_default_populate(NULL, NULL, NULL);
+ fw_devlink_resume();

return 0;
}
--
2.26.2.761.g0e0b3e54be-goog


2020-05-19 06:27:27

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On 15.05.2020 07:35, Saravana Kannan wrote:
> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> parsing of the device tree nodes when a lot of devices are added. This
> will significantly cut down parsing time (as much a 1 second on some
> systems). So, use them when adding devices for all the top level device
> tree nodes in a system.
>
> Signed-off-by: Saravana Kannan <[email protected]>

This patch recently landed in linux-next 20200518. Sadly, it causes
regression on Samsung Exynos5433-based TM2e board:

s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel

Internal error: synchronous external abort: 96000210 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 50 Comm: kworker/0:1 Not tainted 5.7.0-rc5+ #701
Hardware name: Samsung TM2E board (DT)
Workqueue: events deferred_probe_work_func
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : samsung_i2s_probe+0x768/0x8f0
lr : samsung_i2s_probe+0x688/0x8f0
...
Call trace:
 samsung_i2s_probe+0x768/0x8f0
 platform_drv_probe+0x50/0xa8
 really_probe+0x108/0x370
 driver_probe_device+0x54/0xb8
 __device_attach_driver+0x90/0xc0
 bus_for_each_drv+0x70/0xc8
 __device_attach+0xdc/0x140
 device_initial_probe+0x10/0x18
 bus_probe_device+0x94/0xa0
 deferred_probe_work_func+0x70/0xa8
 process_one_work+0x2a8/0x718
 worker_thread+0x48/0x470
 kthread+0x134/0x160
 ret_from_fork+0x10/0x1c
Code: 17ffffaf d503201f f94086c0 91003000 (88dffc00)
---[ end trace ccf721c9400ddbd6 ]---
Kernel panic - not syncing: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x090002,24006087
Memory Limit: none

---[ end Kernel panic - not syncing: Fatal exception ]---

Both issues, the lack of DMA for SPI device and Synchronous abort in I2S
probe are new after applying this patch. I'm trying to investigate which
resources are missing and why. The latter issue means typically that the
registers for the given device has been accessed without enabling the
needed clocks or power domains.

> ---
> drivers/of/platform.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> index 3371e4a06248..55d719347810 100644
> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -538,7 +538,9 @@ static int __init of_platform_default_populate_init(void)
> }
>
> /* Populate everything else. */
> + fw_devlink_pause();
> of_platform_default_populate(NULL, NULL, NULL);
> + fw_devlink_resume();
>
> return 0;
> }

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-05-19 06:50:45

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

On Mon, May 18, 2020 at 11:25 PM Marek Szyprowski
<[email protected]> wrote:
>
> Hi Saravana,
>
> On 15.05.2020 07:35, Saravana Kannan wrote:
> > The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> > parsing of the device tree nodes when a lot of devices are added. This
> > will significantly cut down parsing time (as much a 1 second on some
> > systems). So, use them when adding devices for all the top level device
> > tree nodes in a system.
> >
> > Signed-off-by: Saravana Kannan <[email protected]>
>
> This patch recently landed in linux-next 20200518. Sadly, it causes
> regression on Samsung Exynos5433-based TM2e board:
>
> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
> s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
> s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
>
> Internal error: synchronous external abort: 96000210 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 0 PID: 50 Comm: kworker/0:1 Not tainted 5.7.0-rc5+ #701
> Hardware name: Samsung TM2E board (DT)
> Workqueue: events deferred_probe_work_func
> pstate: 60000005 (nZCv daif -PAN -UAO)
> pc : samsung_i2s_probe+0x768/0x8f0
> lr : samsung_i2s_probe+0x688/0x8f0
> ...
> Call trace:
> samsung_i2s_probe+0x768/0x8f0
> platform_drv_probe+0x50/0xa8
> really_probe+0x108/0x370
> driver_probe_device+0x54/0xb8
> __device_attach_driver+0x90/0xc0
> bus_for_each_drv+0x70/0xc8
> __device_attach+0xdc/0x140
> device_initial_probe+0x10/0x18
> bus_probe_device+0x94/0xa0
> deferred_probe_work_func+0x70/0xa8
> process_one_work+0x2a8/0x718
> worker_thread+0x48/0x470
> kthread+0x134/0x160
> ret_from_fork+0x10/0x1c
> Code: 17ffffaf d503201f f94086c0 91003000 (88dffc00)
> ---[ end trace ccf721c9400ddbd6 ]---
> Kernel panic - not syncing: Fatal exception
> SMP: stopping secondary CPUs
> Kernel Offset: disabled
> CPU features: 0x090002,24006087
> Memory Limit: none
>
> ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> Both issues, the lack of DMA for SPI device and Synchronous abort in I2S
> probe are new after applying this patch. I'm trying to investigate which
> resources are missing and why. The latter issue means typically that the
> registers for the given device has been accessed without enabling the
> needed clocks or power domains.

Did you try this copy-pasta fix that I sent later?
https://lore.kernel.org/lkml/[email protected]/

Not every system would need it (my test setup didn't), but it helps some cases.

If that fix doesn't help, then some tips for debugging the failing drivers.
What this pause/resume patch effectively (not explicitly) does is:
1. Doesn't immediately probe the devices as they are added in
of_platform_default_populate_init()
2. Adds them in order to the deferred probe list.
3. Then kicks off deferred probe on them in the order they were added.

These drivers are just not handling -EPROBE_DEFER correctly or
assuming probe order and that's causing these issues.

So, we can either fix that or you can try adding some code to flush
the deferred probe workqueue at the end of fw_devlink_resume().

Let me know how it goes.

Thanks,
Saravana

2020-05-19 07:13:54

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On 19.05.2020 08:48, Saravana Kannan wrote:
> On Mon, May 18, 2020 at 11:25 PM Marek Szyprowski
> <[email protected]> wrote:
>> On 15.05.2020 07:35, Saravana Kannan wrote:
>>> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
>>> parsing of the device tree nodes when a lot of devices are added. This
>>> will significantly cut down parsing time (as much a 1 second on some
>>> systems). So, use them when adding devices for all the top level device
>>> tree nodes in a system.
>>>
>>> Signed-off-by: Saravana Kannan <[email protected]>
>> This patch recently landed in linux-next 20200518. Sadly, it causes
>> regression on Samsung Exynos5433-based TM2e board:
>>
>> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
>> s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
>> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
>> s3c64xx-spi 14d50000.spi: Failed to get RX DMA channel
>> s3c64xx-spi 14d30000.spi: Failed to get RX DMA channel
>>
>> Internal error: synchronous external abort: 96000210 [#1] PREEMPT SMP
>> Modules linked in:
>> CPU: 0 PID: 50 Comm: kworker/0:1 Not tainted 5.7.0-rc5+ #701
>> Hardware name: Samsung TM2E board (DT)
>> Workqueue: events deferred_probe_work_func
>> pstate: 60000005 (nZCv daif -PAN -UAO)
>> pc : samsung_i2s_probe+0x768/0x8f0
>> lr : samsung_i2s_probe+0x688/0x8f0
>> ...
>> Call trace:
>> samsung_i2s_probe+0x768/0x8f0
>> platform_drv_probe+0x50/0xa8
>> really_probe+0x108/0x370
>> driver_probe_device+0x54/0xb8
>> __device_attach_driver+0x90/0xc0
>> bus_for_each_drv+0x70/0xc8
>> __device_attach+0xdc/0x140
>> device_initial_probe+0x10/0x18
>> bus_probe_device+0x94/0xa0
>> deferred_probe_work_func+0x70/0xa8
>> process_one_work+0x2a8/0x718
>> worker_thread+0x48/0x470
>> kthread+0x134/0x160
>> ret_from_fork+0x10/0x1c
>> Code: 17ffffaf d503201f f94086c0 91003000 (88dffc00)
>> ---[ end trace ccf721c9400ddbd6 ]---
>> Kernel panic - not syncing: Fatal exception
>> SMP: stopping secondary CPUs
>> Kernel Offset: disabled
>> CPU features: 0x090002,24006087
>> Memory Limit: none
>>
>> ---[ end Kernel panic - not syncing: Fatal exception ]---
>>
>> Both issues, the lack of DMA for SPI device and Synchronous abort in I2S
>> probe are new after applying this patch. I'm trying to investigate which
>> resources are missing and why. The latter issue means typically that the
>> registers for the given device has been accessed without enabling the
>> needed clocks or power domains.
> Did you try this copy-pasta fix that I sent later?
> https://lore.kernel.org/lkml/[email protected]/
>
> Not every system would need it (my test setup didn't), but it helps some cases.
>
> If that fix doesn't help, then some tips for debugging the failing drivers.
> What this pause/resume patch effectively (not explicitly) does is:
> 1. Doesn't immediately probe the devices as they are added in
> of_platform_default_populate_init()
> 2. Adds them in order to the deferred probe list.
> 3. Then kicks off deferred probe on them in the order they were added.
>
> These drivers are just not handling -EPROBE_DEFER correctly or
> assuming probe order and that's causing these issues.
>
> So, we can either fix that or you can try adding some code to flush
> the deferred probe workqueue at the end of fw_devlink_resume().
>
> Let me know how it goes.

So far it looks that your patch revealed a hidden issue in exynos5433
clocks configuration, because adding clk_ignore_unused parameter to
kernel command line fixes the boot. I'm still investigating it, so
probable you can ignore my regression report. I will let you know asap I
finish checking it.

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-05-19 10:34:16

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi

On 19.05.2020 09:11, Marek Szyprowski wrote:
> On 19.05.2020 08:48, Saravana Kannan wrote:
>> On Mon, May 18, 2020 at 11:25 PM Marek Szyprowski
>> <[email protected]> wrote:
>>> On 15.05.2020 07:35, Saravana Kannan wrote:
>>>> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
>>>> parsing of the device tree nodes when a lot of devices are added. This
>>>> will significantly cut down parsing time (as much a 1 second on some
>>>> systems). So, use them when adding devices for all the top level
>>>> device
>>>> tree nodes in a system.
>>>>
>>>> Signed-off-by: Saravana Kannan <[email protected]>
>>> This patch recently landed in linux-next 20200518. Sadly, it causes
>>> regression on Samsung Exynos5433-based TM2e board:
>>>
>>> ...
>>>
>>> Both issues, the lack of DMA for SPI device and Synchronous abort in
>>> I2S
>>> probe are new after applying this patch. I'm trying to investigate
>>> which
>>> resources are missing and why. The latter issue means typically that
>>> the
>>> registers for the given device has been accessed without enabling the
>>> needed clocks or power domains.
>> Did you try this copy-pasta fix that I sent later?
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> Not every system would need it (my test setup didn't), but it helps
>> some cases.
>>
>> If that fix doesn't help, then some tips for debugging the failing
>> drivers.
>> What this pause/resume patch effectively (not explicitly) does is:
>> 1. Doesn't immediately probe the devices as they are added in
>> of_platform_default_populate_init()
>> 2. Adds them in order to the deferred probe list.
>> 3. Then kicks off deferred probe on them in the order they were added.
>>
>> These drivers are just not handling -EPROBE_DEFER correctly or
>> assuming probe order and that's causing these issues.
>>
>> So, we can either fix that or you can try adding some code to flush
>> the deferred probe workqueue at the end of fw_devlink_resume().
>>
>> Let me know how it goes.
>
> So far it looks that your patch revealed a hidden issue in exynos5433
> clocks configuration, because adding clk_ignore_unused parameter to
> kernel command line fixes the boot. I'm still investigating it, so
> probable you can ignore my regression report. I will let you know asap
> I finish checking it.
>
Okay, I confirm that the issue is in the Exynos I2S driver and
Exynos5433 clock provider. I've posted a quick workaround. I'm sorry for
the noise, your patch is fine.

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-05-19 18:05:18

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

On Tue, May 19, 2020 at 3:32 AM Marek Szyprowski
<[email protected]> wrote:
>
> Hi
>
> On 19.05.2020 09:11, Marek Szyprowski wrote:
> > On 19.05.2020 08:48, Saravana Kannan wrote:
> >> On Mon, May 18, 2020 at 11:25 PM Marek Szyprowski
> >> <[email protected]> wrote:
> >>> On 15.05.2020 07:35, Saravana Kannan wrote:
> >>>> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> >>>> parsing of the device tree nodes when a lot of devices are added. This
> >>>> will significantly cut down parsing time (as much a 1 second on some
> >>>> systems). So, use them when adding devices for all the top level
> >>>> device
> >>>> tree nodes in a system.
> >>>>
> >>>> Signed-off-by: Saravana Kannan <[email protected]>
> >>> This patch recently landed in linux-next 20200518. Sadly, it causes
> >>> regression on Samsung Exynos5433-based TM2e board:
> >>>
> >>> ...
> >>>
> >>> Both issues, the lack of DMA for SPI device and Synchronous abort in
> >>> I2S
> >>> probe are new after applying this patch. I'm trying to investigate
> >>> which
> >>> resources are missing and why. The latter issue means typically that
> >>> the
> >>> registers for the given device has been accessed without enabling the
> >>> needed clocks or power domains.
> >> Did you try this copy-pasta fix that I sent later?
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >>
> >> Not every system would need it (my test setup didn't), but it helps
> >> some cases.
> >>
> >> If that fix doesn't help, then some tips for debugging the failing
> >> drivers.
> >> What this pause/resume patch effectively (not explicitly) does is:
> >> 1. Doesn't immediately probe the devices as they are added in
> >> of_platform_default_populate_init()
> >> 2. Adds them in order to the deferred probe list.
> >> 3. Then kicks off deferred probe on them in the order they were added.
> >>
> >> These drivers are just not handling -EPROBE_DEFER correctly or
> >> assuming probe order and that's causing these issues.
> >>
> >> So, we can either fix that or you can try adding some code to flush
> >> the deferred probe workqueue at the end of fw_devlink_resume().
> >>
> >> Let me know how it goes.
> >
> > So far it looks that your patch revealed a hidden issue in exynos5433
> > clocks configuration, because adding clk_ignore_unused parameter to
> > kernel command line fixes the boot. I'm still investigating it, so
> > probable you can ignore my regression report. I will let you know asap
> > I finish checking it.
> >
> Okay, I confirm that the issue is in the Exynos I2S driver and
> Exynos5433 clock provider. I've posted a quick workaround. I'm sorry for
> the noise, your patch is fine.

Thanks for debugging and finding the real issue. I tried finding your
patches, but couldn't. Can you point me to a lore.kernel.org link? I'm
just curious to see what the issue was.

I'm guessing you didn't need to pick up this one?
https://lore.kernel.org/lkml/[email protected]/

-Saravana

2020-05-20 04:26:09

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On 19.05.2020 20:02, Saravana Kannan wrote:
> On Tue, May 19, 2020 at 3:32 AM Marek Szyprowski
> <[email protected]> wrote:
>> On 19.05.2020 09:11, Marek Szyprowski wrote:
>>> On 19.05.2020 08:48, Saravana Kannan wrote:
>>>> On Mon, May 18, 2020 at 11:25 PM Marek Szyprowski
>>>> <[email protected]> wrote:
>>>>> On 15.05.2020 07:35, Saravana Kannan wrote:
>>>>>> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
>>>>>> parsing of the device tree nodes when a lot of devices are added. This
>>>>>> will significantly cut down parsing time (as much a 1 second on some
>>>>>> systems). So, use them when adding devices for all the top level
>>>>>> device
>>>>>> tree nodes in a system.
>>>>>>
>>>>>> Signed-off-by: Saravana Kannan <[email protected]>
>>>>> This patch recently landed in linux-next 20200518. Sadly, it causes
>>>>> regression on Samsung Exynos5433-based TM2e board:
>>>>>
>>>>> ...
>>>>>
>>>>> Both issues, the lack of DMA for SPI device and Synchronous abort in
>>>>> I2S
>>>>> probe are new after applying this patch. I'm trying to investigate
>>>>> which
>>>>> resources are missing and why. The latter issue means typically that
>>>>> the
>>>>> registers for the given device has been accessed without enabling the
>>>>> needed clocks or power domains.
>>>> Did you try this copy-pasta fix that I sent later?
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>>
>>>> Not every system would need it (my test setup didn't), but it helps
>>>> some cases.
>>>>
>>>> If that fix doesn't help, then some tips for debugging the failing
>>>> drivers.
>>>> What this pause/resume patch effectively (not explicitly) does is:
>>>> 1. Doesn't immediately probe the devices as they are added in
>>>> of_platform_default_populate_init()
>>>> 2. Adds them in order to the deferred probe list.
>>>> 3. Then kicks off deferred probe on them in the order they were added.
>>>>
>>>> These drivers are just not handling -EPROBE_DEFER correctly or
>>>> assuming probe order and that's causing these issues.
>>>>
>>>> So, we can either fix that or you can try adding some code to flush
>>>> the deferred probe workqueue at the end of fw_devlink_resume().
>>>>
>>>> Let me know how it goes.
>>> So far it looks that your patch revealed a hidden issue in exynos5433
>>> clocks configuration, because adding clk_ignore_unused parameter to
>>> kernel command line fixes the boot. I'm still investigating it, so
>>> probable you can ignore my regression report. I will let you know asap
>>> I finish checking it.
>>>
>> Okay, I confirm that the issue is in the Exynos I2S driver and
>> Exynos5433 clock provider. I've posted a quick workaround. I'm sorry for
>> the noise, your patch is fine.
> Thanks for debugging and finding the real issue. I tried finding your
> patches, but couldn't. Can you point me to a lore.kernel.org link? I'm
> just curious to see what the issue was.

https://lore.kernel.org/linux-samsung-soc/[email protected]/T/#t

It looks that one more clock has to be enabled to properly read init
configuration. So far it worked, because that device was probed much
earlier, before the unused clocks are turned off. Your patch changed the
probe order, so that device is probed later.

> I'm guessing you didn't need to pick up this one?
> https://lore.kernel.org/lkml/[email protected]/

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-06-17 12:23:29

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On Fri, May 15, 2020 at 7:38 AM Saravana Kannan <[email protected]> wrote:
> The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> parsing of the device tree nodes when a lot of devices are added. This
> will significantly cut down parsing time (as much a 1 second on some
> systems). So, use them when adding devices for all the top level device
> tree nodes in a system.
>
> Signed-off-by: Saravana Kannan <[email protected]>

This is now commit 93d2e4322aa74c1a ("of: platform: Batch fwnode parsing
when adding all top level devices") in v5.8-rc1, and I have bisected a
regression to it: on r8a7740/armadillo and sh73a0/kzm9g, the system can
no longer be woken up from s2ram by a GPIO key. Reverting the commit
fixes the issue.

On these systems, the GPIO/PFC block has its interrupt lines connected
to intermediate interrupt controllers (Renesas INTC), which are in turn
connected to the main interrupt controller (ARM GIC). The INTC block is
part of a power and clock domain. Hence if a GPIO is enabled as a
wake-up source, the INTC is part of the wake-up path, and thus must be
kept enabled when entering s2ram.

While this commit has no impact on probe order for me (unlike in Marek's
case), it does have an impact on suspend order:
- Before this commit:
1. The keyboard (gpio-keys) is suspended, and calls
enable_irq_wake() to inform the upstream interrupt controller
(INTC) that it is part of the wake-up path,
2. INTC is suspended, and calls device_set_wakeup_path() to inform
the device core that it must be kept enabled,
3. The system is woken by pressing a wake-up key.

- After this commit:
1. INTC is suspended, and is not aware it is part of the wake-up
path, so it is disabled by the device core,
2. gpio-keys is suspended, and calls enable_irq_wake() in vain,
3. Pressing a wake-up key has no effect, as INTC is disabled, and
the interrupt does not come through.

It looks like no device links are involved, as both gpio-keys and INTC have
no links.
Do you have a clue?

Thanks!

> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -538,7 +538,9 @@ static int __init of_platform_default_populate_init(void)
> }
>
> /* Populate everything else. */
> + fw_devlink_pause();
> of_platform_default_populate(NULL, NULL, NULL);
> + fw_devlink_resume();
>
> return 0;
> }

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-06-17 18:41:37

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Geert,

On Wed, Jun 17, 2020 at 5:20 AM Geert Uytterhoeven <[email protected]> wrote:
>
> Hi Saravana,
>
> On Fri, May 15, 2020 at 7:38 AM Saravana Kannan <[email protected]> wrote:
> > The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> > parsing of the device tree nodes when a lot of devices are added. This
> > will significantly cut down parsing time (as much a 1 second on some
> > systems). So, use them when adding devices for all the top level device
> > tree nodes in a system.
> >
> > Signed-off-by: Saravana Kannan <[email protected]>
>
> This is now commit 93d2e4322aa74c1a ("of: platform: Batch fwnode parsing
> when adding all top level devices") in v5.8-rc1, and I have bisected a
> regression to it: on r8a7740/armadillo and sh73a0/kzm9g, the system can
> no longer be woken up from s2ram by a GPIO key. Reverting the commit
> fixes the issue.
>
> On these systems, the GPIO/PFC block has its interrupt lines connected
> to intermediate interrupt controllers (Renesas INTC), which are in turn
> connected to the main interrupt controller (ARM GIC). The INTC block is
> part of a power and clock domain. Hence if a GPIO is enabled as a
> wake-up source, the INTC is part of the wake-up path, and thus must be
> kept enabled when entering s2ram.
>
> While this commit has no impact on probe order for me (unlike in Marek's
> case), it does have an impact on suspend order:
> - Before this commit:
> 1. The keyboard (gpio-keys) is suspended, and calls
> enable_irq_wake() to inform the upstream interrupt controller
> (INTC) that it is part of the wake-up path,
> 2. INTC is suspended, and calls device_set_wakeup_path() to inform
> the device core that it must be kept enabled,
> 3. The system is woken by pressing a wake-up key.
>
> - After this commit:
> 1. INTC is suspended, and is not aware it is part of the wake-up
> path, so it is disabled by the device core,
> 2. gpio-keys is suspended, and calls enable_irq_wake() in vain,
> 3. Pressing a wake-up key has no effect, as INTC is disabled, and
> the interrupt does not come through.
>
> It looks like no device links are involved, as both gpio-keys and INTC have
> no links.
> Do you have a clue?
>
> Thanks!

That patch of mine defers probe on all devices added by the
of_platform_default_populate() call, and then once the call returns,
it immediately triggers a deferred probe.

So all these devices are being probed in parallel in the deferred
probe workqueue while the main "initcall thread" continues down to
further initcalls. It looks like some of the drivers in subsequent
initcalls are assuming that devices in the earlier initcalls always
probe and can't be deferred?

There are two options.
1. Fix these drivers.
2. Add a "flush deferred workqueue" in fw_devlink_resume()

I'd rather we fix the drivers so that they handle deferred probes
correctly. Thoughts?

-Saravana

2020-06-18 07:37:02

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On Wed, Jun 17, 2020 at 8:36 PM Saravana Kannan <[email protected]> wrote:
> On Wed, Jun 17, 2020 at 5:20 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Fri, May 15, 2020 at 7:38 AM Saravana Kannan <[email protected]> wrote:
> > > The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> > > parsing of the device tree nodes when a lot of devices are added. This
> > > will significantly cut down parsing time (as much a 1 second on some
> > > systems). So, use them when adding devices for all the top level device
> > > tree nodes in a system.
> > >
> > > Signed-off-by: Saravana Kannan <[email protected]>
> >
> > This is now commit 93d2e4322aa74c1a ("of: platform: Batch fwnode parsing
> > when adding all top level devices") in v5.8-rc1, and I have bisected a
> > regression to it: on r8a7740/armadillo and sh73a0/kzm9g, the system can
> > no longer be woken up from s2ram by a GPIO key. Reverting the commit
> > fixes the issue.
> >
> > On these systems, the GPIO/PFC block has its interrupt lines connected
> > to intermediate interrupt controllers (Renesas INTC), which are in turn
> > connected to the main interrupt controller (ARM GIC). The INTC block is
> > part of a power and clock domain. Hence if a GPIO is enabled as a
> > wake-up source, the INTC is part of the wake-up path, and thus must be
> > kept enabled when entering s2ram.
> >
> > While this commit has no impact on probe order for me (unlike in Marek's
> > case), it does have an impact on suspend order:
> > - Before this commit:
> > 1. The keyboard (gpio-keys) is suspended, and calls
> > enable_irq_wake() to inform the upstream interrupt controller
> > (INTC) that it is part of the wake-up path,
> > 2. INTC is suspended, and calls device_set_wakeup_path() to inform
> > the device core that it must be kept enabled,
> > 3. The system is woken by pressing a wake-up key.
> >
> > - After this commit:
> > 1. INTC is suspended, and is not aware it is part of the wake-up
> > path, so it is disabled by the device core,
> > 2. gpio-keys is suspended, and calls enable_irq_wake() in vain,
> > 3. Pressing a wake-up key has no effect, as INTC is disabled, and
> > the interrupt does not come through.
> >
> > It looks like no device links are involved, as both gpio-keys and INTC have
> > no links.
> > Do you have a clue?
> >
> > Thanks!
>
> That patch of mine defers probe on all devices added by the
> of_platform_default_populate() call, and then once the call returns,
> it immediately triggers a deferred probe.
>
> So all these devices are being probed in parallel in the deferred
> probe workqueue while the main "initcall thread" continues down to
> further initcalls. It looks like some of the drivers in subsequent
> initcalls are assuming that devices in the earlier initcalls always
> probe and can't be deferred?
>
> There are two options.
> 1. Fix these drivers.
> 2. Add a "flush deferred workqueue" in fw_devlink_resume()
>
> I'd rather we fix the drivers so that they handle deferred probes
> correctly. Thoughts?

While the affected drivers should handle deferred probe fine, none of
the affected drivers is subject to deferred probing: they all probe
successfully on first try (I had added debug prints to
platform_drv_probe() to be sure).
The affected drivers are still probed in the same order (INTC is one of
the earliest drivers probed, gpio-keys is the last). However, during
system suspend, gpio-keys is suspended before INTC, which is wrong, as
gpio-keys uses an interrupt provided by INTC.

Perhaps the "in parallel" is the real culprit, and there is a race
condition somewhere?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-06-19 01:49:47

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

On Thu, Jun 18, 2020 at 12:32 AM Geert Uytterhoeven
<[email protected]> wrote:
>
> Hi Saravana,
>
> On Wed, Jun 17, 2020 at 8:36 PM Saravana Kannan <[email protected]> wrote:
> > On Wed, Jun 17, 2020 at 5:20 AM Geert Uytterhoeven <[email protected]> wrote:
> > > On Fri, May 15, 2020 at 7:38 AM Saravana Kannan <[email protected]> wrote:
> > > > The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> > > > parsing of the device tree nodes when a lot of devices are added. This
> > > > will significantly cut down parsing time (as much a 1 second on some
> > > > systems). So, use them when adding devices for all the top level device
> > > > tree nodes in a system.
> > > >
> > > > Signed-off-by: Saravana Kannan <[email protected]>
> > >
> > > This is now commit 93d2e4322aa74c1a ("of: platform: Batch fwnode parsing
> > > when adding all top level devices") in v5.8-rc1, and I have bisected a
> > > regression to it: on r8a7740/armadillo and sh73a0/kzm9g, the system can
> > > no longer be woken up from s2ram by a GPIO key. Reverting the commit
> > > fixes the issue.
> > >
> > > On these systems, the GPIO/PFC block has its interrupt lines connected
> > > to intermediate interrupt controllers (Renesas INTC), which are in turn
> > > connected to the main interrupt controller (ARM GIC). The INTC block is
> > > part of a power and clock domain. Hence if a GPIO is enabled as a
> > > wake-up source, the INTC is part of the wake-up path, and thus must be
> > > kept enabled when entering s2ram.
> > >
> > > While this commit has no impact on probe order for me (unlike in Marek's
> > > case), it does have an impact on suspend order:
> > > - Before this commit:
> > > 1. The keyboard (gpio-keys) is suspended, and calls
> > > enable_irq_wake() to inform the upstream interrupt controller
> > > (INTC) that it is part of the wake-up path,
> > > 2. INTC is suspended, and calls device_set_wakeup_path() to inform
> > > the device core that it must be kept enabled,
> > > 3. The system is woken by pressing a wake-up key.
> > >
> > > - After this commit:
> > > 1. INTC is suspended, and is not aware it is part of the wake-up
> > > path, so it is disabled by the device core,
> > > 2. gpio-keys is suspended, and calls enable_irq_wake() in vain,
> > > 3. Pressing a wake-up key has no effect, as INTC is disabled, and
> > > the interrupt does not come through.
> > >
> > > It looks like no device links are involved, as both gpio-keys and INTC have
> > > no links.
> > > Do you have a clue?
> > >
> > > Thanks!
> >
> > That patch of mine defers probe on all devices added by the
> > of_platform_default_populate() call, and then once the call returns,
> > it immediately triggers a deferred probe.
> >
> > So all these devices are being probed in parallel in the deferred
> > probe workqueue while the main "initcall thread" continues down to
> > further initcalls. It looks like some of the drivers in subsequent
> > initcalls are assuming that devices in the earlier initcalls always
> > probe and can't be deferred?
> >
> > There are two options.
> > 1. Fix these drivers.
> > 2. Add a "flush deferred workqueue" in fw_devlink_resume()
> >
> > I'd rather we fix the drivers so that they handle deferred probes
> > correctly. Thoughts?
>
> While the affected drivers should handle deferred probe fine, none of
> the affected drivers is subject to deferred probing: they all probe
> successfully on first try (I had added debug prints to
> platform_drv_probe() to be sure).
> The affected drivers are still probed in the same order (INTC is one of
> the earliest drivers probed, gpio-keys is the last).

Thanks, this is useful info. Now I know that my patch isn't somehow
reordering devices that would have probed as soon as
of_platform_default_populate_init() added them.

When you say the "The affected drivers are still probed in the same
order", are you only referring to the devices that would have probed
before of_platform_default_populate_init() returns? Or ALL devices in
the system are probing in the same order?

I assume gpio-keys gets probed in the "normal init thread" and not by
the deferred probe workqueue? I'm guessing this because gpio_keys
driver seems to register during late_initcall() whereas
of_platform_default_populate_init() runs as an arch_initcall_sync().

> However, during
> system suspend, gpio-keys is suspended before INTC, which is wrong, as
> gpio-keys uses an interrupt provided by INTC.
>
> Perhaps the "in parallel" is the real culprit, and there is a race
> condition somewhere?

I tried digging into the gpio_keys driver code to see how it interacts
with INTC and if gpio-keys defers probe if INTC hasn't probed yet. But
it seems like a rabbit hole that'd be easier to figure out when you
have the device. Can you check if gpio-keys is probing before INTC in
the "bad" case?

Also, in general, can you see if there's a difference in the probe
order between all the devices in the system? Adding a log to
really_probe() would be better in case non-platform devices are
getting reordered (my change affects all devices that are created from
DT, not just platform devices).

I want to make sure we understand the real issue before we try to fix it.

Thanks,
Saravana

2020-06-19 13:01:50

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On Fri, Jun 19, 2020 at 1:01 AM Saravana Kannan <[email protected]> wrote:
> On Thu, Jun 18, 2020 at 12:32 AM Geert Uytterhoeven
> <[email protected]> wrote:
> > On Wed, Jun 17, 2020 at 8:36 PM Saravana Kannan <[email protected]> wrote:
> > > On Wed, Jun 17, 2020 at 5:20 AM Geert Uytterhoeven <[email protected]> wrote:
> > > > On Fri, May 15, 2020 at 7:38 AM Saravana Kannan <[email protected]> wrote:
> > > > > The fw_devlink_pause() and fw_devlink_resume() APIs allow batching the
> > > > > parsing of the device tree nodes when a lot of devices are added. This
> > > > > will significantly cut down parsing time (as much a 1 second on some
> > > > > systems). So, use them when adding devices for all the top level device
> > > > > tree nodes in a system.
> > > > >
> > > > > Signed-off-by: Saravana Kannan <[email protected]>
> > > >
> > > > This is now commit 93d2e4322aa74c1a ("of: platform: Batch fwnode parsing
> > > > when adding all top level devices") in v5.8-rc1, and I have bisected a
> > > > regression to it: on r8a7740/armadillo and sh73a0/kzm9g, the system can
> > > > no longer be woken up from s2ram by a GPIO key. Reverting the commit
> > > > fixes the issue.
> > > >
> > > > On these systems, the GPIO/PFC block has its interrupt lines connected
> > > > to intermediate interrupt controllers (Renesas INTC), which are in turn
> > > > connected to the main interrupt controller (ARM GIC). The INTC block is
> > > > part of a power and clock domain. Hence if a GPIO is enabled as a
> > > > wake-up source, the INTC is part of the wake-up path, and thus must be
> > > > kept enabled when entering s2ram.
> > > >
> > > > While this commit has no impact on probe order for me (unlike in Marek's
> > > > case), it does have an impact on suspend order:
> > > > - Before this commit:
> > > > 1. The keyboard (gpio-keys) is suspended, and calls
> > > > enable_irq_wake() to inform the upstream interrupt controller
> > > > (INTC) that it is part of the wake-up path,
> > > > 2. INTC is suspended, and calls device_set_wakeup_path() to inform
> > > > the device core that it must be kept enabled,
> > > > 3. The system is woken by pressing a wake-up key.
> > > >
> > > > - After this commit:
> > > > 1. INTC is suspended, and is not aware it is part of the wake-up
> > > > path, so it is disabled by the device core,
> > > > 2. gpio-keys is suspended, and calls enable_irq_wake() in vain,
> > > > 3. Pressing a wake-up key has no effect, as INTC is disabled, and
> > > > the interrupt does not come through.
> > > >
> > > > It looks like no device links are involved, as both gpio-keys and INTC have
> > > > no links.
> > > > Do you have a clue?
> > > >
> > > > Thanks!
> > >
> > > That patch of mine defers probe on all devices added by the
> > > of_platform_default_populate() call, and then once the call returns,
> > > it immediately triggers a deferred probe.
> > >
> > > So all these devices are being probed in parallel in the deferred
> > > probe workqueue while the main "initcall thread" continues down to
> > > further initcalls. It looks like some of the drivers in subsequent
> > > initcalls are assuming that devices in the earlier initcalls always
> > > probe and can't be deferred?
> > >
> > > There are two options.
> > > 1. Fix these drivers.
> > > 2. Add a "flush deferred workqueue" in fw_devlink_resume()
> > >
> > > I'd rather we fix the drivers so that they handle deferred probes
> > > correctly. Thoughts?
> >
> > While the affected drivers should handle deferred probe fine, none of
> > the affected drivers is subject to deferred probing: they all probe
> > successfully on first try (I had added debug prints to
> > platform_drv_probe() to be sure).
> > The affected drivers are still probed in the same order (INTC is one of
> > the earliest drivers probed, gpio-keys is the last).
>
> Thanks, this is useful info. Now I know that my patch isn't somehow
> reordering devices that would have probed as soon as
> of_platform_default_populate_init() added them.
>
> When you say the "The affected drivers are still probed in the same
> order", are you only referring to the devices that would have probed
> before of_platform_default_populate_init() returns? Or ALL devices in
> the system are probing in the same order?

I was referring to all platform devices (based on a debug print added to
platform_drv_probe()). See more below.

> I assume gpio-keys gets probed in the "normal init thread" and not by
> the deferred probe workqueue? I'm guessing this because gpio_keys
> driver seems to register during late_initcall() whereas
> of_platform_default_populate_init() runs as an arch_initcall_sync().

After adding a WARN(1, ...) to gpio_keys_probe(), the backtrace shows it
is called directly from do_one_initcall(), in both the good and the bad
case.

> > However, during
> > system suspend, gpio-keys is suspended before INTC, which is wrong, as
> > gpio-keys uses an interrupt provided by INTC.
> >
> > Perhaps the "in parallel" is the real culprit, and there is a race
> > condition somewhere?
>
> I tried digging into the gpio_keys driver code to see how it interacts
> with INTC and if gpio-keys defers probe if INTC hasn't probed yet. But
> it seems like a rabbit hole that'd be easier to figure out when you
> have the device. Can you check if gpio-keys is probing before INTC in
> the "bad" case?

It is not, gpio-keys is always probed very late.

Hence for testing, I moved gpio-keys initialization just before INTC, so
it is probed before INTC. Then gpio-keys is deferred, as expected, and
reprobes successfully later.
Interestingly, that fixes my wake-up issue, too?!?

> Also, in general, can you see if there's a difference in the probe
> order between all the devices in the system? Adding a log to
> really_probe() would be better in case non-platform devices are
> getting reordered (my change affects all devices that are created from
> DT, not just platform devices).
>
> I want to make sure we understand the real issue before we try to fix it.

Enabling all debug prints in really_probe(), comparing the output
before/after the bad commit, and filtering out all noise, I get:

bus: 'platform': really_probe: probing driver reg-dummy with
device reg-dummy
B bus: 'platform': really_probe: probing driver renesas_intc_irqpin
with device e6900000.interrupt-controller
bus: 'platform': really_probe: probing driver renesas_intc_irqpin
with device e6900004.interrupt-controller
A +Workqueue: events deferred_probe_work_func
bus: 'platform': really_probe: probing driver renesas_intc_irqpin
with device e6900008.interrupt-controller
C bus: 'platform': really_probe: probing driver renesas_intc_irqpin
with device e690000c.interrupt-controller
bus: 'platform': really_probe: probing driver sh-pfc with device
e6050000.pin-controller
bus: 'platform': really_probe: probing driver reg-fixed-voltage
with device regulator-1p8v
bus: 'platform': really_probe: probing driver reg-fixed-voltage
with device regulator-3p3v
bus: 'platform': really_probe: probing driver reg-fixed-voltage
with device regulator-vmmc-sdhi0
bus: 'platform': really_probe: probing driver reg-fixed-voltage
with device regulator-vmmc-sdhi2
bus: 'platform': really_probe: probing driver i2c-sh_mobile with
device e6820000.i2c
bus: 'i2c': really_probe: probing driver as3711 with device 0-0040
bus: 'platform': really_probe: probing driver as3711-regulator
with device as3711-regulator
bus: 'platform': really_probe: probing driver i2c-sh_mobile with
device e6822000.i2c
bus: 'platform': really_probe: probing driver i2c-sh_mobile with
device e6826000.i2c
bus: 'i2c': really_probe: probing driver pcf857x with device 2-0020
bus: 'platform': really_probe: probing driver sh_cmt with device
e6138000.timer
bus: 'platform': really_probe: probing driver armv7-pmu with device pmu
bus: 'platform': really_probe: probing driver simple-pm-bus with
device fec10000.bus
bus: 'platform': really_probe: probing driver as3711-backlight
with device as3711-backlight
bus: 'platform': really_probe: probing driver sh-sci with device
e6c80000.serial
bus: 'platform': really_probe: probing driver smsc911x with
device 10000000.ethernet
bus: 'i2c': really_probe: probing driver st1232-ts with device 1-0055
bus: 'i2c': really_probe: probing driver adxl34x with device 0-001d
bus: 'i2c': really_probe: probing driver rtc-rs5c372 with device 0-0032
bus: 'platform': really_probe: probing driver rmobile_reset with
device e6180000.system-controller
bus: 'platform': really_probe: probing driver cpufreq-dt with
device cpufreq-dt
bus: 'platform': really_probe: probing driver sh_mobile_sdhi with
device ee100000.sd
bus: 'platform': really_probe: probing driver sh_mobile_sdhi with
device ee140000.sd
bus: 'platform': really_probe: probing driver sh_mmcif with
device e6bd0000.mmc
bus: 'platform': really_probe: probing driver leds-gpio with device leds
bus: 'i2c': really_probe: probing driver ak8975 with device 0-000c
bus: 'platform': really_probe: probing driver snd-soc-dummy with
device snd-soc-dummy
bus: 'i2c': really_probe: probing driver ak4642-codec with device 0-0012
bus: 'platform': really_probe: probing driver asoc-simple-card
with device sound
bus: 'platform': really_probe: probing driver fsi-pcm-audio with
device ec230000.sound
bus: 'platform': really_probe: probing driver asoc-simple-card
with device sound
bus: 'platform': really_probe: probing driver gpio-keys with
device keyboard"
bus: 'mmc': really_probe: probing driver mmcblk with device mmc2:0001

So all devices are probed in the exact same order.
A: Note the addition of the message "Workqueue: events
deferred_probe_work_func", which might give a clue?
B,C: "e6900000.interrupt-controller" and "e6900008.interrupt-controller"
are the two devices that are suspended later in the wrong order.
One of them is probed before A, one after, so A may be a red herring?

I'm still not much wiser, though....

BTW, r8a7740/armadillo is single CPU, while sh73a0/kzm9g is dual-CPU.
So both UP and SMP are affected.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-06-20 04:39:59

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Snipped a bunch of old context.

On Fri, Jun 19, 2020 at 5:24 AM Geert Uytterhoeven <[email protected]> wrote:
>
> On Fri, Jun 19, 2020 at 1:01 AM Saravana Kannan <[email protected]> wrote:
> >
> > When you say the "The affected drivers are still probed in the same
> > order", are you only referring to the devices that would have probed
> > before of_platform_default_populate_init() returns? Or ALL devices in
> > the system are probing in the same order?
>
> I was referring to all platform devices (based on a debug print added to
> platform_drv_probe()). See more below.
>
> > I assume gpio-keys gets probed in the "normal init thread" and not by
> > the deferred probe workqueue? I'm guessing this because gpio_keys
> > driver seems to register during late_initcall() whereas
> > of_platform_default_populate_init() runs as an arch_initcall_sync().
>
> After adding a WARN(1, ...) to gpio_keys_probe(), the backtrace shows it
> is called directly from do_one_initcall(), in both the good and the bad
> case.
>
> > > However, during
> > > system suspend, gpio-keys is suspended before INTC, which is wrong, as
> > > gpio-keys uses an interrupt provided by INTC.
> > >
> > > Perhaps the "in parallel" is the real culprit, and there is a race
> > > condition somewhere?
> >
> > I tried digging into the gpio_keys driver code to see how it interacts
> > with INTC and if gpio-keys defers probe if INTC hasn't probed yet. But
> > it seems like a rabbit hole that'd be easier to figure out when you
> > have the device. Can you check if gpio-keys is probing before INTC in
> > the "bad" case?
>
> It is not, gpio-keys is always probed very late.
>
> Hence for testing, I moved gpio-keys initialization just before INTC, so
> it is probed before INTC. Then gpio-keys is deferred, as expected, and
> reprobes successfully later.
> Interestingly, that fixes my wake-up issue, too?!?
>
> > Also, in general, can you see if there's a difference in the probe
> > order between all the devices in the system? Adding a log to
> > really_probe() would be better in case non-platform devices are
> > getting reordered (my change affects all devices that are created from
> > DT, not just platform devices).
> >
> > I want to make sure we understand the real issue before we try to fix it.
>
> Enabling all debug prints in really_probe(), comparing the output
> before/after the bad commit, and filtering out all noise, I get:
>
> bus: 'platform': really_probe: probing driver reg-dummy with
> device reg-dummy
> B bus: 'platform': really_probe: probing driver renesas_intc_irqpin
> with device e6900000.interrupt-controller
> bus: 'platform': really_probe: probing driver renesas_intc_irqpin
> with device e6900004.interrupt-controller
> A +Workqueue: events deferred_probe_work_func
> bus: 'platform': really_probe: probing driver renesas_intc_irqpin
> with device e6900008.interrupt-controller
> C bus: 'platform': really_probe: probing driver renesas_intc_irqpin
> with device e690000c.interrupt-controller
> bus: 'platform': really_probe: probing driver sh-pfc with device
> e6050000.pin-controller
> bus: 'platform': really_probe: probing driver reg-fixed-voltage
> with device regulator-1p8v
> bus: 'platform': really_probe: probing driver reg-fixed-voltage
> with device regulator-3p3v
> bus: 'platform': really_probe: probing driver reg-fixed-voltage
> with device regulator-vmmc-sdhi0
> bus: 'platform': really_probe: probing driver reg-fixed-voltage
> with device regulator-vmmc-sdhi2
> bus: 'platform': really_probe: probing driver i2c-sh_mobile with
> device e6820000.i2c
> bus: 'i2c': really_probe: probing driver as3711 with device 0-0040
> bus: 'platform': really_probe: probing driver as3711-regulator
> with device as3711-regulator
> bus: 'platform': really_probe: probing driver i2c-sh_mobile with
> device e6822000.i2c
> bus: 'platform': really_probe: probing driver i2c-sh_mobile with
> device e6826000.i2c
> bus: 'i2c': really_probe: probing driver pcf857x with device 2-0020
> bus: 'platform': really_probe: probing driver sh_cmt with device
> e6138000.timer
> bus: 'platform': really_probe: probing driver armv7-pmu with device pmu
> bus: 'platform': really_probe: probing driver simple-pm-bus with
> device fec10000.bus
> bus: 'platform': really_probe: probing driver as3711-backlight
> with device as3711-backlight
> bus: 'platform': really_probe: probing driver sh-sci with device
> e6c80000.serial
> bus: 'platform': really_probe: probing driver smsc911x with
> device 10000000.ethernet
> bus: 'i2c': really_probe: probing driver st1232-ts with device 1-0055
> bus: 'i2c': really_probe: probing driver adxl34x with device 0-001d
> bus: 'i2c': really_probe: probing driver rtc-rs5c372 with device 0-0032
> bus: 'platform': really_probe: probing driver rmobile_reset with
> device e6180000.system-controller
> bus: 'platform': really_probe: probing driver cpufreq-dt with
> device cpufreq-dt
> bus: 'platform': really_probe: probing driver sh_mobile_sdhi with
> device ee100000.sd
> bus: 'platform': really_probe: probing driver sh_mobile_sdhi with
> device ee140000.sd
> bus: 'platform': really_probe: probing driver sh_mmcif with
> device e6bd0000.mmc
> bus: 'platform': really_probe: probing driver leds-gpio with device leds
> bus: 'i2c': really_probe: probing driver ak8975 with device 0-000c
> bus: 'platform': really_probe: probing driver snd-soc-dummy with
> device snd-soc-dummy
> bus: 'i2c': really_probe: probing driver ak4642-codec with device 0-0012
> bus: 'platform': really_probe: probing driver asoc-simple-card
> with device sound
> bus: 'platform': really_probe: probing driver fsi-pcm-audio with
> device ec230000.sound
> bus: 'platform': really_probe: probing driver asoc-simple-card
> with device sound
> bus: 'platform': really_probe: probing driver gpio-keys with
> device keyboard"
> bus: 'mmc': really_probe: probing driver mmcblk with device mmc2:0001
>
> So all devices are probed in the exact same order.
> A: Note the addition of the message "Workqueue: events
> deferred_probe_work_func", which might give a clue?
> B,C: "e6900000.interrupt-controller" and "e6900008.interrupt-controller"
> are the two devices that are suspended later in the wrong order.
> One of them is probed before A, one after, so A may be a red herring?
>

Thanks! Your logs and the tip above helped!

*face palm*/Sigh... (for what is happening)

In the absence of other info (like device links, parent-child
relationship, etc) the device_add() order is used as an implicit order
for suspend (in reverse) and resume.
device_add() -> device_pm_add() -> list_add_tail(..., &dpm_list)

But the deferred probe code reorders the deferred devices:
deferred_probe_work_func() -> device_pm_move_to_tail()

So, this is what is happening:
1. of_platform_default_populate() adds all the top level devices in DT order.
2. So dpm_list is: ... INTC, ..., gpio-keys
3. Only devices with drivers already registered by this time get added
to the deferred probe list (because that only happens when a driver
matches)
4. INTC is one of them.
5. So when deferred_probe_work_func() gets to INTC, it moves it to the
end of the dpm_list
6. So dpm_list is now: ..., gpio-keys, ... INTC

Things fail.

Obviously, my patch is causing this unintended behavior and I could
try to fix it. But I'm not really sure if the actual bug is in my
patch...

Deferred probe is supposed to allow a driver to defer probe without
worrying something else is going to get screwed up. Also, order of
devices in DT isn't supposed to affect functionality. Both of these
are not true with the current behavior.

I think instead of deferred_probe_work_func() moving the device to the
end of the dpm_list, I think the device probing successfully is what
should move it to the end of the dpm_list. That way, the dpm_list is
actually ordered by when the devices become functional and not the
random order in DT or random probe order which can get pretty
convoluted with multiple deferred probes. This feels right and will
make suspend/resume more robust against DT ordering -- but I'm not
sure what other wide ranging impact this has for other platforms.

Greg/Rafael, do you agree with the paragraph above? If yes, I can send
a patch out that orders dpm_list by probe success order.

Thanks,
Saravana

2020-06-20 05:04:50

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

On Fri, Jun 19, 2020 at 1:07 PM Saravana Kannan <[email protected]> wrote:
>
> I think instead of deferred_probe_work_func() moving the device to the
> end of the dpm_list, I think the device probing successfully is what
> should move it to the end of the dpm_list. That way, the dpm_list is
> actually ordered by when the devices become functional and not the
> random order in DT or random probe order which can get pretty
> convoluted with multiple deferred probes. This feels right and will
> make suspend/resume more robust against DT ordering -- but I'm not
> sure what other wide ranging impact this has for other platforms.

Geert,

If you want to play around with a potential fix to test my hypothesis,
I think it's just adding this one line to driver_bound():
============
klist_add_tail(&dev->p->knode_driver, &dev->driver->p->klist_devices);
device_links_driver_bound(dev);
+device_pm_move_to_tail(dev);

device_pm_check_callbacks(dev);
============

-Saravana

2020-06-22 15:51:53

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

Hi Saravana,

On Sat, Jun 20, 2020 at 4:33 AM Saravana Kannan <[email protected]> wrote:
> On Fri, Jun 19, 2020 at 1:07 PM Saravana Kannan <[email protected]> wrote:
> > I think instead of deferred_probe_work_func() moving the device to the
> > end of the dpm_list, I think the device probing successfully is what
> > should move it to the end of the dpm_list. That way, the dpm_list is
> > actually ordered by when the devices become functional and not the
> > random order in DT or random probe order which can get pretty
> > convoluted with multiple deferred probes. This feels right and will
> > make suspend/resume more robust against DT ordering -- but I'm not
> > sure what other wide ranging impact this has for other platforms.
>
> If you want to play around with a potential fix to test my hypothesis,
> I think it's just adding this one line to driver_bound():
> ============
> klist_add_tail(&dev->p->knode_driver, &dev->driver->p->klist_devices);
> device_links_driver_bound(dev);
> +device_pm_move_to_tail(dev);
>
> device_pm_check_callbacks(dev);
> ============

Thanks, that seems to fix the issue for me, on both affected systems!
Note that this has quite some impact on the order devices are suspended,
but this seems harmless.

Will try on more systems later...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-06-24 23:24:38

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v1 4/4] of: platform: Batch fwnode parsing when adding all top level devices

On Mon, Jun 22, 2020 at 8:49 AM Geert Uytterhoeven <[email protected]> wrote:
>
> Hi Saravana,
>
> On Sat, Jun 20, 2020 at 4:33 AM Saravana Kannan <[email protected]> wrote:
> > On Fri, Jun 19, 2020 at 1:07 PM Saravana Kannan <[email protected]> wrote:
> > > I think instead of deferred_probe_work_func() moving the device to the
> > > end of the dpm_list, I think the device probing successfully is what
> > > should move it to the end of the dpm_list. That way, the dpm_list is
> > > actually ordered by when the devices become functional and not the
> > > random order in DT or random probe order which can get pretty
> > > convoluted with multiple deferred probes. This feels right and will
> > > make suspend/resume more robust against DT ordering -- but I'm not
> > > sure what other wide ranging impact this has for other platforms.
> >
> > If you want to play around with a potential fix to test my hypothesis,
> > I think it's just adding this one line to driver_bound():
> > ============
> > klist_add_tail(&dev->p->knode_driver, &dev->driver->p->klist_devices);
> > device_links_driver_bound(dev);
> > +device_pm_move_to_tail(dev);
> >
> > device_pm_check_callbacks(dev);
> > ============
>
> Thanks, that seems to fix the issue for me, on both affected systems!
> Note that this has quite some impact on the order devices are suspended,
> but this seems harmless.
>
> Will try on more systems later...

Thanks for testing. Maybe I should just send that change as a patch
and see what Greg/Rafael have to say to that.

It's a general fix anyway. So, might as well send it out.

-Saravana