2022-06-13 14:35:12

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

From: Saravana Kannan <[email protected]>

[ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]

Mounting NFS rootfs was timing out when deferred_probe_timeout was
non-zero [1]. This was because ip_auto_config() initcall times out
waiting for the network interfaces to show up when
deferred_probe_timeout was non-zero. While ip_auto_config() calls
wait_for_device_probe() to make sure any currently running deferred
probe work or asynchronous probe finishes, that wasn't sufficient to
account for devices being deferred until deferred_probe_timeout.

Commit 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits
until the deferred_probe_timeout fires") tried to fix that by making
sure wait_for_device_probe() waits for deferred_probe_timeout to expire
before returning.

However, if wait_for_device_probe() is called from the kernel_init()
context:

- Before deferred_probe_initcall() [2], it causes the boot process to
hang due to a deadlock.

- After deferred_probe_initcall() [3], it blocks kernel_init() from
continuing till deferred_probe_timeout expires and beats the point of
deferred_probe_timeout that's trying to wait for userspace to load
modules.

Neither of this is good. So revert the changes to
wait_for_device_probe().

[1] - https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/
[2] - https://lore.kernel.org/lkml/[email protected]/
[3] - https://lore.kernel.org/lkml/[email protected]/

Fixes: 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires")
Cc: John Stultz <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Yoshihiro Shimoda <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Cc: Basil Eljuse <[email protected]>
Cc: Ferry Toth <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Anders Roxell <[email protected]>
Cc: [email protected]
Reported-by: Nathan Chancellor <[email protected]>
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Acked-by: John Stultz <[email protected]>
Signed-off-by: Saravana Kannan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/base/dd.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 977e94cf669e..86fd2ea35656 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -257,7 +257,6 @@ DEFINE_SHOW_ATTRIBUTE(deferred_devs);

int driver_deferred_probe_timeout;
EXPORT_SYMBOL_GPL(driver_deferred_probe_timeout);
-static DECLARE_WAIT_QUEUE_HEAD(probe_timeout_waitqueue);

static int __init deferred_probe_timeout_setup(char *str)
{
@@ -312,7 +311,6 @@ static void deferred_probe_timeout_work_func(struct work_struct *work)
list_for_each_entry(p, &deferred_probe_pending_list, deferred_probe)
dev_info(p->device, "deferred probe pending\n");
mutex_unlock(&deferred_probe_mutex);
- wake_up_all(&probe_timeout_waitqueue);
}
static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_work_func);

@@ -720,9 +718,6 @@ int driver_probe_done(void)
*/
void wait_for_device_probe(void)
{
- /* wait for probe timeout */
- wait_event(probe_timeout_waitqueue, !driver_deferred_probe_timeout);
-
/* wait for the deferred probe workqueue to finish */
flush_work(&deferred_probe_work);

--
2.35.1




2023-08-16 12:10:05

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

Hi Shreeya,

On Wed, Aug 16, 2023 at 11:39 AM Shreeya Patel
<[email protected]> wrote:
> On 13/06/22 15:40, Greg Kroah-Hartman wrote:
> > From: Saravana Kannan<[email protected]>
> >
> > [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
> >
> > Mounting NFS rootfs was timing out when deferred_probe_timeout was
> > non-zero [1]. This was because ip_auto_config() initcall times out
> > waiting for the network interfaces to show up when
> > deferred_probe_timeout was non-zero. While ip_auto_config() calls
> > wait_for_device_probe() to make sure any currently running deferred
> > probe work or asynchronous probe finishes, that wasn't sufficient to
> > account for devices being deferred until deferred_probe_timeout.
> >
> > Commit 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits
> > until the deferred_probe_timeout fires") tried to fix that by making
> > sure wait_for_device_probe() waits for deferred_probe_timeout to expire
> > before returning.
> >
> > However, if wait_for_device_probe() is called from the kernel_init()
> > context:
> >
> > - Before deferred_probe_initcall() [2], it causes the boot process to
> > hang due to a deadlock.
> >
> > - After deferred_probe_initcall() [3], it blocks kernel_init() from
> > continuing till deferred_probe_timeout expires and beats the point of
> > deferred_probe_timeout that's trying to wait for userspace to load
> > modules.
> >
> > Neither of this is good. So revert the changes to
> > wait_for_device_probe().
> >
> > [1] -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/
> > [2] -https://lore.kernel.org/lkml/[email protected]/
> > [3] -https://lore.kernel.org/lkml/[email protected]/
>
> Hi Saravana, Greg,
>
>
> KernelCI found this patch causes the baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
> see the following details for more information.

Commit 9be4cbd09da820a2 ("driver core: Set default deferred_probe_timeout
back to 0.") in v5.19 contains a reference to the same commit as
mentioned in the Fixes tag. Does backporting that help?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-08-16 13:16:28

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

On Wed, Aug 16, 2023 at 12:10 PM Geert Uytterhoeven
<[email protected]> wrote:
> On Wed, Aug 16, 2023 at 11:39 AM Shreeya Patel
> <[email protected]> wrote:
> > On 13/06/22 15:40, Greg Kroah-Hartman wrote:
> > > From: Saravana Kannan<[email protected]>
> > >
> > > [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
> > >
> > > Mounting NFS rootfs was timing out when deferred_probe_timeout was
> > > non-zero [1]. This was because ip_auto_config() initcall times out
> > > waiting for the network interfaces to show up when
> > > deferred_probe_timeout was non-zero. While ip_auto_config() calls
> > > wait_for_device_probe() to make sure any currently running deferred
> > > probe work or asynchronous probe finishes, that wasn't sufficient to
> > > account for devices being deferred until deferred_probe_timeout.
> > >
> > > Commit 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits
> > > until the deferred_probe_timeout fires") tried to fix that by making
> > > sure wait_for_device_probe() waits for deferred_probe_timeout to expire
> > > before returning.
> > >
> > > However, if wait_for_device_probe() is called from the kernel_init()
> > > context:
> > >
> > > - Before deferred_probe_initcall() [2], it causes the boot process to
> > > hang due to a deadlock.
> > >
> > > - After deferred_probe_initcall() [3], it blocks kernel_init() from
> > > continuing till deferred_probe_timeout expires and beats the point of
> > > deferred_probe_timeout that's trying to wait for userspace to load
> > > modules.
> > >
> > > Neither of this is good. So revert the changes to
> > > wait_for_device_probe().
> > >
> > > [1] -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/
> > > [2] -https://lore.kernel.org/lkml/[email protected]/
> > > [3] -https://lore.kernel.org/lkml/[email protected]/
> >
> > Hi Saravana, Greg,
> >
> >
> > KernelCI found this patch causes the baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
> > see the following details for more information.
>
> Commit 9be4cbd09da820a2 ("driver core: Set default deferred_probe_timeout
> back to 0.") in v5.19 contains a reference to the same commit as
> mentioned in the Fixes tag. Does backporting that help?

Anyway, remembering the days (weeks?) spent in investigating
subtle issues with fw_devlinks and deferred probe, collecting all the
fixes for backporting to stable may be a very hard job...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-08-21 13:48:19

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

On 2023-08-21 12:35, Shreeya Patel wrote:
>
> On 19/08/23 01:49, Saravana Kannan wrote:
>> On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel
>> <[email protected]> wrote:
>>> Hi Geert, Saravana,
>>>
>>> On 18/08/23 00:03, Saravana Kannan wrote:
>>>> On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel
>>>> <[email protected]> wrote:
>>>>> Hi Greg,
>>>>>
>>>>> On 16/08/23 20:33, Greg Kroah-Hartman wrote:
>>>>>> On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote:
>>>>>>> On 13/06/22 15:40, Greg Kroah-Hartman wrote:
>>>>>>>> From: Saravana Kannan<[email protected]>
>>>>>>>>
>>>>>>>> [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
>>>>>>>>
>>>>>>>> Mounting NFS rootfs was timing out when deferred_probe_timeout was
>>>>>>>> non-zero [1].  This was because ip_auto_config() initcall times out
>>>>>>>> waiting for the network interfaces to show up when
>>>>>>>> deferred_probe_timeout was non-zero. While ip_auto_config() calls
>>>>>>>> wait_for_device_probe() to make sure any currently running deferred
>>>>>>>> probe work or asynchronous probe finishes, that wasn't
>>>>>>>> sufficient to
>>>>>>>> account for devices being deferred until deferred_probe_timeout.
>>>>>>>>
>>>>>>>> Commit 35a672363ab3 ("driver core: Ensure
>>>>>>>> wait_for_device_probe() waits
>>>>>>>> until the deferred_probe_timeout fires") tried to fix that by
>>>>>>>> making
>>>>>>>> sure wait_for_device_probe() waits for deferred_probe_timeout to
>>>>>>>> expire
>>>>>>>> before returning.
>>>>>>>>
>>>>>>>> However, if wait_for_device_probe() is called from the
>>>>>>>> kernel_init()
>>>>>>>> context:
>>>>>>>>
>>>>>>>> - Before deferred_probe_initcall() [2], it causes the boot
>>>>>>>> process to
>>>>>>>>       hang due to a deadlock.
>>>>>>>>
>>>>>>>> - After deferred_probe_initcall() [3], it blocks kernel_init() from
>>>>>>>>       continuing till deferred_probe_timeout expires and beats
>>>>>>>> the point of
>>>>>>>>       deferred_probe_timeout that's trying to wait for userspace
>>>>>>>> to load
>>>>>>>>       modules.
>>>>>>>>
>>>>>>>> Neither of this is good. So revert the changes to
>>>>>>>> wait_for_device_probe().
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/
>>>>>>>> [2]
>>>>>>>> -https://lore.kernel.org/lkml/[email protected]/
>>>>>>>> [3] -https://lore.kernel.org/lkml/[email protected]/
>>>>>>> Hi Saravana, Greg,
>>>>>>>
>>>>>>>
>>>>>>> KernelCI found this patch causes the
>>>>>>> baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
>>>>>>> see the following details for more information.
>>>>>>>
>>>>>>> KernelCI dashboard link:
>>>>>>> https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/
>>>>>>>
>>>>>>> Error messages from the logs :-
>>>>>>>
>>>>>>> + UUID=11236495_1.5.2.4.5
>>>>>>> + set +x
>>>>>>> + export
>>>>>>> 'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin'
>>>>>>> + cd /opt/bootrr/libexec/bootrr
>>>>>>> + sh helpers/bootrr-auto
>>>>>>> e6800000.ethernet
>>>>>>> e6700000.dma-controller
>>>>>>> e7300000.dma-controller
>>>>>>> e7310000.dma-controller
>>>>>>> ec700000.dma-controller
>>>>>>> ec720000.dma-controller
>>>>>>> fea20000.vsp
>>>>>>> feb00000.display
>>>>>>> fea28000.vsp
>>>>>>> fea30000.vsp
>>>>>>> fe9a0000.vsp
>>>>>>> fe9af000.fcp
>>>>>>> fea27000.fcp
>>>>>>> fea2f000.fcp
>>>>>>> fea37000.fcp
>>>>>>> sound
>>>>>>> ee100000.mmc
>>>>>>> ee140000.mmc
>>>>>>> ec500000.sound
>>>>>>> /lava-11236495/1/../bin/lava-test-case
>>>>>>> <8>[   17.476741] <LAVA_SIGNAL_TESTCASE
>>>>>>> TEST_CASE_ID=deferred-probe-empty RESULT=fail>
>>>>>>>
>>>>>>> Test case failing :-
>>>>>>> Baseline Bootrr deferred-probe-empty test
>>>>>>> -https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests
>>>>>>>
>>>>>>> Regression Reproduced :-
>>>>>>>
>>>>>>> Lava job after reverting the commit 5ee76c256e92
>>>>>>> https://lava.collabora.dev/scheduler/job/11292890
>>>>>>>
>>>>>>>
>>>>>>> Bisection report from KernelCI can be found at the bottom of the
>>>>>>> email.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shreeya Patel
>>>>>>>
>>>>>>> #regzbot introduced: 5ee76c256e92
>>>>>>> #regzbot title: KernelCI: Multiple devices deferring on
>>>>>>> r8a77960-ulcb
>>>>>>>
>>>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * **
>>>>>>> * If you do send a fix, please include this trailer: *
>>>>>>> * Reported-by: "kernelci.org bot" <bot@...> *
>>>>>>> * *
>>>>>>> * Hope this helps! *
>>>>>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
>>>>>>>
>>>>>>> stable-rc/linux-5.10.y bisection:
>>>>>>> baseline.bootrr.deferred-probe-empty on
>>>>>>> r8a77960-ulcb
>>>>>> You are testing 5.10.y, yet the subject says 5.17?
>>>>>>
>>>>>> Which is it here?
>>>>> Sorry, I accidentally used the lore link for 5.17 while reporting this
>>>>> issue,
>>>>> but this test does fail on all the stable releases from 5.10 onwards.
>>>>>
>>>>> stable 5.15 :-
>>>>> https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/
>>>>> mainline :-
>>>>> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
>>>>>
>>>> Shreeya, can you try the patch Geert suggested and let us know if it
>>>> helps? If not, then I can try to take a closer look.
>>> I tried to test the kernel with 9be4cbd09da8 but it didn't change the
>>> result.
>>> https://lava.collabora.dev/scheduler/job/11311615
>>>
>>> Also, I am not sure if this can change things but just FYI, KernelCI
>>> adds some kernel parameters when running these tests and one of the
>>> parameter is deferred_probe_timeout=60.
>> Ah this is good to know.
>>
>>> You can check this in the definition details given in the Lava job. I
>>> also tried to remove this parameter and rerun the test but again I got
>>> the same result.
>> How long does the test wait after boot before checking for the
>> deferred devices list?
>>
>
> AFAIK, script for running the tests is immediately ran after the boot
> process is complete so there is no wait time.

Regardless of what the kernel is doing, it seems like a fundamentally
dumb test to specifically ask deferred probe to wait for up to a minute
then complain that it hasn't finished after 11 seconds :/

If anything, it seems plausible that the "regression" might actually be
the correct behaviour, and it was wrong before. I can't manage to pull
up a boot log for a pre-5.10 kernel since all the async stuff on the
KernelCI dashboard always just times out for me with a helpful "Error
while loading data from the server (error code: 0)", but what would be
interesting is whether those devices on the list are expected to
successfully probe anyway - the mainline log below also shows other
stuff failing to probe and CPUs failing to come online, so it's clearly
not a very happy platform to begin with.

Robin.

>>> I will try to add 9be4cbd09da8 to mainline kernel and see what results I
>>> get.
>> Now I'm confused. What do you mean by mainline? Are you saying the tip
>> of tree of Linus's tree is also hitting this issue?
>
>
> KernelCI runs tests on different kernel branches and trees, we also have
> this same test running on mainline tree.
> Following is the link to the dashboard for it and as you can see, it
> does fail there too.
>
>
> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
>
>
>> -Saravana
>>