2024-01-12 08:12:26

by Thomas Perrot

[permalink] [raw]
Subject: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

Hello,

We are updating the kernel from the 6.1 to the 6.6 and we observe an
amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
“workqueue: Failed to create a rescuer kthread for wq 'amdgpu-reset-
dev': -EINTR
[drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
allocate wq for amdgpu_reset_domain!
amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.
amdgpu: probe of 0000:07:00.0 failed with error -12”

We tried to figure it out without success for the moment, do you have
some advice to identify the root cause and to fix it?

Kind regards,
Thomas Perrot

--
Thomas Perrot, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


Attachments:
signature.asc (673.00 B)
This is a digitally signed message part

2024-01-12 08:17:55

by Christian König

[permalink] [raw]
Subject: Re: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

Well the driver load is interrupted for some reason.

Have you set any timeout for modprobe?

Regards,
Christian.

Am 12.01.24 um 09:11 schrieb Thomas Perrot:
> Hello,
>
> We are updating the kernel from the 6.1 to the 6.6 and we observe an
> amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
> “workqueue: Failed to create a rescuer kthread for wq 'amdgpu-reset-
> dev': -EINTR
> [drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
> allocate wq for amdgpu_reset_domain!
> amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.
> amdgpu: probe of 0000:07:00.0 failed with error -12”
>
> We tried to figure it out without success for the moment, do you have
> some advice to identify the root cause and to fix it?
>
> Kind regards,
> Thomas Perrot
>


2024-01-15 10:20:42

by Christian König

[permalink] [raw]
Subject: Re: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

Am 15.01.24 um 11:17 schrieb Thomas Perrot:
> Hello Christian,
>
> On Fri, 2024-01-12 at 09:17 +0100, Christian König wrote:
>> Well the driver load is interrupted for some reason.
>>
>> Have you set any timeout for modprobe?
>>
> We don't set a modprobe timeout.

Well you somehow abort probing the driver.

This seems to be an external event and not something the driver can
influence.

Regards,
Christian.

>
> Kind regards,
> Thomas
>
>> Regards,
>> Christian.
>>
>> Am 12.01.24 um 09:11 schrieb Thomas Perrot:
>>> Hello,
>>>
>>> We are updating the kernel from the 6.1 to the 6.6 and we observe
>>> an
>>> amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
>>> “workqueue: Failed to create a rescuer kthread for wq 'amdgpu-
>>> reset-
>>> dev': -EINTR
>>> [drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
>>> allocate wq for amdgpu_reset_domain!
>>> amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
>>> amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.
>>> amdgpu: probe of 0000:07:00.0 failed with error -12”
>>>
>>> We tried to figure it out without success for the moment, do you
>>> have
>>> some advice to identify the root cause and to fix it?
>>>
>>> Kind regards,
>>> Thomas Perrot
>>>


2024-01-15 10:38:13

by Thomas Perrot

[permalink] [raw]
Subject: Re: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

Hello Christian,

On Fri, 2024-01-12 at 09:17 +0100, Christian König wrote:
> Well the driver load is interrupted for some reason.
>
> Have you set any timeout for modprobe?
>

We don't set a modprobe timeout.

Kind regards,
Thomas

> Regards,
> Christian.
>
> Am 12.01.24 um 09:11 schrieb Thomas Perrot:
> > Hello,
> >
> > We are updating the kernel from the 6.1 to the 6.6 and we observe
> > an
> > amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
> > “workqueue: Failed to create a rescuer kthread for wq 'amdgpu-
> > reset-
> > dev': -EINTR
> > [drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
> > allocate wq for amdgpu_reset_domain!
> > amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
> > amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.
> > amdgpu: probe of 0000:07:00.0 failed with error -12”
> >
> > We tried to figure it out without success for the moment, do you
> > have
> > some advice to identify the root cause and to fix it?
> >
> > Kind regards,
> > Thomas Perrot
> >
>

--
Thomas Perrot, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


Attachments:
signature.asc (673.00 B)
This is a digitally signed message part