2024-05-05 01:12:35

by Micha Albert

[permalink] [raw]
Subject: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8

Hello,

I have an AMD Radeon 6600 XT GPU in a cheap Thunderbolt eGPU board. In 6.8.7, this works as expected, and my Plymouth screen (including the LUKS password prompt) shows on my 2 monitors connected to the GPU as well as my main laptop screen. Upon entering the password, I'm put into userspace as expected. However, upon upgrading to 6.8.8, I will be greeted with the regular password prompt, but after entering my password and waiting for it to be accepted, my eGPU will reset and not function. I can tell that it resets since I can hear the click of my ATX power supply turning off and on again, and the status LED of the eGPU board goes from green to blue and back to green, all in less than a second.

I talked to a friend, and we found out that the kernel parameter thunderbolt.host_reset=false fixes the issue. He also thinks that commits cc4c94 (59a54c upstream) and 11371c (ec8162 upstream) look suspicious. I've attached the output of dmesg when the error was occurring, since I'm still able to use my laptop normally when this happens, just not with my eGPU and its connected displays.

Sincerely,
Micha Albert


Attachments:
kernel-log-thunderbolt-error.log (119.08 kB)

2024-05-05 05:00:10

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8

[CCing Mario, who asked for the two suspected commits to be backported]

On 05.05.24 03:12, Micha Albert wrote:
>
>     I have an AMD Radeon 6600 XT GPU in a cheap Thunderbolt eGPU board.
> In 6.8.7, this works as expected, and my Plymouth screen (including the
> LUKS password prompt) shows on my 2 monitors connected to the GPU as
> well as my main laptop screen. Upon entering the password, I'm put into
> userspace as expected. However, upon upgrading to 6.8.8, I will be
> greeted with the regular password prompt, but after entering my password
> and waiting for it to be accepted, my eGPU will reset and not function.
> I can tell that it resets since I can hear the click of my ATX power
> supply turning off and on again, and the status LED of the eGPU board
> goes from green to blue and back to green, all in less than a second.
>
>    I talked to a friend, and we found out that the kernel parameter
> thunderbolt.host_reset=false fixes the issue. He also thinks that
> commits cc4c94 (59a54c upstream) and 11371c (ec8162 upstream) look
> suspicious. I've attached the output of dmesg when the error was
> occurring, since I'm still able to use my laptop normally when this
> happens, just not with my eGPU and its connected displays.

Thx for the report. Could you please test if 6.9-rc6 (or a later
snapshot; or -rc7, which should be out in about ~18 hours) is affected
as well? That would be really important to know.

It would also be great if you could try reverting the two patches you
mentioned and see if they are really what's causing this. There iirc are
two more; maybe you might need to revert some or all of them in the
order they were applied.

Ciao, Thorsten

P.s.: To be sure the issue doesn't fall through the cracks unnoticed,
I'm adding it to regzbot, the Linux kernel regression tracking bot:

#regzbot ^introduced v6.8.7..v6.8.8
#regzbot title thunderbolt: eGPU disconnected during boot

2024-05-05 12:37:20

by Mario Limonciello

[permalink] [raw]
Subject: Re: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8



On 5/4/24 23:59, Linux regression tracking (Thorsten Leemhuis) wrote:
> [CCing Mario, who asked for the two suspected commits to be backported]
>
> On 05.05.24 03:12, Micha Albert wrote:
>>
>>     I have an AMD Radeon 6600 XT GPU in a cheap Thunderbolt eGPU board.
>> In 6.8.7, this works as expected, and my Plymouth screen (including the
>> LUKS password prompt) shows on my 2 monitors connected to the GPU as
>> well as my main laptop screen. Upon entering the password, I'm put into
>> userspace as expected. However, upon upgrading to 6.8.8, I will be
>> greeted with the regular password prompt, but after entering my password
>> and waiting for it to be accepted, my eGPU will reset and not function.
>> I can tell that it resets since I can hear the click of my ATX power
>> supply turning off and on again, and the status LED of the eGPU board
>> goes from green to blue and back to green, all in less than a second.
>>
>>    I talked to a friend, and we found out that the kernel parameter
>> thunderbolt.host_reset=false fixes the issue. He also thinks that
>> commits cc4c94 (59a54c upstream) and 11371c (ec8162 upstream) look
>> suspicious. I've attached the output of dmesg when the error was
>> occurring, since I'm still able to use my laptop normally when this
>> happens, just not with my eGPU and its connected displays.
>
> Thx for the report. Could you please test if 6.9-rc6 (or a later
> snapshot; or -rc7, which should be out in about ~18 hours) is affected
> as well? That would be really important to know.
>
> It would also be great if you could try reverting the two patches you
> mentioned and see if they are really what's causing this. There iirc are
> two more; maybe you might need to revert some or all of them in the
> order they were applied.

There are two other things that I think would be good to understand this
issue.

1) Is it related to trusted devices handling?

You can try to apply it both to 6.8.y or to 6.9-rc.

https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/commit/?h=iommu/fixes&id=0f91d0795741c12cee200667648669a91b568735

2) Is it because you have amdgpu in your initramfs but not thunderbolt?

If so; there's very likely an ordering issue.

[ 2.325788] [drm] GPU posting now...
[ 30.360701] ACPI: bus type thunderbolt registered

Can you remove amdgpu from your initramfs and wait for it to startup
after you pivot rootfs? Does this still happen?

>
> Ciao, Thorsten
>
> P.s.: To be sure the issue doesn't fall through the cracks unnoticed,
> I'm adding it to regzbot, the Linux kernel regression tracking bot:
>
> #regzbot ^introduced v6.8.7..v6.8.8
> #regzbot title thunderbolt: eGPU disconnected during boot
>

2024-05-05 14:23:56

by Mario Limonciello

[permalink] [raw]
Subject: Re: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8

On 5/5/2024 07:37, Mario Limonciello wrote:
>
>
> On 5/4/24 23:59, Linux regression tracking (Thorsten Leemhuis) wrote:
>> [CCing Mario, who asked for the two suspected commits to be backported]
>>
>> On 05.05.24 03:12, Micha Albert wrote:
>>>
>>>      I have an AMD Radeon 6600 XT GPU in a cheap Thunderbolt eGPU board.
>>> In 6.8.7, this works as expected, and my Plymouth screen (including the
>>> LUKS password prompt) shows on my 2 monitors connected to the GPU as
>>> well as my main laptop screen. Upon entering the password, I'm put into
>>> userspace as expected. However, upon upgrading to 6.8.8, I will be
>>> greeted with the regular password prompt, but after entering my password
>>> and waiting for it to be accepted, my eGPU will reset and not function.
>>> I can tell that it resets since I can hear the click of my ATX power
>>> supply turning off and on again, and the status LED of the eGPU board
>>> goes from green to blue and back to green, all in less than a second.
>>>
>>>     I talked to a friend, and we found out that the kernel parameter
>>> thunderbolt.host_reset=false fixes the issue. He also thinks that
>>> commits cc4c94 (59a54c upstream) and 11371c (ec8162 upstream) look
>>> suspicious. I've attached the output of dmesg when the error was
>>> occurring, since I'm still able to use my laptop normally when this
>>> happens, just not with my eGPU and its connected displays.
>>
>> Thx for the report. Could you please test if 6.9-rc6 (or a later
>> snapshot; or -rc7, which should be out in about ~18 hours) is affected
>> as well? That would be really important to know.
>>
>> It would also be great if you could try reverting the two patches you
>> mentioned and see if they are really what's causing this. There iirc are
>> two more; maybe you might need to revert some or all of them in the
>> order they were applied.
>
> There are two other things that I think would be good to understand this
> issue.
>
> 1) Is it related to trusted devices handling?
>
> You can try to apply it both to 6.8.y or to 6.9-rc.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git/commit/?h=iommu/fixes&id=0f91d0795741c12cee200667648669a91b568735
>
> 2) Is it because you have amdgpu in your initramfs but not thunderbolt?
>
> If so; there's very likely an ordering issue.
>
> [    2.325788] [drm] GPU posting now...
> [   30.360701] ACPI: bus type thunderbolt registered
>
> Can you remove amdgpu from your initramfs and wait for it to startup
> after you pivot rootfs?  Does this still happen?
>

One more thought. When you say it's "not function", is it authorized in
thunderbolt sysfs?

See
https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/thunderbolt.rst

Is it showing up in lspci anymore?

>>
>> Ciao, Thorsten
>>
>> P.s.: To be sure the issue doesn't fall through the cracks unnoticed,
>> I'm adding it to regzbot, the Linux kernel regression tracking bot:
>>
>> #regzbot ^introduced v6.8.7..v6.8.8
>> #regzbot title thunderbolt: eGPU disconnected during boot
>>


2024-05-06 12:26:17

by Gia

[permalink] [raw]
Subject: Re: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8

Hello, from 6.8.7=>6.8.8 I run into a similar problem with my Caldigit
TS3 Plus Thunderbolt 3 dock.

After the update I see this message on boot "xHCI host controller not
responding, assume dead" and the dock is not working anymore. Kernel
6.8.7 works great.

2024-05-06 12:54:18

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] Thunderbolt Host Reset Change Causes eGPU Disconnection from 6.8.7=>6.8.8

[CCing Mario, who asked for the two suspected commits to be backported]

On 06.05.24 14:24, Gia wrote:
> Hello, from 6.8.7=>6.8.8 I run into a similar problem with my Caldigit
> TS3 Plus Thunderbolt 3 dock.
>
> After the update I see this message on boot "xHCI host controller not
> responding, assume dead" and the dock is not working anymore. Kernel
> 6.8.7 works great.

Thx for the report. Could you make the kernel log (journalctl -k/dmesg)
accessible somewhere?

And have you looked into the other stuff that Mario suggested in the
other thread? See the following mail and the reply to it for details:

https://lore.kernel.org/all/[email protected]/T/#u

Ciao, Thorsten

P.S.: To be sure the issue doesn't fall through the cracks unnoticed,
I'm adding it to regzbot, the Linux kernel regression tracking bot:

#regzbot ^introduced v6.8.7..v6.8.8
#regzbot title thunderbolt: TB3 dock problems, xHCI host controller not
responding, assume dead