LinuxLists.cc - Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

2023-11-06 20:15:19

Subject: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

Hi,

I recently came across a kernel bugzilla that bisected a boot problem
[1] introduced in kernel 6.5 to this change.

commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
Author: Takashi Sakamoto <[email protected]>
Date: Tue May 30 08:12:40 2023 +0900

firewire: core: use union for callback of transaction completion

Removing the firewire card from the system fixes it for both reporters
(CC'ed)

As the author of this issue can you please take a look at it?

Thanks,

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217993

2023-11-07 07:03:03

by Bagas Sanjaya

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

On Mon, Nov 06, 2023 at 02:14:39PM -0600, Mario Limonciello wrote:
> Hi,
>
> I recently came across a kernel bugzilla that bisected a boot problem [1]
> introduced in kernel 6.5 to this change.
>
> commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
> Author: Takashi Sakamoto <[email protected]>
> Date: Tue May 30 08:12:40 2023 +0900
>
> firewire: core: use union for callback of transaction completion
>
> Removing the firewire card from the system fixes it for both reporters
> (CC'ed)
>
> As the author of this issue can you please take a look at it?
>

Thanks for the forwarding regression report from Bugzilla. I'm adding it
to regzbot:

#regzbot introduced: dcadfd7f7c74ef https://bugzilla.kernel.org/show_bug.cgi?id=217993
#regzbot title: completing firewire transaction callback with union bootloops AMD Ryzen 7 system
#regzbot link: https://lore.kernel.org/regressions/[email protected]/

--
An old man doll... just what I always wanted! - Clara

Attachments:

(No filename) (1.02 kB)
signature.asc (235.00 B)
Download all attachments

2023-11-20 05:43:17

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

On 08.11.23 06:16, Takashi Sakamoto wrote:
> On Tue, Nov 07, 2023 at 03:27:08PM -0600, Mario Limonciello wrote:
>> +linux-pci / Bjorn
>> On 11/7/2023 06:17, Takashi Sakamoto wrote:
>>>
>>> Thanks for the report.
>>>
>>> I apologize for the inconvenience you and your reporter facing, however
>>> I can not avoid to say that the problem appears to be specific to the AMD
>>> Ryzen machines.

It nevertheless from the point of kernel development *afaics* is a
kernel regression, as it doesn't matter much if the root of the problem
is a hw problem; what matters primarily is that the problem apparently
did not happen before the commit mentioned in the subject.

Which bears the question: Can the culprit (and others commits that might
be depending on it) still be reverted without causing regression itself?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

>> Unfortunately I don't have this 1394 hardware myself. I was just looking at
>> another completely unrelated issue on Bugzilla and noticed the report come
>> up in my search and wanted to ensure it's on your radar already as the
>> author as it's lingered a while.
>
> It is your misfortune to face this kind of machine trouble.
>
> In the report[1], Matthias Schrumpf and Mark Broadworth noted to use AMD
> Ryzen 7 5800X on B550/X570 chipsets, and insert VT6307 in their PCIe bus.
> I guess that the device attends PCI bridge (ASM1083) since VT6307 has PCI
> interface only.
>
> We can see MCE error in another report[2]. Unfortunately, the reporter,
> Ian Donnelly, have less suspiction about machine architecture, and never
> provides hardware information. But I believe that it comes from AMD Ryzen
> machine. I transcribe the error here:
>
> ```
> [ 0.860834] mce: [Hardware Error]: Machine check events logged
> [ 0.860834] microcode: CPU20: patch_level=0x0a201025
> [ 0.860835] microcode: CPU21: patch_level=0x0a201025
> [ 0.860836] microcode: CPU23: patch_level=0x0a201025
> [ 0.860836] microcode: CPU22: patch_level=0x0a201025
> [ 0.860837] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 0: bc00080001010135
> [ 0.860845] fbcon: Taking over console
> [ 0.860847] mce: [Hardware Error]: TSC 0 ADDR fca000f0 MISC d012000000000000 IPID 1000b000000000
> [ 0.860854] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696955537 SOCKET 0 APIC b microcode a201025
> [ 0.860860] microcode: CPU0: patch_level=0x0a201025
> [ 0.861676] microcode: Microcode Update Driver: v2.2.
> ```
>
> Additionally, as I note in the PR[3], I observed cache-coherence failure
> over memory dedicated for DMA transmission. The mapping is created by
> `dmam_alloc_coherent()` and no need to have extra care such as streaming
> API. However, the combination of ASM1083 and VT6307 provides me bogus
> values from the memory in AMD Ryzen machine, and I can see no issue in
> Intel machines.
>
> Essentially, the host system reboots when firewire-ohci module in guest
> system probes the PCI device for 1394 OHCI hardware provided by PCI
> pass-though[4].
>
>>> I've already received the similar report[1], and have been
>>> investigating it in the last few weeks, then got the insight. Please take
>>> a look at my short report about it in PR to Linus for 6.7-rc1:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> I can confirm that I have been abe to reproduce the problem on AMD Ryzen
>>> machine. However, it's important to note that I have not observed the
>>> problem on the following systems:
>>
>> Any chance you (or anyone with the issue) has a serial output available?
>> I think it would be really good to look at the circumstances surrounding the
>> reboot.
>>
>>>
>>> * Intel machine (Sandy Bridge and Skylake generations)
>>> * AMD machines predating Ryzen (Sempron 145)
>>> * Machines using different 1394 OHCI hardware from other vendors such as
>>> TI
>>> * VIA VT6307 connected directly to PCI slot (i.e. without the issued
>>> PCIe/PCI bridge)
>>>
>>> Currently, I have not been able to obtain any useful debug output from
>>> the Linux system or any hardware error reports when the system reboots.
>>> It seems that the system reboots spontaneously. My assumption at this
>>> point is that AMD Ryzen machines detect a specific hardware error
>>> triggered by Ryzen machine quirk related to the combination of the Asmedia
>>> ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset.
>>>
>>
>> Recent kernels have enabled PCI AER. Could that be factoring in perhaps?
>
> I ordered equipments for the workflow, and waiting for shipping, since
> my motherboard has no interface for serial output.
>
> (However, I predict that we can no helpful output via the interface.)
>
>>> I genuinely appreciate your assistance in debugging this elusive
>>> hardware issue. If any workaround specific to AMD Ryzen machine quirk is
>>> required in PCI driver for 1394 OHCI hardware, I'm willing to apply it.
>>> However, it is preferable to figure out the reboot mechanism at first,
>>> I think.
>>
>> Does the BIOS on these machines enable a watchdog timer? If so, I'd suggest
>> disabling that for a starting point.
>
> For consumer use, the machine has no such function, I think. For
> your information, this is the machine information I used:
>
> * Ryzen 5 2400G
> * Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5
> * BIOS F51h 02/09/2023
>
>> How about if you compile as a module and then modprobe.blacklist the module
>> on kernel command line and load it later. Can you trigger the fault/reboot
>> this way? If so, it at least rules out some conditions that happen during a
>> race at boot.
>
> Nowadays FireWire software stack is optional in the most of
> distributions. I can encounter the same issue at deferred probing enough
> after booting up, even if the load of system is very low.
>
>> Looking more closely at the change, I would guess the fault is specifically
>> in get_cycle_time(). I can see that the VIA devices do set
>> QUIRK_CYCLE_TIMER which will cause additional reads.
>
> I've already tested with the driver compiled without these codes, but the
> system reboots again.
>
>> Another guesses worth looking at is to see if iommu=pt or amd_iommu=off
>> help.
>>
>> If either of those help it could point at being a problem with
>> get_cycle_time() and IOMMU. The older systems you mentioned working
>> probably didn't enable IOMMU by default but most AMD Ryzen systems do.
>
> I already suspect platform IOMMU and kernel implementation, however it
> is helpless to disable AMD SVM and IOMMU in BIOS settings. Of course, it
> is helpless as well to provide any options to iommu in kernel command line.
>
> If I had any opportunity to access to AMD machines for enterprise-grade
> usage somehow, I would have done it. However, I am a private-time
> contributor and what I can access to is the ones for consumer use
> without any hardware support for RAS reporting.
>
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217994
> [3] https://lore.kernel.org/lkml/[email protected]/
> [4] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks
>
> Takashi Sakamoto
>
>

2023-11-28 05:24:56

by Takashi Sakamoto

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

Hi Mario

Following up on our last conversation, I purchase some hardware to
attempt to retrieve outputs from serial port. Finally, I bought another
mother board in used market which provides serial port from Super I/O
chip (ASUS TUF Gaming X570-Plus). However, I have retrieved no helpful
outputs yet when encountering the system reboot.

As you mentioned, I check whether PCIe AER is enabled or not in the
running kernel (Ubuntu 23.04 linux-image-6.2.0-37-generic). It is
certainly enabled, however I can see nothing in the output as I noted.

I experienced extra troubles relevant to AMD Ryzen machine and the issued
PCIe device:

* ASRock X570 Phantom Gaming 4 with AMD Ryzen 5 3600X does not detect
the card. We can see no corresponding entry in lspci.
* After associating the card to vfio-pci, lspci command can reboot the
system even if firewire-ohci driver is not loaded. I can regenerate it
in both Gigabyte AX370-Gaming 5/ASUS TUF Gaming X570-plus with AMD
Ryzen 2400G.

I'm plreased to see if you have extra ideas to get helpful output from
the system. But I guess that I should start finding some workaround to
avoid the issued access to register instead of investigating the reboot
mechanism, sigh...

Anyway, thanks for your help.

Takashi Sakamoto

On Wed, Nov 08, 2023 at 02:16:44PM +0900, Takashi Sakamoto wrote:
> Hi Mario,
>
> On Tue, Nov 07, 2023 at 03:27:08PM -0600, Mario Limonciello wrote:
> > +linux-pci / Bjorn
> > On 11/7/2023 06:17, Takashi Sakamoto wrote:
> > > Hi Mario,
> > >
> > > Thanks for the report.
> > >
> > > I apologize for the inconvenience you and your reporter facing, however
> > > I can not avoid to say that the problem appears to be specific to the AMD
> > > Ryzen machines.
> >
> > Unfortunately I don't have this 1394 hardware myself. I was just looking at
> > another completely unrelated issue on Bugzilla and noticed the report come
> > up in my search and wanted to ensure it's on your radar already as the
> > author as it's lingered a while.
>
> It is your misfortune to face this kind of machine trouble.
>
> In the report[1], Matthias Schrumpf and Mark Broadworth noted to use AMD
> Ryzen 7 5800X on B550/X570 chipsets, and insert VT6307 in their PCIe bus.
> I guess that the device attends PCI bridge (ASM1083) since VT6307 has PCI
> interface only.
>
> We can see MCE error in another report[2]. Unfortunately, the reporter,
> Ian Donnelly, have less suspiction about machine architecture, and never
> provides hardware information. But I believe that it comes from AMD Ryzen
> machine. I transcribe the error here:
>
> ```
> [ 0.860834] mce: [Hardware Error]: Machine check events logged
> [ 0.860834] microcode: CPU20: patch_level=0x0a201025
> [ 0.860835] microcode: CPU21: patch_level=0x0a201025
> [ 0.860836] microcode: CPU23: patch_level=0x0a201025
> [ 0.860836] microcode: CPU22: patch_level=0x0a201025
> [ 0.860837] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 0: bc00080001010135
> [ 0.860845] fbcon: Taking over console
> [ 0.860847] mce: [Hardware Error]: TSC 0 ADDR fca000f0 MISC d012000000000000 IPID 1000b000000000
> [ 0.860854] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696955537 SOCKET 0 APIC b microcode a201025
> [ 0.860860] microcode: CPU0: patch_level=0x0a201025
> [ 0.861676] microcode: Microcode Update Driver: v2.2.
> ```
>
> Additionally, as I note in the PR[3], I observed cache-coherence failure
> over memory dedicated for DMA transmission. The mapping is created by
> `dmam_alloc_coherent()` and no need to have extra care such as streaming
> API. However, the combination of ASM1083 and VT6307 provides me bogus
> values from the memory in AMD Ryzen machine, and I can see no issue in
> Intel machines.
>
> Essentially, the host system reboots when firewire-ohci module in guest
> system probes the PCI device for 1394 OHCI hardware provided by PCI
> pass-though[4].
>
> > > I've already received the similar report[1], and have been
> > > investigating it in the last few weeks, then got the insight. Please take
> > > a look at my short report about it in PR to Linus for 6.7-rc1:
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > I can confirm that I have been abe to reproduce the problem on AMD Ryzen
> > > machine. However, it's important to note that I have not observed the
> > > problem on the following systems:
> >
> > Any chance you (or anyone with the issue) has a serial output available?
> > I think it would be really good to look at the circumstances surrounding the
> > reboot.
> >
> > >
> > > * Intel machine (Sandy Bridge and Skylake generations)
> > > * AMD machines predating Ryzen (Sempron 145)
> > > * Machines using different 1394 OHCI hardware from other vendors such as
> > > TI
> > > * VIA VT6307 connected directly to PCI slot (i.e. without the issued
> > > PCIe/PCI bridge)
> > >
> > > Currently, I have not been able to obtain any useful debug output from
> > > the Linux system or any hardware error reports when the system reboots.
> > > It seems that the system reboots spontaneously. My assumption at this
> > > point is that AMD Ryzen machines detect a specific hardware error
> > > triggered by Ryzen machine quirk related to the combination of the Asmedia
> > > ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset.
> > >
> >
> > Recent kernels have enabled PCI AER. Could that be factoring in perhaps?
>
> I ordered equipments for the workflow, and waiting for shipping, since
> my motherboard has no interface for serial output.
>
> (However, I predict that we can no helpful output via the interface.)
>
> > > I genuinely appreciate your assistance in debugging this elusive
> > > hardware issue. If any workaround specific to AMD Ryzen machine quirk is
> > > required in PCI driver for 1394 OHCI hardware, I'm willing to apply it.
> > > However, it is preferable to figure out the reboot mechanism at first,
> > > I think.
> >
> > Does the BIOS on these machines enable a watchdog timer? If so, I'd suggest
> > disabling that for a starting point.
>
> For consumer use, the machine has no such function, I think. For
> your information, this is the machine information I used:
>
> * Ryzen 5 2400G
> * Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5
> * BIOS F51h 02/09/2023
>
> > How about if you compile as a module and then modprobe.blacklist the module
> > on kernel command line and load it later. Can you trigger the fault/reboot
> > this way? If so, it at least rules out some conditions that happen during a
> > race at boot.
>
> Nowadays FireWire software stack is optional in the most of
> distributions. I can encounter the same issue at deferred probing enough
> after booting up, even if the load of system is very low.
>
> > Looking more closely at the change, I would guess the fault is specifically
> > in get_cycle_time(). I can see that the VIA devices do set
> > QUIRK_CYCLE_TIMER which will cause additional reads.
>
> I've already tested with the driver compiled without these codes, but the
> system reboots again.
>
> > Another guesses worth looking at is to see if iommu=pt or amd_iommu=off
> > help.
> >
> > If either of those help it could point at being a problem with
> > get_cycle_time() and IOMMU. The older systems you mentioned working
> > probably didn't enable IOMMU by default but most AMD Ryzen systems do.
>
> I already suspect platform IOMMU and kernel implementation, however it
> is helpless to disable AMD SVM and IOMMU in BIOS settings. Of course, it
> is helpless as well to provide any options to iommu in kernel command line.
>
> If I had any opportunity to access to AMD machines for enterprise-grade
> usage somehow, I would have done it. However, I am a private-time
> contributor and what I can access to is the ones for consumer use
> without any hardware support for RAS reporting.
>
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217994
> [3] https://lore.kernel.org/lkml/[email protected]/
> [4] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks
>
> Takashi Sakamoto

2023-11-28 06:10:08

by Mario Limonciello

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

+ Boris

Maybe he has some ideas on this issue.

On 11/27/2023 23:24, Takashi Sakamoto wrote:
> Hi Mario
>
> Following up on our last conversation, I purchase some hardware to
> attempt to retrieve outputs from serial port. Finally, I bought another
> mother board in used market which provides serial port from Super I/O
> chip (ASUS TUF Gaming X570-Plus). However, I have retrieved no helpful
> outputs yet when encountering the system reboot.

Did you up the loglevel to 8 to make sure you'll get all kernel output
on the serial port, not just errors?

>
> As you mentioned, I check whether PCIe AER is enabled or not in the
> running kernel (Ubuntu 23.04 linux-image-6.2.0-37-generic). It is
> certainly enabled, however I can see nothing in the output as I noted.
>
> I experienced extra troubles relevant to AMD Ryzen machine and the issued
> PCIe device:
>
> * ASRock X570 Phantom Gaming 4 with AMD Ryzen 5 3600X does not detect
> the card. We can see no corresponding entry in lspci.
> * After associating the card to vfio-pci, lspci command can reboot the
> system even if firewire-ohci driver is not loaded. I can regenerate it
> in both Gigabyte AX370-Gaming 5/ASUS TUF Gaming X570-plus with AMD
> Ryzen 2400G.

Rather than lspci, is it specifically config space access from sysfs?
Does the output from the serial port change with IOMMU enabled vs disabled?

>
> I'm plreased to see if you have extra ideas to get helpful output from
> the system. But I guess that I should start finding some workaround to
> avoid the issued access to register instead of investigating the reboot
> mechanism, sigh...
>
> Anyway, thanks for your help. >

Can you check FCH::PM::S5_RESET_STATUS on next boot after failure has
occurred? It is available at MMIO FED80300 or through indirect IO
access at 0xC0.

If MMIO doesn't work, double check FCH::PM_ISACONTROL bit 1 (described
on page 296) to confirm if your system enables it.

The meanings of the different bits can be found in a recent PPR:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip

Indirect IO is described on PDF page 294.

This will at least give us a hint what's going on in this case.

>
> Takashi Sakamoto
>
> On Wed, Nov 08, 2023 at 02:16:44PM +0900, Takashi Sakamoto wrote:
>> Hi Mario,
>>
>> On Tue, Nov 07, 2023 at 03:27:08PM -0600, Mario Limonciello wrote:
>>> +linux-pci / Bjorn
>>> On 11/7/2023 06:17, Takashi Sakamoto wrote:
>>>> Hi Mario,
>>>>
>>>> Thanks for the report.
>>>>
>>>> I apologize for the inconvenience you and your reporter facing, however
>>>> I can not avoid to say that the problem appears to be specific to the AMD
>>>> Ryzen machines.
>>>
>>> Unfortunately I don't have this 1394 hardware myself. I was just looking at
>>> another completely unrelated issue on Bugzilla and noticed the report come
>>> up in my search and wanted to ensure it's on your radar already as the
>>> author as it's lingered a while.
>>
>> It is your misfortune to face this kind of machine trouble.
>>
>> In the report[1], Matthias Schrumpf and Mark Broadworth noted to use AMD
>> Ryzen 7 5800X on B550/X570 chipsets, and insert VT6307 in their PCIe bus.
>> I guess that the device attends PCI bridge (ASM1083) since VT6307 has PCI
>> interface only.
>>
>> We can see MCE error in another report[2]. Unfortunately, the reporter,
>> Ian Donnelly, have less suspiction about machine architecture, and never
>> provides hardware information. But I believe that it comes from AMD Ryzen
>> machine. I transcribe the error here:
>>
>> ```
>> [ 0.860834] mce: [Hardware Error]: Machine check events logged
>> [ 0.860834] microcode: CPU20: patch_level=0x0a201025
>> [ 0.860835] microcode: CPU21: patch_level=0x0a201025
>> [ 0.860836] microcode: CPU23: patch_level=0x0a201025
>> [ 0.860836] microcode: CPU22: patch_level=0x0a201025
>> [ 0.860837] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 0: bc00080001010135
>> [ 0.860845] fbcon: Taking over console
>> [ 0.860847] mce: [Hardware Error]: TSC 0 ADDR fca000f0 MISC d012000000000000 IPID 1000b000000000
>> [ 0.860854] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1696955537 SOCKET 0 APIC b microcode a201025
>> [ 0.860860] microcode: CPU0: patch_level=0x0a201025
>> [ 0.861676] microcode: Microcode Update Driver: v2.2.
>> ```
>>
>> Additionally, as I note in the PR[3], I observed cache-coherence failure
>> over memory dedicated for DMA transmission. The mapping is created by
>> `dmam_alloc_coherent()` and no need to have extra care such as streaming
>> API. However, the combination of ASM1083 and VT6307 provides me bogus
>> values from the memory in AMD Ryzen machine, and I can see no issue in
>> Intel machines.
>>
>> Essentially, the host system reboots when firewire-ohci module in guest
>> system probes the PCI device for 1394 OHCI hardware provided by PCI
>> pass-though[4].
>>
>>>> I've already received the similar report[1], and have been
>>>> investigating it in the last few weeks, then got the insight. Please take
>>>> a look at my short report about it in PR to Linus for 6.7-rc1:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> I can confirm that I have been abe to reproduce the problem on AMD Ryzen
>>>> machine. However, it's important to note that I have not observed the
>>>> problem on the following systems:
>>>
>>> Any chance you (or anyone with the issue) has a serial output available?
>>> I think it would be really good to look at the circumstances surrounding the
>>> reboot.
>>>
>>>>
>>>> * Intel machine (Sandy Bridge and Skylake generations)
>>>> * AMD machines predating Ryzen (Sempron 145)
>>>> * Machines using different 1394 OHCI hardware from other vendors such as
>>>> TI
>>>> * VIA VT6307 connected directly to PCI slot (i.e. without the issued
>>>> PCIe/PCI bridge)
>>>>
>>>> Currently, I have not been able to obtain any useful debug output from
>>>> the Linux system or any hardware error reports when the system reboots.
>>>> It seems that the system reboots spontaneously. My assumption at this
>>>> point is that AMD Ryzen machines detect a specific hardware error
>>>> triggered by Ryzen machine quirk related to the combination of the Asmedia
>>>> ASM1083/1085 and VIA VT6306/6307/6308, leading to power reset.
>>>>
>>>
>>> Recent kernels have enabled PCI AER. Could that be factoring in perhaps?
>>
>> I ordered equipments for the workflow, and waiting for shipping, since
>> my motherboard has no interface for serial output.
>>
>> (However, I predict that we can no helpful output via the interface.)
>>
>>>> I genuinely appreciate your assistance in debugging this elusive
>>>> hardware issue. If any workaround specific to AMD Ryzen machine quirk is
>>>> required in PCI driver for 1394 OHCI hardware, I'm willing to apply it.
>>>> However, it is preferable to figure out the reboot mechanism at first,
>>>> I think.
>>>
>>> Does the BIOS on these machines enable a watchdog timer? If so, I'd suggest
>>> disabling that for a starting point.
>>
>> For consumer use, the machine has no such function, I think. For
>> your information, this is the machine information I used:
>>
>> * Ryzen 5 2400G
>> * Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5
>> * BIOS F51h 02/09/2023
>>
>>> How about if you compile as a module and then modprobe.blacklist the module
>>> on kernel command line and load it later. Can you trigger the fault/reboot
>>> this way? If so, it at least rules out some conditions that happen during a
>>> race at boot.
>>
>> Nowadays FireWire software stack is optional in the most of
>> distributions. I can encounter the same issue at deferred probing enough
>> after booting up, even if the load of system is very low.
>>
>>> Looking more closely at the change, I would guess the fault is specifically
>>> in get_cycle_time(). I can see that the VIA devices do set
>>> QUIRK_CYCLE_TIMER which will cause additional reads.
>>
>> I've already tested with the driver compiled without these codes, but the
>> system reboots again.
>>
>>> Another guesses worth looking at is to see if iommu=pt or amd_iommu=off
>>> help.
>>>
>>> If either of those help it could point at being a problem with
>>> get_cycle_time() and IOMMU. The older systems you mentioned working
>>> probably didn't enable IOMMU by default but most AMD Ryzen systems do.
>>
>> I already suspect platform IOMMU and kernel implementation, however it
>> is helpless to disable AMD SVM and IOMMU in BIOS settings. Of course, it
>> is helpless as well to provide any options to iommu in kernel command line.
>>
>> If I had any opportunity to access to AMD machines for enterprise-grade
>> usage somehow, I would have done it. However, I am a private-time
>> contributor and what I can access to is the ones for consumer use
>> without any hardware support for RAS reporting.
>>
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993
>> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217994
>> [3] https://lore.kernel.org/lkml/[email protected]/
>> [4] https://lore.kernel.org/lkml/[email protected]/
>>
>> Thanks
>>
>> Takashi Sakamoto

2023-12-03 12:30:39

by Takashi Sakamoto

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

Hi Mario,

Thanks for the advices.

I note that In my experiments I use Ubuntu 23.04 amd64 (v6.2 kernel) with
backported FireWire stack[1]. Except for the stack, the kernel and software
packages can be retrieved from repositories of Ubuntu project.

On Tue, Nov 28, 2023 at 12:09:41AM -0600, Mario Limonciello wrote:
> On 11/27/2023 23:24, Takashi Sakamoto wrote:
> > Hi Mario
> >
> > Following up on our last conversation, I purchase some hardware to
> > attempt to retrieve outputs from serial port. Finally, I bought another
> > mother board in used market which provides serial port from Super I/O
> > chip (ASUS TUF Gaming X570-Plus). However, I have retrieved no helpful
> > outputs yet when encountering the system reboot.
>
> Did you up the loglevel to 8 to make sure you'll get all kernel output on
> the serial port, not just errors?

Even if giving either 'debug' cmdline option or incrementing console
loglevel via syctl, I receive no useful output from console when loading
the module at or after booting up.

```
$ sysctl kernel.printk
kernel.printk = 7 7 1 7
```

I tried at several difference cases; enabling/disabling IOMMU,
enabling/disabling SVM in motherboard level. But nothing effective.

> > As you mentioned, I check whether PCIe AER is enabled or not in the
> > running kernel (Ubuntu 23.04 linux-image-6.2.0-37-generic). It is
> > certainly enabled, however I can see nothing in the output as I noted.
> >
> > I experienced extra troubles relevant to AMD Ryzen machine and the issued
> > PCIe device:
> >
> > * ASRock X570 Phantom Gaming 4 with AMD Ryzen 5 3600X does not detect
> > the card. We can see no corresponding entry in lspci.
> > * After associating the card to vfio-pci, lspci command can reboot the
> > system even if firewire-ohci driver is not loaded. I can regenerate it
> > in both Gigabyte AX370-Gaming 5/ASUS TUF Gaming X570-plus with AMD
> > Ryzen 2400G.
>
> Rather than lspci, is it specifically config space access from sysfs? Does
> the output from the serial port change with IOMMU enabled vs disabled?

In lspci case, I can work with debugger and figure out that 'pread(2)' to
file descriptor for 'config' node in sysfs causes the unexpected system
reboot. Additionally I can regenerate it by hexdump(1) to the node:

```
$ lspci
...
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 03)
05:00.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev 80)
...
$ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
00000000 06 11 44 30 80 00 10 02 80 10 00 0c 10 20 00 00 |..D0......... ..|
00000010 00 00 90 fc 01 d0 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 06 11 44 30 |..............D0|
00000030 00 00 00 00 50 00 00 00 00 00 00 00 ff 01 00 20 |....P.......... |
00000040

$ lsmod | grep firewire
(no output)

$ sudo -i
# modprobe vfio-pci
# echo 1106 3044 > /sys/bus/pci/drivers/vfio-pci/new_id
# exit

$ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
(reboot)
```

I can suppress it when disabling IOMMU in motherboard. In this point, the
issue of lspci is a bit different from the issue of driver issue.

> > I'm plreased to see if you have extra ideas to get helpful output from
> > the system. But I guess that I should start finding some workaround to
> > avoid the issued access to register instead of investigating the reboot
> > mechanism, sigh...
> >
> > Anyway, thanks for your help. >
>
> Can you check FCH::PM::S5_RESET_STATUS on next boot after failure has
> occurred? It is available at MMIO FED80300 or through indirect IO access at
> 0xC0.
>
> If MMIO doesn't work, double check FCH::PM_ISACONTROL bit 1 (described on
> page 296) to confirm if your system enables it.
>
> The meanings of the different bits can be found in a recent PPR:
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip
>
> Indirect IO is described on PDF page 294.
>
> This will at least give us a hint what's going on in this case.

I'll try the above in this week. Thanks.

[1] https://github.com/takaswie/linux-firewire-dkms/

Takashi Sakamoto

2023-12-05 20:05:06

by Mario Limonciello

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

On 12/3/2023 06:29, Takashi Sakamoto wrote:
> Hi Mario,
>
> Thanks for the advices.
>
> I note that In my experiments I use Ubuntu 23.04 amd64 (v6.2 kernel) with
> backported FireWire stack[1]. Except for the stack, the kernel and software
> packages can be retrieved from repositories of Ubuntu project.
>
> On Tue, Nov 28, 2023 at 12:09:41AM -0600, Mario Limonciello wrote:
>> On 11/27/2023 23:24, Takashi Sakamoto wrote:
>>> Hi Mario
>>>
>>> Following up on our last conversation, I purchase some hardware to
>>> attempt to retrieve outputs from serial port. Finally, I bought another
>>> mother board in used market which provides serial port from Super I/O
>>> chip (ASUS TUF Gaming X570-Plus). However, I have retrieved no helpful
>>> outputs yet when encountering the system reboot.
>>
>> Did you up the loglevel to 8 to make sure you'll get all kernel output on
>> the serial port, not just errors?
>
> Even if giving either 'debug' cmdline option or incrementing console
> loglevel via syctl, I receive no useful output from console when loading
> the module at or after booting up.
>
> ```
> $ sysctl kernel.printk
> kernel.printk = 7 7 1 7
> ```
>
> I tried at several difference cases; enabling/disabling IOMMU,
> enabling/disabling SVM in motherboard level. But nothing effective.
>
>>> As you mentioned, I check whether PCIe AER is enabled or not in the
>>> running kernel (Ubuntu 23.04 linux-image-6.2.0-37-generic). It is
>>> certainly enabled, however I can see nothing in the output as I noted.
>>>
>>> I experienced extra troubles relevant to AMD Ryzen machine and the issued
>>> PCIe device:
>>>
>>> * ASRock X570 Phantom Gaming 4 with AMD Ryzen 5 3600X does not detect
>>> the card. We can see no corresponding entry in lspci.
>>> * After associating the card to vfio-pci, lspci command can reboot the
>>> system even if firewire-ohci driver is not loaded. I can regenerate it
>>> in both Gigabyte AX370-Gaming 5/ASUS TUF Gaming X570-plus with AMD
>>> Ryzen 2400G.
>>
>> Rather than lspci, is it specifically config space access from sysfs? Does
>> the output from the serial port change with IOMMU enabled vs disabled?
>
> In lspci case, I can work with debugger and figure out that 'pread(2)' to
> file descriptor for 'config' node in sysfs causes the unexpected system
> reboot. Additionally I can regenerate it by hexdump(1) to the node:

OK - is this by chance related to access to PCI extended config space
failing for this device then? If you read just the first 256 bytes it's
ok, but beyond that it fails?

If so, can you please try to reproduce using this series from Bjorn applied:
https://lore.kernel.org/r/[email protected]

And then add this to kernel command line:
efi=debug "dyndbg=file arch/x86/pci/* +p"

Capture the dmesg and share it.

>
> ```
> $ lspci
> ...
> 04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 03)
> 05:00.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev 80)
> ...
> $ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
> 00000000 06 11 44 30 80 00 10 02 80 10 00 0c 10 20 00 00 |..D0......... ..|
> 00000010 00 00 90 fc 01 d0 00 00 00 00 00 00 00 00 00 00 |................|
> 00000020 00 00 00 00 00 00 00 00 00 00 00 00 06 11 44 30 |..............D0|
> 00000030 00 00 00 00 50 00 00 00 00 00 00 00 ff 01 00 20 |....P.......... |
> 00000040
>
> $ lsmod | grep firewire
> (no output)
>
> $ sudo -i
> # modprobe vfio-pci
> # echo 1106 3044 > /sys/bus/pci/drivers/vfio-pci/new_id
> # exit
>
> $ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
> (reboot)
> ```

Can you access config space for other PCIe devices successfully on this
system?
Specifically extended config space?

>
> I can suppress it when disabling IOMMU in motherboard. In this point, the
> issue of lspci is a bit different from the issue of driver issue.
>
>>> I'm plreased to see if you have extra ideas to get helpful output from
>>> the system. But I guess that I should start finding some workaround to
>>> avoid the issued access to register instead of investigating the reboot
>>> mechanism, sigh...
>>>
>>> Anyway, thanks for your help. >
>>
>> Can you check FCH::PM::S5_RESET_STATUS on next boot after failure has
>> occurred? It is available at MMIO FED80300 or through indirect IO access at
>> 0xC0.
>>
>> If MMIO doesn't work, double check FCH::PM_ISACONTROL bit 1 (described on
>> page 296) to confirm if your system enables it.
>>
>> The meanings of the different bits can be found in a recent PPR:
>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip
>>
>> Indirect IO is described on PDF page 294.
>>
>> This will at least give us a hint what's going on in this case.
>
> I'll try the above in this week. Thanks.
>
>
> [1] https://github.com/takaswie/linux-firewire-dkms/
>
> Takashi Sakamoto

2023-12-10 08:04:51

by Takashi Sakamoto

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

Hi Mario,

On Tue, Nov 28, 2023 at 12:09:41AM -0600, Mario Limonciello wrote:
> Can you check FCH::PM::S5_RESET_STATUS on next boot after failure has
> occurred? It is available at MMIO FED80300 or through indirect IO access at
> 0xC0.
>
> If MMIO doesn't work, double check FCH::PM_ISACONTROL bit 1 (described on
> page 296) to confirm if your system enables it.
>
> The meanings of the different bits can be found in a recent PPR:
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip
>
> Indirect IO is described on PDF page 294.
>
> This will at least give us a hint what's going on in this case.

I attempt to check the above. As a result, the value of offset
::S5_RESET_STATUS is 0x00080800 every time after the unexpected reboot.
(both cases of lspci and 1394 OHCI driver). The 'mmioen' bit of
::ISACONTROL offset is always enabled.

According to the document, the bit 20 of ::S5_RESET_STATUS expresses
'do_k8_full_reset' to mean that the system reboot was caused by
'CF9 = 0x0E'. But I have no idea about it...

For the attempt, I found a patch to i2c-piix4[1] and applied it with
slight fix, like:

```
diff --git a/drivers/i2c/busses/i2c-piix4.c b/drivers/i2c/busses/i2c-piix4.c
index 809fbd014cd6..11c1ba3afa76 100644
--- a/drivers/i2c/busses/i2c-piix4.c
+++ b/drivers/i2c/busses/i2c-piix4.c
@@ -99,7 +99,9 @@
#define SB800_PIIX4_PORT_IDX_SHIFT_KERNCZ 3

#define SB800_PIIX4_FCH_PM_ADDR 0xFED80300
-#define SB800_PIIX4_FCH_PM_SIZE 8
+#define SB800_PIIX4_FCH_PM_SIZE 0x100
+#define SB800_PIIX4_FCH_PM_OFFSET_ISACONTROL 0x04
+#define SB800_PIIX4_FCH_PM_OFFSET_S5_RESET_STATUS 0xc0

/* insmod parameters */

@@ -200,6 +202,9 @@ static int piix4_sb800_region_request(struct device *dev,

mmio_cfg->addr = addr;

+ dev_info(dev, "ISACONTROL = 0x%08x", ioread32(addr + SB800_PIIX4_FCH_PM_OFFSET_ISACONTROL));
+ dev_info(dev, "S5_RESET_STATUS = 0x%08x", ioread32(addr + SB800_PIIX4_FCH_PM_OFFSET_S5_RESET_STATUS));
+
return 0;
}
```

> On 12/3/2023 06:29, Takashi Sakamoto wrote:
> > In lspci case, I can work with debugger and figure out that 'pread(2)' to
> > file descriptor for 'config' node in sysfs causes the unexpected system
> > reboot. Additionally I can regenerate it by hexdump(1) to the node:
>
> OK - is this by chance related to access to PCI extended config space
> failing for this device then? If you read just the first 256 bytes it's ok,
> but beyond that it fails?

I can regenerate unexpected system reboot even if readling from the node
enough shorter than 256 bytes. This time I use dd(1) for the purpose since
hexdump uses stream I/O API (reads 4096 bytes once).

> If so, can you please try to reproduce using this series from Bjorn applied:
> https://lore.kernel.org/r/[email protected]
>
> And then add this to kernel command line:
> efi=debug "dyndbg=file arch/x86/pci/* +p"
>
> Capture the dmesg and share it.

I try the series by backport way, but it takes a bit time.

[1] https://lore.kernel.org/lkml/[email protected]/T/

Thanks a lot

Takashi Sakamoto

2024-01-16 01:56:24

by Takashi Sakamoto

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

Hi,

The change for 1394 OHCI driver, aimed at suppressing the unexpected
system reboot in AMD Ryzen machine[1], has been merged into Linux kernel
v6.7[2]. It has also been applied to the following releases of stable and
longterm kernels.

* 6.6.11[3]
* 6.1.72[4]
* 5.15.147[5]
* 5.10.208[6]
* 5.4.267[7]
* 4.19.305[8]
* 4.14.336[9]

Once the downstream distribution project provides the corresponding kernel
packages, you should no longer encounter the unexpected system reboot.

Note that the following combination of hardware is not necessarily suitable,
depending on your use case:

* Any type of AMD Ryzen machine
* 1394 OHCI hardware consists of:
* Asmedia ASM1083/1085
* VIA VT6306/6307/6308

When working with time-aware protocol, such as audio sample processing, it
is advisable to avoid the combination. The change accompanies a functional
limitation that the software stack does not provides precise hardware time
in this case.

If you choose to continue using AMD Ryzen machine, the recommendation is
to replace the 1394 OHCI hardware with another one. Conversely, if you
choose to continue using the 1394 OHCI hardware, the recommendation is to
use the machine provided by vendors other than AMD.

Thanks for your report and long patience.

[1] https://git.kernel.org/torvalds/linux/c/ac9184fbb847
[2] https://lore.kernel.org/lkml/CAHk-=widprp4XoHUcsDe7e16YZjLYJWra-dK0hE1MnfPMf6C3Q@mail.gmail.com/
[3] https://lore.kernel.org/lkml/2024011058-sheep-thrower-d2f8@gregkh/
[4] https://lore.kernel.org/lkml/2024011052-unsightly-bronze-e628@gregkh/
[5] https://lore.kernel.org/lkml/2024011541-defective-scuff-c55e@gregkh/
[6] https://lore.kernel.org/lkml/2024011532-lustiness-hybrid-fc72@gregkh/
[7] https://lore.kernel.org/lkml/2024011519-mating-tag-1f62@gregkh/
[8] https://lore.kernel.org/lkml/2024011508-shakiness-resonant-f15e@gregkh/
[9] https://lore.kernel.org/lkml/2024011046-ecology-tiptoeing-ce50@gregkh/

On Mon, Nov 06, 2023 at 02:14:39PM -0600, Mario Limonciello wrote:
> Hi,
>
> I recently came across a kernel bugzilla that bisected a boot problem [1]
> introduced in kernel 6.5 to this change.
>
> commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
> Author: Takashi Sakamoto <[email protected]>
> Date: Tue May 30 08:12:40 2023 +0900
>
> firewire: core: use union for callback of transaction completion
>
> Removing the firewire card from the system fixes it for both reporters
> (CC'ed)
>
> As the author of this issue can you please take a look at it?
>
> Thanks,
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217993

Thanks

Takashi Sakamoto

2024-01-19 14:08:49

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

On 07.11.23 08:02, Bagas Sanjaya wrote:
> On Mon, Nov 06, 2023 at 02:14:39PM -0600, Mario Limonciello wrote:
>> Hi,
>>
>> I recently came across a kernel bugzilla that bisected a boot problem [1]
>> introduced in kernel 6.5 to this change.
>>
>> commit dcadfd7f7c74ef9ee415e072a19bdf6c085159eb (HEAD -> dcadfd7f7c7)
>> Author: Takashi Sakamoto <[email protected]>
>> Date: Tue May 30 08:12:40 2023 +0900
>>
>> firewire: core: use union for callback of transaction completion
>>
>> Removing the firewire card from the system fixes it for both reporters
>> (CC'ed)
>>
>> As the author of this issue can you please take a look at it?
>>
>
> Thanks for the forwarding regression report from Bugzilla. I'm adding it
> to regzbot:
>
> #regzbot introduced: dcadfd7f7c74ef https://bugzilla.kernel.org/show_bug.cgi?id=217993
> #regzbot title: completing firewire transaction callback with union bootloops AMD Ryzen 7 system
> #regzbot link: https://lore.kernel.org/regressions/[email protected]/

#regzbot fix: ac9184fbb8478dab4a0724b279f94956b69be827
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.