2024-05-30 12:02:27

by David Wang

[permalink] [raw]
Subject: [Regression] 6.10-rc1: Fail to resurrect from suspend.

Hi,

My system fails to resurrect after `systemctl suspend` with 6.10-rc1,
when pressing power button, the machine "sounds" starting(fans roaring),
but my keyboard/mouse/monitor is not powered, and I have nothing to
do but powering cycle the system.

I run a bisect session, and narrows it down to following commit:

commit d74169ceb0d2e32438946a2f1f9fc8c803304bd6
Author: Dimitri Sivanich <[email protected]>
Date: Wed Apr 24 15:16:29 2024 +0800

iommu/vt-d: Allocate DMAR fault interrupts locally

The Intel IOMMU code currently tries to allocate all DMAR fault interrupt
vectors on the boot cpu. On large systems with high DMAR counts this
results in vector exhaustion, and most of the vectors are not initially
allocated socket local.

Instead, have a cpu on each node do the vector allocation for the DMARs on
that node. The boot cpu still does the allocation for its node during its
boot sequence.

Signed-off-by: Dimitri Sivanich <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>

And I have confirmed that reverting this commit can fix my problem.

Following is my bisect logs:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
# status: waiting for bad commit, 1 good commit known
# bad: [1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0] Linux 6.10-rc1
git bisect bad 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
# good: [db5d28c0bfe566908719bec8e25443aabecbb802] Merge tag 'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernel
git bisect good db5d28c0bfe566908719bec8e25443aabecbb802
# good: [db5d28c0bfe566908719bec8e25443aabecbb802] Merge tag 'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernel
git bisect good db5d28c0bfe566908719bec8e25443aabecbb802
# bad: [a90f1cd105c6c5c246f07ca371d873d35b78c7d9] Merge tag 'turbostat-for-Linux-6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
git bisect bad a90f1cd105c6c5c246f07ca371d873d35b78c7d9
# good: [8b35a3bb33b57bc2cb2694a50e49e0ea01b9ff6f] Merge tag 'pmdomain-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
git bisect good 8b35a3bb33b57bc2cb2694a50e49e0ea01b9ff6f
# bad: [619b92b9c8fe5369503ae948ad4e0a9c195c2c4a] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect bad 619b92b9c8fe5369503ae948ad4e0a9c195c2c4a
# good: [91b6163be404e36baea39fc978e4739fd0448ebd] Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
git bisect good 91b6163be404e36baea39fc978e4739fd0448ebd
# bad: [0cc6f45cecb46cefe89c17ec816dc8cd58a2229a] Merge tag 'iommu-updates-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect bad 0cc6f45cecb46cefe89c17ec816dc8cd58a2229a
# good: [89721e3038d181bacbd6be54354b513fdf1b4f10] Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
git bisect good 89721e3038d181bacbd6be54354b513fdf1b4f10
# good: [89721e3038d181bacbd6be54354b513fdf1b4f10] Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
git bisect good 89721e3038d181bacbd6be54354b513fdf1b4f10
# good: [de111f6b4f6a3010020825d22a068f416bc29c95] iommu/amd: Enable Guest Translation after reading IOMMU feature register
git bisect good de111f6b4f6a3010020825d22a068f416bc29c95
# good: [da55da5a42d4247d7a48b843fa5fcd9a4a10f4fe] iommu/arm-smmu-v3: Make the kunit into a module
git bisect good da55da5a42d4247d7a48b843fa5fcd9a4a10f4fe
# bad: [ba00196ca41c4f6d0b0d3c4a6748a133577abe05] iommu/vt-d: Decouple igfx_off from graphic identity mapping
git bisect bad ba00196ca41c4f6d0b0d3c4a6748a133577abe05
# bad: [446a68c58d2e5b8140d474f1a74082aebeee9bb0] iommu/vt-d: Add trace events for cache tag interface
git bisect bad 446a68c58d2e5b8140d474f1a74082aebeee9bb0
# bad: [cc9e49d35b4de47d6b656ac144cb22b11dc65c2e] iommu/vt-d: Remove debugfs use of private data field
git bisect bad cc9e49d35b4de47d6b656ac144cb22b11dc65c2e
# good: [9e7ee0f045395dc8aa55fbdc164c062484f4c88d] iommu/vt-d: Use try_cmpxchg64{,_local}() in iommu.c
git bisect good 9e7ee0f045395dc8aa55fbdc164c062484f4c88d
# bad: [d74169ceb0d2e32438946a2f1f9fc8c803304bd6] iommu/vt-d: Allocate DMAR fault interrupts locally
git bisect bad d74169ceb0d2e32438946a2f1f9fc8c803304bd6
# first bad commit: [d74169ceb0d2e32438946a2f1f9fc8c803304bd6] iommu/vt-d: Allocate DMAR fault interrupts locally


FYI
David



2024-05-30 12:15:24

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.

Hi David,

There is a fix to commit d74169ceb0, which I'll be posting shortly. Hopefully
that will resolve your issue.

On Thu, May 30, 2024 at 08:01:10PM +0800, David Wang wrote:
> Hi,
>
> My system fails to resurrect after `systemctl suspend` with 6.10-rc1,
> when pressing power button, the machine "sounds" starting(fans roaring),
> but my keyboard/mouse/monitor is not powered, and I have nothing to
> do but powering cycle the system.
>
> I run a bisect session, and narrows it down to following commit:
>
> commit d74169ceb0d2e32438946a2f1f9fc8c803304bd6
> Author: Dimitri Sivanich <[email protected]>
> Date: Wed Apr 24 15:16:29 2024 +0800
>
> iommu/vt-d: Allocate DMAR fault interrupts locally
>
> The Intel IOMMU code currently tries to allocate all DMAR fault interrupt
> vectors on the boot cpu. On large systems with high DMAR counts this
> results in vector exhaustion, and most of the vectors are not initially
> allocated socket local.
>
> Instead, have a cpu on each node do the vector allocation for the DMARs on
> that node. The boot cpu still does the allocation for its node during its
> boot sequence.
>
> Signed-off-by: Dimitri Sivanich <[email protected]>
> Reviewed-by: Kevin Tian <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Lu Baolu <[email protected]>
> Signed-off-by: Joerg Roedel <[email protected]>
>
> And I have confirmed that reverting this commit can fix my problem.
>
> Following is my bisect logs:
> $ git bisect log
> git bisect start
> # status: waiting for both good and bad commits
> # good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
> git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
> # status: waiting for bad commit, 1 good commit known
> # bad: [1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0] Linux 6.10-rc1
> git bisect bad 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> # good: [db5d28c0bfe566908719bec8e25443aabecbb802] Merge tag 'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernel
> git bisect good db5d28c0bfe566908719bec8e25443aabecbb802
> # good: [db5d28c0bfe566908719bec8e25443aabecbb802] Merge tag 'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernel
> git bisect good db5d28c0bfe566908719bec8e25443aabecbb802
> # bad: [a90f1cd105c6c5c246f07ca371d873d35b78c7d9] Merge tag 'turbostat-for-Linux-6.10-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
> git bisect bad a90f1cd105c6c5c246f07ca371d873d35b78c7d9
> # good: [8b35a3bb33b57bc2cb2694a50e49e0ea01b9ff6f] Merge tag 'pmdomain-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
> git bisect good 8b35a3bb33b57bc2cb2694a50e49e0ea01b9ff6f
> # bad: [619b92b9c8fe5369503ae948ad4e0a9c195c2c4a] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
> git bisect bad 619b92b9c8fe5369503ae948ad4e0a9c195c2c4a
> # good: [91b6163be404e36baea39fc978e4739fd0448ebd] Merge tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
> git bisect good 91b6163be404e36baea39fc978e4739fd0448ebd
> # bad: [0cc6f45cecb46cefe89c17ec816dc8cd58a2229a] Merge tag 'iommu-updates-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
> git bisect bad 0cc6f45cecb46cefe89c17ec816dc8cd58a2229a
> # good: [89721e3038d181bacbd6be54354b513fdf1b4f10] Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
> git bisect good 89721e3038d181bacbd6be54354b513fdf1b4f10
> # good: [89721e3038d181bacbd6be54354b513fdf1b4f10] Merge tag 'net-accept-more-20240515' of git://git.kernel.dk/linux
> git bisect good 89721e3038d181bacbd6be54354b513fdf1b4f10
> # good: [de111f6b4f6a3010020825d22a068f416bc29c95] iommu/amd: Enable Guest Translation after reading IOMMU feature register
> git bisect good de111f6b4f6a3010020825d22a068f416bc29c95
> # good: [da55da5a42d4247d7a48b843fa5fcd9a4a10f4fe] iommu/arm-smmu-v3: Make the kunit into a module
> git bisect good da55da5a42d4247d7a48b843fa5fcd9a4a10f4fe
> # bad: [ba00196ca41c4f6d0b0d3c4a6748a133577abe05] iommu/vt-d: Decouple igfx_off from graphic identity mapping
> git bisect bad ba00196ca41c4f6d0b0d3c4a6748a133577abe05
> # bad: [446a68c58d2e5b8140d474f1a74082aebeee9bb0] iommu/vt-d: Add trace events for cache tag interface
> git bisect bad 446a68c58d2e5b8140d474f1a74082aebeee9bb0
> # bad: [cc9e49d35b4de47d6b656ac144cb22b11dc65c2e] iommu/vt-d: Remove debugfs use of private data field
> git bisect bad cc9e49d35b4de47d6b656ac144cb22b11dc65c2e
> # good: [9e7ee0f045395dc8aa55fbdc164c062484f4c88d] iommu/vt-d: Use try_cmpxchg64{,_local}() in iommu.c
> git bisect good 9e7ee0f045395dc8aa55fbdc164c062484f4c88d
> # bad: [d74169ceb0d2e32438946a2f1f9fc8c803304bd6] iommu/vt-d: Allocate DMAR fault interrupts locally
> git bisect bad d74169ceb0d2e32438946a2f1f9fc8c803304bd6
> # first bad commit: [d74169ceb0d2e32438946a2f1f9fc8c803304bd6] iommu/vt-d: Allocate DMAR fault interrupts locally
>
>
> FYI
> David

2024-05-30 14:10:21

by David Wang

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.


At 2024-05-30 20:14:56, "Dimitri Sivanich" <[email protected]> wrote:
>Hi David,
>
>There is a fix to commit d74169ceb0, which I'll be posting shortly. Hopefully
>that will resolve your issue.
>

Hi, I just applied that patch on 6.10-rc1, it dose fix my problem!

Thx~
David

2024-06-10 13:17:15

by David Wang

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.

Hi,

This still happens with 6.10-rc3...
I think this is a serious problem for AMD users who used to `suspend`...


David


2024-06-10 14:02:52

by Vasant Hegde

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.

Hi David,


On 6/10/2024 6:45 PM, David Wang wrote:
> Hi,
>
> This still happens with 6.10-rc3...
> I think this is a serious problem for AMD users who used to `suspend`...


I am sorry. Can you tell us which issue youare referring?

Is this `ILLEGAL_DEV_TABLE_ENTRY` (one described in bugzilla [1])? -OR-
the kernel panic in amd_iommu_enable_faulting() path?

If its kernel panic, then patch [2] didn't make it into rc3 fixes.


[1] https://bugzilla.kernel.org/show_bug.cgi?id=218900
[2]
https://lore.kernel.org/linux-iommu/lsahbfrt26ysjzgg6p6ezcrf525b25d7nnuqxgis5k6g3zsnzt@qsmzecwdjnen/T/#t


-Vasant

2024-06-10 14:10:52

by Vasant Hegde

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.

David,

On 6/10/2024 7:24 PM, David Wang wrote:
>
> At 2024-06-10 21:44:02, "Vasant Hegde" <[email protected]> wrote:
>> Hi David,
>>
>>
>> On 6/10/2024 6:45 PM, David Wang wrote:
>>> Hi,
>>>
>>> This still happens with 6.10-rc3...
>>> I think this is a serious problem for AMD users who used to `suspend`...
>>
>>
>> I am sorry. Can you tell us which issue youare referring?
>
> Oh, I was mentioning this thread: https://lore.kernel.org/lkml/[email protected]/
>
>>
>> Is this `ILLEGAL_DEV_TABLE_ENTRY` (one described in bugzilla [1])? -OR-
>> the kernel panic in amd_iommu_enable_faulting() path?
>>
>> If its kernel panic, then patch [2] didn't make it into rc3 fixes.
>>
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=218900
>> [2]
>> https://lore.kernel.org/linux-iommu/lsahbfrt26ysjzgg6p6ezcrf525b25d7nnuqxgis5k6g3zsnzt@qsmzecwdjnen/T/#t
>>
>>
>> -Vasant
>
> Yes, it is the patch[2]; sad to hear it didn't make into rc3.... I have to patch my system manually ever since rc1.....

Thanks for confirming. Hopefully Joerg will pick this patch for -rc4.

-Vasant


2024-06-10 14:22:37

by David Wang

[permalink] [raw]
Subject: Re: [Regression] 6.10-rc1: Fail to resurrect from suspend.


At 2024-06-10 21:44:02, "Vasant Hegde" <[email protected]> wrote:
>Hi David,
>
>
>On 6/10/2024 6:45 PM, David Wang wrote:
>> Hi,
>>
>> This still happens with 6.10-rc3...
>> I think this is a serious problem for AMD users who used to `suspend`...
>
>
>I am sorry. Can you tell us which issue youare referring?

Oh, I was mentioning this thread: https://lore.kernel.org/lkml/[email protected]/

>
>Is this `ILLEGAL_DEV_TABLE_ENTRY` (one described in bugzilla [1])? -OR-
>the kernel panic in amd_iommu_enable_faulting() path?
>
>If its kernel panic, then patch [2] didn't make it into rc3 fixes.
>
>
>[1] https://bugzilla.kernel.org/show_bug.cgi?id=218900
>[2]
>https://lore.kernel.org/linux-iommu/lsahbfrt26ysjzgg6p6ezcrf525b25d7nnuqxgis5k6g3zsnzt@qsmzecwdjnen/T/#t
>
>
>-Vasant

Yes, it is the patch[2]; sad to hear it didn't make into rc3.... I have to patch my system manually ever since rc1.....


Thx
David