2023-09-21 02:33:26

by Bagas Sanjaya

[permalink] [raw]
Subject: what to do on no reproducer case? (was Re: Fwd: Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx)

[addressing to Thorsten]

On Sat, Sep 02, 2023 at 07:20:55AM +0700, Bagas Sanjaya wrote:
> Hi,
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> > seems to be a regression since 6.5 release:
> > the infamous error message from the kernel on this 32c/64t threadripper:
> >> [ 2046.269103] perf: interrupt took too long (3141 > 3138), lowering
> >> kernel.perf_event_max_sample_rate to 63600
> >> [ 2405.049567] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2405.049571] Dazed and confused, but trying to continue
> >> [ 2406.902609] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> >> [ 2406.902612] Dazed and confused, but trying to continue
> >> [ 2423.978918] Uhhuh. NMI received for unknown reason 2d on CPU 33.
> >> [ 2423.978921] Dazed and confused, but trying to continue
> >> [ 2429.995160] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2429.995163] Dazed and confused, but trying to continue
> >> [ 2431.233575] Uhhuh. NMI received for unknown reason 3d on CPU 36.
> >> [ 2431.233578] Dazed and confused, but trying to continue
> >> [ 2442.382252] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2442.382255] Dazed and confused, but trying to continue
> >> [ 2442.725076] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2442.725078] Dazed and confused, but trying to continue
> >> [ 2442.732025] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2442.732027] Dazed and confused, but trying to continue
> >> [ 2443.666671] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2443.666673] Dazed and confused, but trying to continue
> >> [ 2443.756776] Uhhuh. NMI received for unknown reason 3d on CPU 39.
> >> [ 2443.756779] Dazed and confused, but trying to continue
> >> [ 2443.907309] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2443.907311] Dazed and confused, but trying to continue
> >> [ 2444.004281] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2444.004283] Dazed and confused, but trying to continue
> >> [ 2444.207944] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2444.207945] Dazed and confused, but trying to continue
> >> [ 2444.517408] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2444.517410] Dazed and confused, but trying to continue
> >> [ 2444.946941] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2444.946943] Dazed and confused, but trying to continue
> >> [ 2445.573807] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2445.573809] Dazed and confused, but trying to continue
> >> [ 2445.776108] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2445.776110] Dazed and confused, but trying to continue
> >> [ 2445.969029] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2445.969031] Dazed and confused, but trying to continue
> >> [ 2446.977458] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2446.977460] Dazed and confused, but trying to continue
> >> [ 2447.044329] Uhhuh. NMI received for unknown reason 2d on CPU 46.
> >> [ 2447.044331] Dazed and confused, but trying to continue
> >> [ 2447.469269] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2447.469271] Dazed and confused, but trying to continue
> >> [ 2447.866530] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2447.866531] Dazed and confused, but trying to continue
> >> [ 2448.456615] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2448.456617] Dazed and confused, but trying to continue
> >> [ 2448.509614] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2448.509616] Dazed and confused, but trying to continue
> >> [ 2448.758005] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2448.758007] Dazed and confused, but trying to continue
> >> [ 2449.093565] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2449.093567] Dazed and confused, but trying to continue
> >> [ 2449.227344] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2449.227346] Dazed and confused, but trying to continue
> >> [ 2449.770534] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2449.770535] Dazed and confused, but trying to continue
> >> [ 2449.955594] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2449.955596] Dazed and confused, but trying to continue
> >> [ 2450.077872] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2450.077874] Dazed and confused, but trying to continue
> >> [ 2450.190844] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2450.190846] Dazed and confused, but trying to continue
> >> [ 2450.561450] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2450.561452] Dazed and confused, but trying to continue
> >> [ 2450.604498] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2450.604500] Dazed and confused, but trying to continue
> >> [ 2450.814451] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2450.814453] Dazed and confused, but trying to continue
> >> [ 2450.923171] Uhhuh. NMI received for unknown reason 2d on CPU 49.
> >> [ 2450.923173] Dazed and confused, but trying to continue
> >> [ 2451.084612] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2451.084614] Dazed and confused, but trying to continue
> >> [ 2451.793342] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2451.793343] Dazed and confused, but trying to continue
> >> [ 2451.793662] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2451.793664] Dazed and confused, but trying to continue
> >> [ 2451.926819] Uhhuh. NMI received for unknown reason 3d on CPU 48.
> >> [ 2451.926821] Dazed and confused, but trying to continue
> >> [ 2452.502583] Uhhuh. NMI received for unknown reason 3d on CPU 49.
> >> [ 2452.502585] Dazed and confused, but trying to continue
> >> [ 2452.675633] Uhhuh. NMI received for unknown reason 2d on CPU 61.
> >> [ 2452.675636] Dazed and confused, but trying to continue
> >> [ 2452.974655] Uhhuh. NMI received for unknown reason 2d on CPU 48.
> >> [ 2452.974657] Dazed and confused, but trying to continue
> >> [ 7065.904855] elogind-daemon[2461]: New session c2 of user janpieter.
> >
> > according to dmesg, this happens without any special reason (I didn't even notice)
> > some googling points at a ACPI C state problem on AMD CPUs a few years ago
> > in 5.14 kernels, I didn't see it.
>
> See Bugzilla for the full thread.
>
> Anyway, I'm adding this regression to be tracked by regzbot:
>
> #regzbot introduced: v6.4..v6.5 https://bugzilla.kernel.org/show_bug.cgi?id=217857
>

Hi Thorsten,

This regression looks stalled: on Bugzilla, the reporter keeps asking to me,
for which I'm not the expert of involved subsystem. And apparently, he still
had not any reproducer yet (is it triggered by random chance?). Should I
mark this as inconclusive?

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (6.90 kB)
signature.asc (235.00 B)
Download all attachments

2023-09-21 21:42:46

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: what to do on no reproducer case? (was Re: Fwd: Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx)

On 20.09.23 02:27, Bagas Sanjaya wrote:
> [addressing to Thorsten]
>
> On Sat, Sep 02, 2023 at 07:20:55AM +0700, Bagas Sanjaya wrote:
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>
>>> seems to be a regression since 6.5 release:
>>> the infamous error message from the kernel on this 32c/64t threadripper:
>>>> [ 2046.269103] perf: interrupt took too long (3141 > 3138), lowering
>>>> kernel.perf_event_max_sample_rate to 63600
>>>> [ 2405.049567] Uhhuh. NMI received for unknown reason 2d on CPU 48.
>>>> [ 2405.049571] Dazed and confused, but trying to continue
>>>> [ 2406.902609] Uhhuh. NMI received for unknown reason 2d on CPU 33.
>>>> [ 2406.902612] Dazed and confused, but trying to continue
>>>> [ 2423.978918] Uhhuh. NMI received for unknown reason 2d on CPU 33.
>>>> [ 2423.978921] Dazed and confused, but trying to continue
> [...]
>>> according to dmesg, this happens without any special reason (I didn't even notice)
>>> some googling points at a ACPI C state problem on AMD CPUs a few years ago
>>> in 5.14 kernels, I didn't see it.
>>
>> See Bugzilla for the full thread.
>>
>> Anyway, I'm adding this regression to be tracked by regzbot:
>>
>> #regzbot introduced: v6.4..v6.5 https://bugzilla.kernel.org/show_bug.cgi?id=217857
>
> This regression looks stalled: on Bugzilla, the reporter keeps asking to me,
> for which I'm not the expert of involved subsystem. And apparently, he still
> had not any reproducer yet (is it triggered by random chance?). Should I
> mark this as inconclusive?

Yes, without a reliable bisection result there sometimes is not much we
can do -- apart from prodding various developers directly and asking for
help or an idea. But in this case that's not worth it afaics, as
messages like
https://lore.kernel.org/all/[email protected]/
indicate that it might be a hardware problem and not really a
regression. Hence:

#regzbot resolve: inconclusive: not bisected and might be a hardware
problem after all

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

2023-10-04 12:50:59

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: what to do on no reproducer case? (was Re: Fwd: Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx)

On 21/09/2023 15:10, Thorsten Leemhuis wrote:
> On 20.09.23 02:27, Bagas Sanjaya wrote:
>> This regression looks stalled: on Bugzilla, the reporter keeps asking to me,
>> for which I'm not the expert of involved subsystem. And apparently, he still
>> had not any reproducer yet (is it triggered by random chance?). Should I
>> mark this as inconclusive?
>
> Yes, without a reliable bisection result there sometimes is not much we
> can do -- apart from prodding various developers directly and asking for
> help or an idea. But in this case that's not worth it afaics, as
> messages like
> https://lore.kernel.org/all/[email protected]/
> indicate that it might be a hardware problem and not really a
> regression. Hence:
>
> #regzbot resolve: inconclusive: not bisected and might be a hardware
> problem after all
>

Thanks for the tip! Now to fix up:

#regzbot inconclusive: regression not bisected - possibly hardware issue

--
An old man doll... just what I always wanted! - Clara

2023-10-05 16:35:02

by Adam Borowski

[permalink] [raw]
Subject: Re: what to do on no reproducer case? (was Re: Fwd: Uhhuh. NMI received for unknown reason 3d/2d/ on CPU xx)

On Wed, Oct 04, 2023 at 07:50:29PM +0700, Bagas Sanjaya wrote:
> On 21/09/2023 15:10, Thorsten Leemhuis wrote:
> > Yes, without a reliable bisection result there sometimes is not much we
> > can do -- apart from prodding various developers directly and asking for
> > help or an idea. But in this case that's not worth it afaics, as
> > messages like
> > https://lore.kernel.org/all/[email protected]/
> > indicate that it might be a hardware problem and not really a
> > regression. Hence:
> >
> > #regzbot resolve: inconclusive: not bisected and might be a hardware
> > problem after all

> Thanks for the tip! Now to fix up:
>
> #regzbot inconclusive: regression not bisected - possibly hardware issue

This doesn't seem to be a regression, I'm seeing this (reason 2c/3c) since
forever (2019) on my box (2990WX), up to current -rc kernels.

Of course, it might be a different problem that results in same message,
but I don't pretend to have a clue about the cause -- just reporting.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Q: Is it ok to combine wired, wifi, and/or bluetooth connections
⢿⡄⠘⠷⠚⠋⠀ in wearable computing?
⠈⠳⣄⠀⠀⠀⠀ A: No, that would be mixed fabric, which Lev19:19 forbids.