2023-07-20 00:45:53

by Luck, Tony

[permalink] [raw]
Subject: rasdaemon broke between v6.0 and v6.3?

[resend as plain text - sorry for the earlier HTML]

An internal team is seeing tests that worked on v6.0 fail on v6.3. The problem is that
rasdaemon isn’t waking up to process the “mce_record” trace events.

Manually checking for them works:

root@R-4251:/sys/kernel/debug/tracing>systemctl stop rasdaemon
root@R-4251:/sys/kernel/debug/tracing>
root@R-4251:/sys/kernel/debug/tracing>
root@R-4251:/sys/kernel/debug/tracing>echo 1 > events/mce/mce_record/enable
root@R-4251:/sys/kernel/debug/tracing>
root@R-4251:/sys/kernel/debug/tracing>cat trace_pipe
<...>-235 [000] ..... 596.892583: mce_record: CPU: 0, MCGc/s: f000c15/0, MC13: 8c00004200800090, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/08000a80c2982086/0000000000000000, RIP: 00:<0000000000000000>, TSC: 14120b051a1, PROCESSOR: 0:c06f1, TIME: 1689802780, SOCKET: 0, APIC: 0
kworker/0:2-235 [000] ..... 597.204343: mce_record: CPU: 0, MCGc/s: f000c15/0, MC255: 9c0000000000009f, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/000000000000008c/0000000000000000, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:c06f1, TIME: 1689802781, SOCKET: 0, APIC: 0

So their tests are injecting errors, and the trace event is firing.

Is there some updated version of rasdaemon needed?

Some kernel CONFIG option problem?

-Tony





2023-07-20 03:40:31

by Aristeu Rozanski

[permalink] [raw]
Subject: Re: rasdaemon broke between v6.0 and v6.3?

On Wed, Jul 19, 2023 at 04:35:54PM -0700, Tony Luck wrote:
> [resend as plain text - sorry for the earlier HTML]
>
> An internal team is seeing tests that worked on v6.0 fail on v6.3. The problem is that
> rasdaemon isn’t waking up to process the “mce_record” trace events.
>
> Manually checking for them works:
>
> root@R-4251:/sys/kernel/debug/tracing>systemctl stop rasdaemon
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>echo 1 > events/mce/mce_record/enable
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>cat trace_pipe
> <...>-235 [000] ..... 596.892583: mce_record: CPU: 0, MCGc/s: f000c15/0, MC13: 8c00004200800090, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/08000a80c2982086/0000000000000000, RIP: 00:<0000000000000000>, TSC: 14120b051a1, PROCESSOR: 0:c06f1, TIME: 1689802780, SOCKET: 0, APIC: 0
> kworker/0:2-235 [000] ..... 597.204343: mce_record: CPU: 0, MCGc/s: f000c15/0, MC255: 9c0000000000009f, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/000000000000008c/0000000000000000, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:c06f1, TIME: 1689802781, SOCKET: 0, APIC: 0
>
> So their tests are injecting errors, and the trace event is firing.
>
> Is there some updated version of rasdaemon needed?
>
> Some kernel CONFIG option problem?

Looks like you're hitting the issue this commit is supposed to fix:
http://git.infradead.org/users/mchehab/rasdaemon.git/commit/6986d818e6d2c846c001fc7211b5a4153e5ecd11

--
Aristeu


2023-07-20 11:30:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: rasdaemon broke between v6.0 and v6.3?

On Wed, 19 Jul 2023 16:35:54 -0700
Tony Luck <[email protected]> wrote:

> [resend as plain text - sorry for the earlier HTML]
>
> An internal team is seeing tests that worked on v6.0 fail on v6.3. The problem is that
> rasdaemon isn’t waking up to process the “mce_record” trace events.
>
> Manually checking for them works:
>
> root@R-4251:/sys/kernel/debug/tracing>systemctl stop rasdaemon
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>echo 1 > events/mce/mce_record/enable
> root@R-4251:/sys/kernel/debug/tracing>
> root@R-4251:/sys/kernel/debug/tracing>cat trace_pipe
> <...>-235 [000] ..... 596.892583: mce_record: CPU: 0, MCGc/s: f000c15/0, MC13: 8c00004200800090, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/08000a80c2982086/0000000000000000, RIP: 00:<0000000000000000>, TSC: 14120b051a1, PROCESSOR: 0:c06f1, TIME: 1689802780, SOCKET: 0, APIC: 0
> kworker/0:2-235 [000] ..... 597.204343: mce_record: CPU: 0, MCGc/s: f000c15/0, MC255: 9c0000000000009f, IPID: 0000000000000000, ADDR/MISC/SYND: 0000000123450000/000000000000008c/0000000000000000, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 0:c06f1, TIME: 1689802781, SOCKET: 0, APIC: 0
>
> So their tests are injecting errors, and the trace event is firing.
>
> Is there some updated version of rasdaemon needed?
>
> Some kernel CONFIG option problem?
>

A bug was fixed that I think affected rasdaemon.

commit 3e46d910d8acf94e5360126593b68bf4fee4c4a1
Author: Shiju Jose <[email protected]>
Date: Thu Feb 2 18:23:09 2023 +0000

tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw

Make sure /sys/kernel/tracing/buffer_percent = 0

-- Steve