2023-06-16 12:34:14

by Breno Leitao

[permalink] [raw]
Subject: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On some systems, the Performance Counter Global Status Register is
coming with reserved bits set, which causes the system to be unusable
if a simple `perf top` runs. The system hits the WARN() thousands times
while perf runs.

WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0

This happens because the "Performance Counter Global Status Register"
(PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
Manual, Volume 2: System Programming, 24593"[1]

WARN_ONCE if any reserved bit is set, and sanitize the value to what the
code is handling, so the overflow events continue to be handled for the
number of events that are known to be sane.

Signed-off-by: Breno Leitao <[email protected]>
Fixes: 7685665c390d ("perf/x86/amd/core: Add PerfMonV2 overflow handling")

[1] Link: https://www.amd.com/system/files/TechDocs/24593.pdf
---
arch/x86/events/amd/core.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index bccea57dee81..809ddb15c122 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -909,6 +909,10 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
status &= ~GLOBAL_STATUS_LBRS_FROZEN;
}

+ amd_pmu_global_cntr_mask = (1ULL << x86_pmu.num_counters) - 1;
+ WARN_ON_ONCE(status & ~amd_pmu_global_cntr_mask);
+ status &= amd_pmu_global_cntr_mask;
+
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;
--
2.34.1



2023-06-16 13:41:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> On some systems, the Performance Counter Global Status Register is
> coming with reserved bits set, which causes the system to be unusable
> if a simple `perf top` runs. The system hits the WARN() thousands times
> while perf runs.
>
> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>
> This happens because the "Performance Counter Global Status Register"
> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> Manual, Volume 2: System Programming, 24593"[1]

Would it then not make more sense to mask out bit7 before:

+ status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
if (!status)
goto done;

?

Aside from being reserved, why are these bits magically set all of a
sudden?

2023-06-16 14:20:41

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > On some systems, the Performance Counter Global Status Register is
> > coming with reserved bits set, which causes the system to be unusable
> > if a simple `perf top` runs. The system hits the WARN() thousands times
> > while perf runs.
> >
> > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> >
> > This happens because the "Performance Counter Global Status Register"
> > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > Manual, Volume 2: System Programming, 24593"[1]
>
> Would it then not make more sense to mask out bit7 before:

It is more than bit 7. This is the register structure according to the document
above:

Bits Mnemonic Description Access type
63:60 Reserved RO
59 PMCF Performance Counter Freeze RO
58 LBRSF Last Branch Record Stack Freeze RO
57:6 Reserved RO
5:0 CNT_OF Counter overflow for PerfCnt[5:0] RO

In the code, bit GLOBAL_STATUS_LBRS_FROZEN is handled and cleared before
we reach my changes

That said, your approach is almost similar to what I did, and I will be happy
to change in order to make the code clearer.

> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> if (!status)
> goto done;
>
> ?
>
> Aside from being reserved, why are these bits magically set all of a
> sudden?

That is probably a question to AMD.


2023-06-16 14:48:03

by Sandipan Das (AMD)

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On 16/06/23 7:33 pm, Breno Leitao wrote:
> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
>>> On some systems, the Performance Counter Global Status Register is
>>> coming with reserved bits set, which causes the system to be unusable
>>> if a simple `perf top` runs. The system hits the WARN() thousands times
>>> while perf runs.
>>>
>>> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>>>
>>> This happens because the "Performance Counter Global Status Register"
>>> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
>>> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
>>> Manual, Volume 2: System Programming, 24593"[1]
>>
>> Would it then not make more sense to mask out bit7 before:
>
> It is more than bit 7. This is the register structure according to the document
> above:
>
> Bits Mnemonic Description Access type
> 63:60 Reserved RO
> 59 PMCF Performance Counter Freeze RO
> 58 LBRSF Last Branch Record Stack Freeze RO
> 57:6 Reserved RO
> 5:0 CNT_OF Counter overflow for PerfCnt[5:0] RO
>
> In the code, bit GLOBAL_STATUS_LBRS_FROZEN is handled and cleared before
> we reach my changes
>
> That said, your approach is almost similar to what I did, and I will be happy
> to change in order to make the code clearer.
>
>> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
>> if (!status)
>> goto done;
>>
>> ?
>>
>> Aside from being reserved, why are these bits magically set all of a
>> sudden?
>
> That is probably a question to AMD.
>

The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
I am working out the details with Breno and will report back with a resolution.

- Sandipan

[sending from my personal email as I currently don't have access to my work laptop]

2023-06-16 16:02:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> I am working out the details with Breno and will report back with a resolution.

Thanks!

2023-09-13 15:04:35

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Wed, Sep 13, 2023 at 04:30:44PM +0200, Jirka Hladky wrote:
> Hi Sandipan,
>
> Is there any update on this issue?

I still think we want to have this patch in Linux, so, we protect the
kernel for whatever the hardware/firmware is doing.

2023-09-13 16:29:51

by Jirka Hladky

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

I agree.

Are you planning v2 of the patch, addressing the points raised by
Peter Zijlstra?

On Wed, Sep 13, 2023 at 5:03 PM Breno Leitao <[email protected]> wrote:
>
> On Wed, Sep 13, 2023 at 04:30:44PM +0200, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue?
>
> I still think we want to have this patch in Linux, so, we protect the
> kernel for whatever the hardware/firmware is doing.
>


--
-Jirka

2023-09-13 16:35:35

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Peter,

On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > On some systems, the Performance Counter Global Status Register is
> > coming with reserved bits set, which causes the system to be unusable
> > if a simple `perf top` runs. The system hits the WARN() thousands times
> > while perf runs.
> >
> > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> >
> > This happens because the "Performance Counter Global Status Register"
> > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > Manual, Volume 2: System Programming, 24593"[1]
>
> Would it then not make more sense to mask out bit7 before:
>
> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> if (!status)
> goto done;

Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
(AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
global variable because it seems to represent what the loop below is
iterating over:

/* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
static u64 amd_pmu_global_cntr_mask __read_mostly;

Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
now, it warns at every time we call this function, which makes the
machine unusable, but, warning it once could be helpful to figure out
there is something wrong with the machine/firmware.

Anyway, please let me know whatever is your preferred way and I will
submit a v2.

2023-09-13 19:21:39

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Jirka,

Can you please share the following?

1. Family, Model and Stepping of the processor
2. Microcode patch level
On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> Hi Sandipan,
>
> Is there any update on this issue? We have hit the issue, and it makes
> the server pretty unusable due to the thousands of Call Traces being
> logged.
>
> Thanks a lot!
> Jirka
>
>
> On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
>>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
>>> I am working out the details with Breno and will report back with a resolution.
>>
>> Thanks!
>>
>
>

2023-09-14 05:30:23

by Jirka Hladky

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Sandipan,

Is there any update on this issue? We have hit the issue, and it makes
the server pretty unusable due to the thousands of Call Traces being
logged.

Thanks a lot!
Jirka


On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> > The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> > I am working out the details with Breno and will report back with a resolution.
>
> Thanks!
>


--
-Jirka

2023-09-14 09:08:40

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Breno, Jirka,

On 9/14/2023 2:15 PM, Jirka Hladky wrote:
> Hi Breno,
>
> I'm definitively voting for using WARN_ON_ONCE - in the current
> implementation, we are getting thousands of the same warnings and Call
> Traces, causing the system to become unusable.
>
>> Anyway, please let me know whatever is your preferred way and I will submit a v2.
> @Peter Zijlstra and @Sandipan - could you please comment on the
> preferred implementation of the patch?
>

I agree with using WARN_ON_ONCE() to make this less intrusive.

>
> On Wed, Sep 13, 2023 at 6:24 PM Breno Leitao <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
>>> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
>>>> On some systems, the Performance Counter Global Status Register is
>>>> coming with reserved bits set, which causes the system to be unusable
>>>> if a simple `perf top` runs. The system hits the WARN() thousands times
>>>> while perf runs.
>>>>
>>>> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>>>>
>>>> This happens because the "Performance Counter Global Status Register"
>>>> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
>>>> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
>>>> Manual, Volume 2: System Programming, 24593"[1]
>>>
>>> Would it then not make more sense to mask out bit7 before:
>>>
>>> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
>>> if (!status)
>>> goto done;
>>
>> Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
>> (AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
>> global variable because it seems to represent what the loop below is
>> iterating over:
>>
>> /* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
>> static u64 amd_pmu_global_cntr_mask __read_mostly;
>>
>> Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
>> now, it warns at every time we call this function, which makes the
>> machine unusable, but, warning it once could be helpful to figure out
>> there is something wrong with the machine/firmware.
>>
>> Anyway, please let me know whatever is your preferred way and I will
>> submit a v2.
>>
>
>

2023-09-14 09:57:19

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:

> > I agree with using WARN_ON_ONCE() to make this less intrusive.
>
> Could you send a patch that AMD is happy with?

Why the current patch is not good enough?

2023-09-14 10:29:56

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ


On 9/14/2023 2:42 PM, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>> Hi Breno, Jirka,
>>
>> On 9/14/2023 2:15 PM, Jirka Hladky wrote:
>>> Hi Breno,
>>>
>>> I'm definitively voting for using WARN_ON_ONCE - in the current
>>> implementation, we are getting thousands of the same warnings and Call
>>> Traces, causing the system to become unusable.
>>>
>>>> Anyway, please let me know whatever is your preferred way and I will submit a v2.
>>> @Peter Zijlstra and @Sandipan - could you please comment on the
>>> preferred implementation of the patch?
>>>
>>
>> I agree with using WARN_ON_ONCE() to make this less intrusive.
>
> Could you send a patch that AMD is happy with? I think you wrote this
> was a firmware bug and is sorted with a new firmware, so perhaps make it
> WARN_ONCE() and tell the users to upgrade their firmware to $ver ?

Sure. Will do.

2023-09-14 10:58:41

by Jirka Hladky

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Breno,

I'm definitively voting for using WARN_ON_ONCE - in the current
implementation, we are getting thousands of the same warnings and Call
Traces, causing the system to become unusable.

>Anyway, please let me know whatever is your preferred way and I will submit a v2.
@Peter Zijlstra and @Sandipan - could you please comment on the
preferred implementation of the patch?

THANK YOU
Jirka

On Wed, Sep 13, 2023 at 6:24 PM Breno Leitao <[email protected]> wrote:
>
> Hi Peter,
>
> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > > On some systems, the Performance Counter Global Status Register is
> > > coming with reserved bits set, which causes the system to be unusable
> > > if a simple `perf top` runs. The system hits the WARN() thousands times
> > > while perf runs.
> > >
> > > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> > >
> > > This happens because the "Performance Counter Global Status Register"
> > > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > > Manual, Volume 2: System Programming, 24593"[1]
> >
> > Would it then not make more sense to mask out bit7 before:
> >
> > + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> > if (!status)
> > goto done;
>
> Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
> (AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
> global variable because it seems to represent what the loop below is
> iterating over:
>
> /* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
> static u64 amd_pmu_global_cntr_mask __read_mostly;
>
> Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
> now, it warns at every time we call this function, which makes the
> machine unusable, but, warning it once could be helpful to figure out
> there is something wrong with the machine/firmware.
>
> Anyway, please let me know whatever is your preferred way and I will
> submit a v2.
>


--
-Jirka

2023-09-14 11:13:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> Hi Breno, Jirka,
>
> On 9/14/2023 2:15 PM, Jirka Hladky wrote:
> > Hi Breno,
> >
> > I'm definitively voting for using WARN_ON_ONCE - in the current
> > implementation, we are getting thousands of the same warnings and Call
> > Traces, causing the system to become unusable.
> >
> >> Anyway, please let me know whatever is your preferred way and I will submit a v2.
> > @Peter Zijlstra and @Sandipan - could you please comment on the
> > preferred implementation of the patch?
> >
>
> I agree with using WARN_ON_ONCE() to make this less intrusive.

Could you send a patch that AMD is happy with? I think you wrote this
was a firmware bug and is sorted with a new firmware, so perhaps make it
WARN_ONCE() and tell the users to upgrade their firmware to $ver ?

2023-09-14 12:32:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> >>
> >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> >>>
> >>> Could you send a patch that AMD is happy with?
> >>
> >> Why the current patch is not good enough?
> >
> > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > the AMD pmu and certainly I don't have insight into their design future.
>
> Hi Breno,
>
> Functionally, the patch looks good to me and I will be reusing it
> without any change to the authorship. However, as Peter suggested, I
> wanted to add a message to prompt users to update the microcode and
> also call out the required patch levels in the commit message since
> different Zen 4 variants and steppings use different microcode.
>
> Here's what I plan to send.
>
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index abadd5f23425..186a124bb3c0 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -909,6 +909,13 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> status &= ~GLOBAL_STATUS_LBRS_FROZEN;
> }
>
> + if (status & ~amd_pmu_global_cntr_mask)
> + pr_warn_once("Unknown status bits are set (0x%llx), please consider updating microcode\n",
> + status);
> +
> + /* Clear any reserved bits set by buggy microcode */
> + status &= amd_pmu_global_cntr_mask;
> +
> for (idx = 0; idx < x86_pmu.num_counters; idx++) {
> if (!test_bit(idx, cpuc->active_mask))
> continue;
>
> --
>
> Hi Peter,
>
> There is another case where users will see warnings but the patch
> to fix it (link below) is yet to be reviewed. May I rebase and
> resend it along with the above?
>
> https://lore.kernel.org/all/[email protected]/
>

Sure, sorry I seem to have missed that :-(

2023-09-14 12:34:35

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> >>
> >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> >>>
> >>> Could you send a patch that AMD is happy with?
> >>
> >> Why the current patch is not good enough?
> >
> > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > the AMD pmu and certainly I don't have insight into their design future.
>
> Hi Breno,
>
> Functionally, the patch looks good to me and I will be reusing it
> without any change to the authorship. However, as Peter suggested, I
> wanted to add a message to prompt users to update the microcode and
> also call out the required patch levels in the commit message since
> different Zen 4 variants and steppings use different microcode.
>
> Here's what I plan to send.

Awesome. Thanks for addressing it.

2023-09-14 16:24:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>
> > > I agree with using WARN_ON_ONCE() to make this less intrusive.
> >
> > Could you send a patch that AMD is happy with?
>
> Why the current patch is not good enough?

Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
the AMD pmu and certainly I don't have insight into their design future.

2023-09-14 20:28:58

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
>> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>>
>>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
>>>
>>> Could you send a patch that AMD is happy with?
>>
>> Why the current patch is not good enough?
>
> Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> the AMD pmu and certainly I don't have insight into their design future.

Hi Breno,

Functionally, the patch looks good to me and I will be reusing it
without any change to the authorship. However, as Peter suggested, I
wanted to add a message to prompt users to update the microcode and
also call out the required patch levels in the commit message since
different Zen 4 variants and steppings use different microcode.

Here's what I plan to send.

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index abadd5f23425..186a124bb3c0 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -909,6 +909,13 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
status &= ~GLOBAL_STATUS_LBRS_FROZEN;
}

+ if (status & ~amd_pmu_global_cntr_mask)
+ pr_warn_once("Unknown status bits are set (0x%llx), please consider updating microcode\n",
+ status);
+
+ /* Clear any reserved bits set by buggy microcode */
+ status &= amd_pmu_global_cntr_mask;
+
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;

--

Hi Peter,

There is another case where users will see warnings but the patch
to fix it (link below) is yet to be reviewed. May I rebase and
resend it along with the above?

https://lore.kernel.org/all/[email protected]/

2023-09-14 21:20:49

by Jirka Hladky

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Sandipan,

here is the info from /proc/cpuinfo

vendor_id : AuthenticAMD
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor
stepping : 2
microcode : 0xaa0020f

>2. Microcode patch level
Is it the microcode string from cpuinfo provided above, or are you
looking for something else?

Thanks!
Jirka


On Wed, Sep 13, 2023 at 6:19 PM Sandipan Das <[email protected]> wrote:
>
> Hi Jirka,
>
> Can you please share the following?
>
> 1. Family, Model and Stepping of the processor
> 2. Microcode patch level
> On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue? We have hit the issue, and it makes
> > the server pretty unusable due to the thousands of Call Traces being
> > logged.
> >
> > Thanks a lot!
> > Jirka
> >
> >
> > On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> >>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> >>> I am working out the details with Breno and will report back with a resolution.
> >>
> >> Thanks!
> >>
> >
> >
>


--
-Jirka

2023-09-14 21:57:52

by Jirka Hladky

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Thank you, all!

I have confirmed that the latest microcode fixes the issue.

=============================================
cat /proc/cpuinfo
microcode : 0xaa00212

perf top runs fine, without any kernel warnings
=============================================

Jirka


On Thu, Sep 14, 2023 at 2:22 PM Breno Leitao <[email protected]> wrote:
>
> On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> > On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> > >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> > >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> > >>
> > >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> > >>>
> > >>> Could you send a patch that AMD is happy with?
> > >>
> > >> Why the current patch is not good enough?
> > >
> > > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > > the AMD pmu and certainly I don't have insight into their design future.
> >
> > Hi Breno,
> >
> > Functionally, the patch looks good to me and I will be reusing it
> > without any change to the authorship. However, as Peter suggested, I
> > wanted to add a message to prompt users to update the microcode and
> > also call out the required patch levels in the commit message since
> > different Zen 4 variants and steppings use different microcode.
> >
> > Here's what I plan to send.
>
> Awesome. Thanks for addressing it.
>


--
-Jirka

2023-09-14 22:47:40

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH] perf/x86/amd: Do not WARN on every IRQ

Hi Jirka,

Thanks for reporting back. Moving to the latest Family 19h microcode (link below) will fix the problem.
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amd-ucode/microcode_amd_fam19h.bin


On 9/14/2023 2:03 PM, Jirka Hladky wrote:
> Hi Sandipan,
>
> here is the info from /proc/cpuinfo
>                                                                                                                                                                                                                      
> vendor_id       : AuthenticAMD                                                                                                                                                                                                              
> cpu family      : 25                                                                                                                                                                                                                        
> model           : 160                                                                                                                                                                                                                      
> model name      : AMD EPYC 9754 128-Core Processor                                                                                                                                                                                          
> stepping        : 2                                                                                                                                                                                                                        
> microcode       : 0xaa0020f
>
>>2. Microcode patch level
> Is it the microcode string from cpuinfo provided above, or are you looking for something else? 
>
> Thanks!
> Jirka
>
>
> On Wed, Sep 13, 2023 at 6:19 PM Sandipan Das <[email protected] <mailto:[email protected]>> wrote:
>
> Hi Jirka,
>
> Can you please share the following?
>
> 1. Family, Model and Stepping of the processor
> 2. Microcode patch level
> On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue? We have hit the issue, and it makes
> > the server pretty unusable due to the thousands of Call Traces being
> > logged.
> >
> > Thanks a lot!
> > Jirka
> >
> >
> > On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected] <mailto:[email protected]>> wrote:
> >>
> >> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> >>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> >>> I am working out the details with Breno and will report back with a resolution.
> >>
> >> Thanks!
> >>
> >
> >
>
>
>
> --
> -Jirka