On some systems, the Performance Counter Global Status Register is
coming with reserved bits set, which causes the system to be unusable
if a simple `perf top` runs. The system hits the WARN() thousands times
while perf runs.
WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
This happens because the "Performance Counter Global Status Register"
(PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
Manual, Volume 2: System Programming, 24593"[1]
WARN_ONCE if any reserved bit is set, and sanitize the value to what the
code is handling, so the overflow events continue to be handled for the
number of events that are known to be sane.
Signed-off-by: Breno Leitao <[email protected]>
Fixes: 7685665c390d ("perf/x86/amd/core: Add PerfMonV2 overflow handling")
[1] Link: https://www.amd.com/system/files/TechDocs/24593.pdf
---
arch/x86/events/amd/core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index bccea57dee81..809ddb15c122 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -909,6 +909,10 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
status &= ~GLOBAL_STATUS_LBRS_FROZEN;
}
+ amd_pmu_global_cntr_mask = (1ULL << x86_pmu.num_counters) - 1;
+ WARN_ON_ONCE(status & ~amd_pmu_global_cntr_mask);
+ status &= amd_pmu_global_cntr_mask;
+
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;
--
2.34.1
On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> On some systems, the Performance Counter Global Status Register is
> coming with reserved bits set, which causes the system to be unusable
> if a simple `perf top` runs. The system hits the WARN() thousands times
> while perf runs.
>
> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>
> This happens because the "Performance Counter Global Status Register"
> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> Manual, Volume 2: System Programming, 24593"[1]
Would it then not make more sense to mask out bit7 before:
+ status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
if (!status)
goto done;
?
Aside from being reserved, why are these bits magically set all of a
sudden?
On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > On some systems, the Performance Counter Global Status Register is
> > coming with reserved bits set, which causes the system to be unusable
> > if a simple `perf top` runs. The system hits the WARN() thousands times
> > while perf runs.
> >
> > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> >
> > This happens because the "Performance Counter Global Status Register"
> > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > Manual, Volume 2: System Programming, 24593"[1]
>
> Would it then not make more sense to mask out bit7 before:
It is more than bit 7. This is the register structure according to the document
above:
Bits Mnemonic Description Access type
63:60 Reserved RO
59 PMCF Performance Counter Freeze RO
58 LBRSF Last Branch Record Stack Freeze RO
57:6 Reserved RO
5:0 CNT_OF Counter overflow for PerfCnt[5:0] RO
In the code, bit GLOBAL_STATUS_LBRS_FROZEN is handled and cleared before
we reach my changes
That said, your approach is almost similar to what I did, and I will be happy
to change in order to make the code clearer.
> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> if (!status)
> goto done;
>
> ?
>
> Aside from being reserved, why are these bits magically set all of a
> sudden?
That is probably a question to AMD.
On 16/06/23 7:33 pm, Breno Leitao wrote:
> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
>>> On some systems, the Performance Counter Global Status Register is
>>> coming with reserved bits set, which causes the system to be unusable
>>> if a simple `perf top` runs. The system hits the WARN() thousands times
>>> while perf runs.
>>>
>>> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>>>
>>> This happens because the "Performance Counter Global Status Register"
>>> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
>>> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
>>> Manual, Volume 2: System Programming, 24593"[1]
>>
>> Would it then not make more sense to mask out bit7 before:
>
> It is more than bit 7. This is the register structure according to the document
> above:
>
> Bits Mnemonic Description Access type
> 63:60 Reserved RO
> 59 PMCF Performance Counter Freeze RO
> 58 LBRSF Last Branch Record Stack Freeze RO
> 57:6 Reserved RO
> 5:0 CNT_OF Counter overflow for PerfCnt[5:0] RO
>
> In the code, bit GLOBAL_STATUS_LBRS_FROZEN is handled and cleared before
> we reach my changes
>
> That said, your approach is almost similar to what I did, and I will be happy
> to change in order to make the code clearer.
>
>> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
>> if (!status)
>> goto done;
>>
>> ?
>>
>> Aside from being reserved, why are these bits magically set all of a
>> sudden?
>
> That is probably a question to AMD.
>
The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
I am working out the details with Breno and will report back with a resolution.
- Sandipan
[sending from my personal email as I currently don't have access to my work laptop]
On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> I am working out the details with Breno and will report back with a resolution.
Thanks!
On Wed, Sep 13, 2023 at 04:30:44PM +0200, Jirka Hladky wrote:
> Hi Sandipan,
>
> Is there any update on this issue?
I still think we want to have this patch in Linux, so, we protect the
kernel for whatever the hardware/firmware is doing.
I agree.
Are you planning v2 of the patch, addressing the points raised by
Peter Zijlstra?
On Wed, Sep 13, 2023 at 5:03 PM Breno Leitao <[email protected]> wrote:
>
> On Wed, Sep 13, 2023 at 04:30:44PM +0200, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue?
>
> I still think we want to have this patch in Linux, so, we protect the
> kernel for whatever the hardware/firmware is doing.
>
--
-Jirka
Hi Peter,
On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > On some systems, the Performance Counter Global Status Register is
> > coming with reserved bits set, which causes the system to be unusable
> > if a simple `perf top` runs. The system hits the WARN() thousands times
> > while perf runs.
> >
> > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> >
> > This happens because the "Performance Counter Global Status Register"
> > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > Manual, Volume 2: System Programming, 24593"[1]
>
> Would it then not make more sense to mask out bit7 before:
>
> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> if (!status)
> goto done;
Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
(AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
global variable because it seems to represent what the loop below is
iterating over:
/* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
static u64 amd_pmu_global_cntr_mask __read_mostly;
Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
now, it warns at every time we call this function, which makes the
machine unusable, but, warning it once could be helpful to figure out
there is something wrong with the machine/firmware.
Anyway, please let me know whatever is your preferred way and I will
submit a v2.
Hi Jirka,
Can you please share the following?
1. Family, Model and Stepping of the processor
2. Microcode patch level
On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> Hi Sandipan,
>
> Is there any update on this issue? We have hit the issue, and it makes
> the server pretty unusable due to the thousands of Call Traces being
> logged.
>
> Thanks a lot!
> Jirka
>
>
> On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
>>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
>>> I am working out the details with Breno and will report back with a resolution.
>>
>> Thanks!
>>
>
>
Hi Sandipan,
Is there any update on this issue? We have hit the issue, and it makes
the server pretty unusable due to the thousands of Call Traces being
logged.
Thanks a lot!
Jirka
On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> > The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> > I am working out the details with Breno and will report back with a resolution.
>
> Thanks!
>
--
-Jirka
Hi Breno, Jirka,
On 9/14/2023 2:15 PM, Jirka Hladky wrote:
> Hi Breno,
>
> I'm definitively voting for using WARN_ON_ONCE - in the current
> implementation, we are getting thousands of the same warnings and Call
> Traces, causing the system to become unusable.
>
>> Anyway, please let me know whatever is your preferred way and I will submit a v2.
> @Peter Zijlstra and @Sandipan - could you please comment on the
> preferred implementation of the patch?
>
I agree with using WARN_ON_ONCE() to make this less intrusive.
>
> On Wed, Sep 13, 2023 at 6:24 PM Breno Leitao <[email protected]> wrote:
>>
>> Hi Peter,
>>
>> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
>>> On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
>>>> On some systems, the Performance Counter Global Status Register is
>>>> coming with reserved bits set, which causes the system to be unusable
>>>> if a simple `perf top` runs. The system hits the WARN() thousands times
>>>> while perf runs.
>>>>
>>>> WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
>>>>
>>>> This happens because the "Performance Counter Global Status Register"
>>>> (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
>>>> to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
>>>> Manual, Volume 2: System Programming, 24593"[1]
>>>
>>> Would it then not make more sense to mask out bit7 before:
>>>
>>> + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
>>> if (!status)
>>> goto done;
>>
>> Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
>> (AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
>> global variable because it seems to represent what the loop below is
>> iterating over:
>>
>> /* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
>> static u64 amd_pmu_global_cntr_mask __read_mostly;
>>
>> Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
>> now, it warns at every time we call this function, which makes the
>> machine unusable, but, warning it once could be helpful to figure out
>> there is something wrong with the machine/firmware.
>>
>> Anyway, please let me know whatever is your preferred way and I will
>> submit a v2.
>>
>
>
On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> > I agree with using WARN_ON_ONCE() to make this less intrusive.
>
> Could you send a patch that AMD is happy with?
Why the current patch is not good enough?
On 9/14/2023 2:42 PM, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>> Hi Breno, Jirka,
>>
>> On 9/14/2023 2:15 PM, Jirka Hladky wrote:
>>> Hi Breno,
>>>
>>> I'm definitively voting for using WARN_ON_ONCE - in the current
>>> implementation, we are getting thousands of the same warnings and Call
>>> Traces, causing the system to become unusable.
>>>
>>>> Anyway, please let me know whatever is your preferred way and I will submit a v2.
>>> @Peter Zijlstra and @Sandipan - could you please comment on the
>>> preferred implementation of the patch?
>>>
>>
>> I agree with using WARN_ON_ONCE() to make this less intrusive.
>
> Could you send a patch that AMD is happy with? I think you wrote this
> was a firmware bug and is sorted with a new firmware, so perhaps make it
> WARN_ONCE() and tell the users to upgrade their firmware to $ver ?
Sure. Will do.
Hi Breno,
I'm definitively voting for using WARN_ON_ONCE - in the current
implementation, we are getting thousands of the same warnings and Call
Traces, causing the system to become unusable.
>Anyway, please let me know whatever is your preferred way and I will submit a v2.
@Peter Zijlstra and @Sandipan - could you please comment on the
preferred implementation of the patch?
THANK YOU
Jirka
On Wed, Sep 13, 2023 at 6:24 PM Breno Leitao <[email protected]> wrote:
>
> Hi Peter,
>
> On Fri, Jun 16, 2023 at 03:29:54PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 16, 2023 at 04:53:15AM -0700, Breno Leitao wrote:
> > > On some systems, the Performance Counter Global Status Register is
> > > coming with reserved bits set, which causes the system to be unusable
> > > if a simple `perf top` runs. The system hits the WARN() thousands times
> > > while perf runs.
> > >
> > > WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0
> > >
> > > This happens because the "Performance Counter Global Status Register"
> > > (PerfCntGlobalStatus) MSR has bit 7 set. Bit 7 should be reserved according
> > > to the documentation (Figure 13-12 from "AMD64 Architecture Programmer’s
> > > Manual, Volume 2: System Programming, 24593"[1]
> >
> > Would it then not make more sense to mask out bit7 before:
> >
> > + status &= ~AMD_PMU_V2_GLOBAL_STATUS_RESERVED;
> > if (!status)
> > goto done;
>
> Instead of masking `status` against AMD_PMU_V2_GLOBAL_STATUS_RESERVED
> (AMD64_NUM_COUNTERS?), I opted for using the `amd_pmu_global_cntr_mask`
> global variable because it seems to represent what the loop below is
> iterating over:
>
> /* PMC Enable and Overflow bits for PerfCntrGlobal* registers */
> static u64 amd_pmu_global_cntr_mask __read_mostly;
>
> Also, I think we want to WARN_ON_ONCE() if we see this problem. Right
> now, it warns at every time we call this function, which makes the
> machine unusable, but, warning it once could be helpful to figure out
> there is something wrong with the machine/firmware.
>
> Anyway, please let me know whatever is your preferred way and I will
> submit a v2.
>
--
-Jirka
On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> Hi Breno, Jirka,
>
> On 9/14/2023 2:15 PM, Jirka Hladky wrote:
> > Hi Breno,
> >
> > I'm definitively voting for using WARN_ON_ONCE - in the current
> > implementation, we are getting thousands of the same warnings and Call
> > Traces, causing the system to become unusable.
> >
> >> Anyway, please let me know whatever is your preferred way and I will submit a v2.
> > @Peter Zijlstra and @Sandipan - could you please comment on the
> > preferred implementation of the patch?
> >
>
> I agree with using WARN_ON_ONCE() to make this less intrusive.
Could you send a patch that AMD is happy with? I think you wrote this
was a firmware bug and is sorted with a new firmware, so perhaps make it
WARN_ONCE() and tell the users to upgrade their firmware to $ver ?
On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> >>
> >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> >>>
> >>> Could you send a patch that AMD is happy with?
> >>
> >> Why the current patch is not good enough?
> >
> > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > the AMD pmu and certainly I don't have insight into their design future.
>
> Hi Breno,
>
> Functionally, the patch looks good to me and I will be reusing it
> without any change to the authorship. However, as Peter suggested, I
> wanted to add a message to prompt users to update the microcode and
> also call out the required patch levels in the commit message since
> different Zen 4 variants and steppings use different microcode.
>
> Here's what I plan to send.
>
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index abadd5f23425..186a124bb3c0 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -909,6 +909,13 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
> status &= ~GLOBAL_STATUS_LBRS_FROZEN;
> }
>
> + if (status & ~amd_pmu_global_cntr_mask)
> + pr_warn_once("Unknown status bits are set (0x%llx), please consider updating microcode\n",
> + status);
> +
> + /* Clear any reserved bits set by buggy microcode */
> + status &= amd_pmu_global_cntr_mask;
> +
> for (idx = 0; idx < x86_pmu.num_counters; idx++) {
> if (!test_bit(idx, cpuc->active_mask))
> continue;
>
> --
>
> Hi Peter,
>
> There is another case where users will see warnings but the patch
> to fix it (link below) is yet to be reviewed. May I rebase and
> resend it along with the above?
>
> https://lore.kernel.org/all/[email protected]/
>
Sure, sorry I seem to have missed that :-(
On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> >>
> >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> >>>
> >>> Could you send a patch that AMD is happy with?
> >>
> >> Why the current patch is not good enough?
> >
> > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > the AMD pmu and certainly I don't have insight into their design future.
>
> Hi Breno,
>
> Functionally, the patch looks good to me and I will be reusing it
> without any change to the authorship. However, as Peter suggested, I
> wanted to add a message to prompt users to update the microcode and
> also call out the required patch levels in the commit message since
> different Zen 4 variants and steppings use different microcode.
>
> Here's what I plan to send.
Awesome. Thanks for addressing it.
On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>
> > > I agree with using WARN_ON_ONCE() to make this less intrusive.
> >
> > Could you send a patch that AMD is happy with?
>
> Why the current patch is not good enough?
Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
the AMD pmu and certainly I don't have insight into their design future.
On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
>> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
>>
>>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
>>>
>>> Could you send a patch that AMD is happy with?
>>
>> Why the current patch is not good enough?
>
> Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> the AMD pmu and certainly I don't have insight into their design future.
Hi Breno,
Functionally, the patch looks good to me and I will be reusing it
without any change to the authorship. However, as Peter suggested, I
wanted to add a message to prompt users to update the microcode and
also call out the required patch levels in the commit message since
different Zen 4 variants and steppings use different microcode.
Here's what I plan to send.
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index abadd5f23425..186a124bb3c0 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -909,6 +909,13 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
status &= ~GLOBAL_STATUS_LBRS_FROZEN;
}
+ if (status & ~amd_pmu_global_cntr_mask)
+ pr_warn_once("Unknown status bits are set (0x%llx), please consider updating microcode\n",
+ status);
+
+ /* Clear any reserved bits set by buggy microcode */
+ status &= amd_pmu_global_cntr_mask;
+
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;
--
Hi Peter,
There is another case where users will see warnings but the patch
to fix it (link below) is yet to be reviewed. May I rebase and
resend it along with the above?
https://lore.kernel.org/all/[email protected]/
Hi Sandipan,
here is the info from /proc/cpuinfo
vendor_id : AuthenticAMD
cpu family : 25
model : 160
model name : AMD EPYC 9754 128-Core Processor
stepping : 2
microcode : 0xaa0020f
>2. Microcode patch level
Is it the microcode string from cpuinfo provided above, or are you
looking for something else?
Thanks!
Jirka
On Wed, Sep 13, 2023 at 6:19 PM Sandipan Das <[email protected]> wrote:
>
> Hi Jirka,
>
> Can you please share the following?
>
> 1. Family, Model and Stepping of the processor
> 2. Microcode patch level
> On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue? We have hit the issue, and it makes
> > the server pretty unusable due to the thousands of Call Traces being
> > logged.
> >
> > Thanks a lot!
> > Jirka
> >
> >
> > On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> >>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> >>> I am working out the details with Breno and will report back with a resolution.
> >>
> >> Thanks!
> >>
> >
> >
>
--
-Jirka
Thank you, all!
I have confirmed that the latest microcode fixes the issue.
=============================================
cat /proc/cpuinfo
microcode : 0xaa00212
perf top runs fine, without any kernel warnings
=============================================
Jirka
On Thu, Sep 14, 2023 at 2:22 PM Breno Leitao <[email protected]> wrote:
>
> On Thu, Sep 14, 2023 at 04:52:13PM +0530, Sandipan Das wrote:
> > On 9/14/2023 4:48 PM, Peter Zijlstra wrote:
> > > On Thu, Sep 14, 2023 at 02:30:43AM -0700, Breno Leitao wrote:
> > >> On Thu, Sep 14, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote:
> > >>> On Thu, Sep 14, 2023 at 02:25:40PM +0530, Sandipan Das wrote:
> > >>
> > >>>> I agree with using WARN_ON_ONCE() to make this less intrusive.
> > >>>
> > >>> Could you send a patch that AMD is happy with?
> > >>
> > >> Why the current patch is not good enough?
> > >
> > > Sandipan, can you answer this? I don't tihnk I'm qualified to speak for
> > > the AMD pmu and certainly I don't have insight into their design future.
> >
> > Hi Breno,
> >
> > Functionally, the patch looks good to me and I will be reusing it
> > without any change to the authorship. However, as Peter suggested, I
> > wanted to add a message to prompt users to update the microcode and
> > also call out the required patch levels in the commit message since
> > different Zen 4 variants and steppings use different microcode.
> >
> > Here's what I plan to send.
>
> Awesome. Thanks for addressing it.
>
--
-Jirka
Hi Jirka,
Thanks for reporting back. Moving to the latest Family 19h microcode (link below) will fix the problem.
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amd-ucode/microcode_amd_fam19h.bin
On 9/14/2023 2:03 PM, Jirka Hladky wrote:
> Hi Sandipan,
>
> here is the info from /proc/cpuinfo
>
> vendor_id : AuthenticAMD
> cpu family : 25
> model : 160
> model name : AMD EPYC 9754 128-Core Processor
> stepping : 2
> microcode : 0xaa0020f
>
>>2. Microcode patch level
> Is it the microcode string from cpuinfo provided above, or are you looking for something else?
>
> Thanks!
> Jirka
>
>
> On Wed, Sep 13, 2023 at 6:19 PM Sandipan Das <[email protected] <mailto:[email protected]>> wrote:
>
> Hi Jirka,
>
> Can you please share the following?
>
> 1. Family, Model and Stepping of the processor
> 2. Microcode patch level
> On 9/13/2023 8:00 PM, Jirka Hladky wrote:
> > Hi Sandipan,
> >
> > Is there any update on this issue? We have hit the issue, and it makes
> > the server pretty unusable due to the thousands of Call Traces being
> > logged.
> >
> > Thanks a lot!
> > Jirka
> >
> >
> > On Fri, Jun 16, 2023 at 5:34 PM Peter Zijlstra <[email protected] <mailto:[email protected]>> wrote:
> >>
> >> On Fri, Jun 16, 2023 at 08:13:22PM +0530, Sandipan Das (AMD) wrote:
> >>> The reserved bits should never be set. The purpose of the WARN_ON() is to catch such anomalies.
> >>> I am working out the details with Breno and will report back with a resolution.
> >>
> >> Thanks!
> >>
> >
> >
>
>
>
> --
> -Jirka