2023-04-05 22:17:07

by Rui Salvaterra

[permalink] [raw]
Subject: [BUG?] unchecked MSR access error: WRMSR to 0x19c

Hi, everyone,

I have a Haswell (Core i7-4770R) machine running Linux 6.3-rc5 on
which, after a while under load (say, compiling the kernel), I get
this trace…

[ 832.549630] unchecked MSR access error: WRMSR to 0x19c (tried to
write 0x000000000000aaa8) at rIP: 0xffffffff816f66a6
(throttle_active_work+0xa6/0x1d0)
[ 832.549652] Call Trace:
[ 832.549654] <TASK>
[ 832.549655] process_one_work+0x1ab/0x300
[ 832.549661] worker_thread+0x4b/0x340
[ 832.549664] ? process_one_work+0x300/0x300
[ 832.549676] kthread+0xac/0xc0
[ 832.549679] ? kthread_exit+0x20/0x20
[ 832.549682] ret_from_fork+0x1f/0x30
[ 832.549693] </TASK>

… after which I get these from time to time in dmesg.

[ 836.709562] CPU7: Core temperature is above threshold, cpu clock is
throttled (total events = 219)
[ 836.709569] CPU3: Core temperature is above threshold, cpu clock is
throttled (total events = 219)
[ 1272.792138] CPU2: Core temperature is above threshold, cpu clock is
throttled (total events = 1)
[ 1272.792156] CPU6: Core temperature is above threshold, cpu clock is
throttled (total events = 1)

This is the microcode revision on the CPU.

[ 0.000000] microcode: updated early: 0xe -> 0x1c, date = 2019-11-12

Note that I have the exact same issue on an Ivy Bridge (Core
i7-3720QM) machine, but not on an Ivy Bridge laptop (Celeron 1007U).
Maybe this is a legitimate warning, but please note that I've
thorughly cleaned the machines before retesting to see if, by
coincidence, I had any airway/cooling issues. The fact that it started
happening recently (since Linux 6.1, I believe), and the fact that
running stress-ng --cpu 16 before the unchecked WRMSR error happens
doesn't cause any thermal throttling events, lead me to believe this
is possibly some unintended oversight.

Please let me know if you need any additional information (.config, or
anything else).

Thanks in advance,
Rui


2023-04-06 21:39:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [BUG?] unchecked MSR access error: WRMSR to 0x19c

CCing more appropiate people and quoting the whole mail...

On Wed, Apr 05, 2023 at 11:14:45PM +0100, Rui Salvaterra wrote:
> Hi, everyone,
>
> I have a Haswell (Core i7-4770R) machine running Linux 6.3-rc5 on
> which, after a while under load (say, compiling the kernel), I get
> this trace…
>
> [ 832.549630] unchecked MSR access error: WRMSR to 0x19c (tried to
> write 0x000000000000aaa8) at rIP: 0xffffffff816f66a6
> (throttle_active_work+0xa6/0x1d0)
> [ 832.549652] Call Trace:
> [ 832.549654] <TASK>
> [ 832.549655] process_one_work+0x1ab/0x300
> [ 832.549661] worker_thread+0x4b/0x340
> [ 832.549664] ? process_one_work+0x300/0x300
> [ 832.549676] kthread+0xac/0xc0
> [ 832.549679] ? kthread_exit+0x20/0x20
> [ 832.549682] ret_from_fork+0x1f/0x30
> [ 832.549693] </TASK>
>
> … after which I get these from time to time in dmesg.
>
> [ 836.709562] CPU7: Core temperature is above threshold, cpu clock is
> throttled (total events = 219)
> [ 836.709569] CPU3: Core temperature is above threshold, cpu clock is
> throttled (total events = 219)
> [ 1272.792138] CPU2: Core temperature is above threshold, cpu clock is
> throttled (total events = 1)
> [ 1272.792156] CPU6: Core temperature is above threshold, cpu clock is
> throttled (total events = 1)
>
> This is the microcode revision on the CPU.
>
> [ 0.000000] microcode: updated early: 0xe -> 0x1c, date = 2019-11-12
>
> Note that I have the exact same issue on an Ivy Bridge (Core
> i7-3720QM) machine, but not on an Ivy Bridge laptop (Celeron 1007U).
> Maybe this is a legitimate warning, but please note that I've
> thorughly cleaned the machines before retesting to see if, by
> coincidence, I had any airway/cooling issues. The fact that it started
> happening recently (since Linux 6.1, I believe), and the fact that
> running stress-ng --cpu 16 before the unchecked WRMSR error happens
> doesn't cause any thermal throttling events, lead me to believe this
> is possibly some unintended oversight.
>
> Please let me know if you need any additional information (.config, or
> anything else).
>
> Thanks in advance,
> Rui

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-06 22:44:41

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [BUG?] unchecked MSR access error: WRMSR to 0x19c

Hi Rui,

On Thu, 2023-04-06 at 23:36 +0200, Borislav Petkov wrote:
> CCing more appropiate people and quoting the whole mail...
>
> On Wed, Apr 05, 2023 at 11:14:45PM +0100, Rui Salvaterra wrote:
> > Hi, everyone,
> >
> > I have a Haswell (Core i7-4770R) machine running Linux 6.3-rc5 on
> > which, after a while under load (say, compiling the kernel), I get
> > this trace…
> >
> > [  832.549630] unchecked MSR access error: WRMSR to 0x19c (tried to
> > write 0x000000000000aaa8) at rIP: 0xffffffff816f66a6

Please send output of : cpuid -1

Also please try this:

#wrmsr 0x19c 0xaaa8
This should give "wrmsr: CPU 0 cannot set MSR 0x0000019c to
0x000000000000aaa8

#wrmsr 0x19c 0x0aa8

I think this will not return error.

Thanks,
Srinivas

> > (throttle_active_work+0xa6/0x1d0)
> > [  832.549652] Call Trace:
> > [  832.549654]  <TASK>
> > [  832.549655]  process_one_work+0x1ab/0x300
> > [  832.549661]  worker_thread+0x4b/0x340
> > [  832.549664]  ? process_one_work+0x300/0x300
> > [  832.549676]  kthread+0xac/0xc0
> > [  832.549679]  ? kthread_exit+0x20/0x20
> > [  832.549682]  ret_from_fork+0x1f/0x30
> > [  832.549693]  </TASK>
> >
> > … after which I get these from time to time in dmesg.
> >
> > [  836.709562] CPU7: Core temperature is above threshold, cpu clock
> > is
> > throttled (total events = 219)
> > [  836.709569] CPU3: Core temperature is above threshold, cpu clock
> > is
> > throttled (total events = 219)
> > [ 1272.792138] CPU2: Core temperature is above threshold, cpu clock
> > is
> > throttled (total events = 1)
> > [ 1272.792156] CPU6: Core temperature is above threshold, cpu clock
> > is
> > throttled (total events = 1)
> >
> > This is the microcode revision on the CPU.
> >
> > [    0.000000] microcode: updated early: 0xe -> 0x1c, date = 2019-
> > 11-12
> >
> > Note that I have the exact same issue on an Ivy Bridge (Core
> > i7-3720QM) machine, but not on an Ivy Bridge laptop (Celeron
> > 1007U).
> > Maybe this is a legitimate warning, but please note that I've
> > thorughly cleaned the machines before retesting to see if, by
> > coincidence, I had any airway/cooling issues. The fact that it
> > started
> > happening recently (since Linux 6.1, I believe), and the fact that
> > running stress-ng --cpu 16 before the unchecked WRMSR error happens
> > doesn't cause any thermal throttling events, lead me to believe
> > this
> > is possibly some unintended oversight.
> >
> > Please let me know if you need any additional information (.config,
> > or
> > anything else).
> >
> > Thanks in advance,
> > Rui
>

2023-04-07 19:14:20

by Rui Salvaterra

[permalink] [raw]
Subject: Re: [BUG?] unchecked MSR access error: WRMSR to 0x19c

Hi, Srinivas,

On Thu, 6 Apr 2023 at 23:31, srinivas pandruvada
<[email protected]> wrote:
>
> Please send output of : cpuid -1

Attached.

> Also please try this:
>
> #wrmsr 0x19c 0xaaa8
> This should give "wrmsr: CPU 0 cannot set MSR 0x0000019c to 0x000000000000aaa8

Indeed, it does:

wrmsr: CPU 0 cannot set MSR 0x0000019c to 0x000000000000aaa8

> #wrmsr 0x19c 0x0aa8
>
> I think this will not return error.

Correct, no error writing 0x0aa8.

I'll soon test the patch you sent me.

Kind regards,
Rui


Attachments:
cpuid.txt (26.40 kB)