2023-04-19 10:02:34

by Jinpu Wang

[permalink] [raw]
Subject: k10temp show over 100 degrees temperature on EPYC Milan servers from DELL and SMC

Dear experts on the list,

We've noticed many of our EPYC Milan servers from different vendors
(DELL and SMC) show 100 degrees, eg

sudo sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +117.5°C
Tdie: +117.5°C
Tccd1: +67.0°C
Tccd2: +65.2°C
Tccd3: +63.2°C
Tccd4: +63.8°C
Tccd5: +67.2°C
Tccd6: +63.5°C
Tccd7: +64.2°C
Tccd8: +64.8°C

sudo lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7713P 64-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2479.705
CPU max MHz: 3720,7029
CPU min MHz: 1500,0000
BogoMIPS: 3992.43
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-127

We've seen such high temperatures even on idle servers.

We are running LTS kernel 5.10.136, but checking the git history for
k10temp driver, I don't find any missing fix.
My questions are:
1. Is it normal to have such high temperatures for tctl? can we trust
the value?
2 Do we need to worry about such high temperatures?

Thx!
Jinpu Wang @ IONOS Cloud.


2023-04-19 13:44:49

by Mario Limonciello

[permalink] [raw]
Subject: RE: k10temp show over 100 degrees temperature on EPYC Milan servers from DELL and SMC

[Public]

Hi,

> Dear experts on the list,
>
> We've noticed many of our EPYC Milan servers from different vendors
> (DELL and SMC) show 100 degrees, eg
>
> sudo sensors
> k10temp-pci-00c3
> Adapter: PCI adapter
> Tctl: +117.5°C
> Tdie: +117.5°C
> Tccd1: +67.0°C
> Tccd2: +65.2°C
> Tccd3: +63.2°C
> Tccd4: +63.8°C
> Tccd5: +67.2°C
> Tccd6: +63.5°C
> Tccd7: +64.2°C
> Tccd8: +64.8°C
>
> sudo lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> Address sizes: 48 bits physical, 48 bits virtual
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 1
> NUMA node(s): 1
> Vendor ID: AuthenticAMD
> CPU family: 25
> Model: 1
> Model name: AMD EPYC 7713P 64-Core Processor
> Stepping: 1
> Frequency boost: enabled
> CPU MHz: 2479.705
> CPU max MHz: 3720,7029
> CPU min MHz: 1500,0000
> BogoMIPS: 3992.43
> Virtualization: AMD-V
> L1d cache: 2 MiB
> L1i cache: 2 MiB
> L2 cache: 32 MiB
> L3 cache: 256 MiB
> NUMA node0 CPU(s): 0-127
>
> We've seen such high temperatures even on idle servers.
>
> We are running LTS kernel 5.10.136, but checking the git history for
> k10temp driver, I don't find any missing fix.
> My questions are:
> 1. Is it normal to have such high temperatures for tctl? can we trust
> the value?
> 2 Do we need to worry about such high temperatures?
>
> Thx!
> Jinpu Wang @ IONOS Cloud.

It's fixed by this patch that will be going into 6.4.
https://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git/commit/?h=hwmon-next&id=1dc8e097967b69a56531c9ccb70b854771310e85

Guenter,

If you didn't already send your 6.4 PR, can you please add
Cc: [email protected] to the patch in your tree?

Thanks,

2023-04-19 14:11:04

by Guenter Roeck

[permalink] [raw]
Subject: Re: k10temp show over 100 degrees temperature on EPYC Milan servers from DELL and SMC

On 4/19/23 06:33, Limonciello, Mario wrote:
> [Public]
>
> Hi,
>
[ ... ]

>
> It's fixed by this patch that will be going into 6.4.
> https://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git/commit/?h=hwmon-next&id=1dc8e097967b69a56531c9ccb70b854771310e85
>
> Guenter,
>
> If you didn't already send your 6.4 PR, can you please add
> Cc: [email protected] to the patch in your tree?
>

I already added a Fixes: tag but, sure, adding Cc: stable@
make sense. Will do.

Thanks,
Guenter

2023-04-20 05:50:09

by Jinpu Wang

[permalink] [raw]
Subject: Re: k10temp show over 100 degrees temperature on EPYC Milan servers from DELL and SMC

On Wed, Apr 19, 2023 at 3:33 PM Limonciello, Mario
<[email protected]> wrote:
>
> [Public]
>
> Hi,
>
> > Dear experts on the list,
> >
> > We've noticed many of our EPYC Milan servers from different vendors
> > (DELL and SMC) show 100 degrees, eg
> >
> > sudo sensors
> > k10temp-pci-00c3
> > Adapter: PCI adapter
> > Tctl: +117.5°C
> > Tdie: +117.5°C
> > Tccd1: +67.0°C
> > Tccd2: +65.2°C
> > Tccd3: +63.2°C
> > Tccd4: +63.8°C
> > Tccd5: +67.2°C
> > Tccd6: +63.5°C
> > Tccd7: +64.2°C
> > Tccd8: +64.8°C
> >
> > sudo lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > Address sizes: 48 bits physical, 48 bits virtual
> > CPU(s): 128
> > On-line CPU(s) list: 0-127
> > Thread(s) per core: 2
> > Core(s) per socket: 64
> > Socket(s): 1
> > NUMA node(s): 1
> > Vendor ID: AuthenticAMD
> > CPU family: 25
> > Model: 1
> > Model name: AMD EPYC 7713P 64-Core Processor
> > Stepping: 1
> > Frequency boost: enabled
> > CPU MHz: 2479.705
> > CPU max MHz: 3720,7029
> > CPU min MHz: 1500,0000
> > BogoMIPS: 3992.43
> > Virtualization: AMD-V
> > L1d cache: 2 MiB
> > L1i cache: 2 MiB
> > L2 cache: 32 MiB
> > L3 cache: 256 MiB
> > NUMA node0 CPU(s): 0-127
> >
> > We've seen such high temperatures even on idle servers.
> >
> > We are running LTS kernel 5.10.136, but checking the git history for
> > k10temp driver, I don't find any missing fix.
> > My questions are:
> > 1. Is it normal to have such high temperatures for tctl? can we trust
> > the value?
> > 2 Do we need to worry about such high temperatures?
> >
> > Thx!
> > Jinpu Wang @ IONOS Cloud.
>
> It's fixed by this patch that will be going into 6.4.
> https://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git/commit/?h=hwmon-next&id=1dc8e097967b69a56531c9ccb70b854771310e85

Hi,

I tested on affected server, the tctl output looks normal now.

Thx for quick reply.
>
> Guenter,
>
> If you didn't already send your 6.4 PR, can you please add
> Cc: [email protected] to the patch in your tree?
>
> Thanks,