2023-02-17 16:37:42

by Zhang, Rui

[permalink] [raw]
Subject: [RFC PATCH 0/1] x86: cpu topology fix and question on x86_max_cores

Hi, All,

There are two kernel issues observed on Intel Hybrid platform named
MeteorLake. And these two issues altogether bring broken CPU topology.

This patch series aims to
1. provide a solution for the first issue, as an urgent fix and -stable
candidate, so that this problem is not exposed to end users.
2. get feedback on how to fix the second issue.

Any comments on this are really appreciated.

Problem details on MeteorLake
-----------------------------

On Intel Hybrid platforms like AlderLake-P/S, both smp_num_siblings
and cpuinfo_x86.x86_max_cores are broken like below

[ 0.201005] detect_extended_topology: CPU APICID 0x0, smp_num_siblings 2, x86_max_cores 10
[ 0.201117] start_kernel->check_bugs->cpu_smt_check_topology: smp_num_siblings 2
...
[ 0.010146] detect_extended_topology: CPU APICID 0x8, smp_num_siblings 2, x86_max_cores 10
...
[ 0.010146] detect_extended_topology: CPU APICID 0x39, smp_num_siblings 2, x86_max_cores 10
[ 0.010146] detect_extended_topology: CPU APICID 0x48, smp_num_siblings 1, x86_max_cores 20
...
[ 0.010146] detect_extended_topology: CPU APICID 0x4e, smp_num_siblings 1, x86_max_cores 20
[ 2.583800] sched_set_itmt_core_prio: smp_num_siblings 1

This is because the SMT siblings value returned by CPUID.1F SMT level
EBX differs among CPUs. It returns 2 for Pcore CPUs which have HT
sibling and returns 1 for Ecore CPUs which do not have SMT sibling.

This brings several potential issues:
1. some kernel configuration like cpu_smt_control, as set in
start_kernel()->check_bugs()->cpu_smt_check_topology(), depends on
smp_num_siblings set by boot cpu.
It is pure luck that all the current hybrid platforms use Pcore as cpu0
and hide this problem.
2. some per CPU data like cpuinfo_x86.x86_max_cores that depends on
smp_num_siblings becomes inconsistent and bogus.
3. the final smp_num_siblings value after boot depends on the last CPU
enumerated, which could either be Pcore or Ecore CPU.

Previously, there is no functional issue observed on AlderLake-P/S.

However, on MeteorLake, this becomes worse.
a). smp_num_siblings varies like AlderLake and it is set to 1 for Ecore.
b). x86_max_cores is totally broken and it is set to 1 for the boot cpu.

Altogether, these two issues make the system being treated as an UP
system in set_cpu_sibling_map() when probing Ecore CPUs, and the Ecore
CPUs are not updated in any cpu sibling maps erroneously.

Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21

And this also breaks userspace tools like lscpu
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1

Solution for fix smp_num_sibling
--------------------------------

Patch 1/1 ensures that smp_num_siblings represents the system-wide maximum
number of siblings by always increasing its value. Never allow it to
decrease.

It is sufficient to make the problem go away.

However, there is a pontenial problem left. That is, when boot CPU is an
Ecore CPU, smp_num_sibling is set to 1 during BSP probe, kernel disables
SMT support by setting cpu_smt_control to CPU_SMT_NOT_SUPPORTED in
start_kernel()->check_bugs()->cpu_smt_check_topology().
So far, we don't have such platforms.

Questions on how to fix cpuinfo_x86.x86_max_cores
-------------------------------------------------

Fixing x86_max_cores is more complex. Current kernel uses below logic to
get x86_max_cores
x86_max_cores = cpus_in_a_package / smp_num_siblings
But
1. There is a known bug in CPUID.1F handling code. Thus cpus_in_a_package
can be bogus. To fix it, I will add CPUID.1F Module level support.
2. x86_max_cores is set and used in an inconsistent way in current kernel.
In short, smp_num_siblings/x86_max_cores
2.1 represents the number of maximum *addressable* threads/cores in a
core/package when retrieved via CPUID 1 and 4 on old platforms.
CPUID.1 EBX 23:16 "Maximum number of addressable IDs for logical
processors in this physical package".
CPUID.4 EAX 31:26 "Maximum number of addressable IDs for processor
cores in the physical package".
2.2 represents the number of maximum *possible* threads/cores in a
core/package, when retrieved via CPUID.B/1F on non-Hybrid platforms.
CPUID.B/1F EBX 15:0 "Number of logical processors at this level type.
The number reflects configuration as shipped by Intel".
For example, in calc_llc_size_per_core()
do_div(llc_size, c->x86_max_cores);
x86_max_cores is used as the max *possible* cores in a package.
2.3 is used in a conflict way on other vendors like AMD by checking the
code. I need help on confirming the proper behavior for AMD.
For example, in amd_get_topology(),
c->x86_coreid_bits = get_count_order(c->x86_max_cores);
x86_max_cores is used as the max *addressable* cores in a package.
in get_nbc_for_node(),
cores_per_node = (c->x86_max_cores * smp_num_siblings) / amd_get_nodes_per_socket();
x86_max_cores is used as the max *possible* cores in a package.
3. using
x86_max_cores = cpus_in_a_package / smp_num_siblings
to get the number of maximum *possible* cores in a package during boot
cpu bringup is not applicable on platforms with asymmetric cores.
Because, for a given number of threads, we don't know how many of the
threads are the master thread or the only thread of a core, and how
many of them are SMT siblings.
For example, on a platform with 6 Pcore and 8 Ecore, there are 20
threads. But setting x86_max_cores to 10 is apparently wrong.

Given the above situation, I have below question and any input is really
appreciated.

Is this inconsistency a problem or not?

If we want to keep it consistent, it has to represent the max
*addressable* cores in a package. Because max *possible* cores in a
package is not available on Intel Hybrid platform.

I have proposed a patch for this purpose, but gave up because I didn't
address the above scenarios that uses x86_max_core differently.
https://lore.kernel.org/all/[email protected]/

Does this break the expectation on AMD platforms using CPUID.B?
If yes, what should we do?

BTW, this also fixes the potential issue that kernel runs as SMT disabled
when cpu0 is a Ecore cpu.

The downside is that, on a platform that does not have SMT, like
AlderLake-N, which has Ecore CPUs only, kernel will still run as
SMT-capable with this solution. Because SMT ID still takes 1 bit in
APIC-ID and smp_num_siblings is set to 2. But given that kernel has
already optimized for non-SMT case at runtime in places like scheduler,
by checking sibling maps, so my current understanding is that the overhead
won't be big.

thanks,
rui


2023-02-17 16:37:49

by Zhang, Rui

[permalink] [raw]
Subject: [PATCH 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

The SMT siblings value returned by CPUID.1F SMT level EBX differs
among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
It returns 2 for Pcore CPUs which have SMT siblings and returns 1 for
Ecore CPUs which do not have SMT siblings.

Today, the CPU boot code sets the global variable smp_num_siblings when
every CPU thread is brought up. The last thread to boot will overwrite
it with the number of siblings of *that* thread. That last thread to
boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
is an Ecore, smp_num_siblings == 1.

smp_num_siblings describes if the *system* supports SMT. It should
specify the maximum number of SMT threads among all cores.

On AlderLake-P/S platforms, it does not cause any functional issues so
far.
But on MeteorLake-P platform, when probing an Ecore CPU,
a). smp_num_siblings varies like AlderLake and it is set to 1 for Ecore.
b). x86_max_cores is totally broken and it is set to 1 for the boot cpu.
Altogether, these two issues make the system being treated as an UP
system in set_cpu_sibling_map() when probing Ecore CPUs, and the Ecore
CPUs are not updated in any cpu sibling maps erroneously.

Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21

And this also breaks userspace tools like lscpu
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1

To fix the first issue, ensure that smp_num_siblings represents the
system-wide maximum number of siblings by always increasing its value.
Never allow it to decrease.

Note that this fix is sufficient to make set_cpu_sibling_map() work
correctly. And how to fix the bogus cpuinfo_x86.x86_max_cores will be
addressed separately.

CC: [email protected]
Suggested-by: Len Brown <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>
---
arch/x86/kernel/cpu/topology.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 5e868b62a7c4..0270925fe013 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -79,7 +79,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c)
* initial apic id, which also represents 32-bit extended x2apic id.
*/
c->initial_apicid = edx;
- smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
#endif
return 0;
}
@@ -109,7 +109,8 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
*/
cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
c->initial_apicid = edx;
- core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
pkg_mask_width = die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
--
2.25.1


2023-02-17 18:03:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

On 2/17/23 08:37, Zhang Rui wrote:
> The SMT siblings value returned by CPUID.1F SMT level EBX differs
> among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
> It returns 2 for Pcore CPUs which have SMT siblings and returns 1 for
> Ecore CPUs which do not have SMT siblings.
>
> Today, the CPU boot code sets the global variable smp_num_siblings when
> every CPU thread is brought up. The last thread to boot will overwrite
> it with the number of siblings of *that* thread. That last thread to
> boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
> is an Ecore, smp_num_siblings == 1.
>
> smp_num_siblings describes if the *system* supports SMT. It should
> specify the maximum number of SMT threads among all cores.

I was with you until here, but I'm having a hard time parsing this:

> On AlderLake-P/S platforms, it does not cause any functional issues so
> far.
> But on MeteorLake-P platform, when probing an Ecore CPU,
> a). smp_num_siblings varies like AlderLake and it is set to 1 for Ecore.
> b). x86_max_cores is totally broken and it is set to 1 for the boot cpu.
> Altogether, these two issues make the system being treated as an UP
> system in set_cpu_sibling_map() when probing Ecore CPUs, and the Ecore
> CPUs are not updated in any cpu sibling maps erroneously.

Let's try and focus this changelog on the problem at hand which is a
broken smp_num_siblings on MeterorLake. Right?

> Below shows part of the CPU topology information before and after the
> fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
> ...
> -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
> -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
> +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
> +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
> ...
> -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
> -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
> +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
> +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
>
> And this also breaks userspace tools like lscpu
> -Core(s) per socket: 1
> -Socket(s): 11
> +Core(s) per socket: 16
> +Socket(s): 1

Heh, yeah, 11 sockets is a tiny bug.

> To fix the first issue, ensure that smp_num_siblings represents the
> system-wide maximum number of siblings by always increasing its value.
> Never allow it to decrease.
>
> Note that this fix is sufficient to make set_cpu_sibling_map() work
> correctly. And how to fix the bogus cpuinfo_x86.x86_max_cores will be
> addressed separately.

Having this note here is probably OK. But, I'm not sure even mentioning
x86_max_cores is worth it. Doesn't it just add confusion?

> diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
> index 5e868b62a7c4..0270925fe013 100644
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c
> @@ -79,7 +79,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c)
> * initial apic id, which also represents 32-bit extended x2apic id.
> */
> c->initial_apicid = edx;
> - smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
> #endif
> return 0;
> }
> @@ -109,7 +109,8 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
> */
> cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
> c->initial_apicid = edx;
> - core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
> core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
> die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> pkg_mask_width = die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

The fix seems simple enough.

2023-02-18 16:11:49

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

Hi, Dave,

Thanks for reviewing this.

On Fri, 2023-02-17 at 10:03 -0800, Dave Hansen wrote:
> On 2/17/23 08:37, Zhang Rui wrote:
> > The SMT siblings value returned by CPUID.1F SMT level EBX differs
> > among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
> > It returns 2 for Pcore CPUs which have SMT siblings and returns 1
> > for
> > Ecore CPUs which do not have SMT siblings.
> >
> > Today, the CPU boot code sets the global variable smp_num_siblings
> > when
> > every CPU thread is brought up. The last thread to boot will
> > overwrite
> > it with the number of siblings of *that* thread. That last thread
> > to
> > boot will "win". If the thread is a Pcore, smp_num_siblings ==
> > 2. If it
> > is an Ecore, smp_num_siblings == 1.
> >
> > smp_num_siblings describes if the *system* supports SMT. It should
> > specify the maximum number of SMT threads among all cores.
>
> I was with you until here, but I'm having a hard time parsing this:
>
> > On AlderLake-P/S platforms, it does not cause any functional issues
> > so
> > far.
> > But on MeteorLake-P platform, when probing an Ecore CPU,
> > a). smp_num_siblings varies like AlderLake and it is set to 1 for
> > Ecore.
> > b). x86_max_cores is totally broken and it is set to 1 for the boot
> > cpu.
> > Altogether, these two issues make the system being treated as an UP
> > system in set_cpu_sibling_map() when probing Ecore CPUs, and the
> > Ecore
> > CPUs are not updated in any cpu sibling maps erroneously.
>
> Let's try and focus this changelog on the problem at hand which is a
> broken smp_num_siblings on MeterorLake. Right?

yes. I totally agree with this.

But when showing the (cpu topology info and lscpu) problem below, I
want to deliver a clear message that
1. there are two bugs and *both* of them are required in order to
trigger the problem
2. this patch just fixes one of the bugs

Do you mean that I don't need to mention the x86_max_cores issue here?

>
> > Below shows part of the CPU topology information before and after
> > the
> > fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
> > ...
> > -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
> > -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
> > +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
> > +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
> > ...
> > -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
> > -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
> > +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
> > +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
> >
> > And this also breaks userspace tools like lscpu
> > -Core(s) per socket: 1
> > -Socket(s): 11
> > +Core(s) per socket: 16
> > +Socket(s): 1
>
> Heh, yeah, 11 sockets is a tiny bug.
>
> > To fix the first issue, ensure that smp_num_siblings represents the
> > system-wide maximum number of siblings by always increasing its
> > value.
> > Never allow it to decrease.
> >
> > Note that this fix is sufficient to make set_cpu_sibling_map() work
> > correctly. And how to fix the bogus cpuinfo_x86.x86_max_cores will
> > be
> > addressed separately.
>
> Having this note here is probably OK. But, I'm not sure even
> mentioning
> x86_max_cores is worth it. Doesn't it just add confusion?

Even without the CPU topology problem we see, changing smp_num_siblings
from a larger value to a smaller value is wrong. In that sense, I can
remove this from the changelog.

thanks,
rui

2023-02-18 17:20:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

On 2/18/23 08:11, Zhang, Rui wrote:
> yes. I totally agree with this.
>
> But when showing the (cpu topology info and lscpu) problem below, I
> want to deliver a clear message that
> 1. there are two bugs and *both* of them are required in order to
> trigger the problem
> 2. this patch just fixes one of the bugs

That's fine, but please deliver that message in the cover letter, not
the patch changelog.

> Do you mean that I don't need to mention the x86_max_cores issue here?

Yes.