2023-02-20 03:29:16

by Zhang, Rui

[permalink] [raw]
Subject: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

Hi, All,

There are two kernel issues observed on Intel Hybrid platform named
MeteorLake. And these two issues altogether bring broken CPU topology.

This patch series aims to
1. provide a solution for the first issue, as an urgent fix and -stable
candidate, so that this problem is not exposed to end users.
2. get feedback on how to fix the second issue.

Any comments on this are really appreciated.

Problem details on MeteorLake
-----------------------------

On Intel Hybrid platforms like AlderLake-P/S, both smp_num_siblings
and cpuinfo_x86.x86_max_cores are broken like below

[ 0.201005] detect_extended_topology: CPU APICID 0x0, smp_num_siblings 2, x86_max_cores 10
[ 0.201117] start_kernel->check_bugs->cpu_smt_check_topology: smp_num_siblings 2
...
[ 0.010146] detect_extended_topology: CPU APICID 0x8, smp_num_siblings 2, x86_max_cores 10
...
[ 0.010146] detect_extended_topology: CPU APICID 0x39, smp_num_siblings 2, x86_max_cores 10
[ 0.010146] detect_extended_topology: CPU APICID 0x48, smp_num_siblings 1, x86_max_cores 20
...
[ 0.010146] detect_extended_topology: CPU APICID 0x4e, smp_num_siblings 1, x86_max_cores 20
[ 2.583800] sched_set_itmt_core_prio: smp_num_siblings 1

This is because the SMT siblings value returned by CPUID.1F SMT level
EBX differs among CPUs. It returns 2 for Pcore CPUs which have HT
sibling and returns 1 for Ecore CPUs which do not have SMT sibling.

This brings several potential issues:
1. some kernel configuration like cpu_smt_control, as set in
start_kernel()->check_bugs()->cpu_smt_check_topology(), depends on
smp_num_siblings set by boot cpu.
It is pure luck that all the current hybrid platforms use Pcore as cpu0
and hide this problem.
2. some per CPU data like cpuinfo_x86.x86_max_cores that depends on
smp_num_siblings becomes inconsistent and bogus.
3. the final smp_num_siblings value after boot depends on the last CPU
enumerated, which could either be Pcore or Ecore CPU.

Previously, there is no functional issue observed on AlderLake-P/S.

However, on MeteorLake, this becomes worse.
a). smp_num_siblings varies like AlderLake and it is set to 1 for Ecore.
b). x86_max_cores is totally broken and it is set to 1 for the boot cpu.

Altogether, these two issues make the system being treated as an UP
system in set_cpu_sibling_map() when probing Ecore CPUs, and the Ecore
CPUs are not updated in any cpu sibling maps erroneously.

Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21

And this also breaks userspace tools like lscpu
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1

Solution for fix smp_num_sibling
--------------------------------

Patch 1/1 ensures that smp_num_siblings represents the system-wide maximum
number of siblings by always increasing its value. Never allow it to
decrease.

It is sufficient to make the problem go away.

However, there is a pontenial problem left. That is, when boot CPU is an
Ecore CPU, smp_num_sibling is set to 1 during BSP probe, kernel disables
SMT support by setting cpu_smt_control to CPU_SMT_NOT_SUPPORTED in
start_kernel()->check_bugs()->cpu_smt_check_topology().
So far, we don't have such platforms.

Questions on how to fix cpuinfo_x86.x86_max_cores
-------------------------------------------------

Fixing x86_max_cores is more complex. Current kernel uses below logic to
get x86_max_cores
x86_max_cores = cpus_in_a_package / smp_num_siblings
But
1. There is a known bug in CPUID.1F handling code. Thus cpus_in_a_package
can be bogus. To fix it, I will add CPUID.1F Module level support.
2. x86_max_cores is set and used in an inconsistent way in current kernel.
In short, smp_num_siblings/x86_max_cores
2.1 represents the number of maximum *addressable* threads/cores in a
core/package when retrieved via CPUID 1 and 4 on old platforms.
CPUID.1 EBX 23:16 "Maximum number of addressable IDs for logical
processors in this physical package".
CPUID.4 EAX 31:26 "Maximum number of addressable IDs for processor
cores in the physical package".
2.2 represents the number of maximum *possible* threads/cores in a
core/package, when retrieved via CPUID.B/1F on non-Hybrid platforms.
CPUID.B/1F EBX 15:0 "Number of logical processors at this level type.
The number reflects configuration as shipped by Intel".
For example, in calc_llc_size_per_core()
do_div(llc_size, c->x86_max_cores);
x86_max_cores is used as the max *possible* cores in a package.
2.3 is used in a conflict way on other vendors like AMD by checking the
code. I need help on confirming the proper behavior for AMD.
For example, in amd_get_topology(),
c->x86_coreid_bits = get_count_order(c->x86_max_cores);
x86_max_cores is used as the max *addressable* cores in a package.
in get_nbc_for_node(),
cores_per_node = (c->x86_max_cores * smp_num_siblings) / amd_get_nodes_per_socket();
x86_max_cores is used as the max *possible* cores in a package.
3. using
x86_max_cores = cpus_in_a_package / smp_num_siblings
to get the number of maximum *possible* cores in a package during boot
cpu bringup is not applicable on platforms with asymmetric cores.
Because, for a given number of threads, we don't know how many of the
threads are the master thread or the only thread of a core, and how
many of them are SMT siblings.
For example, on a platform with 6 Pcore and 8 Ecore, there are 20
threads. But setting x86_max_cores to 10 is apparently wrong.

Given the above situation, I have below question and any input is really
appreciated.

Is this inconsistency a problem or not?

If we want to keep it consistent, it has to represent the max
*addressable* cores in a package. Because max *possible* cores in a
package is not available on Intel Hybrid platform.

I have proposed a patch for this purpose, but gave up because I didn't
address the above scenarios that uses x86_max_core differently.
https://lore.kernel.org/all/[email protected]/

Does this break the expectation on AMD platforms using CPUID.B?
If yes, what should we do?

BTW, this also fixes the potential issue that kernel runs as SMT disabled
when cpu0 is a Ecore cpu.

The downside is that, on a platform that does not have SMT, like
AlderLake-N, which has Ecore CPUs only, kernel will still run as
SMT-capable with this solution. Because SMT ID still takes 1 bit in
APIC-ID and smp_num_siblings is set to 2. But given that kernel has
already optimized for non-SMT case at runtime in places like scheduler,
by checking sibling maps, so my current understanding is that the overhead
won't be big.

thanks,
rui

---
Changes in V2:
- improve changelog to more focus on the smp_num_siblings issue.


2023-02-20 03:29:24

by Zhang, Rui

[permalink] [raw]
Subject: [PATCH V2 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

The SMT siblings value returned by CPUID.1F SMT level EBX differs
among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
It returns 2 for Pcore CPUs which have SMT siblings and returns 1 for
Ecore CPUs which do not have SMT siblings.

Today, the CPU boot code sets the global variable smp_num_siblings when
every CPU thread is brought up. The last thread to boot will overwrite
it with the number of siblings of *that* thread. That last thread to
boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
is an Ecore, smp_num_siblings == 1.

smp_num_siblings describes if the *system* supports SMT. It should
specify the maximum number of SMT threads among all cores.

Ensure that smp_num_siblings represents the system-wide maximum number
of siblings by always increasing its value. Never allow it to decrease.

On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are
not updated in any cpu sibling map because the system is treated as an
UP system when probing Ecore CPUs.

Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21

And this also breaks userspace tools like lscpu
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1

CC: [email protected]
Suggested-by: Len Brown <[email protected]>
Signed-off-by: Zhang Rui <[email protected]>
---
arch/x86/kernel/cpu/topology.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 5e868b62a7c4..0270925fe013 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -79,7 +79,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c)
* initial apic id, which also represents 32-bit extended x2apic id.
*/
c->initial_apicid = edx;
- smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
#endif
return 0;
}
@@ -109,7 +109,8 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
*/
cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
c->initial_apicid = edx;
- core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
+ smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
pkg_mask_width = die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
--
2.25.1


2023-02-20 10:37:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores


On Mon, Feb 20, 2023 at 11:28:55AM +0800, Zhang Rui wrote:

> Solution for fix smp_num_sibling
> --------------------------------
>
> Patch 1/1 ensures that smp_num_siblings represents the system-wide maximum
> number of siblings by always increasing its value. Never allow it to
> decrease.
>
> It is sufficient to make the problem go away.
>
> However, there is a pontenial problem left. That is, when boot CPU is an
> Ecore CPU, smp_num_sibling is set to 1 during BSP probe, kernel disables
> SMT support by setting cpu_smt_control to CPU_SMT_NOT_SUPPORTED in
> start_kernel()->check_bugs()->cpu_smt_check_topology().
> So far, we don't have such platforms.

This is the much recurring problem of the boot CPU not having access to
the system topology.

Instead of fixing that, Intel seems to work at making it worse. At some
point, we're just going to have to give up and move to DT or something
:/

Please communicate (again), that only knowing the topology/setup of the
system once all the CPUs are online is crap. Once you start bringing up
APs some things are fixed -- if we guessed wrong, we're hosed.

Specific examples of this that we've ran into in the past are:

- does the machine have SMT
- is the machine Hybrid
(and if so, how many different core types will be have)

Specifically, things like determining the number of GP event counters on
a PMU sometimes depends on HT being active, but we want the PMU
initialized really early because it also serves watchdog and you want
splats when something goes side-ways.

The end result is that we have to make things complicated and
dynamically re-adjust when system resources come online.

So far we've managed -- just, but *PLEASE*, dont make it worse!!!


2023-02-20 11:09:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, Feb 20, 2023 at 11:28:55AM +0800, Zhang Rui wrote:

> Questions on how to fix cpuinfo_x86.x86_max_cores
> -------------------------------------------------
>
> Fixing x86_max_cores is more complex. Current kernel uses below logic to
> get x86_max_cores
> x86_max_cores = cpus_in_a_package / smp_num_siblings
> But
> 1. There is a known bug in CPUID.1F handling code. Thus cpus_in_a_package
> can be bogus. To fix it, I will add CPUID.1F Module level support.
> 2. x86_max_cores is set and used in an inconsistent way in current kernel.
> In short, smp_num_siblings/x86_max_cores
> 2.1 represents the number of maximum *addressable* threads/cores in a
> core/package when retrieved via CPUID 1 and 4 on old platforms.
> CPUID.1 EBX 23:16 "Maximum number of addressable IDs for logical
> processors in this physical package".
> CPUID.4 EAX 31:26 "Maximum number of addressable IDs for processor
> cores in the physical package".
> 2.2 represents the number of maximum *possible* threads/cores in a
> core/package, when retrieved via CPUID.B/1F on non-Hybrid platforms.
> CPUID.B/1F EBX 15:0 "Number of logical processors at this level type.
> The number reflects configuration as shipped by Intel".
> For example, in calc_llc_size_per_core()
> do_div(llc_size, c->x86_max_cores);
> x86_max_cores is used as the max *possible* cores in a package.
> 2.3 is used in a conflict way on other vendors like AMD by checking the
> code. I need help on confirming the proper behavior for AMD.
> For example, in amd_get_topology(),
> c->x86_coreid_bits = get_count_order(c->x86_max_cores);
> x86_max_cores is used as the max *addressable* cores in a package.
> in get_nbc_for_node(),
> cores_per_node = (c->x86_max_cores * smp_num_siblings) / amd_get_nodes_per_socket();
> x86_max_cores is used as the max *possible* cores in a package.
> 3. using
> x86_max_cores = cpus_in_a_package / smp_num_siblings
> to get the number of maximum *possible* cores in a package during boot
> cpu bringup is not applicable on platforms with asymmetric cores.
> Because, for a given number of threads, we don't know how many of the
> threads are the master thread or the only thread of a core, and how
> many of them are SMT siblings.
> For example, on a platform with 6 Pcore and 8 Ecore, there are 20
> threads. But setting x86_max_cores to 10 is apparently wrong.
>
> Given the above situation, I have below question and any input is really
> appreciated.
>
> Is this inconsistency a problem or not?

IIRC x86_max_cores in specific is only ever used in arch specific code,
the pmu uncore drivers and things like that (grep shows MCE).

Also, perhaps you want to look at calculate_max_logical_packages(). That
has a comment about there not being heterogeneous systems :/

Anyway, the reason I went and had a look there, is because I remember
Thomas and me spend entirely too much time to try and figure out means
to size an array for number of pacakges at boot time and getting it
wrong too many times to recount.

If only there was a sane way to tell these things without actually
bringing everything online first :-(

2023-02-20 11:23:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

On Mon, Feb 20, 2023 at 11:28:56AM +0800, Zhang Rui wrote:
> The SMT siblings value returned by CPUID.1F SMT level EBX differs
> among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
> It returns 2 for Pcore CPUs which have SMT siblings and returns 1 for
> Ecore CPUs which do not have SMT siblings.
>
> Today, the CPU boot code sets the global variable smp_num_siblings when
> every CPU thread is brought up. The last thread to boot will overwrite
> it with the number of siblings of *that* thread. That last thread to
> boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
> is an Ecore, smp_num_siblings == 1.
>
> smp_num_siblings describes if the *system* supports SMT. It should
> specify the maximum number of SMT threads among all cores.
>
> Ensure that smp_num_siblings represents the system-wide maximum number
> of siblings by always increasing its value. Never allow it to decrease.
>
> On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are
> not updated in any cpu sibling map because the system is treated as an
> UP system when probing Ecore CPUs.
>
> Below shows part of the CPU topology information before and after the
> fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
> ...
> -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
> -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
> +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
> +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
> ...
> -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
> -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
> +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
> +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
>
> And this also breaks userspace tools like lscpu
> -Core(s) per socket: 1
> -Socket(s): 11
> +Core(s) per socket: 16
> +Socket(s): 1
>
> CC: [email protected]
> Suggested-by: Len Brown <[email protected]>
> Signed-off-by: Zhang Rui <[email protected]>
> ---
> arch/x86/kernel/cpu/topology.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
> index 5e868b62a7c4..0270925fe013 100644
> --- a/arch/x86/kernel/cpu/topology.c
> +++ b/arch/x86/kernel/cpu/topology.c
> @@ -79,7 +79,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c)
> * initial apic id, which also represents 32-bit extended x2apic id.
> */
> c->initial_apicid = edx;
> - smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
> #endif
> return 0;
> }
> @@ -109,7 +109,8 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
> */
> cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
> c->initial_apicid = edx;
> - core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> + smp_num_siblings = max_t(int, smp_num_siblings, LEVEL_MAX_SIBLINGS(ebx));
> core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);
> die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> pkg_mask_width = die_plus_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

Seems ok, but perhaps you can stick an 'int' cast in
LEVEL_MAX_SIGLINGS instead and write a simpler max() -- and/or convert
smt_num_siblings to unsigned int.

Regardless,

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2023-02-20 14:33:24

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, 2023-02-20 at 12:08 +0100, Peter Zijlstra wrote:
> On Mon, Feb 20, 2023 at 11:28:55AM +0800, Zhang Rui wrote:
>
> > Questions on how to fix cpuinfo_x86.x86_max_cores
> > -------------------------------------------------
> >
> > Fixing x86_max_cores is more complex. Current kernel uses below
> > logic to
> > get x86_max_cores
> > x86_max_cores = cpus_in_a_package / smp_num_siblings
> > But
> > 1. There is a known bug in CPUID.1F handling code. Thus
> > cpus_in_a_package
> > can be bogus. To fix it, I will add CPUID.1F Module level
> > support.
> > 2. x86_max_cores is set and used in an inconsistent way in current
> > kernel.
> > In short, smp_num_siblings/x86_max_cores
> > 2.1 represents the number of maximum *addressable* threads/cores
> > in a
> > core/package when retrieved via CPUID 1 and 4 on old
> > platforms.
> > CPUID.1 EBX 23:16 "Maximum number of addressable IDs for
> > logical
> > processors in this physical package".
> > CPUID.4 EAX 31:26 "Maximum number of addressable IDs for
> > processor
> > cores in the physical package".
> > 2.2 represents the number of maximum *possible* threads/cores in
> > a
> > core/package, when retrieved via CPUID.B/1F on non-Hybrid
> > platforms.
> > CPUID.B/1F EBX 15:0 "Number of logical processors at this
> > level type.
> > The number reflects configuration as shipped by Intel".
> > For example, in calc_llc_size_per_core()
> > do_div(llc_size, c->x86_max_cores);
> > x86_max_cores is used as the max *possible* cores in a
> > package.
> > 2.3 is used in a conflict way on other vendors like AMD by
> > checking the
> > code. I need help on confirming the proper behavior for AMD.
> > For example, in amd_get_topology(),
> > c->x86_coreid_bits = get_count_order(c->x86_max_cores);
> > x86_max_cores is used as the max *addressable* cores in a
> > package.
> > in get_nbc_for_node(),
> > cores_per_node = (c->x86_max_cores * smp_num_siblings) /
> > amd_get_nodes_per_socket();
> > x86_max_cores is used as the max *possible* cores in a
> > package.
> > 3. using
> > x86_max_cores = cpus_in_a_package / smp_num_siblings
> > to get the number of maximum *possible* cores in a package
> > during boot
> > cpu bringup is not applicable on platforms with asymmetric
> > cores.
> > Because, for a given number of threads, we don't know how many
> > of the
> > threads are the master thread or the only thread of a core, and
> > how
> > many of them are SMT siblings.
> > For example, on a platform with 6 Pcore and 8 Ecore, there are
> > 20
> > threads. But setting x86_max_cores to 10 is apparently wrong.
> >
> > Given the above situation, I have below question and any input is
> > really
> > appreciated.
> >
> > Is this inconsistency a problem or not?
>
> IIRC x86_max_cores in specific is only ever used in arch specific
> code,
> the pmu uncore drivers and things like that (grep shows MCE).

Do you mean that, as long as the usage of x86_max_cores matches its
definition for a given vendor/generation, the definition of
x86_max_cores can be inconsistent?

I was thinking how to make it consistent.
For Intel platform, defining x86_max_cores to max-*addressable*-cores-
in-a-package is not a problem in most cases, except the one below
calc_llc_size_per_core() in
arch/x86/kernel/cpu/microcode/intel.c
which needs the number of *possible* cores to get per core LLC size.
But I think we probably can improve this by replacing x86_max_cores
with boot_cpu_data.booted_cores? Because doing microcode update
requires all the cores to be online.

I don't know the answer for other X86 vendors.

>
> Also, perhaps you want to look at calculate_max_logical_packages().
> That
> has a comment about there not being heterogeneous systems :/

yeah, I noticed this previously.

ncpus = cpu_data(0).booted_cores * topology_max_smt_threads();
__max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);

The DIV_ROUND_UP() makes it to work on systems with current asymmetric
core systems. But
1. if a core can support more than 2 HT siblings, this can break if
there are multi symmetric packages.
2. if the system has asymmetric packages, this can break.
So far we don't have such platforms.
3. it can also be broken when using boot option 'maxcpus' as booted_cor
es changes.

But ironically, we don't have a better way to get __max_logical_package
s.


> Anyway, the reason I went and had a look there, is because I remember
> Thomas and me spend entirely too much time to try and figure out
> means
> to size an array for number of pacakges at boot time and getting it
> wrong too many times to recount.
>
> If only there was a sane way to tell these things without actually
> bringing everything online first :-(

I thought of improving this by parsing all the valid APIC-IDs in MADT
during BSP bootup, and get such information by decoding the APIC-IDs
using the APIC-ID layout information retrieved from BSP. But this is
likely to be a fertile new source of bugs as Dave concerned.

thanks,
rui

2023-02-20 14:41:03

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, 2023-02-20 at 11:36 +0100, Peter Zijlstra wrote:
> On Mon, Feb 20, 2023 at 11:28:55AM +0800, Zhang Rui wrote:
>
> > Solution for fix smp_num_sibling
> > --------------------------------
> >
> > Patch 1/1 ensures that smp_num_siblings represents the system-wide
> > maximum
> > number of siblings by always increasing its value. Never allow it
> > to
> > decrease.
> >
> > It is sufficient to make the problem go away.
> >
> > However, there is a pontenial problem left. That is, when boot CPU
> > is an
> > Ecore CPU, smp_num_sibling is set to 1 during BSP probe, kernel
> > disables
> > SMT support by setting cpu_smt_control to CPU_SMT_NOT_SUPPORTED in
> > start_kernel()->check_bugs()->cpu_smt_check_topology().
> > So far, we don't have such platforms.
>
> This is the much recurring problem of


> the boot CPU not having access to
> the system topology.
>
Exactly!


> Instead of fixing that, Intel seems to work at making it worse. At
> some
> point, we're just going to have to give up and move to DT or
> something
> :/
>
> Please communicate (again), that only knowing the topology/setup of
> the
> system once all the CPUs are online is crap. Once you start bringing
> up
> APs some things are fixed -- if we guessed wrong, we're hosed.
>
> Specific examples of this that we've ran into in the past are:
>
> - does the machine have SMT
> - is the machine Hybrid
> (and if so, how many different core types will be have)
>

It is "good" for me to know this is not the first time we run into such
issues. Less effort for me to sell the problem.
I will bring your suggestions back and see if we can improve this
instead of complicating software.

thanks,
rui

2023-02-20 19:08:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, Feb 20, 2023 at 02:33:09PM +0000, Zhang, Rui wrote:

> > If only there was a sane way to tell these things without actually
> > bringing everything online first :-(
>
> I thought of improving this by parsing all the valid APIC-IDs in MADT
> during BSP bootup, and get such information by decoding the APIC-IDs
> using the APIC-ID layout information retrieved from BSP. But this is
> likely to be a fertile new source of bugs as Dave concerned.

Yes and no, aren't we using those APIC-IDs to send INIT to the APs?
Consequently, those APIC-IDs must be good, otherwise SMP bringup will go
side-ways -- and it typically doesn't.

We could add an assertion in SMP bringup that the APIC-ID from MADT
matches the state as read from CPUID ? If that matches (it really
should) then using the MADT table IDs early should work and at least
give us a litle bit more data.

APIC-ID is of no use vs hybrid though, and MADT doesn't include any
either, so that's all still buggered.

Luckily it looks like MADT is fairly extensible, ideally we add an field
to entry-type-0 to help with the hybrid mess.

TL;DR, I really think we should try this.

2023-02-20 22:49:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, Feb 20 2023 at 14:33, Rui Zhang wrote:
> On Mon, 2023-02-20 at 12:08 +0100, Peter Zijlstra wrote:
>> Also, perhaps you want to look at calculate_max_logical_packages().
>> That has a comment about there not being heterogeneous systems :/
>
> yeah, I noticed this previously.
>
> ncpus = cpu_data(0).booted_cores * topology_max_smt_threads();
> __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
>
> The DIV_ROUND_UP() makes it to work on systems with current asymmetric
> core systems. But
> 1. if a core can support more than 2 HT siblings, this can break if
> there are multi symmetric packages.
> 2. if the system has asymmetric packages, this can break.
> So far we don't have such platforms.
> 3. it can also be broken when using boot option 'maxcpus' as booted_cor
> es changes.
>
> But ironically, we don't have a better way to get __max_logical_packages.

Sadly enough this was communicated to Intel hard/firmware people many
years ago.

>> Anyway, the reason I went and had a look there, is because I remember
>> Thomas and me spend entirely too much time to try and figure out
>> means
>> to size an array for number of pacakges at boot time and getting it
>> wrong too many times to recount.
>>
>> If only there was a sane way to tell these things without actually
>> bringing everything online first :-(
>
> I thought of improving this by parsing all the valid APIC-IDs in MADT
> during BSP bootup, and get such information by decoding the APIC-IDs
> using the APIC-ID layout information retrieved from BSP. But this is
> likely to be a fertile new source of bugs as Dave concerned.

The APIC-IDs are only usefull if there is an architected scheme how they
are assigned. Is there such a thing?

The SDM is not helpful at all, but according to the ACPI spec there
exists:

Processor Properties Topology Table (PPTT)

That table actually provides pretty much what we are looking for, but
that table is optional and there is actually code for that in the
kernel, which is ARM64 specific.

So while this would be useful it's not usable on x86 because that would
make too much sense, right?

Thanks,

tglx

2023-02-20 22:52:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, Feb 20 2023 at 20:06, Peter Zijlstra wrote:
> On Mon, Feb 20, 2023 at 02:33:09PM +0000, Zhang, Rui wrote:
> APIC-ID is of no use vs hybrid though, and MADT doesn't include any
> either, so that's all still buggered.
>
> Luckily it looks like MADT is fairly extensible, ideally we add an field
> to entry-type-0 to help with the hybrid mess.

I don't think that's the right way. This should use PPTT because that's
a proper hierarchy model. Extending MADT is just another duct tape
solution.

And it actually does not matter which way we go because none of the
existing systems have any of that.

Thanks,

tglx

2023-02-21 08:02:10

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

Hi, Peter,

> We could add an assertion in SMP bringup that the APIC-ID from MADT
> matches the state as read from CPUID ? If that matches (it really
> should) then using the MADT table IDs early should work and at least
> give us a litle bit more data.
>
> APIC-ID is of no use vs hybrid though,

Do you have any pointer to the hybrid issue you're referring to here?
In which case we need to know how many type of cores?

thanks,
rui

2023-02-21 08:26:22

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

> > >
> > I thought of improving this by parsing all the valid APIC-IDs in
> > MADT
> > during BSP bootup, and get such information by decoding the APIC-
> > IDs
> > using the APIC-ID layout information retrieved from BSP. But this
> > is
> > likely to be a fertile new source of bugs as Dave concerned.
>
> The APIC-IDs are only usefull if there is an architected scheme how
> they
> are assigned. Is there such a thing?

I don't know.
Do you think it helps if the APIC-ID layout are defined to be identical
across all CPUs?
In this case, BSP knows the APIC-ID layout of itself and this can apply
to the other APIC-IDs.

>
> The SDM is not helpful at all, but according to the ACPI spec there
> exists:
>
> Processor Properties Topology Table (PPTT)
>
> That table actually provides pretty much what we are looking for, but
> that table is optional and there is actually code for that in the
> kernel, which is ARM64 specific.
>
> So while this would be useful it's not usable on x86 because that
> would
> make too much sense, right?

Thanks for pointing to this.

I got a brief view of PPTT. So far, my understanding is that PPTT
provides
1. the cpu Hierarchy, but package level only. There may be multiple
levels but it does not tell us if it is a Die, Module or Core.
2. the cache Hierarchy

I need to find one real PPTT implementation to see how it works.

thanks,
rui

2023-02-21 08:38:51

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

Hi, Peter,

> > ---
> > arch/x86/kernel/cpu/topology.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/topology.c
> > b/arch/x86/kernel/cpu/topology.c
> > index 5e868b62a7c4..0270925fe013 100644
> > --- a/arch/x86/kernel/cpu/topology.c
> > +++ b/arch/x86/kernel/cpu/topology.c
> > @@ -79,7 +79,7 @@ int detect_extended_topology_early(struct
> > cpuinfo_x86 *c)
> > * initial apic id, which also represents 32-bit extended
> > x2apic id.
> > */
> > c->initial_apicid = edx;
> > - smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > + smp_num_siblings = max_t(int, smp_num_siblings,
> > LEVEL_MAX_SIBLINGS(ebx));
> > #endif
> > return 0;
> > }
> > @@ -109,7 +109,8 @@ int detect_extended_topology(struct cpuinfo_x86
> > *c)
> > */
> > cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
> > c->initial_apicid = edx;
> > - core_level_siblings = smp_num_siblings =
> > LEVEL_MAX_SIBLINGS(ebx);
> > + core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > + smp_num_siblings = max_t(int, smp_num_siblings,
> > LEVEL_MAX_SIBLINGS(ebx));
> > core_plus_mask_width = ht_mask_width =
> > BITS_SHIFT_NEXT_LEVEL(eax);
> > die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > pkg_mask_width = die_plus_mask_width =
> > BITS_SHIFT_NEXT_LEVEL(eax);
>
> Seems ok, but perhaps you can stick an 'int' cast in
> LEVEL_MAX_SIGLINGS instead and write a simpler max() -- and/or
> convert
> smt_num_siblings to unsigned int.
>
yeah, it is doable. I'd prefer to use the current version to keep this
fix simpler if you don't mind.

> Regardless,
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Thanks for your ACK.

-rui


2023-02-21 09:01:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Mon, Feb 20, 2023 at 11:49:45PM +0100, Thomas Gleixner wrote:

> > I thought of improving this by parsing all the valid APIC-IDs in MADT
> > during BSP bootup, and get such information by decoding the APIC-IDs
> > using the APIC-ID layout information retrieved from BSP. But this is
> > likely to be a fertile new source of bugs as Dave concerned.
>
> The APIC-IDs are only usefull if there is an architected scheme how they
> are assigned. Is there such a thing?

Isn't that given through CPUID? Or are we worried each CPU will have
different values in the topology leafs?

We really should have added that CPUID uniformity sanity check a long
while ago :-(

2023-02-21 10:09:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

On Tue, Feb 21, 2023 at 10:00:54AM +0100, Peter Zijlstra wrote:
> We really should have added that CPUID uniformity sanity check a long
> while ago :-(

We're pretty much there:

c0dd9245aa9e ("x86/microcode: Check CPU capabilities after late microcode update correctly")

(in tip currently)

It would need to get extended to do it on each CPU during SMP boot.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-03-07 16:11:15

by Zhang, Rui

[permalink] [raw]
Subject: Re: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

Hi, all,

sorry for the late followup.

On Tue, 2023-02-21 at 16:26 +0800, Zhang Rui wrote:
> > > I thought of improving this by parsing all the valid APIC-IDs in
> > > MADT
> > > during BSP bootup, and get such information by decoding the APIC-
> > > IDs
> > > using the APIC-ID layout information retrieved from BSP. But this
> > > is
> > > likely to be a fertile new source of bugs as Dave concerned.
> >
> > The APIC-IDs are only usefull if there is an architected scheme how
> > they
> > are assigned. Is there such a thing?
>
> I don't know.
> Do you think it helps if the APIC-ID layout are defined to be
> identical
> across all CPUs?
> In this case, BSP knows the APIC-ID layout of itself and this can
> apply
> to the other APIC-IDs.

Yeah, I have confirmed with Len that the APIC-ID layouts are identical
across all CPUs on each single system.

>
> > The SDM is not helpful at all, but according to the ACPI spec there
> > exists:
> >
> > Processor Properties Topology Table (PPTT)
> >
> > That table actually provides pretty much what we are looking for,
> > but
> > that table is optional and there is actually code for that in the
> > kernel, which is ARM64 specific.
> >
> > So while this would be useful it's not usable on x86 because that
> > would
> > make too much sense, right?
>
> Thanks for pointing to this.
>
> I got a brief view of PPTT. So far, my understanding is that PPTT
> provides
> 1. the cpu Hierarchy, but package level only. There may be multiple
> levels but it does not tell us if it is a Die, Module or Core.
> 2. the cache Hierarchy
>
> I need to find one real PPTT implementation to see how it works.

I got one PPTT dump and also checked the kernel pptt parsing code.
Based on current PPTT definition, it is true that it can only tell
1. a thread (a Processor Hierarchy Node Structure with "Processor is a
Thread" flag set)
2. a CPU(core) (a Processor Hierarchy Node Structure with "Processor is
a Thread" flag cleared)
3. a package (a Processor Hierarchy Node Structure with "Physical
package" flag set)

We can get useful information like total packages, number of cores in a
package, number of smt siblings etc. But, say, if there is another
level between Core and package, it cannot tell if it is a
Die/Tile/Module. So far, this does not show a strong advantage compared
with the MADT solution, which doesn't depend on new firmware support.

thanks,
rui

2023-03-08 02:47:11

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC PATCH V2 0/1] x86: cpu topology fix and question on x86_max_cores

Yes, On Intel, Linux should be able assert that CPUID.1F has exactly the same "level" definitions on all CPUs in the system -- no matter SMP or Hybrid processor.

This is what Thomas is asking for -- an architectural APIC-ID decoder -- not a future feature, but one that is already shipping.

With this decoder, we can parse the ACPI BIOS APIC/MADT.
As Peter asserts, the list of processors in the MADT has to work, else smpboot would not be working.

If we don't already have a check that the INIT to APIC-ID N really wakes up N, and not some other id,
It shouldn't be hard to add that sanity check...

Anyway, trusting the MADT, and CPUID.1F, we can determine from cpu0 if we are going to boot any SMT siblings or not.
Indeed, we know ahead of time all of the CPUID.1F levels, including the implicit which level -- which CPUs are in which packages,
And thus how many packages.

-Len

2023-03-13 02:05:36

by Zhang, Rui

[permalink] [raw]
Subject: Re: [PATCH V2 1/1] x86/topology: fix erroneous smp_num_siblings on Intel Hybrid platform

On Tue, 2023-02-21 at 16:34 +0800, Zhang Rui wrote:
> Hi, Peter,
>
> > > ---
> > > arch/x86/kernel/cpu/topology.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/cpu/topology.c
> > > b/arch/x86/kernel/cpu/topology.c
> > > index 5e868b62a7c4..0270925fe013 100644
> > > --- a/arch/x86/kernel/cpu/topology.c
> > > +++ b/arch/x86/kernel/cpu/topology.c
> > > @@ -79,7 +79,7 @@ int detect_extended_topology_early(struct
> > > cpuinfo_x86 *c)
> > > * initial apic id, which also represents 32-bit extended
> > > x2apic id.
> > > */
> > > c->initial_apicid = edx;
> > > - smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > > + smp_num_siblings = max_t(int, smp_num_siblings,
> > > LEVEL_MAX_SIBLINGS(ebx));
> > > #endif
> > > return 0;
> > > }
> > > @@ -109,7 +109,8 @@ int detect_extended_topology(struct
> > > cpuinfo_x86
> > > *c)
> > > */
> > > cpuid_count(leaf, SMT_LEVEL, &eax, &ebx, &ecx, &edx);
> > > c->initial_apicid = edx;
> > > - core_level_siblings = smp_num_siblings =
> > > LEVEL_MAX_SIBLINGS(ebx);
> > > + core_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > > + smp_num_siblings = max_t(int, smp_num_siblings,
> > > LEVEL_MAX_SIBLINGS(ebx));
> > > core_plus_mask_width = ht_mask_width =
> > > BITS_SHIFT_NEXT_LEVEL(eax);
> > > die_level_siblings = LEVEL_MAX_SIBLINGS(ebx);
> > > pkg_mask_width = die_plus_mask_width =
> > > BITS_SHIFT_NEXT_LEVEL(eax);
> >
> > Seems ok, but perhaps you can stick an 'int' cast in
> > LEVEL_MAX_SIGLINGS instead and write a simpler max() -- and/or
> > convert
> > smt_num_siblings to unsigned int.
> >
> yeah, it is doable. I'd prefer to use the current version to keep
> this
> fix simpler if you don't mind.
>
> > Regardless,
> >
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
> Thanks for your ACK.

Hi, all,

Despite the discussions about future improvements in the cover letter
of this patch series, is there any further changes needed for this one?

thanks,
rui