Hi!
This is the (for now) last part of reworking topology enumeration and
management. It's based on the APIC and CPUID rework series which can be
found here:
https://lore.kernel.org/lkml/[email protected]
With these preparatory changes in place, it's now possible to address the
real issues of the current topology code:
- Wrong core count on hybrid systems
- Heuristics based size information for packages and dies which
are failing to work correctly with certain command line parameters.
- Full evaluation fail for a theoretical hybrid system which boots
from an E-core
- The complete insanity of manipulating global data from firmware parsers
or the XEN/PV fake SMP enumeration. The latter is really a piece of art.
This series addresses this by
- Mopping up some more historical technical debt
- Consolidating all topology relevant functionality into one place
- Providing separate interfaces for boot time and ACPI hotplug operations
- A sane ordering of command line options and restrictions
- A sensible way to handle the BSP problem in kdump kernels instead of
the unreliable command line option.
- Confinement of topology relevant variables by replacing the XEN/PV SMP
enumeration fake with something halfways sensible.
- Evaluation of sizes by analysing the topology via the CPUID provided
APIC ID segmentation and the actual APIC IDs which are registered at
boot time.
- Removal of heuristics and broken size calculations
The idea behind this is the following:
The APIC IDs describe the system topology in multiple domain levels. The
CPUID topology parser provides the information which part of the APIC ID is
associated to the individual levels (Intel terminology):
[ROOT][PACKAGE][DIE][TILE][MODULE][CORE][THREAD]
The root space contains the package (socket) IDs. Not enumerated levels
consume 0 bits space, but conceptually they are always represented. If
e.g. only CORE and THREAD levels are enumerated then the DIE, MODULE and
TILE have the same physical ID as the PACKAGE.
If SMT is not supported, then the THREAD domain is still used. It then
has the same physical ID as the CORE domain and is the only child of
the core domain.
This allows an unified view on the system independent of the enumerated
domain levels without requiring any conditionals in the code.
AMD does only expose 4 domain levels with obviously different terminology,
but that can be easily mapped into the Intel variant with a trivial lookup
table added to the CPUID parser.
The resulting topology information of an ADL hybrid system with 8 P-Cores
and 8 E-Cores looks like this:
CPU topo: Max. logical packages: 1
CPU topo: Max. logical dies: 1
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 2
CPU topo: Num. cores per package: 16
CPU topo: Num. threads per package: 24
CPU topo: Allowing 24 present CPUs plus 0 hotplug CPUs
CPU topo: Thread : 24
CPU topo: Core : 16
CPU topo: Module : 1
CPU topo: Tile : 1
CPU topo: Die : 1
CPU topo: Package : 1
This is happening on the boot CPU before any of the APs is started and
provides correct size information right from the start.
Even the XEN/PV trainwreck makes use of this now. On Dom0 it utilizes the
MADT and on DomU it provides fake APIC IDs, which combined with the
provided CPUID information make it at least look halfways realistic instead
of claiming to have one CPU per package as the current upstream code does.
This is solely addressing the core topology issues, but there is a plan for
further consolidation of other topology related information into one single
source of information instead of having a gazillion of localized special
parsers and representations all over the place. There are quite some other
things which can be simplified on top of this, like updating the various
cpumasks during CPU bringup, but that's all left for later.
So another 53 patches later, the resulting diffstat is:
64 files changed, 830 insertions(+), 955 deletions(-)
and the combo diffstat of all three series combined:
115 files changed, 2414 insertions(+), 3035 deletions(-)
The current series applies on top of
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v3
and is available from git here:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-full-v1
Thanks,
tglx
---
Documentation/admin-guide/kdump/kdump.rst | 7
Documentation/admin-guide/kernel-parameters.txt | 9
Documentation/arch/x86/topology.rst | 24
arch/x86/events/intel/cstate.c | 2
arch/x86/events/intel/uncore.c | 2
arch/x86/events/intel/uncore_nhmex.c | 4
arch/x86/events/intel/uncore_snb.c | 8
arch/x86/events/intel/uncore_snbep.c | 18
arch/x86/events/rapl.c | 2
arch/x86/include/asm/apic.h | 17
arch/x86/include/asm/io_apic.h | 1
arch/x86/include/asm/mpspec.h | 66 -
arch/x86/include/asm/perf_event_p4.h | 11
arch/x86/include/asm/processor.h | 2
arch/x86/include/asm/prom.h | 4
arch/x86/include/asm/smp.h | 6
arch/x86/include/asm/topology.h | 60 -
arch/x86/include/asm/x86_init.h | 10
arch/x86/kernel/acpi/boot.c | 59 -
arch/x86/kernel/apic/apic.c | 207 ----
arch/x86/kernel/apic/apic_common.c | 15
arch/x86/kernel/apic/apic_flat_64.c | 9
arch/x86/kernel/apic/apic_noop.c | 2
arch/x86/kernel/apic/apic_numachip.c | 12
arch/x86/kernel/apic/bigsmp_32.c | 14
arch/x86/kernel/apic/io_apic.c | 85 -
arch/x86/kernel/apic/local.h | 4
arch/x86/kernel/apic/probe_32.c | 3
arch/x86/kernel/apic/x2apic_cluster.c | 3
arch/x86/kernel/apic/x2apic_phys.c | 6
arch/x86/kernel/apic/x2apic_uv_x.c | 6
arch/x86/kernel/cpu/Makefile | 12
arch/x86/kernel/cpu/cacheinfo.c | 2
arch/x86/kernel/cpu/common.c | 42
arch/x86/kernel/cpu/debugfs.c | 15
arch/x86/kernel/cpu/mce/inject.c | 3
arch/x86/kernel/cpu/microcode/intel.c | 4
arch/x86/kernel/cpu/topology.c | 497 ++++++++++
arch/x86/kernel/cpu/topology.h | 5
arch/x86/kernel/cpu/topology_common.c | 57 -
arch/x86/kernel/devicetree.c | 4
arch/x86/kernel/jailhouse.c | 30
arch/x86/kernel/mpparse.c | 35
arch/x86/kernel/process.c | 2
arch/x86/kernel/setup.c | 27
arch/x86/kernel/smpboot.c | 229 ----
arch/x86/kernel/x86_init.c | 5
arch/x86/mm/amdtopology.c | 7
arch/x86/platform/ce4100/ce4100.c | 14
arch/x86/platform/intel-mid/intel-mid.c | 5
arch/x86/xen/apic.c | 21
arch/x86/xen/enlighten_hvm.c | 2
arch/x86/xen/enlighten_pv.c | 3
arch/x86/xen/smp.c | 2
arch/x86/xen/smp.h | 2
arch/x86/xen/smp_pv.c | 66 -
drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 2
drivers/hwmon/coretemp.c | 2
drivers/hwmon/fam15h_power.c | 2
drivers/platform/x86/intel/uncore-frequency/uncore-frequency.c | 2
drivers/powercap/intel_rapl_common.c | 2
drivers/thermal/intel/intel_hfi.c | 2
drivers/thermal/intel/intel_powerclamp.c | 2
drivers/thermal/intel/x86_pkg_temp_thermal.c | 2
64 files changed, 830 insertions(+), 955 deletions(-)
On 07.08.23 15:52, Thomas Gleixner wrote:
> Hi!
>
> This is the (for now) last part of reworking topology enumeration and
> management. It's based on the APIC and CPUID rework series which can be
> found here:
>
> https://lore.kernel.org/lkml/[email protected]
>
> With these preparatory changes in place, it's now possible to address the
> real issues of the current topology code:
>
> - Wrong core count on hybrid systems
>
> - Heuristics based size information for packages and dies which
> are failing to work correctly with certain command line parameters.
>
> - Full evaluation fail for a theoretical hybrid system which boots
> from an E-core
>
> - The complete insanity of manipulating global data from firmware parsers
> or the XEN/PV fake SMP enumeration. The latter is really a piece of art.
>
> This series addresses this by
>
> - Mopping up some more historical technical debt
>
> - Consolidating all topology relevant functionality into one place
>
> - Providing separate interfaces for boot time and ACPI hotplug operations
>
> - A sane ordering of command line options and restrictions
>
> - A sensible way to handle the BSP problem in kdump kernels instead of
> the unreliable command line option.
>
> - Confinement of topology relevant variables by replacing the XEN/PV SMP
> enumeration fake with something halfways sensible.
>
> - Evaluation of sizes by analysing the topology via the CPUID provided
> APIC ID segmentation and the actual APIC IDs which are registered at
> boot time.
>
> - Removal of heuristics and broken size calculations
>
> The idea behind this is the following:
>
> The APIC IDs describe the system topology in multiple domain levels. The
> CPUID topology parser provides the information which part of the APIC ID is
> associated to the individual levels (Intel terminology):
>
> [ROOT][PACKAGE][DIE][TILE][MODULE][CORE][THREAD]
>
> The root space contains the package (socket) IDs. Not enumerated levels
> consume 0 bits space, but conceptually they are always represented. If
> e.g. only CORE and THREAD levels are enumerated then the DIE, MODULE and
> TILE have the same physical ID as the PACKAGE.
>
> If SMT is not supported, then the THREAD domain is still used. It then
> has the same physical ID as the CORE domain and is the only child of
> the core domain.
>
> This allows an unified view on the system independent of the enumerated
> domain levels without requiring any conditionals in the code.
>
> AMD does only expose 4 domain levels with obviously different terminology,
> but that can be easily mapped into the Intel variant with a trivial lookup
> table added to the CPUID parser.
>
> The resulting topology information of an ADL hybrid system with 8 P-Cores
> and 8 E-Cores looks like this:
>
> CPU topo: Max. logical packages: 1
> CPU topo: Max. logical dies: 1
> CPU topo: Max. dies per package: 1
> CPU topo: Max. threads per core: 2
> CPU topo: Num. cores per package: 16
> CPU topo: Num. threads per package: 24
> CPU topo: Allowing 24 present CPUs plus 0 hotplug CPUs
> CPU topo: Thread : 24
> CPU topo: Core : 16
> CPU topo: Module : 1
> CPU topo: Tile : 1
> CPU topo: Die : 1
> CPU topo: Package : 1
>
> This is happening on the boot CPU before any of the APs is started and
> provides correct size information right from the start.
>
> Even the XEN/PV trainwreck makes use of this now. On Dom0 it utilizes the
> MADT and on DomU it provides fake APIC IDs, which combined with the
> provided CPUID information make it at least look halfways realistic instead
> of claiming to have one CPU per package as the current upstream code does.
>
> This is solely addressing the core topology issues, but there is a plan for
> further consolidation of other topology related information into one single
> source of information instead of having a gazillion of localized special
> parsers and representations all over the place. There are quite some other
> things which can be simplified on top of this, like updating the various
> cpumasks during CPU bringup, but that's all left for later.
>
> So another 53 patches later, the resulting diffstat is:
>
> 64 files changed, 830 insertions(+), 955 deletions(-)
>
> and the combo diffstat of all three series combined:
>
> 115 files changed, 2414 insertions(+), 3035 deletions(-)
>
> The current series applies on top of
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v3
>
> and is available from git here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-full-v1
Tested on an Intel system with Xen:
- PV dom0 is working fine. I couldn't test physical cpu hotplug, but removing
and then re-adding vcpus to dom0 worked.
- PV domU is working fine, too. A test with starting using 2 vcpus initially
and onlining another 2 vcpus later was doing fine.
So for Xen PV you can add my:
Tested-by: Juergen Gross <[email protected]>
One other thing to mention: with this series the reported topology via "lscpu"
and "cat /proc/cpuinfo" inside a PV guest/dom0 is looking sane for the first
time. :-)
Thanks for this significant improvement!
Juergen
On Tue, Aug 08 2023 at 11:29, Sohil Mehta wrote:
> On 8/7/2023 6:52 AM, Thomas Gleixner wrote:
> However, I am a bit confused with the dmesg results.
>
> Dmesg output
> ------------
> CPU topo: Max. logical packages: 4
> CPU topo: Max. logical dies: 4
> CPU topo: Max. dies per package: 1
> CPU topo: Max. threads per core: 2
> CPU topo: Num. cores per package: 16
> CPU topo: Num. threads per package: 32
That's indeed weird. Can you please provide:
- the output of 'cpuid -r'
- the output of /sys/kernel/debug/x86/topo/domains
- the APIC IDs of all CPUs (see below patch)
> Questions
> ---------
> 1) Before this series, the Max logical packages used to be logged as 8
> in dmesg. But now it shows up as 4. Is that expected?
> To me, it seems that to get to 160 potential CPUs with 10cores/20threads
> per package, the Max logical packages should be 8.
As the number of cores per package is not consistent, this is not a
surprise. 160/32 = 5, but yes something is fishy there.
Oh. Is this perhaps one of those machines where the APICs are enumerated
twice. Can you apply the patch below?
The ACPI part is a modified variant of:
https://lore.kernel.org/r/[email protected]
Please apply the topology hunks first without the ACPI changes and then
try the ACPI change on top.
Thanks,
tglx
---
arch/x86/kernel/acpi/boot.c | 31 ++++++++++++++++---------------
arch/x86/kernel/cpu/topology.c | 2 ++
2 files changed, 18 insertions(+), 15 deletions(-)
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -171,6 +171,8 @@ static bool __init acpi_is_processor_usa
return false;
}
+static bool has_lapic_cpus;
+
static int __init
acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
{
@@ -241,6 +243,14 @@ acpi_parse_lapic(union acpi_subtable_hea
return 0;
/*
+ * According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure
+ * when MADT provides both valid LAPIC and x2APIC entries, the APIC ID
+ * in x2APIC must be equal or greater than 0xff.
+ */
+ if (has_lapic_cpus && apic_id < 0xff)
+ return 0;
+
+ /*
* We need to register disabled CPU as well to permit
* counting disabled CPUs. This allows us to size
* cpus_possible_map more accurately, to permit
@@ -1084,21 +1094,12 @@ static int __init acpi_parse_madt_lapic_
acpi_parse_sapic, MAX_LOCAL_APIC);
if (!count) {
- memset(madt_proc, 0, sizeof(madt_proc));
- madt_proc[0].id = ACPI_MADT_TYPE_LOCAL_APIC;
- madt_proc[0].handler = acpi_parse_lapic;
- madt_proc[1].id = ACPI_MADT_TYPE_LOCAL_X2APIC;
- madt_proc[1].handler = acpi_parse_x2apic;
- ret = acpi_table_parse_entries_array(ACPI_SIG_MADT,
- sizeof(struct acpi_table_madt),
- madt_proc, ARRAY_SIZE(madt_proc), MAX_LOCAL_APIC);
- if (ret < 0) {
- pr_err("Error parsing LAPIC/X2APIC entries\n");
- return ret;
- }
-
- count = madt_proc[0].count;
- x2count = madt_proc[1].count;
+ count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_LAPIC,
+ acpi_parse_lapic, MAX_LOCAL_APIC);
+ if (count)
+ has_lapic_cpus = true;
+ x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC,
+ acpi_parse_x2apic, MAX_LOCAL_APIC);
}
if (!count && !x2count) {
pr_err("No LAPIC entries present\n");
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -159,6 +159,8 @@ void __init topology_register_apic(u32 a
return;
}
+ pr_info("Register %03x %d\n", apic_id, present);
+
if (present) {
/*
* Prevent double registration, which is valid in case of
On 8/7/2023 6:52 AM, Thomas Gleixner wrote:
>
> The current series applies on top of
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v3
>
> and is available from git here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-full-v1
>
I tested this on a 2S Ivy bridge system with 10 cores per socket
resulting in a total of 20 cores/40 threads. The specifications are
listed at:
https://www.intel.com/content/www/us/en/products/sku/75279/intel-xeon-processor-e52690-v2-25m-cache-3-00-ghz/specifications.html
However, I am a bit confused with the dmesg results.
Dmesg output
------------
CPU topo: Max. logical packages: 4
CPU topo: Max. logical dies: 4
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 2
CPU topo: Num. cores per package: 16
CPU topo: Num. threads per package: 32
CPU topo: Allowing 40 present CPUs plus 120 hotplug CPUs
CPU topo: Thread : 120
CPU topo: Core : 60
CPU topo: Module : 4
CPU topo: Tile : 4
CPU topo: Die : 4
CPU topo: Package : 4
smpboot: x86: Booting SMP configuration:
.... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9
.... node #1, CPUs: #10 #11 #12 #13 #14 #15 #16 #17 #18 #19
.... node #0, CPUs: #20 #21 #22 #23 #24 #25 #26 #27 #28 #29
.... node #1, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39
smp: Brought up 2 nodes, 40 CPUs
smpboot: Total of 40 processors activated (239426.00 BogoMIPS)
Debugfs
-------
# cat /sys/kernel/debug/x86/topo/cpus/39
online: 1
initial_apicid: 39
apicid: 39
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 12
logical_pkg_id: 1
logical_die_id: 1
llc_id: 32
l2c_id: 56
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 32
num_cores: 16
max_dies_per_pkg: 1
max_threads_per_core:2
lscpu output
------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
CPU family: 6
Model: 62
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
Stepping: 4
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5985.65
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 640 KiB (20 instances)
L1i: 640 KiB (20 instances)
L2: 5 MiB (20 instances)
L3: 50 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
Questions
---------
1) Before this series, the Max logical packages used to be logged as 8
in dmesg. But now it shows up as 4. Is that expected?
To me, it seems that to get to 160 potential CPUs with 10cores/20threads
per package, the Max logical packages should be 8.
2) The "Num. cores per package" is listed as 16 in dmesg but shows as 10
in lscpu. The lscpu one seems correct to me. Why does dmesg show this as
16? I don't believe it refers to Max cores per package either?
3) Should the domain name and weight print be a little more descriptive?
pr_info("%-10s: %5u\n", domain_names[dom], domain_weight(dom));
Thread : 120
Core : 60
Module : 4
Tile : 4
Die : 4
Package : 4
This seems a little hard to decipher for the end user without some
context. Can you please help explain what does Thread: 120 refer to wrt
this system?
Please let me know if you need any additional info.
Maybe some of these things are not specific to this series. I apologize
for not raising these questions earlier. The new prints in dmesg caught
my eye and I started to wonder what it exactly means.
Thanks,
Sohil
On Tue, Aug 08 2023 at 13:30, Sohil Mehta wrote:
> There are compile issues with the ACPI hunk.
>
>> arch/x86/kernel/acpi/boot.c: In function ‘acpi_parse_lapic’:
>> arch/x86/kernel/acpi/boot.c:250:31: error: ‘apic_id’ undeclared (first use in this function)
>> 250 | if (has_lapic_cpus && apic_id < 0xff)
>> | ^~~~~~~
>> arch/x86/kernel/acpi/boot.c:250:31: note: each undeclared identifier is reported only once for each function it appears in
>> arch/x86/kernel/acpi/boot.c: In function ‘acpi_parse_madt_lapic_entries’:
>> arch/x86/kernel/acpi/boot.c:1097:47: error: ‘ACPI_MADT_TYPE_LOCAL_LAPIC’ undeclared (first use in this function); did you mean ‘ACPI_MADT_TYPE_LOCAL_SAPIC’?
>> 1097 | count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_LAPIC,
>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> | ACPI_MADT_TYPE_LOCAL_SAPIC
>
Duh. Yes. I just picked the thing from the list and hacked it
up. Compilable variant below.
Thanks,
tglx
---
arch/x86/kernel/acpi/boot.c | 35 +++++++++++++++++------------------
arch/x86/kernel/cpu/topology.c | 2 ++
2 files changed, 19 insertions(+), 18 deletions(-)
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -171,6 +171,8 @@ static bool __init acpi_is_processor_usa
return false;
}
+static bool has_lapic_cpus;
+
static int __init
acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
{
@@ -241,6 +243,14 @@ acpi_parse_lapic(union acpi_subtable_hea
return 0;
/*
+ * According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure
+ * when MADT provides both valid LAPIC and x2APIC entries, the APIC ID
+ * in x2APIC must be equal or greater than 0xff.
+ */
+ if (has_lapic_cpus && processor->id < 0xff)
+ return 0;
+
+ /*
* We need to register disabled CPU as well to permit
* counting disabled CPUs. This allows us to size
* cpus_possible_map more accurately, to permit
@@ -1072,10 +1082,8 @@ static int __init early_acpi_parse_madt_
static int __init acpi_parse_madt_lapic_entries(void)
{
- int count;
int x2count = 0;
- int ret;
- struct acpi_subtable_proc madt_proc[2];
+ int count;
if (!boot_cpu_has(X86_FEATURE_APIC))
return -ENODEV;
@@ -1084,21 +1092,12 @@ static int __init acpi_parse_madt_lapic_
acpi_parse_sapic, MAX_LOCAL_APIC);
if (!count) {
- memset(madt_proc, 0, sizeof(madt_proc));
- madt_proc[0].id = ACPI_MADT_TYPE_LOCAL_APIC;
- madt_proc[0].handler = acpi_parse_lapic;
- madt_proc[1].id = ACPI_MADT_TYPE_LOCAL_X2APIC;
- madt_proc[1].handler = acpi_parse_x2apic;
- ret = acpi_table_parse_entries_array(ACPI_SIG_MADT,
- sizeof(struct acpi_table_madt),
- madt_proc, ARRAY_SIZE(madt_proc), MAX_LOCAL_APIC);
- if (ret < 0) {
- pr_err("Error parsing LAPIC/X2APIC entries\n");
- return ret;
- }
-
- count = madt_proc[0].count;
- x2count = madt_proc[1].count;
+ count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
+ acpi_parse_lapic, MAX_LOCAL_APIC);
+ if (count > 0)
+ has_lapic_cpus = true;
+ x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC,
+ acpi_parse_x2apic, MAX_LOCAL_APIC);
}
if (!count && !x2count) {
pr_err("No LAPIC entries present\n");
On 8/8/2023 12:10 PM, Thomas Gleixner wrote:
> That's indeed weird. Can you please provide:
>
> - the output of 'cpuid -r'
> - the output of /sys/kernel/debug/x86/topo/domains
> - the APIC IDs of all CPUs (see below patch)
>
Domains
-------
domain: Thread shift: 1 dom_size: 2 max_threads: 2
domain: Core shift: 5 dom_size: 16 max_threads: 32
domain: Module shift: 5 dom_size: 1 max_threads: 32
domain: Tile shift: 5 dom_size: 1 max_threads: 32
domain: Die shift: 5 dom_size: 1 max_threads: 32
domain: Package shift: 5 dom_size: 1 max_threads: 32
Attached the cpuid output and the apic id list.
Also, I see a warning message that only seems to show up with the final
installment series applied. I attached the complete dmesg as well (just
in case):
unchecked MSR access error: WRMSR to 0xe44 (tried to write
0x0000000000000003) at rIP: 0xffffffff8d2a6698 (native_write_msr+0x8/0x30)
Call Trace:
<TASK>
? show_stack_regs+0x27/0x30
? ex_handler_msr+0x10f/0x180
? search_extable+0x2b/0x40
? fixup_exception+0x315/0x380
? exc_general_protection+0x139/0x460
? idr_alloc_cyclic+0x59/0xc0
? asm_exc_general_protection+0x2b/0x30
? native_write_msr+0x8/0x30
? ivbep_uncore_msr_init_box+0x47/0x60
uncore_box_ref.part.0+0xa6/0xe0
uncore_event_cpu_online+0x6e/0x1c0
? __pfx_uncore_event_cpu_online+0x10/0x10
cpuhp_invoke_callback+0x165/0x4b0
? try_to_wake_up+0x284/0x6b0
cpuhp_thread_fun+0xc4/0x1e0
? __pfx_smpboot_thread_fn+0x10/0x10
smpboot_thread_fn+0xe7/0x1e0
kthread+0xfb/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x40/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
>
> Please apply the topology hunks first without the ACPI changes and then
> try the ACPI change on top.
>
There are compile issues with the ACPI hunk.
> arch/x86/kernel/acpi/boot.c: In function ‘acpi_parse_lapic’:
> arch/x86/kernel/acpi/boot.c:250:31: error: ‘apic_id’ undeclared (first use in this function)
> 250 | if (has_lapic_cpus && apic_id < 0xff)
> | ^~~~~~~
> arch/x86/kernel/acpi/boot.c:250:31: note: each undeclared identifier is reported only once for each function it appears in
> arch/x86/kernel/acpi/boot.c: In function ‘acpi_parse_madt_lapic_entries’:
> arch/x86/kernel/acpi/boot.c:1097:47: error: ‘ACPI_MADT_TYPE_LOCAL_LAPIC’ undeclared (first use in this function); did you mean ‘ACPI_MADT_TYPE_LOCAL_SAPIC’?
> 1097 | count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_LAPIC,
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~
> | ACPI_MADT_TYPE_LOCAL_SAPIC
Did I miss applying something? I didn't try to understand and fix them.
Sohil
On 8/8/2023 3:10 PM, Peter Zijlstra wrote:
> It works better if you move this hunk into acpi_parse_x2apic() instead.
> Then I can indeed confirm it works as advertised -- also having one of
> them afflicted ivb-ep machines.
>
I had a disappointed email typed up and was about to send it when I saw
this.
The inconsistency and warning on my system resolves with this. I lost
120 imaginary hotpluggable cpus but other than that everything seems fine :)
CPU topo: Max. logical packages: 2
CPU topo: Max. logical dies: 2
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 2
CPU topo: Num. cores per package: 10
CPU topo: Num. threads per package: 20
CPU topo: Allowing 40 present CPUs plus 0 hotplug CPUs
CPU topo: Thread : 40
CPU topo: Core : 20
CPU topo: Module : 2
CPU topo: Tile : 2
CPU topo: Die : 2
CPU topo: Package : 2
domain: Thread shift: 1 dom_size: 2 max_threads: 2
domain: Core shift: 5 dom_size: 16 max_threads: 32
domain: Module shift: 5 dom_size: 1 max_threads: 32
domain: Tile shift: 5 dom_size: 1 max_threads: 32
domain: Die shift: 5 dom_size: 1 max_threads: 32
domain: Package shift: 5 dom_size: 1 max_threads: 32
/sys/kernel/debug/x86/topo/cpus/39
online: 1
initial_apicid: 39
apicid: 39
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 12
logical_pkg_id: 1
logical_die_id: 1
llc_id: 32
l2c_id: 56
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 20
num_cores: 10
max_dies_per_pkg: 1
max_threads_per_core:2
Sohil
On Tue, Aug 08 2023 at 12:20, Andrew Cooper wrote:
> On 08/08/2023 8:40 am, Juergen Gross wrote:
>> Tested on an Intel system with Xen:
>>
>> - PV dom0 is working fine. I couldn't test physical cpu hotplug, but
>> removing
>> and then re-adding vcpus to dom0 worked.
>
> It turns out that physical CPU hotplug with XenPV is broken in at least
> two ways.
>
> It's dom0 (not Xen) that gets the hot-unplug event, after which the Xen
> code in Linux succumbs to a preempt-check failure while trying to
> offline the vCPU that aliases the pCPU wanting to go offline.
That should be gone by now :)
On Tue, Aug 08 2023 at 15:58, Sohil Mehta wrote:
> On 8/8/2023 3:10 PM, Peter Zijlstra wrote:
>> It works better if you move this hunk into acpi_parse_x2apic() instead.
>> Then I can indeed confirm it works as advertised -- also having one of
>> them afflicted ivb-ep machines.
>>
>
> I had a disappointed email typed up
Rightfully though as I'm clearly too tired and too grumpy to think
straight.
> and was about to send it when I saw this.
:)
> The inconsistency and warning on my system resolves with this. I lost
> 120 imaginary hotpluggable cpus but other than that everything seems
> fine :)
Sorry about that loss. :)
> CPU topo: Max. logical packages: 2
> CPU topo: Max. logical dies: 2
> CPU topo: Max. dies per package: 1
> CPU topo: Max. threads per core: 2
> CPU topo: Num. cores per package: 10
> CPU topo: Num. threads per package: 20
> CPU topo: Allowing 40 present CPUs plus 0 hotplug CPUs
> CPU topo: Thread : 40
> CPU topo: Core : 20
> CPU topo: Module : 2
> CPU topo: Tile : 2
> CPU topo: Die : 2
> CPU topo: Package : 2
>
> domain: Thread shift: 1 dom_size: 2 max_threads: 2
> domain: Core shift: 5 dom_size: 16 max_threads: 32
> domain: Module shift: 5 dom_size: 1 max_threads: 32
> domain: Tile shift: 5 dom_size: 1 max_threads: 32
> domain: Die shift: 5 dom_size: 1 max_threads: 32
> domain: Package shift: 5 dom_size: 1 max_threads: 32
>
> /sys/kernel/debug/x86/topo/cpus/39
> online: 1
> initial_apicid: 39
> apicid: 39
> pkg_id: 1
> die_id: 1
> cu_id: 255
> core_id: 12
> logical_pkg_id: 1
> logical_die_id: 1
> llc_id: 32
> l2c_id: 56
> amd_node_id: 0
> amd_nodes_per_pkg: 0
> num_threads: 20
> num_cores: 10
> max_dies_per_pkg: 1
> max_threads_per_core:2
That makes much more sense now.
Zhang, can you please follow up on:
https://lore.kernel.org/all/[email protected]/
or otherwise I just polish up PeterZ's variant of it tomorrow.
Thanks,
tglx
On Tue, Aug 08 2023 at 13:30, Sohil Mehta wrote:
> On 8/8/2023 12:10 PM, Thomas Gleixner wrote:
> domain: Thread shift: 1 dom_size: 2 max_threads: 2
> domain: Core shift: 5 dom_size: 16 max_threads: 32
> domain: Module shift: 5 dom_size: 1 max_threads: 32
> domain: Tile shift: 5 dom_size: 1 max_threads: 32
> domain: Die shift: 5 dom_size: 1 max_threads: 32
> domain: Package shift: 5 dom_size: 1 max_threads: 32
>
> CPU 0:
> 0x0000000b 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000000
> 0x0000000b 0x01: eax=0x00000005 ebx=0x00000014 ecx=0x00000201 edx=0x00000000
Ok. So this is consistent.
> Also, I see a warning message that only seems to show up with the final
> installment series applied. I attached the complete dmesg as well (just
> in case):
>
> unchecked MSR access error: WRMSR to 0xe44 (tried to write
> 0x0000000000000003) at rIP: 0xffffffff8d2a6698 (native_write_msr+0x8/0x30)
> uncore_box_ref.part.0+0xa6/0xe0
> uncore_event_cpu_online+0x6e/0x1c0
> ? __pfx_uncore_event_cpu_online+0x10/0x10
> cpuhp_invoke_callback+0x165/0x4b0
That's probably a consequence of the inconsistency.
> [ 0.187210] CPU topo: Register 000 1
> [ 0.187211] CPU topo: Register 002 1
> [ 0.187212] CPU topo: Register 004 1
> [ 0.187213] CPU topo: Register 006 1
> [ 0.187214] CPU topo: Register 008 1
> [ 0.187215] CPU topo: Register 010 1
> [ 0.187216] CPU topo: Register 012 1
> [ 0.187217] CPU topo: Register 014 1
> [ 0.187218] CPU topo: Register 016 1
> [ 0.187219] CPU topo: Register 018 1
The first package (primary threads)
> [ 0.187219] CPU topo: Register 020 1
> [ 0.187220] CPU topo: Register 022 1
> [ 0.187221] CPU topo: Register 024 1
> [ 0.187222] CPU topo: Register 026 1
> [ 0.187223] CPU topo: Register 028 1
> [ 0.187223] CPU topo: Register 030 1
> [ 0.187224] CPU topo: Register 032 1
> [ 0.187225] CPU topo: Register 034 1
> [ 0.187226] CPU topo: Register 036 1
> [ 0.187227] CPU topo: Register 038 1
The second package (primary threads)
> [ 0.187228] CPU topo: Register 001 1
> [ 0.187228] CPU topo: Register 003 1
> [ 0.187229] CPU topo: Register 005 1
> [ 0.187230] CPU topo: Register 007 1
> [ 0.187230] CPU topo: Register 009 1
> [ 0.187231] CPU topo: Register 011 1
> [ 0.187232] CPU topo: Register 013 1
> [ 0.187233] CPU topo: Register 015 1
> [ 0.187233] CPU topo: Register 017 1
> [ 0.187234] CPU topo: Register 019 1
The second package (secondary threads)
> [ 0.187235] CPU topo: Register 021 1
> [ 0.187235] CPU topo: Register 023 1
> [ 0.187236] CPU topo: Register 025 1
> [ 0.187237] CPU topo: Register 027 1
> [ 0.187238] CPU topo: Register 029 1
> [ 0.187238] CPU topo: Register 031 1
> [ 0.187239] CPU topo: Register 033 1
> [ 0.187240] CPU topo: Register 035 1
> [ 0.187241] CPU topo: Register 037 1
> [ 0.187241] CPU topo: Register 039 1
The second package (secondary threads)
> [ 0.187244] CPU topo: Register 000 0
> [ 0.187244] CPU topo: Register 001 0
... PKG 0
> [ 0.187266] CPU topo: Register 01e 0
> [ 0.187267] CPU topo: Register 01f 0
Ah. that's indeed the issue which the ACPI patch addresses. So that
table claims that the packages are truly filled up to capacity, i.e. 32
threads. The old code did not notice because they are all marked
non-present, but with the new approach these are rightfully accounted as
pluggable and show up in the bitmaps accordingly. Sigh...
> [ 0.187268] CPU topo: Register 020 0
... PKG 1
> [ 0.187291] CPU topo: Register 03f 0
> [ 0.187292] CPU topo: Register 040 0
... PKG 2
> [ 0.187304] CPU topo: Register 05f 0
> [ 0.187305] CPU topo: Register 060 0
... PKG 3
> [ 0.187335] CPU topo: Register 077 0
This one is funny as it stops at 0x77, i.e 8 CPUs short of the full
range.
So this:
> [ 0.187412] CPU topo: Max. logical packages: 4
_IS_ correct according to the above.
I bet that the ACPI patch cures it.
Thanks,
tglx
On Tue, Aug 08, 2023 at 10:41:51PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -171,6 +171,8 @@ static bool __init acpi_is_processor_usa
> return false;
> }
>
> +static bool has_lapic_cpus;
> +
> static int __init
> acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
> {
> @@ -241,6 +243,14 @@ acpi_parse_lapic(union acpi_subtable_hea
> return 0;
>
> /*
> + * According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure
> + * when MADT provides both valid LAPIC and x2APIC entries, the APIC ID
> + * in x2APIC must be equal or greater than 0xff.
> + */
> + if (has_lapic_cpus && processor->id < 0xff)
> + return 0;
> +
> + /*
> * We need to register disabled CPU as well to permit
> * counting disabled CPUs. This allows us to size
> * cpus_possible_map more accurately, to permit
It works better if you move this hunk into acpi_parse_x2apic() instead.
Then I can indeed confirm it works as advertised -- also having one of
them afflicted ivb-ep machines.
Tested-by: Peter Zijlstra (Intel) <[email protected]>
---
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 088323ed6179..f6cff99d6087 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -171,6 +171,8 @@ static bool __init acpi_is_processor_usable(u32 lapic_flags)
return false;
}
+static bool has_lapic_cpus;
+
static int __init
acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
{
@@ -199,6 +201,14 @@ acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
if (!acpi_is_processor_usable(processor->lapic_flags))
return 0;
+ /*
+ * According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure
+ * when MADT provides both valid LAPIC and x2APIC entries, the APIC ID
+ * in x2APIC must be equal or greater than 0xff.
+ */
+ if (has_lapic_cpus && apic_id < 0xff)
+ return 0;
+
/*
* We need to register disabled CPU as well to permit
* counting disabled CPUs. This allows us to size
@@ -1072,10 +1083,8 @@ static int __init early_acpi_parse_madt_lapic_addr_ovr(void)
static int __init acpi_parse_madt_lapic_entries(void)
{
- int count;
int x2count = 0;
- int ret;
- struct acpi_subtable_proc madt_proc[2];
+ int count;
if (!boot_cpu_has(X86_FEATURE_APIC))
return -ENODEV;
@@ -1084,21 +1093,12 @@ static int __init acpi_parse_madt_lapic_entries(void)
acpi_parse_sapic, MAX_LOCAL_APIC);
if (!count) {
- memset(madt_proc, 0, sizeof(madt_proc));
- madt_proc[0].id = ACPI_MADT_TYPE_LOCAL_APIC;
- madt_proc[0].handler = acpi_parse_lapic;
- madt_proc[1].id = ACPI_MADT_TYPE_LOCAL_X2APIC;
- madt_proc[1].handler = acpi_parse_x2apic;
- ret = acpi_table_parse_entries_array(ACPI_SIG_MADT,
- sizeof(struct acpi_table_madt),
- madt_proc, ARRAY_SIZE(madt_proc), MAX_LOCAL_APIC);
- if (ret < 0) {
- pr_err("Error parsing LAPIC/X2APIC entries\n");
- return ret;
- }
-
- count = madt_proc[0].count;
- x2count = madt_proc[1].count;
+ count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
+ acpi_parse_lapic, MAX_LOCAL_APIC);
+ if (count > 0)
+ has_lapic_cpus = true;
+ x2count = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC,
+ acpi_parse_x2apic, MAX_LOCAL_APIC);
}
if (!count && !x2count) {
pr_err("No LAPIC entries present\n");
Hi Thomas,
> From: Thomas Gleixner <[email protected]>
> ...
> Subject: [patch 00/53] x86/topology: The final installment
> ...
>
> The current series applies on top of
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v3
>
> and is available from git here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-full-v1
Test Machine
------------
I tested the 'topo-full-v1' branch on a Sapphire Rapids server with 2 sockets,
each containing 48 cores, resulting in a total of 192 threads.
Test Results
------------
The following test results (same result either w/ or w/o [1]) show that
this branch worked well on this server. Based on the test results,
Tested-by: Qiuxu Zhuo <[email protected]>
[1] https://lore.kernel.org/all/[email protected]/
Logs (hyper-thread enabled)
---------------------------
1.1 dmesg | grep "CPU topo" :
CPU topo: Max. logical packages: 2
CPU topo: Max. logical dies: 2
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 2
CPU topo: Num. cores per package: 48
CPU topo: Num. threads per package: 96
CPU topo: Allowing 192 present CPUs plus 0 hotplug CPUs
CPU topo: Thread : 192
CPU topo: Core : 96
CPU topo: Module : 2
CPU topo: Tile : 2
CPU topo: Die : 2
CPU topo: Package : 2
1.2 cat /sys/kernel/debug/x86/topo/domains :
domain: Thread shift: 1 dom_size: 2 max_threads: 2
domain: Core shift: 7 dom_size: 64 max_threads: 128
domain: Module shift: 7 dom_size: 1 max_threads: 128
domain: Tile shift: 7 dom_size: 1 max_threads: 128
domain: Die shift: 7 dom_size: 1 max_threads: 128
domain: Package shift: 7 dom_size: 1 max_threads: 128
1.3 <1st socket, 1st core, primary thread>
cat /sys/kernel/debug/x86/topo/cpus/0
online: 1
initial_apicid: 0
apicid: 0
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 0
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 0
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<1st socket, last core, primary thread>
cat /sys/kernel/debug/x86/topo/cpus/47
online: 1
initial_apicid: 5e
apicid: 5e
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 47
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 94
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<2nd socket, 1st core, primary thread>
cat /sys/kernel/debug/x86/topo/cpus/48
online: 1
initial_apicid: 80
apicid: 80
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 0
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 128
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<2nd socket, last core, primary thread>
cat /sys/kernel/debug/x86/topo/cpus/95
online: 1
initial_apicid: de
apicid: de
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 47
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 222
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<1st socket, 1st core, secondary thread>
cat /sys/kernel/debug/x86/topo/cpus/96
online: 1
initial_apicid: 1
apicid: 1
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 0
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 0
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<1st socket, last core, secondary thread>
cat /sys/kernel/debug/x86/topo/cpus/143
online: 1
initial_apicid: 5f
apicid: 5f
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 47
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 94
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<2nd socket, 1st core, secondary thread>
cat /sys/kernel/debug/x86/topo/cpus/144
online: 1
initial_apicid: 81
apicid: 81
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 0
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 128
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
<2nd socket, last core, secondary thread>
cat /sys/kernel/debug/x86/topo/cpus/191
online: 1
initial_apicid: df
apicid: df
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 47
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 222
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 96
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:2
Logs (hyper-thread disabled)
----------------------------
2.1 dmesg | grep "CPU topo" :
CPU topo: Max. logical packages: 2
CPU topo: Max. logical dies: 2
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 1
CPU topo: Num. cores per package: 48
CPU topo: Num. threads per package: 48
CPU topo: Allowing 96 present CPUs plus 0 hotplug CPUs
CPU topo: Thread : 96
CPU topo: Core : 96
CPU topo: Module : 2
CPU topo: Tile : 2
CPU topo: Die : 2
CPU topo: Package : 2
2.2 cat /sys/kernel/debug/x86/topo/domains :
domain: Thread shift: 1 dom_size: 2 max_threads: 2
domain: Core shift: 7 dom_size: 64 max_threads: 128
domain: Module shift: 7 dom_size: 1 max_threads: 128
domain: Tile shift: 7 dom_size: 1 max_threads: 128
domain: Die shift: 7 dom_size: 1 max_threads: 128
domain: Package shift: 7 dom_size: 1 max_threads: 128
2.3 <1st socket, 1st core>
cat /sys/kernel/debug/x86/topo/cpus/0
online: 1
initial_apicid: 0
apicid: 0
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 0
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 0
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 48
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:1
<1st socket, last core>
cat /sys/kernel/debug/x86/topo/cpus/47
online: 1
initial_apicid: 5e
apicid: 5e
pkg_id: 0
die_id: 0
cu_id: 255
core_id: 47
logical_pkg_id: 0
logical_die_id: 0
llc_id: 0
l2c_id: 94
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 48
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:1
<2nd socket, 1st core>
cat /sys/kernel/debug/x86/topo/cpus/48
online: 1
initial_apicid: 80
apicid: 80
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 0
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 128
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 48
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:1
<2nd socket, last core>
cat /sys/kernel/debug/x86/topo/cpus/95
online: 1
initial_apicid: de
apicid: de
pkg_id: 1
die_id: 1
cu_id: 255
core_id: 47
logical_pkg_id: 1
logical_die_id: 1
llc_id: 128
l2c_id: 222
amd_node_id: 0
amd_nodes_per_pkg: 0
num_threads: 48
num_cores: 48
max_dies_per_pkg: 1
max_threads_per_core:1
-Qiuxu
Hi Sohil,
> From: Sohil Mehta <[email protected]>
> ...
> Subject: Re: [patch 00/53] x86/topology: The final installment
> ...
> Debugfs
> -------
> # cat /sys/kernel/debug/x86/topo/cpus/39
> online: 1
> initial_apicid: 39
> apicid: 39
> ...
Did you convert the output formats of 'initial_apicid' and 'apicid'
from hexadecimal to decimal in your testing? It was really coincidental
that if the output values were '39' in hexadecimal for the CPU '39'
in decimal :-).
I noticed they were represented in hexadecimal format in
arch/x86/kernel/cpu/debugfs.c:
seq_printf(m, "initial_apicid: %x\n", c->topo.initial_apicid);
seq_printf(m, "apicid: %x\n", c->topo.apicid);
Thanks!
-Qiuxu
Hi Qiuxu,
On 8/9/2023 9:50 AM, Qiuxu Zhuo wrote:
>> Debugfs
>> -------
>> # cat /sys/kernel/debug/x86/topo/cpus/39
>> online: 1
>> initial_apicid: 39
>> apicid: 39
>> ...
>
> Did you convert the output formats of 'initial_apicid' and 'apicid'
> from hexadecimal to decimal in your testing? It was really coincidental
> that if the output values were '39' in hexadecimal for the CPU '39'
> in decimal :-).
>
I didn't convert the output formats. That is purely coincidental or some
trick by firmware developers to avoid alphabets in the hexadecimal APIC IDs.
I have now modified the print as shown below which generates the
following output:
> pr_info("Register 0x%03x as CPU %d is_present %d\n", apic_id, cpu, present);
CPU topo: Register 0x000 as CPU 0 is_present 1
CPU topo: Register 0x002 as CPU 1 is_present 1
CPU topo: Register 0x004 as CPU 2 is_present 1
CPU topo: Register 0x006 as CPU 3 is_present 1
CPU topo: Register 0x008 as CPU 4 is_present 1
CPU topo: Register 0x010 as CPU 5 is_present 1
CPU topo: Register 0x012 as CPU 6 is_present 1
CPU topo: Register 0x014 as CPU 7 is_present 1
CPU topo: Register 0x016 as CPU 8 is_present 1
CPU topo: Register 0x018 as CPU 9 is_present 1
CPU topo: Register 0x020 as CPU 10 is_present 1
CPU topo: Register 0x022 as CPU 11 is_present 1
CPU topo: Register 0x024 as CPU 12 is_present 1
CPU topo: Register 0x026 as CPU 13 is_present 1
CPU topo: Register 0x028 as CPU 14 is_present 1
CPU topo: Register 0x030 as CPU 15 is_present 1
CPU topo: Register 0x032 as CPU 16 is_present 1
CPU topo: Register 0x034 as CPU 17 is_present 1
CPU topo: Register 0x036 as CPU 18 is_present 1
CPU topo: Register 0x038 as CPU 19 is_present 1
CPU topo: Register 0x001 as CPU 20 is_present 1
CPU topo: Register 0x003 as CPU 21 is_present 1
CPU topo: Register 0x005 as CPU 22 is_present 1
CPU topo: Register 0x007 as CPU 23 is_present 1
CPU topo: Register 0x009 as CPU 24 is_present 1
CPU topo: Register 0x011 as CPU 25 is_present 1
CPU topo: Register 0x013 as CPU 26 is_present 1
CPU topo: Register 0x015 as CPU 27 is_present 1
CPU topo: Register 0x017 as CPU 28 is_present 1
CPU topo: Register 0x019 as CPU 29 is_present 1
CPU topo: Register 0x021 as CPU 30 is_present 1
CPU topo: Register 0x023 as CPU 31 is_present 1
CPU topo: Register 0x025 as CPU 32 is_present 1
CPU topo: Register 0x027 as CPU 33 is_present 1
CPU topo: Register 0x029 as CPU 34 is_present 1
CPU topo: Register 0x031 as CPU 35 is_present 1
CPU topo: Register 0x033 as CPU 36 is_present 1
CPU topo: Register 0x035 as CPU 37 is_present 1
CPU topo: Register 0x037 as CPU 38 is_present 1
CPU topo: Register 0x039 as CPU 39 is_present 1
Sohil
On 8/8/2023 4:20 PM, Thomas Gleixner wrote:
> That makes much more sense now.
>
> Zhang, can you please follow up on:
>
> https://lore.kernel.org/all/[email protected]/
>
> or otherwise I just polish up PeterZ's variant of it tomorrow.
>
> Thanks,
>
> tglx
For the full series (with PeterZ's patch applied),
Tested-by: Sohil Mehta <[email protected]>
Zhang, Thomas, please let me know if there is a different variant of the
patch that needs to be tested.
Sohil
> From: Mehta, Sohil <[email protected]>
>...
>I didn't convert the output formats. That is purely coincidental or some
>trick by firmware developers to avoid alphabets in the hexadecimal APIC IDs.
From your new log below, looks like this is indeed the case.
>I have now modified the print as shown below which generates the
>following output:
>
>> pr_info("Register 0x%03x as CPU %d is_present %d\n", apic_id, cpu, present);
>
>
>CPU topo: Register 0x000 as CPU 0 is_present 1
>CPU topo: Register 0x002 as CPU 1 is_present 1
>CPU topo: Register 0x004 as CPU 2 is_present 1
>CPU topo: Register 0x006 as CPU 3 is_present 1
>CPU topo: Register 0x008 as CPU 4 is_present 1
>CPU topo: Register 0x010 as CPU 5 is_present 1
>CPU topo: Register 0x012 as CPU 6 is_present 1
>CPU topo: Register 0x014 as CPU 7 is_present 1
>CPU topo: Register 0x016 as CPU 8 is_present 1
>CPU topo: Register 0x018 as CPU 9 is_present 1
>CPU topo: Register 0x020 as CPU 10 is_present 1
>CPU topo: Register 0x022 as CPU 11 is_present 1
>CPU topo: Register 0x024 as CPU 12 is_present 1
>CPU topo: Register 0x026 as CPU 13 is_present 1
>CPU topo: Register 0x028 as CPU 14 is_present 1
>CPU topo: Register 0x030 as CPU 15 is_present 1
>CPU topo: Register 0x032 as CPU 16 is_present 1
>CPU topo: Register 0x034 as CPU 17 is_present 1
>CPU topo: Register 0x036 as CPU 18 is_present 1
>CPU topo: Register 0x038 as CPU 19 is_present 1
>CPU topo: Register 0x001 as CPU 20 is_present 1
>CPU topo: Register 0x003 as CPU 21 is_present 1
>CPU topo: Register 0x005 as CPU 22 is_present 1
>CPU topo: Register 0x007 as CPU 23 is_present 1
>CPU topo: Register 0x009 as CPU 24 is_present 1
>CPU topo: Register 0x011 as CPU 25 is_present 1
>CPU topo: Register 0x013 as CPU 26 is_present 1
>CPU topo: Register 0x015 as CPU 27 is_present 1
>CPU topo: Register 0x017 as CPU 28 is_present 1
>CPU topo: Register 0x019 as CPU 29 is_present 1
>CPU topo: Register 0x021 as CPU 30 is_present 1
>CPU topo: Register 0x023 as CPU 31 is_present 1
>CPU topo: Register 0x025 as CPU 32 is_present 1
>CPU topo: Register 0x027 as CPU 33 is_present 1
>CPU topo: Register 0x029 as CPU 34 is_present 1
>CPU topo: Register 0x031 as CPU 35 is_present 1
>CPU topo: Register 0x033 as CPU 36 is_present 1
>CPU topo: Register 0x035 as CPU 37 is_present 1
>CPU topo: Register 0x037 as CPU 38 is_present 1
>CPU topo: Register 0x039 as CPU 39 is_present 1
Thank you for sharing the output from your machine.
Interesting :-).
-Qiuxu
Hi, Thomas,
> > CPU topo: Max. logical packages: 2
> > CPU topo: Max. logical dies: 2
> > CPU topo: Max. dies per package: 1
> > CPU topo: Max. threads per core: 2
> > CPU topo: Num. cores per package: 10
> > CPU topo: Num. threads per package: 20
> > CPU topo: Allowing 40 present CPUs plus 0 hotplug CPUs
> > CPU topo: Thread : 40
> > CPU topo: Core : 20
> > CPU topo: Module : 2
> > CPU topo: Tile : 2
> > CPU topo: Die : 2
> > CPU topo: Package : 2
> >
> > domain: Thread shift: 1 dom_size: 2 max_threads: 2
> > domain: Core shift: 5 dom_size: 16 max_threads: 32
> > domain: Module shift: 5 dom_size: 1 max_threads: 32
> > domain: Tile shift: 5 dom_size: 1 max_threads: 32
> > domain: Die shift: 5 dom_size: 1 max_threads: 32
> > domain: Package shift: 5 dom_size: 1 max_threads: 32
> >
> > /sys/kernel/debug/x86/topo/cpus/39
> > online: 1
> > initial_apicid: 39
> > apicid: 39
> > pkg_id: 1
> > die_id: 1
> > cu_id: 255
> > core_id: 12
> > logical_pkg_id: 1
> > logical_die_id: 1
> > llc_id: 32
> > l2c_id: 56
> > amd_node_id: 0
> > amd_nodes_per_pkg: 0
> > num_threads: 20
> > num_cores: 10
> > max_dies_per_pkg: 1
> > max_threads_per_core:2
>
> That makes much more sense now.
>
> Zhang, can you please follow up on:
>
>
> https://lore.kernel.org/all/[email protected]/
>
> or otherwise I just polish up PeterZ's variant of it tomorrow.
>
Sorry for the late response. I was in the reviewing of the previous
patch series and missed this one.
IMO, PeterZ' patch already follows the suggestion at
https://lore.kernel.org/all/87pm4bp54z.ffs@tglx/
And https://lore.kernel.org/all/87jzukqjvf.ffs@tglx/ is not needed
because there are no duplicate APIC IDs from LAPIC and x2APIC anymore
with the above patch.
So I think we can go with PeterZ' patch.
thanks,
rui
From: Thomas Gleixner <[email protected]> Sent: Monday, August 7, 2023 6:53 AM
>
> Hi!
>
> This is the (for now) last part of reworking topology enumeration and
> management. It's based on the APIC and CPUID rework series which can be
> found here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> With these preparatory changes in place, it's now possible to address the
> real issues of the current topology code:
>
> - Wrong core count on hybrid systems
>
> - Heuristics based size information for packages and dies which
> are failing to work correctly with certain command line parameters.
>
> - Full evaluation fail for a theoretical hybrid system which boots
> from an E-core
>
> - The complete insanity of manipulating global data from firmware parsers
> or the XEN/PV fake SMP enumeration. The latter is really a piece of art.
>
> This series addresses this by
>
> - Mopping up some more historical technical debt
>
> - Consolidating all topology relevant functionality into one place
>
> - Providing separate interfaces for boot time and ACPI hotplug operations
>
> - A sane ordering of command line options and restrictions
>
> - A sensible way to handle the BSP problem in kdump kernels instead of
> the unreliable command line option.
>
> - Confinement of topology relevant variables by replacing the XEN/PV SMP
> enumeration fake with something halfways sensible.
>
> - Evaluation of sizes by analysing the topology via the CPUID provided
> APIC ID segmentation and the actual APIC IDs which are registered at
> boot time.
>
> - Removal of heuristics and broken size calculations
>
> The idea behind this is the following:
>
> The APIC IDs describe the system topology in multiple domain levels. The
> CPUID topology parser provides the information which part of the APIC ID is
> associated to the individual levels (Intel terminology):
>
> [ROOT][PACKAGE][DIE][TILE][MODULE][CORE][THREAD]
>
> The root space contains the package (socket) IDs. Not enumerated levels
> consume 0 bits space, but conceptually they are always represented. If
> e.g. only CORE and THREAD levels are enumerated then the DIE, MODULE and
> TILE have the same physical ID as the PACKAGE.
>
> If SMT is not supported, then the THREAD domain is still used. It then
> has the same physical ID as the CORE domain and is the only child of
> the core domain.
>
> This allows an unified view on the system independent of the enumerated
> domain levels without requiring any conditionals in the code.
>
> AMD does only expose 4 domain levels with obviously different terminology,
> but that can be easily mapped into the Intel variant with a trivial lookup
> table added to the CPUID parser.
>
> The resulting topology information of an ADL hybrid system with 8 P-Cores
> and 8 E-Cores looks like this:
>
> CPU topo: Max. logical packages: 1
> CPU topo: Max. logical dies: 1
> CPU topo: Max. dies per package: 1
> CPU topo: Max. threads per core: 2
> CPU topo: Num. cores per package: 16
> CPU topo: Num. threads per package: 24
> CPU topo: Allowing 24 present CPUs plus 0 hotplug CPUs
> CPU topo: Thread : 24
> CPU topo: Core : 16
> CPU topo: Module : 1
> CPU topo: Tile : 1
> CPU topo: Die : 1
> CPU topo: Package : 1
>
> This is happening on the boot CPU before any of the APs is started and
> provides correct size information right from the start.
>
> Even the XEN/PV trainwreck makes use of this now. On Dom0 it utilizes the
> MADT and on DomU it provides fake APIC IDs, which combined with the
> provided CPUID information make it at least look halfways realistic instead
> of claiming to have one CPU per package as the current upstream code does.
>
> This is solely addressing the core topology issues, but there is a plan for
> further consolidation of other topology related information into one single
> source of information instead of having a gazillion of localized special
> parsers and representations all over the place. There are quite some other
> things which can be simplified on top of this, like updating the various
> cpumasks during CPU bringup, but that's all left for later.
>
> So another 53 patches later, the resulting diffstat is:
>
> 64 files changed, 830 insertions(+), 955 deletions(-)
>
> and the combo diffstat of all three series combined:
>
> 115 files changed, 2414 insertions(+), 3035 deletions(-)
>
> The current series applies on top of
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v3
>
> and is available from git here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-full-v1
>
> Thanks,
>
> tglx
Tested the full series on Hyper-V VMs on Intel and AMD Zen processors.
Tested with hyper-threading enabled and disabled, and with a variety of
NUMA and L3 cache configurations. All looks good, modulo the known
issue with Hyper-V providing incorrect APIC IDs in some NUMA configs,
but this patch series did not make that problem any worse.
Tested-by: Michael Kelley <[email protected]>