This patch series changes how kernel derives cpu "package" from
package-as-socket to package-as-die in order to fix following issues
on AMD family17h multi-die processor platforms:
* irqbalance fails to allocating IRQs to individual CPU within the die.
* The scheduler fails to load-balance across 8 threads within a die
(e.g. running 8-thread application w/ taskset -c 0-7 ) with
the DIE schedule domain omitted due to x86_has_numa_in_package.
These issues are fixed when properly intepretes package as DIE.
This series has also been tested on existing AMD systems w/ family15h
and family10h multi-die processors.
Suravee Suthikulpanit (2):
x86/CPU/AMD: Present package as die instead of socket
x86/CPU/AMD: Use L3 Cache info from CPUID to determine LLC ID
arch/x86/kernel/cpu/amd.c | 205 ++++++++++++++++++++++++++++------------------
1 file changed, 124 insertions(+), 81 deletions(-)
--
2.7.4
According to the Documentation/x86/topology.txt, AMD nomenclature for
package is NUMA node (or die). However, this is not the case on AMD
family17h multi-die processor platforms, which can have up to 4 dies
per socket as shown in the following system topology.
Die (Dx) View :
----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------
System View (with 2 socket) :
--------------------
| -------------|------
| | | |
------------ ------------
| D1 -- D0 | | D7 -- D6 |
| | \/ | | | | \/ | |
SOCKET0 | | /\ | | | | /\ | | SOCKET1
| D2 -- D3 | | D4 -- D5 |
------------ ------------
| | | |
------|------------| |
--------------------
Current logic interpretes package as socket (i.e. phys_proc_id is
socket id), which results in setting x86_has_numa_in_package, and omits
the DIE schedule domain. However, NUMA schedule domains are derived from
SRAT/SLIT, which assumes NUMA node is a die, and build NUMA schedule
domains on top of NUMA nodes. This results in incomplete schedule domains
as following:
domain 0: SMT
domain 1: MC /* core complex w/ shared L3*/
---- Missing DIE level domain ----
domain 2: NUMA /* socket */
domain 3: NUMA /* platform */
Presenting package-as-die does not set x86_has_numa_in_package.
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Leo Duran <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
Cc: <[email protected]> # v4.10+
---
arch/x86/kernel/cpu/amd.c | 189 +++++++++++++++++++++++++++-------------------
1 file changed, 112 insertions(+), 77 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bb5abe8..2f5869c 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1,3 +1,5 @@
+#define pr_fmt(fmt) "x86/AMD: " fmt
+
#include <linux/export.h>
#include <linux/bitops.h>
#include <linux/elf.h>
@@ -32,6 +34,12 @@ static bool cpu_has_amd_erratum(struct cpuinfo_x86 *cpu, const int *erratum);
*/
static u32 nodes_per_socket = 1;
+/*
+ * l3_num_threads_sharing: Stores the number of threads sharing L3 cache.
+ * Refer to CPUID_Fn8000001D_EAX_x03 [Cache Properties (L3)] NumSharingCache.
+ */
+static u32 l3_num_threads_sharing;
+
static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
{
u32 gprs[8] = { 0 };
@@ -296,96 +304,122 @@ static int nearby_node(int apicid)
}
#endif
+#ifdef CONFIG_SMP
+
/*
- * Fixup core topology information for
- * (1) AMD multi-node processors
+ * Per Documentation/x86/topology.c, the kernel works with
+ * {packages, cores, threads}, and we will map:
+ *
+ * thread = core in compute-unit (CMT), or thread in core (SMT)
+ * core = compute-unit (CMT), or core (SMT)
+ * package = node (die)
+ *
+ * Discover topology based on available information from CPUID first,
+ * and only derive them as needed.
+ *
+ * (1) phys_proc_id is die ID in AMD multi-die processors.
* Assumption: Number of cores in each internal node is the same.
- * (2) AMD processors supporting compute units
+ * (2) cpu_core_id is derived from either CPUID topology extension
+ * or initial APIC_ID.
+ * (3) cpu_llc_id is either L3 or per-node
*/
-#ifdef CONFIG_SMP
static void amd_get_topology(struct cpuinfo_x86 *c)
{
- u8 node_id;
int cpu = smp_processor_id();
- /* get information required for multi-node processors */
if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
u32 eax, ebx, ecx, edx;
cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
- node_id = ecx & 0xff;
+ c->phys_proc_id = ecx & 0xff;
smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
- if (c->x86 == 0x15)
- c->cu_id = ebx & 0xff;
-
- if (c->x86 >= 0x17) {
- c->cpu_core_id = ebx & 0xff;
-
- if (smp_num_siblings > 1)
- c->x86_max_cores /= smp_num_siblings;
- }
+ /* Adjustment to get core per die */
+ c->x86_max_cores /= smp_num_siblings;
/*
- * We may have multiple LLCs if L3 caches exist, so check if we
- * have an L3 cache by looking at the L3 cache CPUID leaf.
+ * For family15h/16h, this is ComputeUnitId per socket
+ * For family17h, this is CoreId per socket
*/
+ c->cpu_core_id = (ebx & 0xff);
+
if (cpuid_edx(0x80000006)) {
- if (c->x86 == 0x17) {
+ cpuid_count(0x8000001d, 3, &eax, &ebx, &ecx, &edx);
+ l3_num_threads_sharing = ((eax >> 14) & 0xfff) + 1;
+ }
+
+ if (c->x86 == 0x17) {
+ /*
+ * In family 17h, the CPUID_Fn8000001E_EBX[7:0] (CoreId)
+ * is non-contiguous in down-coring and non-SMT cases.
+ * This logic fixes up the cpu_core_id to be contiguous
+ * for cores within the die.
+ */
+ u32 tmp = c->cpu_core_id;
+ u32 die_offset, ccx_offset, cpu_offset;
+
+ if (smp_num_siblings == 1) {
/*
- * LLC is at the core complex level.
- * Core complex id is ApicId[3].
+ * For SMT-disabled case, the CoreId bit-encoding is
+ * [7:4] : die
+ * [3] : ccx
+ * [2:0] : core
*/
- per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
+ die_offset = ((tmp >> 4) & 0xf) * c->x86_max_cores;
+ ccx_offset = ((tmp >> 3) & 1) * l3_num_threads_sharing;
+ cpu_offset = tmp & 7;
} else {
- /* LLC is at the node level. */
- per_cpu(cpu_llc_id, cpu) = node_id;
+ /*
+ * For SMT-enabled case, the CoreId bit-encoding is
+ * [7:3] : die
+ * [2] : ccx
+ * [1:0] : core
+ */
+ die_offset = ((tmp >> 3) & 0x1f) * c->x86_max_cores;
+ ccx_offset = ((tmp >> 2) & 1) * l3_num_threads_sharing / smp_num_siblings;
+ cpu_offset = tmp & 3;
}
+ c->cpu_core_id = die_offset + ccx_offset + cpu_offset;
+ pr_debug("Fixup CoreId:%#x to cpu_core_id:%#x\n", tmp, c->cpu_core_id);
}
- } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
- u64 value;
+ } else {
+ if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
+ u64 value;
- rdmsrl(MSR_FAM10H_NODE_ID, value);
- node_id = value & 7;
-
- per_cpu(cpu_llc_id, cpu) = node_id;
- } else
- return;
-
- /* fixup multi-node processor information */
- if (nodes_per_socket > 1) {
- u32 cus_per_node;
-
- set_cpu_cap(c, X86_FEATURE_AMD_DCM);
- cus_per_node = c->x86_max_cores / nodes_per_socket;
+ /* Use MSR provided node ID */
+ rdmsrl(MSR_FAM10H_NODE_ID, value);
+ c->phys_proc_id = value & 7;
+ } else {
+ /*
+ * On older AMD dual core setup the lower
+ * bits of the APIC id distinguish the cores.
+ * Assumes number of cores is a power of two.
+ */
+ c->phys_proc_id = c->initial_apicid >> c->x86_coreid_bits;
+ }
- /* core id has to be in the [0 .. cores_per_node - 1] range */
- c->cpu_core_id %= cus_per_node;
+ /* Get core id from APIC */
+ c->cpu_core_id = c->initial_apicid & ((1 << c->x86_coreid_bits) - 1);
}
-}
-#endif
-/*
- * On a AMD dual core setup the lower bits of the APIC id distinguish the cores.
- * Assumes number of cores is a power of two.
- */
-static void amd_detect_cmp(struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_SMP
- unsigned bits;
- int cpu = smp_processor_id();
+ /* core id has to be in the [0 .. cores_per_die - 1] range */
+ c->cpu_core_id %= c->x86_max_cores;
- bits = c->x86_coreid_bits;
- /* Low order bits define the core id (index of core in socket) */
- c->cpu_core_id = c->initial_apicid & ((1 << bits)-1);
- /* Convert the initial APIC ID into the socket ID */
- c->phys_proc_id = c->initial_apicid >> bits;
- /* use socket ID also for last level cache */
+ /* Default LLC is at the die level. */
per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
- amd_get_topology(c);
-#endif
+
+ /*
+ * We may have multiple LLCs if L3 caches exist, so check if we
+ * have an L3 cache by looking at the L3 cache CPUID leaf.
+ * For family17h, LLC is at the core complex level.
+ * Core complex id is ApicId[3].
+ */
+ if (cpuid_edx(0x80000006) && c->x86 == 0x17)
+ per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
+
}
+#endif
u16 amd_get_nb_id(int cpu)
{
@@ -412,7 +446,7 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
node = numa_cpu_node(cpu);
if (node == NUMA_NO_NODE)
- node = per_cpu(cpu_llc_id, cpu);
+ node = c->phys_proc_id;
/*
* On multi-fabric platform (e.g. Numascale NumaChip) a
@@ -457,26 +491,23 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
static void early_init_amd_mc(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_SMP
- unsigned bits, ecx;
+ u32 threads_per_socket;
/* Multi core CPU? */
if (c->extended_cpuid_level < 0x80000008)
return;
- ecx = cpuid_ecx(0x80000008);
-
- c->x86_max_cores = (ecx & 0xff) + 1;
-
- /* CPU telling us the core id bits shift? */
- bits = (ecx >> 12) & 0xF;
-
- /* Otherwise recompute */
- if (bits == 0) {
- while ((1 << bits) < c->x86_max_cores)
- bits++;
- }
+ /* Threads per socket */
+ threads_per_socket = (cpuid_ecx(0x80000008) & 0xff) + 1;
+ /* Thread per die */
+ c->x86_max_cores = threads_per_socket / nodes_per_socket;
- c->x86_coreid_bits = bits;
+ /*
+ * This is per socket, and should only be used to decode APIC ID,
+ * which is needed on older systems where X86_FEATURE_TOPOEXT
+ * is not supported.
+ */
+ c->x86_coreid_bits = get_count_order(threads_per_socket);
#endif
}
@@ -765,11 +796,15 @@ static void init_amd(struct cpuinfo_x86 *c)
cpu_detect_cache_sizes(c);
- /* Multi core CPU? */
+#ifdef CONFIG_SMP
if (c->extended_cpuid_level >= 0x80000008) {
- amd_detect_cmp(c);
+ amd_get_topology(c);
srat_detect_node(c);
}
+#endif
+ /* Multi-die? */
+ if (nodes_per_socket > 1)
+ set_cpu_cap(c, X86_FEATURE_AMD_DCM);
#ifdef CONFIG_X86_32
detect_ht(c);
--
2.7.4
CPUID_Fn8000001D_EAX_x03: Cache Properties (L3) [NumSharingCache]
should be used to determine if the last-level cache (LLC) ID is
the same as die ID or the core-complex (CCX) ID. In the former case,
the number of cores within a die would be the same as the number of
sharing threads. This is available if CPU topology extension is
supported (e.g. since family15h).
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Leo Duran <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
Cc: <[email protected]> # v4.10+
---
arch/x86/kernel/cpu/amd.c | 40 ++++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 2f5869c..faa4ec3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -306,6 +306,29 @@ static int nearby_node(int apicid)
#ifdef CONFIG_SMP
+static void amd_get_llc_id(struct cpuinfo_x86 *c)
+{
+ int cpu = smp_processor_id();
+
+ /* Default LLC is at the node level. */
+ per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+
+ /*
+ * We may have multiple LLCs per die if L3 caches exist.
+ * Currently, the only case where LLC (L3) is not
+ * at the die level is when LLC is at the core complex (CCX) level.
+ * So, enumerate cpu_llc_id using ccx_id.
+ */
+ if (l3_num_threads_sharing &&
+ l3_num_threads_sharing < (c->x86_max_cores * smp_num_siblings)) {
+ u32 cpu_id = (c->phys_proc_id * c->x86_max_cores) + c->cpu_core_id;
+ u32 ccx_id = cpu_id * smp_num_siblings / l3_num_threads_sharing;
+
+ per_cpu(cpu_llc_id, cpu) = ccx_id;
+ pr_debug("Use ccx ID as llc ID: %#x\n", ccx_id);
+ }
+}
+
/*
* Per Documentation/x86/topology.c, the kernel works with
* {packages, cores, threads}, and we will map:
@@ -321,12 +344,9 @@ static int nearby_node(int apicid)
* Assumption: Number of cores in each internal node is the same.
* (2) cpu_core_id is derived from either CPUID topology extension
* or initial APIC_ID.
- * (3) cpu_llc_id is either L3 or per-node
*/
static void amd_get_topology(struct cpuinfo_x86 *c)
{
- int cpu = smp_processor_id();
-
if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
u32 eax, ebx, ecx, edx;
@@ -405,19 +425,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
/* core id has to be in the [0 .. cores_per_die - 1] range */
c->cpu_core_id %= c->x86_max_cores;
-
- /* Default LLC is at the die level. */
- per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
-
- /*
- * We may have multiple LLCs if L3 caches exist, so check if we
- * have an L3 cache by looking at the L3 cache CPUID leaf.
- * For family17h, LLC is at the core complex level.
- * Core complex id is ApicId[3].
- */
- if (cpuid_edx(0x80000006) && c->x86 == 0x17)
- per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
-
}
#endif
@@ -799,6 +806,7 @@ static void init_amd(struct cpuinfo_x86 *c)
#ifdef CONFIG_SMP
if (c->extended_cpuid_level >= 0x80000008) {
amd_get_topology(c);
+ amd_get_llc_id(c);
srat_detect_node(c);
}
#endif
--
2.7.4
Remove stable from CC - this is not how you do a stable submission.
See Documentation/process/stable-kernel-rules.rst
On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
> According to the Documentation/x86/topology.txt, AMD nomenclature for
> package is NUMA node (or die).
You guys keep talking about die so you should add the definition of
"die" to that document so that we're all on the same page. In a separate
patch.
Which reminds me, this patch does a gazillion things. It needs to be
split.
> However, this is not the case on AMD family17h multi-die processor
> platforms, which can have up to 4 dies per socket as shown in the
> following system topology.
So what exactly does that mean?
A die is a package on ZN and you can have up to 4 packages on a physical
socket?
Please be specific and verbose.
> Die (Dx) View :
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> System View (with 2 socket) :
> --------------------
> | -------------|------
> | | | |
> ------------ ------------
> | D1 -- D0 | | D7 -- D6 |
> | | \/ | | | | \/ | |
> SOCKET0 | | /\ | | | | /\ | | SOCKET1
> | D2 -- D3 | | D4 -- D5 |
> ------------ ------------
> | | | |
> ------|------------| |
> --------------------
>
> Current logic interpretes package as socket (i.e. phys_proc_id is
> socket id), which results in setting x86_has_numa_in_package, and omits
> the DIE schedule domain.
Well, this is what we do on MCM Magny-Cours. Btw, here's where you
explain *why* you need the DIE domain.
> However, NUMA schedule domains are derived from
> SRAT/SLIT, which assumes NUMA node is a die,
Ok, so I don't understand this: SRAT/SLIT should basically say that you
have 4 *logical* NUMA nodes on that *physical* socket. Is that what you
want?
Or is not what you want?
Because it makes sense to me that BIOS should describe how the logical
grouping of the cores in a node is. And x86_numa_in_package_topology
roughly means that we rely on SRAT/SLIT to tell us the topology.
So why can't SRAT/SLIT do that just like they did on Magy-Cours MCM
packages?
> and build NUMA schedule
> domains on top of NUMA nodes. This results in incomplete schedule domains
> as following:
> domain 0: SMT
> domain 1: MC /* core complex w/ shared L3*/
> ---- Missing DIE level domain ----
> domain 2: NUMA /* socket */
> domain 3: NUMA /* platform */
>
> Presenting package-as-die does not set x86_has_numa_in_package.
>
> Signed-off-by: Suravee Suthikulpanit <[email protected]>
> Signed-off-by: Leo Duran <[email protected]>
> Signed-off-by: Yazen Ghannam <[email protected]>
This SOB chain is wrong, of course. Lemme guess, you want to say that
you all three worked on this patch, right?
Well, this is not how we express that. Please read this:
Documentation/process/submitting-patches.rst, section 11
And while you're at it, read section 3 too pls.
> Cc: <[email protected]> # v4.10+
> ---
> arch/x86/kernel/cpu/amd.c | 189 +++++++++++++++++++++++++++-------------------
> 1 file changed, 112 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index bb5abe8..2f5869c 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1,3 +1,5 @@
> +#define pr_fmt(fmt) "x86/AMD: " fmt
> +
This needs to go in a separate patch along with auditing all pr_* calls
in that file and removing any other prefixes they have.
> #include <linux/export.h>
> #include <linux/bitops.h>
> #include <linux/elf.h>
> @@ -32,6 +34,12 @@ static bool cpu_has_amd_erratum(struct cpuinfo_x86 *cpu, const int *erratum);
> */
> static u32 nodes_per_socket = 1;
>
> +/*
> + * l3_num_threads_sharing: Stores the number of threads sharing L3 cache.
> + * Refer to CPUID_Fn8000001D_EAX_x03 [Cache Properties (L3)] NumSharingCache.
> + */
> +static u32 l3_num_threads_sharing;
u32 is clearly too much. I know, nodes_per_socket is u32 too but that
should be turned into an u8 too. In a separate patch.
> +
> static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
> {
> u32 gprs[8] = { 0 };
> @@ -296,96 +304,122 @@ static int nearby_node(int apicid)
> }
> #endif
>
> +#ifdef CONFIG_SMP
> +
This is an unnecessary move.
> /*
> - * Fixup core topology information for
> - * (1) AMD multi-node processors
> + * Per Documentation/x86/topology.c,
$ ls Documentation/x86/topology.c
ls: cannot access 'Documentation/x86/topology.c': No such file or directory
> + * the kernel works with
> + * {packages, cores, threads}, and we will map:
> + *
> + * thread = core in compute-unit (CMT), or thread in core (SMT)
> + * core = compute-unit (CMT), or core (SMT)
> + * package = node (die)
Put all those definitions into Documentation/x86/topology.txt and point
only to that file in the comment here.
> + *
> + * Discover topology based on available information from CPUID first,
> + * and only derive them as needed.
> + *
> + * (1) phys_proc_id is die ID in AMD multi-die processors.
> * Assumption: Number of cores in each internal node is the same.
So that's wrong:
- cpuinfo_x86.phys_proc_id:
The physical ID of the package. This information is retrieved via CPUID
and deduced from the APIC IDs of the cores in the package.
> - * (2) AMD processors supporting compute units
> + * (2) cpu_core_id is derived from either CPUID topology extension
> + * or initial APIC_ID.
> + * (3) cpu_llc_id is either L3 or per-node
Move to Documentation/x86/topology.txt
> */
> -#ifdef CONFIG_SMP
> static void amd_get_topology(struct cpuinfo_x86 *c)
> {
> - u8 node_id;
> int cpu = smp_processor_id();
>
> - /* get information required for multi-node processors */
> if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
> u32 eax, ebx, ecx, edx;
>
> cpuid(0x8000001e, &eax, &ebx, &ecx, &edx);
>
> - node_id = ecx & 0xff;
> + c->phys_proc_id = ecx & 0xff;
> smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
>
> - if (c->x86 == 0x15)
> - c->cu_id = ebx & 0xff;
You can't remove this, of course, see
79a8b9aa388b ("x86/CPU/AMD: Bring back Compute Unit ID")
And so on and so on.
So I'm going to stop reviewing here because:
* this patch needs to be split
* I still don't understand why we need all that dance and not describe the
topology properly with SRAT/SLIT
Thanks.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Hi Boris,
On 6/27/17 17:48, Borislav Petkov wrote:
> On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
>> According to the Documentation/x86/topology.txt, AMD nomenclature for
>> package is NUMA node (or die).
>
> You guys keep talking about die so you should add the definition of
> "die" to that document so that we're all on the same page. In a separate
> patch.
What we are trying point out here is that (NUMA) "node" and "die" are the same
thing in most of AMD processors, not necessary trying to introduce another term
here.
>> However, this is not the case on AMD family17h multi-die processor
>> platforms, which can have up to 4 dies per socket as shown in the
>> following system topology.
>
> So what exactly does that mean? A die is a package on ZN and you can have up to 4 packages on a physical socket?
Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
>> Current logic interpretes package as socket (i.e. phys_proc_id is
>> socket id), which results in setting x86_has_numa_in_package, and omits
>> the DIE schedule domain.
>
> Well, this is what we do on MCM Magny-Cours. Btw, here's where you
> explain *why* you need the DIE domain.
As I have described in the cover letter, this patch series changes how kernel
derives cpu "package" from package-as-socket to package-as-die in order to fix
following issues on AMD family17h multi-die processor platforms:
* irqbalance fails to allocating IRQs to individual CPU within the die.
* The scheduler fails to load-balance across 8 threads within a die`
(e.g. running 8-thread application w/ taskset -c 0-7 ) with
the DIE schedule domain omitted due to x86_has_numa_in_package.
These issues are fixed when properly intepretes package as DIE.
For MCM Magny-Cours (family15h), we do not have this issue because the MC
sched-domain has the same cpumask as the DIE sched-domain. This is not the same
as for Zen cores (family17h) where the MC sched-domain is the CCX (2 CCX per die)
>> However, NUMA schedule domains are derived from
>> SRAT/SLIT, which assumes NUMA node is a die,
>
> Ok, so I don't understand this: SRAT/SLIT should basically say that you
> have 4 *logical* NUMA nodes on that *physical* socket. Is that what you
> want? Or is not what you want? Because it makes sense to me that BIOS
> should describe how the logical grouping of the cores in a node is.
> And x86_numa_in_package_topology roughly means that we rely on SRAT/SLIT
> to tell us the topology.
SRAT/SLIT tables are doing the right thing, where it describes topology among
*logical* NUMA nodes (dies). So, for the system view below:
>> System View (with 2 socket) :
>> --------------------
>> | -------------|------
>> | | | |
>> ------------ ------------
>> | D1 -- D0 | | D7 -- D6 |
>> | | \/ | | | | \/ | |
>> SOCKET0 | | /\ | | | | /\ | | SOCKET1
>> | D2 -- D3 | | D4 -- D5 |
>> ------------ ------------
>> | | | |
>> ------|------------| |
>> --------------------
SLIT table is showing
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
However, SRAT/SLIT does not describe the DIE. So, using
x86_numa_in_packge_topology on multi-die Zen processor will result in missing
the DIE sched-domain for cpus within a die.
> So why can't SRAT/SLIT do that just like they did on Magny-Cours MCM
> packages?
Again, Magny-Cours MCM DIE sched-domain is the same as MC sched-domain. So, we
can omit the DIE sched-domain. Here is an example of /proc/schedstat
Magney-Cours cpu0
domain0 00000000,00000003 (SMT)
domain1 00000000,000000ff (MC which is the same as DIE)
domain2 00ff00ff,00ffffff (NUMA 1 hop)
domain3 ffffffff,ffffffff (NUMA platform)
Zen cpu0 (package-as-socket)
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC ccx)
domain2 00000000,ffffffff,00000000,ffffffff (NUMA socket)
domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA platform)
Zen cpu0 (package-as-die)
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC ccx)
domain2 00000000,000000ff,00000000,000000ff (DIE)
domain3 00000000,ffffffff,00000000,ffffffff (NUMA socket)
domain4 ffffffff,ffffffff,ffffffff,ffffffff (NUMA platform)
Hope this helps clarify the purpose of this patch series. If no other concerns
going into this direction, I'll split and clean up the patch series per your
initial review comments.
Thanks,
Suravee
On Tue, Jun 27, 2017 at 08:07:10PM +0700, Suravee Suthikulpanit wrote:
> What we are trying point out here is that (NUMA) "node" and "die" are the
> same thing in most of AMD processors, not necessary trying to introduce
> another term here.
So don't use it then. The whole topology topic is confusing as it is to
people so that every time I, for example, have to explain myself with an
example when talking about it. Adding a "die" into the mix makes it more
confusing, not less.
So pick the terms, please, and document them properly so that we all are
on the same page when talking about topology.
> Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
See above.
I'd like to have the topology terminology all explained and written
down, pls.
> As I have described in the cover letter, this patch series changes how
I know what you've described in the cover letter - I've read it. I
meant, put that same explanation in the commit message to state *why*
this patch is needed.
> SLIT table is showing
>
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> However, SRAT/SLIT does not describe the DIE. So, using
> x86_numa_in_packge_topology on multi-die Zen processor will result in
> missing the DIE sched-domain for cpus within a die.
What does "does not describe the DIE" mean exactly? How exactly you need
to describe a die. And forget the die sched domain - first answer the
question: how is the NUMA info in SRAT/SLIT insufficient for scheduling?
Are you saying, you want to have all threads on a die belong to a
separate scheduling entity?
> Again, Magny-Cours MCM DIE sched-domain is the same as MC sched-domain.
> So, we can omit the DIE sched-domain. Here is an example of /proc/schedstat
Ok, we can forget the DIE thing.
> Magney-Cours cpu0
> domain0 00000000,00000003 (SMT)
> domain1 00000000,000000ff (MC which is the same as DIE)
> domain2 00ff00ff,00ffffff (NUMA 1 hop)
> domain3 ffffffff,ffffffff (NUMA platform)
>
> Zen cpu0 (package-as-socket)
> domain0 00000000,00000001,00000000,00000001 (SMT)
> domain1 00000000,0000000f,00000000,0000000f (MC ccx)
> domain2 00000000,ffffffff,00000000,ffffffff (NUMA socket)
> domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA platform)
>
> Zen cpu0 (package-as-die)
> domain0 00000000,00000001,00000000,00000001 (SMT)
> domain1 00000000,0000000f,00000000,0000000f (MC ccx)
> domain2 00000000,000000ff,00000000,000000ff (DIE)
So this is 8 threads IINM.
You want to have those 8 threads as a separate scheduling entity?
But looking at this picture:
Die (Dx) View :
----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------
That's 16 threads on a die.
So are you trying to tell me that you want to have all threads sharing
an L3 into a single scheduling domain? Is that it?
Or do you want to have all threads on a die in a single scheduling
domain?
Also, bear in mind that if we do this picking of
threads/cores/nodes/dies/... apart like this now, it will definitely
need touching in the future when the hw guys change topology again.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, 27 Jun 2017, Suravee Suthikulpanit wrote:
> On 6/27/17 17:48, Borislav Petkov wrote:
> > On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
> > > However, this is not the case on AMD family17h multi-die processor
> > > platforms, which can have up to 4 dies per socket as shown in the
> > > following system topology.
> >
> > So what exactly does that mean? A die is a package on ZN and you can have up
> > to 4 packages on a physical socket?
>
> Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
And why is this relevant at all?
The kernel does not care about sockets. Sockets are electromechanical
components and completely irrelevant.
The kernel cares about :
Threads - Single scheduling unit
Cores - Contains one or more threads
Packages - Contains one or more cores. The cores share L3.
NUMA Node - Contains one or more Packages which share a memory
controller.
I'm not aware of x86 systems which have several Packages
sharing a memory controller, so Package == NUMA Node
(but I might be wrong here).
Platform - Contains one or more Numa Nodes
All the kernel is interested in is the above and the NUMA Node distance so
it knows about memory access latencies. No sockets, no MCMs, that's all
completely useless for the scheduler.
So if the current CPUID stuff gives you the same phycial package ID for all
packages in a MCM, then this needs to be fixed at the CPUID/ACPI/BIOS level
and not hacked around in the kernel.
The only reason why a MCM might need it's own ID is, when it contains
infrastructure which is shared between the packages, but again that's
irrelevant for the scheduler. That'd be only relevant to implement a driver
for that shared infrastructure.
Thanks,
tglx
Le 27/06/2017 16:21, Thomas Gleixner a ?crit :
> On Tue, 27 Jun 2017, Suravee Suthikulpanit wrote:
>> On 6/27/17 17:48, Borislav Petkov wrote:
>>> On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
>>>> However, this is not the case on AMD family17h multi-die processor
>>>> platforms, which can have up to 4 dies per socket as shown in the
>>>> following system topology.
>>> So what exactly does that mean? A die is a package on ZN and you can have up
>>> to 4 packages on a physical socket?
>> Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
> And why is this relevant at all?
>
> The kernel does not care about sockets. Sockets are electromechanical
> components and completely irrelevant.
>
> The kernel cares about :
>
> Threads - Single scheduling unit
>
> Cores - Contains one or more threads
>
> Packages - Contains one or more cores. The cores share L3.
>
> NUMA Node - Contains one or more Packages which share a memory
> controller.
>
> I'm not aware of x86 systems which have several Packages
> sharing a memory controller, so Package == NUMA Node
> (but I might be wrong here).
You often have multiple NUMA nodes inside a single package. That's what
we see in sysfs on Intel (since haswell, when Cluster-on-Die is enabled)
and on AMD (since magny-cours).
Kernel "packages" contains cores whose "physical_package_id" are the
same. Documentation says:
physical_package_id: physical package id of cpu#. Typically
corresponds to a physical socket number, but the actual value
is architecture and platform dependent.
I don't know if it's possible on x86 to have different
physical_package_ids for different cores on a single socket.
Brice
Hi Thomas, et al,
Just a quick comment below.
Leo.
> -----Original Message-----
> From: Thomas Gleixner [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 9:21 AM
> To: Suthikulpanit, Suravee <[email protected]>
> Cc: Borislav Petkov <[email protected]>; [email protected]; linux-
> [email protected]; Duran, Leo <[email protected]>; Ghannam,
> Yazen <[email protected]>; Peter Zijlstra <[email protected]>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, 27 Jun 2017, Suravee Suthikulpanit wrote:
> > On 6/27/17 17:48, Borislav Petkov wrote:
> > > On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
> > > > However, this is not the case on AMD family17h multi-die processor
> > > > platforms, which can have up to 4 dies per socket as shown in the
> > > > following system topology.
> > >
> > > So what exactly does that mean? A die is a package on ZN and you can
> > > have up to 4 packages on a physical socket?
> >
> > Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
>
> And why is this relevant at all?
>
> The kernel does not care about sockets. Sockets are electromechanical
> components and completely irrelevant.
>
> The kernel cares about :
>
> Threads - Single scheduling unit
>
> Cores - Contains one or more threads
>
> Packages - Contains one or more cores. The cores share L3.
>
> NUMA Node - Contains one or more Packages which share a memory
> controller.
>
> I'm not aware of x86 systems which have several Packages
> sharing a memory controller, so Package == NUMA Node
> (but I might be wrong here).
>
> Platform - Contains one or more Numa Nodes
[Duran, Leo]
That is my understanding of intent as well... However, regarding the L3:
The sentence 'The cores share L3.' under 'Packages' may give the impression that all cores in a package share an L3.
In our case, we define a Package a group of cores sharing a memory controller, a 'Die' in hardware terms.
Also, it turns out that within a Package we may have separate groups of cores each having their own L3 (in hardware terms we refer to those as a 'Core Complex').
Basically, in our case a Package may contain more than one L3 (i.e., in hardware terms, there may more than one 'Core complex' in a 'Die').
The important point is that all logical processors (threads) that share an L3 have a common "cpu_llc_id".
>
> All the kernel is interested in is the above and the NUMA Node distance so it
> knows about memory access latencies. No sockets, no MCMs, that's all
> completely useless for the scheduler.
>
> So if the current CPUID stuff gives you the same phycial package ID for all
> packages in a MCM, then this needs to be fixed at the CPUID/ACPI/BIOS
> level and not hacked around in the kernel.
>
> The only reason why a MCM might need it's own ID is, when it contains
> infrastructure which is shared between the packages, but again that's
> irrelevant for the scheduler. That'd be only relevant to implement a driver for
> that shared infrastructure.
>
> Thanks,
>
> tglx
On 6/27/17 21:21, Thomas Gleixner wrote:
> On Tue, 27 Jun 2017, Suravee Suthikulpanit wrote:
>> On 6/27/17 17:48, Borislav Petkov wrote:
>>> On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
>>>> However, this is not the case on AMD family17h multi-die processor
>>>> platforms, which can have up to 4 dies per socket as shown in the
>>>> following system topology.
>>>
>>> So what exactly does that mean? A die is a package on ZN and you can have up
>>> to 4 packages on a physical socket?
>>
>> Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
>
> And why is this relevant at all?
>
> The kernel does not care about sockets. Sockets are electromechanical
> components and completely irrelevant.
>
> The kernel cares about :
>
> Threads - Single scheduling unit
>
> Cores - Contains one or more threads
>
> Packages - Contains one or more cores. The cores share L3.
>
> NUMA Node - Contains one or more Packages which share a memory
> controller.
>
> I'm not aware of x86 systems which have several Packages
> sharing a memory controller, so Package == NUMA Node
> (but I might be wrong here).
According to the definition of Packages above, x86_has_numa_in_package does not
map to the topology we are trying to present here since we have 2 L3 caches
(packages) in a NUMA node. So, I am not sure if x86_has_numa_in_package make
sense here.
> Platform - Contains one or more Numa Nodes
>
> All the kernel is interested in is the above and the NUMA Node distance so
> it knows about memory access latencies. No sockets, no MCMs, that's all
> completely useless for the scheduler.
Agree to your point here that we do not care about socket, MCMs, and anything
that has to do with the physical grouping of the cores here. It does not really
mean anything as long as we can correctly describe the NUMA node topology.
The reason we are trying to present "package == NUMA node (die)" here is because
the topology.txt defines package to contain a number of cores plus shared
resources (e.g. DRAM controller, shared caches, etc). Since the
cpuinfo_x86.phys_proc_id is also defined as the physical ID of the package, and
it is used in the arch/x86/kernel/smpboot.c: match_die(), we think it would make
sense to represent package == NUMA node (die) here instead of a socket.
> So if the current CPUID stuff gives you the same phycial package ID for all
> packages in a MCM, then this needs to be fixed at the CPUID/ACPI/BIOS level
> and not hacked around in the kernel.
Unfortunately, the information from CPUID (since the older families) has been
representing the whole physical package topology, which is why we need fix-up in
the AMD specific code (in arch/x86/kernel/cpu/amd.c) to correctly derive
topology information as defined in the topology.txt.
Thanks,
Suravee
On Tue, Jun 27, 2017 at 03:48:55PM +0000, Duran, Leo wrote:
> Basically, in our case a Package may contain more than one L3 (i.e.,
> in hardware terms, there may more than one 'Core complex' in a 'Die').
> The important point is that all logical processors (threads) that
> share an L3 have a common "cpu_llc_id".
All that means nothing if it doesn't have any effect on scheduling.
If moving the working set between L3s within the package, as you call
it, is not measureable, i.e., task migration, then we don't care. The
whole exercise is not about modelling the hardware topology accurately
but for presenting sufficient detailed topology to the scheduler so that
it works optimally on Zen.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jun 27, 2017 at 10:55:52PM +0700, Suravee Suthikulpanit wrote:
> The reason we are trying to present "package == NUMA node (die)" here is
> because the topology.txt defines package to contain a number of cores plus
> shared resources (e.g. DRAM controller, shared caches, etc). Since the
> cpuinfo_x86.phys_proc_id is also defined as the physical ID of the package,
Ok, it seems we will continue talking past each other here. So let's
look at the issues separately:
* irqbalance fails to allocating IRQs to individual CPU within the die.
Why does it fail? What is the root cause for this?
* The scheduler fails to load-balance across 8 threads within a die
(e.g. running 8-thread application w/ taskset -c 0-7 ) with
the DIE schedule domain omitted due to x86_has_numa_in_package.
Why do you need to load-balance within the die? Why not load-balance
within the 0-15 threads?
What are the disadvantages of the situation now?
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, 27 Jun 2017, Duran, Leo wrote:
> > From: Thomas Gleixner [mailto:[email protected]]
> > On Tue, 27 Jun 2017, Suravee Suthikulpanit wrote:
> > > On 6/27/17 17:48, Borislav Petkov wrote:
> > > > On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
> > > > > However, this is not the case on AMD family17h multi-die processor
> > > > > platforms, which can have up to 4 dies per socket as shown in the
> > > > > following system topology.
> > > >
> > > > So what exactly does that mean? A die is a package on ZN and you can
> > > > have up to 4 packages on a physical socket?
> > >
> > > Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
> >
> > And why is this relevant at all?
> >
> > The kernel does not care about sockets. Sockets are electromechanical
> > components and completely irrelevant.
> >
> > The kernel cares about :
> >
> > Threads - Single scheduling unit
> >
> > Cores - Contains one or more threads
> >
> > Packages - Contains one or more cores. The cores share L3.
> >
> > NUMA Node - Contains one or more Packages which share a memory
> > controller.
> >
> > I'm not aware of x86 systems which have several Packages
> > sharing a memory controller, so Package == NUMA Node
> > (but I might be wrong here).
> >
> > Platform - Contains one or more Numa Nodes
> [Duran, Leo]
> That is my understanding of intent as well... However, regarding the L3:
Obviously is that not your understanding.
> The sentence 'The cores share L3.' under 'Packages' may give the
> impression that all cores in a package share an L3.
> In our case, we define a Package a group of cores sharing a memory
> controller, a 'Die' in hardware terms.
>
> Also, it turns out that within a Package we may have separate groups of
> cores each having their own L3 (in hardware terms we refer to those as a
> 'Core Complex').
>
> Basically, in our case a Package may contain more than one L3 (i.e., in
> hardware terms, there may more than one 'Core complex' in a 'Die').
>
> The important point is that all logical processors (threads) that share
> an L3 have a common "cpu_llc_id".
I don't care at all what you call a package. Really. The only relevant view
is what the kernel thinks a package is. And I made that entirely clear.
If you think that's confusing, then feel free to submit patches which
change the names we are using throughout x86 including the documentation.
Thanks,
tglx
Boris,
Please see my comments below.
Leo.
> -----Original Message-----
> From: Borislav Petkov [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 11:12 AM
> To: Duran, Leo <[email protected]>
> Cc: 'Thomas Gleixner' <[email protected]>; Suthikulpanit, Suravee
> <[email protected]>; [email protected]; linux-
> [email protected]; Ghannam, Yazen <[email protected]>;
> Peter Zijlstra <[email protected]>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, Jun 27, 2017 at 03:48:55PM +0000, Duran, Leo wrote:
> > Basically, in our case a Package may contain more than one L3 (i.e.,
> > in hardware terms, there may more than one 'Core complex' in a 'Die').
> > The important point is that all logical processors (threads) that
> > share an L3 have a common "cpu_llc_id".
>
> All that means nothing if it doesn't have any effect on scheduling.
>
> If moving the working set between L3s within the package, as you call it, is
> not measureable, i.e., task migration, then we don't care. The whole exercise
> is not about modelling the hardware topology accurately but for presenting
> sufficient detailed topology to the scheduler so that it works optimally on
> Zen.
[Duran, Leo]
My understanding is the intent of the vendor-specific CPU topology code (in our case, amd.c) is to accurately present topology to the kernel, based on the 'abstractions' the kernel has defined.
(abstractions being terms like "Package", "cpu_llc_id", etc)
It is then up-to code that consumes the kernel-defined abstraction to ensure optimal use of the returned topology information.
That is,
My understating is that it is *not* our job at the "amd.c" level to return information that influences the scheduler in any prescribed way.
If that were the case, it would seem to me that the code consuming the returned topology information is 'broken', or that the defined abstractions are 'broken'.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jun 27, 2017 at 04:23:16PM +0000, Duran, Leo wrote:
> My understating is that it is *not* our job at the "amd.c" level to
> return information that influences the scheduler in any prescribed
> way.
Your understanding is wrong.
The abstractions we present to the rest of the kernel is so that other
facilities can operate in generic way without having to know the
processor they're running on.
In your particular case you're trying to address two, AFAIU, scheduling
problems.
So no, we want to tell the scheduler *exactly* what it needs to know.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Thomas,
Please see my comments & question below.
Leo.
> -----Original Message-----
> From: Thomas Gleixner [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 11:19 AM
> To: Duran, Leo <[email protected]>
> Cc: Suthikulpanit, Suravee <[email protected]>; Borislav
> Petkov <[email protected]>; [email protected]; [email protected];
> Ghannam, Yazen <[email protected]>; Peter Zijlstra
> <[email protected]>
> Subject: RE: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, 27 Jun 2017, Duran, Leo wrote:
> > > From: Thomas Gleixner [mailto:[email protected]] On Tue, 27 Jun
> > > 2017, Suravee Suthikulpanit wrote:
> > > > On 6/27/17 17:48, Borislav Petkov wrote:
> > > > > On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit
> wrote:
> > > > > > However, this is not the case on AMD family17h multi-die
> > > > > > processor platforms, which can have up to 4 dies per socket as
> > > > > > shown in the following system topology.
> > > > >
> > > > > So what exactly does that mean? A die is a package on ZN and you
> > > > > can have up to 4 packages on a physical socket?
> > > >
> > > > Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
> > >
> > > And why is this relevant at all?
> > >
> > > The kernel does not care about sockets. Sockets are
> > > electromechanical components and completely irrelevant.
> > >
> > > The kernel cares about :
> > >
> > > Threads - Single scheduling unit
> > >
> > > Cores - Contains one or more threads
> > >
> > > Packages - Contains one or more cores. The cores share L3.
> > >
> > > NUMA Node - Contains one or more Packages which share a
> memory
> > > controller.
> > >
> > > I'm not aware of x86 systems which have several Packages
> > > sharing a memory controller, so Package == NUMA Node
> > > (but I might be wrong here).
> > >
> > > Platform - Contains one or more Numa Nodes
> > [Duran, Leo]
>
> > That is my understanding of intent as well... However, regarding the L3:
>
> Obviously is that not your understanding.
>
> > The sentence 'The cores share L3.' under 'Packages' may give the
> > impression that all cores in a package share an L3.
> > In our case, we define a Package a group of cores sharing a memory
> > controller, a 'Die' in hardware terms.
> >
> > Also, it turns out that within a Package we may have separate groups
> > of cores each having their own L3 (in hardware terms we refer to those
> > as a 'Core Complex').
> >
> > Basically, in our case a Package may contain more than one L3 (i.e.,
> > in hardware terms, there may more than one 'Core complex' in a 'Die').
> >
> > The important point is that all logical processors (threads) that
> > share an L3 have a common "cpu_llc_id".
>
> I don't care at all what you call a package. Really. The only relevant view is
> what the kernel thinks a package is. And I made that entirely clear.
[Duran, Leo]
Well, our intent is to return what the kernel expects.
So, if we're not doing that properly, we need to fix it.
To be clear,
Are you saying that there can only be a single L3 per 'Package'?
>
> If you think that's confusing, then feel free to submit patches which change
> the names we are using throughout x86 including the documentation.
>
> Thanks,
>
> tglx
Boris,
Are you saying that "amd.c' should be scheduler-aware?.. Really?
If so, are you saying that information returned by kernel-defined terms like 'Package', 'Core', etc, should done in the context of understanding the scheduler,
rather than in the context what is being documented for those terms to actually mean or represent.
I'd hope that "amd.c" should be doing the latter... and that perhaps we're simply not returning information as specified by the intended definition of those terms (in which case we need to fix our code)
Leo.
> -----Original Message-----
> From: Borislav Petkov [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 11:32 AM
> To: Duran, Leo <[email protected]>
> Cc: 'Thomas Gleixner' <[email protected]>; Suthikulpanit, Suravee
> <[email protected]>; [email protected]; linux-
> [email protected]; Ghannam, Yazen <[email protected]>;
> Peter Zijlstra <[email protected]>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, Jun 27, 2017 at 04:23:16PM +0000, Duran, Leo wrote:
> > My understating is that it is *not* our job at the "amd.c" level to
> > return information that influences the scheduler in any prescribed
> > way.
>
> Your understanding is wrong.
>
> The abstractions we present to the rest of the kernel is so that other facilities
> can operate in generic way without having to know the processor they're
> running on.
>
> In your particular case you're trying to address two, AFAIU, scheduling
> problems.
>
> So no, we want to tell the scheduler *exactly* what it needs to know.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jun 27, 2017 at 04:42:32PM +0000, Duran, Leo wrote:
First of all, please do not top-post.
> Are you saying that "amd.c' should be scheduler-aware?.. Really?
Please read again what I said.
> If so, are you saying that information returned by kernel-defined
> terms like 'Package', 'Core',
"information returned by kernel-defined terms"... hmmm, I don't know
what that means.
> etc, should done in the context of understanding the scheduler, rather
> than in the context what is being documented for those terms to
> actually mean or represent.
-ENOPARSE.
> I'd hope that "amd.c" should be doing the latter... and that perhaps
> we're simply not returning information as specified by the intended
> definition of those terms (in which case we need to fix our code)
-ENOPARSE.
I can't really understand what you're trying to tell me here.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Boris,
On 6/27/17 20:42, Borislav Petkov wrote:
> On Tue, Jun 27, 2017 at 08:07:10PM +0700, Suravee Suthikulpanit wrote:
>> What we are trying point out here is that (NUMA) "node" and "die" are the
>> same thing in most of AMD processors, not necessary trying to introduce
>> another term here.
>
> So don't use it then. The whole topology topic is confusing as it is to
> people so that every time I, for example, have to explain myself with an
> example when talking about it. Adding a "die" into the mix makes it more
> confusing, not less.
>
> So pick the terms, please, and document them properly so that we all are
> on the same page when talking about topology.
>
>
>> Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
>
> See above.
>
> I'd like to have the topology terminology all explained and written
> down, pls.
>
Sure, I will document the additional terms as you suggested once we agree on the
direction.
>> However, SRAT/SLIT does not describe the DIE. So, using
>> x86_numa_in_packge_topology on multi-die Zen processor will result in
>> missing the DIE sched-domain for cpus within a die.
>
> What does "does not describe the DIE" mean exactly? How exactly you need
> to describe a die. And forget the die sched domain - first answer the
> question: how is the NUMA info in SRAT/SLIT insufficient for scheduling?
>
> Are you saying, you want to have all threads on a die belong to a
> separate scheduling entity?
Please see my comment below.....
>> Zen cpu0 (package-as-die)
>> domain0 00000000,00000001,00000000,00000001 (SMT)
>> domain1 00000000,0000000f,00000000,0000000f (MC ccx)
>> domain2 00000000,000000ff,00000000,000000ff (DIE)
>
> So this is 8 threads IINM.
>
Actually, the DIE sched-domain (domain2) has 16 threads (the cpumask is split
between cpu 0-7 and 64-71 since the BIOS enumerate all T0 in the system first
before T1).
> You want to have those 8 threads as a separate scheduling entity?
> But looking at this picture:
>
> Die (Dx) View :
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> That's 16 threads on a die.
>
> So are you trying to tell me that you want to have all threads sharing
> an L3 into a single scheduling domain? Is that it?
> Or do you want to have all threads on a die in a single scheduling
> domain?
The 8 threads sharing each L3 are already in the same sched-domain1 (MC CCX).
So, cpu0 is in the same sched-domain1 as cpu1,2,3,64,65,66,67. Here, we need the
DIE sched-domain because it represents all cpus that are in the same NUMA node
(since we have one memory controller per DIE). IIUC, for Zen, w/o the DIE
sched-domain, the scheduler could try to re-balance the tasks from one CCX
(schedule group) to another CCX across NUMA node, and potentially causing
unnecessary performance due to remote memory access.
Please note also that SRAT/SLIT information are used to derive the NUMA
sched-domains, while the DIE sched-domain is non-NUMA sched-domain (derived from
CPUID topology extension which is available on newer families).
Please let me know if I missing any other points.
Thanks,
Suravee
> -----Original Message-----
> From: Borislav Petkov [mailto:[email protected]]
> Sent: Tuesday, June 27, 2017 11:46 AM
> To: Duran, Leo <[email protected]>
> Cc: 'Thomas Gleixner' <[email protected]>; Suthikulpanit, Suravee
> <[email protected]>; [email protected]; linux-
> [email protected]; Ghannam, Yazen <[email protected]>;
> Peter Zijlstra <[email protected]>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, Jun 27, 2017 at 04:42:32PM +0000, Duran, Leo wrote:
>
> First of all, please do not top-post.
>
> > Are you saying that "amd.c' should be scheduler-aware?.. Really?
>
> Please read again what I said.
>
> > If so, are you saying that information returned by kernel-defined
> > terms like 'Package', 'Core',
>
> "information returned by kernel-defined terms"... hmmm, I don't know what
> that means.
[Duran, Leo]
I'm referring to: Documentation\x86\ topology.txt
Started by Thomas Gleixner <[email protected]> and Borislav Petkov <[email protected]>.
>
> > etc, should done in the context of understanding the scheduler, rather
> > than in the context what is being documented for those terms to
> > actually mean or represent.
>
> -ENOPARSE.
>
> > I'd hope that "amd.c" should be doing the latter... and that perhaps
> > we're simply not returning information as specified by the intended
> > definition of those terms (in which case we need to fix our code)
>
> -ENOPARSE.
>
> I can't really understand what you're trying to tell me here.
>
[Duran, Leo]
OK, let me try again:
I'd hope that "amd.c" returns 'Package', 'Core', et al, in compliance with the document I referred to above.
Allowing code that consumes the returned information to make (hopefully optimal) decisions in a vendor-agnostic way.
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jun 27, 2017 at 11:54:12PM +0700, Suravee Suthikulpanit wrote:
> The 8 threads sharing each L3 are already in the same sched-domain1 (MC
> CCX). So, cpu0 is in the same sched-domain1 as cpu1,2,3,64,65,66,67. Here,
> we need the DIE sched-domain because it represents all cpus that are in the
> same NUMA node (since we have one memory controller per DIE).
So this is still confusing. Please drop the "DIE sched-domain" as that is
something you're trying to define and I'm trying to parse what you're
trying to define and why.
> IIUC, for Zen, w/o the DIE sched-domain, the scheduler could try to
> re-balance the tasks from one CCX (schedule group) to another CCX
> across NUMA node, and
CCX, schedule group, NUMA node, ... now my head is spinning. Do you
see what I mean with agreeing on the nomenclature and proper term
definitions first?
> potentially causing unnecessary performance due to remote memory access.
>
> Please note also that SRAT/SLIT information are used to derive the NUMA
> sched-domains, while the DIE sched-domain is non-NUMA sched-domain (derived
> from CPUID topology extension which is available on newer families).
So let's try to discuss this without using DIE sched-domain, CCX, etc,
and let's start simple.
So in that die graphic:
----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------
you want all those threads to belong to a single scheduling group.
Correct?
Now that thing has a memory controller attached to it, correct?
If so, why is this thing not a logical NUMA node, as described in
SRAT/SLIT?
If not, what does a NUMA node entail on Zen as described by SRAT/SLIT?
I.e., what is the difference between the two things? I.e., how many dies
as above are in a NUMA node?
Now, SRAT should contain the assignment which core belongs to which
node. Why is that not sufficient?
Ok, that should be enough questions for now. Let's start with them.
Thanks.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Borislav Petkov
> Sent: Tuesday, June 27, 2017 1:44 PM
> To: Suthikulpanit, Suravee <[email protected]>
> Cc: [email protected]; [email protected]; Duran, Leo
> <[email protected]>; Ghannam, Yazen <[email protected]>;
> Peter Zijlstra <[email protected]>
> Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of
> socket
>
> On Tue, Jun 27, 2017 at 11:54:12PM +0700, Suravee Suthikulpanit wrote:
> > The 8 threads sharing each L3 are already in the same sched-domain1
> > (MC CCX). So, cpu0 is in the same sched-domain1 as
> > cpu1,2,3,64,65,66,67. Here, we need the DIE sched-domain because it
> > represents all cpus that are in the same NUMA node (since we have one
> memory controller per DIE).
>
> So this is still confusing. Please drop the "DIE sched-domain" as that is
> something you're trying to define and I'm trying to parse what you're trying to
> define and why.
>
> > IIUC, for Zen, w/o the DIE sched-domain, the scheduler could try to
> > re-balance the tasks from one CCX (schedule group) to another CCX
> > across NUMA node, and
>
> CCX, schedule group, NUMA node, ... now my head is spinning. Do you see
> what I mean with agreeing on the nomenclature and proper term definitions
> first?
>
> > potentially causing unnecessary performance due to remote memory
> access.
> >
> > Please note also that SRAT/SLIT information are used to derive the
> > NUMA sched-domains, while the DIE sched-domain is non-NUMA
> > sched-domain (derived from CPUID topology extension which is available on
> newer families).
>
> So let's try to discuss this without using DIE sched-domain, CCX, etc, and let's
> start simple.
>
> So in that die graphic:
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> you want all those threads to belong to a single scheduling group.
> Correct?
>
> Now that thing has a memory controller attached to it, correct?
>
> If so, why is this thing not a logical NUMA node, as described in SRAT/SLIT?
>
> If not, what does a NUMA node entail on Zen as described by SRAT/SLIT?
> I.e., what is the difference between the two things? I.e., how many dies as
> above are in a NUMA node?
>
> Now, SRAT should contain the assignment which core belongs to which node.
> Why is that not sufficient?
>
> Ok, that should be enough questions for now. Let's start with them.
>
This group is a NUMA node. It is the "identity" NUMA node. Linux skips the
identity NUMA node when finding the NUMA levels. This is fine as long as the
MC domain is equivalent to the identity NUMA node. However, this is not the
case on Zen systems.
We could patch the sched/topology.c to not skip the identity NUMA node.
Though this will affect all systems not just AMD.
Something like this:
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..98d856c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1103,6 +1103,8 @@ void sched_init_numa(void)
* node_distance(i,j) in order to avoid cubic time.
*/
next_distance = curr_distance;
+ sched_domains_numa_distance[level++] = next_distance;
+ sched_domains_numa_levels = level;
for (i = 0; i < nr_node_ids; i++) {
for (j = 0; j < nr_node_ids; j++) {
for (k = 0; k < nr_node_ids; k++) {
Thanks,
Yazen
On Tue, Jun 27, 2017 at 06:32:34PM +0000, Ghannam, Yazen wrote:
> > you want all those threads to belong to a single scheduling group.
> > Correct?
> >
> > Now that thing has a memory controller attached to it, correct?
> >
> > If so, why is this thing not a logical NUMA node, as described in SRAT/SLIT?
> >
> > If not, what does a NUMA node entail on Zen as described by SRAT/SLIT?
> > I.e., what is the difference between the two things? I.e., how many dies as
> > above are in a NUMA node?
> >
> > Now, SRAT should contain the assignment which core belongs to which node.
> > Why is that not sufficient?
> >
> > Ok, that should be enough questions for now. Let's start with them.
> >
>
> This group is a NUMA node. It is the "identity" NUMA node. Linux skips the
Please be more specific. Which group exactly? Which question above are
you answering?
> identity NUMA node when finding the NUMA levels. This is fine as long as the
> MC domain is equivalent to the identity NUMA node. However, this is not the
> case on Zen systems.
>
> We could patch the sched/topology.c to not skip the identity NUMA node.
> Though this will affect all systems not just AMD.
We can always add a X86_FEATURE flag but we need to agree on what you
guys are actually trying to change and why?
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 6/28/17 00:44, Borislav Petkov wrote:
> So let's try to discuss this without using DIE sched-domain, CCX, etc,
> and let's start simple.
>
> So in that die graphic:
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> you want all those threads to belong to a single scheduling group.
> Correct?
Actually, let's be a bit more specific here since the meaning of sched-group and
sched-domain are different where:
(From: Documentation/scheduler/sched-domains.txt)
---- begin snippet ----
Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.
Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.
---- end snippet ----
So, from the definition above, we would like all those 16 threads to be in the
same sched-domain, where threads from C0,1,2,3 are in the same sched-group, and
threads in C4,5,6,7 are in another sched-group.
> Now that thing has a memory controller attached to it, correct?
Yes
> If so, why is this thing not a logical NUMA node, as described in
> SRAT/SLIT?
Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT.
> Now, SRAT should contain the assignment which core belongs to which
> node. Why is that not sufficient?
Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell scheduler
what are the cpus within a NUMA node.
However, looking at the current sched-domain below. Notice that there is no
sched-domain with 16 threads to represent a NUMA node:
cpu0
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC)
domain2 00000000,ffffffff,00000000,ffffffff (NUMA)
domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA)
sched-domain2 (which represents a sched-domain containing all cpus within a
socket) would have 8 sched-groups (based on the cpumasks from domain1).
According to the documentation snippet above regarding balancing within a
sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups
(spanning 4 NUMA node). Here, IINM, it would be more beneficial if the scheduler
would try to load balance between the two groups within the same NUMA node first
before, going across NUMA node in order to minimize memory latency. This would
require another sched-domain between domain 1 and 2, which represent all 16
threads within a NUMA node (i.e. die sched-domain), this would allow scheduler
to load balance within the NUMA node first, before going across NUMA node.
However, since the current code decides that x86_has_numa_in_package is true, it
omits the die sched-domain. In order to avoid this, we are proposing to
represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die ID). And this is
the main point of the patch series.
Thanks,
Suravee
+ Matt and Mel.
On Wed, Jun 28, 2017 at 03:26:10AM +0700, Suravee Suthikulpanit wrote:
> So, from the definition above, we would like all those 16 threads to be in
> the same sched-domain, where threads from C0,1,2,3 are in the same
> sched-group, and threads in C4,5,6,7 are in another sched-group.
Figures, you want to have a sched group per L3.
> > Now that thing has a memory controller attached to it, correct?
>
> Yes
>
> > If so, why is this thing not a logical NUMA node, as described in
> > SRAT/SLIT?
>
> Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT.
>
> > Now, SRAT should contain the assignment which core belongs to which
> > node. Why is that not sufficient?
>
> Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell
> scheduler what are the cpus within a NUMA node.
>
> However, looking at the current sched-domain below. Notice that there is no
> sched-domain with 16 threads to represent a NUMA node:
>
> cpu0
> domain0 00000000,00000001,00000000,00000001 (SMT)
> domain1 00000000,0000000f,00000000,0000000f (MC)
> domain2 00000000,ffffffff,00000000,ffffffff (NUMA)
> domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA)
>
> sched-domain2 (which represents a sched-domain containing all cpus within a
> socket) would have 8 sched-groups (based on the cpumasks from domain1).
> According to the documentation snippet above regarding balancing within a
> sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups
> (spanning 4 NUMA node). Here, IINM, it would be more beneficial if the
> scheduler would try to load balance between the two groups within the same
> NUMA node first before, going across NUMA node in order to minimize memory
> latency. This would require another sched-domain between domain 1 and 2,
> which represent all 16 threads within a NUMA node (i.e. die sched-domain),
> this would allow scheduler to load balance within the NUMA node first,
> before going across NUMA node.
>
> However, since the current code decides that x86_has_numa_in_package is
> true, it omits the die sched-domain. In order to avoid this, we are
> proposing to represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die
> ID). And this is the main point of the patch series.
Ok, so here's what I'm reading from all this: a 16-thread package *is*
represented as a logical NUMA node properly by SRAT/SLIT, after all.
However, sched_init_numa() doesn't parse it properly or something else
along that path keeps the scheduler from seeing the 4 NUMA nodes on the
physical socket.
Because we do use the SLIT table to build the NUMA domains. So
considering that 16T package really is represented properly as a logical
NUMA node, why doesn't it get detected as such?
This is what needs asnwering first and not some forceful fitting the
topology into a DIE domain.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Jun 27, 2017 at 08:07:10PM +0700, Suravee Suthikulpanit wrote:
> As I have described in the cover letter, this patch series changes how
Which you didn't send to me... thanks for that.
> kernel derives cpu "package" from package-as-socket to package-as-die in
> order to fix following issues on AMD family17h multi-die processor
> platforms:
> These issues are fixed when properly intepretes package as DIE.
>
> For MCM Magny-Cours (family15h), we do not have this issue because the MC
> sched-domain has the same cpumask as the DIE sched-domain. This is not the
> same as for Zen cores (family17h) where the MC sched-domain is the CCX (2
> CCX per die)
Are you saying that your LLCs are smaller than your NUMA nodes ? You do
not in fact have a cache level that spans the memory controller?
Because that is indeed the assumption we have for Intel CoD/SnC and AMD
Magny-Cours.