2020-12-10 14:30:43

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH v3 0/5] Extend Parsing "ibm,thread-groups" for Shared-L2 information

From: "Gautham R. Shenoy" <[email protected]>

Hi,

This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property
to discover the Shared-L2 cache information.

The previous versions can be found here :

v2 : https://lore.kernel.org/linuxppc-dev/[email protected]/T/#m043ea15d3832658527fca94765202b9cbefd330d

v1 : https://lore.kernel.org/linuxppc-dev/[email protected]/T/#m0fabffa1ea1a2807b362f25c849bb19415216520


Changes form v2-->v3:
* Fixed the build errors reported by the Kernel Test Robot for Patches 4 and 5.

Changes from v1-->v2:
Incorporate the review comments from Srikar and
fix a build error on !PPC64 configs reported by the kernel bot.

* Split Patch 1 into three patches
* First patch ensure that parse_thread_groups() is made generic to
support more than one property.
* Second patch renames cpu_l1_cache_map as
thread_group_l1_cache_map for consistency. No functional impact.
* The third patch makes init_thread_group_l1_cache_map()
generic. No functional impact.

* Patch 2 (Now patch 4): Incorporates the review comments from Srikar simplifying
the changes to update_mask_by_l2()

* Patch 3 (Now patch 5): Fix a build errors for 32-bit configs
reported by the kernel build bot.

Description of the Patchset
===========================
The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
the second group is {9,11,13,15}. Property "1" is indicative of
the thread in the group sharing L1 cache, translation cache and
Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
the first group is {8,10,12,14} and the
"ibm,ppc-interrupt-server#s" of the second group is
{9,11,13,15}. Property "2" indicates that the threads in each group
share the L2-cache.

The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is
00000001 00000002 00000004 00000000
00000002 00000004 00000006 00000001
00000003 00000005 00000007 00000002
00000002 00000004 00000000 00000002
00000004 00000006 00000001 00000003
00000005 00000007

then, the sub-array
[00000002 00000002 00000004
00000000 00000002 00000004 00000006
00000001 00000003 00000005 00000007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.

This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platforms, we populate the populating the
cpu_l2_cache_mask of every CPU to the core-siblings which share L2
with the CPU as specified in the by the "ibm,thread-groups" property
array.

With the patchset, the sched-domain hierarchy is correctly
reported. For eg for CPUs 0,1, with the patchset

CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Testing
==========

With the producer-consumer
testcase(https://github.com/gautshen/misc/tree/master/producer_consumer)
where in the producer thread performs writes to 4096 random locations,
and the consumer thread subsequently reads from those 4096 random
location. We measure the time taken by the consumer to finish the 4096
reads (called an iteration of the consumer). Thus lower the value,
better is the result.

The best case occurs when the producer and consumer are affined to the
same L2 cache domain (Eg: CPU0, CPU2). On the platform with the
thread-groups sharing L2,
|-----------------------------------------------|
| Without Patch |
|-----------|-----------|-----------------------|
| Producer | Consumer | Avg time per Consumer |
| Affinity | Affinity | Iteration |
|-----------|-----------|-----------------------|
| CPU0 | CPU2 | 235us |
|-----------|-----------|-----------------------|
|Not affined|Not affined| 347us |
|-----------------------------------------------|

We see that out-of-box, the average time per consumer iteration is
higher since the tasks can be placed anywhere within the core without
them being in the L2 domain.

|-----------------------------------------------|
| With Patch |
|-----------|-----------|-----------------------|
| Producer | Consumer | Avg time per Consumer |
| Affinity | Affinity | Iteration |
|-----------|-----------|-----------------------|
| CPU0 | CPU2 | 235us |
|-----------|-----------|-----------------------|
|Not affined|Not affined| 236us |
|-----------------------------------------------|

With the patch, since the L2 domain is correctly identified, the
scheduler does the right thing by co-locating the producer and
consumer on the same L2 domain, thereby yielding the out-of-box
performance matching the best case.

Finally, this patchset reports the correct shared_cpu_map/list in the
sysfs for L2 cache on such platforms. With the patchset for CPUs0, 1,
for L2 cache we see the correct shared_cpu_map/list

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:000000,00000055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:000000,000000aa

The patchset has been tested on older platforms which encode only the
L1 sharing information via "ibm,thread-groups" and there is no
regression found.

Gautham R. Shenoy (5):
powerpc/smp: Parse ibm,thread-groups with multiple properties
powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
powerpc/smp: Rename init_thread_group_l1_cache_map() to make it
generic
powerpc/smp: Add support detecting thread-groups sharing L2 cache
powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

arch/powerpc/include/asm/smp.h | 6 +
arch/powerpc/kernel/cacheinfo.c | 30 +++--
arch/powerpc/kernel/smp.c | 241 ++++++++++++++++++++++++++++------------
3 files changed, 198 insertions(+), 79 deletions(-)

--
1.9.4


2020-12-11 00:17:26

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

From: "Gautham R. Shenoy" <[email protected]>

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
the second group is {9,11,13,15}. Property "1" is indicative of
the thread in the group sharing L1 cache, translation cache and
Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
the first group is {8,10,12,14} and the
"ibm,ppc-interrupt-server#s" of the second group is
{9,11,13,15}. Property "2" indicates that the threads in each group
share the L2-cache.

The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/kernel/smp.c | 174 ++++++++++++++++++++++++++++++----------------
1 file changed, 113 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..88d88ad 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
unsigned int thread_list[MAX_THREAD_LIST_SIZE];
};

+/* Maximum number of properties that groups of threads within a core can share */
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+ unsigned int nr_properties;
+ struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
/*
* On big-cores system, cpu_l1_cache_map for each CPU corresponds to
* the set its siblings that share the L1-cache.
@@ -695,81 +704,98 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
/*
* parse_thread_groups: Parses the "ibm,thread-groups" device tree
* property for the CPU device node @dn and stores
- * the parsed output in the thread_groups
- * structure @tg if the ibm,thread-groups[0]
- * matches @property.
+ * the parsed output in the thread_groups_list
+ * structure @tglp.
*
* @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
* output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- * interested in.
*
* ibm,thread-groups[0..N-1] array defines which group of threads in
* the CPU-device node can be grouped together based on the property.
*
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
* threads are being grouped together. If this value is 1, it implies
* that the threads in the same group share L1, translation cache.
*
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
*
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
* group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
*
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
* "ibm,ppc-interrupt-server#s" arranged as per their membership in
* the grouping.
*
- * Example: If ibm,thread-groups = [1,2,4,5,6,7,8,9,10,11,12] it
- * implies that there are 2 groups of 4 threads each, where each group
- * of threads share L1, translation cache.
+ * Example:
+ * If "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]
+ * This can be decomposed up into two consecutive arrays:
+ * a) [1,2,4,8,10,12,14,9,11,13,15]
+ * b) [2,2,4,8,10,12,14,9,11,13,15]
+ *
+ * where in,
+ *
+ * a) provides information of Property "1" being shared by "2" groups,
+ * each with "4" threads each. The "ibm,ppc-interrupt-server#s" of
+ * the first group is {8,10,12,14} and the
+ * "ibm,ppc-interrupt-server#s" of the second group is
+ * {9,11,13,15}. Property "1" is indicative of the thread in the
+ * group sharing L1 cache, translation cache and Instruction Data
+ * flow.
*
- * The "ibm,ppc-interrupt-server#s" of the first group is {5,6,7,8}
- * and the "ibm,ppc-interrupt-server#s" of the second group is {9, 10,
- * 11, 12} structure
+ * b) provides information of Property "2" being shared by "2" groups,
+ * each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
+ * the first group is {8,10,12,14} and the
+ * "ibm,ppc-interrupt-server#s" of the second group is
+ * {9,11,13,15}. Property "2" indicates that the threads in each
+ * group share the L2-cache.
*
* Returns 0 on success, -EINVAL if the property does not exist,
* -ENODATA if property does not have a value, and -EOVERFLOW if the
* property data isn't large enough.
*/
static int parse_thread_groups(struct device_node *dn,
- struct thread_groups *tg,
- unsigned int property)
+ struct thread_groups_list *tglp)
{
- int i;
- u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
- u32 *thread_list;
+ unsigned int property_idx = 0;
+ u32 *thread_group_array;
size_t total_threads;
- int ret;
+ int ret = 0, count;
+ u32 *thread_list;
+ int i = 0;

+ count = of_property_count_u32_elems(dn, "ibm,thread-groups");
+ thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL);
ret = of_property_read_u32_array(dn, "ibm,thread-groups",
- thread_group_array, 3);
+ thread_group_array, count);
if (ret)
- return ret;
-
- tg->property = thread_group_array[0];
- tg->nr_groups = thread_group_array[1];
- tg->threads_per_group = thread_group_array[2];
- if (tg->property != property ||
- tg->nr_groups < 1 ||
- tg->threads_per_group < 1)
- return -ENODATA;
+ goto out_free;

- total_threads = tg->nr_groups * tg->threads_per_group;
+ while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) {
+ int j;
+ struct thread_groups *tg = &tglp->property_tgs[property_idx++];

- ret = of_property_read_u32_array(dn, "ibm,thread-groups",
- thread_group_array,
- 3 + total_threads);
- if (ret)
- return ret;
+ tg->property = thread_group_array[i];
+ tg->nr_groups = thread_group_array[i + 1];
+ tg->threads_per_group = thread_group_array[i + 2];
+ total_threads = tg->nr_groups * tg->threads_per_group;

- thread_list = &thread_group_array[3];
+ thread_list = &thread_group_array[i + 3];

- for (i = 0 ; i < total_threads; i++)
- tg->thread_list[i] = thread_list[i];
+ for (j = 0; j < total_threads; j++)
+ tg->thread_list[j] = thread_list[j];
+ i = i + 3 + total_threads;
+ }

- return 0;
+ tglp->nr_properties = property_idx;
+
+out_free:
+ kfree(thread_group_array);
+ return ret;
}

/*
@@ -805,50 +831,76 @@ static int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
return -1;
}

+static struct thread_groups *__init get_thread_groups(int cpu,
+ int group_property,
+ int *err)
+{
+ struct device_node *dn = of_get_cpu_node(cpu, NULL);
+ struct thread_groups_list *cpu_tgl = &tgl[cpu];
+ struct thread_groups *tg = NULL;
+ int i;
+ *err = 0;
+
+ if (!dn) {
+ *err = -ENODATA;
+ return NULL;
+ }
+
+ if (!cpu_tgl->nr_properties) {
+ *err = parse_thread_groups(dn, cpu_tgl);
+ if (*err)
+ goto out;
+ }
+
+ for (i = 0; i < cpu_tgl->nr_properties; i++) {
+ if (cpu_tgl->property_tgs[i].property == group_property) {
+ tg = &cpu_tgl->property_tgs[i];
+ break;
+ }
+ }
+
+ if (!tg)
+ *err = -EINVAL;
+out:
+ of_node_put(dn);
+ return tg;
+}
+
static int init_cpu_l1_cache_map(int cpu)

{
- struct device_node *dn = of_get_cpu_node(cpu, NULL);
- struct thread_groups tg = {.property = 0,
- .nr_groups = 0,
- .threads_per_group = 0};
int first_thread = cpu_first_thread_sibling(cpu);
int i, cpu_group_start = -1, err = 0;
+ struct thread_groups *tg = NULL;

- if (!dn)
- return -ENODATA;
-
- err = parse_thread_groups(dn, &tg, THREAD_GROUP_SHARE_L1);
- if (err)
- goto out;
+ tg = get_thread_groups(cpu, THREAD_GROUP_SHARE_L1,
+ &err);
+ if (!tg)
+ return err;

- cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
+ cpu_group_start = get_cpu_thread_group_start(cpu, tg);

if (unlikely(cpu_group_start == -1)) {
WARN_ON_ONCE(1);
- err = -ENODATA;
- goto out;
+ return -ENODATA;
}

zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));

for (i = first_thread; i < first_thread + threads_per_core; i++) {
- int i_group_start = get_cpu_thread_group_start(i, &tg);
+ int i_group_start = get_cpu_thread_group_start(i, tg);

if (unlikely(i_group_start == -1)) {
WARN_ON_ONCE(1);
- err = -ENODATA;
- goto out;
+ return -ENODATA;
}

if (i_group_start == cpu_group_start)
cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
}

-out:
- of_node_put(dn);
- return err;
+ return 0;
}

static bool shared_caches;
--
1.9.4

2020-12-15 08:49:32

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

* Gautham R. Shenoy <[email protected]> [2020-12-10 16:08:55]:

> From: "Gautham R. Shenoy" <[email protected]>
>
> The "ibm,thread-groups" device-tree property is an array that is used
> to indicate if groups of threads within a core share certain
> properties. It provides details of which property is being shared by
> which groups of threads. This array can encode information about
> multiple properties being shared by different thread-groups within the
> core.
>

Looks good to me.

Reviewed-by: Srikar Dronamraju <[email protected]>

--
Thanks and Regards
Srikar Dronamraju

2020-12-15 10:54:52

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] Extend Parsing "ibm, thread-groups" for Shared-L2 information

On Thu, 10 Dec 2020 16:08:54 +0530, Gautham R. Shenoy wrote:
> This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property
> to discover the Shared-L2 cache information.
>
> The previous versions can be found here :
>
> v2 : https://lore.kernel.org/linuxppc-dev/[email protected]/T/#m043ea15d3832658527fca94765202b9cbefd330d
>
> [...]

Applied to powerpc/next.

[1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties
https://git.kernel.org/powerpc/c/790a1662d3a26fe9fa5f691386d8fde6bb8b0dc2
[2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
https://git.kernel.org/powerpc/c/1fdc1d6632ff3f6813a2f15b65586bde8fe0f0ba
[3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic
https://git.kernel.org/powerpc/c/fbd2b672e91d276b9fa5a729e4a823ba29fa2692
[4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache
https://git.kernel.org/powerpc/c/9538abee18cca70ffd03cef56027388b0c5084cc
[5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
https://git.kernel.org/powerpc/c/0be47634db0baa9e91c7e635e7e73355d6a5cf43

cheers