2022-04-22 09:00:28

by 王擎

[permalink] [raw]
Subject: [PATCH 0/2] Add complex scheduler level for arm64

From: Wang Qing <[email protected]>

The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic.
Sharing some logic between the cores can make a complex area efficient.

This patch adds complex level for complexs by parsing cache topology
form DT. It will directly benefit a lot of workload which loves more
resources such as memory bandwidth, caches.

Note this patch only handle the DT case.

wangqing (2):
arch_topology: support for describing cache topology from DT
arm64: Add complex scheduler level for arm64

arch/arm64/Kconfig | 13 ++++++++++
arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++-
drivers/base/arch_topology.c | 47 +++++++++++++++++++++++++++++++++-
include/linux/arch_topology.h | 3 +++
4 files changed, 109 insertions(+), 2 deletions(-)

--
2.7.4


2022-04-22 19:23:46

by 王擎

[permalink] [raw]
Subject: [PATCH 2/2] arm64: Add complex scheduler level for arm64

From: Wang Qing <[email protected]>

The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic.
Sharing some logic between the cores can make a complex area efficient.

This patch adds complex level for complexs and automatically enables
the load balance among complexs. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has been done in qcom sm8450 with Stream benchmark:
8threads stream (2 little cores * 2(complex) + 3 middle cores + 1 big core)
stream stream
w/o patch w/ patch
MB/sec copy 37579.2 ( 0.00%) 39127.3 ( 4.12%)
MB/sec scale 38261.1 ( 0.00%) 39195.4 ( 2.44%)
MB/sec add 39497.0 ( 0.00%) 41101.5 ( 4.06%)
MB/sec triad 39885.6 ( 0.00%) 40772.7 ( 2.22%)

Signed-off-by: Wang Qing <[email protected]>
---
arch/arm64/Kconfig | 13 +++++++++++
arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index edbe035cb0e3..4063de8c6153 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
by sharing mid-level caches, last-level cache tags or internal
busses.

+config SCHED_COMPLEX
+ bool "Complex scheduler support"
+ help
+ DSU supports blocks that are called complexes which contain up to
+ two cores of the same type and some shared logic. Sharing some logic
+ between the cores can make a complex area efficient.
+
+ Complex also can be considered as a shared cache group smaller
+ than cluster.
+
+ Complex scheduler support improves the CPU scheduler's decision
+ making when dealing with machines that have complexs of CPUs.
+
config SCHED_SMT
bool "SMT scheduler support"
help
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b46041f2b97..526765112146 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -14,6 +14,7 @@
#include <linux/sched/mm.h>
#include <linux/sched/hotplug.h>
#include <linux/sched/task_stack.h>
+#include <linux/sched/topology.h>
#include <linux/interrupt.h>
#include <linux/cache.h>
#include <linux/profile.h>
@@ -57,6 +58,10 @@
DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
EXPORT_PER_CPU_SYMBOL(cpu_number);

+#ifdef SCHED_COMPLEX
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
+#endif
+
/*
* as from 2.5, kernels no longer have an init_tasks structure
* so we need some other way of telling a new secondary core
@@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
}
}

+#ifdef SCHED_COMPLEX
+static int arm64_complex_flags(void)
+{
+ return SD_SHARE_PKG_RESOURCES;
+}
+
+const struct cpumask *arm64_complex_mask(int cpu)
+{
+ const struct cpumask *core_mask = cpu_cpu_mask(cpu);
+
+ /* Find the smaller shared cache level than clustergroup and coregroup*/
+#ifdef CONFIG_SCHED_MC
+ core_mask = cpu_coregroup_mask(cpu);
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+ core_mask = cpu_clustergroup_mask(cpu);
+#endif
+
+ find_max_sub_sc(core_mask, cpu, &per_cpu(cpu_complex_map, cpu));
+
+ return &per_cpu(cpu_complex_map, cpu);
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+ { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_COMPLEX
+ { arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+ { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+#ifdef CONFIG_SCHED_MC
+ { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
void __init smp_prepare_cpus(unsigned int max_cpus)
{
const struct cpu_operations *ops;
@@ -723,9 +769,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
unsigned int this_cpu;

init_cpu_topology();
-
this_cpu = smp_processor_id();
store_cpu_topology(this_cpu);
+ set_sched_topology(arm64_topology);
numa_store_cpu_info(this_cpu);
numa_add_cpu(this_cpu);

--
2.27.0.windows.1

2022-04-22 20:15:30

by 王擎

[permalink] [raw]
Subject: [PATCH 1/2] arch_topology: support for describing cache topology from DT

From: Wang Qing <[email protected]>

When ACPI is not enabled, we can get cache topolopy from DT like:
* cpu0: cpu@000 {
* next-level-cache = <&L2_1>;
* L2_1: l2-cache {
* compatible = "cache";
* next-level-cache = <&L3_1>;
* };
* L3_1: l3-cache {
* compatible = "cache";
* };
* };
*
* cpu1: cpu@001 {
* next-level-cache = <&L2_1>;
* };
* ...
* };
cache_topology hold the pointer describing "next-level-cache",
it can describe the cache topology of every level.

Signed-off-by: Wang Qing <[email protected]>
---
drivers/base/arch_topology.c | 47 ++++++++++++++++++++++++++++++++++-
include/linux/arch_topology.h | 3 +++
2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..46e84ce2ec0c 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
return -1;

cpu = of_cpu_node_to_id(cpu_node);
- if (cpu >= 0)
+ if (cpu >= 0) {
topology_parse_cpu_capacity(cpu_node, cpu);
+ topology_parse_cpu_caches(cpu_node, cpu);
+ }
else
pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
cpu_node, cpumask_pr_args(cpu_possible_mask));
@@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
}
#endif

+/*
+ * cpu cache topology table
+ */
+#define MAX_CACHE_LEVEL 7
+staic struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
+
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
+{
+ struct device_node *node_cache = cpu_node;
+ int level = 0;
+
+ while (level < MAX_CACHE_LEVEL) {
+ node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
+ if (!node_cache)
+ break;
+
+ cache_topology[cpu][level++] = node_cache;
+ }
+}
+
+/*
+ * find the maximum level shared cache under giving mask
+ */
+void find_max_sub_sc(const struct cpumask *giving_mask, int cpu,
+ struct cpumask *sc_mask)
+{
+ int cache_level, cpu_id;
+
+ for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
+ if (!cache_topology[cpu][cache_level])
+ continue;
+
+ cpumask_clear(sc_mask);
+ for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
+ if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
+ cpumask_set_cpu(cpu_id, sc_mask);
+ }
+
+ if (cpumask_subset(sc_mask, giving_mask))
+ break;
+ }
+}
+
/*
* cpu topology table
*/
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 58cbe18d825c..c6ed727e453c 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
void remove_cpu_topology(unsigned int cpuid);
void reset_cpu_topology(void);
int parse_acpi_topology(void);
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
+void find_max_sub_sc(const struct cpumask *giving_mask, int cpu,
+ struct cpumask *sc_mask);
#endif

#endif /* _LINUX_ARCH_TOPOLOGY_H_ */
--
2.27.0.windows.1

2022-04-22 20:32:49

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH 1/2] arch_topology: support for describing cache topology from DT

On Thu, Apr 21, 2022 at 07:55:57AM -0700, Qing Wang wrote:
> From: Wang Qing <[email protected]>
>
> When ACPI is not enabled, we can get cache topolopy from DT like:
> * cpu0: cpu@000 {
> * next-level-cache = <&L2_1>;
> * L2_1: l2-cache {
> * compatible = "cache";
> * next-level-cache = <&L3_1>;
> * };
> * L3_1: l3-cache {
> * compatible = "cache";
> * };
> * };
> *
> * cpu1: cpu@001 {
> * next-level-cache = <&L2_1>;
> * };
> * ...
> * };
> cache_topology hold the pointer describing "next-level-cache",
> it can describe the cache topology of every level.

As I mentioned before, I would like to avoid any duplication and see
what can be reused from drivers/base/cacheinfo.c

We can discuss and see how to proceed on that once we settle/agree on
2/2. I don't want to waste your or my time if we don't end up using this.
So let us look at this once we agree to push the sched related changes
as we have used generic ones so far and you want to introduce arm64 specific
levels. That requires some discussions and thoughts before we can finalise.

Also I have mentioned you to keep Dietmar and Vincent in cc for all sched
related changes which you failed to do again. I expect you fix that next
time if you want them to help you in discussions and make any progress on
this. Otherwise it may get ignored as you don't have all the right
people in cc.

--
Regards,
Sudeep

2022-04-22 20:56:41

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] arch_topology: support for describing cache topology from DT

Hi Qing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on driver-core/driver-core-testing linus/master arm-perf/for-next/perf v5.18-rc3 next-20220421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Qing-Wang/Add-complex-scheduler-level-for-arm64/20220421-225748
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: riscv-randconfig-c006-20220421 (https://download.01.org/0day-ci/archive/20220422/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 5bd87350a5ae429baf8f373cb226a57b62f87280)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/854ee80a8c32ea98203c96ba25cae2e87eeb43b1
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Qing-Wang/Add-complex-scheduler-level-for-arm64/20220421-225748
git checkout 854ee80a8c32ea98203c96ba25cae2e87eeb43b1
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash arch/riscv/ drivers/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> drivers/base/arch_topology.c:617:1: error: unknown type name 'staic'; did you mean 'static'?
staic struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
^~~~~
static
1 error generated.


vim +617 drivers/base/arch_topology.c

612
613 /*
614 * cpu cache topology table
615 */
616 #define MAX_CACHE_LEVEL 7
> 617 staic struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
618

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-22 21:40:08

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/2] arch_topology: support for describing cache topology from DT

On Thu, Apr 21, 2022 at 07:55:57AM -0700, Qing Wang wrote:
> From: Wang Qing <[email protected]>
>
> When ACPI is not enabled, we can get cache topolopy from DT like:
> * cpu0: cpu@000 {
> * next-level-cache = <&L2_1>;
> * L2_1: l2-cache {
> * compatible = "cache";
> * next-level-cache = <&L3_1>;
> * };
> * L3_1: l3-cache {
> * compatible = "cache";
> * };
> * };
> *
> * cpu1: cpu@001 {
> * next-level-cache = <&L2_1>;
> * };
> * ...
> * };
> cache_topology hold the pointer describing "next-level-cache",
> it can describe the cache topology of every level.
>
> Signed-off-by: Wang Qing <[email protected]>
> ---
> drivers/base/arch_topology.c | 47 ++++++++++++++++++++++++++++++++++-
> include/linux/arch_topology.h | 3 +++
> 2 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1d6636ebaac5..46e84ce2ec0c 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
> return -1;
>
> cpu = of_cpu_node_to_id(cpu_node);
> - if (cpu >= 0)
> + if (cpu >= 0) {
> topology_parse_cpu_capacity(cpu_node, cpu);
> + topology_parse_cpu_caches(cpu_node, cpu);
> + }
> else
> pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
> cpu_node, cpumask_pr_args(cpu_possible_mask));
> @@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
> }
> #endif
>
> +/*
> + * cpu cache topology table
> + */
> +#define MAX_CACHE_LEVEL 7
> +staic struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
> +
> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
> +{
> + struct device_node *node_cache = cpu_node;
> + int level = 0;
> +
> + while (level < MAX_CACHE_LEVEL) {
> + node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
> + if (!node_cache)
> + break;
> +
> + cache_topology[cpu][level++] = node_cache;
> + }
> +}
> +
> +/*
> + * find the maximum level shared cache under giving mask
> + */
> +void find_max_sub_sc(const struct cpumask *giving_mask, int cpu,
> + struct cpumask *sc_mask)

This is not a good global function name. No one will know what this
means when they read it. Please make it make more sense.

thanks,

greg k-h