2016-10-01 11:46:02

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 0/9] Support Intel® Turbo Boost Max Technology 3.0

v5:
- Simplify intel_pstate for enabling ITMT feature
- Put x86_sched_itmt_flags related functions under proper
CONFIG_SCHED_MC/SMT flags
- Comment to note that rebuild_sched_domain is not needed after updating
CPU priorities.
- Define sysctl_sched_itmt_enabled to 0 when ITMT is not used in
arch/x86/include/asm/topology.h
- Dropped patch "Fix numa in package topology bug", as this is
already applied

v4:
- Split x86 multi-node numa topology bug fix and setting
of SD_ASYM flag for ITMT topology into 2 patches
- Split the sysctl changes for ITMT enablement and setting of ITMT
capability/core priorities into 2 patches.
- Avoid unnecessary rebuild of sched domains when ITMT sysctl or
capabilities are updated.
- Fix missing stub function for topology_max_packages for !SMP case.
- Rename set_sched_itmt() to sched_set_itmt_support().
- Various updates to itmt.c to eliminate goto and make logic tighter.
- Various change logs and comments update.
- intel_pstate: Split function to process cppc and enable ITMT
- intel_pstate: Just keep the cppc_perf information till we use CPPC for HWP

v3:
- Fix race clash when more than one program are enabling/disabling ITMT
- Remove group_priority_cpu macro to simplify code.
- Error reported by 0-day for compile issue on ARM

v2
- The patchset is split into two parts so that CPPC changes can be merged first
1. Only ACPI CPPC changes (It is posted separately)
2. ITMT changes (scheduler and Intel P-State)

- Changes in patch: sched,x86: Enable Turbo Boost Max Technology
1. Use arch_update_cpu_topology to indicate need to completely
rebuild sched domain when ITMT related sched domain flags change
2. Enable client (single node) platform capable of ITMT with ITMT
scheduling by default
3. Implement arch_asym_cpu_priority to provide the cpu priority
value to scheduler for asym packing.
4. Fix a compile bug for i386 architecture.

- Changes in patch: sched: Extend scheduler's asym packing
1. Use arch_asym_cpu_priority() to provide cpu priority
value used for asym packing to the scheduler.

- Changes in acpi: bus: Enable HWP CPPC objects and
acpi: bus: Set _OSC for diverse core support
Minor code cleanup by removing #ifdef
- Changes in Kconfig for Intel P-State
Avoid building CPPC lib for i386 for issue reported by 0-day

- Feature is enabled by default for single socket systems

With Intel® Turbo Boost Max Technology 3.0 (ITMT), single-threaded performance is
optimized by identifying processor's fastest core and running critical workloads
on it.
Refere to:
http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-max-technology.html

This patchset consist of all changes required to support ITMT feature:
- Use CPPC information in Intel P-State driver to get performance information
- Scheduler enhancements
- cppc lib patches (split in to a seprate series)

This featured can be enabled by writing at runtime
# echo 1 > /proc/sys/kernel/sched_itmt_enabled
This featured can be disabled by writing at runtime
# echo 0 > /proc/sys/kernel/sched_itmt_enabled

Peter Zijlstra (Intel) (1):
x86/topology: Fix numa in package topology bug

Rafael J. Wysocki (1):
cpufreq: intel_pstate: Use CPPC to get max performance

Srinivas Pandruvada (2):
acpi: bus: Enable HWP CPPC objects
acpi: bus: Set _OSC for diverse core support

Tim Chen (6):
sched: Extend scheduler's asym packing
x86/topology: Provide topology_num_packages()
x86/topology: Define x86's arch_update_cpu_topology
x86: Enable Intel Turbo Boost Max Technology 3.0
x86/sysctl: Add sysctl for ITMT scheduling feature
x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU

arch/x86/Kconfig | 9 ++
arch/x86/include/asm/topology.h | 28 ++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/itmt.c | 189 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/smpboot.c | 44 +++++++++-
drivers/acpi/bus.c | 10 +++
drivers/cpufreq/Kconfig.x86 | 1 +
drivers/cpufreq/intel_pstate.c | 56 +++++++++++-
include/linux/acpi.h | 1 +
include/linux/sched.h | 2 +
kernel/sched/core.c | 18 ++++
kernel/sched/fair.c | 35 +++++---
kernel/sched/sched.h | 6 ++
13 files changed, 384 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/kernel/itmt.c

--
2.7.4


2016-10-01 11:46:15

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 2/9] x86/topology: Provide topology_num_packages()

From: Tim Chen <[email protected]>

Returns number of cpu packages discovered.

This information is needed to determine the size of the platform and
decide if the Intel Turbo Boost Max Technology 3.0 (ITMT) feature
should be turned on by default. The ITMT feature is more effective on
single socket client like system that uses small number of cores most
of the time.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/include/asm/topology.h | 3 +++
arch/x86/kernel/smpboot.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index cf75871..3e95dfc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -129,10 +129,13 @@ static inline int topology_max_smt_threads(void)
}

int topology_update_package_map(unsigned int apicid, unsigned int cpu);
+extern int topology_num_packages(void);
extern int topology_phys_to_logical_pkg(unsigned int pkg);
#else
#define topology_max_packages() (1)
static inline int
+topology_num_packages(void) { return 1; }
+static inline int
topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; }
static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
static inline int topology_max_smt_threads(void) { return 1; }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 7137ec4..6a763a2 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -295,6 +295,11 @@ found:
return 0;
}

+int topology_num_packages(void)
+{
+ return logical_packages;
+}
+
/**
* topology_phys_to_logical_pkg - Map a physical package id to a logical
*
--
2.7.4

2016-10-01 11:46:27

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 8/9] acpi: bus: Set _OSC for diverse core support

Set the OSC_SB_CPC_DIVERSE_HIGH_SUPPORT (bit 12) to enable diverse
core support.

This is required to inform BIOS the support of Intel Turbo Boost Max
Technology 3.0 feature.

Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/acpi/bus.c | 3 +++
include/linux/acpi.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 61643a5..8ab6ec2 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -337,6 +337,9 @@ static void acpi_bus_osc_support(void)
}
#endif

+ if (IS_ENABLED(CONFIG_SCHED_ITMT))
+ capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_CPC_DIVERSE_HIGH_SUPPORT;
+
if (!ghes_disable)
capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_APEI_SUPPORT;
if (ACPI_FAILURE(acpi_get_handle(NULL, "\\_SB", &handle)))
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index e746552..53841a2 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -462,6 +462,7 @@ acpi_status acpi_run_osc(acpi_handle handle, struct acpi_osc_context *context);
#define OSC_SB_CPCV2_SUPPORT 0x00000040
#define OSC_SB_PCLPI_SUPPORT 0x00000080
#define OSC_SB_OSLPI_SUPPORT 0x00000100
+#define OSC_SB_CPC_DIVERSE_HIGH_SUPPORT 0x00001000

extern bool osc_sb_apei_support_acked;
extern bool osc_pc_lpi_support_confirmed;
--
2.7.4

2016-10-01 11:46:38

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 7/9] acpi: bus: Enable HWP CPPC objects

Need to set platform wide _OSC bits to enable CPPC and CPPC version 2.
If platform supports CPPC, then BIOS exposes CPPC tables.

The primary reason to enable CPPC support is to get the maximum
performance of each CPU to check and enable Intel Turbo Boost Max
Technology 3.0 (ITMT).

Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/acpi/bus.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 85b7d07..61643a5 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -330,6 +330,13 @@ static void acpi_bus_osc_support(void)
capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_HOTPLUG_OST_SUPPORT;
capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_PCLPI_SUPPORT;

+#ifdef CONFIG_X86
+ if (boot_cpu_has(X86_FEATURE_HWP)) {
+ capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_CPC_SUPPORT;
+ capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_CPCV2_SUPPORT;
+ }
+#endif
+
if (!ghes_disable)
capbuf[OSC_SUPPORT_DWORD] |= OSC_SB_APEI_SUPPORT;
if (ACPI_FAILURE(acpi_get_handle(NULL, "\\_SB", &handle)))
--
2.7.4

2016-10-01 11:46:08

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 1/9] sched: Extend scheduler's asym packing

From: Tim Chen <[email protected]>

We generalize the scheduler's asym packing to provide an ordering
of the cpu beyond just the cpu number. This allows the use of the
ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
sched domain. The preference is defined with the cpu priority
given by arch_asym_cpu_priority(cpu).

We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 18 ++++++++++++++++++
kernel/sched/fair.c | 35 ++++++++++++++++++++++++-----------
kernel/sched/sched.h | 6 ++++++
4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98fe95f..82ca1e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1052,6 +1052,8 @@ static inline int cpu_numa_flags(void)
}
#endif

+int arch_asym_cpu_priority(int cpu);
+
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e86c4a5..08135ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6237,7 +6237,25 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
WARN_ON(!sg);

do {
+ int cpu, max_cpu = -1, prev_cpu = -1;
+
sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+
+ if (!(sd->flags & SD_ASYM_PACKING))
+ goto next;
+
+ for_each_cpu(cpu, sched_group_cpus(sg)) {
+ if (prev_cpu < 0) {
+ prev_cpu = cpu;
+ max_cpu = cpu;
+ } else {
+ if (sched_asym_prefer(cpu, max_cpu))
+ max_cpu = cpu;
+ }
+ }
+ sg->asym_prefer_cpu = max_cpu;
+
+next:
sg = sg->next;
} while (sg != sd->groups);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5cd07b..bb96e1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,16 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
*/
unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

+#ifdef CONFIG_SMP
+/*
+ * For asym packing, by default the lower numbered cpu has higher priority.
+ */
+int __weak arch_asym_cpu_priority(int cpu)
+{
+ return -cpu;
+}
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -6861,16 +6871,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (env->idle == CPU_NOT_IDLE)
return true;
/*
- * ASYM_PACKING needs to move all the work to the lowest
- * numbered CPUs in the group, therefore mark all groups
- * higher than ourself as busy.
+ * ASYM_PACKING needs to move all the work to the highest
+ * prority CPUs in the group, therefore mark all groups
+ * of lower priority than ourself as busy.
*/
- if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running &&
+ sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
if (!sds->busiest)
return true;

- /* Prefer to move from highest possible cpu's work */
- if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
+ /* Prefer to move from lowest priority cpu's work */
+ if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
+ sg->asym_prefer_cpu))
return true;
}

@@ -7022,8 +7034,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
if (!sds->busiest)
return 0;

- busiest_cpu = group_first_cpu(sds->busiest);
- if (env->dst_cpu > busiest_cpu)
+ busiest_cpu = sds->busiest->asym_prefer_cpu;
+ if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
return 0;

env->imbalance = DIV_ROUND_CLOSEST(
@@ -7364,10 +7376,11 @@ static int need_active_balance(struct lb_env *env)

/*
* ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * lower priority CPUs in order to pack all tasks in the
+ * highest priority CPUs.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+ if ((sd->flags & SD_ASYM_PACKING) &&
+ sched_asym_prefer(env->dst_cpu, env->src_cpu))
return 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7fc1ce..3f3d04a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -532,6 +532,11 @@ struct dl_rq {

#ifdef CONFIG_SMP

+static inline bool sched_asym_prefer(int a, int b)
+{
+ return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
+}
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -884,6 +889,7 @@ struct sched_group {

unsigned int group_weight;
struct sched_group_capacity *sgc;
+ int asym_prefer_cpu; /* cpu of highest priority in group */

/*
* The CPUs this group covers.
--
2.7.4

2016-10-01 11:46:57

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 9/9] cpufreq: intel_pstate: Use CPPC to get max performance

From: "Rafael J. Wysocki" <[email protected]>

This change uses acpi cppc_lib interface to get CPPC performance limits
and calls scheduler interface to update per cpu highest priority. If
there is a difference in highest performance of each CPUs, call scheduler
interface to enable ITMT feature for only one time.

Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.

Original-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/Kconfig.x86 | 1 +
drivers/cpufreq/intel_pstate.c | 56 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
index adbd1de..c6d273b 100644
--- a/drivers/cpufreq/Kconfig.x86
+++ b/drivers/cpufreq/Kconfig.x86
@@ -6,6 +6,7 @@ config X86_INTEL_PSTATE
bool "Intel P state control"
depends on X86
select ACPI_PROCESSOR if ACPI
+ select ACPI_CPPC_LIB if X86_64 && ACPI && SCHED_ITMT
help
This driver provides a P state for Intel core processors.
The driver implements an internal governor and will become
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c877e70..e135cef 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -44,6 +44,7 @@

#ifdef CONFIG_ACPI
#include <acpi/processor.h>
+#include <acpi/cppc_acpi.h>
#endif

#define FRAC_BITS 8
@@ -377,14 +378,67 @@ static bool intel_pstate_get_ppc_enable_status(void)
return acpi_ppc;
}

+#ifdef CONFIG_SCHED_ITMT
+
+/* The work item is needed to avoid CPU hotplug locking issues */
+static void intel_pstste_sched_itmt_work_fn(struct work_struct *work)
+{
+ sched_set_itmt_support(true);
+}
+
+static DECLARE_WORK(sched_itmt_work, intel_pstste_sched_itmt_work_fn);
+
+static void intel_pstate_set_itmt_prio(int cpu)
+{
+ struct cppc_perf_caps cppc_perf;
+ static u32 max_highest_perf = 0, min_highest_perf = U32_MAX;
+ int ret;
+
+ ret = cppc_get_perf_caps(cpu, &cppc_perf);
+ if (ret)
+ return;
+
+ /*
+ * The priorities can be set regardless of whether or not
+ * sched_set_itmt_support(true) has been called and it is valid to
+ * update them at any time after it has been called.
+ */
+ sched_set_itmt_core_prio(cppc_perf.highest_perf, cpu);
+
+ if (max_highest_perf <= min_highest_perf) {
+ if (cppc_perf.highest_perf > max_highest_perf)
+ max_highest_perf = cppc_perf.highest_perf;
+
+ if (cppc_perf.highest_perf < min_highest_perf)
+ min_highest_perf = cppc_perf.highest_perf;
+
+ if (max_highest_perf > min_highest_perf) {
+ /*
+ * This code can be run during CPU online under the
+ * CPU hotplug locks, so sched_set_itmt_support()
+ * cannot be called from here. Queue up a work item
+ * to invoke it.
+ */
+ schedule_work(&sched_itmt_work);
+ }
+ }
+}
+#else
+static void intel_pstate_set_itmt_prio(int cpu)
+{
+}
+#endif
+
static void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy *policy)
{
struct cpudata *cpu;
int ret;
int i;

- if (hwp_active)
+ if (hwp_active) {
+ intel_pstate_set_itmt_prio(policy->cpu);
return;
+ }

if (!intel_pstate_get_ppc_enable_status())
return;
--
2.7.4

2016-10-01 11:47:18

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 6/9] x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU

From: Tim Chen <[email protected]>

Some Intel cores in a package can be boosted to a higher turbo frequency
with ITMT 3.0 technology. The scheduler can use the asymmetric packing
feature to move tasks to the more capable cores.

If ITMT is enabled, add SD_ASYM_PACKING flag to the thread and core
sched domains to enable asymmetric packing.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 38901b3..607cbe6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -487,22 +487,42 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
return false;
}

+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+static inline int x86_sched_itmt_flags(void)
+{
+ return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
+}
+
+#ifdef CONFIG_SCHED_MC
+static int x86_core_flags(void)
+{
+ return cpu_core_flags() | x86_sched_itmt_flags();
+}
+#endif
+#ifdef CONFIG_SCHED_SMT
+static int x86_smt_flags(void)
+{
+ return cpu_smt_flags() | x86_sched_itmt_flags();
+}
+#endif
+#endif
+
static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+ { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
{ NULL, },
};

static struct sched_domain_topology_level x86_topology[] = {
#ifdef CONFIG_SCHED_SMT
- { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+ { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_MC
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
--
2.7.4

2016-10-01 11:47:33

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

From: Tim Chen <[email protected]>

Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.

Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.

By default, system that is ITMT capable and single
socket has this feature turned on. It is more likely
to be lightly loaded and operates in Turbo range.

When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/include/asm/topology.h | 2 +
arch/x86/kernel/itmt.c | 98 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 637d847..e45151f 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -155,6 +155,7 @@ extern bool x86_topology_update;
#include <asm/percpu.h>

DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
+extern unsigned int __read_mostly sysctl_sched_itmt_enabled;

/* Interface to set priority of a cpu */
void sched_set_itmt_core_prio(int prio, int core_cpu);
@@ -164,6 +165,7 @@ void sched_set_itmt_support(bool itmt_supported);

#else /* CONFIG_SCHED_ITMT */

+#define sysctl_sched_itmt_enabled 0
static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
{
}
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index f485b49..ab0ae2a 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -33,6 +33,67 @@ static DEFINE_MUTEX(itmt_update_mutex);
/* Boolean to track if system has ITMT capabilities */
static bool __read_mostly sched_itmt_capable;

+/*
+ * Boolean to control whether we want to move processes to cpu capable
+ * of higher turbo frequency for cpus supporting Intel Turbo Boost Max
+ * Technology 3.0.
+ *
+ * It can be set via /proc/sys/kernel/sched_itmt_enabled
+ */
+unsigned int __read_mostly sysctl_sched_itmt_enabled;
+
+static int sched_itmt_update_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ unsigned int old_sysctl;
+
+ mutex_lock(&itmt_update_mutex);
+
+ if (!sched_itmt_capable) {
+ mutex_unlock(&itmt_update_mutex);
+ return 0;
+ }
+
+ old_sysctl = sysctl_sched_itmt_enabled;
+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
+ x86_topology_update = true;
+ rebuild_sched_domains();
+ }
+
+ mutex_unlock(&itmt_update_mutex);
+
+ return ret;
+}
+
+static unsigned int zero;
+static unsigned int one = 1;
+static struct ctl_table itmt_kern_table[] = {
+ {
+ .procname = "sched_itmt_enabled",
+ .data = &sysctl_sched_itmt_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_itmt_update_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {}
+};
+
+static struct ctl_table itmt_root_table[] = {
+ {
+ .procname = "kernel",
+ .mode = 0555,
+ .child = itmt_kern_table,
+ },
+ {}
+};
+
+static struct ctl_table_header *itmt_sysctl_header;
+
/**
* sched_set_itmt_support - Indicate platform support ITMT
* @itmt_supported: indicate platform's CPU has ITMT capability
@@ -45,13 +106,46 @@ static bool __read_mostly sched_itmt_capable;
*
* This must be done only after sched_set_itmt_core_prio
* has been called to set the cpus' priorities.
+ *
+ * It must not be called with cpu hot plug lock
+ * held as we need to acquire the lock to rebuild sched domains
+ * later.
*/
void sched_set_itmt_support(bool itmt_supported)
{
mutex_lock(&itmt_update_mutex);

- if (itmt_supported != sched_itmt_capable)
- sched_itmt_capable = itmt_supported;
+ if (itmt_supported == sched_itmt_capable) {
+ mutex_unlock(&itmt_update_mutex);
+ return;
+ }
+ sched_itmt_capable = itmt_supported;
+
+ if (itmt_supported) {
+ itmt_sysctl_header =
+ register_sysctl_table(itmt_root_table);
+ if (!itmt_sysctl_header) {
+ mutex_unlock(&itmt_update_mutex);
+ return;
+ }
+ /*
+ * ITMT capability automatically enables ITMT
+ * scheduling for small systems (single node).
+ */
+ if (topology_num_packages() == 1)
+ sysctl_sched_itmt_enabled = 1;
+ } else {
+ if (itmt_sysctl_header)
+ unregister_sysctl_table(itmt_sysctl_header);
+ }
+
+ if (sysctl_sched_itmt_enabled) {
+ /* disable sched_itmt if we are no longer ITMT capable */
+ if (!itmt_supported)
+ sysctl_sched_itmt_enabled = 0;
+ x86_topology_update = true;
+ rebuild_sched_domains();
+ }

mutex_unlock(&itmt_update_mutex);
}
--
2.7.4

2016-10-01 11:47:43

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 4/9] x86: Enable Intel Turbo Boost Max Technology 3.0

From: Tim Chen <[email protected]>

On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum
turbo frequencies of some cores in a CPU package may be higher than for
the other cores in the same package. In that case, better performance
(and possibly lower energy consumption as well) can be achieved by
making the scheduler prefer to run tasks on the CPUs with higher max
turbo frequencies.

To that end, set up a core priority metric to abstract the core
preferences based on the maximum turbo frequency. In that metric,
the cores with higher maximum turbo frequencies are higher-priority
than the other cores in the same package and that causes the scheduler
to favor them when making load-balancing decisions using the asymmertic
packing approach. At the same time, the priority of SMT threads with a
higher CPU number is reduced so as to avoid scheduling tasks on all of
the threads that belong to a favored core before all of the other cores
have been given a task to run.

The priority metric will be initialized by the P-state driver with the
help of the sched_set_itmt_core_prio() function. The P-state driver
will also determine whether or not ITMT is supported by the platform
and will call sched_set_itmt_support() to indicate that.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/Kconfig | 9 ++++
arch/x86/include/asm/topology.h | 22 ++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/itmt.c | 95 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 127 insertions(+)
create mode 100644 arch/x86/kernel/itmt.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a1f0ce..6dfb97d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -927,6 +927,15 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.

+config SCHED_ITMT
+ bool "Intel Turbo Boost Max Technology (ITMT) scheduler support"
+ depends on SCHED_MC && CPU_SUP_INTEL && X86_INTEL_PSTATE
+ ---help---
+ ITMT enabled scheduler support improves the CPU scheduler's decision
+ to move tasks to cpu core that can be boosted to a higher frequency
+ than others. It will have better performance at a cost of slightly
+ increased overhead in task migrations. If unsure say N here.
+
source "kernel/Kconfig.preempt"

config UP_LATE_INIT
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 323f61f..637d847 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -150,4 +150,26 @@ int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);

extern bool x86_topology_update;
+
+#ifdef CONFIG_SCHED_ITMT
+#include <asm/percpu.h>
+
+DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
+
+/* Interface to set priority of a cpu */
+void sched_set_itmt_core_prio(int prio, int core_cpu);
+
+/* Interface to notify scheduler that system supports ITMT */
+void sched_set_itmt_support(bool itmt_supported);
+
+#else /* CONFIG_SCHED_ITMT */
+
+static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
+{
+}
+static inline void sched_set_itmt_support(bool itmt_supported)
+{
+}
+#endif /* CONFIG_SCHED_ITMT */
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0503f5b..2008335 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_EFI) += sysfb_efi.o

obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
obj-$(CONFIG_TRACING) += tracepoint.o
+obj-$(CONFIG_SCHED_ITMT) += itmt.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
new file mode 100644
index 0000000..f485b49
--- /dev/null
+++ b/arch/x86/kernel/itmt.c
@@ -0,0 +1,95 @@
+/*
+ * itmt.c: Support Intel Turbo Boost Max Technology 3.0
+ *
+ * (C) Copyright 2016 Intel Corporation
+ * Author: Tim Chen <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ *
+ * On platforms supporting Intel Turbo Boost Max Technology 3.0, (ITMT),
+ * the maximum turbo frequencies of some cores in a CPU package may be
+ * higher than for the other cores in the same package. In that case,
+ * better performance can be achieved by making the scheduler prefer
+ * to run tasks on the CPUs with higher max turbo frequencies.
+ *
+ * This file provides functions and data structures for enabling the
+ * scheduler to favor scheduling on cores can be boosted to a higher
+ * frequency under ITMT.
+ */
+
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/cpuset.h>
+#include <asm/mutex.h>
+#include <linux/sched.h>
+#include <linux/sysctl.h>
+#include <linux/nodemask.h>
+
+static DEFINE_MUTEX(itmt_update_mutex);
+
+/* Boolean to track if system has ITMT capabilities */
+static bool __read_mostly sched_itmt_capable;
+
+/**
+ * sched_set_itmt_support - Indicate platform support ITMT
+ * @itmt_supported: indicate platform's CPU has ITMT capability
+ *
+ * This function is used by the OS to indicate to scheduler if the platform
+ * is capable of supporting the ITMT feature.
+ *
+ * The current scheme has the pstate driver detects if the system
+ * is ITMT capable and call set_sched_itmt.
+ *
+ * This must be done only after sched_set_itmt_core_prio
+ * has been called to set the cpus' priorities.
+ */
+void sched_set_itmt_support(bool itmt_supported)
+{
+ mutex_lock(&itmt_update_mutex);
+
+ if (itmt_supported != sched_itmt_capable)
+ sched_itmt_capable = itmt_supported;
+
+ mutex_unlock(&itmt_update_mutex);
+}
+
+DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
+int arch_asym_cpu_priority(int cpu)
+{
+ return per_cpu(sched_core_priority, cpu);
+}
+
+/**
+ * sched_set_itmt_core_prio - Set CPU priority based on ITMT
+ * @prio: Priority of cpu core
+ * @core_cpu: The cpu number associated with the core
+ *
+ * The pstate driver will find out the max boost frequency
+ * and call this function to set a priority proportional
+ * to the max boost frequency. CPU with higher boost
+ * frequency will receive higher priority.
+ *
+ * No need to rebuild sched domain after updating
+ * the CPU priorities. The sched domains have no
+ * dependency on CPU priorities.
+ */
+void sched_set_itmt_core_prio(int prio, int core_cpu)
+{
+ int cpu, i = 1;
+
+ for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
+ int smt_prio;
+
+ /*
+ * Ensure that the siblings are moved to the end
+ * of the priority chain and only used when
+ * all other high priority cpus are out of capacity.
+ */
+ smt_prio = prio * smp_num_siblings / i;
+ i++;
+ per_cpu(sched_core_priority, cpu) = smt_prio;
+ }
+}
--
2.7.4

2016-10-01 11:47:51

by srinivas pandruvada

[permalink] [raw]
Subject: [PATCH v5 3/9] x86/topology: Define x86's arch_update_cpu_topology

From: Tim Chen <[email protected]>

The scheduler calls arch_update_cpu_topology() to check whether the
scheduler domains have to be rebuilt.

So far x86 has no requirement for this, but the upcoming ITMT support
makes this necessary.

Request the rebuild when the x86 internal update flag is set.

Suggested-by: Morten Rasmussen <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/smpboot.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 3e95dfc..323f61f 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -149,4 +149,5 @@ struct pci_bus;
int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);

+extern bool x86_topology_update;
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6a763a2..38901b3 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -109,6 +109,17 @@ static bool logical_packages_frozen __read_mostly;
/* Maximum number of SMT threads on any online core */
int __max_smt_threads __read_mostly;

+/* Flag to indicate if a complete sched domain rebuild is required */
+bool x86_topology_update;
+
+int arch_update_cpu_topology(void)
+{
+ int retval = x86_topology_update;
+
+ x86_topology_update = false;
+ return retval;
+}
+
static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
{
unsigned long flags;
--
2.7.4

2016-10-01 16:38:54

by Nilay Vaish

[permalink] [raw]
Subject: Re: [PATCH v5 1/9] sched: Extend scheduler's asym packing

On 1 October 2016 at 06:45, Srinivas Pandruvada
<[email protected]> wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e86c4a5..08135ca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6237,7 +6237,25 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> WARN_ON(!sg);
>
> do {
> + int cpu, max_cpu = -1, prev_cpu = -1;
> +
> sg->group_weight = cpumask_weight(sched_group_cpus(sg));
> +
> + if (!(sd->flags & SD_ASYM_PACKING))
> + goto next;
> +
> + for_each_cpu(cpu, sched_group_cpus(sg)) {
> + if (prev_cpu < 0) {
> + prev_cpu = cpu;
> + max_cpu = cpu;

It seems that you can drop prev_cpu and put the check on max_cpu instead.

--
Nilay

2016-10-03 16:49:11

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v5 1/9] sched: Extend scheduler's asym packing

On Sat, Oct 01, 2016 at 11:38:04AM -0500, Nilay Vaish wrote:
> On 1 October 2016 at 06:45, Srinivas Pandruvada
> <[email protected]> wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e86c4a5..08135ca 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6237,7 +6237,25 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
> > WARN_ON(!sg);
> >
> > do {
> > + int cpu, max_cpu = -1, prev_cpu = -1;
> > +
> > sg->group_weight = cpumask_weight(sched_group_cpus(sg));
> > +
> > + if (!(sd->flags & SD_ASYM_PACKING))
> > + goto next;
> > +
> > + for_each_cpu(cpu, sched_group_cpus(sg)) {
> > + if (prev_cpu < 0) {
> > + prev_cpu = cpu;
> > + max_cpu = cpu;
>
> It seems that you can drop prev_cpu and put the check on max_cpu instead.
>

The usage of prev_cpu was an artifact of the evolution of this patch.
It can indeed be dropped and I've updated this change in the patch below.

Thanks.

Tim

--->8---
commit 90955f87f228ee2fe7eeffcab851eb3141a783b4
Author: Tim Chen <[email protected]>
Subject: [PATCH v5 1/9 update] sched: Extend scheduler's asym packing

sched: Extend scheduler's asym packing

We generalize the scheduler's asym packing to provide an ordering
of the cpu beyond just the cpu number. This allows the use of the
ASYM_PACKING scheduler machinery to move loads to preferred CPU in a
sched domain. The preference is defined with the cpu priority
given by arch_asym_cpu_priority(cpu).

We also record the most preferred cpu in a sched group when
we build the cpu's capacity for fast lookup of preferred cpu
during load balancing.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..aeea288 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1052,6 +1052,8 @@ static inline int cpu_numa_flags(void)
}
#endif

+int arch_asym_cpu_priority(int cpu);
+
struct sched_domain_attr {
int relax_domain_level;
};
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e86c4a5..b2e22de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6237,7 +6237,22 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
WARN_ON(!sg);

do {
+ int cpu, max_cpu = -1;
+
sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+
+ if (!(sd->flags & SD_ASYM_PACKING))
+ goto next;
+
+ for_each_cpu(cpu, sched_group_cpus(sg)) {
+ if (max_cpu < 0)
+ max_cpu = cpu;
+ else if (sched_asym_prefer(cpu, max_cpu))
+ max_cpu = cpu;
+ }
+ sg->asym_prefer_cpu = max_cpu;
+
+next:
sg = sg->next;
} while (sg != sd->groups);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34..8e2a078 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,16 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
*/
unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

+#ifdef CONFIG_SMP
+/*
+ * For asym packing, by default the lower numbered cpu has higher priority.
+ */
+int __weak arch_asym_cpu_priority(int cpu)
+{
+ return -cpu;
+}
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -6862,16 +6872,18 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (env->idle == CPU_NOT_IDLE)
return true;
/*
- * ASYM_PACKING needs to move all the work to the lowest
- * numbered CPUs in the group, therefore mark all groups
- * higher than ourself as busy.
+ * ASYM_PACKING needs to move all the work to the highest
+ * prority CPUs in the group, therefore mark all groups
+ * of lower priority than ourself as busy.
*/
- if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running &&
+ sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
if (!sds->busiest)
return true;

- /* Prefer to move from highest possible cpu's work */
- if (group_first_cpu(sds->busiest) < group_first_cpu(sg))
+ /* Prefer to move from lowest priority cpu's work */
+ if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
+ sg->asym_prefer_cpu))
return true;
}

@@ -7023,8 +7035,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
if (!sds->busiest)
return 0;

- busiest_cpu = group_first_cpu(sds->busiest);
- if (env->dst_cpu > busiest_cpu)
+ busiest_cpu = sds->busiest->asym_prefer_cpu;
+ if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
return 0;

env->imbalance = DIV_ROUND_CLOSEST(
@@ -7365,10 +7377,11 @@ static int need_active_balance(struct lb_env *env)

/*
* ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * lower priority CPUs in order to pack all tasks in the
+ * highest priority CPUs.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+ if ((sd->flags & SD_ASYM_PACKING) &&
+ sched_asym_prefer(env->dst_cpu, env->src_cpu))
return 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc51..b6f449d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -532,6 +532,11 @@ struct dl_rq {

#ifdef CONFIG_SMP

+static inline bool sched_asym_prefer(int a, int b)
+{
+ return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
+}
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -884,6 +889,7 @@ struct sched_group {

unsigned int group_weight;
struct sched_group_capacity *sgc;
+ int asym_prefer_cpu; /* cpu of highest priority in group */

/*
* The CPUs this group covers.

2016-10-05 14:26:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] x86: Enable Intel Turbo Boost Max Technology 3.0

On Sat, 1 Oct 2016, Srinivas Pandruvada wrote:
> +void sched_set_itmt_support(bool itmt_supported)
> +{
> + mutex_lock(&itmt_update_mutex);
> +
> + if (itmt_supported != sched_itmt_capable)
> + sched_itmt_capable = itmt_supported;

Yikes. What is this conditional for? The only value it has is to confuse
the reader.

> +
> + mutex_unlock(&itmt_update_mutex);
> +}
> +
> +DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);

Darn. Do not stick variable definitiions in the middle of the code and
especially not glued to the function w/o a newline in between. Move it to
the top of the file.

> +int arch_asym_cpu_priority(int cpu)
> +{
> + return per_cpu(sched_core_priority, cpu);
> +}


> +void sched_set_itmt_core_prio(int prio, int core_cpu)
> +{
> + int cpu, i = 1;
> +
> + for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
> + int smt_prio;
> +
> + /*
> + * Ensure that the siblings are moved to the end
> + * of the priority chain and only used when
> + * all other high priority cpus are out of capacity.
> + */
> + smt_prio = prio * smp_num_siblings / i;
> + i++;

Your code ordering is really random. What has this i++ to do with the
store? Nothing. It just makes reading the code harder. Just move it below
the store.

> + per_cpu(sched_core_priority, cpu) = smt_prio;

Thanks,

tglx

2016-10-05 14:38:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

On Sat, 1 Oct 2016, Srinivas Pandruvada wrote:
> +static int sched_itmt_update_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + unsigned int old_sysctl;
> +
> + mutex_lock(&itmt_update_mutex);
> +
> + if (!sched_itmt_capable) {
> + mutex_unlock(&itmt_update_mutex);
> + return 0;

This should return a proper error code.

> void sched_set_itmt_support(bool itmt_supported)
> {
> mutex_lock(&itmt_update_mutex);
>
> - if (itmt_supported != sched_itmt_capable)
> - sched_itmt_capable = itmt_supported;
> + if (itmt_supported == sched_itmt_capable) {
> + mutex_unlock(&itmt_update_mutex);
> + return;
> + }
> + sched_itmt_capable = itmt_supported;
> +
> + if (itmt_supported) {
> + itmt_sysctl_header =
> + register_sysctl_table(itmt_root_table);
> + if (!itmt_sysctl_header) {
> + mutex_unlock(&itmt_update_mutex);
> + return;

So you now have a state of capable which cannot be enabled. Whats the
point?

> + }
> + /*
> + * ITMT capability automatically enables ITMT
> + * scheduling for small systems (single node).
> + */
> + if (topology_num_packages() == 1)
> + sysctl_sched_itmt_enabled = 1;
> + } else {
> + if (itmt_sysctl_header)
> + unregister_sysctl_table(itmt_sysctl_header);
> + }
> +
> + if (sysctl_sched_itmt_enabled) {
> + /* disable sched_itmt if we are no longer ITMT capable */
> + if (!itmt_supported)


How do you get here if itmt is not supported?

> + sysctl_sched_itmt_enabled = 0;
> + x86_topology_update = true;
> + rebuild_sched_domains();
> + }
>
> mutex_unlock(&itmt_update_mutex);

Thanks,

tglx

2016-10-05 16:05:24

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] x86: Enable Intel Turbo Boost Max Technology 3.0

On Wed, 2016-10-05 at 16:23 +0200, Thomas Gleixner wrote:
> On Sat, 1 Oct 2016, Srinivas Pandruvada wrote:
> >
> > +void sched_set_itmt_support(bool itmt_supported)
> > +{
> > + mutex_lock(&itmt_update_mutex);
> > +
> > + if (itmt_supported != sched_itmt_capable)
> > + sched_itmt_capable = itmt_supported;
> Yikes. What is this conditional for? The only value it has is to confuse
> the reader.

Will remove the check.

>
> >
> > +
> > + mutex_unlock(&itmt_update_mutex);
> > +}
> > +
> > +DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
> Darn. Do not stick variable definitiions in the middle of the code and
> especially not glued to the function w/o a newline in between. Move it to
> the top of the file.

Will move to top of file.

>
> >
> > +int arch_asym_cpu_priority(int cpu)
> > +{
> > + return per_cpu(sched_core_priority, cpu);
> > +}
>
> >
> > +void sched_set_itmt_core_prio(int prio, int core_cpu)
> > +{
> > + int cpu, i = 1;
> > +
> > + for_each_cpu(cpu, topology_sibling_cpumask(core_cpu)) {
> > + int smt_prio;
> > +
> > + /*
> > +  * Ensure that the siblings are moved to the end
> > +  * of the priority chain and only used when
> > +  * all other high priority cpus are out of capacity.
> > +  */
> > + smt_prio = prio * smp_num_siblings / i;
> > + i++;
> Your code ordering is really random. What has this i++ to do with the
> store? Nothing. It just makes reading the code harder. Just move it below
> the store.

Will move it to the end of for loop.

>
> >
> > + per_cpu(sched_core_priority, cpu) = smt_prio;

Thanks.

Tim

2016-10-05 16:24:59

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

On Wed, 2016-10-05 at 16:35 +0200, Thomas Gleixner wrote:
> On Sat, 1 Oct 2016, Srinivas Pandruvada wrote:
> >
> > +static int sched_itmt_update_handler(struct ctl_table *table, int write,
> > +       void __user *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > + int ret;
> > + unsigned int old_sysctl;
> > +
> > + mutex_lock(&itmt_update_mutex);
> > +
> > + if (!sched_itmt_capable) {
> > + mutex_unlock(&itmt_update_mutex);
> > + return 0;
> This should return a proper error code.

Okay. Will return EINVAL instead.

>
> >
> >  void sched_set_itmt_support(bool itmt_supported)
> >  {
> >   mutex_lock(&itmt_update_mutex);
> >  
> > - if (itmt_supported != sched_itmt_capable)
> > - sched_itmt_capable = itmt_supported;
> > + if (itmt_supported == sched_itmt_capable) {
> > + mutex_unlock(&itmt_update_mutex);
> > + return;
> > + }
> > + sched_itmt_capable = itmt_supported;
> > +
> > + if (itmt_supported) {
> > + itmt_sysctl_header =
> > + register_sysctl_table(itmt_root_table);
> > + if (!itmt_sysctl_header) {
> > + mutex_unlock(&itmt_update_mutex);
> > + return;
> So you now have a state of capable which cannot be enabled. Whats the
> point?

For multi-socket system where ITMT is not enabled by default, the operator
can still decide to enable it via sysctl.

>
> >
> > + }
> > + /*
> > +  * ITMT capability automatically enables ITMT
> > +  * scheduling for small systems (single node).
> > +  */
> > + if (topology_num_packages() == 1)
> > + sysctl_sched_itmt_enabled = 1;
> > + } else {
> > + if (itmt_sysctl_header)
> > + unregister_sysctl_table(itmt_sysctl_header);
> > + }
> > +
> > + if (sysctl_sched_itmt_enabled) {
> > + /* disable sched_itmt if we are no longer ITMT capable */
> > + if (!itmt_supported)
>
> How do you get here if itmt is not supported? 

If the OS decides to turn off ITMT for any reason, (i.e. invoke 
sched_set_itmt_support(false) after it has turned on itmt_support
before), this is the logic to do it.  We don't turn off ITMT support
after it has been turned on today, in the future the OS may.

If you prefer, I can change things to sched_set_itmt_support(void) so
we can only turn on ITMT support. And once the support is on, we
don't revoke it.

Thanks.

Tim

2016-10-06 11:15:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

On Wed, 5 Oct 2016, Tim Chen wrote:
> On Wed, 2016-10-05 at 16:35 +0200, Thomas Gleixner wrote:
> > > + if (itmt_supported) {
> > > + itmt_sysctl_header =
> > > + register_sysctl_table(itmt_root_table);
> > > + if (!itmt_sysctl_header) {
> > > + mutex_unlock(&itmt_update_mutex);
> > > + return;
> > So you now have a state of capable which cannot be enabled. Whats the
> > point?
>
> For multi-socket system where ITMT is not enabled by default, the operator
> can still decide to enable it via sysctl.

With a sysctl which failed to be installed. Good luck with that.

> > > + }
> > > + /*
> > > +  * ITMT capability automatically enables ITMT
> > > +  * scheduling for small systems (single node).
> > > +  */
> > > + if (topology_num_packages() == 1)
> > > + sysctl_sched_itmt_enabled = 1;
> > > + } else {
> > > + if (itmt_sysctl_header)
> > > + unregister_sysctl_table(itmt_sysctl_header);
> > > + }
> > > +
> > > + if (sysctl_sched_itmt_enabled) {
> > > + /* disable sched_itmt if we are no longer ITMT capable */
> > > + if (!itmt_supported)
> >
> > How do you get here if itmt is not supported? 
>
> If the OS decides to turn off ITMT for any reason, (i.e. invoke 
> sched_set_itmt_support(false) after it has turned on itmt_support
> before), this is the logic to do it.  We don't turn off ITMT support
> after it has been turned on today, in the future the OS may.

Then please make this two functions (set/clear) so one can actually follow
the logic. The above is just too convoluted.

Thanks,

tglx

2016-10-06 17:37:57

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

On Thu, 2016-10-06 at 13:13 +0200, Thomas Gleixner wrote:
> On Wed, 5 Oct 2016, Tim Chen wrote:
> >
> > On Wed, 2016-10-05 at 16:35 +0200, Thomas Gleixner wrote:
> > >
> > > >
> > > > + if (itmt_supported) {
> > > > + itmt_sysctl_header =
> > > > + register_sysctl_table(itmt_root_table);
> > > > + if (!itmt_sysctl_header) {
> > > > + mutex_unlock(&itmt_update_mutex);
> > > > + return;
> > > So you now have a state of capable which cannot be enabled. Whats the
> > > point?
> > For multi-socket system where ITMT is not enabled by default, the operator
> > can still decide to enable it via sysctl.
> With a sysctl which failed to be installed. Good luck with that.

I misunderstood your earlier comment.
You are talking about the case where we fail to register the sysctl?

In this case, the system is in a state that indicates it is 
ITMT capable but cannot be enabled.  So we return and do not turn on ITMT
scheduling.  The system operator should always have the capability
to enable/disable ITMT via sysctl.  So we do not turn on ITMT if operator has
no control over it, even if the system is capable of ITMT.


>  
> >
> > >
> > > >
> > > > + }
> > > > + /*
> > > > +  * ITMT capability automatically enables ITMT
> > > > +  * scheduling for small systems (single node).
> > > > +  */
> > > > + if (topology_num_packages() == 1)
> > > > + sysctl_sched_itmt_enabled = 1;
> > > > + } else {
> > > > + if (itmt_sysctl_header)
> > > > + unregister_sysctl_table(itmt_sysctl_header);
> > > > + }
> > > > +
> > > > + if (sysctl_sched_itmt_enabled) {
> > > > + /* disable sched_itmt if we are no longer ITMT capable */
> > > > + if (!itmt_supported)
> > > How do you get here if itmt is not supported? 
> > If the OS decides to turn off ITMT for any reason, (i.e. invoke 
> > sched_set_itmt_support(false) after it has turned on itmt_support
> > before), this is the logic to do it.  We don't turn off ITMT support
> > after it has been turned on today, in the future the OS may.
> Then please make this two functions (set/clear) so one can actually follow
> the logic. The above is just too convoluted.

Sure, I will add a clear function and move the clearing logic there.

Thanks.

Tim

2016-10-12 16:54:00

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] x86/sysctl: Add sysctl for ITMT scheduling feature

On Thu, Oct 06, 2016 at 01:13:08PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Oct 2016, Tim Chen wrote:
> > On Wed, 2016-10-05 at 16:35 +0200, Thomas Gleixner wrote:
> > > > + if (itmt_supported) {
> > > > + itmt_sysctl_header =
> > > > + register_sysctl_table(itmt_root_table);
> > > > + if (!itmt_sysctl_header) {
> > > > + mutex_unlock(&itmt_update_mutex);
> > > > + return;
> > > So you now have a state of capable which cannot be enabled. Whats the
> > > point?
> >
> > For multi-socket system where ITMT is not enabled by default, the operator
> > can still decide to enable it via sysctl.
>
> With a sysctl which failed to be installed. Good luck with that.
>
> > > > + }
> > > > + /*
> > > > + ?* ITMT capability automatically enables ITMT
> > > > + ?* scheduling for small systems (single node).
> > > > + ?*/
> > > > + if (topology_num_packages() == 1)
> > > > + sysctl_sched_itmt_enabled = 1;
> > > > + } else {
> > > > + if (itmt_sysctl_header)
> > > > + unregister_sysctl_table(itmt_sysctl_header);
> > > > + }
> > > > +
> > > > + if (sysctl_sched_itmt_enabled) {
> > > > + /* disable sched_itmt if we are no longer ITMT capable */
> > > > + if (!itmt_supported)
> > >
> > > How do you get here if itmt is not supported??
> >
> > If the OS decides to turn off ITMT for any reason, (i.e. invoke?
> > sched_set_itmt_support(false) after it has turned on itmt_support
> > before), this is the logic to do it. ?We don't turn off ITMT support
> > after it has been turned on today, in the future the OS may.
>
> Then please make this two functions (set/clear) so one can actually follow
> the logic. The above is just too convoluted.
>

Thomas,

Will the update patch below address your concerns for this patch?
Please let us know if you have any other additional comments about
this series. We'll like to address all of them before we post
an update to the series.

Thanks.

Tim

--->8---

From: Tim Chen <[email protected]>
Subject: [PATCH 5/6 v5 - update proposal] x86/sysctl: Add sysctl for ITMT scheduling feature

Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.

Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.

By default, system that is ITMT capable and single
socket has this feature turned on. It is more likely
to be lightly loaded and operates in Turbo range.

When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
arch/x86/include/asm/topology.h | 2 +
arch/x86/kernel/itmt.c | 105 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 107 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 1cd8d12..46ebdd1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -155,6 +155,7 @@ extern bool x86_topology_update;
#include <asm/percpu.h>

DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
+extern unsigned int __read_mostly sysctl_sched_itmt_enabled;

/* Interface to set priority of a cpu */
void sched_set_itmt_core_prio(int prio, int core_cpu);
@@ -167,6 +168,7 @@ void sched_clear_itmt_support(void);

#else /* CONFIG_SCHED_ITMT */

+#define sysctl_sched_itmt_enabled 0
static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
{
}
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 4be3d81..b104368 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -34,6 +34,67 @@ DEFINE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
/* Boolean to track if system has ITMT capabilities */
static bool __read_mostly sched_itmt_capable;

+/*
+ * Boolean to control whether we want to move processes to cpu capable
+ * of higher turbo frequency for cpus supporting Intel Turbo Boost Max
+ * Technology 3.0.
+ *
+ * It can be set via /proc/sys/kernel/sched_itmt_enabled
+ */
+unsigned int __read_mostly sysctl_sched_itmt_enabled;
+
+static int sched_itmt_update_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ unsigned int old_sysctl;
+
+ mutex_lock(&itmt_update_mutex);
+
+ if (!sched_itmt_capable) {
+ mutex_unlock(&itmt_update_mutex);
+ return -EINVAL;
+ }
+
+ old_sysctl = sysctl_sched_itmt_enabled;
+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
+ x86_topology_update = true;
+ rebuild_sched_domains();
+ }
+
+ mutex_unlock(&itmt_update_mutex);
+
+ return ret;
+}
+
+static unsigned int zero;
+static unsigned int one = 1;
+static struct ctl_table itmt_kern_table[] = {
+ {
+ .procname = "sched_itmt_enabled",
+ .data = &sysctl_sched_itmt_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_itmt_update_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {}
+};
+
+static struct ctl_table itmt_root_table[] = {
+ {
+ .procname = "kernel",
+ .mode = 0555,
+ .child = itmt_kern_table,
+ },
+ {}
+};
+
+static struct ctl_table_header *itmt_sysctl_header;
+
/**
* sched_set_itmt_support - Indicate platform supports ITMT
*
@@ -47,13 +108,40 @@ static bool __read_mostly sched_itmt_capable;
*
* This must be done only after sched_set_itmt_core_prio
* has been called to set the cpus' priorities.
+ *
+ * It must not be called with cpu hot plug lock
+ * held as we need to acquire the lock to rebuild sched domains
+ * later.
*/
int sched_set_itmt_support(void)
{
mutex_lock(&itmt_update_mutex);

+ if (sched_itmt_capable) {
+ mutex_unlock(&itmt_update_mutex);
+ return 0;
+ }
+
+ itmt_sysctl_header = register_sysctl_table(itmt_root_table);
+ if (!itmt_sysctl_header) {
+ mutex_unlock(&itmt_update_mutex);
+ return -ENOMEM;
+ }
+
sched_itmt_capable = true;

+ /*
+ * ITMT capability automatically enables ITMT
+ * scheduling for small systems (single node).
+ */
+ if (topology_num_packages() == 1)
+ sysctl_sched_itmt_enabled = 1;
+
+ if (sysctl_sched_itmt_enabled) {
+ x86_topology_update = true;
+ rebuild_sched_domains();
+ }
+
mutex_unlock(&itmt_update_mutex);
return 0;
}
@@ -64,13 +152,30 @@ int sched_set_itmt_support(void)
* This function is used by the OS to indicate that it has
* revoked the platform's support of ITMT feature.
*
+ * It must not be called with cpu hot plug lock
+ * held as we need to acquire the lock to rebuild sched domains
+ * later.
*/
void sched_clear_itmt_support(void)
{
mutex_lock(&itmt_update_mutex);

+ if (!sched_itmt_capable) {
+ mutex_unlock(&itmt_update_mutex);
+ return;
+ }
sched_itmt_capable = false;

+ if (itmt_sysctl_header)
+ unregister_sysctl_table(itmt_sysctl_header);
+
+ if (sysctl_sched_itmt_enabled) {
+ /* disable sched_itmt if we are no longer ITMT capable */
+ sysctl_sched_itmt_enabled = 0;
+ x86_topology_update = true;
+ rebuild_sched_domains();
+ }
+
mutex_unlock(&itmt_update_mutex);
}

--
2.5.5