2021-02-19 13:05:11

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 0/7] sched/fair: misfit task load-balance tweaks

Hi folks,

Here is this year's series of misfit changes. On the menu:

o Patch 1 prevents pcpu kworkers from causing group_imbalanced
o Patch 2 is an independent active balance cleanup
o Patch 3 adds some more sched_asym_cpucapacity static branches
o Patch 4 introduces yet another margin for capacity to capacity
comparisons
o Patches 5-6 build on top of patch 4 and change capacity comparisons
throughout misfit load balancing
o Patch 7 aligns running and non-running misfit task cache hotness
considerations

IMO the somewhat controversial bit is patch 4, because it attempts to solve
margin issues by... Adding another margin. This does solve issues on
existing platforms (e.g. Pixel4), but we'll be back to square one the day
some "clever" folks spin a platform with two different CPU capacities less
than 5% apart.

This is based on top of today's tip/sched/core at:

c5e6fc08feb2 ("sched,x86: Allow !PREEMPT_DYNAMIC")

Testing
=======

I ran my usual [1] misfit tests on
o TC2
o Juno
o HiKey960
o Dragonboard845C
o RB5

RB5 has a similar topology to Pixel4 and highlights the problem of having
two different CPU capacity values above 819 (in this case 871 and 1024):
without these patches, CPU hogs (i.e. misfit tasks) running on the "medium"
CPUs will never be upmigrated to a "big" via misfit balance.

Revisions
=========

v1 -> v2
--------

o Collected Reviewed-by
o Minor comment and code cleanups

o Consolidated static key vs SD flag explanation (Dietmar)

Note to Vincent: I didn't measure the impact of adding said static key to
load_balance(); I do however believe it is a low hanging fruit. The
wrapper keeps things neat and tidy, and is also helpful for documenting
the intricacies of the static key status vs the presence of the SD flag
in a CPU's sched_domain hierarchy.

o Removed v1 patch 4 - root_domain.max_cpu_capacity is absolutely not what
I had convinced myself it was.
o Squashed capacity margin usage with removal of
group_smaller_{min, max}_capacity() (Vincent)
o Replaced v1 patch 7 with Lingutla's can_migrate_task() patch [2]
o Rewrote task_hot() modification changelog

Links
=====

[1]: https://lisa-linux-integrated-system-analysis.readthedocs.io/en/master/kernel_tests.html#lisa.tests.scheduler.misfit.StaggeredFinishes
[2]: http://lore.kernel.org/r/[email protected]

Cheers,
Valentin

Lingutla Chandrasekhar (1):
sched/fair: Ignore percpu threads for imbalance pulls

Valentin Schneider (6):
sched/fair: Clean up active balance nr_balance_failed trickery
sched/fair: Add more sched_asym_cpucapacity static branch checks
sched/fair: Introduce a CPU capacity comparison helper
sched/fair: Employ capacity_greater() throughout load_balance()
sched/fair: Filter out locally-unsolvable misfit imbalances
sched/fair: Relax task_hot() for misfit tasks

kernel/sched/fair.c | 128 ++++++++++++++++++++++++-------------------
kernel/sched/sched.h | 33 +++++++++++
2 files changed, 105 insertions(+), 56 deletions(-)

--
2.27.0


2021-02-19 13:05:38

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls

From: Lingutla Chandrasekhar <[email protected]>

In load balancing, when balancing group is unable to pull task
due to ->cpus_ptr constraints from busy group, then it sets
LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
is set for its parent domain level. which makes the group
classified as imbalance to get help from another balancing cpu.

Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
CPUs 2-3 as Bigs with below scenario:
- CPU0 doing newly_idle balancing
- CPU1 running percpu kworker and RT task (small tasks)
- CPU2 running 2 big tasks
- CPU3 running 1 medium task

While CPU0 is doing newly_idle load balance at MC level, it fails to
pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
find other busiest_group, so CPU0 stops balacing at MC level without
clearing 'sgc->imbalance' and restart the load balacing at DIE level.

And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
type as imbalance, and Bigs that classified the level below imbalance type
would be ignored to pick as busiest, and the balancing would be aborted
without pulling any tasks (by the time, CPU1 might not have running tasks).

It is suboptimal decision to classify the group as imbalance due to
percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.

Signed-off-by: Lingutla Chandrasekhar <[email protected]>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..2d4dcf1a3372 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7539,6 +7539,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;

+ /* Disregard pcpu kthreads; they are where they need to be. */
+ if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

--
2.27.0

2021-02-19 13:06:27

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 4/7] sched/fair: Introduce a CPU capacity comparison helper

During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

group_smaller_max_cpu_capacity(<candidate group>, <local group>);

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24119f9ad191..cc16d0e0b9fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)

+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap1' noticeably greater than 'cap2'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
#endif

#ifdef CONFIG_CFS_BANDWIDTH
--
2.27.0

2021-02-19 13:06:37

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 2/7] sched/fair: Clean up active balance nr_balance_failed trickery

When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().

This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.

Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d4dcf1a3372..535ebc31c9a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7394,6 +7394,7 @@ enum migration_type {
#define LBF_SOME_PINNED 0x08
#define LBF_NOHZ_STATS 0x10
#define LBF_NOHZ_AGAIN 0x20
+#define LBF_ACTIVE_LB 0x40

struct lb_env {
struct sched_domain *sd;
@@ -7583,10 +7584,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)

/*
* Aggressive migration if:
- * 1) destination numa is preferred
- * 2) task is cache cold, or
- * 3) too many balance attempts have failed.
+ * 1) active balance
+ * 2) destination numa is preferred
+ * 3) task is cache cold, or
+ * 4) too many balance attempts have failed.
*/
+ if (env->flags & LBF_ACTIVE_LB)
+ return 1;
+
tsk_cache_hot = migrate_degrades_locality(p, env);
if (tsk_cache_hot == -1)
tsk_cache_hot = task_hot(p, env);
@@ -9780,9 +9785,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
}
-
- /* We've kicked active balancing, force task migration. */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
}
} else {
sd->nr_balance_failed = 0;
@@ -9938,7 +9940,8 @@ static int active_load_balance_cpu_stop(void *data)
* @dst_grpmask we need to make that test go away with lying
* about DST_PINNED.
*/
- .flags = LBF_DST_PINNED,
+ .flags = LBF_DST_PINNED |
+ LBF_ACTIVE_LB,
};

schedstat_inc(sd->alb_count);
--
2.27.0

2021-02-19 13:07:10

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 3/7] sched/fair: Add more sched_asym_cpucapacity static branch checks

Rik noted a while back that a handful of

sd->flags & SD_ASYM_CPUCAPACITY

& family in the CFS load-balancer code aren't guarded by the
sched_asym_cpucapacity static branch.

Turning those checks into NOPs for those who don't need it is fairly
straightforward, and hiding it in a helper doesn't change code size in all
but one spot. It also gives us a place to document the differences between
checking the static key and checking the SD flag.

Suggested-by: Rik van Riel <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++-------------
kernel/sched/sched.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 535ebc31c9a8..24119f9ad191 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6288,15 +6288,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* sd_asym_cpucapacity rather than sd_llc.
*/
if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ /* See sd_has_asym_cpucapacity() */
sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
- /*
- * On an asymmetric CPU capacity system where an exclusive
- * cpuset defines a symmetric island (i.e. one unique
- * capacity_orig value through the cpuset), the key will be set
- * but the CPUs within that cpuset will not have a domain with
- * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
- * capacity path.
- */
if (sd) {
i = select_idle_capacity(p, sd, target);
return ((unsigned)i < nr_cpumask_bits) ? i : target;
@@ -8440,7 +8433,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;

/* Check for a misfit task on the cpu */
- if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
*sg_status |= SG_OVERLOAD;
@@ -8497,7 +8490,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* CPUs in the group should either be possible to resolve
* internally or be covered by avg_load imbalance (eventually).
*/
- if (sgs->group_type == group_misfit_task &&
+ if (static_branch_unlikely(&sched_asym_cpucapacity) &&
+ sgs->group_type == group_misfit_task &&
(!group_smaller_max_cpu_capacity(sg, sds->local) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -8580,7 +8574,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* throughput. Maximize throughput, power/energy consequences are not
* considered.
*/
- if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
(sgs->group_type <= group_fully_busy) &&
(group_smaller_min_cpu_capacity(sds->local, sg)))
return false;
@@ -8703,7 +8697,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
}

/* Check if task fits in the group */
- if (sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(sd) &&
!task_fits_capacity(p, group->sgc->max_capacity)) {
sgs->group_misfit_task_load = 1;
}
@@ -9394,7 +9388,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* Higher per-CPU capacity is considered better than balancing
* average load.
*/
- if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ if (sd_has_asym_cpucapacity(env->sd) &&
capacity_of(env->dst_cpu) < capacity &&
nr_running == 1)
continue;
@@ -10224,6 +10218,7 @@ static void nohz_balancer_kick(struct rq *rq)
}
}

+ /* See sd_has_asym_cpucapacity(). */
sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, cpu));
if (sd) {
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..a447b3f28792 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1484,6 +1484,39 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
extern struct static_key_false sched_asym_cpucapacity;

+/*
+ * Note that the static key is system-wide, but the visibility of
+ * SD_ASYM_CPUCAPACITY isn't. Thus the static key being enabled does not
+ * imply all CPUs can see asymmetry.
+ *
+ * Consider an asymmetric CPU capacity system such as:
+ *
+ * MC [ ]
+ * 0 1 2 3 4 5
+ * L L L L B B
+ *
+ * w/ arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
+ *
+ * By default, booting this system will enable the sched_asym_cpucapacity
+ * static key, and all CPUs will see SD_ASYM_CPUCAPACITY set at their MC
+ * sched_domain.
+ *
+ * Further consider exclusive cpusets creating a "symmetric island":
+ *
+ * MC [ ][ ]
+ * 0 1 2 3 4 5
+ * L L L L B B
+ *
+ * Again, booting this will enable the static key, but CPUs 0-1 will *not* have
+ * SD_ASYM_CPUCAPACITY set in any of their sched_domain. This is the intended
+ * behaviour, as CPUs 0-1 should be treated as a regular, isolated SMP system.
+ */
+static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
+{
+ return static_branch_unlikely(&sched_asym_cpucapacity) &&
+ sd->flags & SD_ASYM_CPUCAPACITY;
+}
+
struct sched_group_capacity {
atomic_t ref;
/*
--
2.27.0

2021-02-19 13:07:16

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 5/7] sched/fair: Employ capacity_greater() throughout load_balance()

While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.

Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 33 ++++-----------------------------
1 file changed, 4 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc16d0e0b9fb..af5ce083c982 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8320,26 +8320,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
return false;
}

-/*
- * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity than sched_group ref.
- */
-static inline bool
-group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
-}
-
-/*
- * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
- * per-CPU capacity_orig than sched_group ref.
- */
-static inline bool
-group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
-{
- return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
-}
-
static inline enum
group_type group_classify(unsigned int imbalance_pct,
struct sched_group *group,
@@ -8491,15 +8471,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (!sgs->sum_h_nr_running)
return false;

- /*
- * Don't try to pull misfit tasks we can't help.
- * We can use max_capacity here as reduction in capacity on some
- * CPUs in the group should either be possible to resolve
- * internally or be covered by avg_load imbalance (eventually).
- */
+ /* Don't try to pull misfit tasks we can't help */
if (static_branch_unlikely(&sched_asym_cpucapacity) &&
sgs->group_type == group_misfit_task &&
- (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;

@@ -8583,7 +8558,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
*/
if (sd_has_asym_cpucapacity(env->sd) &&
(sgs->group_type <= group_fully_busy) &&
- (group_smaller_min_cpu_capacity(sds->local, sg)))
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
return false;

return true;
@@ -9396,7 +9371,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* average load.
*/
if (sd_has_asym_cpucapacity(env->sd) &&
- capacity_of(env->dst_cpu) < capacity &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;

--
2.27.0

2021-02-19 13:07:37

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

Consider the following (hypothetical) asymmetric CPU capacity topology,
with some amount of capacity pressure (RT | DL | IRQ | thermal):

DIE [ ]
MC [ ][ ]
0 1 2 3

| CPU | capacity_orig | capacity |
|-----+---------------+----------|
| 0 | 870 | 860 |
| 1 | 870 | 600 |
| 2 | 1024 | 850 |
| 3 | 1024 | 860 |

If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
sufficiently busy, i.e. don't have enough spare capacity to accommodate
CPU1's misfit task. This would then fall on CPU2 to pull the task.

This currently won't happen, because CPU2 will fail

capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)

in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
level. In this case, the max_capacity is that of CPU0's, which is at this
point in time greater than that of CPU2's. This comparison doesn't make
much sense, given that the only CPUs we should care about in this scenario
are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
destination CPU).

Aggregate a misfit task's load into sgs->group_misfit_task_load only if
env->dst_cpu would grant it a capacity uplift. Separately track whether a
sched_group contains a misfit task to still classify it as
group_misfit_task and not pick it as busiest group when pulling from a
lower-capacity CPU (which is the current behaviour and prevents
down-migration).

Since find_busiest_queue() can now iterate over CPUs with a higher capacity
than the local CPU's, add a capacity check there.

Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af5ce083c982..ee172b384e29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,6 +5747,12 @@ static unsigned long capacity_of(int cpu)
return cpu_rq(cpu)->cpu_capacity;
}

+/* Is CPU a's capacity noticeably greater than CPU b's? */
+static inline bool cpu_capacity_greater(int a, int b)
+{
+ return capacity_greater(capacity_of(a), capacity_of(b));
+}
+
static void record_wakee(struct task_struct *p)
{
/*
@@ -8061,7 +8067,8 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ unsigned long group_misfit_task_load; /* Task load that can be uplifted */
+ int group_has_misfit_task; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8334,7 +8341,7 @@ group_type group_classify(unsigned int imbalance_pct,
if (sgs->group_asym_packing)
return group_asym_packing;

- if (sgs->group_misfit_task_load)
+ if (sgs->group_has_misfit_task)
return group_misfit_task;

if (!group_has_capacity(imbalance_pct, sgs))
@@ -8420,10 +8427,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;

/* Check for a misfit task on the cpu */
- if (sd_has_asym_cpucapacity(env->sd) &&
- sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_status |= SG_OVERLOAD;
+ if (!sd_has_asym_cpucapacity(env->sd) ||
+ !rq->misfit_task_load)
+ continue;
+
+ *sg_status |= SG_OVERLOAD;
+ sgs->group_has_misfit_task = true;
+
+ /*
+ * Don't attempt to maximize load for misfit tasks that can't be
+ * granted a CPU capacity uplift.
+ */
+ if (cpu_capacity_greater(env->dst_cpu, i)) {
+ sgs->group_misfit_task_load = max(
+ sgs->group_misfit_task_load,
+ rq->misfit_task_load);
}
}

@@ -8474,7 +8492,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Don't try to pull misfit tasks we can't help */
if (static_branch_unlikely(&sched_asym_cpucapacity) &&
sgs->group_type == group_misfit_task &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+ (!sgs->group_misfit_task_load ||
sds->local_stat.group_type != group_has_spare))
return false;

@@ -9434,15 +9452,18 @@ static struct rq *find_busiest_queue(struct lb_env *env,
case migrate_misfit:
/*
* For ASYM_CPUCAPACITY domains with misfit tasks we
- * simply seek the "biggest" misfit task.
+ * simply seek the "biggest" misfit task we can
+ * accommodate.
*/
+ if (!cpu_capacity_greater(env->dst_cpu, i))
+ continue;
+
if (rq->misfit_task_load > busiest_load) {
busiest_load = rq->misfit_task_load;
busiest = rq;
}

break;
-
}
}

--
2.27.0

2021-02-19 13:07:46

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 7/7] sched/fair: Relax task_hot() for misfit tasks

Consider the following topology:

DIE [ ]
MC [ ][ ]
0 1 2 3

capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})

w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).

When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
this load_balance() happens and preempt the CPU hog running there, we would
have, for the [0-1] group at CPU2's DIE level:

o sgs->sum_nr_running > sgs->group_weight
o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct

IOW, this group is group_overloaded.

Considering CPU0 is picked by find_busiest_queue(), we would then visit the
preempted CPU hog in detach_tasks(). However, given it has just been
preempted by this pcpu kworker, task_hot() will prevent it from being
detached. We then leave load_balance() without having done anything.

Long story short, preempted misfit tasks are affected by task_hot(), while
currently running misfit tasks are intentionally preempted by the stopper
task to migrate them over to a higher-capacity CPU.

Align detach_tasks() with the active-balance logic and let it pick a
cache-hot misfit task when the destination CPU can provide a capacity
uplift.

Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee172b384e29..554430fd249c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7448,6 +7448,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
if (env->sd->flags & SD_SHARE_CPUCAPACITY)
return 0;

+ /*
+ * On a (sane) asymmetric CPU capacity system, the increase in compute
+ * capacity should offset any potential performance hit caused by a
+ * migration.
+ */
+ if (sd_has_asym_cpucapacity(env->sd) &&
+ env->idle != CPU_NOT_IDLE &&
+ !task_fits_capacity(p, capacity_of(env->src_cpu)) &&
+ cpu_capacity_greater(env->dst_cpu, env->src_cpu))
+ return 0;
+
/*
* Buddy candidates are cache hot:
*/
--
2.27.0

2021-02-22 05:35:13

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls

On Fri, Feb 19, 2021 at 12:59:57PM +0000, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar <[email protected]>
>
> In load balancing, when balancing group is unable to pull task
> due to ->cpus_ptr constraints from busy group, then it sets
> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
> is set for its parent domain level. which makes the group
> classified as imbalance to get help from another balancing cpu.
>
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
> CPUs 2-3 as Bigs with below scenario:
> - CPU0 doing newly_idle balancing
> - CPU1 running percpu kworker and RT task (small tasks)
> - CPU2 running 2 big tasks
> - CPU3 running 1 medium task
>
> While CPU0 is doing newly_idle load balance at MC level, it fails to
> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
> find other busiest_group, so CPU0 stops balacing at MC level without
> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
>
> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
> type as imbalance, and Bigs that classified the level below imbalance type
> would be ignored to pick as busiest, and the balancing would be aborted
> without pulling any tasks (by the time, CPU1 might not have running tasks).
>
> It is suboptimal decision to classify the group as imbalance due to
> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.
>
> Signed-off-by: Lingutla Chandrasekhar <[email protected]>
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..2d4dcf1a3372 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7539,6 +7539,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> + /* Disregard pcpu kthreads; they are where they need to be. */
> + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> + return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>

Looks good to me. Thanks Valentin for the help.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2021-02-23 02:39:02

by kernel test robot

[permalink] [raw]
Subject: [sched/fair] b360fb5e59: stress-ng.vm-segv.ops_per_sec -13.9% regression


Greeting,

FYI, we noticed a -13.9% regression of stress-ng.vm-segv.ops_per_sec due to commit:


commit: b360fb5e5954a8a440ef95bf11257e2e7ea90340 ("[PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls")
url: https://github.com/0day-ci/linux/commits/Valentin-Schneider/sched-fair-misfit-task-load-balance-tweaks/20210219-211028
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4

in testcase: stress-ng
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 60s
fs: ext4
class: vm
test: vm-segv
cpufreq_governor: performance
ucode: 0x5003003




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml

=========================================================================================
class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
vm/gcc-9/performance/1HDD/ext4/x86_64-rhel-8.3/10%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp7/vm-segv/stress-ng/60s/0x5003003

commit:
c5e6fc08fe ("sched,x86: Allow !PREEMPT_DYNAMIC")
b360fb5e59 ("sched/fair: Ignore percpu threads for imbalance pulls")

c5e6fc08feb2b88d b360fb5e5954a8a440ef95bf112
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
1:6 -3% 1:6 perf-profile.children.cycles-pp.error_entry
0:6 -1% 0:6 perf-profile.self.cycles-pp.error_entry
%stddev %change %stddev
\ | \
11324 ? 3% -28.1% 8140 ? 3% stress-ng.time.involuntary_context_switches
6818 ? 15% +315.2% 28311 ? 12% stress-ng.time.major_page_faults
30952041 -12.8% 26988502 stress-ng.time.minor_page_faults
378.82 +5.3% 398.75 stress-ng.time.system_time
215.82 -10.0% 194.24 stress-ng.time.user_time
62102177 -13.9% 53448474 stress-ng.time.voluntary_context_switches
810348 -13.9% 698034 stress-ng.vm-segv.ops
13505 -13.9% 11633 stress-ng.vm-segv.ops_per_sec
9.76 -1.3% 9.63 iostat.cpu.system
311.06 +3.9% 323.18 pmeter.Average_Active_Power
1937515 -13.9% 1667746 vmstat.system.cs
2.33e+08 +15.2% 2.685e+08 cpuidle.C1.time
14107313 -9.8% 12722308 ? 2% cpuidle.C1.usage
51960611 -15.6% 43858596 cpuidle.POLL.usage
1.11 -0.3 0.85 mpstat.cpu.all.irq%
0.18 -0.0 0.16 mpstat.cpu.all.soft%
0.40 -0.1 0.35 mpstat.cpu.all.usr%
52529 ? 6% -35.8% 33706 ? 44% sched_debug.cfs_rq:/.min_vruntime.avg
38574 ? 10% -41.7% 22479 ? 41% sched_debug.cfs_rq:/.min_vruntime.min
624583 -56.9% 269004 ? 99% sched_debug.cpu.nr_switches.avg
421467 ? 6% -65.0% 147633 ?102% sched_debug.cpu.nr_switches.min
33017 ? 41% +71.9% 56752 ? 4% numa-vmstat.node0.nr_anon_pages
33861 ? 40% +68.8% 57163 ? 4% numa-vmstat.node0.nr_inactive_anon
33861 ? 40% +68.8% 57163 ? 4% numa-vmstat.node0.nr_zone_inactive_anon
31421 ? 44% -75.4% 7730 ? 31% numa-vmstat.node1.nr_anon_pages
32588 ? 42% -71.2% 9386 ? 25% numa-vmstat.node1.nr_inactive_anon
1949 ? 5% -15.9% 1638 ? 9% numa-vmstat.node1.nr_page_table_pages
32588 ? 42% -71.2% 9386 ? 25% numa-vmstat.node1.nr_zone_inactive_anon
7935635 ? 13% -19.3% 6407159 ? 6% numa-vmstat.node1.numa_hit
7732506 ? 14% -19.7% 6206498 ? 6% numa-vmstat.node1.numa_local
131518 ? 41% +72.1% 226297 ? 4% numa-meminfo.node0.AnonPages
140036 ? 38% +66.7% 233420 ? 4% numa-meminfo.node0.AnonPages.max
136203 ? 40% +68.2% 229070 ? 4% numa-meminfo.node0.Inactive
135279 ? 40% +68.9% 228531 ? 4% numa-meminfo.node0.Inactive(anon)
7950 ? 8% +15.4% 9178 ? 2% numa-meminfo.node0.PageTables
40320 ? 68% -79.8% 8138 ? 45% numa-meminfo.node1.AnonHugePages
125324 ? 44% -75.9% 30215 ? 32% numa-meminfo.node1.AnonPages
208760 ? 26% -48.1% 108370 ? 8% numa-meminfo.node1.AnonPages.max
129998 ? 42% -71.1% 37570 ? 25% numa-meminfo.node1.Inactive
129996 ? 42% -71.4% 37183 ? 25% numa-meminfo.node1.Inactive(anon)
7433 ? 6% -18.1% 6091 ? 9% numa-meminfo.node1.PageTables
9921 +2.0% 10121 proc-vmstat.nr_mapped
23344 -3.6% 22514 proc-vmstat.nr_slab_reclaimable
30383698 ? 2% -13.8% 26186115 proc-vmstat.numa_hit
30297093 ? 2% -13.9% 26099520 proc-vmstat.numa_local
1600 ?149% -100.0% 0.00 proc-vmstat.numa_pages_migrated
9337 ? 2% -12.5% 8174 ? 3% proc-vmstat.pgactivate
34255062 ? 2% -14.7% 29225155 proc-vmstat.pgalloc_normal
31199740 -12.6% 27277537 proc-vmstat.pgfault
34113059 ? 2% -14.7% 29083007 proc-vmstat.pgfree
1600 ?149% -100.0% 0.00 proc-vmstat.pgmigrate_success
345591 -8.1% 317680 proc-vmstat.pgreuse
11501 ? 5% +9.6% 12610 ? 6% softirqs.CPU12.RCU
10678 ? 5% +16.0% 12383 ? 2% softirqs.CPU16.RCU
10871 ? 4% +13.1% 12294 ? 2% softirqs.CPU17.RCU
10724 ? 2% +13.8% 12205 ? 3% softirqs.CPU18.RCU
10810 ? 4% +16.2% 12560 ? 3% softirqs.CPU19.RCU
10647 ? 6% +16.2% 12372 ? 6% softirqs.CPU20.RCU
10863 ? 3% +14.7% 12461 ? 3% softirqs.CPU21.RCU
11231 ? 5% +14.6% 12873 ? 6% softirqs.CPU22.RCU
11141 ? 6% +21.0% 13480 ? 8% softirqs.CPU64.RCU
11209 ? 6% +20.8% 13545 ? 2% softirqs.CPU65.RCU
11108 ? 3% +20.0% 13334 ? 6% softirqs.CPU66.RCU
11414 ? 9% +16.9% 13345 ? 6% softirqs.CPU67.RCU
11162 ? 4% +16.2% 12968 ? 9% softirqs.CPU68.RCU
11035 ? 5% +13.6% 12533 ? 4% softirqs.CPU69.RCU
11003 ? 5% +18.9% 13078 ? 8% softirqs.CPU70.RCU
11097 ? 4% +14.9% 12756 ? 5% softirqs.CPU71.RCU
76929 +14.7% 88227 slabinfo.anon_vma_chain.active_objs
1205 +14.4% 1379 slabinfo.anon_vma_chain.active_slabs
77192 +14.4% 88314 slabinfo.anon_vma_chain.num_objs
1205 +14.4% 1379 slabinfo.anon_vma_chain.num_slabs
23274 +12.2% 26121 slabinfo.kmalloc-512.active_objs
9748 ? 2% -10.0% 8776 ? 3% slabinfo.pid.active_objs
9748 ? 2% -10.0% 8776 ? 3% slabinfo.pid.num_objs
28777 -11.3% 25539 ? 2% slabinfo.proc_inode_cache.active_objs
28919 -11.3% 25647 ? 2% slabinfo.proc_inode_cache.num_objs
11256 ? 5% -23.8% 8581 ? 3% slabinfo.task_delay_info.active_objs
11256 ? 5% -23.8% 8581 ? 3% slabinfo.task_delay_info.num_objs
53293 +12.8% 60111 slabinfo.vm_area_struct.active_objs
1335 +12.6% 1503 slabinfo.vm_area_struct.active_slabs
53426 +12.6% 60156 slabinfo.vm_area_struct.num_objs
1335 +12.6% 1503 slabinfo.vm_area_struct.num_slabs
20536 ? 2% +25.3% 25732 ? 12% slabinfo.vmap_area.active_objs
24731 +23.5% 30551 ? 3% slabinfo.vmap_area.num_objs
4.799e+09 -12.2% 4.215e+09 perf-stat.i.branch-instructions
1.09 +0.1 1.14 perf-stat.i.branch-miss-rate%
50816825 -7.1% 47191966 perf-stat.i.branch-misses
4.52 ? 3% +5.6 10.16 ? 3% perf-stat.i.cache-miss-rate%
18266157 ? 2% +125.2% 41128630 ? 3% perf-stat.i.cache-misses
4.817e+08 -11.6% 4.258e+08 perf-stat.i.cache-references
1998506 -14.0% 1718841 perf-stat.i.context-switches
1.89 +12.4% 2.12 perf-stat.i.cpi
4.232e+10 -1.4% 4.175e+10 perf-stat.i.cpu-cycles
836.21 ? 4% +14.8% 959.82 ? 7% perf-stat.i.cpu-migrations
2507 -54.0% 1153 ? 5% perf-stat.i.cycles-between-cache-misses
4212845 ? 3% -10.4% 3775244 ? 2% perf-stat.i.dTLB-load-misses
6.137e+09 -12.7% 5.359e+09 perf-stat.i.dTLB-loads
0.03 ? 3% +0.0 0.04 ? 2% perf-stat.i.dTLB-store-miss-rate%
1127230 ? 3% -6.7% 1051240 ? 2% perf-stat.i.dTLB-store-misses
3.192e+09 -13.4% 2.766e+09 perf-stat.i.dTLB-stores
67.76 -0.7 67.08 perf-stat.i.iTLB-load-miss-rate%
17023777 -10.2% 15291627 perf-stat.i.iTLB-load-misses
7885583 -7.2% 7319880 perf-stat.i.iTLB-loads
2.237e+10 -12.5% 1.956e+10 perf-stat.i.instructions
0.53 -10.8% 0.48 perf-stat.i.ipc
111.04 ? 13% +310.1% 455.43 ? 11% perf-stat.i.major-faults
0.44 -1.4% 0.43 perf-stat.i.metric.GHz
0.06 ? 12% +21.9% 0.07 ? 14% perf-stat.i.metric.K/sec
152.39 -12.5% 133.27 perf-stat.i.metric.M/sec
494626 -12.8% 431405 perf-stat.i.minor-faults
77.46 +12.6 90.07 perf-stat.i.node-load-miss-rate%
3962536 ? 4% +195.0% 11688629 ? 3% perf-stat.i.node-load-misses
1000403 +7.6% 1076834 ? 2% perf-stat.i.node-loads
41.34 ? 2% +31.6 72.93 perf-stat.i.node-store-miss-rate%
1494039 ? 5% +198.5% 4459287 ? 3% perf-stat.i.node-store-misses
1975747 ? 3% -25.8% 1466690 ? 2% perf-stat.i.node-stores
520459 -12.8% 453999 perf-stat.i.page-faults
1.06 +0.1 1.12 perf-stat.overall.branch-miss-rate%
3.78 ? 2% +5.8 9.60 ? 3% perf-stat.overall.cache-miss-rate%
1.89 +12.7% 2.13 perf-stat.overall.cpi
2324 ? 2% -56.1% 1021 ? 3% perf-stat.overall.cycles-between-cache-misses
0.04 ? 3% +0.0 0.04 ? 2% perf-stat.overall.dTLB-store-miss-rate%
68.35 -0.7 67.65 perf-stat.overall.iTLB-load-miss-rate%
0.53 -11.2% 0.47 perf-stat.overall.ipc
79.66 +11.8 91.48 perf-stat.overall.node-load-miss-rate%
42.93 ? 3% +32.1 75.07 perf-stat.overall.node-store-miss-rate%
4.727e+09 -12.1% 4.154e+09 perf-stat.ps.branch-instructions
50025626 -7.1% 46471888 perf-stat.ps.branch-misses
17939323 ? 2% +124.6% 40288410 ? 3% perf-stat.ps.cache-misses
4.743e+08 -11.5% 4.196e+08 perf-stat.ps.cache-references
1968071 -13.9% 1694380 perf-stat.ps.context-switches
4.167e+10 -1.4% 4.11e+10 perf-stat.ps.cpu-cycles
821.15 ? 4% +14.8% 942.91 ? 7% perf-stat.ps.cpu-migrations
4147982 ? 3% -10.3% 3720585 ? 2% perf-stat.ps.dTLB-load-misses
6.043e+09 -12.6% 5.282e+09 perf-stat.ps.dTLB-loads
1109704 ? 3% -6.6% 1036060 ? 2% perf-stat.ps.dTLB-store-misses
3.144e+09 -13.3% 2.726e+09 perf-stat.ps.dTLB-stores
16765267 -10.1% 15068821 perf-stat.ps.iTLB-load-misses
7761893 -7.1% 7207514 perf-stat.ps.iTLB-loads
2.203e+10 -12.5% 1.928e+10 perf-stat.ps.instructions
108.63 ? 13% +309.6% 444.94 ? 12% perf-stat.ps.major-faults
487068 -12.7% 425183 perf-stat.ps.minor-faults
3881430 ? 4% +194.6% 11435790 ? 3% perf-stat.ps.node-load-misses
989805 +7.5% 1064249 ? 2% perf-stat.ps.node-loads
1464554 ? 5% +198.0% 4363838 ? 3% perf-stat.ps.node-store-misses
1945831 ? 3% -25.6% 1448410 ? 2% perf-stat.ps.node-stores
512507 -12.7% 447452 perf-stat.ps.page-faults
1.389e+12 -12.5% 1.215e+12 perf-stat.total.instructions
239831 -33.6% 159201 interrupts.CAL:Function_call_interrupts
2858 ? 21% -36.1% 1825 ? 17% interrupts.CPU0.CAL:Function_call_interrupts
76.83 ? 14% -73.1% 20.67 ? 24% interrupts.CPU0.TLB:TLB_shootdowns
3447 ? 10% -38.8% 2108 ? 18% interrupts.CPU1.CAL:Function_call_interrupts
2457 ? 16% -30.3% 1712 ? 15% interrupts.CPU10.CAL:Function_call_interrupts
2252 ? 16% -29.6% 1585 ? 22% interrupts.CPU11.CAL:Function_call_interrupts
2156 ? 23% -27.9% 1555 ? 13% interrupts.CPU12.CAL:Function_call_interrupts
378.33 ? 11% +42.1% 537.50 ? 19% interrupts.CPU13.RES:Rescheduling_interrupts
2358 ? 15% -40.2% 1409 ? 8% interrupts.CPU14.CAL:Function_call_interrupts
2216 ? 20% -36.0% 1418 ? 10% interrupts.CPU15.CAL:Function_call_interrupts
2387 ? 15% -24.2% 1811 ? 5% interrupts.CPU16.CAL:Function_call_interrupts
2329 ? 14% -36.0% 1491 ? 14% interrupts.CPU17.CAL:Function_call_interrupts
2509 ? 13% -39.2% 1525 ? 15% interrupts.CPU18.CAL:Function_call_interrupts
2333 ? 19% -32.9% 1566 ? 18% interrupts.CPU19.CAL:Function_call_interrupts
3082 ? 15% -36.0% 1972 ? 23% interrupts.CPU2.CAL:Function_call_interrupts
2019 ? 6% -24.8% 1518 ? 17% interrupts.CPU2.NMI:Non-maskable_interrupts
2019 ? 6% -24.8% 1518 ? 17% interrupts.CPU2.PMI:Performance_monitoring_interrupts
2456 ? 11% -36.1% 1569 ? 11% interrupts.CPU21.CAL:Function_call_interrupts
2668 ? 21% -37.7% 1662 ? 11% interrupts.CPU22.CAL:Function_call_interrupts
2361 ? 15% -34.6% 1543 ? 15% interrupts.CPU23.CAL:Function_call_interrupts
2651 ? 8% -22.4% 2057 ? 17% interrupts.CPU24.CAL:Function_call_interrupts
2936 ? 10% -42.9% 1675 ? 20% interrupts.CPU25.CAL:Function_call_interrupts
2734 ? 22% -37.4% 1712 ? 24% interrupts.CPU26.CAL:Function_call_interrupts
2614 ? 19% -34.1% 1723 ? 18% interrupts.CPU27.CAL:Function_call_interrupts
2774 ? 13% -46.3% 1490 ? 21% interrupts.CPU28.CAL:Function_call_interrupts
2454 ? 18% -44.0% 1373 ? 18% interrupts.CPU29.CAL:Function_call_interrupts
433.83 ? 16% -30.7% 300.67 ? 13% interrupts.CPU29.RES:Rescheduling_interrupts
2969 ? 19% -40.4% 1769 ? 15% interrupts.CPU3.CAL:Function_call_interrupts
2566 ? 10% -32.4% 1734 ? 13% interrupts.CPU30.CAL:Function_call_interrupts
2614 ? 16% -43.6% 1475 ? 26% interrupts.CPU31.CAL:Function_call_interrupts
2212 ? 11% -34.4% 1452 ? 11% interrupts.CPU32.CAL:Function_call_interrupts
2273 ? 16% -39.6% 1372 ? 23% interrupts.CPU33.CAL:Function_call_interrupts
2313 ? 20% -34.2% 1521 ? 18% interrupts.CPU36.CAL:Function_call_interrupts
2258 ? 19% -31.3% 1550 ? 25% interrupts.CPU38.CAL:Function_call_interrupts
2482 ? 16% -29.8% 1743 ? 14% interrupts.CPU4.CAL:Function_call_interrupts
2466 ? 16% -40.9% 1458 ? 23% interrupts.CPU40.CAL:Function_call_interrupts
2488 ? 21% -42.9% 1420 ? 21% interrupts.CPU41.CAL:Function_call_interrupts
2327 ? 16% -39.7% 1403 ? 19% interrupts.CPU42.CAL:Function_call_interrupts
2352 ? 16% -40.0% 1412 ? 20% interrupts.CPU44.CAL:Function_call_interrupts
2252 ? 15% -41.0% 1329 ? 26% interrupts.CPU46.CAL:Function_call_interrupts
2490 ? 13% -40.1% 1491 ? 23% interrupts.CPU47.CAL:Function_call_interrupts
2628 ? 13% -32.4% 1777 ? 17% interrupts.CPU48.CAL:Function_call_interrupts
2752 ? 14% -35.0% 1788 ? 12% interrupts.CPU49.CAL:Function_call_interrupts
2668 ? 15% -39.8% 1605 ? 20% interrupts.CPU5.CAL:Function_call_interrupts
2693 ? 12% -35.1% 1749 ? 15% interrupts.CPU50.CAL:Function_call_interrupts
2515 ? 17% -35.9% 1613 ? 22% interrupts.CPU51.CAL:Function_call_interrupts
2809 ? 16% -45.4% 1534 ? 14% interrupts.CPU52.CAL:Function_call_interrupts
2748 ? 11% -35.2% 1780 ? 22% interrupts.CPU53.CAL:Function_call_interrupts
2634 ? 12% -34.1% 1736 ? 9% interrupts.CPU54.CAL:Function_call_interrupts
2607 ? 11% -38.0% 1616 ? 9% interrupts.CPU55.CAL:Function_call_interrupts
2527 ? 19% -38.3% 1558 ? 13% interrupts.CPU57.CAL:Function_call_interrupts
2620 ? 12% -22.3% 2037 ? 15% interrupts.CPU58.CAL:Function_call_interrupts
2569 ? 10% -30.1% 1795 ? 16% interrupts.CPU61.CAL:Function_call_interrupts
2514 ? 8% -36.3% 1601 ? 25% interrupts.CPU62.CAL:Function_call_interrupts
2525 ? 16% -31.9% 1719 ? 25% interrupts.CPU63.CAL:Function_call_interrupts
2494 ? 8% -24.6% 1880 ? 11% interrupts.CPU64.CAL:Function_call_interrupts
506.50 ? 15% +26.9% 642.67 ? 9% interrupts.CPU65.RES:Rescheduling_interrupts
486.67 ? 9% +29.5% 630.33 ? 15% interrupts.CPU66.RES:Rescheduling_interrupts
120.00 ?126% -92.4% 9.17 ? 36% interrupts.CPU66.TLB:TLB_shootdowns
2444 ? 12% -21.9% 1908 ? 18% interrupts.CPU67.CAL:Function_call_interrupts
2403 ? 16% -32.6% 1620 ? 15% interrupts.CPU69.CAL:Function_call_interrupts
2673 ? 11% -37.2% 1679 ? 24% interrupts.CPU7.CAL:Function_call_interrupts
2360 ? 12% -27.0% 1723 ? 14% interrupts.CPU70.CAL:Function_call_interrupts
2408 ? 11% -31.9% 1639 ? 16% interrupts.CPU71.CAL:Function_call_interrupts
2591 ? 13% -36.9% 1635 ? 24% interrupts.CPU75.CAL:Function_call_interrupts
2793 ? 13% -39.7% 1685 ? 17% interrupts.CPU76.CAL:Function_call_interrupts
2662 ? 14% -37.3% 1668 ? 20% interrupts.CPU77.CAL:Function_call_interrupts
1899 ? 8% -25.9% 1408 ? 33% interrupts.CPU77.NMI:Non-maskable_interrupts
1899 ? 8% -25.9% 1408 ? 33% interrupts.CPU77.PMI:Performance_monitoring_interrupts
2624 ? 10% -35.4% 1694 ? 27% interrupts.CPU78.CAL:Function_call_interrupts
2816 ? 12% -35.0% 1829 ? 25% interrupts.CPU79.CAL:Function_call_interrupts
2393 ? 10% -36.2% 1525 ? 11% interrupts.CPU8.CAL:Function_call_interrupts
2465 ? 16% -33.6% 1637 ? 18% interrupts.CPU80.CAL:Function_call_interrupts
2430 ? 16% -41.1% 1432 ? 9% interrupts.CPU81.CAL:Function_call_interrupts
2369 ? 7% -25.5% 1765 ? 14% interrupts.CPU83.CAL:Function_call_interrupts
2509 ? 20% -40.1% 1504 ? 19% interrupts.CPU84.CAL:Function_call_interrupts
2624 ? 20% -34.7% 1713 ? 26% interrupts.CPU85.CAL:Function_call_interrupts
2329 ? 15% -29.0% 1654 ? 21% interrupts.CPU86.CAL:Function_call_interrupts
2599 ? 14% -39.2% 1580 ? 17% interrupts.CPU88.CAL:Function_call_interrupts
2413 ? 13% -39.0% 1471 ? 12% interrupts.CPU9.CAL:Function_call_interrupts
2492 ? 19% -36.1% 1592 ? 30% interrupts.CPU90.CAL:Function_call_interrupts
2325 ? 21% -32.6% 1567 ? 19% interrupts.CPU91.CAL:Function_call_interrupts
2557 ? 13% -42.3% 1474 ? 19% interrupts.CPU92.CAL:Function_call_interrupts
2576 ? 19% -45.2% 1412 ? 19% interrupts.CPU94.CAL:Function_call_interrupts
2571 ? 10% -46.4% 1377 ? 20% interrupts.CPU95.CAL:Function_call_interrupts
475.17 ? 15% -22.4% 368.50 ? 13% interrupts.CPU95.RES:Rescheduling_interrupts
5869 ? 2% -83.2% 988.50 ? 12% interrupts.TLB:TLB_shootdowns
2.74 ? 6% -0.3 2.40 ? 3% perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
2.66 ? 6% -0.3 2.33 ? 3% perf-profile.calltrace.cycles-pp.__sched_text_start.schedule_idle.do_idle.cpu_startup_entry.start_secondary
1.88 ? 4% -0.2 1.64 ? 4% perf-profile.calltrace.cycles-pp.ptrace_request.arch_ptrace.__x64_sys_ptrace.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.89 ? 4% -0.2 1.65 ? 4% perf-profile.calltrace.cycles-pp.arch_ptrace.__x64_sys_ptrace.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.02 ? 5% -0.2 1.79 ? 3% perf-profile.calltrace.cycles-pp.__wake_up_common_lock.do_notify_parent_cldstop.ptrace_stop.ptrace_do_notify.ptrace_notify
1.87 ? 4% -0.2 1.64 ? 4% perf-profile.calltrace.cycles-pp.ptrace_resume.ptrace_request.arch_ptrace.__x64_sys_ptrace.do_syscall_64
1.90 ? 6% -0.2 1.67 ? 3% perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.do_notify_parent_cldstop.ptrace_stop.ptrace_do_notify
1.73 ? 4% -0.2 1.50 ? 4% perf-profile.calltrace.cycles-pp.try_to_wake_up.ptrace_resume.ptrace_request.arch_ptrace.__x64_sys_ptrace
1.46 ? 6% -0.2 1.25 ? 4% perf-profile.calltrace.cycles-pp.schedule.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
1.42 ? 6% -0.2 1.22 ? 4% perf-profile.calltrace.cycles-pp.__sched_text_start.schedule.do_wait.kernel_wait4.__do_sys_wait4
0.63 ? 5% -0.2 0.46 ? 44% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.ptrace_resume
1.25 ? 6% -0.2 1.09 ? 3% perf-profile.calltrace.cycles-pp.pick_next_task_fair.__sched_text_start.schedule_idle.do_idle.cpu_startup_entry
1.19 ? 5% -0.1 1.05 ? 4% perf-profile.calltrace.cycles-pp.do_notify_parent_cldstop.ptrace_stop.ptrace_do_notify.ptrace_notify.syscall_trace_enter
0.75 ? 7% -0.1 0.64 ? 3% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__sched_text_start.schedule.do_wait.kernel_wait4
0.83 ? 4% -0.1 0.73 ? 5% perf-profile.calltrace.cycles-pp.enqueue_task_fair.ttwu_do_activate.try_to_wake_up.ptrace_resume.ptrace_request
0.67 ? 7% -0.1 0.57 ? 4% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__sched_text_start.schedule.do_wait
0.86 ? 4% -0.1 0.76 ? 4% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.ptrace_resume.ptrace_request.arch_ptrace
0.61 ? 6% +0.1 0.69 ? 4% perf-profile.calltrace.cycles-pp.kmem_cache_free.remove_vma.__do_munmap.__vm_munmap.__x64_sys_munmap
0.78 ? 6% +0.1 0.88 ? 3% perf-profile.calltrace.cycles-pp.remove_vma.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
0.67 ? 8% +0.1 0.78 ? 4% perf-profile.calltrace.cycles-pp.anon_vma_interval_tree_insert.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm
0.58 ? 4% +0.1 0.69 ? 6% perf-profile.calltrace.cycles-pp.queued_read_lock_slowpath.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
1.09 ? 5% +0.1 1.22 ? 4% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap
1.35 ? 5% +0.2 1.53 ? 4% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap
1.36 ? 5% +0.2 1.54 ? 4% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
0.79 ? 5% +0.2 0.99 ? 4% perf-profile.calltrace.cycles-pp.ptrace_check_attach.__x64_sys_ptrace.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.00 ? 6% +0.2 2.24 ? 3% perf-profile.calltrace.cycles-pp.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm.copy_process
0.68 ? 7% +0.3 0.94 ? 6% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.asm_call_sysvec_on_stack.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
0.67 ? 7% +0.3 0.93 ? 6% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.asm_call_sysvec_on_stack.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.69 ? 7% +0.3 0.94 ? 6% perf-profile.calltrace.cycles-pp.asm_call_sysvec_on_stack.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
2.31 ? 6% +0.3 2.57 ? 3% perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
2.42 ? 6% +0.3 2.68 ? 3% perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
1.24 ? 6% +0.3 1.58 ? 7% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle
0.18 ?141% +0.4 0.55 ? 6% perf-profile.calltrace.cycles-pp.__vmalloc_node_range.dup_task_struct.copy_process.kernel_clone.__do_sys_clone
1.36 ? 6% +0.4 1.74 ? 6% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
1.94 ? 6% +0.4 2.36 ? 4% perf-profile.calltrace.cycles-pp.unlink_anon_vmas.free_pgtables.unmap_region.__do_munmap.__vm_munmap
0.26 ?100% +0.5 0.73 ? 9% perf-profile.calltrace.cycles-pp.unlink_file_vma.free_pgtables.unmap_region.__do_munmap.__vm_munmap
0.00 +0.6 0.58 ? 4% perf-profile.calltrace.cycles-pp.wait_task_inactive.ptrace_check_attach.__x64_sys_ptrace.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.69 ? 5% +0.6 3.33 ? 3% perf-profile.calltrace.cycles-pp.free_pgtables.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
5.68 ? 5% -0.6 5.06 ? 3% perf-profile.children.cycles-pp.__sched_text_start
3.69 ? 5% -0.5 3.22 ? 3% perf-profile.children.cycles-pp.try_to_wake_up
2.79 ? 6% -0.4 2.42 ? 3% perf-profile.children.cycles-pp.schedule_idle
2.49 ? 5% -0.3 2.21 ? 4% perf-profile.children.cycles-pp.do_notify_parent_cldstop
2.23 ? 5% -0.3 1.96 ? 3% perf-profile.children.cycles-pp.__wake_up_common_lock
2.08 ? 5% -0.3 1.83 ? 3% perf-profile.children.cycles-pp.__wake_up_common
1.89 ? 4% -0.2 1.65 ? 4% perf-profile.children.cycles-pp.arch_ptrace
1.89 ? 4% -0.2 1.64 ? 4% perf-profile.children.cycles-pp.ptrace_request
1.87 ? 4% -0.2 1.64 ? 4% perf-profile.children.cycles-pp.ptrace_resume
1.68 ? 5% -0.2 1.47 ? 4% perf-profile.children.cycles-pp.enqueue_task_fair
1.70 ? 5% -0.2 1.50 ? 4% perf-profile.children.cycles-pp.ttwu_do_activate
1.33 ? 5% -0.2 1.14 ? 4% perf-profile.children.cycles-pp.enqueue_entity
1.45 ? 5% -0.2 1.26 ? 3% perf-profile.children.cycles-pp.pick_next_task_fair
1.47 ? 5% -0.2 1.30 ? 3% perf-profile.children.cycles-pp.dequeue_task_fair
1.17 ? 5% -0.2 1.02 ? 3% perf-profile.children.cycles-pp.update_load_avg
1.33 ? 5% -0.2 1.18 ? 2% perf-profile.children.cycles-pp.dequeue_entity
0.84 ? 7% -0.1 0.73 ? 4% perf-profile.children.cycles-pp.get_page_from_freelist
0.52 ? 8% -0.1 0.43 ? 3% perf-profile.children.cycles-pp.page_add_file_rmap
0.75 ? 4% -0.1 0.66 ? 3% perf-profile.children.cycles-pp.select_task_rq_fair
0.65 ? 4% -0.1 0.57 ? 3% perf-profile.children.cycles-pp.update_curr
0.52 ? 6% -0.1 0.44 ? 6% perf-profile.children.cycles-pp.___might_sleep
0.39 ? 5% -0.1 0.33 ? 5% perf-profile.children.cycles-pp.__switch_to_asm
0.50 ? 5% -0.1 0.44 ? 3% perf-profile.children.cycles-pp.wake_up_new_task
0.46 ? 5% -0.1 0.41 ? 5% perf-profile.children.cycles-pp.wp_page_copy
0.41 ? 4% -0.0 0.36 ? 2% perf-profile.children.cycles-pp.find_idlest_group
0.38 ? 6% -0.0 0.33 ? 3% perf-profile.children.cycles-pp.tick_nohz_idle_exit
0.39 ? 6% -0.0 0.34 ? 5% perf-profile.children.cycles-pp.sched_clock
0.37 ? 6% -0.0 0.33 ? 6% perf-profile.children.cycles-pp.native_sched_clock
0.28 ? 7% -0.0 0.24 ? 6% perf-profile.children.cycles-pp.switch_mm_irqs_off
0.31 ? 3% -0.0 0.28 ? 3% perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
0.23 ? 6% -0.0 0.19 ? 6% perf-profile.children.cycles-pp.rmqueue
0.18 ? 5% -0.0 0.14 ? 5% perf-profile.children.cycles-pp.tick_nohz_idle_enter
0.16 ? 4% -0.0 0.13 ? 5% perf-profile.children.cycles-pp.__cond_resched
0.23 ? 5% -0.0 0.20 ? 6% perf-profile.children.cycles-pp.reweight_entity
0.07 ? 11% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.perf_iterate_sb
0.19 ? 5% -0.0 0.16 ? 4% perf-profile.children.cycles-pp.allocate_slab
0.26 ? 5% -0.0 0.23 ? 6% perf-profile.children.cycles-pp.update_ts_time_stats
0.14 ? 7% -0.0 0.11 ? 9% perf-profile.children.cycles-pp.free_pcppages_bulk
0.17 ? 4% -0.0 0.14 ? 4% perf-profile.children.cycles-pp.___perf_sw_event
0.20 ? 11% -0.0 0.17 ? 7% perf-profile.children.cycles-pp.sync_regs
0.15 ? 5% -0.0 0.12 ? 7% perf-profile.children.cycles-pp.free_unref_page_list
0.19 ? 9% -0.0 0.17 ? 6% perf-profile.children.cycles-pp.__put_user_nocheck_4
0.09 ? 7% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.memcpy_erms
0.24 ? 4% -0.0 0.21 ? 4% perf-profile.children.cycles-pp.__task_pid_nr_ns
0.07 ? 6% -0.0 0.05 perf-profile.children.cycles-pp.rcu_eqs_enter
0.15 ? 4% -0.0 0.13 ? 5% perf-profile.children.cycles-pp.place_entity
0.07 ? 10% -0.0 0.05 ? 45% perf-profile.children.cycles-pp.perf_event_task
0.12 ? 5% -0.0 0.10 ? 6% perf-profile.children.cycles-pp.ttwu_queue_wakelist
0.08 ? 6% -0.0 0.06 ? 13% perf-profile.children.cycles-pp.arch_dup_task_struct
0.09 ? 8% -0.0 0.07 ? 7% perf-profile.children.cycles-pp.call_cpuidle
0.07 ? 7% -0.0 0.05 ? 45% perf-profile.children.cycles-pp.lru_add_drain
0.14 ? 5% -0.0 0.12 ? 3% perf-profile.children.cycles-pp.copy_page
0.07 ? 6% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.rcu_all_qs
0.07 ? 7% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.unfreeze_partials
0.09 ? 8% -0.0 0.07 ? 6% perf-profile.children.cycles-pp.__calc_delta
0.08 ? 14% +0.0 0.11 ? 11% perf-profile.children.cycles-pp.userfaultfd_unmap_prep
0.06 ? 11% +0.0 0.09 ? 10% perf-profile.children.cycles-pp.tick_nohz_irq_exit
0.14 ? 11% +0.0 0.18 ? 9% perf-profile.children.cycles-pp.__memcg_kmem_uncharge
0.33 ? 5% +0.0 0.37 ? 6% perf-profile.children.cycles-pp.__put_anon_vma
0.25 ? 5% +0.0 0.29 ? 6% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.18 ? 7% +0.0 0.22 ? 11% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.12 ? 10% +0.1 0.17 ? 14% perf-profile.children.cycles-pp.tick_nohz_next_event
0.02 ?141% +0.1 0.07 ? 13% perf-profile.children.cycles-pp.lock_page_lruvec_irqsave
0.16 ? 9% +0.1 0.21 ? 8% perf-profile.children.cycles-pp.__rb_erase_color
0.19 ? 11% +0.1 0.24 ? 9% perf-profile.children.cycles-pp.page_counter_uncharge
0.18 ? 9% +0.1 0.24 ? 9% perf-profile.children.cycles-pp.page_counter_cancel
0.26 ? 10% +0.1 0.32 ? 3% perf-profile.children.cycles-pp.find_get_task_by_vpid
0.49 ? 8% +0.1 0.55 ? 6% perf-profile.children.cycles-pp.__vmalloc_node_range
0.00 +0.1 0.06 ? 14% perf-profile.children.cycles-pp.calc_global_load_tick
0.21 ? 6% +0.1 0.28 ? 8% perf-profile.children.cycles-pp.drain_obj_stock
0.58 ? 4% +0.1 0.65 ? 3% perf-profile.children.cycles-pp.irq_exit_rcu
0.52 ? 5% +0.1 0.59 ? 2% perf-profile.children.cycles-pp.do_softirq_own_stack
0.16 ? 7% +0.1 0.23 ? 7% perf-profile.children.cycles-pp.scheduler_tick
0.00 +0.1 0.07 ? 18% perf-profile.children.cycles-pp.timekeeping_max_deferment
0.44 ? 5% +0.1 0.51 ? 9% perf-profile.children.cycles-pp.page_counter_try_charge
0.59 ? 5% +0.1 0.67 perf-profile.children.cycles-pp.__softirqentry_text_start
0.38 ? 5% +0.1 0.46 ? 4% perf-profile.children.cycles-pp._raw_read_lock
0.21 ? 13% +0.1 0.28 ? 12% perf-profile.children.cycles-pp.queued_write_lock_slowpath
0.28 ? 5% +0.1 0.37 ? 12% perf-profile.children.cycles-pp.vma_interval_tree_remove
0.34 ? 5% +0.1 0.43 ? 4% perf-profile.children.cycles-pp.anon_vma_interval_tree_remove
0.79 ? 6% +0.1 0.88 ? 3% perf-profile.children.cycles-pp.remove_vma
0.28 ? 13% +0.1 0.38 ? 7% perf-profile.children.cycles-pp.alloc_vmap_area
0.51 ? 6% +0.1 0.60 ? 6% perf-profile.children.cycles-pp.refill_obj_stock
0.29 ? 14% +0.1 0.39 ? 6% perf-profile.children.cycles-pp.__get_vm_area_node
0.28 ? 7% +0.1 0.38 ? 8% perf-profile.children.cycles-pp.tick_sched_handle
0.27 ? 7% +0.1 0.37 ? 8% perf-profile.children.cycles-pp.update_process_times
0.33 ? 7% +0.1 0.45 ? 7% perf-profile.children.cycles-pp.tick_sched_timer
0.67 ? 7% +0.1 0.79 ? 5% perf-profile.children.cycles-pp.anon_vma_interval_tree_insert
0.62 ? 6% +0.1 0.74 ? 4% perf-profile.children.cycles-pp.down_write
0.46 ? 6% +0.1 0.58 ? 4% perf-profile.children.cycles-pp.wait_task_inactive
0.48 ? 5% +0.1 0.61 ? 6% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.28 ? 6% +0.1 0.41 ? 7% perf-profile.children.cycles-pp.clockevents_program_event
1.11 ? 5% +0.1 1.25 ? 4% perf-profile.children.cycles-pp.release_pages
0.62 ? 4% +0.1 0.77 ? 4% perf-profile.children.cycles-pp.ktime_get
0.09 ? 24% +0.2 0.26 ? 11% perf-profile.children.cycles-pp.osq_lock
1.37 ? 5% +0.2 1.54 ? 4% perf-profile.children.cycles-pp.tlb_finish_mmu
1.36 ? 5% +0.2 1.54 ? 4% perf-profile.children.cycles-pp.tlb_flush_mmu
0.79 ? 5% +0.2 0.99 ? 4% perf-profile.children.cycles-pp.ptrace_check_attach
0.51 ? 4% +0.2 0.73 ? 9% perf-profile.children.cycles-pp.unlink_file_vma
2.00 ? 6% +0.2 2.24 ? 3% perf-profile.children.cycles-pp.anon_vma_clone
2.33 ? 6% +0.3 2.59 ? 3% perf-profile.children.cycles-pp.filemap_map_pages
2.43 ? 6% +0.3 2.69 ? 3% perf-profile.children.cycles-pp.do_fault
1.00 ? 5% +0.3 1.27 ? 6% perf-profile.children.cycles-pp.queued_read_lock_slowpath
0.22 ? 12% +0.3 0.49 ? 5% perf-profile.children.cycles-pp.cgroup_enter_frozen
0.90 ? 4% +0.3 1.18 ? 4% perf-profile.children.cycles-pp.hrtimer_interrupt
0.91 ? 4% +0.3 1.19 ? 4% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.26 ? 5% +0.3 0.57 ? 5% perf-profile.children.cycles-pp.cgroup_leave_frozen
0.19 ? 12% +0.3 0.51 ? 7% perf-profile.children.cycles-pp.rwsem_spin_on_owner
1.47 ? 3% +0.3 1.82 ? 3% perf-profile.children.cycles-pp.asm_call_sysvec_on_stack
1.73 ? 3% +0.4 2.12 ? 4% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
1.95 ? 6% +0.4 2.38 ? 3% perf-profile.children.cycles-pp.unlink_anon_vmas
1.88 ? 3% +0.4 2.32 ? 3% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.83 ? 6% +0.6 1.42 ? 4% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.35 ? 13% +0.6 0.96 ? 6% perf-profile.children.cycles-pp.rwsem_down_write_slowpath
2.69 ? 5% +0.6 3.34 ? 3% perf-profile.children.cycles-pp.free_pgtables
1.23 ? 8% +0.8 2.06 ? 5% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
1.15 ? 5% -0.1 1.03 ? 4% perf-profile.self.cycles-pp.__sched_text_start
0.48 ? 8% -0.1 0.38 ? 4% perf-profile.self.cycles-pp.page_add_file_rmap
0.55 ? 5% -0.1 0.45 ? 5% perf-profile.self.cycles-pp.update_load_avg
0.71 ? 6% -0.1 0.62 ? 4% perf-profile.self.cycles-pp.update_rq_clock
0.51 ? 6% -0.1 0.43 ? 6% perf-profile.self.cycles-pp.___might_sleep
0.54 ? 3% -0.1 0.48 ? 5% perf-profile.self.cycles-pp.do_idle
0.39 ? 6% -0.1 0.33 ? 5% perf-profile.self.cycles-pp.__switch_to_asm
0.31 ? 5% -0.1 0.26 ? 4% perf-profile.self.cycles-pp.update_curr
0.24 ? 7% -0.0 0.20 ? 7% perf-profile.self.cycles-pp.obj_cgroup_charge
0.36 ? 4% -0.0 0.32 ? 3% perf-profile.self.cycles-pp.find_idlest_group
0.23 ? 5% -0.0 0.20 ? 6% perf-profile.self.cycles-pp.reweight_entity
0.22 ? 4% -0.0 0.19 ? 4% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.20 ? 10% -0.0 0.17 ? 7% perf-profile.self.cycles-pp.sync_regs
0.12 ? 11% -0.0 0.09 ? 11% perf-profile.self.cycles-pp.try_charge
0.09 ? 7% -0.0 0.06 ? 11% perf-profile.self.cycles-pp.memcpy_erms
0.36 ? 2% -0.0 0.34 ? 3% perf-profile.self.cycles-pp.native_irq_return_iret
0.23 ? 3% -0.0 0.20 ? 4% perf-profile.self.cycles-pp.__task_pid_nr_ns
0.12 ? 6% -0.0 0.09 ? 11% perf-profile.self.cycles-pp.ttwu_queue_wakelist
0.13 ? 5% -0.0 0.11 ? 4% perf-profile.self.cycles-pp.___perf_sw_event
0.16 ? 7% -0.0 0.14 ? 5% perf-profile.self.cycles-pp.dequeue_entity
0.14 ? 4% -0.0 0.12 ? 6% perf-profile.self.cycles-pp.__wake_up_common
0.14 ? 6% -0.0 0.12 ? 3% perf-profile.self.cycles-pp.copy_page
0.08 ? 5% -0.0 0.07 ? 7% perf-profile.self.cycles-pp.call_cpuidle
0.09 ? 8% -0.0 0.07 ? 6% perf-profile.self.cycles-pp.__calc_delta
0.07 ? 7% +0.0 0.08 perf-profile.self.cycles-pp.alloc_vmap_area
0.06 ? 7% +0.0 0.08 ? 10% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.07 ? 14% +0.0 0.11 ? 8% perf-profile.self.cycles-pp.userfaultfd_unmap_prep
0.08 ? 9% +0.0 0.11 ? 5% perf-profile.self.cycles-pp.drain_obj_stock
0.15 ? 8% +0.0 0.19 ? 5% perf-profile.self.cycles-pp.unlink_anon_vmas
0.16 ? 9% +0.0 0.21 ? 9% perf-profile.self.cycles-pp.page_counter_cancel
0.12 ? 11% +0.1 0.18 ? 11% perf-profile.self.cycles-pp.__rb_erase_color
0.27 ? 5% +0.1 0.33 ? 6% perf-profile.self.cycles-pp.cpuidle_enter_state
0.00 +0.1 0.06 ? 13% perf-profile.self.cycles-pp.cgroup_enter_frozen
0.15 ? 8% +0.1 0.21 ? 6% perf-profile.self.cycles-pp.find_get_task_by_vpid
0.00 +0.1 0.06 ? 14% perf-profile.self.cycles-pp.calc_global_load_tick
0.00 +0.1 0.07 ? 18% perf-profile.self.cycles-pp.timekeeping_max_deferment
0.38 ? 6% +0.1 0.45 ? 3% perf-profile.self.cycles-pp._raw_read_lock
0.06 ? 11% +0.1 0.15 ? 7% perf-profile.self.cycles-pp.rwsem_down_write_slowpath
0.33 ? 5% +0.1 0.43 ? 4% perf-profile.self.cycles-pp.anon_vma_interval_tree_remove
0.27 ? 4% +0.1 0.37 ? 12% perf-profile.self.cycles-pp.vma_interval_tree_remove
0.81 ? 4% +0.1 0.92 ? 4% perf-profile.self.cycles-pp.kmem_cache_free
0.67 ? 8% +0.1 0.78 ? 4% perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
0.31 ? 8% +0.1 0.43 ? 5% perf-profile.self.cycles-pp.ptrace_stop
0.85 ? 6% +0.1 0.97 ? 4% perf-profile.self.cycles-pp.release_pages
0.49 ? 6% +0.1 0.62 ? 5% perf-profile.self.cycles-pp.down_write
0.63 ? 6% +0.1 0.76 ? 4% perf-profile.self.cycles-pp.do_wait
0.17 ? 9% +0.2 0.32 ? 3% perf-profile.self.cycles-pp.wait_task_inactive
0.56 ? 6% +0.2 0.72 ? 4% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.43 ? 6% +0.2 0.59 ? 5% perf-profile.self.cycles-pp.ktime_get
0.09 ? 23% +0.2 0.26 ? 11% perf-profile.self.cycles-pp.osq_lock
0.18 ? 12% +0.3 0.50 ? 7% perf-profile.self.cycles-pp.rwsem_spin_on_owner
1.33 ? 6% +0.3 1.67 ? 3% perf-profile.self.cycles-pp.filemap_map_pages
1.22 ? 7% +0.8 2.04 ? 5% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath



pmeter.Average_Active_Power

328 +---------------------------------------------------------------------+
326 |-+ O O |
| O O O O O O O O O O O O O O O O |
324 |-O O O O O O O O O O O O OO |
322 |-+ O O O O O O O |
320 |-+ |
318 |-+ |
| |
316 |-+ + |
314 |-+ :: + |
312 |-+ +. : +. +. +. +. ++ + : .+ |
310 |-+ + + .++. : +. : + : +. + + .+ + : +.+ +.+ .+ |
|.++ + + + +.+.+ ++.++ +.+ + : .+ + |
308 |-+ + |
306 +---------------------------------------------------------------------+


stress-ng.time.user_time

220 +---------------------------------------------------------------------+
|.+ + .++.+. +.+ + .+ ++. |
215 |-++ ++.+ .++. + + + +.+ + ++.+.+ + :.+ : + ++.+.+ |
| + + +.+ + :+ + + + + +.+ |
210 |-+ + + + |
| |
205 |-+ |
| |
200 |-+ |
| |
195 |-+O O OO O O OO O O O OO O O O O O O O O OO O |
| O O O O O O O O O O O O O O O O |
190 |-+ O O |
| |
185 +---------------------------------------------------------------------+


stress-ng.time.system_time

405 +---------------------------------------------------------------------+
| O O OO OO |
400 |-+O O O O |
| O O O O O O OO O O OO OO OO O OO |
| O OO O O O O O O OO O |
395 |-+ O |
| |
390 |-+ |
| |
385 |-+ |
| |
| .+. +. .+. +. |
380 |-++ ++.++.+.++.+. +. .+ +.+ + ++.+.++.+ +.++.+. +.+ .+.+ |
|.+ + ++.+ +.+ + + |
375 +---------------------------------------------------------------------+


stress-ng.time.major_page_faults

55000 +-------------------------------------------------------------------+
50000 |-+ O |
| |
45000 |-O |
40000 |-+O O O |
35000 |-+ O O O O O O O O |
30000 |-+ O O O O O O O O |
| O O OO OO O O O O O O O O O |
25000 |-+ O O O O O |
20000 |-+ O |
15000 |-+ |
10000 |-+ |
|.+ .++.++.+.++.++.+.++.++. +.+.++.+ .+.++. +.+ .+. +. +.+.++.++.+ |
5000 |-++ + + + + + + |
0 +-------------------------------------------------------------------+


stress-ng.time.voluntary_context_switches

6.4e+07 +-----------------------------------------------------------------+
| +. + .+ +.+ |
6.2e+07 |.++ .+ ++. +.+ +.++ .++.++ .+ .+. :: + : : +.+ .+ |
| + + +. + + + + + + + + : : +.+ + |
6e+07 |-+ + ++ + + + |
| |
5.8e+07 |-+ |
| |
5.6e+07 |-+ |
| |
5.4e+07 |-+ O OO O O O O O O O O O O OO O O |
| OO OO O O O O O O OO O OO O O OO O O |
5.2e+07 |-+ OO OO |
| |
5e+07 +-----------------------------------------------------------------+


stress-ng.vm-segv.ops

820000 +------------------------------------------------------------------+
|.++ .+ +.+ ++ +.++ +.++.+ .+ .+ :: + + : + .+ |
800000 |-+ + + +. + :+ + .+ :.+ + :: :+ +.: + |
780000 |-+ + ++ + + + + + + |
| |
760000 |-+ |
| |
740000 |-+ |
| |
720000 |-+ |
700000 |-+ O O OO O O O O O |
| O OO O O O OO O OO OO O O OO OO O O OO O O O |
680000 |-O O OO O O |
| O |
660000 +------------------------------------------------------------------+


stress-ng.vm-segv.ops_per_sec

14000 +-------------------------------------------------------------------+
| |
13500 |.++ .+ .+ +.++.++.+ .++.++ + + + ++ +.++.+ |
| : + +. + :+ : + : + + :+ :+ + + + +.+ |
| : : +.+ + :+ : + + + + ++ |
13000 |-+ + + + |
| |
12500 |-+ |
| |
12000 |-+ |
| O O O O |
| O OO O O O O O OO O O OO O O O O O OO O |
11500 |-O O OO O O O O O O O O O O O |
| O |
11000 +-------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang


Attachments:
(No filename) (54.24 kB)
config-5.11.0-00053-gb360fb5e5954 (175.14 kB)
job-script (8.47 kB)
job.yaml (5.75 kB)
reproduce (481.00 B)
Download all attachments

2021-02-23 12:40:31

by Valentin Schneider

[permalink] [raw]
Subject: Re: [sched/fair] b360fb5e59: stress-ng.vm-segv.ops_per_sec -13.9% regression

On 23/02/21 10:30, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -13.9% regression of stress-ng.vm-segv.ops_per_sec due to commit:
>
>
> commit: b360fb5e5954a8a440ef95bf11257e2e7ea90340 ("[PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls")
> url: https://github.com/0day-ci/linux/commits/Valentin-Schneider/sched-fair-misfit-task-load-balance-tweaks/20210219-211028
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4
>
> in testcase: stress-ng
> on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
> with following parameters:
>
> nr_threads: 10%
> disk: 1HDD
> testtime: 60s
> fs: ext4
> class: vm
> test: vm-segv
> cpufreq_governor: performance
> ucode: 0x5003003
>
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml
> bin/lkp run compatible-job.yaml
>
> =========================================================================================
> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
> vm/gcc-9/performance/1HDD/ext4/x86_64-rhel-8.3/10%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp7/vm-segv/stress-ng/60s/0x5003003
>
> commit:
> c5e6fc08fe ("sched,x86: Allow !PREEMPT_DYNAMIC")
> b360fb5e59 ("sched/fair: Ignore percpu threads for imbalance pulls")
>
> c5e6fc08feb2b88d b360fb5e5954a8a440ef95bf112
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 1:6 -3% 1:6 perf-profile.children.cycles-pp.error_entry
> 0:6 -1% 0:6 perf-profile.self.cycles-pp.error_entry
> %stddev %change %stddev
> \ | \
> 11324 � 3% -28.1% 8140 � 3% stress-ng.time.involuntary_context_switches
> 6818 � 15% +315.2% 28311 � 12% stress-ng.time.major_page_faults
> 30952041 -12.8% 26988502 stress-ng.time.minor_page_faults

> 378.82 +5.3% 398.75 stress-ng.time.system_time
> 215.82 -10.0% 194.24 stress-ng.time.user_time
> 62102177 -13.9% 53448474 stress-ng.time.voluntary_context_switches
> 810348 -13.9% 698034 stress-ng.vm-segv.ops
> 13505 -13.9% 11633 stress-ng.vm-segv.ops_per_sec

My hunch was that this could be due to the balance interval no longer being
increased when load balance catches pcpu kworkers, but that's not the case:
we would still have LBF_ALL_PINNED, which will still double the balance
interval if no task was moved.

I'm not sure which stat to look at wrt softirqs; this seems to say there
weren't that many more:

> 1.11 -0.3 0.85 mpstat.cpu.all.irq%
> 0.18 -0.0 0.16 mpstat.cpu.all.soft%
> 0.40 -0.1 0.35 mpstat.cpu.all.usr%

But this does:

> 11501 � 5% +9.6% 12610 � 6% softirqs.CPU12.RCU
> 10678 � 5% +16.0% 12383 � 2% softirqs.CPU16.RCU
> 10871 � 4% +13.1% 12294 � 2% softirqs.CPU17.RCU
> 10724 � 2% +13.8% 12205 � 3% softirqs.CPU18.RCU
> 10810 � 4% +16.2% 12560 � 3% softirqs.CPU19.RCU
> 10647 � 6% +16.2% 12372 � 6% softirqs.CPU20.RCU
> 10863 � 3% +14.7% 12461 � 3% softirqs.CPU21.RCU
> 11231 � 5% +14.6% 12873 � 6% softirqs.CPU22.RCU
> 11141 � 6% +21.0% 13480 � 8% softirqs.CPU64.RCU
> 11209 � 6% +20.8% 13545 � 2% softirqs.CPU65.RCU
> 11108 � 3% +20.0% 13334 � 6% softirqs.CPU66.RCU
> 11414 � 9% +16.9% 13345 � 6% softirqs.CPU67.RCU
> 11162 � 4% +16.2% 12968 � 9% softirqs.CPU68.RCU
> 11035 � 5% +13.6% 12533 � 4% softirqs.CPU69.RCU
> 11003 � 5% +18.9% 13078 � 8% softirqs.CPU70.RCU
> 11097 � 4% +14.9% 12756 � 5% softirqs.CPU71.RCU

2021-03-04 23:27:16

by Valentin Schneider

[permalink] [raw]
Subject: Re: [sched/fair] b360fb5e59: stress-ng.vm-segv.ops_per_sec -13.9% regression

On 23/02/21 12:36, Valentin Schneider wrote:
> On 23/02/21 10:30, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -13.9% regression of stress-ng.vm-segv.ops_per_sec due to commit:
>>
>>
>> commit: b360fb5e5954a8a440ef95bf11257e2e7ea90340 ("[PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls")
>> url: https://github.com/0day-ci/linux/commits/Valentin-Schneider/sched-fair-misfit-task-load-balance-tweaks/20210219-211028
>> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4
>>
>> in testcase: stress-ng
>> on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
>> with following parameters:
>>
>> nr_threads: 10%
>> disk: 1HDD
>> testtime: 60s
>> fs: ext4
>> class: vm
>> test: vm-segv
>> cpufreq_governor: performance
>> ucode: 0x5003003
>>

So I've been running this on my 32 CPU arm64 desktop with:
nr_threads: 10%
nr_threads: 50%
(20 iterations each)

In the 50% case I see a ~2% improvement, in the 10% a -0.3%
regression (another batch showed -0.08%)... Still far off from the reported
-14%. If it's really required I can go find an x86 box to test this on, but
so far it looks like a fluke.