Modern platform are growing fast on CPU numbers. To achieve better
utility of CPU resource, multiple apps are starting to sharing the CPUs.
What we need is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.
The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.
GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.
Our test on ARM64 platform with 128 CPUs shows that,
throughput of sysbench memory is improved about 25%,
and redis-benchmark is improved up to about 10%.
See each patch for detail:
The 1st patch introduces infrastructure.
The 2nd patch introduces detail about partition info.
The 3rd patch is the main part of group balancer.
The 4th patch is about stats.
v2:
Put partition info and period settings to cpuset subsys of cgroup_v2.
v1: https://lore.kernel.org/all/[email protected]/
Michael Wang (1):
sched: Introduce group balancer
Tianchen Ding (3):
sched, cpuset: Introduce infrastructure of group balancer
cpuset: Handle input of partition info for group balancer
cpuset, gb: Add stat for group balancer
include/linux/cpuset.h | 5 +
include/linux/sched.h | 5 +
include/linux/sched/gb.h | 70 ++++++
init/Kconfig | 12 +
kernel/cgroup/cpuset.c | 405 +++++++++++++++++++++++++++++++-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 5 +
kernel/sched/debug.c | 10 +-
kernel/sched/fair.c | 26 ++-
kernel/sched/gb.c | 487 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 14 ++
11 files changed, 1037 insertions(+), 3 deletions(-)
create mode 100644 include/linux/sched/gb.h
create mode 100644 kernel/sched/gb.c
--
2.27.0
From: Michael Wang <[email protected]>
Modern platform are growing fast on CPU numbers, multiple
apps sharing one box are very common, they used to have
exclusive cpu setting but nowadays things are changing.
To achieve better utility of CPU resource, multiple apps
are starting to sharing the CPUs. The CPU resources usually
overcommitted since app's workload are undulated.
This introduced problems on performance when share mode vs
exclusive mode, for eg with cgroup A,B and C deployed in
exclusive mode, it will be:
CPU_X (100%) CPU_Y (100%) CPU_Z (50%)
T_1_CG_A T_1_CG_B T_1_CG_C
T_2_CG_A T_2_CG_B T_2_CG_C
T_3_CG_A T_3_CG_B
T_4_CG_A T_4_CG_B
while the share mode will be:
CPU_X (100%) CPU_Y (75%) CPU_Z (75%)
T_1_CG_A T_2_CG_A T_1_CG_B
T_2_CG_B T_3_CG_B T_2_CG_C
T_4_CG_B T_4_CG_A T_3_CG_A
T_1_CG_C
As we can see, the confliction between groups on CPU
resources are now happening all over the CPUs.
The testing on sysbench-memory show 30+% drop on share
mode, and redis-benchmark show 10+% drop too, compared
to the exclusive mode.
However, despite of the performance drop, in real world
we still prefer share mode. The undulated workload can
make the exclusive mode so unefficient on CPU utilization,
for eg the next period, when CG_A become 'idle', exclusive
mode will like:
CPU_X (0%) CPU_Y (100%) CPU_Z (50%)
T_1_CG_B T_1_CG_C
T_2_CG_B T_2_CG_C
T_3_CG_B
T_4_CG_B
while share mode like:
CPU_X (50%) CPU_Y (50%) CPU_Z (50%)
T_2_CG_B T_1_CG_C T_3_CG_B
T_4_CG_B T_1_CG_B T_2_CG_C
The CPU_X is totally wasted in exclusive mode, the resource
efficiency are really poor.
Thus what we need, is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.
The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.
Just like balance the task among CPUs, now with GB a user can
put CPU X,Y,Z into three partitions, and balance group A,B,C
into these partition, to make them as exclusive as possible.
GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.
How to use:
First, make sure the children of your cgroup both enabling "cpu" and
"cpuset" subsys, because group balancer gather load info from task group
and partition info from cpuset group. So tasks should stay at the same
cgroup. Do this recursively if necessary:
echo "+cpu +cpuset" > $CGROUP_PATH/cgroup.subtree_control
To create partition, for example run:
echo "0-15;16-31;32-47;48-63;" > $CGROUP_PATH/cpuset.gb.partition
This will create 4 partitions contain CPUs 0-15,16-31,32-47 and
48-63 separately.
Then enable GB for your cgroup, run:
echo 200000 > $CGROUP_PATH/cpuset.gb.period_us
This will enable GB for all descendants of $CGROUP_PATH, EXCEPT itself.
Testing Results:
In order to enlarge the differences, we do testing on ARM platform
with 128 CPUs, create 8 partition according to cluster info.
Since we pick benchmark which can gain benefit from exclusive mode,
this is more like a functional testing rather than performance, to
show that GB help winback the performance.
Create 8 cgroup each running 'sysbench memory --threads=16 run',
the output of share mode is:
events/s (eps): 4939865.8892
events/s (eps): 4699033.0351
events/s (eps): 4373262.0563
events/s (eps): 3534852.1000
events/s (eps): 4724359.4354
events/s (eps): 3438985.1082
events/s (eps): 3600268.9196
events/s (eps): 3782130.8202
the output of gb mode is:
events/s (eps): 4919287.0608
events/s (eps): 5926525.9995
events/s (eps): 4933459.3272
events/s (eps): 5184040.1349
events/s (eps): 5030940.1116
events/s (eps): 5773255.0246
events/s (eps): 4649109.0129
events/s (eps): 5683217.7641
Create 4 cgroup each running redis-server with 16 io threads,
4 redis-benchmark per each server show average rps as:
share mode gb mode
PING_INLINE 45903.94 46529.9 1.36%
PING_MBULK 48342.58 50881.99 5.25%
SET 38681.95 42108.17 8.86%
GET 46774.09 51067.99 9.18%
INCR 46092.24 50543.98 9.66%
LPUSH 41723.35 45464.73 8.97%
RPUSH 42722.77 47667.76 11.57%
LPOP 41010.75 45077.65 9.92%
RPOP 43198.33 47248.05 9.37%
SADD 44750.16 50253.79 12.30%
HSET 44352 47940.12 8.09%
SPOP 47436.64 51658.99 8.90%
ZADD 43124.02 45992.96 6.65%
ZPOPMIN 46854.7 51561.52 10.05%
LPUSH 41723.35 45464.73 8.97%
LRANGE_100 22411.47 23311.32 4.02%
LRANGE_300 11323.8 11585.06 2.31%
LRANGE_500 7516.12 7577.76 0.82%
LRANGE_600 6632.1 6737.31 1.59%
MSET 27945.01 29401.3 5.21%
Co-developed-by: Cruz Zhao <[email protected]>
Signed-off-by: Cruz Zhao <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
Co-developed-by: Tianchen Ding <[email protected]>
Signed-off-by: Tianchen Ding <[email protected]>
---
include/linux/cpuset.h | 5 +
include/linux/sched.h | 5 +
include/linux/sched/gb.h | 10 +
kernel/cgroup/cpuset.c | 39 ++++
kernel/sched/core.c | 5 +
kernel/sched/fair.c | 26 ++-
kernel/sched/gb.c | 448 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 10 +
8 files changed, 547 insertions(+), 1 deletion(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..3be2ab42bb98 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -178,6 +178,11 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
+#ifdef CONFIG_GROUP_BALANCER
+struct gb_info *task_gi(struct task_struct *p, bool get_controller);
+cpumask_var_t *gi_cpus(struct gb_info *gi);
+#endif
+
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e0b5b4a4c8f..fca655770925 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,11 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_GROUP_BALANCER
+ u64 gb_stamp;
+ struct callback_head gb_work;
+#endif
+
#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_sig;
diff --git a/include/linux/sched/gb.h b/include/linux/sched/gb.h
index b03a2b4ef4b7..7af91662b740 100644
--- a/include/linux/sched/gb.h
+++ b/include/linux/sched/gb.h
@@ -9,10 +9,14 @@
#include <linux/sched.h>
#include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/cgroup-defs.h>
#define CPU_PART_MAX 32
struct gb_part {
int id;
+ unsigned int mgrt_on;
+ u64 predict_load;
struct cpumask cpus;
};
@@ -41,6 +45,12 @@ struct gb_info {
/* Stamp to record next settling. */
unsigned long settle_next;
+
+ /*
+ * Record preferred partition (in part_info of controller) of this cgroup.
+ * Default -1.
+ */
+ int gb_prefer;
};
#define for_each_gbpart(i, pi) \
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4b456b379b87..de13c22c1921 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -235,6 +235,20 @@ static struct gb_info *css_gi(struct cgroup_subsys_state *css, bool get_controll
return get_controller ? cs->control_gi : &cs->gi;
}
+
+struct gb_info *task_gi(struct task_struct *p, bool get_controller)
+{
+ struct cpuset *cs = task_cs(p);
+
+ return cs ? css_gi(&cs->css, get_controller) : NULL;
+}
+
+cpumask_var_t *gi_cpus(struct gb_info *gi)
+{
+ struct cpuset *cs = container_of(gi, struct cpuset, gi);
+
+ return &cs->effective_cpus;
+}
#else
static inline struct gb_info *cs_gi(struct cpuset *cs)
{
@@ -2572,6 +2586,21 @@ static inline void update_child_gb_controller(struct cpuset *cs, struct gb_info
rcu_read_unlock();
}
+static inline void reset_child_gb_prefer(struct cpuset *cs)
+{
+ struct cpuset *cp;
+ struct cgroup_subsys_state *pos_css;
+
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cp, pos_css, cs) {
+ if (cp == cs)
+ continue;
+
+ cs_gi(cp)->gb_prefer = -1;
+ }
+ rcu_read_unlock();
+}
+
static inline void update_gb_part_info(struct cpuset *cs, struct gb_part_info *new)
{
struct gb_info *gi = cs_gi(cs);
@@ -2606,6 +2635,7 @@ static inline void init_gb(struct cpuset *cs)
gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period);
gi->settle_next = jiffies + gi->settle_period;
+ gi->gb_prefer = -1;
/* Search upwards to find any existing controller. */
for (css = cs->css.parent; css; css = css->parent) {
@@ -2637,6 +2667,7 @@ static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *c
struct cpuset *cs = css_cs(css);
struct gb_info *gi = cs_gi(cs);
int retval = -EINVAL;
+ bool reset;
if (val > MAX_GB_PERIOD)
return retval;
@@ -2650,9 +2681,17 @@ static int gb_period_write_u64(struct cgroup_subsys_state *css, struct cftype *c
if (!gi->part_info)
goto out_unlock;
+ /* gb_period = 0 means disabling group balancer. */
+ reset = gi->gb_period && !val;
+
gi->gb_period = val * NSEC_PER_USEC;
retval = 0;
+ if (reset) {
+ synchronize_rcu();
+ reset_child_gb_prefer(cs);
+ }
+
out_unlock:
percpu_up_write(&cpuset_rwsem);
return retval;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e5365755bdd..d57d6210bcfc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4484,6 +4484,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_GROUP_BALANCER
+ p->gb_stamp = 0;
+ p->gb_work.next = &p->gb_work;
+ init_task_work(&p->gb_work, group_balancing_work);
#endif
return 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10c8dc9b7494..c1bd3f62e39c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,6 +1609,9 @@ static void update_numa_stats(struct task_numa_env *env,
!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
+ if (group_hot(env->p, env->src_cpu, cpu) == 0)
+ continue;
+
if (ns->idle_cpu == -1)
ns->idle_cpu = cpu;
@@ -1944,6 +1947,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
+ if (group_hot(env->p, env->src_cpu, cpu) == 0)
+ continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -6050,8 +6056,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
schedstat_inc(p->stats.nr_wakeups_affine_attempts);
- if (target == nr_cpumask_bits)
+ if (target == nr_cpumask_bits) {
+ if (group_hot(p, prev_cpu, this_cpu) == 1)
+ return this_cpu;
return prev_cpu;
+ }
schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->stats.nr_wakeups_affine);
@@ -7846,6 +7855,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}
+ tsk_cache_hot = group_hot(p, env->src_cpu, env->dst_cpu);
+ if (tsk_cache_hot != -1)
+ return tsk_cache_hot;
+
/*
* Aggressive migration if:
* 1) active balance
@@ -8266,6 +8279,15 @@ static unsigned long task_h_load(struct task_struct *p)
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
cfs_rq_load_avg(cfs_rq) + 1);
}
+
+#ifdef CONFIG_GROUP_BALANCER
+unsigned long cfs_h_load(struct cfs_rq *cfs_rq)
+{
+ update_cfs_rq_h_load(cfs_rq);
+ return cfs_rq->h_load;
+}
+#endif
+
#else
static bool __update_blocked_fair(struct rq *rq, bool *done)
{
@@ -11213,6 +11235,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+ task_tick_gb(rq, curr);
+
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c
index 4a2876ec1cca..f7da96253ad0 100644
--- a/kernel/sched/gb.c
+++ b/kernel/sched/gb.c
@@ -8,7 +8,455 @@
#include "sched.h"
+#define DECAY_PERIOD HZ
+
/* Params about settle period. (ms) */
unsigned int sysctl_gb_settle_period = 200;
unsigned int sysctl_gb_settle_period_max = 10000;
unsigned int sysctl_gb_global_settle_period = 200;
+
+static unsigned long global_settle_next;
+static unsigned long predict_period_stamp;
+DEFINE_SPINLOCK(settle_lock);
+
+static inline int ctop(struct gb_part_info *pi, int cpu)
+{
+ return pi->ctop[cpu];
+}
+
+static u64 tg_load_of_part(struct gb_part_info *pi, struct task_group *tg, int id)
+{
+ int i;
+ u64 load = 0;
+
+ for_each_cpu(i, part_cpus(pi, id))
+ load += cfs_h_load(tg->cfs_rq[i]);
+
+ return load;
+}
+
+static u64 load_of_part(struct gb_part_info *pi, int id)
+{
+ int i;
+ u64 load = 0;
+
+ for_each_cpu(i, part_cpus(pi, id))
+ load += cpu_rq(i)->cfs.avg.load_avg;
+
+ return load;
+}
+
+static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int dst)
+{
+ struct gb_part *src_part, *dst_part;
+
+ if (src != -1) {
+ src_part = &pi->parts[src];
+ if (READ_ONCE(src_part->mgrt_on))
+ return 0;
+ }
+
+ if (dst != -1) {
+ dst_part = &pi->parts[dst];
+ if (READ_ONCE(dst_part->mgrt_on))
+ return 0;
+ }
+
+ if (src != -1 && xchg(&src_part->mgrt_on, 1))
+ return 0;
+
+ if (dst != -1 && xchg(&dst_part->mgrt_on, 1)) {
+ WRITE_ONCE(src_part->mgrt_on, 0);
+ return 0;
+ }
+
+ return 1;
+}
+
+static inline void part_mgrt_unlock(struct gb_part_info *pi, int src, int dst)
+{
+ struct gb_part *src_part, *dst_part;
+
+ if (src != -1) {
+ src_part = &pi->parts[src];
+ WRITE_ONCE(src_part->mgrt_on, 0);
+ }
+
+ if (dst != -1) {
+ dst_part = &pi->parts[dst];
+ WRITE_ONCE(dst_part->mgrt_on, 0);
+ }
+}
+
+static u64 cap_of_part(struct gb_part_info *pi, int id)
+{
+ int i;
+ u64 cap = 0;
+
+ for_each_cpu(i, part_cpus(pi, id))
+ cap += cpu_rq(i)->cpu_capacity;
+
+ return cap;
+}
+
+static inline void predict_load_add(struct gb_part_info *pi, int id, u64 load)
+{
+ pi->parts[id].predict_load += load;
+}
+
+static void predict_load_decay(struct gb_part_info *pi)
+{
+ int i, fact, passed;
+
+ passed = jiffies - global_settle_next + predict_period_stamp;
+ predict_period_stamp = passed % DECAY_PERIOD;
+ fact = passed / DECAY_PERIOD;
+
+ if (!fact)
+ return;
+
+ for_each_gbpart(i, pi) {
+ struct gb_part *gp = &pi->parts[i];
+
+ /*
+ * Decay NICE_0_LOAD into zero after 10 seconds
+ */
+ if (fact > 10)
+ gp->predict_load = 0;
+ else
+ gp->predict_load >>= fact;
+ }
+}
+
+static int try_to_settle(struct gb_part_info *pi, struct gb_info *gi, struct task_group *tg)
+{
+ int i, src, dst, ret;
+ u64 mgrt_load, tg_load, min_load, src_load, dst_load;
+ cpumask_var_t *effective_cpus = gi_cpus(gi);
+
+ src = dst = -1;
+ min_load = U64_MAX;
+ tg_load = 0;
+ for_each_gbpart(i, pi) {
+ u64 mgrt, load;
+
+ /* DO NOT settle in parts out of effective_cpus. */
+ if (!cpumask_intersects(part_cpus(pi, i), effective_cpus[0]))
+ continue;
+
+ mgrt = tg_load_of_part(pi, tg, i);
+ load = load_of_part(pi, i);
+ /* load after migration */
+ if (load > mgrt)
+ load -= mgrt;
+ else
+ load = 0;
+
+ /*
+ * Try to find out the partition contain
+ * minimum load, and the load of this task
+ * group is excluded on comparison.
+ *
+ * This help to prevent that a partition
+ * full of the tasks from this task group was
+ * considered as busy.
+ *
+ * As for the prediction load, the partition
+ * this group preferred will be excluded, since
+ * these prediction load could be introduced by
+ * itself.
+ *
+ * This is not precise, but it serves the idea
+ * to prefer a partition as long as possible,
+ * to save the cost of resettle as much as
+ * possible.
+ */
+ if (i == gi->gb_prefer) {
+ src = i;
+ src_load = load + mgrt;
+ mgrt_load = mgrt;
+ } else
+ load += pi->parts[i].predict_load;
+
+ if (load < min_load) {
+ dst = i;
+ min_load = load;
+ dst_load = load + mgrt;
+ }
+
+ tg_load += mgrt;
+ }
+
+ if (!tg_load)
+ return 0;
+
+ ret = 0;
+ gi->settle_period *= 2;
+ if (src == -1) {
+ /* First settle */
+ gi->gb_prefer = dst;
+ predict_load_add(pi, dst, tg_load);
+ ret = 1;
+ } else if (src != dst) {
+ /* Resettle will cost, be careful */
+ long dst_imb, src_imb, dst_cap, src_cap;
+
+ src_cap = cap_of_part(pi, src);
+ dst_cap = cap_of_part(pi, dst);
+
+ /*
+ * src_load dst_load
+ * ------------ vs ---------
+ * src_capacity dst_capacity
+ *
+ * Should not cause further imbalancing after
+ * resettle.
+ */
+ src_imb = abs(src_load * dst_cap - dst_load * src_cap);
+ dst_imb = abs((src_load - mgrt_load) * dst_cap - (dst_load + mgrt_load) * src_cap);
+
+ if (dst_imb <= src_imb) {
+ gi->gb_prefer = dst;
+ predict_load_add(pi, dst, tg_load);
+ gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period) * 2;
+ ret = 1;
+ }
+ }
+
+ if (gi->settle_period > msecs_to_jiffies(sysctl_gb_settle_period_max))
+ gi->settle_period = msecs_to_jiffies(sysctl_gb_settle_period_max);
+
+ return ret;
+}
+
+/*
+ * group_hot() will tell us which cpu is contained in the
+ * preferred CPU partition of the task group of a task.
+ *
+ * return 1 if prefer dst_cpu
+ * return 0 if prefer src_cpu
+ * return -1 if prefer either or neither
+ */
+int group_hot(struct task_struct *p, int src_cpu, int dst_cpu)
+{
+ int ret = -1;
+ struct task_group *tg;
+ struct gb_info *gi, *control_gi;
+ struct gb_part_info *pi;
+
+ rcu_read_lock();
+
+ control_gi = task_gi(p, true);
+
+ if (!control_gi || !control_gi->gb_period)
+ goto out_unlock;
+
+ gi = task_gi(p, false);
+ pi = control_gi->part_info;
+ tg = task_group(p);
+ if (gi->gb_prefer != -1 && ctop(pi, src_cpu) != ctop(pi, dst_cpu))
+ ret = (gi->gb_prefer == ctop(pi, dst_cpu));
+
+out_unlock:
+ rcu_read_unlock();
+ return ret;
+}
+
+void task_tick_gb(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->gb_work;
+ struct gb_info *gi, *control_gi;
+ struct gb_part_info *pi;
+ u64 period, now;
+
+ if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+ return;
+
+ rcu_read_lock();
+
+ control_gi = task_gi(curr, true);
+
+ if (!control_gi || !control_gi->gb_period)
+ goto unlock;
+
+ gi = task_gi(curr, false);
+ pi = control_gi->part_info;
+
+ /* Save it when already satisfied. */
+ if (gi->gb_prefer != -1 &&
+ gi->gb_prefer == ctop(pi, task_cpu(curr)))
+ goto unlock;
+
+ now = curr->se.sum_exec_runtime;
+ period = control_gi->gb_period;
+
+ if (now > curr->gb_stamp + period) {
+ curr->gb_stamp = now;
+ task_work_add(curr, work, TWA_RESUME);
+ }
+
+unlock:
+ rcu_read_unlock();
+}
+
+void group_balancing_work(struct callback_head *work)
+{
+ int cpu, this_cpu, this_part, this_prefer, best_cpu;
+ struct task_group *this_tg;
+ struct gb_info *this_gi, *control_gi;
+ struct gb_part_info *pi;
+ struct task_struct *best_task;
+ struct cpumask cpus;
+
+ SCHED_WARN_ON(current != container_of(work, struct task_struct, gb_work));
+
+ work->next = work;
+ if (current->flags & PF_EXITING)
+ return;
+
+ rcu_read_lock();
+ /*
+ * We build group balancer on "cpuset" subsys, and gather load info from
+ * "cpu" subsys. So we need to ensure these two subsys belonging to
+ * the same cgroup.
+ */
+ if (task_cgroup(current, cpuset_cgrp_id) != task_group(current)->css.cgroup)
+ goto unlock;
+
+ control_gi = task_gi(current, true);
+
+ if (!control_gi || !control_gi->gb_period)
+ goto unlock;
+
+ this_gi = task_gi(current, false);
+ pi = control_gi->part_info;
+ this_tg = task_group(current);
+
+ /*
+ * Settle task group one-by-one help prevent the
+ * situation when multiple group try to settle the
+ * same partition at the same time.
+ *
+ * However, when bunch of groups trying to settle at
+ * the same time, there are no guarantee on the
+ * fairness, some of them may get more chances and
+ * settle sooner than the others.
+ *
+ * So one trick here is to grow the cg_settle_period
+ * of settled group, to make sure they yield the
+ * next chances to others.
+ *
+ * Another trick here is about prediction, as settle
+ * group will followed by bunch of task migration,
+ * the current load of CPU partition can't imply it's
+ * busyness in future, and we may pick a busy one in
+ * the end.
+ *
+ * Thus we maintain the predict load after each settle,
+ * so next try will be able to do the prediction and
+ * avoid to pick those which is already busy enough.
+ */
+ if (spin_trylock(&settle_lock)) {
+ if (time_after(jiffies, global_settle_next) &&
+ time_after(jiffies, this_gi->settle_next)) {
+ predict_load_decay(pi);
+
+ global_settle_next = jiffies;
+ if (try_to_settle(pi, this_gi, this_tg))
+ global_settle_next +=
+ msecs_to_jiffies(sysctl_gb_global_settle_period);
+
+ this_gi->settle_next = jiffies + this_gi->settle_period;
+ }
+ spin_unlock(&settle_lock);
+ }
+
+ this_cpu = task_cpu(current);
+ this_prefer = this_gi->gb_prefer;
+ this_part = ctop(pi, this_cpu);
+ if (this_prefer == -1 ||
+ this_part == this_prefer ||
+ !part_mgrt_lock(pi, this_part, this_prefer)) {
+ goto unlock;
+ }
+
+ cpumask_copy(&cpus, part_cpus(pi, this_prefer));
+
+ /*
+ * We arrived here when current task A don't prefer
+ * it's current CPU, but prefer CPUs of partition Y.
+ *
+ * In other word, if task A could run on CPUs of
+ * partition Y, it have good chance to reduce conflict
+ * with the tasks from other groups, and share hot
+ * cache with the tasks from the same group.
+ *
+ * So here is the main logical of group balancer to
+ * achieve it's purpose, make sure groups of tasks
+ * are balanced into groups of CPUs.
+ *
+ * If A's current CPU belong to CPU partition X,
+ * try to find a CPU from partition Y, which is
+ * running a task prefer partition X, and swap them.
+ *
+ * Otherwise, or if can't find such CPU and task,
+ * just find an idle CPU from partition Y, and do
+ * migration.
+ *
+ * Ideally the migration and swap work will finally
+ * put the tasks into right places, but the wakeup
+ * stuff can easily break that by locate an idle CPU
+ * out of the range.
+ *
+ * However, since the whole idea is to gain cache
+ * benefit and reduce conflict between groups, if
+ * there are enough idle CPU out there then every
+ * thing just fine, so let it go.
+ */
+
+ best_cpu = -1;
+ best_task = NULL;
+ for_each_cpu_and(cpu, &cpus, current->cpus_ptr) {
+ struct task_struct *p;
+
+ WARN_ON(cpu == this_cpu);
+
+ if (available_idle_cpu(cpu)) {
+ best_cpu = cpu;
+ continue;
+ }
+
+ if (this_part == -1)
+ continue;
+
+ p = rcu_dereference(cpu_rq(cpu)->curr);
+
+ if (!p || (p->flags & PF_EXITING) || is_idle_task(p)
+ || p == current || task_gi(p, true) != control_gi)
+ continue;
+
+ if (task_cpu(p) == cpu &&
+ this_part == task_gi(p, false)->gb_prefer &&
+ cpumask_test_cpu(this_cpu, p->cpus_ptr)) {
+ get_task_struct(p);
+ best_task = p;
+ best_cpu = cpu;
+ break;
+ }
+ }
+
+ rcu_read_unlock();
+
+ if (best_task) {
+ migrate_swap(current, best_task, best_cpu, this_cpu);
+ put_task_struct(best_task);
+ } else if (best_cpu != -1) {
+ migrate_task_to(current, best_cpu);
+ }
+
+ part_mgrt_unlock(pi, this_part, this_prefer);
+ return;
+
+unlock:
+ rcu_read_unlock();
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d6555aacdf78..97c621bcded3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3148,3 +3148,13 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif
+#ifdef CONFIG_GROUP_BALANCER
+extern unsigned long cfs_h_load(struct cfs_rq *cfs_rq);
+extern int group_hot(struct task_struct *p, int src_cpu, int dst_cpu);
+extern void task_tick_gb(struct rq *rq, struct task_struct *curr);
+extern void group_balancing_work(struct callback_head *work);
+#else
+static inline int group_hot(struct task_struct *p, int src_cpu, int dst_cpu) { return -1; }
+static inline void task_tick_gb(struct rq *rq, struct task_struct *curr) {};
+static inline void group_balancing_work(struct callback_head *work) {};
+#endif
--
2.27.0
On 2022/3/9 01:13, Tejun Heo wrote:
> Hello,
>
> On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
>> Modern platform are growing fast on CPU numbers. To achieve better
>> utility of CPU resource, multiple apps are starting to sharing the CPUs.
>>
>> What we need is a way to ease confliction in share mode,
>> make groups as exclusive as possible, to gain both performance
>> and resource efficiency.
>>
>> The main idea of group balancer is to fulfill this requirement
>> by balancing groups of tasks among groups of CPUs, consider this
>> as a dynamic demi-exclusive mode. Task trigger work to settle it's
>> group into a proper partition (minimum predicted load), then try
>> migrate itself into it. To gradually settle groups into the most
>> exclusively partition.
>>
>> GB can be seen as an optimize policy based on load balance,
>> it obeys the main idea of load balance and makes adjustment
>> based on that.
>>
>> Our test on ARM64 platform with 128 CPUs shows that,
>> throughput of sysbench memory is improved about 25%,
>> and redis-benchmark is improved up to about 10%.
>
> The motivation makes sense to me but I'm not sure this is the right way to
> architecture it. We already have the framework to do all these - the sched
> domains and the load balancer. Architecturally, what the suggested patchset
> is doing is building a separate load balancer on top of cpuset after using
> cpuset to disable the existing load balancer, which is rather obviously
> convoluted.
>
"the sched domains and the load balancer" you mentioned are the ways to
"balance" tasks on each domains. However, this patchset aims to "group"
them together to win hot cache and less competition, which is different
from load balancer. See commit log of the patch 3/4 and this link:
https://lore.kernel.org/all/[email protected]/
> * AFAICS, none of what the suggested code does is all that complicated or
> needs a lot of input from userspace. it should be possible to parametrize
> the existing load balancer to behave better.
>
Group balancer mainly needs 2 inputs from userspace: cpu partition info
and cgroup info.
Cpu partition info does need user input (and maybe a bit complicated).
As a result, the division methods are __free__ to users(can refer to
NUMA nodes, clusters, cache, etc.)
Cgroup info doesn't need extra input. It's naturally configured.
It do parametrize the existing load balancer to behave better.
Group balancer is a kind of optimize policy, and should obey the basic
policy (load balance) and improve it.
The relationship between load balancer and group balancer is explained
in detail at the above link.
> * If, for some reason, you need more customizable behavior in terms of cpu
> allocation, which is what cpuset is for, maybe it'd be better to build the
> load balancer in userspace. That'd fit way better with how cgroup is used
> in general and with threaded cgroups, it should fit nicely with everything
> else.
>
We put group balancer in kernel space because this new policy does not
depend on userspace apps. It's a "general" feature.
Doing "dynamic cpuset" in userspace may also introduce performance
issue, since it may need to bind and unbind different cpusets for
several times, and is too strict(compared with our "soft bind").
Hello,
On Wed, Mar 09, 2022 at 04:30:51PM +0800, Tianchen Ding wrote:
> "the sched domains and the load balancer" you mentioned are the ways to
> "balance" tasks on each domains. However, this patchset aims to "group" them
> together to win hot cache and less competition, which is different from load
> balancer. See commit log of the patch 3/4 and this link:
> https://lore.kernel.org/all/[email protected]/
I read that but it doesn't make whole lot of sense to me. As Peter noted, we
already have issues with cross NUMA node balancing interacting with in-node
balancing, which likely indicates that it needs more unified solution rather
than more fragmented. I have a hard time seeing how adding yet another layer
on top helps the situation.
> > * If, for some reason, you need more customizable behavior in terms of cpu
> > allocation, which is what cpuset is for, maybe it'd be better to build the
> > load balancer in userspace. That'd fit way better with how cgroup is used
> > in general and with threaded cgroups, it should fit nicely with everything
> > else.
> >
>
> We put group balancer in kernel space because this new policy does not
> depend on userspace apps. It's a "general" feature.
Well, it's general for use cases which are happy with the two knobs that you
defined for your use case.
> Doing "dynamic cpuset" in userspace may also introduce performance issue,
> since it may need to bind and unbind different cpusets for several times,
> and is too strict(compared with our "soft bind").
My bet is that you're gonna be able to get just about the same bench results
with userspace diddling with thread cgroup membership. Why not try that
first? The interface is already there. I have a hard time seeing the
justification for hard coding this into the kernel at this stage.
Thanks.
--
tejun
On 2022/3/10 02:00, Tejun Heo wrote:
> Hello,
>
> On Wed, Mar 09, 2022 at 04:30:51PM +0800, Tianchen Ding wrote:
>> "the sched domains and the load balancer" you mentioned are the ways to
>> "balance" tasks on each domains. However, this patchset aims to "group" them
>> together to win hot cache and less competition, which is different from load
>> balancer. See commit log of the patch 3/4 and this link:
>> https://lore.kernel.org/all/[email protected]/
>
> I read that but it doesn't make whole lot of sense to me. As Peter noted, we
> already have issues with cross NUMA node balancing interacting with in-node
> balancing, which likely indicates that it needs more unified solution rather
> than more fragmented. I have a hard time seeing how adding yet another layer
> on top helps the situation.
>
>>> * If, for some reason, you need more customizable behavior in terms of cpu
>>> allocation, which is what cpuset is for, maybe it'd be better to build the
>>> load balancer in userspace. That'd fit way better with how cgroup is used
>>> in general and with threaded cgroups, it should fit nicely with everything
>>> else.
>>>
>>
>> We put group balancer in kernel space because this new policy does not
>> depend on userspace apps. It's a "general" feature.
>
> Well, it's general for use cases which are happy with the two knobs that you
> defined for your use case.
>
>> Doing "dynamic cpuset" in userspace may also introduce performance issue,
>> since it may need to bind and unbind different cpusets for several times,
>> and is too strict(compared with our "soft bind").
>
> My bet is that you're gonna be able to get just about the same bench results
> with userspace diddling with thread cgroup membership. Why not try that
> first? The interface is already there. I have a hard time seeing the
> justification for hard coding this into the kernel at this stage.
>
> Thanks.
>
Well, I understand your point is putting this in userspace. However,
we've considered about that but found it hard to do so. If you have any
better idea, please share with us. :-)
If we want to build group balancer in userspace, we need:
1) gather load info from each rq periodically
2) make decision to set cpuset.cpus of each cgroup
However, there're some problems about this way.
For 1), we need to consider how frequently collecting these info, which
may impact performance and accuracy. If the load changes hugely right
after we get it once, our data are expired and then the decision may be
wrong. (If we are in kernel, faster action can be taken.)
We believe 2) is harder. The specific policy may be complex and alter
according to different scenes. There's not a general method.
e.g., with 16cpus and 4 cgroups, how to decide when we set one of them
0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are
much more threads in cgroupA than cgroupB/C/D , and we want to satisfy
cgroupA as far as possible(on the premise of fairness of B/C/D),
dynamically enlarging(when B/C/D partly idle) and shrinking(when B/C/D
busy) cpuset of cgroupA requires complex policy. In this example,
fairness and performance can be provided by existing scheduler, but when
it comes to grouping hot cache or decreasing competion, both scheduler
in kernel and action in userspace are hard to solve.
What's more, in many cloud computing scenes, there may be hundreds or
thousands of containers, which are much larger than partition number.
These containers may be dynamically created and destroyed at any time.
Making policy to manage them from userspace will not be practical.
These problems become easy when going to kernelspace. We get info
directly from scheduler, and help revising its decision at some key
points, or do some support work(e.g., task migration if possible).
Thanks.
On 2022/3/22 02:16, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 10, 2022 at 01:47:34PM +0800, Tianchen Ding wrote:
>> If we want to build group balancer in userspace, we need:
>> 1) gather load info from each rq periodically
>> 2) make decision to set cpuset.cpus of each cgroup
>>
>> However, there're some problems about this way.
>>
>> For 1), we need to consider how frequently collecting these info, which may
>> impact performance and accuracy. If the load changes hugely right after we
>> get it once, our data are expired and then the decision may be wrong. (If we
>> are in kernel, faster action can be taken.)
>
> We now have a pretty well established way to transport data to userspace at
> really low overhead. If you piggy back on bpf interfaces, they can usually
> be pretty unintrusive and low effort as long as you have the right kind of
> data aggregated already, which shouldn't be that difficult here.
>
Yes, bpf is a good way to fetch these data.
In fact we may also consider impacting on some decisions of scheduler in
some scenes. But it seems to be a long term work...
>> We believe 2) is harder. The specific policy may be complex and alter
>> according to different scenes. There's not a general method.
>> e.g., with 16cpus and 4 cgroups, how to decide when we set one of them
>> 0-3(when busy)or 0-7(when some of other cgroups are idle)? If there are much
>> more threads in cgroupA than cgroupB/C/D , and we want to satisfy cgroupA as
>> far as possible(on the premise of fairness of B/C/D), dynamically
>> enlarging(when B/C/D partly idle) and shrinking(when B/C/D busy) cpuset of
>> cgroupA requires complex policy. In this example, fairness and performance
>> can be provided by existing scheduler, but when it comes to grouping hot
>> cache or decreasing competion, both scheduler in kernel and action in
>> userspace are hard to solve.
>
> So, I get that it's not easy. In fact, we don't even yet know how to
> properly compare loads across groups of CPUs - simple sums that you're using
> break down when there are big gaps in weight numbers across tasks and can
> become meaningless in the presence of CPU affinities. They can still work
"simple sums" is fine because we only use this load when selecting
partitions, not migrating.
If I understand correctly, do you mean a scene that:
cgroupA has only one thread with full busy load (1024)
cgroup B/C/D have many threads with small load (maybe 100)
There're two CPU partitons:0-3 and 4-7
If A choose 0-3 (we suppose its thread sticking on cpu0), so the load of
partition 0-3 becomes 1024 because simple sums. So B/C/D will all choose
partition 4-7. This will be fine since our group balancer is a kind of
"soft" bind. If we are waking a thread from cgroupB and find there is no
free cpu in 4-7, we'll fallback to normal path and select other idle cpu
(i.e., 1-3) if possible.
Otherwise, if there is no idle cpu totally, meaning the whole system is
busy. The existing load balance mechanism will handle it, and our
patches only try to switch two tasks if their cache in their own cgroups
can be closer after migration.
> when the configuration is fairly homogeenous and controlled but the standard
> should be far higher for something we bake into the kernel and expose
> userland-visible interface for.
>
Our aim is to improve performance of apps, so will not limit cpu usage
strictly. This kind of soft bind also helps to avoid conflicts with
existing load balance mechanism. We may explain this in document.
>> What's more, in many cloud computing scenes, there may be hundreds or
>> thousands of containers, which are much larger than partition number. These
>> containers may be dynamically created and destroyed at any time. Making
>> policy to manage them from userspace will not be practical.
>>
>> These problems become easy when going to kernelspace. We get info directly
>> from scheduler, and help revising its decision at some key points, or do
>> some support work(e.g., task migration if possible).
>
> I don't think they become necessarily easy. Sure, you can hack up something
> which works for some cases by poking into existing code; however, the bar
> for acceptance is also way higher for a kernel interface - it should be
> generic, consistent with other interfaces (I won't go into cgroup interface
> issues here), and work orthogonally with other kernel features (ie. task /
> group weights should work in an explainable way). I don't think the proposed
> patches are scoring high in those axes.
>
We set the interface into cpuset subsys because partition info should
follow effective_cpus. There may be a problem that we need to get load
info from cpu subsys, so "cpu" and "cpuset" should be both enabled in
the same cgroup. Fortunately, this is easy in cgroup-v2. Any other
conflict do you see?
About "work orthogonally with other kernel features", we've explained
above. Weights are only used in selecting partitions and we just tend to
trying pulling tasks to their selected partitions, not force.
> I'm not against the goal here. Given that cgroups express the logical
> structure of applications running on the system, it does make sense to
> factor that into scheduling decisions. However, what's proposed seems too
> premature and I have a hard time seeing why this level of functionality
> would be difficult to be implement from userspace with some additions in
> terms of visibility which is really easy to do these days.
Just as out explanation in the last mail, we believe building this in
userspace is not easy. Because interfaces provided by kernel now is
inflexible. From the view of cgroup, it is given a binded "cpuset.cpus",
and all cpus in this range are treated equally, while all cpus out of
this range are strictly forbidden. The existing cpuset mechanisms do
help resource limitation and isolation, but may be weak on improving
perforamce.
So, our idea is, doing this in userspace, which relies on existing
kernel interfaces, is hard. Only exposing more data to increase
visibility in userspace is not enough. We need a whole solution from
collecting data to impacting decisions of the scheduler. Maybe eBPF can
cover all these works in the future, but it takes long time to develop
and maybe too much to achieve just dynamic cpu bind.
Thanks.
Ping~
Any idea for this patchset? :-)
On 2022/3/8 17:26, Tianchen Ding wrote:
> Modern platform are growing fast on CPU numbers. To achieve better
> utility of CPU resource, multiple apps are starting to sharing the CPUs.
>
> What we need is a way to ease confliction in share mode,
> make groups as exclusive as possible, to gain both performance
> and resource efficiency.
>
> The main idea of group balancer is to fulfill this requirement
> by balancing groups of tasks among groups of CPUs, consider this
> as a dynamic demi-exclusive mode. Task trigger work to settle it's
> group into a proper partition (minimum predicted load), then try
> migrate itself into it. To gradually settle groups into the most
> exclusively partition.
>
> GB can be seen as an optimize policy based on load balance,
> it obeys the main idea of load balance and makes adjustment
> based on that.
>
> Our test on ARM64 platform with 128 CPUs shows that,
> throughput of sysbench memory is improved about 25%,
> and redis-benchmark is improved up to about 10%.
>
> See each patch for detail:
> The 1st patch introduces infrastructure.
> The 2nd patch introduces detail about partition info.
> The 3rd patch is the main part of group balancer.
> The 4th patch is about stats.
>
> v2:
> Put partition info and period settings to cpuset subsys of cgroup_v2.
>
> v1: https://lore.kernel.org/all/[email protected]/
>
> Michael Wang (1):
> sched: Introduce group balancer
>
> Tianchen Ding (3):
> sched, cpuset: Introduce infrastructure of group balancer
> cpuset: Handle input of partition info for group balancer
> cpuset, gb: Add stat for group balancer
>
> include/linux/cpuset.h | 5 +
> include/linux/sched.h | 5 +
> include/linux/sched/gb.h | 70 ++++++
> init/Kconfig | 12 +
> kernel/cgroup/cpuset.c | 405 +++++++++++++++++++++++++++++++-
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 5 +
> kernel/sched/debug.c | 10 +-
> kernel/sched/fair.c | 26 ++-
> kernel/sched/gb.c | 487 +++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 14 ++
> 11 files changed, 1037 insertions(+), 3 deletions(-)
> create mode 100644 include/linux/sched/gb.h
> create mode 100644 kernel/sched/gb.c
>