Message-ID: <f52125a6-e96d-c2c6-f497-44901eb4fe71@linux.alibaba.com>
Date:   Wed, 12 Jan 2022 10:59:09 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.4.1
Subject: Re: [RFC PATCH] sched: introduce group balancer
Content-Language: en-US
From:   =?UTF-8?B?546L6LSH?= <yun.wang@linux.alibaba.com>
To:     Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Masahiro Yamada <masahiroy@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Nathan Chancellor <nathan@kernel.org>,
        Kees Cook <keescook@chromium.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        Chris Down <chris@chrisdown.name>,
        Vipin Sharma <vipinsh@google.com>,
        Rasmus Villemoes <linux@rasmusvillemoes.dk>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Arnd Bergmann <arnd@arndb.de>,
        Cruz Zhao <cruzzhao@linux.alibaba.com>,
        Tianchen Ding <dtcccc@linux.alibaba.com>,
        linux-kernel@vger.kernel.org
References: <98f41efd-74b2-198a-839c-51b785b748a6@linux.alibaba.com>
In-Reply-To: <98f41efd-74b2-198a-839c-51b785b748a6@linux.alibaba.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk

Friendly ping~ Please let me know if you got any comments or suggestion :-)

Regards,
Michael Wang

在 2022/1/4 下午3:33, 王贇 写道:
> Modern platform are growing fast on CPU numbers, multiple
> apps sharing one box are very common, they used to have
> exclusive cpu setting but nowadays things are changing.
> 
> To achieve better utility of CPU resource, multiple apps
> are starting to sharing the CPUs. The CPU resources usually
> overcommitted since app's workload are undulated.
> 
> This introduced problems on performance when share mode vs
> exclusive mode, for eg with cgroup A,B and C deployed in
> exclusive mode, it will be:
> 
>    CPU_X (100%)     CPU_Y (100%)     CPU_Z (50%)
>    T_1_CG_A         T_1_CG_B         T_1_CG_C
>    T_2_CG_A         T_2_CG_B         T_2_CG_C
>    T_3_CG_A         T_3_CG_B
>    T_4_CG_A         T_4_CG_B
> 
> while the share mode will be:
> 
>    CPU_X (100%)     CPU_Y (75%)     CPU_Z (75%)
>    T_1_CG_A         T_2_CG_A         T_1_CG_B
>    T_2_CG_B         T_3_CG_B         T_2_CG_C
>    T_4_CG_B         T_4_CG_A         T_3_CG_A
>    T_1_CG_C
> 
> As we can see, the confliction between groups on CPU
> resources are now happening all over the CPUs.
> 
> The testing on sysbench-memory show 30+% drop on share
> mode, and redis-benchmark show 10+% drop too, compared
> to the exclusive mode.
> 
> However, despite of the performance drop, in real world
> we still prefer share mode. The undulated workload can
> make the exclusive mode so unefficient on CPU utilization,
> for eg the next period, when CG_A become 'idle', exclusive
> mode will like:
> 
>    CPU_X (0%)     CPU_Y (100%)     CPU_Z (50%)
>                   T_1_CG_B         T_1_CG_C
>                   T_2_CG_B         T_2_CG_C
>                   T_3_CG_B
>                   T_4_CG_B
> 
> while share mode like:
> 
>    CPU_X (50%)     CPU_Y (50%)     CPU_Z (50%)
>    T_2_CG_B         T_1_CG_C        T_3_CG_B
>    T_4_CG_B         T_1_CG_B        T_2_CG_C
> 
> The CPU_X is totally wasted in exclusive mode, the resource
> efficiency are really poor.
> 
> Thus what we need, is a way to ease confliction in share mode,
> make groups as exclusive as possible, to gain both performance
> and resource efficiency.
> 
> The main idea of group balancer is to fulfill this requirement
> by balancing groups of tasks among groups of CPUs, consider this
> as a dynamic demi-exclusive mode.
> 
> Just like balance the task among CPUs, now with GB a user can
> put CPU X,Y,Z into three partitions, and balance group A,B,C
> into these partition, to make them as exclusive as possible.
> 
> The design is very likely to the numa balancing, task trigger
> work to settle it's group into a proper partition (minimum
> predicted load), then try migrate itself into it. To gradually
> settle groups into the most exclusively partition.
> 
> How To Use:
> 
> To create partition, for example run:
>    echo disable > /proc/gb_ctrl
>    echo "0-15;16-31;32-47;48-63;" > /proc/gb_ctrl
>    echo enable > /proc/gb_ctrl
> 
> this will create 4 partitions contain CPUs 0-15,16-31,32-47 and
> 48-63 separately.
> 
> Then enable GB for your cgroup, run
>    $CPU_CGROUP_PATH/cpu.gb_period_ms
> 
> And you can check:
>    $CPU_CGROUP_PATH/cpu.gb_stat
> 
> which give output as:
>    PART-0 0-15 1008 1086  *
>    PART-1 16-31 0 2
>    PART-2 32-47 0 0
>    PART-3 48-63 0 1024
> 
> The partition ID followed by it's CPUs range, load of group, load
> of partition and a star mark as preferred.
> 
> Testing Results:
>    In order to enlarge the differences, we do testing on ARM platform
>    with 128 CPUs, create 8 partition according to cluster info.
> 
>    Since we pick benchmark which can gain benefit from exclusive mode,
>    this is more like a functional testing rather than performance, to
>    show that GB help winback the performance.
> 
>    Create 8 cgroup each running 'sysbench memory --threads=16 run',
>    the output of share mode is:
>      events/s (eps):                      4181233.4646
>      events/s (eps):                      3548328.2346
>      events/s (eps):                      4578816.2412
>      events/s (eps):                      4761797.3932
>      events/s (eps):                      3486703.0455
>      events/s (eps):                      3474920.9803
>      events/s (eps):                      3604632.7799
>      events/s (eps):                      3149506.7001
>    the output of gb mode is:
>      events/s (eps):                      5472334.9313
>      events/s (eps):                      4085399.1606
>      events/s (eps):                      4398122.2170
>      events/s (eps):                      6180233.6766
>      events/s (eps):                      4299784.2742
>      events/s (eps):                      4914813.6847
>      events/s (eps):                      3675395.1191
>      events/s (eps):                      6767666.6229
> 
>    Create 4 cgroup each running redis-server with 16 io threads,
>    4 redis-benchmark per each server show average rps as:
> 
>                      share mode       gb mode
> 
>    PING_INLINE     : 41154.84         42229.27               2.61%
>    PING_MBULK      : 43042.07         44907.10               4.33%
>    SET             : 34502.00         37374.58               8.33%
>    GET             : 41713.47         45257.68               8.50%
>    INCR            : 41533.26         44259.31               6.56%
>    LPUSH           : 36541.23         39417.84               7.87%
>    RPUSH           : 39059.26         42075.32               7.72%
>    LPOP            : 36978.73         39903.15               7.91%
>    RPOP            : 39553.32         42071.53               6.37%
>    SADD            : 40614.30         44693.33              10.04%
>    HSET            : 39101.93         42401.16               8.44%
>    SPOP            : 42838.90         46560.46               8.69%
>    ZADD            : 38346.80         41685.46               8.71%
>    ZPOPMIN         : 41952.26         46138.14               9.98%
>    LRANGE_100      : 19364.66         20251.56               4.58%
>    LRANGE_300      : 9699.57          9935.86                2.44%
>    LRANGE_500      : 6291.76          6512.48                3.51%
>    LRANGE_600      : 5619.13          5658.31                0.70%
>    MSET            : 24432.78         26517.63               8.53%
> 
> Signed-off-by: Cruz Zhao <cruzzhao@linux.alibaba.com>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>   include/linux/sched.h |   5 +
>   init/Kconfig          |  12 +
>   kernel/sched/Makefile |   1 +
>   kernel/sched/core.c   |  44 +++
>   kernel/sched/fair.c   |  22 +-
>   kernel/sched/gb.c     | 767 
> ++++++++++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h  |  24 ++
>   7 files changed, 874 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/sched/gb.c
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0cd3d9c..efabfb1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1284,6 +1284,11 @@ struct task_struct {
>       unsigned long            numa_pages_migrated;
>   #endif /* CONFIG_NUMA_BALANCING */
> 
> +#ifdef CONFIG_GROUP_BALANCER
> +    unsigned long            gb_stamp;
> +    struct callback_head        gb_work;
> +#endif
> +
>   #ifdef CONFIG_RSEQ
>       struct rseq __user *rseq;
>       u32 rseq_sig;
> diff --git a/init/Kconfig b/init/Kconfig
> index 41a728d..32131af 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1021,6 +1021,18 @@ config RT_GROUP_SCHED
>         realtime bandwidth for them.
>         See Documentation/scheduler/sched-rt-group.rst for more 
> information.
> 
> +config GROUP_BALANCER
> +    bool "Group balancer support for SCHED_OTHER"
> +    depends on (CGROUP_SCHED && SMP)
> +    default CGROUP_SCHED
> +    help
> +      This feature allow you to do load balance in group mode. In other
> +      word, balance groups of tasks among groups of CPUs.
> +
> +      This can reduce the conflict between task groups and gain benefit
> +      from hot cache, usually for the cases when there are multiple apps
> +      sharing the same CPUs.
> +
>   endif #CGROUP_SCHED
> 
>   config UCLAMP_TASK_GROUP
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index c7421f2..e24d645 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
>   obj-$(CONFIG_CPU_ISOLATION) += isolation.o
>   obj-$(CONFIG_PSI) += psi.o
>   obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_GROUP_BALANCER) += gb.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 83872f9..e3675a1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4436,6 +4436,10 @@ int sched_fork(unsigned long clone_flags, struct 
> task_struct *p)
>       plist_node_init(&p->pushable_tasks, MAX_PRIO);
>       RB_CLEAR_NODE(&p->pushable_dl_tasks);
>   #endif
> +#ifdef CONFIG_GROUP_BALANCER
> +    p->gb_stamp = 0;
> +    init_task_work(&p->gb_work, group_balancing_work);
> +#endif
>       return 0;
>   }
> 
> @@ -10554,6 +10558,35 @@ static int cpu_cfs_stat_show(struct seq_file 
> *sf, void *v)
>   #endif /* CONFIG_CFS_BANDWIDTH */
>   #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
> +#ifdef CONFIG_GROUP_BALANCER
> +int cpu_cfs_gb_stat_show(struct seq_file *sf, void *v)
> +{
> +    struct task_group *tg = css_tg(seq_css(sf));
> +
> +    if (gb_enabled() && tg != &root_task_group)
> +        gb_stat_show(tg, sf);
> +
> +    return 0;
> +}
> +
> +static int cpu_gb_period_write(struct cgroup_subsys_state *css,
> +                    struct cftype *cftype, u64 period_ms)
> +{
> +    struct task_group *tg = css_tg(css);
> +
> +    if (tg == &root_task_group)
> +        return -EINVAL;
> +
> +    return set_gb_period(tg, period_ms);
> +}
> +
> +static u64 cpu_gb_period_read(struct cgroup_subsys_state *css,
> +                   struct cftype *cft)
> +{
> +    return get_gb_period(css_tg(css));
> +}
> +#endif
> +
>   #ifdef CONFIG_RT_GROUP_SCHED
>   static int cpu_rt_runtime_write(struct cgroup_subsys_state *css,
>                   struct cftype *cft, s64 val)
> @@ -10628,6 +10661,17 @@ static int cpu_idle_write_s64(struct 
> cgroup_subsys_state *css,
>           .seq_show = cpu_cfs_stat_show,
>       },
>   #endif
> +#ifdef CONFIG_GROUP_BALANCER
> +    {
> +        .name = "gb_stat",
> +        .seq_show = cpu_cfs_gb_stat_show,
> +    },
> +    {
> +        .name = "gb_period_ms",
> +        .read_u64 = cpu_gb_period_read,
> +        .write_u64 = cpu_gb_period_write,
> +    },
> +#endif
>   #ifdef CONFIG_RT_GROUP_SCHED
>       {
>           .name = "rt_runtime_us",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa..80f5038 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1577,6 +1577,9 @@ static void update_numa_stats(struct task_numa_env 
> *env,
>                   !cpumask_test_cpu(cpu, env->p->cpus_ptr))
>                   continue;
> 
> +            if (group_hot(env->p, env->src_cpu, cpu) == 0)
> +                continue;
> +
>               if (ns->idle_cpu == -1)
>                   ns->idle_cpu = cpu;
> 
> @@ -1912,6 +1915,9 @@ static void task_numa_find_cpu(struct 
> task_numa_env *env,
>           if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>               continue;
> 
> +        if (group_hot(env->p, env->src_cpu, cpu) == 0)
> +            continue;
> +
>           env->dst_cpu = cpu;
>           if (task_numa_compare(env, taskimp, groupimp, maymove))
>               break;
> @@ -5965,8 +5971,11 @@ static int wake_affine(struct sched_domain *sd, 
> struct task_struct *p,
>           target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> 
>       schedstat_inc(p->stats.nr_wakeups_affine_attempts);
> -    if (target == nr_cpumask_bits)
> +    if (target == nr_cpumask_bits) {
> +        if (group_hot(p, prev_cpu, this_cpu) == 1)
> +            return this_cpu;
>           return prev_cpu;
> +    }
> 
>       schedstat_inc(sd->ttwu_move_affine);
>       schedstat_inc(p->stats.nr_wakeups_affine);
> @@ -7761,6 +7770,12 @@ int can_migrate_task(struct task_struct *p, 
> struct lb_env *env)
>           return 0;
>       }
> 
> +    if (gb_enabled()) {
> +        tsk_cache_hot = group_hot(p, env->src_cpu, env->dst_cpu);
> +        if (tsk_cache_hot != -1)
> +            return tsk_cache_hot;
> +    }
> +
>       /*
>        * Aggressive migration if:
>        * 1) active balance
> @@ -11125,6 +11140,9 @@ static void task_tick_fair(struct rq *rq, struct 
> task_struct *curr, int queued)
>       if (static_branch_unlikely(&sched_numa_balancing))
>           task_tick_numa(rq, curr);
> 
> +    if (gb_enabled())
> +        task_tick_gb(rq, curr);
> +
>       update_misfit_status(curr, rq);
>       update_overutilized_status(task_rq(curr));
> 
> @@ -11453,6 +11471,8 @@ int alloc_fair_sched_group(struct task_group 
> *tg, struct task_group *parent)
>           init_entity_runnable_average(se);
>       }
> 
> +    init_cfs_gb(tg);
> +
>       return 1;
> 
>   err_free_rq:
> diff --git a/kernel/sched/gb.c b/kernel/sched/gb.c
> new file mode 100644
> index 0000000..e7195c6
> --- /dev/null
> +++ b/kernel/sched/gb.c
> @@ -0,0 +1,767 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *  Group Balancer code
> + *
> + *  Copyright (C) 2021 Alibaba, Inc., Michael Wang 
> <yun.wang@linux.alibaba.com>
> + */
> +#include "sched.h"
> +
> +#define GB_IMBALANCE_MIN    32
> +#define CPU_PART_MAX        32
> +#define DECAY_PERIOD        HZ
> +#define for_each_part(pi, x)    \
> +    for (x = 0; x < pi->nr_part; x++)
> +#define for_each_part_but(pi, x, y)    \
> +    for (x = (y + 1) % pi->nr_part; x != y; x = (x + 1) % pi->nr_part)
> +
> +struct gb_part {
> +    int id;
> +    unsigned int mgrt_on;
> +    u64 predict_load;
> +    struct cpumask cpus;
> +};
> +
> +struct gb_part_info {
> +    int nr_part;
> +    struct rcu_head rcu_head;
> +    struct gb_part parts[CPU_PART_MAX];
> +    int ctop[NR_CPUS];
> +};
> +
> +static unsigned long tg_settle_period = HZ/5;
> +static unsigned long tg_settle_period_max = HZ*10;
> +static unsigned long global_settle_period = HZ/5;
> +
> +static unsigned long global_settle_next;
> +static unsigned long predict_period_stamp;
> +
> +DEFINE_STATIC_KEY_FALSE(sched_group_balancer);
> +DEFINE_MUTEX(gb_mutex);
> +DEFINE_SPINLOCK(settle_lock);
> +
> +static struct gb_part_info *part_info;
> +
> +static inline int ctop(struct gb_part_info *pi, int cpu)
> +{
> +    return pi->ctop[cpu];
> +}
> +
> +static inline struct cpumask *part_cpus(struct gb_part_info *pi, int id)
> +{
> +    return &pi->parts[id].cpus;
> +}
> +
> +static inline int part_mgrt_lock(struct gb_part_info *pi, int src, int 
> dst)
> +{
> +    struct gb_part *src_part, *dst_part;
> +
> +    /*
> +     * The target partition number must be
> +     * nonnegative.
> +     */
> +    if (dst == -1)
> +        return 0;
> +
> +    dst_part = &pi->parts[dst];
> +    if (READ_ONCE(dst_part->mgrt_on))
> +        return 0;
> +
> +    /*
> +     * The source partition number can be -1
> +     * when from cpus that out of any partitions.
> +     */
> +    if (src != -1) {
> +        src_part = &pi->parts[src];
> +        if (READ_ONCE(src_part->mgrt_on))
> +            return 0;
> +    }
> +
> +    if (xchg(&dst_part->mgrt_on, 1))
> +        return 0;
> +
> +    if (src != -1 && xchg(&src_part->mgrt_on, 1)) {
> +        WRITE_ONCE(dst_part->mgrt_on, 0);
> +        return 0;
> +    }
> +
> +    return 1;
> +}
> +
> +static inline void part_mgrt_unlock(struct gb_part_info *pi, int src, 
> int dst)
> +{
> +    struct gb_part *src_part, *dst_part;
> +
> +    if (src != -1) {
> +        src_part = &pi->parts[src];
> +        WRITE_ONCE(src_part->mgrt_on, 0);
> +    }
> +
> +    if (dst != -1) {
> +        dst_part = &pi->parts[dst];
> +        WRITE_ONCE(dst_part->mgrt_on, 0);
> +    }
> +}
> +
> +static u64 cfs_h_load(struct sched_entity *se)
> +{
> +    u64 load = se->avg.load_avg;
> +
> +    se = se->parent;
> +    while (se) {
> +        struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +        load = div64_ul(load * se->avg.load_avg,
> +                cfs_rq->avg.runnable_avg + 1);
> +        se = se->parent;
> +    }
> +
> +    return load;
> +}
> +
> +static u64 tg_load_of_part(struct gb_part_info *pi, struct task_group 
> *tg, int id)
> +{
> +    int i;
> +    u64 load = 0;
> +
> +    for_each_cpu(i, part_cpus(pi, id))
> +        load += cfs_h_load(tg->se[i]);
> +
> +    return load;
> +}
> +
> +static u64 load_of_part(struct gb_part_info *pi, int id)
> +{
> +    int i;
> +    u64 load = 0;
> +
> +    for_each_cpu(i, part_cpus(pi, id))
> +        load += cpu_rq(i)->cfs.avg.load_avg;
> +
> +    return load;
> +}
> +
> +static u64 cap_of_part(struct gb_part_info *pi, int id)
> +{
> +    int i;
> +    u64 cap = 0;
> +
> +    for_each_cpu(i, part_cpus(pi, id))
> +        cap += cpu_rq(i)->cpu_capacity;
> +
> +    return cap;
> +}
> +
> +static void free_gb_part_info(struct rcu_head *rcu_head)
> +{
> +    struct gb_part_info *pi = container_of(rcu_head, struct 
> gb_part_info, rcu_head);
> +
> +    kfree(pi);
> +}
> +
> +static inline void predict_load_add(struct gb_part_info *pi, int id, 
> u64 load)
> +{
> +    pi->parts[id].predict_load += load;
> +}
> +
> +static void predict_load_decay(struct gb_part_info *pi)
> +{
> +    int i, fact, passed;
> +
> +    passed = jiffies - global_settle_next + predict_period_stamp;
> +    predict_period_stamp = passed % DECAY_PERIOD;
> +    fact = passed / DECAY_PERIOD;
> +
> +    if (!fact)
> +        return;
> +
> +    for_each_part(pi, i) {
> +        struct gb_part *part = &pi->parts[i];
> +
> +        /*
> +         * Decay NICE_0_LOAD into zero after 10 seconds
> +         */
> +        if (fact > 10)
> +            part->predict_load = 0;
> +        else
> +            part->predict_load >>= fact;
> +    }
> +}
> +
> +static noinline int try_to_settle(struct gb_part_info *pi, struct 
> task_group *tg)
> +{
> +    int i, src, dst, ret;
> +    u64 mgrt_load, tg_load, min_load, src_load, dst_load;
> +
> +    src = dst = -1;
> +    min_load = U64_MAX;
> +    tg_load = 0;
> +    for_each_part(pi, i) {
> +        u64 mgrt = tg_load_of_part(pi, tg, i);
> +        u64 load = load_of_part(pi, i);
> +
> +        /* load after migration */
> +        if (load > mgrt)
> +            load -= mgrt;
> +        else
> +            load = 0;
> +
> +        /*
> +         * Try to find out the partition contain
> +         * minimum load, and the load of this task
> +         * group is excluded on comparison.
> +         *
> +         * This help to prevent that a partition
> +         * full of the tasks from this task group was
> +         * considered as busy.
> +         *
> +         * As for the prediction load, the partition
> +         * this group preferred will be excluded, since
> +         * these prediction load could be introduced by
> +         * itself.
> +         *
> +         * This is not precise, but it serves the idea
> +         * to prefer a partition as long as possible,
> +         * to save the cost of resettle as much as
> +         * possible.
> +         */
> +        if (i == tg->gb_prefer) {
> +            src = i;
> +            src_load = load + mgrt;
> +            mgrt_load = mgrt;
> +        } else
> +            load += pi->parts[i].predict_load;
> +
> +        if (load < min_load) {
> +            dst = i;
> +            min_load = load;
> +            dst_load = load + mgrt;
> +        }
> +
> +        tg_load += mgrt;
> +    }
> +
> +    if (!tg_load)
> +        return 0;
> +
> +    ret = 0;
> +    tg->settle_period *= 2;
> +    if (src == -1) {
> +        /* First settle */
> +        tg->gb_prefer = dst;
> +        predict_load_add(pi, dst, tg_load);
> +        ret = 1;
> +    } else if (src != dst) {
> +        /* Resettle will cost, be careful */
> +        long dst_imb, src_imb, dst_cap, src_cap;
> +
> +        src_cap = cap_of_part(pi, src);
> +        dst_cap = cap_of_part(pi, dst);
> +
> +        /*
> +         * src_load        dst_load
> +         * ------------ vs ---------
> +         * src_capacity    dst_capacity
> +         *
> +         * Should not cause further imbalancing after
> +         * resettle.
> +         */
> +        src_imb = abs(src_load * dst_cap - dst_load * src_cap);
> +        dst_imb = abs((src_load - mgrt_load) * dst_cap - (dst_load + 
> mgrt_load) * src_cap);
> +
> +        if (dst_imb <= src_imb) {
> +            tg->gb_prefer = dst;
> +            predict_load_add(pi, dst, tg_load);
> +            tg->settle_period = tg_settle_period * 2;
> +            ret = 1;
> +        }
> +    }
> +
> +    if (tg->settle_period > tg_settle_period_max)
> +        tg->settle_period = tg_settle_period_max;
> +
> +    return ret;
> +}
> +
> +/*
> + * group_hot() will tell us which cpu is contained in the
> + * preferred CPU partition of the task group of a task.
> + *
> + * return 1 if prefer dst_cpu
> + * return 0 if prefer src_cpu
> + * return -1 if prefer either or neither
> + */
> +int group_hot(struct task_struct *p, int src_cpu, int dst_cpu)
> +{
> +    int ret = -1;
> +    struct task_group *tg;
> +    struct gb_part_info *pi;
> +
> +    rcu_read_lock();
> +
> +    pi = part_info;
> +    tg = task_group(p);
> +    if (pi && tg->gb_prefer != -1 &&
> +        ctop(pi, src_cpu) != ctop(pi, dst_cpu))
> +        ret = (tg->gb_prefer == ctop(pi, dst_cpu));
> +
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
> +void task_tick_gb(struct rq *rq, struct task_struct *curr)
> +{
> +    struct callback_head *work = &curr->gb_work;
> +    struct task_group *tg = task_group(curr);
> +    struct gb_part_info *pi = part_info;
> +
> +    if ((curr->flags & (PF_EXITING | PF_KTHREAD)))
> +        return;
> +
> +    if (!tg->gb_period || !pi)
> +        return;
> +
> +    /* Save it when already satisfied */
> +    if (tg->gb_prefer != -1 &&
> +        tg->gb_prefer == ctop(pi, task_cpu(curr)))
> +        return;
> +
> +    /* Try to settle the active group */
> +    curr->gb_stamp++;
> +    if (curr->gb_stamp > tg->gb_period) {
> +        curr->gb_stamp = 0;
> +        task_work_add(curr, work, TWA_RESUME);
> +    }
> +}
> +
> +void group_balancing_work(struct callback_head *work)
> +{
> +    int cpu, this_cpu, this_part, this_prefer, best_cpu;
> +    struct task_group *this_tg;
> +    struct gb_part_info *pi;
> +    struct task_struct *best_task;
> +    struct cpumask cpus;
> +
> +    SCHED_WARN_ON(current != container_of(work, struct task_struct, 
> gb_work));
> +
> +    rcu_read_lock();
> +
> +    pi = part_info;
> +    this_tg = task_group(current);
> +    if (!this_tg->gb_period || !pi) {
> +        rcu_read_unlock();
> +        return;
> +    }
> +
> +    /*
> +     * Settle task group one-by-one help prevent the
> +     * situation when multiple group try to settle the
> +     * same partition at the same time.
> +     *
> +     * However, when bunch of groups trying to settle at
> +     * the same time, there are no guarantee on the
> +     * fairness, some of them may get more chances and
> +     * settle sooner than the others.
> +     *
> +     * So one trick here is to grow the tg_settle_period
> +     * of settled group, to make sure they yield the
> +     * next chances to others.
> +     *
> +     * Another trick here is about prediction, as settle
> +     * group will followed by bunch of task migration,
> +     * the current load of CPU partition can't imply it's
> +     * busyness in future, and we may pick a busy one in
> +     * the end.
> +     *
> +     * Thus we maintain the predict load after each settle,
> +     * so next try will be able to do the prediction and
> +     * avoid to pick those which is already busy enough.
> +     */
> +    if (spin_trylock(&settle_lock)) {
> +        if (time_after(jiffies, global_settle_next) &&
> +            time_after(jiffies, this_tg->settle_next)) {
> +            predict_load_decay(pi);
> +
> +            global_settle_next = jiffies;
> +            if (try_to_settle(pi, this_tg))
> +                global_settle_next += global_settle_period;
> +
> +            this_tg->settle_next = jiffies + this_tg->settle_period;
> +        }
> +        spin_unlock(&settle_lock);
> +    }
> +
> +    this_cpu = task_cpu(current);
> +    this_prefer = this_tg->gb_prefer;
> +    this_part = ctop(pi, this_cpu);
> +    if (this_prefer == -1 ||
> +        this_part == this_prefer ||
> +        !part_mgrt_lock(pi, this_part, this_prefer)) {
> +        rcu_read_unlock();
> +        return;
> +    }
> +
> +    cpumask_copy(&cpus, part_cpus(pi, this_prefer));
> +
> +    /*
> +     * We arrived here when current task A don't prefer
> +     * it's current CPU, but prefer CPUs of partition Y.
> +     *
> +     * In other word, if task A could run on CPUs of
> +     * partition Y, it have good chance to reduce conflict
> +     * with the tasks from other groups, and share hot
> +     * cache with the tasks from the same group.
> +     *
> +     * So here is the main logical of group balancer to
> +     * achieve it's purpose, make sure groups of tasks
> +     * are balanced into groups of CPUs.
> +     *
> +     * If A's current CPU belong to CPU partition X,
> +     * try to find a CPU from partition Y, which is
> +     * running a task prefer partition X, and swap them.
> +     *
> +     * Otherwise, or if can't find such CPU and task,
> +     * just find an idle CPU from partition Y, and do
> +     * migration.
> +     *
> +     * Ideally the migration and swap work will finally
> +     * put the tasks into right places, but the wakeup
> +     * stuff can easily break that by locate an idle CPU
> +     * out of the range.
> +     *
> +     * However, since the whole idea is to gain cache
> +     * benefit and reduce conflict between groups, if
> +     * there are enough idle CPU out there then every
> +     * thing just fine, so let it go.
> +     */
> +
> +    best_cpu = -1;
> +    best_task = NULL;
> +    for_each_cpu_and(cpu, &cpus, current->cpus_ptr) {
> +        struct task_struct *p;
> +        struct task_group *tg;
> +
> +        WARN_ON(cpu == this_cpu);
> +
> +        if (cpu == smp_processor_id())
> +            continue;
> +
> +        if (available_idle_cpu(cpu)) {
> +            best_cpu = cpu;
> +            continue;
> +        }
> +
> +        if (this_part == -1)
> +            continue;
> +
> +        p = rcu_dereference(cpu_rq(cpu)->curr);
> +        tg = task_group(p);
> +
> +        if (!p || (p->flags & PF_EXITING) || is_idle_task(p))
> +            continue;
> +
> +        if (task_cpu(p) == cpu &&
> +            this_part == tg->gb_prefer &&
> +            cpumask_test_cpu(this_cpu, p->cpus_ptr)) {
> +            get_task_struct(p);
> +            best_task = p;
> +            best_cpu = cpu;
> +            break;
> +        }
> +    }
> +
> +    part_mgrt_unlock(pi, this_part, this_prefer);
> +
> +    rcu_read_unlock();
> +
> +    if (best_task) {
> +        migrate_swap(current, best_task, best_cpu, this_cpu);
> +        put_task_struct(best_task);
> +    } else if (best_cpu != -1)
> +        migrate_task_to(current, best_cpu);
> +}
> +
> +void gb_stat_show(struct task_group *tg, struct seq_file *sf)
> +{
> +    int i;
> +    struct gb_part_info *pi;
> +
> +    rcu_read_lock();
> +
> +    pi = part_info;
> +    if (!tg->gb_period || !pi)
> +        goto out;
> +
> +    for_each_part(pi, i) {
> +        seq_printf(sf, "PART-%d ", i);
> +        seq_printf(sf, "%*pbl ", cpumask_pr_args(part_cpus(pi, i)));
> +        seq_printf(sf, "%llu ", tg_load_of_part(pi, tg, i));
> +        seq_printf(sf, "%llu ", load_of_part(pi, i));
> +        if (tg->gb_prefer == i)
> +            seq_printf(sf, " *");
> +        seq_putc(sf, '\n');
> +    }
> +out:
> +    rcu_read_unlock();
> +}
> +
> +int set_gb_period(struct task_group *tg, unsigned long period_ms)
> +{
> +    tg->gb_period = msecs_to_jiffies(period_ms);
> +
> +    return 0;
> +}
> +
> +unsigned long get_gb_period(struct task_group *tg)
> +{
> +    return jiffies_to_msecs(tg->gb_period);
> +}
> +
> +void init_cfs_gb(struct task_group *tg)
> +{
> +    tg->gb_prefer = -1;
> +    tg->settle_period = tg_settle_period;
> +    tg->settle_next = jiffies + tg->settle_period;
> +}
> +
> +static int build_gb_partition(char *buf, struct gb_part_info *pi, int id)
> +{
> +    int i, ret;
> +    struct cpumask cpus_allowed;
> +    struct gb_part *part = &pi->parts[id];
> +
> +    ret = cpulist_parse(buf, &cpus_allowed);
> +    if (ret || cpumask_empty(&cpus_allowed))
> +        return -1;
> +
> +    part->id = id;
> +    part->mgrt_on = 0;
> +    part->predict_load = 0;
> +    cpumask_copy(&part->cpus, &cpus_allowed);
> +
> +    for_each_cpu(i, &part->cpus)
> +        pi->ctop[i] = id;
> +
> +    return 0;
> +}
> +
> +static inline int start_group_balancer(void)
> +{
> +    mutex_lock(&gb_mutex);
> +    if (!part_info) {
> +        mutex_unlock(&gb_mutex);
> +        return -1;
> +    }
> +
> +    static_branch_enable(&sched_group_balancer);
> +
> +    mutex_unlock(&gb_mutex);
> +    return 0;
> +}
> +
> +static int gb_prefer_reset(struct task_group *tg, void *data)
> +{
> +    tg->gb_prefer = -1;
> +
> +    return 0;
> +}
> +
> +static inline void stop_group_balancer(void)
> +{
> +    bool reset = false;
> +
> +    mutex_lock(&gb_mutex);
> +    if (gb_enabled()) {
> +        reset = true;
> +        static_branch_disable(&sched_group_balancer);
> +    }
> +    mutex_unlock(&gb_mutex);
> +
> +    if (!reset)
> +        return;
> +
> +    synchronize_rcu();
> +
> +    rcu_read_lock();
> +    walk_tg_tree(gb_prefer_reset, tg_nop, NULL);
> +    rcu_read_unlock();
> +}
> +
> +static ssize_t gb_write(struct file *file, const char __user *buffer,
> +               size_t count, loff_t *ppos)
> +{
> +    int id;
> +    size_t ret;
> +    unsigned long val;
> +    char *kbuf, *start, *end;
> +    struct gb_part_info *new, *old;
> +
> +    if (count < 2)
> +        return -EINVAL;
> +
> +    ret = -ENOMEM;
> +    kbuf = kzalloc(count + 2, GFP_KERNEL);
> +    if (!kbuf)
> +        return ret;
> +
> +    new = kzalloc(sizeof(*new), GFP_KERNEL);
> +    if (!new)
> +        goto out;
> +
> +    if (copy_from_user(kbuf, buffer, count)) {
> +        ret = -EFAULT;
> +        goto out;
> +    }
> +
> +    kbuf[count - 1] = ';';
> +    kbuf[count] = '\0';
> +
> +    ret = count;
> +    if (!strcmp(kbuf, "enable;")) {
> +        if (start_group_balancer())
> +            ret = -EINVAL;
> +        goto out;
> +    } else if (!strcmp(kbuf, "disable;")) {
> +        stop_group_balancer();
> +        goto out;
> +    } else if (sscanf(kbuf, "tg_settle_period=%lu;", &val) == 1) {
> +        val = msecs_to_jiffies(val);
> +        if (val > tg_settle_period_max)
> +            ret = -EINVAL;
> +        else
> +            tg_settle_period = val;
> +        goto out;
> +    } else if (sscanf(kbuf, "tg_settle_period_max=%lu;", &val) == 1) {
> +        val = msecs_to_jiffies(val);
> +        if (val < tg_settle_period)
> +            ret = -EINVAL;
> +        else
> +            tg_settle_period_max = val;
> +        goto out;
> +    } else if (sscanf(kbuf, "global_settle_period=%lu;", &val) == 1) {
> +        global_settle_period = msecs_to_jiffies(val);
> +        goto out;
> +    }
> +
> +    if (gb_enabled()) {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> +    memset(new->ctop, -1, sizeof(int) * num_possible_cpus());
> +
> +    id = 0;
> +    start = kbuf;
> +    end = strchr(start, ';');
> +
> +    while (end && *end != '\0') {
> +        *end = '\0';
> +
> +        if (new->nr_part >= CPU_PART_MAX)
> +            goto out;
> +
> +        if (!build_gb_partition(start, new, id)) {
> +            id++;
> +            new->nr_part++;
> +        }
> +
> +        start = end + 1;
> +        end = strchr(start, ';');
> +    };
> +
> +    if (new->nr_part < 2)
> +        goto out;
> +
> +    mutex_lock(&gb_mutex);
> +    old = part_info;
> +    part_info = new;
> +    mutex_unlock(&gb_mutex);
> +
> +    if (old)
> +        call_rcu(&old->rcu_head, free_gb_part_info);
> +
> +    kfree(kbuf);
> +    return count;
> +
> +out:
> +    kfree(kbuf);
> +    kfree(new);
> +    return ret;
> +}
> +
> +static void *gb_start(struct seq_file *s, loff_t *pos)
> +{
> +    mutex_lock(&gb_mutex);
> +    if (!part_info || *pos != 0)
> +        return NULL;
> +
> +    return (void *)&part_info->parts[0];
> +}
> +
> +static void *gb_next(struct seq_file *s, void *p, loff_t *pos)
> +{
> +    if (++(*pos) >= part_info->nr_part)
> +        return NULL;
> +
> +    return (void *)&part_info->parts[*pos];
> +}
> +
> +static void gb_stop(struct seq_file *s, void *p)
> +{
> +    mutex_unlock(&gb_mutex);
> +}
> +
> +static int gb_show(struct seq_file *s, void *p)
> +{
> +    struct gb_part *part = (struct gb_part *)p;
> +    u64 load = load_of_part(part_info, part->id);
> +
> +    if (part->id == 0) {
> +        seq_printf(s, "Group Balancer %s\n",
> +                gb_enabled() ? "Enabled" : "Disabled");
> +        seq_printf(s, "Group Settle Period %u\n",
> +                jiffies_to_msecs(tg_settle_period));
> +        seq_printf(s, "Group Settle Period Maximum %u\n",
> +                jiffies_to_msecs(tg_settle_period_max));
> +        seq_printf(s, "Global Settle Period %u\n",
> +                jiffies_to_msecs(global_settle_period));
> +    }
> +
> +    seq_printf(s, "PART-%d: ", part->id);
> +    seq_printf(s, "%*pbl ", cpumask_pr_args(&part->cpus));
> +    seq_printf(s, "load=%llu ", load);
> +    seq_putc(s, '\n');
> +
> +    return 0;
> +}
> +
> +static const struct seq_operations gb_op = {
> +    .start = gb_start,
> +    .next = gb_next,
> +    .stop = gb_stop,
> +    .show = gb_show,
> +};
> +
> +static int gb_open(struct inode *inode, struct file *file)
> +{
> +    return seq_open(file, &gb_op);
> +}
> +
> +static const struct proc_ops gb_proc_ops = {
> +    .proc_flags    = PROC_ENTRY_PERMANENT,
> +    .proc_open    = gb_open,
> +    .proc_read    = seq_read,
> +    .proc_write    = gb_write,
> +    .proc_lseek    = seq_lseek,
> +    .proc_release    = seq_release,
> +};
> +
> +static int __init gb_init(void)
> +{
> +    root_task_group.gb_period = 0;
> +    root_task_group.gb_prefer = -1;
> +
> +    proc_create("gb_ctrl", 0600, NULL, &gb_proc_ops);
> +
> +    return 0;
> +}
> +core_initcall(gb_init);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de53be9..d80f872 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -440,6 +440,12 @@ struct task_group {
>       struct uclamp_se    uclamp[UCLAMP_CNT];
>   #endif
> 
> +#ifdef CONFIG_GROUP_BALANCER
> +    int            gb_prefer;
> +    unsigned long        gb_period;
> +    unsigned long        settle_period;
> +    unsigned long        settle_next;
> +#endif
>   };
> 
>   #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -3118,3 +3124,21 @@ static inline bool is_per_cpu_kthread(struct 
> task_struct *p)
>   extern void sched_dynamic_update(int mode);
>   #endif
> 
> +#ifdef CONFIG_GROUP_BALANCER
> +extern struct static_key_false sched_group_balancer;
> +#define gb_enabled() static_branch_unlikely(&sched_group_balancer)
> +extern void init_cfs_gb(struct task_group *tg);
> +extern int group_hot(struct task_struct *p, int src_cpu, int dst_cpu);
> +extern void task_tick_gb(struct rq *rq, struct task_struct *curr);
> +extern void group_balancing_work(struct callback_head *work);
> +
> +extern void gb_stat_show(struct task_group *tg, struct seq_file *sf);
> +extern int set_gb_period(struct task_group *tg, unsigned long period_ms);
> +extern unsigned long get_gb_period(struct task_group *tg);
> +#else
> +#define gb_enabled() (false)
> +static inline void init_cfs_gb(struct task_group *tg) { return 0; }
> +static inline int group_hot(struct task_struct *p, int src_cpu, int 
> dst_cpu) { return -1; }
> +static inline void task_tick_gb(struct rq *rq, struct task_struct 
> *curr) {};
> +static inline void group_balancing_work(struct callback_head *work) {};
> +#endif