Received-SPF: pass (google.com: domain of linux-kernel+bounces-449-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
From: Imran Khan <imran.f.khan@oracle.com>
To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, vschneid@redhat.com
Cc: linux-kernel@vger.kernel.org
Subject: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.
Date: Fri, 15 Dec 2023 16:26:52 +1100
Message-Id: <20231215052652.917741-1-imran.f.khan@oracle.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

It has been found that sometimes a task_group has some residual
load_avg even though the load average at each of its owned queues
i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
tg_load_avg_contrib have become 0 for a long time.
Under this scenario if another task starts running in this task_group,
it does not get proper time share on CPU since pre-existing
load average of task group inversely impacts the new task's CPU share
on each CPU.

This change looks for the condition when a task_group has no running
tasks and sets the task_group's load average to 0 in such cases, so
that tasks that run in future under this task_group get the CPU time
in accordance with the current load.

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
---

So far I have encountered only one way of reproducing this issue which is
as follows:

1. Take a kvm guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and 
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2

2. Assign a CPU intensive workload to each of these cgroups and start the
workload.

For my tests I am using following app:

int main(int argc, char *argv[])
{
        unsigned long count, i, val;
        if (argc != 2) {
              printf("usage: ./a.out <number of random nums to generate> \n");
              return 0;
        }

        count = strtoul(argv[1], NULL, 10);

        printf("Generating %lu random numbers \n", count);
        for (i = 0; i < count; i++) {
                val = rand();
                val = val % 2;
                //usleep(1);
        }
        printf("Generated %lu random numbers \n", count);
        return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/

3. We can see that both of the cgroups get similar CPU time:

# systemd-cgtop --depth 1
Path                                 Tasks    %CPU  Memory  Input/s    Output/s
/                                      659      -     5.5G        -        -
/system.slice                            -      -     5.7G        -        -
/test_group_1                            4      -        -        -        -
/test_group_2                            3      -        -        -        -
/user.slice                             31      -    56.5M        -        -

Path                                 Tasks   %CPU   Memory  Input/s    Output/s
/                                      659  394.6     5.5G        -        -
/test_group_2                            3   65.7        -        -        -
/user.slice                             29   55.1    48.0M        -        -
/test_group_1                            4   47.3        -        -        -
/system.slice                            -    2.2     5.7G        -        -

Path                                 Tasks  %CPU    Memory  Input/s    Output/s
/                                      659  394.8     5.5G        -        -
/test_group_1                            4   62.9        -        -        -
/user.slice                             28   44.9    54.2M        -        -
/test_group_2                            3   44.7        -        -        -
/system.slice                            -    0.9     5.7G        -        -

Path                                 Tasks  %CPU    Memory  Input/s     Output/s
/                                      659  394.4     5.5G        -        -
/test_group_2                            3   58.8        -        -        -
/test_group_1                            4   51.9        -        -        -
/user.slice                              30   39.3    59.6M        -        -
/system.slice                            -    1.9     5.7G        -        -

Path                                 Tasks  %CPU     Memory  Input/s    Output/s
/                                      659  394.7     5.5G        -        -
/test_group_1                            4   60.9        -        -        -
/test_group_2                            3   57.9        -        -        -
/user.slice                             28   43.5    36.9M        -        -
/system.slice                            -    3.0     5.7G        -        -

Path                                 Tasks  %CPU     Memory  Input/s     Output/s
/                                      659  395.0     5.5G        -        -
/test_group_1                            4   66.8        -        -        -
/test_group_2                            3   56.3        -        -        -
/user.slice                             29   43.1    51.8M        -        -
/system.slice                            -    0.7     5.7G        -        -

4. Now move systemd-udevd to one of these test groups, say test_group_1, and 
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.

5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of  CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running.

#systemd-cgtop --depth 1
Path                                   Tasks   %CPU   Memory  Input/s   Output/s
/                                      1219     -     5.4G        -        -
/system.slice                           -       -     5.6G        -        -
/test_group_1                           4       -        -        -        -
/test_group_2                           3       -        -        -        -
/user.slice                            26       -    91.3M        -        -

Path                                   Tasks  %CPU     Memory  Input/s   Output/s
/                                      1221  394.3     5.4G        -        -
/test_group_2                             3   82.7        -        -        -
/test_group_1                             4   14.3        -        -        -
/system.slice                             -    0.8     5.6G        -        -
/user.slice                              26    0.4    91.2M        -        -

Path                                   Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  394.6     5.4G        -        -
/test_group_2                             3   67.4        -        -        -
/system.slice                             -   24.6     5.6G        -        -
/test_group_1                             4   12.5        -        -        -
/user.slice                              26    0.4    91.2M        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                     1221  395.2     5.4G        -        -
/test_group_2                            3   60.9        -        -        -
/system.slice                            -   27.9     5.6G        -        -
/test_group_1                            4   12.2        -        -        -
/user.slice                             26    0.4    91.2M        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                     1221  395.2     5.4G        -        -
/test_group_2                            3   69.4        -        -        -
/test_group_1                            4   13.9        -        -        -
/user.slice                             28    1.6    92.0M        -        -
/system.slice                            -    1.0     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.6     5.4G        -        -
/test_group_2                             3   59.3        -        -        -
/test_group_1                             4   14.1        -        -        -
/user.slice                              28    1.3    92.2M        -        -
/system.slice                             -    0.7     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.5     5.4G        -        -
/test_group_2                            3   67.2        -        -        -
/test_group_1                            4   11.5        -        -        -
/user.slice                             28    1.3    92.5M        -        -
/system.slice                            -    0.6     5.6G        -        -

Path                                  Tasks  %CPU    Memory  Input/s    Output/s
/                                      1221  395.1     5.4G        -        -
/test_group_2                             3   76.8        -        -        -
/test_group_1                             4   12.9        -        -        -
/user.slice                              28    1.3    92.8M        -        -
/system.slice                             -    1.2     5.6G        -        -

From sched_debug data it is seen that in bad case the load.weight of per-cpu
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the cpu hogger app as running entity. 

I put some traces in update_tg_load_avg to check why load_avg was much higher
for test_group_1 and from there, I saw that even after systemd-udevd has
stopped and we are yet to launch the CPU hoggers, test_group_1 has got some
load average like shown in following snippet (I have changed the format of 
trace to limit the number of columns here):

  <idle>-0   [000] d.s.  1824.372593: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.380572: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1 
  <idle>-0   [000] d.s.  1824.380573: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.384563: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.384563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.391564: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.391564: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.398569: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.398569: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
.....................
.....................
  xfsaild/dm-3-1623    [001] d... 1824.569615: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  xfsaild/dm-3-1623    [001] d... 1824.569617: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
...................
...................
  sh-195050  [000] d.s.  1824.756563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.757570: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.757571: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.761566: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.761566: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.765567: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.765568: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
  <idle>-0   [000] d.s.  1824.769562: update_tg_load_avg.constprop.124:
	 tg_path = /test_group_1
  <idle>-0   [000] d.s.  1824.769563: update_tg_load_avg.constprop.124:
	 cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664

Now when CPU hogger is launched again in test_group_1, this pre-existing
tg->load_avg gets added to load contribution of cpu hogger and results in a
situation where both cgroups are running the same work load but very different
task_group.load_avg which in turn results in these cgroups getting very
different amount of time on CPU.

Apologies for this long description of the issue, but I have found only one way
of reproducing this issue and using this way issue can be reproduced with 100%
strike rate, so I thought of providing as much info as possible.

I have kept the change as RFC as I was not sure if the load_avg should be
dragged down to 0 in one go (as being done in this change) or should it be
pulled back in steps like making it half of the previous value each time.


 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..c34d9e7abb96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 		account_numa_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
+		atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
 	}
 #endif
 	cfs_rq->nr_running++;
@@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (entity_is_task(se)) {
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
+		atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
 	}
 #endif
 	cfs_rq->nr_running--;
@@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 		return;
 
+	/*
+	 * If number of running tasks for a task_group is 0, pull back its
+	 * load_avg to 0 as well.
+	 * This avoids the situation where a freshly forked task in a cgroup,
+	 * with some residual load_avg but with no running tasks, gets less
+	 * CPU time because of pre-existing load_avg of task_group.
+	 */
+	if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
+	    && atomic_long_read(&cfs_rq->tg->load_avg)) {
+		atomic_long_set(&cfs_rq->tg->load_avg, 0);
+		cfs_rq->last_update_tg_load_avg = now;
+		return;
+	}
+
 	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
@@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	atomic_long_set(&tg->cfs_nr_running, 0);
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..c3390bdb8400 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -370,6 +370,7 @@ struct task_group {
 	 */
 	atomic_long_t		load_avg ____cacheline_aligned;
 #endif
+	atomic_long_t           cfs_nr_running ____cacheline_aligned;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.34.1