2023-07-17 10:10:43

by Hao Jia

[permalink] [raw]
Subject: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

Now the member variable bstat of the structure cgroup_rstat_cpu
records the per-cpu time of the cgroup itself, but does not
include the per-cpu time of its descendants. The per-cpu time
including descendants is very useful for calculating the
per-cpu usage of cgroups.

Although we can indirectly obtain the total per-cpu time
of the cgroup and its descendants by accumulating the per-cpu
bstat of each descendant of the cgroup. But after a child cgroup
is removed, we will lose its bstat information. This will cause
the cumulative value to be non-monotonic, thus affecting
the accuracy of cgroup per-cpu usage.

So we add the cumul_bstat variable to record the total
per-cpu time of this cgroup and its descendants, which is
similar to "cpuacct.usage*" in cgroup v1. And this is
also helpful for the migration from cgroup v1 to cgroup v2.
After adding this variable, we can obtain the per-cpu time of
cgroup and its descendants in user mode through eBPF, etc.

Signed-off-by: Hao Jia <[email protected]>
---
include/linux/cgroup-defs.h | 3 +++
kernel/cgroup/rstat.c | 4 ++++
2 files changed, 7 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8a0d5466c7be..25114d62089e 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -341,6 +341,9 @@ struct cgroup_rstat_cpu {
*/
struct cgroup_base_stat last_bstat;

+ /* Record the cumulative per-cpu time of cgroup and its descendants */
+ struct cgroup_base_stat cumul_bstat;
+
/*
* Child cgroups with stat updates on this cpu since the last read
* are linked on the parent's ->updated_children through
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 2542c21b6b6d..7c0ddd2e90d7 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -361,11 +361,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
cgroup_base_stat_sub(&delta, &rstatc->last_bstat);
cgroup_base_stat_add(&cgrp->bstat, &delta);
cgroup_base_stat_add(&rstatc->last_bstat, &delta);
+ cgroup_base_stat_add(&rstatc->cumul_bstat, &delta);

/* propagate global delta to parent (unless that's root) */
if (cgroup_parent(parent)) {
delta = cgrp->bstat;
+ rstatc = cgroup_rstat_cpu(parent, cpu);
+
cgroup_base_stat_sub(&delta, &cgrp->last_bstat);
+ cgroup_base_stat_add(&rstatc->cumul_bstat, &delta);
cgroup_base_stat_add(&parent->bstat, &delta);
cgroup_base_stat_add(&cgrp->last_bstat, &delta);
}
--
2.37.0



2023-07-17 19:56:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

Hello,

On Mon, Jul 17, 2023 at 05:36:12PM +0800, Hao Jia wrote:
> Now the member variable bstat of the structure cgroup_rstat_cpu

You said "now" indicating that the behavior has changed recently but I don't
see what changed there. Can you elaborate?

> records the per-cpu time of the cgroup itself, but does not
> include the per-cpu time of its descendants. The per-cpu time

It does. The per-cpu delta is added to its parent and then that will in turn
be used to propagate to its parent.

> including descendants is very useful for calculating the
> per-cpu usage of cgroups.
>
> Although we can indirectly obtain the total per-cpu time
> of the cgroup and its descendants by accumulating the per-cpu
> bstat of each descendant of the cgroup. But after a child cgroup
> is removed, we will lose its bstat information. This will cause
> the cumulative value to be non-monotonic, thus affecting
> the accuracy of cgroup per-cpu usage.
>
> So we add the cumul_bstat variable to record the total
> per-cpu time of this cgroup and its descendants, which is
> similar to "cpuacct.usage*" in cgroup v1. And this is
> also helpful for the migration from cgroup v1 to cgroup v2.
> After adding this variable, we can obtain the per-cpu time of
> cgroup and its descendants in user mode through eBPF, etc.

I think you're misunderstanding how the code works. Can you please double
check?

Thanks.

--
tejun

2023-07-18 10:30:21

by Hao Jia

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants


Hello,
On 2023/7/18 Tejun Heo wrote:

> On Mon, Jul 17, 2023 at 05:36:12PM +0800, Hao Jia wrote:
>> Now the member variable bstat of the structure cgroup_rstat_cpu
>
> You said "now" indicating that the behavior has changed recently but I don't
> see what changed there. Can you elaborate?

Thank you for your review, sorry for confusing you with my expression,
it is true that it has not changed, bstat has always recorded the
per-cpu time of cgroup itself.

>
>> records the per-cpu time of the cgroup itself, but does not
>> include the per-cpu time of its descendants. The per-cpu time
>
> It does. The per-cpu delta is added to its parent and then that will in turn
> be used to propagate to its parent.
>Yes, so only cgrp->bstat contains the cpu time of the cgroup and its
descendants. rstatc->bstat only records the per-cpu time of cgroup itself.

>
> I think you're misunderstanding how the code works. Can you please double
> check?

--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -361,11 +361,15 @@ static void cgroup_base_stat_flush(struct cgroup
*cgrp, int cpu)
cgroup_base_stat_sub(&delta, &rstatc->last_bstat);
cgroup_base_stat_add(&cgrp->bstat, &delta);
cgroup_base_stat_add(&rstatc->last_bstat, &delta);

We add the current cgroup's per-cpu delta to its per-cpu variable
(cumul_bstat).

+ cgroup_base_stat_add(&rstatc->cumul_bstat, &delta);

/* propagate global delta to parent (unless that's root) */
if (cgroup_parent(parent)) {
delta = cgrp->bstat;
+ rstatc = cgroup_rstat_cpu(parent, cpu);
+
cgroup_base_stat_sub(&delta, &cgrp->last_bstat);

Since we hold the global cgroup_rstat_lock and disable interrupts in
cgroup_base_stat_flush() and we only update cgrp->bstat here.
So the global delta at this time is equal to the increment of this
cgroup and its descendants on this cpu (unless root cgroup), so we can
add the global delta to the cumul_bstat of its parent and propagate it
upward.

+ cgroup_base_stat_add(&rstatc->cumul_bstat, &delta);
cgroup_base_stat_add(&parent->bstat, &delta);
cgroup_base_stat_add(&cgrp->last_bstat, &delta);
}


I wrote a kernel module to verify and test the above kernel code,
The kernel module adds a cpu.stat_percpu_all file for each cgroup, which
can output the per-cpu time information of the cgroup and its
descendants calculated in two ways:
1. Cumulatively add the per-cpu bstat of cgroup and each of its descendants.
2. Directly output @cumul_bstat read in the kernel (patch has been applied).
When the child cgroup is not removed, the results calculated by the two
methods should be equal.

NOTE: We need to flush the data before "cat cpu.stat_percpu_all", such
as "cat cpu.stat".

Code link:
https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main

The testing procedure is in README.

I installed the 6.5.0-rc1 kernel on my machine (an Intel Xeon(R)
Platinum 8260 machine 96 CPUs in total.) and did some tests.
So far I haven't found anything unusual, if I'm wrong, please point it
out, it's very helpful to me, thanks again!

Thanks,
Hao

2023-07-18 22:01:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote:
> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main

Isn't that just adding the same numbers twice and verifying that? Maybe I'm
misunderstanding you. Here's a simpler case:

# cd /sys/fs/cgroup
# mkdir -p asdf/test0
# grep usage_usec asdf/test0/cpu.stat
usage_usec 0
# echo $$ > asdf/test0/cgroup.procs
# stress -c 1 & sleep 1; kill %%
[1] 122329
stress: info: [122329] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
# grep usage_usec asdf/test0/cpu.stat
usage_usec 1000956
[1]+ Terminated stress -c 1
# grep usage_usec asdf/cpu.stat
usage_usec 1002548
# echo $$ > /sys/fs/cgroup/cgroup.procs
# rmdir asdf/test0
# grep usage_usec asdf/cpu.stat
usage_usec 1006338

So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and
asdf/cpu.stat correctly reports the cumulative usage. After removing
asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here?
What are you adding?

Thanks.

--
tejun

2023-07-19 03:10:24

by Hao Jia

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants



On 2023/7/19 Tejun Heo wrote:
> On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote:
>> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main
>
> So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and
> asdf/cpu.stat correctly reports the cumulative usage. After removing
> asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here?

Sorry, some of my expressions may have misled you.

Yes, cpu.stat will display the cumulative **global** cpu time of the
cgroup and its descendants (the corresponding kernel variable is
"cgrp->bstat"), and it will not be lost when the child cgroup is removed.

Similarly, we need a **per-cpu** variable to record the accumulated
per-cpu time of cgroup and its descendants.
The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not
satisfied, it only records the per-cpu time of cgroup itself,
So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record
per-cpu time of cgroup and its descendants.

In order to verify the correctness of my patch, I wrote a kernel module
to compare the results of calculating the per-cpu time of cgroup and its
descendants in two ways:
Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and
each of its descendants.
Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in
the kernel.

When the child cgroup is not removed, the results calculated by the two
methods should be equal.

> What are you adding?
I want to add a **per-cpu variable** to record the cumulative per-cpu
time of cgroup and its descendants, which is similar to the variable
"cgrp->bstat", but it is a per-cpu variable.
It is very useful and convenient for calculating the usage of cgroup on
each cpu, and its behavior is similar to the "cpuacct.usage*" interface
of cgroup v1.

Thanks,
Hao

2023-07-27 12:47:44

by Hao Jia

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants



On 2023/7/19 Hao Jia wrote:
>
>
> On 2023/7/19 Tejun Heo wrote:
>> On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote:
>>> https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main
>>
>> So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and
>> asdf/cpu.stat correctly reports the cumulative usage. After removing
>> asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here?
>
> Sorry, some of my expressions may have misled you.
>
> Yes, cpu.stat will display the cumulative **global** cpu time of the
> cgroup and its descendants (the corresponding kernel variable is
> "cgrp->bstat"), and it will not be lost when the child cgroup is removed.
>
> Similarly, we need a **per-cpu** variable to record the accumulated
> per-cpu time of cgroup and its descendants.
> The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not
> satisfied, it only records the per-cpu time of cgroup itself,
> So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record
> per-cpu time of cgroup and its descendants.
>
> In order to verify the correctness of my patch, I wrote a kernel module
> to compare the results of calculating the per-cpu time of cgroup and its
> descendants in two ways:
>   Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and
> each of its descendants.
>   Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in
> the kernel.
>
> When the child cgroup is not removed, the results calculated by the two
> methods should be equal.
>
>> What are you adding?
> I want to add a **per-cpu variable** to record the cumulative per-cpu
> time of cgroup and its descendants, which is similar to the variable
> "cgrp->bstat", but it is a per-cpu variable.
> It is very useful and convenient for calculating the usage of cgroup on
> each cpu, and its behavior is similar to the "cpuacct.usage*" interface
> of cgroup v1.
>

Hello Tejun,

I don't know if I explained it clearly, and do you understand what I mean?

Would you mind adding a variable like this to facilitate per-cpu usage
calculations and migration from cgroup v1 to cgroup v2?

Thanks,
Hao

2023-07-27 18:11:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

On Thu, Jul 27, 2023 at 08:05:44PM +0800, Hao Jia wrote:
>
>
> On 2023/7/19 Hao Jia wrote:
> >
> >
> > On 2023/7/19 Tejun Heo wrote:
> > > On Tue, Jul 18, 2023 at 06:08:50PM +0800, Hao Jia wrote:
> > > > https://github.com/jiaozhouxiaojia/cgv2-stat-percpu_test/tree/main
> > >
> > > So, we run `stress -c 1` for 1 second in the asdf/test0 cgroup and
> > > asdf/cpu.stat correctly reports the cumulative usage. After removing
> > > asdf/test0 cgroup, asdf's usage_usec is still there. What's missing here?
> >
> > Sorry, some of my expressions may have misled you.
> >
> > Yes, cpu.stat will display the cumulative **global** cpu time of the
> > cgroup and its descendants (the corresponding kernel variable is
> > "cgrp->bstat"), and it will not be lost when the child cgroup is
> > removed.
> >
> > Similarly, we need a **per-cpu** variable to record the accumulated
> > per-cpu time of cgroup and its descendants.
> > The existing kernel variable "cgroup_rstat_cpu(cgrp, cpu)->bstat" is not
> > satisfied, it only records the per-cpu time of cgroup itself,
> > So I try to add "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" to record
> > per-cpu time of cgroup and its descendants.
> >
> > In order to verify the correctness of my patch, I wrote a kernel module
> > to compare the results of calculating the per-cpu time of cgroup and its
> > descendants in two ways:
> > ? Method 1. Traverse and add the per-cpu rstatc->bstat of cgroup and
> > each of its descendants.
> > ? Method 2. Directly read "cgroup_rstat_cpu(cgrp, cpu)->cumul_bstat" in
> > the kernel.
> >
> > When the child cgroup is not removed, the results calculated by the two
> > methods should be equal.
> >
> > > What are you adding?
> > I want to add a **per-cpu variable** to record the cumulative per-cpu
> > time of cgroup and its descendants, which is similar to the variable
> > "cgrp->bstat", but it is a per-cpu variable.
> > It is very useful and convenient for calculating the usage of cgroup on
> > each cpu, and its behavior is similar to the "cpuacct.usage*" interface
> > of cgroup v1.
> >
>
> Hello Tejun,
>
> I don't know if I explained it clearly, and do you understand what I mean?
>
> Would you mind adding a variable like this to facilitate per-cpu usage
> calculations and migration from cgroup v1 to cgroup v2?

Oh yeah, I do. I'm just thinking whether we also want to expose that in the
cgroupfs. We are currently not showing anything per-cpu and the output
formatting gets nasty with a huge number of CPUs, so maybe that's not going
to work out all that well. Anyways, I'll get back to you next week.

Thanks.

--
tejun

2023-08-02 22:21:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

Hello,

On Thu, Jul 27, 2023 at 07:44:28AM -1000, Tejun Heo wrote:
> Oh yeah, I do. I'm just thinking whether we also want to expose that in the
> cgroupfs. We are currently not showing anything per-cpu and the output
> formatting gets nasty with a huge number of CPUs, so maybe that's not going
> to work out all that well. Anyways, I'll get back to you next week.

I couldn't come up with an answer. Let's go ahead with adding the field but
can you please do the followings?

* Name it to something like subtree_bstat instead of cumul_bstat. The
counters are all cumulative.

* Are you sure the upward propagation logic is correct? It's calculating
global delta and then propagating to the per-cpu delta of the parent. Is
that correct because the two delta calculations always end up the same?

* Please add a comment explaining that the field is not currently used
outside of being read from bpf / drgn and what not and that we're still
trying to determine how to expose that in the cgroupfs interface.

Thanks.

--
tejun

2023-08-07 05:20:46

by Hao Jia

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants



On 2023/8/3 Tejun Heo wrote:
> I couldn't come up with an answer. Let's go ahead with adding the field but
> can you please do the followings?
>

Thank you for your suggestion. I am very willing to do it.

> * Name it to something like subtree_bstat instead of cumul_bstat. The
> counters are all cumulative.

I did it in v2 patch.

>
> * Are you sure the upward propagation logic is correct? It's calculating
> global delta and then propagating to the per-cpu delta of the parent. Is
> that correct because the two delta calculations always end up the same?

Sorry, I made a mistake and misled you. These two deltas are not always
equal.

I found and reproduced a bug case:
We build /sys/fs/cgroup/test /sys/fs/cgroup/test/t1,
/sys/fs/cgroup/test/t1/tt1 in turn.
And there are only tasks in /sys/fs/cgroup/test/t1/tt1. After applying
this patch, some operations and corresponding data changes are as follows:

Step 1、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cgrp->bstat utime 0 stime 0 sum_exec_runtime 257341
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) ->bstat utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 257341
}


Step 2、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 12* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 747042
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 1000000 sum_exec_runtime
747042
}


Step 3、cat /sys/fs/cgroup/test/cpu.stat
(cgroup fulsh /test/t1/tt1 -> /test/t1 -> /test in turn)

*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 263468
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 263468
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 520809
}

*cpu 6* current flush cgroup /test/t1
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
/test/t1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 1267851 <---
parent(/test) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test) cumul_bstat (cpu 6) utime 0 stime 1000000
sum_exec_runtime 1267851 <--- *error*
******
Here cgrp delta is *not equal* to per-cpu delta.
The frequency of cgroup (/test) and its chiled cgroup (/test/t1/tt1)
flush is inconsistent.
In other words (when we call cgroup_base_stat_flush(), we will not
necessarily flush to the highest-level cgroup except root(like step 1
and 2 above)).
Therefore, cgrp delta may contain the cumulative value of multiple
per-cpu deltas.

The correct value of parent(/test) cumul_bstat should be utime 0
stime 0 sum_exec_runtime 520809.
******
}

*cpu 6* current flush cgroup /test
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
cumul_bstat utime 0 stime 1000000 sum_exec_runtime 1267851
/test ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
cgroup_parent(parent) is NULL end.


So we should add a per-cpu variable subtree_last_bstat similar to
cgrp->last_bstat to record the last value.

I have sent v2 patch, please review it again.
v2 link:
https://lore.kernel.org/all/[email protected]


> * Please add a comment explaining that the field is not currently used
> outside of being read from bpf / drgn and what not and that we're still
> trying to determine how to expose that in the cgroupfs interface.


Thanks, I did it in v2 patch.

Thanks,
Hao