2023-06-30 23:56:11

by Ivan Babrou

[permalink] [raw]
Subject: Expensive memory.stat + cpu.stat reads

Hello,

We're seeing CPU load issues with cgroup stats retrieval. I made a
public gist with all the details, including the repro code (which
unfortunately requires heavily loaded hardware) and some flamegraphs:

* https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13

I'll repeat the gist of that gist here. Our repro has the following
output after a warm-up run:

completed: 5.17s [manual / mem-stat + cpu-stat]
completed: 5.59s [manual / cpu-stat + mem-stat]
completed: 0.52s [manual / mem-stat]
completed: 0.04s [manual / cpu-stat]

The first two lines do effectively the following:

for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
/sys/fs/cgroup/system.slice/cpu.stat > /dev/null

The latter two are the same thing, but via two loops:

for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
/dev/null; done
for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> /dev/null; done

As you might've noticed from the output, splitting the loop into two
makes the code run 10x faster. This isn't great, because most
monitoring software likes to get all stats for one service before
reading the stats for the next one, which maps to the slow and
expensive way of doing this.

We're running Linux v6.1 (the output is from v6.1.25) with no patches
that touch the cgroup or mm subsystems, so you can assume vanilla
kernel.

From the flamegraph it just looks like rstat flushing takes longer. I
used the following flags on an AMD EPYC 7642 system (our usual pick
cpu-clock was blaming spinlock irqrestore, which was questionable):

perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro

Naturally, there are two questions that arise:

* Is this expected (I guess not, but good to be sure)?
* What can we do to make this better?

I am happy to try out patches or to do some tracing to help understand
this better.


2023-07-06 06:41:19

by Shakeel Butt

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> Hello,
>
> We're seeing CPU load issues with cgroup stats retrieval. I made a
> public gist with all the details, including the repro code (which
> unfortunately requires heavily loaded hardware) and some flamegraphs:
>
> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
>
> I'll repeat the gist of that gist here. Our repro has the following
> output after a warm-up run:
>
> completed: 5.17s [manual / mem-stat + cpu-stat]
> completed: 5.59s [manual / cpu-stat + mem-stat]
> completed: 0.52s [manual / mem-stat]
> completed: 0.04s [manual / cpu-stat]
>
> The first two lines do effectively the following:
>
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
>
> The latter two are the same thing, but via two loops:
>
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> /dev/null; done
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > /dev/null; done
>
> As you might've noticed from the output, splitting the loop into two
> makes the code run 10x faster. This isn't great, because most
> monitoring software likes to get all stats for one service before
> reading the stats for the next one, which maps to the slow and
> expensive way of doing this.
>
> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> that touch the cgroup or mm subsystems, so you can assume vanilla
> kernel.
>
> From the flamegraph it just looks like rstat flushing takes longer. I
> used the following flags on an AMD EPYC 7642 system (our usual pick
> cpu-clock was blaming spinlock irqrestore, which was questionable):
>
> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
>
> Naturally, there are two questions that arise:
>
> * Is this expected (I guess not, but good to be sure)?
> * What can we do to make this better?
>
> I am happy to try out patches or to do some tracing to help understand
> this better.

Hi Ivan,

Thanks a lot, as always, for reporting this. This is not expected and
should be fixed. Is the issue easy to repro or some specific workload or
high load/traffic is required? Can you repro this with the latest linus
tree? Also do you see any difference of root's cgroup.stat where this
issue happens vs good state?

BTW I am away for next month with very limited connectivity, so expect
slow response.

thanks,
Shakeel

2023-07-10 15:38:53

by Michal Koutný

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

Hello.

On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou <[email protected]> wrote:
> As you might've noticed from the output, splitting the loop into two
> makes the code run 10x faster.

That is curious.

> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> that touch the cgroup or mm subsystems, so you can assume vanilla
> kernel.

Have you watched for this on older kernels too?

> I am happy to try out patches or to do some tracing to help understand
> this better.

I see in your reproducer you tried swapping order of controllers
flushed.
Have you also tried flushing same controller twice (in the inner loop)?
(Despite the expectation is that it shouldn't be different from half the
scenario where ran two loops.)

Thanks,
Michal


Attachments:
(No filename) (795.00 B)
signature.asc (235.00 B)
Download all attachments

2023-07-10 23:34:11

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <[email protected]> wrote:
>
> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> > Hello,
> >
> > We're seeing CPU load issues with cgroup stats retrieval. I made a
> > public gist with all the details, including the repro code (which
> > unfortunately requires heavily loaded hardware) and some flamegraphs:
> >
> > * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >
> > I'll repeat the gist of that gist here. Our repro has the following
> > output after a warm-up run:
> >
> > completed: 5.17s [manual / mem-stat + cpu-stat]
> > completed: 5.59s [manual / cpu-stat + mem-stat]
> > completed: 0.52s [manual / mem-stat]
> > completed: 0.04s [manual / cpu-stat]
> >
> > The first two lines do effectively the following:
> >
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >
> > The latter two are the same thing, but via two loops:
> >
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> > /dev/null; done
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > > /dev/null; done
> >
> > As you might've noticed from the output, splitting the loop into two
> > makes the code run 10x faster. This isn't great, because most
> > monitoring software likes to get all stats for one service before
> > reading the stats for the next one, which maps to the slow and
> > expensive way of doing this.
> >
> > We're running Linux v6.1 (the output is from v6.1.25) with no patches
> > that touch the cgroup or mm subsystems, so you can assume vanilla
> > kernel.
> >
> > From the flamegraph it just looks like rstat flushing takes longer. I
> > used the following flags on an AMD EPYC 7642 system (our usual pick
> > cpu-clock was blaming spinlock irqrestore, which was questionable):
> >
> > perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >
> > Naturally, there are two questions that arise:
> >
> > * Is this expected (I guess not, but good to be sure)?
> > * What can we do to make this better?
> >
> > I am happy to try out patches or to do some tracing to help understand
> > this better.
>
> Hi Ivan,
>
> Thanks a lot, as always, for reporting this. This is not expected and
> should be fixed. Is the issue easy to repro or some specific workload or
> high load/traffic is required? Can you repro this with the latest linus
> tree? Also do you see any difference of root's cgroup.stat where this
> issue happens vs good state?

I'm afraid there's no easy way to reproduce. We see it from time to
time in different locations. The one that I was looking at for the
initial email does not reproduce it anymore:

completed: 0.75s [manual / cpu-stat + mem-stat]
completed: 0.72s [manual / mem-stat]
completed: 0.05s [manual / cpu-stat]

I took the top 20 servers by metrics scrape duration for cadvisor and
it it does happen on the slowest ones:

completed: 22.32s [manual / cpu-stat + mem-stat]
completed: 0.34s [manual / mem-stat]
completed: 1.68s [manual / cpu-stat]

Is cadvisor or a similar metrics agent used by Google? Any chance you
could see if your own fleet exhibits this behavior?

Given that this behavior requires full production setup with customer
traffic and long qualification times we have for kernels, I'm not sure
if I can try the linus tree here.

I uploaded the contents of /sys/fs/cgroup/memory.stat here:

* https://gist.github.com/bobrik/9255b5e8ed0a83afb73ebf06b79f07c4

The fast one is v6.1.37 and the slow one is v6.1.25. I'm not sure if
the kernel version makes a difference or if it's a matter of uptime /
traffic profile. The data is from two different locations. The fast
location has gone through an expansion, which meant a full reboot with
a kernel upgrade, so maybe that affected things:

* https://i.imgur.com/x8uyMaF.png

Let me try to reboot the slow location and see if there's any lasting
improvement.

2023-07-10 23:52:30

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Mon, Jul 10, 2023 at 7:44 AM Michal Koutný <[email protected]> wrote:
>
> Hello.
>
> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou <[email protected]> wrote:
> > As you might've noticed from the output, splitting the loop into two
> > makes the code run 10x faster.
>
> That is curious.
>
> > We're running Linux v6.1 (the output is from v6.1.25) with no patches
> > that touch the cgroup or mm subsystems, so you can assume vanilla
> > kernel.
>
> Have you watched for this on older kernels too?

We've been on v6.1 for quite a while now, but it's possible that we
weren't paying enough attention before to notice.

> > I am happy to try out patches or to do some tracing to help understand
> > this better.
>
> I see in your reproducer you tried swapping order of controllers
> flushed.
> Have you also tried flushing same controller twice (in the inner loop)?
> (Despite the expectation is that it shouldn't be different from half the
> scenario where ran two loops.)

Same controller twice is fast (whether it's mem + mem or cpu + cpu):

warm-up
completed: 17.24s [manual / cpu-stat + mem-stat]
completed: 1.02s [manual / mem-stat+mem-stat]
completed: 0.59s [manual / cpu-stat+cpu-stat]
completed: 0.44s [manual / mem-stat]
completed: 0.16s [manual / cpu-stat]
running
completed: 14.32s [manual / cpu-stat + mem-stat]
completed: 1.25s [manual / mem-stat+mem-stat]
completed: 0.42s [manual / cpu-stat+cpu-stat]
completed: 0.12s [manual / mem-stat]
completed: 0.50s [manual / cpu-stat]

2023-07-11 01:01:43

by Waiman Long

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On 7/10/23 19:21, Ivan Babrou wrote:
> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <[email protected]> wrote:
>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
>>> Hello,
>>>
>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
>>> public gist with all the details, including the repro code (which
>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
>>>
>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
>>>
>>> I'll repeat the gist of that gist here. Our repro has the following
>>> output after a warm-up run:
>>>
>>> completed: 5.17s [manual / mem-stat + cpu-stat]
>>> completed: 5.59s [manual / cpu-stat + mem-stat]
>>> completed: 0.52s [manual / mem-stat]
>>> completed: 0.04s [manual / cpu-stat]
>>>
>>> The first two lines do effectively the following:
>>>
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
>>>
>>> The latter two are the same thing, but via two loops:
>>>
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
>>> /dev/null; done
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>> /dev/null; done
>>> As you might've noticed from the output, splitting the loop into two
>>> makes the code run 10x faster. This isn't great, because most
>>> monitoring software likes to get all stats for one service before
>>> reading the stats for the next one, which maps to the slow and
>>> expensive way of doing this.
>>>
>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
>>> that touch the cgroup or mm subsystems, so you can assume vanilla
>>> kernel.
>>>
>>> From the flamegraph it just looks like rstat flushing takes longer. I
>>> used the following flags on an AMD EPYC 7642 system (our usual pick
>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
>>>
>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
>>>
>>> Naturally, there are two questions that arise:
>>>
>>> * Is this expected (I guess not, but good to be sure)?
>>> * What can we do to make this better?
>>>
>>> I am happy to try out patches or to do some tracing to help understand
>>> this better.
>> Hi Ivan,
>>
>> Thanks a lot, as always, for reporting this. This is not expected and
>> should be fixed. Is the issue easy to repro or some specific workload or
>> high load/traffic is required? Can you repro this with the latest linus
>> tree? Also do you see any difference of root's cgroup.stat where this
>> issue happens vs good state?
> I'm afraid there's no easy way to reproduce. We see it from time to
> time in different locations. The one that I was looking at for the
> initial email does not reproduce it anymore:

My understanding of mem-stat and cpu-stat is that they are independent
of each other. In theory, reading one shouldn't affect the performance
of reading the others. Since you are doing mem-stat and cpu-stat reading
repetitively in a loop, it is likely that all the data are in the cache
most of the time resulting in very fast processing time. If it happens
that the specific memory location of mem-stat and cpu-stat data are such
that reading one will cause the other data to be flushed out of the
cache and have to be re-read from memory again, you could see
significant performance regression.

It is one of the possible causes, but I may be wrong.

Cheers,
Longman



2023-07-14 00:28:44

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <[email protected]> wrote:
>
> On 7/10/23 19:21, Ivan Babrou wrote:
> > On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <[email protected]> wrote:
> >> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> >>> Hello,
> >>>
> >>> We're seeing CPU load issues with cgroup stats retrieval. I made a
> >>> public gist with all the details, including the repro code (which
> >>> unfortunately requires heavily loaded hardware) and some flamegraphs:
> >>>
> >>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >>>
> >>> I'll repeat the gist of that gist here. Our repro has the following
> >>> output after a warm-up run:
> >>>
> >>> completed: 5.17s [manual / mem-stat + cpu-stat]
> >>> completed: 5.59s [manual / cpu-stat + mem-stat]
> >>> completed: 0.52s [manual / mem-stat]
> >>> completed: 0.04s [manual / cpu-stat]
> >>>
> >>> The first two lines do effectively the following:
> >>>
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >>>
> >>> The latter two are the same thing, but via two loops:
> >>>
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> >>> /dev/null; done
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>> /dev/null; done
> >>> As you might've noticed from the output, splitting the loop into two
> >>> makes the code run 10x faster. This isn't great, because most
> >>> monitoring software likes to get all stats for one service before
> >>> reading the stats for the next one, which maps to the slow and
> >>> expensive way of doing this.
> >>>
> >>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> >>> that touch the cgroup or mm subsystems, so you can assume vanilla
> >>> kernel.
> >>>
> >>> From the flamegraph it just looks like rstat flushing takes longer. I
> >>> used the following flags on an AMD EPYC 7642 system (our usual pick
> >>> cpu-clock was blaming spinlock irqrestore, which was questionable):
> >>>
> >>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >>>
> >>> Naturally, there are two questions that arise:
> >>>
> >>> * Is this expected (I guess not, but good to be sure)?
> >>> * What can we do to make this better?
> >>>
> >>> I am happy to try out patches or to do some tracing to help understand
> >>> this better.
> >> Hi Ivan,
> >>
> >> Thanks a lot, as always, for reporting this. This is not expected and
> >> should be fixed. Is the issue easy to repro or some specific workload or
> >> high load/traffic is required? Can you repro this with the latest linus
> >> tree? Also do you see any difference of root's cgroup.stat where this
> >> issue happens vs good state?
> > I'm afraid there's no easy way to reproduce. We see it from time to
> > time in different locations. The one that I was looking at for the
> > initial email does not reproduce it anymore:
>
> My understanding of mem-stat and cpu-stat is that they are independent
> of each other. In theory, reading one shouldn't affect the performance
> of reading the others. Since you are doing mem-stat and cpu-stat reading
> repetitively in a loop, it is likely that all the data are in the cache
> most of the time resulting in very fast processing time. If it happens
> that the specific memory location of mem-stat and cpu-stat data are such
> that reading one will cause the other data to be flushed out of the
> cache and have to be re-read from memory again, you could see
> significant performance regression.
>
> It is one of the possible causes, but I may be wrong.

Do you think it's somewhat similar to how iterating a matrix in rows
is faster than in columns due to sequential vs random memory reads?

* https://stackoverflow.com/q/9936132
* https://en.wikipedia.org/wiki/Row-_and_column-major_order
* https://en.wikipedia.org/wiki/Loop_interchange

I've had a similar suspicion and it would be good to confirm whether
it's that or something else. I can probably collect perf counters for
different runs, but I'm not sure which ones I'll need.

In a similar vein, if we could come up with a tracepoint that would
tell us the amount of work done (or any other relevant metric that
would help) during rstat flushing, I can certainly collect that
information as well for every reading combination.

2023-07-14 17:48:22

by Waiman Long

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On 7/13/23 19:25, Ivan Babrou wrote:
> On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <[email protected]> wrote:
>> On 7/10/23 19:21, Ivan Babrou wrote:
>>> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <[email protected]> wrote:
>>>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
>>>>> Hello,
>>>>>
>>>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
>>>>> public gist with all the details, including the repro code (which
>>>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
>>>>>
>>>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
>>>>>
>>>>> I'll repeat the gist of that gist here. Our repro has the following
>>>>> output after a warm-up run:
>>>>>
>>>>> completed: 5.17s [manual / mem-stat + cpu-stat]
>>>>> completed: 5.59s [manual / cpu-stat + mem-stat]
>>>>> completed: 0.52s [manual / mem-stat]
>>>>> completed: 0.04s [manual / cpu-stat]
>>>>>
>>>>> The first two lines do effectively the following:
>>>>>
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
>>>>>
>>>>> The latter two are the same thing, but via two loops:
>>>>>
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
>>>>> /dev/null; done
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>>>> /dev/null; done
>>>>> As you might've noticed from the output, splitting the loop into two
>>>>> makes the code run 10x faster. This isn't great, because most
>>>>> monitoring software likes to get all stats for one service before
>>>>> reading the stats for the next one, which maps to the slow and
>>>>> expensive way of doing this.
>>>>>
>>>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
>>>>> that touch the cgroup or mm subsystems, so you can assume vanilla
>>>>> kernel.
>>>>>
>>>>> From the flamegraph it just looks like rstat flushing takes longer. I
>>>>> used the following flags on an AMD EPYC 7642 system (our usual pick
>>>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
>>>>>
>>>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
>>>>>
>>>>> Naturally, there are two questions that arise:
>>>>>
>>>>> * Is this expected (I guess not, but good to be sure)?
>>>>> * What can we do to make this better?
>>>>>
>>>>> I am happy to try out patches or to do some tracing to help understand
>>>>> this better.
>>>> Hi Ivan,
>>>>
>>>> Thanks a lot, as always, for reporting this. This is not expected and
>>>> should be fixed. Is the issue easy to repro or some specific workload or
>>>> high load/traffic is required? Can you repro this with the latest linus
>>>> tree? Also do you see any difference of root's cgroup.stat where this
>>>> issue happens vs good state?
>>> I'm afraid there's no easy way to reproduce. We see it from time to
>>> time in different locations. The one that I was looking at for the
>>> initial email does not reproduce it anymore:
>> My understanding of mem-stat and cpu-stat is that they are independent
>> of each other. In theory, reading one shouldn't affect the performance
>> of reading the others. Since you are doing mem-stat and cpu-stat reading
>> repetitively in a loop, it is likely that all the data are in the cache
>> most of the time resulting in very fast processing time. If it happens
>> that the specific memory location of mem-stat and cpu-stat data are such
>> that reading one will cause the other data to be flushed out of the
>> cache and have to be re-read from memory again, you could see
>> significant performance regression.
>>
>> It is one of the possible causes, but I may be wrong.
> Do you think it's somewhat similar to how iterating a matrix in rows
> is faster than in columns due to sequential vs random memory reads?
>
> * https://stackoverflow.com/q/9936132
> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> * https://en.wikipedia.org/wiki/Loop_interchange

Yes, it is similar to what is being described in those articles.


>
> I've had a similar suspicion and it would be good to confirm whether
> it's that or something else. I can probably collect perf counters for
> different runs, but I'm not sure which ones I'll need.
>
> In a similar vein, if we could come up with a tracepoint that would
> tell us the amount of work done (or any other relevant metric that
> would help) during rstat flushing, I can certainly collect that
> information as well for every reading combination.

The perf-c2c tool may be able to help. The data to look for is how often
the data is from caches vs direct memory load/store.

Cheers,
Longman


2023-07-15 00:54:07

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Mon, Jul 10, 2023 at 4:21 PM Ivan Babrou <[email protected]> wrote:
> The fast one is v6.1.37 and the slow one is v6.1.25. I'm not sure if
> the kernel version makes a difference or if it's a matter of uptime /
> traffic profile. The data is from two different locations. The fast
> location has gone through an expansion, which meant a full reboot with
> a kernel upgrade, so maybe that affected things:
>
> * https://i.imgur.com/x8uyMaF.png
>
> Let me try to reboot the slow location and see if there's any lasting
> improvement.

There is no correlation with the kernel version, v6.1.38 is slow too:

completed: 23.09s [manual / cpu-stat + mem-stat]
completed: 0.30s [manual / mem-stat]
completed: 0.64s [manual / cpu-stat]

2023-07-15 00:54:56

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Jul 14, 2023 at 10:23 AM Waiman Long <[email protected]> wrote:
>
> On 7/13/23 19:25, Ivan Babrou wrote:
> > On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <[email protected]> wrote:
> >> On 7/10/23 19:21, Ivan Babrou wrote:
> >>> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <[email protected]> wrote:
> >>>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
> >>>>> public gist with all the details, including the repro code (which
> >>>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
> >>>>>
> >>>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >>>>>
> >>>>> I'll repeat the gist of that gist here. Our repro has the following
> >>>>> output after a warm-up run:
> >>>>>
> >>>>> completed: 5.17s [manual / mem-stat + cpu-stat]
> >>>>> completed: 5.59s [manual / cpu-stat + mem-stat]
> >>>>> completed: 0.52s [manual / mem-stat]
> >>>>> completed: 0.04s [manual / cpu-stat]
> >>>>>
> >>>>> The first two lines do effectively the following:
> >>>>>
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >>>>>
> >>>>> The latter two are the same thing, but via two loops:
> >>>>>
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> >>>>> /dev/null; done
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>>>> /dev/null; done
> >>>>> As you might've noticed from the output, splitting the loop into two
> >>>>> makes the code run 10x faster. This isn't great, because most
> >>>>> monitoring software likes to get all stats for one service before
> >>>>> reading the stats for the next one, which maps to the slow and
> >>>>> expensive way of doing this.
> >>>>>
> >>>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> >>>>> that touch the cgroup or mm subsystems, so you can assume vanilla
> >>>>> kernel.
> >>>>>
> >>>>> From the flamegraph it just looks like rstat flushing takes longer. I
> >>>>> used the following flags on an AMD EPYC 7642 system (our usual pick
> >>>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
> >>>>>
> >>>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >>>>>
> >>>>> Naturally, there are two questions that arise:
> >>>>>
> >>>>> * Is this expected (I guess not, but good to be sure)?
> >>>>> * What can we do to make this better?
> >>>>>
> >>>>> I am happy to try out patches or to do some tracing to help understand
> >>>>> this better.
> >>>> Hi Ivan,
> >>>>
> >>>> Thanks a lot, as always, for reporting this. This is not expected and
> >>>> should be fixed. Is the issue easy to repro or some specific workload or
> >>>> high load/traffic is required? Can you repro this with the latest linus
> >>>> tree? Also do you see any difference of root's cgroup.stat where this
> >>>> issue happens vs good state?
> >>> I'm afraid there's no easy way to reproduce. We see it from time to
> >>> time in different locations. The one that I was looking at for the
> >>> initial email does not reproduce it anymore:
> >> My understanding of mem-stat and cpu-stat is that they are independent
> >> of each other. In theory, reading one shouldn't affect the performance
> >> of reading the others. Since you are doing mem-stat and cpu-stat reading
> >> repetitively in a loop, it is likely that all the data are in the cache
> >> most of the time resulting in very fast processing time. If it happens
> >> that the specific memory location of mem-stat and cpu-stat data are such
> >> that reading one will cause the other data to be flushed out of the
> >> cache and have to be re-read from memory again, you could see
> >> significant performance regression.
> >>
> >> It is one of the possible causes, but I may be wrong.
> > Do you think it's somewhat similar to how iterating a matrix in rows
> > is faster than in columns due to sequential vs random memory reads?
> >
> > * https://stackoverflow.com/q/9936132
> > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > * https://en.wikipedia.org/wiki/Loop_interchange
>
> Yes, it is similar to what is being described in those articles.
>
>
> >
> > I've had a similar suspicion and it would be good to confirm whether
> > it's that or something else. I can probably collect perf counters for
> > different runs, but I'm not sure which ones I'll need.
> >
> > In a similar vein, if we could come up with a tracepoint that would
> > tell us the amount of work done (or any other relevant metric that
> > would help) during rstat flushing, I can certainly collect that
> > information as well for every reading combination.
>
> The perf-c2c tool may be able to help. The data to look for is how often
> the data is from caches vs direct memory load/store.

It looks like c2c only works for the whole system, not individual
treads. There's a lot of noise from the rest of the system.

2023-07-15 00:55:15

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
> > My understanding of mem-stat and cpu-stat is that they are independent
> > of each other. In theory, reading one shouldn't affect the performance
> > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > repetitively in a loop, it is likely that all the data are in the cache
> > most of the time resulting in very fast processing time. If it happens
> > that the specific memory location of mem-stat and cpu-stat data are such
> > that reading one will cause the other data to be flushed out of the
> > cache and have to be re-read from memory again, you could see
> > significant performance regression.
> >
> > It is one of the possible causes, but I may be wrong.
>
> Do you think it's somewhat similar to how iterating a matrix in rows
> is faster than in columns due to sequential vs random memory reads?
>
> * https://stackoverflow.com/q/9936132
> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> * https://en.wikipedia.org/wiki/Loop_interchange
>
> I've had a similar suspicion and it would be good to confirm whether
> it's that or something else. I can probably collect perf counters for
> different runs, but I'm not sure which ones I'll need.
>
> In a similar vein, if we could come up with a tracepoint that would
> tell us the amount of work done (or any other relevant metric that
> would help) during rstat flushing, I can certainly collect that
> information as well for every reading combination.

Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
(discrete) and slow (combined) cases, I grabbed some stats for it:

* Slow:

completed: 19.43s [manual / mem-stat + cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
^C
00:12:55
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 191 |************ |
256 -> 511 : 590 |****************************************|
512 -> 1023 : 186 |************ |
1024 -> 2047 : 2 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 504 |********************************** |
16384 -> 32767 : 514 |********************************** |
32768 -> 65535 : 3 | |
65536 -> 131071 : 1 | |

avg = 8852 usecs, total: 17633268 usecs, count: 1992

* Fast:

completed: 0.95s [manual / mem-stat]
completed: 0.05s [manual / cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
^C
00:13:27
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 499 |****************************************|
8 -> 15 : 253 |******************** |
16 -> 31 : 191 |*************** |
32 -> 63 : 41 |*** |
64 -> 127 : 12 | |
128 -> 255 : 2 | |
256 -> 511 : 2 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 34 |** |
16384 -> 32767 : 21 |* |

avg = 857 usecs, total: 904762 usecs, count: 1055

There's a different number of calls into cgroup_rstat_flush_locked and
they are much slower in the slow case. There are also two bands in the
slow case, with 8ms..32ms having the half of the calls.

For mem_cgroup_css_rstat_flush:

* Slow:

completed: 32.77s [manual / mem-stat + cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
^C
00:21:25
usecs : count distribution
0 -> 1 : 93078 |* |
2 -> 3 : 3397714 |****************************************|
4 -> 7 : 1009440 |*********** |
8 -> 15 : 168013 |* |
16 -> 31 : 93 | |

avg = 3 usecs, total: 17189289 usecs, count: 4668338

* Fast:

completed: 0.16s [manual / mem-stat]
completed: 0.04s [manual / cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
^C
00:21:57
usecs : count distribution
0 -> 1 : 1441 |*** |
2 -> 3 : 18780 |****************************************|
4 -> 7 : 4826 |********** |
8 -> 15 : 732 |* |
16 -> 31 : 1 | |

avg = 3 usecs, total: 89174 usecs, count: 25780

There's an 181x difference in the number of calls into
mem_cgroup_css_rstat_flush.

Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
yielding a ton more iterations for some reason here?

* https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196

It's inlined, but I can place a probe into the loop:

7 for_each_possible_cpu(cpu) {
8 raw_spinlock_t *cpu_lock =
per_cpu_ptr(&cgroup_rstat_cpu_lock,
cpu);
10 struct cgroup *pos = NULL;
unsigned long flags;

/*
* The _irqsave() is needed because cgroup_rstat_lock is
* spinlock_t which is a sleeping lock on
PREEMPT_RT. Acquiring
* this lock with the _irq() suffix only
disables interrupts on
* a non-PREEMPT_RT kernel. The raw_spinlock_t
below disables
* interrupts on both configurations. The
_irqsave() ensures
* that interrupts are always disabled and
later restored.
*/
raw_spin_lock_irqsave(cpu_lock, flags);
while ((pos =
cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
struct cgroup_subsys_state *css;

cgroup_base_stat_flush(pos, cpu);
26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);

28 rcu_read_lock();
29 list_for_each_entry_rcu(css,
&pos->rstat_css_list,
rstat_css_node)
31 css->ss->css_rstat_flush(css, cpu);
32 rcu_read_unlock();
}
34 raw_spin_unlock_irqrestore(cpu_lock, flags);

I added probes on both line 26 and line 31 to catch the middle and inner loops.

* Slow:

completed: 32.97s [manual / mem-stat + cpu-stat]

Performance counter stats for '/tmp/derp':

4,702,570 probe:cgroup_rstat_flush_locked_L26
9,301,436 probe:cgroup_rstat_flush_locked_L31

* Fast:

completed: 0.17s [manual / mem-stat]
completed: 0.34s [manual / cpu-stat]

Performance counter stats for '/tmp/derp':

31,769 probe:cgroup_rstat_flush_locked_L26
62,849 probe:cgroup_rstat_flush_locked_L31

It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
lot more positions.

I'm going to sign off for the week, but let me know if I should place
any more probes to nail this down.

2023-08-11 22:33:51

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <[email protected]> wrote:
>
> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
> > > My understanding of mem-stat and cpu-stat is that they are independent
> > > of each other. In theory, reading one shouldn't affect the performance
> > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > repetitively in a loop, it is likely that all the data are in the cache
> > > most of the time resulting in very fast processing time. If it happens
> > > that the specific memory location of mem-stat and cpu-stat data are such
> > > that reading one will cause the other data to be flushed out of the
> > > cache and have to be re-read from memory again, you could see
> > > significant performance regression.
> > >
> > > It is one of the possible causes, but I may be wrong.
> >
> > Do you think it's somewhat similar to how iterating a matrix in rows
> > is faster than in columns due to sequential vs random memory reads?
> >
> > * https://stackoverflow.com/q/9936132
> > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > * https://en.wikipedia.org/wiki/Loop_interchange
> >
> > I've had a similar suspicion and it would be good to confirm whether
> > it's that or something else. I can probably collect perf counters for
> > different runs, but I'm not sure which ones I'll need.
> >
> > In a similar vein, if we could come up with a tracepoint that would
> > tell us the amount of work done (or any other relevant metric that
> > would help) during rstat flushing, I can certainly collect that
> > information as well for every reading combination.
>
> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> (discrete) and slow (combined) cases, I grabbed some stats for it:
>
> * Slow:
>
> completed: 19.43s [manual / mem-stat + cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> ^C
> 00:12:55
> usecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 1 | |
> 128 -> 255 : 191 |************ |
> 256 -> 511 : 590 |****************************************|
> 512 -> 1023 : 186 |************ |
> 1024 -> 2047 : 2 | |
> 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 504 |********************************** |
> 16384 -> 32767 : 514 |********************************** |
> 32768 -> 65535 : 3 | |
> 65536 -> 131071 : 1 | |
>
> avg = 8852 usecs, total: 17633268 usecs, count: 1992
>
> * Fast:
>
> completed: 0.95s [manual / mem-stat]
> completed: 0.05s [manual / cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> ^C
> 00:13:27
> usecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 499 |****************************************|
> 8 -> 15 : 253 |******************** |
> 16 -> 31 : 191 |*************** |
> 32 -> 63 : 41 |*** |
> 64 -> 127 : 12 | |
> 128 -> 255 : 2 | |
> 256 -> 511 : 2 | |
> 512 -> 1023 : 0 | |
> 1024 -> 2047 : 0 | |
> 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 34 |** |
> 16384 -> 32767 : 21 |* |
>
> avg = 857 usecs, total: 904762 usecs, count: 1055
>
> There's a different number of calls into cgroup_rstat_flush_locked and
> they are much slower in the slow case. There are also two bands in the
> slow case, with 8ms..32ms having the half of the calls.
>
> For mem_cgroup_css_rstat_flush:
>
> * Slow:
>
> completed: 32.77s [manual / mem-stat + cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> ^C
> 00:21:25
> usecs : count distribution
> 0 -> 1 : 93078 |* |
> 2 -> 3 : 3397714 |****************************************|
> 4 -> 7 : 1009440 |*********** |
> 8 -> 15 : 168013 |* |
> 16 -> 31 : 93 | |
>
> avg = 3 usecs, total: 17189289 usecs, count: 4668338
>
> * Fast:
>
> completed: 0.16s [manual / mem-stat]
> completed: 0.04s [manual / cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> ^C
> 00:21:57
> usecs : count distribution
> 0 -> 1 : 1441 |*** |
> 2 -> 3 : 18780 |****************************************|
> 4 -> 7 : 4826 |********** |
> 8 -> 15 : 732 |* |
> 16 -> 31 : 1 | |
>
> avg = 3 usecs, total: 89174 usecs, count: 25780
>
> There's an 181x difference in the number of calls into
> mem_cgroup_css_rstat_flush.
>
> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> yielding a ton more iterations for some reason here?
>
> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
>
> It's inlined, but I can place a probe into the loop:
>
> 7 for_each_possible_cpu(cpu) {
> 8 raw_spinlock_t *cpu_lock =
> per_cpu_ptr(&cgroup_rstat_cpu_lock,
> cpu);
> 10 struct cgroup *pos = NULL;
> unsigned long flags;
>
> /*
> * The _irqsave() is needed because cgroup_rstat_lock is
> * spinlock_t which is a sleeping lock on
> PREEMPT_RT. Acquiring
> * this lock with the _irq() suffix only
> disables interrupts on
> * a non-PREEMPT_RT kernel. The raw_spinlock_t
> below disables
> * interrupts on both configurations. The
> _irqsave() ensures
> * that interrupts are always disabled and
> later restored.
> */
> raw_spin_lock_irqsave(cpu_lock, flags);
> while ((pos =
> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> struct cgroup_subsys_state *css;
>
> cgroup_base_stat_flush(pos, cpu);
> 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>
> 28 rcu_read_lock();
> 29 list_for_each_entry_rcu(css,
> &pos->rstat_css_list,
> rstat_css_node)
> 31 css->ss->css_rstat_flush(css, cpu);
> 32 rcu_read_unlock();
> }
> 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
>
> I added probes on both line 26 and line 31 to catch the middle and inner loops.
>
> * Slow:
>
> completed: 32.97s [manual / mem-stat + cpu-stat]
>
> Performance counter stats for '/tmp/derp':
>
> 4,702,570 probe:cgroup_rstat_flush_locked_L26
> 9,301,436 probe:cgroup_rstat_flush_locked_L31
>
> * Fast:
>
> completed: 0.17s [manual / mem-stat]
> completed: 0.34s [manual / cpu-stat]
>
> Performance counter stats for '/tmp/derp':
>
> 31,769 probe:cgroup_rstat_flush_locked_L26
> 62,849 probe:cgroup_rstat_flush_locked_L31
>
> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> lot more positions.
>
> I'm going to sign off for the week, but let me know if I should place
> any more probes to nail this down.

I spent some time looking into this and I think I landed on a fix:

* https://github.com/bobrik/linux/commit/50b627811d54

I'm not 100% sure if it's the right fix for the issue, but it reduces
the runtime significantly.

We see a 50x decrease for memory.stat + cpu.stat loop duration with
the patch applied. TL;DR is that memory.stat flushes all cgroups,
while cpu.stat flushes just the subtree you're asking for. Both do it
for cpu and memory at the same time, since both are rstat based. See
the description for the commit linked above for more details.

There are two more graphs to add. I rebooted 4 servers, 2 of which
received my patch and 2 were the control. The reboot happened at
around 05:00Z

First is the latency of scraping cadvisor, where you can clearly see
that the control group is rising with the daily load, while the test
group is staying mostly flat:

* https://i.imgur.com/GMtzHIu.png

Second is the CPU time spent by cadvisor, where you can see a similar picture:

* https://i.imgur.com/LWHt14P.png

Let me know what you think.

2023-08-11 23:21:21

by Waiman Long

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On 8/11/23 18:03, Ivan Babrou wrote:
> On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <[email protected]> wrote:
>> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
>>>> My understanding of mem-stat and cpu-stat is that they are independent
>>>> of each other. In theory, reading one shouldn't affect the performance
>>>> of reading the others. Since you are doing mem-stat and cpu-stat reading
>>>> repetitively in a loop, it is likely that all the data are in the cache
>>>> most of the time resulting in very fast processing time. If it happens
>>>> that the specific memory location of mem-stat and cpu-stat data are such
>>>> that reading one will cause the other data to be flushed out of the
>>>> cache and have to be re-read from memory again, you could see
>>>> significant performance regression.
>>>>
>>>> It is one of the possible causes, but I may be wrong.
>>> Do you think it's somewhat similar to how iterating a matrix in rows
>>> is faster than in columns due to sequential vs random memory reads?
>>>
>>> * https://stackoverflow.com/q/9936132
>>> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
>>> * https://en.wikipedia.org/wiki/Loop_interchange
>>>
>>> I've had a similar suspicion and it would be good to confirm whether
>>> it's that or something else. I can probably collect perf counters for
>>> different runs, but I'm not sure which ones I'll need.
>>>
>>> In a similar vein, if we could come up with a tracepoint that would
>>> tell us the amount of work done (or any other relevant metric that
>>> would help) during rstat flushing, I can certainly collect that
>>> information as well for every reading combination.
>> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
>> (discrete) and slow (combined) cases, I grabbed some stats for it:
>>
>> * Slow:
>>
>> completed: 19.43s [manual / mem-stat + cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
>> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
>> ^C
>> 00:12:55
>> usecs : count distribution
>> 0 -> 1 : 0 | |
>> 2 -> 3 : 0 | |
>> 4 -> 7 : 0 | |
>> 8 -> 15 : 0 | |
>> 16 -> 31 : 0 | |
>> 32 -> 63 : 0 | |
>> 64 -> 127 : 1 | |
>> 128 -> 255 : 191 |************ |
>> 256 -> 511 : 590 |****************************************|
>> 512 -> 1023 : 186 |************ |
>> 1024 -> 2047 : 2 | |
>> 2048 -> 4095 : 0 | |
>> 4096 -> 8191 : 0 | |
>> 8192 -> 16383 : 504 |********************************** |
>> 16384 -> 32767 : 514 |********************************** |
>> 32768 -> 65535 : 3 | |
>> 65536 -> 131071 : 1 | |
>>
>> avg = 8852 usecs, total: 17633268 usecs, count: 1992
>>
>> * Fast:
>>
>> completed: 0.95s [manual / mem-stat]
>> completed: 0.05s [manual / cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
>> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
>> ^C
>> 00:13:27
>> usecs : count distribution
>> 0 -> 1 : 0 | |
>> 2 -> 3 : 0 | |
>> 4 -> 7 : 499 |****************************************|
>> 8 -> 15 : 253 |******************** |
>> 16 -> 31 : 191 |*************** |
>> 32 -> 63 : 41 |*** |
>> 64 -> 127 : 12 | |
>> 128 -> 255 : 2 | |
>> 256 -> 511 : 2 | |
>> 512 -> 1023 : 0 | |
>> 1024 -> 2047 : 0 | |
>> 2048 -> 4095 : 0 | |
>> 4096 -> 8191 : 0 | |
>> 8192 -> 16383 : 34 |** |
>> 16384 -> 32767 : 21 |* |
>>
>> avg = 857 usecs, total: 904762 usecs, count: 1055
>>
>> There's a different number of calls into cgroup_rstat_flush_locked and
>> they are much slower in the slow case. There are also two bands in the
>> slow case, with 8ms..32ms having the half of the calls.
>>
>> For mem_cgroup_css_rstat_flush:
>>
>> * Slow:
>>
>> completed: 32.77s [manual / mem-stat + cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
>> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
>> ^C
>> 00:21:25
>> usecs : count distribution
>> 0 -> 1 : 93078 |* |
>> 2 -> 3 : 3397714 |****************************************|
>> 4 -> 7 : 1009440 |*********** |
>> 8 -> 15 : 168013 |* |
>> 16 -> 31 : 93 | |
>>
>> avg = 3 usecs, total: 17189289 usecs, count: 4668338
>>
>> * Fast:
>>
>> completed: 0.16s [manual / mem-stat]
>> completed: 0.04s [manual / cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
>> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
>> ^C
>> 00:21:57
>> usecs : count distribution
>> 0 -> 1 : 1441 |*** |
>> 2 -> 3 : 18780 |****************************************|
>> 4 -> 7 : 4826 |********** |
>> 8 -> 15 : 732 |* |
>> 16 -> 31 : 1 | |
>>
>> avg = 3 usecs, total: 89174 usecs, count: 25780
>>
>> There's an 181x difference in the number of calls into
>> mem_cgroup_css_rstat_flush.
>>
>> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
>> yielding a ton more iterations for some reason here?
>>
>> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
>>
>> It's inlined, but I can place a probe into the loop:
>>
>> 7 for_each_possible_cpu(cpu) {
>> 8 raw_spinlock_t *cpu_lock =
>> per_cpu_ptr(&cgroup_rstat_cpu_lock,
>> cpu);
>> 10 struct cgroup *pos = NULL;
>> unsigned long flags;
>>
>> /*
>> * The _irqsave() is needed because cgroup_rstat_lock is
>> * spinlock_t which is a sleeping lock on
>> PREEMPT_RT. Acquiring
>> * this lock with the _irq() suffix only
>> disables interrupts on
>> * a non-PREEMPT_RT kernel. The raw_spinlock_t
>> below disables
>> * interrupts on both configurations. The
>> _irqsave() ensures
>> * that interrupts are always disabled and
>> later restored.
>> */
>> raw_spin_lock_irqsave(cpu_lock, flags);
>> while ((pos =
>> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
>> struct cgroup_subsys_state *css;
>>
>> cgroup_base_stat_flush(pos, cpu);
>> 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>>
>> 28 rcu_read_lock();
>> 29 list_for_each_entry_rcu(css,
>> &pos->rstat_css_list,
>> rstat_css_node)
>> 31 css->ss->css_rstat_flush(css, cpu);
>> 32 rcu_read_unlock();
>> }
>> 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
>>
>> I added probes on both line 26 and line 31 to catch the middle and inner loops.
>>
>> * Slow:
>>
>> completed: 32.97s [manual / mem-stat + cpu-stat]
>>
>> Performance counter stats for '/tmp/derp':
>>
>> 4,702,570 probe:cgroup_rstat_flush_locked_L26
>> 9,301,436 probe:cgroup_rstat_flush_locked_L31
>>
>> * Fast:
>>
>> completed: 0.17s [manual / mem-stat]
>> completed: 0.34s [manual / cpu-stat]
>>
>> Performance counter stats for '/tmp/derp':
>>
>> 31,769 probe:cgroup_rstat_flush_locked_L26
>> 62,849 probe:cgroup_rstat_flush_locked_L31
>>
>> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
>> lot more positions.
>>
>> I'm going to sign off for the week, but let me know if I should place
>> any more probes to nail this down.
> I spent some time looking into this and I think I landed on a fix:
>
> * https://github.com/bobrik/linux/commit/50b627811d54
>
> I'm not 100% sure if it's the right fix for the issue, but it reduces
> the runtime significantly.

It looks like the overhead of mem_cgroup_flush_stats() may be the cause
of the performance regression. So what version of the Linux kernel are
you using? From the patch listed in the URL above, it doesn't seem like
you are using the latest kernel.

The latest upstream kernel already have patches that reduce rstat
flushing overhead like commit 11192d9c124 ("memcg: flush stats only if
updated"). Have you try to reproduce it with the latest kernel?

Cheers,
Longman


2023-08-11 23:22:00

by Ivan Babrou

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Aug 11, 2023 at 3:27 PM Waiman Long <[email protected]> wrote:
>
> On 8/11/23 18:03, Ivan Babrou wrote:
> > On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <[email protected]> wrote:
> >> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
> >>>> My understanding of mem-stat and cpu-stat is that they are independent
> >>>> of each other. In theory, reading one shouldn't affect the performance
> >>>> of reading the others. Since you are doing mem-stat and cpu-stat reading
> >>>> repetitively in a loop, it is likely that all the data are in the cache
> >>>> most of the time resulting in very fast processing time. If it happens
> >>>> that the specific memory location of mem-stat and cpu-stat data are such
> >>>> that reading one will cause the other data to be flushed out of the
> >>>> cache and have to be re-read from memory again, you could see
> >>>> significant performance regression.
> >>>>
> >>>> It is one of the possible causes, but I may be wrong.
> >>> Do you think it's somewhat similar to how iterating a matrix in rows
> >>> is faster than in columns due to sequential vs random memory reads?
> >>>
> >>> * https://stackoverflow.com/q/9936132
> >>> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> >>> * https://en.wikipedia.org/wiki/Loop_interchange
> >>>
> >>> I've had a similar suspicion and it would be good to confirm whether
> >>> it's that or something else. I can probably collect perf counters for
> >>> different runs, but I'm not sure which ones I'll need.
> >>>
> >>> In a similar vein, if we could come up with a tracepoint that would
> >>> tell us the amount of work done (or any other relevant metric that
> >>> would help) during rstat flushing, I can certainly collect that
> >>> information as well for every reading combination.
> >> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> >> (discrete) and slow (combined) cases, I grabbed some stats for it:
> >>
> >> * Slow:
> >>
> >> completed: 19.43s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> >> ^C
> >> 00:12:55
> >> usecs : count distribution
> >> 0 -> 1 : 0 | |
> >> 2 -> 3 : 0 | |
> >> 4 -> 7 : 0 | |
> >> 8 -> 15 : 0 | |
> >> 16 -> 31 : 0 | |
> >> 32 -> 63 : 0 | |
> >> 64 -> 127 : 1 | |
> >> 128 -> 255 : 191 |************ |
> >> 256 -> 511 : 590 |****************************************|
> >> 512 -> 1023 : 186 |************ |
> >> 1024 -> 2047 : 2 | |
> >> 2048 -> 4095 : 0 | |
> >> 4096 -> 8191 : 0 | |
> >> 8192 -> 16383 : 504 |********************************** |
> >> 16384 -> 32767 : 514 |********************************** |
> >> 32768 -> 65535 : 3 | |
> >> 65536 -> 131071 : 1 | |
> >>
> >> avg = 8852 usecs, total: 17633268 usecs, count: 1992
> >>
> >> * Fast:
> >>
> >> completed: 0.95s [manual / mem-stat]
> >> completed: 0.05s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> >> ^C
> >> 00:13:27
> >> usecs : count distribution
> >> 0 -> 1 : 0 | |
> >> 2 -> 3 : 0 | |
> >> 4 -> 7 : 499 |****************************************|
> >> 8 -> 15 : 253 |******************** |
> >> 16 -> 31 : 191 |*************** |
> >> 32 -> 63 : 41 |*** |
> >> 64 -> 127 : 12 | |
> >> 128 -> 255 : 2 | |
> >> 256 -> 511 : 2 | |
> >> 512 -> 1023 : 0 | |
> >> 1024 -> 2047 : 0 | |
> >> 2048 -> 4095 : 0 | |
> >> 4096 -> 8191 : 0 | |
> >> 8192 -> 16383 : 34 |** |
> >> 16384 -> 32767 : 21 |* |
> >>
> >> avg = 857 usecs, total: 904762 usecs, count: 1055
> >>
> >> There's a different number of calls into cgroup_rstat_flush_locked and
> >> they are much slower in the slow case. There are also two bands in the
> >> slow case, with 8ms..32ms having the half of the calls.
> >>
> >> For mem_cgroup_css_rstat_flush:
> >>
> >> * Slow:
> >>
> >> completed: 32.77s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> >> ^C
> >> 00:21:25
> >> usecs : count distribution
> >> 0 -> 1 : 93078 |* |
> >> 2 -> 3 : 3397714 |****************************************|
> >> 4 -> 7 : 1009440 |*********** |
> >> 8 -> 15 : 168013 |* |
> >> 16 -> 31 : 93 | |
> >>
> >> avg = 3 usecs, total: 17189289 usecs, count: 4668338
> >>
> >> * Fast:
> >>
> >> completed: 0.16s [manual / mem-stat]
> >> completed: 0.04s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> >> ^C
> >> 00:21:57
> >> usecs : count distribution
> >> 0 -> 1 : 1441 |*** |
> >> 2 -> 3 : 18780 |****************************************|
> >> 4 -> 7 : 4826 |********** |
> >> 8 -> 15 : 732 |* |
> >> 16 -> 31 : 1 | |
> >>
> >> avg = 3 usecs, total: 89174 usecs, count: 25780
> >>
> >> There's an 181x difference in the number of calls into
> >> mem_cgroup_css_rstat_flush.
> >>
> >> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> >> yielding a ton more iterations for some reason here?
> >>
> >> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> >>
> >> It's inlined, but I can place a probe into the loop:
> >>
> >> 7 for_each_possible_cpu(cpu) {
> >> 8 raw_spinlock_t *cpu_lock =
> >> per_cpu_ptr(&cgroup_rstat_cpu_lock,
> >> cpu);
> >> 10 struct cgroup *pos = NULL;
> >> unsigned long flags;
> >>
> >> /*
> >> * The _irqsave() is needed because cgroup_rstat_lock is
> >> * spinlock_t which is a sleeping lock on
> >> PREEMPT_RT. Acquiring
> >> * this lock with the _irq() suffix only
> >> disables interrupts on
> >> * a non-PREEMPT_RT kernel. The raw_spinlock_t
> >> below disables
> >> * interrupts on both configurations. The
> >> _irqsave() ensures
> >> * that interrupts are always disabled and
> >> later restored.
> >> */
> >> raw_spin_lock_irqsave(cpu_lock, flags);
> >> while ((pos =
> >> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> >> struct cgroup_subsys_state *css;
> >>
> >> cgroup_base_stat_flush(pos, cpu);
> >> 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> >>
> >> 28 rcu_read_lock();
> >> 29 list_for_each_entry_rcu(css,
> >> &pos->rstat_css_list,
> >> rstat_css_node)
> >> 31 css->ss->css_rstat_flush(css, cpu);
> >> 32 rcu_read_unlock();
> >> }
> >> 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
> >>
> >> I added probes on both line 26 and line 31 to catch the middle and inner loops.
> >>
> >> * Slow:
> >>
> >> completed: 32.97s [manual / mem-stat + cpu-stat]
> >>
> >> Performance counter stats for '/tmp/derp':
> >>
> >> 4,702,570 probe:cgroup_rstat_flush_locked_L26
> >> 9,301,436 probe:cgroup_rstat_flush_locked_L31
> >>
> >> * Fast:
> >>
> >> completed: 0.17s [manual / mem-stat]
> >> completed: 0.34s [manual / cpu-stat]
> >>
> >> Performance counter stats for '/tmp/derp':
> >>
> >> 31,769 probe:cgroup_rstat_flush_locked_L26
> >> 62,849 probe:cgroup_rstat_flush_locked_L31
> >>
> >> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> >> lot more positions.
> >>
> >> I'm going to sign off for the week, but let me know if I should place
> >> any more probes to nail this down.
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> It looks like the overhead of mem_cgroup_flush_stats() may be the cause
> of the performance regression. So what version of the Linux kernel are
> you using? From the patch listed in the URL above, it doesn't seem like
> you are using the latest kernel.
>
> The latest upstream kernel already have patches that reduce rstat
> flushing overhead like commit 11192d9c124 ("memcg: flush stats only if
> updated"). Have you try to reproduce it with the latest kernel?

We're on v6.1, which is the latest LTS (I mentioned it in the first
message). It has this patch included and my commit mentions it in the
very first line of the description.

I should've also mentioned commit fd25a9e0e23b ("memcg: unify memcg
stat flushing") from the same series, which my patch is partially
reverting.

2023-08-11 23:58:03

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Aug 11, 2023 at 3:03 PM Ivan Babrou <[email protected]> wrote:
>
> On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <[email protected]> wrote:
> >
> > On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
> > > > My understanding of mem-stat and cpu-stat is that they are independent
> > > > of each other. In theory, reading one shouldn't affect the performance
> > > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > > repetitively in a loop, it is likely that all the data are in the cache
> > > > most of the time resulting in very fast processing time. If it happens
> > > > that the specific memory location of mem-stat and cpu-stat data are such
> > > > that reading one will cause the other data to be flushed out of the
> > > > cache and have to be re-read from memory again, you could see
> > > > significant performance regression.
> > > >
> > > > It is one of the possible causes, but I may be wrong.
> > >
> > > Do you think it's somewhat similar to how iterating a matrix in rows
> > > is faster than in columns due to sequential vs random memory reads?
> > >
> > > * https://stackoverflow.com/q/9936132
> > > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > > * https://en.wikipedia.org/wiki/Loop_interchange
> > >
> > > I've had a similar suspicion and it would be good to confirm whether
> > > it's that or something else. I can probably collect perf counters for
> > > different runs, but I'm not sure which ones I'll need.
> > >
> > > In a similar vein, if we could come up with a tracepoint that would
> > > tell us the amount of work done (or any other relevant metric that
> > > would help) during rstat flushing, I can certainly collect that
> > > information as well for every reading combination.
> >
> > Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> > (discrete) and slow (combined) cases, I grabbed some stats for it:
> >
> > * Slow:
> >
> > completed: 19.43s [manual / mem-stat + cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > ^C
> > 00:12:55
> > usecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 1 | |
> > 128 -> 255 : 191 |************ |
> > 256 -> 511 : 590 |****************************************|
> > 512 -> 1023 : 186 |************ |
> > 1024 -> 2047 : 2 | |
> > 2048 -> 4095 : 0 | |
> > 4096 -> 8191 : 0 | |
> > 8192 -> 16383 : 504 |********************************** |
> > 16384 -> 32767 : 514 |********************************** |
> > 32768 -> 65535 : 3 | |
> > 65536 -> 131071 : 1 | |
> >
> > avg = 8852 usecs, total: 17633268 usecs, count: 1992
> >
> > * Fast:
> >
> > completed: 0.95s [manual / mem-stat]
> > completed: 0.05s [manual / cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > ^C
> > 00:13:27
> > usecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 499 |****************************************|
> > 8 -> 15 : 253 |******************** |
> > 16 -> 31 : 191 |*************** |
> > 32 -> 63 : 41 |*** |
> > 64 -> 127 : 12 | |
> > 128 -> 255 : 2 | |
> > 256 -> 511 : 2 | |
> > 512 -> 1023 : 0 | |
> > 1024 -> 2047 : 0 | |
> > 2048 -> 4095 : 0 | |
> > 4096 -> 8191 : 0 | |
> > 8192 -> 16383 : 34 |** |
> > 16384 -> 32767 : 21 |* |
> >
> > avg = 857 usecs, total: 904762 usecs, count: 1055
> >
> > There's a different number of calls into cgroup_rstat_flush_locked and
> > they are much slower in the slow case. There are also two bands in the
> > slow case, with 8ms..32ms having the half of the calls.
> >
> > For mem_cgroup_css_rstat_flush:
> >
> > * Slow:
> >
> > completed: 32.77s [manual / mem-stat + cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > ^C
> > 00:21:25
> > usecs : count distribution
> > 0 -> 1 : 93078 |* |
> > 2 -> 3 : 3397714 |****************************************|
> > 4 -> 7 : 1009440 |*********** |
> > 8 -> 15 : 168013 |* |
> > 16 -> 31 : 93 | |
> >
> > avg = 3 usecs, total: 17189289 usecs, count: 4668338
> >
> > * Fast:
> >
> > completed: 0.16s [manual / mem-stat]
> > completed: 0.04s [manual / cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > ^C
> > 00:21:57
> > usecs : count distribution
> > 0 -> 1 : 1441 |*** |
> > 2 -> 3 : 18780 |****************************************|
> > 4 -> 7 : 4826 |********** |
> > 8 -> 15 : 732 |* |
> > 16 -> 31 : 1 | |
> >
> > avg = 3 usecs, total: 89174 usecs, count: 25780
> >
> > There's an 181x difference in the number of calls into
> > mem_cgroup_css_rstat_flush.
> >
> > Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> > yielding a ton more iterations for some reason here?
> >
> > * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> >
> > It's inlined, but I can place a probe into the loop:
> >
> > 7 for_each_possible_cpu(cpu) {
> > 8 raw_spinlock_t *cpu_lock =
> > per_cpu_ptr(&cgroup_rstat_cpu_lock,
> > cpu);
> > 10 struct cgroup *pos = NULL;
> > unsigned long flags;
> >
> > /*
> > * The _irqsave() is needed because cgroup_rstat_lock is
> > * spinlock_t which is a sleeping lock on
> > PREEMPT_RT. Acquiring
> > * this lock with the _irq() suffix only
> > disables interrupts on
> > * a non-PREEMPT_RT kernel. The raw_spinlock_t
> > below disables
> > * interrupts on both configurations. The
> > _irqsave() ensures
> > * that interrupts are always disabled and
> > later restored.
> > */
> > raw_spin_lock_irqsave(cpu_lock, flags);
> > while ((pos =
> > cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> > struct cgroup_subsys_state *css;
> >
> > cgroup_base_stat_flush(pos, cpu);
> > 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> >
> > 28 rcu_read_lock();
> > 29 list_for_each_entry_rcu(css,
> > &pos->rstat_css_list,
> > rstat_css_node)
> > 31 css->ss->css_rstat_flush(css, cpu);
> > 32 rcu_read_unlock();
> > }
> > 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
> >
> > I added probes on both line 26 and line 31 to catch the middle and inner loops.
> >
> > * Slow:
> >
> > completed: 32.97s [manual / mem-stat + cpu-stat]
> >
> > Performance counter stats for '/tmp/derp':
> >
> > 4,702,570 probe:cgroup_rstat_flush_locked_L26
> > 9,301,436 probe:cgroup_rstat_flush_locked_L31
> >
> > * Fast:
> >
> > completed: 0.17s [manual / mem-stat]
> > completed: 0.34s [manual / cpu-stat]
> >
> > Performance counter stats for '/tmp/derp':
> >
> > 31,769 probe:cgroup_rstat_flush_locked_L26
> > 62,849 probe:cgroup_rstat_flush_locked_L31
> >
> > It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> > lot more positions.
> >
> > I'm going to sign off for the week, but let me know if I should place
> > any more probes to nail this down.
>
> I spent some time looking into this and I think I landed on a fix:
>
> * https://github.com/bobrik/linux/commit/50b627811d54
>
> I'm not 100% sure if it's the right fix for the issue, but it reduces
> the runtime significantly.

Flushing the entire hierarchy in mem_cgroup_flush_stats() was added
such that concurrent flushers can just skip and let one flusher do the
work for everyone. This was added because we flush the stats in some
paths (like reclaim, refault, dirty throttling) where sometimes there
is a lot of concurrency and we have a thundering herd problem on the
cgroup rstat global lock.

Maybe we can separate userspace reads from other flushers, such that
userspace reads flush the cgroup in question only, while in-kernel
flushers skip if someone else is flushing.

There is also some inconsistency today as not all paths use
mem_cgroup_flush_stats() (see zswap charging function in
mm/memcontrol.c).

Separating userspace reads from in-kernel flushers would also help
because skipping a flush if someone else is flushing for userspace
reads can lead to inaccuracy (see [1]).

I would wait for Shakeel to weigh in here, since he introduced the
unified flushing.

[1]https://lore.kernel.org/lkml/[email protected]/

>
> We see a 50x decrease for memory.stat + cpu.stat loop duration with
> the patch applied. TL;DR is that memory.stat flushes all cgroups,
> while cpu.stat flushes just the subtree you're asking for. Both do it
> for cpu and memory at the same time, since both are rstat based. See
> the description for the commit linked above for more details.
>
> There are two more graphs to add. I rebooted 4 servers, 2 of which
> received my patch and 2 were the control. The reboot happened at
> around 05:00Z
>
> First is the latency of scraping cadvisor, where you can clearly see
> that the control group is rising with the daily load, while the test
> group is staying mostly flat:
>
> * https://i.imgur.com/GMtzHIu.png
>
> Second is the CPU time spent by cadvisor, where you can see a similar picture:
>
> * https://i.imgur.com/LWHt14P.png
>
> Let me know what you think.
>

2023-08-12 00:42:58

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

On Fri, Aug 11, 2023 at 4:43 PM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Aug 11, 2023 at 3:03 PM Ivan Babrou <[email protected]> wrote:
> >
> > On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <[email protected]> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <[email protected]> wrote:
> > > > > My understanding of mem-stat and cpu-stat is that they are independent
> > > > > of each other. In theory, reading one shouldn't affect the performance
> > > > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > > > repetitively in a loop, it is likely that all the data are in the cache
> > > > > most of the time resulting in very fast processing time. If it happens
> > > > > that the specific memory location of mem-stat and cpu-stat data are such
> > > > > that reading one will cause the other data to be flushed out of the
> > > > > cache and have to be re-read from memory again, you could see
> > > > > significant performance regression.
> > > > >
> > > > > It is one of the possible causes, but I may be wrong.
> > > >
> > > > Do you think it's somewhat similar to how iterating a matrix in rows
> > > > is faster than in columns due to sequential vs random memory reads?
> > > >
> > > > * https://stackoverflow.com/q/9936132
> > > > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > > > * https://en.wikipedia.org/wiki/Loop_interchange
> > > >
> > > > I've had a similar suspicion and it would be good to confirm whether
> > > > it's that or something else. I can probably collect perf counters for
> > > > different runs, but I'm not sure which ones I'll need.
> > > >
> > > > In a similar vein, if we could come up with a tracepoint that would
> > > > tell us the amount of work done (or any other relevant metric that
> > > > would help) during rstat flushing, I can certainly collect that
> > > > information as well for every reading combination.
> > >
> > > Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> > > (discrete) and slow (combined) cases, I grabbed some stats for it:
> > >
> > > * Slow:
> > >
> > > completed: 19.43s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:12:55
> > > usecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 0 | |
> > > 8 -> 15 : 0 | |
> > > 16 -> 31 : 0 | |
> > > 32 -> 63 : 0 | |
> > > 64 -> 127 : 1 | |
> > > 128 -> 255 : 191 |************ |
> > > 256 -> 511 : 590 |****************************************|
> > > 512 -> 1023 : 186 |************ |
> > > 1024 -> 2047 : 2 | |
> > > 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 504 |********************************** |
> > > 16384 -> 32767 : 514 |********************************** |
> > > 32768 -> 65535 : 3 | |
> > > 65536 -> 131071 : 1 | |
> > >
> > > avg = 8852 usecs, total: 17633268 usecs, count: 1992
> > >
> > > * Fast:
> > >
> > > completed: 0.95s [manual / mem-stat]
> > > completed: 0.05s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:13:27
> > > usecs : count distribution
> > > 0 -> 1 : 0 | |
> > > 2 -> 3 : 0 | |
> > > 4 -> 7 : 499 |****************************************|
> > > 8 -> 15 : 253 |******************** |
> > > 16 -> 31 : 191 |*************** |
> > > 32 -> 63 : 41 |*** |
> > > 64 -> 127 : 12 | |
> > > 128 -> 255 : 2 | |
> > > 256 -> 511 : 2 | |
> > > 512 -> 1023 : 0 | |
> > > 1024 -> 2047 : 0 | |
> > > 2048 -> 4095 : 0 | |
> > > 4096 -> 8191 : 0 | |
> > > 8192 -> 16383 : 34 |** |
> > > 16384 -> 32767 : 21 |* |
> > >
> > > avg = 857 usecs, total: 904762 usecs, count: 1055
> > >
> > > There's a different number of calls into cgroup_rstat_flush_locked and
> > > they are much slower in the slow case. There are also two bands in the
> > > slow case, with 8ms..32ms having the half of the calls.
> > >
> > > For mem_cgroup_css_rstat_flush:
> > >
> > > * Slow:
> > >
> > > completed: 32.77s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:25
> > > usecs : count distribution
> > > 0 -> 1 : 93078 |* |
> > > 2 -> 3 : 3397714 |****************************************|
> > > 4 -> 7 : 1009440 |*********** |
> > > 8 -> 15 : 168013 |* |
> > > 16 -> 31 : 93 | |
> > >
> > > avg = 3 usecs, total: 17189289 usecs, count: 4668338
> > >
> > > * Fast:
> > >
> > > completed: 0.16s [manual / mem-stat]
> > > completed: 0.04s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:57
> > > usecs : count distribution
> > > 0 -> 1 : 1441 |*** |
> > > 2 -> 3 : 18780 |****************************************|
> > > 4 -> 7 : 4826 |********** |
> > > 8 -> 15 : 732 |* |
> > > 16 -> 31 : 1 | |
> > >
> > > avg = 3 usecs, total: 89174 usecs, count: 25780
> > >
> > > There's an 181x difference in the number of calls into
> > > mem_cgroup_css_rstat_flush.
> > >
> > > Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> > > yielding a ton more iterations for some reason here?
> > >
> > > * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> > >
> > > It's inlined, but I can place a probe into the loop:
> > >
> > > 7 for_each_possible_cpu(cpu) {
> > > 8 raw_spinlock_t *cpu_lock =
> > > per_cpu_ptr(&cgroup_rstat_cpu_lock,
> > > cpu);
> > > 10 struct cgroup *pos = NULL;
> > > unsigned long flags;
> > >
> > > /*
> > > * The _irqsave() is needed because cgroup_rstat_lock is
> > > * spinlock_t which is a sleeping lock on
> > > PREEMPT_RT. Acquiring
> > > * this lock with the _irq() suffix only
> > > disables interrupts on
> > > * a non-PREEMPT_RT kernel. The raw_spinlock_t
> > > below disables
> > > * interrupts on both configurations. The
> > > _irqsave() ensures
> > > * that interrupts are always disabled and
> > > later restored.
> > > */
> > > raw_spin_lock_irqsave(cpu_lock, flags);
> > > while ((pos =
> > > cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> > > struct cgroup_subsys_state *css;
> > >
> > > cgroup_base_stat_flush(pos, cpu);
> > > 26 bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> > >
> > > 28 rcu_read_lock();
> > > 29 list_for_each_entry_rcu(css,
> > > &pos->rstat_css_list,
> > > rstat_css_node)
> > > 31 css->ss->css_rstat_flush(css, cpu);
> > > 32 rcu_read_unlock();
> > > }
> > > 34 raw_spin_unlock_irqrestore(cpu_lock, flags);
> > >
> > > I added probes on both line 26 and line 31 to catch the middle and inner loops.
> > >
> > > * Slow:
> > >
> > > completed: 32.97s [manual / mem-stat + cpu-stat]
> > >
> > > Performance counter stats for '/tmp/derp':
> > >
> > > 4,702,570 probe:cgroup_rstat_flush_locked_L26
> > > 9,301,436 probe:cgroup_rstat_flush_locked_L31
> > >
> > > * Fast:
> > >
> > > completed: 0.17s [manual / mem-stat]
> > > completed: 0.34s [manual / cpu-stat]
> > >
> > > Performance counter stats for '/tmp/derp':
> > >
> > > 31,769 probe:cgroup_rstat_flush_locked_L26
> > > 62,849 probe:cgroup_rstat_flush_locked_L31
> > >
> > > It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> > > lot more positions.
> > >
> > > I'm going to sign off for the week, but let me know if I should place
> > > any more probes to nail this down.
> >
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> Flushing the entire hierarchy in mem_cgroup_flush_stats() was added
> such that concurrent flushers can just skip and let one flusher do the
> work for everyone. This was added because we flush the stats in some
> paths (like reclaim, refault, dirty throttling) where sometimes there
> is a lot of concurrency and we have a thundering herd problem on the
> cgroup rstat global lock.
>
> Maybe we can separate userspace reads from other flushers, such that
> userspace reads flush the cgroup in question only, while in-kernel
> flushers skip if someone else is flushing.
>
> There is also some inconsistency today as not all paths use
> mem_cgroup_flush_stats() (see zswap charging function in
> mm/memcontrol.c).
>
> Separating userspace reads from in-kernel flushers would also help
> because skipping a flush if someone else is flushing for userspace
> reads can lead to inaccuracy (see [1]).
>
> I would wait for Shakeel to weigh in here, since he introduced the
> unified flushing.
>
> [1]https://lore.kernel.org/lkml/[email protected]/
>

+Tejun Heo

There have been a lot of problems coming from this global rstat lock:
hard lockups (when we used to flush atomically), unified flushing
being expensive, skipping flushing being inaccurate, etc.

I wonder if it's time to rethink this lock and break it down into
granular locks. Perhaps a per-cgroup lock, and develop a locking
scheme where you always lock a parent then a child, then flush the
child and unlock it and move to the next child, etc. This will allow
concurrent flushing of non-root cgroups. Even when flushing the root,
if we flush all its children first without locking the root, then only
lock the root when flushing the top-level children, then some level of
concurrency can be achieved.

Maybe this is too complicated, I never tried to implement it, but I
have been bouncing around this idea in my head for a while now.

We can also split the update tree per controller. As far as I can tell
there is no reason to flush cpu stats for example when someone wants
to read memory stats.

2023-08-12 02:57:15

by Shakeel Butt

[permalink] [raw]
Subject: Re: Expensive memory.stat + cpu.stat reads

Hi Ivan,

(sorry for late response as I was away)

On Fri, Aug 11, 2023 at 3:35 PM Ivan Babrou <[email protected]> wrote:
[...]
> > > I spent some time looking into this and I think I landed on a fix:
> > >
> > > * https://github.com/bobrik/linux/commit/50b627811d54
> > >
> > > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > > the runtime significantly.

In your patch, can you try to replace mem_cgroup_flush_stats() with
mem_cgroup_flush_stats_ratelimited() instead of cgroup_rstat_flush().
I wanted to see if you observe any stale stats issues.