Hello.
On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <[email protected]> wrote:
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it. It actually breaks the isolation
> between cgroups.
No doubt about accounting the memory if it's significant amount.
> Let's account the consumed percpu memory to the parent cgroup.
Why did you choose charging to the parent of the created cgroup?
Should the charge go the cgroup _that is creating_ the new memcg?
One reason is that there are the throttling mechanisms for memory limits
and those are better exercised when the actor and its memory artefact
are the same cgroup, aren't they?
The second reason is based on the example Dlegation Containment
(Documentation/admin-guide/cgroup-v2.rst)
> For an example, let's assume cgroups C0 and C1 have been delegated to
> user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> all processes under C0 and C1 belong to U0::
>
> ~~~~~~~~~~~~~ - C0 - C00
> ~ cgroup ~ \ C01
> ~ hierarchy ~
> ~~~~~~~~~~~~~ - C1 - C10
Thanks to permissions a task running in C0 creating a cgroup in C1 would
deplete C1's supply victimizing tasks inside C1.
Thanks,
Michal
On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutn? <[email protected]> wrote:
> Hello.
>
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <[email protected]> wrote:
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it. It actually breaks the isolation
> > between cgroups.
> No doubt about accounting the memory if it's significant amount.
>
> > Let's account the consumed percpu memory to the parent cgroup.
> Why did you choose charging to the parent of the created cgroup?
>
> Should the charge go the cgroup _that is creating_ the new memcg?
>
> One reason is that there are the throttling mechanisms for memory limits
> and those are better exercised when the actor and its memory artefact
> are the same cgroup, aren't they?
>
> The second reason is based on the example Dlegation Containment
> (Documentation/admin-guide/cgroup-v2.rst)
>
> > For an example, let's assume cgroups C0 and C1 have been delegated to
> > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > all processes under C0 and C1 belong to U0::
> >
> > ~~~~~~~~~~~~~ - C0 - C00
> > ~ cgroup ~ \ C01
> > ~ hierarchy ~
> > ~~~~~~~~~~~~~ - C1 - C10
>
> Thanks to permissions a task running in C0 creating a cgroup in C1 would
> deplete C1's supply victimizing tasks inside C1.
These week-old issues appear to be significant. Roman? Or someone
else?
On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutn? <[email protected]> wrote:
>
> > Hello.
> >
> > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <[email protected]> wrote:
> > > Because the size of memory cgroup internal structures can dramatically
> > > exceed the size of object or page which is pinning it in the memory, it's
> > > not a good idea to simple ignore it. It actually breaks the isolation
> > > between cgroups.
> > No doubt about accounting the memory if it's significant amount.
> >
> > > Let's account the consumed percpu memory to the parent cgroup.
> > Why did you choose charging to the parent of the created cgroup?
> >
> > Should the charge go the cgroup _that is creating_ the new memcg?
> >
> > One reason is that there are the throttling mechanisms for memory limits
> > and those are better exercised when the actor and its memory artefact
> > are the same cgroup, aren't they?
Hi!
In general, yes. But in this case I think it wouldn't be a good idea:
most often cgroups are created by a centralized daemon (systemd),
which is usually located in the root cgroup. Even if it's located not in
the root cgroup, limiting it's memory will likely affect the whole system,
even if only one specific limit was reached.
If there is a containerized workload, which creates sub-cgroups,
charging it's parent cgroup is perfectly effective.
And the opposite, if we'll charge the cgroup of a process, who created
a cgroup, we'll not cover the most common case: systemd creating
cgroups for all services in the system.
> >
> > The second reason is based on the example Dlegation Containment
> > (Documentation/admin-guide/cgroup-v2.rst)
> >
> > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > all processes under C0 and C1 belong to U0::
> > >
> > > ~~~~~~~~~~~~~ - C0 - C00
> > > ~ cgroup ~ \ C01
> > > ~ hierarchy ~
> > > ~~~~~~~~~~~~~ - C1 - C10
> >
> > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > deplete C1's supply victimizing tasks inside C1.
Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
in completely different cgroup. In this particular case there are tons of other
ways how a task from C00 can hurt C1.
>
> These week-old issues appear to be significant. Roman? Or someone
> else?
Oh, I'm sorry, somehow I've missed this letter.
Thank you for pointing at it!
Thanks!
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin wrote:
> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutn? <[email protected]> wrote:
> >
> > > Hello.
> > >
> > > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <[email protected]> wrote:
> > > > Because the size of memory cgroup internal structures can dramatically
> > > > exceed the size of object or page which is pinning it in the memory, it's
> > > > not a good idea to simple ignore it. It actually breaks the isolation
> > > > between cgroups.
> > > No doubt about accounting the memory if it's significant amount.
> > >
> > > > Let's account the consumed percpu memory to the parent cgroup.
> > > Why did you choose charging to the parent of the created cgroup?
> > >
> > > Should the charge go the cgroup _that is creating_ the new memcg?
> > >
> > > One reason is that there are the throttling mechanisms for memory limits
> > > and those are better exercised when the actor and its memory artefact
> > > are the same cgroup, aren't they?
>
> Hi!
>
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
>
> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
>
> > >
> > > The second reason is based on the example Dlegation Containment
> > > (Documentation/admin-guide/cgroup-v2.rst)
> > >
> > > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > > all processes under C0 and C1 belong to U0::
> > > >
> > > > ~~~~~~~~~~~~~ - C0 - C00
> > > > ~ cgroup ~ \ C01
> > > > ~ hierarchy ~
> > > > ~~~~~~~~~~~~~ - C1 - C10
> > >
> > > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > > deplete C1's supply victimizing tasks inside C1.
>
> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
>
> >
> > These week-old issues appear to be significant. Roman? Or someone
> > else?
>
> Oh, I'm sorry, somehow I've missed this letter.
> Thank you for pointing at it!
Hello, Michal!
Do you have concerns left here or it's good to go?
It seems that this blocking the whole percpu accounting patchset from being merged,
and I still hope it can be squeezed into 5.9.
Thank you!
Roman
On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <[email protected]> wrote:
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
The generic scheme would be (assuming the no internal process
constraint, in the root too)
` root or delegated root
` manager-cgroup (systemd, docker, ...)
` [aggregation group(s)]
` job-group-1
` ...
` job-group-n
> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
No dispute about this in either approaches.
> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
What I mean is that systemd should be charged for the cgroup creation.
Or more generally, any container/cgroup manager should be charged.
Consider a leak when it wouldn't remove spent cgroups, IMO the effect
(throttling, reclaim) should be exercised on such a culprit.
I don't think the existing workload (job-group-i above) should
unnecessarily suffer when only manager is acting up. Is that different
from your idea?
> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
I agree with that.
If I haven't overlooked anything, this should be first case when
cgroup-related structures are accounted (please correct me).
So this is setting a precendent, if others show useful to be accounted
in the future too. I'm thinking about cpu_cgroup_css_alloc() that can
also allocate a lot (with big CPU count). The current approach would lead
situations where matching cpu and memory csses needn't to exist and that
would need special handling.
> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > These week-old issues appear to be significant. Roman? Or someone
> > else?
Despite my concerns, I don't think this is fundamental and can't be
changed later so it doesn't prevent the inclusion in 5.9 RC1.
Regards,
Michal
On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote:
> On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <[email protected]> wrote:
> > In general, yes. But in this case I think it wouldn't be a good idea:
> > most often cgroups are created by a centralized daemon (systemd),
> > which is usually located in the root cgroup. Even if it's located not in
> > the root cgroup, limiting it's memory will likely affect the whole system,
> > even if only one specific limit was reached.
> The generic scheme would be (assuming the no internal process
> constraint, in the root too)
>
> ` root or delegated root
> ` manager-cgroup (systemd, docker, ...)
> ` [aggregation group(s)]
> ` job-group-1
> ` ...
> ` job-group-n
>
> > If there is a containerized workload, which creates sub-cgroups,
> > charging it's parent cgroup is perfectly effective.
> No dispute about this in either approaches.
>
> > And the opposite, if we'll charge the cgroup of a process, who created
> > a cgroup, we'll not cover the most common case: systemd creating
> > cgroups for all services in the system.
> What I mean is that systemd should be charged for the cgroup creation.
> Or more generally, any container/cgroup manager should be charged.
> Consider a leak when it wouldn't remove spent cgroups, IMO the effect
> (throttling, reclaim) should be exercised on such a culprit.
As I said, there are 2 problems with charging systemd (or a similar daemon):
1) It often belongs to the root cgroup.
2) OOMing or failing some random memory allocations is a bad way
to "communicate" a memory shortage to the daemon.
What we really want is to prevent creating a huge number of cgroups
(including dying cgroups) in some specific sub-tree(s).
OOMing the daemon or returning -ENOMEM to some random syscalls
will not help us to reach the goal and likely will bring a bad
experience to a user.
In a generic case I don't see how we can charge the cgroup which
creates cgroups without solving these problems first.
And if there is a very special case where we have to limit it,
we can just add an additional layer:
` root or delegated root
` manager-parent-cgroup-with-a-limit
` manager-cgroup (systemd, docker, ...)
` [aggregation group(s)]
` job-group-1
` ...
` job-group-n
>
> I don't think the existing workload (job-group-i above) should
> unnecessarily suffer when only manager is acting up. Is that different
> from your idea?
>
> > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> > in completely different cgroup. In this particular case there are tons of other
> > ways how a task from C00 can hurt C1.
> I agree with that.
>
>
> If I haven't overlooked anything, this should be first case when
> cgroup-related structures are accounted (please correct me).
> So this is setting a precendent, if others show useful to be accounted
> in the future too.
Right.
> I'm thinking about cpu_cgroup_css_alloc() that can
> also allocate a lot (with big CPU count). The current approach would lead
> situations where matching cpu and memory csses needn't to exist and that
> would need special handling.
I'd definitely charge the parent cgroup in all similar cases.
>
>
> > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > > These week-old issues appear to be significant. Roman? Or someone
> > > else?
> Despite my concerns, I don't think this is fundamental and can't be
> changed later so it doesn't prevent the inclusion in 5.9 RC1.
Thank you!
On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <[email protected]> wrote:
> As I said, there are 2 problems with charging systemd (or a similar daemon):
> 1) It often belongs to the root cgroup.
This doesn't hold for systemd (if we agree that systemd is the most
common case).
> 2) OOMing or failing some random memory allocations is a bad way
> to "communicate" a memory shortage to the daemon.
> What we really want is to prevent creating a huge number of cgroups
There's cgroup.max.descendants for that...
> (including dying cgroups) in some specific sub-tree(s).
...oh, so is this limiting the number of cgroups or limiting resources
used?
> OOMing the daemon or returning -ENOMEM to some random syscalls
> will not help us to reach the goal and likely will bring a bad
> experience to a user.
If we reach the situation when memory for cgroup operations is tight,
it'll disappoint the user either way.
My premise is that a running workload is more valuable than the
accompanying manager.
> In a generic case I don't see how we can charge the cgroup which
> creates cgroups without solving these problems first.
In my understanding, "onbehalveness" is a concept useful for various
kernel threads doing deferred work. Here it's promoted to user processes
managing cgroups.
> And if there is a very special case where we have to limit it,
> we can just add an additional layer:
>
> ` root or delegated root
> ` manager-parent-cgroup-with-a-limit
> ` manager-cgroup (systemd, docker, ...)
> ` [aggregation group(s)]
> ` job-group-1
> ` ...
> ` job-group-n
If the charge goes to the parent of created cgroup (job-cgroup-i here),
then the layer adds nothing. Am I missing something?
> I'd definitely charge the parent cgroup in all similar cases.
(This would mandate the controllers on the unified hierarchy, which is
fine IMO.) Then the order of enabling controllers on a subtree (e.g.
cpu,memory vs memory,cpu) by the manager would yield different charging.
This seems wrong^W confusing to me.
Thanks,
Michal
On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote:
> On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <[email protected]> wrote:
> > As I said, there are 2 problems with charging systemd (or a similar daemon):
> > 1) It often belongs to the root cgroup.
> This doesn't hold for systemd (if we agree that systemd is the most
> common case).
Ok, it's better.
>
> > 2) OOMing or failing some random memory allocations is a bad way
> > to "communicate" a memory shortage to the daemon.
> > What we really want is to prevent creating a huge number of cgroups
> There's cgroup.max.descendants for that...
cgroup.max.descendants limits the number of live cgroups, it can't limit
the number of dying cgroups.
>
> > (including dying cgroups) in some specific sub-tree(s).
> ...oh, so is this limiting the number of cgroups or limiting resources
> used?
My scenario is simple: there is a large machine, which has no memory
pressure for some time (e.g. is idle or running a workload with small
working set). Periodically running services creating a lot of cgroups,
usually in system.slice. After some time a significant part of the whole
memory is getting consumed by dying cgroups and their percpu data.
Getting rid of it and reclaiming all memory is not always possible
(percpu is getting fragmented relatively easy) and is time consuming.
If we'll set memory.high on system.slice, it will create an artificial
memory pressure once we're getting close to the limit. It will trigger
the reclaim of user pages and slab objects, so eventually we'll be able
to release dying cgroups as well.
You might say that it would work even without charging memcg internal
structures. The problem is that a small slab object can indirectly pin
a lot of (percpu) memory. If don't take the indirectly pinned memory
into account, likely we won't apply enough memory pressure.
If we'll limit init.slice (where systemd seems to reside), as you suggest,
we'll eventually create trashing in init.slice, followed by OOM.
I struggle to see how it makes the life of a user better?
>
> > OOMing the daemon or returning -ENOMEM to some random syscalls
> > will not help us to reach the goal and likely will bring a bad
> > experience to a user.
> If we reach the situation when memory for cgroup operations is tight,
> it'll disappoint the user either way.
> My premise is that a running workload is more valuable than the
> accompanying manager.
The problem is that OOM-killing the accompanying manager won't release
resources and help to get rid of accumulated cgroups. So in the very
best case it will prevent new cgroups from being created (as well
as some other random operations from being performed). Most likely
the only way to "fix" this for a user will be to reboot the machine.
>
> > In a generic case I don't see how we can charge the cgroup which
> > creates cgroups without solving these problems first.
> In my understanding, "onbehalveness" is a concept useful for various
> kernel threads doing deferred work. Here it's promoted to user processes
> managing cgroups.
>
> > And if there is a very special case where we have to limit it,
> > we can just add an additional layer:
> >
> > ` root or delegated root
> > ` manager-parent-cgroup-with-a-limit
> > ` manager-cgroup (systemd, docker, ...)
> > ` [aggregation group(s)]
> > ` job-group-1
> > ` ...
> > ` job-group-n
> If the charge goes to the parent of created cgroup (job-cgroup-i here),
> then the layer adds nothing. Am I missing something?
Sorry, I was wrong here, please ignore this part.
>
> > I'd definitely charge the parent cgroup in all similar cases.
> (This would mandate the controllers on the unified hierarchy, which is
> fine IMO.) Then the order of enabling controllers on a subtree (e.g.
> cpu,memory vs memory,cpu) by the manager would yield different charging.
> This seems wrong^W confusing to me.
I agree it's confusing.
Thanks!
On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin <[email protected]> wrote:
> If we'll limit init.slice (where systemd seems to reside), as you suggest,
> we'll eventually create trashing in init.slice, followed by OOM.
> I struggle to see how it makes the life of a user better?
> [...]
> The problem is that OOM-killing the accompanying manager won't release
> resources and help to get rid of accumulated cgroups.
I see your point now. I focused on the creator because of the live
memcgs.
When the issue are the dying memcgs (c), they were effectively released
by their creator but are pinned by whatever remained after their life
(LRU pages, slab->obj_cgroups). Since these pins were created _from
within_ such a child (c), they're most readily removable by reclaiming
(hierarchically) close to c. (It'd be achievable by limiting the lowest
common ancestor of manager and its product (typically root) but that is
more clumsy and less effective.)
This is the reasoning that justifies the remote charge.
Thanks!
Michal