LinuxLists.cc - [RFD] cgroup: about multiple hierarchies

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Sorry, forgot to cc hch. Cc'ing him and quoting whole message.

On Tue, Feb 21, 2012 at 01:19:38PM -0800, Tejun Heo wrote:
> Hello, guys.
>
> I've been thinking about multiple hierarchy support in cgroup for a
> while, especially after Frederic's pending task counter patchset.
> This is a write up of what I've been thinking. I don't know what to
> do yet and simply continuing the current situation definitely is an
> option, so please read on and throw in your 20 Won (or whatever amount
> in whatever currency you want).
>
> * The problems.
>
> The support for multiple process hierarchies always struck me as
> rather strange. If you forget about the current cgroup controllers
> and their implementations, the *only* reason to support multiple
> hierarchies is if you want to apply resource limits based on different
> orthogonal categorizations.
>
> Documentation/cgroups.txt seems to be written with this consideration
> on mind. It's giving an example of applying limits accoring to two
> orthogonal categorizations - user groups (profressors, students...)
> and applications (WWW, NFS...). While it may sound like a valid use
> case, I'm very skeptical how useful or common mixing such orthogonal
> categorizations in a single setup would be.
>
> If support for multiple hierarchies comes for free, at least in terms
> of features, maybe it can be better but of course it isn't so. Any
> given cgroup subsystem (or controller) can only be applied to a single
> hierarchy, which makes sense for a lot of things - what would two
> different limits on the same resource from different hierarchies mean?
> But, there also are things which can be used and useful in all
> hierarchies - e.g. cgroup freezer and task counter.
>
> While the current cgroup implementation and conventions can probably
> allow admins and engineers to tailor cgroup configuration for a
> specific setup, it is very difficult to use in generic and automated
> way. I mean, who owns the freezer or task counter? If they're
> mounted on their own hierarchies, how should they be structured?
> Should the different hierarchies be structured such that they are
> projections of one unified hierarchy so that those generic mechanisms
> can be applied uniformly? If so, why do we need multiple hierarchies
> at all?
>
> A related limitation is that as different subsystems don't know which
> hierarchies they'll end up on, they can't cooperate. Wouldn't it make
> more sense if task counter is a separate thing watching the resources
> and triggers different actions as conifgured - be it failing forks or
> freezing?
>
> And yet another oddity is how cgroup handles nested cgroups - some
> care about nesting but others just treat both internal and leaf nodes
> equally. They don't care about the topology at all. This, too, can
> be fine if you approach things subsys by subsys and use them in
> different ways but if you try to combine them in generic way you get
> sucked into the lala land of whatevers.
>
> The following is a "best practices" document on using cgroups.
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> To me, it seems to demonstrate the rather ugly situation that the
> current cgroup is providing. Everyone should tip-toe around cgroup
> hierarchies and nobody has full knowledge or control over them.
> e.g. base system management (e.g. systemd) can't use freezer or task
> counter as someone else might want to use it for different hierarchy
> layout.
>
> It seems to me that cgroup interface is too complicated and inflexible
> at the same time to be useful in generic manner. Sure, it can be
> useful for setups individually crafted by engineers and admins to
> match specific sites or applications but as soon as you try to do
> something automatic and generic with it, there just are too many
> different scenarios and limitations to consider.
>
>
> * So, what to do?
>
> Heh, I don't know. IIRC, last year at LinuxCon Japan, I heard
> Christoph saying that the biggest problem w/ cgroup was that it was
> building completely separate hierarchies out of the traditional
> process hierarchies. After thinking about this stuff for a while, I
> fully agree with him. I think this whole thing should have been a
> layer over the process tree like sessions or program groups.
>
> Unfortunately, that ship sailed long ago and we gotta make do with
> what we have on our collective hands. Here are some paths that we can
> take.
>
> 1. We're screwed anyway. Just don't worry about it and continue down
> on this path. Can't get much worse, right?
>
> This approach has the apparent advantage of not having to do
> anything and is probably most likely to be taken. This isn't ideal
> but hey nothing is. :P
>
> 2. Make it more flexible (and likely more complex, unfortunately).
> Allow the utility type subsystems to be used in multiple
> hierarchies. The easiest and probably dirtiest way to achieve that
> would be embedding them into cgroup core.
>
> Thinking about doing this depresses me and it's not like I have a
> cheerful personality to begin with. :(
>
> 3. Head towards single hierarchy with the pie-in-the-sky goal of
> merging things into process hierarchy in some distant future.
>
> The first step would be herding people to use a unified hierarchy
> (ie. all subsystems mounted on a single cgroup tree) which is
> controlled by single entity in userland (be it systemd or cgroupd,
> cgroup-kit or whatever); however, even if we exclude supporting
> orthogonal categorizations, there are good number of non-trivial
> hurdles to clear before this can be realized.
>
> Most importantly, we would need to clean up how nesting is handled
> across different subsystems. Handling internal and leaf nodes as
> equals simply can't work. Membership should be recursive, and for
> subsystems which can't support proper nesting, the right thing to
> do would be somehow ensuring that only single node in the path from
> root to leaf is active for the controller. We may even have to
> introduce an alternative of operation to support this (yuck).
>
> This path would require the most amount of work and we would be
> excluding a feature - support for multiple orthogonal
> categorizations - which has been available till now, probably
> through deprecation process spanning years; however, this at least
> gives us hope that we may reach sanity in the end, how distant that
> end may be. Oh, hope. :)
>
> So, I mean, I don't know. What do other people think? Is this a
> unnecessary worry? Are people generally happy with the way things
> are? Lennart, Kay, what do you guys think?
>
> Thanks.
>
> --
> tejun

--
tejun

2012-02-22 13:31:18

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, 2012-02-21 at 13:19 -0800, Tejun Heo wrote:
> So, I mean, I don't know. What do other people think? Is this a
> unnecessary worry? Are people generally happy with the way things
> are? Lennart, Kay, what do you guys think?

FWIW I'm all for ripping the orthogonal hierarchy crap out, I hate it
just about as much as you do judging from your write-up.

Yes it will make some people unhappy, but I can live with that since my
life will be easier.. :-)

I'm not sure on your process hierarchy pie though, I rather like being
able to assign tasks to cgroups of my making without having to mirror
that in the process hierarchy.

Having seen what userspace does (libvirt in particular, I've still
managed to not get infected by the systemd crap) its utterly and
completely insane. Now I don't think any of my machines actually still
have libvirt on it, so I don't care if we break that either ;-)

Another thing I dislike about all the cgroup crap is all the dozens of
tiny controllers being proposed left right and center. Like WTF isn't
the hugetlb controller part of memcg? Its all memory, right?

Now I appreciate all this is new and exciting and Linux does the
evolutionary development thing so its bound to be a mess sometimes, but
shees..

So +1 on just ripping everything apart and trying again.

2012-02-22 13:35:45

by Glauber Costa

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

I am afraid I also don't have too much answers for your questions, but
I do have more questions =)

On 02/22/2012 01:21 AM, Tejun Heo wrote:
> Sorry, forgot to cc hch. Cc'ing him and quoting whole message.
>
> On Tue, Feb 21, 2012 at 01:19:38PM -0800, Tejun Heo wrote:
>> Hello, guys.
>>
>> I've been thinking about multiple hierarchy support in cgroup for a
>> while, especially after Frederic's pending task counter patchset.
>> This is a write up of what I've been thinking. I don't know what to
>> do yet and simply continuing the current situation definitely is an
>> option, so please read on and throw in your 20 Won (or whatever amount
>> in whatever currency you want).

I said that previously, but to this days the need for it still strikes
me. I mean: the usecase is pretty clear. But every single cgroup is
counting forks in a way or another. So for me, it would be better to
simply count it as a cgroup property and act on it accordingly.

But then, of course, if you have multiple hierarchies, in which of them
should you put that ? How ugly is it that you'll fail a fork, then check
a hierarchy - no problem - only to later found out that this was
configures in another hierarchy ?

>>
>> * The problems.
>>
>> The support for multiple process hierarchies always struck me as
>> rather strange. If you forget about the current cgroup controllers
>> and their implementations, the *only* reason to support multiple
>> hierarchies is if you want to apply resource limits based on different
>> orthogonal categorizations.
>>
>> Documentation/cgroups.txt seems to be written with this consideration
>> on mind. It's giving an example of applying limits accoring to two
>> orthogonal categorizations - user groups (profressors, students...)
>> and applications (WWW, NFS...). While it may sound like a valid use
>> case, I'm very skeptical how useful or common mixing such orthogonal
>> categorizations in a single setup would be.
>>
>> If support for multiple hierarchies comes for free, at least in terms
>> of features, maybe it can be better but of course it isn't so. Any
>> given cgroup subsystem (or controller) can only be applied to a single
>> hierarchy, which makes sense for a lot of things - what would two
>> different limits on the same resource from different hierarchies mean?
>> But, there also are things which can be used and useful in all
>> hierarchies - e.g. cgroup freezer and task counter.
>>
>> While the current cgroup implementation and conventions can probably
>> allow admins and engineers to tailor cgroup configuration for a
>> specific setup, it is very difficult to use in generic and automated
>> way. I mean, who owns the freezer or task counter? If they're
>> mounted on their own hierarchies, how should they be structured?
>> Should the different hierarchies be structured such that they are
>> projections of one unified hierarchy so that those generic mechanisms
>> can be applied uniformly? If so, why do we need multiple hierarchies
>> at all?
>>
>> A related limitation is that as different subsystems don't know which
>> hierarchies they'll end up on, they can't cooperate. Wouldn't it make
>> more sense if task counter is a separate thing watching the resources
>> and triggers different actions as conifgured - be it failing forks or
>> freezing?

Well, there is more. The use case we have in mind here, is Containers.
To span a container, we put process in cgroups - we don't care about
hierarchies, they are all the same - but then also need to put those
same process in different namespaces.

This is quite cumbersome, because those are two completely different
ways of achieving more or less the same thing, resource visibility. At
some point, we need to allow the container admin to interface with those
resources - traditionally done via /proc. And now the mess begins:

Part of /proc is namespace aware. So if you are reading your
/proc/mounts file, this is okay. But part of the data coming from there,
like /proc/cpuinfo, /proc/stat, or /proc/meminfo, really belong to
cgroups. And in some cases, information comes from more than one cgroup.
A consensus wasn't yet reached about what to do with it.

>> And yet another oddity is how cgroup handles nested cgroups - some
>> care about nesting but others just treat both internal and leaf nodes
>> equally.
To be honest, I don't like that very much. I think once you have a
directory-like structure, nesting of controlled resources should be
assumed. But since I don't understand why this is this way to begin
with, I'll leave it to someone else.

>> They don't care about the topology at all. This, too, can
>> be fine if you approach things subsys by subsys and use them in
>> different ways but if you try to combine them in generic way you get
>> sucked into the lala land of whatevers.
>>
>> The following is a "best practices" document on using cgroups.
>>
>> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>>
>> To me, it seems to demonstrate the rather ugly situation that the
>> current cgroup is providing. Everyone should tip-toe around cgroup
>> hierarchies and nobody has full knowledge or control over them.
>> e.g. base system management (e.g. systemd) can't use freezer or task
>> counter as someone else might want to use it for different hierarchy
>> layout.
>>
>> It seems to me that cgroup interface is too complicated and inflexible
>> at the same time to be useful in generic manner. Sure, it can be
>> useful for setups individually crafted by engineers and admins to
>> match specific sites or applications but as soon as you try to do
>> something automatic and generic with it, there just are too many
>> different scenarios and limitations to consider.
>>
>>
>> * So, what to do?
>>
>> Heh, I don't know. IIRC, last year at LinuxCon Japan, I heard
>> Christoph saying that the biggest problem w/ cgroup was that it was
>> building completely separate hierarchies out of the traditional
>> process hierarchies. After thinking about this stuff for a while, I
>> fully agree with him. I think this whole thing should have been a
>> layer over the process tree like sessions or program groups.
>>
>> Unfortunately, that ship sailed long ago and we gotta make do with
>> what we have on our collective hands. Here are some paths that we can
>> take.
>>
>> 1. We're screwed anyway. Just don't worry about it and continue down
>> on this path. Can't get much worse, right?
Wrong. =)

>>
>> This approach has the apparent advantage of not having to do
>> anything and is probably most likely to be taken. This isn't ideal
>> but hey nothing is. :P
>>
>> 2. Make it more flexible (and likely more complex, unfortunately).
It sounds like the guys on TV proposing more debt to end the debt crisis...

>> Allow the utility type subsystems to be used in multiple
>> hierarchies. The easiest and probably dirtiest way to achieve that
>> would be embedding them into cgroup core.
>>
>> Thinking about doing this depresses me and it's not like I have a
>> cheerful personality to begin with. :(
>>
>> 3. Head towards single hierarchy with the pie-in-the-sky goal of
>> merging things into process hierarchy in some distant future.
>>
>> The first step would be herding people to use a unified hierarchy
>> (ie. all subsystems mounted on a single cgroup tree) which is
>> controlled by single entity in userland (be it systemd or cgroupd,
>> cgroup-kit or whatever); however, even if we exclude supporting
>> orthogonal categorizations, there are good number of non-trivial
>> hurdles to clear before this can be realized.
>>
>> Most importantly, we would need to clean up how nesting is handled
>> across different subsystems. Handling internal and leaf nodes as
>> equals simply can't work.
Agree here.

>> Membership should be recursive, and for
>> subsystems which can't support proper nesting, the right thing to
>> do would be somehow ensuring that only single node in the path from
>> root to leaf is active for the controller. We may even have to
>> introduce an alternative of operation to support this (yuck).
>>
>> This path would require the most amount of work and we would be
>> excluding a feature - support for multiple orthogonal
>> categorizations - which has been available till now, probably
>> through deprecation process spanning years; however, this at least
>> gives us hope that we may reach sanity in the end, how distant that
>> end may be. Oh, hope. :)
>>
>> So, I mean, I don't know. What do other people think? Is this a
>> unnecessary worry? Are people generally happy with the way things
>> are? Lennart, Kay, what do you guys think?
>>

Well, most of the controllers, can be in practice enabled or disabled.
The mere fact that you live on a cgroup controller doesn't do anything
until you start to set limits - with the big exception being the cpu
controller - once you're there, it treats you as a sched entity. Maybe
we should ensure that all cgroups can be either on/off. Then after that,
we can group processes the way we want, and they may or may be not
resource constrained, depending on what you put on your files.

This can be combined with a mechanism to lock the tasks file for
removal, then maybe we can end up in a better awareness situation -
maybe it would be saner if you can be sure that once you put a task on a
group, it won't just disappear...

2012-02-22 13:38:57

by Glauber Costa

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On 02/22/2012 05:30 PM, Peter Zijlstra wrote:
> On Tue, 2012-02-21 at 13:19 -0800, Tejun Heo wrote:
>> So, I mean, I don't know. What do other people think? Is this a
>> unnecessary worry? Are people generally happy with the way things
>> are? Lennart, Kay, what do you guys think?
>
> FWIW I'm all for ripping the orthogonal hierarchy crap out, I hate it
> just about as much as you do judging from your write-up.
>
> Yes it will make some people unhappy, but I can live with that since my
> life will be easier.. :-)
>
> I'm not sure on your process hierarchy pie though, I rather like being
> able to assign tasks to cgroups of my making without having to mirror
> that in the process hierarchy.
>
> Having seen what userspace does (libvirt in particular, I've still
> managed to not get infected by the systemd crap) its utterly and
> completely insane. Now I don't think any of my machines actually still
> have libvirt on it, so I don't care if we break that either ;-)
>
> Another thing I dislike about all the cgroup crap is all the dozens of
> tiny controllers being proposed left right and center. Like WTF isn't
> the hugetlb controller part of memcg? Its all memory, right?
>
Right. But this is easy to solve.
People are usually pointing out that "Hey, but that's not how my
controller works, I need it to be slightly different here and there".
If we agree this is a bad thing - I think it is, we can at least adopt
as a policy not to take any patches that create another hierarchy unless
the need is utterly demonstrated.

2012-02-22 15:45:14

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, Feb 21, 2012 at 01:19:38PM -0800, Tejun Heo wrote:
> Hello, guys.
>
> I've been thinking about multiple hierarchy support in cgroup for a
> while, especially after Frederic's pending task counter patchset.
> This is a write up of what I've been thinking. I don't know what to
> do yet and simply continuing the current situation definitely is an
> option, so please read on and throw in your 20 Won (or whatever amount
> in whatever currency you want).
>
> * The problems.
>
> The support for multiple process hierarchies always struck me as
> rather strange. If you forget about the current cgroup controllers
> and their implementations, the *only* reason to support multiple
> hierarchies is if you want to apply resource limits based on different
> orthogonal categorizations.
>
> Documentation/cgroups.txt seems to be written with this consideration
> on mind. It's giving an example of applying limits accoring to two
> orthogonal categorizations - user groups (profressors, students...)
> and applications (WWW, NFS...). While it may sound like a valid use
> case, I'm very skeptical how useful or common mixing such orthogonal
> categorizations in a single setup would be.
>
> If support for multiple hierarchies comes for free, at least in terms
> of features, maybe it can be better but of course it isn't so. Any
> given cgroup subsystem (or controller) can only be applied to a single
> hierarchy, which makes sense for a lot of things - what would two
> different limits on the same resource from different hierarchies mean?
> But, there also are things which can be used and useful in all
> hierarchies - e.g. cgroup freezer and task counter.
>
> While the current cgroup implementation and conventions can probably
> allow admins and engineers to tailor cgroup configuration for a
> specific setup, it is very difficult to use in generic and automated
> way. I mean, who owns the freezer or task counter? If they're
> mounted on their own hierarchies, how should they be structured?
> Should the different hierarchies be structured such that they are
> projections of one unified hierarchy so that those generic mechanisms
> can be applied uniformly? If so, why do we need multiple hierarchies
> at all?
>
> A related limitation is that as different subsystems don't know which
> hierarchies they'll end up on, they can't cooperate. Wouldn't it make
> more sense if task counter is a separate thing watching the resources
> and triggers different actions as conifgured - be it failing forks or
> freezing?

For this particular example, I think we'd better have a file in which
a task can poll and get woken up when the task limit has been reached.
Then that task can decide to freeze or whatever.

>
> And yet another oddity is how cgroup handles nested cgroups - some
> care about nesting but others just treat both internal and leaf nodes
> equally. They don't care about the topology at all. This, too, can
> be fine if you approach things subsys by subsys and use them in
> different ways but if you try to combine them in generic way you get
> sucked into the lala land of whatevers.
>
> The following is a "best practices" document on using cgroups.
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> To me, it seems to demonstrate the rather ugly situation that the
> current cgroup is providing. Everyone should tip-toe around cgroup
> hierarchies and nobody has full knowledge or control over them.
> e.g. base system management (e.g. systemd) can't use freezer or task
> counter as someone else might want to use it for different hierarchy
> layout.
>
> It seems to me that cgroup interface is too complicated and inflexible
> at the same time to be useful in generic manner. Sure, it can be
> useful for setups individually crafted by engineers and admins to
> match specific sites or applications but as soon as you try to do
> something automatic and generic with it, there just are too many
> different scenarios and limitations to consider.
>
>
> * So, what to do?
>
> Heh, I don't know. IIRC, last year at LinuxCon Japan, I heard
> Christoph saying that the biggest problem w/ cgroup was that it was
> building completely separate hierarchies out of the traditional
> process hierarchies. After thinking about this stuff for a while, I
> fully agree with him. I think this whole thing should have been a
> layer over the process tree like sessions or program groups.
>
> Unfortunately, that ship sailed long ago and we gotta make do with
> what we have on our collective hands. Here are some paths that we can
> take.
>
> 1. We're screwed anyway. Just don't worry about it and continue down
> on this path. Can't get much worse, right?
>
> This approach has the apparent advantage of not having to do
> anything and is probably most likely to be taken. This isn't ideal
> but hey nothing is. :P

Thing is we have an ABI now and it has been there for a while now. Aren't
we stuck with it? I'm no big fan of that multiple hierarchies thing either
but now I fear we have to support it.

>
> 2. Make it more flexible (and likely more complex, unfortunately).
> Allow the utility type subsystems to be used in multiple
> hierarchies. The easiest and probably dirtiest way to achieve that
> would be embedding them into cgroup core.
>
> Thinking about doing this depresses me and it's not like I have a
> cheerful personality to begin with. :(

Another solution is to support a class of multi-bindable subsystems as in
this old patch from Paul:

https://lkml.org/lkml/2009/7/1/578

It sounds to me more healthy to iterate only over subsystems in fork/exit.
We probably don't want to add a new iteration over cgroups themselves
on these fast path.

2012-02-22 16:39:17

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, Feb 21, 2012 at 01:19:38PM -0800, Tejun Heo wrote:

[..]
> 3. Head towards single hierarchy with the pie-in-the-sky goal of
> merging things into process hierarchy in some distant future.
>
> The first step would be herding people to use a unified hierarchy
> (ie. all subsystems mounted on a single cgroup tree) which is
> controlled by single entity in userland (be it systemd or cgroupd,
> cgroup-kit or whatever); however, even if we exclude supporting
> orthogonal categorizations, there are good number of non-trivial
> hurdles to clear before this can be realized.

Apart from orthogonal categorizations, one advantage of of multiple
hierarchies is that you don't have to use a controller if you don't
want to. (Just don't create cgroup in controller's respective hierarchy).

This is not ideal but practically it might he helpful. In the sense
cgroups might not come cheap and different controllers might have different
overheads associated with it. For example, in blkio controller we can end
up idling a lot with increasing number of cgroups. In that case a better
way might be that use blkio controller cgroups selectively and that is
any workload which is destroying the performance of others, move it out
in a separate blkio group.

This is not ideal situation but that's how things currently are.

systemd by default creates in cgroups only cpu hierarchy (apart from named
systemd hiearchy to keep track of groups/processes). By default it does
not make use of other controllers and put any restrictions on
processes/services apart from cpu. Having a separate hiearchy for every
controller atleast easily allows that.

>
> Most importantly, we would need to clean up how nesting is handled
> across different subsystems. Handling internal and leaf nodes as
> equals simply can't work. Membership should be recursive, and for
> subsystems which can't support proper nesting, the right thing to
> do would be somehow ensuring that only single node in the path from
> root to leaf is active for the controller. We may even have to
> introduce an alternative of operation to support this (yuck).
>
> This path would require the most amount of work and we would be
> excluding a feature - support for multiple orthogonal
> categorizations - which has been available till now, probably
> through deprecation process spanning years; however, this at least
> gives us hope that we may reach sanity in the end, how distant that
> end may be. Oh, hope. :)

Yes this is something needs to be cleaned up. Everybody seems to have
dealt with hiearchy in its own way.

For blkio controller, initially we provided fully nested hiearchies like
cpu controller but then implementation became too complex (CFQ is already
complicated and implementing fully nested hiearchies made it much more
complicated without any significant gain). So, I converted it into
flat model where internally we treat the whole hierarchy flat. (It
might have been a bad decision though).

So for blkio controller we can convert it into fully nested hierarchy
at the expense of more complex code in CFQ. I think memory cgroup
controller provides both flat and hierarchical mode. Keeping it fully
hierarchical also increases the cost as we need to traverse lot more
pointers for simple things like nested stats. On a system having
both systemd and libvirt, every virtual machine is already 3-4 level
deep in cgroup hierarchy.

Trying to make all the controllers uniform in terms of their treatment
of cgroup hiearchy sounds like a good thing to do. Once that is done,
one can probably see if it is worth to put all the controllers in a
single hierarchy.

Thanks
Vivek

2012-02-22 16:57:27

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Wed, Feb 22, 2012 at 11:38:58AM -0500, Vivek Goyal wrote:

[..]
> >
> > Most importantly, we would need to clean up how nesting is handled
> > across different subsystems. Handling internal and leaf nodes as
> > equals simply can't work. Membership should be recursive, and for
> > subsystems which can't support proper nesting, the right thing to
> > do would be somehow ensuring that only single node in the path from
> > root to leaf is active for the controller. We may even have to
> > introduce an alternative of operation to support this (yuck).
> >
> > This path would require the most amount of work and we would be
> > excluding a feature - support for multiple orthogonal
> > categorizations - which has been available till now, probably
> > through deprecation process spanning years; however, this at least
> > gives us hope that we may reach sanity in the end, how distant that
> > end may be. Oh, hope. :)
>
> Yes this is something needs to be cleaned up. Everybody seems to have
> dealt with hiearchy in its own way.
>
> For blkio controller, initially we provided fully nested hiearchies like
> cpu controller but then implementation became too complex (CFQ is already
> complicated and implementing fully nested hiearchies made it much more
> complicated without any significant gain). So, I converted it into
> flat model where internally we treat the whole hierarchy flat. (It
> might have been a bad decision though).

IIRC, another reason to implement flat hierachy was that some people
believed that's more natural way of doing things. For example, when
you talk about cgroup, people ask, ok, give me a cgroup with 25% IO
bandwidth. Now this does not come naturally with completely nested
hierarchies where task and groups are treated at the same level. As
group's peer tasks share the bandwidth, and task come and go a group's
% share varies dynamically.

Again, it does not mean I am advocating flat hiearchy. I am just wondering
in case of fully nested hierarchies (task at same level as groups), how
does one explain it to a layman user who understands things in terms of
% of resources.

Just saying that your group has weight X does not mean much in absolute
terms. And % bandwidth achieved by group will vary dynamically. (Hey,
you told me that one can divide the system resources somewhat
deterministically. But bandwidth varying dynamically does not sound the
same).

Thanks
Vivek

2012-02-22 18:01:20

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hey, Peter.

On Wed, Feb 22, 2012 at 02:30:59PM +0100, Peter Zijlstra wrote:
> FWIW I'm all for ripping the orthogonal hierarchy crap out, I hate it
> just about as much as you do judging from your write-up.

I just don't get why it's there. Maybe, there can be some remote use
cases where orthogonal hierarchies can be useful but structuring whole
cgroup around that seems really extreme.

> I'm not sure on your process hierarchy pie though, I rather like being
> able to assign tasks to cgroups of my making without having to mirror
> that in the process hierarchy.

The only question is whether we want to allow cgroup hierarchy to be
completely orthogonal from process tree structure, which I don't think
is a good idea. It shouldn't affect trivial use cases. If not
explicitly configured, all tasks would live in a single root cgroup -
much like every process would belong to the same session if nobody
does setsid() since boot (or container).

I don't know how the implementation would turn out and it may as well
stay separate as it is now but I still think the topology should match
pstree.

Thanks.

--
tejun

2012-02-22 18:22:13

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hey, Frederic.

On Wed, Feb 22, 2012 at 04:45:04PM +0100, Frederic Weisbecker wrote:
> > A related limitation is that as different subsystems don't know which
> > hierarchies they'll end up on, they can't cooperate. Wouldn't it make
> > more sense if task counter is a separate thing watching the resources
> > and triggers different actions as conifgured - be it failing forks or
> > freezing?
>
> For this particular example, I think we'd better have a file in which
> a task can poll and get woken up when the task limit has been reached.
> Then that task can decide to freeze or whatever.

Yes, that may be a solution but to "guarantee" that the limit is never
breached, we need to stop it first somehow. Probably making freezing
the default behavior with userland notifier (inotify event should
suffice) should do, which we can't do now. :(

> > 1. We're screwed anyway. Just don't worry about it and continue down
> > on this path. Can't get much worse, right?
> >
> > This approach has the apparent advantage of not having to do
> > anything and is probably most likely to be taken. This isn't ideal
> > but hey nothing is. :P
>
> Thing is we have an ABI now and it has been there for a while now. Aren't
> we stuck with it? I'm no big fan of that multiple hierarchies thing either
> but now I fear we have to support it.

Well, yes and no. While maintaining userland ABI is very important,
its importance isn't infinite and there are different types of
userland ABIs. We definitely don't want to screw with syscalls. We
should keep userland visible dynamic files which are used by common
usertools stable at almost all costs. When it comes over to system
interface which is used mostly by base system tools, it can be a bit
flexible. If the ABI in question is an optional thing, we probably
can be slightly more flexible.

We of course can't change things drastically. It should be done
carefully with rather long deprecation period, but it can be done and
in fact isn't too uncommon. Stuff under /sysfs tends to be somewhat
volatile and sysfs itself went through several ABI incompatible
iterations.

So, we can transition in baby steps. e.g. we can first implement
proper nesting behavior without changing the default behavior and then
the base system can be updated to mount and control all subsystems by
default (with configuration opt-outs) so that the hierarchy reflects
pstree, effectively driving people away from multiple hierarchies and
we can implement new features assuming the new structure. After a few
years, the kernel can start whining about non-start hierarchies and
then eventually remove the support. It's a long process but
definitely doable.

> > 2. Make it more flexible (and likely more complex, unfortunately).
> > Allow the utility type subsystems to be used in multiple
> > hierarchies. The easiest and probably dirtiest way to achieve that
> > would be embedding them into cgroup core.
> >
> > Thinking about doing this depresses me and it's not like I have a
> > cheerful personality to begin with. :(
>
> Another solution is to support a class of multi-bindable subsystems as in
> this old patch from Paul:
>
> https://lkml.org/lkml/2009/7/1/578

Heh, yeah, this would be closer to the proper way to achieve
multi-attach but I can't help feeling that this just buries ourselves
deeper into s*it and we're already knee-deep. If multiple hierarchies
is an essential feature, maybe, but, if it's not, and I'm extremely
skeptical that it is, why the hell would we want to go that way?

> It sounds to me more healthy to iterate only over subsystems in fork/exit.
> We probably don't want to add a new iteration over cgroups themselves
> on these fast path.

Hmmm? Don't follow why this is relevant.

Thanks.

--
tejun

2012-02-22 18:33:57

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hey, Vivek.

On Wed, Feb 22, 2012 at 11:38:58AM -0500, Vivek Goyal wrote:
> Apart from orthogonal categorizations, one advantage of of multiple
> hierarchies is that you don't have to use a controller if you don't
> want to. (Just don't create cgroup in controller's respective hierarchy).
>
> This is not ideal but practically it might he helpful. In the sense
> cgroups might not come cheap and different controllers might have different
> overheads associated with it. For example, in blkio controller we can end
> up idling a lot with increasing number of cgroups. In that case a better
> way might be that use blkio controller cgroups selectively and that is
> any workload which is destroying the performance of others, move it out
> in a separate blkio group.

It should of course be possible to apply selective grouping on
different cgroups. It's like any other layers on top of pstree -
sessions, program groups or containers. Just group subtrees as you
see fit for each subsystem (there gotta be some fancy CS word for this
thing). As long as those grouped trees are from the same base tree,
we can represent it in a single tree, just like we can just annotate
sessions and program groups in pstree.

So, as long as you don't want something orthogonal from pstree, it
should be fine.

> So for blkio controller we can convert it into fully nested hierarchy
> at the expense of more complex code in CFQ. I think memory cgroup
> controller provides both flat and hierarchical mode. Keeping it fully
> hierarchical also increases the cost as we need to traverse lot more
> pointers for simple things like nested stats. On a system having
> both systemd and libvirt, every virtual machine is already 3-4 level
> deep in cgroup hierarchy.

I don't think every controller should implement full nesting and
sharing the same hierarchy doesn't require it. ie. if a controller
only wants to support flat hierarchy, just allow a single subgroup to
be active on any path between root and leaf. We can add a flag or
helpers to support such mode of operation and controllers themselves
can treat all cgroups equally.

Thanks.

--
tejun

2012-02-22 18:43:27

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hello,

On Wed, Feb 22, 2012 at 11:57:14AM -0500, Vivek Goyal wrote:
> IIRC, another reason to implement flat hierachy was that some people
> believed that's more natural way of doing things. For example, when
> you talk about cgroup, people ask, ok, give me a cgroup with 25% IO
> bandwidth. Now this does not come naturally with completely nested
> hierarchies where task and groups are treated at the same level. As
> group's peer tasks share the bandwidth, and task come and go a group's
> % share varies dynamically.

I don't see how that is more "natural". While I don't think
supporting full nesting is necessary for all controllers, the
semantics is very clear - build grouped trees according to active
configurations and distritbute resources top to bottom (network qdiscs
do exactly this). Flat case is proper degenerate case of nesting.
There's nothing more or less natural. It's just matter of trade off
between complexity and requirements.

> Again, it does not mean I am advocating flat hiearchy. I am just wondering
> in case of fully nested hierarchies (task at same level as groups), how
> does one explain it to a layman user who understands things in terms of
> % of resources.

I don't know whether we want nesting for block cgroup or not but at
the same time that doesn't really matter. Sharing hierarchy doesn't
require every controller supporting full hierarchy. I'm not sure how
the interface should be tho - maybe we can fail specifying config if
there already is an effective config encompassing that node or maybe
we can just break the existing config, I don't know.

Thank you.

--
tejun

2012-02-23 07:36:28

by Li Zefan

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

> Another thing I dislike about all the cgroup crap is all the dozens of
> tiny controllers being proposed left right and center. Like WTF isn't
> the hugetlb controller part of memcg? Its all memory, right?
>

We also have two network controllers - net_cls and net_prio.
Patches were sent to netdev only, so I didn't see them until
they hit mainline.

2012-02-23 07:51:47

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Quoting Glauber Costa ([email protected]):
> >>The support for multiple process hierarchies always struck me as
> >>rather strange. If you forget about the current cgroup controllers
> >>and their implementations, the *only* reason to support multiple
> >>hierarchies is if you want to apply resource limits based on different
> >>orthogonal categorizations.
> >>

Right, the old lwn writeup took the same approach:
http://lwn.net/Articles/236038/

> >>Documentation/cgroups.txt seems to be written with this consideration
> >>on mind. It's giving an example of applying limits accoring to two
> >>orthogonal categorizations - user groups (profressors, students...)
> >>and applications (WWW, NFS...). While it may sound like a valid use
> >>case, I'm very skeptical how useful or common mixing such orthogonal
> >>categorizations in a single setup would be.

My first inclination is to agree, but counterexamples do come to mind.

I could imagine a site saying "users can run (X) (say, ftpds), but the
memory consumed by all those ftpds must not be > 10% total RAM". At
the same time, they may run several apaches but want them all locked to
two of the cpus.

It might be worth a formal description of the new limits on use cases
such changes (both dropping support for orthogonal cgroups, and limiting
cgroups hierarchies to a mirror pstrees, separately) would bring.

To me personally the hierarchy limitation is more worrying. There have
been times when I've simply created cgroups for 'compile' and 'image
build', with particular cpu and memory limits. If I started a second
simultaneous compile, I'd want both compiles confined together. (That's
not to say the simplification might not be worth it, just bringing up
the other side)

-serge

2012-02-23 08:10:36

by Li Zefan

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

> Trying to make all the controllers uniform in terms of their treatment
> of cgroup hiearchy sounds like a good thing to do.

Agreed.

Apart from nesting cgroups, there're other inconsistencies.

- Some controllers disallow more than one cgroup layer. That's the new
net_prio controller, and I don't know why it's made so, but I guess
it's fine to eliminate this restriction.

- Some controllers move resource charges when a task is moved to
a different cgroup, but some don't?

- Some controllers disallow task attaching under some circumstances.
So if we have a single hierarchy with all subsystems, the chance
that attaching a task to a cgroup fails may be bigger.

> Once that is done,
> one can probably see if it is worth to put all the controllers in a
> single hierarchy.
>

2012-02-23 08:19:31

by Li Zefan

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

> The following is a "best practices" document on using cgroups.
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> To me, it seems to demonstrate the rather ugly situation that the
> current cgroup is providing. Everyone should tip-toe around cgroup
> hierarchies and nobody has full knowledge or control over them.
> e.g. base system management (e.g. systemd) can't use freezer or task
> counter as someone else might want to use it for different hierarchy
> layout.
>

This issue still exists if we allow a single hierarchy only, right?
Different cgroup users/applications have to struggle not to step
on each other's toe.

2012-02-23 09:41:56

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Wed, 2012-02-22 at 11:57 -0500, Vivek Goyal wrote:
>
> Again, it does not mean I am advocating flat hiearchy. I am just wondering
> in case of fully nested hierarchies (task at same level as groups), how
> does one explain it to a layman user who understands things in terms of
> % of resources.

If your complete control is % based then I would assume its a % of a %.
Simple enough.

If its bandwidth based then simply don't allow a child to consume more
bandwidth than its parent, also simple.

If your layman isn't capable of grokking that, he should stay the f*ck
away from it.

I'm really thinking that if we stick with the full hierarchical thing we
should mandate all controllers be fully hierarchical. And yes that
sucks, but so be it.

The scheduler thing tries to be completely hierarchical and yes it will
run into the ground if you push it hard enough simply because we're
hitting the limits of fixed point arithmetic, fractions can only go so
far, so the deeper you nest the crappier things get -- not that any
userspace cares about this.

2012-02-23 14:13:38

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, 2012-02-23 at 10:41 +0100, Peter Zijlstra wrote:
> If your complete control is % based then I would assume its a % of a %.
> Simple enough.
>
> If its bandwidth based then simply don't allow a child to consume more
> bandwidth than its parent, also simple.
>
> If your layman isn't capable of grokking that, he should stay the f*ck
> away from it.

Fact is, the scheduler does both these things, so there's absolutely no
reason for other controllers not to do so too. Its the only sensible
thing if you want hierarchy.

My utter disregard for cgroups comes from having to actually implement a
controller for them, its a frigging nightmare. The systemd retards
mandating all this nonsense for booting a machine is completely bonghit
inspired and hasn't made me feel any better about it.

2012-02-23 17:29:23

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hey, Serge.

On Thu, Feb 23, 2012 at 07:45:26AM +0000, Serge E. Hallyn wrote:
> > >>Documentation/cgroups.txt seems to be written with this consideration
> > >>on mind. It's giving an example of applying limits accoring to two
> > >>orthogonal categorizations - user groups (profressors, students...)
> > >>and applications (WWW, NFS...). While it may sound like a valid use
> > >>case, I'm very skeptical how useful or common mixing such orthogonal
> > >>categorizations in a single setup would be.
>
> My first inclination is to agree, but counterexamples do come to mind.
>
> I could imagine a site saying "users can run (X) (say, ftpds), but the
> memory consumed by all those ftpds must not be > 10% total RAM". At
> the same time, they may run several apaches but want them all locked to
> two of the cpus.

Orthogonal hierarchies is a feature and it does allow use cases which
aren't possible to support otherwise. It's not too difficult to come
up with a use case crafted to exploit the feature. The main thing is
whether the added functionality justifies the complexity and other
disadvantages described earlier in the thread. To me, the scenarios
seem not realistic, common place or essential enough.

Also, it's not like there's only one problem to solve these issues.
It may not be exactly the same thing but that's just part of the
trade-off game we all play.

> It might be worth a formal description of the new limits on use cases
> such changes (both dropping support for orthogonal cgroups, and limiting
> cgroups hierarchies to a mirror pstrees, separately) would bring.

The word "formal" scares me. :)

> To me personally the hierarchy limitation is more worrying. There have
> been times when I've simply created cgroups for 'compile' and 'image
> build', with particular cpu and memory limits. If I started a second
> simultaneous compile, I'd want both compiles confined together. (That's
> not to say the simplification might not be worth it, just bringing up
> the other side)

Yeah, that's an interesting point, but wouldn't something like the
following work too?

1. create_cgroup --cpu 40% --mem 20% screen
2. tell screen to create as many build screens you want
3. issue builds from those screens

To me, something like the above seems far more consistent with
everything else we have on the system than moving tasks around by
echoing pids to some sysfs file.

Thanks.

--
tejun

2012-02-23 17:33:55

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hello, Li.

On Thu, Feb 23, 2012 at 04:22:26PM +0800, Li Zefan wrote:
> > The following is a "best practices" document on using cgroups.
> >
> > http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
> >
> > To me, it seems to demonstrate the rather ugly situation that the
> > current cgroup is providing. Everyone should tip-toe around cgroup
> > hierarchies and nobody has full knowledge or control over them.
> > e.g. base system management (e.g. systemd) can't use freezer or task
> > counter as someone else might want to use it for different hierarchy
> > layout.
> >
>
> This issue still exists if we allow a single hierarchy only, right?
> Different cgroup users/applications have to struggle not to step
> on each other's toe.

Oh sure, having single hierarchy doesn't solve that problem but makes
it clear that there's single representation that kernel understands
and deals with. I think the problem now is that kernel tries to
multiplex multiple users. Unfortunately, it does that half-way and
badly and I think the nature of the problem doesn't really allow
proper muxed interface at kernel layer. So, I'm suggesting to let go
of the broken pretense and just have a single unified interfce and let
userland deal with resource allocation policies.

Thanks.

--
tejun

2012-02-23 18:47:19

by Serge Hallyn

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Quoting Tejun Heo ([email protected]):
> Hey, Serge.
>
> On Thu, Feb 23, 2012 at 07:45:26AM +0000, Serge E. Hallyn wrote:
> > > >>Documentation/cgroups.txt seems to be written with this consideration
> > > >>on mind. It's giving an example of applying limits accoring to two
> > > >>orthogonal categorizations - user groups (profressors, students...)
> > > >>and applications (WWW, NFS...). While it may sound like a valid use
> > > >>case, I'm very skeptical how useful or common mixing such orthogonal
> > > >>categorizations in a single setup would be.
> >
> > My first inclination is to agree, but counterexamples do come to mind.
> >
> > I could imagine a site saying "users can run (X) (say, ftpds), but the
> > memory consumed by all those ftpds must not be > 10% total RAM". At
> > the same time, they may run several apaches but want them all locked to
> > two of the cpus.
>
> Orthogonal hierarchies is a feature and it does allow use cases which

Of course. Note that while I used myself in the examples, I'm not
opposed to any of what you've suggested. Just trying to raise
discussion.

> aren't possible to support otherwise. It's not too difficult to come
> up with a use case crafted to exploit the feature. The main thing is
> whether the added functionality justifies the complexity and other

And (somehow) I think we need to get input from the users - the ones not
on lkml. There is an end-user summit coming up, right? Perhaps this
question should be floated there?

> disadvantages described earlier in the thread. To me, the scenarios
> seem not realistic, common place or essential enough.
>
> Also, it's not like there's only one problem to solve these issues.
> It may not be exactly the same thing but that's just part of the
> trade-off game we all play.
>
> > It might be worth a formal description of the new limits on use cases
> > such changes (both dropping support for orthogonal cgroups, and limiting
> > cgroups hierarchies to a mirror pstrees, separately) would bring.
>
> The word "formal" scares me. :)

The upside would be a clear explanation of what userspace can do to
work around the more limited kernel functionality.

> > To me personally the hierarchy limitation is more worrying. There have
> > been times when I've simply created cgroups for 'compile' and 'image
> > build', with particular cpu and memory limits. If I started a second
> > simultaneous compile, I'd want both compiles confined together. (That's
> > not to say the simplification might not be worth it, just bringing up
> > the other side)
>
> Yeah, that's an interesting point, but wouldn't something like the
> following work too?
>
> 1. create_cgroup --cpu 40% --mem 20% screen
> 2. tell screen to create as many build screens you want
> 3. issue builds from those screens

That works for a single user. Gets more complicated if you have multiple
users but still want to confine compiles differently from other workloads.

Still, we now have 'namespace attach', so even if we generally shadow
pstree with the cgroups, perhaps we could implement a cgroup transfer
much more cleanly than the current cgroup attach stuff.

Or, maybe it's just not something users would deem worthwhile. *I*
will be fine either way.

> To me, something like the above seems far more consistent with
> everything else we have on the system than moving tasks around by
> echoing pids to some sysfs file.

-serge

2012-02-23 19:41:36

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Wed, Feb 22, 2012 at 10:33:51AM -0800, Tejun Heo wrote:

[..]
>
> > So for blkio controller we can convert it into fully nested hierarchy
> > at the expense of more complex code in CFQ. I think memory cgroup
> > controller provides both flat and hierarchical mode. Keeping it fully
> > hierarchical also increases the cost as we need to traverse lot more
> > pointers for simple things like nested stats. On a system having
> > both systemd and libvirt, every virtual machine is already 3-4 level
> > deep in cgroup hierarchy.
>
> I don't think every controller should implement full nesting and
> sharing the same hierarchy doesn't require it. ie. if a controller
> only wants to support flat hierarchy, just allow a single subgroup to
> be active on any path between root and leaf. We can add a flag or
> helpers to support such mode of operation and controllers themselves
> can treat all cgroups equally.

I am not sure I understand "allow a single subgroup to be active on any
path"

So if a hierarchy looks as follows.

root
/ | \
g1 g2 g3
| |
g4 g5

So you are saying that just either g2 or g4 to be active in path 2 and
similiarly allow g3 or g5 to be active. IOW, if a task is in g5 and g3
is active group, and effectively task will be considered in g3? So in
above diagram if g1 and g4 and g3 are active groups, controller will
see them as.

root
/ | \
g1 g4 g3

Did I understand it right or you meant something else. But this is still
not flat and has 2 level of hierarchy. Tasks in root group and tasks in
children group (g1, g2 and g3) are different levels hence controller needs
to implement hierarchy. For it to be truly flat, it needs to look like
this.
pivot_point
/ | \ \
g1 g4 g3 root

Now the notion that only one group is active in each path from root to
leaf does not mean much.

Considering everything internally as flat, isn't it simpler. So cgroup
tree still might look hierarchical but actually controller treats it
as.
root
/ / | \ \
g1 g2 g3 g4 g5

Well, above is not exactly flat as has 2 level of hierarchy. That's blkio
controller views cgroup hierarhcy as follows, currently.

pivot_point
/ / | \ \ \
g1 g2 g3 g4 g5 root

Thanks
Vivek

2012-02-23 20:32:59

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, Feb 23, 2012 at 03:59:44PM +0800, Li Zefan wrote:
> > Trying to make all the controllers uniform in terms of their treatment
> > of cgroup hiearchy sounds like a good thing to do.
>
> Agreed.
>
> Apart from nesting cgroups, there're other inconsistencies.
>
> - Some controllers disallow more than one cgroup layer. That's the new
> net_prio controller, and I don't know why it's made so, but I guess
> it's fine to eliminate this restriction.

You mean don't allow creating deeper levels in cgroup hierarcy? That will
fail with libvirt + systemd as they create much deeper levels. I had to
change that for blkio.

>
> - Some controllers move resource charges when a task is moved to
> a different cgroup, but some don't?

I think in case of some controllers it does not even apply. For cpu, blkio
resources are renewable, so there is no moving around of charges.

Thanks
Vivek

2012-02-23 21:39:10

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, Feb 23, 2012 at 10:41:34AM +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-22 at 11:57 -0500, Vivek Goyal wrote:
> >
> > Again, it does not mean I am advocating flat hiearchy. I am just wondering
> > in case of fully nested hierarchies (task at same level as groups), how
> > does one explain it to a layman user who understands things in terms of
> > % of resources.
>
> If your complete control is % based then I would assume its a % of a %.
> Simple enough.

But % of % will vary dynamically and not be static. So if root has got
100% of resources and we want 25% of that for a group, then hierarchy
might look as follows.

root
/ | \
T1 T2 g1

T1, T2 are tasks and g1 is the group needing 25% of root's resources. Now
number of tasks running in parallel to g1 will determine its effective %
and tasks come and go. So the only way to do this would be that move T1
and T2 in a child group under root and make sure new tasks don't show up
in root.

Otherwise creating a group under root does not ensure that you get minimum
% of resource. It just makes sure that you can't get more than 25% of
% resources when things are tight.

>
> If its bandwidth based then simply don't allow a child to consume more
> bandwidth than its parent, also simple.

In case of absolute limit, things are somewhat simpler. A group is not impacted
by its peer tasks/groups that much. Well, there is also an issue and that
is how do all the children of a group share the resources. So assume
following.

g1
/ | \
T1 T2 g2

Assume g1 has 100MB/s limit and g2 has 90MB/s limit too. Now how this
100MB/s is divided among T1, T2 and g2? Round robin or do proportional
division based on weights. I think for cpu scheduler it can do
proportional division as everything is implemented in single layer. For
blkio, trottling is stacked on top of proportional. So I guess, I can
do round robin between T1, T2 and g2 and also make sure total of T1, T2
and g2 does not cross g1's bandwidth.

So upper limit is not that big a issue. Proportional one does become
one with effective % varying dynamically.

Thanks
Vivek

2012-02-23 22:35:46

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, Feb 23, 2012 at 04:38:47PM -0500, Vivek Goyal wrote:
> On Thu, Feb 23, 2012 at 10:41:34AM +0100, Peter Zijlstra wrote:
> > On Wed, 2012-02-22 at 11:57 -0500, Vivek Goyal wrote:
> > >
> > > Again, it does not mean I am advocating flat hiearchy. I am just wondering
> > > in case of fully nested hierarchies (task at same level as groups), how
> > > does one explain it to a layman user who understands things in terms of
> > > % of resources.
> >
> > If your complete control is % based then I would assume its a % of a %.
> > Simple enough.
>
> But % of % will vary dynamically and not be static. So if root has got
> 100% of resources and we want 25% of that for a group, then hierarchy
> might look as follows.

It is complex but semantics is pretty well defined. It should behave
exactly the same as HTB. Whether the complexity would be justifiable
is a different issue.

Thanks.

--
tejun

2012-02-23 22:38:38

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Hello,

On Thu, Feb 23, 2012 at 02:41:10PM -0500, Vivek Goyal wrote:
> Considering everything internally as flat, isn't it simpler. So cgroup
> tree still might look hierarchical but actually controller treats it
> as.
> root
> / / | \ \
> g1 g2 g3 g4 g5

I don't know. Mixing the above with controllers which implement
proper nesting makes my head explode (why is there a hierarchy at
all?). Root is always special anyway. Just treating root differently
and collapsing the rest of hierarchies should do, right?

Thanks.

--
tejun

2012-02-24 11:33:35

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, 2012-02-23 at 16:38 -0500, Vivek Goyal wrote:
> > > Again, it does not mean I am advocating flat hiearchy. I am just wondering
> > > in case of fully nested hierarchies (task at same level as groups), how
> > > does one explain it to a layman user who understands things in terms of
> > > % of resources.
> >
> > If your complete control is % based then I would assume its a % of a %.
> > Simple enough.
>
> But % of % will vary dynamically and not be static. So if root has got
> 100% of resources and we want 25% of that for a group, then hierarchy
> might look as follows.
>
> root
> / | \
> T1 T2 g1
>
> T1, T2 are tasks and g1 is the group needing 25% of root's resources. Now
> number of tasks running in parallel to g1 will determine its effective %
> and tasks come and go. So the only way to do this would be that move T1
> and T2 in a child group under root and make sure new tasks don't show up
> in root.

Which is exactly that the scheduler stuff does.. so tough luck for the
sysad who can't grasp it.

> Otherwise creating a group under root does not ensure that you get minimum
> % of resource. It just makes sure that you can't get more than 25% of
> % resources when things are tight.

You never said anything about minimum resource guarantees in the initial
problem statement.

2012-02-26 05:00:06

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

Tejun Heo wrote:
> Sorry, forgot to cc hch. Cc'ing him and quoting whole message.
>
> On Tue, Feb 21, 2012 at 01:19:38PM -0800, Tejun Heo wrote:
>> Hello, guys.
>>
>> I've been thinking about multiple hierarchy support in cgroup for a
>> while, especially after Frederic's pending task counter patchset.
>> This is a write up of what I've been thinking. I don't know what to
>> do yet and simply continuing the current situation definitely is an
>> option, so please read on and throw in your 20 Won (or whatever amount
>> in whatever currency you want).
>>
>> * The problems.
>>
>> The support for multiple process hierarchies always struck me as
>> rather strange. If you forget about the current cgroup controllers
>> and their implementations, the *only* reason to support multiple
>> hierarchies is if you want to apply resource limits based on different
>> orthogonal categorizations.
>>
>> Documentation/cgroups.txt seems to be written with this consideration
>> on mind. It's giving an example of applying limits accoring to two
>> orthogonal categorizations - user groups (profressors, students...)
>> and applications (WWW, NFS...). While it may sound like a valid use
>> case, I'm very skeptical how useful or common mixing such orthogonal
>> categorizations in a single setup would be.
>>
>> If support for multiple hierarchies comes for free, at least in terms
>> of features, maybe it can be better but of course it isn't so. Any
>> given cgroup subsystem (or controller) can only be applied to a single
>> hierarchy, which makes sense for a lot of things - what would two
>> different limits on the same resource from different hierarchies mean?
>> But, there also are things which can be used and useful in all
>> hierarchies - e.g. cgroup freezer and task counter.
>>
>> While the current cgroup implementation and conventions can probably
>> allow admins and engineers to tailor cgroup configuration for a
>> specific setup, it is very difficult to use in generic and automated
>> way. I mean, who owns the freezer or task counter? If they're
>> mounted on their own hierarchies, how should they be structured?
>> Should the different hierarchies be structured such that they are
>> projections of one unified hierarchy so that those generic mechanisms
>> can be applied uniformly? If so, why do we need multiple hierarchies
>> at all?

We can keep orthogonal categorization in a single hierarchy, if we allow task
to live in several cgroups simultaneously, each controller in independent cgroup.
Task to cgroup links already organized through css, which can store any combination
of subsystems. I think it might be easier than current multiple hierarchies.

>>
>> A related limitation is that as different subsystems don't know which
>> hierarchies they'll end up on, they can't cooperate. Wouldn't it make
>> more sense if task counter is a separate thing watching the resources
>> and triggers different actions as conifgured - be it failing forks or
>> freezing?
>>
>> And yet another oddity is how cgroup handles nested cgroups - some
>> care about nesting but others just treat both internal and leaf nodes
>> equally. They don't care about the topology at all. This, too, can
>> be fine if you approach things subsys by subsys and use them in
>> different ways but if you try to combine them in generic way you get
>> sucked into the lala land of whatevers.
>>
>> The following is a "best practices" document on using cgroups.
>>
>> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>>
>> To me, it seems to demonstrate the rather ugly situation that the
>> current cgroup is providing. Everyone should tip-toe around cgroup
>> hierarchies and nobody has full knowledge or control over them.
>> e.g. base system management (e.g. systemd) can't use freezer or task
>> counter as someone else might want to use it for different hierarchy
>> layout.
>>
>> It seems to me that cgroup interface is too complicated and inflexible
>> at the same time to be useful in generic manner. Sure, it can be
>> useful for setups individually crafted by engineers and admins to
>> match specific sites or applications but as soon as you try to do
>> something automatic and generic with it, there just are too many
>> different scenarios and limitations to consider.
>>
>>
>> * So, what to do?
>>
>> Heh, I don't know. IIRC, last year at LinuxCon Japan, I heard
>> Christoph saying that the biggest problem w/ cgroup was that it was
>> building completely separate hierarchies out of the traditional
>> process hierarchies. After thinking about this stuff for a while, I
>> fully agree with him. I think this whole thing should have been a
>> layer over the process tree like sessions or program groups.

I agree too. Zombies can not live in cgroups, this is not fair!
It seems, to integrate cgroups into normal process hierarchies, we should
link cgroup-css with struct pid rather than struct task.
Struct pid always rcu-protected and well managed. This change should
simplify cgroup iteration and allows to drop ugly "use_task_css_set_links"
together with "css_set_lock" on fork/exit paths.

>>
>> Unfortunately, that ship sailed long ago and we gotta make do with
>> what we have on our collective hands. Here are some paths that we can
>> take.
>>
>> 1. We're screwed anyway. Just don't worry about it and continue down
>> on this path. Can't get much worse, right?
>>
>> This approach has the apparent advantage of not having to do
>> anything and is probably most likely to be taken. This isn't ideal
>> but hey nothing is. :P
>>
>> 2. Make it more flexible (and likely more complex, unfortunately).
>> Allow the utility type subsystems to be used in multiple
>> hierarchies. The easiest and probably dirtiest way to achieve that
>> would be embedding them into cgroup core.
>>
>> Thinking about doing this depresses me and it's not like I have a
>> cheerful personality to begin with. :(
>>
>> 3. Head towards single hierarchy with the pie-in-the-sky goal of
>> merging things into process hierarchy in some distant future.
>>
>> The first step would be herding people to use a unified hierarchy
>> (ie. all subsystems mounted on a single cgroup tree) which is
>> controlled by single entity in userland (be it systemd or cgroupd,
>> cgroup-kit or whatever); however, even if we exclude supporting
>> orthogonal categorizations, there are good number of non-trivial
>> hurdles to clear before this can be realized.
>>
>> Most importantly, we would need to clean up how nesting is handled
>> across different subsystems. Handling internal and leaf nodes as
>> equals simply can't work. Membership should be recursive, and for
>> subsystems which can't support proper nesting, the right thing to
>> do would be somehow ensuring that only single node in the path from
>> root to leaf is active for the controller. We may even have to
>> introduce an alternative of operation to support this (yuck).
>>
>> This path would require the most amount of work and we would be
>> excluding a feature - support for multiple orthogonal
>> categorizations - which has been available till now, probably
>> through deprecation process spanning years; however, this at least
>> gives us hope that we may reach sanity in the end, how distant that
>> end may be. Oh, hope. :)
>>
>> So, I mean, I don't know. What do other people think? Is this a
>> unnecessary worry? Are people generally happy with the way things
>> are? Lennart, Kay, what do you guys think?
>>
>> Thanks.
>>
>> --
>> tejun
>

2012-02-27 17:46:22

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Wed, Feb 22, 2012 at 10:22:07AM -0800, Tejun Heo wrote:
> Hey, Frederic.
>
> On Wed, Feb 22, 2012 at 04:45:04PM +0100, Frederic Weisbecker wrote:
> > > A related limitation is that as different subsystems don't know which
> > > hierarchies they'll end up on, they can't cooperate. Wouldn't it make
> > > more sense if task counter is a separate thing watching the resources
> > > and triggers different actions as conifgured - be it failing forks or
> > > freezing?
> >
> > For this particular example, I think we'd better have a file in which
> > a task can poll and get woken up when the task limit has been reached.
> > Then that task can decide to freeze or whatever.
>
> Yes, that may be a solution but to "guarantee" that the limit is never
> breached, we need to stop it first somehow. Probably making freezing
> the default behavior with userland notifier (inotify event should
> suffice) should do, which we can't do now. :(

The limit can't be breached because forks are rejected once we reached the
limit given by the user.

With this rejection, another task can take control of this and freeze the
cgroup.

>
> > > 1. We're screwed anyway. Just don't worry about it and continue down
> > > on this path. Can't get much worse, right?
> > >
> > > This approach has the apparent advantage of not having to do
> > > anything and is probably most likely to be taken. This isn't ideal
> > > but hey nothing is. :P
> >
> > Thing is we have an ABI now and it has been there for a while now. Aren't
> > we stuck with it? I'm no big fan of that multiple hierarchies thing either
> > but now I fear we have to support it.
>
> Well, yes and no. While maintaining userland ABI is very important,
> its importance isn't infinite and there are different types of
> userland ABIs. We definitely don't want to screw with syscalls. We
> should keep userland visible dynamic files which are used by common
> usertools stable at almost all costs. When it comes over to system
> interface which is used mostly by base system tools, it can be a bit
> flexible. If the ABI in question is an optional thing, we probably
> can be slightly more flexible.

But cgroups falls into the general purpose category to me. Not something
that was used only by a finite circle of a few well known and defined tools.

> We of course can't change things drastically. It should be done
> carefully with rather long deprecation period, but it can be done and
> in fact isn't too uncommon. Stuff under /sysfs tends to be somewhat
> volatile and sysfs itself went through several ABI incompatible
> iterations.
>
> So, we can transition in baby steps. e.g. we can first implement
> proper nesting behavior without changing the default behavior and then
> the base system can be updated to mount and control all subsystems by
> default (with configuration opt-outs) so that the hierarchy reflects
> pstree, effectively driving people away from multiple hierarchies and
> we can implement new features assuming the new structure. After a few
> years, the kernel can start whining about non-start hierarchies and
> then eventually remove the support. It's a long process but
> definitely doable.

Well, if we can I'll be glad.

>
> > > 2. Make it more flexible (and likely more complex, unfortunately).
> > > Allow the utility type subsystems to be used in multiple
> > > hierarchies. The easiest and probably dirtiest way to achieve that
> > > would be embedding them into cgroup core.
> > >
> > > Thinking about doing this depresses me and it's not like I have a
> > > cheerful personality to begin with. :(
> >
> > Another solution is to support a class of multi-bindable subsystems as in
> > this old patch from Paul:
> >
> > https://lkml.org/lkml/2009/7/1/578
>
> Heh, yeah, this would be closer to the proper way to achieve
> multi-attach but I can't help feeling that this just buries ourselves
> deeper into s*it and we're already knee-deep. If multiple hierarchies
> is an essential feature, maybe, but, if it's not, and I'm extremely
> skeptical that it is, why the hell would we want to go that way?

I don't know, it just depend what will happen on these multiple
hierarchies.

>
> > It sounds to me more healthy to iterate only over subsystems in fork/exit.
> > We probably don't want to add a new iteration over cgroups themselves
> > on these fast path.
>
> Hmmm? Don't follow why this is relevant.

If you make something a cgroup core feature instead of a subsystem and you
need to do something on these cgroups during forks, then you need to
iterate over these as well as the subsystems.

Typically adding some more loop on fork is not considered very welcome.

>
> Thanks.
>
> --
> tejun

2012-02-28 21:16:44

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Thu, Feb 23, 2012 at 02:34:57PM -0800, Tejun Heo wrote:
> On Thu, Feb 23, 2012 at 04:38:47PM -0500, Vivek Goyal wrote:
> > On Thu, Feb 23, 2012 at 10:41:34AM +0100, Peter Zijlstra wrote:
> > > On Wed, 2012-02-22 at 11:57 -0500, Vivek Goyal wrote:
> > > >
> > > > Again, it does not mean I am advocating flat hiearchy. I am just wondering
> > > > in case of fully nested hierarchies (task at same level as groups), how
> > > > does one explain it to a layman user who understands things in terms of
> > > > % of resources.
> > >
> > > If your complete control is % based then I would assume its a % of a %.
> > > Simple enough.
> >
> > But % of % will vary dynamically and not be static. So if root has got
> > 100% of resources and we want 25% of that for a group, then hierarchy
> > might look as follows.
>
> It is complex but semantics is pretty well defined. It should behave
> exactly the same as HTB. Whether the complexity would be justifiable
> is a different issue.

I don't know much about HTB but a quick read at internet seems to suggest
that hierarchy we setup is pretty static and does not change with more
task coming in/going out. That means share/configured bandwidth of each
queue in the hierarchy is fixed until and unless that tree is changed.

But in this case, if task and groups are treated at same level, things
are not static and % share will change dynamically.

Thanks
Vivek

2012-02-28 21:21:57

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, 2012-02-28 at 16:16 -0500, Vivek Goyal wrote:
>
> But in this case, if task and groups are treated at same level, things
> are not static and % share will change dynamically.

which is exactly how the scheduler stuff behaves for the proportional
bits.. so there's no reason not to do it too.

2012-02-28 21:35:46

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, Feb 28, 2012 at 10:21:40PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-02-28 at 16:16 -0500, Vivek Goyal wrote:
> >
> > But in this case, if task and groups are treated at same level, things
> > are not static and % share will change dynamically.
>
> which is exactly how the scheduler stuff behaves for the proportional
> bits.. so there's no reason not to do it too.

Yes this is how scheduler does to handle hierarchy. Treat task and group
at same level. Tejun was giving example of HTB and I was saying that there
class/queues or whatever, seem to be static and are not created
dynamically as tasks come in/go. So its not same.

So coming back to scheduler, handling tasks and groups at same level only
provides us with notion of priority for group. It does not provide any
notion of % (neither minimum, nor maximum). To calculate the % one needs
to know the proportioanal share/weight of all entities at same level and
currently number of entities vary hence % share can't be determined.

Whether it is a good thing or bad thing, I don't know. I think previous
design was allocating a group for every user. I guess, in that case we
will have fixed % share of each user (until and unless users are created/
removed).

So I don't know what's the right behavior. With this discussion, I am just
trying to make it explicit what to expect out of cgroup controllers. For
cpu controller, it is priority at the group level no fixed minimum/maximum
% shares. And that's a limitation of treating task and group at same level.

Thanks
Vivek

2012-02-28 21:44:11

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, 2012-02-28 at 16:35 -0500, Vivek Goyal wrote:
> For
> cpu controller, it is priority at the group level no fixed minimum/maximum
> % shares. And that's a limitation of treating task and group at same level.

Depends on what you mean by min/max %, you can do it on the group level
by using bandwidth caps (for max) or inverted (max on everybody else,
for min).

2012-02-28 21:54:16

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, 2012-02-28 at 16:35 -0500, Vivek Goyal wrote:
> Yes this is how scheduler does to handle hierarchy. Treat task and group
> at same level.

...

> Whether it is a good thing or bad thing, I don't know.

That's IMO what the cgroupfs interface provides for, if you do anything
different there's this shadow group that contains the tasks for which
you then have to provide extra parameter control.

Furthermore, by treating tasks and groups at the same level you can
create the extra group, but you can't do the reverse. So its the more
versatile solution as well.

> I think previous
> design was allocating a group for every user. I guess, in that case we
> will have fixed % share of each user (until and unless users are created/
> removed).

Not even, it depended on if the user had anything runnable or not. It
was very much like the current cgroup stuff if you create a cgroup for
each user and stick the tasks in.

The cpu-cgroup stuff is purely runnable based, so every wakeup/sleep
changes the entire weight distribution, yay! :-)

2012-02-28 21:54:53

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, Feb 28, 2012 at 10:43:54PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-02-28 at 16:35 -0500, Vivek Goyal wrote:
> > For
> > cpu controller, it is priority at the group level no fixed minimum/maximum
> > % shares. And that's a limitation of treating task and group at same level.
>
> Depends on what you mean by min/max %, you can do it on the group level
> by using bandwidth caps (for max) or inverted (max on everybody else,
> for min).

I was referring to using pure proportional controller. max bandwidth is
new and I am looking for a quick documentation file which describes
what are the knobs and how to use it. Did not find any in
Documentation/cgroups/. Is there any documentation available?

I am assuming that max are being specified for groups in some absolute
quantity. That is fine. It will not still be max %, as again for % you
need fixed number of entities at any level and that's not the case with
tasks.

Minimum for one group (max for everyone else) will also only work if
task and groups are not at same level.

I think the only way to get fixed % share is not to put task and group
at same level during system configuration.

Thanks
Vivek

2012-02-28 22:00:36

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, 2012-02-28 at 16:54 -0500, Vivek Goyal wrote:
> On Tue, Feb 28, 2012 at 10:43:54PM +0100, Peter Zijlstra wrote:
> > On Tue, 2012-02-28 at 16:35 -0500, Vivek Goyal wrote:
> > > For
> > > cpu controller, it is priority at the group level no fixed minimum/maximum
> > > % shares. And that's a limitation of treating task and group at same level.
> >
> > Depends on what you mean by min/max %, you can do it on the group level
> > by using bandwidth caps (for max) or inverted (max on everybody else,
> > for min).
>
> I was referring to using pure proportional controller. max bandwidth is
> new and I am looking for a quick documentation file which describes
> what are the knobs and how to use it. Did not find any in
> Documentation/cgroups/. Is there any documentation available?

Its written in C, its at kernel/sched/fair.c ;-)

> I am assuming that max are being specified for groups in some absolute
> quantity. That is fine. It will not still be max %, as again for % you
> need fixed number of entities at any level and that's not the case with
> tasks.
>
> Minimum for one group (max for everyone else) will also only work if
> task and groups are not at same level.

I'm really not seeing this.

> I think the only way to get fixed % share is not to put task and group
> at same level during system configuration.

Still doesn't matter, like said, its all runnable based. If a group has
0 runnable entities it doesn't exist (more or less).

2012-02-28 22:09:50

[permalink] [raw]

Subject: Re: [RFD] cgroup: about multiple hierarchies

On Tue, Feb 28, 2012 at 10:53:59PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-02-28 at 16:35 -0500, Vivek Goyal wrote:
> > Yes this is how scheduler does to handle hierarchy. Treat task and group
> > at same level.
>
> ...
>
> > Whether it is a good thing or bad thing, I don't know.
>
> That's IMO what the cgroupfs interface provides for, if you do anything
> different there's this shadow group that contains the tasks for which
> you then have to provide extra parameter control.
>
> Furthermore, by treating tasks and groups at the same level you can
> create the extra group, but you can't do the reverse. So its the more
> versatile solution as well.

Agreed that it is more versatile. And one can move all the tasks to a
new group to achieve what a shadow group will do.

The only thing is what is a good default. If we are thinking of dividing
resources in terms of % and writing a user space tool, then in default
model we just don't know what's the %. May be it is dynamically varying
% and should be shown accordingly.

Or if idea of minimum % proportional bandwidth is more natural, then
we shall have to change userspace and things like systemd to not run
any task in /. Then a user space tool can go through cgroup hierarchy
and calculate minimum % share of a group and display it.

>
> > I think previous
> > design was allocating a group for every user. I guess, in that case we
> > will have fixed % share of each user (until and unless users are created/
> > removed).
>
> Not even, it depended on if the user had anything runnable or not. It
> was very much like the current cgroup stuff if you create a cgroup for
> each user and stick the tasks in.
>
> The cpu-cgroup stuff is purely runnable based, so every wakeup/sleep
> changes the entire weight distribution, yay! :-)

:-). That's fine. If a group is not using its bandwidth because there is
no runnable task, then other groups get more cpu. I thought that's the
proportional definition.

Thanks
Vivek

2012-02-28 22:31:23