Hi,
While development is going on for cgroup and various controllers, we also
need a facility so that an admin/user can specify the group creation and
also specify the rules based on which tasks should be placed in respective
groups. Group creation part will be handled by libcg which is already
under development. We still need to tackle the issue of how to specify
the rules and how these rules are enforced (rules engine).
I have gathered few views, with regards to how rule engine can possibly be
implemented, I am listing these down.
Proposal 1
==========
Let user space daemon hanle all that. Daemon will open a netlink socket
and receive the notifications for various kernel events. Daemon will
also parse appropriate admin specified rules config file and place the
processes in right cgroup based on rules as and when events happen.
I have written a prototype user space program which does that. Program
can be found here. Currently it is in very crude shape.
http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
Various people have raised two main issues with this approach.
- netlink is not a reliable protocol.
- Messages can be dropped and one can loose message. That means a
newly forked process might never go into right group as meant.
- How to handle delays in rule exectuion?
- For example, if an "exec" happens and by the time process is moved to
right group, it might have forked off few more processes or might
have done quite some amount of memory allocation which will be
charged to the wring group. Or, newly exec process might get
killed in existing cgroup because of lack of memory (despite the
fact that destination cgroup has sufficient memory).
Proposal 2
==========
Implement one or more kernel modules which will implement the rule engine.
User space program can parse the config files and pass it to module.
Kernel will be patched only on select points to look for the rules (as
provided by modules). Very minimal code running inside the kernel if there
are no rules loaded.
Concerns:
- Rules can become complex and we don't want to handle that complexity in
kernel.
Pros:
- Reliable and precise movement of tasks in right cgroup based on rules.
Proposal 3
==========
How about if additional parameters can be passed to system calls and one
can pass destination cgroup as additional parameter. Probably something
like sys_indirect proposal. Maybe glibc can act as a wrapper to pass
additional parameter so that applications don't need any modifications.
Concerns:
========
- Looks like sys_indirect interface for passing extra flags was rejected.
- Requires extra work in glibc which can also involve parsing of rule
files. :-(
Proposal 4
==========
Some vauge thoughts are there regarding how about kind of freezing the
process or thread upon fork, exec and unfreeze it once the thread has been
placed in right cgroup.
Concerns:
========
- Requires reliable netlink protocol otherwise there is a possibility that
a task never gets unfrozen.
- On what basis does one freeze a thread. There might not be any rules to
process for that thread we will unnecessarily delay it.
Please provide your inputs regarding what's the best way to handle the
rules engine.
To me, letting the rules live in separate module/modules seems to be a
reasonable way to move forward which will provide reliable and timely
execution of rules and by making it modular, we can remove most of the
complexity from core kernel code.
Thanks
Vivek
Vivek Goyal wrote:
> Hi,
>
> While development is going on for cgroup and various controllers, we also
> need a facility so that an admin/user can specify the group creation and
> also specify the rules based on which tasks should be placed in respective
> groups. Group creation part will be handled by libcg which is already
> under development. We still need to tackle the issue of how to specify
> the rules and how these rules are enforced (rules engine).
>
> I have gathered few views, with regards to how rule engine can possibly be
> implemented, I am listing these down.
>
> Proposal 1
> ==========
> Let user space daemon hanle all that. Daemon will open a netlink socket
> and receive the notifications for various kernel events. Daemon will
> also parse appropriate admin specified rules config file and place the
> processes in right cgroup based on rules as and when events happen.
>
> I have written a prototype user space program which does that. Program
> can be found here. Currently it is in very crude shape.
>
> http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
>
> Various people have raised two main issues with this approach.
>
> - netlink is not a reliable protocol.
> - Messages can be dropped and one can loose message. That means a
> newly forked process might never go into right group as meant.
>
> - How to handle delays in rule exectuion?
> - For example, if an "exec" happens and by the time process is moved to
> right group, it might have forked off few more processes or might
> have done quite some amount of memory allocation which will be
> charged to the wring group. Or, newly exec process might get
> killed in existing cgroup because of lack of memory (despite the
> fact that destination cgroup has sufficient memory).
right.
I think it is necessary to avoid these issues.
IMO, In particular a second one (handle may delay).
This issue can always happen.
> Proposal 2
> ==========
> Implement one or more kernel modules which will implement the rule engine.
> User space program can parse the config files and pass it to module.
> Kernel will be patched only on select points to look for the rules (as
> provided by modules). Very minimal code running inside the kernel if there
> are no rules loaded.
>
> Concerns:
> - Rules can become complex and we don't want to handle that complexity in
> kernel.
>
> Pros:
> - Reliable and precise movement of tasks in right cgroup based on rules.
>
> Proposal 3
> ==========
> How about if additional parameters can be passed to system calls and one
> can pass destination cgroup as additional parameter. Probably something
> like sys_indirect proposal. Maybe glibc can act as a wrapper to pass
> additional parameter so that applications don't need any modifications.
>
> Concerns:
> ========
> - Looks like sys_indirect interface for passing extra flags was rejected.
> - Requires extra work in glibc which can also involve parsing of rule
> files. :-(
>
> Proposal 4
> ==========
> Some vauge thoughts are there regarding how about kind of freezing the
> process or thread upon fork, exec and unfreeze it once the thread has been
> placed in right cgroup.
>
> Concerns:
> ========
> - Requires reliable netlink protocol otherwise there is a possibility that
> a task never gets unfrozen.
> - On what basis does one freeze a thread. There might not be any rules to
> process for that thread we will unnecessarily delay it.
>
>
> Please provide your inputs regarding what's the best way to handle the
> rules engine.
>
> To me, letting the rules live in separate module/modules seems to be a
> reasonable way to move forward which will provide reliable and timely
> execution of rules and by making it modular, we can remove most of the
> complexity from core kernel code.
I'd agree with your opinion.
Strict movement of tasks is indispensable in enterprises scene.
Regards, Kazunaga Ikeno
On Tue, 1 Jul 2008 15:11:26 -0400
Vivek Goyal <[email protected]> wrote:
> Hi,
>
> While development is going on for cgroup and various controllers, we also
> need a facility so that an admin/user can specify the group creation and
> also specify the rules based on which tasks should be placed in respective
> groups. Group creation part will be handled by libcg which is already
> under development. We still need to tackle the issue of how to specify
> the rules and how these rules are enforced (rules engine).
>
> I have gathered few views, with regards to how rule engine can possibly be
> implemented, I am listing these down.
>
> Proposal 1
> ==========
> Let user space daemon hanle all that. Daemon will open a netlink socket
> and receive the notifications for various kernel events. Daemon will
> also parse appropriate admin specified rules config file and place the
> processes in right cgroup based on rules as and when events happen.
>
> I have written a prototype user space program which does that. Program
> can be found here. Currently it is in very crude shape.
>
> http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
>
> Various people have raised two main issues with this approach.
>
> - netlink is not a reliable protocol.
> - Messages can be dropped and one can loose message. That means a
> newly forked process might never go into right group as meant.
>
> - How to handle delays in rule exectuion?
> - For example, if an "exec" happens and by the time process is moved to
> right group, it might have forked off few more processes or might
> have done quite some amount of memory allocation which will be
> charged to the wring group. Or, newly exec process might get
> killed in existing cgroup because of lack of memory (despite the
> fact that destination cgroup has sufficient memory).
>
Hmm, can't we rework the process event connector to use some reliable
fast interface besides netlink ? (I mean an interface like eventpoll.)
(Or enhance netlink ? ;)
Because "a child inherits parent's" rule is very strong, I think the amount
of events we have to check is much less than we get report. Can't we add some
filter/assumption here ?
BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
creating exec-image is charged to original group...
Thanks,
-Kame
On Thu, Jul 03, 2008 at 10:19:57AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 1 Jul 2008 15:11:26 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > Hi,
> >
> > While development is going on for cgroup and various controllers, we also
> > need a facility so that an admin/user can specify the group creation and
> > also specify the rules based on which tasks should be placed in respective
> > groups. Group creation part will be handled by libcg which is already
> > under development. We still need to tackle the issue of how to specify
> > the rules and how these rules are enforced (rules engine).
> >
> > I have gathered few views, with regards to how rule engine can possibly be
> > implemented, I am listing these down.
> >
> > Proposal 1
> > ==========
> > Let user space daemon hanle all that. Daemon will open a netlink socket
> > and receive the notifications for various kernel events. Daemon will
> > also parse appropriate admin specified rules config file and place the
> > processes in right cgroup based on rules as and when events happen.
> >
> > I have written a prototype user space program which does that. Program
> > can be found here. Currently it is in very crude shape.
> >
> > http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
> >
> > Various people have raised two main issues with this approach.
> >
> > - netlink is not a reliable protocol.
> > - Messages can be dropped and one can loose message. That means a
> > newly forked process might never go into right group as meant.
> >
> > - How to handle delays in rule exectuion?
> > - For example, if an "exec" happens and by the time process is moved to
> > right group, it might have forked off few more processes or might
> > have done quite some amount of memory allocation which will be
> > charged to the wring group. Or, newly exec process might get
> > killed in existing cgroup because of lack of memory (despite the
> > fact that destination cgroup has sufficient memory).
> >
> Hmm, can't we rework the process event connector to use some reliable
> fast interface besides netlink ? (I mean an interface like eventpoll.)
> (Or enhance netlink ? ;)
I see following text in netlink man page.
"However, reliable transmissions from kernel to user are impossible in
any case. The kernel can’t send a netlink message if the socket buffer
is full: the message will be dropped and the kernel and the userspace
process will no longer have the same view of kernel state. It is up to
the application to detect when this happens (via the ENOBUFS error
returned by recvmsg(2)) and resynchronize."
So at the end of the day, it looks like unreliability comes from the
fact that we can not allocate memory currently so we will discard the
packet.
Are there alternatives as compared to dropping packets?
- Let sender cache the packet and retry later. So maybe netlink layer
can return error if packet can not be queued and connector can cache the
event and try sending it later. (Hopefully later memory situation became
better because of OOM or some process exited or something else...).
This looks like a band-aid to handle the temporary congestion kind of
problems. Will not be able to help if consumer is inherently slow and
event generation is faster.
This probably can be one possible enhancement to connector, but at the end
of the day, any kind of user space daemon will have to accept the fact
that packets can be dropped, leading to lost events. Detect that situation
(using ENOBUFS) and then let admin know about it (logging). I am not sure
what admin is supposed to do after that.
I am CCing Thomas Graf. He might have a better idea of netlink limitations
and is there a way to overcome these.
>
> Because "a child inherits parent's" rule is very strong, I think the amount
> of events we have to check is much less than we get report. Can't we add some
> filter/assumption here ?
>
I am not sure if proc connector currently allows filtering of various
events like fork, exec, exit etc. In a quick look it looks like it
does not. But probably that can be worked out. Even then, it will just
help reduce the number of messages queued for user space on that socket
but will not take away the fact that messages can be dropped under
memory pressure.
> BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
> creating exec-image is charged to original group...
>
As of today it should happen because newly execed process will run into
same cgroup as parent. But that's what probably we need to avoid.
For example, if an admin has created three cgroups "database", "browser"
"others" and a user launches "firefox" from shell (assuming shell is running
originally in "others" cgroup), then any memory allocation for firefox should
come from "browser" cgroup and not from "others".
I am assuming that this will be a requirement for enterprise class
systems. Would be good to know the experiences of people who are already
doing some kind of work load management.
Thanks
Vivek
On Thu, 3 Jul 2008 11:54:46 -0400
Vivek Goyal <[email protected]> wrote:
> On Thu, Jul 03, 2008 at 10:19:57AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 1 Jul 2008 15:11:26 -0400
> > Vivek Goyal <[email protected]> wrote:
> > > - How to handle delays in rule exectuion?
> > > - For example, if an "exec" happens and by the time process is moved to
> > > right group, it might have forked off few more processes or might
> > > have done quite some amount of memory allocation which will be
> > > charged to the wring group. Or, newly exec process might get
> > > killed in existing cgroup because of lack of memory (despite the
> > > fact that destination cgroup has sufficient memory).
> > >
> > Hmm, can't we rework the process event connector to use some reliable
> > fast interface besides netlink ? (I mean an interface like eventpoll.)
> > (Or enhance netlink ? ;)
>
> I see following text in netlink man page.
>
> "However, reliable transmissions from kernel to user are impossible in
> any case. The kernel can$B!G(Bt send a netlink message if the socket buffer
> is full: the message will be dropped and the kernel and the userspace
> process will no longer have the same view of kernel state. It is up to
> the application to detect when this happens (via the ENOBUFS error
> returned by recvmsg(2)) and resynchronize."
>
> So at the end of the day, it looks like unreliability comes from the
> fact that we can not allocate memory currently so we will discard the
> packet.
>
> Are there alternatives as compared to dropping packets?
>
If it's just problem of memory allocation, preallocate socket buffer and
use it later, like radix_tree_preload().
==
foo() {
if (preallocate())
return -ENOBUFS;
.......
proc_xxxx_connector();
}
==
(this means setuid() will return -ENOBUFS, undocumented error code.)
But af_netlink layer have another cause of dropping packets
1. copying skb at broadcast.
2. recv buffer over run..
(2) is not avoidable in the kernel.
> - Let sender cache the packet and retry later. So maybe netlink layer
> can return error if packet can not be queued and connector can cache the
> event and try sending it later. (Hopefully later memory situation became
> better because of OOM or some process exited or something else...).
>
> This looks like a band-aid to handle the temporary congestion kind of
> problems. Will not be able to help if consumer is inherently slow and
> event generation is faster.
>
> This probably can be one possible enhancement to connector, but at the end
> of the day, any kind of user space daemon will have to accept the fact
> that packets can be dropped, leading to lost events. Detect that situation
> (using ENOBUFS) and then let admin know about it (logging). I am not sure
> what admin is supposed to do after that.
>
I'm not either ;)
> I am CCing Thomas Graf. He might have a better idea of netlink limitations
> and is there a way to overcome these.
>
> >
> > Because "a child inherits parent's" rule is very strong, I think the amount
> > of events we have to check is much less than we get report. Can't we add some
> > filter/assumption here ?
> >
>
> I am not sure if proc connector currently allows filtering of various
> events like fork, exec, exit etc. In a quick look it looks like it
> does not. But probably that can be worked out. Even then, it will just
> help reduce the number of messages queued for user space on that socket
> but will not take away the fact that messages can be dropped under
> memory pressure.
>
agreed.
> > BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
> > creating exec-image is charged to original group...
> >
>
> As of today it should happen because newly execed process will run into
> same cgroup as parent. But that's what probably we need to avoid.
I think so.
> For example, if an admin has created three cgroups "database", "browser"
> "others" and a user launches "firefox" from shell (assuming shell is running
> originally in "others" cgroup), then any memory allocation for firefox should
> come from "browser" cgroup and not from "others".
>
yes.
> I am assuming that this will be a requirement for enterprise class
> systems. Would be good to know the experiences of people who are already
> doing some kind of work load management.
>
Thanks,
-Kame
> Thanks
> Vivek
>
>> Because "a child inherits parent's" rule is very strong, I think the amount
>> of events we have to check is much less than we get report. Can't we add some
>> filter/assumption here ?
>>
>
> I am not sure if proc connector currently allows filtering of various
> events like fork, exec, exit etc. In a quick look it looks like it
> does not. But probably that can be worked out. Even then, it will just
> help reduce the number of messages queued for user space on that socket
> but will not take away the fact that messages can be dropped under
> memory pressure.
>
Proc connector doesn't support event filtering. We can easily add a
global event mask, but not straightforward to add per-socket event mask
if not impossible.
Vivek Goyal wrote:
> On Thu, Jul 03, 2008 at 10:19:57AM +0900, KAMEZAWA Hiroyuki wrote:
>> On Tue, 1 Jul 2008 15:11:26 -0400
>> Vivek Goyal <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> While development is going on for cgroup and various controllers, we also
>>> need a facility so that an admin/user can specify the group creation and
>>> also specify the rules based on which tasks should be placed in respective
>>> groups. Group creation part will be handled by libcg which is already
>>> under development. We still need to tackle the issue of how to specify
>>> the rules and how these rules are enforced (rules engine).
>>>
>>> I have gathered few views, with regards to how rule engine can possibly be
>>> implemented, I am listing these down.
>>>
>>> Proposal 1
>>> ==========
>>> Let user space daemon hanle all that. Daemon will open a netlink socket
>>> and receive the notifications for various kernel events. Daemon will
>>> also parse appropriate admin specified rules config file and place the
>>> processes in right cgroup based on rules as and when events happen.
>>>
>>> I have written a prototype user space program which does that. Program
>>> can be found here. Currently it is in very crude shape.
>>>
>>> http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
>>>
>>> Various people have raised two main issues with this approach.
>>>
>>> - netlink is not a reliable protocol.
>>> - Messages can be dropped and one can loose message. That means a
>>> newly forked process might never go into right group as meant.
>>>
>>> - How to handle delays in rule exectuion?
>>> - For example, if an "exec" happens and by the time process is moved to
>>> right group, it might have forked off few more processes or might
>>> have done quite some amount of memory allocation which will be
>>> charged to the wring group. Or, newly exec process might get
>>> killed in existing cgroup because of lack of memory (despite the
>>> fact that destination cgroup has sufficient memory).
>>>
>> Hmm, can't we rework the process event connector to use some reliable
>> fast interface besides netlink ? (I mean an interface like eventpoll.)
>> (Or enhance netlink ? ;)
>
> I see following text in netlink man page.
>
> "However, reliable transmissions from kernel to user are impossible in
> any case. The kernel can’t send a netlink message if the socket buffer
> is full: the message will be dropped and the kernel and the userspace
> process will no longer have the same view of kernel state. It is up to
> the application to detect when this happens (via the ENOBUFS error
> returned by recvmsg(2)) and resynchronize."
>
> So at the end of the day, it looks like unreliability comes from the
> fact that we can not allocate memory currently so we will discard the
> packet.
>
> Are there alternatives as compared to dropping packets?
>
> - Let sender cache the packet and retry later. So maybe netlink layer
> can return error if packet can not be queued and connector can cache the
> event and try sending it later. (Hopefully later memory situation became
> better because of OOM or some process exited or something else...).
>
> This looks like a band-aid to handle the temporary congestion kind of
> problems. Will not be able to help if consumer is inherently slow and
> event generation is faster.
>
> This probably can be one possible enhancement to connector, but at the end
> of the day, any kind of user space daemon will have to accept the fact
> that packets can be dropped, leading to lost events. Detect that situation
> (using ENOBUFS) and then let admin know about it (logging). I am not sure
> what admin is supposed to do after that.
>
> I am CCing Thomas Graf. He might have a better idea of netlink limitations
> and is there a way to overcome these.
>
One thing we did with the delay accounting framework was to add the ability for
clients to listen on a per-cpu basis, that helped us scale well (user space
buffers per-client in turn per-cpu)
>> Because "a child inherits parent's" rule is very strong, I think the amount
>> of events we have to check is much less than we get report. Can't we add some
>> filter/assumption here ?
>>
>
> I am not sure if proc connector currently allows filtering of various
> events like fork, exec, exit etc. In a quick look it looks like it
> does not. But probably that can be worked out. Even then, it will just
> help reduce the number of messages queued for user space on that socket
> but will not take away the fact that messages can be dropped under
> memory pressure.
>
>> BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
>> creating exec-image is charged to original group...
>>
>
> As of today it should happen because newly execed process will run into
> same cgroup as parent. But that's what probably we need to avoid.
> For example, if an admin has created three cgroups "database", "browser"
> "others" and a user launches "firefox" from shell (assuming shell is running
> originally in "others" cgroup), then any memory allocation for firefox should
> come from "browser" cgroup and not from "others".
>
> I am assuming that this will be a requirement for enterprise class
> systems. Would be good to know the experiences of people who are already
> doing some kind of work load management.
CKRM had a kernel module for rule based classification - called rule based
classification engine (rbce). We should consider a simple cgroups client that
can share a database from user space and use the fork callback for classification.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
On Tue, Jul 08, 2008 at 03:05:47PM +0530, Balbir Singh wrote:
> Vivek Goyal wrote:
> > On Thu, Jul 03, 2008 at 10:19:57AM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Tue, 1 Jul 2008 15:11:26 -0400
> >> Vivek Goyal <[email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> While development is going on for cgroup and various controllers, we also
> >>> need a facility so that an admin/user can specify the group creation and
> >>> also specify the rules based on which tasks should be placed in respective
> >>> groups. Group creation part will be handled by libcg which is already
> >>> under development. We still need to tackle the issue of how to specify
> >>> the rules and how these rules are enforced (rules engine).
> >>>
> >>> I have gathered few views, with regards to how rule engine can possibly be
> >>> implemented, I am listing these down.
> >>>
> >>> Proposal 1
> >>> ==========
> >>> Let user space daemon hanle all that. Daemon will open a netlink socket
> >>> and receive the notifications for various kernel events. Daemon will
> >>> also parse appropriate admin specified rules config file and place the
> >>> processes in right cgroup based on rules as and when events happen.
> >>>
> >>> I have written a prototype user space program which does that. Program
> >>> can be found here. Currently it is in very crude shape.
> >>>
> >>> http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
> >>>
> >>> Various people have raised two main issues with this approach.
> >>>
> >>> - netlink is not a reliable protocol.
> >>> - Messages can be dropped and one can loose message. That means a
> >>> newly forked process might never go into right group as meant.
> >>>
> >>> - How to handle delays in rule exectuion?
> >>> - For example, if an "exec" happens and by the time process is moved to
> >>> right group, it might have forked off few more processes or might
> >>> have done quite some amount of memory allocation which will be
> >>> charged to the wring group. Or, newly exec process might get
> >>> killed in existing cgroup because of lack of memory (despite the
> >>> fact that destination cgroup has sufficient memory).
> >>>
> >> Hmm, can't we rework the process event connector to use some reliable
> >> fast interface besides netlink ? (I mean an interface like eventpoll.)
> >> (Or enhance netlink ? ;)
> >
> > I see following text in netlink man page.
> >
> > "However, reliable transmissions from kernel to user are impossible in
> > any case. The kernel can’t send a netlink message if the socket buffer
> > is full: the message will be dropped and the kernel and the userspace
> > process will no longer have the same view of kernel state. It is up to
> > the application to detect when this happens (via the ENOBUFS error
> > returned by recvmsg(2)) and resynchronize."
> >
> > So at the end of the day, it looks like unreliability comes from the
> > fact that we can not allocate memory currently so we will discard the
> > packet.
> >
> > Are there alternatives as compared to dropping packets?
> >
> > - Let sender cache the packet and retry later. So maybe netlink layer
> > can return error if packet can not be queued and connector can cache the
> > event and try sending it later. (Hopefully later memory situation became
> > better because of OOM or some process exited or something else...).
> >
> > This looks like a band-aid to handle the temporary congestion kind of
> > problems. Will not be able to help if consumer is inherently slow and
> > event generation is faster.
> >
> > This probably can be one possible enhancement to connector, but at the end
> > of the day, any kind of user space daemon will have to accept the fact
> > that packets can be dropped, leading to lost events. Detect that situation
> > (using ENOBUFS) and then let admin know about it (logging). I am not sure
> > what admin is supposed to do after that.
> >
> > I am CCing Thomas Graf. He might have a better idea of netlink limitations
> > and is there a way to overcome these.
> >
>
> One thing we did with the delay accounting framework was to add the ability for
> clients to listen on a per-cpu basis, that helped us scale well (user space
> buffers per-client in turn per-cpu)
>
Ok, I will look into it. But another key question still remains that if we
do it in user space, then there is no easy way of avoiding delay in execution
of rules.
> >> Because "a child inherits parent's" rule is very strong, I think the amount
> >> of events we have to check is much less than we get report. Can't we add some
> >> filter/assumption here ?
> >>
> >
> > I am not sure if proc connector currently allows filtering of various
> > events like fork, exec, exit etc. In a quick look it looks like it
> > does not. But probably that can be worked out. Even then, it will just
> > help reduce the number of messages queued for user space on that socket
> > but will not take away the fact that messages can be dropped under
> > memory pressure.
> >
> >> BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
> >> creating exec-image is charged to original group...
> >>
> >
> > As of today it should happen because newly execed process will run into
> > same cgroup as parent. But that's what probably we need to avoid.
> > For example, if an admin has created three cgroups "database", "browser"
> > "others" and a user launches "firefox" from shell (assuming shell is running
> > originally in "others" cgroup), then any memory allocation for firefox should
> > come from "browser" cgroup and not from "others".
> >
> > I am assuming that this will be a requirement for enterprise class
> > systems. Would be good to know the experiences of people who are already
> > doing some kind of work load management.
>
> CKRM had a kernel module for rule based classification - called rule based
> classification engine (rbce). We should consider a simple cgroups client that
> can share a database from user space and use the fork callback for classification.
Hmm..., had a quick look and CKRM implemented rule based engine as kernel
module.
Initially I thought of providing rules based in uid, gid and executable name.
So basically policies enforced upon setuid and exec related calls. I am
thinking if rules engine can be split in two parts. Set of rules which can
bear dealy can live in user space and which can not bear delay can live
in kernel. Something like, moving of tasks from one cgroup to other can
probably go in user space or fork notifications related rules can live
in user space.
Thanks
Vivek
Hi Vivek,
On Tue, Jul 1, 2008 at 12:11 PM, Vivek Goyal <[email protected]> wrote:
>
> - netlink is not a reliable protocol.
> - Messages can be dropped and one can loose message. That means a
> newly forked process might never go into right group as meant.
One way that you could avoid the unreliability would be to not use
netlink, but instead use cgroups itself.
What we're looking for is a way to easily distinguish between
processes that are in the right cgroups, and processes that might be
in the wrong cgroups. Additionally, we want the children of such
processes to inherit the same status until we've dealt with them, and
not be able to change their status themselves.
That sounds a bit like a cgroup. How about the following?
- create a cgroup subsystem called "setuid".
- have a uid_changed() hook called by sys_setuid() and friends; this
hook would simply attach current to the root cgroup in the "setuid"
hierarchy if it wasn't already in that cgroup (which can be determined
with a couple of dereferences from current and no locking, so not
slowing down the normal case).
- userspace uses this by:
mount the setuid hierarchy, e.g. at /mnt/setuid
create a child cgroup /mnt/setuid/processed
while true:
wait for /mnt/setuid/tasks to be non-empty
read a pid from /mnt/setuid/tasks
move that pid to the appropriate cgroups in memory/cpu/etc
hierarchies if necessary
move that pid to /mnt/setuid/processed/tasks
i.e. any pid in the root cgroup of the setuid hierarchy is one that
needs attention and may need to be moved to different cgroups
A couple of enhancements to make this more usable might include:
- adding an API (via a new syscall or an eventfd?) to wait for a
cgroup to be non-empty, to avoid having to poll /mnt/setuid/tasks more
than necessary
- allow the user to designate certain processes and their children as
uninteresting so that their setuid calls don't trigger them being
moved back to the root (perhaps indicated via membership of an
"ignored" cgroup in the setuid hierarchy?)
This should be more reliable than netlink since it doesn't involve
userspace having to keep up with a stream of events - we're not
queuing up events, we're just shifting process group memberships.
Similar approaches could be used for a "setgid" hierarchy and an
"execve" hierarchy.
Paul
On Thu, Jul 3, 2008 at 8:54 AM, Vivek Goyal <[email protected]> wrote:
>
> As of today it should happen because newly execed process will run into
> same cgroup as parent. But that's what probably we need to avoid.
> For example, if an admin has created three cgroups "database", "browser"
> "others" and a user launches "firefox" from shell (assuming shell is running
> originally in "others" cgroup), then any memory allocation for firefox should
> come from "browser" cgroup and not from "others".
I think that I'm a little skeptical that anyone would ever want to do that.
Wouldn't it be a simpler mechanism for the admin to simply have
wrappers around the "firefox" and "oracle" binaries that move the
process into the "browser" or "database" cgroup before running the
real binaries?
>
> I am assuming that this will be a requirement for enterprise class
> systems. Would be good to know the experiences of people who are already
> doing some kind of work load management.
I can help there. :-) At Google we have two approaches:
- grid jobs, which are moved into the appropriate cgroup (actually,
currently cpuset) by the grid daemon when it starts the job
- ssh logins, which are moved into the appropriate cpuset by a
forced-command script specified in the sshd config.
I don't see the rule-based approach being all that useful for our needs.
It's all very well coming up with theoretical cases that a fancy new
mechanism solves. But it carries more weight if someone can stand up
and say "Yes, I want to use this on my real cluster of machines". (Or
even "Yes, if this is implemented I *will* use it on my desktop" would
be a start)
Paul
On Thu, Jul 10, 2008 at 02:07:11AM -0700, Paul Menage wrote:
> Hi Vivek,
>
> On Tue, Jul 1, 2008 at 12:11 PM, Vivek Goyal <[email protected]> wrote:
> >
> > - netlink is not a reliable protocol.
> > - Messages can be dropped and one can loose message. That means a
> > newly forked process might never go into right group as meant.
>
> One way that you could avoid the unreliability would be to not use
> netlink, but instead use cgroups itself.
>
> What we're looking for is a way to easily distinguish between
> processes that are in the right cgroups, and processes that might be
> in the wrong cgroups. Additionally, we want the children of such
> processes to inherit the same status until we've dealt with them, and
> not be able to change their status themselves.
>
> That sounds a bit like a cgroup. How about the following?
>
> - create a cgroup subsystem called "setuid".
>
> - have a uid_changed() hook called by sys_setuid() and friends; this
> hook would simply attach current to the root cgroup in the "setuid"
> hierarchy if it wasn't already in that cgroup (which can be determined
> with a couple of dereferences from current and no locking, so not
> slowing down the normal case).
>
> - userspace uses this by:
>
> mount the setuid hierarchy, e.g. at /mnt/setuid
> create a child cgroup /mnt/setuid/processed
> while true:
> wait for /mnt/setuid/tasks to be non-empty
> read a pid from /mnt/setuid/tasks
> move that pid to the appropriate cgroups in memory/cpu/etc
> hierarchies if necessary
> move that pid to /mnt/setuid/processed/tasks
>
> i.e. any pid in the root cgroup of the setuid hierarchy is one that
> needs attention and may need to be moved to different cgroups
>
> A couple of enhancements to make this more usable might include:
>
> - adding an API (via a new syscall or an eventfd?) to wait for a
> cgroup to be non-empty, to avoid having to poll /mnt/setuid/tasks more
> than necessary
>
> - allow the user to designate certain processes and their children as
> uninteresting so that their setuid calls don't trigger them being
> moved back to the root (perhaps indicated via membership of an
> "ignored" cgroup in the setuid hierarchy?)
>
> This should be more reliable than netlink since it doesn't involve
> userspace having to keep up with a stream of events - we're not
> queuing up events, we're just shifting process group memberships.
>
> Similar approaches could be used for a "setgid" hierarchy and an
> "execve" hierarchy.
This looks interesting. So above method should solve atleast the
reliability issue of event transport to user space. Got few thougts.
- Hopefully number of hiearchies will not explode as we will be
mounting one hierarchies per event type (uid change, gid change,
exec, maybe fork etc.).
- IIUC, it does not solve the concern of delay. So after setuid, or exec,
tasks continues to run into existing cgroup until user space daemon
processes the event and moves the task into right cgroup. More on this
in reply to your other mail.
Thanks
Vivek
On Thu, Jul 10, 2008 at 02:23:52AM -0700, Paul Menage wrote:
> On Thu, Jul 3, 2008 at 8:54 AM, Vivek Goyal <[email protected]> wrote:
> >
> > As of today it should happen because newly execed process will run into
> > same cgroup as parent. But that's what probably we need to avoid.
> > For example, if an admin has created three cgroups "database", "browser"
> > "others" and a user launches "firefox" from shell (assuming shell is running
> > originally in "others" cgroup), then any memory allocation for firefox should
> > come from "browser" cgroup and not from "others".
>
> I think that I'm a little skeptical that anyone would ever want to do that.
>
> Wouldn't it be a simpler mechanism for the admin to simply have
> wrappers around the "firefox" and "oracle" binaries that move the
> process into the "browser" or "database" cgroup before running the
> real binaries?
>
Well, that would mean first wrappers need to be created around all the
applications which needs to be controlled. Then wrapper needs to
synchronize with the classification daemon if I have been put into
the right cgroup and can I go ahead with launching the real binary etc.
This sounds ugly and putting wrappers around all the applications does
not seem very practical.
> >
> > I am assuming that this will be a requirement for enterprise class
> > systems. Would be good to know the experiences of people who are already
> > doing some kind of work load management.
>
> I can help there. :-) At Google we have two approaches:
>
> - grid jobs, which are moved into the appropriate cgroup (actually,
> currently cpuset) by the grid daemon when it starts the job
>
So grid daemon probably first forks off, determines the right cpuset
move the job there and then do exec?
> - ssh logins, which are moved into the appropriate cpuset by a
> forced-command script specified in the sshd config.
>
> I don't see the rule-based approach being all that useful for our needs.
>
> It's all very well coming up with theoretical cases that a fancy new
> mechanism solves. But it carries more weight if someone can stand up
> and say "Yes, I want to use this on my real cluster of machines". (Or
> even "Yes, if this is implemented I *will* use it on my desktop" would
> be a start)
>
So it boils down to.
1) Can we bear the delay in task classification (Especially, exec). If yes,
then all the classification job can take place in userspace.
2) If no,
a) Then either we need to implement rule based engine to let
kernel do classfication.
b) or we need to do various things in user space as you suggested.
- Pur wrapper around applications.
- Job launcher (ex. Grid daemon) is modified to determine
the right cgroup and place application there before
actually launching the job.
Balbir and other people, any more thoughts on this? How exactly this thing
need to be used in your work environment.
I am little skeptical of options 2b working in most of the scenarios.
Thanks
Vivek
On Thu, Jul 10, 2008 at 02:07:11AM -0700, Paul Menage wrote:
> Hi Vivek,
>
> On Tue, Jul 1, 2008 at 12:11 PM, Vivek Goyal <[email protected]> wrote:
> >
> > - netlink is not a reliable protocol.
> > - Messages can be dropped and one can loose message. That means a
> > newly forked process might never go into right group as meant.
>
> One way that you could avoid the unreliability would be to not use
> netlink, but instead use cgroups itself.
>
> What we're looking for is a way to easily distinguish between
> processes that are in the right cgroups, and processes that might be
> in the wrong cgroups. Additionally, we want the children of such
> processes to inherit the same status until we've dealt with them, and
> not be able to change their status themselves.
>
> That sounds a bit like a cgroup. How about the following?
>
> - create a cgroup subsystem called "setuid".
>
> - have a uid_changed() hook called by sys_setuid() and friends; this
> hook would simply attach current to the root cgroup in the "setuid"
> hierarchy if it wasn't already in that cgroup (which can be determined
> with a couple of dereferences from current and no locking, so not
> slowing down the normal case).
>
> - userspace uses this by:
>
> mount the setuid hierarchy, e.g. at /mnt/setuid
> create a child cgroup /mnt/setuid/processed
> while true:
> wait for /mnt/setuid/tasks to be non-empty
> read a pid from /mnt/setuid/tasks
> move that pid to the appropriate cgroups in memory/cpu/etc
> hierarchies if necessary
> move that pid to /mnt/setuid/processed/tasks
>
> i.e. any pid in the root cgroup of the setuid hierarchy is one that
> needs attention and may need to be moved to different cgroups
>
> A couple of enhancements to make this more usable might include:
>
> - adding an API (via a new syscall or an eventfd?) to wait for a
> cgroup to be non-empty, to avoid having to poll /mnt/setuid/tasks more
> than necessary
>
> - allow the user to designate certain processes and their children as
> uninteresting so that their setuid calls don't trigger them being
> moved back to the root (perhaps indicated via membership of an
> "ignored" cgroup in the setuid hierarchy?)
>
> This should be more reliable than netlink since it doesn't involve
> userspace having to keep up with a stream of events - we're not
> queuing up events, we're just shifting process group memberships.
>
> Similar approaches could be used for a "setgid" hierarchy and an
> "execve" hierarchy.
We also need to do something to track all the forked childs after
the setuid, setgid or exec till original parent event got classified
and children need to meet the same treatment.
Thanks
Vivek
On Thu, 10 Jul 2008 02:23:52 -0700
"Paul Menage" <[email protected]> wrote:
> I don't see the rule-based approach being all that useful for our needs.
Agreed, there really is no need for a rule-based approach in kernel space.
There are basically three different cases:
1) daemons get started up in their own process groups, this can
be handled by the initscripts
2) user sessions (ssh, etc) start in their own process groups,
this can be handled by PAM
3) users fork processes that should go into special process
groups - this could be handled by having a small ruleset
in userspace handle things, right before calling exec(),
it can even be hidden from the application by hooking into
the exec() call
If a user overrides the rules for their own processes, at worst
s/he takes away resources from him/herself. No security problem.
Is there any reason at all to push for a kernel side rule-based
engine, except "I want to make my patch set unmergeable?"
--
All Rights Reversed
>
> So it boils down to.
>
> 1) Can we bear the delay in task classification (Especially, exec). If yes,
> then all the classification job can take place in userspace.
The answer is not really.
>
> 2) If no,
> a) Then either we need to implement rule based engine to let
> kernel do classfication.
>
> b) or we need to do various things in user space as you suggested.
> - Pur wrapper around applications.
> - Job launcher (ex. Grid daemon) is modified to determine
> the right cgroup and place application there before
> actually launching the job.
>
I like this approach. The whole classification should really be done by
userspace. Let the wrapper move into the correct group and then start the
task. The kernel really is not the right place for the classification.
And you can have a default group for tasks who really don't care about
where they are placed.
--
regards,
Dhaval
On Thu, Jul 10, 2008 at 10:48:52AM -0400, Rik van Riel wrote:
> On Thu, 10 Jul 2008 02:23:52 -0700
> "Paul Menage" <[email protected]> wrote:
>
> > I don't see the rule-based approach being all that useful for our needs.
>
> Agreed, there really is no need for a rule-based approach in kernel space.
>
> There are basically three different cases:
>
> 1) daemons get started up in their own process groups, this can
> be handled by the initscripts
>
> 2) user sessions (ssh, etc) start in their own process groups,
> this can be handled by PAM
>
> 3) users fork processes that should go into special process
> groups - this could be handled by having a small ruleset
> in userspace handle things, right before calling exec(),
That means application launcher (ex, shell) is aware of the right cgroup
targeted application should go in and then move forked pid to right
cgroup and call exec? Or you had something else in mind?
> it can even be hidden from the application by hooking into
> the exec() call
>
This means hooking into libc. So libc will parse rules file, determine
the right cgroup, place application there and then call exec?
CCing, Ulrich also in case he has some thoughts.
Thanks
Vivek
On Thu, Jul 10, 2008 at 02:07:11AM -0700, Paul Menage wrote:
> Hi Vivek,
>
> On Tue, Jul 1, 2008 at 12:11 PM, Vivek Goyal <[email protected]> wrote:
> >
> > - netlink is not a reliable protocol.
> > - Messages can be dropped and one can loose message. That means a
> > newly forked process might never go into right group as meant.
>
> One way that you could avoid the unreliability would be to not use
> netlink, but instead use cgroups itself.
>
> What we're looking for is a way to easily distinguish between
> processes that are in the right cgroups, and processes that might be
> in the wrong cgroups. Additionally, we want the children of such
> processes to inherit the same status until we've dealt with them, and
> not be able to change their status themselves.
>
> That sounds a bit like a cgroup. How about the following?
>
> - create a cgroup subsystem called "setuid".
>
> - have a uid_changed() hook called by sys_setuid() and friends; this
> hook would simply attach current to the root cgroup in the "setuid"
> hierarchy if it wasn't already in that cgroup (which can be determined
> with a couple of dereferences from current and no locking, so not
> slowing down the normal case).
>
> - userspace uses this by:
>
> mount the setuid hierarchy, e.g. at /mnt/setuid
> create a child cgroup /mnt/setuid/processed
> while true:
> wait for /mnt/setuid/tasks to be non-empty
> read a pid from /mnt/setuid/tasks
> move that pid to the appropriate cgroups in memory/cpu/etc
> hierarchies if necessary
> move that pid to /mnt/setuid/processed/tasks
>
> i.e. any pid in the root cgroup of the setuid hierarchy is one that
> needs attention and may need to be moved to different cgroups
>
Where I see complications is handling forks happening in that time. It
will take us a long time to ensure that a fork bomb goes into the
correct cgroup as an example.
Also another issue, where does the pid reside in the memory/cpu hierarchy.
If it is not in the correct cgroup at the time of exec, or soon after
exec, the wrong cgroup is getting charged.
I liked the other idea you posted about in the other mail, having
wrappers around. I believe that can be done at distro level, which
should not really be too tough.
Or maybe we can use something like selinux (ok, this really is a shot in
the dark, i should read up before opening my mouth here.)
Thanks,
--
regards,
Dhaval
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Vivek Goyal wrote:
>> it can even be hidden from the application by hooking into
>> the exec() call
>>
>
> This means hooking into libc. So libc will parse rules file, determine
> the right cgroup, place application there and then call exec?
As with any "solution" based on userlevel code, the problem is overhead
and interfaces.
Such a rules file would be a real file, I assume, and as such we'd have
to read it every time an exec call is made. At least we'd have to check
using a stat() call that nothing changed. That's always a big overhead.
Once the information is available, how is it used? We'd have to pass
additional information to the exec syscalls. And it has to happen so
that if the exec call fails the original process is not affected (i.e.,
premature changing isn't an option). The method also must be
thread-safe in a limited way: executing failing exec syscalls in
multiple threads mustn't disturb the process.
There is one set of problems which I don't care about but others likely
will: what happens if some program uses the syscalls directly? And what
happens with old libcs and old statically linked programs? It's exactly
the kind of problem why I tell people to never linked statically but
some people don't listen.
The additional file update check is hurting performance but since I hope
what we will get an inotify-like interface that doesn't need normal file
descriptors (or any file descriptors) I think I can live with it.
Somebody would "just" have to implement, e.g., the anonfd functionality
discussed some time ago. (Make sure to talk to Al Viro who already
mentioned to me that it'll be "fun").
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkh2MSkACgkQ2ijCOnn/RHTepgCgrlkwQMItX2QGW6Tw//lw4vH2
ItIAoJ7qyQE31jpQ2D8fBIO/yqmrwgcH
=NQMC
-----END PGP SIGNATURE-----
On Thu, Jul 10, 2008 at 7:06 AM, Vivek Goyal <[email protected]> wrote:
>
> This looks interesting. So above method should solve atleast the
> reliability issue of event transport to user space. Got few thougts.
>
> - Hopefully number of hiearchies will not explode as we will be
> mounting one hierarchies per event type (uid change, gid change,
> exec, maybe fork etc.).
In what circumstances would you want to reclassify processes to a
different cgroup on a fork?
Paul
On Thu, Jul 10, 2008 at 7:33 AM, Vivek Goyal <[email protected]> wrote:
>
> We also need to do something to track all the forked childs after
> the setuid, setgid or exec till original parent event got classified
> and children need to meet the same treatment.
You'd get that automatically, since children of the task moved to the
root cgroup (indicating "needs attention") would also end up in that
cgroup since cgroup are inherited across fork.
Paul
On Thu, Jul 10, 2008 at 7:30 AM, Vivek Goyal <[email protected]> wrote:
>
> Well, that would mean first wrappers need to be created around all the
> applications which needs to be controlled. Then wrapper needs to
> synchronize with the classification daemon
I was suggesting that you wouldn't need a classification daemon in
this case. The logic of which cgroup to enter would be in the
script/command invoked by the wrapper.
>> - grid jobs, which are moved into the appropriate cgroup (actually,
>> currently cpuset) by the grid daemon when it starts the job
>
> So grid daemon probably first forks off, determines the right cpuset
> move the job there and then do exec?
Pretty much, yes. Most jobs have their own cpuset that's created for
them dynamically when the job starts on the machine.
Paul
On Thu, Jul 10, 2008 at 09:46:46AM -0700, Paul Menage wrote:
> On Thu, Jul 10, 2008 at 7:33 AM, Vivek Goyal <[email protected]> wrote:
> >
> > We also need to do something to track all the forked childs after
> > the setuid, setgid or exec till original parent event got classified
> > and children need to meet the same treatment.
>
> You'd get that automatically, since children of the task moved to the
> root cgroup (indicating "needs attention") would also end up in that
> cgroup since cgroup are inherited across fork.
>
I am sorry, I seem to missing something, but who moves the forked
children (which got forked during the time between the parent getting
classified into the right group and the fork itself) into the correct
group?
--
regards,
Dhaval
On Thu, Jul 10, 2008 at 09:41:06AM -0700, Paul Menage wrote:
> On Thu, Jul 10, 2008 at 7:06 AM, Vivek Goyal <[email protected]> wrote:
> >
> > This looks interesting. So above method should solve atleast the
> > reliability issue of event transport to user space. Got few thougts.
> >
> > - Hopefully number of hiearchies will not explode as we will be
> > mounting one hierarchies per event type (uid change, gid change,
> > exec, maybe fork etc.).
>
> In what circumstances would you want to reclassify processes to a
> different cgroup on a fork?
I don't know. Balbir had mentioned in one of the mails in this thread
regarding getting notification on fork.
Thanks
Vivek
On Thu, 10 Jul 2008 08:56:25 -0700
Ulrich Drepper <[email protected]> wrote:
> Once the information is available, how is it used? We'd have to pass
> additional information to the exec syscalls. And it has to happen so
> that if the exec call fails the original process is not affected (i.e.,
> premature changing isn't an option). The method also must be
> thread-safe in a limited way: executing failing exec syscalls in
> multiple threads mustn't disturb the process.
One easy way is to have a "migrate on exec" option added to the
process group code. Instead of moving yourself to a new process
group before exec, you do the same invocation but with a "migrate
me lazily at exec time" flag.
At exec time, your current resources will be subtracted from the
old process group (most of it automatically in exit_mmap) and your
new resources will be added to the new process group on the other
side of exec.
The exec syscall itself does not need to change.
> There is one set of problems which I don't care about but others likely
> will: what happens if some program uses the syscalls directly? And what
> happens with old libcs and old statically linked programs? It's exactly
> the kind of problem why I tell people to never linked statically but
> some people don't listen.
Those people will have to move their processes around between
process groups manually (or with shell scripts). Having per
program process groups is essentially bonus functionality
over the "start daemon in own process group" and "start user
in own process group" functionalities.
Whether and how we want to implement this is open for discussion.
Personally I suspect that a kernel side rule-based engine with
user loadable rules may not be the best idea :)
--
All Rights Reversed
On Thu, Jul 10, 2008 at 01:19:39PM -0400, Vivek Goyal wrote:
> On Thu, Jul 10, 2008 at 09:41:06AM -0700, Paul Menage wrote:
> > On Thu, Jul 10, 2008 at 7:06 AM, Vivek Goyal <[email protected]> wrote:
> > >
> > > This looks interesting. So above method should solve atleast the
> > > reliability issue of event transport to user space. Got few thougts.
> > >
> > > - Hopefully number of hiearchies will not explode as we will be
> > > mounting one hierarchies per event type (uid change, gid change,
> > > exec, maybe fork etc.).
> >
> > In what circumstances would you want to reclassify processes to a
> > different cgroup on a fork?
>
> I don't know. Balbir had mentioned in one of the mails in this thread
> regarding getting notification on fork.
fork or exec? I believe reclassifications would happen only on exec.
--
regards,
Dhaval
On Thu, Jul 10, 2008 at 10:18 AM, Dhaval Giani
<[email protected]> wrote:
>
> I am sorry, I seem to missing something, but who moves the forked
> children (which got forked during the time between the parent getting
> classified into the right group and the fork itself) into the correct
> group?
The classifier daemon would have to do that - my point was that it
would be very clear exactly which processes needed this attention,
since they'd end up in the root cgroup too.
Paul
On Thu, Jul 10, 2008 at 10:30:15AM -0700, Paul Menage wrote:
> On Thu, Jul 10, 2008 at 10:18 AM, Dhaval Giani
> <[email protected]> wrote:
> >
> > I am sorry, I seem to missing something, but who moves the forked
> > children (which got forked during the time between the parent getting
> > classified into the right group and the fork itself) into the correct
> > group?
>
> The classifier daemon would have to do that - my point was that it
> would be very clear exactly which processes needed this attention,
> since they'd end up in the root cgroup too.
>
It still would not solve the problem of the correct group getting
charged. Say for something like cpu, it would get fare more cpu time as
opposed to what it should get. Its in the correct direction, but I am
not sure if it is the solution. I was thinking of having a sandbox
cgroup at each level, but then I am not very sure of this "correct
cgroup getting charged" problem.
--
regards,
Dhaval
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Rik van Riel wrote:
> One easy way is to have a "migrate on exec" option added to the
> process group code.
That's going to be ugly because the exec functions are signal-safe.
I.e., they can happen at any time. This would mean that one always has
to set the migration policy before every exec call and that there must
be a way to retrieve the currently selected policy so that it can
potentially be restored. This policy must be a thread property, not a
process property.
Sticky information like this is IMO always hairy at best. We had the
same discussion at the time of the sys_indirect discussion. This new
syscall proposal was the result of sticky information not being suitable
and it could very well be used for the exec syscalls, too.
Again, this is all about failing exec calls of which there can be
arbitrarily many.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkh2SVAACgkQ2ijCOnn/RHSYsgCfeH3tTQLSILTksRTfWPhffY0x
okkAn0fQDRDBkqSboqzfrqlj1zpvA3Hm
=bi0P
-----END PGP SIGNATURE-----
On Thu, Jul 10, 2008 at 10:39:28AM -0700, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Rik van Riel wrote:
> > One easy way is to have a "migrate on exec" option added to the
> > process group code.
>
> That's going to be ugly because the exec functions are signal-safe.
> I.e., they can happen at any time. This would mean that one always has
> to set the migration policy before every exec call and that there must
> be a way to retrieve the currently selected policy so that it can
> potentially be restored. This policy must be a thread property, not a
> process property.
>
Sorry, I did not understand exactly what's the problem with signal
safe exec function. Before exec, we should be able to determine the
migration policy related to process/thread (either by reading file or
something else etc). Set the policy through cgroup file system. If exec
fails for some reason, we just need to go back to cgroup file system to
undo the effect of setting migration policy previously set for that thread.
Thanks
Vivek
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Vivek Goyal wrote:
> Before exec, we should be able to determine the
> migration policy related to process/thread (either by reading file or
> something else etc). Set the policy through cgroup file system. If exec
> fails for some reason, we just need to go back to cgroup file system to
> undo the effect of setting migration policy previously set for that thread.
That's what I said. It would be necessary to get the old state and
reset it if necessary.
As for the interface: I hope nobody honestly thinks that it is doable to
perform a whole bunch of filesystem operations for every exec.
And more: reading a rule file, interpreting the rules to find the best
match, etc is also too expensive. Every process would have to read the
rule file again. If this is non-trivial or the rule file is large, the
cost of an exec could easily be overshadowed by the cost of this
preparation. Unlike the kernel, the userlevel runtime cannot in general
amortize the cost over several exec calls. Handling all this in the
kernel wouldn't have any of these problems.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkh2jVMACgkQ2ijCOnn/RHQ6JACgx4W0dUh/MK6po23D1ObcnsKA
HOAAn2Qfrh8m5zsdHQoniaoLl12Ut3ZE
=IU/X
-----END PGP SIGNATURE-----
On Thu, 10 Jul 2008 11:40:35 -0400
Vivek Goyal <[email protected]> wrote:
> On Thu, Jul 10, 2008 at 10:48:52AM -0400, Rik van Riel wrote:
> > On Thu, 10 Jul 2008 02:23:52 -0700
> > "Paul Menage" <[email protected]> wrote:
> >
> > > I don't see the rule-based approach being all that useful for our needs.
> >
> > Agreed, there really is no need for a rule-based approach in kernel space.
> >
> > There are basically three different cases:
> >
> > 1) daemons get started up in their own process groups, this can
> > be handled by the initscripts
> >
> > 2) user sessions (ssh, etc) start in their own process groups,
> > this can be handled by PAM
> >
> > 3) users fork processes that should go into special process
> > groups - this could be handled by having a small ruleset
> > in userspace handle things, right before calling exec(),
>
> That means application launcher (ex, shell) is aware of the right cgroup
> targeted application should go in and then move forked pid to right
> cgroup and call exec? Or you had something else in mind?
>
> > it can even be hidden from the application by hooking into
> > the exec() call
> >
>
> This means hooking into libc. So libc will parse rules file, determine
> the right cgroup, place application there and then call exec?
>
Hmm, as I wrote, the rule that the child inherits its own parent't is very
strong rule. (Most of case can be handle by this.) So, what I think of is
1. support a new command (in libcg.)
- /bin/change_group_exec ..... read to /etc/cgroup/config and move cgroup
and call exec.
2. and libc function
- if necessary.
1. is enough because admin/user can write a wrapper script for their
applications if "child inherits parent's" isn't suitable.
no ?
Thanks,
-Kame
On Fri, Jul 11, 2008 at 09:55:01AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 10 Jul 2008 11:40:35 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Thu, Jul 10, 2008 at 10:48:52AM -0400, Rik van Riel wrote:
> > > On Thu, 10 Jul 2008 02:23:52 -0700
> > > "Paul Menage" <[email protected]> wrote:
> > >
> > > > I don't see the rule-based approach being all that useful for our needs.
> > >
> > > Agreed, there really is no need for a rule-based approach in kernel space.
> > >
> > > There are basically three different cases:
> > >
> > > 1) daemons get started up in their own process groups, this can
> > > be handled by the initscripts
> > >
> > > 2) user sessions (ssh, etc) start in their own process groups,
> > > this can be handled by PAM
> > >
> > > 3) users fork processes that should go into special process
> > > groups - this could be handled by having a small ruleset
> > > in userspace handle things, right before calling exec(),
> >
> > That means application launcher (ex, shell) is aware of the right cgroup
> > targeted application should go in and then move forked pid to right
> > cgroup and call exec? Or you had something else in mind?
> >
> > > it can even be hidden from the application by hooking into
> > > the exec() call
> > >
> >
> > This means hooking into libc. So libc will parse rules file, determine
> > the right cgroup, place application there and then call exec?
> >
>
> Hmm, as I wrote, the rule that the child inherits its own parent't is very
> strong rule. (Most of case can be handle by this.) So, what I think of is
>
> 1. support a new command (in libcg.)
> - /bin/change_group_exec ..... read to /etc/cgroup/config and move cgroup
> and call exec.
> 2. and libc function
> - if necessary.
>
> 1. is enough because admin/user can write a wrapper script for their
> applications if "child inherits parent's" isn't suitable.
>
> no ?
>
If admin has decided to group applications and has written the rules for
it then applications should not know anything about grouping. So I think
application writing an script for being placed into the right group should
be out of question. Now how does an admin write a wrapper around existing
application without breaking anything else.
One thing could be creating soft links where admin created alias points
to wrapper and wrapper inturn invokes the executable. But this will not
solve the problem if some user decides to invoke the application
executable directly and not use admin created alias.
Did you have something else in mind when it came to creating wrappers
around applications?
Thanks
Vivek
Vivek Goyal wrote:
> If admin has decided to group applications and has written the rules for
> it then applications should not know anything about grouping. So I think
> application writing an script for being placed into the right group should
> be out of question. Now how does an admin write a wrapper around existing
> application without breaking anything else.
In the Solaris world, processes are placed into cgroups (projects) by
one of two mechanisms:
1) inheritance, with everything I create in my existing project.
To get this started, there is a mechanism under login/getty/whatever
reading the /etc/projects file and, for example, tossing user davecb
into a "user.davecb" project.
2) explicit placement with newtask, which starts a program or moves
a process into a project/cgroup
I have a "bg" project which I use for limiting resource consumption of
background jobs, and a background command which either starts or moves
jobs, thusly:
case "$1" in
[0-9]*) # It's a pid
newtask -p bg -c $1
;;
*) # It's a command-line
newtask -p bg "$@" &
;;
esac
A rules engine would be more useful for managing workloads once
they're assigned, as IBM does on the mainframe with WLM and goal-directed
resource management. (They're brilliant in this area, by the way, so
I'd be inclined to steal ideas from them (;-))
--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
----- Original Message -----
>On Fri, Jul 11, 2008 at 09:55:01AM +0900, KAMEZAWA Hiroyuki wrote:
>> On Thu, 10 Jul 2008 11:40:35 -0400
>> Vivek Goyal <[email protected]> wrote:
>>
>> > On Thu, Jul 10, 2008 at 10:48:52AM -0400, Rik van Riel wrote:
>> > > On Thu, 10 Jul 2008 02:23:52 -0700
>> > > "Paul Menage" <[email protected]> wrote:
>> > >
>> > > > I don't see the rule-based approach being all that useful for our nee
ds.
>> > >
>> > > Agreed, there really is no need for a rule-based approach in kernel spa
ce.
>> > >
>> > > There are basically three different cases:
>> > >
>> > > 1) daemons get started up in their own process groups, this can
>> > > be handled by the initscripts
>> > >
>> > > 2) user sessions (ssh, etc) start in their own process groups,
>> > > this can be handled by PAM
>> > >
>> > > 3) users fork processes that should go into special process
>> > > groups - this could be handled by having a small ruleset
>> > > in userspace handle things, right before calling exec(),
>> >
>> > That means application launcher (ex, shell) is aware of the right cgroup
>> > targeted application should go in and then move forked pid to right
>> > cgroup and call exec? Or you had something else in mind?
>> >
>> > > it can even be hidden from the application by hooking into
>> > > the exec() call
>> > >
>> >
>> > This means hooking into libc. So libc will parse rules file, determine
>> > the right cgroup, place application there and then call exec?
>> >
>>
>> Hmm, as I wrote, the rule that the child inherits its own parent't is very
>> strong rule. (Most of case can be handle by this.) So, what I think of is
>>
>> 1. support a new command (in libcg.)
>> - /bin/change_group_exec ..... read to /etc/cgroup/config and move cgroup
>> and call exec.
>> 2. and libc function
>> - if necessary.
>>
>> 1. is enough because admin/user can write a wrapper script for their
>> applications if "child inherits parent's" isn't suitable.
>>
>> no ?
>>
>
>If admin has decided to group applications and has written the rules for
>it then applications should not know anything about grouping. So I think
>application writing an script for being placed into the right group should
>be out of question. Now how does an admin write a wrapper around existing
>application without breaking anything else.
>
Sure.
>One thing could be creating soft links where admin created alias points
>to wrapper and wrapper inturn invokes the executable. But this will not
>solve the problem if some user decides to invoke the application
>executable directly and not use admin created alias.
>
yes. It's a hole.
>Did you have something else in mind when it came to creating wrappers
>around applications?
>
I have no strong idea around this but now it seems
- handling complicated rules under the kernel will got amount of Nacks.
(and it seems to add some latency.)
- We cannot avoid the problem discussed here if we handle the rule in
userland daemon/process-event-connector.
So, I wonder adding some limitation may make things simple.
- application under wrapper will be executed under a group defined by admin.
- application without wrapper will be executed under a group where exec()
called.
A point is that application-without-wrapper is also under Admin's control beca
use it's executed under a group which calls exec.
But this is not strict control..this is very loose ;)
Thanks,
-Kame
On Mon, Jul 14, 2008 at 10:44:43AM -0400, David Collier-Brown wrote:
> Vivek Goyal wrote:
>> If admin has decided to group applications and has written the rules for
>> it then applications should not know anything about grouping. So I think
>> application writing an script for being placed into the right group should
>> be out of question. Now how does an admin write a wrapper around existing
>> application without breaking anything else.
>
> In the Solaris world, processes are placed into cgroups (projects) by
> one of two mechanisms:
>
> 1) inheritance, with everything I create in my existing project.
> To get this started, there is a mechanism under login/getty/whatever
> reading the /etc/projects file and, for example, tossing user davecb
> into a "user.davecb" project.
>
Placing the login sessions in right cgroup based on uid/gid rules is
probably easy as check needs to be placed only on system entry upon login
(Pam plugin should do). And after that any job started by the user
will automatically start in the same cgroup.
> 2) explicit placement with newtask, which starts a program or moves
> a process into a project/cgroup
>
explicit placement of task based on application type will be tricky.
> I have a "bg" project which I use for limiting resource consumption of
> background jobs, and a background command which either starts or moves
> jobs, thusly:
>
> case "$1" in
> [0-9]*) # It's a pid
> newtask -p bg -c $1
Ok, this is moving of tasks from one cgroup to other based on pid. This
is really easy to do through cgroup file system. Just a matter of writing
to task file.
> ;;
> *) # It's a command-line
> newtask -p bg "$@" &
> ;;
So here a user explicitly invokes the wrapper passing it the targeted
cgroup and the application to be launched in that cgroup. This should work
if there is a facility if user has created its own cgroups (lets say
under user controlled cgroup dir in the hierarchy) and user explicitly
wants to control the resources of applications under its dir. For example,
/mnt/cgroup
| |
gid1 gid2
| | | |
uid1 uid2 uid3 uid4
| |
proj1 proj2
Here probably admin can write the rules for how users are allocated the
resources and give ability to users to create subdirs under their cgroups
where users can create more cgroups and can do their own resource
management based on application tasks and place applications in the right
cgroup by writing wrappers as mentioned by you "newtask".
But here there is no discrimination of application type by admin. Admin
controls resource divisions only based on uid/gid. And users can manage
applications within their user groups. In fact I am having hard time thinking
in what kind of scenarios, there is a need for an admin to control
resource based on application type? Do we really need setups like, on
a system databases should get network bandwidth of 30%. If yes, then
it becomes tricky where admin need to write a wrapper to place the task
in right cgroup without application/user knowing it.
Thanks
Vivek
Vivek Goyal wrote:
> On Mon, Jul 14, 2008 at 10:44:43AM -0400, David Collier-Brown wrote:
> > Vivek Goyal wrote:
> >> If admin has decided to group applications and has written the rules for
> >> it then applications should not know anything about grouping. So I think
> >> application writing an script for being placed into the right group should
> >> be out of question. Now how does an admin write a wrapper around existing
> >> application without breaking anything else.
> >
> > In the Solaris world, processes are placed into cgroups (projects) by
> > one of two mechanisms:
> >
> > 1) inheritance, with everything I create in my existing project.
> > To get this started, there is a mechanism under login/getty/whatever
> > reading the /etc/projects file and, for example, tossing user davecb
> > into a "user.davecb" project.
> >
>
> Placing the login sessions in right cgroup based on uid/gid rules is
> probably easy as check needs to be placed only on system entry upon login
> (Pam plugin should do). And after that any job started by the user
> will automatically start in the same cgroup.
>
> > 2) explicit placement with newtask, which starts a program or moves
> > a process into a project/cgroup
> >
>
> explicit placement of task based on application type will be tricky.
>
> > I have a "bg" project which I use for limiting resource consumption of
> > background jobs, and a background command which either starts or moves
> > jobs, thusly:
> >
> > case "$1" in
> > [0-9]*) # It's a pid
> > newtask -p bg -c $1
>
> Ok, this is moving of tasks from one cgroup to other based on pid. This
> is really easy to do through cgroup file system. Just a matter of writing
> to task file.
>
> > ;;
> > *) # It's a command-line
> > newtask -p bg "$@" &
> > ;;
>
> So here a user explicitly invokes the wrapper passing it the targeted
> cgroup and the application to be launched in that cgroup. This should work
> if there is a facility if user has created its own cgroups (lets say
> under user controlled cgroup dir in the hierarchy) and user explicitly
> wants to control the resources of applications under its dir. For example,
>
> /mnt/cgroup
> | |
> gid1 gid2
> | | | |
> uid1 uid2 uid3 uid4
> | |
> proj1 proj2
>
> Here probably admin can write the rules for how users are allocated the
> resources and give ability to users to create subdirs under their cgroups
> where users can create more cgroups and can do their own resource
> management based on application tasks and place applications in the right
> cgroup by writing wrappers as mentioned by you "newtask".
>
> But here there is no discrimination of application type by admin. Admin
> controls resource divisions only based on uid/gid. And users can manage
> applications within their user groups. In fact I am having hard time thinking
> in what kind of scenarios, there is a need for an admin to control
> resource based on application type? Do we really need setups like, on
> a system databases should get network bandwidth of 30%. If yes, then
> it becomes tricky where admin need to write a wrapper to place the task
> in right cgroup without application/user knowing it.
I think a wrapper (move to right group and calls exec) will run by user, not by admin.
In explicit placement, user knows what a type of application he/she launch.
/mnt/cgroup
| |
gid1 gid2
| | | |
uid1 uid2 uid3 uid4
| |
proj1 proj2
[uid1/gid1]% newtask.sh proj1app
... proj1app run under /mnt/cgroup/gid1/uid1
[uid1/gid1]% newtask.sh --type proj1type proj1app
... proj1app run under /mnt/cgroup/gid1/uid1/proj1
In this case, admin sets up limitation of proj1type.
Also I guess proj1type has ownership (ex: proj1type allows gid1).
Isn't this enough?
Thanks,
Kazunaga Ikeno
On Thu, Jul 17, 2008 at 04:05:17PM +0900, Kazunaga Ikeno wrote:
> Vivek Goyal wrote:
> > On Mon, Jul 14, 2008 at 10:44:43AM -0400, David Collier-Brown wrote:
> > > Vivek Goyal wrote:
> > >> If admin has decided to group applications and has written the rules for
> > >> it then applications should not know anything about grouping. So I think
> > >> application writing an script for being placed into the right group should
> > >> be out of question. Now how does an admin write a wrapper around existing
> > >> application without breaking anything else.
> > >
> > > In the Solaris world, processes are placed into cgroups (projects) by
> > > one of two mechanisms:
> > >
> > > 1) inheritance, with everything I create in my existing project.
> > > To get this started, there is a mechanism under login/getty/whatever
> > > reading the /etc/projects file and, for example, tossing user davecb
> > > into a "user.davecb" project.
> > >
> >
> > Placing the login sessions in right cgroup based on uid/gid rules is
> > probably easy as check needs to be placed only on system entry upon login
> > (Pam plugin should do). And after that any job started by the user
> > will automatically start in the same cgroup.
> >
> > > 2) explicit placement with newtask, which starts a program or moves
> > > a process into a project/cgroup
> > >
> >
> > explicit placement of task based on application type will be tricky.
> >
> > > I have a "bg" project which I use for limiting resource consumption of
> > > background jobs, and a background command which either starts or moves
> > > jobs, thusly:
> > >
> > > case "$1" in
> > > [0-9]*) # It's a pid
> > > newtask -p bg -c $1
> >
> > Ok, this is moving of tasks from one cgroup to other based on pid. This
> > is really easy to do through cgroup file system. Just a matter of writing
> > to task file.
> >
> > > ;;
> > > *) # It's a command-line
> > > newtask -p bg "$@" &
> > > ;;
> >
> > So here a user explicitly invokes the wrapper passing it the targeted
> > cgroup and the application to be launched in that cgroup. This should work
> > if there is a facility if user has created its own cgroups (lets say
> > under user controlled cgroup dir in the hierarchy) and user explicitly
> > wants to control the resources of applications under its dir. For example,
> >
> > /mnt/cgroup
> > | |
> > gid1 gid2
> > | | | |
> > uid1 uid2 uid3 uid4
> > | |
> > proj1 proj2
> >
> > Here probably admin can write the rules for how users are allocated the
> > resources and give ability to users to create subdirs under their cgroups
> > where users can create more cgroups and can do their own resource
> > management based on application tasks and place applications in the right
> > cgroup by writing wrappers as mentioned by you "newtask".
> >
> > But here there is no discrimination of application type by admin. Admin
> > controls resource divisions only based on uid/gid. And users can manage
> > applications within their user groups. In fact I am having hard time thinking
> > in what kind of scenarios, there is a need for an admin to control
> > resource based on application type? Do we really need setups like, on
> > a system databases should get network bandwidth of 30%. If yes, then
> > it becomes tricky where admin need to write a wrapper to place the task
> > in right cgroup without application/user knowing it.
>
> I think a wrapper (move to right group and calls exec) will run by user, not by admin.
> In explicit placement, user knows what a type of application he/she launch.
>
> /mnt/cgroup
> | |
> gid1 gid2
> | | | |
> uid1 uid2 uid3 uid4
> | |
> proj1 proj2
>
This is the easy to handle situation and I am hoping it will work in many
of the cases.
Currently I am writting a patch for libcg which allows querying the
destination cgroup based on uid/gid and libcg will also migrate the
application there. I am also writing a pam plugin which will move
all the login sessions to respective cgroup (as mentioned by rule file).
Will also modify "init" so that all the system services to into cgroup
belonging to root.
Once user is logged in and running into his resource group, he can manage
further subgroups at his own based on his application requirements (as you
mentioned proj1 and proj2 here).
> [uid1/gid1]% newtask.sh proj1app
> ... proj1app run under /mnt/cgroup/gid1/uid1
>
Yes, so if a user does not specifically launch an application targetted
for a particular cgroup, then it will run into default group for that
user (as specified by rule file). In this case under /mnt/cgroup/gid1/uid1.
> [uid1/gid1]% newtask.sh --type proj1type proj1app
> ... proj1app run under /mnt/cgroup/gid1/uid1/proj1
>
IOW, probably a user can say.
newtask.sh --cgrp /mnt/cgroup/gid1/uid1/proj1/ proj1app
> In this case, admin sets up limitation of proj1type.
I think admin should setup the limits only til /mnt/cgroup/gid1/uid1.
After that how resources allocated to uid1 are subdivided between various
user applications should be controller by user. So resources under
proj1 and proj2 will be full controlled by user.
> Also I guess proj1type has ownership (ex: proj1type allows gid1).
> Isn't this enough?
I think to begin with and to get some kind of simple functionality
going it might be good. I am sure others will target for more complex
configurations and usages.
Thanks
Vivek
On Tue, 1 Jul 2008 15:11:26 -0400
Vivek Goyal <[email protected]> wrote:
> Hi,
>
> While development is going on for cgroup and various controllers, we also
> need a facility so that an admin/user can specify the group creation and
> also specify the rules based on which tasks should be placed in respective
> groups. Group creation part will be handled by libcg which is already
> under development. We still need to tackle the issue of how to specify
> the rules and how these rules are enforced (rules engine).
>
A different topic.
Recently I'm interested in "How to write userland daemon program
to control group subsystem." To implement that effectively, we need
some notifier between user <-> kernel.
Can we use "inotify" to catch changes in cgroup (by daemon program) ?
For example, create a new file under memory cgroup
==
/opt/memory_cgroup/group_A/notify_at_memory_reach_limit
==
And a user watches the file by inotify.
The kernel modify modified-time of notify_at_memory_reach_limit file and call
fs/notify_user.c::notify_change() against this inode. He can catchthe event
by inotify.
(I think he can also catch removal of this file, etc...)
Is there some difficulty or problem ? (I'm sorry if we can do this now.)
Thanks,
-Kame
> I have gathered few views, with regards to how rule engine can possibly be
> implemented, I am listing these down.
>
> Proposal 1
> ==========
> Let user space daemon hanle all that. Daemon will open a netlink socket
> and receive the notifications for various kernel events. Daemon will
> also parse appropriate admin specified rules config file and place the
> processes in right cgroup based on rules as and when events happen.
>
> I have written a prototype user space program which does that. Program
> can be found here. Currently it is in very crude shape.
>
> http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch
>
> Various people have raised two main issues with this approach.
>
> - netlink is not a reliable protocol.
> - Messages can be dropped and one can loose message. That means a
> newly forked process might never go into right group as meant.
>
> - How to handle delays in rule exectuion?
> - For example, if an "exec" happens and by the time process is moved to
> right group, it might have forked off few more processes or might
> have done quite some amount of memory allocation which will be
> charged to the wring group. Or, newly exec process might get
> killed in existing cgroup because of lack of memory (despite the
> fact that destination cgroup has sufficient memory).
>
> Proposal 2
> ==========
> Implement one or more kernel modules which will implement the rule engine.
> User space program can parse the config files and pass it to module.
> Kernel will be patched only on select points to look for the rules (as
> provided by modules). Very minimal code running inside the kernel if there
> are no rules loaded.
>
> Concerns:
> - Rules can become complex and we don't want to handle that complexity in
> kernel.
>
> Pros:
> - Reliable and precise movement of tasks in right cgroup based on rules.
>
> Proposal 3
> ==========
> How about if additional parameters can be passed to system calls and one
> can pass destination cgroup as additional parameter. Probably something
> like sys_indirect proposal. Maybe glibc can act as a wrapper to pass
> additional parameter so that applications don't need any modifications.
>
> Concerns:
> ========
> - Looks like sys_indirect interface for passing extra flags was rejected.
> - Requires extra work in glibc which can also involve parsing of rule
> files. :-(
>
> Proposal 4
> ==========
> Some vauge thoughts are there regarding how about kind of freezing the
> process or thread upon fork, exec and unfreeze it once the thread has been
> placed in right cgroup.
>
> Concerns:
> ========
> - Requires reliable netlink protocol otherwise there is a possibility that
> a task never gets unfrozen.
> - On what basis does one freeze a thread. There might not be any rules to
> process for that thread we will unnecessarily delay it.
>
>
> Please provide your inputs regarding what's the best way to handle the
> rules engine.
>
> To me, letting the rules live in separate module/modules seems to be a
> reasonable way to move forward which will provide reliable and timely
> execution of rules and by making it modular, we can remove most of the
> complexity from core kernel code.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, Jul 18, 2008 at 2:52 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> For example, create a new file under memory cgroup
> ==
> /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
> ==
> And a user watches the file by inotify.
> The kernel modify modified-time of notify_at_memory_reach_limit file and call
> fs/notify_user.c::notify_change() against this inode. He can catchthe event
> by inotify.
> (I think he can also catch removal of this file, etc...)
>
We've been doing something like this to handle OOMs in userspace, with
pretty good success. The approach that we used so far was a custom
control file tied to a wait queue, that gets woken when a cgroup
triggers OOM, but it's a bit hacky. I've been considering some kind of
more generic approach that could be reused by different subsystems for
other notifications, maybe using eventfd or maybe netlink.
inotify would be an option too, but that seems like it might be
forcing ourselves into filesystem semantics too much.
Paul
KAMEZAWA Hiroyuki wrote:
> On Tue, 1 Jul 2008 15:11:26 -0400
> Vivek Goyal <[email protected]> wrote:
>
>> Hi,
>>
>> While development is going on for cgroup and various controllers, we also
>> need a facility so that an admin/user can specify the group creation and
>> also specify the rules based on which tasks should be placed in respective
>> groups. Group creation part will be handled by libcg which is already
>> under development. We still need to tackle the issue of how to specify
>> the rules and how these rules are enforced (rules engine).
>>
>
> A different topic.
>
> Recently I'm interested in "How to write userland daemon program
> to control group subsystem." To implement that effectively, we need
> some notifier between user <-> kernel.
>
> Can we use "inotify" to catch changes in cgroup (by daemon program) ?
>
> For example, create a new file under memory cgroup
> ==
> /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
> ==
> And a user watches the file by inotify.
> The kernel modify modified-time of notify_at_memory_reach_limit file and call
> fs/notify_user.c::notify_change() against this inode. He can catchthe event
> by inotify.
Won't the time latency be an issue (time between exceeding the limit and the
user space being notified?). Since the notification does not use user memory at
the moment (it will not stress the limits futher :)), provided the notification
handler is not running under the group that has exceeded its limit. Do we expect
the user space application to ACK that it's seen the notification? We could use
a netlink channel as well (in the case that we need two way communication).
I would prefer to notify on memory.failcnt, if we do use this interface.
> (I think he can also catch removal of this file, etc...)
>
> Is there some difficulty or problem ? (I'm sorry if we can do this now.)
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
On Fri, Jul 18, 2008 at 11:39:13AM -0500, Balbir Singh wrote:
> KAMEZAWA Hiroyuki wrote:
> > On Tue, 1 Jul 2008 15:11:26 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> While development is going on for cgroup and various controllers, we also
> >> need a facility so that an admin/user can specify the group creation and
> >> also specify the rules based on which tasks should be placed in respective
> >> groups. Group creation part will be handled by libcg which is already
> >> under development. We still need to tackle the issue of how to specify
> >> the rules and how these rules are enforced (rules engine).
> >>
> >
> > A different topic.
> >
> > Recently I'm interested in "How to write userland daemon program
> > to control group subsystem." To implement that effectively, we need
> > some notifier between user <-> kernel.
> >
> > Can we use "inotify" to catch changes in cgroup (by daemon program) ?
> >
> > For example, create a new file under memory cgroup
> > ==
> > /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
> > ==
> > And a user watches the file by inotify.
> > The kernel modify modified-time of notify_at_memory_reach_limit file and call
> > fs/notify_user.c::notify_change() against this inode. He can catchthe event
> > by inotify.
>
> Won't the time latency be an issue (time between exceeding the limit and the
> user space being notified?).
Does not look like it will be an issue. Of course faster the notification
better it is but there will be some latency. So if we get notified on
memory.failcnt then probably will try to increase the memory limit and
even if it takes some time should be fine. Anyway, there is no way to avoid
latency and hopefully we are not looking at real time notifications and
responses. :-)
> Since the notification does not use user memory at
> the moment (it will not stress the limits futher :)), provided the notification
> handler is not running under the group that has exceeded its limit. Do we expect
> the user space application to ACK that it's seen the notification? We could use
> a netlink channel as well (in the case that we need two way communication).
>
Can't think of a reason why user space needs to send an ACK to kernel
after seeing the event. If we are not using netlink and resorting to
inotify coupled with epoll then we should not loose any events and kernel
need not to be acked back.
Given the fact that netlink can drop packets, I am not sure how good an
option netlink is for cgroup notifications. Is it too hard to stick to
filesystem semantics for notifications?
Thanks
Vivek
----- Original Message -----
>On Fri, Jul 18, 2008 at 2:52 AM, KAMEZAWA Hiroyuki
><[email protected]> wrote:
>>
>> For example, create a new file under memory cgroup
>> ==
>> /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
>> ==
>> And a user watches the file by inotify.
>> The kernel modify modified-time of notify_at_memory_reach_limit file and ca
ll
>> fs/notify_user.c::notify_change() against this inode. He can catchthe event
>> by inotify.
>> (I think he can also catch removal of this file, etc...)
>>
>
>We've been doing something like this to handle OOMs in userspace, with
>pretty good success. The approach that we used so far was a custom
>control file tied to a wait queue, that gets woken when a cgroup
>triggers OOM, but it's a bit hacky. I've been considering some kind of
>more generic approach that could be reused by different subsystems for
>other notifications, maybe using eventfd or maybe netlink.
>
Hmm, eventfd is AIO's one ?
Anyway I agree we need something generic. (hopefully, reuse existing one.)
>inotify would be an option too, but that seems like it might be
>forcing ourselves into filesystem semantics too much.
>
At quick glance, Inotify's good points are
- can be used for any file. for example, even changes in "tasks" file can be
cathced if it modify modified-time.
- It can be queued.
- It supports ONESHOT, NONBLOCK, etc...
- All memory allocation is done by the waiter (the user).
But yes, we cannot notify other events than "there is some change".
Thanks,
-Kame
>Paul
----- Original Message -----
>KAMEZAWA Hiroyuki wrote:
>> On Tue, 1 Jul 2008 15:11:26 -0400
>> Vivek Goyal <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> While development is going on for cgroup and various controllers, we also
>>> need a facility so that an admin/user can specify the group creation and
>>> also specify the rules based on which tasks should be placed in respective
>>> groups. Group creation part will be handled by libcg which is already
>>> under development. We still need to tackle the issue of how to specify
>>> the rules and how these rules are enforced (rules engine).
>>>
>>
>> A different topic.
>>
>> Recently I'm interested in "How to write userland daemon program
>> to control group subsystem." To implement that effectively, we need
>> some notifier between user <-> kernel.
>>
>> Can we use "inotify" to catch changes in cgroup (by daemon program) ?
>>
>> For example, create a new file under memory cgroup
>> ==
>> /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
>> ==
>> And a user watches the file by inotify.
>> The kernel modify modified-time of notify_at_memory_reach_limit file and ca
ll
>> fs/notify_user.c::notify_change() against this inode. He can catchthe event
>> by inotify.
>
>Won't the time latency be an issue (time between exceeding the limit and the
>user space being notified?). Since the notification does not use user memory
at
>the moment (it will not stress the limits futher :)), provided the notificati
on
>handler is not running under the group that has exceeded its limit. Do we exp
ect
>the user space application to ACK that it's seen the notification? We could u
se
>a netlink channel as well (in the case that we need two way communication).
>
>I would prefer to notify on memory.failcnt, if we do use this interface.
>
Maybe we need some technique "How to run a daemon in proper way."
(use special daemon cgroup etc...)
I don't think the user space has to do ACK to the kernel. The user space
can modify control file when he get events, but that's all he can do, anyway.
Thanks,
-Kame
The problem of placing tasks in respective cgroups seems to be correctly
addressed by userspace lib wrappers or classifier daemons [1].
However, this is an attempt to implement an in-kernel classifier.
[ I wrote this patch for a "special purpose" environment, where a lot of
short-lived processes belonging to different users are spawned by
different daemons, so the main goal here would be to remove the dealy
needed by userspace classification and place the tasks in the right
cgroup at the time they're created. This is just an ugly hack for now
and it works only for uid-based rules, gid-based rules could be
implemented in a similar way. ]
UID:cgroup associations are stored in a RCU-protected hash list.
The kernel<->userspace interface works as following:
- the file "uids" is added in the cgroup filesystem
- a UID can be placed only in a single cgroup
- a cgroup can have multiple UIDs
Respect to the userspace solution (e.g. classifier daemon) this solution
has the advantage of removing the delay for task classification, that
means each task always runs in the appropriate cgroup at the time is
created (fork, exec) or when the uid changes (setuid).
OTOH the disadvantage is to introduce the complexity in the kernel.
[1] http://lkml.org/lkml/2008/7/1/391
Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/cgroup.h | 9 +++
kernel/cgroup.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sys.c | 6 ++-
3 files changed, 154 insertions(+), 2 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30934e4..243819a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -393,6 +393,7 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
int cgroup_scan_tasks(struct cgroup_scanner *scan);
int cgroup_attach_task(struct cgroup *, struct task_struct *);
+struct cgroup *uid_to_cgroup(uid_t uid);
#else /* !CONFIG_CGROUPS */
@@ -411,6 +412,14 @@ static inline int cgroupstats_build(struct cgroupstats *stats,
{
return -EINVAL;
}
+static inline int cgroup_attach_task(struct cgroup *, struct task_struct *)
+{
+ return 0;
+}
+static inline struct cgroup *uid_to_cgroup(uid_t uid)
+{
+ return NULL;
+}
#endif /* !CONFIG_CGROUPS */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 791246a..5a010db 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1318,6 +1318,7 @@ enum cgroup_filetype {
FILE_ROOT,
FILE_DIR,
FILE_TASKLIST,
+ FILE_UIDLIST,
FILE_NOTIFY_ON_RELEASE,
FILE_RELEASE_AGENT,
};
@@ -2203,6 +2204,131 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
return 0;
}
+#define CGROUP_UID_HASH_SHIFT 9
+#define CGROUP_UID_HASH_SIZE (1UL << CGROUP_UID_HASH_SHIFT)
+#define cgroup_uid_hashfn(__uid) \
+ hash_long((unsigned long)__uid, CGROUP_UID_HASH_SHIFT)
+
+struct cgroup_uid {
+ uid_t uid;
+ struct cgroup *cgroup;
+ struct hlist_node cgroup_uid_chain;
+};
+
+/* hash list to store uid:cgroup associations (protected by RCU locking) */
+static struct hlist_head *cgroup_uids;
+
+/* spinlock to protect cgroup_uids write operations */
+static __cacheline_aligned DEFINE_SPINLOCK(cgroup_uid_lock);
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static struct cgroup_uid *cgroup_uid_find_item(uid_t uid)
+{
+ struct hlist_node *item;
+ struct cgroup_uid *u;
+
+ hlist_for_each_entry_rcu(u, item, &cgroup_uids[cgroup_uid_hashfn(uid)],
+ cgroup_uid_chain)
+ if (u->uid == uid)
+ return u;
+ return NULL;
+}
+
+struct cgroup *uid_to_cgroup(uid_t uid)
+{
+ struct cgroup_uid *cu;
+ struct cgroup *ret;
+
+ rcu_read_lock();
+ cu = cgroup_uid_find_item(uid);
+ ret = cu ? cu->cgroup : NULL;
+ rcu_read_unlock();
+ return ret;
+}
+
+static int cgroup_uid_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct hlist_node *item;
+ struct cgroup_uid *u;
+ int i;
+
+ rcu_read_lock();
+ for (i = 0; i < CGROUP_UID_HASH_SIZE; i++)
+ hlist_for_each_entry_rcu(u, item, &cgroup_uids[i],
+ cgroup_uid_chain)
+ if (u->cgroup == cgrp)
+ seq_printf(m, "%u\n", u->uid);
+ rcu_read_unlock();
+ return 0;
+}
+
+static int cgroup_uid_write(struct cgroup *cgrp, struct cftype *cft, u64 uid)
+{
+ struct cgroup_uid *u, *old_u;
+
+ u = kmalloc(sizeof(*u), GFP_KERNEL);
+ if (unlikely(!u))
+ return -ENOMEM;
+ u->uid = (uid_t)uid;
+ u->cgroup = cgrp;
+
+ spin_lock_irq(&cgroup_uid_lock);
+ old_u = cgroup_uid_find_item(uid);
+ if (old_u) {
+ /* Replace old element with newer */
+ hlist_replace_rcu(&old_u->cgroup_uid_chain,
+ &u->cgroup_uid_chain);
+ spin_unlock_irq(&cgroup_uid_lock);
+ synchronize_rcu();
+ kfree(old_u);
+ return 0;
+ }
+ /* Add the new element to the cgroup uid hash list */
+ hlist_add_head_rcu(&u->cgroup_uid_chain,
+ &cgroup_uids[cgroup_uid_hashfn(uid)]);
+ spin_unlock_irq(&cgroup_uid_lock);
+ return 0;
+}
+
+static int cgroup_uid_cleanup(struct cgroup *cgrp)
+{
+ HLIST_HEAD(old_items);
+ struct hlist_node *item, *n;
+ struct cgroup_uid *u;
+ int i;
+
+ spin_lock_irq(&cgroup_uid_lock);
+ for (i = 0; i < CGROUP_UID_HASH_SIZE; i++)
+ hlist_for_each_entry_safe(u, item, n, &cgroup_uids[i],
+ cgroup_uid_chain)
+ if (u->cgroup == cgrp) {
+ hlist_del_rcu(&u->cgroup_uid_chain);
+ hlist_add_head(&u->cgroup_uid_chain,
+ &old_items);
+ }
+ spin_unlock_irq(&cgroup_uid_lock);
+ synchronize_rcu();
+ hlist_for_each_entry_safe(u, item, n, &old_items, cgroup_uid_chain)
+ kfree(u);
+ return 0;
+}
+
+static int __init init_cgroup_uid(void)
+{
+ int i;
+
+ cgroup_uids = kmalloc(sizeof(*cgroup_uids) * CGROUP_UID_HASH_SIZE,
+ GFP_KERNEL);
+ if (unlikely(!cgroup_uids))
+ return -ENOMEM;
+ for (i = 0; i < CGROUP_UID_HASH_SIZE; i++)
+ INIT_HLIST_HEAD(&cgroup_uids[i]);
+ return 0;
+}
+
/*
* for the common functions, 'private' gives the type of file
*/
@@ -2215,7 +2341,12 @@ static struct cftype files[] = {
.release = cgroup_tasks_release,
.private = FILE_TASKLIST,
},
-
+ {
+ .name = "uids",
+ .read_seq_string = cgroup_uid_read,
+ .write_u64 = cgroup_uid_write,
+ .private = FILE_UIDLIST,
+ },
{
.name = "notify_on_release",
.read_u64 = cgroup_read_notify_on_release,
@@ -2434,6 +2565,8 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
return -EBUSY;
}
+ cgroup_uid_cleanup(cgrp);
+
spin_lock(&release_list_lock);
set_bit(CGRP_REMOVED, &cgrp->flags);
if (!list_empty(&cgrp->release_list))
@@ -2550,6 +2683,8 @@ int __init cgroup_init(void)
if (err)
return err;
+ init_cgroup_uid();
+
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
@@ -2700,11 +2835,15 @@ static struct file_operations proc_cgroupstats_operations = {
*/
void cgroup_fork(struct task_struct *child)
{
+ struct cgroup *cgrp = uid_to_cgroup(child->uid);
+
task_lock(current);
child->cgroups = current->cgroups;
get_css_set(child->cgroups);
task_unlock(current);
INIT_LIST_HEAD(&child->cg_list);
+ if (cgrp)
+ cgroup_attach_task(cgrp, child);
}
/**
diff --git a/kernel/sys.c b/kernel/sys.c
index c018580..d22e815 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -19,6 +19,7 @@
#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/capability.h>
+#include <linux/cgroup.h>
#include <linux/device.h>
#include <linux/key.h>
#include <linux/times.h>
@@ -548,10 +549,11 @@ asmlinkage long sys_setgid(gid_t gid)
proc_id_connector(current, PROC_EVENT_GID);
return 0;
}
-
+
static int set_user(uid_t new_ruid, int dumpclear)
{
struct user_struct *new_user;
+ struct cgroup *cgrp = uid_to_cgroup(new_ruid);
new_user = alloc_uid(current->nsproxy->user_ns, new_ruid);
if (!new_user)
@@ -571,6 +573,8 @@ static int set_user(uid_t new_ruid, int dumpclear)
smp_wmb();
}
current->uid = new_ruid;
+ if (cgrp)
+ cgroup_attach_task(cgrp, current);
return 0;
}
On Sun, Aug 17, 2008 at 12:33:31PM +0200, Andrea Righi wrote:
> The problem of placing tasks in respective cgroups seems to be correctly
> addressed by userspace lib wrappers or classifier daemons [1].
>
> However, this is an attempt to implement an in-kernel classifier.
>
> [ I wrote this patch for a "special purpose" environment, where a lot of
> short-lived processes belonging to different users are spawned by
> different daemons, so the main goal here would be to remove the dealy
> needed by userspace classification and place the tasks in the right
> cgroup at the time they're created. This is just an ugly hack for now
> and it works only for uid-based rules, gid-based rules could be
> implemented in a similar way. ]
>
Hi Andrea,
Recently I introduced the infrastructure in libcgroup to handle
the task placement issue based on uid and gid rules. This is what I did.
- Introduced two new APIs in libcgroup to place the task in right cgroup.
- cgroup_change_cgroup_uid_gid
Pleces the task in destination cgroup based on uid/gid
rules specified in /etc/cgrules.conf
- cgroup_change_cgroup_path
Puts the task into the cgroup specified by caller
- Provided two command line tools (cgexec and cgclassify) to perform
various process placement related tasks.
- cgexec
A tool to launch a task in user specfied cgroup
- cgclassify
A tool to re-classify already running tasks.
- Wrote a pam plugin so that tasks are placed in right user groups upon
login or reception of other services which take pam's help.
- Currently work is in progress for a user space daemon which will
automatically place the tasks based on notifications.
For your environment, where delay is unbearable, I think you can modify
the daemon to use libcgroup to place the forked task in right cgroup
before actually executing it. Once the task has been placed in right
cgroup, exec() will be called.
We have been doing all the user space development on following mailing
list.
https://lists.sourceforge.net/lists/listinfo/libcg-devel
Latest patches which got merged in libcgroup, are here.
http://sourceforge.net/mailarchive/forum.php?thread_name=20080813171720.108005557%40redhat.com&forum_name=libcg-devel
It is accompanied with a decent README file for design details and for
how to use it.
I think modifying the daemon to make use of libcgroup is the right way
to handle this issue than duplicating the infrastructure in user space
as well as kernel space.
Thanks
Vivek
On Sun, Aug 17, 2008 at 3:33 AM, Andrea Righi <[email protected]> wrote:
>
> [ I wrote this patch for a "special purpose" environment, where a lot of
> short-lived processes belonging to different users are spawned by
> different daemons,
What kinds of daemons are these? Is it not possible to add some
libcgroup calls to these daemons?
I'm reluctant to add features like this to the kernel side of cgroups
due to their "magical" nature - any task that does a setuid() now
risks being swept off into a different cgroup.
Having the cgroup attachment done explicitly e.g. by a PAM library at
login time is much less likely to cause unexpected behaviour.
Maybe if we had a way to control which tasks the magical setuid
switching occurs for, it might be more acceptable. (Perhaps base it on
the cgroup of the task that's doing the setuid as well?
Other thoughts:
- what about other uids (euid, fsuid)?
- what about multiple hierarchies?
- if the attach fails, userspace gets no notification.
Paul
On Mon, Aug 18, 2008 at 02:05:36PM -0700, Paul Menage wrote:
> On Sun, Aug 17, 2008 at 3:33 AM, Andrea Righi <[email protected]> wrote:
> >
> > [ I wrote this patch for a "special purpose" environment, where a lot of
> > short-lived processes belonging to different users are spawned by
> > different daemons,
>
> What kinds of daemons are these? Is it not possible to add some
> libcgroup calls to these daemons?
>
> I'm reluctant to add features like this to the kernel side of cgroups
> due to their "magical" nature - any task that does a setuid() now
> risks being swept off into a different cgroup.
>
> Having the cgroup attachment done explicitly e.g. by a PAM library at
> login time is much less likely to cause unexpected behaviour.
>
> Maybe if we had a way to control which tasks the magical setuid
> switching occurs for, it might be more acceptable. (Perhaps base it on
> the cgroup of the task that's doing the setuid as well?
Hi Paul,
Same thing will happen if we implement the daemon in user space. A task
who does seteuid(), can be swept away to a different cgroup based on
rules specified in /etc/cgrules.conf.
What do you mean by risk? This is the policy set up by system admin and
behaviour would seem consistent as per the policy. If an admin decides
that tasks of user "apache" should run into /container/cpu/apache cgroup and
if a "root" tasks does seteuid(apache), then it manes sense to move task
to /container/cpu/apache.
Exactly what kind of scenario do you have in mind when you want the policy
to be enforced selectively based on task (tid)?
Thanks
Vivek
On Mon, Aug 18, 2008 at 08:35:26AM -0400, Vivek Goyal wrote:
> On Sun, Aug 17, 2008 at 12:33:31PM +0200, Andrea Righi wrote:
> > The problem of placing tasks in respective cgroups seems to be correctly
> > addressed by userspace lib wrappers or classifier daemons [1].
> >
> > However, this is an attempt to implement an in-kernel classifier.
> >
> > [ I wrote this patch for a "special purpose" environment, where a lot of
> > short-lived processes belonging to different users are spawned by
> > different daemons, so the main goal here would be to remove the dealy
> > needed by userspace classification and place the tasks in the right
> > cgroup at the time they're created. This is just an ugly hack for now
> > and it works only for uid-based rules, gid-based rules could be
> > implemented in a similar way. ]
> >
>
> Hi Andrea,
yep! I'm having some troubles with my internet connection, and it seems
my previous reply is lost.. :( resending it, sorry for the noise if
you'll receive more than 1 mail.
>
> Recently I introduced the infrastructure in libcgroup to handle
> the task placement issue based on uid and gid rules. This is what I did.
>
> - Introduced two new APIs in libcgroup to place the task in right cgroup.
> - cgroup_change_cgroup_uid_gid
> Pleces the task in destination cgroup based on uid/gid
> rules specified in /etc/cgrules.conf
> - cgroup_change_cgroup_path
> Puts the task into the cgroup specified by caller
>
> - Provided two command line tools (cgexec and cgclassify) to perform
> various process placement related tasks.
> - cgexec
> A tool to launch a task in user specfied cgroup
> - cgclassify
> A tool to re-classify already running tasks.
>
> - Wrote a pam plugin so that tasks are placed in right user groups upon
> login or reception of other services which take pam's help.
That's interesting. All the daemons that provide access to a system
should pam-aware, so with the pam plugin I should be able to handle all
the cases. Unfortunately I don't have too much details about those
daemons and in fact I was looking for the most generic solution..
>
> - Currently work is in progress for a user space daemon which will
> automatically place the tasks based on notifications.
>
> For your environment, where delay is unbearable, I think you can modify
> the daemon to use libcgroup to place the forked task in right cgroup
> before actually executing it. Once the task has been placed in right
> cgroup, exec() will be called.
>
The deamons should all use the exec() + setuid() way. If pam doesn't
help I'll try to wrap setuid(), using a wrapper lib or something
similar.
> We have been doing all the user space development on following mailing
> list.
>
> https://lists.sourceforge.net/lists/listinfo/libcg-devel
>
> Latest patches which got merged in libcgroup, are here.
>
> http://sourceforge.net/mailarchive/forum.php?thread_name=20080813171720.108005557%40redhat.com&forum_name=libcg-devel
>
> It is accompanied with a decent README file for design details and for
> how to use it.
Thanks, I'll look at the latest libcgroup features ASAP.
>
> I think modifying the daemon to make use of libcgroup is the right way
> to handle this issue than duplicating the infrastructure in user space
> as well as kernel space.
Totally agree in perspective (obviously when it's possible/reasonable in
terms of efforts to change the userspace daemon).
Thanks,
-Andrea
On 8/18/08, Paul Menage <[email protected]> wrote:
> On Sun, Aug 17, 2008 at 3:33 AM, Andrea Righi <[email protected]>
> wrote:
>>
>> [ I wrote this patch for a "special purpose" environment, where a lot of
>> short-lived processes belonging to different users are spawned by
>> different daemons,
>
> What kinds of daemons are these? Is it not possible to add some
> libcgroup calls to these daemons?
unfortunately I don't have too much details for now, so I was just
looking for the most generic solution. The PAM lib approach seems
reasonable for each daemon that represents an entry point to the
system, and, to a large degree, I like the userspace solution (e.g.
the libcgroup as reported by Vivek). It seems to be the right way to
handle all the possible/complex rule an admin would like to define.
>
> I'm reluctant to add features like this to the kernel side of cgroups
agree
> due to their "magical" nature - any task that does a setuid() now
> risks being swept off into a different cgroup.
If the admin configures so, moving tasks that do setuid() in different
cgroups should be an expected behaviour, isn't it?
>
> Having the cgroup attachment done explicitly e.g. by a PAM library at
> login time is much less likely to cause unexpected behaviour.
>
> Maybe if we had a way to control which tasks the magical setuid
> switching occurs for, it might be more acceptable. (Perhaps base it on
> the cgroup of the task that's doing the setuid as well?
do you mean create a cgroup subsystem to handle different per-cgroup
setuid() switching behaviours?
>
> Other thoughts:
>
> - what about other uids (euid, fsuid)?
>
> - what about multiple hierarchies?
>
> - if the attach fails, userspace gets no notification.
good points.
For the last one we could just return an error code from cgroup_fork()
and goto bad_fork_cleanup_cgroup (in this way the fork/exec would
fail anyway).
>
> Paul
>
Thanks,
-Andrea
On Tue, Aug 19, 2008 at 5:57 AM, Vivek Goyal <[email protected]> wrote:
>
> Same thing will happen if we implement the daemon in user space. A task
> who does seteuid(), can be swept away to a different cgroup based on
> rules specified in /etc/cgrules.conf.
Yes, I'm not so keen on a daemon magically pulling things into a
cgroup based on uid either, for the same reasons.
But a user-space based solution can be much more flexible (e.g. easier
to configure it to only move tasks from certain source cgroups).
>
> What do you mean by risk? This is the policy set up by system admin and
> behaviour would seem consistent as per the policy. If an admin decides
> that tasks of user "apache" should run into /container/cpu/apache cgroup and
> if a "root" tasks does seteuid(apache), then it manes sense to move task
> to /container/cpu/apache.
The kind of unexpected behaviour I was imagining was when some other
daemon (e.g. ftpd?) unexpectedly does a setuid to one of the
magically-controlled users, and results in that daemon being pulled
into the specified cgroup. For something like cpu maybe that's mostly
benign (but what moves it back into its original group after it
switches back to root?) but for other subsystems it could be more
painful (memory, device access, etc).
>
> Exactly what kind of scenario do you have in mind when you want the policy
> to be enforced selectively based on task (tid)?
I was thinking of something like possibly a per-cgroup file (that also
affected child cgroups) rather than a global file. That would also
automatically handle multiple hierarchies.
Paul
On Tue, Aug 19, 2008 at 8:12 AM, <[email protected]> wrote:
>
> unfortunately I don't have too much details for now, so I was just
> looking for the most generic solution. The PAM lib approach seems
> reasonable for each daemon that represents an entry point to the
> system,
The PAM approach seems like the cleanest solution to me.
>
>> due to their "magical" nature - any task that does a setuid() now
>> risks being swept off into a different cgroup.
>
> If the admin configures so, moving tasks that do setuid() in different
> cgroups should be an expected behaviour, isn't it?
Is the sysadmin aware of all the places in all system daemons that do
setuid() calls?
Paul
On Mon, Aug 25, 2008 at 05:54:39PM -0700, Paul Menage wrote:
> On Tue, Aug 19, 2008 at 5:57 AM, Vivek Goyal <[email protected]> wrote:
> >
> > Same thing will happen if we implement the daemon in user space. A task
> > who does seteuid(), can be swept away to a different cgroup based on
> > rules specified in /etc/cgrules.conf.
>
> Yes, I'm not so keen on a daemon magically pulling things into a
> cgroup based on uid either, for the same reasons.
>
> But a user-space based solution can be much more flexible (e.g. easier
> to configure it to only move tasks from certain source cgroups).
>
> >
> > What do you mean by risk? This is the policy set up by system admin and
> > behaviour would seem consistent as per the policy. If an admin decides
> > that tasks of user "apache" should run into /container/cpu/apache cgroup and
> > if a "root" tasks does seteuid(apache), then it manes sense to move task
> > to /container/cpu/apache.
>
> The kind of unexpected behaviour I was imagining was when some other
> daemon (e.g. ftpd?) unexpectedly does a setuid to one of the
> magically-controlled users, and results in that daemon being pulled
> into the specified cgroup. For something like cpu maybe that's mostly
> benign (but what moves it back into its original group after it
> switches back to root?)
Once ftpd does seteuid() or setreuid() again to switch effective user to
"root", it will be moved back to original group (root's group).
So basic question is if a program changes its effective user id temporarily
to user B than all the resource consumption should take place from the
resources of user B or should continue to take place from original cgroup.
I would think that we should move the task temporarily to B's cgroup and
bring back again upon identity change.
At the same time I can also understand that this behavior can probably
be considered over-intrusive and some people might want to avoid that.
Two things come to my mind.
- Users who find it too intrusive, can just shut down the rules based
daemon.
- Or, we can implement selective movement of tasks by daemon as suggested by
you. This will make system more complex but provides more flexibility
in the sense users can keep daemon running at the same time control
movement of certain tasks.
> but for other subsystems it could be more
> painful (memory, device access, etc).
>
> >
> > Exactly what kind of scenario do you have in mind when you want the policy
> > to be enforced selectively based on task (tid)?
>
> I was thinking of something like possibly a per-cgroup file (that also
> affected child cgroups) rather than a global file. That would also
> automatically handle multiple hierarchies.
>
So there can be two kind of controls.
- Create a per cgroup file say "group_pinned", where if 1 is written to
"group_pinned" that means daemon will not move tasks from this cgroup upon
effective uid/gid changes.
- Provide more fine grained control where task movement is not controlled
per cgroup, rather per thread id. In that case every cgroup will contain
another file "tasks_pinned" which will contain all the tids which cannot
be moved from this cgroup by daemon. By default this file will be empty
and all the tids are movable.
I think initially we can keep things simple and implement "group_pinned"
which provides coarse control on the whole group and pins all the tasks
in that cgroup.
Thoughts?
Thanks
Vivek
Vivek Goyal wrote:
> On Mon, Aug 25, 2008 at 05:54:39PM -0700, Paul Menage wrote:
>> On Tue, Aug 19, 2008 at 5:57 AM, Vivek Goyal <[email protected]> wrote:
>>> Same thing will happen if we implement the daemon in user space. A task
>>> who does seteuid(), can be swept away to a different cgroup based on
>>> rules specified in /etc/cgrules.conf.
>> Yes, I'm not so keen on a daemon magically pulling things into a
>> cgroup based on uid either, for the same reasons.
>>
>> But a user-space based solution can be much more flexible (e.g. easier
>> to configure it to only move tasks from certain source cgroups).
>>
>>> What do you mean by risk? This is the policy set up by system admin and
>>> behaviour would seem consistent as per the policy. If an admin decides
>>> that tasks of user "apache" should run into /container/cpu/apache cgroup and
>>> if a "root" tasks does seteuid(apache), then it manes sense to move task
>>> to /container/cpu/apache.
>> The kind of unexpected behaviour I was imagining was when some other
>> daemon (e.g. ftpd?) unexpectedly does a setuid to one of the
>> magically-controlled users, and results in that daemon being pulled
>> into the specified cgroup. For something like cpu maybe that's mostly
>> benign (but what moves it back into its original group after it
>> switches back to root?)
>
> Once ftpd does seteuid() or setreuid() again to switch effective user to
> "root", it will be moved back to original group (root's group).
>
> So basic question is if a program changes its effective user id temporarily
> to user B than all the resource consumption should take place from the
> resources of user B or should continue to take place from original cgroup.
>
> I would think that we should move the task temporarily to B's cgroup and
> bring back again upon identity change.
>
> At the same time I can also understand that this behavior can probably
> be considered over-intrusive and some people might want to avoid that.
>
> Two things come to my mind.
>
> - Users who find it too intrusive, can just shut down the rules based
> daemon.
>
Yes, I would say administrators should do that. Classification via setuid(),
does make a lot of sense, but at the same time it might be too aggressive if an
application frequently uses setuid()
> - Or, we can implement selective movement of tasks by daemon as suggested by
> you. This will make system more complex but provides more flexibility
> in the sense users can keep daemon running at the same time control
> movement of certain tasks.
>
Applications that really care about moving should use cgroup_attach_task* and
move back otherwise with cgrules parsing turned off.
I see control as a two level hierarchy, automatic and controlled, how do we make
sure that they don't conflict is something I have not thought about yet.
>> but for other subsystems it could be more
>> painful (memory, device access, etc).
>>
>
>
>>> Exactly what kind of scenario do you have in mind when you want the policy
>>> to be enforced selectively based on task (tid)?
>> I was thinking of something like possibly a per-cgroup file (that also
>> affected child cgroups) rather than a global file. That would also
>> automatically handle multiple hierarchies.
>>
>
> So there can be two kind of controls.
>
> - Create a per cgroup file say "group_pinned", where if 1 is written to
> "group_pinned" that means daemon will not move tasks from this cgroup upon
> effective uid/gid changes.
>
> - Provide more fine grained control where task movement is not controlled
> per cgroup, rather per thread id. In that case every cgroup will contain
> another file "tasks_pinned" which will contain all the tids which cannot
> be moved from this cgroup by daemon. By default this file will be empty
> and all the tids are movable.
>
> I think initially we can keep things simple and implement "group_pinned"
> which provides coarse control on the whole group and pins all the tasks
> in that cgroup.
>
Hmm... I wonder if we are providing too many knobs. Can't we do something simpler?
--
Balbir
Balbir Singh wrote:
> Applications that really care about moving should use cgroup_attach_task* and
> move back otherwise with cgrules parsing turned off.
>
> I see control as a two level hierarchy, automatic and controlled, how do we make
> sure that they don't conflict is something I have not thought about yet.
[...]
> Hmm... I wonder if we are providing too many knobs. Can't we do something simpler?
Solaris doesn't try to change cgroup ("project") on a setuid call, assuming
the program is in the proper cgroup initially. For most cases this is
trivially true under the very simple default rules, and for the rest one
can create a rule or a startup script that sets it with newtask".
The Sun default is
$ cat /etc/project
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
Which means that root users are distinguished from users in
the staff group, and they are distinguished from daemons
and everyone else.
Personally, I add
user.davecb:101::davecb::
bg:100:Background jobs:davecb::
which puts me in a separate cgroup, and provides another one
for me to put background tasks into. The latter allows
me to keep them from reducing the interactive performance of
my laptop.
In practice, this looks like:
$ prstat -J
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
695 davecb 52M 38M sleep 1 0 0:01:41 2.4% Xsun/1
1025 davecb 150M 88M sleep 59 0 0:04:25 1.9% mozilla-bin/5
926 davecb 73M 16M sleep 33 0 0:00:11 1.3% gnome-terminal/2
1067 davecb 6232K 5224K cpu0 54 0 0:00:00 0.3% prstat/1
918 davecb 66M 15M sleep 59 0 0:00:15 0.2% metacity/1
956 davecb 67M 13M sleep 59 0 0:00:04 0.1% gnome-netstatus/1
958 davecb 66M 12M sleep 59 0 0:00:02 0.1% mixer_applet2/1
931 root 2112K 1240K sleep 59 0 0:00:01 0.0% rpc.rstatd/1
954 davecb 68M 15M sleep 57 0 0:00:06 0.0% wnck-applet/1
920 davecb 71M 17M sleep 59 0 0:00:04 0.0% gnome-panel/1
943 davecb 1408K 1136K sleep 57 0 0:00:00 0.0% ksh/1
871 davecb 3984K 2656K sleep 59 0 0:00:01 0.0% xscreensaver/1
916 davecb 10M 4936K sleep 59 0 0:00:01 0.0% gnome-smproxy/1
924 davecb 67M 13M sleep 59 0 0:00:01 0.0% gnome-perfmeter/1
116 root 4352K 1168K sleep 59 0 0:00:00 0.0% lp/1
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
101 32 1050M 352M 71% 0:07:02 6.4% user.davecb
0 49 192M 73M 15% 0:00:22 0.1% system
3 5 33M 11M 2.2% 0:00:00 0.0% default
I'm using 6.4% of the CPU, the daemons are using 0.1% and even a
terribly CPU-heavy program will not starve the others of resources.
So for me, cgroups/projects are golden, and the simplest rules
suffice.
--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
On Tue, Aug 26, 2008 at 11:04:42AM -0400, David Collier-Brown wrote:
> Balbir Singh wrote:
>> Applications that really care about moving should use cgroup_attach_task* and
>> move back otherwise with cgrules parsing turned off.
>>
>> I see control as a two level hierarchy, automatic and controlled, how do we make
>> sure that they don't conflict is something I have not thought about yet.
> [...]
>
>> Hmm... I wonder if we are providing too many knobs. Can't we do something simpler?
>
> Solaris doesn't try to change cgroup ("project") on a setuid call, assuming
> the program is in the proper cgroup initially. For most cases this is
> trivially true under the very simple default rules, and for the rest one
> can create a rule or a startup script that sets it with newtask".
>
Who executes default rules? IOW, how do you make sure tasks of user.davecb
end up in project 101 only and not outside?
> The Sun default is
> $ cat /etc/project
> system:0::::
> user.root:1::::
> noproject:2::::
> default:3::::
> group.staff:10::::
>
> Which means that root users are distinguished from users in
> the staff group, and they are distinguished from daemons
> and everyone else.
>
Now Linux also will allow admin to specify simple rules in
/etc/cgrules.conf. Rules are based basically on premise that users/groups
own resources in a particular cgroup and one can specify which cgroup
the task should run in. For ex.
#john cpu usergroup/faculty/john/
#@student cpu,memory usergroup/student/
#@root * admingroup/
#* * default/
This simply means which user/group's tasks should run in what cgroup for
which controller. (There are some wild cards also). For details, you can
check out libcg-devel source tree and documentation files.
> Personally, I add
> user.davecb:101::davecb::
> bg:100:Background jobs:davecb::
> which puts me in a separate cgroup, and provides another one
> for me to put background tasks into. The latter allows
> me to keep them from reducing the interactive performance of
> my laptop.
So by default all the tasks of user.davecb will run into project 101 until
user davecb decides to launch some background jobs in project 100 using
newtask?
"newtask" like functionality is being provided by a new command line tool
"cgexec" which will allow launching of a new task in specific cgroup
(project).
Thanks
Vivek
On Tue, Aug 26, 2008 at 08:05:12PM +0530, Balbir Singh wrote:
> Vivek Goyal wrote:
> > On Mon, Aug 25, 2008 at 05:54:39PM -0700, Paul Menage wrote:
> >> On Tue, Aug 19, 2008 at 5:57 AM, Vivek Goyal <[email protected]> wrote:
> >>> Same thing will happen if we implement the daemon in user space. A task
> >>> who does seteuid(), can be swept away to a different cgroup based on
> >>> rules specified in /etc/cgrules.conf.
> >> Yes, I'm not so keen on a daemon magically pulling things into a
> >> cgroup based on uid either, for the same reasons.
> >>
> >> But a user-space based solution can be much more flexible (e.g. easier
> >> to configure it to only move tasks from certain source cgroups).
> >>
> >>> What do you mean by risk? This is the policy set up by system admin and
> >>> behaviour would seem consistent as per the policy. If an admin decides
> >>> that tasks of user "apache" should run into /container/cpu/apache cgroup and
> >>> if a "root" tasks does seteuid(apache), then it manes sense to move task
> >>> to /container/cpu/apache.
> >> The kind of unexpected behaviour I was imagining was when some other
> >> daemon (e.g. ftpd?) unexpectedly does a setuid to one of the
> >> magically-controlled users, and results in that daemon being pulled
> >> into the specified cgroup. For something like cpu maybe that's mostly
> >> benign (but what moves it back into its original group after it
> >> switches back to root?)
> >
> > Once ftpd does seteuid() or setreuid() again to switch effective user to
> > "root", it will be moved back to original group (root's group).
> >
> > So basic question is if a program changes its effective user id temporarily
> > to user B than all the resource consumption should take place from the
> > resources of user B or should continue to take place from original cgroup.
> >
> > I would think that we should move the task temporarily to B's cgroup and
> > bring back again upon identity change.
> >
> > At the same time I can also understand that this behavior can probably
> > be considered over-intrusive and some people might want to avoid that.
> >
> > Two things come to my mind.
> >
> > - Users who find it too intrusive, can just shut down the rules based
> > daemon.
> >
>
> Yes, I would say administrators should do that. Classification via setuid(),
> does make a lot of sense, but at the same time it might be too aggressive if an
> application frequently uses setuid()
>
Just minor clarification. Right now all the classification is being done
based on effective uid and effective gid.
[..]
> >>> Exactly what kind of scenario do you have in mind when you want the policy
> >>> to be enforced selectively based on task (tid)?
> >> I was thinking of something like possibly a per-cgroup file (that also
> >> affected child cgroups) rather than a global file. That would also
> >> automatically handle multiple hierarchies.
> >>
> >
> > So there can be two kind of controls.
> >
> > - Create a per cgroup file say "group_pinned", where if 1 is written to
> > "group_pinned" that means daemon will not move tasks from this cgroup upon
> > effective uid/gid changes.
> >
> > - Provide more fine grained control where task movement is not controlled
> > per cgroup, rather per thread id. In that case every cgroup will contain
> > another file "tasks_pinned" which will contain all the tids which cannot
> > be moved from this cgroup by daemon. By default this file will be empty
> > and all the tids are movable.
> >
> > I think initially we can keep things simple and implement "group_pinned"
> > which provides coarse control on the whole group and pins all the tasks
> > in that cgroup.
> >
>
> Hmm... I wonder if we are providing too many knobs. Can't we do something simpler?
I also fear that we are probably providing too many knobs. Until we get
a strong use case, to keep things simple I recommend that for the time
being let us stick to simple user space daemon and user can turn it on
or off based on his needs (whether user wants a cgroup change upon seteuid()
related events). No controls based on group_pinned or tasks_pinned
etc. It is all or none.
Thanks
Vivek
Vivek Goyal wrote:
> Who executes default rules? IOW, how do you make sure tasks of user.davecb
> end up in project 101 only and not outside?
A classifier at login/connect starts each new process off in the correct group.
New processes inherit their parent's group unless you use newtask or su.
> So by default all the tasks of user.davecb will run into project 101 until
> user davecb decides to launch some background jobs in project 100 using
> newtask?
That's right, the and cgexec-like "newtask" is what I use
to script things: for example, my background script says
case "$1" in
[0-9]*) # It's a pid
newtask -p bg -c $1
;;
*) # It's a command-line
newtask -p bg "$@" &
;;
esac
There's also an -F option to put a process into a cgroup
and never let it newtask itself or it's children to another one,
so that software from Dr Evil, Inc. can't do privilege
escalation (;-))
--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#