2013-06-27 16:54:14

by Tim Hockin

[permalink] [raw]
Subject: cgroup access daemon

Changing the subject, so as not to mix two discussions

On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn <[email protected]> wrote:
>
>> > FWIW, the code is too embarassing yet to see daylight, but I'm playing
>> > with a very lowlevel cgroup manager which supports nesting itself.
>> > Access in this POC is low-level ("set freezer.state to THAWED for cgroup
>> > /c1/c2", "Create /c3"), but the key feature is that it can run in two
>> > modes - native mode in which it uses cgroupfs, and child mode where it
>> > talks to a parent manager to make the changes.
>>
>> In this world, are users able to read cgroup files, or do they have to
>> go through a central agent, too?
>
> The agent won't itself do anything to stop access through cgroupfs, but
> the idea would be that cgroupfs would only be mounted in the agent's
> mntns. My hope would be that the libcgroup commands (like cgexec,
> cgcreate, etc) would know to talk to the agent when possible, and users
> would use those.

For our use case this is a huge problem. We have people who access
cgroup files in a fairly tight loops, polling for information. We
have literally hundeds of jobs running on sub-second frequencies -
plumbing all of that through a daemon is going to be a disaster.
Either your daemon becomes a bottleneck, or we have to build something
far more scalable than you really want to. Not to mention the
inefficiency of inserting a layer.

We also need the ability to set up eventfds for users or to let them
poll() on the socket from this daemon.

>> > So then the idea would be that userspace (like libvirt and lxc) would
>> > talk over /dev/cgroup to its manager. Userspace inside a container
>> > (which can't actually mount cgroups itself) would talk to its own
>> > manager which is talking over a passed-in socket to the host manager,
>> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> > the requestor's cgroup).
>>
>> How do you handle updates of this agent? Suppose I have hundreds of
>> running containers, and I want to release a new version of the cgroupd
>> ?
>
> This may change (which is part of what I want to investigate with some
> POC), but right now I'm building any controller-aware smarts into it. I
> think that's what you're asking about? The agent doesn't do "slices"
> etc. This may turn out to be insufficient, we'll see.

No, what I am asking is a release-engineering problem. Suppose we
need to roll out a new version of this daemon (some new feature or a
bug or something). We have hundreds of these "child" agents running
in the job containers.

How do I bring down all these children, and then bring them back up on
a new version in a way that does not disrupt user jobs (much)?

Similarly, what happens when one of these child agents crashes? Does
someone restart it? Do user jobs just stop working?

>
> So the only state which the agent stores is a list of cgroup mounts (if
> in native mode) or an open socket to the parent (if in child mode), and a
> list of connected children sockets.
>
> HUPping the agent will cause it to reload the cgroupfs mounts (in case
> you've mounted a new controller, living in "the old world" :). If you
> just kill it and start a new one, it shouldn't matter.
>
>> (note: inquiries about the implementation do not denote acceptance of
>> the model :)
>
> To put it another way, the problem I'm solving (for now) is not the "I
> want a daemon to ensure that requested guarantees are correctly
> implemented." In that sense I'm maintaining the status quo, i.e. the
> admin needs to architect the layout correctly.
>
> The problem I'm solving is really that I want containers to be able to
> handle cgroups even if they can't mount cgroupfs, and I want all
> userspace to be able to behave the same whether they are in a container
> or not.
>
> This isn't meant as a poke in the eye of anyone who wants to address the
> other problem. If it turns out that we (meaning "the community of
> cgroup users") really want such an agent, then we can add that. I'm not
> convinced.
>
> What would probably be a better design, then, would be that the agent
> I'm working on can plug into a resource allocation agent. Or, I
> suppose, the other way around.
>
>> > At some point (probably soon) we might want to talk about a standard API
>> > for these things. However I think it will have to come in the form of
>> > a standard library, which knows to either send requests over dbus to
>> > systemd, or over /dev/cgroup sock to the manager.
>> >
>> > -serge


2013-06-27 18:11:34

by Serge Hallyn

[permalink] [raw]
Subject: Re: cgroup access daemon

Quoting Tim Hockin ([email protected]):
> Changing the subject, so as not to mix two discussions
>
> On Thu, Jun 27, 2013 at 9:18 AM, Serge Hallyn <[email protected]> wrote:
> >
> >> > FWIW, the code is too embarassing yet to see daylight, but I'm playing
> >> > with a very lowlevel cgroup manager which supports nesting itself.
> >> > Access in this POC is low-level ("set freezer.state to THAWED for cgroup
> >> > /c1/c2", "Create /c3"), but the key feature is that it can run in two
> >> > modes - native mode in which it uses cgroupfs, and child mode where it
> >> > talks to a parent manager to make the changes.
> >>
> >> In this world, are users able to read cgroup files, or do they have to
> >> go through a central agent, too?
> >
> > The agent won't itself do anything to stop access through cgroupfs, but
> > the idea would be that cgroupfs would only be mounted in the agent's
> > mntns. My hope would be that the libcgroup commands (like cgexec,
> > cgcreate, etc) would know to talk to the agent when possible, and users
> > would use those.
>
> For our use case this is a huge problem. We have people who access
> cgroup files in a fairly tight loops, polling for information. We
> have literally hundeds of jobs running on sub-second frequencies -
> plumbing all of that through a daemon is going to be a disaster.
> Either your daemon becomes a bottleneck, or we have to build something
> far more scalable than you really want to. Not to mention the
> inefficiency of inserting a layer.

Currently you can trivially create a container which has the
container's cgroups bind-mounted to the expected places
(/sys/fs/cgroup/$controller) by uncommenting two lines in the
configuration file, and handle cgroups through cgroupfs there.
(This is what the management agent wants to be an alternative
for) The main deficiency there is that /proc/self/cgroups is
not filtered, so it will show /lxc/c1 for init's cgroup, while
the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
is seen under the container's /sys/fs/cgroup/devices (for
instance). Not ideal.

> We also need the ability to set up eventfds for users or to let them
> poll() on the socket from this daemon.

So you'd want to be able to request updates when any cgroup value
is changed, right?

That's currently not in my very limited set of commands, but I can
certainly add it, and yes it would be a simple unix sock so you can
set up eventfd, select/poll, etc.

> >> > So then the idea would be that userspace (like libvirt and lxc) would
> >> > talk over /dev/cgroup to its manager. Userspace inside a container
> >> > (which can't actually mount cgroups itself) would talk to its own
> >> > manager which is talking over a passed-in socket to the host manager,
> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
> >> > the requestor's cgroup).
> >>
> >> How do you handle updates of this agent? Suppose I have hundreds of
> >> running containers, and I want to release a new version of the cgroupd
> >> ?
> >
> > This may change (which is part of what I want to investigate with some
> > POC), but right now I'm building any controller-aware smarts into it. I
> > think that's what you're asking about? The agent doesn't do "slices"
> > etc. This may turn out to be insufficient, we'll see.
>
> No, what I am asking is a release-engineering problem. Suppose we
> need to roll out a new version of this daemon (some new feature or a
> bug or something). We have hundreds of these "child" agents running
> in the job containers.

When I say "container" I mean an lxc container, with it's own isolated
rootfs and mntns. I'm not sure what your "containers" are, but I if
they're not that, then they shouldn't need to run a child agent. They
can just talk over the host cgroup agent's socket.

> How do I bring down all these children, and then bring them back up on
> a new version in a way that does not disrupt user jobs (much)?
>
> Similarly, what happens when one of these child agents crashes? Does
> someone restart it? Do user jobs just stop working?

An upstart^W$init_system job will restart it...

-serge

2013-06-27 20:28:22

by Tim Hockin

[permalink] [raw]
Subject: Re: cgroup access daemon

On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <[email protected]> wrote:
> Quoting Tim Hockin ([email protected]):
>
>> For our use case this is a huge problem. We have people who access
>> cgroup files in a fairly tight loops, polling for information. We
>> have literally hundeds of jobs running on sub-second frequencies -
>> plumbing all of that through a daemon is going to be a disaster.
>> Either your daemon becomes a bottleneck, or we have to build something
>> far more scalable than you really want to. Not to mention the
>> inefficiency of inserting a layer.
>
> Currently you can trivially create a container which has the
> container's cgroups bind-mounted to the expected places
> (/sys/fs/cgroup/$controller) by uncommenting two lines in the
> configuration file, and handle cgroups through cgroupfs there.
> (This is what the management agent wants to be an alternative
> for) The main deficiency there is that /proc/self/cgroups is
> not filtered, so it will show /lxc/c1 for init's cgroup, while
> the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
> is seen under the container's /sys/fs/cgroup/devices (for
> instance). Not ideal.

I'm really saying that if your daemon is to provide a replacement for
cgroupfs direct access, it needs to be designed to be scalable. If
we're going to get away from bind mounting cgroupfs into user
namespaces, then let's try to solve ALL the problems.

>> We also need the ability to set up eventfds for users or to let them
>> poll() on the socket from this daemon.
>
> So you'd want to be able to request updates when any cgroup value
> is changed, right?

Not necessarily ANY, but that's the terminus of this API facet.

> That's currently not in my very limited set of commands, but I can
> certainly add it, and yes it would be a simple unix sock so you can
> set up eventfd, select/poll, etc.

Assuming the protocol is basically a pass-through to basic filesystem
ops, it should be pretty easy. You just need to add it to your
protocol.

But it brings up another point - access control. How do you decide
which files a child agent should have access to? Does that ever
change based on the child's configuration? In our world, the answer is
almost certainly yes.

>> >> > So then the idea would be that userspace (like libvirt and lxc) would
>> >> > talk over /dev/cgroup to its manager. Userspace inside a container
>> >> > (which can't actually mount cgroups itself) would talk to its own
>> >> > manager which is talking over a passed-in socket to the host manager,
>> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> >> > the requestor's cgroup).
>> >>
>> >> How do you handle updates of this agent? Suppose I have hundreds of
>> >> running containers, and I want to release a new version of the cgroupd
>> >> ?
>> >
>> > This may change (which is part of what I want to investigate with some
>> > POC), but right now I'm building any controller-aware smarts into it. I
>> > think that's what you're asking about? The agent doesn't do "slices"
>> > etc. This may turn out to be insufficient, we'll see.
>>
>> No, what I am asking is a release-engineering problem. Suppose we
>> need to roll out a new version of this daemon (some new feature or a
>> bug or something). We have hundreds of these "child" agents running
>> in the job containers.
>
> When I say "container" I mean an lxc container, with it's own isolated
> rootfs and mntns. I'm not sure what your "containers" are, but I if
> they're not that, then they shouldn't need to run a child agent. They
> can just talk over the host cgroup agent's socket.

If they talk over the host agent's socket, where is the access control
and restriction done? Who decides how deep I can nest groups? Who
says which files I may access? Who stops me from modifying someone
else's container?

Our containers are somewhat thinner and more managed than LXC, but not
that much. If we're running a system agent in a user container, we
need to manage that software. We can't just start up a version and
leave it running until the user decides to upgrade - we force
upgrades.

>> How do I bring down all these children, and then bring them back up on
>> a new version in a way that does not disrupt user jobs (much)?
>>
>> Similarly, what happens when one of these child agents crashes? Does
>> someone restart it? Do user jobs just stop working?
>
> An upstart^W$init_system job will restart it...

What happens when the main agent crashes? All those children on UNIX
sockets need to reconnect, I guess. This means your UNIX socket needs
to be a named socket, not just a socketpair(), making your auth model
more complicated.

What happens when the main agent hangs? Is someone health-checking
it? How about all the child daemons?

I guess my main point is that this SOUNDS like a simple project, but
if you just do the simple obvious things, it will be woefully
inadequate for anything but simple use-cases. If we get forced into
such a model (and there are some good reasons to do it, even
disregarding all the other chatter), we'd rather use the same thing
that the upstream world uses, and not re-invent the whole thing
ourselves.

Do you have a design spec, or a requirements list, or even a prototype
that we can look at?

Tim

2013-06-28 16:32:07

by Serge Hallyn

[permalink] [raw]
Subject: Re: cgroup access daemon

Quoting Tim Hockin ([email protected]):
> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <[email protected]> wrote:
> > Quoting Tim Hockin ([email protected]):
> >
> >> For our use case this is a huge problem. We have people who access
> >> cgroup files in a fairly tight loops, polling for information. We
> >> have literally hundeds of jobs running on sub-second frequencies -
> >> plumbing all of that through a daemon is going to be a disaster.
> >> Either your daemon becomes a bottleneck, or we have to build something
> >> far more scalable than you really want to. Not to mention the
> >> inefficiency of inserting a layer.
> >
> > Currently you can trivially create a container which has the
> > container's cgroups bind-mounted to the expected places
> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the
> > configuration file, and handle cgroups through cgroupfs there.
> > (This is what the management agent wants to be an alternative
> > for) The main deficiency there is that /proc/self/cgroups is
> > not filtered, so it will show /lxc/c1 for init's cgroup, while
> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
> > is seen under the container's /sys/fs/cgroup/devices (for
> > instance). Not ideal.
>
> I'm really saying that if your daemon is to provide a replacement for
> cgroupfs direct access, it needs to be designed to be scalable. If
> we're going to get away from bind mounting cgroupfs into user
> namespaces, then let's try to solve ALL the problems.
>
> >> We also need the ability to set up eventfds for users or to let them
> >> poll() on the socket from this daemon.
> >
> > So you'd want to be able to request updates when any cgroup value
> > is changed, right?
>
> Not necessarily ANY, but that's the terminus of this API facet.
>
> > That's currently not in my very limited set of commands, but I can
> > certainly add it, and yes it would be a simple unix sock so you can
> > set up eventfd, select/poll, etc.
>
> Assuming the protocol is basically a pass-through to basic filesystem
> ops, it should be pretty easy. You just need to add it to your
> protocol.
>
> But it brings up another point - access control. How do you decide
> which files a child agent should have access to? Does that ever
> change based on the child's configuration? In our world, the answer is
> almost certainly yes.

Could you give examples?

If you have a white/academic paper I should go read, that'd be great.

At the moment I'm going on the naive belief that proper hierarchy
controls will be enforced (eventually) by the kernel - i.e. if
a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
won't be possible for /lxc/c1/lxc/c2 to take that access.

The native cgroup manager (the one using cgroupfs) will be checking
the credentials of the requesting child manager for access(2) to
the cgroup files.

> >> >> > So then the idea would be that userspace (like libvirt and lxc) would
> >> >> > talk over /dev/cgroup to its manager. Userspace inside a container
> >> >> > (which can't actually mount cgroups itself) would talk to its own
> >> >> > manager which is talking over a passed-in socket to the host manager,
> >> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
> >> >> > the requestor's cgroup).
> >> >>
> >> >> How do you handle updates of this agent? Suppose I have hundreds of
> >> >> running containers, and I want to release a new version of the cgroupd
> >> >> ?
> >> >
> >> > This may change (which is part of what I want to investigate with some
> >> > POC), but right now I'm building any controller-aware smarts into it. I
> >> > think that's what you're asking about? The agent doesn't do "slices"
> >> > etc. This may turn out to be insufficient, we'll see.
> >>
> >> No, what I am asking is a release-engineering problem. Suppose we
> >> need to roll out a new version of this daemon (some new feature or a
> >> bug or something). We have hundreds of these "child" agents running
> >> in the job containers.
> >
> > When I say "container" I mean an lxc container, with it's own isolated
> > rootfs and mntns. I'm not sure what your "containers" are, but I if
> > they're not that, then they shouldn't need to run a child agent. They
> > can just talk over the host cgroup agent's socket.
>
> If they talk over the host agent's socket, where is the access control
> and restriction done? Who decides how deep I can nest groups? Who
> says which files I may access? Who stops me from modifying someone
> else's container?
>
> Our containers are somewhat thinner and more managed than LXC, but not
> that much. If we're running a system agent in a user container, we
> need to manage that software. We can't just start up a version and
> leave it running until the user decides to upgrade - we force
> upgrades.
>
> >> How do I bring down all these children, and then bring them back up on
> >> a new version in a way that does not disrupt user jobs (much)?
> >>
> >> Similarly, what happens when one of these child agents crashes? Does
> >> someone restart it? Do user jobs just stop working?
> >
> > An upstart^W$init_system job will restart it...
>
> What happens when the main agent crashes? All those children on UNIX
> sockets need to reconnect, I guess. This means your UNIX socket needs
> to be a named socket, not just a socketpair(), making your auth model
> more complicated.

It is a named socket.

> What happens when the main agent hangs? Is someone health-checking
> it? How about all the child daemons?
>
> I guess my main point is that this SOUNDS like a simple project, but

I guess it's not "simple". It just focuses on one specific problem.

> if you just do the simple obvious things, it will be woefully
> inadequate for anything but simple use-cases. If we get forced into
> such a model (and there are some good reasons to do it, even
> disregarding all the other chatter), we'd rather use the same thing
> that the upstream world uses, and not re-invent the whole thing
> ourselves.
>
> Do you have a design spec, or a requirements list, or even a prototype
> that we can look at?

The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
shows what I have in mind. It (and the sloppy code next to it)
represent a few hours' work over the last few days while waiting
for compiles and in between emails...

But again, it is completely predicated on my goal to have libvirt
and lxc (and other cgroup users) be able to use the same library
or API to make their requests whether they are on host or in a
container, and regardless of the distro they're running under.

-serge

2013-06-28 18:38:03

by Tim Hockin

[permalink] [raw]
Subject: Re: cgroup access daemon

On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn <[email protected]> wrote:
> Quoting Tim Hockin ([email protected]):
>> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <[email protected]> wrote:
>> > Quoting Tim Hockin ([email protected]):
>> >
>> >> For our use case this is a huge problem. We have people who access
>> >> cgroup files in a fairly tight loops, polling for information. We
>> >> have literally hundeds of jobs running on sub-second frequencies -
>> >> plumbing all of that through a daemon is going to be a disaster.
>> >> Either your daemon becomes a bottleneck, or we have to build something
>> >> far more scalable than you really want to. Not to mention the
>> >> inefficiency of inserting a layer.
>> >
>> > Currently you can trivially create a container which has the
>> > container's cgroups bind-mounted to the expected places
>> > (/sys/fs/cgroup/$controller) by uncommenting two lines in the
>> > configuration file, and handle cgroups through cgroupfs there.
>> > (This is what the management agent wants to be an alternative
>> > for) The main deficiency there is that /proc/self/cgroups is
>> > not filtered, so it will show /lxc/c1 for init's cgroup, while
>> > the host's /sys/fs/cgroup/devices/lxc/c1/c1.real will be what
>> > is seen under the container's /sys/fs/cgroup/devices (for
>> > instance). Not ideal.
>>
>> I'm really saying that if your daemon is to provide a replacement for
>> cgroupfs direct access, it needs to be designed to be scalable. If
>> we're going to get away from bind mounting cgroupfs into user
>> namespaces, then let's try to solve ALL the problems.
>>
>> >> We also need the ability to set up eventfds for users or to let them
>> >> poll() on the socket from this daemon.
>> >
>> > So you'd want to be able to request updates when any cgroup value
>> > is changed, right?
>>
>> Not necessarily ANY, but that's the terminus of this API facet.
>>
>> > That's currently not in my very limited set of commands, but I can
>> > certainly add it, and yes it would be a simple unix sock so you can
>> > set up eventfd, select/poll, etc.
>>
>> Assuming the protocol is basically a pass-through to basic filesystem
>> ops, it should be pretty easy. You just need to add it to your
>> protocol.
>>
>> But it brings up another point - access control. How do you decide
>> which files a child agent should have access to? Does that ever
>> change based on the child's configuration? In our world, the answer is
>> almost certainly yes.
>
> Could you give examples?
>
> If you have a white/academic paper I should go read, that'd be great.

We don't have anything on this, but examples may help.

Someone running as root should be able to connect to the "native"
daemon and read or write any cgroup file they want, right? You could
argue that root should be able to do this to a child-daemon, too, but
let's ignore that.

But inside a container, I don't want the users to be able to write to
anything in their own container. I do want them to be able to make
sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be
able to write to memory.limit_in_bytes, to read but not write
memory.soft_limit_in_bytes, and not be able to read memory.stat.

To get even fancier, a user should be able to create a sub-cgroup and
then designate that sub-cgroup as "final" - no further sub-sub-cgroups
allowed under it. They should also be able to designate that a
sub-cgroup is "one-way" - once a process enters it, it can not leave.

These are real(ish) examples based on what people want to do today.
In particular, the last couple are things that we want to do, but
don't do today.

The particular policy can differ per-container. Production jobs might
be allowed to create sub-cgroups, but batch jobs are not. Some user
jobs are designated "trusted" in one facet or another and get more
(but still not full) access.

> At the moment I'm going on the naive belief that proper hierarchy
> controls will be enforced (eventually) by the kernel - i.e. if
> a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
> won't be possible for /lxc/c1/lxc/c2 to take that access.
>
> The native cgroup manager (the one using cgroupfs) will be checking
> the credentials of the requesting child manager for access(2) to
> the cgroup files.

This might be sufficient, or the basis for a sufficient access control
system for users. The problem comes that we have multiple jobs on a
single machine running as the same user. We need to ensure that the
jobs can not modify each other.

>> >> >> > So then the idea would be that userspace (like libvirt and lxc) would
>> >> >> > talk over /dev/cgroup to its manager. Userspace inside a container
>> >> >> > (which can't actually mount cgroups itself) would talk to its own
>> >> >> > manager which is talking over a passed-in socket to the host manager,
>> >> >> > which in turn runs natively (uses cgroupfs, and nests "create /c1" under
>> >> >> > the requestor's cgroup).
>> >> >>
>> >> >> How do you handle updates of this agent? Suppose I have hundreds of
>> >> >> running containers, and I want to release a new version of the cgroupd
>> >> >> ?
>> >> >
>> >> > This may change (which is part of what I want to investigate with some
>> >> > POC), but right now I'm building any controller-aware smarts into it. I
>> >> > think that's what you're asking about? The agent doesn't do "slices"
>> >> > etc. This may turn out to be insufficient, we'll see.
>> >>
>> >> No, what I am asking is a release-engineering problem. Suppose we
>> >> need to roll out a new version of this daemon (some new feature or a
>> >> bug or something). We have hundreds of these "child" agents running
>> >> in the job containers.
>> >
>> > When I say "container" I mean an lxc container, with it's own isolated
>> > rootfs and mntns. I'm not sure what your "containers" are, but I if
>> > they're not that, then they shouldn't need to run a child agent. They
>> > can just talk over the host cgroup agent's socket.
>>
>> If they talk over the host agent's socket, where is the access control
>> and restriction done? Who decides how deep I can nest groups? Who
>> says which files I may access? Who stops me from modifying someone
>> else's container?
>>
>> Our containers are somewhat thinner and more managed than LXC, but not
>> that much. If we're running a system agent in a user container, we
>> need to manage that software. We can't just start up a version and
>> leave it running until the user decides to upgrade - we force
>> upgrades.
>>
>> >> How do I bring down all these children, and then bring them back up on
>> >> a new version in a way that does not disrupt user jobs (much)?
>> >>
>> >> Similarly, what happens when one of these child agents crashes? Does
>> >> someone restart it? Do user jobs just stop working?
>> >
>> > An upstart^W$init_system job will restart it...
>>
>> What happens when the main agent crashes? All those children on UNIX
>> sockets need to reconnect, I guess. This means your UNIX socket needs
>> to be a named socket, not just a socketpair(), making your auth model
>> more complicated.
>
> It is a named socket.

So anyone can connect? even with SO_PEERCRED, how do you know which
branches of the cgroup tree I am allowed to modify if the same user
owns more than one?

>> What happens when the main agent hangs? Is someone health-checking
>> it? How about all the child daemons?
>>
>> I guess my main point is that this SOUNDS like a simple project, but
>
> I guess it's not "simple". It just focuses on one specific problem.
>
>> if you just do the simple obvious things, it will be woefully
>> inadequate for anything but simple use-cases. If we get forced into
>> such a model (and there are some good reasons to do it, even
>> disregarding all the other chatter), we'd rather use the same thing
>> that the upstream world uses, and not re-invent the whole thing
>> ourselves.
>>
>> Do you have a design spec, or a requirements list, or even a prototype
>> that we can look at?
>
> The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
> shows what I have in mind. It (and the sloppy code next to it)
> represent a few hours' work over the last few days while waiting
> for compiles and in between emails...

Awesome. Do you mind if we look?

> But again, it is completely predicated on my goal to have libvirt
> and lxc (and other cgroup users) be able to use the same library
> or API to make their requests whether they are on host or in a
> container, and regardless of the distro they're running under.

I think that is a good goal. We'd like to not be different, if
possible. Obviously, we can't impose our needs on you if you don't
want to handle them. It sounds like what you are building is the
bottom layer in a stack - we (Google) should use that same bottom
layer. But that can only happen iff you're open to hearing our
requirements. Otherwise we have to strike out on our own or build
more layers in-between.

Tim

2013-06-28 19:21:33

by Serge Hallyn

[permalink] [raw]
Subject: Re: cgroup access daemon

Quoting Tim Hockin ([email protected]):
> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn <[email protected]> wrote:
> > Quoting Tim Hockin ([email protected]):
> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <[email protected]> wrote:
> >> > Quoting Tim Hockin ([email protected]):
> > Could you give examples?
> >
> > If you have a white/academic paper I should go read, that'd be great.
>
> We don't have anything on this, but examples may help.
>
> Someone running as root should be able to connect to the "native"
> daemon and read or write any cgroup file they want, right? You could
> argue that root should be able to do this to a child-daemon, too, but
> let's ignore that.
>
> But inside a container, I don't want the users to be able to write to
> anything in their own container. I do want them to be able to make
> sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be
> able to write to memory.limit_in_bytes, to read but not write
> memory.soft_limit_in_bytes, and not be able to read memory.stat.
>
> To get even fancier, a user should be able to create a sub-cgroup and
> then designate that sub-cgroup as "final" - no further sub-sub-cgroups
> allowed under it. They should also be able to designate that a
> sub-cgroup is "one-way" - once a process enters it, it can not leave.
>
> These are real(ish) examples based on what people want to do today.
> In particular, the last couple are things that we want to do, but
> don't do today.
>
> The particular policy can differ per-container. Production jobs might
> be allowed to create sub-cgroups, but batch jobs are not. Some user
> jobs are designated "trusted" in one facet or another and get more
> (but still not full) access.

Interesting, thanks.

I'll think a bit on how to best address these.

> > At the moment I'm going on the naive belief that proper hierarchy
> > controls will be enforced (eventually) by the kernel - i.e. if
> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
> > won't be possible for /lxc/c1/lxc/c2 to take that access.
> >
> > The native cgroup manager (the one using cgroupfs) will be checking
> > the credentials of the requesting child manager for access(2) to
> > the cgroup files.
>
> This might be sufficient, or the basis for a sufficient access control
> system for users. The problem comes that we have multiple jobs on a
> single machine running as the same user. We need to ensure that the
> jobs can not modify each other.

Would running them each in user namespaces with different mappings (all
jobs running as uid 1000, but uid 1000 mapped to different host uids
for each job) would be (long-term) feasible?

> > It is a named socket.
>
> So anyone can connect? even with SO_PEERCRED, how do you know which
> branches of the cgroup tree I am allowed to modify if the same user
> owns more than one?

I was assuming that any process requesting management of
/c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1)

So if you have two jobs running as uid 1000, one under /c1 and one
under /c2, and one as uid 1001 running under /c3 (with the uids owning
the cgroups), then the file permissions will prevent 1000 and 1001
from walk over each other, while the cgroup manager will not allow
a process (child manager or otherwise) under /c1 to manage cgroups
under /c2 and vice versa.

> >> Do you have a design spec, or a requirements list, or even a prototype
> >> that we can look at?
> >
> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
> > shows what I have in mind. It (and the sloppy code next to it)
> > represent a few hours' work over the last few days while waiting
> > for compiles and in between emails...
>
> Awesome. Do you mind if we look?

No, but it might not be worth it (other than the readme) :) - so far
it's only served to help me think through what I want and need from
the mgr.

> > But again, it is completely predicated on my goal to have libvirt
> > and lxc (and other cgroup users) be able to use the same library
> > or API to make their requests whether they are on host or in a
> > container, and regardless of the distro they're running under.
>
> I think that is a good goal. We'd like to not be different, if
> possible. Obviously, we can't impose our needs on you if you don't
> want to handle them. It sounds like what you are building is the
> bottom layer in a stack - we (Google) should use that same bottom
> layer. But that can only happen iff you're open to hearing our
> requirements. Otherwise we have to strike out on our own or build
> more layers in-between.

I'm definately open to your requirements - whether providing what
you need for another layer on top, or building it right in.

-serge

2013-06-28 19:49:11

by Tim Hockin

[permalink] [raw]
Subject: Re: cgroup access daemon

On Fri, Jun 28, 2013 at 12:21 PM, Serge Hallyn <[email protected]> wrote:
> Quoting Tim Hockin ([email protected]):
>> On Fri, Jun 28, 2013 at 9:31 AM, Serge Hallyn <[email protected]> wrote:
>> > Quoting Tim Hockin ([email protected]):
>> >> On Thu, Jun 27, 2013 at 11:11 AM, Serge Hallyn <[email protected]> wrote:
>> >> > Quoting Tim Hockin ([email protected]):
>> > Could you give examples?
>> >
>> > If you have a white/academic paper I should go read, that'd be great.
>>
>> We don't have anything on this, but examples may help.
>>
>> Someone running as root should be able to connect to the "native"
>> daemon and read or write any cgroup file they want, right? You could
>> argue that root should be able to do this to a child-daemon, too, but
>> let's ignore that.
>>
>> But inside a container, I don't want the users to be able to write to
>> anything in their own container. I do want them to be able to make
>> sub-cgroups, but only 5 levels deep. For sub-cgroups, they should be
>> able to write to memory.limit_in_bytes, to read but not write
>> memory.soft_limit_in_bytes, and not be able to read memory.stat.
>>
>> To get even fancier, a user should be able to create a sub-cgroup and
>> then designate that sub-cgroup as "final" - no further sub-sub-cgroups
>> allowed under it. They should also be able to designate that a
>> sub-cgroup is "one-way" - once a process enters it, it can not leave.
>>
>> These are real(ish) examples based on what people want to do today.
>> In particular, the last couple are things that we want to do, but
>> don't do today.
>>
>> The particular policy can differ per-container. Production jobs might
>> be allowed to create sub-cgroups, but batch jobs are not. Some user
>> jobs are designated "trusted" in one facet or another and get more
>> (but still not full) access.
>
> Interesting, thanks.
>
> I'll think a bit on how to best address these.
>
>> > At the moment I'm going on the naive belief that proper hierarchy
>> > controls will be enforced (eventually) by the kernel - i.e. if
>> > a task in cgroup /lxc/c1 is not allowed to mknod /dev/sda1, then it
>> > won't be possible for /lxc/c1/lxc/c2 to take that access.
>> >
>> > The native cgroup manager (the one using cgroupfs) will be checking
>> > the credentials of the requesting child manager for access(2) to
>> > the cgroup files.
>>
>> This might be sufficient, or the basis for a sufficient access control
>> system for users. The problem comes that we have multiple jobs on a
>> single machine running as the same user. We need to ensure that the
>> jobs can not modify each other.
>
> Would running them each in user namespaces with different mappings (all
> jobs running as uid 1000, but uid 1000 mapped to different host uids
> for each job) would be (long-term) feasible?

Possibly. It's a largish imposition to make on the caller (we don't
use user namespaces today, though we are evaluating how to start using
them) but perhaps not terrible.

>> > It is a named socket.
>>
>> So anyone can connect? even with SO_PEERCRED, how do you know which
>> branches of the cgroup tree I am allowed to modify if the same user
>> owns more than one?
>
> I was assuming that any process requesting management of
> /c1/c2/c3 would have to be in one of its ancestor cgroups (i.e. /c1)
>
> So if you have two jobs running as uid 1000, one under /c1 and one
> under /c2, and one as uid 1001 running under /c3 (with the uids owning
> the cgroups), then the file permissions will prevent 1000 and 1001
> from walk over each other, while the cgroup manager will not allow
> a process (child manager or otherwise) under /c1 to manage cgroups
> under /c2 and vice versa.
>
>> >> Do you have a design spec, or a requirements list, or even a prototype
>> >> that we can look at?
>> >
>> > The readme at https://github.com/hallyn/cgroup-mgr/blob/master/README
>> > shows what I have in mind. It (and the sloppy code next to it)
>> > represent a few hours' work over the last few days while waiting
>> > for compiles and in between emails...
>>
>> Awesome. Do you mind if we look?
>
> No, but it might not be worth it (other than the readme) :) - so far
> it's only served to help me think through what I want and need from
> the mgr.
>
>> > But again, it is completely predicated on my goal to have libvirt
>> > and lxc (and other cgroup users) be able to use the same library
>> > or API to make their requests whether they are on host or in a
>> > container, and regardless of the distro they're running under.
>>
>> I think that is a good goal. We'd like to not be different, if
>> possible. Obviously, we can't impose our needs on you if you don't
>> want to handle them. It sounds like what you are building is the
>> bottom layer in a stack - we (Google) should use that same bottom
>> layer. But that can only happen iff you're open to hearing our
>> requirements. Otherwise we have to strike out on our own or build
>> more layers in-between.
>
> I'm definately open to your requirements - whether providing what
> you need for another layer on top, or building it right in.

Great. That's a good place to start :)