Quoting Alexey Dobriyan ([email protected]):
> On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > implementation, FYI.
>
> OK, here is second version which shows what to do with shared objects
> (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> (still no unlinked files) and dumps some more information including
> structures connections (cr_pos_*)
>
> Dumping pids in under thinking because in OpenVZ pids are saved as
> numbers due to CLONE_NEWPID is not allowed in container. In presense
> of multiple CLONE_NEWPID levels this must present a big problem. Looks
> like there is now way to not dump pids as separate object.
>
> As result, struct cr_image_pid is variable-sized, don't know how this will
> play later.
>
> Also, pid refcount check for external pointers is busted right now,
> because /proc inode pins struct pid, so there is almost always refcount
> vs ->o_count mismatch.
>
> No restore yet. ;-)
Hi Alexey,
thanks for posting this. Of course there are some predictable responses
(I like the simplicity of pure in-kernel, Dave will not :) but this
needs to be posted to make us talk about it.
A few more comments that came to me while looking it over:
1. cap_sys_admin check is unfortunate. In discussions about Oren's
patchset we've agreed that not having that check from the outset forces
us to consider security with each new patch and feature, which is a good
thing.
2. if any tasks being checkpointed are frozen, checkpoint has the
side effect of thawing them, right?
3. wrt pids, i guess what you really want is to store the pids from
init_tsk's level down to the task's lowest pid, right? Then you
manually set each of those on restart? Any higher pids of course
don't matter.
4. do you have any thoughts on what to do with the mntns info at
restart? Will you try to detect mounts which need to be re-created?
How?
5. Since you're always setting f_pos, this won't work straight over
a pipe? Do you figure that's just not a worthwhile feature?
Were you saying (in response to Dave) that you're having private
discussions about whether to pursue posting this as an alternative
to Oren's patchset? If so, any updates on those discussions?
thanks,
-serge
On Sun, Mar 01, 2009 at 02:02:31PM -0600, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan ([email protected]):
> > On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > > implementation, FYI.
> >
> > OK, here is second version which shows what to do with shared objects
> > (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> > (still no unlinked files) and dumps some more information including
> > structures connections (cr_pos_*)
> >
> > Dumping pids in under thinking because in OpenVZ pids are saved as
> > numbers due to CLONE_NEWPID is not allowed in container. In presense
> > of multiple CLONE_NEWPID levels this must present a big problem. Looks
> > like there is now way to not dump pids as separate object.
> >
> > As result, struct cr_image_pid is variable-sized, don't know how this will
> > play later.
> >
> > Also, pid refcount check for external pointers is busted right now,
> > because /proc inode pins struct pid, so there is almost always refcount
> > vs ->o_count mismatch.
> >
> > No restore yet. ;-)
>
> Hi Alexey,
>
> thanks for posting this. Of course there are some predictable responses
> (I like the simplicity of pure in-kernel, Dave will not :) but this
> needs to be posted to make us talk about it.
>
> A few more comments that came to me while looking it over:
>
> 1. cap_sys_admin check is unfortunate. In discussions about Oren's
> patchset we've agreed that not having that check from the outset forces
> us to consider security with each new patch and feature, which is a good
> thing.
Removing CAP_SYS_ADMIN on restore?
> 2. if any tasks being checkpointed are frozen, checkpoint has the
> side effect of thawing them, right?
Haven't tried, but should be a bug, yes. It will be "thaw or kill"
depending on "flags".
> 3. wrt pids, i guess what you really want is to store the pids from
> init_tsk's level down to the task's lowest pid, right? Then you
> manually set each of those on restart? Any higher pids of course
> don't matter.
Yes, numbers are really meant to be from init_tsk level.
> 4. do you have any thoughts on what to do with the mntns info at
> restart? Will you try to detect mounts which need to be re-created?
> How?
Haven't thought, but it will be tricky for sure :^)
> 5. Since you're always setting f_pos, this won't work straight over
> a pipe? Do you figure that's just not a worthwhile feature?
So far there were no loops when dumping data structures, but I _think_
there will be some, so seeking over dumpfile would be inevitable.
> Were you saying (in response to Dave) that you're having private
> discussions about whether to pursue posting this as an alternative
> to Oren's patchset? If so, any updates on those discussions?
Right now, no.
Quoting Alexey Dobriyan ([email protected]):
> On Sun, Mar 01, 2009 at 02:02:31PM -0600, Serge E. Hallyn wrote:
> > Quoting Alexey Dobriyan ([email protected]):
> > > On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > > > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > > > implementation, FYI.
> > >
> > > OK, here is second version which shows what to do with shared objects
> > > (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> > > (still no unlinked files) and dumps some more information including
> > > structures connections (cr_pos_*)
> > >
> > > Dumping pids in under thinking because in OpenVZ pids are saved as
> > > numbers due to CLONE_NEWPID is not allowed in container. In presense
> > > of multiple CLONE_NEWPID levels this must present a big problem. Looks
> > > like there is now way to not dump pids as separate object.
> > >
> > > As result, struct cr_image_pid is variable-sized, don't know how this will
> > > play later.
> > >
> > > Also, pid refcount check for external pointers is busted right now,
> > > because /proc inode pins struct pid, so there is almost always refcount
> > > vs ->o_count mismatch.
> > >
> > > No restore yet. ;-)
> >
> > Hi Alexey,
> >
> > thanks for posting this. Of course there are some predictable responses
> > (I like the simplicity of pure in-kernel, Dave will not :) but this
> > needs to be posted to make us talk about it.
> >
> > A few more comments that came to me while looking it over:
> >
> > 1. cap_sys_admin check is unfortunate. In discussions about Oren's
> > patchset we've agreed that not having that check from the outset forces
> > us to consider security with each new patch and feature, which is a good
> > thing.
>
> Removing CAP_SYS_ADMIN on restore?
And checkpoint.
-serge
>> 1. cap_sys_admin check is unfortunate. In discussions about Oren's
>> patchset we've agreed that not having that check from the outset forces
>> us to consider security with each new patch and feature, which is a good
>> thing.
>
> Removing CAP_SYS_ADMIN on restore?
we've kept the capabilities in our patchset but the user tools doing checkpoint
and restart are setcap'ed appropriately to be able to do different things like :
clone() the namespaces
mount /dev/mqueue
interact with net_ns
etc.
at restart, the task are restarted through execve() so they loose their
capabilities automatically.
but I think we could drop the CAP_SYS_ADMIN tests for some namespaces,
uts and ipc are good candidates. I guess network should require some
privilege.
C.
Quoting Cedric Le Goater ([email protected]):
>
> >> 1. cap_sys_admin check is unfortunate. In discussions about Oren's
> >> patchset we've agreed that not having that check from the outset forces
> >> us to consider security with each new patch and feature, which is a good
> >> thing.
> >
> > Removing CAP_SYS_ADMIN on restore?
>
> we've kept the capabilities in our patchset but the user tools doing checkpoint
> and restart are setcap'ed appropriately to be able to do different things like :
>
> clone() the namespaces
> mount /dev/mqueue
> interact with net_ns
> etc.
Right, that stuff done in userspace requires capabilities.
> at restart, the task are restarted through execve() so they loose their
> capabilities automatically.
>
> but I think we could drop the CAP_SYS_ADMIN tests for some namespaces,
> uts and ipc are good candidates. I guess network should require some
> privilege.
Eric and i have talked about that a lot, and so far are continuing
to punt on it. There are too many possibilities for subtle exploits
so I'm not suggesting changing those now.
But checkpoint and restart are entirely new. If at each small step
we accept that an unprivileged user should be able to use it safely,
that will lead to a better design, i.e. doing may_ptrace before
checkpoint, and always doing access checks before re-creating a
resource.
If we *don't* do that, we'll have a big-stick setuid root checkpoint
and restart program which isn't at all trustworthy (bc it hasn't
received due scrutiny at each commit point), but must be trusted by
anyone wanting to use it.
And if we're too afraid to remove CAP_SYS_ADMIN checks from unsharing
one innocuous namespace, will we ever convince ourselves to remove it
from an established feature that can recreate every type of resource on
the system?
-serge