2010-04-06 03:45:01

by Roland McGrath

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13

(I've been away for a couple of weeks.)
I concur with the things Oleg's said in this thread.

As to what's "correct" for the kernel in theory, it would certainly make
sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
reporting any PID as such. The wait and SIGCHLD code already does this, so
that would be consistent. Off hand I don't see anything other than
tracehook_report_clone{,_complete}() that sees the wrong value now.

Fixing that requires a bit of hair. The simple and clean approach is to
just have the tracehook calls (i.e. ptrace layer) extract the PID from the
task_struct using the desired pid_ns. The trouble there is that the
tracehook_report_clone_complete() call is made when that task_struct is no
longer guaranteed to be valid. The contrary approach of extracting the
appropriate value for the tracer earlier breaks the clean layering because
the fork.c code really should not know at all that ->parent->nsproxy is the
place to look for what values to pass to tracehook calls. I guess the
simple and clean fix is to get_task_struct() before wake_up_new_task()
and put_task_struct() after tracehook_report_clone_complete(). That does
add some gratuitous atomic incr/decr overhead, though.

None of this has much of anything to do with strace, of course. As I've
said, I don't see anything other than the PTRACE_GETEVENTMSG value for
PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel. As
Oleg said, strace doesn't use that at all. (This is not the place to
discuss the details of strace further.)


Thanks,
Roland


2010-04-06 13:54:09

by Matt Helsley

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13


On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
> (I've been away for a couple of weeks.)
> I concur with the things Oleg's said in this thread.
>
> As to what's "correct" for the kernel in theory, it would certainly make
> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
> reporting any PID as such. The wait and SIGCHLD code already does this, so
> that would be consistent. Off hand I don't see anything other than
> tracehook_report_clone{,_complete}() that sees the wrong value now.

Yup.

> Fixing that requires a bit of hair. The simple and clean approach is to
> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
> task_struct using the desired pid_ns. The trouble there is that the

It's also possible to take an extra reference to the struct pid and pass
that to the tracehook. That and the pid_ns of the tracer receiving the pid
is all we'll ever need inside the tracehook layer. The only advantage, I
think, is we wouldn't pin the task struct while holding the pid reference.

> tracehook_report_clone_complete() call is made when that task_struct is no
> longer guaranteed to be valid. The contrary approach of extracting the
> appropriate value for the tracer earlier breaks the clean layering because
> the fork.c code really should not know at all that ->parent->nsproxy is the
> place to look for what values to pass to tracehook calls. I guess the
> simple and clean fix is to get_task_struct() before wake_up_new_task()
> and put_task_struct() after tracehook_report_clone_complete(). That does
> add some gratuitous atomic incr/decr overhead, though.

Also true.

Of course my suggestion of holding the pid reference won't avoid adding
atomic ops -- just changes which refcount they work on.

>
> None of this has much of anything to do with strace, of course. As I've
> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel. As
> Oleg said, strace doesn't use that at all. (This is not the place to
> discuss the details of strace further.)

Also, looking at proposed changes (utrace and Eric Biederman's setns())
storing a pid nr rather than a reference to a task struct or struct pid
probably won't be correct.

In the case of Eric Biederman's setns(), if capable of changing pid namespace,
we could have:

Traced Tracer
fork()
... (an arbitrary amount of time passes)
setns()
ptrace(GETEVENTMSG)

At which point returning a static pid number held in the message field
produces the wrong pid. Also, if utrace allows multiple tracers and they each
exist in a different namespace then storing a pid nr isn't going to work.

So my hunch is, in the long run, we'll need to hold a reference there and
drop it when the last tracer detaches or the next event would have
overwritten the "message". The amount of time it would need to be held
suggests to me that we should refer to a struct pid and not a task struct
if possible.

Cheers,
-Matt Helsley

2010-04-06 14:38:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13

On 04/06, Matt Helsley wrote:
>
> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>
> > tracehook_report_clone_complete() call is made when that task_struct is no
> > longer guaranteed to be valid.

Hmm. I missed this.

> Also, if utrace allows multiple tracers and they each
> exist in a different namespace then storing a pid nr isn't going to work.

Yes, but utrace is simple. ptrace_report_clone() does

ctx->eventmsg = child->pid;

we should fix this line and that is all, afaics. Every tracer has a
separate "struct ptrace_context *ctx".

> So my hunch is, in the long run, we'll need to hold a reference there and
> drop it when the last tracer detaches

Without utrace only one tracer is possible.

So, I think we should either change do_fork() to get the right tracee_pid_nr,
or add get/put into do_fork() and change tracehooks as Roland suggested.

Oleg.

2010-04-06 15:13:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13

Matt Helsley <[email protected]> writes:

> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>> (I've been away for a couple of weeks.)
>> I concur with the things Oleg's said in this thread.
>>
>> As to what's "correct" for the kernel in theory, it would certainly make
>> sense to clean up the ptrace cases to use the tracer (parent) pid_ns when
>> reporting any PID as such. The wait and SIGCHLD code already does this, so
>> that would be consistent. Off hand I don't see anything other than
>> tracehook_report_clone{,_complete}() that sees the wrong value now.
>
> Yup.
>
>> Fixing that requires a bit of hair. The simple and clean approach is to
>> just have the tracehook calls (i.e. ptrace layer) extract the PID from the
>> task_struct using the desired pid_ns. The trouble there is that the
>
> It's also possible to take an extra reference to the struct pid and pass
> that to the tracehook. That and the pid_ns of the tracer receiving the pid
> is all we'll ever need inside the tracehook layer. The only advantage, I
> think, is we wouldn't pin the task struct while holding the pid reference.
>
>> tracehook_report_clone_complete() call is made when that task_struct is no
>> longer guaranteed to be valid. The contrary approach of extracting the
>> appropriate value for the tracer earlier breaks the clean layering because
>> the fork.c code really should not know at all that ->parent->nsproxy is the
>> place to look for what values to pass to tracehook calls. I guess the
>> simple and clean fix is to get_task_struct() before wake_up_new_task()
>> and put_task_struct() after tracehook_report_clone_complete(). That does
>> add some gratuitous atomic incr/decr overhead, though.
>
> Also true.
>
> Of course my suggestion of holding the pid reference won't avoid adding
> atomic ops -- just changes which refcount they work on.
>
>>
>> None of this has much of anything to do with strace, of course. As I've
>> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
>> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel. As
>> Oleg said, strace doesn't use that at all. (This is not the place to
>> discuss the details of strace further.)
>
> Also, looking at proposed changes (utrace and Eric Biederman's setns())
> storing a pid nr rather than a reference to a task struct or struct pid
> probably won't be correct.

My setns work has demonstrated that even for entering a namespace we
never ever need to change the struct pid of a task. setns has no other
bearing on this problem then to say there is no foreseeable reason to
change the rules.

> In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> we could have:
>
> Traced Tracer
> fork()
> ... (an arbitrary amount of time passes)
> setns()
> ptrace(GETEVENTMSG)

Forget that. The pid namespace was architected so that we can ptrace a process
in another pid namespace.

> At which point returning a static pid number held in the message field
> produces the wrong pid.

No. A processes always sees pids from the context of it's original pid
namespace. All setns does is affect which pid namespace children will
be native in.


Eric

2010-04-06 15:17:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13

Oleg Nesterov <[email protected]> writes:

> On 04/06, Matt Helsley wrote:
>>
>> On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:
>>
>> > tracehook_report_clone_complete() call is made when that task_struct is no
>> > longer guaranteed to be valid.
>
> Hmm. I missed this.
>
>> Also, if utrace allows multiple tracers and they each
>> exist in a different namespace then storing a pid nr isn't going to work.
>
> Yes, but utrace is simple. ptrace_report_clone() does
>
> ctx->eventmsg = child->pid;
>
> we should fix this line and that is all, afaics. Every tracer has a
> separate "struct ptrace_context *ctx".
>
>> So my hunch is, in the long run, we'll need to hold a reference there and
>> drop it when the last tracer detaches
>
> Without utrace only one tracer is possible.
>
> So, I think we should either change do_fork() to get the right tracee_pid_nr,
> or add get/put into do_fork() and change tracehooks as Roland suggested.

For a unicast path where the is no danger of the destination process changing
I don't see why we can't compute the userspace pid_nr. It only get's tricky
for things like broadcast signals (pgrp, session) when we don't who the
final recipient process will be.

I think that only gets truly bad in the case of unix domain sockets.

Eric

2010-04-06 15:30:00

by Matt Helsley

[permalink] [raw]
Subject: Re: Testing lxc 0.6.5 in Fedora 13

On Tue, Apr 06, 2010 at 08:13:13AM -0700, Eric W. Biederman wrote:
> Matt Helsley <[email protected]> writes:
>
> > On Mon, Apr 05, 2010 at 08:44:43PM -0700, Roland McGrath wrote:

<snip>

> >> None of this has much of anything to do with strace, of course. As I've
> >> said, I don't see anything other than the PTRACE_GETEVENTMSG value for
> >> PTRACE_EVENT_{CLONE,FORK,VFORK} reports that is wrong in the kernel. As
> >> Oleg said, strace doesn't use that at all. (This is not the place to
> >> discuss the details of strace further.)
> >
> > Also, looking at proposed changes (utrace and Eric Biederman's setns())
> > storing a pid nr rather than a reference to a task struct or struct pid
> > probably won't be correct.
>
> My setns work has demonstrated that even for entering a namespace we
> never ever need to change the struct pid of a task. setns has no other
> bearing on this problem then to say there is no foreseeable reason to
> change the rules.
>
> > In the case of Eric Biederman's setns(), if capable of changing pid namespace,
> > we could have:
> >
> > Traced Tracer
> > fork()
> > ... (an arbitrary amount of time passes)
> > setns()
> > ptrace(GETEVENTMSG)
>
> Forget that. The pid namespace was architected so that we can ptrace a process
> in another pid namespace.
>
> > At which point returning a static pid number held in the message field
> > produces the wrong pid.
>
> No. A processes always sees pids from the context of it's original pid
> namespace. All setns does is affect which pid namespace children will
> be native in.

OK, good. So we can resolve the tasks/struct pids within the tracehook
and be done with it. Thanks Eric!

Cheers,
-Matt Helsley