2011-06-15 14:56:06

by Greg Kurz

[permalink] [raw]
Subject: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Since pid namespaces were introduced, there's a recurring demand: how one
can correlate a pid from a child pid ns with a pid from a parent pid ns ?
The need arises in the LXC community when one wants to send a signal from
the host (aka. init_pid_ns context) to a container process for which one
only knows the pid inside the container.

In the future, this should be achievable thanks to Eric Biederman's setns()
syscall but there's still some work to be done to support pid namespaces:

https://lkml.org/lkml/2011/5/21/162

As stated by Serge Hallyn in:

http://sourceforge.net/mailarchive/message.php?msg_id=27424447

"There is nothing that gives you a 100% guaranteed correct race-free
correspondence right now. You can look under /proc/<pid>/root/proc/ to
see the pids valid in the container, and you can relate output of
lxc-ps --forest to ps --forest output. But nothing under /proc that I
know of tells you "this task is the same as that task". You can't
even look at /proc/<pid> inode numbers since they are different
filesystems for each proc mount."

This patch adds a single line to /proc/self/status. Provided one has kept
track of its container tasks (with a cgroup like liblxc does for example),
he may correlate global pids and container pids. This is still racy but
definitely easier than what we have today.

ChangeLog:
v2: - changed Vpid: to ActivePid:
- lock ->sighand before calling task_active_pid_ns()

Signed-off-by: Greg Kurz <[email protected]>
Signed-off-by: Cedric Le Goater <[email protected]>
---

fs/proc/array.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5e4f776..5f796d9 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -165,7 +165,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
int g;
struct fdtable *fdt = NULL;
const struct cred *cred;
- pid_t ppid, tpid;
+ pid_t ppid, tpid, actpid;
+ struct sighand_struct *sighand;

rcu_read_lock();
ppid = pid_alive(p) ?
@@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
if (tracer)
tpid = task_pid_nr_ns(tracer, ns);
}
+ actpid = 0;
+ sighand = rcu_dereference(p->sighand);
+ if (sighand) {
+ struct pid_namespace *pid_ns;
+ unsigned long flags;
+ spin_lock_irqsave(&sighand->siglock, flags);
+ pid_ns = task_active_pid_ns(p);
+ if (pid_ns)
+ actpid = task_pid_nr_ns(p, pid_ns);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ }
cred = get_task_cred(p);
seq_printf(m,
"State:\t%s\n"
@@ -183,12 +195,13 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
"Pid:\t%d\n"
"PPid:\t%d\n"
"TracerPid:\t%d\n"
+ "ActivePid:\t%d\n"
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
task_tgid_nr_ns(p, ns),
pid_nr_ns(pid, ns),
- ppid, tpid,
+ ppid, tpid, actpid,
cred->uid, cred->euid, cred->suid, cred->fsuid,
cred->gid, cred->egid, cred->sgid, cred->fsgid);


2011-06-15 18:49:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/15, Greg Kurz wrote:
>
> @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> if (tracer)
> tpid = task_pid_nr_ns(tracer, ns);
> }
> + actpid = 0;
> + sighand = rcu_dereference(p->sighand);
> + if (sighand) {
> + struct pid_namespace *pid_ns;
> + unsigned long flags;
> + spin_lock_irqsave(&sighand->siglock, flags);

Well. This is not exactly right. We have lock_task_sighand() for this.

But. Why do you need ->siglock? Why rcu_read_lock() is not enough?

Hmm. You don't even need pid_ns afaics, you could simply look at
pid->numbers[pid->level].

Oleg.

2011-06-15 19:04:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Forgot to ask,

On 06/15, Greg Kurz wrote:
>
> The need arises in the LXC community when one wants to send a signal from
> the host (aka. init_pid_ns context) to a container process for which one
> only knows the pid inside the container.

I am just curious, why do you need this?

Oleg.

2011-06-15 19:08:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Oleg Nesterov <[email protected]> writes:

> On 06/15, Greg Kurz wrote:
>>
>> @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>> if (tracer)
>> tpid = task_pid_nr_ns(tracer, ns);
>> }
>> + actpid = 0;
>> + sighand = rcu_dereference(p->sighand);
>> + if (sighand) {
>> + struct pid_namespace *pid_ns;
>> + unsigned long flags;
>> + spin_lock_irqsave(&sighand->siglock, flags);
>
> Well. This is not exactly right. We have lock_task_sighand() for this.
>
> But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
>
> Hmm. You don't even need pid_ns afaics, you could simply look at
> pid->numbers[pid->level].

I got this moving in that direction, but I admit I probably didn't look
close enough. I just remember it is always tricky when accessing a
process and dealing with races with things like unhash_process().

Eric

2011-06-16 11:02:16

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Wed, 2011-06-15 at 20:46 +0200, Oleg Nesterov wrote:
> On 06/15, Greg Kurz wrote:
> >
> > @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > if (tracer)
> > tpid = task_pid_nr_ns(tracer, ns);
> > }
> > + actpid = 0;
> > + sighand = rcu_dereference(p->sighand);
> > + if (sighand) {
> > + struct pid_namespace *pid_ns;
> > + unsigned long flags;
> > + spin_lock_irqsave(&sighand->siglock, flags);
>
> Well. This is not exactly right. We have lock_task_sighand() for this.
>

I see... ->sighand could change so we need the for(;;) loop in
__lock_task_sighand() to be sure we have the right pointer, correct ?
By the way, if we use lock_task_sighand() we'll end up with nested
rcu_read_lock(): it will work but I don't know how it may affect
performance...

> But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
>

Because there's a race with
__exit_signal()->__unhash_process()->detach_pid() that can break
task_active_pid_ns() and rcu won't help here (unless *perhaps* by
modifying __exit_signal() but I don't want to mess with such a critical
path).

> Hmm. You don't even need pid_ns afaics, you could simply look at
> pid->numbers[pid->level].
>

True but I will have the same problem: detach_pid() nullifies the pid.

Thanks for your comments.

--
Greg

2011-06-16 11:19:41

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Wed, 2011-06-15 at 21:03 +0200, Oleg Nesterov wrote:
> Forgot to ask,
>
> On 06/15, Greg Kurz wrote:
> >
> > The need arises in the LXC community when one wants to send a signal from
> > the host (aka. init_pid_ns context) to a container process for which one
> > only knows the pid inside the container.
>
> I am just curious, why do you need this?
>

Because some LXC users run partially isolated containers (AKA.
application containers started with the lxc-execute command). Some of
the user code runs outside the container and some inside. Since
lxc-execute uses CLONE_NEWPID, it's difficult for the external code to
relate a pid generated inside the container with a task. There are
regular requests on lxc-users@ about this.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-16 12:26:28

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16/2011 01:19 PM, Greg Kurz wrote:
> On Wed, 2011-06-15 at 21:03 +0200, Oleg Nesterov wrote:
>> Forgot to ask,
>>
>> On 06/15, Greg Kurz wrote:
>>>
>>> The need arises in the LXC community when one wants to send a signal from
>>> the host (aka. init_pid_ns context) to a container process for which one
>>> only knows the pid inside the container.
>>
>> I am just curious, why do you need this?
>>
>
> Because some LXC users run partially isolated containers (AKA.
> application containers started with the lxc-execute command). Some of
> the user code runs outside the container and some inside. Since
> lxc-execute uses CLONE_NEWPID, it's difficult for the external code to
> relate a pid generated inside the container with a task. There are
> regular requests on lxc-users@ about this.

It's simply not possible. There's no support for it.

We have a case where a task in a parent pid namespace needs to kill
another task in a sub pid namespace only knowing its internal pid.
the latter has been communicated to the parent task through a file or
a unix socket.

This 'ActivePid' information in /proc is not sufficient to identity
the task, you also need the list of the tasks which are living in
the pid namespace.


We could have used the setns syscall and attach a kill command to
a pid namespace. But, this is borderline as you might ended up
killing all the tasks in the namespace you've attached to. Anyhow,
setns doesn't support pid namespaces yet and it feels safer to send
the signal for outside.

a new kill syscall could be the solution:

int pidns_kill(pid_t init_pid, pid_t some_pid);

where 'init_pid' identifies the namespace and 'some_pid' identifies
a task in this namespace. this is very specific but why not.

Finally, we thought that the 'ActivePid' information was interesting
enough to be exposed in /proc and was solving the case we're facing
rather easily with some framework (cgroups) in user space.

Best,

C.

2011-06-16 12:36:07

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 16/06/11 13:01 +0200, Greg Kurz wrote:
> On Wed, 2011-06-15 at 20:46 +0200, Oleg Nesterov wrote:
> > On 06/15, Greg Kurz wrote:
> > >
> > > @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > > if (tracer)
> > > tpid = task_pid_nr_ns(tracer, ns);
> > > }
> > > + actpid = 0;
> > > + sighand = rcu_dereference(p->sighand);
> > > + if (sighand) {
> > > + struct pid_namespace *pid_ns;
> > > + unsigned long flags;
> > > + spin_lock_irqsave(&sighand->siglock, flags);
> >
> > Well. This is not exactly right. We have lock_task_sighand() for this.
> >
>
> I see... ->sighand could change so we need the for(;;) loop in
> __lock_task_sighand() to be sure we have the right pointer, correct ?
> By the way, if we use lock_task_sighand() we'll end up with nested
> rcu_read_lock(): it will work but I don't know how it may affect
> performance...

rcu_read_lock() is very cheap.

>
> > But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
> >
>
> Because there's a race with
> __exit_signal()->__unhash_process()->detach_pid() that can break
> task_active_pid_ns() and rcu won't help here (unless *perhaps* by
> modifying __exit_signal() but I don't want to mess with such a critical
> path).

In case of race, the only risk is that task_active_pid_ns() returns NULL.
Otherwise, RCU guarantees that the pid_ns will stay alive (see below).

>
> > Hmm. You don't even need pid_ns afaics, you could simply look at
> > pid->numbers[pid->level].
> >
>
> True but I will have the same problem: detach_pid() nullifies the pid.

But the pid won't be freed until an RCU grace period expires. See free_pid(). So
the non-determinism here is when /proc/<pid>/status is read at the same as
threaded execve() or task's exit(), in which case a stale pid (execve()) or
no pid (exit after __unhash_process()) can be accessed. This does not look like
a big deal...

Thanks,

Louis

>
> Thanks for your comments.
>
> --
> Greg
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (2.29 kB)
signature.asc (197.00 B)
Digital signature
Download all attachments

2011-06-16 12:45:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Greg Kurz wrote:
>
> On Wed, 2011-06-15 at 20:46 +0200, Oleg Nesterov wrote:
> > On 06/15, Greg Kurz wrote:
> > >
> > > @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > > if (tracer)
> > > tpid = task_pid_nr_ns(tracer, ns);
> > > }
> > > + actpid = 0;
> > > + sighand = rcu_dereference(p->sighand);
> > > + if (sighand) {
> > > + struct pid_namespace *pid_ns;
> > > + unsigned long flags;
> > > + spin_lock_irqsave(&sighand->siglock, flags);
> >
> > Well. This is not exactly right. We have lock_task_sighand() for this.
> >
>
> I see... ->sighand could change so we need the for(;;) loop in
> __lock_task_sighand() to be sure we have the right pointer, correct ?

Yes,

> By the way, if we use lock_task_sighand() we'll end up with nested
> rcu_read_lock(): it will work but I don't know how it may affect
> performance...

You are kidding ;)

> > But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
> >
>
> Because there's a race with
> __exit_signal()->__unhash_process()->detach_pid() that can break
> task_active_pid_ns()

Yes,

> and rcu won't help here

Why? free_pid() uses call_rcu() to do put_pid()

> (unless *perhaps* by
> modifying __exit_signal() but I don't want to mess with such a critical
> path).

I don't think so...

> > Hmm. You don't even need pid_ns afaics, you could simply look at
> > pid->numbers[pid->level].
> >
>
> True but I will have the same problem: detach_pid() nullifies the pid.

Can't understand. Of course pid can be NULL. So what? Say, ->sighand
can be NULL as well, they both "disappear" at the same time. This is
fine, we raced with exit, we should report pid=0.

Oleg.

2011-06-16 12:54:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Greg Kurz wrote:
>
> On Wed, 2011-06-15 at 21:03 +0200, Oleg Nesterov wrote:
> > Forgot to ask,
> >
> > On 06/15, Greg Kurz wrote:
> > >
> > > The need arises in the LXC community when one wants to send a signal from
> > > the host (aka. init_pid_ns context) to a container process for which one
> > > only knows the pid inside the container.
> >
> > I am just curious, why do you need this?
> >
>
> Because some LXC users run partially isolated containers (AKA.
> application containers started with the lxc-execute command). Some of
> the user code runs outside the container and some inside. Since
> lxc-execute uses CLONE_NEWPID, it's difficult for the external code to
> relate a pid generated inside the container with a task. There are
> regular requests on lxc-users@ about this.

Well, this doesn't answer my question ;) Why do you need to send the
signal into the sub-namespace, and how/why do you know the pid inside
the container.

OK, nevermind, I was just curious.

Oleg.

2011-06-16 13:01:11

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Thu, 2011-06-16 at 14:35 +0200, Louis Rilling wrote:
> On 16/06/11 13:01 +0200, Greg Kurz wrote:
> > On Wed, 2011-06-15 at 20:46 +0200, Oleg Nesterov wrote:
> > > On 06/15, Greg Kurz wrote:
> > > >
> > > > @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > > > if (tracer)
> > > > tpid = task_pid_nr_ns(tracer, ns);
> > > > }
> > > > + actpid = 0;
> > > > + sighand = rcu_dereference(p->sighand);
> > > > + if (sighand) {
> > > > + struct pid_namespace *pid_ns;
> > > > + unsigned long flags;
> > > > + spin_lock_irqsave(&sighand->siglock, flags);
> > >
> > > Well. This is not exactly right. We have lock_task_sighand() for this.
> > >
> >
> > I see... ->sighand could change so we need the for(;;) loop in
> > __lock_task_sighand() to be sure we have the right pointer, correct ?
> > By the way, if we use lock_task_sighand() we'll end up with nested
> > rcu_read_lock(): it will work but I don't know how it may affect
> > performance...
>
> rcu_read_lock() is very cheap.
>

Fair enough. In this case, lock_task_sighand() would be the right choice
if locking is needed.

> >
> > > But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
> > >
> >
> > Because there's a race with
> > __exit_signal()->__unhash_process()->detach_pid() that can break
> > task_active_pid_ns() and rcu won't help here (unless *perhaps* by
> > modifying __exit_signal() but I don't want to mess with such a critical
> > path).
>
> In case of race, the only risk is that task_active_pid_ns() returns NULL.
> Otherwise, RCU guarantees that the pid_ns will stay alive (see below).
>
> >
> > > Hmm. You don't even need pid_ns afaics, you could simply look at
> > > pid->numbers[pid->level].
> > >
> >
> > True but I will have the same problem: detach_pid() nullifies the pid.
>
> But the pid won't be freed until an RCU grace period expires. See free_pid(). So
> the non-determinism here is when /proc/<pid>/status is read at the same as
> threaded execve() or task's exit(), in which case a stale pid (execve()) or
> no pid (exit after __unhash_process()) can be accessed. This does not look like
> a big deal...
>

Ok. You're right, the RCU grace period is just what I need to ensure I
won't dereference a stale pointer. So I don't even have to bother with
->siglock and just check pid_alive() before peeking into pid->numbers.

> Thanks,
>
> Louis
>

Thanks for your help.

--
Greg

2011-06-16 13:09:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Cedric Le Goater wrote:
>
> We have a case where a task in a parent pid namespace needs to kill
> another task in a sub pid namespace only knowing its internal pid.
> the latter has been communicated to the parent task through a file or
> a unix socket.

OK, thanks, this partly answers my question... But if they communicate
anyway, it is not clear why the signal is needed.

> This 'ActivePid' information in /proc is not sufficient to identity
> the task, you also need the list of the tasks which are living in
> the pid namespace.

Yes, I see.

> a new kill syscall could be the solution:
>
> int pidns_kill(pid_t init_pid, pid_t some_pid);
>
> where 'init_pid' identifies the namespace and 'some_pid' identifies
> a task in this namespace. this is very specific but why not.

Yes, I also thought about this. Should be trivial.

Or int sys_tell_me_its_pid(pid_t init_pid, pid_t some_pid).




Just in case.... This is hack, yes, but in fact you do not need the
kernel changes to send a signal inside the namespace. You could
ptrace sub_init, and execute the necessary code "inside" the namespace.

Oleg.

2011-06-16 13:21:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Greg Kurz wrote:
>
> Ok. You're right, the RCU grace period is just what I need to ensure I
> won't dereference a stale pointer. So I don't even have to bother with
> ->siglock and just check pid_alive() before peeking into pid->numbers.

Yes, but you don't really need pid_alive(). Just read ->pids[].pid and
check it is !NULL before dereferencing. Also, we already have 2 pid_alive()
checks, this is ugly. You can consolidate this code.

Oleg.

2011-06-16 13:25:59

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 16/06/11 15:00 +0200, Greg Kurz wrote:
> On Thu, 2011-06-16 at 14:35 +0200, Louis Rilling wrote:
> > On 16/06/11 13:01 +0200, Greg Kurz wrote:
> > > On Wed, 2011-06-15 at 20:46 +0200, Oleg Nesterov wrote:
> > > > On 06/15, Greg Kurz wrote:
> > > > >
> > > > > @@ -176,6 +177,17 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > > > > if (tracer)
> > > > > tpid = task_pid_nr_ns(tracer, ns);
> > > > > }
> > > > > + actpid = 0;
> > > > > + sighand = rcu_dereference(p->sighand);
> > > > > + if (sighand) {
> > > > > + struct pid_namespace *pid_ns;
> > > > > + unsigned long flags;
> > > > > + spin_lock_irqsave(&sighand->siglock, flags);
> > > >
> > > > Well. This is not exactly right. We have lock_task_sighand() for this.
> > > >
> > >
> > > I see... ->sighand could change so we need the for(;;) loop in
> > > __lock_task_sighand() to be sure we have the right pointer, correct ?
> > > By the way, if we use lock_task_sighand() we'll end up with nested
> > > rcu_read_lock(): it will work but I don't know how it may affect
> > > performance...
> >
> > rcu_read_lock() is very cheap.
> >
>
> Fair enough. In this case, lock_task_sighand() would be the right choice
> if locking is needed.
>
> > >
> > > > But. Why do you need ->siglock? Why rcu_read_lock() is not enough?
> > > >
> > >
> > > Because there's a race with
> > > __exit_signal()->__unhash_process()->detach_pid() that can break
> > > task_active_pid_ns() and rcu won't help here (unless *perhaps* by
> > > modifying __exit_signal() but I don't want to mess with such a critical
> > > path).
> >
> > In case of race, the only risk is that task_active_pid_ns() returns NULL.
> > Otherwise, RCU guarantees that the pid_ns will stay alive (see below).
> >
> > >
> > > > Hmm. You don't even need pid_ns afaics, you could simply look at
> > > > pid->numbers[pid->level].
> > > >
> > >
> > > True but I will have the same problem: detach_pid() nullifies the pid.
> >
> > But the pid won't be freed until an RCU grace period expires. See free_pid(). So
> > the non-determinism here is when /proc/<pid>/status is read at the same as
> > threaded execve() or task's exit(), in which case a stale pid (execve()) or
> > no pid (exit after __unhash_process()) can be accessed. This does not look like
> > a big deal...
> >
>
> Ok. You're right, the RCU grace period is just what I need to ensure I
> won't dereference a stale pointer. So I don't even have to bother with
> ->siglock and just check pid_alive() before peeking into pid->numbers.

It ends like open-coding an optimized version of task_pid_vnr(). If the
optimization is really important (I guess this depends on the depth of recursive
pid namespaces), it would be better to re-write task_pid_vnr(). Otherwise, just
use task_pid_vnr() as it is.

Thanks,

Louis

>
> > Thanks,
> >
> > Louis
> >
>
> Thanks for your help.
>
> --
> Greg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (3.28 kB)
signature.asc (197.00 B)
Digital signature
Download all attachments

2011-06-16 14:26:12

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16/2011 03:06 PM, Oleg Nesterov wrote:
> On 06/16, Cedric Le Goater wrote:
>>
>> We have a case where a task in a parent pid namespace needs to kill
>> another task in a sub pid namespace only knowing its internal pid.
>> the latter has been communicated to the parent task through a file or
>> a unix socket.
>
> OK, thanks, this partly answers my question... But if they communicate
> anyway, it is not clear why the signal is needed.

Well, user space always finds ways to challenge the kernel.

Our case is related to HPC. The batch manager runs jobs inside lxc
containers (using namespaces) and signals are sent to the application
for different reasons. First, to cleanly exit but also for other more
specific actions related to the cluster interconnects.

>> This 'ActivePid' information in /proc is not sufficient to identity
>> the task, you also need the list of the tasks which are living in
>> the pid namespace.
>
> Yes, I see.
>
>> a new kill syscall could be the solution:
>>
>> int pidns_kill(pid_t init_pid, pid_t some_pid);
>>
>> where 'init_pid' identifies the namespace and 'some_pid' identifies
>> a task in this namespace. this is very specific but why not.
>
> Yes, I also thought about this. Should be trivial.
>
> Or int sys_tell_me_its_pid(pid_t init_pid, pid_t some_pid).

why not. it's even better because more general.

> Just in case.... This is hack, yes, but in fact you do not need the
> kernel changes to send a signal inside the namespace. You could
> ptrace sub_init, and execute the necessary code "inside" the namespace.

hmm, I look at that.

Thanks,

C.

2011-06-16 14:53:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Louis Rilling wrote:
>
> On 16/06/11 15:00 +0200, Greg Kurz wrote:
> > peeking into pid->numbers.
>
> It ends like open-coding an optimized version of task_pid_vnr(). If the
> optimization is really important (I guess this depends on the depth of recursive
> pid namespaces), it would be better to re-write task_pid_vnr().

No, task_pid_vnr(p) is different, it should use the caller's namespace.

Just in case, I agree there is no need to optimize this code. The simpler
the better. I mentioned pid->numbers[pid->level] just to point that all
we need is task_pid() itself, there are no subtle races which need the
locking.

Oleg.

2011-06-16 15:01:59

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Thu, 2011-06-16 at 15:25 +0200, Louis Rilling wrote:
> > Ok. You're right, the RCU grace period is just what I need to ensure
> I
> > won't dereference a stale pointer. So I don't even have to bother
> with
> > ->siglock and just check pid_alive() before peeking into
> pid->numbers.
>
> It ends like open-coding an optimized version of task_pid_vnr(). If
> the
> optimization is really important (I guess this depends on the depth of
> recursive
> pid namespaces), it would be better to re-write task_pid_vnr().
> Otherwise, just
> use task_pid_vnr() as it is.
>
> Thanks,
>
> Louis
>
Hmm, sorry Louis but I'm looking for the pid number from the task active
pid_ns (AKA. the return value of getpid() if called by this task), so
task_pid_vnr() doesn't fit.

About the open-coding argument, that's why I used task_pid_nr_ns() and
task_active_pid_ns() at first...

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-16 15:07:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Cedric Le Goater <[email protected]> writes:

> On 06/16/2011 01:19 PM, Greg Kurz wrote:
>> On Wed, 2011-06-15 at 21:03 +0200, Oleg Nesterov wrote:
>>> Forgot to ask,
>>>
>>> On 06/15, Greg Kurz wrote:
>>>>
>>>> The need arises in the LXC community when one wants to send a signal from
>>>> the host (aka. init_pid_ns context) to a container process for which one
>>>> only knows the pid inside the container.
>>>
>>> I am just curious, why do you need this?
>>>
>>
>> Because some LXC users run partially isolated containers (AKA.
>> application containers started with the lxc-execute command). Some of
>> the user code runs outside the container and some inside. Since
>> lxc-execute uses CLONE_NEWPID, it's difficult for the external code to
>> relate a pid generated inside the container with a task. There are
>> regular requests on lxc-users@ about this.
>
> It's simply not possible. There's no support for it.
>
> We have a case where a task in a parent pid namespace needs to kill
> another task in a sub pid namespace only knowing its internal pid.
> the latter has been communicated to the parent task through a file or
> a unix socket.

If you are using a unix domain sockets you can get them to translate
the pid for you, with SCM_CREDS. Unix domain sockets have to do that
or it is a security problem.

> This 'ActivePid' information in /proc is not sufficient to identity
> the task, you also need the list of the tasks which are living in
> the pid namespace.
>
>
> We could have used the setns syscall and attach a kill command to
> a pid namespace. But, this is borderline as you might ended up
> killing all the tasks in the namespace you've attached to. Anyhow,
> setns doesn't support pid namespaces yet and it feels safer to send
> the signal for outside.
>
> a new kill syscall could be the solution:
>
> int pidns_kill(pid_t init_pid, pid_t some_pid);
>
> where 'init_pid' identifies the namespace and 'some_pid' identifies
> a task in this namespace. this is very specific but why not.

Ugh.

So I took a quick look at things and while setns support is tricky
from a technical perspective for pid namespaces, supporting the
namespace files is not.

We can plan on having pid, mnt, and user namespace namespace proc files
in 3.1. Along with the small changes needed to have inode numbers on
the namespace files that are useful for comparing namespaces for
equality. I will see about getting those patches out for review
shortly.

> Finally, we thought that the 'ActivePid' information was interesting
> enough to be exposed in /proc and was solving the case we're facing
> rather easily with some framework (cgroups) in user space.

The pid number by which a process knows itself doesn't seem
unreasonable. I'm not totally

After having a second look at the code in proc we already have
a reference to the struct pid we want to print. So simply
printing
pid->numbers[pid->level].nr
seems reasonable, and race free.

Alternatively that could be written.
pid_nr_ns(pid, ns_of_pid(pid));

Eric

2011-06-16 15:08:26

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 16/06/11 16:51 +0200, Oleg Nesterov wrote:
> On 06/16, Louis Rilling wrote:
> >
> > On 16/06/11 15:00 +0200, Greg Kurz wrote:
> > > peeking into pid->numbers.
> >
> > It ends like open-coding an optimized version of task_pid_vnr(). If the
> > optimization is really important (I guess this depends on the depth of recursive
> > pid namespaces), it would be better to re-write task_pid_vnr().
>
> No, task_pid_vnr(p) is different, it should use the caller's namespace.

Damned, I read __task_pid_nr_ns() too quickly. Thanks for correcting me.

>
> Just in case, I agree there is no need to optimize this code. The simpler
> the better. I mentioned pid->numbers[pid->level] just to point that all
> we need is task_pid() itself, there are no subtle races which need the
> locking.

Agreed.

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (984.00 B)
signature.asc (197.00 B)
Digital signature
Download all attachments

2011-06-16 15:22:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Cedric Le Goater <[email protected]> writes:

> On 06/16/2011 03:06 PM, Oleg Nesterov wrote:
>> On 06/16, Cedric Le Goater wrote:
>>>
>>> We have a case where a task in a parent pid namespace needs to kill
>>> another task in a sub pid namespace only knowing its internal pid.
>>> the latter has been communicated to the parent task through a file or
>>> a unix socket.
>>
>> OK, thanks, this partly answers my question... But if they communicate
>> anyway, it is not clear why the signal is needed.
>
> Well, user space always finds ways to challenge the kernel.
>
> Our case is related to HPC. The batch manager runs jobs inside lxc
> containers (using namespaces) and signals are sent to the application
> for different reasons. First, to cleanly exit but also for other more
> specific actions related to the cluster interconnects.

In that case I really recommend unix domain sockets. You likely
won't need a kernel upgrade to make use of those and their pid
translation ability.

>>> a new kill syscall could be the solution:
>>>
>>> int pidns_kill(pid_t init_pid, pid_t some_pid);
>>>
>>> where 'init_pid' identifies the namespace and 'some_pid' identifies
>>> a task in this namespace. this is very specific but why not.
>>
>> Yes, I also thought about this. Should be trivial.
>>
>> Or int sys_tell_me_its_pid(pid_t init_pid, pid_t some_pid).
>
> why not. it's even better because more general.

If we get as far as a new system call (and I don't think any of this
needs a new system call) we really should use a namespace file
descriptor to identify the pid namespace not a pid.

>> Just in case.... This is hack, yes, but in fact you do not need the
>> kernel changes to send a signal inside the namespace. You could
>> ptrace sub_init, and execute the necessary code "inside" the namespace.
>
> hmm, I look at that.

Looking at the ptrace interactions are definitely worthwhile.

I remember there were a few very weird things with pids when ptracing
a process in another pid namespace. It may be that ActivePid is enough
to allow the tracer to figure out the confusing information it is
getting.

I would be surprised if using ptrace to send signals is how you
want to do things. It works, and it is a great argument from
a security perspective on allowing things that we already allow.
Using ptrace to run system calls was cumbersome and not easily
portable across architectures last time I looked.

Eric

2011-06-16 15:27:46

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 16/06/11 17:01 +0200, Greg Kurz wrote:
> On Thu, 2011-06-16 at 15:25 +0200, Louis Rilling wrote:
> > > Ok. You're right, the RCU grace period is just what I need to ensure
> > I
> > > won't dereference a stale pointer. So I don't even have to bother
> > with
> > > ->siglock and just check pid_alive() before peeking into
> > pid->numbers.
> >
> > It ends like open-coding an optimized version of task_pid_vnr(). If
> > the
> > optimization is really important (I guess this depends on the depth of
> > recursive
> > pid namespaces), it would be better to re-write task_pid_vnr().
> > Otherwise, just
> > use task_pid_vnr() as it is.
> >
> > Thanks,
> >
> > Louis
> >
> Hmm, sorry Louis but I'm looking for the pid number from the task active
> pid_ns (AKA. the return value of getpid() if called by this task), so
> task_pid_vnr() doesn't fit.

Yup, I should have somewhat recalled that _vnr() means "in the caller's pid
namespace". Sorry for the wrong argument.

>
> About the open-coding argument, that's why I used task_pid_nr_ns() and
> task_active_pid_ns() at first...

Sure. I still think that if you end accessing pid->numbers[pid->level].nr, then
you're close to coding some (optimized) task_active_pid_nr() that could also be
used by getpid(), using some task_active_tgid_nr() instead of task_tgid_vnr().
Anyway, optimization is not a relevant point here AFAICS.

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (1.54 kB)
signature.asc (197.00 B)
Digital signature
Download all attachments

2011-06-16 15:33:48

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Thu, 2011-06-16 at 08:07 -0700, Eric W. Biederman wrote:
> > Finally, we thought that the 'ActivePid' information was interesting
> > enough to be exposed in /proc and was solving the case we're facing
> > rather easily with some framework (cgroups) in user space.
>
> The pid number by which a process knows itself doesn't seem
> unreasonable. I'm not totally
>
> After having a second look at the code in proc we already have
> a reference to the struct pid we want to print. So simply
> printing
> pid->numbers[pid->level].nr
> seems reasonable, and race free.
>
> Alternatively that could be written.
> pid_nr_ns(pid, ns_of_pid(pid));
>
> Eric

So we end up with the following, since pid is valid:

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5e4f776..b7316cc 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,12 +183,14 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
"Pid:\t%d\n"
"PPid:\t%d\n"
"TracerPid:\t%d\n"
+ "ActivePid:\t%d\n"
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
task_tgid_nr_ns(p, ns),
pid_nr_ns(pid, ns),
ppid, tpid,
+ pid_nr_ns(pid, ns_of_pid(pid)),
cred->uid, cred->euid, cred->suid, cred->fsuid,
cred->gid, cred->egid, cred->sgid, cred->fsgid);

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-16 16:15:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Greg Kurz wrote:
>
> @@ -183,12 +183,14 @@ static inline void task_state(struct seq_file *m,
> struct pid_namespace *ns,
> "Pid:\t%d\n"
> "PPid:\t%d\n"
> "TracerPid:\t%d\n"
> + "ActivePid:\t%d\n"
> "Uid:\t%d\t%d\t%d\t%d\n"
> "Gid:\t%d\t%d\t%d\t%d\n",
> get_task_state(p),
> task_tgid_nr_ns(p, ns),
> pid_nr_ns(pid, ns),
> ppid, tpid,
> + pid_nr_ns(pid, ns_of_pid(pid)),

Indeed, I forgot that we already have task_pid() in the arguments.

Oleg.

2011-06-16 16:25:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On 06/16, Eric W. Biederman wrote:
>
> I remember there were a few very weird things with pids when ptracing
> a process in another pid namespace.

The only problems is pids. PTRACE_EVENT_FORK/etc reports the global pid.
This is fixable. Some tracers look at syscall_get_return_value() register
and get the "local" pid which they do not want.

> It may be that ActivePid is enough
> to allow the tracer to figure out the confusing information it is
> getting.

I don't think so. ActivePid doesn't help unless you know all pids in
this container.

> I would be surprised if using ptrace to send signals is how you
> want to do things.

Yes, yes, this is hack.

Oleg.

2011-06-16 17:55:39

by Bryan Donlan

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Wed, Jun 15, 2011 at 10:55, Greg Kurz <[email protected]> wrote:
> Since pid namespaces were introduced, there's a recurring demand: how one
> can correlate a pid from a child pid ns with a pid from a parent pid ns ?
> The need arises in the LXC community when one wants to send a signal from
> the host (aka. init_pid_ns context) to a container process for which one
> only knows the pid inside the container.
>
> In the future, this should be achievable thanks to Eric Biederman's setns()
> syscall but there's still some work to be done to support pid namespaces:
>
> https://lkml.org/lkml/2011/5/21/162
>
> As stated by Serge Hallyn in:
>
> http://sourceforge.net/mailarchive/message.php?msg_id=27424447
>
> "There is nothing that gives you a 100% guaranteed correct race-free
> correspondence right now. ?You can look under /proc/<pid>/root/proc/ to
> see the pids valid in the container, and you can relate output of
> lxc-ps --forest to ps --forest output. ?But nothing under /proc that I
> know of tells you "this task is the same as that task". ?You can't
> even look at /proc/<pid> inode numbers since they are different
> filesystems for each proc mount."
>
> This patch adds a single line to /proc/self/status. Provided one has kept
> track of its container tasks (with a cgroup like liblxc does for example),
> he may correlate global pids and container pids. This is still racy but
> definitely easier than what we have today.

Although getting the in-namespace PID is a useful thing, wouldn't a
truly race-free API be preferable? Any access by PID has the race
condition in which the target process could die, and its PID get
recycled between retrieving the PID and doing something with it.
Perhaps a file-descriptor API would be better, such as something like
this:

int openpid(int id, int flags);
int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
int sigqueue_fd(int process_fd, int sig, const union sigval value); //
glibc wrapper

The opened process FD could be passed across a unix domain socket to a
process outside the namespace, which could then send signals without
knowing the in-namespace PID. This same API can be easily extended to
cover other syscalls which may require PIDs as well.

2011-06-20 11:45:40

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:
> On Wed, Jun 15, 2011 at 10:55, Greg Kurz <[email protected]> wrote:
> > Since pid namespaces were introduced, there's a recurring demand: how one
> > can correlate a pid from a child pid ns with a pid from a parent pid ns ?
> > The need arises in the LXC community when one wants to send a signal from
> > the host (aka. init_pid_ns context) to a container process for which one
> > only knows the pid inside the container.
> >
> > In the future, this should be achievable thanks to Eric Biederman's setns()
> > syscall but there's still some work to be done to support pid namespaces:
> >
> > https://lkml.org/lkml/2011/5/21/162
> >
> > As stated by Serge Hallyn in:
> >
> > http://sourceforge.net/mailarchive/message.php?msg_id=27424447
> >
> > "There is nothing that gives you a 100% guaranteed correct race-free
> > correspondence right now. You can look under /proc/<pid>/root/proc/ to
> > see the pids valid in the container, and you can relate output of
> > lxc-ps --forest to ps --forest output. But nothing under /proc that I
> > know of tells you "this task is the same as that task". You can't
> > even look at /proc/<pid> inode numbers since they are different
> > filesystems for each proc mount."
> >
> > This patch adds a single line to /proc/self/status. Provided one has kept
> > track of its container tasks (with a cgroup like liblxc does for example),
> > he may correlate global pids and container pids. This is still racy but
> > definitely easier than what we have today.
>
> Although getting the in-namespace PID is a useful thing, wouldn't a
> truly race-free API be preferable? Any access by PID has the race
> condition in which the target process could die, and its PID get
> recycled between retrieving the PID and doing something with it.

Well the PID is a racy construct when used by another task than the
parent... fortunately, most userland code can cope with it ! :)

> Perhaps a file-descriptor API would be better, such as something like
> this:
>
> int openpid(int id, int flags);
> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
> glibc wrapper
>

The race still exists: openpid() is being passed a PID... Only the
parent can legitimately know that this PID identifies a specific
unwaited child.

> The opened process FD could be passed across a unix domain socket to a
> process outside the namespace, which could then send signals without
> knowing the in-namespace PID. This same API can be easily extended to
> cover other syscalls which may require PIDs as well.

Indeed, the idea of not exposing a PID from another namespace sounds
nice.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-20 17:38:10

by Bryan Donlan

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Mon, Jun 20, 2011 at 07:45, Greg Kurz <[email protected]> wrote:
> On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:

>> Although getting the in-namespace PID is a useful thing, wouldn't a
>> truly race-free API be preferable? Any access by PID has the race
>> condition in which the target process could die, and its PID get
>> recycled between retrieving the PID and doing something with it.
>
> Well the PID is a racy construct when used by another task than the
> parent... fortunately, most userland code can cope with it ! :)

That doesn't mean we shouldn't try to fix the race! :)

>> Perhaps a file-descriptor API would be better, such as something like
>> this:
>>
>> int openpid(int id, int flags);
>> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
>> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
>> glibc wrapper
>>
>
> The race still exists: openpid() is being passed a PID... Only the
> parent can legitimately know that this PID identifies a specific
> unwaited child.

Yes, the idea would be either the parent process, or the target
process itself would open the PID, then pass the resulting file
descriptor to whatever process is actually doing the killing.
Alternately, one could add additional calls to help identify whether
the right process was opened (perhaps a call to get a directory handle
to the corresponding /proc directory?)

2011-06-20 22:45:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Bryan Donlan <[email protected]> writes:

> On Mon, Jun 20, 2011 at 07:45, Greg Kurz <[email protected]> wrote:
>> On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:
>
>>> Although getting the in-namespace PID is a useful thing, wouldn't a
>>> truly race-free API be preferable? Any access by PID has the race
>>> condition in which the target process could die, and its PID get
>>> recycled between retrieving the PID and doing something with it.
>>
>> Well the PID is a racy construct when used by another task than the
>> parent... fortunately, most userland code can cope with it ! :)
>
> That doesn't mean we shouldn't try to fix the race! :)
>
>>> Perhaps a file-descriptor API would be better, such as something like
>>> this:
>>>
>>> int openpid(int id, int flags);
>>> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
>>> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
>>> glibc wrapper
>>>
>>
>> The race still exists: openpid() is being passed a PID... Only the
>> parent can legitimately know that this PID identifies a specific
>> unwaited child.
>
> Yes, the idea would be either the parent process, or the target
> process itself would open the PID, then pass the resulting file
> descriptor to whatever process is actually doing the killing.
> Alternately, one could add additional calls to help identify whether
> the right process was opened (perhaps a call to get a directory handle
> to the corresponding /proc directory?)

fd = open("/proc/self/", O_DIRECTORY);
?

Doing something based on proc files seems like a reasonable direction to
head if we are working on a race free api.

I suspect all we need is a sigqueue file.

Eric

2011-06-22 15:01:00

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Mon, 2011-06-20 at 13:37 -0400, Bryan Donlan wrote:
> On Mon, Jun 20, 2011 at 07:45, Greg Kurz <[email protected]> wrote:
> > On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:
>
> >> Although getting the in-namespace PID is a useful thing, wouldn't a
> >> truly race-free API be preferable? Any access by PID has the race
> >> condition in which the target process could die, and its PID get
> >> recycled between retrieving the PID and doing something with it.
> >
> > Well the PID is a racy construct when used by another task than the
> > parent... fortunately, most userland code can cope with it ! :)
>
> That doesn't mean we shouldn't try to fix the race! :)
>
> >> Perhaps a file-descriptor API would be better, such as something like
> >> this:
> >>
> >> int openpid(int id, int flags);
> >> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
> >> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
> >> glibc wrapper
> >>
> >
> > The race still exists: openpid() is being passed a PID... Only the
> > parent can legitimately know that this PID identifies a specific
> > unwaited child.
>
> Yes, the idea would be either the parent process, or the target
> process itself would open the PID, then pass the resulting file
> descriptor to whatever process is actually doing the killing.

Agreed. Such an API would be useful in a scenario where the task to be
killed and the killing task can share a file descriptor: same thread
group or inherited with clone() or connected with an AF_UNIX socket.
My point was just that the racy pid based API will still be needed to
handle all the other scenarios. But maybe it's fine to have two sets of
process handling calls.

> Alternately, one could add additional calls to help identify whether
> the right process was opened (perhaps a call to get a directory handle
> to the corresponding /proc directory?)

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-22 15:29:39

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Mon, 2011-06-20 at 15:44 -0700, Eric W. Biederman wrote:
> Bryan Donlan <[email protected]> writes:
>
> > On Mon, Jun 20, 2011 at 07:45, Greg Kurz <[email protected]> wrote:
> >> On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:
> >
> >>> Although getting the in-namespace PID is a useful thing, wouldn't a
> >>> truly race-free API be preferable? Any access by PID has the race
> >>> condition in which the target process could die, and its PID get
> >>> recycled between retrieving the PID and doing something with it.
> >>
> >> Well the PID is a racy construct when used by another task than the
> >> parent... fortunately, most userland code can cope with it ! :)
> >
> > That doesn't mean we shouldn't try to fix the race! :)
> >
> >>> Perhaps a file-descriptor API would be better, such as something like
> >>> this:
> >>>
> >>> int openpid(int id, int flags);
> >>> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
> >>> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
> >>> glibc wrapper
> >>>
> >>
> >> The race still exists: openpid() is being passed a PID... Only the
> >> parent can legitimately know that this PID identifies a specific
> >> unwaited child.
> >
> > Yes, the idea would be either the parent process, or the target
> > process itself would open the PID, then pass the resulting file
> > descriptor to whatever process is actually doing the killing.
> > Alternately, one could add additional calls to help identify whether
> > the right process was opened (perhaps a call to get a directory handle
> > to the corresponding /proc directory?)
>
> fd = open("/proc/self/", O_DIRECTORY);
> ?
>
> Doing something based on proc files seems like a reasonable direction to
> head if we are working on a race free api.
>
> I suspect all we need is a sigqueue file.
>

Are you referring to Bryan's rt_sigqueueinfo_fd() syscall or to a
new /proc/self/sigqueue file ?

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-22 16:57:46

by Bryan Donlan

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Wed, Jun 22, 2011 at 11:00, Greg Kurz <[email protected]> wrote:
> On Mon, 2011-06-20 at 13:37 -0400, Bryan Donlan wrote:
>> On Mon, Jun 20, 2011 at 07:45, Greg Kurz <[email protected]> wrote:
>> > On Thu, 2011-06-16 at 13:54 -0400, Bryan Donlan wrote:
>>
>> >> Although getting the in-namespace PID is a useful thing, wouldn't a
>> >> truly race-free API be preferable? Any access by PID has the race
>> >> condition in which the target process could die, and its PID get
>> >> recycled between retrieving the PID and doing something with it.
>> >
>> > Well the PID is a racy construct when used by another task than the
>> > parent... fortunately, most userland code can cope with it ! :)
>>
>> That doesn't mean we shouldn't try to fix the race! :)
>>
>> >> Perhaps a file-descriptor API would be better, such as something like
>> >> this:
>> >>
>> >> int openpid(int id, int flags);
>> >> int rt_sigqueueinfo_fd(int process_fd, int sig, siginfo_t *info);
>> >> int sigqueue_fd(int process_fd, int sig, const union sigval value); //
>> >> glibc wrapper
>> >>
>> >
>> > The race still exists: openpid() is being passed a PID... Only the
>> > parent can legitimately know that this PID identifies a specific
>> > unwaited child.
>>
>> Yes, the idea would be either the parent process, or the target
>> process itself would open the PID, then pass the resulting file
>> descriptor to whatever process is actually doing the killing.
>
> Agreed. Such an API would be useful in a scenario where the task to be
> killed and the killing task can share a file descriptor: same thread
> group or inherited with clone() or connected with an AF_UNIX socket.
> My point was just that the racy pid based API will still be needed to
> handle all the other scenarios. But maybe it's fine to have two sets of
> process handling calls.

Actually, with a /proc/self/sigqueue file as Eric suggested, , it
would be possible to mitigate some races even if you're not passed a
fd from the target or its parent. Consider:

1) Read somedaemon.pid
2) Open the /proc/$pid directory
3) Use openat to open /proc/$pid/status; check that Name: matches the
daemon's name; if not go to 1
4) Use openat to open /proc/$pid/sigqueue; send a signal

It's not foolproof, in that you might hit a different process with the
same name+uid+gid+groups+cmdline+environment+etc, but any such
cross-process operation is racy to the extent that, at some level, you
need to have criteria for selecting your target, and such criteria
might allow for false positives.

2011-06-23 00:39:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Greg Kurz <[email protected]> writes:

> On Mon, 2011-06-20 at 15:44 -0700, Eric W. Biederman wrote:
>> fd = open("/proc/self/", O_DIRECTORY);
>> ?
>>
>> Doing something based on proc files seems like a reasonable direction to
>> head if we are working on a race free api.
>>
>> I suspect all we need is a sigqueue file.
>>
>
> Are you referring to Bryan's rt_sigqueueinfo_fd() syscall or to a
> new /proc/self/sigqueue file ?

I was suggesting implement rt_sigqueueinfo_fd as a proc file instead.

Getting a file descriptor api one way or another for delivering signals
sounds nice in principle. I don't know if it is useful enough to
justify the cost of implementing and supporting it.

Eric

2011-06-23 13:43:40

by Greg Kurz

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

On Wed, 2011-06-22 at 17:39 -0700, Eric W. Biederman wrote:
> I was suggesting implement rt_sigqueueinfo_fd as a proc file instead.
>
> Getting a file descriptor api one way or another for delivering signals
> sounds nice in principle. I don't know if it is useful enough to
> justify the cost of implementing and supporting it.
>

That's my concern: it will take some time to work it out and push it. In
comparison, ActivePid: has no support overhead and is enough for
userland code to relate pids betwen parent and child namespaces.
Mitigating the race with pids is cool but it is another topic IMHO.

Thanks.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2011-06-23 14:38:10

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH] Introduce ActivePid: in /proc/self/status (v2, was Vpid:)

Quoting Greg Kurz ([email protected]):
> On Wed, 2011-06-22 at 17:39 -0700, Eric W. Biederman wrote:
> > I was suggesting implement rt_sigqueueinfo_fd as a proc file instead.
> >
> > Getting a file descriptor api one way or another for delivering signals
> > sounds nice in principle. I don't know if it is useful enough to
> > justify the cost of implementing and supporting it.
> >
>
> That's my concern: it will take some time to work it out and push it. In
> comparison, ActivePid: has no support overhead and is enough for
> userland code to relate pids betwen parent and child namespaces.
> Mitigating the race with pids is cool but it is another topic IMHO.

I think it's perfectly understandable for management / monitoring
software on the host to want to correlate pids on the host and in the
container. And not reasonable that there is no way to do that. This
patch makes sense.

I've carelessly deleted patch v3 from my inbox, but Greg please add
my

Acked-by: Serge Hallyn <[email protected]>

to v3 (https://lkml.org/lkml/2011/6/22/283)

thanks,
-serge