LinuxLists.cc - [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Too late for me, but I don't understand this patch after a quick glance.
perhaps I missed something...

On 01/23, Tycho Andersen wrote:
>
> @@ -256,6 +256,17 @@ void release_task(struct task_struct *p)
> write_lock_irq(&tasklist_lock);
> ptrace_release_task(p);
> thread_pid = get_pid(p->thread_pid);
> +
> + /*
> + * If we're not the leader, notify any waiters on our pidfds. Note that
> + * we don't want to notify the leader until /everyone/ in the thread
> + * group is dead, viz. the condition below.
> + *
> + * We have to do this here, since __exit_signal() will
> + * __unhash_processes(), and break do_notify_pidfd()'s lookup.
> + */
> + if (!thread_group_leader(p))
> + do_notify_pidfd(p);

This doesn't look consistent.

If the task is a group leader do_notify_pidfd() is called by exit_notify()
when it becomes a zombie (if no other threads), before it is reaped by its
parent (unless autoreap).

If it is a sub-thread, it is called by release_task() above. Note that a
sub-thread can become a zombie too if it is traced.

> __exit_signal(p);

and, do_notify_pidfd() is called before __exit_signal() which does
__unhash_process() -> detach_pid(PIDTYPE_PID).

Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
won't return true after do_notify_pidfd() above, not to mention that
thread_group_empty() is not possible if !thread_group_leader().

So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
becomes a zombie (passes exit_notify) or when it is reaped by release_task?

Either way pidfd_poll() needs more changes with this patch and it can't
use thread_group_exited(). If do_notify_pidfd() is called by release_task()
after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID).

Oleg.

2024-01-23 21:17:11

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote:
> Too late for me, but I don't understand this patch after a quick glance.
> perhaps I missed something...

Thanks for taking a look.

> On 01/23, Tycho Andersen wrote:
> >
> > @@ -256,6 +256,17 @@ void release_task(struct task_struct *p)
> > write_lock_irq(&tasklist_lock);
> > ptrace_release_task(p);
> > thread_pid = get_pid(p->thread_pid);
> > +
> > + /*
> > + * If we're not the leader, notify any waiters on our pidfds. Note that
> > + * we don't want to notify the leader until /everyone/ in the thread
> > + * group is dead, viz. the condition below.
> > + *
> > + * We have to do this here, since __exit_signal() will
> > + * __unhash_processes(), and break do_notify_pidfd()'s lookup.
> > + */
> > + if (!thread_group_leader(p))
> > + do_notify_pidfd(p);
>
> This doesn't look consistent.
>
> If the task is a group leader do_notify_pidfd() is called by exit_notify()
> when it becomes a zombie (if no other threads), before it is reaped by its
> parent (unless autoreap).

There is another path, also in release_task(), that I was trying to
mirror since it deals explicitly with sub-threads but,

> If it is a sub-thread, it is called by release_task() above. Note that a
> sub-thread can become a zombie too if it is traced.

I didn't know about this.

> > __exit_signal(p);
>
> and, do_notify_pidfd() is called before __exit_signal() which does
> __unhash_process() -> detach_pid(PIDTYPE_PID).
>
> Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
> won't return true after do_notify_pidfd() above, not to mention that
> thread_group_empty() is not possible if !thread_group_leader().

I was wondering about this too, but the test_non_tgl_poll_exit test in
the next patch tests exactly this and works as expected.

> So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
> becomes a zombie (passes exit_notify) or when it is reaped by release_task?

It seems like we'd want it when exit_notify() is called in principle,
since that's when the pid actually dies. When it is reaped is "mostly
unrelated". Something like,

1. in the "normal" exit_notify() paths via do_notify_parent()
2. if none of those cases are true (aka the final else in
exit_notify()) and the thread is not ptraced
3. via release_task() finally if this was the thread group leader and
it died before some sub-thread

then in pidfd_poll(), we can do:

if (!tsk || (tsk->exit_state >= 0) || thread_group_exited())
do_notify_pidfd();

?

> Either way pidfd_poll() needs more changes with this patch and it can't
> use thread_group_exited(). If do_notify_pidfd() is called by release_task()
> after __exit_signal(), it can just check pid_has_task(PIDTYPE_PID).

I suppose this is why my test works, since pid_task(PIDTYPE_PID) is null
after release_task(). But if we want it to happen earlier, we'll have
to do something like the above.

Tycho

2024-01-23 22:24:03

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

I am already sleeping. I'll try to reply to other parts of your email
tomorrow but I am not sure, I will be very busy with family duties.

On 01/23, Tycho Andersen wrote:
>
> > > __exit_signal(p);
> >
> > and, do_notify_pidfd() is called before __exit_signal() which does
> > __unhash_process() -> detach_pid(PIDTYPE_PID).
> >
> > Doesn't this mean that pidfd_poll() can hang? thread_group_exited()
> > won't return true after do_notify_pidfd() above, not to mention that
> > thread_group_empty() is not possible if !thread_group_leader().
>
> I was wondering about this too, but the test_non_tgl_poll_exit test in
> the next patch tests exactly this and works as expected.

Well, if release_task() completes __exit_signal() before the woken task
does thread_group_exited(), pid_task(PIDTYPE_PID) will return 0 and
pidfd_poll() won't hang.

But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't
even understand why the process/thread created by fork_task_with_thread()
should ever exit. And why it creates the "writer" child... Never mind, too
late for me to read the code.

Oleg.

2024-01-24 01:27:04

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/23, Oleg Nesterov wrote:
>
> But to be honest I can't understand test_non_tgl_poll_exit() at all. I don't
> even understand why the process/thread created by fork_task_with_thread()
> should ever exit. And why it creates the "writer" child... Never mind, too
> late for me to read the code.

Ah, OK, it passes thread_wait_exit to fork_task_with_thread(), and this "fn"
reads sk_pair and does exit() which is actually exit_group().

Oleg.

2024-01-25 14:10:03

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Add Eric.

On 01/23, Tycho Andersen wrote:
>
> On Tue, Jan 23, 2024 at 08:56:08PM +0100, Oleg Nesterov wrote:
>
> > So. When do we want to do do_notify_pidfd() ? Whe the task (leader or not)
> > becomes a zombie (passes exit_notify) or when it is reaped by release_task?
>
> It seems like we'd want it when exit_notify() is called in principle,
> since that's when the pid actually dies.

No the pid "dies" after this task is reaped, until then its nr is still
in use and pid_task(PIDTYPE_PID) returns the exiting/exited task.

> When it is reaped is "mostly unrelated".

Then why pidfd_poll() can't simply check !task || task->exit_state ?

Nevermind. So, currently pidfd_poll() succeeds when the leader can be
reaped, iow the whole thread group has exited. But even if you are the
parent, you can't expect that wait(WNOHANG) must succeed, the leader
can be traced. I guess it is too late to change this behaviour.

What if we add the new PIDFD_THREAD flag? With this flag

- sys_pidfd_open() doesn't require the must be a group leader

- pidfd_poll() succeeds when the task passes exit_notify() and
becomes a zombie, even if it is a leader and has other threads.

Please the the incomplete/untested patch below.

- The change in exit_notify() is sub-optimal, we can do better
to avoid 2 do_notify_pidfd() calls from exit_notify(). But
so far this is only for discussion, lets keep it simple.

- __pidfd_prepare() needs some minor cleanups regardless of
this change, I'll send the patch...

What do you think?

And why is thread_group_exited() exported?

Oleg.

diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 5406fbc13074..2e6461459877 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -7,6 +7,7 @@
#include <linux/fcntl.h>

/* Flags for pidfd_open(). */
-#define PIDFD_NONBLOCK O_NONBLOCK
+#define PIDFD_NONBLOCK O_NONBLOCK
+#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's

#endif /* _UAPI_LINUX_PIDFD_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index dfb963d2f862..9f8526b7d717 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
autoreap = true;
}

+ /* unnecessary if do_notify_parent() was already called,
+ we can do better */
+ do_notify_pidfd(tsk);
+
if (autoreap) {
tsk->exit_state = EXIT_DEAD;
list_add(&tsk->ptrace_entry, &dead);
diff --git a/kernel/fork.c b/kernel/fork.c
index c981fa6171c1..38f2c7423fb4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -101,6 +101,7 @@
#include <linux/user_events.h>
#include <linux/iommu.h>
#include <linux/rseq.h>
+#include <uapi/linux/pidfd.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif

+static bool xxx_exited(struct pid *pid, int excl)
+{
+ struct task_struct *task;
+ bool exited;
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ exited = !task ||
+ (READ_ONCE(task->exit_state) && (excl || thread_group_empty(task)));
+ rcu_read_unlock();
+
+ return exited;
+}
+
/*
* Poll support for process exit notification.
*/
static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
{
struct pid *pid = file->private_data;
+ int excl = file->f_flags & PIDFD_THREAD;
__poll_t poll_flags = 0;

poll_wait(file, &pid->wait_pidfd, pts);
@@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
* If the thread group leader exits before all other threads in the
* group, then poll(2) should block, similar to the wait(2) family.
*/
- if (thread_group_exited(pid))
+ if (xxx_exited(pid, excl))
poll_flags = EPOLLIN | EPOLLRDNORM;

return poll_flags;
@@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
{
int pidfd;
struct file *pidfd_file;
+ unsigned excl = flags & PIDFD_THREAD;

+ flags &= ~PIDFD_THREAD;
if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
return -EINVAL;

@@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
return PTR_ERR(pidfd_file);
}
get_pid(pid); /* held by pidfd_file now */
+ pidfd_file->f_flags |= excl;
*ret = pidfd_file;
return pidfd;
}
@@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
*/
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
{
- if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+ unsigned excl = flags & PIDFD_THREAD;
+
+ if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID))
return -EINVAL;

return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index b52b10865454..5257197f9493 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
int fd;
struct pid *p;

- if (flags & ~PIDFD_NONBLOCK)
+ if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
return -EINVAL;

if (pid <= 0)

2024-01-25 17:19:24

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

> > When it is reaped is "mostly unrelated".
>
> Then why pidfd_poll() can't simply check !task || task->exit_state ?
>
> Nevermind. So, currently pidfd_poll() succeeds when the leader can be

Hm, the comment right above mentions:

/*
* Inform pollers only when the whole thread group exits.
* If the thread group leader exits before all other threads in the
* group, then poll(2) should block, similar to the wait(2) family.
*/

> reaped, iow the whole thread group has exited. But even if you are the
> parent, you can't expect that wait(WNOHANG) must succeed, the leader
> can be traced. I guess it is too late to change this behaviour.

Hm, why is that an issue though? And if it is an issue why shouldn't we
be able to change it? Because a program would rely on WNOHANG to hang on
a ptraced leader? That seems esoteric imho. I might just misunderstand.

>
> What if we add the new PIDFD_THREAD flag? With this flag
>
> - sys_pidfd_open() doesn't require the must be a group leader

Yes.

>
> - pidfd_poll() succeeds when the task passes exit_notify() and
> becomes a zombie, even if it is a leader and has other threads.

Iiuc, if an existing user creates a pidfd for a thread-group leader and
then polls that pidfd they would currently only get notified if the
thread-group is empty and the leader has exited.

If we now start notifying when the thread-group leader exits but the
thread-group isn't empty then this would be a fairly big api change and
I would expect this to cause regressions as that surely is something
that userspace relies on. Am I understand this right?

2024-01-25 17:52:44

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/25, Christian Brauner wrote:
>
> > > When it is reaped is "mostly unrelated".
> >
> > Then why pidfd_poll() can't simply check !task || task->exit_state ?
> >
> > Nevermind. So, currently pidfd_poll() succeeds when the leader can be
>
> Hm, the comment right above mentions:
>
> /*
> * Inform pollers only when the whole thread group exits.
> * If the thread group leader exits before all other threads in the
> * group, then poll(2) should block, similar to the wait(2) family.
> */
> > reaped, iow the whole thread group has exited.

Yes, but the comment doesn't contradict with what I have said?

> > But even if you are the
> > parent, you can't expect that wait(WNOHANG) must succeed, the leader
> > can be traced. I guess it is too late to change this behaviour.
>
> Hm, why is that an issue though?

Well, I didn't say this is a problem. I simply do not know how/why people
use pidfd_poll().

I mostly tried to explain why do I think that do_notify_pidfd() should
be always called from exit_notify() path, not by release_task(), even
if the task is not a leader.

> Because a program would rely on WNOHANG to hang on
> a ptraced leader? That seems esoteric imho.

To me it would be usefule, but lets not discuss this now. The "patch"
I sent doesn't change the current behaviour.

> > What if we add the new PIDFD_THREAD flag? With this flag
> >
> > - sys_pidfd_open() doesn't require the must be a group leader
>
> Yes.
>
> >
> > - pidfd_poll() succeeds when the task passes exit_notify() and
> > becomes a zombie, even if it is a leader and has other threads.
>
> Iiuc, if an existing user creates a pidfd for a thread-group leader and
> then polls that pidfd they would currently only get notified if the
> thread-group is empty and the leader has exited.
>
> If we now start notifying when the thread-group leader exits but the
> thread-group isn't empty then this would be a fairly big api change

Hmm... again, this patch doesn't (shouldn't) change the current behavior.

Please note "with this flag" above. If sys_pidfd_open() was called
without PIDFD_THREAD, then sys_pidfd_open() still requires that the
target task must be a group leader, and pidfd_poll() won't succeed
until the leader exits and thread_group_empty() is true.

Oleg.

2024-01-25 18:04:02

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote:
> > > What if we add the new PIDFD_THREAD flag? With this flag
> > >
> > > - sys_pidfd_open() doesn't require the must be a group leader
> >
> > Yes.
> >
> > >
> > > - pidfd_poll() succeeds when the task passes exit_notify() and
> > > becomes a zombie, even if it is a leader and has other threads.
> >
> > Iiuc, if an existing user creates a pidfd for a thread-group leader and
> > then polls that pidfd they would currently only get notified if the
> > thread-group is empty and the leader has exited.
> >
> > If we now start notifying when the thread-group leader exits but the
> > thread-group isn't empty then this would be a fairly big api change
>
> Hmm... again, this patch doesn't (shouldn't) change the current behavior.
>
> Please note "with this flag" above. If sys_pidfd_open() was called
> without PIDFD_THREAD, then sys_pidfd_open() still requires that the
> target task must be a group leader, and pidfd_poll() won't succeed
> until the leader exits and thread_group_empty() is true.

Thanks for sending your patch, I'll take a look at it (probably
tomorrow at this rate).

One of the things I don't like about PIDFD_THREAD is that it's hard to
tell whether an arbitrary thread is a leader or not. Right now we do
it by parsing /proc/pid/status, which shows all the stuff from
do_task_stat() that we don't care about but which is quite expensive
to compute. (Maybe there's a better way?)

With PIDFD_THREAD we could could do it twice, once with the flag, get
EINVAL, and then do it again. But ideally we wouldn't have to.

Still, if that's the only way that makes sense, that's fine.

Tycho

2024-01-25 18:31:50

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/25, Tycho Andersen wrote:
>
> One of the things I don't like about PIDFD_THREAD is that it's hard to
> tell whether an arbitrary thread is a leader or not. Right now we do
> it by parsing /proc/pid/status, which shows all the stuff from
> do_task_stat() that we don't care about but which is quite expensive
> to compute. (Maybe there's a better way?)
>
> With PIDFD_THREAD we could could do it twice, once with the flag, get
> EINVAL, and then do it again. But ideally we wouldn't have to.

Too late for me, most probably I misunderstood.

If you want the PIDFD_THREAD behaviour, you can always use this flag
without any check...

Could you spell?

Oleg.

2024-01-25 18:34:47

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/25, Oleg Nesterov wrote:
>
> On 01/25, Tycho Andersen wrote:
> >
> > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > tell whether an arbitrary thread is a leader or not. Right now we do
> > it by parsing /proc/pid/status, which shows all the stuff from
> > do_task_stat() that we don't care about but which is quite expensive
> > to compute. (Maybe there's a better way?)
> >
> > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > EINVAL, and then do it again. But ideally we wouldn't have to.
>
> Too late for me, most probably I misunderstood.
>
> If you want the PIDFD_THREAD behaviour, you can always use this flag
> without any check...
>
> Could you spell?

Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
otherwise. Trivial.

Oleg.

2024-01-25 18:38:11

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote:
> On 01/25, Oleg Nesterov wrote:
> >
> > On 01/25, Tycho Andersen wrote:
> > >
> > > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > > tell whether an arbitrary thread is a leader or not. Right now we do
> > > it by parsing /proc/pid/status, which shows all the stuff from
> > > do_task_stat() that we don't care about but which is quite expensive
> > > to compute. (Maybe there's a better way?)
> > >
> > > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > > EINVAL, and then do it again. But ideally we wouldn't have to.
> >
> > Too late for me, most probably I misunderstood.
> >
> > If you want the PIDFD_THREAD behaviour, you can always use this flag
> > without any check...

Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a
leader, then we can just always specify it. (We don't care about the
behavior of pidfd_poll().)

> > Could you spell?
>
> Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
> PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
> otherwise. Trivial.

Yep, or given the above, maybe it'll work as-is, thank you.

Tycho

2024-01-26 10:24:35

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

> Please the the incomplete/untested patch below.
>
> - The change in exit_notify() is sub-optimal, we can do better
> to avoid 2 do_notify_pidfd() calls from exit_notify(). But
> so far this is only for discussion, lets keep it simple.
>
> - __pidfd_prepare() needs some minor cleanups regardless of
> this change, I'll send the patch...
>
> What do you think?
>
> And why is thread_group_exited() exported?
>
> Oleg.
>
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 5406fbc13074..2e6461459877 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -7,6 +7,7 @@
> #include <linux/fcntl.h>
>
> /* Flags for pidfd_open(). */
> -#define PIDFD_NONBLOCK O_NONBLOCK
> +#define PIDFD_NONBLOCK O_NONBLOCK
> +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's

I like it!

The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
Because it doesn't map as clearly as NONBLOCK did.

>
> #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index dfb963d2f862..9f8526b7d717 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -752,6 +752,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> autoreap = true;
> }
>
> + /* unnecessary if do_notify_parent() was already called,
> + we can do better */
> + do_notify_pidfd(tsk);
> +
> if (autoreap) {
> tsk->exit_state = EXIT_DEAD;
> list_add(&tsk->ptrace_entry, &dead);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c981fa6171c1..38f2c7423fb4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -101,6 +101,7 @@
> #include <linux/user_events.h>
> #include <linux/iommu.h>
> #include <linux/rseq.h>
> +#include <uapi/linux/pidfd.h>
>
> #include <asm/pgalloc.h>
> #include <linux/uaccess.h>
> @@ -2068,12 +2069,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
> }
> #endif
>
> +static bool xxx_exited(struct pid *pid, int excl)
> +{
> + struct task_struct *task;
> + bool exited;
> +
> + rcu_read_lock();
> + task = pid_task(pid, PIDTYPE_PID);
> + exited = !task ||
> + (READ_ONCE(task->exit_state) && (excl || thread_group_empty(task)));
> + rcu_read_unlock();
> +
> + return exited;
> +}
> +
> /*
> * Poll support for process exit notification.
> */
> static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> {
> struct pid *pid = file->private_data;
> + int excl = file->f_flags & PIDFD_THREAD;
> __poll_t poll_flags = 0;
>
> poll_wait(file, &pid->wait_pidfd, pts);
> @@ -2083,7 +2099,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> * If the thread group leader exits before all other threads in the
> * group, then poll(2) should block, similar to the wait(2) family.
> */
> - if (thread_group_exited(pid))
> + if (xxx_exited(pid, excl))
> poll_flags = EPOLLIN | EPOLLRDNORM;
>
> return poll_flags;
> @@ -2129,7 +2145,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> {
> int pidfd;
> struct file *pidfd_file;
> + unsigned excl = flags & PIDFD_THREAD;
>
> + flags &= ~PIDFD_THREAD;
> if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
> return -EINVAL;
>
> @@ -2144,6 +2162,7 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> return PTR_ERR(pidfd_file);
> }
> get_pid(pid); /* held by pidfd_file now */
> + pidfd_file->f_flags |= excl;
> *ret = pidfd_file;
> return pidfd;
> }
> @@ -2176,7 +2195,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> */
> int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> {
> - if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> + unsigned excl = flags & PIDFD_THREAD;
> +
> + if (!pid || !pid_has_task(pid, excl ? PIDTYPE_PID : PIDTYPE_TGID))
> return -EINVAL;
>
> return __pidfd_prepare(pid, flags, ret);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index b52b10865454..5257197f9493 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> int fd;
> struct pid *p;
>
> - if (flags & ~PIDFD_NONBLOCK)
> + if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
> return -EINVAL;
>
> if (pid <= 0)
>

2024-01-26 10:24:43

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Thu, Jan 25, 2024 at 11:36:50AM -0700, Tycho Andersen wrote:
> On Thu, Jan 25, 2024 at 07:30:46PM +0100, Oleg Nesterov wrote:
> > On 01/25, Oleg Nesterov wrote:
> > >
> > > On 01/25, Tycho Andersen wrote:
> > > >
> > > > One of the things I don't like about PIDFD_THREAD is that it's hard to
> > > > tell whether an arbitrary thread is a leader or not. Right now we do
> > > > it by parsing /proc/pid/status, which shows all the stuff from
> > > > do_task_stat() that we don't care about but which is quite expensive
> > > > to compute. (Maybe there's a better way?)
> > > >
> > > > With PIDFD_THREAD we could could do it twice, once with the flag, get
> > > > EINVAL, and then do it again. But ideally we wouldn't have to.
> > >
> > > Too late for me, most probably I misunderstood.
> > >
> > > If you want the PIDFD_THREAD behaviour, you can always use this flag
> > > without any check...
>
> Sorry, I hadn't read the patch. If it's ok to use PIDFD_THREAD on a
> leader, then we can just always specify it. (We don't care about the
> behavior of pidfd_poll().)
>
> > > Could you spell?
> >
> > Just in case, we can even add PIDFD_AUTO (modulo naming) which acts as
> > PIDFD_THREAD if the target task is not a leader or 0 (current behaviour)
> > otherwise. Trivial.
>
> Yep, or given the above, maybe it'll work as-is, thank you.

Yes, let's rather do the explicit PIDFD_THREAD.

2024-01-26 10:28:04

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Thu, Jan 25, 2024 at 06:51:14PM +0100, Oleg Nesterov wrote:
> On 01/25, Christian Brauner wrote:
> >
> > > > When it is reaped is "mostly unrelated".
> > >
> > > Then why pidfd_poll() can't simply check !task || task->exit_state ?
> > >
> > > Nevermind. So, currently pidfd_poll() succeeds when the leader can be
> >
> > Hm, the comment right above mentions:
> >
> > /*
> > * Inform pollers only when the whole thread group exits.
> > * If the thread group leader exits before all other threads in the
> > * group, then poll(2) should block, similar to the wait(2) family.
> > */
> > > reaped, iow the whole thread group has exited.
>
> Yes, but the comment doesn't contradict with what I have said?

No, it doesn't. I'm trying to understand what you are suggesting though.
Are you saying !task || tas->exit_state is enough and we shouldn't use
the helper that was added in commit 38fd525a4c61 ("exit: Factor
thread_group_exited out of pidfd_poll"). If so what does that buy us
open-coding the check instead of using that helper? Is there an actual
bug here?

> > > But even if you are the
> > > parent, you can't expect that wait(WNOHANG) must succeed, the leader
> > > can be traced. I guess it is too late to change this behaviour.
> >
> > Hm, why is that an issue though?
>
> Well, I didn't say this is a problem. I simply do not know how/why people
> use pidfd_poll().

Sorry, I just have a hard time understanding what you wanted then. :)

"I guess it is too late to change this behavior." made it sound like a)
there's a problem and b) that you would prefer to change behavior. Thus,
it seems that wait(WNOHANG) hanging when a traced leader of an empty
thread-group has exited is a problem in your eyes.

>
> I mostly tried to explain why do I think that do_notify_pidfd() should
> be always called from exit_notify() path, not by release_task(), even
> if the task is not a leader.
>
> > Because a program would rely on WNOHANG to hang on
> > a ptraced leader? That seems esoteric imho.
>
> To me it would be usefule, but lets not discuss this now. The "patch"

Ok, that's good then. I would expect that at least stuff like rr makes
use of pidfd and they might rely on this behavior - although I haven't
checked their code.

> I sent doesn't change the current behaviour.

Yeah, I got that but it would still be useful to understand the wider
context you were adressing.

>
> > > What if we add the new PIDFD_THREAD flag? With this flag
> > >
> > > - sys_pidfd_open() doesn't require the must be a group leader
> >
> > Yes.
> >
> > >
> > > - pidfd_poll() succeeds when the task passes exit_notify() and
> > > becomes a zombie, even if it is a leader and has other threads.
> >
> > Iiuc, if an existing user creates a pidfd for a thread-group leader and
> > then polls that pidfd they would currently only get notified if the
> > thread-group is empty and the leader has exited.
> >
> > If we now start notifying when the thread-group leader exits but the
> > thread-group isn't empty then this would be a fairly big api change
>
> Hmm... again, this patch doesn't (shouldn't) change the current behavior.
>
> Please note "with this flag" above. If sys_pidfd_open() was called
> without PIDFD_THREAD, then sys_pidfd_open() still requires that the
> target task must be a group leader, and pidfd_poll() won't succeed
> until the leader exits and thread_group_empty() is true.

Yeah, I missed the PIDFD_THREAD flag suggestion. Sorry about that. Btw,
I'm not sure whether you remember that but when we originally did the
pidfd work you and I discussed thread support and already decided back
then that having a flag like PIDFD_THREAD would likely be the way to go.

The PIDFD_THREAD flag would be would be interesting because we could
make pidfd_send_signal() support this flag as well to allow sending a
signal to a specific thread. That's something that I had also wanted to
support. And I've been asked for this a few times already. What do you
think?

2024-01-26 14:35:11

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/26, Christian Brauner wrote:
>
> No, it doesn't. I'm trying to understand what you are suggesting though.
> Are you saying !task || tas->exit_state is enough

If PIDFD_THREAD then I think it is enough. Otherwise we still need
!task || (exit_state && thread_group_empty)

> and we shouldn't use
> the helper that was added in commit 38fd525a4c61 ("exit: Factor
> thread_group_exited out of pidfd_poll"). If so what does that buy us
> open-coding the check instead of using that helper? Is there an actual
> bug here?

The patch adds the new xxx_exited(task, excl) helper which checks

!task || (exit_state && (excl || thread_group_empty))

yes, the naming is not good.

> > Well, I didn't say this is a problem. I simply do not know how/why people
> > use pidfd_poll().
>
> Sorry, I just have a hard time understanding what you wanted then. :)
>
> "I guess it is too late to change this behavior." made it sound like a)
> there's a problem and b) that you would prefer to change behavior. Thus,
> it seems that wait(WNOHANG) hanging when a traced leader of an empty
> thread-group has exited is a problem in your eyes.

Again, I mostly tried to argue with do_notify_pidfd() called by realese_task().

I think that with PIDFD_THREAD set pidfd_poll() should succeed right
after the exiting thread becomes a zombie (passes exit_notify), whether
it is a leader or not.

Let me quote part of my reply to Tycho's patch

> + /*
> + * If we're not the leader, notify any waiters on our pidfds. Note that
> + * we don't want to notify the leader until /everyone/ in the thread
> + * group is dead, viz. the condition below.
> + *
> + * We have to do this here, since __exit_signal() will
> + * __unhash_processes(), and break do_notify_pidfd()'s lookup.
> + */
> + if (!thread_group_leader(p))
> + do_notify_pidfd(p);

This doesn't look consistent.

If the task is a group leader do_notify_pidfd() is called by exit_notify()
when it becomes a zombie (if no other threads), before it is reaped by its
parent (unless autoreap).

If it is a sub-thread, it is called by release_task() above. Note that a
sub-thread can become a zombie too if it is traced.

Not to mention that this is racy.

I would not mind if we simply move do_notify_pidfd() from exit_notify() to
release_task() and do it regardless of thread_group_leader(). And in some
sense this looks more logical to me.

But as I said:

- I do not know how/why people actually use poll(pidfd)

- it is too late to change the current behaviour

Sorry for confusion.

> I'm not sure whether you remember that but when we originally did the
> pidfd work you and I discussed thread support and already decided back
> then that having a flag like PIDFD_THREAD would likely be the way to go.

All I can recall is that, yes, we had some discussions about pidfd in
the past ;)

> The PIDFD_THREAD flag would be would be interesting because we could
> make pidfd_send_signal() support this flag

Agreed,

Oleg.

2024-01-26 14:51:39

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/26, Christian Brauner wrote:
>
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -7,6 +7,7 @@
> > #include <linux/fcntl.h>
> >
> > /* Flags for pidfd_open(). */
> > -#define PIDFD_NONBLOCK O_NONBLOCK
> > +#define PIDFD_NONBLOCK O_NONBLOCK
> > +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's
>
> I like it!
>
> The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
> Because it doesn't map as clearly as NONBLOCK did.

But it would be nice to have PIDFD_THREAD in file->f_flags. Where else
can we keep it?

I chose O_EXCL because it can only be used at open time, it can never
be used or changed after anon_inode_getfile(), so we can safely do

pidfd_file->f_flags |= PIDFD_THREAD;

in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal.

What do you suggest instead?

Oleg.

2024-01-26 21:51:39

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Hi Oleg,

On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> What do you think?

Thank you, it passes all my tests.

> + /* unnecessary if do_notify_parent() was already called,
> + we can do better */
> + do_notify_pidfd(tsk);

"do better" here could be something like,

diff --git a/kernel/exit.c b/kernel/exit.c
index efe8f1d3a6af..7e545393f2f5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -742,6 +742,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
bool autoreap;
struct task_struct *p, *n;
LIST_HEAD(dead);
+ bool needs_notify = true;

write_lock_irq(&tasklist_lock);
forget_original_parent(tsk, &dead);
@@ -756,16 +757,21 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
!ptrace_reparented(tsk) ?
tsk->exit_signal : SIGCHLD;
autoreap = do_notify_parent(tsk, sig);
+ needs_notify = false;
} else if (thread_group_leader(tsk)) {
- autoreap = thread_group_empty(tsk) &&
- do_notify_parent(tsk, tsk->exit_signal);
+ autoreap = false;
+ if (thread_group_empty(tsk)) {
+ autoreap = do_notify_parent(tsk, tsk->exit_signal);
+ needs_notify = false;
+ }
} else {
autoreap = true;
}

/* unnecessary if do_notify_parent() was already called,
we can do better */
- do_notify_pidfd(tsk);
+ if (needs_notify)
+ do_notify_pidfd(tsk);

if (autoreap) {
tsk->exit_state = EXIT_DEAD;

but even with that, there's other calls in the tree to
do_notify_parent() that might double notify.

This brings up another interesting behavior that I noticed while
testing this, if you do a poll() on pidfd, followed quickly by a
pidfd_getfd() on the same thread you just got an event on, you can
sometimes get an EBADF from __pidfd_fget() instead of the more
expected ESRCH higher up the stack.

I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
Then we can refuse further pidfd syscall operations in a sane way, and
also "do better" above by checking this flag from do_pidfd_notify()
before doing it again?

Tycho

2024-01-27 10:56:06

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

Hi Tycho,

On 01/26, Tycho Andersen wrote:
>
> On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > What do you think?
>
> Thank you, it passes all my tests.

Great, thanks!

OK, I'll make v2 on top of the recent
"pidfd: cleanup the usage of __pidfd_prepare's flags"

but we need to finish our discussion with Christian about the
usage of O_EXCL.

As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
I think this needs another discussion too, lets do this later.

> > + /* unnecessary if do_notify_parent() was already called,
> > + we can do better */
> > + do_notify_pidfd(tsk);
>
> "do better" here could be something like,
>
> [...snip...]

No, no, please see below.

For the moment, please forget about PIDFD_THREAD, lets discuss
the current behaviour.

> but even with that, there's other calls in the tree to
> do_notify_parent() that might double notify.

Yes, and we can't avoid this. Well, perhaps do_notify_parent()
can do something like

if (ptrace_reparented())
do_notify_pidfd();

so that only the "final" do_notify_parent() does do_notify_pidfd()
but this needs another discussion and in fact I don't think this
would be right or make much sense. Lets forget this for now.

Now. Even without PIDFD_THREAD, I think it makes sense to change
do_notify_parent() to do

if (thread_group_empty(tsk))
do_notify_pidfd(tsk);

thread_group_empty(tsk) can only be true if tsk is a group leader
and it is the last thread. And this is exactly what pidfd_poll()
currently needs.

In fact I'd even prefer to do this in a separate patch for the
documentation purposes.

Now, PIDFD_THREAD can just add

if (!thread_group_empty(tsk))
do_notify_pidfd(tsk);

right after "tsk->exit_state = EXIT_ZOMBIE", that is all.

This also preserves the do_notify_pidfd/__wake_up_parent ordering.
Not that I think this is important, just for consistency.

> This brings up another interesting behavior that I noticed while
> testing this, if you do a poll() on pidfd, followed quickly by a
> pidfd_getfd() on the same thread you just got an event on, you can
> sometimes get an EBADF from __pidfd_fget() instead of the more
> expected ESRCH higher up the stack.

exit_notify() is called after exit_files(). pidfd_getfd() returns
ESRCH if the exiting thread completes release_task(), otherwise it
returns EBADF because ->files == NULL. This too doesn't really
depend on PIDFD_THREAD.

> I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> Then we can refuse further pidfd syscall operations in a sane way, and

But how? We only have "struct pid *", how can we find all files
"attached" to this pid?

> also "do better" above by checking this flag from do_pidfd_notify()
> before doing it again?

and even it was possible, I don't think it makes a lot of sense, see
also above.

but perhaps I understood you...

Oleg.

2024-01-27 14:26:50

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Fri, Jan 26, 2024 at 03:33:50PM +0100, Oleg Nesterov wrote:
> On 01/26, Christian Brauner wrote:
> >
> > > --- a/include/uapi/linux/pidfd.h
> > > +++ b/include/uapi/linux/pidfd.h
> > > @@ -7,6 +7,7 @@
> > > #include <linux/fcntl.h>
> > >
> > > /* Flags for pidfd_open(). */
> > > -#define PIDFD_NONBLOCK O_NONBLOCK
> > > +#define PIDFD_NONBLOCK O_NONBLOCK
> > > +#define PIDFD_THREAD O_EXCL // or anything else not used by anon_inode's
> >
> > I like it!
> >
> > The only request I would have is to not alias O_EXCL and PIDFD_THREAD.
> > Because it doesn't map as clearly as NONBLOCK did.
>
> But it would be nice to have PIDFD_THREAD in file->f_flags. Where else
> can we keep it?

No, I did just mean that the uapi value doesn't necessarily have to
reflect what we do internally. IOW, we can still raise O_EXCL internally
in ->f_flags but there's no need to expose it as O_EXCL to userspace. We
often have internal and external flag spaces. If you prefer it your way
I'm not going argue.

> I chose O_EXCL because it can only be used at open time, it can never
> be used or changed after anon_inode_getfile(), so we can safely do
>
> pidfd_file->f_flags |= PIDFD_THREAD;
>
> in __pidfd_prepare() and then check in pidfd_poll/pidfd_send_signal.
>
> What do you suggest instead?

(Long-term and unrelated to this here, I think we will need to consider
not just stashing struct pid in pidfd_file->private_data but instead
struct pid with additional data because there's various functionality
that users would like that requires additional state to be stored and we
can't or don't want to do that in struct pid directly.)

2024-01-27 14:33:17

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote:
> Hi Tycho,
>
> On 01/26, Tycho Andersen wrote:
> >
> > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > > What do you think?
> >
> > Thank you, it passes all my tests.
>
> Great, thanks!
>
> OK, I'll make v2 on top of the recent
> "pidfd: cleanup the usage of __pidfd_prepare's flags"
>
> but we need to finish our discussion with Christian about the
> usage of O_EXCL.

Just write the patch exactly like you want. If it's as a little a detail
as the uapi flag value we can just always change that when applying.
There's no reason to introduce artificial delays because of my
preference here..

>
> As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
> I think this needs another discussion too, lets do this later.
>
> > > + /* unnecessary if do_notify_parent() was already called,
> > > + we can do better */
> > > + do_notify_pidfd(tsk);
> >
> > "do better" here could be something like,
> >
> > [...snip...]
>
> No, no, please see below.
>
> For the moment, please forget about PIDFD_THREAD, lets discuss
> the current behaviour.
>
> > but even with that, there's other calls in the tree to
> > do_notify_parent() that might double notify.
>
> Yes, and we can't avoid this. Well, perhaps do_notify_parent()
> can do something like
>
> if (ptrace_reparented())
> do_notify_pidfd();
>
> so that only the "final" do_notify_parent() does do_notify_pidfd()
> but this needs another discussion and in fact I don't think this
> would be right or make much sense. Lets forget this for now.
>
> Now. Even without PIDFD_THREAD, I think it makes sense to change
> do_notify_parent() to do
>
> if (thread_group_empty(tsk))
> do_notify_pidfd(tsk);
>
> thread_group_empty(tsk) can only be true if tsk is a group leader
> and it is the last thread. And this is exactly what pidfd_poll()
> currently needs.
>
> In fact I'd even prefer to do this in a separate patch for the
> documentation purposes.
>
> Now, PIDFD_THREAD can just add
>
> if (!thread_group_empty(tsk))
> do_notify_pidfd(tsk);
>
> right after "tsk->exit_state = EXIT_ZOMBIE", that is all.

Sounds good.

>
> This also preserves the do_notify_pidfd/__wake_up_parent ordering.
> Not that I think this is important, just for consistency.
>
> > This brings up another interesting behavior that I noticed while
> > testing this, if you do a poll() on pidfd, followed quickly by a
> > pidfd_getfd() on the same thread you just got an event on, you can
> > sometimes get an EBADF from __pidfd_fget() instead of the more
> > expected ESRCH higher up the stack.
>
> exit_notify() is called after exit_files(). pidfd_getfd() returns
> ESRCH if the exiting thread completes release_task(), otherwise it
> returns EBADF because ->files == NULL. This too doesn't really
> depend on PIDFD_THREAD.
>
> > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> > Then we can refuse further pidfd syscall operations in a sane way, and
>
> But how? We only have "struct pid *", how can we find all files
> "attached" to this pid?

We can't. There's some use-cases that would make this desirable but that
would mean we would need another data structure to track this. I once
toyed with a patch for this but never got so excited about it to
actually finish it.

2024-01-27 15:56:03

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Sat, Jan 27, 2024 at 11:54:32AM +0100, Oleg Nesterov wrote:
> Hi Tycho,
>
> On 01/26, Tycho Andersen wrote:
> >
> > On Thu, Jan 25, 2024 at 03:08:31PM +0100, Oleg Nesterov wrote:
> > > What do you think?
> >
> > Thank you, it passes all my tests.
>
> Great, thanks!
>
> OK, I'll make v2 on top of the recent
> "pidfd: cleanup the usage of __pidfd_prepare's flags"
>
> but we need to finish our discussion with Christian about the
> usage of O_EXCL.
>
> As for clone(CLONE_PIDFD | CLONE_THREAD), this is trivial but
> I think this needs another discussion too, lets do this later.
>
> > > + /* unnecessary if do_notify_parent() was already called,
> > > + we can do better */
> > > + do_notify_pidfd(tsk);
> >
> > "do better" here could be something like,
> >
> > [...snip...]
>
> No, no, please see below.
>
> For the moment, please forget about PIDFD_THREAD, lets discuss
> the current behaviour.
>
> > but even with that, there's other calls in the tree to
> > do_notify_parent() that might double notify.
>
> Yes, and we can't avoid this. Well, perhaps do_notify_parent()
> can do something like
>
> if (ptrace_reparented())
> do_notify_pidfd();
>
> so that only the "final" do_notify_parent() does do_notify_pidfd()
> but this needs another discussion and in fact I don't think this
> would be right or make much sense. Lets forget this for now.

It seems like (and the current pidfd_test enforces for some cases) we
want exactly one notification for a task dying. I don't understand
how we guarantee this now, with all of these calls.

> > This brings up another interesting behavior that I noticed while
> > testing this, if you do a poll() on pidfd, followed quickly by a
> > pidfd_getfd() on the same thread you just got an event on, you can
> > sometimes get an EBADF from __pidfd_fget() instead of the more
> > expected ESRCH higher up the stack.
>
> exit_notify() is called after exit_files(). pidfd_getfd() returns
> ESRCH if the exiting thread completes release_task(), otherwise it
> returns EBADF because ->files == NULL. This too doesn't really
> depend on PIDFD_THREAD.

Yup, understood. It just seems like an inconsistency we might want to
fix.

> > I wonder if it makes sense to abuse ->f_flags to add a PIDFD_NOTIFIED?
> > Then we can refuse further pidfd syscall operations in a sane way, and
>
> But how? We only have "struct pid *", how can we find all files
> "attached" to this pid?

Yeah, we'd need some other linkage as Christian points out. But if
there is a predicate we can write that says whether this task has been
notified or not, it's not necessary. I just don't understand what that
is. But maybe your patch will make it clearer.

Tycho

2024-01-27 16:35:22

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/27, Tycho Andersen wrote:
>
> It seems like (and the current pidfd_test enforces for some cases)

Which pidfd_test ?

> we
> want exactly one notification for a task dying.

This can't be right. EVERY user of poll_wait() or wait_event/etc
must handle/tolerate the false wakeups.

> I don't understand
> how we guarantee this now, with all of these calls.

I don't understand why do we need or even want to guarantee this.

The extra wakeup must be always fine correctness-wise. Sure, it
would be nice to avoid the unnecessary wakeups, and perhaps we
can change wake_up_all() to pass a key to, say, only wake_up the
PIDFD_THREAD waiters from exit_notify(). But certainly this is
outside the scope of PIDFD_THREAD change we discuss.

The changes in do_notify_parent() (I have already sent the patch) and
in exit_notify() (proposed in my previous email) just ensure that,
with the minimal changes, we avoid 2 do_notify_pidfd's from the same
exit_notify() path.

> > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > ESRCH if the exiting thread completes release_task(), otherwise it
> > returns EBADF because ->files == NULL. This too doesn't really
> > depend on PIDFD_THREAD.
>
> Yup, understood. It just seems like an inconsistency we might want to
> fix.

Not sure this worth "fixing"...

Oleg.

2024-01-27 17:21:09

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Sat, Jan 27, 2024 at 05:31:39PM +0100, Oleg Nesterov wrote:
> On 01/27, Tycho Andersen wrote:
> >
> > It seems like (and the current pidfd_test enforces for some cases)
>
> Which pidfd_test ?

I was thinking of poll_pidfd() in pidfd_test.c, but,

> > we
> > want exactly one notification for a task dying.
>
> This can't be right. EVERY user of poll_wait() or wait_event/etc
> must handle/tolerate the false wakeups.

you're right, it doesn't enforce exactly once.

> > I don't understand
> > how we guarantee this now, with all of these calls.
>
> I don't understand why do we need or even want to guarantee this.
>
> The extra wakeup must be always fine correctness-wise. Sure, it
> would be nice to avoid the unnecessary wakeups, and perhaps we
> can change wake_up_all() to pass a key to, say, only wake_up the
> PIDFD_THREAD waiters from exit_notify(). But certainly this is
> outside the scope of PIDFD_THREAD change we discuss.
>
> The changes in do_notify_parent() (I have already sent the patch) and
> in exit_notify() (proposed in my previous email) just ensure that,
> with the minimal changes, we avoid 2 do_notify_pidfd's from the same
> exit_notify() path.

Sounds good.

> > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > returns EBADF because ->files == NULL. This too doesn't really
> > > depend on PIDFD_THREAD.
> >
> > Yup, understood. It just seems like an inconsistency we might want to
> > fix.
>
> Not sure this worth "fixing"...

Yep, maybe not. Just wanted to point it out.

Tycho

2024-01-27 19:33:11

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/27, Tycho Andersen wrote:
>
> > > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > > returns EBADF because ->files == NULL. This too doesn't really
> > > > depend on PIDFD_THREAD.
> > >
> > > Yup, understood. It just seems like an inconsistency we might want to
> > > fix.
> >
> > Not sure this worth "fixing"...
>
> Yep, maybe not. Just wanted to point it out.

On the second thought I am starting to understand your concern...

Indeed, in this case -EBADF is technically correct but it can confuse
the user which doesn't or can't know that this task/thread is exiting,
because EBADF looks as if the "int fd" argument was wrong.

Sorry I missed your point before.

Oleg.

2024-01-27 20:45:09

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote:
> On 01/27, Tycho Andersen wrote:
> >
> > > > > exit_notify() is called after exit_files(). pidfd_getfd() returns
> > > > > ESRCH if the exiting thread completes release_task(), otherwise it
> > > > > returns EBADF because ->files == NULL. This too doesn't really
> > > > > depend on PIDFD_THREAD.
> > > >
> > > > Yup, understood. It just seems like an inconsistency we might want to
> > > > fix.
> > >
> > > Not sure this worth "fixing"...
> >
> > Yep, maybe not. Just wanted to point it out.
>
> On the second thought I am starting to understand your concern...
>
> Indeed, in this case -EBADF is technically correct but it can confuse
> the user which doesn't or can't know that this task/thread is exiting,
> because EBADF looks as if the "int fd" argument was wrong.
>
> Sorry I missed your point before.

No worries. I realized it's not so hard to fix with your new
xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth
cleaning up after all?

Tycho

2024-01-27 21:11:42

[permalink] [raw]

Subject: Re: [PATCH v3 1/3] pidfd: allow pidfd_open() on non-thread-group leaders

On 01/27, Tycho Andersen wrote:
>
> On Sat, Jan 27, 2024 at 08:31:27PM +0100, Oleg Nesterov wrote:
> >
> > On the second thought I am starting to understand your concern...
> >
> > Indeed, in this case -EBADF is technically correct but it can confuse
> > the user which doesn't or can't know that this task/thread is exiting,
> > because EBADF looks as if the "int fd" argument was wrong.
> >
> > Sorry I missed your point before.
>
> No worries. I realized it's not so hard to fix with your new
> xxx_exited() helper from the PIDFD_THREAD patch, so maybe worth
> cleaning up after all?

OK, lets discuss this later.

I'll (hopefully) send v2 on top of

pidfd: cleanup the usage of __pidfd_prepare's flags
pidfd: don't do_notify_pidfd() if !thread_group_empty()

on Monday, will be busy tomorrow (family duties ;)

Oleg.

2024-01-29 11:25:01

[permalink] [raw]

Subject: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On 01/27, Oleg Nesterov wrote:
>
> I'll (hopefully) send v2 on top of
>
> pidfd: cleanup the usage of __pidfd_prepare's flags
> pidfd: don't do_notify_pidfd() if !thread_group_empty()
>
> on Monday

Sorry, I don't have time to finish v2 today, I need to update the comments
and write the changelog.

But the patch itself is ready, I am sending it for review.

Tycho, Christian, any comments?

Oleg.

From c31780f6c1136a72048d24701ac6d8401fc1afda Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <[email protected]>
Date: Sat, 27 Jan 2024 16:59:18 +0100
Subject: [PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

---
include/uapi/linux/pidfd.h | 3 ++-
kernel/exit.c | 7 +++++++
kernel/fork.c | 29 +++++++++++++++++++++++++++--
kernel/pid.c | 2 +-
kernel/signal.c | 4 +++-
5 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index 5406fbc13074..2e6461459877 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -7,6 +7,7 @@
#include <linux/fcntl.h>

/* Flags for pidfd_open(). */
-#define PIDFD_NONBLOCK O_NONBLOCK
+#define PIDFD_NONBLOCK O_NONBLOCK
+#define PIDFD_THREAD O_EXCL

#endif /* _UAPI_LINUX_PIDFD_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index dfb963d2f862..74fe6bfb9577 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -739,6 +739,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);

tsk->exit_state = EXIT_ZOMBIE;
+ /*
+ * sub-thread or delay_group_leader(), wake up the PIDFD_THREAD
+ * waiters.
+ */
+ if (!thread_group_empty(tsk))
+ do_notify_pidfd(tsk);
+
if (unlikely(tsk->ptrace)) {
int sig = thread_group_leader(tsk) &&
thread_group_empty(tsk) &&
diff --git a/kernel/fork.c b/kernel/fork.c
index 347641398f9d..977b58c0eac6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -101,6 +101,7 @@
#include <linux/user_events.h>
#include <linux/iommu.h>
#include <linux/rseq.h>
+#include <uapi/linux/pidfd.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -2050,6 +2051,8 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)

seq_put_decimal_ll(m, "Pid:\t", nr);

+ /* TODO: report PIDFD_THREAD */
+
#ifdef CONFIG_PID_NS
seq_put_decimal_ll(m, "\nNSpid:\t", nr);
if (nr > 0) {
@@ -2068,12 +2071,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif

+static bool pidfd_task_exited(struct pid *pid, bool thread)
+{
+ struct task_struct *task;
+ bool exited;
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ exited = !task ||
+ (READ_ONCE(task->exit_state) && (thread || thread_group_empty(task)));
+ rcu_read_unlock();
+
+ return exited;
+}
+
/*
* Poll support for process exit notification.
*/
static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
{
struct pid *pid = file->private_data;
+ bool thread = file->f_flags & PIDFD_THREAD;
__poll_t poll_flags = 0;

poll_wait(file, &pid->wait_pidfd, pts);
@@ -2083,7 +2101,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
* If the thread group leader exits before all other threads in the
* group, then poll(2) should block, similar to the wait(2) family.
*/
- if (thread_group_exited(pid))
+ if (pidfd_task_exited(pid, thread))
poll_flags = EPOLLIN | EPOLLRDNORM;

return poll_flags;
@@ -2141,6 +2159,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
return PTR_ERR(pidfd_file);
}
get_pid(pid); /* held by pidfd_file now */
+ /*
+ * anon_inode_getfile() ignores everything outside of the
+ * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
+ */
+ pidfd_file->f_flags |= (flags & PIDFD_THREAD);
*ret = pidfd_file;
return pidfd;
}
@@ -2173,7 +2196,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
*/
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
{
- if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+ bool thread = flags & PIDFD_THREAD;
+
+ if (!pid || !pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID));
return -EINVAL;

return __pidfd_prepare(pid, flags, ret);
diff --git a/kernel/pid.c b/kernel/pid.c
index c7a3e359f8f5..04bdd5ecf183 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
int fd;
struct pid *p;

- if (flags & ~PIDFD_NONBLOCK)
+ if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
return -EINVAL;

if (pid <= 0)
diff --git a/kernel/signal.c b/kernel/signal.c
index 9561a3962ca6..919cd33a0405 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2051,7 +2051,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
WARN_ON_ONCE(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
/*
- * tsk is a group leader and has no threads, wake up the pidfd waiters.
+ * tsk is a group leader and has no threads, wake up the !PIDFD_THREAD
+ * waiters.
*/
if (thread_group_empty(tsk))
do_notify_pidfd(tsk);
@@ -3926,6 +3927,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
prepare_kill_siginfo(sig, &kinfo);
}

+ /* TODO: respect PIDFD_THREAD */
ret = kill_pid_info(sig, &kinfo, pid);

err:
--
2.25.1.362.g51ebf55

2024-01-29 13:41:22

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Mon, Jan 29, 2024 at 12:23:15PM +0100, Oleg Nesterov wrote:
> On 01/27, Oleg Nesterov wrote:
> >
> > I'll (hopefully) send v2 on top of
> >
> > pidfd: cleanup the usage of __pidfd_prepare's flags
> > pidfd: don't do_notify_pidfd() if !thread_group_empty()
> >
> > on Monday
>
> Sorry, I don't have time to finish v2 today, I need to update the comments
> and write the changelog.
>
> But the patch itself is ready, I am sending it for review.
>
> Tycho, Christian, any comments?
>
> Oleg.
>
>
> From c31780f6c1136a72048d24701ac6d8401fc1afda Mon Sep 17 00:00:00 2001
> From: Oleg Nesterov <[email protected]>
> Date: Sat, 27 Jan 2024 16:59:18 +0100
> Subject: [PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()
>
> ---
> include/uapi/linux/pidfd.h | 3 ++-
> kernel/exit.c | 7 +++++++
> kernel/fork.c | 29 +++++++++++++++++++++++++++--
> kernel/pid.c | 2 +-
> kernel/signal.c | 4 +++-
> 5 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> index 5406fbc13074..2e6461459877 100644
> --- a/include/uapi/linux/pidfd.h
> +++ b/include/uapi/linux/pidfd.h
> @@ -7,6 +7,7 @@
> #include <linux/fcntl.h>
>
> /* Flags for pidfd_open(). */
> -#define PIDFD_NONBLOCK O_NONBLOCK
> +#define PIDFD_NONBLOCK O_NONBLOCK
> +#define PIDFD_THREAD O_EXCL
>
> #endif /* _UAPI_LINUX_PIDFD_H */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index dfb963d2f862..74fe6bfb9577 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -739,6 +739,13 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> kill_orphaned_pgrp(tsk->group_leader, NULL);
>
> tsk->exit_state = EXIT_ZOMBIE;
> + /*
> + * sub-thread or delay_group_leader(), wake up the PIDFD_THREAD
> + * waiters.
> + */
> + if (!thread_group_empty(tsk))
> + do_notify_pidfd(tsk);
> +
> if (unlikely(tsk->ptrace)) {
> int sig = thread_group_leader(tsk) &&
> thread_group_empty(tsk) &&
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 347641398f9d..977b58c0eac6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -101,6 +101,7 @@
> #include <linux/user_events.h>
> #include <linux/iommu.h>
> #include <linux/rseq.h>
> +#include <uapi/linux/pidfd.h>
>
> #include <asm/pgalloc.h>
> #include <linux/uaccess.h>
> @@ -2050,6 +2051,8 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
>
> seq_put_decimal_ll(m, "Pid:\t", nr);
>
> + /* TODO: report PIDFD_THREAD */

Ah yes, very good point. We should give userspace an indicator whether
something is thread pidfd or not.

> +
> #ifdef CONFIG_PID_NS
> seq_put_decimal_ll(m, "\nNSpid:\t", nr);
> if (nr > 0) {
> @@ -2068,12 +2071,27 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
> }
> #endif
>
> +static bool pidfd_task_exited(struct pid *pid, bool thread)
> +{
> + struct task_struct *task;
> + bool exited;
> +
> + rcu_read_lock();
> + task = pid_task(pid, PIDTYPE_PID);
> + exited = !task ||
> + (READ_ONCE(task->exit_state) && (thread || thread_group_empty(task)));
> + rcu_read_unlock();
> +
> + return exited;
> +}
> +
> /*
> * Poll support for process exit notification.
> */
> static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> {
> struct pid *pid = file->private_data;
> + bool thread = file->f_flags & PIDFD_THREAD;
> __poll_t poll_flags = 0;
>
> poll_wait(file, &pid->wait_pidfd, pts);
> @@ -2083,7 +2101,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> * If the thread group leader exits before all other threads in the
> * group, then poll(2) should block, similar to the wait(2) family.
> */
> - if (thread_group_exited(pid))
> + if (pidfd_task_exited(pid, thread))
> poll_flags = EPOLLIN | EPOLLRDNORM;
>
> return poll_flags;
> @@ -2141,6 +2159,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> return PTR_ERR(pidfd_file);
> }
> get_pid(pid); /* held by pidfd_file now */
> + /*
> + * anon_inode_getfile() ignores everything outside of the
> + * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
> + */
> + pidfd_file->f_flags |= (flags & PIDFD_THREAD);
> *ret = pidfd_file;
> return pidfd;
> }
> @@ -2173,7 +2196,9 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
> */
> int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> {
> - if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> + bool thread = flags & PIDFD_THREAD;
> +
> + if (!pid || !pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID));
> return -EINVAL;
>
> return __pidfd_prepare(pid, flags, ret);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c7a3e359f8f5..04bdd5ecf183 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -629,7 +629,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> int fd;
> struct pid *p;
>
> - if (flags & ~PIDFD_NONBLOCK)
> + if (flags & ~(PIDFD_NONBLOCK | PIDFD_THREAD))
> return -EINVAL;
>
> if (pid <= 0)
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9561a3962ca6..919cd33a0405 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2051,7 +2051,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> WARN_ON_ONCE(!tsk->ptrace &&
> (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> /*
> - * tsk is a group leader and has no threads, wake up the pidfd waiters.
> + * tsk is a group leader and has no threads, wake up the !PIDFD_THREAD
> + * waiters.
> */
> if (thread_group_empty(tsk))
> do_notify_pidfd(tsk);
> @@ -3926,6 +3927,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> prepare_kill_siginfo(sig, &kinfo);
> }
>
> + /* TODO: respect PIDFD_THREAD */

So I've been thinking about this at the end of last week. Do we need to
give userspace a way to send a thread-group wide signal even when a
PIDFD_THREAD pidfd is passed? Or should we just not worry about this
right now and wait until someone needs this?

> ret = kill_pid_info(sig, &kinfo, pid);

Otherwise this looks good to me!

2024-01-29 14:34:16

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Mon, Jan 29, 2024 at 02:41:11PM +0100, Christian Brauner wrote:
> On Mon, Jan 29, 2024 at 12:23:15PM +0100, Oleg Nesterov wrote:
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2051,7 +2051,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> > WARN_ON_ONCE(!tsk->ptrace &&
> > (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> > /*
> > - * tsk is a group leader and has no threads, wake up the pidfd waiters.
> > + * tsk is a group leader and has no threads, wake up the !PIDFD_THREAD
> > + * waiters.
> > */
> > if (thread_group_empty(tsk))
> > do_notify_pidfd(tsk);
> > @@ -3926,6 +3927,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > prepare_kill_siginfo(sig, &kinfo);
> > }
> >
> > + /* TODO: respect PIDFD_THREAD */
>
> So I've been thinking about this at the end of last week. Do we need to
> give userspace a way to send a thread-group wide signal even when a
> PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> right now and wait until someone needs this?

I don't need it currently, but it would have been handy for some of
the tests I wrote.

Should I fix those up and send them too on top of Oleg's v2?

Tycho

2024-01-29 15:14:43

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Mon, Jan 29, 2024 at 07:31:35AM -0700, Tycho Andersen wrote:
> On Mon, Jan 29, 2024 at 02:41:11PM +0100, Christian Brauner wrote:
> > On Mon, Jan 29, 2024 at 12:23:15PM +0100, Oleg Nesterov wrote:
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -2051,7 +2051,8 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> > > WARN_ON_ONCE(!tsk->ptrace &&
> > > (tsk->group_leader != tsk || !thread_group_empty(tsk)));
> > > /*
> > > - * tsk is a group leader and has no threads, wake up the pidfd waiters.
> > > + * tsk is a group leader and has no threads, wake up the !PIDFD_THREAD
> > > + * waiters.
> > > */
> > > if (thread_group_empty(tsk))
> > > do_notify_pidfd(tsk);
> > > @@ -3926,6 +3927,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > > prepare_kill_siginfo(sig, &kinfo);
> > > }
> > >
> > > + /* TODO: respect PIDFD_THREAD */
> >
> > So I've been thinking about this at the end of last week. Do we need to
> > give userspace a way to send a thread-group wide signal even when a
> > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > right now and wait until someone needs this?
>
> I don't need it currently, but it would have been handy for some of
> the tests I wrote.
>
> Should I fix those up and send them too on top of Oleg's v2?

Sure, I don't mind.

2024-01-30 11:23:53

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On 01/29, Christian Brauner wrote:
>
> On Mon, Jan 29, 2024 at 12:23:15PM +0100, Oleg Nesterov wrote:
> > @@ -3926,6 +3927,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > prepare_kill_siginfo(sig, &kinfo);
> > }
> >
> > + /* TODO: respect PIDFD_THREAD */
>
> So I've been thinking about this at the end of last week. Do we need to
> give userspace a way to send a thread-group wide signal even when a
> PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> right now and wait until someone needs this?

I don't know. I am fine either way, but I think this needs a separate
patch and another discussion in any case. Anyway should be trivial,
pidfd_send_signal() has the "flags" argument.

On a related note, should copy_process(CLONE_PIDFD | CLONE_THREAD) add
PIDFD_THREAD flag "automatically" depending on CLONE_THREAD? Or do we
want another CLONE_PIDFD_THREAD flag so that PIDFD_THREAD can be used
without CLONE_THREAD? Again, I do not know, needs another discussion.

> Otherwise this looks good to me!

OK, thanks, I'll send v2 in a minute. The patch is the same, I only
updated the comments.

Oleg.

2024-01-31 18:12:30

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Mon, Jan 29, 2024 at 3:24 AM Oleg Nesterov <[email protected]> wrote:
>
> On 01/27, Oleg Nesterov wrote:
> >
> > I'll (hopefully) send v2 on top of
> >
> > pidfd: cleanup the usage of __pidfd_prepare's flags
> > pidfd: don't do_notify_pidfd() if !thread_group_empty()
> >
> > on Monday
>
> Sorry, I don't have time to finish v2 today, I need to update the comments
> and write the changelog.
>
> But the patch itself is ready, I am sending it for review.
>
> Tycho, Christian, any comments?

Right now, pidfd_send_signal() sends signals to processes, like so:

* The syscall currently only signals via PIDTYPE_PID which covers
* kill(<positive-pid>, <signal>. It does not signal threads or process
* groups.

This patch adds PIDFD_THREAD which, potentially confusingly, doesn't
change this (AFAICS). So at least that should be documented loudly
and clearly, IMO. But I actually just bumped in to this limitation in
pidfd_send_signal(), like so:

https://github.com/systemd/systemd/issues/31093

Specifically, systemd can't properly emulate Ctrl-C using pidfd_send_signal().

I don't know whether implementing the other signal types belongs as
part of this patch, but they're at least thematically related.

--Andy

2024-01-31 18:50:01

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On 01/31, Andy Lutomirski wrote:
>
> Right now, pidfd_send_signal() sends signals to processes, like so:
>
> * The syscall currently only signals via PIDTYPE_PID which covers
> * kill(<positive-pid>, <signal>. It does not signal threads or process
> * groups.
>
> This patch adds PIDFD_THREAD which, potentially confusingly, doesn't
> change this (AFAICS).

Yes,

> So at least that should be documented loudly
> and clearly, IMO.

Please note

/* TODO: respect PIDFD_THREAD */

this patch adds into pidfd_send_signal().

See also this part of discussion

> > + /* TODO: respect PIDFD_THREAD */
>
> So I've been thinking about this at the end of last week. Do we need to
> give userspace a way to send a thread-group wide signal even when a
> PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> right now and wait until someone needs this?

I don't know. I am fine either way, but I think this needs a separate
patch and another discussion in any case. Anyway should be trivial,
pidfd_send_signal() has the "flags" argument.

with Christian in https://lore.kernel.org/all/[email protected]/

Or did I misunderstand you?

Oleg.

2024-01-31 19:15:34

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

Forgot to mention...

And I agree that pidfd_send_signal(flags => PGID/SID) can make
some sense too.

But this a) doesn't depend on PIDFD_THREAD, and b) needs another
patch/discussion.

But again, I am not sure I understood you correctly.

On 01/31, Oleg Nesterov wrote:
>
> On 01/31, Andy Lutomirski wrote:
> >
> > Right now, pidfd_send_signal() sends signals to processes, like so:
> >
> > * The syscall currently only signals via PIDTYPE_PID which covers
> > * kill(<positive-pid>, <signal>. It does not signal threads or process
> > * groups.
> >
> > This patch adds PIDFD_THREAD which, potentially confusingly, doesn't
> > change this (AFAICS).
>
> Yes,
>
> > So at least that should be documented loudly
> > and clearly, IMO.
>
> Please note
>
> /* TODO: respect PIDFD_THREAD */
>
> this patch adds into pidfd_send_signal().
>
> See also this part of discussion
>
> > > + /* TODO: respect PIDFD_THREAD */
> >
> > So I've been thinking about this at the end of last week. Do we need to
> > give userspace a way to send a thread-group wide signal even when a
> > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > right now and wait until someone needs this?
>
> I don't know. I am fine either way, but I think this needs a separate
> patch and another discussion in any case. Anyway should be trivial,
> pidfd_send_signal() has the "flags" argument.
>
> with Christian in https://lore.kernel.org/all/[email protected]/
>
> Or did I misunderstand you?
>
> Oleg.

2024-01-31 19:25:22

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

> On 01/31, Oleg Nesterov wrote:
> >
> > On 01/31, Andy Lutomirski wrote:
> > Please note
> >
> > /* TODO: respect PIDFD_THREAD */
> >
> > this patch adds into pidfd_send_signal().
> >
> > See also this part of discussion
> >
> > > > + /* TODO: respect PIDFD_THREAD */
> > >
> > > So I've been thinking about this at the end of last week. Do we need to
> > > give userspace a way to send a thread-group wide signal even when a
> > > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > > right now and wait until someone needs this?
> >
> > I don't know. I am fine either way, but I think this needs a separate
> > patch and another discussion in any case. Anyway should be trivial,
> > pidfd_send_signal() has the "flags" argument.
> >
> > with Christian in https://lore.kernel.org/all/[email protected]/

I missed that. Whoops.

On Wed, Jan 31, 2024 at 11:15 AM Oleg Nesterov <[email protected]> wrote:
>
> Forgot to mention...
>
> And I agree that pidfd_send_signal(flags => PGID/SID) can make
> some sense too.
>
> But this a) doesn't depend on PIDFD_THREAD, and b) needs another
> patch/discussion.
>
> But again, I am not sure I understood you correctly.
>

Hmm.

When one works with regular (non-fd) pids / pgids etc, one specifies
the signal domain at the time that one sends the signal. I don't know
what pidfds should do. It seems a bit inefficient for anything that
wants a pidfd and might send a signal in a different mode in the
future to have to hold on to multiple pidfds, so it probably should be
a pidfd_send_signal flag.

Which leaves the question of what the default should be. Should
pidfd_send_signal with flags = 0 on a PIDFD_THREAD signal the process
or the thread? I guess there are two reasonable solutions:

1. flags = 0 always means process. And maybe there's a special flag
to send a signal that matches the pidfd type, or maybe not.

2. flags = 0 does what the pidfd seems to imply, and a new
PIDFD_SIGNAL_PID flag overrides it to signal the whole PID even if the
pidfd is PIDFD_THREAD.

Do any of you have actual use cases in mind where one choice is
clearly better than the other choice?

--Andy

2024-01-31 19:46:58

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Wed, Jan 31, 2024 at 11:24:48AM -0800, Andy Lutomirski wrote:
> > On 01/31, Oleg Nesterov wrote:
> > >
> > > On 01/31, Andy Lutomirski wrote:
> > > Please note
> > >
> > > /* TODO: respect PIDFD_THREAD */
> > >
> > > this patch adds into pidfd_send_signal().
> > >
> > > See also this part of discussion
> > >
> > > > > + /* TODO: respect PIDFD_THREAD */
> > > >
> > > > So I've been thinking about this at the end of last week. Do we need to
> > > > give userspace a way to send a thread-group wide signal even when a
> > > > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > > > right now and wait until someone needs this?
> > >
> > > I don't know. I am fine either way, but I think this needs a separate
> > > patch and another discussion in any case. Anyway should be trivial,
> > > pidfd_send_signal() has the "flags" argument.
> > >
> > > with Christian in https://lore.kernel.org/all/[email protected]/
>
> I missed that. Whoops.
>
> On Wed, Jan 31, 2024 at 11:15 AM Oleg Nesterov <[email protected]> wrote:
> >
> > Forgot to mention...
> >
> > And I agree that pidfd_send_signal(flags => PGID/SID) can make
> > some sense too.
> >
> > But this a) doesn't depend on PIDFD_THREAD, and b) needs another
> > patch/discussion.
> >
> > But again, I am not sure I understood you correctly.
> >
>
> Hmm.
>
> When one works with regular (non-fd) pids / pgids etc, one specifies
> the signal domain at the time that one sends the signal. I don't know
> what pidfds should do. It seems a bit inefficient for anything that
> wants a pidfd and might send a signal in a different mode in the
> future to have to hold on to multiple pidfds, so it probably should be
> a pidfd_send_signal flag.
>
> Which leaves the question of what the default should be. Should
> pidfd_send_signal with flags = 0 on a PIDFD_THREAD signal the process
> or the thread? I guess there are two reasonable solutions:
>
> 1. flags = 0 always means process. And maybe there's a special flag
> to send a signal that matches the pidfd type, or maybe not.
>
> 2. flags = 0 does what the pidfd seems to imply, and a new
> PIDFD_SIGNAL_PID flag overrides it to signal the whole PID even if the
> pidfd is PIDFD_THREAD.
>
> Do any of you have actual use cases in mind where one choice is
> clearly better than the other choice?

So conceptually I think having the type of pidfd dictate the default
scope of the signal is the most elegant approach. And then very likely
we should just have:

PIDFD_SIGNAL_THREAD
PIDFD_SIGNAL_THREAD_GROUP
PIDFD_SIGNAL_PROCESS_GROUP

I think for userspace it doesn't really matter as long as we clearly
document what's going on.

Thoughts?

2024-01-31 19:50:47

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Wed, Jan 31, 2024 at 11:46 AM Christian Brauner <[email protected]> wrote:
>
> On Wed, Jan 31, 2024 at 11:24:48AM -0800, Andy Lutomirski wrote:
> > > On 01/31, Oleg Nesterov wrote:
> > > >
> > > > On 01/31, Andy Lutomirski wrote:
> > > > Please note
> > > >
> > > > /* TODO: respect PIDFD_THREAD */
> > > >
> > > > this patch adds into pidfd_send_signal().
> > > >
> > > > See also this part of discussion
> > > >
> > > > > > + /* TODO: respect PIDFD_THREAD */
> > > > >
> > > > > So I've been thinking about this at the end of last week. Do we need to
> > > > > give userspace a way to send a thread-group wide signal even when a
> > > > > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > > > > right now and wait until someone needs this?
> > > >
> > > > I don't know. I am fine either way, but I think this needs a separate
> > > > patch and another discussion in any case. Anyway should be trivial,
> > > > pidfd_send_signal() has the "flags" argument.
> > > >
> > > > with Christian in https://lore.kernel.org/all/[email protected]/
> >
> > I missed that. Whoops.
> >
> > On Wed, Jan 31, 2024 at 11:15 AM Oleg Nesterov <[email protected]> wrote:
> > >
> > > Forgot to mention...
> > >
> > > And I agree that pidfd_send_signal(flags => PGID/SID) can make
> > > some sense too.
> > >
> > > But this a) doesn't depend on PIDFD_THREAD, and b) needs another
> > > patch/discussion.
> > >
> > > But again, I am not sure I understood you correctly.
> > >
> >
> > Hmm.
> >
> > When one works with regular (non-fd) pids / pgids etc, one specifies
> > the signal domain at the time that one sends the signal. I don't know
> > what pidfds should do. It seems a bit inefficient for anything that
> > wants a pidfd and might send a signal in a different mode in the
> > future to have to hold on to multiple pidfds, so it probably should be
> > a pidfd_send_signal flag.
> >
> > Which leaves the question of what the default should be. Should
> > pidfd_send_signal with flags = 0 on a PIDFD_THREAD signal the process
> > or the thread? I guess there are two reasonable solutions:
> >
> > 1. flags = 0 always means process. And maybe there's a special flag
> > to send a signal that matches the pidfd type, or maybe not.
> >
> > 2. flags = 0 does what the pidfd seems to imply, and a new
> > PIDFD_SIGNAL_PID flag overrides it to signal the whole PID even if the
> > pidfd is PIDFD_THREAD.
> >
> > Do any of you have actual use cases in mind where one choice is
> > clearly better than the other choice?
>
> So conceptually I think having the type of pidfd dictate the default
> scope of the signal is the most elegant approach. And then very likely
> we should just have:
>
> PIDFD_SIGNAL_THREAD
> PIDFD_SIGNAL_THREAD_GROUP
> PIDFD_SIGNAL_PROCESS_GROUP
>
> I think for userspace it doesn't really matter as long as we clearly
> document what's going on.
>

This seems reasonable unless we're likely to end up with a pidfd mode
that doesn't actually make sense in a send_signal context. But I'm
not immediately seeing any reason that that would happen.

--Andy

2024-02-01 13:30:59

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Wed, Jan 31, 2024 at 11:50:23AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 31, 2024 at 11:46 AM Christian Brauner <[email protected]> wrote:
> >
> > On Wed, Jan 31, 2024 at 11:24:48AM -0800, Andy Lutomirski wrote:
> > > > On 01/31, Oleg Nesterov wrote:
> > > > >
> > > > > On 01/31, Andy Lutomirski wrote:
> > > > > Please note
> > > > >
> > > > > /* TODO: respect PIDFD_THREAD */
> > > > >
> > > > > this patch adds into pidfd_send_signal().
> > > > >
> > > > > See also this part of discussion
> > > > >
> > > > > > > + /* TODO: respect PIDFD_THREAD */
> > > > > >
> > > > > > So I've been thinking about this at the end of last week. Do we need to
> > > > > > give userspace a way to send a thread-group wide signal even when a
> > > > > > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > > > > > right now and wait until someone needs this?
> > > > >
> > > > > I don't know. I am fine either way, but I think this needs a separate
> > > > > patch and another discussion in any case. Anyway should be trivial,
> > > > > pidfd_send_signal() has the "flags" argument.
> > > > >
> > > > > with Christian in https://lore.kernel.org/all/[email protected]/
> > >
> > > I missed that. Whoops.
> > >
> > > On Wed, Jan 31, 2024 at 11:15 AM Oleg Nesterov <[email protected]> wrote:
> > > >
> > > > Forgot to mention...
> > > >
> > > > And I agree that pidfd_send_signal(flags => PGID/SID) can make
> > > > some sense too.
> > > >
> > > > But this a) doesn't depend on PIDFD_THREAD, and b) needs another
> > > > patch/discussion.
> > > >
> > > > But again, I am not sure I understood you correctly.
> > > >
> > >
> > > Hmm.
> > >
> > > When one works with regular (non-fd) pids / pgids etc, one specifies
> > > the signal domain at the time that one sends the signal. I don't know
> > > what pidfds should do. It seems a bit inefficient for anything that
> > > wants a pidfd and might send a signal in a different mode in the
> > > future to have to hold on to multiple pidfds, so it probably should be
> > > a pidfd_send_signal flag.
> > >
> > > Which leaves the question of what the default should be. Should
> > > pidfd_send_signal with flags = 0 on a PIDFD_THREAD signal the process
> > > or the thread? I guess there are two reasonable solutions:
> > >
> > > 1. flags = 0 always means process. And maybe there's a special flag
> > > to send a signal that matches the pidfd type, or maybe not.
> > >
> > > 2. flags = 0 does what the pidfd seems to imply, and a new
> > > PIDFD_SIGNAL_PID flag overrides it to signal the whole PID even if the
> > > pidfd is PIDFD_THREAD.
> > >
> > > Do any of you have actual use cases in mind where one choice is
> > > clearly better than the other choice?
> >
> > So conceptually I think having the type of pidfd dictate the default
> > scope of the signal is the most elegant approach. And then very likely
> > we should just have:
> >
> > PIDFD_SIGNAL_THREAD
> > PIDFD_SIGNAL_THREAD_GROUP
> > PIDFD_SIGNAL_PROCESS_GROUP
> >
> > I think for userspace it doesn't really matter as long as we clearly
> > document what's going on.
> >
>
> This seems reasonable unless we're likely to end up with a pidfd mode
> that doesn't actually make sense in a send_signal context. But I'm
> not immediately seeing any reason that that would happen.

Yeah, I think that's very unlikely and we could reject it obased on api
design considerations.

2024-02-01 13:57:52

[permalink] [raw]

Subject: Re: [RFC PATCH] pidfd: implement PIDFD_THREAD flag for pidfd_open()

On Thu, Feb 01, 2024 at 02:30:46PM +0100, Christian Brauner wrote:
> On Wed, Jan 31, 2024 at 11:50:23AM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 31, 2024 at 11:46 AM Christian Brauner <[email protected]> wrote:
> > >
> > > On Wed, Jan 31, 2024 at 11:24:48AM -0800, Andy Lutomirski wrote:
> > > > > On 01/31, Oleg Nesterov wrote:
> > > > > >
> > > > > > On 01/31, Andy Lutomirski wrote:
> > > > > > Please note
> > > > > >
> > > > > > /* TODO: respect PIDFD_THREAD */
> > > > > >
> > > > > > this patch adds into pidfd_send_signal().
> > > > > >
> > > > > > See also this part of discussion
> > > > > >
> > > > > > > > + /* TODO: respect PIDFD_THREAD */
> > > > > > >
> > > > > > > So I've been thinking about this at the end of last week. Do we need to
> > > > > > > give userspace a way to send a thread-group wide signal even when a
> > > > > > > PIDFD_THREAD pidfd is passed? Or should we just not worry about this
> > > > > > > right now and wait until someone needs this?
> > > > > >
> > > > > > I don't know. I am fine either way, but I think this needs a separate
> > > > > > patch and another discussion in any case. Anyway should be trivial,
> > > > > > pidfd_send_signal() has the "flags" argument.
> > > > > >
> > > > > > with Christian in https://lore.kernel.org/all/[email protected]/
> > > >
> > > > I missed that. Whoops.
> > > >
> > > > On Wed, Jan 31, 2024 at 11:15 AM Oleg Nesterov <[email protected]> wrote:
> > > > >
> > > > > Forgot to mention...
> > > > >
> > > > > And I agree that pidfd_send_signal(flags => PGID/SID) can make
> > > > > some sense too.
> > > > >
> > > > > But this a) doesn't depend on PIDFD_THREAD, and b) needs another
> > > > > patch/discussion.
> > > > >
> > > > > But again, I am not sure I understood you correctly.
> > > > >
> > > >
> > > > Hmm.
> > > >
> > > > When one works with regular (non-fd) pids / pgids etc, one specifies
> > > > the signal domain at the time that one sends the signal. I don't know
> > > > what pidfds should do. It seems a bit inefficient for anything that
> > > > wants a pidfd and might send a signal in a different mode in the
> > > > future to have to hold on to multiple pidfds, so it probably should be
> > > > a pidfd_send_signal flag.
> > > >
> > > > Which leaves the question of what the default should be. Should
> > > > pidfd_send_signal with flags = 0 on a PIDFD_THREAD signal the process
> > > > or the thread? I guess there are two reasonable solutions:
> > > >
> > > > 1. flags = 0 always means process. And maybe there's a special flag
> > > > to send a signal that matches the pidfd type, or maybe not.
> > > >
> > > > 2. flags = 0 does what the pidfd seems to imply, and a new
> > > > PIDFD_SIGNAL_PID flag overrides it to signal the whole PID even if the
> > > > pidfd is PIDFD_THREAD.
> > > >
> > > > Do any of you have actual use cases in mind where one choice is
> > > > clearly better than the other choice?
> > >
> > > So conceptually I think having the type of pidfd dictate the default
> > > scope of the signal is the most elegant approach. And then very likely
> > > we should just have:
> > >
> > > PIDFD_SIGNAL_THREAD
> > > PIDFD_SIGNAL_THREAD_GROUP
> > > PIDFD_SIGNAL_PROCESS_GROUP
> > >
> > > I think for userspace it doesn't really matter as long as we clearly
> > > document what's going on.
> > >
> >
> > This seems reasonable unless we're likely to end up with a pidfd mode
> > that doesn't actually make sense in a send_signal context. But I'm
> > not immediately seeing any reason that that would happen.
>
> Yeah, I think that's very unlikely and we could reject it obased on api
> design considerations.

Ah, forgot to ask. Did you intend to send a patch for this?

2024-02-01 19:33:58