LinuxLists.cc - [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

2020-04-23 19:46:41

Subject: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/exec.c | 5 +----
include/linux/pid.h | 1 +
kernel/pid.c | 16 ++++++++++++++++
3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..9b60f927afd7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,11 +1186,8 @@ static int de_thread(struct task_struct *tsk)

/* Become a process group leader with the old leader's pid.
* The old leader becomes a thread of the this thread group.
- * Note: The old leader also uses this pid until release_task
- * is called. Odd but simple and correct.
*/
- tsk->pid = leader->pid;
- change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+ exchange_tids(tsk, leader);
transfer_pid(leader, tsk, PIDTYPE_TGID);
transfer_pid(leader, tsk, PIDTYPE_PGID);
transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..2159ffca63fc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -102,6 +102,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type);
extern void detach_pid(struct task_struct *task, enum pid_type);
extern void change_pid(struct task_struct *task, enum pid_type,
struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
extern void transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type);

diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..4ece32d8791a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -363,6 +363,22 @@ void change_pid(struct task_struct *task, enum pid_type type,
attach_pid(task, type);
}

+void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
+{
+ /* pid_links[PIDTYPE_PID].next is always NULL */
+ struct pid *npid = READ_ONCE(ntask->thread_pid);
+ struct pid *opid = READ_ONCE(otask->thread_pid);
+
+ rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
+ rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
+ rcu_assign_pointer(ntask->thread_pid, opid);
+ rcu_assign_pointer(otask->thread_pid, npid);
+ WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
+ WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
+ WRITE_ONCE(ntask->pid, pid_nr(opid));
+ WRITE_ONCE(otask->pid, pid_nr(npid));
+}
+
/* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
void transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type type)
--
2.20.1

2020-04-23 20:33:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
<[email protected]> wrote:
>
> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> + /* pid_links[PIDTYPE_PID].next is always NULL */
> + struct pid *npid = READ_ONCE(ntask->thread_pid);
> + struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> + rcu_assign_pointer(ntask->thread_pid, opid);
> + rcu_assign_pointer(otask->thread_pid, npid);
> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> + WRITE_ONCE(ntask->pid, pid_nr(opid));
> + WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

This function is _very_ hard to read as written.

It really wants a helper function to do the swapping per hlist_head
and hlist_node, I think. And "opid/npid" is very hard to see, and the
naming doesn't make much sense (if it's an "exchange", then why is it
"old/new" - they're symmetric).

At least something like

struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;

struct hlist_node *old_first_node = old_pid_hlist->first;
struct hlist_node *new_first_node = new_pid_hlist->first;

and then trying to group up the first/pprev/thread_pid/pid accesses
so that you them together, and using a helper function that does the
whole switch, so that you'd have

/* Move new node to old hlist, and update thread_pid/pid fields */
insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
rcu_assign_pointer(ntask->thread_pid, opid);
WRITE_ONCE(ntask->pid, pid_nr(opid));

/* Move old new to new hlist, and update thread_pid/pid fields */
insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
rcu_assign_pointer(otask->thread_pid, npid);
WRITE_ONCE(otask->pid, pid_nr(npid));

or something roughly like that.

(And the above still uses "old/new", which as mentioned sounds wrong
to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
did this in my MUA, so I could have gotten the names and types wrong
etc).

I think that would make it look at least _slightly_ less like random
line noise and easier to follow.

But maybe even a rcu_hlist_swap() helper? We have one for regular
lists. Do we really have to do it all written out, not do it with a
"remove and reinsert" model?

Linus

2020-04-24 03:40:09

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
> <[email protected]> wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> + /* pid_links[PIDTYPE_PID].next is always NULL */
>> + struct pid *npid = READ_ONCE(ntask->thread_pid);
>> + struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(ntask->thread_pid, opid);
>> + rcu_assign_pointer(otask->thread_pid, npid);
>> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> + WRITE_ONCE(ntask->pid, pid_nr(opid));
>> + WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> This function is _very_ hard to read as written.
>
> It really wants a helper function to do the swapping per hlist_head
> and hlist_node, I think. And "opid/npid" is very hard to see, and the
> naming doesn't make much sense (if it's an "exchange", then why is it
> "old/new" - they're symmetric).
>
> At least something like
>
> struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
> struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
> struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
> struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;
>
> struct hlist_node *old_first_node = old_pid_hlist->first;
> struct hlist_node *new_first_node = new_pid_hlist->first;
>
> and then trying to group up the first/pprev/thread_pid/pid accesses
> so that you them together, and using a helper function that does the
> whole switch, so that you'd have
>
> /* Move new node to old hlist, and update thread_pid/pid fields */
> insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
> rcu_assign_pointer(ntask->thread_pid, opid);
> WRITE_ONCE(ntask->pid, pid_nr(opid));
>
> /* Move old new to new hlist, and update thread_pid/pid fields */
> insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
> rcu_assign_pointer(otask->thread_pid, npid);
> WRITE_ONCE(otask->pid, pid_nr(npid));
>
> or something roughly like that.
>
> (And the above still uses "old/new", which as mentioned sounds wrong
> to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
> did this in my MUA, so I could have gotten the names and types wrong
> etc).
>
> I think that would make it look at least _slightly_ less like random
> line noise and easier to follow.
>
> But maybe even a rcu_hlist_swap() helper? We have one for regular
> lists. Do we really have to do it all written out, not do it with a
> "remove and reinsert" model?

At one point my brain I had forgetten that xchg can not take two memory
arguments and had hoped to be able to provide stronger guarnatees than I
can. Which is where I think the structure of exchange_pids came from.

I do agree the clearer we can write things, the easier it is for
someone else to come along and follow.

We can not use a remove and reinser model because that does break rcu
accesses, and complicates everything else. With a swap model we have
the struct pids pointer at either of the tasks that are swapped but
never at nothing. With a remove/reinsert model we have to deal the
addittional possibility of the pids not pointing at a thread at all
which can result in things like signals not being delivered at all.

I played with it a bit and the best I have been able to come up is:

void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
{
struct hlist_node **lpprev = left->pprev;
struct hlist_node **rpprev = right->pprev;

rcu_assign_pointer(*lpprev, right);
rcu_assign_pointer(*rpprev, left);
WRITE_ONCE(left->pprev, rpprev);
WRITE_ONCE(right->pprev, lpprev);
}

void exchange_tids(struct task_struct *left, struct task_struct *right)
{
struct hlist_node *lnode = &left->pid_links[PIDTYPE_PID];
struct hlist_node *rnode = &right->pid_links[PIDTYPE_PID];
struct pid *lpid, *rpid;

/* Replace the single entry tid lists with each other */
hlist_swap_before_rcu(lnode, rnode);

/* Swap thread_pid */
rpid = left->thread_pid;
lpid = right->thread_pid;
rcu_assign_pointer(left->thread_pid, lpid);
rcu_assign_pointer(right->thread_pid, rpid);

/* Swap the cached pid value */
WRITE_ONCE(left->pid, pid_nr(lpid));
WRITE_ONCE(right->pid, pid_nr(rpid));
}

hlists because they are not doubly linked can legitimately swap their
beginnings or their tails. Something that regular lists can not,
and I think that is exactly the general purpose semantic I want.

Does that look a little more readable?

Eric

2020-04-24 17:41:32

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

On 04/23, Eric W. Biederman wrote:
>
> When the thread group leader changes during exec and the old leaders
> thread is reaped proc_flush_pid

This is off-topic, but let me mention this before I forget...

Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
that in this case release_task() leaks thread_pid.

> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> + /* pid_links[PIDTYPE_PID].next is always NULL */
> + struct pid *npid = READ_ONCE(ntask->thread_pid);
> + struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> + rcu_assign_pointer(ntask->thread_pid, opid);
> + rcu_assign_pointer(otask->thread_pid, npid);
> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> + WRITE_ONCE(ntask->pid, pid_nr(opid));
> + WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
user of has_group_leader_pid().

I think that we should change lookup_task() to return "struct *pid", this
should simplify the code... Note that none of its callers needs task_struct.

And, instead of thread_group_leader/has_group_leader_pid checks we should
use pid_has_task(TGID).

After that, this patch should kill has_group_leader_pid().

What do you think?

Oleg.

2020-04-24 18:06:16

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <[email protected]> wrote:
>
> At one point my brain I had forgetten that xchg can not take two memory
> arguments and had hoped to be able to provide stronger guarnatees than I
> can. Which is where I think the structure of exchange_pids came from.

Note that even if we were to have a "exchange two memory locations
atomically" instruction (and we don't - even a "double cmpxchg" is
actually just a double-_sized_ one, not a two different locations
one), I'm not convinced it makes sense.

There's no way to _walk_ two lists atomically. Any user will only ever
walk one or the other, so it's not sensible to try to make the two
list updates be atomic.

And if a user for some reason walks both, the walking itself will
obviously then be racy - it does one or the other first, and can see
either the old state, or the new state - or see _neither_ (ie if you
walk it twice, you might see neither task, or you might see both, just
depending on order or walk).

> I do agree the clearer we can write things, the easier it is for
> someone else to come along and follow.

Your alternate write of the function seems a bit more readable to me,
even if the main effect might be just that it was split up a bit and
added a few comments and whitespace.

So I'm more happier with that one. That said:

> We can not use a remove and reinser model because that does break rcu
> accesses, and complicates everything else. With a swap model we have
> the struct pids pointer at either of the tasks that are swapped but
> never at nothing.

I'm not suggesting removing the pid entirely - like making task->pid
be NULL. I'm literally suggesting just doing the RCU list operations
as "remove and re-insert".

And that shouldn't break anything, for the same reason that an atomic
exchange doesn't make sense: you can only ever walk one of the lists
at a time. And regardless of how you walk it, you might not see the
new state (or the old state) reliably.

Put another way:

> void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
> {
> struct hlist_node **lpprev = left->pprev;
> struct hlist_node **rpprev = right->pprev;
>
> rcu_assign_pointer(*lpprev, right);
> rcu_assign_pointer(*rpprev, left);

These are the only two assignments that matter for anything that walks
the list (the pprev ones are for things that change the list, and they
have to have exclusions in place).

And those two writes cannot be atomic anyway, so you fundamentally
will always be in the situation that a walker can miss one of the
tasks.

Which is why I think it would be ok to just do the RCU list swap as a
"remove left, remove right, add left, add right" operation. It doesn't
seem fundamentally different to a walker than the "switch left/right"
operation, and it seems much simpler.

Is there something I'm missing?

But I'm *not* suggesting that we change these simple parts to be
"remove thread_pid or pid pointer, and then insert a new one":

> /* Swap thread_pid */
> rpid = left->thread_pid;
> lpid = right->thread_pid;
> rcu_assign_pointer(left->thread_pid, lpid);
> rcu_assign_pointer(right->thread_pid, rpid);
>
> /* Swap the cached pid value */
> WRITE_ONCE(left->pid, pid_nr(lpid));
> WRITE_ONCE(right->pid, pid_nr(rpid));
> }

because I agree that for things that don't _walk_ the list, but just
look up "thread_pid" vs "pid" atomically but asynchronously, we
obviously need to get one or the other, not some kind of "empty"
state.

> Does that look a little more readable?

Regardless, I find your new version at least a lot more readable, so
I'm ok with it.

It looks like Oleg found an independent issue, though.

Linus

2020-04-24 18:16:46

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

Oleg Nesterov <[email protected]> writes:

> On 04/23, Eric W. Biederman wrote:
>>
>> When the thread group leader changes during exec and the old leaders
>> thread is reaped proc_flush_pid
>
> This is off-topic, but let me mention this before I forget...
>
> Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
> that in this case release_task() leaks thread_pid.

Good catch. Wow. I seem to be introducing almost as many bugs as I am
fixing right now. Ouch.

>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> + /* pid_links[PIDTYPE_PID].next is always NULL */
>> + struct pid *npid = READ_ONCE(ntask->thread_pid);
>> + struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> + rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> + rcu_assign_pointer(ntask->thread_pid, opid);
>> + rcu_assign_pointer(otask->thread_pid, npid);
>> + WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> + WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> + WRITE_ONCE(ntask->pid, pid_nr(opid));
>> + WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
> user of has_group_leader_pid().
>
> I think that we should change lookup_task() to return "struct *pid", this
> should simplify the code... Note that none of its callers needs task_struct.
>
> And, instead of thread_group_leader/has_group_leader_pid checks we should
> use pid_has_task(TGID).

Somehow I thought we could get away without fiddling with that right
now, but on second glance I can see the races.

I played with this earlier and I agree returning a struct pid *
is desirable. I will see if I can track down the patches I was
playing with as that definitely needs to get fixed first.

> After that, this patch should kill has_group_leader_pid().
>
> What do you think?

I agree completely. has_group_leader_pid is the same as
thread_group_leader after this so should be removed. Especially as it
won't have any users.

There are several other potential cleanups as well. Such as not
using a hlist for PIDTYPE_PID. Which would allow us to run the hlists
through struct signal_struct instead. I think that would clean things
up but that touches so many things it may just be pointless code churn.

Just for mentioning I am thinking we should rename PIDTYPE_PID to
PIDTYPE_TID just to create a distance in peoples minds between
the kernel concepts and the user space concepts.

Eric

2020-04-24 18:50:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

On Fri, Apr 24, 2020 at 11:02 AM Linus Torvalds
<[email protected]> wrote:
>
> [..] even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one

Historical accuracy side note: the 68020 actually had a CAS2 that was
"two different locations".

Maybe somebody else did too.

Linus

2020-04-24 19:57:07

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <[email protected]> wrote:
>>
>> At one point my brain I had forgetten that xchg can not take two memory
>> arguments and had hoped to be able to provide stronger guarnatees than I
>> can. Which is where I think the structure of exchange_pids came from.
>
> Note that even if we were to have a "exchange two memory locations
> atomically" instruction (and we don't - even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one), I'm not convinced it makes sense.
>
> There's no way to _walk_ two lists atomically. Any user will only ever
> walk one or the other, so it's not sensible to try to make the two
> list updates be atomic.
>
> And if a user for some reason walks both, the walking itself will
> obviously then be racy - it does one or the other first, and can see
> either the old state, or the new state - or see _neither_ (ie if you
> walk it twice, you might see neither task, or you might see both, just
> depending on order or walk).
>
>> I do agree the clearer we can write things, the easier it is for
>> someone else to come along and follow.
>
> Your alternate write of the function seems a bit more readable to me,
> even if the main effect might be just that it was split up a bit and
> added a few comments and whitespace.
>
> So I'm more happier with that one. That said:
>
>> We can not use a remove and reinser model because that does break rcu
>> accesses, and complicates everything else. With a swap model we have
>> the struct pids pointer at either of the tasks that are swapped but
>> never at nothing.
>
> I'm not suggesting removing the pid entirely - like making task->pid
> be NULL. I'm literally suggesting just doing the RCU list operations
> as "remove and re-insert".
>
> And that shouldn't break anything, for the same reason that an atomic
> exchange doesn't make sense: you can only ever walk one of the lists
> at a time. And regardless of how you walk it, you might not see the
> new state (or the old state) reliably.
>
> Put another way:
>
>> void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>> {
>> struct hlist_node **lpprev = left->pprev;
>> struct hlist_node **rpprev = right->pprev;
>>
>> rcu_assign_pointer(*lpprev, right);
>> rcu_assign_pointer(*rpprev, left);
>
> These are the only two assignments that matter for anything that walks
> the list (the pprev ones are for things that change the list, and they
> have to have exclusions in place).
>
> And those two writes cannot be atomic anyway, so you fundamentally
> will always be in the situation that a walker can miss one of the
> tasks.
>
> Which is why I think it would be ok to just do the RCU list swap as a
> "remove left, remove right, add left, add right" operation. It doesn't
> seem fundamentally different to a walker than the "switch left/right"
> operation, and it seems much simpler.
>
> Is there something I'm missing?

The problem with

remove
remove
add
add
is:

A lookup that hit between the remove and the add could return nothing.

The function kill_pid_info does everything it can to handle this case
today does:

int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
{
int error = -ESRCH;
struct task_struct *p;

for (;;) {
rcu_read_lock();
p = pid_task(pid, PIDTYPE_PID);
if (p)
error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
rcu_read_unlock();
if (likely(!p || error != -ESRCH))
return error;

/*
* The task was unhashed in between, try again. If it
* is dead, pid_task() will return NULL, if we race with
* de_thread() it will find the new leader.
*/
}
}

Now kill_pid_info is signalling the entire task and is just using
PIDTYPE_PID to find a thread in the task.

With the remove then add model there will be a point where pid_task
will return nothing, because ever so briefly the lists will be
empty.

However with an actually swap we will find a task and kill_pid_info
will work. It pathloglical cases lock_task_sighand might have to loop
and we would need to find the new task that has the given pid. But
kill_pid_info is guaranteed to work with swaps and will fail with
remove add.

> But I'm *not* suggesting that we change these simple parts to be
> "remove thread_pid or pid pointer, and then insert a new one":
>
>> /* Swap thread_pid */
>> rpid = left->thread_pid;
>> lpid = right->thread_pid;
>> rcu_assign_pointer(left->thread_pid, lpid);
>> rcu_assign_pointer(right->thread_pid, rpid);
>>
>> /* Swap the cached pid value */
>> WRITE_ONCE(left->pid, pid_nr(lpid));
>> WRITE_ONCE(right->pid, pid_nr(rpid));
>> }
>
> because I agree that for things that don't _walk_ the list, but just
> look up "thread_pid" vs "pid" atomically but asynchronously, we
> obviously need to get one or the other, not some kind of "empty"
> state.

For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
pointers to the appropriate task. Only for PIDTYPE_PGID and PIDTYPE_SID
do these become lists in practice.

That not-really-a-list status allows for signel delivery to indivdual
processes to happen in rcu context. Which is where we would get into
trouble with add/remove.

Since signals are guaranteed to be delivered to the entire session
or the entire process group all of the list walking happens under
the tasklist_lock currently. Which really keeps list walking from
being a concern.

>> Does that look a little more readable?
>
> Regardless, I find your new version at least a lot more readable, so
> I'm ok with it.

Good. Then I will finish cleaning it up and go with that version.

> It looks like Oleg found an independent issue, though.

Yes, and I will definitely work through those.

Eric

2020-04-24 20:15:07

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

On Fri, Apr 24, 2020 at 12:54 PM Eric W. Biederman
<[email protected]> wrote:
>
> The problem with
>
> remove
> remove
> add
> add
> is:
>
> A lookup that hit between the remove and the add could return nothing.

Argh. Because that thing doesn't actually _search_ the list, it just
wants to pick any (first) entry. So it doesn't actually care what it
gets, just that it gets something.

Ok, I see.

> For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
> pointers to the appropriate task. Only for PIDTYPE_PGID and PIDTYPE_SID
> do these become lists in practice.

Yeah, that's what confused me. Ok, please add a comment about the
"single-entry list" issue, and I'm happy.

Linus

2020-04-24 20:57:34

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] proc: Put thread_pid in release_task not proc_flush_pid

Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/base.c | 1 -
kernel/exit.c | 1 +
2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6042b646ab27..42f43c7b9669 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3274,7 +3274,6 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
void proc_flush_pid(struct pid *pid)
{
proc_invalidate_siblings_dcache(&pid->inodes, &pid->lock);
- put_pid(pid);
}

static struct dentry *proc_pid_instantiate(struct dentry * dentry,
diff --git a/kernel/exit.c b/kernel/exit.c
index 389a88cb3081..ce2a75bc0ade 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -219,6 +219,7 @@ void release_task(struct task_struct *p)

write_unlock_irq(&tasklist_lock);
proc_flush_pid(thread_pid);
+ put_pid(thread_pid);
release_thread(p);
put_task_struct_rcu_user(p);

--
2.20.1