2020-04-26 17:42:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu

On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <[email protected]> wrote:
>
> To support this add hlist_swap_before_rcu. An hlist primitive that
> will allow swaping the leading sections of two tasks. For exchanging
> the task pids it will just be swapping the hlist_heads of two single
> entry lists. But the functionality is more general.

So I have no problems with the series now - the code is much more
understandable. Partly because of the split-up, partly because of the
comments, and partly because you explained the special case and why it
was a valid thing to do...

However, I did start thinking about this case again.

I still don't think the "swap entry" macro is necessarily useful in
_general_ - any time it's an actual individual entry, that swap macro
doesn't really work.

So the only reason it works here is because you're actually swapping
the whole list.

But that, in turn, shouldn't be using that "first node" model at all,
it should use the hlist_head. That would have made it a lot more
obvious what is actually going on to me.

Now, the comment very much talks about the head case, but the code
still looks like it's swapping a non-head thing.

I guess the code technically _works_ with "swap two list ends", but
does that actually really make sense as an operation?

So I no longer hate how this patch looks, but I wonder if we should
just make the whole "this node is the *first* node" a bit more
explicit in both the caller and in the swapping code.

It could be as simple as replacing just the conceptual types and
names, so instead of some "pnode1" double-indirect node pointer, we'd
have

struct hlist_head *left_head = container_of(left->pprev,
struct hlist_head, first);
struct hlist_head *right_head = container_of(right->pprev,
struct hlist_head, first);

and then the code would do

rcu_assign_pointer(right_head->first, left);
rcu_assign_pointer(left_head->first, right);
WRITE_ONCE(left->pprev, &right_head->first);
WRITE_ONCE(right->pprev, &left_head->first);

which should generate the exact same code, but makes it clear that
what we're doing is switching the whole hlist when given the first
entries.

Doesn't that make what it actually does a lot more understandable? The
*pnode1/pnode2 games are somewhat opaque, but with that type and name
change and using "container_of()", the code now fairly naturally reads
as "oh, we're changing the first pointers in the list heads, and
making the nodes point back to them" .

Again - the current function _works_ with swapping two hlists in the
middle (not two entries - it swaps the whole list starting at that
entry!), so your current patch is in some ways "more generic". I'm
just suggesting that the generic case doesn't make much sense, and
that the "we know the first entries, swap the lists" actually is what
the real use is, and writing it as such makes the code easier to
understand.

But I'm not going to insist on this, so this is more an RFC. Maybe
people disagree, and/or have an actual use case for that "break two
hlists in the middle, swap the ends" that I find unlikely...

(NOTE: My "convert to hlist_head" code _works_ for that case too
because the code generation is the same! But it would be really really
confusing for that code to be used for anything but the first entry).

Linus


2020-04-27 14:34:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu

Linus Torvalds <[email protected]> writes:

> On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <[email protected]> wrote:
>>
>> To support this add hlist_swap_before_rcu. An hlist primitive that
>> will allow swaping the leading sections of two tasks. For exchanging
>> the task pids it will just be swapping the hlist_heads of two single
>> entry lists. But the functionality is more general.
>
> So I have no problems with the series now - the code is much more
> understandable. Partly because of the split-up, partly because of the
> comments, and partly because you explained the special case and why it
> was a valid thing to do...
>
> However, I did start thinking about this case again.
>
> I still don't think the "swap entry" macro is necessarily useful in
> _general_ - any time it's an actual individual entry, that swap macro
> doesn't really work.

But it isn't a "swap entry" macro/function. I did not even attempt
to make it a "swap entry" function.

I made a chop two lists into two and swap the pieces function.

> So the only reason it works here is because you're actually swapping
> the whole list.
>
> But that, in turn, shouldn't be using that "first node" model at all,
> it should use the hlist_head. That would have made it a lot more
> obvious what is actually going on to me.
>
> Now, the comment very much talks about the head case, but the code
> still looks like it's swapping a non-head thing.
>
> I guess the code technically _works_ with "swap two list ends", but
> does that actually really make sense as an operation?

As an operation yes. Will anyone else want that operation I don't know.

> So I no longer hate how this patch looks, but I wonder if we should
> just make the whole "this node is the *first* node" a bit more
> explicit in both the caller and in the swapping code.
>
> It could be as simple as replacing just the conceptual types and
> names, so instead of some "pnode1" double-indirect node pointer, we'd
> have
>
> struct hlist_head *left_head = container_of(left->pprev,
> struct hlist_head, first);
> struct hlist_head *right_head = container_of(right->pprev,
> struct hlist_head, first);
>
> and then the code would do
>
> rcu_assign_pointer(right_head->first, left);
> rcu_assign_pointer(left_head->first, right);
> WRITE_ONCE(left->pprev, &right_head->first);
> WRITE_ONCE(right->pprev, &left_head->first);
>
> which should generate the exact same code, but makes it clear that
> what we're doing is switching the whole hlist when given the first
> entries.
>
> Doesn't that make what it actually does a lot more understandable?

Understandable is a bit subjective. I think having a well defined hlist
operation I can call makes things more understandable.

I think the getting the list head as:
"head = &task->thread_pid->tasks[PIDTYPE_PID];" is more understandable
and less risky than container_of.

My concern and probably unreasonbable as this is a slow path
with getting the list heads after looking up the pid is that it seems
to add a wait for an additional cache line to load before anything can
happen.

The only way I really know to make this code much more understandable is
to remove the lists entirely for this case. But that is a much larger
change and it is not clear that it makes the kernel code overall better.
I stared at that for a while and it is an interesting follow on but not
something I want or we even can do before exchange_tids is in place.

> The
> *pnode1/pnode2 games are somewhat opaque, but with that type and name
> change and using "container_of()", the code now fairly naturally reads
> as "oh, we're changing the first pointers in the list heads, and
> making the nodes point back to them" .
>
> Again - the current function _works_ with swapping two hlists in the
> middle (not two entries - it swaps the whole list starting at that
> entry!), so your current patch is in some ways "more generic". I'm
> just suggesting that the generic case doesn't make much sense, and
> that the "we know the first entries, swap the lists" actually is what
> the real use is, and writing it as such makes the code easier to
> understand.

Yep. That is waht I designed it to do. I sort of went the other
direction when writing this. I could start with the list heads and swap
the rest of the lists and get the same code. But it looked like it
would be a little slower to find the hlist_heads, and I couldn't think
of a good name for the function. So I figured if I was writing a
fucntion for this case I would write one that was convinient.

For understandability that is my real challenge what is a good name
that people can read and understand what is happening for this swapping
function.

> But I'm not going to insist on this, so this is more an RFC. Maybe
> people disagree, and/or have an actual use case for that "break two
> hlists in the middle, swap the ends" that I find unlikely...
>
> (NOTE: My "convert to hlist_head" code _works_ for that case too
> because the code generation is the same! But it would be really really
> confusing for that code to be used for anything but the first entry).

Yes.

I am open to improvements. Especially in the naming.

Would hlists_swap_heads_rcu be noticably better?

Eric

2020-04-27 20:31:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu

On Mon, Apr 27, 2020 at 7:32 AM Eric W. Biederman <[email protected]> wrote:
>
> Would hlists_swap_heads_rcu be noticably better?

Edited out most of the rest because I think we're in agreement and
it's not a huge deal.

And yes, I think it might be nice to just call it "swap_heads()" and
make the naming match what the user wants, so that people who don't
care about the implementation can just guess from the function name.

But I also think that by now it's mostly bikeshedding, and I'm
probably also now biased by the fact that I have looked at that code
and read your explanations so it's all fresh in my mind.

A year from now, when I've forgotten the details, who knows which part
I'd find confusing? ;)

Linus

2020-04-28 12:21:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly


In the work to remove proc_mnt I noticed that we were calling
proc_flush_task now proc_flush_pid possibly multiple times for the same
pid because of how de_thread works.

This is a bare minimal patchset to sort out de_thread, by introducing
exchange_tids and the helper of exchange_tids hlists_swap_heads_rcu.

The actual call of exchange_tids should be slowpath so I have
prioritized readability over getting every last drop of performance.

I have also read through a bunch of the code to see if I could find
anything that would be affected by this change. Users of
has_group_leader_pid were a good canidates. But I also looked at other
cases that might have a pid->task->pid transition. I ignored other
sources of races with de_thread and exec as those are preexisting.

I found a close call with send_signals user of task_active_pid_ns, but
all pids of a thread group are guaranteeds to be in the same pid
namespace so there is not a problem.

I found a few pieces of debugging code that do:

task = pid_task(pid, PIDTYPE_PID);
if (task) {
printk("%u\n", task->pid);
}

But I can't see how we care if it happens at the wrong moment that
task->pid might not match pid_nr(pid);

Similarly because the code in posix-cpu-timers goes pid->task->pid it
feels like there should be a problem. But as the code that works with
PIDTYPE_PID is only available within the thread group, and as de_thread
kills all of the other threads before it makes any changes of this
kind the race can not happen.

In short I don't think this change will introduce any regressions.

Eric W. Biederman (2):
rculist: Add hlists_swap_heads_rcu
proc: Ensure we see the exit of each process tid exactly once

fs/exec.c | 5 +----
include/linux/pid.h | 1 +
include/linux/rculist.h | 21 +++++++++++++++++++++
kernel/pid.c | 19 +++++++++++++++++++
4 files changed, 42 insertions(+), 4 deletions(-)

Eric

2020-04-28 12:23:26

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu


Using the struct pid to refer to two tasks in de_thread was a clever
idea and ultimately too clever, as it has lead to proc_flush_task
being called inconsistently.

To support rectifying this add hlists_swap_heads_rcu. An hlist
primitive that just swaps the hlist heads of two lists. This is
exactly what is needed for exchanging the pids of two tasks.

Only consideration of correctness of the code has been given,
as the caller is expected to be a slowpath.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/rculist.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8214cdc715f2..67867e0d4cec 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -506,6 +506,27 @@ static inline void hlist_replace_rcu(struct hlist_node *old,
WRITE_ONCE(old->pprev, LIST_POISON2);
}

+/**
+ * hlists_swap_heads_rcu - swap the lists the hlist heads point to
+ * @left: The hlist head on the left
+ * @right: The hlist head on the right
+ *
+ * The lists start out as [@left ][node1 ... ] and
+ [@right ][node2 ... ]
+ * The lists end up as [@left ][node2 ... ]
+ * [@right ][node1 ... ]
+ */
+static inline void hlists_swap_heads_rcu(struct hlist_head *left, struct hlist_head *right)
+{
+ struct hlist_node *node1 = left->first;
+ struct hlist_node *node2 = right->first;
+
+ rcu_assign_pointer(left->first, node2);
+ rcu_assign_pointer(right->first, node1);
+ WRITE_ONCE(node2->pprev, &left->first);
+ WRITE_ONCE(node1->pprev, &right->first);
+}
+
/*
* return the first or the next element in an RCU protected hlist
*/
--
2.20.1

2020-04-28 12:24:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once


When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/exec.c | 5 +----
include/linux/pid.h | 1 +
kernel/pid.c | 19 +++++++++++++++++++
3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..9b60f927afd7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,11 +1186,8 @@ static int de_thread(struct task_struct *tsk)

/* Become a process group leader with the old leader's pid.
* The old leader becomes a thread of the this thread group.
- * Note: The old leader also uses this pid until release_task
- * is called. Odd but simple and correct.
*/
- tsk->pid = leader->pid;
- change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+ exchange_tids(tsk, leader);
transfer_pid(leader, tsk, PIDTYPE_TGID);
transfer_pid(leader, tsk, PIDTYPE_PGID);
transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..2159ffca63fc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -102,6 +102,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type);
extern void detach_pid(struct task_struct *task, enum pid_type);
extern void change_pid(struct task_struct *task, enum pid_type,
struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
extern void transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type);

diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..6d5d0a5bda82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -363,6 +363,25 @@ void change_pid(struct task_struct *task, enum pid_type type,
attach_pid(task, type);
}

+void exchange_tids(struct task_struct *left, struct task_struct *right)
+{
+ struct pid *pid1 = left->thread_pid;
+ struct pid *pid2 = right->thread_pid;
+ struct hlist_head *head1 = &pid1->tasks[PIDTYPE_PID];
+ struct hlist_head *head2 = &pid2->tasks[PIDTYPE_PID];
+
+ /* Swap the single entry tid lists */
+ hlists_swap_heads_rcu(head1, head2);
+
+ /* Swap the per task_struct pid */
+ rcu_assign_pointer(left->thread_pid, pid2);
+ rcu_assign_pointer(right->thread_pid, pid1);
+
+ /* Swap the cached value */
+ WRITE_ONCE(left->pid, pid_nr(pid2));
+ WRITE_ONCE(right->pid, pid_nr(pid1));
+}
+
/* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
void transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type type)
--
2.20.1

2020-04-28 16:56:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <[email protected]> wrote:
>
> In short I don't think this change will introduce any regressions.

I think the series looks fine, but I also think the long explanation
(that I snipped in this reply) in the cover letter should be there in
the kernel tree.

So if you send me this as a single pull request, with that explanation
(either in the email or in the signed tag - although you don't seem to
use tags normally - so that we have that extra commentary for
posterity, that sounds good.

That said, this fix seems to not matter for normal operation, so
unless it's holding up something important, maybe it's 5.8 material?

Linus

2020-04-28 18:00:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

Linus Torvalds <[email protected]> writes:

> On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <[email protected]> wrote:
>>
>> In short I don't think this change will introduce any regressions.
>
> I think the series looks fine, but I also think the long explanation
> (that I snipped in this reply) in the cover letter should be there in
> the kernel tree.

When I have been adding patchsets like this to my tree I have been doing
merge --no-ff so I can create a place for explanations like this, and I
will do the same with this.

I already have Alexey Gladkov's proc changes, and my next_tgid cleanup
on a branch of proc changes in my tree already.

> So if you send me this as a single pull request, with that explanation
> (either in the email or in the signed tag - although you don't seem to
> use tags normally - so that we have that extra commentary for
> posterity, that sounds good.

I hope you don't mind if I combind this with some other proc changes.
If you do mind I will put this on a separate topic branch.

Right now it just seems easier for me to keep track of if I keep my
number of topics limited.

> That said, this fix seems to not matter for normal operation, so
> unless it's holding up something important, maybe it's 5.8 material?

Yes, this is 5.8 material.

I am just aiming to get review before I put in linux-next, and later
send it to your for merging. I should have mentioned that in the cover
letter.

I am noticing that removing technical debt without adding more technical
debt is quite a challenge.

Eric


2020-04-28 18:08:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

On 04/28, Eric W. Biederman wrote:
>
> In short I don't think this change will introduce any regressions.
>
> Eric W. Biederman (2):
> rculist: Add hlists_swap_heads_rcu
> proc: Ensure we see the exit of each process tid exactly once

Eric, sorry, I got lost.

Both changes look good to me, feel free to add my ack, but I presume
this is on top of next_tgid/lookup_task changes ? If yes, why did not
you remove has_group_leader_pid?

Oleg.

2020-04-28 18:59:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

Oleg Nesterov <[email protected]> writes:

> On 04/28, Eric W. Biederman wrote:
>>
>> In short I don't think this change will introduce any regressions.
>>
>> Eric W. Biederman (2):
>> rculist: Add hlists_swap_heads_rcu
>> proc: Ensure we see the exit of each process tid exactly once
>
> Eric, sorry, I got lost.
>
> Both changes look good to me, feel free to add my ack, but I presume
> this is on top of next_tgid/lookup_task changes ? If yes, why did not
> you remove has_group_leader_pid?

On top of next_tgid.

Upon a close examination there are not any current bugs in
posix-cpu-timers nor is there anything that exchange_tids
will make worse.

I am preparing a follow on patchset to kill has_group_leader_pid.

I am preparing a further follow on patchset to see if I can get that
code to start returning pids, because that is cheaper and clearer.


I pushed those changes a little farther out so I could maintain focus on
what I am accomplishing.

Adding exchange_tids was difficult because I had to audit pretty much
all of the pid use in the kernel to see if the small change in behavior
would make anything worse. The rest of the changes should be simpler
and more localized so I hope they go faster.

Eric

2020-04-28 19:01:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

Linus Torvalds <[email protected]> writes:

> On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <[email protected]> wrote:
>>
>> In short I don't think this change will introduce any regressions.
>
> I think the series looks fine.

Mind if I translate that into

Acked-by: Linus Torvalds <[email protected]>
on the patches?

Eric

2020-04-28 19:38:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly

On Tue, Apr 28, 2020 at 11:59 AM Eric W. Biederman
<[email protected]> wrote:
>
> Linus Torvalds <[email protected]> writes:
> >
> > I think the series looks fine.
>
> Mind if I translate that into
>
> Acked-by: Linus Torvalds <[email protected]>
> on the patches?

Sure, go right ahead.

Linus

2020-04-28 21:49:39

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 0/4] signal: Removing has_group_leader_pid


With de_thread now calling exchange_tids has_group_leader_pid no longer
makes any sense and is equivalent to calling thread_group_leader.

As there are only 2 remaining users of has_group_leader_pid let's
update the code and get rid of has_group_leader_pid.

There is one extra patch to lookup_task that performs that unifies
to code paths that become identical when has_group_leader_pid went
away.

Eric W. Biederman (4):
posix-cpu-timer: Tidy up group_leader logic in lookup_task
posix-cpu-timer: Unify the now redundant code in lookup_task
exec: Remove BUG_ON(has_group_leader_pid)
signal: Remove has_group_leader_pid

fs/exec.c | 1 -
include/linux/sched/signal.h | 11 -----------
kernel/time/posix-cpu-timers.c | 21 ++++++++-------------
3 files changed, 8 insertions(+), 25 deletions(-)

2020-04-28 21:55:01

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task


Replace has_group_leader_pid with thread_group_leader. Years ago Oleg
suggested changing thread_group_leader to has_group_leader_pid to handle
races. Looking at the code then and now I don't see how it ever helped.
Especially as then the code really did need to be the
thread_group_leader.

Today it doesn't make a difference if thread_group_leader races with
de_thread as the task returned from lookup_task in the non-thread case
is just used to find values in task->signal.

Since the races with de_thread have never been handled revert
has_group_header_pid to thread_group_leader for clarity.

Update the comment in lookup_task to remove implementation details that
are no longer true and to mention task->signal instead of task->sighand,
as the relevant cpu timer details are all in task->signal.

Ref: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")
Ref: c0deae8c9587 ("posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/time/posix-cpu-timers.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2fd3b3fa68bf..e4051e417bcb 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -69,12 +69,8 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
if (gettime) {
/*
* For clock_gettime(PROCESS) the task does not need to be
- * the actual group leader. tsk->sighand gives
+ * the actual group leader. task->signal gives
* access to the group's clock.
- *
- * Timers need the group leader because they take a
- * reference on it and store the task pointer until the
- * timer is destroyed.
*/
return (p == current || thread_group_leader(p)) ? p : NULL;
}
@@ -82,7 +78,7 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
/*
* For processes require that p is group leader.
*/
- return has_group_leader_pid(p) ? p : NULL;
+ return thread_group_leader(p) ? p : NULL;
}

static struct task_struct *__get_task_for_clock(const clockid_t clock,
--
2.20.1

2020-04-28 21:56:38

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/4] posix-cpu-timer: Unify the now redundant code in lookup_task


Now that both !thread paths through lookup_task call
thread_group_leader, unify them into the single test at the end of
lookup_task.

This unification just makes it clear what is happening in the gettime
special case of lookup_task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/time/posix-cpu-timers.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e4051e417bcb..b7f972fb115e 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -66,14 +66,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
if (thread)
return same_thread_group(p, current) ? p : NULL;

- if (gettime) {
- /*
- * For clock_gettime(PROCESS) the task does not need to be
- * the actual group leader. task->signal gives
- * access to the group's clock.
- */
- return (p == current || thread_group_leader(p)) ? p : NULL;
- }
+ /*
+ * For clock_gettime(PROCESS) the task does not need to be
+ * the actual group leader. task->signal gives
+ * access to the group's clock.
+ */
+ if (gettime && (p == current))
+ return p;

/*
* For processes require that p is group leader.
--
2.20.1

2020-04-28 21:58:51

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid)


With the introduction of exchange_tids thread_group_leader and
has_group_leader_pid have become equivalent. Further at this point in the
code a thread group has exactly two threads, the previous thread_group_leader
that is waiting to be reaped and tsk. So we know it is impossible for tsk to
be the thread_group_leader.

This is also the last user of has_group_leader_pid so removing this check
will allow has_group_leader_pid to be removed.

So remove the "BUG_ON(has_group_leader_pid)" that will never fire.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 9b60f927afd7..6ab1c19d84fa 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1176,7 +1176,6 @@ static int de_thread(struct task_struct *tsk)
tsk->start_boottime = leader->start_boottime;

BUG_ON(!same_thread_group(leader, tsk));
- BUG_ON(has_group_leader_pid(tsk));
/*
* An exec() starts a new thread group with the
* TGID of the previous thread group. Rehash the
--
2.20.1

2020-04-28 22:01:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v4 4/4] signal: Remove has_group_leader_pid


After the introduction of exchange_tids has_group_leader_pid is
equivalent to thread_group_leader. After the last couple of cleanups
has_group_leader_pid has no more callers.

So remove the now unused and redundant has_group_leader_pid.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3e5b090c16d4..0ee5e696c5d8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -654,17 +654,6 @@ static inline bool thread_group_leader(struct task_struct *p)
return p->exit_signal >= 0;
}

-/* Do to the insanities of de_thread it is possible for a process
- * to have the pid of the thread group leader without actually being
- * the thread group leader. For iteration through the pids in proc
- * all we care about is that we have a task with the appropriate
- * pid, we don't actually care if we have the right task.
- */
-static inline bool has_group_leader_pid(struct task_struct *p)
-{
- return task_pid(p) == task_tgid(p);
-}
-
static inline
bool same_thread_group(struct task_struct *p1, struct task_struct *p2)
{
--
2.20.1

2020-04-30 11:59:30

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup


The current posix-cpu-timer code uses pids when holding persistent
references in timers. However the lookups from clockid_t still return
tasks that need to be converted into pids for use.

This results in usage being pid->task->pid and that can race with
release_task and de_thread. This can lead to some not wrong but
surprising results. Surprising enough that Oleg and I both thought
there were some bugs in the code for a while.

This set of changes modifies the code to just lookup, verify, and return
pids from the clockid_t lookups to remove those potentialy troublesome
races.

Eric W. Biederman (3):
posix-cpu-timers: Extend rcu_read_lock removing task_struct references
posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock

kernel/time/posix-cpu-timers.c | 102 ++++++++++++++++++-----------------------
1 file changed, 45 insertions(+), 57 deletions(-)

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>

---

The changes can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git proc-testing

2020-04-30 12:01:30

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type


Taking a clock and returning a pid_type is a more general and
a superset of taking a timer and returning a pid_type.

Perform this generalization so that future changes may use
this code on clocks as well as timers.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/time/posix-cpu-timers.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 91996dd089a4..42f673974d71 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -113,14 +113,14 @@ static inline int validate_clock_permissions(const clockid_t clock)
return ret;
}

-static inline enum pid_type cpu_timer_pid_type(struct k_itimer *timer)
+static inline enum pid_type clock_pid_type(const clockid_t clock)
{
- return CPUCLOCK_PERTHREAD(timer->it_clock) ? PIDTYPE_PID : PIDTYPE_TGID;
+ return CPUCLOCK_PERTHREAD(clock) ? PIDTYPE_PID : PIDTYPE_TGID;
}

static inline struct task_struct *cpu_timer_task_rcu(struct k_itimer *timer)
{
- return pid_task(timer->it.cpu.pid, cpu_timer_pid_type(timer));
+ return pid_task(timer->it.cpu.pid, clock_pid_type(timer->it_clock));
}

/*
@@ -403,7 +403,7 @@ static int posix_cpu_timer_create(struct k_itimer *new_timer)

new_timer->kclock = &clock_posix_cpu;
timerqueue_init(&new_timer->it.cpu.node);
- new_timer->it.cpu.pid = get_task_pid(p, cpu_timer_pid_type(new_timer));
+ new_timer->it.cpu.pid = get_task_pid(p, clock_pid_type(new_timer->it_clock));
rcu_read_unlock();
return 0;
}
--
2.25.0

2020-04-30 12:02:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock


Now that the codes store references to pids instead of referendes to
tasks. Looking up a task for a clock instead of looking up a struct
pid makes the code more difficult to verify it is correct than
necessary.

In posix_cpu_timers_create get_task_pid can race with release_task for
threads and return a NULL pid. As put_pid and cpu_timer_task_rcu
handle NULL pids just fine the code works without problems but it is
an extra case to consider and keep in mind while verifying and
modifying the code.

There are races with de_thread to consider that only don't apply
because thread clocks are only allowed for threads in the same
thread_group.

So instead of leaving a burden for people making modification to the
code in the future return a rcu protected struct pid for the clock
instead.

The logic for __get_task_for_pid and lookup_task has been folded into
the new function pid_for_clock with the only change being the logic
has been modified from working on a task to working on a pid that
will be returned.

In posix_cpu_clock_get instead of calling pid_for_clock checking the
result and then calling pid_task to get the task. The result of
pid_for_clock is fed directly into pid_task. This is safe because
pid_task handles NULL pids. As such an extra error check was
unnecessary.

Instead of hiding the flag that enables the special clock_gettime
handling, I have made the 3 callers just pass the flag in themselves.
That is less code and seems just as simple to work with as the
wrapper functions.

Historically the clock_gettime special case of allowing a process
clock to be found by the thread id did not even exist [33ab0fec3352]
but Thomas Gleixner reports that he has found code that uses that
functionality [55e8c8eb2c7b].

Link: https://lkml.kernel.org/r/[email protected]/
Ref: 33ab0fec3352 ("posix-timers: Consolidate posix_cpu_clock_get()")
Ref: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/time/posix-cpu-timers.c | 75 ++++++++++++++--------------------
1 file changed, 30 insertions(+), 45 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 42f673974d71..165117996ea0 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -47,59 +47,44 @@ void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
/*
* Functions for validating access to tasks.
*/
-static struct task_struct *lookup_task(const pid_t pid, bool thread,
- bool gettime)
+static struct pid *pid_for_clock(const clockid_t clock, bool gettime)
{
- struct task_struct *p;
+ const bool thread = !!CPUCLOCK_PERTHREAD(clock);
+ const pid_t upid = CPUCLOCK_PID(clock);
+ struct pid *pid;
+
+ if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
+ return NULL;

/*
* If the encoded PID is 0, then the timer is targeted at current
* or the process to which current belongs.
*/
- if (!pid)
- return thread ? current : current->group_leader;
+ if (upid == 0)
+ return thread ? task_pid(current) : task_tgid(current);

- p = find_task_by_vpid(pid);
- if (!p)
- return p;
+ pid = find_vpid(upid);
+ if (!pid)
+ return NULL;

- if (thread)
- return same_thread_group(p, current) ? p : NULL;
+ if (thread) {
+ struct task_struct *tsk = pid_task(pid, PIDTYPE_PID);
+ return (tsk && same_thread_group(tsk, current)) ? pid : NULL;
+ }

/*
- * For clock_gettime(PROCESS) the task does not need to be
- * the actual group leader. task->signal gives
- * access to the group's clock.
+ * For clock_gettime(PROCESS) allow finding the process by
+ * with the pid of the current task. The code needs the tgid
+ * of the process so that pid_task(pid, PIDTYPE_TGID) can be
+ * used to find the process.
*/
- if (gettime && (p == current))
- return p;
+ if (gettime && (pid == task_pid(current)))
+ return task_tgid(current);

/*
- * For processes require that p is group leader.
+ * For processes require that pid identifies a process.
*/
- return thread_group_leader(p) ? p : NULL;
-}
-
-static struct task_struct *__get_task_for_clock(const clockid_t clock,
- bool gettime)
-{
- const bool thread = !!CPUCLOCK_PERTHREAD(clock);
- const pid_t pid = CPUCLOCK_PID(clock);
-
- if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
- return NULL;
-
- return lookup_task(pid, thread, gettime);
-}
-
-static inline struct task_struct *get_task_for_clock(const clockid_t clock)
-{
- return __get_task_for_clock(clock, false);
-}
-
-static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
-{
- return __get_task_for_clock(clock, true);
+ return pid_has_task(pid, PIDTYPE_TGID) ? pid : NULL;
}

static inline int validate_clock_permissions(const clockid_t clock)
@@ -107,7 +92,7 @@ static inline int validate_clock_permissions(const clockid_t clock)
int ret;

rcu_read_lock();
- ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
+ ret = pid_for_clock(clock, false) ? 0 : -EINVAL;
rcu_read_unlock();

return ret;
@@ -369,7 +354,7 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
u64 t;

rcu_read_lock();
- tsk = get_task_for_clock_get(clock);
+ tsk = pid_task(pid_for_clock(clock, true), clock_pid_type(clock));
if (!tsk) {
rcu_read_unlock();
return -EINVAL;
@@ -392,18 +377,18 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
*/
static int posix_cpu_timer_create(struct k_itimer *new_timer)
{
- struct task_struct *p;
+ struct pid *pid;

rcu_read_lock();
- p = get_task_for_clock(new_timer->it_clock);
- if (!p) {
+ pid = pid_for_clock(new_timer->it_clock, false);
+ if (!pid) {
rcu_read_unlock();
return -EINVAL;
}

new_timer->kclock = &clock_posix_cpu;
timerqueue_init(&new_timer->it.cpu.node);
- new_timer->it.cpu.pid = get_task_pid(p, clock_pid_type(new_timer->it_clock));
+ new_timer->it.cpu.pid = get_pid(pid);
rcu_read_unlock();
return 0;
}
--
2.25.0

2020-04-30 12:04:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references


Now that the code stores of pid references it is no longer necessary
or desirable to take a reference on task_struct in __get_task_for_clock.

Instead extend the scope of rcu_read_lock and remove the reference
counting on struct task_struct entirely.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/time/posix-cpu-timers.c | 43 ++++++++++++++++++----------------
1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index b7f972fb115e..91996dd089a4 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -81,36 +81,36 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
}

static struct task_struct *__get_task_for_clock(const clockid_t clock,
- bool getref, bool gettime)
+ bool gettime)
{
const bool thread = !!CPUCLOCK_PERTHREAD(clock);
const pid_t pid = CPUCLOCK_PID(clock);
- struct task_struct *p;

if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
return NULL;

- rcu_read_lock();
- p = lookup_task(pid, thread, gettime);
- if (p && getref)
- get_task_struct(p);
- rcu_read_unlock();
- return p;
+ return lookup_task(pid, thread, gettime);
}

static inline struct task_struct *get_task_for_clock(const clockid_t clock)
{
- return __get_task_for_clock(clock, true, false);
+ return __get_task_for_clock(clock, false);
}

static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
{
- return __get_task_for_clock(clock, true, true);
+ return __get_task_for_clock(clock, true);
}

static inline int validate_clock_permissions(const clockid_t clock)
{
- return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
+ int ret;
+
+ rcu_read_lock();
+ ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
+ rcu_read_unlock();
+
+ return ret;
}

static inline enum pid_type cpu_timer_pid_type(struct k_itimer *timer)
@@ -368,15 +368,18 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
struct task_struct *tsk;
u64 t;

+ rcu_read_lock();
tsk = get_task_for_clock_get(clock);
- if (!tsk)
+ if (!tsk) {
+ rcu_read_unlock();
return -EINVAL;
+ }

if (CPUCLOCK_PERTHREAD(clock))
t = cpu_clock_sample(clkid, tsk);
else
t = cpu_clock_sample_group(clkid, tsk, false);
- put_task_struct(tsk);
+ rcu_read_unlock();

*tp = ns_to_timespec64(t);
return 0;
@@ -389,19 +392,19 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
*/
static int posix_cpu_timer_create(struct k_itimer *new_timer)
{
- struct task_struct *p = get_task_for_clock(new_timer->it_clock);
+ struct task_struct *p;

- if (!p)
+ rcu_read_lock();
+ p = get_task_for_clock(new_timer->it_clock);
+ if (!p) {
+ rcu_read_unlock();
return -EINVAL;
+ }

new_timer->kclock = &clock_posix_cpu;
timerqueue_init(&new_timer->it.cpu.node);
new_timer->it.cpu.pid = get_task_pid(p, cpu_timer_pid_type(new_timer));
- /*
- * get_task_for_clock() took a reference on @p. Drop it as the timer
- * holds a reference on the pid of @p.
- */
- put_task_struct(p);
+ rcu_read_unlock();
return 0;
}

--
2.25.0