2020-05-12 00:11:05

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -next v2 0/2] kernel/sys: reduce tasklist_lock usage get/set priorities

Hi,

The following are two patches that deal with removing the tasklist_lock entirely in
getpriority(2), and reducing the scope of it in setpriority(2). This also aligns
somewhat to what we already do with io priorities - although there both set and get
rely entirely on rcu.

Details in the changelog but this passes ltp tests. Please consider for v5.8.

Changes from v1:
- split get and set into two patches.
- improved changelog.

Thanks!

Davidlohr Bueso (2):
kernel/sys: only rely on rcu for getpriority(2)
kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS)

kernel/sys.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

--
2.26.1


2020-05-12 00:11:19

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

Currently the tasklist_lock is shared mainly in order to observe
the list atomically for the PRIO_PGRP and PRIO_USER cases, as
the actual lookups are already rcu-safe, providing a stable task
pointer. By removing the lock, we can race with:

(i) fork (insertion), but this is benign as the child's nice
is inherited and the actual task is not observable by the user
yet either. The return semantics do not differ.

and;

(ii) exit (deletion), this window is small but if a task is
deleted with the highest nice and it is not observed this would
cause a change in return semantics. To further reduce the window
we ignore any tasks that are PF_EXITING in the 'old' version of
the list.

The case for PRIO_PROCESS does not need the lock at all as it only
looks up the pointer.

The following raw microbenchmark improvements on a 40-core box
were seen running the stressng-get workload, which pathologically
pounds on various syscalls that get information from the kernel.
Increasing thread counts of course shows more wins, albeit probably
not something that would be seen in a real workload.

5.7.0-rc3 5.7.0-rc3
getpriority-v1
Hmean get-1 3443.65 ( 0.00%) 3314.08 * -3.76%*
Hmean get-2 7809.99 ( 0.00%) 8547.60 * 9.44%*
Hmean get-4 15498.01 ( 0.00%) 17396.85 * 12.25%*
Hmean get-8 28001.37 ( 0.00%) 31137.53 * 11.20%*
Hmean get-16 31460.88 ( 0.00%) 40284.35 * 28.05%*
Hmean get-32 30036.64 ( 0.00%) 40657.88 * 35.36%*
Hmean get-64 31429.86 ( 0.00%) 41021.73 * 30.52%*
Hmean get-80 31804.13 ( 0.00%) 39188.55 * 23.22%*

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/sys.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..0b72184f5e3e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -277,7 +277,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
return -EINVAL;

rcu_read_lock();
- read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
@@ -296,6 +295,9 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
else
pgrp = task_pgrp(current);
do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
+ if (unlikely(p->flags & PF_EXITING))
+ continue;
+
niceval = nice_to_rlimit(task_nice(p));
if (niceval > retval)
retval = niceval;
@@ -313,6 +315,9 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
}
do_each_thread(g, p) {
if (uid_eq(task_uid(p), uid) && task_pid_vnr(p)) {
+ if (unlikely(p->flags & PF_EXITING))
+ continue;
+
niceval = nice_to_rlimit(task_nice(p));
if (niceval > retval)
retval = niceval;
@@ -323,7 +328,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
break;
}
out_unlock:
- read_unlock(&tasklist_lock);
rcu_read_unlock();

return retval;
--
2.26.1

2020-05-12 00:11:54

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS)

This option does not iterate the tasklist and we already have
an rcu aware stable pointer. Also, the nice value is not serialized
by this lock. Reduce the scope of this lock to just PRIO_PGRP
and PRIO_USER - which need to to set the priorities atomically
to the list of tasks, at least vs fork().

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/sys.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 0b72184f5e3e..f9f87775d6d2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -214,16 +214,19 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
niceval = MAX_NICE;

rcu_read_lock();
- read_lock(&tasklist_lock);
- switch (which) {
- case PRIO_PROCESS:
+
+ if (which == PRIO_PROCESS) {
if (who)
p = find_task_by_vpid(who);
else
p = current;
if (p)
error = set_one_prio(p, niceval, error);
- break;
+ goto out_rcu;
+ }
+
+ read_lock(&tasklist_lock);
+ switch (which) {
case PRIO_PGRP:
if (who)
pgrp = find_vpid(who);
@@ -253,6 +256,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
}
out_unlock:
read_unlock(&tasklist_lock);
+out_rcu:
rcu_read_unlock();
out:
return error;
--
2.26.1

2020-05-12 15:11:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

On 05/11, Davidlohr Bueso wrote:
>
> Currently the tasklist_lock is shared mainly in order to observe
> the list atomically for the PRIO_PGRP and PRIO_USER cases, as
> the actual lookups are already rcu-safe,

not really...

do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID)
which moves the task from one hlist to another. Yes, it is safe in
that task_struct can't go away. But still this is not right because
do_each_pid_task() can scan the wrong (2nd) hlist.

> (ii) exit (deletion), this window is small but if a task is
> deleted with the highest nice and it is not observed this would
> cause a change in return semantics. To further reduce the window
> we ignore any tasks that are PF_EXITING in the 'old' version of
> the list.

can't understand...

could you explain in details why do you think this PF_EXITING check
makes any sense?

Oleg.

2020-05-12 16:13:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/sys: do not grab tasklist_lock for sys_setpriority(PRIO_PROCESS)

On 05/11, Davidlohr Bueso wrote:
>
> This option does not iterate the tasklist and we already have
> an rcu aware stable pointer. Also, the nice value is not serialized
> by this lock. Reduce the scope of this lock to just PRIO_PGRP
> and PRIO_USER - which need to to set the priorities atomically
> to the list of tasks, at least vs fork().

looks correct, but probably the PRIO_USER case can avoid tasklist too?
It can't help to avoid the race with setuid().

(PRIO_PGRP needs tasklist_lock (see my previous email) but afaics it
can race with fork anyway, it can miss the new child which was not
added to the list yet, I hope we do not care).

So I'd suggest a single patch instead of 1-2, but I still don't understand
your PF_EXITING check in 1/2.

Oleg.

--- x/kernel/sys.c
+++ x/kernel/sys.c
@@ -214,7 +214,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
niceval = MAX_NICE;

rcu_read_lock();
- read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
@@ -229,9 +228,11 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
pgrp = find_vpid(who);
else
pgrp = task_pgrp(current);
+ read_lock(&tasklist_lock);
do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
error = set_one_prio(p, niceval, error);
} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+ read_unlock(&tasklist_lock);
break;
case PRIO_USER:
uid = make_kuid(cred->user_ns, who);
@@ -243,16 +244,15 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
if (!user)
goto out_unlock; /* No processes for this user */
}
- do_each_thread(g, p) {
+ for_each_process_thread(g, p) {
if (uid_eq(task_uid(p), uid) && task_pid_vnr(p))
error = set_one_prio(p, niceval, error);
- } while_each_thread(g, p);
+ }
if (!uid_eq(uid, cred->uid))
free_uid(user); /* For find_user() */
break;
}
out_unlock:
- read_unlock(&tasklist_lock);
rcu_read_unlock();
out:
return error;
@@ -277,7 +277,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
return -EINVAL;

rcu_read_lock();
- read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
@@ -295,11 +294,13 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
pgrp = find_vpid(who);
else
pgrp = task_pgrp(current);
+ read_lock(&tasklist_lock);
do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
niceval = nice_to_rlimit(task_nice(p));
if (niceval > retval)
retval = niceval;
} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+ read_unlock(&tasklist_lock);
break;
case PRIO_USER:
uid = make_kuid(cred->user_ns, who);
@@ -311,19 +312,18 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
if (!user)
goto out_unlock; /* No processes for this user */
}
- do_each_thread(g, p) {
+ for_each_process_thread(g, p) {
if (uid_eq(task_uid(p), uid) && task_pid_vnr(p)) {
niceval = nice_to_rlimit(task_nice(p));
if (niceval > retval)
retval = niceval;
}
- } while_each_thread(g, p);
+ }
if (!uid_eq(uid, cred->uid))
free_uid(user); /* for find_user() */
break;
}
out_unlock:
- read_unlock(&tasklist_lock);
rcu_read_unlock();

return retval;

2020-05-12 16:16:55

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

On Tue, 12 May 2020, Oleg Nesterov wrote:

>On 05/11, Davidlohr Bueso wrote:
>>
>> Currently the tasklist_lock is shared mainly in order to observe
>> the list atomically for the PRIO_PGRP and PRIO_USER cases, as
>> the actual lookups are already rcu-safe,
>
>not really...
>
>do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID)
>which moves the task from one hlist to another. Yes, it is safe in
>that task_struct can't go away. But still this is not right because
>do_each_pid_task() can scan the wrong (2nd) hlist.

Hmm I didn't think about this case, I guess this is also busted in
ioprio_get(2) then.

>
>> (ii) exit (deletion), this window is small but if a task is
>> deleted with the highest nice and it is not observed this would
>> cause a change in return semantics. To further reduce the window
>> we ignore any tasks that are PF_EXITING in the 'old' version of
>> the list.
>
>can't understand...
>
>could you explain in details why do you think this PF_EXITING check
>makes any sense?

My logic was that if the task with the highest prio exited while we
were iterating the list, it would not be necessarily seen with rcu
and the syscall would return the highest prio of a task that exited;
and checking against PF_EXITING was a way to ignore such scenarios
as we were going to race with it anyway.

At this point it seems that we can just remove the lock for the
PRIO_PROCESS case.

Thanks,
Davidlohr

2020-05-12 16:45:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

On 05/12, Davidlohr Bueso wrote:
>
> On Tue, 12 May 2020, Oleg Nesterov wrote:
>
> >do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID)
> >which moves the task from one hlist to another. Yes, it is safe in
> >that task_struct can't go away. But still this is not right because
> >do_each_pid_task() can scan the wrong (2nd) hlist.
>
> Hmm I didn't think about this case, I guess this is also busted in
> ioprio_get(2) then.

agreed...

> >
> >could you explain in details why do you think this PF_EXITING check
> >makes any sense?
>
> My logic was that if the task with the highest prio exited while we
> were iterating the list, it would not be necessarily seen with rcu
> and the syscall would return the highest prio of a task that exited;
> and checking against PF_EXITING was a way to ignore such scenarios
> as we were going to race with it anyway.

Sorry, still can't understand. The PF_EXITING flag is not protected by
tasklist_lock or rcu_lock.


OK, if nothing else. Suppose that a prgp has a single process P, this
proces has already exited but its parent didn't do wait().

Currently getpriority() returns task_nice(P). With the PF_EXITING check
it will return -ESRCH. Hmm?

Oleg.

2020-05-12 17:06:10

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

On Tue, 12 May 2020, Oleg Nesterov wrote:

>On 05/12, Davidlohr Bueso wrote:
>>
>> On Tue, 12 May 2020, Oleg Nesterov wrote:
>>
>> >do_each_pid_task(PIDTYPE_PGID) can race with change_pid(PIDTYPE_PGID)
>> >which moves the task from one hlist to another. Yes, it is safe in
>> >that task_struct can't go away. But still this is not right because
>> >do_each_pid_task() can scan the wrong (2nd) hlist.
>>
>> Hmm I didn't think about this case, I guess this is also busted in
>> ioprio_get(2) then.
>
>agreed...
>
>> >
>> >could you explain in details why do you think this PF_EXITING check
>> >makes any sense?
>>
>> My logic was that if the task with the highest prio exited while we
>> were iterating the list, it would not be necessarily seen with rcu
>> and the syscall would return the highest prio of a task that exited;
>> and checking against PF_EXITING was a way to ignore such scenarios
>> as we were going to race with it anyway.
>
>Sorry, still can't understand. The PF_EXITING flag is not protected by
>tasklist_lock or rcu_lock.

Sorry for not making my idea clear, perhaps it's complete garbage.

Right, but setting the flag is an indication that the tasklist_lock
will be taken and removed from the list, and therefore we could
optimistically avoid considering that task altogether instead of
relying on the old copy of the list. It's not perfect, but it does
reduce the window in which getpriority() can return a stale value(?).

At least this is how I justify it. Otoh this also opens a window in
where the lockless version can ignore highest prio task when the locked
version would otherwise consider it. So it might not be worth it.

>
>OK, if nothing else. Suppose that a prgp has a single process P, this
>proces has already exited but its parent didn't do wait().
>
>Currently getpriority() returns task_nice(P). With the PF_EXITING check
>it will return -ESRCH. Hmm?

Yes, that would need fixing but you don't seem to be buying the idea
in the first place.

Thanks,
Davidlohr

2020-05-12 18:18:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/sys: only rely on rcu for getpriority(2)

On 05/12, Davidlohr Bueso wrote:
>
> Right, but setting the flag is an indication that the tasklist_lock
> will be taken

Yes,

> and removed from the list,

Well no. If this task is not a group leader, or if it is traced this won't
happen "soon", this will happen when parent or debugger call wait().

But this doesn't matter. Lets suppose that the task is always removed from
the list right after it sets PF_EXITING. Now,

> and therefore we could
> optimistically avoid considering that task altogether

Why?? This is what I can't understand.

If sys_getpriority() sees PF_EXITING, we can pretend it was called before
this task has exited, or even right before this flag was set. Why should we
skip this task?

And otoh, this check can not help in any case, PF_EXITING can be set right
after the check.

So I still think this check can only add the unnecessary confusion, even if
we forget about change in behaviour.

And finally, whatever I missed, I do not understand how this connects to
"avoid the tasklist_lock". Whether we want it or not does not depend on
the locking, at all.

Oleg.