2022-01-05 21:31:34

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v2 0/3] prlimit and set/getpriority tasklist_lock optimizations

The tasklist_lock popped up as a scalability bottleneck on some testing
workloads. The readlocks in do_prlimit and set/getpriority are not
necessary in all cases.

Based on a cycles profile, it looked like ~87% of the time was spent in
the kernel, ~42% of which was just trying to get *some* spinlock
(queued_spin_lock_slowpath, not necessarily the tasklist_lock).

The big offenders (with rough percentages in cycles of the overall trace):

- do_wait 11%
- setpriority 8% (this patchset)
- kill 8%
- do_exit 5%
- clone 3%
- prlimit64 2% (this patchset)
- getrlimit 1% (this patchset)

I can't easily test this patchset on the original workload for various
reasons. Instead, I used the microbenchmark below to at least verify
there was some improvement. This patchset had a 28% speedup (12% from
baseline to set/getprio, then another 14% for prlimmit).

One interesting thing is that my libc's getrlimit() was calling
prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64
had no effect - it essentially optimized the older syscalls only. I
didn't do that in this patchset, but figured I'd mention it since it was
an option from the previous patch's discussion.

v1: https://lore.kernel.org/lkml/[email protected]/

#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
pid_t child;
struct rlimit rlim[1];

fork(); fork(); fork(); fork(); fork(); fork();

for (int i = 0; i < 5000; i++) {
child = fork();
if (child < 0)
exit(1);
if (child > 0) {
usleep(1000);
kill(child, SIGTERM);
waitpid(child, NULL, 0);
} else {
for (;;) {
setpriority(PRIO_PROCESS, 0,
getpriority(PRIO_PROCESS, 0));
getrlimit(RLIMIT_CPU, rlim);
}
}
}

return 0;
}


Barret Rhoden (3):
setpriority: only grab the tasklist_lock for PRIO_PGRP
prlimit: make do_prlimit() static
prlimit: do not grab the tasklist_lock

include/linux/posix-timers.h | 2 +-
include/linux/resource.h | 2 -
kernel/sys.c | 134 ++++++++++++++++++---------------
kernel/time/posix-cpu-timers.c | 12 ++-
4 files changed, 83 insertions(+), 67 deletions(-)

--
2.34.1.448.ga2b2bfdf31-goog



2022-01-05 21:31:36

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v2 1/3] setpriority: only grab the tasklist_lock for PRIO_PGRP

The tasklist_lock is necessary only for PRIO_PGRP for both setpriority()
and getpriority().

Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
for workloads that also must grab the tasklist_lock for waiting,
killing, and cloning.

This change resulted in a 12% speedup on a microbenchmark where parents
kill and wait on their children, and children getpriority, setpriority,
and getrlimit.

Signed-off-by: Barret Rhoden <[email protected]>
---
kernel/sys.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 8fdac0d90504..558e52fa5bbd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -220,7 +220,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
niceval = MAX_NICE;

rcu_read_lock();
- read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
@@ -231,6 +230,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
error = set_one_prio(p, niceval, error);
break;
case PRIO_PGRP:
+ read_lock(&tasklist_lock);
if (who)
pgrp = find_vpid(who);
else
@@ -238,6 +238,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
do_each_pid_thread(pgrp, PIDTYPE_PGID, p) {
error = set_one_prio(p, niceval, error);
} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+ read_unlock(&tasklist_lock);
break;
case PRIO_USER:
uid = make_kuid(cred->user_ns, who);
@@ -258,7 +259,6 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
break;
}
out_unlock:
- read_unlock(&tasklist_lock);
rcu_read_unlock();
out:
return error;
@@ -283,7 +283,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
return -EINVAL;

rcu_read_lock();
- read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
@@ -297,6 +296,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
}
break;
case PRIO_PGRP:
+ read_lock(&tasklist_lock);
if (who)
pgrp = find_vpid(who);
else
@@ -306,6 +306,7 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
if (niceval > retval)
retval = niceval;
} while_each_pid_thread(pgrp, PIDTYPE_PGID, p);
+ read_unlock(&tasklist_lock);
break;
case PRIO_USER:
uid = make_kuid(cred->user_ns, who);
@@ -329,7 +330,6 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
break;
}
out_unlock:
- read_unlock(&tasklist_lock);
rcu_read_unlock();

return retval;
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-05 21:31:37

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v2 2/3] prlimit: make do_prlimit() static

There are no other callers in the kernel.

Fixed up a comment format and whitespace issue when moving do_prlimit()
higher in sys.c.

Signed-off-by: Barret Rhoden <[email protected]>
---
include/linux/resource.h | 2 -
kernel/sys.c | 116 ++++++++++++++++++++-------------------
2 files changed, 59 insertions(+), 59 deletions(-)

diff --git a/include/linux/resource.h b/include/linux/resource.h
index bdf491cbcab7..4fdbc0c3f315 100644
--- a/include/linux/resource.h
+++ b/include/linux/resource.h
@@ -8,7 +8,5 @@
struct task_struct;

void getrusage(struct task_struct *p, int who, struct rusage *ru);
-int do_prlimit(struct task_struct *tsk, unsigned int resource,
- struct rlimit *new_rlim, struct rlimit *old_rlim);

#endif
diff --git a/kernel/sys.c b/kernel/sys.c
index 558e52fa5bbd..fb2a5e7c0589 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1415,6 +1415,65 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len)
return errno;
}

+/* make sure you are allowed to change @tsk limits before calling this */
+static int do_prlimit(struct task_struct *tsk, unsigned int resource,
+ struct rlimit *new_rlim, struct rlimit *old_rlim)
+{
+ struct rlimit *rlim;
+ int retval = 0;
+
+ if (resource >= RLIM_NLIMITS)
+ return -EINVAL;
+ if (new_rlim) {
+ if (new_rlim->rlim_cur > new_rlim->rlim_max)
+ return -EINVAL;
+ if (resource == RLIMIT_NOFILE &&
+ new_rlim->rlim_max > sysctl_nr_open)
+ return -EPERM;
+ }
+
+ /* protect tsk->signal and tsk->sighand from disappearing */
+ read_lock(&tasklist_lock);
+ if (!tsk->sighand) {
+ retval = -ESRCH;
+ goto out;
+ }
+
+ rlim = tsk->signal->rlim + resource;
+ task_lock(tsk->group_leader);
+ if (new_rlim) {
+ /*
+ * Keep the capable check against init_user_ns until cgroups can
+ * contain all limits.
+ */
+ if (new_rlim->rlim_max > rlim->rlim_max &&
+ !capable(CAP_SYS_RESOURCE))
+ retval = -EPERM;
+ if (!retval)
+ retval = security_task_setrlimit(tsk, resource, new_rlim);
+ }
+ if (!retval) {
+ if (old_rlim)
+ *old_rlim = *rlim;
+ if (new_rlim)
+ *rlim = *new_rlim;
+ }
+ task_unlock(tsk->group_leader);
+
+ /*
+ * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not
+ * infinite. In case of RLIM_INFINITY the posix CPU timer code
+ * ignores the rlimit.
+ */
+ if (!retval && new_rlim && resource == RLIMIT_CPU &&
+ new_rlim->rlim_cur != RLIM_INFINITY &&
+ IS_ENABLED(CONFIG_POSIX_TIMERS))
+ update_rlimit_cpu(tsk, new_rlim->rlim_cur);
+out:
+ read_unlock(&tasklist_lock);
+ return retval;
+}
+
SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
struct rlimit value;
@@ -1558,63 +1617,6 @@ static void rlim64_to_rlim(const struct rlimit64 *rlim64, struct rlimit *rlim)
rlim->rlim_max = (unsigned long)rlim64->rlim_max;
}

-/* make sure you are allowed to change @tsk limits before calling this */
-int do_prlimit(struct task_struct *tsk, unsigned int resource,
- struct rlimit *new_rlim, struct rlimit *old_rlim)
-{
- struct rlimit *rlim;
- int retval = 0;
-
- if (resource >= RLIM_NLIMITS)
- return -EINVAL;
- if (new_rlim) {
- if (new_rlim->rlim_cur > new_rlim->rlim_max)
- return -EINVAL;
- if (resource == RLIMIT_NOFILE &&
- new_rlim->rlim_max > sysctl_nr_open)
- return -EPERM;
- }
-
- /* protect tsk->signal and tsk->sighand from disappearing */
- read_lock(&tasklist_lock);
- if (!tsk->sighand) {
- retval = -ESRCH;
- goto out;
- }
-
- rlim = tsk->signal->rlim + resource;
- task_lock(tsk->group_leader);
- if (new_rlim) {
- /* Keep the capable check against init_user_ns until
- cgroups can contain all limits */
- if (new_rlim->rlim_max > rlim->rlim_max &&
- !capable(CAP_SYS_RESOURCE))
- retval = -EPERM;
- if (!retval)
- retval = security_task_setrlimit(tsk, resource, new_rlim);
- }
- if (!retval) {
- if (old_rlim)
- *old_rlim = *rlim;
- if (new_rlim)
- *rlim = *new_rlim;
- }
- task_unlock(tsk->group_leader);
-
- /*
- * RLIMIT_CPU handling. Arm the posix CPU timer if the limit is not
- * infinite. In case of RLIM_INFINITY the posix CPU timer code
- * ignores the rlimit.
- */
- if (!retval && new_rlim && resource == RLIMIT_CPU &&
- new_rlim->rlim_cur != RLIM_INFINITY &&
- IS_ENABLED(CONFIG_POSIX_TIMERS))
- update_rlimit_cpu(tsk, new_rlim->rlim_cur);
-out:
- read_unlock(&tasklist_lock);
- return retval;
-}
-
/* rcu lock must be held */
static int check_prlimit_permission(struct task_struct *task,
unsigned int flags)
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-05 21:31:48

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock

Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
for workloads that also must grab the tasklist_lock for waiting,
killing, and cloning.

The tasklist_lock was grabbed to protect tsk->sighand from disappearing
(becoming NULL). tsk->signal was already protected by holding a
reference to tsk.

update_rlimit_cpu() assumed tsk->sighand != NULL. With this commit, it
attempts to lock_task_sighand(). However, this means that
update_rlimit_cpu() can fail. This only happens when a task is exiting.
Note that during exec, sighand may *change*, but it will not be NULL.

Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
would not fail by read locking the tasklist_lock and checking tsk->sighand
!= NULL.

If update_rlimit_cpu() fails, there may be other tasks that are not
exiting that share tsk->signal. We need to run update_rlimit_cpu() on
one of them. We can't "back out" the new rlim - once we unlocked
task_lock(group_leader), the rlim is essentially changed.

The only other caller of update_rlimit_cpu() is
selinux_bprm_committing_creds(). It has tsk == current, so
update_rlimit_cpu() cannot fail (current->sighand cannot disappear
until current exits).

This change resulted in a 14% speedup on a microbenchmark where parents
kill and wait on their children, and children getpriority, setpriority,
and getrlimit.

Signed-off-by: Barret Rhoden <[email protected]>
---
include/linux/posix-timers.h | 2 +-
kernel/sys.c | 32 +++++++++++++++++++++-----------
kernel/time/posix-cpu-timers.c | 12 +++++++++---
3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 5bbcd280bfd2..9cf126c3b27f 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -253,7 +253,7 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
u64 *newval, u64 *oldval);

-void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
+int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);

void posixtimer_rearm(struct kernel_siginfo *info);
#endif
diff --git a/kernel/sys.c b/kernel/sys.c
index fb2a5e7c0589..073ae9db192f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1432,13 +1432,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
return -EPERM;
}

- /* protect tsk->signal and tsk->sighand from disappearing */
- read_lock(&tasklist_lock);
- if (!tsk->sighand) {
- retval = -ESRCH;
- goto out;
- }
-
+ /* Holding a refcount on tsk protects tsk->signal from disappearing. */
rlim = tsk->signal->rlim + resource;
task_lock(tsk->group_leader);
if (new_rlim) {
@@ -1467,10 +1461,26 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
*/
if (!retval && new_rlim && resource == RLIMIT_CPU &&
new_rlim->rlim_cur != RLIM_INFINITY &&
- IS_ENABLED(CONFIG_POSIX_TIMERS))
- update_rlimit_cpu(tsk, new_rlim->rlim_cur);
-out:
- read_unlock(&tasklist_lock);
+ IS_ENABLED(CONFIG_POSIX_TIMERS)) {
+ if (update_rlimit_cpu(tsk, new_rlim->rlim_cur)) {
+ /*
+ * update_rlimit_cpu can fail if the task is exiting.
+ * We already set the task group's rlim, so we need to
+ * update_rlimit_cpu for some other task in the process.
+ * If all of the tasks are exiting, then we don't need
+ * to update_rlimit_cpu.
+ */
+ struct task_struct *t_i;
+
+ rcu_read_lock();
+ for_each_thread(tsk, t_i) {
+ if (!update_rlimit_cpu(t_i, new_rlim->rlim_cur))
+ break;
+ }
+ rcu_read_unlock();
+ }
+ }
+
return retval;
}

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 96b4e7810426..e13e628509fb 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -34,14 +34,20 @@ void posix_cputimers_group_init(struct posix_cputimers *pct, u64 cpu_limit)
* tsk->signal->posix_cputimers.bases[clock].nextevt expiration cache if
* necessary. Needs siglock protection since other code may update the
* expiration cache as well.
+ *
+ * Returns 0 on success, -ESRCH on failure. Can fail if the task is exiting and
+ * we cannot lock_task_sighand. Cannot fail if task is current.
*/
-void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
+int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
{
u64 nsecs = rlim_new * NSEC_PER_SEC;
+ unsigned long irq_fl;

- spin_lock_irq(&task->sighand->siglock);
+ if (!lock_task_sighand(task, &irq_fl))
+ return -ESRCH;
set_process_cpu_timer(task, CPUCLOCK_PROF, &nsecs, NULL);
- spin_unlock_irq(&task->sighand->siglock);
+ unlock_task_sighand(task, &irq_fl);
+ return 0;
}

/*
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-05 22:10:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock

Barret Rhoden <[email protected]> writes:

> Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
> for workloads that also must grab the tasklist_lock for waiting,
> killing, and cloning.
>
> The tasklist_lock was grabbed to protect tsk->sighand from disappearing
> (becoming NULL). tsk->signal was already protected by holding a
> reference to tsk.
>
> update_rlimit_cpu() assumed tsk->sighand != NULL. With this commit, it
> attempts to lock_task_sighand(). However, this means that
> update_rlimit_cpu() can fail. This only happens when a task is exiting.
> Note that during exec, sighand may *change*, but it will not be NULL.
>
> Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
> would not fail by read locking the tasklist_lock and checking tsk->sighand
> != NULL.
>
> If update_rlimit_cpu() fails, there may be other tasks that are not
> exiting that share tsk->signal. We need to run update_rlimit_cpu() on
> one of them. We can't "back out" the new rlim - once we unlocked
> task_lock(group_leader), the rlim is essentially changed.
>
> The only other caller of update_rlimit_cpu() is
> selinux_bprm_committing_creds(). It has tsk == current, so
> update_rlimit_cpu() cannot fail (current->sighand cannot disappear
> until current exits).
>
> This change resulted in a 14% speedup on a microbenchmark where parents
> kill and wait on their children, and children getpriority, setpriority,
> and getrlimit.
>
> Signed-off-by: Barret Rhoden <[email protected]>
> ---
> include/linux/posix-timers.h | 2 +-
> kernel/sys.c | 32 +++++++++++++++++++++-----------
> kernel/time/posix-cpu-timers.c | 12 +++++++++---
> 3 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
> index 5bbcd280bfd2..9cf126c3b27f 100644
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -253,7 +253,7 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
> void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
> u64 *newval, u64 *oldval);
>
> -void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
> +int update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
>
> void posixtimer_rearm(struct kernel_siginfo *info);
> #endif
> diff --git a/kernel/sys.c b/kernel/sys.c
> index fb2a5e7c0589..073ae9db192f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1432,13 +1432,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> return -EPERM;
> }
>
> - /* protect tsk->signal and tsk->sighand from disappearing */
> - read_lock(&tasklist_lock);
> - if (!tsk->sighand) {
> - retval = -ESRCH;
> - goto out;
> - }
> -
> + /* Holding a refcount on tsk protects tsk->signal from disappearing. */
> rlim = tsk->signal->rlim + resource;
> task_lock(tsk->group_leader);
> if (new_rlim) {
> @@ -1467,10 +1461,26 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> */
> if (!retval && new_rlim && resource == RLIMIT_CPU &&
> new_rlim->rlim_cur != RLIM_INFINITY &&
> - IS_ENABLED(CONFIG_POSIX_TIMERS))
> - update_rlimit_cpu(tsk, new_rlim->rlim_cur);
> -out:
> - read_unlock(&tasklist_lock);
> + IS_ENABLED(CONFIG_POSIX_TIMERS)) {
> + if (update_rlimit_cpu(tsk, new_rlim->rlim_cur)) {
> + /*
> + * update_rlimit_cpu can fail if the task is exiting.
> + * We already set the task group's rlim, so we need to
> + * update_rlimit_cpu for some other task in the process.
> + * If all of the tasks are exiting, then we don't need
> + * to update_rlimit_cpu.
> + */
> + struct task_struct *t_i;
> +
> + rcu_read_lock();
> + for_each_thread(tsk, t_i) {
> + if (!update_rlimit_cpu(t_i, new_rlim->rlim_cur))
> + break;
> + }
> + rcu_read_unlock();
> + }

I look at this and I ask can't we do this better?

Because you are right that if the thread you landed on is exiting this
is a problem. It is only a problem for prlimit64, as all of the rest
of the calls to do_prlimit happen from current so you know they are not
exiting.

I think the simple solution is just:
update_rlimit_cpu(tsk->group_leader)

As the group leader is guaranteed to be the last thread of the thread
group to be processed in release_task, and thus the last thread with a
sighand. Nothing needs to be done if it does not have a sighand.

How does that sound?

> + }
> +
> return retval;
> }

Eric

2022-01-06 16:43:16

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] prlimit: do not grab the tasklist_lock

On 1/5/22 17:05, Eric W. Biederman wrote:
>
> I think the simple solution is just:
> update_rlimit_cpu(tsk->group_leader)
>
> As the group leader is guaranteed to be the last thread of the thread
> group to be processed in release_task, and thus the last thread with a
> sighand. Nothing needs to be done if it does not have a sighand.

Ah, good to know. I didn't know the group_leader stuck around as a
zombie. That makes life easier. I'll make the change and repost the
patches.

Thanks,

Barret