2024-01-19 14:18:05

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 2/2] getrusage: use sig->stats_lock

Rather than lock_task_sighand(), sig->stats_lock was specifically designed
for this type of use. This way getrusage runs lockless in the likely case.

TODO:
- Change do_task_stat() to use sig->stats_lock too, then we can
remove spin_lock_irq(siglock) in wait_task_zombie().

- Turn sig->stats_lock into seqcount_rwlock_t, this way the
readers in the slow mode won't exclude each other. See
https://lore.kernel.org/all/[email protected]/

- stats_lock has to disable irqs because ->siglock can be taken
in irq context, it would be very nice to change __exit_signal()
to avoid the siglock->stats_lock dependency.

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/sys.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 70ad06ad852e..f8e543f1e38a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1788,7 +1788,9 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
unsigned long maxrss;
struct mm_struct *mm;
struct signal_struct *sig = p->signal;
+ unsigned int seq = 0;

+retry:
memset(r, 0, sizeof(*r));
utime = stime = 0;
maxrss = 0;
@@ -1800,8 +1802,7 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
goto out_thread;
}

- if (!lock_task_sighand(p, &flags))
- return;
+ flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);

switch (who) {
case RUSAGE_BOTH:
@@ -1829,14 +1830,23 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
r->ru_oublock += sig->oublock;
if (maxrss < sig->maxrss)
maxrss = sig->maxrss;
+
+ rcu_read_lock();
__for_each_thread(sig, t)
accumulate_thread_rusage(t, r);
+ rcu_read_unlock();
+
break;

default:
BUG();
}
- unlock_task_sighand(p, &flags);
+
+ if (need_seqretry(&sig->stats_lock, seq)) {
+ seq = 1;
+ goto retry;
+ }
+ done_seqretry_irqrestore(&sig->stats_lock, seq, flags);

if (who == RUSAGE_CHILDREN)
goto out_children;
--
2.25.1.362.g51ebf55




2024-01-20 03:28:13

by Dylan Hatch

[permalink] [raw]
Subject: Re: [PATCH 2/2] getrusage: use sig->stats_lock

On Fri, Jan 19, 2024 at 6:16 AM Oleg Nesterov <[email protected]> wrote:
>
> Rather than lock_task_sighand(), sig->stats_lock was specifically designed
> for this type of use. This way getrusage runs lockless in the likely case.
>
> TODO:
> - Change do_task_stat() to use sig->stats_lock too, then we can
> remove spin_lock_irq(siglock) in wait_task_zombie().
>
> - Turn sig->stats_lock into seqcount_rwlock_t, this way the
> readers in the slow mode won't exclude each other. See
> https://lore.kernel.org/all/[email protected]/
>
> - stats_lock has to disable irqs because ->siglock can be taken
> in irq context, it would be very nice to change __exit_signal()
> to avoid the siglock->stats_lock dependency.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> kernel/sys.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 70ad06ad852e..f8e543f1e38a 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1788,7 +1788,9 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> unsigned long maxrss;
> struct mm_struct *mm;
> struct signal_struct *sig = p->signal;
> + unsigned int seq = 0;
>
> +retry:
> memset(r, 0, sizeof(*r));
> utime = stime = 0;
> maxrss = 0;
> @@ -1800,8 +1802,7 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> goto out_thread;
> }
>
> - if (!lock_task_sighand(p, &flags))
> - return;
> + flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
>
> switch (who) {
> case RUSAGE_BOTH:
> @@ -1829,14 +1830,23 @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> r->ru_oublock += sig->oublock;
> if (maxrss < sig->maxrss)
> maxrss = sig->maxrss;
> +
> + rcu_read_lock();
> __for_each_thread(sig, t)
> accumulate_thread_rusage(t, r);
> + rcu_read_unlock();
> +
> break;
>
> default:
> BUG();
> }
> - unlock_task_sighand(p, &flags);
> +
> + if (need_seqretry(&sig->stats_lock, seq)) {
> + seq = 1;
> + goto retry;
> + }
> + done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
>
> if (who == RUSAGE_CHILDREN)
> goto out_children;
> --
> 2.25.1.362.g51ebf55
>
>

I applied these to a 5.10 kernel, and my repro (calling getrusage(RUSAGE_SELF)
from 200K threads) is no longer triggering a hard lockup.

Tested-by: Dylan Hatch <[email protected]>

2024-01-21 04:46:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] getrusage: use sig->stats_lock

On Fri, 19 Jan 2024 19:27:49 -0800 Dylan Hatch <[email protected]> wrote:

>
> I applied these to a 5.10 kernel, and my repro (calling getrusage(RUSAGE_SELF)
> from 200K threads) is no longer triggering a hard lockup.

Thanks, but...

The changelogs don't actually describe any hard lockup. [1/2] does
mention "the deadlock" but that's all the info we have.

So could we please have a suitable description of the bug which these are
addressing? And a Reported-by:, a Closes: and a Fixes would be great too.



2024-01-21 12:09:33

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] getrusage: use sig->stats_lock

On 01/20, Andrew Morton wrote:
>
> On Fri, 19 Jan 2024 19:27:49 -0800 Dylan Hatch <[email protected]> wrote:
>
> >
> > I applied these to a 5.10 kernel, and my repro (calling getrusage(RUSAGE_SELF)
> > from 200K threads) is no longer triggering a hard lockup.
>
> Thanks, but...
>
> The changelogs don't actually describe any hard lockup. [1/2] does
> mention "the deadlock" but that's all the info we have.

Sorry for confusion... 1/2 tries to explain that this change is not
strictly necessary for 2/2, it is safe to call thread_group_cputime()
with sig->stats_lock held for writing even if thread_group_cputime()
takes the same lock, because in this case thread_group_cputime() can't
enter the slow mode.

> So could we please have a suitable description of the bug which these are
> addressing? And a Reported-by:, a Closes: and a Fixes would be great too.

Yes sorry I forgot to add Reported-by. So I'll try to update the changelog
and add Reported-and-tested-by.

But the problem is known and old. I think do_io_accounting() had the same
problem until 1df4bd83cdfdbd0 ("do_io_accounting: use sig->stats_lock").
and do_task_stat() ...

getrusage() takes siglock and does for_each_thread() twice. If NR_THREADS
call sys_getrusage() in an endless loop on NR_CPUS, lock_task_sighand()
can trigger a hard lockup because it spins with irqs disabled waiting
for other NR_CPUS-1 which need the same siglock. So the time it spins
with irqs disabled is O(NR_CPUS * NR_THREADS).

With this patch all the threads can run lockless in parallel in the
likely case.

Dylan, do you have a better description? Can you share your repro?
although I think that something simple like

#define NT BIG_NUMBER

pthread_barrier_t barr;

void *thread(void *arg)
{
struct rusage ru;

pthread_barrier_wait(&barr);
for (;;)
getrusage(RUSAGE_SELF, &ru);
return NULL;
}

int main(void)
{
pthread_barrier_init(&barr, NULL, NT);

for (int n = 0; n < NT-1; ++n) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
}
thread(NULL);

return 0;
}

should work if you have a machine with a lot of memory/cpus.

Oleg.


2024-01-21 22:32:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] getrusage: use sig->stats_lock

On Fri, 19 Jan 2024 19:27:49 -0800 Dylan Hatch <[email protected]> wrote:

>
> I applied these to a 5.10 kernel, and my repro (calling getrusage(RUSAGE_SELF)
> from 200K threads) is no longer triggering a hard lockup.

Thanks, but...

The changelogs don't actually describe any hard lockup. [1/2] does
mention "the deadlock" but that's all the info we have.

So could we please have a suitable description of the bug which these are
addressing? And a Reported-by:, a Closes: and a Fixes would be great too.



2024-01-23 02:53:41

by Dylan Hatch

[permalink] [raw]
Subject: Re: [PATCH 2/2] getrusage: use sig->stats_lock

On Sun, Jan 21, 2024 at 4:09 AM Oleg Nesterov <[email protected]> wrote:
>
> Dylan, do you have a better description? Can you share your repro?

That description seems accurate to me.

> although I think that something simple like
>
> #define NT BIG_NUMBER
>
> pthread_barrier_t barr;
>
> void *thread(void *arg)
> {
> struct rusage ru;
>
> pthread_barrier_wait(&barr);
> for (;;)
> getrusage(RUSAGE_SELF, &ru);
> return NULL;
> }
>
> int main(void)
> {
> pthread_barrier_init(&barr, NULL, NT);
>
> for (int n = 0; n < NT-1; ++n) {
> pthread_t pt;
> pthread_create(&pt, NULL, thread, NULL);
> }
> thread(NULL);
>
> return 0;
> }
>
> should work if you have a machine with a lot of memory/cpus.
>
> Oleg.
>

Here's my repro, very similar to what you've sent:

#define _GNU_SOURCE
#include <sys/resource.h>
#include <sched.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <unistd.h>

int thrd_func(void *data) {
struct rusage usage;
int *complete = (void *)data;

while (!*complete);
while (1) {
getrusage(RUSAGE_SELF, &usage);
}
}

#define STACK_SIZE (1024)

int main(int argc, char **argv) {
if (argc != 2) {
printf("Usage: %s <thread count>\n", argv[0]);
exit(EXIT_SUCCESS);
}
const int cnt = atoi(argv[1]);
int pids[cnt];
int complete = 0;
printf("Starting test with %d threads...\n", cnt);
for (int i = 0; i < cnt; i++) {
char *stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE |
MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED) {
perror("mmap() failed\n");
return -1;
}

pids[i] = clone(thrd_func, stack + STACK_SIZE, CLONE_THREAD
| CLONE_SIGHAND | CLONE_FS | CLONE_VM |
CLONE_FILES, (void *) &complete);

if (pids[i] == -1) {
perror("clone() failed\n");
return pids[i];
}
}
complete = 1;
printf("waiting on threads...\n");
sleep(100);
complete = 0;
printf("test finished.\n");
exit(EXIT_SUCCESS);
}

I can't remember exactly why I chose to call mmap and clone directly instead
of using pthreads... but I do know what mmap'ing in a smaller stack size
makes the repro more reliable since you can create more threads. It
seemed like around 250K threads was about enough to reliably produce
the lockup, but your mileage may vary.