2015-04-09 17:59:22

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH] exit: Use read lock for do_notify_parent() instead of write lock

The idea is that write lock should be used only for modification of something.
Notification of parent does not change graph of tasks, it just says parent that
child's became dead. So, in ideally it shouldn't be executed under write lock.

Other side is that we take several spin locks inside do_notify_parent(). This
increases the time we're holding tasklist_lock, and all the time the lock is
unavailable for anyone else. It's not good, because tasklist_lock is one of
the most often used locks in the system.

I suggest to execute do_notify_parent() under read_lock(). It allows more tasks
to use it in parallel. Read lock gives enough guarantees for us: child's parent
won't change during the notification.

The only code affected by this change is do_wait() and its child-relative
functions. We execute them with read_lock() held, and this used to guarantee
parallel exit_notify() is impossible. Now they can race, so it's necessary
to synchronize them. We use new EXIT_NOTIFY exit_state for that. It says wait
functions that task is notifying its parent, so they should wait till it set
EXIT_DEAD or EXIT_ZOMBIE exit_status. This doesn't worsen performance. Yes,
we're spinning between wait_consider_task() and do_wait, but there were the
same spin on read_lock() in do_wait(). Really, the new spin is very unlikely
case.

This patch only desribes the idea, it changes exit_notify() only. We can use
the same technics in other places where do_notify_parent() is used.

We need "[PATCH] de_thread: Move notify_count write under lock" from this link:
http://permalink.gmane.org/gmane.linux.kernel/1896220. It's already in akpm
branch of linux-next tree. That patch and current changes of de_thread()
guarantee leader sees right notify_count.

Also, in the future we may think about new rwlock primitive, which atomically
drops write lock and acquires read lock. Something like this for example:

include/asm-generic/qrwlock.h:
static inline void queue_reduce_locked_write_to_read(struct qrwlock *lock)
{
smp_mb__before_atomic();
atomic_add(_QR_BIAS - _QW_LOCKED, &lock->cnts);
}

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/exec.c | 4 ++--
include/linux/sched.h | 6 ++++--
kernel/exit.c | 26 ++++++++++++++++++++++----
3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 314e8d8..7cb8313 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1039,10 +1039,10 @@ static int de_thread(struct task_struct *tsk)

killed:
/* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
+ write_lock_irq(&tasklist_lock);
sig->group_exit_task = NULL;
sig->notify_count = 0;
- read_unlock(&tasklist_lock);
+ write_unlock_irq(&tasklist_lock);
return -EAGAIN;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0eabab9..0fc3154 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,9 +214,11 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_PARKED 512
-#define TASK_STATE_MAX 1024
+/* in tsk->exit_state again */
+#define EXIT_NOTIFY 1024
+#define TASK_STATE_MAX 2048

-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPn"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10b..f6465ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,9 @@
#include <asm/pgtable.h>
#include <asm/mmu_context.h>

+/* Bigger than any errno to differ from real errors */
+#define REPEAT_DOWAIT (MAX_ERRNO + 1)
+
static void exit_mm(struct task_struct *tsk);

static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -594,7 +597,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)

write_lock_irq(&tasklist_lock);
forget_original_parent(tsk, &dead);
+ tsk->exit_state = EXIT_NOTIFY;
+ write_unlock_irq(&tasklist_lock);

+ read_lock(&tasklist_lock);
if (group_dead)
kill_orphaned_pgrp(tsk->group_leader, NULL);

@@ -612,13 +618,14 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
}

tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
- if (tsk->exit_state == EXIT_DEAD)
+ smp_wmb(); /* Pairs with read_lock() in do_wait() */
+ if (autoreap)
list_add(&tsk->ptrace_entry, &dead);

/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))
wake_up_process(tsk->signal->group_exit_task);
- write_unlock_irq(&tasklist_lock);
+ read_unlock(&tasklist_lock);

list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
list_del_init(&p->ptrace_entry);
@@ -1317,6 +1324,13 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
return 0;
}

+ if (unlikely(exit_state == EXIT_NOTIFY)) {
+ if (wo->wo_flags & WNOHANG)
+ return 0;
+ read_unlock(&tasklist_lock);
+ return -REPEAT_DOWAIT;
+ }
+
if (unlikely(exit_state == EXIT_TRACE)) {
/*
* ptrace == 0 means we are the natural parent. In this case
@@ -1488,11 +1502,15 @@ static long do_wait(struct wait_opts *wo)
tsk = current;
do {
retval = do_wait_thread(wo, tsk);
- if (retval)
+ if (unlikely(retval == -REPEAT_DOWAIT))
+ goto repeat;
+ else if (retval)
goto end;

retval = ptrace_do_wait(wo, tsk);
- if (retval)
+ if (unlikely(retval == -REPEAT_DOWAIT))
+ goto repeat;
+ else if (retval)
goto end;

if (wo->wo_flags & __WNOTHREAD)



2015-04-09 18:28:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] exit: Use read lock for do_notify_parent() instead of write lock

On Thu, Apr 09, 2015 at 08:59:08PM +0300, Kirill Tkhai wrote:

I've not really read your email yet, however:

> Also, in the future we may think about new rwlock primitive, which atomically
> drops write lock and acquires read lock. Something like this for example:
>
> include/asm-generic/qrwlock.h:
> static inline void queue_reduce_locked_write_to_read(struct qrwlock *lock)
> {
> smp_mb__before_atomic();
> atomic_add(_QR_BIAS - _QW_LOCKED, &lock->cnts);
> }

we actually have that for the rwsems: downgrade_write(). So the
consistent naming would be: queue_downgrade_write().

2015-04-10 08:22:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] exit: Use read lock for do_notify_parent() instead of write lock


* Peter Zijlstra <[email protected]> wrote:

> On Thu, Apr 09, 2015 at 08:59:08PM +0300, Kirill Tkhai wrote:
>
> I've not really read your email yet, however:
>
> > Also, in the future we may think about new rwlock primitive, which atomically
> > drops write lock and acquires read lock. Something like this for example:
> >
> > include/asm-generic/qrwlock.h:
> > static inline void queue_reduce_locked_write_to_read(struct qrwlock *lock)
> > {
> > smp_mb__before_atomic();
> > atomic_add(_QR_BIAS - _QW_LOCKED, &lock->cnts);
> > }
>
> we actually have that for the rwsems: downgrade_write(). So the
> consistent naming would be: queue_downgrade_write().

and 'downgrade_write_lock()' on the rwlock side.

Thanks,

Ingo

2015-04-10 17:51:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] exit: Use read lock for do_notify_parent() instead of write lock

Kirill,

I'll try to read this patch tomorrow, currently I am hopelessly buried
in user-space programming :/

But I have to admit that so far I dislike this patch very much. It adds
a lot of complications and for what?

Yes, yes, yes. tasklist_lock is another BKL and must die. We need the
per-process lock. Until then I do not think the hacks like this make
any sense, unless you have the "real" workload with before/after
performance numbers.

On 04/09, Kirill Tkhai wrote:
>
> I suggest to execute do_notify_parent() under read_lock(). It allows more tasks
> to use it in parallel. Read lock gives enough guarantees for us: child's parent
> won't change during the notification.

Well, write_unlock() + read_lock() is not nice too...

> include/asm-generic/qrwlock.h:
> static inline void queue_reduce_locked_write_to_read(struct qrwlock *lock)
> {
> smp_mb__before_atomic();
> atomic_add(_QR_BIAS - _QW_LOCKED, &lock->cnts);
> }

Yes, downgrade() will be better.

Still, this only removes do_notify_parent() from the write_lock'ed section.

(lets ignore kill_orphaned_pgrp(), we want to make will_become_orphaned_pgrp
lockless. Look at get_signal).

And this changes the rules: currently ->exit_state is stable under read_lock,
except -> EXIT_DEAD transition. OK, this is probably fine, but we need to
recheck. At least this was certainly wrong some time before iirc.

> @@ -594,7 +597,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>
> write_lock_irq(&tasklist_lock);
> forget_original_parent(tsk, &dead);
> + tsk->exit_state = EXIT_NOTIFY;
> + write_unlock_irq(&tasklist_lock);

And unless I missed something this EXIT_NOTIFY turns the concurrent
do_wait() into the busy-wait loop.

Now suppose that CONFIG_SMP=n and the rt parent preempts the exiting
child right after it drops tasklist: deadlock?

> + read_lock(&tasklist_lock);
> if (group_dead)
> kill_orphaned_pgrp(tsk->group_leader, NULL);
>
> @@ -612,13 +618,14 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> }
>
> tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;

This needs WRITE_ONCE(). Otherwise gcc can do, say,

tsk->exit_state = EXIT_ZOMBIE;
if (autoreap)
tsk->exit_state = EXIT_DEAD;

which will lead to kernel crash (both parent and child can release this
task).


> - if (tsk->exit_state == EXIT_DEAD)
> + smp_wmb(); /* Pairs with read_lock() in do_wait() */

Why? this barries looks unnecessary.

OTOH. We need to set EXIT_XXX before __wake_up_parent(). OK, OK, we do not
because of the busy-wait loop, but busy-wait is not an option.

> @@ -1317,6 +1324,13 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
> return 0;
> }
>
> + if (unlikely(exit_state == EXIT_NOTIFY)) {
> + if (wo->wo_flags & WNOHANG)
> + return 0;
> + read_unlock(&tasklist_lock);
> + return -REPEAT_DOWAIT;
> + }

No, no, no. If you do something like this, please (ab)use wo->notask_error.
And wait_consider_task() should continue after that,

Oleg.

2015-04-13 16:34:23

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] exit: Use read lock for do_notify_parent() instead of write lock

Hi, Oleg,

thanks for your review.

В Пт, 10/04/2015 в 19:50 +0200, Oleg Nesterov пишет:
> Kirill,
>
> I'll try to read this patch tomorrow, currently I am hopelessly buried
> in user-space programming :/
>
> But I have to admit that so far I dislike this patch very much. It adds
> a lot of complications and for what?
>
> Yes, yes, yes. tasklist_lock is another BKL and must die. We need the
> per-process lock. Until then I do not think the hacks like this make
> any sense, unless you have the "real" workload with before/after
> performance numbers.

I don't think the complication is very huge. We add one rule about
exit_state. Yes, the state becomes unstable under read_lock(), but
only wait_consider_task() is affected by this.

Ok, what do you mean when you're speaking about killing tasklist_lock?
Can't we leave it for fork() and __unhash_process() only, but change
other places which lock it for write? Every of the places will get rid
of it by its own way. EXIT_NOTIFY is for do_exit().

Or you want to kill it completelly?

I didn't test the patch on special workload or large SMP systems.
The results for 4 CPU box (kernel compilation):

[origin]
1)534.37user 32.15system 2:29.32elapsed 379%CPU (0avgtext+0avgdata 142488maxresident)k
0inputs+724264outputs (0major+23852891minor)pagefaults 0swaps
2)534.85user 32.81system 2:28.67elapsed 381%CPU (0avgtext+0avgdata 142476maxresident)k
0inputs+724264outputs (0major+23853531minor)pagefaults 0swaps

[patched]
1)531.65user 32.69system 2:27.41elapsed 382%CPU (0avgtext+0avgdata 142580maxresident)k
0inputs+724256outputs (0major+23854620minor)pagefaults 0swaps
2)530.92user 32.51system 2:28.18elapsed 380%CPU (0avgtext+0avgdata 142544maxresident)k
0inputs+724256outputs (0major+23852925minor)pagefaults 0swaps

My test machine has HDD, so it's not the best test for the patch. I'll try something
else later. But I don't expect exciting results on workloads like this.

>
> On 04/09, Kirill Tkhai wrote:
> >
> > I suggest to execute do_notify_parent() under read_lock(). It allows more tasks
> > to use it in parallel. Read lock gives enough guarantees for us: child's parent
> > won't change during the notification.
>
> Well, write_unlock() + read_lock() is not nice too...
>
> > include/asm-generic/qrwlock.h:
> > static inline void queue_reduce_locked_write_to_read(struct qrwlock *lock)
> > {
> > smp_mb__before_atomic();
> > atomic_add(_QR_BIAS - _QW_LOCKED, &lock->cnts);
> > }
>
> Yes, downgrade() will be better.
>
> Still, this only removes do_notify_parent() from the write_lock'ed section.

Yeah, but the plan is to go successively to removing write lock from every
place it's used, except of hashing in fork() and unhashing in __unhash_process().

> (lets ignore kill_orphaned_pgrp(), we want to make will_become_orphaned_pgrp
> lockless. Look at get_signal).
>
> And this changes the rules: currently ->exit_state is stable under read_lock,
> except -> EXIT_DEAD transition. OK, this is probably fine, but we need to
> recheck. At least this was certainly wrong some time before iirc.
>
> > @@ -594,7 +597,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >
> > write_lock_irq(&tasklist_lock);
> > forget_original_parent(tsk, &dead);
> > + tsk->exit_state = EXIT_NOTIFY;
> > + write_unlock_irq(&tasklist_lock);
>
> And unless I missed something this EXIT_NOTIFY turns the concurrent
> do_wait() into the busy-wait loop.
>
> Now suppose that CONFIG_SMP=n and the rt parent preempts the exiting
> child right after it drops tasklist: deadlock?

You sure, thank. We need to disable preemption there.

> > + read_lock(&tasklist_lock);
> > if (group_dead)
> > kill_orphaned_pgrp(tsk->group_leader, NULL);
> >
> > @@ -612,13 +618,14 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > }
> >
> > tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
>
> This needs WRITE_ONCE(). Otherwise gcc can do, say,
>
> tsk->exit_state = EXIT_ZOMBIE;
> if (autoreap)
> tsk->exit_state = EXIT_DEAD;
>
> which will lead to kernel crash (both parent and child can release this
> task).

Ah, thanks.
>
>
> > - if (tsk->exit_state == EXIT_DEAD)
> > + smp_wmb(); /* Pairs with read_lock() in do_wait() */
>
> Why? this barries looks unnecessary.

Sure, it's unnecessary for do_wait().

> OTOH. We need to set EXIT_XXX before __wake_up_parent(). OK, OK, we do not
> because of the busy-wait loop, but busy-wait is not an option.
>
> > @@ -1317,6 +1324,13 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
> > return 0;
> > }
> >
> > + if (unlikely(exit_state == EXIT_NOTIFY)) {
> > + if (wo->wo_flags & WNOHANG)
> > + return 0;
> > + read_unlock(&tasklist_lock);
> > + return -REPEAT_DOWAIT;
> > + }
>
> No, no, no. If you do something like this, please (ab)use wo->notask_error.
> And wait_consider_task() should continue after that,

Kirill