2005-05-17 22:46:58

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH] Optimize sys_times for a single thread process

Avoid taking the tasklist_lock in sys_times if the process is
single threaded. In a NUMA system taking the tasklist_lock may
cause a bouncing cacheline if multiple independent processes
continually call sys_times to measure their performance.

Patch against 2.6.12-rc4

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.11/kernel/sys.c
===================================================================
--- linux-2.6.11.orig/kernel/sys.c 2005-05-17 12:46:12.000000000 -0700
+++ linux-2.6.11/kernel/sys.c 2005-05-17 12:59:36.000000000 -0700
@@ -894,35 +894,49 @@ asmlinkage long sys_times(struct tms __u
*/
if (tbuf) {
struct tms tmp;
- struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;

- read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
-
- /*
- * While we have tasklist_lock read-locked, no dying thread
- * can be updating current->signal->[us]time. Instead,
- * we got their counts included in the live thread loop.
- * However, another thread can come in right now and
- * do a wait call that updates current->signal->c[us]time.
- * To make sure we always see that pair updated atomically,
- * we take the siglock around fetching them.
- */
- spin_lock_irq(&tsk->sighand->siglock);
- cutime = tsk->signal->cutime;
- cstime = tsk->signal->cstime;
- spin_unlock_irq(&tsk->sighand->siglock);
- read_unlock(&tasklist_lock);
+ if (current == next_thread(current)) {
+ /*
+ * Single thread case. We do not need to scan the tasklist
+ * and thus can avoid the read_lock(&task_list_lock). We
+ * also do not need to take the siglock since we
+ * are the only thread in this process
+ */
+ utime = cputime_add(current->signal->utime, current->utime);
+ stime = cputime_add(current->signal->utime, current->stime);
+ cutime = current->signal->cutime;
+ cstime = current->signal->cstime;
+ } else {
+ /* Process with multiple threads */
+ struct task_struct *tsk = current;
+ struct task_struct *t;
+
+ read_lock(&tasklist_lock);
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ t = next_thread(t);
+ } while (t != tsk);

+ /*
+ * While we have tasklist_lock read-locked, no dying thread
+ * can be updating current->signal->[us]time. Instead,
+ * we got their counts included in the live thread loop.
+ * However, another thread can come in right now and
+ * do a wait call that updates current->signal->c[us]time.
+ * To make sure we always see that pair updated atomically,
+ * we take the siglock around fetching them.
+ */
+ spin_lock_irq(&tsk->sighand->siglock);
+ cutime = tsk->signal->cutime;
+ cstime = tsk->signal->cstime;
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ }
tmp.tms_utime = cputime_to_clock_t(utime);
tmp.tms_stime = cputime_to_clock_t(stime);
tmp.tms_cutime = cputime_to_clock_t(cutime);


2005-05-17 23:09:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

Christoph Lameter <[email protected]> wrote:
>
> Avoid taking the tasklist_lock in sys_times if the process is
> single threaded. In a NUMA system taking the tasklist_lock may
> cause a bouncing cacheline if multiple independent processes
> continually call sys_times to measure their performance.
>
> ...
> + if (current == next_thread(current)) {
> + /*
> + * Single thread case. We do not need to scan the tasklist
> + * and thus can avoid the read_lock(&task_list_lock). We
> + * also do not need to take the siglock since we
> + * are the only thread in this process
> + */
> + utime = cputime_add(current->signal->utime, current->utime);
> + stime = cputime_add(current->signal->utime, current->stime);
> + cutime = current->signal->cutime;
> + cstime = current->signal->cstime;
> + } else {

Well, hrm, maybe. If this task has one sibling thread, and that thread is
in the process of exitting then (current == next_thread(current)) may
become true before that sibling thread has had a chance to dump its process
accounting info into the signal structure.

If that dumping happens prior to the __detach_pid() call then things are
probably OK (modulo memory ordering issues). Otherwise there's a little
window where the accounting will go wrong.

Have you audited that code to ensure that the desired sequencing occurs in
all cases and that the appropriate barriers are in place?

It all looks a bit fast-and-loose. If there are significant performance
benefits and these issues are loudly commented (they aren't at present)
then maybe-OK, I guess.

2005-05-18 00:13:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

On Tue, 17 May 2005, Andrew Morton wrote:

> Well, hrm, maybe. If this task has one sibling thread, and that thread is
> in the process of exitting then (current == next_thread(current)) may
> become true before that sibling thread has had a chance to dump its process
> accounting info into the signal structure.

The task is only "unhashed" after the counters have been added in
__exit_signal. See release_task in kernel/exit.c

> If that dumping happens prior to the __detach_pid() call then things are
> probably OK (modulo memory ordering issues). Otherwise there's a little
> window where the accounting will go wrong.

__exit_signal takes various locks that will insure the proper sequencing.

> Have you audited that code to ensure that the desired sequencing occurs in
> all cases and that the appropriate barriers are in place?

AFAIK release task is always called for task removal.

> It all looks a bit fast-and-loose. If there are significant performance
> benefits and these issues are loudly commented (they aren't at present)
> then maybe-OK, I guess.

There are significant performance benefits in particular for one standard
NUMA benchmark that keeps calling sys_times over and over. I believe other
programs may exhibit the same brain dead behavior.

2005-05-18 00:20:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

Christoph Lameter <[email protected]> wrote:
>
> > It all looks a bit fast-and-loose. If there are significant performance
> > benefits and these issues are loudly commented (they aren't at present)
> > then maybe-OK, I guess.
>
> There are significant performance benefits in particular for one standard
> NUMA benchmark that keeps calling sys_times over and over. I believe other
> programs may exhibit the same brain dead behavior.


hrm, OK. Please redo the patch with nice comments which explain what's
going on and why the end result is correct and safe, thanks.

2005-05-18 00:46:10

by Mitchell Blank Jr

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

Christoph Lameter wrote:
> + if (current == next_thread(current)) {
> + /*
> + * Single thread case. We do not need to scan the tasklist
> + * and thus can avoid the read_lock(&task_list_lock). We
> + * also do not need to take the siglock since we
> + * are the only thread in this process
> + */
> + utime = cputime_add(current->signal->utime, current->utime);
> + stime = cputime_add(current->signal->utime, current->stime);
> + cutime = current->signal->cutime;
> + cstime = current->signal->cstime;
> + } else

Maybe #ifdef CONFIG_SMP around this? On uniproc you're still saving the
sti/cli around reading the tsk->signal stuff but that's probably not
enough to warrant the code bloat.

Or maybe this is a small enough amount of code not to matter... just a
suggestion.

-Mitch

2005-05-18 01:01:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

On Tue, 17 May 2005, Mitchell Blank Jr wrote:

> Maybe #ifdef CONFIG_SMP around this? On uniproc you're still saving the
> sti/cli around reading the tsk->signal stuff but that's probably not
> enough to warrant the code bloat.
>
> Or maybe this is a small enough amount of code not to matter... just a
> suggestion.

Here is the patch with more explanations and #ifdef CONFIG_SMP
Avoid taking the tasklist_lock in sys_times if the process is
single threaded. In a NUMA system taking the tasklist_lock may
cause a bouncing cacheline if multiple independent processes
continually call sys_times to measure their performance.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.11/kernel/sys.c
===================================================================
--- linux-2.6.11.orig/kernel/sys.c 2005-05-17 15:21:50.000000000 -0700
+++ linux-2.6.11/kernel/sys.c 2005-05-17 17:00:42.000000000 -0700
@@ -894,35 +894,69 @@ asmlinkage long sys_times(struct tms __u
*/
if (tbuf) {
struct tms tmp;
- struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;

- read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
-
- /*
- * While we have tasklist_lock read-locked, no dying thread
- * can be updating current->signal->[us]time. Instead,
- * we got their counts included in the live thread loop.
- * However, another thread can come in right now and
- * do a wait call that updates current->signal->c[us]time.
- * To make sure we always see that pair updated atomically,
- * we take the siglock around fetching them.
- */
- spin_lock_irq(&tsk->sighand->siglock);
- cutime = tsk->signal->cutime;
- cstime = tsk->signal->cstime;
- spin_unlock_irq(&tsk->sighand->siglock);
- read_unlock(&tasklist_lock);
+#ifdef CONFIG_SMP
+ if (current == next_thread(current)) {
+ /*
+ * Single thread case without the use of any locks.
+ *
+ * We may race with release_task if two threads are
+ * executing. However, release task first adds up the
+ * counters (__exit_signal) before removing the task
+ * from the process tasklist (__unhash_process).
+ * __exit_signal also acquires and releases the
+ * siglock which results in the proper memory ordering
+ * so that the list modifications are always visible
+ * after the counters have been updated.
+ *
+ * If the counters have been updated by the second thread
+ * but the thread has not yet been removed from the list
+ * then the other branch will be executing which will
+ * block on tasklist_lock until the exit handling of the
+ * other task is finished.
+ *
+ * This also implies that the sighand->siglock cannot
+ * be held by another processor. So we can also
+ * skip acquiring that lock.
+ */
+ utime = cputime_add(current->signal->utime, current->utime);
+ stime = cputime_add(current->signal->utime, current->stime);
+ cutime = current->signal->cutime;
+ cstime = current->signal->cstime;
+ } else
+#endif
+ {

+ /* Process with multiple threads */
+ struct task_struct *tsk = current;
+ struct task_struct *t;
+
+ read_lock(&tasklist_lock);
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ t = next_thread(t);
+ } while (t != tsk);
+
+ /*
+ * While we have tasklist_lock read-locked, no dying thread
+ * can be updating current->signal->[us]time. Instead,
+ * we got their counts included in the live thread loop.
+ * However, another thread can come in right now and
+ * do a wait call that updates current->signal->c[us]time.
+ * To make sure we always see that pair updated atomically,
+ * we take the siglock around fetching them.
+ */
+ spin_lock_irq(&tsk->sighand->siglock);
+ cutime = tsk->signal->cutime;
+ cstime = tsk->signal->cstime;
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ }
tmp.tms_utime = cputime_to_clock_t(utime);
tmp.tms_stime = cputime_to_clock_t(stime);
tmp.tms_cutime = cputime_to_clock_t(cutime);

2005-05-18 09:16:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

Christoph Lameter wrote:
>
> +#ifdef CONFIG_SMP
> + if (current == next_thread(current)) {
> + /*
> + * Single thread case without the use of any locks.

A nitpick, but wouldn't be it clearer to to use
thread_group_empty(current)?

Oleg.

2005-05-18 22:06:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

On Wed, 18 May 2005, Oleg Nesterov wrote:

> Christoph Lameter wrote:
> >
> > +#ifdef CONFIG_SMP
> > + if (current == next_thread(current)) {
> > + /*
> > + * Single thread case without the use of any locks.
>
> A nitpick, but wouldn't be it clearer to to use
> thread_group_empty(current)?

The thread ist needs to contain only one element which is current.
thread_group_empty checks for no threads.

Do we need a new macro?

2005-05-19 06:57:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

Christoph Lameter wrote:
>
> On Wed, 18 May 2005, Oleg Nesterov wrote:
>
> > Christoph Lameter wrote:
> > >
> > > +#ifdef CONFIG_SMP
> > > + if (current == next_thread(current)) {
> > > + /*
> > > + * Single thread case without the use of any locks.
> >
> > A nitpick, but wouldn't be it clearer to to use
> > thread_group_empty(current)?
>
> The thread ist needs to contain only one element which is current.
> thread_group_empty checks for no threads.

I think that thread_group_empty() means that there are no *other*
threads in the thread group, that means that we have only one thread.

In any case (p == next_thread(p)) == thread_group_empty(p).

But it is a very minor issue indeed, let's forget it.

Oleg.

2005-05-19 14:43:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Optimize sys_times for a single thread process

On Thu, 19 May 2005, Oleg Nesterov wrote:

> > The thread ist needs to contain only one element which is current.
> > thread_group_empty checks for no threads.
>
> I think that thread_group_empty() means that there are no *other*
> threads in the thread group, that means that we have only one thread.
>
> In any case (p == next_thread(p)) == thread_group_empty(p).
>
> But it is a very minor issue indeed, let's forget it.

No no. If you are right then you are right and I am
wrong.

Index: linux-2.6.12-rc4/kernel/sys.c
===================================================================
--- linux-2.6.12-rc4.orig/kernel/sys.c 2005-05-19 03:23:29.000000000 +0000
+++ linux-2.6.12-rc4/kernel/sys.c 2005-05-19 14:40:32.000000000 +0000
@@ -920,7 +920,7 @@
cputime_t utime, stime, cutime, cstime;

#ifdef CONFIG_SMP
- if (current == next_thread(current)) {
+ if (thread_group_empty(current)) {
/*
* Single thread case without the use of any locks.
*