2009-11-04 00:30:05

by Spencer Candland

[permalink] [raw]
Subject: utime/stime decreasing on thread exit

I am seeing a problem with utime/stime decreasing on thread exit in a
multi-threaded process. I have been able to track this regression down
to the "process wide cpu clocks/timers" changes introduces in
2.6.29-rc5, specifically when I revert the following commits I know
longer see decreasing utime/stime values:

4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
3fccfd67df79c6351a156eb25a7a514e5f39c4d9
7d8e23df69820e6be42bcc41d441f4860e8c76f7
4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0

I poked around a little, but I am afraid I have to admit that I am not
familiar enough with how this works to resolve this or suggest a fix.

I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
have been testing this on x86 vanilla kernels, but have also verified it
on several x86 2.6.29+ distro kernels (fedora and ubuntu).

I first noticed this on a production environment running Apache with the
worker MPM, however while tracking this down I put together a simple
program that has been reliable in showing me utime decreasing, hopefully
it will be helpful in demonstrating the issue:

#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>

#define NUM_THREADS 500

struct tms start;
int oldutime;

void *pound (void *threadid)
{
struct tms end;
int utime;
int c;
for (c = 0; c < 10000000; c++);
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
pthread_exit(NULL);
}


int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}

Output:

# ./decrease_utime
utime decreased, was 1288, now 1287!
utime decreased, was 1294, now 1293!
utime decreased, was 1296, now 1295!
utime decreased, was 1297, now 1296!
utime decreased, was 1298, now 1297!
utime decreased, was 1299, now 1298!

Please let me know if you need any additional information.

Thank you,
Spencer Candland


2009-11-04 06:49:57

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

Spencer Candland wrote:
> I am seeing a problem with utime/stime decreasing on thread exit in a
> multi-threaded process. I have been able to track this regression down
> to the "process wide cpu clocks/timers" changes introduces in
> 2.6.29-rc5, specifically when I revert the following commits I know
> longer see decreasing utime/stime values:
>
> 4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
> 3fccfd67df79c6351a156eb25a7a514e5f39c4d9
> 7d8e23df69820e6be42bcc41d441f4860e8c76f7
> 4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
> 32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0
>
> I poked around a little, but I am afraid I have to admit that I am not
> familiar enough with how this works to resolve this or suggest a fix.
>
> I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
> have been testing this on x86 vanilla kernels, but have also verified it
> on several x86 2.6.29+ distro kernels (fedora and ubuntu).
>
> I first noticed this on a production environment running Apache with the
> worker MPM, however while tracking this down I put together a simple
> program that has been reliable in showing me utime decreasing, hopefully
> it will be helpful in demonstrating the issue:
>
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/times.h>
>
> #define NUM_THREADS 500
>
> struct tms start;
> int oldutime;
>
> void *pound (void *threadid)
> {
> struct tms end;
> int utime;
> int c;
> for (c = 0; c < 10000000; c++);
> times(&end);
> utime = ((int)end.tms_utime - (int)start.tms_utime);
> if (oldutime > utime) {
> printf("utime decreased, was %d, now %d!\n", oldutime, utime);
> }
> oldutime = utime;
> pthread_exit(NULL);
> }
>
>
> int main()
> {
> pthread_t th[NUM_THREADS];
> long i;
> times(&start);
> for (i = 0; i < NUM_THREADS; i++) {
> pthread_create (&th[i], NULL, pound, (void *)i);
> }
> pthread_exit(NULL);
> return 0;
> }

This test program doesn't show the problem correctly, because the
oldutime can be modified by any of threads running simultaneously.

Thread 1: Thread 2:

> times(&end);

> times(&end);
> utime = ((int)end.tms_utime - (int)start.tms_utime);
> if (oldutime > utime) {
> printf("utime decreased, was %d, now %d!\n", oldutime, utime);
> }
> oldutime = utime;

> utime = ((int)end.tms_utime - (int)start.tms_utime);
> if (oldutime > utime) {
> printf("utime decreased, was %d, now %d!\n", oldutime, utime);
> }
> oldutime = utime;

So I thought it is not a bug, but....

>
> Output:
>
> # ./decrease_utime
> utime decreased, was 1288, now 1287!
> utime decreased, was 1294, now 1293!
> utime decreased, was 1296, now 1295!
> utime decreased, was 1297, now 1296!
> utime decreased, was 1298, now 1297!
> utime decreased, was 1299, now 1298!
>
> Please let me know if you need any additional information.
>
> Thank you,
> Spencer Candland

I revised the test program but still I can see the problem.
So I agree that something wrong is actually there.

[seto@localhost work]$ cat test.c
#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>

#define NUM_THREADS 500

void *pound (void *threadid)
{
struct tms t1, t2;
int c;
for (c = 0; c < 10000000; c++);
times(&t1);
times(&t2);
if (t1.tms_utime > t2.tms_utime) {
printf("utime decreased, was %d, now %d! : (%d %d %d %d) (%d %d %d %d)\n",
t1.tms_utime, t2.tms_utime,
t1.tms_stime, t1.tms_utime, t1.tms_cstime, t1.tms_cutime,
t2.tms_stime, t2.tms_utime, t2.tms_cstime, t2.tms_cutime);
}
pthread_exit(NULL);
}


int main()
{
pthread_t th[NUM_THREADS];
long i;
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}

[seto@localhost work]$ ./a.out
utime decreased, was 1066, now 1063! : (102 1066 0 0) (102 1063 0 0)
utime decreased, was 1101, now 1099! : (114 1101 0 0) (114 1099 0 0)
utime decreased, was 1095, now 1093! : (146 1095 0 0) (146 1093 0 0)
utime decreased, was 1089, now 1086! : (166 1089 0 0) (166 1086 0 0)
utime decreased, was 1071, now 1069! : (212 1071 0 0) (212 1069 0 0)
utime decreased, was 1057, now 1054! : (274 1057 0 0) (274 1054 0 0)
utime decreased, was 1054, now 1049! : (79 1054 0 0) (303 1049 0 0)
utime decreased, was 1050, now 1048! : (313 1050 0 0) (313 1048 0 0)
utime decreased, was 1049, now 1046! : (319 1049 0 0) (319 1046 0 0)
utime decreased, was 1058, now 1038! : (83 1058 0 0) (369 1038 0 0)
utime decreased, was 1038, now 1036! : (378 1038 0 0) (378 1036 0 0)
utime decreased, was 1037, now 1035! : (384 1037 0 0) (384 1035 0 0)
utime decreased, was 1035, now 1034! : (385 1035 0 0) (386 1034 0 0)
utime decreased, was 1037, now 1035! : (385 1037 0 0) (385 1035 0 0)
utime decreased, was 1035, now 1032! : (389 1035 0 0) (390 1032 0 0)
utime decreased, was 1030, now 1028! : (402 1030 0 0) (402 1028 0 0)
utime decreased, was 1029, now 1026! : (405 1029 0 0) (405 1026 0 0)
utime decreased, was 1025, now 1024! : (409 1025 0 0) (410 1024 0 0)
utime decreased, was 1023, now 1021! : (420 1023 0 0) (420 1021 0 0)
utime decreased, was 1022, now 1020! : (423 1022 0 0) (423 1020 0 0)
utime decreased, was 1037, now 1017! : (372 1037 0 0) (432 1017 0 0)
utime decreased, was 1019, now 1017! : (431 1019 0 0) (432 1017 0 0)
utime decreased, was 1053, now 1015! : (79 1053 0 0) (434 1015 0 0)
utime decreased, was 1013, now 1011! : (446 1013 0 0) (446 1011 0 0)
utime decreased, was 1013, now 1010! : (448 1013 0 0) (448 1010 0 0)
utime decreased, was 1053, now 1009! : (79 1053 0 0) (451 1009 0 0)
[seto@localhost work]$ taskset 0x1 ./a.out
utime decreased, was 1025, now 1021! : (59 1025 0 0) (417 1021 0 0)
utime decreased, was 1023, now 1021! : (59 1023 0 0) (419 1021 0 0)
utime decreased, was 1027, now 1020! : (60 1027 0 0) (420 1020 0 0)
utime decreased, was 1068, now 1000! : (173 1068 0 0) (500 1000 0 0)


I'll check commits you pointed/reverted.


Thanks,
H.Seto

2009-11-05 05:24:47

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

Hidetoshi Seto wrote:
> Spencer Candland wrote:
>> I am seeing a problem with utime/stime decreasing on thread exit in a
>> multi-threaded process. I have been able to track this regression down
>> to the "process wide cpu clocks/timers" changes introduces in
>> 2.6.29-rc5, specifically when I revert the following commits I know
>> longer see decreasing utime/stime values:
>>
>> 4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
>> 3fccfd67df79c6351a156eb25a7a514e5f39c4d9
>> 7d8e23df69820e6be42bcc41d441f4860e8c76f7
>> 4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
>> 32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0
(snip)
>
> I'll check commits you pointed/reverted.

These are likely series of:

> commit 4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
> Author: Peter Zijlstra <[email protected]>
> Date: Thu Feb 5 12:24:16 2009 +0100
>
> timers: split process wide cpu clocks/timers

This results in making sys_times() to use "clocks" instead of
"timers". Please refer the description of the above commit.

I found 2 problems through my review.


Problem [1]:
thread_group_cputime() vs exit

+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
+{
+ struct sighand_struct *sighand;
+ struct signal_struct *sig;
+ struct task_struct *t;
+
+ *times = INIT_CPUTIME;
+
+ rcu_read_lock();
+ sighand = rcu_dereference(tsk->sighand);
+ if (!sighand)
+ goto out;
+
+ sig = tsk->signal;
+
+ t = tsk;
+ do {
+ times->utime = cputime_add(times->utime, t->utime);
+ times->stime = cputime_add(times->stime, t->stime);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
+
+ t = next_thread(t);
+ } while (t != tsk);
+
+ times->utime = cputime_add(times->utime, sig->utime);
+ times->stime = cputime_add(times->stime, sig->stime);
+ times->sum_exec_runtime += sig->sum_sched_runtime;
+out:
+ rcu_read_unlock();
+}

If one of (thousands) threads do exit while a thread is doing do-while
above, the s/utime of exited thread can be accounted twice, at do-while
(before exit) and at cputime_add() at last (after exit).

I suppose this is hard to fix: Taking lock on signal would solve this
problem, but it could block all other threads long and cause serious
performance issue and so on...


Problem [2]:
use of task_s/utime()

I modified the test program more, to take times() 6 times and print them
if utime decreased between 3rd and 4th.
I noticed that I cannot explain that if the problem [1] was the root cause
then why results show decreased value continuously, instead of an increased
value at a point (like (v)(v)(V)(v)(v)(v)) which is expected.

:
times decreased : (104 984) (104 984) (104 984) (105 983) (105 983) (105 983)
times decreased : (115 981) (116 980) (116 978) (117 977) (117 977) (119 979)
times decreased : (116 980) (117 980) (117 980) (117 977) (118 979) (118 977)
:

And it seems that the more thread exits the more utime decreases.

Soon I found:

[kernel/exit.c]
+ sig->utime = cputime_add(sig->utime, task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, task_stime(tsk));

While the thread_group_cputime() accumulates raw s/utime in do-while loop,
the signal struct accumulates adjusted s/utime of exited threads.

I'm not sure how this adjustment works but applying the following patch
makes the result little bit better:

:
times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
:

But still decreasing(or increasing) continues, because there is a problem [1]
at least.

I think I couldn't handle this problem any more... Anybody can help?


Thanks,
H.Seto

===

Subject: [PATCH] thread_group_cputime() should use task_s/utime()

The signal struct accumulates adjusted cputime of exited threads,
so thread_group_cputime() should use task_s/utime() instead of raw
task->s/utime, to accumulate adjusted cputime of live threads.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
kernel/posix-cpu-timers.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..e065b8a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);
--
1.6.5.2

2009-11-09 14:49:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Thu, 2009-11-05 at 14:24 +0900, Hidetoshi Seto wrote:

> Problem [1]:
> thread_group_cputime() vs exit
>
> +void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> +{
> + struct sighand_struct *sighand;
> + struct signal_struct *sig;
> + struct task_struct *t;
> +
> + *times = INIT_CPUTIME;
> +
> + rcu_read_lock();
> + sighand = rcu_dereference(tsk->sighand);
> + if (!sighand)
> + goto out;
> +
> + sig = tsk->signal;
> +
> + t = tsk;
> + do {
> + times->utime = cputime_add(times->utime, t->utime);
> + times->stime = cputime_add(times->stime, t->stime);
> + times->sum_exec_runtime += t->se.sum_exec_runtime;
> +
> + t = next_thread(t);
> + } while (t != tsk);
> +
> + times->utime = cputime_add(times->utime, sig->utime);
> + times->stime = cputime_add(times->stime, sig->stime);
> + times->sum_exec_runtime += sig->sum_sched_runtime;
> +out:
> + rcu_read_unlock();
> +}
>
> If one of (thousands) threads do exit while a thread is doing do-while
> above, the s/utime of exited thread can be accounted twice, at do-while
> (before exit) and at cputime_add() at last (after exit).
>
> I suppose this is hard to fix: Taking lock on signal would solve this
> problem, but it could block all other threads long and cause serious
> performance issue and so on...

I just checked .22 and there we seem to hold p->sighand->siglock over
the full task iteration. So we might as well revert back to that if
people really mind counting things twice :-)

FWIW getrusage() also takes siglock over the task iteration.

Alternatively, we could try reading the sig->[us]time before doing the
loop, but I guess that's still racy in that we can then miss someone
altogether.

> Problem [2]:
> use of task_s/utime()
>
> I modified the test program more, to take times() 6 times and print them
> if utime decreased between 3rd and 4th.
> I noticed that I cannot explain that if the problem [1] was the root cause
> then why results show decreased value continuously, instead of an increased
> value at a point (like (v)(v)(V)(v)(v)(v)) which is expected.
>
> :
> times decreased : (104 984) (104 984) (104 984) (105 983) (105 983) (105 983)
> times decreased : (115 981) (116 980) (116 978) (117 977) (117 977) (119 979)
> times decreased : (116 980) (117 980) (117 980) (117 977) (118 979) (118 977)
> :
>
> And it seems that the more thread exits the more utime decreases.
>
> Soon I found:
>
> [kernel/exit.c]
> + sig->utime = cputime_add(sig->utime, task_utime(tsk));
> + sig->stime = cputime_add(sig->stime, task_stime(tsk));
>
> While the thread_group_cputime() accumulates raw s/utime in do-while loop,
> the signal struct accumulates adjusted s/utime of exited threads.
>
> I'm not sure how this adjustment works but applying the following patch
> makes the result little bit better:
>
> :
> times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
> times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
> times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
> :
>
> But still decreasing(or increasing) continues, because there is a problem [1]
> at least.
>
> I think I couldn't handle this problem any more... Anybody can help?

Stick in a few trace_printk()s and see what happens?

> Subject: [PATCH] thread_group_cputime() should use task_s/utime()
>
> The signal struct accumulates adjusted cputime of exited threads,
> so thread_group_cputime() should use task_s/utime() instead of raw
> task->s/utime, to accumulate adjusted cputime of live threads.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> kernel/posix-cpu-timers.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 5c9dc22..e065b8a 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>
> t = tsk;
> do {
> - times->utime = cputime_add(times->utime, t->utime);
> - times->stime = cputime_add(times->stime, t->stime);
> + times->utime = cputime_add(times->utime, task_utime(t));
> + times->stime = cputime_add(times->stime, task_stime(t));
> times->sum_exec_runtime += t->se.sum_exec_runtime;
>
> t = next_thread(t);

So what you're trying to say is that because __exit_signal() uses
task_[usg]time() to accumulate sig->[usg]time, we should use it too in
the loop over the live threads?

I'm thinking its the task_[usg]time() usage in __exit_signal() that's
the issue.

I tried running the modified test.c on a current -tip kernel but could
not observe the problem (dual-core opteron).

2009-11-09 17:25:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On 11/09, Peter Zijlstra wrote:
>
> On Thu, 2009-11-05 at 14:24 +0900, Hidetoshi Seto wrote:
>
> > Problem [1]:
> > thread_group_cputime() vs exit
> >
> > +void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> > +{
> > + struct sighand_struct *sighand;
> > + struct signal_struct *sig;
> > + struct task_struct *t;
> > +
> > + *times = INIT_CPUTIME;
> > +
> > + rcu_read_lock();
> > + sighand = rcu_dereference(tsk->sighand);
> > + if (!sighand)
> > + goto out;
> > +
> > + sig = tsk->signal;
> > +
> > + t = tsk;
> > + do {
> > + times->utime = cputime_add(times->utime, t->utime);
> > + times->stime = cputime_add(times->stime, t->stime);
> > + times->sum_exec_runtime += t->se.sum_exec_runtime;
> > +
> > + t = next_thread(t);
> > + } while (t != tsk);
> > +
> > + times->utime = cputime_add(times->utime, sig->utime);
> > + times->stime = cputime_add(times->stime, sig->stime);
> > + times->sum_exec_runtime += sig->sum_sched_runtime;
> > +out:
> > + rcu_read_unlock();
> > +}
> >
> > If one of (thousands) threads do exit while a thread is doing do-while
> > above, the s/utime of exited thread can be accounted twice, at do-while
> > (before exit) and at cputime_add() at last (after exit).
> >
> > I suppose this is hard to fix: Taking lock on signal would solve this
> > problem, but it could block all other threads long and cause serious
> > performance issue and so on...
>
> I just checked .22 and there we seem to hold p->sighand->siglock over
> the full task iteration.

Yes. Then thread_group_cputime() was changed so that it didn't itearate
over sub-threads, then this callsite was move outside of ->siglock, now
it does while_each_thread() again.

> So we might as well revert back to that if
> people really mind counting things twice :-)

Stanislaw has already sent the patch, but I don't know what happened
with this patch:

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

Oleg.

2009-11-09 17:32:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

Sorry, forgot to cc Stanislaw...

On 11/09, Peter Zijlstra wrote:
>
> On Thu, 2009-11-05 at 14:24 +0900, Hidetoshi Seto wrote:
>
> > Problem [1]:
> > thread_group_cputime() vs exit
> >
> > +void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> > +{
> > + struct sighand_struct *sighand;
> > + struct signal_struct *sig;
> > + struct task_struct *t;
> > +
> > + *times = INIT_CPUTIME;
> > +
> > + rcu_read_lock();
> > + sighand = rcu_dereference(tsk->sighand);
> > + if (!sighand)
> > + goto out;
> > +
> > + sig = tsk->signal;
> > +
> > + t = tsk;
> > + do {
> > + times->utime = cputime_add(times->utime, t->utime);
> > + times->stime = cputime_add(times->stime, t->stime);
> > + times->sum_exec_runtime += t->se.sum_exec_runtime;
> > +
> > + t = next_thread(t);
> > + } while (t != tsk);
> > +
> > + times->utime = cputime_add(times->utime, sig->utime);
> > + times->stime = cputime_add(times->stime, sig->stime);
> > + times->sum_exec_runtime += sig->sum_sched_runtime;
> > +out:
> > + rcu_read_unlock();
> > +}
> >
> > If one of (thousands) threads do exit while a thread is doing do-while
> > above, the s/utime of exited thread can be accounted twice, at do-while
> > (before exit) and at cputime_add() at last (after exit).
> >
> > I suppose this is hard to fix: Taking lock on signal would solve this
> > problem, but it could block all other threads long and cause serious
> > performance issue and so on...
>
> I just checked .22 and there we seem to hold p->sighand->siglock over
> the full task iteration.

Yes. Then thread_group_cputime() was changed so that it didn't itearate
over sub-threads, then this callsite was move outside of ->siglock, now
it does while_each_thread() again.

> So we might as well revert back to that if
> people really mind counting things twice :-)

Stanislaw has already sent the patch, but I don't know what happened
with this patch:

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145

Oleg.

2009-11-09 17:31:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Mon, 2009-11-09 at 18:20 +0100, Oleg Nesterov wrote:

> > I just checked .22 and there we seem to hold p->sighand->siglock over
> > the full task iteration.
>
> Yes. Then thread_group_cputime() was changed so that it didn't itearate
> over sub-threads, then this callsite was move outside of ->siglock, now
> it does while_each_thread() again.
>
> > So we might as well revert back to that if
> > people really mind counting things twice :-)
>
> Stanislaw has already sent the patch, but I don't know what happened
> with this patch:
>
> [PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
> http://marc.info/?l=linux-kernel&m=124505545131145

That patch has the siglock in the function calling
thread_group_cputime(), the 22 code had it near the loop proper, which
to me seems a more sensible thing, since there could be more callers,
no?

2009-11-09 19:29:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On 11/09, Peter Zijlstra wrote:
>
> On Mon, 2009-11-09 at 18:20 +0100, Oleg Nesterov wrote:
>
> > Stanislaw has already sent the patch, but I don't know what happened
> > with this patch:
> >
> > [PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
> > http://marc.info/?l=linux-kernel&m=124505545131145
>
> That patch has the siglock in the function calling
> thread_group_cputime(), the 22 code had it near the loop proper, which
> to me seems a more sensible thing, since there could be more callers,
> no?

Well, we can't take ->siglock in thread_group_cputime(), sometimes it
is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.

IIRC, Stanislaw verified other callers have no problems with this helper.

Oleg.

2009-11-09 19:32:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Mon, 2009-11-09 at 20:23 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > On Mon, 2009-11-09 at 18:20 +0100, Oleg Nesterov wrote:
> >
> > > Stanislaw has already sent the patch, but I don't know what happened
> > > with this patch:
> > >
> > > [PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
> > > http://marc.info/?l=linux-kernel&m=124505545131145
> >
> > That patch has the siglock in the function calling
> > thread_group_cputime(), the 22 code had it near the loop proper, which
> > to me seems a more sensible thing, since there could be more callers,
> > no?
>
> Well, we can't take ->siglock in thread_group_cputime(), sometimes it
> is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.
>
> IIRC, Stanislaw verified other callers have no problems with this helper.

Would have made fine changelog material :-)

2009-11-10 05:42:43

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

Peter Zijlstra wrote:
> On Thu, 2009-11-05 at 14:24 +0900, Hidetoshi Seto wrote:
>
>> Problem [1]:
>> thread_group_cputime() vs exit
(snip)
>
> I just checked .22 and there we seem to hold p->sighand->siglock over
> the full task iteration. So we might as well revert back to that if
> people really mind counting things twice :-)
>
> FWIW getrusage() also takes siglock over the task iteration.
>
> Alternatively, we could try reading the sig->[us]time before doing the
> loop, but I guess that's still racy in that we can then miss someone
> altogether.

Right. So "before or after" will not be any help.

As you pointed there are some other functions taking siglock over the
task iteration, so making do_sys_times() do same will be acceptable.
In other words using many threads have risks, which might be solved in
future.

I agree that the following patch is right fix for this.

[PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
http://marc.info/?l=linux-kernel&m=124505545131145



>> Problem [2]:
>> use of task_s/utime()
>>
(snip)
>> [kernel/exit.c]
>> + sig->utime = cputime_add(sig->utime, task_utime(tsk));
>> + sig->stime = cputime_add(sig->stime, task_stime(tsk));
>>
>> While the thread_group_cputime() accumulates raw s/utime in do-while loop,
>> the signal struct accumulates adjusted s/utime of exited threads.
>>
>> I'm not sure how this adjustment works but applying the following patch
>> makes the result little bit better:
>>
>> :
>> times decreased : (436 741) (436 741) (437 744) (436 742) (436 742) (436 742)
>> times decreased : (454 792) (454 792) (455 794) (454 792) (454 792) (454 792)
>> times decreased : (503 941) (503 941) (504 943) (503 941) (503 941) (503 941)
>> :
>>
>> But still decreasing(or increasing) continues, because there is a problem [1]
>> at least.
>>
>> I think I couldn't handle this problem any more... Anybody can help?
>
> Stick in a few trace_printk()s and see what happens?

Nice idea.
I tried it and show the result in later of this mail.

>> Subject: [PATCH] thread_group_cputime() should use task_s/utime()
>>
>> The signal struct accumulates adjusted cputime of exited threads,
>> so thread_group_cputime() should use task_s/utime() instead of raw
>> task->s/utime, to accumulate adjusted cputime of live threads.
>>
>> Signed-off-by: Hidetoshi Seto <[email protected]>
>> ---
>> kernel/posix-cpu-timers.c | 4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> index 5c9dc22..e065b8a 100644
>> --- a/kernel/posix-cpu-timers.c
>> +++ b/kernel/posix-cpu-timers.c
>> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>>
>> t = tsk;
>> do {
>> - times->utime = cputime_add(times->utime, t->utime);
>> - times->stime = cputime_add(times->stime, t->stime);
>> + times->utime = cputime_add(times->utime, task_utime(t));
>> + times->stime = cputime_add(times->stime, task_stime(t));
>> times->sum_exec_runtime += t->se.sum_exec_runtime;
>>
>> t = next_thread(t);
>
> So what you're trying to say is that because __exit_signal() uses
> task_[usg]time() to accumulate sig->[usg]time, we should use it too in
> the loop over the live threads?

Right. Thank you for trying to understand.

>
> I'm thinking its the task_[usg]time() usage in __exit_signal() that's
> the issue.

It likely means reverting:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <[email protected]>
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

I'm not sure the reason why the task_[usg]time() have introduced, but
removing them would be one of solutions.

> I tried running the modified test.c on a current -tip kernel but could
> not observe the problem (dual-core opteron).

It might not happen if your box is quite fast (otherwise slow).
Changing loop in test.c (i.e. cycles in user-land) might help the
reproductivity...



Here is the result of additional test:

I put a trace_printk() in the __exit_signal(), to print tsk->s/utime,
task_s/utime() and tsk->se.sum_exec_rumtime.
(And tsk->prev_s/utime before calling task_s/utime())

<...>-2857 [006] 112.731732: release_task: (37 22)to(40 20), sxr 57480477, (0 0)
<...>-5077 [009] 526.272338: release_task: (0 27)to(10 20), sxr 27997019, (0 0)
<...>-4999 [009] 526.272396: release_task: (1 27)to(10 20), sxr 27967513, (0 0)
<...>-4992 [006] 526.328591: release_task: (2 34)to(10 30), sxr 35823013, (0 0)
<...>-5022 [012] 526.329183: release_task: (0 27)to(10 20), sxr 27761854, (0 0)
<...>-4996 [010] 526.329203: release_task: (3 38)to(10 30), sxr 39200207, (0 0)

... It clearly probes that there is the 3rd problem.

Problem [3]:
granularity of task_s/utime()

According to the git log, originally task_s/utime() were designed to
return clock_t but later changed to return cputime_t by following
commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <[email protected]>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changes the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of clock_t,
not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values accumulated to the
signal struct to be rounded and coarse grained.

After applying a fix (will follow to this mail), return values are changed
to be fine grained:

<...>-5438 [006] 135.212289: release_task: (0 28)to(0 26), sxr 26648558, (0 0)
<...>-5402 [015] 135.213193: release_task: (0 27)to(0 26), sxr 26725886, (0 0)
<...>-5408 [011] 135.214172: release_task: (0 28)to(0 26), sxr 26607882, (0 0)
<...>-5419 [005] 135.214410: release_task: (1 27)to(1 25), sxr 26612615, (0 0)
<...>-5350 [009] 135.214937: release_task: (0 28)to(0 27), sxr 27028388, (0 0)
<...>-5443 [008] 135.216748: release_task: (0 28)to(0 26), sxr 26372691, (0 0)

But it cannot stop adjusted values become smaller than tsk->s/utime.
So some approach to solve problem [2] is required.


Thanks,
H.Seto

2009-11-10 05:47:41

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH] fix granularity of task_u/stime()

Remove casting to clock_t from task_u/stime(), and keep granularity of
cputime_t over the calculation.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
kernel/sched.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 28dd4f4..1632e82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}
--
1.6.5.2


2009-11-10 10:47:20

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Mon, Nov 09, 2009 at 08:23:55PM +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > On Mon, 2009-11-09 at 18:20 +0100, Oleg Nesterov wrote:
> >
> > > Stanislaw has already sent the patch, but I don't know what happened
> > > with this patch:
> > >
> > > [PATCH 1/2] posix-cpu-timers: avoid do_sys_times() races with __exit_signal()
> > > http://marc.info/?l=linux-kernel&m=124505545131145

I did not resubmit it enough times :) But I didn't thought it worth to,
since performance can be degraded.

> > That patch has the siglock in the function calling
> > thread_group_cputime(), the 22 code had it near the loop proper, which
> > to me seems a more sensible thing, since there could be more callers,
> > no?
>
> Well, we can't take ->siglock in thread_group_cputime(), sometimes it
> is called under ->siglock. do_task_stat(), get_cpu_itimer() at least.
>
> IIRC, Stanislaw verified other callers have no problems with this helper.

Actually I didn't. Try to do now.

Most calls are done with sighand or tasklist lock taken, except

Cscope tag: thread_group_cputime
# line filename / context / line
1 1353 fs/binfmt_elf.c <<fill_prstatus>>
thread_group_cputime(p, &cputime);
2 1403 fs/binfmt_elf_fdpic.c <<fill_prstatus>>
thread_group_cputime(p, &cputime);
10 129 mm/oom_kill.c <<badness>>
thread_group_cputime(p, &task_time);

I do not think in case 1 and 2 lock is needed since it seems to be core dump
with all threads dead. Not sure about 10 - oom killer.

One other exception is:
fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()

We can solve this like that:

--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
}

sig = tsk->signal;
- if (!task_cputime_zero(&sig->cputime_expires)) {
- struct task_cputime group_sample;
-
- thread_group_cputimer(tsk, &group_sample);
- if (task_cputime_expired(&group_sample, &sig->cputime_expires))
- return 1;
- }
+ if (!task_cputime_zero(&sig->cputime_expires))
+ return 1;

return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
}

Or stay with task_cputime_expired() but only if cputimer is currently running.

Cheers
Stanislaw

2009-11-10 17:45:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On 11/10, Stanislaw Gruszka wrote:
>
> 10 129 mm/oom_kill.c <<badness>>
> thread_group_cputime(p, &task_time);
>
> Not sure about 10 - oom killer.

Not sure too, but I don't think we should worry about badness().

> One other exception is:
> fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()
>
> We can solve this like that:
>
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
> }
>
> sig = tsk->signal;
> - if (!task_cputime_zero(&sig->cputime_expires)) {
> - struct task_cputime group_sample;
> -
> - thread_group_cputimer(tsk, &group_sample);
> - if (task_cputime_expired(&group_sample, &sig->cputime_expires))
> - return 1;
> - }
> + if (!task_cputime_zero(&sig->cputime_expires))
> + return 1;
>
> return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
> }
>
> Or stay with task_cputime_expired() but only if cputimer is currently running.

Oh. I forgot this code completely, can't comment.

Can't we ensure that fastpath_timer_check() never do while_each_thread() ?
IOW, if sig->cputime_expires != 0 then ->running must be true.

At least, shouldn't stop_process_timers() clear sig->cputime_expires ?

Oleg.

2009-11-10 18:25:57

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Tue, Nov 10, 2009 at 06:40:08PM +0100, Oleg Nesterov wrote:
> > One other exception is:
> > fastpath_timer_check() -> thread_group_cputimer() -> thread_group_cputime()
> >
> > We can solve this like that:
> >
> > --- a/kernel/posix-cpu-timers.c
> > +++ b/kernel/posix-cpu-timers.c
> > @@ -1375,13 +1375,8 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
> > }
> >
> > sig = tsk->signal;
> > - if (!task_cputime_zero(&sig->cputime_expires)) {
> > - struct task_cputime group_sample;
> > -
> > - thread_group_cputimer(tsk, &group_sample);
> > - if (task_cputime_expired(&group_sample, &sig->cputime_expires))
> > - return 1;
> > - }
> > + if (!task_cputime_zero(&sig->cputime_expires))
> > + return 1;
> >
> > return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
> > }
> >
> > Or stay with task_cputime_expired() but only if cputimer is currently running.
>
> Oh. I forgot this code completely, can't comment.
>
> Can't we ensure that fastpath_timer_check() never do while_each_thread() ?

Removing possibility to call while_each_tread() from fastpath_timer_check()
was exactly my intension here, perhaps I was not clear.

> IOW, if sig->cputime_expires != 0 then ->running must be true.
> At least, shouldn't stop_process_timers() clear sig->cputime_expires ?

I'm going to think about that. However as far seems, checking ->running
explicitly and goto slow patch when is not true is safer solution.

Stanislaw

2009-11-10 19:28:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On 11/10, Stanislaw Gruszka wrote:
>
> On Tue, Nov 10, 2009 at 06:40:08PM +0100, Oleg Nesterov wrote:
> > >
> > > Or stay with task_cputime_expired() but only if cputimer is currently running.
> >
> > Oh. I forgot this code completely, can't comment.
> >
> > Can't we ensure that fastpath_timer_check() never do while_each_thread() ?
>
> Removing possibility to call while_each_tread() from fastpath_timer_check()
> was exactly my intension here, perhaps I was not clear.

Yes, yes, I understand.

I meant, perhaps we can ensure this shouldn't happen "by design", instead
of checking ->running in fastpath_timer_check().

> > IOW, if sig->cputime_expires != 0 then ->running must be true.
> > At least, shouldn't stop_process_timers() clear sig->cputime_expires ?
>
> I'm going to think about that. However as far seems, checking ->running
> explicitly and goto slow patch when is not true is safer solution.

Yes, agreed.


Still. check_process_timers() updates sig->cputime_expires at the end,
but it never clears it. For example,

if (sched_expires != 0 &&
(sig->cputime_expires.sched_exp == 0 ||
sig->cputime_expires.sched_exp > sched_expires))
sig->cputime_expires.sched_exp = sched_expires;

Why?

Now suppose that (say) sig->cputime_expires.sched_exp != 0, there are
no cpu timers, ->running == F.

In this case fastpath_timer_check() always returns T and triggers the
slow path which does nothing, not good.

But yes, I agree, probably we should start with the simple/safe change
as you suggested.

Oleg.

2009-11-11 12:13:26

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime()

Hi Hidetoshi

On Tue, Nov 10, 2009 at 02:47:34PM +0900, Hidetoshi Seto wrote:
> Remove casting to clock_t from task_u/stime(), and keep granularity of
> cputime_t over the calculation.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> kernel/sched.c | 21 ++++++++++++---------
> 1 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 28dd4f4..1632e82 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
> return p->stime;
> }
> #else
> +
> +#ifndef nsecs_to_cputime
> +# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
> +#endif
> +

I would like to test, but unfortunately with the patch kernel not build
on my system:

kernel/built-in.o: In function `task_utime':
/usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
`__udivdi3'
kernel/built-in.o: In function `task_stime':
/usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
`__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

Cheers
Stanislaw

2009-11-12 00:00:42

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime()

Stanislaw Gruszka wrote:
> On Tue, Nov 10, 2009 at 02:47:34PM +0900, Hidetoshi Seto wrote:
>> Remove casting to clock_t from task_u/stime(), and keep granularity of
>> cputime_t over the calculation.
>>
>> Signed-off-by: Hidetoshi Seto <[email protected]>
>> ---
>> kernel/sched.c | 21 ++++++++++++---------
>> 1 files changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 28dd4f4..1632e82 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
>> return p->stime;
>> }
>> #else
>> +
>> +#ifndef nsecs_to_cputime
>> +# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
>> +#endif
>> +
>
> I would like to test, but unfortunately with the patch kernel not build
> on my system:
>
> kernel/built-in.o: In function `task_utime':
> /usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
> `__udivdi3'
> kernel/built-in.o: In function `task_stime':
> /usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
> `__udivdi3'
> make: *** [.tmp_vmlinux1] Error 1

Wow, what arch?
Could you provide your .config?


Thanks,
H.Seto

2009-11-12 02:50:00

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime()

Hidetoshi Seto wrote:
> Stanislaw Gruszka wrote:
>> On Tue, Nov 10, 2009 at 02:47:34PM +0900, Hidetoshi Seto wrote:
>>> Remove casting to clock_t from task_u/stime(), and keep granularity of
>>> cputime_t over the calculation.
>>>
>>> Signed-off-by: Hidetoshi Seto <[email protected]>
>>> ---
>>> kernel/sched.c | 21 ++++++++++++---------
>>> 1 files changed, 12 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index 28dd4f4..1632e82 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
>>> return p->stime;
>>> }
>>> #else
>>> +
>>> +#ifndef nsecs_to_cputime
>>> +# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
>>> +#endif
>>> +
>> I would like to test, but unfortunately with the patch kernel not build
>> on my system:
>>
>> kernel/built-in.o: In function `task_utime':
>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
>> `__udivdi3'
>> kernel/built-in.o: In function `task_stime':
>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
>> `__udivdi3'
>> make: *** [.tmp_vmlinux1] Error 1
>
> Wow, what arch?
> Could you provide your .config?

Oh, I found I can hit this error on x86_32 (but not on x86_64).
Maybe something on dividing u64...

I'll try to find fix/workaround for this, but would appreciate if someone
could tell me what is happening here.


Thanks,
H.Seto

2009-11-12 02:55:34

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime()

On Thu, Nov 12, 2009 at 10:49 AM, Hidetoshi Seto
<[email protected]> wrote:
> Hidetoshi Seto wrote:
>> Stanislaw Gruszka wrote:
>>> On Tue, Nov 10, 2009 at 02:47:34PM +0900, Hidetoshi Seto wrote:
>>>> Remove casting to clock_t from task_u/stime(), and keep granularity of
>>>> cputime_t over the calculation.
>>>>
>>>> Signed-off-by: Hidetoshi Seto <[email protected]>
>>>> ---
>>>>  kernel/sched.c |   21 ++++++++++++---------
>>>>  1 files changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index 28dd4f4..1632e82 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -5171,41 +5171,44 @@ cputime_t task_stime(struct task_struct *p)
>>>>     return p->stime;
>>>>  }
>>>>  #else
>>>> +
>>>> +#ifndef nsecs_to_cputime
>>>> +# define nsecs_to_cputime(__nsecs) msecs_to_cputime((__nsecs) / NSEC_PER_MSEC)
>>>> +#endif
>>>> +
>>> I would like to test, but unfortunately with the patch kernel not build
>>> on my system:
>>>
>>> kernel/built-in.o: In function `task_utime':
>>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
>>> `__udivdi3'
>>> kernel/built-in.o: In function `task_stime':
>>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
>>> `__udivdi3'
>>> make: *** [.tmp_vmlinux1] Error 1
>>
>> Wow, what arch?
>> Could you provide your .config?
>
> Oh, I found I can hit this error on x86_32 (but not on x86_64).
> Maybe something on dividing u64...
>
> I'll try to find fix/workaround for this, but would appreciate if someone
> could tell me what is happening here.

Yeah, this is because on 32-bit gcc tries to use libgcc to do 64-bit
integer computing. You can insert inline asm to force gcc not to
optimize this code. ;)

2009-11-12 04:16:46

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime()

Américo Wang wrote:
>>>> kernel/built-in.o: In function `task_utime':
>>>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5187: undefined reference to
>>>> `__udivdi3'
>>>> kernel/built-in.o: In function `task_stime':
>>>> /usr/src/rhel5/linux-2.6/kernel/sched.c:5208: undefined reference to
>>>> `__udivdi3'
>>>> make: *** [.tmp_vmlinux1] Error 1

>> Oh, I found I can hit this error on x86_32 (but not on x86_64).
>> Maybe something on dividing u64...
>>
>> I'll try to find fix/workaround for this, but would appreciate if someone
>> could tell me what is happening here.
>
> Yeah, this is because on 32-bit gcc tries to use libgcc to do 64-bit
> integer computing. You can insert inline asm to force gcc not to
> optimize this code. ;)

Thank you for the information.

According to the following commit which I found that is against "__udivdi3"
error, use of div_u64() will fix this problem, right?

commit 956dba3caaf66b84fe5f6180e0e4dd03902c7980
Author: Andy Whitcroft <[email protected]>
Date: Wed Jul 1 15:20:59 2009 +0100

I'm now testing modified patch...


Thanks,
H.Seto

2009-11-12 04:33:58

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH] fix granularity of task_u/stime(), v2

Originally task_s/utime() were designed to return clock_t but later
changed to return cputime_t by following commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <[email protected]>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changed the type of return value, but not the implementation.
As the result the granularity of task_s/utime() is still that of
clock_t, not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values accumulated
to the signal struct to be rounded and coarse grained.

This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3c11ae0..1f8d028 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5172,41 +5172,45 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) \
+ msecs_to_cputime(div_u64((__nsecs), NSEC_PER_MSEC))
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}
--
1.6.5.2

2009-11-12 14:16:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Thu, 2009-11-12 at 13:33 +0900, Hidetoshi Seto wrote:
> Originally task_s/utime() were designed to return clock_t but later
> changed to return cputime_t by following commit:
>
> commit efe567fc8281661524ffa75477a7c4ca9b466c63
> Author: Christian Borntraeger <[email protected]>
> Date: Thu Aug 23 15:18:02 2007 +0200
>
> It only changed the type of return value, but not the implementation.
> As the result the granularity of task_s/utime() is still that of
> clock_t, not that of cputime_t.
>
> So using task_s/utime() in __exit_signal() makes values accumulated
> to the signal struct to be rounded and coarse grained.
>
> This patch removes casts to clock_t in task_u/stime(), to keep
> granularity of cputime_t over the calculation.
>
> v2:
> Use div_u64() to avoid error "undefined reference to `__udivdi3`"
> on some 32bit systems.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>

/me hates all the cputime_t and clock_t mess.. but I guess the patch is
good.

Thanks.

2009-11-12 14:50:54

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

Hi

On Thu, Nov 12, 2009 at 01:33:45PM +0900, Hidetoshi Seto wrote:
> Originally task_s/utime() were designed to return clock_t but later
> changed to return cputime_t by following commit:
>
> commit efe567fc8281661524ffa75477a7c4ca9b466c63
> Author: Christian Borntraeger <[email protected]>
> Date: Thu Aug 23 15:18:02 2007 +0200
>
> It only changed the type of return value, but not the implementation.
> As the result the granularity of task_s/utime() is still that of
> clock_t, not that of cputime_t.
>
> So using task_s/utime() in __exit_signal() makes values accumulated
> to the signal struct to be rounded and coarse grained.
>
> This patch removes casts to clock_t in task_u/stime(), to keep
> granularity of cputime_t over the calculation.
>
> v2:
> Use div_u64() to avoid error "undefined reference to `__udivdi3`"
> on some 32bit systems.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> kernel/sched.c | 22 +++++++++++++---------
> 1 files changed, 13 insertions(+), 9 deletions(-)

Patch not fix the issue on my system. I test it alone, together with

posix-cpu-timers: avoid do_sys_times() races with __exit_signal(

and (further) together with

--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);

What only changed was probability to enter the issue. I can not reproduce
the bug with below patch:

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b85e384 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

Perhaps we can remove task_{u,s}time() in some places or maybe at whole ?

Stanislaw

2009-11-12 15:00:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Thu, 2009-11-12 at 15:49 +0100, Stanislaw Gruszka wrote:
> I can not reproduce
> the bug with below patch:
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f7864ac..b85e384 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
> * We won't ever get here for the group leader, since it
> * will have been the last reference on the signal_struct.
> */
> - sig->utime = cputime_add(sig->utime, task_utime(tsk));
> - sig->stime = cputime_add(sig->stime, task_stime(tsk));
> + sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
> + sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
> sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
> sig->min_flt += tsk->min_flt;
> sig->maj_flt += tsk->maj_flt;

Yes, this is the thing I suggested and makes sense.

> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce17760..8be5b75 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
> struct task_cputime cputime;
> cputime_t cutime, cstime;
>
> - thread_group_cputime(current, &cputime);
> spin_lock_irq(&current->sighand->siglock);
> + thread_group_cputime(current, &cputime);
> cutime = current->signal->cutime;
> cstime = current->signal->cstime;
> spin_unlock_irq(&current->sighand->siglock);
>
> Perhaps we can remove task_{u,s}time() in some places or maybe at whole ?

I think task_[us]time() still has value for getrusage() since it avoids
the task hiding from top by never running during the tick crap.

2009-11-12 15:42:29

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Thu, Nov 12, 2009 at 04:00:38PM +0100, Peter Zijlstra wrote:
> On Thu, 2009-11-12 at 15:49 +0100, Stanislaw Gruszka wrote:
> > I can not reproduce
> > the bug with below patch:
> >
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index f7864ac..b85e384 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -110,8 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
> > * We won't ever get here for the group leader, since it
> > * will have been the last reference on the signal_struct.
> > */
> > - sig->utime = cputime_add(sig->utime, task_utime(tsk));
> > - sig->stime = cputime_add(sig->stime, task_stime(tsk));
> > + sig->utime = cputime_add(sig->utime, tsk->utime); //task_utime(tsk));
> > + sig->stime = cputime_add(sig->stime, tsk->stime); //task_stime(tsk));
> > sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
> > sig->min_flt += tsk->min_flt;
> > sig->maj_flt += tsk->maj_flt;
>
> Yes, this is the thing I suggested and makes sense.

Ok, I'm going to post this.

Spencer,

seems you have more test cases for utime decreasing issues,
could you send links to me ? Somehow I could not find them
by my own. Particularly test case used in development this commit
is interested:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <[email protected]>
Date: Fri Sep 5 18:12:23 2008 +0200
sched: fix process time monotonicity

Stanislaw

2009-11-12 18:13:11

by Hidetoshi Seto

[permalink] [raw]
Subject: [tip:sched/core] sched: Fix granularity of task_u/stime()

Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
Author: Hidetoshi Seto <[email protected]>
AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 12 Nov 2009 15:23:47 +0100

sched: Fix granularity of task_u/stime()

Originally task_s/utime() were designed to return clock_t but
later changed to return cputime_t by following commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger <[email protected]>
Date: Thu Aug 23 15:18:02 2007 +0200

It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.

This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

Signed-off-by: Hidetoshi Seto <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: Spencer Candland <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 22 +++++++++++++---------
1 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 43e61fa..ab9a034 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5156,41 +5156,45 @@ cputime_t task_stime(struct task_struct *p)
return p->stime;
}
#else
+
+#ifndef nsecs_to_cputime
+# define nsecs_to_cputime(__nsecs) \
+ msecs_to_cputime(div_u64((__nsecs), NSEC_PER_MSEC))
+#endif
+
cputime_t task_utime(struct task_struct *p)
{
- clock_t utime = cputime_to_clock_t(p->utime),
- total = utime + cputime_to_clock_t(p->stime);
+ cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
- temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
+ temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
- utime = (clock_t)temp;
+ utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
+ p->prev_utime = max(p->prev_utime, utime);
return p->prev_utime;
}

cputime_t task_stime(struct task_struct *p)
{
- clock_t stime;
+ cputime_t stime;

/*
* Use CFS's precise accounting. (we subtract utime from
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
- cputime_to_clock_t(task_utime(p));
+ stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);

if (stime >= 0)
- p->prev_stime = max(p->prev_stime, clock_t_to_cputime(stime));
+ p->prev_stime = max(p->prev_stime, stime);

return p->prev_stime;
}

2009-11-13 09:42:29

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Fix granularity of task_u/stime()

On Thu, Nov 12, 2009 at 06:12:41PM +0000, tip-bot for Hidetoshi Seto wrote:
> sched: Fix granularity of task_u/stime()
>
> Originally task_s/utime() were designed to return clock_t but
> later changed to return cputime_t by following commit:
>
> commit efe567fc8281661524ffa75477a7c4ca9b466c63
> Author: Christian Borntraeger <[email protected]>
> Date: Thu Aug 23 15:18:02 2007 +0200
>
> It only changed the type of return value, but not the
> implementation. As the result the granularity of task_s/utime()
> is still that of clock_t, not that of cputime_t.
>
> So using task_s/utime() in __exit_signal() makes values
> accumulated to the signal struct to be rounded and coarse
> grained.
>
> This patch removes casts to clock_t in task_u/stime(), to keep
> granularity of cputime_t over the calculation.

Patch make only difference on 64 arch and do not fix the ordinal utime
decreasing problem on 32 bits. Anyway except below nit looks good
for me (but I to be honest I don't understand this {u/s}time adjusting
functions at all).

> - p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
> + p->prev_utime = max(p->prev_utime, utime);
> return p->prev_utime;
> }
>
> cputime_t task_stime(struct task_struct *p)
> {
> - clock_t stime;
> + cputime_t stime;
>
> /*
> * Use CFS's precise accounting. (we subtract utime from
> * the total, to make sure the total observed by userspace
> * grows monotonically - apps rely on that):
> */
> - stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
> - cputime_to_clock_t(task_utime(p));
> + stime = nsecs_to_cputime(p->se.sum_exec_runtime) - task_utime(p);
>
> if (stime >= 0)

Since cputime_t is not signed type, we should just remove this or
write something like that:

runtime = nsecs_to_cputime(p->se.sum_exec_runtime);
utime = task_utime(p);
if (cputime_ge(runtime, utime))
p->prev_stime = max(p->prev_stime, runtime - utime);

Stanislaw

2009-11-13 12:44:13

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH] sys_times: fix utime/stime decreasing on thread exit

When we have lots of exiting thread, two consecutive calls to sys_times()
can show utime/stime values decrease. This can be showed by program
provided in this thread:

http://lkml.org/lkml/2009/11/3/522

We have two bugs related with this problem, both need to be fixed to make
issue gone.

Problem 1) Races between thread_group_cputime() and __exit_signal()

When process exit in the middle of thread_group_cputime() loop, {u,s}time
values will be accounted twice. One time - in all threads loop, second - in
__exit_signal(). This make sys_times() return values bigger then they
are in real. Next consecutive call to sys_times() return correct values,
so we have {u,s}time decrease.

To fix use sighand->siglock in do_sys_times().

Problem 2) Using adjusted stime/utime values in __exit_signal()

Adjusted task_{u,s}time() functions can return smaller values then
corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
values accumulated in signal->{s,u}time can be smaller then
tsk->{u,s}times previous accounted in thread_group_cputime() loop.
Hence two consecutive sys_times() calls can show decrease.

To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
reverting:

commit 49048622eae698e5c4ae61f7e71200f265ccc529
Author: Balbir Singh <[email protected]>
Date: Fri Sep 5 18:12:23 2008 +0200

sched: fix process time monotonicity

which is also fix for some utime/stime decreasing issues. However
I _believe_ issues which want to be fixed in this commit, was caused
by Problem 1) and this patch not make them happen again.

Patch was heavily inspired by Hidetoshi and Peter.

Reported-by: Spencer Candland <[email protected]>
Signed-off-by: Stanislaw Gruszka <[email protected]>
---
kernel/exit.c | 6 +++---
kernel/sys.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..b0a28a5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -110,9 +110,9 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
- sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
+ sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
sig->nvcsw += tsk->nvcsw;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
--
1.6.2.5

2009-11-13 13:17:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sys_times: fix utime/stime decreasing on thread exit

On Fri, 2009-11-13 at 13:42 +0100, Stanislaw Gruszka wrote:
> When we have lots of exiting thread, two consecutive calls to sys_times()
> can show utime/stime values decrease. This can be showed by program
> provided in this thread:
>
> http://lkml.org/lkml/2009/11/3/522
>
> We have two bugs related with this problem, both need to be fixed to make
> issue gone.
>
> Problem 1) Races between thread_group_cputime() and __exit_signal()
>
> When process exit in the middle of thread_group_cputime() loop, {u,s}time
> values will be accounted twice. One time - in all threads loop, second - in
> __exit_signal(). This make sys_times() return values bigger then they
> are in real. Next consecutive call to sys_times() return correct values,
> so we have {u,s}time decrease.
>
> To fix use sighand->siglock in do_sys_times().
>
> Problem 2) Using adjusted stime/utime values in __exit_signal()
>
> Adjusted task_{u,s}time() functions can return smaller values then
> corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
> values accumulated in signal->{s,u}time can be smaller then
> tsk->{u,s}times previous accounted in thread_group_cputime() loop.
> Hence two consecutive sys_times() calls can show decrease.
>
> To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
> reverting:
>
> commit 49048622eae698e5c4ae61f7e71200f265ccc529
> Author: Balbir Singh <[email protected]>
> Date: Fri Sep 5 18:12:23 2008 +0200
>
> sched: fix process time monotonicity
>
> which is also fix for some utime/stime decreasing issues. However
> I _believe_ issues which want to be fixed in this commit, was caused
> by Problem 1) and this patch not make them happen again.

It would be very good to verify that believe and make it a certainty.

Otherwise we need to do the opposite and propagate task_[usg]time() to
all other places... :/

/me quickly stares at fs/proc/array.c:do_task_stat(), which is what top
uses to get the times..

That simply uses thread_group_cputime() properly under siglock and would
thus indeed require the use of task_[usg]time() in order to avoid the
stupid hiding 'exploit'..

Oh bugger,..

I think we do indeed need something like the below, not sure if all
task_[usg]time() calls are now under siglock, if not they ought to be,
otherwise there's a race with them updating p->prev_[us]time.


---

---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..9b1d715 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *timer,

static inline cputime_t prof_ticks(struct task_struct *p)
{
- return cputime_add(p->utime, p->stime);
+ return cputime_add(task_utime(p), task_stime(p));
}
static inline cputime_t virt_ticks(struct task_struct *p)
{
- return p->utime;
+ return task_utime(p);
}

int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec
*tp)
@@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
struct task_cputime *times)

t = tsk;
do {
- times->utime = cputime_add(times->utime, t->utime);
- times->stime = cputime_add(times->stime, t->stime);
+ times->utime = cputime_add(times->utime, task_utime(t));
+ times->stime = cputime_add(times->stime, task_stime(t));
times->sum_exec_runtime += t->se.sum_exec_runtime;

t = next_thread(t);
@@ -517,7 +517,8 @@ static void cleanup_timers(struct list_head *head,
void posix_cpu_timers_exit(struct task_struct *tsk)
{
cleanup_timers(tsk->cpu_timers,
- tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
+ task_utime(tsk), task_stime(tsk),
+ tsk->se.sum_exec_runtime);

}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -525,8 +526,8 @@ void posix_cpu_timers_exit_group(struct task_struct
*tsk)
struct signal_struct *const sig = tsk->signal;

cleanup_timers(tsk->signal->cpu_timers,
- cputime_add(tsk->utime, sig->utime),
- cputime_add(tsk->stime, sig->stime),
+ cputime_add(task_utime(tsk), sig->utime),
+ cputime_add(task_stime(tsk), sig->stime),
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}

@@ -1365,8 +1366,8 @@ static inline int fastpath_timer_check(struct
task_struct *tsk)

if (!task_cputime_zero(&tsk->cputime_expires)) {
struct task_cputime task_sample = {
- .utime = tsk->utime,
- .stime = tsk->stime,
+ .utime = task_utime(tsk),
+ .stime = tsak_stime(tsk),
.sum_exec_runtime = tsk->se.sum_exec_runtime
};



2009-11-13 14:12:06

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sys_times: fix utime/stime decreasing on thread exit

On Fri, Nov 13, 2009 at 6:46 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2009-11-13 at 13:42 +0100, Stanislaw Gruszka wrote:
>> When we have lots of exiting thread, two consecutive calls to sys_times()
>> can show utime/stime values decrease. This can be showed by program
>> provided in this thread:
>>
>> http://lkml.org/lkml/2009/11/3/522
>>
>> We have two bugs related with this problem, both need to be fixed to make
>> issue gone.
>>
>> Problem 1) Races between thread_group_cputime() and __exit_signal()
>>
>> When process exit in the middle of thread_group_cputime() loop, {u,s}time
>> values will be accounted twice. One time - in all threads loop, second - in
>> __exit_signal(). This make sys_times() return values bigger then they
>> are in real. Next consecutive call to sys_times() return correct values,
>> so we have {u,s}time decrease.
>>
>> To fix use sighand->siglock in do_sys_times().
>>
>> Problem 2) Using adjusted stime/utime values in __exit_signal()
>>
>> Adjusted task_{u,s}time() functions can return smaller values then
>> corresponding tsk->{s,u}time. So when thread exit, thread {u/s}times
>> values accumulated in signal->{s,u}time can be smaller then
>> tsk->{u,s}times previous accounted in thread_group_cputime() loop.
>> Hence two consecutive sys_times() calls can show decrease.
>>
>> To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
>> reverting:
>>
>> commit 49048622eae698e5c4ae61f7e71200f265ccc529
>> Author: Balbir Singh <[email protected]>
>> Date: ? Fri Sep 5 18:12:23 2008 +0200
>>
>> ? ? sched: fix process time monotonicity
>>
>> which is also fix for some utime/stime decreasing issues. However
>> I _believe_ issues which want to be fixed in this commit, was caused
>> by Problem 1) and this patch not make them happen again.
>
> It would be very good to verify that believe and make it a certainty.
>
> Otherwise we need to do the opposite and propagate task_[usg]time() to
> all other places... :/
>
> /me quickly stares at fs/proc/array.c:do_task_stat(), which is what top
> uses to get the times..
>
> That simply uses thread_group_cputime() properly under siglock and would
> thus indeed require the use of task_[usg]time() in order to avoid the
> stupid hiding 'exploit'..
>
> Oh bugger,..
>
> I think we do indeed need something like the below, not sure if all
> task_[usg]time() calls are now under siglock, if not they ought to be,
> otherwise there's a race with them updating p->prev_[us]time.
>
>
> ---
>
> ---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 5c9dc22..9b1d715 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *timer,
>
> ?static inline cputime_t prof_ticks(struct task_struct *p)
> ?{
> - ? ? ? return cputime_add(p->utime, p->stime);
> + ? ? ? return cputime_add(task_utime(p), task_stime(p));
> ?}
> ?static inline cputime_t virt_ticks(struct task_struct *p)
> ?{
> - ? ? ? return p->utime;
> + ? ? ? return task_utime(p);
> ?}
>
> ?int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec
> *tp)
> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
> struct task_cputime *times)
>
> ? ? ? ?t = tsk;
> ? ? ? ?do {
> - ? ? ? ? ? ? ? times->utime = cputime_add(times->utime, t->utime);
> - ? ? ? ? ? ? ? times->stime = cputime_add(times->stime, t->stime);
> + ? ? ? ? ? ? ? times->utime = cputime_add(times->utime, task_utime(t));
> + ? ? ? ? ? ? ? times->stime = cputime_add(times->stime, task_stime(t));
> ? ? ? ? ? ? ? ?times->sum_exec_runtime += t->se.sum_exec_runtime;
>
> ? ? ? ? ? ? ? ?t = next_thread(t);
> @@ -517,7 +517,8 @@ static void cleanup_timers(struct list_head *head,
> ?void posix_cpu_timers_exit(struct task_struct *tsk)
> ?{
> ? ? ? ?cleanup_timers(tsk->cpu_timers,
> - ? ? ? ? ? ? ? ? ? ? ?tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
> + ? ? ? ? ? ? ? ? ? ? ?task_utime(tsk), task_stime(tsk),
> + ? ? ? ? ? ? ? ? ? ? ?tsk->se.sum_exec_runtime);
>
> ?}
> ?void posix_cpu_timers_exit_group(struct task_struct *tsk)
> @@ -525,8 +526,8 @@ void posix_cpu_timers_exit_group(struct task_struct
> *tsk)
> ? ? ? ?struct signal_struct *const sig = tsk->signal;
>
> ? ? ? ?cleanup_timers(tsk->signal->cpu_timers,
> - ? ? ? ? ? ? ? ? ? ? ?cputime_add(tsk->utime, sig->utime),
> - ? ? ? ? ? ? ? ? ? ? ?cputime_add(tsk->stime, sig->stime),
> + ? ? ? ? ? ? ? ? ? ? ?cputime_add(task_utime(tsk), sig->utime),
> + ? ? ? ? ? ? ? ? ? ? ?cputime_add(task_stime(tsk), sig->stime),
> ? ? ? ? ? ? ? ? ? ? ? tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
> ?}
>
> @@ -1365,8 +1366,8 @@ static inline int fastpath_timer_check(struct
> task_struct *tsk)
>
> ? ? ? ?if (!task_cputime_zero(&tsk->cputime_expires)) {
> ? ? ? ? ? ? ? ?struct task_cputime task_sample = {
> - ? ? ? ? ? ? ? ? ? ? ? .utime = tsk->utime,
> - ? ? ? ? ? ? ? ? ? ? ? .stime = tsk->stime,
> + ? ? ? ? ? ? ? ? ? ? ? .utime = task_utime(tsk),
> + ? ? ? ? ? ? ? ? ? ? ? .stime = tsak_stime(tsk),
> ? ? ? ? ? ? ? ? ? ? ? ?.sum_exec_runtime = tsk->se.sum_exec_runtime
> ? ? ? ? ? ? ? ?};

The patch looks correct upon first notice. My fault for missing these
call sites, thanks for catching them Peter. I wonder if we should
change utime and stime to __utime and __stime and force everyone to
use the accessor functions.

Acked-by: Balbir Singh <[email protected]>

Balbir

2009-11-13 15:38:55

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] sys_times: fix utime/stime decreasing on thread exit

On Fri, Nov 13, 2009 at 02:16:59PM +0100, Peter Zijlstra wrote:
> > To fix we use pure tsk->{u,s}time values in __exit_signal(). This mean
> > reverting:
> >
> > commit 49048622eae698e5c4ae61f7e71200f265ccc529
> > Author: Balbir Singh <[email protected]>
> > Date: Fri Sep 5 18:12:23 2008 +0200
> >
> > sched: fix process time monotonicity
> >
> > which is also fix for some utime/stime decreasing issues. However
> > I _believe_ issues which want to be fixed in this commit, was caused
> > by Problem 1) and this patch not make them happen again.
>
> It would be very good to verify that believe and make it a certainty.

Balbir, are some chance to avoid task_[usg]time() usage here? Could
you be so kind and give me point to reproducer program/script you used
when worked on "sched: fix process time monotonicity" commit?

> Otherwise we need to do the opposite and propagate task_[usg]time() to
> all other places... :/
>
> /me quickly stares at fs/proc/array.c:do_task_stat(), which is what top
> uses to get the times..
>
> That simply uses thread_group_cputime() properly under siglock and would
> thus indeed require the use of task_[usg]time() in order to avoid the
> stupid hiding 'exploit'..
>
> Oh bugger,..
>
> I think we do indeed need something like the below, not sure if all
> task_[usg]time() calls are now under siglock, if not they ought to be,
> otherwise there's a race with them updating p->prev_[us]time.
>
> ---
>
> ---diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 5c9dc22..9b1d715 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -170,11 +170,11 @@ static void bump_cpu_timer(struct k_itimer *timer,
>
> static inline cputime_t prof_ticks(struct task_struct *p)
> {
> - return cputime_add(p->utime, p->stime);
> + return cputime_add(task_utime(p), task_stime(p));
> }
> static inline cputime_t virt_ticks(struct task_struct *p)
> {
> - return p->utime;
> + return task_utime(p);
> }
>
> int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec
> *tp)

Something wrong with formatting.

> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
> struct task_cputime *times)
>
> t = tsk;
> do {
> - times->utime = cputime_add(times->utime, t->utime);
> - times->stime = cputime_add(times->stime, t->stime);
> + times->utime = cputime_add(times->utime, task_utime(t));
> + times->stime = cputime_add(times->stime, task_stime(t));
> times->sum_exec_runtime += t->se.sum_exec_runtime;
>
> t = next_thread(t);
[snip]

Confirmed patch fix problem using reproducer from this thread.

But I don't like it much. Sad we can not do transition to opposite
direction and remove task_{u,s}time.

A few month ago I was thinking about removing cputime_t and using
long long instead, now see much more reasons of doing this, but still
lack of skills/time for that - oh dear.

Stanislaw

2009-11-13 17:05:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sys_times: fix utime/stime decreasing on thread exit

On Fri, 2009-11-13 at 16:36 +0100, Stanislaw Gruszka wrote:
>
> But I don't like it much. Sad we can not do transition to opposite
> direction and remove task_{u,s}time.

We need task_[us]time() for some thing like process monitors like top to
avoid tasks hiding from them.

Since the regular [us]time are only incremented on tick, if a task never
runs on the tick, it'll never accumulate any tick and hence it'll be
reported as not using cpu time.

task_[us]time() solve this by using the ns sum_exec_runtime and scale
that according to the [us]time ticks -- the only trouble is keeping it
monotonic.

CONFIG_VIRT_CPU_ACCOUNTING avoids this whole issue by tracking time on
each user/kernel transition, but at the obvious cost of having to track
time at every such transition.

Now since we clearly don't want to incur that overhead on any regular
setup (kernel entry/exit paths are extremely hot), we're (afais) stuck
using the current scheme of dividing the sum_exec_runtime in two
components system/user according to the tick ratio.

Now if we can reduce the task_[us]time muckery to the bare minimum that
would be good, but the way it looks now all this accounting seems too
interwoven to pull that off.

2009-11-13 23:09:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Fix granularity of task_u/stime()


* tip-bot for Hidetoshi Seto <[email protected]> wrote:

> Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
> Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
> Author: Hidetoshi Seto <[email protected]>
> AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 12 Nov 2009 15:23:47 +0100
>
> sched: Fix granularity of task_u/stime()
>
> Originally task_s/utime() were designed to return clock_t but
> later changed to return cputime_t by following commit:

i think this one causes this new warning in -tip:

include/trace/events/timer.h:279: warning: format '%lu'
expects type 'long unsigned int', but argument 4 has type 'cputime_t'

Ingo

2009-11-16 02:44:53

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Fix granularity of task_u/stime()

Ingo Molnar wrote:
> * tip-bot for Hidetoshi Seto <[email protected]> wrote:
>
>> Commit-ID: 761b1d26df542fd5eb348837351e4d2f3bc7bffe
>> Gitweb: http://git.kernel.org/tip/761b1d26df542fd5eb348837351e4d2f3bc7bffe
>> Author: Hidetoshi Seto <[email protected]>
>> AuthorDate: Thu, 12 Nov 2009 13:33:45 +0900
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Thu, 12 Nov 2009 15:23:47 +0100
>>
>> sched: Fix granularity of task_u/stime()
>>
>> Originally task_s/utime() were designed to return clock_t but
>> later changed to return cputime_t by following commit:
>
> i think this one causes this new warning in -tip:
>
> include/trace/events/timer.h:279: warning: format '%lu'
> expects type 'long unsigned int', but argument 4 has type 'cputime_t'

No. I believe the warning was already there.
Following patch, based on tip:sched/core, will remove the warning.


Thanks,
H.Seto

===

Subject: [PATCH] trace: cputime_t is not always unsigned long

Type of cputime_t can be u64, unsigned long long or so on,
e.g. if kernel is configured with VIRT_CPU_ACCOUNTING=y.
So it can cause following warning:

include/trace/events/timer.h:279: warning: format '%lu'
expects type 'long unsigned int', but argument 4 has type 'cputime_t'

For printk(), it is better to use cputime_to_cputime64() and %llu,
since cputime64_t is always u64.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
include/trace/events/timer.h | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1844c48..8f4ec13 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -302,8 +302,8 @@ TRACE_EVENT(itimer_state,
__entry->interval_usec = value->it_interval.tv_usec;
),

- TP_printk("which %d, expires %lu, it_value %lu.%lu, it_interval %lu.%lu",
- __entry->which, __entry->expires,
+ TP_printk("which %d, expires %llu, it_value %lu.%lu, it_interval %lu.%lu",
+ __entry->which, cputime_to_cputime64(__entry->expires),
__entry->value_sec, __entry->value_usec,
__entry->interval_sec, __entry->interval_usec)
);
@@ -332,8 +332,8 @@ TRACE_EVENT(itimer_expire,
__entry->pid = pid_nr(pid);
),

- TP_printk("which %d, pid %d, now %lu", __entry->which,
- (int) __entry->pid, __entry->now)
+ TP_printk("which %d, pid %d, now %llu", __entry->which,
+ (int) __entry->pid, cputime_to_cputime64(__entry->now))
);

#endif /* _TRACE_TIMER_H */
--
1.6.5.2

2009-11-16 19:32:39

by Spencer Candland

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2



Stanislaw Gruszka wrote:
> Ok, I'm going to post this.
>
> Spencer,
>
> seems you have more test cases for utime decreasing issues,
> could you send links to me ? Somehow I could not find them
> by my own. Particularly test case used in development this commit
> is interested:
>
> commit 49048622eae698e5c4ae61f7e71200f265ccc529
> Author: Balbir Singh <[email protected]>
> Date: Fri Sep 5 18:12:23 2008 +0200
> sched: fix process time monotonicity

I had originally noticed that in a production web server, so my test
case was designed to mirror what I was seeing there, which was just
running apache with worker mpm, and running a simple apache bench while
watching the utime/stime of the apache children. Unfortunately that
method was not terribly reliable at reproducing the issue, which is why
I felt it necessary to try to come up with a better test case this time
around.

- Spencer

2009-11-17 12:50:52

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: utime/stime decreasing on thread exit

On Tue, Nov 10, 2009 at 08:23:27PM +0100, Oleg Nesterov wrote:
> > > Can't we ensure that fastpath_timer_check() never do while_each_thread() ?
> >
> > Removing possibility to call while_each_tread() from fastpath_timer_check()
> > was exactly my intension here, perhaps I was not clear.
>
> Yes, yes, I understand.
>
> I meant, perhaps we can ensure this shouldn't happen "by design", instead
> of checking ->running in fastpath_timer_check().

Rule "sig->cputimer_expire != zero implies sig->cputimer.running == true" is
_almost_ assured (after fix in next mail). IMHO there is only one problem with
that functions: posix_cpu_timer_set() and posix_cpu_timer_schedule().

These functions first call thread_group_cputimer() without
tsk->sighand->siglock (only tasklist_lock is taken) and then
arm_timer(), which setups list and cputime_expires cache.

When there is some timer expiring already we can have situation like below:

cpu_timer_sample_group()
check_process_timers()
stop_process_timers()
arm_timer()

At the end we end with cputimer_expire != zero and ->running == false.
Very unlikely situation indeed, but possible. To address this
we can do in arm_timer() something like that:

if (unlikely(!sig->cputimer.running)) {
cpu_timer_sample_group()
bump_cpu_timer();
}

Since we have this we can do optimization, you proposed here:

http://lkml.org/lkml/2009/3/23/381

Use cputimer->running in fastpath_timer_check(). I'm going to work on
it as well on some other optimizations in posix-cpu-timer.c


> Still. check_process_timers() updates sig->cputime_expires at the end,
> but it never clears it. For example,
>
> if (sched_expires != 0 &&
> (sig->cputime_expires.sched_exp == 0 ||
> sig->cputime_expires.sched_exp > sched_expires))
> sig->cputime_expires.sched_exp = sched_expires;
>
> Why?
>
> Now suppose that (say) sig->cputime_expires.sched_exp != 0, there are
> no cpu timers, ->running == F.
>
> In this case fastpath_timer_check() always returns T and triggers the
> slow path which does nothing, not good.

This is real bug. I will fix it in the next patch.

Stanislaw

2009-11-17 12:58:37

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH] posix-cpu-timers: reset expire cache when no timer is running

When process delete cpu timer or timer expires we do not clear
expiration cache sig->cputimer_expires. This create situation when in
fastpath_timer_check() we _on_ cputimer (together with loop on all
threads) just to turn in off in check_process_timers(). We not
necessary go to slowpath and not necessary loop on all threads in
fastpath - and this is repeated on every tick.

To fix we zero sig->cputimer_expires in stop_process_timers().

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
kernel/posix-cpu-timers.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..ac54986 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1061,9 +1061,9 @@ static void check_thread_timers(struct task_struct *tsk,
}
}

-static void stop_process_timers(struct task_struct *tsk)
+static void stop_process_timers(struct signal_struct *sig)
{
- struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
+ struct thread_group_cputimer *cputimer = &sig->cputimer;
unsigned long flags;

if (!cputimer->running)
@@ -1072,6 +1072,10 @@ static void stop_process_timers(struct task_struct *tsk)
spin_lock_irqsave(&cputimer->lock, flags);
cputimer->running = 0;
spin_unlock_irqrestore(&cputimer->lock, flags);
+
+ sig->cputime_expires.prof_exp = cputime_zero;
+ sig->cputime_expires.virt_exp = cputime_zero;
+ sig->cputime_expires.sched_exp = 0;
}

static u32 onecputick;
@@ -1132,7 +1136,7 @@ static void check_process_timers(struct task_struct *tsk,
list_empty(&timers[CPUCLOCK_VIRT]) &&
cputime_eq(sig->it[CPUCLOCK_VIRT].expires, cputime_zero) &&
list_empty(&timers[CPUCLOCK_SCHED])) {
- stop_process_timers(tsk);
+ stop_process_timers(sig);
return;
}

--
1.6.2.5

2009-11-17 13:11:00

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Mon, Nov 16, 2009 at 12:32:43PM -0700, Spencer Candland wrote:
> > seems you have more test cases for utime decreasing issues,
> > could you send links to me ? Somehow I could not find them
> > by my own. Particularly test case used in development this commit
> > is interested:
> >
> > commit 49048622eae698e5c4ae61f7e71200f265ccc529
> > Author: Balbir Singh <[email protected]>
> > Date: Fri Sep 5 18:12:23 2008 +0200
> > sched: fix process time monotonicity
>
> I had originally noticed that in a production web server, so my test
> case was designed to mirror what I was seeing there, which was just
> running apache with worker mpm, and running a simple apache bench while
> watching the utime/stime of the apache children. Unfortunately that
> method was not terribly reliable at reproducing the issue, which is why
> I felt it necessary to try to come up with a better test case this time
> around.

No wonder I could not find anything on google and in mailing list
archives :)

Seems issue reported then was exactly the same as reported now by
you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
make probability of bug smaller and you did not note it until now.

Could you please test this patch, if it solve all utime decrease
problems for you:

http://patchwork.kernel.org/patch/59795/

If you confirm it work, I think we should apply it. Otherwise
we need to go to propagate task_{u,s}time everywhere, which is not
(my) preferred solution.

Stanislaw

2009-11-17 13:25:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Tue, 2009-11-17 at 14:08 +0100, Stanislaw Gruszka wrote:
> On Mon, Nov 16, 2009 at 12:32:43PM -0700, Spencer Candland wrote:
> > > seems you have more test cases for utime decreasing issues,
> > > could you send links to me ? Somehow I could not find them
> > > by my own. Particularly test case used in development this commit
> > > is interested:
> > >
> > > commit 49048622eae698e5c4ae61f7e71200f265ccc529
> > > Author: Balbir Singh <[email protected]>
> > > Date: Fri Sep 5 18:12:23 2008 +0200
> > > sched: fix process time monotonicity
> >
> > I had originally noticed that in a production web server, so my test
> > case was designed to mirror what I was seeing there, which was just
> > running apache with worker mpm, and running a simple apache bench while
> > watching the utime/stime of the apache children. Unfortunately that
> > method was not terribly reliable at reproducing the issue, which is why
> > I felt it necessary to try to come up with a better test case this time
> > around.
>
> No wonder I could not find anything on google and in mailing list
> archives :)
>
> Seems issue reported then was exactly the same as reported now by
> you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
> make probability of bug smaller and you did not note it until now.
>
> Could you please test this patch, if it solve all utime decrease
> problems for you:
>
> http://patchwork.kernel.org/patch/59795/
>
> If you confirm it work, I think we should apply it. Otherwise
> we need to go to propagate task_{u,s}time everywhere, which is not
> (my) preferred solution.

That patch will create another issue, it will allow a process to hide
from top by arranging to never run when the tick hits.

2009-11-18 22:38:14

by Spencer Candland

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

Stanislaw Gruszka wrote:
> Could you please test this patch, if it solve all utime decrease
> problems for you:
>
> http://patchwork.kernel.org/patch/59795/

Yes, this seems to solve the problem.

- Spencer

2009-11-19 18:20:47

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Tue, Nov 17, 2009 at 02:24:48PM +0100, Peter Zijlstra wrote:
> > Seems issue reported then was exactly the same as reported now by
> > you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
> > make probability of bug smaller and you did not note it until now.
> >
> > Could you please test this patch, if it solve all utime decrease
> > problems for you:
> >
> > http://patchwork.kernel.org/patch/59795/
> >
> > If you confirm it work, I think we should apply it. Otherwise
> > we need to go to propagate task_{u,s}time everywhere, which is not
> > (my) preferred solution.
>
> That patch will create another issue, it will allow a process to hide
> from top by arranging to never run when the tick hits.

What about that?

diff --git a/kernel/sched.c b/kernel/sched.c
index 1f8d028..9db1cbc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
}
utime = (cputime_t)temp;

- p->prev_utime = max(p->prev_utime, utime);
+ p->prev_utime = max(p->prev_utime, max(p->utime, utime));
return p->prev_utime;
}

diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);

It's on top of Hidetoshi patch and fix utime decrease problem
on my system.

Are we not doing something nasty here?

cputime_t utime = p->utime, total = utime + p->stime;
u64 temp;

/*
* Use CFS's precise accounting:
*/
temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);

if (total) {
temp *= utime;
do_div(temp, total);
}
utime = (cputime_t)temp;

Stanislaw

2009-11-20 02:00:51

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

Stanislaw Gruszka wrote:
> On Tue, Nov 17, 2009 at 02:24:48PM +0100, Peter Zijlstra wrote:
>>> Seems issue reported then was exactly the same as reported now by
>>> you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
>>> make probability of bug smaller and you did not note it until now.
>>>
>>> Could you please test this patch, if it solve all utime decrease
>>> problems for you:
>>>
>>> http://patchwork.kernel.org/patch/59795/
>>>
>>> If you confirm it work, I think we should apply it. Otherwise
>>> we need to go to propagate task_{u,s}time everywhere, which is not
>>> (my) preferred solution.
>> That patch will create another issue, it will allow a process to hide
>> from top by arranging to never run when the tick hits.
>

Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.

E.g. assume that there are 2 tasks:

Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms

Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms

You can see task_[su]time() works well for these tasks.

> What about that?
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f8d028..9db1cbc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
> }
> utime = (cputime_t)temp;
>
> - p->prev_utime = max(p->prev_utime, utime);
> + p->prev_utime = max(p->prev_utime, max(p->utime, utime));
> return p->prev_utime;
> }

I think this makes things worse.

without this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 5 ms (= accurate)
with this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 50 ms (= not accurate)

Note that task_stime() calculates prev_stime using this prev_utime:

without this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 5 ms (= not accurate)
with this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 0 ms (= not accurate)

>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce17760..8be5b75 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
> struct task_cputime cputime;
> cputime_t cutime, cstime;
>
> - thread_group_cputime(current, &cputime);
> spin_lock_irq(&current->sighand->siglock);
> + thread_group_cputime(current, &cputime);
> cutime = current->signal->cutime;
> cstime = current->signal->cstime;
> spin_unlock_irq(&current->sighand->siglock);
>
> It's on top of Hidetoshi patch and fix utime decrease problem
> on my system.

How about the stime decrease problem which can be caused by same
logic?

According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".

Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is
not preferred, but for [2]) the following:

>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> >> index 5c9dc22..e065b8a 100644
>> >> --- a/kernel/posix-cpu-timers.c
>> >> +++ b/kernel/posix-cpu-timers.c
>> >> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >>
>> >> t = tsk;
>> >> do {
>> >> - times->utime = cputime_add(times->utime, t->utime);
>> >> - times->stime = cputime_add(times->stime, t->stime);
>> >> + times->utime = cputime_add(times->utime, task_utime(t));
>> >> + times->stime = cputime_add(times->stime, task_stime(t));
>> >> times->sum_exec_runtime += t->se.sum_exec_runtime;
>> >>
>> >> t = next_thread(t);

Think about this diff, assuming task C is in same group of task A and B:

sys_times() on C while A and B are living returns:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (50,50) + (50,50) +...) + in_signal(exited)
If A exited, it increases:
(utime, stime)
= task_[su]time(C) + ([su]time(B)+...) + in_signal(exited)+task_[su]time(A)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(500,500)
Otherwise if B exited, it decreases:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+...) + in_signal(exited)+task_[su]time(B)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(5,5)

With this fix, sys_times() returns:
(utime, stime)
= task_[su]time(C) + (task_[su]time(A)+task_[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (500,500) + (5,5) +...) + in_signal(exited)

> Are we not doing something nasty here?
>
> cputime_t utime = p->utime, total = utime + p->stime;
> u64 temp;
>
> /*
> * Use CFS's precise accounting:
> */
> temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);
>
> if (total) {
> temp *= utime;
> do_div(temp, total);
> }
> utime = (cputime_t)temp;

Not here, but doing do_div() for each thread could be said nasty.
I mean
__task_[su]time(sum(A, B, ...))
would be better than:
sum(task_[su]time(A)+task_[su]time(B)+...)

However it would bring another issue, because
__task_[su]time(sum(A, B, ...))
might not equal to
__task_[su]time(sum(B, ...)) + task_[su]time(A)


Thanks,
H.Seto

2009-11-23 09:57:13

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Thu, Nov 12, 2009 at 03:49:19PM +0100, Stanislaw Gruszka wrote:
> On Thu, Nov 12, 2009 at 01:33:45PM +0900, Hidetoshi Seto wrote:
> > Originally task_s/utime() were designed to return clock_t but later
> > changed to return cputime_t by following commit:
> >
> > commit efe567fc8281661524ffa75477a7c4ca9b466c63
> > Author: Christian Borntraeger <[email protected]>
> > Date: Thu Aug 23 15:18:02 2007 +0200
> >
> > It only changed the type of return value, but not the implementation.
> > As the result the granularity of task_s/utime() is still that of
> > clock_t, not that of cputime_t.
> >
> > So using task_s/utime() in __exit_signal() makes values accumulated
> > to the signal struct to be rounded and coarse grained.
> >
> > This patch removes casts to clock_t in task_u/stime(), to keep
> > granularity of cputime_t over the calculation.
> >
> > v2:
> > Use div_u64() to avoid error "undefined reference to `__udivdi3`"
> > on some 32bit systems.
> >
> > Signed-off-by: Hidetoshi Seto <[email protected]>
> > ---
> > kernel/sched.c | 22 +++++++++++++---------
> > 1 files changed, 13 insertions(+), 9 deletions(-)
>
> Patch not fix the issue on my system. I test it alone, together with
>
> posix-cpu-timers: avoid do_sys_times() races with __exit_signal(
>
> and (further) together with
>
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk,
> struct task_cputime *times)
>
> t = tsk;
> do {
> - times->utime = cputime_add(times->utime, t->utime);
> - times->stime = cputime_add(times->stime, t->stime);
> + times->utime = cputime_add(times->utime, task_utime(t));
> + times->stime = cputime_add(times->stime, task_stime(t));
> times->sum_exec_runtime += t->se.sum_exec_runtime;
>
> t = next_thread(t);
>
> What only changed was probability to enter the issue.

I was wrong here, that combination fix the problem on my system. I don't
know how I was testing it before, perhaps I booted wrong kernel.

Stanislaw

2009-11-23 10:12:52

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Fri, Nov 20, 2009 at 11:00:21AM +0900, Hidetoshi Seto wrote:
> >>> Could you please test this patch, if it solve all utime decrease
> >>> problems for you:
> >>>
> >>> http://patchwork.kernel.org/patch/59795/
> >>>
> >>> If you confirm it work, I think we should apply it. Otherwise
> >>> we need to go to propagate task_{u,s}time everywhere, which is not
> >>> (my) preferred solution.
> >> That patch will create another issue, it will allow a process to hide
> >> from top by arranging to never run when the tick hits.
> >
>
> Yes, nowadays there are many threads on high speed hardware,
> such process can exist all around, easier than before.
>
> E.g. assume that there are 2 tasks:
>
> Task A: interrupted by timer few times
> (utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
> => total of runtime is 1 sec, but utime + stime is 100 ms
>
> Task B: interrupted by timer many times
> (utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
> => total of runtime is 10 ms, but utime + stime is 100 ms

How tis is probable, that task is running very long, but not getting
the ticks ? I know this is possible, otherwise we will not see utime
decreasing after do_sys_times() siglock fix, but how probable?

> You can see task_[su]time() works well for these tasks.
>
> > What about that?
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 1f8d028..9db1cbc 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
> > }
> > utime = (cputime_t)temp;
> >
> > - p->prev_utime = max(p->prev_utime, utime);
> > + p->prev_utime = max(p->prev_utime, max(p->utime, utime));
> > return p->prev_utime;
> > }
>
> I think this makes things worse.
>
> without this patch:
> Task A prev_utime: 500 ms (= accurate)
> Task B prev_utime: 5 ms (= accurate)
> with this patch:
> Task A prev_utime: 500 ms (= accurate)
> Task B prev_utime: 50 ms (= not accurate)
>
> Note that task_stime() calculates prev_stime using this prev_utime:
>
> without this patch:
> Task A prev_stime: 500 ms (= accurate)
> Task B prev_stime: 5 ms (= not accurate)
> with this patch:
> Task A prev_stime: 500 ms (= accurate)
> Task B prev_stime: 0 ms (= not accurate)
>
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index ce17760..8be5b75 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
> > struct task_cputime cputime;
> > cputime_t cutime, cstime;
> >
> > - thread_group_cputime(current, &cputime);
> > spin_lock_irq(&current->sighand->siglock);
> > + thread_group_cputime(current, &cputime);
> > cutime = current->signal->cutime;
> > cstime = current->signal->cstime;
> > spin_unlock_irq(&current->sighand->siglock);
> >
> > It's on top of Hidetoshi patch and fix utime decrease problem
> > on my system.
>
> How about the stime decrease problem which can be caused by same
> logic?

Yes, above patch screw up stime. Below should be a bit better, but
not solve objections you have:

diff --git a/kernel/exit.c b/kernel/exit.c
index f7864ac..17491ad 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,6 +91,8 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
+ cputime_t utime, stime;
+
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -110,8 +112,16 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- sig->utime = cputime_add(sig->utime, task_utime(tsk));
- sig->stime = cputime_add(sig->stime, task_stime(tsk));
+
+ utime = task_utime(tsk);
+ stime = task_stime(tsk);
+ if (tsk->utime > utime || tsk->stime > stime) {
+ utime = tsk->utime;
+ stime = tsk->stime;
+ }
+
+ sig->utime = cputime_add(sig->utime, utime);
+ sig->stime = cputime_add(sig->stime, stime);
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;

> According to my labeling, there are 2 unresolved problem [1]
> "thread_group_cputime() vs exit" and [2] "use of task_s/utime()".
>
> Still I believe the real fix for this problem is combination of
> above fix for do_sys_times() (for problem[1]) and (I know it is
> not preferred, but for [2]) the following:
>
> >> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> >> >> index 5c9dc22..e065b8a 100644
> >> >> --- a/kernel/posix-cpu-timers.c
> >> >> +++ b/kernel/posix-cpu-timers.c
> >> >> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> >> >>
> >> >> t = tsk;
> >> >> do {
> >> >> - times->utime = cputime_add(times->utime, t->utime);
> >> >> - times->stime = cputime_add(times->stime, t->stime);
> >> >> + times->utime = cputime_add(times->utime, task_utime(t));
> >> >> + times->stime = cputime_add(times->stime, task_stime(t));
> >> >> times->sum_exec_runtime += t->se.sum_exec_runtime;
> >> >>
> >> >> t = next_thread(t);
>

That works for me and I agree that this is right fix. Peter had concerns
about p->prev_utime races and additional need for further propagation of
task_{s,u}time() to posix-cpu-timers code. However I do not understand
these problems.

Stanislaw

2009-11-23 10:18:49

by Stanislaw Gruszka

[permalink] [raw]
Subject: [PATCH] cputime: avoid do_sys_times() races with __exit_signal()

When process exit in the middle of thread_group_cputime() loop, {u,s}time
values will be accounted twice. One time - in all threads loop, second - in
__exit_signal(). This make sys_times() return values bigger then they
are in real. Next consecutive call to sys_times() can return correct values,
so we have {u,s}time decrease.

To fix use sighand->siglock in do_sys_times().

This is partial fix for problem of utime/stime values decreasing described
in this thread:

http://lkml.org/lkml/2009/11/3/522

To fully fix the problem, we need second fix for mishmash between
task_{u,s}time() and tsk->{u,s}time. It's not clear now how this mishmash
should be fixed.

Signed-off-by: Stanislaw Gruszka <[email protected]>
---
kernel/sys.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index ce17760..8be5b75 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
struct task_cputime cputime;
cputime_t cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_cputime(current, &cputime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
--
1.6.2.5

2009-11-23 10:25:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

* Stanislaw Gruszka <[email protected]> [2009-11-23 11:09:26]:
[snip]
> +
> + utime = task_utime(tsk);
> + stime = task_stime(tsk);
> + if (tsk->utime > utime || tsk->stime > stime) {
> + utime = tsk->utime;
> + stime = tsk->stime;
> + }

Won't this condition be problematic, since it can reset stime
when tsk->utime > utime for example?
--
Balbir

2009-11-23 10:48:59

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

On Mon, 23 Nov 2009 15:55:28 +0530
Balbir Singh <[email protected]> wrote:

> * Stanislaw Gruszka <[email protected]> [2009-11-23 11:09:26]:
> [snip]
> > +
> > + utime = task_utime(tsk);
> > + stime = task_stime(tsk);
> > + if (tsk->utime > utime || tsk->stime > stime) {
> > + utime = tsk->utime;
> > + stime = tsk->stime;
> > + }
>
> Won't this condition be problematic, since it can reset stime
> when tsk->utime > utime for example?

We use both {u,s}time adjusted values or both not adjusted values.
This seems to be right to me. Adjusting not help anyway, if we have
stime/utime ticks disproportion, e.g. task only gets stime ticks but
not gets utime ticks, but it runs in user and kernel space in
the same amount of time.

If this solution whold be accepted, patch will contain additional
comment.

Stanislaw

2009-11-24 05:34:25

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

Stanislaw Gruszka wrote:
> On Fri, Nov 20, 2009 at 11:00:21AM +0900, Hidetoshi Seto wrote:
>> E.g. assume that there are 2 tasks:
>>
>> Task A: interrupted by timer few times
>> (utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
>> => total of runtime is 1 sec, but utime + stime is 100 ms
>>
>> Task B: interrupted by timer many times
>> (utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
>> => total of runtime is 10 ms, but utime + stime is 100 ms
>
> How tis is probable, that task is running very long, but not getting
> the ticks ? I know this is possible, otherwise we will not see utime
> decreasing after do_sys_times() siglock fix, but how probable?

For example, assume a task like watchdog that calls sleep soon after
its work. Such task will be woken up by a timer interrupt on other
task and queued to run queue. Once it get a cpu it can finish its
work before next tick. So it can run long without getting any ticks
on it. I suppose you can find such tasks in monitoring tool which
contains sampling threads that behaves like watchdog.

As the side effect, since such tasks tend to use cpu between tick
period, they make other tasks to more likely be interrupted by ticks.

>>>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>>>>>> index 5c9dc22..e065b8a 100644
>>>>>> --- a/kernel/posix-cpu-timers.c
>>>>>> +++ b/kernel/posix-cpu-timers.c
>>>>>> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>>>>>>
>>>>>> t = tsk;
>>>>>> do {
>>>>>> - times->utime = cputime_add(times->utime, t->utime);
>>>>>> - times->stime = cputime_add(times->stime, t->stime);
>>>>>> + times->utime = cputime_add(times->utime, task_utime(t));
>>>>>> + times->stime = cputime_add(times->stime, task_stime(t));
>>>>>> times->sum_exec_runtime += t->se.sum_exec_runtime;
>>>>>>
>>>>>> t = next_thread(t);
>
> That works for me and I agree that this is right fix. Peter had concerns
> about p->prev_utime races and additional need for further propagation of
> task_{s,u}time() to posix-cpu-timers code. However I do not understand
> these problems.

I think that one of our concerns is the cost of task_{s,u}time(), which
might bring other problem if they are propagated. But I found we can reduce
the cost (about the half, or more), that is why I posted task_times() patch
in other thread in LKML.


Thanks,
H.Seto

2009-11-30 09:20:47

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH 1/2] cputime: remove prev_{u,s}time if VIRT_CPU_ACCOUNTING

They are used in task_times(), only when !VIRT_CPU_ACCOUNTING.
Add ifndef-endif for them.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0395b0f..dff85e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

--
1.6.5.3

2009-11-30 09:21:59

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH 2/2] cputime: introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(), because
it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for every
thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
fs/proc/array.c | 5 +---
include/linux/sched.h | 3 +-
kernel/exit.c | 23 +++++++++++----------
kernel/fork.c | 2 +
kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sys.c | 18 +++++++---------
6 files changed, 75 insertions(+), 28 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ca61a88..2571da4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -506,7 +506,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -517,9 +516,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dff85e5..c5c68ed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,7 +1332,7 @@ struct task_struct {
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
- cputime_t prev_utime, prev_stime;
+ cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
@@ -1723,6 +1723,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..1753cac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1069,6 +1069,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+ p->prev_tgutime = cputime_zero;
+ p->prev_tgstime = cputime_zero;
#endif

p->default_timer_slack_ns = current->timer_slack_ns;
diff --git a/kernel/sched.c b/kernel/sched.c
index b3d4e2b..dabdeae 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5189,6 +5189,12 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
if (st)
*st = p->stime;
}
+
+static inline void __thread_group_times(struct task_struct *p,
+ struct task_cputime *cputime)
+{
+ thread_group_cputime(p, cputime);
+}
#else

#ifndef nsecs_to_cputime
@@ -5197,7 +5203,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5217,16 +5223,58 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

if (ut)
*ut = p->prev_utime;
if (st)
*st = p->prev_stime;
}
+
+static void __thread_group_times(struct task_struct *p,
+ struct task_cputime *cputime)
+{
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, cputime);
+
+ total = cputime_add(cputime->utime, cputime->stime);
+ rtime = nsecs_to_cputime(cputime->sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime->utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ p->prev_tgutime = max(p->prev_tgutime, utime);
+ p->prev_tgstime = max(p->prev_tgstime,
+ cputime_sub(rtime, p->prev_tgutime));
+
+ cputime->utime = p->prev_tgutime;
+ cputime->stime = p->prev_tgstime;
+}
#endif

/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ __thread_group_times(p, &cputime);
+
+ if (ut)
+ *ut = cputime.utime;
+ if (st)
+ *st = cputime.stime;
+}
+
+/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
diff --git a/kernel/sys.c b/kernel/sys.c
index d988abe..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime(current, &cputime);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;
--
1.6.5.3

2009-11-30 14:57:34

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [PATCH 2/2] cputime: introduce thread_group_times()

On Mon, Nov 30, 2009 at 06:21:36PM +0900, Hidetoshi Seto wrote:
> This patch fixes the above problem in the following way:
>
> - Modify thread's exit (= __exit_signal()) not to use task_times().
> It means {u,s}time in signal struct accumulates raw values instead
> of adjusted values. As the result it makes thread_group_cputime()
> to return pure sum of "raw" values.
>
> - Introduce a new function thread_group_times(*task, *utime, *stime)
> that converts "raw" values of thread_group_cputime() to "adjusted"
> values, in same calculation procedure as task_times().
>
> - Modify group's exit (= wait_task_zombie()) to use this introduced
> thread_group_times(). It make c{u,s}time in signal struct to
> have adjusted values like before this patch.
>
> - Replace some thread_group_cputime() by thread_group_times().
> This replacements are only applied where conveys the "adjusted"
> cputime to users, and where already uses task_times() near by it.
> (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
>
> This patch have a positive side effect:
>
> - Before this patch, if a group contains many short-life threads
> (e.g. runs 0.9ms and not interrupted by ticks), the group's
> cputime could be invisible since thread's cputime was accumulated
> after adjusted: imagine adjustment function as adj(ticks, runtime),
> {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
> After this patch it will not happen because the adjustment is
> applied after accumulated.

Idea is very good IMHO.

> Signed-off-by: Hidetoshi Seto <[email protected]>
> ---
> fs/proc/array.c | 5 +---
> include/linux/sched.h | 3 +-
> kernel/exit.c | 23 +++++++++++----------
> kernel/fork.c | 2 +
> kernel/sched.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
> kernel/sys.c | 18 +++++++---------
> 6 files changed, 75 insertions(+), 28 deletions(-)

[snip]

> #ifndef CONFIG_VIRT_CPU_ACCOUNTING
> - cputime_t prev_utime, prev_stime;
> + cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
> #endif

I think the new values should be part of struct_signal (see below)

> +static void __thread_group_times(struct task_struct *p,
> + struct task_cputime *cputime)
> +{
> + cputime_t rtime, utime, total;
> +
> + thread_group_cputime(p, cputime);
> +
> + total = cputime_add(cputime->utime, cputime->stime);
> + rtime = nsecs_to_cputime(cputime->sum_exec_runtime);
> +
> + if (total) {
> + u64 temp;
> +
> + temp = (u64)(rtime * cputime->utime);
> + do_div(temp, total);
> + utime = (cputime_t)temp;
> + } else
> + utime = rtime;
> +
> + p->prev_tgutime = max(p->prev_tgutime, utime);
> + p->prev_tgstime = max(p->prev_tgstime,
> + cputime_sub(rtime, p->prev_tgutime));

p->sig->prev_utime = max(p->sig->prev_utime, utime);
p->sig->prev_stime = max(p->sig->prev_stime,
cputime_sub(rtime, p->sig->prev_utime));

> /*
> + * Must be called with siglock held.
> + */
> +void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
> +{
> + struct task_cputime cputime;
> +
> + __thread_group_times(p, &cputime);
> +
> + if (ut)
> + *ut = cputime.utime;
> + if (st)
> + *st = cputime.stime;

No thread_group_times() nor task_times() is called with NULL arguments, we
can get rid of "if ({u,s}t)" checks. Perhaps thread_group_times() should
have "struct task_cputime" argument as it is wrapper for
thread_group_cputime();

Thanks
Stanislaw

2009-12-01 01:02:59

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH 2/2] cputime: introduce thread_group_times()

Stanislaw Gruszka wrote:
> Idea is very good IMHO.

Thank you very much!

>> #ifndef CONFIG_VIRT_CPU_ACCOUNTING
>> - cputime_t prev_utime, prev_stime;
>> + cputime_t prev_utime, prev_stime, prev_tgutime, prev_tgstime;
>> #endif
>
> I think the new values should be part of struct_signal (see below)

Good point. I'll update patch to do so.

>> /*
>> + * Must be called with siglock held.
>> + */
>> +void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
>> +{
>> + struct task_cputime cputime;
>> +
>> + __thread_group_times(p, &cputime);
>> +
>> + if (ut)
>> + *ut = cputime.utime;
>> + if (st)
>> + *st = cputime.stime;
>
> No thread_group_times() nor task_times() is called with NULL arguments, we
> can get rid of "if ({u,s}t)" checks. Perhaps thread_group_times() should
> have "struct task_cputime" argument as it is wrapper for
> thread_group_cputime();

Removing "if ({u,s}t)" is OK with me.
I expect all thread_group_times() user should have no interest in members
of struct task_cputime other than {u,s}time, so I'd like to keep the
argument as is.


Thanks,
H.Seto

2009-12-02 08:27:13

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH -v2 1/2] sched, cputime: cleanups related to task_times()

- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
kernel/sched.c | 16 ++++++----------
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d47f394..12dbb5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

diff --git a/kernel/sched.c b/kernel/sched.c
index db1f6b2..3674f77 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5164,10 +5164,8 @@ void account_idle_ticks(unsigned long ticks)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING
void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- if (ut)
- *ut = p->utime;
- if (st)
- *st = p->stime;
+ *ut = p->utime;
+ *st = p->stime;
}
#else

@@ -5177,7 +5175,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5197,12 +5195,10 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

- if (ut)
- *ut = p->prev_utime;
- if (st)
- *st = p->prev_stime;
+ *ut = p->prev_utime;
+ *st = p->prev_stime;
}
#endif

--
1.6.5.3

2009-12-02 08:28:23

by Hidetoshi Seto

[permalink] [raw]
Subject: [PATCH -v2 2/2] sched, cputime: introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while {u,s}time
and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(), because
it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater than
adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for every
thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <[email protected]>
---
fs/proc/array.c | 5 +----
include/linux/sched.h | 4 ++++
kernel/exit.c | 23 ++++++++++++-----------
kernel/fork.c | 3 +++
kernel/sched.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 18 ++++++++----------
6 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index b37121a..9e5d173 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -495,7 +495,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -506,9 +505,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12dbb5c..7ae1274 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -628,6 +628,9 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1723,6 +1726,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..3d6f121 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ sig->prev_utime = sig->prev_stime = cputime_zero;
+#endif
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 3674f77..aa733d5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5167,6 +5167,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->utime;
*st = p->stime;
}
+
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ thread_group_cputime(p, &cputime);
+
+ *ut = cputime.utime;
+ *st = cputime.stime;
+}
#else

#ifndef nsecs_to_cputime
@@ -5200,6 +5210,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->prev_utime;
*st = p->prev_stime;
}
+
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct signal_struct *sig = p->signal;
+ struct task_cputime cputime;
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, &cputime);
+
+ total = cputime_add(cputime.utime, cputime.stime);
+ rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime.utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ sig->prev_utime = max(sig->prev_utime, utime);
+ sig->prev_stime = max(sig->prev_stime,
+ cputime_sub(rtime, sig->prev_utime));
+
+ *ut = sig->prev_utime;
+ *st = sig->prev_stime;
+}
#endif

/*
diff --git a/kernel/sys.c b/kernel/sys.c
index d988abe..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime(current, &cputime);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;
--
1.6.5.3

2009-12-02 08:29:19

by Hidetoshi Seto

[permalink] [raw]
Subject: reproducer: utime decreasing

Following is a sample program that can reproduce the problem
in several ways.

I tested my patch using this reproducer, and confirmed that
applying both of Stanislaw's do_sys_times() patch and my
thread_group_times() patch is required to fix the problem.


Thanks,
H.Seto

===
/*
* Sample program to demonstrate time decreasing on thread exit
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/times.h>

#define DEFAULT_THREADS 500

unsigned long lpt;
int looptype;
int samplesleep;
int threads;

void *child (void *thread_id)
{
struct tms t[12];
int i, j;
unsigned long ret = 0, id = (unsigned long)thread_id;

if (looptype) {
/*
* discont:
*
* Loop tricky, to make a significant gap between
* task->{u,s}time and task_times().
*
* runtime of a thread = 0.5 tick * 1000 = 500 ms
* task->{u,s}time = (??? , 0)
* task_times() = ideally (500, 0)
*/
for (j = 0; j < 1000; j++) {
for (i = 0; i < lpt/2; i++)
;
usleep(0);
}
} else {
/*
* cont:
*
* Loop without tricks.
*
* runtime of a thread = 500 ms
* task->{u,s}time = (500, 0)
* task_times() = (500, 0)
*/
for (i = 0; i < lpt * 500; i++)
;
}

if (!(id % 4))
/* bother other threads */
pthread_exit((void *)ret);

for (i = 0; i < 12; i++) {
times(&t[i]);
if (samplesleep)
usleep(0);
}

for (i = 0; i + 5 < 12; i++) {
/*
* +----+----+----+----+----+
* i+0 i+1 i+2 i+3 i+4 i+5
* ^^^^^^
* check here
*/
if (t[i+2].tms_utime > t[i+3].tms_utime
|| t[i+2].tms_stime > t[i+3].tms_stime) {

printf("[%4ld] %s decreased %3d: "
"(%d %d) (%d %d) [%d %d]<->[%d %d] (%d %d) (%d %d)\n",
id,
t[i+2].tms_utime > t[i+3].tms_utime ?
"utime" : "stime",
t[i+2].tms_utime > t[i+3].tms_utime ?
t[i+3].tms_utime - t[i+2].tms_utime :
t[i+3].tms_stime - t[i+2].tms_stime,
t[i+0].tms_utime, t[i+0].tms_stime,
t[i+1].tms_utime, t[i+1].tms_stime,
t[i+2].tms_utime, t[i+2].tms_stime,
t[i+3].tms_utime, t[i+3].tms_stime,
t[i+4].tms_utime, t[i+4].tms_stime,
t[i+5].tms_utime, t[i+5].tms_stime);
ret = 1;
}
}

pthread_exit((void *)ret);
}

void get_loops_per_tick(void)
{
struct tms t1, t2;
unsigned long i, mloop = 1000 * 1000 * 1000;

times(&t1);
for (i = 0; i < mloop; i++)
;
times(&t2);

lpt = mloop / ((t2.tms_utime - t1.tms_utime) * 10);
}

void do_test(int c)
{
struct tms t1, t2;
clock_t j1, j2;
pthread_t *th;
unsigned long i, ret = 0;

th = calloc(threads, sizeof(pthread_t));
if (!th)
return;

looptype = !!(c & 0x1) ? 1 : 0;
samplesleep = !!(c & 0x2) ? 1 : 0;
printf("looptype : %s\n", looptype ? "discont" : "cont");
printf("samplesleep : %s\n", samplesleep ? "yes" : "no");

printf(" ## start ##\n");
j1 = times(&t1);
for (i = 0; i < threads; i++)
pthread_create (&th[i], NULL, child, (void *)i);
for (i = 0; i < threads; i++) {
int r;
pthread_join(th[i], (void *)&r);
ret += (int)r;
}
j2 = times(&t2);
printf(" ## done. ##\n");
printf(" loop total:\n");
printf(" user : %7d ms\n", (t2.tms_utime - t1.tms_utime) * 10);
printf(" system : %7d ms\n", (t2.tms_stime - t1.tms_stime) * 10);
printf(" elapse : %7d ms\n", (j2 - j1) * 10);
printf(" error : %d\n\n", ret);

printf("result: %s\n\n", ret ? "BAD" : "GOOD");
}

int main(int argc, char **argv)
{
int i;

threads = argc > 1 ? atoi(argv[1]) : DEFAULT_THREADS;

printf("### Prep:\n");
get_loops_per_tick();
printf("loops_per_tick: %ld\n", lpt);
printf("threads : %d\n\n", threads);

printf("### Test:\n");
for (i = 0; i < 4; i++)
do_test(i);
}

2009-12-02 08:33:11

by Hidetoshi Seto

[permalink] [raw]
Subject: reproducer: invisible utime

This program can reproduce another problem that originally
task_{u,s}time() intended to solve, I think.

Adjustment by task_{u,s}time() was only applied on getrusage()
and /proc/<pid>/stat. So even in .32-rc8, we can reproduce this
problem on sys_times() which has no adjustment.

I confirmed that my patches fix it, by thread_group_times().


Thanks,
H.Seto

===
/*
* Sample program to demonstrate invisible utime
*/

#include <stdio.h>
#include <unistd.h>
#include <sys/times.h>

/* 300 million, arbitrary */
#define LOOP (300 * 1000 * 1000)

unsigned long lpt, l, c;

int do_loop(void)
{
struct tms t1, t2;
clock_t j1, j2;
unsigned long i;

printf("Loop %d times, sleeping %d times every %d.\n", l, l/c, c);
printf(" start ...\n");
j1 = times(&t1);
for (i = 1; i <= l; i++)
if (!(i % c))
usleep(0);
j2 = times(&t2);
printf(" ... done.\n");

/* tms_{u,s}time is clock_t */
printf(" user : %5d ms\n", (t2.tms_utime - t1.tms_utime) * 10);
printf(" system : %5d ms\n", (t2.tms_stime - t1.tms_stime) * 10);
printf(" elapse : %5d ms\n\n", (j2 - j1) * 10);

return (t2.tms_utime - t1.tms_utime) * 10;
}

int main(int argc, char **argv)
{
int u0, u1, u2, u3;
int ticks;

l = argc > 1 ? atoi(argv[1]) : LOOP;

printf("### Prep:\n");
c = l;
ticks = do_loop();
lpt = l / ticks;
printf("loops/tick: %d\n", lpt);
l = lpt * 1000;
printf("change loop to %d to short test.\n\n", l);

printf("### Test:\n");
c = l;
u0 = do_loop();
c = lpt;
u1 = do_loop();
c = lpt / 2;
u2 = do_loop();
c = lpt / 8;
u3 = do_loop();
printf("result: %s\n\n",
(u0 <= u1) && (u1 <= u2) && (u2 <= u3) ? "GOOD" : "BAD");
}

2009-12-02 15:17:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 1/2] sched, cputime: cleanups related to task_times()

Could you, in the future, please start a new thread for new patches?

2009-12-02 15:29:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH -v2 1/2] sched, cputime: cleanups related to task_times()

* Peter Zijlstra <[email protected]> [2009-12-02 16:17:35]:

> Could you, in the future, please start a new thread for new patches?
>

Yes, I almost missed it, cause I marked the thread as read. Let me
look


--
Balbir

2009-12-02 15:57:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 1/2] sched, cputime: cleanups related to task_times()

On Wed, 2009-12-02 at 17:26 +0900, Hidetoshi Seto wrote:
> - Remove if({u,s}t)s because no one call it with NULL now.
> - Use cputime_{add,sub}().
> - Add ifndef-endif for prev_{u,s}time since they are used
> only when !VIRT_CPU_ACCOUNTING.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>

2009-12-02 15:58:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -v2 2/2] sched, cputime: introduce thread_group_times()

On Wed, 2009-12-02 at 17:28 +0900, Hidetoshi Seto wrote:
> This is a real fix for problem of utime/stime values decreasing
> described in the thread:
> http://lkml.org/lkml/2009/11/3/522
>
> Now cputime is accounted in the following way:
>
> - {u,s}time in task_struct are increased every time when the thread
> is interrupted by a tick (timer interrupt).
>
> - When a thread exits, its {u,s}time are added to signal->{u,s}time,
> after adjusted by task_times().
>
> - When all threads in a thread_group exits, accumulated {u,s}time
> (and also c{u,s}time) in signal struct are added to c{u,s}time
> in signal struct of the group's parent.
>
> So {u,s}time in task struct are "raw" tick count, while {u,s}time
> and c{u,s}time in signal struct are "adjusted" values.
>
> And accounted values are used by:
>
> - task_times(), to get cputime of a thread:
> This function returns adjusted values that originates from raw
> {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
>
> - thread_group_cputime(), to get cputime of a thread group:
> This function returns sum of all {u,s}time of living threads in
> the group, plus {u,s}time in the signal struct that is sum of
> adjusted cputimes of all exited threads belonged to the group.
>
> The problem is the return value of thread_group_cputime(), because
> it is mixed sum of "raw" value and "adjusted" value:
>
> group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
>
> This misbehavior can break {u,s}time monotonicity.
> Assume that if there is a thread that have raw values greater than
> adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only
> runs 45ms) and if it exits, cputime will decrease (e.g. -5ms).
>
> To fix this, we could do:
>
> group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
>
> But task_times() contains hard divisions, so applying it for every
> thread should be avoided.
>
> This patch fixes the above problem in the following way:
>
> - Modify thread's exit (= __exit_signal()) not to use task_times().
> It means {u,s}time in signal struct accumulates raw values instead
> of adjusted values. As the result it makes thread_group_cputime()
> to return pure sum of "raw" values.
>
> - Introduce a new function thread_group_times(*task, *utime, *stime)
> that converts "raw" values of thread_group_cputime() to "adjusted"
> values, in same calculation procedure as task_times().
>
> - Modify group's exit (= wait_task_zombie()) to use this introduced
> thread_group_times(). It make c{u,s}time in signal struct to
> have adjusted values like before this patch.
>
> - Replace some thread_group_cputime() by thread_group_times().
> This replacements are only applied where conveys the "adjusted"
> cputime to users, and where already uses task_times() near by it.
> (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
>
> This patch have a positive side effect:
>
> - Before this patch, if a group contains many short-life threads
> (e.g. runs 0.9ms and not interrupted by ticks), the group's
> cputime could be invisible since thread's cputime was accumulated
> after adjusted: imagine adjustment function as adj(ticks, runtime),
> {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
> After this patch it will not happen because the adjustment is
> applied after accumulated.
>
> v2:
> - remove if()s, put new variables into signal_struct.
>
> Signed-off-by: Hidetoshi Seto <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>


Thanks for taking care of this!

2009-12-02 17:34:37

by Hidetoshi Seto

[permalink] [raw]
Subject: [tip:sched/core] sched, cputime: Cleanups related to task_times()

Commit-ID: d99ca3b977fc5a93141304f571475c2af9e6c1c5
Gitweb: http://git.kernel.org/tip/d99ca3b977fc5a93141304f571475c2af9e6c1c5
Author: Hidetoshi Seto <[email protected]>
AuthorDate: Wed, 2 Dec 2009 17:26:47 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 2 Dec 2009 17:32:39 +0100

sched, cputime: Cleanups related to task_times()

- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Spencer Candland <[email protected]>
Cc: Americo Wang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/fork.c | 2 ++
kernel/sched.c | 16 ++++++----------
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0395b0f..dff85e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,7 +1331,9 @@ struct task_struct {

cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw; /* context switch counts */
struct timespec start_time; /* monotonic time */
struct timespec real_start_time; /* boot based time */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..ad7cb6d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,8 +1066,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->gtime = cputime_zero;
p->utimescaled = cputime_zero;
p->stimescaled = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
p->prev_utime = cputime_zero;
p->prev_stime = cputime_zero;
+#endif

p->default_timer_slack_ns = current->timer_slack_ns;

diff --git a/kernel/sched.c b/kernel/sched.c
index 4883fee..17e2c1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5184,10 +5184,8 @@ void account_idle_ticks(unsigned long ticks)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING
void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- if (ut)
- *ut = p->utime;
- if (st)
- *st = p->stime;
+ *ut = p->utime;
+ *st = p->stime;
}
#else

@@ -5197,7 +5195,7 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)

void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
{
- cputime_t rtime, utime = p->utime, total = utime + p->stime;
+ cputime_t rtime, utime = p->utime, total = cputime_add(utime, p->stime);

/*
* Use CFS's precise accounting:
@@ -5217,12 +5215,10 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
* Compare with previous values, to keep monotonicity:
*/
p->prev_utime = max(p->prev_utime, utime);
- p->prev_stime = max(p->prev_stime, rtime - p->prev_utime);
+ p->prev_stime = max(p->prev_stime, cputime_sub(rtime, p->prev_utime));

- if (ut)
- *ut = p->prev_utime;
- if (st)
- *st = p->prev_stime;
+ *ut = p->prev_utime;
+ *st = p->prev_stime;
}
#endif

2009-12-02 17:34:28

by Hidetoshi Seto

[permalink] [raw]
Subject: [tip:sched/core] sched, cputime: Introduce thread_group_times()

Commit-ID: 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Gitweb: http://git.kernel.org/tip/0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Author: Hidetoshi Seto <[email protected]>
AuthorDate: Wed, 2 Dec 2009 17:28:07 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 2 Dec 2009 17:32:40 +0100

sched, cputime: Introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Spencer Candland <[email protected]>
Cc: Americo Wang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
fs/proc/array.c | 5 +----
include/linux/sched.h | 4 ++++
kernel/exit.c | 23 ++++++++++++-----------
kernel/fork.c | 3 +++
kernel/sched.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 18 ++++++++----------
6 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ca61a88..2571da4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -506,7 +506,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
@@ -517,9 +516,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ thread_group_times(task, &utime, &stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dff85e5..34238bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,6 +624,9 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ cputime_t prev_utime, prev_stime;
+#endif
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1723,6 +1726,7 @@ static inline void put_task_struct(struct task_struct *t)
}

extern void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st);
+extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st);

/*
* Per process flags
diff --git a/kernel/exit.c b/kernel/exit.c
index 2eaf68b..b221ad6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -91,8 +91,6 @@ static void __exit_signal(struct task_struct *tsk)
if (atomic_dec_and_test(&sig->count))
posix_cpu_timers_exit_group(tsk);
else {
- cputime_t utime, stime;
-
/*
* If there is any task waiting for the group exit
* then notify it:
@@ -112,9 +110,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
- task_times(tsk, &utime, &stime);
- sig->utime = cputime_add(sig->utime, utime);
- sig->stime = cputime_add(sig->stime, stime);
+ sig->utime = cputime_add(sig->utime, tsk->utime);
+ sig->stime = cputime_add(sig->stime, tsk->stime);
sig->gtime = cputime_add(sig->gtime, tsk->gtime);
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -1208,6 +1205,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
struct signal_struct *psig;
struct signal_struct *sig;
unsigned long maxrss;
+ cputime_t tgutime, tgstime;

/*
* The resource counters for the group leader are in its
@@ -1223,20 +1221,23 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* need to protect the access to parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
*/
+ thread_group_times(p, &tgutime, &tgstime);
spin_lock_irq(&p->real_parent->sighand->siglock);
psig = p->real_parent->signal;
sig = p->signal;
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(p->utime,
- cputime_add(sig->utime,
- sig->cutime)));
+ cputime_add(tgutime,
+ sig->cutime));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(p->stime,
- cputime_add(sig->stime,
- sig->cstime)));
+ cputime_add(tgstime,
+ sig->cstime));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index ad7cb6d..3d6f121 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+ sig->prev_utime = sig->prev_stime = cputime_zero;
+#endif
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 17e2c1d..e6ba726 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5187,6 +5187,16 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->utime;
*st = p->stime;
}
+
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct task_cputime cputime;
+
+ thread_group_cputime(p, &cputime);
+
+ *ut = cputime.utime;
+ *st = cputime.stime;
+}
#else

#ifndef nsecs_to_cputime
@@ -5220,6 +5230,37 @@ void task_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
*ut = p->prev_utime;
*st = p->prev_stime;
}
+
+/*
+ * Must be called with siglock held.
+ */
+void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *st)
+{
+ struct signal_struct *sig = p->signal;
+ struct task_cputime cputime;
+ cputime_t rtime, utime, total;
+
+ thread_group_cputime(p, &cputime);
+
+ total = cputime_add(cputime.utime, cputime.stime);
+ rtime = nsecs_to_cputime(cputime.sum_exec_runtime);
+
+ if (total) {
+ u64 temp;
+
+ temp = (u64)(rtime * cputime.utime);
+ do_div(temp, total);
+ utime = (cputime_t)temp;
+ } else
+ utime = rtime;
+
+ sig->prev_utime = max(sig->prev_utime, utime);
+ sig->prev_stime = max(sig->prev_stime,
+ cputime_sub(rtime, sig->prev_utime));
+
+ *ut = sig->prev_utime;
+ *st = sig->prev_stime;
+}
#endif

/*
diff --git a/kernel/sys.c b/kernel/sys.c
index bbdfce0..9968c5f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -911,16 +911,15 @@ change_okay:

void do_sys_times(struct tms *tms)
{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
+ cputime_t tgutime, tgstime, cutime, cstime;

- thread_group_cputime(current, &cputime);
spin_lock_irq(&current->sighand->siglock);
+ thread_group_times(current, &tgutime, &tgstime);
cutime = current->signal->cutime;
cstime = current->signal->cstime;
spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
+ tms->tms_utime = cputime_to_clock_t(tgutime);
+ tms->tms_stime = cputime_to_clock_t(tgstime);
tms->tms_cutime = cputime_to_clock_t(cutime);
tms->tms_cstime = cputime_to_clock_t(cstime);
}
@@ -1338,8 +1337,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
{
struct task_struct *t;
unsigned long flags;
- cputime_t utime, stime;
- struct task_cputime cputime;
+ cputime_t tgutime, tgstime, utime, stime;
unsigned long maxrss = 0;

memset((char *) r, 0, sizeof *r);
@@ -1372,9 +1370,9 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ thread_group_times(p, &tgutime, &tgstime);
+ utime = cputime_add(utime, tgutime);
+ stime = cputime_add(stime, tgstime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;

2009-12-03 00:21:44

by Hidetoshi Seto

[permalink] [raw]
Subject: Re: [PATCH -v2 1/2] sched, cputime: cleanups related to task_times()

Balbir Singh wrote:
> * Peter Zijlstra <[email protected]> [2009-12-02 16:17:35]:
>
>> Could you, in the future, please start a new thread for new patches?
>>
>
> Yes, I almost missed it, cause I marked the thread as read. Let me
> look

OK, I will.

Maybe here I could post a link to new thread instead, to make tracking
problem from original thread to patch easy.


Thanks,
H.Seto