LinuxLists.cc - Re: regression introduced by - timers: fix itimer/many thread hang

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Wed, 2008-11-05 at 17:58 -0800, Frank Mayhar wrote:
> On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> > On Mon, 2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed, 2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > > Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread (process?) teardown. I
> > > don't have hardware on which to reproduce this but _looks_ like another
> > > thread has gotten in and torn down the process while we've been busy.
> >
> > I finally managed to get kdump working and caught this in the act. I
> > still need to dig into this more but I think these 2 threads will show
> > us the race condition. Note that this is a slightly hacked kernel in
> > that I removed "static" from a few functions to better see what was
> > going on but no real functional changes when compared to a recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's probably a
> race between process reap and the dequeue_entity() call to update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these routines)
> checks tsk->signal and tsk->signal->cputime.totals for NULL to make sure
> they're still valid. It looks like at this point tsk->signal is valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen unless the
> process is being reaped, and in fact can only happen in a narrow window
> in __cleanup_signal() between the call to thread_group_cputime_free()
> and the kmem_cache_free() of the signal struct itself. Without the slab
> debug stuff it would either skip the update (by noticing that pointers
> were NULL) or blithely update freed structures.
>
> I can't see anything that would prevent this from happening in the
> general case. I don't see what would stop the parent from coming in on
> another CPU and reaping the process after schedule() has picked it off
> the rq but before the deactivate_task->dequeue_task->dequeue_entity->
> update_curr chain completes. I see that schedule() disables preemption
> on that CPU but will that really do the job? I also suspect that with
> hyperthreading both of these processes are on the same silicon, meaning
> that one can be unexpectedly suspended while the other runs, thereby
> making the window wider.
>
> Unfortunately I don't know the code well enough to know what the right
> fix is. Maybe account_group_exec_runtime() should check for PF_EXITED?

That is just plain ugly. The right fix to me seems to destroy the
signal/thread group stuff _after_ the last task goes away.

> Maybe update_curr() should do that? I think that it makes more sense
> for dequeue_entity() to do the check and then not call update_curr() if
> the task is exiting but I defer to others with more knowledge of this
> area.

Hell no. Its a race in this signal/thread group stuff, fix it there.

But now you made me look at this patch:

commit f06febc96ba8e0af80bcc3eaec0a109e88275fac
Author: Frank Mayhar <[email protected]>
Date: Fri Sep 12 09:54:39 2008 -0700

timers: fix itimer/many thread hang

and I'm thinking you just rendered big iron unbootable.

You replaced a loop-over-threads by a loop-over-cpus, on every tick. Did
you stop to think what would happen on 4096 cpu machines?

Also, you just introduced per-cpu allocations for each thread-group,
while Christoph is reworking the per-cpu allocator, with one unfortunate
side-effect - its going to have a limited size pool. Therefore this will
limit the number of thread-groups we can have.

Hohumm, not at all liking this..

2008-11-06 15:04:46

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Thu, 6 Nov 2008, Peter Zijlstra wrote:

> Also, you just introduced per-cpu allocations for each thread-group,
> while Christoph is reworking the per-cpu allocator, with one unfortunate
> side-effect - its going to have a limited size pool. Therefore this will
> limit the number of thread-groups we can have.

Patches exist that implement a dynamically growable percpu pool (using
virtual mappings though). If the cost of the additional complexity /
overhead is justifiable then we can make the percpu pool dynamically
extendable.

2008-11-06 15:08:14

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Thu, 2008-11-06 at 09:03 -0600, Christoph Lameter wrote:
> On Thu, 6 Nov 2008, Peter Zijlstra wrote:
>
> > Also, you just introduced per-cpu allocations for each thread-group,
> > while Christoph is reworking the per-cpu allocator, with one unfortunate
> > side-effect - its going to have a limited size pool. Therefore this will
> > limit the number of thread-groups we can have.
>
> Patches exist that implement a dynamically growable percpu pool (using
> virtual mappings though). If the cost of the additional complexity /
> overhead is justifiable then we can make the percpu pool dynamically
> extendable.

Right, but I don't think the patch under consideration will fly anyway,
doing a for_each_possible_cpu() loop on every tick on all cpus isn't
really healthy, even for moderate sized machines.

2008-11-06 16:09:43

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Thu, 6 Nov 2008, Peter Zijlstra wrote:

> Right, but I don't think the patch under consideration will fly anyway,
> doing a for_each_possible_cpu() loop on every tick on all cpus isn't
> really healthy, even for moderate sized machines.

Correct.

2008-11-06 16:31:30

[permalink] [raw]

Subject: [PATCH] revert: timers: fix itimer/many thread hang

This patch reverts all the itimer/many thread patches:

7086efe1c1536f6bc160e7d60a9bfd645b91f279
bb34d92f643086d546b49cef680f6f305ed84414
5ce73a4a5a4893a1aa4cdeed1b1a5a6de42c43b6
0a8eaa4f9b58759595a1bfe13a1295fdc25ba026
f06febc96ba8e0af80bcc3eaec0a109e88275fac

Because I think the per-cpu accounting approach is wrong and makes
things worse for people with a machine that has more than a hand-full of
CPUs.

Build and boot tested on my favourite x86_64 config.

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8fcfa39..e215906 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1341,15 +1341,20 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
prstatus->pr_pgrp = task_pgrp_vnr(p);
prstatus->pr_sid = task_session_vnr(p);
if (thread_group_leader(p)) {
- struct task_cputime cputime;
-
/*
- * This is the record for the group leader. It shows the
- * group-wide total, not its individual thread total.
+ * This is the record for the group leader. Add in the
+ * cumulative times of previous dead threads. This total
+ * won't include the time of each live thread whose state
+ * is included in the core dump. The final total reported
+ * to our parent process when it calls wait4 will include
+ * those sums as well as the little bit more time it takes
+ * this and each other thread to finish dying after the
+ * core dump synchronization phase.
*/
- thread_group_cputime(p, &cputime);
- cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
- cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
+ cputime_to_timeval(cputime_add(p->utime, p->signal->utime),
+ &prstatus->pr_utime);
+ cputime_to_timeval(cputime_add(p->stime, p->signal->stime),
+ &prstatus->pr_stime);
} else {
cputime_to_timeval(p->utime, &prstatus->pr_utime);
cputime_to_timeval(p->stime, &prstatus->pr_stime);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 6af7fba..efd68c5 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -388,20 +388,20 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

/* add up live thread stats at the group level */
if (whole) {
- struct task_cputime cputime;
struct task_struct *t = task;
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
+ utime = cputime_add(utime, task_utime(t));
+ stime = cputime_add(stime, task_stime(t));
gtime = cputime_add(gtime, task_gtime(t));
t = next_thread(t);
} while (t != task);

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- thread_group_cputime(task, &cputime);
- utime = cputime.utime;
- stime = cputime.stime;
+ utime = cputime_add(utime, sig->utime);
+ stime = cputime_add(stime, sig->stime);
gtime = cputime_add(gtime, sig->gtime);
}

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 4a145ca..89b6ecd 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -66,7 +66,6 @@ static inline unsigned int kstat_irqs(unsigned int irq)
return sum;
}

-extern unsigned long long task_delta_exec(struct task_struct *);
extern void account_user_time(struct task_struct *, cputime_t);
extern void account_user_time_scaled(struct task_struct *, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t);
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7c7213..04c2e43 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -113,6 +113,4 @@ void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,

long clock_nanosleep_restart(struct restart_block *restart_block);

-void update_rlimit_cpu(unsigned long rlim_new);
-
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc07f9a..a739747 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -433,39 +433,6 @@ struct pacct_struct {
unsigned long ac_minflt, ac_majflt;
};

-/**
- * struct task_cputime - collected CPU time counts
- * @utime: time spent in user mode, in &cputime_t units
- * @stime: time spent in kernel mode, in &cputime_t units
- * @sum_exec_runtime: total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are
- * tracked for threads and thread groups. Most things considering
- * CPU time want to group these counts together and treat all three
- * of them in parallel.
- */
-struct task_cputime {
- cputime_t utime;
- cputime_t stime;
- unsigned long long sum_exec_runtime;
-};
-/* Alternate field names when used to cache expirations. */
-#define prof_exp stime
-#define virt_exp utime
-#define sched_exp sum_exec_runtime
-
-/**
- * struct thread_group_cputime - thread group interval timer counts
- * @totals: thread group interval timers; substructure for
- * uniprocessor kernel, per-cpu for SMP kernel.
- *
- * This structure contains the version of task_cputime, above, that is
- * used for thread group CPU clock calculations.
- */
-struct thread_group_cputime {
- struct task_cputime *totals;
-};
-
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -511,17 +478,6 @@ struct signal_struct {
cputime_t it_prof_expires, it_virt_expires;
cputime_t it_prof_incr, it_virt_incr;

- /*
- * Thread group totals for process CPU clocks.
- * See thread_group_cputime(), et al, for details.
- */
- struct thread_group_cputime cputime;
-
- /* Earliest-expiration cache. */
- struct task_cputime cputime_expires;
-
- struct list_head cpu_timers[3];
-
/* job control IDs */

/*
@@ -552,7 +508,7 @@ struct signal_struct {
* Live threads maintain their own counters and add to these
* in __exit_signal, except for the group leader.
*/
- cputime_t cutime, cstime;
+ cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
@@ -561,6 +517,14 @@ struct signal_struct {
struct task_io_accounting ioac;

/*
+ * Cumulative ns of scheduled CPU time for dead threads in the
+ * group, not including a zombie group leader. (This only differs
+ * from jiffies_to_ns(utime + stime) if sched_clock uses something
+ * other than jiffies.)
+ */
+ unsigned long long sum_sched_runtime;
+
+ /*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
@@ -571,6 +535,8 @@ struct signal_struct {
*/
struct rlimit rlim[RLIM_NLIMITS];

+ struct list_head cpu_timers[3];
+
/* keep the process-shared keyrings here so that they do the right
* thing in threads created with CLONE_THREAD */
#ifdef CONFIG_KEYS
@@ -1176,7 +1142,8 @@ struct task_struct {
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt;

- struct task_cputime cputime_expires;
+ cputime_t it_prof_expires, it_virt_expires;
+ unsigned long long it_sched_expires;
struct list_head cpu_timers[3];

/* process credentials */
@@ -1632,7 +1599,6 @@ extern unsigned long long cpu_clock(int cpu);

extern unsigned long long
task_sched_runtime(struct task_struct *task);
-extern unsigned long long thread_group_sched_runtime(struct task_struct *task);

/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
@@ -2144,30 +2110,6 @@ static inline int spin_needbreak(spinlock_t *lock)
}

/*
- * Thread group CPU time accounting.
- */
-
-extern int thread_group_cputime_alloc(struct task_struct *);
-extern void thread_group_cputime(struct task_struct *, struct task_cputime *);
-
-static inline void thread_group_cputime_init(struct signal_struct *sig)
-{
- sig->cputime.totals = NULL;
-}
-
-static inline int thread_group_cputime_clone_thread(struct task_struct *curr)
-{
- if (curr->signal->cputime.totals)
- return 0;
- return thread_group_cputime_alloc(curr);
-}
-
-static inline void thread_group_cputime_free(struct signal_struct *sig)
-{
- free_percpu(sig->cputime.totals);
-}
-
-/*
* Reevaluate whether the task has signals pending delivery.
* Wake the task if so.
* This is required every time the blocked sigset_t changes.
diff --git a/include/linux/time.h b/include/linux/time.h
index ce321ac..d2c578d 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -132,9 +132,6 @@ extern int timekeeping_valid_for_hres(void);
extern void update_wall_time(void);
extern void update_xtime_cache(u64 nsec);

-struct tms;
-extern void do_sys_times(struct tms *);
-
/**
* timespec_to_ns - Convert timespec to nanoseconds
* @ts: pointer to the timespec variable to be converted
diff --git a/kernel/compat.c b/kernel/compat.c
index 8eafe3e..143990e 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -23,7 +23,6 @@
#include <linux/timex.h>
#include <linux/migrate.h>
#include <linux/posix-timers.h>
-#include <linux/times.h>

#include <asm/uaccess.h>

@@ -209,23 +208,49 @@ asmlinkage long compat_sys_setitimer(int which,
return 0;
}

-static compat_clock_t clock_t_to_compat_clock_t(clock_t x)
-{
- return compat_jiffies_to_clock_t(clock_t_to_jiffies(x));
-}
-
asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
{
+ /*
+ * In the SMP world we might just be unlucky and have one of
+ * the times increment as we use it. Since the value is an
+ * atomically safe type this is just fine. Conceptually its
+ * as if the syscall took an instant longer to occur.
+ */
if (tbuf) {
- struct tms tms;
struct compat_tms tmp;
-
- do_sys_times(&tms);
- /* Convert our struct tms to the compat version. */
- tmp.tms_utime = clock_t_to_compat_clock_t(tms.tms_utime);
- tmp.tms_stime = clock_t_to_compat_clock_t(tms.tms_stime);
- tmp.tms_cutime = clock_t_to_compat_clock_t(tms.tms_cutime);
- tmp.tms_cstime = clock_t_to_compat_clock_t(tms.tms_cstime);
+ struct task_struct *tsk = current;
+ struct task_struct *t;
+ cputime_t utime, stime, cutime, cstime;
+
+ read_lock(&tasklist_lock);
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ t = next_thread(t);
+ } while (t != tsk);
+
+ /*
+ * While we have tasklist_lock read-locked, no dying thread
+ * can be updating current->signal->[us]time. Instead,
+ * we got their counts included in the live thread loop.
+ * However, another thread can come in right now and
+ * do a wait call that updates current->signal->c[us]time.
+ * To make sure we always see that pair updated atomically,
+ * we take the siglock around fetching them.
+ */
+ spin_lock_irq(&tsk->sighand->siglock);
+ cutime = tsk->signal->cutime;
+ cstime = tsk->signal->cstime;
+ spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
+
+ tmp.tms_utime = compat_jiffies_to_clock_t(cputime_to_jiffies(utime));
+ tmp.tms_stime = compat_jiffies_to_clock_t(cputime_to_jiffies(stime));
+ tmp.tms_cutime = compat_jiffies_to_clock_t(cputime_to_jiffies(cutime));
+ tmp.tms_cstime = compat_jiffies_to_clock_t(cputime_to_jiffies(cstime));
if (copy_to_user(tbuf, &tmp, sizeof(tmp)))
return -EFAULT;
}
diff --git a/kernel/exit.c b/kernel/exit.c
index b361006..9d2f87b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -113,6 +113,8 @@ static void __exit_signal(struct task_struct *tsk)
* We won't ever get here for the group leader, since it
* will have been the last reference on the signal_struct.
*/
+ sig->utime = cputime_add(sig->utime, task_utime(tsk));
+ sig->stime = cputime_add(sig->stime, task_stime(tsk));
sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
sig->min_flt += tsk->min_flt;
sig->maj_flt += tsk->maj_flt;
@@ -121,6 +123,7 @@ static void __exit_signal(struct task_struct *tsk)
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(&sig->ioac, &tsk->ioac);
+ sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
sig = NULL; /* Marker for below. */
}

@@ -1301,7 +1304,6 @@ static int wait_task_zombie(struct task_struct *p, int options,
if (likely(!traced)) {
struct signal_struct *psig;
struct signal_struct *sig;
- struct task_cputime cputime;

/*
* The resource counters for the group leader are in its
@@ -1317,23 +1319,20 @@ static int wait_task_zombie(struct task_struct *p, int options,
* need to protect the access to p->parent->signal fields,
* as other threads in the parent group can be right
* here reaping other children at the same time.
- *
- * We use thread_group_cputime() to get times for the thread
- * group, which consolidates times for all threads in the
- * group including the group leader.
*/
spin_lock_irq(&p->parent->sighand->siglock);
psig = p->parent->signal;
sig = p->signal;
- thread_group_cputime(p, &cputime);
psig->cutime =
cputime_add(psig->cutime,
- cputime_add(cputime.utime,
- sig->cutime));
+ cputime_add(p->utime,
+ cputime_add(sig->utime,
+ sig->cutime)));
psig->cstime =
cputime_add(psig->cstime,
- cputime_add(cputime.stime,
- sig->cstime));
+ cputime_add(p->stime,
+ cputime_add(sig->stime,
+ sig->cstime)));
psig->cgtime =
cputime_add(psig->cgtime,
cputime_add(p->gtime,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4b964d7..1e13d05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -765,44 +765,15 @@ void __cleanup_sighand(struct sighand_struct *sighand)
kmem_cache_free(sighand_cachep, sighand);
}

-
-/*
- * Initialize POSIX timer handling for a thread group.
- */
-static void posix_cpu_timers_init_group(struct signal_struct *sig)
-{
- /* Thread group counters. */
- thread_group_cputime_init(sig);
-
- /* Expiration times and increments. */
- sig->it_virt_expires = cputime_zero;
- sig->it_virt_incr = cputime_zero;
- sig->it_prof_expires = cputime_zero;
- sig->it_prof_incr = cputime_zero;
-
- /* Cached expiration times. */
- sig->cputime_expires.prof_exp = cputime_zero;
- sig->cputime_expires.virt_exp = cputime_zero;
- sig->cputime_expires.sched_exp = 0;
-
- /* The timer lists. */
- INIT_LIST_HEAD(&sig->cpu_timers[0]);
- INIT_LIST_HEAD(&sig->cpu_timers[1]);
- INIT_LIST_HEAD(&sig->cpu_timers[2]);
-}
-
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
struct signal_struct *sig;
int ret;

if (clone_flags & CLONE_THREAD) {
- ret = thread_group_cputime_clone_thread(current);
- if (likely(!ret)) {
- atomic_inc(&current->signal->count);
- atomic_inc(&current->signal->live);
- }
- return ret;
+ atomic_inc(&current->signal->count);
+ atomic_inc(&current->signal->live);
+ return 0;
}
sig = kmem_cache_alloc(signal_cachep, GFP_KERNEL);
tsk->signal = sig;
@@ -830,25 +801,39 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->it_real_incr.tv64 = 0;
sig->real_timer.function = it_real_fn;

+ sig->it_virt_expires = cputime_zero;
+ sig->it_virt_incr = cputime_zero;
+ sig->it_prof_expires = cputime_zero;
+ sig->it_prof_incr = cputime_zero;
+
sig->leader = 0; /* session leadership doesn't inherit */
sig->tty_old_pgrp = NULL;
sig->tty = NULL;

- sig->cutime = sig->cstime = cputime_zero;
+ sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
sig->gtime = cputime_zero;
sig->cgtime = cputime_zero;
sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
task_io_accounting_init(&sig->ioac);
+ INIT_LIST_HEAD(&sig->cpu_timers[0]);
+ INIT_LIST_HEAD(&sig->cpu_timers[1]);
+ INIT_LIST_HEAD(&sig->cpu_timers[2]);
taskstats_tgid_init(sig);

task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);

- posix_cpu_timers_init_group(sig);
-
+ if (sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY) {
+ /*
+ * New sole thread in the process gets an expiry time
+ * of the whole CPU time limit.
+ */
+ tsk->it_prof_expires =
+ secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
+ }
acct_init_pacct(&sig->pacct);

tty_audit_fork(sig);
@@ -858,7 +843,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)

void __cleanup_signal(struct signal_struct *sig)
{
- thread_group_cputime_free(sig);
exit_thread_group_keys(sig);
tty_kref_put(sig->tty);
kmem_cache_free(signal_cachep, sig);
@@ -909,19 +893,6 @@ void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
#endif /* CONFIG_MM_OWNER */

/*
- * Initialize POSIX timer handling for a single task.
- */
-static void posix_cpu_timers_init(struct task_struct *tsk)
-{
- tsk->cputime_expires.prof_exp = cputime_zero;
- tsk->cputime_expires.virt_exp = cputime_zero;
- tsk->cputime_expires.sched_exp = 0;
- INIT_LIST_HEAD(&tsk->cpu_timers[0]);
- INIT_LIST_HEAD(&tsk->cpu_timers[1]);
- INIT_LIST_HEAD(&tsk->cpu_timers[2]);
-}
-
-/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
*
@@ -1033,7 +1004,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);

- posix_cpu_timers_init(p);
+ p->it_virt_expires = cputime_zero;
+ p->it_prof_expires = cputime_zero;
+ p->it_sched_expires = 0;
+ INIT_LIST_HEAD(&p->cpu_timers[0]);
+ INIT_LIST_HEAD(&p->cpu_timers[1]);
+ INIT_LIST_HEAD(&p->cpu_timers[2]);

p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
@@ -1234,6 +1210,21 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD) {
p->group_leader = current->group_leader;
list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
+
+ if (!cputime_eq(current->signal->it_virt_expires,
+ cputime_zero) ||
+ !cputime_eq(current->signal->it_prof_expires,
+ cputime_zero) ||
+ current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
+ !list_empty(&current->signal->cpu_timers[0]) ||
+ !list_empty(&current->signal->cpu_timers[1]) ||
+ !list_empty(&current->signal->cpu_timers[2])) {
+ /*
+ * Have child wake up on its first tick to check
+ * for process CPU timers.
+ */
+ p->it_prof_expires = jiffies_to_cputime(1);
+ }
}

if (likely(p->pid)) {
diff --git a/kernel/itimer.c b/kernel/itimer.c
index db7c358..ab98274 100644
--- a/kernel/itimer.c
+++ b/kernel/itimer.c
@@ -55,15 +55,17 @@ int do_getitimer(int which, struct itimerval *value)
spin_unlock_irq(&tsk->sighand->siglock);
break;
case ITIMER_VIRTUAL:
+ read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_virt_expires;
cinterval = tsk->signal->it_virt_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_cputime cputime;
- cputime_t utime;
-
- thread_group_cputime(tsk, &cputime);
- utime = cputime.utime;
+ struct task_struct *t = tsk;
+ cputime_t utime = tsk->signal->utime;
+ do {
+ utime = cputime_add(utime, t->utime);
+ t = next_thread(t);
+ } while (t != tsk);
if (cputime_le(cval, utime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -71,19 +73,25 @@ int do_getitimer(int which, struct itimerval *value)
}
}
spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
cputime_to_timeval(cval, &value->it_value);
cputime_to_timeval(cinterval, &value->it_interval);
break;
case ITIMER_PROF:
+ read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_prof_expires;
cinterval = tsk->signal->it_prof_incr;
if (!cputime_eq(cval, cputime_zero)) {
- struct task_cputime times;
- cputime_t ptime;
-
- thread_group_cputime(tsk, &times);
- ptime = cputime_add(times.utime, times.stime);
+ struct task_struct *t = tsk;
+ cputime_t ptime = cputime_add(tsk->signal->utime,
+ tsk->signal->stime);
+ do {
+ ptime = cputime_add(ptime,
+ cputime_add(t->utime,
+ t->stime));
+ t = next_thread(t);
+ } while (t != tsk);
if (cputime_le(cval, ptime)) { /* about to fire */
cval = jiffies_to_cputime(1);
} else {
@@ -91,6 +99,7 @@ int do_getitimer(int which, struct itimerval *value)
}
}
spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
cputime_to_timeval(cval, &value->it_value);
cputime_to_timeval(cinterval, &value->it_interval);
break;
@@ -176,6 +185,7 @@ again:
case ITIMER_VIRTUAL:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_virt_expires;
cinterval = tsk->signal->it_virt_incr;
@@ -190,6 +200,7 @@ again:
tsk->signal->it_virt_expires = nval;
tsk->signal->it_virt_incr = ninterval;
spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
if (ovalue) {
cputime_to_timeval(cval, &ovalue->it_value);
cputime_to_timeval(cinterval, &ovalue->it_interval);
@@ -198,6 +209,7 @@ again:
case ITIMER_PROF:
nval = timeval_to_cputime(&value->it_value);
ninterval = timeval_to_cputime(&value->it_interval);
+ read_lock(&tasklist_lock);
spin_lock_irq(&tsk->sighand->siglock);
cval = tsk->signal->it_prof_expires;
cinterval = tsk->signal->it_prof_incr;
@@ -212,6 +224,7 @@ again:
tsk->signal->it_prof_expires = nval;
tsk->signal->it_prof_incr = ninterval;
spin_unlock_irq(&tsk->sighand->siglock);
+ read_unlock(&tasklist_lock);
if (ovalue) {
cputime_to_timeval(cval, &ovalue->it_value);
cputime_to_timeval(cinterval, &ovalue->it_interval);
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 153dcb2..c42a03a 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -7,93 +7,6 @@
#include <linux/errno.h>
#include <linux/math64.h>
#include <asm/uaccess.h>
-#include <linux/kernel_stat.h>
-
-/*
- * Allocate the thread_group_cputime structure appropriately and fill in the
- * current values of the fields. Called from copy_signal() via
- * thread_group_cputime_clone_thread() when adding a second or subsequent
- * thread to a thread group. Assumes interrupts are enabled when called.
- */
-int thread_group_cputime_alloc(struct task_struct *tsk)
-{
- struct signal_struct *sig = tsk->signal;
- struct task_cputime *cputime;
-
- /*
- * If we have multiple threads and we don't already have a
- * per-CPU task_cputime struct (checked in the caller), allocate
- * one and fill it in with the times accumulated so far. We may
- * race with another thread so recheck after we pick up the sighand
- * lock.
- */
- cputime = alloc_percpu(struct task_cputime);
- if (cputime == NULL)
- return -ENOMEM;
- spin_lock_irq(&tsk->sighand->siglock);
- if (sig->cputime.totals) {
- spin_unlock_irq(&tsk->sighand->siglock);
- free_percpu(cputime);
- return 0;
- }
- sig->cputime.totals = cputime;
- cputime = per_cpu_ptr(sig->cputime.totals, smp_processor_id());
- cputime->utime = tsk->utime;
- cputime->stime = tsk->stime;
- cputime->sum_exec_runtime = tsk->se.sum_exec_runtime;
- spin_unlock_irq(&tsk->sighand->siglock);
- return 0;
-}
-
-/**
- * thread_group_cputime - Sum the thread group time fields across all CPUs.
- *
- * @tsk: The task we use to identify the thread group.
- * @times: task_cputime structure in which we return the summed fields.
- *
- * Walk the list of CPUs to sum the per-CPU time fields in the thread group
- * time structure.
- */
-void thread_group_cputime(
- struct task_struct *tsk,
- struct task_cputime *times)
-{
- struct signal_struct *sig;
- int i;
- struct task_cputime *tot;
-
- sig = tsk->signal;
- if (unlikely(!sig) || !sig->cputime.totals) {
- times->utime = tsk->utime;
- times->stime = tsk->stime;
- times->sum_exec_runtime = tsk->se.sum_exec_runtime;
- return;
- }
- times->stime = times->utime = cputime_zero;
- times->sum_exec_runtime = 0;
- for_each_possible_cpu(i) {
- tot = per_cpu_ptr(tsk->signal->cputime.totals, i);
- times->utime = cputime_add(times->utime, tot->utime);
- times->stime = cputime_add(times->stime, tot->stime);
- times->sum_exec_runtime += tot->sum_exec_runtime;
- }
-}
-
-/*
- * Called after updating RLIMIT_CPU to set timer expiration if necessary.
- */
-void update_rlimit_cpu(unsigned long rlim_new)
-{
- cputime_t cputime;
-
- cputime = secs_to_cputime(rlim_new);
- if (cputime_eq(current->signal->it_prof_expires, cputime_zero) ||
- cputime_lt(current->signal->it_prof_expires, cputime)) {
- spin_lock_irq(&current->sighand->siglock);
- set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
- spin_unlock_irq(&current->sighand->siglock);
- }
-}

static int check_clock(const clockid_t which_clock)
{
@@ -245,6 +158,10 @@ static inline cputime_t virt_ticks(struct task_struct *p)
{
return p->utime;
}
+static inline unsigned long long sched_ns(struct task_struct *p)
+{
+ return task_sched_runtime(p);
+}

int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
{
@@ -294,7 +211,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = p->se.sum_exec_runtime + task_delta_exec(p);
+ cpu->sched = sched_ns(p);
break;
}
return 0;
@@ -303,30 +220,59 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
/*
* Sample a process (thread group) clock for the given group_leader task.
* Must be called with tasklist_lock held for reading.
+ * Must be called with tasklist_lock held for reading, and p->sighand->siglock.
*/
-static int cpu_clock_sample_group(const clockid_t which_clock,
- struct task_struct *p,
- union cpu_time_count *cpu)
+static int cpu_clock_sample_group_locked(unsigned int clock_idx,
+ struct task_struct *p,
+ union cpu_time_count *cpu)
{
- struct task_cputime cputime;
-
- thread_group_cputime(p, &cputime);
- switch (which_clock) {
+ struct task_struct *t = p;
+ switch (clock_idx) {
default:
return -EINVAL;
case CPUCLOCK_PROF:
- cpu->cpu = cputime_add(cputime.utime, cputime.stime);
+ cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
+ do {
+ cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
+ t = next_thread(t);
+ } while (t != p);
break;
case CPUCLOCK_VIRT:
- cpu->cpu = cputime.utime;
+ cpu->cpu = p->signal->utime;
+ do {
+ cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
+ t = next_thread(t);
+ } while (t != p);
break;
case CPUCLOCK_SCHED:
- cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
+ cpu->sched = p->signal->sum_sched_runtime;
+ /* Add in each other live thread. */
+ while ((t = next_thread(t)) != p) {
+ cpu->sched += t->se.sum_exec_runtime;
+ }
+ cpu->sched += sched_ns(p);
break;
}
return 0;
}

+/*
+ * Sample a process (thread group) clock for the given group_leader task.
+ * Must be called with tasklist_lock held for reading.
+ */
+static int cpu_clock_sample_group(const clockid_t which_clock,
+ struct task_struct *p,
+ union cpu_time_count *cpu)
+{
+ int ret;
+ unsigned long flags;
+ spin_lock_irqsave(&p->sighand->siglock, flags);
+ ret = cpu_clock_sample_group_locked(CPUCLOCK_WHICH(which_clock), p,
+ cpu);
+ spin_unlock_irqrestore(&p->sighand->siglock, flags);
+ return ret;
+}
+

int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
{
@@ -525,11 +471,80 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
{
- struct task_cputime cputime;
-
- thread_group_cputime(tsk, &cputime);
cleanup_timers(tsk->signal->cpu_timers,
- cputime.utime, cputime.stime, cputime.sum_exec_runtime);
+ cputime_add(tsk->utime, tsk->signal->utime),
+ cputime_add(tsk->stime, tsk->signal->stime),
+ tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
+}
+
+
+/*
+ * Set the expiry times of all the threads in the process so one of them
+ * will go off before the process cumulative expiry total is reached.
+ */
+static void process_timer_rebalance(struct task_struct *p,
+ unsigned int clock_idx,
+ union cpu_time_count expires,
+ union cpu_time_count val)
+{
+ cputime_t ticks, left;
+ unsigned long long ns, nsleft;
+ struct task_struct *t = p;
+ unsigned int nthreads = atomic_read(&p->signal->live);
+
+ if (!nthreads)
+ return;
+
+ switch (clock_idx) {
+ default:
+ BUG();
+ break;
+ case CPUCLOCK_PROF:
+ left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
+ nthreads);
+ do {
+ if (likely(!(t->flags & PF_EXITING))) {
+ ticks = cputime_add(prof_ticks(t), left);
+ if (cputime_eq(t->it_prof_expires,
+ cputime_zero) ||
+ cputime_gt(t->it_prof_expires, ticks)) {
+ t->it_prof_expires = ticks;
+ }
+ }
+ t = next_thread(t);
+ } while (t != p);
+ break;
+ case CPUCLOCK_VIRT:
+ left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
+ nthreads);
+ do {
+ if (likely(!(t->flags & PF_EXITING))) {
+ ticks = cputime_add(virt_ticks(t), left);
+ if (cputime_eq(t->it_virt_expires,
+ cputime_zero) ||
+ cputime_gt(t->it_virt_expires, ticks)) {
+ t->it_virt_expires = ticks;
+ }
+ }
+ t = next_thread(t);
+ } while (t != p);
+ break;
+ case CPUCLOCK_SCHED:
+ nsleft = expires.sched - val.sched;
+ do_div(nsleft, nthreads);
+ nsleft = max_t(unsigned long long, nsleft, 1);
+ do {
+ if (likely(!(t->flags & PF_EXITING))) {
+ ns = t->se.sum_exec_runtime + nsleft;
+ if (t->it_sched_expires == 0 ||
+ t->it_sched_expires > ns) {
+ t->it_sched_expires = ns;
+ }
+ }
+ t = next_thread(t);
+ } while (t != p);
+ break;
+ }
}

static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -593,32 +608,29 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
default:
BUG();
case CPUCLOCK_PROF:
- if (cputime_eq(p->cputime_expires.prof_exp,
+ if (cputime_eq(p->it_prof_expires,
cputime_zero) ||
- cputime_gt(p->cputime_expires.prof_exp,
+ cputime_gt(p->it_prof_expires,
nt->expires.cpu))
- p->cputime_expires.prof_exp =
- nt->expires.cpu;
+ p->it_prof_expires = nt->expires.cpu;
break;
case CPUCLOCK_VIRT:
- if (cputime_eq(p->cputime_expires.virt_exp,
+ if (cputime_eq(p->it_virt_expires,
cputime_zero) ||
- cputime_gt(p->cputime_expires.virt_exp,
+ cputime_gt(p->it_virt_expires,
nt->expires.cpu))
- p->cputime_expires.virt_exp =
- nt->expires.cpu;
+ p->it_virt_expires = nt->expires.cpu;
break;
case CPUCLOCK_SCHED:
- if (p->cputime_expires.sched_exp == 0 ||
- p->cputime_expires.sched_exp >
- nt->expires.sched)
- p->cputime_expires.sched_exp =
- nt->expires.sched;
+ if (p->it_sched_expires == 0 ||
+ p->it_sched_expires > nt->expires.sched)
+ p->it_sched_expires = nt->expires.sched;
break;
}
} else {
/*
- * For a process timer, set the cached expiration time.
+ * For a process timer, we must balance
+ * all the live threads' expirations.
*/
switch (CPUCLOCK_WHICH(timer->it_clock)) {
default:
@@ -629,9 +641,7 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
cputime_lt(p->signal->it_virt_expires,
timer->it.cpu.expires.cpu))
break;
- p->signal->cputime_expires.virt_exp =
- timer->it.cpu.expires.cpu;
- break;
+ goto rebalance;
case CPUCLOCK_PROF:
if (!cputime_eq(p->signal->it_prof_expires,
cputime_zero) &&
@@ -642,12 +652,13 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
if (i != RLIM_INFINITY &&
i <= cputime_to_secs(timer->it.cpu.expires.cpu))
break;
- p->signal->cputime_expires.prof_exp =
- timer->it.cpu.expires.cpu;
- break;
+ goto rebalance;
case CPUCLOCK_SCHED:
- p->signal->cputime_expires.sched_exp =
- timer->it.cpu.expires.sched;
+ rebalance:
+ process_timer_rebalance(
+ timer->it.cpu.task,
+ CPUCLOCK_WHICH(timer->it_clock),
+ timer->it.cpu.expires, now);
break;
}
}
@@ -958,13 +969,13 @@ static void check_thread_timers(struct task_struct *tsk,
struct signal_struct *const sig = tsk->signal;

maxfire = 20;
- tsk->cputime_expires.prof_exp = cputime_zero;
+ tsk->it_prof_expires = cputime_zero;
while (!list_empty(timers)) {
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
if (!--maxfire || cputime_lt(prof_ticks(tsk), t->expires.cpu)) {
- tsk->cputime_expires.prof_exp = t->expires.cpu;
+ tsk->it_prof_expires = t->expires.cpu;
break;
}
t->firing = 1;
@@ -973,13 +984,13 @@ static void check_thread_timers(struct task_struct *tsk,

++timers;
maxfire = 20;
- tsk->cputime_expires.virt_exp = cputime_zero;
+ tsk->it_virt_expires = cputime_zero;
while (!list_empty(timers)) {
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
if (!--maxfire || cputime_lt(virt_ticks(tsk), t->expires.cpu)) {
- tsk->cputime_expires.virt_exp = t->expires.cpu;
+ tsk->it_virt_expires = t->expires.cpu;
break;
}
t->firing = 1;
@@ -988,13 +999,13 @@ static void check_thread_timers(struct task_struct *tsk,

++timers;
maxfire = 20;
- tsk->cputime_expires.sched_exp = 0;
+ tsk->it_sched_expires = 0;
while (!list_empty(timers)) {
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
- tsk->cputime_expires.sched_exp = t->expires.sched;
+ tsk->it_sched_expires = t->expires.sched;
break;
}
t->firing = 1;
@@ -1044,10 +1055,10 @@ static void check_process_timers(struct task_struct *tsk,
{
int maxfire;
struct signal_struct *const sig = tsk->signal;
- cputime_t utime, ptime, virt_expires, prof_expires;
+ cputime_t utime, stime, ptime, virt_expires, prof_expires;
unsigned long long sum_sched_runtime, sched_expires;
+ struct task_struct *t;
struct list_head *timers = sig->cpu_timers;
- struct task_cputime cputime;

/*
* Don't sample the current process CPU clocks if there are no timers.
@@ -1063,10 +1074,18 @@ static void check_process_timers(struct task_struct *tsk,
/*
* Collect the current process totals.
*/
- thread_group_cputime(tsk, &cputime);
- utime = cputime.utime;
- ptime = cputime_add(utime, cputime.stime);
- sum_sched_runtime = cputime.sum_exec_runtime;
+ utime = sig->utime;
+ stime = sig->stime;
+ sum_sched_runtime = sig->sum_sched_runtime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ sum_sched_runtime += t->se.sum_exec_runtime;
+ t = next_thread(t);
+ } while (t != tsk);
+ ptime = cputime_add(utime, stime);
+
maxfire = 20;
prof_expires = cputime_zero;
while (!list_empty(timers)) {
@@ -1174,18 +1193,60 @@ static void check_process_timers(struct task_struct *tsk,
}
}

- if (!cputime_eq(prof_expires, cputime_zero) &&
- (cputime_eq(sig->cputime_expires.prof_exp, cputime_zero) ||
- cputime_gt(sig->cputime_expires.prof_exp, prof_expires)))
- sig->cputime_expires.prof_exp = prof_expires;
- if (!cputime_eq(virt_expires, cputime_zero) &&
- (cputime_eq(sig->cputime_expires.virt_exp, cputime_zero) ||
- cputime_gt(sig->cputime_expires.virt_exp, virt_expires)))
- sig->cputime_expires.virt_exp = virt_expires;
- if (sched_expires != 0 &&
- (sig->cputime_expires.sched_exp == 0 ||
- sig->cputime_expires.sched_exp > sched_expires))
- sig->cputime_expires.sched_exp = sched_expires;
+ if (!cputime_eq(prof_expires, cputime_zero) ||
+ !cputime_eq(virt_expires, cputime_zero) ||
+ sched_expires != 0) {
+ /*
+ * Rebalance the threads' expiry times for the remaining
+ * process CPU timers.
+ */
+
+ cputime_t prof_left, virt_left, ticks;
+ unsigned long long sched_left, sched;
+ const unsigned int nthreads = atomic_read(&sig->live);
+
+ if (!nthreads)
+ return;
+
+ prof_left = cputime_sub(prof_expires, utime);
+ prof_left = cputime_sub(prof_left, stime);
+ prof_left = cputime_div_non_zero(prof_left, nthreads);
+ virt_left = cputime_sub(virt_expires, utime);
+ virt_left = cputime_div_non_zero(virt_left, nthreads);
+ if (sched_expires) {
+ sched_left = sched_expires - sum_sched_runtime;
+ do_div(sched_left, nthreads);
+ sched_left = max_t(unsigned long long, sched_left, 1);
+ } else {
+ sched_left = 0;
+ }
+ t = tsk;
+ do {
+ if (unlikely(t->flags & PF_EXITING))
+ continue;
+
+ ticks = cputime_add(cputime_add(t->utime, t->stime),
+ prof_left);
+ if (!cputime_eq(prof_expires, cputime_zero) &&
+ (cputime_eq(t->it_prof_expires, cputime_zero) ||
+ cputime_gt(t->it_prof_expires, ticks))) {
+ t->it_prof_expires = ticks;
+ }
+
+ ticks = cputime_add(t->utime, virt_left);
+ if (!cputime_eq(virt_expires, cputime_zero) &&
+ (cputime_eq(t->it_virt_expires, cputime_zero) ||
+ cputime_gt(t->it_virt_expires, ticks))) {
+ t->it_virt_expires = ticks;
+ }
+
+ sched = t->se.sum_exec_runtime + sched_left;
+ if (sched_expires && (t->it_sched_expires == 0 ||
+ t->it_sched_expires > sched)) {
+ t->it_sched_expires = sched;
+ }
+ } while ((t = next_thread(t)) != tsk);
+ }
}

/*
@@ -1253,86 +1314,6 @@ out:
++timer->it_requeue_pending;
}

-/**
- * task_cputime_zero - Check a task_cputime struct for all zero fields.
- *
- * @cputime: The struct to compare.
- *
- * Checks @cputime to see if all fields are zero. Returns true if all fields
- * are zero, false if any field is nonzero.
- */
-static inline int task_cputime_zero(const struct task_cputime *cputime)
-{
- if (cputime_eq(cputime->utime, cputime_zero) &&
- cputime_eq(cputime->stime, cputime_zero) &&
- cputime->sum_exec_runtime == 0)
- return 1;
- return 0;
-}
-
-/**
- * task_cputime_expired - Compare two task_cputime entities.
- *
- * @sample: The task_cputime structure to be checked for expiration.
- * @expires: Expiration times, against which @sample will be checked.
- *
- * Checks @sample against @expires to see if any field of @sample has expired.
- * Returns true if any field of the former is greater than the corresponding
- * field of the latter if the latter field is set. Otherwise returns false.
- */
-static inline int task_cputime_expired(const struct task_cputime *sample,
- const struct task_cputime *expires)
-{
- if (!cputime_eq(expires->utime, cputime_zero) &&
- cputime_ge(sample->utime, expires->utime))
- return 1;
- if (!cputime_eq(expires->stime, cputime_zero) &&
- cputime_ge(cputime_add(sample->utime, sample->stime),
- expires->stime))
- return 1;
- if (expires->sum_exec_runtime != 0 &&
- sample->sum_exec_runtime >= expires->sum_exec_runtime)
- return 1;
- return 0;
-}
-
-/**
- * fastpath_timer_check - POSIX CPU timers fast path.
- *
- * @tsk: The task (thread) being checked.
- *
- * Check the task and thread group timers. If both are zero (there are no
- * timers set) return false. Otherwise snapshot the task and thread group
- * timers and compare them with the corresponding expiration times. Return
- * true if a timer has expired, else return false.
- */
-static inline int fastpath_timer_check(struct task_struct *tsk)
-{
- struct signal_struct *sig = tsk->signal;
-
- if (unlikely(!sig))
- return 0;
-
- if (!task_cputime_zero(&tsk->cputime_expires)) {
- struct task_cputime task_sample = {
- .utime = tsk->utime,
- .stime = tsk->stime,
- .sum_exec_runtime = tsk->se.sum_exec_runtime
- };
-
- if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
- return 1;
- }
- if (!task_cputime_zero(&sig->cputime_expires)) {
- struct task_cputime group_sample;
-
- thread_group_cputime(tsk, &group_sample);
- if (task_cputime_expired(&group_sample, &sig->cputime_expires))
- return 1;
- }
- return 0;
-}
-
/*
* This is called from the timer interrupt handler. The irq handler has
* already updated our counts. We need to check if any timers fire now.
@@ -1345,31 +1326,42 @@ void run_posix_cpu_timers(struct task_struct *tsk)

BUG_ON(!irqs_disabled());

- /*
- * The fast path checks that there are no expired thread or thread
- * group timers. If that's so, just return.
- */
- if (!fastpath_timer_check(tsk))
+#define UNEXPIRED(clock) \
+ (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
+ cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
+
+ if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
+ (tsk->it_sched_expires == 0 ||
+ tsk->se.sum_exec_runtime < tsk->it_sched_expires))
return;

- spin_lock(&tsk->sighand->siglock);
- /*
- * Here we take off tsk->signal->cpu_timers[N] and
- * tsk->cpu_timers[N] all the timers that are firing, and
- * put them on the firing list.
- */
- check_thread_timers(tsk, &firing);
- check_process_timers(tsk, &firing);
+#undef UNEXPIRED

/*
- * We must release these locks before taking any timer's lock.
- * There is a potential race with timer deletion here, as the
- * siglock now protects our private firing list. We have set
- * the firing flag in each timer, so that a deletion attempt
- * that gets the timer lock before we do will give it up and
- * spin until we've taken care of that timer below.
+ * Double-check with locks held.
*/
- spin_unlock(&tsk->sighand->siglock);
+ read_lock(&tasklist_lock);
+ if (likely(tsk->signal != NULL)) {
+ spin_lock(&tsk->sighand->siglock);
+
+ /*
+ * Here we take off tsk->cpu_timers[N] and tsk->signal->cpu_timers[N]
+ * all the timers that are firing, and put them on the firing list.
+ */
+ check_thread_timers(tsk, &firing);
+ check_process_timers(tsk, &firing);
+
+ /*
+ * We must release these locks before taking any timer's lock.
+ * There is a potential race with timer deletion here, as the
+ * siglock now protects our private firing list. We have set
+ * the firing flag in each timer, so that a deletion attempt
+ * that gets the timer lock before we do will give it up and
+ * spin until we've taken care of that timer below.
+ */
+ spin_unlock(&tsk->sighand->siglock);
+ }
+ read_unlock(&tasklist_lock);

/*
* Now that all the timers on our list have the firing flag,
@@ -1397,9 +1389,10 @@ void run_posix_cpu_timers(struct task_struct *tsk)

/*
* Set one of the process-wide special case CPU timers.
- * The tsk->sighand->siglock must be held by the caller.
- * The *newval argument is relative and we update it to be absolute, *oldval
- * is absolute and we update it to be relative.
+ * The tasklist_lock and tsk->sighand->siglock must be held by the caller.
+ * The oldval argument is null for the RLIMIT_CPU timer, where *newval is
+ * absolute; non-null for ITIMER_*, where *newval is relative and we update
+ * it to be absolute, *oldval is absolute and we update it to be relative.
*/
void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
cputime_t *newval, cputime_t *oldval)
@@ -1408,7 +1401,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
struct list_head *head;

BUG_ON(clock_idx == CPUCLOCK_SCHED);
- cpu_clock_sample_group(clock_idx, tsk, &now);
+ cpu_clock_sample_group_locked(clock_idx, tsk, &now);

if (oldval) {
if (!cputime_eq(*oldval, cputime_zero)) {
@@ -1442,14 +1435,13 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
cputime_ge(list_first_entry(head,
struct cpu_timer_list, entry)->expires.cpu,
*newval)) {
- switch (clock_idx) {
- case CPUCLOCK_PROF:
- tsk->signal->cputime_expires.prof_exp = *newval;
- break;
- case CPUCLOCK_VIRT:
- tsk->signal->cputime_expires.virt_exp = *newval;
- break;
- }
+ /*
+ * Rejigger each thread's expiry time so that one will
+ * notice before we hit the process-cumulative expiry time.
+ */
+ union cpu_time_count expires = { .sched = 0 };
+ expires.cpu = *newval;
+ process_timer_rebalance(tsk, clock_idx, expires, now);
}
}

diff --git a/kernel/sched.c b/kernel/sched.c
index 9d50bd4..70f98c4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4033,26 +4033,23 @@ DEFINE_PER_CPU(struct kernel_stat, kstat);
EXPORT_PER_CPU_SYMBOL(kstat);

/*
- * Return any ns on the sched_clock that have not yet been banked in
- * @p in case that task is currently running.
+ * Return p->sum_exec_runtime plus any more ns on the sched_clock
+ * that have not yet been banked in case the task is currently running.
*/
-unsigned long long task_delta_exec(struct task_struct *p)
+unsigned long long task_sched_runtime(struct task_struct *p)
{
unsigned long flags;
+ u64 ns, delta_exec;
struct rq *rq;
- u64 ns = 0;

rq = task_rq_lock(p, &flags);
-
+ ns = p->se.sum_exec_runtime;
if (task_current(rq, p)) {
- u64 delta_exec;
-
update_rq_clock(rq);
delta_exec = rq->clock - p->se.exec_start;
if ((s64)delta_exec > 0)
- ns = delta_exec;
+ ns += delta_exec;
}
-
task_rq_unlock(rq, &flags);

return ns;
@@ -4069,7 +4066,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
cputime64_t tmp;

p->utime = cputime_add(p->utime, cputime);
- account_group_user_time(p, cputime);

/* Add user time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -4094,7 +4090,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
tmp = cputime_to_cputime64(cputime);

p->utime = cputime_add(p->utime, cputime);
- account_group_user_time(p, cputime);
p->gtime = cputime_add(p->gtime, cputime);

cpustat->user = cputime64_add(cpustat->user, tmp);
@@ -4130,7 +4125,6 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
}

p->stime = cputime_add(p->stime, cputime);
- account_group_system_time(p, cputime);

/* Add system time to cpustat. */
tmp = cputime_to_cputime64(cputime);
@@ -4172,7 +4166,6 @@ void account_steal_time(struct task_struct *p, cputime_t steal)

if (p == rq->idle) {
p->stime = cputime_add(p->stime, steal);
- account_group_system_time(p, steal);
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
else
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 51aa3e1..5781abb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -500,7 +500,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
struct task_struct *curtask = task_of(curr);

cpuacct_charge(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
}
}

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index c7963d5..98b1a19 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -526,8 +526,6 @@ static void update_curr_rt(struct rq *rq)
schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
- account_group_exec_runtime(curr, delta_exec);
-
curr->se.exec_start = rq->clock;
cpuacct_charge(curr, delta_exec);

@@ -1460,7 +1458,7 @@ static void watchdog(struct rq *rq, struct task_struct *p)
p->rt.timeout++;
next = DIV_ROUND_UP(min(soft, hard), USEC_PER_SEC/HZ);
if (p->rt.timeout > next)
- p->cputime_expires.sched_exp = p->se.sum_exec_runtime;
+ p->it_sched_expires = p->se.sum_exec_runtime;
}
}

diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index ee71bec..a93ef66 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -277,89 +277,3 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

-/*
- * The following are functions that support scheduler-internal time accounting.
- * These functions are generally called at the timer tick. None of this depends
- * on CONFIG_SCHEDSTATS.
- */
-
-/**
- * account_group_user_time - Maintain utime for a thread group.
- *
- * @tsk: Pointer to task structure.
- * @cputime: Time value by which to increment the utime field of the
- * thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the utime field there.
- */
-static inline void account_group_user_time(struct task_struct *tsk,
- cputime_t cputime)
-{
- struct signal_struct *sig;
-
- sig = tsk->signal;
- if (unlikely(!sig))
- return;
- if (sig->cputime.totals) {
- struct task_cputime *times;
-
- times = per_cpu_ptr(sig->cputime.totals, get_cpu());
- times->utime = cputime_add(times->utime, cputime);
- put_cpu_no_resched();
- }
-}
-
-/**
- * account_group_system_time - Maintain stime for a thread group.
- *
- * @tsk: Pointer to task structure.
- * @cputime: Time value by which to increment the stime field of the
- * thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the stime field there.
- */
-static inline void account_group_system_time(struct task_struct *tsk,
- cputime_t cputime)
-{
- struct signal_struct *sig;
-
- sig = tsk->signal;
- if (unlikely(!sig))
- return;
- if (sig->cputime.totals) {
- struct task_cputime *times;
-
- times = per_cpu_ptr(sig->cputime.totals, get_cpu());
- times->stime = cputime_add(times->stime, cputime);
- put_cpu_no_resched();
- }
-}
-
-/**
- * account_group_exec_runtime - Maintain exec runtime for a thread group.
- *
- * @tsk: Pointer to task structure.
- * @ns: Time value by which to increment the sum_exec_runtime field
- * of the thread_group_cputime structure.
- *
- * If thread group time is being maintained, get the structure for the
- * running CPU and update the sum_exec_runtime field there.
- */
-static inline void account_group_exec_runtime(struct task_struct *tsk,
- unsigned long long ns)
-{
- struct signal_struct *sig;
-
- sig = tsk->signal;
- if (unlikely(!sig))
- return;
- if (sig->cputime.totals) {
- struct task_cputime *times;
-
- times = per_cpu_ptr(sig->cputime.totals, get_cpu());
- times->sum_exec_runtime += ns;
- put_cpu_no_resched();
- }
-}
diff --git a/kernel/signal.c b/kernel/signal.c
index 4530fc6..37ce260 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1342,7 +1342,6 @@ int do_notify_parent(struct task_struct *tsk, int sig)
struct siginfo info;
unsigned long flags;
struct sighand_struct *psig;
- struct task_cputime cputime;
int ret = sig;

BUG_ON(sig == -1);
@@ -1373,9 +1372,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)

info.si_uid = tsk->uid;

- thread_group_cputime(tsk, &cputime);
- info.si_utime = cputime_to_jiffies(cputime.utime);
- info.si_stime = cputime_to_jiffies(cputime.stime);
+ info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime,
+ tsk->signal->utime));
+ info.si_stime = cputime_to_clock_t(cputime_add(tsk->stime,
+ tsk->signal->stime));

info.si_status = tsk->exit_code & 0x7f;
if (tsk->exit_code & 0x80)
diff --git a/kernel/sys.c b/kernel/sys.c
index 31deba8..fc71f99 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -853,28 +853,38 @@ asmlinkage long sys_setfsgid(gid_t gid)
return old_fsgid;
}

-void do_sys_times(struct tms *tms)
-{
- struct task_cputime cputime;
- cputime_t cutime, cstime;
-
- spin_lock_irq(&current->sighand->siglock);
- thread_group_cputime(current, &cputime);
- cutime = current->signal->cutime;
- cstime = current->signal->cstime;
- spin_unlock_irq(&current->sighand->siglock);
- tms->tms_utime = cputime_to_clock_t(cputime.utime);
- tms->tms_stime = cputime_to_clock_t(cputime.stime);
- tms->tms_cutime = cputime_to_clock_t(cutime);
- tms->tms_cstime = cputime_to_clock_t(cstime);
-}
-
asmlinkage long sys_times(struct tms __user * tbuf)
{
+ /*
+ * In the SMP world we might just be unlucky and have one of
+ * the times increment as we use it. Since the value is an
+ * atomically safe type this is just fine. Conceptually its
+ * as if the syscall took an instant longer to occur.
+ */
if (tbuf) {
struct tms tmp;
-
- do_sys_times(&tmp);
+ struct task_struct *tsk = current;
+ struct task_struct *t;
+ cputime_t utime, stime, cutime, cstime;
+
+ spin_lock_irq(&tsk->sighand->siglock);
+ utime = tsk->signal->utime;
+ stime = tsk->signal->stime;
+ t = tsk;
+ do {
+ utime = cputime_add(utime, t->utime);
+ stime = cputime_add(stime, t->stime);
+ t = next_thread(t);
+ } while (t != tsk);
+
+ cutime = tsk->signal->cutime;
+ cstime = tsk->signal->cstime;
+ spin_unlock_irq(&tsk->sighand->siglock);
+
+ tmp.tms_utime = cputime_to_clock_t(utime);
+ tmp.tms_stime = cputime_to_clock_t(stime);
+ tmp.tms_cutime = cputime_to_clock_t(cutime);
+ tmp.tms_cstime = cputime_to_clock_t(cstime);
if (copy_to_user(tbuf, &tmp, sizeof(struct tms)))
return -EFAULT;
}
@@ -1439,6 +1449,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
{
struct rlimit new_rlim, *old_rlim;
+ unsigned long it_prof_secs;
int retval;

if (resource >= RLIM_NLIMITS)
@@ -1492,7 +1503,18 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
if (new_rlim.rlim_cur == RLIM_INFINITY)
goto out;

- update_rlimit_cpu(new_rlim.rlim_cur);
+ it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
+ if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
+ unsigned long rlim_cur = new_rlim.rlim_cur;
+ cputime_t cputime;
+
+ cputime = secs_to_cputime(rlim_cur);
+ read_lock(&tasklist_lock);
+ spin_lock_irq(&current->sighand->siglock);
+ set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
+ spin_unlock_irq(&current->sighand->siglock);
+ read_unlock(&tasklist_lock);
+ }
out:
return 0;
}
@@ -1530,8 +1552,11 @@ out:
*
*/

-static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
+static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r,
+ cputime_t *utimep, cputime_t *stimep)
{
+ *utimep = cputime_add(*utimep, t->utime);
+ *stimep = cputime_add(*stimep, t->stime);
r->ru_nvcsw += t->nvcsw;
r->ru_nivcsw += t->nivcsw;
r->ru_minflt += t->min_flt;
@@ -1545,13 +1570,12 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
struct task_struct *t;
unsigned long flags;
cputime_t utime, stime;
- struct task_cputime cputime;

memset((char *) r, 0, sizeof *r);
utime = stime = cputime_zero;

if (who == RUSAGE_THREAD) {
- accumulate_thread_rusage(p, r);
+ accumulate_thread_rusage(p, r, &utime, &stime);
goto out;
}

@@ -1574,9 +1598,8 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
break;

case RUSAGE_SELF:
- thread_group_cputime(p, &cputime);
- utime = cputime_add(utime, cputime.utime);
- stime = cputime_add(stime, cputime.stime);
+ utime = cputime_add(utime, p->signal->utime);
+ stime = cputime_add(stime, p->signal->stime);
r->ru_nvcsw += p->signal->nvcsw;
r->ru_nivcsw += p->signal->nivcsw;
r->ru_minflt += p->signal->min_flt;
@@ -1585,7 +1608,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
r->ru_oublock += p->signal->oublock;
t = p;
do {
- accumulate_thread_rusage(t, r);
+ accumulate_thread_rusage(t, r, &utime, &stime);
t = next_thread(t);
} while (t != p);
break;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f85597a..d5dd93f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -75,7 +75,6 @@
#include <linux/string.h>
#include <linux/selinux.h>
#include <linux/mutex.h>
-#include <linux/posix-timers.h>

#include "avc.h"
#include "objsec.h"
@@ -2325,7 +2324,13 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
initrlim = init_task.signal->rlim+i;
rlim->rlim_cur = min(rlim->rlim_max, initrlim->rlim_cur);
}
- update_rlimit_cpu(rlim->rlim_cur);
+ if (current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY) {
+ /*
+ * This will cause RLIMIT_CPU calculations
+ * to be refigured.
+ */
+ current->it_prof_expires = jiffies_to_cputime(1);
+ }
}

/* Wake up the parent if it is waiting so that it can

2008-11-06 21:45:10

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

* Peter Zijlstra <[email protected]> wrote:

> This patch reverts all the itimer/many thread patches:
>
> 7086efe1c1536f6bc160e7d60a9bfd645b91f279
> bb34d92f643086d546b49cef680f6f305ed84414
> 5ce73a4a5a4893a1aa4cdeed1b1a5a6de42c43b6
> 0a8eaa4f9b58759595a1bfe13a1295fdc25ba026
> f06febc96ba8e0af80bcc3eaec0a109e88275fac
>
> Because I think the per-cpu accounting approach is wrong and makes
> things worse for people with a machine that has more than a
> hand-full of CPUs.

hm, the revert is rather large but i guess the best way forward.
Unless we can avoid this loop:

- for_each_possible_cpu(i) {
- tot = per_cpu_ptr(tsk->signal->cputime.totals, i);
- times->utime = cputime_add(times->utime, tot->utime);
- times->stime = cputime_add(times->stime, tot->stime);
- times->sum_exec_runtime += tot->sum_exec_runtime;
- }

we have to revert it.

That loop could _perhaps_ be avoided by maintaining the sums from the
scheduler tick: by just adding the latest delta values for the current
task to times->*time. Plus at exit time cleaning up the remaining
delta. That would be a far smaller patch.

Ingo

2008-11-06 21:55:33

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

On Thu, 6 Nov 2008, Ingo Molnar wrote:

> That loop could _perhaps_ be avoided by maintaining the sums from the
> scheduler tick: by just adding the latest delta values for the current
> task to times->*time. Plus at exit time cleaning up the remaining
> delta. That would be a far smaller patch.

That is a similar scheme to the ZVC (see mm/vmstat.c). Peter: Dont you
have an implementation of a ZVC like scheme for you dirty throttling
patchset that may be useful here?

2008-11-06 23:57:31

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Thu, 2008-11-06 at 16:08 +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-06 at 09:03 -0600, Christoph Lameter wrote:
> > On Thu, 6 Nov 2008, Peter Zijlstra wrote:
> >
> > > Also, you just introduced per-cpu allocations for each thread-group,
> > > while Christoph is reworking the per-cpu allocator, with one unfortunate
> > > side-effect - its going to have a limited size pool. Therefore this will
> > > limit the number of thread-groups we can have.
> >
> > Patches exist that implement a dynamically growable percpu pool (using
> > virtual mappings though). If the cost of the additional complexity /
> > overhead is justifiable then we can make the percpu pool dynamically
> > extendable.
>
> Right, but I don't think the patch under consideration will fly anyway,
> doing a for_each_possible_cpu() loop on every tick on all cpus isn't
> really healthy, even for moderate sized machines.

I personally think that you're overstating this. First, the current
implementation walks all threads for each tick, which is simply not
scalable and results in soft lockups with large numbers of threads.
This patch fixes a real bug. Second, this only happens "on every tick"
for processes that have more than one thread _and_ that use posix
interval timers. Roland and I went to some effort to keep loops like
the on you're referring to out of the common paths.

In any event, while this particular implementation may not be optimal,
at least it's _right_. Whatever happened to "make it right, then make
it fast?"
--
Frank Mayhar <[email protected]>
Google, Inc.

2008-11-07 08:36:40

by Ingo Molnar

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

* Frank Mayhar <[email protected]> wrote:

> On Thu, 2008-11-06 at 16:08 +0100, Peter Zijlstra wrote:
> > On Thu, 2008-11-06 at 09:03 -0600, Christoph Lameter wrote:
> > > On Thu, 6 Nov 2008, Peter Zijlstra wrote:
> > >
> > > > Also, you just introduced per-cpu allocations for each thread-group,
> > > > while Christoph is reworking the per-cpu allocator, with one unfortunate
> > > > side-effect - its going to have a limited size pool. Therefore this will
> > > > limit the number of thread-groups we can have.
> > >
> > > Patches exist that implement a dynamically growable percpu pool (using
> > > virtual mappings though). If the cost of the additional complexity /
> > > overhead is justifiable then we can make the percpu pool dynamically
> > > extendable.
> >
> > Right, but I don't think the patch under consideration will fly anyway,
> > doing a for_each_possible_cpu() loop on every tick on all cpus isn't
> > really healthy, even for moderate sized machines.
>
> I personally think that you're overstating this. First, the current
> implementation walks all threads for each tick, which is simply not
> scalable and results in soft lockups with large numbers of threads.
> This patch fixes a real bug. Second, this only happens "on every
> tick" for processes that have more than one thread _and_ that use
> posix interval timers. Roland and I went to some effort to keep
> loops like the on you're referring to out of the common paths.
>
> In any event, while this particular implementation may not be
> optimal, at least it's _right_. Whatever happened to "make it
> right, then make it fast?"

Well, you pushed the lockup to another place: previously we locked up
with enough threads added, now we'll lock up with enough CPUs added.

So ... please get rid of the for-each-cpu loop for good? (Also, the
task-exit race needs to be fixed first i guess, before we worry about
loops.)

Ingo

2008-11-07 10:19:07

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

On Thu, 2008-11-06 at 15:53 -0600, Christoph Lameter wrote:
> On Thu, 6 Nov 2008, Ingo Molnar wrote:
>
> > That loop could _perhaps_ be avoided by maintaining the sums from the
> > scheduler tick: by just adding the latest delta values for the current
> > task to times->*time. Plus at exit time cleaning up the remaining
> > delta. That would be a far smaller patch.
>
> That is a similar scheme to the ZVC (see mm/vmstat.c). Peter: Dont you
> have an implementation of a ZVC like scheme for you dirty throttling
> patchset that may be useful here?

Sure, we can do something similar to that, but mind you, that's an
approximation. Approximations work just fine for vmstats and dirty
balancing, I'm just not sure its something people appreciate wrt
timers :-)

The trick is to only update the sum when a per-cpu delta overflows,
which means the sum will have an error in the order of cpus*delta/2.

2008-11-07 10:28:38

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

(fwiw your email doesn't come across properly, evo refuses to display
them, there's some mangling of headers which makes it think there's an
attachment)

On Thu, 2008-11-06 at 15:52 -0800, Frank Mayhar wrote:
> On Thu, 2008-11-06 at 16:08 +0100, Peter Zijlstra wrote:
> > On Thu, 2008-11-06 at 09:03 -0600, Christoph Lameter wrote:
> > > On Thu, 6 Nov 2008, Peter Zijlstra wrote:
> > >
> > > > Also, you just introduced per-cpu allocations for each thread-group,
> > > > while Christoph is reworking the per-cpu allocator, with one unfortunate
> > > > side-effect - its going to have a limited size pool. Therefore this will
> > > > limit the number of thread-groups we can have.
> > >
> > > Patches exist that implement a dynamically growable percpu pool (using
> > > virtual mappings though). If the cost of the additional complexity /
> > > overhead is justifiable then we can make the percpu pool dynamically
> > > extendable.
> >
> > Right, but I don't think the patch under consideration will fly anyway,
> > doing a for_each_possible_cpu() loop on every tick on all cpus isn't
> > really healthy, even for moderate sized machines.
>
> I personally think that you're overstating this. First, the current
> implementation walks all threads for each tick, which is simply not
> scalable and results in soft lockups with large numbers of threads.
> This patch fixes a real bug. Second, this only happens "on every tick"
> for processes that have more than one thread _and_ that use posix
> interval timers. Roland and I went to some effort to keep loops like
> the on you're referring to out of the common paths.
>
> In any event, while this particular implementation may not be optimal,
> at least it's _right_. Whatever happened to "make it right, then make
> it fast?"

Well, I'm not thinking you did it right ;-)

While I agree that the linear loop is sub-optimal, but it only really
becomes a problem when you have hundreds or thousands of threads in your
application, which I'll argue to be insane anyway.

But with your new scheme it'll be a problem regardless of how many
threads you have, as long as each running application will have at least
2 (not uncommon these days).

Furthermore, the memory requirements for your solution now also scale
with cpus instead of threads, again something not really appreciated.

Therefore I say your solution is worse than the one we had.

You should optimize for the common case, and ensure the worst case
doesn't suck. You did it backwards.

2008-11-07 18:12:16

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Fri, 2008-11-07 at 11:29 +0100, Peter Zijlstra wrote:
> (fwiw your email doesn't come across properly, evo refuses to display
> them, there's some mangling of headers which makes it think there's an
> attachment)

Strange, evolution is what I used. It's what I'm using to write this.

> On Thu, 2008-11-06 at 15:52 -0800, Frank Mayhar wrote:
> Well, I'm not thinking you did it right ;-)

Well you would be wrong, then, wouldn't you? :-)

> While I agree that the linear loop is sub-optimal, but it only really
> becomes a problem when you have hundreds or thousands of threads in your
> application, which I'll argue to be insane anyway.

You might argue that (and a few months ago I would have agreed with you)
but you would, I'm afraid, be wrong. It very much depends on the
application. We ran into the problem when we were running applications
with more than 4500 threads; there are good reasons to have lots-n-lots
of threads having to do with efficient use of resources, which I can't
go into at the moment.

> But with your new scheme it'll be a problem regardless of how many
> threads you have, as long as each running application will have at least
> 2 (not uncommon these days).

If and only if the process on the CPU at the tick is using a POSIX
interval timer.

Further, the former implementation had quadratic (or nearly so)
performance varying by the number of threads in a process (making it
impossible to predict how long the processing will take). This
implementation has linear performance based on the number of CPUs which
is fixed over the life of the running kernel instance.

> Furthermore, the memory requirements for your solution now also scale
> with cpus instead of threads, again something not really appreciated.
>
> Therefore I say your solution is worse than the one we had.

I'm sorry, I can't agree. While I again admit that it may not be
optimal, it's certainly never going to hit a soft lockup until the
numbers of CPUs are far greater than 4096 and even then it's going to be
difficult to make it fail since the algorithm is more efficient.

The simple fact is that with the old implementation it is possible to
wedge the kernel simply by having a single process spawn a large number
of threads with an interval timer running.

> You should optimize for the common case, and ensure the worst case
> doesn't suck. You did it backwards.

You're saying that having four or fewer CPUs isn't the common case?

Oh, and you keep calling it "my" solution. I should really share credit
(or blame) with Roland McGrath, who collaborated with me to write this
thing.
--
Frank Mayhar <[email protected]>
Google, Inc.

2008-11-07 20:26:42

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

OK, so how about going about this differently.

How about we group the threads per cpu, and appoint one the cpu-leader
and collect the per cpu sum in there.

Then, on tick time we sum the cpu-leaders to obtain the total.

This would require finding the cpu-leader for any particular cpu, which
for example, we could do by maintaining a rb-tree ordered on cpu. If you
find an entry, its the leader and you store a pointer to it, if its
empty, insert yourself.

We'd have to update this whenever a task migrates.

When the cpu-leader migrates it can hand off its sum to the cpu-leader
of the target cpu (assuming there is one, otherwise it will again be
cpu-leader).

The advantage is that the memory foot-print scales with nr_tasks and the
runtime cost is min(nr_tasks, nr_cpus) where nr_cpus is limited to the
cpus the process actually runs on, so this takes full advantage of
things like cpusets.

2008-11-10 14:42:30

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Fri, 7 Nov 2008, Peter Zijlstra wrote:

> The advantage is that the memory foot-print scales with nr_tasks and the
> runtime cost is min(nr_tasks, nr_cpus) where nr_cpus is limited to the
> cpus the process actually runs on, so this takes full advantage of
> things like cpusets.

Typically you want threads of a process to spread out as far as possible.
The point of having multiple threads is concurrency after all. So this
will deteriorate in the common cases where you want the full aggregate
processing power of a machine to work on something. Timer processing is
already a latency problem (isnt there some option to switch that off?) and
a solution like this is going to make things worse.

Can we at least somehow make sure that nothing significantly happens in a
timer interrupt on a processor if the thread has not scheduled any events
or not odone any system calls?

2008-11-10 14:43:22

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Mon, 2008-11-10 at 08:38 -0600, Christoph Lameter wrote:
> On Fri, 7 Nov 2008, Peter Zijlstra wrote:
>
> > The advantage is that the memory foot-print scales with nr_tasks and the
> > runtime cost is min(nr_tasks, nr_cpus) where nr_cpus is limited to the
> > cpus the process actually runs on, so this takes full advantage of
> > things like cpusets.
>
> Typically you want threads of a process to spread out as far as possible.
> The point of having multiple threads is concurrency after all. So this
> will deteriorate in the common cases where you want the full aggregate
> processing power of a machine to work on something. Timer processing is
> already a latency problem (isnt there some option to switch that off?) and
> a solution like this is going to make things worse.
>
> Can we at least somehow make sure that nothing significantly happens in a
> timer interrupt on a processor if the thread has not scheduled any events
> or not odone any system calls?

Do threads actually scale that far? I thought mmap_sem contention and
other shared state would render threads basically useless on these very
large machines.

But afaiu this stuff, the per-cpu loop is only done when an itimer is
actually active.

The detail I've not looked at is, if when this itimer is indeed active
and we are running 256 threads of the same application on all cpus do we
then do the per-cpu loop for each tick on each cpu?

2008-11-10 15:42:33

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Mon, 10 Nov 2008, Peter Zijlstra wrote:

> > Can we at least somehow make sure that nothing significantly happens in a
> > timer interrupt on a processor if the thread has not scheduled any events
> > or not odone any system calls?
>
> Do threads actually scale that far? I thought mmap_sem contention and
> other shared state would render threads basically useless on these very
> large machines.

They scale well. The problem is startup when they concurrently allocate
memory. That has been solved with distributing the pte locks. mmap_sem is
taken for read in the fault handler. So you have a wildly bouncing
cacheline at startup time that causes performance issues but no busy
spinning.

> But afaiu this stuff, the per-cpu loop is only done when an itimer is
> actually active.
>
> The detail I've not looked at is, if when this itimer is indeed active
> and we are running 256 threads of the same application on all cpus do we
> then do the per-cpu loop for each tick on each cpu?

I would check how 8p and 16p fare with this? How much potential latency is
added to thread because it is interrupted and the timer interrupt goes
through this loop?

2008-11-10 18:01:39

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Mon, 2008-11-10 at 15:42 +0100, Peter Zijlstra wrote:
> On Mon, 2008-11-10 at 08:38 -0600, Christoph Lameter wrote:
> > Can we at least somehow make sure that nothing significantly happens in a
> > timer interrupt on a processor if the thread has not scheduled any events
> > or not odone any system calls?
> Do threads actually scale that far? I thought mmap_sem contention and
> other shared state would render threads basically useless on these very
> large machines.
>
> But afaiu this stuff, the per-cpu loop is only done when an itimer is
> actually active.

Correct.

> The detail I've not looked at is, if when this itimer is indeed active
> and we are running 256 threads of the same application on all cpus do we
> then do the per-cpu loop for each tick on each cpu?

The answer to this question is, "that depends." You can have an itimer
for a single thread or for the whole thread group. In the former case,
it never happens; it only does the loops for the thread group case. If
there is a thread group itimer then of course we have to sum the tick
count across all CPUs to determine whether the timer has expired.

Personally, I would argue that it's silly to have an itimer running when
you have many threads, and if you care about performance it's even
_more_ silly. But it's sillier yet to be able to wedge the kernel by
running a program in user space.

As far as Christoph's concern regarding latency for 8- and 16-processor
systems, my belief (supported by data I can't discuss, sigh) is that the
loop adds negligible latency. In fact, it can't really be discussed in
this way since the existing implementation adds *lots* of latency when
an itimer is running, since it sums the values across all threads. I
never collected latency versus number of threads data but it's bad
enough that at about 4500 threads (on a dual amd64) it took longer than
a tick to do a tick's worth of processing.
--
Frank Mayhar <[email protected]>
Google, Inc.

2008-11-11 00:20:32

by Ingo Oeser

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Monday 10 November 2008, Peter Zijlstra wrote:
> The detail I've not looked at is, if when this itimer is indeed active
> and we are running 256 threads of the same application on all cpus do we
> then do the per-cpu loop for each tick on each cpu?

Can't this kind of aggregation not be done in a tree-like manner?

I mean these CPUs are not on the same node, right?

Best Regards

Ingo Oeser

2008-11-11 13:59:57

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Tue, 11 Nov 2008, Ingo Oeser wrote:

> I mean these CPUs are not on the same node, right?

Intels roadmap calls for 128 core cpus by 2015....

2008-11-13 16:00:40

by Doug Chapman

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

Peter didn't mention here that the itimer/many thread patches cause a
race-condition panic which is easily reproduced (in fact unavoidable
with SLUB debugging on) on hyperthreaded ia64 systems (but certainly
possible on all systems). Can we get this revert patch pulled in? The
other attempts to fix this so far have been unsuccessful. I have tested
with this revert patch and it does indeed fix that issue.

thanks,

- Doug

On Thu, 2008-11-06 at 17:31 +0100, Peter Zijlstra wrote:
> This patch reverts all the itimer/many thread patches:
>
> 7086efe1c1536f6bc160e7d60a9bfd645b91f279
> bb34d92f643086d546b49cef680f6f305ed84414
> 5ce73a4a5a4893a1aa4cdeed1b1a5a6de42c43b6
> 0a8eaa4f9b58759595a1bfe13a1295fdc25ba026
> f06febc96ba8e0af80bcc3eaec0a109e88275fac
>
> Because I think the per-cpu accounting approach is wrong and makes
> things worse for people with a machine that has more than a hand-full of
> CPUs.
>
> Build and boot tested on my favourite x86_64 config.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8fcfa39..e215906 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1341,15 +1341,20 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
> prstatus->pr_pgrp = task_pgrp_vnr(p);
> prstatus->pr_sid = task_session_vnr(p);
> if (thread_group_leader(p)) {
> - struct task_cputime cputime;
> -
> /*
> - * This is the record for the group leader. It shows the
> - * group-wide total, not its individual thread total.
> + * This is the record for the group leader. Add in the
> + * cumulative times of previous dead threads. This total
> + * won't include the time of each live thread whose state
> + * is included in the core dump. The final total reported
> + * to our parent process when it calls wait4 will include
> + * those sums as well as the little bit more time it takes
> + * this and each other thread to finish dying after the
> + * core dump synchronization phase.
> */
> - thread_group_cputime(p, &cputime);
> - cputime_to_timeval(cputime.utime, &prstatus->pr_utime);
> - cputime_to_timeval(cputime.stime, &prstatus->pr_stime);
> + cputime_to_timeval(cputime_add(p->utime, p->signal->utime),
> + &prstatus->pr_utime);
> + cputime_to_timeval(cputime_add(p->stime, p->signal->stime),
> + &prstatus->pr_stime);
> } else {
> cputime_to_timeval(p->utime, &prstatus->pr_utime);
> cputime_to_timeval(p->stime, &prstatus->pr_stime);
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 6af7fba..efd68c5 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -388,20 +388,20 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
>
> /* add up live thread stats at the group level */
> if (whole) {
> - struct task_cputime cputime;
> struct task_struct *t = task;
> do {
> min_flt += t->min_flt;
> maj_flt += t->maj_flt;
> + utime = cputime_add(utime, task_utime(t));
> + stime = cputime_add(stime, task_stime(t));
> gtime = cputime_add(gtime, task_gtime(t));
> t = next_thread(t);
> } while (t != task);
>
> min_flt += sig->min_flt;
> maj_flt += sig->maj_flt;
> - thread_group_cputime(task, &cputime);
> - utime = cputime.utime;
> - stime = cputime.stime;
> + utime = cputime_add(utime, sig->utime);
> + stime = cputime_add(stime, sig->stime);
> gtime = cputime_add(gtime, sig->gtime);
> }
>
> diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
> index 4a145ca..89b6ecd 100644
> --- a/include/linux/kernel_stat.h
> +++ b/include/linux/kernel_stat.h
> @@ -66,7 +66,6 @@ static inline unsigned int kstat_irqs(unsigned int irq)
> return sum;
> }
>
> -extern unsigned long long task_delta_exec(struct task_struct *);
> extern void account_user_time(struct task_struct *, cputime_t);
> extern void account_user_time_scaled(struct task_struct *, cputime_t);
> extern void account_system_time(struct task_struct *, int, cputime_t);
> diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
> index a7c7213..04c2e43 100644
> --- a/include/linux/posix-timers.h
> +++ b/include/linux/posix-timers.h
> @@ -113,6 +113,4 @@ void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
>
> long clock_nanosleep_restart(struct restart_block *restart_block);
>
> -void update_rlimit_cpu(unsigned long rlim_new);
> -
> #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc07f9a..a739747 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -433,39 +433,6 @@ struct pacct_struct {
> unsigned long ac_minflt, ac_majflt;
> };
>
> -/**
> - * struct task_cputime - collected CPU time counts
> - * @utime: time spent in user mode, in &cputime_t units
> - * @stime: time spent in kernel mode, in &cputime_t units
> - * @sum_exec_runtime: total time spent on the CPU, in nanoseconds
> - *
> - * This structure groups together three kinds of CPU time that are
> - * tracked for threads and thread groups. Most things considering
> - * CPU time want to group these counts together and treat all three
> - * of them in parallel.
> - */
> -struct task_cputime {
> - cputime_t utime;
> - cputime_t stime;
> - unsigned long long sum_exec_runtime;
> -};
> -/* Alternate field names when used to cache expirations. */
> -#define prof_exp stime
> -#define virt_exp utime
> -#define sched_exp sum_exec_runtime
> -
> -/**
> - * struct thread_group_cputime - thread group interval timer counts
> - * @totals: thread group interval timers; substructure for
> - * uniprocessor kernel, per-cpu for SMP kernel.
> - *
> - * This structure contains the version of task_cputime, above, that is
> - * used for thread group CPU clock calculations.
> - */
> -struct thread_group_cputime {
> - struct task_cputime *totals;
> -};
> -
> /*
> * NOTE! "signal_struct" does not have it's own
> * locking, because a shared signal_struct always
> @@ -511,17 +478,6 @@ struct signal_struct {
> cputime_t it_prof_expires, it_virt_expires;
> cputime_t it_prof_incr, it_virt_incr;
>
> - /*
> - * Thread group totals for process CPU clocks.
> - * See thread_group_cputime(), et al, for details.
> - */
> - struct thread_group_cputime cputime;
> -
> - /* Earliest-expiration cache. */
> - struct task_cputime cputime_expires;
> -
> - struct list_head cpu_timers[3];
> -
> /* job control IDs */
>
> /*
> @@ -552,7 +508,7 @@ struct signal_struct {
> * Live threads maintain their own counters and add to these
> * in __exit_signal, except for the group leader.
> */
> - cputime_t cutime, cstime;
> + cputime_t utime, stime, cutime, cstime;
> cputime_t gtime;
> cputime_t cgtime;
> unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
> @@ -561,6 +517,14 @@ struct signal_struct {
> struct task_io_accounting ioac;
>
> /*
> + * Cumulative ns of scheduled CPU time for dead threads in the
> + * group, not including a zombie group leader. (This only differs
> + * from jiffies_to_ns(utime + stime) if sched_clock uses something
> + * other than jiffies.)
> + */
> + unsigned long long sum_sched_runtime;
> +
> + /*
> * We don't bother to synchronize most readers of this at all,
> * because there is no reader checking a limit that actually needs
> * to get both rlim_cur and rlim_max atomically, and either one
> @@ -571,6 +535,8 @@ struct signal_struct {
> */
> struct rlimit rlim[RLIM_NLIMITS];
>
> + struct list_head cpu_timers[3];
> +
> /* keep the process-shared keyrings here so that they do the right
> * thing in threads created with CLONE_THREAD */
> #ifdef CONFIG_KEYS
> @@ -1176,7 +1142,8 @@ struct task_struct {
> /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
> unsigned long min_flt, maj_flt;
>
> - struct task_cputime cputime_expires;
> + cputime_t it_prof_expires, it_virt_expires;
> + unsigned long long it_sched_expires;
> struct list_head cpu_timers[3];
>
> /* process credentials */
> @@ -1632,7 +1599,6 @@ extern unsigned long long cpu_clock(int cpu);
>
> extern unsigned long long
> task_sched_runtime(struct task_struct *task);
> -extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
>
> /* sched_exec is called by processes performing an exec */
> #ifdef CONFIG_SMP
> @@ -2144,30 +2110,6 @@ static inline int spin_needbreak(spinlock_t *lock)
> }
>
> /*
> - * Thread group CPU time accounting.
> - */
> -
> -extern int thread_group_cputime_alloc(struct task_struct *);
> -extern void thread_group_cputime(struct task_struct *, struct task_cputime *);
> -
> -static inline void thread_group_cputime_init(struct signal_struct *sig)
> -{
> - sig->cputime.totals = NULL;
> -}
> -
> -static inline int thread_group_cputime_clone_thread(struct task_struct *curr)
> -{
> - if (curr->signal->cputime.totals)
> - return 0;
> - return thread_group_cputime_alloc(curr);
> -}
> -
> -static inline void thread_group_cputime_free(struct signal_struct *sig)
> -{
> - free_percpu(sig->cputime.totals);
> -}
> -
> -/*
> * Reevaluate whether the task has signals pending delivery.
> * Wake the task if so.
> * This is required every time the blocked sigset_t changes.
> diff --git a/include/linux/time.h b/include/linux/time.h
> index ce321ac..d2c578d 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -132,9 +132,6 @@ extern int timekeeping_valid_for_hres(void);
> extern void update_wall_time(void);
> extern void update_xtime_cache(u64 nsec);
>
> -struct tms;
> -extern void do_sys_times(struct tms *);
> -
> /**
> * timespec_to_ns - Convert timespec to nanoseconds
> * @ts: pointer to the timespec variable to be converted
> diff --git a/kernel/compat.c b/kernel/compat.c
> index 8eafe3e..143990e 100644
> --- a/kernel/compat.c
> +++ b/kernel/compat.c
> @@ -23,7 +23,6 @@
> #include <linux/timex.h>
> #include <linux/migrate.h>
> #include <linux/posix-timers.h>
> -#include <linux/times.h>
>
> #include <asm/uaccess.h>
>
> @@ -209,23 +208,49 @@ asmlinkage long compat_sys_setitimer(int which,
> return 0;
> }
>
> -static compat_clock_t clock_t_to_compat_clock_t(clock_t x)
> -{
> - return compat_jiffies_to_clock_t(clock_t_to_jiffies(x));
> -}
> -
> asmlinkage long compat_sys_times(struct compat_tms __user *tbuf)
> {
> + /*
> + * In the SMP world we might just be unlucky and have one of
> + * the times increment as we use it. Since the value is an
> + * atomically safe type this is just fine. Conceptually its
> + * as if the syscall took an instant longer to occur.
> + */
> if (tbuf) {
> - struct tms tms;
> struct compat_tms tmp;
> -
> - do_sys_times(&tms);
> - /* Convert our struct tms to the compat version. */
> - tmp.tms_utime = clock_t_to_compat_clock_t(tms.tms_utime);
> - tmp.tms_stime = clock_t_to_compat_clock_t(tms.tms_stime);
> - tmp.tms_cutime = clock_t_to_compat_clock_t(tms.tms_cutime);
> - tmp.tms_cstime = clock_t_to_compat_clock_t(tms.tms_cstime);
> + struct task_struct *tsk = current;
> + struct task_struct *t;
> + cputime_t utime, stime, cutime, cstime;
> +
> + read_lock(&tasklist_lock);
> + utime = tsk->signal->utime;
> + stime = tsk->signal->stime;
> + t = tsk;
> + do {
> + utime = cputime_add(utime, t->utime);
> + stime = cputime_add(stime, t->stime);
> + t = next_thread(t);
> + } while (t != tsk);
> +
> + /*
> + * While we have tasklist_lock read-locked, no dying thread
> + * can be updating current->signal->[us]time. Instead,
> + * we got their counts included in the live thread loop.
> + * However, another thread can come in right now and
> + * do a wait call that updates current->signal->c[us]time.
> + * To make sure we always see that pair updated atomically,
> + * we take the siglock around fetching them.
> + */
> + spin_lock_irq(&tsk->sighand->siglock);
> + cutime = tsk->signal->cutime;
> + cstime = tsk->signal->cstime;
> + spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> +
> + tmp.tms_utime = compat_jiffies_to_clock_t(cputime_to_jiffies(utime));
> + tmp.tms_stime = compat_jiffies_to_clock_t(cputime_to_jiffies(stime));
> + tmp.tms_cutime = compat_jiffies_to_clock_t(cputime_to_jiffies(cutime));
> + tmp.tms_cstime = compat_jiffies_to_clock_t(cputime_to_jiffies(cstime));
> if (copy_to_user(tbuf, &tmp, sizeof(tmp)))
> return -EFAULT;
> }
> diff --git a/kernel/exit.c b/kernel/exit.c
> index b361006..9d2f87b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -113,6 +113,8 @@ static void __exit_signal(struct task_struct *tsk)
> * We won't ever get here for the group leader, since it
> * will have been the last reference on the signal_struct.
> */
> + sig->utime = cputime_add(sig->utime, task_utime(tsk));
> + sig->stime = cputime_add(sig->stime, task_stime(tsk));
> sig->gtime = cputime_add(sig->gtime, task_gtime(tsk));
> sig->min_flt += tsk->min_flt;
> sig->maj_flt += tsk->maj_flt;
> @@ -121,6 +123,7 @@ static void __exit_signal(struct task_struct *tsk)
> sig->inblock += task_io_get_inblock(tsk);
> sig->oublock += task_io_get_oublock(tsk);
> task_io_accounting_add(&sig->ioac, &tsk->ioac);
> + sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
> sig = NULL; /* Marker for below. */
> }
>
> @@ -1301,7 +1304,6 @@ static int wait_task_zombie(struct task_struct *p, int options,
> if (likely(!traced)) {
> struct signal_struct *psig;
> struct signal_struct *sig;
> - struct task_cputime cputime;
>
> /*
> * The resource counters for the group leader are in its
> @@ -1317,23 +1319,20 @@ static int wait_task_zombie(struct task_struct *p, int options,
> * need to protect the access to p->parent->signal fields,
> * as other threads in the parent group can be right
> * here reaping other children at the same time.
> - *
> - * We use thread_group_cputime() to get times for the thread
> - * group, which consolidates times for all threads in the
> - * group including the group leader.
> */
> spin_lock_irq(&p->parent->sighand->siglock);
> psig = p->parent->signal;
> sig = p->signal;
> - thread_group_cputime(p, &cputime);
> psig->cutime =
> cputime_add(psig->cutime,
> - cputime_add(cputime.utime,
> - sig->cutime));
> + cputime_add(p->utime,
> + cputime_add(sig->utime,
> + sig->cutime)));
> psig->cstime =
> cputime_add(psig->cstime,
> - cputime_add(cputime.stime,
> - sig->cstime));
> + cputime_add(p->stime,
> + cputime_add(sig->stime,
> + sig->cstime)));
> psig->cgtime =
> cputime_add(psig->cgtime,
> cputime_add(p->gtime,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4b964d7..1e13d05 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -765,44 +765,15 @@ void __cleanup_sighand(struct sighand_struct *sighand)
> kmem_cache_free(sighand_cachep, sighand);
> }
>
> -
> -/*
> - * Initialize POSIX timer handling for a thread group.
> - */
> -static void posix_cpu_timers_init_group(struct signal_struct *sig)
> -{
> - /* Thread group counters. */
> - thread_group_cputime_init(sig);
> -
> - /* Expiration times and increments. */
> - sig->it_virt_expires = cputime_zero;
> - sig->it_virt_incr = cputime_zero;
> - sig->it_prof_expires = cputime_zero;
> - sig->it_prof_incr = cputime_zero;
> -
> - /* Cached expiration times. */
> - sig->cputime_expires.prof_exp = cputime_zero;
> - sig->cputime_expires.virt_exp = cputime_zero;
> - sig->cputime_expires.sched_exp = 0;
> -
> - /* The timer lists. */
> - INIT_LIST_HEAD(&sig->cpu_timers[0]);
> - INIT_LIST_HEAD(&sig->cpu_timers[1]);
> - INIT_LIST_HEAD(&sig->cpu_timers[2]);
> -}
> -
> static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> {
> struct signal_struct *sig;
> int ret;
>
> if (clone_flags & CLONE_THREAD) {
> - ret = thread_group_cputime_clone_thread(current);
> - if (likely(!ret)) {
> - atomic_inc(&current->signal->count);
> - atomic_inc(&current->signal->live);
> - }
> - return ret;
> + atomic_inc(&current->signal->count);
> + atomic_inc(&current->signal->live);
> + return 0;
> }
> sig = kmem_cache_alloc(signal_cachep, GFP_KERNEL);
> tsk->signal = sig;
> @@ -830,25 +801,39 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->it_real_incr.tv64 = 0;
> sig->real_timer.function = it_real_fn;
>
> + sig->it_virt_expires = cputime_zero;
> + sig->it_virt_incr = cputime_zero;
> + sig->it_prof_expires = cputime_zero;
> + sig->it_prof_incr = cputime_zero;
> +
> sig->leader = 0; /* session leadership doesn't inherit */
> sig->tty_old_pgrp = NULL;
> sig->tty = NULL;
>
> - sig->cutime = sig->cstime = cputime_zero;
> + sig->utime = sig->stime = sig->cutime = sig->cstime = cputime_zero;
> sig->gtime = cputime_zero;
> sig->cgtime = cputime_zero;
> sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
> sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
> sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
> task_io_accounting_init(&sig->ioac);
> + INIT_LIST_HEAD(&sig->cpu_timers[0]);
> + INIT_LIST_HEAD(&sig->cpu_timers[1]);
> + INIT_LIST_HEAD(&sig->cpu_timers[2]);
> taskstats_tgid_init(sig);
>
> task_lock(current->group_leader);
> memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
> task_unlock(current->group_leader);
>
> - posix_cpu_timers_init_group(sig);
> -
> + if (sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY) {
> + /*
> + * New sole thread in the process gets an expiry time
> + * of the whole CPU time limit.
> + */
> + tsk->it_prof_expires =
> + secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
> + }
> acct_init_pacct(&sig->pacct);
>
> tty_audit_fork(sig);
> @@ -858,7 +843,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>
> void __cleanup_signal(struct signal_struct *sig)
> {
> - thread_group_cputime_free(sig);
> exit_thread_group_keys(sig);
> tty_kref_put(sig->tty);
> kmem_cache_free(signal_cachep, sig);
> @@ -909,19 +893,6 @@ void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> #endif /* CONFIG_MM_OWNER */
>
> /*
> - * Initialize POSIX timer handling for a single task.
> - */
> -static void posix_cpu_timers_init(struct task_struct *tsk)
> -{
> - tsk->cputime_expires.prof_exp = cputime_zero;
> - tsk->cputime_expires.virt_exp = cputime_zero;
> - tsk->cputime_expires.sched_exp = 0;
> - INIT_LIST_HEAD(&tsk->cpu_timers[0]);
> - INIT_LIST_HEAD(&tsk->cpu_timers[1]);
> - INIT_LIST_HEAD(&tsk->cpu_timers[2]);
> -}
> -
> -/*
> * This creates a new process as a copy of the old one,
> * but does not actually start it yet.
> *
> @@ -1033,7 +1004,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> task_io_accounting_init(&p->ioac);
> acct_clear_integrals(p);
>
> - posix_cpu_timers_init(p);
> + p->it_virt_expires = cputime_zero;
> + p->it_prof_expires = cputime_zero;
> + p->it_sched_expires = 0;
> + INIT_LIST_HEAD(&p->cpu_timers[0]);
> + INIT_LIST_HEAD(&p->cpu_timers[1]);
> + INIT_LIST_HEAD(&p->cpu_timers[2]);
>
> p->lock_depth = -1; /* -1 = no lock */
> do_posix_clock_monotonic_gettime(&p->start_time);
> @@ -1234,6 +1210,21 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> if (clone_flags & CLONE_THREAD) {
> p->group_leader = current->group_leader;
> list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
> +
> + if (!cputime_eq(current->signal->it_virt_expires,
> + cputime_zero) ||
> + !cputime_eq(current->signal->it_prof_expires,
> + cputime_zero) ||
> + current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY ||
> + !list_empty(&current->signal->cpu_timers[0]) ||
> + !list_empty(&current->signal->cpu_timers[1]) ||
> + !list_empty(&current->signal->cpu_timers[2])) {
> + /*
> + * Have child wake up on its first tick to check
> + * for process CPU timers.
> + */
> + p->it_prof_expires = jiffies_to_cputime(1);
> + }
> }
>
> if (likely(p->pid)) {
> diff --git a/kernel/itimer.c b/kernel/itimer.c
> index db7c358..ab98274 100644
> --- a/kernel/itimer.c
> +++ b/kernel/itimer.c
> @@ -55,15 +55,17 @@ int do_getitimer(int which, struct itimerval *value)
> spin_unlock_irq(&tsk->sighand->siglock);
> break;
> case ITIMER_VIRTUAL:
> + read_lock(&tasklist_lock);
> spin_lock_irq(&tsk->sighand->siglock);
> cval = tsk->signal->it_virt_expires;
> cinterval = tsk->signal->it_virt_incr;
> if (!cputime_eq(cval, cputime_zero)) {
> - struct task_cputime cputime;
> - cputime_t utime;
> -
> - thread_group_cputime(tsk, &cputime);
> - utime = cputime.utime;
> + struct task_struct *t = tsk;
> + cputime_t utime = tsk->signal->utime;
> + do {
> + utime = cputime_add(utime, t->utime);
> + t = next_thread(t);
> + } while (t != tsk);
> if (cputime_le(cval, utime)) { /* about to fire */
> cval = jiffies_to_cputime(1);
> } else {
> @@ -71,19 +73,25 @@ int do_getitimer(int which, struct itimerval *value)
> }
> }
> spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> cputime_to_timeval(cval, &value->it_value);
> cputime_to_timeval(cinterval, &value->it_interval);
> break;
> case ITIMER_PROF:
> + read_lock(&tasklist_lock);
> spin_lock_irq(&tsk->sighand->siglock);
> cval = tsk->signal->it_prof_expires;
> cinterval = tsk->signal->it_prof_incr;
> if (!cputime_eq(cval, cputime_zero)) {
> - struct task_cputime times;
> - cputime_t ptime;
> -
> - thread_group_cputime(tsk, &times);
> - ptime = cputime_add(times.utime, times.stime);
> + struct task_struct *t = tsk;
> + cputime_t ptime = cputime_add(tsk->signal->utime,
> + tsk->signal->stime);
> + do {
> + ptime = cputime_add(ptime,
> + cputime_add(t->utime,
> + t->stime));
> + t = next_thread(t);
> + } while (t != tsk);
> if (cputime_le(cval, ptime)) { /* about to fire */
> cval = jiffies_to_cputime(1);
> } else {
> @@ -91,6 +99,7 @@ int do_getitimer(int which, struct itimerval *value)
> }
> }
> spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> cputime_to_timeval(cval, &value->it_value);
> cputime_to_timeval(cinterval, &value->it_interval);
> break;
> @@ -176,6 +185,7 @@ again:
> case ITIMER_VIRTUAL:
> nval = timeval_to_cputime(&value->it_value);
> ninterval = timeval_to_cputime(&value->it_interval);
> + read_lock(&tasklist_lock);
> spin_lock_irq(&tsk->sighand->siglock);
> cval = tsk->signal->it_virt_expires;
> cinterval = tsk->signal->it_virt_incr;
> @@ -190,6 +200,7 @@ again:
> tsk->signal->it_virt_expires = nval;
> tsk->signal->it_virt_incr = ninterval;
> spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> if (ovalue) {
> cputime_to_timeval(cval, &ovalue->it_value);
> cputime_to_timeval(cinterval, &ovalue->it_interval);
> @@ -198,6 +209,7 @@ again:
> case ITIMER_PROF:
> nval = timeval_to_cputime(&value->it_value);
> ninterval = timeval_to_cputime(&value->it_interval);
> + read_lock(&tasklist_lock);
> spin_lock_irq(&tsk->sighand->siglock);
> cval = tsk->signal->it_prof_expires;
> cinterval = tsk->signal->it_prof_incr;
> @@ -212,6 +224,7 @@ again:
> tsk->signal->it_prof_expires = nval;
> tsk->signal->it_prof_incr = ninterval;
> spin_unlock_irq(&tsk->sighand->siglock);
> + read_unlock(&tasklist_lock);
> if (ovalue) {
> cputime_to_timeval(cval, &ovalue->it_value);
> cputime_to_timeval(cinterval, &ovalue->it_interval);
> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 153dcb2..c42a03a 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -7,93 +7,6 @@
> #include <linux/errno.h>
> #include <linux/math64.h>
> #include <asm/uaccess.h>
> -#include <linux/kernel_stat.h>
> -
> -/*
> - * Allocate the thread_group_cputime structure appropriately and fill in the
> - * current values of the fields. Called from copy_signal() via
> - * thread_group_cputime_clone_thread() when adding a second or subsequent
> - * thread to a thread group. Assumes interrupts are enabled when called.
> - */
> -int thread_group_cputime_alloc(struct task_struct *tsk)
> -{
> - struct signal_struct *sig = tsk->signal;
> - struct task_cputime *cputime;
> -
> - /*
> - * If we have multiple threads and we don't already have a
> - * per-CPU task_cputime struct (checked in the caller), allocate
> - * one and fill it in with the times accumulated so far. We may
> - * race with another thread so recheck after we pick up the sighand
> - * lock.
> - */
> - cputime = alloc_percpu(struct task_cputime);
> - if (cputime == NULL)
> - return -ENOMEM;
> - spin_lock_irq(&tsk->sighand->siglock);
> - if (sig->cputime.totals) {
> - spin_unlock_irq(&tsk->sighand->siglock);
> - free_percpu(cputime);
> - return 0;
> - }
> - sig->cputime.totals = cputime;
> - cputime = per_cpu_ptr(sig->cputime.totals, smp_processor_id());
> - cputime->utime = tsk->utime;
> - cputime->stime = tsk->stime;
> - cputime->sum_exec_runtime = tsk->se.sum_exec_runtime;
> - spin_unlock_irq(&tsk->sighand->siglock);
> - return 0;
> -}
> -
> -/**
> - * thread_group_cputime - Sum the thread group time fields across all CPUs.
> - *
> - * @tsk: The task we use to identify the thread group.
> - * @times: task_cputime structure in which we return the summed fields.
> - *
> - * Walk the list of CPUs to sum the per-CPU time fields in the thread group
> - * time structure.
> - */
> -void thread_group_cputime(
> - struct task_struct *tsk,
> - struct task_cputime *times)
> -{
> - struct signal_struct *sig;
> - int i;
> - struct task_cputime *tot;
> -
> - sig = tsk->signal;
> - if (unlikely(!sig) || !sig->cputime.totals) {
> - times->utime = tsk->utime;
> - times->stime = tsk->stime;
> - times->sum_exec_runtime = tsk->se.sum_exec_runtime;
> - return;
> - }
> - times->stime = times->utime = cputime_zero;
> - times->sum_exec_runtime = 0;
> - for_each_possible_cpu(i) {
> - tot = per_cpu_ptr(tsk->signal->cputime.totals, i);
> - times->utime = cputime_add(times->utime, tot->utime);
> - times->stime = cputime_add(times->stime, tot->stime);
> - times->sum_exec_runtime += tot->sum_exec_runtime;
> - }
> -}
> -
> -/*
> - * Called after updating RLIMIT_CPU to set timer expiration if necessary.
> - */
> -void update_rlimit_cpu(unsigned long rlim_new)
> -{
> - cputime_t cputime;
> -
> - cputime = secs_to_cputime(rlim_new);
> - if (cputime_eq(current->signal->it_prof_expires, cputime_zero) ||
> - cputime_lt(current->signal->it_prof_expires, cputime)) {
> - spin_lock_irq(&current->sighand->siglock);
> - set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
> - spin_unlock_irq(&current->sighand->siglock);
> - }
> -}
>
> static int check_clock(const clockid_t which_clock)
> {
> @@ -245,6 +158,10 @@ static inline cputime_t virt_ticks(struct task_struct *p)
> {
> return p->utime;
> }
> +static inline unsigned long long sched_ns(struct task_struct *p)
> +{
> + return task_sched_runtime(p);
> +}
>
> int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
> {
> @@ -294,7 +211,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
> cpu->cpu = virt_ticks(p);
> break;
> case CPUCLOCK_SCHED:
> - cpu->sched = p->se.sum_exec_runtime + task_delta_exec(p);
> + cpu->sched = sched_ns(p);
> break;
> }
> return 0;
> @@ -303,30 +220,59 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
> /*
> * Sample a process (thread group) clock for the given group_leader task.
> * Must be called with tasklist_lock held for reading.
> + * Must be called with tasklist_lock held for reading, and p->sighand->siglock.
> */
> -static int cpu_clock_sample_group(const clockid_t which_clock,
> - struct task_struct *p,
> - union cpu_time_count *cpu)
> +static int cpu_clock_sample_group_locked(unsigned int clock_idx,
> + struct task_struct *p,
> + union cpu_time_count *cpu)
> {
> - struct task_cputime cputime;
> -
> - thread_group_cputime(p, &cputime);
> - switch (which_clock) {
> + struct task_struct *t = p;
> + switch (clock_idx) {
> default:
> return -EINVAL;
> case CPUCLOCK_PROF:
> - cpu->cpu = cputime_add(cputime.utime, cputime.stime);
> + cpu->cpu = cputime_add(p->signal->utime, p->signal->stime);
> + do {
> + cpu->cpu = cputime_add(cpu->cpu, prof_ticks(t));
> + t = next_thread(t);
> + } while (t != p);
> break;
> case CPUCLOCK_VIRT:
> - cpu->cpu = cputime.utime;
> + cpu->cpu = p->signal->utime;
> + do {
> + cpu->cpu = cputime_add(cpu->cpu, virt_ticks(t));
> + t = next_thread(t);
> + } while (t != p);
> break;
> case CPUCLOCK_SCHED:
> - cpu->sched = cputime.sum_exec_runtime + task_delta_exec(p);
> + cpu->sched = p->signal->sum_sched_runtime;
> + /* Add in each other live thread. */
> + while ((t = next_thread(t)) != p) {
> + cpu->sched += t->se.sum_exec_runtime;
> + }
> + cpu->sched += sched_ns(p);
> break;
> }
> return 0;
> }
>
> +/*
> + * Sample a process (thread group) clock for the given group_leader task.
> + * Must be called with tasklist_lock held for reading.
> + */
> +static int cpu_clock_sample_group(const clockid_t which_clock,
> + struct task_struct *p,
> + union cpu_time_count *cpu)
> +{
> + int ret;
> + unsigned long flags;
> + spin_lock_irqsave(&p->sighand->siglock, flags);
> + ret = cpu_clock_sample_group_locked(CPUCLOCK_WHICH(which_clock), p,
> + cpu);
> + spin_unlock_irqrestore(&p->sighand->siglock, flags);
> + return ret;
> +}
> +
>
> int posix_cpu_clock_get(const clockid_t which_clock, struct timespec *tp)
> {
> @@ -525,11 +471,80 @@ void posix_cpu_timers_exit(struct task_struct *tsk)
> }
> void posix_cpu_timers_exit_group(struct task_struct *tsk)
> {
> - struct task_cputime cputime;
> -
> - thread_group_cputime(tsk, &cputime);
> cleanup_timers(tsk->signal->cpu_timers,
> - cputime.utime, cputime.stime, cputime.sum_exec_runtime);
> + cputime_add(tsk->utime, tsk->signal->utime),
> + cputime_add(tsk->stime, tsk->signal->stime),
> + tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
> +}
> +
> +
> +/*
> + * Set the expiry times of all the threads in the process so one of them
> + * will go off before the process cumulative expiry total is reached.
> + */
> +static void process_timer_rebalance(struct task_struct *p,
> + unsigned int clock_idx,
> + union cpu_time_count expires,
> + union cpu_time_count val)
> +{
> + cputime_t ticks, left;
> + unsigned long long ns, nsleft;
> + struct task_struct *t = p;
> + unsigned int nthreads = atomic_read(&p->signal->live);
> +
> + if (!nthreads)
> + return;
> +
> + switch (clock_idx) {
> + default:
> + BUG();
> + break;
> + case CPUCLOCK_PROF:
> + left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
> + nthreads);
> + do {
> + if (likely(!(t->flags & PF_EXITING))) {
> + ticks = cputime_add(prof_ticks(t), left);
> + if (cputime_eq(t->it_prof_expires,
> + cputime_zero) ||
> + cputime_gt(t->it_prof_expires, ticks)) {
> + t->it_prof_expires = ticks;
> + }
> + }
> + t = next_thread(t);
> + } while (t != p);
> + break;
> + case CPUCLOCK_VIRT:
> + left = cputime_div_non_zero(cputime_sub(expires.cpu, val.cpu),
> + nthreads);
> + do {
> + if (likely(!(t->flags & PF_EXITING))) {
> + ticks = cputime_add(virt_ticks(t), left);
> + if (cputime_eq(t->it_virt_expires,
> + cputime_zero) ||
> + cputime_gt(t->it_virt_expires, ticks)) {
> + t->it_virt_expires = ticks;
> + }
> + }
> + t = next_thread(t);
> + } while (t != p);
> + break;
> + case CPUCLOCK_SCHED:
> + nsleft = expires.sched - val.sched;
> + do_div(nsleft, nthreads);
> + nsleft = max_t(unsigned long long, nsleft, 1);
> + do {
> + if (likely(!(t->flags & PF_EXITING))) {
> + ns = t->se.sum_exec_runtime + nsleft;
> + if (t->it_sched_expires == 0 ||
> + t->it_sched_expires > ns) {
> + t->it_sched_expires = ns;
> + }
> + }
> + t = next_thread(t);
> + } while (t != p);
> + break;
> + }
> }
>
> static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
> @@ -593,32 +608,29 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
> default:
> BUG();
> case CPUCLOCK_PROF:
> - if (cputime_eq(p->cputime_expires.prof_exp,
> + if (cputime_eq(p->it_prof_expires,
> cputime_zero) ||
> - cputime_gt(p->cputime_expires.prof_exp,
> + cputime_gt(p->it_prof_expires,
> nt->expires.cpu))
> - p->cputime_expires.prof_exp =
> - nt->expires.cpu;
> + p->it_prof_expires = nt->expires.cpu;
> break;
> case CPUCLOCK_VIRT:
> - if (cputime_eq(p->cputime_expires.virt_exp,
> + if (cputime_eq(p->it_virt_expires,
> cputime_zero) ||
> - cputime_gt(p->cputime_expires.virt_exp,
> + cputime_gt(p->it_virt_expires,
> nt->expires.cpu))
> - p->cputime_expires.virt_exp =
> - nt->expires.cpu;
> + p->it_virt_expires = nt->expires.cpu;
> break;
> case CPUCLOCK_SCHED:
> - if (p->cputime_expires.sched_exp == 0 ||
> - p->cputime_expires.sched_exp >
> - nt->expires.sched)
> - p->cputime_expires.sched_exp =
> - nt->expires.sched;
> + if (p->it_sched_expires == 0 ||
> + p->it_sched_expires > nt->expires.sched)
> + p->it_sched_expires = nt->expires.sched;
> break;
> }
> } else {
> /*
> - * For a process timer, set the cached expiration time.
> + * For a process timer, we must balance
> + * all the live threads' expirations.
> */
> switch (CPUCLOCK_WHICH(timer->it_clock)) {
> default:
> @@ -629,9 +641,7 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
> cputime_lt(p->signal->it_virt_expires,
> timer->it.cpu.expires.cpu))
> break;
> - p->signal->cputime_expires.virt_exp =
> - timer->it.cpu.expires.cpu;
> - break;
> + goto rebalance;
> case CPUCLOCK_PROF:
> if (!cputime_eq(p->signal->it_prof_expires,
> cputime_zero) &&
> @@ -642,12 +652,13 @@ static void arm_timer(struct k_itimer *timer, union cpu_time_count now)
> if (i != RLIM_INFINITY &&
> i <= cputime_to_secs(timer->it.cpu.expires.cpu))
> break;
> - p->signal->cputime_expires.prof_exp =
> - timer->it.cpu.expires.cpu;
> - break;
> + goto rebalance;
> case CPUCLOCK_SCHED:
> - p->signal->cputime_expires.sched_exp =
> - timer->it.cpu.expires.sched;
> + rebalance:
> + process_timer_rebalance(
> + timer->it.cpu.task,
> + CPUCLOCK_WHICH(timer->it_clock),
> + timer->it.cpu.expires, now);
> break;
> }
> }
> @@ -958,13 +969,13 @@ static void check_thread_timers(struct task_struct *tsk,
> struct signal_struct *const sig = tsk->signal;
>
> maxfire = 20;
> - tsk->cputime_expires.prof_exp = cputime_zero;
> + tsk->it_prof_expires = cputime_zero;
> while (!list_empty(timers)) {
> struct cpu_timer_list *t = list_first_entry(timers,
> struct cpu_timer_list,
> entry);
> if (!--maxfire || cputime_lt(prof_ticks(tsk), t->expires.cpu)) {
> - tsk->cputime_expires.prof_exp = t->expires.cpu;
> + tsk->it_prof_expires = t->expires.cpu;
> break;
> }
> t->firing = 1;
> @@ -973,13 +984,13 @@ static void check_thread_timers(struct task_struct *tsk,
>
> ++timers;
> maxfire = 20;
> - tsk->cputime_expires.virt_exp = cputime_zero;
> + tsk->it_virt_expires = cputime_zero;
> while (!list_empty(timers)) {
> struct cpu_timer_list *t = list_first_entry(timers,
> struct cpu_timer_list,
> entry);
> if (!--maxfire || cputime_lt(virt_ticks(tsk), t->expires.cpu)) {
> - tsk->cputime_expires.virt_exp = t->expires.cpu;
> + tsk->it_virt_expires = t->expires.cpu;
> break;
> }
> t->firing = 1;
> @@ -988,13 +999,13 @@ static void check_thread_timers(struct task_struct *tsk,
>
> ++timers;
> maxfire = 20;
> - tsk->cputime_expires.sched_exp = 0;
> + tsk->it_sched_expires = 0;
> while (!list_empty(timers)) {
> struct cpu_timer_list *t = list_first_entry(timers,
> struct cpu_timer_list,
> entry);
> if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
> - tsk->cputime_expires.sched_exp = t->expires.sched;
> + tsk->it_sched_expires = t->expires.sched;
> break;
> }
> t->firing = 1;
> @@ -1044,10 +1055,10 @@ static void check_process_timers(struct task_struct *tsk,
> {
> int maxfire;
> struct signal_struct *const sig = tsk->signal;
> - cputime_t utime, ptime, virt_expires, prof_expires;
> + cputime_t utime, stime, ptime, virt_expires, prof_expires;
> unsigned long long sum_sched_runtime, sched_expires;
> + struct task_struct *t;
> struct list_head *timers = sig->cpu_timers;
> - struct task_cputime cputime;
>
> /*
> * Don't sample the current process CPU clocks if there are no timers.
> @@ -1063,10 +1074,18 @@ static void check_process_timers(struct task_struct *tsk,
> /*
> * Collect the current process totals.
> */
> - thread_group_cputime(tsk, &cputime);
> - utime = cputime.utime;
> - ptime = cputime_add(utime, cputime.stime);
> - sum_sched_runtime = cputime.sum_exec_runtime;
> + utime = sig->utime;
> + stime = sig->stime;
> + sum_sched_runtime = sig->sum_sched_runtime;
> + t = tsk;
> + do {
> + utime = cputime_add(utime, t->utime);
> + stime = cputime_add(stime, t->stime);
> + sum_sched_runtime += t->se.sum_exec_runtime;
> + t = next_thread(t);
> + } while (t != tsk);
> + ptime = cputime_add(utime, stime);
> +
> maxfire = 20;
> prof_expires = cputime_zero;
> while (!list_empty(timers)) {
> @@ -1174,18 +1193,60 @@ static void check_process_timers(struct task_struct *tsk,
> }
> }
>
> - if (!cputime_eq(prof_expires, cputime_zero) &&
> - (cputime_eq(sig->cputime_expires.prof_exp, cputime_zero) ||
> - cputime_gt(sig->cputime_expires.prof_exp, prof_expires)))
> - sig->cputime_expires.prof_exp = prof_expires;
> - if (!cputime_eq(virt_expires, cputime_zero) &&
> - (cputime_eq(sig->cputime_expires.virt_exp, cputime_zero) ||
> - cputime_gt(sig->cputime_expires.virt_exp, virt_expires)))
> - sig->cputime_expires.virt_exp = virt_expires;
> - if (sched_expires != 0 &&
> - (sig->cputime_expires.sched_exp == 0 ||
> - sig->cputime_expires.sched_exp > sched_expires))
> - sig->cputime_expires.sched_exp = sched_expires;
> + if (!cputime_eq(prof_expires, cputime_zero) ||
> + !cputime_eq(virt_expires, cputime_zero) ||
> + sched_expires != 0) {
> + /*
> + * Rebalance the threads' expiry times for the remaining
> + * process CPU timers.
> + */
> +
> + cputime_t prof_left, virt_left, ticks;
> + unsigned long long sched_left, sched;
> + const unsigned int nthreads = atomic_read(&sig->live);
> +
> + if (!nthreads)
> + return;
> +
> + prof_left = cputime_sub(prof_expires, utime);
> + prof_left = cputime_sub(prof_left, stime);
> + prof_left = cputime_div_non_zero(prof_left, nthreads);
> + virt_left = cputime_sub(virt_expires, utime);
> + virt_left = cputime_div_non_zero(virt_left, nthreads);
> + if (sched_expires) {
> + sched_left = sched_expires - sum_sched_runtime;
> + do_div(sched_left, nthreads);
> + sched_left = max_t(unsigned long long, sched_left, 1);
> + } else {
> + sched_left = 0;
> + }
> + t = tsk;
> + do {
> + if (unlikely(t->flags & PF_EXITING))
> + continue;
> +
> + ticks = cputime_add(cputime_add(t->utime, t->stime),
> + prof_left);
> + if (!cputime_eq(prof_expires, cputime_zero) &&
> + (cputime_eq(t->it_prof_expires, cputime_zero) ||
> + cputime_gt(t->it_prof_expires, ticks))) {
> + t->it_prof_expires = ticks;
> + }
> +
> + ticks = cputime_add(t->utime, virt_left);
> + if (!cputime_eq(virt_expires, cputime_zero) &&
> + (cputime_eq(t->it_virt_expires, cputime_zero) ||
> + cputime_gt(t->it_virt_expires, ticks))) {
> + t->it_virt_expires = ticks;
> + }
> +
> + sched = t->se.sum_exec_runtime + sched_left;
> + if (sched_expires && (t->it_sched_expires == 0 ||
> + t->it_sched_expires > sched)) {
> + t->it_sched_expires = sched;
> + }
> + } while ((t = next_thread(t)) != tsk);
> + }
> }
>
> /*
> @@ -1253,86 +1314,6 @@ out:
> ++timer->it_requeue_pending;
> }
>
> -/**
> - * task_cputime_zero - Check a task_cputime struct for all zero fields.
> - *
> - * @cputime: The struct to compare.
> - *
> - * Checks @cputime to see if all fields are zero. Returns true if all fields
> - * are zero, false if any field is nonzero.
> - */
> -static inline int task_cputime_zero(const struct task_cputime *cputime)
> -{
> - if (cputime_eq(cputime->utime, cputime_zero) &&
> - cputime_eq(cputime->stime, cputime_zero) &&
> - cputime->sum_exec_runtime == 0)
> - return 1;
> - return 0;
> -}
> -
> -/**
> - * task_cputime_expired - Compare two task_cputime entities.
> - *
> - * @sample: The task_cputime structure to be checked for expiration.
> - * @expires: Expiration times, against which @sample will be checked.
> - *
> - * Checks @sample against @expires to see if any field of @sample has expired.
> - * Returns true if any field of the former is greater than the corresponding
> - * field of the latter if the latter field is set. Otherwise returns false.
> - */
> -static inline int task_cputime_expired(const struct task_cputime *sample,
> - const struct task_cputime *expires)
> -{
> - if (!cputime_eq(expires->utime, cputime_zero) &&
> - cputime_ge(sample->utime, expires->utime))
> - return 1;
> - if (!cputime_eq(expires->stime, cputime_zero) &&
> - cputime_ge(cputime_add(sample->utime, sample->stime),
> - expires->stime))
> - return 1;
> - if (expires->sum_exec_runtime != 0 &&
> - sample->sum_exec_runtime >= expires->sum_exec_runtime)
> - return 1;
> - return 0;
> -}
> -
> -/**
> - * fastpath_timer_check - POSIX CPU timers fast path.
> - *
> - * @tsk: The task (thread) being checked.
> - *
> - * Check the task and thread group timers. If both are zero (there are no
> - * timers set) return false. Otherwise snapshot the task and thread group
> - * timers and compare them with the corresponding expiration times. Return
> - * true if a timer has expired, else return false.
> - */
> -static inline int fastpath_timer_check(struct task_struct *tsk)
> -{
> - struct signal_struct *sig = tsk->signal;
> -
> - if (unlikely(!sig))
> - return 0;
> -
> - if (!task_cputime_zero(&tsk->cputime_expires)) {
> - struct task_cputime task_sample = {
> - .utime = tsk->utime,
> - .stime = tsk->stime,
> - .sum_exec_runtime = tsk->se.sum_exec_runtime
> - };
> -
> - if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
> - return 1;
> - }
> - if (!task_cputime_zero(&sig->cputime_expires)) {
> - struct task_cputime group_sample;
> -
> - thread_group_cputime(tsk, &group_sample);
> - if (task_cputime_expired(&group_sample, &sig->cputime_expires))
> - return 1;
> - }
> - return 0;
> -}
> -
> /*
> * This is called from the timer interrupt handler. The irq handler has
> * already updated our counts. We need to check if any timers fire now.
> @@ -1345,31 +1326,42 @@ void run_posix_cpu_timers(struct task_struct *tsk)
>
> BUG_ON(!irqs_disabled());
>
> - /*
> - * The fast path checks that there are no expired thread or thread
> - * group timers. If that's so, just return.
> - */
> - if (!fastpath_timer_check(tsk))
> +#define UNEXPIRED(clock) \
> + (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \
> + cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires))
> +
> + if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
> + (tsk->it_sched_expires == 0 ||
> + tsk->se.sum_exec_runtime < tsk->it_sched_expires))
> return;
>
> - spin_lock(&tsk->sighand->siglock);
> - /*
> - * Here we take off tsk->signal->cpu_timers[N] and
> - * tsk->cpu_timers[N] all the timers that are firing, and
> - * put them on the firing list.
> - */
> - check_thread_timers(tsk, &firing);
> - check_process_timers(tsk, &firing);
> +#undef UNEXPIRED
>
> /*
> - * We must release these locks before taking any timer's lock.
> - * There is a potential race with timer deletion here, as the
> - * siglock now protects our private firing list. We have set
> - * the firing flag in each timer, so that a deletion attempt
> - * that gets the timer lock before we do will give it up and
> - * spin until we've taken care of that timer below.
> + * Double-check with locks held.
> */
> - spin_unlock(&tsk->sighand->siglock);
> + read_lock(&tasklist_lock);
> + if (likely(tsk->signal != NULL)) {
> + spin_lock(&tsk->sighand->siglock);
> +
> + /*
> + * Here we take off tsk->cpu_timers[N] and tsk->signal->cpu_timers[N]
> + * all the timers that are firing, and put them on the firing list.
> + */
> + check_thread_timers(tsk, &firing);
> + check_process_timers(tsk, &firing);
> +
> + /*
> + * We must release these locks before taking any timer's lock.
> + * There is a potential race with timer deletion here, as the
> + * siglock now protects our private firing list. We have set
> + * the firing flag in each timer, so that a deletion attempt
> + * that gets the timer lock before we do will give it up and
> + * spin until we've taken care of that timer below.
> + */
> + spin_unlock(&tsk->sighand->siglock);
> + }
> + read_unlock(&tasklist_lock);
>
> /*
> * Now that all the timers on our list have the firing flag,
> @@ -1397,9 +1389,10 @@ void run_posix_cpu_timers(struct task_struct *tsk)
>
> /*
> * Set one of the process-wide special case CPU timers.
> - * The tsk->sighand->siglock must be held by the caller.
> - * The *newval argument is relative and we update it to be absolute, *oldval
> - * is absolute and we update it to be relative.
> + * The tasklist_lock and tsk->sighand->siglock must be held by the caller.
> + * The oldval argument is null for the RLIMIT_CPU timer, where *newval is
> + * absolute; non-null for ITIMER_*, where *newval is relative and we update
> + * it to be absolute, *oldval is absolute and we update it to be relative.
> */
> void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
> cputime_t *newval, cputime_t *oldval)
> @@ -1408,7 +1401,7 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
> struct list_head *head;
>
> BUG_ON(clock_idx == CPUCLOCK_SCHED);
> - cpu_clock_sample_group(clock_idx, tsk, &now);
> + cpu_clock_sample_group_locked(clock_idx, tsk, &now);
>
> if (oldval) {
> if (!cputime_eq(*oldval, cputime_zero)) {
> @@ -1442,14 +1435,13 @@ void set_process_cpu_timer(struct task_struct *tsk, unsigned int clock_idx,
> cputime_ge(list_first_entry(head,
> struct cpu_timer_list, entry)->expires.cpu,
> *newval)) {
> - switch (clock_idx) {
> - case CPUCLOCK_PROF:
> - tsk->signal->cputime_expires.prof_exp = *newval;
> - break;
> - case CPUCLOCK_VIRT:
> - tsk->signal->cputime_expires.virt_exp = *newval;
> - break;
> - }
> + /*
> + * Rejigger each thread's expiry time so that one will
> + * notice before we hit the process-cumulative expiry time.
> + */
> + union cpu_time_count expires = { .sched = 0 };
> + expires.cpu = *newval;
> + process_timer_rebalance(tsk, clock_idx, expires, now);
> }
> }
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9d50bd4..70f98c4 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4033,26 +4033,23 @@ DEFINE_PER_CPU(struct kernel_stat, kstat);
> EXPORT_PER_CPU_SYMBOL(kstat);
>
> /*
> - * Return any ns on the sched_clock that have not yet been banked in
> - * @p in case that task is currently running.
> + * Return p->sum_exec_runtime plus any more ns on the sched_clock
> + * that have not yet been banked in case the task is currently running.
> */
> -unsigned long long task_delta_exec(struct task_struct *p)
> +unsigned long long task_sched_runtime(struct task_struct *p)
> {
> unsigned long flags;
> + u64 ns, delta_exec;
> struct rq *rq;
> - u64 ns = 0;
>
> rq = task_rq_lock(p, &flags);
> -
> + ns = p->se.sum_exec_runtime;
> if (task_current(rq, p)) {
> - u64 delta_exec;
> -
> update_rq_clock(rq);
> delta_exec = rq->clock - p->se.exec_start;
> if ((s64)delta_exec > 0)
> - ns = delta_exec;
> + ns += delta_exec;
> }
> -
> task_rq_unlock(rq, &flags);
>
> return ns;
> @@ -4069,7 +4066,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
> cputime64_t tmp;
>
> p->utime = cputime_add(p->utime, cputime);
> - account_group_user_time(p, cputime);
>
> /* Add user time to cpustat. */
> tmp = cputime_to_cputime64(cputime);
> @@ -4094,7 +4090,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime)
> tmp = cputime_to_cputime64(cputime);
>
> p->utime = cputime_add(p->utime, cputime);
> - account_group_user_time(p, cputime);
> p->gtime = cputime_add(p->gtime, cputime);
>
> cpustat->user = cputime64_add(cpustat->user, tmp);
> @@ -4130,7 +4125,6 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
> }
>
> p->stime = cputime_add(p->stime, cputime);
> - account_group_system_time(p, cputime);
>
> /* Add system time to cpustat. */
> tmp = cputime_to_cputime64(cputime);
> @@ -4172,7 +4166,6 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
>
> if (p == rq->idle) {
> p->stime = cputime_add(p->stime, steal);
> - account_group_system_time(p, steal);
> if (atomic_read(&rq->nr_iowait) > 0)
> cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
> else
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 51aa3e1..5781abb 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -500,7 +500,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
> struct task_struct *curtask = task_of(curr);
>
> cpuacct_charge(curtask, delta_exec);
> - account_group_exec_runtime(curtask, delta_exec);
> }
> }
>
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index c7963d5..98b1a19 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -526,8 +526,6 @@ static void update_curr_rt(struct rq *rq)
> schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
>
> curr->se.sum_exec_runtime += delta_exec;
> - account_group_exec_runtime(curr, delta_exec);
> -
> curr->se.exec_start = rq->clock;
> cpuacct_charge(curr, delta_exec);
>
> @@ -1460,7 +1458,7 @@ static void watchdog(struct rq *rq, struct task_struct *p)
> p->rt.timeout++;
> next = DIV_ROUND_UP(min(soft, hard), USEC_PER_SEC/HZ);
> if (p->rt.timeout > next)
> - p->cputime_expires.sched_exp = p->se.sum_exec_runtime;
> + p->it_sched_expires = p->se.sum_exec_runtime;
> }
> }
>
> diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
> index ee71bec..a93ef66 100644
> --- a/kernel/sched_stats.h
> +++ b/kernel/sched_stats.h
> @@ -277,89 +277,3 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
> #define sched_info_switch(t, next) do { } while (0)
> #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
>
> -/*
> - * The following are functions that support scheduler-internal time accounting.
> - * These functions are generally called at the timer tick. None of this depends
> - * on CONFIG_SCHEDSTATS.
> - */
> -
> -/**
> - * account_group_user_time - Maintain utime for a thread group.
> - *
> - * @tsk: Pointer to task structure.
> - * @cputime: Time value by which to increment the utime field of the
> - * thread_group_cputime structure.
> - *
> - * If thread group time is being maintained, get the structure for the
> - * running CPU and update the utime field there.
> - */
> -static inline void account_group_user_time(struct task_struct *tsk,
> - cputime_t cputime)
> -{
> - struct signal_struct *sig;
> -
> - sig = tsk->signal;
> - if (unlikely(!sig))
> - return;
> - if (sig->cputime.totals) {
> - struct task_cputime *times;
> -
> - times = per_cpu_ptr(sig->cputime.totals, get_cpu());
> - times->utime = cputime_add(times->utime, cputime);
> - put_cpu_no_resched();
> - }
> -}
> -
> -/**
> - * account_group_system_time - Maintain stime for a thread group.
> - *
> - * @tsk: Pointer to task structure.
> - * @cputime: Time value by which to increment the stime field of the
> - * thread_group_cputime structure.
> - *
> - * If thread group time is being maintained, get the structure for the
> - * running CPU and update the stime field there.
> - */
> -static inline void account_group_system_time(struct task_struct *tsk,
> - cputime_t cputime)
> -{
> - struct signal_struct *sig;
> -
> - sig = tsk->signal;
> - if (unlikely(!sig))
> - return;
> - if (sig->cputime.totals) {
> - struct task_cputime *times;
> -
> - times = per_cpu_ptr(sig->cputime.totals, get_cpu());
> - times->stime = cputime_add(times->stime, cputime);
> - put_cpu_no_resched();
> - }
> -}
> -
> -/**
> - * account_group_exec_runtime - Maintain exec runtime for a thread group.
> - *
> - * @tsk: Pointer to task structure.
> - * @ns: Time value by which to increment the sum_exec_runtime field
> - * of the thread_group_cputime structure.
> - *
> - * If thread group time is being maintained, get the structure for the
> - * running CPU and update the sum_exec_runtime field there.
> - */
> -static inline void account_group_exec_runtime(struct task_struct *tsk,
> - unsigned long long ns)
> -{
> - struct signal_struct *sig;
> -
> - sig = tsk->signal;
> - if (unlikely(!sig))
> - return;
> - if (sig->cputime.totals) {
> - struct task_cputime *times;
> -
> - times = per_cpu_ptr(sig->cputime.totals, get_cpu());
> - times->sum_exec_runtime += ns;
> - put_cpu_no_resched();
> - }
> -}
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4530fc6..37ce260 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1342,7 +1342,6 @@ int do_notify_parent(struct task_struct *tsk, int sig)
> struct siginfo info;
> unsigned long flags;
> struct sighand_struct *psig;
> - struct task_cputime cputime;
> int ret = sig;
>
> BUG_ON(sig == -1);
> @@ -1373,9 +1372,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)
>
> info.si_uid = tsk->uid;
>
> - thread_group_cputime(tsk, &cputime);
> - info.si_utime = cputime_to_jiffies(cputime.utime);
> - info.si_stime = cputime_to_jiffies(cputime.stime);
> + info.si_utime = cputime_to_clock_t(cputime_add(tsk->utime,
> + tsk->signal->utime));
> + info.si_stime = cputime_to_clock_t(cputime_add(tsk->stime,
> + tsk->signal->stime));
>
> info.si_status = tsk->exit_code & 0x7f;
> if (tsk->exit_code & 0x80)
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 31deba8..fc71f99 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -853,28 +853,38 @@ asmlinkage long sys_setfsgid(gid_t gid)
> return old_fsgid;
> }
>
> -void do_sys_times(struct tms *tms)
> -{
> - struct task_cputime cputime;
> - cputime_t cutime, cstime;
> -
> - spin_lock_irq(&current->sighand->siglock);
> - thread_group_cputime(current, &cputime);
> - cutime = current->signal->cutime;
> - cstime = current->signal->cstime;
> - spin_unlock_irq(&current->sighand->siglock);
> - tms->tms_utime = cputime_to_clock_t(cputime.utime);
> - tms->tms_stime = cputime_to_clock_t(cputime.stime);
> - tms->tms_cutime = cputime_to_clock_t(cutime);
> - tms->tms_cstime = cputime_to_clock_t(cstime);
> -}
> -
> asmlinkage long sys_times(struct tms __user * tbuf)
> {
> + /*
> + * In the SMP world we might just be unlucky and have one of
> + * the times increment as we use it. Since the value is an
> + * atomically safe type this is just fine. Conceptually its
> + * as if the syscall took an instant longer to occur.
> + */
> if (tbuf) {
> struct tms tmp;
> -
> - do_sys_times(&tmp);
> + struct task_struct *tsk = current;
> + struct task_struct *t;
> + cputime_t utime, stime, cutime, cstime;
> +
> + spin_lock_irq(&tsk->sighand->siglock);
> + utime = tsk->signal->utime;
> + stime = tsk->signal->stime;
> + t = tsk;
> + do {
> + utime = cputime_add(utime, t->utime);
> + stime = cputime_add(stime, t->stime);
> + t = next_thread(t);
> + } while (t != tsk);
> +
> + cutime = tsk->signal->cutime;
> + cstime = tsk->signal->cstime;
> + spin_unlock_irq(&tsk->sighand->siglock);
> +
> + tmp.tms_utime = cputime_to_clock_t(utime);
> + tmp.tms_stime = cputime_to_clock_t(stime);
> + tmp.tms_cutime = cputime_to_clock_t(cutime);
> + tmp.tms_cstime = cputime_to_clock_t(cstime);
> if (copy_to_user(tbuf, &tmp, sizeof(struct tms)))
> return -EFAULT;
> }
> @@ -1439,6 +1449,7 @@ asmlinkage long sys_old_getrlimit(unsigned int resource, struct rlimit __user *r
> asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
> {
> struct rlimit new_rlim, *old_rlim;
> + unsigned long it_prof_secs;
> int retval;
>
> if (resource >= RLIM_NLIMITS)
> @@ -1492,7 +1503,18 @@ asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit __user *rlim)
> if (new_rlim.rlim_cur == RLIM_INFINITY)
> goto out;
>
> - update_rlimit_cpu(new_rlim.rlim_cur);
> + it_prof_secs = cputime_to_secs(current->signal->it_prof_expires);
> + if (it_prof_secs == 0 || new_rlim.rlim_cur <= it_prof_secs) {
> + unsigned long rlim_cur = new_rlim.rlim_cur;
> + cputime_t cputime;
> +
> + cputime = secs_to_cputime(rlim_cur);
> + read_lock(&tasklist_lock);
> + spin_lock_irq(&current->sighand->siglock);
> + set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
> + spin_unlock_irq(&current->sighand->siglock);
> + read_unlock(&tasklist_lock);
> + }
> out:
> return 0;
> }
> @@ -1530,8 +1552,11 @@ out:
> *
> */
>
> -static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r)
> +static void accumulate_thread_rusage(struct task_struct *t, struct rusage *r,
> + cputime_t *utimep, cputime_t *stimep)
> {
> + *utimep = cputime_add(*utimep, t->utime);
> + *stimep = cputime_add(*stimep, t->stime);
> r->ru_nvcsw += t->nvcsw;
> r->ru_nivcsw += t->nivcsw;
> r->ru_minflt += t->min_flt;
> @@ -1545,13 +1570,12 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> struct task_struct *t;
> unsigned long flags;
> cputime_t utime, stime;
> - struct task_cputime cputime;
>
> memset((char *) r, 0, sizeof *r);
> utime = stime = cputime_zero;
>
> if (who == RUSAGE_THREAD) {
> - accumulate_thread_rusage(p, r);
> + accumulate_thread_rusage(p, r, &utime, &stime);
> goto out;
> }
>
> @@ -1574,9 +1598,8 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> break;
>
> case RUSAGE_SELF:
> - thread_group_cputime(p, &cputime);
> - utime = cputime_add(utime, cputime.utime);
> - stime = cputime_add(stime, cputime.stime);
> + utime = cputime_add(utime, p->signal->utime);
> + stime = cputime_add(stime, p->signal->stime);
> r->ru_nvcsw += p->signal->nvcsw;
> r->ru_nivcsw += p->signal->nivcsw;
> r->ru_minflt += p->signal->min_flt;
> @@ -1585,7 +1608,7 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
> r->ru_oublock += p->signal->oublock;
> t = p;
> do {
> - accumulate_thread_rusage(t, r);
> + accumulate_thread_rusage(t, r, &utime, &stime);
> t = next_thread(t);
> } while (t != p);
> break;
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index f85597a..d5dd93f 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -75,7 +75,6 @@
> #include <linux/string.h>
> #include <linux/selinux.h>
> #include <linux/mutex.h>
> -#include <linux/posix-timers.h>
>
> #include "avc.h"
> #include "objsec.h"
> @@ -2325,7 +2324,13 @@ static void selinux_bprm_post_apply_creds(struct linux_binprm *bprm)
> initrlim = init_task.signal->rlim+i;
> rlim->rlim_cur = min(rlim->rlim_max, initrlim->rlim_cur);
> }
> - update_rlimit_cpu(rlim->rlim_cur);
> + if (current->signal->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY) {
> + /*
> + * This will cause RLIMIT_CPU calculations
> + * to be refigured.
> + */
> + current->it_prof_expires = jiffies_to_cputime(1);
> + }
> }
>
> /* Wake up the parent if it is waiting so that it can
>

2008-11-13 16:08:44

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

* Doug Chapman <[email protected]> wrote:

> Peter didn't mention here that the itimer/many thread patches cause
> a race-condition panic which is easily reproduced (in fact
> unavoidable with SLUB debugging on) on hyperthreaded ia64 systems
> (but certainly possible on all systems). Can we get this revert
> patch pulled in? The other attempts to fix this so far have been
> unsuccessful. I have tested with this revert patch and it does
> indeed fix that issue.

Could you please try with the very latest upstream kernel? It should
be fixed by Oleg's fix patch via:

ad474ca: fix for account_group_exec_runtime(), make sure ->signal can't be fre

which is already upstream.

Ingo

2008-11-14 02:43:50

by Roland McGrath

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

Hi folks. I'm catching up on this after a few weeks off-line.
I don't have anything really new to add, but a few thoughts.

An idea like taking siglock in account_group_*() should be a non-starter.
The sole purpose of the per-cpu count scheme is to avoid cache ping-ponging
from updating a shared group-wide count with atomics or locked data in the
tick path. If you take shared locks there, you have the same performance
hit in the tick path. If that were acceptable for the real-world range of
cases of multiple threads on multiple CPUs, it could all be far simpler.

Over the years this sort of problem has come up a few times, races between
release_task(T) and code in the tick path when current==T. The two kinds
of fixes are to make the hot tick path robust against simultaneous
teardown, with whatever adding of checks that entails, or else to
reorganize the teardown earlier in release_task or on the do_exit path so
that existing checks on the hot tick path already avoid entering the
problem parts of the tick path when teardown has begun. It's clearly
preferable to make choices that minimize the work done in the common case
tick path.

There are two kinds of failure mode for these race bugs. In one a
self-reaping thread calls release_task(current) and a tick interrupt
arrives somewhere in the middle. In the other, a reaping parent (or
exec'ing sibling) calls release_task(T) between T's exit_notify and its
final deschedule, and the tick interrupt arrives in that interval.

A third variety of possible fix that we haven't explored much is to delay
parts of the teardown to __put_task_struct or to finish_task_switch's
TASK_DEAD case. That is, make simpler code on the tick path remain safe
until it's no longer possible to have a tick (because it's after the final
deschedule).

The group-wide counts are used for two purposes: firing CPU timers, and
maintaining group CPU clocks for sampling (including wait4 rusage and
process accounting). For timers, it's always OK if they don't fire until
up to a whole tick's worth of CPU accumulates on that CPU clock after they
might have ideally fired. So for purposes of timers, it's fine to disable
firing on ticks anywhere from early in the exit path (a la PF_EXITING
checks) when that makes it easy to keep the tick path safe from teardown
races. If there are live threads left in the group and a group-wide CPU
timer should fire, it will be seen after one of them runs for another tick.

Maintaining the group CPU clocks is a little different. We do miss
whatever time a dying thread spends between when the code at the start of
__exit_signal runs and its final deschedule. For the group-leader (or only
thread in single-threaded processes), it's the time between its parent
calling wait_task_zombie and the dying leader doing its final deschedule.
But we wouldn't like to miss any more CPU time from these counts than we
absolutely have to--that time is effectively accounted to noone. Up
through 2.6.27, these were only summed from the per-thread counters
(updated on ticks, and exec_runtime also on context switches so that one is
quite precise). Now we don't sum those, and just have the parallel
aggregate per-CPU group counts. So the stretch of time lost to the final
accounting increases if account_group_*() bail out in a tick that hits
earlier in the exit path. In some sense it would be ideal to delay all the
final tallies until finish_task_switch's TASK_DEAD case, especially for
exec_runtime (which could then lose no cycles at all). (That sounds very
attractive for a self-reaping thread. On the other hand, for a regular
child dying and its parent on another CPU waking up in wait, we currently
have some opportunity for parallelism between the child's exit path after
exit_notify and the parent doing all the release_task work.)

If I'm understanding it correctly, Oleg's task_rq_unlock_wait change makes
sure that if any task_rq_lock is in effect when clearing ->signal, it's
effectively serialized either to:
CPU1(tsk) CPU2(parent)
task_rq_lock(tsk)...task_rq_unlock(tsk)
tsk->signal = NULL;
__cleanup_signal(sig);
or to:
CPU1(tsk) CPU2(parent)
tsk->signal = NULL;
task_rq_lock(tsk)...task_rq_unlock(tsk)
__cleanup_signal(sig);
so that the locked "..." code either sees NULL or sees a signal_struct
that cannot be passed to __cleanup_signal until after task_rq_unlock.
Is that right?

Doesn't the same bug exist for account_group_user_time and
account_group_system_time? Those aren't called with task_rq_lock(current)
held, I don't think. So Oleg's change doesn't address the whole problem,
unless I'm missing something (which is always likely).

The first thing that pops to my mind is to protect the freeing of
signal_struct and thread_group_cputime_free (i.e. some or most of the
__cleanup_signal worK) with RCU. Then use rcu_read_lock() around accesses
to current->signal in places that can run after exit_notify, including the
interrupt and tick paths. Since __cleanup_signal is only possible after
anything that could use most of the signal_struct is ruled out, the
rcu_head could even be thrown into a union with some other member(s),
e.g. shared_pending, to save the two words bloat in signal_struct.

On the other hand, there is no real use to account_group_* doing any work
after the beginning of __exit_signal, since the updated counts can't ever
be read again. Some new way of safely always bailing out would be fine
too, but I haven't thought of one off hand.

I've rambled on more than enough for one message. So I'll confine this one
to the issue of tick-vs-release_task races. If it's helpful to the
discussion, I can write separately about the trade-offs between the
original (<=2.6.27) approach to group CPU timers and clocks, and what
motivated the approach Frank implemented.

Thanks,
Roland

2008-11-14 14:11:16

by Doug Chapman

[permalink] [raw]

Subject: Re: [PATCH] revert: timers: fix itimer/many thread hang

On Thu, 2008-11-13 at 17:08 +0100, Ingo Molnar wrote:
> * Doug Chapman <[email protected]> wrote:
>
> > Peter didn't mention here that the itimer/many thread patches cause
> > a race-condition panic which is easily reproduced (in fact
> > unavoidable with SLUB debugging on) on hyperthreaded ia64 systems
> > (but certainly possible on all systems). Can we get this revert
> > patch pulled in? The other attempts to fix this so far have been
> > unsuccessful. I have tested with this revert patch and it does
> > indeed fix that issue.
>
> Could you please try with the very latest upstream kernel? It should
> be fixed by Oleg's fix patch via:
>
> ad474ca: fix for account_group_exec_runtime(), make sure ->signal can't be fre
>
> which is already upstream.
>
> Ingo

Ingo,

Thanks, somehow I had missed that. I tested with the latest GIT tree
and the issue is resolved now.

- Doug

2008-11-14 15:42:17

by Oleg Nesterov

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On 11/13, Roland McGrath wrote:
>
> An idea like taking siglock in account_group_*() should be a non-starter.

Yes sure. The patch was buggy anyway, but even _if_ was correct it was
only a temporary hack for 2.6.28.

> A third variety of possible fix that we haven't explored much is to delay
> parts of the teardown to __put_task_struct or to finish_task_switch's
> TASK_DEAD case. That is, make simpler code on the tick path remain safe
> until it's no longer possible to have a tick (because it's after the final
> deschedule).

This was already discussed a bit,
http://marc.info/?l=linux-kernel&m=122640473714466

and perhaps this makes sense. With this change we can simplify other code.

> If I'm understanding it correctly, Oleg's task_rq_unlock_wait change makes
> sure that if any task_rq_lock is in effect when clearing ->signal, it's
> effectively serialized either to:
> CPU1(tsk) CPU2(parent)
> task_rq_lock(tsk)...task_rq_unlock(tsk)
> tsk->signal = NULL;
> __cleanup_signal(sig);
> or to:
> CPU1(tsk) CPU2(parent)
> tsk->signal = NULL;
> task_rq_lock(tsk)...task_rq_unlock(tsk)
> __cleanup_signal(sig);
> so that the locked "..." code either sees NULL or sees a signal_struct
> that cannot be passed to __cleanup_signal until after task_rq_unlock.
> Is that right?
>
> Doesn't the same bug exist for account_group_user_time and
> account_group_system_time? Those aren't called with task_rq_lock(current)
> held, I don't think. So Oleg's change doesn't address the whole problem,
> unless I'm missing something (which is always likely).

You are right. (please see below).

Even run_posix_cpu_timers() becomes unsafe. And I must admit, I have read
this part of the patch carefully before, and I didn't notice the problem.
I'll try to finally read the whole patch carefully on Sunday, but I don't
trust myself ;)

> The first thing that pops to my mind is to protect the freeing of
> signal_struct and thread_group_cputime_free (i.e. some or most of the
> __cleanup_signal worK) with RCU. Then use rcu_read_lock() around accesses
> to current->signal in places that can run after exit_notify,

Yes, this was my initial intent, but needs more changes. (actually,
I personally like the idea to free ->signal from __put_task_struct()
more, but I have no good arguments).

Currently I am trying to find the ugly, but simple fixes for 2.6.28.

account_group_user_time(), run_posix_cpu_timers() are simpler to
fix. Again, I need to actually read the code, but afaics we can
rely on the fact that the task is current, so we can change the
code

- if (!->signal)
+ if (->exit_state)
return;

But of course, I do agree, we need a more clever fix for the long
term, even if the change above can really help.

Oleg.

2008-11-17 13:36:48

by Oleg Nesterov

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On 11/14, Oleg Nesterov wrote:
>
> Currently I am trying to find the ugly, but simple fixes for 2.6.28.
>
> account_group_user_time(), run_posix_cpu_timers() are simpler to
> fix. Again, I need to actually read the code, but afaics we can
> rely on the fact that the task is current, so we can change the
> code
>
> - if (!->signal)
> + if (->exit_state)
> return;

Yes, unless I missed something again, this should work. I'll send
the (simple) patches soon, but I have no idea how to test them.

However, I'm afraid there is another problem. On 32 bit cpus we can't
read "u64 sum_exec_runtime" atomically, and so thread_group_cputime()
can "overestimate" ->sum_exec_runtime by (up to) UINT_MAX if it races
with the thread which updates its per_cpu_ptr(.totals). This for example
means that check_process_timers() can fire the CPUCLOCK_SCHED timers
before time.

No?

Oleg.

2008-11-17 18:17:50

by Roland McGrath

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

> > - if (!->signal)
> > + if (->exit_state)
> > return;
>
> Yes, unless I missed something again, this should work. I'll send
> the (simple) patches soon, but I have no idea how to test them.

That certainly will exclude the problem of crashing in the tick interrupt
after exit_notify. Unfortunately, it's moving in an undesireable direction
for the long run. That is, it loses from the accounting even more of the
CPU time spent on the exit path.

> However, I'm afraid there is another problem. On 32 bit cpus we can't
> read "u64 sum_exec_runtime" atomically, and so thread_group_cputime()
> can "overestimate" ->sum_exec_runtime by (up to) UINT_MAX if it races
> with the thread which updates its per_cpu_ptr(.totals). This for example
> means that check_process_timers() can fire the CPUCLOCK_SCHED timers
> before time.
>
> No?

Yes, I think you're right. The best solution that comes to mind off hand
is to protect the update/read of that u64 with a seqcount_t on 32-bit.

Thanks,
Roland

2008-11-17 21:18:50

by Oleg Nesterov

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On 11/17, Roland McGrath wrote:
>
> > > - if (!->signal)
> > > + if (->exit_state)
> > > return;
> >
> > Yes, unless I missed something again, this should work. I'll send
> > the (simple) patches soon, but I have no idea how to test them.
>
> That certainly will exclude the problem of crashing in the tick interrupt
> after exit_notify. Unfortunately, it's moving in an undesireable direction
> for the long run. That is, it loses from the accounting even more of the
> CPU time spent on the exit path.

Yes, I thought about this too.

But please note that currently this already happens for sub-threads (and
if we protect ->signal with rcu too), the exiting sub-thread does not
contribute to accounting after release_task(self). Also, when the last
thread exits the process can be reaped by its parent, but after that
the threads can still use CPU.

IOW, when ->exit_signal != 0 we already sent the notification to parent
with utime/stime, the parent can reap current at any moment before it
does the final schedule. I don't think we can do something here.

But if we make ->signal refcountable, we can improve the case with the
exiting subthreads at least.

(Just in case, anyway I completeley agree, this hack (and unlock_wait)
should be killed in 2.6.29).

> > However, I'm afraid there is another problem. On 32 bit cpus we can't
> > read "u64 sum_exec_runtime" atomically, and so thread_group_cputime()
> > can "overestimate" ->sum_exec_runtime by (up to) UINT_MAX if it races
> > with the thread which updates its per_cpu_ptr(.totals). This for example
> > means that check_process_timers() can fire the CPUCLOCK_SCHED timers
> > before time.
> >
> > No?
>
> Yes, I think you're right. The best solution that comes to mind off hand
> is to protect the update/read of that u64 with a seqcount_t on 32-bit.

Oh, but we need them to be per-cpu, and both read and write need memory
barriers... Not that I argue, this will fix the problem of course, just
I don't know how this impacts the perfomance.

Oleg.

2008-11-17 21:50:25

by Roland McGrath

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

> But please note that currently this already happens for sub-threads (and
> if we protect ->signal with rcu too), the exiting sub-thread does not
> contribute to accounting after release_task(self). Also, when the last
> thread exits the process can be reaped by its parent, but after that
> the threads can still use CPU.

Understood (I think I mentioned this earlier). All this is what makes it
seem potentially attractive in the long run to reorganize this more thoroughly.

> > Yes, I think you're right. The best solution that comes to mind off hand
> > is to protect the update/read of that u64 with a seqcount_t on 32-bit.
>
> Oh, but we need them to be per-cpu, and both read and write need memory
> barriers... Not that I argue, this will fix the problem of course, just
> I don't know how this impacts the perfomance.

I agree it's a sticky question. Just the only thing I've thought of so far.

Thanks,
Roland

2008-11-21 18:41:54

by Petr Tesařík

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

Dne Friday 07 of November 2008 11:29:04 Peter Zijlstra napsal(a):
> (fwiw your email doesn't come across properly, evo refuses to display
> them, there's some mangling of headers which makes it think there's an
> attachment)
>
> On Thu, 2008-11-06 at 15:52 -0800, Frank Mayhar wrote:
> > On Thu, 2008-11-06 at 16:08 +0100, Peter Zijlstra wrote:
> > > On Thu, 2008-11-06 at 09:03 -0600, Christoph Lameter wrote:
> > > > On Thu, 6 Nov 2008, Peter Zijlstra wrote:
> > > > > Also, you just introduced per-cpu allocations for each
> > > > > thread-group, while Christoph is reworking the per-cpu allocator,
> > > > > with one unfortunate side-effect - its going to have a limited size
> > > > > pool. Therefore this will limit the number of thread-groups we can
> > > > > have.
> > > >
> > > > Patches exist that implement a dynamically growable percpu pool
> > > > (using virtual mappings though). If the cost of the additional
> > > > complexity / overhead is justifiable then we can make the percpu pool
> > > > dynamically extendable.
> > >
> > > Right, but I don't think the patch under consideration will fly anyway,
> > > doing a for_each_possible_cpu() loop on every tick on all cpus isn't
> > > really healthy, even for moderate sized machines.
> >
> > I personally think that you're overstating this. First, the current
> > implementation walks all threads for each tick, which is simply not
> > scalable and results in soft lockups with large numbers of threads.
> > This patch fixes a real bug. Second, this only happens "on every tick"
> > for processes that have more than one thread _and_ that use posix
> > interval timers. Roland and I went to some effort to keep loops like
> > the on you're referring to out of the common paths.
> >
> > In any event, while this particular implementation may not be optimal,
> > at least it's _right_. Whatever happened to "make it right, then make
> > it fast?"
>
> Well, I'm not thinking you did it right ;-)
>
> While I agree that the linear loop is sub-optimal, but it only really
> becomes a problem when you have hundreds or thousands of threads in your
> application, which I'll argue to be insane anyway.

This is just not true. I've seen a very real example of a lockup with a very
sane number of threads (one per CPU), but on a very large machine (1024 CPUs
IIRC). The application set per-process CPU profiling with an interval of 1
tick, which translates to 1024 timers firing off with each tick...

Well, yes, that was broken, too, but that's the way one quite popular FORTRAN
compiler works...

Petr

2008-11-21 19:28:54

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Fri, 2008-11-21 at 19:42 +0100, Petr Tesarik wrote:
> This is just not true. I've seen a very real example of a lockup with a very
> sane number of threads (one per CPU), but on a very large machine (1024 CPUs
> IIRC). The application set per-process CPU profiling with an interval of 1
> tick, which translates to 1024 timers firing off with each tick...
>
> Well, yes, that was broken, too, but that's the way one quite popular FORTRAN
> compiler works...

Petr, have you tried the fix in the latest kernel? Does it help? I'm
intensely curious to know how it affected your runs.
--
Frank Mayhar <[email protected]>
Google, Inc.

2008-11-23 14:25:22

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Fri, 2008-11-21 at 19:42 +0100, Petr Tesarik wrote:

> > > In any event, while this particular implementation may not be optimal,
> > > at least it's _right_. Whatever happened to "make it right, then make
> > > it fast?"
> >
> > Well, I'm not thinking you did it right ;-)
> >
> > While I agree that the linear loop is sub-optimal, but it only really
> > becomes a problem when you have hundreds or thousands of threads in your
> > application, which I'll argue to be insane anyway.
>
> This is just not true. I've seen a very real example of a lockup with a very
> sane number of threads (one per CPU), but on a very large machine (1024 CPUs
> IIRC). The application set per-process CPU profiling with an interval of 1
> tick, which translates to 1024 timers firing off with each tick...
>
> Well, yes, that was broken, too, but that's the way one quite popular FORTRAN
> compiler works...

I'm not sure what side you're arguing...

The current (per-cpu) code is utterly broken on large machines too, I've
asked SGI to run some tests on real numa machines (something multi-brick
altix) and even moderately small machines with 256 cpus in them grind to
a halt (or make progress at a snails pace) when the itimer stuff is
enabled.

Furthermore, I really dislike the per-process-per-cpu memory cost, it
bloats applications and makes the new per-cpu alloc work rather more
difficult than it already is.

I basically think the whole process wide itimer stuff is broken by
design, there is no way to make it work on reasonably large machines,
the whole problem space just doesn't scale. You simply cannot maintain a
global count without bouncing cachelines like mad, so you might as well
accept it and do the process wide counter and bounce only a single line,
instead of bouncing a line per-cpu.

Furthermore, I stand by my claims that anything that runs more than a
hand-full of threads per physical core is utterly braindead and deserves
all the pain it can get. (Yes, I'm a firm believer in state machines and
don't think just throwing threads at a problem is a sane solution).

2008-11-24 08:46:53

by Petr Tesařík

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

Peter Zijlstra píše v Ne 23. 11. 2008 v 15:24 +0100:
> On Fri, 2008-11-21 at 19:42 +0100, Petr Tesarik wrote:
>
> > > > In any event, while this particular implementation may not be optimal,
> > > > at least it's _right_. Whatever happened to "make it right, then make
> > > > it fast?"
> > >
> > > Well, I'm not thinking you did it right ;-)
> > >
> > > While I agree that the linear loop is sub-optimal, but it only really
> > > becomes a problem when you have hundreds or thousands of threads in your
> > > application, which I'll argue to be insane anyway.
> >
> > This is just not true. I've seen a very real example of a lockup with a very
> > sane number of threads (one per CPU), but on a very large machine (1024 CPUs
> > IIRC). The application set per-process CPU profiling with an interval of 1
> > tick, which translates to 1024 timers firing off with each tick...
> >
> > Well, yes, that was broken, too, but that's the way one quite popular FORTRAN
> > compiler works...
>
> I'm not sure what side you're arguing...

In this particular case I'm arguing against both, it seems. The old
behaviour is broken and the new one is not better. :(

> The current (per-cpu) code is utterly broken on large machines too, I've
> asked SGI to run some tests on real numa machines (something multi-brick
> altix) and even moderately small machines with 256 cpus in them grind to
> a halt (or make progress at a snails pace) when the itimer stuff is
> enabled.
>
> Furthermore, I really dislike the per-process-per-cpu memory cost, it
> bloats applications and makes the new per-cpu alloc work rather more
> difficult than it already is.
>
> I basically think the whole process wide itimer stuff is broken by
> design, there is no way to make it work on reasonably large machines,
> the whole problem space just doesn't scale. You simply cannot maintain a
> global count without bouncing cachelines like mad, so you might as well
> accept it and do the process wide counter and bounce only a single line,
> instead of bouncing a line per-cpu.

Very true. Unfortunately per-process itimers are prescribed by the
Single Unix Specification, so we have to cope with them in some way,
while not permitting a non-privileged process a DoS attack. This is
going to be hard, and we'll probably have to twist the specification a
bit to still conform to its wording. :((

I really don't think it's a good idea to set a per-process ITIMER_PROF
to one timer tick on a large machine, but the kernel does allow any
process to do it, and then it can even cause hard freeze on some
hardware. This is _not_ acceptable.

What is worse, we can't just limit the granularity of itimers, because
threads can come into being _after_ the itimer was set.

> Furthermore, I stand by my claims that anything that runs more than a
> hand-full of threads per physical core is utterly braindead and deserves
> all the pain it can get. (Yes, I'm a firm believer in state machines and
> don't think just throwing threads at a problem is a sane solution).

Yes, anything with many threads per-core is badly designed. My point is
that it's not the only broken case.

Petr Tesarik

2008-11-24 09:33:52

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Mon, 2008-11-24 at 09:46 +0100, Petr Tesarik wrote:
> Peter Zijlstra píše v Ne 23. 11. 2008 v 15:24 +0100:
> > On Fri, 2008-11-21 at 19:42 +0100, Petr Tesarik wrote:
> >
> > > > > In any event, while this particular implementation may not be optimal,
> > > > > at least it's _right_. Whatever happened to "make it right, then make
> > > > > it fast?"
> > > >
> > > > Well, I'm not thinking you did it right ;-)
> > > >
> > > > While I agree that the linear loop is sub-optimal, but it only really
> > > > becomes a problem when you have hundreds or thousands of threads in your
> > > > application, which I'll argue to be insane anyway.
> > >
> > > This is just not true. I've seen a very real example of a lockup with a very
> > > sane number of threads (one per CPU), but on a very large machine (1024 CPUs
> > > IIRC). The application set per-process CPU profiling with an interval of 1
> > > tick, which translates to 1024 timers firing off with each tick...
> > >
> > > Well, yes, that was broken, too, but that's the way one quite popular FORTRAN
> > > compiler works...
> >
> > I'm not sure what side you're arguing...
>
> In this particular case I'm arguing against both, it seems. The old
> behaviour is broken and the new one is not better. :(

OK, then we agree ;-)

> > The current (per-cpu) code is utterly broken on large machines too, I've
> > asked SGI to run some tests on real numa machines (something multi-brick
> > altix) and even moderately small machines with 256 cpus in them grind to
> > a halt (or make progress at a snails pace) when the itimer stuff is
> > enabled.
> >
> > Furthermore, I really dislike the per-process-per-cpu memory cost, it
> > bloats applications and makes the new per-cpu alloc work rather more
> > difficult than it already is.
> >
> > I basically think the whole process wide itimer stuff is broken by
> > design, there is no way to make it work on reasonably large machines,
> > the whole problem space just doesn't scale. You simply cannot maintain a
> > global count without bouncing cachelines like mad, so you might as well
> > accept it and do the process wide counter and bounce only a single line,
> > instead of bouncing a line per-cpu.
>
> Very true. Unfortunately per-process itimers are prescribed by the
> Single Unix Specification, so we have to cope with them in some way,
> while not permitting a non-privileged process a DoS attack. This is
> going to be hard, and we'll probably have to twist the specification a
> bit to still conform to its wording. :((

Feel like reading the actual spec and trying to come up with a creative
interpretation? :-)

> I really don't think it's a good idea to set a per-process ITIMER_PROF
> to one timer tick on a large machine, but the kernel does allow any
> process to do it, and then it can even cause hard freeze on some
> hardware. This is _not_ acceptable.
>
> What is worse, we can't just limit the granularity of itimers, because
> threads can come into being _after_ the itimer was set.

Currently it has jiffy granularity, right? And jiffies are different
depending on some compile time constant (HZ), so can't we, for the sake
of per-process itimers, pretend to have a 1 minute jiffie?

That should be as compliant as we are now, and utterly useless for
everybody, thereby discouraging its use, hmm? :-)

2008-11-24 12:33:21

by Petr Tesařík

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

Peter Zijlstra píše v Po 24. 11. 2008 v 10:33 +0100:
> On Mon, 2008-11-24 at 09:46 +0100, Petr Tesarik wrote:
> > Peter Zijlstra píše v Ne 23. 11. 2008 v 15:24 +0100:
> > >[...]
> > > The current (per-cpu) code is utterly broken on large machines too, I've
> > > asked SGI to run some tests on real numa machines (something multi-brick
> > > altix) and even moderately small machines with 256 cpus in them grind to
> > > a halt (or make progress at a snails pace) when the itimer stuff is
> > > enabled.
> > >
> > > Furthermore, I really dislike the per-process-per-cpu memory cost, it
> > > bloats applications and makes the new per-cpu alloc work rather more
> > > difficult than it already is.
> > >
> > > I basically think the whole process wide itimer stuff is broken by
> > > design, there is no way to make it work on reasonably large machines,
> > > the whole problem space just doesn't scale. You simply cannot maintain a
> > > global count without bouncing cachelines like mad, so you might as well
> > > accept it and do the process wide counter and bounce only a single line,
> > > instead of bouncing a line per-cpu.
> >
> > Very true. Unfortunately per-process itimers are prescribed by the
> > Single Unix Specification, so we have to cope with them in some way,
> > while not permitting a non-privileged process a DoS attack. This is
> > going to be hard, and we'll probably have to twist the specification a
> > bit to still conform to its wording. :((
>
> Feel like reading the actual spec and trying to come up with a creative
> interpretation? :-)

Yes, I've just spent a few hours doing that... And I feel very
depressed, as expected.

> > I really don't think it's a good idea to set a per-process ITIMER_PROF
> > to one timer tick on a large machine, but the kernel does allow any
> > process to do it, and then it can even cause hard freeze on some
> > hardware. This is _not_ acceptable.
> >
> > What is worse, we can't just limit the granularity of itimers, because
> > threads can come into being _after_ the itimer was set.
>
> Currently it has jiffy granularity, right? And jiffies are different
> depending on some compile time constant (HZ), so can't we, for the sake
> of per-process itimers, pretend to have a 1 minute jiffie?
>
> That should be as compliant as we are now, and utterly useless for
> everybody, thereby discouraging its use, hmm? :-)

I've got a copy of IEEE Std 10003.1-2004 here, and it suggests that this
should be generally possible. In particular, the description for
itimer_set says:

Implementations may place limitations on the granularity of timer values. For
each interval timer, if the requested timer value requires a finer granularity
than the implementation supports, the actual timer value shall be rounded up
to the next supported value.

However, it seems to be vaguely linked to CLOCK_PROCESS_CPUTIME_ID,
which is defined as:

The identifier of the CPU-time clock associated with the process making a
clock ( ) or timer*( ) function call.

POSIX does not specify whether this clock is identical to the one used
for setitimer et al., or not, but it seems logical that it should. Then,
the kernel should probably return the coarse granularity in
clock_getres(), too.

I tried to find out how this is currently implemented in Linux, and it's
broken. How else. :-/

1. clock_getres() always returns a resolution of 1ns

This is actually good news, because it means that nobody really cares
whether the actual granularity is greater, so I guess we can safely
return any bogus number in clock_getres().

What about using an actual granularity of NR_CPUS*HZ, which should be
safe for any (at least remotely) sane usage?

2. clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) returns -EINVAL

Should not happen. Looking further into it, I think this line in
cpu_clock_sample_group():

switch (which_clock) {

should look like a similar line in cpu_clock_sample(), ie:

switch (CPUCLOCK_WHICH(which_clock)) {

Shall I send a patch?

Petr Tesarik

2008-11-24 13:00:12

[permalink] [raw]

Subject: Re: regression introduced by - timers: fix itimer/many thread hang

On Mon, 2008-11-24 at 13:32 +0100, Petr Tesarik wrote:

> > Feel like reading the actual spec and trying to come up with a creative
> > interpretation? :-)
>
> Yes, I've just spent a few hours doing that... And I feel very
> depressed, as expected.

Thanks for doing that though!

> > > I really don't think it's a good idea to set a per-process ITIMER_PROF
> > > to one timer tick on a large machine, but the kernel does allow any
> > > process to do it, and then it can even cause hard freeze on some
> > > hardware. This is _not_ acceptable.
> > >
> > > What is worse, we can't just limit the granularity of itimers, because
> > > threads can come into being _after_ the itimer was set.
> >
> > Currently it has jiffy granularity, right? And jiffies are different
> > depending on some compile time constant (HZ), so can't we, for the sake
> > of per-process itimers, pretend to have a 1 minute jiffie?
> >
> > That should be as compliant as we are now, and utterly useless for
> > everybody, thereby discouraging its use, hmm? :-)
>
> I've got a copy of IEEE Std 10003.1-2004 here, and it suggests that this
> should be generally possible. In particular, the description for
> itimer_set says:
>
> Implementations may place limitations on the granularity of timer values. For
> each interval timer, if the requested timer value requires a finer granularity
> than the implementation supports, the actual timer value shall be rounded up
> to the next supported value.
>
> However, it seems to be vaguely linked to CLOCK_PROCESS_CPUTIME_ID,
> which is defined as:
>
> The identifier of the CPU-time clock associated with the process making a
> clock ( ) or timer*( ) function call.
>
> POSIX does not specify whether this clock is identical to the one used
> for setitimer et al., or not, but it seems logical that it should. Then,
> the kernel should probably return the coarse granularity in
> clock_getres(), too.
>
> I tried to find out how this is currently implemented in Linux, and it's
> broken. How else. :-/
>
> 1. clock_getres() always returns a resolution of 1ns
>
> This is actually good news, because it means that nobody really cares
> whether the actual granularity is greater, so I guess we can safely
> return any bogus number in clock_getres().
>
> What about using an actual granularity of NR_CPUS*HZ, which should be
> safe for any (at least remotely) sane usage?

nr_cpu_ids * 1/HZ should do I guess, although a cubic function would buy
us even more slack.

> 2. clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts) returns -EINVAL
>
> Should not happen. Looking further into it, I think this line in
> cpu_clock_sample_group():
>
> switch (which_clock) {
>
> should look like a similar line in cpu_clock_sample(), ie:
>
> switch (CPUCLOCK_WHICH(which_clock)) {
>
> Shall I send a patch?

Feel free - its not an area I'm intimately familiar with, I'll look into
whipping up a patch removing all the per-cpu crap from there.

2008-11-24 16:07:36