Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
To solve this problem, this patch set duplicates the cumulative accounting
data in the signal_struct. In the second set (cdata_acct) the complete
cumulative resource counters are stored. The new cumulative CPU time (utime
and stime) is then exported via the taskstats interface.
PATCH SET OVERVIEW
------------------
Patches apply on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
[1] Introduce "struct cdata"
[2] Introduce __account_cdata() function
[3] Introduce cdata_acct for complete cumulative accounting
[4] Export "cdata_acct" with taskstats
On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> to the cumulative time of the parents, if the parents ignore SIGCHLD
> or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> it is not possible to calculate all consumed CPU time of a system by
> looking at the current tasks. CPU time can be lost.
>
> To solve this problem, this patch set duplicates the cumulative accounting
> data in the signal_struct. In the second set (cdata_acct) the complete
> cumulative resource counters are stored. The new cumulative CPU time (utime
> and stime) is then exported via the taskstats interface.
Maybe this has been treated earlier in the threads and I missed it, but
the obvious solution doesn't get mentioned:
What would break if we violate this silly POSIX rule and account time of
childs regardless of SIGCHLD/SA_NOCLDWAIT?
On 11/19, Peter Zijlstra wrote:
>
> On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > it is not possible to calculate all consumed CPU time of a system by
> > looking at the current tasks. CPU time can be lost.
> >
> > To solve this problem, this patch set duplicates the cumulative accounting
> > data in the signal_struct. In the second set (cdata_acct) the complete
> > cumulative resource counters are stored. The new cumulative CPU time (utime
> > and stime) is then exported via the taskstats interface.
>
> Maybe this has been treated earlier in the threads and I missed it, but
> the obvious solution doesn't get mentioned:
IIRC, the first version did this.
And it was me who spoiled this approach. But! only because I wasn't sure
this user-visible change is acceptable, and because there was some
misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2
But,
> What would break if we
say, any test-case which does getrusage() after fork() with ignored
SIGCHLD/SA_NOCLDWAIT?.
> violate this silly POSIX rule and account time of
> childs regardless of SIGCHLD/SA_NOCLDWAIT?
+1.
Personally, I'd certainly prefer this way, because I don't care about
POSIX at all ;)
Still. Once again, this breaks the current rules, and we never do
this without strong reason.
I think we should ask Roland. If he thinks this is OK, I'd certainly
agree.
Oleg.
* Oleg Nesterov <[email protected]> [2010-11-20 16:17:11]:
> On 11/19, Peter Zijlstra wrote:
> >
> > On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > > it is not possible to calculate all consumed CPU time of a system by
> > > looking at the current tasks. CPU time can be lost.
> > >
> > > To solve this problem, this patch set duplicates the cumulative accounting
> > > data in the signal_struct. In the second set (cdata_acct) the complete
> > > cumulative resource counters are stored. The new cumulative CPU time (utime
> > > and stime) is then exported via the taskstats interface.
> >
> > Maybe this has been treated earlier in the threads and I missed it, but
> > the obvious solution doesn't get mentioned:
>
> IIRC, the first version did this.
>
> And it was me who spoiled this approach. But! only because I wasn't sure
> this user-visible change is acceptable, and because there was some
> misunderstanding. See http://marc.info/?l=linux-kernel&m=128552495203050&w=2
>
> But,
>
> > What would break if we
>
> say, any test-case which does getrusage() after fork() with ignored
> SIGCHLD/SA_NOCLDWAIT?.
>
> > violate this silly POSIX rule and account time of
> > childs regardless of SIGCHLD/SA_NOCLDWAIT?
>
> +1.
>
> Personally, I'd certainly prefer this way, because I don't care about
> POSIX at all ;)
>
>
> Still. Once again, this breaks the current rules, and we never do
> this without strong reason.
>
> I think we should ask Roland. If he thinks this is OK, I'd certainly
> agree.
>
The Linux man page states
In Linux kernel versions before 2.6.9, if the disposition of SIGCHLD
is set to SIG_IGN then the resource usages of child processes are
automatically included in the value returned by RUSAGE_CHILDREN,
although POSIX.1-2001 explicitly prohibits this. This nonconformance
is rectified in Linux 2.6.9 and later.
--
Three Cheers,
Balbir
On Fri, 2010-11-19 at 21:19 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-19 at 21:11 +0100, Michael Holzheu wrote:
> > Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> > to the cumulative time of the parents, if the parents ignore SIGCHLD
> > or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> > it is not possible to calculate all consumed CPU time of a system by
> > looking at the current tasks. CPU time can be lost.
> >
> > To solve this problem, this patch set duplicates the cumulative accounting
> > data in the signal_struct. In the second set (cdata_acct) the complete
> > cumulative resource counters are stored. The new cumulative CPU time (utime
> > and stime) is then exported via the taskstats interface.
>
> Maybe this has been treated earlier in the threads and I missed it, but
> the obvious solution doesn't get mentioned:
>
> What would break if we violate this silly POSIX rule and account time of
> childs regardless of SIGCHLD/SA_NOCLDWAIT?
Or maybe we could add a sysctl that allows to switch between the two
semantics.
Michael
On Mon, 2010-11-22 at 12:03 +0100, Michael Holzheu wrote:
> Or maybe we could add a sysctl that allows to switch between the two
> semantics.
Then patch 03/04 would be something like the following:
---
Subject: taskstats: Introduce complete cumulative accounting
From: Michael Holzheu <[email protected]>
Currently the cumulative time accounting in Linux is not complete.
Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.
This patch adds a new sysctl "kernel.full_cdata" that allows to switch
between the POSIX behavior and complete cumulative accounting.
Signed-off-by: Michael Holzheu <[email protected]>
---
include/linux/sched.h | 1 +
kernel/exit.c | 12 ++++++++----
kernel/sysctl.c | 7 +++++++
3 files changed, 16 insertions(+), 4 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1907,6 +1907,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int full_cdata_enabled;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -57,6 +57,8 @@
#include <asm/pgtable.h>
#include <asm/mmu_context.h>
+unsigned int full_cdata_enabled = 1;
+
static void exit_mm(struct task_struct * tsk);
static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -77,7 +79,7 @@ static void __unhash_process(struct task
static void __account_cdata(struct task_struct *p)
{
struct cdata *cd, *pcd, *tcd;
- unsigned long maxrss;
+ unsigned long maxrss, flags;
cputime_t tgutime, tgstime;
/*
@@ -100,7 +102,7 @@ static void __account_cdata(struct task_
* group including the group leader.
*/
thread_group_times(p, &tgutime, &tgstime);
- spin_lock_irq(&p->real_parent->sighand->siglock);
+ spin_lock_irqsave(&p->real_parent->sighand->siglock, flags);
pcd = &p->real_parent->signal->cdata_wait;
tcd = &p->signal->cdata_threads;
cd = &p->signal->cdata_wait;
@@ -137,7 +139,7 @@ static void __account_cdata(struct task_
pcd->maxrss = maxrss;
task_io_accounting_add(&p->real_parent->signal->ioac, &p->ioac);
task_io_accounting_add(&p->real_parent->signal->ioac, &p->signal->ioac);
- spin_unlock_irq(&p->real_parent->sighand->siglock);
+ spin_unlock_irqrestore(&p->real_parent->sighand->siglock, flags);
}
/*
@@ -157,6 +159,8 @@ static void __exit_signal(struct task_st
posix_cpu_timers_exit(tsk);
if (group_dead) {
+ if (full_cdata_enabled)
+ __account_cdata(tsk);
posix_cpu_timers_exit_group(tsk);
tty = sig->tty;
sig->tty = NULL;
@@ -1292,7 +1296,7 @@ static int wait_task_zombie(struct wait_
* It can be ptraced but not reparented, check
* !task_detached() to filter out sub-threads.
*/
- if (likely(!traced) && likely(!task_detached(p)))
+ if (likely(!traced) && likely(!task_detached(p)) && !full_cdata_enabled)
__account_cdata(p);
/*
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -963,6 +963,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#endif
+ {
+ .procname = "full_cdata",
+ .data = &full_cdata_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
On Mon, 22 Nov 2010 13:47:55 +0100, Michael Holzheu said:
> Currently the cumulative time accounting in Linux is not complete.
> Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
> to the cumulative time of the parents, if the parents ignore SIGCHLD
> or have set SA_NOCLDWAIT. This behaviour has the major drawback that
> it is not possible to calculate all consumed CPU time of a system by
> looking at the current tasks. CPU time can be lost.
>
> This patch adds a new sysctl "kernel.full_cdata" that allows to switch
> between the POSIX behavior and complete cumulative accounting.
> +unsigned int full_cdata_enabled = 1;
This probably needs to default to "current kernel behavior" but allow
sysadmins to change to the new behavior.