2007-08-02 13:53:24

by Guillaume Chazarain

[permalink] [raw]
Subject: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

[Resent with different recipients as [email protected] bounced with
a User unknown]

Hi,

This patch adds all thread accounting stats for the global tgid stats.
As a shameless plug, this fixes iotop -P (http://guichaz.free.fr/misc/iotop.py).

Signed-off-by: Guillaume Chazarain <[email protected]>
---

diff -r 22708012ca6e kernel/taskstats.c
--- a/kernel/taskstats.c Tue Jul 31 21:12:07 2007 -0700
+++ b/kernel/taskstats.c Wed Aug 01 17:43:54 2007 +0200
@@ -233,6 +233,7 @@ static int fill_tgid(pid_t tgid, struct
memset(stats, 0, sizeof(*stats));

tsk = first;
+ bacct_add_tsk(stats, first);
do {
if (tsk->exit_state)
continue;
@@ -246,6 +247,7 @@ static int fill_tgid(pid_t tgid, struct

stats->nvcsw += tsk->nvcsw;
stats->nivcsw += tsk->nivcsw;
+ xacct_add_tsk(stats, tsk);
} while_each_thread(first, tsk);

unlock_task_sighand(first, &flags);
diff -r 22708012ca6e kernel/tsacct.c
--- a/kernel/tsacct.c Tue Jul 31 21:12:07 2007 -0700
+++ b/kernel/tsacct.c Wed Aug 01 17:41:40 2007 +0200
@@ -81,8 +81,8 @@ void xacct_add_tsk(struct taskstats *sta
struct mm_struct *mm;

/* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
mm = get_task_mm(p);
if (mm) {
/* adjust to KB unit */
@@ -90,18 +90,14 @@ void xacct_add_tsk(struct taskstats *sta
stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB


2007-08-02 19:04:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

On Thu, 2 Aug 2007 15:53:13 +0200
"Guillaume Chazarain" <[email protected]> wrote:

> [Resent with different recipients as [email protected] bounced with
> a User unknown]
>
> Hi,
>
> This patch adds all thread accounting stats for the global tgid stats.
> As a shameless plug, this fixes iotop -P (http://guichaz.free.fr/misc/iotop.py).

uhm, that's a fairly skimpy changelog for a fairly significant functional
change.

> Signed-off-by: Guillaume Chazarain <[email protected]>
> ---
>
> diff -r 22708012ca6e kernel/taskstats.c
> --- a/kernel/taskstats.c Tue Jul 31 21:12:07 2007 -0700
> +++ b/kernel/taskstats.c Wed Aug 01 17:43:54 2007 +0200
> @@ -233,6 +233,7 @@ static int fill_tgid(pid_t tgid, struct
> memset(stats, 0, sizeof(*stats));
>
> tsk = first;
> + bacct_add_tsk(stats, first);
> do {
> if (tsk->exit_state)
> continue;
> @@ -246,6 +247,7 @@ static int fill_tgid(pid_t tgid, struct
>
> stats->nvcsw += tsk->nvcsw;
> stats->nivcsw += tsk->nivcsw;
> + xacct_add_tsk(stats, tsk);
> } while_each_thread(first, tsk);
>
> unlock_task_sighand(first, &flags);
> diff -r 22708012ca6e kernel/tsacct.c
> --- a/kernel/tsacct.c Tue Jul 31 21:12:07 2007 -0700
> +++ b/kernel/tsacct.c Wed Aug 01 17:41:40 2007 +0200
> @@ -81,8 +81,8 @@ void xacct_add_tsk(struct taskstats *sta
> struct mm_struct *mm;
>
> /* convert pages-jiffies to Mbyte-usec */
> - stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
> - stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
> + stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
> + stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> /* adjust to KB unit */
> @@ -90,18 +90,14 @@ void xacct_add_tsk(struct taskstats *sta
> stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;

How come hiwater_rss and hiwater_vm weren't similarly treated?

> mmput(mm);
> }
> - stats->read_char = p->rchar;
> - stats->write_char = p->wchar;
> - stats->read_syscalls = p->syscr;
> - stats->write_syscalls = p->syscw;
> + stats->read_char += p->rchar;
> + stats->write_char += p->wchar;
> + stats->read_syscalls += p->syscr;
> + stats->write_syscalls += p->syscw;
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> - stats->read_bytes = p->ioac.read_bytes;
> - stats->write_bytes = p->ioac.write_bytes;
> - stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
> -#else
> - stats->read_bytes = 0;
> - stats->write_bytes = 0;
> - stats->cancelled_write_bytes = 0;
> + stats->read_bytes += p->ioac.read_bytes;
> + stats->write_bytes += p->ioac.write_bytes;
> + stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
> #endif

Suitable cc's added. Guys, please have a think about the implications of
this change? In particular, why didn't we do this on day one? It's so
obvious that I have a feeling we must have had a reason?

2007-08-15 07:17:42

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Guillaume Chazarain wrote:
> [Resent with different recipients as [email protected] bounced with
> a User unknown]
>
> Hi,
>
> This patch adds all thread accounting stats for the global tgid stats.
> As a shameless plug, this fixes iotop -P (http://guichaz.free.fr/misc/iotop.py).
>

Since this changes CSA behaviour, it would be nice to get Jay Lan
to confirm if the change looks ok to him..

> Signed-off-by: Guillaume Chazarain <[email protected]>
> ---
>
> diff -r 22708012ca6e kernel/taskstats.c
> --- a/kernel/taskstats.c Tue Jul 31 21:12:07 2007 -0700
> +++ b/kernel/taskstats.c Wed Aug 01 17:43:54 2007 +0200
> @@ -233,6 +233,7 @@ static int fill_tgid(pid_t tgid, struct
> memset(stats, 0, sizeof(*stats));
>
> tsk = first;
> + bacct_add_tsk(stats, first);
> do {
> if (tsk->exit_state)
> continue;
> @@ -246,6 +247,7 @@ static int fill_tgid(pid_t tgid, struct
>
> stats->nvcsw += tsk->nvcsw;
> stats->nivcsw += tsk->nivcsw;
> + xacct_add_tsk(stats, tsk);
> } while_each_thread(first, tsk);
>

using bacct and xacct might not work well in this case. You'll end
up with basic accounting data for the thread group leader (utime,
stime. For delay accounting, we accumulate data of tasks that have
exited the thread group in fill_tgid_exit(), that way when we
sum up data, we continue storing data of all tasks that exited.
You might want to revisit the code to do that as well.

> unlock_task_sighand(first, &flags);
> diff -r 22708012ca6e kernel/tsacct.c
> --- a/kernel/tsacct.c Tue Jul 31 21:12:07 2007 -0700
> +++ b/kernel/tsacct.c Wed Aug 01 17:41:40 2007 +0200
> @@ -81,8 +81,8 @@ void xacct_add_tsk(struct taskstats *sta
> struct mm_struct *mm;
>
> /* convert pages-jiffies to Mbyte-usec */
> - stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
> - stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
> + stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
> + stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
> mm = get_task_mm(p);
> if (mm) {
> /* adjust to KB unit */
> @@ -90,18 +90,14 @@ void xacct_add_tsk(struct taskstats *sta
> stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
> mmput(mm);
> }
> - stats->read_char = p->rchar;
> - stats->write_char = p->wchar;
> - stats->read_syscalls = p->syscr;
> - stats->write_syscalls = p->syscw;
> + stats->read_char += p->rchar;
> + stats->write_char += p->wchar;
> + stats->read_syscalls += p->syscr;
> + stats->write_syscalls += p->syscw;
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> - stats->read_bytes = p->ioac.read_bytes;
> - stats->write_bytes = p->ioac.write_bytes;
> - stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
> -#else
> - stats->read_bytes = 0;
> - stats->write_bytes = 0;
> - stats->cancelled_write_bytes = 0;
> + stats->read_bytes += p->ioac.read_bytes;
> + stats->write_bytes += p->ioac.write_bytes;
> + stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
> #endif
> }
> #undef KB


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-08-19 19:39:10

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
not the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID returns also the min, max or sum (depending on
the semantic of the field) of the accounting info for all threads of a
thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads (hiwater_rss and hiwater_vm).

Signed-off-by: Guillaume Chazarain <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Jay Lan <[email protected]>
Cc: Jonathan Lim <[email protected]>
---

kernel/taskstats.c | 15 ++++++++++-----
kernel/tsacct.c | 43 ++++++++++++++++++++++---------------------
2 files changed, 32 insertions(+), 26 deletions(-)

diff -r 73eff559debc kernel/taskstats.c
--- a/kernel/taskstats.c Sat Aug 18 17:15:17 2007 -0700
+++ b/kernel/taskstats.c Sun Aug 19 17:20:15 2007 +0200
@@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct

stats->nvcsw += tsk->nvcsw;
stats->nivcsw += tsk->nivcsw;
+ bacct_add_tsk(stats, tsk);
+ xacct_add_tsk(stats, tsk);
} while_each_thread(first, tsk);

unlock_task_sighand(first, &flags);
@@ -265,21 +267,24 @@ static void fill_tgid_exit(struct task_s
static void fill_tgid_exit(struct task_struct *tsk)
{
unsigned long flags;
+ struct taskstats *tg_stats;

spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
+ tg_stats = tsk->signal->stats;
+ if (!tg_stats)
goto ret;

/*
* Each accounting subsystem calls its functions here to
* accumalate its per-task stats for tsk, into the per-tgid structure
*
- * per-task-foo(tsk->signal->stats, tsk);
- */
- delayacct_add_tsk(tsk->signal->stats, tsk);
+ * per-task-foo(tg_stats, tsk);
+ */
+ bacct_add_tsk(tg_stats, tsk);
+ xacct_add_tsk(tg_stats, tsk);
+ delayacct_add_tsk(tg_stats, tsk);
ret:
spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
}

static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
diff -r 73eff559debc kernel/tsacct.c
--- a/kernel/tsacct.c Sat Aug 18 17:15:17 2007 -0700
+++ b/kernel/tsacct.c Sun Aug 19 17:20:43 2007 +0200
@@ -38,8 +38,9 @@ void bacct_add_tsk(struct taskstats *sta
/* rebase elapsed time to usec */
ac_etime = timespec_to_ns(&ts);
do_div(ac_etime, NSEC_PER_USEC);
- stats->ac_etime = ac_etime;
- stats->ac_btime = get_seconds() - ts.tv_sec;
+ stats->ac_etime = max_t(u64, stats->ac_etime, ac_etime);
+ stats->ac_btime = min_t(u32, stats->ac_btime,
+ get_seconds() - ts.tv_sec);
if (thread_group_leader(tsk)) {
stats->ac_exitcode = tsk->exit_code;
if (tsk->flags & PF_FORKNOEXEC)
@@ -60,10 +61,12 @@ void bacct_add_tsk(struct taskstats *sta
stats->ac_ppid = pid_alive(tsk) ?
rcu_dereference(tsk->real_parent)->tgid : 0;
rcu_read_unlock();
- stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
- stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
- stats->ac_minflt = tsk->min_flt;
- stats->ac_majflt = tsk->maj_flt;
+ stats->ac_utime = max_t(u64, stats->ac_utime,
+ cputime_to_msecs(tsk->utime) * USEC_PER_MSEC);
+ stats->ac_stime = max_t(u64, stats->ac_stime,
+ cputime_to_msecs(tsk->stime) * USEC_PER_MSEC);
+ stats->ac_minflt = max_t(u64, stats->ac_minflt, tsk->min_flt);
+ stats->ac_majflt = max_t(u64, stats->ac_majflt, tsk->maj_flt);

strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
}
@@ -81,27 +84,25 @@ void xacct_add_tsk(struct taskstats *sta
struct mm_struct *mm;

/* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
mm = get_task_mm(p);
if (mm) {
/* adjust to KB unit */
- stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_rss = max_t(u64, stats->hiwater_rss,
+ mm->hiwater_rss * PAGE_SIZE / KB);
+ stats->hiwater_vm = max_t(u64, stats->hiwater_vm,
+ mm->hiwater_vm * PAGE_SIZE / KB);
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB

2007-08-20 17:02:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Guillaume Chazarain wrote:
> TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
> not the basic and extended accounting. With this patch,
> TASKSTATS_CMD_ATTR_TGID returns also the min, max or sum (depending on
> the semantic of the field) of the accounting info for all threads of a
> thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
> fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
> (http://guichaz.free.fr/misc/iotop.py).
>
> Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
> - Update combined stats of exited threads in fill_tgid_exit() as
> suggested by Balbir Singh.
> - Very light cleanup of fill_tgid_exit() by the way.
> - bacct fields are also combined for all threads.
> - Instead of assuming memory stats are identical for all threads, we
> take the max of all threads (hiwater_rss and hiwater_vm).
>
> Signed-off-by: Guillaume Chazarain <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Jay Lan <[email protected]>
> Cc: Jonathan Lim <[email protected]>
> ---
>
> kernel/taskstats.c | 15 ++++++++++-----
> kernel/tsacct.c | 43 ++++++++++++++++++++++---------------------
> 2 files changed, 32 insertions(+), 26 deletions(-)
>
> diff -r 73eff559debc kernel/taskstats.c
> --- a/kernel/taskstats.c Sat Aug 18 17:15:17 2007 -0700
> +++ b/kernel/taskstats.c Sun Aug 19 17:20:15 2007 +0200
> @@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct
>
> stats->nvcsw += tsk->nvcsw;
> stats->nivcsw += tsk->nivcsw;
> + bacct_add_tsk(stats, tsk);
> + xacct_add_tsk(stats, tsk);

I'm afraid this is still not good enough. bacct_add_tsk() will assign
values and do nothing in the loop (HINT: no summation). Ideally,
we should split up bacct_add_tsk(), so that the initialization
code is split up from values that can summed.

> } while_each_thread(first, tsk);
>
> unlock_task_sighand(first, &flags);
> @@ -265,21 +267,24 @@ static void fill_tgid_exit(struct task_s
> static void fill_tgid_exit(struct task_struct *tsk)
> {
> unsigned long flags;
> + struct taskstats *tg_stats;
>
> spin_lock_irqsave(&tsk->sighand->siglock, flags);
> - if (!tsk->signal->stats)
> + tg_stats = tsk->signal->stats;
> + if (!tg_stats)
> goto ret;
>
> /*
> * Each accounting subsystem calls its functions here to
> * accumalate its per-task stats for tsk, into the per-tgid structure
> *
> - * per-task-foo(tsk->signal->stats, tsk);
> - */
> - delayacct_add_tsk(tsk->signal->stats, tsk);
> + * per-task-foo(tg_stats, tsk);
> + */
> + bacct_add_tsk(tg_stats, tsk);

Ditto.. see above

> + xacct_add_tsk(tg_stats, tsk);
> + delayacct_add_tsk(tg_stats, tsk);
> ret:
> spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
> - return;
> }
>
> static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
> diff -r 73eff559debc kernel/tsacct.c
> --- a/kernel/tsacct.c Sat Aug 18 17:15:17 2007 -0700
> +++ b/kernel/tsacct.c Sun Aug 19 17:20:43 2007 +0200
> @@ -38,8 +38,9 @@ void bacct_add_tsk(struct taskstats *sta
> /* rebase elapsed time to usec */
> ac_etime = timespec_to_ns(&ts);
> do_div(ac_etime, NSEC_PER_USEC);
> - stats->ac_etime = ac_etime;
> - stats->ac_btime = get_seconds() - ts.tv_sec;
> + stats->ac_etime = max_t(u64, stats->ac_etime, ac_etime);
> + stats->ac_btime = min_t(u32, stats->ac_btime,
> + get_seconds() - ts.tv_sec);

I am not sure why we get the max value, ideally we should
be summing up utime, stime into the stats structure.


Could you please write a simple test case and ensure that the
summed up stats are indeed correct for TGID's?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-08-25 15:10:57

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Le Mon, 20 Aug 2007 22:31:08 +0530,
Balbir Singh <[email protected]> a écrit :

> > --- a/kernel/taskstats.c Sat Aug 18 17:15:17 2007 -0700
> > +++ b/kernel/taskstats.c Sun Aug 19 17:20:15 2007 +0200
> > @@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct
> >
> > stats->nvcsw += tsk->nvcsw;
> > stats->nivcsw += tsk->nivcsw;
> > + bacct_add_tsk(stats, tsk);
> > + xacct_add_tsk(stats, tsk);
>
> I'm afraid this is still not good enough. bacct_add_tsk() will assign
> values and do nothing in the loop (HINT: no summation).

Hi Balbir, thank you for your review. I agree with everything you said
and am on my way to do it as time permits, but I have some trouble
understanding this part. You stated that bacct_add_tsk() would overwrite
the stats of each thread in the tgid stats, but the other part of the
patch is the (actually wrong) combination of stats in xxx_add_tsk()
using min/max/sum.

Also, I don't understand why the code to update btime:

/* calculate task elapsed time in timespec */
do_posix_clock_monotonic_gettime(&uptime);
ts = timespec_sub(uptime, tsk->start_time);
...
stats->ac_btime = get_seconds() - ts.tv_sec;

does not simply use tsk->start_time or tsk->real_start_time without
comparing it to the current time.

Thanks.

--
Guillaume

2007-08-26 04:59:01

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Guillaume Chazarain wrote:
> Le Mon, 20 Aug 2007 22:31:08 +0530,
> Balbir Singh <[email protected]> a écrit :
>
>>> --- a/kernel/taskstats.c Sat Aug 18 17:15:17 2007 -0700
>>> +++ b/kernel/taskstats.c Sun Aug 19 17:20:15 2007 +0200
>>> @@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct
>>>
>>> stats->nvcsw += tsk->nvcsw;
>>> stats->nivcsw += tsk->nivcsw;
>>> + bacct_add_tsk(stats, tsk);
>>> + xacct_add_tsk(stats, tsk);
>> I'm afraid this is still not good enough. bacct_add_tsk() will assign
>> values and do nothing in the loop (HINT: no summation).
>
> Hi Balbir, thank you for your review. I agree with everything you said
> and am on my way to do it as time permits, but I have some trouble
> understanding this part. You stated that bacct_add_tsk() would overwrite
> the stats of each thread in the tgid stats, but the other part of the
> patch is the (actually wrong) combination of stats in xxx_add_tsk()
> using min/max/sum.
>

Hi, Guillaume,

The CSA code was written by Jay Lan, but I'll see if I can answer your
questions. In the current implementation, CSA does not use TGID exit
notifications. In fill_pid(), both bacct_add_tsk() and xacct_add_tsk()
are called (which is correct), they complement each other.

The code needs refactoring to ensure that they can work together
in the fill_tgid() scenario.

> Also, I don't understand why the code to update btime:
>
> /* calculate task elapsed time in timespec */
> do_posix_clock_monotonic_gettime(&uptime);
> ts = timespec_sub(uptime, tsk->start_time);
> ...
> stats->ac_btime = get_seconds() - ts.tv_sec;
>
> does not simply use tsk->start_time or tsk->real_start_time without
> comparing it to the current time.
>

>From what I understand, task->start_time and task->real_start_time
are taken from the realtime clock. The accounting in CSA seems
to be very similar to the accounting done in do_acct_process()
(kernel/acct.c).

Jay/Jonathan any comments?


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-08-26 09:45:26

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Le Sun, 26 Aug 2007 10:28:44 +0530,
Balbir Singh <[email protected]> a écrit :

> From what I understand, task->start_time and task->real_start_time
> are taken from the realtime clock. The accounting in CSA seems
> to be very similar to the accounting done in do_acct_process()
> (kernel/acct.c).
>
> Jay/Jonathan any comments?

Thanks, I see it is used to have a wall time version of btime.
Here is my current version of the refactoring for information.
The testcase is coming.

diff -r 27877ed82082 include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h Sat Aug 25 22:17:19 2007 +0200
+++ b/include/linux/tsacct_kern.h Sat Aug 25 21:48:51 2007 +0200
@@ -10,17 +10,23 @@
#include <linux/taskstats.h>

#ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
#else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{}
#endif /* CONFIG_TASKSTATS */

#ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
{}
static inline void acct_update_integrals(struct task_struct *tsk)
diff -r 27877ed82082 kernel/taskstats.c
--- a/kernel/taskstats.c Sat Aug 25 22:17:19 2007 +0200
+++ b/kernel/taskstats.c Sat Aug 25 22:11:23 2007 +0200
@@ -168,6 +168,55 @@ static void send_cpu_listeners(struct sk
up_write(&listeners->sem);
}

+/*
+ * Common stats for each thread in the thread group, @task is not necessarily
+ * a thread group leader.
+ */
+static void fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to initialize
+ * relevant parts of struct taskstsats for a single tgid as follows:
+ *
+ * per-task-foo-fill_threadgroup(stats, task);
+ */
+
+ stats->version = TASKSTATS_VERSION;
+
+ /* fill in basic acct fields */
+ bacct_fill_threadgroup(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_fill_threadgroup(stats, task);
+}
+
+/*
+ * Stats specific to each thread in the thread group. add_tsk() works in
+ * conjunction with fill_threadgroup(), both may write to the same field, so
+ * the ordering of these two calls indicates if the caller wants stats for the
+ * whole thread group or a specific thread.
+ */
+static void add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to combine
+ * relevant parts of struct taskstsats for a single pid as follows:
+ *
+ * per-task-foo-add_tsk(stats, task);
+ */
+ stats->nvcsw += task->nvcsw;
+ stats->nivcsw += task->nivcsw;
+
+ /* fill in delay acct fields */
+ delayacct_add_tsk(stats, task);
+
+ /* fill in basic acct fields */
+ bacct_add_tsk(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_add_tsk(stats, task);
+}
+
static int fill_pid(pid_t pid, struct task_struct *tsk,
struct taskstats *stats)
{
@@ -185,23 +234,8 @@ static int fill_pid(pid_t pid, struct ta
get_task_struct(tsk);

memset(stats, 0, sizeof(*stats));
- /*
- * Each accounting subsystem adds calls to its functions to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
-
- delayacct_add_tsk(stats, tsk);
-
- /* fill in basic acct fields */
- stats->version = TASKSTATS_VERSION;
- stats->nvcsw = tsk->nvcsw;
- stats->nivcsw = tsk->nivcsw;
- bacct_add_tsk(stats, tsk);
-
- /* fill in extended acct fields */
- xacct_add_tsk(stats, tsk);
+ fill_threadgroup(stats, tsk);
+ add_tsk(stats, tsk);

/* Define err: label here if needed */
put_task_struct(tsk);
@@ -233,31 +267,16 @@ static int fill_tgid(pid_t tgid, struct
memset(stats, 0, sizeof(*stats));

tsk = first;
- do {
- if (tsk->exit_state)
- continue;
- /*
- * Accounting subsystem can call its functions here to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
- delayacct_add_tsk(stats, tsk);
-
- stats->nvcsw += tsk->nvcsw;
- stats->nivcsw += tsk->nivcsw;
- } while_each_thread(first, tsk);
-
+ do
+ if (!tsk->exit_state)
+ add_tsk(stats, tsk);
+ while_each_thread(first, tsk);
+
+ fill_threadgroup(stats, first);
unlock_task_sighand(first, &flags);
rc = 0;
out:
rcu_read_unlock();
-
- stats->version = TASKSTATS_VERSION;
- /*
- * Accounting subsytems can also add calls here to modify
- * fields of taskstats.
- */
return rc;
}

@@ -265,21 +284,16 @@ static void fill_tgid_exit(struct task_s
static void fill_tgid_exit(struct task_struct *tsk)
{
unsigned long flags;
+ struct taskstats *tg_stats;

spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
- goto ret;
-
- /*
- * Each accounting subsystem calls its functions here to
- * accumalate its per-task stats for tsk, into the per-tgid structure
- *
- * per-task-foo(tsk->signal->stats, tsk);
- */
- delayacct_add_tsk(tsk->signal->stats, tsk);
-ret:
+ tg_stats = tsk->signal->stats;
+
+ /* fill_threadgroup() will be called by the stats requester */
+ if (tg_stats)
+ add_tsk(tg_stats, tsk);
+
spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
}

static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
diff -r 27877ed82082 kernel/tsacct.c
--- a/kernel/tsacct.c Sat Aug 25 22:17:19 2007 +0200
+++ b/kernel/tsacct.c Sat Aug 25 21:48:51 2007 +0200
@@ -22,50 +22,71 @@
#include <linux/acct.h>
#include <linux/jiffies.h>

-/*
- * fill in basic accounting fields
- */
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_time(struct taskstats *stats, struct task_struct *task)
{
struct timespec uptime, ts;
s64 ac_etime;

+ /* calculate task elapsed time in timespec */
+ do_posix_clock_monotonic_gettime(&uptime);
+ ts = timespec_sub(uptime, task->start_time);
+
+ stats->ac_btime = get_seconds() - ts.tv_sec;
+
+ ac_etime = timespec_to_ns(&ts);
+ do_div(ac_etime, NSEC_PER_USEC);
+ stats->ac_etime = ac_etime;
+}
+
+/*
+ * fill in basic accounting fields
+ */
+
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ struct task_struct *leader;
+
BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);

- /* calculate task elapsed time in timespec */
- do_posix_clock_monotonic_gettime(&uptime);
- ts = timespec_sub(uptime, tsk->start_time);
- /* rebase elapsed time to usec */
- ac_etime = timespec_to_ns(&ts);
- do_div(ac_etime, NSEC_PER_USEC);
- stats->ac_etime = ac_etime;
- stats->ac_btime = get_seconds() - ts.tv_sec;
- if (thread_group_leader(tsk)) {
- stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
+ rcu_read_lock();
+ stats->ac_ppid = pid_alive(task) ?
+ rcu_dereference(task->real_parent)->tgid : 0;
+ rcu_read_unlock();
+
+ leader = task->group_leader;
+ get_task_struct(leader);
+ fill_wall_time(stats, leader);
+ put_task_struct(leader);
+
+ stats->ac_nice = task_nice(task);
+ stats->ac_sched = task->policy;
+ stats->ac_uid = task->uid;
+ stats->ac_gid = task->gid;
+ stats->ac_pid = task->pid;
+
+ strncpy(stats->ac_comm, task->comm, sizeof(stats->ac_comm));
+}
+
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ if (thread_group_leader(task)) {
+ stats->ac_exitcode = task->exit_code;
+ if (task->flags & PF_FORKNOEXEC)
stats->ac_flag |= AFORK;
}
- if (tsk->flags & PF_SUPERPRIV)
+ if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
- if (tsk->flags & PF_DUMPCORE)
+ if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
- if (tsk->flags & PF_SIGNALED)
+ if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
- stats->ac_nice = task_nice(tsk);
- stats->ac_sched = tsk->policy;
- stats->ac_uid = tsk->uid;
- stats->ac_gid = tsk->gid;
- stats->ac_pid = tsk->pid;
- rcu_read_lock();
- stats->ac_ppid = pid_alive(tsk) ?
- rcu_dereference(tsk->real_parent)->tgid : 0;
- rcu_read_unlock();
- stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
- stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
- stats->ac_minflt = tsk->min_flt;
- stats->ac_majflt = tsk->maj_flt;

- strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ fill_wall_time(stats, task);
+
+ stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
+ stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
+ stats->ac_minflt += task->min_flt;
+ stats->ac_majflt += task->maj_flt;
}


@@ -76,32 +97,34 @@ void bacct_add_tsk(struct taskstats *sta
/*
* fill in extended accounting fields
*/
-void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
{
struct mm_struct *mm;

- /* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
- mm = get_task_mm(p);
+ mm = get_task_mm(task);
if (mm) {
/* adjust to KB unit */
stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+}
+
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+{
+ /* convert pages-jiffies to Mbyte-usec */
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
+
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB

--
Guillaume

2007-08-31 03:01:54

by Jonathan Lim

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

On Sat Aug 25 21:58:44 2007, [email protected] wrote:
>
> > Also, I don't understand why the code to update btime:
> >
> > /* calculate task elapsed time in timespec */
> > do_posix_clock_monotonic_gettime(&uptime);
> > ts = timespec_sub(uptime, tsk->start_time);
> > ...
> > stats->ac_btime = get_seconds() - ts.tv_sec;
> >
> > does not simply use tsk->start_time or tsk->real_start_time without
> > comparing it to the current time.
>
> From what I understand, task->start_time and task->real_start_time
> are taken from the realtime clock. The accounting in CSA seems
> to be very similar to the accounting done in do_acct_process()
> (kernel/acct.c).

In CSA 3.0 ...

csa_acct_eop(int exitcode, struct task_struct *p)

csa->ac_btime = boottime +
((p->start_time.tv_nsec < NSEC_PER_SEC/2) ?
p->start_time.tv_sec :
p->start_time.tv_sec +1);

where

do_posix_clock_monotonic_gettime(&uptime);
boottime = xtime.tv_sec - uptime.tv_sec;

In an upcoming version of CSA ...

csa_acct_eop(struct taskstats *p)

csa->ac_btime = p->ac_btime;

where

do_posix_clock_monotonic_gettime(&uptime);
ts = uptime - tsk->start_time;
p->ac_btime = get_seconds() - ts.tv_sec;
= xtime.tv_sec - (uptime - tsk->start_time);
= (xtime.tv_sec - uptime) + tsk->start_time;

So they're basically equivalent.

Jonathan

2007-08-31 07:25:19

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Jonathan Lim wrote:
> On Sat Aug 25 21:58:44 2007, [email protected] wrote:
>>> Also, I don't understand why the code to update btime:
>>>
>>> /* calculate task elapsed time in timespec */
>>> do_posix_clock_monotonic_gettime(&uptime);
>>> ts = timespec_sub(uptime, tsk->start_time);
>>> ...
>>> stats->ac_btime = get_seconds() - ts.tv_sec;
>>>
>>> does not simply use tsk->start_time or tsk->real_start_time without
>>> comparing it to the current time.
>> From what I understand, task->start_time and task->real_start_time
>> are taken from the realtime clock. The accounting in CSA seems
>> to be very similar to the accounting done in do_acct_process()
>> (kernel/acct.c).
>
> In CSA 3.0 ...
>
> csa_acct_eop(int exitcode, struct task_struct *p)
>
> csa->ac_btime = boottime +
> ((p->start_time.tv_nsec < NSEC_PER_SEC/2) ?
> p->start_time.tv_sec :
> p->start_time.tv_sec +1);
>
> where
>
> do_posix_clock_monotonic_gettime(&uptime);
> boottime = xtime.tv_sec - uptime.tv_sec;
>
> In an upcoming version of CSA ...
>
> csa_acct_eop(struct taskstats *p)
>
> csa->ac_btime = p->ac_btime;
>
> where
>
> do_posix_clock_monotonic_gettime(&uptime);
> ts = uptime - tsk->start_time;
> p->ac_btime = get_seconds() - ts.tv_sec;
> = xtime.tv_sec - (uptime - tsk->start_time);
> = (xtime.tv_sec - uptime) + tsk->start_time;
>
> So they're basically equivalent.

Excellent, so can Guillaume change ac_btime to be just tsk->start_time?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-08-31 12:36:34

by Guillaume Chazarain

[permalink] [raw]
Subject: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v3)

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
not the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a
thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Here is the output of the testcase before the patch:

Name | System| User| Cached I/O| Self| Group|
--------------+------------+------------+------------+------------+------------+
version | 5| 5| 5| 5| 5|
ac_exitcode | 0| 0| 0| 0| 0|
ac_flag | 0| 0| 0| 0| 0|
ac_nice | 5| 10| 15| 0| 0|
cpu_count | 2020| 1378| 360| 70645| 74428|
cpu_delay_tota| 2327| 14281152| 14225866| 30475| 44505841|
blkio_delay_to| 0| 0| 0| 0| 1146901308|
swapin_count | 0| 0| 0| 0| 0|
swapin_delay_t| 0| 0| 0| 0| 0|
cpu_run_real_t| 4794271160| 1269806960| 331949536| 44993160| 6562002424|
cpu_run_virtua| 4792937602| 1247020977| 330330065| 43082258| 6530204192|
ac_comm | cached| cached| cached| cached| |
ac_uid | 500| 500| 500| 500| 0|
ac_gid | 500| 500| 500| 500| 0|
ac_pid | 2513| 2514| 2515| 2512| 0|
ac_ppid | 2445| 2445| 2445| 2445| 0|
ac_btime | 1188326011| 1188326012| 1188326013| 1188326011| 0|
ac_etime | 6619382| 5615437| 4600494| 6622066| 0|
ac_utime | 25000| 1264000| 20000| 2000| 0|
ac_stime | 4770000| 6000| 312000| 43000| 0|
ac_minflt | 2| 0| 1| 320| 0|
ac_majflt | 0| 0| 0| 0| 0|
coremem | 3436| 286| 2302| 1285| 0|
virtmem | 4012| 301| 740| 2198| 0|
hiwater_rss | 756| 756| 756| 756| 0|
hiwater_vm | 35004| 35004| 35004| 35004| 0|
read_char | 208877984| 0| 825517700| 2807| 0|
write_char | 0| 0| 0| 0| 0|
read_syscalls | 244730| 0| 215782| 10| 0|
write_syscalls| 0| 0| 0| 0| 0|
read_bytes | 0| 0| 0| 0| 0|
write_bytes | 0| 0| 0| 0| 0|
cancelled_writ| 0| 0| 0| 0| 0|
nvcsw | 0| 0| 0| 8| 8|
nivcsw | 2019| 1378| 360| 9| 3766|

and after the patch:

Name | System| User| Cached I/O| Self| Group|
--------------+------------+------------+------------+------------+------------+
version | 5| 5| 5| 5| 5|
ac_exitcode | 0| 0| 0| 0| 0|
ac_flag | 0| 0| 0| 0| 0|
ac_nice | 5| 10| 15| 0| 0|
cpu_count | 1716| 1189| 293| 72795| 76036|
cpu_delay_tota| 3635| 15921729| 15269951| 277683| 46143870|
blkio_delay_to| 0| 0| 0| 113237372| 113237372|
swapin_count | 0| 0| 0| 0| 0|
swapin_delay_t| 0| 0| 0| 0| 0|
cpu_run_real_t| 4338340472| 1117830064| 284956680| 49992400| 5915100768|
cpu_run_virtua| 4333641573| 1099471940| 282075705| 46039754| 5882048860|
ac_comm | cached| cached| cached| cached| cached|
ac_uid | 500| 500| 500| 500| 500|
ac_gid | 500| 500| 500| 500| 500|
ac_pid | 7144| 7145| 7146| 7143| 7143|
ac_ppid | 2435| 2435| 2435| 2435| 2435|
ac_btime | 1188563357| 1188563358| 1188563359| 1188563356| 1188563356|
ac_etime | 5960119| 4954580| 3938620| 6078187| 6078199|
ac_utime | 33000| 1113000| 35000| 4000| 1193000|
ac_stime | 4306000| 5000| 250000| 46000| 4723000|
ac_minflt | 2| 0| 1| 317| 323|
ac_majflt | 0| 0| 0| 3| 3|
coremem | 363| 3617| 2127| 2320| 11249|
virtmem | 1973| 3546| 2144| 151| 8182|
hiwater_rss | 756| 756| 756| 756| 756|
hiwater_vm | 35004| 35004| 35004| 35004| 35004|
read_char | 188200698| 0| 696308768| 2807| 884512273|
write_char | 0| 0| 0| 0| 33554432|
read_syscalls | 221608| 0| 182008| 10| 403626|
write_syscalls| 0| 0| 0| 0| 32768|
read_bytes | 0| 0| 0| 323584| 323584|
write_bytes | 0| 0| 0| 0| 33554432|
cancelled_writ| 0| 0| 0| 0| 0|
nvcsw | 0| 0| 0| 20| 22|
nivcsw | 1715| 1189| 293| 12| 3249|

The TGID stats now contain all the fields.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads.

Signed-off-by: Guillaume Chazarain <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Jay Lan <[email protected]>
Cc: Jonathan Lim <[email protected]>
---

Documentation/accounting/dump-test.c | 314 +++++++++++++++++++++++++++++++++++
kernel/taskstats.c | 125 ++++++++-----
kernel/tsacct.c | 108 ++++++------
3 files changed, 452 insertions(+), 95 deletions(-)

diff -r 22a60bb0ecf2 Documentation/accounting/dump-test.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/Documentation/accounting/dump-test.c Fri Aug 31 13:32:15 2007 +0200
@@ -0,0 +1,314 @@
+/*
+ * Create some threads with specific behaviours and dump all of their stats.
+ * This needs a libnl recent enough to support Generic Netlink, made with
+ * svn r102 of libnl.
+ *
+ * Compile with:
+ * cc -Wall -I../../include -lnl -lpthread dump-test.c -o dump-test
+ */
+
+#define _GNU_SOURCE /* for strndup() */
+#include <sys/syscall.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/types.h>
+#include <linux/taskstats.h>
+#include <linux/genetlink.h>
+#include <netlink/genl/genl.h>
+#include <netlink/genl/ctrl.h>
+
+/* Utility functions for the created threads */
+
+static void loop_reading(const char *filename)
+{
+ int fd = open(filename, O_RDONLY);
+ char buffer[4096];
+
+ if (fd < 0) {
+ perror(filename);
+ return;
+ }
+
+ for (;;) {
+ lseek(fd, 0, SEEK_SET);
+ while (read(fd, buffer, sizeof(buffer)) > 0) ;
+ }
+}
+
+/*
+ * We want to know the TID of the threads we launch, not to complicate the code
+ * with a real synchronization.
+ */
+static volatile pid_t tid;
+static void set_tid(void)
+{
+ if (tid) {
+ fputs("Unexpected non null tid\n", stderr);
+ exit(1);
+ }
+
+ tid = syscall(SYS_gettid);
+}
+
+static pid_t start_thread(void (*fun) (void), int wait)
+{
+ pthread_t thread;
+
+ tid = 0;
+ if (pthread_create(&thread, NULL, (void *(*)(void *))fun, NULL) < 0) {
+ perror("pthread_create");
+ exit(1);
+ }
+
+ /* Busy wait for the thread to update its tid. */
+ while (!tid)
+ sched_yield();
+
+ if (wait && pthread_join(thread, NULL) < 0) {
+ perror("pthread_join");
+ exit(1);
+ }
+
+ /* Space the starting time of the threads */
+ sleep(1);
+
+ return tid;
+}
+
+/* The created threads */
+
+/*
+ * Should loop in kernel mode
+ */
+static void niced_sys_thread(void)
+{
+ set_tid();
+ nice(5);
+ loop_reading("/proc/self/maps");
+}
+
+/*
+ * Should loop in user mode
+ */
+static void niced_user_thread(void)
+{
+ set_tid();
+ nice(10);
+ for (;;) ;
+}
+
+/*
+ * Should show some I/O syscalls
+ */
+static void cached_io_thread(void)
+{
+ set_tid();
+ nice(15);
+ loop_reading("/bin/ls");
+}
+
+/*
+ * This thread terminates after doing some real I/O so it should be visible in
+ * the thread group stats.
+ */
+#define TMP_PATH "/tmp/taskstats.test.tmp"
+static void dead_thread_having_written(void)
+{
+ int fd, i;
+ char buffer[1024];
+
+ set_tid();
+ fd = creat(TMP_PATH, 0600);
+ if (fd < 0) {
+ perror(TMP_PATH);
+ return;
+ }
+
+ /* Write 32M */
+ for (i = 0; i < 32 * 1024; i++) {
+ int remaining = sizeof(buffer);
+ while (remaining) {
+ int written = write(fd, buffer, remaining);
+ if (written > 0)
+ remaining -= written;
+ else {
+ perror("write");
+ close(fd);
+ return;
+ }
+ }
+ }
+
+ if (fsync(fd) < 0)
+ perror("fsync");
+
+ if (close(fd) < 0)
+ perror("close");
+
+ if (unlink(TMP_PATH))
+ perror("unlink");
+}
+
+/* taskstats reading using libnl */
+
+static int taskstats_netlink_family;
+static struct nl_handle *netlink_handle;
+
+static void init_taskstats(void)
+{
+ netlink_handle = nl_handle_alloc();
+ if (!netlink_handle) {
+ nl_perror("nl_handle_alloc");
+ exit(1);
+ }
+
+ if (genl_connect(netlink_handle) < 0) {
+ nl_perror("genl_connect");
+ exit(1);
+ }
+
+ taskstats_netlink_family = genl_ctrl_resolve(netlink_handle,
+ TASKSTATS_GENL_NAME);
+ if (taskstats_netlink_family < 0) {
+ nl_perror("genl_ctrl_resolve(TASKSTATS_GENL_NAME)");
+ exit(1);
+ }
+}
+
+static struct taskstats get_taskstats(int attr, int pid)
+{
+ struct taskstats stats;
+ int len;
+ struct nl_msg *request;
+ void *request_header;
+ unsigned char *buf;
+ struct nlattr *nlattr, *reply;
+
+ request = nlmsg_alloc();
+ if (request == NULL) {
+ nl_perror("nlmsg_alloc");
+ exit(1);
+ }
+
+ request_header = genlmsg_put(request, 0, 0, taskstats_netlink_family,
+ 0, NLM_F_REQUEST, TASKSTATS_CMD_GET,
+ TASKSTATS_GENL_VERSION);
+ if (request_header == NULL) {
+ nl_perror("genlmsg_put");
+ exit(1);
+ }
+
+ if (nla_put_u32(request, attr, pid) < 0) {
+ nl_perror("nla_put_u32");
+ exit(1);
+ }
+
+ len = nl_send_auto_complete(netlink_handle, request);
+ if (len < 0) {
+ nl_perror("nl_send_auto_complete");
+ exit(1);
+ }
+ nlmsg_free(request);
+
+ len = nl_recv(netlink_handle, NULL, &buf, NULL);
+ if (len < 0) {
+ nl_perror("nl_recv");
+ exit(1);
+ }
+ if (nl_wait_for_ack(netlink_handle) < 0) {
+ nl_perror("nl_wait_for_ack");
+ exit(1);
+ }
+
+ len = 0;
+ reply = (struct nlattr *)buf;
+ nla_for_each_nested(nlattr, reply, len)
+ if (nlattr->nla_type == TASKSTATS_TYPE_STATS) {
+ memcpy(&stats, nla_data(nlattr), sizeof(stats));
+ free(buf);
+ return stats;
+ }
+
+ fputs("TASKSTATS_TYPE_STATS not found\n", stderr);
+ free(buf);
+ exit(1);
+}
+
+/* Output formatting */
+
+#define LINE(fo, f) do { \
+ char *stripped = strndup(#f, 14); \
+ printf("%-14s|" fo "|" fo "|" fo "|" fo "|" fo "|\n", stripped, \
+ stats[0].f, stats[1].f, stats[2].f, stats[3].f, stats[4].f); \
+ free(stripped); \
+} while (0)
+
+static void show_table(struct taskstats stats[])
+{
+ puts("Name | System| User| Cached I/O"
+ "| Self| Group|");
+ puts("--------------+------------+------------+------------"
+ "+------------+------------+");
+ LINE("%12hu", version);
+ LINE("%12u", ac_exitcode);
+ LINE("%12x", ac_flag);
+ LINE("%12u", ac_nice);
+ LINE("%12llu", cpu_count);
+ LINE("%12llu", cpu_delay_total);
+ LINE("%12llu", blkio_delay_total);
+ LINE("%12llu", swapin_count);
+ LINE("%12llu", swapin_delay_total);
+ LINE("%12llu", cpu_run_real_total);
+ LINE("%12llu", cpu_run_virtual_total);
+ LINE("%12s", ac_comm);
+ LINE("%12u", ac_uid);
+ LINE("%12u", ac_gid);
+ LINE("%12u", ac_pid);
+ LINE("%12u", ac_ppid);
+ LINE("%12u", ac_btime);
+ LINE("%12llu", ac_etime);
+ LINE("%12llu", ac_utime);
+ LINE("%12llu", ac_stime);
+ LINE("%12llu", ac_minflt);
+ LINE("%12llu", ac_majflt);
+ LINE("%12llu", coremem);
+ LINE("%12llu", virtmem);
+ LINE("%12llu", hiwater_rss);
+ LINE("%12llu", hiwater_vm);
+ LINE("%12llu", read_char);
+ LINE("%12llu", write_char);
+ LINE("%12llu", read_syscalls);
+ LINE("%12llu", write_syscalls);
+ LINE("%12llu", read_bytes);
+ LINE("%12llu", write_bytes);
+ LINE("%12llu", cancelled_write_bytes);
+ LINE("%12llu", nvcsw);
+ LINE("%12llu", nivcsw);
+}
+
+int main(int argc, char *argv[])
+{
+ struct taskstats stats[5];
+ pid_t niced_sys = start_thread(niced_sys_thread, 0);
+ pid_t niced_user = start_thread(niced_user_thread, 0);
+ pid_t cached_io = start_thread(cached_io_thread, 0);
+ start_thread(dead_thread_having_written, 1);
+
+ init_taskstats();
+
+ stats[0] = get_taskstats(TASKSTATS_CMD_ATTR_PID, niced_sys);
+ stats[1] = get_taskstats(TASKSTATS_CMD_ATTR_PID, niced_user);
+ stats[2] = get_taskstats(TASKSTATS_CMD_ATTR_PID, cached_io);
+ stats[3] = get_taskstats(TASKSTATS_CMD_ATTR_PID, getpid());
+ stats[4] = get_taskstats(TASKSTATS_CMD_ATTR_TGID, cached_io);
+
+ show_table(stats);
+
+ return 0;
+}
diff -r 22a60bb0ecf2 kernel/taskstats.c
--- a/kernel/taskstats.c Fri Aug 31 01:42:23 2007 -0700
+++ b/kernel/taskstats.c Fri Aug 31 13:36:29 2007 +0200
@@ -168,6 +168,60 @@ static void send_cpu_listeners(struct sk
up_write(&listeners->sem);
}

+/*
+ * There are two types of taskstats fields when considering a thread group:
+ * - those that can be aggregated from each thread in the group (like CPU
+ * times),
+ * - those that cannot be aggregated (like UID) or are identical (like
+ * memory usage), so are taken from the group leader.
+ * XXX_threadgroup() methods deal with the first type while XXX_add_tsk() with
+ * the second.
+ */
+static void fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to initialize
+ * relevant parts of struct taskstsats for a single tgid as follows:
+ *
+ * per-task-foo-fill_threadgroup(stats, task);
+ */
+
+ stats->version = TASKSTATS_VERSION;
+
+ /* fill in basic acct fields */
+ bacct_fill_threadgroup(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_fill_threadgroup(stats, task);
+}
+
+/*
+ * Stats specific to each thread in the thread group. Stats of @task should be
+ * combined with those already present in @stats. add_tsk() works in
+ * conjunction with fill_threadgroup(), taskstats fields should not be touched
+ * by both functions.
+ */
+static void add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to combine
+ * relevant parts of struct taskstsats for a single pid as follows:
+ *
+ * per-task-foo-add_tsk(stats, task);
+ */
+ stats->nvcsw += task->nvcsw;
+ stats->nivcsw += task->nivcsw;
+
+ /* fill in delay acct fields */
+ delayacct_add_tsk(stats, task);
+
+ /* fill in basic acct fields */
+ bacct_add_tsk(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_add_tsk(stats, task);
+}
+
static int fill_pid(pid_t pid, struct task_struct *tsk,
struct taskstats *stats)
{
@@ -185,23 +239,8 @@ static int fill_pid(pid_t pid, struct ta
get_task_struct(tsk);

memset(stats, 0, sizeof(*stats));
- /*
- * Each accounting subsystem adds calls to its functions to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
-
- delayacct_add_tsk(stats, tsk);
-
- /* fill in basic acct fields */
- stats->version = TASKSTATS_VERSION;
- stats->nvcsw = tsk->nvcsw;
- stats->nivcsw = tsk->nivcsw;
- bacct_add_tsk(stats, tsk);
-
- /* fill in extended acct fields */
- xacct_add_tsk(stats, tsk);
+ fill_threadgroup(stats, tsk);
+ add_tsk(stats, tsk);

/* Define err: label here if needed */
put_task_struct(tsk);
@@ -213,6 +252,7 @@ static int fill_tgid(pid_t tgid, struct
struct taskstats *stats)
{
struct task_struct *tsk;
+ struct task_struct *leader;
unsigned long flags;
int rc = -ESRCH;

@@ -232,32 +272,21 @@ static int fill_tgid(pid_t tgid, struct
else
memset(stats, 0, sizeof(*stats));

+ leader = first->group_leader;
+ get_task_struct(leader);
+ fill_threadgroup(stats, leader);
+ put_task_struct(leader);
+
tsk = first;
- do {
- if (tsk->exit_state)
- continue;
- /*
- * Accounting subsystem can call its functions here to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
- delayacct_add_tsk(stats, tsk);
-
- stats->nvcsw += tsk->nvcsw;
- stats->nivcsw += tsk->nivcsw;
- } while_each_thread(first, tsk);
+ do
+ if (!tsk->exit_state)
+ add_tsk(stats, tsk);
+ while_each_thread(first, tsk);

unlock_task_sighand(first, &flags);
rc = 0;
out:
rcu_read_unlock();
-
- stats->version = TASKSTATS_VERSION;
- /*
- * Accounting subsytems can also add calls here to modify
- * fields of taskstats.
- */
return rc;
}

@@ -265,21 +294,19 @@ static void fill_tgid_exit(struct task_s
static void fill_tgid_exit(struct task_struct *tsk)
{
unsigned long flags;
+ struct taskstats *tg_stats;

spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
- goto ret;
-
- /*
- * Each accounting subsystem calls its functions here to
- * accumalate its per-task stats for tsk, into the per-tgid structure
- *
- * per-task-foo(tsk->signal->stats, tsk);
- */
- delayacct_add_tsk(tsk->signal->stats, tsk);
-ret:
+ tg_stats = tsk->signal->stats;
+
+ /*
+ * fill_threadgroup() will be called by the stats requester,
+ * i.e. fill_tgid()
+ */
+ if (tg_stats)
+ add_tsk(tg_stats, tsk);
+
spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
}

static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
diff -r 22a60bb0ecf2 kernel/tsacct.c
--- a/kernel/tsacct.c Fri Aug 31 01:42:23 2007 -0700
+++ b/kernel/tsacct.c Tue Aug 28 20:35:27 2007 +0200
@@ -22,50 +22,64 @@
#include <linux/acct.h>
#include <linux/jiffies.h>

-/*
- * fill in basic accounting fields
- */
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_times(struct taskstats *stats, struct task_struct *task)
{
struct timespec uptime, ts;
s64 ac_etime;

- BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
-
/* calculate task elapsed time in timespec */
do_posix_clock_monotonic_gettime(&uptime);
- ts = timespec_sub(uptime, tsk->start_time);
+ ts = timespec_sub(uptime, task->start_time);
/* rebase elapsed time to usec */
ac_etime = timespec_to_ns(&ts);
do_div(ac_etime, NSEC_PER_USEC);
- stats->ac_etime = ac_etime;
- stats->ac_btime = get_seconds() - ts.tv_sec;
- if (thread_group_leader(tsk)) {
- stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
+ stats->ac_etime = ac_etime;
+ stats->ac_btime = get_seconds() - ts.tv_sec;
+}
+
+/*
+ * fill in basic accounting fields
+ */
+
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
+
+ rcu_read_lock();
+ stats->ac_ppid = pid_alive(task) ?
+ rcu_dereference(task->real_parent)->tgid : 0;
+ rcu_read_unlock();
+
+ if (thread_group_leader(task)) {
+ stats->ac_exitcode = task->exit_code;
+ if (task->flags & PF_FORKNOEXEC)
stats->ac_flag |= AFORK;
}
- if (tsk->flags & PF_SUPERPRIV)
+
+ fill_wall_times(stats, task);
+
+ stats->ac_nice = task_nice(task);
+ stats->ac_sched = task->policy;
+ stats->ac_uid = task->uid;
+ stats->ac_gid = task->gid;
+ stats->ac_pid = task->pid;
+
+ strncpy(stats->ac_comm, task->comm, sizeof(stats->ac_comm));
+}
+
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
- if (tsk->flags & PF_DUMPCORE)
+ if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
- if (tsk->flags & PF_SIGNALED)
+ if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
- stats->ac_nice = task_nice(tsk);
- stats->ac_sched = tsk->policy;
- stats->ac_uid = tsk->uid;
- stats->ac_gid = tsk->gid;
- stats->ac_pid = tsk->pid;
- rcu_read_lock();
- stats->ac_ppid = pid_alive(tsk) ?
- rcu_dereference(tsk->real_parent)->tgid : 0;
- rcu_read_unlock();
- stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
- stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
- stats->ac_minflt = tsk->min_flt;
- stats->ac_majflt = tsk->maj_flt;

- strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
+ stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
+ stats->ac_minflt += task->min_flt;
+ stats->ac_majflt += task->maj_flt;
}


@@ -76,32 +90,34 @@ void bacct_add_tsk(struct taskstats *sta
/*
* fill in extended accounting fields
*/
-void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
{
struct mm_struct *mm;

- /* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
- mm = get_task_mm(p);
+ mm = get_task_mm(task);
if (mm) {
/* adjust to KB unit */
stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+}
+
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+{
+ /* convert pages-jiffies to Mbyte-usec */
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
+
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB

2007-09-07 23:37:21

by Jonathan Lim

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

On Fri Aug 31 00:24:47 2007, [email protected] wrote:
>
> Jonathan Lim wrote:
> > On Sat Aug 25 21:58:44 2007, [email protected] wrote:
> >>> Also, I don't understand why the code to update btime:
> >>>
> >>> /* calculate task elapsed time in timespec */
> >>> do_posix_clock_monotonic_gettime(&uptime);
> >>> ts = timespec_sub(uptime, tsk->start_time);
> >>> ...
> >>> stats->ac_btime = get_seconds() - ts.tv_sec;
> >>>
> >>> does not simply use tsk->start_time or tsk->real_start_time without
> >>> comparing it to the current time.
> >> From what I understand, task->start_time and task->real_start_time
> >> are taken from the realtime clock. The accounting in CSA seems
> >> to be very similar to the accounting done in do_acct_process()
> >> (kernel/acct.c).
> >
> > In CSA 3.0 ...
> >
> > csa_acct_eop(int exitcode, struct task_struct *p)
> >
> > csa->ac_btime = boottime +
> > ((p->start_time.tv_nsec < NSEC_PER_SEC/2) ?
> > p->start_time.tv_sec :
> > p->start_time.tv_sec +1);
> >
> > where
> >
> > do_posix_clock_monotonic_gettime(&uptime);
> > boottime = xtime.tv_sec - uptime.tv_sec;
> >
> > In an upcoming version of CSA ...
> >
> > csa_acct_eop(struct taskstats *p)
> >
> > csa->ac_btime = p->ac_btime;
> >
> > where
> >
> > do_posix_clock_monotonic_gettime(&uptime);
> > ts = uptime - tsk->start_time;
> > p->ac_btime = get_seconds() - ts.tv_sec;
> > = xtime.tv_sec - (uptime - tsk->start_time);
> > = (xtime.tv_sec - uptime) + tsk->start_time;
> >
> > So they're basically equivalent.
>
> Excellent, so can Guillaume change ac_btime to be just tsk->start_time?

I don't think so. Current time (xtime) is relative to the epoch; uptime and
tsk->start_time (jiffies) are both relative to some boot time. So you need to
subtract uptime from xtime to get the boot time relative to the epoch, then add
tsk->start_time. The result is what ac_btime should be set to.

I think his recent changes are as follows:

--- a/kernel/tsacct.c Fri Aug 31 01:42:23 2007 -0700
+++ b/kernel/tsacct.c Tue Aug 28 20:35:27 2007 +0200
...
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_times(struct taskstats *stats, struct task_struct *task)
...
- ts = timespec_sub(uptime, tsk->start_time);
+ ts = timespec_sub(uptime, task->start_time);
...
- stats->ac_btime = get_seconds() - ts.tv_sec;
...
+ stats->ac_btime = get_seconds() - ts.tv_sec;

So really no different from before, which is correct.

Jonathan

2007-09-10 13:04:27

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Le Fri, 7 Sep 2007 16:37:57 -0700 (PDT),
Jonathan Lim <[email protected]> a écrit :

> > Excellent, so can Guillaume change ac_btime to be just
> > tsk->start_time?
>
> I don't think so. Current time (xtime) is relative to the epoch;
> uptime and tsk->start_time (jiffies) are both relative to some boot
> time. So you need to subtract uptime from xtime to get the boot time
> relative to the epoch, then add tsk->start_time. The result is what
> ac_btime should be set to.
>
> I think his recent changes are as follows:
>
> --- a/kernel/tsacct.c Fri Aug 31 01:42:23 2007 -0700
> +++ b/kernel/tsacct.c Tue Aug 28 20:35:27 2007 +0200
> ...
> -void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
> +static void fill_wall_times(struct taskstats *stats, struct
> task_struct *task) ...
> - ts = timespec_sub(uptime, tsk->start_time);
> + ts = timespec_sub(uptime, task->start_time);
> ...
> - stats->ac_btime = get_seconds() - ts.tv_sec;
> ...
> + stats->ac_btime = get_seconds() - ts.tv_sec;
>
> So really no different from before, which is correct.

Yes, I just tried to make it clearer that the computations were needed
to get a wall time. Otherwise, as CSA seems to start using taskstats,
do my changes make sense for your usage?

Thanks.

--
Guillaume

2007-09-13 00:18:42

by Andrew Morton

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v3)

On Fri, 31 Aug 2007 14:35:35 +0200
Guillaume Chazarain <[email protected]> wrote:

> TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
> not the basic and extended accounting. With this patch,
> TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a
> thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
> fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
> (http://guichaz.free.fr/misc/iotop.py).
>
> Here is the output of the testcase before the patch:
>
> ...

> Documentation/accounting/dump-test.c | 314 +++++++++++++++++++++++++++++++++++

Another C file in the Documentation directory. Sigh. At kernel summit I
suggested that we should be putting these things in a place from where we
can actually build and install them. People said there was no need to do
that because kernel developers can now easily get new stuff into
util-linux. I don't believe them. Wanna be a guinea pig?


> +static void loop_reading(const char *filename)
> +{
> + int fd = open(filename, O_RDONLY);
> + char buffer[4096];
> +
> + if (fd < 0) {
> + perror(filename);
> + return;
> + }
> +
> + for (;;) {
> + lseek(fd, 0, SEEK_SET);
> + while (read(fd, buffer, sizeof(buffer)) > 0) ;

newline here. Just because it's userspace doesn't mean that it needs to
look crappy, despite all the code out there which disproves this ;)

I think you just invented pread().

> + }
> +}
> +
>
> ...
>
> --- a/kernel/taskstats.c Fri Aug 31 01:42:23 2007 -0700
> +++ b/kernel/taskstats.c Fri Aug 31 13:36:29 2007 +0200
> @@ -168,6 +168,60 @@ static void send_cpu_listeners(struct sk
> up_write(&listeners->sem);
> }
>
> +/*
> + * There are two types of taskstats fields when considering a thread group:
> + * - those that can be aggregated from each thread in the group (like CPU
> + * times),
> + * - those that cannot be aggregated (like UID) or are identical (like
> + * memory usage), so are taken from the group leader.
> + * XXX_threadgroup() methods deal with the first type while XXX_add_tsk() with
> + * the second.
> + */
> +static void fill_threadgroup(struct taskstats *stats, struct task_struct *task)
> +{
> + /*
> + * Each accounting subsystem adds calls to its functions to initialize
> + * relevant parts of struct taskstsats for a single tgid as follows:
> + *
> + * per-task-foo-fill_threadgroup(stats, task);
> + */
> +
> + stats->version = TASKSTATS_VERSION;
> +
> + /* fill in basic acct fields */
> + bacct_fill_threadgroup(stats, task);
> +
> + /* fill in extended acct fields */
> + xacct_fill_threadgroup(stats, task);
> +}
> +
> +/*
> + * Stats specific to each thread in the thread group. Stats of @task should be
> + * combined with those already present in @stats. add_tsk() works in
> + * conjunction with fill_threadgroup(), taskstats fields should not be touched
> + * by both functions.
> + */

It's odd to use kerneldoc-style markup in a non-kerneldoc comment.

> @@ -232,32 +272,21 @@ static int fill_tgid(pid_t tgid, struct
> else
> memset(stats, 0, sizeof(*stats));
>
> + leader = first->group_leader;
> + get_task_struct(leader);
> + fill_threadgroup(stats, leader);
> + put_task_struct(leader);
> +

Are the get_task_struct/put_task_struct here actually needed?


2007-09-15 18:44:30

by Guillaume Chazarain

[permalink] [raw]
Subject: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v4)

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V3 (http://lkml.org/lkml/2007/8/31/121):
- Removed userspace example, either it gets accepted in util-linux-ng or I'll
maintain it elsewhere.
- Added kerneldoc for fill_threadgroup() and add_tsk().
- Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton
and Oleg Nesterov.
- Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock)
for consistency with the locking of task->signal->stats in fill_tgid().
- Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg.
- Documented double accounting race seen by Oleg.
- Rephrased the fill_tgid_exit() comment as per Oleg's recommendation.
- Documented the special case for the AFORK ac_flag.
- Use the exit status (code >> 8) instead of the exit code as documented in
Documentation/accounting/taskstats-struct.txt.
- Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as
suggested by Oleg.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads.

Signed-off-by: Guillaume Chazarain <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Jay Lan <[email protected]>
Cc: Jonathan Lim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---

include/linux/tsacct_kern.h | 12 ++
kernel/taskstats.c | 139 +++++++++++++++++++++-------------
kernel/tsacct.c | 113 +++++++++++++++------------
3 files changed, 162 insertions(+), 102 deletions(-)

diff -r ec584ed21efa include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h Fri Sep 14 14:04:13 2007 -0700
+++ b/include/linux/tsacct_kern.h Tue Aug 28 20:35:27 2007 +0200
@@ -10,17 +10,23 @@
#include <linux/taskstats.h>

#ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
#else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{}
#endif /* CONFIG_TASKSTATS */

#ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
{}
static inline void acct_update_integrals(struct task_struct *tsk)
diff -r ec584ed21efa kernel/taskstats.c
--- a/kernel/taskstats.c Fri Sep 14 14:04:13 2007 -0700
+++ b/kernel/taskstats.c Sat Sep 15 20:22:47 2007 +0200
@@ -168,6 +168,68 @@ static void send_cpu_listeners(struct sk
up_write(&listeners->sem);
}

+/**
+ * fill_threadgroup - initialize some common stats for the thread group
+ * @stats: the taskstats to write into
+ * @task: the thread representing the whole group
+ *
+ * There are two types of taskstats fields when considering a thread group:
+ * - those that can be aggregated from each thread in the group (like CPU
+ * times),
+ * - those that cannot be aggregated (like UID) or are identical (like
+ * memory usage), so are taken from the group leader.
+ * XXX_threadgroup() methods deal with the first type while XXX_add_tsk() with
+ * the second.
+ */
+static void fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to initialize
+ * relevant parts of struct taskstsats for a single tgid as follows:
+ *
+ * per-task-foo-fill_threadgroup(stats, task);
+ */
+
+ stats->version = TASKSTATS_VERSION;
+
+ /* fill in basic acct fields */
+ bacct_fill_threadgroup(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_fill_threadgroup(stats, task);
+}
+
+/**
+ * add_tsk - combine some thread specific stats in a taskstats
+ * @stats: the taskstats to write into
+ * @task: the thread to combine
+ *
+ * Stats specific to each thread in the thread group. Stats of @task should be
+ * combined with those already present in @stats. add_tsk() works in
+ * conjunction with fill_threadgroup(), taskstats fields should not be touched
+ * by both functions.
+ */
+static void add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to combine
+ * relevant parts of struct taskstsats for a single pid as follows:
+ *
+ * per-task-foo-add_tsk(stats, task);
+ */
+ stats->nvcsw += task->nvcsw;
+ stats->nivcsw += task->nivcsw;
+
+ /* fill in delay acct fields */
+ delayacct_add_tsk(stats, task);
+
+ /* fill in basic acct fields */
+ bacct_add_tsk(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_add_tsk(stats, task);
+}
+
static int fill_pid(pid_t pid, struct task_struct *tsk,
struct taskstats *stats)
{
@@ -185,23 +247,8 @@ static int fill_pid(pid_t pid, struct ta
get_task_struct(tsk);

memset(stats, 0, sizeof(*stats));
- /*
- * Each accounting subsystem adds calls to its functions to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
-
- delayacct_add_tsk(stats, tsk);
-
- /* fill in basic acct fields */
- stats->version = TASKSTATS_VERSION;
- stats->nvcsw = tsk->nvcsw;
- stats->nivcsw = tsk->nivcsw;
- bacct_add_tsk(stats, tsk);
-
- /* fill in extended acct fields */
- xacct_add_tsk(stats, tsk);
+ fill_threadgroup(stats, tsk);
+ add_tsk(stats, tsk);

/* Define err: label here if needed */
put_task_struct(tsk);
@@ -232,32 +279,25 @@ static int fill_tgid(pid_t tgid, struct
else
memset(stats, 0, sizeof(*stats));

+ fill_threadgroup(stats, first->group_leader);
+
tsk = first;
- do {
- if (tsk->exit_state)
- continue;
- /*
- * Accounting subsystem can call its functions here to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
- delayacct_add_tsk(stats, tsk);
-
- stats->nvcsw += tsk->nvcsw;
- stats->nivcsw += tsk->nivcsw;
- } while_each_thread(first, tsk);
+ do
+ if (!tsk->exit_state)
+ /*
+ * This check is racy as a thread could exit just right
+ * now and have its statistics accounted twice.
+ */
+ add_tsk(stats, tsk);
+ while_each_thread(first, tsk);
+
+ if (first->signal->group_exit_code >> 8)
+ stats->ac_exitcode = first->signal->group_exit_code >> 8;

unlock_task_sighand(first, &flags);
rc = 0;
out:
rcu_read_unlock();
-
- stats->version = TASKSTATS_VERSION;
- /*
- * Accounting subsytems can also add calls here to modify
- * fields of taskstats.
- */
return rc;
}

@@ -266,20 +306,17 @@ static void fill_tgid_exit(struct task_s
{
unsigned long flags;

- spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
- goto ret;
-
- /*
- * Each accounting subsystem calls its functions here to
- * accumalate its per-task stats for tsk, into the per-tgid structure
- *
- * per-task-foo(tsk->signal->stats, tsk);
- */
- delayacct_add_tsk(tsk->signal->stats, tsk);
-ret:
- spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
+ rcu_read_lock();
+ lock_task_sighand(tsk, &flags);
+
+ /*
+ * The fill_threadgroup() part of the statistics will be added by the
+ * stats requester, i.e. fill_tgid()
+ */
+ add_tsk(tsk->signal->stats, tsk);
+
+ unlock_task_sighand(tsk, &flags);
+ rcu_read_unlock();
}

static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
diff -r ec584ed21efa kernel/tsacct.c
--- a/kernel/tsacct.c Fri Sep 14 14:04:13 2007 -0700
+++ b/kernel/tsacct.c Sat Sep 15 20:22:55 2007 +0200
@@ -22,50 +22,65 @@
#include <linux/acct.h>
#include <linux/jiffies.h>

-/*
- * fill in basic accounting fields
- */
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_times(struct taskstats *stats, struct task_struct *task)
{
struct timespec uptime, ts;
s64 ac_etime;

- BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
-
/* calculate task elapsed time in timespec */
do_posix_clock_monotonic_gettime(&uptime);
- ts = timespec_sub(uptime, tsk->start_time);
+ ts = timespec_sub(uptime, task->start_time);
/* rebase elapsed time to usec */
ac_etime = timespec_to_ns(&ts);
do_div(ac_etime, NSEC_PER_USEC);
- stats->ac_etime = ac_etime;
- stats->ac_btime = get_seconds() - ts.tv_sec;
- if (thread_group_leader(tsk)) {
- stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
- stats->ac_flag |= AFORK;
- }
- if (tsk->flags & PF_SUPERPRIV)
+ stats->ac_etime = ac_etime;
+ stats->ac_btime = get_seconds() - ts.tv_sec;
+}
+
+/*
+ * fill in basic accounting fields
+ */
+
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
+
+ rcu_read_lock();
+ stats->ac_ppid = pid_alive(task) ?
+ rcu_dereference(task->real_parent)->tgid : 0;
+ rcu_read_unlock();
+
+ fill_wall_times(stats, task);
+
+ stats->ac_exitcode = task->exit_code >> 8;
+ stats->ac_nice = task_nice(task);
+ stats->ac_sched = task->policy;
+ stats->ac_uid = task->uid;
+ stats->ac_gid = task->gid;
+ stats->ac_pid = task->pid;
+
+ strncpy(stats->ac_comm, task->comm, sizeof(stats->ac_comm));
+}
+
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
- if (tsk->flags & PF_DUMPCORE)
+ if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
- if (tsk->flags & PF_SIGNALED)
+ if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
- stats->ac_nice = task_nice(tsk);
- stats->ac_sched = tsk->policy;
- stats->ac_uid = tsk->uid;
- stats->ac_gid = tsk->gid;
- stats->ac_pid = tsk->pid;
- rcu_read_lock();
- stats->ac_ppid = pid_alive(tsk) ?
- rcu_dereference(tsk->real_parent)->tgid : 0;
- rcu_read_unlock();
- stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
- stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
- stats->ac_minflt = tsk->min_flt;
- stats->ac_majflt = tsk->maj_flt;
+ if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
+ /*
+ * Threads are created by do_fork() and don't exec but not in
+ * the AFORK sense, as the latter involves fork(2).
+ */
+ stats->ac_flag |= AFORK;

- strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
+ stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
+ stats->ac_minflt += task->min_flt;
+ stats->ac_majflt += task->maj_flt;
}


@@ -76,32 +91,34 @@ void bacct_add_tsk(struct taskstats *sta
/*
* fill in extended accounting fields
*/
-void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
{
struct mm_struct *mm;

- /* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
- mm = get_task_mm(p);
+ mm = get_task_mm(task);
if (mm) {
/* adjust to KB unit */
stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+}
+
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+{
+ /* convert pages-jiffies to Mbyte-usec */
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
+
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB

2007-09-17 22:25:41

by Guillaume Chazarain

[permalink] [raw]
Subject: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V4 (http://lkml.org/lkml/2007/9/15/171):
- Revert gratuitous user interface change (returning exit_code >> 8 instead of
exit_code). Thanks Oleg Nesterov.
- Revert useless heavyweight locking (lock_task_sighand() in fill_tgid_exit).
Thanks Oleg.
- Correctly fill the TGID in taskstats_exit(). Thanks Oleg.

Changelog since V3 (http://lkml.org/lkml/2007/8/31/121):
- Removed userspace example, either it gets accepted in util-linux-ng or I'll
maintain it elsewhere.
- Added kerneldoc for fill_threadgroup() and add_tsk().
- Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton
and Oleg Nesterov.
- Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock)
for consistency with the locking of task->signal->stats in fill_tgid().
- Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg.
- Documented double accounting race seen by Oleg.
- Rephrased the fill_tgid_exit() comment as per Oleg's recommendation.
- Documented the special case for the AFORK ac_flag.
- Use the exit status (code >> 8) instead of the exit code as documented in
Documentation/accounting/taskstats-struct.txt.
- Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as
suggested by Oleg.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads.

Signed-off-by: Guillaume Chazarain <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Jay Lan <[email protected]>
Cc: Jonathan Lim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---

include/linux/tsacct_kern.h | 12 ++-
kernel/taskstats.c | 135 +++++++++++++++++++++-------------
kernel/tsacct.c | 113 ++++++++++++++++------------
3 files changed, 159 insertions(+), 101 deletions(-)

diff -r 2908770b8fc2 include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h Sun Sep 16 22:24:49 2007 -0700
+++ b/include/linux/tsacct_kern.h Tue Aug 28 20:35:27 2007 +0200
@@ -10,17 +10,23 @@
#include <linux/taskstats.h>

#ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
#else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{}
#endif /* CONFIG_TASKSTATS */

#ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
{}
static inline void acct_update_integrals(struct task_struct *tsk)
diff -r 2908770b8fc2 kernel/taskstats.c
--- a/kernel/taskstats.c Sun Sep 16 22:24:49 2007 -0700
+++ b/kernel/taskstats.c Mon Sep 17 22:55:04 2007 +0200
@@ -168,6 +168,68 @@ static void send_cpu_listeners(struct sk
up_write(&listeners->sem);
}

+/**
+ * fill_threadgroup - initialize some common stats for the thread group
+ * @stats: the taskstats to write into
+ * @task: the thread representing the whole group
+ *
+ * There are two types of taskstats fields when considering a thread group:
+ * - those that can be aggregated from each thread in the group (like CPU
+ * times),
+ * - those that cannot be aggregated (like UID) or are identical (like
+ * memory usage), so are taken from the group leader.
+ * XXX_threadgroup() methods deal with the first type while XXX_add_tsk() with
+ * the second.
+ */
+static void fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to initialize
+ * relevant parts of struct taskstsats for a single tgid as follows:
+ *
+ * per-task-foo-fill_threadgroup(stats, task);
+ */
+
+ stats->version = TASKSTATS_VERSION;
+
+ /* fill in basic acct fields */
+ bacct_fill_threadgroup(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_fill_threadgroup(stats, task);
+}
+
+/**
+ * add_tsk - combine some thread specific stats in a taskstats
+ * @stats: the taskstats to write into
+ * @task: the thread to combine
+ *
+ * Stats specific to each thread in the thread group. Stats of @task should be
+ * combined with those already present in @stats. add_tsk() works in
+ * conjunction with fill_threadgroup(), taskstats fields should not be touched
+ * by both functions.
+ */
+static void add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ /*
+ * Each accounting subsystem adds calls to its functions to combine
+ * relevant parts of struct taskstsats for a single pid as follows:
+ *
+ * per-task-foo-add_tsk(stats, task);
+ */
+ stats->nvcsw += task->nvcsw;
+ stats->nivcsw += task->nivcsw;
+
+ /* fill in delay acct fields */
+ delayacct_add_tsk(stats, task);
+
+ /* fill in basic acct fields */
+ bacct_add_tsk(stats, task);
+
+ /* fill in extended acct fields */
+ xacct_add_tsk(stats, task);
+}
+
static int fill_pid(pid_t pid, struct task_struct *tsk,
struct taskstats *stats)
{
@@ -185,23 +247,8 @@ static int fill_pid(pid_t pid, struct ta
get_task_struct(tsk);

memset(stats, 0, sizeof(*stats));
- /*
- * Each accounting subsystem adds calls to its functions to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
-
- delayacct_add_tsk(stats, tsk);
-
- /* fill in basic acct fields */
- stats->version = TASKSTATS_VERSION;
- stats->nvcsw = tsk->nvcsw;
- stats->nivcsw = tsk->nivcsw;
- bacct_add_tsk(stats, tsk);
-
- /* fill in extended acct fields */
- xacct_add_tsk(stats, tsk);
+ fill_threadgroup(stats, tsk);
+ add_tsk(stats, tsk);

/* Define err: label here if needed */
put_task_struct(tsk);
@@ -232,32 +279,25 @@ static int fill_tgid(pid_t tgid, struct
else
memset(stats, 0, sizeof(*stats));

+ fill_threadgroup(stats, first->group_leader);
+
tsk = first;
- do {
- if (tsk->exit_state)
- continue;
- /*
- * Accounting subsystem can call its functions here to
- * fill in relevant parts of struct taskstsats as follows
- *
- * per-task-foo(stats, tsk);
- */
- delayacct_add_tsk(stats, tsk);
-
- stats->nvcsw += tsk->nvcsw;
- stats->nivcsw += tsk->nivcsw;
- } while_each_thread(first, tsk);
+ do
+ if (!tsk->exit_state)
+ /*
+ * This check is racy as a thread could exit just right
+ * now and have its statistics accounted twice.
+ */
+ add_tsk(stats, tsk);
+ while_each_thread(first, tsk);
+
+ if (first->signal->group_exit_code)
+ stats->ac_exitcode = first->signal->group_exit_code;

unlock_task_sighand(first, &flags);
rc = 0;
out:
rcu_read_unlock();
-
- stats->version = TASKSTATS_VERSION;
- /*
- * Accounting subsytems can also add calls here to modify
- * fields of taskstats.
- */
return rc;
}

@@ -267,19 +307,14 @@ static void fill_tgid_exit(struct task_s
unsigned long flags;

spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
- goto ret;
-
- /*
- * Each accounting subsystem calls its functions here to
- * accumalate its per-task stats for tsk, into the per-tgid structure
- *
- * per-task-foo(tsk->signal->stats, tsk);
- */
- delayacct_add_tsk(tsk->signal->stats, tsk);
-ret:
+
+ /*
+ * The fill_threadgroup() part of the statistics will be added by the
+ * stats requester, i.e. fill_tgid()
+ */
+ add_tsk(tsk->signal->stats, tsk);
+
spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
}

static int add_del_listener(pid_t pid, cpumask_t *maskp, int isadd)
@@ -508,7 +543,7 @@ void taskstats_exit(struct task_struct *
if (!stats)
goto err;

- memcpy(stats, tsk->signal->stats, sizeof(*stats));
+ fill_tgid(tsk->pid, tsk, stats);

send:
send_cpu_listeners(rep_skb, listeners);
diff -r 2908770b8fc2 kernel/tsacct.c
--- a/kernel/tsacct.c Sun Sep 16 22:24:49 2007 -0700
+++ b/kernel/tsacct.c Mon Sep 17 20:44:05 2007 +0200
@@ -22,50 +22,65 @@
#include <linux/acct.h>
#include <linux/jiffies.h>

-/*
- * fill in basic accounting fields
- */
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_times(struct taskstats *stats, struct task_struct *task)
{
struct timespec uptime, ts;
s64 ac_etime;

- BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
-
/* calculate task elapsed time in timespec */
do_posix_clock_monotonic_gettime(&uptime);
- ts = timespec_sub(uptime, tsk->start_time);
+ ts = timespec_sub(uptime, task->start_time);
/* rebase elapsed time to usec */
ac_etime = timespec_to_ns(&ts);
do_div(ac_etime, NSEC_PER_USEC);
- stats->ac_etime = ac_etime;
- stats->ac_btime = get_seconds() - ts.tv_sec;
- if (thread_group_leader(tsk)) {
- stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
- stats->ac_flag |= AFORK;
- }
- if (tsk->flags & PF_SUPERPRIV)
+ stats->ac_etime = ac_etime;
+ stats->ac_btime = get_seconds() - ts.tv_sec;
+}
+
+/*
+ * fill in basic accounting fields
+ */
+
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{
+ BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
+
+ rcu_read_lock();
+ stats->ac_ppid = pid_alive(task) ?
+ rcu_dereference(task->real_parent)->tgid : 0;
+ rcu_read_unlock();
+
+ fill_wall_times(stats, task);
+
+ stats->ac_exitcode = task->exit_code;
+ stats->ac_nice = task_nice(task);
+ stats->ac_sched = task->policy;
+ stats->ac_uid = task->uid;
+ stats->ac_gid = task->gid;
+ stats->ac_pid = task->pid;
+
+ strncpy(stats->ac_comm, task->comm, sizeof(stats->ac_comm));
+}
+
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
+{
+ if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
- if (tsk->flags & PF_DUMPCORE)
+ if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
- if (tsk->flags & PF_SIGNALED)
+ if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
- stats->ac_nice = task_nice(tsk);
- stats->ac_sched = tsk->policy;
- stats->ac_uid = tsk->uid;
- stats->ac_gid = tsk->gid;
- stats->ac_pid = tsk->pid;
- rcu_read_lock();
- stats->ac_ppid = pid_alive(tsk) ?
- rcu_dereference(tsk->real_parent)->tgid : 0;
- rcu_read_unlock();
- stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
- stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
- stats->ac_minflt = tsk->min_flt;
- stats->ac_majflt = tsk->maj_flt;
+ if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
+ /*
+ * Threads are created by do_fork() and don't exec but not in
+ * the AFORK sense, as the latter involves fork(2).
+ */
+ stats->ac_flag |= AFORK;

- strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
+ stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
+ stats->ac_minflt += task->min_flt;
+ stats->ac_majflt += task->maj_flt;
}


@@ -76,32 +91,34 @@ void bacct_add_tsk(struct taskstats *sta
/*
* fill in extended accounting fields
*/
-void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
{
struct mm_struct *mm;

- /* convert pages-jiffies to Mbyte-usec */
- stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
- stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
- mm = get_task_mm(p);
+ mm = get_task_mm(task);
if (mm) {
/* adjust to KB unit */
stats->hiwater_rss = mm->hiwater_rss * PAGE_SIZE / KB;
- stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
+ stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB;
mmput(mm);
}
- stats->read_char = p->rchar;
- stats->write_char = p->wchar;
- stats->read_syscalls = p->syscr;
- stats->write_syscalls = p->syscw;
+}
+
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
+{
+ /* convert pages-jiffies to Mbyte-usec */
+ stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+ stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+
+ stats->read_char += p->rchar;
+ stats->write_char += p->wchar;
+ stats->read_syscalls += p->syscr;
+ stats->write_syscalls += p->syscw;
+
#ifdef CONFIG_TASK_IO_ACCOUNTING
- stats->read_bytes = p->ioac.read_bytes;
- stats->write_bytes = p->ioac.write_bytes;
- stats->cancelled_write_bytes = p->ioac.cancelled_write_bytes;
-#else
- stats->read_bytes = 0;
- stats->write_bytes = 0;
- stats->cancelled_write_bytes = 0;
+ stats->read_bytes += p->ioac.read_bytes;
+ stats->write_bytes += p->ioac.write_bytes;
+ stats->cancelled_write_bytes += p->ioac.cancelled_write_bytes;
#endif
}
#undef KB

2007-09-18 15:26:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

On 09/18, Guillaume Chazarain wrote:
>
> @@ -508,7 +543,7 @@ void taskstats_exit(struct task_struct *
> if (!stats)
> goto err;
>
> - memcpy(stats, tsk->signal->stats, sizeof(*stats));
> + fill_tgid(tsk->pid, tsk, stats);

No, no, this is wrong.

tsk->signal->stats contains the accumulated info about the already exited
threads, we shouldn't throw it out.

Also, fill_tgid() doesn't make sense here, current is the last live sub-thread.


Hmm. We have another race here. There is no guarantee that tsk->signal->stats
covers all sub-threads, as it is supposed to be. It is quite possible that
another sub-thread decremented ->signal->live before us, but didn't complete
taskstats_exit()->fill_tgid_exit() yet.

Oleg.

2007-09-20 06:21:41

by Andrew Morton

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

On Tue, 18 Sep 2007 00:23:39 +0200 Guillaume Chazarain <[email protected]> wrote:

> TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
> the basic and extended accounting. With this patch,
> TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
> of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
> fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
> (http://guichaz.free.fr/misc/iotop.py).

This patch conflicts somewhat with
add-scaled-time-to-taskstats-based-process-accounting.patch

I fixed it up like this:

void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{
if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
/*
* Threads are created by do_fork() and don't exec but not in
* the AFORK sense, as the latter involves fork(2).
*/
stats->ac_flag |= AFORK;

stats->ac_utimescaled +=
cputime_to_msecs(task->utimescaled) * USEC_PER_MSEC;
stats->ac_stimescaled +=
cputime_to_msecs(task->stimescaled) * USEC_PER_MSEC;
stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
stats->ac_minflt += task->min_flt;
stats->ac_majflt += task->maj_flt;
}

(note the s/=/+=/ in there) but it all needs reviewing and checking and
testing please.

2007-09-20 08:55:36

by Balbir Singh

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

Andrew Morton wrote:
> On Tue, 18 Sep 2007 00:23:39 +0200 Guillaume Chazarain <[email protected]> wrote:
>
>> TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
>> the basic and extended accounting. With this patch,
>> TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
>> of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
>> fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
>> (http://guichaz.free.fr/misc/iotop.py).
>
> This patch conflicts somewhat with
> add-scaled-time-to-taskstats-based-process-accounting.patch
>
> I fixed it up like this:
>
> void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
> {
> if (task->flags & PF_SUPERPRIV)
> stats->ac_flag |= ASU;
> if (task->flags & PF_DUMPCORE)
> stats->ac_flag |= ACORE;
> if (task->flags & PF_SIGNALED)
> stats->ac_flag |= AXSIG;
> if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
> /*
> * Threads are created by do_fork() and don't exec but not in
> * the AFORK sense, as the latter involves fork(2).
> */
> stats->ac_flag |= AFORK;
>
> stats->ac_utimescaled +=
> cputime_to_msecs(task->utimescaled) * USEC_PER_MSEC;
> stats->ac_stimescaled +=
> cputime_to_msecs(task->stimescaled) * USEC_PER_MSEC;
> stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
> stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
> stats->ac_minflt += task->min_flt;
> stats->ac_majflt += task->maj_flt;
> }
>
> (note the s/=/+=/ in there) but it all needs reviewing and checking and
> testing please.

Andrew,

Thanks for reviewing the patchset, this patch is on my review and test
queue (which has gotten rather long of late). I'll test it further and
get back.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-09-20 12:12:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

On 09/20, Balbir Singh wrote:
>
> Andrew Morton wrote:
> > On Tue, 18 Sep 2007 00:23:39 +0200 Guillaume Chazarain <[email protected]> wrote:
> >
> >> TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
> >> the basic and extended accounting. With this patch,
> >> TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
> >> of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
> >> fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
> >> (http://guichaz.free.fr/misc/iotop.py).
> >
> > This patch conflicts somewhat with
> > add-scaled-time-to-taskstats-based-process-accounting.patch
> >
> > I fixed it up like this:
> >
> > void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
> > {
> > if (task->flags & PF_SUPERPRIV)
> > stats->ac_flag |= ASU;
> > if (task->flags & PF_DUMPCORE)
> > stats->ac_flag |= ACORE;
> > if (task->flags & PF_SIGNALED)
> > stats->ac_flag |= AXSIG;
> > if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
> > /*
> > * Threads are created by do_fork() and don't exec but not in
> > * the AFORK sense, as the latter involves fork(2).
> > */
> > stats->ac_flag |= AFORK;
> >
> > stats->ac_utimescaled +=
> > cputime_to_msecs(task->utimescaled) * USEC_PER_MSEC;
> > stats->ac_stimescaled +=
> > cputime_to_msecs(task->stimescaled) * USEC_PER_MSEC;
> > stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
> > stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
> > stats->ac_minflt += task->min_flt;
> > stats->ac_majflt += task->maj_flt;
> > }
> >
> > (note the s/=/+=/ in there) but it all needs reviewing and checking and
> > testing please.
>
> Andrew,
>
> Thanks for reviewing the patchset, this patch is on my review and test
> queue (which has gotten rather long of late). I'll test it further and
> get back.

I still think this version is very wrong. It makes the ->signal->stats
absolutely meaningless. Quoting myself:

On 09/18, Guillaume Chazarain wrote:
>
> @@ -508,7 +543,7 @@ void taskstats_exit(struct task_struct *
> if (!stats)
> goto err;
>
> - memcpy(stats, tsk->signal->stats, sizeof(*stats));
> + fill_tgid(tsk->pid, tsk, stats);

No, no, this is wrong.

tsk->signal->stats contains the accumulated info about the already exited
threads, we shouldn't throw it out.

Also, fill_tgid() doesn't make sense here, current is the last live sub-thread.

Oleg.

2007-09-20 12:18:45

by Balbir Singh

[permalink] [raw]
Subject: Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

>> Andrew,
>>
>> Thanks for reviewing the patchset, this patch is on my review and test
>> queue (which has gotten rather long of late). I'll test it further and
>> get back.
>
> I still think this version is very wrong. It makes the ->signal->stats
> absolutely meaningless. Quoting myself:
>


Hi, Oleg,

Yes, I see, removing the memcpy is definitely wrong. Thanks for catching
it. I did not get a chance to review the patch, it's on my review queue,
but since you've reviewed it, I am very glad that you have and
identified potential issues.

Big Thanks!

--
Balbir Singh
Linux Technology Center
IBM, ISTL