From: Michael Holzheu <[email protected]>
Version 2
---------
* Use thread_group_leader() instead of (tsk->pid == tsk->tgid)
* Use cdata_wait instead of cdata_acct
* I left the struct signal locking in the patch, because I think
that in order to get consistent data this is necessary. See also
do_task_stat() in fs/proc/array.c. One problem is that we
report wrong values (zero) for cdata, if lock_task_sighand()
fails. This will be the same behavior as for /proc/<pid>/stat.
But maybe we should somehow return to userspace that the
information is not correct. Any ideas?
Description
-----------
With this patch the cumulative CPU time is added to "struct taskstats".
The CPU time is only returned for the thread group leader.
Signed-off-by: Michael Holzheu <[email protected]>
---
include/linux/taskstats.h | 7 ++++++-
kernel/tsacct.c | 7 +++++++
2 files changed, 13 insertions(+), 1 deletion(-)
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -33,7 +33,7 @@
*/
-#define TASKSTATS_VERSION 7
+#define TASKSTATS_VERSION 8
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -163,6 +163,11 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64 freepages_count;
__u64 freepages_delay_total;
+ /* version 7 ends here */
+
+ /* Cumulative CPU time of dead children */
+ __u64 ac_cutime; /* User CPU time [usec] */
+ __u64 ac_cstime; /* System CPU time [usec] */
};
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -30,6 +30,7 @@ void bacct_add_tsk(struct taskstats *sta
{
const struct cred *tcred;
struct timespec uptime, ts;
+ unsigned long flags;
u64 ac_etime;
BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
@@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta
stats->ac_majflt = tsk->maj_flt;
strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ if (thread_group_leader(tsk) && lock_task_sighand(tsk, &flags)) {
+ struct cdata *cd = &tsk->signal->cdata_wait;
+ stats->ac_cutime = cputime_to_usecs(cd->utime);
+ stats->ac_cstime = cputime_to_usecs(cd->stime);
+ unlock_task_sighand(tsk, &flags);
+ }
}
On 11/29, Michael Holzheu wrote:
>
> * I left the struct signal locking in the patch, because I think
> that in order to get consistent data this is necessary.
OK, I guess you mean that we want to read utime/stime "atomically",
and thus we need ->siglock to prevent the race with __exit_signal().
> @@ -30,6 +30,7 @@ void bacct_add_tsk(struct taskstats *sta
> {
> const struct cred *tcred;
> struct timespec uptime, ts;
> + unsigned long flags;
> u64 ac_etime;
>
> BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
> @@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta
> stats->ac_majflt = tsk->maj_flt;
>
> strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
> + if (thread_group_leader(tsk) && lock_task_sighand(tsk, &flags)) {
> + struct cdata *cd = &tsk->signal->cdata_wait;
> + stats->ac_cutime = cputime_to_usecs(cd->utime);
> + stats->ac_cstime = cputime_to_usecs(cd->stime);
perhaps we need a small comment to explain ->siglock...
But in fact I don't really understand this anyway. This is called
before we reparent our children. This means that ac_cutime/ac_cstime
can be changed after that (multithreading, or full_cdata_enabled).
Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
does this, including the group_leader. But, it is possible that
group_leader exits first, before other threads. IOW, what
stats->ac_cXtime actually mean?
Nevermind, the whole series looks correct to me. Although I still
can't find the time to read it ;) But at first glance this version
addresses all concerns we discussed before.
Oleg.
On Wed, 2010-12-01 at 19:51 +0100, Oleg Nesterov wrote:
> But in fact I don't really understand this anyway. This is called
> before we reparent our children. This means that ac_cutime/ac_cstime
> can be changed after that (multithreading, or full_cdata_enabled).
>
> Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
> does this, including the group_leader. But, it is possible that
> group_leader exits first, before other threads. IOW, what
> stats->ac_cXtime actually mean?
Because I worked mostly with the ptop tool, I was not so much focused on
the taskstats exit events, but instead more on the taskstats commands to
query data for running tasks.
For the query scenario stats->ac_cXtime means:
1) full_cdata=0: "Sum of CPU time of exited child processes where
sys_wait() have been done (up to this time)"
2) full_cdata=1: "Sum of CPU time of exited child processes where
sys_wait() have been done plus exited child processes where
the parents ignored SIGCHLD or have set SA_NOCLDWAIT (up to
this time)"
Regarding taskstats_exit(): Do you have something like the following
scenario in mind?
1) You have a thread group with several threads
2) Thread group leader dies and reports cdata_wait in taskstats_exit()
3) Thread group leader stays around as zombie until the thread
group dies
4) Other forked processes of this thread group die
5) cdata_wait of thread group is increased
6) The new cdata is not reported by any exit event of the thread group
So maybe we should remove the thread_group_leader() check and report
cdata_wait for all threads and not only for the thread group leader? We
also should add ac_tgid to taskstats so that userspace can find the
corresponding thread group for each thread.
When the last thread exits and the process/thread group dies,
taskstats_exit() sends an additional taskstats struct to userspace that
aggregates the thread accounting data. Currently only the delay
accounting data is aggregated (see
taskstats_exit->fill_tgid_exit->delayacct_add_tsk). Not sure, why the
other information is not aggregated. We perhaps also should include
ac_cXtime in the aggregated taskstats.
Michael
* Oleg Nesterov <[email protected]> [2010-12-01 19:51:28]:
> On 11/29, Michael Holzheu wrote:
> >
> > * I left the struct signal locking in the patch, because I think
> > that in order to get consistent data this is necessary.
>
> OK, I guess you mean that we want to read utime/stime "atomically",
> and thus we need ->siglock to prevent the race with __exit_signal().
>
> > @@ -30,6 +30,7 @@ void bacct_add_tsk(struct taskstats *sta
> > {
> > const struct cred *tcred;
> > struct timespec uptime, ts;
> > + unsigned long flags;
> > u64 ac_etime;
> >
> > BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);
> > @@ -71,6 +72,12 @@ void bacct_add_tsk(struct taskstats *sta
> > stats->ac_majflt = tsk->maj_flt;
> >
> > strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
> > + if (thread_group_leader(tsk) && lock_task_sighand(tsk, &flags)) {
> > + struct cdata *cd = &tsk->signal->cdata_wait;
> > + stats->ac_cutime = cputime_to_usecs(cd->utime);
> > + stats->ac_cstime = cputime_to_usecs(cd->stime);
>
> perhaps we need a small comment to explain ->siglock...
>
> But in fact I don't really understand this anyway. This is called
> before we reparent our children. This means that ac_cutime/ac_cstime
> can be changed after that (multithreading, or full_cdata_enabled).
>
> Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
> does this, including the group_leader. But, it is possible that
> group_leader exits first, before other threads. IOW, what
> stats->ac_cXtime actually mean?
>
stats->ac_* time was designed only for tgid's to begin with, so I am
not sure if ac_cXtime makes sense for threads
>
> Nevermind, the whole series looks correct to me. Although I still
> can't find the time to read it ;) But at first glance this version
> addresses all concerns we discussed before.
>
> Oleg.
>
--
Three Cheers,
Balbir
* Michael Holzheu <[email protected]> [2010-12-02 17:34:01]:
> On Wed, 2010-12-01 at 19:51 +0100, Oleg Nesterov wrote:
> > But in fact I don't really understand this anyway. This is called
> > before we reparent our children. This means that ac_cutime/ac_cstime
> > can be changed after that (multithreading, or full_cdata_enabled).
> >
> > Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
> > does this, including the group_leader. But, it is possible that
> > group_leader exits first, before other threads. IOW, what
> > stats->ac_cXtime actually mean?
>
> Because I worked mostly with the ptop tool, I was not so much focused on
> the taskstats exit events, but instead more on the taskstats commands to
> query data for running tasks.
>
> For the query scenario stats->ac_cXtime means:
>
> 1) full_cdata=0: "Sum of CPU time of exited child processes where
> sys_wait() have been done (up to this time)"
> 2) full_cdata=1: "Sum of CPU time of exited child processes where
> sys_wait() have been done plus exited child processes where
> the parents ignored SIGCHLD or have set SA_NOCLDWAIT (up to
> this time)"
>
> Regarding taskstats_exit(): Do you have something like the following
> scenario in mind?
>
> 1) You have a thread group with several threads
> 2) Thread group leader dies and reports cdata_wait in taskstats_exit()
> 3) Thread group leader stays around as zombie until the thread
> group dies
> 4) Other forked processes of this thread group die
> 5) cdata_wait of thread group is increased
> 6) The new cdata is not reported by any exit event of the thread group
>
> So maybe we should remove the thread_group_leader() check and report
> cdata_wait for all threads and not only for the thread group leader? We
> also should add ac_tgid to taskstats so that userspace can find the
> corresponding thread group for each thread.
>
> When the last thread exits and the process/thread group dies,
> taskstats_exit() sends an additional taskstats struct to userspace that
> aggregates the thread accounting data. Currently only the delay
> accounting data is aggregated (see
> taskstats_exit->fill_tgid_exit->delayacct_add_tsk). Not sure, why the
> other information is not aggregated. We perhaps also should include
> ac_cXtime in the aggregated taskstats.
>
The delay accounting data is aggregated, the other part tsacct do not
care much about aggregation (IIRC). tsacct was added to export pid
data and never tgid data again IIRC. This subsystem is the one that
owns ac_*, perhaps if tsacct supported tgid's then what you say makes
sense.
--
Three Cheers,
Balbir
Hello Balbir,
On Fri, 2010-12-03 at 13:03 +0530, Balbir Singh wrote:
> * Oleg Nesterov <[email protected]> [2010-12-01 19:51:28]:
>
> > On 11/29, Michael Holzheu wrote:
[snip]
> > Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
> > does this, including the group_leader. But, it is possible that
> > group_leader exits first, before other threads. IOW, what
> > stats->ac_cXtime actually mean?
> >
>
> stats->ac_* time was designed only for tgid's to begin with,
You mean for tids (threads/tasks), no? stats->ac_* time is only reported
for threads in bacct_add_tsk() and not for tgids.
> so I am
> not sure if ac_cXtime makes sense for threads
I would suggest to do it the same way as /proc/<tgid>/tasks/<tid>/stat.
It reports the (same) cumulative time for each thread. See
do_task_stat() in fs/proc/array.c. So IMHO also for taskstats it makes
sense to include the cXtime for all threads and not only for the thread
group leader.
Also I would include tgid in taskstats so that userspace can group the
tasks into thread groups.
I am not sure regarding the aggregation for tgids in
TASKSTATS_CMD_ATTR_TGID and exit events with group_dead=1. Currently
only the delay accounting numbers are aggregated. If we would do it
like /proc/<tgid>/stat (do_task_stat() with whole=1) we also could
aggregate also the other values e.g. the CPU time.
I think the following tgid data would make sense to be returned for
TASKSTATS_CMD_ATTR_TGID and exit events with group_dead=1:
bacct_add_tsk():
----------------
ac_pid thread group leader (tsk->tgid)
ac_etime thread group leader
ac_btime thread group leader
ac_nice thread group leader
ac_sched thread group leader
ac_uid thread group leader
ac_gid thread group leader
ac_ppid thread group leader
ac_comm thread group leader
ac_exit_code thread group leader
ac_flags thread group leader
ac_utimescaled ??
ac_stimescaled ??
ac_utime sum for all live threads + cdata_threads
ac_stime sum for all live threads + cdata_threads
ac_minflt sum for all live threads + cdata_threads
ac_majflt sum for all live threads + cdata_threads
new:
ac_cutime cdata_wait
ac_cstime cdata_wait
ac_tgid thread group leader
xacct_add_tsk():
----------------
coremem ?
virtmem ?
hiwater_rss thread group leader
hiwater_vm thread group leader
read_char sum for all live threads + tsk->signal.ioac
write_char sum for all live threads + tsk->signal.ioac
read_syscalls sum for all live threads + tsk->signal.ioac
write_syscalls sum for all live threads + tsk->signal.ioac
read_bytes sum for all live threads + tsk->signal.ioac
write bytes sum for all live threads + tsk->signal.ioac
cancelled_write_bytes sum for all live threads + tsk->signal.ioac
If we leave everything as it currently is for tgid accounting, we
probably should also not include the cXtime for tgids and just include
the data for threads (TASKSTATS_CMD_ATTR_PID and task exit events).
What do you think?
Michael
* Michael Holzheu <[email protected]> [2010-12-06 13:37:37]:
> Hello Balbir,
>
> On Fri, 2010-12-03 at 13:03 +0530, Balbir Singh wrote:
> > * Oleg Nesterov <[email protected]> [2010-12-01 19:51:28]:
> >
> > > On 11/29, Michael Holzheu wrote:
> [snip]
> > > Say, taskstats_exit()->fill_stats()->bacct_add_tsk(). Every thread
> > > does this, including the group_leader. But, it is possible that
> > > group_leader exits first, before other threads. IOW, what
> > > stats->ac_cXtime actually mean?
> > >
> >
> > stats->ac_* time was designed only for tgid's to begin with,
>
> You mean for tids (threads/tasks), no? stats->ac_* time is only reported
> for threads in bacct_add_tsk() and not for tgids.
Sorry, I meant tid's or pid
>
> > so I am
> > not sure if ac_cXtime makes sense for threads
>
> I would suggest to do it the same way as /proc/<tgid>/tasks/<tid>/stat.
> It reports the (same) cumulative time for each thread. See
> do_task_stat() in fs/proc/array.c. So IMHO also for taskstats it makes
> sense to include the cXtime for all threads and not only for the thread
> group leader.
>
> Also I would include tgid in taskstats so that userspace can group the
> tasks into thread groups.
>
> I am not sure regarding the aggregation for tgids in
> TASKSTATS_CMD_ATTR_TGID and exit events with group_dead=1. Currently
> only the delay accounting numbers are aggregated. If we would do it
> like /proc/<tgid>/stat (do_task_stat() with whole=1) we also could
> aggregate also the other values e.g. the CPU time.
>
> I think the following tgid data would make sense to be returned for
> TASKSTATS_CMD_ATTR_TGID and exit events with group_dead=1:
>
> bacct_add_tsk():
> ----------------
> ac_pid thread group leader (tsk->tgid)
> ac_etime thread group leader
> ac_btime thread group leader
> ac_nice thread group leader
> ac_sched thread group leader
> ac_uid thread group leader
> ac_gid thread group leader
> ac_ppid thread group leader
> ac_comm thread group leader
> ac_exit_code thread group leader
> ac_flags thread group leader
> ac_utimescaled ??
> ac_stimescaled ??
> ac_utime sum for all live threads + cdata_threads
> ac_stime sum for all live threads + cdata_threads
> ac_minflt sum for all live threads + cdata_threads
> ac_majflt sum for all live threads + cdata_threads
>
> new:
> ac_cutime cdata_wait
> ac_cstime cdata_wait
> ac_tgid thread group leader
>
This seems to be make sense.
> xacct_add_tsk():
> ----------------
> coremem ?
> virtmem ?
> hiwater_rss thread group leader
> hiwater_vm thread group leader
> read_char sum for all live threads + tsk->signal.ioac
> write_char sum for all live threads + tsk->signal.ioac
> read_syscalls sum for all live threads + tsk->signal.ioac
> write_syscalls sum for all live threads + tsk->signal.ioac
> read_bytes sum for all live threads + tsk->signal.ioac
> write bytes sum for all live threads + tsk->signal.ioac
> cancelled_write_bytes sum for all live threads + tsk->signal.ioac
>
> If we leave everything as it currently is for tgid accounting, we
> probably should also not include the cXtime for tgids and just include
> the data for threads (TASKSTATS_CMD_ATTR_PID and task exit events).
>
> What do you think?
>
This makes sense to me, we'd need to bump up the API version if we are
going to make this change.
--
Three Cheers,
Balbir
On Wed, 2010-12-01 at 19:51 +0100, Oleg Nesterov wrote:
> On 11/29, Michael Holzheu wrote:
> >
> > * I left the struct signal locking in the patch, because I think
> > that in order to get consistent data this is necessary.
>
> OK, I guess you mean that we want to read utime/stime "atomically",
> and thus we need ->siglock to prevent the race with __exit_signal().
Yes.
But I changed my mind. If I understand the code right, currently no
locking is done for reading the tasks statistics in both the taskstats
and also in the procfs interface. So e.g. it is not guaranteed to get a
consistent set of data when reading /proc/<pid>/stat, because of the
race with account_user/system/whatever_time() and other accounting
functions.
I assume that has been made by intention. Or am I missing something?
Because you said that since 2.6.35 accessing the signal_struct is save,
I think now also for cdata it is not necessary to lock anything. As a
result of this and the discussion regarding the group leader and cdata I
would propose the following patch:
---
Subject: taskstats: Export "cdata_wait" CPU times with taskstats
From: Michael Holzheu <[email protected]>
Version 3
---------
* Remove signal struct locking because accessing signal_struct is save
since 2.6.35.
* Report cdata for all tasks not only for the thread group leader.
This is then the same behavior as for /proc/<tgid>/tasks/<tid>/stat.
* Add tgid to taskstats to allow userspace to group threads.
Version 2
---------
* Use thread_group_leader() instead of (tsk->pid == tsk->tgid)
* Use cdata_wait instead of cdata_acct
* I left the struct signal locking in the patch, because I think
that in order to get consistent data this is necessary. See also
do_task_stat() in fs/proc/array.c. One problem is that we
report wrong values (zero) for cdata, if lock_task_sighand()
fails. This will be the same behavior as for /proc/<pid>/stat.
But maybe we should somehow return to userspace that the
information is not correct. Any ideas?
Description
-----------
With this patch the cumulative CPU time and the tgid (thread group ID)
is added to "struct taskstats".
Signed-off-by: Michael Holzheu <[email protected]>
---
include/linux/taskstats.h | 10 +++++++++-
kernel/tsacct.c | 3 +++
2 files changed, 12 insertions(+), 1 deletion(-)
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -33,7 +33,7 @@
*/
-#define TASKSTATS_VERSION 7
+#define TASKSTATS_VERSION 8
#define TS_COMM_LEN 32 /* should be >= TASK_COMM_LEN
* in linux/sched.h */
@@ -163,6 +163,14 @@ struct taskstats {
/* Delay waiting for memory reclaim */
__u64 freepages_count;
__u64 freepages_delay_total;
+ /* version 7 ends here */
+
+ __u32 ac_tgid; /* Thread group ID */
+
+ /* Cumulative CPU time of dead children */
+ __u64 ac_cutime; /* User CPU time [usec] */
+ __u64 ac_cstime; /* System CPU time [usec] */
+ /* version 8 ends here */
};
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -71,6 +71,9 @@ void bacct_add_tsk(struct taskstats *sta
stats->ac_majflt = tsk->maj_flt;
strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
+ stats->ac_cutime = cputime_to_usecs(tsk->signal->cdata_wait.utime);
+ stats->ac_cstime = cputime_to_usecs(tsk->signal->cdata_wait.stime);
+ stats->ac_tgid = tsk->tgid;
}
Oh, sorry for the delay. Again ;)
On 12/02, Michael Holzheu wrote:
>
> 1) You have a thread group with several threads
> 2) Thread group leader dies and reports cdata_wait in taskstats_exit()
> 3) Thread group leader stays around as zombie until the thread
> group dies
> 4) Other forked processes of this thread group die
> 5) cdata_wait of thread group is increased
> 6) The new cdata is not reported by any exit event of the thread group
Yes.
> So maybe we should remove the thread_group_leader() check and report
> cdata_wait for all threads and not only for the thread group leader? We
> also should add ac_tgid to taskstats so that userspace can find the
> corresponding thread group for each thread.
I do not know. My only point was, this thread_group_leader() looks
a bit confusing, at least if we are talking about do_exit() path.
> When the last thread exits and the process/thread group dies,
> taskstats_exit() sends an additional taskstats struct to userspace that
> aggregates the thread accounting data. Currently only the delay
> accounting data is aggregated (see
> taskstats_exit->fill_tgid_exit->delayacct_add_tsk). Not sure, why the
> other information is not aggregated. We perhaps also should include
> ac_cXtime in the aggregated taskstats.
Not sure I understand... Do you mean
if (is_thread_group)
fill_tgid_exit(tsk);
? Afaics, this is not "When the last thread exits", this is
"this program is multithreaded, it has (or had) other threads".
Oleg.
On 12/07, Michael Holzheu wrote:
>
> On Wed, 2010-12-01 at 19:51 +0100, Oleg Nesterov wrote:
> > On 11/29, Michael Holzheu wrote:
> > >
> > > * I left the struct signal locking in the patch, because I think
> > > that in order to get consistent data this is necessary.
> >
> > OK, I guess you mean that we want to read utime/stime "atomically",
> > and thus we need ->siglock to prevent the race with __exit_signal().
>
> Yes.
>
> But I changed my mind. If I understand the code right, currently no
> locking is done for reading the tasks statistics in both the taskstats
> and also in the procfs interface. So e.g. it is not guaranteed to get a
> consistent set of data when reading /proc/<pid>/stat, because of the
> race with account_user/system/whatever_time() and other accounting
> functions.
Not sure, I think account_user/system/whatever_time() is a bit different.
And, say, do_task_stat() takes ->siglock for thread_group_times(). Btw,
I tried to remove this lock_task_sighand(), and I still hope we can
do this ;) But the patch wasn't acked (iiuc) exactly because it can
make top unhappy..
> Because you said that since 2.6.35 accessing the signal_struct is save,
> I think now also for cdata it is not necessary to lock anything. As a
> result of this and the discussion regarding the group leader and cdata I
> would propose the following patch:
To me, this certainly makes sense. I do not really understand why
it is so important to prevent the case when, say, bacct_add_tsk()
sees the updated .utime and !updated .stime.
If we can race with the exiting tasks, then these numbers are not
"final" anyway.
But again, this all is up to you.
Oleg.
Hello Oleg,
On Wed, 2010-12-08 at 21:23 +0100, Oleg Nesterov wrote:
> Oh, sorry for the delay. Again ;)
No problem, I know this can be exhausting ;-)
> On 12/02, Michael Holzheu wrote:
> > When the last thread exits and the process/thread group dies,
> > taskstats_exit() sends an additional taskstats struct to userspace that
> > aggregates the thread accounting data. Currently only the delay
> > accounting data is aggregated (see
> > taskstats_exit->fill_tgid_exit->delayacct_add_tsk). Not sure, why the
> > other information is not aggregated. We perhaps also should include
> > ac_cXtime in the aggregated taskstats.
>
> Not sure I understand... Do you mean
>
> if (is_thread_group)
> fill_tgid_exit(tsk);
>
> ? Afaics, this is not "When the last thread exits", this is
> "this program is multithreaded, it has (or had) other threads".
fill_tgid_exit() adds the data of the dying thread to the thread group
data (tsk->signal->stats). Currently only for delay accounting.
But the accumulated data is sent to userspace only after the last thread
has died:
if (!is_thread_group || >>>!group_dead<<<)
goto send;
stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tsk->tgid);
if (!stats)
goto err;
Michael
Hello Oleg,
On Sat, 2010-12-11 at 18:39 +0100, Oleg Nesterov wrote:
> On 12/10, Michael Holzheu wrote:
> >
> > > > When the last thread exits and the process/thread group dies,
> > > > taskstats_exit() sends an additional taskstats struct to userspace that
> > > > aggregates the thread accounting data. Currently only the delay
> > > > accounting data is aggregated (see
> > > > taskstats_exit->fill_tgid_exit->delayacct_add_tsk). Not sure, why the
> > > > other information is not aggregated. We perhaps also should include
> > > > ac_cXtime in the aggregated taskstats.
> > >
> > > Not sure I understand... Do you mean
> > >
> > > if (is_thread_group)
> > > fill_tgid_exit(tsk);
> > >
> > > ? Afaics, this is not "When the last thread exits", this is
> > > "this program is multithreaded, it has (or had) other threads".
> >
> > fill_tgid_exit() adds the data of the dying thread to the thread group
> > data (tsk->signal->stats). Currently only for delay accounting.
> >
> > But the accumulated data is sent to userspace only after the last thread
> > has died:
> >
> > if (!is_thread_group || >>>!group_dead<<<)
> > goto send;
>
> Ah, right you are.
>
> And this looks racy, or I missed something again. group_dead can be
> true, but this doesn't mean all other threads have already passed
> taskstats_exit()->fill_tgid_exit()->delayacct_add_tsk().
I think you are right.
One way to fix that could be to separate the aggregation from the
sending. We could call fill_tgid_exit()->delayacct_add_tsk() before
atomic_dec_and_test(&tsk->signal->live) in do_exit() and
taskstats_exit() with the sender part afterwards.
> And. Why "if (is_thread_group)" does "size *= 2" unconditionally?
> IOW, perhaps the patch below makes sense?
Yes, I think the patch makes sense. Balbir?
>
> Oleg.
>
> --- x/kernel/taskstats.c
> +++ x/kernel/taskstats.c
> @@ -576,7 +576,8 @@ void taskstats_exit(struct task_struct *
> is_thread_group = !!taskstats_tgid_alloc(tsk);
> if (is_thread_group) {
> /* PID + STATS + TGID + STATS */
> - size = 2 * size;
> + if (group_dead)
> + size *= 2;
> /* fill the tsk->signal->stats structure */
> fill_tgid_exit(tsk);
> }
>
On Mon, 2010-12-13 at 14:05 +0100, Michael Holzheu wrote:
> > And this looks racy, or I missed something again. group_dead can be
> > true, but this doesn't mean all other threads have already passed
> > taskstats_exit()->fill_tgid_exit()->delayacct_add_tsk().
>
> I think you are right.
>
> One way to fix that could be to separate the aggregation from the
> sending. We could call fill_tgid_exit()->delayacct_add_tsk() before
> atomic_dec_and_test(&tsk->signal->live) in do_exit() and
> taskstats_exit() with the sender part afterwards.
Something like the following patch...
---
include/linux/taskstats_kern.h | 3 +
kernel/exit.c | 3 +
kernel/taskstats.c | 62 +++++++++++++++++------------------------
3 files changed, 31 insertions(+), 37 deletions(-)
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -21,7 +21,8 @@ static inline void taskstats_tgid_free(s
kmem_cache_free(taskstats_cache, sig->stats);
}
-extern void taskstats_exit(struct task_struct *, int group_dead);
+extern void taskstats_exit_notify(struct task_struct *, int group_dead);
+extern void taskstats_exit_add_thread(struct task_struct *);
extern void taskstats_init_early(void);
#else
static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1030,6 +1030,7 @@ NORET_TYPE void do_exit(long code)
/* sync mm's RSS info before statistics gathering */
if (tsk->mm)
sync_mm_rss(tsk, tsk->mm);
+ taskstats_exit_add_thread(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
struct cdata *tcd = &tsk->signal->cdata_threads;
@@ -1045,7 +1046,7 @@ NORET_TYPE void do_exit(long code)
audit_free(tsk);
tsk->exit_code = code;
- taskstats_exit(tsk, group_dead);
+ taskstats_exit_notify(tsk, group_dead);
exit_mm(tsk);
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -263,10 +263,32 @@ out:
return rc;
}
-static void fill_tgid_exit(struct task_struct *tsk)
+static void alloc_signal_stats(struct task_struct *tsk)
+{
+ struct signal_struct *sig = tsk->signal;
+ struct taskstats *stats;
+
+ /* No problem if kmem_cache_zalloc() fails */
+ stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
+
+ spin_lock_irq(&tsk->sighand->siglock);
+ if (!sig->stats) {
+ sig->stats = stats;
+ stats = NULL;
+ }
+ spin_unlock_irq(&tsk->sighand->siglock);
+
+ if (stats)
+ kmem_cache_free(taskstats_cache, stats);
+}
+
+void taskstats_exit_add_thread(struct task_struct *tsk)
{
unsigned long flags;
+ if (tsk->signal->stats == NULL && !thread_group_empty(tsk))
+ alloc_signal_stats(tsk);
+
spin_lock_irqsave(&tsk->sighand->siglock, flags);
if (!tsk->signal->stats)
goto ret;
@@ -530,39 +552,14 @@ static int taskstats_user_cmd(struct sk_
return -EINVAL;
}
-static struct taskstats *taskstats_tgid_alloc(struct task_struct *tsk)
-{
- struct signal_struct *sig = tsk->signal;
- struct taskstats *stats;
-
- if (sig->stats || thread_group_empty(tsk))
- goto ret;
-
- /* No problem if kmem_cache_zalloc() fails */
- stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
-
- spin_lock_irq(&tsk->sighand->siglock);
- if (!sig->stats) {
- sig->stats = stats;
- stats = NULL;
- }
- spin_unlock_irq(&tsk->sighand->siglock);
-
- if (stats)
- kmem_cache_free(taskstats_cache, stats);
-ret:
- return sig->stats;
-}
-
/* Send pid data out on exit */
-void taskstats_exit(struct task_struct *tsk, int group_dead)
+void taskstats_exit_notify(struct task_struct *tsk, int group_dead)
{
int rc;
struct listener_list *listeners;
struct taskstats *stats;
struct sk_buff *rep_skb;
size_t size;
- int is_thread_group;
if (!family_registered)
return;
@@ -573,13 +570,8 @@ void taskstats_exit(struct task_struct *
size = nla_total_size(sizeof(u32)) +
nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
- is_thread_group = !!taskstats_tgid_alloc(tsk);
- if (is_thread_group) {
- /* PID + STATS + TGID + STATS */
- size = 2 * size;
- /* fill the tsk->signal->stats structure */
- fill_tgid_exit(tsk);
- }
+ if (group_dead && tsk->signal->stats)
+ size = 2 * size; /* PID + STATS + TGID + STATS */
listeners = &__raw_get_cpu_var(listener_array);
if (list_empty(&listeners->list))
@@ -598,7 +590,7 @@ void taskstats_exit(struct task_struct *
/*
* Doesn't matter if tsk is the leader or the last group member leaving
*/
- if (!is_thread_group || !group_dead)
+ if (!group_dead || tsk->signal->stats == NULL)
goto send;
stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tsk->tgid);
On 12/13, Michael Holzheu wrote:
>
> On Mon, 2010-12-13 at 14:05 +0100, Michael Holzheu wrote:
> > > And this looks racy, or I missed something again. group_dead can be
> > > true, but this doesn't mean all other threads have already passed
> > > taskstats_exit()->fill_tgid_exit()->delayacct_add_tsk().
> >
> > I think you are right.
> >
> > One way to fix that could be to separate the aggregation from the
> > sending. We could call fill_tgid_exit()->delayacct_add_tsk() before
> > atomic_dec_and_test(&tsk->signal->live) in do_exit() and
> > taskstats_exit() with the sender part afterwards.
Yes, I think this should fix the race. Some nits below...
> --- a/include/linux/taskstats_kern.h
> +++ b/include/linux/taskstats_kern.h
> @@ -21,7 +21,8 @@ static inline void taskstats_tgid_free(s
> kmem_cache_free(taskstats_cache, sig->stats);
> }
>
> -extern void taskstats_exit(struct task_struct *, int group_dead);
> +extern void taskstats_exit_notify(struct task_struct *, int group_dead);
> +extern void taskstats_exit_add_thread(struct task_struct *);
You forgot to update the !CONFIG_TASKSTATS case ;)
> -static void fill_tgid_exit(struct task_struct *tsk)
> +static void alloc_signal_stats(struct task_struct *tsk)
> +{
> + struct signal_struct *sig = tsk->signal;
> + struct taskstats *stats;
> +
> + /* No problem if kmem_cache_zalloc() fails */
> + stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
> +
> + spin_lock_irq(&tsk->sighand->siglock);
> + if (!sig->stats) {
> + sig->stats = stats;
> + stats = NULL;
> + }
> + spin_unlock_irq(&tsk->sighand->siglock);
> +
> + if (stats)
> + kmem_cache_free(taskstats_cache, stats);
> +}
> +
> +void taskstats_exit_add_thread(struct task_struct *tsk)
> {
> unsigned long flags;
>
> + if (tsk->signal->stats == NULL && !thread_group_empty(tsk))
> + alloc_signal_stats(tsk);
> +
> spin_lock_irqsave(&tsk->sighand->siglock, flags);
> if (!tsk->signal->stats)
> goto ret;
Well. I do not like the fact we take ->siglock unconditionally.
And _irqsave is not needed. And we take it twice if sig->stats == NULL.
And "if (!tsk->signal->stats)" under ->siglock in
taskstats_exit_add_thread() looks a bit ugly...
How about
void taskstats_exit_add_thread(struct task_struct *tsk)
{
struct taskstats *stats = NULL;
if (!tsk->signal->stats) {
if (thread_group_empty(tsk)
return;
stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
if (!stats)
return;
}
spin_lock_irq(&tsk->sighand->siglock);
if (!tsk->signal->stats) {
tsk->signal->stats = stats;
stats = NULL;
}
/*
* Each accounting subsystem calls its functions here to
* accumalate its per-task stats for tsk, into the per-tgid structure
*
* per-task-foo(tsk->signal->stats, tsk);
*/
delayacct_add_tsk(tsk->signal->stats, tsk);
spin_unlock_irq(&tsk->sighand->siglock);
if (unlikely(stats))
kmem_cache_free(taskstats_cache, stats);
}
?
Note that it absorbs alloc_signal_stats().
But up to you, probably this is matter of taste.
Oleg.
On Mon, 2010-12-13 at 15:33 +0100, Oleg Nesterov wrote:
> > --- a/include/linux/taskstats_kern.h
> > +++ b/include/linux/taskstats_kern.h
> > @@ -21,7 +21,8 @@ static inline void taskstats_tgid_free(s
> > kmem_cache_free(taskstats_cache, sig->stats);
> > }
> >
> > -extern void taskstats_exit(struct task_struct *, int group_dead);
> > +extern void taskstats_exit_notify(struct task_struct *, int group_dead);
> > +extern void taskstats_exit_add_thread(struct task_struct *);
> You forgot to update the !CONFIG_TASKSTATS case ;)
... of course, thanks!
[snip]
> Well. I do not like the fact we take ->siglock unconditionally.
> And _irqsave is not needed. And we take it twice if sig->stats == NULL.
> And "if (!tsk->signal->stats)" under ->siglock in
> taskstats_exit_add_thread() looks a bit ugly...
Yes I also saw that, but I didn't want to change too much. In fact your
code is much better. I added it to the patch and also replaced
_add_thread with _add_task, because task seems to be the term for thread
used in the taskstats code. I also did some testing. Everything still
seems to work well.
The new patch is attached below. Balbir, do you agree?
Michael
---
include/linux/taskstats_kern.h | 8 +++-
kernel/exit.c | 3 +
kernel/taskstats.c | 67 ++++++++++++++---------------------------
3 files changed, 32 insertions(+), 46 deletions(-)
--- a/include/linux/taskstats_kern.h
+++ b/include/linux/taskstats_kern.h
@@ -21,10 +21,14 @@ static inline void taskstats_tgid_free(s
kmem_cache_free(taskstats_cache, sig->stats);
}
-extern void taskstats_exit(struct task_struct *, int group_dead);
+extern void taskstats_exit_notify(struct task_struct *tsk, int group_dead);
+extern void taskstats_exit_add_task(struct task_struct *tsk);
extern void taskstats_init_early(void);
#else
-static inline void taskstats_exit(struct task_struct *tsk, int group_dead)
+static inline void taskstats_exit_notify(struct task_struct *tsk,
+ int group_dead)
+{}
+static inline void taskstats_exit_add_task(struct task_struct *tsk);
{}
static inline void taskstats_tgid_free(struct signal_struct *sig)
{}
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1030,6 +1030,7 @@ NORET_TYPE void do_exit(long code)
/* sync mm's RSS info before statistics gathering */
if (tsk->mm)
sync_mm_rss(tsk, tsk->mm);
+ taskstats_exit_add_task(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
struct cdata *tcd = &tsk->signal->cdata_threads;
@@ -1045,7 +1046,7 @@ NORET_TYPE void do_exit(long code)
audit_free(tsk);
tsk->exit_code = code;
- taskstats_exit(tsk, group_dead);
+ taskstats_exit_notify(tsk, group_dead);
exit_mm(tsk);
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -263,24 +263,35 @@ out:
return rc;
}
-static void fill_tgid_exit(struct task_struct *tsk)
+void taskstats_exit_add_task(struct task_struct *tsk)
{
- unsigned long flags;
+ struct taskstats *stats = NULL;
- spin_lock_irqsave(&tsk->sighand->siglock, flags);
- if (!tsk->signal->stats)
- goto ret;
+ if (!tsk->signal->stats) {
+ if (thread_group_empty(tsk))
+ return;
+ stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
+ if (!stats)
+ return; /* Bad luck, we will loose this task */
+ }
+
+ spin_lock_irq(&tsk->sighand->siglock);
+ if (!tsk->signal->stats) {
+ tsk->signal->stats = stats;
+ stats = NULL;
+ }
/*
* Each accounting subsystem calls its functions here to
* accumalate its per-task stats for tsk, into the per-tgid structure
*
- * per-task-foo(tsk->signal->stats, tsk);
+ * per-task-foo(tsk->signal->stats, tsk);
*/
delayacct_add_tsk(tsk->signal->stats, tsk);
-ret:
- spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
- return;
+ spin_unlock_irq(&tsk->sighand->siglock);
+
+ if (unlikely(stats))
+ kmem_cache_free(taskstats_cache, stats);
}
static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
@@ -530,39 +541,14 @@ static int taskstats_user_cmd(struct sk_
return -EINVAL;
}
-static struct taskstats *taskstats_tgid_alloc(struct task_struct *tsk)
-{
- struct signal_struct *sig = tsk->signal;
- struct taskstats *stats;
-
- if (sig->stats || thread_group_empty(tsk))
- goto ret;
-
- /* No problem if kmem_cache_zalloc() fails */
- stats = kmem_cache_zalloc(taskstats_cache, GFP_KERNEL);
-
- spin_lock_irq(&tsk->sighand->siglock);
- if (!sig->stats) {
- sig->stats = stats;
- stats = NULL;
- }
- spin_unlock_irq(&tsk->sighand->siglock);
-
- if (stats)
- kmem_cache_free(taskstats_cache, stats);
-ret:
- return sig->stats;
-}
-
/* Send pid data out on exit */
-void taskstats_exit(struct task_struct *tsk, int group_dead)
+void taskstats_exit_notify(struct task_struct *tsk, int group_dead)
{
int rc;
struct listener_list *listeners;
struct taskstats *stats;
struct sk_buff *rep_skb;
size_t size;
- int is_thread_group;
if (!family_registered)
return;
@@ -573,13 +559,8 @@ void taskstats_exit(struct task_struct *
size = nla_total_size(sizeof(u32)) +
nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
- is_thread_group = !!taskstats_tgid_alloc(tsk);
- if (is_thread_group) {
- /* PID + STATS + TGID + STATS */
- size = 2 * size;
- /* fill the tsk->signal->stats structure */
- fill_tgid_exit(tsk);
- }
+ if (group_dead && tsk->signal->stats)
+ size = 2 * size; /* PID + STATS + TGID + STATS */
listeners = &__raw_get_cpu_var(listener_array);
if (list_empty(&listeners->list))
@@ -598,7 +579,7 @@ void taskstats_exit(struct task_struct *
/*
* Doesn't matter if tsk is the leader or the last group member leaving
*/
- if (!is_thread_group || !group_dead)
+ if (!group_dead || tsk->signal->stats == NULL)
goto send;
stats = mk_reply(rep_skb, TASKSTATS_TYPE_TGID, tsk->tgid);