2006-03-14 00:40:43

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 0/9] Per-task delay accounting

This is the next iteration of the delay accounting patches
last posted at
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html

The major changes to the patches (and names of commenters whose concerns
have been addressed) are:

- considerable revision of the synchronous block I/O delay collection (Arjan)
Now most delays seen in I/O submission are not being collected.
- removal of dependency on schedstats by using common code (Nick)
- removal of sysctl or dynamic configurability of delay accounting (Arjan)
Now it it can be set on/off at boot time only.
- providing I/O delays through /proc/<tgid>/stats (Andi)
- revisions to the generic netlink interface (Jamal)

Unaddressed comments are suggestions from Jamal on
further improving the generic netlink interface which is ongoing.

Series

nstimestamp-diff.patch
delayacct-setup.patch
delayacct-blkio-init.patch
delayacct-blkio-collect.patch
delayacct-swapin.patch
delayacct-procfs.patch
delayacct-schedstats.patch
genetlink-utils.patch
delayacct-genetlink.patch


A short note on expected usage of this functionality follows which
addresses a comment from Nick in the previous iteration.
Basically, the patches measures a subset of the delays encountered
by a task waiting for a resource such as cpu, disk I/O and page
frames. Which subset is chosen depends on whether a task's delay
is something that can be controlled by playing with its priority
or rss limit etc.
e.g. cpu delays can directly be affected by its static CPU priority.
Similarly I/O delay can be affected by its I/O priority (assuming
one uses the right iosched). In page frame delay, we only count
the page faults due to swapin since that is directly affected by
the rss limits for a task (well, ok, process).

Delays can be used as feedback to decide whether something can/should
be done about a task (or group of tasks) which is not performing
as one expects.

http://www.research.ibm.com/journal/sj/362/amanaut.html
lists a fairly complicated way of using such data in
workload management.

At a simpler level, one can think of utilities that can use this data
to control individual apps using a simple feedback loop.
At this point we do not have such a utility.

Of course, its useful to visually inspect such data. Some of it is
being exported through /proc as suggested by Andi, but the primary
interface is through a netlink socket so that userspace can get
data for exiting tasks too.

The netlink socket also allows userspace apps to efficiently
collect delay data and group it in arbitrary ways
(as envisaged by CKRM, CSA or ELSA) for reporting or
control purposes.


--Shailabh


2006-03-14 00:42:21

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 1/9] timestamp diff

nstimestamp_diff.patch

Add kernel utility function for measuring the
interval (diff) between two timespec values

Comments addressed (commenter)

- returns difference as timespec instead of nsecs (Arjan)

Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Shailabh Nagar <[email protected]>


include/linux/time.h | 14 ++++++++++++++
1 files changed, 14 insertions(+)

Index: linux-2.6.16-rc5/include/linux/time.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/time.h 2006-03-10 17:56:56.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/time.h 2006-03-10 17:57:26.000000000 -0500
@@ -147,6 +147,20 @@ extern struct timespec ns_to_timespec(co
*/
extern struct timeval ns_to_timeval(const nsec_t nsec);

+/**
+ * timespec_diff - Return difference of timespecs as a timespec
+ * @start: first timespec
+ * @end: second timespec
+ * @ret: result timespec
+ *
+ * Returns the difference between @start and @end in @ret
+ */
+static inline void timespec_diff(struct timespec *start, struct timespec *end,
+ struct timespec *ret)
+{
+ ret->tv_sec = end->tv_sec - start->tv_sec;
+ ret->tv_nsec = end->tv_nsec - start->tv_nsec;
+}
#endif /* __KERNEL__ */

#define NFDBITS __NFDBITS


2006-03-14 00:45:05

by Shailabh Nagar

[permalink] [raw]
Subject: Patch 2/9] Initialization

delayacct-setup.patch

Initialization code related to collection of per-task "delay"
statistics which measure how long it had to wait for cpu,
sync block io, swapping etc. The collection of statistics and
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled
through a kernel boot paramater.

Signed-off-by: Shailabh Nagar <[email protected]>

Documentation/kernel-parameters.txt | 2
include/linux/delayacct.h | 48 ++++++++++++++++++
include/linux/sched.h | 17 ++++++
init/Kconfig | 13 +++++
init/main.c | 2
kernel/Makefile | 1
kernel/delayacct.c | 92 ++++++++++++++++++++++++++++++++++++
kernel/exit.c | 3 +
kernel/fork.c | 2
9 files changed, 180 insertions(+)

Index: linux-2.6.16-rc5/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.16-rc5.orig/Documentation/kernel-parameters.txt 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/Documentation/kernel-parameters.txt 2006-03-11 07:41:37.000000000 -0500
@@ -410,6 +410,8 @@ running once the system is up.
Format: <area>[,<node>]
See also Documentation/networking/decnet.txt.

+ delayacct [KNL] Enable per-task delay accounting
+
devfs= [DEVFS]
See Documentation/filesystems/devfs/boot-options.

Index: linux-2.6.16-rc5/kernel/Makefile
===================================================================
--- linux-2.6.16-rc5.orig/kernel/Makefile 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/Makefile 2006-03-11 07:41:37.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:37.000000000 -0500
@@ -0,0 +1,48 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on; /* Delay accounting turned on/off */
+extern kmem_cache_t *delayacct_cache;
+extern int delayacct_init(void);
+extern void __delayacct_tsk_init(struct task_struct *);
+extern void __delayacct_tsk_exit(struct task_struct *);
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+ /* reinitialize in case parent's non-null pointer was dup'ed*/
+ tsk->delays = NULL;
+ if (unlikely(delayacct_on))
+ __delayacct_tsk_init(tsk);
+}
+
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{
+ if (tsk->delays)
+ __delayacct_tsk_exit(tsk);
+}
+
+#else
+static inline int delayacct_init(void)
+{}
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:37.000000000 -0500
@@ -539,6 +539,20 @@ struct sched_info {
extern struct file_operations proc_schedstat_operations;
#endif

+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+ spinlock_t lock;
+
+ /* For each stat XXX, add following, aligned appropriately
+ *
+ * struct timespec XXX_start, XXX_end;
+ * u64 XXX_delay;
+ * u32 XXX_count;
+ */
+};
+#endif
+
+
enum idle_type
{
SCHED_IDLE,
@@ -870,6 +884,9 @@ struct task_struct {
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
struct rcu_head rcu;
+#ifdef CONFIG_TASK_DELAY_ACCT
+ struct task_delay_info *delays;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.16-rc5/init/Kconfig
===================================================================
--- linux-2.6.16-rc5.orig/init/Kconfig 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/init/Kconfig 2006-03-11 07:41:37.000000000 -0500
@@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config TASK_DELAY_ACCT
+ bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ help
+ Collect information on time spent by a task waiting for system
+ resources like cpu, synchronous block I/O completion and swapping
+ in pages. Such statistics can help in setting a task's priorities
+ relative to other tasks for cpu, io, rss limits etc.
+
+ Unlike BSD process accounting, this information is available
+ continuously during the lifetime of a task.
+
+ Say N if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.16-rc5/init/main.c
===================================================================
--- linux-2.6.16-rc5.orig/init/main.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/init/main.c 2006-03-11 07:41:37.000000000 -0500
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/delayacct.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ delayacct_init();

check_bugs();

Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:37.000000000 -0500
@@ -0,0 +1,92 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/sysctl.h>
+#include <linux/delayacct.h>
+
+int delayacct_on = 0; /* Delay accounting turned on/off */
+kmem_cache_t *delayacct_cache;
+
+static int __init delayacct_setup_enable(char *str)
+{
+ delayacct_on = 1;
+ return 1;
+}
+__setup("delayacct", delayacct_setup_enable);
+
+int delayacct_init(void)
+{
+ delayacct_cache = kmem_cache_create("delayacct_cache",
+ sizeof(struct task_delay_info),
+ 0,
+ SLAB_PANIC,
+ NULL, NULL);
+ if (!delayacct_cache)
+ return -ENOMEM;
+ delayacct_tsk_init(&init_task);
+ return 0;
+}
+
+void __delayacct_tsk_init(struct task_struct *tsk)
+{
+ tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+ if (tsk->delays) {
+ memset(tsk->delays, 0, sizeof(*tsk->delays));
+ spin_lock_init(&tsk->delays->lock);
+ }
+}
+
+void __delayacct_tsk_exit(struct task_struct *tsk)
+{
+ if (tsk->delays) {
+ kmem_cache_free(delayacct_cache, tsk->delays);
+ tsk->delays = NULL;
+ }
+}
+
+/*
+ * Start accounting for a delay statistic using
+ * its starting timestamp (@start)
+ */
+
+static inline void delayacct_start(struct timespec *start)
+{
+ do_posix_clock_monotonic_gettime(start);
+}
+
+/*
+ * Finish delay accounting for a statistic using
+ * its timestamps (@start, @end), accumalator (@total) and @count
+ */
+
+static inline void delayacct_end(struct timespec *start, struct timespec *end,
+ u64 *total, u32 *count)
+{
+ struct timespec ts;
+ nsec_t ns;
+
+ do_posix_clock_monotonic_gettime(end);
+ timespec_diff(start, end, &ts);
+ ns = timespec_to_ns(&ts);
+ if (ns < 0)
+ return;
+
+ spin_lock(&current->delays->lock);
+ *total += ns;
+ (*count)++;
+ spin_unlock(&current->delays->lock);
+}
+
Index: linux-2.6.16-rc5/kernel/fork.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/fork.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/fork.c 2006-03-11 07:41:37.000000000 -0500
@@ -44,6 +44,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/delayacct.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -970,6 +971,7 @@ static task_t *copy_process(unsigned lon
goto bad_fork_cleanup_put_domain;

p->did_exec = 0;
+ delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
copy_flags(clone_flags, p);
p->pid = pid;
retval = -EFAULT;
Index: linux-2.6.16-rc5/kernel/exit.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/exit.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/exit.c 2006-03-11 07:41:37.000000000 -0500
@@ -31,6 +31,7 @@
#include <linux/signal.h>
#include <linux/cn_proc.h>
#include <linux/mutex.h>
+#include <linux/delayacct.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -842,6 +843,8 @@ fastcall NORET_TYPE void do_exit(long co
preempt_count());

acct_update_integrals(tsk);
+ delayacct_tsk_exit(tsk);
+
if (tsk->mm) {
update_hiwater_rss(tsk->mm);
update_hiwater_vm(tsk->mm);


2006-03-14 00:47:10

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 3/9] Block I/O accounting initialization

delayacct-blkio-init.patch

Setup variables and functions to collect per-task
block I/O delays.

Separating the collection part to a later patch
for easier review and discussion.

Signed-off-by: Shailabh Nagar <[email protected]>


include/linux/delayacct.h | 16 ++++++++++++++++
include/linux/sched.h | 4 ++++
kernel/delayacct.c | 14 ++++++++++++++
3 files changed, 34 insertions(+)

Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/delayacct.h 2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:38.000000000 -0500
@@ -22,6 +22,8 @@ extern kmem_cache_t *delayacct_cache;
extern int delayacct_init(void);
extern void __delayacct_tsk_init(struct task_struct *);
extern void __delayacct_tsk_exit(struct task_struct *);
+extern void __delayacct_blkio_start(void);
+extern void __delayacct_blkio_end(void);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -37,6 +39,16 @@ static inline void delayacct_tsk_exit(st
__delayacct_tsk_exit(tsk);
}

+static inline void delayacct_blkio_start(void)
+{
+ if (unlikely(delayacct_on))
+ __delayacct_blkio_start();
+}
+static inline void delayacct_blkio_end(void)
+{
+ if (unlikely(delayacct_on))
+ __delayacct_blkio_end();
+}
#else
static inline int delayacct_init(void)
{}
@@ -44,5 +56,9 @@ static inline void delayacct_tsk_init(st
{}
static inline void delayacct_tsk_exit(struct task_struct *tsk)
{}
+static inline void delayacct_blkio_start(void)
+{}
+static inline void delayacct_blkio_end(void)
+{}
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:38.000000000 -0500
@@ -90,3 +90,17 @@ static inline void delayacct_end(struct
spin_unlock(&current->delays->lock);
}

+void __delayacct_blkio_start(void)
+{
+ if (current->delays)
+ delayacct_start(&current->delays->blkio_start);
+}
+
+void __delayacct_blkio_end(void)
+{
+ if (current->delays)
+ delayacct_end(&current->delays->blkio_start,
+ &current->delays->blkio_end,
+ &current->delays->blkio_delay,
+ &current->delays->blkio_count);
+}
Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:38.000000000 -0500
@@ -549,6 +549,10 @@ struct task_delay_info {
* u64 XXX_delay;
* u32 XXX_count;
*/
+
+ struct timespec blkio_start, blkio_end;
+ u64 blkio_delay; /* wait for sync block io completion */
+ u32 blkio_count;
};
#endif



2006-03-14 00:48:25

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 4/9] Block I/O accounting collection

delayacct-blkio-collect.patch

Collect per-task block I/O delay statistics.

Unlike earlier iterations of the delay accounting
patches, now delays are only collected for the actual
I/O waits rather than try and cover the delays seen in
I/O submission paths.

Signed-off-by: Shailabh Nagar <[email protected]>

kernel/sched.c | 5 +++++
1 files changed, 5 insertions(+)

Index: linux-2.6.16-rc5/kernel/sched.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/sched.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/sched.c 2006-03-11 07:41:39.000000000 -0500
@@ -49,6 +49,7 @@
#include <linux/syscalls.h>
#include <linux/times.h>
#include <linux/acct.h>
+#include <linux/delayacct.h>
#include <asm/tlb.h>

#include <asm/unistd.h>
@@ -4117,9 +4118,11 @@ void __sched io_schedule(void)
{
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());

+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
schedule();
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
}

EXPORT_SYMBOL(io_schedule);
@@ -4129,9 +4132,11 @@ long __sched io_schedule_timeout(long ti
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
long ret;

+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
return ret;
}



2006-03-14 00:51:29

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 5/9] Swapin delays

delayacct-swapin.patch

Account separately for block I/O delays
incurred as a result of swapin page faults since
these are a result of insufficient rss limit.

Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/sched.h | 5 ++++-
kernel/delayacct.c | 17 ++++++++++++-----
mm/memory.c | 4 ++++
3 files changed, 20 insertions(+), 6 deletions(-)

Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:39.000000000 -0500
@@ -550,9 +550,11 @@ struct task_delay_info {
* u32 XXX_count;
*/

- struct timespec blkio_start, blkio_end;
+ struct timespec blkio_start, blkio_end; /* Shared by blkio, swapin */
u64 blkio_delay; /* wait for sync block io completion */
+ u64 swapin_delay; /* wait for sync block io completion */
u32 blkio_count;
+ u32 swapin_count;
};
#endif

@@ -949,6 +951,7 @@ static inline void put_task_struct(struc
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
#define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */
+#define PF_SWAPIN 0x02000000 /* I am doing a swap in */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:39.000000000 -0500
@@ -98,9 +98,16 @@ void __delayacct_blkio_start(void)

void __delayacct_blkio_end(void)
{
- if (current->delays)
- delayacct_end(&current->delays->blkio_start,
- &current->delays->blkio_end,
- &current->delays->blkio_delay,
- &current->delays->blkio_count);
+ if (current->delays) {
+ if (current->flags & PF_SWAPIN) /* Swapping a page in */
+ delayacct_end(&current->delays->blkio_start,
+ &current->delays->blkio_end,
+ &current->delays->swapin_delay,
+ &current->delays->swapin_count);
+ else /* Other block I/O */
+ delayacct_end(&current->delays->blkio_start,
+ &current->delays->blkio_end,
+ &current->delays->blkio_delay,
+ &current->delays->blkio_count);
+ }
}
Index: linux-2.6.16-rc5/mm/memory.c
===================================================================
--- linux-2.6.16-rc5.orig/mm/memory.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/mm/memory.c 2006-03-11 07:41:39.000000000 -0500
@@ -1882,6 +1882,7 @@ static int do_swap_page(struct mm_struct

entry = pte_to_swp_entry(orig_pte);
again:
+ current->flags |= PF_SWAPIN;
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1894,6 +1895,7 @@ again:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
+ current->flags &= ~PF_SWAPOFF;
goto unlock;
}

@@ -1905,6 +1907,8 @@ again:

mark_page_accessed(page);
lock_page(page);
+ current->flags &= ~PF_SWAPOFF;
+
if (!PageSwapCache(page)) {
/* Page migration has occured */
unlock_page(page);


2006-03-14 00:53:49

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 7/9] /proc interface for all I/O delays

delayacct-procfs.patch

Export I/O delays seen by a task through /proc/<tgid>/stats
for use in top etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is
clubbed together with all other I/O here (this is not the
case for the other interface where the swapin I/O is kept distinct)

Signed-off-by: Shailabh Nagar <[email protected]>


fs/proc/array.c | 6 ++++--
include/linux/delayacct.h | 10 ++++++++++
include/linux/jiffies.h | 6 ++++++
kernel/delayacct.c | 13 +++++++++++++
4 files changed, 33 insertions(+), 2 deletions(-)

Index: linux-2.6.16-rc5/fs/proc/array.c
===================================================================
--- linux-2.6.16-rc5.orig/fs/proc/array.c 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/fs/proc/array.c 2006-03-11 07:41:40.000000000 -0500
@@ -75,6 +75,7 @@
#include <linux/times.h>
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
+#include <linux/delayacct.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -414,7 +415,7 @@ static int do_task_stat(struct task_stru

res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
task->pid,
tcomm,
state,
@@ -459,7 +460,8 @@ static int do_task_stat(struct task_stru
task->exit_signal,
task_cpu(task),
task->rt_priority,
- task->policy);
+ task->policy,
+ delayacct_blkio_ticks(task));
if(mm)
mmput(mm);
return res;
Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/delayacct.h 2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:40.000000000 -0500
@@ -24,6 +24,7 @@ extern void __delayacct_tsk_init(struct
extern void __delayacct_tsk_exit(struct task_struct *);
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
+extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -49,6 +50,11 @@ static inline void delayacct_blkio_end(v
if (unlikely(delayacct_on))
__delayacct_blkio_end();
}
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ if (unlikely(delayacct_on))
+ return __delayacct_blkio_ticks(tsk);
+}
#else
static inline int delayacct_init(void)
{}
@@ -60,5 +66,9 @@ static inline void delayacct_blkio_start
{}
static inline void delayacct_blkio_end(void)
{}
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ return 0;
+}
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/include/linux/jiffies.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/jiffies.h 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/jiffies.h 2006-03-11 07:41:40.000000000 -0500
@@ -291,6 +291,12 @@ static inline unsigned long usecs_to_jif
#endif
}

+static inline unsigned long nsecs_to_jiffies(const nsec_t n)
+{
+ return (((u64)n * NSEC_CONVERSION) >>
+ (NSEC_JIFFIE_SC - SEC_JIFFIE_SC)) >> SEC_JIFFIE_SC;
+}
+
/*
* The TICK_NSEC - 1 rounds up the value to the next resolution. Note
* that a remainder subtract here would not do the right thing as the
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:40.000000000 -0500
@@ -111,3 +111,16 @@ void __delayacct_blkio_end(void)
&current->delays->blkio_count);
}
}
+
+unsigned long long __delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ unsigned long long ret;
+
+ if (!tsk->delays)
+ return 0;
+
+ spin_lock(&tsk->delays->lock);
+ ret = nsec_to_clock_t(tsk->delays->blkio_delay + tsk->delays->swapin_delay);
+ spin_unlock(&tsk->delays->lock);
+ return ret;
+}


2006-03-14 00:55:12

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 8/9] generic netlink utility functions

genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Shailabh Nagar <[email protected]>

include/net/genetlink.h | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+)

Index: linux-2.6.16-rc5/include/net/genetlink.h
===================================================================
--- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
}

+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+ return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+ struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+ NLMSG_HDRLEN);
+ return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
#endif /* __NET_GENERIC_NETLINK_H */


2006-03-14 00:56:41

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 9/9] Generic netlink interface for delay accounting

delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

Comments addressed (all in response to Jamal)

- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
- Use multicast only for events generated due to task exit, not for
replies to commands
- Align taskstats and taskstats_reply structures to 64 bit aligned.
- Use dynamic ID generation for genetlink family

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

include/linux/taskstats.h | 128 ++++++++++++++++++++++++
init/Kconfig | 16 ++-
kernel/taskstats.c | 244 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 385 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc5/include/linux/taskstats.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/include/linux/taskstats.h 2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+
+ /* Version 1 */
+#define TASKSTATS_NOPID -1
+ __s64 pid;
+ __s64 tgid;
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC, /* Reserved */
+ TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+ * Marks data sent on task/tgid exit */
+ TASKSTATS_CMD_LISTEN, /* Start listening */
+ TASKSTATS_CMD_IGNORE, /* Stop listening */
+ TASKSTATS_CMD_PID, /* Send stats for a pid */
+ TASKSTATS_CMD_TGID, /* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+ /* Maintain 64-bit alignment while extending */
+ union {
+ __s64 pid;
+ __s64 tgid;
+ } id;
+};
+
+enum outtype {
+ TASKSTATS_REPLY_NONE = 1, /* Control cmd response */
+ TASKSTATS_REPLY_PID, /* per-pid data cmd response*/
+ TASKSTATS_REPLY_TGID, /* per-tgid data cmd response*/
+ TASKSTATS_REPLY_EXIT_PID, /* Exiting task's stats */
+ TASKSTATS_REPLY_EXIT_TGID, /* Exiting tgid's stats
+ * (sent on each tid's exit) */
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+ /* Maintain 64-bit alignment while extending */
+ __u16 outtype; /* Must be one of enum outtype */
+ __u16 version;
+ __u32 err;
+ struct taskstats stats; /* Invalid if err != 0 */
+};
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#define TASKSTATS_HDRLEN (NLMSG_SPACE(GENL_HDRLEN))
+#define TASKSTATS_BODYLEN (sizeof(struct taskstats_reply))
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16-rc5/kernel/taskstats.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/kernel/taskstats.c 2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,244 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = 0,
+};
+
+/* Taskstat specific functions */
+static int prepare_reply(struct genl_info *info, u8 cmd,
+ struct sk_buff **skbp, struct taskstats_reply **replyp)
+{
+ struct sk_buff *skb;
+ struct taskstats_reply *reply;
+
+ skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, NLM_F_REQUEST,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, info->nlhdr->nlmsg_flags,
+ info->genlhdr->cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+ skb_put(skb, TASKSTATS_BODYLEN);
+
+ memset(reply, 0, sizeof(*reply));
+ reply->version = taskstats_version;
+ reply->err = 0;
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, int replytype, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ struct taskstats_reply *reply;
+ int rc;
+
+ reply = (struct taskstats_reply *)genlmsg_data(genlhdr);
+ reply->outtype = replytype;
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ else
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline void fill_pid(struct taskstats_reply *reply, pid_t pid,
+ struct task_struct *pidtsk)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ reply->err = EINVAL;
+ return;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(reply, tsk);
+ if (!rc) {
+ reply->stats.pid = (s64)tsk->pid;
+ reply->stats.tgid = (s64)tsk->tgid;
+ } else
+ reply->err = (rc < 0) ? -rc : rc ;
+
+ put_task_struct(tsk);
+}
+
+static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+ struct taskstats_cmd_param *param= info->userhdr;
+
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc)
+ return rc;
+ fill_pid(reply, param->id.pid, NULL);
+ return send_reply(rep_skb, TASKSTATS_REPLY_PID, info->snd_pid, 0);
+}
+
+static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid,
+ struct task_struct *tgidtsk)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ reply->err = EINVAL;
+ return;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(reply, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ if (!rc) {
+ reply->stats.pid = (s64)TASKSTATS_NOPID;
+ reply->stats.tgid = (s64)tgid;
+ } else
+ reply->err = (rc < 0) ? -rc : rc ;
+}
+
+static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+ struct taskstats_cmd_param *param= info->userhdr;
+
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc)
+ return rc;
+ fill_tgid(reply, param->id.tgid, NULL);
+ return send_reply(rep_skb, TASKSTATS_REPLY_TGID, info->snd_pid, 0);
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+ if (rc)
+ return;
+ fill_pid(reply, tsk->pid, tsk);
+ rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID, 0, 1);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+ if (rc)
+ return;
+ fill_tgid(reply, tsk->tgid, tsk);
+ send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID, 0, 1);
+}
+
+static struct genl_ops pid_ops = {
+ .cmd = TASKSTATS_CMD_PID,
+ .doit = taskstats_send_pid,
+};
+
+static struct genl_ops tgid_ops = {
+ .cmd = TASKSTATS_CMD_TGID,
+ .doit = taskstats_send_tgid,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &pid_ops))
+ goto err;
+ if (genl_register_ops(&family, &tgid_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
+
Index: linux-2.6.16-rc5/init/Kconfig
===================================================================
--- linux-2.6.16-rc5.orig/init/Kconfig 2006-03-13 18:51:30.000000000 -0500
+++ linux-2.6.16-rc5/init/Kconfig 2006-03-13 19:04:52.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.

- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.

+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---


2006-03-14 01:01:17

by Lee Revell

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> + ret->tv_sec = end->tv_sec - start->tv_sec;
> + ret->tv_nsec = end->tv_nsec - start->tv_nsec;

What if end->tv_nsec is less than start->tv_nsec?

Lee

2006-03-14 01:01:51

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 6/9] cpu delay collection

delayacct-schedstats.patch

Collect cpu delays through the
task-related schedstats functions, extracted
as common code to remove dependence of delay
accounting on rest of schedstats.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/sched.h | 8 +++++---
kernel/sched.c | 49 +++++++++++++++++++++++++++++++++++++------------
2 files changed, 42 insertions(+), 15 deletions(-)

Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:40.000000000 -0500
@@ -524,7 +524,7 @@ typedef struct prio_array prio_array_t;
struct backing_dev_info;
struct reclaim_state;

-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info {
/* cumulative counters */
unsigned long cpu_time, /* time spent on the cpu */
@@ -536,8 +536,10 @@ struct sched_info {
last_queued; /* when we were last queued to run */
};

+#ifdef CONFIG_SCHEDSTATS
extern struct file_operations proc_schedstat_operations;
-#endif
+#endif /* CONFIG_SCHEDSTATS */
+#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */

#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info {
@@ -735,7 +737,7 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;

-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
#endif

Index: linux-2.6.16-rc5/kernel/sched.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/sched.c 2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/kernel/sched.c 2006-03-11 07:41:40.000000000 -0500
@@ -473,9 +473,26 @@ struct file_operations proc_schedstat_op
.release = single_release,
};

+static inline void rq_sched_info_arrive(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq) {
+ rq->rq_sched_info.run_delay += diff;
+ rq->rq_sched_info.pcnt++;
+ }
+}
+
+static inline void rq_sched_info_depart(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq)
+ rq->rq_sched_info.cpu_time += diff;
+}
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
#else /* !CONFIG_SCHEDSTATS */
+static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {}
+static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {}
# define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0)
#endif
@@ -495,7 +512,19 @@ static inline runqueue_t *this_rq_lock(v
return rq;
}

+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+
+static inline int sched_info_on(void)
+{
#ifdef CONFIG_SCHEDSTATS
+ return 1;
+#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+ extern int delayacct_on;
+ return delayacct_on;
+#endif
+}
+
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
@@ -524,7 +553,6 @@ static inline void sched_info_dequeued(t
static void sched_info_arrive(task_t *t)
{
unsigned long now = jiffies, diff = 0;
- struct runqueue *rq = task_rq(t);

if (t->sched_info.last_queued)
diff = now - t->sched_info.last_queued;
@@ -533,11 +561,7 @@ static void sched_info_arrive(task_t *t)
t->sched_info.last_arrival = now;
t->sched_info.pcnt++;

- if (!rq)
- return;
-
- rq->rq_sched_info.run_delay += diff;
- rq->rq_sched_info.pcnt++;
+ rq_sched_info_arrive(task_rq(t), diff);
}

/*
@@ -557,8 +581,9 @@ static void sched_info_arrive(task_t *t)
*/
static inline void sched_info_queued(task_t *t)
{
- if (!t->sched_info.last_queued)
- t->sched_info.last_queued = jiffies;
+ if (sched_info_on())
+ if (!t->sched_info.last_queued)
+ t->sched_info.last_queued = jiffies;
}

/*
@@ -567,13 +592,10 @@ static inline void sched_info_queued(tas
*/
static inline void sched_info_depart(task_t *t)
{
- struct runqueue *rq = task_rq(t);
unsigned long diff = jiffies - t->sched_info.last_arrival;

t->sched_info.cpu_time += diff;
-
- if (rq)
- rq->rq_sched_info.cpu_time += diff;
+ rq_sched_info_depart(task_rq(t), diff);
}

/*
@@ -585,6 +607,9 @@ static inline void sched_info_switch(tas
{
struct runqueue *rq = task_rq(prev);

+ if (!sched_info_on())
+ return;
+
/*
* prev now departs the cpu. It's not interesting to record
* stats about how efficient we were at scheduling the idle


2006-03-14 01:05:28

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

On Mon, 2006-03-13 at 20:01, Lee Revell wrote:
> On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> > + ret->tv_sec = end->tv_sec - start->tv_sec;
> > + ret->tv_nsec = end->tv_nsec - start->tv_nsec;
>
> What if end->tv_nsec is less than start->tv_nsec?

The caller of the func can decide what to do.
In our usage, we choose to not use the result of such a diff
but others may want to return an error etc.

--Shailabh

>
> Lee
>

2006-03-14 01:12:52

by Lee Revell

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

On Mon, 2006-03-13 at 20:05 -0500, Shailabh Nagar wrote:
> On Mon, 2006-03-13 at 20:01, Lee Revell wrote:
> > On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> > > + ret->tv_sec = end->tv_sec - start->tv_sec;
> > > + ret->tv_nsec = end->tv_nsec - start->tv_nsec;
> >
> > What if end->tv_nsec is less than start->tv_nsec?
>
> The caller of the func can decide what to do.
> In our usage, we choose to not use the result of such a diff
> but others may want to return an error etc.

You don't think it's a problem that 2.0000001s - 1.9999999s would return
garbage rather than 0.0000002?

Lee

2006-03-14 02:29:12

by jamal

[permalink] [raw]
Subject: Re: [Patch 9/9] Generic netlink interface for delay accounting

On Mon, 2006-13-03 at 19:56 -0500, Shailabh Nagar wrote:
> delayacct-genetlink.patch
>
> Create a generic netlink interface (NETLINK_GENERIC family),
> called "taskstats", for getting delay and cpu statistics of
> tasks and thread groups during their lifetime and when they exit.
>
> Comments addressed (all in response to Jamal)
>

Note, you are still not following the standard scheme of doing things.
Example: using command = GET and the message carrying the TGID to note
which TGID is of interest. Instead you have command = TGID.

cheers,
jamal

2006-03-14 02:41:43

by Matt Helsley

[permalink] [raw]
Subject: Re: [Patch 9/9] Generic netlink interface for delay accounting

On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
<snip>

> Comments addressed (all in response to Jamal)
>
> - Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE

The enums for these are still in the patch. See below.

<snip>

> +/*
> + * Commands sent from userspace
> + * Not versioned. New commands should only be inserted at the enum's end
> + */
> +
> +enum {
> + TASKSTATS_CMD_UNSPEC, /* Reserved */
> + TASKSTATS_CMD_NONE, /* Not a valid cmd to send
> + * Marks data sent on task/tgid exit */
> + TASKSTATS_CMD_LISTEN, /* Start listening */
> + TASKSTATS_CMD_IGNORE, /* Stop listening */

>From the description I thought you had eliminated these.

> + TASKSTATS_CMD_PID, /* Send stats for a pid */
> + TASKSTATS_CMD_TGID, /* Send stats for a tgid */
> +};

Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> Note, you are still not following the standard scheme of doing things.
> Example: using command = GET and the message carrying the TGID to note
> which TGID is of interest. Instead you have command = TGID.
>
> cheers,
> jamal

meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
misunderstanding?

<snip>

Cheers,
-Matt Helsley

2006-03-14 02:48:32

by jamal

[permalink] [raw]
Subject: Re: [Patch 9/9] Generic netlink interface for delay accounting

On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:

> Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> > Note, you are still not following the standard scheme of doing things.
> > Example: using command = GET and the message carrying the TGID to note
> > which TGID is of interest. Instead you have command = TGID.
> >

> meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
> TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
> misunderstanding?
>

I had a long description in an earlier email feedback; but the summary
of it is the GET command is generic like TASKSTATS_CMD_GET; the message
itself carries TLVs of what needs to be gotten which are
either PID and/or TGID etc. Anyways, theres a long spill of what i am
saying in that earlier email. Perhaps the current patch is a transition
towards that?

cheers,
jamal

2006-03-14 03:42:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

> You don't think it's a problem that 2.0000001s - 1.9999999s would return
> garbage rather than 0.0000002?
>
> Lee
>

The caller can use set_normalized_timespec() to fix that.

Warm Regards,
Balbir

2006-03-14 04:18:11

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 9/9] Generic netlink interface for delay accounting

jamal wrote:

>On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
>
>
>>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
>>
>>
>
>
>
>>Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
>>
>>
>>>Note, you are still not following the standard scheme of doing things.
>>>Example: using command = GET and the message carrying the TGID to note
>>>which TGID is of interest. Instead you have command = TGID.
>>>
>>>
>>>
>
>
>
>>meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
>>TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
>>misunderstanding?
>>
>>
>>
>
>I had a long description in an earlier email feedback; but the summary
>of it is the GET command is generic like TASKSTATS_CMD_GET; the message
>itself carries TLVs of what needs to be gotten which are
>either PID and/or TGID etc. Anyways, theres a long spill of what i am
>saying in that earlier email. Perhaps the current patch is a transition
>towards that?
>
>

Yes, the comments you'd made in the previous mail have not been
incorporated and this is still the older version of the patch.
We'd been avoiding TLV usage so far :-)

>cheers,
>jamal
>
>
>

2006-03-14 04:29:13

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 9/9] Generic netlink interface for delay accounting

Matt Helsley wrote:

>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
><snip>
>
>
>
>>Comments addressed (all in response to Jamal)
>>
>>- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
>>
>>
>
>The enums for these are still in the patch. See below.
>
><snip>
>
>
>
>>+/*
>>+ * Commands sent from userspace
>>+ * Not versioned. New commands should only be inserted at the enum's end
>>+ */
>>+
>>+enum {
>>+ TASKSTATS_CMD_UNSPEC, /* Reserved */
>>+ TASKSTATS_CMD_NONE, /* Not a valid cmd to send
>>+ * Marks data sent on task/tgid exit */
>>+ TASKSTATS_CMD_LISTEN, /* Start listening */
>>+ TASKSTATS_CMD_IGNORE, /* Stop listening */
>>
>>
>
>>From the description I thought you had eliminated these.
>
>

Yup, typo. the commands aren't registered but the enums hang on.
Will fix.

--Shailabh

2006-03-14 04:27:24

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

Balbir Singh wrote:

>>You don't think it's a problem that 2.0000001s - 1.9999999s would return
>>garbage rather than 0.0000002?
>>
>>
Sorry....you're right. I was thinking of invalid deltas (where end was
before start).

>>Lee
>>
>>
>>
>
>The caller can use set_normalized_timespec() to fix that.
>
>
Lee's point is valid....we should be doing the set_normalized_timespec.

Will fix.
Thanks,
Shailabh

>Warm Regards,
>Balbir
>
>

2006-03-14 06:50:40

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

> Lee's point is valid....we should be doing the set_normalized_timespec.
>

Semantically that sounds correct. But the intension of the caller as in
our case could be to convert the timespec to nsec_t. Calling
set_normalized_timespec() might be an overhead in this case.

Balbir

2006-03-14 10:54:48

by Jes Sorensen

[permalink] [raw]
Subject: Re: Patch 2/9] Initialization

>>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:

Shailabh> delayacct-setup.patch Initialization code related to
Shailabh> collection of per-task "delay" statistics which measure how
Shailabh> long it had to wait for cpu, sync block io, swapping
Shailabh> etc. The collection of statistics and the interface are in
Shailabh> other patches. This patch sets up the data structures and
Shailabh> allows the statistics collection to be disabled through a
Shailabh> kernel boot paramater.

Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT
Shailabh> +struct task_delay_info {
Shailabh> + spinlock_t lock;
Shailabh> +
Shailabh> + /* For each stat XXX, add following, aligned appropriately
Shailabh> + *
Shailabh> + * struct timespec XXX_start, XXX_end;
Shailabh> + * u64 XXX_delay;
Shailabh> + * u32 XXX_count;
Shailabh> + */
Shailabh> +};
Shailabh> +#endif

Hmmm

I thought you were going to change this to do

u64 some_delay
u32 foo_count
u32 bar_count
u64 another_delay

To avoid wasting space on 64 bit platforms.

Jes

2006-03-14 15:20:46

by Shailabh Nagar

[permalink] [raw]
Subject: Re: Patch 2/9] Initialization

Jes Sorensen wrote:

>>>>>>"Shailabh" == Shailabh Nagar <[email protected]> writes:
>>>>>>
>>>>>>
>
>Shailabh> delayacct-setup.patch Initialization code related to
>Shailabh> collection of per-task "delay" statistics which measure how
>Shailabh> long it had to wait for cpu, sync block io, swapping
>Shailabh> etc. The collection of statistics and the interface are in
>Shailabh> other patches. This patch sets up the data structures and
>Shailabh> allows the statistics collection to be disabled through a
>Shailabh> kernel boot paramater.
>
>Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT
>Shailabh> +struct task_delay_info {
>Shailabh> + spinlock_t lock;
>Shailabh> +
>Shailabh> + /* For each stat XXX, add following, aligned appropriately
>Shailabh> + *
>Shailabh> + * struct timespec XXX_start, XXX_end;
>Shailabh> + * u64 XXX_delay;
>Shailabh> + * u32 XXX_count;
>Shailabh> + */
>Shailabh> +};
>Shailabh> +#endif
>
>Hmmm
>
>I thought you were going to change this to do
>
>u64 some_delay
>u32 foo_count
>u32 bar_count
>u64 another_delay
>
>To avoid wasting space on 64 bit platforms.
>
>

Well, the "aligned appropriately" part of the comment was intended to let
future users know that something like the above should be done.

e.g in a subsequent patch (5/9)
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1919.html
when we introduce the additional stat swapin_delay/count, a 64-bit friendly
alignment is being done thus:

u64 blkio_delay; /* wait for sync block io completion */
+ u64 swapin_delay; /* wait for sync block io completion */
u32 blkio_count;
+ u32 swapin_count;

The need for each stat, potentially, to need a separate set of timespec
variables
to reduce nesting problems does introduce another wrinkle in
cache-friendliness
since you'd ideally like to have all the XXX_* variables close together.
Right now
we're good since swapin doesn't need extra timespecs of its own. But future
additions might need to be more careful.

--Shailabh


>Jes
>
>

2006-03-14 19:28:30

by Greg KH

[permalink] [raw]
Subject: Re: [Patch 0/9] Per-task delay accounting

On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> This is the next iteration of the delay accounting patches
> last posted at
> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html

Do you have any benchmark numbers with this patch applied and with it
not applied? Last I heard it was a measurable decrease for some
"important" benchmark results...

thanks,

greg k-h

2006-03-14 20:49:20

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/9] Per-task delay accounting

Greg KH wrote:

>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>
>
>>This is the next iteration of the delay accounting patches
>>last posted at
>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>
>>
>
>Do you have any benchmark numbers with this patch applied and with it
>not applied?
>
None yet. Wanted to iron out the collection/utility aspects a bit before
going into
the performance impact.

But this seems as good a time as any to collect some stats
using the usual suspects lmbench, kernbench, hackbench etc.

> Last I heard it was a measurable decrease for some
>"important" benchmark results...
>
>
Might have been from an older iteration where schedstats was fully enabled.
But no point speculating....will run with this set of patches and see
what shakes out.

One point about the overhead is that it depends on the frequency with
which data is
collected. So a proper test would probably be a comparison of a
non-patched kernel
with
a) patches applied but delay accounting not turned on at boot i.e. cost
of the checks
b) delay accounting turned on but not being read
c) delay accounting turned on and data read for all tasks at some
"reasonable" rate

Will that be good ? Other suggestions welcome.

>thanks,
>
>greg k-h
>
>

2006-03-14 21:24:20

by Greg KH

[permalink] [raw]
Subject: Re: [Patch 0/9] Per-task delay accounting

On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote:
> Greg KH wrote:
>
> >On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> >
> >
> >>This is the next iteration of the delay accounting patches
> >>last posted at
> >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
> >>
> >>
> >
> >Do you have any benchmark numbers with this patch applied and with it
> >not applied?
> >
> None yet. Wanted to iron out the collection/utility aspects a bit before
> going into
> the performance impact.
>
> But this seems as good a time as any to collect some stats
> using the usual suspects lmbench, kernbench, hackbench etc.
>
> >Last I heard it was a measurable decrease for some
> >"important" benchmark results...
> >
> >
> Might have been from an older iteration where schedstats was fully enabled.
> But no point speculating....will run with this set of patches and see
> what shakes out.
>
> One point about the overhead is that it depends on the frequency with
> which data is
> collected. So a proper test would probably be a comparison of a
> non-patched kernel
> with
> a) patches applied but delay accounting not turned on at boot i.e. cost
> of the checks
> b) delay accounting turned on but not being read

This is probably the most important one, as that is what distros will be
looking at. They will have to enable the option, but will not "turn it
on".

> c) delay accounting turned on and data read for all tasks at some
> "reasonable" rate
>
> Will that be good ? Other suggestions welcome.

How about real benchmarks? The ones that the big companies look at? I
know you have access to them :)

thanks,

greg k-h

2006-03-14 21:59:55

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/9] Per-task delay accounting

Greg KH wrote:

>On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote:
>
>
>>Greg KH wrote:
>>
>>
>>
>>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>>>
>>>
>>>
>>>
>>>>This is the next iteration of the delay accounting patches
>>>>last posted at
>>>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>>>
>>>>
>>>>
>>>>
>>>Do you have any benchmark numbers with this patch applied and with it
>>>not applied?
>>>
>>>
>>>
>>None yet. Wanted to iron out the collection/utility aspects a bit before
>>going into
>>the performance impact.
>>
>>But this seems as good a time as any to collect some stats
>>using the usual suspects lmbench, kernbench, hackbench etc.
>>
>>
>>
>>>Last I heard it was a measurable decrease for some
>>>"important" benchmark results...
>>>
>>>
>>>
>>>
>>Might have been from an older iteration where schedstats was fully enabled.
>>But no point speculating....will run with this set of patches and see
>>what shakes out.
>>
>>One point about the overhead is that it depends on the frequency with
>>which data is
>>collected. So a proper test would probably be a comparison of a
>>non-patched kernel
>>with
>>a) patches applied but delay accounting not turned on at boot i.e. cost
>>of the checks
>>b) delay accounting turned on but not being read
>>
>>
>
>This is probably the most important one, as that is what distros will be
>looking at. They will have to enable the option, but will not "turn it
>on".
>
>
I guess you meant a), not b) but yes, will run them in all these modes.

>
>
>>c) delay accounting turned on and data read for all tasks at some
>>"reasonable" rate
>>
>>Will that be good ? Other suggestions welcome.
>>
>>
>
>How about real benchmarks? The ones that the big companies look at? I
>know you have access to them :)
>
>
Hmm...though you also know, from working for some "big company", that
it might
take a while to get such data since one has to stand in queue :-)
I'll try, and also explore the OSDL STP's DBT tests.


Thanks,
Shailabh

>thanks,
>
>greg k-h
>
>

2006-03-15 10:23:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff


> +static inline void timespec_diff(struct timespec *start, struct timespec *end,
> + struct timespec *ret)
> +{
> + ret->tv_sec = end->tv_sec - start->tv_sec;
> + ret->tv_nsec = end->tv_nsec - start->tv_nsec;
> +}
> #endif /* __KERNEL__ */

I'd suggest normalizing the timespec; better to do it in such a function
than in all callers of it..


2006-03-15 10:24:35

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Patch 2/9] Initialization


> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *

LGPL inside the kernel doesn't make a whole lot of sense.... better make
it GPL.

2006-03-15 10:27:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 3/9] Block I/O accounting initialization


>
> +static inline void delayacct_blkio_start(void)
> +{
> + if (unlikely(delayacct_on))
> + __delayacct_blkio_start();
> +}


I still think the unlikely() makes no sense here; at runtime it's either
going to be always on or off (in the sense that switching it will be
RARE). The cpus branch predictor will get that right; while if you force
it unlikely you can even run the risk of getting a 100% miss on some
architectures (not x86/x86-64) when you enable the accounting


2006-03-15 10:28:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 1/9] timestamp diff

>
> > +static inline void timespec_diff(struct timespec *start, struct timespec *end,
> > + struct timespec *ret)
> > +{
> > + ret->tv_sec = end->tv_sec - start->tv_sec;
> > + ret->tv_nsec = end->tv_nsec - start->tv_nsec;
> > +}
> > #endif /* __KERNEL__ */
>
> I'd suggest normalizing the timespec; better to do it in such a function
> than in all callers of it..
>

Yep, that also looks semantically correct. We will do so.

Thanks,
Balbir

2006-03-15 12:31:18

by Alan

[permalink] [raw]
Subject: Re: Patch 2/9] Initialization

On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote:
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of version 2.1 of the GNU Lesser General Public License
> > + * as published by the Free Software Foundation.
> > + *
>
> LGPL inside the kernel doesn't make a whole lot of sense.... better make
> it GPL.

When you combine an LGPL and GPL work you get a GPL work so yes it would
be clearer to mark it GPL as that is what it became as it was merged,
but perhaps to add a note stating where it can be obtained under other
licenses for other projects.


2006-03-15 15:53:21

by Shailabh Nagar

[permalink] [raw]
Subject: Re: Patch 2/9] Initialization

Alan Cox wrote:

>On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote:
>
>
>>>+ * This program is free software; you can redistribute it and/or modify it
>>>+ * under the terms of version 2.1 of the GNU Lesser General Public License
>>>+ * as published by the Free Software Foundation.
>>>+ *
>>>
>>>
>>LGPL inside the kernel doesn't make a whole lot of sense.... better make
>>it GPL.
>>
>>
>
>When you combine an LGPL and GPL work you get a GPL work so yes it would
>be clearer to mark it GPL as that is what it became as it was merged,
>but perhaps to add a note stating where it can be obtained under other
>licenses for other projects.
>
>
>
Thanks. The LGPL usage is a mistake in the core code....will change to GPL
everywhere except the case below. There's no intent to have the kernel
code available
under other licenses etc. so thats not a problem.

However, the confusion about what license to use for a header file that
will need to be
included in a potentially non-GPL user application persists. The header file
include/linux/taskstats.h, created in patch 9/9
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1925.html

defines the user-kernel messaging interface and should probably continue
to have LGPL,
just to be absolutely safe legally. I've not been following the legal
twists too carefully so
its probably overkill.

--Shailabh






2006-03-15 16:27:37

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 3/9] Block I/O accounting initialization

Arjan van de Ven wrote:

>>
>>+static inline void delayacct_blkio_start(void)
>>+{
>>+ if (unlikely(delayacct_on))
>>+ __delayacct_blkio_start();
>>+}
>>
>>
>
>
>I still think the unlikely() makes no sense here; at runtime it's either
>going to be always on or off (in the sense that switching it will be
>RARE). The cpus branch predictor will get that right; while if you force
>it unlikely you can even run the risk of getting a 100% miss on some
>architectures (not x86/x86-64) when you enable the accounting
>
>
>
I don't understand why the fact that delayacct_on is not going to vary
dynamically once a kernel has been booted should prevent the usage of
"unlikely/likely"
Perhaps you can help me understand. Here's the logic I was using:

There are two kinds of users

1. Those who will not turn delay accounting on at boot time for a number of
reasons (don't care about the functionality, don't want to incur any
overhead
however small).

2. The ones who do turn it on at boot time are willing to pay some overhead
for the benefits such functionality brings. What that overhead exactly
is will be known
when we finish running some tests but we can safely assume it'll be
non-zero.

A distro will need to keep delay accounting configured to accomodate
both kinds
of users but because 1 is the common case, the default boot time option
will be off and
there is a strong incentive to reduce overhead for non-users.

Using unlikely reduces the overhead for the common case 1 and increases
overhead for
case 2 (because of what you pointed out of 100% wrong decisions by the
branch predictor).

I don't know how much of overhead reduction is achieved by using
unlikely (I'm assuming its
non-trivial) or how much penalty is imposed by a 100% wrong prediction
(I'm assuming its not
very high).

But under these assumptions, its better to use unlikely, make type 2
users pay extra overhead
and save type 1 users from any.

Is this reasoning accurate ? If not, we could easily switch the unlikely
off.

Regards,
Shailabh

(cc'ing Greg since he'd brought up the overhead for real benchmarks etc.)

2006-03-22 07:51:42

by Balbir Singh

[permalink] [raw]
Subject: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
> On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
>
>
> I had a long description in an earlier email feedback; but the summary
> of it is the GET command is generic like TASKSTATS_CMD_GET; the message
> itself carries TLVs of what needs to be gotten which are
> either PID and/or TGID etc. Anyways, theres a long spill of what i am
> saying in that earlier email. Perhaps the current patch is a transition
> towards that?
>

Hi, Jamal,

Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.

Comments addressed (in this patch)

- Changed the code to use TLV's for data exchange between kernel and
user space

Thanks,
Balbir


Documentation for the patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

---

include/linux/delayacct.h | 11 ++
include/linux/taskstats.h | 112 ++++++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 ++++++++
kernel/taskstats.c | 251 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 432 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H

#include <linux/sched.h>
+#include <linux/taskstats.h>

#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-22 13:12:01.000000000 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.

- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.

+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-22 13:12:01.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>

int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);

static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st

void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}

/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-22 13:12:07.000000000 +0530
@@ -0,0 +1,251 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ skb = nlmsg_new(NLMSG_GOODSIZE);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, NLM_F_REQUEST,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, info->nlhdr->nlmsg_flags,
+ info->genlhdr->cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+ }
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

2006-03-23 14:08:04

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting


Hi Balbir,

Looking good.
This is a quick scan, so i didnt look at little details.
Some comments embedded.

On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:




>

> diff -puN /dev/null include/linux/taskstats.h

> + * The struct is versioned. Newer versions should only add fields to
> + * the bottom of the struct to maintain backward compatibility.
> + *
> + * To create the next version, bump up the taskstats_version variable
> + * and delineate the start of newly added fields with a comment indicating
> + * the version number.
> + */
> +



> +enum {
> + TASKSTATS_MSG_UNICAST, /* send data only to requester */
> + TASKSTATS_MSG_MULTICAST, /* send data to a group */
> +};
> +

Above will never be used outside of the kernel.
Should it go under the ifdef kernel below?


> +#ifdef __KERNEL__
> +

Note: some people will argue that you should probably have
two header files. One for all kernel things and includes another
which contains all the stuff above ifdef __KERNEL__

> +
> +#endif /* __KERNEL__ */
> +#endif /* _LINUX_TASKSTATS_H */




> diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
> +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530

> +
> +const int taskstats_version = TASKSTATS_VERSION;
> +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> +static int family_registered = 0;
> +
> +static struct genl_family family = {
> + .id = GENL_ID_GENERATE,
> + .name = TASKSTATS_GENL_NAME,
> + .version = TASKSTATS_GENL_VERSION,
> + .hdrsize = 0,

Do you need to specify hdrsize of 0?


> +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> + void **replyp)
> +{
> + struct sk_buff *skb;
> + void *reply;
> +
> + skb = nlmsg_new(NLMSG_GOODSIZE);

Ok, getting a size of NLMSG_GOODSIZE is not a good idea.
The max size youll ever get it seems to me is 2*32-bit-data TLVs +
sizeof struct stats. Why dont you allocate that size?

> + if (!skb)
> + return -ENOMEM;
> +
> + if (!info) {
> + int seq = get_cpu_var(taskstats_seqnum)++;
> + put_cpu_var(taskstats_seqnum);
> +
> + reply = genlmsg_put(skb, 0, seq,
> + family.id, 0, NLM_F_REQUEST,
> + cmd, family.version);

Double check if you need NLM_F_REQUEST

> + } else
> + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> + family.id, 0, info->nlhdr->nlmsg_flags,
> + info->genlhdr->cmd, family.version);

A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
right thing?





> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{
> + int rc;
> + struct sk_buff *rep_skb;
> + struct taskstats stats;
> + void *reply;
> +
> + memset(&stats, 0, sizeof(stats));
> + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);

Same comment as before: a response to a GET is a NEW; so
info->genlhdr->cmd doesnt seem right.

> + if (rc < 0)
> + return rc;
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> + rc = fill_pid((pid_t)pid, NULL, &stats);
> + if (rc < 0)
> + return rc;
> + }
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> + rc = fill_tgid((pid_t)tgid, NULL, &stats);
> + if (rc < 0)
> + return rc;
> + }
> +

Should there be at least either a pid or tgid? If yes, you need to
validate here...

> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> +
> +nla_put_failure:
> + return genlmsg_cancel(rep_skb, reply);
> +

As a general comment double check your logic for errors; if you already
have stashed something in the skb, you need to remove it etc.

> +}
> +
> +
> +/* Send pid data out on exit */
> +void taskstats_exit_pid(struct task_struct *tsk)
> +{
> + int rc;
> + struct sk_buff *rep_skb;
> + void *reply;
> + struct taskstats stats;
> +
> + /*
> + * tasks can start to exit very early. Ensure that the family
> + * is registered before notifications are sent out
> + */
> + if (!family_registered)
> + return;
> +
> + memset(&stats, 0, sizeof(stats));
> + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> + rc = fill_pid(tsk->pid, tsk, &stats);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> + if (rc || thread_group_empty(tsk))
> + return;
> +
> + /* Send tgid data too */
> + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> + if (rc < 0)
> + return;
> +

A single message with PID+TGID sounds reasonable. Why two messages with
two stats? all you will need to do is get rid of the prepare_reply()
above and NLA_PUT_U32() below (just like you do in a response to a GET.

> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> + rc = fill_tgid(tsk->tgid, tsk, &stats);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> +nla_put_failure:
> + genlmsg_cancel(rep_skb, reply);
> +}


Other than the above comments - I believe you have it right.

cheers,
jamal

2006-03-23 15:16:51

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/9] Performance

Greg KH wrote:
> On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>
>>This is the next iteration of the delay accounting patches
>>last posted at
>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>
>
> Do you have any benchmark numbers with this patch applied and with it
> not applied? Last I heard it was a measurable decrease for some
> "important" benchmark results...
>
> thanks,
>
> greg k-h

Here are some numbers for the latest set of posted patches
using microbenchmarks hackbench, kernbench and lmbench.

I was trying to get the real/big benchmark numbers too but
it looks like getting a run whose numbers can be trusted
will take a bit longer than expected. Preliminary runs of
transaction processing benchmarks indicate that overhead
actually decreases with the patch (as also seen in some of
the lmbench numbers below).

--Shailabh



Results highlights

- Configuring delay accounting adds < 0.5%
overhead in most cases and even reduces overhead
in some cases

- Enabling delay accounting has similar results
with a maximum overhead of 1.2% for hackbench
, most other overheads < 1% and reduction in
overhead in some cases



Base
Vanilla 2.6.16-rc6 kernel
without any patches applied
+patch
Delay accounting configured
but not enabled at boot
+patch+enable
Delay accounting enabled at boot
but no stats read

Hackbench
---------
200 groups, using sockets
Elapsed time, in seconds, lower better

%Ovhd Time
Base 0 12.468
+patch 0.4% 12.523
+patch+enable 1.2% 12.622

Kernbench
---------
Average of 5 iterations
Elapsed time, in seconds, lower better

%Ovhd Elapsed
Base 0 195.776
+patch 0.2% 196.246
+patch+enable 0.3% 196.282

Lmbench
-------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
base Linux 2.6.16- 2783 0.17 0.33 5.17 6.49 13.4 0.64 2.61 146. 610. 9376
+patch Linux 2.6.16- 2781 0.17 0.32 4.75 5.85 13.0 0.64 2.62 145. 628. 9393
+patch+en Linux 2.6.16- 2784 0.17 0.32 4.71 6.14 13.4 0.64 2.60 150. 616. 9402

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
base Linux 2.6.16- 4.340 4.9600 7.3300 6.5700 30.3 10.4 36.0
+patch Linux 2.6.16- 4.390 4.9800 7.3100 6.5900 29.7 9.62000 35.8
+patch+en Linux 2.6.16- 4.560 5.0800 7.2400 5.6900 22.7 10.3 33.8

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
base Linux 2.6.16- 4.340 15.9 12.2 18.3 24.9 21.5 29.1 45.3
+patch Linux 2.6.16- 4.390 15.7 11.8 18.6 22.2 22.0 29.1 44.8
+patch+en Linux 2.6.16- 4.560 15.6 12.1 18.9 25.3 21.9 27.1 45.1

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
base Linux 2.6.16- 39.8 58.0 112.0 82.6 8417.0 0.838 2.00000
+patch Linux 2.6.16- 39.6 58.2 111.0 82.3 8392.0 0.864 2.00000
+patch+en Linux 2.6.16- 39.6 59.1 112.8 83.2 8308.0 0.821 2.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
base Linux 2.6.16- 676. 616. 620. 1658.0 2030.6 759.6 825.9 2032 1177.
+patch Linux 2.6.16- 627. 165. 616. 1649.9 2030.9 766.1 834.1 2030 1187.
+patch+en Linux 2.6.16- 633. 148. 603. 1569.7 2030.9 757.2 835.3 2030 1174.

Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
base Linux 2.6.16- 2783 0.719 6.5960 110.5
+patch Linux 2.6.16- 2781 0.720 6.5980 111.0
+patch+en Linux 2.6.16- 2784 0.720 6.5970 110.7


2006-03-23 15:41:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> Hi Balbir,
>
> Looking good.
> This is a quick scan, so i didnt look at little details.

Thanks for your detailed feedback. Please find my response interspersed
along with your comments.

> Some comments embedded.
>
> On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:
>
>
>
>
> >
>
> > diff -puN /dev/null include/linux/taskstats.h
>
> > + * The struct is versioned. Newer versions should only add fields to
> > + * the bottom of the struct to maintain backward compatibility.
> > + *
> > + * To create the next version, bump up the taskstats_version variable
> > + * and delineate the start of newly added fields with a comment indicating
> > + * the version number.
> > + */
> > +
>
>
>
> > +enum {
> > + TASKSTATS_MSG_UNICAST, /* send data only to requester */
> > + TASKSTATS_MSG_MULTICAST, /* send data to a group */
> > +};
> > +
>
> Above will never be used outside of the kernel.
> Should it go under the ifdef kernel below?
>

Will do

>
> > +#ifdef __KERNEL__
> > +
>
> Note: some people will argue that you should probably have
> two header files. One for all kernel things and includes another
> which contains all the stuff above ifdef __KERNEL__
>
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* _LINUX_TASKSTATS_H */
>
>
>
>
> > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
> > +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
>
> > +
> > +const int taskstats_version = TASKSTATS_VERSION;
> > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> > +static int family_registered = 0;
> > +
> > +static struct genl_family family = {
> > + .id = GENL_ID_GENERATE,
> > + .name = TASKSTATS_GENL_NAME,
> > + .version = TASKSTATS_GENL_VERSION,
> > + .hdrsize = 0,
>
> Do you need to specify hdrsize of 0?

I will remove it

>
>
> > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> > + void **replyp)
> > +{
> > + struct sk_buff *skb;
> > + void *reply;
> > +
> > + skb = nlmsg_new(NLMSG_GOODSIZE);
>
> Ok, getting a size of NLMSG_GOODSIZE is not a good idea.
> The max size youll ever get it seems to me is 2*32-bit-data TLVs +
> sizeof struct stats. Why dont you allocate that size?
>

Will do

> > + if (!skb)
> > + return -ENOMEM;
> > +
> > + if (!info) {
> > + int seq = get_cpu_var(taskstats_seqnum)++;
> > + put_cpu_var(taskstats_seqnum);
> > +
> > + reply = genlmsg_put(skb, 0, seq,
> > + family.id, 0, NLM_F_REQUEST,
> > + cmd, family.version);
>
> Double check if you need NLM_F_REQUEST
>

It is not required, I will remove it

> > + } else
> > + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> > + family.id, 0, info->nlhdr->nlmsg_flags,
> > + info->genlhdr->cmd, family.version);
>
> A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
> right thing?
>
>

I will change it

>
>
>
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
> > + int rc;
> > + struct sk_buff *rep_skb;
> > + struct taskstats stats;
> > + void *reply;
> > +
> > + memset(&stats, 0, sizeof(stats));
> > + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
>
> Same comment as before: a response to a GET is a NEW; so
> info->genlhdr->cmd doesnt seem right.
>

I will change it

> > + if (rc < 0)
> > + return rc;
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > + if (rc < 0)
> > + return rc;
> > + }
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> > + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> > + rc = fill_tgid((pid_t)tgid, NULL, &stats);
> > + if (rc < 0)
> > + return rc;
> > + }
> > +
>
> Should there be at least either a pid or tgid? If yes, you need to
> validate here...
>

Yes, you are correct. One of my test cases caught it too.. But I did
not want to untidy the code with if-else's which will keep growing if
the attributes change in the future. I just followed the controller
example. I will change it and validate it. Currently if the attribute
is not valid, a stat of all zero's is returned back.

> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> > +
> > +nla_put_failure:
> > + return genlmsg_cancel(rep_skb, reply);
> > +
>
> As a general comment double check your logic for errors; if you already
> have stashed something in the skb, you need to remove it etc.
>

Wouldn't genlmsg_cancel() take care of cleaning all attributes?


> > +}
> > +
> > +
> > +/* Send pid data out on exit */
> > +void taskstats_exit_pid(struct task_struct *tsk)
> > +{
> > + int rc;
> > + struct sk_buff *rep_skb;
> > + void *reply;
> > + struct taskstats stats;
> > +
> > + /*
> > + * tasks can start to exit very early. Ensure that the family
> > + * is registered before notifications are sent out
> > + */
> > + if (!family_registered)
> > + return;
> > +
> > + memset(&stats, 0, sizeof(stats));
> > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> > + rc = fill_pid(tsk->pid, tsk, &stats);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > + if (rc || thread_group_empty(tsk))
> > + return;
> > +
> > + /* Send tgid data too */
> > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > + if (rc < 0)
> > + return;
> > +
>
> A single message with PID+TGID sounds reasonable. Why two messages with
> two stats? all you will need to do is get rid of the prepare_reply()
> above and NLA_PUT_U32() below (just like you do in a response to a GET.
>

The reason for two stats is that for TGID, we return accumulated values
(of all threads in the group) and for PID we return the value just
for that pid. The return value is

pid
<stats for pid>
tgid
<accumulated stats for tgid>

> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> > + rc = fill_tgid(tsk->tgid, tsk, &stats);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > +nla_put_failure:
> > + genlmsg_cancel(rep_skb, reply);
> > +}
>
>
> Other than the above comments - I believe you have it right.
>

Thanks, I will get back to you with the updated patch soon.

> cheers,
> jamal

Thanks,
Balbir

2006-03-24 01:33:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> Hi Balbir,
>
> Looking good.
> This is a quick scan, so i didnt look at little details.
> Some comments embedded.

Hi, Jamal,

I tried addressing your comments in this new version.

Changelog
---------
1. Moved TASKSTATS_MSG_* to under #ifdef __KERNEL__
2. Got rid of .hdrsize = 0 in genl_family family
3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
4. Got rid of NLM_F_REQUEST, all flags passed down to user space are now 0
5. The response to TASKSTATS_CMD_GET is TASKSTATS_CMD_NEW
6. taskstats_send_stats() now validates the command attributes and ensures
that it either gets a PID or a TGID. If it gets both simultaneously
the PID stats are sent.
7. Do not put the PID/TGID into the skb if there are errors in fill_pid() or
fill_tgid().

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/delayacct.h | 11 +
include/linux/taskstats.h | 111 ++++++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 +++++++
kernel/taskstats.c | 255 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 435 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H

#include <linux/sched.h>
+#include <linux/taskstats.h>

#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-24 06:49:24.000000000 +0530
@@ -0,0 +1,111 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.

- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.

+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-24 06:49:24.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>

int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);

static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st

void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}

/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-24 06:50:03.000000000 +0530
@@ -0,0 +1,255 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new((2 * sizeof(u32)) + sizeof(struct taskstats));
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ } else {
+ return -EINVAL;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

2006-03-24 14:13:38

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:

> > Should there be at least either a pid or tgid? If yes, you need to
> > validate here...
> >
>
> Yes, you are correct. One of my test cases caught it too.. But I did
> not want to untidy the code with if-else's which will keep growing if
> the attributes change in the future. I just followed the controller
> example. I will change it and validate it. Currently if the attribute
> is not valid, a stat of all zero's is returned back.
>

There are many ways to skin this cat.
As an example: you could make pid and tgid global to the function and
set them to 0. At the end of the if statements, you could check if at
least one of them is set - if not you know none was passed and bail out.

> > As a general comment double check your logic for errors; if you already
> > have stashed something in the skb, you need to remove it etc.
> >
>
> Wouldn't genlmsg_cancel() take care of cleaning all attributes?
>

it would for attribute setting - but not for others. As a general
comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
could result in errors.


> > A single message with PID+TGID sounds reasonable. Why two messages with
> > two stats? all you will need to do is get rid of the prepare_reply()
> > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> >
>
> The reason for two stats is that for TGID, we return accumulated values
> (of all threads in the group) and for PID we return the value just
> for that pid. The return value is
>

Ok, I understand the dilemma now - but still not thrilled with having
two messages.
Perhaps you could have nesting of TLVs? This is widely used in the net
code for example
i.e:

TLV = TASKSTATS_TYPE_TGID/PID
TLV = TASKSTATS_TYPE_STATS

Look at using nla_nest_start/end/cancel

cheers,
jamal

2006-03-24 14:13:34

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:

> 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)

Not the right size; the u32 covers the V part of TLV. The T = 16 bits
and L = 16 bits. And if you nest TLVs, then it gets more interesting.

Look at using proper macros instead of hard coding like you did.
grep for something like RTA_SPACE and perhaps send a patch to make it
generic for netlink.h

cheers,
jamal

2006-03-24 14:19:46

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, 2006-24-03 at 09:11 -0500, jamal wrote:

> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
>

actually Thomas already has this in netlink.h
Look at using things like:
nla_attr_size()

make sure padding is taken care of etc (read: use the right macros).

cheers,
jamal

2006-03-24 14:55:23

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, Mar 24, 2006 at 09:04:21AM -0500, jamal wrote:
> On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> > > Should there be at least either a pid or tgid? If yes, you need to
> > > validate here...
> > >
> >
> > Yes, you are correct. One of my test cases caught it too.. But I did
> > not want to untidy the code with if-else's which will keep growing if
> > the attributes change in the future. I just followed the controller
> > example. I will change it and validate it. Currently if the attribute
> > is not valid, a stat of all zero's is returned back.
> >
>
> There are many ways to skin this cat.
> As an example: you could make pid and tgid global to the function and
> set them to 0. At the end of the if statements, you could check if at
> least one of them is set - if not you know none was passed and bail out.

The latest patch does fix it this issue. In the Changelog
6. taskstats_send_stats() now validates the command attributes and ensures
that it either gets a PID or a TGID. If it gets both simultaneously
the PID stats are sent.

Is this change ok with you?

>
> > > As a general comment double check your logic for errors; if you already
> > > have stashed something in the skb, you need to remove it etc.
> > >
> >
> > Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> >
>
> it would for attribute setting - but not for others. As a general
> comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
> could result in errors.

:-) I understand.
What I have done is moved all the NLA_PUT_U32 to after verifying the
return values of functions fill_*(). That way we do not stash anything into the
skb if there are pending errors.

>
>
> > > A single message with PID+TGID sounds reasonable. Why two messages with
> > > two stats? all you will need to do is get rid of the prepare_reply()
> > > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> > >
> >
> > The reason for two stats is that for TGID, we return accumulated values
> > (of all threads in the group) and for PID we return the value just
> > for that pid. The return value is
> >
>
> Ok, I understand the dilemma now - but still not thrilled with having
> two messages.
> Perhaps you could have nesting of TLVs? This is widely used in the net
> code for example
> i.e:
>
> TLV = TASKSTATS_TYPE_TGID/PID
> TLV = TASKSTATS_TYPE_STATS
>
> Look at using nla_nest_start/end/cancel

Hmm... Would it be ok to send one message with the following format

1. TLV=TASKSTATS_TYPE_PID
2. TLV=TASKSTATS_TYPE_STATS
3. TLV=TASKSTATS_TYPE_TGID
4. TLV=TASKSTATS_TYPE_STATS

It would still be one message, except that 3 and 4 would be optional.
What do you think?

>
> cheers,
> jamal

Thanks for your comments,
Balbir

2006-03-24 14:59:38

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, Mar 24, 2006 at 09:11:58AM -0500, jamal wrote:
> On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
>
> Not the right size; the u32 covers the V part of TLV. The T = 16 bits
> and L = 16 bits. And if you nest TLVs, then it gets more interesting.
>
> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
>
> cheers,
> jamal
>

My bad, but I was wondering why my test case did not segfault until
I saw the implementation of nlmsg_new :-)

I will fix this and use nla_total_size() to calculate the correct sizes
including padding and TLV.

Thanks again,
Balbir


2006-03-25 01:20:04

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:

> Hmm... Would it be ok to send one message with the following format
>
> 1. TLV=TASKSTATS_TYPE_PID
> 2. TLV=TASKSTATS_TYPE_STATS
> 3. TLV=TASKSTATS_TYPE_TGID
> 4. TLV=TASKSTATS_TYPE_STATS
>
> It would still be one message, except that 3 and 4 would be optional.
> What do you think?
>

No, that wont work since #2 and #4 are basically the same TLV. [Recall
that "T" is used to index an array]. Your other alternative is to have
#4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
although that would smell a little.
Dont be afraid to do the nest, it will be a little painful initially but
i am sure once you figure it out you will appreciate it.

cheers,
jamal

2006-03-25 02:38:29

by Greg KH

[permalink] [raw]
Subject: Re: [Patch 0/9] Performance

On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote:
> Greg KH wrote:
> > On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> >
> >>This is the next iteration of the delay accounting patches
> >>last posted at
> >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
> >
> >
> > Do you have any benchmark numbers with this patch applied and with it
> > not applied? Last I heard it was a measurable decrease for some
> > "important" benchmark results...
> >
> > thanks,
> >
> > greg k-h
>
> Here are some numbers for the latest set of posted patches
> using microbenchmarks hackbench, kernbench and lmbench.
>
> I was trying to get the real/big benchmark numbers too but
> it looks like getting a run whose numbers can be trusted
> will take a bit longer than expected. Preliminary runs of
> transaction processing benchmarks indicate that overhead
> actually decreases with the patch (as also seen in some of
> the lmbench numbers below).

That's good to hear.

But your .5% is noticable on the +patch results, which I don't think
people who take performance issues seriously will like (that's real
money for the big vendors.) And distros will be forced to enable that
option in their kernels, so those vendors will have to get that
percentage back some other way...

thanks,

greg k-h

2006-03-25 09:42:10

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Fri, Mar 24, 2006 at 08:19:25PM -0500, jamal wrote:
> On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:
>
> > Hmm... Would it be ok to send one message with the following format
> >
> > 1. TLV=TASKSTATS_TYPE_PID
> > 2. TLV=TASKSTATS_TYPE_STATS
> > 3. TLV=TASKSTATS_TYPE_TGID
> > 4. TLV=TASKSTATS_TYPE_STATS
> >
> > It would still be one message, except that 3 and 4 would be optional.
> > What do you think?
> >
>
> No, that wont work since #2 and #4 are basically the same TLV. [Recall
> that "T" is used to index an array]. Your other alternative is to have
> #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
> although that would smell a little.
> Dont be afraid to do the nest, it will be a little painful initially but
> i am sure once you figure it out you will appreciate it.
>

Thanks for the advice, I will dive into nesting. I could not find any
in tree users who use nesting, so I have a few questions

nla_nest_start() accepts two parameters an skb and an attribute type.
Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
contain the nested attributes

TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_PID/TGID
TASKSTATS_TYPE_STATS

but this will lead to


TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_PID
TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_TGID
TASKSTATS_TYPE_STATS

being returned from taskstats_exit_pid().

The other option is to nest

TASKSTATS_TYPE_PID/TGID
TASKSTATS_TYPE_STATS

but the problem with this approach is, nla_len contains the length of
all attributes including the nested attribute. So it is hard to find
the offset of TASKSTATS_TYPE_STATS in the buffer.

Do I understand NLA nesting at all? May be I am missing something obvious.

Thanks,
Balbir

2006-03-25 12:52:25

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:

>
> Thanks for the advice, I will dive into nesting. I could not find any
> in tree users who use nesting, so I have a few questions
>

Hrm - I have to say i am suprised theres nothing; i could have sworn
Thomas had done some conversions already.

> nla_nest_start() accepts two parameters an skb and an attribute type.
> Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> contain the nested attributes
>
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_PID/TGID
> TASKSTATS_TYPE_STATS
>
>
> but this will lead to
>
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_PID
> TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_TGID
> TASKSTATS_TYPE_STATS
>
> being returned from taskstats_exit_pid().
>

no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
Again invoke the rule i cited earlier.
What you could do instead is a second AGGR; and your nesting would be:

TASKSTATS_TYPE_AGGR1 <--- nest start with this type
TASKSTATS_TYPE_PID <-- NLA_U32_PUT
TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
<-- nest end of TASKSTATS_TYPE_AGGR1
TASKSTATS_TYPE_AGGR2 <--- nest start with this type
TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
<-- nest end of TASKSTATS_TYPE_AGGR2

> The other option is to nest
>
> TASKSTATS_TYPE_PID/TGID
> TASKSTATS_TYPE_STATS
>

The advantage being you dont introduce another T.

> but the problem with this approach is, nla_len contains the length of
> all attributes including the nested attribute. So it is hard to find
> the offset of TASKSTATS_TYPE_STATS in the buffer.
>

So you would distinguish the two as have something like:

TASKSTATS_TYPE_PID
u32 pid
TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_TGID
u32 tgid
TASKSTATS_TYPE_STATS
or
TASKSTATS_TYPE_PID
u32 pid
TASKSTATS_TYPE_TGID
u32 tgid

both should be fine. The difference between the two is the length in the
second case will be 4 and in the other case will be larger.

But come to think of it, this will introduce unneeded semantics; you
have very few items to do, so forget it. Go with scheme #1 but change
the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.

cheers,
jamal


2006-03-25 15:36:56

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:
>
> >
> > Thanks for the advice, I will dive into nesting. I could not find any
> > in tree users who use nesting, so I have a few questions
> >
>
> Hrm - I have to say i am suprised theres nothing; i could have sworn
> Thomas had done some conversions already.

I used cscope to check for global references and callers of nla_nest_.*.
I could not find anything.

>
> > nla_nest_start() accepts two parameters an skb and an attribute type.
> > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> > contain the nested attributes
> >
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_PID/TGID
> > TASKSTATS_TYPE_STATS
> >
> >
> > but this will lead to
> >
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_PID
> > TASKSTATS_TYPE_STATS
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_TGID
> > TASKSTATS_TYPE_STATS
> >
> > being returned from taskstats_exit_pid().
> >
>
> no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
> Again invoke the rule i cited earlier.

Yes, thats why I wanted to point it out to you. Thanks for explaining the
rule.

> What you could do instead is a second AGGR; and your nesting would be:
>
> TASKSTATS_TYPE_AGGR1 <--- nest start with this type
> TASKSTATS_TYPE_PID <-- NLA_U32_PUT
> TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
> <-- nest end of TASKSTATS_TYPE_AGGR1
> TASKSTATS_TYPE_AGGR2 <--- nest start with this type
> TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
> TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
> <-- nest end of TASKSTATS_TYPE_AGGR2
>
> > The other option is to nest
> >
> > TASKSTATS_TYPE_PID/TGID
> > TASKSTATS_TYPE_STATS
> >
>
> The advantage being you dont introduce another T.
>
> > but the problem with this approach is, nla_len contains the length of
> > all attributes including the nested attribute. So it is hard to find
> > the offset of TASKSTATS_TYPE_STATS in the buffer.
> >
>
> So you would distinguish the two as have something like:
>
> TASKSTATS_TYPE_PID
> u32 pid
> TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_TGID
> u32 tgid
> TASKSTATS_TYPE_STATS
> or
> TASKSTATS_TYPE_PID
> u32 pid
> TASKSTATS_TYPE_TGID
> u32 tgid
>
> both should be fine. The difference between the two is the length in the
> second case will be 4 and in the other case will be larger.
>
> But come to think of it, this will introduce unneeded semantics; you
> have very few items to do, so forget it. Go with scheme #1 but change
> the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.
>

I prefer #1 as well. The overloaded use of the same type with different lengths
can be confusing.

> cheers,
> jamal
>

Here is another attempt (one more iteration) at trying to get it right.
Thank you for your patience and help in getting it right.

Changelog
---------

As discussed in our email.

Thanks,
Balbir

Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

---

include/linux/delayacct.h | 11 +
include/linux/taskstats.h | 113 +++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 ++++++
kernel/taskstats.c | 291 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 473 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H

#include <linux/sched.h>
+#include <linux/taskstats.h>

#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-25 20:56:55.000000000 +0530
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */
+ TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.

- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.

+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-25 20:56:55.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>

int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);

static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st

void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}

/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-25 20:57:19.000000000 +0530
@@ -0,0 +1,291 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp, size_t size)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new(size);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+ size_t size;
+ struct nlattr *na;
+
+ /*
+ * Size includes space for nested attribute as well
+ * The returned data is of the format
+ * TASKSTATS_TYPE_AGGR_PID/TGID
+ * --> TASKSTATS_TYPE_PID/TGID
+ * --> TASKSTATS_TYPE_STATS
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ } else {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+err:
+ nlmsg_free(rep_skb);
+ return rc;
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+ size_t size;
+ int is_thread_group = !thread_group_empty(tsk);
+ struct nlattr *na;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ if (is_thread_group)
+ size = 2 * size; // PID + STATS + TGID + STATS
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return;
+
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ if (!is_thread_group) {
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ return;
+ }
+
+ memset(&stats, 0, sizeof(stats));
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+ return;
+err:
+ nlmsg_free(rep_skb);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

2006-03-25 17:48:35

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:


I didnt pay attention to failure paths etc; i suppose your testing
should catch those. Getting there, a couple more comments:


> +enum {
> + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
> + TASKSTATS_CMD_GET, /* user->kernel request */
> + TASKSTATS_CMD_NEW, /* kernel->user event */

Should the comment read "kernel->user event/get-response"


> +
> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{


> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> + rc = fill_pid((pid_t)pid, NULL, &stats);
> + if (rc < 0)
> + goto err;
> +
> + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {

in regards to the elseif above:
Could you not have both PID and TGID passed? From my earlier
understanding it seemed legit, no? if answer is yes, then you will have
to do your sizes + reply TLVs at the end.

Also in regards to the nesting, isnt there a need for nla_nest_cancel in
case of failures to add TLVs?

cheers,
jamal


2006-03-25 18:22:14

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On 3/25/06, jamal <[email protected]> wrote:
> On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
>
>
> I didnt pay attention to failure paths etc; i suppose your testing
> should catch those. Getting there, a couple more comments:
>

Yes, I have tried several negative test cases.

>
> > +enum {
> > + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
> > + TASKSTATS_CMD_GET, /* user->kernel request */
> > + TASKSTATS_CMD_NEW, /* kernel->user event */
>
> Should the comment read "kernel->user event/get-response"
>

Yes, good catch. I will update the comment.

>
> > +
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
>
>
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > + if (rc < 0)
> > + goto err;
> > +
> > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
>
> in regards to the elseif above:
> Could you not have both PID and TGID passed? From my earlier
> understanding it seemed legit, no? if answer is yes, then you will have
> to do your sizes + reply TLVs at the end.

No, we cannot have both passed. If we pass both a PID and a TGID and
then the code returns just the stats for the PID.

>
> Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> case of failures to add TLVs?
>

I thought about it, but when I looked at the code of genlmsg_cancel()
and nla_nest_cancel(). It seemed that genlmsg_cancel() should
suffice.

<snippet>
static inline int genlmsg_cancel(struct sk_buff *skb, void *hdr)
{
return nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN);
}

static inline int nlmsg_cancel(struct sk_buff *skb, struct nlmsghdr *nlh)
{
skb_trim(skb, (unsigned char *) nlh - skb->data);

return -1;
}

static inline int nla_nest_cancel(struct sk_buff *skb, struct nlattr *start)
{
if (start)
skb_trim(skb, (unsigned char *) start - skb->data);

return -1;
}

</snippet>

genlmsg_cancel() seemed more generic, since it handles skb_trim from
the nlmsghdr down to skb->data, where as nla_test_cancel() does it
only from the start of the nested attributes to skb->data.

Is my understanding correct?


> cheers,
> jamal
>

Thanks,
Balbir

2006-03-26 14:05:37

by jamal

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:


> No, we cannot have both passed. If we pass both a PID and a TGID and
> then the code returns just the stats for the PID.
>

ok, that clears it then; i think you are ready to go.

> >
> > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > case of failures to add TLVs?
> >
>
> I thought about it, but when I looked at the code of genlmsg_cancel()
> and nla_nest_cancel(). It seemed that genlmsg_cancel() should
> suffice.
>

If your policy is to never send a message if anything fails, then you
are fine.

What would be really useful now that you understand this, is if you can
help extending/cleaning the document i sent you. Or send me a table of
contents of how it would have flowed better for you.

cheers,
jamal



2006-03-26 16:43:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

On Sun, Mar 26, 2006 at 09:05:18AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:
>
>
> > No, we cannot have both passed. If we pass both a PID and a TGID and
> > then the code returns just the stats for the PID.
> >
>
> ok, that clears it then; i think you are ready to go.

Cool! Thanks for all your help and review.

>
> > >
> > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > > case of failures to add TLVs?
> > >
> >
> > I thought about it, but when I looked at the code of genlmsg_cancel()
> > and nla_nest_cancel(). It seemed that genlmsg_cancel() should
> > suffice.
> >
>
> If your policy is to never send a message if anything fails, then you
> are fine.
>
> What would be really useful now that you understand this, is if you can
> help extending/cleaning the document i sent you. Or send me a table of
> contents of how it would have flowed better for you.

I will start looking at the document and see what changes can be made.
I will try and update the document from a genetlink programmers perspective
i.e. things to know, avoid, etc.

>
> cheers,
> jamal
>
>

Thanks,
Balbir

2006-03-26 16:47:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 8/9] generic netlink utility functions

On Mon, Mar 13, 2006 at 07:55:05PM -0500, Shailabh Nagar wrote:
> genetlink-utils.patch
>
> Two utilities for simplifying usage of NETLINK_GENERIC
> interface.
>
> Signed-off-by: Balbir Singh <[email protected]>
> Signed-off-by: Shailabh Nagar <[email protected]>
>
> include/net/genetlink.h | 20 ++++++++++++++++++++
> 1 files changed, 20 insertions(+)
>
> Index: linux-2.6.16-rc5/include/net/genetlink.h
> ===================================================================
> --- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500
> +++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500
> @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
> return nlmsg_unicast(genl_sock, skb, pid);
> }
>
> +/**
> + * gennlmsg_data - head of message payload
> + * @gnlh: genetlink messsage header
> + */
> +static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
> +{
> + return ((unsigned char *) gnlh + GENL_HDRLEN);
> +}
> +
> +/**
> + * genlmsg_len - length of message payload
> + * @gnlh: genetlink message header
> + */
> +static inline int genlmsg_len(const struct genlmsghdr *gnlh)
> +{
> + struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
> + NLMSG_HDRLEN);
> + return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
> +}
> +
> #endif /* __NET_GENERIC_NETLINK_H */
>
>

Jamal,

Does the implementation of these utilities look ok? We use genlmsg_data()
in the delay accounting code but not genlmsg_len(), hence it might not
be very well tested (just reviewed).

Thanks,
Balbir

2006-03-26 17:07:19

by jamal

[permalink] [raw]
Subject: Re: [Patch 8/9] generic netlink utility functions

On Sun, 2006-26-03 at 22:14 +0530, Balbir Singh wrote:

> Jamal,
>
> Does the implementation of these utilities look ok? We use genlmsg_data()
> in the delay accounting code but not genlmsg_len(), hence it might not
> be very well tested (just reviewed).
>

They look fine to me - please resubmit and CC Thomas Graf in case he
sees it different.

cheers,
jamal

2006-03-27 18:28:41

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/9] Performance

Greg KH wrote:

>On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote:
>
>
>>Greg KH wrote:
>>
>>
>>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>>>
>>>
>>>
>>>>This is the next iteration of the delay accounting patches
>>>>last posted at
>>>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>>>
>>>>
>>>Do you have any benchmark numbers with this patch applied and with it
>>>not applied? Last I heard it was a measurable decrease for some
>>>"important" benchmark results...
>>>
>>>thanks,
>>>
>>>greg k-h
>>>
>>>
>>Here are some numbers for the latest set of posted patches
>>using microbenchmarks hackbench, kernbench and lmbench.
>>
>>I was trying to get the real/big benchmark numbers too but
>>it looks like getting a run whose numbers can be trusted
>>will take a bit longer than expected. Preliminary runs of
>>transaction processing benchmarks indicate that overhead
>>actually decreases with the patch (as also seen in some of
>>the lmbench numbers below).
>>
>>
>
>That's good to hear.
>
>But your .5% is noticable on the +patch results, which I don't think
>people who take performance issues seriously will like (that's real
>money for the big vendors.) And distros will be forced to enable that
>option in their kernels, so those vendors will have to get that
>percentage back some other way...
>
>

Sorry, missed your response.

Yes, even the slight deterioration might be an issue for distros. We
discovered one memcpy,
lack of use of "__read_mostly" and another "unlikely" that might help
with the 0.5% but other than
that don't see any major way of reducing overhead further for the +patch
case.

I'll be posting another iteration of the patches with these changes and
corresponding results
(as well as the changes for the netlink interface which has been
stabilized after incorporating Jamal's
comments). Lets see what that does.

Thanks,
Shailabh

>thanks,
>
>greg k-h
>
>