2006-03-30 00:32:49

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 0/8] per-task delay accounting

Andrew,

Could you please include the following delay accounting patches
in -mm ?

The patches have gone through several iterations on lkml and
numerous comments raised by reviewers have been addressed

- several netlink interface comments (Jamal)
- block I/O collection method (Arjan)
- block I/O delays export through /proc (Andi)
- performance issues (Greg) (just addressed, see below)
- GPL headers (Arjan)

Most of the descriptions of the patches are either in the
patch itself or in the documentation patch at the end.

Thanks
--Shailabh


Patch series

delayacct-setup.patch
delayacct-blkio-swapin.patch
delayacct-schedstats.patch
genetlink-utils.patch
delayacct-genetlink.patch
delayacct-virtcpu.patch
delayacct-procfs.patch
delayacct-doc.patch



Results highlights

- No statistically significant performance degradation is seen in
kernbench, hackbench and large OLTP benchmark when delay
accounting is configured.

The overheads of configuring delay accounting,
without enabling at boot time, are statistically negligible
for hackbench and a large OLTP benchmark and negative
(i.e. performance improves) in kernbench.

- Similar lack of degradation is seen in kernbench and hackbench
even when delay accounting is enabled at boot.

No data could be collected for the large OLTP benchmark (efforts
ongoing).

Legend

Base
Vanilla 2.6.16 kernel
without any patches applied
+patch
Delay accounting configured
but not enabled at boot
+patch+enable
Delay accounting enabled at boot
but no stats read



Time Elapsed time, averaged over 10 runs
Stddev Standard deviation of elapsed times
Ovhd % difference of elapsed time with respect to base kernel
t-value Used to measure statistical significance
of difference of two mean values (in this
case mean elapsed time). Low t-values indicate
insignificant difference. The t-values here were
calculated at 95% confidence interval using the tool at
http://www.polarismr.com/education/tools_stat_diff_means.html


Hackbench
---------
200 groups, using pipes
Elapsed time, in seconds, lower better

Ovhd Time Stddev Ovhd significant (t-value)
Base 0% 43.483 0.178 na
+patch 0.1% 43.517 0.265 No (0.337)
+patch+enable 0.3% 43.629 0.167 No (1.892)

Kernbench
---------
Average of 10 iterations
Elapsed time, in seconds, lower better

Ovhd Time Stddev Ovhd significant (t-value)
Base 0% 196.704 0.459 na
+patch -0.5% 195.812 0.477 Yes (4.261)
+patch+enable 0.02% 196.752 0.356 No (0.261)


Large OLTP benchmark
--------------------
An industry standard large database online transaction processing
workload was run with delay accounting patches configured
ON and OFF.

The performance degradation of delay accounting was about 0.2%,
which was well within the normal range of variation between
similar runs.

No runs were taken with delay accounting enabled at boot time.


2006-03-30 00:36:00

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 1/8] Setup

delayacct-setup.patch

Initialization code related to collection of per-task "delay"
statistics which measure how long it had to wait for cpu,
sync block io, swapping etc. The collection of statistics and
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled
through a kernel boot paramater.

Signed-off-by: Shailabh Nagar <[email protected]>

Documentation/kernel-parameters.txt | 2
include/linux/delayacct.h | 51 +++++++++++++++++++
include/linux/sched.h | 17 ++++++
init/Kconfig | 13 +++++
init/main.c | 2
kernel/Makefile | 1
kernel/delayacct.c | 92 ++++++++++++++++++++++++++++++++++++
kernel/exit.c | 3 +
kernel/fork.c | 2
9 files changed, 183 insertions(+)

Index: linux-2.6.16/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.16.orig/Documentation/kernel-parameters.txt 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/Documentation/kernel-parameters.txt 2006-03-29 18:12:57.000000000 -0500
@@ -416,6 +416,8 @@ running once the system is up.
Format: <area>[,<node>]
See also Documentation/networking/decnet.txt.

+ delayacct [KNL] Enable per-task delay accounting
+
devfs= [DEVFS]
See Documentation/filesystems/devfs/boot-options.

Index: linux-2.6.16/kernel/Makefile
===================================================================
--- linux-2.6.16.orig/kernel/Makefile 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/kernel/Makefile 2006-03-29 18:12:57.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.16/include/linux/delayacct.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/include/linux/delayacct.h 2006-03-29 18:12:57.000000000 -0500
@@ -0,0 +1,51 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on; /* Delay accounting turned on/off */
+extern kmem_cache_t *delayacct_cache;
+extern void delayacct_init(void);
+extern void __delayacct_tsk_init(struct task_struct *);
+extern void __delayacct_tsk_exit(struct task_struct *);
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+ /* reinitialize in case parent's non-null pointer was dup'ed*/
+ tsk->delays = NULL;
+ if (unlikely(delayacct_on))
+ __delayacct_tsk_init(tsk);
+}
+
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{
+ if (tsk->delays)
+ __delayacct_tsk_exit(tsk);
+}
+
+#else
+static inline void delayacct_init(void)
+{}
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16/include/linux/sched.h
===================================================================
--- linux-2.6.16.orig/include/linux/sched.h 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/include/linux/sched.h 2006-03-29 18:12:57.000000000 -0500
@@ -540,6 +540,20 @@ struct sched_info {
extern struct file_operations proc_schedstat_operations;
#endif

+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+ spinlock_t lock;
+
+ /* For each stat XXX, add following, aligned appropriately
+ *
+ * struct timespec XXX_start, XXX_end;
+ * u64 XXX_delay;
+ * u32 XXX_count;
+ */
+};
+#endif
+
+
enum idle_type
{
SCHED_IDLE,
@@ -871,6 +885,9 @@ struct task_struct {
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
struct rcu_head rcu;
+#ifdef CONFIG_TASK_DELAY_ACCT
+ struct task_delay_info *delays;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.16/init/Kconfig
===================================================================
--- linux-2.6.16.orig/init/Kconfig 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/init/Kconfig 2006-03-29 18:12:57.000000000 -0500
@@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config TASK_DELAY_ACCT
+ bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ help
+ Collect information on time spent by a task waiting for system
+ resources like cpu, synchronous block I/O completion and swapping
+ in pages. Such statistics can help in setting a task's priorities
+ relative to other tasks for cpu, io, rss limits etc.
+
+ Unlike BSD process accounting, this information is available
+ continuously during the lifetime of a task.
+
+ Say N if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.16/init/main.c
===================================================================
--- linux-2.6.16.orig/init/main.c 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/init/main.c 2006-03-29 18:12:57.000000000 -0500
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/delayacct.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ delayacct_init();

check_bugs();

Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:12:57.000000000 -0500
@@ -0,0 +1,92 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/sysctl.h>
+#include <linux/delayacct.h>
+
+int delayacct_on __read_mostly = 0; /* Delay accounting turned on/off */
+kmem_cache_t *delayacct_cache;
+
+static int __init delayacct_setup_enable(char *str)
+{
+ delayacct_on = 1;
+ return 1;
+}
+__setup("delayacct", delayacct_setup_enable);
+
+void delayacct_init(void)
+{
+ delayacct_cache = kmem_cache_create("delayacct_cache",
+ sizeof(struct task_delay_info),
+ 0,
+ SLAB_PANIC,
+ NULL, NULL);
+ delayacct_tsk_init(&init_task);
+}
+
+void __delayacct_tsk_init(struct task_struct *tsk)
+{
+ tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+ if (tsk->delays) {
+ memset(tsk->delays, 0, sizeof(*tsk->delays));
+ spin_lock_init(&tsk->delays->lock);
+ }
+}
+
+void __delayacct_tsk_exit(struct task_struct *tsk)
+{
+ if (tsk->delays) {
+ kmem_cache_free(delayacct_cache, tsk->delays);
+ tsk->delays = NULL;
+ }
+}
+
+/*
+ * Start accounting for a delay statistic using
+ * its starting timestamp (@start)
+ */
+
+static inline void delayacct_start(struct timespec *start)
+{
+ do_posix_clock_monotonic_gettime(start);
+}
+
+/*
+ * Finish delay accounting for a statistic using
+ * its timestamps (@start, @end), accumalator (@total) and @count
+ */
+
+static inline void delayacct_end(struct timespec *start, struct timespec *end,
+ u64 *total, u32 *count)
+{
+ struct timespec ts;
+ nsec_t ns;
+
+ do_posix_clock_monotonic_gettime(end);
+ ts.tv_sec = end->tv_sec - start->tv_sec;
+ ts.tv_nsec = end->tv_nsec - start->tv_nsec;
+ ns = timespec_to_ns(&ts);
+ if (ns < 0)
+ return;
+
+ spin_lock(&current->delays->lock);
+ *total += ns;
+ (*count)++;
+ spin_unlock(&current->delays->lock);
+}
+
Index: linux-2.6.16/kernel/fork.c
===================================================================
--- linux-2.6.16.orig/kernel/fork.c 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/kernel/fork.c 2006-03-29 18:12:57.000000000 -0500
@@ -44,6 +44,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/delayacct.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -972,6 +973,7 @@ static task_t *copy_process(unsigned lon
goto bad_fork_cleanup_put_domain;

p->did_exec = 0;
+ delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
copy_flags(clone_flags, p);
p->pid = pid;
retval = -EFAULT;
Index: linux-2.6.16/kernel/exit.c
===================================================================
--- linux-2.6.16.orig/kernel/exit.c 2006-03-29 18:12:55.000000000 -0500
+++ linux-2.6.16/kernel/exit.c 2006-03-29 18:12:57.000000000 -0500
@@ -31,6 +31,7 @@
#include <linux/signal.h>
#include <linux/cn_proc.h>
#include <linux/mutex.h>
+#include <linux/delayacct.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -842,6 +843,8 @@ fastcall NORET_TYPE void do_exit(long co
preempt_count());

acct_update_integrals(tsk);
+ delayacct_tsk_exit(tsk);
+
if (tsk->mm) {
update_hiwater_rss(tsk->mm);
update_hiwater_vm(tsk->mm);

2006-03-30 00:37:27

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 2/8] Block I/O, swapin delays

delayacct-blkio-swapin.patch

Collect per-task block I/O delay statistics.

Unlike earlier iterations of the delay accounting
patches, now delays are only collected for the actual
I/O waits rather than try and cover the delays seen in
I/O submission paths.

Account separately for block I/O delays
incurred as a result of swapin page faults whose
frequency can be affected by the task/process' rss limit.
Hence swapin delays can act as feedback for rss limit changes
independent of I/O priority changes.

Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/delayacct.h | 16 ++++++++++++++++
include/linux/sched.h | 7 +++++++
kernel/delayacct.c | 18 ++++++++++++++++++
kernel/sched.c | 5 +++++
mm/memory.c | 4 ++++
5 files changed, 50 insertions(+)

Index: linux-2.6.16/include/linux/delayacct.h
===================================================================
--- linux-2.6.16.orig/include/linux/delayacct.h 2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/include/linux/delayacct.h 2006-03-29 18:13:13.000000000 -0500
@@ -25,6 +25,8 @@ extern kmem_cache_t *delayacct_cache;
extern void delayacct_init(void);
extern void __delayacct_tsk_init(struct task_struct *);
extern void __delayacct_tsk_exit(struct task_struct *);
+extern void __delayacct_blkio_start(void);
+extern void __delayacct_blkio_end(void);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -40,6 +42,16 @@ static inline void delayacct_tsk_exit(st
__delayacct_tsk_exit(tsk);
}

+static inline void delayacct_blkio_start(void)
+{
+ if (current->delays)
+ __delayacct_blkio_start();
+}
+static inline void delayacct_blkio_end(void)
+{
+ if (current->delays)
+ __delayacct_blkio_end();
+}
#else
static inline void delayacct_init(void)
{}
@@ -47,5 +59,9 @@ static inline void delayacct_tsk_init(st
{}
static inline void delayacct_tsk_exit(struct task_struct *tsk)
{}
+static inline void delayacct_blkio_start(void)
+{}
+static inline void delayacct_blkio_end(void)
+{}
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- linux-2.6.16.orig/kernel/delayacct.c 2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:13:13.000000000 -0500
@@ -90,3 +90,21 @@ static inline void delayacct_end(struct
spin_unlock(&current->delays->lock);
}

+void __delayacct_blkio_start(void)
+{
+ delayacct_start(&current->delays->blkio_start);
+}
+
+void __delayacct_blkio_end(void)
+{
+ if (current->flags & PF_SWAPIN) /* Swapping a page in */
+ delayacct_end(&current->delays->blkio_start,
+ &current->delays->blkio_end,
+ &current->delays->swapin_delay,
+ &current->delays->swapin_count);
+ else /* Other block I/O */
+ delayacct_end(&current->delays->blkio_start,
+ &current->delays->blkio_end,
+ &current->delays->blkio_delay,
+ &current->delays->blkio_count);
+}
Index: linux-2.6.16/include/linux/sched.h
===================================================================
--- linux-2.6.16.orig/include/linux/sched.h 2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/include/linux/sched.h 2006-03-29 18:13:13.000000000 -0500
@@ -550,6 +550,12 @@ struct task_delay_info {
* u64 XXX_delay;
* u32 XXX_count;
*/
+
+ struct timespec blkio_start, blkio_end; /* Shared by blkio, swapin */
+ u64 blkio_delay; /* wait for sync block io completion */
+ u64 swapin_delay; /* wait for sync block io completion */
+ u32 blkio_count;
+ u32 swapin_count;
};
#endif

@@ -945,6 +951,7 @@ static inline void put_task_struct(struc
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
#define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */
+#define PF_SWAPIN 0x02000000 /* I am doing a swap in */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.16/kernel/sched.c
===================================================================
--- linux-2.6.16.orig/kernel/sched.c 2006-03-29 18:12:54.000000000 -0500
+++ linux-2.6.16/kernel/sched.c 2006-03-29 18:13:13.000000000 -0500
@@ -49,6 +49,7 @@
#include <linux/syscalls.h>
#include <linux/times.h>
#include <linux/acct.h>
+#include <linux/delayacct.h>
#include <asm/tlb.h>

#include <asm/unistd.h>
@@ -4112,9 +4113,11 @@ void __sched io_schedule(void)
{
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());

+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
schedule();
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
}

EXPORT_SYMBOL(io_schedule);
@@ -4124,9 +4127,11 @@ long __sched io_schedule_timeout(long ti
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
long ret;

+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
return ret;
}

Index: linux-2.6.16/mm/memory.c
===================================================================
--- linux-2.6.16.orig/mm/memory.c 2006-03-29 18:12:54.000000000 -0500
+++ linux-2.6.16/mm/memory.c 2006-03-29 18:13:13.000000000 -0500
@@ -1883,6 +1883,7 @@ static int do_swap_page(struct mm_struct

entry = pte_to_swp_entry(orig_pte);
again:
+ current->flags |= PF_SWAPIN;
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1895,6 +1896,7 @@ again:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
+ current->flags &= ~PF_SWAPIN;
goto unlock;
}

@@ -1906,6 +1908,8 @@ again:

mark_page_accessed(page);
lock_page(page);
+ current->flags &= ~PF_SWAPIN;
+
if (!PageSwapCache(page)) {
/* Page migration has occured */
unlock_page(page);

2006-03-30 00:42:32

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 3/8] cpu delays

delayacct-schedstats.patch

Make the task-related schedstats functions
callable by delay accounting even if schedstats
collection isn't turned on. This removes the
dependency of delay accounting on schedstats and allows
cpu delays to be exported.

Signed-off-by: Chandra Seetharaman <[email protected]>
Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/sched.h | 10 +++++---
kernel/sched.c | 59 ++++++++++++++++++++++++++++++++++++--------------
2 files changed, 50 insertions(+), 19 deletions(-)

Index: linux-2.6.16/include/linux/sched.h
===================================================================
--- linux-2.6.16.orig/include/linux/sched.h 2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/include/linux/sched.h 2006-03-29 18:13:15.000000000 -0500
@@ -525,7 +525,7 @@ typedef struct prio_array prio_array_t;
struct backing_dev_info;
struct reclaim_state;

-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info {
/* cumulative counters */
unsigned long cpu_time, /* time spent on the cpu */
@@ -537,10 +537,14 @@ struct sched_info {
last_queued; /* when we were last queued to run */
};

+#ifdef CONFIG_SCHEDSTATS
extern struct file_operations proc_schedstat_operations;
-#endif
+#endif /* CONFIG_SCHEDSTATS */
+#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */

#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on;
+
struct task_delay_info {
spinlock_t lock;

@@ -736,7 +740,7 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;

-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
#endif

Index: linux-2.6.16/kernel/sched.c
===================================================================
--- linux-2.6.16.orig/kernel/sched.c 2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/kernel/sched.c 2006-03-29 18:13:15.000000000 -0500
@@ -466,9 +466,26 @@ struct file_operations proc_schedstat_op
.release = single_release,
};

+static inline void rq_sched_info_arrive(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq) {
+ rq->rq_sched_info.run_delay += diff;
+ rq->rq_sched_info.pcnt++;
+ }
+}
+
+static inline void rq_sched_info_depart(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq)
+ rq->rq_sched_info.cpu_time += diff;
+}
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
#else /* !CONFIG_SCHEDSTATS */
+static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {}
+static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {}
# define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0)
#endif
@@ -488,7 +505,18 @@ static inline runqueue_t *this_rq_lock(v
return rq;
}

+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+
+static inline int sched_info_on(void)
+{
#ifdef CONFIG_SCHEDSTATS
+ return 1;
+#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+ return delayacct_on;
+#endif
+}
+
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
@@ -517,7 +545,6 @@ static inline void sched_info_dequeued(t
static void sched_info_arrive(task_t *t)
{
unsigned long now = jiffies, diff = 0;
- struct runqueue *rq = task_rq(t);

if (t->sched_info.last_queued)
diff = now - t->sched_info.last_queued;
@@ -526,11 +553,7 @@ static void sched_info_arrive(task_t *t)
t->sched_info.last_arrival = now;
t->sched_info.pcnt++;

- if (!rq)
- return;
-
- rq->rq_sched_info.run_delay += diff;
- rq->rq_sched_info.pcnt++;
+ rq_sched_info_arrive(task_rq(t), diff);
}

/*
@@ -550,8 +573,9 @@ static void sched_info_arrive(task_t *t)
*/
static inline void sched_info_queued(task_t *t)
{
- if (!t->sched_info.last_queued)
- t->sched_info.last_queued = jiffies;
+ if (unlikely(sched_info_on()))
+ if (!t->sched_info.last_queued)
+ t->sched_info.last_queued = jiffies;
}

/*
@@ -560,13 +584,10 @@ static inline void sched_info_queued(tas
*/
static inline void sched_info_depart(task_t *t)
{
- struct runqueue *rq = task_rq(t);
unsigned long diff = jiffies - t->sched_info.last_arrival;

t->sched_info.cpu_time += diff;
-
- if (rq)
- rq->rq_sched_info.cpu_time += diff;
+ rq_sched_info_depart(task_rq(t), diff);
}

/*
@@ -574,7 +595,7 @@ static inline void sched_info_depart(tas
* their time slice. (This may also be called when switching to or from
* the idle task.) We are only called when prev != next.
*/
-static inline void sched_info_switch(task_t *prev, task_t *next)
+static inline void __sched_info_switch(task_t *prev, task_t *next)
{
struct runqueue *rq = task_rq(prev);

@@ -589,10 +610,15 @@ static inline void sched_info_switch(tas
if (next != rq->idle)
sched_info_arrive(next);
}
+static inline void sched_info_switch(task_t *prev, task_t *next)
+{
+ if (unlikely(sched_info_on()))
+ __sched_info_switch(prev, next);
+}
#else
#define sched_info_queued(t) do { } while (0)
#define sched_info_switch(t, next) do { } while (0)
-#endif /* CONFIG_SCHEDSTATS */
+#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

/*
* Adding/removing a task to/from a priority array:
@@ -1346,8 +1372,9 @@ void fastcall sched_fork(task_t *p, int
p->state = TASK_RUNNING;
INIT_LIST_HEAD(&p->run_list);
p->array = NULL;
-#ifdef CONFIG_SCHEDSTATS
- memset(&p->sched_info, 0, sizeof(p->sched_info));
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+ if (unlikely(sched_info_on()))
+ memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
p->oncpu = 0;

2006-03-30 00:49:11

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 4/8] generic netlink utility functions

genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Shailabh Nagar <[email protected]>

include/net/genetlink.h | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+)

Index: linux-2.6.16/include/net/genetlink.h
===================================================================
--- linux-2.6.16.orig/include/net/genetlink.h 2006-03-29 18:12:54.000000000 -0500
+++ linux-2.6.16/include/net/genetlink.h 2006-03-29 18:13:17.000000000 -0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
}

+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+ return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+ struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+ NLMSG_HDRLEN);
+ return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
#endif /* __NET_GENERIC_NETLINK_H */

2006-03-30 00:52:26

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 5/8] generic netlink interface for delay accounting

delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.


Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

include/linux/delayacct.h | 11 +
include/linux/taskstats.h | 113 +++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 ++++++
kernel/taskstats.c | 292 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 474 insertions(+), 3 deletions(-)

Index: linux-2.6.16/include/linux/delayacct.h
===================================================================
--- linux-2.6.16.orig/include/linux/delayacct.h 2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/include/linux/delayacct.h 2006-03-29 18:13:18.000000000 -0500
@@ -18,6 +18,7 @@
#define _LINUX_TASKDELAYS_H

#include <linux/sched.h>
+#include <linux/taskstats.h>

#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -27,6 +28,7 @@ extern void __delayacct_tsk_init(struct
extern void __delayacct_tsk_exit(struct task_struct *);
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -64,4 +66,13 @@ static inline void delayacct_blkio_start
static inline void delayacct_blkio_end(void)
{}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16/include/linux/taskstats.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/include/linux/taskstats.h 2006-03-29 18:13:18.000000000 -0500
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request/get-response */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */
+ TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16/init/Kconfig
===================================================================
--- linux-2.6.16.orig/init/Kconfig 2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/init/Kconfig 2006-03-29 18:13:18.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.

- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.

+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- linux-2.6.16.orig/kernel/delayacct.c 2006-03-29 18:13:13.000000000 -0500
+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:13:18.000000000 -0500
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -18,9 +19,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>

int delayacct_on __read_mostly = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);

static int __init delayacct_setup_enable(char *str)
{
@@ -50,10 +54,16 @@ void __delayacct_tsk_init(struct task_st

void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}

/*
@@ -108,3 +118,37 @@ void __delayacct_blkio_end(void)
&current->delays->blkio_delay,
&current->delays->blkio_count);
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
Index: linux-2.6.16/kernel/Makefile
===================================================================
--- linux-2.6.16.orig/kernel/Makefile 2006-03-29 18:12:57.000000000 -0500
+++ linux-2.6.16/kernel/Makefile 2006-03-29 18:13:18.000000000 -0500
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.16/kernel/taskstats.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/kernel/taskstats.c 2006-03-29 18:13:18.000000000 -0500
@@ -0,0 +1,292 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp, size_t size)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new(size);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+ size_t size;
+ struct nlattr *na;
+
+ /*
+ * Size includes space for nested attribute as well
+ * The returned data is of the format
+ * TASKSTATS_TYPE_AGGR_PID/TGID
+ * --> TASKSTATS_TYPE_PID/TGID
+ * --> TASKSTATS_TYPE_STATS
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ } else {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+err:
+ nlmsg_free(rep_skb);
+ return rc;
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+ size_t size;
+ int is_thread_group = !thread_group_empty(tsk);
+ struct nlattr *na;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ if (is_thread_group)
+ size = 2 * size; // PID + STATS + TGID + STATS
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return;
+
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ if (!is_thread_group) {
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ return;
+ }
+
+ memset(&stats, 0, sizeof(stats));
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ return;
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+ return;
+err:
+ nlmsg_free(rep_skb);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);

2006-03-30 00:55:12

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 6/8] virtual cpu run time

delayacct-virtcpu.patch

Distinguish between "wall-clock" and "virtual" cpu run times and return
both, at per-task and per-tgid granularity.

Some architectures adjust tsk->utime+tsk->stime to reflect the time that
the kernel wasn't scheduled in hypervised environments and this is the
"wall-clock" cpu run time. "Virtual" cpu run time, on the other hand, does
not account for the kernel being descheduled.

This patch allows the most accurate "virtual" cpu run time, collected by
the schedstats code (now shared with delay accounting code), to be returned
to user space, in addition to the "wall-clock" cpu time that was being exported
earlier. Both these times are useful for workload management in different
situations.

In a non-virtualized environment, or on architectures which do not adjust
tsk->utime/stime, these will effectively be the same value but at different
granularities.


Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

include/linux/taskstats.h | 10 ++++++++--
kernel/delayacct.c | 12 +++++++++---
2 files changed, 17 insertions(+), 5 deletions(-)

Index: linux-2.6.16/include/linux/taskstats.h
===================================================================
--- linux-2.6.16.orig/include/linux/taskstats.h 2006-03-29 18:13:18.000000000 -0500
+++ linux-2.6.16/include/linux/taskstats.h 2006-03-29 18:13:20.000000000 -0500
@@ -46,8 +46,14 @@ struct taskstats {
__u64 swapin_count;
__u64 swapin_delay_total; /* swapin page fault wait*/

- __u64 cpu_run_total; /* cpu running time
- * no count available/provided */
+ __u64 cpu_run_real_total; /* cpu "wall-clock" running time
+ * Potentially accounts for cpu
+ * virtualization, on some arches
+ */
+ __u64 cpu_run_virtual_total; /* cpu "virtual" running time
+ * Uses time intervals as seen by
+ * the kernel
+ */
};


Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- linux-2.6.16.orig/kernel/delayacct.c 2006-03-29 18:13:18.000000000 -0500
+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:13:20.000000000 -0500
@@ -123,17 +123,18 @@ int __delayacct_add_tsk(struct taskstats
{
nsec_t tmp;
struct timespec ts;
- unsigned long t1,t2;
+ unsigned long t1,t2,t3;

/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */

- tmp = (nsec_t)d->cpu_run_total ;
+ tmp = (nsec_t)d->cpu_run_real_total ;
tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
- d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+ d->cpu_run_real_total = (tmp < (nsec_t)d->cpu_run_real_total)? 0: tmp;

/* No locking available for sched_info. Take snapshot first. */
t1 = tsk->sched_info.pcnt;
t2 = tsk->sched_info.run_delay;
+ t3 = tsk->sched_info.cpu_time;

d->cpu_count += t1;

@@ -141,6 +142,11 @@ int __delayacct_add_tsk(struct taskstats
tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;

+ tmp = (nsec_t)d->cpu_run_virtual_total
+ + (nsec_t)jiffies_to_usecs(t3) * 1000;
+ d->cpu_run_virtual_total = (tmp < (nsec_t)d->cpu_run_virtual_total) ?
+ 0 : tmp;
+
spin_lock(&tsk->delays->lock);
tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;

2006-03-30 00:57:03

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 7/8] proc interface for block I/O delays

delayacct-procfs.patch

Export I/O delays seen by a task through /proc/<tgid>/stats
for use in top etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is
clubbed together with all other I/O here (this is not the
case in the netlink interface where the swapin I/O is kept distinct)

Signed-off-by: Shailabh Nagar <[email protected]>


fs/proc/array.c | 6 ++++--
include/linux/delayacct.h | 11 +++++++++++
kernel/delayacct.c | 15 +++++++++++++++
3 files changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.16/fs/proc/array.c
===================================================================
--- linux-2.6.16.orig/fs/proc/array.c 2006-03-29 18:12:54.000000000 -0500
+++ linux-2.6.16/fs/proc/array.c 2006-03-29 18:13:21.000000000 -0500
@@ -75,6 +75,7 @@
#include <linux/times.h>
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
+#include <linux/delayacct.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -414,7 +415,7 @@ static int do_task_stat(struct task_stru

res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
task->pid,
tcomm,
state,
@@ -459,7 +460,8 @@ static int do_task_stat(struct task_stru
task->exit_signal,
task_cpu(task),
task->rt_priority,
- task->policy);
+ task->policy,
+ delayacct_blkio_ticks(task));
if(mm)
mmput(mm);
return res;
Index: linux-2.6.16/include/linux/delayacct.h
===================================================================
--- linux-2.6.16.orig/include/linux/delayacct.h 2006-03-29 18:13:18.000000000 -0500
+++ linux-2.6.16/include/linux/delayacct.h 2006-03-29 18:13:21.000000000 -0500
@@ -29,6 +29,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
+extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);

static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -54,6 +55,12 @@ static inline void delayacct_blkio_end(v
if (current->delays)
__delayacct_blkio_end();
}
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ if (unlikely(delayacct_on))
+ return __delayacct_blkio_ticks(tsk);
+ return 0;
+}
#else
static inline void delayacct_init(void)
{}
@@ -65,6 +72,10 @@ static inline void delayacct_blkio_start
{}
static inline void delayacct_blkio_end(void)
{}
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ return 0;
+}
#endif /* CONFIG_TASK_DELAY_ACCT */
#ifdef CONFIG_TASKSTATS
static inline int delayacct_add_tsk(struct taskstats *d,
Index: linux-2.6.16/kernel/delayacct.c
===================================================================
--- linux-2.6.16.orig/kernel/delayacct.c 2006-03-29 18:13:20.000000000 -0500
+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:13:21.000000000 -0500
@@ -118,6 +118,21 @@ void __delayacct_blkio_end(void)
&current->delays->blkio_delay,
&current->delays->blkio_count);
}
+
+unsigned long long __delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ unsigned long long ret;
+
+ if (!tsk->delays)
+ return 0;
+
+ spin_lock(&tsk->delays->lock);
+ ret = nsec_to_clock_t(tsk->delays->blkio_delay +
+ tsk->delays->swapin_delay);
+ spin_unlock(&tsk->delays->lock);
+ return ret;
+}
+
#ifdef CONFIG_TASKSTATS
int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
{

2006-03-30 00:59:38

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 8/8] documentation, userspace utility

delayacct-doc.patch

Some documentation for delay accounting.


Signed-off-by: Shailabh Nagar <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>

Documentation/delay-accounting.txt | 489 +++++++++++++++++++++++++++++++++++++
1 files changed, 489 insertions(+)

Index: linux-2.6.16/Documentation/delay-accounting.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16/Documentation/delay-accounting.txt 2006-03-29 18:57:14.000000000 -0500
@@ -0,0 +1,489 @@
+Delay accounting
+----------------
+
+Tasks encounter delays in execution when they wait
+for some kernel resource to become available e.g. a
+runnable task may wait for a free CPU to run on.
+
+The per-task delay accounting functionality measures
+the delays experienced by a task while
+
+a) waiting for a CPU (while being runnable)
+b) completion of synchronous block I/O initiated by the task
+c) swapping in pages
+
+and makes these statistics available to userspace through
+an efficient netlink interface that permits continuous monitoring
+of a large number of tasks.
+
+Such delays provide feedback for setting a task's cpu priority,
+io priority and rss limit values appropriately. Long delays for
+important tasks could be a trigger for raising its corresponding priority.
+
+Delay statistics can also be collected for all tasks and aggregated into
+arbitrary groups by workload management applications.
+
+
+Interface
+---------
+
+A netlink interface (NETLINK_GENERIC family) has been provided to
+access the delay statistics efficiently.
+
+See include/linux/taskstats.h
+for the format of data returned per-task or per-tgid.
+It will generally be in the form of counters returning the cumulative
+delay seen for cpu, sync block I/O, swapin etc.
+
+Taking the difference of two successive readings of a given
+counter (say cpu_delay_total) for a task will give the delay
+experienced by the task waiting for the corresponding resource
+in that interval.
+
+The getdelay utility described below allows simple commands to be run
+and the corresponding delay statistics to be displayed.
+
+
+Usage
+-----
+
+Compile the kernel with
+ CONFIG_TASK_DELAY_ACCT=y
+ CONFIG_TASKSTATS=y
+
+Enable the accounting at boot time by adding
+the following to the kernel boot options
+ delayacct
+
+and after the system is up, use the getdelay
+utility or similar to access the delays
+seen by a given task or a task group (tgid).
+The utility also allows a given command to be
+executed and the corresponding delays to be
+seen.
+
+
+getdelays [-t tgid] [-p pid] [-c cmd...]
+
+
+Get delays, since system boot, for pid 10
+# ./getdelays -p 10
+(output similar to next case)
+
+Get sum of delays, since system boot, for all pids with tgid 5
+# ./getdelays -t 5
+
+
+CPU count real total virtual total delay total
+ 7876 92005750 100000000 24001500
+IO count delay total
+ 0 0
+MEM count delay total
+ 0 0
+
+Get delays seen in executing a given simple command
+# ./getdelays -c ls /
+
+bin data1 data3 data5 dev home media opt root srv sys usr
+boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
+
+
+CPU count real total virtual total delay total
+ 6 4000250 4000000 0
+IO count delay total
+ 0 0
+MEM count delay total
+ 0 0
+
+getdelays utility source
+------------------------
+
+/* Genetlink test: user program
+ *
+ * Illustrates usage of NETLINK_GENERIC interface to delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <signal.h>
+
+#include <linux/genetlink.h>
+#include <linux/taskstats.h>
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
+
+#define err(code, fmt, arg...) do { printf(fmt, ##arg); exit(code); } while (0)
+int done = 0;
+
+/*
+ * Create a raw netlink socket and bind
+ */
+static int create_nl_socket(int protocol, int groups)
+{
+ socklen_t addr_len;
+ int fd;
+ struct sockaddr_nl local;
+
+ fd = socket(AF_NETLINK, SOCK_RAW, protocol);
+ if (fd < 0)
+ return -1;
+
+ memset(&local, 0, sizeof(local));
+ local.nl_family = AF_NETLINK;
+ local.nl_groups = groups;
+
+ if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
+ goto error;
+
+ return fd;
+ error:
+ close(fd);
+ return -1;
+}
+
+int sendto_fd(int s, const char *buf, int bufLen)
+{
+ struct sockaddr_nl nladdr;
+ int r;
+
+ memset(&nladdr, 0, sizeof(nladdr));
+ nladdr.nl_family = AF_NETLINK;
+
+ while ((r = sendto(s, buf, bufLen, 0, (struct sockaddr *) &nladdr,
+ sizeof(nladdr))) < bufLen) {
+ if (r > 0) {
+ buf += r;
+ bufLen -= r;
+ } else if (errno != EAGAIN)
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * Probe the controller in genetlink to find the family id
+ * for the TASKSTATS family
+ */
+int get_family_id(int sd)
+{
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } family_req;
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } ans;
+
+ int id;
+ struct nlattr *na;
+ int rep_len;
+
+ /* Get family name */
+ family_req.n.nlmsg_type = GENL_ID_CTRL;
+ family_req.n.nlmsg_flags = NLM_F_REQUEST;
+ family_req.n.nlmsg_seq = 0;
+ family_req.n.nlmsg_pid = getpid();
+ family_req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ family_req.g.cmd = CTRL_CMD_GETFAMILY;
+ family_req.g.version = 0x1;
+ na = (struct nlattr *) GENLMSG_DATA(&family_req);
+ na->nla_type = CTRL_ATTR_FAMILY_NAME;
+ na->nla_len = strlen(TASKSTATS_GENL_NAME) + 1 + NLA_HDRLEN;
+ strcpy(NLA_DATA(na), TASKSTATS_GENL_NAME);
+ family_req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+ if (sendto_fd(sd, (char *) &family_req, family_req.n.nlmsg_len) < 0)
+ err(1, "error sending message via Netlink\n");
+
+ rep_len = recv(sd, &ans, sizeof(ans), 0);
+
+ if (rep_len < 0)
+ err(1, "error receiving reply message via Netlink\n");
+
+
+ /* Validate response message */
+ if (!NLMSG_OK((&ans.n), rep_len))
+ err(1, "invalid reply message received via Netlink\n");
+
+ if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */
+ printf("error received NACK - leaving\n");
+ exit(1);
+ }
+
+
+ na = (struct nlattr *) GENLMSG_DATA(&ans);
+ na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len));
+ if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
+ id = *(__u16 *) NLA_DATA(na);
+ }
+ return id;
+}
+
+void print_taskstats(struct taskstats *t)
+{
+ printf("\n\nCPU %15s%15s%15s%15s\n"
+ " %15llu%15llu%15llu%15llu\n"
+ "IO %15s%15s\n"
+ " %15llu%15llu\n"
+ "MEM %15s%15s\n"
+ " %15llu%15llu\n\n",
+ "count", "real total", "virtual total", "delay total",
+ t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total,
+ t->cpu_delay_total,
+ "count", "delay total",
+ t->blkio_count, t->blkio_delay_total,
+ "count", "delay total", t->swapin_count, t->swapin_delay_total);
+}
+
+void sigchld(int sig)
+{
+ done = 1;
+}
+
+int main(int argc, char *argv[])
+{
+ int rc;
+ int sk_nl;
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *genlhdr;
+ char *buf;
+ struct taskstats_cmd_param *param;
+ __u16 id;
+ struct nlattr *na;
+
+ /* For receiving */
+ struct sockaddr_nl kern_nla, from_nla;
+ socklen_t from_nla_len;
+ int recv_len;
+ struct taskstats_reply *reply;
+
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } req;
+
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } ans;
+
+ int nl_sd = -1;
+ int rep_len;
+ int len = 0;
+ int aggr_len, len2;
+ struct sockaddr_nl nladdr;
+ pid_t tid = 0;
+ pid_t rtid = 0;
+ int cmd_type = TASKSTATS_TYPE_TGID;
+ int c, status;
+ int forking = 0;
+ struct sigaction act = {
+ .sa_handler = SIG_IGN,
+ .sa_mask = SA_NOMASK,
+ };
+ struct sigaction tact = {
+ .sa_handler = sigchld,
+ .sa_mask = SA_NOMASK,
+ };
+
+
+ if (argc < 3) {
+ printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]);
+ exit(-1);
+ }
+
+ if (sigaction(SIGCHLD, &tact, NULL) < 0)
+ err(1, "sigaction failed for SIGCHLD\n");
+
+ while (1) {
+
+ c = getopt(argc, argv, "t:p:c:");
+ if (c < 0)
+ break;
+
+ switch (c) {
+ case 't':
+ tid = atoi(optarg);
+ if (!tid)
+ err(1, "Invalid tgid\n");
+ cmd_type = TASKSTATS_CMD_ATTR_TGID;
+ break;
+ case 'p':
+ tid = atoi(optarg);
+ if (!tid)
+ err(1, "Invalid pid\n");
+ cmd_type = TASKSTATS_CMD_ATTR_TGID;
+ break;
+ case 'c':
+ opterr = 0;
+ tid = fork();
+ if (tid < 0)
+ err(1, "fork failed\n");
+
+ if (tid == 0) { /* child process */
+ if (execvp(argv[optind - 1], &argv[optind - 1]) < 0) {
+ exit(-1);
+ }
+ }
+ forking = 1;
+ break;
+ default:
+ case '?':
+ //printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]);
+ //exit(-1);
+ break;
+ }
+ if (c == 'c')
+ break;
+ }
+
+ /* Construct Netlink request message */
+
+ /* Send Netlink request message & get reply */
+
+ if ((nl_sd =
+ create_nl_socket(NETLINK_GENERIC, TASKSTATS_LISTEN_GROUP)) < 0)
+ err(1, "error creating Netlink socket\n");
+
+
+ id = get_family_id(nl_sd);
+
+ /* Send command needed */
+ req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ req.n.nlmsg_type = id;
+ req.n.nlmsg_flags = NLM_F_REQUEST;
+ req.n.nlmsg_seq = 0;
+ req.n.nlmsg_pid = tid;
+ req.g.cmd = TASKSTATS_CMD_GET;
+ na = (struct nlattr *) GENLMSG_DATA(&req);
+ na->nla_type = cmd_type;
+ na->nla_len = sizeof(unsigned int) + NLA_HDRLEN;
+ *(__u32 *) NLA_DATA(na) = tid;
+ req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+
+ if (!forking && sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0)
+ err(1, "error sending message via Netlink\n");
+
+ if (sigaction(SIGINT, &act, NULL) < 0)
+ err(1, "sigaction failed for SIGINT\n");
+ do {
+ int i;
+ struct pollfd pfd;
+ int pollres;
+
+ pfd.events = 0xffff & ~POLLOUT;
+ pfd.fd = nl_sd;
+ pollres = poll(&pfd, 1, 5000);
+ if (pollres < 0 || done) {
+ break;
+ }
+
+ rep_len = recv(nl_sd, &ans, sizeof(ans), 0);
+ nladdr.nl_family = AF_NETLINK;
+ nladdr.nl_groups = TASKSTATS_LISTEN_GROUP;
+
+ if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */
+ printf("error received NACK - leaving\n");
+ exit(1);
+ }
+
+ if (rep_len < 0) {
+ err(1, "error receiving reply message via Netlink\n");
+ break;
+ }
+
+ /* Validate response message */
+ if (!NLMSG_OK((&ans.n), rep_len))
+ err(1, "invalid reply message received via Netlink\n");
+
+ rep_len = GENLMSG_PAYLOAD(&ans.n);
+
+ na = (struct nlattr *) GENLMSG_DATA(&ans);
+ len = 0;
+ i = 0;
+ while (len < rep_len) {
+ len += NLA_ALIGN(na->nla_len);
+ switch (na->nla_type) {
+ case TASKSTATS_TYPE_AGGR_PID:
+ /* Fall through */
+ case TASKSTATS_TYPE_AGGR_TGID:
+ aggr_len = NLA_PAYLOAD(na->nla_len);
+ len2 = 0;
+ /* For nested attributes, na follows */
+ na = (struct nlattr *) NLA_DATA(na);
+ done = 0;
+ while (len2 < aggr_len) {
+ switch (na->nla_type) {
+ case TASKSTATS_TYPE_PID:
+ rtid = *(int *) NLA_DATA(na);
+ break;
+ case TASKSTATS_TYPE_TGID:
+ rtid = *(int *) NLA_DATA(na);
+ break;
+ case TASKSTATS_TYPE_STATS:
+ if (rtid == tid) {
+ print_taskstats((struct taskstats *)
+ NLA_DATA(na));
+ done = 1;
+ }
+ break;
+ }
+ len2 += NLA_ALIGN(na->nla_len);
+ na = (struct nlattr *) ((char *) na + len2);
+ if (done)
+ break;
+ }
+ }
+ na = (struct nlattr *) (GENLMSG_DATA(&ans) + len);
+ if (done)
+ break;
+ }
+ if (done)
+ break;
+ }
+ while (1);
+
+ close(nl_sd);
+ return 0;
+}

2006-03-30 05:03:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 2/8] Block I/O, swapin delays

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-blkio-swapin.patch
>
> Collect per-task block I/O delay statistics.
>
> Unlike earlier iterations of the delay accounting
> patches, now delays are only collected for the actual
> I/O waits rather than try and cover the delays seen in
> I/O submission paths.
>
> Account separately for block I/O delays
> incurred as a result of swapin page faults whose
> frequency can be affected by the task/process' rss limit.
> Hence swapin delays can act as feedback for rss limit changes
> independent of I/O priority changes.
>
> ..
>
> +#define PF_SWAPIN 0x02000000 /* I am doing a swap in */
>

Is there really no sane alternative to doing it this way?

2006-03-30 05:03:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 1/8] Setup

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-setup.patch
>
> Initialization code related to collection of per-task "delay"
> statistics which measure how long it had to wait for cpu,
> sync block io, swapping etc. The collection of statistics and
> the interface are in other patches. This patch sets up the data
> structures and allows the statistics collection to be disabled
> through a kernel boot paramater.
>
> ...
>
> + delayacct [KNL] Enable per-task delay accounting
> +

Why does this boot parameter exist?

The code is neat-looking.

> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:12:57.000000000 -0500
> @@ -0,0 +1,92 @@
> +/* delayacct.c - per-task delay accounting
> + *
> + * Copyright (C) Shailabh Nagar, IBM Corp. 2006
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
> + * the GNU General Public License for more details.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/time.h>
> +#include <linux/sysctl.h>
> +#include <linux/delayacct.h>
> +
> +int delayacct_on __read_mostly = 0; /* Delay accounting turned on/off */

Yes, it should be __read_mostly. But it shouldn't be initialised to zero.

> +void __delayacct_tsk_init(struct task_struct *tsk)
> +{
> + tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
> + if (tsk->delays) {
> + memset(tsk->delays, 0, sizeof(*tsk->delays));
> + spin_lock_init(&tsk->delays->lock);
> + }
> +}

We have kmem_cache_zalloc() now.

> +void __delayacct_tsk_exit(struct task_struct *tsk)
> +{
> + if (tsk->delays) {
> + kmem_cache_free(delayacct_cache, tsk->delays);
> + tsk->delays = NULL;
> + }
> +}

delayacct_tsk_exit() already checked tsk->delays.

> +/*
> + * Finish delay accounting for a statistic using
> + * its timestamps (@start, @end), accumalator (@total) and @count
> + */
> +
> +static inline void delayacct_end(struct timespec *start, struct timespec *end,
> + u64 *total, u32 *count)
> +{
> + struct timespec ts;
> + nsec_t ns;
> +
> + do_posix_clock_monotonic_gettime(end);
> + ts.tv_sec = end->tv_sec - start->tv_sec;
> + ts.tv_nsec = end->tv_nsec - start->tv_nsec;
> + ns = timespec_to_ns(&ts);
> + if (ns < 0)
> + return;
> +
> + spin_lock(&current->delays->lock);
> + *total += ns;
> + (*count)++;
> + spin_unlock(&current->delays->lock);
> +}

It's strange to have a static inline function at the end of a .c file. I
guess this gets used in later patches.

It looks rather too big to be inlined.

I'm surprised that we don't already have a timeval_sub() function
somewhere.

The code you have there will cause ts.tv_nsec to go negative half the time.
It looks like timespec_to_ns() will happily fix that up for us, but it's
all a bit fragile.

2006-03-30 05:04:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Shailabh Nagar <[email protected]> wrote:
>
> Could you please include the following delay accounting patches
> in -mm ?

I'm at a loss to evaluate the suitability of this work, really. I always
am when accounting patches come along.

There are various people and various groups working on various different
things and there appears to be no coordination and little commonality of
aims. I worry that picking one submission basically at random will provide
nothing which the other groups can work on to build up their feature.

On the other hand, we don't want to do nothing until some uber-grand
all-singing, all-dancing statistics-gathering infrastructure comes along.

So I'm a bit stuck. What I would like to see happen is that there be some
coordination between the various stakeholders, and some vague plan which
they're all happy with as a basis for the eventual grand solution.

We already have various bits and pieces of statistics gathering in the
kernel and it's already a bit ad-hoc. Adding more one-requirement-specific
accounting code won't improve that situation.

But then, I said all this a year or two ago and nothing much has happened
since then. It's not your fault, but it's a problem.

Perhaps a good starting point would be a one-page bullet-point-form
wishlist of all the accounting which people want to get out of the kernel,
and a description of what the kernel<->user interface should look like.
Right now, I don't think we even have a picture of that.

We need a statistics maintainer, too, to pull together the plan,
coordinate, push things forwards. The first step would be to identify the
stakeholders, come up with that page of bullet-points.

Then again, maybe the right thing to do is to keep adding low-impact
requirement-specific statistics patches as they come along. But if we're
going to do it that way, we need an up-front reason for doing so, and I
don't know what that would be.

See my problem?

2006-03-30 05:04:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 7/8] proc interface for block I/O delays

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-procfs.patch
>
> Export I/O delays seen by a task through /proc/<tgid>/stats
> for use in top etc.
>
> Note that delays for I/O done for swapping in pages (swapin I/O) is
> clubbed together with all other I/O here (this is not the
> case in the netlink interface where the swapin I/O is kept distinct)
>
> ...
>
> +
> +unsigned long long __delayacct_blkio_ticks(struct task_struct *tsk)

Why unsigned long long here, rather than __u64?

> +{
> + unsigned long long ret;
> +
> + if (!tsk->delays)
> + return 0;

delayacct_blkio_ticks() already checked that.


2006-03-30 05:04:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 3/8] cpu delays

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-schedstats.patch
>
> Make the task-related schedstats functions
> callable by delay accounting even if schedstats
> collection isn't turned on. This removes the
> dependency of delay accounting on schedstats and allows
> cpu delays to be exported.
>
> ..
>
> Index: linux-2.6.16/include/linux/sched.h
> ===================================================================
> --- linux-2.6.16.orig/include/linux/sched.h 2006-03-29 18:13:13.000000000 -0500
> +++ linux-2.6.16/include/linux/sched.h 2006-03-29 18:13:15.000000000 -0500
> @@ -525,7 +525,7 @@ typedef struct prio_array prio_array_t;
> struct backing_dev_info;
> struct reclaim_state;
>
> -#ifdef CONFIG_SCHEDSTATS
> +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> struct sched_info {
> /* cumulative counters */
> unsigned long cpu_time, /* time spent on the cpu */
> @@ -537,10 +537,14 @@ struct sched_info {
> last_queued; /* when we were last queued to run */
> };
>
> +#ifdef CONFIG_SCHEDSTATS
> extern struct file_operations proc_schedstat_operations;
> -#endif
> +#endif /* CONFIG_SCHEDSTATS */
> +#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
>
> #ifdef CONFIG_TASK_DELAY_ACCT
> +extern int delayacct_on;
> +

This was already declared in delayacct.h and nothing in sched.h needs it.
So it's better if .c files pick this declaration up from delayacct.h.

>
> +static inline void rq_sched_info_arrive(struct runqueue *rq,
> + unsigned long diff)
> +{
> + if (rq) {
> + rq->rq_sched_info.run_delay += diff;
> + rq->rq_sched_info.pcnt++;
> + }
> +}

The nonatomic updates need locking. I assume it's runqueue.lock, but a
comment describing the rules would be good.

> +static inline void rq_sched_info_depart(struct runqueue *rq,
> + unsigned long diff)
> +{
> + if (rq)
> + rq->rq_sched_info.cpu_time += diff;
> +}

Ditto.

> static inline void sched_info_queued(task_t *t)
> {
> - if (!t->sched_info.last_queued)
> - t->sched_info.last_queued = jiffies;
> + if (unlikely(sched_info_on()))
> + if (!t->sched_info.last_queued)
> + t->sched_info.last_queued = jiffies;
> }

It might be better to put the unlikely() into sched_info_on() itself.

2006-03-30 05:05:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 5/8] generic netlink interface for delay accounting

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-genetlink.patch
>
> Create a generic netlink interface (NETLINK_GENERIC family),
> called "taskstats", for getting delay and cpu statistics of
> tasks and thread groups during their lifetime and when they exit.
>
>

It's be best to have a netlink person review the netlinkisms here.

> +static inline int delayacct_add_tsk(struct taskstats *d,
> + struct task_struct *tsk)
> +{
> + if (!tsk->delays)
> + return -EINVAL;
> + return __delayacct_add_tsk(d, tsk);
> +}

hm. It's a worry that this can return an error if delay accounting simply
isn't enabled.

> +struct taskstats {
> + /* Maintain 64-bit alignment while extending */
> + /* Version 1 */
> +
> + /* XXX_count is number of delay values recorded.
> + * XXX_total is corresponding cumulative delay in nanoseconds
> + */
> +
> +#define TASKSTATS_NOCPUSTATS 1
> + __u64 cpu_count;
> + __u64 cpu_delay_total; /* wait, while runnable, for cpu */
> + __u64 blkio_count;
> + __u64 blkio_delay_total; /* sync,block io completion wait*/
> + __u64 swapin_count;
> + __u64 swapin_delay_total; /* swapin page fault wait*/
> +
> + __u64 cpu_run_total; /* cpu running time
> + * no count available/provided */
> +};

What locking is used to make updates to these u64's appear to be atomic?
Maybe it's deliberately nonatomic. Either way, it needs a comment.

> void __delayacct_tsk_exit(struct task_struct *tsk)
> {
> + /*
> + * Protect against racing thread group exits
> + */
> + mutex_lock(&delayacct_exit_mutex);
> + taskstats_exit_pid(tsk);
> if (tsk->delays) {
> kmem_cache_free(delayacct_cache, tsk->delays);
> tsk->delays = NULL;
> }
> + mutex_unlock(&delayacct_exit_mutex);
> }

hm, I wonder how contended that lock is likely to be.

The kmem_cache_free() can happen outside the lock.

> +#ifdef CONFIG_TASKSTATS
> +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
> +{
> + nsec_t tmp;
> + struct timespec ts;
> + unsigned long t1,t2;
> +
> + /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
> +
> + tmp = (nsec_t)d->cpu_run_total ;

stray space before semicolon.

> + tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
> + d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;

Missing space before ?, missing space before :

> + d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
> + d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;

dittos.

> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.16/kernel/taskstats.c 2006-03-29 18:13:18.000000000 -0500
> @@ -0,0 +1,292 @@
> +/*
> + * taskstats.c - Export per-task statistics to userland
> + *
> + * Copyright (C) Shailabh Nagar, IBM Corp. 2006
> + * (C) Balbir Singh, IBM Corp. 2006
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/taskstats.h>
> +#include <linux/delayacct.h>
> +#include <net/genetlink.h>
> +#include <asm/atomic.h>
> +
> +const int taskstats_version = TASKSTATS_VERSION;

Odd. It'd be better to put TASKSTATS_VERSION in the header file, use that
directly.

> +
> +static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
> + struct taskstats *stats)
> +{
> + int rc;
> + struct task_struct *tsk = pidtsk;
> +
> + if (!pidtsk) {
> + read_lock(&tasklist_lock);
> + tsk = find_task_by_pid(pid);
> + if (!tsk) {
> + read_unlock(&tasklist_lock);
> + return -ESRCH;
> + }
> + get_task_struct(tsk);
> + read_unlock(&tasklist_lock);
> + } else
> + get_task_struct(tsk);
> +
> + rc = delayacct_add_tsk(stats, tsk);
> + put_task_struct(tsk);
> +
> + return rc;
> +
> +}

Has two callsites, should be uninlined.

> +static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
> + struct taskstats *stats)
> +{
> + int rc;
> + struct task_struct *tsk, *first;
> +
> + first = tgidtsk;
> + read_lock(&tasklist_lock);
> + if (!first) {
> + first = find_task_by_pid(tgid);
> + if (!first) {
> + read_unlock(&tasklist_lock);
> + return -ESRCH;
> + }
> + }
> + tsk = first;
> + do {
> + rc = delayacct_add_tsk(stats, tsk);
> + if (rc)
> + break;
> + } while_each_thread(first, tsk);
> + read_unlock(&tasklist_lock);
> +
> + return rc;
> +}

Ditto.

It's somewhat similar to fill_pid() - perhaps they can be combined, halving
the overhead?

> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{
> + int rc = 0;
> + struct sk_buff *rep_skb;
> + struct taskstats stats;
> + void *reply;
> + size_t size;
> + struct nlattr *na;
> +
> + /*
> + * Size includes space for nested attribute as well
> + * The returned data is of the format
> + * TASKSTATS_TYPE_AGGR_PID/TGID
> + * --> TASKSTATS_TYPE_PID/TGID
> + * --> TASKSTATS_TYPE_STATS
> + */
> + size = nla_total_size(sizeof(u32)) +
> + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> +
> + memset(&stats, 0, sizeof(stats));
> + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
> + if (rc < 0)
> + return rc;
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> + rc = fill_pid((pid_t)pid, NULL, &stats);

We shouldn't have a typecast here. If it generates a warning then we need
to get in there and find out why.

> + if (rc < 0)
> + goto err;
> +
> + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> + rc = fill_tgid((pid_t)tgid, NULL, &stats);

Ditto.

> + if (rc < 0)
> + goto err;
> +
> + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> + } else {
> + rc = -EINVAL;
> + goto err;
> + }
> +
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + nla_nest_end(rep_skb, na);
> +
> + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> +
> +nla_put_failure:
> + return genlmsg_cancel(rep_skb, reply);

Extra space.

> +err:
> + nlmsg_free(rep_skb);
> + return rc;
> +}
> +
> +
> +/* Send pid data out on exit */
> +void taskstats_exit_pid(struct task_struct *tsk)
> +{
> + int rc = 0;
> + struct sk_buff *rep_skb;
> + void *reply;
> + struct taskstats stats;
> + size_t size;
> + int is_thread_group = !thread_group_empty(tsk);
> + struct nlattr *na;
> +
> + /*
> + * tasks can start to exit very early. Ensure that the family
> + * is registered before notifications are sent out
> + */
> + if (!family_registered)
> + return;

This code risks evaluating thread_group_empty() even if !family_registered.
The compiler will most likely sort that out in this case, but it's a risk
when using these initialisers.

> +
> +static int __init taskstats_init(void)
> +{
> + if (genl_register_family(&family))
> + return -EFAULT;

EFAULT?

> + family_registered = 1;

whitespace broke.

> +
> + if (genl_register_ops(&family, &taskstats_ops))
> + goto err;
> +
> + return 0;
> +err:
> + genl_unregister_family(&family);
> + family_registered = 0;
> + return -EFAULT;
> +}
> +
> +late_initcall(taskstats_init);

Why late_initcall()? (A comment would be appropriate)

2006-03-30 05:05:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 6/8] virtual cpu run time

Shailabh Nagar <[email protected]> wrote:
>
> delayacct-virtcpu.patch
>
> Distinguish between "wall-clock" and "virtual" cpu run times and return
> both, at per-task and per-tgid granularity.
>
> Some architectures adjust tsk->utime+tsk->stime to reflect the time that
> the kernel wasn't scheduled in hypervised environments and this is the
> "wall-clock" cpu run time. "Virtual" cpu run time, on the other hand, does
> not account for the kernel being descheduled.
>
> This patch allows the most accurate "virtual" cpu run time, collected by
> the schedstats code (now shared with delay accounting code), to be returned
> to user space, in addition to the "wall-clock" cpu time that was being exported
> earlier. Both these times are useful for workload management in different
> situations.
>
> In a non-virtualized environment, or on architectures which do not adjust
> tsk->utime/stime, these will effectively be the same value but at different
> granularities.
>
> ...
>
> Index: linux-2.6.16/include/linux/taskstats.h
> ===================================================================
> --- linux-2.6.16.orig/include/linux/taskstats.h 2006-03-29 18:13:18.000000000 -0500
> +++ linux-2.6.16/include/linux/taskstats.h 2006-03-29 18:13:20.000000000 -0500
> @@ -46,8 +46,14 @@ struct taskstats {
> __u64 swapin_count;
> __u64 swapin_delay_total; /* swapin page fault wait*/
>
> - __u64 cpu_run_total; /* cpu running time
> - * no count available/provided */
> + __u64 cpu_run_real_total; /* cpu "wall-clock" running time
> + * Potentially accounts for cpu
> + * virtualization, on some arches
> + */
> + __u64 cpu_run_virtual_total; /* cpu "virtual" running time
> + * Uses time intervals as seen by
> + * the kernel
> + */
> };
>

Again, the reader of this struct wants to know what the atomicity rules are.

> + d->cpu_run_real_total = (tmp < (nsec_t)d->cpu_run_real_total)? 0: tmp;

lval = expr1 ? expr2 : expr3;

> + tmp = (nsec_t)d->cpu_run_virtual_total
> + + (nsec_t)jiffies_to_usecs(t3) * 1000;

umm, Linux doesn't have nsec_t any more.

2006-03-30 06:13:08

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 5/8] generic netlink interface for delay accounting

Hi, Andrew


On Wed, Mar 29, 2006 at 09:04:06PM -0800, Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
> >
> > delayacct-genetlink.patch
> >
> > Create a generic netlink interface (NETLINK_GENERIC family),
> > called "taskstats", for getting delay and cpu statistics of
> > tasks and thread groups during their lifetime and when they exit.
> >
> >
>
> It's be best to have a netlink person review the netlinkisms here.

Jamal did review this code, his comments are available at
http://lkml.org/lkml/2006/3/26/71. His comments were very helpful in
doing the correct thing and understanding the design and usage of
genetlink.

>
> > +static inline int delayacct_add_tsk(struct taskstats *d,
> > + struct task_struct *tsk)
> > +{
> > + if (!tsk->delays)
> > + return -EINVAL;
> > + return __delayacct_add_tsk(d, tsk);
> > +}
>
> hm. It's a worry that this can return an error if delay accounting simply
> isn't enabled.

Yes, if CONFIG_TASKSTATS is enabled and the kernel is booted without
delayacct. A user space utility trying to extract statistics will get
an error.

>
> > +struct taskstats {
> > + /* Maintain 64-bit alignment while extending */
> > + /* Version 1 */
> > +
> > + /* XXX_count is number of delay values recorded.
> > + * XXX_total is corresponding cumulative delay in nanoseconds
> > + */
> > +
> > +#define TASKSTATS_NOCPUSTATS 1
> > + __u64 cpu_count;
> > + __u64 cpu_delay_total; /* wait, while runnable, for cpu */
> > + __u64 blkio_count;
> > + __u64 blkio_delay_total; /* sync,block io completion wait*/
> > + __u64 swapin_count;
> > + __u64 swapin_delay_total; /* swapin page fault wait*/
> > +
> > + __u64 cpu_run_total; /* cpu running time
> > + * no count available/provided */
> > +};
>
> What locking is used to make updates to these u64's appear to be atomic?
> Maybe it's deliberately nonatomic. Either way, it needs a comment.

These fields are protected by tsk->delays->lock. We will add comments
to indicate the same.

>
> > void __delayacct_tsk_exit(struct task_struct *tsk)
> > {
> > + /*
> > + * Protect against racing thread group exits
> > + */
> > + mutex_lock(&delayacct_exit_mutex);
> > + taskstats_exit_pid(tsk);
> > if (tsk->delays) {
> > kmem_cache_free(delayacct_cache, tsk->delays);
> > tsk->delays = NULL;
> > }
> > + mutex_unlock(&delayacct_exit_mutex);
> > }
>
> hm, I wonder how contended that lock is likely to be.
>

It is taken for every exiting task. We did not measure the contention
using any tool.

> The kmem_cache_free() can happen outside the lock.


kmem_cache_free() and setting to NULL outside the lock is prone to
race conditions. Consider the following scenario

A thread group T1 has exiting processes P1 and P2

P1 is exiting, finishes the delay accounting by calling taskstats_exit_pid()
and gives up the mutex and calls kmem_cache_free(), but before it can set
tsk->delays to NULL, we try to get statistics for the entire thread group.
This task will show up in the thread group with a dangling tsk->delays.

>
> > +#ifdef CONFIG_TASKSTATS
> > +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
> > +{
> > + nsec_t tmp;
> > + struct timespec ts;
> > + unsigned long t1,t2;
> > +
> > + /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
> > +
> > + tmp = (nsec_t)d->cpu_run_total ;
>
> stray space before semicolon.

Will fix it.

>
> > + tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
> > + d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
>
> Missing space before ?, missing space before :

Will fix it.

>
> > + d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
> > + d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
>
> dittos.

Will fix it.

>
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6.16/kernel/taskstats.c 2006-03-29 18:13:18.000000000 -0500
> > @@ -0,0 +1,292 @@
> > +/*
> > + * taskstats.c - Export per-task statistics to userland
> > + *
> > + * Copyright (C) Shailabh Nagar, IBM Corp. 2006
> > + * (C) Balbir Singh, IBM Corp. 2006
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/taskstats.h>
> > +#include <linux/delayacct.h>
> > +#include <net/genetlink.h>
> > +#include <asm/atomic.h>
> > +
> > +const int taskstats_version = TASKSTATS_VERSION;
>
> Odd. It'd be better to put TASKSTATS_VERSION in the header file, use that
> directly.

We will fix this.

>
> > +
> > +static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
> > + struct taskstats *stats)
> > +{
> > + int rc;
> > + struct task_struct *tsk = pidtsk;
> > +
> > + if (!pidtsk) {
> > + read_lock(&tasklist_lock);
> > + tsk = find_task_by_pid(pid);
> > + if (!tsk) {
> > + read_unlock(&tasklist_lock);
> > + return -ESRCH;
> > + }
> > + get_task_struct(tsk);
> > + read_unlock(&tasklist_lock);
> > + } else
> > + get_task_struct(tsk);
> > +
> > + rc = delayacct_add_tsk(stats, tsk);
> > + put_task_struct(tsk);
> > +
> > + return rc;
> > +
> > +}
>
> Has two callsites, should be uninlined.

Will do.

>
> > +static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
> > + struct taskstats *stats)
> > +{
> > + int rc;
> > + struct task_struct *tsk, *first;
> > +
> > + first = tgidtsk;
> > + read_lock(&tasklist_lock);
> > + if (!first) {
> > + first = find_task_by_pid(tgid);
> > + if (!first) {
> > + read_unlock(&tasklist_lock);
> > + return -ESRCH;
> > + }
> > + }
> > + tsk = first;
> > + do {
> > + rc = delayacct_add_tsk(stats, tsk);
> > + if (rc)
> > + break;
> > + } while_each_thread(first, tsk);
> > + read_unlock(&tasklist_lock);
> > +
> > + return rc;
> > +}
>
> Ditto.

Will do.

>
> It's somewhat similar to fill_pid() - perhaps they can be combined, halving
> the overhead?

Yes, there is a lot of synergy between the two. The main difference is
the way we lock (using get/put_task_struct for pids and tasklist_lock
for tgids). Using a flag to do the correct thing looked a bit ugly,
so we split the functions.

>
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
> > + int rc = 0;
> > + struct sk_buff *rep_skb;
> > + struct taskstats stats;
> > + void *reply;
> > + size_t size;
> > + struct nlattr *na;
> > +
> > + /*
> > + * Size includes space for nested attribute as well
> > + * The returned data is of the format
> > + * TASKSTATS_TYPE_AGGR_PID/TGID
> > + * --> TASKSTATS_TYPE_PID/TGID
> > + * --> TASKSTATS_TYPE_STATS
> > + */
> > + size = nla_total_size(sizeof(u32)) +
> > + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
> > +
> > + memset(&stats, 0, sizeof(stats));
> > + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
> > + if (rc < 0)
> > + return rc;
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
>
> We shouldn't have a typecast here. If it generates a warning then we need
> to get in there and find out why.

The reason for a typecast is that pid is passed as a u32 from userspace.
genetlink currently supports most unsigned types with little or no
support for signed types. We exchange data as u32 and do the correct
thing in the kernel. Would you like us to move away from this?

>
> > + if (rc < 0)
> > + goto err;
> > +
> > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> > + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> > + rc = fill_tgid((pid_t)tgid, NULL, &stats);
>
> Ditto.
>
> > + if (rc < 0)
> > + goto err;
> > +
> > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> > + } else {
> > + rc = -EINVAL;
> > + goto err;
> > + }
> > +
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + nla_nest_end(rep_skb, na);
> > +
> > + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> > +
> > +nla_put_failure:
> > + return genlmsg_cancel(rep_skb, reply);
>
> Extra space.

Will fix.

>
> > +err:
> > + nlmsg_free(rep_skb);
> > + return rc;
> > +}
> > +
> > +
> > +/* Send pid data out on exit */
> > +void taskstats_exit_pid(struct task_struct *tsk)
> > +{
> > + int rc = 0;
> > + struct sk_buff *rep_skb;
> > + void *reply;
> > + struct taskstats stats;
> > + size_t size;
> > + int is_thread_group = !thread_group_empty(tsk);
> > + struct nlattr *na;
> > +
> > + /*
> > + * tasks can start to exit very early. Ensure that the family
> > + * is registered before notifications are sent out
> > + */
> > + if (!family_registered)
> > + return;
>
> This code risks evaluating thread_group_empty() even if !family_registered.
> The compiler will most likely sort that out in this case, but it's a risk
> when using these initialisers.
>

Yes, this can be optimized and we can initialize it after the check for
!family_registerd

> > +
> > +static int __init taskstats_init(void)
> > +{
> > + if (genl_register_family(&family))
> > + return -EFAULT;
>
> EFAULT?

It shouldn't be (Shailabh please comment). We will fix it.

>
> > + family_registered = 1;
>
> whitespace broke.

Will fix

>
> > +
> > + if (genl_register_ops(&family, &taskstats_ops))
> > + goto err;
> > +
> > + return 0;
> > +err:
> > + genl_unregister_family(&family);
> > + family_registered = 0;
> > + return -EFAULT;
> > +}
> > +
> > +late_initcall(taskstats_init);
>
> Why late_initcall()? (A comment would be appropriate)

We will add a comment.

Thanks for your detailed review,
Balbir

2006-03-30 06:26:34

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

On Wed, Mar 29, 2006 at 09:03:14PM -0800, Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
> >
> > Could you please include the following delay accounting patches
> > in -mm ?
>
> I'm at a loss to evaluate the suitability of this work, really. I always
> am when accounting patches come along.
>
> There are various people and various groups working on various different
> things and there appears to be no coordination and little commonality of
> aims. I worry that picking one submission basically at random will provide
> nothing which the other groups can work on to build up their feature.
>
> On the other hand, we don't want to do nothing until some uber-grand
> all-singing, all-dancing statistics-gathering infrastructure comes along.
>
> So I'm a bit stuck. What I would like to see happen is that there be some
> coordination between the various stakeholders, and some vague plan which
> they're all happy with as a basis for the eventual grand solution.
>
> We already have various bits and pieces of statistics gathering in the
> kernel and it's already a bit ad-hoc. Adding more one-requirement-specific
> accounting code won't improve that situation.
>
> But then, I said all this a year or two ago and nothing much has happened
> since then. It's not your fault, but it's a problem.
>
> Perhaps a good starting point would be a one-page bullet-point-form
> wishlist of all the accounting which people want to get out of the kernel,
> and a description of what the kernel<->user interface should look like.
> Right now, I don't think we even have a picture of that.
>
> We need a statistics maintainer, too, to pull together the plan,
> coordinate, push things forwards. The first step would be to identify the
> stakeholders, come up with that page of bullet-points.
>
> Then again, maybe the right thing to do is to keep adding low-impact
> requirement-specific statistics patches as they come along. But if we're
> going to do it that way, we need an up-front reason for doing so, and I
> don't know what that would be.
>
> See my problem?

One of the issues we have tried to address is the ability to provide some
form of a common ground for all the statistics to co-exist. Various methods
were discussed for exchanging data between kernel and user space, genetlink
was suggested often and the clear winner.

To that end, we have created a taskstats.c file. Any subsystem wanting
to add their statistics and sending it to user space can add their own
types by extending taskstats.c (changing the version number) and creating
their own types using genetlink. They will have to do the following

1. Add statistics gathering in their own subsystem
2. Add a type to taskstats.c, extend it and use data from (1) and send
it to user space.

The data from various subsystems can co-exist. I feel that this could serve as
the basic common infrastructure to begin with and refined later (depending on
the needs of other people).

Thoughts?

Balbir

2006-03-30 06:27:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 5/8] generic netlink interface for delay accounting

Balbir Singh <[email protected]> wrote:
>
> > The kmem_cache_free() can happen outside the lock.
>
>
> kmem_cache_free() and setting to NULL outside the lock is prone to
> race conditions. Consider the following scenario
>
> A thread group T1 has exiting processes P1 and P2
>
> P1 is exiting, finishes the delay accounting by calling taskstats_exit_pid()
> and gives up the mutex and calls kmem_cache_free(), but before it can set
> tsk->delays to NULL, we try to get statistics for the entire thread group.
> This task will show up in the thread group with a dangling tsk->delays.

Yes, the `tsk->delays = NULL;' needs to happen inside the lock. But the
kmem_cache_free() does not. It pointlessly increases the lock hold time.

> > > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > > + rc = fill_pid((pid_t)pid, NULL, &stats);
> >
> > We shouldn't have a typecast here. If it generates a warning then we need
> > to get in there and find out why.
>
> The reason for a typecast is that pid is passed as a u32 from userspace.
> genetlink currently supports most unsigned types with little or no
> support for signed types. We exchange data as u32 and do the correct
> thing in the kernel. Would you like us to move away from this?
>

I think it's best to avoid the cast unless it's actually needed to avoid a
warning or compile error, or to do special things with sign extension.
Because casts clutter up the code and can hide real bugs. In this case the
compiler should silently perform the conversion.

2006-03-30 06:32:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 5/8] generic netlink interface for delay accounting

On Wed, Mar 29, 2006 at 10:26:29PM -0800, Andrew Morton wrote:
> Balbir Singh <[email protected]> wrote:
> >
> > > The kmem_cache_free() can happen outside the lock.
> >
> >
> > kmem_cache_free() and setting to NULL outside the lock is prone to
> > race conditions. Consider the following scenario
> >
> > A thread group T1 has exiting processes P1 and P2
> >
> > P1 is exiting, finishes the delay accounting by calling taskstats_exit_pid()
> > and gives up the mutex and calls kmem_cache_free(), but before it can set
> > tsk->delays to NULL, we try to get statistics for the entire thread group.
> > This task will show up in the thread group with a dangling tsk->delays.
>
> Yes, the `tsk->delays = NULL;' needs to happen inside the lock. But the
> kmem_cache_free() does not. It pointlessly increases the lock hold time.

Understood will fix it

>
> > > > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > > > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > > > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > >
> > > We shouldn't have a typecast here. If it generates a warning then we need
> > > to get in there and find out why.
> >
> > The reason for a typecast is that pid is passed as a u32 from userspace.
> > genetlink currently supports most unsigned types with little or no
> > support for signed types. We exchange data as u32 and do the correct
> > thing in the kernel. Would you like us to move away from this?
> >
>
> I think it's best to avoid the cast unless it's actually needed to avoid a
> warning or compile error, or to do special things with sign extension.
> Because casts clutter up the code and can hide real bugs. In this case the
> compiler should silently perform the conversion.

Yep, the compiler was doing it for me, but I tried to be smart and cast
things around. Will fix it.

Thanks,
Balbir

2006-03-30 06:48:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Balbir Singh <[email protected]> wrote:
>
> On Wed, Mar 29, 2006 at 09:03:14PM -0800, Andrew Morton wrote:
> > Shailabh Nagar <[email protected]> wrote:
> > >
> > > Could you please include the following delay accounting patches
> > > in -mm ?
> >
> > I'm at a loss to evaluate the suitability of this work, really. I always
> > am when accounting patches come along.
> >
> > There are various people and various groups working on various different
> > things and there appears to be no coordination and little commonality of
> > aims. I worry that picking one submission basically at random will provide
> > nothing which the other groups can work on to build up their feature.
> >
> > On the other hand, we don't want to do nothing until some uber-grand
> > all-singing, all-dancing statistics-gathering infrastructure comes along.
> >
> > So I'm a bit stuck. What I would like to see happen is that there be some
> > coordination between the various stakeholders, and some vague plan which
> > they're all happy with as a basis for the eventual grand solution.
> >
> > We already have various bits and pieces of statistics gathering in the
> > kernel and it's already a bit ad-hoc. Adding more one-requirement-specific
> > accounting code won't improve that situation.
> >
> > But then, I said all this a year or two ago and nothing much has happened
> > since then. It's not your fault, but it's a problem.
> >
> > Perhaps a good starting point would be a one-page bullet-point-form
> > wishlist of all the accounting which people want to get out of the kernel,
> > and a description of what the kernel<->user interface should look like.
> > Right now, I don't think we even have a picture of that.
> >
> > We need a statistics maintainer, too, to pull together the plan,
> > coordinate, push things forwards. The first step would be to identify the
> > stakeholders, come up with that page of bullet-points.
> >
> > Then again, maybe the right thing to do is to keep adding low-impact
> > requirement-specific statistics patches as they come along. But if we're
> > going to do it that way, we need an up-front reason for doing so, and I
> > don't know what that would be.
> >
> > See my problem?
>
> One of the issues we have tried to address is the ability to provide some
> form of a common ground for all the statistics to co-exist. Various methods
> were discussed for exchanging data between kernel and user space, genetlink
> was suggested often and the clear winner.
>
> To that end, we have created a taskstats.c file. Any subsystem wanting
> to add their statistics and sending it to user space can add their own
> types by extending taskstats.c (changing the version number) and creating
> their own types using genetlink. They will have to do the following
>
> 1. Add statistics gathering in their own subsystem
> 2. Add a type to taskstats.c, extend it and use data from (1) and send
> it to user space.
>
> The data from various subsystems can co-exist. I feel that this could serve as
> the basic common infrastructure to begin with and refined later (depending on
> the needs of other people).
>

Sounds fine to me, but I'm not a stakeholder.

Trolling back through lse-tech gives us:

pnotify:
Erik Jacobson <[email protected]>

CSA accounting/PAGG/JOB:
Jay Lan <[email protected]>
Limin Gu <[email protected]>

per-process IO statistics:
Levent Serinol <[email protected]>

ELSA:
Guillaume Thouvenin <[email protected]>

per-cpu time statistics:
Erich Focht <[email protected]>

Scalable statistics counters with /proc reporting:
Ravikiran G Thirumalai <[email protected]>
(Kiran feft IBM, but presumably the requirement lives on)

There was a long thread "A common layer for Accounting packages". Did it
come to a conclusion?

Anyway, if mostly everyone is mostly happy with what you propose then that
it good news.

2006-03-30 09:55:56

by Paul Jackson

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Andrew wrote:
> CSA accounting/PAGG/JOB:
> Jay Lan <[email protected]>
> Limin Gu <[email protected]>

You can remove Limin Gu from this list. She has joined
the ranks of former-SGI employees, some time back. We
wish her well.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-03-30 13:25:35

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [Patch 0/8] per-task delay accounting

On Wed, Mar 29, 2006 at 10:47:37PM -0800, Andrew Morton wrote:
>
> Sounds fine to me, but I'm not a stakeholder.
>
> Trolling back through lse-tech gives us:
>
> Scalable statistics counters with /proc reporting:
> Ravikiran G Thirumalai <[email protected]>
> (Kiran feft IBM, but presumably the requirement lives on)

Not necessarily in that form. A lot of statistics has now become
per-cpu, something we wanted to achieve back then. Automatic
/proc reporting was an idea only tossed around, but /proc is now
deprecated for such things. There may be a need for fast export of
counters to userspace, but those requirements are not yet clear.

This is different from per-task accounting infrastructure that
people are trying to develop.

Thanks
Dipankar

2006-03-30 15:07:30

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/8] Setup

Andrew Morton wrote:

>Shailabh Nagar <[email protected]> wrote:
>
>
>>delayacct-setup.patch
>>
>>Initialization code related to collection of per-task "delay"
>>statistics which measure how long it had to wait for cpu,
>>sync block io, swapping etc. The collection of statistics and
>>the interface are in other patches. This patch sets up the data
>>structures and allows the statistics collection to be disabled
>>through a kernel boot paramater.
>>
>>...
>>
>>+ delayacct [KNL] Enable per-task delay accounting
>>+
>>
>>
>
>Why does this boot parameter exist?
>
>

To allow people who aren't interested in these statistics from paying the
overhead of collecting the stats for each task, the memory consumed by
per-task
accounting structures that get allocated etc.

>The code is neat-looking.
>
>
Thanks !

>>--- /dev/null 1970-01-01 00:00:00.000000000 +0000
>>+++ linux-2.6.16/kernel/delayacct.c 2006-03-29 18:12:57.000000000 -0500
>>@@ -0,0 +1,92 @@
>>+/* delayacct.c - per-task delay accounting
>>+ *
>>+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ *
>>+ * This program is distributed in the hope that it would be useful, but
>>+ * WITHOUT ANY WARRANTY; without even the implied warranty of
>>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
>>+ * the GNU General Public License for more details.
>>+ */
>>+
>>+#include <linux/sched.h>
>>+#include <linux/slab.h>
>>+#include <linux/time.h>
>>+#include <linux/sysctl.h>
>>+#include <linux/delayacct.h>
>>+
>>+int delayacct_on __read_mostly = 0; /* Delay accounting turned on/off */
>>
>>
>
>Yes, it should be __read_mostly. But it shouldn't be initialised to zero.
>
>
Will fix.

>
>
>>+void __delayacct_tsk_init(struct task_struct *tsk)
>>+{
>>+ tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
>>+ if (tsk->delays) {
>>+ memset(tsk->delays, 0, sizeof(*tsk->delays));
>>+ spin_lock_init(&tsk->delays->lock);
>>+ }
>>+}
>>
>>
>
>We have kmem_cache_zalloc() now.
>
>
Will use.

>
>
>>+void __delayacct_tsk_exit(struct task_struct *tsk)
>>+{
>>+ if (tsk->delays) {
>>+ kmem_cache_free(delayacct_cache, tsk->delays);
>>+ tsk->delays = NULL;
>>+ }
>>+}
>>
>>
>
>delayacct_tsk_exit() already checked tsk->delays.
>
>
Oops. Will fix.

>
>
>>+/*
>>+ * Finish delay accounting for a statistic using
>>+ * its timestamps (@start, @end), accumalator (@total) and @count
>>+ */
>>+
>>+static inline void delayacct_end(struct timespec *start, struct timespec *end,
>>+ u64 *total, u32 *count)
>>+{
>>+ struct timespec ts;
>>+ nsec_t ns;
>>+
>>+ do_posix_clock_monotonic_gettime(end);
>>+ ts.tv_sec = end->tv_sec - start->tv_sec;
>>+ ts.tv_nsec = end->tv_nsec - start->tv_nsec;
>>+ ns = timespec_to_ns(&ts);
>>+ if (ns < 0)
>>+ return;
>>+
>>+ spin_lock(&current->delays->lock);
>>+ *total += ns;
>>+ (*count)++;
>>+ spin_unlock(&current->delays->lock);
>>+}
>>
>>
>
>It's strange to have a static inline function at the end of a .c file. I
>guess this gets used in later patches.
>
>
Yes. It gets called as part of the __delayacct_blkio_end call in a later
patch.

>It looks rather too big to be inlined.
>
>
It gets used only once currently so defining it as a separate function
was more to
aid understanding.

>I'm surprised that we don't already have a timeval_sub() function
>somewhere.
>
>
There isn't one currently though we'd added a generic function timespec_diff
in an earlier iteration of the patches
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1914.html

In the ensuing discussion (under the same thread above), the problem you
mention
below came up - whether a negative ts.tv_nsec should be returned as is
or should the
entire timespec be normalized. The consensus was that a normalized value
would be
appropriate for a generic function.

However, we didn't want to pay the extra cycles of normalizing the
return value since our
usage (using timespec_to_ns) wouldn't need it.

So we chose to remove the generic function and use the two line
subtraction directly.
Especially since we don't really care to use truly negative differences
(which shouldn't happen
since the timestamps are collected assuming monotonically increasing
values).

>The code you have there will cause ts.tv_nsec to go negative half the time.
>It looks like timespec_to_ns() will happily fix that up for us, but it's
>all a bit fragile.
>
>
If you think relying on timespec_to_ns as an "auto" normalizer is flaky,
we can go back to
introducing a timespec_sub (which is a better name than timespec_diff)
which returns a normalized
diff and use that.

--Shailabh

2006-03-30 15:21:28

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 2/8] Block I/O, swapin delays

Andrew Morton wrote:

>Shailabh Nagar <[email protected]> wrote:
>
>
>>delayacct-blkio-swapin.patch
>>
>>Collect per-task block I/O delay statistics.
>>
>>Unlike earlier iterations of the delay accounting
>>patches, now delays are only collected for the actual
>>I/O waits rather than try and cover the delays seen in
>>I/O submission paths.
>>
>>Account separately for block I/O delays
>>incurred as a result of swapin page faults whose
>>frequency can be affected by the task/process' rss limit.
>>Hence swapin delays can act as feedback for rss limit changes
>>independent of I/O priority changes.
>>
>>..
>>
>>+#define PF_SWAPIN 0x02000000 /* I am doing a swap in */
>>
>>
>>
>
>Is there really no sane alternative to doing it this way?
>
>

I'm not sure there is. This goes back to what is really being measured
as "block I/O delay".

Earlier iterations of the code tried to measure total block I/O delay
including I/O submission
and the wait for completion:
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0602.3/0905.html
In that design, the swapin delay was being measured as time spent in the
do_swap_page
function:
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0602.3/0906.html

However, Arjan pointed out the difficulties in trying to account for all
I/O submission paths
so we reverted to measuring just the block I/O wait time within
io_schedule/io_schedule_timeout.

Since these functions will be called as a result of do_swap_page, the
only way to distinguish a
block I/O being done for a swapin vs. others was to mark the task struct
in some way and
task->flags seemed like a convenient way to do it.

We could easily introduce and use a separate flag variable in the delay
accounting structure itself if
using up another bit of task->flags is a concern ?

Or do you have doubts about the methodology of using a flag itself ?
Don't see any other lightweight
way of doing this.

--Shailabh


2006-03-30 16:00:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [Patch 3/8] cpu delays

On Wed, 2006-03-29 at 19:42 -0500, Shailabh Nagar wrote:
>
> -#ifdef CONFIG_SCHEDSTATS
> - memset(&p->sched_info, 0, sizeof(p->sched_info));
> +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> + if (unlikely(sched_info_on()))
> + memset(&p->sched_info, 0, sizeof(p->sched_info));
> #endif

If you unconditionally define sched_info_on(), you can get get rid of
this #ifdef.

+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+
+static inline int sched_info_on(void)
+{
+ #ifdef CONFIG_SCHEDSTATS
+ return 1;
+#elif defined(CONFIG_TASK_DELAY_ACCT)
+ return delayacct_on;
+#else
+ return 0;
+#endif
+}

Might as well hide the junk in a header.

-- Dave

2006-03-30 16:01:33

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 3/8] cpu delays

Andrew Morton wrote:

>Shailabh Nagar <[email protected]> wrote:
>
>
>>delayacct-schedstats.patch
>>
>>Make the task-related schedstats functions
>>callable by delay accounting even if schedstats
>>collection isn't turned on. This removes the
>>dependency of delay accounting on schedstats and allows
>>cpu delays to be exported.
>>
>>..
>>
>>Index: linux-2.6.16/include/linux/sched.h
>>===================================================================
>>--- linux-2.6.16.orig/include/linux/sched.h 2006-03-29 18:13:13.000000000 -0500
>>+++ linux-2.6.16/include/linux/sched.h 2006-03-29 18:13:15.000000000 -0500
>>@@ -525,7 +525,7 @@ typedef struct prio_array prio_array_t;
>> struct backing_dev_info;
>> struct reclaim_state;
>>
>>-#ifdef CONFIG_SCHEDSTATS
>>+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>> struct sched_info {
>> /* cumulative counters */
>> unsigned long cpu_time, /* time spent on the cpu */
>>@@ -537,10 +537,14 @@ struct sched_info {
>> last_queued; /* when we were last queued to run */
>> };
>>
>>+#ifdef CONFIG_SCHEDSTATS
>> extern struct file_operations proc_schedstat_operations;
>>-#endif
>>+#endif /* CONFIG_SCHEDSTATS */
>>+#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
>>
>> #ifdef CONFIG_TASK_DELAY_ACCT
>>+extern int delayacct_on;
>>+
>>
>>
>
>This was already declared in delayacct.h and nothing in sched.h needs it.
>So it's better if .c files pick this declaration up from delayacct.h.
>
>
>
>>+static inline void rq_sched_info_arrive(struct runqueue *rq,
>>+ unsigned long diff)
>>+{
>>+ if (rq) {
>>+ rq->rq_sched_info.run_delay += diff;
>>+ rq->rq_sched_info.pcnt++;
>>+ }
>>+}
>>
>>
>
>The nonatomic updates need locking. I assume it's runqueue.lock, but a
>comment describing the rules would be good.
>
>
>
Yes, it is rq.lock. Will add comment.

>>+static inline void rq_sched_info_depart(struct runqueue *rq,
>>+ unsigned long diff)
>>+{
>>+ if (rq)
>>+ rq->rq_sched_info.cpu_time += diff;
>>+}
>>
>>
>
>Ditto.
>
>
Will add comment.

>
>
>> static inline void sched_info_queued(task_t *t)
>> {
>>- if (!t->sched_info.last_queued)
>>- t->sched_info.last_queued = jiffies;
>>+ if (unlikely(sched_info_on()))
>>+ if (!t->sched_info.last_queued)
>>+ t->sched_info.last_queued = jiffies;
>> }
>>
>>
>
>It might be better to put the unlikely() into sched_info_on() itself.
>
>
The function sched_info_on has only got a return statement so putting
the unlikely within
it doesn't seem possible.

This is a rather aggressive use of the unlikely. In the case where
CONFIG_SCHEDSTATS is
defined, sched_info_on always returns 1 so the unlikely will always
cause a mispredict. I'm assuming
the extra penalty for that is acceptable for those using schedstats
since its never turned on in a production
kernel.
If only CONFIG_TASK_DELAY_ACCT is configured, the value of delayacct_on,
which is very
likely going to be 0, will be returned so the unlikely is potentially
helpful to reduce overhead.

Thoughts ?

As I write this, I realize the function sched_info_on() is also wrongly
written and doesn't account for the
case where both SCHEDSTATS and TASK_DELAY_ACCT are configured. Will fix
that. But thats
orthogonal to the above discussion of whether the unlikely should be used.

--Shailabh


2006-03-30 16:04:05

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 3/8] cpu delays

Dave Hansen wrote:

>On Wed, 2006-03-29 at 19:42 -0500, Shailabh Nagar wrote:
>
>
>>-#ifdef CONFIG_SCHEDSTATS
>>- memset(&p->sched_info, 0, sizeof(p->sched_info));
>>+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>>+ if (unlikely(sched_info_on()))
>>+ memset(&p->sched_info, 0, sizeof(p->sched_info));
>> #endif
>>
>>
>
>If you unconditionally define sched_info_on(), you can get get rid of
>this #ifdef.
>
>+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
>+
>+static inline int sched_info_on(void)
>+{
>+ #ifdef CONFIG_SCHEDSTATS
>+ return 1;
>+#elif defined(CONFIG_TASK_DELAY_ACCT)
>+ return delayacct_on;
>+#else
>+ return 0;
>+#endif
>+}
>
>Might as well hide the junk in a header.
>
>
Thanks, good point.
The sched_info_on() was wrong anyway and your snippet fixes that and
gives the
junk reduction.

Will incorporate.

--Shailabh

>-- Dave
>
>
>

2006-03-30 16:10:48

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 6/8] virtual cpu run time

Andrew Morton wrote:

>Shailabh Nagar <[email protected]> wrote:
>
>
>>delayacct-virtcpu.patch
>>
>>Distinguish between "wall-clock" and "virtual" cpu run times and return
>>both, at per-task and per-tgid granularity.
>>
>>Some architectures adjust tsk->utime+tsk->stime to reflect the time that
>>the kernel wasn't scheduled in hypervised environments and this is the
>>"wall-clock" cpu run time. "Virtual" cpu run time, on the other hand, does
>>not account for the kernel being descheduled.
>>
>>This patch allows the most accurate "virtual" cpu run time, collected by
>>the schedstats code (now shared with delay accounting code), to be returned
>>to user space, in addition to the "wall-clock" cpu time that was being exported
>>earlier. Both these times are useful for workload management in different
>>situations.
>>
>>In a non-virtualized environment, or on architectures which do not adjust
>>tsk->utime/stime, these will effectively be the same value but at different
>>granularities.
>>
>>...
>>
>>Index: linux-2.6.16/include/linux/taskstats.h
>>===================================================================
>>--- linux-2.6.16.orig/include/linux/taskstats.h 2006-03-29 18:13:18.000000000 -0500
>>+++ linux-2.6.16/include/linux/taskstats.h 2006-03-29 18:13:20.000000000 -0500
>>@@ -46,8 +46,14 @@ struct taskstats {
>> __u64 swapin_count;
>> __u64 swapin_delay_total; /* swapin page fault wait*/
>>
>>- __u64 cpu_run_total; /* cpu running time
>>- * no count available/provided */
>>+ __u64 cpu_run_real_total; /* cpu "wall-clock" running time
>>+ * Potentially accounts for cpu
>>+ * virtualization, on some arches
>>+ */
>>+ __u64 cpu_run_virtual_total; /* cpu "virtual" running time
>>+ * Uses time intervals as seen by
>>+ * the kernel
>>+ */
>> };
>>
>>
>>
>
>Again, the reader of this struct wants to know what the atomicity rules are.
>
>
Will add comment.

>
>
>>+ d->cpu_run_real_total = (tmp < (nsec_t)d->cpu_run_real_total)? 0: tmp;
>>
>>
>
> lval = expr1 ? expr2 : expr3;
>
>
didn't get whats wrong ?

>
>
>>+ tmp = (nsec_t)d->cpu_run_virtual_total
>>+ + (nsec_t)jiffies_to_usecs(t3) * 1000;
>>
>>
>
>umm, Linux doesn't have nsec_t any more.
>
>
>
Ok, will switch to s64 everywhere.

--Shailabh


2006-03-30 16:24:42

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 5/8] generic netlink interface for delay accounting

Balbir Singh wrote:

>Hi, Andrew
>
>
>On Wed, Mar 29, 2006 at 09:04:06PM -0800, Andrew Morton wrote:
>
>
>>Shailabh Nagar <[email protected]> wrote:
>>
>>
>>>delayacct-genetlink.patch
>>>
>>>Create a generic netlink interface (NETLINK_GENERIC family),
>>>called "taskstats", for getting delay and cpu statistics of
>>>tasks and thread groups during their lifetime and when they exit.
>>>
>>>
>>>
>>>
>>>
<snip>

>>>+
>>>+static int __init taskstats_init(void)
>>>+{
>>>+ if (genl_register_family(&family))
>>>+ return -EFAULT;
>>>
>>>
>>EFAULT?
>>
>>
>
>It shouldn't be (Shailabh please comment). We will fix it.
>
>
Sorry, it should return the value returned by genl_register_family.

I copied the code from net/tipc/netlink.c
where a similarly erroneous errno is being used. We'll submit a fix
for that as well.

--Shailabh

2006-03-30 17:24:07

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Andrew Morton wrote:

>Balbir Singh <[email protected]> wrote:
>
>
>>On Wed, Mar 29, 2006 at 09:03:14PM -0800, Andrew Morton wrote:
>>
>>
>>>Shailabh Nagar <[email protected]> wrote:
>>>
>>>
>>>>Could you please include the following delay accounting patches
>>>> in -mm ?
>>>>
>>>>
>>>I'm at a loss to evaluate the suitability of this work, really. I always
>>>am when accounting patches come along.
>>>
>>>There are various people and various groups working on various different
>>>things and there appears to be no coordination and little commonality of
>>>aims. I worry that picking one submission basically at random will provide
>>>nothing which the other groups can work on to build up their feature.
>>>
>>>On the other hand, we don't want to do nothing until some uber-grand
>>>all-singing, all-dancing statistics-gathering infrastructure comes along.
>>>
>>>So I'm a bit stuck. What I would like to see happen is that there be some
>>>coordination between the various stakeholders, and some vague plan which
>>>they're all happy with as a basis for the eventual grand solution.
>>>
>>>We already have various bits and pieces of statistics gathering in the
>>>kernel and it's already a bit ad-hoc. Adding more one-requirement-specific
>>>accounting code won't improve that situation.
>>>
>>>But then, I said all this a year or two ago and nothing much has happened
>>>since then. It's not your fault, but it's a problem.
>>>
>>>
Yes, I agree it is a problem. We found it ourselves while developing
this patchset. BSD accounting had
some properties we liked (like availability of stats for a process after
it died) but the way to extend it or
get access to those stats while a process was alive wasn't all that
good. Similarly CSA had needs like ours
but not quite the same.

Our compromise solution, prompted by your comments on getting a
consensus for the use of a "statistics
connector" for all accounting stakeholders, was the taskstats interface,
as described by Balbir below.

But it is not the complete solution or an attempt to get some common
accounting infrastructure, true :-(


>>>Perhaps a good starting point would be a one-page bullet-point-form
>>>wishlist of all the accounting which people want to get out of the kernel,
>>>and a description of what the kernel<->user interface should look like.
>>>Right now, I don't think we even have a picture of that.
>>>
>>>We need a statistics maintainer, too, to pull together the plan,
>>>coordinate, push things forwards. The first step would be to identify the
>>>stakeholders, come up with that page of bullet-points.
>>>
>>>
>>>Then again, maybe the right thing to do is to keep adding low-impact
>>>requirement-specific statistics patches as they come along.
>>>
Personally, this is the approach I favor with unification happening
piecewise, atleast as far as the
collection of statistics is concerned.

The interface for making stats available outside would seem to be more
in need of a unified
approach since we already have a profusion of export methods, some
legacy and some being
introduced by folks like us.


>>>But if we're
>>>going to do it that way, we need an up-front reason for doing so, and I
>>>don't know what that would be.
>>>
>>>See my problem?
>>>
>>>

>>One of the issues we have tried to address is the ability to provide some
>>form of a common ground for all the statistics to co-exist. Various methods
>>were discussed for exchanging data between kernel and user space, genetlink
>>was suggested often and the clear winner.
>>
>>To that end, we have created a taskstats.c file. Any subsystem wanting
>>to add their statistics and sending it to user space can add their own
>>types by extending taskstats.c (changing the version number) and creating
>>their own types using genetlink. They will have to do the following
>>
>>1. Add statistics gathering in their own subsystem
>>2. Add a type to taskstats.c, extend it and use data from (1) and send
>> it to user space.
>>
>>The data from various subsystems can co-exist. I feel that this could serve as
>>the basic common infrastructure to begin with and refined later (depending on
>>the needs of other people).
>>
>>
>>
>
>Sounds fine to me, but I'm not a stakeholder.
>
>Trolling back through lse-tech gives us:
>
>pnotify:
> Erik Jacobson <[email protected]>
>
>CSA accounting/PAGG/JOB:
> Jay Lan <[email protected]>
> Limin Gu <[email protected]>
>
>per-process IO statistics:
> Levent Serinol <[email protected]>
>
>ELSA:
> Guillaume Thouvenin <[email protected]>
>
>per-cpu time statistics:
> Erich Focht <[email protected]>
>
>Scalable statistics counters with /proc reporting:
> Ravikiran G Thirumalai <[email protected]>
> (Kiran feft IBM, but presumably the requirement lives on)
>
>
To this list we can also add

Microstate accounting
Peter Chubb <[email protected]>
I don't know if Peter is still interested in pursuing this or it
was rejected.

>There was a long thread "A common layer for Accounting packages". Did it
>come to a conclusion?
>
>
Unfortunately, not.

>Anyway, if mostly everyone is mostly happy with what you propose then that
>it good news.
>
>
It would seem like a good first step then, for me to contact the folks
above and see if they are able to
use the interface we're proposing and modify it if needed.

--Shailabh

2006-03-31 02:56:34

by Peter Chubb

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

>>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:

>>
Shailabh> To this list we can also add

Shailabh> Microstate accounting Peter Chubb
Shailabh> <[email protected]> I don't know if Peter is still
Shailabh> interested in pursuing this or it was rejected.

It's still maintained in a sporadic sort of way --- I update it when
either I need it for something, or someone's downloaded it and asks
why it doesn't work agains kernel X.Y.Z. I see a few downloads a
month.

My microstate accounting patch overlaps the delay accounting patch quite a
lot in functionality, (but I thnk mine is cleaner except for interrupt
time accounting... which the delay accounting patch doesn't do. I
wanted to know how much time a thread *really* had on the processor,
subtracting off the time spent in interrupt handlers for some other
process).

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2006-03-31 05:27:45

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Peter Chubb wrote:

>>>>>>"Shailabh" == Shailabh Nagar <[email protected]> writes:
>>>>>>
>>>>>>
>
>
>
>Shailabh> To this list we can also add
>
>Shailabh> Microstate accounting Peter Chubb
>Shailabh> <[email protected]> I don't know if Peter is still
>Shailabh> interested in pursuing this or it was rejected.
>
>It's still maintained in a sporadic sort of way --- I update it when
>either I need it for something, or someone's downloaded it and asks
>why it doesn't work agains kernel X.Y.Z. I see a few downloads a
>month.
>
>
So do you intend to pursue acceptance ? If so, do you think the
netlink-based taskstats
interface provided by the delay accounting patches could be an
acceptable substitute for the
interfaces you had (from an old lkml post, they appear to be
/proc/tgid/msa and a syscall
based one) ?


>My microstate accounting patch overlaps the delay accounting patch quite a
>lot in functionality, (but I thnk mine is cleaner except for interrupt
>time accounting... which the delay accounting patch doesn't do. I
>wanted to know how much time a thread *really* had on the processor,
>subtracting off the time spent in interrupt handlers for some other
>process).
>
>
Thanks. Will incorporate into a note on the mechanisms of the other
accounting patches.

--Shailabh


2006-03-31 07:32:05

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

On Fri, 31 Mar 2006 01:42:28 -0500
Shailabh Nagar <[email protected]> wrote:

> Following Andrew's suggestion, here's my quick overview
> of the various other accounting packages that have been
> proposed on lse-tech with a focus on whether they can
> utilize the netlink-based taskstats interface being proposed
> by the delay accounting patches.
>
> Please note that unification of statistics *collection* is not
> being discussed since that kind of merger can be done as these
> patches get accepted, if at all, into the kernel. To try and
> unify right away would hold every patch (esp. delay accounting !)
> hostage to the problems in every other patch unnecessarily. As
> long as the interface can be unified, the merger of the
> collection bits can always happen without affecting user space.
>
> Stakeholders of each of these patches, on cc, are requested to
> please correct any misunderstandings of what their patches do.
>
> Also, please comment on the observations about their patch's
> ability to use the netlink-based taskstats interface, code for which
> was posted at
>
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/1787.html
>
[...]
>
> 5. Enhanced Linux System Accounting (Guillaume Thouvenine)
^^^^^^^^^^
Thouvenin
> ----------------------------------------------------------
>
> - Group tasks at a user level into "jobs" and aggregate,
> at user level, per-task statistics collected by CSA and/or BSD
> process accounting.
>
> - ELSA does not introduce any new requirement for either
> collection or export of statistics from the kernel. It can use
> either BSD and/or CSA's method of using an accounting file.
>
> - ELSA needs notification of forks and exits which it can already
> get through the process events connector in the kernel.
>
> Hence ELSA's needs are either met by the kernel today or are a
> strict subset of CSA (since BSD accounting is already there).

The overview is very interesting and you have a very good comprehension
of ELSA. As you said ELSA is a group tasks at a user level and
everything is already in the kernel so your patches don't generate
troubles to ELSA. As you said in the delay accounting documentation,
delay statistics can also be collected for all tasks and a tool like
ELSA can aggregate results for groups of processes.


Chears,
Guillaume



2006-03-31 08:19:35

by Peter Chubb

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

>>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:

Shailabh> Peter Chubb wrote:
(microstate accounting patch)
>> It's still maintained in a sporadic sort of way --- I update it
>> when either I need it for something, or someone's downloaded it and
>> asks why it doesn't work agains kernel X.Y.Z. I see a few
>> downloads a month.
>>
>>
Shailabh> So do you intend to pursue acceptance ? If so, do you think
Shailabh> the netlink-based taskstats interface provided by the delay
Shailabh> accounting patches could be an acceptable substitute for the
Shailabh> interfaces you had (from an old lkml post, they appear to be
Shailabh> /proc/tgid/msa and a syscall based one) ?

I'd have to take a close look. The syscall interface is modelled on
getrusage(), and only lets you get your own or your children's data;
I'm not too worried about trashing it, as it should be possible to
emulate in terms of netlink (albeit at a cost; system calls are
relatively cheap)

/proc/<pid>/task/<tid>/msa lets you get at anything you own. I use
awk scripts to process the msa file in /proc/... and pipe it into
gnuplot at n second intervals; a netlink interface would need to have
an auxiliary program to read it and then squirt it into the scripts, I
think --- or is there a way to get ASCII out on demand? I quite often
use cat to do quick checks on whats going on too --- so overall I think
the /proc interface is desirable.

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2006-03-31 16:03:57

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Peter Chubb wrote:

>>>>>>"Shailabh" == Shailabh Nagar <[email protected]> writes:
>>>>>>
>>>>>>
>
>Shailabh> Peter Chubb wrote:
> (microstate accounting patch)
>
>
>>> It's still maintained in a sporadic sort of way --- I update it
>>>when either I need it for something, or someone's downloaded it and
>>>asks why it doesn't work agains kernel X.Y.Z. I see a few
>>>downloads a month.
>>>
>>>
>>>
>>>
>Shailabh> So do you intend to pursue acceptance ? If so, do you think
>Shailabh> the netlink-based taskstats interface provided by the delay
>Shailabh> accounting patches could be an acceptable substitute for the
>Shailabh> interfaces you had (from an old lkml post, they appear to be
>Shailabh> /proc/tgid/msa and a syscall based one) ?
>
>I'd have to take a close look.
>
Please do ! As I mentioned in the other note where I summarize the
various accounting packages
I think it should be fairly easy for microstate accounting to extend the
structure returned by the
taskstats interface.

> The syscall interface is modelled on
>getrusage(), and only lets you get your own or your children's data;
>I'm not too worried about trashing it, as it should be possible to
>emulate in terms of netlink (albeit at a cost; system calls are
>relatively cheap)
>
>/proc/<pid>/task/<tid>/msa lets you get at anything you own. I use
>awk scripts to process the msa file in /proc/... and pipe it into
>gnuplot at n second intervals; a netlink interface would need to have
>an auxiliary program to read it and then squirt it into the scripts, I
>think --- or is there a way to get ASCII out on demand?
>
No. The use of netlink pretty much means you have to use an auxiliary
program. We provide
one already (as part of the documentation to the patches).

What netlink buys you is the ability to
- get data for a task after it has exited (ie netlink serves as a buffer)
- get data for large number of tasks more efficiently than /proc


>I quite often
>use cat to do quick checks on whats going on too --- so overall I think
>the /proc interface is desirable.
>
>
Yes, /proc is more convenient both for cat'ting and also since its used
by tools like top.
Delay accounting patches also provide the "block I/O wait (including
swapin)" statistic through
/proc/tgid/stat for convenience and so that top etc. can use it while
displaying per-task stats.


However, the question here is this:

*if* a single, unified interface for per-task statistics was deemed to
be desirable (as Andrew is
effectively suggesting we explore), what would that interface be ?
/proc-based, netlink based or syscall-based ?

I would submit it is netlink-based since it is a superset of /proc and
syscalls.
Neither of the latter two can return data after a task has exited
(atleast not easily...you can always invent
infrastructure to buffer per-task stats but it would be cumbersome)
Whereas the former can, with the help of an auxiliary program, provide
the same data that /proc and syscalls can.
The price paid by /proc and syscall users for unification is
convenience, not loss of functionality.

Would you agree ?

--Shailabh







2006-03-31 17:02:21

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 0/8] per-task delay accounting

Guillaume Thouvenin wrote:

>On Fri, 31 Mar 2006 01:42:28 -0500
>Shailabh Nagar <[email protected]> wrote:
>
>
>
>>Following Andrew's suggestion, here's my quick overview
>>of the various other accounting packages that have been
>>proposed on lse-tech with a focus on whether they can
>>utilize the netlink-based taskstats interface being proposed
>>by the delay accounting patches.
>>
>>Please note that unification of statistics *collection* is not
>>being discussed since that kind of merger can be done as these
>>patches get accepted, if at all, into the kernel. To try and
>>unify right away would hold every patch (esp. delay accounting !)
>>hostage to the problems in every other patch unnecessarily. As
>>long as the interface can be unified, the merger of the
>>collection bits can always happen without affecting user space.
>>
>>Stakeholders of each of these patches, on cc, are requested to
>>please correct any misunderstandings of what their patches do.
>>
>>Also, please comment on the observations about their patch's
>>ability to use the netlink-based taskstats interface, code for which
>>was posted at
>>
>>http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/1787.html
>>
>>
>>
>[...]
>
>
>>5. Enhanced Linux System Accounting (Guillaume Thouvenine)
>>
>>
> ^^^^^^^^^^
> Thouvenin
>
>
>>----------------------------------------------------------
>>
>>- Group tasks at a user level into "jobs" and aggregate,
>>at user level, per-task statistics collected by CSA and/or BSD
>>process accounting.
>>
>>- ELSA does not introduce any new requirement for either
>>collection or export of statistics from the kernel. It can use
>>either BSD and/or CSA's method of using an accounting file.
>>
>>- ELSA needs notification of forks and exits which it can already
>>get through the process events connector in the kernel.
>>
>>Hence ELSA's needs are either met by the kernel today or are a
>>strict subset of CSA (since BSD accounting is already there).
>>
>>
>
>The overview is very interesting and you have a very good comprehension
>of ELSA. As you said ELSA is a group tasks at a user level and
>everything is already in the kernel so your patches don't generate
>troubles to ELSA. As you said in the delay accounting documentation,
>delay statistics can also be collected for all tasks and a tool like
>ELSA can aggregate results for groups of processes.
>
>
>Chears,
>Guillaume
>
>
Thanks Guillaume. Thats one "sign-off" on the taskstats interface then :-)

--Shailabh