2006-01-03 23:17:04

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 0/6] Per-task delay accounting

Andrew,

Could you please consider these patches for inclusion in -mm ?
The comments from earlier postings of these patches have been addressed,
including the one you made about making the connector interface generic
(more about that in the connector patch).

Thanks,
Shailabh


The following patches add accounting for the delays seen by tasks in
a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages (i.e. capacity misses).

Such delays provide feedback for a task's cpu priority, io priority and
rss limit values. Long delays, especially relative to other tasks, can
be a trigger for changing a task's cpu/io priorities and modifying its
rss usage (either directly through sys_getprlimit() that was proposed
earlier on lkml or by throttling cpu consumption or process calling
sys_setrlimit etc.)

The major change since the previous posting of these patches
(http://www.ussg.iu.edu/hypermail/linux/kernel/0512.0/2152.html)
is the resurrection of the connector interface (in addition to /proc)
and, as part of the same patch, the ability to get stats per-tgid in
addition to per-pid.

More comments in individual patches.

Series

nstimestamp-diff.patch
delayacct-init.patch
delayacct-blkio.patch
delayacct-swapin.patch
delayacct-procfs.patch
delayacct-connector.patch




2006-01-03 23:24:18

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 1/6] Delay accounting: timespec diff

nstimestamp_diff.patch

Add kernel utility function for measuring the
interval (diff) between two timespec values, adjusting for overflow

Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/time.h | 15 +++++++++++++++
1 files changed, 15 insertions(+)

Index: linux-2.6.15-rc7/include/linux/time.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/time.h
+++ linux-2.6.15-rc7/include/linux/time.h
@@ -114,6 +114,21 @@ set_normalized_timespec (struct timespec
ts->tv_nsec = nsec;
}

+/*
+ * timespec_diff_ns - Return difference of two timestamps in nanoseconds
+ * In the rare case of @end being earlier than @start, return zero
+ */
+static inline unsigned long long
+timespec_diff_ns(struct timespec *start, struct timespec *end)
+{
+ long long ret;
+
+ ret = end->tv_sec*(1000000000) + end->tv_nsec;
+ ret -= start->tv_sec*(1000000000) + start->tv_nsec;
+ if (ret < 0)
+ return 0;
+ return ret;
+}
#endif /* __KERNEL__ */

#define NFDBITS __NFDBITS

2006-01-03 23:27:07

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 2/6] Delay accounting: Initialization, kernel boot option

Changes since 11/14/05

- use nanosecond resolution, adjusted wall clock time for timestamps
instead of sched_clock (akpm, andi, marcelo)
- kernel boot param to control delay stats collection (parag)
- better CONFIG parameter name (parag)

11/14/05: First post

delayacct-init.patch

Initialization code related to collection of per-task "delay"
statistics which measure how long it had to wait for cpu,
sync block io, swapping etc.. The collection of statistics and
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled
through a kernel boot paramater.


Signed-off-by: Shailabh Nagar <[email protected]>

Documentation/kernel-parameters.txt | 2 ++
include/linux/delayacct.h | 26 ++++++++++++++++++++++++++
include/linux/sched.h | 11 +++++++++++
init/Kconfig | 13 +++++++++++++
kernel/Makefile | 1 +
kernel/delayacct.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 ++
7 files changed, 91 insertions(+)

Index: linux-2.6.15-rc7/init/Kconfig
===================================================================
--- linux-2.6.15-rc7.orig/init/Kconfig
+++ linux-2.6.15-rc7/init/Kconfig
@@ -162,6 +162,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config TASK_DELAY_ACCT
+ bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ help
+ Collect information on time spent by a task waiting for system
+ resources like cpu, synchronous block I/O completion and swapping
+ in pages. Such statistics can help in setting a task's priorities
+ relative to other tasks for cpu, io, rss limits etc.
+
+ Unlike BSD process accounting, this information is available
+ continuously during the lifetime of a task.
+
+ Say N if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.15-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/sched.h
+++ linux-2.6.15-rc7/include/linux/sched.h
@@ -541,6 +541,14 @@ struct sched_info {
extern struct file_operations proc_schedstat_operations;
#endif

+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+ spinlock_t lock;
+
+ /* Add stats in pairs: uint64_t delay, uint32_t count */
+};
+#endif
+
enum idle_type
{
SCHED_IDLE,
@@ -857,6 +865,9 @@ struct task_struct {
int cpuset_mems_generation;
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
+#ifdef CONFIG_TASK_DELAY_ACCT
+ struct task_delay_info delays;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.15-rc7/kernel/fork.c
===================================================================
--- linux-2.6.15-rc7.orig/kernel/fork.c
+++ linux-2.6.15-rc7/kernel/fork.c
@@ -43,6 +43,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/delayacct.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -923,6 +924,7 @@ static task_t *copy_process(unsigned lon
if (p->binfmt && !try_module_get(p->binfmt->module))
goto bad_fork_cleanup_put_domain;

+ delayacct_tsk_init(p);
p->did_exec = 0;
copy_flags(clone_flags, p);
p->pid = pid;
Index: linux-2.6.15-rc7/include/linux/delayacct.h
===================================================================
--- /dev/null
+++ linux-2.6.15-rc7/include/linux/delayacct.h
@@ -0,0 +1,26 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on; /* Delay accounting turned on/off */
+extern void delayacct_tsk_init(struct task_struct *tsk);
+#else
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.15-rc7/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.15-rc7.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.15-rc7/Documentation/kernel-parameters.txt
@@ -921,6 +921,8 @@ running once the system is up.

nocache [ARM]

+ nodelayacct [KNL] Disable per-task delay accounting
+
nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects.

noexec [IA-64]
Index: linux-2.6.15-rc7/kernel/Makefile
===================================================================
--- linux-2.6.15-rc7.orig/kernel/Makefile
+++ linux-2.6.15-rc7/kernel/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o

ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.15-rc7/kernel/delayacct.c
===================================================================
--- /dev/null
+++ linux-2.6.15-rc7/kernel/delayacct.c
@@ -0,0 +1,36 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/sched.h>
+
+int delayacct_on=1; /* Delay accounting turned on/off */
+
+static int __init delayacct_setup_disable(char *str)
+{
+ delayacct_on = 0;
+ return 1;
+}
+__setup("nodelayacct", delayacct_setup_disable);
+
+inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+ memset(&tsk->delays, 0, sizeof(tsk->delays));
+ spin_lock_init(&tsk->delays.lock);
+}
+
+static int __init delayacct_init(void)
+{
+ delayacct_tsk_init(&init_task);
+ return 0;
+}
+core_initcall(delayacct_init);

2006-01-03 23:31:11

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 4/6] Delay accounting: Swap in delays

Changes since 12/7/05

- removed __attribute((unused)) qualifiers from timespec vars (dave hansen)
- use existing timestamping function do_posix_clock_monotonic_gettime() (jay lan)

Changes since 11/14/05

- use nanosecond resolution, adjusted wall clock time for timestamps
instead of sched_clock (akpm, andi, marcelo)
- collect stats only if delay accounting enabled (parag)
- collect delays for only swapin page faults instead of all page faults.

11/14/05: First post


delayacct-swapin.patch

Record time spent by a task waiting for its pages to be swapped in.
This statistic can help in adjusting the rss limits of
tasks (process), especially relative to each other, when the system is
under memory pressure.

Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/delayacct.h | 3 +++
include/linux/sched.h | 2 ++
kernel/delayacct.c | 15 +++++++++++++++
mm/memory.c | 16 +++++++++-------
4 files changed, 29 insertions(+), 7 deletions(-)

Index: linux-2.6.15-rc7/include/linux/delayacct.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/delayacct.h
+++ linux-2.6.15-rc7/include/linux/delayacct.h
@@ -20,11 +20,14 @@
extern int delayacct_on; /* Delay accounting turned on/off */
extern void delayacct_tsk_init(struct task_struct *tsk);
extern void delayacct_blkio(struct timespec *start, struct timespec *end);
+extern void delayacct_swapin(struct timespec *start, struct timespec *end);
#else
static inline void delayacct_tsk_init(struct task_struct *tsk)
{}
static inline void delayacct_blkio(struct timespec *start, struct timespec *end)
{}
+static inline void delayacct_swapin(struct timespec *start, struct timespec *end)
+{}

#endif /* CONFIG_TASK_DELAY_ACCT */
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.15-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/sched.h
+++ linux-2.6.15-rc7/include/linux/sched.h
@@ -548,6 +548,8 @@ struct task_delay_info {
/* Add stats in pairs: uint64_t delay, uint32_t count */
uint64_t blkio_delay; /* wait for sync block io completion */
uint32_t blkio_count;
+ uint64_t swapin_delay; /* wait for pages to be swapped in */
+ uint32_t swapin_count;
};
#endif

Index: linux-2.6.15-rc7/mm/memory.c
===================================================================
--- linux-2.6.15-rc7.orig/mm/memory.c
+++ linux-2.6.15-rc7/mm/memory.c
@@ -2171,16 +2171,15 @@ static inline int handle_pte_fault(struc

old_entry = entry = *pte;
if (!pte_present(entry)) {
- if (pte_none(entry)) {
- int ret;
- struct timespec start, end;
+ int ret;
+ struct timespec start, end;

+ do_posix_clock_monotonic_gettime(&start);
+ if (pte_none(entry)) {
if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, address,
pte, pmd, write_access);

- if (vma->vm_file)
- do_posix_clock_monotonic_gettime(&start);
ret = do_no_page(mm, vma, address,
pte, pmd, write_access);
if (vma->vm_file) {
@@ -2192,8 +2191,11 @@ static inline int handle_pte_fault(struc
if (pte_file(entry))
return do_file_page(mm, vma, address,
pte, pmd, write_access, entry);
- return do_swap_page(mm, vma, address,
- pte, pmd, write_access, entry);
+ ret = do_swap_page(mm, vma, address,
+ pte, pmd, write_access, entry);
+ do_posix_clock_monotonic_gettime(&end);
+ delayacct_swapin(&start, &end);
+ return ret;
}

ptl = pte_lockptr(mm, pmd);
Index: linux-2.6.15-rc7/kernel/delayacct.c
===================================================================
--- linux-2.6.15-rc7.orig/kernel/delayacct.c
+++ linux-2.6.15-rc7/kernel/delayacct.c
@@ -50,3 +50,18 @@ inline void delayacct_blkio(struct times
current->delays.blkio_count++;
spin_unlock(&current->delays.lock);
}
+
+inline void delayacct_swapin(struct timespec *start, struct timespec *end)
+{
+ unsigned long long delay;
+
+ if (!delayacct_on)
+ return;
+
+ delay = timespec_diff_ns(start, end);
+
+ spin_lock(&current->delays.lock);
+ current->delays.swapin_delay += delay;
+ current->delays.swapin_count++;
+ spin_unlock(&current->delays.lock);
+}

2006-01-03 23:32:43

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 5/6] Delay accounting: /proc interface

Changes since 12/7/05
- Return per-tgid stats as sum of constituent tid data
- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)

12/07/05: First post

delayacct-procfs.patch

Creates /proc/<pid>/delay interface for getting delay and cpu statistics
for tasks. The cpu stats are available (non-zero) only if CONFIG_SCHEDSTATS
is enabled.

Signed-off-by: Shailabh Nagar <[email protected]>

fs/proc/array.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/proc/base.c | 21 +++++++++++++++-
fs/proc/internal.h | 2 +
3 files changed, 90 insertions(+), 1 deletion(-)

Index: linux-2.6.15-rc7/fs/proc/array.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/proc/array.c
+++ linux-2.6.15-rc7/fs/proc/array.c
@@ -488,3 +488,71 @@ int proc_pid_statm(struct task_struct *t
return sprintf(buffer,"%d %d %d %d %d %d %d\n",
size, resident, shared, text, lib, data, 0);
}
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+int proc_pid_delay(struct task_struct *task, char * buffer)
+{
+ unsigned long long run_delay, run_time;
+ unsigned long run_count;
+
+#ifdef CONFIG_SCHEDSTATS
+ run_count = task->sched_info.pcnt ;
+ run_time = jiffies_to_usecs(task->sched_info.cpu_time)*1000;
+ run_delay = jiffies_to_usecs(task->sched_info.run_delay)*1000;
+#else
+ /* Non-zero values, zero count indicates data not collected */
+ run_count = 0;
+ run_time = run_delay = 1;
+#endif
+ return sprintf(buffer,"%lu %llu %llu %u %llu %u %llu\n",
+ run_count, (uint64_t) run_time, (uint64_t) run_delay,
+ (unsigned int) task->delays.blkio_count,
+ (uint64_t) task->delays.blkio_delay,
+ (unsigned int) task->delays.swapin_count,
+ (uint64_t) task->delays.swapin_delay);
+}
+
+/*
+ * Sum up delay stats for tid's of the tgid
+ * For any delay stat, a zero delay and non-zero count indicates overflow
+ *
+ */
+int proc_tgid_delay(struct task_struct *task, char * buffer)
+{
+ uint64_t run_time = 0, run_delay = 0, run_count = 0;
+ uint64_t blkio_delay = 0, blkio_count = 0;
+ uint64_t swapin_delay = 0, swapin_count = 0, tmp;
+ struct task_struct *t = task;
+
+ read_lock(&tasklist_lock);
+ do {
+#ifdef CONFIG_SCHEDSTATS
+ run_count += t->sched_info.pcnt;
+ tmp = run_time + jiffies_to_usecs(t->sched_info.cpu_time)*1000;
+ run_time = (tmp < run_time)? 0:tmp ;
+ tmp = run_delay + jiffies_to_usecs(t->sched_info.run_delay)*1000;
+ run_delay = (tmp < run_delay)? 0:tmp;
+#else
+ /* Non-zero values, zero count indicates data not collected */
+ run_count = 0;
+ run_time = run_delay = 1;
+#endif
+ spin_lock(&t->delays.lock);
+ tmp = blkio_delay + t->delays.blkio_delay;
+ blkio_delay = (tmp < blkio_delay)? 0:tmp ;
+ tmp = swapin_delay + t->delays.swapin_delay;
+ swapin_delay = (tmp < swapin_delay)? 0:tmp ;
+ blkio_count += t->delays.blkio_count;
+ swapin_count += t->delays.swapin_count;
+ spin_unlock(&t->delays.lock);
+
+ } while_each_thread(task, t);
+ read_unlock(&tasklist_lock);
+
+ return sprintf(buffer,"%llu %llu %llu %llu %llu %llu %llu\n",
+ run_count, run_time, run_delay,
+ blkio_count, blkio_delay,
+ swapin_count, swapin_delay);
+}
+
+#endif
Index: linux-2.6.15-rc7/fs/proc/base.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/proc/base.c
+++ linux-2.6.15-rc7/fs/proc/base.c
@@ -165,7 +165,10 @@ enum pid_directory_inos {
#endif
PROC_TID_OOM_SCORE,
PROC_TID_OOM_ADJUST,
-
+#ifdef CONFIG_TASK_DELAY_ACCT
+ PROC_TID_DELAY_ACCT,
+ PROC_TGID_DELAY_ACCT,
+#endif
/* Add new entries before this */
PROC_TID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
@@ -220,6 +223,9 @@ static struct pid_entry tgid_base_stuff[
#ifdef CONFIG_AUDITSYSCALL
E(PROC_TGID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+ E(PROC_TGID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO),
+#endif
{0,0,NULL,0}
};
static struct pid_entry tid_base_stuff[] = {
@@ -262,6 +268,9 @@ static struct pid_entry tid_base_stuff[]
#ifdef CONFIG_AUDITSYSCALL
E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+ E(PROC_TID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO),
+#endif
{0,0,NULL,0}
};

@@ -1786,6 +1795,16 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_loginuid_operations;
break;
#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+ case PROC_TID_DELAY_ACCT:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pid_delay;
+ break;
+ case PROC_TGID_DELAY_ACCT:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_tgid_delay;
+ break;
+#endif
default:
printk("procfs: impossible type (%d)",p->type);
iput(inode);
Index: linux-2.6.15-rc7/fs/proc/internal.h
===================================================================
--- linux-2.6.15-rc7.orig/fs/proc/internal.h
+++ linux-2.6.15-rc7/fs/proc/internal.h
@@ -36,6 +36,8 @@ extern int proc_tid_stat(struct task_str
extern int proc_tgid_stat(struct task_struct *, char *);
extern int proc_pid_status(struct task_struct *, char *);
extern int proc_pid_statm(struct task_struct *, char *);
+extern int proc_pid_delay(struct task_struct *, char*);
+extern int proc_tgid_delay(struct task_struct *, char*);

static inline struct task_struct *proc_task(struct inode *inode)
{

2006-01-03 23:34:46

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 6/6] Delay accounting: Connector interface

Changes since 11/14/05:

- explicit versioning of statistics data returned
- new command type for returning per-tgid stats
- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)

First post 11/14/05

delayacct-connector.patch

Creates a connector interface for getting delay and cpu statistics of tasks
during their lifetime and when they exit. The cpu stats are available only if
CONFIG_SCHEDSTATS is enabled.

When a task is alive, userspace can get its stats by sending a command
containing its pid. Sending a tgid returns the sum of stats of the tasks
belonging to that tgid (where such a sum makes sense). Together, the command
interface allows stats for a large number of tasks to be collected more
efficiently than would be possible through /proc or any per-pid interface.

The connector interface also sends the stats for each task to userspace when
the task is exiting. This permits fine-grain accounting for short-lived tasks,
which is important if userspace is doing its own aggregation of statistics
based on some grouping of tasks (e.g. CSA jobs, ELSA banks or CKRM classes).

While this patch is included as part of the delay stats collection patches,
it is intended for general use by any kernel component that needs per-pid/tgid
stats sent at task exit. This limits the increase in connectors called on task
exit, atleast for stats collection purposes.

Signed-off-by: Shailabh Nagar <[email protected]>

drivers/connector/Kconfig | 9 +
drivers/connector/Makefile | 1
drivers/connector/cn_stats.c | 243 +++++++++++++++++++++++++++++++++++++++++++
include/linux/cn_stats.h | 115 ++++++++++++++++++++
include/linux/connector.h | 4
kernel/exit.c | 2
6 files changed, 374 insertions(+)

Index: linux-2.6.15-rc7/drivers/connector/Kconfig
===================================================================
--- linux-2.6.15-rc7.orig/drivers/connector/Kconfig
+++ linux-2.6.15-rc7/drivers/connector/Kconfig
@@ -18,4 +18,13 @@ config PROC_EVENTS
Provide a connector that reports process events to userspace. Send
events such as fork, exec, id change (uid, gid, suid, etc), and exit.

+config STATS_CONNECTOR
+ bool "Report per-task statistics to userspace"
+ depends on CONNECTOR=y && TASK_DELAY_ACCT
+ ---help---
+ Provide a connector interface that reports per-task statistics to
+ userspace. While a task is running, userspace can get the stats by
+ sending a command to the connector. At task exit, the final value of
+ the stats is sent automatically.
+
endmenu
Index: linux-2.6.15-rc7/drivers/connector/Makefile
===================================================================
--- linux-2.6.15-rc7.orig/drivers/connector/Makefile
+++ linux-2.6.15-rc7/drivers/connector/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_CONNECTOR) += cn.o
+obj-$(CONFIG_STATS_CONNECTOR) += cn_stats.o
obj-$(CONFIG_PROC_EVENTS) += cn_proc.o

cn-y += cn_queue.o connector.o
Index: linux-2.6.15-rc7/include/linux/cn_stats.h
===================================================================
--- /dev/null
+++ linux-2.6.15-rc7/include/linux/cn_stats.h
@@ -0,0 +1,115 @@
+/*
+ * cn_stats.h - Task statistics connector
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ * Based on work by Matt Helsley, Nguyen Anh Quynh and Guillaume Thouvenin
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef CN_STATS_H
+#define CN_STATS_H
+
+#include <linux/types.h>
+#include <linux/connector.h>
+
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for all tasks
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the version count macro
+ * and delineate the start of newly added fields with a comment indicating the
+ * version number.
+ */
+#define CNSTATS_DATA_VERSION 1
+
+struct cnstats_data {
+
+ /* Version 1 */
+ pid_t pid;
+ pid_t tgid;
+
+ /* *_delay_total is cumulative delay (in nanosecs) of a
+ * task waiting for cpu to be available, block io
+ * completion, page fault to be serviced etc.
+ * *_count is number of delay intervals recorded.
+ * cpu_run_total is cumulative cpu run time, measured by timestamp
+ * intervals and hence more accurate than sampling-based cpu times.
+ * All *_total values are in nanoseconds though cpu data may not be
+ * collected at that granularity.
+ */
+
+ __u64 cpu_count;
+#define CNSTATS_NOCPUSTATS 1
+ __u64 cpu_run_total;
+ __u64 cpu_delay_total;
+
+ __u64 blkio_count;
+ __u64 blkio_delay_total;
+ __u64 swapin_count;
+ __u64 swapin_delay_total;
+
+};
+
+
+/*
+ * Commands sent from userspace
+ */
+
+struct cnstats_cmd {
+ enum intype {
+ CNSTATS_CMD_LISTEN = 1, /* Start listening on connector */
+ CNSTATS_CMD_IGNORE, /* Stop listening */
+ CNSTATS_CMD_STATS_PID, /* Stats for a pid */
+ CNSTATS_CMD_STATS_TGID, /* Stats for a tgid */
+ } intype;
+
+ union {
+ pid_t pid;
+ pid_t tgid;
+ } param;
+};
+
+/*
+ * Response or data sent from kernel
+ * Versioned for backward compatibility
+ */
+
+struct cnstats {
+ __u32 version;
+ enum outtype {
+ CNSTATS_DATA_NONE = 1, /* Control cmd response */
+ CNSTATS_DATA_EXIT, /* Exiting task's stats */
+ CNSTATS_DATA_PID, /* per-pid data cmd response*/
+ CNSTATS_DATA_TGID, /* per-tgid data cmd response*/
+ } outtype;
+ union {
+ struct cnstats_ack {
+ __u32 err;
+ } ack;
+
+ struct cnstats_data stats;
+ } data;
+};
+
+#ifdef __KERNEL__
+#ifdef CONFIG_STATS_CONNECTOR
+void cnstats_exit_connector(struct task_struct *tsk);
+#else
+static inline void cnstats_exit_connector(struct task_struct *tsk)
+{}
+#endif /* CONFIG_STATS_CONNECTOR */
+#endif /* __KERNEL__ */
+#endif /* CN_PROC_H */
Index: linux-2.6.15-rc7/include/linux/connector.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/connector.h
+++ linux-2.6.15-rc7/include/linux/connector.h
@@ -35,6 +35,10 @@
#define CN_IDX_CIFS 0x2
#define CN_VAL_CIFS 0x1

+/* Statistics connector ids */
+#define CN_IDX_STATS 0x2
+#define CN_VAL_STATS 0x2
+
#define CN_NETLINK_USERS 1

/*
Index: linux-2.6.15-rc7/drivers/connector/cn_stats.c
===================================================================
--- /dev/null
+++ linux-2.6.15-rc7/drivers/connector/cn_stats.c
@@ -0,0 +1,243 @@
+/*
+ * cn_stats.c - Task statistics connector
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ * Based on work by Matt Helsley, Nguyen Anh Quynh and Guillaume Thouvenin
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/cn_stats.h>
+#include <asm/atomic.h>
+
+#define CN_STATS_NOCPU (-1)
+#define CN_STATS_NOACK 0
+#define CN_STATS_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct cnstats))
+
+static atomic_t cnstats_num_listeners = ATOMIC_INIT(0);
+static struct cb_id cnstats_id = { CN_IDX_STATS, CN_VAL_STATS };
+/* cnstats_counts is used as the sequence number of the netlink message */
+static DEFINE_PER_CPU(__u32, cnstats_counts) = { 0 };
+
+void cnstats_init_msg(struct cn_msg *msg, int seq, int ack)
+{
+ memset(msg, 0, CN_STATS_MSG_SIZE);
+ msg->seq = seq;
+ msg->ack = ack;
+ memcpy(&msg->id, &cnstats_id, sizeof(msg->id));
+ msg->len = sizeof(struct cnstats);
+}
+
+/*
+ * Accumalate one task's statistics
+ *
+ */
+static inline void cnstats_add_tsk_data(struct cnstats_data *d, struct task_struct *tsk)
+{
+ uint64_t tmp;
+
+ d->pid = tsk->pid;
+ d->tgid = tsk->tgid;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+#ifdef CONFIG_SCHEDSTATS
+ d->cpu_count += tsk->sched_info.pcnt;
+ tmp = d->cpu_run_total + jiffies_to_usecs(tsk->sched_info.cpu_time)*1000;
+ d->cpu_run_total = (tmp < d->cpu_run_total)? 0:tmp;
+ tmp = d->cpu_delay_total + jiffies_to_usecs(tsk->sched_info.run_delay)*1000;
+ d->cpu_delay_total = (tmp < d->cpu_delay_total)? 0:tmp;
+#else
+ /* Non-zero XXX_total,zero XXX_count implies XXX stat unavailable */
+ d->cpu_count = 0;
+ d->cpu_run_total = d->cpu_delay_total = CNSTATS_NOCPUSTATS;
+#endif
+ spin_lock(&tsk->delays.lock);
+ tmp = d->blkio_delay_total + tsk->delays.blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0:tmp;
+ tmp = d->swapin_delay_total + tsk->delays.swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0:tmp;
+ d->blkio_count += tsk->delays.blkio_count;
+ d->swapin_count += tsk->delays.swapin_count;
+ spin_unlock(&tsk->delays.lock);
+}
+
+/*
+ * Send acknowledgement (error)
+ *
+ */
+static void cnstats_send_ack(int err, int rcvd_seq, int rcvd_ack)
+{
+ __u8 buffer[CN_STATS_MSG_SIZE];
+ struct cn_msg *msg = (struct cn_msg *)buffer;
+ struct cnstats *c = (struct cnstats *)msg->data;
+
+ cnstats_init_msg(msg, rcvd_seq, rcvd_ack+1);
+ c->version = CNSTATS_DATA_VERSION;
+ c->outtype = CNSTATS_DATA_NONE;
+
+ /* Following allows other functions to continue returning -ve errors */
+ c->data.ack.err = abs(err);
+
+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
+}
+
+/*
+ * Send one pid's stats
+ *
+ */
+static int cnstats_send_pid(pid_t pid, int seq, int ack)
+{
+ __u8 buffer[CN_STATS_MSG_SIZE];
+ struct cn_msg *msg = (struct cn_msg *)buffer;
+ struct cnstats *c = (struct cnstats *)msg->data;
+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
+ struct task_struct *tsk;
+
+ cnstats_init_msg(msg, seq, ack);
+ c->version = CNSTATS_DATA_VERSION;
+ c->outtype = CNSTATS_DATA_PID;
+
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+
+ cnstats_add_tsk_data(d, tsk);
+ put_task_struct(tsk);
+
+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
+}
+
+/*
+ * Send one tgid's stats (sum of appropriate per-pid stats)
+ *
+ */
+static int cnstats_send_tgid(pid_t tgid, int seq, int ack)
+{
+ __u8 buffer[CN_STATS_MSG_SIZE];
+ struct cn_msg *msg = (struct cn_msg *)buffer;;
+ struct cnstats *c = (struct cnstats *)msg->data;
+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
+ struct task_struct *first, *tsk;
+
+ cnstats_init_msg(msg, seq, ack);
+ c->version = CNSTATS_DATA_VERSION;
+ c->outtype = CNSTATS_DATA_TGID;
+
+ read_lock(&tasklist_lock);
+ first = tsk = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ do
+ cnstats_add_tsk_data(d, tsk);
+ while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ d->pid = -1;
+ d->tgid = tgid;
+
+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
+}
+
+/***
+ * cnstats_ctl - handle command sent via CN_IDX_STATS connector
+ * @data: command
+ */
+static void cnstats_ctl(void *data)
+{
+ struct cn_msg *msg = data;
+ struct cnstats_cmd *cmd;
+ int err = 0;
+
+ if (msg->len != sizeof(*cmd))
+ return;
+
+ cmd = (struct cnstats_cmd *)msg->data;
+ switch (cmd->intype) {
+ case CNSTATS_CMD_LISTEN:
+ atomic_inc(&cnstats_num_listeners);
+ break;
+
+ case CNSTATS_CMD_IGNORE:
+ atomic_dec(&cnstats_num_listeners);
+ break;
+
+ case CNSTATS_CMD_STATS_PID:
+ if (atomic_read(&cnstats_num_listeners) < 1)
+ return;
+ err = cnstats_send_pid(cmd->param.pid, msg->seq, msg->ack+1);
+ if (!err)
+ return;
+ break;
+
+ case CNSTATS_CMD_STATS_TGID:
+ if (atomic_read(&cnstats_num_listeners) < 1)
+ return;
+ err = cnstats_send_tgid(cmd->param.pid, msg->seq, msg->ack+1);
+ if (!err)
+ return;
+ break;
+
+ default:
+ err = -EINVAL;
+ break;
+ }
+ cnstats_send_ack(err, msg->seq, msg->ack);
+}
+
+/***
+ * cnstats_exit_connector - send task's statistics to userspace when it exits
+ * @tsk: exiting task
+ */
+void cnstats_exit_connector(struct task_struct *tsk)
+{
+ int seq;
+ __u8 buffer[CN_STATS_MSG_SIZE];
+ struct cn_msg *msg = (struct cn_msg *)buffer;
+ struct cnstats *c = (struct cnstats *)msg->data;
+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
+
+ if (atomic_read(&cnstats_num_listeners) < 1)
+ return;
+
+ seq = get_cpu_var(cnstats_counts)++;
+ put_cpu_var(cnstats_counts);
+
+ cnstats_init_msg(msg, seq, CN_STATS_NOACK);
+ c->version = CNSTATS_DATA_VERSION;
+ c->outtype = CNSTATS_DATA_EXIT;
+ cnstats_add_tsk_data(d, tsk);
+
+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
+}
+
+static int __init cnstats_init(void)
+{
+ int err;
+
+ if ((err = cn_add_callback(&cnstats_id, "cn_stats", &cnstats_ctl))) {
+ printk(KERN_WARNING "cn_stats failed to register\n");
+ return err;
+ }
+ return 0;
+}
+
+module_init(cnstats_init);
Index: linux-2.6.15-rc7/kernel/exit.c
===================================================================
--- linux-2.6.15-rc7.orig/kernel/exit.c
+++ linux-2.6.15-rc7/kernel/exit.c
@@ -29,6 +29,7 @@
#include <linux/syscalls.h>
#include <linux/signal.h>
#include <linux/cn_proc.h>
+#include <linux/cn_stats.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -865,6 +866,7 @@ fastcall NORET_TYPE void do_exit(long co

tsk->exit_code = code;
proc_exit_connector(tsk);
+ cnstats_exit_connector(tsk);
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);

2006-01-03 23:36:32

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 3/6] Delay accounting: Sync block I/O delays

Changes since 12/7/05

- removed __attribute((unused)) qualifiers from timespec vars (dave hansen)
- use existing timestamping function do_posix_clock_monotonic_gettime() (jay lan)

Changes since 11/14/05

- use nanosecond resolution, adjusted wall clock time for timestamps
instead of sched_clock (akpm, andi, marcelo)
- collect stats only if delay accounting enabled (parag)
- stats collected for delays in all userspace-initiated block I/O
including fsync/fdatasync but not counting waits for async block io events.

11/14/05: First post


delayacct-blkio.patch

Record time spent by a task waiting for completion of
userspace initiated synchronous block I/O. This can help
determine the right I/O priority for the task.

Signed-off-by: Shailabh Nagar <[email protected]>

fs/buffer.c | 6 ++++++
fs/read_write.c | 10 +++++++++-
include/linux/delayacct.h | 4 ++++
include/linux/sched.h | 2 ++
kernel/delayacct.c | 16 ++++++++++++++++
mm/filemap.c | 10 +++++++++-
mm/memory.c | 17 +++++++++++++++--
7 files changed, 61 insertions(+), 4 deletions(-)

Index: linux-2.6.15-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/sched.h
+++ linux-2.6.15-rc7/include/linux/sched.h
@@ -546,6 +546,8 @@ struct task_delay_info {
spinlock_t lock;

/* Add stats in pairs: uint64_t delay, uint32_t count */
+ uint64_t blkio_delay; /* wait for sync block io completion */
+ uint32_t blkio_count;
};
#endif

Index: linux-2.6.15-rc7/fs/read_write.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/read_write.c
+++ linux-2.6.15-rc7/fs/read_write.c
@@ -14,6 +14,8 @@
#include <linux/security.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/time.h>
+#include <linux/delayacct.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -224,8 +226,14 @@ ssize_t do_sync_read(struct file *filp,
(ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos)))
wait_on_retry_sync_kiocb(&kiocb);

- if (-EIOCBQUEUED == ret)
+ if (-EIOCBQUEUED == ret) {
+ struct timespec start, end;
+
+ do_posix_clock_monotonic_gettime(&start);
ret = wait_on_sync_kiocb(&kiocb);
+ do_posix_clock_monotonic_gettime(&end);
+ delayacct_blkio(&start, &end);
+ }
*ppos = kiocb.ki_pos;
return ret;
}
Index: linux-2.6.15-rc7/mm/filemap.c
===================================================================
--- linux-2.6.15-rc7.orig/mm/filemap.c
+++ linux-2.6.15-rc7/mm/filemap.c
@@ -28,6 +28,8 @@
#include <linux/blkdev.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/time.h>
+#include <linux/delayacct.h>
#include "filemap.h"
/*
* FIXME: remove all knowledge of the buffer layer from the core VM
@@ -1062,8 +1064,14 @@ generic_file_read(struct file *filp, cha

init_sync_kiocb(&kiocb, filp);
ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos);
- if (-EIOCBQUEUED == ret)
+ if (-EIOCBQUEUED == ret) {
+ struct timespec start, end;
+
+ do_posix_clock_monotonic_gettime(&start);
ret = wait_on_sync_kiocb(&kiocb);
+ do_posix_clock_monotonic_gettime(&end);
+ delayacct_blkio(&start, &end);
+ }
return ret;
}

Index: linux-2.6.15-rc7/mm/memory.c
===================================================================
--- linux-2.6.15-rc7.orig/mm/memory.c
+++ linux-2.6.15-rc7/mm/memory.c
@@ -48,6 +48,8 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/time.h>
+#include <linux/delayacct.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -2170,11 +2172,22 @@ static inline int handle_pte_fault(struc
old_entry = entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
+ int ret;
+ struct timespec start, end;
+
if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, address,
pte, pmd, write_access);
- return do_no_page(mm, vma, address,
- pte, pmd, write_access);
+
+ if (vma->vm_file)
+ do_posix_clock_monotonic_gettime(&start);
+ ret = do_no_page(mm, vma, address,
+ pte, pmd, write_access);
+ if (vma->vm_file) {
+ do_posix_clock_monotonic_gettime(&end);
+ delayacct_blkio(&start, &end);
+ }
+ return ret;
}
if (pte_file(entry))
return do_file_page(mm, vma, address,
Index: linux-2.6.15-rc7/fs/buffer.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/buffer.c
+++ linux-2.6.15-rc7/fs/buffer.c
@@ -41,6 +41,8 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/time.h>
+#include <linux/delayacct.h>

static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
static void invalidate_bh_lrus(void);
@@ -337,6 +339,7 @@ static long do_fsync(unsigned int fd, in
struct file * file;
struct address_space *mapping;
int ret, err;
+ struct timespec start, end;

ret = -EBADF;
file = fget(fd);
@@ -349,6 +352,7 @@ static long do_fsync(unsigned int fd, in
goto out_putf;
}

+ do_posix_clock_monotonic_gettime(&start);
mapping = file->f_mapping;

current->flags |= PF_SYNCWRITE;
@@ -371,6 +375,8 @@ static long do_fsync(unsigned int fd, in
out_putf:
fput(file);
out:
+ do_posix_clock_monotonic_gettime(&end);
+ delayacct_blkio(&start, &end);
return ret;
}

Index: linux-2.6.15-rc7/include/linux/delayacct.h
===================================================================
--- linux-2.6.15-rc7.orig/include/linux/delayacct.h
+++ linux-2.6.15-rc7/include/linux/delayacct.h
@@ -19,8 +19,12 @@
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
extern void delayacct_tsk_init(struct task_struct *tsk);
+extern void delayacct_blkio(struct timespec *start, struct timespec *end);
#else
static inline void delayacct_tsk_init(struct task_struct *tsk)
{}
+static inline void delayacct_blkio(struct timespec *start, struct timespec *end)
+{}
+
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.15-rc7/kernel/delayacct.c
===================================================================
--- linux-2.6.15-rc7.orig/kernel/delayacct.c
+++ linux-2.6.15-rc7/kernel/delayacct.c
@@ -12,6 +12,7 @@
*/

#include <linux/sched.h>
+#include <linux/time.h>

int delayacct_on=1; /* Delay accounting turned on/off */

@@ -34,3 +35,18 @@ static int __init delayacct_init(void)
return 0;
}
core_initcall(delayacct_init);
+
+inline void delayacct_blkio(struct timespec *start, struct timespec *end)
+{
+ unsigned long long delay;
+
+ if (!delayacct_on)
+ return;
+
+ delay = timespec_diff_ns(start, end);
+
+ spin_lock(&current->delays.lock);
+ current->delays.blkio_delay += delay;
+ current->delays.blkio_count++;
+ spin_unlock(&current->delays.lock);
+}

2006-01-04 00:21:27

by Greg KH

[permalink] [raw]
Subject: Re: [Patch 6/6] Delay accounting: Connector interface

On Tue, Jan 03, 2006 at 11:33:40PM +0000, Shailabh Nagar wrote:
> Changes since 11/14/05:
>
> - explicit versioning of statistics data returned
> - new command type for returning per-tgid stats
> - for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
>
> First post 11/14/05
>
> delayacct-connector.patch
>
> Creates a connector interface for getting delay and cpu statistics of tasks
> during their lifetime and when they exit. The cpu stats are available only if
> CONFIG_SCHEDSTATS is enabled.

Why do you use this when we can send typesafe data through netlink
messages now? Because of that, I think the whole connector code can be
deleted :)

Unless I missed something in the connector code that is not present in
netlink...

thanks,

greg k-h

2006-01-04 00:43:00

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 6/6] Delay accounting: Connector interface

Greg KH wrote:
> On Tue, Jan 03, 2006 at 11:33:40PM +0000, Shailabh Nagar wrote:
>
>>Changes since 11/14/05:
>>
>>- explicit versioning of statistics data returned
>>- new command type for returning per-tgid stats
>>- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
>>
>>First post 11/14/05
>>
>>delayacct-connector.patch
>>
>>Creates a connector interface for getting delay and cpu statistics of tasks
>>during their lifetime and when they exit. The cpu stats are available only if
>>CONFIG_SCHEDSTATS is enabled.
>
>
> Why do you use this when we can send typesafe data through netlink
> messages now?

AFAIK, adding new netlink types was frowned upon which is one of the reasons why
connectors were proposed (besides making it easier to use the netlink interface) ?

> Because of that, I think the whole connector code can be
> deleted :)
>
> Unless I missed something in the connector code that is not present in
> netlink...
>
> thanks,
>
> greg k-h
>

2006-01-04 00:51:42

by Greg KH

[permalink] [raw]
Subject: Re: [Patch 6/6] Delay accounting: Connector interface

On Wed, Jan 04, 2006 at 12:42:36AM +0000, Shailabh Nagar wrote:
> Greg KH wrote:
> > On Tue, Jan 03, 2006 at 11:33:40PM +0000, Shailabh Nagar wrote:
> >
> >>Changes since 11/14/05:
> >>
> >>- explicit versioning of statistics data returned
> >>- new command type for returning per-tgid stats
> >>- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
> >>
> >>First post 11/14/05
> >>
> >>delayacct-connector.patch
> >>
> >>Creates a connector interface for getting delay and cpu statistics of tasks
> >>during their lifetime and when they exit. The cpu stats are available only if
> >>CONFIG_SCHEDSTATS is enabled.
> >
> >
> > Why do you use this when we can send typesafe data through netlink
> > messages now?
>
> AFAIK, adding new netlink types was frowned upon which is one of the reasons why
> connectors were proposed (besides making it easier to use the netlink interface) ?

I don't know about the issue of creating new types (have you tried?),
but there is a new netlink message format that pretty should make it
just as easy as the connector stuff to send complex message types.

thanks,

greg k-h

2006-01-04 07:49:36

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [Patch 6/6] Delay accounting: Connector interface

Greg KH wrote:

>On Wed, Jan 04, 2006 at 12:42:36AM +0000, Shailabh Nagar wrote:
>
>
>>Greg KH wrote:
>>
>>
>>>On Tue, Jan 03, 2006 at 11:33:40PM +0000, Shailabh Nagar wrote:
>>>
>>>
>>>
>>>>Changes since 11/14/05:
>>>>
>>>>- explicit versioning of statistics data returned
>>>>- new command type for returning per-tgid stats
>>>>- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
>>>>
>>>>First post 11/14/05
>>>>
>>>>delayacct-connector.patch
>>>>
>>>>Creates a connector interface for getting delay and cpu statistics of tasks
>>>>during their lifetime and when they exit. The cpu stats are available only if
>>>>CONFIG_SCHEDSTATS is enabled.
>>>>
>>>>
>>>Why do you use this when we can send typesafe data through netlink
>>>messages now?
>>>
>>>
>>AFAIK, adding new netlink types was frowned upon which is one of the reasons why
>>connectors were proposed (besides making it easier to use the netlink interface) ?
>>
>>
>
>I don't know about the issue of creating new types (have you tried?),
>but there is a new netlink message format that pretty should make it
>just as easy as the connector stuff to send complex message types.
>
>

Thanks - just saw the genetlink patches/interface which seems to handle
the problem of
creating new netlink types as well.

But why do I get the feeling that the delay accounting patches are
becoming a pawn in deprecating connector usage ? :-)
I'd rather not run off and implement over genetlink (which no one else
seems to be using right now) unless there's some
consensus that connectors are to be deprecated.

Andrew - any opinions ?

-- Shailabh

>thanks,
>
>greg k-h
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
>for problems? Stop! Download the new AJAX search engine that makes
>searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
>http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
>_______________________________________________
>Lse-tech mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/lse-tech
>
>
>


2006-01-04 19:04:47

by Jay Lan

[permalink] [raw]
Subject: Re: [Patch 6/6] Delay accounting: Connector interface

Shailabh Nagar wrote:
>Changes since 11/14/05:
>
>- explicit versioning of statistics data returned
>- new command type for returning per-tgid stats
>- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
>
>First post 11/14/05
>
>delayacct-connector.patch
>
>Creates a connector interface for getting delay and cpu statistics of tasks
>during their lifetime and when they exit. The cpu stats are available only if
>CONFIG_SCHEDSTATS is enabled.
>
>When a task is alive, userspace can get its stats by sending a command
>containing its pid. Sending a tgid returns the sum of stats of the tasks
>belonging to that tgid (where such a sum makes sense). Together, the command
>interface allows stats for a large number of tasks to be collected more
>efficiently than would be possible through /proc or any per-pid interface.
>
>The connector interface also sends the stats for each task to userspace when
>the task is exiting. This permits fine-grain accounting for short-lived tasks,
>which is important if userspace is doing its own aggregation of statistics
>based on some grouping of tasks (e.g. CSA jobs, ELSA banks or CKRM classes).
>
>While this patch is included as part of the delay stats collection patches,
>it is intended for general use by any kernel component that needs per-pid/tgid
>stats sent at task exit. This limits the increase in connectors called on task
>exit, atleast for stats collection purposes.
>
>Signed-off-by: Shailabh Nagar <[email protected]>
>
> drivers/connector/Kconfig | 9 +
> drivers/connector/Makefile | 1
> drivers/connector/cn_stats.c | 243 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/cn_stats.h | 115 ++++++++++++++++++++
> include/linux/connector.h | 4
> kernel/exit.c | 2
> 6 files changed, 374 insertions(+)
>
>Index: linux-2.6.15-rc7/drivers/connector/Kconfig
>===================================================================
>--- linux-2.6.15-rc7.orig/drivers/connector/Kconfig
>+++ linux-2.6.15-rc7/drivers/connector/Kconfig
>@@ -18,4 +18,13 @@ config PROC_EVENTS
> Provide a connector that reports process events to userspace. Send
> events such as fork, exec, id change (uid, gid, suid, etc), and exit.
>
>+config STATS_CONNECTOR
>+ bool "Report per-task statistics to userspace"
>+ depends on CONNECTOR=y && TASK_DELAY_ACCT
>+ ---help---
>+ Provide a connector interface that reports per-task statistics to
>+ userspace. While a task is running, userspace can get the stats by
>+ sending a command to the connector. At task exit, the final value of
>+ the stats is sent automatically.
>+
> endmenu
>Index: linux-2.6.15-rc7/drivers/connector/Makefile
>===================================================================
>--- linux-2.6.15-rc7.orig/drivers/connector/Makefile
>+++ linux-2.6.15-rc7/drivers/connector/Makefile
>@@ -1,4 +1,5 @@
> obj-$(CONFIG_CONNECTOR) += cn.o
>+obj-$(CONFIG_STATS_CONNECTOR) += cn_stats.o
> obj-$(CONFIG_PROC_EVENTS) += cn_proc.o
>
> cn-y += cn_queue.o connector.o
>Index: linux-2.6.15-rc7/include/linux/cn_stats.h
>===================================================================
>--- /dev/null
>+++ linux-2.6.15-rc7/include/linux/cn_stats.h
>@@ -0,0 +1,115 @@
>+/*
>+ * cn_stats.h - Task statistics connector
>+ *
>+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
>+ * Based on work by Matt Helsley, Nguyen Anh Quynh and Guillaume Thouvenin
>+ *
>+ * This program is free software; you can redistribute it and/or modify
>+ * it under the terms of the GNU General Public License as published by
>+ * the Free Software Foundation; either version 2 of the License, or
>+ * (at your option) any later version.
>+ *
>+ * This program is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>+ * GNU General Public License for more details.
>+ */
>+
>+#ifndef CN_STATS_H
>+#define CN_STATS_H
>+
>+#include <linux/types.h>
>+#include <linux/connector.h>
>+
>+
>+/* Format for per-task data returned to userland when
>+ * - a task exits
>+ * - listener requests stats for all tasks
>+ *
>+ * The struct is versioned. Newer versions should only add fields to
>+ * the bottom of the struct to maintain backward compatibility.
>+ *
>+ * To create the next version, bump up the version count macro
>+ * and delineate the start of newly added fields with a comment indicating the
>+ * version number.
>+ */
>+#define CNSTATS_DATA_VERSION 1
>+
>+struct cnstats_data {
>+
>+ /* Version 1 */
>+ pid_t pid;
>+ pid_t tgid;
>+
>+ /* *_delay_total is cumulative delay (in nanosecs) of a
>+ * task waiting for cpu to be available, block io
>+ * completion, page fault to be serviced etc.
>+ * *_count is number of delay intervals recorded.
>+ * cpu_run_total is cumulative cpu run time, measured by timestamp
>+ * intervals and hence more accurate than sampling-based cpu times.
>+ * All *_total values are in nanoseconds though cpu data may not be
>+ * collected at that granularity.
>+ */
>+
>+ __u64 cpu_count;
>+#define CNSTATS_NOCPUSTATS 1
>+ __u64 cpu_run_total;
>+ __u64 cpu_delay_total;
>+
>+ __u64 blkio_count;
>+ __u64 blkio_delay_total;
>+ __u64 swapin_count;
>+ __u64 swapin_delay_total;
>+
>+};
>+
>+
>+/*
>+ * Commands sent from userspace
>+ */
>+
>+struct cnstats_cmd {
>+ enum intype {
>+ CNSTATS_CMD_LISTEN = 1, /* Start listening on connector */
>+ CNSTATS_CMD_IGNORE, /* Stop listening */
>+ CNSTATS_CMD_STATS_PID, /* Stats for a pid */
>+ CNSTATS_CMD_STATS_TGID, /* Stats for a tgid */
>+ } intype;
>+
>+ union {
>+ pid_t pid;
>+ pid_t tgid;
>+ } param;
>+};
>+
>+/*
>+ * Response or data sent from kernel
>+ * Versioned for backward compatibility
>+ */
>+
>+struct cnstats {
>+ __u32 version;
>+ enum outtype {
>+ CNSTATS_DATA_NONE = 1, /* Control cmd response */
>+ CNSTATS_DATA_EXIT, /* Exiting task's stats */
>+ CNSTATS_DATA_PID, /* per-pid data cmd response*/
>+ CNSTATS_DATA_TGID, /* per-tgid data cmd response*/
>+ } outtype;
>+ union {
>+ struct cnstats_ack {
>+ __u32 err;
>+ } ack;
>+
>+ struct cnstats_data stats;
>+ } data;
>+};
>+
>+#ifdef __KERNEL__
>+#ifdef CONFIG_STATS_CONNECTOR
>+void cnstats_exit_connector(struct task_struct *tsk);
>+#else
>+static inline void cnstats_exit_connector(struct task_struct *tsk)
>+{}
>+#endif /* CONFIG_STATS_CONNECTOR */
>+#endif /* __KERNEL__ */
>+#endif /* CN_PROC_H */
>Index: linux-2.6.15-rc7/include/linux/connector.h
>===================================================================
>--- linux-2.6.15-rc7.orig/include/linux/connector.h
>+++ linux-2.6.15-rc7/include/linux/connector.h
>@@ -35,6 +35,10 @@
> #define CN_IDX_CIFS 0x2
> #define CN_VAL_CIFS 0x1
>
>+/* Statistics connector ids */
>+#define CN_IDX_STATS 0x2
>+#define CN_VAL_STATS 0x2
>+
> #define CN_NETLINK_USERS 1
>
> /*
>Index: linux-2.6.15-rc7/drivers/connector/cn_stats.c
>===================================================================
>--- /dev/null
>+++ linux-2.6.15-rc7/drivers/connector/cn_stats.c
>@@ -0,0 +1,243 @@
>+/*
>+ * cn_stats.c - Task statistics connector
>

It looks like you create this file to report task stats through connector
and you include stats for delay accounting you need as a starter.

This sounds like a good approach to me. However, we need to move
where cnstats_exit_connector(tsk) is invoked. See below.
>+ *
>+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
>+ * Based on work by Matt Helsley, Nguyen Anh Quynh and Guillaume Thouvenin
>+ *
>+ * This program is free software; you can redistribute it and/or modify
>+ * it under the terms of the GNU General Public License as published by
>+ * the Free Software Foundation; either version 2 of the License, or
>+ * (at your option) any later version.
>+ *
>+ * This program is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>+ * GNU General Public License for more details.
>+ *
>+ */
>+
>+#include <linux/module.h>
>+#include <linux/kernel.h>
>+#include <linux/init.h>
>+#include <linux/cn_stats.h>
>+#include <asm/atomic.h>
>+
>+#define CN_STATS_NOCPU (-1)
>+#define CN_STATS_NOACK 0
>+#define CN_STATS_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct cnstats))
>+
>+static atomic_t cnstats_num_listeners = ATOMIC_INIT(0);
>+static struct cb_id cnstats_id = { CN_IDX_STATS, CN_VAL_STATS };
>+/* cnstats_counts is used as the sequence number of the netlink message */
>+static DEFINE_PER_CPU(__u32, cnstats_counts) = { 0 };
>+
>+void cnstats_init_msg(struct cn_msg *msg, int seq, int ack)
>+{
>+ memset(msg, 0, CN_STATS_MSG_SIZE);
>+ msg->seq = seq;
>+ msg->ack = ack;
>+ memcpy(&msg->id, &cnstats_id, sizeof(msg->id));
>+ msg->len = sizeof(struct cnstats);
>+}
>+
>+/*
>+ * Accumalate one task's statistics
>+ *
>+ */
>+static inline void cnstats_add_tsk_data(struct cnstats_data *d, struct task_struct *tsk)
>+{
>+ uint64_t tmp;
>+
>+ d->pid = tsk->pid;
>+ d->tgid = tsk->tgid;
>+
>+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
>+#ifdef CONFIG_SCHEDSTATS
>+ d->cpu_count += tsk->sched_info.pcnt;
>+ tmp = d->cpu_run_total + jiffies_to_usecs(tsk->sched_info.cpu_time)*1000;
>+ d->cpu_run_total = (tmp < d->cpu_run_total)? 0:tmp;
>+ tmp = d->cpu_delay_total + jiffies_to_usecs(tsk->sched_info.run_delay)*1000;
>+ d->cpu_delay_total = (tmp < d->cpu_delay_total)? 0:tmp;
>+#else
>+ /* Non-zero XXX_total,zero XXX_count implies XXX stat unavailable */
>+ d->cpu_count = 0;
>+ d->cpu_run_total = d->cpu_delay_total = CNSTATS_NOCPUSTATS;
>+#endif
>+ spin_lock(&tsk->delays.lock);
>+ tmp = d->blkio_delay_total + tsk->delays.blkio_delay;
>+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0:tmp;
>+ tmp = d->swapin_delay_total + tsk->delays.swapin_delay;
>+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0:tmp;
>+ d->blkio_count += tsk->delays.blkio_count;
>+ d->swapin_count += tsk->delays.swapin_count;
>+ spin_unlock(&tsk->delays.lock);
>+}
>+
>+/*
>+ * Send acknowledgement (error)
>+ *
>+ */
>+static void cnstats_send_ack(int err, int rcvd_seq, int rcvd_ack)
>+{
>+ __u8 buffer[CN_STATS_MSG_SIZE];
>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>+ struct cnstats *c = (struct cnstats *)msg->data;
>+
>+ cnstats_init_msg(msg, rcvd_seq, rcvd_ack+1);
>+ c->version = CNSTATS_DATA_VERSION;
>+ c->outtype = CNSTATS_DATA_NONE;
>+
>+ /* Following allows other functions to continue returning -ve errors */
>+ c->data.ack.err = abs(err);
>+
>+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>+}
>+
>+/*
>+ * Send one pid's stats
>+ *
>+ */
>+static int cnstats_send_pid(pid_t pid, int seq, int ack)
>+{
>+ __u8 buffer[CN_STATS_MSG_SIZE];
>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>+ struct cnstats *c = (struct cnstats *)msg->data;
>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>+ struct task_struct *tsk;
>+
>+ cnstats_init_msg(msg, seq, ack);
>+ c->version = CNSTATS_DATA_VERSION;
>+ c->outtype = CNSTATS_DATA_PID;
>+
>+ read_lock(&tasklist_lock);
>+ tsk = find_task_by_pid(pid);
>+ if (!tsk) {
>+ read_unlock(&tasklist_lock);
>+ return -ESRCH;
>+ }
>+ get_task_struct(tsk);
>+ read_unlock(&tasklist_lock);
>+
>+ cnstats_add_tsk_data(d, tsk);
>+ put_task_struct(tsk);
>+
>+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>+}
>+
>+/*
>+ * Send one tgid's stats (sum of appropriate per-pid stats)
>+ *
>+ */
>+static int cnstats_send_tgid(pid_t tgid, int seq, int ack)
>+{
>+ __u8 buffer[CN_STATS_MSG_SIZE];
>+ struct cn_msg *msg = (struct cn_msg *)buffer;;
>+ struct cnstats *c = (struct cnstats *)msg->data;
>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>+ struct task_struct *first, *tsk;
>+
>+ cnstats_init_msg(msg, seq, ack);
>+ c->version = CNSTATS_DATA_VERSION;
>+ c->outtype = CNSTATS_DATA_TGID;
>+
>+ read_lock(&tasklist_lock);
>+ first = tsk = find_task_by_pid(tgid);
>+ if (!first) {
>+ read_unlock(&tasklist_lock);
>+ return -ESRCH;
>+ }
>+ do
>+ cnstats_add_tsk_data(d, tsk);
>+ while_each_thread(first, tsk);
>+ read_unlock(&tasklist_lock);
>+
>+ d->pid = -1;
>+ d->tgid = tgid;
>+
>+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>+}
>+
>+/***
>+ * cnstats_ctl - handle command sent via CN_IDX_STATS connector
>+ * @data: command
>+ */
>+static void cnstats_ctl(void *data)
>+{
>+ struct cn_msg *msg = data;
>+ struct cnstats_cmd *cmd;
>+ int err = 0;
>+
>+ if (msg->len != sizeof(*cmd))
>+ return;
>+
>+ cmd = (struct cnstats_cmd *)msg->data;
>+ switch (cmd->intype) {
>+ case CNSTATS_CMD_LISTEN:
>+ atomic_inc(&cnstats_num_listeners);
>+ break;
>+
>+ case CNSTATS_CMD_IGNORE:
>+ atomic_dec(&cnstats_num_listeners);
>+ break;
>+
>+ case CNSTATS_CMD_STATS_PID:
>+ if (atomic_read(&cnstats_num_listeners) < 1)
>+ return;
>+ err = cnstats_send_pid(cmd->param.pid, msg->seq, msg->ack+1);
>+ if (!err)
>+ return;
>+ break;
>+
>+ case CNSTATS_CMD_STATS_TGID:
>+ if (atomic_read(&cnstats_num_listeners) < 1)
>+ return;
>+ err = cnstats_send_tgid(cmd->param.pid, msg->seq, msg->ack+1);
>+ if (!err)
>+ return;
>+ break;
>+
>+ default:
>+ err = -EINVAL;
>+ break;
>+ }
>+ cnstats_send_ack(err, msg->seq, msg->ack);
>+}
>+
>+/***
>+ * cnstats_exit_connector - send task's statistics to userspace when it exits
>+ * @tsk: exiting task
>+ */
>+void cnstats_exit_connector(struct task_struct *tsk)
>+{
>+ int seq;
>+ __u8 buffer[CN_STATS_MSG_SIZE];
>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>+ struct cnstats *c = (struct cnstats *)msg->data;
>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>+
>+ if (atomic_read(&cnstats_num_listeners) < 1)
>+ return;
>+
>+ seq = get_cpu_var(cnstats_counts)++;
>+ put_cpu_var(cnstats_counts);
>+
>+ cnstats_init_msg(msg, seq, CN_STATS_NOACK);
>+ c->version = CNSTATS_DATA_VERSION;
>+ c->outtype = CNSTATS_DATA_EXIT;
>+ cnstats_add_tsk_data(d, tsk);
>+
>+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>+}
>+
>+static int __init cnstats_init(void)
>+{
>+ int err;
>+
>+ if ((err = cn_add_callback(&cnstats_id, "cn_stats", &cnstats_ctl))) {
>+ printk(KERN_WARNING "cn_stats failed to register\n");
>+ return err;
>+ }
>+ return 0;
>+}
>+
>+module_init(cnstats_init);
>Index: linux-2.6.15-rc7/kernel/exit.c
>===================================================================
>--- linux-2.6.15-rc7.orig/kernel/exit.c
>+++ linux-2.6.15-rc7/kernel/exit.c
>@@ -29,6 +29,7 @@
> #include <linux/syscalls.h>
> #include <linux/signal.h>
> #include <linux/cn_proc.h>
>+#include <linux/cn_stats.h>
>
> #include <asm/uaccess.h>
> #include <asm/unistd.h>
>@@ -865,6 +866,7 @@ fastcall NORET_TYPE void do_exit(long co
>
> tsk->exit_code = code;
> proc_exit_connector(tsk);
>+ cnstats_exit_connector(tsk);
>

We need to move both proc_exit_connector(tsk) and
cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
There are task statistics collected in task->mm and those stats
will be lost after exit_mm(tsk).

Thanks,
- jay

> exit_notify(tsk);
> #ifdef CONFIG_NUMA
> mpol_free(tsk->mempolicy);
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>

2006-01-04 21:31:44

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 6/6] Delay accounting: Connector interface

Jay Lan wrote:

>Shailabh Nagar wrote:
>
>
>>Changes since 11/14/05:
>>
>>- explicit versioning of statistics data returned
>>- new command type for returning per-tgid stats
>>- for cpu running time, use tsk->sched_info.cpu_time (collected by schedstats)
>>
>>First post 11/14/05
>>
>>delayacct-connector.patch
>>
>>
<snip>

>>Index: linux-2.6.15-rc7/include/linux/connector.h
>>===================================================================
>>--- linux-2.6.15-rc7.orig/include/linux/connector.h
>>+++ linux-2.6.15-rc7/include/linux/connector.h
>>@@ -35,6 +35,10 @@
>>#define CN_IDX_CIFS 0x2
>>#define CN_VAL_CIFS 0x1
>>
>>+/* Statistics connector ids */
>>+#define CN_IDX_STATS 0x2
>>+#define CN_VAL_STATS 0x2
>>+
>>#define CN_NETLINK_USERS 1
>>
>>/*
>>Index: linux-2.6.15-rc7/drivers/connector/cn_stats.c
>>===================================================================
>>--- /dev/null
>>+++ linux-2.6.15-rc7/drivers/connector/cn_stats.c
>>@@ -0,0 +1,243 @@
>>+/*
>>+ * cn_stats.c - Task statistics connector
>>
>>
>>
>
>It looks like you create this file to report task stats through connector
>and you include stats for delay accounting you need as a starter.
>
>

Yes. The idea is that the connector interface will be used by everyone
interested in exporting per-task/process
stats. The stats that I care about right now are delay stats and those
are being exported now. Adding to the
exported set (by extending struct cnstats_data and possibly, struct
cnstats_cmd) can happen in future.

>This sounds like a good approach to me. However, we need to move
>where cnstats_exit_connector(tsk) is invoked. See below.
>
>

>>+ *
>>+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
>>+ * Based on work by Matt Helsley, Nguyen Anh Quynh and Guillaume Thouvenin
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ *
>>+ * This program is distributed in the hope that it will be useful,
>>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>>+ * GNU General Public License for more details.
>>+ *
>>+ */
>>+
>>+#include <linux/module.h>
>>+#include <linux/kernel.h>
>>+#include <linux/init.h>
>>+#include <linux/cn_stats.h>
>>+#include <asm/atomic.h>
>>+
>>+#define CN_STATS_NOCPU (-1)
>>+#define CN_STATS_NOACK 0
>>+#define CN_STATS_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct cnstats))
>>+
>>+static atomic_t cnstats_num_listeners = ATOMIC_INIT(0);
>>+static struct cb_id cnstats_id = { CN_IDX_STATS, CN_VAL_STATS };
>>+/* cnstats_counts is used as the sequence number of the netlink message */
>>+static DEFINE_PER_CPU(__u32, cnstats_counts) = { 0 };
>>+
>>+void cnstats_init_msg(struct cn_msg *msg, int seq, int ack)
>>+{
>>+ memset(msg, 0, CN_STATS_MSG_SIZE);
>>+ msg->seq = seq;
>>+ msg->ack = ack;
>>+ memcpy(&msg->id, &cnstats_id, sizeof(msg->id));
>>+ msg->len = sizeof(struct cnstats);
>>+}
>>+
>>+/*
>>+ * Accumalate one task's statistics
>>+ *
>>+ */
>>+static inline void cnstats_add_tsk_data(struct cnstats_data *d, struct task_struct *tsk)
>>+{
>>+ uint64_t tmp;
>>+
>>+ d->pid = tsk->pid;
>>+ d->tgid = tsk->tgid;
>>+
>>+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
>>+#ifdef CONFIG_SCHEDSTATS
>>+ d->cpu_count += tsk->sched_info.pcnt;
>>+ tmp = d->cpu_run_total + jiffies_to_usecs(tsk->sched_info.cpu_time)*1000;
>>+ d->cpu_run_total = (tmp < d->cpu_run_total)? 0:tmp;
>>+ tmp = d->cpu_delay_total + jiffies_to_usecs(tsk->sched_info.run_delay)*1000;
>>+ d->cpu_delay_total = (tmp < d->cpu_delay_total)? 0:tmp;
>>+#else
>>+ /* Non-zero XXX_total,zero XXX_count implies XXX stat unavailable */
>>+ d->cpu_count = 0;
>>+ d->cpu_run_total = d->cpu_delay_total = CNSTATS_NOCPUSTATS;
>>+#endif
>>+ spin_lock(&tsk->delays.lock);
>>+ tmp = d->blkio_delay_total + tsk->delays.blkio_delay;
>>+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0:tmp;
>>+ tmp = d->swapin_delay_total + tsk->delays.swapin_delay;
>>+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0:tmp;
>>+ d->blkio_count += tsk->delays.blkio_count;
>>+ d->swapin_count += tsk->delays.swapin_count;
>>+ spin_unlock(&tsk->delays.lock);
>>+}
>>+
>>+/*
>>+ * Send acknowledgement (error)
>>+ *
>>+ */
>>+static void cnstats_send_ack(int err, int rcvd_seq, int rcvd_ack)
>>+{
>>+ __u8 buffer[CN_STATS_MSG_SIZE];
>>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>>+ struct cnstats *c = (struct cnstats *)msg->data;
>>+
>>+ cnstats_init_msg(msg, rcvd_seq, rcvd_ack+1);
>>+ c->version = CNSTATS_DATA_VERSION;
>>+ c->outtype = CNSTATS_DATA_NONE;
>>+
>>+ /* Following allows other functions to continue returning -ve errors */
>>+ c->data.ack.err = abs(err);
>>+
>>+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>>+}
>>+
>>+/*
>>+ * Send one pid's stats
>>+ *
>>+ */
>>+static int cnstats_send_pid(pid_t pid, int seq, int ack)
>>+{
>>+ __u8 buffer[CN_STATS_MSG_SIZE];
>>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>>+ struct cnstats *c = (struct cnstats *)msg->data;
>>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>>+ struct task_struct *tsk;
>>+
>>+ cnstats_init_msg(msg, seq, ack);
>>+ c->version = CNSTATS_DATA_VERSION;
>>+ c->outtype = CNSTATS_DATA_PID;
>>+
>>+ read_lock(&tasklist_lock);
>>+ tsk = find_task_by_pid(pid);
>>+ if (!tsk) {
>>+ read_unlock(&tasklist_lock);
>>+ return -ESRCH;
>>+ }
>>+ get_task_struct(tsk);
>>+ read_unlock(&tasklist_lock);
>>+
>>+ cnstats_add_tsk_data(d, tsk);
>>+ put_task_struct(tsk);
>>+
>>+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>>+}
>>+
>>+/*
>>+ * Send one tgid's stats (sum of appropriate per-pid stats)
>>+ *
>>+ */
>>+static int cnstats_send_tgid(pid_t tgid, int seq, int ack)
>>+{
>>+ __u8 buffer[CN_STATS_MSG_SIZE];
>>+ struct cn_msg *msg = (struct cn_msg *)buffer;;
>>+ struct cnstats *c = (struct cnstats *)msg->data;
>>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>>+ struct task_struct *first, *tsk;
>>+
>>+ cnstats_init_msg(msg, seq, ack);
>>+ c->version = CNSTATS_DATA_VERSION;
>>+ c->outtype = CNSTATS_DATA_TGID;
>>+
>>+ read_lock(&tasklist_lock);
>>+ first = tsk = find_task_by_pid(tgid);
>>+ if (!first) {
>>+ read_unlock(&tasklist_lock);
>>+ return -ESRCH;
>>+ }
>>+ do
>>+ cnstats_add_tsk_data(d, tsk);
>>+ while_each_thread(first, tsk);
>>+ read_unlock(&tasklist_lock);
>>+
>>+ d->pid = -1;
>>+ d->tgid = tgid;
>>+
>>+ return cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>>+}
>>+
>>+/***
>>+ * cnstats_ctl - handle command sent via CN_IDX_STATS connector
>>+ * @data: command
>>+ */
>>+static void cnstats_ctl(void *data)
>>+{
>>+ struct cn_msg *msg = data;
>>+ struct cnstats_cmd *cmd;
>>+ int err = 0;
>>+
>>+ if (msg->len != sizeof(*cmd))
>>+ return;
>>+
>>+ cmd = (struct cnstats_cmd *)msg->data;
>>+ switch (cmd->intype) {
>>+ case CNSTATS_CMD_LISTEN:
>>+ atomic_inc(&cnstats_num_listeners);
>>+ break;
>>+
>>+ case CNSTATS_CMD_IGNORE:
>>+ atomic_dec(&cnstats_num_listeners);
>>+ break;
>>+
>>+ case CNSTATS_CMD_STATS_PID:
>>+ if (atomic_read(&cnstats_num_listeners) < 1)
>>+ return;
>>+ err = cnstats_send_pid(cmd->param.pid, msg->seq, msg->ack+1);
>>+ if (!err)
>>+ return;
>>+ break;
>>+
>>+ case CNSTATS_CMD_STATS_TGID:
>>+ if (atomic_read(&cnstats_num_listeners) < 1)
>>+ return;
>>+ err = cnstats_send_tgid(cmd->param.pid, msg->seq, msg->ack+1);
>>+ if (!err)
>>+ return;
>>+ break;
>>+
>>+ default:
>>+ err = -EINVAL;
>>+ break;
>>+ }
>>+ cnstats_send_ack(err, msg->seq, msg->ack);
>>+}
>>+
>>+/***
>>+ * cnstats_exit_connector - send task's statistics to userspace when it exits
>>+ * @tsk: exiting task
>>+ */
>>+void cnstats_exit_connector(struct task_struct *tsk)
>>+{
>>+ int seq;
>>+ __u8 buffer[CN_STATS_MSG_SIZE];
>>+ struct cn_msg *msg = (struct cn_msg *)buffer;
>>+ struct cnstats *c = (struct cnstats *)msg->data;
>>+ struct cnstats_data *d = (struct cnstats_data *)&c->data.stats;
>>+
>>+ if (atomic_read(&cnstats_num_listeners) < 1)
>>+ return;
>>+
>>+ seq = get_cpu_var(cnstats_counts)++;
>>+ put_cpu_var(cnstats_counts);
>>+
>>+ cnstats_init_msg(msg, seq, CN_STATS_NOACK);
>>+ c->version = CNSTATS_DATA_VERSION;
>>+ c->outtype = CNSTATS_DATA_EXIT;
>>+ cnstats_add_tsk_data(d, tsk);
>>+
>>+ cn_netlink_send(msg, CN_IDX_STATS, GFP_KERNEL);
>>+}
>>+
>>+static int __init cnstats_init(void)
>>+{
>>+ int err;
>>+
>>+ if ((err = cn_add_callback(&cnstats_id, "cn_stats", &cnstats_ctl))) {
>>+ printk(KERN_WARNING "cn_stats failed to register\n");
>>+ return err;
>>+ }
>>+ return 0;
>>+}
>>+
>>+module_init(cnstats_init);
>>Index: linux-2.6.15-rc7/kernel/exit.c
>>===================================================================
>>--- linux-2.6.15-rc7.orig/kernel/exit.c
>>+++ linux-2.6.15-rc7/kernel/exit.c
>>@@ -29,6 +29,7 @@
>>#include <linux/syscalls.h>
>>#include <linux/signal.h>
>>#include <linux/cn_proc.h>
>>+#include <linux/cn_stats.h>
>>
>>#include <asm/uaccess.h>
>>#include <asm/unistd.h>
>>@@ -865,6 +866,7 @@ fastcall NORET_TYPE void do_exit(long co
>>
>> tsk->exit_code = code;
>> proc_exit_connector(tsk);
>>+ cnstats_exit_connector(tsk);
>>
>>
>>
>
>We need to move both proc_exit_connector(tsk) and
>cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
>There are task statistics collected in task->mm and those stats
>will be lost after exit_mm(tsk).
>
>
Good point. The cnstats connector can be moved.
I'm not sure if proc_exit_connector needs to move as well..but Matt can
comment on that.

Do let me know if you think this interface would be good enough for
CSA's needs or what changes would
be needed.

--Shailabh

>Thanks,
> - jay
>
>
>
>> exit_notify(tsk);
>>#ifdef CONFIG_NUMA
>> mpol_free(tsk->mempolicy);
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>>
>
>
>
>


2006-01-04 22:46:09

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Patch 6/6] Delay accounting: Connector interface

On Wed, 2006-01-04 at 11:04 -0800, Jay Lan wrote:
> Shailabh Nagar wrote:
<snip>
> >Index: linux-2.6.15-rc7/kernel/exit.c
> >===================================================================
> >--- linux-2.6.15-rc7.orig/kernel/exit.c
> >+++ linux-2.6.15-rc7/kernel/exit.c
> >@@ -29,6 +29,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/signal.h>
> > #include <linux/cn_proc.h>
> >+#include <linux/cn_stats.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/unistd.h>
> >@@ -865,6 +866,7 @@ fastcall NORET_TYPE void do_exit(long co
> >
> > tsk->exit_code = code;
> > proc_exit_connector(tsk);
> >+ cnstats_exit_connector(tsk);
> >
>
> We need to move both proc_exit_connector(tsk) and
> cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
> There are task statistics collected in task->mm and those stats
> will be lost after exit_mm(tsk).
>
> Thanks,
> - jay
>
> > exit_notify(tsk);
> > #ifdef CONFIG_NUMA
> > mpol_free(tsk->mempolicy);
> >-

Good point. The assignment of the task exit code will also have to move
up before exit_mm(tsk) because the process event connector exit function
retrieves the exit code from the task struct.

Moving these may also affect the job/pagg/task_notify/cpuset exit
notification if we're eventually going to remove *direct* calls to these
from kernel/exit.c.

Cheers,
-Matt Helsley

2006-01-04 23:15:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Patch 6/6] Delay accounting: Connector interface

Matt Helsley <[email protected]> wrote:
>
> > We need to move both proc_exit_connector(tsk) and
> > cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
> > There are task statistics collected in task->mm and those stats
> > will be lost after exit_mm(tsk).
> >
> > Thanks,
> > - jay
> >
> > > exit_notify(tsk);
> > > #ifdef CONFIG_NUMA
> > > mpol_free(tsk->mempolicy);
> > >-
>
> Good point. The assignment of the task exit code will also have to move
> up before exit_mm(tsk) because the process event connector exit function
> retrieves the exit code from the task struct.

Could someone please volunteer to do the patch?

2006-01-05 00:01:51

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [Patch 6/6] Delay accounting: Connector interface

Matt Helsley wrote:

>On Wed, 2006-01-04 at 11:04 -0800, Jay Lan wrote:
>
>
>>Shailabh Nagar wrote:
>>
>>
><snip>
>
>
>>>Index: linux-2.6.15-rc7/kernel/exit.c
>>>===================================================================
>>>--- linux-2.6.15-rc7.orig/kernel/exit.c
>>>+++ linux-2.6.15-rc7/kernel/exit.c
>>>@@ -29,6 +29,7 @@
>>>#include <linux/syscalls.h>
>>>#include <linux/signal.h>
>>>#include <linux/cn_proc.h>
>>>+#include <linux/cn_stats.h>
>>>
>>>#include <asm/uaccess.h>
>>>#include <asm/unistd.h>
>>>@@ -865,6 +866,7 @@ fastcall NORET_TYPE void do_exit(long co
>>>
>>> tsk->exit_code = code;
>>> proc_exit_connector(tsk);
>>>+ cnstats_exit_connector(tsk);
>>>
>>>
>>>
>>We need to move both proc_exit_connector(tsk) and
>>cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
>>There are task statistics collected in task->mm and those stats
>>will be lost after exit_mm(tsk).
>>
>>Thanks,
>> - jay
>>
>>
>>
>>> exit_notify(tsk);
>>>#ifdef CONFIG_NUMA
>>> mpol_free(tsk->mempolicy);
>>>-
>>>
>>>
>
> Good point. The assignment of the task exit code will also have to move
>up before exit_mm(tsk) because the process event connector exit function
>retrieves the exit code from the task struct.
>
>
Why does proc_exit_connector need to move ? It only uses
task->{pid,tgid,exit_code,exit_signal}, none of which
should be affected by exit_mm(), right ?

-- Shailabh

> Moving these may also affect the job/pagg/task_notify/cpuset exit
>notification if we're eventually going to remove *direct* calls to these
>from kernel/exit.c.
>
>Cheers,
> -Matt Helsley
>
>
>
>


2006-01-05 18:45:42

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 00/01] Move Exit Connectors

On Wed, 2006-01-04 at 15:17 -0800, Andrew Morton wrote:
> Matt Helsley <[email protected]> wrote:
> >
> > > We need to move both proc_exit_connector(tsk) and
> > > cnstats_exit_connector(tsk) up to before exit_mm(tsk) statement.
> > > There are task statistics collected in task->mm and those stats
> > > will be lost after exit_mm(tsk).
> > >
> > > Thanks,
> > > - jay
> > >
> > > > exit_notify(tsk);
> > > > #ifdef CONFIG_NUMA
> > > > mpol_free(tsk->mempolicy);
> > > >-
> >
> > Good point. The assignment of the task exit code will also have to move
> > up before exit_mm(tsk) because the process event connector exit function
> > retrieves the exit code from the task struct.
>
> Could someone please volunteer to do the patch?

Here are two separate patches (not a series).

The first simply moves the process event connector north of exit_mm().
It applies to a clean 2.6.15 kernel. Please consider it for -mm.

The second patch moves both -- it's intended to be applied on top of
Shailabh's series of patches.

Cheers,
-Matt Helsley

2006-01-05 19:21:04

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 01/01][RFC] Move Exit Connectors

This patch moves the process event connector exit function above
exit_mm() with the expectation that it will later be removed from the
exit function with other calls that need to take place before exit_mm().

I'm looking into how this affects the exit_signal field. Please review
but do not apply.

Signed-off-by: Matt Helsley <[email protected]>

--

Index: linux-2.6.15/kernel/exit.c
===================================================================
--- linux-2.6.15.orig/kernel/exit.c
+++ linux-2.6.15/kernel/exit.c
@@ -844,10 +844,13 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
acct_process(code);
}
+
+ tsk->exit_code = code;
+ proc_exit_connector(tsk);
exit_mm(tsk);

exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
@@ -860,13 +863,10 @@ fastcall NORET_TYPE void do_exit(long co
disassociate_ctty(1);

module_put(task_thread_info(tsk)->exec_domain->module);
if (tsk->binfmt)
module_put(tsk->binfmt->module);
-
- tsk->exit_code = code;
- proc_exit_connector(tsk);
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);
tsk->mempolicy = NULL;
#endif


2006-01-05 19:24:05

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 00/01] Move Exit Connectors

This patch moves both the process exit event and per-process stats
connectors above exit_mm() since the latter needs values from the
mm_struct which will be lost after exit_mm().

Signed-off-by: Matt Helsley <[email protected]>

--

Index: linux-2.6.15/kernel/exit.c
===================================================================
--- linux-2.6.15.orig/kernel/exit.c
+++ linux-2.6.15/kernel/exit.c
@@ -845,10 +845,14 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
acct_process(code);
}
+
+ tsk->exit_code = code;
+ proc_exit_connector(tsk);
+ cnstats_exit_connector(tsk);
exit_mm(tsk);

exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
@@ -861,14 +865,10 @@ fastcall NORET_TYPE void do_exit(long co
disassociate_ctty(1);

module_put(task_thread_info(tsk)->exec_domain->module);
if (tsk->binfmt)
module_put(tsk->binfmt->module);
-
- tsk->exit_code = code;
- proc_exit_connector(tsk);
- cnstats_exit_connector(tsk);
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);
tsk->mempolicy = NULL;
#endif


2006-01-05 23:10:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/01] Move Exit Connectors

Matt Helsley <[email protected]> wrote:
>
> This patch moves both the process exit event and per-process stats
> connectors above exit_mm() since the latter needs values from the
> mm_struct which will be lost after exit_mm().
>
> Signed-off-by: Matt Helsley <[email protected]>
>
> --
>
> Index: linux-2.6.15/kernel/exit.c
> ===================================================================
> --- linux-2.6.15.orig/kernel/exit.c
> +++ linux-2.6.15/kernel/exit.c
> @@ -845,10 +845,14 @@ fastcall NORET_TYPE void do_exit(long co
> if (group_dead) {
> del_timer_sync(&tsk->signal->real_timer);
> exit_itimers(tsk->signal);
> acct_process(code);
> }
> +
> + tsk->exit_code = code;
> + proc_exit_connector(tsk);
> + cnstats_exit_connector(tsk);

cnstats_exit_connector() doesn't exist yet...


2006-01-06 00:12:49

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, 2006-01-05 at 15:10 -0800, Andrew Morton wrote:
> Matt Helsley <[email protected]> wrote:
> >
> > This patch moves both the process exit event and per-process stats
> > connectors above exit_mm() since the latter needs values from the
> > mm_struct which will be lost after exit_mm().
> >
> > Signed-off-by: Matt Helsley <[email protected]>
> >
> > --
> >
> > Index: linux-2.6.15/kernel/exit.c
> > ===================================================================
> > --- linux-2.6.15.orig/kernel/exit.c
> > +++ linux-2.6.15/kernel/exit.c
> > @@ -845,10 +845,14 @@ fastcall NORET_TYPE void do_exit(long co
> > if (group_dead) {
> > del_timer_sync(&tsk->signal->real_timer);
> > exit_itimers(tsk->signal);
> > acct_process(code);
> > }
> > +
> > + tsk->exit_code = code;
> > + proc_exit_connector(tsk);
> > + cnstats_exit_connector(tsk);
>
> cnstats_exit_connector() doesn't exist yet...

Right. I forgot to repeat what I mentioned in the parent email -- that
this patch is intended to be applied on top of Shailabh's patches.

The first patch I posted (01/01) is intended for plain 2.6.15. Before
proposing 01/01 for -mm I've been trying to see if there are any
problems with the value of tsk->exit_signal before exit_mm() -- hence
the "[RFC]" in the subject line of that one.

Thanks,
-Matt Helsley

2006-01-06 08:57:10

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> Right. I forgot to repeat what I mentioned in the parent email
Matt> -- that this patch is intended to be applied on top of
Matt> Shailabh's patches.

Matt> The first patch I posted (01/01) is intended for plain
Matt> 2.6.15. Before proposing 01/01 for -mm I've been trying to see
Matt> if there are any problems with the value of tsk->exit_signal
Matt> before exit_mm() -- hence the "[RFC]" in the subject line of
Matt> that one.

Matt,

Any chance one of you could put up a set of current patches somewhere?
I am trying to make heads and tails of them and it's pretty hard as I
haven't been on lse-tech for long and the lse-tech mailing list
archives are useless due to the 99 to 1 SPAM ratio ;-(

I am quite concerned about that lock your patches put into struct
task_struct through struct task_delay_info. Have you done any
measurements on how this impacts performance on highly threaded apps
on larger system?

IMHO it seems to make more sense to use something like Jack's proposed
task_notifier code to lock-less collect the data into task local data
structures and then take the data from there and ship off to userland
through netlink or similar like you are doing?

I am working on modifying Jack's patch to carry task local data and
use it for not just accounting but other areas that need optional
callbacks (optional in the sense that it's a feature that can be
enabled or disabled). Looking at Shailabh's delayacct_blkio() changes
it seems that it would be really easy to fit those into that
framework.

Guess I should post some of this code .....

Cheers,
Jes

2006-01-06 16:44:59

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

Jes Sorensen wrote:
>>>>>>"Matt" == Matt Helsley <[email protected]> writes:
>
>
> Matt> Right. I forgot to repeat what I mentioned in the parent email
> Matt> -- that this patch is intended to be applied on top of
> Matt> Shailabh's patches.
>
> Matt> The first patch I posted (01/01) is intended for plain
> Matt> 2.6.15. Before proposing 01/01 for -mm I've been trying to see
> Matt> if there are any problems with the value of tsk->exit_signal
> Matt> before exit_mm() -- hence the "[RFC]" in the subject line of
> Matt> that one.
>
> Matt,
>
> Any chance one of you could put up a set of current patches somewhere?

I'll upload the delay accounting patches to a newly created lse-tech package.

> I am trying to make heads and tails of them and it's pretty hard as I
> haven't been on lse-tech for long and the lse-tech mailing list
> archives are useless due to the 99 to 1 SPAM ratio ;-(
>


> I am quite concerned about that lock your patches put into struct
> task_struct through struct task_delay_info. Have you done any
> measurements on how this impacts performance on highly threaded apps
> on larger system?

I don't expect the lock contention to be high. The lock is held for a
very short time (across two additions/increments). Moreover, it gets
contended only when the stats are being read (either through /proc or connectors).
Since the reading of stats won't be that frequent (the utility of these
numbers is to influence the I/O priority/rss limit etc. which won't be done
at a very small granularity anyway), I wouldn't expect a problem.

But its better to take some measurements anyway. Any suggestions on a
benchmark ?

> IMHO it seems to make more sense to use something like Jack's proposed
> task_notifier code to lock-less collect the data into task local data
> structures and then take the data from there and ship off to userland
> through netlink or similar like you are doing?
>
> I am working on modifying Jack's patch to carry task local data and
> use it for not just accounting but other areas that need optional
> callbacks (optional in the sense that it's a feature that can be
> enabled or disabled). Looking at Shailabh's delayacct_blkio() changes
> it seems that it would be really easy to fit those into that
> framework.
>
> Guess I should post some of this code .....

Please do. If this accounting can fit into some other framework, thats fine too.

-- Shailabh

> Cheers,
> Jes
>

2006-01-11 10:36:36

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:

Shailabh> Jes Sorensen wrote:
>> I am quite concerned about that lock your patches put into struct
>> task_struct through struct task_delay_info. Have you done any
>> measurements on how this impacts performance on highly threaded
>> apps on larger system?

Shailabh> I don't expect the lock contention to be high. The lock is
Shailabh> held for a very short time (across two
Shailabh> additions/increments). Moreover, it gets contended only when
Shailabh> the stats are being read (either through /proc or
Shailabh> connectors). Since the reading of stats won't be that
Shailabh> frequent (the utility of these numbers is to influence the
Shailabh> I/O priority/rss limit etc. which won't be done at a very
Shailabh> small granularity anyway), I wouldn't expect a problem.

Hi Shailabh,

When this is read through connectors, it's initiated by the connectors
code which is called from the task's context hence we don't need
locking for that. It's very similar to the task_notify code I am about
to post and I think the connector code could fit into that
framework. The main issue is /proc, but then one could even have a
mechanism with a hook when the task exits that pushes the data to a
storage point which is lock protected.

Even if a lock isn't contended, you are still going to see the cache
lines bounce around due to the writes. It may not show up on a 4-way
box but what happens on a 64-way? We have seen some pretty nasty
effects on the bigger SN2 boxes with locks that were taken far too
frequently, to the point where it would prevent the box from booting
(now I don't expect it to that severe here, but I'd still like to
explore an approach of doing it lock free).

Shailabh> But its better to take some measurements anyway. Any
Shailabh> suggestions on a benchmark ?

>> IMHO it seems to make more sense to use something like Jack's
>> proposed task_notifier code to lock-less collect the data into task
>> local data structures and then take the data from there and ship
>> off to userland through netlink or similar like you are doing?
>>
>> I am working on modifying Jack's patch to carry task local data and
>> use it for not just accounting but other areas that need optional
>> callbacks (optional in the sense that it's a feature that can be
>> enabled or disabled). Looking at Shailabh's delayacct_blkio()
>> changes it seems that it would be really easy to fit those into
>> that framework.
>>
>> Guess I should post some of this code .....

Shailabh> Please do. If this accounting can fit into some other
Shailabh> framework, thats fine too.

Ok, finally, sorry for the delay. My current code snapshot is
available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
it's a modified version of Jack's task_notify code, and three example
users of it (the SysV IPC semundo semaphore, the key infrastructure
and SGI's JOB module). The patch order should be task_notify.diff,
task-notify-keys.diff, task-notify-semundo.diff, and
task_notify-job.diff last.

I think task_notify it should be usable for statistics gathering as
well, the only issue is how to attach it to the processes we wish to
gather accounting for. Personally I am not a big fan of the current
concept where statistics are gathered for all tasks at all time but
just not exported until accounting is enabled.

I just had a quick look at the connector code and I think it could
possibly be hooked into the task_notify code as well as connectors
seem to be another optional feature.

The API for the task_notify code is not set in stone and we can add
more notifier hooks where needed. If someone has strong reasoning for
making changes to the API, then I'll be very open to that.

Anyway, let me know what you think?

Cheers,
Jes

2006-01-11 12:55:33

by John Hesterberg

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, Jan 11, 2006 at 05:36:29AM -0500, Jes Sorensen wrote:
> I think task_notify it should be usable for statistics gathering as
> well, the only issue is how to attach it to the processes we wish to
> gather accounting for. Personally I am not a big fan of the current
> concept where statistics are gathered for all tasks at all time but
> just not exported until accounting is enabled.

I believe the accounting our customers require needs to be turned on
system-wide. In fact, I recall getting problems reports if there are
some processes not 'accounted' for. If you do it on a task basis,
and accounting gets turned on, you'd have to have a fool-proof way
of tracking down all the tasks in a system and turn on their accounting.

I would expect sites either want accounting on all the time for
everything, or not at all.

John

2006-01-11 13:50:38

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "John" == John Hesterberg <[email protected]> writes:

John> I believe the accounting our customers require needs to be
John> turned on system-wide. In fact, I recall getting problems
John> reports if there are some processes not 'accounted' for. If you
John> do it on a task basis, and accounting gets turned on, you'd have
John> to have a fool-proof way of tracking down all the tasks in a
John> system and turn on their accounting.

Thats what I was scared someone would say ;-( The problem I am seeing
is that SGI wants certain things in the accounting block, I am sure
IBM has other things they wish to count and someone else will want to
count something else again. With all those numbers we may end up with
a relatively large block for accounting numbers in struct task_struct.

However it makes it a lot harder to use the task_notifiers for it ;-(

John> I would expect sites either want accounting on all the time for
John> everything, or not at all.

Right now a lot of the accounting is done by calling into a function
which checks a flag for whether accounting is enabled and returns if
it is not. That could easily be extended to check for the global
accounting flag + check that the task's accounting data structure has
been allocated. Then in certain places, like fork() and schedule(),
allocate it if it's not in place already and accounting has been
switched on. That way we could reduce the overhead to a single pointer
in struct task_struct and the accounting structure could (in theory)
grow indefinately large. Or is this too much of a hack?

Comments?

Cheers,
Jes

2006-01-11 21:06:40

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> >>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:
>
> Shailabh> Jes Sorensen wrote:
> >> I am quite concerned about that lock your patches put into struct
> >> task_struct through struct task_delay_info. Have you done any
> >> measurements on how this impacts performance on highly threaded
> >> apps on larger system?
>
> Shailabh> I don't expect the lock contention to be high. The lock is
> Shailabh> held for a very short time (across two
> Shailabh> additions/increments). Moreover, it gets contended only when
> Shailabh> the stats are being read (either through /proc or
> Shailabh> connectors). Since the reading of stats won't be that
> Shailabh> frequent (the utility of these numbers is to influence the
> Shailabh> I/O priority/rss limit etc. which won't be done at a very
> Shailabh> small granularity anyway), I wouldn't expect a problem.
>
> Hi Shailabh,
>
> When this is read through connectors, it's initiated by the connectors
> code which is called from the task's context hence we don't need
> locking for that. It's very similar to the task_notify code I am about
> to post and I think the connector code could fit into that
> framework. The main issue is /proc, but then one could even have a
> mechanism with a hook when the task exits that pushes the data to a
> storage point which is lock protected.
>
> Even if a lock isn't contended, you are still going to see the cache
> lines bounce around due to the writes. It may not show up on a 4-way
> box but what happens on a 64-way? We have seen some pretty nasty
> effects on the bigger SN2 boxes with locks that were taken far too
> frequently, to the point where it would prevent the box from booting
> (now I don't expect it to that severe here, but I'd still like to
> explore an approach of doing it lock free).
>
> Shailabh> But its better to take some measurements anyway. Any
> Shailabh> suggestions on a benchmark ?
>
> >> IMHO it seems to make more sense to use something like Jack's
> >> proposed task_notifier code to lock-less collect the data into task
> >> local data structures and then take the data from there and ship
> >> off to userland through netlink or similar like you are doing?
> >>
> >> I am working on modifying Jack's patch to carry task local data and
> >> use it for not just accounting but other areas that need optional
> >> callbacks (optional in the sense that it's a feature that can be
> >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> >> changes it seems that it would be really easy to fit those into
> >> that framework.
> >>
> >> Guess I should post some of this code .....
>
> Shailabh> Please do. If this accounting can fit into some other
> Shailabh> framework, thats fine too.
>
> Ok, finally, sorry for the delay. My current code snapshot is
> available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> it's a modified version of Jack's task_notify code, and three example
> users of it (the SysV IPC semundo semaphore, the key infrastructure
> and SGI's JOB module). The patch order should be task_notify.diff,
> task-notify-keys.diff, task-notify-semundo.diff, and
> task_notify-job.diff last.

I can already tell you I don't like the "magic" mechanism to identify
notifier blocks. The problem is that it's yet another space of id
numbers that we have to manage -- either manually, by having a person
hand the numbers out to developers, or automatically using the idr code.
You could use the notifier block's address and avoid an additional id
space.

Also, even if this mechanism goes into task_notify it needs a better
name than "magic".

> I think task_notify it should be usable for statistics gathering as
> well, the only issue is how to attach it to the processes we wish to
> gather accounting for. Personally I am not a big fan of the current
> concept where statistics are gathered for all tasks at all time but
> just not exported until accounting is enabled.

Have you looked at Alan Stern's notifier chain fix patch? Could that be
used in task_notify?

If not, perhaps the patch use the standard kernel list idioms.

Another potential user for the task_notify functionality is the process
events connector. The problem is it requires the ability to receive
notifications about all processes. Also, there's a chance that future
CKRM code could use all-task and per-task notification. Combined with
John Hesterberg's feedback I think there is strong justification for an
all-tasks notification interface.

If there was some way to quickly check for registration on an all-tasks
list you could do both all-task and per-task notification in the same
code. I took a shot at this a while back but the locking was incomplete.
Perhaps a per-cpu all-task notification list would satisfy your
performance-impact concerns.

<snip>

> The API for the task_notify code is not set in stone and we can add
> more notifier hooks where needed. If someone has strong reasoning for
> making changes to the API, then I'll be very open to that.

I'd like to see the all-task notification I mentioned above.
Notifications where uid/gids change could be useful for auditing and
process events connector.

<snip>

Cheers,
-Matt Helsley

2006-01-11 21:40:59

by John Hesterberg

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> > >>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:
> >
> > Shailabh> Jes Sorensen wrote:
> > >> I am quite concerned about that lock your patches put into struct
> > >> task_struct through struct task_delay_info. Have you done any
> > >> measurements on how this impacts performance on highly threaded
> > >> apps on larger system?
> >
> > Shailabh> I don't expect the lock contention to be high. The lock is
> > Shailabh> held for a very short time (across two
> > Shailabh> additions/increments). Moreover, it gets contended only when
> > Shailabh> the stats are being read (either through /proc or
> > Shailabh> connectors). Since the reading of stats won't be that
> > Shailabh> frequent (the utility of these numbers is to influence the
> > Shailabh> I/O priority/rss limit etc. which won't be done at a very
> > Shailabh> small granularity anyway), I wouldn't expect a problem.
> >
> > Hi Shailabh,
> >
> > When this is read through connectors, it's initiated by the connectors
> > code which is called from the task's context hence we don't need
> > locking for that. It's very similar to the task_notify code I am about
> > to post and I think the connector code could fit into that
> > framework. The main issue is /proc, but then one could even have a
> > mechanism with a hook when the task exits that pushes the data to a
> > storage point which is lock protected.
> >
> > Even if a lock isn't contended, you are still going to see the cache
> > lines bounce around due to the writes. It may not show up on a 4-way
> > box but what happens on a 64-way? We have seen some pretty nasty
> > effects on the bigger SN2 boxes with locks that were taken far too
> > frequently, to the point where it would prevent the box from booting
> > (now I don't expect it to that severe here, but I'd still like to
> > explore an approach of doing it lock free).
> >
> > Shailabh> But its better to take some measurements anyway. Any
> > Shailabh> suggestions on a benchmark ?
> >
> > >> IMHO it seems to make more sense to use something like Jack's
> > >> proposed task_notifier code to lock-less collect the data into task
> > >> local data structures and then take the data from there and ship
> > >> off to userland through netlink or similar like you are doing?
> > >>
> > >> I am working on modifying Jack's patch to carry task local data and
> > >> use it for not just accounting but other areas that need optional
> > >> callbacks (optional in the sense that it's a feature that can be
> > >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> > >> changes it seems that it would be really easy to fit those into
> > >> that framework.
> > >>
> > >> Guess I should post some of this code .....
> >
> > Shailabh> Please do. If this accounting can fit into some other
> > Shailabh> framework, thats fine too.
> >
> > Ok, finally, sorry for the delay. My current code snapshot is
> > available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> > it's a modified version of Jack's task_notify code, and three example
> > users of it (the SysV IPC semundo semaphore, the key infrastructure
> > and SGI's JOB module). The patch order should be task_notify.diff,
> > task-notify-keys.diff, task-notify-semundo.diff, and
> > task_notify-job.diff last.
>
> I can already tell you I don't like the "magic" mechanism to identify
> notifier blocks. The problem is that it's yet another space of id
> numbers that we have to manage -- either manually, by having a person
> hand the numbers out to developers, or automatically using the idr code.
> You could use the notifier block's address and avoid an additional id
> space.
>
> Also, even if this mechanism goes into task_notify it needs a better
> name than "magic".
>
> > I think task_notify it should be usable for statistics gathering as
> > well, the only issue is how to attach it to the processes we wish to
> > gather accounting for. Personally I am not a big fan of the current
> > concept where statistics are gathered for all tasks at all time but
> > just not exported until accounting is enabled.
>
> Have you looked at Alan Stern's notifier chain fix patch? Could that be
> used in task_notify?
>
> If not, perhaps the patch use the standard kernel list idioms.
>
> Another potential user for the task_notify functionality is the process
> events connector. The problem is it requires the ability to receive
> notifications about all processes. Also, there's a chance that future
> CKRM code could use all-task and per-task notification. Combined with
> John Hesterberg's feedback I think there is strong justification for an
> all-tasks notification interface.

I have two concerns about an all-tasks notification interface.
First, we want this to scale, so don't want more global locks.
One unique part of the task notify is that it doesn't use locks.

Second, in at least some of the cases we're familiar with,
even when we might need all-tasks notification we still need task-local
data. That's been a problem with some of the global mechanisms I've
seen discussed.

Cheers,

John

2006-01-11 22:48:25

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
> On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> > On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> > > >>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:
> > >
> > > Shailabh> Jes Sorensen wrote:
> > > >> I am quite concerned about that lock your patches put into struct
> > > >> task_struct through struct task_delay_info. Have you done any
> > > >> measurements on how this impacts performance on highly threaded
> > > >> apps on larger system?
> > >
> > > Shailabh> I don't expect the lock contention to be high. The lock is
> > > Shailabh> held for a very short time (across two
> > > Shailabh> additions/increments). Moreover, it gets contended only when
> > > Shailabh> the stats are being read (either through /proc or
> > > Shailabh> connectors). Since the reading of stats won't be that
> > > Shailabh> frequent (the utility of these numbers is to influence the
> > > Shailabh> I/O priority/rss limit etc. which won't be done at a very
> > > Shailabh> small granularity anyway), I wouldn't expect a problem.
> > >
> > > Hi Shailabh,
> > >
> > > When this is read through connectors, it's initiated by the connectors
> > > code which is called from the task's context hence we don't need
> > > locking for that. It's very similar to the task_notify code I am about
> > > to post and I think the connector code could fit into that
> > > framework. The main issue is /proc, but then one could even have a
> > > mechanism with a hook when the task exits that pushes the data to a
> > > storage point which is lock protected.
> > >
> > > Even if a lock isn't contended, you are still going to see the cache
> > > lines bounce around due to the writes. It may not show up on a 4-way
> > > box but what happens on a 64-way? We have seen some pretty nasty
> > > effects on the bigger SN2 boxes with locks that were taken far too
> > > frequently, to the point where it would prevent the box from booting
> > > (now I don't expect it to that severe here, but I'd still like to
> > > explore an approach of doing it lock free).
> > >
> > > Shailabh> But its better to take some measurements anyway. Any
> > > Shailabh> suggestions on a benchmark ?
> > >
> > > >> IMHO it seems to make more sense to use something like Jack's
> > > >> proposed task_notifier code to lock-less collect the data into task
> > > >> local data structures and then take the data from there and ship
> > > >> off to userland through netlink or similar like you are doing?
> > > >>
> > > >> I am working on modifying Jack's patch to carry task local data and
> > > >> use it for not just accounting but other areas that need optional
> > > >> callbacks (optional in the sense that it's a feature that can be
> > > >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> > > >> changes it seems that it would be really easy to fit those into
> > > >> that framework.
> > > >>
> > > >> Guess I should post some of this code .....
> > >
> > > Shailabh> Please do. If this accounting can fit into some other
> > > Shailabh> framework, thats fine too.
> > >
> > > Ok, finally, sorry for the delay. My current code snapshot is
> > > available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> > > it's a modified version of Jack's task_notify code, and three example
> > > users of it (the SysV IPC semundo semaphore, the key infrastructure
> > > and SGI's JOB module). The patch order should be task_notify.diff,
> > > task-notify-keys.diff, task-notify-semundo.diff, and
> > > task_notify-job.diff last.
> >
> > I can already tell you I don't like the "magic" mechanism to identify
> > notifier blocks. The problem is that it's yet another space of id
> > numbers that we have to manage -- either manually, by having a person
> > hand the numbers out to developers, or automatically using the idr code.
> > You could use the notifier block's address and avoid an additional id
> > space.
> >
> > Also, even if this mechanism goes into task_notify it needs a better
> > name than "magic".
> >
> > > I think task_notify it should be usable for statistics gathering as
> > > well, the only issue is how to attach it to the processes we wish to
> > > gather accounting for. Personally I am not a big fan of the current
> > > concept where statistics are gathered for all tasks at all time but
> > > just not exported until accounting is enabled.
> >
> > Have you looked at Alan Stern's notifier chain fix patch? Could that be
> > used in task_notify?
> >
> > If not, perhaps the patch use the standard kernel list idioms.
> >
> > Another potential user for the task_notify functionality is the process
> > events connector. The problem is it requires the ability to receive
> > notifications about all processes. Also, there's a chance that future
> > CKRM code could use all-task and per-task notification. Combined with
> > John Hesterberg's feedback I think there is strong justification for an
> > all-tasks notification interface.
>
> I have two concerns about an all-tasks notification interface.
> First, we want this to scale, so don't want more global locks.
> One unique part of the task notify is that it doesn't use locks.

Are you against global locks in these paths based solely on principle
or have you measured their impact on scalability in those locations?

Without measurement there are two conflicting principles here: code
complexity vs. scalability.

If you've made measurements I'm curious to know what kind of locks were
measured, where they were used, what the load was doing -- as a rough
characterization of the pattern of notifications -- and how much it
impacted scalability. This information might help suggest a better
solution.

> Second, in at least some of the cases we're familiar with,
> even when we might need all-tasks notification we still need task-local
> data. That's been a problem with some of the global mechanisms I've
> seen discussed.

The per-task notification can allow you to have task-local data. Would
registration of per-task notification from within an all-task
notification be sufficient?

Cheers,
-Matt Helsley

2006-01-12 03:30:23

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
>On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
>> Have you looked at Alan Stern's notifier chain fix patch? Could that be
>> used in task_notify?
>
>I have two concerns about an all-tasks notification interface.
>First, we want this to scale, so don't want more global locks.
>One unique part of the task notify is that it doesn't use locks.

Neither does Alan Stern's atomic notifier chain. Indeed it cannot use
locks, because the atomic notifier chains can be called from anywhere,
including non maskable interrupts. The downside is that Alan's atomic
notifier chains require RCU.

An alternative patch that requires no locks and does not even require
RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2

2006-01-12 05:04:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, Jan 12, 2006 at 02:29:52PM +1100, Keith Owens wrote:
> John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
> >On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> >> Have you looked at Alan Stern's notifier chain fix patch? Could that be
> >> used in task_notify?
> >
> >I have two concerns about an all-tasks notification interface.
> >First, we want this to scale, so don't want more global locks.
> >One unique part of the task notify is that it doesn't use locks.
>
> Neither does Alan Stern's atomic notifier chain. Indeed it cannot use
> locks, because the atomic notifier chains can be called from anywhere,
> including non maskable interrupts. The downside is that Alan's atomic
> notifier chains require RCU.
>
> An alternative patch that requires no locks and does not even require
> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2

Interesting! Missed this on the first time around...

But doesn't notifier_call_chain_lockfree() need to either disable
preemption or use atomic operations to update notifier_chain_lockfree_inuse[]
in order to avoid problems with preemption? If I understand the
code, one such problem could be caused by the following sequence
of events:

1. Task A enters notifier_call_chain_lockfree(), gets a copy
of the current CPU in local variable "cpu", snapshots the
(initially zero) value of notifier_chain_lockfree_inuse[cpu]
into local variable "nested", then is preempted.

2. Task B enters notifier_call_chain_lockfree(), gets a copy
of the current CPU in local variable "cpu", snapshots the
(still zero) value of notifier_chain_lockfree_inuse[cpu]
into local variable "nested", sets the value of
notifier_chain_lockfree_inuse[cpu] to 1.

3. Task A runs again, perhaps because Task B's priority dropped,
perhaps because some other CPU became available. It also
sets the value of notifier_chain_lockfree_inuse[cpu] to 1.
It then gains a reference to a notifier_block (call it Fred).

4. Task B completes running through the notifier chain and sets
notifier_chain_lockfree_inuse[cpu] = nested, which is zero.

5. Task C invokes notifier_chain_unregister_lockfree() in order
to remove Fred. Task C finds all notifier_chain_lockfree_inuse[cpu]
entries equal to zero, so removes Fred while Task A is still
referencing it. Which I believe is what was to be prevented.

If one updates notifier_chain_lockfree_inuse[cpu] using atomics,
then one could imagine a sequence of calls to notifier_call_chain_lockfree()
and preemptions that prevented one of the notifier_chain_lockfree_inuse[]
elements from ever reaching zero (though maybe this is being overly
paranoid). If one disables preemption, then latency might become
excessive.

So what am I missing?

Thanx, Paul

2006-01-12 05:36:29

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, 2006-01-12 at 14:29 +1100, Keith Owens wrote:
> John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
> >On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> >> Have you looked at Alan Stern's notifier chain fix patch? Could that be
> >> used in task_notify?
> >
> >I have two concerns about an all-tasks notification interface.
> >First, we want this to scale, so don't want more global locks.
> >One unique part of the task notify is that it doesn't use locks.
>
> Neither does Alan Stern's atomic notifier chain. Indeed it cannot use
> locks, because the atomic notifier chains can be called from anywhere,
> including non maskable interrupts. The downside is that Alan's atomic
> notifier chains require RCU.
>
> An alternative patch that requires no locks and does not even require
> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
>

Interesting. Might the 'inuse' flags suffer from bouncing due to false
sharing?

Cheers,
-Matt Helsley

2006-01-12 05:39:47

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

"Paul E. McKenney" (on Wed, 11 Jan 2006 21:04:53 -0800) wrote:
>On Thu, Jan 12, 2006 at 02:29:52PM +1100, Keith Owens wrote:
>> An alternative patch that requires no locks and does not even require
>> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
>
>But doesn't notifier_call_chain_lockfree() need to either disable
>preemption or use atomic operations to update notifier_chain_lockfree_inuse[]
>in order to avoid problems with preemption?

Yes it does :(. Originally the atomic notifier chains were only called
from atomic contexts (typically interrupt context), but if they are
going to be generalized to all contexts, then the code is not good
enough. These lines either assume no preemption or that preemption is
stacked in LIFO order, which is not guaranteed.

int notifier_call_chain_lockfree(struct notifier_block **list,
unsigned long val, void *v)
{
int ret = NOTIFY_DONE, cpu = smp_processor_id(), nested;
struct notifier_block *nb;
nested = notifier_chain_lockfree_inuse[cpu];
notifier_chain_lockfree_inuse[cpu] = 1;
wmb();
nb = *list;
while (nb) {
smp_read_barrier_depends();
ret = nb->notifier_call(nb, val, v);
if (ret & NOTIFY_STOP_MASK)
break;
nb = nb->next;
}
barrier();
notifier_chain_lockfree_inuse[cpu] = nested;
return ret;
}

So either disable preemption in notifier_call_chain_lockfree (including
all the callbacks that it invokes) or notifier_chain_lockfree_inuse has
to be an atomic_t. atomic_t would be better, but it could cause a
problem on architectures that implement atomic_t via hashed spinlocks
_and_ have non maskable interrupts that call notifier_call_chain_lockfree().

Going away to think about this ...

2006-01-12 05:47:30

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

Matt Helsley (on Wed, 11 Jan 2006 21:26:08 -0800) wrote:
>On Thu, 2006-01-12 at 14:29 +1100, Keith Owens wrote:
>> An alternative patch that requires no locks and does not even require
>> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
>>
>
> Interesting. Might the 'inuse' flags suffer from bouncing due to false
>sharing?

It would, but that was the least of my worries. Algorithm correctness
first, then tune it. If necessary it could be converted to a per cpu
variable, as long as the timing between creating the per cpu variables
and invoking callbacks from hotplug cpu was correct.

2006-01-12 06:21:43

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

"Paul E. McKenney" (on Wed, 11 Jan 2006 21:04:53 -0800) wrote:
>On Thu, Jan 12, 2006 at 02:29:52PM +1100, Keith Owens wrote:
>> John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
>> >On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
>> >> Have you looked at Alan Stern's notifier chain fix patch? Could that be
>> >> used in task_notify?
>> >
>> >I have two concerns about an all-tasks notification interface.
>> >First, we want this to scale, so don't want more global locks.
>> >One unique part of the task notify is that it doesn't use locks.
>>
>> Neither does Alan Stern's atomic notifier chain. Indeed it cannot use
>> locks, because the atomic notifier chains can be called from anywhere,
>> including non maskable interrupts. The downside is that Alan's atomic
>> notifier chains require RCU.
>>
>> An alternative patch that requires no locks and does not even require
>> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
>
>Interesting! Missed this on the first time around...
>
>But doesn't notifier_call_chain_lockfree() need to either disable
>preemption or use atomic operations to update notifier_chain_lockfree_inuse[]
>in order to avoid problems with preemption?

OK, I have thought about it and ...

notifier_call_chain_lockfree() must be called with preempt disabled.

The justification for this routine is to handle all the nasty
corner cases in the notify_die() and similar chains, including panic,
spinlocks being held and even non maskable interrupts. It is silly to
try to make notifier_call_chain_lockfree() handle the preemptible case
as well.

If notifier_call_chain_lockfree() is to be called for task
notification, then the caller must disable preempt around the call to
notifier_call_chain_lockfree(). Scalability over lots of cpus should
not be an issue, especially if notifier_chain_lockfree_inuse[] is
converted to a per cpu variable. The amount of time spent with preempt
disabled is proportional to the number of registered callbacks on the
task notifcation chain and to the amount of work performed by those
callbacks, neither of which should be high.

2006-01-12 06:50:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
> "Paul E. McKenney" (on Wed, 11 Jan 2006 21:04:53 -0800) wrote:
> >On Thu, Jan 12, 2006 at 02:29:52PM +1100, Keith Owens wrote:
> >> John Hesterberg (on Wed, 11 Jan 2006 15:39:10 -0600) wrote:
> >> >On Wed, Jan 11, 2006 at 01:02:10PM -0800, Matt Helsley wrote:
> >> >> Have you looked at Alan Stern's notifier chain fix patch? Could that be
> >> >> used in task_notify?
> >> >
> >> >I have two concerns about an all-tasks notification interface.
> >> >First, we want this to scale, so don't want more global locks.
> >> >One unique part of the task notify is that it doesn't use locks.
> >>
> >> Neither does Alan Stern's atomic notifier chain. Indeed it cannot use
> >> locks, because the atomic notifier chains can be called from anywhere,
> >> including non maskable interrupts. The downside is that Alan's atomic
> >> notifier chains require RCU.
> >>
> >> An alternative patch that requires no locks and does not even require
> >> RCU is in http://marc.theaimsgroup.com/?l=linux-kernel&m=113392370322545&w=2
> >
> >Interesting! Missed this on the first time around...
> >
> >But doesn't notifier_call_chain_lockfree() need to either disable
> >preemption or use atomic operations to update notifier_chain_lockfree_inuse[]
> >in order to avoid problems with preemption?
>
> OK, I have thought about it and ...
>
> notifier_call_chain_lockfree() must be called with preempt disabled.
>
> The justification for this routine is to handle all the nasty
> corner cases in the notify_die() and similar chains, including panic,
> spinlocks being held and even non maskable interrupts. It is silly to
> try to make notifier_call_chain_lockfree() handle the preemptible case
> as well.

Fair enough! A comment, perhaps? In a former life I would have also
demanded debug code to verify that preemption/interrupts/whatever were
actually disabled, given the very subtle nature of any resulting bugs...

> If notifier_call_chain_lockfree() is to be called for task
> notification, then the caller must disable preempt around the call to
> notifier_call_chain_lockfree(). Scalability over lots of cpus should
> not be an issue, especially if notifier_chain_lockfree_inuse[] is
> converted to a per cpu variable. The amount of time spent with preempt
> disabled is proportional to the number of registered callbacks on the
> task notifcation chain and to the amount of work performed by those
> callbacks, neither of which should be high.

Ah, but the guys doing the latency measurements will be the judge of
that, right? ;-)

Thanx, Paul

2006-01-12 07:51:07

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

"Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
>On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
>> OK, I have thought about it and ...
>>
>> notifier_call_chain_lockfree() must be called with preempt disabled.
>>
>Fair enough! A comment, perhaps? In a former life I would have also
>demanded debug code to verify that preemption/interrupts/whatever were
>actually disabled, given the very subtle nature of any resulting bugs...

Comment - OK. Debug code is not required, the reference to
smp_processor_id() already does all the debug checks that
notifier_call_chain_lockfree() needs. CONFIG_PREEMPT_DEBUG is your
friend.

2006-01-12 09:51:32

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:

Matt> I can already tell you I don't like the "magic" mechanism to
Matt> identify notifier blocks. The problem is that it's yet another
Matt> space of id numbers that we have to manage -- either manually,
Matt> by having a person hand the numbers out to developers, or
Matt> automatically using the idr code. You could use the notifier
Matt> block's address and avoid an additional id space.

Hi Matt,

I did debate with myself whether or not to take the 'magic' approach.
I don't think the allocation is a big issue, it's all compile time
determined and as long as people stick it in task_notifier.h it
doesn't matter if their magic number is changed when it goes into the
official tree.

Using the notifier block's address won't work, they are often
dynamically allocated so a client searching for it will rarely know
the address.

The alternative is to search for a function pointer of one of the call
out functions, howevere it requires all users all provide a specific
function pointer or we need to add some flags for the search function
(eg. one of the TN_ ones could be passed in), but this would
complicate the search function and make it slower. This is the
approach Jack took in his original approach, however Jack's patch had
just a single notifier function and it was the user's respoinsibility
to write the code for the demultiplexing of the callouts. I do not
like this approach as it will be a lot more error prone with everyone
doing their own version.

Matt> Also, even if this mechanism goes into task_notify it needs a
Matt> better name than "magic".

Magic is fairly standard for this kind of feature in the kernel,
grep MAGIC include/linux/*.h ;-)

>> I think task_notify it should be usable for statistics gathering as
>> well, the only issue is how to attach it to the processes we wish
>> to gather accounting for. Personally I am not a big fan of the
>> current concept where statistics are gathered for all tasks at all
>> time but just not exported until accounting is enabled.

Matt> Have you looked at Alan Stern's notifier chain fix patch?
Matt> Could that be used in task_notify?

No sorry, do you have a pointer?

Matt> If not, perhaps the patch use the standard kernel list idioms.

Matt> Another potential user for the task_notify functionality is
Matt> the process events connector. The problem is it requires the
Matt> ability to receive notifications about all processes. Also,
Matt> there's a chance that future CKRM code could use all-task and
Matt> per-task notification. Combined with John Hesterberg's feedback
Matt> I think there is strong justification for an all-tasks
Matt> notification interface.

I am wary of this. I don't see how we effectively will be able to do
an all task notification without having to traverse the full task list
and take a ton of global locks. If anybody have any idea to get around
this problem I'd be very interested in hearing their suggestions.

Matt> If there was some way to quickly check for registration on an
Matt> all-tasks list you could do both all-task and per-task
Matt> notification in the same code. I took a shot at this a while
Matt> back but the locking was incomplete. Perhaps a per-cpu all-task
Matt> notification list would satisfy your performance-impact
Matt> concerns.

That could be the approach to take to get around it, but I must admit
I don't know the schedule enough to say whether we will be able to
assign all tasks to one single CPU at any given time and then there's
the issue of locking when they migrate. But again, the latter may
already be handled by the scheduler?

>> The API for the task_notify code is not set in stone and we can add
>> more notifier hooks where needed. If someone has strong reasoning
>> for making changes to the API, then I'll be very open to that.

Matt> I'd like to see the all-task notification I mentioned above.
Matt> Notifications where uid/gids change could be useful for auditing
Matt> and process events connector.

If anybody is willing to step up and propose a patch for this I'd be
interested in taking a look. I don't off hand see a simple solution
for it.

Regards,
Jes

2006-01-12 10:01:04

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
>> I have two concerns about an all-tasks notification interface.
>> First, we want this to scale, so don't want more global locks. One
>> unique part of the task notify is that it doesn't use locks.

Matt> Are you against global locks in these paths based solely on
Matt> principle or have you measured their impact on scalability in
Matt> those locations?

Matt,

It all depends on the specific location of the locks and how often
they are taken. As long as something is taken by the same CPU all the
time is not going to be a major issue, but if we end up with anything
resembling a global lock, even if it is only held for a very short
time, it is going to bite badly. On a 4-way you probably won't notice,
but go to a 64-way and it bites, on a 512-way it eats you alive (we
had a problem in the timer code quite a while back that prevented the
machine from booting - it wasn't a lock that was held for a long time,
just the fact that every CPU would take it every HZ was enough).

Matt> Without measurement there are two conflicting principles here:
Matt> code complexity vs. scalability.

The two should never be mutually exlusive as long as we are careful.

Matt> If you've made measurements I'm curious to know what kind of
Matt> locks were measured, where they were used, what the load was
Matt> doing -- as a rough characterization of the pattern of
Matt> notifications -- and how much it impacted scalability. This
Matt> information might help suggest a better solution.

The one I mentioned above was in the timer interrupt path.

Cheers,
Jes

2006-01-12 15:18:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, Jan 12, 2006 at 06:50:34PM +1100, Keith Owens wrote:
> "Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
> >On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
> >> OK, I have thought about it and ...
> >>
> >> notifier_call_chain_lockfree() must be called with preempt disabled.
> >>
> >Fair enough! A comment, perhaps? In a former life I would have also
> >demanded debug code to verify that preemption/interrupts/whatever were
> >actually disabled, given the very subtle nature of any resulting bugs...
>
> Comment - OK. Debug code is not required, the reference to
> smp_processor_id() already does all the debug checks that
> notifier_call_chain_lockfree() needs. CONFIG_PREEMPT_DEBUG is your
> friend.

Ah, debug_smp_processor_id(). Very cool!!!

Thanx, Paul

2006-01-12 23:07:07

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <[email protected]> writes:
>
> Matt> On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
>
> Matt> I can already tell you I don't like the "magic" mechanism to
> Matt> identify notifier blocks. The problem is that it's yet another
> Matt> space of id numbers that we have to manage -- either manually,
> Matt> by having a person hand the numbers out to developers, or
> Matt> automatically using the idr code. You could use the notifier
> Matt> block's address and avoid an additional id space.
>
> Hi Matt,
>
> I did debate with myself whether or not to take the 'magic' approach.
> I don't think the allocation is a big issue, it's all compile time
> determined and as long as people stick it in task_notifier.h it
> doesn't matter if their magic number is changed when it goes into the
> official tree.
>
> Using the notifier block's address won't work, they are often
> dynamically allocated so a client searching for it will rarely know
> the address.

If the task_notify user is responsible for allocating the notifier
block then it could easily keep a copy of the pointer handy for the
corresponding deregistration.

> The alternative is to search for a function pointer of one of the call
> out functions, howevere it requires all users all provide a specific
> function pointer or we need to add some flags for the search function
> (eg. one of the TN_ ones could be passed in), but this would
> complicate the search function and make it slower. This is the
> approach Jack took in his original approach, however Jack's patch had
> just a single notifier function and it was the user's respoinsibility
> to write the code for the demultiplexing of the callouts. I do not
> like this approach as it will be a lot more error prone with everyone
> doing their own version.

I don't see how it will be error prone. Jack's interface was simple.
And if we're really worred about users messing it up we could generate a
set of macros that users must use to demultiplex the call.

One function pointer per event does bloat the notifier block structure.
Given that this notifier block may need to be replicated per-task this
is a significant amount of space. Jack's demultiplexing approach
requires that the space be proportional to the number of event functions
instead of proportional to the number of notifier blocks.

Furthermore, if new task events are added this structure would need to
be expanded. In contrast, Jack's approach only required the addition of
a new event value.

For these reasons I think using a single function pointer will be much
more effective.

> Matt> Also, even if this mechanism goes into task_notify it needs a
> Matt> better name than "magic".
>
> Magic is fairly standard for this kind of feature in the kernel,
> grep MAGIC include/linux/*.h ;-)

Using ids and pointers in the kernel are pretty standard to. I invite
you to grep for those and count them too ;)

Seriously, it's a nit but I think the name could better reflect the
purpose of the field.

> >> I think task_notify it should be usable for statistics gathering as
> >> well, the only issue is how to attach it to the processes we wish
> >> to gather accounting for. Personally I am not a big fan of the
> >> current concept where statistics are gathered for all tasks at all
> >> time but just not exported until accounting is enabled.
>
> Matt> Have you looked at Alan Stern's notifier chain fix patch?
> Matt> Could that be used in task_notify?
>
> No sorry, do you have a pointer?

No problem. Here it is:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

I think it would be ideal if task_notify could simply be a notifier
chain for notifying users of task events/changes.

> Matt> If not, perhaps the patch use the standard kernel list idioms.
> Matt> Another potential user for the task_notify functionality is
> Matt> the process events connector. The problem is it requires the
> Matt> ability to receive notifications about all processes. Also,
> Matt> there's a chance that future CKRM code could use all-task and
> Matt> per-task notification. Combined with John Hesterberg's feedback
> Matt> I think there is strong justification for an all-tasks
> Matt> notification interface.
>
> I am wary of this. I don't see how we effectively will be able to do
> an all task notification without having to traverse the full task list
> and take a ton of global locks. If anybody have any idea to get around
> this problem I'd be very interested in hearing their suggestions.

Why would we have to traverse the full task list?! Just add one
notifier block to a single, global list of notifiers. Keith's patch
demonstrates one potential method of avoiding a lock in the common path:
traversing the notification chain and making the calls.

> Matt> If there was some way to quickly check for registration on an
> Matt> all-tasks list you could do both all-task and per-task
> Matt> notification in the same code. I took a shot at this a while
> Matt> back but the locking was incomplete. Perhaps a per-cpu all-task
> Matt> notification list would satisfy your performance-impact
> Matt> concerns.
>
> That could be the approach to take to get around it, but I must admit
> I don't know the schedule enough to say whether we will be able to
> assign all tasks to one single CPU at any given time and then there's
> the issue of locking when they migrate. But again, the latter may
> already be handled by the scheduler?

"all-task" means the notification block calls its function when any
task triggers a traversal of the notification lists. This does not imply
that registration/call/deregister of an all-task notification must
traverse all tasks.

> >> The API for the task_notify code is not set in stone and we can add
> >> more notifier hooks where needed. If someone has strong reasoning
> >> for making changes to the API, then I'll be very open to that.
>
> Matt> I'd like to see the all-task notification I mentioned above.
> Matt> Notifications where uid/gids change could be useful for auditing
> Matt> and process events connector.
>
> If anybody is willing to step up and propose a patch for this I'd be
> interested in taking a look. I don't off hand see a simple solution
> for it.

Keith posted an interesting patch for notification chains that I
believe could address your concerns -- though from what I've read it
needs to disable preemption. As far as adding notifications for uid/gid
change, that's relatively trivial. You might look at some revisions of
task_notify that Chandra Seetharaman and I have posted.

Cheers,
-Matt

2006-01-12 23:26:37

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <[email protected]> writes:
>
> Matt> On Wed, 2006-01-11 at 15:39 -0600, John Hesterberg wrote:
> >> I have two concerns about an all-tasks notification interface.
> >> First, we want this to scale, so don't want more global locks. One
> >> unique part of the task notify is that it doesn't use locks.
>
> Matt> Are you against global locks in these paths based solely on
> Matt> principle or have you measured their impact on scalability in
> Matt> those locations?
>
> Matt,
>
> It all depends on the specific location of the locks and how often
> they are taken. As long as something is taken by the same CPU all the
> time is not going to be a major issue, but if we end up with anything
> resembling a global lock, even if it is only held for a very short
> time, it is going to bite badly. On a 4-way you probably won't notice,
> but go to a 64-way and it bites, on a 512-way it eats you alive (we
> had a problem in the timer code quite a while back that prevented the
> machine from booting - it wasn't a lock that was held for a long time,
> just the fact that every CPU would take it every HZ was enough).

OK, so you've established that global locks in timer paths are Bad.
However you haven't established that timer paths are almost the same as
the task paths we're talking about. I suspect timer paths are used much
more frequently than fork, exec, or exit.

I've appended a small patch that adds a global lock to the task_notify
paths for testing purposes. I'm curious to know what kind of a
performance difference you would see on your 64 or 512-way if you were
to run with it.

Looking at the ideas for lockless implementations I'm curious how well
Keith's notifier chains patch would work in this case. It does not
acquire a global lock in the "call" path and, as Keith suggested,
probably can be modified to avoid cache bouncing.

Cheers,
-Matt Helsley

> Matt> Without measurement there are two conflicting principles here:
> Matt> code complexity vs. scalability.
>
> The two should never be mutually exlusive as long as we are careful.
>
> Matt> If you've made measurements I'm curious to know what kind of
> Matt> locks were measured, where they were used, what the load was
> Matt> doing -- as a rough characterization of the pattern of
> Matt> notifications -- and how much it impacted scalability. This
> Matt> information might help suggest a better solution.
>
> The one I mentioned above was in the timer interrupt path.
>
> Cheers,
> Jes


Index: linux-2.6.15/kernel/fork.c
===================================================================
--- linux-2.6.15.orig/kernel/fork.c
+++ linux-2.6.15/kernel/fork.c
@@ -849,10 +849,12 @@ asmlinkage long sys_set_tid_address(int
current->clear_child_tid = tidptr;

return current->pid;
}

+extern spinlock_t proposed_global_lock;
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
*
* It copies the registers, and all the appropriate
@@ -1137,12 +1139,14 @@ static task_t *copy_process(unsigned lon
p->signal->tty = NULL;

nr_threads++;
total_forks++;
write_unlock_irq(&tasklist_lock);
+ spin_lock(&proposed_global_lock);
proc_fork_connector(p);
cpuset_fork(p);
+ spin_unlock(&proposed_global_lock);
retval = 0;

fork_out:
if (retval)
return ERR_PTR(retval);
Index: linux-2.6.15/kernel/exit.c
===================================================================
--- linux-2.6.15.orig/kernel/exit.c
+++ linux-2.6.15/kernel/exit.c
@@ -784,10 +784,11 @@ static void exit_notify(struct task_stru
/* If the process is dead, release it - nobody will wait for it */
if (state == EXIT_DEAD)
release_task(tsk);
}

+extern spinlock_t proposed_global_lock;
fastcall NORET_TYPE void do_exit(long code)
{
struct task_struct *tsk = current;
int group_dead;

@@ -844,10 +845,12 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead) {
del_timer_sync(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
acct_process(code);
}
+ spin_lock(&proposed_global_lock);
+ spin_unlock(&proposed_global_lock);
exit_mm(tsk);

exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
Index: linux-2.6.15/fs/exec.c
===================================================================
--- linux-2.6.15.orig/fs/exec.c
+++ linux-2.6.15/fs/exec.c
@@ -1121,10 +1121,12 @@ int search_binary_handler(struct linux_b
return retval;
}

EXPORT_SYMBOL(search_binary_handler);

+extern spinlock_t proposed_global_lock;
+
/*
* sys_execve() executes a new program.
*/
int do_execve(char * filename,
char __user *__user *argv,
@@ -1190,10 +1192,12 @@ int do_execve(char * filename,

retval = copy_strings(bprm->argc, argv, bprm);
if (retval < 0)
goto out;

+ spin_lock(&proposed_global_lock);
+ spin_unlock(&proposed_global_lock);
retval = search_binary_handler(bprm,regs);
if (retval >= 0) {
free_arg_pages(bprm);

/* execve success */
Index: linux-2.6.15/init/main.c
===================================================================
--- linux-2.6.15.orig/init/main.c
+++ linux-2.6.15/init/main.c
@@ -440,10 +440,11 @@ void __init parse_early_param(void)

/*
* Activate the first processor.
*/

+spinlock_t proposed_global_lock = SPIN_LOCK_UNLOCKED;
asmlinkage void __init start_kernel(void)
{
char * command_line;
extern struct kernel_param __start___param[], __stop___param[];
/*


2006-01-12 23:55:37

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, 2006-01-11 at 05:36 -0500, Jes Sorensen wrote:
> >>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:
>
> Shailabh> Jes Sorensen wrote:
> >> I am quite concerned about that lock your patches put into struct
> >> task_struct through struct task_delay_info. Have you done any
> >> measurements on how this impacts performance on highly threaded
> >> apps on larger system?
>
> Shailabh> I don't expect the lock contention to be high. The lock is
> Shailabh> held for a very short time (across two
> Shailabh> additions/increments). Moreover, it gets contended only when
> Shailabh> the stats are being read (either through /proc or
> Shailabh> connectors). Since the reading of stats won't be that
> Shailabh> frequent (the utility of these numbers is to influence the
> Shailabh> I/O priority/rss limit etc. which won't be done at a very
> Shailabh> small granularity anyway), I wouldn't expect a problem.

Not just when the stats are being read, but when the stats of that
particular task (A) are being read by one task (B) AND updated by the
task itself (A).

> Hi Shailabh,
>
> When this is read through connectors, it's initiated by the connectors
> code which is called from the task's context hence we don't need
> locking for that. It's very similar to the task_notify code I am about
> to post and I think the connector code could fit into that
> framework. The main issue is /proc, but then one could even have a
> mechanism with a hook when the task exits that pushes the data to a
> storage point which is lock protected.

Hmm, which task? The request for stats does not necessarily/usualy
originate from the task we desire stats on. Hence the synchronization.

In this way it's significantly different from lockless task_notify.

> Even if a lock isn't contended, you are still going to see the cache
> lines bounce around due to the writes. It may not show up on a 4-way
> box but what happens on a 64-way? We have seen some pretty nasty
> effects on the bigger SN2 boxes with locks that were taken far too
> frequently, to the point where it would prevent the box from booting
> (now I don't expect it to that severe here, but I'd still like to
> explore an approach of doing it lock free).

You're referring to the performance impact of a global lock on a large
system. The lock Shailabh introduced is per-task. Those won't bounce
around nearly as much -- I think they bounce under the following
conditions:

- The task in which the lock is embedded is in the cache of processor A
and the task reading the stats of that task is on processor B.

- The scheduler bounces a task between processor A and processor B.

Am I missing any other circumstances under which it could bounce?

> Shailabh> But its better to take some measurements anyway. Any
> Shailabh> suggestions on a benchmark ?
>
> >> IMHO it seems to make more sense to use something like Jack's
> >> proposed task_notifier code to lock-less collect the data into task
> >> local data structures and then take the data from there and ship
> >> off to userland through netlink or similar like you are doing?
> >>
> >> I am working on modifying Jack's patch to carry task local data and
> >> use it for not just accounting but other areas that need optional
> >> callbacks (optional in the sense that it's a feature that can be
> >> enabled or disabled). Looking at Shailabh's delayacct_blkio()
> >> changes it seems that it would be really easy to fit those into
> >> that framework.
> >>
> >> Guess I should post some of this code .....
>
> Shailabh> Please do. If this accounting can fit into some other
> Shailabh> framework, thats fine too.
>
> Ok, finally, sorry for the delay. My current code snapshot is
> available at http://www.trained-monkey.org/~jes/patches/task_notify/ -
> it's a modified version of Jack's task_notify code, and three example
> users of it (the SysV IPC semundo semaphore, the key infrastructure
> and SGI's JOB module). The patch order should be task_notify.diff,
> task-notify-keys.diff, task-notify-semundo.diff, and
> task_notify-job.diff last.
>
> I think task_notify it should be usable for statistics gathering as
> well, the only issue is how to attach it to the processes we wish to
> gather accounting for. Personally I am not a big fan of the current
> concept where statistics are gathered for all tasks at all time but
> just not exported until accounting is enabled.

The only potential problem I can see with using task_notify for
gathering statistics is each section of code that increments stats would
have to look for its notify block in the task's list.

Cheers,
-Matt Helsley

2006-01-13 09:35:10

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
>> It all depends on the specific location of the locks and how often
>> they are taken. As long as something is taken by the same CPU all
>> the time is not going to be a major issue, but if we end up with
>> anything resembling a global lock, even if it is only held for a
>> very short time, it is going to bite badly. On a 4-way you probably
>> won't notice, but go to a 64-way and it bites, on a 512-way it eats
>> you alive (we had a problem in the timer code quite a while back
>> that prevented the machine from booting - it wasn't a lock that was
>> held for a long time, just the fact that every CPU would take it
>> every HZ was enough).

Matt> OK, so you've established that global locks in timer paths are
Matt> Bad. However you haven't established that timer paths are
Matt> almost the same as the task paths we're talking about. I suspect
Matt> timer paths are used much more frequently than fork, exec, or
Matt> exit.

Hi Matt,

I wasn't trying to make it sound like this was an apples to apples
comparison, what I am saying is simply that locks aren't free.

You are totally right that fork/exec should be called a lot less
frequently, but the delay account data collection points will be in far
more places than that and they will all go for the lock.

Matt> I've appended a small patch that adds a global lock to the
Matt> task_notify paths for testing purposes. I'm curious to know what
Matt> kind of a performance difference you would see on your 64 or
Matt> 512-way if you were to run with it.

I don't have a 512-way to play with at the moment, but again I don't
think it makes sense to benchmark things which aren't matching what we
are looking at. If we can avoid the locks in the first place then
there's really no reason for not doing that.

Matt> Looking at the ideas for lockless implementations I'm curious
Matt> how well Keith's notifier chains patch would work in this
Matt> case. It does not acquire a global lock in the "call" path and,
Matt> as Keith suggested, probably can be modified to avoid cache
Matt> bouncing.

Yup, I was curious about that. I haven't had a chance to look at it
carefully yet.

Cheers,
Jes

2006-01-13 09:59:04

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
>> Using the notifier block's address won't work, they are often
>> dynamically allocated so a client searching for it will rarely know
>> the address.

Matt> If the task_notify user is responsible for allocating the
Matt> notifier block then it could easily keep a copy of the pointer
Matt> handy for the corresponding deregistration.

And when they are dynamically allocated on a per-object level? Then
you'd have to keep linked lists around to keep track of them. Sorry,
doesn't work.

>> The alternative is to search for a function pointer of one of the
>> call out functions, howevere it requires all users all provide a
>> specific function pointer or we need to add some flags for the
>> search function (eg. one of the TN_ ones could be passed in), but
>> this would complicate the search function and make it slower. This
>> is the approach Jack took in his original approach, however Jack's
>> patch had just a single notifier function and it was the user's
>> respoinsibility to write the code for the demultiplexing of the
>> callouts. I do not like this approach as it will be a lot more
>> error prone with everyone doing their own version.

Matt> I don't see how it will be error prone. Jack's interface was
Matt> simple. And if we're really worred about users messing it up we
Matt> could generate a set of macros that users must use to
Matt> demultiplex the call.

Because experience shows that the more some people copy the same code
someone will get it wrong.

Matt> One function pointer per event does bloat the notifier block
Matt> structure. Given that this notifier block may need to be
Matt> replicated per-task this is a significant amount of
Matt> space. Jack's demultiplexing approach requires that the space be
Matt> proportional to the number of event functions instead of
Matt> proportional to the number of notifier blocks.

Matt> Furthermore, if new task events are added this structure would
Matt> need to be expanded. In contrast, Jack's approach only required
Matt> the addition of a new event value.

Matt> For these reasons I think using a single function pointer will
Matt> be much more effective.

So you add a few pointers per task compared to the code that does the
demultiplexing. We're talkin 3-4 extra pointers in comparison to the
demultiplexing code *and* the extra function call.

My approach is consitent with the rest of the kernel does for most
structures, struct file operations etc.

But if thats what makes the difference as to whether we go for this
type of API, then lets change it back. I am not married to the API,
but rather the functionality.

Matt> No problem. Here it is:
Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

Matt> I think it would be ideal if task_notify could simply be a
Matt> notifier chain for notifying users of task events/changes.

Thats really what it was intended for, but of course only in the
task's own context.

>> That could be the approach to take to get around it, but I must
>> admit I don't know the schedule enough to say whether we will be
>> able to assign all tasks to one single CPU at any given time and
>> then there's the issue of locking when they migrate. But again, the
>> latter may already be handled by the scheduler?

Matt> "all-task" means the notification block calls its function
Matt> when any task triggers a traversal of the notification
Matt> lists. This does not imply that registration/call/deregister of
Matt> an all-task notification must traverse all tasks.

Well then the name is somewhat misleading ;-) Sounds almost like
something that could fit into Erik's Job stuff, but I guess it's a
question of what criteria those tasks are put onto that list with.

Being a little short on the history, are there any pointers to
examples or descriptions of what this would be used for? Curious to
understand the usage pattern.

Matt> I'd like to see the all-task notification I mentioned above.
Matt> Notifications where uid/gids change could be useful for auditing
Matt> and process events connector.
>> If anybody is willing to step up and propose a patch for this I'd
>> be interested in taking a look. I don't off hand see a simple
>> solution for it.

Matt> Keith posted an interesting patch for notification chains that
Matt> I believe could address your concerns -- though from what I've
Matt> read it needs to disable preemption. As far as adding
Matt> notifications for uid/gid change, that's relatively trivial. You
Matt> might look at some revisions of task_notify that Chandra
Matt> Seetharaman and I have posted.

I did go through the archives a while back and I did notice that both
Jack and Erik pointed out a number of issues with those approaches. If
we are going to do the task-group-notify thing, then going to Keith's
approach is probably the best.

Cheers,
Jes

2006-01-13 10:38:42

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

>>>>> "Matt" == Matt Helsley <[email protected]> writes:

Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
Matt> Have you looked at Alan Stern's notifier chain fix patch? Could
Matt> that be used in task_notify?
>> No sorry, do you have a pointer?

Matt> No problem. Here it is:
Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2

Matt> I think it would be ideal if task_notify could simply be a
Matt> notifier chain for notifying users of task events/changes.

Ok, went back and looked at this. I think the core concept is fine,
but there are details such as having a data pointer associated with
the notifier block which is too important to leave out. Otherwise we
have to stick things into the task struct in many cases which is a
waste of space. I also think it needs to be possible to search the
list for special slow path uses to avoid us adding excessive amounts
of callbacks that are only used in one place by one client.

If we can cross-API it for task-group-notifiers then that should be
fine.

Cheers,
Jes

2006-01-13 23:28:59

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Fri, 2006-01-13 at 05:38 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <[email protected]> writes:
>
> Matt> On Thu, 2006-01-12 at 04:51 -0500, Jes Sorensen wrote:
> Matt> Have you looked at Alan Stern's notifier chain fix patch? Could
> Matt> that be used in task_notify?
> >> No sorry, do you have a pointer?
>
> Matt> No problem. Here it is:
> Matt> http://marc.theaimsgroup.com/?l=linux-kernel&m=113407207418475&w=2
>
> Matt> I think it would be ideal if task_notify could simply be a
> Matt> notifier chain for notifying users of task events/changes.
>
> Ok, went back and looked at this. I think the core concept is fine,
> but there are details such as having a data pointer associated with
> the notifier block which is too important to leave out. Otherwise we
> have to stick things into the task struct in many cases which is a
> waste of space. I also think it needs to be possible to search the
> list for special slow path uses to avoid us adding excessive amounts
> of callbacks that are only used in one place by one client.

I agree that having a data pointer associated with the notifier block
is important. It helps us avoid increasing the size of task struct for
each task_notify user and makes modularization of them possible.

Hmm, yes excessive amounts of callbacks for those used by only one
client could be a problem. My first approach to that problem would be to
have one task_notify list per-event rather than just a single list for
all events. This has ugly space implications.

More importantly, I don't think it's a problem yet. Until it's a
problem we ought to go with the simpler implementations. When it is a
problem the task_notify interface should insulate the users from such a
change.

> If we can cross-API it for task-group-notifiers then that should be
> fine.
>
> Cheers,
> Jes

Yup.

Cheers,
-Matt Helsley

2006-01-14 07:34:50

by Matt Helsley

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Fri, 2006-01-13 at 04:35 -0500, Jes Sorensen wrote:
> >>>>> "Matt" == Matt Helsley <[email protected]> writes:
>
> Matt> On Thu, 2006-01-12 at 05:01 -0500, Jes Sorensen wrote:
> >> It all depends on the specific location of the locks and how often
> >> they are taken. As long as something is taken by the same CPU all
> >> the time is not going to be a major issue, but if we end up with
> >> anything resembling a global lock, even if it is only held for a
> >> very short time, it is going to bite badly. On a 4-way you probably
> >> won't notice, but go to a 64-way and it bites, on a 512-way it eats
> >> you alive (we had a problem in the timer code quite a while back
> >> that prevented the machine from booting - it wasn't a lock that was
> >> held for a long time, just the fact that every CPU would take it
> >> every HZ was enough).
>
> Matt> OK, so you've established that global locks in timer paths are
> Matt> Bad. However you haven't established that timer paths are
> Matt> almost the same as the task paths we're talking about. I suspect
> Matt> timer paths are used much more frequently than fork, exec, or
> Matt> exit.
>
> Hi Matt,
>
> I wasn't trying to make it sound like this was an apples to apples
> comparison, what I am saying is simply that locks aren't free.

OK. I'm not sure what gave you the impression I thought they were.

> You are totally right that fork/exec should be called a lot less
> frequently, but the delay account data collection points will be in far
> more places than that and they will all go for the lock.

All of these places are highly likely to also be in the context of the
task that has the lock in it's task->delays struct. This could be a
strongly recommended practice for taking the lock -- we can add a
comment next to the lock suggesting as much.

> Matt> I've appended a small patch that adds a global lock to the
> Matt> task_notify paths for testing purposes. I'm curious to know what
> Matt> kind of a performance difference you would see on your 64 or
> Matt> 512-way if you were to run with it.
>
> I don't have a 512-way to play with at the moment, but again I don't
> think it makes sense to benchmark things which aren't matching what we
> are looking at. If we can avoid the locks in the first place then
> there's really no reason for not doing that.

It could be good for establishing a lower bound on contention resulting
from locking implementations of task_notify. I think in that respect
it's comparing two species of apples.

We need to clarify what locking we're talking about -- locking in
task_notify to protect the notification list vs. locking in delayacct to
protect the accounting data consistency. I think that a simple lockless
approach may only be possible for task_notify.

Cheers,
-Matt Helsley

2006-01-17 17:26:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Thu, Jan 12, 2006 at 07:17:41AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 12, 2006 at 06:50:34PM +1100, Keith Owens wrote:
> > "Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
> > >On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
> > >> OK, I have thought about it and ...
> > >>
> > >> notifier_call_chain_lockfree() must be called with preempt disabled.
> > >>
> > >Fair enough! A comment, perhaps? In a former life I would have also
> > >demanded debug code to verify that preemption/interrupts/whatever were
> > >actually disabled, given the very subtle nature of any resulting bugs...
> >
> > Comment - OK. Debug code is not required, the reference to
> > smp_processor_id() already does all the debug checks that
> > notifier_call_chain_lockfree() needs. CONFIG_PREEMPT_DEBUG is your
> > friend.
>
> Ah, debug_smp_processor_id(). Very cool!!!

One other thing -- given that you are requiring that the read side
have preemption disabled, another update-side option would be to
use synchronize_sched() to wait for all preemption-disabled code
segments to complete. This would allow you to remove the per-CPU
reference counters from the read side, but would require that the
update side either (1) be able to block or (2) be able to defer
the cleanup to process context.

Just another option...

Thanx, Paul

2006-01-17 23:57:50

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

"Paul E. McKenney" (on Tue, 17 Jan 2006 09:26:17 -0800) wrote:
>On Thu, Jan 12, 2006 at 07:17:41AM -0800, Paul E. McKenney wrote:
>> On Thu, Jan 12, 2006 at 06:50:34PM +1100, Keith Owens wrote:
>> > "Paul E. McKenney" (on Wed, 11 Jan 2006 22:51:15 -0800) wrote:
>> > >On Thu, Jan 12, 2006 at 05:19:01PM +1100, Keith Owens wrote:
>> > >> OK, I have thought about it and ...
>> > >>
>> > >> notifier_call_chain_lockfree() must be called with preempt disabled.
>> > >>
>> > >Fair enough! A comment, perhaps? In a former life I would have also
>> > >demanded debug code to verify that preemption/interrupts/whatever were
>> > >actually disabled, given the very subtle nature of any resulting bugs...
>> >
>> > Comment - OK. Debug code is not required, the reference to
>> > smp_processor_id() already does all the debug checks that
>> > notifier_call_chain_lockfree() needs. CONFIG_PREEMPT_DEBUG is your
>> > friend.
>>
>> Ah, debug_smp_processor_id(). Very cool!!!
>
>One other thing -- given that you are requiring that the read side
>have preemption disabled, another update-side option would be to
>use synchronize_sched() to wait for all preemption-disabled code
>segments to complete. This would allow you to remove the per-CPU
>reference counters from the read side, but would require that the
>update side either (1) be able to block or (2) be able to defer
>the cleanup to process context.

Originally I looked at that code but the comment scared me off.
synchronize_sched() maps to synchronize_rcu() and the comment says that
this only protects against rcu_read_lock(), not against preempt_disable().

/**
* synchronize_sched - block until all CPUs have exited any non-preemptive
* kernel code sequences.
*
* This means that all preempt_disable code sequences, including NMI and
* hardware-interrupt handlers, in progress on entry will have completed
* before this primitive returns. However, this does not guarantee that
* softirq handlers will have completed, since in some kernels
*
* This primitive provides the guarantees made by the (deprecated)
* synchronize_kernel() API. In contrast, synchronize_rcu() only
* guarantees that rcu_read_lock() sections will have completed.
*/
#define synchronize_sched() synchronize_rcu()


2006-01-18 02:49:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Wed, Jan 18, 2006 at 10:57:47AM +1100, Keith Owens wrote:
> "Paul E. McKenney" (on Tue, 17 Jan 2006 09:26:17 -0800) wrote:
[ . . .]
> >One other thing -- given that you are requiring that the read side
> >have preemption disabled, another update-side option would be to
> >use synchronize_sched() to wait for all preemption-disabled code
> >segments to complete. This would allow you to remove the per-CPU
> >reference counters from the read side, but would require that the
> >update side either (1) be able to block or (2) be able to defer
> >the cleanup to process context.
>
> Originally I looked at that code but the comment scared me off.
> synchronize_sched() maps to synchronize_rcu() and the comment says that
> this only protects against rcu_read_lock(), not against preempt_disable().

Good point -- in mainline kernels (but not in some realtime
variants), the two guarantees happen to be one and the same.
But the comment certainly does not make this clear.

> /**
> * synchronize_sched - block until all CPUs have exited any non-preemptive
> * kernel code sequences.
> *
> * This means that all preempt_disable code sequences, including NMI and
> * hardware-interrupt handlers, in progress on entry will have completed
> * before this primitive returns. However, this does not guarantee that
> * softirq handlers will have completed, since in some kernels
> *
> * This primitive provides the guarantees made by the (deprecated)
> * synchronize_kernel() API. In contrast, synchronize_rcu() only
> * guarantees that rcu_read_lock() sections will have completed.
> */
> #define synchronize_sched() synchronize_rcu()

Does the following change help?

Thanx, Paul

diff -urpNa -X dontdiff linux-2.6.15/include/linux/rcupdate.h linux-2.6.15-RCUcomment/include/linux/rcupdate.h
--- linux-2.6.15/include/linux/rcupdate.h 2006-01-02 19:21:10.000000000 -0800
+++ linux-2.6.15-RCUcomment/include/linux/rcupdate.h 2006-01-17 18:48:33.000000000 -0800
@@ -265,11 +265,14 @@ static inline int rcu_pending(int cpu)
* This means that all preempt_disable code sequences, including NMI and
* hardware-interrupt handlers, in progress on entry will have completed
* before this primitive returns. However, this does not guarantee that
- * softirq handlers will have completed, since in some kernels
+ * softirq handlers will have completed, since in some kernels, these
+ * handlers can run in process context, and can block.
*
* This primitive provides the guarantees made by the (deprecated)
* synchronize_kernel() API. In contrast, synchronize_rcu() only
* guarantees that rcu_read_lock() sections will have completed.
+ * In "classic RCU", these two guarantees happen to be one and
+ * the same, but can differ in realtime RCU implementations.
*/
#define synchronize_sched() synchronize_rcu()

2006-01-18 02:55:13

by Lee Revell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Tue, 2006-01-17 at 18:49 -0800, Paul E. McKenney wrote:
> - * softirq handlers will have completed, since in some kernels
> + * softirq handlers will have completed, since in some kernels, these
> + * handlers can run in process context, and can block.
> *

I was under the impression that softirq handlers can run in process
context in all kernels. Specifically, in realtime variants softirqs
always run in process context, and in mainline this only happens under
high load.

Lee

2006-01-18 06:29:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [ckrm-tech] Re: [PATCH 00/01] Move Exit Connectors

On Tue, Jan 17, 2006 at 09:55:06PM -0500, Lee Revell wrote:
> On Tue, 2006-01-17 at 18:49 -0800, Paul E. McKenney wrote:
> > - * softirq handlers will have completed, since in some kernels
> > + * softirq handlers will have completed, since in some kernels, these
> > + * handlers can run in process context, and can block.
> > *
>
> I was under the impression that softirq handlers can run in process
> context in all kernels. Specifically, in realtime variants softirqs
> always run in process context, and in mainline this only happens under
> high load.

We might be talking past each other on this one. If I am not getting
too confused, it is possible to configure a mainline kernel so that
the load cannot rise high enough to force softirqs into process
context. Although looking at 2.6.15, it appears that this would
require rebuilding after hand-editing the value of MAX_SOFTIRQ_RESTART,
which some might feel too-brutal of tweaking to be considered mere
"configuration".

In any case, the key point of the comment is that synchronize_sched()
is not guaranteed to wait for all pending softirq handlers to complete.
Does the comment make that sufficiently clear?

Thanx, Paul