LinuxLists.cc - [PATCH 1/3] Process Notification / pnotify

2005-10-03 18:46:51

Subject: [PATCH 1/3] Process Notification / pnotify

Here is a new patch for Process Notification (pnotify). This patch is
derived from PAGG. We've changed the name to better reflect its purpose and
changed the functions and structure names to make it easier to understand.

The pnotify patch allows kernel module authors to associate data with tasks
and receive notification about certain stages in the life of a task through
callbacks.

After applying the patch, please see Documentation/pnotify.txt for more
details and examples.

I implemented feedback received from lse-tech and the pagg mailing list and
this patch is the result.

We have received positive feedback from multiple sources about this patch.
This patch, in its PAGG form, also has received exposure on ia64 by being
part of SLES9.

We're hoping this patch will be considered for mm inclusion and eventually
make it in to the kernel.

There will be two more postings showing two pnotify users:
Linux Job and keyrings. keyrings is a proof of concept patch showing
how one might change an existing kernel component to use pnotify. Job
is the inescapable job container implementation.

I'm hoping we can move beyond this being seen as only a tool for accounting
and consider its broader applications.

This was discussed in lse-tech recently. Unfortunately, the threads got a
bit broken up in the archives but the major pieces are:

main thread:
http://marc.theaimsgroup.com/?l=lse-tech&m=112733871706425&w=2

Performance data, 2p system:
http://marc.theaimsgroup.com/?l=lse-tech&m=112801353626865&w=2

Performance data, 32p system:
http://marc.theaimsgroup.com/?l=lse-tech&m=112810597609896&w=2

In addition, there has been activity on the PAGG mailing list about
pnotify. Find the list archives here:
http://oss.sgi.com/archives/pagg/

Note: In lse-tech, I was asked to implement pnotify using RCU protections
instead of semaphore read locks. I did this in a proof of concecpt patch
but I feel RCU is too restrictive for the kernel module users in terms
of sleep and, as I showed in the performance data, it doesn't buy us
anything in terms of performance.

Signed-off-by: Erik Jacobson <erikj.sgi.com>
---

Documentation/pnotify.txt | 368 +++++++++++++++++++++++++++++
fs/exec.c | 2
include/linux/init_task.h | 2
include/linux/pnotify.h | 227 ++++++++++++++++++
include/linux/sched.h | 5
init/Kconfig | 8
kernel/Makefile | 1
kernel/exit.c | 4
kernel/fork.c | 17 +
kernel/pnotify.c | 568 ++++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 1201 insertions(+), 1 deletion(-)

Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c 2005-09-30 14:57:55.097213456 -0500
+++ linux/fs/exec.c 2005-09-30 14:57:57.629184199 -0500
@@ -48,6 +48,7 @@
#include <linux/syscalls.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/pnotify.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1203,6 +1204,7 @@
retval = search_binary_handler(bprm,regs);
if (retval >= 0) {
free_arg_pages(bprm);
+ pnotify_exec(current);

/* execve success */
security_bprm_free(bprm);
Index: linux/include/linux/init_task.h
===================================================================
--- linux.orig/include/linux/init_task.h 2005-09-30 14:57:55.098189920 -0500
+++ linux/include/linux/init_task.h 2005-09-30 14:57:57.636019445 -0500
@@ -2,6 +2,7 @@
#define _LINUX__INIT_TASK_H

#include <linux/file.h>
+#include <linux/pnotify.h>
#include <linux/rcupdate.h>

#define INIT_FDTABLE \
@@ -121,6 +122,7 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
+ INIT_TASK_PNOTIFY(tsk) \
.fs_excl = ATOMIC_INIT(0), \
}

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h 2005-09-30 14:57:55.098189920 -0500
+++ linux/include/linux/sched.h 2005-09-30 15:25:34.616252251 -0500
@@ -795,6 +795,11 @@
struct mempolicy *mempolicy;
short il_next;
#endif
+#ifdef CONFIG_PNOTIFY
+/* List of pnotify kernel module subscribers */
+ struct list_head pnotify_subscriber_list;
+ struct rw_semaphore pnotify_subscriber_list_sem;
+#endif
#ifdef CONFIG_CPUSETS
struct cpuset *cpuset;
nodemask_t mems_allowed;
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig 2005-09-30 14:57:55.099166384 -0500
+++ linux/init/Kconfig 2005-09-30 15:25:34.489311959 -0500
@@ -162,6 +162,14 @@
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config PNOTIFY
+ bool "Support for Process Notification"
+ help
+ Say Y here if you will be loading modules which provide support
+ for process notification. Examples of such modules include the
+ Linux Jobs module and the Linux Array Sessions module. If you will not
+ be using such modules, say N.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile 2005-09-30 14:57:55.100142848 -0500
+++ linux/kernel/Makefile 2005-09-30 15:25:34.490288423 -0500
@@ -20,6 +20,7 @@
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o
+obj-$(CONFIG_PNOTIFY) += pnotify.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_IKCONFIG_PROC) += configs.o
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2005-09-30 14:57:55.100142848 -0500
+++ linux/kernel/fork.c 2005-09-30 15:54:50.502255817 -0500
@@ -42,6 +42,7 @@
#include <linux/profile.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/pnotify.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -151,6 +152,9 @@
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];
+
+ /* Initialize the pnotify list in pid 0 before it can clone itself. */
+ INIT_PNOTIFY_LIST(current);
}

static struct task_struct *dup_task_struct(struct task_struct *orig)
@@ -1039,6 +1043,15 @@
p->exit_state = 0;

/*
+ * Call pnotify kernel module subscribers and add the same subscribers the
+ * parent has to the new process.
+ * Fail the fork on error.
+ */
+ retval = pnotify_fork(p, current);
+ if (retval)
+ goto bad_fork_cleanup_namespace;
+
+ /*
* Ok, make it visible to the rest of the system.
* We dont wake it up yet.
*/
@@ -1073,7 +1086,7 @@
if (sigismember(&current->pending.signal, SIGKILL)) {
write_unlock_irq(&tasklist_lock);
retval = -EINTR;
- goto bad_fork_cleanup_namespace;
+ goto bad_fork_cleanup_pnotify;
}

/* CLONE_PARENT re-uses the old parent */
@@ -1159,6 +1172,8 @@
return ERR_PTR(retval);
return p;

+bad_fork_cleanup_pnotify:
+ pnotify_exit(p);
bad_fork_cleanup_namespace:
exit_namespace(p);
bad_fork_cleanup_keys:
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c 2005-09-30 14:57:55.100142848 -0500
+++ linux/kernel/exit.c 2005-09-30 15:25:34.617228715 -0500
@@ -29,6 +29,7 @@
#include <linux/proc_fs.h>
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
+#include <linux/pnotify.h>
#include <linux/syscalls.h>
#include <linux/signal.h>

@@ -866,6 +867,9 @@
module_put(tsk->binfmt->module);

tsk->exit_code = code;
+
+ pnotify_exit(tsk);
+
exit_notify(tsk);
#ifdef CONFIG_NUMA
mpol_free(tsk->mempolicy);
Index: linux/kernel/pnotify.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/kernel/pnotify.c 2005-09-30 16:01:50.518408112 -0500
@@ -0,0 +1,568 @@
+/*
+ * Process Notification (pnotify) interface
+ *
+ *
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/pnotify.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <asm/semaphore.h>
+
+/* list of pnotify event list entries that reference the "module"
+ * implementations */
+static LIST_HEAD(pnotify_event_list);
+static DECLARE_RWSEM(pnotify_event_list_sem);
+
+
+/**
+ * pnotify_get_subscriber - get a pnotify subscriber given a search key
+ * @task: We examine the pnotify_subscriber_list from the given task
+ * @key: Key name of kernel module subscriber we wish to retrieve
+ *
+ * Given a pnotify_subscriber_list structure, this function will return
+ * a pointer to the kernel module pnotify_subsciber struct that matches the
+ * search key. If the key is not found, the function will return NULL.
+ *
+ * Locking: This is a pnotify_subscriber_list reader. This function should
+ * be called with at least a read lock on the pnotify_subscriber_list using
+ * down_read(&task->pnotify_subscriber_list_sem).
+ *
+ */
+struct pnotify_subscriber *
+pnotify_get_subscriber(struct task_struct *task, char *key)
+{
+ struct pnotify_subscriber *subscriber;
+
+ list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) {
+ if (!strcmp(subscriber->events->name,key))
+ return subscriber;
+ }
+ return NULL;
+}
+
+
+/**
+ * pnotify_subscribe - Add kernel module to the subscriber list for process
+ * @task: Task that gets the new kernel module subscriber added to the list
+ * @events: pnotify_events structure to associate with kernel module
+ *
+ * Given a task and a pnotify_events structure, this function will allocate
+ * a new pnotify_subscriber, initialize the settings, and insert it into
+ * the pnotify_subscriber_list for the task.
+ *
+ * Locking:
+ * The caller for this function should hold at least a read lock on the
+ * pnotify_event_list_sem - or ensure that the pnotify_events entry cannot be
+ * removed. If this function was called from the pnotify module (usually the
+ * case), then the caller need not hold this lock because the event
+ * structure won't disappear until pnotify_unregister is called.
+ *
+ * This is a pnotify_subscriber_list WRITER. The caller must hold a write
+ * lock on for the tasks pnotify_subscriber_list_sem. This can be locked
+ * using down_write(&task->pnotify_subscriber_list_sem).
+ */
+struct pnotify_subscriber *
+pnotify_subscribe(struct task_struct *task, struct pnotify_events *events)
+{
+ struct pnotify_subscriber *subscriber;
+
+ subscriber = kmalloc(sizeof(struct pnotify_subscriber), GFP_KERNEL);
+ if (!subscriber)
+ return NULL;
+
+ subscriber->events = events;
+ subscriber->data = NULL;
+ atomic_inc(&events->refcnt); /* Increase hook's reference count */
+ list_add_tail(&subscriber->entry, &task->pnotify_subscriber_list);
+ return subscriber;
+}
+
+
+/**
+ * pnotify_unsubscribe - Remove kernel module subscriber from process
+ * @subscriber: The subscriber to remove
+ *
+ * This function will ensure the subscriber is deleted form
+ * the list of subscribers for the task. Finally, the memory for the
+ * subscriber is discarded.
+ *
+ * Prior to calling pnotify_unsubscribe, the subscriber should have been
+ * detached from any uses the kernel module may have. This is often done using
+ * p->events->exit(task, subscriber);
+ *
+ * Locking:
+ * This is a pnotify_subscriber_list WRITER. The caller of this function must
+ * hold a write lock on the pnotify_subscriber_list_sem for the task. This can
+ * be locked using down_write(&task->pnotify_subscriber_list_sem). Because
+ * events are referenced, the caller should ensure the events structure
+ * doesn't disappear. If the caller is a pnotify module, the events
+ * structure won't disappear until pnotify_unregister is called so it's safe
+ * not to lock the pnotify_event_list_sem.
+ *
+ *
+ */
+void
+pnotify_unsubscribe(struct pnotify_subscriber *subscriber)
+{
+ atomic_dec(&subscriber->events->refcnt); /* dec events ref count */
+ list_del(&subscriber->entry);
+ kfree(subscriber);
+}
+
+
+/**
+ * pnotify_get_events - Get the pnotify_events struct matching requested name
+ * @key: The name of the events structure to get
+ *
+ * Given a pnotify_events struct name that represents the kernel module name,
+ * this function will return a pointer to the pnotify_events structure that
+ * matches the name.
+ *
+ * Locking:
+ * You should hold either the write or read lock for pnotify_event_list_sem
+ * before using this function. This will ensure that the pnotify_event_list
+ * does not change while iterating through the list entries.
+ *
+ */
+static struct pnotify_events *
+pnotify_get_events(char *key)
+{
+ struct pnotify_events *events;
+
+ list_for_each_entry(events, &pnotify_event_list, entry) {
+ if (!strcmp(events->name, key)) {
+ return events;
+ }
+ }
+ return NULL;
+}
+
+/**
+ * remove_subscriber_from_all_tasks - Remove subscribers for given events struct
+ * @events: pnotify_events struct for subscribers to remove
+ *
+ * Given a kernel module events struct registered with pnotify,
+ * this function will remove all subscribers matching the events struct from
+ * all tasks.
+ *
+ * If there is a exit function associated with the subscriber, it is called
+ * before the subscriber is unsubscribed/freed.
+ *
+ * This is meant to be used by pnotify_register and pnotify_unregister
+ *
+ * Locking: This is a pnotify_subscriber_list WRITER and this function
+ * handles locking of the pnotify_subscriber_list_sem so callers don't
+ * need to.
+ *
+ */
+static void
+remove_subscriber_from_all_tasks(struct pnotify_events *events)
+{
+ if (events == NULL)
+ return;
+
+ /* Because of internal race conditions we can't guarantee
+ * getting every task in just one pass so we just keep going
+ * until there are no tasks with subscribers from this events struct
+ * attached. The inefficiency of this should be tempered by the fact
+ * that this happens at most once for each registered client.
+ */
+ while (atomic_read(&events->refcnt) != 0) {
+ struct task_struct *g = NULL, *p = NULL;
+
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ struct pnotify_subscriber *subscriber;
+ int task_exited;
+
+ get_task_struct(p);
+ read_unlock(&tasklist_lock);
+ down_write(&p->pnotify_subscriber_list_sem);
+ subscriber = pnotify_get_subscriber(p, events->name);
+ if (subscriber != NULL) {
+ (void)events->exit(p, subscriber);
+ pnotify_unsubscribe(subscriber);
+ }
+ up_write(&p->pnotify_subscriber_list_sem);
+ read_lock(&tasklist_lock);
+ /* If a task exited while we were looping, its sibling
+ * list would be empty. In that case, we jump out of
+ * the do_each_thread and loop again in the outter
+ * while because the reference count probably isn't
+ * zero for the pnotify events yet. Doing it this way
+ * makes it so we don't hold the tasklist lock too
+ * long.
+ */
+
+
+ task_exited = list_empty(&p->sibling);
+ put_task_struct(p);
+ if (task_exited)
+ goto endloop;
+ } while_each_thread(g, p);
+ endloop:
+ read_unlock(&tasklist_lock);
+ }
+}
+
+/**
+ * pnotify_register - Register a new module subscriber and enter it in the list
+ * @events_new: The new pnotify events structure to register.
+ *
+ * Used to register a new module subscriber pnotify_events structure and enter
+ * it into the pnotify_event_list. The service name for a pnotify_events
+ * struct is restricted to 32 characters.
+ *
+ * If an "init()" function is supplied in the events struct being registered
+ * then the kernel module will be subscribed to all existing tasks and the
+ * supplied "init()" function will be applied to it. If any call to the
+ * supplied "init()" function returns a non zero result, the registration will
+ * be aborted. As part of the abort process, all subscribers belonging to the
+ * new client will be removed from all tasks and the supplied "detach()"
+ * function will be called on them.
+ *
+ * If a memory error is encountered, the module (pnotify_events structure)
+ * is unregistered and any tasks we became subscribed to are detached.
+ *
+ * Locking: This function is an event list writer as well as a
+ * pnotify_subscriber_list writer. This function does the locks itself.
+ * Callers don't need to.
+ *
+ */
+int
+pnotify_register(struct pnotify_events *events_new)
+{
+ struct pnotify_events *events = NULL;
+
+ /* Add new pnotify module to access list */
+ if (!events_new)
+ return -EINVAL; /* error */
+ if (!list_empty(&events_new->entry))
+ return -EINVAL; /* error */
+ if (events_new->name == NULL || strlen(events_new->name) >
+ PNOTIFY_NAMELN)
+ return -EINVAL; /* error */
+ if (!events_new->fork || !events_new->exit)
+ return -EINVAL; /* error */
+
+ /* Try to insert new events entry into the events list */
+ down_write(&pnotify_event_list_sem);
+
+ events = pnotify_get_events(events_new->name);
+
+ if (events) {
+ up_write(&pnotify_event_list_sem);
+ printk(KERN_WARNING "Attempt to register duplicate"
+ " pnotify support (name=%s)\n", events_new->name);
+ return -EBUSY;
+ }
+
+ /* Okay, we can insert into the events list */
+ list_add_tail(&events_new->entry, &pnotify_event_list);
+ /* set the ref count to zero */
+ atomic_set(&events_new->refcnt, 0);
+
+ /* Now we can call the init function (if present) for each task */
+ if (events_new->init != NULL) {
+ struct task_struct *g = NULL, *p = NULL;
+ int init_result = 0;
+
+ /* Because of internal race conditions we can't guarantee
+ * getting every task in just one pass so we just keep going
+ * until we don't find any unitialized tasks. The inefficiency
+ * of this should be tempered by the fact that this happens
+ * at most once for each registered client.
+ */
+ read_lock(&tasklist_lock);
+ repeat:
+ do_each_thread(g, p) {
+ struct pnotify_subscriber *subscriber;
+ int task_exited;
+
+ get_task_struct(p);
+ read_unlock(&tasklist_lock);
+ down_write(&p->pnotify_subscriber_list_sem);
+ subscriber = pnotify_get_subscriber(p,
+ events_new->name);
+ if (!subscriber && !(p->flags & PF_EXITING)) {
+ subscriber = pnotify_subscribe(p, events_new);
+ if (subscriber != NULL) {
+ init_result = events_new->init(p,
+ subscriber);
+
+ /* Success, but init function pointer
+ * doesn't want kernel module on the
+ * subscriber list. */
+ if (init_result > 0) {
+ pnotify_unsubscribe(subscriber);
+ }
+ }
+ else {
+ init_result = -ENOMEM;
+ }
+ }
+ up_write(&p->pnotify_subscriber_list_sem);
+ read_lock(&tasklist_lock);
+ /* Like in remove_subscriber_from_all_tasks, if the
+ * task disappeared on us while we were going through
+ * the for_each_thread loop, we need to start over
+ * with that loop. That's why we have the list_empty
+ * here */
+ task_exited = list_empty(&p->sibling);
+ put_task_struct(p);
+ if (init_result < 0)
+ goto endloop;
+ if (task_exited)
+ goto repeat;
+ } while_each_thread(g, p);
+ endloop:
+ read_unlock(&tasklist_lock);
+
+ /*
+ * if anything went wrong during initialisation abandon the
+ * registration process
+ */
+ if (init_result < 0) {
+ remove_subscriber_from_all_tasks(events_new);
+ list_del_init(&events_new->entry);
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_WARNING "Registering pnotify support for"
+ " (name=%s) failed\n", events_new->name);
+
+ return init_result; /* init function error result */
+ }
+ }
+
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_INFO "Registering pnotify support for (name=%s)\n",
+ events_new->name);
+
+ return 0; /* success */
+
+}
+
+/**
+ * pnotify_unregister - Unregister kernel module/pnotify_event struct
+ * @event_old: pnotify_event struct for the kernel module we're unregistering
+ *
+ * Used to unregister kernel module subscribers indicated by
+ * pnotify_events struct. Removes them from the list of kernel modules
+ * in pnotify_event_list.
+ *
+ * Once the events entry in the pnotify_event_list is found, subscribers for
+ * this kernel module have their exit functions called and will then be
+ * removed from the list.
+ *
+ * Locking: This functoin is a pnotify_event_list writer. It also calls
+ * remove_subscriber_from_all_tasks, which is a pnotify_subscriber_list
+ * writer. Callers don't need to hold these locks ahead of calling this
+ * function.
+ *
+ */
+int
+pnotify_unregister(struct pnotify_events *events_old)
+{
+ struct pnotify_events *events;
+
+ /* Check the validity of the arguments */
+ if (!events_old)
+ return -EINVAL; /* error */
+ if (list_empty(&events_old->entry))
+ return -EINVAL; /* error */
+ if (events_old->name == NULL)
+ return -EINVAL; /* error */
+
+ down_write(&pnotify_event_list_sem);
+
+ events = pnotify_get_events(events_old->name);
+
+ if (events && events == events_old) {
+ remove_subscriber_from_all_tasks(events);
+ list_del_init(&events->entry);
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_INFO "Unregistering pnotify support for"
+ " (name=%s)\n", events_old->name);
+
+ return 0; /* success */
+ }
+
+ up_write(&pnotify_event_list_sem);
+
+ printk(KERN_WARNING "Attempt to unregister pnotify support (name=%s)"
+ " failed - not found\n", events_old->name);
+
+ return -EINVAL; /* error */
+}
+
+
+/**
+ * __pnotify_fork - Add kernel module subscribe to same subscribers as parent
+ * @to_task: The child task that will inherit the parent's subscribers
+ * @from_task: The parent task
+ *
+ * Make it so a new task being constructed has the same kernel module
+ * subscribers of its parent.
+ *
+ * The "from" argument is the parent task. The "to" argument is the child
+ * task.
+ *
+ * See Documentation/pnotify.txt * for details on
+ * how to handle return codes from the attach function pointer.
+ *
+ * Locking: The to_task is currently in-construction, so we don't
+ * need to worry about write-locks. We do need to be sure the parent's
+ * subscriber list, which we copy here, doesn't go away on us. This function
+ * read-locks the pnotify_subscriber_list. Callers don't need to lock.
+ *
+ */
+int
+__pnotify_fork(struct task_struct *to_task, struct task_struct *from_task)
+{
+ struct pnotify_subscriber *from_subscriber;
+ int ret;
+
+ /* lock the parents subscriber list we are copying from */
+ down_read(&from_task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry(from_subscriber,
+ &from_task->pnotify_subscriber_list, entry) {
+ struct pnotify_subscriber *to_subscriber = NULL;
+
+ to_subscriber = pnotify_subscribe(to_task,
+ from_subscriber->events);
+ if (!to_subscriber) {
+ /* Failed to get memory.
+ * we don't force __pnotify_exit to run here because
+ * the child is in-consturction and not running yet.
+ * We don't need a write lock on the subscriber
+ * list because the child is in construction.
+ */
+ pnotify_unsubscribe(to_task);
+ up_read(&from_task->pnotify_subscriber_list_sem);
+ return -ENOMEM;
+ }
+ ret = to_subscriber->events->fork(to_task, to_subscriber,
+ from_subscriber->data);
+
+ if (ret < 0) {
+ /* Propagates to copy_process as a fork failure.
+ * Since the child is in consturction, we don't
+ * need a write lock on the subscriber list.
+ * __pnotify_exit isn't run because the child
+ * never got running, exit doesn't make sense.
+ */
+ pnotify_unsubscribe(to_task);
+ up_read(&from_task->pnotify_subscriber_list_sem);
+ return ret; /* Fork failure */
+ }
+ else if (ret > 0) {
+ /* Success, but fork function pointer in the
+ * pnotify_events structure doesn't want the kernel
+ * module subscribed. This is an in-construction
+ * child so we don't need to write lock */
+ pnotify_unsubscribe(to_subscriber);
+ }
+ }
+
+ /* unlock parent's subscriber list */
+ up_read(&from_task->pnotify_subscriber_list_sem);
+
+ return 0; /* success */
+}
+
+/**
+ * __pnotify_exit - Remove all subscribers from given task
+ * @task: Task to remove subscribers from
+ *
+ * For each subscriber for the given task, we run the function pointer
+ * for exit in the associated pnotify_events structure then remove the
+ * it from the tasks's subscriber list until all subscribers are gone.
+ *
+ * Locking: This is a pnotify_subscriber_list writer. This function
+ * write locks the pnotify_subscriber_list. Callers don't have to do their own
+ * locking. The pnotify_events structure referenced exit function is called
+ * with the pnotify_subscriber_list write lock held.
+ *
+ */
+void
+__pnotify_exit(struct task_struct *task)
+{
+ struct pnotify_subscriber *subscriber;
+ struct pnotify_subscriber *subscribertmp;
+
+ /* Remove ref. to subscribers from task immediately */
+ down_write(&task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry_safe(subscriber, subscribertmp,
+ &task->pnotify_subscriber_list, entry) {
+ subscriber->events->exit(task, subscriber);
+ pnotify_unsubscribe(subscriber);
+ }
+
+ up_write(&task->pnotify_subscriber_list_sem);
+
+ return; /* 0 = success, else return last code for failure */
+}
+
+
+/**
+ * __pnotify_exec - Execute exec callback for each subscriber in this task
+ * @task: We go through the subscriber list in the given task
+ *
+ * Used to when a process that has a subscriber list does an exec.
+ * The exec pointer in the events structure is optional.
+ *
+ * Locking: This is a pnotify_subscriber_list reader and implements the
+ * read locks itself. Callers don't need to do their own locking. The
+ * pnotify_events referenced exec function pointer is called in an
+ * environment where the pnotify_subscriber_list is read locked.
+ *
+ */
+int
+__pnotify_exec(struct task_struct *task)
+{
+ struct pnotify_subscriber *subscriber;
+
+ down_read(&task->pnotify_subscriber_list_sem);
+
+ list_for_each_entry(subscriber, &task->pnotify_subscriber_list, entry) {
+ if (subscriber->events->exec) /* exec funct. ptr is optional */
+ subscriber->events->exec(task, subscriber);
+ }
+
+ up_read(&task->pnotify_subscriber_list_sem);
+ return 0;
+}
+
+
+EXPORT_SYMBOL_GPL(pnotify_get_subscriber);
+EXPORT_SYMBOL_GPL(pnotify_subscribe);
+EXPORT_SYMBOL_GPL(pnotify_unsubscribe);
+EXPORT_SYMBOL_GPL(pnotify_register);
+EXPORT_SYMBOL_GPL(pnotify_unregister);
Index: linux/include/linux/pnotify.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/linux/pnotify.h 2005-09-30 14:57:57.651642867 -0500
@@ -0,0 +1,227 @@
+/*
+ * Process Notification (pnotify) interface
+ *
+ *
+ * Copyright (c) 2000-2002, 2004-2005 Silicon Graphics, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ *
+ * For further information regarding this notice, see:
+ *
+ * http://oss.sgi.com/projects/GenInfo/NoticeExplan
+ */
+
+/*
+ * Data structure definitions and function prototypes used to implement
+ * process notification (pnotify).
+ *
+ * pnotify provides a method (service) for kernel modules to be notified when
+ * certain events happen in the life of a process. It also provides a
+ * data pointer that is associated with a given process. See
+ * Documentation/pnotify.txt for a full description.
+ */
+
+#ifndef _LINUX_PNOTIFY_H
+#define _LINUX_PNOTIFY_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_PNOTIFY
+
+#define PNOTIFY_NAMELN 32 /* Max chars in PNOTIFY kernel module name */
+
+#define PNOTIFY_ERROR -1 /* Error. Fork fail for pnotify_fork */
+#define PNOTIFY_OK 0 /* All is well, stay subscribed */
+#define PNOTIFY_NOSUB 1 /* All is well but don't subscribe module
+ * to subscriber list for the process */
+
+
+/**
+ * INIT_PNOTIFY_LIST - init a pnotify subscriber list struct after declaration
+ * @_l: Task struct to init the pnotify_module_subscriber_list and semaphore
+ *
+ */
+#define INIT_PNOTIFY_LIST(_l) \
+do { \
+ INIT_LIST_HEAD(&(_l)->pnotify_subscriber_list); \
+ init_rwsem(&(_l)->pnotify_subscriber_list_sem); \
+} while(0)
+
+/*
+ * Used by task_struct to manage list of subscriber kernel modules for the
+ * process. Each pnotify_subscriber provides the link between the process
+ * and the correct kernel module subscriber.
+ *
+ * STRUCT MEMBERS:
+ * pnotify_events: events: Reference to pnotify_events structure, which
+ * holds the name key and function pointers.
+ * data: Opaque data pointer - defined by pnotify kernel modules.
+ * entry: List pointers
+ */
+struct pnotify_subscriber {
+ struct pnotify_events *events;
+ void *data;
+ struct list_head entry;
+};
+
+/*
+ * Used by pnotify modules to define the callback functions into the
+ * module. See Documentation/pnotify.txt for details.
+ *
+ * STRUCT MEMBERS:
+ * name: The name of the pnotify container type provided by
+ * the module. This will be set by the pnotify module.
+ * fork: Function pointer to function used when associating
+ * a forked process with a kernel module referenced by
+ * this struct. pnotify.txt will provide details on
+ * special return codes interpreted by pnotify.
+ *
+ * exit: Function pointer to function used when a process
+ * associated with the kernel module owning this struct
+ * exits.
+ *
+ * init: Function pointer to initialization function. This
+ * function is used when the module registers with pnotify
+ * to associate existing processes with the referring
+ * kernel module. This is optional and may be set to NULL
+ * if it is not needed by the pnotify kernel module.
+ *
+ * Note: The return values are managed the same way as in
+ * attach above. Except, of course, an error doesn't
+ * result in a fork failure.
+ *
+ * Note: The implementation of pnotify_register causes
+ * us to evaluate some tasks more than once in some cases.
+ * See the comments in pnotify_register for why.
+ * Therefore, if the init function pointer returns
+ * PNOTIFY_NOSUB, which means that it doesn't want this
+ * process associated with the kernel module, that init
+ * function must be prepared to possibly look at the same
+ * "skipped" task more than once.
+ *
+ * data: Opaque data pointer - defined by pnotify modules.
+ * module: Pointer to kernel module struct. Used to increment &
+ * decrement the use count for the module.
+ * entry: List pointers
+ * exec: Function pointer to function used when a process
+ * this kernel module is subscribed to execs. This
+ * is optional and may be set to NULL if it is not
+ * needed by the pnotify module.
+ * refcnt: Keep track of user count of pnotify_events
+ */
+struct pnotify_events {
+ struct module *module;
+ char *name; /* Name Key - restricted to 32 chars */
+ void *data; /* Opaque module specific data */
+ struct list_head entry; /* List pointers */
+ atomic_t refcnt; /* usage counter */
+ int (*init)(struct task_struct *, struct pnotify_subscriber *);
+ int (*fork)(struct task_struct *, struct pnotify_subscriber *, void*);
+ void (*exit)(struct task_struct *, struct pnotify_subscriber *);
+ void (*exec)(struct task_struct *, struct pnotify_subscriber *);
+};
+
+
+/* Kernel service functions for providing pnotify support */
+extern struct pnotify_subscriber *pnotify_get_subscriber(struct task_struct
+ *task, char *key);
+extern struct pnotify_subscriber *pnotify_subscribe(struct task_struct *task,
+ struct pnotify_events *pt);
+extern void pnotify_unsubscribe(struct pnotify_subscriber *subscriber);
+extern int pnotify_register(struct pnotify_events *pt_new);
+extern int pnotify_unregister(struct pnotify_events *pt_old);
+extern int __pnotify_fork(struct task_struct *to_task,
+ struct task_struct *from_task);
+extern void __pnotify_exit(struct task_struct *task);
+extern int __pnotify_exec(struct task_struct *task);
+
+/**
+ * pnotify_fork - child inherits subscriber list associations of its parent
+ * @child: child task - to inherit
+ * @parent: parenet task - child inherits subscriber list from this parent
+ *
+ * function used when a child process must inherit subscriber list assocation
+ * from the parent. Return code is propagated as a fork fail.
+ *
+ */
+static inline int pnotify_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ INIT_PNOTIFY_LIST(child);
+ if (!list_empty(&parent->pnotify_subscriber_list))
+ return __pnotify_fork(child, parent);
+
+ return 0;
+}
+
+
+/**
+ * pnotify_exit - Detach subscriber kernel modules from this process
+ * @task: The task the subscribers will be detached from
+ *
+ */
+static inline void pnotify_exit(struct task_struct *task)
+{
+ if (!list_empty(&task->pnotify_subscriber_list))
+ __pnotify_exit(task);
+}
+
+/**
+ * pnotify_exec - Used when a process exec's
+ * @task: The process doing the exec
+ *
+ */
+static inline void pnotify_exec(struct task_struct *task)
+{
+ if (!list_empty(&task->pnotify_subscriber_list))
+ __pnotify_exec(task);
+}
+
+/**
+ * INIT_TASK_PNOTIFY - Used in INIT_TASK to set head and sem of subscriber list
+ * @tsk: The task work with
+ *
+ * Marco Used in INIT_TASK to set the head and sem of pnotify_subscriber_list
+ * If CONFIG_PNOTIFY is off, it is defined as an empty macro below.
+ *
+ */
+#define INIT_TASK_PNOTIFY(tsk) \
+ .pnotify_subscriber_list = LIST_HEAD_INIT(tsk.pnotify_subscriber_list),\
+ .pnotify_subscriber_list_sem = \
+ __RWSEM_INITIALIZER(tsk.pnotify_subscriber_list_sem),
+
+#else /* CONFIG_PNOTIFY */
+
+/*
+ * Replacement macros used when pnotify (Process Notification) support is not
+ * compiled into the kernel.
+ */
+#define INIT_TASK_PNOTIFY(tsk)
+#define INIT_PNOTIFY_LIST(l) do { } while(0)
+#define pnotify_fork(ct, pt) ({ 0; })
+#define pnotify_exit(t) do { } while(0)
+#define pnotify_exec(t) do { } while(0)
+#define pnotify_unsubscribe(t) do { } while(0)
+
+#endif /* CONFIG_PNOTIFY */
+
+#endif /* _LINUX_NOTIFY_H */
Index: linux/Documentation/pnotify.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/Documentation/pnotify.txt 2005-09-30 14:57:57.655548722 -0500
@@ -0,0 +1,368 @@
+Process Notification (pnotify)
+--------------------
+pnotify provides a method (service) for kernel modules to be notified when
+certain events happen in the life of a process. Events we support include
+fork, exit, and exec. A special init event is also supported (see events
+below). More events could be added. pnotify also provides a generic data
+pointer for the modules to work with so that data can be associated per
+process.
+
+A kernel module will register (pnotify_register) a service request describing
+events it cares about (pnotify_events) with pnotify_register. The request
+tells pnotify which notifications the kernel module wants. The kernel module
+passes along function pointers to be called for these events (exit, fork, exec)
+in the pnotify_events service request.
+
+From the process point of view, each process has a kernel module subscriber
+list (pnotify_subscriber_list). These kernel modules are the ones who want
+notification about the life of the process. As described above, each kernel
+module subscriber on the list has a generic data pointer to point to data
+associated with the process.
+
+In the case of fork, pnotify will allocate the same kernel module subscriber
+list for the new child that existed for the parent. The kernel module's
+function pointer for fork is also called for the child being constructed so
+the kernel module can do what ever it needs to do when a parent forks this
+child. Special return values apply for the fork and init event that don't to
+others. They are described in the fork and init example below.
+
+For exit, similar things happen but the exit function pointer for each
+kernel module subscriber is called and the kernel module subscriber entry for
+that process is deleted.
+
+
+Events
+------
+Events are stages of a processes life that kernel modules care about. The
+fork event is triggered in a certain location in copy_process when a parent
+forks. The exit event happens when a process is going away. We also support
+an exec event, which happens when a process execs. Finally, there is an init
+event. This special event makes it so this kernel module will be associated
+with all current processes in the system at the time of registration. This is
+used when a kernel module wants to keep track of all current processes as
+opposed to just those it associates by itself (and children that follow). The
+events a kernel module cares about are set up in the pnotify_events
+structure - see usage below.
+
+When setting up a pnotify_events, you designate which events you care about
+by either associating NULL (meaning you don't care about that event) or a
+pointer to the function to run when the event is triggered. The fork event
+and the exit event is currently required.
+
+
+How do processes become associated with kernel modules?
+-------------------------------------------------------
+Your kernel module itself can use the pnotify_subscribe function to associate
+a given process with a given pnotify_events structure. This adds
+your kernel module to the subscriber list of the process. In the case
+of inescapable job containers making use of PAM, when PAM allows a person to
+log in, PAM contacts job (via a PAM job module which uses the job userland
+library) and the kernel Job code will call pnotify_subscribe to associate the
+process with pnotify. From that point on, the kernel module will be notified
+about events in the process's life that the module cares about (as well,
+as any children that process may later have).
+
+Likewise, your kernel module can remove an association between it and
+a given process by using pnotify_unsubscribe.
+
+
+Example Usage
+-------------
+
+=== filling out the pnotify_events structure ===
+
+A kernel module wishing to use pnotify needs to set up a pnotify_events
+structure. This structure tells pnotify which events you care about and what
+functions to call when those events are triggered. In addition, you supply a
+name (usually the kernel module name). The entry is always filled out as
+shown below. .module is usually set to THIS_MODULE. data can be optionally
+used to store a pointer with the pnotify_events structure.
+
+Example of a filled out pnotify_events:
+
+static struct pnotify_events pnotify_events = {
+ .module = THIS_MODULE,
+ .name = "test_module",
+ .data = NULL,
+ .entry = LIST_HEAD_INIT(pnotify_events.entry),
+ .init = test_init,
+ .fork = test_attach,
+ .exit = test_detach,
+ .exec = test_exec,
+};
+
+The above pnotify_events structure says the kernel module "test_module" cares
+about events fork, exit, exec, and init. In fork, call the kernel module's
+test_attach function. In exec, call test_exec. In exit, call test_detach.
+The init event is specified, so all processes on the system will be associated
+with this kernel module during registration and the test_init function will
+be run for each.
+
+
+=== Registering with pnotify ===
+
+You will likely register with pnotify in your kernel module's module_init
+function. Here is an example:
+
+static int __init test_module_init(void)
+{
+ int rc = pnotify_register(&pnotify_events);
+ if (rc < 0) {
+ return -1;
+ }
+
+ return 0;
+}
+
+
+=== Example init event function ====
+
+Since the init event is defined, it means this kernel module is added
+to the subscriber list of all processes -- it will receive notification
+about events it cares about for all processes and all children that
+follow.
+
+Of course, if a kernel module doesn't need to know about all current
+processes, that module shouldn't implement this and '.init' in the
+pnotify_events structure would be NULL.
+
+This is as opposed to the normal method where the kernel module adds itself
+to the subscriber list of a process using pnotify_subscribe.
+
+Important:
+Note: The implementation of pnotify_register causes us to evaluate some tasks
+more than once in some cases. See the comments in pnotify_register for why.
+Therefore, if the init function pointer returns PNOTIFY_NOSUB, which means
+that it doesn't want a process association, that init function must be
+prepared to possibly look at the same "skipped" task more than once.
+
+Note that the return value here is similar to the fork function pointer
+below except there is no notion of failing the fork since existing processes
+aren't forking.
+
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+
+static int test_init(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
+{
+ if (pnotify_get_subscriber(tsk, "test_module") == NULL)
+ dprintk("ERROR pnotify expected \"%s\" PID = %d\n", "test_module", tsk->pid);
+
+ dprintk("FYI pnotify init hook fired for PID = %d\n", tsk->pid);
+ atomic_inc(&init_count);
+ return 0;
+}
+
+
+=== Example fork (test_attach) function ===
+
+This function is executed when a process forks - this is associated
+with the pnotify_callout callout in copy_process. There would be a very
+similar test_detach function (not shown).
+
+pnotify will add the kernel module to the notification list for the child
+process automatically and then execute this fork function pointer (test_attach
+in this example). However, the kernel module can control whether the kernel
+module stays on the process's subscriber list and wants notification by the
+return value.
+
+PNOTIFY_ERROR - prevent the process from continuing - failing the fork
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+
+
+static int test_attach(struct task_struct *tsk, struct pnotify_subscriber *subscriber, void *vp)
+{
+ dprintk("pnotify attach hook fired for PID = %d\n", tsk->pid);
+ atomic_inc(&attach_count);
+
+ return PNOTIFY_OK;
+}
+
+
+=== Example exec event function ===
+
+And here is an example function to run when a task gets to exec. So any
+time a "tracked" process gets to exec, this would execute.
+
+static void test_exec(struct task_struct *tsk, struct pnotify_subscriber *subscriber)
+{
+ dprintk("pnotify exec hook fired for PID %d\n", tsk->pid);
+ atomic_inc(&exec_count);
+}
+
+
+=== Unregistering with pnotify ===
+
+You will likely wish to unregister with pnotify in the kernel module's
+module_exit function. Here is an example:
+
+static void __exit test_module_cleanup(void)
+{
+ pnotify_unregister(&pnotify_events);
+ printk("detach called %d times...\n", atomic_read(&detach_count));
+ printk("attach called %d times...\n", atomic_read(&attach_count));
+ printk("init called %d times...\n", atomic_read(&init_count));
+ printk("exec called %d times ...\n", atomic_read(&exec_count));
+ if (atomic_read(&attach_count) + atomic_read(&init_count) !=
+ atomic_read(&detach_count))
+ printk("pnotify PROBLEM: attach count + init count SHOULD equal detach cound and doesn't\n");
+ else
+ printk("Good - attach count + init count equals detach count.\n");
+}
+
+
+
+=== Actually using data associated with the process in your module ===
+
+The above examples show you how to create an example kernel module using
+pnotify, but they didn't show what you might do with the data pointer
+associated with a given process. Below, find an example of accessing
+the data pointer for a given process from within a kernel module making use
+of pnotify.
+
+pnotify_get_subscriber is used to retrieve the pnotify subscriber for a given
+process and kernel module. Like this:
+
+subscriber = pnotify_get_subscriber(task, name);
+
+Where name is your kernel module's name (as provided in the pnotify_events
+structure) and task is the process you're interested
+in.
+
+Please be careful about locking. The task structure has a
+pnotify_subscriber_list_sem to be used for locking. This example retrieves
+a given task in a way that ensures it doesn't disappear while we try to
+access it (that's why we do locking for the tasklist_lock and task). The
+pnotify subscriber list is locked to ensure the list doesn't change as we
+search it with pnotify_get_subscriber.
+
+ read_lock(&tasklist_lock);
+ get_task_struct(task); /* Ensure the task doesn't vanish on us */
+ read_unlock(&tasklist_lock); /* Unlock the tasklist */
+ down_read(&task->pnotify_subscriber_list_sem); /* readlock subscriber list */
+
+ subscriber = pnotify_get_subscriber(task, name);
+ if (subscriber) {
+ /* Get the widgitId associated with this task */
+ widgitId = ((widgitId_t *)subscriber->data);
+ }
+ put_task_struct(task); /* Done accessing the task */
+ up_read(&task->pnotify_subscriber_list_sem); /* unlock subscriber list */
+
+
+Future Events
+-------------
+Kingsley Cheung suggested that we add events for uid and gid changes and this
+may inspire broader use. Depending on how the discussoin goes, I'll post a
+patch to add this functionality in the next day or two.
+
+History
+-------
+Process Notification used to be known as PAGG (Process Aggregates).
+It was re-written to be called Process Notification because we believe this
+better describes its purpose. Structures and functions were re-named to
+be more clear and to reflect the new name.
+
+
+Why Not Notifier Lists?
+-----------------------
+We investigated the use of notifier lists, available in newer kernels.
+
+Notifier lists would not be as efficient as pnotify for kernel modules
+wishing to associate data with processes. With pnotify, if the
+pnotify_subscriber_list of a given task is NULL, we can instantly know
+there are no kernel modules that care about the process. Further, the
+callbacks happen in places were the task struct is likely to be cached.
+So this is a quick operation. With notifier lists, the scope is system
+wide rather than per process. As long as one kernel module wants to be
+notified, we have to walk the notifier list and potentially waste cycles.
+In the case of pnotify, we only walk lists if we're interested about
+a specific task.
+
+On a system where pnotify is used to track only a few processes, the
+overhead of walking the notifier list is high compared to the overhead
+of walking the kernel module subscriber list only when a kernel module
+is interested in a given process.
+
+I don't believe this is easily solved in notifier lists themselves as
+they are meant to be global resources, not per-task resources.
+
+Overlooking performance issues, notifier lists in and of themselves wouldn't
+solve the problem pnotify solves anyway. Although you could argue notifier
+lists can implement the callback portion of pnotify, there is no association
+of data with a given process. This is a needed for kernel modules to
+efficiently associate a task with a data pointer without cluttering up
+the task struct.
+
+In addition to data associated with a process, we desire the ability for
+kernel modules to add themselves to the subscriber list for any arbitrary
+process - not just current or a child of current.
+
+
+Some Justification
+------------------
+We feel that pnotify could be used to reduce the size of the task struct or
+the number of functions in copy_process. For example, if another part of the
+kernel needs to know when a process is forking or exiting, they could use
+pnotify instead of adding additional code to task struct, copy_process, or
+exit.
+
+Some have argued that PAGG in the past shouldn't be used because it will
+allow interesting things to be implemented outside of the kernel. While this
+might be a small risk, having these in place allows customers and users to
+implement kernel components that you don't want to see in the kernel anyway.
+
+For example, a certain vendor may have an urgent need to implement kernel
+functionality or special types of accounting that nobody else is interested
+in. That doesn't mean the code isn't open-source, it just means it isn't
+applicable to all of Linux because it satisfies a niche.
+
+All of pnotify's functionality that needs to be exported is exported with
+EXPORT_SYMBOL_GPL to discourage abuse.
+
+The risk already exists in the kernel for people to implement modules outside
+the kernel that suffer from less peer review and possibly bad programming
+practice. pnotify could add more oppurtunities for out-of-tree kernel module
+authors to make new modules. I believe this is somewhat mitigated by the
+already-existing 'tainted' warnings in the kernel.
+
+Other Ideas?
+------------
+There have been similar proposals to provide pieces of the pnotify
+functionality. If there is a better proposal out there, let's explore it.
+Here are some key functions I hope to see in any proposal:
+
+ - Ability to have notification for exec, fork, exit at minimum
+ - Ability to extend to other callouts later (such as uid/gid changes as
+ I described earlier)
+ - Ability for pnotify user modules to implement code that ends up adding
+ a kernel module subscriber to any arbitrary process (not just current and
+ its children).
+
+I believe, if the above are more or less met, we should be in good shape for
+our other open source projects such as linux job.
+
+Variable Name Changes from PAGG to pnotify
+------------------------------------------
+PAGG_NAMELEN -> PNOTIFY_NAMELEN
+struct pagg -> pnotify_subscriber
+pagg_get -> pnotify_get_subscriber
+pagg_alloc -> pnotify_subscribe
+pagg_free -> pnotify_unsubscribe
+pagg_hook_register -> pnotify_register
+pagg_hook_unregister -> pnotify_unregister
+pagg_attach -> pnotify_fork
+pagg_detach -> pnotify_exit
+pagg_exec -> pnotify_exec
+struct pagg_hook -> pnotify_events
+
+With pnotify_events (formerly pagg_hook):
+ attach -> fork
+ detach -> exit
+
+Return codes for the init and fork function pointers should use:
+PNOTIFY_ERROR - prevent the process from continuing - failing the fork
+PNOTIFY_OK - good, adds the kernel module to the subscriber list for process
+PNOTIFY_NOSUB - good, but don't add kernel module to subscriber list for process
+

2005-10-03 18:51:58

by Erik Jacobson

[permalink] [raw]

Subject: [PATCH 2/3] Process Notification / pnotify user: keyrings

Here is an example implementation showing keyrings using pnotify as
a proof of concecpt. Not all the callouts that keyrings needs are
currently implemented in pnotify (but most could be if desired).

Signed-off-by: Erik Jacobson <[email protected]>
---

include/linux/key.h | 21 +++++
include/linux/sched.h | 4 -
kernel/exit.c | 1
kernel/fork.c | 6 -
security/keys/key.c | 22 ++++++
security/keys/keyctl.c | 28 ++++++-
security/keys/process_keys.c | 152 ++++++++++++++++++++++++++++++++++++-------
security/keys/request_key.c | 31 +++++++-
8 files changed, 222 insertions(+), 43 deletions(-)

Index: linux/include/linux/key.h
===================================================================
--- linux.orig/include/linux/key.h 2005-09-19 22:00:41.000000000 -0500
+++ linux/include/linux/key.h 2005-09-27 09:46:07.501801117 -0500
@@ -19,6 +19,7 @@
#include <linux/list.h>
#include <linux/rbtree.h>
#include <linux/rcupdate.h>
+#include <linux/pnotify.h>
#include <asm/atomic.h>

#ifdef __KERNEL__
@@ -262,9 +263,9 @@
extern struct key root_user_keyring, root_session_keyring;
extern int alloc_uid_keyring(struct user_struct *user);
extern void switch_uid_keyring(struct user_struct *new_user);
-extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
+extern int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata);
extern int copy_thread_group_keys(struct task_struct *tsk);
-extern void exit_keys(struct task_struct *tsk);
+extern void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub);
extern void exit_thread_group_keys(struct signal_struct *tg);
extern int suid_keys(struct task_struct *tsk);
extern int exec_keys(struct task_struct *tsk);
@@ -279,6 +280,22 @@
old_session; \
})

+/* pnotify subscriber service request */
+static struct pnotify_events key_events = {
+ .module = NULL,
+ .name = "key",
+ .data = NULL,
+ .entry = LIST_HEAD_INIT(key_events.entry),
+ .fork = copy_keys,
+ .exit = exit_keys,
+};
+
+/* key info associated with the task struct and managed by pnotify */
+struct key_task {
+ struct key *thread_keyring; /* keyring private to this thread */
+ unsigned char jit_keyring; /* default keyring to attach requested keys to */
+};
+
#else /* CONFIG_KEYS */

#define key_validate(k) 0
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c 2005-09-27 09:45:59.500655412 -0500
+++ linux/kernel/exit.c 2005-09-27 09:46:07.505706973 -0500
@@ -857,7 +857,6 @@
exit_namespace(tsk);
exit_thread();
cpuset_exit(tsk);
- exit_keys(tsk);

if (group_dead && tsk->signal->leader)
disassociate_ctty(1);
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2005-09-27 09:40:40.824808451 -0500
+++ linux/kernel/fork.c 2005-09-27 09:46:07.506683436 -0500
@@ -1009,10 +1009,8 @@
goto bad_fork_cleanup_sighand;
if ((retval = copy_mm(clone_flags, p)))
goto bad_fork_cleanup_signal;
- if ((retval = copy_keys(clone_flags, p)))
- goto bad_fork_cleanup_mm;
if ((retval = copy_namespace(clone_flags, p)))
- goto bad_fork_cleanup_keys;
+ goto bad_fork_cleanup_mm;
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
@@ -1175,8 +1173,6 @@
bad_fork_cleanup_namespace:
pnotify_exit(p);
exit_namespace(p);
-bad_fork_cleanup_keys:
- exit_keys(p);
bad_fork_cleanup_mm:
if (p->mm)
mmput(p->mm);
Index: linux/security/keys/key.c
===================================================================
--- linux.orig/security/keys/key.c 2005-09-19 22:00:41.000000000 -0500
+++ linux/security/keys/key.c 2005-09-27 09:46:07.511565756 -0500
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/workqueue.h>
#include <linux/err.h>
+#include <linux/pnotify.h>
#include "internal.h"

static kmem_cache_t *key_jar;
@@ -1009,6 +1010,9 @@
*/
void __init key_init(void)
{
+ struct key_task *kt;
+ struct pnotify_subscriber *sub;
+
/* allocate a slab in which we can store keys */
key_jar = kmem_cache_create("key_jar", sizeof(struct key),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
@@ -1039,4 +1043,22 @@
/* link the two root keyrings together */
key_link(&root_session_keyring, &root_user_keyring);

+ /* Allocate memory for task assocated key_task structure */
+ kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL);
+ if (!kt) {
+ printk(KERN_ERR "Insufficient memory to allocate key_task structure"
+ " in key_init function.\n");
+ return;
+ }
+ kt->thread_keyring = NULL;
+
+ /* subscribe this kernel entity to the subscriber list for current task */
+ sub = pnotify_subscribe(current, &key_events);
+ if (!sub) {
+ printk(KERN_ERR "Insufficient memory to add to subscriber list structure"
+ " in key_init function.\n");
+ }
+ /* Associate the kt structure with this task via pnotify subscriber */
+ sub->data = (void *)kt;
+
} /* end key_init() */
Index: linux/security/keys/process_keys.c
===================================================================
--- linux.orig/security/keys/process_keys.c 2005-09-19 22:00:41.000000000 -0500
+++ linux/security/keys/process_keys.c 2005-09-27 09:46:07.513518684 -0500
@@ -16,6 +16,7 @@
#include <linux/keyctl.h>
#include <linux/fs.h>
#include <linux/err.h>
+#include <linux/pnotify.h>
#include <asm/uaccess.h>
#include "internal.h"

@@ -137,6 +138,8 @@
int install_thread_keyring(struct task_struct *tsk)
{
struct key *keyring, *old;
+ struct key_task *kt;
+ struct pnotify_subscriber *sub;
char buf[20];
int ret;

@@ -149,9 +152,21 @@
}

task_lock(tsk);
- old = tsk->thread_keyring;
- tsk->thread_keyring = keyring;
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(tsk, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "install_thread_keyring pnotify subscriber or data ptr null, task: %d\n", tsk->pid);
+ task_unlock(tsk);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ ret = PTR_ERR(sub);
+ goto error;
+ }
+ kt = (struct key_task *)sub->data;
+
+ old = kt->thread_keyring;
+ kt->thread_keyring = keyring;
task_unlock(tsk);
+ up_write(&tsk->pnotify_subscriber_list_sem);

ret = 0;

@@ -267,13 +282,25 @@
/*
* copy the keys for fork
*/
-int copy_keys(unsigned long clone_flags, struct task_struct *tsk)
+int copy_keys(struct task_struct *tsk, struct pnotify_subscriber *sub, void *olddata)
{
- key_check(tsk->thread_keyring);
+ struct key_task *kt = ((struct key_task *)(sub->data));
+
+ /* Allocate memory for task-associated key_task structure */
+ kt = (struct key_task *)kmalloc(sizeof(struct key_task),GFP_KERNEL);
+ if (!kt) {
+ printk(KERN_ERR "Insufficient memory to allocate key_task structure"
+ " in copy_keys function. Task was: %d", tsk->pid);
+ return PNOTIFY_ERROR;
+ }
+ /* Associate key_task structure with the new child via pnotify subscriber */
+ sub->data = (void *)kt;
+
+ key_check(kt->thread_keyring);

/* no thread keyring yet */
- tsk->thread_keyring = NULL;
- return 0;
+ kt->thread_keyring = NULL;
+ return PNOTIFY_OK;

} /* end copy_keys() */

@@ -292,9 +319,16 @@
/*
* dispose of keys upon thread exit
*/
-void exit_keys(struct task_struct *tsk)
+void exit_keys(struct task_struct *task, struct pnotify_subscriber *sub)
{
- key_put(tsk->thread_keyring);
+ struct key_task *kt = ((struct key_task *)(sub->data));
+ if (kt == NULL) { /* shouldn't ever happen */
+ printk(KERN_ERR "exit_keys pnotify subscriber data ptr null, task: %d\n", task->pid);
+ return;
+ }
+ key_put(kt->thread_keyring);
+ kfree(kt); /* Free pnotify subscriber data for this task */
+ sub->data = NULL;

} /* end exit_keys() */

@@ -306,12 +340,28 @@
{
unsigned long flags;
struct key *old;
+ struct key_task *kt;
+ struct pnotify_subscriber *sub;

- /* newly exec'd tasks don't get a thread keyring */
task_lock(tsk);
- old = tsk->thread_keyring;
- tsk->thread_keyring = NULL;
+ /* pnotify doesn't have a compute_creds event at this time, so we
+ * need to retrieve the data */
+
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(tsk, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "exec_keys pnotify subscriber or data ptr null, task: %d\n", tsk->pid);
+ task_unlock(tsk);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ return PNOTIFY_OK; /* key structures not populated yet */
+ }
+ kt = (struct key_task *)sub->data;
+
+ /* newly exec'd tasks don't get a thread keyring */
+ old = kt->thread_keyring;
+ kt->thread_keyring = NULL;
task_unlock(tsk);
+ up_write(&tsk->pnotify_subscriber_list_sem);

key_put(old);

@@ -344,12 +394,26 @@
*/
void key_fsuid_changed(struct task_struct *tsk)
{
+ struct key_task *kt;
+ struct pnotify_subscriber *sub;
+
+ /* no pnotify event for this, so we need to grab the data */
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(tsk, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "key_fsuid_changed pnotify subscriber or data ptr null, task: %d\n", tsk->pid);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ return;
+ }
+ kt = (struct key_task *)sub->data;
+
/* update the ownership of the thread keyring */
- if (tsk->thread_keyring) {
- down_write(&tsk->thread_keyring->sem);
- tsk->thread_keyring->uid = tsk->fsuid;
- up_write(&tsk->thread_keyring->sem);
+ if (kt->thread_keyring) {
+ down_write(&kt->thread_keyring->sem);
+ kt->thread_keyring->uid = tsk->fsuid;
+ up_write(&kt->thread_keyring->sem);
}
+ up_write(&tsk->pnotify_subscriber_list_sem);

} /* end key_fsuid_changed() */

@@ -359,12 +423,26 @@
*/
void key_fsgid_changed(struct task_struct *tsk)
{
+ struct key_task *kt;
+ struct pnotify_subscriber *sub;
+
+ /* pnotify doesn't have an event for this, so we need to grab the data */
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(tsk, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "key_fsgid_changed pnotify subscriber or data ptr was null, task: %d\n", tsk->pid);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ return;
+ }
+ kt = (struct key_task *)sub->data;
+
/* update the ownership of the thread keyring */
- if (tsk->thread_keyring) {
- down_write(&tsk->thread_keyring->sem);
- tsk->thread_keyring->gid = tsk->fsgid;
- up_write(&tsk->thread_keyring->sem);
+ if (kt->thread_keyring) {
+ down_write(&kt->thread_keyring->sem);
+ kt->thread_keyring->gid = tsk->fsgid;
+ up_write(&kt->thread_keyring->sem);
}
+ up_write(&tsk->pnotify_subscriber_list_sem);

} /* end key_fsgid_changed() */

@@ -383,6 +461,8 @@
{
struct request_key_auth *rka;
struct key *key, *ret, *err, *instkey;
+ struct pnotify_subscriber *sub;
+ struct key_task *kt;

/* we want to return -EAGAIN or -ENOKEY if any of the keyrings were
* searchable, but we failed to find a key or we found a negative key;
@@ -395,12 +475,23 @@
ret = NULL;
err = ERR_PTR(-EAGAIN);

+ down_write(&context->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(context, key_events.name);
+ if (sub == NULL || sub->data == NULL) {
+ printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid);
+ up_write(&context->pnotify_subscriber_list_sem);
+ return (struct key *)-EFAULT;
+ }
+ kt = (struct key_task *)sub->data;
+
/* search the thread keyring first */
- if (context->thread_keyring) {
- key = keyring_search_aux(context->thread_keyring,
+ if (kt->thread_keyring) {
+ key = keyring_search_aux(kt->thread_keyring,
context, type, description, match);
- if (!IS_ERR(key))
+ if (!IS_ERR(key)) {
+ up_write(&context->pnotify_subscriber_list_sem);
goto found;
+ }

switch (PTR_ERR(key)) {
case -EAGAIN: /* no key */
@@ -414,6 +505,7 @@
break;
}
}
+ up_write(&context->pnotify_subscriber_list_sem);

/* search the process keyring second */
if (context->signal->process_keyring) {
@@ -535,15 +627,26 @@
{
struct key *key;
int ret;
+ struct pnotify_subscriber *sub;
+ struct key_task *kt;

if (!context)
context = current;

key = ERR_PTR(-ENOKEY);

+ down_write(&context->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(context, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "search_process_keyrings pnotify subscriber or data ptr null, task: %d\n", context->pid);
+ up_write(&context->pnotify_subscriber_list_sem);
+ return (struct key *)-EFAULT;
+ }
+ kt = (struct key_task *)sub->data;
+
switch (id) {
case KEY_SPEC_THREAD_KEYRING:
- if (!context->thread_keyring) {
+ if (!kt->thread_keyring) {
if (!create)
goto error;

@@ -554,7 +657,7 @@
}
}

- key = context->thread_keyring;
+ key = kt->thread_keyring;
atomic_inc(&key->usage);
break;

@@ -634,6 +737,7 @@
goto invalid_key;

error:
+ up_write(&context->pnotify_subscriber_list_sem);
return key;

invalid_key:
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h 2005-09-27 09:40:40.817973203 -0500
+++ linux/include/linux/sched.h 2005-09-27 09:46:07.515471612 -0500
@@ -718,10 +718,6 @@
kernel_cap_t cap_effective, cap_inheritable, cap_permitted;
unsigned keep_capabilities:1;
struct user_struct *user;
-#ifdef CONFIG_KEYS
- struct key *thread_keyring; /* keyring private to this thread */
- unsigned char jit_keyring; /* default keyring to attach requested keys to */
-#endif
int oomkilladj; /* OOM kill score adjustment (bit shift). */
char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
Index: linux/security/keys/keyctl.c
===================================================================
--- linux.orig/security/keys/keyctl.c 2005-09-19 22:00:41.000000000 -0500
+++ linux/security/keys/keyctl.c 2005-09-27 09:46:07.517424540 -0500
@@ -931,31 +931,51 @@
long keyctl_set_reqkey_keyring(int reqkey_defl)
{
int ret;
+ unsigned char jit_return;
+ struct pnotify_subscriber *sub;
+ struct key_task *kt;
+
+ down_write(&current->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(current, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "keyctl_set_reqkey_keyring pnotify subscriber or data ptr null, task: %d\n", current->pid);
+ up_write(&current->pnotify_subscriber_list_sem);
+ return -EFAULT;
+ }
+ kt = (struct key_task *)sub->data;

switch (reqkey_defl) {
case KEY_REQKEY_DEFL_THREAD_KEYRING:
ret = install_thread_keyring(current);
- if (ret < 0)
+ if (ret < 0) {
+ up_write(&current->pnotify_subscriber_list_sem);
return ret;
+ }
goto set;

case KEY_REQKEY_DEFL_PROCESS_KEYRING:
ret = install_process_keyring(current);
- if (ret < 0)
+ if (ret < 0) {
+ up_write(&current->pnotify_subscriber_list_sem);
return ret;
+ }

case KEY_REQKEY_DEFL_DEFAULT:
case KEY_REQKEY_DEFL_SESSION_KEYRING:
case KEY_REQKEY_DEFL_USER_KEYRING:
case KEY_REQKEY_DEFL_USER_SESSION_KEYRING:
set:
- current->jit_keyring = reqkey_defl;
+
+ kt->jit_keyring = reqkey_defl;

case KEY_REQKEY_DEFL_NO_CHANGE:
- return current->jit_keyring;
+ jit_return = kt->jit_keyring;
+ up_write(&current->pnotify_subscriber_list_sem);
+ return jit_return;

case KEY_REQKEY_DEFL_GROUP_KEYRING:
default:
+ up_write(&current->pnotify_subscriber_list_sem);
return -EINVAL;
}

Index: linux/security/keys/request_key.c
===================================================================
--- linux.orig/security/keys/request_key.c 2005-09-19 22:00:41.000000000 -0500
+++ linux/security/keys/request_key.c 2005-09-27 09:46:07.517424540 -0500
@@ -14,6 +14,7 @@
#include <linux/kmod.h>
#include <linux/err.h>
#include <linux/keyctl.h>
+#include <linux/pnotify.h>
#include "internal.h"

struct key_construction {
@@ -39,6 +40,17 @@
char *argv[10], *envp[3], uid_str[12], gid_str[12];
char key_str[12], keyring_str[3][12];
int ret, i;
+ struct pnotify_subscriber *sub;
+ struct key_task *kt;
+
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(current, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "call_request_key pnotify subscriber or data ptr null, task: %d\n", tsk->pid);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ return -EFAULT;
+ }
+ kt = (struct key_task *)sub->data;

kenter("{%d},%s,%s", key->serial, op, callout_info);

@@ -58,7 +70,7 @@

/* we specify the process's default keyrings */
sprintf(keyring_str[0], "%d",
- tsk->thread_keyring ? tsk->thread_keyring->serial : 0);
+ kt->thread_keyring ? kt->thread_keyring->serial : 0);

prkey = 0;
if (tsk->signal->process_keyring)
@@ -105,6 +117,7 @@
key_put(session_keyring);

error:
+ up_write(&tsk->pnotify_subscriber_list_sem);
kleave(" = %d", ret);
return ret;

@@ -300,15 +313,26 @@
{
struct task_struct *tsk = current;
struct key *drop = NULL;
+ struct pnotify_subscriber *sub;
+ struct key_task *kt;
+
+ down_write(&tsk->pnotify_subscriber_list_sem);
+ sub = pnotify_get_subscriber(current, key_events.name);
+ if (sub == NULL || sub->data == NULL) { /* shouldn't happen */
+ printk(KERN_ERR "request_key_link pnotify subscriber or data ptr null, task: %d\n", tsk->pid);
+ up_write(&tsk->pnotify_subscriber_list_sem);
+ return;
+ }
+ kt = (struct key_task *)sub->data;

kenter("{%d},%p", key->serial, dest_keyring);

/* find the appropriate keyring */
if (!dest_keyring) {
- switch (tsk->jit_keyring) {
+ switch (kt->jit_keyring) {
case KEY_REQKEY_DEFL_DEFAULT:
case KEY_REQKEY_DEFL_THREAD_KEYRING:
- dest_keyring = tsk->thread_keyring;
+ dest_keyring = kt->thread_keyring;
if (dest_keyring)
break;

@@ -347,6 +371,7 @@
key_put(drop);

kleave("");
+ down_write(&tsk->pnotify_subscriber_list_sem);

} /* end request_key_link() */

2005-10-03 19:02:26

by Erik Jacobson

[permalink] [raw]

Subject: [PATCH 3/3] Process Notification / pnotify user: Job

Here is the Linux job implementation. It provides inescapable job
containers built on top of pnotify.

More information can be found in Documentation/job.txt and
http://oss.sgi.com/projects/pagg/

Note: In the past, the community suggested using a virtual filesystem
interface to userland and moving much of the logic to user space. We
implemented and posted this, but we have found it to suffer greatly in
performance in certain situations. For that reason, I'm posting the
stable non-jobfs version here because it works better. We're open to
more discussion in the jobfs version if people have ideas on how to
make it faster. In particular, we ran in to performance problems when
there was rapid process firing and the site administrator did something
that required a job operation for most tasks. We'd end up spending lots
of time doing inode operations in the virtual filesystem. This patch
doesn't suffer from that problem.

Signed-off-by: Erik Jacobson <[email protected]>

---
Documentation/job.txt | 104 ++
include/linux/job_acct.h | 124 +++
include/linux/jobctl.h | 185 ++++
init/Kconfig | 29
kernel/Makefile | 1
kernel/fork.c | 1
kernel/job.c | 1871 +++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 2315 insertions(+)

Index: linux/Documentation/job.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/Documentation/job.txt 2005-09-30 13:59:14.034582264 -0500
@@ -0,0 +1,104 @@
+Linux Jobs - A Process Notification (pnotify) Module
+----------------------------------------------------
+
+1. Overview
+
+This document provides two additional sections. Section 2 provides a
+listing of the manual page that describes the particulars of the Linux
+job implementation. Section 3 provides some information about using
+the user job library to interface to jobs.
+
+2. Job Man Page
+
+
+JOB(7) Linux User's Manual JOB(7)
+
+
+NAME
+ job - Linux Jobs kernel module overview
+
+DESCRIPTION
+ A job is a group of related processes all descended from a
+ point of entry process and identified by a unique job
+ identifier (jid). A job can contain multiple process
+ groups or sessions, and all processes in one of these sub-
+ groups can only be contained within a single job.
+
+ The primary purpose for having jobs is to provide job
+ based resource limits. The current implementation only
+ provides the job container and resource limits will be
+ provided in a later implementation. When an implementa-
+ tion that provides job limits is available, this descrip-
+ tion will be expanded to provide further explanation of
+ job based limits.
+
+ Not every process on the system is part of a job. That
+ is, only processes which are started by a login initiator
+ like login, rlogin, rsh and so on, get assigned a job ID.
+ In the Linux environment, jobs are created via a PAM mod-
+ ule.
+
+ Jobs on Linux are provided using a loadable kernel module.
+ Linux jobs have the following characteristics:
+
+ o A job is an inescapable container. A process cannot
+ leave the job nor can a new process be created outside
+ the job without explicit action, that is, a system
+ call with root privilege.
+
+ o Each new process inherits the jid and limits [when
+ implemented] from its parent process.
+
+ o All point of entry processes (job initiators) create a
+ new job and set the job limits [when implemented]
+ appropriately.
+
+ o Job initiation on Linux is performed via a PAM session
+ module.
+
+ o The job initiator performs authentication and security
+ checks.
+
+ o Users can raise and lower their own job limits within
+ maximum values specified by the system administrator
+ [when implemented].
+
+ o Not all processes on a system need be members of a job.
+
+ o The process control initialization process (init(1M))
+ and startup scripts called by init are not part of a
+ job.
+
+
+ Job initiators can be categorized as either interactive or
+ batch processes. Limit domain names are defined by the
+ system administrator when the user limits database (ULDB)
+ is created. [The ULDB will be implemented in conjunction
+ with future job limits work.]
+
+ Note: The existing command jobs(1) applies to shell "jobs"
+ and it is not related to the Linux Kernel Module jobs.
+ The at(1), atd(8), atq(1), batch(1), atrun(8), atrm(1))
+ man pages refer to shell scripts as a job. a shell
+ script.
+
+SEE ALSO
+ job(1), jwait(1), jstat(1), jkill(1)
+
+
+
+
+
+
+
+
+
+3. User Job Library
+
+For developers who wish to make software using Linux Jobs, there exists
+a user job library. This library contains functions for obtaining information
+about running jobs, creating jobs, detaching, etc.
+
+The library is part of the job package and can be obtained from oss.sgi.com
+using anonymous ftp. Look in the /projects/pagg/download directory. See the
+README in the job source package for more information.
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig 2005-09-30 12:14:10.989916853 -0500
+++ linux/init/Kconfig 2005-09-30 13:59:19.749826026 -0500
@@ -170,6 +170,35 @@
Linux Jobs module and the Linux Array Sessions module. If you will not
be using such modules, say N.

+config JOB
+ tristate " Process Notification (pnotify) based jobs"
+ depends on PNOTIFY
+ help
+ The Job feature implements a type of process aggregate,
+ or grouping. A job is the collection of all processes that
+ are descended from a point-of-entry process. Examples of such
+ points-of-entry include telnet, rlogin, and console logins.
+
+ Batch schedulers such as LSF also make use of Job for containing,
+ maintaining, and signaling a job as one entity.
+
+ A job differs from a session and process group since the job
+ container (or group) is inescapable. Only root level processes,
+ or those with the CAP_SYS_RESOURCE capability, can create new jobs
+ or escape from a job.
+
+ A job is identified by a unique job identifier (jid). Currently,
+ that jid can be used to obtain status information about the job
+ and the processes it contians. The jid can also be used to send
+ signals to all processes contained in the job. In addition,
+ other processes can wait for the completion of a job - the event
+ where the last process contained in the job has exited.
+
+ If you want to compile support for jobs into the kernel, select
+ this entry using Y. If you want the support for jobs provided as
+ a module, select this entry using M. If you do not want support
+ for jobs, select N.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux/kernel/job.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/kernel/job.c 2005-09-30 13:59:14.049229224 -0500
@@ -0,0 +1,1871 @@
+/*
+ * Linux Job kernel module
+ *
+ *
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ *
+ * For further information regarding this notice, see:
+ *
+ * http://oss.sgi.com/projects/GenInfo/NoticeExplan
+ */
+
+/*
+ * Description: This file implements a type of process grouping called jobs.
+ * For further information about jobs, consult the file
+ * Documentation/job.txt. Jobs are implemented using Process Notification
+ * (pnotify). For more information about pnotify, see
+ * Documentation/pnotify.txt.
+ */
+
+/*
+ * LOCKING INFO
+ *
+ * There are currently two levels of locking in this module. So, we
+ * have two classes of locks:
+ *
+ * (1) job table lock (always, job_table_sem)
+ * (2) job entry lock (usually, job->sem)
+ *
+ * Most of the locking used is read/write sempahores. In rare cases, a
+ * spinlock is also used. Those cases requiring a spinlock concern when the
+ * tasklist_lock must be locked (such as when looping over all tasks on the
+ * system).
+ *
+ * There is only one job_table_sem. There is a job->sem for each job
+ * entry in the job_table. This job module uses Process Notification
+ * (pnotify). Each task has a special lock that protects its pnotify
+ * information - this is called the pnotify_subscriber_list lock. There are
+ * special macros used to lock/unlock a task's subscriber list lock. The
+ * subscriber list lock is really a semaphore.
+ *
+ * Purpose:
+ *
+ * (1) The job_table_sem protects all entries in the table.
+ * (2) The job->sem protects all data and task attachments for the job.
+ *
+ * Truths we hold to be self-evident:
+ *
+ * Only the holder of a write lock for the job_table_lock may add or
+ * delete a job entry from the job_table. The job_table includes all job
+ * entries in the hash table and chains off the hash table locations.
+ *
+ * Only the holder of a write lock for a job->lock may attach or detach
+ * processes/tasks from the attached list for the job.
+ *
+ * If you hold a read lock of job_table_lock, you can assume that the
+ * job entries in the table will not change. The link pointers for
+ * the chains of job entries will not change, the job ID (jid) value
+ * will not change, and data changes will be (mostly) atomic.
+ *
+ * If you hold a read lock of a job->lock, you can assume that the
+ * attachments to the job will not change. The link pointers for the
+ * attachment list will not change and the attachments will not change.
+ *
+ * If you are going to grab nested locks, the nesting order is:
+ *
+ * down_write/up_write/down_read/up_read(&task->pnotify_subscriber_list_sem)
+ * job_table_sem
+ * job->sem
+ *
+ * However, it is not strictly necessary to down the job_table_sem
+ * before downing job->sem.
+ *
+ * Also, the nesting order allows you to lock in this order:
+ *
+ * down_write/up_write/down_read/up_read(&task->pnotify_subscriber_list_sem)
+ * job->sem
+ *
+ * without locking job_table_sem between the two.
+ *
+ */
+
+/* standard for kernel modules */
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kmod.h>
+#include <linux/init.h>
+#include <linux/list.h>
+
+#include <asm/uaccess.h> /* for get_user & put_user */
+
+#include <linux/sched.h> /* for current */
+#include <linux/tty.h> /* for the tty declarations */
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <linux/proc_fs.h>
+
+#include <linux/string.h>
+#include <asm/semaphore.h>
+#include <linux/moduleparam.h>
+
+#include <linux/pnotify.h> /* to use process notification service */
+#include <linux/jobctl.h>
+#include <linux/job_acct.h>
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("pnotify based inescapable jobs");
+MODULE_LICENSE("GPL");
+
+#define HASH_SIZE 1024
+
+/* The states for a job */
+#define RUNNING 1 /* Running job */
+#define ZOMBIE 2 /* Dead job */
+
+/* Job creation tags for the job HID (host ID) */
+#define DISABLED 0xffffffff /* New job creation disabled */
+#define LOCAL 0x0 /* Only creating local sys jobs */
+
+
+#ifdef __BIG_ENDIAN
+#define iptr_hid(ll) ((u32 *)&(ll))
+#define iptr_sid(ll) (((u32 *)(&(ll) + 1)) - 1)
+#else /* __LITTLE_ENDIAN */
+#define iptr_hid(ll) (((u32 *)(&(ll) + 1)) - 1)
+#define iptr_sid(ll) ((u32 *)&(ll))
+#endif /* __BIG_ENDIAN */
+
+#define jid_hash(ll) (*(iptr_sid(ll)) % HASH_SIZE)
+
+
+/* Job info entry for member tasks */
+struct job_attach {
+ struct task_struct *task; /* task we are attaching to job */
+ struct pnotify_subscriber *subscriber; /* our subscriber entry in task */
+ struct job_entry *job; /* the job we are attaching task to */
+ struct list_head entry; /* list stuff */
+};
+
+struct job_waitinfo {
+ int status; /* For tasks waiting on job exit */
+};
+
+struct job_csainfo {
+ u64 corehimem; /* Accounting - highpoint, phys mem */
+ u64 virthimem; /* Accounting - highpoint, virt mem */
+ struct file *acctfile; /* The accounting file for job */
+};
+
+/* Job table entry type */
+struct job_entry {
+ u64 jid; /* Our job ID */
+ int refcnt; /* Number of tasks attached to job */
+ int state; /* State of job - RUNNING,... */
+ struct rw_semaphore sem; /* lock for the job */
+ uid_t user; /* user that owns the job */
+ time_t start; /* When the job began */
+ struct job_csainfo csa; /* CSA accounting info */
+ wait_queue_head_t zombie; /* queue last task - during wait */
+ wait_queue_head_t wait; /* queue of tasks waiting on job */
+ int waitcnt; /* Number of tasks waiting on job */
+ struct job_waitinfo waitinfo; /* Status info for waiting tasks */
+ struct list_head attached; /* List of attached tasks */
+ struct list_head entry; /* List of other jobs - same hash */
+};
+
+
+/* Job container tables */
+static struct list_head job_table[HASH_SIZE];
+static int job_table_refcnt = 0;
+static DECLARE_RWSEM(job_table_sem);
+
+
+/* Accounting subscriber list */
+static struct job_acctmod *acct_list[JOB_ACCT_COUNT];
+static DECLARE_RWSEM(acct_list_sem);
+
+
+/* Host ID for the localhost */
+static u32 jid_hid;
+
+static char *hid = NULL;
+module_param(hid, charp, 0);
+
+/* Function prototypes */
+static int job_dispatch_create(struct job_create *);
+static int job_dispatch_getjid(struct job_getjid *);
+static int job_dispatch_waitjid(struct job_waitjid *);
+static int job_dispatch_killjid(struct job_killjid *);
+static int job_dispatch_getjidcnt(struct job_jidcnt *);
+static int job_dispatch_getjidlst(struct job_jidlst *);
+static int job_dispatch_getpidcnt(struct job_pidcnt *);
+static int job_dispatch_getpidlst(struct job_pidlst *);
+static int job_dispatch_getuser(struct job_user *);
+static int job_dispatch_getprimepid(struct job_primepid *);
+static int job_dispatch_sethid(struct job_sethid *);
+static int job_dispatch_detachjid(struct job_detachjid *);
+static int job_dispatch_detachpid(struct job_detachpid *);
+static int job_dispatch_attachpid(struct job_attachpid *);
+static int job_attach(struct task_struct *, struct pnotify_subscriber *, void *);
+static void job_detach(struct task_struct *, struct pnotify_subscriber *);
+static struct job_entry *job_getjob(u64 jid);
+static int job_dispatcher(unsigned int, unsigned long);
+
+u64 job_getjid(struct task_struct *);
+
+int job_ioctl(struct inode *, struct file *, unsigned int, unsigned long);
+
+/* Job container pnotify service request */
+static struct pnotify_events events = {
+ .module = THIS_MODULE,
+ .name = PNOTIFY_JOB,
+ .data = &job_table,
+ .entry = LIST_HEAD_INIT(events.entry),
+ .fork = job_attach,
+ .exit = job_detach,
+};
+
+/* proc dir entry */
+struct proc_dir_entry *job_proc_entry;
+
+/* file operations for proc file */
+static struct file_operations job_file_ops = {
+ .owner = THIS_MODULE,
+ .ioctl = job_ioctl
+};
+
+
+/*
+ * job_getjob - return job_entry given a jid
+ * @jid: The jid of the job entry we wish to retrieve
+ *
+ * Given a jid value, find the entry in the job_table and return a pointer
+ * to the job entry or NULL if not found.
+ *
+ * You should normally down_read the job_table_sem before calling this
+ * function.
+ */
+struct job_entry *
+job_getjob(u64 jid)
+{
+ struct list_head *entry = NULL;
+ struct job_entry *tjob = NULL;
+ struct job_entry *job = NULL;
+
+ list_for_each(entry, &job_table[ jid_hash(jid) ]) {
+ tjob = list_entry(entry, struct job_entry, entry);
+ if (tjob->jid == jid) {
+ job = tjob;
+ break;
+ }
+ }
+ return job;
+}
+
+
+/*
+ * job_attach - Attach a task to a specified job
+ * @task: Task we want to attach to the job
+ * @new_subscriber: The already allocated subscriber struct for the task
+ * @old_data: (struct job_attach *)old_data)->job is the specified job
+ *
+ * Attach the task to the job specified in the target data (old_data).
+ * This function will add the task to the list of attached tasks for the job.
+ * In addition, a link from the task to the job is created and added to the
+ * task via the data pointer reference.
+ *
+ * The process that owns the target data should be at least read locked (using
+ * down_read(&task->pnotify_subscriber_list_sem)) during this call. This help
+ * in ensuring that the job cannot be removed since at least one process will
+ * still be referencing the job (the one owning the target_data).
+ *
+ * It is expected that this function will be called from within the
+ * pnotify_fork() function in the kernel, when forking (do_fork) a child
+ * process represented by task.
+ *
+ * If this function is called form some other point, then it is possible that
+ * task and data could be altered while going through this function. In such
+ * a case, the caller should also lock the pnotify_subscriber_list for the task
+ * task_struct.
+ *
+ * the function returns 0 upon success, and -1 upon failure.
+ */
+static int
+job_attach(struct task_struct *task, struct pnotify_subscriber *new_subscriber,
+ void *old_data)
+{
+ struct job_entry *job = ((struct job_attach *)old_data)->job;
+ struct job_attach *attached = NULL;
+ int errcode = 0;
+
+ /*
+ * Lock the job for writing. The task owning target_data has its
+ * pnotify_subscriber_list_sem locked, so we know there is at least one
+ * active reference to the job - therefore, it cannot have been removed
+ * before we have gotten this write lock established.
+ */
+ down_write(&job->sem);
+
+ if (job->state == ZOMBIE) {
+ /* If the job is a zombie (dying), bail out of the attach */
+ printk(KERN_WARNING "Attach task(pid=%d) to job"
+ " failed - job is ZOMBIE\n",
+ task->pid);
+ errcode = -EINPROGRESS;
+ up_write(&job->sem);
+ goto error_return;
+ }
+
+
+ /* Allocate memory that we will need */
+
+ attached = (struct job_attach *)kmalloc(sizeof(struct job_attach),
+ GFP_KERNEL);
+ if (!attached) {
+ /* error */
+ printk(KERN_ERR "Attach task(pid=%d) to job"
+ " failed on memory error in kernel\n",
+ task->pid);
+ errcode = -ENOMEM;
+ up_write(&job->sem);
+ goto error_return;
+ }
+
+
+ attached->task = task;
+ attached->subscriber = new_subscriber;
+ attached->job = job;
+ new_subscriber->data = (void *)attached;
+ list_add_tail(&attached->entry, &job->attached);
+ ++job->refcnt;
+
+ up_write(&job->sem);
+
+ return 0;
+
+error_return:
+ kfree(attached);
+ return errcode;
+}
+
+
+/*
+ * job_detach - Detach a task via the pnotify subscriber reference
+ * @task: The task to be detached
+ * @subscriber: The pnotify subscriber reference
+ *
+ * Detach the task from the job attached to via the pnotify reference.
+ * This function will remove the task from the list of attached tasks for the
+ * job specified via the pnotify data pointer. In addition, the link to the
+ * job provided via the data pointer will also be removed.
+ *
+ * The pnotify_subscriber_list should be write locked for task before enterin
+ * this function (using down_write(&task->pnotify_subscriber_list_sem)).
+ *
+ * the function returns 0 uopn success, and -1 uopn failure.
+ */
+static void
+job_detach(struct task_struct *task, struct pnotify_subscriber *subscriber)
+{
+ struct job_attach *attached = ((struct job_attach *)(subscriber->data));
+ struct job_entry *job = attached->job;
+ struct job_csa csa;
+ struct job_acctmod *acct;
+
+ /*
+ * Obtain the lock on the the job_table_sem and the job->sem for
+ * this job.
+ */
+ down_write(&job_table_sem);
+ down_write(&job->sem);
+
+ /* - CSA accounting */
+ if (acct_list[JOB_ACCT_CSA]) {
+ acct = acct_list[JOB_ACCT_CSA];
+ if (acct->module) {
+ if (try_module_get(acct->module) == 0) {
+ printk(KERN_WARNING
+ "job_detach: Tried to get non-living acct module\n");
+ }
+ }
+ if (acct->eop) {
+ csa.job_id = job->jid;
+ csa.job_uid = job->user;
+ csa.job_start = job->start;
+ csa.job_corehimem = job->csa.corehimem;
+ csa.job_virthimem = job->csa.virthimem;
+ csa.job_acctfile = job->csa.acctfile;
+ acct->eop(task->exit_code, task, &csa);
+ }
+ if (acct->module)
+ module_put(acct->module);
+ }
+ job->refcnt--;
+ list_del(&attached->entry);
+ subscriber->data = NULL;
+ kfree(attached);
+
+ if (job->refcnt == 0) {
+ int waitcnt;
+
+ list_del(&job->entry);
+ --job_table_refcnt;
+
+ /*
+ * The job is removed from the job_table.
+ * We can remove the job_table_sem now since
+ * nobody can access the job via the table.
+ */
+ up_write(&job_table_sem);
+
+ job->state = ZOMBIE;
+ job->waitinfo.status = task->exit_code;
+
+ waitcnt = job->waitcnt;
+
+ /*
+ * Release the job semaphore. You cannot hold
+ * this lock if you want the wakeup to work
+ * properly.
+ */
+ up_write(&job->sem);
+
+ if (waitcnt > 0) {
+ wake_up_interruptible(&job->wait);
+ wait_event(job->zombie, job->waitcnt == 0);
+ }
+
+ /*
+ * Job is exiting, all processes waiting for job to exit
+ * have been notified. Now we call the accountin
+ * subscribers.
+ */
+
+ /* - CSA accounting */
+ if (acct_list[JOB_ACCT_CSA]) {
+ acct = acct_list[JOB_ACCT_CSA];
+ if (acct->module) {
+ if (try_module_get(acct->module) == 0) {
+ printk(KERN_WARNING
+ "job_detach: Tried to get non-living acct module\n");
+ }
+ }
+ if (acct->jobend) {
+ int res = 0;
+
+ csa.job_id = job->jid;
+ csa.job_uid = job->user;
+ csa.job_start = job->start;
+ csa.job_corehimem = job->csa.corehimem;
+ csa.job_virthimem = job->csa.virthimem;
+ csa.job_acctfile = job->csa.acctfile;
+
+ res = acct->jobend(JOB_EVENT_END,
+ &csa);
+ if (res) {
+ printk(KERN_WARNING
+ "job_detach: CSA -"
+ " jobend failed.\n");
+ }
+ }
+ if (acct->module)
+ module_put(acct->module);
+ }
+ /*
+ * Every process attached or waiting on this job should be
+ * detached and finished waiting, so now we can free the
+ * memory for the job.
+ */
+ kfree(job);
+
+ } else {
+ /* This is case where job->refcnt was greater than 1, so
+ * we were not going to delete the job after the detach.
+ * Therefore, only the job->sem is being held - the
+ * job_table_sem was released earlier.
+ */
+ up_write(&job->sem);
+ up_write(&job_table_sem);
+ }
+
+ return;
+}
+
+/*
+ * job_dispatch_create - create a new job and attach the calling process to it
+ * @create_args: Pointer of job_create struct which stores the create request
+ *
+ * Returns 0 on success, and negative on failure (negative errno value).
+ */
+static int
+job_dispatch_create(struct job_create *create_args)
+{
+ struct job_create create;
+ struct job_entry *job = NULL;
+ struct job_attach *attached = NULL;
+ struct pnotify_subscriber *subscriber = NULL;
+ struct pnotify_subscriber *old_subscriber = NULL;
+ int errcode = 0;
+ struct job_acctmod *acct = NULL;
+ static u32 jid_count = 0;
+ u32 initial_jid_count;
+
+ /*
+ * if the job ID - host ID segment is set to DISABLED, we will
+ * not be creating new jobs. We don't mark it as an error, but
+ * the jid value returned will be 0.
+ */
+ if (jid_hid == DISABLED) {
+ errcode = 0;
+ goto error_return;
+ }
+ if (!capable(CAP_SYS_RESOURCE)) {
+ errcode = -EPERM;
+ goto error_return;
+ }
+ if (!create_args) {
+ errcode = -EINVAL;
+ goto error_return;
+ }
+
+ if (copy_from_user(&create, create_args, sizeof(create))) {
+ errcode = -EFAULT;
+ goto error_return;
+ }
+
+ /*
+ * Allocate some of the memory we might need, before we start
+ * locking
+ */
+
+ attached = (struct job_attach *)kmalloc(sizeof(struct job_attach), GFP_KERNEL);
+ if (!attached) {
+ /* error */
+ errcode = -ENOMEM;
+ goto error_return;
+ }
+
+ job = (struct job_entry *)kmalloc(sizeof(struct job_entry), GFP_KERNEL);
+ if (!job) {
+ /* error */
+ errcode = -ENOMEM;
+ goto error_return;
+ }
+
+ /* We keep the old pnotify subscriber reference around in case we need it
+ * in an error condition. If, for example, a job_getjob call fails because
+ * the requested JID is already in use, we don't want to detach that job.
+ * Having this ability is complicated by the locking.
+ */
+ down_write(&current->pnotify_subscriber_list_sem);
+ old_subscriber = pnotify_get_subscriber(current, events.name);
+
+ /*
+ * Lock the job_table and add the pointers for the new job.
+ * Since the job is new, we won't need to lock the job.
+ */
+ down_write(&job_table_sem);
+
+ /*
+ * Determine if create should use specified JID or one that is
+ * generated.
+ */
+ if (create.jid != 0) {
+ /* We use the specified JID value */
+ job->jid = create.jid;
+ /* Does the supplied JID conflict with an existing one? */
+ if (job_getjob(job->jid)) {
+ /* JID already in use, bail. error_return tosses/frees job */
+
+ /* error_return doesn't do up_write() */
+ up_write(&job_table_sem);
+ /* we haven't allocated a new pnotify subscriber refernce yet so
+ * error_return won't unlock this. We'll unlock here */
+ up_write(&current->pnotify_subscriber_list_sem);
+ errcode = -EBUSY;
+ /* error_return doesn't touch old_subscriber so we don't detach */
+ goto error_return;
+ }
+ } else {
+ /* We generate a new JID value using a new JID */
+ *(iptr_hid(job->jid)) = jid_hid;
+ *(iptr_sid(job->jid)) = jid_count;
+ initial_jid_count = jid_count++;
+ while (((job->jid == 0) || (job_getjob(job->jid))) &&
+ jid_count != initial_jid_count) {
+
+ /* JID was in use or was zero, try a new one */
+ *(iptr_sid(job->jid)) = jid_count++;
+ }
+ /* If all the JIDs are in use, fail */
+ if (jid_count == initial_jid_count) {
+ /* error_return tosses/frees job */
+ /* error_return doesn't do up_write() */
+ up_write(&job_table_sem);
+ /* we haven't allocated a new pagg yet so error_return won't unlock
+ * this. We'll unlock here */
+ up_write(&current->pnotify_subscriber_list_sem);
+ errcode = -EBUSY;
+ /* error_return doesn't touch old_pagg so we don't detach */
+ goto error_return;
+ }
+
+ }
+
+ subscriber = pnotify_subscribe(current, &events);
+ if (!subscriber) {
+ /* error */
+ up_write(&job_table_sem); /* unlock since error_return doesn't */
+ up_write(&current->pnotify_subscriber_list_sem);
+ errcode = -ENOMEM;
+ goto error_return;
+ }
+
+ /* Initialize job entry values & lists */
+ job->refcnt = 1;
+ job->user = create.user;
+ job->start = jiffies;
+ job->csa.corehimem = 0;
+ job->csa.virthimem = 0;
+ job->csa.acctfile = NULL;
+ job->state = RUNNING;
+ init_rwsem(&job->sem);
+ INIT_LIST_HEAD(&job->attached);
+ list_add_tail(&attached->entry, &job->attached);
+ init_waitqueue_head(&job->wait);
+ init_waitqueue_head(&job->zombie);
+ job->waitcnt = 0;
+ job->waitinfo.status = 0;
+
+ /* set link from entry in attached list to task and job entry */
+ attached->task = current;
+ attached->job = job;
+ attached->subscriber = subscriber;
+ subscriber->data = (void *)attached;
+
+ /* Insert new job into front of chain list */
+ list_add_tail(&job->entry, &job_table[ jid_hash(job->jid) ]);;
+ ++job_table_refcnt;
+
+ up_write(&job_table_sem);
+ /* At this point, the possible error conditions where we would need the
+ * old pnotify subscriber are gone. So we can remove it. We remove after
+ * we unlock because the detach function does job table lock of its own.
+ */
+ if (old_subscriber) {
+ /*
+ * Detaching subscribers for jobs never has a failure case,
+ * so we don't need to worry about error codes.
+ */
+ old_subscriber->events->exit(current, old_subscriber);
+ pnotify_unsubscribe(old_subscriber);
+ }
+ up_write(&current->pnotify_subscriber_list_sem);
+
+ /* Issue callbacks into accounting subscribers */
+
+ /* - CSA subscriber */
+ if (acct_list[JOB_ACCT_CSA]) {
+ acct = acct_list[JOB_ACCT_CSA];
+ if (acct->module) {
+ if (try_module_get(acct->module) == 0) {
+ printk(KERN_WARNING
+ "job_dispatch_create: Tried to get non-living acct module\n");
+ }
+ }
+ if (acct->jobstart) {
+ int res;
+ struct job_csa csa;
+
+ csa.job_id = job->jid;
+ csa.job_uid = job->user;
+ csa.job_start = job->start;
+ csa.job_corehimem = job->csa.corehimem;
+ csa.job_virthimem = job->csa.virthimem;
+ csa.job_acctfile = job->csa.acctfile;
+
+ res = acct->jobstart(JOB_EVENT_START, &csa);
+ if (res < 0) {
+ printk(KERN_WARNING "job_dispatch_create: CSA -"
+ " jobstart failed.\n");
+ }
+ }
+ if (acct->module)
+ module_put(acct->module);
+ }
+
+
+ create.r_jid = job->jid;
+ if (copy_to_user(create_args, &create, sizeof(create))) {
+ return -EFAULT;
+ }
+
+ return 0;
+
+error_return:
+ kfree(attached);
+ kfree(job);
+ create.r_jid = 0;
+ if (copy_to_user(create_args, &create, sizeof(create)))
+ return -EFAULT;
+
+ return errcode;
+}
+
+
+/*
+ * job_dispatch_getjid - retrieves the job ID (jid) for the specified process (pid)
+ * @getjid_args: Pointer of job_getjid struct which stores the get request
+ *
+ * returns 0 on success, negative errno value on exit.
+ */
+static int
+job_dispatch_getjid(struct job_getjid *getjid_args)
+{
+ struct job_getjid getjid;
+ int errcode = 0;
+ struct task_struct *task;
+
+ if (copy_from_user(&getjid, getjid_args, sizeof(getjid)))
+ return -EFAULT;
+
+ /* lock the tasklist until we grab the specific task */
+ read_lock(&tasklist_lock);
+
+ if (getjid.pid == current->pid) {
+ task = current;
+ } else {
+ task = find_task_by_pid(getjid.pid);
+ }
+ if (task) {
+ get_task_struct(task); /* Ensure the task doesn't vanish on us */
+ read_unlock(&tasklist_lock); /* unlock the task list */
+ getjid.r_jid = job_getjid(task);
+ put_task_struct(task); /* We're done accessing the task */
+ if (getjid.r_jid == 0) {
+ errcode = -ENODATA;
+ }
+ } else {
+ read_unlock(&tasklist_lock);
+ getjid.r_jid = 0;
+ errcode = -ESRCH;
+ }
+
+
+ if (copy_to_user(getjid_args, &getjid, sizeof(getjid)))
+ return -EFAULT;
+ return errcode;
+}
+
+
+/*
+ * job_dispatch_waitjid - allows a process to wait until a job exits
+ * @waitjid_args: Pointer of job_waitjid struct which stores the wait request
+ *
+ * On success returns 0, failure it returns the negative errno value.
+ */
+static int
+job_dispatch_waitjid(struct job_waitjid *waitjid_args)
+{
+ struct job_waitjid waitjid;
+ struct job_entry *job;
+ int retcode = 0;
+
+ if (copy_from_user(&waitjid, waitjid_args, sizeof(waitjid)))
+ return -EFAULT;
+
+ waitjid.r_jid = waitjid.stat = 0;
+
+ if (waitjid.options != 0) {
+ retcode = -EINVAL;
+ goto general_return;
+ }
+
+ /* Lock the job table so that the current jobs don't change */
+ down_read(&job_table_sem);
+
+ if ((job = job_getjob(waitjid.jid)) == NULL ) {
+ up_read(&job_table_sem);
+ retcode = -ENODATA;
+ goto general_return;
+ }
+
+ /*
+ * We got the job we need, we can release the job_table_sem
+ */
+ down_write(&job->sem);
+ up_read(&job_table_sem);
+
+ ++job->waitcnt;
+
+ up_write(&job->sem);
+
+ /* We shouldn't hold any locks at this point! The increment of the
+ * jobs waitcnt will ensure that the job is not removed without
+ * first notifying this current task */
+ retcode = wait_event_interruptible(job->wait,
+ job->refcnt == 0);
+
+ if (!retcode) {
+ /*
+ * This data is static at this point, we will
+ * not need a lock to read it.
+ */
+ waitjid.stat = job->waitinfo.status;
+ waitjid.r_jid = job->jid;
+ }
+
+ down_write(&job->sem);
+ --job->waitcnt;
+
+ if (job->waitcnt == 0) {
+ up_write(&job->sem);
+
+ /*
+ * We shouldn't hold any locks at this point! Else, the
+ * last process in the job will not be able to remove the
+ * job entry.
+ *
+ * That process is stuck waiting for this wake_up, so the
+ * job shouldn't disappear until after this function call.
+ * The job entry is not longer in the job table, so no
+ * other process can get to the entry to foul things up.
+ */
+ wake_up(&job->zombie);
+ } else {
+ up_write(&job->sem);
+ }
+
+general_return:
+ if (copy_to_user(waitjid_args, &waitjid, sizeof(waitjid)))
+ return -EFAULT;
+ return retcode;
+}
+
+
+/*
+ * job_dispatch_killjid - send a signal to all processes in a job
+ * @killjid_args: Pointer of job_killjid struct which stores the kill request
+ *
+ * returns 0 on success, negative of errno on failure.
+ */
+static int
+job_dispatch_killjid(struct job_killjid *killjid_args)
+{
+ struct job_killjid killjid;
+ struct job_entry *job;
+ struct list_head *attached_entry;
+ struct siginfo info;
+ int retcode = 0;
+
+ if (copy_from_user(&killjid, killjid_args, sizeof(killjid))) {
+ retcode = -EFAULT;
+ goto cleanup_0locks_return;
+ }
+
+ killjid.r_val = -1;
+
+ /* A signal of zero is really a status check and is handled as such
+ * by send_sig_info. So we have < 0 instead of <= 0 here.
+ */
+ if (killjid.sig < 0) {
+ retcode = -EINVAL;
+ goto cleanup_0locks_return;
+ }
+
+ down_read(&job_table_sem);
+ job = job_getjob(killjid.jid);
+ if (!job) {
+ /* Job not found, copy back data & bail with error */
+ retcode = -ENODATA;
+ goto cleanup_1locks_return;
+ }
+
+ down_read(&job->sem);
+
+ /*
+ * Check capability to signal job. The signaling user must be
+ * the owner of the job or have CAP_SYS_RESOURCE capability.
+ */
+ if (!capable(CAP_SYS_RESOURCE)) {
+ if (current->uid != job->user) {
+ retcode = -EPERM;
+ goto cleanup_2locks_return;
+ }
+ }
+
+ info.si_signo = killjid.sig;
+ info.si_errno = 0;
+ info.si_code = SI_USER;
+ info.si_pid = current->pid;
+ info.si_uid = current->uid;
+
+ /* send_group_sig_info needs the tasklist lock locked */
+ read_lock(&tasklist_lock);
+ list_for_each(attached_entry, &job->attached) {
+ int err;
+ struct job_attach *attached;
+
+ attached = list_entry(attached_entry, struct job_attach, entry);
+ err = send_group_sig_info(killjid.sig, &info,
+ attached->task);
+ if (err != 0) {
+ /*
+ * XXX - the "prime" process, or initiating process
+ * for the job may not be owned by the user. So,
+ * we would get an error in this case. However, we
+ * ignore the error for that specific process - it
+ * should exit when all the child processes exit. It
+ * should ignore all signals from the user.
+ *
+ */
+ if (attached->entry.prev != &job->attached) {
+ retcode = err;
+ }
+ }
+
+ }
+ read_unlock(&tasklist_lock);
+
+cleanup_2locks_return:
+ up_read(&job->sem);
+cleanup_1locks_return:
+ up_read(&job_table_sem);
+cleanup_0locks_return:
+ killjid.r_val = retcode;
+
+ if (copy_to_user(killjid_args, &killjid, sizeof(killjid)))
+ return -EFAULT;
+ return retcode;
+}
+
+
+/*
+ * job_dispatch_getjidcnt - return the number of jobs currently on the system
+ * @jidcnt_args: Pointer of job_jidcnt struct which stores the get request
+ *
+ * returns 0 on success & it always succeeds.
+ */
+static int
+job_dispatch_getjidcnt(struct job_jidcnt *jidcnt_args)
+{
+ struct job_jidcnt jidcnt;
+
+ /* read lock might be overdoing it in this case */
+ down_read(&job_table_sem);
+ jidcnt.r_val = job_table_refcnt;
+ up_read(&job_table_sem);
+
+ if (copy_to_user(jidcnt_args, &jidcnt, sizeof(jidcnt)))
+ return -EFAULT;
+
+ return 0;
+}
+
+
+/*
+ * job_dispatch_getjidlst - get the list of all jids currently on the system
+ * @jidlist_args: Pointer of job_jidlst struct which stores the get request
+ */
+static int
+job_dispatch_getjidlst(struct job_jidlst *jidlst_args)
+{
+ struct job_jidlst jidlst;
+ u64 *jid;
+ struct job_entry *job;
+ struct list_head *job_entry;
+ int i;
+ int count;
+
+ if (copy_from_user(&jidlst, jidlst_args, sizeof(jidlst)))
+ return -EFAULT;
+
+ if (jidlst.r_val == 0)
+ return 0;
+
+ jid = (u64 *)kmalloc(sizeof(u64)*jidlst.r_val, GFP_KERNEL);
+ if (!jid) {
+ jidlst.r_val = 0;
+ if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst)))
+ return -EFAULT;
+ return -ENOMEM;
+ }
+
+ count = 0;
+ down_read(&job_table_sem);
+ for (i = 0; i < HASH_SIZE && count < jidlst.r_val; i++) {
+ list_for_each(job_entry, &job_table[i]) {
+ job = list_entry(job_entry, struct job_entry, entry);
+ jid[count++] = job->jid;
+ if (count == jidlst.r_val) {
+ break;
+ }
+ }
+ }
+ up_read(&job_table_sem);
+
+ jidlst.r_val = count;
+
+ for (i = 0; i < count; i++)
+ if (copy_to_user(jidlst.jid+i, &jid[i], sizeof(u64)))
+ return -EFAULT;
+
+ kfree(jid);
+
+ if (copy_to_user(jidlst_args, &jidlst, sizeof(jidlst)))
+ return -EFAULT;
+ return 0;
+}
+
+
+/*
+ * job_dispatch_getpidcnt - get the processe count in the specified job
+ * @pidcnt_args: Pointer of job_pidcnt struct which stores the get request
+ *
+ * returns 0 on success, or negative errno value on failure.
+ */
+static int
+job_dispatch_getpidcnt(struct job_pidcnt *pidcnt_args)
+{
+ struct job_pidcnt pidcnt;
+ struct job_entry *job;
+ int retcode = 0;
+
+ if (copy_from_user(&pidcnt, pidcnt_args, sizeof(pidcnt)))
+ return -EFAULT;
+
+ pidcnt.r_val = 0;
+
+ down_read(&job_table_sem);
+ job = job_getjob(pidcnt.jid);
+ if (!job) {
+ retcode = -ENODATA;
+ } else {
+ /* Read lock might be overdoing it for this case */
+ down_read(&job->sem);
+ pidcnt.r_val = job->refcnt;
+ up_read(&job->sem);
+ }
+ up_read(&job_table_sem);
+
+ if (copy_to_user(pidcnt_args, &pidcnt, sizeof(pidcnt)))
+ return -EFAULT;
+ return retcode;
+}
+
+/*
+ * job_getpidlst - get the process list in the specified job
+ * @pidlst_args: Pointer of job_pidlst struct which stores the get request
+ *
+ * This function returns the the list of processes that are part of the job.
+ * The number of processes provided by this function could be trimmed if
+ * max size specified in r_val is not large enough to hold the entire list.
+ *
+ * returns 0 on success, negative errno value on failure.
+ */
+static int
+job_dispatch_getpidlst(struct job_pidlst *pidlst_args)
+{
+ struct job_pidlst pidlst;
+ struct job_entry *job;
+ struct job_attach *attached;
+ struct list_head *attached_entry;
+ pid_t *pid;
+ int max;
+ int i;
+
+ if (copy_from_user(&pidlst, pidlst_args, sizeof(pidlst)))
+ return -EFAULT;
+
+ if (pidlst.r_val == 0)
+ return 0;
+
+ max = pidlst.r_val;
+ pidlst.r_val = 0;
+ pid = (pid_t *)kmalloc(sizeof(pid_t)*max, GFP_KERNEL);
+ if (!pid) {
+ if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst)))
+ return -EFAULT;
+ return -ENOMEM;
+ }
+
+ down_read(&job_table_sem);
+
+ job = job_getjob(pidlst.jid);
+ if (!job) {
+ up_read(&job_table_sem);
+ if (copy_to_user(pidlst_args, &pidlst, sizeof(pidlst)))
+ return -EFAULT;
+ return -ENODATA;
+ } else {
+
+ down_read(&job->sem);
+ up_read(&job_table_sem);
+
+ i = 0;
+ list_for_each(attached_entry, &job->attached) {
+ if (i == max) {
+ break;
+ }
+ attached = list_entry(attached_entry, struct job_attach,
+ entry);
+ pid[i++] = attached->task->pid;
+ }
+ pidlst.r_val = i;
+
+ up_read(&job->sem);
+ }
+
+ for (i = 0; i < pidlst.r_val; i++)
+ if (copy_to_user(pidlst.pid+i, &pid[i], sizeof(pid_t)))
+ return -EFAULT;
+ kfree(pid);
+
+ copy_to_user(pidlst_args, &pidlst, sizeof(pidlst));
+ return 0;
+}
+
+
+/*
+ * job_dispatch_getuser - get the uid of the user that owns the job
+ * @user_args: Pointer of job_user struct which stores the get request
+ *
+ * returns 0 on success, returns negative errno on failure.
+ */
+static int
+job_dispatch_getuser(struct job_user *user_args)
+{
+ struct job_entry *job;
+ struct job_user user;
+ int retcode = 0;
+
+ if (copy_from_user(&user, user_args, sizeof(user)))
+ return(-EFAULT);
+ user.r_user = 0;
+
+ down_read(&job_table_sem);
+
+ job = job_getjob(user.jid);
+ if (!job) {
+ retcode = -ENODATA;
+ } else {
+ down_read(&job->sem);
+ user.r_user = job->user;
+ up_read(&job->sem);
+ }
+
+ up_read(&job_table_sem);
+
+ if (copy_to_user(user_args, &user, sizeof(user)))
+ return -EFAULT;
+ return retcode;
+}
+
+
+/*
+ * job_dispatch_getprimepid - get the oldest process (primepid) in the job
+ * @primepid_args: Pointer of job_primepid struct which stores the get request
+ *
+ * returns 0 on success, negative errno on failure.
+ */
+static int
+job_dispatch_getprimepid(struct job_primepid *primepid_args)
+{
+ struct job_primepid primepid;
+ struct job_entry *job = NULL;
+ struct job_attach *attached = NULL;
+ int retcode = 0;
+
+ if (copy_from_user(&primepid, primepid_args, sizeof(primepid)))
+ return -EFAULT;
+
+ primepid.r_pid = 0;
+
+ down_read(&job_table_sem);
+
+ job = job_getjob(primepid.jid);
+ if (!job) {
+ up_read(&job_table_sem);
+ /* Job not found, return INVALID VALUE */
+ return -ENODATA;
+ }
+
+ /*
+ * Job found, now look at first pid entry in the
+ * attached list.
+ */
+ down_read(&job->sem);
+ up_read(&job_table_sem);
+ if (list_empty(&job->attached)) {
+ retcode = -ESRCH;
+ primepid.r_pid = 0;
+ } else {
+ attached = list_entry(job->attached.next, struct job_attach, entry);
+ if (!attached->task) {
+ retcode = -ESRCH;
+ } else {
+ primepid.r_pid = attached->task->pid;
+ }
+ }
+ up_read(&job->sem);
+
+ if (copy_to_user(primepid_args, &primepid, sizeof(primepid)))
+ return -EFAULT;
+ return retcode;
+}
+
+
+/*
+ * job_dispatch_sethid - set the host ID segment for the job IDs (jid)
+ * @sethid_args: Pointer of job_sethid struct which stores the set request
+ *
+ * If hid does not get set, then the jids upper 32 bits will be set to
+ * 0 and the jid cannot be used reliably in a cluster environment.
+ *
+ * returns -errno value on fail, 0 on success
+ */
+static int
+job_dispatch_sethid(struct job_sethid *sethid_args)
+{
+ struct job_sethid sethid;
+ int errcode = 0;
+
+ if (copy_from_user(&sethid, sethid_args, sizeof(sethid)))
+ return -EFAULT;
+
+ if (!capable(CAP_SYS_RESOURCE)) {
+ errcode = -EPERM;
+ sethid.r_hid = 0;
+ goto cleanup_return;
+ }
+
+ /*
+ * Set job_table_sem, so no jobs can be deleted while doing
+ * this operation.
+ */
+ down_write(&job_table_sem);
+
+ sethid.r_hid = jid_hid = sethid.hid;
+
+ up_write(&job_table_sem);
+
+cleanup_return:
+ if (copy_to_user(sethid_args, &sethid, sizeof(sethid)))
+ return -EFAULT;
+ return errcode;
+}
+
+
+/*
+ * job_dispatch_detachjid - detach all processes attached to the specified job
+ * @detachjid_args: Pointer of job_detachjid struct
+ *
+ * The job will exit after the detach. The processes are allowed to
+ * continue running. You need CAP_SYS_RESOURCE capability for this to succeed.
+ *
+ * returns -errno value on fail, 0 on success
+ */
+static int
+job_dispatch_detachjid(struct job_detachjid *detachjid_args)
+{
+ struct job_detachjid detachjid;
+ struct job_entry *job;
+ struct list_head *entry;
+ int count;
+ int errcode = 0;
+ struct task_struct *task;
+ struct pnotify_subscriber *subscriber;
+
+ if (copy_from_user(&detachjid, detachjid_args, sizeof(detachjid)))
+ return -EFAULT;
+
+ detachjid.r_val = 0;
+
+ if (!capable(CAP_SYS_RESOURCE)) {
+ errcode = -EPERM;
+ goto cleanup_return;
+ }
+
+ /*
+ * Set job_table_sem, so no jobs can be deleted while doing
+ * this operation.
+ */
+ down_write(&job_table_sem);
+
+ job = job_getjob(detachjid.jid);
+
+ if (job) {
+
+ down_write(&job->sem);
+
+ /* Mark job as ZOMBIE so no new processes can attach to it */
+ job->state = ZOMBIE;
+
+ count = job->refcnt;
+
+ /* Okay, no new processes can attach to the job. We can
+ * release the locks on the job_table and job since the only
+ * way for the job to change now is for tasks to detach and
+ * the job to be removed. And this is what we want to happen
+ */
+ up_write(&job_table_sem);
+ up_write(&job->sem);
+
+
+ /* Walk through list of attached tasks and unset the
+ * pnotify subscriber entries.
+ *
+ * We don't test with list_empty because that actually means NO tasks
+ * left rather than one task. If we used !list_empty or list_for_each,
+ * we could reference memory freed by the pnotify hook detach function
+ * (job_detach).
+ *
+ * We know there is only one task left when job->attached.next and
+ * job->attached.prev both point to the same place.
+ */
+ while (job->attached.next != job->attached.prev) {
+ entry = job->attached.next;
+
+ task = (list_entry(entry, struct job_attach, entry))->task;
+ subscriber = (list_entry(entry, struct job_attach, entry))->subscriber;
+
+ down_write(&task->pnotify_subscriber_list_sem);
+ subscriber->events->exit(task, subscriber);
+ pnotify_unsubscribe(subscriber);
+ up_write(&task->pnotify_subscriber_list_sem);
+
+ }
+ /* At this point, there is only one task left */
+
+ entry = job->attached.next;
+
+ task = (list_entry(entry, struct job_attach, entry))->task;
+ subscriber = (list_entry(entry, struct job_attach, entry))->subscriber;
+
+ down_write(&task->pnotify_subscriber_list_sem);
+ subscriber->events->exit(task, subscriber);
+ pnotify_unsubscribe(subscriber);
+ up_write(&task->pnotify_subscriber_list_sem);
+
+ detachjid.r_val = count;
+
+ } else {
+ errcode = -ENODATA;
+ up_write(&job_table_sem);
+ }
+
+cleanup_return:
+ if (copy_to_user(detachjid_args, &detachjid, sizeof(detachjid)))
+ return -EFAULT;
+ return errcode;
+}
+
+
+/*
+ * job_dispatch_detachpid - detach a process from the job it is attached to
+ * @detachpid_args: Pointer of job_detachpid struct.
+ *
+ * That process is allowed to continue running. You need
+ * CAP_SYS_RESOURCE capability for this to succeed.
+ *
+ * returns -errno value on fail, 0 on success
+ */
+static int
+job_dispatch_detachpid(struct job_detachpid *detachpid_args)
+{
+ struct job_detachpid detachpid;
+ struct task_struct *task;
+ struct pnotify_subscriber *subscriber;
+ int errcode = 0;
+
+ if (copy_from_user(&detachpid, detachpid_args, sizeof(detachpid)))
+ return -EFAULT;
+
+ detachpid.r_jid = 0;
+
+ if (!capable(CAP_SYS_RESOURCE)) {
+ errcode = -EPERM;
+ goto cleanup_return;
+ }
+
+ /* Lock the task list while we find a specific task */
+ read_lock(&tasklist_lock);
+ task = find_task_by_pid(detachpid.pid);
+ if (!task) {
+ errcode = -ESRCH;
+ /* We need to unlock the tasklist here too or the lock is held forever */
+ read_unlock(&tasklist_lock);
+ goto cleanup_return;
+ }
+
+ /* We have a valid task now */
+ get_task_struct(task); /* Ensure the task doesn't vanish on us */
+ read_unlock(&tasklist_lock); /* Unlock the tasklist */
+ down_write(&task->pnotify_subscriber_list_sem);
+
+ subscriber = pnotify_get_subscriber(task, events.name);
+ if (subscriber) {
+ detachpid.r_jid = ((struct job_attach *)subscriber->data)->job->jid;
+ subscriber->events->exit(task, subscriber);
+ pnotify_unsubscribe(subscriber);
+ } else {
+ errcode = -ENODATA;
+ }
+ put_task_struct(task); /* Done accessing the task */
+ up_write(&task->pnotify_subscriber_list_sem);
+
+cleanup_return:
+ if (copy_to_user(detachpid_args, &detachpid, sizeof(detachpid)))
+ return -EFAULT;
+ return errcode;
+}
+
+/*
+ * job_dispatch_attachpid - attach a process to the specified job
+ * @attachpid_args: Pointer of job_attachpid struct.
+ *
+ * The attaching process must not belong to any job and the specified job
+ * must exist. You need CAP_SYS_RESOURCE capability for this to succeed.
+ *
+ * returns -errno value on fail, 0 on success
+ */
+static int
+job_dispatch_attachpid(struct job_attachpid *attachpid_args)
+{
+ struct job_attachpid attachpid;
+ struct task_struct *task;
+ struct pnotify_subscriber *subscriber;
+ struct job_entry *job = NULL;
+ struct job_attach *attached = NULL;
+ int errcode = 0;
+
+ if (copy_from_user(&attachpid, attachpid_args, sizeof(attachpid)))
+ return -EFAULT;
+
+ if (!capable(CAP_SYS_RESOURCE)) {
+ errcode = -EPERM;
+ goto cleanup_return;
+ }
+
+ /* lock the tasklist until we grab the specific task */
+ read_lock(&tasklist_lock);
+ task = find_task_by_pid(attachpid.pid);
+ if (!task) {
+ errcode = -ESRCH;
+ /* We need to unlock the tasklist here too or the lock is held f
+orever */
+ read_unlock(&tasklist_lock);
+ goto cleanup_return;
+ }
+
+ /* We have a valid task now */
+ get_task_struct(task); /* Ensure the task doesn't vanish on us */
+ read_unlock(&tasklist_lock); /* Unlock the tasklist */
+
+ down_write(&task->pnotify_subscriber_list_sem);
+ /* check if it belongs to a job*/
+ subscriber = pnotify_get_subscriber(task, events.name);
+ if (subscriber) {
+ put_task_struct(task);
+ up_write(&task->pnotify_subscriber_list_sem);
+ errcode = -EINVAL;
+ goto cleanup_return;
+ }
+
+ /* Alloc subscriber list entry for it */
+ subscriber = pnotify_subscribe(task, &events);
+ if (subscriber) {
+ down_read(&job_table_sem);
+ /* Check on the requested job */
+ job = job_getjob(attachpid.r_jid);
+ if (!job) {
+ pnotify_unsubscribe(subscriber);
+ errcode = -ENODATA;
+ }
+ else {
+ attached = list_entry(job->attached.next, struct job_attach, entry);
+ if(attached) {
+ if (subscriber->events->fork(task, subscriber, attached) != 0) {
+ pnotify_unsubscribe(subscriber);
+ errcode = -EFAULT;
+ }
+ }
+ }
+ up_read(&job_table_sem);
+ } else
+ errcode = -ENOMEM;
+ put_task_struct(task); /* Done accessing the task */
+ up_write(&task->pnotify_subscriber_list_sem);
+
+cleanup_return:
+ if (copy_to_user(attachpid_args, &attachpid, sizeof(attachpid)))
+ return -EFAULT;
+ return errcode;
+}
+
+
+/*
+ * job_register_acct - accounting modules register to job module
+ * @am: The registering accounting module's job_acctmod pointer
+ *
+ * returns -errno value on fail, 0 on success.
+ */
+int
+job_register_acct(struct job_acctmod *am)
+{
+ if (!am)
+ return -EINVAL; /* error, invalid value */
+ if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1))
+ return -EINVAL; /* error, invalid value */
+
+ down_write(&acct_list_sem);
+ if (acct_list[am->type] != NULL) {
+ up_write(&acct_list_sem);
+ return -EBUSY; /* error, duplicate entry */
+ }
+
+ acct_list[am->type] = am;
+ up_write(&acct_list_sem);
+ return 0;
+}
+
+
+/*
+ * job_unregister_acct - accounting modules to unregister with the job module
+ * @am: The unregistering accounting module's job_acctmod pointer
+ *
+ * Returns -errno on failure and 0 on success.
+ */
+int
+job_unregister_acct(struct job_acctmod *am)
+{
+ if (!am)
+ return -EINVAL; /* error, invalid value */
+ if (am->type < 0 || am->type > (JOB_ACCT_COUNT-1))
+ return -EINVAL; /* error, invalid value */
+
+ down_write(&acct_list_sem);
+
+ if (acct_list[am->type] != am) {
+ up_write(&acct_list_sem);
+ return -EFAULT; /* error, not matching entry */
+ }
+
+ acct_list[am->type] = NULL;
+ up_write(&acct_list_sem);
+ return 0;
+}
+
+/*
+ * job_getjid - return the Job ID for the given task.
+ * @task: The given task
+ *
+ * If the task is not attached to a job, then 0 is returned.
+ *
+ */
+u64 job_getjid(struct task_struct *task)
+{
+ struct pnotify_subscriber *subscriber = NULL;
+ struct job_entry *job = NULL;
+ u64 jid = 0;
+
+ down_read(&task->pnotify_subscriber_list_sem);
+ subscriber = pnotify_get_subscriber(task, events.name);
+ if (subscriber) {
+ job = ((struct job_attach *)subscriber->data)->job;
+ down_read(&job->sem);
+ jid = job->jid;
+ up_read(&job->sem);
+ }
+ up_read(&task->pnotify_subscriber_list_sem);
+
+ return jid;
+}
+
+
+/*
+ * job_getacct - accounting subscribers get accounting information about a job.
+ * @jid: the job id
+ * @type: the accounting subscriber type
+ * @data: the accounting data that subscriber wants.
+ *
+ * The caller must supply the Job ID (jid) that specifies the job. The
+ * "type" argument indicates the type of accounting data to be returned.
+ * The data will be returned in the memory accessed via the data pointer
+ * argument. The data pointer is void so that this function interface
+ * can handle different types of accounting data.
+ *
+ */
+int job_getacct(u64 jid, int type, void *data)
+{
+ struct job_entry *job;
+
+ if (!data)
+ return -EINVAL;
+
+ if (!jid)
+ return -EINVAL;
+
+ down_read(&job_table_sem);
+ job = job_getjob(jid);
+ if (!job) {
+ up_read(&job_table_sem);
+ return -ENODATA;
+ }
+
+ down_read(&job->sem);
+ up_read(&job_table_sem);
+
+ switch (type) {
+ case JOB_ACCT_CSA:
+ {
+ struct job_csa *csa = (struct job_csa *)data;
+
+ csa->job_id = job->jid;
+ csa->job_uid = job->user;
+ csa->job_start = job->start;
+ csa->job_corehimem = job->csa.corehimem;
+ csa->job_virthimem = job->csa.virthimem;
+ csa->job_acctfile = job->csa.acctfile;
+ break;
+ }
+ default:
+ up_read(&job->sem);
+ return -EINVAL;
+ break;
+ }
+ up_read(&job->sem);
+ return 0;
+}
+
+/*
+ * job_setacct - accounting subscribers set accounting info in the job
+ * @jid: the job id
+ * @type: the accounting subscriber type.
+ * @subfield: the accounting information subfield for this set call
+ * @data: the accounting information to be set
+ *
+ * The job is identified by the jid argument. The type indicates the
+ * type of accounting the information is associated with. The subfield
+ * is a bitmask that indicates exactly what subfields are to be changed.
+ * The data that is used to set the values is supplied by the data pointer.
+ * The data pointer is a void type so that the interface can be used for
+ * different types of accounting information.
+ */
+int job_setacct(u64 jid, int type, int subfield, void *data)
+{
+ struct job_entry *job;
+
+ if (!data)
+ return -EINVAL;
+
+ if (!jid)
+ return -EINVAL;
+
+ down_read(&job_table_sem);
+ job = job_getjob(jid);
+ if (!job) {
+ up_read(&job_table_sem);
+ return -ENODATA;
+ }
+
+ down_read(&job->sem);
+ up_read(&job_table_sem);
+
+ switch (type) {
+ case JOB_ACCT_CSA:
+ {
+ struct job_csa *csa = (struct job_csa *)data;
+
+ if (subfield & JOB_CSA_ACCTFILE) {
+ job->csa.acctfile = csa->job_acctfile;
+ }
+ if (subfield & JOB_CSA_COREHIMEM) {
+ job->csa.corehimem = csa->job_corehimem;
+ }
+ if (subfield & JOB_CSA_VIRTHIMEM) {
+ job->csa.virthimem = csa->job_virthimem;
+ }
+ break;
+ }
+ default:
+ up_read(&job->sem);
+ return -EINVAL;
+ break;
+ }
+ up_read(&job->sem);
+ return 0;
+}
+
+
+
+/*
+ * job_dispatcher - handles job ioctl requests
+ * @request: The syscall request type
+ * @data: The syscall request data
+ *
+ * Returns 0 on success and -(ERRNO VALUE) upon failure.
+ */
+int
+job_dispatcher(unsigned int request, unsigned long data)
+{
+ int rc=0;
+
+ switch (request) {
+ case JOB_CREATE:
+ rc = job_dispatch_create((struct job_create *)data);
+ break;
+ case JOB_ATTACH:
+ case JOB_DETACH:
+ /* RESERVED */
+ rc = -EBADRQC;
+ break;
+ case JOB_GETJID:
+ rc = job_dispatch_getjid((struct job_getjid *)data);
+ break;
+ case JOB_WAITJID:
+ rc = job_dispatch_waitjid((struct job_waitjid *)data);
+ break;
+ case JOB_KILLJID:
+ rc = job_dispatch_killjid((struct job_killjid *)data);
+ break;
+ case JOB_GETJIDCNT:
+ rc = job_dispatch_getjidcnt((struct job_jidcnt *)data);
+ break;
+ case JOB_GETJIDLST:
+ rc = job_dispatch_getjidlst((struct job_jidlst *)data);
+ break;
+ case JOB_GETPIDCNT:
+ rc = job_dispatch_getpidcnt((struct job_pidcnt *)data);
+ break;
+ case JOB_GETPIDLST:
+ rc = job_dispatch_getpidlst((struct job_pidlst *)data);
+ break;
+ case JOB_GETUSER:
+ rc = job_dispatch_getuser((struct job_user *)data);
+ break;
+ case JOB_GETPRIMEPID:
+ rc = job_dispatch_getprimepid((struct job_primepid *)data);
+ break;
+ case JOB_SETHID:
+ rc = job_dispatch_sethid((struct job_sethid *)data);
+ break;
+ case JOB_DETACHJID:
+ rc = job_dispatch_detachjid((struct job_detachjid *)data);
+ break;
+ case JOB_DETACHPID:
+ rc = job_dispatch_detachpid((struct job_detachpid *)data);
+ break;
+ case JOB_ATTACHPID:
+ rc = job_dispatch_attachpid((struct job_attachpid *)data);
+ break;
+ case JOB_SETJLIMIT:
+ case JOB_GETJLIMIT:
+ case JOB_GETJUSAGE:
+ case JOB_FREE:
+ default:
+ rc = -EBADRQC;
+ break;
+ }
+
+ return rc;
+}
+
+
+/*
+ * job_ioctl - handles job ioctl call requests
+ *
+ *
+ * Returns 0 on success and -(ERRNO VALUE) upon failure.
+ */
+int
+job_ioctl(struct inode *inode, struct file *file, unsigned int request,
+ unsigned long data)
+{
+ return job_dispatcher(request, data);
+}
+
+
+/*
+ * init_module
+ *
+ * This function is called when a module is inserted into a kernel. This
+ * function allocates any necessary structures and sets initial values for
+ * module data.
+ *
+ * If the function succeeds, then 0 is returned. On failure, -1 is returned.
+ */
+static int __init
+init_job(void)
+{
+ int i,rc;
+
+
+ /* Initialize the job table chains */
+ for (i = 0; i < HASH_SIZE; i++) {
+ INIT_LIST_HEAD(&job_table[i]);
+ }
+
+ /* Get hostID string and fill in jid_template hostID segment */
+ if (hid) {
+ jid_hid = (int)simple_strtoul(hid, &hid, 16);
+ } else {
+ jid_hid = 0;
+ }
+
+ rc = pnotify_register(&events);
+ if (rc < 0) {
+ return -1;
+ }
+
+ /* Setup our /proc entry file */
+ job_proc_entry = create_proc_entry(JOB_PROC_ENTRY,
+ S_IFREG | S_IRUGO, &proc_root);
+
+ if (!job_proc_entry) {
+ pnotify_unregister(&events);
+ return -1;
+ }
+
+ job_proc_entry->proc_fops = &job_file_ops;
+ job_proc_entry->proc_iops = NULL;
+
+
+ return 0;
+}
+module_init(init_job);
+
+/*
+ * cleanup_module
+ *
+ * This function is called to cleanup after a module when it is removed.
+ * All memory allocated for this module will be freed.
+ *
+ * This function does not take any inputs or produce and output.
+ */
+static void __exit
+cleanup_job(void)
+{
+ remove_proc_entry(JOB_PROC_ENTRY, &proc_root);
+ pnotify_unregister(&events);
+ return;
+}
+module_exit(cleanup_job);
+
+EXPORT_SYMBOL(job_register_acct);
+EXPORT_SYMBOL(job_unregister_acct);
+EXPORT_SYMBOL(job_getjid);
+EXPORT_SYMBOL(job_getacct);
+EXPORT_SYMBOL(job_setacct);
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile 2005-09-30 12:14:10.990893317 -0500
+++ linux/kernel/Makefile 2005-09-30 13:59:19.748849562 -0500
@@ -21,6 +21,7 @@
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_PNOTIFY) += pnotify.o
+obj-$(CONFIG_JOB) += job.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_IKCONFIG_PROC) += configs.o
Index: linux/include/linux/jobctl.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/linux/jobctl.h 2005-09-30 13:59:14.059970327 -0500
@@ -0,0 +1,185 @@
+/*
+ *
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ *
+ * For further information regarding this notice, see:
+ *
+ * http://oss.sgi.com/projects/GenInfo/NoticeExplan
+ *
+ *
+ * Description: This file, include/linux/jobctl.h, contains the data
+ * definitions used by job to communicate with pnotify via the /proc/job
+ * ioctl interface.
+ *
+ */
+
+#ifndef _LINUX_JOBCTL_H
+#define _LINUX_JOBCTL_H
+#ifndef __KERNEL__
+#include <stdint.h>
+#include <sys/types.h>
+#include <asm/unistd.h>
+#endif
+
+#define PNOTIFY_NAMELEN 32 /* Max chars in PNOTIFY module name */
+#define PNOTIFY_NAMESTR PNOTIFY_NAMELN+1 /* PNOTIFY mod name string including
+ * room for end-of-string = '\0' */
+
+/*
+ * =======================
+ * JOB PNOTIFY definitions
+ * =======================
+ */
+#define PNOTIFY_JOB "job" /* PNOTIFY module identifier string */
+
+
+
+/*
+ * ================
+ * KERNEL INTERFACE
+ * ================
+ */
+#define JOB_PROC_ENTRY "job" /* /proc entry name */
+#define JOB_IOCTL_NUM 'A'
+
+
+/*
+ *
+ * Define ioctl options available in the job module
+ *
+ */
+
+#define JOB_NOOP _IOWR(JOB_IOCTL_NUM, 0, void *) /* No-op options */
+
+#define JOB_CREATE _IOWR(JOB_IOCTL_NUM, 1, void *) /* Create a job - uid = 0 only */
+#define JOB_ATTACH _IOWR(JOB_IOCTL_NUM, 2, void *) /* RESERVED */
+#define JOB_DETACH _IOWR(JOB_IOCTL_NUM, 3, void *) /* RESERVED */
+#define JOB_GETJID _IOWR(JOB_IOCTL_NUM, 4, void *) /* Get Job ID for specificed pid */
+#define JOB_WAITJID _IOWR(JOB_IOCTL_NUM, 5, void *) /* Wait for job to complete */
+#define JOB_KILLJID _IOWR(JOB_IOCTL_NUM, 6, void *) /* Send signal to job */
+#define JOB_GETJIDCNT _IOWR(JOB_IOCTL_NUM, 9, void *) /* Get number of JIDs on system */
+#define JOB_GETJIDLST _IOWR(JOB_IOCTL_NUM, 10, void *) /* Get list of JIDs on system */
+#define JOB_GETPIDCNT _IOWR(JOB_IOCTL_NUM, 11, void *) /* Get number of PIDs in JID */
+#define JOB_GETPIDLST _IOWR(JOB_IOCTL_NUM, 12, void *) /* Get list of PIDs in JID */
+#define JOB_SETJLIMIT _IOWR(JOB_IOCTL_NUM, 13, void *) /* Future: set job limits info */
+#define JOB_GETJLIMIT _IOWR(JOB_IOCTL_NUM, 14, void *) /* Future: get job limits info */
+#define JOB_GETJUSAGE _IOWR(JOB_IOCTL_NUM, 15, void *) /* Future: get job res. usage */
+#define JOB_FREE _IOWR(JOB_IOCTL_NUM, 16, void *) /* Future: Free job entry */
+#define JOB_GETUSER _IOWR(JOB_IOCTL_NUM, 17, void *) /* Get owner for job */
+#define JOB_GETPRIMEPID _IOWR(JOB_IOCTL_NUM, 18, void *) /* Get prime pid for job */
+#define JOB_SETHID _IOWR(JOB_IOCTL_NUM, 19, void *) /* Set HID for jid values */
+#define JOB_DETACHJID _IOWR(JOB_IOCTL_NUM, 20, void *) /* Detach all tasks from job */
+#define JOB_DETACHPID _IOWR(JOB_IOCTL_NUM, 21, void *) /* Detach a task from job */
+#define JOB_ATTACHPID _IOWR(JOB_IOCTL_NUM, 22, void *) /* Attach a task to a job */
+#define JOB_OPT_MAX _IOWR(JOB_IOCTL_NUM, 23 , void *) /* Should always be highest number */
+
+
+/*
+ * Define ioctl request structures for job module
+ */
+
+struct job_create {
+ u64 r_jid; /* Return value of JID */
+ u64 jid; /* Jid value requested */
+ int user; /* UID of user associated with job */
+ int options;/* creation options - unused */
+};
+
+
+struct job_getjid {
+ u64 r_jid; /* Returned value of JID */
+ pid_t pid; /* Info requested for PID */
+};
+
+
+struct job_waitjid {
+ u64 r_jid; /* Returned value of JID */
+ u64 jid; /* Waiting on specified JID */
+ int stat; /* Status information on JID */
+ int options;/* Waiting options */
+};
+
+
+struct job_killjid {
+ int r_val; /* Return value of kill request */
+ u64 jid; /* Sending signal to all PIDs in JID */
+ int sig; /* Signal to send */
+};
+
+
+struct job_jidcnt {
+ int r_val; /* Number of JIDs on system */
+};
+
+
+struct job_jidlst {
+ int r_val; /* Number of JIDs in list */
+ u64 *jid; /* List of JIDs */
+};
+
+
+struct job_pidcnt {
+ int r_val; /* Number of PIDs in JID */
+ u64 jid; /* Getting count of JID */
+};
+
+
+struct job_pidlst {
+ int r_val; /* Number of PIDs in list */
+ pid_t *pid; /* List of PIDs */
+ u64 jid;
+};
+
+
+struct job_user {
+ int r_user; /* The UID of the owning user */
+ u64 jid; /* Get the UID for this job */
+};
+
+struct job_primepid {
+ pid_t r_pid; /* The prime pid */
+ u64 jid; /* Get the prime pid for this job */
+};
+
+struct job_sethid {
+ unsigned long r_hid; /* Value that was set */
+ unsigned long hid; /* Value to set to */
+};
+
+
+struct job_detachjid {
+ int r_val; /* Number of tasks detached from job */
+ u64 jid; /* Job to detach processes from */
+};
+
+struct job_detachpid {
+ u64 r_jid; /* Jod ID task was attached to */
+ pid_t pid; /* Task to detach from job */
+};
+
+struct job_attachpid {
+ u64 r_jid; /* Job ID task is to attach to */
+ pid_t pid; /* Task to be attached */
+};
+
+#endif /* _LINUX_JOBCTL_H */
Index: linux/include/linux/job_acct.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/linux/job_acct.h 2005-09-30 13:59:14.063876183 -0500
@@ -0,0 +1,124 @@
+/*
+ * Linux Job kernel definitions & interfaces using pnotify
+ *
+ *
+ * Copyright (c) 2000-2005 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ *
+ * Contact information: Silicon Graphics, Inc., 1500 Crittenden Lane,
+ * Mountain View, CA 94043, or:
+ *
+ * http://www.sgi.com
+ *
+ * For further information regarding this notice, see:
+ *
+ * http://oss.sgi.com/projects/GenInfo/NoticeExplan
+ */
+
+/*
+ * Description: This file, include/linux/job.h, contains the data
+ * structure definitions and functions prototypes used
+ * by other kernel bits that communicate with the job
+ * module. One such example is Comprehensive System
+ * Accounting (CSA).
+ */
+
+#ifndef _LINUX_JOB_ACCT_H
+#define _LINUX_JOB_ACCT_H
+
+/*
+ * ================
+ * GENERAL USE INFO
+ * ================
+ */
+
+/*
+ * The job start/stop events: These will identify the
+ * the reason the jobstart and jobend callbacks are being
+ * called.
+ */
+enum {
+ JOB_EVENT_IGNORE = 0,
+ JOB_EVENT_START = 1,
+ JOB_EVENT_RESTART = 2,
+ JOB_EVENT_END = 3,
+};
+
+
+/*
+ * =========================================
+ * INTERFACE INFO FOR ACCOUNTING SUBSCRIBERS
+ * =========================================
+ */
+
+/* To register as a job dependent accounting module */
+struct job_acctmod {
+ int type; /* CSA or something else */
+ int (*jobstart)(int event, void *data);
+ int (*jobend)(int event, void *data);
+ void (*eop)(int code, void *data1, void *data2);
+ struct module *module;
+};
+
+
+/*
+ * Subscriber type: Each module that registers as a accounting data
+ * "subscriber" has to have a type. This type will identify the
+ * the appropriate structs and macros to use when exchanging data.
+ */
+#define JOB_ACCT_CSA 0
+#define JOB_ACCT_COUNT 1 /* Number of entries available */
+
+
+/*
+ * --------------
+ * CSA ACCOUNTING
+ * --------------
+ */
+
+/*
+ * For data exchange betwee job and csa. The embedded defines
+ * identify the sub-fields
+ */
+struct job_csa {
+#define JOB_CSA_JID 001
+ u64 job_id;
+#define JOB_CSA_UID 002
+ uid_t job_uid;
+#define JOB_CSA_START 004
+ time_t job_start;
+#define JOB_CSA_COREHIMEM 010
+ u64 job_corehimem;
+#define JOB_CSA_VIRTHIMEM 020
+ u64 job_virthimem;
+#define JOB_CSA_ACCTFILE 040
+ struct file *job_acctfile;
+};
+
+
+/*
+ * ===================
+ * FUNCTION PROTOTYPES
+ * ===================
+ */
+int job_register_acct(struct job_acctmod *);
+int job_unregister_acct(struct job_acctmod *);
+u64 job_getjid(struct task_struct *);
+int job_getacct(u64, int, void *);
+int job_setacct(u64, int, int, void *);
+
+#endif /* _LINUX_JOB_ACCT_H */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2005-09-30 13:59:13.993570776 -0500
+++ linux/kernel/fork.c 2005-09-30 13:59:14.064852647 -0500
@@ -122,6 +122,7 @@
if (!profile_handoff_task(tsk))
free_task(tsk);
}
+EXPORT_SYMBOL_GPL(__put_task_struct);

void __init fork_init(unsigned long mempages)
{

2005-10-03 19:49:28

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 1/3] Process Notification / pnotify

Hmmm ... I notice with interest two notification patches posted in
the last few days to lkml:

Matthew Helsley's Process Events Connector (posted 28 Sep 2005)
Erik Jacobson's pnotify (posted 3 Oct 2005)

I suspect Matthew and Erik will both instantly hate me for asking, but
does it make sense to integrate these two?

If I understand these two proposals correctly:

Helsley adds hooks in fork, exec, id change, and exit, to pass
events to userspace.

Jacobson adds hooks in fork, exec and exit, to pass events to
kernel routines and loadable modules.

Perhaps, just brainstorming here, it would make sense for Halsley to
register with pnotify instead of adding his own hooks in parallel.
This presumes that pnotify is accepted into the kernel, and that
pnotify adds the id change hook that Helsley requires.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-04 00:32:30

by Matt Helsley

[permalink] [raw]

Subject: Re: [PATCH 1/3] Process Notification / pnotify

On Mon, 2005-10-03 at 12:49 -0700, Paul Jackson wrote:
> Hmmm ... I notice with interest two notification patches posted in
> the last few days to lkml:
>
> Matthew Helsley's Process Events Connector (posted 28 Sep 2005)
> Erik Jacobson's pnotify (posted 3 Oct 2005)
>
> I suspect Matthew and Erik will both instantly hate me for asking, but
> does it make sense to integrate these two?
>
> If I understand these two proposals correctly:
>
> Helsley adds hooks in fork, exec, id change, and exit, to pass
> events to userspace.
>
> Jacobson adds hooks in fork, exec and exit, to pass events to
> kernel routines and loadable modules.
>
> Perhaps, just brainstorming here, it would make sense for Halsley to
> register with pnotify instead of adding his own hooks in parallel.
> This presumes that pnotify is accepted into the kernel, and that
> pnotify adds the id change hook that Helsley requires.

Paul,

pnotify is extreme overkill for the process events connector. The
per-task subscriber lists, data, inheritance of those lists, tasklist
locking, and iteration over the lists are all overhead compared to the
process events connector patch.

For the process events connector it makes more sense to have a global
list of subscribers interested all tasks. If there are M kernel modules
interested in getting events for all N tasks this would save space
proportional to M*N compared to pnotify. Of course this means the
elements of this list could not have per-task data.

I think per-task data should be split out from pnotify and submitted as
a separate system used by pnotify. Maybe with a *_PER_TASK API similar
to *_PER_CPU.

Cheers,
-Matt Helsley

2005-10-04 15:34:23

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH 3/3] Process Notification / pnotify user: Job

Quoting Erik Jacobson ([email protected]):
> Index: linux/init/Kconfig
> ===================================================================
> --- linux.orig/init/Kconfig 2005-09-30 12:14:10.989916853 -0500
> +++ linux/init/Kconfig 2005-09-30 13:59:19.749826026 -0500
> @@ -170,6 +170,35 @@
> Linux Jobs module and the Linux Array Sessions module. If you will not
> be using such modules, say N.
>
> +config JOB
> + tristate " Process Notification (pnotify) based jobs"

Should it be possible to compile job as a module, or should
this not be "tristate"?

It makes use of send_group_sig_info, which is not EXPORTed.

thanks,
-serge

2005-10-04 15:42:15

by Erik Jacobson

[permalink] [raw]

Subject: Re: [PATCH 3/3] Process Notification / pnotify user: Job

> Should it be possible to compile job as a module, or should
> this not be "tristate"?
>
> It makes use of send_group_sig_info, which is not EXPORTed.

You are right, the patch is supposed to export that variable as well.
I must have forgot to add it to the file list using quilt before sending
it out, and I tested it as built-in so I didn't catch it.

I'll fix this. Thank you.

We're only using send_group_sig_info because it does a check for signal
'zero' which means status check. send_sig_info doesn't do that check any
more. We could probably just handle this inside job some how and use
send_sig_info instead.

--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota

2005-10-04 16:03:32

by Erik Jacobson

[permalink] [raw]

Subject: Re: [PATCH 3/3] Process Notification / pnotify user: Job

> Should it be possible to compile job as a module, or should
> this not be "tristate"?

So I added this to the patch. I tested it as a kernel module. Sorry this
was missed in the patch I sent out.

Index: linux/kernel/signal.c
===================================================================
--- linux.orig/kernel/signal.c 2005-09-30 16:05:55.407727738 -0500
+++ linux/kernel/signal.c 2005-10-04 10:48:50.626485968 -0500
@@ -1978,6 +1978,7 @@
EXPORT_SYMBOL(ptrace_notify);
EXPORT_SYMBOL(send_sig);
EXPORT_SYMBOL(send_sig_info);
+EXPORT_SYMBOL(send_group_sig_info);
EXPORT_SYMBOL(sigprocmask);
EXPORT_SYMBOL(block_all_signals);
EXPORT_SYMBOL(unblock_all_signals);