Here, I am proposing Process Aggregates support for the 2.6 kernel.
What is Process Aggregates (PAGG)?
----------------------------------
PAGG provides for the implementation of arbitrary process groups in Linux.
It is a building block for kernel modules that can group processes
together into a single set for specific purposes beyond the traditional
process groups.
What types of things could make use of PAGG?
--------------------------------------------
Some example uses for PAGG include:
- System accounting - the CSA module makes use of it (see below)
- Clustering - Keeping track of processes for clustering
- NUMA placement - Making sure groups of processes run on specific CPUs or
chunks of memory
- Batch Queuing systems - Ensure specific jobs run in their designated
places.
More information about PAGG
---------------------------
There is a web page for the PAGG project at SGI:
http://oss.sgi.com/projects/pagg/
Also, the patch includes the file Documentation/pagg.txt. You may look there
for detailed information.
One project that makes extensive use of PAGG is Comprehensive System
Accounting (CSA). Details on that can be found here:
http://oss.sgi.com/projects/csa/
Info on the attached patch
--------------------------
This patch is made to apply to the 2.6.5 kernel. Patches for some other
versions of the kernel (including 2.4 and 2.6 versions) may be found on
the PAGG web site above.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota
* Erik Jacobson ([email protected]) wrote:
> Here, I am proposing Process Aggregates support for the 2.6 kernel.
This looks like it's just the infrastructure, i.e. nothing is using it.
It seems like PAGG could be done on top of CKRM (albeit, with more
code). But if the goal is to do some basic accounting, scheduling, etc.
on a resource group, wouldn't CKRM be more generic?
Couple quick comments below.
> +struct pagg_hook {
> + struct module *module;
doesn't seem used.
> + char *name; /* Name Key - restricted to 32 characters */
why the restriction?
> +#define attach_pagg_list_chk(ct, pt) \
> +do { \
> + INIT_PAGG_LIST(&ct->pagg_list); \
> + if (!list_empty(&pt->pagg_list.head)) { \
> + if (attach_pagg_list(ct, pt) != 0) \
> + goto bad_fork_cleanup; \
> + } \
> +} while(0)
Goto a label defined elsewhere, buried in a macro. Please code this
openly.
> +#define detach_pagg_list_chk(t) \
> +do { \
> + if (!list_empty(&t->pagg_list.head)) { \
> + detach_pagg_list(t); \
> + } \
> +} while(0)
All these macros could be type safe inlined functions, and when config'd
off, use your alt. no-op macros.
> +#define read_lock_pagg_list(t) down_read(&t->pagg_list.sem)
> + /* Up the read semaphore for the task's pagg_list */
> +#define read_unlock_pagg_list(t) up_read(&t->pagg_list.sem)
> + /* Down the write semaphore for the task's pagg_list */
> +#define write_lock_pagg_list(t) down_write(&t->pagg_list.sem)
> + /* Up the write semaphore for the task's pagg_list */
> +#define write_unlock_pagg_list(t) up_write(&t->pagg_list.sem)
Just open code these. There's too much hidden in macros.
> @@ -488,11 +489,15 @@
>
> struct dentry *proc_dentry;
> struct backing_dev_info *backing_dev_info;
> -
> struct io_context *io_context;
>
> unsigned long ptrace_message;
> siginfo_t *last_siginfo; /* For ptrace use. */
> +
> +#if defined(CONFIG_PAGG)
> +/* List of pagg (process aggregate) attachments */
> + struct pagg_list pagg_list;
> +#endif
unused?
> +unregister_pagg_hook(struct pagg_hook *pagg_hook_old)
<snip>
> + down_write(&pagg_hook_list_sem);
> +
> + pagg_hook = get_pagg_hook(pagg_hook_old->name);
> + if (pagg_hook && pagg_hook == pagg_hook_old) {
> + /*
> + * Scan through processes on system and check for
> + * references to pagg containers for this pagg hook.
> + *
> + * The module cannot be unloaded if there are references.
> + */
> + read_lock(&tasklist_lock);
> + for_each_process(task) {
> + struct pagg *pagg = NULL;
> +
> + read_lock_pagg_list(task);
Uh-oh, grabbing a semaphore while holding tasklist_lock.
There's too much hidden in macros (like read_lock_pagg_list).
> +attach_pagg_list(struct task_struct *to_task, struct task_struct *from_task)
<snip>
> + to_pagg = alloc_pagg(to_task, from_pagg->hook);
> + if (!to_pagg) {
> + retcode = -ENOMEM;
> + goto error_return;
> + }
> + retcode = attach_pagg(to_task, to_pagg, from_pagg->data);
> + if (retcode != 0) {
> + /* attach should issue error message */
> + goto error_return;
> + }
This looks like it leaks the just alloc'd to_pagg.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Monday, April 26, 2004 4:39 pm, Chris Wright wrote:
> * Erik Jacobson ([email protected]) wrote:
> > Here, I am proposing Process Aggregates support for the 2.6 kernel.
>
> This looks like it's just the infrastructure, i.e. nothing is using it.
> It seems like PAGG could be done on top of CKRM (albeit, with more
> code). But if the goal is to do some basic accounting, scheduling, etc.
> on a resource group, wouldn't CKRM be more generic?
Quite possibly. Do you have a pointer to the latest bits/design docs?
Thanks,
Jesse
* Jesse Barnes ([email protected]) wrote:
> On Monday, April 26, 2004 4:39 pm, Chris Wright wrote:
> > * Erik Jacobson ([email protected]) wrote:
> > > Here, I am proposing Process Aggregates support for the 2.6 kernel.
> >
> > This looks like it's just the infrastructure, i.e. nothing is using it.
> > It seems like PAGG could be done on top of CKRM (albeit, with more
> > code). But if the goal is to do some basic accounting, scheduling, etc.
> > on a resource group, wouldn't CKRM be more generic?
>
> Quite possibly. Do you have a pointer to the latest bits/design docs?
Nothing aside from what's on ckrm.sf.net. I know they've been retooling
it a bit, but I'm not up on the current status.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
Thanks for the comments.
I made a change and responded to your other comments. Let me know if I
missed something.
I didn't choose to change the macros at this time - however, I'm not against
changing them either - I just haven't done it yet.
I'll attach a new patch.
On Mon, 26 Apr 2004, Chris Wright wrote:
> This looks like it's just the infrastructure, i.e. nothing is using it.
> It seems like PAGG could be done on top of CKRM (albeit, with more
> code). But if the goal is to do some basic accounting, scheduling, etc.
> on a resource group, wouldn't CKRM be more generic?
Right. A couple examples of things we have that use it are CSA
(oss.sgi.com/csa) and job. job provides inescapable job containers that
are also used by csa.
But what I presented here was just the infrastructure as you said.
Patches for inescapable job containers ('job') are available on the pagg web
site as well (oss.sgi.com/pagg).
> Couple quick comments below.
>
> > +struct pagg_hook {
> > + struct module *module;
When another kernel module makes use of PAGG, it sets this in the pagg
hook.
> > + char *name; /* Name Key - restricted to 32 characters */
>
> why the restriction?
I'm open to suggestions. Right now, this is usually set to something like
"job" or similar. The max length is enforced by the module that makes use
of pagg. For example, with the job package:
...
#define PAGG_NAMELN 32 /* Max chars in PAGG module name */
...
#define PAGG_JOB "job" /* PAGG module identifier string */
...
> > +unregister_pagg_hook(struct pagg_hook *pagg_hook_old)
> <snip>
> > + down_write(&pagg_hook_list_sem);
> > +
> > + pagg_hook = get_pagg_hook(pagg_hook_old->name);
> > + if (pagg_hook && pagg_hook == pagg_hook_old) {
> > + /*
> > + * Scan through processes on system and check for
> > + * references to pagg containers for this pagg hook.
> > + *
> > + * The module cannot be unloaded if there are references.
> > + */
> > + read_lock(&tasklist_lock);
> > + for_each_process(task) {
> > + struct pagg *pagg = NULL;
> > +
> > + read_lock_pagg_list(task);
>
> Uh-oh, grabbing a semaphore while holding tasklist_lock.
>
> There's too much hidden in macros (like read_lock_pagg_list).
I fixed the tasklist issue you were concerned about. Again, I didn't address
the macro issue at this moment.
> > +attach_pagg_list(struct task_struct *to_task, struct task_struct *from_task)
> <snip>
> > + to_pagg = alloc_pagg(to_task, from_pagg->hook);
> > + if (!to_pagg) {
> > + retcode = -ENOMEM;
> > + goto error_return;
> > + }
> > + retcode = attach_pagg(to_task, to_pagg, from_pagg->data);
> > + if (retcode != 0) {
> > + /* attach should issue error message */
> > + goto error_return;
> > + }
>
> This looks like it leaks the just alloc'd to_pagg.
I agree that it looks suspect but I think it's OK.
You're talking about the case where the pagg was allocated, but couldn't
attach I assume.
The alloc_pagg function adds that allocated pagg to the pagg list. In
error_return, detach_pagg_list is called so this pagg should be freed then.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota
> Nothing aside from what's on ckrm.sf.net. I know they've been retooling
> it a bit, but I'm not up on the current status.
The web site contains API docs and some out-dated patches. I joined the
mailing list and asked for the latest stuff. They told me they'd be posting
a new patch to their mailing list soon.
I expect the "new" stuff uses the virtual filesystem interface and other
things suggested by the API docs they have.
I plan to look at this in more detail when I have the latest stuff to
think about.
My first impression is that pagg itself could be used to implement parts of
what ckrm is doing if they desired and not necessarily the other way around.
It's not clear to me if this is something that will be accepted in to the
main kernel source or not.
I'll post again when I learn more about it and can form a better opinion.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota
* Erik Jacobson ([email protected]) wrote:
> I expect the "new" stuff uses the virtual filesystem interface and other
> things suggested by the API docs they have.
*nod*
> My first impression is that pagg itself could be used to implement parts of
> what ckrm is doing if they desired and not necessarily the other way around.
Guess the key point is that many folks are interested in some sort of
aggregate resource container. QoS on virtual servers, make rlimit type
of limits acutally useful, your needs, etc. Be nice to come from
common infrastructure.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
* Erik Jacobson ([email protected]) wrote:
> I didn't choose to change the macros at this time - however, I'm not against
> changing them either - I just haven't done it yet.
OK. I still think it's a good idea for readability and type safety.
> On Mon, 26 Apr 2004, Chris Wright wrote:
> > This looks like it's just the infrastructure, i.e. nothing is using it.
> > It seems like PAGG could be done on top of CKRM (albeit, with more
> > code). But if the goal is to do some basic accounting, scheduling, etc.
> > on a resource group, wouldn't CKRM be more generic?
>
> Right. A couple examples of things we have that use it are CSA
> (oss.sgi.com/csa) and job. job provides inescapable job containers that
> are also used by csa.
>
> But what I presented here was just the infrastructure as you said.
>
> Patches for inescapable job containers ('job') are available on the pagg web
> site as well (oss.sgi.com/pagg).
OK, thanks, I'll take a look.
> > > + char *name; /* Name Key - restricted to 32 characters */
> >
> > why the restriction?
>
> I'm open to suggestions. Right now, this is usually set to something like
> "job" or similar. The max length is enforced by the module that makes use
> of pagg. For example, with the job package:
Right, if it's a pointer, and you guarantee it's NULL terminated, then I
don't see the point. Otherwise, strncmp() or something?
> I fixed the tasklist issue you were concerned about. Again, I didn't address
> the macro issue at this moment.
Alright, see below.
> > This looks like it leaks the just alloc'd to_pagg.
>
> I agree that it looks suspect but I think it's OK.
>
> You're talking about the case where the pagg was allocated, but couldn't
> attach I assume.
>
> The alloc_pagg function adds that allocated pagg to the pagg list. In
> error_return, detach_pagg_list is called so this pagg should be freed then.
Yes, I see it now, thanks. BTW, I see a common idiom here of:
if (!list_empty) {
list_for_each() {
}
}
Seems like mostly an empty optimization, since list_for_each essentially
does that list_empty() check, no?
> +unregister_pagg_hook(struct pagg_hook *pagg_hook_old)
> +{
<snip>
> + down_write(&pagg_hook_list_sem);
<snip>
> + read_lock(&tasklist_lock);
> + for_each_process(task) {
> + struct pagg *pagg = NULL;
> +
> + get_task_struct(task); /* So the task doesn't vanish on us */
> + read_unlock(&tasklist_lock);
dropped tasklist_lock, task could exit, and unlink and potentially drop the
only other ref.
> + read_lock_pagg_list(task);
> + pagg = get_pagg(task, pagg_hook_old->name);
> + put_task_struct(task);
and this could have totally freed the memory to task.
still looks unsafe to me.
> + /*
> + * We won't be accessing pagg's memory, just need
> + * to see if one existed - so we can release the task
> + * lock now.
> + */
> + read_unlock_pagg_list(task);
> + if (pagg) {
> + up_write(&pagg_hook_list_sem);
> + return -EBUSY;
> + }
> +
> + /* lock the task list again so we get a valid task in the loop */
> + read_lock(&tasklist_lock);
> + }
> + read_unlock(&tasklist_lock);
> + list_del_init(&pagg_hook->entry);
> + up_write(&pagg_hook_list_sem);
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
Highlevel comments:
- without any user merging doesn't make sense
- you probably want to update all the function/data structure comments
to the normal kernel-doc style.
> +- init/Config.help
> +- init/Config.in
Where did you find these files? :)
> +- Documentation/Configure.help
Dito.
> +This implementation of PAGG supports the i386 and ia64 architecture.
Can't find anything architecture-specific here.
The whole chapter 2 of this document doesn't belong into the kernel
tree.
> @@ -1151,6 +1152,7 @@
> retval = search_binary_handler(&bprm,regs);
> if (retval >= 0) {
> free_arg_pages(&bprm);
> + exec_pagg_list_chk(current);
This looks rather misnamed. pagg_exec sounds like a better name,
with __pagg_exec for the implementation after the inline list_empty
check.
> +#ifndef _PAGG_H
> +#define _PAGG_H
should be _LINUX_PAGG_H
> +#define INIT_PAGG_LIST(l) \
> +do { \
> + INIT_LIST_HEAD(l.head); \
> + init_rwsem(l.sem); \
> +} while(0)
braces around l here to avoid too much trouble?
> +struct pagg_hook {
> + struct module *module;
> + char *name; /* Name Key - restricted to 32 characters */
> + int (*attach)(struct task_struct *, struct pagg *, void*);
> + int (*detach)(struct task_struct *, struct pagg *);
> + int (*init)(struct task_struct *, struct pagg *);
> + void *data; /* Opaque module specific data */
> + struct list_head entry; /* List pointers */
> + void (*exec)(struct task_struct *, struct pagg *);
> +};
The ordering here looks strange, please keep data and methods ordered,
ala:
struct pagg_hook {
struct module *module;
char *name; /* Name Key - restricted to 32 characters */
void *data; /* Opaque module specific data */
struct list_head entry; /* List pointers */
int (*init)(struct task_struct *, struct pagg *);
int (*attach)(struct task_struct *, struct pagg *, void*);
int (*detach)(struct task_struct *, struct pagg *);
void (*exec)(struct task_struct *, struct pagg *);
};
> +extern struct pagg *get_pagg(struct task_struct *task, char *key);
> +extern struct pagg *alloc_pagg(struct task_struct *task,
> + struct pagg_hook *pt);
> +extern void free_pagg(struct pagg *pagg);
> +extern int register_pagg_hook(struct pagg_hook *pt_new);
> +extern int unregister_pagg_hook(struct pagg_hook *pt_old);
> +extern int attach_pagg_list(struct task_struct *to_task,
> + struct task_struct *from_task);
> +extern int detach_pagg_list(struct task_struct *task);
> +extern int exec_pagg_list(struct task_struct *task);
I'd call all these pagg_*. Also please kill the _list postfixes,
they're extremly confusing.
> +/*
> + * Macro used when a child process must inherit attachment to pagg
> + * containers from the parent.
> + *
> + * Arguments:
> + * ct: child (struct task_struct *)
> + * pt: parent (struct task_struct *)
> + * cf: clone_flags
> + */
> +#define attach_pagg_list_chk(ct, pt) \
> +do { \
> + INIT_PAGG_LIST(&ct->pagg_list); \
> + if (!list_empty(&pt->pagg_list.head)) { \
> + if (attach_pagg_list(ct, pt) != 0) \
> + goto bad_fork_cleanup; \
> + } \
> +} while(0)
Should probably be an inline, ala:
static inline int pagg_attach(struct task_struct *child,
struct task_struct *parent)
{
INIT_PAGG_LIST(&child->pagg_list);
if (!list_empty(&parent->pagg_list.head))
return __pagg_attach(child, parent));
return 0;
}
and then handle the error in the caller.
> +#define detach_pagg_list_chk(t) \
> +do { \
> + if (!list_empty(&t->pagg_list.head)) { \
> + detach_pagg_list(t); \
> + } \
> +} while(0)
static inline void pagg_detach(struct task_struct *task)
{
if (!list_empty(&task->pagg_list.head))
__pagg_detach(task);
}
> +#define exec_pagg_list_chk(t) \
> +do { \
> + if (!list_empty(&t->pagg_list.head)) { \
> + exec_pagg_list(t); \
> + } \
> +} while(0)
Dito.
> + /* Invoke module detach callback for the pagg & task */
> +#define detach_pagg(t, p) p->hook->detach(t, p)
> + /* Invoke module attach callback for the pagg & task */
> +#define attach_pagg(t, p, d) p->hook->attach(t, p, (void *)d)
> + /* Allows optional callout at exec */
> +#define exec_pagg(t, p) do { \
> + if (p->hook->exec) \
> + p->hook->exec(t, p);\
> + } while(0)
please kill all these wrappers. in linux we call methods directly,
unlike the sysv style :) Also why is the exec hook conditional and the
others not? please make that coherent.
>
> + /* Allows module to set data item for pagg */
> +#define set_pagg(p, d) p->data = (void *)d
> + /* Down the read semaphore for the task's pagg_list */
> +#define read_lock_pagg_list(t) down_read(&t->pagg_list.sem)
> + /* Up the read semaphore for the task's pagg_list */
> +#define read_unlock_pagg_list(t) up_read(&t->pagg_list.sem)
> + /* Down the write semaphore for the task's pagg_list */
> +#define write_lock_pagg_list(t) down_write(&t->pagg_list.sem)
> + /* Up the write semaphore for the task's pagg_list */
> +#define write_unlock_pagg_list(t) up_write(&t->pagg_list.sem)
thos were already mentioned, please kill all those accesors..
> +#if defined(CONFIG_PAGG)
#ifdef CONFIG_PAGG is preferred style in linux.
> +++ 2.6pagg-patch/kernel/Makefile 2004-04-13 21:42:35.000000000 -0500
> @@ -7,7 +7,7 @@
> sysctl.o capability.o ptrace.o timer.o user.o \
> signal.o sys.o kmod.o workqueue.o pid.o \
> rcupdate.o intermodule.o extable.o params.o posix-timers.o \
> - kthread.o
> + kthread.o pagg.o
do you really want to build in pagg.o all the time, even without
CONFIG_PAGG set?
> obj-$(CONFIG_COMPAT) += compat.o
> +obj-$(CONFIG_PAGG) += pagg.o
.. then you wouldn't need this line at least :)
> + * structure maintains pointers to callback functions and
> + * data strucures maintained in modules that have
> + * registered with the kernel as pagg container
> + * providers.
> + */
> +
> +#include <linux/config.h>
> +
> +#ifdef CONFIG_PAGG
this one isn't needed if you properly compile pagg.o only if CONFIG_PAGG
is set..
> +#include <asm/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <asm/semaphore.h>
> +#include <linux/smp_lock.h>
> +#include <linux/proc_fs.h>
> +#include <linux/module.h>
> +#include <linux/pagg.h>
Please include asm/ headers after linux/. smp_lock.h, proc_fs.h amd
uaccess.h don't seem to be needed.
> +struct pagg *
> +get_pagg(struct task_struct *task, char *key)
> +{
> + struct list_head *entry;
> +
> + list_for_each(entry, &task->pagg_list.head) {
> + struct pagg *pagg = list_entry(entry, struct pagg, entry);
list_for_each_entry()
> + if (!strcmp(pagg->hook->name,key)) {
> + return pagg;
> + }
superflous braces here.
> + pagg = (struct pagg *)kmalloc(sizeof(struct pagg), GFP_KERNEL);
no need to cast.
> +free_pagg(struct pagg *pagg)
> +{
> +
> + list_del(&pagg->entry);
> + kfree(pagg);
> +}
that blank line over the list_del looks rather strange..
> + list_for_each(entry, &pagg_hook_list) {
> + pagg_hook = list_entry(entry, struct pagg_hook, entry);
list_for_each_entry again.
> + /* Try to insert new hook entry into the pagg hook list */
> + down_write(&pagg_hook_list_sem);
does this really need a semaphore? a spinlock looks like it could do it
aswell - or am I missing a blocking function somewhere?
> + printk(KERN_INFO "Registering PAGG support for (name=%s)\n",
> + pagg_hook_new->name);
sounds rather verbose, no?
> + for_each_process(task) {
> + struct pagg *pagg = NULL;
> +
> + get_task_struct(task); /* So the task doesn't vanish on us */
> + read_unlock(&tasklist_lock);
> + read_lock_pagg_list(task);
> + pagg = get_pagg(task, pagg_hook_old->name);
> + put_task_struct(task);
> + /*
> + * We won't be accessing pagg's memory, just need
> + * to see if one existed - so we can release the task
> + * lock now.
> + */
> + read_unlock_pagg_list(task);
> + if (pagg) {
> + up_write(&pagg_hook_list_sem);
> + return -EBUSY;
> + }
> +
if the pagg list lock wasn't a sleeping lock this could be much simpler,
no?
> + printk(KERN_INFO "Unregistering PAGG support for"
> + " (name=%s)\n", pagg_hook_old->name);
also overly verbose.
> + /* Remove ref. to paggs from task immediately */
> + write_lock_pagg_list(task);
> +
> + if (list_empty(&task->pagg_list.head)) {
> + write_unlock_pagg_list(task);
> + return retcode;
> + }
> +
> + list_for_each(entry, &task->pagg_list.head) {
> + int rettemp = 0;
> + struct pagg *pagg = list_entry(entry, struct pagg, entry);
list_for_each* is a noop for an empty list. also you want
list_for_each_entry again.
> +int exec_pagg_list(struct task_struct *task) {
brace wants to be on the next line.
>
> + struct list_head *entry;
> +
> +
> +
huh?
> +EXPORT_SYMBOL(get_pagg);
> +EXPORT_SYMBOL(alloc_pagg);
> +EXPORT_SYMBOL(free_pagg);
> +EXPORT_SYMBOL(attach_pagg_list);
> +EXPORT_SYMBOL(detach_pagg_list);
> +EXPORT_SYMBOL(exec_pagg_list);
> +EXPORT_SYMBOL(register_pagg_hook);
> +EXPORT_SYMBOL(unregister_pagg_hook);
should probably be _GPL as this directly messed with highly kernel
internal process managment.
> - without any user merging doesn't make sense
Could you try restating this particular comment, Christoph?
I am failing to make any sense of it.
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
* Paul Jackson ([email protected]) wrote:
> > - without any user merging doesn't make sense
>
> Could you try restating this particular comment, Christoph?
> I am failing to make any sense of it.
Same thing I was trying to get across. Merging an infrastructure with
no users is a tough sell.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Thu, Apr 29, 2004 at 12:20:26PM -0700, Paul Jackson wrote:
> > - without any user merging doesn't make sense
>
> Could you try restating this particular comment, Christoph?
> I am failing to make any sense of it.
without merging anything that actually uses pagg getting pagg itself
into the kernel doesn't make a lot of sense.
Thanks for your clarifications, Christoph and Chris.
Crystal clear now ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
> without merging anything that actually uses pagg getting pagg itself
> into the kernel doesn't make a lot of sense.
We have job and CSA that make use of it but I didn't know if I should
push them all at the same time, or one at a time. As it is, the suggestions
provided for PAGG are making for some significant revisions. The changes
require adjustments to job as well.
I'm hoping to get something with nearly all the suggestions implemented for
PAGG posted, then I can post job if you think that will help. I'm sure
there will be lots of comments on that as well -- and those comments will
probably require revisions to CSA :-)
If you're saying there really is zero chance even if I implement all the
suggestions and have things that use it, I guess I'll just have to live with
that (what's my other choice? :). At least the patches will be better off
even if they exist as patches forever. We're really hoping we can get this in
though... So I want to do everything I can to give it the best shot possible.
--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota
On Mon, 26 Apr 2004, Chris Wright wrote:
> > Quite possibly. Do you have a pointer to the latest bits/design docs?
>
> Nothing aside from what's on ckrm.sf.net. I know they've been retooling
> it a bit, but I'm not up on the current status.
Shailabh posted the latest CKRM code to lkml yesterday.
CKRM + rcfs seems to be slightly more capable than the PAGG code;
furthermore, CKRM already has a number of resource controller
modules while I'm only seeing very basic PAGG infrastructure.
Now would be a good time to join forces...
I admit that the CKRM project so far seems to have been mostly
an IBM internal project, but since there is community interest
it'd be time to get the community involved.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, 29 Apr 2004, Erik Jacobson wrote:
> If you're saying there really is zero chance even if I implement all the
> suggestions and have things that use it, I guess I'll just have to live
> with that (what's my other choice? :).
I suspect there's a rather good chance of merging a common
PAGG/CKRM infrastructure, since they pretty much do the same
thing at the core and they both have different functionality
implemented on top of the core process grouping.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
> I suspect there's a rather good chance of merging a common
> PAGG/CKRM infrastructure, since they pretty much do the same
> thing at the core and they both have different functionality
> implemented on top of the core process grouping.
Still doesn't make a lot of sense. CKRM is a huge cludgy beast poking
everywhere while PAGG is a really small layer to allow kernel modules
keeping per-process state. If CKRM gets merged at all (and the current
looks far to horrible and the gains are rather unclear) it should layer
ontop of something like PAGG for the functionality covered by it.
Selon Erik Jacobson <[email protected]>:
> What is Process Aggregates (PAGG)?
> ----------------------------------
> PAGG provides for the implementation of arbitrary process groups in Linux.
> It is a building block for kernel modules that can group processes
> together into a single set for specific purposes beyond the traditional
> process groups.
Hello,
I'm working on a project (ELSA -Enhanced Linux System Accounting) that is
very similar to PAGG and CSA. My goal is to improve accounting on Linux by using
containers to group processes according to the system administrator policy.
Currently I provide a patch to 2.6.5 kernel ( http://elsa.sourceforge.net ) that
provides infrastructures to manipulate containers. The advantage of my patch is
that manipulating containers is very simple. I think that it could be very
interesting to share our work to see what could be done. Thus I will take a look
to PAGG implementation and also to the CKRM solution proposed by Chris Wright.
Best,
Guillaume
Selon Christoph Hellwig <[email protected]>:
> > I suspect there's a rather good chance of merging a common
> > PAGG/CKRM infrastructure, since they pretty much do the same
> > thing at the core and they both have different functionality
> > implemented on top of the core process grouping.
>
> Still doesn't make a lot of sense. CKRM is a huge cludgy beast poking
> everywhere while PAGG is a really small layer to allow kernel modules
> keeping per-process state. If CKRM gets merged at all (and the current
> looks far to horrible and the gains are rather unclear) it should layer
> ontop of something like PAGG for the functionality covered by it.
And what about put the management of containers outside the kernel. We could for
exemple use a program that will listen file /proc/acct_event and execute a
programs to handle the event like ACPID does. Of course it will need some kernel
modifications but those modifications will be small as process aggregation will
be done outside the kernel. We could also use relayfs to exchange datas between
user program and the kernel.
Guillaume
On Fri, 30 Apr 2004, Christoph Hellwig wrote:
> Still doesn't make a lot of sense. CKRM is a huge cludgy beast poking
> everywhere while PAGG is a really small layer to allow kernel modules
> keeping per-process state. If CKRM gets merged at all (and the current
> looks far to horrible and the gains are rather unclear) it should layer
> ontop of something like PAGG for the functionality covered by it.
What was the last time you looked at the CKRM source?
Sure it's a bit bigger than PAGG, but that's also because
it includes the functionality to change the group a process
belongs to and other things that don't seem to be included
in the PAGG patch.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Fri, Apr 30, 2004 at 08:54:08AM -0400, Rik van Riel wrote:
> What was the last time you looked at the CKRM source?
the day before yesterday (the patch in SuSE's tree because there
doesn't seem to be any official patch on their website)
> Sure it's a bit bigger than PAGG, but that's also because
> it includes the functionality to change the group a process
> belongs to and other things that don't seem to be included
> in the PAGG patch.
Again, pagg doesn't even play in that league. It's really just a tiny
meachnism to allow a kernel module keep per-process data. Policies
like process-groups can be implemented ontop of that.
On Fri, 2004-04-30 at 09:06, Christoph Hellwig wrote:
> On Fri, Apr 30, 2004 at 08:54:08AM -0400, Rik van Riel wrote:
> > What was the last time you looked at the CKRM source?
>
> the day before yesterday (the patch in SuSE's tree because there
> doesn't seem to be any official patch on their website)
>
Somewhat unrelated, but the day before yesterday suse was at ckrm-e5,
we're now at ckrm-e12.
-chris
On Fri, 30 Apr 2004, Christoph Hellwig wrote:
> Again, pagg doesn't even play in that league. It's really just a tiny
> meachnism to allow a kernel module keep per-process data. Policies like
> process-groups can be implemented on top of that.
So basically you're arguing that PAGG is better because it
doesn't do what's needed ? ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
* Rik van Riel ([email protected]) wrote:
> On Fri, 30 Apr 2004, Christoph Hellwig wrote:
>
> > Still doesn't make a lot of sense. CKRM is a huge cludgy beast poking
> > everywhere while PAGG is a really small layer to allow kernel modules
> > keeping per-process state. If CKRM gets merged at all (and the current
> > looks far to horrible and the gains are rather unclear) it should layer
> > ontop of something like PAGG for the functionality covered by it.
>
> What was the last time you looked at the CKRM source?
>
> Sure it's a bit bigger than PAGG, but that's also because
> it includes the functionality to change the group a process
> belongs to and other things that don't seem to be included
> in the PAGG patch.
I looked briefly at one of the PAGG modules called job. It contains the
grouping functionalities. I suspect that this is something that would
be a common need for all users of a resource grouping mechanism.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Fri, Apr 30, 2004 at 11:22:49AM -0400, Rik van Riel wrote:
> > Again, pagg doesn't even play in that league. It's really just a tiny
> > meachnism to allow a kernel module keep per-process data. Policies like
> > process-groups can be implemented on top of that.
>
> So basically you're arguing that PAGG is better because it
> doesn't do what's needed ? ;)
I told you a bunch of times that's it's a different thing. Simply keeping
per-process state might be a useful building block for some monster resource
whatever fuckup, but certainly not the other way around.
Chris Mason wrote:
> On Fri, 2004-04-30 at 09:06, Christoph Hellwig wrote:
>
>>On Fri, Apr 30, 2004 at 08:54:08AM -0400, Rik van Riel wrote:
>>
>>>What was the last time you looked at the CKRM source?
>>
>>the day before yesterday (the patch in SuSE's tree because there
>>doesn't seem to be any official patch on their website)
That was rectified concommitant with the lkml posting of the patches
for ckrm-E12. Please see the Implementation section of
http://ckrm.sf.net for all the current patches.
>>
>
> Somewhat unrelated, but the day before yesterday suse was at ckrm-e5,
> we're now at ckrm-e12.
Good point. One of the major changes between ckrm-e5 and ckrm-e12 is a
serious attempt at modularizing and cleaning up the internal
interfaces which should help allay concerns about it being a big piece
of code which has to be taken in whole.
From the view of kernel developers considering merging CKRM into the
kernel, only two components are essential:
core
rcfs
Of course, to do anything useful, you need to have either one of
task_class: groups tasks together
socket_class: groups sockets together
the two are completely independent.
Once a particular grouping is chosen, one can further selectively
include one or more resource controllers associated with the grouping.
i.e. for task_classes, choose one or more of cpu, mem, io,
numtasks....; for socket_class, choose one or more of listenaq,<future
socket based controllers, potentially including outbound network
control)....
The same kind of flexibility that is available to kernel developers
for integrating parts of ckrm selectively, will also remain available
to users, even if all of CKRM is included. So a user could enable just
task_classes and the cpu controller if s/he doesn't care about memory,
io or any other kind of control.
-- Shailabh
Christoph Hellwig wrote:
> On Fri, Apr 30, 2004 at 08:54:08AM -0400, Rik van Riel wrote:
>
>>What was the last time you looked at the CKRM source?
>
>
> the day before yesterday (the patch in SuSE's tree because there
> doesn't seem to be any official patch on their website)
>
>
>>Sure it's a bit bigger than PAGG, but that's also because
>>it includes the functionality to change the group a process
>>belongs to and other things that don't seem to be included
>>in the PAGG patch.
>
>
> Again, pagg doesn't even play in that league. It's really just a tiny
> meachnism to allow a kernel module keep per-process data.
Speaking of per-process data, one of the classification engines of
CKRM called crbce, implemented as a module, allows per-process data to
be sent to userland. crbce in particular, exports data on the delays
seen by processes in a) waiting for cpu time after being runnable
b) page fault service time c) io service time etc. (getting the data
requires another kernel patch)....so per-process data needs can be met
through CKRM, though that is not the intent or main objective of the
project.
> Policies
> like process-groups can be implemented ontop of that.
This is true if one is only interested in data gathering or
coarse-grain control. One could monitor per-process stats and fiddle
with each process' rlimits (assuming all the ones needed are
available) and achieve coarse-grain group control.
But if processes leave and join process groups rapidly, you need the
schedulers and the core kernel to be aware of the groupings and
schedule resources accordingly.
In CKRM, the premise is that the privileged user defines the way
processes get grouped and could do so in a way that leads to rapid
changes in group membership. So having group control/monitoring
policies implemented as an externally loaded module (not talking of
scheduler modifications as modules, which is a no-no) is not a
palatable option.
-- Shailabh
Guillaume Thouvenin wrote:
> Selon Christoph Hellwig <[email protected]>:
>
>
>>>I suspect there's a rather good chance of merging a common
>>>PAGG/CKRM infrastructure, since they pretty much do the same
>>>thing at the core and they both have different functionality
>>>implemented on top of the core process grouping.
>>
>>Still doesn't make a lot of sense. CKRM is a huge cludgy beast poking
>>everywhere while PAGG is a really small layer to allow kernel modules
>>keeping per-process state. If CKRM gets merged at all (and the current
>>looks far to horrible and the gains are rather unclear) it should layer
>>ontop of something like PAGG for the functionality covered by it.
>
>
> And what about put the management of containers outside the kernel. We could for
> exemple use a program that will listen file /proc/acct_event and execute a
> programs to handle the event like ACPID does. Of course it will need some kernel
> modifications but those modifications will be small as process aggregation will
> be done outside the kernel. We could also use relayfs to exchange datas between
> user program and the kernel.
>
> Guillaume
Guillaume,
As mentioned in my response to Christoph, keeping process aggregation
outside the kernel (or as a module that sits atop process-centric
patches) will work only for statistics gathering and coarse-grain
control.
CKRM's crbce controller (will be put up on http://ckrm.sf.net within a
day...) uses relayfs to send per-process data to a privileged user
program (will also be included) that can use the data as it pleases,
including doing aggregation.
We think a class-aware kernel is the right way to go and it can be
done with sufficiently low impact that one doesn't have to
unnecessarily limit the flexibility of users in defining process
groups (=classes) or the time periods over which shares can be enforced.
-- Shailabh
* Shailabh ([email protected]) wrote:
> In CKRM, the premise is that the privileged user defines the way
> processes get grouped and could do so in a way that leads to rapid
> changes in group membership. So having group control/monitoring
> policies implemented as an externally loaded module (not talking of
> scheduler modifications as modules, which is a no-no) is not a
> palatable option.
Yes, this is why I looked at the PAGG job module. I was looking for how
it might have mucked (externally) with scheduler. At any rate, I found
all the primitives for joining/leaving/defining groups here which I'd
have expected closer to core.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
On Fri, 30 Apr 2004, Guillaume Thouvenin wrote:
> And what about put the management of containers outside the kernel.
User Mode Linux would be an option indeed, if the overhead
was low enough for general use. It's quite possible that
this can be done...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan