LinuxLists.cc - Re: [PATCH 3/9] Containers (V9): Add tasks file interface

2007-05-01 18:12:46

Subject: Re: [PATCH 3/9] Containers (V9): Add tasks file interface

> +static int attach_task_by_pid(struct container *cont, char *pidbuf)
> +{
> + pid_t pid;
> + struct task_struct *tsk;
> + int ret;
> +
> + if (sscanf(pidbuf, "%d", &pid) != 1)
> + return -EIO;
> +
> + if (pid) {
> + read_lock(&tasklist_lock);

You could just use rcu_read_lock() and rcu_read_unlock() instead
of read_lock(&tasklist_lock) and read_unlock(&tasklist_lock).

> +
> + tsk = find_task_by_pid(pid);
> + if (!tsk || tsk->flags & PF_EXITING) {
> + read_unlock(&tasklist_lock);
> + return -ESRCH;
> + }
> +
> + get_task_struct(tsk);
> + read_unlock(&tasklist_lock);
> +
> + if ((current->euid) && (current->euid != tsk->uid)
> + && (current->euid != tsk->suid)) {
> + put_task_struct(tsk);
> + return -EACCES;
> + }
> + } else {
> + tsk = current;
> + get_task_struct(tsk);
> + }
> +
> + ret = attach_task(cont, tsk);
> + put_task_struct(tsk);
> + return ret;
> +}
> +
> /* The various types of files and directories in a container file system */
>
> typedef enum {
> @@ -684,6 +789,54 @@ typedef enum {
> FILE_TASKLIST,
> } container_filetype_t;
>
> +static ssize_t container_common_file_write(struct container *cont,
> + struct cftype *cft,
> + struct file *file,
> + const char __user *userbuf,
> + size_t nbytes, loff_t *unused_ppos)
> +{
> + container_filetype_t type = cft->private;
> + char *buffer;
> + int retval = 0;
> +
> + if (nbytes >= PATH_MAX)
> + return -E2BIG;
> +
> + /* +1 for nul-terminator */
> + if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0)
> + return -ENOMEM;
> +
> + if (copy_from_user(buffer, userbuf, nbytes)) {
> + retval = -EFAULT;
> + goto out1;
> + }
> + buffer[nbytes] = 0; /* nul-terminate */
> +
> + mutex_lock(&container_mutex);
> +
> + if (container_is_removed(cont)) {
> + retval = -ENODEV;
> + goto out2;
> + }

Can't we make this check prior to kmalloc() and copy_from_user()?

> +int container_task_count(const struct container *cont) {
> + int count = 0;
> + struct task_struct *g, *p;
> + struct container_subsys_state *css;
> + int subsys_id;
> + get_first_subsys(cont, &css, &subsys_id);
> +
> + read_lock(&tasklist_lock);

Can be replaced with rcu_read_lock() and rcu_read_unlock()

> + do_each_thread(g, p) {
> + if (task_subsys_state(p, subsys_id) == css)
> + count ++;
> + } while_each_thread(g, p);
> + read_unlock(&tasklist_lock);
> + return count;
> +}
> +
> +static int pid_array_load(pid_t *pidarray, int npids, struct container *cont)
> +{
> + int n = 0;
> + struct task_struct *g, *p;
> + struct container_subsys_state *css;
> + int subsys_id;
> + get_first_subsys(cont, &css, &subsys_id);
> + rcu_read_lock();
> + read_lock(&tasklist_lock);

The read_lock() and read_unlock() are redundant

> +
> + do_each_thread(g, p) {
> + if (task_subsys_state(p, subsys_id) == css) {
> + pidarray[n++] = pid_nr(task_pid(p));
> + if (unlikely(n == npids))
> + goto array_full;
> + }
> + } while_each_thread(g, p);
> +
> +array_full:
> + read_unlock(&tasklist_lock);
> + rcu_read_unlock();
> + return n;
> +}
> +
[snip]

> +static int container_tasks_open(struct inode *unused, struct file *file)
> +{
> + struct container *cont = __d_cont(file->f_dentry->d_parent);
> + struct ctr_struct *ctr;
> + pid_t *pidarray;
> + int npids;
> + char c;
> +
> + if (!(file->f_mode & FMODE_READ))
> + return 0;
> +
> + ctr = kmalloc(sizeof(*ctr), GFP_KERNEL);
> + if (!ctr)
> + goto err0;
> +
> + /*
> + * If container gets more users after we read count, we won't have
> + * enough space - tough. This race is indistinguishable to the
> + * caller from the case that the additional container users didn't
> + * show up until sometime later on.
> + */
> + npids = container_task_count(cont);
> + pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
> + if (!pidarray)
> + goto err1;
> +
> + npids = pid_array_load(pidarray, npids, cont);
> + sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
> +
> + /* Call pid_array_to_buf() twice, first just to get bufsz */
> + ctr->bufsz = pid_array_to_buf(&c, sizeof(c), pidarray, npids) + 1;
> + ctr->buf = kmalloc(ctr->bufsz, GFP_KERNEL);
> + if (!ctr->buf)
> + goto err2;
> + ctr->bufsz = pid_array_to_buf(ctr->buf, ctr->bufsz, pidarray, npids);
> +
> + kfree(pidarray);
> + file->private_data = ctr;
> + return 0;
> +
> +err2:
> + kfree(pidarray);
> +err1:
> + kfree(ctr);
> +err0:
> + return -ENOMEM;
> +}
> +

Any chance we could get a per-container task list? It will
help subsystem writers as well. Alternatively, subsystems
could use the attach_task() callback to track all tasks,
but a per-container list will avoid duplication.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-05-01 20:38:12

by Paul Menage

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

On 5/1/07, Balbir Singh <[email protected]> wrote:
> > + if (container_is_removed(cont)) {
> > + retval = -ENODEV;
> > + goto out2;
> > + }
>
> Can't we make this check prior to kmalloc() and copy_from_user()?

We could but I'm not sure what it would buy us - we'd be optimizing
for the case that essentially never occurs.

>
>
>
> > +int container_task_count(const struct container *cont) {
> > + int count = 0;
> > + struct task_struct *g, *p;
> > + struct container_subsys_state *css;
> > + int subsys_id;
> > + get_first_subsys(cont, &css, &subsys_id);
> > +
> > + read_lock(&tasklist_lock);
>
> Can be replaced with rcu_read_lock() and rcu_read_unlock()

Are you sure about that? I see many users of
do_each_thread()/while_each_thread() taking a lock on tasklist_lock,
and only one (fs/binfmt_elf.c) that's clearly relying on an RCU
critical sections. Documentation?

>
> Any chance we could get a per-container task list? It will
> help subsystem writers as well.

It would be possible, yes - but we probably wouldn't want the overhead
(additional ref counts and list manipulations on every fork/exit) of
it on by default. We could make it a config option that particular
subsystems could select.

I guess the question is how useful is this really, compared to just
doing a do_each_thread() and seeing which tasks are in the container?
Certainly that's a non-trivial operation, but in what circumstances is
it really necessary to do it?

Paul

2007-05-02 03:18:13

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

On Tue, May 01, 2007 at 01:37:24PM -0700, Paul Menage wrote:
> > Any chance we could get a per-container task list? It will
> > help subsystem writers as well.
>
> It would be possible, yes - but we probably wouldn't want the overhead
> (additional ref counts and list manipulations on every fork/exit) of
> it on by default. We could make it a config option that particular
> subsystems could select.
>
> I guess the question is how useful is this really, compared to just
> doing a do_each_thread() and seeing which tasks are in the container?
> Certainly that's a non-trivial operation, but in what circumstances is
> it really necessary to do it?

For the CPU controller I was working on, (a fast access to) such a list would
have been valuable. Basically each task has a weight associated with it
(p->load_weight) which is made to depend upon its class limit. Whenever
the class limit changes, we need to go and change all its member task's
->load_weight value.

If you don't maintain the per-container task list, I guess I could still
work around it, by either:

- Walk the task table and find relevant members, OR better
perhaps
- Move p->load_weight to a class structure

--
Regards,
vatsa

2007-05-02 03:25:56

by Paul Menage

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

On 5/1/07, Srivatsa Vaddagiri <[email protected]> wrote:
>
> For the CPU controller I was working on, (a fast access to) such a list would
> have been valuable. Basically each task has a weight associated with it
> (p->load_weight) which is made to depend upon its class limit. Whenever
> the class limit changes, we need to go and change all its member task's
> ->load_weight value.
>
> If you don't maintain the per-container task list, I guess I could still
> work around it, by either:
>
> - Walk the task table and find relevant members

That doesn't seem like a terrible solution to me, unless you expect
the class limit to be changing incredibly frequently.

If we had multiple subsystems that needed to walk the container member
list on a fast-path operation (e.g. to make a scheduling decision)
that would be a good reason to maintain such a list.

> perhaps
> - Move p->load_weight to a class structure

Sounds like a good idea if you can do it - but if it's per-process,
how would it fit in the class structure?

Paul

2007-05-02 03:38:32

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

On Tue, May 01, 2007 at 08:25:35PM -0700, Paul Menage wrote:
> > - Walk the task table and find relevant members
>
> That doesn't seem like a terrible solution to me, unless you expect
> the class limit to be changing incredibly frequently.

yeah i agree. Group limit(s) should not be changing so frequently.

> > perhaps
> > - Move p->load_weight to a class structure
>
> Sounds like a good idea if you can do it - but if it's per-process,
> how would it fit in the class structure?

p->load_weight essentially depends on two things:

- nice value or static priority (which is per process, already present
in task_struct)
- class limit (which is per class)

So in theory we can eliminate the load_weight field in task_struct and
compute it at runtime from the above two fields, although it will be
slightly inefficient I guess to compute the value every time a task is
added to the runqueue. If that is not desirable, then we can stick with
option 1 (walk task list and change member task's->load_weight upon class
limit change).

--
Regards,
vatsa

2007-05-02 03:58:26

by Balbir Singh

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

Paul Menage wrote:
> On 5/1/07, Balbir Singh <[email protected]> wrote:
>> > + if (container_is_removed(cont)) {
>> > + retval = -ENODEV;
>> > + goto out2;
>> > + }
>>
>> Can't we make this check prior to kmalloc() and copy_from_user()?
>
> We could but I'm not sure what it would buy us - we'd be optimizing
> for the case that essentially never occurs.
>

I am not sure about the never occurs part of it, because we check
for the condition, so it could occur. I agree, it is a premature
optimization and could wait a little longer before going in.

>>
>>
>>
>> > +int container_task_count(const struct container *cont) {
>> > + int count = 0;
>> > + struct task_struct *g, *p;
>> > + struct container_subsys_state *css;
>> > + int subsys_id;
>> > + get_first_subsys(cont, &css, &subsys_id);
>> > +
>> > + read_lock(&tasklist_lock);
>>
>> Can be replaced with rcu_read_lock() and rcu_read_unlock()
>
> Are you sure about that? I see many users of
> do_each_thread()/while_each_thread() taking a lock on tasklist_lock,
> and only one (fs/binfmt_elf.c) that's clearly relying on an RCU
> critical sections. Documentation?
>

I suspect they are all pending conversions to be made.
Eric is the expert on this. Meanwhile here's a couple of
pointers. Quoting from the second URL

"We don't need the tasklist_lock to safely iterate through processes
anymore."

http://www.linuxjournal.com/article/6993 (please see incremental use
of RCU) and
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17/2.6.17-mm2/broken-out/proc-remove-tasklist_lock-from-proc_pid_readdir.patch

>>
>> Any chance we could get a per-container task list? It will
>> help subsystem writers as well.
>
> It would be possible, yes - but we probably wouldn't want the overhead
> (additional ref counts and list manipulations on every fork/exit) of
> it on by default. We could make it a config option that particular
> subsystems could select.
>
> I guess the question is how useful is this really, compared to just
> doing a do_each_thread() and seeing which tasks are in the container?
> Certainly that's a non-trivial operation, but in what circumstances is
> it really necessary to do it?
>
> Paul

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-05-08 14:51:22

by Balbir Singh

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

Paul Menage wrote:
> On 5/1/07, Srivatsa Vaddagiri <[email protected]> wrote:
>> For the CPU controller I was working on, (a fast access to) such a list would
>> have been valuable. Basically each task has a weight associated with it
>> (p->load_weight) which is made to depend upon its class limit. Whenever
>> the class limit changes, we need to go and change all its member task's
>> ->load_weight value.
>>
>> If you don't maintain the per-container task list, I guess I could still
>> work around it, by either:
>>
>> - Walk the task table and find relevant members
>
> That doesn't seem like a terrible solution to me, unless you expect
> the class limit to be changing incredibly frequently.
>
> If we had multiple subsystems that needed to walk the container member
> list on a fast-path operation (e.g. to make a scheduling decision)
> that would be a good reason to maintain such a list.
>

I now have a use case for maintaining a per-container task list.
I am trying to build a per-container stats similar to taskstats.
I intend to support container accounting of

1. Tasks running
2. Tasks stopped
3. Tasks un-interruptible
4. Tasks blocked on IO
5. Tasks sleeping

This would provide statistics similar to the patch that Pavel had sent out.

I faced the following problems while trying to implement this feature

1. There is no easy way to get a list of all tasks belonging to a container
(we need to walk all threads)
2. There is no concept of a container identifier. When a user issues a command
to extract statistics, the only unique container identifier is the container
path, which means that we need to do a path lookup to determine the dentry
for the container (which gets quite ugly with all the string manipulation)

Adding a container id, will make it easier to find a container and return
statistics belonging to the container.

If we get these two features added to the current patches, it will make the
infrastructure more "developer" and eventually more user friendly :-)

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2007-05-10 21:22:08

by Paul Menage

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

On 5/8/07, Balbir Singh <[email protected]> wrote:
>
> I now have a use case for maintaining a per-container task list.
> I am trying to build a per-container stats similar to taskstats.
> I intend to support container accounting of
>
> 1. Tasks running
> 2. Tasks stopped
> 3. Tasks un-interruptible
> 4. Tasks blocked on IO
> 5. Tasks sleeping
>
> This would provide statistics similar to the patch that Pavel had sent out.
>
> I faced the following problems while trying to implement this feature
>
> 1. There is no easy way to get a list of all tasks belonging to a container
> (we need to walk all threads)

Well, walking the taks list is pretty easy - but yes, it could become
inefficient when there are many small containers in use.

I've got some ideas for a way of tracking this specifically for
containers with subsystems that want this, while avoiding the overhead
for subsystems that don't really need it. I'll try to add them to the
next patchset.

> 2. There is no concept of a container identifier. When a user issues a command
> to extract statistics, the only unique container identifier is the container
> path, which means that we need to do a path lookup to determine the dentry
> for the container (which gets quite ugly with all the string manipulation)

We could just cache the container path permanently in the container,
and invalidate it if any of its parents gets renamed. (I imagine this
happens almost never.)

>
> Adding a container id, will make it easier to find a container and return
> statistics belonging to the container.

Not unreasonable, but there are a few questions that would have to be answered:

- how is the container id picked? Like a pid, or user-defined? Or some
kind of string?

- how would it be exposed to userspace? A generic control file
provided by the container filesystem in all container directories?

- can you give a more concrete example of how this would actually be
useful? For your container stats, it seems that just reading a control
file in the container's directory would give you the stats that you
want, and userspace already knows the container's name/id since it
opened the control file.

Paul

2007-05-11 02:31:51

by Balbir Singh

[permalink] [raw]

Subject: Re: [ckrm-tech] [PATCH 3/9] Containers (V9): Add tasks file interface

Paul Menage wrote:
> On 5/8/07, Balbir Singh <[email protected]> wrote:
>>
>> I now have a use case for maintaining a per-container task list.
>> I am trying to build a per-container stats similar to taskstats.
>> I intend to support container accounting of
>>
>> 1. Tasks running
>> 2. Tasks stopped
>> 3. Tasks un-interruptible
>> 4. Tasks blocked on IO
>> 5. Tasks sleeping
>>
>> This would provide statistics similar to the patch that Pavel had sent
>> out.
>>
>> I faced the following problems while trying to implement this feature
>>
>> 1. There is no easy way to get a list of all tasks belonging to a
>> container
>> (we need to walk all threads)
>
> Well, walking the taks list is pretty easy - but yes, it could become
> inefficient when there are many small containers in use.
>
> I've got some ideas for a way of tracking this specifically for
> containers with subsystems that want this, while avoiding the overhead
> for subsystems that don't really need it. I'll try to add them to the
> next patchset.

Super!

>
>> 2. There is no concept of a container identifier. When a user issues a
>> command
>> to extract statistics, the only unique container identifier is the
>> container
>> path, which means that we need to do a path lookup to determine the
>> dentry
>> for the container (which gets quite ugly with all the string
>> manipulation)
>
> We could just cache the container path permanently in the container,
> and invalidate it if any of its parents gets renamed. (I imagine this
> happens almost never.)
>

Here's what I have so far, I cache the mount point of the container
and add the container path to it. I'm now stuck examining tasks,
while walking through a bunch of tasks, there is no easy way of
knowing the container path of the task without walking all subsystems
and then extracting the containers absolute path.

>>
>> Adding a container id, will make it easier to find a container and
>> return
>> statistics belonging to the container.
>
> Not unreasonable, but there are a few questions that would have to be
> answered:
>
> - how is the container id picked? Like a pid, or user-defined? Or some
> kind of string?
>

I was planning on using a hierarchical scheme, top 8 bits for
the container hierarchy and bottom 24 for a unique id. The id
is automatically selected. Once we know the container id, we'll
need a more efficient mechanism to map the id to the container.

> - how would it be exposed to userspace? A generic control file
> provided by the container filesystem in all container directories?
>

A file in all container directories is an option

> - can you give a more concrete example of how this would actually be
> useful? For your container stats, it seems that just reading a control
> file in the container's directory would give you the stats that you
> want, and userspace already knows the container's name/id since it
> opened the control file.
>

Sure, the plan is to build a containerstats interface like taskstats.
In taskstats, we exchange data between user space and kernel space
using genetlink sockets. We have a push and pull mechanism for statistics.

> Paul

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL