2009-07-02 23:26:30

by Paul Menage

[permalink] [raw]
Subject: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

The following series (written by Ben Blum) adds a "cgroup.procs" file
to each cgroup that reports unique tgids rather than pids, and fixes a
pid namespace bug in the existing "tasks" file that could cause
readers in different namespaces to interfere with one another.

The patches as a pair provide similar functionality to Li Zefan's
patch posted yesterday, with the addition on the "cgroup.procs" file;
if it's decided that Li's patch needs to be fast-tracked to 2.6.31,
then these patches can be rebased as a small extension of Li's patch;
if Li'z patch doesn't need to go to 2.6.31 then it makes more sense to
take this pair since they provide more overall functionality.

---

Ben Blum (2):
Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
Ensures correct concurrent opening/reading of pidlists across pid namespaces


include/linux/cgroup.h | 46 +++++-
kernel/cgroup.c | 355 +++++++++++++++++++++++++++++++++---------------
2 files changed, 285 insertions(+), 116 deletions(-)


2009-07-02 23:26:43

by Paul Menage

[permalink] [raw]
Subject: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the pidlist
for the tasks file. Those are now separated into a new struct cgroup_pidlist,
of which two are had, one for procs and one for tasks. The way the seq_file
operations are set up is changed so that just the pidlist struct gets passed
around as the private data.

Possible future functionality is making the procs file writable for purposes
of adding all threads with the same tgid at once.

Signed-off-by: Ben Blum <[email protected]>
Reviewed-by: Paul Menage <[email protected]>
Signed-off-by: Paul Menage <[email protected]>

---

include/linux/cgroup.h | 23 +++-
kernel/cgroup.c | 287 ++++++++++++++++++++++++++++++------------------
2 files changed, 195 insertions(+), 115 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 665fa70..3b937c3 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,6 +141,18 @@ enum {
CGRP_WAIT_ON_RMDIR,
};

+
+struct cgroup_pidlist {
+ /* protects the other fields */
+ struct rw_semaphore mutex;
+ /* array of xids */
+ pid_t *list;
+ /* how many elements the above list has */
+ int length;
+ /* how many files are using the current array */
+ int use_count;
+};
+
struct cgroup {
unsigned long flags; /* "unsigned long" so bitops work */

@@ -179,14 +191,9 @@ struct cgroup {
*/
struct list_head release_list;

- /* pids_mutex protects the fields below */
- struct rw_semaphore pids_mutex;
- /* Array of process ids in the cgroup */
- pid_t *tasks_pids;
- /* How many files are using the current tasks_pids array */
- int pids_use_count;
- /* Length of the current tasks_pids array */
- int pids_length;
+ /* we will have two separate pidlists, one for pids (the tasks file)
+ * and one for tgids (the procs file). */
+ struct cgroup_pidlist tasks, procs;

/* For RCU-protected deletion */
struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ea255fe..3cf0faa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -961,7 +961,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->children);
INIT_LIST_HEAD(&cgrp->css_sets);
INIT_LIST_HEAD(&cgrp->release_list);
- init_rwsem(&cgrp->pids_mutex);
+ init_rwsem(&(cgrp->tasks.mutex));
+ init_rwsem(&(cgrp->procs.mutex));
}
static void init_cgroup_root(struct cgroupfs_root *root)
{
@@ -1409,15 +1410,6 @@ static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
return ret;
}

-/* The various types of files and directories in a cgroup file system */
-enum cgroup_filetype {
- FILE_ROOT,
- FILE_DIR,
- FILE_TASKLIST,
- FILE_NOTIFY_ON_RELEASE,
- FILE_RELEASE_AGENT,
-};
-
/**
* cgroup_lock_live_group - take cgroup_mutex and check that cgrp is alive.
* @cgrp: the cgroup to be checked for liveness
@@ -2115,7 +2107,7 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
}

/*
- * Stuff for reading the 'tasks' file.
+ * Stuff for reading the 'tasks'/'procs' files.
*
* Reading this file can return large amounts of data if a cgroup has
* *lots* of attached tasks. So it may need several calls to read(),
@@ -2124,30 +2116,115 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
*
*/

-/*
- * Load into 'pidarray' up to 'npids' of the tasks using cgroup
- * 'cgrp'. Return actual number of pids loaded. No need to
- * task_lock(p) when reading out p->cgroup, since we're in an RCU
- * read section, so the css_set can't go away, and is
- * immutable after creation.
+/**
+ * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
+ * returns -ENOMEM if the malloc fails; 0 on success
+ */
+/* this implementation relies on the fact that pids will never be -1 */
+#define PIDLIST_VALUE_NONE ((pid_t)(-1))
+static int pidlist_uniq(pid_t **p, int *length)
+{
+ int i, j = 0; /* array index */
+ int last = 0; /* index of last unique entry */
+ int count = 1; /* count of total unique entries */
+ pid_t *list, *newlist;
+ BUG_ON(p == NULL || *p == NULL || length == NULL);
+ list = *p;
+ /*
+ * we presume the 0th element is unique, so i starts at 1. trivial
+ * edge cases first; no work needs to be done for either
+ */
+ if (*length == 0 || *length == 1)
+ return 0;
+ for (i = 1; i < *length; i++) {
+ BUG_ON(list[i] == PIDLIST_VALUE_NONE);
+ if (list[i] == list[last]) {
+ list[i] = PIDLIST_VALUE_NONE;
+ } else {
+ last = i;
+ count++;
+ }
+ }
+ newlist = kmalloc(count * sizeof(pid_t), GFP_KERNEL);
+ if (!newlist)
+ return -ENOMEM;
+ /* copy to new array */
+ for (i = 0; i < *length; i++) {
+ if (list[i] != PIDLIST_VALUE_NONE) {
+ newlist[j] = list[i];
+ j++;
+ }
+ }
+ BUG_ON(j != count); /* this would fail on a zero-length array */
+ kfree(list);
+ *p = newlist;
+ *length = count;
+ return 0;
+}
+
+static int cmppid(const void *a, const void *b)
+{
+ return *(pid_t *)a - *(pid_t *)b;
+}
+
+/**
+ * Load a cgroup's pidarray with either procs' tgids or tasks' pids
*/
-static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
+static int pidlist_array_load(struct cgroup *cgrp, bool procs)
{
- int n = 0, pid;
+ pid_t *array;
+ int length;
+ int retval;
+ int pid, n = 0; /* used for populating the array */
struct cgroup_iter it;
struct task_struct *tsk;
+ struct cgroup_pidlist *l;
+
+ /*
+ * If cgroup gets more users after we read count, we won't have
+ * enough space - tough. This race is indistinguishable to the
+ * caller from the case that the additional cgroup users didn't
+ * show up until sometime later on.
+ */
+ length = cgroup_task_count(cgrp);
+ array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+ if (!array)
+ return -ENOMEM;
+ /* now, populate the array */
cgroup_iter_start(cgrp, &it);
while ((tsk = cgroup_iter_next(cgrp, &it))) {
- if (unlikely(n == npids))
+ if (unlikely(n == length))
break;
- pid = task_pid_vnr(tsk);
- if (pid > 0)
- pidarray[n++] = pid;
+ /* get tgid or pid for procs or tasks file respectively */
+ pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+ if (pid > 0) /* make sure to only use valid results */
+ array[n++] = pid;
}
cgroup_iter_end(cgrp, &it);
- return n;
+ length = n;
+ /* now sort & (if procs) strip out duplicates */
+ sort(array, length, sizeof(pid_t), cmppid, NULL);
+ if (procs) {
+ retval = pidlist_uniq(&array, &length);
+ if (retval) { /* the malloc inside uniq can fail */
+ kfree(array);
+ return retval;
+ }
+ l = &(cgrp->procs);
+ } else {
+ l = &(cgrp->tasks);
+ }
+ /* store array in cgroup, freeing old if necessary */
+ down_write(&l->mutex);
+ kfree(l->list);
+ l->list = array;
+ l->length = length;
+ l->use_count++;
+ up_write(&l->mutex);
+ return 0;
}

+
/**
* cgroupstats_build - build and fill cgroupstats
* @stats: cgroupstats to fill information into
@@ -2202,19 +2279,14 @@ err:
return ret;
}

-static int cmppid(const void *a, const void *b)
-{
- return *(pid_t *)a - *(pid_t *)b;
-}
-

/*
- * seq_file methods for the "tasks" file. The seq_file position is the
+ * seq_file methods for the tasks/procs files. The seq_file position is the
* next pid to display; the seq_file iterator is a pointer to the pid
- * in the cgroup->tasks_pids array.
+ * in the cgroup->l->list array.
*/

-static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
+static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
{
/*
* Initially we receive a position value that corresponds to
@@ -2222,46 +2294,45 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
* after a seek to the start). Use a binary-search to find the
* next pid to display, if any
*/
- struct cgroup *cgrp = s->private;
+ struct cgroup_pidlist *l = s->private;
int index = 0, pid = *pos;
int *iter;

- down_read(&cgrp->pids_mutex);
+ down_read(&l->mutex);
if (pid) {
- int end = cgrp->pids_length;
+ int end = l->length;

while (index < end) {
int mid = (index + end) / 2;
- if (cgrp->tasks_pids[mid] == pid) {
+ if (l->list[mid] == pid) {
index = mid;
break;
- } else if (cgrp->tasks_pids[mid] <= pid)
+ } else if (l->list[mid] <= pid)
index = mid + 1;
else
end = mid;
}
}
/* If we're off the end of the array, we're done */
- if (index >= cgrp->pids_length)
+ if (index >= l->length)
return NULL;
/* Update the abstract position to be the actual pid that we found */
- iter = cgrp->tasks_pids + index;
+ iter = l->list + index;
*pos = *iter;
return iter;
}

-static void cgroup_tasks_stop(struct seq_file *s, void *v)
+static void cgroup_pidlist_stop(struct seq_file *s, void *v)
{
- struct cgroup *cgrp = s->private;
- up_read(&cgrp->pids_mutex);
+ struct cgroup_pidlist *l = s->private;
+ up_read(&l->mutex);
}

-static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
+static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
{
- struct cgroup *cgrp = s->private;
- int *p = v;
- int *end = cgrp->tasks_pids + cgrp->pids_length;
-
+ struct cgroup_pidlist *l = s->private;
+ pid_t *p = v;
+ pid_t *end = l->list + l->length;
/*
* Advance to the next pid in the array. If this goes off the
* end, we're done
@@ -2275,98 +2346,95 @@ static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
}
}

-static int cgroup_tasks_show(struct seq_file *s, void *v)
+static int cgroup_pidlist_show(struct seq_file *s, void *v)
{
return seq_printf(s, "%d\n", *(int *)v);
}

-static struct seq_operations cgroup_tasks_seq_operations = {
- .start = cgroup_tasks_start,
- .stop = cgroup_tasks_stop,
- .next = cgroup_tasks_next,
- .show = cgroup_tasks_show,
+/*
+ * seq_operations functions for iterating on pidlists through seq_file -
+ * independent of whether it's tasks or procs
+ */
+static const struct seq_operations cgroup_pidlist_seq_operations = {
+ .start = cgroup_pidlist_start,
+ .stop = cgroup_pidlist_stop,
+ .next = cgroup_pidlist_next,
+ .show = cgroup_pidlist_show,
};

-static void release_cgroup_pid_array(struct cgroup *cgrp)
+static void cgroup_release_pid_array(struct cgroup_pidlist *l)
{
- down_write(&cgrp->pids_mutex);
- BUG_ON(!cgrp->pids_use_count);
- if (!--cgrp->pids_use_count) {
- kfree(cgrp->tasks_pids);
- cgrp->tasks_pids = NULL;
- cgrp->pids_length = 0;
+ down_write(&l->mutex);
+ BUG_ON(!l->use_count);
+ if (!--l->use_count) {
+ kfree(l->list);
+ l->list = NULL;
+ l->length = 0;
}
- up_write(&cgrp->pids_mutex);
+ up_write(&l->mutex);
}

-static int cgroup_tasks_release(struct inode *inode, struct file *file)
+static int cgroup_pidlist_release(struct inode *inode, struct file *file)
{
- struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-
+ struct cgroup_pidlist *l;
if (!(file->f_mode & FMODE_READ))
return 0;
-
- release_cgroup_pid_array(cgrp);
+ /*
+ * the seq_file will only be initialized if the file was opened for
+ * reading; hence we check if it's not null only in that case.
+ */
+ BUG_ON(!file->private_data);
+ l = ((struct seq_file *)file->private_data)->private;
+ cgroup_release_pid_array(l);
return seq_release(inode, file);
}

-static struct file_operations cgroup_tasks_operations = {
+static const struct file_operations cgroup_pidlist_operations = {
.read = seq_read,
.llseek = seq_lseek,
.write = cgroup_file_write,
- .release = cgroup_tasks_release,
+ .release = cgroup_pidlist_release,
};

/*
- * Handle an open on 'tasks' file. Prepare an array containing the
- * process id's of tasks currently attached to the cgroup being opened.
+ * The following functions handle opens on a file that displays a pidlist
+ * (tasks or procs). Prepare an array of the process/thread IDs of whoever's
+ * in the cgroup.
*/
-
-static int cgroup_tasks_open(struct inode *unused, struct file *file)
+/* helper function for the two below it */
+static int cgroup_pidlist_open(struct file *file, bool procs)
{
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
- pid_t *pidarray;
- int npids;
+ struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
int retval;

/* Nothing to do for write-only files */
if (!(file->f_mode & FMODE_READ))
return 0;

- /*
- * If cgroup gets more users after we read count, we won't have
- * enough space - tough. This race is indistinguishable to the
- * caller from the case that the additional cgroup users didn't
- * show up until sometime later on.
- */
- npids = cgroup_task_count(cgrp);
- pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
- if (!pidarray)
- return -ENOMEM;
- npids = pid_array_load(pidarray, npids, cgrp);
- sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
-
- /*
- * Store the array in the cgroup, freeing the old
- * array if necessary
- */
- down_write(&cgrp->pids_mutex);
- kfree(cgrp->tasks_pids);
- cgrp->tasks_pids = pidarray;
- cgrp->pids_length = npids;
- cgrp->pids_use_count++;
- up_write(&cgrp->pids_mutex);
-
- file->f_op = &cgroup_tasks_operations;
+ /* have the array populated */
+ retval = pidlist_array_load(cgrp, procs);
+ if (retval)
+ return retval;
+ /* configure file information */
+ file->f_op = &cgroup_pidlist_operations;

- retval = seq_open(file, &cgroup_tasks_seq_operations);
+ retval = seq_open(file, &cgroup_pidlist_seq_operations);
if (retval) {
- release_cgroup_pid_array(cgrp);
+ cgroup_release_pid_array(l);
return retval;
}
- ((struct seq_file *)file->private_data)->private = cgrp;
+ ((struct seq_file *)file->private_data)->private = l;
return 0;
}
+static int cgroup_tasks_open(struct inode *unused, struct file *file)
+{
+ return cgroup_pidlist_open(file, false);
+}
+static int cgroup_procs_open(struct inode *unused, struct file *file)
+{
+ return cgroup_pidlist_open(file, true);
+}

static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,
struct cftype *cft)
@@ -2389,21 +2457,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
/*
* for the common functions, 'private' gives the type of file
*/
+/* for hysterical reasons, we can't put this on the older files */
+#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
static struct cftype files[] = {
{
.name = "tasks",
.open = cgroup_tasks_open,
.write_u64 = cgroup_tasks_write,
- .release = cgroup_tasks_release,
- .private = FILE_TASKLIST,
+ .release = cgroup_pidlist_release,
.mode = S_IRUGO | S_IWUSR,
},
-
+ {
+ .name = CGROUP_FILE_GENERIC_PREFIX "procs",
+ .open = cgroup_procs_open,
+ /* .write_u64 = cgroup_procs_write, TODO */
+ .release = cgroup_pidlist_release,
+ .mode = S_IRUGO,
+ },
{
.name = "notify_on_release",
.read_u64 = cgroup_read_notify_on_release,
.write_u64 = cgroup_write_notify_on_release,
- .private = FILE_NOTIFY_ON_RELEASE,
},
};

@@ -2412,7 +2486,6 @@ static struct cftype cft_release_agent = {
.read_seq_string = cgroup_release_agent_show,
.write_string = cgroup_release_agent_write,
.max_write_len = PATH_MAX,
- .private = FILE_RELEASE_AGENT,
};

static int cgroup_populate_dir(struct cgroup *cgrp)

2009-07-02 23:26:55

by Paul Menage

[permalink] [raw]
Subject: [PATCH 2/2] Ensures correct concurrent opening/reading of pidlists across pid namespaces

Ensures correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process seeing
results from the other's namespace. Rather than one pidlist for each file in a
cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks
versus procs) in which entries are placed on demand. Each pidlist has its own
lock, and that the pidlists themselves are passed around in the seq_file's
private pointer means we don't have to touch the cgroup or its master list
except when creating and destroying entries.

Signed-off-by: Ben Blum <[email protected]>
Reviewed-by: Paul Menage <[email protected]>
Signed-off-by: Paul Menage <[email protected]>

---

include/linux/cgroup.h | 33 +++++++++++++---
kernel/cgroup.c | 102 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3b937c3..b934b72 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,16 +141,36 @@ enum {
CGRP_WAIT_ON_RMDIR,
};

+/* which pidlist file are we talking about? */
+enum cgroup_filetype {
+ CGROUP_FILE_PROCS,
+ CGROUP_FILE_TASKS,
+};

+/*
+ * A pidlist is a list of pids that virtually represents the contents of one
+ * of the cgroup files ("procs" or "tasks"). We keep a list of such pidlists,
+ * a pair (one each for procs, tasks) for each pid namespace that's relevant
+ * to the cgroup.
+ */
struct cgroup_pidlist {
- /* protects the other fields */
- struct rw_semaphore mutex;
+ /*
+ * used to find which pidlist is wanted. doesn't change as long as
+ * this particular list stays in the list.
+ */
+ struct { enum cgroup_filetype type; struct pid_namespace *ns; } key;
/* array of xids */
pid_t *list;
/* how many elements the above list has */
int length;
/* how many files are using the current array */
int use_count;
+ /* each of these stored in a list by its cgroup */
+ struct list_head links;
+ /* pointer to the cgroup we belong to, for list removal purposes */
+ struct cgroup *owner;
+ /* protects the other fields */
+ struct rw_semaphore mutex;
};

struct cgroup {
@@ -191,9 +211,12 @@ struct cgroup {
*/
struct list_head release_list;

- /* we will have two separate pidlists, one for pids (the tasks file)
- * and one for tgids (the procs file). */
- struct cgroup_pidlist tasks, procs;
+ /*
+ * list of pidlists, up to two for each namespace (one for procs, one
+ * for tasks); created on demand.
+ */
+ struct list_head pidlists;
+ struct mutex pidlist_mutex;

/* For RCU-protected deletion */
struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3cf0faa..9b739ff 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -47,6 +47,7 @@
#include <linux/hash.h>
#include <linux/namei.h>
#include <linux/smp_lock.h>
+#include <linux/pid_namespace.h>

#include <asm/atomic.h>

@@ -641,7 +642,11 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp)
static void free_cgroup_rcu(struct rcu_head *obj)
{
struct cgroup *cgrp = container_of(obj, struct cgroup, rcu_head);
-
+ /*
+ * if we're getting rid of the cgroup, refcount must have ensured
+ * that anybody using a pidlist will have cleaned it up by now.
+ */
+ BUG_ON(!list_empty(&cgrp->pidlists));
kfree(cgrp);
}

@@ -961,8 +966,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->children);
INIT_LIST_HEAD(&cgrp->css_sets);
INIT_LIST_HEAD(&cgrp->release_list);
- init_rwsem(&(cgrp->tasks.mutex));
- init_rwsem(&(cgrp->procs.mutex));
+ INIT_LIST_HEAD(&cgrp->pidlists);
+ mutex_init(&cgrp->pidlist_mutex);
}
static void init_cgroup_root(struct cgroupfs_root *root)
{
@@ -2168,9 +2173,52 @@ static int cmppid(const void *a, const void *b)
}

/**
+ * find the appropriate pidlist for our purpose (given procs vs tasks)
+ * returns with the lock on that pidlist already held, and takes care
+ * of the use count, or returns NULL with no locks held if we're out of
+ * memory.
+ */
+static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
+ enum cgroup_filetype type)
+{
+ struct cgroup_pidlist *l;
+ /* don't need task_nsproxy() if we're looking at ourself */
+ struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+ mutex_lock(&cgrp->pidlist_mutex);
+ list_for_each_entry(l, &cgrp->pidlists, links) {
+ if (l->key.type == type && l->key.ns == ns) {
+ /* found a matching list - drop the extra refcount */
+ put_pid_ns(ns);
+ /* make sure l doesn't vanish out from under us */
+ down_write(&l->mutex);
+ mutex_unlock(&cgrp->pidlist_mutex);
+ l->use_count++;
+ return l;
+ }
+ }
+ /* entry not found; create a new one */
+ l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+ if (!l) {
+ mutex_unlock(&cgrp->pidlist_mutex);
+ return l;
+ }
+ init_rwsem(&l->mutex);
+ down_write(&l->mutex);
+ l->key.type = type;
+ l->key.ns = ns;
+ l->use_count = 0; /* don't increment here */
+ l->list = NULL;
+ l->owner = cgrp;
+ list_add(&l->links, &cgrp->pidlists);
+ mutex_unlock(&cgrp->pidlist_mutex);
+ return l;
+}
+
+/**
* Load a cgroup's pidarray with either procs' tgids or tasks' pids
*/
-static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+ struct cgroup_pidlist **lp)
{
pid_t *array;
int length;
@@ -2196,7 +2244,10 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
if (unlikely(n == length))
break;
/* get tgid or pid for procs or tasks file respectively */
- pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+ if (type == CGROUP_FILE_PROCS)
+ pid = task_tgid_vnr(tsk);
+ else
+ pid = task_pid_vnr(tsk);
if (pid > 0) /* make sure to only use valid results */
array[n++] = pid;
}
@@ -2204,23 +2255,25 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
length = n;
/* now sort & (if procs) strip out duplicates */
sort(array, length, sizeof(pid_t), cmppid, NULL);
- if (procs) {
+ if (type == CGROUP_FILE_PROCS) {
retval = pidlist_uniq(&array, &length);
if (retval) { /* the malloc inside uniq can fail */
kfree(array);
return retval;
}
- l = &(cgrp->procs);
- } else {
- l = &(cgrp->tasks);
}
- /* store array in cgroup, freeing old if necessary */
- down_write(&l->mutex);
+ l = cgroup_pidlist_find(cgrp, type);
+ if (!l) {
+ kfree(array);
+ return -ENOMEM;
+ }
+ /* store array, freeing old if necessary - lock already held */
kfree(l->list);
l->list = array;
l->length = length;
l->use_count++;
up_write(&l->mutex);
+ *lp = l;
return 0;
}

@@ -2364,13 +2417,26 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {

static void cgroup_release_pid_array(struct cgroup_pidlist *l)
{
+ /*
+ * the case where we're the last user of this particular pidlist will
+ * have us remove it from the cgroup's list, which entails taking the
+ * mutex. since in pidlist_find the pidlist->lock depends on cgroup->
+ * pidlist_mutex, we have to take pidlist_mutex first.
+ */
+ mutex_lock(&l->owner->pidlist_mutex);
down_write(&l->mutex);
BUG_ON(!l->use_count);
if (!--l->use_count) {
+ /* we're the last user if refcount is 0; remove and free */
+ list_del(&l->links);
+ mutex_unlock(&l->owner->pidlist_mutex);
kfree(l->list);
- l->list = NULL;
- l->length = 0;
+ put_pid_ns(l->key.ns);
+ up_write(&l->mutex);
+ kfree(l);
+ return;
}
+ mutex_unlock(&l->owner->pidlist_mutex);
up_write(&l->mutex);
}

@@ -2402,10 +2468,10 @@ static const struct file_operations cgroup_pidlist_operations = {
* in the cgroup.
*/
/* helper function for the two below it */
-static int cgroup_pidlist_open(struct file *file, bool procs)
+static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
{
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
- struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
+ struct cgroup_pidlist *l;
int retval;

/* Nothing to do for write-only files */
@@ -2413,7 +2479,7 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
return 0;

/* have the array populated */
- retval = pidlist_array_load(cgrp, procs);
+ retval = pidlist_array_load(cgrp, type, &l);
if (retval)
return retval;
/* configure file information */
@@ -2429,11 +2495,11 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
}
static int cgroup_tasks_open(struct inode *unused, struct file *file)
{
- return cgroup_pidlist_open(file, false);
+ return cgroup_pidlist_open(file, CGROUP_FILE_TASKS);
}
static int cgroup_procs_open(struct inode *unused, struct file *file)
{
- return cgroup_pidlist_open(file, true);
+ return cgroup_pidlist_open(file, CGROUP_FILE_PROCS);
}

static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,

2009-07-02 23:47:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 02 Jul 2009 16:26:20 -0700
Paul Menage <[email protected]> wrote:

> Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
>
> struct cgroup used to have a bunch of fields for keeping track of the pidlist
> for the tasks file. Those are now separated into a new struct cgroup_pidlist,
> of which two are had, one for procs and one for tasks. The way the seq_file
> operations are set up is changed so that just the pidlist struct gets passed
> around as the private data.
>
> Possible future functionality is making the procs file writable for purposes
> of adding all threads with the same tgid at once.
>

It'd be nice were the changelog to spell out the contents of this file
(via an example) so we can review the proposed userspace interface
change.

> ---
>
> include/linux/cgroup.h | 23 +++-
> kernel/cgroup.c | 287 ++++++++++++++++++++++++++++++------------------

It'd be nice if the proposed userspace interface were documented too.

>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 665fa70..3b937c3 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
>
> ...
>
> -/*
> - * Load into 'pidarray' up to 'npids' of the tasks using cgroup
> - * 'cgrp'. Return actual number of pids loaded. No need to
> - * task_lock(p) when reading out p->cgroup, since we're in an RCU
> - * read section, so the css_set can't go away, and is
> - * immutable after creation.
> +/**
> + * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
> + * returns -ENOMEM if the malloc fails; 0 on success
> + */

The comment purports to be kerneldoc ("/**") but didn't document the
function's arguments.

> +/* this implementation relies on the fact that pids will never be -1 */
> +#define PIDLIST_VALUE_NONE ((pid_t)(-1))
> +static int pidlist_uniq(pid_t **p, int *length)
> +{
> + int i, j = 0; /* array index */
> + int last = 0; /* index of last unique entry */
> + int count = 1; /* count of total unique entries */
> + pid_t *list, *newlist;

a newline here is conventional.

> + BUG_ON(p == NULL || *p == NULL || length == NULL);

Unneeded, really. The kernel will reliably oops if any of these are
true, which provides us with the same info.

In fact debuggability would be enhanced were this assertion to be
removed, because if this goes blam then it will be very hard to work
out which of these three tests went wrong.

> + list = *p;
> + /*
> + * we presume the 0th element is unique, so i starts at 1. trivial
> + * edge cases first; no work needs to be done for either
> + */
> + if (*length == 0 || *length == 1)
> + return 0;
> + for (i = 1; i < *length; i++) {
> + BUG_ON(list[i] == PIDLIST_VALUE_NONE);
> + if (list[i] == list[last]) {
> + list[i] = PIDLIST_VALUE_NONE;
> + } else {
> + last = i;
> + count++;
> + }
> + }
> + newlist = kmalloc(count * sizeof(pid_t), GFP_KERNEL);

What is the maximum possible value of `count' here?

Is there any way in which malicious code can exploit the potential
multiplicative overflow in this statement? (kcalloc() checks for
this).

> + if (!newlist)
> + return -ENOMEM;
> + /* copy to new array */
> + for (i = 0; i < *length; i++) {
> + if (list[i] != PIDLIST_VALUE_NONE) {
> + newlist[j] = list[i];
> + j++;
> + }
> + }
> + BUG_ON(j != count); /* this would fail on a zero-length array */
> + kfree(list);
> + *p = newlist;
> + *length = count;
> + return 0;
> +}
> +
> +static int cmppid(const void *a, const void *b)
> +{
> + return *(pid_t *)a - *(pid_t *)b;
> +}
> +
> +/**
> + * Load a cgroup's pidarray with either procs' tgids or tasks' pids
> */

again, that ain't kerneldoc.

> -static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
> +static int pidlist_array_load(struct cgroup *cgrp, bool procs)
> {
> - int n = 0, pid;
> + pid_t *array;
> + int length;
> + int retval;
> + int pid, n = 0; /* used for populating the array */
> struct cgroup_iter it;
> struct task_struct *tsk;
> + struct cgroup_pidlist *l;
> +
> + /*
> + * If cgroup gets more users after we read count, we won't have
> + * enough space - tough. This race is indistinguishable to the
> + * caller from the case that the additional cgroup users didn't
> + * show up until sometime later on.
> + */
> + length = cgroup_task_count(cgrp);
> + array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);

max size?

overflowable?

> + if (!array)
> + return -ENOMEM;
> + /* now, populate the array */
> cgroup_iter_start(cgrp, &it);
> while ((tsk = cgroup_iter_next(cgrp, &it))) {
> - if (unlikely(n == npids))
> + if (unlikely(n == length))
> break;
> - pid = task_pid_vnr(tsk);
> - if (pid > 0)
> - pidarray[n++] = pid;
> + /* get tgid or pid for procs or tasks file respectively */
> + pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
> + if (pid > 0) /* make sure to only use valid results */
> + array[n++] = pid;
> }
> cgroup_iter_end(cgrp, &it);
> - return n;
> + length = n;
> + /* now sort & (if procs) strip out duplicates */
> + sort(array, length, sizeof(pid_t), cmppid, NULL);
> + if (procs) {
> + retval = pidlist_uniq(&array, &length);
> + if (retval) { /* the malloc inside uniq can fail */
> + kfree(array);
> + return retval;
> + }
> + l = &(cgrp->procs);
> + } else {
> + l = &(cgrp->tasks);
> + }
> + /* store array in cgroup, freeing old if necessary */
> + down_write(&l->mutex);
> + kfree(l->list);
> + l->list = array;
> + l->length = length;
> + l->use_count++;
> + up_write(&l->mutex);
> + return 0;
> }
>
> +
> /**
> * cgroupstats_build - build and fill cgroupstats
> * @stats: cgroupstats to fill information into
>
> ...
>
> -static int cgroup_tasks_release(struct inode *inode, struct file *file)
> +static int cgroup_pidlist_release(struct inode *inode, struct file *file)
> {
> - struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
> -
> + struct cgroup_pidlist *l;
> if (!(file->f_mode & FMODE_READ))
> return 0;
> -
> - release_cgroup_pid_array(cgrp);
> + /*
> + * the seq_file will only be initialized if the file was opened for
> + * reading; hence we check if it's not null only in that case.
> + */
> + BUG_ON(!file->private_data);

Unneeded assertion for reasons described above.

> + l = ((struct seq_file *)file->private_data)->private;
> + cgroup_release_pid_array(l);
> return seq_release(inode, file);
> }
>
>
> ...
>
> @@ -2389,21 +2457,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
> /*
> * for the common functions, 'private' gives the type of file
> */
> +/* for hysterical reasons, we can't put this on the older files */

"raisins" ;)

> +#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
> static struct cftype files[] = {
> {
> .name = "tasks",
> .open = cgroup_tasks_open,
> .write_u64 = cgroup_tasks_write,
> - .release = cgroup_tasks_release,
> - .private = FILE_TASKLIST,
> + .release = cgroup_pidlist_release,
> .mode = S_IRUGO | S_IWUSR,
> },
> -
> + {
> + .name = CGROUP_FILE_GENERIC_PREFIX "procs",
> + .open = cgroup_procs_open,
> + /* .write_u64 = cgroup_procs_write, TODO */
> + .release = cgroup_pidlist_release,
> + .mode = S_IRUGO,
> + },
> {
> .name = "notify_on_release",
> .read_u64 = cgroup_read_notify_on_release,
> .write_u64 = cgroup_write_notify_on_release,
> - .private = FILE_NOTIFY_ON_RELEASE,
> },
> };
>
>
> ...
>

2009-07-02 23:54:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] Ensures correct concurrent opening/reading of pidlists across pid namespaces

On Thu, 02 Jul 2009 16:26:25 -0700
Paul Menage <[email protected]> wrote:

> Ensures correct concurrent opening/reading of pidlists across pid namespaces
>
> Previously there was the problem in which two processes from different pid
> namespaces reading the tasks or procs file could result in one process seeing
> results from the other's namespace. Rather than one pidlist for each file in a
> cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks
> versus procs) in which entries are placed on demand. Each pidlist has its own
> lock, and that the pidlists themselves are passed around in the seq_file's
> private pointer means we don't have to touch the cgroup or its master list
> except when creating and destroying entries.
>
> Signed-off-by: Ben Blum <[email protected]>
> Reviewed-by: Paul Menage <[email protected]>
> Signed-off-by: Paul Menage <[email protected]>

The way these patches were sent states that you were their primary
author. Is that accurate? If not, they should have had

From: Ben Blum <[email protected]>

at the very top of the changelog.

>
> ...
>
> /**
> + * find the appropriate pidlist for our purpose (given procs vs tasks)
> + * returns with the lock on that pidlist already held, and takes care
> + * of the use count, or returns NULL with no locks held if we're out of
> + * memory.
> + */

Comment purports to be kerneldoc, but isn't.

> +static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
> + enum cgroup_filetype type)
> +{
> + struct cgroup_pidlist *l;
> + /* don't need task_nsproxy() if we're looking at ourself */
> + struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
> + mutex_lock(&cgrp->pidlist_mutex);
> + list_for_each_entry(l, &cgrp->pidlists, links) {
> + if (l->key.type == type && l->key.ns == ns) {
> + /* found a matching list - drop the extra refcount */
> + put_pid_ns(ns);
> + /* make sure l doesn't vanish out from under us */

This looks fishy.

> + down_write(&l->mutex);
> + mutex_unlock(&cgrp->pidlist_mutex);
> + l->use_count++;
> + return l;

The caller of cgroup_pidlist_find() must ensure that l->use_count > 0,
otherwise cgroup_pidlist_find() cannot safely use `l' - it could be
freed at any time. But if l->use_count > 0, there is no risk of `l'
"vanishing out from under us".

I'm probably wrong there, but that's the usual pattern and this code
looks like it's doing something different. Please check?

> + }
> + }
> + /* entry not found; create a new one */
> + l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
> + if (!l) {
> + mutex_unlock(&cgrp->pidlist_mutex);
> + return l;
> + }
> + init_rwsem(&l->mutex);
> + down_write(&l->mutex);
> + l->key.type = type;
> + l->key.ns = ns;
> + l->use_count = 0; /* don't increment here */
> + l->list = NULL;
> + l->owner = cgrp;
> + list_add(&l->links, &cgrp->pidlists);
> + mutex_unlock(&cgrp->pidlist_mutex);
> + return l;
> +}
> +
>
> ...
>

2009-07-03 00:22:18

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 2/2] Ensures correct concurrent opening/reading of pidlists across pid namespaces

On Thu, Jul 2, 2009 at 4:54 PM, Andrew Morton<[email protected]> wrote:
>
> The way these patches were sent states that you were their primary
> author. ?Is that accurate? ?If not, they should have had
>
> From: Ben Blum <[email protected]>
>
> at the very top of the changelog.

Yes, they're originally by Ben.

Hmm. "git log" in my tree shows:

commit 673496e073c88216b3d2d9fe76eb1163dedb5fa3
Author: Ben Blum <[email protected]>
Date: Thu Jul 2 17:19:24 2009 -0700

Ensures correct concurrent opening/reading of pidlists across pid namespaces

and "stg edit" shows the first line as:

From: Ben Blum <[email protected]>

So I'm not sure how "stg mail" managed to lose that.

Paul

2009-07-03 00:26:49

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 2/2] Ensures correct concurrent opening/reading of pidlists across pid namespaces

On Thu, Jul 2, 2009 at 5:22 PM, Paul Menage<[email protected]> wrote:
>
> So I'm not sure how "stg mail" managed to lose that.
>

Ah - I was missing a

%(fromauth)s

tag in my stgit patchmail template so it wasn't mentioning the real
"From" author. Now fixed for next time I send.

Paul

2009-07-03 00:31:49

by Ben Blum

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 4:46 PM, Andrew Morton<[email protected]> wrote:
>> +/**
>> + * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
>> + * returns -ENOMEM if the malloc fails; 0 on success
>> + */
>
> The comment purports to be kerneldoc ("/**") but didn't document the
> function's arguments.

Wasn't aware of that restriction. Recommend making
scripts/checkpatch.pl look for that sort of thing?

>> + ? ? list = *p;
>> + ? ? /*
>> + ? ? ?* we presume the 0th element is unique, so i starts at 1. trivial
>> + ? ? ?* edge cases first; no work needs to be done for either
>> + ? ? ?*/
>> + ? ? if (*length == 0 || *length == 1)
>> + ? ? ? ? ? ? return 0;
>> + ? ? for (i = 1; i < *length; i++) {
>> + ? ? ? ? ? ? BUG_ON(list[i] == PIDLIST_VALUE_NONE);
>> + ? ? ? ? ? ? if (list[i] == list[last]) {
>> + ? ? ? ? ? ? ? ? ? ? list[i] = PIDLIST_VALUE_NONE;
>> + ? ? ? ? ? ? } else {
>> + ? ? ? ? ? ? ? ? ? ? last = i;
>> + ? ? ? ? ? ? ? ? ? ? count++;
>> + ? ? ? ? ? ? }
>> + ? ? }
>> + ? ? newlist = kmalloc(count * sizeof(pid_t), GFP_KERNEL);
>
> What is the maximum possible value of `count' here?
>
> Is there any way in which malicious code can exploit the potential
> multiplicative overflow in this statement? ?(kcalloc() checks for
> this).
>> + ? ? /*
>> + ? ? ?* If cgroup gets more users after we read count, we won't have
>> + ? ? ?* enough space - tough. ?This race is indistinguishable to the
>> + ? ? ?* caller from the case that the additional cgroup users didn't
>> + ? ? ?* show up until sometime later on.
>> + ? ? ?*/
>> + ? ? length = cgroup_task_count(cgrp);
>> + ? ? array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
>
> max size?
>
> overflowable?

In the first snippet, count will be at most equal to length. As length
is determined from cgroup_task_count, it can be no greater than the
total number of pids on the system. (Also, the second snippet of code
was there before, just relocated, so if there's an overflow bug in
either it'll have already been there.)

>> @@ -2389,21 +2457,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
>> ?/*
>> ? * for the common functions, 'private' gives the type of file
>> ? */
>> +/* for hysterical reasons, we can't put this on the older files */
>
> "raisins" ;)

They keys are right next to each other, I promise.

There was a bit of discussion on how to name these files. Paul wanted
to start naming the generic cgroup files with the "cgroup." prefix,
but we can't change "tasks" and "notify_on_release" etc. We decided to
use the new name format but only for the new file - can anything be
done about the other ones, or do they have to stay as is?

2009-07-03 00:44:02

by Ben Blum

[permalink] [raw]
Subject: Re: [PATCH 2/2] Ensures correct concurrent opening/reading of pidlists across pid namespaces

On Thu, Jul 2, 2009 at 4:54 PM, Andrew Morton<[email protected]> wrote:
>> +static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum cgroup_filetype type)
>> +{
>> + ? ? struct cgroup_pidlist *l;
>> + ? ? /* don't need task_nsproxy() if we're looking at ourself */
>> + ? ? struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
>> + ? ? mutex_lock(&cgrp->pidlist_mutex);
>> + ? ? list_for_each_entry(l, &cgrp->pidlists, links) {
>> + ? ? ? ? ? ? if (l->key.type == type && l->key.ns == ns) {
>> + ? ? ? ? ? ? ? ? ? ? /* found a matching list - drop the extra refcount */
>> + ? ? ? ? ? ? ? ? ? ? put_pid_ns(ns);
>> + ? ? ? ? ? ? ? ? ? ? /* make sure l doesn't vanish out from under us */
>
> This looks fishy.
>
>> + ? ? ? ? ? ? ? ? ? ? down_write(&l->mutex);
>> + ? ? ? ? ? ? ? ? ? ? mutex_unlock(&cgrp->pidlist_mutex);
>> + ? ? ? ? ? ? ? ? ? ? l->use_count++;
>> + ? ? ? ? ? ? ? ? ? ? return l;
>
> The caller of cgroup_pidlist_find() must ensure that l->use_count > 0,
> otherwise cgroup_pidlist_find() cannot safely use `l' - it could be
> freed at any time. ?But if l->use_count > 0, there is no risk of `l'
> "vanishing out from under us".
>
> I'm probably wrong there, but that's the usual pattern and this code
> looks like it's doing something different. ?Please check?
>

That comment is vague, and should be rewritten. Individual pidlist
locks depend on the cgroup->pidlist_mutex; the main idea here is that
we can't drop the pidlist_mutex before picking up l->lock in case
somebody's trying to remove it from the list at the same time (compare
with cgroup_release_pid_array, the destroyer). The pid_namespace
refcount is also safe, because having found the existing list means
whoever put it there has a reference on the namespace in l->key, which
hasn't gone away yet and also is protected by the
cgroup->pidlist_mutex.

The only ordering that's important here is that incrementing
l->use_count and dropping cgroup->pidlist_mutex both have to come
after taking l->mutex.

2009-07-03 00:53:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 17:31:38 -0700 Benjamin Blum <[email protected]> wrote:

> On Thu, Jul 2, 2009 at 4:46 PM, Andrew Morton<[email protected]> wrote:
> >> +/**
> >> + * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
> >> + * returns -ENOMEM if the malloc fails; 0 on success
> >> + */
> >
> > The comment purports to be kerneldoc ("/**") but didn't document the
> > function's arguments.
>
> Wasn't aware of that restriction. Recommend making
> scripts/checkpatch.pl look for that sort of thing?

ooh, hard.

Probably the kerneldoc parsing tools are the place to do this checking
- there's no point in duplicating it. But they might not be smart
enough to detect missing arguments.

> >> + __ __ list = *p;
> >> + __ __ /*
> >> + __ __ __* we presume the 0th element is unique, so i starts at 1. trivial
> >> + __ __ __* edge cases first; no work needs to be done for either
> >> + __ __ __*/
> >> + __ __ if (*length == 0 || *length == 1)
> >> + __ __ __ __ __ __ return 0;
> >> + __ __ for (i = 1; i < *length; i++) {
> >> + __ __ __ __ __ __ BUG_ON(list[i] == PIDLIST_VALUE_NONE);
> >> + __ __ __ __ __ __ if (list[i] == list[last]) {
> >> + __ __ __ __ __ __ __ __ __ __ list[i] = PIDLIST_VALUE_NONE;
> >> + __ __ __ __ __ __ } else {
> >> + __ __ __ __ __ __ __ __ __ __ last = i;
> >> + __ __ __ __ __ __ __ __ __ __ count++;
> >> + __ __ __ __ __ __ }
> >> + __ __ }

Someone's email client is doing s/0x20/0xa0/grr

> >> + __ __ newlist = kmalloc(count * sizeof(pid_t), GFP_KERNEL);
> >
> > What is the maximum possible value of `count' here?
> >
> > Is there any way in which malicious code can exploit the potential
> > multiplicative overflow in this statement? __(kcalloc() checks for
> > this).
> >> + __ __ /*
> >> + __ __ __* If cgroup gets more users after we read count, we won't have
> >> + __ __ __* enough space - tough. __This race is indistinguishable to the
> >> + __ __ __* caller from the case that the additional cgroup users didn't
> >> + __ __ __* show up until sometime later on.
> >> + __ __ __*/
> >> + __ __ length = cgroup_task_count(cgrp);
> >> + __ __ array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
> >
> > max size?
> >
> > overflowable?
>
> In the first snippet, count will be at most equal to length. As length
> is determined from cgroup_task_count, it can be no greater than the
> total number of pids on the system.

Well that's a problem, because there can be tens or hundreds of
thousands of pids, and there's a fairly low maximum size for kmalloc()s
(include/linux/kmalloc_sizes.h).

And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
large allocations are less unreliable. There is a large break point at
8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).

It would be really nice if we could avoid "fixing" this via vmalloc().
Because vmalloc() causes, and is vulnerable to internal fragmentation
problems.

> (Also, the second snippet of code
> was there before, just relocated, so if there's an overflow bug in
> either it'll have already been there.)
>
> >> @@ -2389,21 +2457,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
> >> __/*
> >> __ * for the common functions, 'private' gives the type of file
> >> __ */
> >> +/* for hysterical reasons, we can't put this on the older files */
> >
> > "raisins" ;)
>
> They keys are right next to each other, I promise.
>
> There was a bit of discussion on how to name these files. Paul wanted
> to start naming the generic cgroup files with the "cgroup." prefix,
> but we can't change "tasks" and "notify_on_release" etc. We decided to
> use the new name format but only for the new file - can anything be
> done about the other ones, or do they have to stay as is?

One could perhaps create an alias (symlink?) and leave that in place
for a few kernel releases and then remove the old names. The trick to
doing this politely is to arrange for a friendly printk to come out
when userspace uses the old filename, so people know to change their
tools. That printk should come out once-per-boot, not once-per-access.


2009-07-03 01:08:38

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:
>> In the first snippet, count will be at most equal to length. As length
>> is determined from cgroup_task_count, it can be no greater than the
>> total number of pids on the system.
>
> Well that's a problem, because there can be tens or hundreds of
> thousands of pids, and there's a fairly low maximum size for kmalloc()s
> (include/linux/kmalloc_sizes.h).
>
> And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
> large allocations are less unreliable. ?There is a large break point at
> 8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).

This has been a long-standing problem with the tasks file, ever since
the cpusets days.

There are ways around it - Lai Jiangshan <[email protected]> posted
a patch that allocated an array of pages to store pids in, with a
custom sorting function that let you specify indirection rather than
assuming everything was in one contiguous array. This was technically
the right approach in terms of not needing vmalloc and never doing
large allocations, but it was very complex; an alternative that was
mooted was to use kmalloc for small cgroups and vmalloc for large
ones, so the vmalloc penalty wouldn't be paid generally. The thread
fizzled AFAICS.

>
> One could perhaps create an alias (symlink?) and leave that in place
> for a few kernel releases and then remove the old names. ?The trick to
> doing this politely is to arrange for a friendly printk to come out
> when userspace uses the old filename, so people know to change their
> tools. ?That printk should come out once-per-boot, not once-per-access.

Personally, I feel that a bit of ugliness in the naming inconsistency
is less painful than trying to deprecate something that people might
be using. If we could just flip the names without breaking anyone,
that would be great, but this is just a style issue rather than a
functional issue. My experience of such printk() statements scattered
around in code is that no-one takes much notice of them.

Paul

2009-07-03 01:14:17

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

Paul Menage wrote:
> The following series (written by Ben Blum) adds a "cgroup.procs" file

This is rather useful. :)

> to each cgroup that reports unique tgids rather than pids, and fixes a
> pid namespace bug in the existing "tasks" file that could cause
> readers in different namespaces to interfere with one another.
>
> The patches as a pair provide similar functionality to Li Zefan's
> patch posted yesterday, with the addition on the "cgroup.procs" file;
> if it's decided that Li's patch needs to be fast-tracked to 2.6.31,
> then these patches can be rebased as a small extension of Li's patch;
> if Li'z patch doesn't need to go to 2.6.31 then it makes more sense to
> take this pair since they provide more overall functionality.
>

I'd prefer fixing it for 2.6.31. There is no reason we don't fix it
earlier than later. Base a fix on top of a new feature is no good,
and it makes backporting the fix harder if someone wants to do this.

2009-07-03 01:18:12

by Ben Blum

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 6:08 PM, Paul Menage<[email protected]> wrote:
> On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:
>>> In the first snippet, count will be at most equal to length. As length
>>> is determined from cgroup_task_count, it can be no greater than the
>>> total number of pids on the system.
>>
>> Well that's a problem, because there can be tens or hundreds of
>> thousands of pids, and there's a fairly low maximum size for kmalloc()s
>> (include/linux/kmalloc_sizes.h).
>>
>> And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
>> large allocations are less unreliable. ?There is a large break point at
>> 8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).
>
> This has been a long-standing problem with the tasks file, ever since
> the cpusets days.
>
> There are ways around it - Lai Jiangshan <[email protected]> posted
> a patch that allocated an array of pages to store pids in, with a
> custom sorting function that let you specify indirection rather than
> assuming everything was in one contiguous array. This was technically
> the right approach in terms of not needing vmalloc and never doing
> large allocations, but it was very complex; an alternative that was
> mooted was to use kmalloc for small cgroups and vmalloc for large
> ones, so the vmalloc penalty wouldn't be paid generally. The thread
> fizzled AFAICS.

As it is currently, the kmalloc call will simply fail if there are too
many pids, correct? Do we prefer not being able to read the file in
this case, or would we rather use vmalloc?

>
>>
>> One could perhaps create an alias (symlink?) and leave that in place
>> for a few kernel releases and then remove the old names. ?The trick to
>> doing this politely is to arrange for a friendly printk to come out
>> when userspace uses the old filename, so people know to change their
>> tools. ?That printk should come out once-per-boot, not once-per-access.
>
> Personally, I feel that a bit of ugliness in the naming inconsistency
> is less painful than trying to deprecate something that people might
> be using.

That's what the people who designed x86 said :P

> If we could just flip the names without breaking anyone,
> that would be great, but this is just a style issue rather than a
> functional issue. My experience of such printk() statements scattered
> around in code is that no-one takes much notice of them.

Whether or not we get rid of the old ones, it would be good to put in
aliases with the new style now so there's the option of removing the
old style ones later.

>
> Paul
>

2009-07-03 01:30:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 18:08:29 -0700 Paul Menage <[email protected]> wrote:

> On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:
> >> In the first snippet, count will be at most equal to length. As length
> >> is determined from cgroup_task_count, it can be no greater than the
> >> total number of pids on the system.
> >
> > Well that's a problem, because there can be tens or hundreds of
> > thousands of pids, and there's a fairly low maximum size for kmalloc()s
> > (include/linux/kmalloc_sizes.h).
> >
> > And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
> > large allocations are less unreliable. __There is a large break point at
> > 8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).
>
> This has been a long-standing problem with the tasks file, ever since
> the cpusets days.
>
> There are ways around it - Lai Jiangshan <[email protected]> posted
> a patch that allocated an array of pages to store pids in, with a
> custom sorting function that let you specify indirection rather than
> assuming everything was in one contiguous array. This was technically
> the right approach in terms of not needing vmalloc and never doing
> large allocations, but it was very complex; an alternative that was
> mooted was to use kmalloc for small cgroups and vmalloc for large
> ones, so the vmalloc penalty wouldn't be paid generally. The thread
> fizzled AFAICS.

It's a problem which occurs fairly regularly. Some sites are fairly
busted. Many gave up and used vmalloc(). Others use an open-coded
array-of-pages thing.

This happens enough that I expect the kernel would benefit from a
general dynamic-array library facility. Something whose interface
mimics the C-level array operations but which is internally implemented
via some data structure which uses PAGE_SIZE allocations. Probably a
simple two-level thing would suffice.

> >
> > One could perhaps create an alias (symlink?) and leave that in place
> > for a few kernel releases and then remove the old names. __The trick to
> > doing this politely is to arrange for a friendly printk to come out
> > when userspace uses the old filename, so people know to change their
> > tools. __That printk should come out once-per-boot, not once-per-access.
>
> Personally, I feel that a bit of ugliness in the naming inconsistency
> is less painful than trying to deprecate something that people might
> be using. If we could just flip the names without breaking anyone,
> that would be great, but this is just a style issue rather than a
> functional issue.

Sure, leaving things as they are won't kill us.

But I do expect it'd be pretty easy to migrate to new names.

> My experience of such printk() statements scattered
> around in code is that no-one takes much notice of them.

mm... I don't recall having any problems with the approach, the few
times we've tried it. This case is particularly easy because at this
stage the audience is developers, and usually developers who wrote
their own stuff and don't have to wait for providers/distros/etc to
make changes.

2009-07-03 02:08:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 18:17:56 -0700 Benjamin Blum <[email protected]> wrote:

> On Thu, Jul 2, 2009 at 6:08 PM, Paul Menage<[email protected]> wrote:
> > On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:
> >>> In the first snippet, count will be at most equal to length. As length
> >>> is determined from cgroup_task_count, it can be no greater than the
> >>> total number of pids on the system.
> >>
> >> Well that's a problem, because there can be tens or hundreds of
> >> thousands of pids, and there's a fairly low maximum size for kmalloc()s
> >> (include/linux/kmalloc_sizes.h).
> >>
> >> And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
> >> large allocations are less unreliable. __There is a large break point at
> >> 8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).
> >
> > This has been a long-standing problem with the tasks file, ever since
> > the cpusets days.
> >
> > There are ways around it - Lai Jiangshan <[email protected]> posted
> > a patch that allocated an array of pages to store pids in, with a
> > custom sorting function that let you specify indirection rather than
> > assuming everything was in one contiguous array. This was technically
> > the right approach in terms of not needing vmalloc and never doing
> > large allocations, but it was very complex; an alternative that was
> > mooted was to use kmalloc for small cgroups and vmalloc for large
> > ones, so the vmalloc penalty wouldn't be paid generally. The thread
> > fizzled AFAICS.
>
> As it is currently, the kmalloc call will simply fail if there are too
> many pids, correct? Do we prefer not being able to read the file in
> this case, or would we rather use vmalloc?

We'd prefer that we not use vmalloc and that the reads not fail!



Why are we doing all this anyway? To avoid presenting duplicated pids
to userspace? Nothing else?

If so, why not stop doing that - userspace can remove dupes (if it
cares) more easily than the kernel can?


Or we can do it the other way? Create an initially-empty local IDR
tree or radix tree and, within that, mark off any pids which we've
already emitted? That'll have a worst-case memory consumption of
approximately PID_MAX_LIMIT bits -- presently that's half a megabyte.
With no large allocations needed?


btw, did pidlist_uniq() actually needs to allocate new memory for the
output array? Could it have done the filtering in-place?

2009-07-03 02:26:00

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 02, 2009 at 06:17:56PM -0700, Benjamin Blum wrote:
> On Thu, Jul 2, 2009 at 6:08 PM, Paul Menage<[email protected]> wrote:
> > On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:

<snip>

> >
> >>
> >> One could perhaps create an alias (symlink?) and leave that in place
> >> for a few kernel releases and then remove the old names. ?The trick to
> >> doing this politely is to arrange for a friendly printk to come out
> >> when userspace uses the old filename, so people know to change their
> >> tools. ?That printk should come out once-per-boot, not once-per-access.
> >
> > Personally, I feel that a bit of ugliness in the naming inconsistency
> > is less painful than trying to deprecate something that people might
> > be using.
>
> That's what the people who designed x86 said :P

And look at how successful x86 has been! ;)

Seriously, I don't think the name "tasks" is ugly. I think "tasks"
is a nice balance between overly verbose ("cgroup.tasks") and specificity.
If anything I think the new file should be called "processes", not
"cgroup.procs". The established convention was "subsys.foo". cgroup is not
a subsystem of itself hence the names "tasks" and "processes" are just fine.

> > If we could just flip the names without breaking anyone,
> > that would be great, but this is just a style issue rather than a
> > functional issue. My experience of such printk() statements scattered
> > around in code is that no-one takes much notice of them.

I agree with Paul.

>
> Whether or not we get rid of the old ones, it would be good to put in
> aliases with the new style now so there's the option of removing the
> old style ones later.

What a terrible idea! If every alias has an uncertain future nobody will
know which they should use. As a consequence 50% may use "tasks" and 50%
may use "cgroup.tasks" and then we won't be able to remove either name!

Let's stick with what we have and not add endless numbers of aliases for
every possible naming convention.

Cheers,
-Matt Helsley

2009-07-03 03:49:59

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 7:25 PM, Matt Helsley<[email protected]> wrote:
>
> Seriously, I don't think the name "tasks" is ugly. I think "tasks"
> is a nice balance between overly verbose ("cgroup.tasks") and specificity.
> If anything I think the new file should be called "processes", not
> "cgroup.procs". The established convention was "subsys.foo". cgroup is not
> a subsystem of itself hence the names "tasks" and "processes" are just fine.

But that means that every time we add a new cgroup framework control
file, we risk breaking people who happen to already have setups that
use that name. At least if we prefix all new names with "cgroup." it's
easier for people to avoid future clashes. I consider it a mistake on
my part that I didn't give the "tasks" file the "cgroup" prefix when I
originally morphed cpusets into cgroups.

Paul

2009-07-03 04:16:26

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 7:08 PM, Andrew Morton<[email protected]> wrote:
>
> Why are we doing all this anyway? ?To avoid presenting duplicated pids
> to userspace? ?Nothing else?

To present the pids or tgids in sorted order. Removing duplicates is
only for the case of the "procs" file; that could certainly be left to
userspace, but it wouldn't by itself remove the existing requirement
for a contiguous array.

The seq_file iterator for these files relies on them being sorted so
that it can pick up where it left off even in the event of the pid set
changing between reads - it does a binary search to find the first pid
greater than the last one that was returned, so as to guarantee that
we return every pid that was in the cgroup before the scan started and
remained in the cgroup until after the scan finished; there are no
guarantees about pids that enter/leave the cgroup during the scan.

> Or we can do it the other way? ?Create an initially-empty local IDR
> tree or radix tree and, within that, mark off any pids which we've
> already emitted? ?That'll have a worst-case memory consumption of
> approximately PID_MAX_LIMIT bits -- presently that's half a megabyte.
> With no large allocations needed?
>

But that would be half a megabyte per open fd? That's a lot of memory
that userspace can pin down by opening fds. The reason for the current
pid array approach is to mean that there's only ever one pid array
allocated at a time per cgroup, rather than per open fd.

There's actually a structure already for doing that - cgroup_scanner,
which uses a high-watermark and a priority heap to provide a similar
guarantee, with a constant memory size overhead (typically one page).
But it can take O(n^2) time to scan a large cgroup, as would, I
suspect, using an IDR, so it's only used for cases where we really
can't avoid it due to locking reasons. I'd rather have something that
accumulates unsorted pids in page-size chunks as we iterate through
the cgroup, and then sorts them using something like Lai Jiangshan's
patch did.

>
> btw, did pidlist_uniq() actually needs to allocate new memory for the
> output array? ?Could it have done the filtering in-place?

Yes - or just omit duplicates in the seq_file iterator, I guess

Paul

2009-07-03 05:56:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 18:30:04 -0700
Andrew Morton <[email protected]> wrote:

> On Thu, 2 Jul 2009 18:08:29 -0700 Paul Menage <[email protected]> wrote:
>
> > On Thu, Jul 2, 2009 at 5:53 PM, Andrew Morton<[email protected]> wrote:
> > >> In the first snippet, count will be at most equal to length. As length
> > >> is determined from cgroup_task_count, it can be no greater than the
> > >> total number of pids on the system.
> > >
> > > Well that's a problem, because there can be tens or hundreds of
> > > thousands of pids, and there's a fairly low maximum size for kmalloc()s
> > > (include/linux/kmalloc_sizes.h).
> > >
> > > And even if this allocation attempt doesn't exceed KMALLOC_MAX_SIZE,
> > > large allocations are less unreliable. __There is a large break point at
> > > 8*PAGE_SIZE (PAGE_ALLOC_COSTLY_ORDER).
> >
> > This has been a long-standing problem with the tasks file, ever since
> > the cpusets days.
> >
> > There are ways around it - Lai Jiangshan <[email protected]> posted
> > a patch that allocated an array of pages to store pids in, with a
> > custom sorting function that let you specify indirection rather than
> > assuming everything was in one contiguous array. This was technically
> > the right approach in terms of not needing vmalloc and never doing
> > large allocations, but it was very complex; an alternative that was
> > mooted was to use kmalloc for small cgroups and vmalloc for large
> > ones, so the vmalloc penalty wouldn't be paid generally. The thread
> > fizzled AFAICS.
>
> It's a problem which occurs fairly regularly. Some sites are fairly
> busted. Many gave up and used vmalloc(). Others use an open-coded
> array-of-pages thing.
>
> This happens enough that I expect the kernel would benefit from a
> general dynamic-array library facility. Something whose interface
> mimics the C-level array operations but which is internally implemented
> via some data structure which uses PAGE_SIZE allocations. Probably a
> simple two-level thing would suffice.
>
I think both of kmalloc usage here are very bad.

Why we can't do what readdir(/proc) does ? I'm sorry I misunderstand.
Following is an easy example.


0. at open, inilialize f_pos to 0. f_pos is used as "pid"
remember "css_set with hole" as template in f_private?(or somewhere) at open
...like this.
--
struct cgroupfs_root *root = cgrp->root;
struct cgroup *template = kzalloc(sizeof(void*) * CGROUP_SUBSYS_COUNT);

for (i = 0; i < CGROUP_SUBSYS_COUNT; i++)
if (root->subsys_bits & (1UL << i))
template[i] = cgrp->subsys[i];
--


1. at read(), find task_struct of "pid" in f_pos.
2. look up task_struct of "pid" and compare with f_private
--
struct cgroup *template = f_private;

for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
if (!template[i])
contiue;
if (template[i] != task_subsys_state(task, i))
break;
}
if (i == CGROUP_SUBSYS_COUNT)
print task;
--

4. f_pos++ until filling seq_buffer.


Thanks,
-Kame




2009-07-03 06:55:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 21:16:15 -0700 Paul Menage <[email protected]> wrote:

> On Thu, Jul 2, 2009 at 7:08 PM, Andrew Morton<[email protected]> wrote:
> >
> > Why are we doing all this anyway? __To avoid presenting duplicated pids
> > to userspace? __Nothing else?
>
> To present the pids or tgids in sorted order. Removing duplicates is
> only for the case of the "procs" file; that could certainly be left to
> userspace, but it wouldn't by itself remove the existing requirement
> for a contiguous array.
>
> The seq_file iterator for these files relies on them being sorted so
> that it can pick up where it left off even in the event of the pid set
> changing between reads - it does a binary search to find the first pid
> greater than the last one that was returned, so as to guarantee that
> we return every pid that was in the cgroup before the scan started and
> remained in the cgroup until after the scan finished; there are no
> guarantees about pids that enter/leave the cgroup during the scan.

OIC. Gee we made it hard for ourselves. That tears it.

> > Or we can do it the other way? __Create an initially-empty local IDR
> > tree or radix tree and, within that, mark off any pids which we've
> > already emitted? __That'll have a worst-case memory consumption of
> > approximately PID_MAX_LIMIT bits -- presently that's half a megabyte.
> > With no large allocations needed?
> >
>
> But that would be half a megabyte per open fd?

worst-case, assuming that there are 4,000,000/64 pids, spread evenly
across the pid space.

> That's a lot of memory
> that userspace can pin down by opening fds.

yup.

> The reason for the current
> pid array approach is to mean that there's only ever one pid array
> allocated at a time per cgroup, rather than per open fd.

Oh. Didn't know that. That would have been an interesting thing to
have documeted, but the sorry little comment which tried to explain
this got deleted by this patch :(

> There's actually a structure already for doing that - cgroup_scanner,
> which uses a high-watermark and a priority heap to provide a similar
> guarantee, with a constant memory size overhead (typically one page).
> But it can take O(n^2) time to scan a large cgroup, as would, I
> suspect, using an IDR, so it's only used for cases where we really
> can't avoid it due to locking reasons. I'd rather have something that
> accumulates unsorted pids in page-size chunks as we iterate through
> the cgroup, and then sorts them using something like Lai Jiangshan's
> patch did.

Was it a mistake to try to present an ordered, dupes-removed view of a
large amount of data from the kernel?

> >
> > btw, did pidlist_uniq() actually needs to allocate new memory for the
> > output array? __Could it have done the filtering in-place?
>
> Yes - or just omit duplicates in the seq_file iterator, I guess

OK.


So now what? lib/dynarray.c?

2009-07-03 07:08:20

by Ben Blum

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 10:25 PM, Matt Helsley<[email protected]> wrote:
>> Whether or not we get rid of the old ones, it would be good to put in
>> aliases with the new style now so there's the option of removing the
>> old style ones later.
>
> What a terrible idea! If every alias has an uncertain future nobody will
> know which they should use. As a consequence 50% may use "tasks" and 50%
> may use "cgroup.tasks" and then we won't be able to remove either name!
>
> Let's stick with what we have and not add endless numbers of aliases for
> every possible naming convention.

I agree that this is a problem, but Paul has a good point about
needing to avoid possible clashes, so it seems to me that the
"cgroups." prefix is necessary in at least all new files. Painful as
it would be to make the filenames consistent and correct, I'd rather
see us try to move towards that rather than leaving the filenames
split and ugly as they currently are.

Can we solve the "why lots of aliases are bad" problem during the
transition by putting a note in the documentation and also having
printk warnings when the deprecated filename is opened?

2009-07-03 07:56:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, 2 Jul 2009 23:55:27 -0700
Andrew Morton <[email protected]> wrote:

> On Thu, 2 Jul 2009 21:16:15 -0700 Paul Menage <[email protected]> wrote:
>
> > On Thu, Jul 2, 2009 at 7:08 PM, Andrew Morton<[email protected]> wrote:
> > >
> > > Why are we doing all this anyway? __To avoid presenting duplicated pids
> > > to userspace? __Nothing else?
> >
> > To present the pids or tgids in sorted order. Removing duplicates is
> > only for the case of the "procs" file; that could certainly be left to
> > userspace, but it wouldn't by itself remove the existing requirement
> > for a contiguous array.
> >
> > The seq_file iterator for these files relies on them being sorted so
> > that it can pick up where it left off even in the event of the pid set
> > changing between reads - it does a binary search to find the first pid
> > greater than the last one that was returned, so as to guarantee that
> > we return every pid that was in the cgroup before the scan started and
> > remained in the cgroup until after the scan finished; there are no
> > guarantees about pids that enter/leave the cgroup during the scan.
>
> OIC. Gee we made it hard for ourselves. That tears it.
>
At using "file" interface, it's not necessary to guarantee atomic-and-correct
result about list of pids. It's impossible.

readdir(/proc) does best-effort-jobs based on pids. plz do in such a way.
It uses find_ge_pid() to scanning all exisiting pids.


> > >
> > > btw, did pidlist_uniq() actually needs to allocate new memory for the
> > > output array? __Could it have done the filtering in-place?
> >
> > Yes - or just omit duplicates in the seq_file iterator, I guess
>
> OK.
>
>
> So now what? lib/dynarray.c?

We never need array for user interface, IIUC.

Thanks,
-Kame

2009-07-03 15:52:55

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 10:54 PM, KAMEZAWA
Hiroyuki<[email protected]> wrote:
>
> Why we can't do what readdir(/proc) does ? I'm sorry I misunderstand.
> Following is an easy example.
>
>
> 0. at open, inilialize f_pos to 0. f_pos is used as "pid"
> ? remember "css_set with hole" as template in f_private?(or somewhere) at open
> ? ...like this.
> --
> ? struct cgroupfs_root *root = cgrp->root;
> ? struct cgroup *template = kzalloc(sizeof(void*) * CGROUP_SUBSYS_COUNT);
>
> ? for (i = 0; i < CGROUP_SUBSYS_COUNT; i++)
> ? ? ? ?if (root->subsys_bits & (1UL << i))
> ? ? ? ? ? ? ? ?template[i] = ?cgrp->subsys[i];
> --
>
>
> 1. at read(), find task_struct of "pid" in f_pos.
> 2. look up task_struct of "pid" and compare with f_private
> --
> ? struct cgroup *template = f_private;
>
> ? for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> ? ? ? ?if (!template[i])
> ? ? ? ? ? ? ? ?contiue;
> ? ? ? ?if (template[i] != task_subsys_state(task, i))
> ? ? ? ? ? ? ? ?break;
> ? }
> ? if (i == CGROUP_SUBSYS_COUNT)
> ? ? ? ?print task;

The problem with this is that the time taken to scan a single cgroup
is linear in the total number of threads in the system, so if you have
a lot of threads and a lot of cgroups (even if most of the threads are
concentrated in a single cgroup) the time taken to scan all the tasks
files in O(N^2) in the number of threads in the system. The current
scheme is linear in the number of threads in a cgroup, so looking at
all cgroups is linear in the number of threads in the system. (This
O(N^2) problem is something that we've actually observed as an
overhead on some busy systems at Google).

Paul

2009-07-03 16:12:11

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Thu, Jul 2, 2009 at 11:55 PM, Andrew Morton<[email protected]> wrote:
>> The seq_file iterator for these files relies on them being sorted so
>> that it can pick up where it left off even in the event of the pid set
>> changing between reads - it does a binary search to find the first pid
>> greater than the last one that was returned, so as to guarantee that
>> we return every pid that was in the cgroup before the scan started and
>> remained in the cgroup until after the scan finished; there are no
>> guarantees about pids that enter/leave the cgroup during the scan.
>
> OIC. ?Gee we made it hard for ourselves. ?That tears it.

Without a guarantee like that, the file becomes much less useful - it
would just be "a random subset of the pids that were in the cgroup at
some time during the read".

>
> Was it a mistake to try to present an ordered, dupes-removed view of a
> large amount of data from the kernel?

The dupe-removal is a red-herring in this argument - that's only added
by this patch, and doesn't affect the existing "tasks" file. I think
we could even change it into a "mostly dupe-removed" view very cheaply
by simply ignoring a tgid when building the array if it's the same as
the previous one we just saw, and leaving userspace to deal with any
remaining dupes.

The ordered semantics of the tasks file comes from cpusets and I
maintained the API (which actually allocated a potentially even larger
array containing the pre-formatted data).

I'd be surprised if removing the ordering constraint broke anything in
userspace, but the more important part about the ordering is to allow
the kernel to present a consistent view in the presence of changing
pids.

Hmm, I guess we could use a combination of the IDR approach that you
suggested and the present shared-array approach:

- when opening a tasks file:
- populate an IDR with all the pids/tgids in the cgroup
- find any existing IDR open for the cgroup in the list keyed by
namespace and filetype ("procs"/"tasks")
- replace (and free) the existing IDR or extend the list with a new one
- bump the refcount

- when reading:
- iterate from the last reported pid/tgid

- when closing:
- drop the refcount, and free the IDR if the count reaches 0

That maintains the property of preventing userspace from consuming
arbitrary amounts of memory just by holding an fd open on a large
cgroup, while also maintaining a consistency guarantee, and we get the
ordering for free as a side-effect of the IDR, with no large memory
allocations. It's not hugely different from the current solution - all
we're doing is replacing the large array in the cgroup_pidlist
structure with an IDR, and populating/reading it appropriately.

The downsides would be a higher fixed cost, I suspect - setting up an
IDR and populating/scanning it sounds like it has to be more expensive
than filling/reading a linear buffer. But it's probably not enough
extra overhead to worry too much about it.

Paul

2009-07-03 16:50:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Fri, 3 Jul 2009 09:11:56 -0700 Paul Menage <[email protected]> wrote:

> Hmm, I guess we could use a combination of the IDR approach that you
> suggested and the present shared-array approach:
>
> - when opening a tasks file:
> - populate an IDR with all the pids/tgids in the cgroup
> - find any existing IDR open for the cgroup in the list keyed by
> namespace and filetype ("procs"/"tasks")
> - replace (and free) the existing IDR or extend the list with a new one
> - bump the refcount
>
> - when reading:
> - iterate from the last reported pid/tgid
>
> - when closing:
> - drop the refcount, and free the IDR if the count reaches 0
>
> That maintains the property of preventing userspace from consuming
> arbitrary amounts of memory just by holding an fd open on a large
> cgroup, while also maintaining a consistency guarantee, and we get the
> ordering for free as a side-effect of the IDR, with no large memory
> allocations. It's not hugely different from the current solution - all
> we're doing is replacing the large array in the cgroup_pidlist
> structure with an IDR, and populating/reading it appropriately.

I think you're saying "for each pid N in the cgroup, set the Nth
element in an IDR tree". That would work. And it automatically gives
ordered traversal and dupe removal.

I don't think IDRs permit in-order traversal, whereas radix-trees do
support this. Unfortunately radix-trees are presented as operating on
void* data, so one would need to do some typecasting when storing
BITS_PER_LONG-sized bitfields inside them.

> The downsides would be a higher fixed cost, I suspect - setting up an
> IDR and populating/scanning it sounds like it has to be more expensive
> than filling/reading a linear buffer. But it's probably not enough
> extra overhead to worry too much about it.

Yes, I expect it'd be fairly modest. There will be far more calls to
kmalloc() when using a tree, but that's the whole point..

2009-07-03 17:55:00

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Fri, Jul 3, 2009 at 9:50 AM, Andrew Morton<[email protected]> wrote:
>
> I think you're saying "for each pid N in the cgroup, set the Nth
> element in an IDR tree".

Right.

>?That would work. ?And it automatically gives
> ordered traversal and dupe removal.
>
> I don't think IDRs permit in-order traversal, whereas radix-trees do
> support this.

I thought that an IDR *was* a form of radix tree.

Looking at the IDR API it doesn't appear support the notion of "insert
an entry with id N" anyway.

>?Unfortunately radix-trees are presented as operating on
> void* data, so one would need to do some typecasting when storing
> BITS_PER_LONG-sized bitfields inside them.

That would mean adding something a bit like the IDA wrapper that
converts IDR to deal with bitfields?

Is the benefit of avoiding a vmalloc() at all costs really worth the
additional complexity, versus just doing:

if (size > 2 * PAGE_SIZE) {
array = vmalloc(size);
} else {
array = kmalloc(size, GFP_KERNEL);
}

?

That would only require vmalloc on cgroups with >2048 tasks, which is
going to be pretty rare, and is way simpler.

Paul

2009-07-03 18:10:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Fri, 3 Jul 2009 10:54:48 -0700 Paul Menage <[email protected]> wrote:

> >__Unfortunately radix-trees are presented as operating on
> > void* data, so one would need to do some typecasting when storing
> > BITS_PER_LONG-sized bitfields inside them.
>
> That would mean adding something a bit like the IDA wrapper that
> converts IDR to deal with bitfields?

I guess so.

> Is the benefit of avoiding a vmalloc() at all costs really worth the
> additional complexity

Well no. But nor was it worth the additional complexity the last twenty
times someone resorted to vmalloc to solve a problem of this nature. Taking
a kernel-wide perspective here gives a different answer.

However I don't think a little scoreboarding thing (what's the correct
term) built around radix-trees would suffice to solve many of those
past sins. Whereas a general dynamic array thing would be applicable
in many cases.

2009-07-04 02:07:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

Paul Menage さんは書きました:
> On Thu, Jul 2, 2009 at 10:54 PM, KAMEZAWA
> Hiroyuki<[email protected]> wrote:
>>
>> Why we can't do what readdir(/proc) does ? I'm sorry I misunderstand.
>> Following is an easy example.
>>
>>
>> 0. at open, inilialize f_pos to 0. f_pos is used as "pid"
>> &#160; remember "css_set with hole" as template in f_private?(or
somewhere) at open
>> &#160; ...like this.
>> --
>> &#160; struct cgroupfs_root *root = cgrp->root;
>> &#160; struct cgroup *template = kzalloc(sizeof(void*) *
CGROUP_SUBSYS_COUNT);
>>
>> &#160; for (i = 0; i < CGROUP_SUBSYS_COUNT; i++)
>> &#160; &#160; &#160; &#160;if (root->subsys_bits & (1UL << i))
>> &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;template[i] =
&#160;cgrp->subsys[i];
>> --
>>
>>
>> 1. at read(), find task_struct of "pid" in f_pos.
>> 2. look up task_struct of "pid" and compare with f_private
>> --
>> &#160; struct cgroup *template = f_private;
>>
>> &#160; for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>> &#160; &#160; &#160; &#160;if (!template[i])
>> &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;contiue;
>> &#160; &#160; &#160; &#160;if (template[i] != task_subsys_state(task, i))
>> &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;break;
>> &#160; }
>> &#160; if (i == CGROUP_SUBSYS_COUNT)
>> &#160; &#160; &#160; &#160;print task;
>
> The problem with this is that the time taken to scan a single cgroup
> is linear in the total number of threads in the system, so if you have
> a lot of threads and a lot of cgroups (even if most of the threads are
> concentrated in a single cgroup) the time taken to scan all the tasks
> files in O(N^2) in the number of threads in the system. The current
> scheme is linear in the number of threads in a cgroup, so looking at
> all cgroups is linear in the number of threads in the system. (This
> O(N^2) problem is something that we've actually observed as an
> overhead on some busy systems at Google).
>
yes. that's a problem. but not far from 'ps' 's performance.
kmalloc() scheme can walk faster than this under heavy memory pressure ?
Anyway, above algorithm shows that it's enough to have per-cgroup bitmap
(size can be dinamically changed) rather than big table and ugly sort().
How about adding per-cgroup taskid bitmap ?
clear/set is very easy.

Thanks,
-Kame

2009-07-04 16:11:14

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

2009/7/3 KAMEZAWA Hiroyuki <[email protected]>:
> Anyway, above algorithm shows that it's enough to have per-cgroup bitmap
> (size can be dinamically changed) rather than big table and ugly sort().
> How about adding per-cgroup taskid bitmap ?
> clear/set is very easy.
>

A per-cgroup bitmap of task ids would mean (assuming that it's
implemented as a sparse tree like an IDA) that you'd add extra
allocations and possible failure points in the fork path - right now
the fork overhead imposed by the cgroups framework itself is just
linking yourself into your parent's css_set and bumping its refcount.
I guess there probably are workloads where doing more work at
fork/exit time and less at "tasks" scan time is a win, but that has to
be balanced against those where fork/exit performance is more
critical, and the fact that it would be adding another way that fork
could fail.

Paul

2009-07-05 06:39:01

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

* [email protected] <[email protected]> [2009-07-02 16:26:15]:

> The following series (written by Ben Blum) adds a "cgroup.procs" file
> to each cgroup that reports unique tgids rather than pids, and fixes a
> pid namespace bug in the existing "tasks" file that could cause
> readers in different namespaces to interfere with one another.
>
> The patches as a pair provide similar functionality to Li Zefan's
> patch posted yesterday, with the addition on the "cgroup.procs" file;
> if it's decided that Li's patch needs to be fast-tracked to 2.6.31,
> then these patches can be rebased as a small extension of Li's patch;
> if Li'z patch doesn't need to go to 2.6.31 then it makes more sense to
> take this pair since they provide more overall functionality.
>

Paul, I don't see an interface to migrate all procs or at-least I
can't read it in the changelog. As discussed in the containers
mini-summit in 2008, it would be a nice thing to have (I've not looked
at the implementation yet).

> ---
>
> Ben Blum (2):
> Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
> Ensures correct concurrent opening/reading of pidlists across pid namespaces
>
>
> include/linux/cgroup.h | 46 +++++-
> kernel/cgroup.c | 355 +++++++++++++++++++++++++++++++++---------------
> 2 files changed, 285 insertions(+), 116 deletions(-)
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

--
Balbir

2009-07-05 23:55:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Sat, 4 Jul 2009 09:10:58 -0700
Paul Menage <[email protected]> wrote:

> 2009/7/3 KAMEZAWA Hiroyuki <[email protected]>:
> > Anyway, above algorithm shows that it's enough to have per-cgroup bitmap
> > (size can be dinamically changed) rather than big table and ugly sort().
> > How about adding per-cgroup taskid bitmap ?
> > clear/set is very easy.
> >
>
> A per-cgroup bitmap of task ids would mean (assuming that it's
> implemented as a sparse tree like an IDA) that you'd add extra
> allocations and possible failure points in the fork path - right now
> the fork overhead imposed by the cgroups framework itself is just
> linking yourself into your parent's css_set and bumping its refcount.
> I guess there probably are workloads where doing more work at
> fork/exit time and less at "tasks" scan time is a win, but that has to
> be balanced against those where fork/exit performance is more
> critical, and the fact that it would be adding another way that fork
> could fail.
>
ok.

But "sorting" at procs file seems not sane ;)

-Kame

> Paul
>

2009-07-10 23:58:36

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Sat, Jul 4, 2009 at 11:38 PM, Balbir Singh<[email protected]> wrote:
>
> Paul, I don't see an interface to migrate all procs or at-least I
> can't read it in the changelog. As discussed in the containers
> mini-summit in 2008, it would be a nice thing to have (I've not looked
> at the implementation yet).
>

We're still working on that bit.

We want to provide the guarantee that if you write a tgid to the procs
file, all its threads get migrated even if you race with a clone()
call, and are trying to figure out a nice way to do that without
introducing unnecessary overhead/synchronization in the fork path.

Paul

2009-07-13 12:16:04

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

* [email protected] <[email protected]> [2009-07-10 16:58:23]:

> On Sat, Jul 4, 2009 at 11:38 PM, Balbir Singh<[email protected]> wrote:
> >
> > Paul, I don't see an interface to migrate all procs or at-least I
> > can't read it in the changelog. As discussed in the containers
> > mini-summit in 2008, it would be a nice thing to have (I've not looked
> > at the implementation yet).
> >
>
> We're still working on that bit.
>
> We want to provide the guarantee that if you write a tgid to the procs
> file, all its threads get migrated even if you race with a clone()
> call, and are trying to figure out a nice way to do that without
> introducing unnecessary overhead/synchronization in the fork path.
>

How about lazy migration? Mark a group as to move when the kernel sees
it next for scheduling. We'll still need synchronization, but we can
amortize the cost. I've not tried it yet. BTW, we have quite a bit of
logic in libcgroup to deal with migration problems, specially during
classification, which Kenichi has added. We needed timestamping to get
correct migration working during classification.

--
Balbir

2009-07-13 16:26:33

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Mon, Jul 13, 2009 at 5:11 AM, Balbir Singh<[email protected]> wrote:
>
> How about lazy migration? Mark a group as to move when the kernel sees
> it next for scheduling.

Waiting for the next scheduling point might be too long, since a
thread can block for arbitrary amounts of time and keeping the marker
around for arbitrary time (unless we add a new task_struct field)
would be tricky. Marking the cgroup or tgid as being migrated which
then triggers the extra synchronization in the fork path (but which
isn't needed at other times) is probably where we'll end up.

Paul

2009-07-14 05:56:45

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

* [email protected] <[email protected]> [2009-07-13 09:26:26]:

> On Mon, Jul 13, 2009 at 5:11 AM, Balbir Singh<[email protected]> wrote:
> >
> > How about lazy migration? Mark a group as to move when the kernel sees
> > it next for scheduling.
>
> Waiting for the next scheduling point might be too long, since a
> thread can block for arbitrary amounts of time and keeping the marker
> around for arbitrary time (unless we add a new task_struct field)
> would be tricky. Marking the cgroup or tgid as being migrated which
> then triggers the extra synchronization in the fork path (but which
> isn't needed at other times) is probably where we'll end up.


Hmm... but we would not need that information till we schedule the
tasks, adding a field to task_struct is what I had in mind.

--
Balbir

2009-07-14 06:49:25

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Mon, Jul 13, 2009 at 10:56 PM, Balbir Singh<[email protected]> wrote:
>>
>> Waiting for the next scheduling point might be too long, since a
>> thread can block for arbitrary amounts of time and keeping the marker
>> around for arbitrary time (unless we add a new task_struct field)
>> would be tricky. Marking the cgroup or tgid as being migrated which
>> then triggers the extra synchronization in the fork path (but which
>> isn't needed at other times) is probably where we'll end up.
>
>
> Hmm... but we would not need that information till we schedule the
> tasks, adding a field to task_struct is what I had in mind.

Waiting until schedule to move the threads would result in them still
showing up in the old "tasks" file until they next ran, which would be
confusing and misleading.

As a first cut, we were planning to add an rwsem that gets taken for
read in cgroup_fork(), released in cgroup_post_fork(), and taken for
write when moving an entire process to a new cgroup; not ideal
performance-wise, but safe.

If adding a field to task_struct is an option, then the rwsem could be
per thread-group leader, which would reduce contention.

Paul

2009-07-14 07:16:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

* [email protected] <[email protected]> [2009-07-13 23:49:16]:

> On Mon, Jul 13, 2009 at 10:56 PM, Balbir Singh<[email protected]> wrote:
> >>
> >> Waiting for the next scheduling point might be too long, since a
> >> thread can block for arbitrary amounts of time and keeping the marker
> >> around for arbitrary time (unless we add a new task_struct field)
> >> would be tricky. Marking the cgroup or tgid as being migrated which
> >> then triggers the extra synchronization in the fork path (but which
> >> isn't needed at other times) is probably where we'll end up.
> >
> >
> > Hmm... but we would not need that information till we schedule the
> > tasks, adding a field to task_struct is what I had in mind.
>
> Waiting until schedule to move the threads would result in them still
> showing up in the old "tasks" file until they next ran, which would be
> confusing and misleading.
>

Yes, agreed, even first access to the data struture based lazy
migration might not be too helpful, since it can be very context
dependent.

> As a first cut, we were planning to add an rwsem that gets taken for
> read in cgroup_fork(), released in cgroup_post_fork(), and taken for
> write when moving an entire process to a new cgroup; not ideal
> performance-wise, but safe.
>
> If adding a field to task_struct is an option, then the rwsem could be
> per thread-group leader, which would reduce contention.
>

We should also document that moving large processes with several
threads can be expensive.


--
Balbir

2009-07-14 17:34:19

by Ben Blum

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Tue, Jul 14, 2009 at 3:16 AM, Balbir Singh<[email protected]> wrote:
> * [email protected] <[email protected]> [2009-07-13 23:49:16]:
>> As a first cut, we were planning to add an rwsem that gets taken for
>> read in cgroup_fork(), released in cgroup_post_fork(), and taken for
>> write when moving an entire process to a new cgroup; not ideal
>> performance-wise, but safe.
>>
>> If adding a field to task_struct is an option, then the rwsem could be
>> per thread-group leader, which would reduce contention.

That would indeed help, but would only improve system-wide performance
in the case with write-contention (no improvement at all if, say, the
threadgroup is the only one forking, or when nobody is writing to the
procs file). While that's preferable to a global lock, if we can add a
field to task_struct, a (lockless) flag-based approach might be
possible.

> We should also document that moving large processes with several
> threads can be expensive.

Indeed. Or more specifically, that moving fast-growing threadgroups
can be expensive.

2009-07-14 17:44:03

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Tue, Jul 14, 2009 at 10:34 AM, Benjamin Blum<[email protected]> wrote:
> procs file). While that's preferable to a global lock, if we can add a
> field to task_struct, a (lockless) flag-based approach might be
> possible.
>

I've been trying to think of a way to do that. AFAICS the only way to
do that reliably would be to move the call to cgroup_fork() that hooks
into the parent's cgroup inside the lock on the group leader's thread
list, and move the fork callbacks into cgroup_fork(). (Which would
mean that they'd not be able to sleep/fail, etc).

Paul

2009-07-14 20:38:40

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Tue, Jul 14, 2009 at 10:43 AM, Paul Menage<[email protected]> wrote:
>
> I've been trying to think of a way to do that. AFAICS the only way to
> do that reliably would be to move the call to cgroup_fork() that hooks
> into the parent's cgroup inside the lock on the group leader's thread
> list, and move the fork callbacks into cgroup_fork(). (Which would
> mean that they'd not be able to sleep/fail, etc).

Currently the only user of the cgroup fork callbacks is the freezer cgroup.

Matt, if this callback was moved inside tasklist_lock, would that
present any problems? Given that in other places you call
freeze_task() from inside other low-level locks like css_set_lock
(within a cgroup iteration) then hopefully it would be OK.

The only question then would be whether anything between the point
where cgroup_fork() is currently called, and the point where the new
thread is added to its thread group list, cares about p->cgroups being
valid. We can probably flush out any such assumptions by clearing
tsk->cgroups in dup_task_struct, so that any attempts to reference it
would immediately oops.

Then (assuming this doesn't expose any such dependencies) we can move
the cgroup_fork() call inside the write_lock_irq(&tasklist_lock) in
do_fork(). The "procs" handler can (once it has done any necessary
allocations) take a read_lock on tasklist_lock, and atomically move
all threads to the new cgroup.

Paul

2009-07-14 23:08:47

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 0/2] CGroups: cgroup member list enhancement/fix

On Tue, Jul 14, 2009 at 01:38:30PM -0700, Paul Menage wrote:
> On Tue, Jul 14, 2009 at 10:43 AM, Paul Menage<[email protected]> wrote:
> >
> > I've been trying to think of a way to do that. AFAICS the only way to
> > do that reliably would be to move the call to cgroup_fork() that hooks
> > into the parent's cgroup inside the lock on the group leader's thread
> > list, and move the fork callbacks into cgroup_fork(). (Which would
> > mean that they'd not be able to sleep/fail, etc).
>
> Currently the only user of the cgroup fork callbacks is the freezer cgroup.
>
> Matt, if this callback was moved inside tasklist_lock, would that
> present any problems? Given that in other places you call
> freeze_task() from inside other low-level locks like css_set_lock
> (within a cgroup iteration) then hopefully it would be OK.

Yes, it should be OK.

>
> The only question then would be whether anything between the point
> where cgroup_fork() is currently called, and the point where the new
> thread is added to its thread group list, cares about p->cgroups being
> valid. We can probably flush out any such assumptions by clearing
> tsk->cgroups in dup_task_struct, so that any attempts to reference it
> would immediately oops.

This is harder to verify the correctness of. You're probably right but
I'm not completely convinced yet.

Cheers,
-Matt

2009-07-15 08:33:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

Andrew Morton <[email protected]> writes:

> On Fri, 3 Jul 2009 10:54:48 -0700 Paul Menage <[email protected]> wrote:
>
>> >__Unfortunately radix-trees are presented as operating on
>> > void* data, so one would need to do some typecasting when storing
>> > BITS_PER_LONG-sized bitfields inside them.
>>
>> That would mean adding something a bit like the IDA wrapper that
>> converts IDR to deal with bitfields?
>
> I guess so.
>
>> Is the benefit of avoiding a vmalloc() at all costs really worth the
>> additional complexity
>
> Well no. But nor was it worth the additional complexity the last twenty
> times someone resorted to vmalloc to solve a problem of this nature. Taking
> a kernel-wide perspective here gives a different answer.
>
> However I don't think a little scoreboarding thing (what's the correct
> term) built around radix-trees would suffice to solve many of those
> past sins. Whereas a general dynamic array thing would be applicable
> in many cases.

It is even easier. Just grab the logic from proc_pid_readdir.
It uses rcu locking.
It returns pids in order.
It needs no mallocs to use.

The people who ran benchmarks tell me I actually sped up proc,
when I started traversing the existing bitmap of pids.

It makes the guarantee that for every process that existed for
the length of the operation you will see it's pid. Processes
that die or were born half way through we don't say anything about
but if they stay around you will get them next time.

I think guaranteeing a truly atomic snapshot is likely to be a
horrible idea requiring all kinds of nasty locking, and smp
scalability issues. So please walk the list of pids and
just return those that belong to your cgroup.

Compare to the rest of the implementations everyone is balking
at it should be simple.

Eric

2009-07-15 16:18:39

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

On Wed, Jul 15, 2009 at 1:33 AM, Eric W. Biederman<[email protected]> wrote:
>
> I think guaranteeing a truly atomic snapshot is likely to be a
> horrible idea requiring all kinds of nasty locking,

We don't guarantee a truly atomic snapshot unless you manage to read
the entire file with a single read() call.

> and smp
> scalability issues. ?So please walk the list of pids and
> just return those that belong to your cgroup.

The downside with that is that scanning any cgroup takes O(n) in the
number of threads on the machine, so scanning them all becomes O(n^2).
We've definitely seen problems (on older kernels using cpusets, which
did something similar, i.e. walking the tasklist) where we have lots
of small cpusets and a few huge ones, and this blew up the cost of
accessing any of them.

But having said that, the idea of being able to maintain just a cursor
is something that would definitely be nice.

Here's another idea that might work:

Currently, each cgroup has a list running through the attached css_set
objects, and each css_set has a list running through its tasks; we
iterate through these lists of lists to produce the cgroup's task list

Since this list isn't sorted in any way, there's no convenient way to
save/restore your position between seq_file invocations; this is why
we currently generate a sorted snapshot, so that even if the snapshot
is updated by someone else before our next read, we know where to pick
up from (the next pid above the last one that we returned).

Instead, we could actually store cursor objects in the list itself
whenever we need to pause/resume iterating through a large cgroup (due
to hitting the limits of a single seq_file read, i.e. probably after
every 700 threads).

Then we'd just need to teach cgroup_iter_next() to distinguish between
real tasks and cursors, and skip the latter. Simple way to do that
would be to change the existing declarations in task_struct:

#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock */
struct css_set *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list;
#endif

and instead define these two fields together as a struct
cgroup_css_list_elem. A cursor can just be a cgroup_css_list_elem
whose cgroups field points to a distinguishing address that identifies
it as a cursor.

So we're guaranteed to hit all threads that are in the cgroup before
we start, and stay there until we finish; there are no guarantees
about threads that move into or out of the cgroup while we're
iterating

It's a bit more complex than just iterating over the machine's entire
pid list, but I think it scales better.

Paul