by Louis Rilling

[permalink] [raw]

Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once

Hi Ben,

On 23/07/09 20:22 -0700, Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a rwsem that's taken for
> reading in the fork() path to prevent newly forking threads within the
> threadgroup from "escaping" while moving is in progress.
>
> Signed-off-by: Ben Blum <[email protected]>

Thank you for working on this interface. This can indeed be very useful. Please
find comments below, hoping that this will help making it better.

[...]

> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 6eb1a97..d579346 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt

[...]

> @@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0:
>
> # echo 0 > tasks
>
> +The cgroup.procs file is useful for managing all tasks in a threadgroup at
> +once. It works the same way as the tasks file, but moves all tasks in the
> +threadgroup with the specified tgid.
> +
> +Writing the pid of a task that's not the threadgroup leader (i.e., a pid
> +that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
> +attach the writing task and all tasks in its threadgroup, but is invalid if
> +the writing task is not the leader of the threadgroup.
> +

This restriction sounds unfortunate and I'm not sure that there are good reasons
for it (see below).

[...]

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 637a54e..3f8d323 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c

[...]

> @@ -1330,75 +1421,294 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> }
> }
>
> - task_lock(tsk);
> - cg = tsk->cgroups;
> - get_css_set(cg);
> - task_unlock(tsk);
> + retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
> + if (retval)
> + return retval;
> +
> + for_each_subsys(root, ss) {
> + if (ss->attach)
> + ss->attach(ss, cgrp, oldcgrp, tsk);
> + }
> +
> + synchronize_rcu();
> +
> /*
> - * Locate or allocate a new css_set for this task,
> - * based on its final set of cgroups
> + * wake up rmdir() waiter. the rmdir should fail since the cgroup
> + * is no longer empty.
> */
> + cgroup_wakeup_rmdir_waiters(cgrp);
> + return 0;
> +}
> +
> +/*
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list, of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> + struct css_set *cg;
> + struct list_head links;
> +};
> +
> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
> + struct css_set *cg,
> + struct list_head *newcg_list)
> +{
> + struct css_set *newcg;
> + struct cg_list_entry *cg_entry;
> + struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> + read_lock(&css_set_lock);
> + newcg = find_existing_css_set(cg, cgrp, template);
> + if (newcg)
> + get_css_set(newcg);
> + read_unlock(&css_set_lock);
> + /* doesn't exist at all? */
> + if (!newcg)
> + return 1;
> + /* see if it's already in the list */
> + list_for_each_entry(cg_entry, newcg_list, links) {
> + if (cg_entry->cg == newcg) {
> + put_css_set(newcg);
> + return 0;
> + }
> + }
> + /* not found */
> + put_css_set(newcg);
> + return 1;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving
> + * the given task to the given cgroup. Returns 0 on success, -ENOMEM if we
> + * run out of memory.
> + */
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> + struct list_head *newcg_list)
> +{
> + struct css_set *newcg;
> + struct cg_list_entry *cg_entry;
> + /* ensure a new css_set will exist for this thread */
> newcg = find_css_set(cg, cgrp);
> - put_css_set(cg);
> if (!newcg)
> return -ENOMEM;
> -
> - task_lock(tsk);
> - if (tsk->flags & PF_EXITING) {
> - task_unlock(tsk);
> + /* add new element to list */
> + cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> + if (!cg_entry) {
> put_css_set(newcg);
> - return -ESRCH;
> + return -ENOMEM;
> }
> - rcu_assign_pointer(tsk->cgroups, newcg);
> - task_unlock(tsk);
> + cg_entry->cg = newcg;
> + list_add(&cg_entry->links, newcg_list);
> + return 0;
> +}
>
> - /* Update the css_set linked lists if we're using them */
> - write_lock(&css_set_lock);
> - if (!list_empty(&tsk->cg_list)) {
> - list_del(&tsk->cg_list);
> - list_add(&tsk->cg_list, &newcg->tasks);
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> + * threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> + int retval;
> + struct cgroup_subsys *ss;
> + struct cgroup *oldcgrp;
> + struct css_set *oldcg;
> + struct cgroupfs_root *root = cgrp->root;
> + int subsys_id;
> + /* threadgroup list cursor */
> + struct task_struct *tsk;
> + /*
> + * we need to make sure we have css_sets for all the tasks we're
> + * going to move -before- we actually start moving them, so that in
> + * case we get an ENOMEM we can bail out before making any changes.
> + */
> + struct list_head newcg_list;
> + struct cg_list_entry *cg_entry;
> +
> + /* first, make sure this came from a valid tgid */
> + if (!thread_group_leader(leader))
> + return -EINVAL;
> + /*
> + * check that we can legitimately attach to the cgroup.
> + */
> + for_each_subsys(root, ss) {
> + if (ss->can_attach) {
> + retval = ss->can_attach(ss, cgrp, leader);
> + if (retval)
> + return retval;
> + }
> }

So the semantics of ->can_attach() becomes: if called for a thread group leader,
the result should be valid for the whole thread group, even if only the thread
group leader is being attached. This looks a bit fuzzy and thus not desirable.
Why not checking ->can_attach() for all threads (and lock cgroup_fork_mutex
earlier)?

> - write_unlock(&css_set_lock);
>
> + get_first_subsys(cgrp, NULL, &subsys_id);
> +
> + /*
> + * step 1: make sure css_sets exist for all threads to be migrated.
> + * we use find_css_set, which allocates a new one if necessary.
> + */
> + INIT_LIST_HEAD(&newcg_list);
> + oldcgrp = task_cgroup(leader, subsys_id);
> + if (cgrp != oldcgrp) {
> + /* get old css_set */
> + task_lock(leader);
> + if (leader->flags & PF_EXITING) {
> + task_unlock(leader);
> + retval = -ESRCH;
> + goto list_teardown;
> + }
> + oldcg = leader->cgroups;
> + get_css_set(oldcg);
> + task_unlock(leader);
> + /* acquire new one */
> + retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> + put_css_set(oldcg);
> + if (retval)
> + goto list_teardown;
> + }

The only difference between leader's case (above) and other threads' case
(below) is the check for PF_EXITING. If all threads were handled equally in the
->can_attach check, this special handling would be pointless.

> +again:
> + rcu_read_lock();
> + /*
> + * if we need to fetch a new css_set for this task, we must exit the
> + * rcu_read section because allocating it can sleep. afterwards, we'll
> + * need to restart iteration on the threadgroup list - the whole thing
> + * will be O(nm) in the number of threads and css_sets; as the typical
> + * case only has one css_set for all of them, usually O(n).
> + */
> + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> + /* nothing to do if this task is already in the cgroup */
> + oldcgrp = task_cgroup(tsk, subsys_id);
> + if (cgrp == oldcgrp)
> + continue;
> + /* get old css_set pointer */
> + task_lock(tsk);
> + if (tsk->flags & PF_EXITING) {
> + /* ignore this task if it's going away */
> + task_unlock(tsk);
> + continue;
> + }
> + oldcg = tsk->cgroups;
> + get_css_set(oldcg);
> + task_unlock(tsk);
> + /* see if the new one for us is already in the list? */
> + retval = css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list);
> + if (retval) {
> + /* we don't already have it. get new one. */
> + rcu_read_unlock();
> + retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> + put_css_set(oldcg);
> + if (retval)
> + goto list_teardown;
> + /* begin iteration again. */
> + goto again;
> + } else {
> + /* was already there, nothing to do. */
> + put_css_set(oldcg);
> + }
> + }
> + rcu_read_unlock();
> +
> + /*
> + * step 2: now that we're guaranteed success wrt the css_sets, proceed

I don't see how css_sets are guaranteed while cgroup_fork_mutex is not held and
thus does not prevent new threads from being created right now. Could you
elaborate on that?

> + * to move all tasks to the new cgroup. the only fail case henceforth
> + * is if the threadgroup leader has PF_EXITING set (in which case all
> + * the other threads get killed) - if other threads happen to be

This statement is wrong. A thread group leader can have PF_EXITING (and even
become zombie) while other sub-threads continue their execution. For instance it
is perfectly valid for a thread group leader to call sys_exit(), and no other
thread will be affected.

> + * exiting, we just ignore them and move on.
> + */
> + oldcgrp = task_cgroup(leader, subsys_id);
> + /* if leader is already there, skip moving him */
> + if (cgrp != oldcgrp) {
> + retval = cgroup_task_migrate(cgrp, oldcgrp, leader, 1);
> + if (retval) {
> + BUG_ON(retval != -ESRCH);
> + goto list_teardown;
> + }
> + }
> + /*
> + * now move all the rest of the threads - need to lock against
> + * possible races with fork().
> + */
> + down_write(&cgroup_fork_mutex);
> + rcu_read_lock();
> + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> + /* leave current thread as it is if it's already there */
> + oldcgrp = task_cgroup(tsk, subsys_id);
> + if (cgrp == oldcgrp)
> + continue;
> + /* we don't care whether these threads are exiting */
> + retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
> + BUG_ON(retval != 0 && retval != -ESRCH);
> + }
> + rcu_read_unlock();
> + up_write(&cgroup_fork_mutex);
> +
> + /*
> + * step 3: attach whole threadgroup to each subsystem
> + */
> for_each_subsys(root, ss) {
> if (ss->attach)
> - ss->attach(ss, cgrp, oldcgrp, tsk);
> + ss->attach(ss, cgrp, oldcgrp, leader);
> }

So ->attach called for leader should attach all sub-threads too? Does this mean
that all subsystems should be changed accordingly? Again this looks making
->attach semantics fuzzy, and thus not desirable.

Thank,

Louis

> - set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> - synchronize_rcu();
> - put_css_set(cg);
>
> /*
> - * wake up rmdir() waiter. the rmdir should fail since the cgroup
> - * is no longer empty.
> + * step 4: success! ...and cleanup
> */
> + synchronize_rcu();
> cgroup_wakeup_rmdir_waiters(cgrp);
> - return 0;
> + retval = 0;
> +list_teardown:
> + /* no longer need the list of css_sets, so get rid of it */
> + while (!list_empty(&newcg_list)) {
> + /* pop from the list */
> + cg_entry = list_first_entry(&newcg_list, struct cg_list_entry,
> + links);
> + list_del(&cg_entry->links);
> + /* drop the refcount */
> + put_css_set(cg_entry->cg);
> + kfree(cg_entry);
> + }
> + /* done! */
> + return retval;
> }
>
> /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
> */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid,
> + int attach(struct cgroup *,
> + struct task_struct *))
> {
> struct task_struct *tsk;
> const struct cred *cred = current_cred(), *tcred;
> int ret;
>
> + if (!cgroup_lock_live_group(cgrp))
> + return -ENODEV;
> +
> if (pid) {
> rcu_read_lock();
> tsk = find_task_by_vpid(pid);
> if (!tsk || tsk->flags & PF_EXITING) {
> rcu_read_unlock();
> + cgroup_unlock();
> return -ESRCH;
> }
> -
> + /*
> + * even if we're attaching all tasks in the thread group, we
> + * only need to check permissions on the group leader, because
> + * even if another task has different permissions, the group
> + * leader will have sufficient access to change it.
> + */
> tcred = __task_cred(tsk);
> if (cred->euid &&
> cred->euid != tcred->uid &&
> cred->euid != tcred->suid) {
> rcu_read_unlock();
> + cgroup_unlock();
> return -EACCES;
> }
> get_task_struct(tsk);
> @@ -1408,19 +1718,25 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> get_task_struct(tsk);
> }
>
> - ret = cgroup_attach_task(cgrp, tsk);
> + /*
> + * Note that the check for whether the task is its threadgroup leader
> + * is done in cgroup_attach_proc. This means that writing 0 to the
> + * procs file will only work if the writing task is the leader.
> + */
> + ret = attach(cgrp, tsk);
> put_task_struct(tsk);
> + cgroup_unlock();
> return ret;
> }
>
> static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
> {
> - int ret;
> - if (!cgroup_lock_live_group(cgrp))
> - return -ENODEV;
> - ret = attach_task_by_pid(cgrp, pid);
> - cgroup_unlock();
> - return ret;
> + return attach_task_by_pid(cgrp, pid, cgroup_attach_task);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
> + return attach_task_by_pid(cgrp, tgid, cgroup_attach_proc);
> }
>
> /**
> @@ -2580,9 +2896,9 @@ static struct cftype files[] = {
> {
> .name = CGROUP_FILE_GENERIC_PREFIX "procs",
> .open = cgroup_procs_open,
> - /* .write_u64 = cgroup_procs_write, TODO */
> + .write_u64 = cgroup_procs_write,
> .release = cgroup_pidlist_release,
> - .mode = S_IRUGO,
> + .mode = S_IRUGO | S_IWUSR,
> },
> {
> .name = "notify_on_release",
> @@ -3185,6 +3501,7 @@ static struct file_operations proc_cgroupstats_operations = {
> */
> void cgroup_fork(struct task_struct *child)
> {
> + down_read(&cgroup_fork_mutex);
> task_lock(current);
> child->cgroups = current->cgroups;
> get_css_set(child->cgroups);
> @@ -3231,6 +3548,7 @@ void cgroup_post_fork(struct task_struct *child)
> task_unlock(child);
> write_unlock(&css_set_lock);
> }
> + up_read(&cgroup_fork_mutex);
> }
> /**
> * cgroup_exit - detach cgroup from exiting task
> @@ -3302,6 +3620,24 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
> }
>
> /**
> + * cgroup_fork_failed - undo operations for fork failure
> + * @tsk: pointer to task_struct of exiting process
> + * @run_callback: run exit callbacks?
> + *
> + * Description: Undo cgroup operations after cgroup_fork in fork failure.
> + *
> + * We release the read lock that was taken in cgroup_fork(), since it is
> + * supposed to be dropped in cgroup_post_fork in the success case. The other
> + * thing that wants to be done is detaching the failed child task from the
> + * cgroup, so we wrap cgroup_exit.
> + */
> +void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks)
> +{
> + up_read(&cgroup_fork_mutex);
> + cgroup_exit(tsk, run_callbacks);
> +}
> +
> +/**
> * cgroup_clone - clone the cgroup the given subsystem is attached to
> * @tsk: the task to be moved
> * @subsys: the given subsystem
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 926c117..027ec16 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
> mpol_put(p->mempolicy);
> bad_fork_cleanup_cgroup:
> #endif
> - cgroup_exit(p, cgroup_callbacks_done);
> + cgroup_fork_failed(p, cgroup_callbacks_done);
> delayacct_tsk_free(p);
> if (p->binfmt)
> module_put(p->binfmt->module);
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes

Attachments:

(No filename) (16.24 kB)
signature.asc (197.00 B)
Digital signature Download all attachments

2009-07-24 10:08:11

Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a rwsem that's taken for
> reading in the fork() path to prevent newly forking threads within the
> threadgroup from "escaping" while moving is in progress.
>
> Signed-off-by: Ben Blum <[email protected]>
>
[ cut ]
> /**
> + * cgroup_fork_failed - undo operations for fork failure
> + * @tsk: pointer to task_struct of exiting process
> + * @run_callback: run exit callbacks?
> + *
> + * Description: Undo cgroup operations after cgroup_fork in fork failure.
> + *
> + * We release the read lock that was taken in cgroup_fork(), since it is
> + * supposed to be dropped in cgroup_post_fork in the success case. The other
> + * thing that wants to be done is detaching the failed child task from the
> + * cgroup, so we wrap cgroup_exit.
> + */
> +void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks)
> +{
> + up_read(&cgroup_fork_mutex);
> + cgroup_exit(tsk, run_callbacks);
> +}
> +
> +/**
> * cgroup_clone - clone the cgroup the given subsystem is attached to
> * @tsk: the task to be moved
> * @subsys: the given subsystem
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 926c117..027ec16 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
> mpol_put(p->mempolicy);
> bad_fork_cleanup_cgroup:
> #endif
> - cgroup_exit(p, cgroup_callbacks_done);
> + cgroup_fork_failed(p, cgroup_callbacks_done);
> delayacct_tsk_free(p);
> if (p->binfmt)
> module_put(p->binfmt->module);
>
Hi Ben,

The current code (with or without your patch) may lead to an error
because the fork hook can fail and the exit hook is called in all the
cases making the fork / exit asymmetric.

I will take the usual example with a cgroup with a counter of tasks, in
the fork hook it increments the counter, in the exit hook it decrements
the counter. There is one process in the cgroup, hence the counter value
is 1. Now this process forks and the fork hook fails before the task
counter is incremented to 2, this is not detected in copy process
function because the cgroup_fork_callbacks does not return an error, so
the process will be forked without error and when the process will exits
the counter will be decremented reaching 0 instead of 1.

IMO, the correct fix should be to make the fork hook to return an error
and have the cgroup to call the exit method of the subsystem where the
fork hook was called. For example, there are 10 subsystems using the
fork / exit hooks, when the a process forks, the fork callbacks is
called for these subsystems but, let's say, the 3rd fails. So we undo,
by calling the exit hooks of the first two.

I wrote a patchset to consolidate the hooks called in the cgroup for
fork and exit, and one of them does a rollback for the fork hook when an
error occurs. I added an attachment the patch as an example.

Thanks
-- Daniel
>

Attachments:

0001-cgroup-make-fork-hook-to-return-an-error.patch (6.95 kB)

2009-11-10 01:30:42

by Li Zefan

[permalink] [raw]

Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once

Daniel Lezcano wrote:
> Ben Blum wrote:
>> Makes procs file writable to move all threads by tgid at once
>>
>> This patch adds functionality that enables users to move all threads in a
>> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
>> file. This current implementation makes use of a rwsem that's taken for
>> reading in the fork() path to prevent newly forking threads within the
>> threadgroup from "escaping" while moving is in progress.
>>
>> Signed-off-by: Ben Blum <[email protected]>
>>
> [ cut ]
>> /**
>> + * cgroup_fork_failed - undo operations for fork failure
>> + * @tsk: pointer to task_struct of exiting process
>> + * @run_callback: run exit callbacks?
>> + *
>> + * Description: Undo cgroup operations after cgroup_fork in fork
>> failure.
>> + *
>> + * We release the read lock that was taken in cgroup_fork(), since it is
>> + * supposed to be dropped in cgroup_post_fork in the success case.
>> The other
>> + * thing that wants to be done is detaching the failed child task
>> from the
>> + * cgroup, so we wrap cgroup_exit.
>> + */
>> +void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks)
>> +{
>> + up_read(&cgroup_fork_mutex);
>> + cgroup_exit(tsk, run_callbacks);
>> +}
>> +
>> +/**
>> * cgroup_clone - clone the cgroup the given subsystem is attached to
>> * @tsk: the task to be moved
>> * @subsys: the given subsystem
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 926c117..027ec16 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
>> mpol_put(p->mempolicy);
>> bad_fork_cleanup_cgroup:
>> #endif
>> - cgroup_exit(p, cgroup_callbacks_done);
>> + cgroup_fork_failed(p, cgroup_callbacks_done);
>> delayacct_tsk_free(p);
>> if (p->binfmt)
>> module_put(p->binfmt->module);
>>
> Hi Ben,
>
> The current code (with or without your patch) may lead to an error
> because the fork hook can fail and the exit hook is called in all the
> cases making the fork / exit asymmetric.
>

The _current_ code won't lead to this error, because the fork hook
can't fail.

> I will take the usual example with a cgroup with a counter of tasks, in
> the fork hook it increments the counter, in the exit hook it decrements
> the counter. There is one process in the cgroup, hence the counter value
> is 1. Now this process forks and the fork hook fails before the task
> counter is incremented to 2, this is not detected in copy process
> function because the cgroup_fork_callbacks does not return an error, so
> the process will be forked without error and when the process will exits
> the counter will be decremented reaching 0 instead of 1.
>
> IMO, the correct fix should be to make the fork hook to return an error
> and have the cgroup to call the exit method of the subsystem where the
> fork hook was called. For example, there are 10 subsystems using the
> fork / exit hooks, when the a process forks, the fork callbacks is
> called for these subsystems but, let's say, the 3rd fails. So we undo,
> by calling the exit hooks of the first two.
>
> I wrote a patchset to consolidate the hooks called in the cgroup for
> fork and exit, and one of them does a rollback for the fork hook when an
> error occurs. I added an attachment the patch as an example.
>

I'd like to see this patch sent with another patch that needs this
fail-able fork() hook.

Note this patch is not doing a _fix_, but does an extension. And
for now, this extension is not needed.

2009-11-10 10:26:50

by Daniel Lezcano

[permalink] [raw]

Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once

Li Zefan wrote:
> Daniel Lezcano wrote:
>
>> Ben Blum wrote:
>>
>>> Makes procs file writable to move all threads by tgid at once
>>>
>>> This patch adds functionality that enables users to move all threads in a
>>> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
>>> file. This current implementation makes use of a rwsem that's taken for
>>> reading in the fork() path to prevent newly forking threads within the
>>> threadgroup from "escaping" while moving is in progress.
>>>
>>> Signed-off-by: Ben Blum <[email protected]>
>>>
>>>
>> [ cut ]
>>
>>> /**
>>> + * cgroup_fork_failed - undo operations for fork failure
>>> + * @tsk: pointer to task_struct of exiting process
>>> + * @run_callback: run exit callbacks?
>>> + *
>>> + * Description: Undo cgroup operations after cgroup_fork in fork
>>> failure.
>>> + *
>>> + * We release the read lock that was taken in cgroup_fork(), since it is
>>> + * supposed to be dropped in cgroup_post_fork in the success case.
>>> The other
>>> + * thing that wants to be done is detaching the failed child task
>>> from the
>>> + * cgroup, so we wrap cgroup_exit.
>>> + */
>>> +void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks)
>>> +{
>>> + up_read(&cgroup_fork_mutex);
>>> + cgroup_exit(tsk, run_callbacks);
>>> +}
>>> +
>>> +/**
>>> * cgroup_clone - clone the cgroup the given subsystem is attached to
>>> * @tsk: the task to be moved
>>> * @subsys: the given subsystem
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 926c117..027ec16 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
>>> mpol_put(p->mempolicy);
>>> bad_fork_cleanup_cgroup:
>>> #endif
>>> - cgroup_exit(p, cgroup_callbacks_done);
>>> + cgroup_fork_failed(p, cgroup_callbacks_done);
>>> delayacct_tsk_free(p);
>>> if (p->binfmt)
>>> module_put(p->binfmt->module);
>>>
>>>
>> Hi Ben,
>>
>> The current code (with or without your patch) may lead to an error
>> because the fork hook can fail and the exit hook is called in all the
>> cases making the fork / exit asymmetric.
>>
>>
>
> The _current_ code won't lead to this error, because the fork hook
> can't fail.
>
Right, as no subsystem is using both hooks right now, the bug is never
triggered and the current code won't lead to an error.
But from my POV, there is a bug hidden in a corner waiting for a
subsystem to make use of the fail-able fork / exit :)

>> I will take the usual example with a cgroup with a counter of tasks, in
>> the fork hook it increments the counter, in the exit hook it decrements
>> the counter. There is one process in the cgroup, hence the counter value
>> is 1. Now this process forks and the fork hook fails before the task
>> counter is incremented to 2, this is not detected in copy process
>> function because the cgroup_fork_callbacks does not return an error, so
>> the process will be forked without error and when the process will exits
>> the counter will be decremented reaching 0 instead of 1.
>>
>> IMO, the correct fix should be to make the fork hook to return an error
>> and have the cgroup to call the exit method of the subsystem where the
>> fork hook was called. For example, there are 10 subsystems using the
>> fork / exit hooks, when the a process forks, the fork callbacks is
>> called for these subsystems but, let's say, the 3rd fails. So we undo,
>> by calling the exit hooks of the first two.
>>
>> I wrote a patchset to consolidate the hooks called in the cgroup for
>> fork and exit, and one of them does a rollback for the fork hook when an
>> error occurs. I added an attachment the patch as an example.
>>
>>
>
> I'd like to see this patch sent with another patch that needs this
> fail-able fork() hook.
>
> Note this patch is not doing a _fix_, but does an extension. And
> for now, this extension is not needed.
>
I don't know, may be it could be interesting to implement that before
more subsystems make use of these hooks.
This is not critical, that can be sent later, separately from this
patchset of course.

Thanks
-- Daniel

2009-11-11 02:08:38

by Li Zefan

[permalink] [raw]

Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once

>>> Hi Ben,
>>>
>>> The current code (with or without your patch) may lead to an error
>>> because the fork hook can fail and the exit hook is called in all the
>>> cases making the fork / exit asymmetric.
>>>
>>>
>>
>> The _current_ code won't lead to this error, because the fork hook
>> can't fail.
>>
> Right, as no subsystem is using both hooks right now, the bug is never
> triggered and the current code won't lead to an error.
> But from my POV, there is a bug hidden in a corner waiting for a
> subsystem to make use of the fail-able fork / exit :)
>

Actually the freezer subsystem is using the fork hook, but it doesn't
need to be able to fail it.

I don't think we can claim this a bug. If there is a new subsystem
that needs fail-able fork hook, it has to extent the hook interface
and adjust the code to meet its needs.

We always adjust our code to meet new needs, don't we?

>>> I will take the usual example with a cgroup with a counter of tasks, in
>>> the fork hook it increments the counter, in the exit hook it decrements
>>> the counter. There is one process in the cgroup, hence the counter value
>>> is 1. Now this process forks and the fork hook fails before the task
>>> counter is incremented to 2, this is not detected in copy process
>>> function because the cgroup_fork_callbacks does not return an error, so
>>> the process will be forked without error and when the process will exits
>>> the counter will be decremented reaching 0 instead of 1.
>>>
>>> IMO, the correct fix should be to make the fork hook to return an error
>>> and have the cgroup to call the exit method of the subsystem where the
>>> fork hook was called. For example, there are 10 subsystems using the
>>> fork / exit hooks, when the a process forks, the fork callbacks is
>>> called for these subsystems but, let's say, the 3rd fails. So we undo,
>>> by calling the exit hooks of the first two.
>>>
>>> I wrote a patchset to consolidate the hooks called in the cgroup for
>>> fork and exit, and one of them does a rollback for the fork hook when an
>>> error occurs. I added an attachment the patch as an example.
>>>
>>>
>>
>> I'd like to see this patch sent with another patch that needs this
>> fail-able fork() hook.
>>
>> Note this patch is not doing a _fix_, but does an extension. And
>> for now, this extension is not needed.
>>
> I don't know, may be it could be interesting to implement that before
> more subsystems make use of these hooks.
> This is not critical, that can be sent later, separately from this
> patchset of course.
>

We tend to remove code that is not used. For example, we may remove
subsys->bind() interface, because no one is using it, though it has
been there for years.

So adding things that are not used is normally not good.

2009-11-11 20:06:47

by Daniel Lezcano

[permalink] [raw]

Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once

Li Zefan wrote:
>>>> Hi Ben,
>>>>
>>>> The current code (with or without your patch) may lead to an error
>>>> because the fork hook can fail and the exit hook is called in all the
>>>> cases making the fork / exit asymmetric.
>>>>
>>>>
>>>>
>>> The _current_ code won't lead to this error, because the fork hook
>>> can't fail.
>>>
>>>
>> Right, as no subsystem is using both hooks right now, the bug is never
>> triggered and the current code won't lead to an error.
>> But from my POV, there is a bug hidden in a corner waiting for a
>> subsystem to make use of the fail-able fork / exit :)
>>
>>
>
> Actually the freezer subsystem is using the fork hook, but it doesn't
> need to be able to fail it.
>
> I don't think we can claim this a bug. If there is a new subsystem
> that needs fail-able fork hook, it has to extent the hook interface
> and adjust the code to meet its needs.
>
> We always adjust our code to meet new needs, don't we?
>
Sure.

>>>> I will take the usual example with a cgroup with a counter of tasks, in
>>>> the fork hook it increments the counter, in the exit hook it decrements
>>>> the counter. There is one process in the cgroup, hence the counter value
>>>> is 1. Now this process forks and the fork hook fails before the task
>>>> counter is incremented to 2, this is not detected in copy process
>>>> function because the cgroup_fork_callbacks does not return an error, so
>>>> the process will be forked without error and when the process will exits
>>>> the counter will be decremented reaching 0 instead of 1.
>>>>
>>>> IMO, the correct fix should be to make the fork hook to return an error
>>>> and have the cgroup to call the exit method of the subsystem where the
>>>> fork hook was called. For example, there are 10 subsystems using the
>>>> fork / exit hooks, when the a process forks, the fork callbacks is
>>>> called for these subsystems but, let's say, the 3rd fails. So we undo,
>>>> by calling the exit hooks of the first two.
>>>>
>>>> I wrote a patchset to consolidate the hooks called in the cgroump for
>>>> fork and exit, and one of them does a rollback for the fork hook when an
>>>> error occurs. I added an attachment the patch as an example.
>>>>
>>>>
>>>>
>>> I'd like to see this patch sent with another patch that needs this
>>> fail-able fork() hook.
>>>
>>> Note this patch is not doing a _fix_, but does an extension. And
>>> for now, this extension is not needed.
>>>
>>>
>> I don't know, may be it could be interesting to implement that before
>> more subsystems make use of these hooks.
>> This is not critical, that can be sent later, separately from this
>> patchset of course.
>>
>>
>
> We tend to remove code that is not used. For example, we may remove
> subsys->bind() interface, because no one is using it, though it has
> been there for years.
>
> So adding things that are not used is normally not good.
>
Makes sense.

Thanks
-- Daniel