Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752262AbZGXKCZ (ORCPT ); Fri, 24 Jul 2009 06:02:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751725AbZGXKCY (ORCPT ); Fri, 24 Jul 2009 06:02:24 -0400 Received: from bohort.kerlabs.com ([62.160.40.57]:32773 "EHLO bohort.kerlabs.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751445AbZGXKCW (ORCPT ); Fri, 24 Jul 2009 06:02:22 -0400 Date: Fri, 24 Jul 2009 12:02:20 +0200 From: Louis Rilling To: Ben Blum Cc: linux-kernel@vger.kernel.org, containers@lists.linux-foundation.org, akpm@linux-foundation.org, serue@us.ibm.com, lizf@cn.fujitsu.com, menage@google.com Subject: Re: [PATCH 5/6] Makes procs file writable to move all threads by tgid at once Message-ID: <20090724100220.GF11101@hawkmoon.kerlabs.com> Mail-Followup-To: Ben Blum , linux-kernel@vger.kernel.org, containers@lists.linux-foundation.org, akpm@linux-foundation.org, serue@us.ibm.com, lizf@cn.fujitsu.com, menage@google.com References: <20090724032033.2463.79256.stgit@hastromil.mtv.corp.google.com> <20090724032200.2463.82408.stgit@hastromil.mtv.corp.google.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=_bohort-21039-1248429709-0001-2" Content-Disposition: inline In-Reply-To: <20090724032200.2463.82408.stgit@hastromil.mtv.corp.google.com> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17805 Lines: 581 This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --=_bohort-21039-1248429709-0001-2 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi Ben, On 23/07/09 20:22 -0700, Ben Blum wrote: > Makes procs file writable to move all threads by tgid at once >=20 > This patch adds functionality that enables users to move all threads in a > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' > file. This current implementation makes use of a rwsem that's taken for > reading in the fork() path to prevent newly forking threads within the > threadgroup from "escaping" while moving is in progress. >=20 > Signed-off-by: Ben Blum Thank you for working on this interface. This can indeed be very useful. Pl= ease find comments below, hoping that this will help making it better. [...] > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cg= roups.txt > index 6eb1a97..d579346 100644 > --- a/Documentation/cgroups/cgroups.txt > +++ b/Documentation/cgroups/cgroups.txt [...] > @@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0: > =20 > # echo 0 > tasks > =20 > +The cgroup.procs file is useful for managing all tasks in a threadgroup = at > +once. It works the same way as the tasks file, but moves all tasks in the > +threadgroup with the specified tgid. > + > +Writing the pid of a task that's not the threadgroup leader (i.e., a pid > +that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs = will > +attach the writing task and all tasks in its threadgroup, but is invalid= if > +the writing task is not the leader of the threadgroup. > + This restriction sounds unfortunate and I'm not sure that there are good re= asons for it (see below). [...] > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 637a54e..3f8d323 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c [...] > @@ -1330,75 +1421,294 @@ int cgroup_attach_task(struct cgroup *cgrp, stru= ct task_struct *tsk) > } > } > =20 > - task_lock(tsk); > - cg =3D tsk->cgroups; > - get_css_set(cg); > - task_unlock(tsk); > + retval =3D cgroup_task_migrate(cgrp, oldcgrp, tsk, 0); > + if (retval) > + return retval; > + > + for_each_subsys(root, ss) { > + if (ss->attach) > + ss->attach(ss, cgrp, oldcgrp, tsk); > + } > + > + synchronize_rcu(); > + > /* > - * Locate or allocate a new css_set for this task, > - * based on its final set of cgroups > + * wake up rmdir() waiter. the rmdir should fail since the cgroup > + * is no longer empty. > */ > + cgroup_wakeup_rmdir_waiters(cgrp); > + return 0; > +} > + > +/* > + * cgroup_attach_proc works in two stages, the first of which prefetches= all > + * new css_sets needed (to make sure we have enough memory before commit= ting > + * to the move) and stores them in a list, of entries of the following t= ype. > + * TODO: possible optimization: use css_set->rcu_head for chaining inste= ad > + */ > +struct cg_list_entry { > + struct css_set *cg; > + struct list_head links; > +}; > + > +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct= *tsk, > + struct css_set *cg, > + struct list_head *newcg_list) > +{ > + struct css_set *newcg; > + struct cg_list_entry *cg_entry; > + struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT]; > + read_lock(&css_set_lock); > + newcg =3D find_existing_css_set(cg, cgrp, template); > + if (newcg) > + get_css_set(newcg); > + read_unlock(&css_set_lock); > + /* doesn't exist at all? */ > + if (!newcg) > + return 1; > + /* see if it's already in the list */ > + list_for_each_entry(cg_entry, newcg_list, links) { > + if (cg_entry->cg =3D=3D newcg) { > + put_css_set(newcg); > + return 0; > + } > + } > + /* not found */ > + put_css_set(newcg); > + return 1; > +} > + > +/* > + * Find the new css_set and store it in the list in preparation for movi= ng > + * the given task to the given cgroup. Returns 0 on success, -ENOMEM if = we > + * run out of memory. > + */ > +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg, > + struct list_head *newcg_list) > +{ > + struct css_set *newcg; > + struct cg_list_entry *cg_entry; > + /* ensure a new css_set will exist for this thread */ > newcg =3D find_css_set(cg, cgrp); > - put_css_set(cg); > if (!newcg) > return -ENOMEM; > - > - task_lock(tsk); > - if (tsk->flags & PF_EXITING) { > - task_unlock(tsk); > + /* add new element to list */ > + cg_entry =3D kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL); > + if (!cg_entry) { > put_css_set(newcg); > - return -ESRCH; > + return -ENOMEM; > } > - rcu_assign_pointer(tsk->cgroups, newcg); > - task_unlock(tsk); > + cg_entry->cg =3D newcg; > + list_add(&cg_entry->links, newcg_list); > + return 0; > +} > =20 > - /* Update the css_set linked lists if we're using them */ > - write_lock(&css_set_lock); > - if (!list_empty(&tsk->cg_list)) { > - list_del(&tsk->cg_list); > - list_add(&tsk->cg_list, &newcg->tasks); > +/** > + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup > + * @cgrp: the cgroup to attach to > + * @leader: the threadgroup leader task_struct of the group to be attach= ed > + * > + * Call holding cgroup_mutex. Will take task_lock of each thread in lead= er's > + * threadgroup individually in turn. > + */ > +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader) > +{ > + int retval; > + struct cgroup_subsys *ss; > + struct cgroup *oldcgrp; > + struct css_set *oldcg; > + struct cgroupfs_root *root =3D cgrp->root; > + int subsys_id; > + /* threadgroup list cursor */ > + struct task_struct *tsk; > + /* > + * we need to make sure we have css_sets for all the tasks we're > + * going to move -before- we actually start moving them, so that in > + * case we get an ENOMEM we can bail out before making any changes. > + */ > + struct list_head newcg_list; > + struct cg_list_entry *cg_entry; > + > + /* first, make sure this came from a valid tgid */ > + if (!thread_group_leader(leader)) > + return -EINVAL; > + /* > + * check that we can legitimately attach to the cgroup. > + */ > + for_each_subsys(root, ss) { > + if (ss->can_attach) { > + retval =3D ss->can_attach(ss, cgrp, leader); > + if (retval) > + return retval; > + } > } So the semantics of ->can_attach() becomes: if called for a thread group le= ader, the result should be valid for the whole thread group, even if only the thr= ead group leader is being attached. This looks a bit fuzzy and thus not desirab= le. Why not checking ->can_attach() for all threads (and lock cgroup_fork_mutex earlier)? > - write_unlock(&css_set_lock); > =20 > + get_first_subsys(cgrp, NULL, &subsys_id); > +=09 > + /* > + * step 1: make sure css_sets exist for all threads to be migrated. > + * we use find_css_set, which allocates a new one if necessary. > + */ > + INIT_LIST_HEAD(&newcg_list); > + oldcgrp =3D task_cgroup(leader, subsys_id); > + if (cgrp !=3D oldcgrp) { > + /* get old css_set */ > + task_lock(leader); > + if (leader->flags & PF_EXITING) { > + task_unlock(leader); > + retval =3D -ESRCH; > + goto list_teardown; > + } > + oldcg =3D leader->cgroups; > + get_css_set(oldcg); > + task_unlock(leader); > + /* acquire new one */ > + retval =3D css_set_prefetch(cgrp, oldcg, &newcg_list); > + put_css_set(oldcg); > + if (retval) > + goto list_teardown; > + } The only difference between leader's case (above) and other threads' case (below) is the check for PF_EXITING. If all threads were handled equally in= the ->can_attach check, this special handling would be pointless. > +again: > + rcu_read_lock(); > + /* > + * if we need to fetch a new css_set for this task, we must exit the > + * rcu_read section because allocating it can sleep. afterwards, we'll > + * need to restart iteration on the threadgroup list - the whole thing > + * will be O(nm) in the number of threads and css_sets; as the typical > + * case only has one css_set for all of them, usually O(n). > + */ > + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) { > + /* nothing to do if this task is already in the cgroup */ > + oldcgrp =3D task_cgroup(tsk, subsys_id); > + if (cgrp =3D=3D oldcgrp) > + continue; > + /* get old css_set pointer */ > + task_lock(tsk); > + if (tsk->flags & PF_EXITING) { > + /* ignore this task if it's going away */ > + task_unlock(tsk); > + continue; > + } > + oldcg =3D tsk->cgroups; > + get_css_set(oldcg); > + task_unlock(tsk); > + /* see if the new one for us is already in the list? */ > + retval =3D css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list); > + if (retval) { > + /* we don't already have it. get new one. */ > + rcu_read_unlock(); > + retval =3D css_set_prefetch(cgrp, oldcg, &newcg_list); > + put_css_set(oldcg); > + if (retval) > + goto list_teardown; > + /* begin iteration again. */ > + goto again; > + } else { > + /* was already there, nothing to do. */ > + put_css_set(oldcg); > + } > + } > + rcu_read_unlock(); > + > + /* > + * step 2: now that we're guaranteed success wrt the css_sets, proceed I don't see how css_sets are guaranteed while cgroup_fork_mutex is not held= and thus does not prevent new threads from being created right now. Could you elaborate on that? > + * to move all tasks to the new cgroup. the only fail case henceforth > + * is if the threadgroup leader has PF_EXITING set (in which case all > + * the other threads get killed) - if other threads happen to be This statement is wrong. A thread group leader can have PF_EXITING (and even become zombie) while other sub-threads continue their execution. For instan= ce it is perfectly valid for a thread group leader to call sys_exit(), and no oth= er thread will be affected. > + * exiting, we just ignore them and move on. > + */ > + oldcgrp =3D task_cgroup(leader, subsys_id); > + /* if leader is already there, skip moving him */ > + if (cgrp !=3D oldcgrp) { > + retval =3D cgroup_task_migrate(cgrp, oldcgrp, leader, 1); > + if (retval) { > + BUG_ON(retval !=3D -ESRCH); > + goto list_teardown; > + } > + } > + /* > + * now move all the rest of the threads - need to lock against > + * possible races with fork(). > + */ > + down_write(&cgroup_fork_mutex); > + rcu_read_lock(); > + list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) { > + /* leave current thread as it is if it's already there */ > + oldcgrp =3D task_cgroup(tsk, subsys_id); > + if (cgrp =3D=3D oldcgrp) > + continue; > + /* we don't care whether these threads are exiting */ > + retval =3D cgroup_task_migrate(cgrp, oldcgrp, tsk, 1); > + BUG_ON(retval !=3D 0 && retval !=3D -ESRCH); > + } > + rcu_read_unlock(); > + up_write(&cgroup_fork_mutex); > +=09 > + /* > + * step 3: attach whole threadgroup to each subsystem=20 > + */ > for_each_subsys(root, ss) { > if (ss->attach) > - ss->attach(ss, cgrp, oldcgrp, tsk); > + ss->attach(ss, cgrp, oldcgrp, leader); > } So ->attach called for leader should attach all sub-threads too? Does this = mean that all subsystems should be changed accordingly? Again this looks making ->attach semantics fuzzy, and thus not desirable. Thank, Louis > - set_bit(CGRP_RELEASABLE, &oldcgrp->flags); > - synchronize_rcu(); > - put_css_set(cg); > =20 > /* > - * wake up rmdir() waiter. the rmdir should fail since the cgroup > - * is no longer empty. > + * step 4: success! ...and cleanup > */ > + synchronize_rcu(); > cgroup_wakeup_rmdir_waiters(cgrp); > - return 0; > + retval =3D 0; > +list_teardown: > + /* no longer need the list of css_sets, so get rid of it */ > + while (!list_empty(&newcg_list)) { > + /* pop from the list */ > + cg_entry =3D list_first_entry(&newcg_list, struct cg_list_entry, > + links); > + list_del(&cg_entry->links); > + /* drop the refcount */ > + put_css_set(cg_entry->cg); > + kfree(cg_entry); > + } > + /* done! */ > + return retval; > } > =20 > /* > - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex > - * held. May take task_lock of task > + * Find the task_struct of the task to attach by vpid and pass it along = to the > + * function to attach either it or all tasks in its threadgroup. Will ta= ke > + * cgroup_mutex; may take task_lock of task. > */ > -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid) > +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, > + int attach(struct cgroup *, > + struct task_struct *)) > { > struct task_struct *tsk; > const struct cred *cred =3D current_cred(), *tcred; > int ret; > =20 > + if (!cgroup_lock_live_group(cgrp)) > + return -ENODEV; > + > if (pid) { > rcu_read_lock(); > tsk =3D find_task_by_vpid(pid); > if (!tsk || tsk->flags & PF_EXITING) { > rcu_read_unlock(); > + cgroup_unlock(); > return -ESRCH; > } > - > + /* > + * even if we're attaching all tasks in the thread group, we > + * only need to check permissions on the group leader, because > + * even if another task has different permissions, the group > + * leader will have sufficient access to change it. > + */ > tcred =3D __task_cred(tsk); > if (cred->euid && > cred->euid !=3D tcred->uid && > cred->euid !=3D tcred->suid) { > rcu_read_unlock(); > + cgroup_unlock(); > return -EACCES; > } > get_task_struct(tsk); > @@ -1408,19 +1718,25 @@ static int attach_task_by_pid(struct cgroup *cgrp= , u64 pid) > get_task_struct(tsk); > } > =20 > - ret =3D cgroup_attach_task(cgrp, tsk); > + /* > + * Note that the check for whether the task is its threadgroup leader > + * is done in cgroup_attach_proc. This means that writing 0 to the > + * procs file will only work if the writing task is the leader. > + */ > + ret =3D attach(cgrp, tsk); > put_task_struct(tsk); > + cgroup_unlock(); > return ret; > } > =20 > static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u= 64 pid) > { > - int ret; > - if (!cgroup_lock_live_group(cgrp)) > - return -ENODEV; > - ret =3D attach_task_by_pid(cgrp, pid); > - cgroup_unlock(); > - return ret; > + return attach_task_by_pid(cgrp, pid, cgroup_attach_task); > +} > + > +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u= 64 tgid) > +{ > + return attach_task_by_pid(cgrp, tgid, cgroup_attach_proc); > } > =20 > /** > @@ -2580,9 +2896,9 @@ static struct cftype files[] =3D { > { > .name =3D CGROUP_FILE_GENERIC_PREFIX "procs", > .open =3D cgroup_procs_open, > - /* .write_u64 =3D cgroup_procs_write, TODO */ > + .write_u64 =3D cgroup_procs_write, > .release =3D cgroup_pidlist_release, > - .mode =3D S_IRUGO, > + .mode =3D S_IRUGO | S_IWUSR, > }, > { > .name =3D "notify_on_release", > @@ -3185,6 +3501,7 @@ static struct file_operations proc_cgroupstats_oper= ations =3D { > */ > void cgroup_fork(struct task_struct *child) > { > + down_read(&cgroup_fork_mutex); > task_lock(current); > child->cgroups =3D current->cgroups; > get_css_set(child->cgroups); > @@ -3231,6 +3548,7 @@ void cgroup_post_fork(struct task_struct *child) > task_unlock(child); > write_unlock(&css_set_lock); > } > + up_read(&cgroup_fork_mutex); > } > /** > * cgroup_exit - detach cgroup from exiting task > @@ -3302,6 +3620,24 @@ void cgroup_exit(struct task_struct *tsk, int run_= callbacks) > } > =20 > /** > + * cgroup_fork_failed - undo operations for fork failure > + * @tsk: pointer to task_struct of exiting process > + * @run_callback: run exit callbacks? > + * > + * Description: Undo cgroup operations after cgroup_fork in fork failure. > + * > + * We release the read lock that was taken in cgroup_fork(), since it is > + * supposed to be dropped in cgroup_post_fork in the success case. The o= ther > + * thing that wants to be done is detaching the failed child task from t= he > + * cgroup, so we wrap cgroup_exit. > + */ > +void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks) > +{ > + up_read(&cgroup_fork_mutex); > + cgroup_exit(tsk, run_callbacks); > +} > + > +/** > * cgroup_clone - clone the cgroup the given subsystem is attached to > * @tsk: the task to be moved > * @subsys: the given subsystem > diff --git a/kernel/fork.c b/kernel/fork.c > index 926c117..027ec16 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy: > mpol_put(p->mempolicy); > bad_fork_cleanup_cgroup: > #endif > - cgroup_exit(p, cgroup_callbacks_done); > + cgroup_fork_failed(p, cgroup_callbacks_done); > delayacct_tsk_free(p); > if (p->binfmt) > module_put(p->binfmt->module); >=20 > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers --=20 Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes --=_bohort-21039-1248429709-0001-2 Content-Type: application/pgp-signature; name="signature.asc" Content-Transfer-Encoding: 7bit Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkpphqsACgkQVKcRuvQ9Q1QgKgCgwvL8vaTC/lhzV7sRNrX1x7uc AN0AniCAitDGuXRA8SIdJTJqSmDquH0/ =RpVA -----END PGP SIGNATURE----- --=_bohort-21039-1248429709-0001-2-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/