Date: Thu, 10 Apr 2014 05:08:55 +0200
From: "Serge E. Hallyn" <serge@hallyn.com>
To: Tejun Heo <tj@kernel.org>
Cc: gregkh@linuxfoundation.org, rlove@rlove.org,
        containers@lists.linux-foundation.org, serge.hallyn@ubuntu.com,
        kay@vrfy.org, linux-kernel@vger.kernel.org, lennart@poettering.net,
        cgroups@vger.kernel.org, eparis@parisplace.org, john@johnmccutchan.com
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the
 default hierarchy
Message-ID: <20140410030855.GA29658@mail.hallyn.com>
References: <1397056052-2829-1-git-send-email-tj@kernel.org>
 <1397056052-2829-4-git-send-email-tj@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1397056052-2829-4-git-send-email-tj@kernel.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org

Quoting Tejun Heo (tj@kernel.org):
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up.  cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.

Thanks, Tejun.

> * It delivers events by forking and execing a userland binary
>   specified as the release_agent.  This is a long deprecated method of
>   notification delivery.  It's extremely heavy, slow and cumbersome to
>   integrate with larger infrastructure.

(Not seriously worried about this, but it's a point worth considering)
It does have one advantage though:  if the userspace agent goes bad,
cgroups can still be removed on empty.

Do you plan on keeping release-on-empty around?  I assume only for a
while?

Do you think there is any value in having a simpler "remove-when-empty"
file?  Doesn't call out to userspace, just drops the cgroup when there
are no more tasks or sub-cgroups?

> * There is single monitoring point at the root.  There's no way to
>   delegate management of subtree.
> 
> * The event isn't recursive.  It triggers when a cgroup doesn't have
>   any tasks or child cgroups.  Events for internal nodes trigger only
>   after all children are removed.  This again makes it impossible to
>   delegate management of subtree.
> 
> * Events are filtered from the kernel side.  "notify_on_release" file
>   is used to subscribe to or suppres release event and events are not
>   generated if a cgroup becomes empty by moving the last task out of
>   it; however, event is generated if it becomes empty because the last
>   child cgroup is removed.  This is inconsistent, awkward and

Hm, maybe I'm misreading but this doesn't seem right.  If I move
a task into x1 and kill the task, x1 goes away.  Likewise if I
create x1/y1, and rmdir y1, x1 goes away.  I suspect I'm misunderstanding
the case in which you say it doesn't happen?

>   unnecessarily complicated and probably done this way because event
>   delivery itself was expensive.
> 
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not.  Its value is 1 if there is no task in the cgroup and its

I think you meant this backward?  It's 1 if there is *any task in
the cgroup and its descendants, else 0?

> descendants; otherwise, 0, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
> 
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>

Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>

> Cc: Lennart Poettering <lennart@poettering.net>
> ---
>  include/linux/cgroup.h | 15 ++++++++++++
>  kernel/cgroup.c        | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index dee6f3c..e45d87f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -154,6 +154,14 @@ struct cgroup {
>  	/* the number of attached css's */
>  	int nr_css;
>  
> +	/*
> +	 * If this cgroup contains any tasks, it contributes one to
> +	 * populated_cnt.  All children with non-zero popuplated_cnt of
> +	 * their own contribute one.  The count is zero iff there's no task
> +	 * in this cgroup or its subtree.
> +	 */
> +	int populated_cnt;
> +
>  	atomic_t refcnt;
>  
>  	/*
> @@ -166,6 +174,7 @@ struct cgroup {
>  	struct cgroup *parent;		/* my parent */
>  	struct kernfs_node *kn;		/* cgroup kernfs entry */
>  	struct kernfs_node *control_kn;	/* kn for "cgroup.subtree_control" */
> +	struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
>  
>  	/*
>  	 * Monotonically increasing unique serial number which defines a
> @@ -264,6 +273,12 @@ enum {
>  	 *
>  	 * - "cgroup.clone_children" is removed.
>  	 *
> +	 * - "cgroup.subtree_populated" is available.  Its value is 0 if
> +	 *   the cgroup and its descendants contain no task; otherwise, 1.
> +	 *   The file also generates kernfs notification which can be
> +	 *   monitored through poll and [di]notify when the value of the
> +	 *   file changes.
> +	 *
>  	 * - If mount is requested with sane_behavior but without any
>  	 *   subsystem, the default unified hierarchy is mounted.
>  	 *
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4e958c7..17f0a09 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -411,6 +411,43 @@ static struct css_set init_css_set = {
>  
>  static int css_set_count	= 1;	/* 1 for init_css_set */
>  
> +/**
> + * cgroup_update_populated - updated populated count of a cgroup
> + * @cgrp: the target cgroup
> + * @populated: inc or dec populated count
> + *
> + * @cgrp is either getting the first task (css_set) or losing the last.
> + * Update @cgrp->populated_cnt accordingly.  The count is propagated
> + * towards root so that a given cgroup's populated_cnt is zero iff the
> + * cgroup and all its descendants are empty.
> + *
> + * @cgrp's interface file "cgroup.subtree_populated" is zero if
> + * @cgrp->populated_cnt is zero and 1 otherwise.  When @cgrp->populated_cnt
> + * changes from or to zero, userland is notified that the content of the
> + * interface file has changed.  This can be used to detect when @cgrp and
> + * its descendants become populated or empty.
> + */
> +static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
> +{
> +	lockdep_assert_held(&css_set_rwsem);
> +
> +	do {
> +		bool trigger;
> +
> +		if (populated)
> +			trigger = !cgrp->populated_cnt++;
> +		else
> +			trigger = !--cgrp->populated_cnt;
> +
> +		if (!trigger)
> +			break;
> +
> +		if (cgrp->populated_kn)
> +			kernfs_notify(cgrp->populated_kn);
> +		cgrp = cgrp->parent;
> +	} while (cgrp);
> +}
> +
>  /*
>   * hash table for cgroup groups. This improves the performance to find
>   * an existing css_set. This hash doesn't (currently) take into
> @@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
>  		list_del(&link->cgrp_link);
>  
>  		/* @cgrp can't go away while we're holding css_set_rwsem */
> -		if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> +		if (list_empty(&cgrp->cset_links)) {
> +			cgroup_update_populated(cgrp, false);
> +			if (notify_on_release(cgrp)) {
> +				if (taskexit)
> +					set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +				check_for_release(cgrp);
> +			}
>  		}
>  
>  		kfree(link);
> @@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
>  	link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
>  	link->cset = cset;
>  	link->cgrp = cgrp;
> +
> +	if (list_empty(&cgrp->cset_links))
> +		cgroup_update_populated(cgrp, true);
>  	list_move(&link->cset_link, &cgrp->cset_links);
> +
>  	/*
>  	 * Always add links to the tail of the list so that the list
>  	 * is sorted by order of hierarchy creation
> @@ -2633,6 +2677,12 @@ err_undo_css:
>  	goto out_unlock;
>  }
>  
> +static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
> +{
> +	seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
> +	return 0;
> +}
> +
>  static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  				 size_t nbytes, loff_t off)
>  {
> @@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
>  				  NULL, false, key);
>  	if (cft->seq_show == cgroup_subtree_control_show)
>  		cgrp->control_kn = kn;
> +	else if (cft->seq_show == cgroup_subtree_populated_show)
> +		cgrp->populated_kn = kn;
>  	return PTR_ERR_OR_ZERO(kn);
>  }
>  
> @@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_subtree_control_show,
>  		.write_string = cgroup_subtree_control_write,
>  	},
> +	{
> +		.name = "cgroup.subtree_populated",
> +		.flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
> +		.seq_show = cgroup_subtree_populated_show,
> +	},
>  
>  	/*
>  	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
> -- 
> 1.9.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/