2014-04-14 21:44:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated, v2

Hello,

This is v2 of cgroup.populated patchset changes from v1[L] are,

* Rebased on top of v3.15-rc1

* Patch description updated for the third patch as per Serge.

(inotify folks, can you please review the second patch which hooks up
kernfs_notify() with fsnotify? It works but directly invoking
fsnotify_parent() and fsnotify() feels a bit dirty. Maybe something
like fsnotify_notify_modify_by_dentry() would make more sense?)

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues. This patchset implements a replacement
mechanism "cgroup.subtree_populated" which can be used to monitor
whether the cgroup's subhierarchy has tasks in it or not and triggers
poll and [di]notify events when its content changes.

This patchset contains the following three patches.

0001-kernfs-implement-kernfs_root-supers-list.patch
0002-kernfs-make-kernfs_notify-trigger-inotify-events-too.patch
0003-cgroup-implement-cgroup.subtree_populated-for-the-de.patch

0001-0002 add [di]notify notification to kernfs_notify().

0003 implements cgroup.subtree_populated.

This patchset is on top of "cgroup: implement unified hierarchy, v2"
patchset[1] and availalbe in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated

diffstat follows.

fs/kernfs/dir.c | 1
fs/kernfs/file.c | 41 +++++++++++++++++++++++----
fs/kernfs/kernfs-internal.h | 5 +++
fs/kernfs/mount.c | 11 +++++++
include/linux/cgroup.h | 15 ++++++++++
include/linux/kernfs.h | 4 ++
kernel/cgroup.c | 65 +++++++++++++++++++++++++++++++++++++++++---
7 files changed, 132 insertions(+), 10 deletions(-)

Thanks.

--
tejun

[L] http://lkml.kernel.org/g/[email protected]
[1] http://lkml.kernel.org/g/[email protected]


2014-04-14 21:44:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/3] kernfs: make kernfs_notify() trigger inotify events too

kernfs_notify() is used to indicate either new data is available or
the content of a file has changed. It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor. Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.

fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node. As any active monitor
would pin dentry, just looking up existing dentry is enough. This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().

Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify(). It might be better to add a wrapper
in fsnotify.h instead.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: John McCutchan <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Eric Paris <[email protected]>
---
fs/kernfs/file.c | 41 +++++++++++++++++++++++++++++++++++------
1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8034706..98bacd9 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -14,6 +14,7 @@
#include <linux/poll.h>
#include <linux/pagemap.h>
#include <linux/sched.h>
+#include <linux/fsnotify.h>

#include "kernfs-internal.h"

@@ -785,20 +786,48 @@ static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
*/
void kernfs_notify(struct kernfs_node *kn)
{
+ struct kernfs_root *root = kernfs_root(kn);
struct kernfs_open_node *on;
+ struct kernfs_super_info *info;
unsigned long flags;

+ if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
+ return;
+
+ /* kick poll */
spin_lock_irqsave(&kernfs_open_node_lock, flags);

- if (!WARN_ON(kernfs_type(kn) != KERNFS_FILE)) {
- on = kn->attr.open;
- if (on) {
- atomic_inc(&on->event);
- wake_up_interruptible(&on->poll);
- }
+ on = kn->attr.open;
+ if (on) {
+ atomic_inc(&on->event);
+ wake_up_interruptible(&on->poll);
}

spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+
+ /* kick fsnotify */
+ mutex_lock(&kernfs_mutex);
+
+ list_for_each_entry(info, &root->supers, node) {
+ struct inode *inode;
+ struct dentry *dentry;
+
+ inode = ilookup(info->sb, kn->ino);
+ if (!inode)
+ continue;
+
+ dentry = d_find_any_alias(inode);
+ if (dentry) {
+ fsnotify_parent(NULL, dentry, FS_MODIFY);
+ fsnotify(inode, FS_MODIFY, inode, FSNOTIFY_EVENT_INODE,
+ NULL, 0);
+ dput(dentry);
+ }
+
+ iput(inode);
+ }
+
+ mutex_unlock(&kernfs_mutex);
}
EXPORT_SYMBOL_GPL(kernfs_notify);

--
1.9.0

2014-04-14 21:44:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.

* There is single monitoring point at the root. There's no way to
delegate management of subtree.

* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of subtree.

* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.

This patch implements interface file "cgroup.subtree_populated" which
can be used to monitor whether the cgroup's subhierarchy has tasks in
it or not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

v2: Patch description updated as per Serge.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Cc: Lennart Poettering <[email protected]>
---
include/linux/cgroup.h | 15 ++++++++++++
kernel/cgroup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ada2392..4b38e2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,14 @@ struct cgroup {
/* the number of attached css's */
int nr_css;

+ /*
+ * If this cgroup contains any tasks, it contributes one to
+ * populated_cnt. All children with non-zero popuplated_cnt of
+ * their own contribute one. The count is zero iff there's no task
+ * in this cgroup or its subtree.
+ */
+ int populated_cnt;
+
atomic_t refcnt;

/*
@@ -166,6 +174,7 @@ struct cgroup {
struct cgroup *parent; /* my parent */
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */
+ struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */

/*
* Monotonically increasing unique serial number which defines a
@@ -264,6 +273,12 @@ enum {
*
* - "cgroup.clone_children" is removed.
*
+ * - "cgroup.subtree_populated" is available. Its value is 0 if
+ * the cgroup and its descendants contain no task; otherwise, 1.
+ * The file also generates kernfs notification which can be
+ * monitored through poll and [di]notify when the value of the
+ * file changes.
+ *
* - If mount is requested with sane_behavior but without any
* subsystem, the default unified hierarchy is mounted.
*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 08c4439..e379e83 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -411,6 +411,43 @@ static struct css_set init_css_set = {

static int css_set_count = 1; /* 1 for init_css_set */

+/**
+ * cgroup_update_populated - updated populated count of a cgroup
+ * @cgrp: the target cgroup
+ * @populated: inc or dec populated count
+ *
+ * @cgrp is either getting the first task (css_set) or losing the last.
+ * Update @cgrp->populated_cnt accordingly. The count is propagated
+ * towards root so that a given cgroup's populated_cnt is zero iff the
+ * cgroup and all its descendants are empty.
+ *
+ * @cgrp's interface file "cgroup.subtree_populated" is zero if
+ * @cgrp->populated_cnt is zero and 1 otherwise. When @cgrp->populated_cnt
+ * changes from or to zero, userland is notified that the content of the
+ * interface file has changed. This can be used to detect when @cgrp and
+ * its descendants become populated or empty.
+ */
+static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+{
+ lockdep_assert_held(&css_set_rwsem);
+
+ do {
+ bool trigger;
+
+ if (populated)
+ trigger = !cgrp->populated_cnt++;
+ else
+ trigger = !--cgrp->populated_cnt;
+
+ if (!trigger)
+ break;
+
+ if (cgrp->populated_kn)
+ kernfs_notify(cgrp->populated_kn);
+ cgrp = cgrp->parent;
+ } while (cgrp);
+}
+
/*
* hash table for cgroup groups. This improves the performance to find
* an existing css_set. This hash doesn't (currently) take into
@@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
list_del(&link->cgrp_link);

/* @cgrp can't go away while we're holding css_set_rwsem */
- if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
- if (taskexit)
- set_bit(CGRP_RELEASABLE, &cgrp->flags);
- check_for_release(cgrp);
+ if (list_empty(&cgrp->cset_links)) {
+ cgroup_update_populated(cgrp, false);
+ if (notify_on_release(cgrp)) {
+ if (taskexit)
+ set_bit(CGRP_RELEASABLE, &cgrp->flags);
+ check_for_release(cgrp);
+ }
}

kfree(link);
@@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
link->cset = cset;
link->cgrp = cgrp;
+
+ if (list_empty(&cgrp->cset_links))
+ cgroup_update_populated(cgrp, true);
list_move(&link->cset_link, &cgrp->cset_links);
+
/*
* Always add links to the tail of the list so that the list
* is sorted by order of hierarchy creation
@@ -2643,6 +2687,12 @@ err_undo_css:
goto out_unlock;
}

+static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
+{
+ seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
+ return 0;
+}
+
static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -2809,6 +2859,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)

if (cft->seq_show == cgroup_subtree_control_show)
cgrp->control_kn = kn;
+ else if (cft->seq_show == cgroup_subtree_populated_show)
+ cgrp->populated_kn = kn;
return 0;
}

@@ -3922,6 +3974,11 @@ static struct cftype cgroup_base_files[] = {
.seq_show = cgroup_subtree_control_show,
.write_string = cgroup_subtree_control_write,
},
+ {
+ .name = "cgroup.subtree_populated",
+ .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_subtree_populated_show,
+ },

/*
* Historical crazy stuff. These don't have "cgroup." prefix and
--
1.9.0

2014-04-14 21:45:28

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/3] kernfs: implement kernfs_root->supers list

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
---
fs/kernfs/dir.c | 1 +
fs/kernfs/kernfs-internal.h | 5 +++++
fs/kernfs/mount.c | 11 +++++++++++
include/linux/kernfs.h | 4 ++++
4 files changed, 21 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 78f3403..43aa979 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -711,6 +711,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
return ERR_PTR(-ENOMEM);

ida_init(&root->ino_ida);
+ INIT_LIST_HEAD(&root->supers);

kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 8be13b2..dc84a3e 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -49,6 +49,8 @@ static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
* mount.c
*/
struct kernfs_super_info {
+ struct super_block *sb;
+
/*
* The root associated with this super_block. Each super_block is
* identified by the root and ns it's associated with.
@@ -62,6 +64,9 @@ struct kernfs_super_info {
* an array and compare kernfs_node tag against every entry.
*/
const void *ns;
+
+ /* anchored at kernfs_root->supers, protected by kernfs_mutex */
+ struct list_head node;
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 6a5f04a..f25a7c0 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -68,6 +68,7 @@ static int kernfs_fill_super(struct super_block *sb)
struct inode *inode;
struct dentry *root;

+ info->sb = sb;
sb->s_blocksize = PAGE_CACHE_SIZE;
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
sb->s_magic = SYSFS_MAGIC;
@@ -166,12 +167,18 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
*new_sb_created = !sb->s_root;

if (!sb->s_root) {
+ struct kernfs_super_info *info = kernfs_info(sb);
+
error = kernfs_fill_super(sb);
if (error) {
deactivate_locked_super(sb);
return ERR_PTR(error);
}
sb->s_flags |= MS_ACTIVE;
+
+ mutex_lock(&kernfs_mutex);
+ list_add(&info->node, &root->supers);
+ mutex_unlock(&kernfs_mutex);
}

return dget(sb->s_root);
@@ -190,6 +197,10 @@ void kernfs_kill_sb(struct super_block *sb)
struct kernfs_super_info *info = kernfs_info(sb);
struct kernfs_node *root_kn = sb->s_root->d_fsdata;

+ mutex_lock(&kernfs_mutex);
+ list_del(&info->node);
+ mutex_unlock(&kernfs_mutex);
+
/*
* Remove the superblock from fs_supers/s_instances
* so we can't find it, before freeing kernfs_super_info.
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index b0122dc..589318b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -144,6 +144,10 @@ struct kernfs_root {
/* private fields, do not use outside kernfs proper */
struct ida ino_ida;
struct kernfs_syscall_ops *syscall_ops;
+
+ /* list of kernfs_super_info of this root, protected by kernfs_mutex */
+ struct list_head supers;
+
wait_queue_head_t deactivate_waitq;
};

--
1.9.0

2014-04-15 00:58:00

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On 2014/4/15 5:44, Tejun Heo wrote:
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
>
> * It delivers events by forking and execing a userland binary
> specified as the release_agent. This is a long deprecated method of
> notification delivery. It's extremely heavy, slow and cumbersome to
> integrate with larger infrastructure.
>
> * There is single monitoring point at the root. There's no way to
> delegate management of subtree.
>
> * The event isn't recursive. It triggers when a cgroup doesn't have
> any tasks or child cgroups. Events for internal nodes trigger only
> after all children are removed. This again makes it impossible to
> delegate management of subtree.
>
> * Events are filtered from the kernel side. "notify_on_release" file
> is used to subscribe to or suppress release event. This is
> unnecessarily complicated and probably done this way because event
> delivery itself was expensive.
>
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not. Its value is 0 if there is no task in the cgroup and its
> descendants; otherwise, 1,

Is cgroup.tree_populated a better name?

cgroup.subtree_control controls child cgroups only, but .subtree_populated
shows 1 if there're tasks in the cgroup or its children, so the two
are a bit inconsistent to me.

> and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
>
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
>
> v2: Patch description updated as per Serge.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>
> Cc: Lennart Poettering <[email protected]>

2014-04-15 14:54:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

Hello,

On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
> Is cgroup.tree_populated a better name?
>
> cgroup.subtree_control controls child cgroups only, but .subtree_populated
> shows 1 if there're tasks in the cgroup or its children, so the two
> are a bit inconsistent to me.

Yes, good catch. subtree_control affects subtree proper.
subtree_populated covers self too. The difference is subtle and the
trade off between shared pattern in names and clarifying the subtlety
didn't seem clear-cut to me. Hmmm....

--
tejun

2014-04-15 16:52:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On Tue, Apr 15, 2014 at 10:54:50AM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
> > Is cgroup.tree_populated a better name?
> >
> > cgroup.subtree_control controls child cgroups only, but .subtree_populated
> > shows 1 if there're tasks in the cgroup or its children, so the two
> > are a bit inconsistent to me.
>
> Yes, good catch. subtree_control affects subtree proper.
> subtree_populated covers self too. The difference is subtle and the
> trade off between shared pattern in names and clarifying the subtlety
> didn't seem clear-cut to me. Hmmm....

How about cgroup.populated?

--
tejun

2014-04-16 01:31:11

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On 2014/4/16 0:52, Tejun Heo wrote:
> On Tue, Apr 15, 2014 at 10:54:50AM -0400, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
>>> Is cgroup.tree_populated a better name?
>>>
>>> cgroup.subtree_control controls child cgroups only, but .subtree_populated
>>> shows 1 if there're tasks in the cgroup or its children, so the two
>>> are a bit inconsistent to me.
>>
>> Yes, good catch. subtree_control affects subtree proper.
>> subtree_populated covers self too. The difference is subtle and the
>> trade off between shared pattern in names and clarifying the subtlety
>> didn't seem clear-cut to me. Hmmm....
>
> How about cgroup.populated?
>

Yeah, fine for me.

2014-04-16 02:49:08

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

Hi Tejun,

On 2014/4/15 5:44, Tejun Heo wrote:
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
>
> * It delivers events by forking and execing a userland binary
> specified as the release_agent. This is a long deprecated method of
> notification delivery. It's extremely heavy, slow and cumbersome to
> integrate with larger infrastructure.
>
> * There is single monitoring point at the root. There's no way to
> delegate management of subtree.
>
> * The event isn't recursive. It triggers when a cgroup doesn't have
> any tasks or child cgroups. Events for internal nodes trigger only
> after all children are removed. This again makes it impossible to
> delegate management of subtree.
>
> * Events are filtered from the kernel side. "notify_on_release" file
> is used to subscribe to or suppress release event. This is
> unnecessarily complicated and probably done this way because event
> delivery itself was expensive.
>
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not. Its value is 0 if there is no task in the cgroup and its
> descendants; otherwise, 1, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
>

For the old notification mechanism, the path of the cgroup that becomes
empty will be passed to the user specified release agent. Like this:

# cat /sbin/cpuset_release_agent
#!/bin/sh
rmdir /dev/cpuset/$1

How do we achieve this using inotify?

- monitor all the cgroups, or
- monitor all the leaf cgroups, and travel cgrp->parent to delete all
empty cgroups.
- monitor root cgroup only, and travel the whole hierarchy to find
empy cgroups when it gets an fs event.

Seems none of them is scalible.

> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
>
> v2: Patch description updated as per Serge.
>

2014-04-16 03:33:50

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <[email protected]> wrote:
> On 2014/4/15 5:44, Tejun Heo wrote:
>> cgroup users often need a way to determine when a cgroup's
>> subhierarchy becomes empty so that it can be cleaned up. cgroup
>> currently provides release_agent for it; unfortunately, this mechanism
>> is riddled with issues.
>>
>> * It delivers events by forking and execing a userland binary
>> specified as the release_agent. This is a long deprecated method of
>> notification delivery. It's extremely heavy, slow and cumbersome to
>> integrate with larger infrastructure.
>>
>> * There is single monitoring point at the root. There's no way to
>> delegate management of subtree.
>>
>> * The event isn't recursive. It triggers when a cgroup doesn't have
>> any tasks or child cgroups. Events for internal nodes trigger only
>> after all children are removed. This again makes it impossible to
>> delegate management of subtree.
>>
>> * Events are filtered from the kernel side. "notify_on_release" file
>> is used to subscribe to or suppress release event. This is
>> unnecessarily complicated and probably done this way because event
>> delivery itself was expensive.
>>
>> This patch implements interface file "cgroup.subtree_populated" which
>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>> it or not. Its value is 0 if there is no task in the cgroup and its
>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>> triggers when the value changes, which can be monitored through poll
>> and [di]notify.
>>
>
> For the old notification mechanism, the path of the cgroup that becomes
> empty will be passed to the user specified release agent. Like this:
>
> # cat /sbin/cpuset_release_agent
> #!/bin/sh
> rmdir /dev/cpuset/$1
>
> How do we achieve this using inotify?
>
> - monitor all the cgroups, or
> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
> empty cgroups.
> - monitor root cgroup only, and travel the whole hierarchy to find
> empy cgroups when it gets an fs event.
>
> Seems none of them is scalible.

The manager would add all cgroups as watches to one inotify file
descriptor, it should not be problem to do that.

Kay

2014-04-16 03:50:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

Kay Sievers <[email protected]> writes:

> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <[email protected]> wrote:
>> On 2014/4/15 5:44, Tejun Heo wrote:
>>> cgroup users often need a way to determine when a cgroup's
>>> subhierarchy becomes empty so that it can be cleaned up. cgroup
>>> currently provides release_agent for it; unfortunately, this mechanism
>>> is riddled with issues.
>>>
>>> * It delivers events by forking and execing a userland binary
>>> specified as the release_agent. This is a long deprecated method of
>>> notification delivery. It's extremely heavy, slow and cumbersome to
>>> integrate with larger infrastructure.
>>>
>>> * There is single monitoring point at the root. There's no way to
>>> delegate management of subtree.
>>>
>>> * The event isn't recursive. It triggers when a cgroup doesn't have
>>> any tasks or child cgroups. Events for internal nodes trigger only
>>> after all children are removed. This again makes it impossible to
>>> delegate management of subtree.
>>>
>>> * Events are filtered from the kernel side. "notify_on_release" file
>>> is used to subscribe to or suppress release event. This is
>>> unnecessarily complicated and probably done this way because event
>>> delivery itself was expensive.
>>>
>>> This patch implements interface file "cgroup.subtree_populated" which
>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>> it or not. Its value is 0 if there is no task in the cgroup and its
>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>> triggers when the value changes, which can be monitored through poll
>>> and [di]notify.
>>>
>>
>> For the old notification mechanism, the path of the cgroup that becomes
>> empty will be passed to the user specified release agent. Like this:
>>
>> # cat /sbin/cpuset_release_agent
>> #!/bin/sh
>> rmdir /dev/cpuset/$1
>>
>> How do we achieve this using inotify?
>>
>> - monitor all the cgroups, or
>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>> empty cgroups.
>> - monitor root cgroup only, and travel the whole hierarchy to find
>> empy cgroups when it gets an fs event.
>>
>> Seems none of them is scalible.
>
> The manager would add all cgroups as watches to one inotify file
> descriptor, it should not be problem to do that.

inotify won't work on cgroupfs.

Eric

2014-04-16 04:15:37

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On Tue, Apr 15, 2014 at 8:50 PM, Eric W. Biederman
<[email protected]> wrote:
> Kay Sievers <[email protected]> writes:
>
>> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <[email protected]> wrote:
>>> On 2014/4/15 5:44, Tejun Heo wrote:
>>>> cgroup users often need a way to determine when a cgroup's
>>>> subhierarchy becomes empty so that it can be cleaned up. cgroup
>>>> currently provides release_agent for it; unfortunately, this mechanism
>>>> is riddled with issues.
>>>>
>>>> * It delivers events by forking and execing a userland binary
>>>> specified as the release_agent. This is a long deprecated method of
>>>> notification delivery. It's extremely heavy, slow and cumbersome to
>>>> integrate with larger infrastructure.
>>>>
>>>> * There is single monitoring point at the root. There's no way to
>>>> delegate management of subtree.
>>>>
>>>> * The event isn't recursive. It triggers when a cgroup doesn't have
>>>> any tasks or child cgroups. Events for internal nodes trigger only
>>>> after all children are removed. This again makes it impossible to
>>>> delegate management of subtree.
>>>>
>>>> * Events are filtered from the kernel side. "notify_on_release" file
>>>> is used to subscribe to or suppress release event. This is
>>>> unnecessarily complicated and probably done this way because event
>>>> delivery itself was expensive.
>>>>
>>>> This patch implements interface file "cgroup.subtree_populated" which
>>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>>> it or not. Its value is 0 if there is no task in the cgroup and its
>>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>>> triggers when the value changes, which can be monitored through poll
>>>> and [di]notify.
>>>>
>>>
>>> For the old notification mechanism, the path of the cgroup that becomes
>>> empty will be passed to the user specified release agent. Like this:
>>>
>>> # cat /sbin/cpuset_release_agent
>>> #!/bin/sh
>>> rmdir /dev/cpuset/$1
>>>
>>> How do we achieve this using inotify?
>>>
>>> - monitor all the cgroups, or
>>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>> empty cgroups.
>>> - monitor root cgroup only, and travel the whole hierarchy to find
>>> empy cgroups when it gets an fs event.
>>>
>>> Seems none of them is scalible.
>>
>> The manager would add all cgroups as watches to one inotify file
>> descriptor, it should not be problem to do that.
>
> inotify won't work on cgroupfs.

Inotify on kernfs will work.

Kay

2014-04-16 04:18:21

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On 2014/4/16 11:33, Kay Sievers wrote:
> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <[email protected]> wrote:
>> On 2014/4/15 5:44, Tejun Heo wrote:
>>> cgroup users often need a way to determine when a cgroup's
>>> subhierarchy becomes empty so that it can be cleaned up. cgroup
>>> currently provides release_agent for it; unfortunately, this mechanism
>>> is riddled with issues.
>>>
>>> * It delivers events by forking and execing a userland binary
>>> specified as the release_agent. This is a long deprecated method of
>>> notification delivery. It's extremely heavy, slow and cumbersome to
>>> integrate with larger infrastructure.
>>>
>>> * There is single monitoring point at the root. There's no way to
>>> delegate management of subtree.
>>>
>>> * The event isn't recursive. It triggers when a cgroup doesn't have
>>> any tasks or child cgroups. Events for internal nodes trigger only
>>> after all children are removed. This again makes it impossible to
>>> delegate management of subtree.
>>>
>>> * Events are filtered from the kernel side. "notify_on_release" file
>>> is used to subscribe to or suppress release event. This is
>>> unnecessarily complicated and probably done this way because event
>>> delivery itself was expensive.
>>>
>>> This patch implements interface file "cgroup.subtree_populated" which
>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>> it or not. Its value is 0 if there is no task in the cgroup and its
>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>> triggers when the value changes, which can be monitored through poll
>>> and [di]notify.
>>>
>>
>> For the old notification mechanism, the path of the cgroup that becomes
>> empty will be passed to the user specified release agent. Like this:
>>
>> # cat /sbin/cpuset_release_agent
>> #!/bin/sh
>> rmdir /dev/cpuset/$1
>>
>> How do we achieve this using inotify?
>>
>> - monitor all the cgroups, or
>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>> empty cgroups.
>> - monitor root cgroup only, and travel the whole hierarchy to find
>> empy cgroups when it gets an fs event.
>>
>> Seems none of them is scalible.
>
> The manager would add all cgroups as watches to one inotify file
> descriptor, it should not be problem to do that.
>

Never use inotify. Thanks for explanation, so I think inotify can scale
to thounsands of cgroups after I googled a bit.

2014-04-16 04:22:22

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy

On 2014/4/16 11:50, Eric W. Biederman wrote:
> Kay Sievers <[email protected]> writes:
>
>> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <[email protected]> wrote:
>>> On 2014/4/15 5:44, Tejun Heo wrote:
>>>> cgroup users often need a way to determine when a cgroup's
>>>> subhierarchy becomes empty so that it can be cleaned up. cgroup
>>>> currently provides release_agent for it; unfortunately, this mechanism
>>>> is riddled with issues.
>>>>
>>>> * It delivers events by forking and execing a userland binary
>>>> specified as the release_agent. This is a long deprecated method of
>>>> notification delivery. It's extremely heavy, slow and cumbersome to
>>>> integrate with larger infrastructure.
>>>>
>>>> * There is single monitoring point at the root. There's no way to
>>>> delegate management of subtree.
>>>>
>>>> * The event isn't recursive. It triggers when a cgroup doesn't have
>>>> any tasks or child cgroups. Events for internal nodes trigger only
>>>> after all children are removed. This again makes it impossible to
>>>> delegate management of subtree.
>>>>
>>>> * Events are filtered from the kernel side. "notify_on_release" file
>>>> is used to subscribe to or suppress release event. This is
>>>> unnecessarily complicated and probably done this way because event
>>>> delivery itself was expensive.
>>>>
>>>> This patch implements interface file "cgroup.subtree_populated" which
>>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>>> it or not. Its value is 0 if there is no task in the cgroup and its
>>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>>> triggers when the value changes, which can be monitored through poll
>>>> and [di]notify.
>>>>
>>>
>>> For the old notification mechanism, the path of the cgroup that becomes
>>> empty will be passed to the user specified release agent. Like this:
>>>
>>> # cat /sbin/cpuset_release_agent
>>> #!/bin/sh
>>> rmdir /dev/cpuset/$1
>>>
>>> How do we achieve this using inotify?
>>>
>>> - monitor all the cgroups, or
>>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>> empty cgroups.
>>> - monitor root cgroup only, and travel the whole hierarchy to find
>>> empy cgroups when it gets an fs event.
>>>
>>> Seems none of them is scalible.
>>
>> The manager would add all cgroups as watches to one inotify file
>> descriptor, it should not be problem to do that.
>
> inotify won't work on cgroupfs.
>

Tejun's working on inotify support for cgroupfs, and I believe this patchset
has been tested, so it works.

So what do you mean by saying it won't work? Could you be more specific?

2014-04-16 14:50:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v3 3/3] cgroup: implement cgroup.populated for the default hierarchy

Hello,

Renamed to cgroup.populated. git branch updated accordingly.

Thanks.
------- 8< -------
>From bf3fd307267ce307c4c8ec41834d7bddb6441907 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Wed, 16 Apr 2014 10:48:38 -0400

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.

* There is single monitoring point at the root. There's no way to
delegate management of a subtree.

* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.

* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.

This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

v2: Patch description updated as per Serge.

v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Cc: Lennart Poettering <[email protected]>
---
include/linux/cgroup.h | 15 ++++++++++++
kernel/cgroup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ada2392..4b38e2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,14 @@ struct cgroup {
/* the number of attached css's */
int nr_css;

+ /*
+ * If this cgroup contains any tasks, it contributes one to
+ * populated_cnt. All children with non-zero popuplated_cnt of
+ * their own contribute one. The count is zero iff there's no task
+ * in this cgroup or its subtree.
+ */
+ int populated_cnt;
+
atomic_t refcnt;

/*
@@ -166,6 +174,7 @@ struct cgroup {
struct cgroup *parent; /* my parent */
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */
+ struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */

/*
* Monotonically increasing unique serial number which defines a
@@ -264,6 +273,12 @@ enum {
*
* - "cgroup.clone_children" is removed.
*
+ * - "cgroup.subtree_populated" is available. Its value is 0 if
+ * the cgroup and its descendants contain no task; otherwise, 1.
+ * The file also generates kernfs notification which can be
+ * monitored through poll and [di]notify when the value of the
+ * file changes.
+ *
* - If mount is requested with sane_behavior but without any
* subsystem, the default unified hierarchy is mounted.
*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7986aa6..2412cb7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -411,6 +411,43 @@ static struct css_set init_css_set = {

static int css_set_count = 1; /* 1 for init_css_set */

+/**
+ * cgroup_update_populated - updated populated count of a cgroup
+ * @cgrp: the target cgroup
+ * @populated: inc or dec populated count
+ *
+ * @cgrp is either getting the first task (css_set) or losing the last.
+ * Update @cgrp->populated_cnt accordingly. The count is propagated
+ * towards root so that a given cgroup's populated_cnt is zero iff the
+ * cgroup and all its descendants are empty.
+ *
+ * @cgrp's interface file "cgroup.populated" is zero if
+ * @cgrp->populated_cnt is zero and 1 otherwise. When @cgrp->populated_cnt
+ * changes from or to zero, userland is notified that the content of the
+ * interface file has changed. This can be used to detect when @cgrp and
+ * its descendants become populated or empty.
+ */
+static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+{
+ lockdep_assert_held(&css_set_rwsem);
+
+ do {
+ bool trigger;
+
+ if (populated)
+ trigger = !cgrp->populated_cnt++;
+ else
+ trigger = !--cgrp->populated_cnt;
+
+ if (!trigger)
+ break;
+
+ if (cgrp->populated_kn)
+ kernfs_notify(cgrp->populated_kn);
+ cgrp = cgrp->parent;
+ } while (cgrp);
+}
+
/*
* hash table for cgroup groups. This improves the performance to find
* an existing css_set. This hash doesn't (currently) take into
@@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
list_del(&link->cgrp_link);

/* @cgrp can't go away while we're holding css_set_rwsem */
- if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
- if (taskexit)
- set_bit(CGRP_RELEASABLE, &cgrp->flags);
- check_for_release(cgrp);
+ if (list_empty(&cgrp->cset_links)) {
+ cgroup_update_populated(cgrp, false);
+ if (notify_on_release(cgrp)) {
+ if (taskexit)
+ set_bit(CGRP_RELEASABLE, &cgrp->flags);
+ check_for_release(cgrp);
+ }
}

kfree(link);
@@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
link->cset = cset;
link->cgrp = cgrp;
+
+ if (list_empty(&cgrp->cset_links))
+ cgroup_update_populated(cgrp, true);
list_move(&link->cset_link, &cgrp->cset_links);
+
/*
* Always add links to the tail of the list so that the list
* is sorted by order of hierarchy creation
@@ -2643,6 +2687,12 @@ err_undo_css:
goto out_unlock;
}

+static int cgroup_populated_show(struct seq_file *seq, void *v)
+{
+ seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
+ return 0;
+}
+
static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -2809,6 +2859,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)

if (cft->seq_show == cgroup_subtree_control_show)
cgrp->control_kn = kn;
+ else if (cft->seq_show == cgroup_populated_show)
+ cgrp->populated_kn = kn;
return 0;
}

@@ -3918,6 +3970,11 @@ static struct cftype cgroup_base_files[] = {
.seq_show = cgroup_subtree_control_show,
.write_string = cgroup_subtree_control_write,
},
+ {
+ .name = "cgroup.populated",
+ .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_populated_show,
+ },

/*
* Historical crazy stuff. These don't have "cgroup." prefix and
--
1.9.0

2014-04-17 01:24:04

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] cgroup: implement cgroup.populated for the default hierarchy

> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
>
> * It delivers events by forking and execing a userland binary
> specified as the release_agent. This is a long deprecated method of
> notification delivery. It's extremely heavy, slow and cumbersome to
> integrate with larger infrastructure.
>
> * There is single monitoring point at the root. There's no way to
> delegate management of a subtree.
>
> * The event isn't recursive. It triggers when a cgroup doesn't have
> any tasks or child cgroups. Events for internal nodes trigger only
> after all children are removed. This again makes it impossible to
> delegate management of a subtree.
>
> * Events are filtered from the kernel side. "notify_on_release" file
> is used to subscribe to or suppress release event. This is
> unnecessarily complicated and probably done this way because event
> delivery itself was expensive.
>
> This patch implements interface file "cgroup.populated" which can be
> used to monitor whether the cgroup's subhierarchy has tasks in it or
> not. Its value is 0 if there is no task in the cgroup and its
> descendants; otherwise, 1, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
>
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
>
> v2: Patch description updated as per Serge.
>
> v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
> subtree_ prefix was a bit confusing because
> "cgroup.subtree_control" uses it to denote the tree rooted at the
> cgroup sans the cgroup itself while the populated state includes
> the cgroup itself.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>
> Cc: Lennart Poettering <[email protected]>

Acked-by: Li Zefan <[email protected]>