Hello,
(inotify folks, can you please review the second patch which hooks up
kernfs_notify() with fsnotify? It works but directly invoking
fsnotify_parent() and fsnotify() feels a bit dirty. Maybe something
like fsnotify_notify_modify_by_dentry() would make more sense?)
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues. This patchset implements a replacement
mechanism "cgroup.subtree_populated" which can be used to monitor
whether the cgroup's subhierarchy has tasks in it or not and triggers
poll and [di]notify events when its content changes.
This patchset contains the following three patches.
0001-kernfs-implement-kernfs_root-supers-list.patch
0002-kernfs-make-kernfs_notify-trigger-inotify-events-too.patch
0003-cgroup-implement-cgroup.subtree_populated-for-the-de.patch
0001-0002 add [di]notify notification to kernfs_notify().
0003 implements cgroup.subtree_populated.
This patchset is on top of "cgroup: implement unified hierarchy"
patchset[1] and availalbe in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated
diffstat follows.
fs/kernfs/dir.c | 1
fs/kernfs/file.c | 41 +++++++++++++++++++++++----
fs/kernfs/kernfs-internal.h | 5 +++
fs/kernfs/mount.c | 11 +++++++
include/linux/cgroup.h | 15 ++++++++++
include/linux/kernfs.h | 4 ++
kernel/cgroup.c | 65 +++++++++++++++++++++++++++++++++++++++++---
7 files changed, 132 insertions(+), 10 deletions(-)
--
tejun
[1] http://lkml.kernel.org/g/[email protected]
kernfs_notify() is used to indicate either new data is available or
the content of a file has changed. It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor. Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.
fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node. As any active monitor
would pin dentry, just looking up existing dentry is enough. This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().
Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify(). It might be better to add a wrapper
in fsnotify.h instead.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: John McCutchan <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Eric Paris <[email protected]>
---
fs/kernfs/file.c | 41 +++++++++++++++++++++++++++++++++++------
1 file changed, 35 insertions(+), 6 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ddcb471..774af54 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -14,6 +14,7 @@
#include <linux/poll.h>
#include <linux/pagemap.h>
#include <linux/sched.h>
+#include <linux/fsnotify.h>
#include "kernfs-internal.h"
@@ -784,20 +785,48 @@ static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
*/
void kernfs_notify(struct kernfs_node *kn)
{
+ struct kernfs_root *root = kernfs_root(kn);
struct kernfs_open_node *on;
+ struct kernfs_super_info *info;
unsigned long flags;
+ if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
+ return;
+
+ /* kick poll */
spin_lock_irqsave(&kernfs_open_node_lock, flags);
- if (!WARN_ON(kernfs_type(kn) != KERNFS_FILE)) {
- on = kn->attr.open;
- if (on) {
- atomic_inc(&on->event);
- wake_up_interruptible(&on->poll);
- }
+ on = kn->attr.open;
+ if (on) {
+ atomic_inc(&on->event);
+ wake_up_interruptible(&on->poll);
}
spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+
+ /* kick fsnotify */
+ mutex_lock(&kernfs_mutex);
+
+ list_for_each_entry(info, &root->supers, node) {
+ struct inode *inode;
+ struct dentry *dentry;
+
+ inode = ilookup(info->sb, kn->ino);
+ if (!inode)
+ continue;
+
+ dentry = d_find_any_alias(inode);
+ if (dentry) {
+ fsnotify_parent(NULL, dentry, FS_MODIFY);
+ fsnotify(inode, FS_MODIFY, inode, FSNOTIFY_EVENT_INODE,
+ NULL, 0);
+ dput(dentry);
+ }
+
+ iput(inode);
+ }
+
+ mutex_unlock(&kernfs_mutex);
}
EXPORT_SYMBOL_GPL(kernfs_notify);
--
1.9.0
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.
* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.
* There is single monitoring point at the root. There's no way to
delegate management of subtree.
* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of subtree.
* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppres release event and events are not
generated if a cgroup becomes empty by moving the last task out of
it; however, event is generated if it becomes empty because the last
child cgroup is removed. This is inconsistent, awkward and
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
This patch implements interface file "cgroup.subtree_populated" which
can be used to monitor whether the cgroup's subhierarchy has tasks in
it or not. Its value is 1 if there is no task in the cgroup and its
descendants; otherwise, 0, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.
This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Lennart Poettering <[email protected]>
---
include/linux/cgroup.h | 15 ++++++++++++
kernel/cgroup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index dee6f3c..e45d87f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,14 @@ struct cgroup {
/* the number of attached css's */
int nr_css;
+ /*
+ * If this cgroup contains any tasks, it contributes one to
+ * populated_cnt. All children with non-zero popuplated_cnt of
+ * their own contribute one. The count is zero iff there's no task
+ * in this cgroup or its subtree.
+ */
+ int populated_cnt;
+
atomic_t refcnt;
/*
@@ -166,6 +174,7 @@ struct cgroup {
struct cgroup *parent; /* my parent */
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */
+ struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
/*
* Monotonically increasing unique serial number which defines a
@@ -264,6 +273,12 @@ enum {
*
* - "cgroup.clone_children" is removed.
*
+ * - "cgroup.subtree_populated" is available. Its value is 0 if
+ * the cgroup and its descendants contain no task; otherwise, 1.
+ * The file also generates kernfs notification which can be
+ * monitored through poll and [di]notify when the value of the
+ * file changes.
+ *
* - If mount is requested with sane_behavior but without any
* subsystem, the default unified hierarchy is mounted.
*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4e958c7..17f0a09 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -411,6 +411,43 @@ static struct css_set init_css_set = {
static int css_set_count = 1; /* 1 for init_css_set */
+/**
+ * cgroup_update_populated - updated populated count of a cgroup
+ * @cgrp: the target cgroup
+ * @populated: inc or dec populated count
+ *
+ * @cgrp is either getting the first task (css_set) or losing the last.
+ * Update @cgrp->populated_cnt accordingly. The count is propagated
+ * towards root so that a given cgroup's populated_cnt is zero iff the
+ * cgroup and all its descendants are empty.
+ *
+ * @cgrp's interface file "cgroup.subtree_populated" is zero if
+ * @cgrp->populated_cnt is zero and 1 otherwise. When @cgrp->populated_cnt
+ * changes from or to zero, userland is notified that the content of the
+ * interface file has changed. This can be used to detect when @cgrp and
+ * its descendants become populated or empty.
+ */
+static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+{
+ lockdep_assert_held(&css_set_rwsem);
+
+ do {
+ bool trigger;
+
+ if (populated)
+ trigger = !cgrp->populated_cnt++;
+ else
+ trigger = !--cgrp->populated_cnt;
+
+ if (!trigger)
+ break;
+
+ if (cgrp->populated_kn)
+ kernfs_notify(cgrp->populated_kn);
+ cgrp = cgrp->parent;
+ } while (cgrp);
+}
+
/*
* hash table for cgroup groups. This improves the performance to find
* an existing css_set. This hash doesn't (currently) take into
@@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
list_del(&link->cgrp_link);
/* @cgrp can't go away while we're holding css_set_rwsem */
- if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
- if (taskexit)
- set_bit(CGRP_RELEASABLE, &cgrp->flags);
- check_for_release(cgrp);
+ if (list_empty(&cgrp->cset_links)) {
+ cgroup_update_populated(cgrp, false);
+ if (notify_on_release(cgrp)) {
+ if (taskexit)
+ set_bit(CGRP_RELEASABLE, &cgrp->flags);
+ check_for_release(cgrp);
+ }
}
kfree(link);
@@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
link->cset = cset;
link->cgrp = cgrp;
+
+ if (list_empty(&cgrp->cset_links))
+ cgroup_update_populated(cgrp, true);
list_move(&link->cset_link, &cgrp->cset_links);
+
/*
* Always add links to the tail of the list so that the list
* is sorted by order of hierarchy creation
@@ -2633,6 +2677,12 @@ err_undo_css:
goto out_unlock;
}
+static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
+{
+ seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
+ return 0;
+}
+
static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
@@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
NULL, false, key);
if (cft->seq_show == cgroup_subtree_control_show)
cgrp->control_kn = kn;
+ else if (cft->seq_show == cgroup_subtree_populated_show)
+ cgrp->populated_kn = kn;
return PTR_ERR_OR_ZERO(kn);
}
@@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
.seq_show = cgroup_subtree_control_show,
.write_string = cgroup_subtree_control_write,
},
+ {
+ .name = "cgroup.subtree_populated",
+ .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
+ .seq_show = cgroup_subtree_populated_show,
+ },
/*
* Historical crazy stuff. These don't have "cgroup." prefix and
--
1.9.0
Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root. Let's implement it - the planned
inotify extension to kernfs_notify() needs it.
Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
---
fs/kernfs/dir.c | 1 +
fs/kernfs/kernfs-internal.h | 5 +++++
fs/kernfs/mount.c | 11 +++++++++++
include/linux/kernfs.h | 4 ++++
4 files changed, 21 insertions(+)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 939684e..a8bf7ac 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -711,6 +711,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
return ERR_PTR(-ENOMEM);
ida_init(&root->ino_ida);
+ INIT_LIST_HEAD(&root->supers);
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index a91d7a1..61e0920 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -49,6 +49,8 @@ static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
* mount.c
*/
struct kernfs_super_info {
+ struct super_block *sb;
+
/*
* The root associated with this super_block. Each super_block is
* identified by the root and ns it's associated with.
@@ -62,6 +64,9 @@ struct kernfs_super_info {
* an array and compare kernfs_node tag against every entry.
*/
const void *ns;
+
+ /* anchored at kernfs_root->supers, protected by kernfs_mutex */
+ struct list_head node;
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index e5b28b0..1f80295 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -68,6 +68,7 @@ static int kernfs_fill_super(struct super_block *sb)
struct inode *inode;
struct dentry *root;
+ info->sb = sb;
sb->s_blocksize = PAGE_CACHE_SIZE;
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
sb->s_magic = SYSFS_MAGIC;
@@ -160,12 +161,18 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
if (IS_ERR(sb))
return ERR_CAST(sb);
if (!sb->s_root) {
+ struct kernfs_super_info *info = kernfs_info(sb);
+
error = kernfs_fill_super(sb);
if (error) {
deactivate_locked_super(sb);
return ERR_PTR(error);
}
sb->s_flags |= MS_ACTIVE;
+
+ mutex_lock(&kernfs_mutex);
+ list_add(&info->node, &root->supers);
+ mutex_unlock(&kernfs_mutex);
}
return dget(sb->s_root);
@@ -184,6 +191,10 @@ void kernfs_kill_sb(struct super_block *sb)
struct kernfs_super_info *info = kernfs_info(sb);
struct kernfs_node *root_kn = sb->s_root->d_fsdata;
+ mutex_lock(&kernfs_mutex);
+ list_del(&info->node);
+ mutex_unlock(&kernfs_mutex);
+
/*
* Remove the superblock from fs_supers/s_instances
* so we can't find it, before freeing kernfs_super_info.
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 649497a..d8ec4f2 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -144,6 +144,10 @@ struct kernfs_root {
/* private fields, do not use outside kernfs proper */
struct ida ino_ida;
struct kernfs_syscall_ops *syscall_ops;
+
+ /* list of kernfs_super_info of this root, protected by kernfs_mutex */
+ struct list_head supers;
+
wait_queue_head_t deactivate_waitq;
};
--
1.9.0
Quoting Tejun Heo ([email protected]):
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
Thanks, Tejun.
> * It delivers events by forking and execing a userland binary
> specified as the release_agent. This is a long deprecated method of
> notification delivery. It's extremely heavy, slow and cumbersome to
> integrate with larger infrastructure.
(Not seriously worried about this, but it's a point worth considering)
It does have one advantage though: if the userspace agent goes bad,
cgroups can still be removed on empty.
Do you plan on keeping release-on-empty around? I assume only for a
while?
Do you think there is any value in having a simpler "remove-when-empty"
file? Doesn't call out to userspace, just drops the cgroup when there
are no more tasks or sub-cgroups?
> * There is single monitoring point at the root. There's no way to
> delegate management of subtree.
>
> * The event isn't recursive. It triggers when a cgroup doesn't have
> any tasks or child cgroups. Events for internal nodes trigger only
> after all children are removed. This again makes it impossible to
> delegate management of subtree.
>
> * Events are filtered from the kernel side. "notify_on_release" file
> is used to subscribe to or suppres release event and events are not
> generated if a cgroup becomes empty by moving the last task out of
> it; however, event is generated if it becomes empty because the last
> child cgroup is removed. This is inconsistent, awkward and
Hm, maybe I'm misreading but this doesn't seem right. If I move
a task into x1 and kill the task, x1 goes away. Likewise if I
create x1/y1, and rmdir y1, x1 goes away. I suspect I'm misunderstanding
the case in which you say it doesn't happen?
> unnecessarily complicated and probably done this way because event
> delivery itself was expensive.
>
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not. Its value is 1 if there is no task in the cgroup and its
I think you meant this backward? It's 1 if there is *any task in
the cgroup and its descendants, else 0?
> descendants; otherwise, 0, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
>
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Serge Hallyn <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> Cc: Lennart Poettering <[email protected]>
> ---
> include/linux/cgroup.h | 15 ++++++++++++
> kernel/cgroup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index dee6f3c..e45d87f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -154,6 +154,14 @@ struct cgroup {
> /* the number of attached css's */
> int nr_css;
>
> + /*
> + * If this cgroup contains any tasks, it contributes one to
> + * populated_cnt. All children with non-zero popuplated_cnt of
> + * their own contribute one. The count is zero iff there's no task
> + * in this cgroup or its subtree.
> + */
> + int populated_cnt;
> +
> atomic_t refcnt;
>
> /*
> @@ -166,6 +174,7 @@ struct cgroup {
> struct cgroup *parent; /* my parent */
> struct kernfs_node *kn; /* cgroup kernfs entry */
> struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */
> + struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
>
> /*
> * Monotonically increasing unique serial number which defines a
> @@ -264,6 +273,12 @@ enum {
> *
> * - "cgroup.clone_children" is removed.
> *
> + * - "cgroup.subtree_populated" is available. Its value is 0 if
> + * the cgroup and its descendants contain no task; otherwise, 1.
> + * The file also generates kernfs notification which can be
> + * monitored through poll and [di]notify when the value of the
> + * file changes.
> + *
> * - If mount is requested with sane_behavior but without any
> * subsystem, the default unified hierarchy is mounted.
> *
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4e958c7..17f0a09 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -411,6 +411,43 @@ static struct css_set init_css_set = {
>
> static int css_set_count = 1; /* 1 for init_css_set */
>
> +/**
> + * cgroup_update_populated - updated populated count of a cgroup
> + * @cgrp: the target cgroup
> + * @populated: inc or dec populated count
> + *
> + * @cgrp is either getting the first task (css_set) or losing the last.
> + * Update @cgrp->populated_cnt accordingly. The count is propagated
> + * towards root so that a given cgroup's populated_cnt is zero iff the
> + * cgroup and all its descendants are empty.
> + *
> + * @cgrp's interface file "cgroup.subtree_populated" is zero if
> + * @cgrp->populated_cnt is zero and 1 otherwise. When @cgrp->populated_cnt
> + * changes from or to zero, userland is notified that the content of the
> + * interface file has changed. This can be used to detect when @cgrp and
> + * its descendants become populated or empty.
> + */
> +static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
> +{
> + lockdep_assert_held(&css_set_rwsem);
> +
> + do {
> + bool trigger;
> +
> + if (populated)
> + trigger = !cgrp->populated_cnt++;
> + else
> + trigger = !--cgrp->populated_cnt;
> +
> + if (!trigger)
> + break;
> +
> + if (cgrp->populated_kn)
> + kernfs_notify(cgrp->populated_kn);
> + cgrp = cgrp->parent;
> + } while (cgrp);
> +}
> +
> /*
> * hash table for cgroup groups. This improves the performance to find
> * an existing css_set. This hash doesn't (currently) take into
> @@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
> list_del(&link->cgrp_link);
>
> /* @cgrp can't go away while we're holding css_set_rwsem */
> - if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
> - if (taskexit)
> - set_bit(CGRP_RELEASABLE, &cgrp->flags);
> - check_for_release(cgrp);
> + if (list_empty(&cgrp->cset_links)) {
> + cgroup_update_populated(cgrp, false);
> + if (notify_on_release(cgrp)) {
> + if (taskexit)
> + set_bit(CGRP_RELEASABLE, &cgrp->flags);
> + check_for_release(cgrp);
> + }
> }
>
> kfree(link);
> @@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
> link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
> link->cset = cset;
> link->cgrp = cgrp;
> +
> + if (list_empty(&cgrp->cset_links))
> + cgroup_update_populated(cgrp, true);
> list_move(&link->cset_link, &cgrp->cset_links);
> +
> /*
> * Always add links to the tail of the list so that the list
> * is sorted by order of hierarchy creation
> @@ -2633,6 +2677,12 @@ err_undo_css:
> goto out_unlock;
> }
>
> +static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
> +{
> + seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
> + return 0;
> +}
> +
> static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> @@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
> NULL, false, key);
> if (cft->seq_show == cgroup_subtree_control_show)
> cgrp->control_kn = kn;
> + else if (cft->seq_show == cgroup_subtree_populated_show)
> + cgrp->populated_kn = kn;
> return PTR_ERR_OR_ZERO(kn);
> }
>
> @@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
> .seq_show = cgroup_subtree_control_show,
> .write_string = cgroup_subtree_control_write,
> },
> + {
> + .name = "cgroup.subtree_populated",
> + .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
> + .seq_show = cgroup_subtree_populated_show,
> + },
>
> /*
> * Historical crazy stuff. These don't have "cgroup." prefix and
> --
> 1.9.0
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Hey, Serge.
On Thu, Apr 10, 2014 at 05:08:55AM +0200, Serge E. Hallyn wrote:
> Quoting Tejun Heo ([email protected]):
> > * It delivers events by forking and execing a userland binary
> > specified as the release_agent. This is a long deprecated method of
> > notification delivery. It's extremely heavy, slow and cumbersome to
> > integrate with larger infrastructure.
>
> (Not seriously worried about this, but it's a point worth considering)
> It does have one advantage though: if the userspace agent goes bad,
> cgroups can still be removed on empty.
>
> Do you plan on keeping release-on-empty around? I assume only for a
> while?
The new mechanism is only for the unified hierarchy. The old one will
be kept around for other hierarchies.
> Do you think there is any value in having a simpler "remove-when-empty"
> file? Doesn't call out to userspace, just drops the cgroup when there
> are no more tasks or sub-cgroups?
I don't think so. Implementing such simplistic mechanism in userland
is trivial and even independent failover mechanisms can be easily
implemented from userland as multiple entities can set up watches. I
don't think there's much value in providing another mechanism from
kernel side. The only reason why release_agent thing got as complex
as it is is because the mechanism is fundamentally flawed - clumsy
delivery, no multiple watches, single watch point - so people tried to
work around it by adding event filtering from kernel side, which is
quite backwards IMHO. With proper event mechanism, everything should
be easily achievable from userland side.
> > * Events are filtered from the kernel side. "notify_on_release" file
> > is used to subscribe to or suppres release event and events are not
> > generated if a cgroup becomes empty by moving the last task out of
> > it; however, event is generated if it becomes empty because the last
> > child cgroup is removed. This is inconsistent, awkward and
>
> Hm, maybe I'm misreading but this doesn't seem right. If I move
> a task into x1 and kill the task, x1 goes away. Likewise if I
> create x1/y1, and rmdir y1, x1 goes away. I suspect I'm misunderstanding
> the case in which you say it doesn't happen?
The case where you move a task out of x1/y1 to another cgroup doesn't
generate an event. One could say that that's unnecessary because the
mover knows that the cgroup is becoming empty; however, it excludes
any cases where there are more than one actors and the same can be
said for cases when the actor is removing a child.
> > This patch implements interface file "cgroup.subtree_populated" which
> > can be used to monitor whether the cgroup's subhierarchy has tasks in
> > it or not. Its value is 1 if there is no task in the cgroup and its
>
> I think you meant this backward? It's 1 if there is *any task in
> the cgroup and its descendants, else 0?
Oops, yeap. Will update.
Thanks!
--
tejun
Quoting Tejun Heo ([email protected]):
> Hey, Serge.
>
> On Thu, Apr 10, 2014 at 05:08:55AM +0200, Serge E. Hallyn wrote:
> > Quoting Tejun Heo ([email protected]):
> > > * It delivers events by forking and execing a userland binary
> > > specified as the release_agent. This is a long deprecated method of
> > > notification delivery. It's extremely heavy, slow and cumbersome to
> > > integrate with larger infrastructure.
> >
> > (Not seriously worried about this, but it's a point worth considering)
> > It does have one advantage though: if the userspace agent goes bad,
> > cgroups can still be removed on empty.
> >
> > Do you plan on keeping release-on-empty around? I assume only for a
> > while?
>
> The new mechanism is only for the unified hierarchy. The old one will
> be kept around for other hierarchies.
>
> > Do you think there is any value in having a simpler "remove-when-empty"
> > file? Doesn't call out to userspace, just drops the cgroup when there
> > are no more tasks or sub-cgroups?
>
> I don't think so. Implementing such simplistic mechanism in userland
> is trivial and even independent failover mechanisms can be easily
> implemented from userland as multiple entities can set up watches. I
> don't think there's much value in providing another mechanism from
> kernel side. The only reason why release_agent thing got as complex
> as it is is because the mechanism is fundamentally flawed - clumsy
> delivery, no multiple watches, single watch point - so people tried to
> work around it by adding event filtering from kernel side, which is
> quite backwards IMHO. With proper event mechanism, everything should
> be easily achievable from userland side.
Except for the keeping state. If the userspace agent crashes when it
was meant to drop 100 cgroups when they become empty, then when it
restarts those 100 cgroups may never be freed. Of course userspace
can do things about this, but it just seems like it would be so
trivial to handle this case in the kernel. Like you say there is
no need for all the fancy agent spawning - just notice that the
cgroup became empty, that releaseonempty was true, and so unlink the
thing. It'll be freed when anyone who has it held open closes it.
> > > * Events are filtered from the kernel side. "notify_on_release" file
> > > is used to subscribe to or suppres release event and events are not
> > > generated if a cgroup becomes empty by moving the last task out of
> > > it; however, event is generated if it becomes empty because the last
> > > child cgroup is removed. This is inconsistent, awkward and
> >
> > Hm, maybe I'm misreading but this doesn't seem right. If I move
> > a task into x1 and kill the task, x1 goes away. Likewise if I
> > create x1/y1, and rmdir y1, x1 goes away. I suspect I'm misunderstanding
> > the case in which you say it doesn't happen?
>
> The case where you move a task out of x1/y1 to another cgroup doesn't
> generate an event. One could say that that's unnecessary because the
Still confused.
If I create x1/y1 and x3/y3, set notify_on_release on x1 and x1/y1,
move a task into x1/y1, then move it to x3/y3, then x1/y1 and x1 both
do get removed.
> mover knows that the cgroup is becoming empty; however, it excludes
> any cases where there are more than one actors and the same can be
> said for cases when the actor is removing a child.
>
> > > This patch implements interface file "cgroup.subtree_populated" which
> > > can be used to monitor whether the cgroup's subhierarchy has tasks in
> > > it or not. Its value is 1 if there is no task in the cgroup and its
> >
> > I think you meant this backward? It's 1 if there is *any task in
> > the cgroup and its descendants, else 0?
>
> Oops, yeap. Will update.
>
> Thanks!
>
> --
> tejun
Hello,
On Thu, Apr 10, 2014 at 09:04:24AM -0500, Serge Hallyn wrote:
> Except for the keeping state. If the userspace agent crashes when it
> was meant to drop 100 cgroups when they become empty, then when it
> restarts those 100 cgroups may never be freed. Of course userspace
> can do things about this, but it just seems like it would be so
> trivial to handle this case in the kernel. Like you say there is
But it's also trivial from userland. You can write up a separate
backup agent or just make the main agent perform cleanup on restart,
which it should probably be doing anyway.
> no need for all the fancy agent spawning - just notice that the
> cgroup became empty, that releaseonempty was true, and so unlink the
> thing. It'll be freed when anyone who has it held open closes it.
It'd be a superflous feature which require a separate control knob on
each cgroup and can lead to confusion too as there are now two
entities acting on it by default. The manager has to worry about
keeping those knobs in sync and deal with cases where cgroups
disappear unexpectedly. Think about it. What should the default
value of the knob be? What if the manager crashes before setting up
the knob to its desired state? It needs to perform a cleanup pass
after crash *anyway*. How is it gonna synchronize with the current
state of cgroup hierarchy otherwise?
It's adding complexity without any actual benefits. This is kernel
API. It should be concise and orthogonal where it can be. There of
course are cases where convenience should be considered but what
you're suggesting doesn't seem beneficial in any substantial way.
> > The case where you move a task out of x1/y1 to another cgroup doesn't
> > generate an event. One could say that that's unnecessary because the
>
> Still confused.
>
> If I create x1/y1 and x3/y3, set notify_on_release on x1 and x1/y1,
> move a task into x1/y1, then move it to x3/y3, then x1/y1 and x1 both
> do get removed.
Ah, you're right, cgroup_task_migrate() sets CGRP_RELEASABLE
explicitly. I was confused because put_css_set_locked() sets
CGRP_RELEASABLE only if @taskexit is set. Will drop that part from
the description.
Thanks.
--
tejun
> Ah, you're right, cgroup_task_migrate() sets CGRP_RELEASABLE
> explicitly. I was confused because put_css_set_locked() sets
> CGRP_RELEASABLE only if @taskexit is set. Will drop that part from
> the description.
>
"If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file"
says cgroups.txt. :)
On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> This patchset is on top of "cgroup: implement unified hierarchy"
> patchset[1] and availalbe in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated
Branch renamed to review-populated-v1.
Thanks.
--
tejun
On Mon, Apr 14, 2014 at 05:31:00PM -0400, Tejun Heo wrote:
> On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> > This patchset is on top of "cgroup: implement unified hierarchy"
> > patchset[1] and availalbe in the following git branch.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated
>
> Branch renamed to review-populated-v1.
Do you want me to pull this branch into my tree for 3.16 now? I have no
objection to you taking it, if that's easier for you, which ever is fine
for me.
thanks,
greg k-h
Hello, Greg.
On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> Do you want me to pull this branch into my tree for 3.16 now? I have no
> objection to you taking it, if that's easier for you, which ever is fine
> for me.
I think the best way to proceed would be applying 1-2 to
driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
like to get some feedback from inotify people before proceeding.
Thanks.
--
tejun
On Tue, Apr 15, 2014 at 12:18:28PM -0400, Tejun Heo wrote:
> Hello, Greg.
>
> On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> > Do you want me to pull this branch into my tree for 3.16 now? I have no
> > objection to you taking it, if that's easier for you, which ever is fine
> > for me.
>
> I think the best way to proceed would be applying 1-2 to
> driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
> like to get some feedback from inotify people before proceeding.
Greg, can you please apply the first two patches to driver-core-next
(or a separate branch which is pulled into driver-core-next, either
way works for me as long as the branch is stable)? I'll pull in the
branch and route the third patch through cgroup/for-3.16.
Thanks!
--
tejun
On Wed, Apr 23, 2014 at 11:16:38AM -0400, Tejun Heo wrote:
> On Tue, Apr 15, 2014 at 12:18:28PM -0400, Tejun Heo wrote:
> > Hello, Greg.
> >
> > On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> > > Do you want me to pull this branch into my tree for 3.16 now? I have no
> > > objection to you taking it, if that's easier for you, which ever is fine
> > > for me.
> >
> > I think the best way to proceed would be applying 1-2 to
> > driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
> > like to get some feedback from inotify people before proceeding.
>
> Greg, can you please apply the first two patches to driver-core-next
> (or a separate branch which is pulled into driver-core-next, either
> way works for me as long as the branch is stable)? I'll pull in the
> branch and route the third patch through cgroup/for-3.16.
Now done.
greg k-h
On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> Hello,
>
> (inotify folks, can you please review the second patch which hooks up
> kernfs_notify() with fsnotify? It works but directly invoking
> fsnotify_parent() and fsnotify() feels a bit dirty. Maybe something
> like fsnotify_notify_modify_by_dentry() would make more sense?)
>
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues. This patchset implements a replacement
> mechanism "cgroup.subtree_populated" which can be used to monitor
> whether the cgroup's subhierarchy has tasks in it or not and triggers
> poll and [di]notify events when its content changes.
>
> This patchset contains the following three patches.
>
> 0001-kernfs-implement-kernfs_root-supers-list.patch
> 0002-kernfs-make-kernfs_notify-trigger-inotify-events-too.patch
> 0003-cgroup-implement-cgroup.subtree_populated-for-the-de.patch
0001-0002 are applied to driver-core-next, which was pulled into
cgroup/for-3.16. 0003 applied to cgroup/for-3.16.
Thanks.
--
tejun