2013-04-12 23:11:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] cgroup, memcg: introduce sane_behavior mount option

It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
staying compatible with what already has been exposed to userland.

To make progress, those behaviors need to go but we can't simply drop
or change the crazies as those are directly visible to userland. This
patchset implements a mount option - sane_behavior - which turns on
new saner behaviors, so that we can keep providing the old behaviors
while and after transitioning to saner ones.

As the behaviors which should be changed are still being determined
and then implemented, __DEVEL__ prefix is added to the mount option
and it triggers a warning message when used.

The mount option changes the following behaviors after this patchset.

* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

* memcg: .use_hierarchy is forced on and the cgroupfs file is not
created.

and there are a lot more to come. Basically, when turned on, all
controllers should be ready to be mounted in the same hierarchy and
not get in the way unless specifically configured - making blk-throtl
hierarchical would need this to flip the meaning of limits, cpuset to
allow tasks to run by default in new cgroups and handle empty cpusets
in a way friendly to being co-mounted, and so on.

This patchset contains the following four patches.

0001-cgroup-convert-cgroupfs_root-flag-bits-to-masks-and-.patch
0002-move-cgroupfs_root-to-include-linux-cgroup.h.patch
0003-cgroup-introduce-sane_behavior-mount-option.patch
0004-memcg-force-use_hierarchy-if-sane_behavior.patch

0001-0002 are prep patches. It exposes cgroupfs_root in cgroup.h so
that flags can be tested with inline helpers.

0003 introduces sane_behavior mount option and implements behavior
changes in cgroup core proper.

0004 makes memcg .use_hierarchy changes.

The memcg patch doesn't conflict with memcg changes in -next, so it
can be routed through the cgroup tree. Michal, how do you wanna route
it?

This patchset is based on top of cgroup/for-3.10 and also available in
the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-sane_behavior

Thanks.

include/linux/cgroup.h | 103 +++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 121 ++++++++++++++++++++++---------------------------
mm/memcontrol.c | 13 +++++
3 files changed, 172 insertions(+), 65 deletions(-)

--
tejun


2013-04-12 23:11:20

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.

It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 3 +++
mm/memcontrol.c | 13 +++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9c300ad..c562e33 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -269,6 +269,9 @@ enum {
*
* - Remount is disallowed.
*
+ * - memcg: use_hierarchy is on by default and the cgroup file for
+ * the flag is not created.
+ *
* The followings are planned changes.
*
* - release_agent will be disallowed once replacement notification
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9715c0c..a651131 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5814,6 +5814,7 @@ static struct cftype mem_cgroup_files[] = {
},
{
.name = "use_hierarchy",
+ .flags = CFTYPE_INSANE,
.write_u64 = mem_cgroup_hierarchy_write,
.read_u64 = mem_cgroup_hierarchy_read,
},
@@ -6784,6 +6785,17 @@ static void mem_cgroup_move_task(struct cgroup *cont,
}
#endif

+/*
+ * Cgroup retains root cgroups across [un]mount cycles making it necessary
+ * to verify sane_behavior flag on each mount attempt.
+ */
+static void mem_cgroup_bind(struct cgroup *root)
+{
+ /* use_hierarchy is forced with sane_behavior */
+ if (cgroup_sane_behavior(root))
+ mem_cgroup_from_cont(root)->use_hierarchy = true;
+}
+
struct cgroup_subsys mem_cgroup_subsys = {
.name = "memory",
.subsys_id = mem_cgroup_subsys_id,
@@ -6794,6 +6806,7 @@ struct cgroup_subsys mem_cgroup_subsys = {
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .bind = mem_cgroup_bind,
.base_cftypes = mem_cgroup_files,
.early_init = 0,
.use_id = 1,
--
1.8.1.4

2013-04-12 23:11:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/4] move cgroupfs_root to include/linux/cgroup.h

While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly. This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers. Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 57 --------------------------------------------------
2 files changed, 57 insertions(+), 57 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 45aee0f..b21881e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -19,6 +19,7 @@
#include <linux/idr.h>
#include <linux/workqueue.h>
#include <linux/xattr.h>
+#include <linux/fs.h>

#ifdef CONFIG_CGROUPS

@@ -238,6 +239,62 @@ struct cgroup {
struct simple_xattrs xattrs;
};

+#define MAX_CGROUP_ROOT_NAMELEN 64
+
+/* cgroupfs_root->flags */
+enum {
+ CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
+ CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
+};
+
+/*
+ * A cgroupfs_root represents the root of a cgroup hierarchy, and may be
+ * associated with a superblock to form an active hierarchy. This is
+ * internal to cgroup core. Don't access directly from controllers.
+ */
+struct cgroupfs_root {
+ struct super_block *sb;
+
+ /*
+ * The bitmask of subsystems intended to be attached to this
+ * hierarchy
+ */
+ unsigned long subsys_mask;
+
+ /* Unique id for this hierarchy. */
+ int hierarchy_id;
+
+ /* The bitmask of subsystems currently attached to this hierarchy */
+ unsigned long actual_subsys_mask;
+
+ /* A list running through the attached subsystems */
+ struct list_head subsys_list;
+
+ /* The root cgroup for this hierarchy */
+ struct cgroup top_cgroup;
+
+ /* Tracks how many cgroups are currently defined in hierarchy.*/
+ int number_of_cgroups;
+
+ /* A list running through the active hierarchies */
+ struct list_head root_list;
+
+ /* All cgroups on this root, cgroup_mutex protected */
+ struct list_head allcg_list;
+
+ /* Hierarchy-specific flags */
+ unsigned long flags;
+
+ /* IDs for cgroups in this hierarchy */
+ struct ida cgroup_ida;
+
+ /* The path to use for release notifications. */
+ char release_agent_path[PATH_MAX];
+
+ /* The name for this hierarchy - may be empty */
+ char name[MAX_CGROUP_ROOT_NAMELEN];
+};
+
/*
* A css_set is a structure holding pointers to a set of
* cgroup_subsys_state objects. This saves space in the task struct
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a372eaa..8848070 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -30,7 +30,6 @@
#include <linux/cred.h>
#include <linux/ctype.h>
#include <linux/errno.h>
-#include <linux/fs.h>
#include <linux/init_task.h>
#include <linux/kernel.h>
#include <linux/list.h>
@@ -104,56 +103,6 @@ static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
#include <linux/cgroup_subsys.h>
};

-#define MAX_CGROUP_ROOT_NAMELEN 64
-
-/*
- * A cgroupfs_root represents the root of a cgroup hierarchy,
- * and may be associated with a superblock to form an active
- * hierarchy
- */
-struct cgroupfs_root {
- struct super_block *sb;
-
- /*
- * The bitmask of subsystems intended to be attached to this
- * hierarchy
- */
- unsigned long subsys_mask;
-
- /* Unique id for this hierarchy. */
- int hierarchy_id;
-
- /* The bitmask of subsystems currently attached to this hierarchy */
- unsigned long actual_subsys_mask;
-
- /* A list running through the attached subsystems */
- struct list_head subsys_list;
-
- /* The root cgroup for this hierarchy */
- struct cgroup top_cgroup;
-
- /* Tracks how many cgroups are currently defined in hierarchy.*/
- int number_of_cgroups;
-
- /* A list running through the active hierarchies */
- struct list_head root_list;
-
- /* All cgroups on this root, cgroup_mutex protected */
- struct list_head allcg_list;
-
- /* Hierarchy-specific flags */
- unsigned long flags;
-
- /* IDs for cgroups in this hierarchy */
- struct ida cgroup_ida;
-
- /* The path to use for release notifications. */
- char release_agent_path[PATH_MAX];
-
- /* The name for this hierarchy - may be empty */
- char name[MAX_CGROUP_ROOT_NAMELEN];
-};
-
/*
* The "rootnode" hierarchy is the "dummy hierarchy", reserved for the
* subsystems that are otherwise unattached - it never has more than a
@@ -296,12 +245,6 @@ bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
}
EXPORT_SYMBOL_GPL(cgroup_is_descendant);

-/* cgroupfs_root->flags */
-enum {
- CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
- CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
-};
-
static int cgroup_is_releasable(const struct cgroup *cgrp)
{
const int bits =
--
1.8.1.4

2013-04-12 23:11:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/4] cgroup: introduce sane_behavior mount option

It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility. This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors. As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default. Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
include/linux/cgroup.h | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b21881e..9c300ad 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -156,6 +156,8 @@ enum {
* specified at mount time and thus is implemented here.
*/
CGRP_CPUSET_CLONE_CHILDREN,
+ /* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
+ CGRP_SANE_BEHAVIOR,
};

struct cgroup_name {
@@ -243,6 +245,37 @@ struct cgroup {

/* cgroupfs_root->flags */
enum {
+ /*
+ * Unfortunately, cgroup core and various controllers are riddled
+ * with idiosyncrasies and pointless options. The following flag,
+ * when set, will force sane behavior - some options are forced on,
+ * others are disallowed, and some controllers will change their
+ * hierarchical or other behaviors.
+ *
+ * The set of behaviors affected by this flag are still being
+ * determined and developed and the mount option for this flag is
+ * prefixed with __DEVEL__. The prefix will be dropped once we
+ * reach the point where all behaviors are compatible with the
+ * planned unified hierarchy, which will automatically turn on this
+ * flag.
+ *
+ * The followings are the behaviors currently affected this flag.
+ *
+ * - Mount options "noprefix" and "clone_children" are disallowed.
+ * Also, cgroupfs file cgroup.clone_children is not created.
+ *
+ * - When mounting an existing superblock, mount options should
+ * match.
+ *
+ * - Remount is disallowed.
+ *
+ * The followings are planned changes.
+ *
+ * - release_agent will be disallowed once replacement notification
+ * mechanism is implemented.
+ */
+ CGRP_ROOT_SANE_BEHAVIOR = (1 << 0),
+
CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
};
@@ -360,6 +393,7 @@ struct cgroup_map_cb {
/* cftype->flags */
#define CFTYPE_ONLY_ON_ROOT (1U << 0) /* only create on root cg */
#define CFTYPE_NOT_ON_ROOT (1U << 1) /* don't create on root cg */
+#define CFTYPE_INSANE (1U << 2) /* don't create if sane_behavior */

#define MAX_CFTYPE_NAME 64

@@ -486,6 +520,15 @@ struct cgroup_scanner {
void *data;
};

+/*
+ * See the comment above CGRP_ROOT_SANE_BEHAVIOR for details. This
+ * function can be called as long as @cgrp is accessible.
+ */
+static inline bool cgroup_sane_behavior(const struct cgroup *cgrp)
+{
+ return cgrp->root->flags & CGRP_ROOT_SANE_BEHAVIOR;
+}
+
/* Caller should hold rcu_read_lock() */
static inline const char *cgroup_name(const struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8848070..e229800 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1080,6 +1080,8 @@ static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry)
mutex_lock(&cgroup_root_mutex);
for_each_subsys(root, ss)
seq_printf(seq, ",%s", ss->name);
+ if (root->flags & CGRP_ROOT_SANE_BEHAVIOR)
+ seq_puts(seq, ",sane_behavior");
if (root->flags & CGRP_ROOT_NOPREFIX)
seq_puts(seq, ",noprefix");
if (root->flags & CGRP_ROOT_XATTR)
@@ -1144,6 +1146,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
all_ss = true;
continue;
}
+ if (!strcmp(token, "__DEVEL__sane_behavior")) {
+ opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+ continue;
+ }
if (!strcmp(token, "noprefix")) {
opts->flags |= CGRP_ROOT_NOPREFIX;
continue;
@@ -1231,6 +1237,20 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

/* Consistency checks */

+ if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
+ pr_warning("cgroup: sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
+
+ if (opts->flags & CGRP_ROOT_NOPREFIX) {
+ pr_err("cgroup: sane_behavior: noprefix is not allowed\n");
+ return -EINVAL;
+ }
+
+ if (opts->cpuset_clone_children) {
+ pr_err("cgroup: sane_behavior: clone_children is not allowed\n");
+ return -EINVAL;
+ }
+ }
+
/*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
@@ -1307,6 +1327,11 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
struct cgroup_sb_opts opts;
unsigned long added_mask, removed_mask;

+ if (root->flags & CGRP_ROOT_SANE_BEHAVIOR) {
+ pr_err("cgroup: sane_behavior: remount is not allowed\n");
+ return -EINVAL;
+ }
+
mutex_lock(&cgrp->dentry->d_inode->i_mutex);
mutex_lock(&cgroup_mutex);
mutex_lock(&cgroup_root_mutex);
@@ -1657,6 +1682,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
* any) is not needed
*/
cgroup_drop_root(opts.new_root);
+
+ if (((root->flags | opts.flags) & CGRP_ROOT_SANE_BEHAVIOR) &&
+ root->flags != opts.flags) {
+ pr_err("cgroup: sane_behavior: new mount options should match the existing superblock\n");
+ ret = -EINVAL;
+ goto drop_new_super;
+ }
+
/* no subsys rebinding, so refcounts don't change */
drop_parsed_module_refcounts(opts.subsys_mask);
}
@@ -2197,6 +2230,13 @@ static int cgroup_release_agent_show(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+static int cgroup_sane_behavior_show(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *seq)
+{
+ seq_printf(seq, "%d\n", cgroup_sane_behavior(cgrp));
+ return 0;
+}
+
/* A buffer size big enough for numbers or short strings */
#define CGROUP_LOCAL_BUFFER_SIZE 64

@@ -2678,6 +2718,8 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys,

for (cft = cfts; cft->name[0] != '\0'; cft++) {
/* does cft->flags tell us to skip this file on @cgrp? */
+ if ((cft->flags & CFTYPE_INSANE) && cgroup_sane_behavior(cgrp))
+ continue;
if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent)
continue;
if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
@@ -3915,10 +3957,17 @@ static struct cftype files[] = {
},
{
.name = "cgroup.clone_children",
+ .flags = CFTYPE_INSANE,
.read_u64 = cgroup_clone_children_read,
.write_u64 = cgroup_clone_children_write,
},
{
+ .name = "cgroup.sane_behavior",
+ .flags = CFTYPE_ONLY_ON_ROOT,
+ .read_seq_string = cgroup_sane_behavior_show,
+ .mode = S_IRUGO,
+ },
+ {
.name = "release_agent",
.flags = CFTYPE_ONLY_ON_ROOT,
.read_seq_string = cgroup_release_agent_show,
--
1.8.1.4

2013-04-12 23:12:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/4] cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix

There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags. Convert them to masks. Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.

Note that flags are assigned from (1 << 1). The first bit will be
used by a flag which will be added soon.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 678a22c..a372eaa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -296,10 +296,10 @@ bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
}
EXPORT_SYMBOL_GPL(cgroup_is_descendant);

-/* bits in struct cgroupfs_root flags field */
+/* cgroupfs_root->flags */
enum {
- ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
- ROOT_XATTR, /* supports extended attributes */
+ CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
+ CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
};

static int cgroup_is_releasable(const struct cgroup *cgrp)
@@ -1137,9 +1137,9 @@ static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry)
mutex_lock(&cgroup_root_mutex);
for_each_subsys(root, ss)
seq_printf(seq, ",%s", ss->name);
- if (test_bit(ROOT_NOPREFIX, &root->flags))
+ if (root->flags & CGRP_ROOT_NOPREFIX)
seq_puts(seq, ",noprefix");
- if (test_bit(ROOT_XATTR, &root->flags))
+ if (root->flags & CGRP_ROOT_XATTR)
seq_puts(seq, ",xattr");
if (strlen(root->release_agent_path))
seq_printf(seq, ",release_agent=%s", root->release_agent_path);
@@ -1202,7 +1202,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
continue;
}
if (!strcmp(token, "noprefix")) {
- set_bit(ROOT_NOPREFIX, &opts->flags);
+ opts->flags |= CGRP_ROOT_NOPREFIX;
continue;
}
if (!strcmp(token, "clone_children")) {
@@ -1210,7 +1210,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
continue;
}
if (!strcmp(token, "xattr")) {
- set_bit(ROOT_XATTR, &opts->flags);
+ opts->flags |= CGRP_ROOT_XATTR;
continue;
}
if (!strncmp(token, "release_agent=", 14)) {
@@ -1293,8 +1293,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
*/
- if (test_bit(ROOT_NOPREFIX, &opts->flags) &&
- (opts->subsys_mask & mask))
+ if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
return -EINVAL;


@@ -2523,7 +2522,7 @@ static struct simple_xattrs *__d_xattrs(struct dentry *dentry)
static inline int xattr_enabled(struct dentry *dentry)
{
struct cgroupfs_root *root = dentry->d_sb->s_fs_info;
- return test_bit(ROOT_XATTR, &root->flags);
+ return root->flags & CGRP_ROOT_XATTR;
}

static bool is_valid_xattr(const char *name)
@@ -2695,7 +2694,7 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,

simple_xattrs_init(&cft->xattrs);

- if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
+ if (subsys && !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
strcpy(name, subsys->name);
strcat(name, ".");
}
--
1.8.1.4

2013-04-15 00:56:47

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 1/4] cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix

Quoting Tejun Heo ([email protected]):
> There's no reason to be using bitops, which tends to be more
> cumbersome, to handle root flags. Convert them to masks. Also, as
> they'll be moved to include/linux/cgroup.h and it's generally a good
> idea, add CGRP_ prefix.
>
> Note that flags are assigned from (1 << 1). The first bit will be
> used by a flag which will be added soon.
>
> Signed-off-by: Tejun Heo <[email protected]>

This *is* much nicer to read, thanks.

Acked-by: Serge E. Hallyn <[email protected]>

> ---
> kernel/cgroup.c | 21 ++++++++++-----------
> 1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 678a22c..a372eaa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -296,10 +296,10 @@ bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
> }
> EXPORT_SYMBOL_GPL(cgroup_is_descendant);
>
> -/* bits in struct cgroupfs_root flags field */
> +/* cgroupfs_root->flags */
> enum {
> - ROOT_NOPREFIX, /* mounted subsystems have no named prefix */
> - ROOT_XATTR, /* supports extended attributes */
> + CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
> + CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
> };
>
> static int cgroup_is_releasable(const struct cgroup *cgrp)
> @@ -1137,9 +1137,9 @@ static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry)
> mutex_lock(&cgroup_root_mutex);
> for_each_subsys(root, ss)
> seq_printf(seq, ",%s", ss->name);
> - if (test_bit(ROOT_NOPREFIX, &root->flags))
> + if (root->flags & CGRP_ROOT_NOPREFIX)
> seq_puts(seq, ",noprefix");
> - if (test_bit(ROOT_XATTR, &root->flags))
> + if (root->flags & CGRP_ROOT_XATTR)
> seq_puts(seq, ",xattr");
> if (strlen(root->release_agent_path))
> seq_printf(seq, ",release_agent=%s", root->release_agent_path);
> @@ -1202,7 +1202,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> continue;
> }
> if (!strcmp(token, "noprefix")) {
> - set_bit(ROOT_NOPREFIX, &opts->flags);
> + opts->flags |= CGRP_ROOT_NOPREFIX;
> continue;
> }
> if (!strcmp(token, "clone_children")) {
> @@ -1210,7 +1210,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> continue;
> }
> if (!strcmp(token, "xattr")) {
> - set_bit(ROOT_XATTR, &opts->flags);
> + opts->flags |= CGRP_ROOT_XATTR;
> continue;
> }
> if (!strncmp(token, "release_agent=", 14)) {
> @@ -1293,8 +1293,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> * with the old cpuset, so we allow noprefix only if mounting just
> * the cpuset subsystem.
> */
> - if (test_bit(ROOT_NOPREFIX, &opts->flags) &&
> - (opts->subsys_mask & mask))
> + if ((opts->flags & CGRP_ROOT_NOPREFIX) && (opts->subsys_mask & mask))
> return -EINVAL;
>
>
> @@ -2523,7 +2522,7 @@ static struct simple_xattrs *__d_xattrs(struct dentry *dentry)
> static inline int xattr_enabled(struct dentry *dentry)
> {
> struct cgroupfs_root *root = dentry->d_sb->s_fs_info;
> - return test_bit(ROOT_XATTR, &root->flags);
> + return root->flags & CGRP_ROOT_XATTR;
> }
>
> static bool is_valid_xattr(const char *name)
> @@ -2695,7 +2694,7 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
>
> simple_xattrs_init(&cft->xattrs);
>
> - if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
> + if (subsys && !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
> strcpy(name, subsys->name);
> strcat(name, ".");
> }
> --
> 1.8.1.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2013-04-15 01:02:53

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 2/4] move cgroupfs_root to include/linux/cgroup.h

Quoting Tejun Heo ([email protected]):
> While controllers shouldn't be accessing cgroupfs_root directly, it
> being hidden inside kern/cgroup.c makes somethings pretty silly. This
> makes routing hierarchy-wide settings which need to be visible to
> controllers cumbersome.
>
> We're gonna add another hierarchy-wide setting which needs to be
> accessed from controllers. Move cgroupfs_root and its flags to the
> header file so that we can access root settings with inline helpers.
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

2013-04-15 01:06:06

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/4] cgroup: introduce sane_behavior mount option

Quoting Tejun Heo ([email protected]):
> It's a sad fact that at this point various cgroup controllers are
> carrying so many idiosyncrasies and pure insanities that it simply
> isn't possible to reach any sort of sane consistent behavior while
> maintaining staying fully compatible with what already has been
> exposed to userland.
>
> As we can't break exposed userland interface, transitioning to sane
> behaviors can only be done in steps while maintaining backwards
> compatibility. This patch introduces a new mount option -
> __DEVEL__sane_behavior - which disables crazy features and enforces
> consistent behaviors in cgroup core proper and various controllers.
> As exactly which behaviors it changes are still being determined, the
> mount option, at this point, is useful only for development of the new
> behaviors. As such, the mount option is prefixed with __DEVEL__ and
> generates a warning message when used.
>
> Eventually, once we get to the point where all controller's behaviors
> are consistent enough to implement unified hierarchy, the __DEVEL__
> prefix will be dropped, and more importantly, unified-hierarchy will
> enforce sane_behavior by default. Maybe we'll able to completely drop
> the crazy stuff after a while, maybe not, but we at least have a
> strategy to move on to saner behaviors.
>
> This patch introduces the mount option and changes the following
> behaviors in cgroup core.
>
> * Mount options "noprefix" and "clone_children" are disallowed. Also,
> cgroupfs file cgroup.clone_children is not created.
>
> * When mounting an existing superblock, mount options should match.
> This is currently pretty crazy. If one mounts a cgroup, creates a
> subdirectory, unmounts it and then mount it again with different
> option, it looks like the new options are applied but they aren't.
>
> * Remount is disallowed.
>
> The behaviors changes are documented in the comment above
> CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
> controllers are converted and planned improvements progress.
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

> Cc: Li Zefan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> ---
> include/linux/cgroup.h | 43 +++++++++++++++++++++++++++++++++++++++++++
> kernel/cgroup.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 92 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b21881e..9c300ad 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -156,6 +156,8 @@ enum {
> * specified at mount time and thus is implemented here.
> */
> CGRP_CPUSET_CLONE_CHILDREN,
> + /* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
> + CGRP_SANE_BEHAVIOR,
> };
>
> struct cgroup_name {
> @@ -243,6 +245,37 @@ struct cgroup {
>
> /* cgroupfs_root->flags */
> enum {
> + /*
> + * Unfortunately, cgroup core and various controllers are riddled
> + * with idiosyncrasies and pointless options. The following flag,
> + * when set, will force sane behavior - some options are forced on,
> + * others are disallowed, and some controllers will change their
> + * hierarchical or other behaviors.
> + *
> + * The set of behaviors affected by this flag are still being
> + * determined and developed and the mount option for this flag is
> + * prefixed with __DEVEL__. The prefix will be dropped once we
> + * reach the point where all behaviors are compatible with the
> + * planned unified hierarchy, which will automatically turn on this
> + * flag.
> + *
> + * The followings are the behaviors currently affected this flag.
> + *
> + * - Mount options "noprefix" and "clone_children" are disallowed.
> + * Also, cgroupfs file cgroup.clone_children is not created.
> + *
> + * - When mounting an existing superblock, mount options should
> + * match.
> + *
> + * - Remount is disallowed.
> + *
> + * The followings are planned changes.
> + *
> + * - release_agent will be disallowed once replacement notification
> + * mechanism is implemented.
> + */
> + CGRP_ROOT_SANE_BEHAVIOR = (1 << 0),
> +
> CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
> CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
> };
> @@ -360,6 +393,7 @@ struct cgroup_map_cb {
> /* cftype->flags */
> #define CFTYPE_ONLY_ON_ROOT (1U << 0) /* only create on root cg */
> #define CFTYPE_NOT_ON_ROOT (1U << 1) /* don't create on root cg */
> +#define CFTYPE_INSANE (1U << 2) /* don't create if sane_behavior */
>
> #define MAX_CFTYPE_NAME 64
>
> @@ -486,6 +520,15 @@ struct cgroup_scanner {
> void *data;
> };
>
> +/*
> + * See the comment above CGRP_ROOT_SANE_BEHAVIOR for details. This
> + * function can be called as long as @cgrp is accessible.
> + */
> +static inline bool cgroup_sane_behavior(const struct cgroup *cgrp)
> +{
> + return cgrp->root->flags & CGRP_ROOT_SANE_BEHAVIOR;
> +}
> +
> /* Caller should hold rcu_read_lock() */
> static inline const char *cgroup_name(const struct cgroup *cgrp)
> {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 8848070..e229800 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1080,6 +1080,8 @@ static int cgroup_show_options(struct seq_file *seq, struct dentry *dentry)
> mutex_lock(&cgroup_root_mutex);
> for_each_subsys(root, ss)
> seq_printf(seq, ",%s", ss->name);
> + if (root->flags & CGRP_ROOT_SANE_BEHAVIOR)
> + seq_puts(seq, ",sane_behavior");
> if (root->flags & CGRP_ROOT_NOPREFIX)
> seq_puts(seq, ",noprefix");
> if (root->flags & CGRP_ROOT_XATTR)
> @@ -1144,6 +1146,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> all_ss = true;
> continue;
> }
> + if (!strcmp(token, "__DEVEL__sane_behavior")) {
> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> + continue;
> + }
> if (!strcmp(token, "noprefix")) {
> opts->flags |= CGRP_ROOT_NOPREFIX;
> continue;
> @@ -1231,6 +1237,20 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> /* Consistency checks */
>
> + if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> + pr_warning("cgroup: sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> +
> + if (opts->flags & CGRP_ROOT_NOPREFIX) {
> + pr_err("cgroup: sane_behavior: noprefix is not allowed\n");
> + return -EINVAL;
> + }
> +
> + if (opts->cpuset_clone_children) {
> + pr_err("cgroup: sane_behavior: clone_children is not allowed\n");
> + return -EINVAL;
> + }
> + }
> +
> /*
> * Option noprefix was introduced just for backward compatibility
> * with the old cpuset, so we allow noprefix only if mounting just
> @@ -1307,6 +1327,11 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
> struct cgroup_sb_opts opts;
> unsigned long added_mask, removed_mask;
>
> + if (root->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> + pr_err("cgroup: sane_behavior: remount is not allowed\n");
> + return -EINVAL;
> + }
> +
> mutex_lock(&cgrp->dentry->d_inode->i_mutex);
> mutex_lock(&cgroup_mutex);
> mutex_lock(&cgroup_root_mutex);
> @@ -1657,6 +1682,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> * any) is not needed
> */
> cgroup_drop_root(opts.new_root);
> +
> + if (((root->flags | opts.flags) & CGRP_ROOT_SANE_BEHAVIOR) &&
> + root->flags != opts.flags) {
> + pr_err("cgroup: sane_behavior: new mount options should match the existing superblock\n");
> + ret = -EINVAL;
> + goto drop_new_super;
> + }
> +
> /* no subsys rebinding, so refcounts don't change */
> drop_parsed_module_refcounts(opts.subsys_mask);
> }
> @@ -2197,6 +2230,13 @@ static int cgroup_release_agent_show(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +static int cgroup_sane_behavior_show(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *seq)
> +{
> + seq_printf(seq, "%d\n", cgroup_sane_behavior(cgrp));
> + return 0;
> +}
> +
> /* A buffer size big enough for numbers or short strings */
> #define CGROUP_LOCAL_BUFFER_SIZE 64
>
> @@ -2678,6 +2718,8 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys,
>
> for (cft = cfts; cft->name[0] != '\0'; cft++) {
> /* does cft->flags tell us to skip this file on @cgrp? */
> + if ((cft->flags & CFTYPE_INSANE) && cgroup_sane_behavior(cgrp))
> + continue;
> if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgrp->parent)
> continue;
> if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
> @@ -3915,10 +3957,17 @@ static struct cftype files[] = {
> },
> {
> .name = "cgroup.clone_children",
> + .flags = CFTYPE_INSANE,
> .read_u64 = cgroup_clone_children_read,
> .write_u64 = cgroup_clone_children_write,
> },
> {
> + .name = "cgroup.sane_behavior",
> + .flags = CFTYPE_ONLY_ON_ROOT,
> + .read_seq_string = cgroup_sane_behavior_show,
> + .mode = S_IRUGO,
> + },
> + {
> .name = "release_agent",
> .flags = CFTYPE_ONLY_ON_ROOT,
> .read_seq_string = cgroup_release_agent_show,
> --
> 1.8.1.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2013-04-15 01:06:57

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

Quoting Tejun Heo ([email protected]):
> Turn on use_hierarchy by default if sane_behavior is specified and
> don't create .use_hierarchy file.
>
> It is debatable whether to remove .use_hierarchy file or make it ro as
> the former could make transition easier in certain cases; however, the
> behavior changes which will be gated by sane_behavior are intensive
> including changing basic meaning of certain control knobs in a few
> controllers and I don't really think keeping this piece would make
> things easier in any noticeable way, so let's remove it.
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: Serge E. Hallyn <[email protected]>

> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/cgroup.h | 3 +++
> mm/memcontrol.c | 13 +++++++++++++
> 2 files changed, 16 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 9c300ad..c562e33 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -269,6 +269,9 @@ enum {
> *
> * - Remount is disallowed.
> *
> + * - memcg: use_hierarchy is on by default and the cgroup file for
> + * the flag is not created.
> + *
> * The followings are planned changes.
> *
> * - release_agent will be disallowed once replacement notification
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9715c0c..a651131 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5814,6 +5814,7 @@ static struct cftype mem_cgroup_files[] = {
> },
> {
> .name = "use_hierarchy",
> + .flags = CFTYPE_INSANE,
> .write_u64 = mem_cgroup_hierarchy_write,
> .read_u64 = mem_cgroup_hierarchy_read,
> },
> @@ -6784,6 +6785,17 @@ static void mem_cgroup_move_task(struct cgroup *cont,
> }
> #endif
>
> +/*
> + * Cgroup retains root cgroups across [un]mount cycles making it necessary
> + * to verify sane_behavior flag on each mount attempt.
> + */
> +static void mem_cgroup_bind(struct cgroup *root)
> +{
> + /* use_hierarchy is forced with sane_behavior */
> + if (cgroup_sane_behavior(root))
> + mem_cgroup_from_cont(root)->use_hierarchy = true;
> +}
> +
> struct cgroup_subsys mem_cgroup_subsys = {
> .name = "memory",
> .subsys_id = mem_cgroup_subsys_id,
> @@ -6794,6 +6806,7 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .bind = mem_cgroup_bind,
> .base_cftypes = mem_cgroup_files,
> .early_init = 0,
> .use_id = 1,
> --
> 1.8.1.4
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2013-04-15 01:13:44

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

Quoting Tejun Heo ([email protected]):
> Turn on use_hierarchy by default if sane_behavior is specified and
> don't create .use_hierarchy file.
>
> It is debatable whether to remove .use_hierarchy file or make it ro as
> the former could make transition easier in certain cases; however, the
> behavior changes which will be gated by sane_behavior are intensive
> including changing basic meaning of certain control knobs in a few
> controllers and I don't really think keeping this piece would make
> things easier in any noticeable way, so let's remove it.

Hi Tejun,

this actually reminds me of something that's been on my todo list to
report for some time, but I haven't had time to find the source of the
bug... And maybe it's already been reported... but

If I do

cd /sys/fs/cgroup/memory
mkdir b
cd b
echo 1 > memory.use_hierarchy
echo 5000 > memory.limit_in_bytes
cat memory.limit_in_bytes
8192
mkdir c
cd c
cat memory.use_hierarchy
1
cat memory.limit_in_bytes
9223372036854775807
echo $$ > tasks
bash
<killed>

So it seems the hierarchy is being enforced, but not reported in
child limit_in_bytes files.

(Last tested tonight on 3.8.0-17-generic #27-Ubuntu fwiw)

-serge

2013-04-15 02:35:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

On Sun 14-04-13 20:13:36, Serge Hallyn wrote:
> Quoting Tejun Heo ([email protected]):
> > Turn on use_hierarchy by default if sane_behavior is specified and
> > don't create .use_hierarchy file.
> >
> > It is debatable whether to remove .use_hierarchy file or make it ro as
> > the former could make transition easier in certain cases; however, the
> > behavior changes which will be gated by sane_behavior are intensive
> > including changing basic meaning of certain control knobs in a few
> > controllers and I don't really think keeping this piece would make
> > things easier in any noticeable way, so let's remove it.
>
> Hi Tejun,
>
> this actually reminds me of something that's been on my todo list to
> report for some time, but I haven't had time to find the source of the
> bug... And maybe it's already been reported... but
>
> If I do
>
> cd /sys/fs/cgroup/memory
> mkdir b
> cd b
> echo 1 > memory.use_hierarchy
> echo 5000 > memory.limit_in_bytes
> cat memory.limit_in_bytes
> 8192
> mkdir c
> cd c
> cat memory.use_hierarchy
> 1
> cat memory.limit_in_bytes
> 9223372036854775807
> echo $$ > tasks
> bash
> <killed>
>
> So it seems the hierarchy is being enforced, but not reported in
> child limit_in_bytes files.

This is an intended behavior. Limits are not propagated to the children
because they are enforced anyway (by reclaim). The behavior would
be quite inconsistent when the parent limit would be changed later
otherwise. We only inherit properties which are enforced hierarchically:
use_hierarchy, oom_disable and swappiness.

> (Last tested tonight on 3.8.0-17-generic #27-Ubuntu fwiw)
>
> -serge

--
Michal Hocko
SUSE Labs

2013-04-15 02:39:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

Hello, Serge.

On Sun, Apr 14, 2013 at 08:13:36PM -0500, Serge Hallyn wrote:
> If I do
>
> cd /sys/fs/cgroup/memory
> mkdir b
> cd b
> echo 1 > memory.use_hierarchy
> echo 5000 > memory.limit_in_bytes
> cat memory.limit_in_bytes
> 8192
> mkdir c
> cd c
> cat memory.use_hierarchy
> 1
> cat memory.limit_in_bytes
> 9223372036854775807
> echo $$ > tasks
> bash
> <killed>
>
> So it seems the hierarchy is being enforced, but not reported in
> child limit_in_bytes files.

Hmm.... if I understand you correctly, it ain't bug. It's supposed to
work that way. The parent has certain limits and the child doesn't.
The child will operate within the paren't limits but in those limits
it isn't restricted. We actually have a controller which does
propagate configuration, the device security one, which I don't think
is really optimal but it seems to be the easier way to implement
hierarchical behavior for that controller.

Anyways, if you think about the use cases, the current memcg way makes
a lot more sense and is more flexible. e.g. You can express things
like A + B shouldn't go above 1000 (whatever the unit is) but A and B
in each can go upto 700 when there's room.

Thanks.

--
tejun

2013-04-15 02:50:15

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 3/4] cgroup: introduce sane_behavior mount option

> {
> + .name = "cgroup.sane_behavior",
> + .flags = CFTYPE_ONLY_ON_ROOT,
> + .read_seq_string = cgroup_sane_behavior_show,
> + .mode = S_IRUGO,
> + },

We don't have to set .mode explicitly.

2013-04-15 02:50:22

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup, memcg: introduce sane_behavior mount option

On 2013/4/13 7:10, Tejun Heo wrote:
> It's a sad fact that at this point various cgroup controllers are
> carrying so many idiosyncrasies and pure insanities that it simply
> isn't possible to reach any sort of sane consistent behavior while
> staying compatible with what already has been exposed to userland.
>
> To make progress, those behaviors need to go but we can't simply drop
> or change the crazies as those are directly visible to userland. This
> patchset implements a mount option - sane_behavior - which turns on
> new saner behaviors, so that we can keep providing the old behaviors
> while and after transitioning to saner ones.
>
> As the behaviors which should be changed are still being determined
> and then implemented, __DEVEL__ prefix is added to the mount option
> and it triggers a warning message when used.
>
> The mount option changes the following behaviors after this patchset.
>
> * Mount options "noprefix" and "clone_children" are disallowed. Also,
> cgroupfs file cgroup.clone_children is not created.
>
> * When mounting an existing superblock, mount options should match.
> This is currently pretty crazy. If one mounts a cgroup, creates a
> subdirectory, unmounts it and then mount it again with different
> option, it looks like the new options are applied but they aren't.
>
> * Remount is disallowed.
>
> * memcg: .use_hierarchy is forced on and the cgroupfs file is not
> created.
>
> and there are a lot more to come. Basically, when turned on, all
> controllers should be ready to be mounted in the same hierarchy and
> not get in the way unless specifically configured - making blk-throtl
> hierarchical would need this to flip the meaning of limits, cpuset to
> allow tasks to run by default in new cgroups and handle empty cpusets
> in a way friendly to being co-mounted, and so on.
>
> This patchset contains the following four patches.
>
> 0001-cgroup-convert-cgroupfs_root-flag-bits-to-masks-and-.patch
> 0002-move-cgroupfs_root-to-include-linux-cgroup.h.patch
> 0003-cgroup-introduce-sane_behavior-mount-option.patch
> 0004-memcg-force-use_hierarchy-if-sane_behavior.patch
>
> 0001-0002 are prep patches. It exposes cgroupfs_root in cgroup.h so
> that flags can be tested with inline helpers.
>
> 0003 introduces sane_behavior mount option and implements behavior
> changes in cgroup core proper.
>
> 0004 makes memcg .use_hierarchy changes.
>

Looks good to me!

Acked-by: Li Zefan <[email protected]>

2013-04-15 02:54:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/4] cgroup: introduce sane_behavior mount option

On Mon, Apr 15, 2013 at 10:49:15AM +0800, Li Zefan wrote:
> > {
> > + .name = "cgroup.sane_behavior",
> > + .flags = CFTYPE_ONLY_ON_ROOT,
> > + .read_seq_string = cgroup_sane_behavior_show,
> > + .mode = S_IRUGO,
> > + },
>
> We don't have to set .mode explicitly.

Right, dropping.

Thanks.

--
tejun

2013-04-15 03:18:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup, memcg: introduce sane_behavior mount option

On Fri, Apr 12, 2013 at 04:10:55PM -0700, Tejun Heo wrote:
> To make progress, those behaviors need to go but we can't simply drop
> or change the crazies as those are directly visible to userland. This
> patchset implements a mount option - sane_behavior - which turns on
> new saner behaviors, so that we can keep providing the old behaviors
> while and after transitioning to saner ones.

Applied 1-3 to cgroup/for-3.10. Michal, I left the last memcg one
out. Please let me know how you wanna route it.

Thanks.

--
tejun

2013-04-15 05:29:36

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

Quoting Tejun Heo ([email protected]):
> Hello, Serge.
>
> On Sun, Apr 14, 2013 at 08:13:36PM -0500, Serge Hallyn wrote:
> > If I do
> >
> > cd /sys/fs/cgroup/memory
> > mkdir b
> > cd b
> > echo 1 > memory.use_hierarchy
> > echo 5000 > memory.limit_in_bytes
> > cat memory.limit_in_bytes
> > 8192
> > mkdir c
> > cd c
> > cat memory.use_hierarchy
> > 1
> > cat memory.limit_in_bytes
> > 9223372036854775807
> > echo $$ > tasks
> > bash
> > <killed>
> >
> > So it seems the hierarchy is being enforced, but not reported in
> > child limit_in_bytes files.
>
> Hmm.... if I understand you correctly, it ain't bug. It's supposed to
> work that way. The parent has certain limits and the child doesn't.
> The child will operate within the paren't limits but in those limits
> it isn't restricted. We actually have a controller which does
> propagate configuration, the device security one, which I don't think
> is really optimal but it seems to be the easier way to implement
> hierarchical behavior for that controller.
>
> Anyways, if you think about the use cases, the current memcg way makes
> a lot more sense and is more flexible. e.g. You can express things
> like A + B shouldn't go above 1000 (whatever the unit is) but A and B
> in each can go upto 700 when there's room.

True, that makes sense, thanks.

This example would be great to have in Documentation/cgroups/memory.txt.
Perhaps as a new subsection 6.2?

-serge

2013-04-15 14:42:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

On Fri 12-04-13 16:10:59, Tejun Heo wrote:
> Turn on use_hierarchy by default if sane_behavior is specified and
> don't create .use_hierarchy file.
>
> It is debatable whether to remove .use_hierarchy file or make it ro as
> the former could make transition easier in certain cases; however, the
> behavior changes which will be gated by sane_behavior are intensive
> including changing basic meaning of certain control knobs in a few
> controllers and I don't really think keeping this piece would make
> things easier in any noticeable way, so let's remove it.

OK, as the feature has to be enabled explicitly I do not mind not
exporting the file.
I have one nit though. Can we extend the comment for mem_cgroup_bind
that it doesn't have to care about children because there are none
because rebind_subsystems wouldn't allow remounting otherwise?
It wasn't obvious, at least from me.

Other than that.
Acked-by: Michal Hocko <[email protected]>

Thanks!

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/cgroup.h | 3 +++
> mm/memcontrol.c | 13 +++++++++++++
> 2 files changed, 16 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 9c300ad..c562e33 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -269,6 +269,9 @@ enum {
> *
> * - Remount is disallowed.
> *
> + * - memcg: use_hierarchy is on by default and the cgroup file for
> + * the flag is not created.
> + *
> * The followings are planned changes.
> *
> * - release_agent will be disallowed once replacement notification
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9715c0c..a651131 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5814,6 +5814,7 @@ static struct cftype mem_cgroup_files[] = {
> },
> {
> .name = "use_hierarchy",
> + .flags = CFTYPE_INSANE,
> .write_u64 = mem_cgroup_hierarchy_write,
> .read_u64 = mem_cgroup_hierarchy_read,
> },
> @@ -6784,6 +6785,17 @@ static void mem_cgroup_move_task(struct cgroup *cont,
> }
> #endif
>
> +/*
> + * Cgroup retains root cgroups across [un]mount cycles making it necessary
> + * to verify sane_behavior flag on each mount attempt.
> + */
> +static void mem_cgroup_bind(struct cgroup *root)
> +{
> + /* use_hierarchy is forced with sane_behavior */
> + if (cgroup_sane_behavior(root))
> + mem_cgroup_from_cont(root)->use_hierarchy = true;
> +}
> +
> struct cgroup_subsys mem_cgroup_subsys = {
> .name = "memory",
> .subsys_id = mem_cgroup_subsys_id,
> @@ -6794,6 +6806,7 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .bind = mem_cgroup_bind,
> .base_cftypes = mem_cgroup_files,
> .early_init = 0,
> .use_id = 1,
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2013-04-15 14:46:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup, memcg: introduce sane_behavior mount option

On Sun 14-04-13 20:17:54, Tejun Heo wrote:
> On Fri, Apr 12, 2013 at 04:10:55PM -0700, Tejun Heo wrote:
> > To make progress, those behaviors need to go but we can't simply drop
> > or change the crazies as those are directly visible to userland. This
> > patchset implements a mount option - sane_behavior - which turns on
> > new saner behaviors, so that we can keep providing the old behaviors
> > while and after transitioning to saner ones.
>
> Applied 1-3 to cgroup/for-3.10. Michal, I left the last memcg one
> out. Please let me know how you wanna route it.

You can route it via your tree as it seems like the simplest way.

Thanks!
--
Michal Hocko
SUSE Labs

2013-04-15 15:31:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] memcg: force use_hierarchy if sane_behavior

(2013/04/13 8:10), Tejun Heo wrote:
> Turn on use_hierarchy by default if sane_behavior is specified and
> don't create .use_hierarchy file.
>
> It is debatable whether to remove .use_hierarchy file or make it ro as
> the former could make transition easier in certain cases; however, the
> behavior changes which will be gated by sane_behavior are intensive
> including changing basic meaning of certain control knobs in a few
> controllers and I don't really think keeping this piece would make
> things easier in any noticeable way, so let's remove it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>

I'd like to say this behaivor as v2 rather than sane_behavior....

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Hmm.. we need to check performance.


2013-04-15 20:40:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 4/4] memcg: force use_hierarchy if sane_behavior

Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.

It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.

v2: Explain that mem_cgroup_bind() doesn't have to worry about
children as suggested by Michal Hocko.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Acked-by: Li Zefan <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Applying to cgroup/for-3.10.

Thanks!

include/linux/cgroup.h | 3 +++
mm/memcontrol.c | 17 +++++++++++++++++
2 files changed, 20 insertions(+)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -269,6 +269,9 @@ enum {
*
* - Remount is disallowed.
*
+ * - memcg: use_hierarchy is on by default and the cgroup file for
+ * the flag is not created.
+ *
* The followings are planned changes.
*
* - release_agent will be disallowed once replacement notification
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5814,6 +5814,7 @@ static struct cftype mem_cgroup_files[]
},
{
.name = "use_hierarchy",
+ .flags = CFTYPE_INSANE,
.write_u64 = mem_cgroup_hierarchy_write,
.read_u64 = mem_cgroup_hierarchy_read,
},
@@ -6784,6 +6785,21 @@ static void mem_cgroup_move_task(struct
}
#endif

+/*
+ * Cgroup retains root cgroups across [un]mount cycles making it necessary
+ * to verify sane_behavior flag on each mount attempt.
+ */
+static void mem_cgroup_bind(struct cgroup *root)
+{
+ /*
+ * use_hierarchy is forced with sane_behavior. cgroup core
+ * guarantees that @root doesn't have any children, so turning it
+ * on for the root memcg is enough.
+ */
+ if (cgroup_sane_behavior(root))
+ mem_cgroup_from_cont(root)->use_hierarchy = true;
+}
+
struct cgroup_subsys mem_cgroup_subsys = {
.name = "memory",
.subsys_id = mem_cgroup_subsys_id,
@@ -6794,6 +6810,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .bind = mem_cgroup_bind,
.base_cftypes = mem_cgroup_files,
.early_init = 0,
.use_id = 1,

2013-04-15 20:57:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] memcg: force use_hierarchy if sane_behavior

On Mon 15-04-13 13:40:16, Tejun Heo wrote:
> Turn on use_hierarchy by default if sane_behavior is specified and
> don't create .use_hierarchy file.
>
> It is debatable whether to remove .use_hierarchy file or make it ro as
> the former could make transition easier in certain cases; however, the
> behavior changes which will be gated by sane_behavior are intensive
> including changing basic meaning of certain control knobs in a few
> controllers and I don't really think keeping this piece would make
> things easier in any noticeable way, so let's remove it.
>
> v2: Explain that mem_cgroup_bind() doesn't have to worry about
> children as suggested by Michal Hocko.

Thanks a lot Tejun!

>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Serge E. Hallyn <[email protected]>
> Acked-by: Li Zefan <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Applying to cgroup/for-3.10.
>
> Thanks!
>
> include/linux/cgroup.h | 3 +++
> mm/memcontrol.c | 17 +++++++++++++++++
> 2 files changed, 20 insertions(+)
>
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -269,6 +269,9 @@ enum {
> *
> * - Remount is disallowed.
> *
> + * - memcg: use_hierarchy is on by default and the cgroup file for
> + * the flag is not created.
> + *
> * The followings are planned changes.
> *
> * - release_agent will be disallowed once replacement notification
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5814,6 +5814,7 @@ static struct cftype mem_cgroup_files[]
> },
> {
> .name = "use_hierarchy",
> + .flags = CFTYPE_INSANE,
> .write_u64 = mem_cgroup_hierarchy_write,
> .read_u64 = mem_cgroup_hierarchy_read,
> },
> @@ -6784,6 +6785,21 @@ static void mem_cgroup_move_task(struct
> }
> #endif
>
> +/*
> + * Cgroup retains root cgroups across [un]mount cycles making it necessary
> + * to verify sane_behavior flag on each mount attempt.
> + */
> +static void mem_cgroup_bind(struct cgroup *root)
> +{
> + /*
> + * use_hierarchy is forced with sane_behavior. cgroup core
> + * guarantees that @root doesn't have any children, so turning it
> + * on for the root memcg is enough.
> + */
> + if (cgroup_sane_behavior(root))
> + mem_cgroup_from_cont(root)->use_hierarchy = true;
> +}
> +
> struct cgroup_subsys mem_cgroup_subsys = {
> .name = "memory",
> .subsys_id = mem_cgroup_subsys_id,
> @@ -6794,6 +6810,7 @@ struct cgroup_subsys mem_cgroup_subsys =
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .bind = mem_cgroup_bind,
> .base_cftypes = mem_cgroup_files,
> .early_init = 0,
> .use_id = 1,

--
Michal Hocko
SUSE Labs