2012-11-16 19:20:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET cgroup/for-3.8] netprio_cgroup: implement hierarchy support

Hello, guys.

This patchset implements hierarchy support for netprio_cgroup.
netprio_cgroup along with netcls_cgroup is a rather weird in that it
really isn't about resource control. It just hitches on cgroup as a
convenient mechanism to do stuff to groups of tasks. The hierarchy
support reflects such nature. There's no limit being imposed from
ancestors. It simply propagates configuration downwards until there's
a node with its own config. IOW, any given cgroup inherits priorities
from its parent for all netdevs which it doesn't have its own config
for.

As a parent isn't affected by child inheriting its config, the
hierarchy implementation is pretty simple. It's enough to inherit
config from ->css_online() and propagate new config downwards from
write_priomap(). As each node needs to know which config is its local
one and which is inherited, an extra config array is added -
netprio_map->aux[]. It's a separate array to avoid disturbing spatial
locality of ->priomap[]. While it currently contains single one-bit
flag, I still made it a struct so that adding more configuration
(e.g. max_prio) is easy.

Note that this does change userland-visible behavior. Now, nesting is
allowed and cgroups at the first level inherit priorities from the
root cgroup. I can't think of any better way than just biting the
bullet here. :(

0001-cgroup-add-cgroup-id.patch
0002-netprio-simplify-write_priomap.patch
0003-netprio_cgroup-shorten-variable-names-in-extend_netd.patch
0004-netprio_cgroup-reimplement-priomap-expansion.patch
0005-netprio_cgroup-use-cgroup-id-instead-of-cgroup_netpr.patch
0006-netprio_cgroup-implement-netprio-_set-_prio-helpers.patch
0007-netprio_cgroup-keep-track-of-whether-prio-is-set-or-.patch
0008-netprio_cgroup-implement-hierarchy-support.patch

0001 adds cgroup->id. This will eventually replace css_id.

0002-0006 are prep patches.

0007 implements is_local flag which tracks whether a cgroup has its
own config or should inherit from its parent.

0008 implements hierarchy support.

This patchset is on top of

cgroup/for-3.8 ef9fe980c6 ("cgroup_freezer: implement proper hierarchy support")
+ [1] "[PATCHSET cgroup/for-3.8] cgroup: allow ->post_create() to fail"
+ [2] "[PATCH 1/2] cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/"
"[PATCH 2/2] cgroup, cpuset: remove cgroup_subsys->post_clone()"

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-netprio_cgroup-hierarchy

diffstat follows.

Documentation/cgroups/net_prio.txt | 21 +-
include/linux/cgroup.h | 2
include/net/netprio_cgroup.h | 21 +-
kernel/cgroup.c | 15 +
net/core/netprio_cgroup.c | 376 +++++++++++++++++++++++--------------
5 files changed, 284 insertions(+), 151 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel.cgroups/5047
[2] http://thread.gmane.org/gmane.linux.kernel/1393151


2012-11-16 19:20:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/8] cgroup: add cgroup->id

With the introduction of generic cgroup hierarchy iterators, css_id is
being phased out. It was unnecessarily complex, id'ing the wrong
thing (cgroups need IDs, not CSSes) and has other oddities like not
being available at ->css_alloc().

This patch adds cgroup->id, which is a simple per-hierarchy
ida-allocated ID which is assigned before ->css_alloc() and released
after ->css_free().

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 2 ++
kernel/cgroup.c | 15 ++++++++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d5fc8a7..b512469 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -159,6 +159,8 @@ struct cgroup {
*/
atomic_t count;

+ int id; /* ida allocated in-hierarchy ID */
+
/*
* We link our 'sibling' struct into our parent's 'children'.
* Our children link their 'sibling' into our 'children'.
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 99d5e69..0b25dcb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -138,6 +138,9 @@ struct cgroupfs_root {
/* Hierarchy-specific flags */
unsigned long flags;

+ /* IDs for cgroups in this hierarchy */
+ struct ida cgroup_ida;
+
/* The path to use for release notifications. */
char release_agent_path[PATH_MAX];

@@ -890,6 +893,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)

simple_xattrs_free(&cgrp->xattrs);

+ ida_simple_remove(&cgrp->root->cgroup_ida, cgrp->id);
kfree(cgrp);
} else {
struct cfent *cfe = __d_cfe(dentry);
@@ -1465,6 +1469,7 @@ static struct cgroupfs_root *cgroup_root_from_opts(struct cgroup_sb_opts *opts)

root->subsys_mask = opts->subsys_mask;
root->flags = opts->flags;
+ ida_init(&root->cgroup_ida);
if (opts->release_agent)
strcpy(root->release_agent_path, opts->release_agent);
if (opts->name)
@@ -1483,6 +1488,7 @@ static void cgroup_drop_root(struct cgroupfs_root *root)
spin_lock(&hierarchy_id_lock);
ida_remove(&hierarchy_ida, root->hierarchy_id);
spin_unlock(&hierarchy_id_lock);
+ ida_destroy(&root->cgroup_ida);
kfree(root);
}

@@ -4093,10 +4099,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
struct cgroup_subsys *ss;
struct super_block *sb = root->sb;

+ /* allocate the cgroup and its ID, 0 is reserved for the root */
cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
return -ENOMEM;

+ cgrp->id = ida_simple_get(&root->cgroup_ida, 1, 0, GFP_KERNEL);
+ if (cgrp->id < 0)
+ goto err_free_cgrp;
+
/*
* Only live parents can have children. Note that the liveliness
* check isn't strictly necessary because cgroup_mkdir() and
@@ -4106,7 +4117,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
*/
if (!cgroup_lock_live_group(parent)) {
err = -ENODEV;
- goto err_free_cgrp;
+ goto err_free_id;
}

/* Grab a reference on the superblock so the hierarchy doesn't
@@ -4198,6 +4209,8 @@ err_free_all:
mutex_unlock(&cgroup_mutex);
/* Release the reference count that we took on the superblock */
deactivate_super(sb);
+err_free_id:
+ ida_simple_remove(&root->cgroup_ida, cgrp->id);
err_free_cgrp:
kfree(cgrp);
return err;
--
1.7.11.7

2012-11-16 19:20:44

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/8] netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx

With priomap expansion no longer depending on knowing max id
allocated, netprio_cgroup can use cgroup->id insted of cs->prioidx.
Drop prioidx alloc/free logic and convert all uses to cgroup->id.

* In cgrp_css_alloc(), parent->id test is moved above @cs allocation
to simplify error path.

* In cgrp_css_free(), @cs assignment is made initialization.

Signed-off-by: Tejun Heo <[email protected]>
---
include/net/netprio_cgroup.h | 11 +++-----
net/core/netprio_cgroup.c | 65 ++++++++------------------------------------
2 files changed, 15 insertions(+), 61 deletions(-)

diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index 2760f4f..1d04b6f 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -27,7 +27,6 @@ struct netprio_map {

struct cgroup_netprio_state {
struct cgroup_subsys_state css;
- u32 prioidx;
};

extern void sock_update_netprioidx(struct sock *sk, struct task_struct *task);
@@ -36,13 +35,12 @@ extern void sock_update_netprioidx(struct sock *sk, struct task_struct *task);

static inline u32 task_netprioidx(struct task_struct *p)
{
- struct cgroup_netprio_state *state;
+ struct cgroup_subsys_state *css;
u32 idx;

rcu_read_lock();
- state = container_of(task_subsys_state(p, net_prio_subsys_id),
- struct cgroup_netprio_state, css);
- idx = state->prioidx;
+ css = task_subsys_state(p, net_prio_subsys_id);
+ idx = css->cgroup->id;
rcu_read_unlock();
return idx;
}
@@ -57,8 +55,7 @@ static inline u32 task_netprioidx(struct task_struct *p)
rcu_read_lock();
css = task_subsys_state(p, net_prio_subsys_id);
if (css)
- idx = container_of(css,
- struct cgroup_netprio_state, css)->prioidx;
+ idx = css->cgroup->id;
rcu_read_unlock();
return idx;
}
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 569d83d..9409cdf 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -28,10 +28,6 @@
#include <linux/fdtable.h>

#define PRIOMAP_MIN_SZ 128
-#define PRIOIDX_SZ 128
-
-static unsigned long prioidx_map[PRIOIDX_SZ];
-static DEFINE_SPINLOCK(prioidx_map_lock);

static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
{
@@ -39,32 +35,6 @@ static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgr
struct cgroup_netprio_state, css);
}

-static int get_prioidx(u32 *prio)
-{
- unsigned long flags;
- u32 prioidx;
-
- spin_lock_irqsave(&prioidx_map_lock, flags);
- prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * PRIOIDX_SZ);
- if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ) {
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
- return -ENOSPC;
- }
- set_bit(prioidx, prioidx_map);
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
- *prio = prioidx;
- return 0;
-}
-
-static void put_prioidx(u32 idx)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&prioidx_map_lock, flags);
- clear_bit(idx, prioidx_map);
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
-}
-
/*
* Extend @dev->priomap so that it's large enough to accomodate
* @target_idx. @dev->priomap.priomap_len > @target_idx after successful
@@ -120,62 +90,50 @@ static int extend_netdev_table(struct net_device *dev, u32 target_idx)
static struct cgroup_subsys_state *cgrp_css_alloc(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs;
- int ret = -EINVAL;
+
+ if (cgrp->parent && cgrp->parent->id)
+ return ERR_PTR(-EINVAL);

cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);

- if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
- goto out;
-
- ret = get_prioidx(&cs->prioidx);
- if (ret < 0) {
- pr_warn("No space in priority index array\n");
- goto out;
- }
-
return &cs->css;
-out:
- kfree(cs);
- return ERR_PTR(ret);
}

static void cgrp_css_free(struct cgroup *cgrp)
{
- struct cgroup_netprio_state *cs;
+ struct cgroup_netprio_state *cs = cgrp_netprio_state(cgrp);
struct net_device *dev;
struct netprio_map *map;

- cs = cgrp_netprio_state(cgrp);
rtnl_lock();
for_each_netdev(&init_net, dev) {
map = rtnl_dereference(dev->priomap);
- if (map && cs->prioidx < map->priomap_len)
- map->priomap[cs->prioidx] = 0;
+ if (map && cgrp->id < map->priomap_len)
+ map->priomap[cgrp->id] = 0;
}
rtnl_unlock();
- put_prioidx(cs->prioidx);
kfree(cs);
}

static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
{
- return (u64)cgrp_netprio_state(cgrp)->prioidx;
+ return cgrp->id;
}

static int read_priomap(struct cgroup *cont, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct net_device *dev;
- u32 prioidx = cgrp_netprio_state(cont)->prioidx;
+ u32 id = cont->id;
u32 priority;
struct netprio_map *map;

rcu_read_lock();
for_each_netdev_rcu(&init_net, dev) {
map = rcu_dereference(dev->priomap);
- priority = (map && prioidx < map->priomap_len) ? map->priomap[prioidx] : 0;
+ priority = (map && id < map->priomap_len) ? map->priomap[id] : 0;
cb->fill(cb, dev->name, priority);
}
rcu_read_unlock();
@@ -185,7 +143,6 @@ static int read_priomap(struct cgroup *cont, struct cftype *cft,
static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
const char *buffer)
{
- u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
char devname[IFNAMSIZ + 1];
struct net_device *dev;
struct netprio_map *map;
@@ -201,13 +158,13 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,

rtnl_lock();

- ret = extend_netdev_table(dev, prioidx);
+ ret = extend_netdev_table(dev, cgrp->id);
if (ret)
goto out_unlock;

map = rtnl_dereference(dev->priomap);
if (map)
- map->priomap[prioidx] = prio;
+ map->priomap[cgrp->id] = prio;
out_unlock:
rtnl_unlock();
dev_put(dev);
--
1.7.11.7

2012-11-16 19:20:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/8] netprio_cgroup: keep track of whether prio is set or not

netprio_cgroup keeps prio config per cgroup-netdev pair and doesn't
distinguish between unconfigured and explicit 0 priority. To
implement hierarchy support, it's necessary to know whether a given
pair has explicit configuration or not.

This patch adds netprio_map->aux[] which is indexed by cgroup->id and
currently only contains one bool - is_local. netprio[_set]_prio() is
updated to handle @is_local. is_local is set iff the pair has
explicit config. write_priomap() now clears is_local if a negative
value is written and sets on positive. With cgrp_css_free() also
updated to clear it, is_local is set for a cgroup-netdev pair iff the
pair is online and a positive prio has been configured via
"net_prio.ifpriomap".

is_local is visible to userland via cgroup file "net_prio.is_local".

This currently doesn't change any behavior. It will be used to
implement hierarchy support.

Signed-off-by: Tejun Heo <[email protected]>
---
include/net/netprio_cgroup.h | 10 +++++
net/core/netprio_cgroup.c | 94 ++++++++++++++++++++++++++++++++++----------
2 files changed, 83 insertions(+), 21 deletions(-)

diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index 1d04b6f..bd3daf8 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -19,8 +19,18 @@


#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+
+/*
+ * Auxliary per-cgroup-netdev configuration. Kept separate from priomap[]
+ * so that priomap[] has better spatial locality.
+ */
+struct netprio_aux {
+ bool is_local:1; /* cgroup has priority configured */
+};
+
struct netprio_map {
struct rcu_head rcu;
+ struct netprio_aux *aux; /* auxiliary config array */
u32 priomap_len;
u32 priomap[];
};
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b2af0d0..e7a5b03 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -35,6 +35,14 @@ static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgr
struct cgroup_netprio_state, css);
}

+static void free_netprio_map(struct netprio_map *map)
+{
+ if (map) {
+ kfree(map->aux);
+ kfree_rcu(map, rcu);
+ }
+}
+
/*
* Extend @dev->priomap so that it's large enough to accomodate
* @target_idx. @dev->priomap.priomap_len > @target_idx after successful
@@ -69,38 +77,55 @@ static int extend_netdev_table(struct net_device *dev, u32 target_idx)

/* allocate & copy */
new = kzalloc(new_sz, GFP_KERNEL);
- if (!new) {
- pr_warn("Unable to alloc new priomap!\n");
- return -ENOMEM;
- }
+ if (!new)
+ goto enomem;
+
+ new->aux = kzalloc(new_len * sizeof(new->aux[0]), GFP_KERNEL);
+ if (!new->aux)
+ goto enomem;

- if (old)
+ if (old) {
memcpy(new->priomap, old->priomap,
old->priomap_len * sizeof(old->priomap[0]));
+ memcpy(new->aux, old->aux,
+ old->priomap_len * sizeof(old->aux[0]));
+ }

new->priomap_len = new_len;

/* install the new priomap */
rcu_assign_pointer(dev->priomap, new);
- if (old)
- kfree_rcu(old, rcu);
+ free_netprio_map(old);
return 0;
+
+enomem:
+ free_netprio_map(new);
+ pr_warn("Unable to alloc new priomap!\n");
+ return -ENOMEM;
}

/**
* netprio_prio - return the effective netprio of a cgroup-net_device pair
* @cgrp: cgroup part of the target pair
* @dev: net_device part of the target pair
+ * @is_local_p: optional out param, %true if @cgrp-@dev has local config
*
* Should be called under RCU read or rtnl lock.
*/
-static u32 netprio_prio(struct cgroup *cgrp, struct net_device *dev)
+static u32 netprio_prio(struct cgroup *cgrp, struct net_device *dev,
+ bool *is_local_p)
{
struct netprio_map *map = rcu_dereference_rtnl(dev->priomap);
+ bool is_local = false;
+ u32 prio = 0;

- if (map && cgrp->id < map->priomap_len)
- return map->priomap[cgrp->id];
- return 0;
+ if (map && cgrp->id < map->priomap_len) {
+ is_local = map->aux[cgrp->id].is_local;
+ prio = map->priomap[cgrp->id];
+ }
+ if (is_local_p)
+ *is_local_p = is_local;
+ return prio;
}

/**
@@ -108,19 +133,20 @@ static u32 netprio_prio(struct cgroup *cgrp, struct net_device *dev)
* @cgrp: cgroup part of the target pair
* @dev: net_device part of the target pair
* @prio: prio to set
+ * @is_local: indicates whether @prio is @cgrp's local prio configuration
*
* Set netprio to @prio on @cgrp-@dev pair. Should be called under rtnl
- * lock and may fail under memory pressure for non-zero @prio.
+ * lock and may fail under memory pressure if (@prio || @is_local).
*/
static int netprio_set_prio(struct cgroup *cgrp, struct net_device *dev,
- u32 prio)
+ u32 prio, bool is_local)
{
struct netprio_map *map;
int ret;

- /* avoid extending priomap for zero writes */
+ /* avoid extending priomap for clearing zero writes */
map = rtnl_dereference(dev->priomap);
- if (!prio && (!map || map->priomap_len <= cgrp->id))
+ if (!is_local && !prio && (!map || map->priomap_len <= cgrp->id))
return 0;

ret = extend_netdev_table(dev, cgrp->id);
@@ -128,6 +154,7 @@ static int netprio_set_prio(struct cgroup *cgrp, struct net_device *dev,
return ret;

map = rtnl_dereference(dev->priomap);
+ map->aux[cgrp->id].is_local = is_local;
map->priomap[cgrp->id] = prio;
return 0;
}
@@ -153,7 +180,7 @@ static void cgrp_css_free(struct cgroup *cgrp)

rtnl_lock();
for_each_netdev(&init_net, dev)
- WARN_ON_ONCE(netprio_set_prio(cgrp, dev, 0));
+ WARN_ON_ONCE(netprio_set_prio(cgrp, dev, 0, false));
rtnl_unlock();
kfree(cs);
}
@@ -170,7 +197,7 @@ static int read_priomap(struct cgroup *cont, struct cftype *cft,

rcu_read_lock();
for_each_netdev_rcu(&init_net, dev)
- cb->fill(cb, dev->name, netprio_prio(cont, dev));
+ cb->fill(cb, dev->name, netprio_prio(cont, dev, NULL));
rcu_read_unlock();
return 0;
}
@@ -180,25 +207,46 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
{
char devname[IFNAMSIZ + 1];
struct net_device *dev;
+ s64 v;
u32 prio;
+ bool is_local;
int ret;

- if (sscanf(buffer, "%"__stringify(IFNAMSIZ)"s %u", devname, &prio) != 2)
+ if (sscanf(buffer, "%"__stringify(IFNAMSIZ)"s %lld", devname, &v) != 2)
return -EINVAL;

+ prio = clamp_val(v, 0, UINT_MAX);
+ is_local = v >= 0;
+
dev = dev_get_by_name(&init_net, devname);
if (!dev)
return -ENODEV;

rtnl_lock();

- ret = netprio_set_prio(cgrp, dev, prio);
+ ret = netprio_set_prio(cgrp, dev, prio, is_local);

rtnl_unlock();
dev_put(dev);
return ret;
}

+static int netprio_read_is_local(struct cgroup *cont, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct net_device *dev;
+
+ rtnl_lock();
+ for_each_netdev(&init_net, dev) {
+ bool is_local;
+
+ netprio_prio(cont, dev, &is_local);
+ seq_printf(m, "%s %d\n", dev->name, is_local);
+ }
+ rtnl_unlock();
+ return 0;
+}
+
static int update_netprio(const void *v, struct file *file, unsigned n)
{
int err;
@@ -231,6 +279,10 @@ static struct cftype ss_files[] = {
.read_map = read_priomap,
.write_string = write_priomap,
},
+ {
+ .name = "is_local",
+ .read_seq_string = netprio_read_is_local,
+ },
{ } /* terminate */
};

@@ -270,7 +322,7 @@ static int netprio_device_event(struct notifier_block *unused,
old = rtnl_dereference(dev->priomap);
RCU_INIT_POINTER(dev->priomap, NULL);
if (old)
- kfree_rcu(old, rcu);
+ free_netprio_map(old);
break;
}
return NOTIFY_DONE;
@@ -308,7 +360,7 @@ static void __exit exit_cgroup_netprio(void)
old = rtnl_dereference(dev->priomap);
RCU_INIT_POINTER(dev->priomap, NULL);
if (old)
- kfree_rcu(old, rcu);
+ free_netprio_map(old);
}
rtnl_unlock();
}
--
1.7.11.7

2012-11-16 19:20:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 8/8] netprio_cgroup: implement hierarchy support

Implement hierarchy support. Each netprio_cgroup inherits its
parent's prio config for any net_device which it doesn't have local
config on.

As each netprio_cgroup is fully ready after ->css_alloc() and config
inheritance doesn't affect the parent, netprio_cgroup doesn't need to
strictly distinguish on and offline cgroups and can get by simply
inheriting the parent's configuration from ->css_online() and
propagating config updates downwards in write_priomap().

* As ->css_online() inherits prios on all netdevs from the parent,
clearing priomap on ->css_free() is no longer necessary. Removed.

* Error out on nesting in ->css_alloc() removed along with
ss->broken_hierarchy marking.

Note that this patch changes userland-visible behavior. Nesting is
now allowed and priority configuration is inherited through hierarchy.
This especially changes how the first level cgroups below the root
cgroup behave - any unconfigured pairs now inherit priorities from the
root cgroup instead of assuming 0.

Signed-off-by: Tejun Heo <[email protected]>
---
Documentation/cgroups/net_prio.txt | 21 +++++-
net/core/netprio_cgroup.c | 130 ++++++++++++++++++++++++++++++-------
2 files changed, 125 insertions(+), 26 deletions(-)

diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroups/net_prio.txt
index 01b3226..4dcca61 100644
--- a/Documentation/cgroups/net_prio.txt
+++ b/Documentation/cgroups/net_prio.txt
@@ -22,13 +22,15 @@ With the above step, the initial group acting as the parent accounting group
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.

-Each net_prio cgroup contains two files that are subsystem specific
+Each net_prio cgroup contains three files that are subsystem specific
+
+* net_prio.prioidx

-net_prio.prioidx
This file is read-only, and is simply informative. It contains a unique integer
value that the kernel uses as an internal representation of this cgroup.

-net_prio.ifpriomap
+* net_prio.ifpriomap
+
This file contains a map of the priorities assigned to traffic originating from
processes in this group and egressing the system on various interfaces. It
contains a list of tuples in the form <ifname priority>. Contents of this file
@@ -51,3 +53,16 @@ One usage for the net_prio cgroup is with mqprio qdisc allowing application
traffic to be steered to hardware/driver based traffic classes. These mappings
can then be managed by administrators or other networking protocols such as
DCBX.
+
+If priority is not set for an interface, the parent's priority is inherited.
+For the root cgroup, there's no parent and all unset priorities are zero.
+Priority can be unset by echoing negative value to ifpriomap. For example,
+the following would undo the configuration done above and make iscsi cgroup
+to inherit prio for eth0 from the root cgroup.
+
+echo "eth0 -1" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
+
+* net_prio.is_local
+
+This file is read-only and shows whether the net_prio cgroup has its own
+priority configured or inherited priority from its parent for each interface.
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index e7a5b03..bf9aac7 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -163,9 +163,6 @@ static struct cgroup_subsys_state *cgrp_css_alloc(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs;

- if (cgrp->parent && cgrp->parent->id)
- return ERR_PTR(-EINVAL);
-
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
@@ -173,16 +170,37 @@ static struct cgroup_subsys_state *cgrp_css_alloc(struct cgroup *cgrp)
return &cs->css;
}

-static void cgrp_css_free(struct cgroup *cgrp)
+static int cgrp_css_online(struct cgroup *cgrp)
{
- struct cgroup_netprio_state *cs = cgrp_netprio_state(cgrp);
+ struct cgroup *parent = cgrp->parent;
struct net_device *dev;
+ int ret = 0;
+
+ if (!parent)
+ return 0;

rtnl_lock();
- for_each_netdev(&init_net, dev)
- WARN_ON_ONCE(netprio_set_prio(cgrp, dev, 0, false));
+ /*
+ * Inherit prios from the parent. In netprio, a child node has no
+ * affect on the parent making prio propagation happening before
+ * this perfectly fine. No need to mark on/offline. Also, as all
+ * prios are set during onlining, there is no need to clear them on
+ * offline.
+ */
+ for_each_netdev(&init_net, dev) {
+ u32 prio = netprio_prio(parent, dev, NULL);
+
+ ret = netprio_set_prio(cgrp, dev, prio, false);
+ if (ret)
+ break;
+ }
rtnl_unlock();
- kfree(cs);
+ return ret;
+}
+
+static void cgrp_css_free(struct cgroup *cgrp)
+{
+ kfree(cgrp_netprio_state(cgrp));
}

static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
@@ -202,29 +220,104 @@ static int read_priomap(struct cgroup *cont, struct cftype *cft,
return 0;
}

+/**
+ * netprio_propagate_prio - propagate prio configuration downwards
+ * @root: cgroup to propagate prio config down from
+ * @dev: net_device whose prio will be propagated
+ *
+ * Propagate @dev's prio configuration to descendants of @root. Each
+ * descendant of @root re-inherits from its parent in pre-order tree walk.
+ * This should be called after the prio of @root-@dev pair is changed to
+ * keep the descendants up-to-date.
+ *
+ * This may race with a new cgroup coming online and propagation may happen
+ * before finishing ->css_online() or while being taken offline. As a
+ * netprio css is ready after ->css_alloc() and propagation doesn't affect
+ * the parent, this is safe.
+ *
+ * Should be called with rtnl lock held.
+ */
+static int netprio_propagate_prio(struct cgroup *root, struct net_device *dev)
+{
+ struct cgroup *pos;
+ int ret = 0;
+
+ ASSERT_RTNL();
+ rcu_read_lock();
+
+ cgroup_for_each_descendant_pre(pos, root) {
+ bool is_local;
+ u32 prio;
+ int tmp;
+
+ /*
+ * Don't propagate if @pos has local configuration. We can
+ * skip @pos's subtree but don't have to. Just propagate
+ * through for simplicity.
+ */
+ netprio_prio(pos, dev, &is_local);
+ if (is_local)
+ continue;
+
+ /*
+ * Set priority. On failure, record the error value but
+ * continue propagating. This is depended upon by
+ * write_priomap() when reverting failed propagation.
+ */
+ prio = netprio_prio(pos->parent, dev, NULL);
+ tmp = netprio_set_prio(pos, dev, prio, false);
+ ret = ret ?: tmp;
+ }
+
+ rcu_read_unlock();
+ return ret;
+}
+
static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
const char *buffer)
{
char devname[IFNAMSIZ + 1];
struct net_device *dev;
s64 v;
- u32 prio;
- bool is_local;
+ u32 old_prio, prio;
+ bool old_is_local, is_local;
int ret;

if (sscanf(buffer, "%"__stringify(IFNAMSIZ)"s %lld", devname, &v) != 2)
return -EINVAL;

- prio = clamp_val(v, 0, UINT_MAX);
- is_local = v >= 0;
-
dev = dev_get_by_name(&init_net, devname);
if (!dev)
return -ENODEV;

rtnl_lock();

+ /*
+ * Positive @v is local config which takes precedence. Negative @v
+ * deletes local config and inherits prio from the parent.
+ */
+ is_local = v >= 0;
+ if (is_local || !cgrp->parent)
+ prio = clamp_val(v, 0, UINT_MAX);
+ else
+ prio = netprio_prio(cgrp->parent, dev, NULL);
+
+ /*
+ * Record the current config and try to update prio and propagate,
+ * which may fail under memory pressure. On failure, we revert.
+ * Note that reverting itself may fail but it's guaranteed that at
+ * least all the existing priomaps are reverted, which is enough.
+ * Some packets may go out while reverting. We don't care.
+ */
+ old_prio = netprio_prio(cgrp, dev, &old_is_local);
ret = netprio_set_prio(cgrp, dev, prio, is_local);
+ if (!ret)
+ ret = netprio_propagate_prio(cgrp, dev);
+
+ if (ret) {
+ netprio_set_prio(cgrp, dev, old_prio, old_is_local);
+ netprio_propagate_prio(cgrp, dev);
+ }

rtnl_unlock();
dev_put(dev);
@@ -289,21 +382,12 @@ static struct cftype ss_files[] = {
struct cgroup_subsys net_prio_subsys = {
.name = "net_prio",
.css_alloc = cgrp_css_alloc,
+ .css_online = cgrp_css_online,
.css_free = cgrp_css_free,
.attach = net_prio_attach,
.subsys_id = net_prio_subsys_id,
.base_cftypes = ss_files,
.module = THIS_MODULE,
-
- /*
- * net_prio has artificial limit on the number of cgroups and
- * disallows nesting making it impossible to co-mount it with other
- * hierarchical subsystems. Remove the artificially low PRIOIDX_SZ
- * limit and properly nest configuration such that children follow
- * their parents' configurations by default and are allowed to
- * override and remove the following.
- */
- .broken_hierarchy = true,
};

static int netprio_device_event(struct notifier_block *unused,
--
1.7.11.7

2012-11-16 19:21:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/8] netprio_cgroup: implement netprio[_set]_prio() helpers

Introduce two helpers - netprio_prio() and netprio_set_prio() - which
hide the details of priomap access and expansion. This will help
implementing hierarchy support.

Signed-off-by: Tejun Heo <[email protected]>
---
net/core/netprio_cgroup.c | 72 ++++++++++++++++++++++++++++++++---------------
1 file changed, 50 insertions(+), 22 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 9409cdf..b2af0d0 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -87,6 +87,51 @@ static int extend_netdev_table(struct net_device *dev, u32 target_idx)
return 0;
}

+/**
+ * netprio_prio - return the effective netprio of a cgroup-net_device pair
+ * @cgrp: cgroup part of the target pair
+ * @dev: net_device part of the target pair
+ *
+ * Should be called under RCU read or rtnl lock.
+ */
+static u32 netprio_prio(struct cgroup *cgrp, struct net_device *dev)
+{
+ struct netprio_map *map = rcu_dereference_rtnl(dev->priomap);
+
+ if (map && cgrp->id < map->priomap_len)
+ return map->priomap[cgrp->id];
+ return 0;
+}
+
+/**
+ * netprio_set_prio - set netprio on a cgroup-net_device pair
+ * @cgrp: cgroup part of the target pair
+ * @dev: net_device part of the target pair
+ * @prio: prio to set
+ *
+ * Set netprio to @prio on @cgrp-@dev pair. Should be called under rtnl
+ * lock and may fail under memory pressure for non-zero @prio.
+ */
+static int netprio_set_prio(struct cgroup *cgrp, struct net_device *dev,
+ u32 prio)
+{
+ struct netprio_map *map;
+ int ret;
+
+ /* avoid extending priomap for zero writes */
+ map = rtnl_dereference(dev->priomap);
+ if (!prio && (!map || map->priomap_len <= cgrp->id))
+ return 0;
+
+ ret = extend_netdev_table(dev, cgrp->id);
+ if (ret)
+ return ret;
+
+ map = rtnl_dereference(dev->priomap);
+ map->priomap[cgrp->id] = prio;
+ return 0;
+}
+
static struct cgroup_subsys_state *cgrp_css_alloc(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs;
@@ -105,14 +150,10 @@ static void cgrp_css_free(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs = cgrp_netprio_state(cgrp);
struct net_device *dev;
- struct netprio_map *map;

rtnl_lock();
- for_each_netdev(&init_net, dev) {
- map = rtnl_dereference(dev->priomap);
- if (map && cgrp->id < map->priomap_len)
- map->priomap[cgrp->id] = 0;
- }
+ for_each_netdev(&init_net, dev)
+ WARN_ON_ONCE(netprio_set_prio(cgrp, dev, 0));
rtnl_unlock();
kfree(cs);
}
@@ -126,16 +167,10 @@ static int read_priomap(struct cgroup *cont, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct net_device *dev;
- u32 id = cont->id;
- u32 priority;
- struct netprio_map *map;

rcu_read_lock();
- for_each_netdev_rcu(&init_net, dev) {
- map = rcu_dereference(dev->priomap);
- priority = (map && id < map->priomap_len) ? map->priomap[id] : 0;
- cb->fill(cb, dev->name, priority);
- }
+ for_each_netdev_rcu(&init_net, dev)
+ cb->fill(cb, dev->name, netprio_prio(cont, dev));
rcu_read_unlock();
return 0;
}
@@ -145,7 +180,6 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
{
char devname[IFNAMSIZ + 1];
struct net_device *dev;
- struct netprio_map *map;
u32 prio;
int ret;

@@ -158,14 +192,8 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,

rtnl_lock();

- ret = extend_netdev_table(dev, cgrp->id);
- if (ret)
- goto out_unlock;
+ ret = netprio_set_prio(cgrp, dev, prio);

- map = rtnl_dereference(dev->priomap);
- if (map)
- map->priomap[cgrp->id] = prio;
-out_unlock:
rtnl_unlock();
dev_put(dev);
return ret;
--
1.7.11.7

2012-11-16 19:20:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/8] netprio_cgroup: shorten variable names in extend_netdev_table()

The function is about to go through a rewrite. In preparation,
shorten the variable names so that we don't repeat "priomap" so often.

This patch is cosmetic.

Signed-off-by: Tejun Heo <[email protected]>
---
net/core/netprio_cgroup.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 66d98da..92cc54c 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -71,26 +71,25 @@ static int extend_netdev_table(struct net_device *dev, u32 new_len)
{
size_t new_size = sizeof(struct netprio_map) +
((sizeof(u32) * new_len));
- struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
- struct netprio_map *old_priomap;
+ struct netprio_map *new = kzalloc(new_size, GFP_KERNEL);
+ struct netprio_map *old;

- old_priomap = rtnl_dereference(dev->priomap);
+ old = rtnl_dereference(dev->priomap);

- if (!new_priomap) {
+ if (!new) {
pr_warn("Unable to alloc new priomap!\n");
return -ENOMEM;
}

- if (old_priomap)
- memcpy(new_priomap->priomap, old_priomap->priomap,
- old_priomap->priomap_len *
- sizeof(old_priomap->priomap[0]));
+ if (old)
+ memcpy(new->priomap, old->priomap,
+ old->priomap_len * sizeof(old->priomap[0]));

- new_priomap->priomap_len = new_len;
+ new->priomap_len = new_len;

- rcu_assign_pointer(dev->priomap, new_priomap);
- if (old_priomap)
- kfree_rcu(old_priomap, rcu);
+ rcu_assign_pointer(dev->priomap, new);
+ if (old)
+ kfree_rcu(old, rcu);
return 0;
}

--
1.7.11.7

2012-11-16 19:21:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/8] netprio_cgroup: reimplement priomap expansion

netprio kept track of the highest prioidx allocated and resized
priomaps accordingly when necessary. This makes it necessary to keep
track of prioidx allocation and may end up resizing on every new
prioidx.

Update extend_netdev_table() such that it takes @target_idx which the
priomap should be able to accomodate. If the priomap is large enough,
nothing happens; otherwise, the size is doubled until @target_idx can
be accomodated.

This makes max_prioidx and write_update_netdev_table() unnecessary.
write_priomap() now calls extend_netdev_table() directly.

Signed-off-by: Tejun Heo <[email protected]>
---
net/core/netprio_cgroup.c | 56 ++++++++++++++++++++++++++++-------------------
1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 92cc54c..569d83d 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,11 +27,11 @@

#include <linux/fdtable.h>

+#define PRIOMAP_MIN_SZ 128
#define PRIOIDX_SZ 128

static unsigned long prioidx_map[PRIOIDX_SZ];
static DEFINE_SPINLOCK(prioidx_map_lock);
-static atomic_t max_prioidx = ATOMIC_INIT(0);

static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
{
@@ -51,8 +51,6 @@ static int get_prioidx(u32 *prio)
return -ENOSPC;
}
set_bit(prioidx, prioidx_map);
- if (atomic_read(&max_prioidx) < prioidx)
- atomic_set(&max_prioidx, prioidx);
spin_unlock_irqrestore(&prioidx_map_lock, flags);
*prio = prioidx;
return 0;
@@ -67,15 +65,40 @@ static void put_prioidx(u32 idx)
spin_unlock_irqrestore(&prioidx_map_lock, flags);
}

-static int extend_netdev_table(struct net_device *dev, u32 new_len)
+/*
+ * Extend @dev->priomap so that it's large enough to accomodate
+ * @target_idx. @dev->priomap.priomap_len > @target_idx after successful
+ * return. Must be called under rtnl lock.
+ */
+static int extend_netdev_table(struct net_device *dev, u32 target_idx)
{
- size_t new_size = sizeof(struct netprio_map) +
- ((sizeof(u32) * new_len));
- struct netprio_map *new = kzalloc(new_size, GFP_KERNEL);
- struct netprio_map *old;
+ struct netprio_map *old, *new;
+ size_t new_sz, new_len;

+ /* is the existing priomap large enough? */
old = rtnl_dereference(dev->priomap);
+ if (old && old->priomap_len > target_idx)
+ return 0;
+
+ /*
+ * Determine the new size. Let's keep it power-of-two. We start
+ * from PRIOMAP_MIN_SZ and double it until it's large enough to
+ * accommodate @target_idx.
+ */
+ new_sz = PRIOMAP_MIN_SZ;
+ while (true) {
+ new_len = (new_sz - offsetof(struct netprio_map, priomap)) /
+ sizeof(new->priomap[0]);
+ if (new_len > target_idx)
+ break;
+ new_sz *= 2;
+ /* overflowed? */
+ if (WARN_ON(new_sz < PRIOMAP_MIN_SZ))
+ return -ENOSPC;
+ }

+ /* allocate & copy */
+ new = kzalloc(new_sz, GFP_KERNEL);
if (!new) {
pr_warn("Unable to alloc new priomap!\n");
return -ENOMEM;
@@ -87,26 +110,13 @@ static int extend_netdev_table(struct net_device *dev, u32 new_len)

new->priomap_len = new_len;

+ /* install the new priomap */
rcu_assign_pointer(dev->priomap, new);
if (old)
kfree_rcu(old, rcu);
return 0;
}

-static int write_update_netdev_table(struct net_device *dev)
-{
- int ret = 0;
- u32 max_len;
- struct netprio_map *map;
-
- max_len = atomic_read(&max_prioidx) + 1;
- map = rtnl_dereference(dev->priomap);
- if (!map || map->priomap_len < max_len)
- ret = extend_netdev_table(dev, max_len);
-
- return ret;
-}
-
static struct cgroup_subsys_state *cgrp_css_alloc(struct cgroup *cgrp)
{
struct cgroup_netprio_state *cs;
@@ -191,7 +201,7 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,

rtnl_lock();

- ret = write_update_netdev_table(dev);
+ ret = extend_netdev_table(dev, prioidx);
if (ret)
goto out_unlock;

--
1.7.11.7

2012-11-16 19:22:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/8] netprio: simplify write_priomap()

sscanf() doesn't bite.

Signed-off-by: Tejun Heo <[email protected]>
---
net/core/netprio_cgroup.c | 56 ++++++++++-------------------------------------
1 file changed, 11 insertions(+), 45 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index f0b6b0d..66d98da 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -176,66 +176,32 @@ static int read_priomap(struct cgroup *cont, struct cftype *cft,
static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
const char *buffer)
{
- char *devname = kstrdup(buffer, GFP_KERNEL);
- int ret = -EINVAL;
u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
- unsigned long priority;
- char *priostr;
+ char devname[IFNAMSIZ + 1];
struct net_device *dev;
struct netprio_map *map;
+ u32 prio;
+ int ret;

- if (!devname)
- return -ENOMEM;
-
- /*
- * Minimally sized valid priomap string
- */
- if (strlen(devname) < 3)
- goto out_free_devname;
-
- priostr = strstr(devname, " ");
- if (!priostr)
- goto out_free_devname;
-
- /*
- *Separate the devname from the associated priority
- *and advance the priostr pointer to the priority value
- */
- *priostr = '\0';
- priostr++;
-
- /*
- * If the priostr points to NULL, we're at the end of the passed
- * in string, and its not a valid write
- */
- if (*priostr == '\0')
- goto out_free_devname;
-
- ret = kstrtoul(priostr, 10, &priority);
- if (ret < 0)
- goto out_free_devname;
-
- ret = -ENODEV;
+ if (sscanf(buffer, "%"__stringify(IFNAMSIZ)"s %u", devname, &prio) != 2)
+ return -EINVAL;

dev = dev_get_by_name(&init_net, devname);
if (!dev)
- goto out_free_devname;
+ return -ENODEV;

rtnl_lock();
+
ret = write_update_netdev_table(dev);
- if (ret < 0)
- goto out_put_dev;
+ if (ret)
+ goto out_unlock;

map = rtnl_dereference(dev->priomap);
if (map)
- map->priomap[prioidx] = priority;
-
-out_put_dev:
+ map->priomap[prioidx] = prio;
+out_unlock:
rtnl_unlock();
dev_put(dev);
-
-out_free_devname:
- kfree(devname);
return ret;
}

--
1.7.11.7

2012-11-19 09:05:41

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

On 2012/11/17 3:20, Tejun Heo wrote:
> With the introduction of generic cgroup hierarchy iterators, css_id is
> being phased out. It was unnecessarily complex, id'ing the wrong
> thing (cgroups need IDs, not CSSes) and has other oddities like not
> being available at ->css_alloc().
>
> This patch adds cgroup->id, which is a simple per-hierarchy
> ida-allocated ID which is assigned before ->css_alloc() and released
> after ->css_free().
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: Li Zefan <[email protected]>

2012-11-19 13:26:23

by Neil Horman

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] netprio_cgroup: implement hierarchy support

On Fri, Nov 16, 2012 at 11:20:16AM -0800, Tejun Heo wrote:
> Hello, guys.
>
> This patchset implements hierarchy support for netprio_cgroup.
> netprio_cgroup along with netcls_cgroup is a rather weird in that it
> really isn't about resource control. It just hitches on cgroup as a
> convenient mechanism to do stuff to groups of tasks. The hierarchy
> support reflects such nature. There's no limit being imposed from
> ancestors. It simply propagates configuration downwards until there's
> a node with its own config. IOW, any given cgroup inherits priorities
> from its parent for all netdevs which it doesn't have its own config
> for.
>
> As a parent isn't affected by child inheriting its config, the
> hierarchy implementation is pretty simple. It's enough to inherit
> config from ->css_online() and propagate new config downwards from
> write_priomap(). As each node needs to know which config is its local
> one and which is inherited, an extra config array is added -
> netprio_map->aux[]. It's a separate array to avoid disturbing spatial
> locality of ->priomap[]. While it currently contains single one-bit
> flag, I still made it a struct so that adding more configuration
> (e.g. max_prio) is easy.
>
> Note that this does change userland-visible behavior. Now, nesting is
> allowed and cgroups at the first level inherit priorities from the
> root cgroup. I can't think of any better way than just biting the
> bullet here. :(
>
> 0001-cgroup-add-cgroup-id.patch
> 0002-netprio-simplify-write_priomap.patch
> 0003-netprio_cgroup-shorten-variable-names-in-extend_netd.patch
> 0004-netprio_cgroup-reimplement-priomap-expansion.patch
> 0005-netprio_cgroup-use-cgroup-id-instead-of-cgroup_netpr.patch
> 0006-netprio_cgroup-implement-netprio-_set-_prio-helpers.patch
> 0007-netprio_cgroup-keep-track-of-whether-prio-is-set-or-.patch
> 0008-netprio_cgroup-implement-hierarchy-support.patch
>
> 0001 adds cgroup->id. This will eventually replace css_id.
>
> 0002-0006 are prep patches.
>
> 0007 implements is_local flag which tracks whether a cgroup has its
> own config or should inherit from its parent.
>
> 0008 implements hierarchy support.
>
> This patchset is on top of
>
> cgroup/for-3.8 ef9fe980c6 ("cgroup_freezer: implement proper hierarchy support")
> + [1] "[PATCHSET cgroup/for-3.8] cgroup: allow ->post_create() to fail"
> + [2] "[PATCH 1/2] cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/"
> "[PATCH 2/2] cgroup, cpuset: remove cgroup_subsys->post_clone()"
>
> and available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-netprio_cgroup-hierarchy
>
> diffstat follows.
>
> Documentation/cgroups/net_prio.txt | 21 +-
> include/linux/cgroup.h | 2
> include/net/netprio_cgroup.h | 21 +-
> kernel/cgroup.c | 15 +
> net/core/netprio_cgroup.c | 376 +++++++++++++++++++++++--------------
> 5 files changed, 284 insertions(+), 151 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> [1] http://thread.gmane.org/gmane.linux.kernel.cgroups/5047
> [2] http://thread.gmane.org/gmane.linux.kernel/1393151
>

Thanks Tejun

For the series:
Acked-by: Neil Horman <[email protected]>

2012-11-19 17:05:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

On Mon, Nov 19, 2012 at 05:03:47PM +0800, Li Zefan wrote:
> On 2012/11/17 3:20, Tejun Heo wrote:
> > With the introduction of generic cgroup hierarchy iterators, css_id is
> > being phased out. It was unnecessarily complex, id'ing the wrong
> > thing (cgroups need IDs, not CSSes) and has other oddities like not
> > being available at ->css_alloc().
> >
> > This patch adds cgroup->id, which is a simple per-hierarchy
> > ida-allocated ID which is assigned before ->css_alloc() and released
> > after ->css_free().
> >
> > Signed-off-by: Tejun Heo <[email protected]>
>
> Acked-by: Li Zefan <[email protected]>

Applied to cgroup/for-3.8.

Thanks.

--
tejun

2012-11-19 19:25:52

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] netprio_cgroup: implement hierarchy support

Hi Tejun,

I have tested this series and it works nicely. I didn't mentioned yet to
review it completely. I will do it tomorrow.

cheers,
daniel

2012-11-19 19:54:21

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCHSET cgroup/for-3.8] netprio_cgroup: implement hierarchy support

On 19.11.2012 20:25, Daniel Wagner wrote:
> Hi Tejun,
>
> I have tested this series and it works nicely. I didn't mentioned yet to
> review it completely. I will do it tomorrow.

s/mentioned/managed/

2012-11-20 04:35:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

(2012/11/17 4:20), Tejun Heo wrote:
> With the introduction of generic cgroup hierarchy iterators, css_id is
> being phased out. It was unnecessarily complex, id'ing the wrong
> thing (cgroups need IDs, not CSSes) and has other oddities like not
> being available at ->css_alloc().
>
> This patch adds cgroup->id, which is a simple per-hierarchy
> ida-allocated ID which is assigned before ->css_alloc() and released
> after ->css_free().
>
> Signed-off-by: Tejun Heo <[email protected]>


I'm sorry if I misunderstand ... current usage of css-id in memory/swap cgroup
is for recording information of memory cgroup which may be destroyed. In some case,
a memcg's cgroup is freed but a struct memcgroup and its css are available, swap_cgroup
may contain id ot if.
This patch puts cgroup's id at diput(), so, the id used in swap_cgroup can be
reused while it's in use. Right ?

Thanks,
-Kame

> ---
> include/linux/cgroup.h | 2 ++
> kernel/cgroup.c | 15 ++++++++++++++-
> 2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index d5fc8a7..b512469 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -159,6 +159,8 @@ struct cgroup {
> */
> atomic_t count;
>
> + int id; /* ida allocated in-hierarchy ID */
> +
> /*
> * We link our 'sibling' struct into our parent's 'children'.
> * Our children link their 'sibling' into our 'children'.
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 99d5e69..0b25dcb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -138,6 +138,9 @@ struct cgroupfs_root {
> /* Hierarchy-specific flags */
> unsigned long flags;
>
> + /* IDs for cgroups in this hierarchy */
> + struct ida cgroup_ida;
> +
> /* The path to use for release notifications. */
> char release_agent_path[PATH_MAX];
>
> @@ -890,6 +893,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>
> simple_xattrs_free(&cgrp->xattrs);
>
> + ida_simple_remove(&cgrp->root->cgroup_ida, cgrp->id);
> kfree(cgrp);
> } else {
> struct cfent *cfe = __d_cfe(dentry);
> @@ -1465,6 +1469,7 @@ static struct cgroupfs_root *cgroup_root_from_opts(struct cgroup_sb_opts *opts)
>
> root->subsys_mask = opts->subsys_mask;
> root->flags = opts->flags;
> + ida_init(&root->cgroup_ida);
> if (opts->release_agent)
> strcpy(root->release_agent_path, opts->release_agent);
> if (opts->name)
> @@ -1483,6 +1488,7 @@ static void cgroup_drop_root(struct cgroupfs_root *root)
> spin_lock(&hierarchy_id_lock);
> ida_remove(&hierarchy_ida, root->hierarchy_id);
> spin_unlock(&hierarchy_id_lock);
> + ida_destroy(&root->cgroup_ida);
> kfree(root);
> }
>
> @@ -4093,10 +4099,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> struct cgroup_subsys *ss;
> struct super_block *sb = root->sb;
>
> + /* allocate the cgroup and its ID, 0 is reserved for the root */
> cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
> if (!cgrp)
> return -ENOMEM;
>
> + cgrp->id = ida_simple_get(&root->cgroup_ida, 1, 0, GFP_KERNEL);
> + if (cgrp->id < 0)
> + goto err_free_cgrp;
> +
> /*
> * Only live parents can have children. Note that the liveliness
> * check isn't strictly necessary because cgroup_mkdir() and
> @@ -4106,7 +4117,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> */
> if (!cgroup_lock_live_group(parent)) {
> err = -ENODEV;
> - goto err_free_cgrp;
> + goto err_free_id;
> }
>
> /* Grab a reference on the superblock so the hierarchy doesn't
> @@ -4198,6 +4209,8 @@ err_free_all:
> mutex_unlock(&cgroup_mutex);
> /* Release the reference count that we took on the superblock */
> deactivate_super(sb);
> +err_free_id:
> + ida_simple_remove(&root->cgroup_ida, cgrp->id);
> err_free_cgrp:
> kfree(cgrp);
> return err;
>

2012-11-20 05:31:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

Hello, Kamezawa.

On Tue, Nov 20, 2012 at 01:34:54PM +0900, Kamezawa Hiroyuki wrote:
> I'm sorry if I misunderstand ... current usage of css-id in memory/swap cgroup
> is for recording information of memory cgroup which may be destroyed. In some case,
> a memcg's cgroup is freed but a struct memcgroup and its css are available, swap_cgroup
> may contain id ot if.
> This patch puts cgroup's id at diput(), so, the id used in swap_cgroup can be
> reused while it's in use. Right ?

CSSes hold onto cgroups, so if memcg is around, its cgroup doesn't go
away, so the right thing to do would be holding onto CSS whlie there
are remaining references, which IMHO is the way it should have been
implemented from the beginning. The only reason memcg currently has
its own refcnt nested inside css refcnt is because cgroup used to
require css refs to be completely drained for cgroup_rmdir() to
proceed. Now that that weirdity is gone, we should go back to sane
css based reference counting, right?

Thanks.

--
tejun

2012-11-20 07:06:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

(2012/11/20 14:31), Tejun Heo wrote:
> Hello, Kamezawa.
>
> On Tue, Nov 20, 2012 at 01:34:54PM +0900, Kamezawa Hiroyuki wrote:
>> I'm sorry if I misunderstand ... current usage of css-id in memory/swap cgroup
>> is for recording information of memory cgroup which may be destroyed. In some case,
>> a memcg's cgroup is freed but a struct memcgroup and its css are available, swap_cgroup
>> may contain id ot if.
>> This patch puts cgroup's id at diput(), so, the id used in swap_cgroup can be
>> reused while it's in use. Right ?
>
> CSSes hold onto cgroups, so if memcg is around, its cgroup doesn't go
> away, so the right thing to do would be holding onto CSS whlie there
> are remaining references, which IMHO is the way it should have been
> implemented from the beginning. The only reason memcg currently has
> its own refcnt nested inside css refcnt is because cgroup used to
> require css refs to be completely drained for cgroup_rmdir() to
> proceed. Now that that weirdity is gone, we should go back to sane
> css based reference counting, right?
>

Ah, hm, Maybe I missed new __css_put() implementation...

> void __css_put(struct cgroup_subsys_state *css)
> {
> struct cgroup *cgrp = css->cgroup;
> int v;
>
> rcu_read_lock();
> v = css_unbias_refcnt(atomic_dec_return(&css->refcnt));
>
> switch (v) {
> case 1:
> if (notify_on_release(cgrp)) {
> set_bit(CGRP_RELEASABLE, &cgrp->flags);
> check_for_release(cgrp);
> }
> break;
> case 0:
> schedule_work(&css->dput_work);
> break;
> }
> rcu_read_unlock();
> }

If swap_cgroup holds css's refcnt instead of memcg's....
final dput will be invoked when the last swap_cgroup release a reference.

It seems to work and we can drop memcg's refcnt (maybe).

BTW, css's ID was limited to 65535 to be encoded in 2bytes.
If we use INT, this will increase size of swap_cgroup.
(2bytes per page => 4bytes per page) It's preallocated at swapon()
because allocating memory dynamically when we swap a memory is not good.

Do we really need 4bytes for ID ? If so, swap_cgroup should be totally re-designed.

Thanks,
-Kame















2012-11-20 07:08:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

Hello, Kamezawa.

On Tue, Nov 20, 2012 at 04:05:51PM +0900, Kamezawa Hiroyuki wrote:
> BTW, css's ID was limited to 65535 to be encoded in 2bytes.
> If we use INT, this will increase size of swap_cgroup.
> (2bytes per page => 4bytes per page) It's preallocated at swapon()
> because allocating memory dynamically when we swap a memory is not good.
>
> Do we really need 4bytes for ID ? If so, swap_cgroup should be totally re-designed.

That's a memcg restriction which shouldn't have been imposed on cgroup
core from the beginning. What memcg should do is rejecting to become
online from ->css_onilne() if cgrp->id is out of the range it can
handle.

Thanks.

--
tejun

2012-11-20 07:12:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

(2012/11/20 16:08), Tejun Heo wrote:
> Hello, Kamezawa.
>
> On Tue, Nov 20, 2012 at 04:05:51PM +0900, Kamezawa Hiroyuki wrote:
>> BTW, css's ID was limited to 65535 to be encoded in 2bytes.
>> If we use INT, this will increase size of swap_cgroup.
>> (2bytes per page => 4bytes per page) It's preallocated at swapon()
>> because allocating memory dynamically when we swap a memory is not good.
>>
>> Do we really need 4bytes for ID ? If so, swap_cgroup should be totally re-designed.
>
> That's a memcg restriction which shouldn't have been imposed on cgroup
> core from the beginning. What memcg should do is rejecting to become
> online from ->css_onilne() if cgrp->id is out of the range it can
> handle.
>

Hmm. I'll think a little and consider a patch to remove memcg's refcnt.


Thanks,
-Kame

2012-11-20 08:20:32

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: add cgroup->id

On 11/20/2012 11:05 AM, Kamezawa Hiroyuki wrote:
> (2012/11/20 14:31), Tejun Heo wrote:
>> Hello, Kamezawa.
>>
>> On Tue, Nov 20, 2012 at 01:34:54PM +0900, Kamezawa Hiroyuki wrote:
>>> I'm sorry if I misunderstand ... current usage of css-id in
>>> memory/swap cgroup
>>> is for recording information of memory cgroup which may be destroyed.
>>> In some case,
>>> a memcg's cgroup is freed but a struct memcgroup and its css are
>>> available, swap_cgroup
>>> may contain id ot if.
>>> This patch puts cgroup's id at diput(), so, the id used in
>>> swap_cgroup can be
>>> reused while it's in use. Right ?
>>
>> CSSes hold onto cgroups, so if memcg is around, its cgroup doesn't go
>> away, so the right thing to do would be holding onto CSS whlie there
>> are remaining references, which IMHO is the way it should have been
>> implemented from the beginning. The only reason memcg currently has
>> its own refcnt nested inside css refcnt is because cgroup used to
>> require css refs to be completely drained for cgroup_rmdir() to
>> proceed. Now that that weirdity is gone, we should go back to sane
>> css based reference counting, right?
>>
>
> Ah, hm, Maybe I missed new __css_put() implementation...
>
>> void __css_put(struct cgroup_subsys_state *css)
>> {
>> struct cgroup *cgrp = css->cgroup;
>> int v;
>>
>> rcu_read_lock();
>> v = css_unbias_refcnt(atomic_dec_return(&css->refcnt));
>>
>> switch (v) {
>> case 1:
>> if (notify_on_release(cgrp)) {
>> set_bit(CGRP_RELEASABLE, &cgrp->flags);
>> check_for_release(cgrp);
>> }
>> break;
>> case 0:
>> schedule_work(&css->dput_work);
>> break;
>> }
>> rcu_read_unlock();
>> }
>
> If swap_cgroup holds css's refcnt instead of memcg's....
> final dput will be invoked when the last swap_cgroup release a reference.
>
> It seems to work and we can drop memcg's refcnt (maybe).
>
> BTW, css's ID was limited to 65535 to be encoded in 2bytes.
> If we use INT, this will increase size of swap_cgroup.
> (2bytes per page => 4bytes per page) It's preallocated at swapon()
> because allocating memory dynamically when we swap a memory is not good.
>
> Do we really need 4bytes for ID ? If so, swap_cgroup should be totally
> re-designed.
>

For the record, I've already came to the conclusion myself that
swap_cgroup should be redesigned for this very same reason. (I was
testing it a while ago). I haven't had much time to think about it,
though. But I was considering using the memcg address itself, in a
sparsely populated structure.