Hello,
This is v4 of the xt_cgroup2 patchset. Changes from v3 are
* Refreshed on top of net-next/master e72c932d3f8a ("cxgb3: Convert
simple_strtoul to kstrtox"). Dropped "cgroups: Allow dynamically
changing net_classid" from series as it's already applied.
The changes from v2 to v3 are
* Folded cgroup2 path matching into xt_cgroup as a new revision rather
than a separate xt_cgroup2 match as suggested by Pablo.
* Refreshed on top of Nina's net_cls dynamic config update fix patch.
I included the fix patch as part of this series to ease reviewing.
The changes from v1 to v2 are
* Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
carries either (prioidx, classid) pair or cgroup2 pointer. This
avoids inflating struct sock with yet another cgroup related field.
Unfortunately, this does add some complexity but that's the
trade-off and the complexity is contained in cgroup proper.
* Various small updats as per David and Jan's reviews.
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
This patchset includes the following eight patches.
0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch
0002-kernfs-implement-kernfs_walk_and_get.patch
0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch
0004-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch
0005-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch
0006-sock-cgroup-add-sock-sk_cgroup.patch
0007-netfilter-prepare-xt_cgroup-for-multi-revisions.patch
0008-netfilter-implement-xt_cgroup-cgroup2-path-match.patch
0001-0003 are prepatory patches in kernfs and cgroup. These patches
are available in the following branch which will stay stable.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test
0004-0006 consolidate two cgroup related fields in struct sock into
cgroup_sock_data and update it so that it can alternatively carry a
cgroup pointer.
0007-0008 implement cgroup2 patch matching in xt_cgroup.
This patchset is on top of net-next/master and also available in the
following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2
I'll post iptables extension as a reply. diffstat follows. Thanks.
fs/kernfs/dir.c | 46 +++++++++++
include/linux/cgroup-defs.h | 126 +++++++++++++++++++++++++++++++
include/linux/cgroup.h | 66 +++++++++++++++-
include/linux/kernfs.h | 12 ++
include/net/cls_cgroup.h | 11 +-
include/net/netprio_cgroup.h | 16 +++
include/net/sock.h | 13 ---
include/uapi/linux/netfilter/xt_cgroup.h | 15 +++
kernel/cgroup.c | 126 ++++++++++++++++++++++++-------
net/Kconfig | 6 +
net/core/dev.c | 3
net/core/netclassid_cgroup.c | 11 +-
net/core/netprio_cgroup.c | 19 ++++
net/core/scm.c | 4
net/core/sock.c | 17 ----
net/netfilter/nft_meta.c | 2
net/netfilter/xt_cgroup.c | 108 ++++++++++++++++++++++----
17 files changed, 513 insertions(+), 88 deletions(-)
--
tejun
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question. While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.
This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level. cgroup_is_descendant() is accordingly reimplmented
and made inline.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 14 ++++++++++++++
include/linux/cgroup.h | 18 +++++++++++++++++-
kernel/cgroup.c | 32 ++++++++++----------------------
3 files changed, 41 insertions(+), 23 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 60d44b2..504d859 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -235,6 +235,14 @@ struct cgroup {
int id;
/*
+ * The depth this cgroup is at. The root is at depth zero and each
+ * step down the hierarchy increments the level. This along with
+ * ancestor_ids[] can determine whether a given cgroup is a
+ * descendant of another without traversing the hierarchy.
+ */
+ int level;
+
+ /*
* Each non-empty css_set associated with this cgroup contributes
* one to populated_cnt. All children with non-zero popuplated_cnt
* of their own contribute one. The count is zero iff there's no
@@ -289,6 +297,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
+
+ /* ids of the ancestors at each level including self */
+ int ancestor_ids[];
};
/*
@@ -308,6 +319,9 @@ struct cgroup_root {
/* The root cgroup. Root is destroyed on its release. */
struct cgroup cgrp;
+ /* for cgrp->ancestor_ids[0] */
+ int cgrp_ancestor_id_storage;
+
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 22e3754..b5ee2c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,7 +81,6 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
@@ -459,6 +458,23 @@ static inline struct cgroup *task_cgroup(struct task_struct *task,
return task_css(task, subsys_id)->cgroup;
}
+/**
+ * cgroup_is_descendant - test ancestry
+ * @cgrp: the cgroup to be tested
+ * @ancestor: possible ancestor of @cgrp
+ *
+ * Test whether @cgrp is a descendant of @ancestor. It also returns %true
+ * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
+ * and @ancestor are accessible.
+ */
+static inline bool cgroup_is_descendant(struct cgroup *cgrp,
+ struct cgroup *ancestor)
+{
+ if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
+ return false;
+ return cgrp->ancestor_ids[ancestor->level] == ancestor->id;
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f1603c1..3190040 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -459,25 +459,6 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
}
EXPORT_SYMBOL_GPL(of_css);
-/**
- * cgroup_is_descendant - test ancestry
- * @cgrp: the cgroup to be tested
- * @ancestor: possible ancestor of @cgrp
- *
- * Test whether @cgrp is a descendant of @ancestor. It also returns %true
- * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
- * and @ancestor are accessible.
- */
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
-{
- while (cgrp) {
- if (cgrp == ancestor)
- return true;
- cgrp = cgroup_parent(cgrp);
- }
- return false;
-}
-
static int notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -1903,6 +1884,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
if (ret < 0)
goto out;
root_cgrp->id = ret;
+ root_cgrp->ancestor_ids[0] = ret;
ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, 0,
GFP_KERNEL);
@@ -4846,11 +4828,11 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
umode_t mode)
{
- struct cgroup *parent, *cgrp;
+ struct cgroup *parent, *cgrp, *tcgrp;
struct cgroup_root *root;
struct cgroup_subsys *ss;
struct kernfs_node *kn;
- int ssid, ret;
+ int level, ssid, ret;
/* Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
*/
@@ -4861,9 +4843,11 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
if (!parent)
return -ENODEV;
root = parent->root;
+ level = parent->level + 1;
/* allocate the cgroup and its ID, 0 is reserved for the root */
- cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
+ cgrp = kzalloc(sizeof(*cgrp) +
+ sizeof(cgrp->ancestor_ids[0]) * (level + 1), GFP_KERNEL);
if (!cgrp) {
ret = -ENOMEM;
goto out_unlock;
@@ -4887,6 +4871,10 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
cgrp->self.parent = &parent->self;
cgrp->root = root;
+ cgrp->level = level;
+
+ for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
+ cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
--
2.5.0
Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.
v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Cc: David Miller <[email protected]>
---
fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 12 ++++++++++++
2 files changed, 58 insertions(+)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..742bf4a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -694,6 +694,29 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
return NULL;
}
+static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
+ const unsigned char *path,
+ const void *ns)
+{
+ static char path_buf[PATH_MAX]; /* protected by kernfs_mutex */
+ size_t len = strlcpy(path_buf, path, PATH_MAX);
+ char *p = path_buf;
+ char *name;
+
+ lockdep_assert_held(&kernfs_mutex);
+
+ if (len >= PATH_MAX)
+ return NULL;
+
+ while ((name = strsep(&p, "/")) && parent) {
+ if (*name == '\0')
+ continue;
+ parent = kernfs_find_ns(parent, name, ns);
+ }
+
+ return parent;
+}
+
/**
* kernfs_find_and_get_ns - find and get kernfs_node with the given name
* @parent: kernfs_node to search under
@@ -719,6 +742,29 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
EXPORT_SYMBOL_GPL(kernfs_find_and_get_ns);
/**
+ * kernfs_walk_and_get_ns - find and get kernfs_node with the given path
+ * @parent: kernfs_node to search under
+ * @path: path to look for
+ * @ns: the namespace tag to use
+ *
+ * Look for kernfs_node with path @path under @parent and get a reference
+ * if found. This function may sleep and returns pointer to the found
+ * kernfs_node on success, %NULL on failure.
+ */
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns)
+{
+ struct kernfs_node *kn;
+
+ mutex_lock(&kernfs_mutex);
+ kn = kernfs_walk_ns(parent, path, ns);
+ kernfs_get(kn);
+ mutex_unlock(&kernfs_mutex);
+
+ return kn;
+}
+
+/**
* kernfs_create_root - create a new kernfs hierarchy
* @scops: optional syscall operations for the hierarchy
* @flags: KERNFS_ROOT_* flags
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5d4e9c4..af51df3 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn);
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns);
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns);
void kernfs_get(struct kernfs_node *kn);
void kernfs_put(struct kernfs_node *kn);
@@ -350,6 +352,10 @@ static inline struct kernfs_node *
kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{ return NULL; }
+static inline struct kernfs_node *
+kernfs_walk_and_get_ns(struct kernfs_node *parent, const char *path,
+ const void *ns)
+{ return NULL; }
static inline void kernfs_get(struct kernfs_node *kn) { }
static inline void kernfs_put(struct kernfs_node *kn) { }
@@ -431,6 +437,12 @@ kernfs_find_and_get(struct kernfs_node *kn, const char *name)
}
static inline struct kernfs_node *
+kernfs_walk_and_get(struct kernfs_node *kn, const char *path)
+{
+ return kernfs_walk_and_get_ns(kn, path, NULL);
+}
+
+static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
--
2.5.0
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path. This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.
v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 7 +++++++
kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 41 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5ee2c4..4c3ffab 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);
+struct cgroup *cgroup_get_from_path(const char *path);
+
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
@@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
percpu_ref_put_many(&css->refcnt, n);
}
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3190040..3db5e8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
return css_tryget(&cgrp->self);
}
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
}
+/**
+ * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path
+ * @path: path on the default hierarchy
+ *
+ * Find the cgroup at @path on the default hierarchy, increment its
+ * reference count and return it. Returns pointer to the found cgroup on
+ * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
+ * if @path points to a non-directory.
+ */
+struct cgroup *cgroup_get_from_path(const char *path)
+{
+ struct kernfs_node *kn;
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+
+ kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
+ if (kn) {
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ cgrp = kn->priv;
+ cgroup_get(cgrp);
+ } else {
+ cgrp = ERR_PTR(-ENOTDIR);
+ }
+ kernfs_put(kn);
+ } else {
+ cgrp = ERR_PTR(-ENOENT);
+ }
+
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
--
2.5.0
netprio builds per-netdev contiguous priomap array which is indexed by
css->id. The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range. This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.
This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Daniel Wagner <[email protected]>
Cc: Daniel Borkmann <[email protected]>
CC: Neil Horman <[email protected]>
---
include/net/sock.h | 10 +++++-----
net/core/netprio_cgroup.c | 9 +++++++++
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 6f58b84..a95bcf7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,7 +288,6 @@ struct cg_proto;
* @sk_ack_backlog: current listen backlog
* @sk_max_ack_backlog: listen backlog set in listen()
* @sk_priority: %SO_PRIORITY setting
- * @sk_cgrp_prioidx: socket group's priority map index
* @sk_type: socket type (%SOCK_STREAM, etc)
* @sk_protocol: which protocol this socket belongs in this network family
* @sk_peer_pid: &struct pid for this socket's peer
@@ -309,6 +308,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
+ * @sk_cgrp_prioidx: socket group's priority map index
* @sk_classid: this socket's cgroup classid
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
@@ -425,9 +425,7 @@ struct sock {
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
__u32 sk_priority;
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- __u32 sk_cgrp_prioidx;
-#endif
+ __u32 sk_mark;
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
long sk_rcvtimeo;
@@ -445,7 +443,9 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
- __u32 sk_mark;
+#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
+ u16 sk_cgrp_prioidx;
+#endif
#ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
#endif
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..2b9159b 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,6 +27,12 @@
#include <linux/fdtable.h>
+/*
+ * netprio allocates per-net_device priomap array which is indexed by
+ * css->id. Limiting css ID to 16bits doesn't lose anything.
+ */
+#define NETPRIO_ID_MAX USHRT_MAX
+
#define PRIOMAP_MIN_SZ 128
/*
@@ -144,6 +150,9 @@ static int cgrp_css_online(struct cgroup_subsys_state *css)
struct net_device *dev;
int ret = 0;
+ if (css->id > NETPRIO_ID_MAX)
+ return -ENOSPC;
+
if (!parent_css)
return 0;
--
2.5.0
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
and its accessors are defined in cgroup-defs.h. This is to prepare
for overloading the fields with a cgroup pointer.
This patch mostly performs equivalent conversions but the followings
are noteworthy.
* Equality test before updating classid is removed from
sock_update_classid(). This shouldn't make any noticeable
difference and a similar test will be implemented on the helper side
later.
* sock_update_netprioidx() now takes struct sock_cgroup_data and can
be moved to netprio_cgroup.h without causing include dependency
loop. Moved.
* The dummy version of sock_update_netprioidx() converted to a static
inline function while at it.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 36 ++++++++++++++++++++++++++++++++++++
include/net/cls_cgroup.h | 11 +++++------
include/net/netprio_cgroup.h | 16 +++++++++++++---
include/net/sock.h | 11 +++--------
net/Kconfig | 6 ++++++
net/core/dev.c | 3 ++-
net/core/netclassid_cgroup.c | 4 ++--
net/core/netprio_cgroup.c | 3 ++-
net/core/scm.c | 4 ++--
net/core/sock.c | 15 ++-------------
net/netfilter/nft_meta.c | 2 +-
net/netfilter/xt_cgroup.c | 3 ++-
12 files changed, 76 insertions(+), 38 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 504d859..ed128fed 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -542,4 +542,40 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
#endif /* CONFIG_CGROUPS */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+struct sock_cgroup_data {
+ u16 prioidx;
+ u32 classid;
+};
+
+static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
+{
+ return skcd->prioidx;
+}
+
+static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
+{
+ return skcd->classid;
+}
+
+static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
+ u16 prioidx)
+{
+ skcd->prioidx = prioidx;
+}
+
+static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
+ u32 classid)
+{
+ skcd->classid = classid;
+}
+
+#else /* CONFIG_SOCK_CGROUP_DATA */
+
+struct sock_cgroup_data {
+};
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_DEFS_H */
diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index ccd6d8b..c0a92e2 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -41,13 +41,12 @@ static inline u32 task_cls_classid(struct task_struct *p)
return classid;
}
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
u32 classid;
classid = task_cls_classid(current);
- if (classid != sk->sk_classid)
- sk->sk_classid = classid;
+ sock_cgroup_set_classid(skcd, classid);
}
static inline u32 task_get_classid(const struct sk_buff *skb)
@@ -64,17 +63,17 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
* softirqs always disables bh.
*/
if (in_serving_softirq()) {
- /* If there is an sk_classid we'll use that. */
+ /* If there is an sock_cgroup_classid we'll use that. */
if (!skb->sk)
return 0;
- classid = skb->sk->sk_classid;
+ classid = sock_cgroup_classid(&skb->sk->sk_cgrp_data);
}
return classid;
}
#else /* !CONFIG_CGROUP_NET_CLASSID */
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
}
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index f2a9597..6041905 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -25,8 +25,6 @@ struct netprio_map {
u32 priomap[];
};
-void sock_update_netprioidx(struct sock *sk);
-
static inline u32 task_netprioidx(struct task_struct *p)
{
struct cgroup_subsys_state *css;
@@ -38,13 +36,25 @@ static inline u32 task_netprioidx(struct task_struct *p)
rcu_read_unlock();
return idx;
}
+
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+ if (in_interrupt())
+ return;
+
+ sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
+}
+
#else /* !CONFIG_CGROUP_NET_PRIO */
+
static inline u32 task_netprioidx(struct task_struct *p)
{
return 0;
}
-#define sock_update_netprioidx(sk)
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+}
#endif /* CONFIG_CGROUP_NET_PRIO */
#endif /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index a95bcf7..0ca22b0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -59,6 +59,7 @@
#include <linux/static_key.h>
#include <linux/sched.h>
#include <linux/wait.h>
+#include <linux/cgroup-defs.h>
#include <linux/filter.h>
#include <linux/rculist_nulls.h>
@@ -308,8 +309,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
- * @sk_cgrp_prioidx: socket group's priority map index
- * @sk_classid: this socket's cgroup classid
+ * @sk_cgrp_data: cgroup data for this cgroup
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
* @sk_state_change: callback to indicate change in the state of the sock
@@ -443,12 +443,7 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- u16 sk_cgrp_prioidx;
-#endif
-#ifdef CONFIG_CGROUP_NET_CLASSID
- u32 sk_classid;
-#endif
+ struct sock_cgroup_data sk_cgrp_data;
struct cg_proto *sk_cgrp;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
diff --git a/net/Kconfig b/net/Kconfig
index 127da94..11f8c22 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -250,9 +250,14 @@ config XPS
depends on SMP
default y
+config SOCK_CGROUP_DATA
+ bool
+ default n
+
config CGROUP_NET_PRIO
bool "Network priority cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use in assigning processes to network priorities on
a per-interface basis.
@@ -260,6 +265,7 @@ config CGROUP_NET_PRIO
config CGROUP_NET_CLASSID
bool "Network classid cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use as general purpose socket classid marker that is
being used in cls_cgroup and for netfilter matching.
diff --git a/net/core/dev.c b/net/core/dev.c
index e5c3954..8f705fc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2929,7 +2929,8 @@ static void skb_update_prio(struct sk_buff *skb)
struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);
if (!skb->priority && skb->sk && map) {
- unsigned int prioidx = skb->sk->sk_cgrp_prioidx;
+ unsigned int prioidx =
+ sock_cgroup_prioidx(&skb->sk->sk_cgrp_data);
if (prioidx < map->priomap_len)
skb->priority = map->priomap[prioidx];
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 2e4df84..e60ded4 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -62,8 +62,8 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
struct socket *sock = sock_from_file(file, &err);
if (sock)
- sock->sk->sk_classid = (u32)(unsigned long)v;
-
+ sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 2b9159b..de42aa7 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -223,7 +223,8 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);
if (sock)
- sock->sk->sk_cgrp_prioidx = (u32)(unsigned long)v;
+ sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}
diff --git a/net/core/scm.c b/net/core/scm.c
index 8a1741b..14596fb 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -289,8 +289,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
/* Bump the usage count and install the file. */
sock = sock_from_file(fp[i], &err);
if (sock) {
- sock_update_netprioidx(sock->sk);
- sock_update_classid(sock->sk);
+ sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+ sock_update_classid(&sock->sk->sk_cgrp_data);
}
fd_install(new_fd, get_file(fp[i]));
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 7965ef4..947741d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1393,17 +1393,6 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
module_put(owner);
}
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
-void sock_update_netprioidx(struct sock *sk)
-{
- if (in_interrupt())
- return;
-
- sk->sk_cgrp_prioidx = task_netprioidx(current);
-}
-EXPORT_SYMBOL_GPL(sock_update_netprioidx);
-#endif
-
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
@@ -1432,8 +1421,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
sock_net_set(sk, net);
atomic_set(&sk->sk_wmem_alloc, 1);
- sock_update_classid(sk);
- sock_update_netprioidx(sk);
+ sock_update_classid(&sk->sk_cgrp_data);
+ sock_update_netprioidx(&sk->sk_cgrp_data);
}
return sk;
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 9dfaf4d..1915cab 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -174,7 +174,7 @@ void nft_meta_get_eval(const struct nft_expr *expr,
sk = skb_to_full_sk(skb);
if (!sk || !sk_fullsock(sk))
goto err;
- *dest = sk->sk_classid;
+ *dest = sock_cgroup_classid(&sk->sk_cgrp_data);
break;
#endif
default:
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index a1d126f..54eaeb4 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -42,7 +42,8 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;
- return (info->id == skb->sk->sk_classid) ^ info->invert;
+ return (info->id == sock_cgroup_classid(&skb->sk->sk_cgrp_data)) ^
+ info->invert;
}
static struct xt_match cgroup_mt_reg __read_mostly = {
--
2.5.0
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
---
include/linux/cgroup-defs.h | 88 +++++++++++++++++++++++++++++++++++++++++---
include/linux/cgroup.h | 41 +++++++++++++++++++++
kernel/cgroup.c | 55 ++++++++++++++++++++++++++-
net/core/netclassid_cgroup.c | 7 +++-
net/core/netprio_cgroup.c | 7 +++-
net/core/sock.c | 2 +
6 files changed, 191 insertions(+), 9 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ed128fed..9dc2263 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -544,31 +544,107 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
#ifdef CONFIG_SOCK_CGROUP_DATA
+/*
+ * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
+ * per-socket cgroup information except for memcg association.
+ *
+ * On legacy hierarchies, net_prio and net_cls controllers directly set
+ * attributes on each sock which can then be tested by the network layer.
+ * On the default hierarchy, each sock is associated with the cgroup it was
+ * created in and the networking layer can match the cgroup directly.
+ *
+ * To avoid carrying all three cgroup related fields separately in sock,
+ * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
+ * On boot, sock_cgroup_data records the cgroup that the sock was created
+ * in so that cgroup2 matches can be made; however, once either net_prio or
+ * net_cls starts being used, the area is overriden to carry prioidx and/or
+ * classid. The two modes are distinguished by whether the lowest bit is
+ * set. Clear bit indicates cgroup pointer while set bit prioidx and
+ * classid.
+ *
+ * While userland may start using net_prio or net_cls at any time, once
+ * either is used, cgroup2 matching no longer works. There is no reason to
+ * mix the two and this is in line with how legacy and v2 compatibility is
+ * handled. On mode switch, cgroup references which are already being
+ * pointed to by socks may be leaked. While this can be remedied by adding
+ * synchronization around sock_cgroup_data, given that the number of leaked
+ * cgroups is bound and highly unlikely to be high, this seems to be the
+ * better trade-off.
+ */
struct sock_cgroup_data {
- u16 prioidx;
- u32 classid;
+ union {
+#ifdef __LITTLE_ENDIAN
+ struct {
+ u8 is_data;
+ u8 padding;
+ u16 prioidx;
+ u32 classid;
+ } __packed;
+#else
+ struct {
+ u32 classid;
+ u16 prioidx;
+ u8 padding;
+ u8 is_data;
+ } __packed;
+#endif
+ u64 val;
+ };
};
+/*
+ * There's a theoretical window where the following accessors race with
+ * updaters and return part of the previous pointer as the prioidx or
+ * classid. Such races are short-lived and the result isn't critical.
+ */
static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
{
- return skcd->prioidx;
+ /* fallback to 1 which is always the ID of the root cgroup */
+ return (skcd->is_data & 1) ? skcd->prioidx : 1;
}
static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
{
- return skcd->classid;
+ /* fallback to 0 which is the unconfigured default classid */
+ return (skcd->is_data & 1) ? skcd->classid : 0;
}
+/*
+ * If invoked concurrently, the updaters may clobber each other. The
+ * caller is responsible for synchronization.
+ */
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
u16 prioidx)
{
- skcd->prioidx = prioidx;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.prioidx = prioidx;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}
static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
u32 classid)
{
- skcd->classid = classid;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_classid(&skcd_buf) == classid)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.classid = classid;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}
#else /* CONFIG_SOCK_CGROUP_DATA */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4c3ffab..a8ba1ea 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -578,4 +578,45 @@ static inline int cgroup_init(void) { return 0; }
#endif /* !CONFIG_CGROUPS */
+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+extern spinlock_t cgroup_sk_update_lock;
+#endif
+
+void cgroup_sk_alloc_disable(void);
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_free(struct sock_cgroup_data *skcd);
+
+static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
+{
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+ unsigned long v;
+
+ /*
+ * @skcd->val is 64bit but the following is safe on 32bit too as we
+ * just need the lower ulong to be written and read atomically.
+ */
+ v = READ_ONCE(skcd->val);
+
+ if (v & 1)
+ return &cgrp_dfl_root.cgrp;
+
+ return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
+#else
+ return (struct cgroup *)(unsigned long)skcd->val;
+#endif
+}
+
+#else /* CONFIG_CGROUP_DATA */
+
+static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
+
+#endif /* CONFIG_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3db5e8f..4f8f792 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,8 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
-
#include <linux/atomic.h>
+#include <net/sock.h>
/*
* pidlists linger the following amount before being destroyed. The goal
@@ -5782,6 +5782,59 @@ struct cgroup *cgroup_get_from_path(const char *path)
}
EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+
+spinlock_t cgroup_sk_update_lock;
+static bool cgroup_sk_alloc_disabled __read_mostly;
+
+void cgroup_sk_alloc_disable(void)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+ pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
+ cgroup_sk_alloc_disabled = true;
+}
+
+#else
+
+#define cgroup_sk_alloc_disabled false
+
+#endif
+
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+
+ rcu_read_lock();
+
+ while (true) {
+ struct css_set *cset;
+
+ cset = task_css_set(current);
+ if (likely(cgroup_tryget(cset->dfl_cgrp))) {
+ skcd->val = (unsigned long)cset->dfl_cgrp;
+ break;
+ }
+ cpu_relax();
+ }
+
+ rcu_read_unlock();
+}
+
+void cgroup_sk_free(struct sock_cgroup_data *skcd)
+{
+ cgroup_put(sock_cgroup_ptr(skcd));
+}
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index e60ded4..04257a0 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -61,9 +61,12 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}
@@ -98,6 +101,8 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
{
struct cgroup_cls_state *cs = css_cls_state(css);
+ cgroup_sk_alloc_disable();
+
cs->classid = (u32)value;
update_classid(css, (void *)(unsigned long)cs->classid);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index de42aa7..053d60c 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -209,6 +209,8 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
if (!dev)
return -ENODEV;
+ cgroup_sk_alloc_disable();
+
rtnl_lock();
ret = netprio_set_prio(of_css(of), dev, prio);
@@ -222,9 +224,12 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
{
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 947741d..1278d7b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1363,6 +1363,7 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
if (!try_module_get(prot->owner))
goto out_free_sec;
sk_tx_queue_clear(sk);
+ cgroup_sk_alloc(&sk->sk_cgrp_data);
}
return sk;
@@ -1385,6 +1386,7 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
owner = prot->owner;
slab = prot->slab;
+ cgroup_sk_free(&sk->sk_cgrp_data);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
--
2.5.0
xt_cgroup will grow cgroup2 path based match. Postfix existing
symbols with _v0 and prepare for multi revision registration.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
include/uapi/linux/netfilter/xt_cgroup.h | 2 +-
net/netfilter/xt_cgroup.c | 36 +++++++++++++++++---------------
2 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
index 43acb7e..577c9e0 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -3,7 +3,7 @@
#include <linux/types.h>
-struct xt_cgroup_info {
+struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 54eaeb4..1730025 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -24,9 +24,9 @@ MODULE_DESCRIPTION("Xtables: process control group matching");
MODULE_ALIAS("ipt_cgroup");
MODULE_ALIAS("ip6t_cgroup");
-static int cgroup_mt_check(const struct xt_mtchk_param *par)
+static int cgroup_mt_check_v0(const struct xt_mtchk_param *par)
{
- struct xt_cgroup_info *info = par->matchinfo;
+ struct xt_cgroup_info_v0 *info = par->matchinfo;
if (info->invert & ~1)
return -EINVAL;
@@ -35,9 +35,9 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par)
}
static bool
-cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
{
- const struct xt_cgroup_info *info = par->matchinfo;
+ const struct xt_cgroup_info_v0 *info = par->matchinfo;
if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;
@@ -46,27 +46,29 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
info->invert;
}
-static struct xt_match cgroup_mt_reg __read_mostly = {
- .name = "cgroup",
- .revision = 0,
- .family = NFPROTO_UNSPEC,
- .checkentry = cgroup_mt_check,
- .match = cgroup_mt,
- .matchsize = sizeof(struct xt_cgroup_info),
- .me = THIS_MODULE,
- .hooks = (1 << NF_INET_LOCAL_OUT) |
- (1 << NF_INET_POST_ROUTING) |
- (1 << NF_INET_LOCAL_IN),
+static struct xt_match cgroup_mt_reg[] __read_mostly = {
+ {
+ .name = "cgroup",
+ .revision = 0,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup_mt_check_v0,
+ .match = cgroup_mt_v0,
+ .matchsize = sizeof(struct xt_cgroup_info_v0),
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+ },
};
static int __init cgroup_mt_init(void)
{
- return xt_register_match(&cgroup_mt_reg);
+ return xt_register_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
}
static void __exit cgroup_mt_exit(void)
{
- xt_unregister_match(&cgroup_mt_reg);
+ xt_unregister_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
}
module_init(cgroup_mt_init);
--
2.5.0
This patch implements xt_cgroup path match which matches cgroup2
membership of the associated socket. The match is recursive and
invertible.
For rationales on introducing another cgroup based match, please refer
to a preceding commit "sock, cgroup: add sock->sk_cgroup".
v3: Folded into xt_cgroup as a new revision interface as suggested by
Pablo.
v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added
explicit alignment to the priv field. Both suggested by Jan.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
include/uapi/linux/netfilter/xt_cgroup.h | 13 ++++++
net/netfilter/xt_cgroup.c | 69 ++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)
diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
index 577c9e0..1e4b37b 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -2,10 +2,23 @@
#define _UAPI_XT_CGROUP_H
#include <linux/types.h>
+#include <linux/limits.h>
struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};
+struct xt_cgroup_info_v1 {
+ __u8 has_path;
+ __u8 has_classid;
+ __u8 invert_path;
+ __u8 invert_classid;
+ char path[PATH_MAX];
+ __u32 classid;
+
+ /* kernel internal data */
+ void *priv __attribute__((aligned(8)));
+};
+
#endif /* _UAPI_XT_CGROUP_H */
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 1730025..a086a91 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -34,6 +34,37 @@ static int cgroup_mt_check_v0(const struct xt_mtchk_param *par)
return 0;
}
+static int cgroup_mt_check_v1(const struct xt_mtchk_param *par)
+{
+ struct xt_cgroup_info_v1 *info = par->matchinfo;
+ struct cgroup *cgrp;
+
+ if ((info->invert_path & ~1) || (info->invert_classid & ~1))
+ return -EINVAL;
+
+ if (!info->has_path && !info->has_classid) {
+ pr_info("xt_cgroup: no path or classid specified\n");
+ return -EINVAL;
+ }
+
+ if (info->has_path && info->has_classid) {
+ pr_info("xt_cgroup: both path and classid specified\n");
+ return -EINVAL;
+ }
+
+ if (info->has_path) {
+ cgrp = cgroup_get_from_path(info->path);
+ if (IS_ERR(cgrp)) {
+ pr_info("xt_cgroup: invalid path, errno=%ld\n",
+ PTR_ERR(cgrp));
+ return -EINVAL;
+ }
+ info->priv = cgrp;
+ }
+
+ return 0;
+}
+
static bool
cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
{
@@ -46,6 +77,31 @@ cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
info->invert;
}
+static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ const struct xt_cgroup_info_v1 *info = par->matchinfo;
+ struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data;
+ struct cgroup *ancestor = info->priv;
+
+ if (!skb->sk || !sk_fullsock(skb->sk))
+ return false;
+
+ if (ancestor)
+ return cgroup_is_descendant(sock_cgroup_ptr(skcd), ancestor) ^
+ info->invert_path;
+ else
+ return (info->classid == sock_cgroup_classid(skcd)) ^
+ info->invert_classid;
+}
+
+static void cgroup_mt_destroy_v1(const struct xt_mtdtor_param *par)
+{
+ struct xt_cgroup_info_v1 *info = par->matchinfo;
+
+ if (info->priv)
+ cgroup_put(info->priv);
+}
+
static struct xt_match cgroup_mt_reg[] __read_mostly = {
{
.name = "cgroup",
@@ -59,6 +115,19 @@ static struct xt_match cgroup_mt_reg[] __read_mostly = {
(1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
},
+ {
+ .name = "cgroup",
+ .revision = 1,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup_mt_check_v1,
+ .match = cgroup_mt_v1,
+ .matchsize = sizeof(struct xt_cgroup_info_v1),
+ .destroy = cgroup_mt_destroy_v1,
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+ },
};
static int __init cgroup_mt_init(void)
--
2.5.0
From: Tejun Heo <[email protected]>
Date: Mon, 7 Dec 2015 17:38:47 -0500
> 0001-0003 are prepatory patches in kernfs and cgroup. These patches
> are available in the following branch which will stay stable.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test
Pulled into net-next.
> 0004-0006 consolidate two cgroup related fields in struct sock into
> cgroup_sock_data and update it so that it can alternatively carry a
> cgroup pointer.
Applied to net-next.
> 0007-0008 implement cgroup2 patch matching in xt_cgroup.
I'll leave these for Pablo to integrate.
Thanks.
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Tejun Heo
> Sent: Tuesday, December 8, 2015 6:39
> To: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: [PATCHSET v4] netfilter, cgroup: implement cgroup2 path match in
> xt_cgroup
>
> Hello,
>
> This is v4 of the xt_cgroup2 patchset. Changes from v3 are
>
> * Refreshed on top of net-next/master e72c932d3f8a ("cxgb3: Convert
> simple_strtoul to kstrtox"). Dropped "cgroups: Allow dynamically
> changing net_classid" from series as it's already applied.
>
> The changes from v2 to v3 are
>
> * Folded cgroup2 path matching into xt_cgroup as a new revision rather
> than a separate xt_cgroup2 match as suggested by Pablo.
>
> * Refreshed on top of Nina's net_cls dynamic config update fix patch.
> I included the fix patch as part of this series to ease reviewing.
>
> The changes from v1 to v2 are
>
> * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
> carries either (prioidx, classid) pair or cgroup2 pointer. This
> avoids inflating struct sock with yet another cgroup related field.
> Unfortunately, this does add some complexity but that's the
> trade-off and the complexity is contained in cgroup proper.
>
> * Various small updats as per David and Jan's reviews.
>
>
> In cgroup v1, dealing with cgroup membership was difficult because the
> number of membership associations was unbound. As a result, cgroup v1
> grew several controllers whose primary purpose is either tagging
> membership or pull in configuration knobs from other subsystems so
> that cgroup membership test can be avoided.
>
> net_cls and net_prio controllers are examples of the latter. They
> allow configuring network-specific attributes from cgroup side so that
> network subsystem can avoid testing cgroup membership; unfortunately,
> these are not only cumbersome but also problematic.
>
> Both net_cls and net_prio aren't properly hierarchical. Both inherit
> configuration from the parent on creation but there's no interaction
> afterwards. An ancestor doesn't restrict the behavior in its subtree
> in anyway and configuration changes aren't propagated downwards.
> Especially when combined with cgroup delegation, this is problematic
> because delegatees can mess up whatever network configuration
> implemented at the system level. net_prio would allow the delegatees
> to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
> the same for classid.
>
> While it is possible to solve these issues from controller side by
> implementing hierarchical allowable ranges in both controllers, it
> would involve quite a bit of complexity in the controllers and further
> obfuscate network configuration as it becomes even more difficult to
> tell what's actually being configured looking from the network side.
> While not much can be done for v1 at this point, as membership
> handling is sane on cgroup v2, it'd be better to make cgroup matching
> behave like other network matches and classifiers than introducing
> further complications.
>
> This patchset includes the following eight patches.
>
> 0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch
> 0002-kernfs-implement-kernfs_walk_and_get.patch
> 0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch
> 0004-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch
> 0005-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch
> 0006-sock-cgroup-add-sock-sk_cgroup.patch
> 0007-netfilter-prepare-xt_cgroup-for-multi-revisions.patch
> 0008-netfilter-implement-xt_cgroup-cgroup2-path-match.patch
>
> 0001-0003 are prepatory patches in kernfs and cgroup. These patches
> are available in the following branch which will stay stable.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test
>
> 0004-0006 consolidate two cgroup related fields in struct sock into
> cgroup_sock_data and update it so that it can alternatively carry a
> cgroup pointer.
>
> 0007-0008 implement cgroup2 patch matching in xt_cgroup.
>
> This patchset is on top of net-next/master and also available in the
> following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2
>
> I'll post iptables extension as a reply. diffstat follows. Thanks.
>
> fs/kernfs/dir.c | 46 +++++++++++
> include/linux/cgroup-defs.h | 126 +++++++++++++++++++++++++++++++
> include/linux/cgroup.h | 66 +++++++++++++++-
> include/linux/kernfs.h | 12 ++
> include/net/cls_cgroup.h | 11 +-
> include/net/netprio_cgroup.h | 16 +++
> include/net/sock.h | 13 ---
> include/uapi/linux/netfilter/xt_cgroup.h | 15 +++
> kernel/cgroup.c | 126 ++++++++++++++++++++++++-------
> net/Kconfig | 6 +
> net/core/dev.c | 3
> net/core/netclassid_cgroup.c | 11 +-
> net/core/netprio_cgroup.c | 19 ++++
> net/core/scm.c | 4
> net/core/sock.c | 17 ----
> net/netfilter/nft_meta.c | 2
> net/netfilter/xt_cgroup.c | 108 ++++++++++++++++++++++----
> 17 files changed, 513 insertions(+), 88 deletions(-)
>
> --
> tejun
Hi Tejun,
With today's linux-next (next-20151214), I still got the same back trace, which
was previously reported at http://lists.openwall.net/netdev/2015/11/23/80:
[ 15.129701] BUG: spinlock bad magic on CPU#6, (systemd)/1012
[ 15.129701] lock: cgroup_sk_update_lock+0x0/0x40, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 15.129701] CPU: 6 PID: 1012 Comm: (systemd) Not tainted 4.4.0-rc4-next-20151214+ #3
[ 15.129701] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012
[ 15.129701] ffffffffae6cddc0 ffff8800e158bab0 ffffffffad317212 0000000000000000
[ 15.129701] ffff8800e158bad0 ffffffffad0a1b8c ffffffffae6cddc0 ffffffffad800ee6
[ 15.129701] ffff8800e158baf0 ffffffffad0a1c06 ffffffffae6cddc0 ffff8800ead9f080
[ 15.129701] Call Trace:
[ 15.129701] [<ffffffffad317212>] dump_stack+0x44/0x62
[ 15.129701] [<ffffffffad0a1b8c>] spin_dump+0x7c/0xd0
[ 15.129701] [<ffffffffad0a1c06>] spin_bug+0x26/0x30
[ 15.129701] [<ffffffffad0a1d55>] do_raw_spin_lock+0xe5/0x120
[ 15.129701] [<ffffffffad5968b9>] _raw_spin_lock+0x39/0x40
[ 15.129701] [<ffffffffad4c27b3>] ? update_classid_sock+0x33/0x80
[ 15.129701] [<ffffffffad4c27b3>] update_classid_sock+0x33/0x80
[ 15.129701] [<ffffffffad4c2780>] ? write_classid+0x30/0x30
[ 15.129701] [<ffffffffad1e2c2a>] iterate_fd+0x5a/0x90
[ 15.129701] [<ffffffffad4c2717>] update_classid+0x47/0x80
[ 15.129701] [<ffffffffad4c2825>] cgrp_attach+0x25/0x30
[ 15.129701] [<ffffffffad0de43b>] cgroup_taskset_migrate+0x14b/0x280
[ 15.129701] [<ffffffffad0de62f>] cgroup_migrate+0xbf/0x100
[ 15.129701] [<ffffffffad0de575>] ? cgroup_migrate+0x5/0x100
[ 15.129701] [<ffffffffad0de725>] cgroup_attach_task+0xb5/0x100
[ 15.129701] [<ffffffffad0de675>] ? cgroup_attach_task+0x5/0x100
[ 15.129701] [<ffffffffad0dea0a>] __cgroup_procs_write+0x1da/0x310
[ 15.129701] [<ffffffffad0de88e>] ? __cgroup_procs_write+0x5e/0x310
[ 15.129701] [<ffffffffad0deb74>] cgroup_procs_write+0x14/0x20
[ 15.129701] [<ffffffffad0db3d0>] cgroup_file_write+0x40/0x130
[ 15.129701] [<ffffffffad242fa0>] kernfs_fop_write+0x130/0x180
[ 15.129701] [<ffffffffad1c4618>] __vfs_write+0x28/0xe0
[ 15.129701] [<ffffffffad09bb3c>] ? percpu_down_read+0x3c/0x90
[ 15.129701] [<ffffffffad1c7ddc>] ? __sb_start_write+0xdc/0xf0
[ 15.129701] [<ffffffffad1c7ddc>] ? __sb_start_write+0xdc/0xf0
[ 15.129701] [<ffffffffad1c4eb9>] vfs_write+0xa9/0x190
[ 15.129701] [<ffffffffad1c5e19>] SyS_write+0x49/0xa0
[ 15.129701] [<ffffffffad5972f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
My kernel config is attached FYI.
Thanks,
-- Dexuan
bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global
spinlock cgroup_sk_update_lock but erroneously skipped initializer
leading to uninitialized spinlock warning. Fix it by using
DEFINE_SPINLOCK().
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Dexuan Cui <[email protected]>
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
---
Hello, Dexuan.
Oops, sorry about that. Somehow thought it was a different problem
which is already fixed. This should do it.
Thanks.
kernel/cgroup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4f8f792..4466273f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5790,7 +5790,7 @@ EXPORT_SYMBOL_GPL(cgroup_get_from_path);
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
-spinlock_t cgroup_sk_update_lock;
+DEFINE_SPINLOCK(cgroup_sk_update_lock);
static bool cgroup_sk_alloc_disabled __read_mostly;
void cgroup_sk_alloc_disable(void)
From: Tejun Heo <[email protected]>
Date: Mon, 14 Dec 2015 11:24:06 -0500
> bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global
> spinlock cgroup_sk_update_lock but erroneously skipped initializer
> leading to uninitialized spinlock warning. Fix it by using
> DEFINE_SPINLOCK().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Dexuan Cui <[email protected]>
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Applied, thanks.
On Mon, Dec 07, 2015 at 05:38:54PM -0500, Tejun Heo wrote:
> xt_cgroup will grow cgroup2 path based match. Postfix existing
> symbols with _v0 and prepare for multi revision registration.
Applied, thanks.
On Mon, Dec 07, 2015 at 05:38:55PM -0500, Tejun Heo wrote:
> This patch implements xt_cgroup path match which matches cgroup2
> membership of the associated socket. The match is recursive and
> invertible.
Applied, thanks.
I shared the same concerns as Florian regarding the large size of the
path field in iptables, but given that we expose the layout of our
internal representation there (which is bad in terms of
extensibility), the only solution that I can see is to artificially
limitate the size of that field, but that may break users depending on
the scenario.
Hopefully, we should be able to provide something better in nf_tables
to address this.
> -----Original Message-----
> From: David Miller [mailto:[email protected]]
> Sent: Tuesday, December 15, 2015 3:21
> To: [email protected]
> Cc: Dexuan Cui <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH net-next] net, cgroup: cgroup_sk_updat_lock was missing
> initializer
>
> From: Tejun Heo <[email protected]>
> Date: Mon, 14 Dec 2015 11:24:06 -0500
>
> > bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global
> > spinlock cgroup_sk_update_lock but erroneously skipped initializer
> > leading to uninitialized spinlock warning. Fix it by using
> > DEFINE_SPINLOCK().
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reported-by: Dexuan Cui <[email protected]>
> > Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
>
> Applied, thanks.
Thanks! I can confirm it fixed the issue.
Thanks,
-- Dexuan
On Mon, Dec 07, 2015 at 05:38:50PM -0500, Tejun Heo wrote:
> Implement cgroup_get_from_path() using kernfs_walk_and_get() which
> obtains a default hierarchy cgroup from its path. This will be used
> to allow cgroup path based matching from outside cgroup proper -
> e.g. networking and perf.
Hi Tejun,
I'm trying to figure out how to handle this in the cgroup ns patchset.
Is this going to be purely used internally? From the user i see in
this patchset it looks like I should leave it be (and have @path always
be absolute) Is that right?
thanks,
-serge
> v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> include/linux/cgroup.h | 7 +++++++
> kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++-----
> 2 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5ee2c4..4c3ffab 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
> struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
> struct cgroup_subsys *ss);
>
> +struct cgroup *cgroup_get_from_path(const char *path);
> +
> int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
> int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
>
> @@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
> percpu_ref_put_many(&css->refcnt, n);
> }
>
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> + css_put(&cgrp->self);
> +}
> +
> /**
> * task_css_set_check - obtain a task's css_set with extra access conditions
> * @task: the task to obtain css_set for
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3190040..3db5e8f 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
> return css_tryget(&cgrp->self);
> }
>
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> - css_put(&cgrp->self);
> -}
> -
> struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
> {
> struct cgroup *cgrp = of->kn->parent->priv;
> @@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
> return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
> }
>
> +/**
> + * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path
> + * @path: path on the default hierarchy
> + *
> + * Find the cgroup at @path on the default hierarchy, increment its
> + * reference count and return it. Returns pointer to the found cgroup on
> + * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
> + * if @path points to a non-directory.
> + */
> +struct cgroup *cgroup_get_from_path(const char *path)
> +{
> + struct kernfs_node *kn;
> + struct cgroup *cgrp;
> +
> + mutex_lock(&cgroup_mutex);
> +
> + kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
> + if (kn) {
> + if (kernfs_type(kn) == KERNFS_DIR) {
> + cgrp = kn->priv;
> + cgroup_get(cgrp);
> + } else {
> + cgrp = ERR_PTR(-ENOTDIR);
> + }
> + kernfs_put(kn);
> + } else {
> + cgrp = ERR_PTR(-ENOENT);
> + }
> +
> + mutex_unlock(&cgroup_mutex);
> + return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_get_from_path);
> +
> #ifdef CONFIG_CGROUP_DEBUG
> static struct cgroup_subsys_state *
> debug_css_alloc(struct cgroup_subsys_state *parent_css)
> --
> 2.5.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hello, Serge.
On Mon, Dec 21, 2015 at 06:22:41PM -0600, Serge E. Hallyn wrote:
> I'm trying to figure out how to handle this in the cgroup ns patchset.
> Is this going to be purely used internally? From the user i see in
> this patchset it looks like I should leave it be (and have @path always
> be absolute) Is that right?
Yeap, I don't think this function needs to worry about namespaces.
Thanks.
--
tejun