Hello,
This is the second take of the xt_cgroup2 patchset. Changes from the
last take are
* Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
carries either (prioidx, classid) pair or cgroup2 pointer. This
avoids inflating struct sock with yet another cgroup related field.
Unfortunately, this does add some complexity but that's the
trade-off and the complexity is contained in cgroup proper.
* Various small updats as per David and Jan's reviews.
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
Unfortunately, this ends up adding another cgroup related field to
struct sock - sock->sk_cgroup. I tried to think of a way to overload
the fields for net_cls and net_prio but couldn't come up with a sane
way to do that. In the long term, it should be possible to disable
net_cls and net_prio.
This patchset includes the following five patches.
0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch
0002-kernfs-implement-kernfs_walk_and_get.patch
0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch
0004-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch
0005-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch
0006-sock-cgroup-add-sock-sk_cgroup.patch
0007-netfilter-implement-xt_cgroup2-match.patch
0001-0004 are prepatory patches in kernfs and cgroup. 0005-0006
consolidate two cgroup related fields in struct sock into
cgroup_sock_data and update it so that it can alternatively carry a
cgroup pointer. 0007 implements the new xt_cgroup2 match.
This patchset is on top of v4.4-rc1 and also available in the
following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2
I'll post iptables extension as a reply. diffstat follows. Thanks.
fs/kernfs/dir.c | 46 ++++++++++
include/linux/cgroup-defs.h | 124 +++++++++++++++++++++++++++++
include/linux/cgroup.h | 66 +++++++++++++++
include/linux/kernfs.h | 12 ++
include/net/cls_cgroup.h | 11 +-
include/net/netprio_cgroup.h | 16 +++
include/net/sock.h | 13 ---
include/uapi/linux/netfilter/xt_cgroup2.h | 15 +++
kernel/cgroup.c | 126 +++++++++++++++++++++++-------
net/Kconfig | 6 +
net/core/dev.c | 3
net/core/netclassid_cgroup.c | 10 +-
net/core/netprio_cgroup.c | 19 ++++
net/core/scm.c | 4
net/core/sock.c | 17 ----
net/netfilter/Kconfig | 10 ++
net/netfilter/Makefile | 1
net/netfilter/nft_meta.c | 2
net/netfilter/xt_cgroup.c | 3
net/netfilter/xt_cgroup2.c | 78 ++++++++++++++++++
20 files changed, 512 insertions(+), 70 deletions(-)
--
tejun
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question. While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.
This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level. cgroup_is_descendant() is accordingly reimplmented
and made inline.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 14 ++++++++++++++
include/linux/cgroup.h | 18 +++++++++++++++++-
kernel/cgroup.c | 32 ++++++++++----------------------
3 files changed, 41 insertions(+), 23 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 60d44b2..504d859 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -235,6 +235,14 @@ struct cgroup {
int id;
/*
+ * The depth this cgroup is at. The root is at depth zero and each
+ * step down the hierarchy increments the level. This along with
+ * ancestor_ids[] can determine whether a given cgroup is a
+ * descendant of another without traversing the hierarchy.
+ */
+ int level;
+
+ /*
* Each non-empty css_set associated with this cgroup contributes
* one to populated_cnt. All children with non-zero popuplated_cnt
* of their own contribute one. The count is zero iff there's no
@@ -289,6 +297,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
+
+ /* ids of the ancestors at each level including self */
+ int ancestor_ids[];
};
/*
@@ -308,6 +319,9 @@ struct cgroup_root {
/* The root cgroup. Root is destroyed on its release. */
struct cgroup cgrp;
+ /* for cgrp->ancestor_ids[0] */
+ int cgrp_ancestor_id_storage;
+
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 22e3754..b5ee2c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,7 +81,6 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
@@ -459,6 +458,23 @@ static inline struct cgroup *task_cgroup(struct task_struct *task,
return task_css(task, subsys_id)->cgroup;
}
+/**
+ * cgroup_is_descendant - test ancestry
+ * @cgrp: the cgroup to be tested
+ * @ancestor: possible ancestor of @cgrp
+ *
+ * Test whether @cgrp is a descendant of @ancestor. It also returns %true
+ * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
+ * and @ancestor are accessible.
+ */
+static inline bool cgroup_is_descendant(struct cgroup *cgrp,
+ struct cgroup *ancestor)
+{
+ if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
+ return false;
+ return cgrp->ancestor_ids[ancestor->level] == ancestor->id;
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f1603c1..3190040 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -459,25 +459,6 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
}
EXPORT_SYMBOL_GPL(of_css);
-/**
- * cgroup_is_descendant - test ancestry
- * @cgrp: the cgroup to be tested
- * @ancestor: possible ancestor of @cgrp
- *
- * Test whether @cgrp is a descendant of @ancestor. It also returns %true
- * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
- * and @ancestor are accessible.
- */
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
-{
- while (cgrp) {
- if (cgrp == ancestor)
- return true;
- cgrp = cgroup_parent(cgrp);
- }
- return false;
-}
-
static int notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -1903,6 +1884,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
if (ret < 0)
goto out;
root_cgrp->id = ret;
+ root_cgrp->ancestor_ids[0] = ret;
ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, 0,
GFP_KERNEL);
@@ -4846,11 +4828,11 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
umode_t mode)
{
- struct cgroup *parent, *cgrp;
+ struct cgroup *parent, *cgrp, *tcgrp;
struct cgroup_root *root;
struct cgroup_subsys *ss;
struct kernfs_node *kn;
- int ssid, ret;
+ int level, ssid, ret;
/* Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
*/
@@ -4861,9 +4843,11 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
if (!parent)
return -ENODEV;
root = parent->root;
+ level = parent->level + 1;
/* allocate the cgroup and its ID, 0 is reserved for the root */
- cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
+ cgrp = kzalloc(sizeof(*cgrp) +
+ sizeof(cgrp->ancestor_ids[0]) * (level + 1), GFP_KERNEL);
if (!cgrp) {
ret = -ENOMEM;
goto out_unlock;
@@ -4887,6 +4871,10 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
cgrp->self.parent = &parent->self;
cgrp->root = root;
+ cgrp->level = level;
+
+ for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
+ cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
--
2.5.0
Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.
v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: David Miller <[email protected]>
---
fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 12 ++++++++++++
2 files changed, 58 insertions(+)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..742bf4a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -694,6 +694,29 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
return NULL;
}
+static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
+ const unsigned char *path,
+ const void *ns)
+{
+ static char path_buf[PATH_MAX]; /* protected by kernfs_mutex */
+ size_t len = strlcpy(path_buf, path, PATH_MAX);
+ char *p = path_buf;
+ char *name;
+
+ lockdep_assert_held(&kernfs_mutex);
+
+ if (len >= PATH_MAX)
+ return NULL;
+
+ while ((name = strsep(&p, "/")) && parent) {
+ if (*name == '\0')
+ continue;
+ parent = kernfs_find_ns(parent, name, ns);
+ }
+
+ return parent;
+}
+
/**
* kernfs_find_and_get_ns - find and get kernfs_node with the given name
* @parent: kernfs_node to search under
@@ -719,6 +742,29 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
EXPORT_SYMBOL_GPL(kernfs_find_and_get_ns);
/**
+ * kernfs_walk_and_get_ns - find and get kernfs_node with the given path
+ * @parent: kernfs_node to search under
+ * @path: path to look for
+ * @ns: the namespace tag to use
+ *
+ * Look for kernfs_node with path @path under @parent and get a reference
+ * if found. This function may sleep and returns pointer to the found
+ * kernfs_node on success, %NULL on failure.
+ */
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns)
+{
+ struct kernfs_node *kn;
+
+ mutex_lock(&kernfs_mutex);
+ kn = kernfs_walk_ns(parent, path, ns);
+ kernfs_get(kn);
+ mutex_unlock(&kernfs_mutex);
+
+ return kn;
+}
+
+/**
* kernfs_create_root - create a new kernfs hierarchy
* @scops: optional syscall operations for the hierarchy
* @flags: KERNFS_ROOT_* flags
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5d4e9c4..af51df3 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn);
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns);
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns);
void kernfs_get(struct kernfs_node *kn);
void kernfs_put(struct kernfs_node *kn);
@@ -350,6 +352,10 @@ static inline struct kernfs_node *
kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{ return NULL; }
+static inline struct kernfs_node *
+kernfs_walk_and_get_ns(struct kernfs_node *parent, const char *path,
+ const void *ns)
+{ return NULL; }
static inline void kernfs_get(struct kernfs_node *kn) { }
static inline void kernfs_put(struct kernfs_node *kn) { }
@@ -431,6 +437,12 @@ kernfs_find_and_get(struct kernfs_node *kn, const char *name)
}
static inline struct kernfs_node *
+kernfs_walk_and_get(struct kernfs_node *kn, const char *path)
+{
+ return kernfs_walk_and_get_ns(kn, path, NULL);
+}
+
+static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
--
2.5.0
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path. This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.
v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 7 +++++++
kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 41 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5ee2c4..4c3ffab 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);
+struct cgroup *cgroup_get_from_path(const char *path);
+
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
@@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
percpu_ref_put_many(&css->refcnt, n);
}
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3190040..3db5e8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
return css_tryget(&cgrp->self);
}
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
}
+/**
+ * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path
+ * @path: path on the default hierarchy
+ *
+ * Find the cgroup at @path on the default hierarchy, increment its
+ * reference count and return it. Returns pointer to the found cgroup on
+ * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
+ * if @path points to a non-directory.
+ */
+struct cgroup *cgroup_get_from_path(const char *path)
+{
+ struct kernfs_node *kn;
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+
+ kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
+ if (kn) {
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ cgrp = kn->priv;
+ cgroup_get(cgrp);
+ } else {
+ cgrp = ERR_PTR(-ENOTDIR);
+ }
+ kernfs_put(kn);
+ } else {
+ cgrp = ERR_PTR(-ENOENT);
+ }
+
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
--
2.5.0
netprio builds per-netdev contiguous priomap array which is indexed by
css->id. The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range. This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.
This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
---
include/net/sock.h | 10 +++++-----
net/core/netprio_cgroup.c | 9 +++++++++
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index bbf7c2c..b517351 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,7 +288,6 @@ struct cg_proto;
* @sk_ack_backlog: current listen backlog
* @sk_max_ack_backlog: listen backlog set in listen()
* @sk_priority: %SO_PRIORITY setting
- * @sk_cgrp_prioidx: socket group's priority map index
* @sk_type: socket type (%SOCK_STREAM, etc)
* @sk_protocol: which protocol this socket belongs in this network family
* @sk_peer_pid: &struct pid for this socket's peer
@@ -309,6 +308,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
+ * @sk_cgrp_prioidx: socket group's priority map index
* @sk_classid: this socket's cgroup classid
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
@@ -423,9 +423,7 @@ struct sock {
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
__u32 sk_priority;
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- __u32 sk_cgrp_prioidx;
-#endif
+ __u32 sk_mark;
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
long sk_rcvtimeo;
@@ -443,7 +441,9 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
- __u32 sk_mark;
+#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
+ u16 sk_cgrp_prioidx;
+#endif
#ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
#endif
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..2b9159b 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,6 +27,12 @@
#include <linux/fdtable.h>
+/*
+ * netprio allocates per-net_device priomap array which is indexed by
+ * css->id. Limiting css ID to 16bits doesn't lose anything.
+ */
+#define NETPRIO_ID_MAX USHRT_MAX
+
#define PRIOMAP_MIN_SZ 128
/*
@@ -144,6 +150,9 @@ static int cgrp_css_online(struct cgroup_subsys_state *css)
struct net_device *dev;
int ret = 0;
+ if (css->id > NETPRIO_ID_MAX)
+ return -ENOSPC;
+
if (!parent_css)
return 0;
--
2.5.0
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
and its accessors are defined in cgroup-defs.h. This is to prepare
for overloading the fields with a cgroup pointer.
This patch mostly performs equivalent conversions but the followings
are noteworthy.
* Equality test before updating classid is removed from
sock_update_classid(). This shouldn't make any noticeable
difference and a similar test will be implemented on the helper side
later.
* sock_update_netprioidx() now takes struct sock_cgroup_data and can
be moved to netprio_cgroup.h without causing include dependency
loop. Moved.
* The dummy version of sock_update_netprioidx() converted to a static
inline function while at it.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 36 ++++++++++++++++++++++++++++++++++++
include/net/cls_cgroup.h | 11 +++++------
include/net/netprio_cgroup.h | 16 +++++++++++++---
include/net/sock.h | 11 +++--------
net/Kconfig | 6 ++++++
net/core/dev.c | 3 ++-
net/core/netclassid_cgroup.c | 4 ++--
net/core/netprio_cgroup.c | 3 ++-
net/core/scm.c | 4 ++--
net/core/sock.c | 15 ++-------------
net/netfilter/nft_meta.c | 2 +-
net/netfilter/xt_cgroup.c | 3 ++-
12 files changed, 76 insertions(+), 38 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 504d859..ed128fed 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -542,4 +542,40 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
#endif /* CONFIG_CGROUPS */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+struct sock_cgroup_data {
+ u16 prioidx;
+ u32 classid;
+};
+
+static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
+{
+ return skcd->prioidx;
+}
+
+static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
+{
+ return skcd->classid;
+}
+
+static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
+ u16 prioidx)
+{
+ skcd->prioidx = prioidx;
+}
+
+static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
+ u32 classid)
+{
+ skcd->classid = classid;
+}
+
+#else /* CONFIG_SOCK_CGROUP_DATA */
+
+struct sock_cgroup_data {
+};
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_DEFS_H */
diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index ccd6d8b..c0a92e2 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -41,13 +41,12 @@ static inline u32 task_cls_classid(struct task_struct *p)
return classid;
}
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
u32 classid;
classid = task_cls_classid(current);
- if (classid != sk->sk_classid)
- sk->sk_classid = classid;
+ sock_cgroup_set_classid(skcd, classid);
}
static inline u32 task_get_classid(const struct sk_buff *skb)
@@ -64,17 +63,17 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
* softirqs always disables bh.
*/
if (in_serving_softirq()) {
- /* If there is an sk_classid we'll use that. */
+ /* If there is an sock_cgroup_classid we'll use that. */
if (!skb->sk)
return 0;
- classid = skb->sk->sk_classid;
+ classid = sock_cgroup_classid(&skb->sk->sk_cgrp_data);
}
return classid;
}
#else /* !CONFIG_CGROUP_NET_CLASSID */
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
}
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index f2a9597..6041905 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -25,8 +25,6 @@ struct netprio_map {
u32 priomap[];
};
-void sock_update_netprioidx(struct sock *sk);
-
static inline u32 task_netprioidx(struct task_struct *p)
{
struct cgroup_subsys_state *css;
@@ -38,13 +36,25 @@ static inline u32 task_netprioidx(struct task_struct *p)
rcu_read_unlock();
return idx;
}
+
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+ if (in_interrupt())
+ return;
+
+ sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
+}
+
#else /* !CONFIG_CGROUP_NET_PRIO */
+
static inline u32 task_netprioidx(struct task_struct *p)
{
return 0;
}
-#define sock_update_netprioidx(sk)
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+}
#endif /* CONFIG_CGROUP_NET_PRIO */
#endif /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index b517351..c4e3a30 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -58,6 +58,7 @@
#include <linux/memcontrol.h>
#include <linux/static_key.h>
#include <linux/sched.h>
+#include <linux/cgroup-defs.h>
#include <linux/filter.h>
#include <linux/rculist_nulls.h>
@@ -308,8 +309,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
- * @sk_cgrp_prioidx: socket group's priority map index
- * @sk_classid: this socket's cgroup classid
+ * @sk_cgrp_data: cgroup data for this cgroup
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
* @sk_state_change: callback to indicate change in the state of the sock
@@ -441,12 +441,7 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- u16 sk_cgrp_prioidx;
-#endif
-#ifdef CONFIG_CGROUP_NET_CLASSID
- u32 sk_classid;
-#endif
+ struct sock_cgroup_data sk_cgrp_data;
struct cg_proto *sk_cgrp;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
diff --git a/net/Kconfig b/net/Kconfig
index 127da94..11f8c22 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -250,9 +250,14 @@ config XPS
depends on SMP
default y
+config SOCK_CGROUP_DATA
+ bool
+ default n
+
config CGROUP_NET_PRIO
bool "Network priority cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use in assigning processes to network priorities on
a per-interface basis.
@@ -260,6 +265,7 @@ config CGROUP_NET_PRIO
config CGROUP_NET_CLASSID
bool "Network classid cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use as general purpose socket classid marker that is
being used in cls_cgroup and for netfilter matching.
diff --git a/net/core/dev.c b/net/core/dev.c
index ab9b8d0..2346cfd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2925,7 +2925,8 @@ static void skb_update_prio(struct sk_buff *skb)
struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);
if (!skb->priority && skb->sk && map) {
- unsigned int prioidx = skb->sk->sk_cgrp_prioidx;
+ unsigned int prioidx =
+ sock_cgroup_prioidx(&skb->sk->sk_cgrp_data);
if (prioidx < map->priomap_len)
skb->priority = map->priomap[prioidx];
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 6441f47..991c513 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -62,8 +62,8 @@ static int update_classid(const void *v, struct file *file, unsigned n)
struct socket *sock = sock_from_file(file, &err);
if (sock)
- sock->sk->sk_classid = (u32)(unsigned long)v;
-
+ sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 2b9159b..de42aa7 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -223,7 +223,8 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);
if (sock)
- sock->sk->sk_cgrp_prioidx = (u32)(unsigned long)v;
+ sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}
diff --git a/net/core/scm.c b/net/core/scm.c
index 3b6899b..2a3c7e2 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -289,8 +289,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
/* Bump the usage count and install the file. */
sock = sock_from_file(fp[i], &err);
if (sock) {
- sock_update_netprioidx(sock->sk);
- sock_update_classid(sock->sk);
+ sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+ sock_update_classid(&sock->sk->sk_cgrp_data);
}
fd_install(new_fd, get_file(fp[i]));
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 1e4dd54..35af060 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1393,17 +1393,6 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
module_put(owner);
}
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
-void sock_update_netprioidx(struct sock *sk)
-{
- if (in_interrupt())
- return;
-
- sk->sk_cgrp_prioidx = task_netprioidx(current);
-}
-EXPORT_SYMBOL_GPL(sock_update_netprioidx);
-#endif
-
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
@@ -1432,8 +1421,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
sock_net_set(sk, net);
atomic_set(&sk->sk_wmem_alloc, 1);
- sock_update_classid(sk);
- sock_update_netprioidx(sk);
+ sock_update_classid(&sk->sk_cgrp_data);
+ sock_update_netprioidx(&sk->sk_cgrp_data);
}
return sk;
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 9dfaf4d..1915cab 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -174,7 +174,7 @@ void nft_meta_get_eval(const struct nft_expr *expr,
sk = skb_to_full_sk(skb);
if (!sk || !sk_fullsock(sk))
goto err;
- *dest = sk->sk_classid;
+ *dest = sock_cgroup_classid(&sk->sk_cgrp_data);
break;
#endif
default:
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index a1d126f..54eaeb4 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -42,7 +42,8 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;
- return (info->id == skb->sk->sk_classid) ^ info->invert;
+ return (info->id == sock_cgroup_classid(&skb->sk->sk_cgrp_data)) ^
+ info->invert;
}
static struct xt_match cgroup_mt_reg __read_mostly = {
--
2.5.0
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
---
include/linux/cgroup-defs.h | 86 ++++++++++++++++++++++++++++++++++++++++----
include/linux/cgroup.h | 41 +++++++++++++++++++++
kernel/cgroup.c | 55 +++++++++++++++++++++++++++-
net/core/netclassid_cgroup.c | 6 +++-
net/core/netprio_cgroup.c | 7 +++-
net/core/sock.c | 2 ++
6 files changed, 188 insertions(+), 9 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ed128fed..76fe7c7 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -544,31 +544,105 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
#ifdef CONFIG_SOCK_CGROUP_DATA
+/*
+ * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
+ * per-socket cgroup information except for memcg association.
+ *
+ * On legacy hierarchies, net_prio and net_cls controllers directly set
+ * attributes on each sock which can then be tested by the network layer.
+ * On the default hierarchy, each sock is associated with the cgroup it was
+ * created in and the networking layer can match the cgroup directly.
+ *
+ * To avoid carrying all three cgroup related fields separately in sock,
+ * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
+ * On boot, sock_cgroup_data records the cgroup that the sock was created
+ * in so that cgroup2 matches can be made; however, once either net_prio or
+ * net_cls starts being used, the area is overriden to carry prioidx and/or
+ * classid. The two modes are distinguished by whether the lowest bit is
+ * set. Clear bit indicates cgroup pointer while set bit prioidx and
+ * classid.
+ *
+ * While userland may start using net_prio or net_cls at any time, once
+ * either is used, cgroup2 matching no longer works. There is no reason to
+ * mix the two and this is in line with how legacy and v2 compatibility is
+ * handled. On mode switch, cgroup references which are already being
+ * pointed to by socks may be leaked. While this can be remedied by adding
+ * synchronization around sock_cgroup_data, given that the number of leaked
+ * cgroups is bound and highly unlikely to be high, this seems to be the
+ * better trade-off.
+ */
struct sock_cgroup_data {
- u16 prioidx;
- u32 classid;
+ union {
+#ifdef __LITTLE_ENDIAN
+ struct {
+ u8 is_data;
+ u8 padding;
+ u16 prioidx;
+ u32 classid;
+ } __packed;
+#else
+ struct {
+ u32 classid;
+ u16 prioidx;
+ u8 padding;
+ u8 is_data;
+ } __packed;
+#endif
+ u64 val;
+ };
};
+/*
+ * There's a theoretical window where the following accessors race with
+ * updaters and return part of the previous pointer as the prioidx or
+ * classid. Such races are short-lived and the result isn't critical.
+ */
static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
{
- return skcd->prioidx;
+ return (skcd->is_data & 1) ? skcd->prioidx : 1;
}
static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
{
- return skcd->classid;
+ return (skcd->is_data & 1) ? skcd->classid : 0;
}
+/*
+ * If invoked concurrently, the updaters may clobber each other. The
+ * caller is responsible for synchronization.
+ */
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
u16 prioidx)
{
- skcd->prioidx = prioidx;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.prioidx = prioidx;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}
static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
u32 classid)
{
- skcd->classid = classid;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_classid(&skcd_buf) == classid)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.classid = classid;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}
#else /* CONFIG_SOCK_CGROUP_DATA */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4c3ffab..a8ba1ea 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -578,4 +578,45 @@ static inline int cgroup_init(void) { return 0; }
#endif /* !CONFIG_CGROUPS */
+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+extern spinlock_t cgroup_sk_update_lock;
+#endif
+
+void cgroup_sk_alloc_disable(void);
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_free(struct sock_cgroup_data *skcd);
+
+static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
+{
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+ unsigned long v;
+
+ /*
+ * @skcd->val is 64bit but the following is safe on 32bit too as we
+ * just need the lower ulong to be written and read atomically.
+ */
+ v = READ_ONCE(skcd->val);
+
+ if (v & 1)
+ return &cgrp_dfl_root.cgrp;
+
+ return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
+#else
+ return (struct cgroup *)(unsigned long)skcd->val;
+#endif
+}
+
+#else /* CONFIG_CGROUP_DATA */
+
+static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
+
+#endif /* CONFIG_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3db5e8f..4f8f792 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,8 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
-
#include <linux/atomic.h>
+#include <net/sock.h>
/*
* pidlists linger the following amount before being destroyed. The goal
@@ -5782,6 +5782,59 @@ struct cgroup *cgroup_get_from_path(const char *path)
}
EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+
+spinlock_t cgroup_sk_update_lock;
+static bool cgroup_sk_alloc_disabled __read_mostly;
+
+void cgroup_sk_alloc_disable(void)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+ pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
+ cgroup_sk_alloc_disabled = true;
+}
+
+#else
+
+#define cgroup_sk_alloc_disabled false
+
+#endif
+
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+
+ rcu_read_lock();
+
+ while (true) {
+ struct css_set *cset;
+
+ cset = task_css_set(current);
+ if (likely(cgroup_tryget(cset->dfl_cgrp))) {
+ skcd->val = (unsigned long)cset->dfl_cgrp;
+ break;
+ }
+ cpu_relax();
+ }
+
+ rcu_read_unlock();
+}
+
+void cgroup_sk_free(struct sock_cgroup_data *skcd)
+{
+ cgroup_put(sock_cgroup_ptr(skcd));
+}
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 991c513..a575538 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -61,9 +61,12 @@ static int update_classid(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}
@@ -89,6 +92,7 @@ static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
u64 value)
{
+ cgroup_sk_alloc_disable();
css_cls_state(css)->classid = (u32) value;
return 0;
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index de42aa7..053d60c 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -209,6 +209,8 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
if (!dev)
return -ENODEV;
+ cgroup_sk_alloc_disable();
+
rtnl_lock();
ret = netprio_set_prio(of_css(of), dev, prio);
@@ -222,9 +224,12 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
{
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 35af060..ecce5df 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1363,6 +1363,7 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
if (!try_module_get(prot->owner))
goto out_free_sec;
sk_tx_queue_clear(sk);
+ cgroup_sk_alloc(&sk->sk_cgrp_data);
}
return sk;
@@ -1385,6 +1386,7 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
owner = prot->owner;
slab = prot->slab;
+ cgroup_sk_free(&sk->sk_cgrp_data);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
--
2.5.0
This patch implements xt_cgroup2 which matches cgroup2 membership of
the associated socket. The match is recursive and invertible.
For rationales on introducing another cgroup based match, please refer
to a preceding commit "sock, cgroup: add sock->sk_cgroup".
v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added
explicit alignment to the priv field. Both suggested by Jan.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
Cc: Jan Engelhardt <[email protected]>
---
include/uapi/linux/netfilter/xt_cgroup2.h | 15 ++++++
net/netfilter/Kconfig | 10 ++++
net/netfilter/Makefile | 1 +
net/netfilter/xt_cgroup2.c | 78 +++++++++++++++++++++++++++++++
4 files changed, 104 insertions(+)
create mode 100644 include/uapi/linux/netfilter/xt_cgroup2.h
create mode 100644 net/netfilter/xt_cgroup2.c
diff --git a/include/uapi/linux/netfilter/xt_cgroup2.h b/include/uapi/linux/netfilter/xt_cgroup2.h
new file mode 100644
index 0000000..ea51335
--- /dev/null
+++ b/include/uapi/linux/netfilter/xt_cgroup2.h
@@ -0,0 +1,15 @@
+#ifndef _XT_CGROUP2_H
+#define _XT_CGROUP2_H
+
+#include <linux/types.h>
+#include <linux/limits.h>
+
+struct xt_cgroup2_info {
+ char path[PATH_MAX];
+ __u8 invert;
+
+ /* kernel internal data */
+ void *priv __attribute__((aligned(8)));
+};
+
+#endif /* _XT_CGROUP2_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e22349e..5fd12dd 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -974,6 +974,16 @@ config NETFILTER_XT_MATCH_BPF
To compile it as a module, choose M here. If unsure, say N.
+config NETFILTER_XT_MATCH_CGROUP2
+ tristate '"cgroup2" match support'
+ depends on NETFILTER_ADVANCED
+ depends on CGROUPS
+ select SOCK_CGROUP_DATA
+ help
+ cgroup2 matching allows you to match locally generated and
+ early demuxed packets based on the v2 cgroup the socket is
+ associated with on creation.
+
config NETFILTER_XT_MATCH_CGROUP
tristate '"control group" match support'
depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 7638c36..86cee05 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -152,6 +152,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_MULTIPORT) += xt_multiport.o
obj-$(CONFIG_NETFILTER_XT_MATCH_NFACCT) += xt_nfacct.o
obj-$(CONFIG_NETFILTER_XT_MATCH_OSF) += xt_osf.o
obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP2) += xt_cgroup2.o
obj-$(CONFIG_NETFILTER_XT_MATCH_CGROUP) += xt_cgroup.o
obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
diff --git a/net/netfilter/xt_cgroup2.c b/net/netfilter/xt_cgroup2.c
new file mode 100644
index 0000000..39884b3
--- /dev/null
+++ b/net/netfilter/xt_cgroup2.c
@@ -0,0 +1,78 @@
+#include <linux/skbuff.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_cgroup2.h>
+#include <net/sock.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tejun Heo <[email protected]>");
+MODULE_DESCRIPTION("Xtables: cgroup2 socket ownership matching");
+MODULE_ALIAS("ipt_cgroup2");
+MODULE_ALIAS("ip6t_cgroup2");
+
+static int cgroup2_mt_check(const struct xt_mtchk_param *par)
+{
+ struct xt_cgroup2_info *info = par->matchinfo;
+ struct cgroup *cgrp;
+
+ if (info->invert & ~1)
+ return -EINVAL;
+
+ cgrp = cgroup_get_from_path(info->path);
+ if (IS_ERR(cgrp)) {
+ pr_info("xt_cgroup2: invalid path, errno=%ld\n", PTR_ERR(cgrp));
+ return -EINVAL;
+ }
+ info->priv = cgrp;
+
+ return 0;
+}
+
+static bool cgroup2_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ const struct xt_cgroup2_info *info = par->matchinfo;
+ struct cgroup *ancestor = info->priv;
+ struct cgroup *cgrp;
+
+ if (!skb->sk || !sk_fullsock(skb->sk))
+ return false;
+
+ cgrp = sock_cgroup_ptr(&skb->sk->sk_cgrp_data);
+
+ return cgroup_is_descendant(cgrp, ancestor) ^ info->invert;
+}
+
+static void cgroup2_mt_destroy(const struct xt_mtdtor_param *par)
+{
+ struct xt_cgroup2_info *info = par->matchinfo;
+
+ cgroup_put(info->priv);
+}
+
+static struct xt_match cgroup2_mt_reg __read_mostly = {
+ .name = "cgroup2",
+ .revision = 0,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup2_mt_check,
+ .match = cgroup2_mt,
+ .matchsize = sizeof(struct xt_cgroup2_info),
+ .destroy = cgroup2_mt_destroy,
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+};
+
+static int __init cgroup2_mt_init(void)
+{
+ return xt_register_match(&cgroup2_mt_reg);
+}
+
+static void __exit cgroup2_mt_exit(void)
+{
+ xt_unregister_match(&cgroup2_mt_reg);
+}
+
+module_init(cgroup2_mt_init);
+module_exit(cgroup2_mt_exit);
--
2.5.0
This patch adds the extension for the xt_cgroup2 which matches packets
based on the v2 cgroup path of the associated socket.
v2: cgroup2_match->userspacesize and ->save and man page updated as
per Jan.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Jan Engelhardt <[email protected]>
---
extensions/libxt_cgroup2.c | 71 +++++++++++++++++++++++++++++++++++
extensions/libxt_cgroup2.man | 24 +++++++++++
include/linux/netfilter/xt_cgroup2.h | 15 +++++++
3 files changed, 110 insertions(+)
--- /dev/null
+++ b/extensions/libxt_cgroup2.c
@@ -0,0 +1,71 @@
+#include <stdio.h>
+#include <mntent.h>
+#include <xtables.h>
+#include <linux/netfilter/xt_cgroup2.h>
+#include <linux/limits.h>
+
+enum {
+ O_CGROUP2_PATH,
+};
+
+static void cgroup2_help(void)
+{
+ printf(
+"cgroup2 match options:\n"
+"[!] --path path path relative to cgroup2 root, recursive match\n");
+}
+
+static const struct xt_option_entry cgroup2_opts[] = {
+ {
+ .name = "path",
+ .id = O_CGROUP2_PATH,
+ .type = XTTYPE_STRING,
+ .flags = XTOPT_INVERT | XTOPT_MAND | XTOPT_PUT,
+ XTOPT_POINTER(struct xt_cgroup2_info, path)
+ },
+ XTOPT_TABLEEND,
+};
+
+static void cgroup2_parse(struct xt_option_call *cb)
+{
+ struct xt_cgroup2_info *info = cb->data;
+
+ xtables_option_parse(cb);
+
+ if (cb->invert)
+ info->invert = true;
+}
+
+static void cgroup2_print(const void *ip, const struct xt_entry_match *match,
+ int numeric)
+{
+ const struct xt_cgroup2_info *info = (void *)match->data;
+
+ printf(" cgroup2 %s%s", info->invert ? "! ":"", info->path);
+}
+
+static void cgroup2_save(const void *ip, const struct xt_entry_match *match)
+{
+ const struct xt_cgroup2_info *info = (void *)match->data;
+
+ printf("%s --path", info->invert ? " !" : "");
+ xtables_save_string(info->path);
+}
+
+static struct xtables_match cgroup2_match = {
+ .family = NFPROTO_UNSPEC,
+ .name = "cgroup2",
+ .version = XTABLES_VERSION,
+ .size = XT_ALIGN(sizeof(struct xt_cgroup2_info)),
+ .userspacesize = offsetof(struct xt_cgroup2_info, priv),
+ .help = cgroup2_help,
+ .print = cgroup2_print,
+ .save = cgroup2_save,
+ .x6_parse = cgroup2_parse,
+ .x6_options = cgroup2_opts,
+};
+
+void _init(void)
+{
+ xtables_register_match(&cgroup2_match);
+}
--- /dev/null
+++ b/extensions/libxt_cgroup2.man
@@ -0,0 +1,24 @@
+.TP
+[\fB!\fP] \fB\-\-path\fP \fIpath\fP
+Match cgroup2 membership.
+
+Each socket is associated with the v2 cgroup of the creating process.
+This matches packets coming from or going to all sockets in the
+sub-hierarchy of the specified path. The path should be relative to
+the root of the cgroup2 hierarchy. Can be used in the OUTPUT and
+INPUT chains to assign particular firewall policies for aggregated
+processes on the system. This allows for more fine-grained firewall
+policies that only match for a subset of the system's processes.
+
+\fBIMPORTANT\fP: when being used in the INPUT chain, the cgroup2
+matcher is currently only of limited functionality, meaning it
+will only match on packets that are processed for local sockets
+through early socket demuxing. Therefore, general usage on the
+INPUT chain is not advised unless the implications are well
+understood.
+.PP
+Example:
+.IP
+iptables \-A OUTPUT \-p tcp \-\-sport 80 \-m cgroup2 ! \-\-path service/http-server \-j DROP
+.PP
+Available since Linux 4.5.
--- /dev/null
+++ b/include/linux/netfilter/xt_cgroup2.h
@@ -0,0 +1,15 @@
+#ifndef _XT_CGROUP2_H
+#define _XT_CGROUP2_H
+
+#include <linux/types.h>
+#include <linux/limits.h>
+
+struct xt_cgroup2_info {
+ char path[PATH_MAX];
+ __u8 invert;
+
+ /* kernel internal data */
+ void *priv __attribute__((aligned(8)));
+};
+
+#endif /* _XT_CGROUP2_H */
Replace "disadviced" with "not advised" as suggested by Jan.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Jan Engelhardt <[email protected]>
---
extensions/libxt_cgroup.man | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/extensions/libxt_cgroup.man
+++ b/extensions/libxt_cgroup.man
@@ -12,7 +12,7 @@ the net_cls cgroup's id.
matcher is currently only of limited functionality, meaning it
will only match on packets that are processed for local sockets
through early socket demuxing. Therefore, general usage on the
-INPUT chain is disadviced unless the implications are well
+INPUT chain is not advised unless the implications are well
understood.
.PP
Example:
On Thu, Nov 19, 2015 at 01:52:46PM -0500, Tejun Heo wrote:
> Implement kernfs_walk_and_get() which is similar to
> kernfs_find_and_get() but can walk a path instead of just a name.
>
> v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
> David.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: David Miller <[email protected]>
> ---
> fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 12 ++++++++++++
> 2 files changed, 58 insertions(+)
Acked-by: Greg Kroah-Hartman <[email protected]>
On 11/19/2015 07:52 PM, Tejun Heo wrote:
> netprio builds per-netdev contiguous priomap array which is indexed by
> css->id. The array is allocated using kzalloc() effectively limiting
> the maximum ID supported to some thousand range. This patch caps the
> maximum supported css->id to USHRT_MAX which should be way above what
> is actually useable.
>
> This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
> up part will be used to overload the cgroup related fields.
> sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
> two cgroup related fields are adjacent.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: Daniel Wagner <[email protected]>
Acked-by: Daniel Wagner <[email protected]>
Hi Tejun,
On 11/19/2015 07:52 PM, Tejun Heo wrote:
> +/*
> + * There's a theoretical window where the following accessors race with
> + * updaters and return part of the previous pointer as the prioidx or
> + * classid. Such races are short-lived and the result isn't critical.
> + */
> static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
> {
> - return skcd->prioidx;
> + return (skcd->is_data & 1) ? skcd->prioidx : 1;
> }
>
> static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
> {
> - return skcd->classid;
> + return (skcd->is_data & 1) ? skcd->classid : 0;
> }
I still try to understand what the code does, hence this stupid question:
Why is sock_cgroup_prioidx() returning 1 if is not data and
sock_cgroup_classid() a 0?
thanks,
daniel
From: Tejun Heo <[email protected]>
Date: Thu, 19 Nov 2015 13:52:44 -0500
> This is the second take of the xt_cgroup2 patchset. Changes from the
> last take are
>
> * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
> carries either (prioidx, classid) pair or cgroup2 pointer. This
> avoids inflating struct sock with yet another cgroup related field.
> Unfortunately, this does add some complexity but that's the
> trade-off and the complexity is contained in cgroup proper.
>
> * Various small updats as per David and Jan's reviews.
I like this a lot better, thanks.
Please address Daniel's feedback on patch #6 and then I'm personally
fine with this series.
Pablo, are you ok with me merging this into net-next directly or
would you rather I take patches 1-6 into net-next and then you can
merge and then add patch #7 on top?
Thanks.
Hello, Daniel.
On Fri, Nov 20, 2015 at 12:04:05PM +0100, Daniel Wagner wrote:
> > static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
> > {
> > - return skcd->prioidx;
> > + return (skcd->is_data & 1) ? skcd->prioidx : 1;
> > }
> >
> > static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
> > {
> > - return skcd->classid;
> > + return (skcd->is_data & 1) ? skcd->classid : 0;
> > }
>
>
> I still try to understand what the code does, hence this stupid question:
>
> Why is sock_cgroup_prioidx() returning 1 if is not data and
> sock_cgroup_classid() a 0?
I prolly should have added comments there. prioidx carries the cgroup
ID on the hierarchy net_prio is attached to, so if nothing is
configured, the default value would be the ID of the root cgroup which
is always 1. For net_cls, the unconfigured default value is zero.
Will refresh the patch with comments.
Thanks.
--
tejun
On Fri, Nov 20, 2015 at 01:59:12PM -0500, David Miller wrote:
> From: Tejun Heo <[email protected]>
> Date: Thu, 19 Nov 2015 13:52:44 -0500
>
> > This is the second take of the xt_cgroup2 patchset. Changes from the
> > last take are
> >
> > * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
> > carries either (prioidx, classid) pair or cgroup2 pointer. This
> > avoids inflating struct sock with yet another cgroup related field.
> > Unfortunately, this does add some complexity but that's the
> > trade-off and the complexity is contained in cgroup proper.
> >
> > * Various small updats as per David and Jan's reviews.
>
> I like this a lot better, thanks.
>
> Please address Daniel's feedback on patch #6 and then I'm personally
> fine with this series.
>
> Pablo, are you ok with me merging this into net-next directly or
> would you rather I take patches 1-6 into net-next and then you can
> merge and then add patch #7 on top?
I'd suggest you get 1-6, then I'll pull this info my tree. Thanks David!
Regarding #7, I have a couple two concerns:
1) cgroup currently doesn't work the way users expect, ie. to perform any
reasonable firewalling. Since this relies on early demux, only a
limited number of sockets get access to the cgroup info.
2) We have traditionally rejected match2 and target2 extensions. I
guess you can accomodate the new cgroup code through the revision
iptables infrastructure, so we still use the cgroup match.
Let me know, thanks.
On Fri, Nov 20, 2015 at 08:56:25PM +0100, Pablo Neira Ayuso wrote:
> Regarding #7, I have a couple two concerns:
>
> 1) cgroup currently doesn't work the way users expect, ie. to perform any
> reasonable firewalling. Since this relies on early demux, only a
> limited number of sockets get access to the cgroup info.
Ops sorry, I forgot to indicate that I'm refering to the INPUT chain.
> 2) We have traditionally rejected match2 and target2 extensions. I
> guess you can accomodate the new cgroup code through the revision
> iptables infrastructure, so we still use the cgroup match.
>
> Let me know, thanks.
Hello, David, Pablo.
On Fri, Nov 20, 2015 at 08:56:25PM +0100, Pablo Neira Ayuso wrote:
> > Pablo, are you ok with me merging this into net-next directly or
> > would you rather I take patches 1-6 into net-next and then you can
> > merge and then add patch #7 on top?
>
> I'd suggest you get 1-6, then I'll pull this info my tree. Thanks David!
Hmm.... 1-3 will be needed to address similar issues in a different
controller, so putting them in a separate branch would work best. I
created a branch which contains the 1-3 on top of v4.4-rc1.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test
If creating a different branch from net side is better, please let me
know.
> Regarding #7, I have a couple two concerns:
>
> 1) cgroup currently doesn't work the way users expect, ie. to perform any
> reasonable firewalling. Since this relies on early demux, only a
> limited number of sockets get access to the cgroup info.
Right, it doesn't work well on INPUT side, so the big warning in the
man page.
> 2) We have traditionally rejected match2 and target2 extensions. I
> guess you can accomodate the new cgroup code through the revision
> iptables infrastructure, so we still use the cgroup match.
I thought it would be confusing because the two are completely
separate. Hmmm... okay, I'll merge it into xt_cgroup.
Thanks.
--
tejun
On Thu, Nov 19, 2015 at 08:41:04PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Nov 19, 2015 at 01:52:46PM -0500, Tejun Heo wrote:
> > Implement kernfs_walk_and_get() which is similar to
> > kernfs_find_and_get() but can walk a path instead of just a name.
> >
> > v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
> > David.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Greg Kroah-Hartman <[email protected]>
> > Cc: David Miller <[email protected]>
> > ---
> > fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/kernfs.h | 12 ++++++++++++
> > 2 files changed, 58 insertions(+)
>
> Acked-by: Greg Kroah-Hartman <[email protected]>
Greg, would it be okay to route this one through either cgroup or net
tree?
Thanks.
--
tejun
On Fri, Nov 20, 2015 at 04:12:54PM -0500, Tejun Heo wrote:
> On Thu, Nov 19, 2015 at 08:41:04PM -0800, Greg Kroah-Hartman wrote:
> > On Thu, Nov 19, 2015 at 01:52:46PM -0500, Tejun Heo wrote:
> > > Implement kernfs_walk_and_get() which is similar to
> > > kernfs_find_and_get() but can walk a path instead of just a name.
> > >
> > > v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
> > > David.
> > >
> > > Signed-off-by: Tejun Heo <[email protected]>
> > > Cc: Greg Kroah-Hartman <[email protected]>
> > > Cc: David Miller <[email protected]>
> > > ---
> > > fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > include/linux/kernfs.h | 12 ++++++++++++
> > > 2 files changed, 58 insertions(+)
> >
> > Acked-by: Greg Kroah-Hartman <[email protected]>
>
> Greg, would it be okay to route this one through either cgroup or net
> tree?
Either is fine with me, whatever works best for you.
greg k-h