2015-11-21 16:14:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

Hello,

This is v3 of the xt_cgroup2 patchset. Changes from the last take are

* Folded cgroup2 path matching into xt_cgroup as a new revision rather
than a separate xt_cgroup2 match as suggested by Pablo.

* Refreshed on top of Nina's net_cls dynamic config update fix patch.
I included the fix patch as part of this series to ease reviewing.

The changes from v1 to v2 are

* Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
carries either (prioidx, classid) pair or cgroup2 pointer. This
avoids inflating struct sock with yet another cgroup related field.
Unfortunately, this does add some complexity but that's the
trade-off and the complexity is contained in cgroup proper.

* Various small updats as per David and Jan's reviews.


In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

This patchset includes the following nine patches.

0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch
0002-kernfs-implement-kernfs_walk_and_get.patch
0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch
0004-cgroups-Allow-dynamically-changing-net_classid.patch
0005-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch
0006-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch
0007-sock-cgroup-add-sock-sk_cgroup.patch
0008-netfilter-prepare-xt_cgroup-for-multi-revisions.patch
0009-netfilter-implement-xt_cgroup-cgroup2-path-match.patch

0001-0003 are prepatory patches in kernfs and cgroup. These patches
are available in the following branch which will stay stable.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test

0004 is the following net_cls config update fix patch included in this
series to ease reviewing as it causes a conflict with a later patch in
this series.

http://lkml.kernel.org/g/[email protected]

0005-0007 consolidate two cgroup related fields in struct sock into
cgroup_sock_data and update it so that it can alternatively carry a
cgroup pointer.

0008-0009 implement cgroup2 patch matching in xt_cgroup.

This patchset is on top of v4.4-rc1 and also available in the
following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2

I'll post iptables extension as a reply. diffstat follows. Thanks.

fs/kernfs/dir.c | 46 +++++++++++
include/linux/cgroup-defs.h | 126 +++++++++++++++++++++++++++++++
include/linux/cgroup.h | 66 +++++++++++++++-
include/linux/kernfs.h | 12 ++
include/net/cls_cgroup.h | 11 +-
include/net/netprio_cgroup.h | 16 +++
include/net/sock.h | 13 ---
include/uapi/linux/netfilter/xt_cgroup.h | 15 +++
kernel/cgroup.c | 126 ++++++++++++++++++++++++-------
net/Kconfig | 6 +
net/core/dev.c | 3
net/core/netclassid_cgroup.c | 37 ++++++---
net/core/netprio_cgroup.c | 19 ++++
net/core/scm.c | 4
net/core/sock.c | 17 ----
net/netfilter/nft_meta.c | 2
net/netfilter/xt_cgroup.c | 108 ++++++++++++++++++++++----
17 files changed, 531 insertions(+), 96 deletions(-)

--
tejun


2015-11-21 16:15:28

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/9] cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it

cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question. While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.

This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level. cgroup_is_descendant() is accordingly reimplmented
and made inline.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 14 ++++++++++++++
include/linux/cgroup.h | 18 +++++++++++++++++-
kernel/cgroup.c | 32 ++++++++++----------------------
3 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 60d44b2..504d859 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -235,6 +235,14 @@ struct cgroup {
int id;

/*
+ * The depth this cgroup is at. The root is at depth zero and each
+ * step down the hierarchy increments the level. This along with
+ * ancestor_ids[] can determine whether a given cgroup is a
+ * descendant of another without traversing the hierarchy.
+ */
+ int level;
+
+ /*
* Each non-empty css_set associated with this cgroup contributes
* one to populated_cnt. All children with non-zero popuplated_cnt
* of their own contribute one. The count is zero iff there's no
@@ -289,6 +297,9 @@ struct cgroup {

/* used to schedule release agent */
struct work_struct release_agent_work;
+
+ /* ids of the ancestors at each level including self */
+ int ancestor_ids[];
};

/*
@@ -308,6 +319,9 @@ struct cgroup_root {
/* The root cgroup. Root is destroyed on its release. */
struct cgroup cgrp;

+ /* for cgrp->ancestor_ids[0] */
+ int cgrp_ancestor_id_storage;
+
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 22e3754..b5ee2c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,7 +81,6 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);

-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor);
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);

@@ -459,6 +458,23 @@ static inline struct cgroup *task_cgroup(struct task_struct *task,
return task_css(task, subsys_id)->cgroup;
}

+/**
+ * cgroup_is_descendant - test ancestry
+ * @cgrp: the cgroup to be tested
+ * @ancestor: possible ancestor of @cgrp
+ *
+ * Test whether @cgrp is a descendant of @ancestor. It also returns %true
+ * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
+ * and @ancestor are accessible.
+ */
+static inline bool cgroup_is_descendant(struct cgroup *cgrp,
+ struct cgroup *ancestor)
+{
+ if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
+ return false;
+ return cgrp->ancestor_ids[ancestor->level] == ancestor->id;
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_is_populated(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f1603c1..3190040 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -459,25 +459,6 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
}
EXPORT_SYMBOL_GPL(of_css);

-/**
- * cgroup_is_descendant - test ancestry
- * @cgrp: the cgroup to be tested
- * @ancestor: possible ancestor of @cgrp
- *
- * Test whether @cgrp is a descendant of @ancestor. It also returns %true
- * if @cgrp == @ancestor. This function is safe to call as long as @cgrp
- * and @ancestor are accessible.
- */
-bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor)
-{
- while (cgrp) {
- if (cgrp == ancestor)
- return true;
- cgrp = cgroup_parent(cgrp);
- }
- return false;
-}
-
static int notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -1903,6 +1884,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
if (ret < 0)
goto out;
root_cgrp->id = ret;
+ root_cgrp->ancestor_ids[0] = ret;

ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, 0,
GFP_KERNEL);
@@ -4846,11 +4828,11 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
umode_t mode)
{
- struct cgroup *parent, *cgrp;
+ struct cgroup *parent, *cgrp, *tcgrp;
struct cgroup_root *root;
struct cgroup_subsys *ss;
struct kernfs_node *kn;
- int ssid, ret;
+ int level, ssid, ret;

/* Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
*/
@@ -4861,9 +4843,11 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
if (!parent)
return -ENODEV;
root = parent->root;
+ level = parent->level + 1;

/* allocate the cgroup and its ID, 0 is reserved for the root */
- cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
+ cgrp = kzalloc(sizeof(*cgrp) +
+ sizeof(cgrp->ancestor_ids[0]) * (level + 1), GFP_KERNEL);
if (!cgrp) {
ret = -ENOMEM;
goto out_unlock;
@@ -4887,6 +4871,10 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,

cgrp->self.parent = &parent->self;
cgrp->root = root;
+ cgrp->level = level;
+
+ for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
+ cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;

if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
--
2.5.0

2015-11-21 16:14:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/9] kernfs: implement kernfs_walk_and_get()

Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.

v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
David.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Cc: David Miller <[email protected]>
---
fs/kernfs/dir.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 12 ++++++++++++
2 files changed, 58 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..742bf4a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -694,6 +694,29 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
return NULL;
}

+static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
+ const unsigned char *path,
+ const void *ns)
+{
+ static char path_buf[PATH_MAX]; /* protected by kernfs_mutex */
+ size_t len = strlcpy(path_buf, path, PATH_MAX);
+ char *p = path_buf;
+ char *name;
+
+ lockdep_assert_held(&kernfs_mutex);
+
+ if (len >= PATH_MAX)
+ return NULL;
+
+ while ((name = strsep(&p, "/")) && parent) {
+ if (*name == '\0')
+ continue;
+ parent = kernfs_find_ns(parent, name, ns);
+ }
+
+ return parent;
+}
+
/**
* kernfs_find_and_get_ns - find and get kernfs_node with the given name
* @parent: kernfs_node to search under
@@ -719,6 +742,29 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
EXPORT_SYMBOL_GPL(kernfs_find_and_get_ns);

/**
+ * kernfs_walk_and_get_ns - find and get kernfs_node with the given path
+ * @parent: kernfs_node to search under
+ * @path: path to look for
+ * @ns: the namespace tag to use
+ *
+ * Look for kernfs_node with path @path under @parent and get a reference
+ * if found. This function may sleep and returns pointer to the found
+ * kernfs_node on success, %NULL on failure.
+ */
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns)
+{
+ struct kernfs_node *kn;
+
+ mutex_lock(&kernfs_mutex);
+ kn = kernfs_walk_ns(parent, path, ns);
+ kernfs_get(kn);
+ mutex_unlock(&kernfs_mutex);
+
+ return kn;
+}
+
+/**
* kernfs_create_root - create a new kernfs hierarchy
* @scops: optional syscall operations for the hierarchy
* @flags: KERNFS_ROOT_* flags
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5d4e9c4..af51df3 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn);
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns);
+struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
+ const char *path, const void *ns);
void kernfs_get(struct kernfs_node *kn);
void kernfs_put(struct kernfs_node *kn);

@@ -350,6 +352,10 @@ static inline struct kernfs_node *
kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{ return NULL; }
+static inline struct kernfs_node *
+kernfs_walk_and_get_ns(struct kernfs_node *parent, const char *path,
+ const void *ns)
+{ return NULL; }

static inline void kernfs_get(struct kernfs_node *kn) { }
static inline void kernfs_put(struct kernfs_node *kn) { }
@@ -431,6 +437,12 @@ kernfs_find_and_get(struct kernfs_node *kn, const char *name)
}

static inline struct kernfs_node *
+kernfs_walk_and_get(struct kernfs_node *kn, const char *path)
+{
+ return kernfs_walk_and_get_ns(kn, path, NULL);
+}
+
+static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
--
2.5.0

2015-11-21 16:15:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/9] cgroup: implement cgroup_get_from_path() and expose cgroup_put()

Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path. This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.

v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 7 +++++++
kernel/cgroup.c | 39 ++++++++++++++++++++++++++++++++++-----
2 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5ee2c4..4c3ffab 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
struct cgroup_subsys *ss);

+struct cgroup *cgroup_get_from_path(const char *path);
+
int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);

@@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
percpu_ref_put_many(&css->refcnt, n);
}

+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/**
* task_css_set_check - obtain a task's css_set with extra access conditions
* @task: the task to obtain css_set for
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3190040..3db5e8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
return css_tryget(&cgrp->self);
}

-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
}

+/**
+ * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path
+ * @path: path on the default hierarchy
+ *
+ * Find the cgroup at @path on the default hierarchy, increment its
+ * reference count and return it. Returns pointer to the found cgroup on
+ * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
+ * if @path points to a non-directory.
+ */
+struct cgroup *cgroup_get_from_path(const char *path)
+{
+ struct kernfs_node *kn;
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+
+ kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
+ if (kn) {
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ cgrp = kn->priv;
+ cgroup_get(cgrp);
+ } else {
+ cgrp = ERR_PTR(-ENOTDIR);
+ }
+ kernfs_put(kn);
+ } else {
+ cgrp = ERR_PTR(-ENOENT);
+ }
+
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_get_from_path);
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
--
2.5.0

2015-11-21 16:15:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/9] cgroups: Allow dynamically changing net_classid

From: Nina Schiff <[email protected]>

The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.

Signed-off-by: Nina Schiff <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
net/core/netclassid_cgroup.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 6441f47..2e4df84 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -56,7 +56,7 @@ static void cgrp_css_free(struct cgroup_subsys_state *css)
kfree(css_cls_state(css));
}

-static int update_classid(const void *v, struct file *file, unsigned n)
+static int update_classid_sock(const void *v, struct file *file, unsigned n)
{
int err;
struct socket *sock = sock_from_file(file, &err);
@@ -67,18 +67,25 @@ static int update_classid(const void *v, struct file *file, unsigned n)
return 0;
}

-static void cgrp_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void update_classid(struct cgroup_subsys_state *css, void *v)
{
- struct cgroup_cls_state *cs = css_cls_state(css);
- void *v = (void *)(unsigned long)cs->classid;
+ struct css_task_iter it;
struct task_struct *p;

- cgroup_taskset_for_each(p, tset) {
+ css_task_iter_start(css, &it);
+ while ((p = css_task_iter_next(&it))) {
task_lock(p);
- iterate_fd(p->files, 0, update_classid, v);
+ iterate_fd(p->files, 0, update_classid_sock, v);
task_unlock(p);
}
+ css_task_iter_end(&it);
+}
+
+static void cgrp_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset)
+{
+ update_classid(css,
+ (void *)(unsigned long)css_cls_state(css)->classid);
}

static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -89,8 +96,11 @@ static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
u64 value)
{
- css_cls_state(css)->classid = (u32) value;
+ struct cgroup_cls_state *cs = css_cls_state(css);
+
+ cs->classid = (u32)value;

+ update_classid(css, (void *)(unsigned long)cs->classid);
return 0;
}

--
2.5.0

2015-11-21 16:15:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/9] netprio_cgroup: limit the maximum css->id to USHRT_MAX

netprio builds per-netdev contiguous priomap array which is indexed by
css->id. The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range. This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.

This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.

Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Daniel Wagner <[email protected]>
Cc: Daniel Borkmann <[email protected]>
CC: Neil Horman <[email protected]>
---
include/net/sock.h | 10 +++++-----
net/core/netprio_cgroup.c | 9 +++++++++
2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index bbf7c2c..b517351 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,7 +288,6 @@ struct cg_proto;
* @sk_ack_backlog: current listen backlog
* @sk_max_ack_backlog: listen backlog set in listen()
* @sk_priority: %SO_PRIORITY setting
- * @sk_cgrp_prioidx: socket group's priority map index
* @sk_type: socket type (%SOCK_STREAM, etc)
* @sk_protocol: which protocol this socket belongs in this network family
* @sk_peer_pid: &struct pid for this socket's peer
@@ -309,6 +308,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
+ * @sk_cgrp_prioidx: socket group's priority map index
* @sk_classid: this socket's cgroup classid
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
@@ -423,9 +423,7 @@ struct sock {
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
__u32 sk_priority;
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- __u32 sk_cgrp_prioidx;
-#endif
+ __u32 sk_mark;
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
long sk_rcvtimeo;
@@ -443,7 +441,9 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
- __u32 sk_mark;
+#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
+ u16 sk_cgrp_prioidx;
+#endif
#ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
#endif
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..2b9159b 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,6 +27,12 @@

#include <linux/fdtable.h>

+/*
+ * netprio allocates per-net_device priomap array which is indexed by
+ * css->id. Limiting css ID to 16bits doesn't lose anything.
+ */
+#define NETPRIO_ID_MAX USHRT_MAX
+
#define PRIOMAP_MIN_SZ 128

/*
@@ -144,6 +150,9 @@ static int cgrp_css_online(struct cgroup_subsys_state *css)
struct net_device *dev;
int ret = 0;

+ if (css->id > NETPRIO_ID_MAX)
+ return -ENOSPC;
+
if (!parent_css)
return 0;

--
2.5.0

2015-11-21 16:15:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/9] net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct

Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
and its accessors are defined in cgroup-defs.h. This is to prepare
for overloading the fields with a cgroup pointer.

This patch mostly performs equivalent conversions but the followings
are noteworthy.

* Equality test before updating classid is removed from
sock_update_classid(). This shouldn't make any noticeable
difference and a similar test will be implemented on the helper side
later.

* sock_update_netprioidx() now takes struct sock_cgroup_data and can
be moved to netprio_cgroup.h without causing include dependency
loop. Moved.

* The dummy version of sock_update_netprioidx() converted to a static
inline function while at it.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup-defs.h | 36 ++++++++++++++++++++++++++++++++++++
include/net/cls_cgroup.h | 11 +++++------
include/net/netprio_cgroup.h | 16 +++++++++++++---
include/net/sock.h | 11 +++--------
net/Kconfig | 6 ++++++
net/core/dev.c | 3 ++-
net/core/netclassid_cgroup.c | 4 ++--
net/core/netprio_cgroup.c | 3 ++-
net/core/scm.c | 4 ++--
net/core/sock.c | 15 ++-------------
net/netfilter/nft_meta.c | 2 +-
net/netfilter/xt_cgroup.c | 3 ++-
12 files changed, 76 insertions(+), 38 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 504d859..ed128fed 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -542,4 +542,40 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}

#endif /* CONFIG_CGROUPS */

+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+struct sock_cgroup_data {
+ u16 prioidx;
+ u32 classid;
+};
+
+static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
+{
+ return skcd->prioidx;
+}
+
+static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
+{
+ return skcd->classid;
+}
+
+static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
+ u16 prioidx)
+{
+ skcd->prioidx = prioidx;
+}
+
+static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
+ u32 classid)
+{
+ skcd->classid = classid;
+}
+
+#else /* CONFIG_SOCK_CGROUP_DATA */
+
+struct sock_cgroup_data {
+};
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_DEFS_H */
diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index ccd6d8b..c0a92e2 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -41,13 +41,12 @@ static inline u32 task_cls_classid(struct task_struct *p)
return classid;
}

-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
u32 classid;

classid = task_cls_classid(current);
- if (classid != sk->sk_classid)
- sk->sk_classid = classid;
+ sock_cgroup_set_classid(skcd, classid);
}

static inline u32 task_get_classid(const struct sk_buff *skb)
@@ -64,17 +63,17 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
* softirqs always disables bh.
*/
if (in_serving_softirq()) {
- /* If there is an sk_classid we'll use that. */
+ /* If there is an sock_cgroup_classid we'll use that. */
if (!skb->sk)
return 0;

- classid = skb->sk->sk_classid;
+ classid = sock_cgroup_classid(&skb->sk->sk_cgrp_data);
}

return classid;
}
#else /* !CONFIG_CGROUP_NET_CLASSID */
-static inline void sock_update_classid(struct sock *sk)
+static inline void sock_update_classid(struct sock_cgroup_data *skcd)
{
}

diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index f2a9597..6041905 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -25,8 +25,6 @@ struct netprio_map {
u32 priomap[];
};

-void sock_update_netprioidx(struct sock *sk);
-
static inline u32 task_netprioidx(struct task_struct *p)
{
struct cgroup_subsys_state *css;
@@ -38,13 +36,25 @@ static inline u32 task_netprioidx(struct task_struct *p)
rcu_read_unlock();
return idx;
}
+
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+ if (in_interrupt())
+ return;
+
+ sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
+}
+
#else /* !CONFIG_CGROUP_NET_PRIO */
+
static inline u32 task_netprioidx(struct task_struct *p)
{
return 0;
}

-#define sock_update_netprioidx(sk)
+static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
+{
+}

#endif /* CONFIG_CGROUP_NET_PRIO */
#endif /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index b517351..c4e3a30 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -58,6 +58,7 @@
#include <linux/memcontrol.h>
#include <linux/static_key.h>
#include <linux/sched.h>
+#include <linux/cgroup-defs.h>

#include <linux/filter.h>
#include <linux/rculist_nulls.h>
@@ -308,8 +309,7 @@ struct cg_proto;
* @sk_send_head: front of stuff to transmit
* @sk_security: used by security modules
* @sk_mark: generic packet mark
- * @sk_cgrp_prioidx: socket group's priority map index
- * @sk_classid: this socket's cgroup classid
+ * @sk_cgrp_data: cgroup data for this cgroup
* @sk_cgrp: this socket's cgroup-specific proto data
* @sk_write_pending: a write to stream socket waits to start
* @sk_state_change: callback to indicate change in the state of the sock
@@ -441,12 +441,7 @@ struct sock {
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
- u16 sk_cgrp_prioidx;
-#endif
-#ifdef CONFIG_CGROUP_NET_CLASSID
- u32 sk_classid;
-#endif
+ struct sock_cgroup_data sk_cgrp_data;
struct cg_proto *sk_cgrp;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
diff --git a/net/Kconfig b/net/Kconfig
index 127da94..11f8c22 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -250,9 +250,14 @@ config XPS
depends on SMP
default y

+config SOCK_CGROUP_DATA
+ bool
+ default n
+
config CGROUP_NET_PRIO
bool "Network priority cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use in assigning processes to network priorities on
a per-interface basis.
@@ -260,6 +265,7 @@ config CGROUP_NET_PRIO
config CGROUP_NET_CLASSID
bool "Network classid cgroup"
depends on CGROUPS
+ select SOCK_CGROUP_DATA
---help---
Cgroup subsystem for use as general purpose socket classid marker that is
being used in cls_cgroup and for netfilter matching.
diff --git a/net/core/dev.c b/net/core/dev.c
index ab9b8d0..2346cfd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2925,7 +2925,8 @@ static void skb_update_prio(struct sk_buff *skb)
struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);

if (!skb->priority && skb->sk && map) {
- unsigned int prioidx = skb->sk->sk_cgrp_prioidx;
+ unsigned int prioidx =
+ sock_cgroup_prioidx(&skb->sk->sk_cgrp_data);

if (prioidx < map->priomap_len)
skb->priority = map->priomap[prioidx];
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 2e4df84..e60ded4 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -62,8 +62,8 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
struct socket *sock = sock_from_file(file, &err);

if (sock)
- sock->sk->sk_classid = (u32)(unsigned long)v;
-
+ sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 2b9159b..de42aa7 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -223,7 +223,8 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);
if (sock)
- sock->sk->sk_cgrp_prioidx = (u32)(unsigned long)v;
+ sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
+ (unsigned long)v);
return 0;
}

diff --git a/net/core/scm.c b/net/core/scm.c
index 3b6899b..2a3c7e2 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -289,8 +289,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
/* Bump the usage count and install the file. */
sock = sock_from_file(fp[i], &err);
if (sock) {
- sock_update_netprioidx(sock->sk);
- sock_update_classid(sock->sk);
+ sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+ sock_update_classid(&sock->sk->sk_cgrp_data);
}
fd_install(new_fd, get_file(fp[i]));
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 1e4dd54..35af060 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1393,17 +1393,6 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
module_put(owner);
}

-#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
-void sock_update_netprioidx(struct sock *sk)
-{
- if (in_interrupt())
- return;
-
- sk->sk_cgrp_prioidx = task_netprioidx(current);
-}
-EXPORT_SYMBOL_GPL(sock_update_netprioidx);
-#endif
-
/**
* sk_alloc - All socket objects are allocated here
* @net: the applicable net namespace
@@ -1432,8 +1421,8 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
sock_net_set(sk, net);
atomic_set(&sk->sk_wmem_alloc, 1);

- sock_update_classid(sk);
- sock_update_netprioidx(sk);
+ sock_update_classid(&sk->sk_cgrp_data);
+ sock_update_netprioidx(&sk->sk_cgrp_data);
}

return sk;
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 9dfaf4d..1915cab 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -174,7 +174,7 @@ void nft_meta_get_eval(const struct nft_expr *expr,
sk = skb_to_full_sk(skb);
if (!sk || !sk_fullsock(sk))
goto err;
- *dest = sk->sk_classid;
+ *dest = sock_cgroup_classid(&sk->sk_cgrp_data);
break;
#endif
default:
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index a1d126f..54eaeb4 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -42,7 +42,8 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;

- return (info->id == skb->sk->sk_classid) ^ info->invert;
+ return (info->id == sock_cgroup_classid(&skb->sk->sk_cgrp_data)) ^
+ info->invert;
}

static struct xt_match cgroup_mt_reg __read_mostly = {
--
2.5.0

2015-11-21 16:15:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/9] sock, cgroup: add sock->sk_cgroup

In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.

As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.

This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.

v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.

v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
---
include/linux/cgroup-defs.h | 88 +++++++++++++++++++++++++++++++++++++++++---
include/linux/cgroup.h | 41 +++++++++++++++++++++
kernel/cgroup.c | 55 ++++++++++++++++++++++++++-
net/core/netclassid_cgroup.c | 7 +++-
net/core/netprio_cgroup.c | 7 +++-
net/core/sock.c | 2 +
6 files changed, 191 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ed128fed..9dc2263 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -544,31 +544,107 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}

#ifdef CONFIG_SOCK_CGROUP_DATA

+/*
+ * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
+ * per-socket cgroup information except for memcg association.
+ *
+ * On legacy hierarchies, net_prio and net_cls controllers directly set
+ * attributes on each sock which can then be tested by the network layer.
+ * On the default hierarchy, each sock is associated with the cgroup it was
+ * created in and the networking layer can match the cgroup directly.
+ *
+ * To avoid carrying all three cgroup related fields separately in sock,
+ * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
+ * On boot, sock_cgroup_data records the cgroup that the sock was created
+ * in so that cgroup2 matches can be made; however, once either net_prio or
+ * net_cls starts being used, the area is overriden to carry prioidx and/or
+ * classid. The two modes are distinguished by whether the lowest bit is
+ * set. Clear bit indicates cgroup pointer while set bit prioidx and
+ * classid.
+ *
+ * While userland may start using net_prio or net_cls at any time, once
+ * either is used, cgroup2 matching no longer works. There is no reason to
+ * mix the two and this is in line with how legacy and v2 compatibility is
+ * handled. On mode switch, cgroup references which are already being
+ * pointed to by socks may be leaked. While this can be remedied by adding
+ * synchronization around sock_cgroup_data, given that the number of leaked
+ * cgroups is bound and highly unlikely to be high, this seems to be the
+ * better trade-off.
+ */
struct sock_cgroup_data {
- u16 prioidx;
- u32 classid;
+ union {
+#ifdef __LITTLE_ENDIAN
+ struct {
+ u8 is_data;
+ u8 padding;
+ u16 prioidx;
+ u32 classid;
+ } __packed;
+#else
+ struct {
+ u32 classid;
+ u16 prioidx;
+ u8 padding;
+ u8 is_data;
+ } __packed;
+#endif
+ u64 val;
+ };
};

+/*
+ * There's a theoretical window where the following accessors race with
+ * updaters and return part of the previous pointer as the prioidx or
+ * classid. Such races are short-lived and the result isn't critical.
+ */
static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
{
- return skcd->prioidx;
+ /* fallback to 1 which is always the ID of the root cgroup */
+ return (skcd->is_data & 1) ? skcd->prioidx : 1;
}

static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd)
{
- return skcd->classid;
+ /* fallback to 0 which is the unconfigured default classid */
+ return (skcd->is_data & 1) ? skcd->classid : 0;
}

+/*
+ * If invoked concurrently, the updaters may clobber each other. The
+ * caller is responsible for synchronization.
+ */
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
u16 prioidx)
{
- skcd->prioidx = prioidx;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_prioidx(&skcd_buf) == prioidx)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.prioidx = prioidx;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}

static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd,
u32 classid)
{
- skcd->classid = classid;
+ struct sock_cgroup_data skcd_buf = { .val = READ_ONCE(skcd->val) };
+
+ if (sock_cgroup_classid(&skcd_buf) == classid)
+ return;
+
+ if (!(skcd_buf.is_data & 1)) {
+ skcd_buf.val = 0;
+ skcd_buf.is_data = 1;
+ }
+
+ skcd_buf.classid = classid;
+ WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */
}

#else /* CONFIG_SOCK_CGROUP_DATA */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4c3ffab..a8ba1ea 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -578,4 +578,45 @@ static inline int cgroup_init(void) { return 0; }

#endif /* !CONFIG_CGROUPS */

+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+extern spinlock_t cgroup_sk_update_lock;
+#endif
+
+void cgroup_sk_alloc_disable(void);
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_free(struct sock_cgroup_data *skcd);
+
+static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
+{
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+ unsigned long v;
+
+ /*
+ * @skcd->val is 64bit but the following is safe on 32bit too as we
+ * just need the lower ulong to be written and read atomically.
+ */
+ v = READ_ONCE(skcd->val);
+
+ if (v & 1)
+ return &cgrp_dfl_root.cgrp;
+
+ return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
+#else
+ return (struct cgroup *)(unsigned long)skcd->val;
+#endif
+}
+
+#else /* CONFIG_CGROUP_DATA */
+
+static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
+
+#endif /* CONFIG_CGROUP_DATA */
+
#endif /* _LINUX_CGROUP_H */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3db5e8f..4f8f792 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,8 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
-
#include <linux/atomic.h>
+#include <net/sock.h>

/*
* pidlists linger the following amount before being destroyed. The goal
@@ -5782,6 +5782,59 @@ struct cgroup *cgroup_get_from_path(const char *path)
}
EXPORT_SYMBOL_GPL(cgroup_get_from_path);

+/*
+ * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
+ * definition in cgroup-defs.h.
+ */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+
+#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
+
+spinlock_t cgroup_sk_update_lock;
+static bool cgroup_sk_alloc_disabled __read_mostly;
+
+void cgroup_sk_alloc_disable(void)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+ pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n");
+ cgroup_sk_alloc_disabled = true;
+}
+
+#else
+
+#define cgroup_sk_alloc_disabled false
+
+#endif
+
+void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
+{
+ if (cgroup_sk_alloc_disabled)
+ return;
+
+ rcu_read_lock();
+
+ while (true) {
+ struct css_set *cset;
+
+ cset = task_css_set(current);
+ if (likely(cgroup_tryget(cset->dfl_cgrp))) {
+ skcd->val = (unsigned long)cset->dfl_cgrp;
+ break;
+ }
+ cpu_relax();
+ }
+
+ rcu_read_unlock();
+}
+
+void cgroup_sk_free(struct sock_cgroup_data *skcd)
+{
+ cgroup_put(sock_cgroup_ptr(skcd));
+}
+
+#endif /* CONFIG_SOCK_CGROUP_DATA */
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index e60ded4..04257a0 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -61,9 +61,12 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n)
int err;
struct socket *sock = sock_from_file(file, &err);

- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_classid(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}

@@ -98,6 +101,8 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
{
struct cgroup_cls_state *cs = css_cls_state(css);

+ cgroup_sk_alloc_disable();
+
cs->classid = (u32)value;

update_classid(css, (void *)(unsigned long)cs->classid);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index de42aa7..053d60c 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -209,6 +209,8 @@ static ssize_t write_priomap(struct kernfs_open_file *of,
if (!dev)
return -ENODEV;

+ cgroup_sk_alloc_disable();
+
rtnl_lock();

ret = netprio_set_prio(of_css(of), dev, prio);
@@ -222,9 +224,12 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
{
int err;
struct socket *sock = sock_from_file(file, &err);
- if (sock)
+ if (sock) {
+ spin_lock(&cgroup_sk_update_lock);
sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data,
(unsigned long)v);
+ spin_unlock(&cgroup_sk_update_lock);
+ }
return 0;
}

diff --git a/net/core/sock.c b/net/core/sock.c
index 35af060..ecce5df 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1363,6 +1363,7 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
if (!try_module_get(prot->owner))
goto out_free_sec;
sk_tx_queue_clear(sk);
+ cgroup_sk_alloc(&sk->sk_cgrp_data);
}

return sk;
@@ -1385,6 +1386,7 @@ static void sk_prot_free(struct proto *prot, struct sock *sk)
owner = prot->owner;
slab = prot->slab;

+ cgroup_sk_free(&sk->sk_cgrp_data);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
--
2.5.0

2015-11-21 16:15:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 8/9] netfilter: prepare xt_cgroup for multi revisions

xt_cgroup will grow cgroup2 path based match. Postfix existing
symbols with _v0 and prepare for multi revision registration.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
include/uapi/linux/netfilter/xt_cgroup.h | 2 +-
net/netfilter/xt_cgroup.c | 36 +++++++++++++++++---------------
2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
index 43acb7e..577c9e0 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -3,7 +3,7 @@

#include <linux/types.h>

-struct xt_cgroup_info {
+struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 54eaeb4..1730025 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -24,9 +24,9 @@ MODULE_DESCRIPTION("Xtables: process control group matching");
MODULE_ALIAS("ipt_cgroup");
MODULE_ALIAS("ip6t_cgroup");

-static int cgroup_mt_check(const struct xt_mtchk_param *par)
+static int cgroup_mt_check_v0(const struct xt_mtchk_param *par)
{
- struct xt_cgroup_info *info = par->matchinfo;
+ struct xt_cgroup_info_v0 *info = par->matchinfo;

if (info->invert & ~1)
return -EINVAL;
@@ -35,9 +35,9 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par)
}

static bool
-cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
+cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
{
- const struct xt_cgroup_info *info = par->matchinfo;
+ const struct xt_cgroup_info_v0 *info = par->matchinfo;

if (skb->sk == NULL || !sk_fullsock(skb->sk))
return false;
@@ -46,27 +46,29 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par)
info->invert;
}

-static struct xt_match cgroup_mt_reg __read_mostly = {
- .name = "cgroup",
- .revision = 0,
- .family = NFPROTO_UNSPEC,
- .checkentry = cgroup_mt_check,
- .match = cgroup_mt,
- .matchsize = sizeof(struct xt_cgroup_info),
- .me = THIS_MODULE,
- .hooks = (1 << NF_INET_LOCAL_OUT) |
- (1 << NF_INET_POST_ROUTING) |
- (1 << NF_INET_LOCAL_IN),
+static struct xt_match cgroup_mt_reg[] __read_mostly = {
+ {
+ .name = "cgroup",
+ .revision = 0,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup_mt_check_v0,
+ .match = cgroup_mt_v0,
+ .matchsize = sizeof(struct xt_cgroup_info_v0),
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+ },
};

static int __init cgroup_mt_init(void)
{
- return xt_register_match(&cgroup_mt_reg);
+ return xt_register_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
}

static void __exit cgroup_mt_exit(void)
{
- xt_unregister_match(&cgroup_mt_reg);
+ xt_unregister_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg));
}

module_init(cgroup_mt_init);
--
2.5.0

2015-11-21 16:15:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

This patch implements xt_cgroup path match which matches cgroup2
membership of the associated socket. The match is recursive and
invertible.

For rationales on introducing another cgroup based match, please refer
to a preceding commit "sock, cgroup: add sock->sk_cgroup".

v3: Folded into xt_cgroup as a new revision interface as suggested by
Pablo.

v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added
explicit alignment to the priv field. Both suggested by Jan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Daniel Wagner <[email protected]>
CC: Neil Horman <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
include/uapi/linux/netfilter/xt_cgroup.h | 13 ++++++
net/netfilter/xt_cgroup.c | 69 ++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)

diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
index 577c9e0..1e4b37b 100644
--- a/include/uapi/linux/netfilter/xt_cgroup.h
+++ b/include/uapi/linux/netfilter/xt_cgroup.h
@@ -2,10 +2,23 @@
#define _UAPI_XT_CGROUP_H

#include <linux/types.h>
+#include <linux/limits.h>

struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};

+struct xt_cgroup_info_v1 {
+ __u8 has_path;
+ __u8 has_classid;
+ __u8 invert_path;
+ __u8 invert_classid;
+ char path[PATH_MAX];
+ __u32 classid;
+
+ /* kernel internal data */
+ void *priv __attribute__((aligned(8)));
+};
+
#endif /* _UAPI_XT_CGROUP_H */
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 1730025..a086a91 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -34,6 +34,37 @@ static int cgroup_mt_check_v0(const struct xt_mtchk_param *par)
return 0;
}

+static int cgroup_mt_check_v1(const struct xt_mtchk_param *par)
+{
+ struct xt_cgroup_info_v1 *info = par->matchinfo;
+ struct cgroup *cgrp;
+
+ if ((info->invert_path & ~1) || (info->invert_classid & ~1))
+ return -EINVAL;
+
+ if (!info->has_path && !info->has_classid) {
+ pr_info("xt_cgroup: no path or classid specified\n");
+ return -EINVAL;
+ }
+
+ if (info->has_path && info->has_classid) {
+ pr_info("xt_cgroup: both path and classid specified\n");
+ return -EINVAL;
+ }
+
+ if (info->has_path) {
+ cgrp = cgroup_get_from_path(info->path);
+ if (IS_ERR(cgrp)) {
+ pr_info("xt_cgroup: invalid path, errno=%ld\n",
+ PTR_ERR(cgrp));
+ return -EINVAL;
+ }
+ info->priv = cgrp;
+ }
+
+ return 0;
+}
+
static bool
cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
{
@@ -46,6 +77,31 @@ cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
info->invert;
}

+static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ const struct xt_cgroup_info_v1 *info = par->matchinfo;
+ struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data;
+ struct cgroup *ancestor = info->priv;
+
+ if (!skb->sk || !sk_fullsock(skb->sk))
+ return false;
+
+ if (ancestor)
+ return cgroup_is_descendant(sock_cgroup_ptr(skcd), ancestor) ^
+ info->invert_path;
+ else
+ return (info->classid == sock_cgroup_classid(skcd)) ^
+ info->invert_classid;
+}
+
+static void cgroup_mt_destroy_v1(const struct xt_mtdtor_param *par)
+{
+ struct xt_cgroup_info_v1 *info = par->matchinfo;
+
+ if (info->priv)
+ cgroup_put(info->priv);
+}
+
static struct xt_match cgroup_mt_reg[] __read_mostly = {
{
.name = "cgroup",
@@ -59,6 +115,19 @@ static struct xt_match cgroup_mt_reg[] __read_mostly = {
(1 << NF_INET_POST_ROUTING) |
(1 << NF_INET_LOCAL_IN),
},
+ {
+ .name = "cgroup",
+ .revision = 1,
+ .family = NFPROTO_UNSPEC,
+ .checkentry = cgroup_mt_check_v1,
+ .match = cgroup_mt_v1,
+ .matchsize = sizeof(struct xt_cgroup_info_v1),
+ .destroy = cgroup_mt_destroy_v1,
+ .me = THIS_MODULE,
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING) |
+ (1 << NF_INET_LOCAL_IN),
+ },
};

static int __init cgroup_mt_init(void)
--
2.5.0

2015-11-21 16:17:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

Oops, made a copy & paste error on Neil Horman's address. Sorry,
Neil. The thread can be found at

http://lkml.kernel.org/g/[email protected]

Thanks.

--
tejun

2015-11-21 16:18:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/2 iptables] libxt_cgroup: prepare for multi revisions

libxt_cgroup will grow cgroup2 path based match. Postfix existing
symbols with _v0 and prepare for multi revision registration. While
at it, rename O_CGROUP to O_CLASSID and fwid to classid.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
extensions/libxt_cgroup.c | 51 +++++++++++++++++++-----------------
include/linux/netfilter/xt_cgroup.h | 2 -
2 files changed, 28 insertions(+), 25 deletions(-)

--- a/extensions/libxt_cgroup.c
+++ b/extensions/libxt_cgroup.c
@@ -3,30 +3,30 @@
#include <linux/netfilter/xt_cgroup.h>

enum {
- O_CGROUP = 0,
+ O_CLASSID = 0,
};

-static void cgroup_help(void)
+static void cgroup_help_v0(void)
{
printf(
"cgroup match options:\n"
-"[!] --cgroup fwid Match cgroup fwid\n");
+"[!] --cgroup classid Match cgroup classid\n");
}

-static const struct xt_option_entry cgroup_opts[] = {
+static const struct xt_option_entry cgroup_opts_v0[] = {
{
.name = "cgroup",
- .id = O_CGROUP,
+ .id = O_CLASSID,
.type = XTTYPE_UINT32,
.flags = XTOPT_INVERT | XTOPT_MAND | XTOPT_PUT,
- XTOPT_POINTER(struct xt_cgroup_info, id)
+ XTOPT_POINTER(struct xt_cgroup_info_v0, id)
},
XTOPT_TABLEEND,
};

-static void cgroup_parse(struct xt_option_call *cb)
+static void cgroup_parse_v0(struct xt_option_call *cb)
{
- struct xt_cgroup_info *cgroupinfo = cb->data;
+ struct xt_cgroup_info_v0 *cgroupinfo = cb->data;

xtables_option_parse(cb);
if (cb->invert)
@@ -34,34 +34,37 @@ static void cgroup_parse(struct xt_optio
}

static void
-cgroup_print(const void *ip, const struct xt_entry_match *match, int numeric)
+cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int numeric)
{
- const struct xt_cgroup_info *info = (void *) match->data;
+ const struct xt_cgroup_info_v0 *info = (void *) match->data;

printf(" cgroup %s%u", info->invert ? "! ":"", info->id);
}

-static void cgroup_save(const void *ip, const struct xt_entry_match *match)
+static void cgroup_save_v0(const void *ip, const struct xt_entry_match *match)
{
- const struct xt_cgroup_info *info = (void *) match->data;
+ const struct xt_cgroup_info_v0 *info = (void *) match->data;

printf("%s --cgroup %u", info->invert ? " !" : "", info->id);
}

-static struct xtables_match cgroup_match = {
- .family = NFPROTO_UNSPEC,
- .name = "cgroup",
- .version = XTABLES_VERSION,
- .size = XT_ALIGN(sizeof(struct xt_cgroup_info)),
- .userspacesize = XT_ALIGN(sizeof(struct xt_cgroup_info)),
- .help = cgroup_help,
- .print = cgroup_print,
- .save = cgroup_save,
- .x6_parse = cgroup_parse,
- .x6_options = cgroup_opts,
+static struct xtables_match cgroup_match[] = {
+ {
+ .family = NFPROTO_UNSPEC,
+ .revision = 0,
+ .name = "cgroup",
+ .version = XTABLES_VERSION,
+ .size = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)),
+ .userspacesize = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)),
+ .help = cgroup_help_v0,
+ .print = cgroup_print_v0,
+ .save = cgroup_save_v0,
+ .x6_parse = cgroup_parse_v0,
+ .x6_options = cgroup_opts_v0,
+ },
};

void _init(void)
{
- xtables_register_match(&cgroup_match);
+ xtables_register_matches(cgroup_match, ARRAY_SIZE(cgroup_match));
}
--- a/include/linux/netfilter/xt_cgroup.h
+++ b/include/linux/netfilter/xt_cgroup.h
@@ -3,7 +3,7 @@

#include <linux/types.h>

-struct xt_cgroup_info {
+struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};

2015-11-21 16:19:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/2 iptables] libxt_cgroup: add support for cgroup2 path matching

This patch updates xt_cgroup so that it supports revision 1 interface
which includes cgroup2 path based matching.

v3: Folded into xt_cgroup as a new revision interface as suggested by
Pablo.

v2: cgroup2_match->userspacesize and ->save and man page updated as
per Jan.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Jan Engelhardt <[email protected]>
Cc: Pablo Neira Ayuso <[email protected]>
---
extensions/libxt_cgroup.c | 86 ++++++++++++++++++++++++++++++++++++
extensions/libxt_cgroup.man | 33 ++++++++-----
include/linux/netfilter/xt_cgroup.h | 13 +++++
3 files changed, 119 insertions(+), 13 deletions(-)

--- a/extensions/libxt_cgroup.c
+++ b/extensions/libxt_cgroup.c
@@ -4,6 +4,7 @@

enum {
O_CLASSID = 0,
+ O_PATH = 1,
};

static void cgroup_help_v0(void)
@@ -13,6 +14,14 @@ static void cgroup_help_v0(void)
"[!] --cgroup classid Match cgroup classid\n");
}

+static void cgroup_help_v1(void)
+{
+ printf(
+"cgroup match options:\n"
+"[!] --path path Recursively match path relative to cgroup2 root\n"
+"[!] --cgroup claasid Match cgroup classid, can't be used with --path\n");
+}
+
static const struct xt_option_entry cgroup_opts_v0[] = {
{
.name = "cgroup",
@@ -24,6 +33,24 @@ static const struct xt_option_entry cgro
XTOPT_TABLEEND,
};

+static const struct xt_option_entry cgroup_opts_v1[] = {
+ {
+ .name = "path",
+ .id = O_PATH,
+ .type = XTTYPE_STRING,
+ .flags = XTOPT_INVERT | XTOPT_PUT,
+ XTOPT_POINTER(struct xt_cgroup_info_v1, path)
+ },
+ {
+ .name = "cgroup",
+ .id = O_CLASSID,
+ .type = XTTYPE_UINT32,
+ .flags = XTOPT_INVERT | XTOPT_PUT,
+ XTOPT_POINTER(struct xt_cgroup_info_v1, classid)
+ },
+ XTOPT_TABLEEND,
+};
+
static void cgroup_parse_v0(struct xt_option_call *cb)
{
struct xt_cgroup_info_v0 *cgroupinfo = cb->data;
@@ -33,6 +60,26 @@ static void cgroup_parse_v0(struct xt_op
cgroupinfo->invert = true;
}

+static void cgroup_parse_v1(struct xt_option_call *cb)
+{
+ struct xt_cgroup_info_v1 *info = cb->data;
+
+ xtables_option_parse(cb);
+
+ switch (cb->entry->id) {
+ case O_PATH:
+ info->has_path = true;
+ if (cb->invert)
+ info->invert_path = true;
+ break;
+ case O_CLASSID:
+ info->has_classid = true;
+ if (cb->invert)
+ info->invert_classid = true;
+ break;
+ }
+}
+
static void
cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int numeric)
{
@@ -48,6 +95,32 @@ static void cgroup_save_v0(const void *i
printf("%s --cgroup %u", info->invert ? " !" : "", info->id);
}

+static void
+cgroup_print_v1(const void *ip, const struct xt_entry_match *match, int numeric)
+{
+ const struct xt_cgroup_info_v1 *info = (void *)match->data;
+
+ printf(" cgroup");
+ if (info->has_path)
+ printf(" %s%s", info->invert_path ? "! ":"", info->path);
+ if (info->has_classid)
+ printf(" %s%u", info->invert_classid ? "! ":"", info->classid);
+}
+
+static void cgroup_save_v1(const void *ip, const struct xt_entry_match *match)
+{
+ const struct xt_cgroup_info_v1 *info = (void *)match->data;
+
+ if (info->has_path) {
+ printf("%s --path", info->invert_path ? " !" : "");
+ xtables_save_string(info->path);
+ }
+
+ if (info->has_classid)
+ printf("%s --cgroup %u", info->invert_classid ? " !" : "",
+ info->classid);
+}
+
static struct xtables_match cgroup_match[] = {
{
.family = NFPROTO_UNSPEC,
@@ -62,6 +135,19 @@ static struct xtables_match cgroup_match
.x6_parse = cgroup_parse_v0,
.x6_options = cgroup_opts_v0,
},
+ {
+ .family = NFPROTO_UNSPEC,
+ .revision = 1,
+ .name = "cgroup",
+ .version = XTABLES_VERSION,
+ .size = XT_ALIGN(sizeof(struct xt_cgroup_info_v1)),
+ .userspacesize = offsetof(struct xt_cgroup_info_v1, priv),
+ .help = cgroup_help_v1,
+ .print = cgroup_print_v1,
+ .save = cgroup_save_v1,
+ .x6_parse = cgroup_parse_v1,
+ .x6_options = cgroup_opts_v1,
+ },
};

void _init(void)
--- a/extensions/libxt_cgroup.man
+++ b/extensions/libxt_cgroup.man
@@ -1,23 +1,30 @@
.TP
-[\fB!\fP] \fB\-\-cgroup\fP \fIfwid\fP
-Match corresponding cgroup for this packet.
+[\fB!\fP] \fB\-\-path\fP \fIpath\fP
+Match cgroup2 membership.

-Can be used in the OUTPUT chain to assign particular firewall
-policies for aggregated task/jobs on the system. This allows
-for more fine-grained firewall policies that only match for a
-subset of the system's processes. fwid is the maker set through
-the net_cls cgroup's id.
+Each socket is associated with the v2 cgroup of the creating process.
+This matches packets coming from or going to all sockets in the
+sub-hierarchy of the specified path. The path should be relative to
+the root of the cgroup2 hierarchy.
+.TP
+[\fB!\fP] \fB\-\-cgroup\fP \fIclassid\fP
+Match cgroup net_cls classid.

-\fBIMPORTANT\fP: when being used in the INPUT chain, the cgroup
-matcher is currently only of limited functionality, meaning it
-will only match on packets that are processed for local sockets
-through early socket demuxing. Therefore, general usage on the
-INPUT chain is disadviced unless the implications are well
-understood.
+classid is the marker set through the cgroup net_cls controller. This
+option and \-\-path can't be used together.
.PP
Example:
.IP
+iptables \-A OUTPUT \-p tcp \-\-sport 80 \-m cgroup ! \-\-path service/http-server \-j DROP
+.IP
iptables \-A OUTPUT \-p tcp \-\-sport 80 \-m cgroup ! \-\-cgroup 1
\-j DROP
.PP
+\fBIMPORTANT\fP: when being used in the INPUT chain, the cgroup
+matcher is currently only of limited functionality, meaning it
+will only match on packets that are processed for local sockets
+through early socket demuxing. Therefore, general usage on the
+INPUT chain is not advised unless the implications are well
+understood.
+.PP
Available since Linux 3.14.
--- a/include/linux/netfilter/xt_cgroup.h
+++ b/include/linux/netfilter/xt_cgroup.h
@@ -2,10 +2,23 @@
#define _XT_CGROUP_H

#include <linux/types.h>
+#include <linux/limits.h>

struct xt_cgroup_info_v0 {
__u32 id;
__u32 invert;
};

+struct xt_cgroup_info_v1 {
+ __u8 has_path;
+ __u8 has_classid;
+ __u8 invert_path;
+ __u8 invert_classid;
+ char path[PATH_MAX];
+ __u32 classid;
+
+ /* kernel internal data */
+ void *priv __attribute__((aligned(8)));
+};
+
#endif /* _XT_CGROUP_H */

2015-11-21 16:59:25

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Tejun Heo <[email protected]> wrote:
> This patch implements xt_cgroup path match which matches cgroup2
> membership of the associated socket. The match is recursive and
> invertible.
>
> For rationales on introducing another cgroup based match, please refer
> to a preceding commit "sock, cgroup: add sock->sk_cgroup".
>
> v3: Folded into xt_cgroup as a new revision interface as suggested by
> Pablo.
>
> v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added
> explicit alignment to the priv field. Both suggested by Jan.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: Daniel Wagner <[email protected]>
> CC: Neil Horman <[email protected]>
> Cc: Jan Engelhardt <[email protected]>
> Cc: Pablo Neira Ayuso <[email protected]>
> ---
> include/uapi/linux/netfilter/xt_cgroup.h | 13 ++++++
> net/netfilter/xt_cgroup.c | 69 ++++++++++++++++++++++++++++++++
> 2 files changed, 82 insertions(+)
>
> diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h
> index 577c9e0..1e4b37b 100644
> --- a/include/uapi/linux/netfilter/xt_cgroup.h
> +++ b/include/uapi/linux/netfilter/xt_cgroup.h
> @@ -2,10 +2,23 @@
> #define _UAPI_XT_CGROUP_H
>
> #include <linux/types.h>
> +#include <linux/limits.h>
>
> struct xt_cgroup_info_v0 {
> __u32 id;
> __u32 invert;
> };
>
> +struct xt_cgroup_info_v1 {
> + __u8 has_path;
> + __u8 has_classid;
> + __u8 invert_path;
> + __u8 invert_classid;
> + char path[PATH_MAX];
> + __u32 classid;
> +
> + /* kernel internal data */
> + void *priv __attribute__((aligned(8)));
> +};

Ahem. Am I reading this right? This struct is > 4k in size?
If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?

Thanks!

2015-11-21 17:04:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Hello,

On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
> > +struct xt_cgroup_info_v1 {
> > + __u8 has_path;
> > + __u8 has_classid;
> > + __u8 invert_path;
> > + __u8 invert_classid;
> > + char path[PATH_MAX];
> > + __u32 classid;
> > +
> > + /* kernel internal data */
> > + void *priv __attribute__((aligned(8)));
> > +};
>
> Ahem. Am I reading this right? This struct is > 4k in size?
> If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?

Hmmm... yeap but would this be an acutual problem? We can try to make
it shorter but idk it ultimately is a path. Another solution would be
trying to pass inode around but that is problematic with showing and
printing rules as the only way to reverse-map inode to path is walking
the tree and the cgroup may already be gone at that point. While >4k
struct isn't pretty, this looks like the path of least resistance.

Thanks.

--
tejun

2015-11-21 18:54:47

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Tejun Heo <[email protected]> wrote:
> On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
> > > +struct xt_cgroup_info_v1 {
> > > + __u8 has_path;
> > > + __u8 has_classid;
> > > + __u8 invert_path;
> > > + __u8 invert_classid;
> > > + char path[PATH_MAX];
> > > + __u32 classid;
> > > +
> > > + /* kernel internal data */
> > > + void *priv __attribute__((aligned(8)));
> > > +};
> >
> > Ahem. Am I reading this right? This struct is > 4k in size?
> > If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?
>
> Hmmm... yeap but would this be an acutual problem?

Since rule blob can be allocated via vmalloc i guess "no", its not
really a problem unless someone needs realy insane amount of such rules.

I don't have any better suggestion, so I guess its necessary evil.

The only other question I have is wheter PATH_MAX might be a possible
ABI breaker in future. It would have to be guaranteed that this is the
same size forever, else you'd get strange errors on rule insertion if
the sizes of the kernel and userspace version differs.

2015-11-21 20:26:31

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match


On Saturday 2015-11-21 19:54, Florian Westphal wrote:
>
>The only other question I have is wheter PATH_MAX might be a possible
>ABI breaker in future. It would have to be guaranteed that this is the
>same size forever, else you'd get strange errors on rule insertion if
>the sizes of the kernel and userspace version differs.
>
The same goes for IFNAMSIZ. But, so far, nobody changed it in the kernel,
even though there are voices that 15 characters + '\0' were a tight choice.

2015-11-22 20:31:39

by Pablo Neira Ayuso

[permalink] [raw]
Subject: Re: [PATCH 1/2 iptables] libxt_cgroup: prepare for multi revisions

On Sat, Nov 21, 2015 at 11:18:46AM -0500, Tejun Heo wrote:
> libxt_cgroup will grow cgroup2 path based match. Postfix existing
> symbols with _v0 and prepare for multi revision registration. While
> at it, rename O_CGROUP to O_CLASSID and fwid to classid.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: Jan Engelhardt <[email protected]>
> Cc: Pablo Neira Ayuso <[email protected]>
> ---
> extensions/libxt_cgroup.c | 51 +++++++++++++++++++-----------------
> include/linux/netfilter/xt_cgroup.h | 2 -
> 2 files changed, 28 insertions(+), 25 deletions(-)
>
> --- a/extensions/libxt_cgroup.c
> +++ b/extensions/libxt_cgroup.c
> @@ -3,30 +3,30 @@
> #include <linux/netfilter/xt_cgroup.h>
>
> enum {
> - O_CGROUP = 0,
> + O_CLASSID = 0,
> };
>
> -static void cgroup_help(void)
> +static void cgroup_help_v0(void)
> {
> printf(
> "cgroup match options:\n"
> -"[!] --cgroup fwid Match cgroup fwid\n");
> +"[!] --cgroup classid Match cgroup classid\n");

We have to keep the old cgroup integer ID around for a while,
otherwise we'll break users with old kernels and new iptables
utilities.

2015-11-22 20:34:54

by Pablo Neira Ayuso

[permalink] [raw]
Subject: Re: [PATCH 1/2 iptables] libxt_cgroup: prepare for multi revisions

On Sun, Nov 22, 2015 at 09:31:28PM +0100, Pablo Neira Ayuso wrote:
> On Sat, Nov 21, 2015 at 11:18:46AM -0500, Tejun Heo wrote:
> > --- a/extensions/libxt_cgroup.c
> > +++ b/extensions/libxt_cgroup.c
> > @@ -3,30 +3,30 @@
> > #include <linux/netfilter/xt_cgroup.h>
> >
> > enum {
> > - O_CGROUP = 0,
> > + O_CLASSID = 0,
> > };
> >
> > -static void cgroup_help(void)
> > +static void cgroup_help_v0(void)
> > {
> > printf(
> > "cgroup match options:\n"
> > -"[!] --cgroup fwid Match cgroup fwid\n");
> > +"[!] --cgroup classid Match cgroup classid\n");
>
> We have to keep the old cgroup integer ID around for a while,
> otherwise we'll break users with old kernels and new iptables
> utilities.

Oh, I see.

This is just a rename to prepare the string based identifier.

Sorry for the noise.

2015-11-23 07:12:07

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

Hi Tejun,

On 11/21/2015 05:13 PM, Tejun Heo wrote:
> This is v3 of the xt_cgroup2 patchset. Changes from the last take are
>
> * Folded cgroup2 path matching into xt_cgroup as a new revision rather
> than a separate xt_cgroup2 match as suggested by Pablo.
>
> * Refreshed on top of Nina's net_cls dynamic config update fix patch.
> I included the fix patch as part of this series to ease reviewing.

I started to play with your patches and was greeted by this:

[ 3.217648] systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway.
[ 3.224665] BUG: spinlock bad magic on CPU#1, systemd/1
[ 3.225653] lock: cgroup_sk_update_lock+0x0/0x60, .magic: 00000000, .owner: systemd/1, .owner_cpu: 1
[ 3.227034] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #195
[ 3.227862] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 3.228906] ffffffff834a2160 ffff88007c043ad0 ffffffff81551edc ffff88007c028000
[ 3.229512] ffff88007c043af0 ffffffff81136868 ffffffff834a2160 ffff88007aff5940
[ 3.230105] ffff88007c043b08 ffffffff81136b05 ffffffff834a2160 ffff88007c043b20
[ 3.230716] Call Trace:
[ 3.230906] [<ffffffff81551edc>] dump_stack+0x4e/0x82
[ 3.231289] [<ffffffff81136868>] spin_dump+0x78/0xc0
[ 3.231642] [<ffffffff81136b05>] do_raw_spin_unlock+0x75/0xd0
[ 3.232039] [<ffffffff81bced77>] _raw_spin_unlock+0x27/0x50
[ 3.232431] [<ffffffff819b1848>] update_classid_sock+0x68/0x80
[ 3.232836] [<ffffffff812855c1>] iterate_fd+0x71/0x150
[ 3.233197] [<ffffffff819b1757>] update_classid+0x47/0x80
[ 3.233571] [<ffffffff819b17d4>] cgrp_attach+0x14/0x20
[ 3.233929] [<ffffffff81188951>] cgroup_taskset_migrate+0x1e1/0x330
[ 3.234366] [<ffffffff81188b95>] cgroup_migrate+0xf5/0x190
[ 3.234747] [<ffffffff81188aa5>] ? cgroup_migrate+0x5/0x190
[ 3.235130] [<ffffffff81188da6>] cgroup_attach_task+0x176/0x200
[ 3.235543] [<ffffffff81188c35>] ? cgroup_attach_task+0x5/0x200
[ 3.235953] [<ffffffff8118922d>] __cgroup_procs_write+0x2ad/0x460
[ 3.236377] [<ffffffff81188fde>] ? __cgroup_procs_write+0x5e/0x460
[ 3.236805] [<ffffffff81189414>] cgroup_procs_write+0x14/0x20
[ 3.237205] [<ffffffff81185ae5>] cgroup_file_write+0x35/0x1c0
[ 3.237600] [<ffffffff812e25e1>] kernfs_fop_write+0x141/0x190
[ 3.237998] [<ffffffff81265e78>] __vfs_write+0x28/0xe0
[ 3.238361] [<ffffffff811292c7>] ? percpu_down_read+0x57/0xa0
[ 3.238761] [<ffffffff81268b04>] ? __sb_start_write+0xb4/0xf0
[ 3.239154] [<ffffffff81268b04>] ? __sb_start_write+0xb4/0xf0
[ 3.239554] [<ffffffff812665ec>] vfs_write+0xac/0x1a0
[ 3.239930] [<ffffffff81285fa6>] ? __fget_light+0x66/0x90
[ 3.240308] [<ffffffff81266f09>] SyS_write+0x49/0xb0
[ 3.240656] [<ffffffff81bcfb32>] entry_SYSCALL_64_fastpath+0x12/0x76

I am using a Fedora 23 host with systemd.unified_cgroup_hierarchy=1. The config is
available here:

http://monom.org/cgroup/config-review-xt_cgroup2

Probably completely rubbish, because it's my random test config.

cheers,
daniel

2015-11-23 08:54:38

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

On 11/23/2015 08:11 AM, Daniel Wagner wrote:
> [ 3.217648] systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway.
> [ 3.224665] BUG: spinlock bad magic on CPU#1, systemd/1
> [ 3.225653] lock: cgroup_sk_update_lock+0x0/0x60, .magic: 00000000, .owner: systemd/1, .owner_cpu: 1
> [ 3.227034] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #195
> [ 3.227862] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 3.228906] ffffffff834a2160 ffff88007c043ad0 ffffffff81551edc ffff88007c028000
> [ 3.229512] ffff88007c043af0 ffffffff81136868 ffffffff834a2160 ffff88007aff5940
> [ 3.230105] ffff88007c043b08 ffffffff81136b05 ffffffff834a2160 ffff88007c043b20
> [ 3.230716] Call Trace:
> [ 3.230906] [<ffffffff81551edc>] dump_stack+0x4e/0x82
> [ 3.231289] [<ffffffff81136868>] spin_dump+0x78/0xc0
> [ 3.231642] [<ffffffff81136b05>] do_raw_spin_unlock+0x75/0xd0
> [ 3.232039] [<ffffffff81bced77>] _raw_spin_unlock+0x27/0x50
> [ 3.232431] [<ffffffff819b1848>] update_classid_sock+0x68/0x80
> [ 3.232836] [<ffffffff812855c1>] iterate_fd+0x71/0x150
> [ 3.233197] [<ffffffff819b1757>] update_classid+0x47/0x80
> [ 3.233571] [<ffffffff819b17d4>] cgrp_attach+0x14/0x20
> [ 3.233929] [<ffffffff81188951>] cgroup_taskset_migrate+0x1e1/0x330
> [ 3.234366] [<ffffffff81188b95>] cgroup_migrate+0xf5/0x190
> [ 3.234747] [<ffffffff81188aa5>] ? cgroup_migrate+0x5/0x190
> [ 3.235130] [<ffffffff81188da6>] cgroup_attach_task+0x176/0x200
> [ 3.235543] [<ffffffff81188c35>] ? cgroup_attach_task+0x5/0x200
> [ 3.235953] [<ffffffff8118922d>] __cgroup_procs_write+0x2ad/0x460
> [ 3.236377] [<ffffffff81188fde>] ? __cgroup_procs_write+0x5e/0x460
> [ 3.236805] [<ffffffff81189414>] cgroup_procs_write+0x14/0x20
> [ 3.237205] [<ffffffff81185ae5>] cgroup_file_write+0x35/0x1c0
> [ 3.237600] [<ffffffff812e25e1>] kernfs_fop_write+0x141/0x190
> [ 3.237998] [<ffffffff81265e78>] __vfs_write+0x28/0xe0
> [ 3.238361] [<ffffffff811292c7>] ? percpu_down_read+0x57/0xa0
> [ 3.238761] [<ffffffff81268b04>] ? __sb_start_write+0xb4/0xf0
> [ 3.239154] [<ffffffff81268b04>] ? __sb_start_write+0xb4/0xf0
> [ 3.239554] [<ffffffff812665ec>] vfs_write+0xac/0x1a0
> [ 3.239930] [<ffffffff81285fa6>] ? __fget_light+0x66/0x90
> [ 3.240308] [<ffffffff81266f09>] SyS_write+0x49/0xb0
> [ 3.240656] [<ffffffff81bcfb32>] entry_SYSCALL_64_fastpath+0x12/0x76

I have enabled a few additional cgroup controllers as well, because I was
trying to figure out why I only see the 'memory' cgroup controller in
cgroup.controllers. pid and io show up but not net_prio or net_cls.
Not sure why systemd (v227) is not mounting them.

Though, after a while a similar call trace is produced. I guess this
has nothing to do with the current changes.

[ 11.594536] ------------[ cut here ]------------
[ 11.595274] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
[ 11.595958] Modules linked in:
[ 11.596199] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #196
[ 11.596689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 11.597632] ffffffff81f66d8b ffff88007c04bb90 ffffffff8155ccdc 0000000000000000
[ 11.598234] ffff88007c04bbc8 ffffffff810de202 ffff8800793dda00 ffff88007a096800
[ 11.598877] ffff88007c04bc80 ffff88007a6b6200 0000000000000001 ffff88007c04bbd8
[ 11.599547] Call Trace:
[ 11.599784] [<ffffffff8155ccdc>] dump_stack+0x4e/0x82
[ 11.600197] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[ 11.600705] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[ 11.601208] [<ffffffff8118e261>] pids_cancel.constprop.6+0x31/0x40
[ 11.601764] [<ffffffff8118e32d>] pids_can_attach+0x6d/0xf0
[ 11.602245] [<ffffffff811887fa>] cgroup_taskset_migrate+0x6a/0x330
[ 11.602795] [<ffffffff81188bb5>] cgroup_migrate+0xf5/0x190
[ 11.603276] [<ffffffff81188ac5>] ? cgroup_migrate+0x5/0x190
[ 11.603788] [<ffffffff81188dc6>] cgroup_attach_task+0x176/0x200
[ 11.604308] [<ffffffff81188c55>] ? cgroup_attach_task+0x5/0x200
[ 11.604831] [<ffffffff8118924d>] __cgroup_procs_write+0x2ad/0x460
[ 11.605367] [<ffffffff81188ffe>] ? __cgroup_procs_write+0x5e/0x460
[ 11.605929] [<ffffffff81189434>] cgroup_procs_write+0x14/0x20
[ 11.606448] [<ffffffff81185af5>] cgroup_file_write+0x35/0x1c0
[ 11.606931] [<ffffffff812e94f1>] kernfs_fop_write+0x141/0x190
[ 11.607401] [<ffffffff81269b18>] __vfs_write+0x28/0xe0
[ 11.607834] [<ffffffff811292d7>] ? percpu_down_read+0x57/0xa0
[ 11.608366] [<ffffffff8126c7a4>] ? __sb_start_write+0xb4/0xf0
[ 11.608874] [<ffffffff8126c7a4>] ? __sb_start_write+0xb4/0xf0
[ 11.609343] [<ffffffff8126a28c>] vfs_write+0xac/0x1a0
[ 11.609843] [<ffffffff81289db6>] ? __fget_light+0x66/0x90
[ 11.610315] [<ffffffff8126aba9>] SyS_write+0x49/0xb0
[ 11.610756] [<ffffffff81bdb4f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 11.611305] ---[ end trace 7f953d0ce5af99ea ]---

2015-11-23 12:43:10

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Hi Tejun,

On 11/21/2015 05:14 PM, Tejun Heo wrote:> +static int
> cgroup_mt_check_v1(const struct xt_mtchk_param *par)
> +{
> + struct xt_cgroup_info_v1 *info = par->matchinfo;
> + struct cgroup *cgrp;
> +
> + if ((info->invert_path & ~1) || (info->invert_classid & ~1))
> + return -EINVAL;

The checks below use pr_info() in case the configuration is not valid.
Is this missing here on purpose?

I have tested it slightly and it seems to work (also on an older
kernel). I don't know if that qualifies it for a Tested-by but at least
Acked-by should do the trick:

Tested-by: Daniel Wagner <[email protected]>
Acked-by: Daniel Wagner <[email protected]>

cheers,
daniel

2015-11-23 12:45:06

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 8/9] netfilter: prepare xt_cgroup for multi revisions

Hi Tejun,

On 11/21/2015 05:14 PM, Tejun Heo wrote:
> xt_cgroup will grow cgroup2 path based match. Postfix existing
> symbols with _v0 and prepare for multi revision registration.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: Daniel Wagner <[email protected]>

Same as in my reply to patch #9 (yes, I know do it wrong order...
thought can't stop now... :))

Tested-by: Daniel Wagner <[email protected]>
Acked-by: Daniel Wagner <[email protected]>

cheers,
daniel

2015-11-23 13:02:16

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 7/9] sock, cgroup: add sock->sk_cgroup

Hi Tejun,

On 11/21/2015 05:13 PM, Tejun Heo wrote:
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: Daniel Wagner <[email protected]>

I did a quick test and for new connection the cgroup2 match worked as
expected. For an existing connection I wasn't able to trigger the match.

It is quite likely I do something wrong:

ssh into the box
# mkdir /sys/fs/cgroup/test
# echo $$ > /sys/fs/cgroup/test/cgroup.procs
# echo $PPID > /sys/fs/cgroup/test/cgroup.procs
# iptables -A OUTPUT -m cgroup --path test

Should I see matches with the existing ssh session?

cheers,
daniel

2015-11-23 13:43:37

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

On 11/21/2015 07:54 PM, Florian Westphal wrote:
> Tejun Heo <[email protected]> wrote:
>> On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
>>>> +struct xt_cgroup_info_v1 {
>>>> + __u8 has_path;
>>>> + __u8 has_classid;
>>>> + __u8 invert_path;
>>>> + __u8 invert_classid;
>>>> + char path[PATH_MAX];
>>>> + __u32 classid;
>>>> +
>>>> + /* kernel internal data */
>>>> + void *priv __attribute__((aligned(8)));
>>>> +};
>>>
>>> Ahem. Am I reading this right? This struct is > 4k in size?
>>> If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?
>>
>> Hmmm... yeap but would this be an acutual problem?
>
> Since rule blob can be allocated via vmalloc i guess "no", its not
> really a problem unless someone needs realy insane amount of such rules.
>
> I don't have any better suggestion, so I guess its necessary evil.
>
> The only other question I have is wheter PATH_MAX might be a possible
> ABI breaker in future. It would have to be guaranteed that this is the
> same size forever, else you'd get strange errors on rule insertion if
> the sizes of the kernel and userspace version differs.

Haven't looked deeply into kernfs, but if it's possible to get the object
from the struct file eventually, you could let iptables frontend open that
path and just pass the fd down. Would be sizeof(int) vs PATH_MAX then, i.e.
when you have a large number of rules to load.

2015-11-23 13:51:50

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

On 11/23/2015 02:43 PM, Daniel Borkmann wrote:
> On 11/21/2015 07:54 PM, Florian Westphal wrote:
>> Tejun Heo <[email protected]> wrote:
>>> On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote:
>>>>> +struct xt_cgroup_info_v1 {
>>>>> + __u8 has_path;
>>>>> + __u8 has_classid;
>>>>> + __u8 invert_path;
>>>>> + __u8 invert_classid;
>>>>> + char path[PATH_MAX];
>>>>> + __u32 classid;
>>>>> +
>>>>> + /* kernel internal data */
>>>>> + void *priv __attribute__((aligned(8)));
>>>>> +};
>>>>
>>>> Ahem. Am I reading this right? This struct is > 4k in size?
>>>> If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?
>>>
>>> Hmmm... yeap but would this be an acutual problem?
>>
>> Since rule blob can be allocated via vmalloc i guess "no", its not
>> really a problem unless someone needs realy insane amount of such rules.
>>
>> I don't have any better suggestion, so I guess its necessary evil.
>>
>> The only other question I have is wheter PATH_MAX might be a possible
>> ABI breaker in future. It would have to be guaranteed that this is the
>> same size forever, else you'd get strange errors on rule insertion if
>> the sizes of the kernel and userspace version differs.
>
> Haven't looked deeply into kernfs, but if it's possible to get the object
> from the struct file eventually, you could let iptables frontend open that
> path and just pass the fd down. Would be sizeof(int) vs PATH_MAX then, i.e.
> when you have a large number of rules to load.

( ... but with the downside that things like save/restore wouldn't work. )

2015-11-23 15:40:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Hello, Daniel.

On Mon, Nov 23, 2015 at 02:43:12PM +0100, Daniel Borkmann wrote:
...
> Haven't looked deeply into kernfs, but if it's possible to get the object
> from the struct file eventually, you could let iptables frontend open that
> path and just pass the fd down. Would be sizeof(int) vs PATH_MAX then, i.e.
> when you have a large number of rules to load.

That works in one direction but not in the other. ie. You can tell
the kernel the path that way but can't retrieve. If using path string
is unacceptable, the best alternative would be an inode number rather
than a fd; however, using an inode number would be quite cumbersome
and painful too.

Thanks.

--
tejun

2015-11-23 15:41:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

Hello,

On Mon, Nov 23, 2015 at 01:43:01PM +0100, Daniel Wagner wrote:
> Hi Tejun,
>
> On 11/21/2015 05:14 PM, Tejun Heo wrote:> +static int
> > cgroup_mt_check_v1(const struct xt_mtchk_param *par)
> > +{
> > + struct xt_cgroup_info_v1 *info = par->matchinfo;
> > + struct cgroup *cgrp;
> > +
> > + if ((info->invert_path & ~1) || (info->invert_classid & ~1))
> > + return -EINVAL;
>
> The checks below use pr_info() in case the configuration is not valid.
> Is this missing here on purpose?

It's mostly copied from v0 function but I think it makes sense. The
other errors can be caused by incorrect user input but the above one
can't happen unless iptables extension itself is broken.

> I have tested it slightly and it seems to work (also on an older
> kernel). I don't know if that qualifies it for a Tested-by but at least
> Acked-by should do the trick:

Will answer that there.

> Tested-by: Daniel Wagner <[email protected]>
> Acked-by: Daniel Wagner <[email protected]>

Thanks.

--
tejun

2015-11-23 15:48:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 7/9] sock, cgroup: add sock->sk_cgroup

Hello,

On Mon, Nov 23, 2015 at 02:02:03PM +0100, Daniel Wagner wrote:
> On 11/21/2015 05:13 PM, Tejun Heo wrote:
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Daniel Borkmann <[email protected]>
> > Cc: Daniel Wagner <[email protected]>
>
> I did a quick test and for new connection the cgroup2 match worked as
> expected. For an existing connection I wasn't able to trigger the match.
>
> It is quite likely I do something wrong:
>
> ssh into the box
> # mkdir /sys/fs/cgroup/test
> # echo $$ > /sys/fs/cgroup/test/cgroup.procs
> # echo $PPID > /sys/fs/cgroup/test/cgroup.procs
> # iptables -A OUTPUT -m cgroup --path test
>
> Should I see matches with the existing ssh session?

Socket is associated with the creating cgroup and stays associated
with that cgroup until it's released. Migrating the process doesn't
change the ownership of the sockets it has created. This is in line
with how other stateful resources such as memory are handled in
cgroup2 hierarchy.

Thanks.

--
tejun

2015-11-23 15:53:33

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH 7/9] sock, cgroup: add sock->sk_cgroup

On 11/23/2015 04:48 PM, Tejun Heo wrote:
> On Mon, Nov 23, 2015 at 02:02:03PM +0100, Daniel Wagner wrote:
>> On 11/21/2015 05:13 PM, Tejun Heo wrote:
>>> Signed-off-by: Tejun Heo <[email protected]>
>>> Cc: Daniel Borkmann <[email protected]>
>>> Cc: Daniel Wagner <[email protected]>
>>
>> I did a quick test and for new connection the cgroup2 match worked as
>> expected. For an existing connection I wasn't able to trigger the match.
>>
>> It is quite likely I do something wrong:
>>
>> ssh into the box
>> # mkdir /sys/fs/cgroup/test
>> # echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> # echo $PPID > /sys/fs/cgroup/test/cgroup.procs
>> # iptables -A OUTPUT -m cgroup --path test
>>
>> Should I see matches with the existing ssh session?
>
> Socket is associated with the creating cgroup and stays associated
> with that cgroup until it's released. Migrating the process doesn't
> change the ownership of the sockets it has created. This is in line
> with how other stateful resources such as memory are handled in
> cgroup2 hierarchy.

Thanks for the explanation. Looks good to me:

Tested-by: Daniel Wagner <[email protected]>
Acked-by: Daniel Wagner <[email protected]>

Thanks,
Daniel

2015-11-23 15:53:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

On Mon, Nov 23, 2015 at 09:54:32AM +0100, Daniel Wagner wrote:
...
> > [ 3.224665] BUG: spinlock bad magic on CPU#1, systemd/1
> > [ 3.225653] lock: cgroup_sk_update_lock+0x0/0x60, .magic: 00000000, .owner: systemd/1, .owner_cpu: 1
> > [ 3.227034] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #195
> > [ 3.227862] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> > [ 3.228906] ffffffff834a2160 ffff88007c043ad0 ffffffff81551edc ffff88007c028000
> > [ 3.229512] ffff88007c043af0 ffffffff81136868 ffffffff834a2160 ffff88007aff5940
> > [ 3.230105] ffff88007c043b08 ffffffff81136b05 ffffffff834a2160 ffff88007c043b20
> > [ 3.230716] Call Trace:
> > [ 3.230906] [<ffffffff81551edc>] dump_stack+0x4e/0x82
> > [ 3.231289] [<ffffffff81136868>] spin_dump+0x78/0xc0
> > [ 3.231642] [<ffffffff81136b05>] do_raw_spin_unlock+0x75/0xd0
> > [ 3.232039] [<ffffffff81bced77>] _raw_spin_unlock+0x27/0x50
> > [ 3.232431] [<ffffffff819b1848>] update_classid_sock+0x68/0x80
> > [ 3.232836] [<ffffffff812855c1>] iterate_fd+0x71/0x150
> > [ 3.233197] [<ffffffff819b1757>] update_classid+0x47/0x80
> > [ 3.233571] [<ffffffff819b17d4>] cgrp_attach+0x14/0x20
> > [ 3.233929] [<ffffffff81188951>] cgroup_taskset_migrate+0x1e1/0x330
> > [ 3.234366] [<ffffffff81188b95>] cgroup_migrate+0xf5/0x190
> > [ 3.235130] [<ffffffff81188da6>] cgroup_attach_task+0x176/0x200
> > [ 3.235953] [<ffffffff8118922d>] __cgroup_procs_write+0x2ad/0x460
> > [ 3.236805] [<ffffffff81189414>] cgroup_procs_write+0x14/0x20
> > [ 3.237205] [<ffffffff81185ae5>] cgroup_file_write+0x35/0x1c0
> > [ 3.237600] [<ffffffff812e25e1>] kernfs_fop_write+0x141/0x190
> > [ 3.237998] [<ffffffff81265e78>] __vfs_write+0x28/0xe0
> > [ 3.239554] [<ffffffff812665ec>] vfs_write+0xac/0x1a0
> > [ 3.240308] [<ffffffff81266f09>] SyS_write+0x49/0xb0
> > [ 3.240656] [<ffffffff81bcfb32>] entry_SYSCALL_64_fastpath+0x12/0x76
>
> I have enabled a few additional cgroup controllers as well, because I was
> trying to figure out why I only see the 'memory' cgroup controller in
> cgroup.controllers. pid and io show up but not net_prio or net_cls.
> Not sure why systemd (v227) is not mounting them.

net_prio and net_cls aren't gonna be on the v2 hierarchy. The match
in this patchset is being introduced to replace them; however, you can
mount them separately on a v1 hierarchy and use the same as before.

> Though, after a while a similar call trace is produced. I guess this
> has nothing to do with the current changes.
>
> [ 11.594536] ------------[ cut here ]------------
> [ 11.595274] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
> [ 11.595958] Modules linked in:
> [ 11.596199] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #196
> [ 11.596689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 11.597632] ffffffff81f66d8b ffff88007c04bb90 ffffffff8155ccdc 0000000000000000
> [ 11.598234] ffff88007c04bbc8 ffffffff810de202 ffff8800793dda00 ffff88007a096800
> [ 11.598877] ffff88007c04bc80 ffff88007a6b6200 0000000000000001 ffff88007c04bbd8
> [ 11.599547] Call Trace:
> [ 11.599784] [<ffffffff8155ccdc>] dump_stack+0x4e/0x82
> [ 11.600197] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
> [ 11.600705] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
> [ 11.601208] [<ffffffff8118e261>] pids_cancel.constprop.6+0x31/0x40
> [ 11.601764] [<ffffffff8118e32d>] pids_can_attach+0x6d/0xf0

Yeah, this is a known problem regarding css's lifetime. Working on
it. The earlier dump, I think, is likely to have been caused by the
same issue.

Thanks.

--
tejun

2015-11-23 15:57:23

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

On 11/23/2015 04:53 PM, Tejun Heo wrote:
> On Mon, Nov 23, 2015 at 09:54:32AM +0100, Daniel Wagner wrote:
> ...
>>> [ 3.224665] BUG: spinlock bad magic on CPU#1, systemd/1
>>> [ 3.225653] lock: cgroup_sk_update_lock+0x0/0x60, .magic: 00000000, .owner: systemd/1, .owner_cpu: 1
>>> [ 3.227034] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #195
>>> [ 3.227862] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>>> [ 3.228906] ffffffff834a2160 ffff88007c043ad0 ffffffff81551edc ffff88007c028000
>>> [ 3.229512] ffff88007c043af0 ffffffff81136868 ffffffff834a2160 ffff88007aff5940
>>> [ 3.230105] ffff88007c043b08 ffffffff81136b05 ffffffff834a2160 ffff88007c043b20
>>> [ 3.230716] Call Trace:
>>> [ 3.230906] [<ffffffff81551edc>] dump_stack+0x4e/0x82
>>> [ 3.231289] [<ffffffff81136868>] spin_dump+0x78/0xc0
>>> [ 3.231642] [<ffffffff81136b05>] do_raw_spin_unlock+0x75/0xd0
>>> [ 3.232039] [<ffffffff81bced77>] _raw_spin_unlock+0x27/0x50
>>> [ 3.232431] [<ffffffff819b1848>] update_classid_sock+0x68/0x80
>>> [ 3.232836] [<ffffffff812855c1>] iterate_fd+0x71/0x150
>>> [ 3.233197] [<ffffffff819b1757>] update_classid+0x47/0x80
>>> [ 3.233571] [<ffffffff819b17d4>] cgrp_attach+0x14/0x20
>>> [ 3.233929] [<ffffffff81188951>] cgroup_taskset_migrate+0x1e1/0x330
>>> [ 3.234366] [<ffffffff81188b95>] cgroup_migrate+0xf5/0x190
>>> [ 3.235130] [<ffffffff81188da6>] cgroup_attach_task+0x176/0x200
>>> [ 3.235953] [<ffffffff8118922d>] __cgroup_procs_write+0x2ad/0x460
>>> [ 3.236805] [<ffffffff81189414>] cgroup_procs_write+0x14/0x20
>>> [ 3.237205] [<ffffffff81185ae5>] cgroup_file_write+0x35/0x1c0
>>> [ 3.237600] [<ffffffff812e25e1>] kernfs_fop_write+0x141/0x190
>>> [ 3.237998] [<ffffffff81265e78>] __vfs_write+0x28/0xe0
>>> [ 3.239554] [<ffffffff812665ec>] vfs_write+0xac/0x1a0
>>> [ 3.240308] [<ffffffff81266f09>] SyS_write+0x49/0xb0
>>> [ 3.240656] [<ffffffff81bcfb32>] entry_SYSCALL_64_fastpath+0x12/0x76
>>
>> I have enabled a few additional cgroup controllers as well, because I was
>> trying to figure out why I only see the 'memory' cgroup controller in
>> cgroup.controllers. pid and io show up but not net_prio or net_cls.
>> Not sure why systemd (v227) is not mounting them.
>
> net_prio and net_cls aren't gonna be on the v2 hierarchy. The match
> in this patchset is being introduced to replace them; however, you can
> mount them separately on a v1 hierarchy and use the same as before.

Okay, I could have figured that myself I guess. I mounted the v1
hierarchy and it works as you have described it.

2015-11-23 17:38:03

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match

From: Florian Westphal
> Sent: 21 November 2015 16:56
> > +struct xt_cgroup_info_v1 {
> > + __u8 has_path;
> > + __u8 has_classid;
> > + __u8 invert_path;
> > + __u8 invert_classid;
> > + char path[PATH_MAX];
> > + __u32 classid;
> > +
> > + /* kernel internal data */
> > + void *priv __attribute__((aligned(8)));
> > +};
>
> Ahem. Am I reading this right? This struct is > 4k in size?
> If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?

I've not looked at the use, but could you put 'char path[];'
as the last member an require any allocations to be long enough
to contain the actual path?

David

2015-11-23 17:55:59

by Jan Engelhardt

[permalink] [raw]
Subject: RE: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match


On Monday 2015-11-23 18:35, David Laight wrote:
>From: Florian Westphal
>> Sent: 21 November 2015 16:56
>> > +struct xt_cgroup_info_v1 {
>> > + char path[PATH_MAX];
>> > + __u32 classid;
>> > +
>> > + /* kernel internal data */
>> > + void *priv __attribute__((aligned(8)));
>> > +};
>>
>> Ahem. Am I reading this right? This struct is > 4k in size?
>> If so -- Ugh. Does sizeof(path) really have to be PATH_MAX?
>
>I've not looked at the use, but could you put 'char path[];'
>as the last member an require any allocations to be long enough
>to contain the actual path?

Oh, smart :) Yeah, ebt_among does something like that.
(.matchsize = -1, hint)

Except that the "priv" pointer seems to be ruining the fun here -
kernel vars have to be last, which collides with the requirements
for []-type members.

2015-11-23 19:58:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

Hello,

On Mon, Nov 23, 2015 at 10:53:46AM -0500, Tejun Heo wrote:
> > [ 11.594536] ------------[ cut here ]------------
> > [ 11.595274] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
> > [ 11.595958] Modules linked in:
> > [ 11.596199] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #196
> > [ 11.596689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> > [ 11.597632] ffffffff81f66d8b ffff88007c04bb90 ffffffff8155ccdc 0000000000000000
> > [ 11.598234] ffff88007c04bbc8 ffffffff810de202 ffff8800793dda00 ffff88007a096800
> > [ 11.598877] ffff88007c04bc80 ffff88007a6b6200 0000000000000001 ffff88007c04bbd8
> > [ 11.599547] Call Trace:
> > [ 11.599784] [<ffffffff8155ccdc>] dump_stack+0x4e/0x82
> > [ 11.600197] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
> > [ 11.600705] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
> > [ 11.601208] [<ffffffff8118e261>] pids_cancel.constprop.6+0x31/0x40
> > [ 11.601764] [<ffffffff8118e32d>] pids_can_attach+0x6d/0xf0
>
> Yeah, this is a known problem regarding css's lifetime. Working on
> it. The earlier dump, I think, is likely to have been caused by the
> same issue.

Just posted the fix for this issue. Can you please verify the fix?

http://lkml.kernel.org/g/[email protected]

Thanks a lot!

--
tejun

2015-11-23 20:45:30

by David Miller

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

From: Tejun Heo <[email protected]>
Date: Sat, 21 Nov 2015 11:13:52 -0500

> * Refreshed on top of Nina's net_cls dynamic config update fix patch.
> I included the fix patch as part of this series to ease reviewing.

I put this into the 'net' tree as it's a bug fix, so can you respin
this after I next merge 'net' into 'net-next'? I'll let you know.

There'll probably be at least some minor feedback meanwhile anyways.

Thanks.

2015-11-23 20:54:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup

On Mon, Nov 23, 2015 at 03:45:23PM -0500, David Miller wrote:
> > * Refreshed on top of Nina's net_cls dynamic config update fix patch.
> > I included the fix patch as part of this series to ease reviewing.
>
> I put this into the 'net' tree as it's a bug fix, so can you respin
> this after I next merge 'net' into 'net-next'? I'll let you know.

Sure thing.

Thanks.

--
tejun