LinuxLists.cc - [PATCH] cgroup: Reorganize css_set

2022-09-05 17:17:48

Subject: [PATCH] cgroup: Reorganize css_set_lock and kernfs path processing

The commit 74e4b956eb1c incorrectly wrapped kernfs_walk_and_get
(might_sleep) under css_set_lock (spinlock). css_set_lock is needed by
__cset_cgroup_from_root to ensure stable cset->cgrp_links. The returned
cgroup object is pinned by the css_set (*).

Because current cannot switch namespace asynchronously, the css_set is
also pinned by ns_proxy->cgroup_ns (regardless of current's cgroup
migration).

Kernfs code that traverses paths with relative root_cgroup not need
css_set_lock.

(*) Except for root cgroups. The default hierarchy root (under which
cgroup id and path resolution only happens) is eternal so it's moot.
cgroup_show_path (VFS callback) is expected to be synchronized (**) wrt
kill_sb (VFS callback) (mnt_namespace.list with namespace_sem).
(**) If not, it's still an independent issue from this and the fixed one.

Fixes: 74e4b956eb1c: ("cgroup: Honor caller's cgroup NS when resolving path")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/cgroup/cgroup.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

I considered adding get_cgroup() into current_cgns_cgroup_from_root to
avoid reliance on the transitive pinning via css_set.
After reasoning about no asynchronous NS switch and v1 hiearchies kill_sb it
didn't seem to bring that much benefit (it didn't compose well with
BUG_ON(!cgrp) neither).

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e0b72eb5d283..8c9497f01332 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1391,11 +1391,16 @@ static void cgroup_destroy_root(struct cgroup_root *root)
cgroup_free_root(root);
}

+/*
+ * Returned cgroup is without refcount but it's valid as long as cset pins it.
+ */
static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
struct cgroup_root *root)
{
struct cgroup *res_cgroup = NULL;

+ lockdep_assert_held(&css_set_lock);
+
if (cset == &init_css_set) {
res_cgroup = &root->cgrp;
} else if (root == &cgrp_dfl_root) {
@@ -1426,8 +1431,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
struct cgroup *res = NULL;
struct css_set *cset;

- lockdep_assert_held(&css_set_lock);
-
rcu_read_lock();

cset = current->nsproxy->cgroup_ns->root_cset;
@@ -1446,7 +1449,6 @@ static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
struct cgroup *res = NULL;

lockdep_assert_held(&cgroup_mutex);
- lockdep_assert_held(&css_set_lock);

res = __cset_cgroup_from_root(cset, root);

@@ -1861,8 +1863,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,

spin_lock_irq(&css_set_lock);
ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
- len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
spin_unlock_irq(&css_set_lock);
+ len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);

if (len >= PATH_MAX)
len = -ERANGE;
@@ -6649,8 +6651,8 @@ struct cgroup *cgroup_get_from_path(const char *path)

spin_lock_irq(&css_set_lock);
root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
- kn = kernfs_walk_and_get(root_cgrp->kn, path);
spin_unlock_irq(&css_set_lock);
+ kn = kernfs_walk_and_get(root_cgrp->kn, path);
if (!kn)
goto out;

base-commit: a8c52eba880a6e8c07fc2130604f8e386b90b763
--
2.37.0

2022-09-06 18:23:07

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH] cgroup: Reorganize css_set_lock and kernfs path processing

Hello, Michal.

On Mon, Sep 05, 2022 at 07:09:44PM +0200, Michal Koutn? wrote:
> I considered adding get_cgroup() into current_cgns_cgroup_from_root to
> avoid reliance on the transitive pinning via css_set.
> After reasoning about no asynchronous NS switch and v1 hiearchies kill_sb it
> didn't seem to bring that much benefit (it didn't compose well with
> BUG_ON(!cgrp) neither).

I still think this is too subtle and incidental. If we go this way, we'd
need to add comments explaining why this obviously confusing pattern (lock,
find obj, unlock, use obj) is being used which goes into how the object is
directly pinned through the css_set which happens to be pinned also
because... and so on. Even if the code looks a bit uglier, I'd much prefer
straight-forward pinning.

Thanks.

--
tejun

2022-09-28 11:35:18

by Michal Koutný

[permalink] [raw]

Subject: [PATCH v2] cgroup: Reorganize css_set_lock and kernfs path processing

2022-10-05 17:09:38

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v2] cgroup: Reorganize css_set_lock and kernfs path processing

Hello,

On Wed, Sep 28, 2022 at 01:33:16PM +0200, Michal Koutn? wrote:
...
> I realized the pinning with reference taking won't really work
> generally. The code would get the reference within RCU read section, so
> it'd have to be cgroup_get_live() and if that fails there's not much to
> do.

Hmm... isn't current's root cgrp guaranteed to be alive? How would
cgroup_get_live() fail? Also, shouldn't cgroup_get() enough for path
walking?

> @@ -6673,8 +6678,8 @@ struct cgroup *cgroup_get_from_path(const char *path)
>
> spin_lock_irq(&css_set_lock);
> root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
> - kn = kernfs_walk_and_get(root_cgrp->kn, path);
> spin_unlock_irq(&css_set_lock);
> + kn = kernfs_walk_and_get(root_cgrp->kn, path);

If you really wanna do it this way, can you please add a detailed comment
here why this is safe? But I'd prefer just doing a strightforward ref
inc/dec around it.

Thanks.

--
tejun

2022-10-10 08:42:34

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH v2] cgroup: Reorganize css_set_lock and kernfs path processing

On Wed, Oct 05, 2022 at 06:47:31AM -1000, Tejun Heo <[email protected]> wrote:
> Hmm... isn't current's root cgrp guaranteed to be alive?

True on the default hierarchy. v1 hierarchies (singular ones with root
cgroup only) can be unmounted.

> How would cgroup_get_live() fail?

kill_sb is not synchronized via css_set_lock.

> Also, shouldn't cgroup_get() enough for path walking?

If ref count dropped to zero, release callback (css_release_work_fn)
would be queued, cgroup_get would increase the refcount but it won't
cancel this.

Note these were concerns with the first version of the patch that also touched
cgroup_show_path() (that processes v1 hierarchies too). With the
reduction I avoided this.

Strictly speaking, even css_set_lock is unnecessary around
current_cgns_cgroup_from_root() when called with cgrp_dfl_root as the
cset->cgrp_links is not traversed at all.

> If you really wanna do it this way, can you please add a detailed comment
> here why this is safe? But I'd prefer just doing a strightforward ref
> inc/dec around it.

I see the the extraction under css_set_lock without inc/dec turns out
confusing. Let me expand the idea above and avoid css_set_lock
completely (another message).

Michal

2022-10-10 09:20:58

by Michal Koutný

[permalink] [raw]

Subject: [PATCH v3] cgroup: Reorganize css_set_lock and kernfs path processing

The commit 74e4b956eb1c incorrectly wrapped kernfs_walk_and_get
(might_sleep) under css_set_lock (spinlock). css_set_lock is needed by
__cset_cgroup_from_root to ensure stable cset->cgrp_links but not for
kernfs_walk_and_get.

We only need to make sure that the returned root_cgrp won't be freed
under us. This is given in the case of global root because it is static
(cgrp_dfl_root.cgrp). When the root_cgrp is lower in the hierarchy, it
is pinned by cgroup_ns->root_cset (and `current` task cannot switch
namespace asynchronously so ns_proxy pins cgroup_ns).

Note this reasoning won't hold for root cgroups in v1 hierarchies,
therefore create a special-cased helper function just for the default
hierarchy.

Fixes: 74e4b956eb1c ("cgroup: Honor caller's cgroup NS when resolving path")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/cgroup/cgroup.c | 41 ++++++++++++++++++++++++++++-------------
1 file changed, 28 insertions(+), 13 deletions(-)

Tested only with test_core selftests (i.e. the path/id resolution not
checed, only the migration code).

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c37b8265c0a3..a7ec96f26997 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1392,6 +1392,9 @@ static void cgroup_destroy_root(struct cgroup_root *root)
cgroup_free_root(root);
}

+/*
+ * Returned cgroup is without refcount but it's valid as long as cset pins it.
+ */
static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
struct cgroup_root *root)
{
@@ -1403,6 +1406,7 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
res_cgroup = cset->dfl_cgrp;
} else {
struct cgrp_cset_link *link;
+ lockdep_assert_held(&css_set_lock);

list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
struct cgroup *c = link->cgrp;
@@ -1414,6 +1418,7 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
}
}

+ BUG_ON(!res_cgroup);
return res_cgroup;
}

@@ -1436,23 +1441,37 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)

rcu_read_unlock();

- BUG_ON(!res);
return res;
}

+/*
+ * look up cgroup associated with current task's cgroup namespace on the
+ * default hierarchy
+ *
+ * Note this doesn't need locks unlike generic colleagues. Why?
+ * - Internal rcu_read_lock is unnecessary because we don't dereference any rcu
+ * pointers.
+ * - css_set_lock is not needed because we just read cset->dfl_cgrp.
+ * - As a bonus returned cgrp is pinned with the current because it cannot
+ * switch cgroup_ns asynchronously.
+ */
+static struct cgroup *
+current_cgns_cgroup_dfl(void)
+{
+ struct css_set *cset;
+
+ cset = current->nsproxy->cgroup_ns->root_cset;
+ return __cset_cgroup_from_root(cset, &cgrp_dfl_root);
+}
+
/* look up cgroup associated with given css_set on the specified hierarchy */
static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
struct cgroup_root *root)
{
- struct cgroup *res = NULL;
-
lockdep_assert_held(&cgroup_mutex);
lockdep_assert_held(&css_set_lock);

- res = __cset_cgroup_from_root(cset, root);
-
- BUG_ON(!res);
- return res;
+ return __cset_cgroup_from_root(cset, root);
}

/*
@@ -6085,9 +6104,7 @@ struct cgroup *cgroup_get_from_id(u64 id)
if (!cgrp)
return ERR_PTR(-ENOENT);

- spin_lock_irq(&css_set_lock);
- root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
- spin_unlock_irq(&css_set_lock);
+ root_cgrp = current_cgns_cgroup_dfl();
if (!cgroup_is_descendant(cgrp, root_cgrp)) {
cgroup_put(cgrp);
return ERR_PTR(-ENOENT);
@@ -6671,10 +6688,8 @@ struct cgroup *cgroup_get_from_path(const char *path)
struct cgroup *cgrp = ERR_PTR(-ENOENT);
struct cgroup *root_cgrp;

- spin_lock_irq(&css_set_lock);
- root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
+ root_cgrp = current_cgns_cgroup_dfl();
kn = kernfs_walk_and_get(root_cgrp->kn, path);
- spin_unlock_irq(&css_set_lock);
if (!kn)
goto out;

--
2.37.3

2022-10-10 20:56:10

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH v3] cgroup: Reorganize css_set_lock and kernfs path processing

On Mon, Oct 10, 2022 at 10:29:18AM +0200, Michal Koutn? wrote:
> The commit 74e4b956eb1c incorrectly wrapped kernfs_walk_and_get
> (might_sleep) under css_set_lock (spinlock). css_set_lock is needed by
> __cset_cgroup_from_root to ensure stable cset->cgrp_links but not for
> kernfs_walk_and_get.
>
> We only need to make sure that the returned root_cgrp won't be freed
> under us. This is given in the case of global root because it is static
> (cgrp_dfl_root.cgrp). When the root_cgrp is lower in the hierarchy, it
> is pinned by cgroup_ns->root_cset (and `current` task cannot switch
> namespace asynchronously so ns_proxy pins cgroup_ns).
>
> Note this reasoning won't hold for root cgroups in v1 hierarchies,
> therefore create a special-cased helper function just for the default
> hierarchy.
>
> Fixes: 74e4b956eb1c ("cgroup: Honor caller's cgroup NS when resolving path")
> Reported-by: Dan Carpenter <[email protected]>
> Signed-off-by: Michal Koutn? <[email protected]>

Applied to cgroup/for-6.1-fixes w/ trivial comment / line break adjustments.

Thanks.

--
tejun