LinuxLists.cc - [PATCHSET v2 cgroup/for-5.16-fixes] cgroup: Use open-time creds and namespace for migration perm checks

2021-12-13 19:18:44

Subject: [PATCHSET v2 cgroup/for-5.16-fixes] cgroup: Use open-time creds and namespace for migration perm checks

Hello,

v2: 0002-cgroup-Allocate-cgroup_file_ctx-for-kernfs_open_file updated to
drop the union and embed procs.iter as suggested by Linus.

cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials and cgroup
namespace which is a potential security weakness as it may allow scenarios
where a less privileged process tricks a more privileged one into writing
into a fd that it created.

This patchset make the perm checks use credentials and cgroup namespace
stored at the time of open and contains the following patches.

0001-cgroup-Use-open-time-credentials-for-process-migrato.patch
0002-cgroup-Allocate-cgroup_file_ctx-for-kernfs_open_file.patch
0003-cgroup-Use-open-time-cgroup-namespace-for-process-mi.patch
0004-selftests-cgroup-Make-cg_create-use-0755-for-permiss.patch
0005-selftests-cgroup-Test-open-time-credential-usage-for.patch
0006-selftests-cgroup-Test-open-time-cgroup-namespace-usa.patch

The patchset is also available in the following git branch. If there's no
objetion, I will apply to cgroup/for-5.16-fixes in a few days.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-migration-perms-1

diffstat follows. Thank you.

kernel/cgroup/cgroup-internal.h | 13 ++
kernel/cgroup/cgroup-v1.c | 7 -
kernel/cgroup/cgroup.c | 88 +++++++++-----
tools/testing/selftests/cgroup/cgroup_util.c | 2
tools/testing/selftests/cgroup/test_core.c | 165 +++++++++++++++++++++++++++
5 files changed, 243 insertions(+), 32 deletions(-)

--
tejun

2021-12-13 19:18:46

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 1/6] cgroup: Use open-time credentials for process migraton perm checks

2021-12-13 19:18:51

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 3/6] cgroup: Use open-time cgroup namespace for process migration perm checks

cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.

This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.

This also fixes a use-after-free bug on cgroupns reported in

https://lore.kernel.org/r/[email protected]

Note that backporting this fix also requires the preceding patch.

Reported-by: "Eric W. Biederman" <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Reported-by: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup/cgroup-internal.h | 2 ++
kernel/cgroup/cgroup.c | 28 +++++++++++++++++++---------
2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 0ced1e04177f..9ead575f9678 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -66,6 +66,8 @@ static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
}

struct cgroup_file_ctx {
+ struct cgroup_namespace *ns;
+
struct {
bool started;
struct css_task_iter iter;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a84631d08d98..cafb8c114a21 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3822,14 +3822,19 @@ static int cgroup_file_open(struct kernfs_open_file *of)
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
+
+ ctx->ns = current->nsproxy->cgroup_ns;
+ get_cgroup_ns(ctx->ns);
of->priv = ctx;

if (!cft->open)
return 0;

ret = cft->open(of);
- if (ret)
+ if (ret) {
+ put_cgroup_ns(ctx->ns);
kfree(ctx);
+ }
return ret;
}

@@ -3840,13 +3845,14 @@ static void cgroup_file_release(struct kernfs_open_file *of)

if (cft->release)
cft->release(of);
+ put_cgroup_ns(ctx->ns);
kfree(ctx);
}

static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+ struct cgroup_file_ctx *ctx = of->priv;
struct cgroup *cgrp = of->kn->parent->priv;
struct cftype *cft = of_cft(of);
struct cgroup_subsys_state *css;
@@ -3863,7 +3869,7 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
- ns != &init_cgroup_ns && ns->root_cset->dfl_cgrp == cgrp)
+ ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
return -EPERM;

if (cft->write)
@@ -4853,9 +4859,9 @@ static int cgroup_may_write(const struct cgroup *cgrp, struct super_block *sb)

static int cgroup_procs_write_permission(struct cgroup *src_cgrp,
struct cgroup *dst_cgrp,
- struct super_block *sb)
+ struct super_block *sb,
+ struct cgroup_namespace *ns)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup *com_cgrp = src_cgrp;
int ret;

@@ -4884,11 +4890,12 @@ static int cgroup_procs_write_permission(struct cgroup *src_cgrp,

static int cgroup_attach_permissions(struct cgroup *src_cgrp,
struct cgroup *dst_cgrp,
- struct super_block *sb, bool threadgroup)
+ struct super_block *sb, bool threadgroup,
+ struct cgroup_namespace *ns)
{
int ret = 0;

- ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb);
+ ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb, ns);
if (ret)
return ret;

@@ -4905,6 +4912,7 @@ static int cgroup_attach_permissions(struct cgroup *src_cgrp,
static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
bool threadgroup)
{
+ struct cgroup_file_ctx *ctx = of->priv;
struct cgroup *src_cgrp, *dst_cgrp;
struct task_struct *task;
const struct cred *saved_cred;
@@ -4932,7 +4940,8 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
*/
saved_cred = override_creds(of->file->f_cred);
ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
- of->file->f_path.dentry->d_sb, threadgroup);
+ of->file->f_path.dentry->d_sb,
+ threadgroup, ctx->ns);
revert_creds(saved_cred);
if (ret)
goto out_finish;
@@ -6152,7 +6161,8 @@ static int cgroup_css_set_fork(struct kernel_clone_args *kargs)
goto err;

ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
- !(kargs->flags & CLONE_THREAD));
+ !(kargs->flags & CLONE_THREAD),
+ current->nsproxy->cgroup_ns);
if (ret)
goto err;

--
2.34.1

2021-12-13 19:18:55

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 4/6] selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644

0644 is an odd perm to create a cgroup which is a directory. Use the regular
0755 instead. This is necessary for euid switching test case.

Signed-off-by: Tejun Heo <[email protected]>
---
tools/testing/selftests/cgroup/cgroup_util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 623cec04ad42..0cf7e90c0052 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -221,7 +221,7 @@ int cg_find_unified_root(char *root, size_t len)

int cg_create(const char *cgroup)
{
- return mkdir(cgroup, 0644);
+ return mkdir(cgroup, 0755);
}

int cg_wait_for_proc_count(const char *cgroup, int count)
--
2.34.1

2021-12-13 19:18:59

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 6/6] selftests: cgroup: Test open-time cgroup namespace usage for migration checks

When a task is writing to an fd opened by a different task, the perm check
should use the cgroup namespace of the latter task. Add a test for it.

Signed-off-by: Tejun Heo <[email protected]>
---
tools/testing/selftests/cgroup/test_core.c | 97 ++++++++++++++++++++++
1 file changed, 97 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index 01b766506973..600123503063 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -1,11 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0 */

+#define _GNU_SOURCE
#include <linux/limits.h>
+#include <linux/sched.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>
+#include <sched.h>
#include <stdio.h>
#include <errno.h>
#include <signal.h>
@@ -741,6 +744,99 @@ static int test_cgcore_lesser_euid_open(const char *root)
return ret;
}

+struct lesser_ns_open_thread_arg {
+ const char *path;
+ int fd;
+ int err;
+};
+
+static int lesser_ns_open_thread_fn(void *arg)
+{
+ struct lesser_ns_open_thread_arg *targ = arg;
+
+ targ->fd = open(targ->path, O_RDWR);
+ targ->err = errno;
+ return 0;
+}
+
+/*
+ * cgroup migration permission check should be performed based on the cgroup
+ * namespace at the time of open instead of write.
+ */
+static int test_cgcore_lesser_ns_open(const char *root)
+{
+ static char stack[65536];
+ const uid_t test_euid = 65534; /* usually nobody, any !root is fine */
+ int ret = KSFT_FAIL;
+ char *cg_test_a = NULL, *cg_test_b = NULL;
+ char *cg_test_a_procs = NULL, *cg_test_b_procs = NULL;
+ int cg_test_b_procs_fd = -1;
+ struct lesser_ns_open_thread_arg targ = { .fd = -1 };
+ pid_t pid;
+ int status;
+
+ cg_test_a = cg_name(root, "cg_test_a");
+ cg_test_b = cg_name(root, "cg_test_b");
+
+ if (!cg_test_a || !cg_test_b)
+ goto cleanup;
+
+ cg_test_a_procs = cg_name(cg_test_a, "cgroup.procs");
+ cg_test_b_procs = cg_name(cg_test_b, "cgroup.procs");
+
+ if (!cg_test_a_procs || !cg_test_b_procs)
+ goto cleanup;
+
+ if (cg_create(cg_test_a) || cg_create(cg_test_b))
+ goto cleanup;
+
+ if (cg_enter_current(cg_test_b))
+ goto cleanup;
+
+ if (chown(cg_test_a_procs, test_euid, -1) ||
+ chown(cg_test_b_procs, test_euid, -1))
+ goto cleanup;
+
+ targ.path = cg_test_b_procs;
+ pid = clone(lesser_ns_open_thread_fn, stack + sizeof(stack),
+ CLONE_NEWCGROUP | CLONE_FILES | CLONE_VM | SIGCHLD,
+ &targ);
+ if (pid < 0)
+ goto cleanup;
+
+ if (waitpid(pid, &status, 0) < 0)
+ goto cleanup;
+
+ if (!WIFEXITED(status))
+ goto cleanup;
+
+ cg_test_b_procs_fd = targ.fd;
+ if (cg_test_b_procs_fd < 0)
+ goto cleanup;
+
+ if (cg_enter_current(cg_test_a))
+ goto cleanup;
+
+ if ((status = write(cg_test_b_procs_fd, "0", 1)) >= 0 || errno != ENOENT)
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ cg_enter_current(root);
+ if (cg_test_b_procs_fd >= 0)
+ close(cg_test_b_procs_fd);
+ if (cg_test_b)
+ cg_destroy(cg_test_b);
+ if (cg_test_a)
+ cg_destroy(cg_test_a);
+ free(cg_test_b_procs);
+ free(cg_test_a_procs);
+ free(cg_test_b);
+ free(cg_test_a);
+ return ret;
+}
+
#define T(x) { x, #x }
struct corecg_test {
int (*fn)(const char *root);
@@ -757,6 +853,7 @@ struct corecg_test {
T(test_cgcore_thread_migration),
T(test_cgcore_destroy),
T(test_cgcore_lesser_euid_open),
+ T(test_cgcore_lesser_ns_open),
};
#undef T

--
2.34.1

2021-12-13 19:19:02

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 5/6] selftests: cgroup: Test open-time credential usage for migration checks

When a task is writing to an fd opened by a different task, the perm check
should use the credentials of the latter task. Add a test for it.

Signed-off-by: Tejun Heo <[email protected]>
---
tools/testing/selftests/cgroup/test_core.c | 68 ++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_core.c b/tools/testing/selftests/cgroup/test_core.c
index 3df648c37876..01b766506973 100644
--- a/tools/testing/selftests/cgroup/test_core.c
+++ b/tools/testing/selftests/cgroup/test_core.c
@@ -674,6 +674,73 @@ static int test_cgcore_thread_migration(const char *root)
return ret;
}

+/*
+ * cgroup migration permission check should be performed based on the
+ * credentials at the time of open instead of write.
+ */
+static int test_cgcore_lesser_euid_open(const char *root)
+{
+ const uid_t test_euid = 65534; /* usually nobody, any !root is fine */
+ int ret = KSFT_FAIL;
+ char *cg_test_a = NULL, *cg_test_b = NULL;
+ char *cg_test_a_procs = NULL, *cg_test_b_procs = NULL;
+ int cg_test_b_procs_fd = -1;
+ uid_t saved_uid;
+
+ cg_test_a = cg_name(root, "cg_test_a");
+ cg_test_b = cg_name(root, "cg_test_b");
+
+ if (!cg_test_a || !cg_test_b)
+ goto cleanup;
+
+ cg_test_a_procs = cg_name(cg_test_a, "cgroup.procs");
+ cg_test_b_procs = cg_name(cg_test_b, "cgroup.procs");
+
+ if (!cg_test_a_procs || !cg_test_b_procs)
+ goto cleanup;
+
+ if (cg_create(cg_test_a) || cg_create(cg_test_b))
+ goto cleanup;
+
+ if (cg_enter_current(cg_test_a))
+ goto cleanup;
+
+ if (chown(cg_test_a_procs, test_euid, -1) ||
+ chown(cg_test_b_procs, test_euid, -1))
+ goto cleanup;
+
+ saved_uid = geteuid();
+ if (seteuid(test_euid))
+ goto cleanup;
+
+ cg_test_b_procs_fd = open(cg_test_b_procs, O_RDWR);
+
+ if (seteuid(saved_uid))
+ goto cleanup;
+
+ if (cg_test_b_procs_fd < 0)
+ goto cleanup;
+
+ if (write(cg_test_b_procs_fd, "0", 1) >= 0 || errno != EACCES)
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ cg_enter_current(root);
+ if (cg_test_b_procs_fd >= 0)
+ close(cg_test_b_procs_fd);
+ if (cg_test_b)
+ cg_destroy(cg_test_b);
+ if (cg_test_a)
+ cg_destroy(cg_test_a);
+ free(cg_test_b_procs);
+ free(cg_test_a_procs);
+ free(cg_test_b);
+ free(cg_test_a);
+ return ret;
+}
+
#define T(x) { x, #x }
struct corecg_test {
int (*fn)(const char *root);
@@ -689,6 +756,7 @@ struct corecg_test {
T(test_cgcore_proc_migration),
T(test_cgcore_thread_migration),
T(test_cgcore_destroy),
+ T(test_cgcore_lesser_euid_open),
};
#undef T

--
2.34.1

2021-12-13 19:19:03

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.

Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.

v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
cgroup_file_ctx as suggested by Linus.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
kernel/cgroup/cgroup-internal.h | 11 +++++++
kernel/cgroup/cgroup.c | 53 +++++++++++++++++++++------------
2 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc17a9d..0ced1e04177f 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -65,6 +65,17 @@ static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
return container_of(kfc, struct cgroup_fs_context, kfc);
}

+struct cgroup_file_ctx {
+ struct {
+ bool started;
+ struct css_task_iter iter;
+ } procs;
+
+ struct {
+ void *trigger;
+ } psi;
+};
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2632e46da1d4..a84631d08d98 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3630,6 +3630,7 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, enum psi_res res)
{
+ struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
struct cgroup *cgrp;
struct psi_group *psi;
@@ -3648,7 +3649,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return PTR_ERR(new);
}

- psi_trigger_replace(&of->priv, new);
+ psi_trigger_replace(&ctx->psi.trigger, new);

cgroup_put(cgrp);

@@ -3679,12 +3680,16 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
- return psi_trigger_poll(&of->priv, of->file, pt);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
}

static void cgroup_pressure_release(struct kernfs_open_file *of)
{
- psi_trigger_replace(&of->priv, NULL);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ psi_trigger_replace(&ctx->psi.trigger, NULL);
}

bool cgroup_psi_enabled(void)
@@ -3811,18 +3816,31 @@ static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
static int cgroup_file_open(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx;
+ int ret;

- if (cft->open)
- return cft->open(of);
- return 0;
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ of->priv = ctx;
+
+ if (!cft->open)
+ return 0;
+
+ ret = cft->open(of);
+ if (ret)
+ kfree(ctx);
+ return ret;
}

static void cgroup_file_release(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx = of->priv;

if (cft->release)
cft->release(of);
+ kfree(ctx);
}

static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
@@ -4751,21 +4769,21 @@ void css_task_iter_end(struct css_task_iter *it)

static void cgroup_procs_release(struct kernfs_open_file *of)
{
- if (of->priv) {
- css_task_iter_end(of->priv);
- kfree(of->priv);
- }
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ if (ctx->procs.started)
+ css_task_iter_end(&ctx->procs.iter);
}

static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
{
struct kernfs_open_file *of = s->private;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;

if (pos)
(*pos)++;

- return css_task_iter_next(it);
+ return css_task_iter_next(&ctx->procs.iter);
}

static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
@@ -4773,21 +4791,18 @@ static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
{
struct kernfs_open_file *of = s->private;
struct cgroup *cgrp = seq_css(s)->cgroup;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
+ struct css_task_iter *it = &ctx->procs.iter;

/*
* When a seq_file is seeked, it's always traversed sequentially
* from position 0, so we can simply keep iterating on !0 *pos.
*/
- if (!it) {
+ if (!ctx->procs.started) {
if (WARN_ON_ONCE((*pos)))
return ERR_PTR(-EINVAL);
-
- it = kzalloc(sizeof(*it), GFP_KERNEL);
- if (!it)
- return ERR_PTR(-ENOMEM);
- of->priv = it;
css_task_iter_start(&cgrp->self, iter_flags, it);
+ ctx->procs.started = true;
} else if (!(*pos)) {
css_task_iter_end(it);
css_task_iter_start(&cgrp->self, iter_flags, it);
--
2.34.1

2021-12-13 19:30:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

On Mon, Dec 13, 2021 at 11:18 AM Tejun Heo <[email protected]> wrote:
>
> v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
> cgroup_file_ctx as suggested by Linus.

Thanks.

Looking at the patch, I mostly like the end result, although it
highlights how odd that __cgroup_procs_start() thing is.

I wonder if that 'started' flag could be removed, and the
css_task_iter_start() thing could be done at allocation time?

But I may just be confused. This is not a NAK, just a "that pattern
looks slightly odd to me".

Linus

2021-12-13 19:56:31

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

Hello,

On Mon, Dec 13, 2021 at 11:29:36AM -0800, Linus Torvalds wrote:
> Looking at the patch, I mostly like the end result, although it
> highlights how odd that __cgroup_procs_start() thing is.
>
> I wonder if that 'started' flag could be removed, and the
> css_task_iter_start() thing could be done at allocation time?
>
> But I may just be confused. This is not a NAK, just a "that pattern
> looks slightly odd to me".

Yeah, it's a bit odd. So...

css_task_iters are stateful and a bit expensive to initialize, so we don't
want to do it from generic open path. However, we can add an open method for
cgroup_files and then do it from there.

However, while these files don't allow random seeking, they do allow seeking
to 0 to re-read the whole content, so the iteration start path needs to
handle both the initial read and subsequent repeating reads. While this can
be handled by starting iter on open and then always restarting it on
seq_start, that isn't ideal given the prior point. So, if we want to handle
it in the seq_start path, we need to distinguish the initial read and
re-reads anyway.

This may be neater if we start the iter from open and restart from seek so
that the seq_start method doesn't have to distinguish the two. However,
kernfs doesn't forward the seek event downwards (yet anyway).

So, there's a bit of awkwardness but it at least isn't gratuitous given the
surroundings.

Thanks.

--
tejun

2021-12-14 17:03:46

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 1/6] cgroup: Use open-time credentials for process migraton perm checks

On Mon, Dec 13, 2021 at 09:18:28AM -1000, Tejun Heo <[email protected]> wrote:
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: "Eric W. Biederman" <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Cc: Michal Koutn? <[email protected]>

Just metadata suggestion:
Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy")

Besides that
Reviewed-by: Michal Koutn? <[email protected]>

2021-12-14 17:03:58

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

On Mon, Dec 13, 2021 at 09:18:29AM -1000, Tejun Heo <[email protected]> wrote:
> static int cgroup_file_open(struct kernfs_open_file *of)

IIUC, this is common to v1 files too, i.e. cgroup_pidlist_start too.

> static void cgroup_file_release(struct kernfs_open_file *of)
> {
> struct cftype *cft = of_cft(of);
> + struct cgroup_file_ctx *ctx = of->priv;
>
> if (cft->release)
> cft->release(of);
> + kfree(ctx);

Here it could free a pointer to pidlist that has different lifecycle.

Perhaps add one more slot into cgroup_file_ctx for the pidlist pointer?

2021-12-14 17:04:07

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 3/6] cgroup: Use open-time cgroup namespace for process migration perm checks

On Mon, Dec 13, 2021 at 09:18:30AM -1000, Tejun Heo <[email protected]> wrote:
> ---
> kernel/cgroup/cgroup-internal.h | 2 ++
> kernel/cgroup/cgroup.c | 28 +++++++++++++++++++---------
> 2 files changed, 21 insertions(+), 9 deletions(-)

Reviewed-by: Michal Koutn? <[email protected]>

2021-12-14 17:04:15

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 4/6] selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644

On Mon, Dec 13, 2021 at 09:18:31AM -1000, Tejun Heo <[email protected]> wrote:
> ---
> tools/testing/selftests/cgroup/cgroup_util.c | 2 +-

Reviewed-by: Michal Koutn? <[email protected]>

2021-12-14 17:04:31

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 5/6] selftests: cgroup: Test open-time credential usage for migration checks

On Mon, Dec 13, 2021 at 09:18:32AM -1000, Tejun Heo <[email protected]> wrote:
> ---
> tools/testing/selftests/cgroup/test_core.c | 68 ++++++++++++++++++++++
> 1 file changed, 68 insertions(+)

Tested-by: Michal Koutn? <[email protected]>

(in the sense it fails on unpatched kernel :-)

2021-12-14 17:04:35

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 6/6] selftests: cgroup: Test open-time cgroup namespace usage for migration checks

On Mon, Dec 13, 2021 at 09:18:33AM -1000, Tejun Heo <[email protected]> wrote:
> ---
> tools/testing/selftests/cgroup/test_core.c | 97 ++++++++++++++++++++++
> 1 file changed, 97 insertions(+)

Tested-by: Michal Koutn? <[email protected]>

(in the sense it fails on unpatched kernel)

2021-12-14 19:44:33

by Tejun Heo

[permalink] [raw]

Subject: [PATCH v3 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

From 9a572a58dc5b8046e22550301581d95e2b62fbcd Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Tue, 14 Dec 2021 09:42:03 -1000
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

of->priv is currently used by each interface file implementation to store
private information. This patch collects the current two private data usages
into struct cgroup_file_ctx which is allocated and freed by the common path.
This allows generic private data which applies to multiple files, which will
be used to in the following patch.

Note that cgroup_procs iterator is now embedded as procs.iter in the new
cgroup_file_ctx so that it doesn't need to be allocated and freed
separately.

v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in
cgroup_file_ctx as suggested by Linus.

v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too.
Converted. Didn't change to embedded allocation as cgroup1 pidlists get
stored for caching.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Koutn? <[email protected]>
---
The rest of the patchset applies as-is. The following git branch contains
the latest.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-migration-perms-2

Thanks.

kernel/cgroup/cgroup-internal.h | 17 +++++++++++
kernel/cgroup/cgroup-v1.c | 26 ++++++++--------
kernel/cgroup/cgroup.c | 53 +++++++++++++++++++++------------
3 files changed, 65 insertions(+), 31 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc17a9d..cf637bc4ab45 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -65,6 +65,23 @@ static inline struct cgroup_fs_context *cgroup_fc2context(struct fs_context *fc)
return container_of(kfc, struct cgroup_fs_context, kfc);
}

+struct cgroup_pidlist;
+
+struct cgroup_file_ctx {
+ struct {
+ void *trigger;
+ } psi;
+
+ struct {
+ bool started;
+ struct css_task_iter iter;
+ } procs;
+
+ struct {
+ struct cgroup_pidlist *pidlist;
+ } procs1;
+};
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 0e7369103ba6..41e0837a5a0b 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -394,6 +394,7 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
* next pid to display, if any
*/
struct kernfs_open_file *of = s->private;
+ struct cgroup_file_ctx *ctx = of->priv;
struct cgroup *cgrp = seq_css(s)->cgroup;
struct cgroup_pidlist *l;
enum cgroup_filetype type = seq_cft(s)->private;
@@ -403,25 +404,24 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
mutex_lock(&cgrp->pidlist_mutex);

/*
- * !NULL @of->priv indicates that this isn't the first start()
- * after open. If the matching pidlist is around, we can use that.
- * Look for it. Note that @of->priv can't be used directly. It
- * could already have been destroyed.
+ * !NULL @ctx->procs1.pidlist indicates that this isn't the first
+ * start() after open. If the matching pidlist is around, we can use
+ * that. Look for it. Note that @ctx->procs1.pidlist can't be used
+ * directly. It could already have been destroyed.
*/
- if (of->priv)
- of->priv = cgroup_pidlist_find(cgrp, type);
+ if (ctx->procs1.pidlist)
+ ctx->procs1.pidlist = cgroup_pidlist_find(cgrp, type);

/*
* Either this is the first start() after open or the matching
* pidlist has been destroyed inbetween. Create a new one.
*/
- if (!of->priv) {
- ret = pidlist_array_load(cgrp, type,
- (struct cgroup_pidlist **)&of->priv);
+ if (!ctx->procs1.pidlist) {
+ ret = pidlist_array_load(cgrp, type, &ctx->procs1.pidlist);
if (ret)
return ERR_PTR(ret);
}
- l = of->priv;
+ l = ctx->procs1.pidlist;

if (pid) {
int end = l->length;
@@ -449,7 +449,8 @@ static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
static void cgroup_pidlist_stop(struct seq_file *s, void *v)
{
struct kernfs_open_file *of = s->private;
- struct cgroup_pidlist *l = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
+ struct cgroup_pidlist *l = ctx->procs1.pidlist;

if (l)
mod_delayed_work(cgroup_pidlist_destroy_wq, &l->destroy_dwork,
@@ -460,7 +461,8 @@ static void cgroup_pidlist_stop(struct seq_file *s, void *v)
static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
{
struct kernfs_open_file *of = s->private;
- struct cgroup_pidlist *l = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
+ struct cgroup_pidlist *l = ctx->procs1.pidlist;
pid_t *p = v;
pid_t *end = l->list + l->length;
/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2632e46da1d4..a84631d08d98 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3630,6 +3630,7 @@ static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, enum psi_res res)
{
+ struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
struct cgroup *cgrp;
struct psi_group *psi;
@@ -3648,7 +3649,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
return PTR_ERR(new);
}

- psi_trigger_replace(&of->priv, new);
+ psi_trigger_replace(&ctx->psi.trigger, new);

cgroup_put(cgrp);

@@ -3679,12 +3680,16 @@ static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
- return psi_trigger_poll(&of->priv, of->file, pt);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
}

static void cgroup_pressure_release(struct kernfs_open_file *of)
{
- psi_trigger_replace(&of->priv, NULL);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ psi_trigger_replace(&ctx->psi.trigger, NULL);
}

bool cgroup_psi_enabled(void)
@@ -3811,18 +3816,31 @@ static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
static int cgroup_file_open(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx;
+ int ret;

- if (cft->open)
- return cft->open(of);
- return 0;
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ of->priv = ctx;
+
+ if (!cft->open)
+ return 0;
+
+ ret = cft->open(of);
+ if (ret)
+ kfree(ctx);
+ return ret;
}

static void cgroup_file_release(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx = of->priv;

if (cft->release)
cft->release(of);
+ kfree(ctx);
}

static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
@@ -4751,21 +4769,21 @@ void css_task_iter_end(struct css_task_iter *it)

static void cgroup_procs_release(struct kernfs_open_file *of)
{
- if (of->priv) {
- css_task_iter_end(of->priv);
- kfree(of->priv);
- }
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ if (ctx->procs.started)
+ css_task_iter_end(&ctx->procs.iter);
}

static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
{
struct kernfs_open_file *of = s->private;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;

if (pos)
(*pos)++;

- return css_task_iter_next(it);
+ return css_task_iter_next(&ctx->procs.iter);
}

static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
@@ -4773,21 +4791,18 @@ static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
{
struct kernfs_open_file *of = s->private;
struct cgroup *cgrp = seq_css(s)->cgroup;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
+ struct css_task_iter *it = &ctx->procs.iter;

/*
* When a seq_file is seeked, it's always traversed sequentially
* from position 0, so we can simply keep iterating on !0 *pos.
*/
- if (!it) {
+ if (!ctx->procs.started) {
if (WARN_ON_ONCE((*pos)))
return ERR_PTR(-EINVAL);
-
- it = kzalloc(sizeof(*it), GFP_KERNEL);
- if (!it)
- return ERR_PTR(-ENOMEM);
- of->priv = it;
css_task_iter_start(&cgrp->self, iter_flags, it);
+ ctx->procs.started = true;
} else if (!(*pos)) {
css_task_iter_end(it);
css_task_iter_start(&cgrp->self, iter_flags, it);
--
2.34.1

2021-12-15 07:37:16

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH v3 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv

On Tue, Dec 14, 2021 at 09:44:26AM -1000, Tejun Heo <[email protected]> wrote:
> kernel/cgroup/cgroup-internal.h | 17 +++++++++++
> kernel/cgroup/cgroup-v1.c | 26 ++++++++--------
> kernel/cgroup/cgroup.c | 53 +++++++++++++++++++++------------
> 3 files changed, 65 insertions(+), 31 deletions(-)

Thanks. This last one member of the series can have therefore

Reviewed-by: Michal Koutn? <[email protected]>

2021-12-16 09:23:09

by kernel test robot

[permalink] [raw]

Subject: [cgroup] 27183b4e07: WARNING:at_mm/slab.c:#___cache_free

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 27183b4e0735229f7ab300f000f78c9badf2a110 ("[PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv")
url: https://github.com/0day-ci/linux/commits/Tejun-Heo/cgroup-Use-open-time-credentials-for-process-migraton-perm-checks/20211214-041859
base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 52.345653][ T1] WARNING: CPU: 0 PID: 1 at mm/slab.c:597 ___cache_free (mm/slab.c:597 mm/slab.c:3492)
[ 52.346695][ T1] Modules linked in:
[ 52.347196][ T1] CPU: 0 PID: 1 Comm: systemd Not tainted 5.16.0-rc1-00009-g27183b4e0735 #1
[ 52.348386][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 52.349576][ T1] RIP: 0010:___cache_free (mm/slab.c:597 mm/slab.c:3492)
[ 52.350274][ T1] Code: 50 ff 48 83 05 8b c8 4b 06 01 4c 39 7c d3 10 75 cd 48 83 05 84 c8 4b 06 01 48 83 05 8c c8 4b 06 01 90 48 83 05 8b c8 4b 06 01 <0f> 0b 48 83 05 89 c8 4b 06 01 48 83 05 89 c8 4b 06 01 90 48 83 05
All code
========
0: 50 push %rax
1: ff 48 83 decl -0x7d(%rax)
4: 05 8b c8 4b 06 add $0x64bc88b,%eax
9: 01 4c 39 7c add %ecx,0x7c(%rcx,%rdi,1)
d: d3 10 rcll %cl,(%rax)
f: 75 cd jne 0xffffffffffffffde
11: 48 83 05 84 c8 4b 06 addq $0x1,0x64bc884(%rip) # 0x64bc89d
18: 01
19: 48 83 05 8c c8 4b 06 addq $0x1,0x64bc88c(%rip) # 0x64bc8ad
20: 01
21: 90 nop
22: 48 83 05 8b c8 4b 06 addq $0x1,0x64bc88b(%rip) # 0x64bc8b5
29: 01
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 83 05 89 c8 4b 06 addq $0x1,0x64bc889(%rip) # 0x64bc8bd
33: 01
34: 48 83 05 89 c8 4b 06 addq $0x1,0x64bc889(%rip) # 0x64bc8c5
3b: 01
3c: 90 nop
3d: 48 rex.W
3e: 83 .byte 0x83
3f: 05 .byte 0x5

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 83 05 89 c8 4b 06 addq $0x1,0x64bc889(%rip) # 0x64bc893
9: 01
a: 48 83 05 89 c8 4b 06 addq $0x1,0x64bc889(%rip) # 0x64bc89b
11: 01
12: 90 nop
13: 48 rex.W
14: 83 .byte 0x83
15: 05 .byte 0x5
[ 52.352800][ T1] RSP: 0018:ffff888100363d40 EFLAGS: 00010002
[ 52.353567][ T1] RAX: 0000000000000004 RBX: ffff888100258000 RCX: 0000000000000000
[ 52.354553][ T1] RDX: 0000000000000003 RSI: ffff8881298a9f00 RDI: ffff8881000403c0
[ 52.355577][ T1] RBP: ffff888100363d98 R08: ffff88810004f3e8 R09: ffff888100958d20
[ 52.356627][ T1] R10: 0000000000000000 R11: ffffffff87bab640 R12: ffff8881000403c0
[ 52.357643][ T1] R13: ffff8881000403c0 R14: ffffffff812c055d R15: ffff8881298a9f00
[ 52.358670][ T1] FS: 0000000000000000(0000) GS:ffffffff854fa000(0063) knlGS:00000000f784b6c0
[ 52.359828][ T1] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 52.360680][ T1] CR2: 00000000f7ddce32 CR3: 0000000129f42000 CR4: 00000000000406b0
[ 52.361731][ T1] Call Trace:
[ 52.362185][ T1] <TASK>
[ 52.362580][ T1] ? debug_check_no_obj_freed (lib/debugobjects.c:1002 lib/debugobjects.c:1023)
[ 52.363317][ T1] ? cgroup_file_release (kernel/cgroup/cgroup.c:3843)
[ 52.363974][ T1] kfree (mm/slab.c:3453 mm/slab.c:3803)
[ 52.364459][ T1] cgroup_file_release (kernel/cgroup/cgroup.c:3843)
[ 52.365119][ T1] kernfs_release_file+0x40/0xc0
[ 52.365866][ T1] kernfs_fop_release (fs/kernfs/file.c:757)
[ 52.366506][ T1] __fput (fs/file_table.c:281)
[ 52.367019][ T1] ____fput (fs/file_table.c:313)
[ 52.367541][ T1] task_work_run (kernel/task_work.c:166 (discriminator 1))
[ 52.368166][ T1] exit_to_user_mode_prepare (include/linux/tracehook.h:189 kernel/entry/common.c:175 kernel/entry/common.c:207)
[ 52.368910][ T1] syscall_exit_to_user_mode (kernel/entry/common.c:301)
[ 52.369622][ T1] __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 52.370272][ T1] do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 52.370886][ T1] do_SYSENTER_32 (arch/x86/entry/common.c:247)
[ 52.371481][ T1] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:141)
[ 52.372337][ T1] RIP: 0023:0xf7fb7549
[ 52.372878][ T1] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi
4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d
a: 10 06 adc %al,(%rsi)
c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi
10: 10 07 adc %al,(%rdi)
12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
16: 10 08 adc %cl,(%rax)
18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
1c: 00 00 add %al,(%rax)
1e: 00 00 add %al,(%rax)
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24: 89 e5 mov %esp,%ebp
26: 0f 34 sysenter
28: cd 80 int $0x80
2a:* 5d pop %rbp <-- trapping instruction
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
[ 52.375395][ T1] RSP: 002b:00000000fff9a478 EFLAGS: 00000206 ORIG_RAX: 0000000000000006
[ 52.376465][ T1] RAX: 0000000000000000 RBX: 000000000000001c RCX: 0000000000000660
[ 52.377503][ T1] RDX: 00000000f7c48300 RSI: 00000000f7c48960 RDI: 0000000000000000
[ 52.378542][ T1] RBP: 00000000f7c4a000 R08: 0000000000000000 R09: 0000000000000000
[ 52.379652][ T1] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 52.380688][ T1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 52.381710][ T1] </TASK>
[ 52.382097][ T1] irq event stamp: 47735052
[ 52.382649][ T1] hardirqs last enabled at (47735051): _raw_spin_unlock_irq (include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202)
[ 52.383953][ T1] hardirqs last disabled at (47735052): kfree (mm/slab.c:3793 (discriminator 1))
[ 52.385104][ T1] softirqs last enabled at (47734810): cgroup_idr_replace (kernel/cgroup/cgroup.c:339)
[ 52.386365][ T1] softirqs last disabled at (47734808): cgroup_idr_replace (kernel/cgroup/cgroup.c:336)
[ 52.387667][ T1] ---[ end trace 08fad742e8d71fba ]---
Mounting RPC Pipe File System...
[ OK ] Reached target Swap.
[ OK ] Listening on udev Kernel Socket.
[ OK ] Listening on Journal Socket.
Starting Remount Root and Kernel File Systems...
Starting Journal Service...
Starting Load Kernel Modules...
Starting Create Static Device Nodes in /dev...
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Listening on RPCbind Server Activation Socket.
[ OK ] Mounted RPC Pipe File System.
[ OK ] Started Remount Root and Kernel File Systems.
[ OK ] Started Load Kernel Modules.
[ OK ] Started Create Static Device Nodes in /dev.

To reproduce:

# build kernel
cd linux
cp config-5.16.0-rc1-00009-g27183b4e0735 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang

Attachments:

(No filename) (9.14 kB)
config-5.16.0-rc1-00009-g27183b4e0735 (146.93 kB)
job-script (4.78 kB)
dmesg.xz (20.46 kB)
Download all attachments

2021-12-16 09:43:49

by Michal Koutný

[permalink] [raw]

Subject: Re: [cgroup] 27183b4e07: WARNING:at_mm/slab.c:#___cache_free

On Thu, Dec 16, 2021 at 05:22:55PM +0800, kernel test robot <[email protected]> wrote:
> commit: 27183b4e0735229f7ab300f000f78c9badf2a110 ("[PATCH 2/6] cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv")
> url: https://github.com/0day-ci/linux/commits/Tejun-Heo/cgroup-Use-open-time-credentials-for-process-migraton-perm-checks/20211214-041859
> base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next
> patch link: https://lore.kernel.org/lkml/[email protected]

TL;DR This is the v2 patch and this situation is fixed in v3 [1].

FWIW, the full log reports a BUG later:

> [ 52.570729][ T1] BUG: unable to handle page fault for address: ffffffffffffffe0
> [ 52.571736][ T1] #PF: supervisor read access in kernel mode
> [ 52.572490][ T1] #PF: error_code(0x0000) - not-present page
> [ 52.573271][ T1] PGD 542b067 P4D 542b067 PUD 542d067 PMD 0
> [ 52.574056][ T1] Oops: 0000 [#1] PTI
> [ 52.574580][ T1] CPU: 0 PID: 1 Comm: systemd Tainted: G W 5.16.0-rc1-00009-g27183b4e0735 #1
> [ 52.575935][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 52.577101][ T1] RIP: 0010:cgroup_pidlist_find+0x67/0x100
> [ 52.577863][ T1] Code: 03 00 00 48 8d bb c0 03 00 00 48 8d 42 e0 48 39 d7 75 17 eb 37 48 8b 50 20 48 83 05 c2 d5 5a 06 01 48 8d 42 e0 48 39 d7 74 22 <44> 39 20 75 e6 48 83 05 9c d5 5a 06 01 4c 39 68 08 75 d8 5b 48 83
> [ 52.580455][ T1] RSP: 0018:ffff888100363ce0 EFLAGS: 00010286
> [ 52.581260][ T1] RAX: ffffffffffffffe0 RBX: ffff888123128800 RCX: 0000000000000003
> [ 52.582341][ T1] RDX: 0000000000000000 RSI: ffff888123128c38 RDI: ffff888123128bc0
> [ 52.583386][ T1] RBP: ffff888100363cf8 R08: 0000000000000000 R09: 0000000000000003
> [ 52.584416][ T1] R10: ffff888100363cf8 R11: ffff888123128c38 R12: 0000000000000000
> [ 52.585452][ T1] R13: ffffffff8554a980 R14: ffff888123128800 R15: ffff888123090800
> [ 52.586479][ T1] FS: 0000000000000000(0000) GS:ffffffff854fa000(0063) knlGS:00000000f784b6c0
> [ 52.587696][ T1] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 52.588521][ T1] CR2: ffffffffffffffe0 CR3: 0000000129f42000 CR4: 00000000000406b0
> [ 52.589543][ T1] Call Trace:
> [ 52.589972][ T1] <TASK>
> [ 52.590357][ T1] cgroup_pidlist_start+0x85/0x180
> [ 52.591035][ T1] cgroup_seqfile_start+0x29/0x40
> [ 52.591706][ T1] kernfs_seq_start+0x6e/0x100
> [ 52.592355][ T1] ? kvmalloc_node+0xd6/0x140
> [ 52.593068][ T1] seq_read_iter+0x13b/0x680
> [ 52.593627][ T1] ? up_read+0x36/0x50
> [ 52.594124][ T1] kernfs_fop_read_iter+0x4f/0x60
> [ 52.594783][ T1] new_sync_read+0x14e/0x240
> [ 52.595373][ T1] vfs_read+0x190/0x2c0
> [ 52.595925][ T1] ksys_read+0x70/0x150
> [ 52.596463][ T1] __ia32_sys_read+0x1b/0x30
> [ 52.597057][ T1] __do_fast_syscall_32+0x77/0x100
> [ 52.597711][ T1] do_fast_syscall_32+0x33/0x80
> [ 52.598310][ T1] do_SYSENTER_32+0x1f/0x30
> [ 52.598890][ T1] entry_SYSENTER_compat_after_hwframe+0x4d/0x5f
> [ 52.599777][ T1] RIP: 0023:0xf7fb7549
> [ 52.600308][ T1] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
> [ 52.602731][ T1] RSP: 002b:00000000fff99188 EFLAGS: 00000206 ORIG_RAX: 0000000000000003
> [ 52.603796][ T1] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00000000583ac988
> [ 52.604817][ T1] RDX: 0000000000001000 RSI: 00000000583f6e10 RDI: 00000000f7c48960
> [ 52.605833][ T1] RBP: 00000000fff991d8 R08: 0000000000000000 R09: 0000000000000000
> [ 52.606825][ T1] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 52.607838][ T1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 52.608839][ T1] </TASK>
> [ 52.609226][ T1] Modules linked in:
> [ 52.609731][ T1] CR2: ffffffffffffffe0
> [ 52.610242][ T1] ---[ end trace 08fad742e8d71fbb ]---

This looks very much like UAF via cgrp->pidlists which was the fixed.

Michal

[1] https://lore.kernel.org/r/[email protected]/

2022-01-06 21:06:00

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCHSET v2 cgroup/for-5.16-fixes] cgroup: Use open-time creds and namespace for migration perm checks

Applied to cgroup/for-5.16-fixes w/ Michal's reviewed-by added.

Thanks.

--
tejun