2023-06-30 18:46:03

by Michal Koutný

[permalink] [raw]
Subject: [PATCH v2 0/3] cpuset: Allow setscheduler regardless of manipulated task

Changes in v2:
- rebased on mainline
- drop is_in_v2_mode()

Changes in v1 (https://lore.kernel.org/r/[email protected]):
- added selftests
- comments rewording

RFC in https://lore.kernel.org/r/[email protected]

Michal Koutný (3):
cpuset: Allow setscheduler regardless of manipulated task
selftests: cgroup: Minor code reorganizations
selftests: cgroup: Add cpuset migrations testcase

MAINTAINERS | 2 +
kernel/cgroup/cpuset.c | 13 +-
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/cgroup_util.c | 2 +
tools/testing/selftests/cgroup/cgroup_util.h | 2 +
tools/testing/selftests/cgroup/test_core.c | 2 +-
tools/testing/selftests/cgroup/test_cpuset.c | 272 ++++++++++++++++++
.../selftests/cgroup/test_cpuset_prs.sh | 2 +-
9 files changed, 293 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/cgroup/test_cpuset.c


base-commit: e55e5df193d247a38a5e1ac65a5316a0adcc22fa
--
2.41.0



2023-06-30 18:51:43

by Michal Koutný

[permalink] [raw]
Subject: [PATCH v2 3/3] selftests: cgroup: Add cpuset migrations testcase

Add a separate testfile to verify treating permissions when tasks are
migrated on cgroup v2 hierarchy between cpuset cgroups.

In accordance with v2 design, migration should be allowed based on
delegation boundaries (i.e. cgroup.procs permissions) and does not
depend on the migrated object (i.e. unprivileged process can migrate
another process (even privileged) as long as it remains in the original
dedicated scope).

Signed-off-by: Michal Koutný <[email protected]>
---
MAINTAINERS | 1 +
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/test_cpuset.c | 272 +++++++++++++++++++
4 files changed, 276 insertions(+)
create mode 100644 tools/testing/selftests/cgroup/test_cpuset.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 03bec83944c4..5c55de000ee3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5260,6 +5260,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
F: Documentation/admin-guide/cgroup-v1/cpusets.rst
F: include/linux/cpuset.h
F: kernel/cgroup/cpuset.c
+F: tools/testing/selftests/cgroup/test_cpuset.c
F: tools/testing/selftests/cgroup/test_cpuset_prs.sh

CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index c4a57e69f749..8443a8d46a1c 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -5,4 +5,5 @@ test_freezer
test_kmem
test_kill
test_cpu
+test_cpuset
wait_inotify
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 3d263747d2ad..dee0f013c7f4 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -12,6 +12,7 @@ TEST_GEN_PROGS += test_core
TEST_GEN_PROGS += test_freezer
TEST_GEN_PROGS += test_kill
TEST_GEN_PROGS += test_cpu
+TEST_GEN_PROGS += test_cpuset

LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h

@@ -23,3 +24,4 @@ $(OUTPUT)/test_core: cgroup_util.c
$(OUTPUT)/test_freezer: cgroup_util.c
$(OUTPUT)/test_kill: cgroup_util.c
$(OUTPUT)/test_cpu: cgroup_util.c
+$(OUTPUT)/test_cpuset: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c
new file mode 100644
index 000000000000..976ec6f014d8
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/limits.h>
+#include <signal.h>
+
+#include "../kselftest.h"
+#include "cgroup_util.h"
+
+static int idle_process_fn(const char *cgroup, void *arg)
+{
+ (void)pause();
+ return 0;
+}
+
+static int do_migration_fn(const char *cgroup, void *arg)
+{
+ int object_pid = (int)(size_t)arg;
+
+ if (setuid(TEST_UID))
+ return EXIT_FAILURE;
+
+ // XXX checking /proc/$pid/cgroup would be quicker than wait
+ if (cg_enter(cgroup, object_pid) ||
+ cg_wait_for_proc_count(cgroup, 1))
+ return EXIT_FAILURE;
+
+ return EXIT_SUCCESS;
+}
+
+static int do_controller_fn(const char *cgroup, void *arg)
+{
+ const char *child = cgroup;
+ const char *parent = arg;
+
+ if (setuid(TEST_UID))
+ return EXIT_FAILURE;
+
+ if (!cg_read_strstr(child, "cgroup.controllers", "cpuset"))
+ return EXIT_FAILURE;
+
+ if (cg_write(parent, "cgroup.subtree_control", "+cpuset"))
+ return EXIT_FAILURE;
+
+ if (cg_read_strstr(child, "cgroup.controllers", "cpuset"))
+ return EXIT_FAILURE;
+
+ if (cg_write(parent, "cgroup.subtree_control", "-cpuset"))
+ return EXIT_FAILURE;
+
+ if (!cg_read_strstr(child, "cgroup.controllers", "cpuset"))
+ return EXIT_FAILURE;
+
+ return EXIT_SUCCESS;
+}
+
+/*
+ * Migrate a process between two sibling cgroups.
+ * The success should only depend on the parent cgroup permissions and not the
+ * migrated process itself (cpuset controller is in place because it uses
+ * security_task_setscheduler() in cgroup v1).
+ */
+static int test_cpuset_perms_object(const char *root, bool allow)
+{
+ char *parent = NULL, *child_src = NULL, *child_dst = NULL;
+ char *parent_procs = NULL, *child_src_procs = NULL, *child_dst_procs = NULL;
+ const uid_t test_euid = TEST_UID;
+ int object_pid = 0;
+ int ret = KSFT_FAIL;
+
+ parent = cg_name(root, "cpuset_test_0");
+ if (!parent)
+ goto cleanup;
+ parent_procs = cg_name(parent, "cgroup.procs");
+ if (!parent_procs)
+ goto cleanup;
+ if (cg_create(parent))
+ goto cleanup;
+
+ child_src = cg_name(parent, "cpuset_test_1");
+ if (!child_src)
+ goto cleanup;
+ child_src_procs = cg_name(child_src, "cgroup.procs");
+ if (!child_src_procs)
+ goto cleanup;
+ if (cg_create(child_src))
+ goto cleanup;
+
+ child_dst = cg_name(parent, "cpuset_test_2");
+ if (!child_dst)
+ goto cleanup;
+ child_dst_procs = cg_name(child_dst, "cgroup.procs");
+ if (!child_dst_procs)
+ goto cleanup;
+ if (cg_create(child_dst))
+ goto cleanup;
+
+ if (cg_write(parent, "cgroup.subtree_control", "+cpuset"))
+ goto cleanup;
+
+ if (cg_read_strstr(child_src, "cgroup.controllers", "cpuset") ||
+ cg_read_strstr(child_dst, "cgroup.controllers", "cpuset"))
+ goto cleanup;
+
+ /* Enable permissions along src->dst tree path */
+ if (chown(child_src_procs, test_euid, -1) ||
+ chown(child_dst_procs, test_euid, -1))
+ goto cleanup;
+
+ if (allow && chown(parent_procs, test_euid, -1))
+ goto cleanup;
+
+ /* Fork a privileged child as a test object */
+ object_pid = cg_run_nowait(child_src, idle_process_fn, NULL);
+ if (object_pid < 0)
+ goto cleanup;
+
+ /* Carry out migration in a child process that can drop all privileges
+ * (including capabilities), the main process must remain privileged for
+ * cleanup.
+ * Child process's cgroup is irrelevant but we place it into child_dst
+ * as hacky way to pass information about migration target to the child.
+ */
+ if (allow ^ (cg_run(child_dst, do_migration_fn, (void *)(size_t)object_pid) == EXIT_SUCCESS))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ if (object_pid > 0) {
+ (void)kill(object_pid, SIGTERM);
+ (void)clone_reap(object_pid, WEXITED);
+ }
+
+ cg_destroy(child_dst);
+ free(child_dst_procs);
+ free(child_dst);
+
+ cg_destroy(child_src);
+ free(child_src_procs);
+ free(child_src);
+
+ cg_destroy(parent);
+ free(parent_procs);
+ free(parent);
+
+ return ret;
+}
+
+static int test_cpuset_perms_object_allow(const char *root)
+{
+ return test_cpuset_perms_object(root, true);
+}
+
+static int test_cpuset_perms_object_deny(const char *root)
+{
+ return test_cpuset_perms_object(root, false);
+}
+
+/*
+ * Migrate a process between parent and child implicitely
+ * Implicit migration happens when a controller is enabled/disabled.
+ *
+ */
+static int test_cpuset_perms_subtree(const char *root)
+{
+ char *parent = NULL, *child = NULL;
+ char *parent_procs = NULL, *parent_subctl = NULL, *child_procs = NULL;
+ const uid_t test_euid = TEST_UID;
+ int object_pid = 0;
+ int ret = KSFT_FAIL;
+
+ parent = cg_name(root, "cpuset_test_0");
+ if (!parent)
+ goto cleanup;
+ parent_procs = cg_name(parent, "cgroup.procs");
+ if (!parent_procs)
+ goto cleanup;
+ parent_subctl = cg_name(parent, "cgroup.subtree_control");
+ if (!parent_subctl)
+ goto cleanup;
+ if (cg_create(parent))
+ goto cleanup;
+
+ child = cg_name(parent, "cpuset_test_1");
+ if (!child)
+ goto cleanup;
+ child_procs = cg_name(child, "cgroup.procs");
+ if (!child_procs)
+ goto cleanup;
+ if (cg_create(child))
+ goto cleanup;
+
+ /* Enable permissions as in a delegated subtree */
+ if (chown(parent_procs, test_euid, -1) ||
+ chown(parent_subctl, test_euid, -1) ||
+ chown(child_procs, test_euid, -1))
+ goto cleanup;
+
+ /* Put a privileged child in the subtree and modify controller state
+ * from an unprivileged process, the main process remains privileged
+ * for cleanup.
+ * The unprivileged child runs in subtree too to avoid parent and
+ * internal-node constraing violation.
+ */
+ object_pid = cg_run_nowait(child, idle_process_fn, NULL);
+ if (object_pid < 0)
+ goto cleanup;
+
+ if (cg_run(child, do_controller_fn, parent) != EXIT_SUCCESS)
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ if (object_pid > 0) {
+ (void)kill(object_pid, SIGTERM);
+ (void)clone_reap(object_pid, WEXITED);
+ }
+
+ cg_destroy(child);
+ free(child_procs);
+ free(child);
+
+ cg_destroy(parent);
+ free(parent_subctl);
+ free(parent_procs);
+ free(parent);
+
+ return ret;
+}
+
+
+#define T(x) { x, #x }
+struct cpuset_test {
+ int (*fn)(const char *root);
+ const char *name;
+} tests[] = {
+ T(test_cpuset_perms_object_allow),
+ T(test_cpuset_perms_object_deny),
+ T(test_cpuset_perms_subtree),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+ char root[PATH_MAX];
+ int i, ret = EXIT_SUCCESS;
+
+ if (cg_find_unified_root(root, sizeof(root)))
+ ksft_exit_skip("cgroup v2 isn't mounted\n");
+
+ if (cg_read_strstr(root, "cgroup.subtree_control", "cpuset"))
+ if (cg_write(root, "cgroup.subtree_control", "+cpuset"))
+ ksft_exit_skip("Failed to set cpuset controller\n");
+
+ for (i = 0; i < ARRAY_SIZE(tests); i++) {
+ switch (tests[i].fn(root)) {
+ case KSFT_PASS:
+ ksft_test_result_pass("%s\n", tests[i].name);
+ break;
+ case KSFT_SKIP:
+ ksft_test_result_skip("%s\n", tests[i].name);
+ break;
+ default:
+ ret = EXIT_FAILURE;
+ ksft_test_result_fail("%s\n", tests[i].name);
+ break;
+ }
+ }
+
+ return ret;
+}
--
2.41.0


2023-06-30 19:08:26

by Michal Koutný

[permalink] [raw]
Subject: [PATCH v2 1/3] cpuset: Allow setscheduler regardless of manipulated task

When we migrate a task between two cgroups, one of the checks is a
verification whether we can modify task's scheduler settings
(cap_task_setscheduler()).

An implicit migration occurs also when enabling a controller on the
unified hierarchy (think of parent to child migration). The
aforementioned check may be problematic if the caller of the migration
(enabling a controller) has no permissions over migrated tasks.
For instance, a user's cgroup that ends up running a process of a
different user. Although cgroup permissions are configured favorably,
the enablement fails due to the foreign process [1].

Change the behavior by relaxing the permissions check on the unified
hierarchy (or in v2 mode). This is in accordance with unified hierarchy
attachment behavior when permissions of the source to target cgroups are
decisive whereas the migrated task is opaque (as opposed to more
restrictive check in __cgroup1_procs_write()).

[1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649

Signed-off-by: Michal Koutný <[email protected]>
---
kernel/cgroup/cpuset.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58e6f18f01c1..41d3ed14b0f4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2505,9 +2505,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
ret = task_can_attach(task);
if (ret)
goto out_unlock;
- ret = security_task_setscheduler(task);
- if (ret)
- goto out_unlock;
+
+ /*
+ * Skip rights over task check in v2, migration permission derives
+ * from hierarchy ownership in cgroup_procs_write_permission()).
+ */
+ if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+ ret = security_task_setscheduler(task);
+ if (ret)
+ goto out_unlock;
+ }

if (dl_task(task)) {
cs->nr_migrate_dl_tasks++;
--
2.41.0


2023-06-30 19:31:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] cpuset: Allow setscheduler regardless of manipulated task


On 6/30/23 14:39, Michal Koutný wrote:
> When we migrate a task between two cgroups, one of the checks is a
> verification whether we can modify task's scheduler settings
> (cap_task_setscheduler()).
>
> An implicit migration occurs also when enabling a controller on the
> unified hierarchy (think of parent to child migration). The
> aforementioned check may be problematic if the caller of the migration
> (enabling a controller) has no permissions over migrated tasks.
> For instance, a user's cgroup that ends up running a process of a
> different user. Although cgroup permissions are configured favorably,
> the enablement fails due to the foreign process [1].
>
> Change the behavior by relaxing the permissions check on the unified
> hierarchy (or in v2 mode). This is in accordance with unified hierarchy
> attachment behavior when permissions of the source to target cgroups are
> decisive whereas the migrated task is opaque (as opposed to more
> restrictive check in __cgroup1_procs_write()).
>
> [1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649
>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 58e6f18f01c1..41d3ed14b0f4 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2505,9 +2505,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> ret = task_can_attach(task);
> if (ret)
> goto out_unlock;
> - ret = security_task_setscheduler(task);
> - if (ret)
> - goto out_unlock;
> +
> + /*
> + * Skip rights over task check in v2, migration permission derives
> + * from hierarchy ownership in cgroup_procs_write_permission()).
> + */
> + if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
> + ret = security_task_setscheduler(task);
> + if (ret)
> + goto out_unlock;
> + }

I am somewhat hesitant to skip the security_task_setscheduler() check
for all cgroup v2 task migrations. The check is controlled by SElinux
which is a different subsystem. I believe the scheduler property here
refer's to the task cpu affinity and node mask. If you look at
cpuset_attach(), we have actually skipped the task iteration process to
change them if cpu affinity and node mask aren't changed at all.

I don't want to introduce a possible security vulnerability because of
this relaxation. I would suggest you skip it under the same condition of
no change to cpu affinity and node mask for cgroup v2.

Thanks,
Longman