LinuxLists.cc - [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

[permalink] [raw]

Subject: [RFC PATCH bpf-next 1/3] cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf

This patch adds a new cgroup helper cgroup_freeze_no_kn() to freeze a
cgroup hierarchy that is on a default cgroup v2 without going through
kernfs interface.

For some cases we want to freeze the cgroup of a task based on some
signals, doing so from bpf is better than user space which could be
too late.

The cgroup_freeze_no_kn() will acquire the cgroup_mutex and release it
at the end.

It also checks if the cgroup is on the default hierarchy and it is not
a root cgroup.

Signed-off-by: Djalal Harouni <[email protected]>
---
include/linux/cgroup.h | 2 ++
kernel/cgroup/cgroup.c | 69 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..5019b32ea933 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -137,6 +137,8 @@ int cgroup_init(void);

int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);

+int cgroup_freeze_no_kn(struct cgroup *cgrp, int freeze);
+
/*
* Iteration helpers and macros.
*/
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a66c088c851c..0aafcd9e39b5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1595,6 +1595,26 @@ static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask)
return cur_ss_mask;
}

+/**
+ * cgroup_dfl_write_no_kn - check if direct writes to cgroup without going
+ * through kernfs is allowed.
+ * @cgrp: the target cgroup
+ *
+ * This helper ensures that the cgroup is on the default hierarchy and it
+ * is not a root cgroup.
+ *
+ * Return: %0 on success or a negative errno code on failure.
+ */
+static int cgroup_dfl_write_no_kn(struct cgroup *cgrp)
+{
+ lockdep_assert_held(&cgroup_mutex);
+
+ if (!cgroup_on_dfl(cgrp) || !cgroup_parent(cgrp))
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
/**
* cgroup_kn_unlock - unlocking helper for cgroup kernfs methods
* @kn: the kernfs_node being serviced
@@ -1668,6 +1688,25 @@ struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline)
return NULL;
}

+/**
+ * cgroup_lock_live_no_kn - locking helper for direct writes to cgroup without
+ * going through kernfs interface.
+ * @cgrp: the target cgroup
+ *
+ * This helper performs cgroup locking and verifies that the associated cgroup
+ * is alive. Returns the cgroup if alive; otherwise, %NULL.
+ * A successful return should be undone by a matching cgroup_unlock()
+ * invocation.
+ */
+static struct cgroup *cgroup_lock_live_no_kn(struct cgroup *cgrp)
+{
+ cgroup_lock();
+ if (!cgroup_is_dead(cgrp))
+ return cgrp;
+ cgroup_unlock();
+ return NULL;
+}
+
static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft)
{
char name[CGROUP_FILE_NAME_MAX];
@@ -3930,6 +3969,36 @@ static int cgroup_freeze_show(struct seq_file *seq, void *v)
return 0;
}

+/**
+ * cgroup_freeze_no_kn - Freeze a cgroup that is on the default hierarchy
+ * without going through kernfs interface.
+ *
+ * @cgrp: the target cgroup
+ * @freeze: freeze state, passing value 1 causes the freezing of the cgroup
+ * and all descendant cgroups. Processes under this cgroup hierarchy will
+ * be stopped and will not run until the cgroup is explicitly unfrozen.
+ * Passing value 0 unthaws the cgroup hierarchy.
+ *
+ * Return: %0 on success or a negative errno code on failure.
+ */
+int cgroup_freeze_no_kn(struct cgroup *cgrp, int freeze)
+{
+ int ret = 0;
+
+ if (freeze < 0 || freeze > 1)
+ return -ERANGE;
+
+ if (!cgroup_lock_live_no_kn(cgrp))
+ return -ENOENT;
+
+ ret = cgroup_dfl_write_no_kn(cgrp);
+ if (!ret)
+ cgroup_freeze(cgrp, freeze);
+
+ cgroup_unlock();
+ return ret;
+}
+
static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
--
2.34.1

2024-03-27 22:56:01

[permalink] [raw]

Subject: [RFC PATCH bpf-next 2/3] bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task

This patch adds a new bpf helper bpf_task_freeze_cgroup() to freeze a
cgroup of a task and all its descendant cgroups. It requires the task
to be on the default cgroup v2 hierarchy.

For some cases we want to freeze the cgroup of a task based on some
signals, doing so from bpf is better than user space which could be
too late.

Planned users of this feature are: tetragon and systemd when freezing
a cgroup hierarchy that could be a K8s pod, container, system service
or a user session.

This helper will acquire the cgroup_mutex during its operation and
release it before it returns.

Signed-off-by: Djalal Harouni <[email protected]>
---
kernel/bpf/helpers.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9234174ccb21..8d510a1b265c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2270,6 +2270,36 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
return NULL;
return cgrp;
}
+
+/**
+ * bpf_task_freeze_cgroup - Freeze the task cgroup and all its descendant cgroups.
+ *
+ * @task: The target task
+ * @freeze: freeze state, passing value 1 causes the freezing of the cgroup
+ * and all descendant cgroups. Processes under this cgroup hierarchy will
+ * be stopped and will not run until the cgroup is explicitly unfrozen.
+ * Passing value 0 unthaws the task cgroup and its descendant cgroups.
+ *
+ * Return: %0 on success or a negative errno code on failure.
+ */
+__bpf_kfunc int bpf_task_freeze_cgroup(struct task_struct *task, int freeze)
+{
+ int ret;
+ struct cgroup *cgrp;
+
+ rcu_read_lock();
+ cgrp = task_dfl_cgroup(task);
+ if (!cgrp || !cgroup_tryget(cgrp)) {
+ rcu_read_unlock();
+ return -ENOENT;
+ }
+ rcu_read_unlock();
+
+ ret = cgroup_freeze_no_kn(cgrp, freeze);
+ cgroup_put(cgrp);
+
+ return ret;
+}
#endif /* CONFIG_CGROUPS */

/**
@@ -2577,6 +2607,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_task_freeze_cgroup, KF_TRUSTED_ARGS | KF_SLEEPABLE)
#endif
BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_throw)
--
2.34.1

2024-03-27 22:56:18

[permalink] [raw]

Subject: [RFC PATCH bpf-next 3/3] selftests/bpf: add selftest for bpf_task_freeze_cgroup

This adds a selftest for `bpf_task_freeze_cgroup` kfunc. The test works
by forking a child then:

1. Child:
- Migrate to a new cgroup
- Loads bpf programs
- Trigger the 'lsm_freeze_cgroup' bpf program so it freeze itself.
by calling "bpf_task_freeze_cgroup(child, 1)"

<- wait for parent to unthaw

- On unthaw it continues, forks another process and triggers the
'tp_newchild' bpf program to set some monitored pids of the new
process, that are asserted at user space, to ensure that we
resumed correctly.

2. Parent:
- Keeps reading the 'cgroup.freeze' file of the child cgroup until
it prints 1 which means the child cgroup is frozen
- Attaches the sample 'lsm_task_free' so it triggers the bpf program
and then calls "bpf_task_freeze_cgroup(task, 0);" on the child task
to unthaw its cgroup.
- Then waits for a clean exit of the child process.

The scenario allows to test both: freeze and unthaw a task cgroup.

Signed-off-by: Djalal Harouni <[email protected]>
---
.../bpf/prog_tests/task_freeze_cgroup.c | 165 ++++++++++++++++++
.../bpf/progs/test_task_freeze_cgroup.c | 110 ++++++++++++
2 files changed, 275 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c
create mode 100644 tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c

diff --git a/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c
new file mode 100644
index 000000000000..afb7d46194c5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/task_freeze_cgroup.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Isovalent */
+
+#include <sys/syscall.h>
+#include <test_progs.h>
+#include <cgroup_helpers.h>
+#include <unistd.h>
+#include "test_task_freeze_cgroup.skel.h"
+
+#define FOO "/test-task-freeze-cgroup"
+
+static int bpf_sleepable(struct test_task_freeze_cgroup *skel)
+{
+ int err, foo;
+ pid_t pid;
+
+ foo = test__join_cgroup(FOO);
+ if (!ASSERT_OK(foo < 0, "cgroup_join_foo"))
+ return -errno;
+
+ skel = test_task_freeze_cgroup__open();
+ if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open"))
+ return -errno;
+
+ skel->rodata->parent_pid = getppid();
+ skel->rodata->monitor_pid = getpid();
+ skel->rodata->cgid = get_cgroup_id(FOO);
+ skel->bss->new_pid = getpid();
+ skel->bss->freeze = 1;
+
+ err = test_task_freeze_cgroup__load(skel);
+ if (!ASSERT_OK(err, "test_task_freeze_cgroup__load"))
+ goto cleanup;
+
+ /* First, attach the LSM program, and then it will be triggered when the
+ * TP_BTF program is attached.
+ */
+ skel->links.lsm_freeze_cgroup =
+ bpf_program__attach_lsm(skel->progs.lsm_freeze_cgroup);
+ if (!ASSERT_OK_PTR(skel->links.lsm_freeze_cgroup, "attach_lsm")) {
+ err = -errno;
+ goto cleanup;
+ }
+
+ /* This will fail */
+ skel->links.tp_newchild =
+ bpf_program__attach_trace(skel->progs.tp_newchild);
+ if (!ASSERT_EQ(errno, EPERM, "attach_trace")) {
+ err = -EINVAL;
+ goto cleanup;
+ }
+
+ /* Try again now */
+ skel->links.tp_newchild =
+ bpf_program__attach_trace(skel->progs.tp_newchild);
+ if (!ASSERT_OK_PTR(skel->links.tp_newchild, "attach_trace")) {
+ err = -EINVAL;
+ goto cleanup;
+ }
+
+ /* Trigger a new child and assert unfrozen state */
+ pid = fork();
+ if (pid == 0)
+ exit(0);
+
+ err = (pid == -1);
+ if (ASSERT_OK(err, "fork process"))
+ wait(NULL);
+
+ /* Now we should continue, assert that new_pid reflects child */
+ ASSERT_NEQ(skel->rodata->monitor_pid, skel->bss->new_pid,
+ "test task_freeze_cgroup failed at monitor_pid != new_pid");
+ ASSERT_NEQ(0, skel->bss->new_pid,
+ "test task_freeze_cgroup failed at remote_pid != 0");
+
+ /* Assert that bpf set new_pid to new forked child pid */
+ ASSERT_EQ(pid, skel->bss->new_pid,
+ "test task_freeze_cgroup failed at pid == new_pid");
+
+ test_task_freeze_cgroup__detach(skel);
+
+cleanup:
+ test_task_freeze_cgroup__destroy(skel);
+ close(foo);
+ return err;
+}
+
+void test_task_freeze_cgroup(void)
+{
+ pid_t pid, result;
+ char buf[512] = {0};
+ char path[PATH_MAX] = {0};
+ int ret, status, attempts, frozen = 0;
+ struct test_task_freeze_cgroup *skel = NULL;
+
+ pid = fork();
+ ret = (pid == -1);
+ if (!ASSERT_OK(ret, "fork process"))
+ return;
+
+ if (pid == 0) {
+ ret = bpf_sleepable(skel);
+ ASSERT_EQ(0, ret, "bpf_sleepable failed");
+ exit(ret);
+ }
+
+ skel = test_task_freeze_cgroup__open();
+ if (!ASSERT_OK_PTR(skel, "test_task_freeze_cgroup__open"))
+ goto out;
+
+ snprintf(path, sizeof(path),
+ "/sys/fs/cgroup/cgroup-test-work-dir%d%s/cgroup.freeze",
+ pid, FOO);
+
+ for (attempts = 5; attempts >= 0; attempts--) {
+ ret = 0;
+ int fd = open(path, O_RDONLY);
+ if (fd > 0)
+ ret = read(fd, buf, sizeof(buf) - 1);
+ if (ret > 0) {
+ errno = 0;
+ frozen = strtol(buf, NULL, 10);
+ if (errno)
+ frozen = 0;
+ }
+
+ close(fd);
+ if (frozen)
+ break;
+ sleep(1);
+ }
+
+ /* Assert that child cgroup is frozen */
+ if (!ASSERT_EQ(1, frozen, "child cgroup not frozen"))
+ goto out;
+
+ ret = test_task_freeze_cgroup__load(skel);
+ if (!ASSERT_OK(ret, "test_task_freeze_cgroup__load"))
+ goto out;
+
+ /* Unthaw child cgroup from parent */
+ skel->links.lsm_task_free =
+ bpf_program__attach_lsm(skel->progs.lsm_task_free);
+ if (!ASSERT_OK_PTR(skel->links.lsm_task_free, "attach_lsm"))
+ goto out;
+
+ result = waitpid(pid, &status, WUNTRACED);
+ if (!ASSERT_NEQ(result, -1, "waitpid"))
+ goto detach;
+
+ result = WIFEXITED(status);
+ if (!ASSERT_EQ(result, 1, "forked process did not terminate normally"))
+ goto detach;
+
+ result = WEXITSTATUS(status);
+ if (!ASSERT_EQ(result, 0, "forked process did not exit successfully"))
+ goto detach;
+
+detach:
+ test_task_freeze_cgroup__detach(skel);
+
+out:
+ if (skel)
+ test_task_freeze_cgroup__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
new file mode 100644
index 000000000000..dbd2d60f464e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_task_freeze_cgroup.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Isovalent */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <errno.h>
+#include "bpf_misc.h"
+
+struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
+long bpf_task_under_cgroup(struct task_struct *task, struct cgroup *ancestor) __ksym;
+void bpf_cgroup_release(struct cgroup *p) __ksym;
+struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+int bpf_task_freeze_cgroup(struct task_struct *task, int freeze) __ksym;
+
+const volatile int parent_pid;
+const volatile int monitor_pid;
+const volatile __u64 cgid;
+int new_pid;
+int freeze;
+
+SEC("tp_btf/task_newtask")
+int BPF_PROG(tp_newchild, struct task_struct *task, u64 clone_flags)
+{
+ struct cgroup *cgrp = NULL;
+ struct task_struct *acquired;
+
+ if (monitor_pid != (bpf_get_current_pid_tgid() >> 32))
+ return 0;
+
+ acquired = bpf_task_acquire(task);
+ if (!acquired)
+ return 0;
+
+ cgrp = bpf_cgroup_from_id(cgid);
+ if (!cgrp)
+ goto out;
+
+ if (bpf_task_under_cgroup(acquired, cgrp))
+ new_pid = acquired->tgid;
+
+out:
+ if (cgrp)
+ bpf_cgroup_release(cgrp);
+ bpf_task_release(acquired);
+
+ return 0;
+}
+
+/* This is attached from parent to trigger the bpf lsm hook, so parent
+ * can unthaw the child.
+ */
+SEC("lsm/task_free")
+int BPF_PROG(lsm_task_free, struct task_struct *task)
+{
+ return 0;
+}
+
+SEC("lsm.s/bpf")
+int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size)
+{
+ int ret = 0;
+ struct cgroup *cgrp = NULL;
+ struct task_struct *task;
+
+ if (cmd != BPF_LINK_CREATE)
+ return ret;
+
+ task = bpf_get_current_task_btf();
+ if (parent_pid == task->pid) {
+ /* Unthaw child from parent */
+ task = bpf_task_from_pid(monitor_pid);
+ if (!task)
+ return -ENOENT;
+
+ ret = bpf_task_freeze_cgroup(task, 0);
+ bpf_task_release(task);
+ return ret;
+ }
+
+ if (monitor_pid != task->pid)
+ return 0;
+
+ /* Freeze the child cgroup from its context */
+ cgrp = bpf_cgroup_from_id(cgid);
+ if (!cgrp)
+ goto out;
+
+ if (!bpf_task_under_cgroup(task, cgrp))
+ goto out;
+
+ if (freeze) {
+ /* Schedule freeze task and return -EPERM */
+ ret = bpf_task_freeze_cgroup(task, 1);
+ if (!ret) {
+ ret = -EPERM;
+ /* reset for next call */
+ freeze = 0;
+ }
+ }
+out:
+ if (cgrp)
+ bpf_cgroup_release(cgrp);
+ return ret;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.34.1

2024-03-28 17:22:32

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello, Djalal.

On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
> This patch series adds support to freeze the task cgroup hierarchy
> that is on a default cgroup v2 without going through kernfs interface.
>
> For some cases we want to freeze the cgroup of a task based on some
> signals, doing so from bpf is better than user space which could be
> too late.
>
> Planned users of this feature are: tetragon and systemd when freezing
> a cgroup hierarchy that could be a K8s pod, container, system service
> or a user session.
>
> Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
> Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
> Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup

It bothers me a bit that it's adding a dedicated interface for something
which already has a defined userspace interface. Would it be better to have
kfunc wrappers for kernel_read() and kernel_write()?

Thanks.

--
tejun

2024-03-28 17:32:49

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Thu, Mar 28, 2024 at 10:22 AM Tejun Heo <[email protected]> wrote:
>
> Hello, Djalal.
>
> On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni wrote:
> > This patch series adds support to freeze the task cgroup hierarchy
> > that is on a default cgroup v2 without going through kernfs interface.
> >
> > For some cases we want to freeze the cgroup of a task based on some
> > signals, doing so from bpf is better than user space which could be
> > too late.
> >
> > Planned users of this feature are: tetragon and systemd when freezing
> > a cgroup hierarchy that could be a K8s pod, container, system service
> > or a user session.
> >
> > Patch 1: cgroup: add cgroup_freeze_no_kn() to freeze a cgroup from bpf
> > Patch 2: bpf: add bpf_task_freeze_cgroup() to freeze the cgroup of a task
> > Patch 3: selftests/bpf: add selftest for bpf_task_freeze_cgroup
>
> It bothers me a bit that it's adding a dedicated interface for something
> which already has a defined userspace interface. Would it be better to have
> kfunc wrappers for kernel_read() and kernel_write()?

How would that look ?
prog cannot and shouldn't open a file.
The seq_file would be passed/pinned by user space?

2024-03-28 17:59:03

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello, Alexei.

On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
> > It bothers me a bit that it's adding a dedicated interface for something
> > which already has a defined userspace interface. Would it be better to have
> > kfunc wrappers for kernel_read() and kernel_write()?
>
> How would that look ?
> prog cannot and shouldn't open a file.

Oh, I didn't know. Why is that?

> The seq_file would be passed/pinned by user space?

Would it work if it's just "open this file, write this and then close it"?

Thanks.

--
tejun

2024-03-28 19:46:29

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Thu, Mar 28, 2024 at 10:58 AM Tejun Heo <[email protected]> wrote:
>
> Hello, Alexei.
>
> On Thu, Mar 28, 2024 at 10:32:24AM -0700, Alexei Starovoitov wrote:
> > > It bothers me a bit that it's adding a dedicated interface for something
> > > which already has a defined userspace interface. Would it be better to have
> > > kfunc wrappers for kernel_read() and kernel_write()?
> >
> > How would that look ?
> > prog cannot and shouldn't open a file.
>
> Oh, I didn't know. Why is that?
>
> > The seq_file would be passed/pinned by user space?
>
> Would it work if it's just "open this file, write this and then close it"?

Continuing discussion...
To use kernel_file_open() it would need path, inode, cred.
None of that is available now.
Allocating all these structures just to wrap a cgroup pointer
feels like overkill.
Of course, it would solve the need to introduce other
cgroup apis that are already available via text based cgroupfs
read/write. So there are pros and cons in both approaches.
Maybe the 3rd option would be to expose:
cgroup_lock() as a special blend of acquire plus lock.
Then there will be no need for bpf_task_freeze_cgroup() with task
argument. Instead cgroup_freeze() will be such kfunc that
takes cgroup argument and the verifier will check that
cgroup was acquired/locked.
Sort-of what we check to access bpf_rb_root.

2024-03-28 20:03:39

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello,

On Thu, Mar 28, 2024 at 12:46:03PM -0700, Alexei Starovoitov wrote:
> To use kernel_file_open() it would need path, inode, cred.

Yeah, it's more involved and potentially more controversial. That said, just
to push the argument a bit further, at least for path, it'd be nice to have
a helper anyway which can return cgroup path. I don't know whether we'd need
direct inode access. For cred, just share some root cred?

> None of that is available now.
> Allocating all these structures just to wrap a cgroup pointer
> feels like overkill.
> Of course, it would solve the need to introduce other
> cgroup apis that are already available via text based cgroupfs
> read/write. So there are pros and cons in both approaches.
> Maybe the 3rd option would be to expose:
> cgroup_lock() as a special blend of acquire plus lock.

Oh, let's not expose that. That has been a source of problem for some use
cases and we may have to change how cgroups are locked.

> Then there will be no need for bpf_task_freeze_cgroup() with task
> argument. Instead cgroup_freeze() will be such kfunc that
> takes cgroup argument and the verifier will check that
> cgroup was acquired/locked.
> Sort-of what we check to access bpf_rb_root.

There's also cgroup.kill which would be useful for similar use cases. We can
add interface for both but idk. Let's say we have something like the
following (pardon the bad naming):

bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)

Would that work? I'm not necessarily in love with the idea or against adding
separate helpers but the duplication still bothers me a bit.

Thanks.

--
tejun

2024-03-28 20:47:38

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <[email protected]> wrote:
>
> There's also cgroup.kill which would be useful for similar use cases. We can
> add interface for both but idk. Let's say we have something like the
> following (pardon the bad naming):
>
> bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
>
> Would that work? I'm not necessarily in love with the idea or against adding
> separate helpers but the duplication still bothers me a bit.

I liked it.
So filename will be one of cgroup_base_files[].name ?
We probably don't want psi or cgroup1_base_files in there.

From the verifier pov 2nd arg can be "char *knob__str" and
the verifier will make sure it's a constant NULL terminated string,
so at runtime it will be easier to search cgroup_base_files array.
And 'buf' can be: void *mem, int mem__sz with kfunc doing
run-time validation that there a null there.

2024-03-28 21:01:57

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello,

On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <[email protected]> wrote:
> >
> > There's also cgroup.kill which would be useful for similar use cases. We can
> > add interface for both but idk. Let's say we have something like the
> > following (pardon the bad naming):
> >
> > bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
> >
> > Would that work? I'm not necessarily in love with the idea or against adding
> > separate helpers but the duplication still bothers me a bit.
>
> I liked it.
> So filename will be one of cgroup_base_files[].name ?
> We probably don't want psi or cgroup1_base_files in there.

Would it matter? If the user has root perm, they can do whatever with the
files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
wanna make sure @filename doesn't include '/'? Or is it that you don't want
to go through the usual file name look up?

> From the verifier pov 2nd arg can be "char *knob__str" and
> the verifier will make sure it's a constant NULL terminated string,
> so at runtime it will be easier to search cgroup_base_files array.
> And 'buf' can be: void *mem, int mem__sz with kfunc doing
> run-time validation that there a null there.

That all sound good.

Thanks.

--
tejun

2024-03-28 21:29:20

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Thu, Mar 28, 2024 at 2:01 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <[email protected]> wrote:
> > >
> > > There's also cgroup.kill which would be useful for similar use cases. We can
> > > add interface for both but idk. Let's say we have something like the
> > > following (pardon the bad naming):
> > >
> > > bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
> > >
> > > Would that work? I'm not necessarily in love with the idea or against adding
> > > separate helpers but the duplication still bothers me a bit.
> >
> > I liked it.
> > So filename will be one of cgroup_base_files[].name ?
> > We probably don't want psi or cgroup1_base_files in there.
>
> Would it matter?

Few weak reasons:
cgroup_psi_files have show/write/poll/release which
doesn't map to this bpf_cgroup_knob_write/read ?
cgroup1_base_files probably needs to a separate kfunc
bpf_cgroup1_...

> If the user has root perm, they can do whatever with the
> files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> wanna make sure @filename doesn't include '/'? Or is it that you don't want
> to go through the usual file name look up?

yeah. why do a file lookup? The names are there in the array.
cgroup pointer gives that "relative path" and knob name is the last
part of such "path". Easy to search in that array(s).

> > From the verifier pov 2nd arg can be "char *knob__str" and
> > the verifier will make sure it's a constant NULL terminated string,
> > so at runtime it will be easier to search cgroup_base_files array.
> > And 'buf' can be: void *mem, int mem__sz with kfunc doing
> > run-time validation that there a null there.
>
> That all sound good.
>
> Thanks.
>
> --
> tejun

2024-03-28 23:23:38

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello,

On Thu, Mar 28, 2024 at 02:28:51PM -0700, Alexei Starovoitov wrote:
> > > So filename will be one of cgroup_base_files[].name ?
> > > We probably don't want psi or cgroup1_base_files in there.
> >
> > Would it matter?
>
> Few weak reasons:
> . cgroup_psi_files have show/write/poll/release which
> doesn't map to this bpf_cgroup_knob_write/read ?
> . cgroup1_base_files probably needs to a separate kfunc
> bpf_cgroup1_...
>
> > If the user has root perm, they can do whatever with the
> > files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> > wanna make sure @filename doesn't include '/'? Or is it that you don't want
> > to go through the usual file name look up?
>
> yeah. why do a file lookup? The names are there in the array.
> cgroup pointer gives that "relative path" and knob name is the last
> part of such "path". Easy to search in that array(s).

Difficult to tell without looking at the implementation but I don't have
strong opinions. The interface makes sense to me and as long as we can hook
it up in a reasonably way, it should be okay. We can always change internal
implementation later if necessary.

Thanks.

--
tejun

2024-03-29 15:07:57

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello Tejun, Alexei,

On 3/28/24 22:01, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 28, 2024 at 01:45:56PM -0700, Alexei Starovoitov wrote:
>> On Thu, Mar 28, 2024 at 1:02 PM Tejun Heo <[email protected]> wrote:
>>>
>>> There's also cgroup.kill which would be useful for similar use cases. We can
>>> add interface for both but idk. Let's say we have something like the
>>> following (pardon the bad naming):
>>>

Yes having the cgroup.kill from bpf would be useful!

>>> bpf_cgroup_knob_write(struct cgroup *cgrp, char *filename, char *buf)
>>>
>>> Would that work? I'm not necessarily in love with the idea or against adding
>>> separate helpers but the duplication still bothers me a bit.
>>
>> I liked it.
>> So filename will be one of cgroup_base_files[].name ?
>> We probably don't want psi or cgroup1_base_files in there.
>
> Would it matter? If the user has root perm, they can do whatever with the
> files anyway, so I'm not sure why we'd restrict any specific knob. Maybe we
> wanna make sure @filename doesn't include '/'? Or is it that you don't want
> to go through the usual file name look up?

It would be easy at least for me if I just start with cgroupv2 and
ensure that it has same available filenames as if we go through kernfs.
Not a root cgroup node and maybe only freeze and kill for now that are
part of cgroup_base_files.

So if I get it right, somehow like what I did but we endup with:

In bpf, cgroup was already acquired.

bpf_cgroup_knob_write(cgroup, "freeze", buf)
|_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock

cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
|_ parse params -> cgroup_ref++ -> krnfs_active_ref-- ->
-> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...

Please let me know if I missed something.

Thanks!

2024-03-29 21:39:35

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello,

On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
> It would be easy at least for me if I just start with cgroupv2 and
> ensure that it has same available filenames as if we go through kernfs.
> Not a root cgroup node and maybe only freeze and kill for now that are
> part of cgroup_base_files.
>
> So if I get it right, somehow like what I did but we endup with:
>
> In bpf, cgroup was already acquired.
>
> bpf_cgroup_knob_write(cgroup, "freeze", buf)
> |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
>
>
> cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
> |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- ->
> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
>
> Please let me know if I missed something.

I've thought about it a bit and I wonder whether a better way to do this is
implementing this at the kernfs layer. Something like (hopefully with a
better name):

s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);

So, about the same, but takes kernfs_node directory instead of cgroup. This
would make the interface useful for accessing sysfs knobs too which use
similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
kobj->sd. This way we can avoid the internal object -> path -> internal
object ping-poinging while keeping the interface a lot more generic. What do
you think?

Thanks.

--
tejun

2024-03-29 23:05:28

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
> > It would be easy at least for me if I just start with cgroupv2 and
> > ensure that it has same available filenames as if we go through kernfs.
> > Not a root cgroup node and maybe only freeze and kill for now that are
> > part of cgroup_base_files.
> >
> > So if I get it right, somehow like what I did but we endup with:
> >
> > In bpf, cgroup was already acquired.
> >
> > bpf_cgroup_knob_write(cgroup, "freeze", buf)
> > |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
> >
> >
> > cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
> > |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- ->
> > -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
> >
> > Please let me know if I missed something.
>
> I've thought about it a bit and I wonder whether a better way to do this is
> implementing this at the kernfs layer. Something like (hopefully with a
> better name):
>
> s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
>
> So, about the same, but takes kernfs_node directory instead of cgroup. This
> would make the interface useful for accessing sysfs knobs too which use
> similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
> kobj->sd. This way we can avoid the internal object -> path -> internal
> object ping-poinging while keeping the interface a lot more generic. What do
> you think?

And helpers like cgroup_freeze_write() will be refactored
to take kernfs_node directly instead of kernfs_open_file?
Makes sense to me.
Sounds like a minimal amount of changes and flexible enough.

2024-04-02 17:46:50

by Michal Koutný

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello.

On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni <[email protected]> wrote:
> ...
> For some cases we want to freeze the cgroup of a task based on some
> signals, doing so from bpf is better than user space which could be
> too late.

Notice that freezer itself is not immediate -- tasks are frozen as if a
signal (kill(2)) was delivered to them (i.e. returning to userspace).

What kind of signals (also kill?) are you talking about for
illustration?

> Planned users of this feature are: tetragon and systemd when freezing
> a cgroup hierarchy that could be a K8s pod, container, system service
> or a user session.

It sounds like the signals are related to a particular process. If so
what is it good for to freeze unrelated processes in the same cgroup?

I think those answers better clarify why this is needed.

As for the generalization to any cgroup attribute (or kernfs). Can this
be compared with sysctls -- I see there are helpers to intercept user
writes but no helpers to affect sysctl values without an outer writer.
What would justify different approaches between kernfs attributes and
sysctls (direct writes vs modified writes)?

Thanks,
Michal

Attachments:

(No filename) (1.18 kB)
signature.asc (235.00 B)
Download all attachments

2024-04-02 18:13:38

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello,

On 3/30/24 00:04, Alexei Starovoitov wrote:
> On Fri, Mar 29, 2024 at 2:39 PM Tejun Heo <[email protected]> wrote:
>>
>> Hello,
>>
>> On Fri, Mar 29, 2024 at 02:22:28PM +0100, Djalal Harouni wrote:
>>> It would be easy at least for me if I just start with cgroupv2 and
>>> ensure that it has same available filenames as if we go through kernfs.
>>> Not a root cgroup node and maybe only freeze and kill for now that are
>>> part of cgroup_base_files.
>>>
>>> So if I get it right, somehow like what I did but we endup with:
>>>
>>> In bpf, cgroup was already acquired.
>>>
>>> bpf_cgroup_knob_write(cgroup, "freeze", buf)
>>> |_ parse params -> lock cgroup_mutex -> cgroup_freeze() -> unlock
>>>
>>>
>>> cgroup_freeze_write(struct kernfs_open_file *of, char *buf,...)
>>> |_ parse params -> cgroup_ref++ -> krnfs_active_ref-- ->
>>> -> lock cgroup_mutex -> cgroup_freeze() -> unlock + krnfs++ ...
>>>
>>> Please let me know if I missed something.
>>
>> I've thought about it a bit and I wonder whether a better way to do this is
>> implementing this at the kernfs layer. Something like (hopefully with a
>> better name):
>>
>> s32 bpf_kernfs_knob_write(struct kernfs_node *dir, const char *knob, char *buf);
>>
>> So, about the same, but takes kernfs_node directory instead of cgroup. This
>> would make the interface useful for accessing sysfs knobs too which use
>> similar conventions. For cgroup, @dir is just cgrp->kn and for sysfs it'd be
>> kobj->sd. This way we can avoid the internal object -> path -> internal
>> object ping-poinging while keeping the interface a lot more generic. What do
>> you think?
>
> And helpers like cgroup_freeze_write() will be refactored
> to take kernfs_node directly instead of kernfs_open_file?
> Makes sense to me.
> Sounds like a minimal amount of changes and flexible enough.

Thank you Alexei, Tejun for the feedback. Will try to get back with a v2.

One particular thing is the kernfs_open_file->mutex nests outside of the
refcounting of kernfs_node, let's see.

Thanks!

2024-04-02 18:21:22

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hello Michal,

On 4/2/24 18:16, Michal Koutný wrote:
> Hello.
>
> On Wed, Mar 27, 2024 at 11:53:22PM +0100, Djalal Harouni <[email protected]> wrote:
>> ...
>> For some cases we want to freeze the cgroup of a task based on some
>> signals, doing so from bpf is better than user space which could be
>> too late.
>
> Notice that freezer itself is not immediate -- tasks are frozen as if a
> signal (kill(2)) was delivered to them (i.e. returning to userspace).

Thanks yes, I would expect freeze to behave like signal, and if one
wants to block immediately there is the LSM override return. The
selftest attached tries to do exactly that.

> What kind of signals (also kill?) are you talking about for
> illustration?

Could be security signals, reading sensitive files or related to any
operation management, for X reasons this user session should be freezed
or killed.

The kill is an effective defense against fork-bombs as an example.

>> Planned users of this feature are: tetragon and systemd when freezing
>> a cgroup hierarchy that could be a K8s pod, container, system service
>> or a user session.
>
> It sounds like the signals are related to a particular process. If so
> what is it good for to freeze unrelated processes in the same cgroup?

Today some container/pod operations are performed at bpf level, having
the freeze and kill available is straightforward to perform this.

> I think those answers better clarify why this is needed.

Alright will add those in v2.

>
> As for the generalization to any cgroup attribute (or kernfs). Can this
> be compared with sysctls -- I see there are helpers to intercept user
> writes but no helpers to affect sysctl values without an outer writer.
> What would justify different approaches between kernfs attributes and
> sysctls (direct writes vs modified writes)?

For generalizing this, haven't thought about it that much. First use
case is to try to get freeze and possibly kill support, and use a common
interface as requested.

Thank you!

>
> Thanks,
> Michal

2024-04-09 16:21:24

by Michal Koutný

[permalink] [raw]

Subject: Re: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

Hi.

On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni <[email protected]> wrote:
> Thanks yes, I would expect freeze to behave like signal, and if one
> wants to block immediately there is the LSM override return. The
> selftest attached tries to do exactly that.

Are you refering to this part:

int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size)
...
ret = bpf_task_freeze_cgroup(task, 1);
if (!ret) {
ret = -EPERM;
/* reset for next call */
?

> Could be security signals, reading sensitive files or related to any
> operation management, for X reasons this user session should be freezed
> or killed.

What can be done with a frozen cgroup after anything of that happens?
Anything besides killing anyway?

Killing of an offending process could be caught by its supervisor (like
container runtime or systemd) and propagated accordingly to the whole
cgroup.

> The kill is an effective defense against fork-bombs as an example.

There are several ways how to prevent fork-bombs in kernel already, it
looks like a contrived example.

> Today some container/pod operations are performed at bpf level, having
> the freeze and kill available is straightforward to perform this.

It seems to me like an extra step when the same operation can be done from
(the managing) userspace already.

> For generalizing this, haven't thought about it that much. First use
> case is to try to get freeze and possibly kill support, and use a common
> interface as requested.

BTW, I notice that there is bpf_sys_bpf() helper that allows calling an
arbitrary syscall. Wouldn't that be sufficient for everything?

(Based on how I still understand the problem: either you must respond
immediately and then the direct return from LSM is appropriate or timing
is not sensitive but you want act on whole cgroup.)

Thanks,
Michal

Attachments:

(No filename) (1.87 kB)
signature.asc (235.00 B)
Download all attachments

2024-04-11 00:26:42

by Yonghong Song

[permalink] [raw]

Subject: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On 4/9/24 8:32 AM, Michal Koutný wrote:
> Hi.
>
> On Tue, Apr 02, 2024 at 07:20:45PM +0100, Djalal Harouni <[email protected]> wrote:
>> Thanks yes, I would expect freeze to behave like signal, and if one
>> wants to block immediately there is the LSM override return. The
>> selftest attached tries to do exactly that.
> Are you refering to this part:
>
> int BPF_PROG(lsm_freeze_cgroup, int cmd, union bpf_attr *attr, unsigned int size)
> ...
> ret = bpf_task_freeze_cgroup(task, 1);
> if (!ret) {
> ret = -EPERM;
> /* reset for next call */
> ?
>
>
>> Could be security signals, reading sensitive files or related to any
>> operation management, for X reasons this user session should be freezed
>> or killed.
> What can be done with a frozen cgroup after anything of that happens?
> Anything besides killing anyway?
>
> Killing of an offending process could be caught by its supervisor (like
> container runtime or systemd) and propagated accordingly to the whole
> cgroup.
>
>> The kill is an effective defense against fork-bombs as an example.
> There are several ways how to prevent fork-bombs in kernel already, it
> looks like a contrived example.
>
>> Today some container/pod operations are performed at bpf level, having
>> the freeze and kill available is straightforward to perform this.
> It seems to me like an extra step when the same operation can be done from
> (the managing) userspace already.
>
>> For generalizing this, haven't thought about it that much. First use
>> case is to try to get freeze and possibly kill support, and use a common
>> interface as requested.
> BTW, I notice that there is bpf_sys_bpf() helper that allows calling an
> arbitrary syscall. Wouldn't that be sufficient for everything?

This is not true. Currently, only 'bpf' and 'close' syscalls are supported.

static const struct bpf_func_proto *
syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
switch (func_id) {
case BPF_FUNC_sys_bpf:
return !bpf_token_capable(prog->aux->token, CAP_PERFMON)
? NULL : &bpf_sys_bpf_proto;
case BPF_FUNC_btf_find_by_name_kind:
return &bpf_btf_find_by_name_kind_proto;
case BPF_FUNC_sys_close:
return &bpf_sys_close_proto;
case BPF_FUNC_kallsyms_lookup_name:
return &bpf_kallsyms_lookup_name_proto;
default:
return tracing_prog_func_proto(func_id, prog);
}
}

More syscalls can be added (through kfunc) if there is a use case for that.

>
> (Based on how I still understand the problem: either you must respond
> immediately and then the direct return from LSM is appropriate or timing
> is not sensitive but you want act on whole cgroup.)
>
> Thanks,
> Michal

2024-04-11 08:26:17

by Michal Koutný

[permalink] [raw]

Subject: Re: Re: [RFC PATCH bpf-next 0/3] bpf: freeze a task cgroup from bpf

On Wed, Apr 10, 2024 at 05:26:18PM -0700, Yonghong Song <[email protected]> wrote:
> This is not true.

Oh, I misunderstood a manpage, I can see now it's not for any syscall.

> More syscalls can be added (through kfunc) if there is a use case for that.

Thus, I don't want to open this up.

Michal

Attachments:

(No filename) (315.00 B)
signature.asc (235.00 B)
Download all attachments

2024-04-11 08:37:31