This patchset provides a bpf solution for monitoring cgroup activities
and exporting cgroup states in an organized way in bpffs. It introduces
the following features:
1. sleepable tracepoints and sleepable tracing programs.
2. a set of bpf helpers for creating and deleting files and
directories in bpffs.
3. a new iter prog, parameterizable by cgroup ids, to print cgroup
state.
Sleepable tracepoints and tracing progs allow us to run bpf progs when
a new cgroup is created or an existing cgroup is removed. The set of
filesystem helpers allows sleepable tracing progs to set up directories
in bpffs for each cgroup. The progs can also pin and unlink bpf objects
from these bpffs directories. The new iter prog can be used to export
cgroup states. Using this set of additions, we are creating an extension
to the current cgroup interface to export per-cgroup stats.
See the selftest added in patch 09/09, test_cgroup_stats, as a full
example on how it can be done. The test develops a custom metric
measuring per-cgroup scheduling latencies and exports it via cgroup
iters, which are pinned by sleepable tracing progs attaching at cgroup
tracepoints.
Not only for per-cgroup stats, the same approach can be used for other
states such as task_vma iter and per-bpf-prog state. As an example, we
can write sleepable tracing progs to monitor task fork and exit, and let
the tracing prog to set up directories, parameterize task_vma iter and
pin the iters.
Hao Luo (9):
bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
bpf: Add BPF_OBJ_PIN and BPF_OBJ_GET in the bpf_sys_bpf helper
selftests/bpf: tests mkdir, rmdir, unlink and pin in syscall
bpf: Introduce sleepable tracepoints
cgroup: Sleepable cgroup tracepoints.
libbpf: Add sleepable tp_btf
bpf: Lift permission check in __sys_bpf when called from kernel.
bpf: Introduce cgroup iter
selftests/bpf: Tests using sleepable tracepoints to monitor cgroup
events
include/linux/bpf.h | 16 +-
include/linux/tracepoint-defs.h | 1 +
include/trace/bpf_probe.h | 22 +-
include/trace/events/cgroup.h | 45 ++++
include/uapi/linux/bpf.h | 32 +++
kernel/bpf/Makefile | 2 +-
kernel/bpf/cgroup_iter.c | 141 +++++++++++
kernel/bpf/inode.c | 33 ++-
kernel/bpf/syscall.c | 237 ++++++++++++++++--
kernel/cgroup/cgroup.c | 5 +
kernel/trace/bpf_trace.c | 5 +
tools/include/uapi/linux/bpf.h | 32 +++
tools/lib/bpf/libbpf.c | 1 +
.../selftests/bpf/prog_tests/syscall.c | 67 ++++-
.../bpf/prog_tests/test_cgroup_stats.c | 187 ++++++++++++++
tools/testing/selftests/bpf/progs/bpf_iter.h | 7 +
.../selftests/bpf/progs/cgroup_monitor.c | 78 ++++++
.../selftests/bpf/progs/cgroup_sched_lat.c | 232 +++++++++++++++++
.../testing/selftests/bpf/progs/syscall_fs.c | 69 +++++
19 files changed, 1175 insertions(+), 37 deletions(-)
create mode 100644 kernel/bpf/cgroup_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_stats.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_monitor.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_sched_lat.c
create mode 100644 tools/testing/selftests/bpf/progs/syscall_fs.c
--
2.35.1.574.g5d30c73bfb-goog
Add two new sleepable tracepoints in cgroup: cgroup_mkdir_s and
cgroup_rmdir_s. The suffix _s means they are in a sleepable context.
These two tracepoints don't need full cgroup paths, they don't have
to live in atomic context. These two tracepoints are also called without
holding cgroup_mutex.
They can be used for bpf to monitor cgroup creation and deletion. Bpf
sleepable programs can attach to these two tracepoints and create
corresponding directories in bpffs. The created directories don't need
the cgroup paths, cgroup id is sufficient to identify the cgroup. Once
the bpffs directories have been created, the bpf prog can further pin
bpf objects inside the directories and allow users to read the pinned
objects.
This serves a way to extend the fixed cgroup interface.
Cc: Tejun Heo <[email protected]>
Signed-off-by: Hao Luo <[email protected]>
---
include/trace/events/cgroup.h | 45 +++++++++++++++++++++++++++++++++++
kernel/cgroup/cgroup.c | 5 ++++
2 files changed, 50 insertions(+)
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..4483a7d6c43a 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,51 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
TP_ARGS(cgrp, path, val)
);
+/*
+ * The following tracepoints are supposed to be called in a sleepable context.
+ */
+DECLARE_EVENT_CLASS(cgroup_sleepable_tp,
+
+ TP_PROTO(struct cgroup *cgrp),
+
+ TP_ARGS(cgrp),
+
+ TP_STRUCT__entry(
+ __field( int, root )
+ __field( int, level )
+ __field( u64, id )
+ ),
+
+ TP_fast_assign(
+ __entry->root = cgrp->root->hierarchy_id;
+ __entry->id = cgroup_id(cgrp);
+ __entry->level = cgrp->level;
+ ),
+
+ TP_printk("root=%d id=%llu level=%d",
+ __entry->root, __entry->id, __entry->level)
+);
+
+#ifdef DEFINE_EVENT_SLEEPABLE
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args) \
+ DEFINE_EVENT_SLEEPABLE(template, call, PARAMS(proto), PARAMS(args))
+#endif
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_mkdir_s,
+
+ TP_PROTO(struct cgroup *cgrp),
+
+ TP_ARGS(cgrp)
+);
+
+DEFINE_EVENT(cgroup_sleepable_tp, cgroup_rmdir_s,
+
+ TP_PROTO(struct cgroup *cgrp),
+
+ TP_ARGS(cgrp)
+);
+
#endif /* _TRACE_CGROUP_H */
/* This part must be outside protection */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9d05c3ca2d5e..f14ab00d9ef5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5535,6 +5535,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
cgroup_destroy_locked(cgrp);
out_unlock:
cgroup_kn_unlock(parent_kn);
+ if (!ret)
+ trace_cgroup_mkdir_s(cgrp);
return ret;
}
@@ -5725,6 +5727,9 @@ int cgroup_rmdir(struct kernfs_node *kn)
TRACE_CGROUP_PATH(rmdir, cgrp);
cgroup_kn_unlock(kn);
+
+ if (!ret)
+ trace_cgroup_rmdir_s(cgrp);
return ret;
}
--
2.35.1.574.g5d30c73bfb-goog
Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter.
The target cgroup's state can be read out via a link of this iter.
Typically, we can monitor cgroup creation and deletion using sleepable
tracing and use it to create corresponding directories in bpffs and pin
a cgroup id parameterized link in the directory. Then we can read the
auto-pinned iter link to get cgroup's state. The output of the iter link
is determined by the program. See the selftest test_cgroup_stats.c for
an example.
Signed-off-by: Hao Luo <[email protected]>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 6 ++
kernel/bpf/Makefile | 2 +-
kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 6 ++
5 files changed, 155 insertions(+), 1 deletion(-)
create mode 100644 kernel/bpf/cgroup_iter.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 759ade7b24b3..3ce9b0b7ed89 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
struct bpf_iter_aux_info {
struct bpf_map *map;
+ u64 cgroup_id;
};
typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
struct {
__u32 map_fd;
} map;
+ struct {
+ __u64 cgroup_id;
+ } cgroup;
};
/* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
struct {
__u32 map_id;
} map;
+ struct {
+ __u64 cgroup_id;
+ } cgroup;
};
} iter;
struct {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..52a0e4c6e96e 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..011d9dcd1d51
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+ __bpf_md_ptr(struct bpf_iter_meta *, meta);
+ __bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct cgroup *cgroup;
+ u64 cgroup_id;
+
+ /* Only one session is supported. */
+ if (*pos > 0)
+ return NULL;
+
+ cgroup_id = *(u64 *)seq->private;
+ cgroup = cgroup_get_from_id(cgroup_id);
+ if (!cgroup)
+ return NULL;
+
+ if (*pos == 0)
+ ++*pos;
+
+ return cgroup;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ ++*pos;
+ return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+ struct bpf_iter__cgroup ctx;
+ struct bpf_iter_meta meta;
+ struct bpf_prog *prog;
+ int ret = 0;
+
+ ctx.meta = &meta;
+ ctx.cgroup = v;
+ meta.seq = seq;
+ prog = bpf_iter_get_info(&meta, false);
+ if (prog)
+ ret = bpf_iter_run_prog(prog, &ctx);
+
+ return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+ if (v)
+ cgroup_put(v);
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+ .start = cgroup_iter_seq_start,
+ .next = cgroup_iter_seq_next,
+ .stop = cgroup_iter_seq_stop,
+ .show = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+ *(u64 *)priv_data = aux->cgroup_id;
+ return 0;
+}
+
+static void cgroup_iter_seq_fini(void *priv_data)
+{
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+ .seq_ops = &cgroup_iter_seq_ops,
+ .init_seq_private = cgroup_iter_seq_init,
+ .fini_seq_private = cgroup_iter_seq_fini,
+ .seq_priv_size = sizeof(u64),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+ union bpf_iter_link_info *linfo,
+ struct bpf_iter_aux_info *aux)
+{
+ aux->cgroup_id = linfo->cgroup.cgroup_id;
+ return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+}
+
+void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+ struct seq_file *seq)
+{
+ char buf[64] = {0};
+
+ cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
+ seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
+ seq_printf(seq, "cgroup_path:\t%s\n", buf);
+}
+
+int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+ struct bpf_link_info *info)
+{
+ info->iter.cgroup.cgroup_id = aux->cgroup_id;
+ return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+ struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+ .target = "cgroup",
+ .attach_target = bpf_iter_attach_cgroup,
+ .detach_target = bpf_iter_detach_cgroup,
+ .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
+ .fill_link_info = bpf_iter_cgroup_fill_link_info,
+ .ctx_arg_info_size = 1,
+ .ctx_arg_info = {
+ { offsetof(struct bpf_iter__cgroup, cgroup),
+ PTR_TO_BTF_ID },
+ },
+ .seq_info = &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+ bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+ return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a5dbc794403d..855ad80d9983 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
struct {
__u32 map_fd;
} map;
+ struct {
+ __u64 cgroup_id;
+ } cgroup;
};
/* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5887,6 +5890,9 @@ struct bpf_link_info {
struct {
__u32 map_id;
} map;
+ struct {
+ __u64 cgroup_id;
+ } cgroup;
};
} iter;
struct {
--
2.35.1.574.g5d30c73bfb-goog
Hi Hao,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: hexagon-randconfig-r023-20220226 (https://download.01.org/0day-ci/archive/20220226/[email protected]/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/bpf/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> kernel/bpf/cgroup_iter.c:107:39: warning: format specifies type 'unsigned long' but the argument has type 'u64' (aka 'unsigned long long') [-Wformat]
seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
~~~ ^~~~~~~~~~~~~~
%llu
>> kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for function 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
^
kernel/bpf/cgroup_iter.c:101:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
^
static
>> kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for function 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
^
kernel/bpf/cgroup_iter.c:111:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
^
static
3 warnings generated.
vim +107 kernel/bpf/cgroup_iter.c
100
> 101 void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
102 struct seq_file *seq)
103 {
104 char buf[64] = {0};
105
106 cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> 107 seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
108 seq_printf(seq, "cgroup_path:\t%s\n", buf);
109 }
110
> 111 int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
112 struct bpf_link_info *info)
113 {
114 info->iter.cgroup.cgroup_id = aux->cgroup_id;
115 return 0;
116 }
117
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Hao,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-c002-20220226 (https://download.01.org/0day-ci/archive/20220226/[email protected]/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash kernel/bpf/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
101 | void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
>> kernel/bpf/cgroup_iter.c:107:40: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'u64' {aka 'long long unsigned int'} [-Wformat=]
107 | seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
| ~~^ ~~~~~~~~~~~~~~
| | |
| | u64 {aka long long unsigned int}
| long unsigned int
| %llu
kernel/bpf/cgroup_iter.c: At top level:
>> kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
111 | int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/bpf_iter_cgroup_show_fdinfo +101 kernel/bpf/cgroup_iter.c
100
> 101 void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
102 struct seq_file *seq)
103 {
104 char buf[64] = {0};
105
106 cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> 107 seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
108 seq_printf(seq, "cgroup_path:\t%s\n", buf);
109 }
110
> 111 int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
112 struct bpf_link_info *info)
113 {
114 info->iter.cgroup.cgroup_id = aux->cgroup_id;
115 return 0;
116 }
117
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Hao,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20220226/[email protected]/config)
compiler: m68k-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/ee74423719e2efb4efa7a4491920c78b60024ec7
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Hao-Luo/Extend-cgroup-interface-with-bpf/20220226-074615
git checkout ee74423719e2efb4efa7a4491920c78b60024ec7
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=m68k SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_stop':
>> kernel/bpf/cgroup_iter.c:60:17: error: implicit declaration of function 'cgroup_put'; did you mean 'cgroup_psi'? [-Werror=implicit-function-declaration]
60 | cgroup_put(v);
| ^~~~~~~~~~
| cgroup_psi
kernel/bpf/cgroup_iter.c: At top level:
kernel/bpf/cgroup_iter.c:101:6: warning: no previous prototype for 'bpf_iter_cgroup_show_fdinfo' [-Wmissing-prototypes]
101 | void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
kernel/bpf/cgroup_iter.c:107:40: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'u64' {aka 'long long unsigned int'} [-Wformat=]
107 | seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
| ~~^ ~~~~~~~~~~~~~~
| | |
| | u64 {aka long long unsigned int}
| long unsigned int
| %llu
kernel/bpf/cgroup_iter.c: At top level:
kernel/bpf/cgroup_iter.c:111:5: warning: no previous prototype for 'bpf_iter_cgroup_fill_link_info' [-Wmissing-prototypes]
111 | int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +60 kernel/bpf/cgroup_iter.c
56
57 static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
58 {
59 if (v)
> 60 cgroup_put(v);
61 }
62
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On 2/25/22 3:43 PM, Hao Luo wrote:
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter.
>
> The target cgroup's state can be read out via a link of this iter.
> Typically, we can monitor cgroup creation and deletion using sleepable
> tracing and use it to create corresponding directories in bpffs and pin
> a cgroup id parameterized link in the directory. Then we can read the
> auto-pinned iter link to get cgroup's state. The output of the iter link
> is determined by the program. See the selftest test_cgroup_stats.c for
> an example.
>
> Signed-off-by: Hao Luo <[email protected]>
> ---
> include/linux/bpf.h | 1 +
> include/uapi/linux/bpf.h | 6 ++
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 6 ++
> 5 files changed, 155 insertions(+), 1 deletion(-)
> create mode 100644 kernel/bpf/cgroup_iter.c
>
[...]
> +
> +static const struct seq_operations cgroup_iter_seq_ops = {
> + .start = cgroup_iter_seq_start,
> + .next = cgroup_iter_seq_next,
> + .stop = cgroup_iter_seq_stop,
> + .show = cgroup_iter_seq_show,
> +};
> +
> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +
> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> + *(u64 *)priv_data = aux->cgroup_id;
> + return 0;
> +}
> +
> +static void cgroup_iter_seq_fini(void *priv_data)
> +{
> +}
> +
> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> + .seq_ops = &cgroup_iter_seq_ops,
> + .init_seq_private = cgroup_iter_seq_init,
> + .fini_seq_private = cgroup_iter_seq_fini,
Since cgroup_iter_seq_fini() is a nop, you can just have
.fini_seq_private = NULL,
> + .seq_priv_size = sizeof(u64),
> +};
> +
> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> + union bpf_iter_link_info *linfo,
> + struct bpf_iter_aux_info *aux)
> +{
> + aux->cgroup_id = linfo->cgroup.cgroup_id;
> + return 0;
> +}
> +
> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> +{
> +}
> +
> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> + struct seq_file *seq)
> +{
> + char buf[64] = {0};
Is this 64 the maximum possible cgroup path length?
If there is a macro for that, I think it would be good to use it.
> +
> + cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
cgroup_path_from_kernfs_id() might fail in which case, buf will be 0.
and cgroup_path will be nothing. I guess this might be the expected
result. I might be good to add a comment to clarify in the code.
> + seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> + seq_printf(seq, "cgroup_path:\t%s\n", buf);
> +}
> +
> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> + struct bpf_link_info *info)
> +{
> + info->iter.cgroup.cgroup_id = aux->cgroup_id;
> + return 0;
> +}
> +
> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> + struct cgroup *cgroup)
> +
> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> + .target = "cgroup",
> + .attach_target = bpf_iter_attach_cgroup,
> + .detach_target = bpf_iter_detach_cgroup,
The same ehre, since bpf_iter_detach_cgroup() is a nop,
you can replace it with NULL in the above.
> + .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
> + .fill_link_info = bpf_iter_cgroup_fill_link_info,
> + .ctx_arg_info_size = 1,
> + .ctx_arg_info = {
> + { offsetof(struct bpf_iter__cgroup, cgroup),
> + PTR_TO_BTF_ID },
> + },
> + .seq_info = &cgroup_iter_seq_info,
> +};
> +
> +static int __init bpf_cgroup_iter_init(void)
> +{
> + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> + return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> +}
> +
[...]
On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter.
>
> The target cgroup's state can be read out via a link of this iter.
> Typically, we can monitor cgroup creation and deletion using sleepable
> tracing and use it to create corresponding directories in bpffs and pin
> a cgroup id parameterized link in the directory. Then we can read the
> auto-pinned iter link to get cgroup's state. The output of the iter link
> is determined by the program. See the selftest test_cgroup_stats.c for
> an example.
>
> Signed-off-by: Hao Luo <[email protected]>
> ---
> include/linux/bpf.h | 1 +
> include/uapi/linux/bpf.h | 6 ++
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
> tools/include/uapi/linux/bpf.h | 6 ++
> 5 files changed, 155 insertions(+), 1 deletion(-)
> create mode 100644 kernel/bpf/cgroup_iter.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 759ade7b24b3..3ce9b0b7ed89 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>
> struct bpf_iter_aux_info {
> struct bpf_map *map;
> + u64 cgroup_id;
> };
>
> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a5dbc794403d..855ad80d9983 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> struct {
> __u32 map_fd;
> } map;
> + struct {
> + __u64 cgroup_id;
> + } cgroup;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> struct {
> __u32 map_id;
> } map;
> + struct {
> + __u64 cgroup_id;
> + } cgroup;
> };
> } iter;
> struct {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index c1a9be6a4b9f..52a0e4c6e96e 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>
> obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
> obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
> obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> new file mode 100644
> index 000000000000..011d9dcd1d51
> --- /dev/null
> +++ b/kernel/bpf/cgroup_iter.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Google */
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/cgroup.h>
> +#include <linux/kernel.h>
> +#include <linux/seq_file.h>
> +
> +struct bpf_iter__cgroup {
> + __bpf_md_ptr(struct bpf_iter_meta *, meta);
> + __bpf_md_ptr(struct cgroup *, cgroup);
> +};
> +
> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> + struct cgroup *cgroup;
> + u64 cgroup_id;
> +
> + /* Only one session is supported. */
> + if (*pos > 0)
> + return NULL;
> +
> + cgroup_id = *(u64 *)seq->private;
> + cgroup = cgroup_get_from_id(cgroup_id);
> + if (!cgroup)
> + return NULL;
> +
> + if (*pos == 0)
> + ++*pos;
> +
> + return cgroup;
> +}
> +
> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> + ++*pos;
> + return NULL;
> +}
> +
> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> +{
> + struct bpf_iter__cgroup ctx;
> + struct bpf_iter_meta meta;
> + struct bpf_prog *prog;
> + int ret = 0;
> +
> + ctx.meta = &meta;
> + ctx.cgroup = v;
> + meta.seq = seq;
> + prog = bpf_iter_get_info(&meta, false);
> + if (prog)
> + ret = bpf_iter_run_prog(prog, &ctx);
> +
> + return ret;
> +}
> +
> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> +{
> + if (v)
> + cgroup_put(v);
> +}
I think in existing iterators, we make a final call to seq_show, with v as NULL,
is there a specific reason to do it differently for this? There is logic in
bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
NULL, to execute BPF program with NULL p, see the comment above stop label.
If you do add the seq_show call with NULL, you'd also need to change the
ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
> +
> +static const struct seq_operations cgroup_iter_seq_ops = {
> + .start = cgroup_iter_seq_start,
> + .next = cgroup_iter_seq_next,
> + .stop = cgroup_iter_seq_stop,
> + .show = cgroup_iter_seq_show,
> +};
> +
> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +
> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> + *(u64 *)priv_data = aux->cgroup_id;
> + return 0;
> +}
> +
> +static void cgroup_iter_seq_fini(void *priv_data)
> +{
> +}
> +
> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> + .seq_ops = &cgroup_iter_seq_ops,
> + .init_seq_private = cgroup_iter_seq_init,
> + .fini_seq_private = cgroup_iter_seq_fini,
> + .seq_priv_size = sizeof(u64),
> +};
> +
> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> + union bpf_iter_link_info *linfo,
> + struct bpf_iter_aux_info *aux)
> +{
> + aux->cgroup_id = linfo->cgroup.cgroup_id;
> + return 0;
> +}
> +
> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> +{
> +}
> +
> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> + struct seq_file *seq)
> +{
> + char buf[64] = {0};
> +
> + cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> + seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> + seq_printf(seq, "cgroup_path:\t%s\n", buf);
> +}
> +
> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> + struct bpf_link_info *info)
> +{
> + info->iter.cgroup.cgroup_id = aux->cgroup_id;
> + return 0;
> +}
> +
> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> + struct cgroup *cgroup)
> +
> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> + .target = "cgroup",
> + .attach_target = bpf_iter_attach_cgroup,
> + .detach_target = bpf_iter_detach_cgroup,
> + .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
> + .fill_link_info = bpf_iter_cgroup_fill_link_info,
> + .ctx_arg_info_size = 1,
> + .ctx_arg_info = {
> + { offsetof(struct bpf_iter__cgroup, cgroup),
> + PTR_TO_BTF_ID },
> + },
> + .seq_info = &cgroup_iter_seq_info,
> +};
> +
> +static int __init bpf_cgroup_iter_init(void)
> +{
> + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> + return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> +}
> +
> +late_initcall(bpf_cgroup_iter_init);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a5dbc794403d..855ad80d9983 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> struct {
> __u32 map_fd;
> } map;
> + struct {
> + __u64 cgroup_id;
> + } cgroup;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> struct {
> __u32 map_id;
> } map;
> + struct {
> + __u64 cgroup_id;
> + } cgroup;
> };
> } iter;
> struct {
> --
> 2.35.1.574.g5d30c73bfb-goog
>
--
Kartikeya
On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>> be parameterized by a cgroup id and prints only that cgroup. So one
>> needs to specify a target cgroup id when attaching this iter.
>>
>> The target cgroup's state can be read out via a link of this iter.
>> Typically, we can monitor cgroup creation and deletion using sleepable
>> tracing and use it to create corresponding directories in bpffs and pin
>> a cgroup id parameterized link in the directory. Then we can read the
>> auto-pinned iter link to get cgroup's state. The output of the iter link
>> is determined by the program. See the selftest test_cgroup_stats.c for
>> an example.
>>
>> Signed-off-by: Hao Luo <[email protected]>
>> ---
>> include/linux/bpf.h | 1 +
>> include/uapi/linux/bpf.h | 6 ++
>> kernel/bpf/Makefile | 2 +-
>> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
>> tools/include/uapi/linux/bpf.h | 6 ++
>> 5 files changed, 155 insertions(+), 1 deletion(-)
>> create mode 100644 kernel/bpf/cgroup_iter.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 759ade7b24b3..3ce9b0b7ed89 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>>
>> struct bpf_iter_aux_info {
>> struct bpf_map *map;
>> + u64 cgroup_id;
>> };
>>
>> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index a5dbc794403d..855ad80d9983 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>> struct {
>> __u32 map_fd;
>> } map;
>> + struct {
>> + __u64 cgroup_id;
>> + } cgroup;
>> };
>>
>> /* BPF syscall commands, see bpf(2) man-page for more details. */
>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>> struct {
>> __u32 map_id;
>> } map;
>> + struct {
>> + __u64 cgroup_id;
>> + } cgroup;
>> };
>> } iter;
>> struct {
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index c1a9be6a4b9f..52a0e4c6e96e 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>>
>> obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>> obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
>> obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
>> new file mode 100644
>> index 000000000000..011d9dcd1d51
>> --- /dev/null
>> +++ b/kernel/bpf/cgroup_iter.c
>> @@ -0,0 +1,141 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2022 Google */
>> +#include <linux/bpf.h>
>> +#include <linux/btf_ids.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/kernel.h>
>> +#include <linux/seq_file.h>
>> +
>> +struct bpf_iter__cgroup {
>> + __bpf_md_ptr(struct bpf_iter_meta *, meta);
>> + __bpf_md_ptr(struct cgroup *, cgroup);
>> +};
>> +
>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> + struct cgroup *cgroup;
>> + u64 cgroup_id;
>> +
>> + /* Only one session is supported. */
>> + if (*pos > 0)
>> + return NULL;
>> +
>> + cgroup_id = *(u64 *)seq->private;
>> + cgroup = cgroup_get_from_id(cgroup_id);
>> + if (!cgroup)
>> + return NULL;
>> +
>> + if (*pos == 0)
>> + ++*pos;
>> +
>> + return cgroup;
>> +}
>> +
>> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> + ++*pos;
>> + return NULL;
>> +}
>> +
>> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
>> +{
>> + struct bpf_iter__cgroup ctx;
>> + struct bpf_iter_meta meta;
>> + struct bpf_prog *prog;
>> + int ret = 0;
>> +
>> + ctx.meta = &meta;
>> + ctx.cgroup = v;
>> + meta.seq = seq;
>> + prog = bpf_iter_get_info(&meta, false);
>> + if (prog)
>> + ret = bpf_iter_run_prog(prog, &ctx);
>> +
>> + return ret;
>> +}
>> +
>> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
>> +{
>> + if (v)
>> + cgroup_put(v);
>> +}
>
> I think in existing iterators, we make a final call to seq_show, with v as NULL,
> is there a specific reason to do it differently for this? There is logic in
> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> NULL, to execute BPF program with NULL p, see the comment above stop label.
>
> If you do add the seq_show call with NULL, you'd also need to change the
> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a
non-NULL cgroup. But we do have issues for cgroup_iter_seq_stop() which
I missed earlier.
For cgroup_iter, the following is the current workflow:
start -> not NULL -> show -> next -> NULL -> stop
or
start -> NULL -> stop
So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
the cgroup_put() is not actually called, i.e., corresponding cgroup is
not freed.
There are two ways to fix the issue:
. call cgroup_put() in next() before return NULL. This way,
stop() will be a noop.
. put cgroup_get_from_id() and cgroup_put() in
bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
I prefer the second approach as it is cleaner.
>
>> +
>> +static const struct seq_operations cgroup_iter_seq_ops = {
>> + .start = cgroup_iter_seq_start,
>> + .next = cgroup_iter_seq_next,
>> + .stop = cgroup_iter_seq_stop,
>> + .show = cgroup_iter_seq_show,
>> +};
>> +
>> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
>> +
>> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
>> +{
>> + *(u64 *)priv_data = aux->cgroup_id;
>> + return 0;
>> +}
>> +
>> +static void cgroup_iter_seq_fini(void *priv_data)
>> +{
>> +}
>> +
>> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
>> + .seq_ops = &cgroup_iter_seq_ops,
>> + .init_seq_private = cgroup_iter_seq_init,
>> + .fini_seq_private = cgroup_iter_seq_fini,
>> + .seq_priv_size = sizeof(u64),
>> +};
>> +
>> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
>> + union bpf_iter_link_info *linfo,
>> + struct bpf_iter_aux_info *aux)
>> +{
>> + aux->cgroup_id = linfo->cgroup.cgroup_id;
>> + return 0;
>> +}
>> +
>> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
>> +{
>> +}
>> +
>> +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
>> + struct seq_file *seq)
>> +{
>> + char buf[64] = {0};
>> +
>> + cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
>> + seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
>> + seq_printf(seq, "cgroup_path:\t%s\n", buf);
>> +}
>> +
>> +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
>> + struct bpf_link_info *info)
>> +{
>> + info->iter.cgroup.cgroup_id = aux->cgroup_id;
>> + return 0;
>> +}
>> +
>> +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
>> + struct cgroup *cgroup)
>> +
>> +static struct bpf_iter_reg bpf_cgroup_reg_info = {
>> + .target = "cgroup",
>> + .attach_target = bpf_iter_attach_cgroup,
>> + .detach_target = bpf_iter_detach_cgroup,
>> + .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
>> + .fill_link_info = bpf_iter_cgroup_fill_link_info,
>> + .ctx_arg_info_size = 1,
>> + .ctx_arg_info = {
>> + { offsetof(struct bpf_iter__cgroup, cgroup),
>> + PTR_TO_BTF_ID },
>> + },
>> + .seq_info = &cgroup_iter_seq_info,
>> +};
>> +
>> +static int __init bpf_cgroup_iter_init(void)
>> +{
>> + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
>> + return bpf_iter_reg_target(&bpf_cgroup_reg_info);
>> +}
>> +
>> +late_initcall(bpf_cgroup_iter_init);
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index a5dbc794403d..855ad80d9983 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>> struct {
>> __u32 map_fd;
>> } map;
>> + struct {
>> + __u64 cgroup_id;
>> + } cgroup;
>> };
>>
>> /* BPF syscall commands, see bpf(2) man-page for more details. */
>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>> struct {
>> __u32 map_id;
>> } map;
>> + struct {
>> + __u64 cgroup_id;
>> + } cgroup;
>> };
>> } iter;
>> struct {
>> --
>> 2.35.1.574.g5d30c73bfb-goog
>>
>
> --
> Kartikeya
On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
>
>
> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> > On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > needs to specify a target cgroup id when attaching this iter.
> > >
> > > The target cgroup's state can be read out via a link of this iter.
> > > Typically, we can monitor cgroup creation and deletion using sleepable
> > > tracing and use it to create corresponding directories in bpffs and pin
> > > a cgroup id parameterized link in the directory. Then we can read the
> > > auto-pinned iter link to get cgroup's state. The output of the iter link
> > > is determined by the program. See the selftest test_cgroup_stats.c for
> > > an example.
> > >
> > > Signed-off-by: Hao Luo <[email protected]>
> > > ---
> > > include/linux/bpf.h | 1 +
> > > include/uapi/linux/bpf.h | 6 ++
> > > kernel/bpf/Makefile | 2 +-
> > > kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
> > > tools/include/uapi/linux/bpf.h | 6 ++
> > > 5 files changed, 155 insertions(+), 1 deletion(-)
> > > create mode 100644 kernel/bpf/cgroup_iter.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 759ade7b24b3..3ce9b0b7ed89 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
> > >
> > > struct bpf_iter_aux_info {
> > > struct bpf_map *map;
> > > + u64 cgroup_id;
> > > };
> > >
> > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index a5dbc794403d..855ad80d9983 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > > struct {
> > > __u32 map_fd;
> > > } map;
> > > + struct {
> > > + __u64 cgroup_id;
> > > + } cgroup;
> > > };
> > >
> > > /* BPF syscall commands, see bpf(2) man-page for more details. */
> > > @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> > > struct {
> > > __u32 map_id;
> > > } map;
> > > + struct {
> > > + __u64 cgroup_id;
> > > + } cgroup;
> > > };
> > > } iter;
> > > struct {
> > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > index c1a9be6a4b9f..52a0e4c6e96e 100644
> > > --- a/kernel/bpf/Makefile
> > > +++ b/kernel/bpf/Makefile
> > > @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
> > >
> > > obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
> > > obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> > > -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> > > +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
> > > obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> > > obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
> > > obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> > > diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> > > new file mode 100644
> > > index 000000000000..011d9dcd1d51
> > > --- /dev/null
> > > +++ b/kernel/bpf/cgroup_iter.c
> > > @@ -0,0 +1,141 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Copyright (c) 2022 Google */
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf_ids.h>
> > > +#include <linux/cgroup.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/seq_file.h>
> > > +
> > > +struct bpf_iter__cgroup {
> > > + __bpf_md_ptr(struct bpf_iter_meta *, meta);
> > > + __bpf_md_ptr(struct cgroup *, cgroup);
> > > +};
> > > +
> > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > > +{
> > > + struct cgroup *cgroup;
> > > + u64 cgroup_id;
> > > +
> > > + /* Only one session is supported. */
> > > + if (*pos > 0)
> > > + return NULL;
> > > +
> > > + cgroup_id = *(u64 *)seq->private;
> > > + cgroup = cgroup_get_from_id(cgroup_id);
> > > + if (!cgroup)
> > > + return NULL;
> > > +
> > > + if (*pos == 0)
> > > + ++*pos;
> > > +
> > > + return cgroup;
> > > +}
> > > +
> > > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > +{
> > > + ++*pos;
> > > + return NULL;
> > > +}
> > > +
> > > +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> > > +{
> > > + struct bpf_iter__cgroup ctx;
> > > + struct bpf_iter_meta meta;
> > > + struct bpf_prog *prog;
> > > + int ret = 0;
> > > +
> > > + ctx.meta = &meta;
> > > + ctx.cgroup = v;
> > > + meta.seq = seq;
> > > + prog = bpf_iter_get_info(&meta, false);
> > > + if (prog)
> > > + ret = bpf_iter_run_prog(prog, &ctx);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> > > +{
> > > + if (v)
> > > + cgroup_put(v);
> > > +}
> >
> > I think in existing iterators, we make a final call to seq_show, with v as NULL,
> > is there a specific reason to do it differently for this? There is logic in
> > bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> > NULL, to execute BPF program with NULL p, see the comment above stop label.
> >
> > If you do add the seq_show call with NULL, you'd also need to change the
> > ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
>
> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
> earlier.
>
Right, I was thinking whether it should call seq_show for v == NULL case. All
other iterators seem to do so, it's a bit different here since it is only
iterating over a single cgroup, I guess, but it would be nice to have some
consistency.
> For cgroup_iter, the following is the current workflow:
> start -> not NULL -> show -> next -> NULL -> stop
> or
> start -> NULL -> stop
>
> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> the cgroup_put() is not actually called, i.e., corresponding cgroup is
> not freed.
>
> There are two ways to fix the issue:
> . call cgroup_put() in next() before return NULL. This way,
> stop() will be a noop.
> . put cgroup_get_from_id() and cgroup_put() in
> bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
>
> I prefer the second approach as it is cleaner.
>
I think current approach is also not safe if cgroup_id gets reused, right? I.e.
it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
is less likely to occur, but since it wraps around to find a free ID it might
not be theoretical.
> >
> > > +
> > > +static const struct seq_operations cgroup_iter_seq_ops = {
> > > + .start = cgroup_iter_seq_start,
> > > + .next = cgroup_iter_seq_next,
> > > + .stop = cgroup_iter_seq_stop,
> > > + .show = cgroup_iter_seq_show,
> > > +};
> > > +
> > > +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> > > +
> > > +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
> > > +{
> > > + *(u64 *)priv_data = aux->cgroup_id;
> > > + return 0;
> > > +}
> > > +
> > > +static void cgroup_iter_seq_fini(void *priv_data)
> > > +{
> > > +}
> > > +
> > > +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> > > + .seq_ops = &cgroup_iter_seq_ops,
> > > + .init_seq_private = cgroup_iter_seq_init,
> > > + .fini_seq_private = cgroup_iter_seq_fini,
> > > + .seq_priv_size = sizeof(u64),
> > > +};
> > > +
> > > +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> > > + union bpf_iter_link_info *linfo,
> > > + struct bpf_iter_aux_info *aux)
> > > +{
> > > + aux->cgroup_id = linfo->cgroup.cgroup_id;
> > > + return 0;
> > > +}
> > > +
> > > +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> > > +{
> > > +}
> > > +
> > > +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > > + struct seq_file *seq)
> > > +{
> > > + char buf[64] = {0};
> > > +
> > > + cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
> > > + seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> > > + seq_printf(seq, "cgroup_path:\t%s\n", buf);
> > > +}
> > > +
> > > +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> > > + struct bpf_link_info *info)
> > > +{
> > > + info->iter.cgroup.cgroup_id = aux->cgroup_id;
> > > + return 0;
> > > +}
> > > +
> > > +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> > > + struct cgroup *cgroup)
> > > +
> > > +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> > > + .target = "cgroup",
> > > + .attach_target = bpf_iter_attach_cgroup,
> > > + .detach_target = bpf_iter_detach_cgroup,
> > > + .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
> > > + .fill_link_info = bpf_iter_cgroup_fill_link_info,
> > > + .ctx_arg_info_size = 1,
> > > + .ctx_arg_info = {
> > > + { offsetof(struct bpf_iter__cgroup, cgroup),
> > > + PTR_TO_BTF_ID },
> > > + },
> > > + .seq_info = &cgroup_iter_seq_info,
> > > +};
> > > +
> > > +static int __init bpf_cgroup_iter_init(void)
> > > +{
> > > + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> > > + return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> > > +}
> > > +
> > > +late_initcall(bpf_cgroup_iter_init);
> > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > > index a5dbc794403d..855ad80d9983 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > > struct {
> > > __u32 map_fd;
> > > } map;
> > > + struct {
> > > + __u64 cgroup_id;
> > > + } cgroup;
> > > };
> > >
> > > /* BPF syscall commands, see bpf(2) man-page for more details. */
> > > @@ -5887,6 +5890,9 @@ struct bpf_link_info {
> > > struct {
> > > __u32 map_id;
> > > } map;
> > > + struct {
> > > + __u64 cgroup_id;
> > > + } cgroup;
> > > };
> > > } iter;
> > > struct {
> > > --
> > > 2.35.1.574.g5d30c73bfb-goog
> > >
> >
> > --
> > Kartikeya
--
Kartikeya
On Wed, Mar 2, 2022 at 7:03 PM Kumar Kartikeya Dwivedi <[email protected]> wrote:
> >
>
> I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> is less likely to occur, but since it wraps around to find a free ID it might
> not be theoretical.
cgroupid is 64-bit.
On 3/2/22 7:03 PM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
>>
>>
>> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
>>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>>>> be parameterized by a cgroup id and prints only that cgroup. So one
>>>> needs to specify a target cgroup id when attaching this iter.
>>>>
>>>> The target cgroup's state can be read out via a link of this iter.
>>>> Typically, we can monitor cgroup creation and deletion using sleepable
>>>> tracing and use it to create corresponding directories in bpffs and pin
>>>> a cgroup id parameterized link in the directory. Then we can read the
>>>> auto-pinned iter link to get cgroup's state. The output of the iter link
>>>> is determined by the program. See the selftest test_cgroup_stats.c for
>>>> an example.
>>>>
>>>> Signed-off-by: Hao Luo <[email protected]>
>>>> ---
>>>> include/linux/bpf.h | 1 +
>>>> include/uapi/linux/bpf.h | 6 ++
>>>> kernel/bpf/Makefile | 2 +-
>>>> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
>>>> tools/include/uapi/linux/bpf.h | 6 ++
>>>> 5 files changed, 155 insertions(+), 1 deletion(-)
>>>> create mode 100644 kernel/bpf/cgroup_iter.c
>>>>
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index 759ade7b24b3..3ce9b0b7ed89 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>>> @@ -1595,6 +1595,7 @@ int bpf_obj_get_path(bpfptr_t pathname, int flags);
>>>>
>>>> struct bpf_iter_aux_info {
>>>> struct bpf_map *map;
>>>> + u64 cgroup_id;
>>>> };
>>>>
>>>> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index a5dbc794403d..855ad80d9983 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>> struct {
>>>> __u32 map_fd;
>>>> } map;
>>>> + struct {
>>>> + __u64 cgroup_id;
>>>> + } cgroup;
>>>> };
>>>>
>>>> /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>> @@ -5887,6 +5890,9 @@ struct bpf_link_info {
>>>> struct {
>>>> __u32 map_id;
>>>> } map;
>>>> + struct {
>>>> + __u64 cgroup_id;
>>>> + } cgroup;
>>>> };
>>>> } iter;
>>>> struct {
>>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>>> index c1a9be6a4b9f..52a0e4c6e96e 100644
>>>> --- a/kernel/bpf/Makefile
>>>> +++ b/kernel/bpf/Makefile
>>>> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>>>>
>>>> obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
>>>> obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>>>> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
>>>> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>>>> obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>>>> obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
>>>> obj-$(CONFIG_BPF_SYSCALL) += disasm.o
>>>> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
>>>> new file mode 100644
>>>> index 000000000000..011d9dcd1d51
>>>> --- /dev/null
>>>> +++ b/kernel/bpf/cgroup_iter.c
>>>> @@ -0,0 +1,141 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/* Copyright (c) 2022 Google */
>>>> +#include <linux/bpf.h>
>>>> +#include <linux/btf_ids.h>
>>>> +#include <linux/cgroup.h>
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/seq_file.h>
>>>> +
>>>> +struct bpf_iter__cgroup {
>>>> + __bpf_md_ptr(struct bpf_iter_meta *, meta);
>>>> + __bpf_md_ptr(struct cgroup *, cgroup);
>>>> +};
>>>> +
>>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
>>>> +{
>>>> + struct cgroup *cgroup;
>>>> + u64 cgroup_id;
>>>> +
>>>> + /* Only one session is supported. */
>>>> + if (*pos > 0)
>>>> + return NULL;
>>>> +
>>>> + cgroup_id = *(u64 *)seq->private;
>>>> + cgroup = cgroup_get_from_id(cgroup_id);
>>>> + if (!cgroup)
>>>> + return NULL;
>>>> +
>>>> + if (*pos == 0)
>>>> + ++*pos;
>>>> +
>>>> + return cgroup;
>>>> +}
>>>> +
>>>> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>>>> +{
>>>> + ++*pos;
>>>> + return NULL;
>>>> +}
>>>> +
>>>> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
>>>> +{
>>>> + struct bpf_iter__cgroup ctx;
>>>> + struct bpf_iter_meta meta;
>>>> + struct bpf_prog *prog;
>>>> + int ret = 0;
>>>> +
>>>> + ctx.meta = &meta;
>>>> + ctx.cgroup = v;
>>>> + meta.seq = seq;
>>>> + prog = bpf_iter_get_info(&meta, false);
>>>> + if (prog)
>>>> + ret = bpf_iter_run_prog(prog, &ctx);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
>>>> +{
>>>> + if (v)
>>>> + cgroup_put(v);
>>>> +}
>>>
>>> I think in existing iterators, we make a final call to seq_show, with v as NULL,
>>> is there a specific reason to do it differently for this? There is logic in
>>> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
>>> NULL, to execute BPF program with NULL p, see the comment above stop label.
>>>
>>> If you do add the seq_show call with NULL, you'd also need to change the
>>> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
>>
>> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
>> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
>> earlier.
>>
>
> Right, I was thinking whether it should call seq_show for v == NULL case. All
> other iterators seem to do so, it's a bit different here since it is only
> iterating over a single cgroup, I guess, but it would be nice to have some
> consistency.
You are correct that I think it is okay since it only iterates with one
cgroup. This is different from other cases so far where more than one
objects may be traversed. We may have future other use cases, e.g.,
one task. I think we can abstract out start()/next()/stop() callbacks
for such use cases. So it is okay it is different from other existing
iterators since they are indeed different.
>
>> For cgroup_iter, the following is the current workflow:
>> start -> not NULL -> show -> next -> NULL -> stop
>> or
>> start -> NULL -> stop
>>
>> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
>> the cgroup_put() is not actually called, i.e., corresponding cgroup is
>> not freed.
>>
>> There are two ways to fix the issue:
>> . call cgroup_put() in next() before return NULL. This way,
>> stop() will be a noop.
>> . put cgroup_get_from_id() and cgroup_put() in
>> bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
>>
>> I prefer the second approach as it is cleaner.
>>
>
> I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> is less likely to occur, but since it wraps around to find a free ID it might
> not be theoretical.
As Alexei mentioned, cgroup id is 64-bit, the collision should
be nearly impossible. Another option is to get a fd from
the cgroup path, and send the fd to the kernel. This probably
works.
>
>>>
>>>> +
>>>> +static const struct seq_operations cgroup_iter_seq_ops = {
>>>> + .start = cgroup_iter_seq_start,
>>>> + .next = cgroup_iter_seq_next,
>>>> + .stop = cgroup_iter_seq_stop,
>>>> + .show = cgroup_iter_seq_show,
>>>> +};
>>>> +
>>>> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
>>>> +
>>>> +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
>>>> +{
>>>> + *(u64 *)priv_data = aux->cgroup_id;
>>>> + return 0;
>>>> +}
>>>> +
[...]
On Thu, Mar 03, 2022 at 01:03:57PM IST, Yonghong Song wrote:
> [...]
> >
> > Right, I was thinking whether it should call seq_show for v == NULL case. All
> > other iterators seem to do so, it's a bit different here since it is only
> > iterating over a single cgroup, I guess, but it would be nice to have some
> > consistency.
>
> You are correct that I think it is okay since it only iterates with one
> cgroup. This is different from other cases so far where more than one
> objects may be traversed. We may have future other use cases, e.g.,
> one task. I think we can abstract out start()/next()/stop() callbacks
> for such use cases. So it is okay it is different from other existing
> iterators since they are indeed different.
>
> >
> > > For cgroup_iter, the following is the current workflow:
> > > start -> not NULL -> show -> next -> NULL -> stop
> > > or
> > > start -> NULL -> stop
> > >
> > > So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> > > the cgroup_put() is not actually called, i.e., corresponding cgroup is
> > > not freed.
> > >
> > > There are two ways to fix the issue:
> > > . call cgroup_put() in next() before return NULL. This way,
> > > stop() will be a noop.
> > > . put cgroup_get_from_id() and cgroup_put() in
> > > bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
> > >
> > > I prefer the second approach as it is cleaner.
> > >
> >
> > I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> > it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> > be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> > is less likely to occur, but since it wraps around to find a free ID it might
> > not be theoretical.
>
> As Alexei mentioned, cgroup id is 64-bit, the collision should
> be nearly impossible. Another option is to get a fd from
> the cgroup path, and send the fd to the kernel. This probably
> works.
>
I see, even on 32-bit systems the actual id is 64-bit.
As for cgroup fd vs id, existing cgroup BPF programs seem to take fd, map iter
also takes map fd, so it might make sense to use cgroup fd here as well.
> [...]
--
Kartikeya
Thanks Yonghong,
On Wed, Mar 2, 2022 at 2:00 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 2/25/22 3:43 PM, Hao Luo wrote:
> > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > be parameterized by a cgroup id and prints only that cgroup. So one
> > needs to specify a target cgroup id when attaching this iter.
> >
> > The target cgroup's state can be read out via a link of this iter.
> > Typically, we can monitor cgroup creation and deletion using sleepable
> > tracing and use it to create corresponding directories in bpffs and pin
> > a cgroup id parameterized link in the directory. Then we can read the
> > auto-pinned iter link to get cgroup's state. The output of the iter link
> > is determined by the program. See the selftest test_cgroup_stats.c for
> > an example.
> >
> > Signed-off-by: Hao Luo <[email protected]>
> > ---
> > include/linux/bpf.h | 1 +
> > include/uapi/linux/bpf.h | 6 ++
> > kernel/bpf/Makefile | 2 +-
> > kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
> > tools/include/uapi/linux/bpf.h | 6 ++
> > 5 files changed, 155 insertions(+), 1 deletion(-)
> > create mode 100644 kernel/bpf/cgroup_iter.c
> >
[...]
> > +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> > + .seq_ops = &cgroup_iter_seq_ops,
> > + .init_seq_private = cgroup_iter_seq_init,
> > + .fini_seq_private = cgroup_iter_seq_fini,
>
> Since cgroup_iter_seq_fini() is a nop, you can just have
> .fini_seq_private = NULL,
>
Sounds good. It looks weird to have .init without .fini. This may
indicate a bug somewhere. .attach and .detach the same. I see that you
pointed out a bug in a followed reply and the fix has paired attach
and detach. That explains something. :)
> > +void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > + struct seq_file *seq)
> > +{
> > + char buf[64] = {0};
>
> Is this 64 the maximum possible cgroup path length?
> If there is a macro for that, I think it would be good to use it.
>
64 is something I made up. There is a macro for path length. Let me
use that in v2.
> > +
> > + cgroup_path_from_kernfs_id(aux->cgroup_id, buf, sizeof(buf));
>
> cgroup_path_from_kernfs_id() might fail in which case, buf will be 0.
> and cgroup_path will be nothing. I guess this might be the expected
> result. I might be good to add a comment to clarify in the code.
>
No problem.
>
> > + seq_printf(seq, "cgroup_id:\t%lu\n", aux->cgroup_id);
> > + seq_printf(seq, "cgroup_path:\t%s\n", buf);
> > +}
> > +
> > +int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
> > + struct bpf_link_info *info)
> > +{
> > + info->iter.cgroup.cgroup_id = aux->cgroup_id;
> > + return 0;
> > +}
> > +
> > +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
> > + struct cgroup *cgroup)
> > +
> > +static struct bpf_iter_reg bpf_cgroup_reg_info = {
> > + .target = "cgroup",
> > + .attach_target = bpf_iter_attach_cgroup,
> > + .detach_target = bpf_iter_detach_cgroup,
>
> The same ehre, since bpf_iter_detach_cgroup() is a nop,
> you can replace it with NULL in the above.
>
> > + .show_fdinfo = bpf_iter_cgroup_show_fdinfo,
> > + .fill_link_info = bpf_iter_cgroup_fill_link_info,
> > + .ctx_arg_info_size = 1,
> > + .ctx_arg_info = {
> > + { offsetof(struct bpf_iter__cgroup, cgroup),
> > + PTR_TO_BTF_ID },
> > + },
> > + .seq_info = &cgroup_iter_seq_info,
> > +};
> > +
> > +static int __init bpf_cgroup_iter_init(void)
> > +{
> > + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
> > + return bpf_iter_reg_target(&bpf_cgroup_reg_info);
> > +}
> > +
> [...]
On Wed, Mar 2, 2022 at 11:34 PM Yonghong Song <[email protected]> wrote:
>
>
>
> On 3/2/22 7:03 PM, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote:
> >>
> >>
> >> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote:
> >>> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote:
> >>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> >>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> >>>> be parameterized by a cgroup id and prints only that cgroup. So one
> >>>> needs to specify a target cgroup id when attaching this iter.
> >>>>
> >>>> The target cgroup's state can be read out via a link of this iter.
> >>>> Typically, we can monitor cgroup creation and deletion using sleepable
> >>>> tracing and use it to create corresponding directories in bpffs and pin
> >>>> a cgroup id parameterized link in the directory. Then we can read the
> >>>> auto-pinned iter link to get cgroup's state. The output of the iter link
> >>>> is determined by the program. See the selftest test_cgroup_stats.c for
> >>>> an example.
> >>>>
> >>>> Signed-off-by: Hao Luo <[email protected]>
> >>>> ---
> >>>> include/linux/bpf.h | 1 +
> >>>> include/uapi/linux/bpf.h | 6 ++
> >>>> kernel/bpf/Makefile | 2 +-
> >>>> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++
> >>>> tools/include/uapi/linux/bpf.h | 6 ++
> >>>> 5 files changed, 155 insertions(+), 1 deletion(-)
> >>>> create mode 100644 kernel/bpf/cgroup_iter.c
[...]
> >>>
> >>> I think in existing iterators, we make a final call to seq_show, with v as NULL,
> >>> is there a specific reason to do it differently for this? There is logic in
> >>> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns
> >>> NULL, to execute BPF program with NULL p, see the comment above stop label.
> >>>
> >>> If you do add the seq_show call with NULL, you'd also need to change the
> >>> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL.
> >>
> >> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL
> >> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed
> >> earlier.
> >>
> >
> > Right, I was thinking whether it should call seq_show for v == NULL case. All
> > other iterators seem to do so, it's a bit different here since it is only
> > iterating over a single cgroup, I guess, but it would be nice to have some
> > consistency.
>
> You are correct that I think it is okay since it only iterates with one
> cgroup. This is different from other cases so far where more than one
> objects may be traversed. We may have future other use cases, e.g.,
> one task. I think we can abstract out start()/next()/stop() callbacks
> for such use cases. So it is okay it is different from other existing
> iterators since they are indeed different.
>
Right. This iter is special. It has a single element. So we don't
really need preamble and epilogue, which can directly be coded up in
the iter program. And we can also guarantee the cgroup passed is
always valid, otherwise we wouldn't invoke show(). So passing
PTR_TO_BTF_ID is fine. I did so mainly in order to save a null check
inside the prog.
> >
> >> For cgroup_iter, the following is the current workflow:
> >> start -> not NULL -> show -> next -> NULL -> stop
> >> or
> >> start -> NULL -> stop
> >>
> >> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so
> >> the cgroup_put() is not actually called, i.e., corresponding cgroup is
> >> not freed.
> >>
> >> There are two ways to fix the issue:
> >> . call cgroup_put() in next() before return NULL. This way,
> >> stop() will be a noop.
> >> . put cgroup_get_from_id() and cgroup_put() in
> >> bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup().
> >>
> >> I prefer the second approach as it is cleaner.
> >>
Yeah, the second approach should be fine. I was thinking of holding
the cgroup's reference only when we actually start reading, so that a
cgroup can go at any time and this iter gets a reference only in best
effort. Now a reference is held from attach to detach, but I think it
should be fine. Let me test.
> >
> > I think current approach is also not safe if cgroup_id gets reused, right? I.e.
> > it only does cgroup_get_from_id in seq_start, not at attach time, so it may not
> > be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it
> > is less likely to occur, but since it wraps around to find a free ID it might
> > not be theoretical.
>
> As Alexei mentioned, cgroup id is 64-bit, the collision should
> be nearly impossible. Another option is to get a fd from
> the cgroup path, and send the fd to the kernel. This probably
> works.
>
64bit cgroup id should be fine. Using cgroup path and fd is more
complicated, unnecessarily IMHO.
> [...]