Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
to FDs, that's intended for BPF's security model[1]. Not only does it
prevent non-privilidged users from getting other users' bpf program, but
also it prevents the user from iterating his own bpf objects.
In container environment, some users want to run bpf programs in their
containers. These users can run their bpf programs under CAP_BPF and
some other specific CAPs, but they can't inspect their bpf programs in a
generic way. For example, the bpftool can't be used as it requires
CAP_SYS_ADMIN. That is very inconvenient.
Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
which is not created by the process itself is with SCM_RIGHTS, that
requires each processes which created bpf object has to implement a unix
domain socket to share the fd of a bpf object between different
processes, that is really trivial and troublesome.
Hence we need a better mechanism to get bpf object info without
CAP_SYS_ADMIN.
BPF namespace is introduced in this patchset with an attempt to remove
the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
link in a specific bpf namespace, then these bpf objects will not be
visible to the users in a different bpf namespace. But these bpf
objects are visible to its parent bpf namespace, so the sys admin can
still iterate and inspect them.
BPF namespace is similar to PID namespace, and the bpf objects are
similar to tasks, so BPF namespace is very easy to understand. These
patchset only implements BPF namespace for bpf map, prog and link. In the
future we may extend it to other bpf objects like btf, bpffs and etc.
For example, we can allow some of the BTF objects to be used in
non-init bpf namespace, then the container user can only trace the
processes running in his container, but can't get the information of
tasks running in other containers.
A simple example is introduced into selftests/bpf on how to use the bpf
namespace.
Putting bpf map, prog and link into bpf namespace is the first step.
Let's start with it.
[1]. https://lore.kernel.org/bpf/[email protected]/
Yafang Shao (13):
fork: New clone3 flag for BPF namespace
proc_ns: Extend the field type in struct proc_ns_operations to long
bpf: Implement bpf namespace
bpf: No need to check if id is 0
bpf: Make bpf objects id have the same alloc and free pattern
bpf: Helpers to alloc and free object id in bpf namespace
bpf: Add bpf helper to get bpf object id
bpf: Alloc and free bpf_map id in bpf namespace
bpf: Alloc and free bpf_prog id in bpf namespace
bpf: Alloc and free bpf_link id in bpf namespace
bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace
bpf: Use bpf_idr_lock array instead
selftests/bpf: Add selftest for bpf namespace
fs/proc/namespaces.c | 4 +
include/linux/bpf.h | 9 +-
include/linux/bpf_namespace.h | 88 ++++++++++
include/linux/nsproxy.h | 4 +
include/linux/proc_ns.h | 3 +-
include/linux/user_namespace.h | 1 +
include/uapi/linux/bpf.h | 7 +
include/uapi/linux/sched.h | 1 +
kernel/bpf/Makefile | 1 +
kernel/bpf/bpf_namespace.c | 283 ++++++++++++++++++++++++++++++
kernel/bpf/offload.c | 16 +-
kernel/bpf/syscall.c | 262 ++++++++++-----------------
kernel/bpf/task_iter.c | 12 ++
kernel/fork.c | 5 +-
kernel/nsproxy.c | 19 +-
kernel/trace/bpf_trace.c | 2 +
kernel/ucount.c | 1 +
tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 13 +-
tools/include/uapi/linux/bpf.h | 7 +
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/test_bpfns.c | 76 ++++++++
21 files changed, 637 insertions(+), 180 deletions(-)
create mode 100644 include/linux/bpf_namespace.h
create mode 100644 kernel/bpf/bpf_namespace.c
create mode 100644 tools/testing/selftests/bpf/test_bpfns.c
--
1.8.3.1
In struct proc_ns_operations, the field 'type' is the new namespace
clone flag. As the newly introduced CLONE_NEWBPF is more than 32bit,
we need also extend this field from int to long to adapt to this change.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/proc_ns.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 75807ec..555c257 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -16,7 +16,7 @@
struct proc_ns_operations {
const char *name;
const char *real_ns_name;
- int type;
+ long type;
struct ns_common *(*get)(struct task_struct *task);
void (*put)(struct ns_common *ns);
int (*install)(struct nsset *nsset, struct ns_common *ns);
--
1.8.3.1
It is similar with pid namespace. When we create a new bpf object in a
child BPF namespace, it will alloc the id in current BPF namespace and
its parent BPF namespace. The hierarchy as follows,
init_bpf_ns : level = 0
/ \
child_a child_b : level = 1
/ \
child_b_a child_b_b : level = 2
When we create a bpf object in child_bb, it will allocate IDs for this
object in child_bb, child_b and the init_bpf_ns.
We will allocate the id for bpf_map, bpf_prog and bpf_link in bpf
namespace.
Signed-off-by: Yafang Shao <[email protected]>
---
fs/proc/namespaces.c | 4 +
include/linux/bpf_namespace.h | 46 +++++++++
include/linux/nsproxy.h | 4 +
include/linux/proc_ns.h | 1 +
include/linux/user_namespace.h | 1 +
kernel/bpf/Makefile | 1 +
kernel/bpf/bpf_namespace.c | 219 +++++++++++++++++++++++++++++++++++++++++
kernel/nsproxy.c | 19 +++-
kernel/ucount.c | 1 +
9 files changed, 294 insertions(+), 2 deletions(-)
create mode 100644 include/linux/bpf_namespace.h
create mode 100644 kernel/bpf/bpf_namespace.c
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8e159fc..1a36757 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -9,6 +9,7 @@
#include <linux/ipc_namespace.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/bpf_namespace.h>
#include "internal.h"
@@ -37,6 +38,9 @@
&timens_operations,
&timens_for_children_operations,
#endif
+#ifdef CONFIG_BPF
+ &bpfns_operations,
+#endif
};
static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/bpf_namespace.h b/include/linux/bpf_namespace.h
new file mode 100644
index 0000000..06aa51f
--- /dev/null
+++ b/include/linux/bpf_namespace.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPF_ID_NS_H
+#define _LINUX_BPF_ID_NS_H
+#include <linux/types.h>
+#include <linux/idr.h>
+#include <linux/ns_common.h>
+#include <linux/user_namespace.h>
+
+struct ubpf_obj_id {
+ int nr;
+ struct bpf_namespace *ns;
+};
+
+struct bpf_obj_id {
+ refcount_t count;
+ unsigned int level;
+ struct rcu_head rcu;
+ struct ubpf_obj_id numbers[1];
+};
+
+enum {
+ MAP_OBJ_ID = 0,
+ PROG_OBJ_ID,
+ LINK_OBJ_ID,
+ OBJ_ID_NUM,
+};
+
+struct bpf_namespace {
+ struct idr idr[OBJ_ID_NUM];
+ struct rcu_head rcu;
+ int level;
+ struct ns_common ns;
+ struct user_namespace *user_ns;
+ struct kmem_cache *obj_id_cachep;
+ struct bpf_namespace *parent;
+ struct ucounts *ucounts;
+};
+
+extern struct bpf_namespace init_bpf_ns;
+extern struct proc_ns_operations bpfns_operations;
+
+struct bpf_namespace *copy_bpfns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct bpf_namespace *old_ns);
+void put_bpfns(struct bpf_namespace *ns);
+#endif /* _LINUX_BPF_ID_NS_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index fee881c..d24ab6b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -10,6 +10,9 @@
struct ipc_namespace;
struct pid_namespace;
struct cgroup_namespace;
+#ifdef CONFIG_BPF
+struct bpf_namespace;
+#endif
struct fs_struct;
/*
@@ -38,6 +41,7 @@ struct nsproxy {
struct time_namespace *time_ns;
struct time_namespace *time_ns_for_children;
struct cgroup_namespace *cgroup_ns;
+ struct bpf_namespace *bpf_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 555c257..c10ce2c 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -46,6 +46,7 @@ enum {
PROC_PID_INIT_INO = 0xEFFFFFFCU,
PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
PROC_TIME_INIT_INO = 0xEFFFFFFAU,
+ PROC_BPF_INIT_INO = 0xEFFFFFF9U,
};
#ifdef CONFIG_PROC_FS
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 45f09be..93eb618 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -54,6 +54,7 @@ enum ucount_type {
UCOUNT_FANOTIFY_GROUPS,
UCOUNT_FANOTIFY_MARKS,
#endif
+ UCOUNT_BPF_NAMESPACES,
UCOUNT_COUNTS,
};
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 0224261..828aef0 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -44,3 +44,4 @@ obj-$(CONFIG_BPF_PRELOAD) += preload/
obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
$(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE
$(call if_changed_rule,cc_o_c)
+obj-$(CONFIG_BPF_SYSCALL) += bpf_namespace.o
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
new file mode 100644
index 0000000..88a86cd
--- /dev/null
+++ b/kernel/bpf/bpf_namespace.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/ns_common.h>
+#include <linux/syscalls.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/proc_ns.h>
+#include <linux/sched/task.h>
+#include <linux/sched/signal.h>
+#include <linux/idr.h>
+#include <linux/user_namespace.h>
+#include <linux/bpf_namespace.h>
+
+#define MAX_BPF_NS_LEVEL 32
+static struct kmem_cache *bpfns_cachep;
+static struct kmem_cache *obj_id_cache[MAX_PID_NS_LEVEL];
+static struct ns_common *bpfns_get(struct task_struct *task);
+static void bpfns_put(struct ns_common *ns);
+static struct kmem_cache *create_bpf_cachep(unsigned int level);
+static DEFINE_MUTEX(obj_id_caches_mutex);
+
+static int bpfns_install(struct nsset *nsset, struct ns_common *ns)
+{
+ pr_info("setns not supported for bpf namespace");
+ return -EOPNOTSUPP;
+}
+
+struct proc_ns_operations bpfns_operations = {
+ .name = "bpf",
+ .type = CLONE_NEWBPF,
+ .get = bpfns_get,
+ .put = bpfns_put,
+ .install = bpfns_install,
+};
+
+struct bpf_namespace init_bpf_ns = {
+ .level = 0,
+ .user_ns = &init_user_ns,
+ .ns.ops = &bpfns_operations,
+ .ns.inum = PROC_BPF_INIT_INO,
+};
+
+static struct bpf_namespace *get_bpfns(struct bpf_namespace *ns)
+{
+ if (ns != &init_bpf_ns)
+ refcount_inc(&ns->ns.count);
+ return ns;
+}
+
+static struct ns_common *bpfns_get(struct task_struct *task)
+{
+ struct ns_common *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = &nsproxy->bpf_ns->ns;
+ get_bpfns(container_of(ns, struct bpf_namespace, ns));
+ }
+ rcu_read_unlock();
+ return ns;
+}
+
+static struct ucounts *inc_bpf_namespaces(struct user_namespace *ns)
+{
+ return inc_ucount(ns, current_euid(), UCOUNT_BPF_NAMESPACES);
+}
+
+static void dec_bpf_namespaces(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_BPF_NAMESPACES);
+}
+
+static void delayed_free_bpfns(struct rcu_head *p)
+{
+ struct bpf_namespace *ns = container_of(p, struct bpf_namespace, rcu);
+
+ dec_bpf_namespaces(ns->ucounts);
+ put_user_ns(ns->user_ns);
+ kmem_cache_free(bpfns_cachep, ns);
+}
+
+static void destroy_bpf_namespace(struct bpf_namespace *ns)
+{
+ int i;
+
+ ns_free_inum(&ns->ns);
+ for (i = 0; i < OBJ_ID_NUM; i++)
+ idr_destroy(&ns->idr[i]);
+ call_rcu(&ns->rcu, delayed_free_bpfns);
+}
+
+void put_bpfns(struct bpf_namespace *ns)
+{
+ struct bpf_namespace *parent;
+
+ while (ns != &init_bpf_ns) {
+ parent = ns->parent;
+ if (!refcount_dec_and_test(&ns->ns.count))
+ break;
+ destroy_bpf_namespace(ns);
+ ns = parent;
+ }
+}
+
+static void bpfns_put(struct ns_common *ns)
+{
+ struct bpf_namespace *bpf_ns;
+
+ bpf_ns = container_of(ns, struct bpf_namespace, ns);
+ put_bpfns(bpf_ns);
+}
+
+static struct bpf_namespace *
+create_bpf_namespace(struct user_namespace *user_ns,
+ struct bpf_namespace *parent_bpfns)
+{
+ struct bpf_namespace *ns;
+ unsigned int level = parent_bpfns->level + 1;
+ struct ucounts *ucounts;
+ int err;
+ int i;
+
+ err = -EINVAL;
+ if (!in_userns(parent_bpfns->user_ns, user_ns))
+ goto out;
+
+ err = -ENOSPC;
+ if (level > MAX_BPF_NS_LEVEL)
+ goto out;
+ ucounts = inc_bpf_namespaces(user_ns);
+ if (!ucounts)
+ goto out;
+
+ err = -ENOMEM;
+ ns = kmem_cache_zalloc(bpfns_cachep, GFP_KERNEL);
+ if (!ns)
+ goto out_dec;
+
+ for (i = 0; i < OBJ_ID_NUM; i++)
+ idr_init(&ns->idr[i]);
+
+ ns->obj_id_cachep = create_bpf_cachep(level);
+ if (!ns->obj_id_cachep)
+ goto out_free_idr;
+
+ err = ns_alloc_inum(&ns->ns);
+ if (err)
+ goto out_free_idr;
+ ns->ns.ops = &bpfns_operations;
+
+ refcount_set(&ns->ns.count, 1);
+ ns->level = level;
+ ns->parent = get_bpfns(parent_bpfns);
+ ns->user_ns = get_user_ns(user_ns);
+ ns->ucounts = ucounts;
+ return ns;
+
+out_free_idr:
+ for (i = 0; i < OBJ_ID_NUM; i++)
+ idr_destroy(&ns->idr[i]);
+ kmem_cache_free(bpfns_cachep, ns);
+out_dec:
+ dec_bpf_namespaces(ucounts);
+out:
+ return ERR_PTR(err);
+}
+
+struct bpf_namespace *copy_bpfns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct bpf_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWBPF))
+ return get_bpfns(old_ns);
+ return create_bpf_namespace(user_ns, old_ns);
+}
+
+static struct kmem_cache *create_bpf_cachep(unsigned int level)
+{
+ /* Level 0 is init_bpf_ns.obj_id_cachep */
+ struct kmem_cache **pkc = &obj_id_cache[level - 1];
+ struct kmem_cache *kc;
+ char name[4 + 10 + 1];
+ unsigned int len;
+
+ kc = READ_ONCE(*pkc);
+ if (kc)
+ return kc;
+
+ snprintf(name, sizeof(name), "bpf_%u", level + 1);
+ len = sizeof(struct bpf_obj_id) + level * sizeof(struct ubpf_obj_id);
+ mutex_lock(&obj_id_caches_mutex);
+ /* Name collision forces to do allocation under mutex. */
+ if (!*pkc)
+ *pkc = kmem_cache_create(name, len, 0,
+ SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
+ mutex_unlock(&obj_id_caches_mutex);
+ /* current can fail, but someone else can succeed. */
+ return READ_ONCE(*pkc);
+}
+
+static void __init bpfns_idr_init(void)
+{
+ int i;
+
+ init_bpf_ns.obj_id_cachep =
+ KMEM_CACHE(pid, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
+ for (i = 0; i < OBJ_ID_NUM; i++)
+ idr_init(&init_bpf_ns.idr[i]);
+}
+
+static __init int bpf_namespaces_init(void)
+{
+ bpfns_cachep = KMEM_CACHE(bpf_namespace, SLAB_PANIC | SLAB_ACCOUNT);
+ bpfns_idr_init();
+ return 0;
+}
+
+late_initcall(bpf_namespaces_init);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index a487ff2..6a6fa70 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -19,6 +19,7 @@
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
#include <linux/time_namespace.h>
+#include <linux/bpf_namespace.h>
#include <linux/fs_struct.h>
#include <linux/proc_fs.h>
#include <linux/proc_ns.h>
@@ -26,6 +27,7 @@
#include <linux/syscalls.h>
#include <linux/cgroup.h>
#include <linux/perf_event.h>
+#include <linux/bpf_namespace.h>
static struct kmem_cache *nsproxy_cachep;
@@ -47,6 +49,9 @@ struct nsproxy init_nsproxy = {
.time_ns = &init_time_ns,
.time_ns_for_children = &init_time_ns,
#endif
+#ifdef CONFIG_BPF
+ .bpf_ns = &init_bpf_ns,
+#endif
};
static inline struct nsproxy *create_nsproxy(void)
@@ -121,8 +126,16 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
}
new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns);
+ new_nsp->bpf_ns = copy_bpfns(flags, user_ns, tsk->nsproxy->bpf_ns);
+ if (IS_ERR(new_nsp->bpf_ns)) {
+ err = PTR_ERR(new_nsp->bpf_ns);
+ goto out_bpf;
+ }
return new_nsp;
+out_bpf:
+ put_time_ns(new_nsp->time_ns);
+ put_time_ns(new_nsp->time_ns_for_children);
out_time:
put_net(new_nsp->net_ns);
out_net:
@@ -156,7 +169,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
- CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
+ CLONE_NEWCGROUP | CLONE_NEWTIME | CLONE_NEWBPF)))) {
if ((flags & CLONE_VM) ||
likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
get_nsproxy(old_ns);
@@ -203,6 +216,8 @@ void free_nsproxy(struct nsproxy *ns)
put_time_ns(ns->time_ns_for_children);
put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
+ if (ns->bpf_ns)
+ put_bpfns(ns->bpf_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -218,7 +233,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
- CLONE_NEWTIME)))
+ CLONE_NEWTIME | CLONE_NEWBPF)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
diff --git a/kernel/ucount.c b/kernel/ucount.c
index ee8e57f..97e0ae3 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -87,6 +87,7 @@ static int set_permissions(struct ctl_table_header *head,
UCOUNT_ENTRY("max_fanotify_groups"),
UCOUNT_ENTRY("max_fanotify_marks"),
#endif
+ UCOUNT_ENTRY("max_bpf_namespaces"),
{ }
};
#endif /* CONFIG_SYSCTL */
--
1.8.3.1
Make them have the same patthern, then we can use the generic helpers
instead.
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/bpf/offload.c | 15 +++++++++++--
kernel/bpf/syscall.c | 62 ++++++++++++++++++++++++----------------------------
2 files changed, 41 insertions(+), 36 deletions(-)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index d9c9f45..aec70e0 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -134,9 +134,20 @@ static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap,
static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
{
+ struct bpf_map *map = &offmap->map;
+
WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE));
- /* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map */
- bpf_map_free_id(&offmap->map);
+ /* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map.
+ *
+ * Offloaded maps are removed from the IDR store when their device
+ * disappears - even if someone holds an fd to them they are unusable,
+ * the memory is gone, all ops will fail; they are simply waiting for
+ * refcnt to drop to be freed.
+ */
+ if (map->id) {
+ bpf_map_free_id(map);
+ map->id = 0;
+ }
list_del_init(&offmap->offloads);
offmap->netdev = NULL;
}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3664f2..ee1297d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -382,30 +382,19 @@ static int bpf_map_alloc_id(struct bpf_map *map)
idr_preload(GFP_KERNEL);
spin_lock_bh(&map_idr_lock);
id = idr_alloc_cyclic(&map_idr, map, 1, INT_MAX, GFP_ATOMIC);
- if (id > 0)
- map->id = id;
spin_unlock_bh(&map_idr_lock);
idr_preload_end();
- return id > 0 ? 0 : id;
+ return id;
}
void bpf_map_free_id(struct bpf_map *map)
{
unsigned long flags;
- /* Offloaded maps are removed from the IDR store when their device
- * disappears - even if someone holds an fd to them they are unusable,
- * the memory is gone, all ops will fail; they are simply waiting for
- * refcnt to drop to be freed.
- */
- if (!map->id)
- return;
-
spin_lock_irqsave(&map_idr_lock, flags);
idr_remove(&map_idr, map->id);
- map->id = 0;
spin_unlock_irqrestore(&map_idr_lock, flags);
}
@@ -748,8 +737,11 @@ static void bpf_map_put_uref(struct bpf_map *map)
void bpf_map_put(struct bpf_map *map)
{
if (atomic64_dec_and_test(&map->refcnt)) {
- /* bpf_map_free_id() must be called first */
- bpf_map_free_id(map);
+ /* bpf_map_free_id() must be called first. */
+ if (map->id) {
+ bpf_map_free_id(map);
+ map->id = 0;
+ }
btf_put(map->btf);
INIT_WORK(&map->work, bpf_map_free_deferred);
/* Avoid spawning kworkers, since they all might contend
@@ -1215,8 +1207,9 @@ static int map_create(union bpf_attr *attr)
goto free_map_field_offs;
err = bpf_map_alloc_id(map);
- if (err)
+ if (err < 0)
goto free_map_sec;
+ map->id = err;
bpf_map_save_memcg(map);
@@ -2024,29 +2017,18 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog)
idr_preload(GFP_KERNEL);
spin_lock_bh(&prog_idr_lock);
id = idr_alloc_cyclic(&prog_idr, prog, 1, INT_MAX, GFP_ATOMIC);
- if (id > 0)
- prog->aux->id = id;
spin_unlock_bh(&prog_idr_lock);
idr_preload_end();
- return id > 0 ? 0 : id;
+ return id;
}
void bpf_prog_free_id(struct bpf_prog *prog)
{
unsigned long flags;
- /* cBPF to eBPF migrations are currently not in the idr store.
- * Offloaded programs are removed from the store when their device
- * disappears - even if someone grabs an fd to them they are unusable,
- * simply waiting for refcnt to drop to be freed.
- */
- if (!prog->aux->id)
- return;
-
spin_lock_irqsave(&prog_idr_lock, flags);
idr_remove(&prog_idr, prog->aux->id);
- prog->aux->id = 0;
spin_unlock_irqrestore(&prog_idr_lock, flags);
}
@@ -2091,7 +2073,15 @@ static void bpf_prog_put_deferred(struct work_struct *work)
prog = aux->prog;
perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
bpf_audit_prog(prog, BPF_AUDIT_UNLOAD);
- bpf_prog_free_id(prog);
+ /* cBPF to eBPF migrations are currently not in the idr store.
+ * Offloaded programs are removed from the store when their device
+ * disappears - even if someone grabs an fd to them they are unusable,
+ * simply waiting for refcnt to drop to be freed.
+ */
+ if (prog->aux->id) {
+ bpf_prog_free_id(prog);
+ prog->aux->id = 0;
+ }
__bpf_prog_put_noref(prog, true);
}
@@ -2655,8 +2645,9 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
goto free_used_maps;
err = bpf_prog_alloc_id(prog);
- if (err)
+ if (err < 0)
goto free_used_maps;
+ prog->aux->id = err;
/* Upon success of bpf_prog_alloc_id(), the BPF prog is
* effectively publicly exposed. However, retrieving via
@@ -2730,9 +2721,6 @@ void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
static void bpf_link_free_id(int id)
{
- if (!id)
- return;
-
spin_lock_bh(&link_idr_lock);
idr_remove(&link_idr, id);
spin_unlock_bh(&link_idr_lock);
@@ -2748,7 +2736,10 @@ static void bpf_link_free_id(int id)
void bpf_link_cleanup(struct bpf_link_primer *primer)
{
primer->link->prog = NULL;
- bpf_link_free_id(primer->id);
+ if (primer->id) {
+ bpf_link_free_id(primer->id);
+ primer->id = 0;
+ }
fput(primer->file);
put_unused_fd(primer->fd);
}
@@ -2761,7 +2752,10 @@ void bpf_link_inc(struct bpf_link *link)
/* bpf_link_free is guaranteed to be called from process context */
static void bpf_link_free(struct bpf_link *link)
{
- bpf_link_free_id(link->id);
+ if (link->id) {
+ bpf_link_free_id(link->id);
+ link->id = 0;
+ }
if (link->prog) {
/* detach BPF program, clean up used resources */
link->ops->release(link);
--
1.8.3.1
Introduce generic helpers to alloc bpf_{map,prog,link} in bpf namespace.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf_namespace.h | 36 ++++++++++++++++++
kernel/bpf/bpf_namespace.c | 86 +++++++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 6 +--
3 files changed, 125 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf_namespace.h b/include/linux/bpf_namespace.h
index 06aa51f..50bd68c 100644
--- a/include/linux/bpf_namespace.h
+++ b/include/linux/bpf_namespace.h
@@ -38,9 +38,45 @@ struct bpf_namespace {
extern struct bpf_namespace init_bpf_ns;
extern struct proc_ns_operations bpfns_operations;
+extern spinlock_t map_idr_lock;
+extern spinlock_t prog_idr_lock;
+extern spinlock_t link_idr_lock;
struct bpf_namespace *copy_bpfns(unsigned long flags,
struct user_namespace *user_ns,
struct bpf_namespace *old_ns);
void put_bpfns(struct bpf_namespace *ns);
+struct bpf_obj_id *bpf_alloc_obj_id(struct bpf_namespace *ns,
+ void *obj, int type);
+void bpf_free_obj_id(struct bpf_obj_id *obj_id, int type);
+
+/*
+ * The helpers to get the bpf_id's id seen from different namespaces
+ *
+ * bpf_id_nr() : global id, i.e. the id seen from the init namespace;
+ * bpf_id_vnr() : virtual id, i.e. the id seen from the pid namespace of
+ * current.
+ * bpf_id_nr_ns() : id seen from the ns specified.
+ *
+ * see also task_xid_nr() etc in include/linux/sched.h
+ */
+static inline int bpf_obj_id_nr(struct bpf_obj_id *obj_id)
+{
+ if (obj_id)
+ return obj_id->numbers[0].nr;
+ return 0;
+}
+
+static inline int bpf_obj_id_nr_ns(struct bpf_obj_id *obj_id,
+ struct bpf_namespace *ns)
+{
+ if (obj_id && ns->level <= obj_id->level)
+ return obj_id->numbers[ns->level].nr;
+ return 0;
+}
+
+static inline int bpf_obj_id_vnr(struct bpf_obj_id *obj_id)
+{
+ return bpf_obj_id_nr_ns(obj_id, current->nsproxy->bpf_ns);
+}
#endif /* _LINUX_BPF_ID_NS_H */
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
index 88a86cd..1e98d1d 100644
--- a/kernel/bpf/bpf_namespace.c
+++ b/kernel/bpf/bpf_namespace.c
@@ -217,3 +217,89 @@ static __init int bpf_namespaces_init(void)
}
late_initcall(bpf_namespaces_init);
+
+struct bpf_obj_id *bpf_alloc_obj_id(struct bpf_namespace *ns,
+ void *obj, int type)
+{
+ struct bpf_namespace *tmp = ns;
+ struct bpf_obj_id *obj_id;
+ spinlock_t *idr_lock;
+ unsigned long flags;
+ int id;
+ int i;
+
+ switch (type) {
+ case MAP_OBJ_ID:
+ idr_lock = &map_idr_lock;
+ break;
+ case PROG_OBJ_ID:
+ idr_lock = &prog_idr_lock;
+ break;
+ case LINK_OBJ_ID:
+ idr_lock = &link_idr_lock;
+ break;
+ default:
+ return ERR_PTR(-EINVAL);
+ }
+
+ obj_id = kmem_cache_alloc(ns->obj_id_cachep, GFP_KERNEL);
+ if (!obj_id)
+ return ERR_PTR(-ENOMEM);
+
+ obj_id->level = ns->level;
+ for (i = ns->level; i >= 0; i--) {
+ idr_preload(GFP_KERNEL);
+ spin_lock_bh(idr_lock);
+ id = idr_alloc_cyclic(&tmp->idr[type], obj, 1, INT_MAX, GFP_ATOMIC);
+ spin_unlock_bh(idr_lock);
+ idr_preload_end();
+ if (id < 0)
+ goto out_free;
+ obj_id->numbers[i].nr = id;
+ obj_id->numbers[i].ns = tmp;
+ tmp = tmp->parent;
+ }
+
+ return obj_id;
+
+out_free:
+ for (; i <= ns->level; i++) {
+ tmp = obj_id->numbers[i].ns;
+ spin_lock_irqsave(idr_lock, flags);
+ idr_remove(&tmp->idr[type], obj_id->numbers[i].nr);
+ spin_unlock_irqrestore(idr_lock, flags);
+ }
+ kmem_cache_free(ns->obj_id_cachep, obj_id);
+ return ERR_PTR(id);
+}
+
+void bpf_free_obj_id(struct bpf_obj_id *obj_id, int type)
+{
+ struct bpf_namespace *ns;
+ spinlock_t *idr_lock;
+ unsigned long flags;
+ int i;
+
+ switch (type) {
+ case MAP_OBJ_ID:
+ idr_lock = &map_idr_lock;
+ break;
+ case PROG_OBJ_ID:
+ idr_lock = &prog_idr_lock;
+ break;
+ case LINK_OBJ_ID:
+ idr_lock = &link_idr_lock;
+ break;
+ default:
+ return;
+ }
+ /* Note that the level-0 should be freed at last */
+ for (i = obj_id->level; i >= 0; i--) {
+ spin_lock_irqsave(idr_lock, flags);
+ ns = obj_id->numbers[i].ns;
+ idr_remove(&ns->idr[type], obj_id->numbers[i].nr);
+ spin_unlock_irqrestore(idr_lock, flags);
+ }
+ ns = obj_id->numbers[obj_id->level].ns;
+ kmem_cache_free(ns->obj_id_cachep, obj_id);
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ee1297d..f24e550 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,11 +48,11 @@
DEFINE_PER_CPU(int, bpf_prog_active);
static DEFINE_IDR(prog_idr);
-static DEFINE_SPINLOCK(prog_idr_lock);
+DEFINE_SPINLOCK(prog_idr_lock);
static DEFINE_IDR(map_idr);
-static DEFINE_SPINLOCK(map_idr_lock);
+DEFINE_SPINLOCK(map_idr_lock);
static DEFINE_IDR(link_idr);
-static DEFINE_SPINLOCK(link_idr_lock);
+DEFINE_SPINLOCK(link_idr_lock);
int sysctl_unprivileged_bpf_disabled __read_mostly =
IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
--
1.8.3.1
A new bpf helper is introduced to get bpf object id in a tracing bpf
prog.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 7 +++++++
kernel/bpf/task_iter.c | 12 ++++++++++++
kernel/trace/bpf_trace.c | 2 ++
tools/include/uapi/linux/bpf.h | 7 +++++++
5 files changed, 29 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2d8f3f6..c94034a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2867,6 +2867,7 @@ static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_find_obj_id_proto;
const struct bpf_func_proto *tracing_prog_func_proto(
enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e3d3b51..3009877 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5540,6 +5540,12 @@ struct bpf_stack_build_id {
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * int bpf_find_obj_id(void *obj_id)
+ * Description
+ * Get bpf object id in current bpf namespace.
+ * Return
+ * bpf object id is returned on success.
*/
#define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx) \
@@ -5754,6 +5760,7 @@ struct bpf_stack_build_id {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(find_obj_id, 212, ##ctx) \
/* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6..a551743 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -8,6 +8,7 @@
#include <linux/fdtable.h>
#include <linux/filter.h>
#include <linux/btf_ids.h>
+#include <linux/bpf_namespace.h>
#include "mmap_unlock_work.h"
static const char * const iter_task_type_names[] = {
@@ -823,6 +824,17 @@ static void bpf_iter_task_show_fdinfo(const struct bpf_iter_aux_info *aux, struc
.arg5_type = ARG_ANYTHING,
};
+BPF_CALL_1(bpf_find_obj_id, void *, obj_id)
+{
+ return bpf_obj_id_vnr(obj_id);
+}
+
+const struct bpf_func_proto bpf_find_obj_id_proto = {
+ .func = bpf_find_obj_id,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+};
+
DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work);
static void do_mmap_read_unlock(struct irq_work *entry)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index bcf91bc..977bb61 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1510,6 +1510,8 @@ static int __init bpf_key_sig_kfuncs_init(void)
return &bpf_find_vma_proto;
case BPF_FUNC_trace_vprintk:
return bpf_get_trace_vprintk_proto();
+ case BPF_FUNC_find_obj_id:
+ return &bpf_find_obj_id_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d6c5a02..8beacad 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5540,6 +5540,12 @@ struct bpf_stack_build_id {
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * int bpf_find_obj_id(void *obj_id)
+ * Description
+ * Get bpf object id in current bpf namespace.
+ * Return
+ * bpf object id is returned on success.
*/
#define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx) \
@@ -5754,6 +5760,7 @@ struct bpf_stack_build_id {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(find_obj_id, 212, ##ctx) \
/* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
--
1.8.3.1
We only expose the bpf map id under current bpf namespace to user. The
map->id is still the id in the init bpf namespace.
The result as follows,
Run bpftool in a new bpf namespace
$ bpftool map show
4: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 159 frozen
pids kprobe(20490)
5: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 159
pids kprobe(20490)
$ bpftool prog show
91: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-20T22:19:16+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 4
btf_id 159
pids kprobe(20490)
93: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-20T22:19:16+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 4
btf_id 159
pids kprobe(20490)
At the same time, run bpftool in init bpf namespace,
$ bpftool map show
48: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 159 frozen
pids kprobe(20490)
49: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 159
pids kprobe(20490)
$ bpftool prog show
91: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-20T22:19:16+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 48
btf_id 159
pids kprobe(20490)
93: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-20T22:19:16+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 48
btf_id 159
pids kprobe(20490)
In init bpf namespace, bpftool can also show other bpf maps, but the
bpftool running in the new bpf namespace can't.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf.h | 3 +-
kernel/bpf/bpf_namespace.c | 1 +
kernel/bpf/offload.c | 3 +-
kernel/bpf/syscall.c | 58 ++++++++++---------------------
tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 7 +++-
5 files changed, 30 insertions(+), 42 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c94034a..2a1f19c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -29,6 +29,7 @@
#include <linux/rcupdate_trace.h>
#include <linux/static_call.h>
#include <linux/memcontrol.h>
+#include <linux/bpf_namespace.h>
struct bpf_verifier_env;
struct bpf_verifier_log;
@@ -279,6 +280,7 @@ struct bpf_map {
} owner;
bool bypass_spec_v1;
bool frozen; /* write-once; write-protected by freeze_mutex */
+ struct bpf_obj_id *obj_id;
};
static inline const char *btf_field_type_name(enum btf_field_type type)
@@ -1939,7 +1941,6 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
void bpf_prog_put(struct bpf_prog *prog);
void bpf_prog_free_id(struct bpf_prog *prog);
-void bpf_map_free_id(struct bpf_map *map);
struct btf_field *btf_record_find(const struct btf_record *rec,
u32 offset, u32 field_mask);
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
index 1e98d1d..6a6ef70 100644
--- a/kernel/bpf/bpf_namespace.c
+++ b/kernel/bpf/bpf_namespace.c
@@ -11,6 +11,7 @@
#include <linux/bpf_namespace.h>
#define MAX_BPF_NS_LEVEL 32
+DEFINE_SPINLOCK(map_idr_lock);
static struct kmem_cache *bpfns_cachep;
static struct kmem_cache *obj_id_cache[MAX_PID_NS_LEVEL];
static struct ns_common *bpfns_get(struct task_struct *task);
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index aec70e0..7a90ebe 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -25,6 +25,7 @@
#include <linux/rhashtable.h>
#include <linux/rtnetlink.h>
#include <linux/rwsem.h>
+#include <linux/bpf_namespace.h>
/* Protects offdevs, members of bpf_offload_netdev and offload members
* of all progs.
@@ -145,7 +146,7 @@ static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
* refcnt to drop to be freed.
*/
if (map->id) {
- bpf_map_free_id(map);
+ bpf_free_obj_id(map->obj_id, MAP_OBJ_ID);
map->id = 0;
}
list_del_init(&offmap->offloads);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f24e550..1335200 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -35,6 +35,7 @@
#include <linux/rcupdate_trace.h>
#include <linux/memcontrol.h>
#include <linux/trace_events.h>
+#include <linux/bpf_namespace.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -49,8 +50,6 @@
DEFINE_PER_CPU(int, bpf_prog_active);
static DEFINE_IDR(prog_idr);
DEFINE_SPINLOCK(prog_idr_lock);
-static DEFINE_IDR(map_idr);
-DEFINE_SPINLOCK(map_idr_lock);
static DEFINE_IDR(link_idr);
DEFINE_SPINLOCK(link_idr_lock);
@@ -375,30 +374,6 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
map->map_extra = attr->map_extra;
}
-static int bpf_map_alloc_id(struct bpf_map *map)
-{
- int id;
-
- idr_preload(GFP_KERNEL);
- spin_lock_bh(&map_idr_lock);
- id = idr_alloc_cyclic(&map_idr, map, 1, INT_MAX, GFP_ATOMIC);
- spin_unlock_bh(&map_idr_lock);
- idr_preload_end();
-
- return id;
-}
-
-void bpf_map_free_id(struct bpf_map *map)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&map_idr_lock, flags);
-
- idr_remove(&map_idr, map->id);
-
- spin_unlock_irqrestore(&map_idr_lock, flags);
-}
-
#ifdef CONFIG_MEMCG_KMEM
static void bpf_map_save_memcg(struct bpf_map *map)
{
@@ -737,9 +712,9 @@ static void bpf_map_put_uref(struct bpf_map *map)
void bpf_map_put(struct bpf_map *map)
{
if (atomic64_dec_and_test(&map->refcnt)) {
- /* bpf_map_free_id() must be called first. */
+ /* bpf_free_obj_id() must be called first. */
if (map->id) {
- bpf_map_free_id(map);
+ bpf_free_obj_id(map->obj_id, MAP_OBJ_ID);
map->id = 0;
}
btf_put(map->btf);
@@ -817,7 +792,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
map->map_flags,
(unsigned long long)map->map_extra,
bpf_map_memory_usage(map),
- map->id,
+ bpf_obj_id_vnr(map->obj_id),
READ_ONCE(map->frozen));
if (type) {
seq_printf(m, "owner_prog_type:\t%u\n", type);
@@ -1115,6 +1090,7 @@ static int map_create(union bpf_attr *attr)
{
int numa_node = bpf_map_attr_numa_node(attr);
struct btf_field_offs *foffs;
+ struct bpf_obj_id *obj_id;
struct bpf_map *map;
int f_flags;
int err;
@@ -1206,10 +1182,11 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_field_offs;
- err = bpf_map_alloc_id(map);
- if (err < 0)
+ obj_id = bpf_alloc_obj_id(current->nsproxy->bpf_ns, map, MAP_OBJ_ID);
+ if (IS_ERR(obj_id))
goto free_map_sec;
- map->id = err;
+ map->obj_id = obj_id;
+ map->id = bpf_obj_id_nr(obj_id);
bpf_map_save_memcg(map);
@@ -1217,7 +1194,7 @@ static int map_create(union bpf_attr *attr)
if (err < 0) {
/* failed to allocate fd.
* bpf_map_put_with_uref() is needed because the above
- * bpf_map_alloc_id() has published the map
+ * bpf_alloc_obj_id() has published the map
* to the userspace and the userspace may
* have refcnt-ed it through BPF_MAP_GET_FD_BY_ID.
*/
@@ -3709,11 +3686,12 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_map *map;
spin_lock_bh(&map_idr_lock);
again:
- map = idr_get_next(&map_idr, id);
+ map = idr_get_next(&ns->idr[MAP_OBJ_ID], id);
if (map) {
map = __bpf_map_inc_not_zero(map, false);
if (IS_ERR(map)) {
@@ -3791,6 +3769,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_map *map;
u32 id = attr->map_id;
int f_flags;
@@ -3808,7 +3787,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
return f_flags;
spin_lock_bh(&map_idr_lock);
- map = idr_find(&map_idr, id);
+ map = idr_find(&ns->idr[MAP_OBJ_ID], id);
if (map)
map = __bpf_map_inc_not_zero(map, true);
else
@@ -3896,7 +3875,7 @@ static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog,
map = bpf_map_from_imm(prog, imm, &off, &type);
if (map) {
insns[i].src_reg = type;
- insns[i].imm = map->id;
+ insns[i].imm = bpf_obj_id_vnr(map->obj_id);
insns[i + 1].imm = off;
continue;
}
@@ -3978,7 +3957,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
u32 i;
for (i = 0; i < ulen; i++)
- if (put_user(prog->aux->used_maps[i]->id,
+ if (put_user(bpf_obj_id_vnr(prog->aux->used_maps[i]->obj_id),
&user_map_ids[i])) {
mutex_unlock(&prog->aux->used_maps_mutex);
return -EFAULT;
@@ -4242,7 +4221,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
memset(&info, 0, sizeof(info));
info.type = map->map_type;
- info.id = map->id;
+ info.id = bpf_obj_id_vnr(map->obj_id);
info.key_size = map->key_size;
info.value_size = map->value_size;
info.max_entries = map->max_entries;
@@ -4994,6 +4973,7 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
union bpf_attr attr;
bool capable;
int err;
@@ -5072,7 +5052,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
break;
case BPF_MAP_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &map_idr, &map_idr_lock);
+ &ns->idr[MAP_OBJ_ID], &map_idr_lock);
break;
case BPF_BTF_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
diff --git a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
index eb05ea5..a71aef7 100644
--- a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
+++ b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
@@ -24,11 +24,14 @@ enum bpf_obj_type {
static __always_inline __u32 get_obj_id(void *ent, enum bpf_obj_type type)
{
+ void *obj_id;
+
switch (type) {
case BPF_OBJ_PROG:
return BPF_CORE_READ((struct bpf_prog *)ent, aux, id);
case BPF_OBJ_MAP:
- return BPF_CORE_READ((struct bpf_map *)ent, id);
+ obj_id = BPF_CORE_READ((struct bpf_map *)ent, obj_id);
+ break;
case BPF_OBJ_BTF:
return BPF_CORE_READ((struct btf *)ent, id);
case BPF_OBJ_LINK:
@@ -36,6 +39,8 @@ static __always_inline __u32 get_obj_id(void *ent, enum bpf_obj_type type)
default:
return 0;
}
+
+ return bpf_find_obj_id(obj_id);
}
/* could be used only with BPF_LINK_TYPE_PERF_EVENT links */
--
1.8.3.1
idr_alloc_cyclic() will return -ENOSPC if there's no available IDs, so
don't need to check if the id is less than the start number.
Signed-off-by: Yafang Shao <[email protected]>
---
kernel/bpf/syscall.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e18ac7f..f3664f2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -387,9 +387,6 @@ static int bpf_map_alloc_id(struct bpf_map *map)
spin_unlock_bh(&map_idr_lock);
idr_preload_end();
- if (WARN_ON_ONCE(!id))
- return -ENOSPC;
-
return id > 0 ? 0 : id;
}
@@ -2032,10 +2029,6 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog)
spin_unlock_bh(&prog_idr_lock);
idr_preload_end();
- /* id is in [1, INT_MAX) */
- if (WARN_ON_ONCE(!id))
- return -ENOSPC;
-
return id > 0 ? 0 : id;
}
--
1.8.3.1
Similar to bpf map, We only expose the bpf map id under current bpf
namespace to user. The prog->aux->id is still the id in the init bpf
namespace.
The id of used_maps is also the id in current bpf namespace.
The result as follows,
Run bpftool in current namespace,
$ bpftool map show
4: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 96 frozen
pids kprobe(8790)
5: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 96
pids kprobe(8790)
$ bpftool prog show
7: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-21T10:20:58+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 4
btf_id 96
9: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-21T10:20:58+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 4
btf_id 96
At the same time, run bpftool in init bpf namespace.
$ bpftool map show
18: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 96 frozen
pids kprobe(8790)
19: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 96
pids kprobe(8790)
$ bpftool prog show
29: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-21T10:20:58+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 18
btf_id 96
pids kprobe(8790)
31: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-21T10:20:58+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 18
btf_id 96
pids kprobe(8790)
In init bpf namespace, bpftool can also show other bpf progs, but the
bpftool running in the new bpf namespace can't.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf.h | 3 +-
kernel/bpf/bpf_namespace.c | 1 +
kernel/bpf/syscall.c | 56 ++++++++++---------------------
tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 3 +-
4 files changed, 22 insertions(+), 41 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2a1f19c..16f2a01 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1416,6 +1416,7 @@ struct bpf_prog_aux {
struct work_struct work;
struct rcu_head rcu;
};
+ struct bpf_obj_id *obj_id;
};
struct bpf_prog {
@@ -1940,8 +1941,6 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
struct bpf_prog * __must_check bpf_prog_inc_not_zero(struct bpf_prog *prog);
void bpf_prog_put(struct bpf_prog *prog);
-void bpf_prog_free_id(struct bpf_prog *prog);
-
struct btf_field *btf_record_find(const struct btf_record *rec,
u32 offset, u32 field_mask);
void btf_record_free(struct btf_record *rec);
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
index 6a6ef70..8c70945 100644
--- a/kernel/bpf/bpf_namespace.c
+++ b/kernel/bpf/bpf_namespace.c
@@ -12,6 +12,7 @@
#define MAX_BPF_NS_LEVEL 32
DEFINE_SPINLOCK(map_idr_lock);
+DEFINE_SPINLOCK(prog_idr_lock);
static struct kmem_cache *bpfns_cachep;
static struct kmem_cache *obj_id_cache[MAX_PID_NS_LEVEL];
static struct ns_common *bpfns_get(struct task_struct *task);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1335200..4725924 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,8 +48,6 @@
#define BPF_OBJ_FLAG_MASK (BPF_F_RDONLY | BPF_F_WRONLY)
DEFINE_PER_CPU(int, bpf_prog_active);
-static DEFINE_IDR(prog_idr);
-DEFINE_SPINLOCK(prog_idr_lock);
static DEFINE_IDR(link_idr);
DEFINE_SPINLOCK(link_idr_lock);
@@ -1983,32 +1981,10 @@ static void bpf_audit_prog(const struct bpf_prog *prog, unsigned int op)
if (unlikely(!ab))
return;
audit_log_format(ab, "prog-id=%u op=%s",
- prog->aux->id, bpf_audit_str[op]);
+ bpf_obj_id_vnr(prog->aux->obj_id), bpf_audit_str[op]);
audit_log_end(ab);
}
-static int bpf_prog_alloc_id(struct bpf_prog *prog)
-{
- int id;
-
- idr_preload(GFP_KERNEL);
- spin_lock_bh(&prog_idr_lock);
- id = idr_alloc_cyclic(&prog_idr, prog, 1, INT_MAX, GFP_ATOMIC);
- spin_unlock_bh(&prog_idr_lock);
- idr_preload_end();
-
- return id;
-}
-
-void bpf_prog_free_id(struct bpf_prog *prog)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&prog_idr_lock, flags);
- idr_remove(&prog_idr, prog->aux->id);
- spin_unlock_irqrestore(&prog_idr_lock, flags);
-}
-
static void __bpf_prog_put_rcu(struct rcu_head *rcu)
{
struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu);
@@ -2056,7 +2032,7 @@ static void bpf_prog_put_deferred(struct work_struct *work)
* simply waiting for refcnt to drop to be freed.
*/
if (prog->aux->id) {
- bpf_prog_free_id(prog);
+ bpf_free_obj_id(prog->aux->obj_id, PROG_OBJ_ID);
prog->aux->id = 0;
}
__bpf_prog_put_noref(prog, true);
@@ -2157,7 +2133,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp)
prog->jited,
prog_tag,
prog->pages * 1ULL << PAGE_SHIFT,
- prog->aux->id,
+ bpf_obj_id_vnr(prog->aux->obj_id),
stats.nsecs,
stats.cnt,
stats.misses,
@@ -2468,6 +2444,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
enum bpf_prog_type type = attr->prog_type;
struct bpf_prog *prog, *dst_prog = NULL;
struct btf *attach_btf = NULL;
+ struct bpf_obj_id *obj_id;
int err;
char license[128];
bool is_gpl;
@@ -2621,12 +2598,13 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
if (err < 0)
goto free_used_maps;
- err = bpf_prog_alloc_id(prog);
- if (err < 0)
+ obj_id = bpf_alloc_obj_id(current->nsproxy->bpf_ns, prog, PROG_OBJ_ID);
+ if (IS_ERR(obj_id))
goto free_used_maps;
- prog->aux->id = err;
+ prog->aux->obj_id = obj_id;
+ prog->aux->id = bpf_obj_id_nr(obj_id);
- /* Upon success of bpf_prog_alloc_id(), the BPF prog is
+ /* Upon success of bpf_alloc_obj_id(), the BPF prog is
* effectively publicly exposed. However, retrieving via
* bpf_prog_get_fd_by_id() will take another reference,
* therefore it cannot be gone underneath us.
@@ -2803,7 +2781,7 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
"prog_tag:\t%s\n"
"prog_id:\t%u\n",
prog_tag,
- prog->aux->id);
+ bpf_obj_id_vnr(prog->aux->obj_id));
}
if (link->ops->show_fdinfo)
link->ops->show_fdinfo(link, m);
@@ -3706,11 +3684,12 @@ struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_prog *prog;
spin_lock_bh(&prog_idr_lock);
again:
- prog = idr_get_next(&prog_idr, id);
+ prog = idr_get_next(&ns->idr[PROG_OBJ_ID], id);
if (prog) {
prog = bpf_prog_inc_not_zero(prog);
if (IS_ERR(prog)) {
@@ -3727,13 +3706,14 @@ struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id)
struct bpf_prog *bpf_prog_by_id(u32 id)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_prog *prog;
if (!id)
return ERR_PTR(-ENOENT);
spin_lock_bh(&prog_idr_lock);
- prog = idr_find(&prog_idr, id);
+ prog = idr_find(&ns->idr[PROG_OBJ_ID], id);
if (prog)
prog = bpf_prog_inc_not_zero(prog);
else
@@ -3939,7 +3919,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
return -EFAULT;
info.type = prog->type;
- info.id = prog->aux->id;
+ info.id = bpf_obj_id_vnr(prog->aux->obj_id);
info.load_time = prog->aux->load_time;
info.created_by_uid = from_kuid_munged(current_user_ns(),
prog->aux->user->uid);
@@ -4287,7 +4267,7 @@ static int bpf_link_get_info_by_fd(struct file *file,
info.type = link->type;
info.id = link->id;
if (link->prog)
- info.prog_id = link->prog->aux->id;
+ info.prog_id = bpf_obj_id_vnr(link->prog->aux->obj_id);
if (link->ops->fill_link_info) {
err = link->ops->fill_link_info(link, &info);
@@ -4452,7 +4432,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
struct bpf_raw_event_map *btp = raw_tp->btp;
err = bpf_task_fd_query_copy(attr, uattr,
- raw_tp->link.prog->aux->id,
+ bpf_obj_id_vnr(raw_tp->link.prog->aux->obj_id),
BPF_FD_TYPE_RAW_TRACEPOINT,
btp->tp->name, 0, 0);
goto put_file;
@@ -5048,7 +5028,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
break;
case BPF_PROG_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &prog_idr, &prog_idr_lock);
+ &ns->idr[PROG_OBJ_ID], &prog_idr_lock);
break;
case BPF_MAP_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
diff --git a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
index a71aef7..1fd8ceb 100644
--- a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
+++ b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
@@ -28,7 +28,8 @@ static __always_inline __u32 get_obj_id(void *ent, enum bpf_obj_type type)
switch (type) {
case BPF_OBJ_PROG:
- return BPF_CORE_READ((struct bpf_prog *)ent, aux, id);
+ obj_id = BPF_CORE_READ((struct bpf_prog *)ent, aux, obj_id);
+ break;
case BPF_OBJ_MAP:
obj_id = BPF_CORE_READ((struct bpf_map *)ent, obj_id);
break;
--
1.8.3.1
CAP_SYS_ADMIN is not required to iterate bpf objects if a user is in a
non-init bpf namespace. The user can iterate bpf maps, progs, and links
in his bpf namespace but can't iterate the bpf objects in different bpf
namespace.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf_namespace.h | 8 ++++++++
kernel/bpf/syscall.c | 10 +++++-----
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/include/linux/bpf_namespace.h b/include/linux/bpf_namespace.h
index 50bd68c..f484791 100644
--- a/include/linux/bpf_namespace.h
+++ b/include/linux/bpf_namespace.h
@@ -5,6 +5,7 @@
#include <linux/idr.h>
#include <linux/ns_common.h>
#include <linux/user_namespace.h>
+#include <linux/capability.h>
struct ubpf_obj_id {
int nr;
@@ -79,4 +80,11 @@ static inline int bpf_obj_id_vnr(struct bpf_obj_id *obj_id)
{
return bpf_obj_id_nr_ns(obj_id, current->nsproxy->bpf_ns);
}
+
+static inline bool bpfns_capable(void)
+{
+ if (current->nsproxy->bpf_ns != &init_bpf_ns && capable(CAP_BPF))
+ return true;
+ return false;
+}
#endif /* _LINUX_BPF_ID_NS_H */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 855d5f7..8a72694 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3628,7 +3628,7 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) && !bpfns_capable())
return -EPERM;
next_id++;
@@ -3712,7 +3712,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_GET_FD_BY_ID))
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) && !bpfns_capable())
return -EPERM;
prog = bpf_prog_by_id(id);
@@ -3740,7 +3740,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
attr->open_flags & ~BPF_OBJ_FLAG_MASK)
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) && !bpfns_capable())
return -EPERM;
f_flags = bpf_get_file_flag(attr->open_flags);
@@ -4386,7 +4386,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
if (CHECK_ATTR(BPF_TASK_FD_QUERY))
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) && !bpfns_capable())
return -EPERM;
if (attr->task_fd_query.flags != 0)
@@ -4781,7 +4781,7 @@ static int bpf_link_get_fd_by_id(const union bpf_attr *attr)
if (CHECK_ATTR(BPF_LINK_GET_FD_BY_ID))
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) && !bpfns_capable())
return -EPERM;
link = bpf_link_by_id(id);
--
1.8.3.1
Use an array instead, that will make the code more clear.
It is a cleanup.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf_namespace.h | 4 +---
kernel/bpf/bpf_namespace.c | 37 ++++++-------------------------------
kernel/bpf/syscall.c | 42 +++++++++++++++++++++---------------------
3 files changed, 28 insertions(+), 55 deletions(-)
diff --git a/include/linux/bpf_namespace.h b/include/linux/bpf_namespace.h
index f484791..4d58986 100644
--- a/include/linux/bpf_namespace.h
+++ b/include/linux/bpf_namespace.h
@@ -39,9 +39,7 @@ struct bpf_namespace {
extern struct bpf_namespace init_bpf_ns;
extern struct proc_ns_operations bpfns_operations;
-extern spinlock_t map_idr_lock;
-extern spinlock_t prog_idr_lock;
-extern spinlock_t link_idr_lock;
+extern spinlock_t bpf_idr_lock[OBJ_ID_NUM];
struct bpf_namespace *copy_bpfns(unsigned long flags,
struct user_namespace *user_ns,
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
index c7d62ef..51c240f 100644
--- a/kernel/bpf/bpf_namespace.c
+++ b/kernel/bpf/bpf_namespace.c
@@ -11,9 +11,7 @@
#include <linux/bpf_namespace.h>
#define MAX_BPF_NS_LEVEL 32
-DEFINE_SPINLOCK(map_idr_lock);
-DEFINE_SPINLOCK(prog_idr_lock);
-DEFINE_SPINLOCK(link_idr_lock);
+spinlock_t bpf_idr_lock[OBJ_ID_NUM];
static struct kmem_cache *bpfns_cachep;
static struct kmem_cache *obj_id_cache[MAX_PID_NS_LEVEL];
static struct ns_common *bpfns_get(struct task_struct *task);
@@ -208,8 +206,10 @@ static void __init bpfns_idr_init(void)
init_bpf_ns.obj_id_cachep =
KMEM_CACHE(pid, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
- for (i = 0; i < OBJ_ID_NUM; i++)
+ for (i = 0; i < OBJ_ID_NUM; i++) {
idr_init(&init_bpf_ns.idr[i]);
+ spin_lock_init(&bpf_idr_lock[i]);
+ }
}
static __init int bpf_namespaces_init(void)
@@ -231,24 +231,11 @@ struct bpf_obj_id *bpf_alloc_obj_id(struct bpf_namespace *ns,
int id;
int i;
- switch (type) {
- case MAP_OBJ_ID:
- idr_lock = &map_idr_lock;
- break;
- case PROG_OBJ_ID:
- idr_lock = &prog_idr_lock;
- break;
- case LINK_OBJ_ID:
- idr_lock = &link_idr_lock;
- break;
- default:
- return ERR_PTR(-EINVAL);
- }
-
obj_id = kmem_cache_alloc(ns->obj_id_cachep, GFP_KERNEL);
if (!obj_id)
return ERR_PTR(-ENOMEM);
+ idr_lock = &bpf_idr_lock[type];
obj_id->level = ns->level;
for (i = ns->level; i >= 0; i--) {
idr_preload(GFP_KERNEL);
@@ -283,19 +270,7 @@ void bpf_free_obj_id(struct bpf_obj_id *obj_id, int type)
unsigned long flags;
int i;
- switch (type) {
- case MAP_OBJ_ID:
- idr_lock = &map_idr_lock;
- break;
- case PROG_OBJ_ID:
- idr_lock = &prog_idr_lock;
- break;
- case LINK_OBJ_ID:
- idr_lock = &link_idr_lock;
- break;
- default:
- return;
- }
+ idr_lock = &bpf_idr_lock[type];
/* Note that the level-0 should be freed at last */
for (i = obj_id->level; i >= 0; i--) {
spin_lock_irqsave(idr_lock, flags);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8a72694..7cbaaa9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1269,7 +1269,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
return map;
}
-/* map_idr_lock should have been held or the map should have been
+/* map idr_lock should have been held or the map should have been
* protected by rcu read lock.
*/
struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
@@ -1287,9 +1287,9 @@ struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map)
{
- spin_lock_bh(&map_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
map = __bpf_map_inc_not_zero(map, false);
- spin_unlock_bh(&map_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
return map;
}
@@ -2195,7 +2195,7 @@ void bpf_prog_inc(struct bpf_prog *prog)
}
EXPORT_SYMBOL_GPL(bpf_prog_inc);
-/* prog_idr_lock should have been held */
+/* prog idr_lock should have been held */
struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog *prog)
{
int refold;
@@ -2836,10 +2836,10 @@ int bpf_link_prime(struct bpf_link *link, struct bpf_link_primer *primer)
int bpf_link_settle(struct bpf_link_primer *primer)
{
/* make bpf_link fetchable by ID */
- spin_lock_bh(&link_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
primer->link->id = primer->id;
primer->link->obj_id = primer->obj_id;
- spin_unlock_bh(&link_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
/* make bpf_link fetchable by FD */
fd_install(primer->fd, primer->file);
/* pass through installed FD */
@@ -3648,7 +3648,7 @@ struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_map *map;
- spin_lock_bh(&map_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
again:
map = idr_get_next(&ns->idr[MAP_OBJ_ID], id);
if (map) {
@@ -3658,7 +3658,7 @@ struct bpf_map *bpf_map_get_curr_or_next(u32 *id)
goto again;
}
}
- spin_unlock_bh(&map_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
return map;
}
@@ -3668,7 +3668,7 @@ struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id)
struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_prog *prog;
- spin_lock_bh(&prog_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[PROG_OBJ_ID]);
again:
prog = idr_get_next(&ns->idr[PROG_OBJ_ID], id);
if (prog) {
@@ -3678,7 +3678,7 @@ struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id)
goto again;
}
}
- spin_unlock_bh(&prog_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[PROG_OBJ_ID]);
return prog;
}
@@ -3693,13 +3693,13 @@ struct bpf_prog *bpf_prog_by_id(u32 id)
if (!id)
return ERR_PTR(-ENOENT);
- spin_lock_bh(&prog_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[PROG_OBJ_ID]);
prog = idr_find(&ns->idr[PROG_OBJ_ID], id);
if (prog)
prog = bpf_prog_inc_not_zero(prog);
else
prog = ERR_PTR(-ENOENT);
- spin_unlock_bh(&prog_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[PROG_OBJ_ID]);
return prog;
}
@@ -3747,13 +3747,13 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr)
if (f_flags < 0)
return f_flags;
- spin_lock_bh(&map_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
map = idr_find(&ns->idr[MAP_OBJ_ID], id);
if (map)
map = __bpf_map_inc_not_zero(map, true);
else
map = ERR_PTR(-ENOENT);
- spin_unlock_bh(&map_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[MAP_OBJ_ID]);
if (IS_ERR(map))
return PTR_ERR(map);
@@ -4735,7 +4735,7 @@ struct bpf_link *bpf_link_by_id(u32 id)
if (!id)
return ERR_PTR(-ENOENT);
- spin_lock_bh(&link_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
/* before link is "settled", ID is 0, pretend it doesn't exist yet */
link = idr_find(&ns->idr[LINK_OBJ_ID], id);
if (link) {
@@ -4746,7 +4746,7 @@ struct bpf_link *bpf_link_by_id(u32 id)
} else {
link = ERR_PTR(-ENOENT);
}
- spin_unlock_bh(&link_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
return link;
}
@@ -4755,7 +4755,7 @@ struct bpf_link *bpf_link_get_curr_or_next(u32 *id)
struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_link *link;
- spin_lock_bh(&link_idr_lock);
+ spin_lock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
again:
link = idr_get_next(&ns->idr[LINK_OBJ_ID], id);
if (link) {
@@ -4765,7 +4765,7 @@ struct bpf_link *bpf_link_get_curr_or_next(u32 *id)
goto again;
}
}
- spin_unlock_bh(&link_idr_lock);
+ spin_unlock_bh(&bpf_idr_lock[LINK_OBJ_ID]);
return link;
}
@@ -5011,11 +5011,11 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
break;
case BPF_PROG_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &ns->idr[PROG_OBJ_ID], &prog_idr_lock);
+ &ns->idr[PROG_OBJ_ID], &bpf_idr_lock[PROG_OBJ_ID]);
break;
case BPF_MAP_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &ns->idr[MAP_OBJ_ID], &map_idr_lock);
+ &ns->idr[MAP_OBJ_ID], &bpf_idr_lock[MAP_OBJ_ID]);
break;
case BPF_BTF_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
@@ -5069,7 +5069,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
break;
case BPF_LINK_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &ns->idr[LINK_OBJ_ID], &link_idr_lock);
+ &ns->idr[LINK_OBJ_ID], &bpf_idr_lock[LINK_OBJ_ID]);
break;
case BPF_ENABLE_STATS:
err = bpf_enable_stats(&attr);
--
1.8.3.1
Similar to bpf map, We only expose the bpf link id under current bpf
namespace to user. The link->id is still the id in the init bpf
namespace.
The result as follows,
Run bpftool in a new bpf namespace,
$ bpftool map show
4: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 79 frozen
pids kprobe(8322)
5: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 79
pids kprobe(8322)
$ bpftool prog show
7: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-21T13:54:34+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 4
btf_id 79
pids kprobe(8322)
9: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-21T13:54:34+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 4
btf_id 79
pids kprobe(8322)
$ bpftool link show
1: perf_event prog 9
bpf_cookie 0
pids kprobe(8322)
2: perf_event prog 7
bpf_cookie 0
pids kprobe(8322)
At the same time, run bpftool in the init bpf namespace,
$ bpftool map show
8: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 79 frozen
pids kprobe(8322)
9: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 79
pids kprobe(8322)
$ bpftool prog show
15: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-21T13:54:34+0800 uid 0
xlated 56B jited 39B memlock 4096B map_ids 8
btf_id 79
pids kprobe(8322)
17: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-21T13:54:34+0800 uid 0
xlated 48B jited 35B memlock 4096B map_ids 8
btf_id 79
pids kprobe(8322)
$ bpftool link show
2: perf_event prog 17
bpf_cookie 0
pids kprobe(8322)
3: perf_event prog 15
bpf_cookie 0
pids kprobe(8322)
The bpftool running in the init bpf namespace can also show other bpf
links, but the bpftool in the new bpf namespace can only show the links
in its current bpf namespace.
Signed-off-by: Yafang Shao <[email protected]>
---
include/linux/bpf.h | 2 ++
kernel/bpf/bpf_namespace.c | 1 +
kernel/bpf/syscall.c | 55 +++++++++++--------------------
tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 3 +-
4 files changed, 24 insertions(+), 37 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 16f2a01..efa14ac 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1468,6 +1468,7 @@ struct bpf_link {
const struct bpf_link_ops *ops;
struct bpf_prog *prog;
struct work_struct work;
+ struct bpf_obj_id *obj_id;
};
struct bpf_link_ops {
@@ -1506,6 +1507,7 @@ struct bpf_link_primer {
struct file *file;
int fd;
u32 id;
+ struct bpf_obj_id *obj_id;
};
struct bpf_struct_ops_value;
diff --git a/kernel/bpf/bpf_namespace.c b/kernel/bpf/bpf_namespace.c
index 8c70945..c7d62ef 100644
--- a/kernel/bpf/bpf_namespace.c
+++ b/kernel/bpf/bpf_namespace.c
@@ -13,6 +13,7 @@
#define MAX_BPF_NS_LEVEL 32
DEFINE_SPINLOCK(map_idr_lock);
DEFINE_SPINLOCK(prog_idr_lock);
+DEFINE_SPINLOCK(link_idr_lock);
static struct kmem_cache *bpfns_cachep;
static struct kmem_cache *obj_id_cache[MAX_PID_NS_LEVEL];
static struct ns_common *bpfns_get(struct task_struct *task);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4725924..855d5f7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,8 +48,6 @@
#define BPF_OBJ_FLAG_MASK (BPF_F_RDONLY | BPF_F_WRONLY)
DEFINE_PER_CPU(int, bpf_prog_active);
-static DEFINE_IDR(link_idr);
-DEFINE_SPINLOCK(link_idr_lock);
int sysctl_unprivileged_bpf_disabled __read_mostly =
IS_BUILTIN(CONFIG_BPF_UNPRIV_DEFAULT_OFF) ? 2 : 0;
@@ -2670,17 +2668,11 @@ void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
atomic64_set(&link->refcnt, 1);
link->type = type;
link->id = 0;
+ link->obj_id = NULL;
link->ops = ops;
link->prog = prog;
}
-static void bpf_link_free_id(int id)
-{
- spin_lock_bh(&link_idr_lock);
- idr_remove(&link_idr, id);
- spin_unlock_bh(&link_idr_lock);
-}
-
/* Clean up bpf_link and corresponding anon_inode file and FD. After
* anon_inode is created, bpf_link can't be just kfree()'d due to deferred
* anon_inode's release() call. This helper marksbpf_link as
@@ -2692,7 +2684,7 @@ void bpf_link_cleanup(struct bpf_link_primer *primer)
{
primer->link->prog = NULL;
if (primer->id) {
- bpf_link_free_id(primer->id);
+ bpf_free_obj_id(primer->obj_id, LINK_OBJ_ID);
primer->id = 0;
}
fput(primer->file);
@@ -2708,7 +2700,7 @@ void bpf_link_inc(struct bpf_link *link)
static void bpf_link_free(struct bpf_link *link)
{
if (link->id) {
- bpf_link_free_id(link->id);
+ bpf_free_obj_id(link->obj_id, LINK_OBJ_ID);
link->id = 0;
}
if (link->prog) {
@@ -2774,7 +2766,7 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
"link_type:\t%s\n"
"link_id:\t%u\n",
bpf_link_type_strs[link->type],
- link->id);
+ bpf_obj_id_vnr(link->obj_id));
if (prog) {
bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
seq_printf(m,
@@ -2797,19 +2789,6 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
.write = bpf_dummy_write,
};
-static int bpf_link_alloc_id(struct bpf_link *link)
-{
- int id;
-
- idr_preload(GFP_KERNEL);
- spin_lock_bh(&link_idr_lock);
- id = idr_alloc_cyclic(&link_idr, link, 1, INT_MAX, GFP_ATOMIC);
- spin_unlock_bh(&link_idr_lock);
- idr_preload_end();
-
- return id;
-}
-
/* Prepare bpf_link to be exposed to user-space by allocating anon_inode file,
* reserving unused FD and allocating ID from link_idr. This is to be paired
* with bpf_link_settle() to install FD and ID and expose bpf_link to
@@ -2825,23 +2804,23 @@ static int bpf_link_alloc_id(struct bpf_link *link)
*/
int bpf_link_prime(struct bpf_link *link, struct bpf_link_primer *primer)
{
+ struct bpf_obj_id *obj_id;
struct file *file;
- int fd, id;
+ int fd;
fd = get_unused_fd_flags(O_CLOEXEC);
if (fd < 0)
return fd;
-
- id = bpf_link_alloc_id(link);
- if (id < 0) {
+ obj_id = bpf_alloc_obj_id(current->nsproxy->bpf_ns, link, LINK_OBJ_ID);
+ if (IS_ERR(obj_id)) {
put_unused_fd(fd);
- return id;
+ return PTR_ERR(obj_id);
}
file = anon_inode_getfile("bpf_link", &bpf_link_fops, link, O_CLOEXEC);
if (IS_ERR(file)) {
- bpf_link_free_id(id);
+ bpf_free_obj_id(obj_id, LINK_OBJ_ID);
put_unused_fd(fd);
return PTR_ERR(file);
}
@@ -2849,7 +2828,8 @@ int bpf_link_prime(struct bpf_link *link, struct bpf_link_primer *primer)
primer->link = link;
primer->file = file;
primer->fd = fd;
- primer->id = id;
+ primer->id = bpf_obj_id_nr(obj_id);
+ primer->obj_id = obj_id;
return 0;
}
@@ -2858,6 +2838,7 @@ int bpf_link_settle(struct bpf_link_primer *primer)
/* make bpf_link fetchable by ID */
spin_lock_bh(&link_idr_lock);
primer->link->id = primer->id;
+ primer->link->obj_id = primer->obj_id;
spin_unlock_bh(&link_idr_lock);
/* make bpf_link fetchable by FD */
fd_install(primer->fd, primer->file);
@@ -4265,7 +4246,7 @@ static int bpf_link_get_info_by_fd(struct file *file,
return -EFAULT;
info.type = link->type;
- info.id = link->id;
+ info.id = bpf_obj_id_vnr(link->obj_id);
if (link->prog)
info.prog_id = bpf_obj_id_vnr(link->prog->aux->obj_id);
@@ -4748,6 +4729,7 @@ static struct bpf_link *bpf_link_inc_not_zero(struct bpf_link *link)
struct bpf_link *bpf_link_by_id(u32 id)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_link *link;
if (!id)
@@ -4755,7 +4737,7 @@ struct bpf_link *bpf_link_by_id(u32 id)
spin_lock_bh(&link_idr_lock);
/* before link is "settled", ID is 0, pretend it doesn't exist yet */
- link = idr_find(&link_idr, id);
+ link = idr_find(&ns->idr[LINK_OBJ_ID], id);
if (link) {
if (link->id)
link = bpf_link_inc_not_zero(link);
@@ -4770,11 +4752,12 @@ struct bpf_link *bpf_link_by_id(u32 id)
struct bpf_link *bpf_link_get_curr_or_next(u32 *id)
{
+ struct bpf_namespace *ns = current->nsproxy->bpf_ns;
struct bpf_link *link;
spin_lock_bh(&link_idr_lock);
again:
- link = idr_get_next(&link_idr, id);
+ link = idr_get_next(&ns->idr[LINK_OBJ_ID], id);
if (link) {
link = bpf_link_inc_not_zero(link);
if (IS_ERR(link)) {
@@ -5086,7 +5069,7 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
break;
case BPF_LINK_GET_NEXT_ID:
err = bpf_obj_get_next_id(&attr, uattr.user,
- &link_idr, &link_idr_lock);
+ &ns->idr[LINK_OBJ_ID], &link_idr_lock);
break;
case BPF_ENABLE_STATS:
err = bpf_enable_stats(&attr);
diff --git a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
index 1fd8ceb..e2237ad 100644
--- a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
+++ b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
@@ -36,7 +36,8 @@ static __always_inline __u32 get_obj_id(void *ent, enum bpf_obj_type type)
case BPF_OBJ_BTF:
return BPF_CORE_READ((struct btf *)ent, id);
case BPF_OBJ_LINK:
- return BPF_CORE_READ((struct bpf_link *)ent, id);
+ obj_id = BPF_CORE_READ((struct bpf_link *)ent, obj_id);
+ break;
default:
return 0;
}
--
1.8.3.1
A simple test case is added for the newly introduced bpf namespcae.
Signed-off-by: Yafang Shao <[email protected]>
---
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/test_bpfns.c | 76 ++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/test_bpfns.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4a8ef11..55f0aeb 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -40,7 +40,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
test_sock test_sockmap get_cgroup_id_user \
test_cgroup_storage \
test_tcpnotify_user test_sysctl \
- test_progs-no_alu32
+ test_progs-no_alu32 test_bpfns
# Also test bpf-gcc, if present
ifneq ($(BPF_GCC),)
@@ -255,6 +255,7 @@ $(OUTPUT)/flow_dissector_load: $(TESTING_HELPERS)
$(OUTPUT)/test_maps: $(TESTING_HELPERS)
$(OUTPUT)/test_verifier: $(TESTING_HELPERS) $(CAP_HELPERS) $(UNPRIV_HELPERS)
$(OUTPUT)/xsk.o: $(BPFOBJ)
+$(OUTPUT)/test_bpfns: $(TESTING_HELPERS)
BPFTOOL ?= $(DEFAULT_BPFTOOL)
$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \
diff --git a/tools/testing/selftests/bpf/test_bpfns.c b/tools/testing/selftests/bpf/test_bpfns.c
new file mode 100644
index 0000000..7baebe2
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_bpfns.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE 1
+#endif
+#include <unistd.h>
+#include <errno.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <linux/sched.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+static int create_bpf_map(const char *name)
+{
+ static struct bpf_map_create_opts map_opts = {
+ .sz = sizeof(map_opts),
+ };
+ unsigned int value;
+ unsigned int key;
+ int map_fd;
+
+ map_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, name, sizeof(key),
+ sizeof(value), 1, &map_opts);
+ if (map_fd < 0)
+ fprintf(stderr, "%s - Failed to create map\n", strerror(errno));
+ return map_fd;
+}
+
+
+int main(int argc, char *argv[])
+{
+ struct bpf_map_info info = {};
+ __u32 info_len = sizeof(info);
+ struct clone_args args = {
+ .flags = 0x400000000ULL, /* CLONE_NEWBPF */
+ .exit_signal = SIGCHLD,
+ };
+ int map_fd, child_map_fd;
+ pid_t pid;
+
+ /* Create a map in init bpf namespace. */
+ map_fd = create_bpf_map("map_in_init");
+ if (map_fd < 0)
+ exit(EXIT_FAILURE);
+ pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
+ if (pid < 0) {
+ fprintf(stderr, "%s - Failed to create new process\n", strerror(errno));
+ exit(EXIT_FAILURE);
+ }
+
+ if (pid == 0) {
+ struct bpf_map_info info = {};
+
+ /* In a new bpf namespace, it is the first map. */
+ child_map_fd = create_bpf_map("map_in_bpfns");
+ if (child_map_fd < 0)
+ exit(EXIT_FAILURE);
+ bpf_obj_get_info_by_fd(child_map_fd, &info, &info_len);
+ assert(info.id == 1);
+ exit(EXIT_SUCCESS);
+ }
+
+ if (waitpid(pid, NULL, 0) != pid) {
+ fprintf(stderr, "Failed to wait on child process\n");
+ exit(EXIT_FAILURE);
+ }
+
+ return 0;
+}
--
1.8.3.1
Yafang Shao <[email protected]> writes:
> We only expose the bpf map id under current bpf namespace to user. The
> map->id is still the id in the init bpf namespace.
>
> The result as follows,
>
> Run bpftool in a new bpf namespace
> $ bpftool map show
> 4: array name kprobe_b.rodata flags 0x80
> key 4B value 37B max_entries 1 memlock 360B
> btf_id 159 frozen
The btf_id is identical for all the different objects in this example
output; surely that can't be right? Copy-paste error? Same thing in the
other patches...
-Toke
Yafang Shao <[email protected]> writes:
> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.
>
> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.
>
> Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> which is not created by the process itself is with SCM_RIGHTS, that
> requires each processes which created bpf object has to implement a unix
> domain socket to share the fd of a bpf object between different
> processes, that is really trivial and troublesome.
>
> Hence we need a better mechanism to get bpf object info without
> CAP_SYS_ADMIN.
>
> BPF namespace is introduced in this patchset with an attempt to remove
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can
> still iterate and inspect them.
>
> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.
May? I think we should cover all of the existing BPF objects from the
beginning here, or we may miss important interactions that will
invalidate the whole idea. In particular, I'm a little worried about the
interaction between namespaces and bpffs; what happens if you're in a
bpf namespace and you try to read a BPF object from a bpffs that belongs
to a different namespace? Does the operation fail? Is the object hidden
entirely? Something else?
-Toke
On Sun, Mar 26, 2023 at 6:51 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Yafang Shao <[email protected]> writes:
>
> > We only expose the bpf map id under current bpf namespace to user. The
> > map->id is still the id in the init bpf namespace.
> >
> > The result as follows,
> >
> > Run bpftool in a new bpf namespace
> > $ bpftool map show
> > 4: array name kprobe_b.rodata flags 0x80
> > key 4B value 37B max_entries 1 memlock 360B
> > btf_id 159 frozen
>
> The btf_id is identical for all the different objects in this example
> output; surely that can't be right? Copy-paste error? Same thing in the
> other patches...
>
The bpf progs {"kretprobe_run","kprobe_run"} and the bpf maps
{"kprobe_b.rodata","kprobe_b.data"} belong to the same bpf program. So
the btf_id of them are always the same. For example, below is the
result when I rerun it on my test server,
$ bpftool btf show
...
943: name <anon> size 1086B prog_ids 48824,48822 map_ids 43712,43711
pids kprobe(3599801)
...
$ bpftool map show
43711: array name kprobe_b.rodata flags 0x80
key 4B value 37B max_entries 1 memlock 360B
btf_id 943 frozen
pids kprobe(3599801)
43712: array name kprobe_b.data flags 0x400
key 4B value 4B max_entries 1 memlock 8192B
btf_id 943
pids kprobe(3599801)
$ bpftool prog show
48822: kprobe name kretprobe_run tag 0de47cc241a2b1b3 gpl
loaded_at 2023-03-27T10:35:01+0800 uid 0
xlated 112B jited 78B memlock 4096B map_ids 43711
btf_id 943
pids kprobe(3599801)
48824: kprobe name kprobe_run tag bf163b23cd3b174d gpl
loaded_at 2023-03-27T10:35:01+0800 uid 0
xlated 104B jited 75B memlock 4096B map_ids 43711
btf_id 943
pids kprobe(3599801)
The btf_id hasn't been added into the bpf namespace, so the btf id in
init bpf namespace and child bpf namespace are the same value.
--
Regards
Yafang
On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Yafang Shao <[email protected]> writes:
>
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
> >
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> > which is not created by the process itself is with SCM_RIGHTS, that
> > requires each processes which created bpf object has to implement a unix
> > domain socket to share the fd of a bpf object between different
> > processes, that is really trivial and troublesome.
> >
> > Hence we need a better mechanism to get bpf object info without
> > CAP_SYS_ADMIN.
> >
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
> >
> > BPF namespace is similar to PID namespace, and the bpf objects are
> > similar to tasks, so BPF namespace is very easy to understand. These
> > patchset only implements BPF namespace for bpf map, prog and link. In the
> > future we may extend it to other bpf objects like btf, bpffs and etc.
>
> May? I think we should cover all of the existing BPF objects from the
> beginning here, or we may miss important interactions that will
> invalidate the whole idea.
This patchset is intended to address iterating bpf IDs and converting
IDs to FDs. To be more specific, it covers
BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
but I don't implement it because I find we can do more wrt BTF, for
example, if we can expose a small amount of BTFs in the vmlinux to
non-root bpf namespace.
But, yes, I should implement BTF ID in this patchset.
> In particular, I'm a little worried about the
> interaction between namespaces and bpffs; what happens if you're in a
> bpf namespace and you try to read a BPF object from a bpffs that belongs
> to a different namespace? Does the operation fail? Is the object hidden
> entirely? Something else?
>
bpffs is a different topic and it can be implemented in later patchsets.
bpffs has its own specific problem even without the bpf namespace.
1. The user can always get the information of a bpf object through its
corresponding pinned file.
In our practice, different container users have different bpffs, and
we allow the container user to bind-mount its bpffs only, so others'
bpffs are invisible.
To make it better with the bpf namespace, I think we can fail the
operation if the pinned file doesn't belong to its bpf namespace. That
said, we will add pinned bpf files into the bpf namespace in the next
step.
2. The user can always iterate bpf objects through progs.debug and maps.debug
progs.debug and maps.debug are debugging purposes only. So I think we
can handle it later.
--
Regards
Yafang
On 03/26, Yafang Shao wrote:
> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.
> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.
> Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> which is not created by the process itself is with SCM_RIGHTS, that
> requires each processes which created bpf object has to implement a unix
> domain socket to share the fd of a bpf object between different
> processes, that is really trivial and troublesome.
> Hence we need a better mechanism to get bpf object info without
> CAP_SYS_ADMIN.
[..]
> BPF namespace is introduced in this patchset with an attempt to remove
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can
> still iterate and inspect them.
Does it essentially mean unpriv bpf? Can I, as a non-root, create
a new bpf namespace and start loading/attaching progs?
Maybe add a paragraph about now vs whatever you're proposing.
Otherwise it's not very clear what's the security story.
(haven't looked at the whole series, so maybe it's answered somewhere else?)
> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.
> For example, we can allow some of the BTF objects to be used in
> non-init bpf namespace, then the container user can only trace the
> processes running in his container, but can't get the information of
> tasks running in other containers.
> A simple example is introduced into selftests/bpf on how to use the bpf
> namespace.
> Putting bpf map, prog and link into bpf namespace is the first step.
> Let's start with it.
> [1].
> https://lore.kernel.org/bpf/[email protected]/
> Yafang Shao (13):
> fork: New clone3 flag for BPF namespace
> proc_ns: Extend the field type in struct proc_ns_operations to long
> bpf: Implement bpf namespace
> bpf: No need to check if id is 0
> bpf: Make bpf objects id have the same alloc and free pattern
> bpf: Helpers to alloc and free object id in bpf namespace
> bpf: Add bpf helper to get bpf object id
> bpf: Alloc and free bpf_map id in bpf namespace
> bpf: Alloc and free bpf_prog id in bpf namespace
> bpf: Alloc and free bpf_link id in bpf namespace
> bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace
> bpf: Use bpf_idr_lock array instead
> selftests/bpf: Add selftest for bpf namespace
> fs/proc/namespaces.c | 4 +
> include/linux/bpf.h | 9 +-
> include/linux/bpf_namespace.h | 88 ++++++++++
> include/linux/nsproxy.h | 4 +
> include/linux/proc_ns.h | 3 +-
> include/linux/user_namespace.h | 1 +
> include/uapi/linux/bpf.h | 7 +
> include/uapi/linux/sched.h | 1 +
> kernel/bpf/Makefile | 1 +
> kernel/bpf/bpf_namespace.c | 283
> ++++++++++++++++++++++++++++++
> kernel/bpf/offload.c | 16 +-
> kernel/bpf/syscall.c | 262
> ++++++++++-----------------
> kernel/bpf/task_iter.c | 12 ++
> kernel/fork.c | 5 +-
> kernel/nsproxy.c | 19 +-
> kernel/trace/bpf_trace.c | 2 +
> kernel/ucount.c | 1 +
> tools/bpf/bpftool/skeleton/pid_iter.bpf.c | 13 +-
> tools/include/uapi/linux/bpf.h | 7 +
> tools/testing/selftests/bpf/Makefile | 3 +-
> tools/testing/selftests/bpf/test_bpfns.c | 76 ++++++++
> 21 files changed, 637 insertions(+), 180 deletions(-)
> create mode 100644 include/linux/bpf_namespace.h
> create mode 100644 kernel/bpf/bpf_namespace.c
> create mode 100644 tools/testing/selftests/bpf/test_bpfns.c
> --
> 1.8.3.1
On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
>
> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.
>
> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.
Agreed that it is important to enable tools like bpftool without
CAP_SYS_ADMIN. However, I am not sure whether we need a new
namespace for this. Can we reuse some existing namespace for this?
If we do need a new namespace, maybe we should share some effort
with tracer namespace proposal [1]?
Thanks,
Song
[1] https://lpc.events/event/16/contributions/1237/
Yafang Shao <[email protected]> writes:
> On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Yafang Shao <[email protected]> writes:
>>
>> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
>> > to FDs, that's intended for BPF's security model[1]. Not only does it
>> > prevent non-privilidged users from getting other users' bpf program, but
>> > also it prevents the user from iterating his own bpf objects.
>> >
>> > In container environment, some users want to run bpf programs in their
>> > containers. These users can run their bpf programs under CAP_BPF and
>> > some other specific CAPs, but they can't inspect their bpf programs in a
>> > generic way. For example, the bpftool can't be used as it requires
>> > CAP_SYS_ADMIN. That is very inconvenient.
>> >
>> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
>> > which is not created by the process itself is with SCM_RIGHTS, that
>> > requires each processes which created bpf object has to implement a unix
>> > domain socket to share the fd of a bpf object between different
>> > processes, that is really trivial and troublesome.
>> >
>> > Hence we need a better mechanism to get bpf object info without
>> > CAP_SYS_ADMIN.
>> >
>> > BPF namespace is introduced in this patchset with an attempt to remove
>> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
>> > link in a specific bpf namespace, then these bpf objects will not be
>> > visible to the users in a different bpf namespace. But these bpf
>> > objects are visible to its parent bpf namespace, so the sys admin can
>> > still iterate and inspect them.
>> >
>> > BPF namespace is similar to PID namespace, and the bpf objects are
>> > similar to tasks, so BPF namespace is very easy to understand. These
>> > patchset only implements BPF namespace for bpf map, prog and link. In the
>> > future we may extend it to other bpf objects like btf, bpffs and etc.
>>
>> May? I think we should cover all of the existing BPF objects from the
>> beginning here, or we may miss important interactions that will
>> invalidate the whole idea.
>
> This patchset is intended to address iterating bpf IDs and converting
> IDs to FDs. To be more specific, it covers
> BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
> It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
> but I don't implement it because I find we can do more wrt BTF, for
> example, if we can expose a small amount of BTFs in the vmlinux to
> non-root bpf namespace.
> But, yes, I should implement BTF ID in this patchset.
Right, as you can see by my comment on that patch, not including the btf
id is a tad confusing, so yeah, better include that.
>> In particular, I'm a little worried about the
>> interaction between namespaces and bpffs; what happens if you're in a
>> bpf namespace and you try to read a BPF object from a bpffs that belongs
>> to a different namespace? Does the operation fail? Is the object hidden
>> entirely? Something else?
>>
>
> bpffs is a different topic and it can be implemented in later patchsets.
> bpffs has its own specific problem even without the bpf namespace.
> 1. The user can always get the information of a bpf object through its
> corresponding pinned file.
> In our practice, different container users have different bpffs, and
> we allow the container user to bind-mount its bpffs only, so others'
> bpffs are invisible.
> To make it better with the bpf namespace, I think we can fail the
> operation if the pinned file doesn't belong to its bpf namespace. That
> said, we will add pinned bpf files into the bpf namespace in the next
> step.
>
> 2. The user can always iterate bpf objects through progs.debug and maps.debug
> progs.debug and maps.debug are debugging purposes only. So I think we
> can handle it later.
Well, I disagree. Working out these issues with bpffs is an important
aspect to get a consistent API, and handwaving it away risks merging
something that will turn out to not be workable further down the line at
which point we can't change it.
-Toke
On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <[email protected]> wrote:
>
> On 03/26, Yafang Shao wrote:
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
>
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
>
> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> > which is not created by the process itself is with SCM_RIGHTS, that
> > requires each processes which created bpf object has to implement a unix
> > domain socket to share the fd of a bpf object between different
> > processes, that is really trivial and troublesome.
>
> > Hence we need a better mechanism to get bpf object info without
> > CAP_SYS_ADMIN.
>
> [..]
>
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
>
> Does it essentially mean unpriv bpf?
Right. With CAP_BPF and some other CAPs enabled.
> Can I, as a non-root, create
> a new bpf namespace and start loading/attaching progs?
No, you can't create a new bpf namespace as a non-root, see also
copy_namespaces().
In the container environment, new namespaces are always created by
containered, which is started by root.
> Maybe add a paragraph about now vs whatever you're proposing.
What I'm proposing in this patchset is to put bpf objects (map, prog,
link, and btf) into the bpf namespace. Next step I will put bpffs into
the bpf namespace as well.
That said, I'm trying to put all the objects created in bpf into the
bpf namespace. Below is a simple paragraph to illustrate it.
Regarding the unpriv user with CAP_BPF enabled,
Now | Future
------------------------------------------------------------------------
Iterate his BPF IDs | N | Y |
Iterate others' BPF IDs | N | N |
Convert his BPF IDs to FDs | N | Y |
Convert others' BPF IDs to FDs | N | N |
Get others' object info from pinned file | Y(*) | N |
------------------------------------------------------------------------
(*) It can be improved by,
1). Different containers has different bpffs
2). Setting file permission
That's not perfect, for example, if one single user has two bpf
instances, but we don't want them to inspect each other.
> Otherwise it's not very clear what's the security story.
> (haven't looked at the whole series, so maybe it's answered somewhere else?)
>
--
Regards
Yafang
On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
>
> On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> >
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
> >
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
>
> Agreed that it is important to enable tools like bpftool without
> CAP_SYS_ADMIN. However, I am not sure whether we need a new
> namespace for this. Can we reuse some existing namespace for this?
>
It seems we can't.
> If we do need a new namespace, maybe we should share some effort
> with tracer namespace proposal [1]?
>
Thanks for your information. I will learn the tracer namespace first
and try to analyze how to cooperate with it.
> Thanks,
> Song
>
> [1] https://lpc.events/event/16/contributions/1237/
--
Regards
Yafang
On Tue, Mar 28, 2023 at 4:51 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Yafang Shao <[email protected]> writes:
>
> > On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <[email protected]> wrote:
> >>
> >> Yafang Shao <[email protected]> writes:
> >>
> >> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> >> > to FDs, that's intended for BPF's security model[1]. Not only does it
> >> > prevent non-privilidged users from getting other users' bpf program, but
> >> > also it prevents the user from iterating his own bpf objects.
> >> >
> >> > In container environment, some users want to run bpf programs in their
> >> > containers. These users can run their bpf programs under CAP_BPF and
> >> > some other specific CAPs, but they can't inspect their bpf programs in a
> >> > generic way. For example, the bpftool can't be used as it requires
> >> > CAP_SYS_ADMIN. That is very inconvenient.
> >> >
> >> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> >> > which is not created by the process itself is with SCM_RIGHTS, that
> >> > requires each processes which created bpf object has to implement a unix
> >> > domain socket to share the fd of a bpf object between different
> >> > processes, that is really trivial and troublesome.
> >> >
> >> > Hence we need a better mechanism to get bpf object info without
> >> > CAP_SYS_ADMIN.
> >> >
> >> > BPF namespace is introduced in this patchset with an attempt to remove
> >> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> >> > link in a specific bpf namespace, then these bpf objects will not be
> >> > visible to the users in a different bpf namespace. But these bpf
> >> > objects are visible to its parent bpf namespace, so the sys admin can
> >> > still iterate and inspect them.
> >> >
> >> > BPF namespace is similar to PID namespace, and the bpf objects are
> >> > similar to tasks, so BPF namespace is very easy to understand. These
> >> > patchset only implements BPF namespace for bpf map, prog and link. In the
> >> > future we may extend it to other bpf objects like btf, bpffs and etc.
> >>
> >> May? I think we should cover all of the existing BPF objects from the
> >> beginning here, or we may miss important interactions that will
> >> invalidate the whole idea.
> >
> > This patchset is intended to address iterating bpf IDs and converting
> > IDs to FDs. To be more specific, it covers
> > BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
> > It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
> > but I don't implement it because I find we can do more wrt BTF, for
> > example, if we can expose a small amount of BTFs in the vmlinux to
> > non-root bpf namespace.
> > But, yes, I should implement BTF ID in this patchset.
>
> Right, as you can see by my comment on that patch, not including the btf
> id is a tad confusing, so yeah, better include that.
>
> >> In particular, I'm a little worried about the
> >> interaction between namespaces and bpffs; what happens if you're in a
> >> bpf namespace and you try to read a BPF object from a bpffs that belongs
> >> to a different namespace? Does the operation fail? Is the object hidden
> >> entirely? Something else?
> >>
> >
> > bpffs is a different topic and it can be implemented in later patchsets.
> > bpffs has its own specific problem even without the bpf namespace.
> > 1. The user can always get the information of a bpf object through its
> > corresponding pinned file.
> > In our practice, different container users have different bpffs, and
> > we allow the container user to bind-mount its bpffs only, so others'
> > bpffs are invisible.
> > To make it better with the bpf namespace, I think we can fail the
> > operation if the pinned file doesn't belong to its bpf namespace. That
> > said, we will add pinned bpf files into the bpf namespace in the next
> > step.
> >
> > 2. The user can always iterate bpf objects through progs.debug and maps.debug
> > progs.debug and maps.debug are debugging purposes only. So I think we
> > can handle it later.
>
> Well, I disagree. Working out these issues with bpffs is an important
> aspect to get a consistent API, and handwaving it away risks merging
> something that will turn out to not be workable further down the line at
> which point we can't change it.
>
Sure, I will include bpffs in the next version.
--
Regards
Yafang
On 03/28, Yafang Shao wrote:
> On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <[email protected]> wrote:
> >
> > On 03/26, Yafang Shao wrote:
> > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> IDs
> > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > prevent non-privilidged users from getting other users' bpf program,
> but
> > > also it prevents the user from iterating his own bpf objects.
> >
> > > In container environment, some users want to run bpf programs in their
> > > containers. These users can run their bpf programs under CAP_BPF and
> > > some other specific CAPs, but they can't inspect their bpf programs
> in a
> > > generic way. For example, the bpftool can't be used as it requires
> > > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> object
> > > which is not created by the process itself is with SCM_RIGHTS, that
> > > requires each processes which created bpf object has to implement a
> unix
> > > domain socket to share the fd of a bpf object between different
> > > processes, that is really trivial and troublesome.
> >
> > > Hence we need a better mechanism to get bpf object info without
> > > CAP_SYS_ADMIN.
> >
> > [..]
> >
> > > BPF namespace is introduced in this patchset with an attempt to remove
> > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > link in a specific bpf namespace, then these bpf objects will not be
> > > visible to the users in a different bpf namespace. But these bpf
> > > objects are visible to its parent bpf namespace, so the sys admin can
> > > still iterate and inspect them.
> >
> > Does it essentially mean unpriv bpf?
> Right. With CAP_BPF and some other CAPs enabled.
> > Can I, as a non-root, create
> > a new bpf namespace and start loading/attaching progs?
> No, you can't create a new bpf namespace as a non-root, see also
> copy_namespaces().
> In the container environment, new namespaces are always created by
> containered, which is started by root.
Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
from copy_namespaces? Isn't it trivially bypassed with a new user
namespace?
IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
in this particular user-ns. Then I can go on and create a new bpf
namespace (with CAP_BPF) and go wild? I won't see anything from the
other namespaces, but I'll be able to load/attach bpf programs?
> > Maybe add a paragraph about now vs whatever you're proposing.
> What I'm proposing in this patchset is to put bpf objects (map, prog,
> link, and btf) into the bpf namespace. Next step I will put bpffs into
> the bpf namespace as well.
> That said, I'm trying to put all the objects created in bpf into the
> bpf namespace. Below is a simple paragraph to illustrate it.
> Regarding the unpriv user with CAP_BPF enabled,
> Now | Future
> ------------------------------------------------------------------------
> Iterate his BPF IDs | N | Y |
> Iterate others' BPF IDs | N | N |
> Convert his BPF IDs to FDs | N | Y |
> Convert others' BPF IDs to FDs | N | N |
> Get others' object info from pinned file | Y(*) | N |
> ------------------------------------------------------------------------
> (*) It can be improved by,
> 1). Different containers has different bpffs
> 2). Setting file permission
> That's not perfect, for example, if one single user has two bpf
> instances, but we don't want them to inspect each other.
I think the question here is what happens to the existing
capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
And if not, I don't think it integrates well with the user namespaces?
> > Otherwise it's not very clear what's the security story.
> > (haven't looked at the whole series, so maybe it's answered somewhere
> else?)
> >
> --
> Regards
> Yafang
On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <[email protected]> wrote:
>
> On 03/28, Yafang Shao wrote:
> > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On 03/26, Yafang Shao wrote:
> > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > IDs
> > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > prevent non-privilidged users from getting other users' bpf program,
> > but
> > > > also it prevents the user from iterating his own bpf objects.
> > >
> > > > In container environment, some users want to run bpf programs in their
> > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > some other specific CAPs, but they can't inspect their bpf programs
> > in a
> > > > generic way. For example, the bpftool can't be used as it requires
> > > > CAP_SYS_ADMIN. That is very inconvenient.
> > >
> > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > object
> > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > requires each processes which created bpf object has to implement a
> > unix
> > > > domain socket to share the fd of a bpf object between different
> > > > processes, that is really trivial and troublesome.
> > >
> > > > Hence we need a better mechanism to get bpf object info without
> > > > CAP_SYS_ADMIN.
> > >
> > > [..]
> > >
> > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > visible to the users in a different bpf namespace. But these bpf
> > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > still iterate and inspect them.
> > >
> > > Does it essentially mean unpriv bpf?
>
> > Right. With CAP_BPF and some other CAPs enabled.
>
> > > Can I, as a non-root, create
> > > a new bpf namespace and start loading/attaching progs?
>
> > No, you can't create a new bpf namespace as a non-root, see also
> > copy_namespaces().
> > In the container environment, new namespaces are always created by
> > containered, which is started by root.
>
> Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> from copy_namespaces? Isn't it trivially bypassed with a new user
> namespace?
>
> IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> in this particular user-ns. Then I can go on and create a new bpf
> namespace (with CAP_BPF) and go wild? I won't see anything from the
> other namespaces, but I'll be able to load/attach bpf programs?
>
I don't think so. If you create a new userspace, and give the process
the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
initial namespace, you can't do that. Because currently only CAP_BPF
or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
programs.
> > > Maybe add a paragraph about now vs whatever you're proposing.
>
> > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > the bpf namespace as well.
> > That said, I'm trying to put all the objects created in bpf into the
> > bpf namespace. Below is a simple paragraph to illustrate it.
>
> > Regarding the unpriv user with CAP_BPF enabled,
> > Now | Future
> > ------------------------------------------------------------------------
> > Iterate his BPF IDs | N | Y |
> > Iterate others' BPF IDs | N | N |
> > Convert his BPF IDs to FDs | N | Y |
> > Convert others' BPF IDs to FDs | N | N |
> > Get others' object info from pinned file | Y(*) | N |
> > ------------------------------------------------------------------------
>
> > (*) It can be improved by,
> > 1). Different containers has different bpffs
> > 2). Setting file permission
> > That's not perfect, for example, if one single user has two bpf
> > instances, but we don't want them to inspect each other.
>
> I think the question here is what happens to the existing
> capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
>
They won't become ns_capable(CAP_BPF). If it becomes
ns_capable(CAP_BPF), it will really go wild then.
> And if not, I don't think it integrates well with the user namespaces?
>
IIUC, it is the CAP_BPF which doesn't integrate with the user
namespaces, right?
--
Regards
Yafang
On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <[email protected]> wrote:
>
> On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <[email protected]> wrote:
> >
> > On 03/28, Yafang Shao wrote:
> > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <[email protected]> wrote:
> > > >
> > > > On 03/26, Yafang Shao wrote:
> > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > > IDs
> > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > prevent non-privilidged users from getting other users' bpf program,
> > > but
> > > > > also it prevents the user from iterating his own bpf objects.
> > > >
> > > > > In container environment, some users want to run bpf programs in their
> > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > some other specific CAPs, but they can't inspect their bpf programs
> > > in a
> > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > >
> > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > > object
> > > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > > requires each processes which created bpf object has to implement a
> > > unix
> > > > > domain socket to share the fd of a bpf object between different
> > > > > processes, that is really trivial and troublesome.
> > > >
> > > > > Hence we need a better mechanism to get bpf object info without
> > > > > CAP_SYS_ADMIN.
> > > >
> > > > [..]
> > > >
> > > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > > visible to the users in a different bpf namespace. But these bpf
> > > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > > still iterate and inspect them.
> > > >
> > > > Does it essentially mean unpriv bpf?
> >
> > > Right. With CAP_BPF and some other CAPs enabled.
> >
> > > > Can I, as a non-root, create
> > > > a new bpf namespace and start loading/attaching progs?
> >
> > > No, you can't create a new bpf namespace as a non-root, see also
> > > copy_namespaces().
> > > In the container environment, new namespaces are always created by
> > > containered, which is started by root.
> >
> > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> > from copy_namespaces? Isn't it trivially bypassed with a new user
> > namespace?
> >
> > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> > in this particular user-ns. Then I can go on and create a new bpf
> > namespace (with CAP_BPF) and go wild? I won't see anything from the
> > other namespaces, but I'll be able to load/attach bpf programs?
> >
>
> I don't think so. If you create a new userspace, and give the process
> the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
> initial namespace, you can't do that. Because currently only CAP_BPF
> or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
> programs.
>
> > > > Maybe add a paragraph about now vs whatever you're proposing.
> >
> > > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > > the bpf namespace as well.
> > > That said, I'm trying to put all the objects created in bpf into the
> > > bpf namespace. Below is a simple paragraph to illustrate it.
> >
> > > Regarding the unpriv user with CAP_BPF enabled,
> > > Now | Future
> > > ------------------------------------------------------------------------
> > > Iterate his BPF IDs | N | Y |
> > > Iterate others' BPF IDs | N | N |
> > > Convert his BPF IDs to FDs | N | Y |
> > > Convert others' BPF IDs to FDs | N | N |
> > > Get others' object info from pinned file | Y(*) | N |
> > > ------------------------------------------------------------------------
> >
> > > (*) It can be improved by,
> > > 1). Different containers has different bpffs
> > > 2). Setting file permission
> > > That's not perfect, for example, if one single user has two bpf
> > > instances, but we don't want them to inspect each other.
> >
> > I think the question here is what happens to the existing
> > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
> >
>
> They won't become ns_capable(CAP_BPF). If it becomes
> ns_capable(CAP_BPF), it will really go wild then.
>
> > And if not, I don't think it integrates well with the user namespaces?
> >
>
> IIUC, it is the CAP_BPF which doesn't integrate with the user
> namespaces, right?
Yeah. And it's probably fine if we don't, I just wanted to see some
explanation on the long-term plan.
If the purpose is to have a bpf namespace and use it for pure
isolation purposes, let's state it clearly in the cover letter.
Otherwise it's not clear whether it's only about isolation or
potentially allowing CAP_BPF in user namespaces.
Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit
error? (or maybe it already does, haven't looked at the patches)
One other question I have is: should init bpf namespace see
everything? Otherwise it might be hard to chase down the namespaces
that loaded some BPF program...
> --
> Regards
> Yafang
On Thu, Mar 30, 2023 at 4:50 AM Stanislav Fomichev <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <[email protected]> wrote:
> >
> > On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On 03/28, Yafang Shao wrote:
> > > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <[email protected]> wrote:
> > > > >
> > > > > On 03/26, Yafang Shao wrote:
> > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > > > IDs
> > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > prevent non-privilidged users from getting other users' bpf program,
> > > > but
> > > > > > also it prevents the user from iterating his own bpf objects.
> > > > >
> > > > > > In container environment, some users want to run bpf programs in their
> > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > some other specific CAPs, but they can't inspect their bpf programs
> > > > in a
> > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > >
> > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > > > object
> > > > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > > > requires each processes which created bpf object has to implement a
> > > > unix
> > > > > > domain socket to share the fd of a bpf object between different
> > > > > > processes, that is really trivial and troublesome.
> > > > >
> > > > > > Hence we need a better mechanism to get bpf object info without
> > > > > > CAP_SYS_ADMIN.
> > > > >
> > > > > [..]
> > > > >
> > > > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > > > visible to the users in a different bpf namespace. But these bpf
> > > > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > > > still iterate and inspect them.
> > > > >
> > > > > Does it essentially mean unpriv bpf?
> > >
> > > > Right. With CAP_BPF and some other CAPs enabled.
> > >
> > > > > Can I, as a non-root, create
> > > > > a new bpf namespace and start loading/attaching progs?
> > >
> > > > No, you can't create a new bpf namespace as a non-root, see also
> > > > copy_namespaces().
> > > > In the container environment, new namespaces are always created by
> > > > containered, which is started by root.
> > >
> > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> > > from copy_namespaces? Isn't it trivially bypassed with a new user
> > > namespace?
> > >
> > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> > > in this particular user-ns. Then I can go on and create a new bpf
> > > namespace (with CAP_BPF) and go wild? I won't see anything from the
> > > other namespaces, but I'll be able to load/attach bpf programs?
> > >
> >
> > I don't think so. If you create a new userspace, and give the process
> > the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
> > initial namespace, you can't do that. Because currently only CAP_BPF
> > or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
> > programs.
> >
> > > > > Maybe add a paragraph about now vs whatever you're proposing.
> > >
> > > > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > > > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > > > the bpf namespace as well.
> > > > That said, I'm trying to put all the objects created in bpf into the
> > > > bpf namespace. Below is a simple paragraph to illustrate it.
> > >
> > > > Regarding the unpriv user with CAP_BPF enabled,
> > > > Now | Future
> > > > ------------------------------------------------------------------------
> > > > Iterate his BPF IDs | N | Y |
> > > > Iterate others' BPF IDs | N | N |
> > > > Convert his BPF IDs to FDs | N | Y |
> > > > Convert others' BPF IDs to FDs | N | N |
> > > > Get others' object info from pinned file | Y(*) | N |
> > > > ------------------------------------------------------------------------
> > >
> > > > (*) It can be improved by,
> > > > 1). Different containers has different bpffs
> > > > 2). Setting file permission
> > > > That's not perfect, for example, if one single user has two bpf
> > > > instances, but we don't want them to inspect each other.
> > >
> > > I think the question here is what happens to the existing
> > > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
> > >
> >
> > They won't become ns_capable(CAP_BPF). If it becomes
> > ns_capable(CAP_BPF), it will really go wild then.
> >
> > > And if not, I don't think it integrates well with the user namespaces?
> > >
> >
> > IIUC, it is the CAP_BPF which doesn't integrate with the user
> > namespaces, right?
>
> Yeah. And it's probably fine if we don't, I just wanted to see some
> explanation on the long-term plan.
> If the purpose is to have a bpf namespace and use it for pure
> isolation purposes, let's state it clearly in the cover letter.
Will add it.
> Otherwise it's not clear whether it's only about isolation or
> potentially allowing CAP_BPF in user namespaces.
> Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit
> error? (or maybe it already does, haven't looked at the patches)
>
Good suggestion. It should return an error. I will change it in the
next version.
> One other question I have is: should init bpf namespace see
> everything? Otherwise it might be hard to chase down the namespaces
> that loaded some BPF program...
Right, the init bpf namespace will see everything.
--
Regards
Yafang
On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
>
<...>
>
> BPF namespace is introduced in this patchset with an attempt to remove
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can
> still iterate and inspect them.
>
> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.
> For example, we can allow some of the BTF objects to be used in
> non-init bpf namespace, then the container user can only trace the
> processes running in his container, but can't get the information of
> tasks running in other containers.
>
Hi Yafang,
Thanks for putting effort toward enabling BPF for container users!
However, I think the cover letter can be improved. It's unclear to me
what exactly is BPF namespace, what exactly it tries to achieve and
what is its behavior. If you look at the manpage of pid namespace [1],
cgroup namespace[2], and namespace[3], they all have a very precise
definition, their goals and explain the intended behaviors well.
I felt you intended the BPF namespace to provide isolation of object
ids. That is, different views of the bpf object ids for different
processes. This is like the PID namespace. But somehow, you also
attach CAPs on top of that. That, I think, is not a namespace's job.
Well, I could be wrong, but would appreciate you adding more details
as follow-up.
Hao
[1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
[2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
[3] https://man7.org/linux/man-pages/man7/namespaces.7.html
On Fri, Mar 31, 2023 at 1:52 PM Hao Luo <[email protected]> wrote:
>
> On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> >
> <...>
> >
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
> >
> > BPF namespace is similar to PID namespace, and the bpf objects are
> > similar to tasks, so BPF namespace is very easy to understand. These
> > patchset only implements BPF namespace for bpf map, prog and link. In the
> > future we may extend it to other bpf objects like btf, bpffs and etc.
> > For example, we can allow some of the BTF objects to be used in
> > non-init bpf namespace, then the container user can only trace the
> > processes running in his container, but can't get the information of
> > tasks running in other containers.
> >
>
> Hi Yafang,
>
> Thanks for putting effort toward enabling BPF for container users!
>
> However, I think the cover letter can be improved. It's unclear to me
> what exactly is BPF namespace, what exactly it tries to achieve and
> what is its behavior. If you look at the manpage of pid namespace [1],
> cgroup namespace[2], and namespace[3], they all have a very precise
> definition, their goals and explain the intended behaviors well.
>
Thanks for your suggestion. The covetter should be improved. I will
read the man pages of these namespaces and improve it as you
suggested.
> I felt you intended the BPF namespace to provide isolation of object
> ids. That is, different views of the bpf object ids for different
> processes. This is like the PID namespace. But somehow, you also
> attach CAPs on top of that. That, I think, is not a namespace's job.
>
Agree with you that it should be independent of CAPs.
After the bpf namespace is introduced, actually we don't need the CAPs
when the user iterates IDs or converts IDs to FDs in his bpf namespace
(except in the init bpf namespace), because these are all readonly
operations and the user can only read the bpf objects created by
himself. While the CAPs should be required when the user wants to
write something, e.g. creating a map, loading a prog. They are really
different things. I will change it in the next version.
> Well, I could be wrong, but would appreciate you adding more details
> as follow-up.
>
> Hao
>
> [1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
> [2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
> [3] https://man7.org/linux/man-pages/man7/namespaces.7.html
--
Regards
Yafang
On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> >
> > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > >
> > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > prevent non-privilidged users from getting other users' bpf program, but
> > > also it prevents the user from iterating his own bpf objects.
> > >
> > > In container environment, some users want to run bpf programs in their
> > > containers. These users can run their bpf programs under CAP_BPF and
> > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > generic way. For example, the bpftool can't be used as it requires
> > > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > Agreed that it is important to enable tools like bpftool without
> > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > namespace for this. Can we reuse some existing namespace for this?
> >
>
> It seems we can't.
Yafang,
It's a Nack.
The only thing you've been trying to "solve" with bpf namespace is to
allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
The concept of bpf namespace is not even close to be thought through.
Others pointed out the gaps in the design. Like bpffs. There are plenty.
Please do not send patches like this in the future.
You need to start with describing the problem you want to solve,
then propose _several_ solutions, describe their pros and cons,
solicit feedback, present at the conferences (like LSFMMBPF or LPC),
and when the community agrees that 1. problem is worth solving,
2. the solution makes sense, only then work on patches.
"In container environment, some users want to run bpf programs in their containers."
is something Song brought up at LSFMMBPF a year ago.
At that meeting most of the folks agreed that there is a need to run bpf
in containers and make sure that the effect of bpf prog is limited to a container.
This new namespace that creates virtual IDs for progs and maps doesn't come
close in solving this task.
On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> > >
> > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > > >
> > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > also it prevents the user from iterating his own bpf objects.
> > > >
> > > > In container environment, some users want to run bpf programs in their
> > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > generic way. For example, the bpftool can't be used as it requires
> > > > CAP_SYS_ADMIN. That is very inconvenient.
> > >
> > > Agreed that it is important to enable tools like bpftool without
> > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > namespace for this. Can we reuse some existing namespace for this?
> > >
> >
> > It seems we can't.
>
> Yafang,
>
> It's a Nack.
>
> The only thing you've been trying to "solve" with bpf namespace is to
> allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> The concept of bpf namespace is not even close to be thought through.
Right, it is more likely a PoC in its current state.
> Others pointed out the gaps in the design. Like bpffs. There are plenty.
> Please do not send patches like this in the future.
The reason I sent it with an early state is that I want to get some
early feedback from the community ahead of the LSF/MM/BPF workshop,
then I can improve it based on these feedbacks and present it more
specifically at the workshop. Then the discussion will be more
effective.
> You need to start with describing the problem you want to solve,
> then propose _several_ solutions, describe their pros and cons,
> solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> and when the community agrees that 1. problem is worth solving,
> 2. the solution makes sense, only then work on patches.
>
I would like to give a short discussion on the BPF namespace if
everything goes fine.
> "In container environment, some users want to run bpf programs in their containers."
> is something Song brought up at LSFMMBPF a year ago.
> At that meeting most of the folks agreed that there is a need to run bpf
> in containers and make sure that the effect of bpf prog is limited to a container.
> This new namespace that creates virtual IDs for progs and maps doesn't come
> close in solving this task.
Currently in our production environment, all the containers running
bpf programs are privileged, that is risky. So actually the goal of
the BPF namespace is to make them (or part of them) non-privileged.
But some of the abilities of these bpf programs will be lost in this
procedure, like the debug-bility with bpftool, so we need to fix it.
Agree with you that this goal is far from making bpf programs safely
running in a container environment.
--
Regards
Yafang
On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> > > >
> > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > also it prevents the user from iterating his own bpf objects.
> > > > >
> > > > > In container environment, some users want to run bpf programs in their
> > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > >
> > > > Agreed that it is important to enable tools like bpftool without
> > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > namespace for this. Can we reuse some existing namespace for this?
> > > >
> > >
> > > It seems we can't.
> >
> > Yafang,
> >
> > It's a Nack.
> >
> > The only thing you've been trying to "solve" with bpf namespace is to
> > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > The concept of bpf namespace is not even close to be thought through.
>
> Right, it is more likely a PoC in its current state.
>
> > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > Please do not send patches like this in the future.
>
> The reason I sent it with an early state is that I want to get some
> early feedback from the community ahead of the LSF/MM/BPF workshop,
> then I can improve it based on these feedbacks and present it more
> specifically at the workshop. Then the discussion will be more
> effective.
>
> > You need to start with describing the problem you want to solve,
> > then propose _several_ solutions, describe their pros and cons,
> > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > and when the community agrees that 1. problem is worth solving,
> > 2. the solution makes sense, only then work on patches.
> >
>
> I would like to give a short discussion on the BPF namespace if
> everything goes fine.
Not in this shape of BPF namespace as done in this patch set.
We've talked about BPF namespace in the past. This is not it.
> > "In container environment, some users want to run bpf programs in their containers."
> > is something Song brought up at LSFMMBPF a year ago.
> > At that meeting most of the folks agreed that there is a need to run bpf
> > in containers and make sure that the effect of bpf prog is limited to a container.
> > This new namespace that creates virtual IDs for progs and maps doesn't come
> > close in solving this task.
>
> Currently in our production environment, all the containers running
> bpf programs are privileged, that is risky. So actually the goal of
> the BPF namespace is to make them (or part of them) non-privileged.
> But some of the abilities of these bpf programs will be lost in this
> procedure, like the debug-bility with bpftool, so we need to fix it.
> Agree with you that this goal is far from making bpf programs safely
> running in a container environment.
I disagree that allowing admin to run bpftool without sudo is a task
worth solving. The visibility of bpf progs in a container is a different task.
Without doing any kernel changes we can add a flag to bpftool to let
'bpftool prog show' list progs that were loaded by processes in the same cgroup.
bpftool already does prog->pid mapping with bpf iterators.
It can filter by cgroup just as well.
On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> > > > >
> > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > >
> > > > > > In container environment, some users want to run bpf programs in their
> > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > >
> > > > > Agreed that it is important to enable tools like bpftool without
> > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > >
> > > >
> > > > It seems we can't.
> > >
> > > Yafang,
> > >
> > > It's a Nack.
> > >
> > > The only thing you've been trying to "solve" with bpf namespace is to
> > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > The concept of bpf namespace is not even close to be thought through.
> >
> > Right, it is more likely a PoC in its current state.
> >
> > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > Please do not send patches like this in the future.
> >
> > The reason I sent it with an early state is that I want to get some
> > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > then I can improve it based on these feedbacks and present it more
> > specifically at the workshop. Then the discussion will be more
> > effective.
> >
> > > You need to start with describing the problem you want to solve,
> > > then propose _several_ solutions, describe their pros and cons,
> > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > and when the community agrees that 1. problem is worth solving,
> > > 2. the solution makes sense, only then work on patches.
> > >
> >
> > I would like to give a short discussion on the BPF namespace if
> > everything goes fine.
>
> Not in this shape of BPF namespace as done in this patch set.
> We've talked about BPF namespace in the past. This is not it.
>
> > > "In container environment, some users want to run bpf programs in their containers."
> > > is something Song brought up at LSFMMBPF a year ago.
> > > At that meeting most of the folks agreed that there is a need to run bpf
> > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > close in solving this task.
> >
> > Currently in our production environment, all the containers running
> > bpf programs are privileged, that is risky. So actually the goal of
> > the BPF namespace is to make them (or part of them) non-privileged.
> > But some of the abilities of these bpf programs will be lost in this
> > procedure, like the debug-bility with bpftool, so we need to fix it.
> > Agree with you that this goal is far from making bpf programs safely
> > running in a container environment.
>
> I disagree that allowing admin to run bpftool without sudo is a task
> worth solving. The visibility of bpf progs in a container is a different task.
> Without doing any kernel changes we can add a flag to bpftool to let
> 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> bpftool already does prog->pid mapping with bpf iterators.
> It can filter by cgroup just as well.
IIUC, at least we need bellow change in the kernel,
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
-
next_id++;
spin_lock_bh(lock);
if (!idr_get_next(idr, &next_id))
Because the container doesn't have CAP_SYS_ADMIN enabled, while they
only have CAP_BPF and other required CAPs.
Another possible solution is that we run an agent in the host, and the
user in the container who wants to get the bpf objects info in his
container should send a request to this agent via unix domain socket.
That is what we are doing now in our production environment. That
said, each container has to run a client to get the bpf object fd.
There are some downsides,
- It can't handle pinned bpf programs
For pinned programs, the user can get them from the pinned files
directly, so he can use bpftool in his case, only with some
complaints.
- If the user attached the bpf prog, and then removed the pinned
file, but didn't detach it.
That happened. But this error case can't be handled.
- There may be other corner cases that it can't fit.
There's a solution to improve it, but we also need to change the
kernel. That is, we can use the wasted space btf->name.
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b7e5a55..59d73a3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
u32 btf_data_size,
err = -ENOMEM;
goto errout;
}
+ snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
+ current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
env->btf = btf;
data = kvmalloc(btf_data_size, GFP_KERNEL | __GFP_NOWARN);
--
Regards
Yafang
On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote:
> On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > > > > > >
> > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > > >
> > > > > > > In container environment, some users want to run bpf programs in their
> > > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > > >
> > > > > > Agreed that it is important to enable tools like bpftool without
> > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > > >
> > > > >
> > > > > It seems we can't.
> > > >
> > > > Yafang,
> > > >
> > > > It's a Nack.
> > > >
> > > > The only thing you've been trying to "solve" with bpf namespace is to
> > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > > The concept of bpf namespace is not even close to be thought through.
> > >
> > > Right, it is more likely a PoC in its current state.
> > >
> > > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > > Please do not send patches like this in the future.
> > >
> > > The reason I sent it with an early state is that I want to get some
> > > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > > then I can improve it based on these feedbacks and present it more
> > > specifically at the workshop. Then the discussion will be more
> > > effective.
> > >
> > > > You need to start with describing the problem you want to solve,
> > > > then propose _several_ solutions, describe their pros and cons,
> > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > > and when the community agrees that 1. problem is worth solving,
> > > > 2. the solution makes sense, only then work on patches.
> > > >
> > >
> > > I would like to give a short discussion on the BPF namespace if
> > > everything goes fine.
> >
> > Not in this shape of BPF namespace as done in this patch set.
> > We've talked about BPF namespace in the past. This is not it.
> >
> > > > "In container environment, some users want to run bpf programs in their containers."
> > > > is something Song brought up at LSFMMBPF a year ago.
> > > > At that meeting most of the folks agreed that there is a need to run bpf
> > > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > > close in solving this task.
> > >
> > > Currently in our production environment, all the containers running
> > > bpf programs are privileged, that is risky. So actually the goal of
> > > the BPF namespace is to make them (or part of them) non-privileged.
> > > But some of the abilities of these bpf programs will be lost in this
> > > procedure, like the debug-bility with bpftool, so we need to fix it.
> > > Agree with you that this goal is far from making bpf programs safely
> > > running in a container environment.
> >
> > I disagree that allowing admin to run bpftool without sudo is a task
> > worth solving. The visibility of bpf progs in a container is a different task.
> > Without doing any kernel changes we can add a flag to bpftool to let
> > 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> > bpftool already does prog->pid mapping with bpf iterators.
> > It can filter by cgroup just as well.
>
> IIUC, at least we need bellow change in the kernel,
No. The user should just 'sudo bpftool ...' instead.
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> return -EINVAL;
>
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> -
> next_id++;
> spin_lock_bh(lock);
> if (!idr_get_next(idr, &next_id))
>
> Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> only have CAP_BPF and other required CAPs.
>
> Another possible solution is that we run an agent in the host, and the
> user in the container who wants to get the bpf objects info in his
> container should send a request to this agent via unix domain socket.
> That is what we are doing now in our production environment. That
> said, each container has to run a client to get the bpf object fd.
None of such hacks are necessary. People that debug bpf setups with bpftool
can always sudo.
> There are some downsides,
> - It can't handle pinned bpf programs
> For pinned programs, the user can get them from the pinned files
> directly, so he can use bpftool in his case, only with some
> complaints.
> - If the user attached the bpf prog, and then removed the pinned
> file, but didn't detach it.
> That happened. But this error case can't be handled.
> - There may be other corner cases that it can't fit.
>
> There's a solution to improve it, but we also need to change the
> kernel. That is, we can use the wasted space btf->name.
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index b7e5a55..59d73a3 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> u32 btf_data_size,
> err = -ENOMEM;
> goto errout;
> }
> + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
Unnecessary.
comm, pid, cgroup can be printed by bpftool without changing the kernel.
On Thu, Apr 6, 2023 at 10:07 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote:
> > On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > > > >
> > > > > > > > In container environment, some users want to run bpf programs in their
> > > > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > > > >
> > > > > > > Agreed that it is important to enable tools like bpftool without
> > > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > > > >
> > > > > >
> > > > > > It seems we can't.
> > > > >
> > > > > Yafang,
> > > > >
> > > > > It's a Nack.
> > > > >
> > > > > The only thing you've been trying to "solve" with bpf namespace is to
> > > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > > > The concept of bpf namespace is not even close to be thought through.
> > > >
> > > > Right, it is more likely a PoC in its current state.
> > > >
> > > > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > > > Please do not send patches like this in the future.
> > > >
> > > > The reason I sent it with an early state is that I want to get some
> > > > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > > > then I can improve it based on these feedbacks and present it more
> > > > specifically at the workshop. Then the discussion will be more
> > > > effective.
> > > >
> > > > > You need to start with describing the problem you want to solve,
> > > > > then propose _several_ solutions, describe their pros and cons,
> > > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > > > and when the community agrees that 1. problem is worth solving,
> > > > > 2. the solution makes sense, only then work on patches.
> > > > >
> > > >
> > > > I would like to give a short discussion on the BPF namespace if
> > > > everything goes fine.
> > >
> > > Not in this shape of BPF namespace as done in this patch set.
> > > We've talked about BPF namespace in the past. This is not it.
> > >
> > > > > "In container environment, some users want to run bpf programs in their containers."
> > > > > is something Song brought up at LSFMMBPF a year ago.
> > > > > At that meeting most of the folks agreed that there is a need to run bpf
> > > > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > > > close in solving this task.
> > > >
> > > > Currently in our production environment, all the containers running
> > > > bpf programs are privileged, that is risky. So actually the goal of
> > > > the BPF namespace is to make them (or part of them) non-privileged.
> > > > But some of the abilities of these bpf programs will be lost in this
> > > > procedure, like the debug-bility with bpftool, so we need to fix it.
> > > > Agree with you that this goal is far from making bpf programs safely
> > > > running in a container environment.
> > >
> > > I disagree that allowing admin to run bpftool without sudo is a task
> > > worth solving. The visibility of bpf progs in a container is a different task.
> > > Without doing any kernel changes we can add a flag to bpftool to let
> > > 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> > > bpftool already does prog->pid mapping with bpf iterators.
> > > It can filter by cgroup just as well.
> >
> > IIUC, at least we need bellow change in the kernel,
>
> No. The user should just 'sudo bpftool ...' instead.
>
It seems that I didn't describe the issue clearly.
The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
required to run bpftool, so the bpftool running in the container
can't get the ID of bpf objects or convert IDs to FDs.
Is there something that I missed ?
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> > return -EINVAL;
> >
> > - if (!capable(CAP_SYS_ADMIN))
> > - return -EPERM;
> > -
> > next_id++;
> > spin_lock_bh(lock);
> > if (!idr_get_next(idr, &next_id))
> >
> > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > only have CAP_BPF and other required CAPs.
> >
> > Another possible solution is that we run an agent in the host, and the
> > user in the container who wants to get the bpf objects info in his
> > container should send a request to this agent via unix domain socket.
> > That is what we are doing now in our production environment. That
> > said, each container has to run a client to get the bpf object fd.
>
> None of such hacks are necessary. People that debug bpf setups with bpftool
> can always sudo.
>
> > There are some downsides,
> > - It can't handle pinned bpf programs
> > For pinned programs, the user can get them from the pinned files
> > directly, so he can use bpftool in his case, only with some
> > complaints.
> > - If the user attached the bpf prog, and then removed the pinned
> > file, but didn't detach it.
> > That happened. But this error case can't be handled.
> > - There may be other corner cases that it can't fit.
> >
> > There's a solution to improve it, but we also need to change the
> > kernel. That is, we can use the wasted space btf->name.
> >
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index b7e5a55..59d73a3 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > u32 btf_data_size,
> > err = -ENOMEM;
> > goto errout;
> > }
> > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
>
> Unnecessary.
> comm, pid, cgroup can be printed by bpftool without changing the kernel.
Some questions,
- What if the process exits after attaching the bpf prog and the prog
is not auto-detachable?
For example, the reuserport bpf prog is not auto-detachable. After
pins the reuserport bpf prog, a task can attach it through the pinned
bpf file, but if the task forgets to detach it and the pinned file is
removed, then it seems there's no way to figure out which task or
cgroup this prog belongs to...
- Could you pls. explain in detail how to get comm, pid, or cgroup
from a pinned bpffs file?
--
Regards
Yafang
On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
>
> It seems that I didn't describe the issue clearly.
> The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> required to run bpftool, so the bpftool running in the container
> can't get the ID of bpf objects or convert IDs to FDs.
> Is there something that I missed ?
Nothing. This is by design. bpftool needs sudo. That's all.
>
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> > > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> > > return -EINVAL;
> > >
> > > - if (!capable(CAP_SYS_ADMIN))
> > > - return -EPERM;
> > > -
> > > next_id++;
> > > spin_lock_bh(lock);
> > > if (!idr_get_next(idr, &next_id))
> > >
> > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > > only have CAP_BPF and other required CAPs.
> > >
> > > Another possible solution is that we run an agent in the host, and the
> > > user in the container who wants to get the bpf objects info in his
> > > container should send a request to this agent via unix domain socket.
> > > That is what we are doing now in our production environment. That
> > > said, each container has to run a client to get the bpf object fd.
> >
> > None of such hacks are necessary. People that debug bpf setups with bpftool
> > can always sudo.
> >
> > > There are some downsides,
> > > - It can't handle pinned bpf programs
> > > For pinned programs, the user can get them from the pinned files
> > > directly, so he can use bpftool in his case, only with some
> > > complaints.
> > > - If the user attached the bpf prog, and then removed the pinned
> > > file, but didn't detach it.
> > > That happened. But this error case can't be handled.
> > > - There may be other corner cases that it can't fit.
> > >
> > > There's a solution to improve it, but we also need to change the
> > > kernel. That is, we can use the wasted space btf->name.
> > >
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index b7e5a55..59d73a3 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > > u32 btf_data_size,
> > > err = -ENOMEM;
> > > goto errout;
> > > }
> > > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
> >
> > Unnecessary.
> > comm, pid, cgroup can be printed by bpftool without changing the kernel.
>
> Some questions,
> - What if the process exits after attaching the bpf prog and the prog
> is not auto-detachable?
> For example, the reuserport bpf prog is not auto-detachable. After
> pins the reuserport bpf prog, a task can attach it through the pinned
> bpf file, but if the task forgets to detach it and the pinned file is
> removed, then it seems there's no way to figure out which task or
> cgroup this prog belongs to...
you're saying that there is a bpf prog in the kernel without
corresponding user space ? Meaning no user space process has an FD
that points to this prog or FD to a map that this prog is using?
In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> - Could you pls. explain in detail how to get comm, pid, or cgroup
> from a pinned bpffs file?
pinned bpf prog and no user space holds FD to it?
It's not part of any cgroup. Nothing to print.
On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> >
> > It seems that I didn't describe the issue clearly.
> > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > required to run bpftool, so the bpftool running in the container
> > can't get the ID of bpf objects or convert IDs to FDs.
> > Is there something that I missed ?
>
> Nothing. This is by design. bpftool needs sudo. That's all.
>
Hmm, what I'm trying to do is make bpftool run without sudo.
>
> >
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> > > > if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> > > > return -EINVAL;
> > > >
> > > > - if (!capable(CAP_SYS_ADMIN))
> > > > - return -EPERM;
> > > > -
> > > > next_id++;
> > > > spin_lock_bh(lock);
> > > > if (!idr_get_next(idr, &next_id))
> > > >
> > > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > > > only have CAP_BPF and other required CAPs.
> > > >
> > > > Another possible solution is that we run an agent in the host, and the
> > > > user in the container who wants to get the bpf objects info in his
> > > > container should send a request to this agent via unix domain socket.
> > > > That is what we are doing now in our production environment. That
> > > > said, each container has to run a client to get the bpf object fd.
> > >
> > > None of such hacks are necessary. People that debug bpf setups with bpftool
> > > can always sudo.
> > >
> > > > There are some downsides,
> > > > - It can't handle pinned bpf programs
> > > > For pinned programs, the user can get them from the pinned files
> > > > directly, so he can use bpftool in his case, only with some
> > > > complaints.
> > > > - If the user attached the bpf prog, and then removed the pinned
> > > > file, but didn't detach it.
> > > > That happened. But this error case can't be handled.
> > > > - There may be other corner cases that it can't fit.
> > > >
> > > > There's a solution to improve it, but we also need to change the
> > > > kernel. That is, we can use the wasted space btf->name.
> > > >
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index b7e5a55..59d73a3 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > > > u32 btf_data_size,
> > > > err = -ENOMEM;
> > > > goto errout;
> > > > }
> > > > + snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > > > + current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
> > >
> > > Unnecessary.
> > > comm, pid, cgroup can be printed by bpftool without changing the kernel.
> >
> > Some questions,
> > - What if the process exits after attaching the bpf prog and the prog
> > is not auto-detachable?
> > For example, the reuserport bpf prog is not auto-detachable. After
> > pins the reuserport bpf prog, a task can attach it through the pinned
> > bpf file, but if the task forgets to detach it and the pinned file is
> > removed, then it seems there's no way to figure out which task or
> > cgroup this prog belongs to...
>
> you're saying that there is a bpf prog in the kernel without
> corresponding user space ?
No, it is corresponding to user space. For example, it may be
corresponding to a socket fd, or a cgroup fd.
> Meaning no user space process has an FD
> that points to this prog or FD to a map that this prog is using?
> In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
>
Even if it is kernel bpf prog, it is created by a process. The user
needs to know which one created it.
> > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > from a pinned bpffs file?
>
> pinned bpf prog and no user space holds FD to it?
> It's not part of any cgroup. Nothing to print.
As I explained above, even if it holds nothing, the user needs to know
the information from it. For example, if it is expected, which one
created it?
--
Regards
Yafang
On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
>
> On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > >
> > > It seems that I didn't describe the issue clearly.
> > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > required to run bpftool, so the bpftool running in the container
> > > can't get the ID of bpf objects or convert IDs to FDs.
> > > Is there something that I missed ?
> >
> > Nothing. This is by design. bpftool needs sudo. That's all.
> >
>
> Hmm, what I'm trying to do is make bpftool run without sudo.
This is not a task that is worth solving.
> > > Some questions,
> > > - What if the process exits after attaching the bpf prog and the prog
> > > is not auto-detachable?
> > > For example, the reuserport bpf prog is not auto-detachable. After
> > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > bpf file, but if the task forgets to detach it and the pinned file is
> > > removed, then it seems there's no way to figure out which task or
> > > cgroup this prog belongs to...
> >
> > you're saying that there is a bpf prog in the kernel without
> > corresponding user space ?
>
> No, it is corresponding to user space. For example, it may be
> corresponding to a socket fd, or a cgroup fd.
>
> > Meaning no user space process has an FD
> > that points to this prog or FD to a map that this prog is using?
> > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> >
>
> Even if it is kernel bpf prog, it is created by a process. The user
> needs to know which one created it.
In some situations it's certainly interesting to know which process
loaded a particular program.
In many other situations it's irrelevant.
For example, the process that loaded a prog could have been moved to a
different cgroup.
If you want to track the loading you need to install bpf_lsm
that monitors prog_load hook and collect that info.
It's not the job of the kernel to do it.
> > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > from a pinned bpffs file?
> >
> > pinned bpf prog and no user space holds FD to it?
> > It's not part of any cgroup. Nothing to print.
>
> As I explained above, even if it holds nothing, the user needs to know
> the information from it. For example, if it is expected, which one
> created it?
See the answer above. The kernel has enough hooks already to provide
this information to user space. No kernel changes necessary.
On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> >
> > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > >
> > > > It seems that I didn't describe the issue clearly.
> > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > required to run bpftool, so the bpftool running in the container
> > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > Is there something that I missed ?
> > >
> > > Nothing. This is by design. bpftool needs sudo. That's all.
> > >
> >
> > Hmm, what I'm trying to do is make bpftool run without sudo.
>
> This is not a task that is worth solving.
>
Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > > Some questions,
> > > > - What if the process exits after attaching the bpf prog and the prog
> > > > is not auto-detachable?
> > > > For example, the reuserport bpf prog is not auto-detachable. After
> > > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > > bpf file, but if the task forgets to detach it and the pinned file is
> > > > removed, then it seems there's no way to figure out which task or
> > > > cgroup this prog belongs to...
> > >
> > > you're saying that there is a bpf prog in the kernel without
> > > corresponding user space ?
> >
> > No, it is corresponding to user space. For example, it may be
> > corresponding to a socket fd, or a cgroup fd.
> >
> > > Meaning no user space process has an FD
> > > that points to this prog or FD to a map that this prog is using?
> > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> > >
> >
> > Even if it is kernel bpf prog, it is created by a process. The user
> > needs to know which one created it.
>
> In some situations it's certainly interesting to know which process
> loaded a particular program.
> In many other situations it's irrelevant.
> For example, the process that loaded a prog could have been moved to a
> different cgroup.
> If you want to track the loading you need to install bpf_lsm
> that monitors prog_load hook and collect that info.
> It's not the job of the kernel to do it.
>
Agreed with you that we can add lots of hooks to track every detail of
the operations.
But it is not free. More hooks, more overhead.
If we can change the kernel to make it lightweight, why not...
> > > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > > from a pinned bpffs file?
> > >
> > > pinned bpf prog and no user space holds FD to it?
> > > It's not part of any cgroup. Nothing to print.
> >
> > As I explained above, even if it holds nothing, the user needs to know
> > the information from it. For example, if it is expected, which one
> > created it?
>
> See the answer above. The kernel has enough hooks already to provide
> this information to user space. No kernel changes necessary.
--
Regards
Yafang
On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
>
> On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > It seems that I didn't describe the issue clearly.
> > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > required to run bpftool, so the bpftool running in the container
> > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > Is there something that I missed ?
> > > >
> > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > >
> > >
> > > Hmm, what I'm trying to do is make bpftool run without sudo.
> >
> > This is not a task that is worth solving.
> >
>
> Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
I'll leave the BPF namespace discussion aside (I agree that it needs
way more thought).
I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
you to take over someone else's link and stuff like this. But just
iterating IDs seems like a pretty innocent functionality, so maybe we
should remove CAP_SYS_ADMIN for GET_NEXT_ID?
By itself GET_NEXT_ID is relatively useless without capabilities, but
we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
for a while now, and that seems useful in itself, as it would indeed
help tools like bpftool to get *some* information even without
privileges. Whether those GET_INFO_BY_ID operations should return same
full bpf_{prog,map,link,btf}_info or some trimmed down version of them
would be up to discussion, but I think getting some info without
creating an FD seems useful in itself.
Would it be worth discussing and solving this separately from
namespacing issues?
>
> > > > > Some questions,
> > > > > - What if the process exits after attaching the bpf prog and the prog
> > > > > is not auto-detachable?
> > > > > For example, the reuserport bpf prog is not auto-detachable. After
> > > > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > > > bpf file, but if the task forgets to detach it and the pinned file is
> > > > > removed, then it seems there's no way to figure out which task or
> > > > > cgroup this prog belongs to...
> > > >
> > > > you're saying that there is a bpf prog in the kernel without
> > > > corresponding user space ?
> > >
> > > No, it is corresponding to user space. For example, it may be
> > > corresponding to a socket fd, or a cgroup fd.
> > >
> > > > Meaning no user space process has an FD
> > > > that points to this prog or FD to a map that this prog is using?
> > > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> > > >
> > >
> > > Even if it is kernel bpf prog, it is created by a process. The user
> > > needs to know which one created it.
> >
> > In some situations it's certainly interesting to know which process
> > loaded a particular program.
> > In many other situations it's irrelevant.
> > For example, the process that loaded a prog could have been moved to a
> > different cgroup.
> > If you want to track the loading you need to install bpf_lsm
> > that monitors prog_load hook and collect that info.
> > It's not the job of the kernel to do it.
> >
>
> Agreed with you that we can add lots of hooks to track every detail of
> the operations.
> But it is not free. More hooks, more overhead.
> If we can change the kernel to make it lightweight, why not...
>
> > > > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > > > from a pinned bpffs file?
> > > >
> > > > pinned bpf prog and no user space holds FD to it?
> > > > It's not part of any cgroup. Nothing to print.
> > >
> > > As I explained above, even if it holds nothing, the user needs to know
> > > the information from it. For example, if it is expected, which one
> > > created it?
> >
> > See the answer above. The kernel has enough hooks already to provide
> > this information to user space. No kernel changes necessary.
>
>
>
> --
> Regards
> Yafang
On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> >
> > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > It seems that I didn't describe the issue clearly.
> > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > Is there something that I missed ?
> > > > >
> > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > >
> > > >
> > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > >
> > > This is not a task that is worth solving.
> > >
> >
> > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
>
> I'll leave the BPF namespace discussion aside (I agree that it needs
> way more thought).
>
> I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> you to take over someone else's link and stuff like this. But just
> iterating IDs seems like a pretty innocent functionality, so maybe we
> should remove CAP_SYS_ADMIN for GET_NEXT_ID?
>
> By itself GET_NEXT_ID is relatively useless without capabilities, but
> we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> for a while now, and that seems useful in itself, as it would indeed
> help tools like bpftool to get *some* information even without
> privileges. Whether those GET_INFO_BY_ID operations should return same
> full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> would be up to discussion, but I think getting some info without
> creating an FD seems useful in itself.
>
> Would it be worth discussing and solving this separately from
> namespacing issues?
Iteration of IDs itself is fine. The set of IDs is not security sensitive,
but GET_NEXT_BY_ID has to be carefully restricted.
It returns xlated, jited, BTF, line info, etc
and with all the restrictions it would need something like
CAP_SYS_PTRACE and CAP_PERFMON to be useful.
And with that we're not far from CAP_SYS_ADMIN.
Why bother then?
On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > >
> > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > Is there something that I missed ?
> > > > > >
> > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > >
> > > > >
> > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > >
> > > > This is not a task that is worth solving.
> > > >
> > >
> > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> >
> > I'll leave the BPF namespace discussion aside (I agree that it needs
> > way more thought).
> >
> > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > you to take over someone else's link and stuff like this. But just
> > iterating IDs seems like a pretty innocent functionality, so maybe we
> > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> >
> > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > for a while now, and that seems useful in itself, as it would indeed
> > help tools like bpftool to get *some* information even without
> > privileges. Whether those GET_INFO_BY_ID operations should return same
> > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > would be up to discussion, but I think getting some info without
> > creating an FD seems useful in itself.
> >
> > Would it be worth discussing and solving this separately from
> > namespacing issues?
>
> Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> but GET_NEXT_BY_ID has to be carefully restricted.
> It returns xlated, jited, BTF, line info, etc
> and with all the restrictions it would need something like
> CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> And with that we're not far from CAP_SYS_ADMIN.
> Why bother then?
Not sure if I get your point clearly. I think the reason we introduce
CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN.
Enabling specific CAPs instead of CAP_SYS_ADMIN should be our
alignment goal, right?
If so, then it is worth doing. As Andrii suggested, a trimmed down
version is also helpful and should be acceptable. Agreed with you that
we have lots of things to do, but if we don't try to move it forward,
it will always be as-is.
--
Regards
Yafang
On Thu, Apr 6, 2023 at 9:34 PM Yafang Shao <[email protected]> wrote:
>
> On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > Is there something that I missed ?
> > > > > > >
> > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > >
> > > > > >
> > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > >
> > > > > This is not a task that is worth solving.
> > > > >
> > > >
> > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > >
> > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > way more thought).
> > >
> > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > you to take over someone else's link and stuff like this. But just
> > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > >
> > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > for a while now, and that seems useful in itself, as it would indeed
> > > help tools like bpftool to get *some* information even without
> > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > would be up to discussion, but I think getting some info without
> > > creating an FD seems useful in itself.
> > >
> > > Would it be worth discussing and solving this separately from
> > > namespacing issues?
> >
> > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > but GET_NEXT_BY_ID has to be carefully restricted.
> > It returns xlated, jited, BTF, line info, etc
> > and with all the restrictions it would need something like
> > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > And with that we're not far from CAP_SYS_ADMIN.
> > Why bother then?
>
> Not sure if I get your point clearly. I think the reason we introduce
> CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN.
> Enabling specific CAPs instead of CAP_SYS_ADMIN should be our
> alignment goal, right?
We want users to switch to CAP_BPF (potentially with CAP_PERFMON)
from full CAP_SYS_ADMIN to reduce attack surface of production workloads.
bpftool is a tool for humans to do introspection and debugging.
It will stay root only.
> If so, then it is worth doing. As Andrii suggested, a trimmed down
> version is also helpful and should be acceptable.
It's not helpful. It's actively misleading.
On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > >
> > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > Is there something that I missed ?
> > > > > >
> > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > >
> > > > >
> > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > >
> > > > This is not a task that is worth solving.
> > > >
> > >
> > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> >
> > I'll leave the BPF namespace discussion aside (I agree that it needs
> > way more thought).
> >
> > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > you to take over someone else's link and stuff like this. But just
> > iterating IDs seems like a pretty innocent functionality, so maybe we
> > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> >
> > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > for a while now, and that seems useful in itself, as it would indeed
> > help tools like bpftool to get *some* information even without
> > privileges. Whether those GET_INFO_BY_ID operations should return same
> > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > would be up to discussion, but I think getting some info without
> > creating an FD seems useful in itself.
> >
> > Would it be worth discussing and solving this separately from
> > namespacing issues?
>
> Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> but GET_NEXT_BY_ID has to be carefully restricted.
> It returns xlated, jited, BTF, line info, etc
> and with all the restrictions it would need something like
> CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> And with that we're not far from CAP_SYS_ADMIN.
> Why bother then?
You probably meant that GET_INFO_BY_ID should be carefully restricted?
So yeah, that's what I said that this would have to be discussed
further. I agree that returning func/line info, program dump, etc is
probably a privileged part. But there is plenty of useful info besides
that (e.g., prog name, insns cnt, run stats, etc) that would be useful
for unpriv applications to monitor their own apps that they opened
from BPF FS, or just some observability daemons.
There is a lot of useful information in bpf_map_info and bpf_link_info
that's way less privileged. I think bpf_link_info is good as is. Same
for bpf_map_info.
Either way, I'm not insisting, just something that seems pretty simple
to add and useful in some scenarios. We can reuse existing code and
types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
those privileged fields you mentioned. Anyway, something to put on the
backburner, perhaps.
On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > Is there something that I missed ?
> > > > > > >
> > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > >
> > > > > >
> > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > >
> > > > > This is not a task that is worth solving.
> > > > >
> > > >
> > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > >
> > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > way more thought).
> > >
> > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > you to take over someone else's link and stuff like this. But just
> > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > >
> > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > for a while now, and that seems useful in itself, as it would indeed
> > > help tools like bpftool to get *some* information even without
> > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > would be up to discussion, but I think getting some info without
> > > creating an FD seems useful in itself.
> > >
> > > Would it be worth discussing and solving this separately from
> > > namespacing issues?
> >
> > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > but GET_NEXT_BY_ID has to be carefully restricted.
> > It returns xlated, jited, BTF, line info, etc
> > and with all the restrictions it would need something like
> > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > And with that we're not far from CAP_SYS_ADMIN.
> > Why bother then?
>
> You probably meant that GET_INFO_BY_ID should be carefully restricted?
yes.
> So yeah, that's what I said that this would have to be discussed
> further. I agree that returning func/line info, program dump, etc is
> probably a privileged part. But there is plenty of useful info besides
> that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> for unpriv applications to monitor their own apps that they opened
> from BPF FS, or just some observability daemons.
>
> There is a lot of useful information in bpf_map_info and bpf_link_info
> that's way less privileged. I think bpf_link_info is good as is. Same
> for bpf_map_info.
>
> Either way, I'm not insisting, just something that seems pretty simple
> to add and useful in some scenarios. We can reuse existing code and
> types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> those privileged fields you mentioned. Anyway, something to put on the
> backburner, perhaps.
Sorry, but I only see negatives. It's an extra code in the kernel
that has to be carefully reviewed when initially submitted and
then every patch that touches get_info_by_id would have to go
through a microscope every time to avoid introducing a security issue.
And for what? So that CAP_BPF application can read prog name and run stats?
On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > Is there something that I missed ?
> > > > > > > >
> > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > >
> > > > > > >
> > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > >
> > > > > > This is not a task that is worth solving.
> > > > > >
> > > > >
> > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > >
> > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > way more thought).
> > > >
> > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > you to take over someone else's link and stuff like this. But just
> > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > >
> > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > for a while now, and that seems useful in itself, as it would indeed
> > > > help tools like bpftool to get *some* information even without
> > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > would be up to discussion, but I think getting some info without
> > > > creating an FD seems useful in itself.
> > > >
> > > > Would it be worth discussing and solving this separately from
> > > > namespacing issues?
> > >
> > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > It returns xlated, jited, BTF, line info, etc
> > > and with all the restrictions it would need something like
> > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > And with that we're not far from CAP_SYS_ADMIN.
> > > Why bother then?
> >
> > You probably meant that GET_INFO_BY_ID should be carefully restricted?
>
> yes.
>
> > So yeah, that's what I said that this would have to be discussed
> > further. I agree that returning func/line info, program dump, etc is
> > probably a privileged part. But there is plenty of useful info besides
> > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > for unpriv applications to monitor their own apps that they opened
> > from BPF FS, or just some observability daemons.
> >
> > There is a lot of useful information in bpf_map_info and bpf_link_info
> > that's way less privileged. I think bpf_link_info is good as is. Same
> > for bpf_map_info.
> >
> > Either way, I'm not insisting, just something that seems pretty simple
> > to add and useful in some scenarios. We can reuse existing code and
> > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > those privileged fields you mentioned. Anyway, something to put on the
> > backburner, perhaps.
>
> Sorry, but I only see negatives. It's an extra code in the kernel
> that has to be carefully reviewed when initially submitted and
> then every patch that touches get_info_by_id would have to go
> through a microscope every time to avoid introducing a security issue.
> And for what? So that CAP_BPF application can read prog name and run stats?
Per my experience, observability is a very important part for a
project. If the user can't observe the object directly created by it,
he will worry about or even mistrust it.
However I don't insist on it either if you think we shouldn't do it.
--
Regards
Yafang
On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <[email protected]> wrote:
>
> On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
> <[email protected]> wrote:
> >
> > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > > Is there something that I missed ?
> > > > > > > > >
> > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > > >
> > > > > > > This is not a task that is worth solving.
> > > > > > >
> > > > > >
> > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > > >
> > > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > > way more thought).
> > > > >
> > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > > you to take over someone else's link and stuff like this. But just
> > > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > > >
> > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > > for a while now, and that seems useful in itself, as it would indeed
> > > > > help tools like bpftool to get *some* information even without
> > > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > > would be up to discussion, but I think getting some info without
> > > > > creating an FD seems useful in itself.
> > > > >
> > > > > Would it be worth discussing and solving this separately from
> > > > > namespacing issues?
> > > >
> > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > > It returns xlated, jited, BTF, line info, etc
> > > > and with all the restrictions it would need something like
> > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > > And with that we're not far from CAP_SYS_ADMIN.
> > > > Why bother then?
> > >
> > > You probably meant that GET_INFO_BY_ID should be carefully restricted?
> >
> > yes.
> >
> > > So yeah, that's what I said that this would have to be discussed
> > > further. I agree that returning func/line info, program dump, etc is
> > > probably a privileged part. But there is plenty of useful info besides
> > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > > for unpriv applications to monitor their own apps that they opened
> > > from BPF FS, or just some observability daemons.
> > >
> > > There is a lot of useful information in bpf_map_info and bpf_link_info
> > > that's way less privileged. I think bpf_link_info is good as is. Same
> > > for bpf_map_info.
> > >
> > > Either way, I'm not insisting, just something that seems pretty simple
> > > to add and useful in some scenarios. We can reuse existing code and
> > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > > those privileged fields you mentioned. Anyway, something to put on the
> > > backburner, perhaps.
> >
> > Sorry, but I only see negatives. It's an extra code in the kernel
> > that has to be carefully reviewed when initially submitted and
> > then every patch that touches get_info_by_id would have to go
> > through a microscope every time to avoid introducing a security issue.
> > And for what? So that CAP_BPF application can read prog name and run stats?
>
> Per my experience, observability is a very important part for a
> project. If the user can't observe the object directly created by it,
> he will worry about or even mistrust it.
The user can observe the objects just fine. That's what get_info_by_fd is for.
But the kernel will not report JITed instructions to unpriv user who
just loaded a prog and a sole owner of it.
By your definition such a user should not trust the kernel. So be it.
On Sat, Apr 8, 2023 at 12:32 AM Alexei Starovoitov
<[email protected]> wrote:
>
> On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <[email protected]> wrote:
> >
> > On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
> > <[email protected]> wrote:
> > >
> > > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> > > <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > > > required to run bpftool, so the bpftool running in the container
> > > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > > > Is there something that I missed ?
> > > > > > > > > >
> > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > > > >
> > > > > > > > This is not a task that is worth solving.
> > > > > > > >
> > > > > > >
> > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > > > >
> > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > > > way more thought).
> > > > > >
> > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > > > you to take over someone else's link and stuff like this. But just
> > > > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > > > >
> > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > > > for a while now, and that seems useful in itself, as it would indeed
> > > > > > help tools like bpftool to get *some* information even without
> > > > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > > > would be up to discussion, but I think getting some info without
> > > > > > creating an FD seems useful in itself.
> > > > > >
> > > > > > Would it be worth discussing and solving this separately from
> > > > > > namespacing issues?
> > > > >
> > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > > > It returns xlated, jited, BTF, line info, etc
> > > > > and with all the restrictions it would need something like
> > > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > > > And with that we're not far from CAP_SYS_ADMIN.
> > > > > Why bother then?
> > > >
> > > > You probably meant that GET_INFO_BY_ID should be carefully restricted?
> > >
> > > yes.
> > >
> > > > So yeah, that's what I said that this would have to be discussed
> > > > further. I agree that returning func/line info, program dump, etc is
> > > > probably a privileged part. But there is plenty of useful info besides
> > > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > > > for unpriv applications to monitor their own apps that they opened
> > > > from BPF FS, or just some observability daemons.
> > > >
> > > > There is a lot of useful information in bpf_map_info and bpf_link_info
> > > > that's way less privileged. I think bpf_link_info is good as is. Same
> > > > for bpf_map_info.
> > > >
> > > > Either way, I'm not insisting, just something that seems pretty simple
> > > > to add and useful in some scenarios. We can reuse existing code and
> > > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > > > those privileged fields you mentioned. Anyway, something to put on the
> > > > backburner, perhaps.
> > >
> > > Sorry, but I only see negatives. It's an extra code in the kernel
> > > that has to be carefully reviewed when initially submitted and
> > > then every patch that touches get_info_by_id would have to go
> > > through a microscope every time to avoid introducing a security issue.
> > > And for what? So that CAP_BPF application can read prog name and run stats?
> >
> > Per my experience, observability is a very important part for a
> > project. If the user can't observe the object directly created by it,
> > he will worry about or even mistrust it.
>
> The user can observe the objects just fine. That's what get_info_by_fd is for.
> But the kernel will not report JITed instructions to unpriv user who
> just loaded a prog and a sole owner of it.
There's no UAPI to create the JITed instructions directly per my
understanding. The JITed instructions are created by the kernel.
While they're really UAPI to create a map, prog, and link.
> By your definition such a user should not trust the kernel. So be it.
--
Regards
Yafang