Per an earlier discussion last year[1], I have been looking for a mechanism to a) collect resource usages for devices (GPU for now but there could be other device type in the future) and b) possibly enforce some of the resource usages. An obvious mechanism was to use cgroup but there are too much diversity in GPU hardware architecture to have a common cgroup interface at this point. An alternative is to leverage tracepoint with a bpf program inside a cgroup hierarchy for usage collection and enforcement (via writable tracepoint.)
This is a prototype for such idea. It is incomplete but I would like to solicit some feedback before continuing to make sure I am going down the right path. This prototype is built based on my understanding of the followings:
- tracepoint (and kprobe, uprobe) is associated with perf event
- perf events/tracepoint can be a hook for bpf progs but those bpf progs are not part of the cgroup hierarchy
- bpf progs can be attached to the cgroup hierarchy with cgroup local storage and other benefits
- separately, perf subsystem has a cgroup controller (perf cgroup) that allow perf event to be triggered with a cgroup filter
So the key idea of this RFC is to leverage hierarchical organization of bpf-cgroup for the purpose of perf event/tracepoints.
==Known unresolved topics (feedback very much welcome)==
Storage:
I came across the idea of "preallocated" memory for bpf hash map/storage to avoid deadlock[2] but I don't have a good understanding about it currently. If existing bpf_cgroup_storage_type are not considered pre-allocated then I am thinking we can introduce a new type but I am not sure if this is needed yet.
Scalability:
Scalability concern has been raised about perf cgroup [3] and there seems to be a solution to it recently with bperf [4]. This RFC does not change the status quo on the scalability question but if I understand the bperf idea correctly, this RFC may have some similarity.
[1] https://lore.kernel.org/netdev/[email protected]/T/#m52bc26bbbf16131c48e6b34d875c87660943c452
[2] https://lwn.net/Articles/679074/
[3] https://www.linuxplumbersconf.org/event/4/contributions/291/attachments/313/528/Linux_Plumbers_Conference_2019.pdf
[4] https://linuxplumbersconf.org/event/11/contributions/899/
Kenny Ho (4):
cgroup, perf: Add ability to connect to perf cgroup from other cgroup
controller
bpf, perf: add ability to attach complete array of bpf prog to perf
event
bpf,cgroup,tracing: add new BPF_PROG_TYPE_CGROUP_TRACEPOINT
bpf,cgroup,perf: extend bpf-cgroup to support tracepoint attachment
include/linux/bpf-cgroup.h | 17 +++++--
include/linux/bpf_types.h | 4 ++
include/linux/cgroup.h | 2 +
include/linux/perf_event.h | 6 +++
include/linux/trace_events.h | 9 ++++
include/uapi/linux/bpf.h | 2 +
kernel/bpf/cgroup.c | 96 +++++++++++++++++++++++++++++-------
kernel/bpf/syscall.c | 4 ++
kernel/cgroup/cgroup.c | 13 ++---
kernel/events/core.c | 62 +++++++++++++++++++++++
kernel/trace/bpf_trace.c | 36 ++++++++++++++
11 files changed, 222 insertions(+), 29 deletions(-)
--
2.25.1
Change-Id: Ic9727186bb8c76c757e48635143b16e607f2299f
Signed-off-by: Kenny Ho <[email protected]>
---
include/linux/bpf-cgroup.h | 2 ++
include/linux/bpf_types.h | 4 ++++
include/uapi/linux/bpf.h | 2 ++
kernel/bpf/syscall.c | 4 ++++
kernel/trace/bpf_trace.c | 8 ++++++++
5 files changed, 20 insertions(+)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 2746fd804216..a5e4d9b19470 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -48,6 +48,7 @@ enum cgroup_bpf_attach_type {
CGROUP_INET4_GETSOCKNAME,
CGROUP_INET6_GETSOCKNAME,
CGROUP_INET_SOCK_RELEASE,
+ CGROUP_TRACEPOINT,
MAX_CGROUP_BPF_ATTACH_TYPE
};
@@ -81,6 +82,7 @@ to_cgroup_bpf_attach_type(enum bpf_attach_type attach_type)
CGROUP_ATYPE(CGROUP_INET4_GETSOCKNAME);
CGROUP_ATYPE(CGROUP_INET6_GETSOCKNAME);
CGROUP_ATYPE(CGROUP_INET_SOCK_RELEASE);
+ CGROUP_ATYPE(CGROUP_TRACEPOINT);
default:
return CGROUP_BPF_ATTACH_TYPE_INVALID;
}
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 9c81724e4b98..c108f498a35e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -57,6 +57,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl,
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt,
struct bpf_sockopt, struct bpf_sockopt_kern)
#endif
+#if defined (CONFIG_BPF_EVENTS) && defined (CONFIG_CGROUP_BPF)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_TRACEPOINT, cg_tracepoint,
+ __u64, u64)
+#endif
#ifdef CONFIG_BPF_LIRC_MODE2
BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
__u32, u32)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6fc59d61937a..014ffaa3fc2a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -949,6 +949,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+ BPF_PROG_TYPE_CGROUP_TRACEPOINT,
};
enum bpf_attach_type {
@@ -994,6 +995,7 @@ enum bpf_attach_type {
BPF_SK_REUSEPORT_SELECT,
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
BPF_PERF_EVENT,
+ BPF_CGROUP_TRACEPOINT,
__MAX_BPF_ATTACH_TYPE
};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e50c0bfdb7d..d77598fa4eb2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2149,6 +2149,7 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
case BPF_PROG_TYPE_LSM:
case BPF_PROG_TYPE_STRUCT_OPS: /* has access to struct sock */
case BPF_PROG_TYPE_EXT: /* extends any prog */
+ case BPF_PROG_TYPE_CGROUP_TRACEPOINT:
return true;
default:
return false;
@@ -3137,6 +3138,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
return BPF_PROG_TYPE_SK_LOOKUP;
case BPF_XDP:
return BPF_PROG_TYPE_XDP;
+ case BPF_CGROUP_TRACEPOINT:
+ return BPF_PROG_TYPE_CGROUP_TRACEPOINT;
default:
return BPF_PROG_TYPE_UNSPEC;
}
@@ -3189,6 +3192,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
+ case BPF_PROG_TYPE_CGROUP_TRACEPOINT:
case BPF_PROG_TYPE_SOCK_OPS:
ret = cgroup_bpf_prog_attach(attr, ptype, prog);
break;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 8addd10202c2..4ad864a4852a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1798,6 +1798,14 @@ const struct bpf_verifier_ops perf_event_verifier_ops = {
const struct bpf_prog_ops perf_event_prog_ops = {
};
+const struct bpf_verifier_ops cg_tracepoint_verifier_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = tp_prog_is_valid_access,
+};
+
+const struct bpf_prog_ops cg_tracepoint_prog_ops = {
+};
+
static DEFINE_MUTEX(bpf_event_mutex);
#define BPF_TRACE_MAX_PROGS 64
--
2.25.1
Change-Id: Ie2580c3a71e2a5116551879358cb5304b04d3838
Signed-off-by: Kenny Ho <[email protected]>
---
include/linux/trace_events.h | 9 +++++++++
kernel/trace/bpf_trace.c | 28 ++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 3e475eeb5a99..5cfe3d08966c 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -725,6 +725,8 @@ trace_trigger_soft_disabled(struct trace_event_file *file)
#ifdef CONFIG_BPF_EVENTS
unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
+int perf_event_attach_bpf_prog_array(struct perf_event *event,
+ struct bpf_prog_array *new_array);
int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie);
void perf_event_detach_bpf_prog(struct perf_event *event);
int perf_event_query_prog_array(struct perf_event *event, void __user *info);
@@ -741,6 +743,13 @@ static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *c
return 1;
}
+static inline int
+int perf_event_attach_bpf_prog_array(struct perf_event *event,
+ struct bpf_prog_array *new_array)
+{
+ return -EOPNOTSUPP;
+}
+
static inline int
perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie)
{
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6b3153841a33..8addd10202c2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1802,6 +1802,34 @@ static DEFINE_MUTEX(bpf_event_mutex);
#define BPF_TRACE_MAX_PROGS 64
+int perf_event_attach_bpf_prog_array(struct perf_event *event,
+ struct bpf_prog_array *new_array)
+{
+ struct bpf_prog_array_item *item;
+ struct bpf_prog_array *old_array;
+
+ if (!new_array)
+ return -EINVAL;
+
+ if (bpf_prog_array_length(new_array) >= BPF_TRACE_MAX_PROGS)
+ return -E2BIG;
+
+ if (!trace_kprobe_on_func_entry(event->tp_event) ||
+ !trace_kprobe_error_injectable(event->tp_event))
+ for (item = new_array->items; item->prog; item++)
+ if (item->prog->kprobe_override)
+ return -EINVAL;
+
+ mutex_lock(&bpf_event_mutex);
+
+ old_array = bpf_event_rcu_dereference(event->tp_event->prog_array);
+ rcu_assign_pointer(event->tp_event->prog_array, new_array);
+ bpf_prog_array_free(old_array);
+
+ mutex_unlock(&bpf_event_mutex);
+ return 0;
+}
+
int perf_event_attach_bpf_prog(struct perf_event *event,
struct bpf_prog *prog,
u64 bpf_cookie)
--
2.25.1
bpf progs are attached to cgroups as usual with the idea of effective
progs remain the same. The perf event / tracepoint's fd is defined as
attachment 'subtype'. The 'subtype' is passed along during attachment
via bpf_attr, reusing replace_bpf_fd field.
After the effective progs are calculated, perf_event is allocated using
the 'subtype'/'fd' value for all cpus filtering on the perf cgroup that
corresponds to the bpf-cgroup (with assumption of a unified hierarchy.)
The effective bpf prog array is then attached to each newly allocated
perf_event and subsequently enabled by activate_effective_progs.
Change-Id: I07a4dcaa0a682bafa496f05411365100d6c84fff
Signed-off-by: Kenny Ho <[email protected]>
---
include/linux/bpf-cgroup.h | 15 ++++--
include/linux/perf_event.h | 4 ++
kernel/bpf/cgroup.c | 96 +++++++++++++++++++++++++++++++-------
kernel/cgroup/cgroup.c | 9 ++--
kernel/events/core.c | 45 ++++++++++++++++++
5 files changed, 142 insertions(+), 27 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index a5e4d9b19470..b6e22fd2aa6e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -154,6 +154,11 @@ struct cgroup_bpf {
/* cgroup_bpf is released using a work queue */
struct work_struct release_work;
+
+ /* list of perf events (per child cgroups) for tracepoint/kprobe/uprobe bpf attachment to cgroup */
+ /* TODO: array of tp type with array of events for each cgroup
+ * currently only one tp type supported at a time */
+ struct list_head per_cg_events;
};
int cgroup_bpf_inherit(struct cgroup *cgrp);
@@ -161,21 +166,21 @@ void cgroup_bpf_offline(struct cgroup *cgrp);
int __cgroup_bpf_attach(struct cgroup *cgrp,
struct bpf_prog *prog, struct bpf_prog *replace_prog,
- struct bpf_cgroup_link *link,
+ struct bpf_cgroup_link *link, int bpf_attach_subtype,
enum bpf_attach_type type, u32 flags);
int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
struct bpf_cgroup_link *link,
- enum bpf_attach_type type);
+ enum bpf_attach_type type, int bpf_attach_subtype);
int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
union bpf_attr __user *uattr);
/* Wrapper for __cgroup_bpf_*() protected by cgroup_mutex */
int cgroup_bpf_attach(struct cgroup *cgrp,
struct bpf_prog *prog, struct bpf_prog *replace_prog,
- struct bpf_cgroup_link *link, enum bpf_attach_type type,
- u32 flags);
+ struct bpf_cgroup_link *link, int bpf_attach_subtype,
+ enum bpf_attach_type type, u32 flags);
int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
- enum bpf_attach_type type);
+ enum bpf_attach_type type, int bpf_attach_subtype);
int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
union bpf_attr __user *uattr);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c440db65c18..5a149d8865a1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -776,6 +776,7 @@ struct perf_event {
#ifdef CONFIG_CGROUP_PERF
struct perf_cgroup *cgrp; /* cgroup event is attach to */
+ struct list_head bpf_cg_list;
#endif
#ifdef CONFIG_SECURITY
@@ -982,6 +983,9 @@ extern void perf_pmu_resched(struct pmu *pmu);
extern int perf_event_refresh(struct perf_event *event, int refresh);
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
+extern int perf_event_create_for_all_cpus(struct perf_event_attr *attr,
+ struct cgroup *cgroup,
+ struct list_head *entries);
extern struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 03145d45e3d5..0ecf465ddfb2 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -14,6 +14,8 @@
#include <linux/string.h>
#include <linux/bpf.h>
#include <linux/bpf-cgroup.h>
+#include <linux/perf_event.h>
+#include <linux/trace_events.h>
#include <net/sock.h>
#include <net/bpf_sk_storage.h>
@@ -112,6 +114,8 @@ static void cgroup_bpf_release(struct work_struct *work)
struct bpf_prog_array *old_array;
struct list_head *storages = &cgrp->bpf.storages;
struct bpf_cgroup_storage *storage, *stmp;
+ struct list_head *events = &cgrp->bpf.per_cg_events;
+ struct perf_event *event, *etmp;
unsigned int atype;
@@ -141,6 +145,10 @@ static void cgroup_bpf_release(struct work_struct *work)
bpf_cgroup_storage_free(storage);
}
+ list_for_each_entry_safe(event, etmp, events, bpf_cg_list) {
+ perf_event_release_kernel(event);
+ }
+
mutex_unlock(&cgroup_mutex);
for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
@@ -226,13 +234,16 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp,
*/
static int compute_effective_progs(struct cgroup *cgrp,
enum cgroup_bpf_attach_type atype,
+ int bpf_attach_subtype,
struct bpf_prog_array **array)
{
struct bpf_prog_array_item *item;
struct bpf_prog_array *progs;
struct bpf_prog_list *pl;
struct cgroup *p = cgrp;
- int cnt = 0;
+ struct perf_event *event, *etmp;
+ struct perf_event_attr attr = {};
+ int rc, cnt = 0;
/* count number of effective programs by walking parents */
do {
@@ -245,6 +256,21 @@ static int compute_effective_progs(struct cgroup *cgrp,
if (!progs)
return -ENOMEM;
+ if (atype == CGROUP_TRACEPOINT) {
+ /* TODO: only create event for cgroup that can have process */
+
+ attr.config = bpf_attach_subtype;
+ attr.type = PERF_TYPE_TRACEPOINT;
+ attr.sample_type = PERF_SAMPLE_RAW;
+ attr.sample_period = 1;
+ attr.wakeup_events = 1;
+
+ rc = perf_event_create_for_all_cpus(&attr, cgrp,
+ &cgrp->bpf.per_cg_events);
+ if (rc)
+ goto err;
+ }
+
/* populate the array with effective progs */
cnt = 0;
p = cgrp;
@@ -264,20 +290,41 @@ static int compute_effective_progs(struct cgroup *cgrp,
}
} while ((p = cgroup_parent(p)));
+ if (atype == CGROUP_TRACEPOINT) {
+ list_for_each_entry_safe(event, etmp, &cgrp->bpf.per_cg_events, bpf_cg_list) {
+ rc = perf_event_attach_bpf_prog_array(event, progs);
+ if (rc)
+ goto err_attach;
+ }
+ }
+
*array = progs;
return 0;
+err_attach:
+ list_for_each_entry_safe(event, etmp, &cgrp->bpf.per_cg_events, bpf_cg_list)
+ perf_event_release_kernel(event);
+err:
+ bpf_prog_array_free(progs);
+ return rc;
}
static void activate_effective_progs(struct cgroup *cgrp,
enum cgroup_bpf_attach_type atype,
struct bpf_prog_array *old_array)
{
- old_array = rcu_replace_pointer(cgrp->bpf.effective[atype], old_array,
- lockdep_is_held(&cgroup_mutex));
- /* free prog array after grace period, since __cgroup_bpf_run_*()
- * might be still walking the array
- */
- bpf_prog_array_free(old_array);
+ struct perf_event *event, *etmp;
+
+ if (atype == CGROUP_TRACEPOINT)
+ list_for_each_entry_safe(event, etmp, &cgrp->bpf.per_cg_events, bpf_cg_list)
+ perf_event_enable(event);
+ else {
+ old_array = rcu_replace_pointer(cgrp->bpf.effective[atype], old_array,
+ lockdep_is_held(&cgroup_mutex));
+ /* free prog array after grace period, since __cgroup_bpf_run_*()
+ * might be still walking the array
+ */
+ bpf_prog_array_free(old_array);
+ }
}
/**
@@ -306,9 +353,10 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->bpf.progs[i]);
INIT_LIST_HEAD(&cgrp->bpf.storages);
+ INIT_LIST_HEAD(&cgrp->bpf.per_cg_events);
for (i = 0; i < NR; i++)
- if (compute_effective_progs(cgrp, i, &arrays[i]))
+ if (compute_effective_progs(cgrp, i, -1, &arrays[i]))
goto cleanup;
for (i = 0; i < NR; i++)
@@ -328,7 +376,8 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
}
static int update_effective_progs(struct cgroup *cgrp,
- enum cgroup_bpf_attach_type atype)
+ enum cgroup_bpf_attach_type atype,
+ int bpf_attach_subtype)
{
struct cgroup_subsys_state *css;
int err;
@@ -340,7 +389,8 @@ static int update_effective_progs(struct cgroup *cgrp,
if (percpu_ref_is_zero(&desc->bpf.refcnt))
continue;
- err = compute_effective_progs(desc, atype, &desc->bpf.inactive);
+ err = compute_effective_progs(desc, atype, bpf_attach_subtype,
+ &desc->bpf.inactive);
if (err)
goto cleanup;
}
@@ -424,6 +474,7 @@ static struct bpf_prog_list *find_attach_entry(struct list_head *progs,
* @prog: A program to attach
* @link: A link to attach
* @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set
+ * @bpf_attach_subtype: Type ID of perf tracing event for tracepoint/kprobe/uprobe
* @type: Type of attach operation
* @flags: Option flags
*
@@ -432,7 +483,7 @@ static struct bpf_prog_list *find_attach_entry(struct list_head *progs,
*/
int __cgroup_bpf_attach(struct cgroup *cgrp,
struct bpf_prog *prog, struct bpf_prog *replace_prog,
- struct bpf_cgroup_link *link,
+ struct bpf_cgroup_link *link, int bpf_attach_subtype,
enum bpf_attach_type type, u32 flags)
{
u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI));
@@ -454,6 +505,14 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
if (!!replace_prog != !!(flags & BPF_F_REPLACE))
/* replace_prog implies BPF_F_REPLACE, and vice versa */
return -EINVAL;
+ if ((type == BPF_CGROUP_TRACEPOINT) &&
+ ((flags & BPF_F_REPLACE) || (bpf_attach_subtype < 0) || !(flags & BPF_F_ALLOW_MULTI)))
+ /* replace fd is used to pass the subtype */
+ /* subtype is required for BPF_CGROUP_TRACEPOINT */
+ /* not allow multi BPF progs for the attach type for now */
+ return -EINVAL;
+
+ /* TODO check bpf_attach_subtype is valid */
atype = to_cgroup_bpf_attach_type(type);
if (atype < 0)
@@ -499,7 +558,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
bpf_cgroup_storages_assign(pl->storage, storage);
cgrp->bpf.flags[atype] = saved_flags;
- err = update_effective_progs(cgrp, atype);
+ err = update_effective_progs(cgrp, atype, bpf_attach_subtype);
if (err)
goto cleanup;
@@ -679,7 +738,8 @@ static struct bpf_prog_list *find_detach_entry(struct list_head *progs,
* Must be called with cgroup_mutex held.
*/
int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
- struct bpf_cgroup_link *link, enum bpf_attach_type type)
+ struct bpf_cgroup_link *link, enum bpf_attach_type type,
+ int bpf_attach_subtype)
{
enum cgroup_bpf_attach_type atype;
struct bpf_prog *old_prog;
@@ -708,7 +768,7 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
pl->prog = NULL;
pl->link = NULL;
- err = update_effective_progs(cgrp, atype);
+ err = update_effective_progs(cgrp, atype, bpf_attach_subtype);
if (err)
goto cleanup;
@@ -809,7 +869,7 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
}
}
- ret = cgroup_bpf_attach(cgrp, prog, replace_prog, NULL,
+ ret = cgroup_bpf_attach(cgrp, prog, replace_prog, NULL, attr->replace_bpf_fd,
attr->attach_type, attr->attach_flags);
if (replace_prog)
@@ -832,7 +892,7 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
if (IS_ERR(prog))
prog = NULL;
- ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type);
+ ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, attr->replace_bpf_fd);
if (prog)
bpf_prog_put(prog);
@@ -861,7 +921,7 @@ static void bpf_cgroup_link_release(struct bpf_link *link)
}
WARN_ON(__cgroup_bpf_detach(cg_link->cgroup, NULL, cg_link,
- cg_link->type));
+ cg_link->type, -1));
cg = cg_link->cgroup;
cg_link->cgroup = NULL;
@@ -961,7 +1021,7 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
goto out_put_cgroup;
}
- err = cgroup_bpf_attach(cgrp, NULL, NULL, link,
+ err = cgroup_bpf_attach(cgrp, NULL, NULL, link, -1,
link->type, BPF_F_ALLOW_MULTI);
if (err) {
bpf_link_cleanup(&link_primer);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a645b212b69b..17a1269dc2f9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6626,25 +6626,26 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
#ifdef CONFIG_CGROUP_BPF
int cgroup_bpf_attach(struct cgroup *cgrp,
struct bpf_prog *prog, struct bpf_prog *replace_prog,
- struct bpf_cgroup_link *link,
+ struct bpf_cgroup_link *link, int bpf_attach_subtype,
enum bpf_attach_type type,
u32 flags)
{
int ret;
mutex_lock(&cgroup_mutex);
- ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, link, type, flags);
+ ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, link,
+ bpf_attach_subtype, type, flags);
mutex_unlock(&cgroup_mutex);
return ret;
}
int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
- enum bpf_attach_type type)
+ enum bpf_attach_type type, int bpf_attach_subtype)
{
int ret;
mutex_lock(&cgroup_mutex);
- ret = __cgroup_bpf_detach(cgrp, prog, NULL, type);
+ ret = __cgroup_bpf_detach(cgrp, prog, NULL, type, bpf_attach_subtype);
mutex_unlock(&cgroup_mutex);
return ret;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d34e00749c9b..71056af4322b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12511,6 +12511,51 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
}
EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
+int perf_event_create_for_all_cpus(struct perf_event_attr *attr,
+ struct cgroup *cgroup,
+ struct list_head *entries)
+{
+ struct perf_event **events;
+ struct perf_cgroup *perf_cgrp;
+ int cpu, i = 0;
+
+ events = kzalloc(sizeof(struct perf_event *) * num_possible_cpus(),
+ GFP_KERNEL);
+
+ if (!events)
+ return -ENOMEM;
+
+ for_each_possible_cpu(cpu) {
+ /* allocate first, connect the cgroup later */
+ events[i] = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
+
+ if (IS_ERR(events[i]))
+ goto err;
+
+ i++;
+ }
+
+ perf_cgrp = cgroup_tryget_perf_cgroup(cgroup);
+ if (!perf_cgrp)
+ goto err;
+
+ for (i--; i >= 0; i--) {
+ events[i]->cgrp = perf_cgrp;
+
+ list_add(&events[i]->bpf_cg_list, entries);
+ }
+
+ kfree(events);
+ return 0;
+
+err:
+ for (i--; i >= 0; i--)
+ free_event(events[i]);
+
+ kfree(events);
+ return -ENOMEM;
+}
+
void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
{
struct perf_event_context *src_ctx;
--
2.25.1
On Thu, Nov 18, 2021 at 03:28:40PM -0500, Kenny Ho wrote:
> @@ -245,6 +256,21 @@ static int compute_effective_progs(struct cgroup *cgrp,
> if (!progs)
> return -ENOMEM;
>
> + if (atype == CGROUP_TRACEPOINT) {
> + /* TODO: only create event for cgroup that can have process */
> +
> + attr.config = bpf_attach_subtype;
> + attr.type = PERF_TYPE_TRACEPOINT;
> + attr.sample_type = PERF_SAMPLE_RAW;
> + attr.sample_period = 1;
> + attr.wakeup_events = 1;
> +
> + rc = perf_event_create_for_all_cpus(&attr, cgrp,
> + &cgrp->bpf.per_cg_events);
> + if (rc)
> + goto err;
> + }
...
> +int perf_event_create_for_all_cpus(struct perf_event_attr *attr,
> + struct cgroup *cgroup,
> + struct list_head *entries)
> +{
> + struct perf_event **events;
> + struct perf_cgroup *perf_cgrp;
> + int cpu, i = 0;
> +
> + events = kzalloc(sizeof(struct perf_event *) * num_possible_cpus(),
> + GFP_KERNEL);
> +
> + if (!events)
> + return -ENOMEM;
> +
> + for_each_possible_cpu(cpu) {
> + /* allocate first, connect the cgroup later */
> + events[i] = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
This is a very heavy hammer for this task.
There is really no need for perf_event to be created.
Did you consider using raw_tp approach instead?
It doesn't need this heavy stuff.
Also I suspect in follow up you'd be adding tracepoints to GPU code?
Did you consider just leaving few __weak global functions in GPU code
and let bpf progs attach to them as fentry?
I suspect the true hierarchical nature of bpf-cgroup framework isn't necessary.
The bpf program itself can filter for given cgroup.
We have bpf_current_task_under_cgroup() and friends.
I suggest to sprinkle __weak empty funcs in GPU and see what
you can do with it with fentry and bpf_current_task_under_cgroup.
There is also bpf_get_current_ancestor_cgroup_id().
On Thu, Nov 18, 2021 at 11:33 PM Alexei Starovoitov
<[email protected]> wrote:
>
> On Thu, Nov 18, 2021 at 03:28:40PM -0500, Kenny Ho wrote:
> > + for_each_possible_cpu(cpu) {
> > + /* allocate first, connect the cgroup later */
> > + events[i] = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL);
>
> This is a very heavy hammer for this task.
> There is really no need for perf_event to be created.
> Did you consider using raw_tp approach instead?
I came across raw_tp but I don't have a good understanding of it yet.
Initially I was hoping perf event/tracepoint is a stepping stone to
raw tp but that doesn't seem to be the case (and unfortunately I
picked the perf event/tracepoint route to dive in first because I saw
cgroup usage.) Can you confirm if the following statements are true?
- is raw_tp related to writable tracepoint
- are perf_event/tracepoint/kprobe/uprobe and fentry/fexit/raw_tp
considered two separate 'things' (even though both of their purpose is
tracing)?
> It doesn't need this heavy stuff.
> Also I suspect in follow up you'd be adding tracepoints to GPU code?
> Did you consider just leaving few __weak global functions in GPU code
> and let bpf progs attach to them as fentry?
There are already tracepoints in the GPU code. And I do like fentry
way of doing things more but my head was very much focused on cgroup,
and tracepoint/kprobe path seems to have something for it. I
suspected this would be a bit too heavy after seeing the scalability
discussion but I wasn't sure so I whip this up quickly to get some
feedback (while learning more about perf/bpf/cgroup.)
> I suspect the true hierarchical nature of bpf-cgroup framework isn't necessary.
> The bpf program itself can filter for given cgroup.
> We have bpf_current_task_under_cgroup() and friends.
Is there a way to access cgroup local storage from a prog that is not
attached to a bpf-cgroup? Although, I guess I can just store/read
things using a map with the cg id as key. And with the
bpf_get_current_ancestor_cgroup_id below I can just simulate the
values being propagated if the hierarchy ends up being relevant. Then
again, is there a way to atomically update multiple elements of a map?
I am trying to figure out how to support a multi-user multi-app
sharing use case (like user A given quota X and user B given quota Y
with app 1 and 2 each having a quota assigned by A and app 8 and 9
each having quota assigned by B.) Is there some kind of 'lock'
mechanism for me to keep quota 1,2,X in sync? (Same for 8,9,Y.)
> I suggest to sprinkle __weak empty funcs in GPU and see what
> you can do with it with fentry and bpf_current_task_under_cgroup.
> There is also bpf_get_current_ancestor_cgroup_id().