Hello,
There have been requests for more sophisticated perf event sample
filtering based on the sample data. Recently the kernel added BPF
programs can access perf sample data and this is the userspace part
to enable such a filtering.
This still has some rough edges and needs more improvements. But
I'd like to share the current work and get some feedback for the
directions and idea for further improvements.
The kernel changes are in the tip.git tree (perf/core branch) for now.
perf record has --filter option to set filters on the last specified
event in the command line. It worked only for tracepoints and Intel
PT events so far. This patchset extends it to have 'bpf:' prefix in
order to enable the general sample filters using BPF for any events.
A new filter expression parser was added (using flex/bison) to process
the filter string. Right now, it only accepts very simple expressions
separated by comma. I'd like to keep the filter expression as simple
as possible.
It requires samples satisfy all the filter expressions otherwise it'd
drop the sample. IOW filter expressions are connected with logical AND
operations implicitly.
Essentially the BPF filter expression is:
"bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
The <term> can be one of:
ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
mem_dtlb, mem_blk, mem_hops
The <operator> can be one of:
==, !=, >, >=, <, <=, &
The <value> can be one of:
<number> (for any term)
na, load, store, pfetch, exec (for mem_op)
l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
remote (for mem_remote)
na, locked (for mem_locked)
na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
na, by_data, by_addr (for mem_blk)
hops0, hops1, hops2, hops3 (for mem_hops)
I plan to improve it with range expressions like for ip or addr and it
should support symbols like the existing addr-filters. Also cgroup
should understand and convert cgroup names to IDs.
Let's take a look at some examples. The following is to profile a user
program on the command line. When the frequency mode is used, it starts
with a very small period (i.e. 1) and adjust it on every interrupt (NMI)
to catch up the given frequency.
$ ./perf record -- ./perf test -w noploop
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.263 MB perf.data (4006 samples) ]
$ ./perf script -F pid,period,event,ip,sym | head
36695 1 cycles: ffffffffbab12ddd perf_event_exec
36695 1 cycles: ffffffffbab12ddd perf_event_exec
36695 5 cycles: ffffffffbab12ddd perf_event_exec
36695 46 cycles: ffffffffbab12de5 perf_event_exec
36695 1163 cycles: ffffffffba80a0eb x86_pmu_disable_all
36695 1304 cycles: ffffffffbaa19507 __hrtimer_get_next_event
36695 8143 cycles: ffffffffbaa186f9 __run_timers
36695 69040 cycles: ffffffffbaa0c393 rcu_segcblist_ready_cbs
36695 355117 cycles: 4b0da4 noploop
36695 321861 cycles: 4b0da4 noploop
If you want to skip the first few samples that have small periods, you
can do like this (note it requires root due to BPF).
$ sudo ./perf record -e cycles --filter 'bpf: period > 10000' -- ./perf test -w noploop
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.262 MB perf.data (3990 samples) ]
$ sudo ./perf script -F pid,period,event,ip,sym | head
39524 58253 cycles: ffffffffba97dac0 update_rq_clock
39524 232657 cycles: 4b0da2 noploop
39524 210981 cycles: 4b0da2 noploop
39524 282882 cycles: 4b0da4 noploop
39524 392180 cycles: 4b0da4 noploop
39524 456058 cycles: 4b0da4 noploop
39524 415196 cycles: 4b0da2 noploop
39524 462721 cycles: 4b0da4 noploop
39524 526272 cycles: 4b0da2 noploop
39524 565569 cycles: 4b0da4 noploop
Maybe more useful example is when it deals with precise memory events.
On AMD processors with IBS, you can filter only memory load with L1
dTLB is missed like below.
$ sudo ./perf record -ad -e ibs_op//p \
> --filter 'bpf: mem_op == load, mem_dtlb > l1_hit' sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.338 MB perf.data (15 samples) ]
$ sudo ./perf script -F data_src | head
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
49080142 |OP LOAD|LVL L1 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
51088842 |OP LOAD|LVL L3 or Remote Cache (1 hop) hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
49080442 |OP LOAD|LVL L2 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
You can also check the number of dropped samples in LOST_SAMPLES events
using perf report --stat command.
$ sudo ./perf report --stat
Aggregated stats:
TOTAL events: 16066
MMAP events: 22 ( 0.1%)
COMM events: 4166 (25.9%)
EXIT events: 1 ( 0.0%)
THROTTLE events: 816 ( 5.1%)
UNTHROTTLE events: 613 ( 3.8%)
FORK events: 4165 (25.9%)
SAMPLE events: 15 ( 0.1%)
MMAP2 events: 6133 (38.2%)
LOST_SAMPLES events: 1 ( 0.0%)
KSYMBOL events: 69 ( 0.4%)
BPF_EVENT events: 57 ( 0.4%)
FINISHED_ROUND events: 3 ( 0.0%)
ID_INDEX events: 1 ( 0.0%)
THREAD_MAP events: 1 ( 0.0%)
CPU_MAP events: 1 ( 0.0%)
TIME_CONV events: 1 ( 0.0%)
FINISHED_INIT events: 1 ( 0.0%)
ibs_op//p stats:
SAMPLE events: 15
LOST_SAMPLES events: 3991
Note that the total aggregated stats show 1 LOST_SAMPLES event but
per event stats show 3991 events because it's the actual number of
dropped samples while the aggregated stats has the number of record.
Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
to avoid the confusion.
The code is available at 'perf/bpf-filter-v1' branch in my tree.
git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
Again, you need tip/perf/core kernel for this to work.
Any feedback is welcome.
Thanks,
Namhyung
Namhyung Kim (7):
perf bpf filter: Introduce basic BPF filter expression
perf bpf filter: Implement event sample filtering
perf record: Add BPF event filter support
perf record: Record dropped sample count
perf bpf filter: Add 'pid' sample data support
perf bpf filter: Add more weight sample data support
perf bpf filter: Add data_src sample data support
tools/perf/Documentation/perf-record.txt | 10 +-
tools/perf/Makefile.perf | 2 +-
tools/perf/builtin-record.c | 46 ++++--
tools/perf/util/Build | 16 ++
tools/perf/util/bpf-filter.c | 117 ++++++++++++++
tools/perf/util/bpf-filter.h | 48 ++++++
tools/perf/util/bpf-filter.l | 146 ++++++++++++++++++
tools/perf/util/bpf-filter.y | 55 +++++++
tools/perf/util/bpf_counter.c | 3 +-
tools/perf/util/bpf_skel/sample-filter.h | 25 +++
tools/perf/util/bpf_skel/sample_filter.bpf.c | 152 +++++++++++++++++++
tools/perf/util/evsel.c | 2 +
tools/perf/util/evsel.h | 7 +-
tools/perf/util/parse-events.c | 4 +
tools/perf/util/session.c | 3 +-
15 files changed, 615 insertions(+), 21 deletions(-)
create mode 100644 tools/perf/util/bpf-filter.c
create mode 100644 tools/perf/util/bpf-filter.h
create mode 100644 tools/perf/util/bpf-filter.l
create mode 100644 tools/perf/util/bpf-filter.y
create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
base-commit: 37f322cd58d81a9d46456531281c908de9ef6e42
--
2.39.1.581.gbfd45094c4-goog
Use --filter option to set BPF filter for any events. The filter string
must start with 'bpf:' prefix. Then the BPF program will check the sample
data and filter according to the expression.
For example, the below is the typical perf record for frequency mode.
The sample period started from 1 and increased gradually.
$ sudo ./perf record -e cycles true
$ sudo ./perf script
perf-exec 2272336 546683.916875: 1 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916892: 1 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916899: 3 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916905: 17 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916911: 100 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916917: 589 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916924: 3470 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
perf-exec 2272336 546683.916930: 20465 cycles: ffffffff828499b8 perf_event_exec+0x298 ([kernel.kallsyms])
true 2272336 546683.916940: 119873 cycles: ffffffff8283afdd perf_iterate_ctx+0x2d ([kernel.kallsyms])
true 2272336 546683.917003: 461349 cycles: ffffffff82892517 vma_interval_tree_insert+0x37 ([kernel.kallsyms])
true 2272336 546683.917237: 635778 cycles: ffffffff82a11400 security_mmap_file+0x20 ([kernel.kallsyms])
When you add a BPF filter to get samples having periods greater than 1000,
the output would look like below:
$ sudo ./perf record -e cycles --filter 'bpf: period > 1000' true
$ sudo ./perf script
perf-exec 2273949 546850.708501: 5029 cycles: ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
perf-exec 2273949 546850.708508: 32409 cycles: ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
perf-exec 2273949 546850.708526: 143369 cycles: ffffffff82b4cdbf xas_start+0x5f ([kernel.kallsyms])
perf-exec 2273949 546850.708600: 372650 cycles: ffffffff8286b8f7 __pagevec_lru_add+0x117 ([kernel.kallsyms])
perf-exec 2273949 546850.708791: 482953 cycles: ffffffff829190de __mod_memcg_lruvec_state+0x4e ([kernel.kallsyms])
true 2273949 546850.709036: 501985 cycles: ffffffff828add7c tlb_gather_mmu+0x4c ([kernel.kallsyms])
true 2273949 546850.709292: 503065 cycles: 7f2446d97c03 _dl_map_object_deps+0x973 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 10 +++++++++-
tools/perf/builtin-record.c | 9 +++++++++
tools/perf/util/bpf_counter.c | 3 +--
tools/perf/util/evsel.c | 2 ++
tools/perf/util/parse-events.c | 4 ++++
5 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index ff815c2f67e8..7c6bb3be842a 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -121,7 +121,9 @@ OPTIONS
--filter=<filter>::
Event filter. This option should follow an event selector (-e) which
selects either tracepoint event(s) or a hardware trace PMU
- (e.g. Intel PT or CoreSight).
+ (e.g. Intel PT or CoreSight). If the filter string starts with 'bpf:'
+ it means a general filter using BPF which can be applied for any kind
+ of events.
- tracepoint filters
@@ -174,6 +176,12 @@ OPTIONS
within a single mapping. MMAP events (or /proc/<pid>/maps) can be
examined to determine if that is a possibility.
+ - bpf filters
+
+ BPF filter can access the sample data and make a decision based on the
+ data. Users need to set the appropriate sample type to use the BPF
+ filter.
+
Multiple filters can be separated with space or comma.
--exclude-perf::
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 29dcd454b8e2..c81047a78f3e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -52,6 +52,7 @@
#include "util/pmu-hybrid.h"
#include "util/evlist-hybrid.h"
#include "util/off_cpu.h"
+#include "util/bpf-filter.h"
#include "asm/bug.h"
#include "perf.h"
#include "cputopo.h"
@@ -1368,6 +1369,14 @@ static int record__open(struct record *rec)
session->evlist = evlist;
perf_session__set_id_hdr_size(session);
+
+ evlist__for_each_entry(evlist, pos) {
+ if (list_empty(&pos->bpf_filters))
+ continue;
+ rc = perf_bpf_filter__prepare(pos);
+ if (rc)
+ break;
+ }
out:
return rc;
}
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index eeee899fcf34..0414385794ee 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -781,8 +781,7 @@ extern struct bpf_counter_ops bperf_cgrp_ops;
static inline bool bpf_counter_skip(struct evsel *evsel)
{
- return list_empty(&evsel->bpf_counter_list) &&
- evsel->follower_skel == NULL;
+ return evsel->bpf_counter_ops == NULL;
}
int bpf_counter__install_pe(struct evsel *evsel, int cpu_map_idx, int fd)
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 51e8ce6edddc..cae624fde026 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -50,6 +50,7 @@
#include "off_cpu.h"
#include "../perf-sys.h"
#include "util/parse-branch-options.h"
+#include "util/bpf-filter.h"
#include <internal/xyarray.h>
#include <internal/lib.h>
#include <internal/threadmap.h>
@@ -1494,6 +1495,7 @@ void evsel__exit(struct evsel *evsel)
assert(list_empty(&evsel->core.node));
assert(evsel->evlist == NULL);
bpf_counter__destroy(evsel);
+ perf_bpf_filter__destroy(evsel);
evsel__free_counts(evsel);
perf_evsel__free_fd(&evsel->core);
perf_evsel__free_id(&evsel->core);
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 0336ff27c15f..33f654be6fcc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -27,6 +27,7 @@
#include "perf.h"
#include "util/parse-events-hybrid.h"
#include "util/pmu-hybrid.h"
+#include "util/bpf-filter.h"
#include "tracepoint.h"
#include "thread_map.h"
@@ -2517,6 +2518,9 @@ static int set_filter(struct evsel *evsel, const void *arg)
return -1;
}
+ if (!strncmp(str, "bpf:", 4))
+ return perf_bpf_filter__parse(&evsel->bpf_filters, str+4);
+
if (evsel->core.attr.type == PERF_TYPE_TRACEPOINT) {
if (evsel__append_tp_filter(evsel, str) < 0) {
fprintf(stderr,
--
2.39.1.581.gbfd45094c4-goog
The BPF program will be attached to a perf_event and be triggered when
it overflows. It'd iterate the filters map and compare the sample
value according to the expression. If any of them fails, the sample
would be dropped.
Also it needs to have the corresponding sample data for the expression
so it compares data->sample_flags with the given value. To access the
sample data, it uses the bpf_cast_to_kern_ctx() kfunc which was added
in v6.2 kernel.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Makefile.perf | 2 +-
tools/perf/util/bpf-filter.c | 71 +++++++++++
tools/perf/util/bpf-filter.h | 24 ++--
tools/perf/util/bpf_skel/sample-filter.h | 24 ++++
tools/perf/util/bpf_skel/sample_filter.bpf.c | 118 +++++++++++++++++++
tools/perf/util/evsel.h | 7 +-
6 files changed, 234 insertions(+), 12 deletions(-)
create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index bac9272682b7..474af4adea95 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1047,7 +1047,7 @@ SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
SKELETONS += $(SKEL_OUT)/off_cpu.skel.h $(SKEL_OUT)/lock_contention.skel.h
-SKELETONS += $(SKEL_OUT)/kwork_trace.skel.h
+SKELETONS += $(SKEL_OUT)/kwork_trace.skel.h $(SKEL_OUT)/sample_filter.skel.h
$(SKEL_TMP_OUT) $(LIBAPI_OUTPUT) $(LIBBPF_OUTPUT) $(LIBPERF_OUTPUT) $(LIBSUBCMD_OUTPUT) $(LIBSYMBOL_OUTPUT):
$(Q)$(MKDIR) -p $@
diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
index 6b1148fcfb0e..f47420cf81c9 100644
--- a/tools/perf/util/bpf-filter.c
+++ b/tools/perf/util/bpf-filter.c
@@ -1,10 +1,81 @@
// SPDX-License-Identifier: GPL-2.0
#include <stdlib.h>
+#include <bpf/bpf.h>
+#include <linux/err.h>
+#include <internal/xyarray.h>
+
+#include "util/debug.h"
+#include "util/evsel.h"
+
#include "util/bpf-filter.h"
#include "util/bpf-filter-flex.h"
#include "util/bpf-filter-bison.h"
+#include "bpf_skel/sample-filter.h"
+#include "bpf_skel/sample_filter.skel.h"
+
+#define FD(e, x, y) (*(int *)xyarray__entry(e->core.fd, x, y))
+
+int perf_bpf_filter__prepare(struct evsel *evsel)
+{
+ int i, x, y, fd;
+ struct sample_filter_bpf *skel;
+ struct bpf_program *prog;
+ struct bpf_link *link;
+ struct perf_bpf_filter_expr *expr;
+
+ skel = sample_filter_bpf__open();
+ if (!skel) {
+ pr_err("Failed to open perf sample-filter BPF skeleton\n");
+ return -1;
+ }
+
+ bpf_map__set_max_entries(skel->maps.filters, MAX_FILTERS);
+
+ if (sample_filter_bpf__load(skel) < 0) {
+ pr_err("Failed to load perf sample-filter BPF skeleton\n");
+ return -1;
+ }
+
+ i = 0;
+ fd = bpf_map__fd(skel->maps.filters);
+ list_for_each_entry(expr, &evsel->bpf_filters, list) {
+ struct perf_bpf_filter_entry entry = {
+ .op = expr->op,
+ .flags = expr->sample_flags,
+ .value = expr->val,
+ };
+ bpf_map_update_elem(fd, &i, &entry, BPF_ANY);
+ i++;
+ }
+
+ prog = skel->progs.perf_sample_filter;
+ for (x = 0; x < xyarray__max_x(evsel->core.fd); x++) {
+ for (y = 0; y < xyarray__max_y(evsel->core.fd); y++) {
+ link = bpf_program__attach_perf_event(prog, FD(evsel, x, y));
+ if (IS_ERR(link)) {
+ pr_err("Failed to attach perf sample-filter program\n");
+ return PTR_ERR(link);
+ }
+ }
+ }
+ evsel->bpf_skel = skel;
+ return 0;
+}
+
+int perf_bpf_filter__destroy(struct evsel *evsel)
+{
+ struct perf_bpf_filter_expr *expr, *tmp;
+
+ list_for_each_entry_safe(expr, tmp, &evsel->bpf_filters, list) {
+ list_del(&expr->list);
+ free(expr);
+ }
+ sample_filter_bpf__destroy(evsel->bpf_skel);
+ return 0;
+}
+
struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
enum perf_bpf_filter_op op,
unsigned long val)
diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
index fd5b1164a322..6077930073f9 100644
--- a/tools/perf/util/bpf-filter.h
+++ b/tools/perf/util/bpf-filter.h
@@ -4,15 +4,7 @@
#include <linux/list.h>
-enum perf_bpf_filter_op {
- PBF_OP_EQ,
- PBF_OP_NEQ,
- PBF_OP_GT,
- PBF_OP_GE,
- PBF_OP_LT,
- PBF_OP_LE,
- PBF_OP_AND,
-};
+#include "bpf_skel/sample-filter.h"
struct perf_bpf_filter_expr {
struct list_head list;
@@ -21,16 +13,30 @@ struct perf_bpf_filter_expr {
unsigned long val;
};
+struct evsel;
+
#ifdef HAVE_BPF_SKEL
struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
enum perf_bpf_filter_op op,
unsigned long val);
int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
+int perf_bpf_filter__prepare(struct evsel *evsel);
+int perf_bpf_filter__destroy(struct evsel *evsel);
+
#else /* !HAVE_BPF_SKEL */
+
static inline int perf_bpf_filter__parse(struct list_head *expr_head __maybe_unused,
const char *str __maybe_unused)
{
return -ENOSYS;
}
+static inline int perf_bpf_filter__prepare(struct evsel *evsel)
+{
+ return -ENOSYS;
+}
+static inline int perf_bpf_filter__destroy(struct evsel *evsel)
+{
+ return -ENOSYS;
+}
#endif /* HAVE_BPF_SKEL*/
#endif /* PERF_UTIL_BPF_FILTER_H */
\ No newline at end of file
diff --git a/tools/perf/util/bpf_skel/sample-filter.h b/tools/perf/util/bpf_skel/sample-filter.h
new file mode 100644
index 000000000000..862060bfda14
--- /dev/null
+++ b/tools/perf/util/bpf_skel/sample-filter.h
@@ -0,0 +1,24 @@
+#ifndef PERF_UTIL_BPF_SKEL_SAMPLE_FILTER_H
+#define PERF_UTIL_BPF_SKEL_SAMPLE_FILTER_H
+
+#define MAX_FILTERS 32
+
+/* supported filter operations */
+enum perf_bpf_filter_op {
+ PBF_OP_EQ,
+ PBF_OP_NEQ,
+ PBF_OP_GT,
+ PBF_OP_GE,
+ PBF_OP_LT,
+ PBF_OP_LE,
+ PBF_OP_AND
+};
+
+/* BPF map entry for filtering */
+struct perf_bpf_filter_entry {
+ enum perf_bpf_filter_op op;
+ __u64 flags;
+ __u64 value;
+};
+
+#endif /* PERF_UTIL_BPF_SKEL_SAMPLE_FILTER_H */
\ No newline at end of file
diff --git a/tools/perf/util/bpf_skel/sample_filter.bpf.c b/tools/perf/util/bpf_skel/sample_filter.bpf.c
new file mode 100644
index 000000000000..1aa6a4cacd51
--- /dev/null
+++ b/tools/perf/util/bpf_skel/sample_filter.bpf.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2023 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#include "sample-filter.h"
+
+/* BPF map that will be filled by user space */
+struct filters {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct perf_bpf_filter_entry);
+ __uint(max_entries, MAX_FILTERS);
+} filters SEC(".maps");
+
+int dropped;
+
+void *bpf_cast_to_kern_ctx(void *) __ksym;
+
+/* helper function to return the given perf sample data */
+static inline __u64 perf_get_sample(struct bpf_perf_event_data_kern *kctx,
+ struct perf_bpf_filter_entry *entry)
+{
+ if ((kctx->data->sample_flags & entry->flags) == 0)
+ return 0;
+
+ switch (entry->flags) {
+ case PERF_SAMPLE_IP:
+ return kctx->data->ip;
+ case PERF_SAMPLE_ID:
+ return kctx->data->id;
+ case PERF_SAMPLE_TID:
+ return kctx->data->tid_entry.tid;
+ case PERF_SAMPLE_CPU:
+ return kctx->data->cpu_entry.cpu;
+ case PERF_SAMPLE_TIME:
+ return kctx->data->time;
+ case PERF_SAMPLE_ADDR:
+ return kctx->data->addr;
+ case PERF_SAMPLE_PERIOD:
+ return kctx->data->period;
+ case PERF_SAMPLE_TRANSACTION:
+ return kctx->data->txn;
+ case PERF_SAMPLE_WEIGHT:
+ return kctx->data->weight.full;
+ case PERF_SAMPLE_PHYS_ADDR:
+ return kctx->data->phys_addr;
+ case PERF_SAMPLE_CODE_PAGE_SIZE:
+ return kctx->data->code_page_size;
+ case PERF_SAMPLE_DATA_PAGE_SIZE:
+ return kctx->data->data_page_size;
+ default:
+ break;
+ }
+ return 0;
+}
+
+/* BPF program to be called from perf event overflow handler */
+SEC("perf_event")
+int perf_sample_filter(void *ctx)
+{
+ struct bpf_perf_event_data_kern *kctx;
+ struct perf_bpf_filter_entry *entry;
+ __u64 sample_data;
+ int i;
+
+ kctx = bpf_cast_to_kern_ctx(ctx);
+
+ for (i = 0; i < MAX_FILTERS; i++) {
+ int key = i; /* needed for verifier :( */
+
+ entry = bpf_map_lookup_elem(&filters, &key);
+ if (entry == NULL)
+ break;
+ sample_data = perf_get_sample(kctx, entry);
+
+ switch (entry->op) {
+ case PBF_OP_EQ:
+ if (!(sample_data == entry->value))
+ goto drop;
+ break;
+ case PBF_OP_NEQ:
+ if (!(sample_data != entry->value))
+ goto drop;
+ break;
+ case PBF_OP_GT:
+ if (!(sample_data > entry->value))
+ goto drop;
+ break;
+ case PBF_OP_GE:
+ if (!(sample_data >= entry->value))
+ goto drop;
+ break;
+ case PBF_OP_LT:
+ if (!(sample_data < entry->value))
+ goto drop;
+ break;
+ case PBF_OP_LE:
+ if (!(sample_data <= entry->value))
+ goto drop;
+ break;
+ case PBF_OP_AND:
+ if (!(sample_data & entry->value))
+ goto drop;
+ break;
+ }
+ }
+ /* generate sample data */
+ return 1;
+
+drop:
+ __sync_fetch_and_add(&dropped, 1);
+ return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 24cb807ef6ce..6845642485ec 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -151,8 +151,10 @@ struct evsel {
*/
struct bpf_counter_ops *bpf_counter_ops;
- /* for perf-stat -b */
- struct list_head bpf_counter_list;
+ union {
+ struct list_head bpf_counter_list; /* for perf-stat -b */
+ struct list_head bpf_filters; /* for perf-record --filter */
+ };
/* for perf-stat --use-bpf */
int bperf_leader_prog_fd;
@@ -160,6 +162,7 @@ struct evsel {
union {
struct bperf_leader_bpf *leader_skel;
struct bperf_follower_bpf *follower_skel;
+ void *bpf_skel;
};
unsigned long open_flags;
int precise_ip_original;
--
2.39.1.581.gbfd45094c4-goog
This implements a tiny parser for the filter expressions used for BPF.
Each expression will be converted to struct perf_bpf_filter_expr and
be passed to a BPF map.
For now, I'd like to start with the very basic comparisons like EQ or
GT. The LHS should be a term for sample data and the RHS is a number.
The expressions are connected by a comma. For example,
period > 10000
ip < 0x1000000000000, cpu == 3
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/Build | 16 +++++++++
tools/perf/util/bpf-filter.c | 37 +++++++++++++++++++
tools/perf/util/bpf-filter.h | 36 +++++++++++++++++++
tools/perf/util/bpf-filter.l | 70 ++++++++++++++++++++++++++++++++++++
tools/perf/util/bpf-filter.y | 52 +++++++++++++++++++++++++++
5 files changed, 211 insertions(+)
create mode 100644 tools/perf/util/bpf-filter.c
create mode 100644 tools/perf/util/bpf-filter.h
create mode 100644 tools/perf/util/bpf-filter.l
create mode 100644 tools/perf/util/bpf-filter.y
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 918b501f9bd8..6af73fb5c797 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -154,6 +154,9 @@ perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-flex.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-bison.o
ifeq ($(CONFIG_LIBTRACEEVENT),y)
perf-$(CONFIG_PERF_BPF_SKEL) += bpf_lock_contention.o
@@ -266,6 +269,16 @@ $(OUTPUT)util/pmu-bison.c $(OUTPUT)util/pmu-bison.h: util/pmu.y
$(Q)$(call echo-cmd,bison)$(BISON) -v $< -d $(PARSER_DEBUG_BISON) $(BISON_FILE_PREFIX_MAP) \
-o $(OUTPUT)util/pmu-bison.c -p perf_pmu_
+$(OUTPUT)util/bpf-filter-flex.c $(OUTPUT)util/bpf-filter-flex.h: util/bpf-filter.l $(OUTPUT)util/bpf-filter-bison.c
+ $(call rule_mkdir)
+ $(Q)$(call echo-cmd,flex)$(FLEX) -o $(OUTPUT)util/bpf-filter-flex.c \
+ --header-file=$(OUTPUT)util/bpf-filter-flex.h $(PARSER_DEBUG_FLEX) $<
+
+$(OUTPUT)util/bpf-filter-bison.c $(OUTPUT)util/bpf-filter-bison.h: util/bpf-filter.y
+ $(call rule_mkdir)
+ $(Q)$(call echo-cmd,bison)$(BISON) -v $< -d $(PARSER_DEBUG_BISON) $(BISON_FILE_PREFIX_MAP) \
+ -o $(OUTPUT)util/bpf-filter-bison.c -p perf_bpf_filter_
+
FLEX_GE_26 := $(shell expr $(shell $(FLEX) --version | sed -e 's/flex \([0-9]\+\).\([0-9]\+\)/\1\2/g') \>\= 26)
ifeq ($(FLEX_GE_26),1)
flex_flags := -Wno-switch-enum -Wno-switch-default -Wno-unused-function -Wno-redundant-decls -Wno-sign-compare -Wno-unused-parameter -Wno-missing-prototypes -Wno-missing-declarations
@@ -279,6 +292,7 @@ endif
CFLAGS_parse-events-flex.o += $(flex_flags)
CFLAGS_pmu-flex.o += $(flex_flags)
CFLAGS_expr-flex.o += $(flex_flags)
+CFLAGS_bpf-filter-flex.o += $(flex_flags)
bison_flags := -DYYENABLE_NLS=0
BISON_GE_35 := $(shell expr $(shell $(BISON) --version | grep bison | sed -e 's/.\+ \([0-9]\+\).\([0-9]\+\)/\1\2/g') \>\= 35)
@@ -290,10 +304,12 @@ endif
CFLAGS_parse-events-bison.o += $(bison_flags)
CFLAGS_pmu-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
CFLAGS_expr-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
+CFLAGS_bpf-filter-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
$(OUTPUT)util/parse-events.o: $(OUTPUT)util/parse-events-flex.c $(OUTPUT)util/parse-events-bison.c
$(OUTPUT)util/pmu.o: $(OUTPUT)util/pmu-flex.c $(OUTPUT)util/pmu-bison.c
$(OUTPUT)util/expr.o: $(OUTPUT)util/expr-flex.c $(OUTPUT)util/expr-bison.c
+$(OUTPUT)util/bpf-filter.o: $(OUTPUT)util/bpf-filter-flex.c $(OUTPUT)util/bpf-filter-bison.c
CFLAGS_bitmap.o += -Wno-unused-parameter -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
CFLAGS_find_bit.o += -Wno-unused-parameter -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
new file mode 100644
index 000000000000..6b1148fcfb0e
--- /dev/null
+++ b/tools/perf/util/bpf-filter.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+
+#include "util/bpf-filter.h"
+#include "util/bpf-filter-flex.h"
+#include "util/bpf-filter-bison.h"
+
+struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
+ enum perf_bpf_filter_op op,
+ unsigned long val)
+{
+ struct perf_bpf_filter_expr *expr;
+
+ expr = malloc(sizeof(*expr));
+ if (expr != NULL) {
+ expr->sample_flags = sample_flags;
+ expr->op = op;
+ expr->val = val;
+ }
+ return expr;
+}
+
+int perf_bpf_filter__parse(struct list_head *expr_head, const char *str)
+{
+ YY_BUFFER_STATE buffer;
+ int ret;
+
+ buffer = perf_bpf_filter__scan_string(str);
+
+ ret = perf_bpf_filter_parse(expr_head);
+
+ perf_bpf_filter__flush_buffer(buffer);
+ perf_bpf_filter__delete_buffer(buffer);
+ perf_bpf_filter_lex_destroy();
+
+ return ret;
+}
\ No newline at end of file
diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
new file mode 100644
index 000000000000..fd5b1164a322
--- /dev/null
+++ b/tools/perf/util/bpf-filter.h
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef PERF_UTIL_BPF_FILTER_H
+#define PERF_UTIL_BPF_FILTER_H
+
+#include <linux/list.h>
+
+enum perf_bpf_filter_op {
+ PBF_OP_EQ,
+ PBF_OP_NEQ,
+ PBF_OP_GT,
+ PBF_OP_GE,
+ PBF_OP_LT,
+ PBF_OP_LE,
+ PBF_OP_AND,
+};
+
+struct perf_bpf_filter_expr {
+ struct list_head list;
+ enum perf_bpf_filter_op op;
+ unsigned long sample_flags;
+ unsigned long val;
+};
+
+#ifdef HAVE_BPF_SKEL
+struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
+ enum perf_bpf_filter_op op,
+ unsigned long val);
+int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
+#else /* !HAVE_BPF_SKEL */
+static inline int perf_bpf_filter__parse(struct list_head *expr_head __maybe_unused,
+ const char *str __maybe_unused)
+{
+ return -ENOSYS;
+}
+#endif /* HAVE_BPF_SKEL*/
+#endif /* PERF_UTIL_BPF_FILTER_H */
\ No newline at end of file
diff --git a/tools/perf/util/bpf-filter.l b/tools/perf/util/bpf-filter.l
new file mode 100644
index 000000000000..34c6a9fd4fa4
--- /dev/null
+++ b/tools/perf/util/bpf-filter.l
@@ -0,0 +1,70 @@
+%option prefix="perf_bpf_filter_"
+%option noyywrap
+
+%{
+#include <stdlib.h>
+#include <linux/perf_event.h>
+
+#include "bpf-filter.h"
+#include "bpf-filter-bison.h"
+
+static int sample(unsigned long sample_flag)
+{
+ perf_bpf_filter_lval.sample = sample_flag;
+ return BFT_SAMPLE;
+}
+
+static int operator(enum perf_bpf_filter_op op)
+{
+ perf_bpf_filter_lval.op = op;
+ return BFT_OP;
+}
+
+static int value(int base)
+{
+ long num;
+
+ errno = 0;
+ num = strtoul(perf_bpf_filter_text, NULL, base);
+ if (errno)
+ return BFT_ERROR;
+
+ perf_bpf_filter_lval.num = num;
+ return BFT_NUM;
+}
+
+%}
+
+num_dec [0-9]+
+num_hex 0[Xx][0-9a-fA-F]+
+
+%%
+
+{num_dec} { return value(10); }
+{num_hex} { return value(16); }
+
+ip { return sample(PERF_SAMPLE_IP); }
+id { return sample(PERF_SAMPLE_ID); }
+tid { return sample(PERF_SAMPLE_TID); }
+cpu { return sample(PERF_SAMPLE_CPU); }
+time { return sample(PERF_SAMPLE_TIME); }
+addr { return sample(PERF_SAMPLE_ADDR); }
+period { return sample(PERF_SAMPLE_PERIOD); }
+txn { return sample(PERF_SAMPLE_TRANSACTION); }
+weight { return sample(PERF_SAMPLE_WEIGHT); }
+phys_addr { return sample(PERF_SAMPLE_PHYS_ADDR); }
+code_pgsz { return sample(PERF_SAMPLE_CODE_PAGE_SIZE); }
+data_pgsz { return sample(PERF_SAMPLE_DATA_PAGE_SIZE); }
+
+"==" { return operator(PBF_OP_EQ); }
+"!=" { return operator(PBF_OP_NEQ); }
+">" { return operator(PBF_OP_GT); }
+"<" { return operator(PBF_OP_LT); }
+">=" { return operator(PBF_OP_GE); }
+"<=" { return operator(PBF_OP_LE); }
+"&" { return operator(PBF_OP_AND); }
+
+"," { return ','; }
+. { }
+
+%%
diff --git a/tools/perf/util/bpf-filter.y b/tools/perf/util/bpf-filter.y
new file mode 100644
index 000000000000..0bf36ec30abf
--- /dev/null
+++ b/tools/perf/util/bpf-filter.y
@@ -0,0 +1,52 @@
+%parse-param {struct list_head *expr_head}
+
+%{
+
+#include <stdio.h>
+#include <string.h>
+#include <linux/compiler.h>
+#include <linux/list.h>
+#include "bpf-filter.h"
+
+static void perf_bpf_filter_error(struct list_head *expr __maybe_unused,
+ char const *msg)
+{
+ printf("perf_bpf_filter: %s\n", msg);
+}
+
+%}
+
+%union
+{
+ unsigned long num;
+ unsigned long sample;
+ enum perf_bpf_filter_op op;
+ struct perf_bpf_filter_expr *expr;
+}
+
+%token BFT_SAMPLE BFT_OP BFT_ERROR BFT_NUM
+%type <expr> filter_term
+%type <sample> BFT_SAMPLE
+%type <op> BFT_OP
+%type <num> BFT_NUM
+
+%%
+
+filter:
+filter ',' filter_term
+{
+ list_add(&$3->list, expr_head);
+}
+|
+filter_term
+{
+ list_add(&$1->list, expr_head);
+}
+
+filter_term:
+BFT_SAMPLE BFT_OP BFT_NUM
+{
+ $$ = perf_bpf_filter_expr__new($1, $2, $3);
+}
+
+%%
--
2.39.1.581.gbfd45094c4-goog
When it uses bpf filters, event might drop some samples. It'd be nice
if it can report how many samples it lost. As LOST_SAMPLES event can
carry the similar information, let's use it for bpf filters.
To indicate it's from BPF filters, add a new misc flag for that and
do not display cpu load warnings.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-record.c | 37 ++++++++++++++++++++++--------------
tools/perf/util/bpf-filter.c | 7 +++++++
tools/perf/util/bpf-filter.h | 5 +++++
tools/perf/util/session.c | 3 ++-
4 files changed, 37 insertions(+), 15 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index c81047a78f3e..3201d1a1ea1f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1869,24 +1869,16 @@ record__switch_output(struct record *rec, bool at_exit)
return fd;
}
-static void __record__read_lost_samples(struct record *rec, struct evsel *evsel,
+static void __record__save_lost_samples(struct record *rec, struct evsel *evsel,
struct perf_record_lost_samples *lost,
- int cpu_idx, int thread_idx)
+ int cpu_idx, int thread_idx, u64 lost_count,
+ u16 misc_flag)
{
- struct perf_counts_values count;
struct perf_sample_id *sid;
struct perf_sample sample = {};
int id_hdr_size;
- if (perf_evsel__read(&evsel->core, cpu_idx, thread_idx, &count) < 0) {
- pr_err("read LOST count failed\n");
- return;
- }
-
- if (count.lost == 0)
- return;
-
- lost->lost = count.lost;
+ lost->lost = lost_count;
if (evsel->core.ids) {
sid = xyarray__entry(evsel->core.sample_id, cpu_idx, thread_idx);
sample.id = sid->id;
@@ -1895,6 +1887,7 @@ static void __record__read_lost_samples(struct record *rec, struct evsel *evsel,
id_hdr_size = perf_event__synthesize_id_sample((void *)(lost + 1),
evsel->core.attr.sample_type, &sample);
lost->header.size = sizeof(*lost) + id_hdr_size;
+ lost->header.misc = misc_flag;
record__write(rec, NULL, lost, lost->header.size);
}
@@ -1918,6 +1911,7 @@ static void record__read_lost_samples(struct record *rec)
evlist__for_each_entry(session->evlist, evsel) {
struct xyarray *xy = evsel->core.sample_id;
+ u64 lost_count;
if (xy == NULL || evsel->core.fd == NULL)
continue;
@@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
for (int x = 0; x < xyarray__max_x(xy); x++) {
for (int y = 0; y < xyarray__max_y(xy); y++) {
- __record__read_lost_samples(rec, evsel, lost, x, y);
+ struct perf_counts_values count;
+
+ if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
+ pr_err("read LOST count failed\n");
+ goto out;
+ }
+
+ if (count.lost) {
+ __record__save_lost_samples(rec, evsel, lost,
+ x, y, count.lost, 0);
+ }
}
}
+
+ lost_count = perf_bpf_filter__lost_count(evsel);
+ if (lost_count)
+ __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
+ PERF_RECORD_MISC_LOST_SAMPLES_BPF);
}
+out:
free(lost);
-
}
static volatile sig_atomic_t workload_exec_errno;
diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
index f47420cf81c9..11fb391c92e9 100644
--- a/tools/perf/util/bpf-filter.c
+++ b/tools/perf/util/bpf-filter.c
@@ -76,6 +76,13 @@ int perf_bpf_filter__destroy(struct evsel *evsel)
return 0;
}
+u64 perf_bpf_filter__lost_count(struct evsel *evsel)
+{
+ struct sample_filter_bpf *skel = evsel->bpf_skel;
+
+ return skel ? skel->bss->dropped : 0;
+}
+
struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
enum perf_bpf_filter_op op,
unsigned long val)
diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
index 6077930073f9..36b44c8188ab 100644
--- a/tools/perf/util/bpf-filter.h
+++ b/tools/perf/util/bpf-filter.h
@@ -22,6 +22,7 @@ struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flag
int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
int perf_bpf_filter__prepare(struct evsel *evsel);
int perf_bpf_filter__destroy(struct evsel *evsel);
+u64 perf_bpf_filter__lost_count(struct evsel *evsel);
#else /* !HAVE_BPF_SKEL */
@@ -38,5 +39,9 @@ static inline int perf_bpf_filter__destroy(struct evsel *evsel)
{
return -ENOSYS;
}
+static inline u64 perf_bpf_filter__lost_count(struct evsel *evsel)
+{
+ return 0;
+}
#endif /* HAVE_BPF_SKEL*/
#endif /* PERF_UTIL_BPF_FILTER_H */
\ No newline at end of file
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 749d5b5c135b..7d8d057d1772 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1582,7 +1582,8 @@ static int machines__deliver_event(struct machines *machines,
evlist->stats.total_lost += event->lost.lost;
return tool->lost(tool, event, sample, machine);
case PERF_RECORD_LOST_SAMPLES:
- if (tool->lost_samples == perf_event__process_lost_samples)
+ if (tool->lost_samples == perf_event__process_lost_samples &&
+ !(event->header.misc & PERF_RECORD_MISC_LOST_SAMPLES_BPF))
evlist->stats.total_lost_samples += event->lost_samples.lost;
return tool->lost_samples(tool, event, sample, machine);
case PERF_RECORD_READ:
--
2.39.1.581.gbfd45094c4-goog
The pid is special because it's saved in the PERF_SAMPLE_TID together.
So it needs to differenciate tid and pid using the 'part' field in the
perf bpf filter entry struct.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/bpf-filter.c | 4 +++-
tools/perf/util/bpf-filter.h | 3 ++-
tools/perf/util/bpf-filter.l | 11 ++++++++++-
tools/perf/util/bpf-filter.y | 7 +++++--
tools/perf/util/bpf_skel/sample-filter.h | 3 ++-
tools/perf/util/bpf_skel/sample_filter.bpf.c | 5 ++++-
6 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
index 11fb391c92e9..2e02dc965dd9 100644
--- a/tools/perf/util/bpf-filter.c
+++ b/tools/perf/util/bpf-filter.c
@@ -43,6 +43,7 @@ int perf_bpf_filter__prepare(struct evsel *evsel)
list_for_each_entry(expr, &evsel->bpf_filters, list) {
struct perf_bpf_filter_entry entry = {
.op = expr->op,
+ .part = expr->part,
.flags = expr->sample_flags,
.value = expr->val,
};
@@ -83,7 +84,7 @@ u64 perf_bpf_filter__lost_count(struct evsel *evsel)
return skel ? skel->bss->dropped : 0;
}
-struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
+struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags, int part,
enum perf_bpf_filter_op op,
unsigned long val)
{
@@ -92,6 +93,7 @@ struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flag
expr = malloc(sizeof(*expr));
if (expr != NULL) {
expr->sample_flags = sample_flags;
+ expr->part = part;
expr->op = op;
expr->val = val;
}
diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
index 36b44c8188ab..4fb33d296d9c 100644
--- a/tools/perf/util/bpf-filter.h
+++ b/tools/perf/util/bpf-filter.h
@@ -9,6 +9,7 @@
struct perf_bpf_filter_expr {
struct list_head list;
enum perf_bpf_filter_op op;
+ int part;
unsigned long sample_flags;
unsigned long val;
};
@@ -16,7 +17,7 @@ struct perf_bpf_filter_expr {
struct evsel;
#ifdef HAVE_BPF_SKEL
-struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
+struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags, int part,
enum perf_bpf_filter_op op,
unsigned long val);
int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
diff --git a/tools/perf/util/bpf-filter.l b/tools/perf/util/bpf-filter.l
index 34c6a9fd4fa4..5117c76c7c7a 100644
--- a/tools/perf/util/bpf-filter.l
+++ b/tools/perf/util/bpf-filter.l
@@ -10,7 +10,15 @@
static int sample(unsigned long sample_flag)
{
- perf_bpf_filter_lval.sample = sample_flag;
+ perf_bpf_filter_lval.sample.type = sample_flag;
+ perf_bpf_filter_lval.sample.part = 0;
+ return BFT_SAMPLE;
+}
+
+static int sample_part(unsigned long sample_flag, int part)
+{
+ perf_bpf_filter_lval.sample.type = sample_flag;
+ perf_bpf_filter_lval.sample.part = part;
return BFT_SAMPLE;
}
@@ -46,6 +54,7 @@ num_hex 0[Xx][0-9a-fA-F]+
ip { return sample(PERF_SAMPLE_IP); }
id { return sample(PERF_SAMPLE_ID); }
tid { return sample(PERF_SAMPLE_TID); }
+pid { return sample_part(PERF_SAMPLE_TID, 1); }
cpu { return sample(PERF_SAMPLE_CPU); }
time { return sample(PERF_SAMPLE_TIME); }
addr { return sample(PERF_SAMPLE_ADDR); }
diff --git a/tools/perf/util/bpf-filter.y b/tools/perf/util/bpf-filter.y
index 0bf36ec30abf..8f307d5ffc54 100644
--- a/tools/perf/util/bpf-filter.y
+++ b/tools/perf/util/bpf-filter.y
@@ -19,7 +19,10 @@ static void perf_bpf_filter_error(struct list_head *expr __maybe_unused,
%union
{
unsigned long num;
- unsigned long sample;
+ struct {
+ unsigned long type;
+ int part;
+ } sample;
enum perf_bpf_filter_op op;
struct perf_bpf_filter_expr *expr;
}
@@ -46,7 +49,7 @@ filter_term
filter_term:
BFT_SAMPLE BFT_OP BFT_NUM
{
- $$ = perf_bpf_filter_expr__new($1, $2, $3);
+ $$ = perf_bpf_filter_expr__new($1.type, $1.part, $2, $3);
}
%%
diff --git a/tools/perf/util/bpf_skel/sample-filter.h b/tools/perf/util/bpf_skel/sample-filter.h
index 862060bfda14..6b9fd554ad7b 100644
--- a/tools/perf/util/bpf_skel/sample-filter.h
+++ b/tools/perf/util/bpf_skel/sample-filter.h
@@ -17,7 +17,8 @@ enum perf_bpf_filter_op {
/* BPF map entry for filtering */
struct perf_bpf_filter_entry {
enum perf_bpf_filter_op op;
- __u64 flags;
+ __u32 part; /* sub-sample type info when it has multiple values */
+ __u64 flags; /* perf sample type flags */
__u64 value;
};
diff --git a/tools/perf/util/bpf_skel/sample_filter.bpf.c b/tools/perf/util/bpf_skel/sample_filter.bpf.c
index 1aa6a4cacd51..e9a0633d638d 100644
--- a/tools/perf/util/bpf_skel/sample_filter.bpf.c
+++ b/tools/perf/util/bpf_skel/sample_filter.bpf.c
@@ -32,7 +32,10 @@ static inline __u64 perf_get_sample(struct bpf_perf_event_data_kern *kctx,
case PERF_SAMPLE_ID:
return kctx->data->id;
case PERF_SAMPLE_TID:
- return kctx->data->tid_entry.tid;
+ if (entry->part)
+ return kctx->data->tid_entry.pid;
+ else
+ return kctx->data->tid_entry.tid;
case PERF_SAMPLE_CPU:
return kctx->data->cpu_entry.cpu;
case PERF_SAMPLE_TIME:
--
2.39.1.581.gbfd45094c4-goog
The weight data consists of a couple of fields with the
PERF_SAMPE_WEIGHT_STRUCT. Add weight{1,2,3} term to select them
separately. Also add their aliases like 'ins_lat', 'p_stage_cyc'
and 'retire_lat'.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/bpf-filter.l | 6 ++++++
tools/perf/util/bpf_skel/sample_filter.bpf.c | 8 ++++++++
2 files changed, 14 insertions(+)
diff --git a/tools/perf/util/bpf-filter.l b/tools/perf/util/bpf-filter.l
index 5117c76c7c7a..bdc06babb028 100644
--- a/tools/perf/util/bpf-filter.l
+++ b/tools/perf/util/bpf-filter.l
@@ -61,6 +61,12 @@ addr { return sample(PERF_SAMPLE_ADDR); }
period { return sample(PERF_SAMPLE_PERIOD); }
txn { return sample(PERF_SAMPLE_TRANSACTION); }
weight { return sample(PERF_SAMPLE_WEIGHT); }
+weight1 { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 1); }
+weight2 { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 2); }
+weight3 { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 3); }
+ins_lat { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 2); } /* alias for weight2 */
+p_stage_cyc { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 3); } /* alias for weight3 */
+retire_lat { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 3); } /* alias for weight3 */
phys_addr { return sample(PERF_SAMPLE_PHYS_ADDR); }
code_pgsz { return sample(PERF_SAMPLE_CODE_PAGE_SIZE); }
data_pgsz { return sample(PERF_SAMPLE_DATA_PAGE_SIZE); }
diff --git a/tools/perf/util/bpf_skel/sample_filter.bpf.c b/tools/perf/util/bpf_skel/sample_filter.bpf.c
index e9a0633d638d..5b239f194fa9 100644
--- a/tools/perf/util/bpf_skel/sample_filter.bpf.c
+++ b/tools/perf/util/bpf_skel/sample_filter.bpf.c
@@ -46,6 +46,14 @@ static inline __u64 perf_get_sample(struct bpf_perf_event_data_kern *kctx,
return kctx->data->period;
case PERF_SAMPLE_TRANSACTION:
return kctx->data->txn;
+ case PERF_SAMPLE_WEIGHT_STRUCT:
+ if (entry->part == 1)
+ return kctx->data->weight.var1_dw;
+ if (entry->part == 2)
+ return kctx->data->weight.var2_w;
+ if (entry->part == 3)
+ return kctx->data->weight.var3_w;
+ /* fall through */
case PERF_SAMPLE_WEIGHT:
return kctx->data->weight.full;
case PERF_SAMPLE_PHYS_ADDR:
--
2.39.1.581.gbfd45094c4-goog
The data_src has many entries to express memory behaviors. Add each
term separately so that users can combine them for their purpose.
I didn't add prefix for the constants for simplicity as they are mostly
distinguishable but I had to use l1_miss and l2_hit for mem_dtlb since
mem_lvl has different values for the same names. Note that I decided
mem_lvl to be used as an alias of mem_lvlnum as it's deprecated now.
According to the comment in the UAPI header, users should use the mix
of mem_lvlnum, mem_remote and mem_snoop. Also the SNOOPX bits are
concatenated to mem_snoop for simplicity.
The following terms are used for data_src and the corresponding perf
sample data fields:
* mem_op : { load, store, pfetch, exec }
* mem_lvl: { l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem }
* mem_snoop: { none, hit, miss, hitm, fwd, peer }
* mem_remote: { remote }
* mem_lock: { locked }
* mem_dtlb { l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault }
* mem_blk { by_data, by_addr }
* mem_hops { hops0, hops1, hops2, hops3 }
We can now use a filter expression like below:
'bpf: mem_op == load, mem_lvl <= l2, mem_dtlb == l1_hit'
'bpf: mem_dtlb == l2_miss, mem_hops > hops1'
'bpf: mem_lvl == ram, mem_remote == 1'
Note that 'na' is shared among the terms as it has the same value except
for mem_lvl. I don't have a good idea to handle that for now.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/util/bpf-filter.l | 61 ++++++++++++++++++++
tools/perf/util/bpf_skel/sample_filter.bpf.c | 23 ++++++++
2 files changed, 84 insertions(+)
diff --git a/tools/perf/util/bpf-filter.l b/tools/perf/util/bpf-filter.l
index bdc06babb028..3af9331302cf 100644
--- a/tools/perf/util/bpf-filter.l
+++ b/tools/perf/util/bpf-filter.l
@@ -41,6 +41,12 @@ static int value(int base)
return BFT_NUM;
}
+static int constant(int val)
+{
+ perf_bpf_filter_lval.num = val;
+ return BFT_NUM;
+}
+
%}
num_dec [0-9]+
@@ -70,6 +76,15 @@ retire_lat { return sample_part(PERF_SAMPLE_WEIGHT_STRUCT, 3); } /* alias for we
phys_addr { return sample(PERF_SAMPLE_PHYS_ADDR); }
code_pgsz { return sample(PERF_SAMPLE_CODE_PAGE_SIZE); }
data_pgsz { return sample(PERF_SAMPLE_DATA_PAGE_SIZE); }
+mem_op { return sample_part(PERF_SAMPLE_DATA_SRC, 1); }
+mem_lvlnum { return sample_part(PERF_SAMPLE_DATA_SRC, 2); }
+mem_lvl { return sample_part(PERF_SAMPLE_DATA_SRC, 2); } /* alias for mem_lvlnum */
+mem_snoop { return sample_part(PERF_SAMPLE_DATA_SRC, 3); } /* include snoopx */
+mem_remote { return sample_part(PERF_SAMPLE_DATA_SRC, 4); }
+mem_lock { return sample_part(PERF_SAMPLE_DATA_SRC, 5); }
+mem_dtlb { return sample_part(PERF_SAMPLE_DATA_SRC, 6); }
+mem_blk { return sample_part(PERF_SAMPLE_DATA_SRC, 7); }
+mem_hops { return sample_part(PERF_SAMPLE_DATA_SRC, 8); }
"==" { return operator(PBF_OP_EQ); }
"!=" { return operator(PBF_OP_NEQ); }
@@ -79,6 +94,52 @@ data_pgsz { return sample(PERF_SAMPLE_DATA_PAGE_SIZE); }
"<=" { return operator(PBF_OP_LE); }
"&" { return operator(PBF_OP_AND); }
+na { return constant(PERF_MEM_OP_NA); }
+load { return constant(PERF_MEM_OP_LOAD); }
+store { return constant(PERF_MEM_OP_STORE); }
+pfetch { return constant(PERF_MEM_OP_PFETCH); }
+exec { return constant(PERF_MEM_OP_EXEC); }
+
+l1 { return constant(PERF_MEM_LVLNUM_L1); }
+l2 { return constant(PERF_MEM_LVLNUM_L2); }
+l3 { return constant(PERF_MEM_LVLNUM_L3); }
+l4 { return constant(PERF_MEM_LVLNUM_L4); }
+cxl { return constant(PERF_MEM_LVLNUM_CXL); }
+io { return constant(PERF_MEM_LVLNUM_IO); }
+any_cache { return constant(PERF_MEM_LVLNUM_ANY_CACHE); }
+lfb { return constant(PERF_MEM_LVLNUM_LFB); }
+ram { return constant(PERF_MEM_LVLNUM_RAM); }
+pmem { return constant(PERF_MEM_LVLNUM_PMEM); }
+
+none { return constant(PERF_MEM_SNOOP_NONE); }
+hit { return constant(PERF_MEM_SNOOP_HIT); }
+miss { return constant(PERF_MEM_SNOOP_MISS); }
+hitm { return constant(PERF_MEM_SNOOP_HITM); }
+fwd { return constant(PERF_MEM_SNOOPX_FWD); }
+peer { return constant(PERF_MEM_SNOOPX_PEER); }
+
+remote { return constant(PERF_MEM_REMOTE_REMOTE); }
+
+locked { return constant(PERF_MEM_LOCK_LOCKED); }
+
+l1_hit { return constant(PERF_MEM_TLB_L1 | PERF_MEM_TLB_HIT); }
+l1_miss { return constant(PERF_MEM_TLB_L1 | PERF_MEM_TLB_MISS); }
+l2_hit { return constant(PERF_MEM_TLB_L2 | PERF_MEM_TLB_HIT); }
+l2_miss { return constant(PERF_MEM_TLB_L2 | PERF_MEM_TLB_MISS); }
+any_hit { return constant(PERF_MEM_TLB_HIT); }
+any_miss { return constant(PERF_MEM_TLB_MISS); }
+walk { return constant(PERF_MEM_TLB_WK); }
+os { return constant(PERF_MEM_TLB_OS); }
+fault { return constant(PERF_MEM_TLB_OS); } /* alias for os */
+
+by_data { return constant(PERF_MEM_BLK_DATA); }
+by_addr { return constant(PERF_MEM_BLK_ADDR); }
+
+hops0 { return constant(PERF_MEM_HOPS_0); }
+hops1 { return constant(PERF_MEM_HOPS_1); }
+hops2 { return constant(PERF_MEM_HOPS_2); }
+hops3 { return constant(PERF_MEM_HOPS_3); }
+
"," { return ','; }
. { }
diff --git a/tools/perf/util/bpf_skel/sample_filter.bpf.c b/tools/perf/util/bpf_skel/sample_filter.bpf.c
index 5b239f194fa9..0148b47de7b9 100644
--- a/tools/perf/util/bpf_skel/sample_filter.bpf.c
+++ b/tools/perf/util/bpf_skel/sample_filter.bpf.c
@@ -62,6 +62,29 @@ static inline __u64 perf_get_sample(struct bpf_perf_event_data_kern *kctx,
return kctx->data->code_page_size;
case PERF_SAMPLE_DATA_PAGE_SIZE:
return kctx->data->data_page_size;
+ case PERF_SAMPLE_DATA_SRC:
+ if (entry->part == 1)
+ return kctx->data->data_src.mem_op;
+ if (entry->part == 2)
+ return kctx->data->data_src.mem_lvl_num;
+ if (entry->part == 3) {
+ __u32 snoop = kctx->data->data_src.mem_snoop;
+ __u32 snoopx = kctx->data->data_src.mem_snoopx;
+
+ return (snoopx << 5) | snoop;
+ }
+ if (entry->part == 4)
+ return kctx->data->data_src.mem_remote;
+ if (entry->part == 5)
+ return kctx->data->data_src.mem_lock;
+ if (entry->part == 6)
+ return kctx->data->data_src.mem_dtlb;
+ if (entry->part == 7)
+ return kctx->data->data_src.mem_blk;
+ if (entry->part == 8)
+ return kctx->data->data_src.mem_hops;
+ /* return the whole word */
+ return kctx->data->data_src.val;
default:
break;
}
--
2.39.1.581.gbfd45094c4-goog
On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
>
> This implements a tiny parser for the filter expressions used for BPF.
> Each expression will be converted to struct perf_bpf_filter_expr and
> be passed to a BPF map.
>
> For now, I'd like to start with the very basic comparisons like EQ or
> GT. The LHS should be a term for sample data and the RHS is a number.
> The expressions are connected by a comma. For example,
>
> period > 10000
> ip < 0x1000000000000, cpu == 3
>
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/util/Build | 16 +++++++++
> tools/perf/util/bpf-filter.c | 37 +++++++++++++++++++
> tools/perf/util/bpf-filter.h | 36 +++++++++++++++++++
> tools/perf/util/bpf-filter.l | 70 ++++++++++++++++++++++++++++++++++++
> tools/perf/util/bpf-filter.y | 52 +++++++++++++++++++++++++++
> 5 files changed, 211 insertions(+)
> create mode 100644 tools/perf/util/bpf-filter.c
> create mode 100644 tools/perf/util/bpf-filter.h
> create mode 100644 tools/perf/util/bpf-filter.l
> create mode 100644 tools/perf/util/bpf-filter.y
>
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 918b501f9bd8..6af73fb5c797 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -154,6 +154,9 @@ perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-flex.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf-filter-bison.o
>
> ifeq ($(CONFIG_LIBTRACEEVENT),y)
> perf-$(CONFIG_PERF_BPF_SKEL) += bpf_lock_contention.o
> @@ -266,6 +269,16 @@ $(OUTPUT)util/pmu-bison.c $(OUTPUT)util/pmu-bison.h: util/pmu.y
> $(Q)$(call echo-cmd,bison)$(BISON) -v $< -d $(PARSER_DEBUG_BISON) $(BISON_FILE_PREFIX_MAP) \
> -o $(OUTPUT)util/pmu-bison.c -p perf_pmu_
>
> +$(OUTPUT)util/bpf-filter-flex.c $(OUTPUT)util/bpf-filter-flex.h: util/bpf-filter.l $(OUTPUT)util/bpf-filter-bison.c
> + $(call rule_mkdir)
> + $(Q)$(call echo-cmd,flex)$(FLEX) -o $(OUTPUT)util/bpf-filter-flex.c \
> + --header-file=$(OUTPUT)util/bpf-filter-flex.h $(PARSER_DEBUG_FLEX) $<
> +
> +$(OUTPUT)util/bpf-filter-bison.c $(OUTPUT)util/bpf-filter-bison.h: util/bpf-filter.y
> + $(call rule_mkdir)
> + $(Q)$(call echo-cmd,bison)$(BISON) -v $< -d $(PARSER_DEBUG_BISON) $(BISON_FILE_PREFIX_MAP) \
> + -o $(OUTPUT)util/bpf-filter-bison.c -p perf_bpf_filter_
> +
> FLEX_GE_26 := $(shell expr $(shell $(FLEX) --version | sed -e 's/flex \([0-9]\+\).\([0-9]\+\)/\1\2/g') \>\= 26)
> ifeq ($(FLEX_GE_26),1)
> flex_flags := -Wno-switch-enum -Wno-switch-default -Wno-unused-function -Wno-redundant-decls -Wno-sign-compare -Wno-unused-parameter -Wno-missing-prototypes -Wno-missing-declarations
> @@ -279,6 +292,7 @@ endif
> CFLAGS_parse-events-flex.o += $(flex_flags)
> CFLAGS_pmu-flex.o += $(flex_flags)
> CFLAGS_expr-flex.o += $(flex_flags)
> +CFLAGS_bpf-filter-flex.o += $(flex_flags)
>
> bison_flags := -DYYENABLE_NLS=0
> BISON_GE_35 := $(shell expr $(shell $(BISON) --version | grep bison | sed -e 's/.\+ \([0-9]\+\).\([0-9]\+\)/\1\2/g') \>\= 35)
> @@ -290,10 +304,12 @@ endif
> CFLAGS_parse-events-bison.o += $(bison_flags)
> CFLAGS_pmu-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
> CFLAGS_expr-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
> +CFLAGS_bpf-filter-bison.o += -DYYLTYPE_IS_TRIVIAL=0 $(bison_flags)
>
> $(OUTPUT)util/parse-events.o: $(OUTPUT)util/parse-events-flex.c $(OUTPUT)util/parse-events-bison.c
> $(OUTPUT)util/pmu.o: $(OUTPUT)util/pmu-flex.c $(OUTPUT)util/pmu-bison.c
> $(OUTPUT)util/expr.o: $(OUTPUT)util/expr-flex.c $(OUTPUT)util/expr-bison.c
> +$(OUTPUT)util/bpf-filter.o: $(OUTPUT)util/bpf-filter-flex.c $(OUTPUT)util/bpf-filter-bison.c
>
> CFLAGS_bitmap.o += -Wno-unused-parameter -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
> CFLAGS_find_bit.o += -Wno-unused-parameter -DETC_PERFCONFIG="BUILD_STR($(ETC_PERFCONFIG_SQ))"
> diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
> new file mode 100644
> index 000000000000..6b1148fcfb0e
> --- /dev/null
> +++ b/tools/perf/util/bpf-filter.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdlib.h>
> +
> +#include "util/bpf-filter.h"
> +#include "util/bpf-filter-flex.h"
> +#include "util/bpf-filter-bison.h"
> +
> +struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
> + enum perf_bpf_filter_op op,
> + unsigned long val)
> +{
> + struct perf_bpf_filter_expr *expr;
> +
> + expr = malloc(sizeof(*expr));
> + if (expr != NULL) {
> + expr->sample_flags = sample_flags;
> + expr->op = op;
> + expr->val = val;
> + }
> + return expr;
> +}
> +
> +int perf_bpf_filter__parse(struct list_head *expr_head, const char *str)
> +{
> + YY_BUFFER_STATE buffer;
> + int ret;
> +
> + buffer = perf_bpf_filter__scan_string(str);
> +
> + ret = perf_bpf_filter_parse(expr_head);
> +
> + perf_bpf_filter__flush_buffer(buffer);
> + perf_bpf_filter__delete_buffer(buffer);
> + perf_bpf_filter_lex_destroy();
> +
> + return ret;
> +}
> \ No newline at end of file
> diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
> new file mode 100644
> index 000000000000..fd5b1164a322
> --- /dev/null
> +++ b/tools/perf/util/bpf-filter.h
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef PERF_UTIL_BPF_FILTER_H
> +#define PERF_UTIL_BPF_FILTER_H
> +
> +#include <linux/list.h>
> +
> +enum perf_bpf_filter_op {
> + PBF_OP_EQ,
> + PBF_OP_NEQ,
> + PBF_OP_GT,
> + PBF_OP_GE,
> + PBF_OP_LT,
> + PBF_OP_LE,
> + PBF_OP_AND,
> +};
> +
> +struct perf_bpf_filter_expr {
> + struct list_head list;
> + enum perf_bpf_filter_op op;
> + unsigned long sample_flags;
> + unsigned long val;
> +};
> +
> +#ifdef HAVE_BPF_SKEL
> +struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
> + enum perf_bpf_filter_op op,
> + unsigned long val);
> +int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
> +#else /* !HAVE_BPF_SKEL */
> +static inline int perf_bpf_filter__parse(struct list_head *expr_head __maybe_unused,
> + const char *str __maybe_unused)
> +{
> + return -ENOSYS;
> +}
> +#endif /* HAVE_BPF_SKEL*/
> +#endif /* PERF_UTIL_BPF_FILTER_H */
> \ No newline at end of file
> diff --git a/tools/perf/util/bpf-filter.l b/tools/perf/util/bpf-filter.l
> new file mode 100644
> index 000000000000..34c6a9fd4fa4
> --- /dev/null
> +++ b/tools/perf/util/bpf-filter.l
> @@ -0,0 +1,70 @@
> +%option prefix="perf_bpf_filter_"
> +%option noyywrap
> +
> +%{
> +#include <stdlib.h>
> +#include <linux/perf_event.h>
> +
> +#include "bpf-filter.h"
> +#include "bpf-filter-bison.h"
> +
> +static int sample(unsigned long sample_flag)
> +{
> + perf_bpf_filter_lval.sample = sample_flag;
> + return BFT_SAMPLE;
> +}
> +
> +static int operator(enum perf_bpf_filter_op op)
> +{
> + perf_bpf_filter_lval.op = op;
> + return BFT_OP;
> +}
> +
> +static int value(int base)
> +{
> + long num;
> +
> + errno = 0;
> + num = strtoul(perf_bpf_filter_text, NULL, base);
> + if (errno)
> + return BFT_ERROR;
> +
> + perf_bpf_filter_lval.num = num;
> + return BFT_NUM;
> +}
> +
> +%}
> +
> +num_dec [0-9]+
> +num_hex 0[Xx][0-9a-fA-F]+
> +
> +%%
> +
> +{num_dec} { return value(10); }
> +{num_hex} { return value(16); }
> +
> +ip { return sample(PERF_SAMPLE_IP); }
> +id { return sample(PERF_SAMPLE_ID); }
> +tid { return sample(PERF_SAMPLE_TID); }
> +cpu { return sample(PERF_SAMPLE_CPU); }
> +time { return sample(PERF_SAMPLE_TIME); }
> +addr { return sample(PERF_SAMPLE_ADDR); }
> +period { return sample(PERF_SAMPLE_PERIOD); }
> +txn { return sample(PERF_SAMPLE_TRANSACTION); }
> +weight { return sample(PERF_SAMPLE_WEIGHT); }
> +phys_addr { return sample(PERF_SAMPLE_PHYS_ADDR); }
> +code_pgsz { return sample(PERF_SAMPLE_CODE_PAGE_SIZE); }
> +data_pgsz { return sample(PERF_SAMPLE_DATA_PAGE_SIZE); }
> +
> +"==" { return operator(PBF_OP_EQ); }
> +"!=" { return operator(PBF_OP_NEQ); }
> +">" { return operator(PBF_OP_GT); }
> +"<" { return operator(PBF_OP_LT); }
> +">=" { return operator(PBF_OP_GE); }
> +"<=" { return operator(PBF_OP_LE); }
> +"&" { return operator(PBF_OP_AND); }
> +
> +"," { return ','; }
> +. { }
> +
> +%%
> diff --git a/tools/perf/util/bpf-filter.y b/tools/perf/util/bpf-filter.y
> new file mode 100644
> index 000000000000..0bf36ec30abf
> --- /dev/null
> +++ b/tools/perf/util/bpf-filter.y
> @@ -0,0 +1,52 @@
> +%parse-param {struct list_head *expr_head}
> +
> +%{
> +
> +#include <stdio.h>
> +#include <string.h>
> +#include <linux/compiler.h>
> +#include <linux/list.h>
> +#include "bpf-filter.h"
> +
> +static void perf_bpf_filter_error(struct list_head *expr __maybe_unused,
> + char const *msg)
> +{
> + printf("perf_bpf_filter: %s\n", msg);
> +}
> +
> +%}
> +
> +%union
> +{
> + unsigned long num;
> + unsigned long sample;
> + enum perf_bpf_filter_op op;
> + struct perf_bpf_filter_expr *expr;
> +}
> +
> +%token BFT_SAMPLE BFT_OP BFT_ERROR BFT_NUM
> +%type <expr> filter_term
To avoid memory leaks for parse errors, I think you want here:
%destructor { free($$); } <expr>
Thanks,
Ian
> +%type <sample> BFT_SAMPLE
> +%type <op> BFT_OP
> +%type <num> BFT_NUM
> +
> +%%
> +
> +filter:
> +filter ',' filter_term
> +{
> + list_add(&$3->list, expr_head);
> +}
> +|
> +filter_term
> +{
> + list_add(&$1->list, expr_head);
> +}
> +
> +filter_term:
> +BFT_SAMPLE BFT_OP BFT_NUM
> +{
> + $$ = perf_bpf_filter_expr__new($1, $2, $3);
> +}
> +
> +%%
> --
> 2.39.1.581.gbfd45094c4-goog
>
On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
>
> When it uses bpf filters, event might drop some samples. It'd be nice
> if it can report how many samples it lost. As LOST_SAMPLES event can
> carry the similar information, let's use it for bpf filters.
>
> To indicate it's from BPF filters, add a new misc flag for that and
> do not display cpu load warnings.
Can you potentially have lost samples from being too slow to drain the
ring buffer and dropped samples because of BPF? Is it possible to
distinguish lost and dropped with this approach?
Thanks,
Ian
> Signed-off-by: Namhyung Kim <[email protected]>
> ---
> tools/perf/builtin-record.c | 37 ++++++++++++++++++++++--------------
> tools/perf/util/bpf-filter.c | 7 +++++++
> tools/perf/util/bpf-filter.h | 5 +++++
> tools/perf/util/session.c | 3 ++-
> 4 files changed, 37 insertions(+), 15 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index c81047a78f3e..3201d1a1ea1f 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -1869,24 +1869,16 @@ record__switch_output(struct record *rec, bool at_exit)
> return fd;
> }
>
> -static void __record__read_lost_samples(struct record *rec, struct evsel *evsel,
> +static void __record__save_lost_samples(struct record *rec, struct evsel *evsel,
> struct perf_record_lost_samples *lost,
> - int cpu_idx, int thread_idx)
> + int cpu_idx, int thread_idx, u64 lost_count,
> + u16 misc_flag)
> {
> - struct perf_counts_values count;
> struct perf_sample_id *sid;
> struct perf_sample sample = {};
> int id_hdr_size;
>
> - if (perf_evsel__read(&evsel->core, cpu_idx, thread_idx, &count) < 0) {
> - pr_err("read LOST count failed\n");
> - return;
> - }
> -
> - if (count.lost == 0)
> - return;
> -
> - lost->lost = count.lost;
> + lost->lost = lost_count;
> if (evsel->core.ids) {
> sid = xyarray__entry(evsel->core.sample_id, cpu_idx, thread_idx);
> sample.id = sid->id;
> @@ -1895,6 +1887,7 @@ static void __record__read_lost_samples(struct record *rec, struct evsel *evsel,
> id_hdr_size = perf_event__synthesize_id_sample((void *)(lost + 1),
> evsel->core.attr.sample_type, &sample);
> lost->header.size = sizeof(*lost) + id_hdr_size;
> + lost->header.misc = misc_flag;
> record__write(rec, NULL, lost, lost->header.size);
> }
>
> @@ -1918,6 +1911,7 @@ static void record__read_lost_samples(struct record *rec)
>
> evlist__for_each_entry(session->evlist, evsel) {
> struct xyarray *xy = evsel->core.sample_id;
> + u64 lost_count;
>
> if (xy == NULL || evsel->core.fd == NULL)
> continue;
> @@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
>
> for (int x = 0; x < xyarray__max_x(xy); x++) {
> for (int y = 0; y < xyarray__max_y(xy); y++) {
> - __record__read_lost_samples(rec, evsel, lost, x, y);
> + struct perf_counts_values count;
> +
> + if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
> + pr_err("read LOST count failed\n");
> + goto out;
> + }
> +
> + if (count.lost) {
> + __record__save_lost_samples(rec, evsel, lost,
> + x, y, count.lost, 0);
> + }
> }
> }
> +
> + lost_count = perf_bpf_filter__lost_count(evsel);
> + if (lost_count)
> + __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
> + PERF_RECORD_MISC_LOST_SAMPLES_BPF);
> }
> +out:
> free(lost);
> -
> }
>
> static volatile sig_atomic_t workload_exec_errno;
> diff --git a/tools/perf/util/bpf-filter.c b/tools/perf/util/bpf-filter.c
> index f47420cf81c9..11fb391c92e9 100644
> --- a/tools/perf/util/bpf-filter.c
> +++ b/tools/perf/util/bpf-filter.c
> @@ -76,6 +76,13 @@ int perf_bpf_filter__destroy(struct evsel *evsel)
> return 0;
> }
>
> +u64 perf_bpf_filter__lost_count(struct evsel *evsel)
> +{
> + struct sample_filter_bpf *skel = evsel->bpf_skel;
> +
> + return skel ? skel->bss->dropped : 0;
> +}
> +
> struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flags,
> enum perf_bpf_filter_op op,
> unsigned long val)
> diff --git a/tools/perf/util/bpf-filter.h b/tools/perf/util/bpf-filter.h
> index 6077930073f9..36b44c8188ab 100644
> --- a/tools/perf/util/bpf-filter.h
> +++ b/tools/perf/util/bpf-filter.h
> @@ -22,6 +22,7 @@ struct perf_bpf_filter_expr *perf_bpf_filter_expr__new(unsigned long sample_flag
> int perf_bpf_filter__parse(struct list_head *expr_head, const char *str);
> int perf_bpf_filter__prepare(struct evsel *evsel);
> int perf_bpf_filter__destroy(struct evsel *evsel);
> +u64 perf_bpf_filter__lost_count(struct evsel *evsel);
>
> #else /* !HAVE_BPF_SKEL */
>
> @@ -38,5 +39,9 @@ static inline int perf_bpf_filter__destroy(struct evsel *evsel)
> {
> return -ENOSYS;
> }
> +static inline u64 perf_bpf_filter__lost_count(struct evsel *evsel)
> +{
> + return 0;
> +}
> #endif /* HAVE_BPF_SKEL*/
> #endif /* PERF_UTIL_BPF_FILTER_H */
> \ No newline at end of file
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 749d5b5c135b..7d8d057d1772 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1582,7 +1582,8 @@ static int machines__deliver_event(struct machines *machines,
> evlist->stats.total_lost += event->lost.lost;
> return tool->lost(tool, event, sample, machine);
> case PERF_RECORD_LOST_SAMPLES:
> - if (tool->lost_samples == perf_event__process_lost_samples)
> + if (tool->lost_samples == perf_event__process_lost_samples &&
> + !(event->header.misc & PERF_RECORD_MISC_LOST_SAMPLES_BPF))
> evlist->stats.total_lost_samples += event->lost_samples.lost;
> return tool->lost_samples(tool, event, sample, machine);
> case PERF_RECORD_READ:
> --
> 2.39.1.581.gbfd45094c4-goog
>
On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> There have been requests for more sophisticated perf event sample
> filtering based on the sample data. Recently the kernel added BPF
> programs can access perf sample data and this is the userspace part
> to enable such a filtering.
>
> This still has some rough edges and needs more improvements. But
> I'd like to share the current work and get some feedback for the
> directions and idea for further improvements.
>
> The kernel changes are in the tip.git tree (perf/core branch) for now.
> perf record has --filter option to set filters on the last specified
> event in the command line. It worked only for tracepoints and Intel
> PT events so far. This patchset extends it to have 'bpf:' prefix in
> order to enable the general sample filters using BPF for any events.
>
> A new filter expression parser was added (using flex/bison) to process
> the filter string. Right now, it only accepts very simple expressions
> separated by comma. I'd like to keep the filter expression as simple
> as possible.
>
> It requires samples satisfy all the filter expressions otherwise it'd
> drop the sample. IOW filter expressions are connected with logical AND
> operations implicitly.
>
> Essentially the BPF filter expression is:
>
> "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
>
> The <term> can be one of:
> ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> mem_dtlb, mem_blk, mem_hops
>
> The <operator> can be one of:
> ==, !=, >, >=, <, <=, &
>
> The <value> can be one of:
> <number> (for any term)
> na, load, store, pfetch, exec (for mem_op)
> l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> remote (for mem_remote)
> na, locked (for mem_locked)
> na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> na, by_data, by_addr (for mem_blk)
> hops0, hops1, hops2, hops3 (for mem_hops)
>
> I plan to improve it with range expressions like for ip or addr and it
> should support symbols like the existing addr-filters. Also cgroup
> should understand and convert cgroup names to IDs.
>
> Let's take a look at some examples. The following is to profile a user
> program on the command line. When the frequency mode is used, it starts
> with a very small period (i.e. 1) and adjust it on every interrupt (NMI)
> to catch up the given frequency.
>
> $ ./perf record -- ./perf test -w noploop
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.263 MB perf.data (4006 samples) ]
>
> $ ./perf script -F pid,period,event,ip,sym | head
> 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> 36695 5 cycles: ffffffffbab12ddd perf_event_exec
> 36695 46 cycles: ffffffffbab12de5 perf_event_exec
> 36695 1163 cycles: ffffffffba80a0eb x86_pmu_disable_all
> 36695 1304 cycles: ffffffffbaa19507 __hrtimer_get_next_event
> 36695 8143 cycles: ffffffffbaa186f9 __run_timers
> 36695 69040 cycles: ffffffffbaa0c393 rcu_segcblist_ready_cbs
> 36695 355117 cycles: 4b0da4 noploop
> 36695 321861 cycles: 4b0da4 noploop
>
> If you want to skip the first few samples that have small periods, you
> can do like this (note it requires root due to BPF).
>
> $ sudo ./perf record -e cycles --filter 'bpf: period > 10000' -- ./perf test -w noploop
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.262 MB perf.data (3990 samples) ]
>
> $ sudo ./perf script -F pid,period,event,ip,sym | head
> 39524 58253 cycles: ffffffffba97dac0 update_rq_clock
> 39524 232657 cycles: 4b0da2 noploop
> 39524 210981 cycles: 4b0da2 noploop
> 39524 282882 cycles: 4b0da4 noploop
> 39524 392180 cycles: 4b0da4 noploop
> 39524 456058 cycles: 4b0da4 noploop
> 39524 415196 cycles: 4b0da2 noploop
> 39524 462721 cycles: 4b0da4 noploop
> 39524 526272 cycles: 4b0da2 noploop
> 39524 565569 cycles: 4b0da4 noploop
>
> Maybe more useful example is when it deals with precise memory events.
> On AMD processors with IBS, you can filter only memory load with L1
> dTLB is missed like below.
>
> $ sudo ./perf record -ad -e ibs_op//p \
> > --filter 'bpf: mem_op == load, mem_dtlb > l1_hit' sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.338 MB perf.data (15 samples) ]
>
> $ sudo ./perf script -F data_src | head
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 49080142 |OP LOAD|LVL L1 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 51088842 |OP LOAD|LVL L3 or Remote Cache (1 hop) hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> 49080442 |OP LOAD|LVL L2 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
>
> You can also check the number of dropped samples in LOST_SAMPLES events
> using perf report --stat command.
>
> $ sudo ./perf report --stat
>
> Aggregated stats:
> TOTAL events: 16066
> MMAP events: 22 ( 0.1%)
> COMM events: 4166 (25.9%)
> EXIT events: 1 ( 0.0%)
> THROTTLE events: 816 ( 5.1%)
> UNTHROTTLE events: 613 ( 3.8%)
> FORK events: 4165 (25.9%)
> SAMPLE events: 15 ( 0.1%)
> MMAP2 events: 6133 (38.2%)
> LOST_SAMPLES events: 1 ( 0.0%)
> KSYMBOL events: 69 ( 0.4%)
> BPF_EVENT events: 57 ( 0.4%)
> FINISHED_ROUND events: 3 ( 0.0%)
> ID_INDEX events: 1 ( 0.0%)
> THREAD_MAP events: 1 ( 0.0%)
> CPU_MAP events: 1 ( 0.0%)
> TIME_CONV events: 1 ( 0.0%)
> FINISHED_INIT events: 1 ( 0.0%)
> ibs_op//p stats:
> SAMPLE events: 15
> LOST_SAMPLES events: 3991
>
> Note that the total aggregated stats show 1 LOST_SAMPLES event but
> per event stats show 3991 events because it's the actual number of
> dropped samples while the aggregated stats has the number of record.
> Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> to avoid the confusion.
>
> The code is available at 'perf/bpf-filter-v1' branch in my tree.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Again, you need tip/perf/core kernel for this to work.
> Any feedback is welcome.
This is great! I wonder about related clean up:
- can we remove BPF events as this is a better feature?
- I believe BPF events are flaky, seldom used (with the exception
of the augmented syscalls for perf trace, which really should move to
a BPF skeleton as most people don't know how to use it) and they add a
bunch of complexity. A particular complexity I care about is that the
path separator forward slash ('/') is also the modifier separator for
events.
- what will happen with multiple events/metrics? Perhaps there should
be a way of listing filters so that each filter applies to the
appropriate event in the event list, like cgroups and -G. For metrics
we shuffle the list of events and so maybe the filters need some way
to specify which event they apply to.
- It feels like there should be some BPF way of overcoming the fixed
length number of filters so it is still bounded but not a hardcoded
number.
Thanks,
Ian
> Thanks,
> Namhyung
>
> Namhyung Kim (7):
> perf bpf filter: Introduce basic BPF filter expression
> perf bpf filter: Implement event sample filtering
> perf record: Add BPF event filter support
> perf record: Record dropped sample count
> perf bpf filter: Add 'pid' sample data support
> perf bpf filter: Add more weight sample data support
> perf bpf filter: Add data_src sample data support
>
> tools/perf/Documentation/perf-record.txt | 10 +-
> tools/perf/Makefile.perf | 2 +-
> tools/perf/builtin-record.c | 46 ++++--
> tools/perf/util/Build | 16 ++
> tools/perf/util/bpf-filter.c | 117 ++++++++++++++
> tools/perf/util/bpf-filter.h | 48 ++++++
> tools/perf/util/bpf-filter.l | 146 ++++++++++++++++++
> tools/perf/util/bpf-filter.y | 55 +++++++
> tools/perf/util/bpf_counter.c | 3 +-
> tools/perf/util/bpf_skel/sample-filter.h | 25 +++
> tools/perf/util/bpf_skel/sample_filter.bpf.c | 152 +++++++++++++++++++
> tools/perf/util/evsel.c | 2 +
> tools/perf/util/evsel.h | 7 +-
> tools/perf/util/parse-events.c | 4 +
> tools/perf/util/session.c | 3 +-
> 15 files changed, 615 insertions(+), 21 deletions(-)
> create mode 100644 tools/perf/util/bpf-filter.c
> create mode 100644 tools/perf/util/bpf-filter.h
> create mode 100644 tools/perf/util/bpf-filter.l
> create mode 100644 tools/perf/util/bpf-filter.y
> create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
> create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
>
>
> base-commit: 37f322cd58d81a9d46456531281c908de9ef6e42
> --
> 2.39.1.581.gbfd45094c4-goog
>
Hi Ian,
On Tue, Feb 14, 2023 at 8:58 AM Ian Rogers <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> >
> > Hello,
> >
> > There have been requests for more sophisticated perf event sample
> > filtering based on the sample data. Recently the kernel added BPF
> > programs can access perf sample data and this is the userspace part
> > to enable such a filtering.
> >
> > This still has some rough edges and needs more improvements. But
> > I'd like to share the current work and get some feedback for the
> > directions and idea for further improvements.
> >
> > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > perf record has --filter option to set filters on the last specified
> > event in the command line. It worked only for tracepoints and Intel
> > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > order to enable the general sample filters using BPF for any events.
> >
> > A new filter expression parser was added (using flex/bison) to process
> > the filter string. Right now, it only accepts very simple expressions
> > separated by comma. I'd like to keep the filter expression as simple
> > as possible.
> >
> > It requires samples satisfy all the filter expressions otherwise it'd
> > drop the sample. IOW filter expressions are connected with logical AND
> > operations implicitly.
> >
> > Essentially the BPF filter expression is:
> >
> > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
> >
> > The <term> can be one of:
> > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > mem_dtlb, mem_blk, mem_hops
> >
> > The <operator> can be one of:
> > ==, !=, >, >=, <, <=, &
> >
> > The <value> can be one of:
> > <number> (for any term)
> > na, load, store, pfetch, exec (for mem_op)
> > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > remote (for mem_remote)
> > na, locked (for mem_locked)
> > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > na, by_data, by_addr (for mem_blk)
> > hops0, hops1, hops2, hops3 (for mem_hops)
> >
> > I plan to improve it with range expressions like for ip or addr and it
> > should support symbols like the existing addr-filters. Also cgroup
> > should understand and convert cgroup names to IDs.
> >
> > Let's take a look at some examples. The following is to profile a user
> > program on the command line. When the frequency mode is used, it starts
> > with a very small period (i.e. 1) and adjust it on every interrupt (NMI)
> > to catch up the given frequency.
> >
> > $ ./perf record -- ./perf test -w noploop
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.263 MB perf.data (4006 samples) ]
> >
> > $ ./perf script -F pid,period,event,ip,sym | head
> > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 5 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 46 cycles: ffffffffbab12de5 perf_event_exec
> > 36695 1163 cycles: ffffffffba80a0eb x86_pmu_disable_all
> > 36695 1304 cycles: ffffffffbaa19507 __hrtimer_get_next_event
> > 36695 8143 cycles: ffffffffbaa186f9 __run_timers
> > 36695 69040 cycles: ffffffffbaa0c393 rcu_segcblist_ready_cbs
> > 36695 355117 cycles: 4b0da4 noploop
> > 36695 321861 cycles: 4b0da4 noploop
> >
> > If you want to skip the first few samples that have small periods, you
> > can do like this (note it requires root due to BPF).
> >
> > $ sudo ./perf record -e cycles --filter 'bpf: period > 10000' -- ./perf test -w noploop
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.262 MB perf.data (3990 samples) ]
> >
> > $ sudo ./perf script -F pid,period,event,ip,sym | head
> > 39524 58253 cycles: ffffffffba97dac0 update_rq_clock
> > 39524 232657 cycles: 4b0da2 noploop
> > 39524 210981 cycles: 4b0da2 noploop
> > 39524 282882 cycles: 4b0da4 noploop
> > 39524 392180 cycles: 4b0da4 noploop
> > 39524 456058 cycles: 4b0da4 noploop
> > 39524 415196 cycles: 4b0da2 noploop
> > 39524 462721 cycles: 4b0da4 noploop
> > 39524 526272 cycles: 4b0da2 noploop
> > 39524 565569 cycles: 4b0da4 noploop
> >
> > Maybe more useful example is when it deals with precise memory events.
> > On AMD processors with IBS, you can filter only memory load with L1
> > dTLB is missed like below.
> >
> > $ sudo ./perf record -ad -e ibs_op//p \
> > > --filter 'bpf: mem_op == load, mem_dtlb > l1_hit' sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 1.338 MB perf.data (15 samples) ]
> >
> > $ sudo ./perf script -F data_src | head
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 49080142 |OP LOAD|LVL L1 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51088842 |OP LOAD|LVL L3 or Remote Cache (1 hop) hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 49080442 |OP LOAD|LVL L2 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> >
> > You can also check the number of dropped samples in LOST_SAMPLES events
> > using perf report --stat command.
> >
> > $ sudo ./perf report --stat
> >
> > Aggregated stats:
> > TOTAL events: 16066
> > MMAP events: 22 ( 0.1%)
> > COMM events: 4166 (25.9%)
> > EXIT events: 1 ( 0.0%)
> > THROTTLE events: 816 ( 5.1%)
> > UNTHROTTLE events: 613 ( 3.8%)
> > FORK events: 4165 (25.9%)
> > SAMPLE events: 15 ( 0.1%)
> > MMAP2 events: 6133 (38.2%)
> > LOST_SAMPLES events: 1 ( 0.0%)
> > KSYMBOL events: 69 ( 0.4%)
> > BPF_EVENT events: 57 ( 0.4%)
> > FINISHED_ROUND events: 3 ( 0.0%)
> > ID_INDEX events: 1 ( 0.0%)
> > THREAD_MAP events: 1 ( 0.0%)
> > CPU_MAP events: 1 ( 0.0%)
> > TIME_CONV events: 1 ( 0.0%)
> > FINISHED_INIT events: 1 ( 0.0%)
> > ibs_op//p stats:
> > SAMPLE events: 15
> > LOST_SAMPLES events: 3991
> >
> > Note that the total aggregated stats show 1 LOST_SAMPLES event but
> > per event stats show 3991 events because it's the actual number of
> > dropped samples while the aggregated stats has the number of record.
> > Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> > to avoid the confusion.
> >
> > The code is available at 'perf/bpf-filter-v1' branch in my tree.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> >
> > Again, you need tip/perf/core kernel for this to work.
> > Any feedback is welcome.
>
> This is great! I wonder about related clean up:
>
> - can we remove BPF events as this is a better feature?
> - I believe BPF events are flaky, seldom used (with the exception
> of the augmented syscalls for perf trace, which really should move to
> a BPF skeleton as most people don't know how to use it) and they add a
> bunch of complexity. A particular complexity I care about is that the
> path separator forward slash ('/') is also the modifier separator for
> events.
Well.. I actually never tried the BPF events myself :)
I think we can deprecate it and get rid of it once the perf trace
conversion is done.
>
> - what will happen with multiple events/metrics? Perhaps there should
> be a way of listing filters so that each filter applies to the
> appropriate event in the event list, like cgroups and -G. For metrics
> we shuffle the list of events and so maybe the filters need some way
> to specify which event they apply to.
For now, it's applied to the last event specified by '-e' before the fitter.
As it's local to the event, you should be able to use appropriate one
for each event. I didn't think about the metrics as it's for perf record
only.
>
> - It feels like there should be some BPF way of overcoming the fixed
> length number of filters so it is still bounded but not a hardcoded
> number.
Maybe.. but note that the hardcoded max is just for the verifier.
At runtime, it should stop after processing the actual number
of filter items only.
Thanks,
Namhyung
> >
> > Namhyung Kim (7):
> > perf bpf filter: Introduce basic BPF filter expression
> > perf bpf filter: Implement event sample filtering
> > perf record: Add BPF event filter support
> > perf record: Record dropped sample count
> > perf bpf filter: Add 'pid' sample data support
> > perf bpf filter: Add more weight sample data support
> > perf bpf filter: Add data_src sample data support
> >
> > tools/perf/Documentation/perf-record.txt | 10 +-
> > tools/perf/Makefile.perf | 2 +-
> > tools/perf/builtin-record.c | 46 ++++--
> > tools/perf/util/Build | 16 ++
> > tools/perf/util/bpf-filter.c | 117 ++++++++++++++
> > tools/perf/util/bpf-filter.h | 48 ++++++
> > tools/perf/util/bpf-filter.l | 146 ++++++++++++++++++
> > tools/perf/util/bpf-filter.y | 55 +++++++
> > tools/perf/util/bpf_counter.c | 3 +-
> > tools/perf/util/bpf_skel/sample-filter.h | 25 +++
> > tools/perf/util/bpf_skel/sample_filter.bpf.c | 152 +++++++++++++++++++
> > tools/perf/util/evsel.c | 2 +
> > tools/perf/util/evsel.h | 7 +-
> > tools/perf/util/parse-events.c | 4 +
> > tools/perf/util/session.c | 3 +-
> > 15 files changed, 615 insertions(+), 21 deletions(-)
> > create mode 100644 tools/perf/util/bpf-filter.c
> > create mode 100644 tools/perf/util/bpf-filter.h
> > create mode 100644 tools/perf/util/bpf-filter.l
> > create mode 100644 tools/perf/util/bpf-filter.y
> > create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
> > create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
> >
> >
> > base-commit: 37f322cd58d81a9d46456531281c908de9ef6e42
> > --
> > 2.39.1.581.gbfd45094c4-goog
> >
On Tue, Feb 14, 2023 at 8:10 AM Ian Rogers <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
[SNIP]
> > diff --git a/tools/perf/util/bpf-filter.y b/tools/perf/util/bpf-filter.y
> > new file mode 100644
> > index 000000000000..0bf36ec30abf
> > --- /dev/null
> > +++ b/tools/perf/util/bpf-filter.y
> > @@ -0,0 +1,52 @@
> > +%parse-param {struct list_head *expr_head}
> > +
> > +%{
> > +
> > +#include <stdio.h>
> > +#include <string.h>
> > +#include <linux/compiler.h>
> > +#include <linux/list.h>
> > +#include "bpf-filter.h"
> > +
> > +static void perf_bpf_filter_error(struct list_head *expr __maybe_unused,
> > + char const *msg)
> > +{
> > + printf("perf_bpf_filter: %s\n", msg);
> > +}
> > +
> > +%}
> > +
> > +%union
> > +{
> > + unsigned long num;
> > + unsigned long sample;
> > + enum perf_bpf_filter_op op;
> > + struct perf_bpf_filter_expr *expr;
> > +}
> > +
> > +%token BFT_SAMPLE BFT_OP BFT_ERROR BFT_NUM
> > +%type <expr> filter_term
>
> To avoid memory leaks for parse errors, I think you want here:
> %destructor { free($$); } <expr>
Sure, thanks for the suggestion.
Namhyung
On Tue, Feb 14, 2023 at 8:41 AM Ian Rogers <[email protected]> wrote:
>
> On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> >
> > When it uses bpf filters, event might drop some samples. It'd be nice
> > if it can report how many samples it lost. As LOST_SAMPLES event can
> > carry the similar information, let's use it for bpf filters.
> >
> > To indicate it's from BPF filters, add a new misc flag for that and
> > do not display cpu load warnings.
>
> Can you potentially have lost samples from being too slow to drain the
> ring buffer and dropped samples because of BPF? Is it possible to
> distinguish lost and dropped with this approach?
Yeah, the former is exactly what LOST_SAMPLES event gives you.
It should come from the kernel while BPF filters keep a separate
counter for dropped samples and inject LOST_SAMPLES events
with the new misc flag. So we can differentiate them using the misc
flag and that's how I suppress the warning for BPF dropped ones.
Thanks,
Namhyung
Em Tue, Feb 14, 2023 at 08:57:58AM -0800, Ian Rogers escreveu:
> On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> > There have been requests for more sophisticated perf event sample
> > filtering based on the sample data. Recently the kernel added BPF
> > programs can access perf sample data and this is the userspace part
> > to enable such a filtering.
> > This still has some rough edges and needs more improvements. But
> > I'd like to share the current work and get some feedback for the
> > directions and idea for further improvements.
> > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > perf record has --filter option to set filters on the last specified
> > event in the command line. It worked only for tracepoints and Intel
> > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > order to enable the general sample filters using BPF for any events.
> > A new filter expression parser was added (using flex/bison) to process
> > the filter string. Right now, it only accepts very simple expressions
> > separated by comma. I'd like to keep the filter expression as simple
> > as possible.
> > It requires samples satisfy all the filter expressions otherwise it'd
> > drop the sample. IOW filter expressions are connected with logical AND
> > operations implicitly.
> > Essentially the BPF filter expression is:
> > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
bpf is the technology used for that, but this really is about filtering
by fields in the sample type, right? So perhaps we could remove that
"bpf:" part and simply do:
sudo ./perf record -e cycles --filter 'period > 10000' -- ./perf test -w noploop
And notice that this requires this new mechanism and just use it? It
gets more compact and should be unambiguous for non-tracepoint events?
And for tracepoint events if we can use both mechanisms, then use the
tracepoint one since it requires less setup?
Perhaps use "sample_type.field" to disambiguate if we would like to get a
field from the sample_type and another in the tracepoint if both have
the same name?
And how difficult it would be to just accept the same syntax (or a
superset) of what is available for tracepoint filters? I.e. allow || as
well as &&.
Great stuff!
- Arnaldo
> > The <term> can be one of:
> > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > mem_dtlb, mem_blk, mem_hops
> >
> > The <operator> can be one of:
> > ==, !=, >, >=, <, <=, &
> >
> > The <value> can be one of:
> > <number> (for any term)
> > na, load, store, pfetch, exec (for mem_op)
> > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > remote (for mem_remote)
> > na, locked (for mem_locked)
> > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > na, by_data, by_addr (for mem_blk)
> > hops0, hops1, hops2, hops3 (for mem_hops)
> >
> > I plan to improve it with range expressions like for ip or addr and it
> > should support symbols like the existing addr-filters. Also cgroup
> > should understand and convert cgroup names to IDs.
> >
> > Let's take a look at some examples. The following is to profile a user
> > program on the command line. When the frequency mode is used, it starts
> > with a very small period (i.e. 1) and adjust it on every interrupt (NMI)
> > to catch up the given frequency.
> >
> > $ ./perf record -- ./perf test -w noploop
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.263 MB perf.data (4006 samples) ]
> >
> > $ ./perf script -F pid,period,event,ip,sym | head
> > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 5 cycles: ffffffffbab12ddd perf_event_exec
> > 36695 46 cycles: ffffffffbab12de5 perf_event_exec
> > 36695 1163 cycles: ffffffffba80a0eb x86_pmu_disable_all
> > 36695 1304 cycles: ffffffffbaa19507 __hrtimer_get_next_event
> > 36695 8143 cycles: ffffffffbaa186f9 __run_timers
> > 36695 69040 cycles: ffffffffbaa0c393 rcu_segcblist_ready_cbs
> > 36695 355117 cycles: 4b0da4 noploop
> > 36695 321861 cycles: 4b0da4 noploop
> >
> > If you want to skip the first few samples that have small periods, you
> > can do like this (note it requires root due to BPF).
> >
> > $ sudo ./perf record -e cycles --filter 'bpf: period > 10000' -- ./perf test -w noploop
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.262 MB perf.data (3990 samples) ]
> >
> > $ sudo ./perf script -F pid,period,event,ip,sym | head
> > 39524 58253 cycles: ffffffffba97dac0 update_rq_clock
> > 39524 232657 cycles: 4b0da2 noploop
> > 39524 210981 cycles: 4b0da2 noploop
> > 39524 282882 cycles: 4b0da4 noploop
> > 39524 392180 cycles: 4b0da4 noploop
> > 39524 456058 cycles: 4b0da4 noploop
> > 39524 415196 cycles: 4b0da2 noploop
> > 39524 462721 cycles: 4b0da4 noploop
> > 39524 526272 cycles: 4b0da2 noploop
> > 39524 565569 cycles: 4b0da4 noploop
> >
> > Maybe more useful example is when it deals with precise memory events.
> > On AMD processors with IBS, you can filter only memory load with L1
> > dTLB is missed like below.
> >
> > $ sudo ./perf record -ad -e ibs_op//p \
> > > --filter 'bpf: mem_op == load, mem_dtlb > l1_hit' sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 1.338 MB perf.data (15 samples) ]
> >
> > $ sudo ./perf script -F data_src | head
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 49080142 |OP LOAD|LVL L1 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51088842 |OP LOAD|LVL L3 or Remote Cache (1 hop) hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > 49080442 |OP LOAD|LVL L2 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> >
> > You can also check the number of dropped samples in LOST_SAMPLES events
> > using perf report --stat command.
> >
> > $ sudo ./perf report --stat
> >
> > Aggregated stats:
> > TOTAL events: 16066
> > MMAP events: 22 ( 0.1%)
> > COMM events: 4166 (25.9%)
> > EXIT events: 1 ( 0.0%)
> > THROTTLE events: 816 ( 5.1%)
> > UNTHROTTLE events: 613 ( 3.8%)
> > FORK events: 4165 (25.9%)
> > SAMPLE events: 15 ( 0.1%)
> > MMAP2 events: 6133 (38.2%)
> > LOST_SAMPLES events: 1 ( 0.0%)
> > KSYMBOL events: 69 ( 0.4%)
> > BPF_EVENT events: 57 ( 0.4%)
> > FINISHED_ROUND events: 3 ( 0.0%)
> > ID_INDEX events: 1 ( 0.0%)
> > THREAD_MAP events: 1 ( 0.0%)
> > CPU_MAP events: 1 ( 0.0%)
> > TIME_CONV events: 1 ( 0.0%)
> > FINISHED_INIT events: 1 ( 0.0%)
> > ibs_op//p stats:
> > SAMPLE events: 15
> > LOST_SAMPLES events: 3991
> >
> > Note that the total aggregated stats show 1 LOST_SAMPLES event but
> > per event stats show 3991 events because it's the actual number of
> > dropped samples while the aggregated stats has the number of record.
> > Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> > to avoid the confusion.
> >
> > The code is available at 'perf/bpf-filter-v1' branch in my tree.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> >
> > Again, you need tip/perf/core kernel for this to work.
> > Any feedback is welcome.
>
> This is great! I wonder about related clean up:
>
> - can we remove BPF events as this is a better feature?
> - I believe BPF events are flaky, seldom used (with the exception
> of the augmented syscalls for perf trace, which really should move to
> a BPF skeleton as most people don't know how to use it) and they add a
> bunch of complexity. A particular complexity I care about is that the
> path separator forward slash ('/') is also the modifier separator for
> events.
>
> - what will happen with multiple events/metrics? Perhaps there should
> be a way of listing filters so that each filter applies to the
> appropriate event in the event list, like cgroups and -G. For metrics
> we shuffle the list of events and so maybe the filters need some way
> to specify which event they apply to.
>
> - It feels like there should be some BPF way of overcoming the fixed
> length number of filters so it is still bounded but not a hardcoded
> number.
>
> Thanks,
> Ian
>
>
> > Thanks,
> > Namhyung
> >
> > Namhyung Kim (7):
> > perf bpf filter: Introduce basic BPF filter expression
> > perf bpf filter: Implement event sample filtering
> > perf record: Add BPF event filter support
> > perf record: Record dropped sample count
> > perf bpf filter: Add 'pid' sample data support
> > perf bpf filter: Add more weight sample data support
> > perf bpf filter: Add data_src sample data support
> >
> > tools/perf/Documentation/perf-record.txt | 10 +-
> > tools/perf/Makefile.perf | 2 +-
> > tools/perf/builtin-record.c | 46 ++++--
> > tools/perf/util/Build | 16 ++
> > tools/perf/util/bpf-filter.c | 117 ++++++++++++++
> > tools/perf/util/bpf-filter.h | 48 ++++++
> > tools/perf/util/bpf-filter.l | 146 ++++++++++++++++++
> > tools/perf/util/bpf-filter.y | 55 +++++++
> > tools/perf/util/bpf_counter.c | 3 +-
> > tools/perf/util/bpf_skel/sample-filter.h | 25 +++
> > tools/perf/util/bpf_skel/sample_filter.bpf.c | 152 +++++++++++++++++++
> > tools/perf/util/evsel.c | 2 +
> > tools/perf/util/evsel.h | 7 +-
> > tools/perf/util/parse-events.c | 4 +
> > tools/perf/util/session.c | 3 +-
> > 15 files changed, 615 insertions(+), 21 deletions(-)
> > create mode 100644 tools/perf/util/bpf-filter.c
> > create mode 100644 tools/perf/util/bpf-filter.h
> > create mode 100644 tools/perf/util/bpf-filter.l
> > create mode 100644 tools/perf/util/bpf-filter.y
> > create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
> > create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
> >
> >
> > base-commit: 37f322cd58d81a9d46456531281c908de9ef6e42
> > --
> > 2.39.1.581.gbfd45094c4-goog
> >
--
- Arnaldo
Hi Arnaldo,
On Tue, Feb 14, 2023 at 11:16 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Tue, Feb 14, 2023 at 08:57:58AM -0800, Ian Rogers escreveu:
> > On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> > > There have been requests for more sophisticated perf event sample
> > > filtering based on the sample data. Recently the kernel added BPF
> > > programs can access perf sample data and this is the userspace part
> > > to enable such a filtering.
>
> > > This still has some rough edges and needs more improvements. But
> > > I'd like to share the current work and get some feedback for the
> > > directions and idea for further improvements.
>
> > > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > > perf record has --filter option to set filters on the last specified
> > > event in the command line. It worked only for tracepoints and Intel
> > > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > > order to enable the general sample filters using BPF for any events.
>
> > > A new filter expression parser was added (using flex/bison) to process
> > > the filter string. Right now, it only accepts very simple expressions
> > > separated by comma. I'd like to keep the filter expression as simple
> > > as possible.
>
> > > It requires samples satisfy all the filter expressions otherwise it'd
> > > drop the sample. IOW filter expressions are connected with logical AND
> > > operations implicitly.
>
> > > Essentially the BPF filter expression is:
>
> > > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
>
> bpf is the technology used for that, but this really is about filtering
> by fields in the sample type, right? So perhaps we could remove that
> "bpf:" part and simply do:
>
> sudo ./perf record -e cycles --filter 'period > 10000' -- ./perf test -w noploop
>
> And notice that this requires this new mechanism and just use it? It
> gets more compact and should be unambiguous for non-tracepoint events?
>
> And for tracepoint events if we can use both mechanisms, then use the
> tracepoint one since it requires less setup?
Sure, it'd work if we could select the filter mechanism based on the
event type. One thing to note is BPF filter requires root permission
even if the event itself does not. Users might be surprised if it
suddenly requires root for their userspace profiling.
>
> Perhaps use "sample_type.field" to disambiguate if we would like to get a
> field from the sample_type and another in the tracepoint if both have
> the same name?
I think the tracepoint filters are different as they work on the event-
specific data field. From the sample data's perspective, it's just
RAW data and current BPF filters do nothing with it.
So I'd rather simply delegate it to the tracepoint.
>
> And how difficult it would be to just accept the same syntax (or a
> superset) of what is available for tracepoint filters? I.e. allow || as
> well as &&.
Making the parser accept those syntax would not be that difficult.
But I'm afraid of the BPF program doing the job and how we can
build the map to achieve that.
>
> Great stuff!
Thanks!
Namhyung
>
> > > The <term> can be one of:
> > > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > > mem_dtlb, mem_blk, mem_hops
> > >
> > > The <operator> can be one of:
> > > ==, !=, >, >=, <, <=, &
> > >
> > > The <value> can be one of:
> > > <number> (for any term)
> > > na, load, store, pfetch, exec (for mem_op)
> > > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > > remote (for mem_remote)
> > > na, locked (for mem_locked)
> > > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > > na, by_data, by_addr (for mem_blk)
> > > hops0, hops1, hops2, hops3 (for mem_hops)
> > >
> > > I plan to improve it with range expressions like for ip or addr and it
> > > should support symbols like the existing addr-filters. Also cgroup
> > > should understand and convert cgroup names to IDs.
> > >
> > > Let's take a look at some examples. The following is to profile a user
> > > program on the command line. When the frequency mode is used, it starts
> > > with a very small period (i.e. 1) and adjust it on every interrupt (NMI)
> > > to catch up the given frequency.
> > >
> > > $ ./perf record -- ./perf test -w noploop
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.263 MB perf.data (4006 samples) ]
> > >
> > > $ ./perf script -F pid,period,event,ip,sym | head
> > > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > > 36695 1 cycles: ffffffffbab12ddd perf_event_exec
> > > 36695 5 cycles: ffffffffbab12ddd perf_event_exec
> > > 36695 46 cycles: ffffffffbab12de5 perf_event_exec
> > > 36695 1163 cycles: ffffffffba80a0eb x86_pmu_disable_all
> > > 36695 1304 cycles: ffffffffbaa19507 __hrtimer_get_next_event
> > > 36695 8143 cycles: ffffffffbaa186f9 __run_timers
> > > 36695 69040 cycles: ffffffffbaa0c393 rcu_segcblist_ready_cbs
> > > 36695 355117 cycles: 4b0da4 noploop
> > > 36695 321861 cycles: 4b0da4 noploop
> > >
> > > If you want to skip the first few samples that have small periods, you
> > > can do like this (note it requires root due to BPF).
> > >
> > > $ sudo ./perf record -e cycles --filter 'bpf: period > 10000' -- ./perf test -w noploop
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.262 MB perf.data (3990 samples) ]
> > >
> > > $ sudo ./perf script -F pid,period,event,ip,sym | head
> > > 39524 58253 cycles: ffffffffba97dac0 update_rq_clock
> > > 39524 232657 cycles: 4b0da2 noploop
> > > 39524 210981 cycles: 4b0da2 noploop
> > > 39524 282882 cycles: 4b0da4 noploop
> > > 39524 392180 cycles: 4b0da4 noploop
> > > 39524 456058 cycles: 4b0da4 noploop
> > > 39524 415196 cycles: 4b0da2 noploop
> > > 39524 462721 cycles: 4b0da4 noploop
> > > 39524 526272 cycles: 4b0da2 noploop
> > > 39524 565569 cycles: 4b0da4 noploop
> > >
> > > Maybe more useful example is when it deals with precise memory events.
> > > On AMD processors with IBS, you can filter only memory load with L1
> > > dTLB is missed like below.
> > >
> > > $ sudo ./perf record -ad -e ibs_op//p \
> > > > --filter 'bpf: mem_op == load, mem_dtlb > l1_hit' sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 1.338 MB perf.data (15 samples) ]
> > >
> > > $ sudo ./perf script -F data_src | head
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 49080142 |OP LOAD|LVL L1 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 51088842 |OP LOAD|LVL L3 or Remote Cache (1 hop) hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > > 49080442 |OP LOAD|LVL L2 hit|SNP N/A|TLB L2 hit|LCK N/A|BLK N/A
> > > 51080242 |OP LOAD|LVL LFB/MAB hit|SNP N/A|TLB L2 miss|LCK N/A|BLK N/A
> > >
> > > You can also check the number of dropped samples in LOST_SAMPLES events
> > > using perf report --stat command.
> > >
> > > $ sudo ./perf report --stat
> > >
> > > Aggregated stats:
> > > TOTAL events: 16066
> > > MMAP events: 22 ( 0.1%)
> > > COMM events: 4166 (25.9%)
> > > EXIT events: 1 ( 0.0%)
> > > THROTTLE events: 816 ( 5.1%)
> > > UNTHROTTLE events: 613 ( 3.8%)
> > > FORK events: 4165 (25.9%)
> > > SAMPLE events: 15 ( 0.1%)
> > > MMAP2 events: 6133 (38.2%)
> > > LOST_SAMPLES events: 1 ( 0.0%)
> > > KSYMBOL events: 69 ( 0.4%)
> > > BPF_EVENT events: 57 ( 0.4%)
> > > FINISHED_ROUND events: 3 ( 0.0%)
> > > ID_INDEX events: 1 ( 0.0%)
> > > THREAD_MAP events: 1 ( 0.0%)
> > > CPU_MAP events: 1 ( 0.0%)
> > > TIME_CONV events: 1 ( 0.0%)
> > > FINISHED_INIT events: 1 ( 0.0%)
> > > ibs_op//p stats:
> > > SAMPLE events: 15
> > > LOST_SAMPLES events: 3991
> > >
> > > Note that the total aggregated stats show 1 LOST_SAMPLES event but
> > > per event stats show 3991 events because it's the actual number of
> > > dropped samples while the aggregated stats has the number of record.
> > > Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> > > to avoid the confusion.
> > >
> > > The code is available at 'perf/bpf-filter-v1' branch in my tree.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> > >
> > > Again, you need tip/perf/core kernel for this to work.
> > > Any feedback is welcome.
> >
> > This is great! I wonder about related clean up:
> >
> > - can we remove BPF events as this is a better feature?
> > - I believe BPF events are flaky, seldom used (with the exception
> > of the augmented syscalls for perf trace, which really should move to
> > a BPF skeleton as most people don't know how to use it) and they add a
> > bunch of complexity. A particular complexity I care about is that the
> > path separator forward slash ('/') is also the modifier separator for
> > events.
> >
> > - what will happen with multiple events/metrics? Perhaps there should
> > be a way of listing filters so that each filter applies to the
> > appropriate event in the event list, like cgroups and -G. For metrics
> > we shuffle the list of events and so maybe the filters need some way
> > to specify which event they apply to.
> >
> > - It feels like there should be some BPF way of overcoming the fixed
> > length number of filters so it is still bounded but not a hardcoded
> > number.
> >
> > Thanks,
> > Ian
> >
> >
> > > Thanks,
> > > Namhyung
> > >
> > > Namhyung Kim (7):
> > > perf bpf filter: Introduce basic BPF filter expression
> > > perf bpf filter: Implement event sample filtering
> > > perf record: Add BPF event filter support
> > > perf record: Record dropped sample count
> > > perf bpf filter: Add 'pid' sample data support
> > > perf bpf filter: Add more weight sample data support
> > > perf bpf filter: Add data_src sample data support
> > >
> > > tools/perf/Documentation/perf-record.txt | 10 +-
> > > tools/perf/Makefile.perf | 2 +-
> > > tools/perf/builtin-record.c | 46 ++++--
> > > tools/perf/util/Build | 16 ++
> > > tools/perf/util/bpf-filter.c | 117 ++++++++++++++
> > > tools/perf/util/bpf-filter.h | 48 ++++++
> > > tools/perf/util/bpf-filter.l | 146 ++++++++++++++++++
> > > tools/perf/util/bpf-filter.y | 55 +++++++
> > > tools/perf/util/bpf_counter.c | 3 +-
> > > tools/perf/util/bpf_skel/sample-filter.h | 25 +++
> > > tools/perf/util/bpf_skel/sample_filter.bpf.c | 152 +++++++++++++++++++
> > > tools/perf/util/evsel.c | 2 +
> > > tools/perf/util/evsel.h | 7 +-
> > > tools/perf/util/parse-events.c | 4 +
> > > tools/perf/util/session.c | 3 +-
> > > 15 files changed, 615 insertions(+), 21 deletions(-)
> > > create mode 100644 tools/perf/util/bpf-filter.c
> > > create mode 100644 tools/perf/util/bpf-filter.h
> > > create mode 100644 tools/perf/util/bpf-filter.l
> > > create mode 100644 tools/perf/util/bpf-filter.y
> > > create mode 100644 tools/perf/util/bpf_skel/sample-filter.h
> > > create mode 100644 tools/perf/util/bpf_skel/sample_filter.bpf.c
> > >
> > >
> > > base-commit: 37f322cd58d81a9d46456531281c908de9ef6e42
> > > --
> > > 2.39.1.581.gbfd45094c4-goog
> > >
>
> --
>
> - Arnaldo
On Mon, Feb 13, 2023 at 09:04:49PM -0800, Namhyung Kim wrote:
SNIP
> @@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
>
> for (int x = 0; x < xyarray__max_x(xy); x++) {
> for (int y = 0; y < xyarray__max_y(xy); y++) {
> - __record__read_lost_samples(rec, evsel, lost, x, y);
> + struct perf_counts_values count;
> +
> + if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
> + pr_err("read LOST count failed\n");
> + goto out;
> + }
> +
> + if (count.lost) {
> + __record__save_lost_samples(rec, evsel, lost,
> + x, y, count.lost, 0);
> + }
> }
> }
> +
> + lost_count = perf_bpf_filter__lost_count(evsel);
> + if (lost_count)
> + __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
> + PERF_RECORD_MISC_LOST_SAMPLES_BPF);
hi,
I can't see PERF_RECORD_MISC_LOST_SAMPLES_BPF in the tip/perf/core so can't compile,
what do I miss?
thanks,
jirka
Em Thu, Feb 16, 2023 at 05:23:05PM +0100, Jiri Olsa escreveu:
> On Mon, Feb 13, 2023 at 09:04:49PM -0800, Namhyung Kim wrote:
>
> SNIP
>
> > @@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
> >
> > for (int x = 0; x < xyarray__max_x(xy); x++) {
> > for (int y = 0; y < xyarray__max_y(xy); y++) {
> > - __record__read_lost_samples(rec, evsel, lost, x, y);
> > + struct perf_counts_values count;
> > +
> > + if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
> > + pr_err("read LOST count failed\n");
> > + goto out;
> > + }
> > +
> > + if (count.lost) {
> > + __record__save_lost_samples(rec, evsel, lost,
> > + x, y, count.lost, 0);
> > + }
> > }
> > }
> > +
> > + lost_count = perf_bpf_filter__lost_count(evsel);
> > + if (lost_count)
> > + __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
> > + PERF_RECORD_MISC_LOST_SAMPLES_BPF);
>
> hi,
> I can't see PERF_RECORD_MISC_LOST_SAMPLES_BPF in the tip/perf/core so can't compile,
> what do I miss?
Humm, but you shouldn't need kernel headers to build tools/perf/, right?
- Arnaldo
On Thu, Feb 16, 2023 at 01:32:14PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Feb 16, 2023 at 05:23:05PM +0100, Jiri Olsa escreveu:
> > On Mon, Feb 13, 2023 at 09:04:49PM -0800, Namhyung Kim wrote:
> >
> > SNIP
> >
> > > @@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
> > >
> > > for (int x = 0; x < xyarray__max_x(xy); x++) {
> > > for (int y = 0; y < xyarray__max_y(xy); y++) {
> > > - __record__read_lost_samples(rec, evsel, lost, x, y);
> > > + struct perf_counts_values count;
> > > +
> > > + if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
> > > + pr_err("read LOST count failed\n");
> > > + goto out;
> > > + }
> > > +
> > > + if (count.lost) {
> > > + __record__save_lost_samples(rec, evsel, lost,
> > > + x, y, count.lost, 0);
> > > + }
> > > }
> > > }
> > > +
> > > + lost_count = perf_bpf_filter__lost_count(evsel);
> > > + if (lost_count)
> > > + __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
> > > + PERF_RECORD_MISC_LOST_SAMPLES_BPF);
> >
> > hi,
> > I can't see PERF_RECORD_MISC_LOST_SAMPLES_BPF in the tip/perf/core so can't compile,
> > what do I miss?
>
> Humm, but you shouldn't need kernel headers to build tools/perf/, right?
right, should be also in tools/include headers
jirka
Hi Jiri and Arnaldo,
On Thu, Feb 16, 2023 at 05:34:33PM +0100, Jiri Olsa wrote:
> On Thu, Feb 16, 2023 at 01:32:14PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Feb 16, 2023 at 05:23:05PM +0100, Jiri Olsa escreveu:
> > > On Mon, Feb 13, 2023 at 09:04:49PM -0800, Namhyung Kim wrote:
> > >
> > > SNIP
> > >
> > > > @@ -1929,12 +1923,27 @@ static void record__read_lost_samples(struct record *rec)
> > > >
> > > > for (int x = 0; x < xyarray__max_x(xy); x++) {
> > > > for (int y = 0; y < xyarray__max_y(xy); y++) {
> > > > - __record__read_lost_samples(rec, evsel, lost, x, y);
> > > > + struct perf_counts_values count;
> > > > +
> > > > + if (perf_evsel__read(&evsel->core, x, y, &count) < 0) {
> > > > + pr_err("read LOST count failed\n");
> > > > + goto out;
> > > > + }
> > > > +
> > > > + if (count.lost) {
> > > > + __record__save_lost_samples(rec, evsel, lost,
> > > > + x, y, count.lost, 0);
> > > > + }
> > > > }
> > > > }
> > > > +
> > > > + lost_count = perf_bpf_filter__lost_count(evsel);
> > > > + if (lost_count)
> > > > + __record__save_lost_samples(rec, evsel, lost, 0, 0, lost_count,
> > > > + PERF_RECORD_MISC_LOST_SAMPLES_BPF);
> > >
> > > hi,
> > > I can't see PERF_RECORD_MISC_LOST_SAMPLES_BPF in the tip/perf/core so can't compile,
> > > what do I miss?
> >
> > Humm, but you shouldn't need kernel headers to build tools/perf/, right?
>
> right, should be also in tools/include headers
Yeah, sorry about that. I'm not sure how I missed the part.
I put it in tools/lib/perf/include/perf/event.h only as it does nothing
with kernel. Will fix in v2.
Thanks,
Namhyung
---8<---
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index ad47d7b31046..51b9338f4c11 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -70,6 +70,8 @@ struct perf_record_lost {
__u64 lost;
};
+#define PERF_RECORD_MISC_LOST_SAMPLES_BPF (1 << 15)
+
struct perf_record_lost_samples {
struct perf_event_header header;
__u64 lost;
On Tue, Feb 14, 2023 at 10:01:41AM -0800, Namhyung Kim wrote:
> Hi Ian,
>
> On Tue, Feb 14, 2023 at 8:58 AM Ian Rogers <[email protected]> wrote:
> >
> > On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > There have been requests for more sophisticated perf event sample
> > > filtering based on the sample data. Recently the kernel added BPF
> > > programs can access perf sample data and this is the userspace part
> > > to enable such a filtering.
> > >
> > > This still has some rough edges and needs more improvements. But
> > > I'd like to share the current work and get some feedback for the
> > > directions and idea for further improvements.
> > >
> > > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > > perf record has --filter option to set filters on the last specified
> > > event in the command line. It worked only for tracepoints and Intel
> > > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > > order to enable the general sample filters using BPF for any events.
> > >
> > > A new filter expression parser was added (using flex/bison) to process
> > > the filter string. Right now, it only accepts very simple expressions
> > > separated by comma. I'd like to keep the filter expression as simple
> > > as possible.
> > >
> > > It requires samples satisfy all the filter expressions otherwise it'd
> > > drop the sample. IOW filter expressions are connected with logical AND
> > > operations implicitly.
> > >
> > > Essentially the BPF filter expression is:
> > >
> > > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
> > >
> > > The <term> can be one of:
> > > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > > mem_dtlb, mem_blk, mem_hops
> > >
> > > The <operator> can be one of:
> > > ==, !=, >, >=, <, <=, &
> > >
> > > The <value> can be one of:
> > > <number> (for any term)
> > > na, load, store, pfetch, exec (for mem_op)
> > > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > > remote (for mem_remote)
> > > na, locked (for mem_locked)
> > > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > > na, by_data, by_addr (for mem_blk)
> > > hops0, hops1, hops2, hops3 (for mem_hops)
> > >
> > > I plan to improve it with range expressions like for ip or addr and it
> > > should support symbols like the existing addr-filters. Also cgroup
> > > should understand and convert cgroup names to IDs.
this seems similar to what ftrace is doing in filter_match_preds,
I checked the code briefly and I wonder if we shoud be able to write
that function logic in bpf, assuming that the filter is prepared in
user space
it might solve the 'part' data problem in generic way.. but I might be
missing some blocker of course.. just an idea ;-)
could replace the tracepoint filters.. if we actually care
SNIP
> > > Note that the total aggregated stats show 1 LOST_SAMPLES event but
> > > per event stats show 3991 events because it's the actual number of
> > > dropped samples while the aggregated stats has the number of record.
> > > Maybe we need to change the per-event stats to 'LOST_SAMPLES count'
> > > to avoid the confusion.
> > >
> > > The code is available at 'perf/bpf-filter-v1' branch in my tree.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> > >
> > > Again, you need tip/perf/core kernel for this to work.
> > > Any feedback is welcome.
> >
> > This is great! I wonder about related clean up:
+1
> >
> > - can we remove BPF events as this is a better feature?
> > - I believe BPF events are flaky, seldom used (with the exception
> > of the augmented syscalls for perf trace, which really should move to
> > a BPF skeleton as most people don't know how to use it) and they add a
> > bunch of complexity. A particular complexity I care about is that the
> > path separator forward slash ('/') is also the modifier separator for
> > events.
>
> Well.. I actually never tried the BPF events myself :)
> I think we can deprecate it and get rid of it once the perf trace
> conversion is done.
+1 ;-) would be awesome
jirka
Hi Jiri,
On Tue, Feb 21, 2023 at 3:54 AM Jiri Olsa <[email protected]> wrote:
>
> On Tue, Feb 14, 2023 at 10:01:41AM -0800, Namhyung Kim wrote:
> > Hi Ian,
> >
> > On Tue, Feb 14, 2023 at 8:58 AM Ian Rogers <[email protected]> wrote:
> > >
> > > On Mon, Feb 13, 2023 at 9:05 PM Namhyung Kim <[email protected]> wrote:
> > > >
> > > > Hello,
> > > >
> > > > There have been requests for more sophisticated perf event sample
> > > > filtering based on the sample data. Recently the kernel added BPF
> > > > programs can access perf sample data and this is the userspace part
> > > > to enable such a filtering.
> > > >
> > > > This still has some rough edges and needs more improvements. But
> > > > I'd like to share the current work and get some feedback for the
> > > > directions and idea for further improvements.
> > > >
> > > > The kernel changes are in the tip.git tree (perf/core branch) for now.
> > > > perf record has --filter option to set filters on the last specified
> > > > event in the command line. It worked only for tracepoints and Intel
> > > > PT events so far. This patchset extends it to have 'bpf:' prefix in
> > > > order to enable the general sample filters using BPF for any events.
> > > >
> > > > A new filter expression parser was added (using flex/bison) to process
> > > > the filter string. Right now, it only accepts very simple expressions
> > > > separated by comma. I'd like to keep the filter expression as simple
> > > > as possible.
> > > >
> > > > It requires samples satisfy all the filter expressions otherwise it'd
> > > > drop the sample. IOW filter expressions are connected with logical AND
> > > > operations implicitly.
> > > >
> > > > Essentially the BPF filter expression is:
> > > >
> > > > "bpf:" <term> <operator> <value> ("," <term> <operator> <value>)*
> > > >
> > > > The <term> can be one of:
> > > > ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
> > > > code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
> > > > p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
> > > > mem_dtlb, mem_blk, mem_hops
> > > >
> > > > The <operator> can be one of:
> > > > ==, !=, >, >=, <, <=, &
> > > >
> > > > The <value> can be one of:
> > > > <number> (for any term)
> > > > na, load, store, pfetch, exec (for mem_op)
> > > > l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
> > > > na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
> > > > remote (for mem_remote)
> > > > na, locked (for mem_locked)
> > > > na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
> > > > na, by_data, by_addr (for mem_blk)
> > > > hops0, hops1, hops2, hops3 (for mem_hops)
> > > >
> > > > I plan to improve it with range expressions like for ip or addr and it
> > > > should support symbols like the existing addr-filters. Also cgroup
> > > > should understand and convert cgroup names to IDs.
>
> this seems similar to what ftrace is doing in filter_match_preds,
> I checked the code briefly and I wonder if we shoud be able to write
> that function logic in bpf, assuming that the filter is prepared in
> user space
>
> it might solve the 'part' data problem in generic way.. but I might be
> missing some blocker of course.. just an idea ;-)
>
> could replace the tracepoint filters.. if we actually care
I'm not sure about replacing tracepoint filters. IIRC BPF is optional,
then tracepoints should work without it. From the BPF's perspective,
it has its own way of handling tracepoints so no need to deal with
perf or event tracing (ftrace) for that.
From the perf's perspective, I think it can use either the existing ftrace
filters or build a new BPF filter for each event. But it cannot use BTF
for perf tracepoint events at least for now. Certainly it can use RAW
sample data and parse the event format to access the fields but I'm
not sure it's worth doing that. :)
Thanks,
Namhyung