2018-03-01 04:21:23

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

This patch set is a different way to address the pressing need to access
task_struct pointers in sched tracepoints from bpf programs.

The first approach simply added these pointers to sched tracepoints:
https://lkml.org/lkml/2017/12/14/753
which Peter nacked.
Few options were discussed and eventually the discussion converged on
doing bpf specific tracepoint_probe_register() probe functions.
Details here:
https://lkml.org/lkml/2017/12/20/929

Patch 1 is kernel wide cleanup of pass-struct-by-value into
pass-struct-by-reference into tracepoints.

Patch 2 minor prep work to expose number of arguments passed
into tracepoints.

Patch 3 introduces BPF_RAW_TRACEPOINT api.
the auto-cleanup and multiple concurrent users are must have
features of tracing api. For bpf raw tracepoints it looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);

// receive anon_inode fd for given bpf_raw_tracepoint
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");

// attach bpf program to given tracepoint
bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);

Ctrl-C of tracing daemon or cmdline tool will automatically
detach bpf program, unload it and unregister tracepoint probe.
More details in patch 3.

Patch 4, 5 - user space lib and tests

samples/bpf/test_overhead performance on 1 cpu:

tracepoint base kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename 1.1M 769K 947K 1.0M
urandom_read 789K 697K 750K 755K

Alexei Starovoitov (5):
treewide: remove struct-pass-by-value from tracepoints arguments
tracepoint: compute num_args at build time
bpf: introduce BPF_RAW_TRACEPOINT
libbpf: add bpf_raw_tracepoint_open helper
samples/bpf: raw tracepoint test

arch/x86/xen/mmu_pv.c | 16 +--
drivers/gpu/drm/i915/i915_trace.h | 13 +-
drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
drivers/infiniband/hw/hfi1/trace_ctxts.h | 12 +-
drivers/s390/cio/ioasm.c | 18 +--
drivers/s390/cio/trace.h | 50 ++++----
fs/dax.c | 2 +-
include/linux/bpf_types.h | 1 +
include/linux/trace_events.h | 57 +++++++++
include/linux/tracepoint-defs.h | 1 +
include/linux/tracepoint.h | 32 +++--
include/trace/bpf_probe.h | 87 +++++++++++++
include/trace/define_trace.h | 15 ++-
include/trace/events/f2fs.h | 2 +-
include/trace/events/fs_dax.h | 6 +-
include/trace/events/rcu.h | 4 +-
include/trace/events/xen.h | 32 ++---
include/uapi/linux/bpf.h | 11 ++
kernel/bpf/syscall.c | 108 ++++++++++++++++
kernel/rcu/tree.c | 10 +-
kernel/trace/bpf_trace.c | 211 +++++++++++++++++++++++++++++++
kernel/tracepoint.c | 27 ++--
net/wireless/trace.h | 2 +-
samples/bpf/Makefile | 1 +
samples/bpf/bpf_load.c | 13 ++
samples/bpf/test_overhead_raw_tp_kern.c | 17 +++
samples/bpf/test_overhead_user.c | 12 ++
sound/firewire/amdtp-stream-trace.h | 2 +-
tools/include/uapi/linux/bpf.h | 11 ++
tools/lib/bpf/bpf.c | 10 ++
tools/lib/bpf/bpf.h | 1 +
31 files changed, 677 insertions(+), 109 deletions(-)
create mode 100644 include/trace/bpf_probe.h
create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c

--
2.9.5



2018-03-01 04:21:46

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 1/5] treewide: remove struct-pass-by-value from tracepoints arguments

Fix all tracepoint arguments to pass structures (large and small) by reference
instead of by value.
Avoiding passing large structs by value is a good coding style.
Passing small structs sometimes is beneficial, but in all cases
it makes no difference vs readability of the code.
The subsequent patch enforces that all tracepoints args are either integers
or pointers and fit into 64-bit.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
arch/x86/xen/mmu_pv.c | 16 +++++-----
drivers/gpu/drm/i915/i915_trace.h | 13 +++++++--
drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
drivers/infiniband/hw/hfi1/trace_ctxts.h | 12 ++++----
drivers/s390/cio/ioasm.c | 18 ++++++------
drivers/s390/cio/trace.h | 50 ++++++++++++++++----------------
fs/dax.c | 2 +-
include/trace/events/f2fs.h | 2 +-
include/trace/events/fs_dax.h | 6 ++--
include/trace/events/rcu.h | 4 +--
include/trace/events/xen.h | 32 ++++++++++----------
kernel/rcu/tree.c | 10 +++----
net/wireless/trace.h | 2 +-
sound/firewire/amdtp-stream-trace.h | 2 +-
14 files changed, 89 insertions(+), 82 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index aae88fec9941..b1a8061c3b28 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -218,7 +218,7 @@ static void xen_set_pmd_hyper(pmd_t *ptr, pmd_t val)

static void xen_set_pmd(pmd_t *ptr, pmd_t val)
{
- trace_xen_mmu_set_pmd(ptr, val);
+ trace_xen_mmu_set_pmd(ptr, &val);

/* If page is not pinned, we can just update the entry
directly */
@@ -277,14 +277,14 @@ static inline void __xen_set_pte(pte_t *ptep, pte_t pteval)

static void xen_set_pte(pte_t *ptep, pte_t pteval)
{
- trace_xen_mmu_set_pte(ptep, pteval);
+ trace_xen_mmu_set_pte(ptep, &pteval);
__xen_set_pte(ptep, pteval);
}

static void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval)
{
- trace_xen_mmu_set_pte_at(mm, addr, ptep, pteval);
+ trace_xen_mmu_set_pte_at(mm, addr, ptep, &pteval);
__xen_set_pte(ptep, pteval);
}

@@ -292,7 +292,7 @@ pte_t xen_ptep_modify_prot_start(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
/* Just return the pte as-is. We preserve the bits on commit */
- trace_xen_mmu_ptep_modify_prot_start(mm, addr, ptep, *ptep);
+ trace_xen_mmu_ptep_modify_prot_start(mm, addr, ptep, ptep);
return *ptep;
}

@@ -301,7 +301,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
{
struct mmu_update u;

- trace_xen_mmu_ptep_modify_prot_commit(mm, addr, ptep, pte);
+ trace_xen_mmu_ptep_modify_prot_commit(mm, addr, ptep, &pte);
xen_mc_batch();

u.ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD;
@@ -409,7 +409,7 @@ static void xen_set_pud_hyper(pud_t *ptr, pud_t val)

static void xen_set_pud(pud_t *ptr, pud_t val)
{
- trace_xen_mmu_set_pud(ptr, val);
+ trace_xen_mmu_set_pud(ptr, &val);

/* If page is not pinned, we can just update the entry
directly */
@@ -424,7 +424,7 @@ static void xen_set_pud(pud_t *ptr, pud_t val)
#ifdef CONFIG_X86_PAE
static void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
{
- trace_xen_mmu_set_pte_atomic(ptep, pte);
+ trace_xen_mmu_set_pte_atomic(ptep, &pte);
set_64bit((u64 *)ptep, native_pte_val(pte));
}

@@ -514,7 +514,7 @@ static void xen_set_p4d(p4d_t *ptr, p4d_t val)
pgd_t *user_ptr = xen_get_user_pgd((pgd_t *)ptr);
pgd_t pgd_val;

- trace_xen_mmu_set_p4d(ptr, (p4d_t *)user_ptr, val);
+ trace_xen_mmu_set_p4d(ptr, (p4d_t *)user_ptr, &val);

/* If page is not pinned, we can just update the entry
directly */
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index e1169c02eb2b..681da1f51911 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -849,8 +849,8 @@ TRACE_EVENT(i915_flip_complete,
TP_printk("plane=%d, obj=%p", __entry->plane, __entry->obj)
);

-TRACE_EVENT_CONDITION(i915_reg_rw,
- TP_PROTO(bool write, i915_reg_t reg, u64 val, int len, bool trace),
+TRACE_EVENT_CONDITION(i915_reg_rw__,
+ TP_PROTO(bool write, u32 reg, u64 val, int len, bool trace),

TP_ARGS(write, reg, val, len, trace),

@@ -865,7 +865,7 @@ TRACE_EVENT_CONDITION(i915_reg_rw,

TP_fast_assign(
__entry->val = (u64)val;
- __entry->reg = i915_mmio_reg_offset(reg);
+ __entry->reg = reg;
__entry->write = write;
__entry->len = len;
),
@@ -876,6 +876,13 @@ TRACE_EVENT_CONDITION(i915_reg_rw,
(u32)(__entry->val & 0xffffffff),
(u32)(__entry->val >> 32))
);
+#if !defined(CREATE_TRACE_POINTS) && !defined(TRACE_HEADER_MULTI_READ)
+static inline void trace_i915_reg_rw(bool write, i915_reg_t reg, u64 val,
+ int len, bool trace)
+{
+ trace_i915_reg_rw__(write, i915_mmio_reg_offset(reg), val, len, trace);
+}
+#endif

TRACE_EVENT(intel_gpu_freq_change,
TP_PROTO(u32 freq),
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 41fafebe3b0d..da4aa1a95b11 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1153,7 +1153,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, unsigned long arg, u32 len)
cinfo.sdma_ring_size = fd->cq->nentries;
cinfo.rcvegr_size = uctxt->egrbufs.rcvtid_size;

- trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, cinfo);
+ trace_hfi1_ctxt_info(uctxt->dd, uctxt->ctxt, fd->subctxt, &cinfo);
if (copy_to_user((void __user *)arg, &cinfo, len))
return -EFAULT;

diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h b/drivers/infiniband/hw/hfi1/trace_ctxts.h
index 4eb4cc798035..e00c8a7d559c 100644
--- a/drivers/infiniband/hw/hfi1/trace_ctxts.h
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -106,7 +106,7 @@ TRACE_EVENT(hfi1_uctxtdata,
TRACE_EVENT(hfi1_ctxt_info,
TP_PROTO(struct hfi1_devdata *dd, unsigned int ctxt,
unsigned int subctxt,
- struct hfi1_ctxt_info cinfo),
+ struct hfi1_ctxt_info *cinfo),
TP_ARGS(dd, ctxt, subctxt, cinfo),
TP_STRUCT__entry(DD_DEV_ENTRY(dd)
__field(unsigned int, ctxt)
@@ -120,11 +120,11 @@ TRACE_EVENT(hfi1_ctxt_info,
TP_fast_assign(DD_DEV_ASSIGN(dd);
__entry->ctxt = ctxt;
__entry->subctxt = subctxt;
- __entry->egrtids = cinfo.egrtids;
- __entry->rcvhdrq_cnt = cinfo.rcvhdrq_cnt;
- __entry->rcvhdrq_size = cinfo.rcvhdrq_entsize;
- __entry->sdma_ring_size = cinfo.sdma_ring_size;
- __entry->rcvegr_size = cinfo.rcvegr_size;
+ __entry->egrtids = cinfo->egrtids;
+ __entry->rcvhdrq_cnt = cinfo->rcvhdrq_cnt;
+ __entry->rcvhdrq_size = cinfo->rcvhdrq_entsize;
+ __entry->sdma_ring_size = cinfo->sdma_ring_size;
+ __entry->rcvegr_size = cinfo->rcvegr_size;
),
TP_printk("[%s] ctxt %u:%u " CINFO_FMT,
__get_str(dev),
diff --git a/drivers/s390/cio/ioasm.c b/drivers/s390/cio/ioasm.c
index 4fa9ee1d09fa..0aecb6314e6f 100644
--- a/drivers/s390/cio/ioasm.c
+++ b/drivers/s390/cio/ioasm.c
@@ -35,7 +35,7 @@ int stsch(struct subchannel_id schid, struct schib *addr)
int ccode;

ccode = __stsch(schid, addr);
- trace_s390_cio_stsch(schid, addr, ccode);
+ trace_s390_cio_stsch(&schid, addr, ccode);

return ccode;
}
@@ -63,7 +63,7 @@ int msch(struct subchannel_id schid, struct schib *addr)
int ccode;

ccode = __msch(schid, addr);
- trace_s390_cio_msch(schid, addr, ccode);
+ trace_s390_cio_msch(&schid, addr, ccode);

return ccode;
}
@@ -88,7 +88,7 @@ int tsch(struct subchannel_id schid, struct irb *addr)
int ccode;

ccode = __tsch(schid, addr);
- trace_s390_cio_tsch(schid, addr, ccode);
+ trace_s390_cio_tsch(&schid, addr, ccode);

return ccode;
}
@@ -115,7 +115,7 @@ int ssch(struct subchannel_id schid, union orb *addr)
int ccode;

ccode = __ssch(schid, addr);
- trace_s390_cio_ssch(schid, addr, ccode);
+ trace_s390_cio_ssch(&schid, addr, ccode);

return ccode;
}
@@ -141,7 +141,7 @@ int csch(struct subchannel_id schid)
int ccode;

ccode = __csch(schid);
- trace_s390_cio_csch(schid, ccode);
+ trace_s390_cio_csch(&schid, ccode);

return ccode;
}
@@ -202,7 +202,7 @@ int rchp(struct chp_id chpid)
int ccode;

ccode = __rchp(chpid);
- trace_s390_cio_rchp(chpid, ccode);
+ trace_s390_cio_rchp(&chpid, ccode);

return ccode;
}
@@ -228,7 +228,7 @@ int rsch(struct subchannel_id schid)
int ccode;

ccode = __rsch(schid);
- trace_s390_cio_rsch(schid, ccode);
+ trace_s390_cio_rsch(&schid, ccode);

return ccode;
}
@@ -253,7 +253,7 @@ int hsch(struct subchannel_id schid)
int ccode;

ccode = __hsch(schid);
- trace_s390_cio_hsch(schid, ccode);
+ trace_s390_cio_hsch(&schid, ccode);

return ccode;
}
@@ -278,7 +278,7 @@ int xsch(struct subchannel_id schid)
int ccode;

ccode = __xsch(schid);
- trace_s390_cio_xsch(schid, ccode);
+ trace_s390_cio_xsch(&schid, ccode);

return ccode;
}
diff --git a/drivers/s390/cio/trace.h b/drivers/s390/cio/trace.h
index 1f8d1c1e566d..4aa6d1426106 100644
--- a/drivers/s390/cio/trace.h
+++ b/drivers/s390/cio/trace.h
@@ -22,7 +22,7 @@
#include <linux/tracepoint.h>

DECLARE_EVENT_CLASS(s390_class_schib,
- TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+ TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
TP_ARGS(schid, schib, cc),
TP_STRUCT__entry(
__field(u8, cssid)
@@ -33,9 +33,9 @@ DECLARE_EVENT_CLASS(s390_class_schib,
__field(int, cc)
),
TP_fast_assign(
- __entry->cssid = schid.cssid;
- __entry->ssid = schid.ssid;
- __entry->schno = schid.sch_no;
+ __entry->cssid = schid->cssid;
+ __entry->ssid = schid->ssid;
+ __entry->schno = schid->sch_no;
__entry->devno = schib->pmcw.dev;
__entry->schib = *schib;
__entry->cc = cc;
@@ -60,7 +60,7 @@ DECLARE_EVENT_CLASS(s390_class_schib,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schib, s390_cio_stsch,
- TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+ TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
TP_ARGS(schid, schib, cc)
);

@@ -71,7 +71,7 @@ DEFINE_EVENT(s390_class_schib, s390_cio_stsch,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schib, s390_cio_msch,
- TP_PROTO(struct subchannel_id schid, struct schib *schib, int cc),
+ TP_PROTO(struct subchannel_id *schid, struct schib *schib, int cc),
TP_ARGS(schid, schib, cc)
);

@@ -82,7 +82,7 @@ DEFINE_EVENT(s390_class_schib, s390_cio_msch,
* @cc: Condition code
*/
TRACE_EVENT(s390_cio_tsch,
- TP_PROTO(struct subchannel_id schid, struct irb *irb, int cc),
+ TP_PROTO(struct subchannel_id *schid, struct irb *irb, int cc),
TP_ARGS(schid, irb, cc),
TP_STRUCT__entry(
__field(u8, cssid)
@@ -92,9 +92,9 @@ TRACE_EVENT(s390_cio_tsch,
__field(int, cc)
),
TP_fast_assign(
- __entry->cssid = schid.cssid;
- __entry->ssid = schid.ssid;
- __entry->schno = schid.sch_no;
+ __entry->cssid = schid->cssid;
+ __entry->ssid = schid->ssid;
+ __entry->schno = schid->sch_no;
__entry->irb = *irb;
__entry->cc = cc;
),
@@ -151,7 +151,7 @@ TRACE_EVENT(s390_cio_tpi,
* @cc: Condition code
*/
TRACE_EVENT(s390_cio_ssch,
- TP_PROTO(struct subchannel_id schid, union orb *orb, int cc),
+ TP_PROTO(struct subchannel_id *schid, union orb *orb, int cc),
TP_ARGS(schid, orb, cc),
TP_STRUCT__entry(
__field(u8, cssid)
@@ -161,9 +161,9 @@ TRACE_EVENT(s390_cio_ssch,
__field(int, cc)
),
TP_fast_assign(
- __entry->cssid = schid.cssid;
- __entry->ssid = schid.ssid;
- __entry->schno = schid.sch_no;
+ __entry->cssid = schid->cssid;
+ __entry->ssid = schid->ssid;
+ __entry->schno = schid->sch_no;
__entry->orb = *orb;
__entry->cc = cc;
),
@@ -173,7 +173,7 @@ TRACE_EVENT(s390_cio_ssch,
);

DECLARE_EVENT_CLASS(s390_class_schid,
- TP_PROTO(struct subchannel_id schid, int cc),
+ TP_PROTO(struct subchannel_id *schid, int cc),
TP_ARGS(schid, cc),
TP_STRUCT__entry(
__field(u8, cssid)
@@ -182,9 +182,9 @@ DECLARE_EVENT_CLASS(s390_class_schid,
__field(int, cc)
),
TP_fast_assign(
- __entry->cssid = schid.cssid;
- __entry->ssid = schid.ssid;
- __entry->schno = schid.sch_no;
+ __entry->cssid = schid->cssid;
+ __entry->ssid = schid->ssid;
+ __entry->schno = schid->sch_no;
__entry->cc = cc;
),
TP_printk("schid=%x.%x.%04x cc=%d", __entry->cssid, __entry->ssid,
@@ -198,7 +198,7 @@ DECLARE_EVENT_CLASS(s390_class_schid,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schid, s390_cio_csch,
- TP_PROTO(struct subchannel_id schid, int cc),
+ TP_PROTO(struct subchannel_id *schid, int cc),
TP_ARGS(schid, cc)
);

@@ -208,7 +208,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_csch,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schid, s390_cio_hsch,
- TP_PROTO(struct subchannel_id schid, int cc),
+ TP_PROTO(struct subchannel_id *schid, int cc),
TP_ARGS(schid, cc)
);

@@ -218,7 +218,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_hsch,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schid, s390_cio_xsch,
- TP_PROTO(struct subchannel_id schid, int cc),
+ TP_PROTO(struct subchannel_id *schid, int cc),
TP_ARGS(schid, cc)
);

@@ -228,7 +228,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_xsch,
* @cc: Condition code
*/
DEFINE_EVENT(s390_class_schid, s390_cio_rsch,
- TP_PROTO(struct subchannel_id schid, int cc),
+ TP_PROTO(struct subchannel_id *schid, int cc),
TP_ARGS(schid, cc)
);

@@ -238,7 +238,7 @@ DEFINE_EVENT(s390_class_schid, s390_cio_rsch,
* @cc: Condition code
*/
TRACE_EVENT(s390_cio_rchp,
- TP_PROTO(struct chp_id chpid, int cc),
+ TP_PROTO(struct chp_id *chpid, int cc),
TP_ARGS(chpid, cc),
TP_STRUCT__entry(
__field(u8, cssid)
@@ -246,8 +246,8 @@ TRACE_EVENT(s390_cio_rchp,
__field(int, cc)
),
TP_fast_assign(
- __entry->cssid = chpid.cssid;
- __entry->id = chpid.id;
+ __entry->cssid = chpid->cssid;
+ __entry->id = chpid->id;
__entry->cc = cc;
),
TP_printk("chpid=%x.%02x cc=%d", __entry->cssid, __entry->id,
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..6d03ead8e788 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1429,7 +1429,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
goto finish_iomap;
}

- trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
+ trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, &pfn, entry);
result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
write);
break;
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 06c87f9f720c..795698925d20 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -491,7 +491,7 @@ DEFINE_EVENT(f2fs__truncate_node, f2fs_truncate_node,

TRACE_EVENT(f2fs_truncate_partial_nodes,

- TP_PROTO(struct inode *inode, nid_t nid[], int depth, int err),
+ TP_PROTO(struct inode *inode, nid_t *nid, int depth, int err),

TP_ARGS(inode, nid, depth, err),

diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index 97b09fcf7e52..5a6a8285750f 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -104,7 +104,7 @@ DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback);

DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
TP_PROTO(struct inode *inode, struct vm_fault *vmf,
- long length, pfn_t pfn, void *radix_entry),
+ long length, pfn_t *pfn, void *radix_entry),
TP_ARGS(inode, vmf, length, pfn, radix_entry),
TP_STRUCT__entry(
__field(unsigned long, ino)
@@ -123,7 +123,7 @@ DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
__entry->address = vmf->address;
__entry->write = vmf->flags & FAULT_FLAG_WRITE;
__entry->length = length;
- __entry->pfn_val = pfn.val;
+ __entry->pfn_val = pfn->val;
__entry->radix_entry = radix_entry;
),
TP_printk("dev %d:%d ino %#lx %s %s address %#lx length %#lx "
@@ -145,7 +145,7 @@ DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
#define DEFINE_PMD_INSERT_MAPPING_EVENT(name) \
DEFINE_EVENT(dax_pmd_insert_mapping_class, name, \
TP_PROTO(struct inode *inode, struct vm_fault *vmf, \
- long length, pfn_t pfn, void *radix_entry), \
+ long length, pfn_t *pfn, void *radix_entry), \
TP_ARGS(inode, vmf, length, pfn, radix_entry))

DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping);
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 0b50fda80db0..4b463294306f 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -436,7 +436,7 @@ TRACE_EVENT(rcu_fqs,
*/
TRACE_EVENT(rcu_dyntick,

- TP_PROTO(const char *polarity, long oldnesting, long newnesting, atomic_t dynticks),
+ TP_PROTO(const char *polarity, long oldnesting, long newnesting, atomic_t *dynticks),

TP_ARGS(polarity, oldnesting, newnesting, dynticks),

@@ -451,7 +451,7 @@ TRACE_EVENT(rcu_dyntick,
__entry->polarity = polarity;
__entry->oldnesting = oldnesting;
__entry->newnesting = newnesting;
- __entry->dynticks = atomic_read(&dynticks);
+ __entry->dynticks = atomic_read(dynticks);
),

TP_printk("%s %lx %lx %#3x", __entry->polarity,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 7dd8f34c37df..ea9e9014f0c5 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -128,14 +128,14 @@ TRACE_EVENT(xen_mc_extend_args,
TRACE_DEFINE_SIZEOF(pteval_t);
/* mmu */
DECLARE_EVENT_CLASS(xen_mmu__set_pte,
- TP_PROTO(pte_t *ptep, pte_t pteval),
+ TP_PROTO(pte_t *ptep, pte_t *pteval),
TP_ARGS(ptep, pteval),
TP_STRUCT__entry(
__field(pte_t *, ptep)
__field(pteval_t, pteval)
),
TP_fast_assign(__entry->ptep = ptep;
- __entry->pteval = pteval.pte),
+ __entry->pteval = pteval->pte),
TP_printk("ptep %p pteval %0*llx (raw %0*llx)",
__entry->ptep,
(int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -144,14 +144,14 @@ DECLARE_EVENT_CLASS(xen_mmu__set_pte,

#define DEFINE_XEN_MMU_SET_PTE(name) \
DEFINE_EVENT(xen_mmu__set_pte, name, \
- TP_PROTO(pte_t *ptep, pte_t pteval), \
+ TP_PROTO(pte_t *ptep, pte_t *pteval), \
TP_ARGS(ptep, pteval))

DEFINE_XEN_MMU_SET_PTE(xen_mmu_set_pte);

TRACE_EVENT(xen_mmu_set_pte_at,
TP_PROTO(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval),
+ pte_t *ptep, pte_t *pteval),
TP_ARGS(mm, addr, ptep, pteval),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
@@ -162,7 +162,7 @@ TRACE_EVENT(xen_mmu_set_pte_at,
TP_fast_assign(__entry->mm = mm;
__entry->addr = addr;
__entry->ptep = ptep;
- __entry->pteval = pteval.pte),
+ __entry->pteval = pteval->pte),
TP_printk("mm %p addr %lx ptep %p pteval %0*llx (raw %0*llx)",
__entry->mm, __entry->addr, __entry->ptep,
(int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -172,14 +172,14 @@ TRACE_EVENT(xen_mmu_set_pte_at,
TRACE_DEFINE_SIZEOF(pmdval_t);

TRACE_EVENT(xen_mmu_set_pmd,
- TP_PROTO(pmd_t *pmdp, pmd_t pmdval),
+ TP_PROTO(pmd_t *pmdp, pmd_t *pmdval),
TP_ARGS(pmdp, pmdval),
TP_STRUCT__entry(
__field(pmd_t *, pmdp)
__field(pmdval_t, pmdval)
),
TP_fast_assign(__entry->pmdp = pmdp;
- __entry->pmdval = pmdval.pmd),
+ __entry->pmdval = pmdval->pmd),
TP_printk("pmdp %p pmdval %0*llx (raw %0*llx)",
__entry->pmdp,
(int)sizeof(pmdval_t) * 2, (unsigned long long)pmd_val(native_make_pmd(__entry->pmdval)),
@@ -220,14 +220,14 @@ TRACE_EVENT(xen_mmu_pmd_clear,
TRACE_DEFINE_SIZEOF(pudval_t);

TRACE_EVENT(xen_mmu_set_pud,
- TP_PROTO(pud_t *pudp, pud_t pudval),
+ TP_PROTO(pud_t *pudp, pud_t *pudval),
TP_ARGS(pudp, pudval),
TP_STRUCT__entry(
__field(pud_t *, pudp)
__field(pudval_t, pudval)
),
TP_fast_assign(__entry->pudp = pudp;
- __entry->pudval = native_pud_val(pudval)),
+ __entry->pudval = native_pud_val(*pudval)),
TP_printk("pudp %p pudval %0*llx (raw %0*llx)",
__entry->pudp,
(int)sizeof(pudval_t) * 2, (unsigned long long)pud_val(native_make_pud(__entry->pudval)),
@@ -237,7 +237,7 @@ TRACE_EVENT(xen_mmu_set_pud,
TRACE_DEFINE_SIZEOF(p4dval_t);

TRACE_EVENT(xen_mmu_set_p4d,
- TP_PROTO(p4d_t *p4dp, p4d_t *user_p4dp, p4d_t p4dval),
+ TP_PROTO(p4d_t *p4dp, p4d_t *user_p4dp, p4d_t *p4dval),
TP_ARGS(p4dp, user_p4dp, p4dval),
TP_STRUCT__entry(
__field(p4d_t *, p4dp)
@@ -246,7 +246,7 @@ TRACE_EVENT(xen_mmu_set_p4d,
),
TP_fast_assign(__entry->p4dp = p4dp;
__entry->user_p4dp = user_p4dp;
- __entry->p4dval = p4d_val(p4dval)),
+ __entry->p4dval = p4d_val(*p4dval)),
TP_printk("p4dp %p user_p4dp %p p4dval %0*llx (raw %0*llx)",
__entry->p4dp, __entry->user_p4dp,
(int)sizeof(p4dval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->p4dval)),
@@ -255,14 +255,14 @@ TRACE_EVENT(xen_mmu_set_p4d,
#else

TRACE_EVENT(xen_mmu_set_pud,
- TP_PROTO(pud_t *pudp, pud_t pudval),
+ TP_PROTO(pud_t *pudp, pud_t *pudval),
TP_ARGS(pudp, pudval),
TP_STRUCT__entry(
__field(pud_t *, pudp)
__field(pudval_t, pudval)
),
TP_fast_assign(__entry->pudp = pudp;
- __entry->pudval = native_pud_val(pudval)),
+ __entry->pudval = native_pud_val(*pudval)),
TP_printk("pudp %p pudval %0*llx (raw %0*llx)",
__entry->pudp,
(int)sizeof(pudval_t) * 2, (unsigned long long)pgd_val(native_make_pgd(__entry->pudval)),
@@ -273,7 +273,7 @@ TRACE_EVENT(xen_mmu_set_pud,

DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
TP_PROTO(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval),
+ pte_t *ptep, pte_t *pteval),
TP_ARGS(mm, addr, ptep, pteval),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
@@ -284,7 +284,7 @@ DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
TP_fast_assign(__entry->mm = mm;
__entry->addr = addr;
__entry->ptep = ptep;
- __entry->pteval = pteval.pte),
+ __entry->pteval = pteval->pte),
TP_printk("mm %p addr %lx ptep %p pteval %0*llx (raw %0*llx)",
__entry->mm, __entry->addr, __entry->ptep,
(int)sizeof(pteval_t) * 2, (unsigned long long)pte_val(native_make_pte(__entry->pteval)),
@@ -293,7 +293,7 @@ DECLARE_EVENT_CLASS(xen_mmu_ptep_modify_prot,
#define DEFINE_XEN_MMU_PTEP_MODIFY_PROT(name) \
DEFINE_EVENT(xen_mmu_ptep_modify_prot, name, \
TP_PROTO(struct mm_struct *mm, unsigned long addr, \
- pte_t *ptep, pte_t pteval), \
+ pte_t *ptep, pte_t *pteval), \
TP_ARGS(mm, addr, ptep, pteval))

DEFINE_XEN_MMU_PTEP_MODIFY_PROT(xen_mmu_ptep_modify_prot_start);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 491bdf39f276..43c0f899f78c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -772,7 +772,7 @@ static void rcu_eqs_enter(bool user)
}

lockdep_assert_irqs_disabled();
- trace_rcu_dyntick(TPS("Start"), rdtp->dynticks_nesting, 0, rdtp->dynticks);
+ trace_rcu_dyntick(TPS("Start"), rdtp->dynticks_nesting, 0, &rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
@@ -848,14 +848,14 @@ void rcu_nmi_exit(void)
* leave it in non-RCU-idle state.
*/
if (rdtp->dynticks_nmi_nesting != 1) {
- trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nmi_nesting, rdtp->dynticks_nmi_nesting - 2, rdtp->dynticks);
+ trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nmi_nesting, rdtp->dynticks_nmi_nesting - 2, &rdtp->dynticks);
WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* No store tearing. */
rdtp->dynticks_nmi_nesting - 2);
return;
}

/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
- trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nmi_nesting, 0, rdtp->dynticks);
+ trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nmi_nesting, 0, &rdtp->dynticks);
WRITE_ONCE(rdtp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
rcu_dynticks_eqs_enter();
}
@@ -930,7 +930,7 @@ static void rcu_eqs_exit(bool user)
rcu_dynticks_task_exit();
rcu_dynticks_eqs_exit();
rcu_cleanup_after_idle();
- trace_rcu_dyntick(TPS("End"), rdtp->dynticks_nesting, 1, rdtp->dynticks);
+ trace_rcu_dyntick(TPS("End"), rdtp->dynticks_nesting, 1, &rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
WRITE_ONCE(rdtp->dynticks_nesting, 1);
WRITE_ONCE(rdtp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
@@ -1004,7 +1004,7 @@ void rcu_nmi_enter(void)
}
trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
rdtp->dynticks_nmi_nesting,
- rdtp->dynticks_nmi_nesting + incby, rdtp->dynticks);
+ rdtp->dynticks_nmi_nesting + incby, &rdtp->dynticks);
WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* Prevent store tearing. */
rdtp->dynticks_nmi_nesting + incby);
barrier();
diff --git a/net/wireless/trace.h b/net/wireless/trace.h
index 5152938b358d..018c81fa72fb 100644
--- a/net/wireless/trace.h
+++ b/net/wireless/trace.h
@@ -3137,7 +3137,7 @@ TRACE_EVENT(rdev_start_radar_detection,

TRACE_EVENT(rdev_set_mcast_rate,
TP_PROTO(struct wiphy *wiphy, struct net_device *netdev,
- int mcast_rate[NUM_NL80211_BANDS]),
+ int *mcast_rate),
TP_ARGS(wiphy, netdev, mcast_rate),
TP_STRUCT__entry(
WIPHY_ENTRY
diff --git a/sound/firewire/amdtp-stream-trace.h b/sound/firewire/amdtp-stream-trace.h
index ea0d486652c8..54cdd4ffa9ce 100644
--- a/sound/firewire/amdtp-stream-trace.h
+++ b/sound/firewire/amdtp-stream-trace.h
@@ -14,7 +14,7 @@
#include <linux/tracepoint.h>

TRACE_EVENT(in_packet,
- TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 cip_header[2], unsigned int payload_length, unsigned int index),
+ TP_PROTO(const struct amdtp_stream *s, u32 cycles, u32 *cip_header, unsigned int payload_length, unsigned int index),
TP_ARGS(s, cycles, cip_header, payload_length, index),
TP_STRUCT__entry(
__field(unsigned int, second)
--
2.9.5


2018-03-01 04:22:17

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 4/5] libbpf: add bpf_raw_tracepoint_open helper

Signed-off-by: Alexei Starovoitov <[email protected]>
---
tools/include/uapi/linux/bpf.h | 11 +++++++++++
tools/lib/bpf/bpf.c | 10 ++++++++++
tools/lib/bpf/bpf.h | 1 +
3 files changed, 22 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db6bdc375126..50bf5f9054da 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_cmd {
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
+ BPF_RAW_TRACEPOINT_OPEN,
};

enum bpf_map_type {
@@ -133,6 +134,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+ BPF_PROG_TYPE_RAW_TRACEPOINT,
};

enum bpf_attach_type {
@@ -143,6 +145,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
+ BPF_RAW_TRACEPOINT,
__MAX_BPF_ATTACH_TYPE
};

@@ -320,6 +323,10 @@ union bpf_attr {
__aligned_u64 prog_ids;
__u32 prog_cnt;
} query;
+
+ struct {
+ __u64 name;
+ } raw_tracepoint;
} __attribute__((aligned(8)));

/* BPF helper function descriptions:
@@ -1106,4 +1113,8 @@ struct bpf_cgroup_dev_ctx {
__u32 minor;
};

+struct bpf_raw_tracepoint_args {
+ __u64 args[0];
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 592a58a2b681..4cbe7b6afcc0 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -428,6 +428,16 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
return err;
}

+int bpf_raw_tracepoint_open(const char *name)
+{
+ union bpf_attr attr;
+
+ bzero(&attr, sizeof(attr));
+ attr.raw_tracepoint.name = ptr_to_u64(name);
+
+ return sys_bpf(BPF_RAW_TRACEPOINT_OPEN, &attr, sizeof(attr));
+}
+
int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
{
struct sockaddr_nl sa;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 8d18fb73d7fb..f672d39bd9fa 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -79,4 +79,5 @@ int bpf_map_get_fd_by_id(__u32 id);
int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
__u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);
+int bpf_raw_tracepoint_open(const char *name);
#endif
--
2.9.5


2018-03-01 04:23:24

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 2/5] tracepoint: compute num_args at build time

add fancy macro to compute number of arguments passed into tracepoint
at compile time and store it as part of 'struct tracepoint'.
The number is necessary to check safety of bpf program access that
is coming in subsequent patch.

for_each_tracepoint_range() api has no users inside the kernel.
Make it more useful with ability to stop for_each() loop depending
via callback return value.
In such form it's used in subsequent patch.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/tracepoint-defs.h | 1 +
include/linux/tracepoint.h | 32 +++++++++++++++++++++++---------
include/trace/define_trace.h | 14 +++++++-------
kernel/tracepoint.c | 27 ++++++++++++++++-----------
4 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 64ed7064f1fa..39a283c61c51 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -33,6 +33,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+ u32 num_args;
};

#endif
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index c94f466d57ef..b1676e53bb23 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -40,9 +40,19 @@ tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
int prio);
extern int
tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
-extern void
-for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
- void *priv);
+
+#ifdef CONFIG_TRACEPOINTS
+void *
+for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+ void *priv);
+#else
+static inline void *
+for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+ void *priv)
+{
+ return NULL;
+}
+#endif

#ifdef CONFIG_MODULES
struct tp_module {
@@ -225,23 +235,27 @@ extern void syscall_unregfunc(void);
return static_key_false(&__tracepoint_##name.key); \
}

+#define ___FN_COUNT(fn,n0,n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,n15,n16,n,...) fn##n
+#define __FN_COUNT(fn,...) ___FN_COUNT(fn,##__VA_ARGS__,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)
+#define __COUNT(...) __FN_COUNT(/**/,##__VA_ARGS__)
+
/*
* We have no guarantee that gcc and the linker won't up-align the tracepoint
* structures, so we create an array of pointers that will be used for iteration
* on the tracepoints.
*/
-#define DEFINE_TRACE_FN(name, reg, unreg) \
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args) \
static const char __tpstrtab_##name[] \
__attribute__((section("__tracepoints_strings"))) = #name; \
struct tracepoint __tracepoint_##name \
__attribute__((section("__tracepoints"))) = \
- { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
+ { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL, num_args };\
static struct tracepoint * const __tracepoint_ptr_##name __used \
__attribute__((section("__tracepoints_ptrs"))) = \
&__tracepoint_##name;

-#define DEFINE_TRACE(name) \
- DEFINE_TRACE_FN(name, NULL, NULL);
+#define DEFINE_TRACE(name, num_args) \
+ DEFINE_TRACE_FN(name, NULL, NULL, num_args);

#define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
EXPORT_SYMBOL_GPL(__tracepoint_##name)
@@ -275,8 +289,8 @@ extern void syscall_unregfunc(void);
return false; \
}

-#define DEFINE_TRACE_FN(name, reg, unreg)
-#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_FN(name, reg, unreg, num_args)
+#define DEFINE_TRACE(name, num_args)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
#define EXPORT_TRACEPOINT_SYMBOL(name)

diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index d9e3d4aa3f6e..c040eda95d41 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -25,7 +25,7 @@

#undef TRACE_EVENT
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, __COUNT(args))

#undef TRACE_EVENT_CONDITION
#define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
@@ -39,24 +39,24 @@
#undef TRACE_EVENT_FN
#define TRACE_EVENT_FN(name, proto, args, tstruct, \
assign, print, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))

#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))

#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, __COUNT(args))

#undef DEFINE_EVENT_FN
#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, __COUNT(args))

#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, __COUNT(args))

#undef DEFINE_EVENT_CONDITION
#define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
@@ -64,7 +64,7 @@

#undef DECLARE_TRACE
#define DECLARE_TRACE(name, proto, args) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, __COUNT(args))

#undef TRACE_INCLUDE
#undef __TRACE_INCLUDE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 671b13457387..3f2dc5738c2b 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -502,17 +502,22 @@ static __init int init_tracepoints(void)
__initcall(init_tracepoints);
#endif /* CONFIG_MODULES */

-static void for_each_tracepoint_range(struct tracepoint * const *begin,
- struct tracepoint * const *end,
- void (*fct)(struct tracepoint *tp, void *priv),
- void *priv)
+static void *for_each_tracepoint_range(struct tracepoint * const *begin,
+ struct tracepoint * const *end,
+ void *(*fct)(struct tracepoint *tp, void *priv),
+ void *priv)
{
struct tracepoint * const *iter;
+ void *ret;

if (!begin)
- return;
- for (iter = begin; iter < end; iter++)
- fct(*iter, priv);
+ return NULL;
+ for (iter = begin; iter < end; iter++) {
+ ret = fct(*iter, priv);
+ if (ret)
+ return ret;
+ }
+ return NULL;
}

/**
@@ -520,11 +525,11 @@ static void for_each_tracepoint_range(struct tracepoint * const *begin,
* @fct: callback
* @priv: private data
*/
-void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
- void *priv)
+void *for_each_kernel_tracepoint(void *(*fct)(struct tracepoint *tp, void *priv),
+ void *priv)
{
- for_each_tracepoint_range(__start___tracepoints_ptrs,
- __stop___tracepoints_ptrs, fct, priv);
+ return for_each_tracepoint_range(__start___tracepoints_ptrs,
+ __stop___tracepoints_ptrs, fct, priv);
}
EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint);

--
2.9.5


2018-03-01 04:23:31

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 3/5] bpf: introduce BPF_RAW_TRACEPOINT

Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
__u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
// program can read args[N] where N depends on tracepoint
// and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
0xffffffff81132080 <+0>: mov %ecx,%ecx
0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
TP_PROTO(const struct net_device *dev,
const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since the probe funcs need to appear in kallsyms.
The alternative of having __bpf_trace_##call being global in kallsyms
could have been to keep them static and add another pointer to these
static functions to 'struct trace_event_class' and 'struct trace_event_call',
but keeping them global simplifies implementation and keeps it indepedent
from the tracing side.

Also such approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is very valuable optimization.

Since ftrace and perf side are not involved the new
BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
// attach bpf program to given tracepoint
bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side for_each_kernel_tracepoint() is used
to find a tracepoint with "xdp_exception" name
(that would be __tracepoint_xdp_exception record)

Then kallsyms_lookup_name() is used to find the addr
of __bpf_trace_xdp_exception() probe function.

And finally tracepoint_probe_register() is used to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Also multiple bpf_raw_tracepoint_open("foo") are permitted.
Each raw_tp_fd allows to attach one bpf program, so multiple
user space processes can open their own raw_tp_fd with their own
bpf program. The kernel will execute all tracepoint probes
and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf_types.h | 1 +
include/linux/trace_events.h | 57 ++++++++++++
include/trace/bpf_probe.h | 87 ++++++++++++++++++
include/trace/define_trace.h | 1 +
include/uapi/linux/bpf.h | 11 +++
kernel/bpf/syscall.c | 108 ++++++++++++++++++++++
kernel/trace/bpf_trace.c | 211 +++++++++++++++++++++++++++++++++++++++++++
7 files changed, 476 insertions(+)
create mode 100644 include/trace/bpf_probe.h

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349a3809..b83ec377046a 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -18,6 +18,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
+BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
#endif
#ifdef CONFIG_CGROUP_BPF
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 8a1442c4e513..46d76bbd5668 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -468,6 +468,8 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
void perf_event_detach_bpf_prog(struct perf_event *event);
int perf_event_query_prog_array(struct perf_event *event, void __user *info);
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog);
#else
static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
{
@@ -487,6 +489,14 @@ perf_event_query_prog_array(struct perf_event *event, void __user *info)
{
return -EOPNOTSUPP;
}
+static inline int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *p)
+{
+ return -EOPNOTSUPP;
+}
+static inline int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *p)
+{
+ return -EOPNOTSUPP;
+}
#endif

enum {
@@ -546,6 +556,53 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
void perf_trace_buf_update(void *record, u16 type);
void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);

+void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
+void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
+void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3);
+void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4);
+void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5);
+void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6);
+void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
+void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8);
+void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9);
+void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10);
+void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11);
+void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
+void bpf_trace_run13(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13);
+void bpf_trace_run14(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13, u64 arg14);
+void bpf_trace_run15(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13, u64 arg14, u64 arg15);
+void bpf_trace_run16(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13, u64 arg14, u64 arg15, u64 arg16);
+void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17);
void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
struct trace_event_call *call, u64 count,
struct pt_regs *regs, struct hlist_head *head,
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
new file mode 100644
index 000000000000..cfbdf6082a95
--- /dev/null
+++ b/include/trace/bpf_probe.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#undef TRACE_SYSTEM_VAR
+
+#ifdef CONFIG_BPF_EVENTS
+
+#undef __entry
+#define __entry entry
+
+#undef __get_dynamic_array
+#define __get_dynamic_array(field) \
+ ((void *)__entry + (__entry->__data_loc_##field & 0xffff))
+
+#undef __get_dynamic_array_len
+#define __get_dynamic_array_len(field) \
+ ((__entry->__data_loc_##field >> 16) & 0xffff)
+
+#undef __get_str
+#define __get_str(field) ((char *)__get_dynamic_array(field))
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __perf_count
+#define __perf_count(c) (c)
+
+#undef __perf_task
+#define __perf_task(t) (t)
+
+/*
+ * cast any interger or pointer type to u64 without warnings
+ * on 32 and 64 bit archs
+ */
+#define __CAST_TO_U64(expr) \
+ (u64) __builtin_choose_expr(sizeof(long) < sizeof(expr), \
+ (expr), \
+ (long) expr)
+#define __CAST1(a,...) __CAST_TO_U64(a)
+#define __CAST2(a,...) __CAST_TO_U64(a), __CAST1(__VA_ARGS__)
+#define __CAST3(a,...) __CAST_TO_U64(a), __CAST2(__VA_ARGS__)
+#define __CAST4(a,...) __CAST_TO_U64(a), __CAST3(__VA_ARGS__)
+#define __CAST5(a,...) __CAST_TO_U64(a), __CAST4(__VA_ARGS__)
+#define __CAST6(a,...) __CAST_TO_U64(a), __CAST5(__VA_ARGS__)
+#define __CAST7(a,...) __CAST_TO_U64(a), __CAST6(__VA_ARGS__)
+#define __CAST8(a,...) __CAST_TO_U64(a), __CAST7(__VA_ARGS__)
+#define __CAST9(a,...) __CAST_TO_U64(a), __CAST8(__VA_ARGS__)
+#define __CAST10(a,...) __CAST_TO_U64(a), __CAST9(__VA_ARGS__)
+#define __CAST11(a,...) __CAST_TO_U64(a), __CAST10(__VA_ARGS__)
+#define __CAST12(a,...) __CAST_TO_U64(a), __CAST11(__VA_ARGS__)
+#define __CAST13(a,...) __CAST_TO_U64(a), __CAST12(__VA_ARGS__)
+#define __CAST14(a,...) __CAST_TO_U64(a), __CAST13(__VA_ARGS__)
+#define __CAST15(a,...) __CAST_TO_U64(a), __CAST14(__VA_ARGS__)
+#define __CAST16(a,...) __CAST_TO_U64(a), __CAST15(__VA_ARGS__)
+#define __CAST17(a,...) __CAST_TO_U64(a), __CAST16(__VA_ARGS__)
+
+#define CAST_TO_U64(...) __FN_COUNT(__CAST,##__VA_ARGS__)(__VA_ARGS__)
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+/* no 'static' here. The bpf probe functions are global */ \
+notrace void \
+__bpf_trace_##call(void *__data, proto) \
+{ \
+ struct bpf_prog *prog = __data; \
+ \
+ __FN_COUNT(bpf_trace_run, args)(prog, CAST_TO_U64(args)); \
+}
+
+/*
+ * This part is compiled out, it is only here as a build time check
+ * to make sure that if the tracepoint handling changes, the
+ * bpf probe will fail to compile unless it too is updated.
+ */
+#undef DEFINE_EVENT
+#define DEFINE_EVENT(template, call, proto, args) \
+static inline void bpf_test_probe_##call(void) \
+{ \
+ check_trace_callback_type_##call(__bpf_trace_##template); \
+}
+
+
+#undef DEFINE_EVENT_PRINT
+#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
+ DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
+
+#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+#endif /* CONFIG_BPF_EVENTS */
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index c040eda95d41..3bbd3b88177f 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -95,6 +95,7 @@
#ifdef TRACEPOINTS_ENABLED
#include <trace/trace_events.h>
#include <trace/perf.h>
+#include <trace/bpf_probe.h>
#endif

#undef TRACE_EVENT
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc375126..50bf5f9054da 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_cmd {
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
+ BPF_RAW_TRACEPOINT_OPEN,
};

enum bpf_map_type {
@@ -133,6 +134,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
+ BPF_PROG_TYPE_RAW_TRACEPOINT,
};

enum bpf_attach_type {
@@ -143,6 +145,7 @@ enum bpf_attach_type {
BPF_SK_SKB_STREAM_PARSER,
BPF_SK_SKB_STREAM_VERDICT,
BPF_CGROUP_DEVICE,
+ BPF_RAW_TRACEPOINT,
__MAX_BPF_ATTACH_TYPE
};

@@ -320,6 +323,10 @@ union bpf_attr {
__aligned_u64 prog_ids;
__u32 prog_cnt;
} query;
+
+ struct {
+ __u64 name;
+ } raw_tracepoint;
} __attribute__((aligned(8)));

/* BPF helper function descriptions:
@@ -1106,4 +1113,8 @@ struct bpf_cgroup_dev_ctx {
__u32 minor;
};

+struct bpf_raw_tracepoint_args {
+ __u64 args[0];
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e24aa3241387..b5c33dda1a1c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1311,6 +1311,109 @@ static int bpf_obj_get(const union bpf_attr *attr)
attr->file_flags);
}

+struct bpf_raw_tracepoint {
+ struct tracepoint *tp;
+ struct bpf_prog *prog;
+};
+
+static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
+{
+ struct bpf_raw_tracepoint *raw_tp = filp->private_data;
+
+ if (raw_tp->prog) {
+ bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
+ bpf_prog_put(raw_tp->prog);
+ }
+ kfree(raw_tp);
+ return 0;
+}
+
+static const struct file_operations bpf_raw_tp_fops = {
+ .release = bpf_raw_tracepoint_release,
+ .read = bpf_dummy_read,
+ .write = bpf_dummy_write,
+};
+
+static struct bpf_raw_tracepoint *__bpf_raw_tracepoint_get(struct fd f)
+{
+ if (!f.file)
+ return ERR_PTR(-EBADF);
+ if (f.file->f_op != &bpf_raw_tp_fops) {
+ fdput(f);
+ return ERR_PTR(-EINVAL);
+ }
+ return f.file->private_data;
+}
+
+static void *__find_tp(struct tracepoint *tp, void *priv)
+{
+ char *name = priv;
+
+ if (!strcmp(tp->name, name))
+ return tp;
+ return NULL;
+}
+
+#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.name
+
+static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
+{
+ struct bpf_raw_tracepoint *raw_tp;
+ struct tracepoint *tp;
+ char tp_name[128];
+
+ if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
+ sizeof(tp_name) - 1) < 0)
+ return -EFAULT;
+ tp_name[sizeof(tp_name) - 1] = 0;
+
+ tp = for_each_kernel_tracepoint(__find_tp, tp_name);
+ if (!tp)
+ return -ENOENT;
+
+ raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
+ if (!raw_tp)
+ return -ENOMEM;
+ raw_tp->tp = tp;
+
+ return anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
+ O_CLOEXEC);
+}
+
+static int attach_raw_tp(const union bpf_attr *attr)
+{
+ struct bpf_raw_tracepoint *raw_tp;
+ struct bpf_prog *prog;
+ struct fd f;
+ int err = -EEXIST;
+
+ if (attr->attach_flags)
+ return -EINVAL;
+
+ f = fdget(attr->target_fd);
+ raw_tp = __bpf_raw_tracepoint_get(f);
+ if (IS_ERR(raw_tp))
+ return PTR_ERR(raw_tp);
+
+ if (raw_tp->prog)
+ goto out;
+
+ prog = bpf_prog_get_type(attr->attach_bpf_fd,
+ BPF_PROG_TYPE_RAW_TRACEPOINT);
+ if (IS_ERR(prog)) {
+ err = PTR_ERR(prog);
+ goto out;
+ }
+ err = bpf_probe_register(raw_tp->tp, prog);
+ if (err)
+ bpf_prog_put(prog);
+ else
+ raw_tp->prog = prog;
+out:
+ fdput(f);
+ return err;
+}
+
#ifdef CONFIG_CGROUP_BPF

#define BPF_PROG_ATTACH_LAST_FIELD attach_flags
@@ -1385,6 +1488,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
return sockmap_get_from_fd(attr, true);
+ case BPF_RAW_TRACEPOINT:
+ return attach_raw_tp(attr);
default:
return -EINVAL;
}
@@ -1917,6 +2022,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_OBJ_GET_INFO_BY_FD:
err = bpf_obj_get_info_by_fd(&attr, uattr);
break;
+ case BPF_RAW_TRACEPOINT_OPEN:
+ err = bpf_raw_tracepoint_open(&attr);
+ break;
default:
err = -EINVAL;
break;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index c0a9e310d715..e59b62875d1e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -723,6 +723,14 @@ const struct bpf_verifier_ops tracepoint_verifier_ops = {
const struct bpf_prog_ops tracepoint_prog_ops = {
};

+const struct bpf_verifier_ops raw_tracepoint_verifier_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = tp_prog_is_valid_access,
+};
+
+const struct bpf_prog_ops raw_tracepoint_prog_ops = {
+};
+
static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type,
struct bpf_insn_access_aux *info)
{
@@ -884,3 +892,206 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)

return ret;
}
+
+static __always_inline
+void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
+{
+ rcu_read_lock();
+ preempt_disable();
+ (void) BPF_PROG_RUN(prog, args);
+ preempt_enable();
+ rcu_read_unlock();
+}
+
+#define EVAL1(FN, X) FN(X)
+#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
+#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
+#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
+#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
+#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
+
+#define COPY(X) args[X - 1] = arg##X;
+
+void bpf_trace_run1(struct bpf_prog *prog, u64 arg1)
+{
+ u64 args[1];
+
+ EVAL1(COPY, 1);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run1);
+void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2)
+{
+ u64 args[2];
+
+ EVAL2(COPY, 1, 2);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run2);
+void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3)
+{
+ u64 args[3];
+
+ EVAL3(COPY, 1, 2, 3);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run3);
+void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4)
+{
+ u64 args[4];
+
+ EVAL4(COPY, 1, 2, 3, 4);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run4);
+void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5)
+{
+ u64 args[5];
+
+ EVAL5(COPY, 1, 2, 3, 4, 5);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run5);
+void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6)
+{
+ u64 args[6];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run6);
+void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7)
+{
+ u64 args[7];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL1(COPY, 7);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run7);
+void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8)
+{
+ u64 args[8];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL2(COPY, 7, 8);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run8);
+void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9)
+{
+ u64 args[9];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL3(COPY, 7, 8, 9);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run9);
+void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10)
+{
+ u64 args[10];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL4(COPY, 7, 8, 9, 10);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run10);
+void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11)
+{
+ u64 args[11];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL5(COPY, 7, 8, 9, 10, 11);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run11);
+void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12)
+{
+ u64 args[12];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL6(COPY, 7, 8, 9, 10, 11, 12);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run12);
+void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
+ u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
+ u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
+ u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17)
+{
+ u64 args[17];
+
+ EVAL6(COPY, 1, 2, 3, 4, 5, 6);
+ EVAL6(COPY, 7, 8, 9, 10, 11, 12);
+ EVAL5(COPY, 13, 14, 15, 16, 17);
+ __bpf_trace_run(prog, args);
+}
+EXPORT_SYMBOL_GPL(bpf_trace_run17);
+
+static int __bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+ unsigned long addr;
+ char buf[128];
+
+ /*
+ * check that program doesn't access arguments beyond what's
+ * available in this tracepoint
+ */
+ if (prog->aux->max_ctx_offset > tp->num_args * sizeof(u64))
+ return -EINVAL;
+
+ snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+ addr = kallsyms_lookup_name(buf);
+ if (!addr)
+ return -ENOENT;
+
+ return tracepoint_probe_register(tp, (void *)addr, prog);
+}
+
+int bpf_probe_register(struct tracepoint *tp, struct bpf_prog *prog)
+{
+ int err;
+
+ mutex_lock(&bpf_event_mutex);
+ err = __bpf_probe_register(tp, prog);
+ mutex_unlock(&bpf_event_mutex);
+ return err;
+}
+
+static int __bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+ unsigned long addr;
+ char buf[128];
+
+ snprintf(buf, sizeof(buf), "__bpf_trace_%s", tp->name);
+ addr = kallsyms_lookup_name(buf);
+ if (!addr)
+ return -ENOENT;
+
+ return tracepoint_probe_unregister(tp, (void *)addr, prog);
+}
+
+int bpf_probe_unregister(struct tracepoint *tp, struct bpf_prog *prog)
+{
+ int err;
+
+ mutex_lock(&bpf_event_mutex);
+ err = __bpf_probe_unregister(tp, prog);
+ mutex_unlock(&bpf_event_mutex);
+ return err;
+}
--
2.9.5


2018-03-01 04:23:57

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH bpf-next 5/5] samples/bpf: raw tracepoint test

empty raw_tracepoint bpf program to test overhead

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 1 +
samples/bpf/bpf_load.c | 13 +++++++++++++
samples/bpf/test_overhead_raw_tp_kern.c | 17 +++++++++++++++++
samples/bpf/test_overhead_user.c | 12 ++++++++++++
4 files changed, 43 insertions(+)
create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 2c2a587e0942..4d6a6edd4bf6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -119,6 +119,7 @@ always += offwaketime_kern.o
always += spintest_kern.o
always += map_perf_test_kern.o
always += test_overhead_tp_kern.o
+always += test_overhead_raw_tp_kern.o
always += test_overhead_kprobe_kern.o
always += parse_varlen.o parse_simple.o parse_ldabs.o
always += test_cgrp2_tc_kern.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 69806d74fa53..46e9195dc665 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -61,6 +61,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+ bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
bool is_xdp = strncmp(event, "xdp", 3) == 0;
bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
@@ -84,6 +85,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_KPROBE;
} else if (is_tracepoint) {
prog_type = BPF_PROG_TYPE_TRACEPOINT;
+ } else if (is_raw_tracepoint) {
+ prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT;
} else if (is_xdp) {
prog_type = BPF_PROG_TYPE_XDP;
} else if (is_perf_event) {
@@ -128,6 +131,15 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}

+ if (is_raw_tracepoint) {
+ efd = bpf_raw_tracepoint_open(event + 15);
+ if (efd < 0) {
+ printf("tracepoint %s %s\n", event + 15, strerror(errno));
+ return -1;
+ }
+ return bpf_prog_attach(fd, efd, BPF_RAW_TRACEPOINT, 0);
+ }
+
if (is_kprobe || is_kretprobe) {
if (is_kprobe)
event += 7;
@@ -584,6 +596,7 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/", 11) == 0 ||
+ memcmp(shname, "raw_tracepoint/", 15) == 0 ||
memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "perf_event", 10) == 0 ||
memcmp(shname, "socket", 6) == 0 ||
diff --git a/samples/bpf/test_overhead_raw_tp_kern.c b/samples/bpf/test_overhead_raw_tp_kern.c
new file mode 100644
index 000000000000..d2af8bc1c805
--- /dev/null
+++ b/samples/bpf/test_overhead_raw_tp_kern.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Facebook */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+SEC("raw_tracepoint/task_rename")
+int prog(struct bpf_raw_tracepoint_args *ctx)
+{
+ return 0;
+}
+
+SEC("raw_tracepoint/urandom_read")
+int prog2(struct bpf_raw_tracepoint_args *ctx)
+{
+ return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_overhead_user.c b/samples/bpf/test_overhead_user.c
index d291167fd3c7..e1d35e07a10e 100644
--- a/samples/bpf/test_overhead_user.c
+++ b/samples/bpf/test_overhead_user.c
@@ -158,5 +158,17 @@ int main(int argc, char **argv)
unload_progs();
}

+ if (test_flags & 0xC0) {
+ snprintf(filename, sizeof(filename),
+ "%s_raw_tp_kern.o", argv[0]);
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+ printf("w/RAW_TRACEPOINT\n");
+ run_perf_test(num_cpu, test_flags >> 6);
+ unload_progs();
+ }
+
return 0;
}
--
2.9.5


2018-03-05 14:19:46

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
> This patch set is a different way to address the pressing need to access
> task_struct pointers in sched tracepoints from bpf programs.
>
> The first approach simply added these pointers to sched tracepoints:
> https://lkml.org/lkml/2017/12/14/753
> which Peter nacked.
> Few options were discussed and eventually the discussion converged on
> doing bpf specific tracepoint_probe_register() probe functions.
> Details here:
> https://lkml.org/lkml/2017/12/20/929

Ping, Peter/Steven. If you have a chance, please review the series.

> Patch 1 is kernel wide cleanup of pass-struct-by-value into
> pass-struct-by-reference into tracepoints.
>
> Patch 2 minor prep work to expose number of arguments passed
> into tracepoints.
>
> Patch 3 introduces BPF_RAW_TRACEPOINT api.
> the auto-cleanup and multiple concurrent users are must have
> features of tracing api. For bpf raw tracepoints it looks like:
> // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
> prog_fd = bpf_prog_load(...);
>
> // receive anon_inode fd for given bpf_raw_tracepoint
> raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
>
> // attach bpf program to given tracepoint
> bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);
>
> Ctrl-C of tracing daemon or cmdline tool will automatically
> detach bpf program, unload it and unregister tracepoint probe.
> More details in patch 3.
>
> Patch 4, 5 - user space lib and tests
>
> samples/bpf/test_overhead performance on 1 cpu:
>
> tracepoint base kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
> task_rename 1.1M 769K 947K 1.0M
> urandom_read 789K 697K 750K 755K
>
> Alexei Starovoitov (5):
> treewide: remove struct-pass-by-value from tracepoints arguments
> tracepoint: compute num_args at build time
> bpf: introduce BPF_RAW_TRACEPOINT
> libbpf: add bpf_raw_tracepoint_open helper
> samples/bpf: raw tracepoint test
>
> arch/x86/xen/mmu_pv.c | 16 +--
> drivers/gpu/drm/i915/i915_trace.h | 13 +-
> drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
> drivers/infiniband/hw/hfi1/trace_ctxts.h | 12 +-
> drivers/s390/cio/ioasm.c | 18 +--
> drivers/s390/cio/trace.h | 50 ++++----
> fs/dax.c | 2 +-
> include/linux/bpf_types.h | 1 +
> include/linux/trace_events.h | 57 +++++++++
> include/linux/tracepoint-defs.h | 1 +
> include/linux/tracepoint.h | 32 +++--
> include/trace/bpf_probe.h | 87 +++++++++++++
> include/trace/define_trace.h | 15 ++-
> include/trace/events/f2fs.h | 2 +-
> include/trace/events/fs_dax.h | 6 +-
> include/trace/events/rcu.h | 4 +-
> include/trace/events/xen.h | 32 ++---
> include/uapi/linux/bpf.h | 11 ++
> kernel/bpf/syscall.c | 108 ++++++++++++++++
> kernel/rcu/tree.c | 10 +-
> kernel/trace/bpf_trace.c | 211 +++++++++++++++++++++++++++++++
> kernel/tracepoint.c | 27 ++--
> net/wireless/trace.h | 2 +-
> samples/bpf/Makefile | 1 +
> samples/bpf/bpf_load.c | 13 ++
> samples/bpf/test_overhead_raw_tp_kern.c | 17 +++
> samples/bpf/test_overhead_user.c | 12 ++
> sound/firewire/amdtp-stream-trace.h | 2 +-
> tools/include/uapi/linux/bpf.h | 11 ++
> tools/lib/bpf/bpf.c | 10 ++
> tools/lib/bpf/bpf.h | 1 +
> 31 files changed, 677 insertions(+), 109 deletions(-)
> create mode 100644 include/trace/bpf_probe.h
> create mode 100644 samples/bpf/test_overhead_raw_tp_kern.c
>


2018-03-05 21:13:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

On Mon, 5 Mar 2018 14:36:07 +0100
Daniel Borkmann <[email protected]> wrote:

> Ping, Peter/Steven. If you have a chance, please review the series.

You're not off my radar, but I'm doing a lot of traveling for the next
two weeks (started last week). I'll see if I can find some time to look
at them. I scanned them over once, and they look interesting ;-)

-- Steve

2018-03-05 23:58:13

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH bpf-next 3/5] bpf: introduce BPF_RAW_TRACEPOINT

On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
> Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
> kernel internal arguments of the tracepoints in their raw form.
>
> From bpf program point of view the access to the arguments look like:
> struct bpf_raw_tracepoint_args {
> __u64 args[0];
> };
>
> int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
> {
> // program can read args[N] where N depends on tracepoint
> // and statically verified at program load+attach time
> }
>
> kprobe+bpf infrastructure allows programs access function arguments.
> This feature allows programs access raw tracepoint arguments.
>
> Similar to proposed 'dynamic ftrace events' there are no abi guarantees
> to what the tracepoints arguments are and what their meaning is.
> The program needs to type cast args properly and use bpf_probe_read()
> helper to access struct fields when argument is a pointer.
>
> For every tracepoint __bpf_trace_##call function is prepared.
> In assembler it looks like:
> (gdb) disassemble __bpf_trace_xdp_exception
> Dump of assembler code for function __bpf_trace_xdp_exception:
> 0xffffffff81132080 <+0>: mov %ecx,%ecx
> 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>
>
> where
>
> TRACE_EVENT(xdp_exception,
> TP_PROTO(const struct net_device *dev,
> const struct bpf_prog *xdp, u32 act),
>
> The above assembler snippet is casting 32-bit 'act' field into 'u64'
> to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
> All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
> and in total this approach adds 7k bytes to .text and 8k bytes
> to .rodata since the probe funcs need to appear in kallsyms.
> The alternative of having __bpf_trace_##call being global in kallsyms
> could have been to keep them static and add another pointer to these
> static functions to 'struct trace_event_class' and 'struct trace_event_call',
> but keeping them global simplifies implementation and keeps it indepedent
> from the tracing side.
>
> Also such approach gives the lowest possible overhead
> while calling trace_xdp_exception() from kernel C code and
> transitioning into bpf land.

Awesome work! Just a few comments below.

> Since tracepoint+bpf are used at speeds of 1M+ events per second
> this is very valuable optimization.
>
> Since ftrace and perf side are not involved the new
> BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
> that returns anon_inode FD of 'bpf-raw-tracepoint' object.
>
> The user space looks like:
> // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
> prog_fd = bpf_prog_load(...);
> // receive anon_inode fd for given bpf_raw_tracepoint
> raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
> // attach bpf program to given tracepoint
> bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);
>
> Ctrl-C of tracing daemon or cmdline tool that uses this feature
> will automatically detach bpf program, unload it and
> unregister tracepoint probe.
>
> On the kernel side for_each_kernel_tracepoint() is used
> to find a tracepoint with "xdp_exception" name
> (that would be __tracepoint_xdp_exception record)
>
> Then kallsyms_lookup_name() is used to find the addr
> of __bpf_trace_xdp_exception() probe function.
>
> And finally tracepoint_probe_register() is used to connect probe
> with tracepoint.
>
> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
> tracepoint mechanisms. perf_event_open() can be used in parallel
> on the same tracepoint.
> Also multiple bpf_raw_tracepoint_open("foo") are permitted.
> Each raw_tp_fd allows to attach one bpf program, so multiple
> user space processes can open their own raw_tp_fd with their own
> bpf program. The kernel will execute all tracepoint probes
> and all attached bpf programs.
>
> In the future bpf_raw_tracepoints can be extended with
> query/introspection logic.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/linux/bpf_types.h | 1 +
> include/linux/trace_events.h | 57 ++++++++++++
> include/trace/bpf_probe.h | 87 ++++++++++++++++++
> include/trace/define_trace.h | 1 +
> include/uapi/linux/bpf.h | 11 +++
> kernel/bpf/syscall.c | 108 ++++++++++++++++++++++
> kernel/trace/bpf_trace.c | 211 +++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 476 insertions(+)
> create mode 100644 include/trace/bpf_probe.h
>
[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e24aa3241387..b5c33dda1a1c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1311,6 +1311,109 @@ static int bpf_obj_get(const union bpf_attr *attr)
> attr->file_flags);
> }
>
> +struct bpf_raw_tracepoint {
> + struct tracepoint *tp;
> + struct bpf_prog *prog;
> +};
> +
> +static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
> +{
> + struct bpf_raw_tracepoint *raw_tp = filp->private_data;
> +
> + if (raw_tp->prog) {
> + bpf_probe_unregister(raw_tp->tp, raw_tp->prog);
> + bpf_prog_put(raw_tp->prog);
> + }
> + kfree(raw_tp);
> + return 0;
> +}
> +
> +static const struct file_operations bpf_raw_tp_fops = {
> + .release = bpf_raw_tracepoint_release,
> + .read = bpf_dummy_read,
> + .write = bpf_dummy_write,
> +};
> +
> +static struct bpf_raw_tracepoint *__bpf_raw_tracepoint_get(struct fd f)
> +{
> + if (!f.file)
> + return ERR_PTR(-EBADF);
> + if (f.file->f_op != &bpf_raw_tp_fops) {
> + fdput(f);
> + return ERR_PTR(-EINVAL);
> + }
> + return f.file->private_data;
> +}
> +
> +static void *__find_tp(struct tracepoint *tp, void *priv)
> +{
> + char *name = priv;
> +
> + if (!strcmp(tp->name, name))
> + return tp;
> + return NULL;
> +}
> +
> +#define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.name
> +
> +static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
> +{
> + struct bpf_raw_tracepoint *raw_tp;
> + struct tracepoint *tp;
> + char tp_name[128];
> +
> + if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
> + sizeof(tp_name) - 1) < 0)
> + return -EFAULT;
> + tp_name[sizeof(tp_name) - 1] = 0;
> +
> + tp = for_each_kernel_tracepoint(__find_tp, tp_name);
> + if (!tp)
> + return -ENOENT;
> +
> + raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
> + if (!raw_tp)
> + return -ENOMEM;
> + raw_tp->tp = tp;
> +
> + return anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
> + O_CLOEXEC);

When anon_inode_getfd() fails to get you an fd, then you leak raw_tp here.

> +}
> +
> +static int attach_raw_tp(const union bpf_attr *attr)
> +{
> + struct bpf_raw_tracepoint *raw_tp;
> + struct bpf_prog *prog;
> + struct fd f;
> + int err = -EEXIST;
> +
> + if (attr->attach_flags)
> + return -EINVAL;
> +
> + f = fdget(attr->target_fd);
> + raw_tp = __bpf_raw_tracepoint_get(f);
> + if (IS_ERR(raw_tp))
> + return PTR_ERR(raw_tp);
> +
> + if (raw_tp->prog)
> + goto out;
> +
> + prog = bpf_prog_get_type(attr->attach_bpf_fd,
> + BPF_PROG_TYPE_RAW_TRACEPOINT);
> + if (IS_ERR(prog)) {
> + err = PTR_ERR(prog);
> + goto out;
> + }
> + err = bpf_probe_register(raw_tp->tp, prog);
> + if (err)
> + bpf_prog_put(prog);
> + else
> + raw_tp->prog = prog;

I think this would race here with the above test on concurrent attach
attempts, so you could still register via bpf_probe_register() multiple
times before you hit the earlier raw_tp->prog test to bail out before
doing so.

> +out:
> + fdput(f);
> + return err;
> +}
> +
> #ifdef CONFIG_CGROUP_BPF
>
> #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
> @@ -1385,6 +1488,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> case BPF_SK_SKB_STREAM_PARSER:
> case BPF_SK_SKB_STREAM_VERDICT:
> return sockmap_get_from_fd(attr, true);
> + case BPF_RAW_TRACEPOINT:
> + return attach_raw_tp(attr);
> default:
> return -EINVAL;
> }
> @@ -1917,6 +2022,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> case BPF_OBJ_GET_INFO_BY_FD:
> err = bpf_obj_get_info_by_fd(&attr, uattr);
> break;
> + case BPF_RAW_TRACEPOINT_OPEN:
> + err = bpf_raw_tracepoint_open(&attr);

With regards to above attach_raw_tp() comment, why not having single
BPF_RAW_TRACEPOINT_OPEN command already passing BPF fd along with the
tp name? Is there a concrete reason/use-case why it's split that way?

> + break;
> default:
> err = -EINVAL;
> break;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index c0a9e310d715..e59b62875d1e 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -723,6 +723,14 @@ const struct bpf_verifier_ops tracepoint_verifier_ops = {
> const struct bpf_prog_ops tracepoint_prog_ops = {
> };
>

2018-03-06 01:30:53

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH bpf-next 3/5] bpf: introduce BPF_RAW_TRACEPOINT

On 3/5/18 3:56 PM, Daniel Borkmann wrote:
> On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
>> Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
>> kernel internal arguments of the tracepoints in their raw form.
>>
>> From bpf program point of view the access to the arguments look like:
>> struct bpf_raw_tracepoint_args {
>> __u64 args[0];
>> };
>>
>> int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
>> {
>> // program can read args[N] where N depends on tracepoint
>> // and statically verified at program load+attach time
>> }
>>
>> kprobe+bpf infrastructure allows programs access function arguments.
>> This feature allows programs access raw tracepoint arguments.
>>
>> Similar to proposed 'dynamic ftrace events' there are no abi guarantees
>> to what the tracepoints arguments are and what their meaning is.
>> The program needs to type cast args properly and use bpf_probe_read()
>> helper to access struct fields when argument is a pointer.
>>
>> For every tracepoint __bpf_trace_##call function is prepared.
>> In assembler it looks like:
>> (gdb) disassemble __bpf_trace_xdp_exception
>> Dump of assembler code for function __bpf_trace_xdp_exception:
>> 0xffffffff81132080 <+0>: mov %ecx,%ecx
>> 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>
>>
>> where
>>
>> TRACE_EVENT(xdp_exception,
>> TP_PROTO(const struct net_device *dev,
>> const struct bpf_prog *xdp, u32 act),
>>
>> The above assembler snippet is casting 32-bit 'act' field into 'u64'
>> to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
>> All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
>> and in total this approach adds 7k bytes to .text and 8k bytes
>> to .rodata since the probe funcs need to appear in kallsyms.
>> The alternative of having __bpf_trace_##call being global in kallsyms
>> could have been to keep them static and add another pointer to these
>> static functions to 'struct trace_event_class' and 'struct trace_event_call',
>> but keeping them global simplifies implementation and keeps it indepedent
>> from the tracing side.
>>
>> Also such approach gives the lowest possible overhead
>> while calling trace_xdp_exception() from kernel C code and
>> transitioning into bpf land.
>
> Awesome work! Just a few comments below.
>
>> Since tracepoint+bpf are used at speeds of 1M+ events per second
>> this is very valuable optimization.
>>
>> Since ftrace and perf side are not involved the new
>> BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
>> that returns anon_inode FD of 'bpf-raw-tracepoint' object.
>>
>> The user space looks like:
>> // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
>> prog_fd = bpf_prog_load(...);
>> // receive anon_inode fd for given bpf_raw_tracepoint
>> raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception");
>> // attach bpf program to given tracepoint
>> bpf_prog_attach(prog_fd, raw_tp_fd, BPF_RAW_TRACEPOINT);
>>
>> Ctrl-C of tracing daemon or cmdline tool that uses this feature
>> will automatically detach bpf program, unload it and
>> unregister tracepoint probe.
>>
>> On the kernel side for_each_kernel_tracepoint() is used
>> to find a tracepoint with "xdp_exception" name
>> (that would be __tracepoint_xdp_exception record)
>>
>> Then kallsyms_lookup_name() is used to find the addr
>> of __bpf_trace_xdp_exception() probe function.
>>
>> And finally tracepoint_probe_register() is used to connect probe
>> with tracepoint.
>>
>> Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
>> tracepoint mechanisms. perf_event_open() can be used in parallel
>> on the same tracepoint.
>> Also multiple bpf_raw_tracepoint_open("foo") are permitted.
>> Each raw_tp_fd allows to attach one bpf program, so multiple
>> user space processes can open their own raw_tp_fd with their own
>> bpf program. The kernel will execute all tracepoint probes
>> and all attached bpf programs.
>>
>> In the future bpf_raw_tracepoints can be extended with
>> query/introspection logic.
>>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
...
>> +static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
>> +{
>> + struct bpf_raw_tracepoint *raw_tp;
>> + struct tracepoint *tp;
>> + char tp_name[128];
>> +
>> + if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name),
>> + sizeof(tp_name) - 1) < 0)
>> + return -EFAULT;
>> + tp_name[sizeof(tp_name) - 1] = 0;
>> +
>> + tp = for_each_kernel_tracepoint(__find_tp, tp_name);
>> + if (!tp)
>> + return -ENOENT;
>> +
>> + raw_tp = kmalloc(sizeof(*raw_tp), GFP_USER | __GFP_ZERO);
>> + if (!raw_tp)
>> + return -ENOMEM;
>> + raw_tp->tp = tp;
>> +
>> + return anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
>> + O_CLOEXEC);
>
> When anon_inode_getfd() fails to get you an fd, then you leak raw_tp here.

good catch. will fix.

>> break;
>> + case BPF_RAW_TRACEPOINT_OPEN:
>> + err = bpf_raw_tracepoint_open(&attr);
>
> With regards to above attach_raw_tp() comment, why not having single
> BPF_RAW_TRACEPOINT_OPEN command already passing BPF fd along with the
> tp name? Is there a concrete reason/use-case why it's split that way?

The use case is to attach the same bpf prog to multiple raw_tp,
but with your suggestion it will work as well,
so yeah will change to that since it simplifies uapi and avoids
the race in attach.

Thank you for review.


2018-03-06 10:36:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH bpf-next 0/5] bpf, tracing: introduce bpf raw tracepoints

On Mon, Mar 05, 2018 at 02:36:07PM +0100, Daniel Borkmann wrote:
> On 03/01/2018 05:19 AM, Alexei Starovoitov wrote:
> > This patch set is a different way to address the pressing need to access
> > task_struct pointers in sched tracepoints from bpf programs.
> >
> > The first approach simply added these pointers to sched tracepoints:
> > https://lkml.org/lkml/2017/12/14/753
> > which Peter nacked.
> > Few options were discussed and eventually the discussion converged on
> > doing bpf specific tracepoint_probe_register() probe functions.
> > Details here:
> > https://lkml.org/lkml/2017/12/20/929
>
> Ping, Peter/Steven. If you have a chance, please review the series.

This series doesn't really touch anything I maintain, but the general
appraoch seems sane to me. I like the first patch that ensures
structures are passed by reference.

The rest is all tracepoint/bpf glue and I never really got into the bpf
internals, so I don't think I've got anything useful to say there.