LinuxLists.cc - [RFC 0/2] Experience on full stack profiling using eBPF.

2015-02-06 06:59:14

Subject: [RFC 0/2] Experience on full stack profiling using eBPF.

Hi Alexei,

We are working on full stack system profiling tools and found your eBPF
is an amazing start point for us. However, during our prototype
designing and implementing we also found some limitations on it. I'd
like to share our experience with you in this mailing list, and I hope
you and others can provide comments and suggestions for us.

Our goal is to profile production systems by analyzing end-to-end trace
of particular procedures, including full IO request through sys_write()
system call to underlying disk driver, full network processes through
sys_sendto() to NIC driver and even full database query through client
initiating to server processing to client finalizing. To avoid heavy
tracing overhead and overwhelming data, we randomly sample a few out of
huge amounts of procedures, only trace events caused by them at
different layers and leave other procedures work normally. As a result,
we make as few traces as possible for our profiling with the minimal
effect to system performance.

In this series of RFC we provide a very rough implementation which
traces at different layers for IO processes in a QEMU system. Patch 1/2
deploys some new tracepoints at each layer of IO stack, patch 2/2
attaches eBPF programs on them. eBPF programs select events according to
their trigger conditions, and propagate these conditions to eBPF
programs at lower layers. We use hash tables to record trigger
conditions.

Our prototype is able to sample one sys_write() call among all write by
a dd command. The result is as follow:

dd-985 [000] .... 54878.066046: iov_iter_copy_from_user_atomic: Node1 iov=00000000011c0010 page=ffffea0001eebb40
...
kworker/ [000] .... 54886.746406: ext4_bio_write_page: Node2 page=ffffea0001eebb40 bio=ffff88007c249540
kworker/ [000] .... 54886.750054: bio_attempt_back_merge: Node3 bio=ffff88007c249540 rq=ffff88007c3d4158
...
kworker/ [000] d... 54886.828480: virtscsi_add_cmd: Node4 req=ffff88007c3d4158 index=2
...
tracex5-9 [000] .Ns. 54886.840415: bio_endio: Node5 bio=ffff88007c249540

The result trace log is less than 20 lines, which will be more than
10000 lines without sampling and conditions. We can easily get a graph
like this from the above result trace:

S1/2/3 are samples.

S1S2 S3
| | |
(vfs)pagecache -----------------------------------------> timeline
S1S2 S3
| | |
(block)request -----------------------------------------> timeline
S1S2 S3
| | |
bio_end -----------------------------------------> timeline

During our work we found that eBPF can be improved so we can make our
life easier:

1. BPF prog cannot use div/mod operator, which is useful in sampling.

2. BPF is a filter, the output trace goes to trace pipe, sometimes we
may need a structured trace output like perf.data, so we can record more
in filter function. BPF Array table is a good place to record data, but
the index is restricted to 4 byte size.

3. Though most of events are filtered out, the eBPF program attached to
tracepoints still run for every execution. We should find a way to
minimize performance overhead for those events filtered out.

4. Provide a more user-friendly interface for user to use.

Thanks for your work.

Signed-off-by: He Kuang <[email protected]>
He Kuang (2):
tracing: add additional IO tracepoints
samples: bpf: IO profiling example

block/bio.c | 1 +
block/blk-core.c | 4 +
drivers/scsi/virtio_scsi.c | 3 +
fs/ext4/page-io.c | 4 +
include/trace/events/block.h | 81 +++++++++++++++++
include/trace/events/ext4.h | 21 +++++
include/trace/events/filemap.h | 25 ++++++
include/trace/events/scsi.h | 21 +++++
mm/filemap.c | 2 +
samples/bpf/Makefile | 4 +
samples/bpf/tracex5_kern.c | 195 +++++++++++++++++++++++++++++++++++++++++
samples/bpf/tracex5_user.c | 56 ++++++++++++
12 files changed, 417 insertions(+)
create mode 100644 samples/bpf/tracex5_kern.c
create mode 100644 samples/bpf/tracex5_user.c

--
2.2.0.33.gc18b867

2015-02-06 06:59:27

by He Kuang

[permalink] [raw]

Subject: [RFC 1/2] tracing: add additional IO tracepoints

Add tracepoints according to IO data stream to trace data propagate
in vfs, fs, block layers.

Signed-off-by: He Kuang <[email protected]>
---
block/bio.c | 1 +
block/blk-core.c | 4 +++
drivers/scsi/virtio_scsi.c | 3 ++
fs/ext4/page-io.c | 4 +++
include/trace/events/block.h | 81 ++++++++++++++++++++++++++++++++++++++++++
include/trace/events/ext4.h | 21 +++++++++++
include/trace/events/filemap.h | 25 +++++++++++++
include/trace/events/scsi.h | 21 +++++++++++
mm/filemap.c | 2 ++
9 files changed, 162 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 471d738..208fb63 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1827,6 +1827,7 @@ void bio_endio(struct bio *bio, int error)
} else {
if (bio->bi_end_io)
bio->bi_end_io(bio, error);
+ trace_bio_endio(bio);
bio = NULL;
}
}
diff --git a/block/blk-core.c b/block/blk-core.c
index 30f6153..7a8ceff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1442,6 +1442,7 @@ bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));

blk_account_io_start(req, false);
+ trace_bio_attempt_back_merge(req, bio);
return true;
}

@@ -1466,6 +1467,7 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));

blk_account_io_start(req, false);
+ trace_bio_attempt_front_merge(req, bio);
return true;
}

@@ -1612,6 +1614,8 @@ get_rq:
* Returns with the queue unlocked.
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
+ trace_blk_queue_bio(req, bio);
+
if (IS_ERR(req)) {
bio_endio(bio, PTR_ERR(req)); /* @q is dead */
goto out_unlock;
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index c52bb5d..9fde3e0 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -30,6 +30,8 @@
#include <scsi/scsi_tcq.h>
#include <linux/seqlock.h>

+#include <trace/events/scsi.h>
+
#define VIRTIO_SCSI_MEMPOOL_SZ 64
#define VIRTIO_SCSI_EVENT_LEN 8
#define VIRTIO_SCSI_VQ_BASE 2
@@ -464,6 +466,7 @@ static int virtscsi_add_cmd(struct virtqueue *vq,
sgs[out_num + in_num++] = in->sgl;
}

+ trace_virtscsi_add_cmd(vq, sc->request);
return virtqueue_add_sgs(vq, sgs, out_num, in_num, cmd, GFP_ATOMIC);
}

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index b24a254..d5b618d 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -31,6 +31,8 @@
#include "xattr.h"
#include "acl.h"

+#include <trace/events/ext4.h>
+
static struct kmem_cache *io_end_cachep;

int __init ext4_init_pageio(void)
@@ -478,6 +480,8 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
redirty_page_for_writepage(wbc, page);
break;
}
+ trace_ext4_bio_write_page(io, page);
+
nr_submitted++;
clear_buffer_dirty(bh);
} while ((bh = bh->b_this_page) != head);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index e8a5eca..67b572c 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -667,6 +667,87 @@ TRACE_EVENT(block_rq_remap,
(unsigned long long)__entry->old_sector, __entry->nr_bios)
);

+TRACE_EVENT(blk_queue_bio,
+
+ TP_PROTO(struct request *rq, struct bio* bio),
+
+ TP_ARGS(rq, bio),
+
+ TP_STRUCT__entry(
+ __field( void* , rq_base )
+ __field( void*, bio_base)
+ ),
+
+ TP_fast_assign(
+ __entry->rq_base = rq;
+ __entry->bio_base = bio;
+ ),
+
+ TP_printk("Node3 bio=%p rq=%p",
+ __entry->bio_base,
+ __entry->rq_base)
+ );
+
+TRACE_EVENT(bio_attempt_back_merge,
+
+ TP_PROTO(struct request *rq, struct bio* bio),
+
+ TP_ARGS(rq, bio),
+
+ TP_STRUCT__entry(
+ __field( void*, rq_base )
+ __field( void*, bio_base)
+ ),
+
+ TP_fast_assign(
+ __entry->rq_base = rq;
+ __entry->bio_base = bio;
+ ),
+
+ TP_printk("Node3 bio=%p rq=%p",
+ __entry->bio_base,
+ __entry->rq_base)
+ );
+
+TRACE_EVENT(bio_attempt_front_merge,
+
+ TP_PROTO(struct request *rq, struct bio* bio),
+
+ TP_ARGS(rq, bio),
+
+ TP_STRUCT__entry(
+ __field( void*, rq_base )
+ __field( void*, bio_base )
+ ),
+
+ TP_fast_assign(
+ __entry->rq_base = rq;
+ __entry->bio_base = bio;
+ ),
+
+ TP_printk("Node3 bio=%p rq=%p",
+ __entry->bio_base,
+ __entry->rq_base)
+ );
+
+TRACE_EVENT(bio_endio,
+
+ TP_PROTO(struct bio* bio),
+
+ TP_ARGS(bio),
+
+ TP_STRUCT__entry(
+ __field( void*, bio_base )
+ ),
+
+ TP_fast_assign(
+ __entry->bio_base = bio;
+ ),
+
+ TP_printk("Node5 bio=%p",
+ __entry->bio_base)
+ );
+
#endif /* _TRACE_BLOCK_H */

/* This part must be outside protection */
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 6cfb841..507ac24 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2476,6 +2476,27 @@ TRACE_EVENT(ext4_es_shrink,
__entry->scan_time, __entry->nr_skipped, __entry->retried)
);

+TRACE_EVENT(ext4_bio_write_page,
+ TP_PROTO(struct ext4_io_submit *io, struct page *page),
+
+ TP_ARGS(io, page),
+
+ TP_STRUCT__entry(
+ __field( struct page *, page )
+ __field( struct bio*, bio )
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->bio = io->io_bio;
+ ),
+
+ TP_printk("Node2 page=%p bio=%p",
+ __entry->page,
+ __entry->bio)
+ );
+
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..3173e60 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -11,6 +11,8 @@
#include <linux/device.h>
#include <linux/kdev_t.h>

+#include <linux/uio.h>
+
DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,

TP_PROTO(struct page *page),
@@ -52,6 +54,29 @@ DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_add_to_page_cache,
TP_ARGS(page)
);

+
+
+TRACE_EVENT(iov_iter_copy_from_user_atomic,
+
+ TP_PROTO(struct page *page, struct iov_iter *i),
+
+ TP_ARGS(page, i),
+
+ TP_STRUCT__entry(
+ __field(struct page *, page)
+ __field(void*, iov_base)
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->iov_base = i->iov->iov_base;
+ ),
+
+ TP_printk("Node1 iov=%p page=%p",
+ __entry->iov_base,
+ __entry->page)
+ );
+
#endif /* _TRACE_FILEMAP_H */

/* This part must be outside protection */
diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index 079bd10..e6c0c3a 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -8,6 +8,7 @@
#include <scsi/scsi_host.h>
#include <linux/tracepoint.h>
#include <linux/trace_seq.h>
+#include <linux/virtio.h>

#define scsi_opcode_name(opcode) { opcode, #opcode }
#define show_opcode_name(val) \
@@ -359,6 +360,26 @@ TRACE_EVENT(scsi_eh_wakeup,
TP_printk("host_no=%u", __entry->host_no)
);

+TRACE_EVENT(virtscsi_add_cmd,
+
+ TP_PROTO(struct virtqueue *vq, struct request *req),
+
+ TP_ARGS(vq, req),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, vq_index )
+ __field( void *, req_base )
+ ),
+
+ TP_fast_assign(
+ __entry->vq_index = vq->index;
+ __entry->req_base = req;
+ ),
+
+ TP_printk("Node4 req=%p index=%u", __entry->req_base,
+ __entry->vq_index)
+ );
+
#endif /* _TRACE_SCSI_H */

/* This part must be outside protection */
diff --git a/mm/filemap.c b/mm/filemap.c
index 673e458..d6dd161 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2497,6 +2497,8 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ trace_iov_iter_copy_from_user_atomic(page, i);
+
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);

--
2.2.0.33.gc18b867

2015-02-06 06:59:26

by He Kuang

[permalink] [raw]

Subject: [RFC 2/2] samples: bpf: IO profiling example

For the first node, we sample and record its characteristic
(eg. address), and add the result to a global hash table. For other
nodes, they'll be traced when their context matches the characteristics
in previous tables.

Signed-off-by: He Kuang <[email protected]>
---
samples/bpf/Makefile | 4 +
samples/bpf/tracex5_kern.c | 195 +++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/tracex5_user.c | 56 +++++++++++++
3 files changed, 255 insertions(+)
create mode 100644 samples/bpf/tracex5_kern.c
create mode 100644 samples/bpf/tracex5_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aec1f52..529deb7 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -11,6 +11,7 @@ hostprogs-y += tracex1
hostprogs-y += tracex2
hostprogs-y += tracex3
hostprogs-y += tracex4
+hostprogs-y += tracex5

dropmon-objs := dropmon.o libbpf.o
test_verifier-objs := test_verifier.o libbpf.o
@@ -22,6 +23,7 @@ tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
+tracex5-objs := bpf_load.o libbpf.o tracex5_user.o

# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -31,6 +33,7 @@ always += tracex1_kern.o
always += tracex2_kern.o
always += tracex3_kern.o
always += tracex4_kern.o
+always += tracex5_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(objtree)/include/uapi/
@@ -43,6 +46,7 @@ HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
HOSTLOADLIBES_tracex4 += -lelf
+HOSTLOADLIBES_tracex5 += -lelf

# point this to your LLVM backend with bpf support
#LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex5_kern.c b/samples/bpf/tracex5_kern.c
new file mode 100644
index 0000000..67c03b0
--- /dev/null
+++ b/samples/bpf/tracex5_kern.c
@@ -0,0 +1,195 @@
+/*
+ In this example, we placed several nodes in kernel source, all these
+ nodes in the data stream path, and represent the data transfer.
+
+ we sampling on the first node and record the characteristic in it, add
+ to a global hash table, then the next node will be traced when it's
+ context matches the characteristics in previous tables.'
+*/
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+
+struct globals {
+ int num_samples;
+ int samples;
+};
+
+struct bpf_map_def SEC("maps") global_map = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(struct globals),
+ .max_entries = 1,
+};
+
+/* trigger table: symbol(page) */
+struct bpf_map_def SEC("maps") trigger_page_hash = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(u64),
+ .value_size = sizeof(int),
+ .max_entries = 512,
+};
+
+/* trigger table: symbol(bio) */
+struct bpf_map_def SEC("maps") trigger_bio_hash = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(u64),
+ .value_size = sizeof(int),
+ .max_entries = 512,
+};
+
+
+/* trigger table symbol(req) */
+struct bpf_map_def SEC("maps") trigger_req_hash = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(u64),
+ .value_size = sizeof(int),
+ .max_entries = 512,
+};
+
+//iov -> page
+SEC("events/filemap/iov_iter_copy_from_user_atomic")
+int node1(struct bpf_context *ctx)
+{
+ int ind = 0;
+ struct globals *g = bpf_map_lookup_elem(&global_map, &ind);
+
+ if (!g)
+ return 0;
+
+ g->num_samples++;
+ if ((g->num_samples >> 10) * 1024 == g->num_samples) {
+ struct page* page = (struct page*)ctx->arg1;
+ int fill = 1;
+
+ __sync_fetch_and_add(&g->samples, 1);
+
+ bpf_map_update_elem(&trigger_page_hash, &page, &fill, BPF_ANY);
+
+ return 1;
+ } else {
+ return 0;
+ }
+}
+
+/*
+ ext4.h is not in the include pathes, it's not convenient to fetch
+ struct fields now.
+
+ Though in this example we should avoid this, but if we want to use
+ bio->bi_iter, we should use the struct definition.
+*/
+struct ext4_io_submit {
+ int io_op;
+ struct bio *io_bio;
+};
+
+//page -> bio
+SEC("events/ext4/ext4_bio_write_page")
+int node_2(struct bpf_context *ctx)
+{
+ u64 page_addr = ctx->arg2;
+ struct ext4_io_submit* io = (struct ext4_io_submit*)ctx->arg1;
+ struct bio* bio = bpf_fetch_ptr(&io->io_bio);
+ void *value;
+ int fill = 1;
+
+ value = bpf_map_lookup_elem(&trigger_page_hash, &page_addr);
+ if (!value)
+ return 0;
+
+ bpf_map_update_elem(&trigger_bio_hash, &bio, &fill, BPF_ANY);
+ return 1;
+}
+
+//bio -> rq case1
+SEC("events/block/blk_queue_bio")
+int node_3_1(struct bpf_context *ctx)
+{
+ struct request *request = (struct request*)ctx->arg1;
+ struct bio* bio = (struct bio * )ctx->arg2;
+ void *value;
+ int fill = 1;
+ int i;
+ int found = 0;
+
+ value = bpf_map_lookup_elem(&trigger_bio_hash, &bio);
+ if (!value) {
+ return 0;
+ }
+
+ bpf_map_update_elem(&trigger_req_hash, &request, &fill, BPF_ANY);
+ return 1;
+}
+
+//bio -> rq case2
+SEC("events/block/bio_attempt_front_merge")
+int node_3_2(struct bpf_context *ctx)
+{
+ struct request *request = (struct request*)ctx->arg1;
+ struct bio* bio = (struct bio * )ctx->arg2;
+ void *value;
+ int fill = 1;
+
+ value = bpf_map_lookup_elem(&trigger_bio_hash, &bio);
+ if (!value) {
+ return 0;
+ }
+
+ bpf_map_update_elem(&trigger_req_hash, &request, &fill, BPF_ANY);
+ return 1;
+}
+
+//bio -> rq case3
+SEC("events/block/bio_attempt_back_merge")
+int node_3_3(struct bpf_context *ctx)
+{
+ struct request *request = (struct request*)ctx->arg1;
+ struct bio* bio = (struct bio * )ctx->arg2;
+ void *value;
+ int fill = 1;
+
+ value = bpf_map_lookup_elem(&trigger_bio_hash, &bio);
+ if (!value) {
+ return 0;
+ }
+
+ bpf_map_update_elem(&trigger_req_hash, &request, &fill, BPF_ANY);
+ return 1;
+}
+
+//req -> vq
+SEC("events/scsi/virtscsi_add_cmd")
+int node_4(struct bpf_context *ctx)
+{
+ struct request *request = (struct request*)ctx->arg2;
+ void *value;
+ int fill = 1;
+
+ value = bpf_map_lookup_elem(&trigger_req_hash, &request);
+ if (!value) {
+ return 0;
+ }
+ return 1;
+}
+
+//end: bio
+SEC("events/block/bio_endio")
+int node_5(struct bpf_context *ctx)
+{
+ struct bio* bio = (struct bio*)ctx->arg1;
+ void *value;
+
+ value = bpf_map_lookup_elem(&trigger_bio_hash, &bio);
+ if (!value) {
+ return 0;
+ }
+ return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex5_user.c b/samples/bpf/tracex5_user.c
new file mode 100644
index 0000000..814f926
--- /dev/null
+++ b/samples/bpf/tracex5_user.c
@@ -0,0 +1,56 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+struct globals {
+ int num_samples;
+ int samples;
+};
+
+static void print_hist(int fd)
+{
+ struct globals g = {};
+ int key = 0;
+
+ bpf_lookup_elem(map_fd[0], &key, &g);
+
+ printf("Total samples = %d, num=%d\n", g.samples, g.num_samples);
+}
+
+static void int_exit(int sig)
+{
+ print_hist(map_fd[0]);
+ exit(0);
+}
+
+int main(int ac, char **argv)
+{
+ FILE *f;
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ /* start 'dd' in the background to have plenty of 'write' syscalls */
+ f = popen("dd if=/dev/zero of=/mnt/data/testfile bs=4k count=5000", "r");
+ (void) f;
+
+ signal(SIGINT, int_exit);
+
+ if (fork() == 0) {
+ read_trace_pipe();
+ } else {
+ sleep(120);
+ print_hist(map_fd[0]);
+ }
+ return 0;
+}
--
2.2.0.33.gc18b867

2015-02-06 20:23:37

by Alexei Starovoitov

[permalink] [raw]

Subject: Re: [RFC 0/2] Experience on full stack profiling using eBPF.

On Thu, Feb 5, 2015 at 10:56 PM, He Kuang <[email protected]> wrote:
>
> Our goal is to profile production systems by analyzing end-to-end trace
> of particular procedures, including full IO request through sys_write()
> system call to underlying disk driver, full network processes through
> sys_sendto() to NIC driver and even full database query through client
> initiating to server processing to client finalizing. To avoid heavy
> tracing overhead and overwhelming data, we randomly sample a few out of
> huge amounts of procedures, only trace events caused by them at
> different layers and leave other procedures work normally. As a result,
> we make as few traces as possible for our profiling with the minimal
> effect to system performance.

all makes sense. Ability to do what you're describing above
is exactly the goal for bpf+tracing infra.

> The result trace log is less than 20 lines, which will be more than
> 10000 lines without sampling and conditions. We can easily get a graph
> like this from the above result trace:
>
> S1/2/3 are samples.
>
> S1S2 S3
> | | |
> (vfs)pagecache -----------------------------------------> timeline
> S1S2 S3
> | | |
> (block)request -----------------------------------------> timeline
> S1S2 S3
> | | |
> bio_end -----------------------------------------> timeline

nice.
Would be great if your tool is developed in some public repo
that everyone can see and contribute.

> During our work we found that eBPF can be improved so we can make our
> life easier:
>
> 1. BPF prog cannot use div/mod operator, which is useful in sampling.

hmm. there are several tests in test_bpf.ko that check div/mod ops.
Just tried it with simple C example and it worked fine as well.
Please share the example that caused issues.

> 2. BPF is a filter, the output trace goes to trace pipe, sometimes we
> may need a structured trace output like perf.data, so we can record more
> in filter function. BPF Array table is a good place to record data, but
> the index is restricted to 4 byte size.

I think 4B elements in array is more then practically usable.
array map is used in cases where one needs very fast access
to few elements.
hash map is for large key/values (though the number
of elements is also limited to 4B)
Both key and value don't have do be scalar types.
For hash map one can use a struct of multiple integers as a key
and as a value. For array map key has to be u32 and
value is a struct of any size.

btw in the next version of my tracing patches I've changed
attachment point to use perf_event instead of tracefs interface.
So all not-dropped events will go into perf.data if user space
cares to accept them.

Independently of that, I'm planning to add a 'streaming'
interface in addition to maps.
More like generalized trace/perf buffer so that programs
can stream interesting events to user space in the
arbitrary format.

> 3. Though most of events are filtered out, the eBPF program attached to
> tracepoints still run for every execution. We should find a way to
> minimize performance overhead for those events filtered out.

well, yeah the program does run for every event,
because the program itself does the filtering :)
when bpf program returns 0, the tracepoint handler
returns immediately. The cost of running 'return 0' bpf
program is 2% according to my measurements with 'dd'
(I will post all data and new version of patches soon)

> 4. Provide a more user-friendly interface for user to use.

sure. please suggest :)

> Thanks for your work.

thanks a lot for trying it out and providing feedback!

Some tracepoints from patch 1 seems redundant.
Like there is already one for trace_block_bio_backmerge()
why add trace_bio_attempt_back_merge() ?
that's anyway for block layer maintainers to decide.
imo adding tracepoints in 'strategic' places makes a lot of sense
just need to think through their placements.