2014-07-22 03:45:38

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH RFC v3 net-next 0/3] eBPF examples in C

Hi,

I've been asked to repost LLVM eBPF backend and examples in C, so here they are.
LLVM backend is 99% the same as it was in Feb. Not resending it to the list,
since I only fixed minor things there. See it in my tree.

ex1 - is the same example I showed in Feb, but now it works through BPF syscall
and Ctrl-C does auto cleanup. This is patch 2/3

ex2 - is a new example that demonstrates key feature of eBPF programs
for kernel debugging/tracing. This is patch 3/3

dtrace/systemtap/ktap approach is to use one script file that should provide
all desired functionality. That architectural decision overcomplicated their
implementations.
eBPF follows split model: everything that needs to process millions of events
per second needs to run in kernel and needs to be short and deterministic,
all other things like aggregation and nice graphs should run in user space.

In the patch 3/3, kfree_skb events are counted by a program written in C,
compiled into eBPF and attached to the event. That's ex2_kern.c file.
The corresponding user space part is ex2_user.c which walks in-kernel map
every second and prints its contents. So user space and kernel are
accessing BPF maps in parallel. Kernel is counting events, user space
prints them.

Patch 1/3 is a parser of .o file generated by LLVM. It looks for pre-defined
ELF sections like 'license', 'maps', 'events' and loads bpf maps/programs
via BPF syscall that I posted earlier.

Alexei Starovoitov (3):
samples: bpf: elf file loader
samples: bpf: eBPF example in C
samples: bpf: eBPF dropmon example in C

samples/bpf/Makefile | 17 +++-
samples/bpf/bpf_helpers.h | 21 +++++
samples/bpf/bpf_load.c | 228 +++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/bpf_load.h | 18 ++++
samples/bpf/ex1_kern.c | 27 ++++++
samples/bpf/ex1_user.c | 11 +++
samples/bpf/ex2_kern.c | 29 ++++++
samples/bpf/ex2_user.c | 28 ++++++
8 files changed, 377 insertions(+), 2 deletions(-)
create mode 100644 samples/bpf/bpf_helpers.h
create mode 100644 samples/bpf/bpf_load.c
create mode 100644 samples/bpf/bpf_load.h
create mode 100644 samples/bpf/ex1_kern.c
create mode 100644 samples/bpf/ex1_user.c
create mode 100644 samples/bpf/ex2_kern.c
create mode 100644 samples/bpf/ex2_user.c

----

The following changes since commit 240524089d7a5c0396656574e299beb3a55461e3:

net: bcmgenet: only update UMAC_CMD if something changed (2014-07-21 19:49:11 -0700)

are available in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

for you to fetch changes up to 27ae0cec12d0aac6f0705b0269ee705a0c599571:

samples: bpf: eBPF dropmon example in C (2014-07-21 20:01:29 -0700)

----------------------------------------------------------------
Alexei Starovoitov (20):
net: filter: split filter.c into two files
bpf: update MAINTAINERS entry
net: filter: rename struct sock_filter_int into bpf_insn
net: filter: split filter.h and expose eBPF to user space
bpf: introduce syscall(BPF, ...) and BPF maps
bpf: enable bpf syscall on x64
bpf: add lookup/update/delete/iterate methods to BPF maps
bpf: add hashtable type of BPF maps
bpf: expand BPF syscall with program load/unload
bpf: add eBPF verifier
bpf: allow eBPF programs to use maps
net: sock: allow eBPF programs to be attached to sockets
tracing: allow eBPF programs to be attached to events
samples: bpf: add mini eBPF library to manipulate maps and programs
samples: bpf: example of stateful socket filtering
samples: bpf: example of tracing filters with eBPF
bpf: llvm backend
samples: bpf: elf file loader
samples: bpf: eBPF example in C
samples: bpf: eBPF dropmon example in C


2014-07-22 03:45:43

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH RFC v3 net-next 1/3] samples: bpf: elf file loader

simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

/* parses elf file compiled by llvm .c->.o
* creates maps and stores FD into map_fd array
* loads eBPF programs
* returns zero on success
*/
int load_bpf_file(char *path);

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/bpf_helpers.h | 21 +++++
samples/bpf/bpf_load.c | 228 +++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/bpf_load.h | 18 ++++
3 files changed, 267 insertions(+)
create mode 100644 samples/bpf/bpf_helpers.h
create mode 100644 samples/bpf/bpf_load.c
create mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
new file mode 100644
index 000000000000..356e7d34d174
--- /dev/null
+++ b/samples/bpf/bpf_helpers.h
@@ -0,0 +1,21 @@
+#ifndef __BPF_HELPERS_H
+#define __BPF_HELPERS_H
+
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+static void *(*bpf_load_pointer)(void *unsafe_ptr) = (void *) BPF_FUNC_load_pointer;
+static int (*bpf_printk)(const char *fmt, int fmt_size, ...) = (void *) BPF_FUNC_printk;
+static int (*bpf_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) = (void *) BPF_FUNC_memcmp;
+static void *(*bpf_map_lookup_elem)(void *map, void *key) = (void*) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value) = (void*) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) = (void *) BPF_FUNC_map_delete_elem;
+static void (*bpf_dump_stack)(void) = (void *) BPF_FUNC_dump_stack;
+
+struct bpf_map_def {
+ int type;
+ int key_size;
+ int value_size;
+ int max_entries;
+};
+
+#endif
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
new file mode 100644
index 000000000000..7614f19d5282
--- /dev/null
+++ b/samples/bpf/bpf_load.c
@@ -0,0 +1,228 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_helpers.h"
+#include "bpf_load.h"
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+static char license[128];
+static bool processed_sec[128];
+int map_fd[MAX_MAPS];
+
+static int load_and_attach(const char *event, char *prog, int size,
+ const struct bpf_map_fixup *fixups, int fixup_len)
+{
+ int fd, event_fd, err;
+ char fmt[32];
+ char path[256] = DEBUGFS;
+
+ fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER,
+ (struct bpf_insn *)prog, size, license,
+ fixups, fixup_len);
+
+ if (fd < 0) {
+ printf("err %d errno %d\n", fd, errno);
+ return fd;
+ }
+
+
+ snprintf(fmt, sizeof(fmt), "bpf-%d", fd);
+
+ strcat(path, event);
+ strcat(path, "/filter");
+
+ printf("writing %s -> %s\n", fmt, path);
+
+ event_fd = open(path, O_WRONLY, 0);
+ if (event_fd < 0) {
+ printf("failed to open event %s\n", event);
+ return event_fd;
+ }
+
+ err = write(event_fd, fmt, strlen(fmt));
+ (void) err;
+
+ return 0;
+}
+
+static int load_maps(struct bpf_map_def *maps, int len)
+{
+ int i;
+
+ for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
+
+ map_fd[i] = bpf_create_map(maps[i].type,
+ maps[i].key_size,
+ maps[i].value_size,
+ maps[i].max_entries);
+ if (map_fd[i] < 0)
+ return 1;
+ }
+ return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+ GElf_Shdr *shdr, Elf_Data **data)
+{
+ Elf_Scn *scn;
+
+ scn = elf_getscn(elf, i);
+ if (!scn)
+ return 1;
+
+ if (gelf_getshdr(scn, shdr) != shdr)
+ return 2;
+
+ *shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+ if (!*shname || !shdr->sh_size)
+ return 3;
+
+ *data = elf_getdata(scn, 0);
+ if (!*data || elf_getdata(scn, *data) != NULL)
+ return 4;
+
+ return 0;
+}
+
+static struct bpf_map_fixup fixup[128];
+
+static int parse_relo_into_map_fixup(Elf_Data *data, Elf_Data *symbols,
+ GElf_Shdr *shdr)
+{
+ GElf_Sym sym;
+ GElf_Rel rel;
+ int i, nrels;
+
+ nrels = shdr->sh_size / shdr->sh_entsize;
+
+ for (i = 0; i < nrels; i++) {
+ gelf_getrel(data, i, &rel);
+
+ fixup[i].insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+ gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+ fixup[i].fd = map_fd[sym.st_value / sizeof(struct bpf_map_def)];
+ }
+
+ return nrels;
+}
+
+int load_bpf_file(char *path)
+{
+ int fd, i, nrels;
+ Elf *elf;
+ GElf_Ehdr ehdr;
+ GElf_Shdr shdr, shdr_prog;
+ Elf_Data *data, *data_prog, *symbols = NULL;
+ char *shname, *shname_prog;
+
+ if (elf_version(EV_CURRENT) == EV_NONE)
+ return 1;
+
+ fd = open(path, O_RDONLY, 0);
+ if (fd < 0)
+ return 1;
+
+ elf = elf_begin(fd, ELF_C_READ, NULL);
+
+ if (!elf)
+ return 1;
+
+ if (gelf_getehdr(elf, &ehdr) != &ehdr)
+ return 1;
+
+ /* scan over all elf sections to get license and map info */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+
+ if (0)
+ printf("section %d:%s data %p size %zd link %d flags %d\n",
+ i, shname, data->d_buf, data->d_size,
+ shdr.sh_link, (int) shdr.sh_flags);
+
+ if (strcmp(shname, "license") == 0) {
+ processed_sec[i] = true;
+ memcpy(license, data->d_buf, data->d_size);
+ } else if (strcmp(shname, "maps") == 0) {
+ processed_sec[i] = true;
+ if (load_maps(data->d_buf, data->d_size))
+ return 1;
+ } else if (shdr.sh_type == SHT_SYMTAB) {
+ symbols = data;
+ }
+ }
+
+ /* load programs that need map fixup (relocations) */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+ if (shdr.sh_type == SHT_REL) {
+
+ if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+ &shdr_prog, &data_prog))
+ continue;
+
+ if (0)
+ printf("relo %s into %s\n", shname, shname_prog);
+
+ processed_sec[shdr.sh_info] = true;
+ processed_sec[i] = true;
+
+ nrels = parse_relo_into_map_fixup(data, symbols, &shdr);
+
+ if (memcmp(shname_prog, "events/", sizeof("events/") - 1) == 0)
+ load_and_attach(shname_prog,
+ data_prog->d_buf, data_prog->d_size,
+ fixup, nrels * sizeof(fixup[0]));
+ }
+ }
+
+ /* load programs that don't use maps */
+ for (i = 1; i < ehdr.e_shnum; i++) {
+
+ if (processed_sec[i])
+ continue;
+
+ if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+ continue;
+
+ if (memcmp(shname, "events/", sizeof("events/") - 1) == 0)
+ load_and_attach(shname, data->d_buf, data->d_size,
+ NULL, 0);
+ }
+
+ close(fd);
+ return 0;
+}
+
+void read_trace_pipe(void)
+{
+ int trace_fd;
+
+ trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+ if (trace_fd < 0)
+ return;
+
+ while (1) {
+ static char buf[4096];
+ ssize_t sz;
+
+ sz = read(trace_fd, buf, sizeof(buf));
+ if (sz)
+ puts(buf);
+ }
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
new file mode 100644
index 000000000000..d34e61a101d2
--- /dev/null
+++ b/samples/bpf/bpf_load.h
@@ -0,0 +1,18 @@
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#define MAX_MAPS 64
+
+extern int map_fd[MAX_MAPS];
+
+/* parses elf file compiled by llvm .c->.o
+ * creates maps and stores FD into map_fd array
+ * loads eBPF programs
+ * returns zero on success
+ */
+int load_bpf_file(char *path);
+
+/* forever reads /sys/.../trace_pipe */
+void read_trace_pipe(void);
+
+#endif
--
1.7.9.5

2014-07-22 03:45:47

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C

semantically this example is the same as dropmon.c which has eBPF in assembler.

In this example both eBPF and user space are in C:

ex2_kern.c - C code that will compile into eBPF and will
be attached to kfree_skb event
ex2_user.c - corresponding user space part that iterates the map
populated by in-kernel eBPF code

Usage:
$ sudo ex2 ex2_kern.o

Should see:
writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
location 0xffffffff816efc67 count 1

location 0xffffffff815d8030 count 1
location 0xffffffff816efc67 count 3

location 0xffffffff815d8030 count 4
location 0xffffffff816efc67 count 9

location 0xffffffff815d8030 count 7
location 0xffffffff816efc67 count 15

location 0xffffffff815d8030 count 10
location 0xffffffff816efc67 count 23

location 0xffffffff815d8030 count 13
location 0xffffffff816efc67 count 29

Ctrl-C at any time. Kernel will auto cleanup maps and programs

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 6 ++++--
samples/bpf/ex2_kern.c | 29 +++++++++++++++++++++++++++++
samples/bpf/ex2_user.c | 28 ++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 2 deletions(-)
create mode 100644 samples/bpf/ex2_kern.c
create mode 100644 samples/bpf/ex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index c97f408fcd6d..b865a5df5c60 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,19 +2,21 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example dropmon ex1
+hostprogs-y := sock_example dropmon ex1 ex2

sock_example-objs := sock_example.o libbpf.o
dropmon-objs := dropmon.o libbpf.o
ex1-objs := bpf_load.o libbpf.o ex1_user.o
+ex2-objs := bpf_load.o libbpf.o ex2_user.o

# Tell kbuild to always build the programs
-always := $(hostprogs-y) ex1_kern.o
+always := $(hostprogs-y) ex1_kern.o ex2_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include

HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_ex1 += -lelf
+HOSTLOADLIBES_ex2 += -lelf

LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc

diff --git a/samples/bpf/ex2_kern.c b/samples/bpf/ex2_kern.c
new file mode 100644
index 000000000000..b4a8e73a3f9d
--- /dev/null
+++ b/samples/bpf/ex2_kern.c
@@ -0,0 +1,29 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(long),
+ .value_size = sizeof(long),
+ .max_entries = 1024,
+};
+
+SEC("events/skb/kfree_skb")
+int bpf_prog2(struct bpf_context *ctx)
+{
+ long loc = ctx->arg2;
+ long init_val = 1;
+ void *value;
+
+ value = bpf_map_lookup_elem(&my_map, &loc);
+ if (value)
+ (*(long *) value) += 1;
+ else
+ bpf_map_update_elem(&my_map, &loc, &init_val);
+ return 0;
+}
+
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex2_user.c b/samples/bpf/ex2_user.c
new file mode 100644
index 000000000000..8b0fc91f83ca
--- /dev/null
+++ b/samples/bpf/ex2_user.c
@@ -0,0 +1,28 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+ long key, next_key, value;
+ int i;
+
+ if (load_bpf_file(argv[1]))
+ return 1;
+
+ for (i = 0; i < 10; i++) {
+ key = 0;
+ while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+ bpf_lookup_elem(map_fd[0], &next_key, &value);
+ printf("location 0x%lx count %ld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+
+ return 0;
+}
--
1.7.9.5

2014-07-22 03:46:30

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH RFC v3 net-next 2/3] samples: bpf: eBPF example in C

ex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"

ex1_user.c - corresponding user space component that
forever reads /sys/.../trace_pipe

Usage:
$ sudo ex1 ex1_kern.o

should see:
writing bpf-4 -> /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
ping-25476 [001] ..s3 5639.718218: __netif_receive_skb_core: skb 4d51700 dev b9e6000
ping-25476 [001] ..s3 5639.718262: __netif_receive_skb_core: skb 4d51400 dev b9e6000

ping-25476 [002] ..s3 5640.716233: __netif_receive_skb_core: skb 5d06500 dev b9e6000
ping-25476 [002] ..s3 5640.716272: __netif_receive_skb_core: skb 5d06300 dev b9e6000

Ctrl-C at any time, kernel will auto cleanup

Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 15 +++++++++++++--
samples/bpf/ex1_kern.c | 27 +++++++++++++++++++++++++++
samples/bpf/ex1_user.c | 11 +++++++++++
3 files changed, 51 insertions(+), 2 deletions(-)
create mode 100644 samples/bpf/ex1_kern.c
create mode 100644 samples/bpf/ex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index caf1ab93b37c..c97f408fcd6d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,23 @@
obj- := dummy.o

# List of programs to build
-hostprogs-y := sock_example dropmon
+hostprogs-y := sock_example dropmon ex1

sock_example-objs := sock_example.o libbpf.o
dropmon-objs := dropmon.o libbpf.o
+ex1-objs := bpf_load.o libbpf.o ex1_user.o

# Tell kbuild to always build the programs
-always := $(hostprogs-y)
+always := $(hostprogs-y) ex1_kern.o

HOSTCFLAGS += -I$(objtree)/usr/include
+
+HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_ex1 += -lelf
+
+LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+
+%.o: %.c
+ clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+ -O2 -emit-llvm -c $< -o -| $(LLC) -o $@
diff --git a/samples/bpf/ex1_kern.c b/samples/bpf/ex1_kern.c
new file mode 100644
index 000000000000..3fa7e4db0be6
--- /dev/null
+++ b/samples/bpf/ex1_kern.c
@@ -0,0 +1,27 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+SEC("events/net/netif_receive_skb")
+int bpf_prog1(struct bpf_context *ctx)
+{
+ /*
+ * attaches to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+ * prints events for loobpack device only
+ */
+ char devname[] = "lo";
+ struct net_device *dev;
+ struct sk_buff *skb = 0;
+
+ skb = (struct sk_buff *) ctx->arg1;
+ dev = bpf_load_pointer(&skb->dev);
+ if (bpf_memcmp(dev->name, devname, 2) == 0) {
+ char fmt[] = "skb %x dev %x\n";
+ bpf_printk(fmt, sizeof(fmt), skb, dev);
+ }
+ return 0;
+}
+
+char license[] SEC("license") = "GPL";
diff --git a/samples/bpf/ex1_user.c b/samples/bpf/ex1_user.c
new file mode 100644
index 000000000000..d90c543ba30c
--- /dev/null
+++ b/samples/bpf/ex1_user.c
@@ -0,0 +1,11 @@
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+ if (load_bpf_file(argv[1]))
+ return 1;
+
+ read_trace_pipe();
+
+ return 0;
+}
--
1.7.9.5

2014-07-30 15:46:20

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C


ast wrote earlier:

> [...]
> dtrace/systemtap/ktap approach is to use one script file that should provide
> all desired functionality. That architectural decision overcomplicated their
> implementations.
>
> eBPF follows split model: everything that needs to process millions of events
> per second needs to run in kernel and needs to be short and deterministic,
> all other things like aggregation and nice graphs should run in user space.
> [...]

For the record, this is not entirely accurate as to dtrace. dtrace
delegates aggregation and most reporting to userspace. Also,
systemtap is "short and deterministic" even for aggregations & nice
graphs, but since it limits its storage & cpu consumption, its
arrays/reports cannot get super large.


> [...]
> +SEC("events/skb/kfree_skb")
> +int bpf_prog2(struct bpf_context *ctx)
> +{
> +[...]
> + value = bpf_map_lookup_elem(&my_map, &loc);
> + if (value)
> + (*(long *) value) += 1;
> + else
> + bpf_map_update_elem(&my_map, &loc, &init_val);
> + return 0;
> +}

What kind of locking/serialization is provided by the ebpf runtime
over shared variables such as my_map?


- FChE

2014-07-30 17:17:22

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C

On Wed, Jul 30, 2014 at 8:45 AM, Frank Ch. Eigler <[email protected]> wrote:
> For the record, this is not entirely accurate as to dtrace. dtrace
> delegates aggregation and most reporting to userspace. Also,
> systemtap is "short and deterministic" even for aggregations & nice
> graphs, but since it limits its storage & cpu consumption, its
> arrays/reports cannot get super large.

My understanding of systemtap is that the whole .stp script is converted
to C, compiled as .ko and loaded, so all map walking and prints are
happening in the kernel. Similarly for ktap which has special functions
in kernel to print histograms.
I thought dtrace printf are also happening from the kernel. What is the
trick they use to know which pieces of dtrace script should be run in
user space?
In ebpf examples there are two C files: one for kernel with ebpf isa
and one for userspace as native. I thought about combining them,
but couldn't figure out a clean way of doing it.

>> [...]
>> +SEC("events/skb/kfree_skb")
>> +int bpf_prog2(struct bpf_context *ctx)
>> +{
>> +[...]
>> + value = bpf_map_lookup_elem(&my_map, &loc);
>> + if (value)
>> + (*(long *) value) += 1;
>> + else
>> + bpf_map_update_elem(&my_map, &loc, &init_val);
>> + return 0;
>> +}
>
> What kind of locking/serialization is provided by the ebpf runtime
> over shared variables such as my_map?

it's traditional rcu scheme.
Programs are running under rcu_read_lock(), so that
bpf_map_lookup_elem() can return pointer to map value which
won't disappear while program is running.
In-kernel map implementation needs to use rcu style to match
ebpf program assumptions. map implementation is enforcing
the limit to the number of elements.
I didn't post 'array' type of map yet. bpf_map_lookup in this
implementation will just return 'base + index' pointer.
Regardless of type of map the same ebpf program running on
different cpus may lookup the same 'key' and receive the same
map value pointer. In such case concurrent write access to
map value can be done with bpf_xadd instruction, though
using normal read/write is also allowed. In some cases
the speed of racy var++ is preferred over 'lock xadd'.
There are no lock/unlock function helpers available to ebpf
programs, since program may terminate early with div by zero
for example, so in-kernel lock helper implementation would
be complicated and slow. It's possible to do, but for the use
cases so far there is no need.

2014-07-30 17:39:08

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C

Hi, Alexei -

> My understanding of systemtap is that the whole .stp script is converted
> to C, compiled as .ko and loaded, so all map walking and prints are
> happening in the kernel. Similarly for ktap which has special functions
> in kernel to print histograms.

That is correct.

> I thought dtrace printf are also happening from the kernel. What is
> the trick they use to know which pieces of dtrace script should be
> run in user space?

It appears as though the bytecode language running in the kernel sends
some action commands back out to userspace, not just plain data.


> In ebpf examples there are two C files: one for kernel with ebpf isa
> and one for userspace as native. I thought about combining them,
> but couldn't figure out a clean way of doing it.

(#if ?)


> > What kind of locking/serialization is provided by the ebpf runtime
> > over shared variables such as my_map?
>
> it's traditional rcu scheme.

OK, that protects the table structure, but:

> [...] In such case concurrent write access to map value can be done
> with bpf_xadd instruction, though using normal read/write is also
> allowed. In some cases the speed of racy var++ is preferred over
> 'lock xadd'.

... so concurrency control over shared values is left up to the
programmer.

> There are no lock/unlock function helpers available to ebpf
> programs, since program may terminate early with div by zero
> for example, so in-kernel lock helper implementation would
> be complicated and slow. It's possible to do, but for the use
> cases so far there is no need.

OK, I hope that works out. I've been told that dtrace does something
similiar (!) by eschewing protection on global variables such as
strings. In their case it's less bad than it sounds because they are
used to offloading computation to userspace or to store only
thread-local state, and accept the corollary limitations on control.

(Systemtap does fully & automatically protect shared variables, even
in the face of run-time script errors.)


- FChE

2014-07-30 18:53:58

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH RFC v3 net-next 3/3] samples: bpf: eBPF dropmon example in C

On Wed, Jul 30, 2014 at 10:36 AM, Frank Ch. Eigler <[email protected]> wrote:
>> > What kind of locking/serialization is provided by the ebpf runtime
>> > over shared variables such as my_map?
>>
>> it's traditional rcu scheme.
>
> OK, that protects the table structure, but:
>
>> [...] In such case concurrent write access to map value can be done
>> with bpf_xadd instruction, though using normal read/write is also
>> allowed. In some cases the speed of racy var++ is preferred over
>> 'lock xadd'.
>
> ... so concurrency control over shared values is left up to the
> programmer.

yes. It has to be flexible and fast.
One of our main use cases is network analytics where a lot of
packets are going through ebpf programs, so every cycle counts.
Mandatory locks in critical path are not acceptable. If we add
locks they will be optional.

>> There are no lock/unlock function helpers available to ebpf
>> programs, since program may terminate early with div by zero
>> for example, so in-kernel lock helper implementation would
>> be complicated and slow. It's possible to do, but for the use
>> cases so far there is no need.
>
> OK, I hope that works out. I've been told that dtrace does something
> similiar (!) by eschewing protection on global variables such as
> strings. In their case it's less bad than it sounds because they are
> used to offloading computation to userspace or to store only
> thread-local state, and accept the corollary limitations on control.

interesting.
btw, things like global variables, per-cpu storage are potential ebpf
features. So far they're 'nice to have' instead of 'mandatory'.
The maps are powerful enough to do the same:
Global storage is map of one element.
Per-cpu storage is map of num_cpu elements.