Hi All,
this patch set demonstrates the potential of eBPF.
First patch "net: filter: split filter.c into two files" splits eBPF interpreter
out of networking into kernel/bpf/. The goal for BPF subsystem is to be usable
in NET-less configuration. Though the whole set is marked is RFC, the 1st patch
is good to go. Similar version of the patch that was posted few weeks ago, but
was deferred. I'm assuming due to lack of forward visibility. I hope that this
patch set shows what eBPF is capable of and where it's heading.
Other patches expose eBPF instruction set to user space and introduce concepts
of maps and programs accessible via syscall.
'maps' is a generic storage of different types for sharing data between kernel
and userspace. Maps are referrenced by global id. Root can create multiple
maps of different types where key/value are opaque bytes of data. It's up to
user space and eBPF program to decide what they store in the maps.
eBPF programs are similar to kernel modules. They live in global space and
have unique prog_id. Each program is a safe run-to-completion set of
instructions. eBPF verifier statically determines that the program terminates
and safe to execute. During verification the program takes a hold of maps
that it intends to use, so selected maps cannot be removed until program is
unloaded. The program can be attached to different events. These events can
be packets, tracepoint events and other types in the future. New event triggers
execution of the program which may store information about the event in the maps.
Beyond storing data the programs may call into in-kernel helper functions
which may, for example, dump stack, do trace_printk or other forms of live
kernel debugging. Same program can be attached to multiple events. Different
programs can access the same map:
tracepoint tracepoint tracepoint sk_buff sk_buff
event A event B event C on eth0 on eth1
| | | | |
| | | | |
--> tracing <-- tracing socket socket
prog_1 prog_2 prog_3 prog_4
| | | |
|--- -----| |-------| map_3
map_1 map_2
User space (via syscall) and eBPF programs access maps concurrently.
Last two patches are sample code. 1st demonstrates stateful packet inspection.
It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
framework can be used for network analytics.
2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
event and counts number of packet drops at particular $pc location.
User space periodically summarizes what eBPF programs recorded.
In these two samples the eBPF programs are tiny and written in 'assembler'
with macroses. More complex programs can be written C (llvm backend is not
part of this diff to reduce 'huge' perception).
Since eBPF is fully JITed on x64, the cost of running eBPF program is very
small even for high frequency events. Here are the numbers comparing
flow_dissector in C vs eBPF:
x86_64 skb_flow_dissect() same skb (all cached) - 42 nsec per call
x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
eBPF+jit skb_flow_dissect() same skb (all cached) - 51 nsec per call
eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call
Detailed explanation on eBPF verifier and safety is in patch 08/14
Thanks
Alexei
minor todo: rename 'struct sock_filter_int' into 'struct bpf_insn'. It's not
part of this diff to reduce size.
------
The following changes since commit c1c27fb9b3040a2559d4d3e1183afa8c106bc94a:
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next (2014-06-27 12:59:38 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master
for you to fetch changes up to 4c8da0f21220087e38894c69339cddc64c1220f9:
samples: bpf: example of tracing filters with eBPF (2014-06-27 15:22:07 -0700)
----------------------------------------------------------------
Alexei Starovoitov (14):
net: filter: split filter.c into two files
net: filter: split filter.h and expose eBPF to user space
bpf: introduce syscall(BPF, ...) and BPF maps
bpf: update MAINTAINERS entry
bpf: add lookup/update/delete/iterate methods to BPF maps
bpf: add hashtable type of BPF maps
bpf: expand BPF syscall with program load/unload
bpf: add eBPF verifier
bpf: allow eBPF programs to use maps
net: sock: allow eBPF programs to be attached to sockets
tracing: allow eBPF programs to be attached to events
samples: bpf: add mini eBPF library to manipulate maps and programs
samples: bpf: example of stateful socket filtering
samples: bpf: example of tracing filters with eBPF
Documentation/networking/filter.txt | 302 +++++++
MAINTAINERS | 9 +
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/cris/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/x86/syscalls/syscall_64.tbl | 1 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/linux/bpf.h | 135 +++
include/linux/filter.h | 304 +------
include/linux/ftrace_event.h | 5 +
include/linux/syscalls.h | 2 +
include/trace/bpf_trace.h | 29 +
include/trace/ftrace.h | 10 +
include/uapi/asm-generic/socket.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/bpf.h | 403 +++++++++
kernel/Makefile | 1 +
kernel/bpf/Makefile | 1 +
kernel/bpf/core.c | 548 ++++++++++++
kernel/bpf/hashtab.c | 371 +++++++++
kernel/bpf/syscall.c | 778 +++++++++++++++++
kernel/bpf/verifier.c | 1431 ++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 +
kernel/trace/Kconfig | 1 +
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 217 +++++
kernel/trace/trace.h | 3 +
kernel/trace/trace_events.c | 7 +
kernel/trace/trace_events_filter.c | 72 +-
net/core/filter.c | 646 +++-----------
net/core/sock.c | 13 +
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 15 +
samples/bpf/dropmon.c | 127 +++
samples/bpf/libbpf.c | 114 +++
samples/bpf/libbpf.h | 18 +
samples/bpf/sock_example.c | 160 ++++
47 files changed, 4941 insertions(+), 820 deletions(-)
create mode 100644 include/linux/bpf.h
create mode 100644 include/trace/bpf_trace.h
create mode 100644 include/uapi/linux/bpf.h
create mode 100644 kernel/bpf/Makefile
create mode 100644 kernel/bpf/core.c
create mode 100644 kernel/bpf/hashtab.c
create mode 100644 kernel/bpf/syscall.c
create mode 100644 kernel/bpf/verifier.c
create mode 100644 kernel/trace/bpf_trace.c
create mode 100644 samples/bpf/.gitignore
create mode 100644 samples/bpf/Makefile
create mode 100644 samples/bpf/dropmon.c
create mode 100644 samples/bpf/libbpf.c
create mode 100644 samples/bpf/libbpf.h
create mode 100644 samples/bpf/sock_example.c
--
1.7.9.5
BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest
kernel/bpf/core.c: eBPF interpreter
net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
This patch only moves functions.
Signed-off-by: Alexei Starovoitov <[email protected]>
---
kernel/Makefile | 1 +
kernel/bpf/Makefile | 1 +
kernel/bpf/core.c | 545 +++++++++++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 520 ------------------------------------------------
4 files changed, 547 insertions(+), 520 deletions(-)
create mode 100644 kernel/bpf/Makefile
create mode 100644 kernel/bpf/core.c
diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b6246ce9..e7360b7c2c0e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
+obj-$(CONFIG_NET) += bpf/
obj-$(CONFIG_PERF_EVENTS) += events/
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
new file mode 100644
index 000000000000..6a71145e2769
--- /dev/null
+++ b/kernel/bpf/Makefile
@@ -0,0 +1 @@
+obj-y := core.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
new file mode 100644
index 000000000000..dd9c29ff720e
--- /dev/null
+++ b/kernel/bpf/core.c
@@ -0,0 +1,545 @@
+/*
+ * Linux Socket Filter - Kernel level socket filtering
+ *
+ * Based on the design of the Berkeley Packet Filter. The new
+ * internal format has been designed by PLUMgrid:
+ *
+ * Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
+ *
+ * Authors:
+ *
+ * Jay Schulist <[email protected]>
+ * Alexei Starovoitov <[email protected]>
+ * Daniel Borkmann <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Andi Kleen - Fix a few bad bugs and races.
+ * Kris Katterjohn - Added many additional checks in sk_chk_filter()
+ */
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+/* Registers */
+#define BPF_R0 regs[BPF_REG_0]
+#define BPF_R1 regs[BPF_REG_1]
+#define BPF_R2 regs[BPF_REG_2]
+#define BPF_R3 regs[BPF_REG_3]
+#define BPF_R4 regs[BPF_REG_4]
+#define BPF_R5 regs[BPF_REG_5]
+#define BPF_R6 regs[BPF_REG_6]
+#define BPF_R7 regs[BPF_REG_7]
+#define BPF_R8 regs[BPF_REG_8]
+#define BPF_R9 regs[BPF_REG_9]
+#define BPF_R10 regs[BPF_REG_10]
+
+/* Named registers */
+#define DST regs[insn->dst_reg]
+#define SRC regs[insn->src_reg]
+#define FP regs[BPF_REG_FP]
+#define ARG1 regs[BPF_REG_ARG1]
+#define CTX regs[BPF_REG_CTX]
+#define IMM insn->imm
+
+/* No hurry in this branch
+ *
+ * Exported for the bpf jit load helper.
+ */
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
+{
+ u8 *ptr = NULL;
+
+ if (k >= SKF_NET_OFF)
+ ptr = skb_network_header(skb) + k - SKF_NET_OFF;
+ else if (k >= SKF_LL_OFF)
+ ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
+ if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
+ return ptr;
+
+ return NULL;
+}
+
+static inline void *load_pointer(const struct sk_buff *skb, int k,
+ unsigned int size, void *buffer)
+{
+ if (k >= 0)
+ return skb_header_pointer(skb, k, size, buffer);
+
+ return bpf_internal_load_pointer_neg_helper(skb, k, size);
+}
+
+/* Base function for offset calculation. Needs to go into .text section,
+ * therefore keeping it non-static as well; will also be used by JITs
+ * anyway later on, so do not let the compiler omit it.
+ */
+noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ return 0;
+}
+
+/**
+ * __sk_run_filter - run a filter on a given context
+ * @ctx: buffer to run the filter on
+ * @insn: filter to apply
+ *
+ * Decode and apply filter instructions to the skb->data. Return length to
+ * keep, 0 for none. @ctx is the data we are operating on, @insn is the
+ * array of filter instructions.
+ */
+static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
+{
+ u64 stack[MAX_BPF_STACK / sizeof(u64)];
+ u64 regs[MAX_BPF_REG], tmp;
+ static const void *jumptable[256] = {
+ [0 ... 255] = &&default_label,
+ /* Now overwrite non-defaults ... */
+ /* 32 bit ALU operations */
+ [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
+ [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
+ [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
+ [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
+ [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
+ [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
+ [BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
+ [BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
+ [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
+ [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
+ [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
+ [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
+ [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
+ [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
+ [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
+ [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
+ [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
+ [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
+ [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
+ [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
+ [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
+ [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
+ [BPF_ALU | BPF_NEG] = &&ALU_NEG,
+ [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
+ [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
+ /* 64 bit ALU operations */
+ [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
+ [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
+ [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
+ [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
+ [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
+ [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
+ [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
+ [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
+ [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
+ [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
+ [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
+ [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
+ [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
+ [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
+ [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
+ [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
+ [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
+ [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
+ [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
+ [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
+ [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
+ [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
+ [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
+ [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
+ [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
+ /* Call instruction */
+ [BPF_JMP | BPF_CALL] = &&JMP_CALL,
+ /* Jumps */
+ [BPF_JMP | BPF_JA] = &&JMP_JA,
+ [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
+ [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
+ [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
+ [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
+ [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
+ [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
+ [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
+ [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
+ [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
+ [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
+ [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
+ [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
+ [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
+ [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
+ /* Program return */
+ [BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
+ /* Store instructions */
+ [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
+ [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
+ [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
+ [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
+ [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
+ [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
+ [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
+ [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
+ [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
+ [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
+ /* Load instructions */
+ [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
+ [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
+ [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
+ [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
+ [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
+ [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
+ [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
+ [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
+ [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
+ [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+ };
+ void *ptr;
+ int off;
+
+#define CONT ({ insn++; goto select_insn; })
+#define CONT_JMP ({ insn++; goto select_insn; })
+
+ FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
+ ARG1 = (u64) (unsigned long) ctx;
+
+ /* Registers used in classic BPF programs need to be reset first. */
+ regs[BPF_REG_A] = 0;
+ regs[BPF_REG_X] = 0;
+
+select_insn:
+ goto *jumptable[insn->code];
+
+ /* ALU */
+#define ALU(OPCODE, OP) \
+ ALU64_##OPCODE##_X: \
+ DST = DST OP SRC; \
+ CONT; \
+ ALU_##OPCODE##_X: \
+ DST = (u32) DST OP (u32) SRC; \
+ CONT; \
+ ALU64_##OPCODE##_K: \
+ DST = DST OP IMM; \
+ CONT; \
+ ALU_##OPCODE##_K: \
+ DST = (u32) DST OP (u32) IMM; \
+ CONT;
+
+ ALU(ADD, +)
+ ALU(SUB, -)
+ ALU(AND, &)
+ ALU(OR, |)
+ ALU(LSH, <<)
+ ALU(RSH, >>)
+ ALU(XOR, ^)
+ ALU(MUL, *)
+#undef ALU
+ ALU_NEG:
+ DST = (u32) -DST;
+ CONT;
+ ALU64_NEG:
+ DST = -DST;
+ CONT;
+ ALU_MOV_X:
+ DST = (u32) SRC;
+ CONT;
+ ALU_MOV_K:
+ DST = (u32) IMM;
+ CONT;
+ ALU64_MOV_X:
+ DST = SRC;
+ CONT;
+ ALU64_MOV_K:
+ DST = IMM;
+ CONT;
+ ALU64_ARSH_X:
+ (*(s64 *) &DST) >>= SRC;
+ CONT;
+ ALU64_ARSH_K:
+ (*(s64 *) &DST) >>= IMM;
+ CONT;
+ ALU64_MOD_X:
+ if (unlikely(SRC == 0))
+ return 0;
+ tmp = DST;
+ DST = do_div(tmp, SRC);
+ CONT;
+ ALU_MOD_X:
+ if (unlikely(SRC == 0))
+ return 0;
+ tmp = (u32) DST;
+ DST = do_div(tmp, (u32) SRC);
+ CONT;
+ ALU64_MOD_K:
+ tmp = DST;
+ DST = do_div(tmp, IMM);
+ CONT;
+ ALU_MOD_K:
+ tmp = (u32) DST;
+ DST = do_div(tmp, (u32) IMM);
+ CONT;
+ ALU64_DIV_X:
+ if (unlikely(SRC == 0))
+ return 0;
+ do_div(DST, SRC);
+ CONT;
+ ALU_DIV_X:
+ if (unlikely(SRC == 0))
+ return 0;
+ tmp = (u32) DST;
+ do_div(tmp, (u32) SRC);
+ DST = (u32) tmp;
+ CONT;
+ ALU64_DIV_K:
+ do_div(DST, IMM);
+ CONT;
+ ALU_DIV_K:
+ tmp = (u32) DST;
+ do_div(tmp, (u32) IMM);
+ DST = (u32) tmp;
+ CONT;
+ ALU_END_TO_BE:
+ switch (IMM) {
+ case 16:
+ DST = (__force u16) cpu_to_be16(DST);
+ break;
+ case 32:
+ DST = (__force u32) cpu_to_be32(DST);
+ break;
+ case 64:
+ DST = (__force u64) cpu_to_be64(DST);
+ break;
+ }
+ CONT;
+ ALU_END_TO_LE:
+ switch (IMM) {
+ case 16:
+ DST = (__force u16) cpu_to_le16(DST);
+ break;
+ case 32:
+ DST = (__force u32) cpu_to_le32(DST);
+ break;
+ case 64:
+ DST = (__force u64) cpu_to_le64(DST);
+ break;
+ }
+ CONT;
+
+ /* CALL */
+ JMP_CALL:
+ /* Function call scratches BPF_R1-BPF_R5 registers,
+ * preserves BPF_R6-BPF_R9, and stores return value
+ * into BPF_R0.
+ */
+ BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
+ BPF_R4, BPF_R5);
+ CONT;
+
+ /* JMP */
+ JMP_JA:
+ insn += insn->off;
+ CONT;
+ JMP_JEQ_X:
+ if (DST == SRC) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JEQ_K:
+ if (DST == IMM) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JNE_X:
+ if (DST != SRC) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JNE_K:
+ if (DST != IMM) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGT_X:
+ if (DST > SRC) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGT_K:
+ if (DST > IMM) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGE_X:
+ if (DST >= SRC) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGE_K:
+ if (DST >= IMM) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGT_X:
+ if (((s64) DST) > ((s64) SRC)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGT_K:
+ if (((s64) DST) > ((s64) IMM)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGE_X:
+ if (((s64) DST) >= ((s64) SRC)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGE_K:
+ if (((s64) DST) >= ((s64) IMM)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSET_X:
+ if (DST & SRC) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSET_K:
+ if (DST & IMM) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_EXIT:
+ return BPF_R0;
+
+ /* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+ STX_MEM_##SIZEOP: \
+ *(SIZE *)(unsigned long) (DST + insn->off) = SRC; \
+ CONT; \
+ ST_MEM_##SIZEOP: \
+ *(SIZE *)(unsigned long) (DST + insn->off) = IMM; \
+ CONT; \
+ LDX_MEM_##SIZEOP: \
+ DST = *(SIZE *)(unsigned long) (SRC + insn->off); \
+ CONT;
+
+ LDST(B, u8)
+ LDST(H, u16)
+ LDST(W, u32)
+ LDST(DW, u64)
+#undef LDST
+ STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
+ atomic_add((u32) SRC, (atomic_t *)(unsigned long)
+ (DST + insn->off));
+ CONT;
+ STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
+ atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
+ (DST + insn->off));
+ CONT;
+ LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
+ off = IMM;
+load_word:
+ /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
+ * only appearing in the programs where ctx ==
+ * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
+ * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
+ * internal BPF verifier will check that BPF_R6 ==
+ * ctx.
+ *
+ * BPF_ABS and BPF_IND are wrappers of function calls,
+ * so they scratch BPF_R1-BPF_R5 registers, preserve
+ * BPF_R6-BPF_R9, and store return value into BPF_R0.
+ *
+ * Implicit input:
+ * ctx == skb == BPF_R6 == CTX
+ *
+ * Explicit input:
+ * SRC == any register
+ * IMM == 32-bit immediate
+ *
+ * Output:
+ * BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+
+ ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = get_unaligned_be32(ptr);
+ CONT;
+ }
+
+ return 0;
+ LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
+ off = IMM;
+load_half:
+ ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = get_unaligned_be16(ptr);
+ CONT;
+ }
+
+ return 0;
+ LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
+ off = IMM;
+load_byte:
+ ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = *(u8 *)ptr;
+ CONT;
+ }
+
+ return 0;
+ LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
+ off = IMM + SRC;
+ goto load_word;
+ LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
+ off = IMM + SRC;
+ goto load_half;
+ LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
+ off = IMM + SRC;
+ goto load_byte;
+
+ default_label:
+ /* If we ever reach this, we have a bug somewhere. */
+ WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+ return 0;
+}
+
+void __weak bpf_int_jit_compile(struct sk_filter *prog)
+{
+}
+
+/**
+ * sk_filter_select_runtime - select execution runtime for BPF program
+ * @fp: sk_filter populated with internal BPF program
+ *
+ * try to JIT internal BPF program, if JIT is not available select interpreter
+ * BPF program will be executed via SK_RUN_FILTER() macro
+ */
+void sk_filter_select_runtime(struct sk_filter *fp)
+{
+ fp->bpf_func = (void *) __sk_run_filter;
+
+ /* Probe if internal BPF can be JITed */
+ bpf_int_jit_compile(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
+
+/* free internal BPF program */
+void sk_filter_free(struct sk_filter *fp)
+{
+ bpf_jit_free(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/net/core/filter.c b/net/core/filter.c
index 1dbf6462f766..79d8a1b1ad75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -45,54 +45,6 @@
#include <linux/seccomp.h>
#include <linux/if_vlan.h>
-/* Registers */
-#define BPF_R0 regs[BPF_REG_0]
-#define BPF_R1 regs[BPF_REG_1]
-#define BPF_R2 regs[BPF_REG_2]
-#define BPF_R3 regs[BPF_REG_3]
-#define BPF_R4 regs[BPF_REG_4]
-#define BPF_R5 regs[BPF_REG_5]
-#define BPF_R6 regs[BPF_REG_6]
-#define BPF_R7 regs[BPF_REG_7]
-#define BPF_R8 regs[BPF_REG_8]
-#define BPF_R9 regs[BPF_REG_9]
-#define BPF_R10 regs[BPF_REG_10]
-
-/* Named registers */
-#define DST regs[insn->dst_reg]
-#define SRC regs[insn->src_reg]
-#define FP regs[BPF_REG_FP]
-#define ARG1 regs[BPF_REG_ARG1]
-#define CTX regs[BPF_REG_CTX]
-#define IMM insn->imm
-
-/* No hurry in this branch
- *
- * Exported for the bpf jit load helper.
- */
-void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
-{
- u8 *ptr = NULL;
-
- if (k >= SKF_NET_OFF)
- ptr = skb_network_header(skb) + k - SKF_NET_OFF;
- else if (k >= SKF_LL_OFF)
- ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
- if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
- return ptr;
-
- return NULL;
-}
-
-static inline void *load_pointer(const struct sk_buff *skb, int k,
- unsigned int size, void *buffer)
-{
- if (k >= 0)
- return skb_header_pointer(skb, k, size, buffer);
-
- return bpf_internal_load_pointer_neg_helper(skb, k, size);
-}
-
/**
* sk_filter - run a packet through a socket filter
* @sk: sock associated with &sk_buff
@@ -135,451 +87,6 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
}
EXPORT_SYMBOL(sk_filter);
-/* Base function for offset calculation. Needs to go into .text section,
- * therefore keeping it non-static as well; will also be used by JITs
- * anyway later on, so do not let the compiler omit it.
- */
-noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
-{
- return 0;
-}
-
-/**
- * __sk_run_filter - run a filter on a given context
- * @ctx: buffer to run the filter on
- * @insn: filter to apply
- *
- * Decode and apply filter instructions to the skb->data. Return length to
- * keep, 0 for none. @ctx is the data we are operating on, @insn is the
- * array of filter instructions.
- */
-static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
-{
- u64 stack[MAX_BPF_STACK / sizeof(u64)];
- u64 regs[MAX_BPF_REG], tmp;
- static const void *jumptable[256] = {
- [0 ... 255] = &&default_label,
- /* Now overwrite non-defaults ... */
- /* 32 bit ALU operations */
- [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
- [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
- [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
- [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
- [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
- [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
- [BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
- [BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
- [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
- [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
- [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
- [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
- [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
- [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
- [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
- [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
- [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
- [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
- [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
- [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
- [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
- [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
- [BPF_ALU | BPF_NEG] = &&ALU_NEG,
- [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
- [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
- /* 64 bit ALU operations */
- [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
- [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
- [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
- [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
- [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
- [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
- [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
- [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
- [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
- [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
- [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
- [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
- [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
- [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
- [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
- [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
- [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
- [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
- [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
- [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
- [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
- [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
- [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
- [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
- [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
- /* Call instruction */
- [BPF_JMP | BPF_CALL] = &&JMP_CALL,
- /* Jumps */
- [BPF_JMP | BPF_JA] = &&JMP_JA,
- [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
- [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
- [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
- [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
- [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
- [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
- [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
- [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
- [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
- [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
- [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
- [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
- [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
- [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
- /* Program return */
- [BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
- /* Store instructions */
- [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
- [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
- [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
- [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
- [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
- [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
- [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
- [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
- [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
- [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
- /* Load instructions */
- [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
- [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
- [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
- [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
- [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
- [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
- [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
- [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
- [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
- [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
- };
- void *ptr;
- int off;
-
-#define CONT ({ insn++; goto select_insn; })
-#define CONT_JMP ({ insn++; goto select_insn; })
-
- FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
- ARG1 = (u64) (unsigned long) ctx;
-
- /* Registers used in classic BPF programs need to be reset first. */
- regs[BPF_REG_A] = 0;
- regs[BPF_REG_X] = 0;
-
-select_insn:
- goto *jumptable[insn->code];
-
- /* ALU */
-#define ALU(OPCODE, OP) \
- ALU64_##OPCODE##_X: \
- DST = DST OP SRC; \
- CONT; \
- ALU_##OPCODE##_X: \
- DST = (u32) DST OP (u32) SRC; \
- CONT; \
- ALU64_##OPCODE##_K: \
- DST = DST OP IMM; \
- CONT; \
- ALU_##OPCODE##_K: \
- DST = (u32) DST OP (u32) IMM; \
- CONT;
-
- ALU(ADD, +)
- ALU(SUB, -)
- ALU(AND, &)
- ALU(OR, |)
- ALU(LSH, <<)
- ALU(RSH, >>)
- ALU(XOR, ^)
- ALU(MUL, *)
-#undef ALU
- ALU_NEG:
- DST = (u32) -DST;
- CONT;
- ALU64_NEG:
- DST = -DST;
- CONT;
- ALU_MOV_X:
- DST = (u32) SRC;
- CONT;
- ALU_MOV_K:
- DST = (u32) IMM;
- CONT;
- ALU64_MOV_X:
- DST = SRC;
- CONT;
- ALU64_MOV_K:
- DST = IMM;
- CONT;
- ALU64_ARSH_X:
- (*(s64 *) &DST) >>= SRC;
- CONT;
- ALU64_ARSH_K:
- (*(s64 *) &DST) >>= IMM;
- CONT;
- ALU64_MOD_X:
- if (unlikely(SRC == 0))
- return 0;
- tmp = DST;
- DST = do_div(tmp, SRC);
- CONT;
- ALU_MOD_X:
- if (unlikely(SRC == 0))
- return 0;
- tmp = (u32) DST;
- DST = do_div(tmp, (u32) SRC);
- CONT;
- ALU64_MOD_K:
- tmp = DST;
- DST = do_div(tmp, IMM);
- CONT;
- ALU_MOD_K:
- tmp = (u32) DST;
- DST = do_div(tmp, (u32) IMM);
- CONT;
- ALU64_DIV_X:
- if (unlikely(SRC == 0))
- return 0;
- do_div(DST, SRC);
- CONT;
- ALU_DIV_X:
- if (unlikely(SRC == 0))
- return 0;
- tmp = (u32) DST;
- do_div(tmp, (u32) SRC);
- DST = (u32) tmp;
- CONT;
- ALU64_DIV_K:
- do_div(DST, IMM);
- CONT;
- ALU_DIV_K:
- tmp = (u32) DST;
- do_div(tmp, (u32) IMM);
- DST = (u32) tmp;
- CONT;
- ALU_END_TO_BE:
- switch (IMM) {
- case 16:
- DST = (__force u16) cpu_to_be16(DST);
- break;
- case 32:
- DST = (__force u32) cpu_to_be32(DST);
- break;
- case 64:
- DST = (__force u64) cpu_to_be64(DST);
- break;
- }
- CONT;
- ALU_END_TO_LE:
- switch (IMM) {
- case 16:
- DST = (__force u16) cpu_to_le16(DST);
- break;
- case 32:
- DST = (__force u32) cpu_to_le32(DST);
- break;
- case 64:
- DST = (__force u64) cpu_to_le64(DST);
- break;
- }
- CONT;
-
- /* CALL */
- JMP_CALL:
- /* Function call scratches BPF_R1-BPF_R5 registers,
- * preserves BPF_R6-BPF_R9, and stores return value
- * into BPF_R0.
- */
- BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
- BPF_R4, BPF_R5);
- CONT;
-
- /* JMP */
- JMP_JA:
- insn += insn->off;
- CONT;
- JMP_JEQ_X:
- if (DST == SRC) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JEQ_K:
- if (DST == IMM) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JNE_X:
- if (DST != SRC) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JNE_K:
- if (DST != IMM) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGT_X:
- if (DST > SRC) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGT_K:
- if (DST > IMM) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGE_X:
- if (DST >= SRC) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGE_K:
- if (DST >= IMM) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGT_X:
- if (((s64) DST) > ((s64) SRC)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGT_K:
- if (((s64) DST) > ((s64) IMM)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGE_X:
- if (((s64) DST) >= ((s64) SRC)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGE_K:
- if (((s64) DST) >= ((s64) IMM)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSET_X:
- if (DST & SRC) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSET_K:
- if (DST & IMM) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_EXIT:
- return BPF_R0;
-
- /* STX and ST and LDX*/
-#define LDST(SIZEOP, SIZE) \
- STX_MEM_##SIZEOP: \
- *(SIZE *)(unsigned long) (DST + insn->off) = SRC; \
- CONT; \
- ST_MEM_##SIZEOP: \
- *(SIZE *)(unsigned long) (DST + insn->off) = IMM; \
- CONT; \
- LDX_MEM_##SIZEOP: \
- DST = *(SIZE *)(unsigned long) (SRC + insn->off); \
- CONT;
-
- LDST(B, u8)
- LDST(H, u16)
- LDST(W, u32)
- LDST(DW, u64)
-#undef LDST
- STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
- atomic_add((u32) SRC, (atomic_t *)(unsigned long)
- (DST + insn->off));
- CONT;
- STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
- atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
- (DST + insn->off));
- CONT;
- LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
- off = IMM;
-load_word:
- /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
- * only appearing in the programs where ctx ==
- * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
- * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
- * internal BPF verifier will check that BPF_R6 ==
- * ctx.
- *
- * BPF_ABS and BPF_IND are wrappers of function calls,
- * so they scratch BPF_R1-BPF_R5 registers, preserve
- * BPF_R6-BPF_R9, and store return value into BPF_R0.
- *
- * Implicit input:
- * ctx == skb == BPF_R6 == CTX
- *
- * Explicit input:
- * SRC == any register
- * IMM == 32-bit immediate
- *
- * Output:
- * BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
- */
-
- ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = get_unaligned_be32(ptr);
- CONT;
- }
-
- return 0;
- LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
- off = IMM;
-load_half:
- ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = get_unaligned_be16(ptr);
- CONT;
- }
-
- return 0;
- LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
- off = IMM;
-load_byte:
- ptr = load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = *(u8 *)ptr;
- CONT;
- }
-
- return 0;
- LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
- off = IMM + SRC;
- goto load_word;
- LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
- off = IMM + SRC;
- goto load_half;
- LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
- off = IMM + SRC;
- goto load_byte;
-
- default_label:
- /* If we ever reach this, we have a bug somewhere. */
- WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
- return 0;
-}
-
/* Helper to find the offset of pkt_type in sk_buff structure. We want
* to make sure its still a 3bit field starting at a byte boundary;
* taken from arch/x86/net/bpf_jit_comp.c.
@@ -1464,33 +971,6 @@ out_err:
return ERR_PTR(err);
}
-void __weak bpf_int_jit_compile(struct sk_filter *prog)
-{
-}
-
-/**
- * sk_filter_select_runtime - select execution runtime for BPF program
- * @fp: sk_filter populated with internal BPF program
- *
- * try to JIT internal BPF program, if JIT is not available select interpreter
- * BPF program will be executed via SK_RUN_FILTER() macro
- */
-void sk_filter_select_runtime(struct sk_filter *fp)
-{
- fp->bpf_func = (void *) __sk_run_filter;
-
- /* Probe if internal BPF can be JITed */
- bpf_int_jit_compile(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
-
-/* free internal BPF program */
-void sk_filter_free(struct sk_filter *fp)
-{
- bpf_jit_free(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_free);
-
static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
struct sock *sk)
{
--
1.7.9.5
add new map type: BPF_MAP_TYPE_HASH
and its simple (not auto resizeable) hash table implementation
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/uapi/linux/bpf.h | 1 +
kernel/bpf/Makefile | 2 +-
kernel/bpf/hashtab.c | 371 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 373 insertions(+), 1 deletion(-)
create mode 100644 kernel/bpf/hashtab.c
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index faed2ce2d25a..1399ed1d5dad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -354,6 +354,7 @@ enum bpf_map_attributes {
enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC,
+ BPF_MAP_TYPE_HASH,
};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index 000000000000..6e481cacbba3
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,371 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <linux/jhash.h>
+
+struct bpf_htab {
+ struct bpf_map map;
+ struct hlist_head *buckets;
+ struct kmem_cache *elem_cache;
+ char *slab_name;
+ spinlock_t lock;
+ u32 count; /* number of elements in this hashtable */
+ u32 n_buckets; /* number of hash buckets */
+ u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+ struct hlist_node hash_node;
+ struct rcu_head rcu;
+ struct bpf_htab *htab;
+ u32 hash;
+ u32 pad;
+ char key[0];
+};
+
+#define HASH_MAX_BUCKETS 1024
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
+{
+ struct bpf_htab *htab;
+ int err, i;
+
+ htab = kmalloc(sizeof(*htab), GFP_USER);
+ if (!htab)
+ return ERR_PTR(-ENOMEM);
+
+ /* look for mandatory map attributes */
+ err = -EINVAL;
+ if (!attr[BPF_MAP_KEY_SIZE])
+ goto free_htab;
+ htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
+
+ if (!attr[BPF_MAP_VALUE_SIZE])
+ goto free_htab;
+ htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
+
+ if (!attr[BPF_MAP_MAX_ENTRIES])
+ goto free_htab;
+ htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
+
+ htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
+ htab->map.max_entries : HASH_MAX_BUCKETS;
+
+ /* hash table size must be power of 2 */
+ if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
+ goto free_htab;
+
+ err = -E2BIG;
+ if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+ goto free_htab;
+
+ err = -ENOMEM;
+ htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
+ GFP_USER);
+
+ if (!htab->buckets)
+ goto free_htab;
+
+ for (i = 0; i < htab->n_buckets; i++)
+ INIT_HLIST_HEAD(&htab->buckets[i]);
+
+ spin_lock_init(&htab->lock);
+ htab->count = 0;
+
+ htab->elem_size = sizeof(struct htab_elem) +
+ round_up(htab->map.key_size, 8) +
+ htab->map.value_size;
+
+ htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
+ if (!htab->slab_name)
+ goto free_buckets;
+
+ htab->elem_cache = kmem_cache_create(htab->slab_name,
+ htab->elem_size, 0, 0, NULL);
+ if (!htab->elem_cache)
+ goto free_slab_name;
+
+ return &htab->map;
+
+free_slab_name:
+ kfree(htab->slab_name);
+free_buckets:
+ kfree(htab->buckets);
+free_htab:
+ kfree(htab);
+ return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+ return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+ return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+ void *key, u32 key_size)
+{
+ struct htab_elem *l;
+
+ hlist_for_each_entry_rcu(l, head, hash_node) {
+ if (l->hash == hash && !memcmp(&l->key, key, key_size))
+ return l;
+ }
+ return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_head *head;
+ struct htab_elem *l;
+ u32 hash, key_size;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ head = select_bucket(htab, hash);
+
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (l)
+ return l->key + round_up(map->key_size, 8);
+ else
+ return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct hlist_head *head;
+ struct htab_elem *l, *next_l;
+ u32 hash, key_size;
+ int i;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ head = select_bucket(htab, hash);
+
+ /* lookup the key */
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (!l) {
+ i = 0;
+ goto find_first_elem;
+ }
+
+ /* key was found, get next key in the same bucket */
+ next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+ struct htab_elem, hash_node);
+
+ if (next_l) {
+ /* if next elem in this hash list is non-zero, just return it */
+ memcpy(next_key, next_l->key, key_size);
+ return 0;
+ } else {
+ /* no more elements in this hash list, go to the next bucket */
+ i = hash & (htab->n_buckets - 1);
+ i++;
+ }
+
+find_first_elem:
+ /* iterate over buckets */
+ for (; i < htab->n_buckets; i++) {
+ head = select_bucket(htab, i);
+
+ /* pick first element in the bucket */
+ next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
+ struct htab_elem, hash_node);
+ if (next_l) {
+ /* if it's not empty, just return it */
+ memcpy(next_key, next_l->key, key_size);
+ return 0;
+ }
+ }
+
+ /* itereated over all buckets and all elements */
+ return -ENOENT;
+}
+
+static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
+{
+ void *l;
+
+ l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
+ if (!l)
+ return ERR_PTR(-ENOMEM);
+ return l;
+}
+
+static void free_htab_elem_rcu(struct rcu_head *rcu)
+{
+ struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
+
+ kmem_cache_free(l->htab->elem_cache, l);
+}
+
+static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+ l->htab = htab;
+ call_rcu(&l->rcu, free_htab_elem_rcu);
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct htab_elem *l_new, *l_old;
+ struct hlist_head *head;
+ u32 key_size;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ l_new = htab_alloc_elem(htab);
+ if (IS_ERR(l_new))
+ return -ENOMEM;
+
+ key_size = map->key_size;
+
+ memcpy(l_new->key, key, key_size);
+ memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
+
+ l_new->hash = htab_map_hash(l_new->key, key_size);
+
+ head = select_bucket(htab, l_new->hash);
+
+ l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+
+ spin_lock_bh(&htab->lock);
+ if (!l_old && unlikely(htab->count >= map->max_entries)) {
+ /* if elem with this 'key' doesn't exist and we've reached
+ * max_entries limit, fail insertion of new elem
+ */
+ spin_unlock_bh(&htab->lock);
+ kmem_cache_free(htab->elem_cache, l_new);
+ return -EFBIG;
+ }
+
+ /* add new element to the head of the list, so that concurrent
+ * search will find it before old elem
+ */
+ hlist_add_head_rcu(&l_new->hash_node, head);
+ if (l_old) {
+ hlist_del_rcu(&l_old->hash_node);
+ release_htab_elem(htab, l_old);
+ } else {
+ htab->count++;
+ }
+ spin_unlock_bh(&htab->lock);
+
+ return 0;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_delete_elem(struct bpf_map *map, void *key)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+ struct htab_elem *l;
+ struct hlist_head *head;
+ u32 hash, key_size;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ key_size = map->key_size;
+
+ hash = htab_map_hash(key, key_size);
+
+ head = select_bucket(htab, hash);
+
+ l = lookup_elem_raw(head, hash, key, key_size);
+
+ if (l) {
+ spin_lock_bh(&htab->lock);
+ hlist_del_rcu(&l->hash_node);
+ htab->count--;
+ release_htab_elem(htab, l);
+ spin_unlock_bh(&htab->lock);
+ return 0;
+ }
+ return -ESRCH;
+}
+
+static void delete_all_elements(struct bpf_htab *htab)
+{
+ int i;
+
+ for (i = 0; i < htab->n_buckets; i++) {
+ struct hlist_head *head = select_bucket(htab, i);
+ struct hlist_node *n;
+ struct htab_elem *l;
+
+ hlist_for_each_entry_safe(l, n, head, hash_node) {
+ hlist_del_rcu(&l->hash_node);
+ htab->count--;
+ kmem_cache_free(htab->elem_cache, l);
+ }
+ }
+}
+
+/* called when map->refcnt goes to zero */
+static void htab_map_free(struct bpf_map *map)
+{
+ struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+ /* wait for all outstanding updates to complete */
+ synchronize_rcu();
+
+ /* kmem_cache_free all htab elements */
+ delete_all_elements(htab);
+
+ /* and destroy cache, which might sleep */
+ kmem_cache_destroy(htab->elem_cache);
+
+ kfree(htab->buckets);
+ kfree(htab->slab_name);
+ kfree(htab);
+}
+
+static struct bpf_map_ops htab_ops = {
+ .map_alloc = htab_map_alloc,
+ .map_free = htab_map_free,
+ .map_get_next_key = htab_map_get_next_key,
+ .map_lookup_elem = htab_map_lookup_elem,
+ .map_update_elem = htab_map_update_elem,
+ .map_delete_elem = htab_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+ .ops = &htab_ops,
+ .type = BPF_MAP_TYPE_HASH,
+};
+
+static int __init register_htab_map(void)
+{
+ bpf_register_map_type(&tl);
+ return 0;
+}
+late_initcall(register_htab_map);
--
1.7.9.5
Safety of eBPF programs is statically determined by the verifier, which detects:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
It checks that
- R1-R5 registers statisfy function prototype
- program terminates
- BPF_LD_ABS|IND instructions are only used in socket filters
It is configured with:
- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
that provides information to the verifer which fields of 'ctx'
are accessible (remember 'ctx' is the first argument to eBPF program)
- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
reports argument types of kernel helper functions that eBPF program
may call, so that verifier can checks that R1-R5 types match prototype
More details in Documentation/networking/filter.txt
Signed-off-by: Alexei Starovoitov <[email protected]>
---
Documentation/networking/filter.txt | 233 ++++++
include/linux/bpf.h | 48 ++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/Makefile | 2 +-
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 1431 +++++++++++++++++++++++++++++++++++
6 files changed, 1715 insertions(+), 2 deletions(-)
create mode 100644 kernel/bpf/verifier.c
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index e14e486f69cd..05fee8fcedf1 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
2 byte atomic increments are not supported.
+eBPF verifier
+-------------
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
+since addition of two valid pointers makes invalid pointer.
+
+If register was never written to, it's not readable:
+ bpf_mov R0 = R2
+ bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+ bpf_mov R6 = 1
+ bpf_call foo
+ bpf_mov R0 = R6
+ bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
+so that its state is preserved across calls.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
+ctx is generic. verifier is configured to known what context is for particular
+class of bpf programs. For example, context == skb (for socket filters) and
+ctx == seccomp_data for seccomp filters.
+A callback is used to customize verifier to restrict eBPF program access to only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+ bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+ bpf_ld R0 = *(u32 *)(R10 - 4)
+ bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type PTR_TO_STACK
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is tracked as well, since four (R6-R9)
+callee saved registers may not be enough for some programs.
+
+Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
+For example, skb_get_nlattr() function has the following definition:
+ struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
+and eBPF verifier will check that this function is always called with first
+argument being 'ctx'. In other words R1 must have type PTR_TO_CTX
+at the time of bpf_call insn.
+After the call register R0 will be set to readable state, so that
+program can access it.
+
+Function calls is a main mechanism to extend functionality of eBPF programs.
+Socket filters may let programs to call one set of functions, whereas tracing
+filters may allow completely different set.
+
+If a function made accessible to eBPF program, it needs to be thought through
+from security point of view. The verifier will guarantee that the function is
+called with valid arguments.
+
+seccomp vs socket filters have different security restrictions for classic BPF.
+Seccomp solves this by two stage verifier: classic BPF verifier is followed
+by seccomp verifier. In case of eBPF one configurable verifier is shared for
+all use cases.
+
+See details of eBPF verifier in kernel/bpf/verifier.c
+
eBPF maps
---------
'maps' is a generic storage of different types for sharing data between kernel
@@ -1064,6 +1166,137 @@ size. It will not let programs pass junk values as 'key' and 'value' to
bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
can safely access the pointers in all cases.
+Understanding eBPF verifier messages
+------------------------------------
+
+The following are few examples of invalid eBPF programs and verifier error
+messages as seen in the log:
+
+Program with unreachable instructions:
+static struct sock_filter_int prog[] = {
+ BPF_EXIT_INSN(),
+ BPF_EXIT_INSN(),
+};
+Error:
+ unreachable insn 1
+
+Program that reads uninitialized register:
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r0 = r2
+ R2 !read_ok
+
+Program that doesn't initialize R0 before exiting:
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r1
+ 1: (95) exit
+ R0 !read_ok
+
+Program that accesses stack out of bounds:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 +8) = 0
+ invalid stack off=8 size=8
+
+Program that doesn't initialize stack before passing its address into function:
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r10
+ 1: (07) r2 += -8
+ 2: (b7) r1 = 1
+ 3: (85) call 1
+ invalid indirect read from stack off -8+0 size 8
+
+Program that uses invalid map_id=2 while calling to map_lookup_elem() function:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 2),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 2
+ 4: (85) call 1
+ invalid access to map_id=2
+
+Program that doesn't check return value of map_lookup_elem() before accessing
+map element:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (7a) *(u64 *)(r0 +0) = 0
+ R0 invalid mem access 'map_value_or_null'
+
+Program that correctly checks map_lookup_elem() returned value for NULL, but
+accesses the memory with incorrect alignment:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+1
+ R0=map_value1 R10=fp
+ 6: (7a) *(u64 *)(r0 +4) = 0
+ misaligned access off 4 size 8
+
+Program that correctly checks map_lookup_elem() returned value for NULL and
+accesses memory with correct alignment in one side of 'if' branch, but fails
+to do so in the other side of 'if' branch:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+2
+ R0=map_value1 R10=fp
+ 6: (7a) *(u64 *)(r0 +0) = 0
+ 7: (95) exit
+
+ from 5 to 8: R0=imm0 R10=fp
+ 8: (7a) *(u64 *)(r0 +0) = 1
+ R0 invalid mem access 'imm'
+
Testing
-------
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7bfcad87018e..67fd49eac904 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,17 +47,63 @@ struct bpf_map_type_list {
void bpf_register_map_type(struct bpf_map_type_list *tl);
struct bpf_map *bpf_map_get(u32 map_id);
+/* types of values:
+ * - stored in an eBPF register
+ * - passed into helper functions as an argument
+ * - returned from helper functions
+ */
+enum bpf_reg_type {
+ INVALID_PTR, /* reg doesn't contain a valid pointer */
+ PTR_TO_CTX, /* reg points to bpf_context */
+ PTR_TO_MAP, /* reg points to map element value */
+ PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
+ PTR_TO_STACK, /* reg == frame_pointer */
+ PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
+ PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
+ PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
+ RET_INTEGER, /* function returns integer */
+ RET_VOID, /* function returns void */
+ CONST_ARG, /* function expects integer constant argument */
+ CONST_ARG_MAP_ID, /* int const argument that is used as map_id */
+ /* int const argument indicating number of bytes accessed from stack
+ * previous function argument must be ptr_to_stack_imm
+ */
+ CONST_ARG_STACK_IMM_SIZE,
+};
+
/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
* to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
* instructions after verifying
*/
struct bpf_func_proto {
s32 func_off;
+ enum bpf_reg_type ret_type;
+ enum bpf_reg_type arg1_type;
+ enum bpf_reg_type arg2_type;
+ enum bpf_reg_type arg3_type;
+ enum bpf_reg_type arg4_type;
+ enum bpf_reg_type arg5_type;
+};
+
+/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
+ * the first argument to eBPF programs.
+ * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
+ */
+struct bpf_context;
+
+enum bpf_access_type {
+ BPF_READ = 1,
+ BPF_WRITE = 2
};
struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+
+ /* return true if 'size' wide access at offset 'off' within bpf_context
+ * with 'type' (read or write) is allowed
+ */
+ bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
};
struct bpf_prog_type_list {
@@ -78,5 +124,7 @@ struct bpf_prog_info {
void free_bpf_prog_info(struct bpf_prog_info *info);
struct sk_filter *bpf_prog_get(u32 prog_id);
+/* verify correctness of eBPF program */
+int bpf_check(struct sk_filter *fp);
#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ed067e245099..597a35cc101d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -381,6 +381,7 @@ enum bpf_prog_attributes {
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
+ BPF_PROG_TYPE_SOCKET_FILTER,
};
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 558e12712ebc..95a9035e0f29 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o hashtab.o
+obj-y := core.o syscall.o hashtab.o verifier.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 836809b1bc4e..48d8f43da151 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -554,7 +554,7 @@ static int bpf_prog_load(int prog_id, enum bpf_prog_type type,
mutex_lock(&bpf_map_lock);
/* run eBPF verifier */
- /* err = bpf_check(prog); */
+ err = bpf_check(prog);
if (err == 0 && prog->info->used_maps) {
/* program passed verifier and it's using some maps,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
new file mode 100644
index 000000000000..470fce48b3b0
--- /dev/null
+++ b/kernel/bpf/verifier.c
@@ -0,0 +1,1431 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/capability.h>
+
+/* bpf_check() is a static code analyzer that walks the BPF program
+ * instruction by instruction and updates register/stack state.
+ * All paths of conditional branches are analyzed until 'ret' insn.
+ *
+ * At the first pass depth-first-search verifies that the BPF program is a DAG.
+ * It rejects the following programs:
+ * - larger than BPF_MAXINSNS insns
+ * - if loop is present (detected via back-edge)
+ * - unreachable insns exist (shouldn't be a forest. program = one function)
+ * - ret insn is not a last insn
+ * - out of bounds or malformed jumps
+ * The second pass is all possible path descent from the 1st insn.
+ * Conditional branch target insns keep a link list of verifier states.
+ * If the state already visited, this path can be pruned.
+ * If it wasn't a DAG, such state prunning would be incorrect, since it would
+ * skip cycles. Since it's analyzing all pathes through the program,
+ * the length of the analysis is limited to 32k insn, which may be hit even
+ * if insn_cnt < 4K, but there are too many branches that change stack/regs.
+ * Number of 'branches to be analyzed' is limited to 1k
+ *
+ * All registers are 64-bit (even on 32-bit arch)
+ * R0 - return register
+ * R1-R5 argument passing registers
+ * R6-R9 callee saved registers
+ * R10 - frame pointer read-only
+ *
+ * At the start of BPF program the register R1 contains a pointer to bpf_context
+ * and has type PTR_TO_CTX.
+ *
+ * R10 has type PTR_TO_STACK. The sequence 'mov Rd, R10; add Rd, imm' changes
+ * Rd state to PTR_TO_STACK_IMM and immediate constant is saved for further
+ * stack bounds checking
+ *
+ * registers used to pass pointers to function calls are verified against
+ * function prototypes
+ *
+ * Example: before the call to bpf_map_lookup_elem(),
+ * R1 must contain integer constant and R2 PTR_TO_STACK_IMM_MAP_KEY
+ * Integer constant in R1 is a map_id. The verifier checks that map_id is valid
+ * and corresponding map->key_size fetched to check that
+ * [R3, R3 + map_info->key_size) are within stack limits and all that stack
+ * memory was initiliazed earlier by BPF program.
+ * After bpf_table_lookup() call insn, R0 is set to PTR_TO_MAP_CONDITIONAL
+ * R1-R5 are cleared and no longer readable (but still writeable).
+ *
+ * bpf_table_lookup() function returns ether pointer to map value or NULL
+ * which is type PTR_TO_MAP_CONDITIONAL. Once it passes through !=0 insn
+ * the register holding that pointer in the true branch changes state to
+ * PTR_TO_MAP and the same register changes state to INVALID_PTR in the false
+ * branch. See check_cond_jmp_op()
+ *
+ * load/store alignment is checked
+ * Ex: BPF_STX|BPF_W [Rd + 3] = Rs is rejected, because it's misaligned
+ *
+ * load/store to stack bounds checked and register spill is tracked
+ * Ex: BPF_STX|BPF_B [R10 + 0] = Rs is rejected, because it's out of bounds
+ *
+ * load/store to map bounds checked and map_id provides map size
+ * Ex: BPF_STX|BPF_H [Rd + 8] = Rs is ok, if Rd is PTR_TO_MAP and
+ * 8 + sizeof(u16) <= map_info->value_size
+ *
+ * load/store to bpf_context checked against known fields
+ */
+#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
+
+struct reg_state {
+ enum bpf_reg_type ptr;
+ int imm;
+ bool read_ok;
+};
+
+enum bpf_stack_slot_type {
+ STACK_INVALID, /* nothing was stored in this stack slot */
+ STACK_SPILL, /* 1st byte of register spilled into stack */
+ STACK_SPILL_PART, /* other 7 bytes of register spill */
+ STACK_MISC /* BPF program wrote some data into this slot */
+};
+
+struct bpf_stack_slot {
+ enum bpf_stack_slot_type type;
+ enum bpf_reg_type ptr;
+ int imm;
+};
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct verifier_state {
+ struct reg_state regs[MAX_BPF_REG];
+ struct bpf_stack_slot stack[MAX_BPF_STACK];
+};
+
+/* linked list of verifier states used to prune search */
+struct verifier_state_list {
+ struct verifier_state state;
+ struct verifier_state_list *next;
+};
+
+/* verifier_state + insn_idx are pushed to stack when branch is encountered */
+struct verifier_stack_elem {
+ /* verifer state is 'st'
+ * before processing instruction 'insn_idx'
+ * and after processing instruction 'prev_insn_idx'
+ */
+ struct verifier_state st;
+ int insn_idx;
+ int prev_insn_idx;
+ struct verifier_stack_elem *next;
+};
+
+#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
+
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+ struct sk_filter *prog; /* eBPF program being verified */
+ struct verifier_stack_elem *head; /* stack of verifier states to be processed */
+ int stack_size; /* number of states to be processed */
+ struct verifier_state cur_state; /* current verifier state */
+ struct verifier_state_list **branch_landing; /* search prunning optimization */
+ u32 used_maps[MAX_USED_MAPS]; /* array of map_id's used by eBPF program */
+ u32 used_map_cnt; /* number of used maps */
+};
+
+/* verbose verifier prints what it's seeing
+ * bpf_check() is called under map lock, so no race to access this global var
+ */
+static bool verbose_on;
+
+/* when verifier rejects eBPF program, it does a second path with verbose on
+ * to dump the verification trace to the log, so the user can figure out what's
+ * wrong with the program
+ */
+static int verbose(const char *fmt, ...)
+{
+ va_list args;
+ int ret;
+
+ if (!verbose_on)
+ return 0;
+
+ va_start(args, fmt);
+ ret = vprintk(fmt, args);
+ va_end(args);
+ return ret;
+}
+
+/* string representation of 'enum bpf_reg_type' */
+static const char * const reg_type_str[] = {
+ [INVALID_PTR] = "inv",
+ [PTR_TO_CTX] = "ctx",
+ [PTR_TO_MAP] = "map_value",
+ [PTR_TO_MAP_CONDITIONAL] = "map_value_or_null",
+ [PTR_TO_STACK] = "fp",
+ [PTR_TO_STACK_IMM] = "fp",
+ [PTR_TO_STACK_IMM_MAP_KEY] = "fp_key",
+ [PTR_TO_STACK_IMM_MAP_VALUE] = "fp_value",
+ [RET_INTEGER] = "ret_int",
+ [RET_VOID] = "ret_void",
+ [CONST_ARG] = "imm",
+ [CONST_ARG_MAP_ID] = "map_id",
+ [CONST_ARG_STACK_IMM_SIZE] = "imm_size",
+};
+
+static void pr_cont_verifier_state(struct verifier_env *env)
+{
+ enum bpf_reg_type ptr;
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ if (!env->cur_state.regs[i].read_ok)
+ continue;
+ ptr = env->cur_state.regs[i].ptr;
+ pr_cont(" R%d=%s", i, reg_type_str[ptr]);
+ if (ptr == CONST_ARG ||
+ ptr == PTR_TO_STACK_IMM ||
+ ptr == PTR_TO_MAP_CONDITIONAL ||
+ ptr == PTR_TO_MAP)
+ pr_cont("%d", env->cur_state.regs[i].imm);
+ }
+ for (i = 0; i < MAX_BPF_STACK; i++) {
+ if (env->cur_state.stack[i].type == STACK_SPILL)
+ pr_cont(" fp%d=%s", -MAX_BPF_STACK + i,
+ reg_type_str[env->cur_state.stack[i].ptr]);
+ }
+ pr_cont("\n");
+}
+
+static const char *const bpf_class_string[] = {
+ "ld", "ldx", "st", "stx", "alu", "jmp", "BUG", "alu64"
+};
+
+static const char *const bpf_alu_string[] = {
+ "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+ "%=", "^=", "=", "s>>=", "endian", "BUG", "BUG"
+};
+
+static const char *const bpf_ldst_string[] = {
+ "u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+ "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
+};
+
+static void pr_cont_bpf_insn(struct sock_filter_int *insn)
+{
+ u8 class = BPF_CLASS(insn->code);
+
+ if (class == BPF_ALU || class == BPF_ALU64) {
+ if (BPF_SRC(insn->code) == BPF_X)
+ pr_cont("(%02x) %sr%d %s %sr%d\n",
+ insn->code, class == BPF_ALU ? "(u32) " : "",
+ insn->dst_reg,
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ class == BPF_ALU ? "(u32) " : "",
+ insn->src_reg);
+ else
+ pr_cont("(%02x) %sr%d %s %s%d\n",
+ insn->code, class == BPF_ALU ? "(u32) " : "",
+ insn->dst_reg,
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ class == BPF_ALU ? "(u32) " : "",
+ insn->imm);
+ } else if (class == BPF_STX) {
+ if (BPF_MODE(insn->code) == BPF_MEM)
+ pr_cont("(%02x) *(%s *)(r%d %+d) = r%d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg,
+ insn->off, insn->src_reg);
+ else if (BPF_MODE(insn->code) == BPF_XADD)
+ pr_cont("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg, insn->off,
+ insn->src_reg);
+ else
+ pr_cont("BUG_%02x\n", insn->code);
+ } else if (class == BPF_ST) {
+ if (BPF_MODE(insn->code) != BPF_MEM) {
+ pr_cont("BUG_st_%02x\n", insn->code);
+ return;
+ }
+ pr_cont("(%02x) *(%s *)(r%d %+d) = %d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->dst_reg,
+ insn->off, insn->imm);
+ } else if (class == BPF_LDX) {
+ if (BPF_MODE(insn->code) != BPF_MEM) {
+ pr_cont("BUG_ldx_%02x\n", insn->code);
+ return;
+ }
+ pr_cont("(%02x) r%d = *(%s *)(r%d %+d)\n",
+ insn->code, insn->dst_reg,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->src_reg, insn->off);
+ } else if (class == BPF_JMP) {
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_CALL) {
+ pr_cont("(%02x) call %d\n", insn->code, insn->imm);
+ } else if (insn->code == (BPF_JMP | BPF_JA)) {
+ pr_cont("(%02x) goto pc%+d\n",
+ insn->code, insn->off);
+ } else if (insn->code == (BPF_JMP | BPF_EXIT)) {
+ pr_cont("(%02x) exit\n", insn->code);
+ } else if (BPF_SRC(insn->code) == BPF_X) {
+ pr_cont("(%02x) if r%d %s r%d goto pc%+d\n",
+ insn->code, insn->dst_reg,
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->src_reg, insn->off);
+ } else {
+ pr_cont("(%02x) if r%d %s 0x%x goto pc%+d\n",
+ insn->code, insn->dst_reg,
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->imm, insn->off);
+ }
+ } else {
+ pr_cont("(%02x) %s\n", insn->code, bpf_class_string[class]);
+ }
+}
+
+static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
+{
+ struct verifier_stack_elem *elem;
+ int insn_idx;
+
+ if (env->head == NULL)
+ return -1;
+
+ memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
+ insn_idx = env->head->insn_idx;
+ if (prev_insn_idx)
+ *prev_insn_idx = env->head->prev_insn_idx;
+ elem = env->head->next;
+ kfree(env->head);
+ env->head = elem;
+ env->stack_size--;
+ return insn_idx;
+}
+
+static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
+ int prev_insn_idx)
+{
+ struct verifier_stack_elem *elem;
+
+ elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
+ if (!elem)
+ goto err;
+
+ memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
+ elem->insn_idx = insn_idx;
+ elem->prev_insn_idx = prev_insn_idx;
+ elem->next = env->head;
+ env->head = elem;
+ env->stack_size++;
+ if (env->stack_size > 1024) {
+ verbose("BPF program is too complex\n");
+ goto err;
+ }
+ return &elem->st;
+err:
+ /* pop all elements and return */
+ while (pop_stack(env, NULL) >= 0);
+ return NULL;
+}
+
+#define CALLER_SAVED_REGS 6
+static const int caller_saved[CALLER_SAVED_REGS] = {
+ BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
+};
+
+static void init_reg_state(struct reg_state *regs)
+{
+ struct reg_state *reg;
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ regs[i].ptr = INVALID_PTR;
+ regs[i].read_ok = false;
+ regs[i].imm = 0xbadbad;
+ }
+ reg = regs + BPF_REG_FP;
+ reg->ptr = PTR_TO_STACK;
+ reg->read_ok = true;
+
+ reg = regs + BPF_REG_1; /* 1st arg to a function */
+ reg->ptr = PTR_TO_CTX;
+ reg->read_ok = true;
+}
+
+static void mark_reg_no_ptr(struct reg_state *regs, int regno)
+{
+ regs[regno].ptr = INVALID_PTR;
+ regs[regno].imm = 0xbadbad;
+ regs[regno].read_ok = true;
+}
+
+static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
+{
+ if (is_src) {
+ if (!regs[regno].read_ok) {
+ verbose("R%d !read_ok\n", regno);
+ return -EACCES;
+ }
+ } else {
+ if (regno == BPF_REG_FP)
+ /* frame pointer is read only */
+ return -EACCES;
+ mark_reg_no_ptr(regs, regno);
+ }
+ return 0;
+}
+
+static int bpf_size_to_bytes(int bpf_size)
+{
+ if (bpf_size == BPF_W)
+ return 4;
+ else if (bpf_size == BPF_H)
+ return 2;
+ else if (bpf_size == BPF_B)
+ return 1;
+ else if (bpf_size == BPF_DW)
+ return 8;
+ else
+ return -EACCES;
+}
+
+static int check_stack_write(struct verifier_state *state, int off, int size,
+ int value_regno)
+{
+ struct bpf_stack_slot *slot;
+ int i;
+
+ if (value_regno >= 0 &&
+ (state->regs[value_regno].ptr == PTR_TO_MAP ||
+ state->regs[value_regno].ptr == PTR_TO_STACK_IMM ||
+ state->regs[value_regno].ptr == PTR_TO_CTX)) {
+
+ /* register containing pointer is being spilled into stack */
+ if (size != 8) {
+ verbose("invalid size of register spill\n");
+ return -EACCES;
+ }
+
+ slot = &state->stack[MAX_BPF_STACK + off];
+ slot->type = STACK_SPILL;
+ /* save register state */
+ slot->ptr = state->regs[value_regno].ptr;
+ slot->imm = state->regs[value_regno].imm;
+ for (i = 1; i < 8; i++) {
+ slot = &state->stack[MAX_BPF_STACK + off + i];
+ slot->type = STACK_SPILL_PART;
+ slot->ptr = 0;
+ slot->imm = 0;
+ }
+ } else {
+
+ /* regular write of data into stack */
+ for (i = 0; i < size; i++) {
+ slot = &state->stack[MAX_BPF_STACK + off + i];
+ slot->type = STACK_MISC;
+ slot->ptr = 0;
+ slot->imm = 0;
+ }
+ }
+ return 0;
+}
+
+static int check_stack_read(struct verifier_state *state, int off, int size,
+ int value_regno)
+{
+ int i;
+ struct bpf_stack_slot *slot;
+
+ slot = &state->stack[MAX_BPF_STACK + off];
+
+ if (slot->type == STACK_SPILL) {
+ if (size != 8) {
+ verbose("invalid size of register spill\n");
+ return -EACCES;
+ }
+ for (i = 1; i < 8; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].type !=
+ STACK_SPILL_PART) {
+ verbose("corrupted spill memory\n");
+ return -EACCES;
+ }
+ }
+
+ /* restore register state from stack */
+ state->regs[value_regno].ptr = slot->ptr;
+ state->regs[value_regno].imm = slot->imm;
+ state->regs[value_regno].read_ok = true;
+ return 0;
+ } else {
+ for (i = 0; i < size; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].type !=
+ STACK_MISC) {
+ verbose("invalid read from stack off %d+%d size %d\n",
+ off, i, size);
+ return -EACCES;
+ }
+ }
+ /* have read misc data from the stack */
+ mark_reg_no_ptr(state->regs, value_regno);
+ return 0;
+ }
+}
+
+static int remember_map_id(struct verifier_env *env, u32 map_id)
+{
+ int i;
+
+ /* check whether we recorded this map_id already */
+ for (i = 0; i < env->used_map_cnt; i++)
+ if (env->used_maps[i] == map_id)
+ return 0;
+
+ if (env->used_map_cnt >= MAX_USED_MAPS)
+ return -E2BIG;
+
+ /* remember this map_id */
+ env->used_maps[env->used_map_cnt++] = map_id;
+ return 0;
+}
+
+static int get_map_info(struct verifier_env *env, u32 map_id,
+ struct bpf_map **map)
+{
+ /* if BPF program contains bpf_table_lookup(map_id, key)
+ * the incorrect map_id will be caught here
+ */
+ *map = bpf_map_get(map_id);
+ if (!*map) {
+ verbose("invalid access to map_id=%d\n", map_id);
+ return -EACCES;
+ }
+
+ _(remember_map_id(env, map_id));
+
+ return 0;
+}
+
+/* check read/write into map element returned by bpf_table_lookup() */
+static int check_table_access(struct verifier_env *env, int regno, int off,
+ int size)
+{
+ struct bpf_map *map;
+ int map_id = env->cur_state.regs[regno].imm;
+
+ _(get_map_info(env, map_id, &map));
+
+ if (off < 0 || off + size > map->value_size) {
+ verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
+ map_id, map->value_size, off, size);
+ return -EACCES;
+ }
+ return 0;
+}
+
+/* check access to 'struct bpf_context' fields */
+static int check_ctx_access(struct verifier_env *env, int off, int size,
+ enum bpf_access_type t)
+{
+ if (env->prog->info->ops->is_valid_access &&
+ env->prog->info->ops->is_valid_access(off, size, t))
+ return 0;
+
+ verbose("invalid bpf_context access off=%d size=%d\n", off, size);
+ return -EACCES;
+}
+
+static int check_mem_access(struct verifier_env *env, int regno, int off,
+ int bpf_size, enum bpf_access_type t,
+ int value_regno)
+{
+ struct verifier_state *state = &env->cur_state;
+ int size;
+
+ _(size = bpf_size_to_bytes(bpf_size));
+
+ if (off % size != 0) {
+ verbose("misaligned access off %d size %d\n", off, size);
+ return -EACCES;
+ }
+
+ if (state->regs[regno].ptr == PTR_TO_MAP) {
+ _(check_table_access(env, regno, off, size));
+ if (t == BPF_READ)
+ mark_reg_no_ptr(state->regs, value_regno);
+ } else if (state->regs[regno].ptr == PTR_TO_CTX) {
+ _(check_ctx_access(env, off, size, t));
+ if (t == BPF_READ)
+ mark_reg_no_ptr(state->regs, value_regno);
+ } else if (state->regs[regno].ptr == PTR_TO_STACK) {
+ if (off >= 0 || off < -MAX_BPF_STACK) {
+ verbose("invalid stack off=%d size=%d\n", off, size);
+ return -EACCES;
+ }
+ if (t == BPF_WRITE)
+ _(check_stack_write(state, off, size, value_regno));
+ else
+ _(check_stack_read(state, off, size, value_regno));
+ } else {
+ verbose("R%d invalid mem access '%s'\n",
+ regno, reg_type_str[state->regs[regno].ptr]);
+ return -EACCES;
+ }
+ return 0;
+}
+
+/* when register 'regno' is passed into function that will read 'access_size'
+ * bytes from that pointer, make sure that it's within stack boundary
+ * and all elements of stack are initialized
+ */
+static int check_stack_boundary(struct verifier_env *env,
+ int regno, int access_size)
+{
+ struct verifier_state *state = &env->cur_state;
+ struct reg_state *regs = state->regs;
+ int off, i;
+
+ if (regs[regno].ptr != PTR_TO_STACK_IMM)
+ return -EACCES;
+
+ off = regs[regno].imm;
+ if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
+ access_size <= 0) {
+ verbose("invalid stack ptr R%d off=%d access_size=%d\n",
+ regno, off, access_size);
+ return -EACCES;
+ }
+
+ for (i = 0; i < access_size; i++) {
+ if (state->stack[MAX_BPF_STACK + off + i].type != STACK_MISC) {
+ verbose("invalid indirect read from stack off %d+%d size %d\n",
+ off, i, access_size);
+ return -EACCES;
+ }
+ }
+ return 0;
+}
+
+static int check_func_arg(struct verifier_env *env, int regno,
+ enum bpf_reg_type arg_type, int *map_id,
+ struct bpf_map **mapp)
+{
+ struct reg_state *reg = env->cur_state.regs + regno;
+ enum bpf_reg_type expected_type;
+
+ if (arg_type == INVALID_PTR)
+ return 0;
+
+ if (!reg->read_ok) {
+ verbose("R%d !read_ok\n", regno);
+ return -EACCES;
+ }
+
+ if (arg_type == PTR_TO_STACK_IMM_MAP_KEY ||
+ arg_type == PTR_TO_STACK_IMM_MAP_VALUE)
+ expected_type = PTR_TO_STACK_IMM;
+ else if (arg_type == CONST_ARG_MAP_ID ||
+ arg_type == CONST_ARG_STACK_IMM_SIZE)
+ expected_type = CONST_ARG;
+ else
+ expected_type = arg_type;
+
+ if (reg->ptr != expected_type) {
+ verbose("R%d type=%s expected=%s\n", regno,
+ reg_type_str[reg->ptr], reg_type_str[expected_type]);
+ return -EACCES;
+ }
+
+ if (arg_type == CONST_ARG_MAP_ID) {
+ /* bpf_map_xxx(map_id) call: check that map_id is valid */
+ *map_id = reg->imm;
+ _(get_map_info(env, reg->imm, mapp));
+ } else if (arg_type == PTR_TO_STACK_IMM_MAP_KEY) {
+ /*
+ * bpf_map_xxx(..., map_id, ..., key) call:
+ * check that [key, key + map->key_size) are within
+ * stack limits and initialized
+ */
+ if (!*mapp) {
+ /*
+ * in function declaration map_id must come before
+ * table_key or table_elem, so that it's verified
+ * and known before we have to check table_key here
+ */
+ verbose("invalid map_id to access map->key\n");
+ return -EACCES;
+ }
+ _(check_stack_boundary(env, regno, (*mapp)->key_size));
+ } else if (arg_type == PTR_TO_STACK_IMM_MAP_VALUE) {
+ /*
+ * bpf_map_xxx(..., map_id, ..., value) call:
+ * check [value, value + map->value_size) validity
+ */
+ if (!*mapp) {
+ verbose("invalid map_id to access map->elem\n");
+ return -EACCES;
+ }
+ _(check_stack_boundary(env, regno, (*mapp)->value_size));
+ } else if (arg_type == CONST_ARG_STACK_IMM_SIZE) {
+ /*
+ * bpf_xxx(..., buf, len) call will access 'len' bytes
+ * from stack pointer 'buf'. Check it
+ * note: regno == len, regno - 1 == buf
+ */
+ _(check_stack_boundary(env, regno - 1, reg->imm));
+ }
+
+ return 0;
+}
+
+static int check_call(struct verifier_env *env, int func_id)
+{
+ struct verifier_state *state = &env->cur_state;
+ const struct bpf_func_proto *fn = NULL;
+ struct reg_state *regs = state->regs;
+ struct bpf_map *map = NULL;
+ struct reg_state *reg;
+ int map_id = -1;
+ int i;
+
+ /* find function prototype */
+ if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
+ verbose("invalid func %d\n", func_id);
+ return -EINVAL;
+ }
+
+ if (env->prog->info->ops->get_func_proto)
+ fn = env->prog->info->ops->get_func_proto(func_id);
+
+ if (!fn || (fn->ret_type != RET_INTEGER &&
+ fn->ret_type != PTR_TO_MAP_CONDITIONAL &&
+ fn->ret_type != RET_VOID)) {
+ verbose("unknown func %d\n", func_id);
+ return -EINVAL;
+ }
+
+ /* check args */
+ _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
+ _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
+ _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
+ _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
+
+ /* reset caller saved regs */
+ for (i = 0; i < CALLER_SAVED_REGS; i++) {
+ reg = regs + caller_saved[i];
+ reg->read_ok = false;
+ reg->ptr = INVALID_PTR;
+ reg->imm = 0xbadbad;
+ }
+
+ /* update return register */
+ reg = regs + BPF_REG_0;
+ if (fn->ret_type == RET_INTEGER) {
+ reg->read_ok = true;
+ reg->ptr = INVALID_PTR;
+ } else if (fn->ret_type != RET_VOID) {
+ reg->read_ok = true;
+ reg->ptr = fn->ret_type;
+ if (fn->ret_type == PTR_TO_MAP_CONDITIONAL)
+ /*
+ * remember map_id, so that check_table_access()
+ * can check 'value_size' boundary of memory access
+ * to map element returned from bpf_table_lookup()
+ */
+ reg->imm = map_id;
+ }
+ return 0;
+}
+
+/* check validity of 32-bit and 64-bit arithmetic operations */
+static int check_alu_op(struct reg_state *regs, struct sock_filter_int *insn)
+{
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_END || opcode == BPF_NEG) {
+ if (BPF_SRC(insn->code) != BPF_X)
+ return -EINVAL;
+ /* check src operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ } else if (opcode == BPF_MOV) {
+
+ if (BPF_SRC(insn->code) == BPF_X)
+ /* check src operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ if (BPF_SRC(insn->code) == BPF_X) {
+ if (BPF_CLASS(insn->code) == BPF_ALU64) {
+ /* case: R1 = R2
+ * copy register state to dest reg
+ */
+ regs[insn->dst_reg].ptr = regs[insn->src_reg].ptr;
+ regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
+ } else {
+ regs[insn->dst_reg].ptr = INVALID_PTR;
+ regs[insn->dst_reg].imm = 0;
+ }
+ } else {
+ /* case: R = imm
+ * remember the value we stored into this reg
+ */
+ regs[insn->dst_reg].ptr = CONST_ARG;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+
+ } else { /* all other ALU ops: and, sub, xor, add, ... */
+
+ int stack_relative = 0;
+
+ if (BPF_SRC(insn->code) == BPF_X)
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
+ regs[insn->dst_reg].ptr == PTR_TO_STACK &&
+ BPF_SRC(insn->code) == BPF_K)
+ stack_relative = 1;
+
+ /* check dest operand */
+ _(check_reg_arg(regs, insn->dst_reg, 0));
+
+ if (stack_relative) {
+ regs[insn->dst_reg].ptr = PTR_TO_STACK_IMM;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+ }
+
+ return 0;
+}
+
+static int check_cond_jmp_op(struct verifier_env *env,
+ struct sock_filter_int *insn, int *insn_idx)
+{
+ struct reg_state *regs = env->cur_state.regs;
+ struct verifier_state *other_branch;
+ u8 opcode = BPF_OP(insn->code);
+
+ if (BPF_SRC(insn->code) == BPF_X)
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+
+ /* detect if R == 0 where R was initialized to zero earlier */
+ if (BPF_SRC(insn->code) == BPF_K &&
+ (opcode == BPF_JEQ || opcode == BPF_JNE) &&
+ regs[insn->dst_reg].ptr == CONST_ARG &&
+ regs[insn->dst_reg].imm == insn->imm) {
+ if (opcode == BPF_JEQ) {
+ /* if (imm == imm) goto pc+off;
+ * only follow the goto, ignore fall-through
+ */
+ *insn_idx += insn->off;
+ return 0;
+ } else {
+ /* if (imm != imm) goto pc+off;
+ * only follow fall-through branch, since
+ * that's where the program will go
+ */
+ return 0;
+ }
+ }
+
+ other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
+ if (!other_branch)
+ return -EFAULT;
+
+ /* detect if R == 0 where R is returned value from table_lookup() */
+ if (BPF_SRC(insn->code) == BPF_K &&
+ insn->imm == 0 && (opcode == BPF_JEQ ||
+ opcode == BPF_JNE) &&
+ regs[insn->dst_reg].ptr == PTR_TO_MAP_CONDITIONAL) {
+ if (opcode == BPF_JEQ) {
+ /* next fallthrough insn can access memory via
+ * this register
+ */
+ regs[insn->dst_reg].ptr = PTR_TO_MAP;
+ /* branch targer cannot access it, since reg == 0 */
+ other_branch->regs[insn->dst_reg].ptr = CONST_ARG;
+ other_branch->regs[insn->dst_reg].imm = 0;
+ } else {
+ other_branch->regs[insn->dst_reg].ptr = PTR_TO_MAP;
+ regs[insn->dst_reg].ptr = CONST_ARG;
+ regs[insn->dst_reg].imm = 0;
+ }
+ } else if (BPF_SRC(insn->code) == BPF_K &&
+ (opcode == BPF_JEQ || opcode == BPF_JNE)) {
+
+ if (opcode == BPF_JEQ) {
+ /* detect if (R == imm) goto
+ * and in the target state recognize that R = imm
+ */
+ other_branch->regs[insn->dst_reg].ptr = CONST_ARG;
+ other_branch->regs[insn->dst_reg].imm = insn->imm;
+ } else {
+ /* detect if (R != imm) goto
+ * and in the fall-through state recognize that R = imm
+ */
+ regs[insn->dst_reg].ptr = CONST_ARG;
+ regs[insn->dst_reg].imm = insn->imm;
+ }
+ }
+ if (verbose_on)
+ pr_cont_verifier_state(env);
+ return 0;
+}
+
+/* verify safety of LD_ABS|LD_IND instructions:
+ * - they can only appear in the programs where ctx == skb
+ * - since they are wrappers of function calls, they scratch R1-R5 registers,
+ * preserve R6-R9, and store return value into R0
+ *
+ * Implicit input:
+ * ctx == skb == R6 == CTX
+ *
+ * Explicit input:
+ * SRC == any register
+ * IMM == 32-bit immediate
+ *
+ * Output:
+ * R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+
+static int check_ld_abs(struct verifier_env *env, struct sock_filter_int *insn)
+{
+ struct reg_state *regs = env->cur_state.regs;
+ u8 mode = BPF_MODE(insn->code);
+ struct reg_state *reg;
+ int i;
+
+ if (mode != BPF_ABS && mode != BPF_IND)
+ return -EINVAL;
+
+ if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+ verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
+ return -EINVAL;
+ }
+
+ /* check whether implicit source operand (register R6) is readable */
+ _(check_reg_arg(regs, BPF_REG_6, 1));
+
+ if (regs[BPF_REG_6].ptr != PTR_TO_CTX) {
+ verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
+ return -EINVAL;
+ }
+
+ if (mode == BPF_IND)
+ /* check explicit source operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ /* reset caller saved regs to unreadable */
+ for (i = 0; i < CALLER_SAVED_REGS; i++) {
+ reg = regs + caller_saved[i];
+ reg->read_ok = false;
+ reg->ptr = INVALID_PTR;
+ reg->imm = 0xbadbad;
+ }
+
+ /* mark destination R0 register as readable, since it contains
+ * the value fetched from the packet
+ */
+ regs[BPF_REG_0].read_ok = true;
+ return 0;
+}
+
+/* non-recursive DFS pseudo code
+ * 1 procedure DFS-iterative(G,v):
+ * 2 label v as discovered
+ * 3 let S be a stack
+ * 4 S.push(v)
+ * 5 while S is not empty
+ * 6 t <- S.pop()
+ * 7 if t is what we're looking for:
+ * 8 return t
+ * 9 for all edges e in G.adjacentEdges(t) do
+ * 10 if edge e is already labelled
+ * 11 continue with the next edge
+ * 12 w <- G.adjacentVertex(t,e)
+ * 13 if vertex w is not discovered and not explored
+ * 14 label e as tree-edge
+ * 15 label w as discovered
+ * 16 S.push(w)
+ * 17 continue at 5
+ * 18 else if vertex w is discovered
+ * 19 label e as back-edge
+ * 20 else
+ * 21 // vertex w is explored
+ * 22 label e as forward- or cross-edge
+ * 23 label t as explored
+ * 24 S.pop()
+ *
+ * convention:
+ * 1 - discovered
+ * 2 - discovered and 1st branch labelled
+ * 3 - discovered and 1st and 2nd branch labelled
+ * 4 - explored
+ */
+
+#define STATE_END ((struct verifier_state_list *)-1)
+
+#define PUSH_INT(I) \
+ do { \
+ if (cur_stack >= insn_cnt) { \
+ ret = -E2BIG; \
+ goto free_st; \
+ } \
+ stack[cur_stack++] = I; \
+ } while (0)
+
+#define PEAK_INT() \
+ ({ \
+ int _ret; \
+ if (cur_stack == 0) \
+ _ret = -1; \
+ else \
+ _ret = stack[cur_stack - 1]; \
+ _ret; \
+ })
+
+#define POP_INT() \
+ ({ \
+ int _ret; \
+ if (cur_stack == 0) \
+ _ret = -1; \
+ else \
+ _ret = stack[--cur_stack]; \
+ _ret; \
+ })
+
+#define PUSH_INSN(T, W, E) \
+ do { \
+ int w = W; \
+ if (E == 1 && st[T] >= 2) \
+ break; \
+ if (E == 2 && st[T] >= 3) \
+ break; \
+ if (w >= insn_cnt) { \
+ ret = -EACCES; \
+ goto free_st; \
+ } \
+ if (E == 2) \
+ /* mark branch target for state pruning */ \
+ env->branch_landing[w] = STATE_END; \
+ if (st[w] == 0) { \
+ /* tree-edge */ \
+ st[T] = 1 + E; \
+ st[w] = 1; /* discovered */ \
+ PUSH_INT(w); \
+ goto peak_stack; \
+ } else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
+ verbose("back-edge from insn %d to %d\n", t, w); \
+ ret = -EINVAL; \
+ goto free_st; \
+ } else if (st[w] == 4) { \
+ /* forward- or cross-edge */ \
+ st[T] = 1 + E; \
+ } else { \
+ verbose("insn state internal bug\n"); \
+ ret = -EFAULT; \
+ goto free_st; \
+ } \
+ } while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env)
+{
+ struct sock_filter_int *insns = env->prog->insnsi;
+ int insn_cnt = env->prog->len;
+ int cur_stack = 0;
+ int *stack;
+ int ret = 0;
+ int *st;
+ int i, t;
+
+ if (insns[insn_cnt - 1].code != (BPF_JMP | BPF_EXIT)) {
+ verbose("last insn is not a 'ret'\n");
+ return -EINVAL;
+ }
+
+ st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+ if (!st)
+ return -ENOMEM;
+
+ stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+ if (!stack) {
+ kfree(st);
+ return -ENOMEM;
+ }
+
+ st[0] = 1; /* mark 1st insn as discovered */
+ PUSH_INT(0);
+
+peak_stack:
+ while ((t = PEAK_INT()) != -1) {
+ if (insns[t].code == (BPF_JMP | BPF_EXIT))
+ goto mark_explored;
+
+ if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+ u8 opcode = BPF_OP(insns[t].code);
+
+ if (opcode == BPF_CALL) {
+ PUSH_INSN(t, t + 1, 1);
+ } else if (opcode == BPF_JA) {
+ if (BPF_SRC(insns[t].code) != BPF_X) {
+ ret = -EINVAL;
+ goto free_st;
+ }
+ PUSH_INSN(t, t + insns[t].off + 1, 1);
+ } else {
+ PUSH_INSN(t, t + 1, 1);
+ PUSH_INSN(t, t + insns[t].off + 1, 2);
+ }
+ /* tell verifier to check for equivalent verifier states
+ * after every call and jump
+ */
+ env->branch_landing[t + 1] = STATE_END;
+ } else {
+ PUSH_INSN(t, t + 1, 1);
+ }
+
+mark_explored:
+ st[t] = 4; /* explored */
+ if (POP_INT() == -1) {
+ verbose("pop_int internal bug\n");
+ ret = -EFAULT;
+ goto free_st;
+ }
+ }
+
+
+ for (i = 0; i < insn_cnt; i++) {
+ if (st[i] != 4) {
+ verbose("unreachable insn %d\n", i);
+ ret = -EINVAL;
+ goto free_st;
+ }
+ }
+
+free_st:
+ kfree(st);
+ kfree(stack);
+ return ret;
+}
+
+/* compare two verifier states
+ *
+ * all states stored in state_list are known to be valid, since
+ * verifier reached 'bpf_exit' instruction through them
+ *
+ * this function is called when verifier exploring different branches of
+ * execution popped from the state stack. If it sees an old state that has
+ * more strict register state and more strict stack state then this execution
+ * branch doesn't need to be explored further, since verifier already
+ * concluded that more strict state leads to valid finish.
+ *
+ * Therefore two states are equivalent if register state is more conservative
+ * and explored stack state is more conservative than the current one.
+ * Example:
+ * explored current
+ * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
+ * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
+ *
+ * In other words if current stack state (one being explored) has more
+ * valid slots than old one that already passed validation, it means
+ * the verifier can stop exploring and conclude that current state is valid too
+ *
+ * Similarly with registers. If explored state has register type as invalid
+ * whereas register type in current state is meaningful, it means that
+ * the current state will reach 'bpf_exit' instruction safely
+ */
+static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
+{
+ int i;
+
+ for (i = 0; i < MAX_BPF_REG; i++) {
+ if (memcmp(&old->regs[i], &cur->regs[i],
+ sizeof(old->regs[0])) != 0) {
+ if (!old->regs[i].read_ok)
+ continue;
+ if (old->regs[i].ptr == INVALID_PTR)
+ continue;
+ return false;
+ }
+ }
+
+ for (i = 0; i < MAX_BPF_STACK; i++) {
+ if (memcmp(&old->stack[i], &cur->stack[i],
+ sizeof(old->stack[0])) != 0) {
+ if (old->stack[i].type == STACK_INVALID)
+ continue;
+ return false;
+ }
+ }
+ return true;
+}
+
+static int is_state_visited(struct verifier_env *env, int insn_idx)
+{
+ struct verifier_state_list *new_sl;
+ struct verifier_state_list *sl;
+
+ sl = env->branch_landing[insn_idx];
+ if (!sl)
+ /* no branch jump to this insn, ignore it */
+ return 0;
+
+ while (sl != STATE_END) {
+ if (states_equal(&sl->state, &env->cur_state))
+ /* reached equivalent register/stack state,
+ * prune the search
+ */
+ return 1;
+ sl = sl->next;
+ }
+ new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
+
+ if (!new_sl)
+ /* ignore ENOMEM, it doesn't affect correctness */
+ return 0;
+
+ /* add new state to the head of linked list */
+ memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
+ new_sl->next = env->branch_landing[insn_idx];
+ env->branch_landing[insn_idx] = new_sl;
+ return 0;
+}
+
+static int do_check(struct verifier_env *env)
+{
+ struct verifier_state *state = &env->cur_state;
+ struct sock_filter_int *insns = env->prog->insnsi;
+ struct reg_state *regs = state->regs;
+ int insn_cnt = env->prog->len;
+ int insn_idx, prev_insn_idx = 0;
+ int insn_processed = 0;
+ bool do_print_state = false;
+
+ init_reg_state(regs);
+ insn_idx = 0;
+ for (;;) {
+ struct sock_filter_int *insn;
+ u8 class;
+
+ if (insn_idx >= insn_cnt) {
+ verbose("invalid insn idx %d insn_cnt %d\n",
+ insn_idx, insn_cnt);
+ return -EFAULT;
+ }
+
+ insn = &insns[insn_idx];
+ class = BPF_CLASS(insn->code);
+
+ if (++insn_processed > 32768) {
+ verbose("BPF program is too large. Proccessed %d insn\n",
+ insn_processed);
+ return -E2BIG;
+ }
+
+ if (is_state_visited(env, insn_idx)) {
+ if (verbose_on) {
+ if (do_print_state)
+ pr_cont("\nfrom %d to %d: safe\n",
+ prev_insn_idx, insn_idx);
+ else
+ pr_cont("%d: safe\n", insn_idx);
+ }
+ goto process_bpf_exit;
+ }
+
+ if (verbose_on && do_print_state) {
+ pr_cont("\nfrom %d to %d:", prev_insn_idx, insn_idx);
+ pr_cont_verifier_state(env);
+ do_print_state = false;
+ }
+
+ if (verbose_on) {
+ pr_cont("%d: ", insn_idx);
+ pr_cont_bpf_insn(insn);
+ }
+
+ if (class == BPF_ALU || class == BPF_ALU64) {
+ _(check_alu_op(regs, insn));
+
+ } else if (class == BPF_LDX) {
+ if (BPF_MODE(insn->code) != BPF_MEM)
+ return -EINVAL;
+
+ /* check src operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+
+ _(check_mem_access(env, insn->src_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_READ,
+ insn->dst_reg));
+
+ /* dest reg state will be updated by mem_access */
+
+ } else if (class == BPF_STX) {
+ /* check src1 operand */
+ _(check_reg_arg(regs, insn->src_reg, 1));
+ /* check src2 operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+ _(check_mem_access(env, insn->dst_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_WRITE,
+ insn->src_reg));
+
+ } else if (class == BPF_ST) {
+ if (BPF_MODE(insn->code) != BPF_MEM)
+ return -EINVAL;
+ /* check src operand */
+ _(check_reg_arg(regs, insn->dst_reg, 1));
+ _(check_mem_access(env, insn->dst_reg, insn->off,
+ BPF_SIZE(insn->code), BPF_WRITE,
+ -1));
+
+ } else if (class == BPF_JMP) {
+ u8 opcode = BPF_OP(insn->code);
+
+ if (opcode == BPF_CALL) {
+ _(check_call(env, insn->imm));
+ } else if (opcode == BPF_JA) {
+ if (BPF_SRC(insn->code) != BPF_X)
+ return -EINVAL;
+ insn_idx += insn->off + 1;
+ continue;
+ } else if (opcode == BPF_EXIT) {
+ /* eBPF calling convetion is such that R0 is used
+ * to return the value from eBPF program.
+ * Make sure that it's readable at this time
+ * of bpf_exit, which means that program wrote
+ * something into it earlier
+ */
+ _(check_reg_arg(regs, BPF_REG_0, 1));
+process_bpf_exit:
+ insn_idx = pop_stack(env, &prev_insn_idx);
+ if (insn_idx < 0) {
+ break;
+ } else {
+ do_print_state = true;
+ continue;
+ }
+ } else {
+ _(check_cond_jmp_op(env, insn, &insn_idx));
+ }
+ } else if (class == BPF_LD) {
+ _(check_ld_abs(env, insn));
+ } else {
+ verbose("unknown insn class %d\n", class);
+ return -EINVAL;
+ }
+
+ insn_idx++;
+ }
+
+ return 0;
+}
+
+static void free_states(struct verifier_env *env, int insn_cnt)
+{
+ struct verifier_state_list *sl, *sln;
+ int i;
+
+ for (i = 0; i < insn_cnt; i++) {
+ sl = env->branch_landing[i];
+
+ if (sl)
+ while (sl != STATE_END) {
+ sln = sl->next;
+ kfree(sl);
+ sl = sln;
+ }
+ }
+
+ kfree(env->branch_landing);
+}
+
+int bpf_check(struct sk_filter *prog)
+{
+ struct verifier_env *env;
+ int ret;
+
+ if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
+ return -E2BIG;
+
+ env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
+ if (!env)
+ return -ENOMEM;
+
+ verbose_on = false;
+retry:
+ env->prog = prog;
+ env->branch_landing = kcalloc(prog->len,
+ sizeof(struct verifier_state_list *),
+ GFP_KERNEL);
+
+ if (!env->branch_landing) {
+ kfree(env);
+ return -ENOMEM;
+ }
+
+ ret = check_cfg(env);
+ if (ret < 0)
+ goto free_env;
+
+ ret = do_check(env);
+
+free_env:
+ while (pop_stack(env, NULL) >= 0);
+ free_states(env, prog->len);
+
+ if (ret < 0 && !verbose_on && capable(CAP_SYS_ADMIN)) {
+ /* verification failed, redo it with verbose on */
+ memset(env, 0, sizeof(struct verifier_env));
+ verbose_on = true;
+ goto retry;
+ }
+
+ if (ret == 0 && env->used_map_cnt) {
+ /* if program passed verifier, update used_maps in bpf_prog_info */
+ prog->info->used_maps = kmalloc_array(env->used_map_cnt,
+ sizeof(u32), GFP_KERNEL);
+ if (!prog->info->used_maps) {
+ kfree(env);
+ return -ENOMEM;
+ }
+ memcpy(prog->info->used_maps, env->used_maps,
+ sizeof(u32) * env->used_map_cnt);
+ prog->info->used_map_cnt = env->used_map_cnt;
+ }
+
+ kfree(env);
+ return ret;
+}
--
1.7.9.5
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
The maps are accessed from user space via BPF syscall, which has commands:
- create a map with given id, type and attributes
map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
returns positive map id or negative error
- delete map with given map id
err = bpf_map_delete(int map_id)
returns zero or negative error
- lookup key in a given map referenced by map_id
err = bpf_map_lookup_elem(int map_id, void *key, void *value)
returns zero and stores found elem into value or negative error
- create or update key/value pair in a given map
err = bpf_map_update_elem(int map_id, void *key, void *value)
returns zero or negative error
- find and delete element by key in a given map
err = bpf_map_delete_elem(int map_id, void *key)
- iterate map elements (based on input key return next_key)
err = bpf_map_get_next_key(int map_id, void *key, void *next_key)
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 6 ++
include/uapi/linux/bpf.h | 25 +++++++
kernel/bpf/syscall.c | 180 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 211 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6448b9beea89..19cd394bdbcc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -18,6 +18,12 @@ struct bpf_map_ops {
/* funcs callable from userspace (via syscall) */
struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
void (*map_free)(struct bpf_map *);
+ int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+
+ /* funcs callable from userspace and from eBPF programs */
+ void *(*map_lookup_elem)(struct bpf_map *map, void *key);
+ int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+ int (*map_delete_elem)(struct bpf_map *map, void *key);
};
struct bpf_map {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04374e57c290..faed2ce2d25a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -315,6 +315,31 @@ enum bpf_cmd {
* returns zero or negative error
*/
BPF_MAP_DELETE,
+
+ /* lookup key in a given map referenced by map_id
+ * err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+ * returns zero and stores found elem into value
+ * or negative error
+ */
+ BPF_MAP_LOOKUP_ELEM,
+
+ /* create or update key/value pair in a given map
+ * err = bpf_map_update_elem(int map_id, void *key, void *value)
+ * returns zero or negative error
+ */
+ BPF_MAP_UPDATE_ELEM,
+
+ /* find and delete elem by key in a given map
+ * err = bpf_map_delete_elem(int map_id, void *key)
+ * returns zero or negative error
+ */
+ BPF_MAP_DELETE_ELEM,
+
+ /* lookup key in a given map and return next key
+ * err = bpf_map_get_elem(int map_id, void *key, void *next_key)
+ * returns zero and stores next key or negative error
+ */
+ BPF_MAP_GET_NEXT_KEY,
};
enum bpf_map_attributes {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b9509923b16f..1a48da23a939 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -219,6 +219,174 @@ static int map_delete(int map_id)
return 0;
}
+static int map_lookup_elem(int map_id, void __user *ukey, void __user *uvalue)
+{
+ struct bpf_map *map;
+ void *key, *value;
+ int err;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ rcu_read_lock();
+ map = idr_find(&bpf_map_id_idr, map_id);
+ err = -EINVAL;
+ if (!map)
+ goto err_unlock;
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_ATOMIC);
+ if (!key)
+ goto err_unlock;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ESRCH;
+ value = map->ops->map_lookup_elem(map, key);
+ if (!value)
+ goto free_key;
+
+ err = -EFAULT;
+ if (copy_to_user(uvalue, value, map->value_size) != 0)
+ goto free_key;
+
+ err = 0;
+
+free_key:
+ kfree(key);
+err_unlock:
+ rcu_read_unlock();
+ return err;
+}
+
+static int map_update_elem(int map_id, void __user *ukey, void __user *uvalue)
+{
+ struct bpf_map *map;
+ void *key, *value;
+ int err;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ rcu_read_lock();
+ map = idr_find(&bpf_map_id_idr, map_id);
+ err = -EINVAL;
+ if (!map)
+ goto err_unlock;
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_ATOMIC);
+ if (!key)
+ goto err_unlock;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ENOMEM;
+ value = kmalloc(map->value_size, GFP_ATOMIC);
+ if (!value)
+ goto free_key;
+
+ err = -EFAULT;
+ if (copy_from_user(value, uvalue, map->value_size) != 0)
+ goto free_value;
+
+ err = map->ops->map_update_elem(map, key, value);
+
+free_value:
+ kfree(value);
+free_key:
+ kfree(key);
+err_unlock:
+ rcu_read_unlock();
+ return err;
+}
+
+static int map_delete_elem(int map_id, void __user *ukey)
+{
+ struct bpf_map *map;
+ void *key;
+ int err;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ rcu_read_lock();
+ map = idr_find(&bpf_map_id_idr, map_id);
+ err = -EINVAL;
+ if (!map)
+ goto err_unlock;
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_ATOMIC);
+ if (!key)
+ goto err_unlock;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = map->ops->map_delete_elem(map, key);
+
+free_key:
+ kfree(key);
+err_unlock:
+ rcu_read_unlock();
+ return err;
+}
+
+static int map_get_next_key(int map_id, void __user *ukey,
+ void __user *unext_key)
+{
+ struct bpf_map *map;
+ void *key, *next_key;
+ int err;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ rcu_read_lock();
+ map = idr_find(&bpf_map_id_idr, map_id);
+ err = -EINVAL;
+ if (!map)
+ goto err_unlock;
+
+ err = -ENOMEM;
+ key = kmalloc(map->key_size, GFP_ATOMIC);
+ if (!key)
+ goto err_unlock;
+
+ err = -EFAULT;
+ if (copy_from_user(key, ukey, map->key_size) != 0)
+ goto free_key;
+
+ err = -ENOMEM;
+ next_key = kmalloc(map->key_size, GFP_ATOMIC);
+ if (!next_key)
+ goto free_key;
+
+ err = map->ops->map_get_next_key(map, key, next_key);
+ if (err)
+ goto free_next_key;
+
+ err = -EFAULT;
+ if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+ goto free_next_key;
+
+ err = 0;
+
+free_next_key:
+ kfree(next_key);
+free_key:
+ kfree(key);
+err_unlock:
+ rcu_read_unlock();
+ return err;
+}
+
SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -232,6 +400,18 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
case BPF_MAP_DELETE:
return map_delete((int) arg2);
+ case BPF_MAP_LOOKUP_ELEM:
+ return map_lookup_elem((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
+ case BPF_MAP_UPDATE_ELEM:
+ return map_update_elem((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
+ case BPF_MAP_DELETE_ELEM:
+ return map_delete_elem((int) arg2, (void __user *) arg3);
+
+ case BPF_MAP_GET_NEXT_KEY:
+ return map_get_next_key((int) arg2, (void __user *) arg3,
+ (void __user *) arg4);
default:
return -EINVAL;
}
--
1.7.9.5
introduce new setsockopt() command:
int prog_id;
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_id, sizeof(prog_id))
prog_id is eBPF program id priorly loaded via:
prog_id = syscall(__NR_bpf, BPF_PROG_LOAD, 0, BPF_PROG_TYPE_SOCKET_FILTER,
&prog, sizeof(prog));
setsockopt() calls bpf_prog_get() which increment refcnt of the program,
so it doesn't get unloaded while socket is using the program.
The same eBPF program can be attached to different sockets.
Program exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program
Signed-off-by: Alexei Starovoitov <[email protected]>
---
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/cris/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/linux/filter.h | 1 +
include/uapi/asm-generic/socket.h | 2 +
net/core/filter.c | 117 ++++++++++++++++++++++++++++++++
net/core/sock.c | 13 ++++
17 files changed, 159 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 3de1394bcab8..8c83c376b5ba 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6e6cd159924b..498ef7220466 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index ed94e5ed0a23..0d5120724780 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -82,6 +82,8 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index ca2c6e6f31c6..81fba267c285 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -80,5 +80,7 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index a1b49bac7951..9cbb2e82fa7c 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 6c9a24b3aefa..587ac2fb4106 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index a14baa218c76..ab1aed2306db 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 6aa3ce1854aa..1c4f916d0ef1 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index fe35ceacf0e7..d189bb79ca07 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -79,4 +79,6 @@
#define SO_BPF_EXTENSIONS 0x4029
+#define SO_ATTACH_FILTER_EBPF 0x402a
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index a9c3e2e18c05..88488f24ae7f 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index e031332096d7..c5f26af90366 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -86,4 +86,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 54d9608681b6..667ed3fa63f2 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -76,6 +76,8 @@
#define SO_BPF_EXTENSIONS 0x0032
+#define SO_ATTACH_FILTER_EBPF 0x0033
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 39acec0cf0b1..24f3e4434979 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9873cc8fd31b..7412cfce84f9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -72,6 +72,7 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
void sk_unattached_filter_destroy(struct sk_filter *fp);
int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int sk_attach_filter_ebpf(u32 prog_id, struct sock *sk);
int sk_detach_filter(struct sock *sk);
int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index ea0796bdcf88..f41844e9ac07 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -82,4 +82,6 @@
#define SO_BPF_EXTENSIONS 48
+#define SO_ATTACH_FILTER_EBPF 49
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 7f7c61b4aa39..11a54295f693 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -44,6 +44,7 @@
#include <linux/ratelimit.h>
#include <linux/seccomp.h>
#include <linux/if_vlan.h>
+#include <linux/bpf.h>
/**
* sk_filter - run a packet through a socket filter
@@ -1117,6 +1118,122 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_attach_filter);
+int sk_attach_filter_ebpf(u32 prog_id, struct sock *sk)
+{
+ struct sk_filter *fp, *old_fp;
+
+ if (sock_flag(sk, SOCK_FILTER_LOCKED))
+ return -EPERM;
+
+ fp = bpf_prog_get(prog_id);
+ if (!fp)
+ return -EINVAL;
+
+ if (fp->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+ /* valid prog_id, but invalid filter type */
+ sk_filter_release(fp);
+ return -EINVAL;
+ }
+
+ old_fp = rcu_dereference_protected(sk->sk_filter,
+ sock_owned_by_user(sk));
+ rcu_assign_pointer(sk->sk_filter, fp);
+
+ if (old_fp)
+ sk_filter_uncharge(sk, old_fp);
+
+ return 0;
+}
+
+static struct bpf_func_proto sock_filter_funcs[] = {
+ [BPF_FUNC_map_lookup_elem] = {
+ .ret_type = PTR_TO_MAP_CONDITIONAL,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ },
+ [BPF_FUNC_map_update_elem] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ },
+ [BPF_FUNC_map_delete_elem] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ },
+};
+
+/* allow socket filters to call
+ * bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
+ */
+static const struct bpf_func_proto *sock_filter_func_proto(enum bpf_func_id func_id)
+{
+ if (func_id < 0 || func_id >= ARRAY_SIZE(sock_filter_funcs))
+ return NULL;
+ return &sock_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+ int size;
+ enum bpf_access_type type;
+} sock_filter_ctx_access[] = {
+ [offsetof(struct sk_buff, mark)] = {
+ FIELD_SIZEOF(struct sk_buff, mark), BPF_READ
+ },
+ [offsetof(struct sk_buff, protocol)] = {
+ FIELD_SIZEOF(struct sk_buff, protocol), BPF_READ
+ },
+ [offsetof(struct sk_buff, queue_mapping)] = {
+ FIELD_SIZEOF(struct sk_buff, queue_mapping), BPF_READ
+ },
+};
+
+/* allow socket filters to access to 'mark', 'protocol' and 'queue_mapping'
+ * fields of 'struct sk_buff'
+ */
+static bool sock_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ const struct bpf_context_access *access;
+
+ if (off < 0 || off >= ARRAY_SIZE(sock_filter_ctx_access))
+ return false;
+
+ access = &sock_filter_ctx_access[off];
+ if (access->size == size && (access->type & type))
+ return true;
+
+ return false;
+}
+
+static struct bpf_verifier_ops sock_filter_ops = {
+ .get_func_proto = sock_filter_func_proto,
+ .is_valid_access = sock_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+ .ops = &sock_filter_ops,
+ .type = BPF_PROG_TYPE_SOCKET_FILTER,
+};
+
+static int __init register_sock_filter_ops(void)
+{
+ /* init function offsets used to convert BPF_FUNC_* constants in
+ * BPF_CALL instructions to offset of helper functions
+ */
+ sock_filter_funcs[BPF_FUNC_map_lookup_elem].func_off =
+ bpf_map_lookup_elem - __bpf_call_base;
+ sock_filter_funcs[BPF_FUNC_map_update_elem].func_off =
+ bpf_map_update_elem - __bpf_call_base;
+ sock_filter_funcs[BPF_FUNC_map_delete_elem].func_off =
+ bpf_map_delete_elem - __bpf_call_base;
+
+ bpf_register_prog_type(&tl);
+ return 0;
+}
+late_initcall(register_sock_filter_ops);
+
int sk_detach_filter(struct sock *sk)
{
int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 026e01f70274..2f9f7b74a551 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -895,6 +895,19 @@ set_rcvbuf:
}
break;
+ case SO_ATTACH_FILTER_EBPF:
+ ret = -EINVAL;
+ if (optlen == sizeof(u32)) {
+ u32 prog_id;
+
+ ret = -EFAULT;
+ if (copy_from_user(&prog_id, optval, sizeof(prog_id)))
+ break;
+
+ ret = sk_attach_filter_ebpf(prog_id, sk);
+ }
+ break;
+
case SO_DETACH_FILTER:
ret = sk_detach_filter(sk);
break;
--
1.7.9.5
this socket filter example does:
- creates a hashtable in kernel with key 4 bytes and value 8 bytes
- populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto
- loads eBPF program:
r0 = skb[14 + 9]; // load one byte of ip->proto
*(u32*)(fp - 4) = r0;
value = bpf_map_lookup_elem(map_id, fp - 4);
if (value)
(*(u64*)value) += 1;
- attaches this program to eth0 raw socket
- every second user space reads map[6] and map[17] to see how many
TCP and UDP packets were seen on eth0
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 13 ++++
samples/bpf/sock_example.c | 160 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 174 insertions(+)
create mode 100644 samples/bpf/.gitignore
create mode 100644 samples/bpf/Makefile
create mode 100644 samples/bpf/sock_example.c
diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index 000000000000..5465c6e92a00
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1 @@
+sock_example
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
new file mode 100644
index 000000000000..95c990151644
--- /dev/null
+++ b/samples/bpf/Makefile
@@ -0,0 +1,13 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := sock_example
+
+sock_example-objs := sock_example.o libbpf.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
+HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index 000000000000..5cf091571d4f
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,160 @@
+/* eBPF example program:
+ * - creates a hashtable in kernel with key 4 bytes and value 8 bytes
+ *
+ * - populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto
+ *
+ * - loads eBPF program:
+ * r0 = skb[14 + 9]; // load one byte of ip->proto
+ * *(u32*)(fp - 4) = r0;
+ * value = bpf_map_lookup_elem(map_id, fp - 4);
+ * if (value)
+ * (*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[6] and map[17] to see how many
+ * TCP and UDP packets were seen on eth0
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include "libbpf.h"
+
+static int open_raw_sock(const char *name)
+{
+ struct sockaddr_ll sll;
+ struct packet_mreq mr;
+ struct ifreq ifr;
+ int sock;
+
+ sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, htons(ETH_P_ALL));
+ if (sock < 0) {
+ printf("cannot open socket!\n");
+ return -1;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strncpy((char *)ifr.ifr_name, name, IFNAMSIZ);
+ if (ioctl(sock, SIOCGIFINDEX, &ifr) < 0) {
+ printf("ioctl: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+
+ memset(&sll, 0, sizeof(sll));
+ sll.sll_family = AF_PACKET;
+ sll.sll_ifindex = ifr.ifr_ifindex;
+ sll.sll_protocol = htons(ETH_P_ALL);
+ if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+ printf("bind: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+
+ memset(&mr, 0, sizeof(mr));
+ mr.mr_ifindex = ifr.ifr_ifindex;
+ mr.mr_type = PACKET_MR_PROMISC;
+ if (setsockopt(sock, SOL_PACKET, PACKET_ADD_MEMBERSHIP, &mr, sizeof(mr)) < 0) {
+ printf("set_promisc: %s\n", strerror(errno));
+ close(sock);
+ return -1;
+ }
+ return sock;
+}
+
+#define MAP_ID 1
+
+static int test_sock(void)
+{
+ static struct sock_filter_int prog[] = {
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+ BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
+ BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0), /* r0 = 0 */
+ BPF_EXIT_INSN(),
+ };
+
+ int sock = -1, prog_id = 1, i, key;
+ long long value = 0, tcp_cnt, udp_cnt;
+
+ if (bpf_create_map(MAP_ID, sizeof(key), sizeof(value), 2) < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ /* must have been left from previous aborted run, delete it */
+ goto cleanup;
+ }
+
+ key = 6; /* tcp */
+ if (bpf_update_elem(MAP_ID, &key, &value) < 0) {
+ printf("update err key=%d\n", key);
+ goto cleanup;
+ }
+
+ key = 17; /* udp */
+ if (bpf_update_elem(MAP_ID, &key, &value) < 0) {
+ printf("update err key=%d\n", key);
+ goto cleanup;
+ }
+
+ prog_id = bpf_prog_load(prog_id, BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL");
+ if (prog_id < 0) {
+ printf("failed to load prog '%s'\n", strerror(errno));
+ goto cleanup;
+ }
+
+ sock = open_raw_sock("eth0");
+
+ if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_id, sizeof(prog_id)) < 0) {
+ printf("setsockopt %d\n", errno);
+ goto cleanup;
+ }
+
+ for (i = 0; i < 10; i++) {
+ key = 6;
+ if (bpf_lookup_elem(MAP_ID, &key, &tcp_cnt) < 0) {
+ printf("lookup err\n");
+ break;
+ }
+ key = 17;
+ if (bpf_lookup_elem(MAP_ID, &key, &udp_cnt) < 0) {
+ printf("lookup err\n");
+ break;
+ }
+ printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
+ sleep(1);
+ }
+
+cleanup:
+ close(sock);
+ bpf_prog_unload(prog_id);
+
+ bpf_delete_map(MAP_ID);
+
+ return 0;
+}
+
+int main(void)
+{
+ test_sock();
+ return 0;
+}
--
1.7.9.5
simple packet drop monitor:
- in-kernel eBPF program attaches to kfree_skb() event and records number
of packet drops at given location
- userspace iterates over the map every second and prints stats
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/Makefile | 4 +-
samples/bpf/dropmon.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 130 insertions(+), 1 deletion(-)
create mode 100644 samples/bpf/dropmon.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95c990151644..8e3dfa0c25e4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,14 @@
obj- := dummy.o
# List of programs to build
-hostprogs-y := sock_example
+hostprogs-y := sock_example dropmon
sock_example-objs := sock_example.o libbpf.o
+dropmon-objs := dropmon.o libbpf.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropmon.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index 000000000000..80d80066f518
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,127 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ * of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "libbpf.h"
+
+#define MAP_ID 1
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
+
+static void write_to_file(const char *file, const char *str)
+{
+ int fd, err;
+
+ fd = open(file, O_WRONLY);
+ err = write(fd, str, strlen(str));
+ (void) err;
+ close(fd);
+}
+
+static int dropmon(void)
+{
+ /* the following eBPF program is equivalent to C:
+ * void filter(struct bpf_context *ctx)
+ * {
+ * long loc = ctx->arg2;
+ * long init_val = 1;
+ * void *value;
+ *
+ * value = bpf_map_lookup_elem(MAP_ID, &loc);
+ * if (value) {
+ * (*(long *) value) += 1;
+ * } else {
+ * bpf_map_update_elem(MAP_ID, &loc, &init_val);
+ * }
+ * }
+ */
+ static struct sock_filter_int prog[] = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+ BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), /* r1 = 1 */
+ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+ BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+ BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+ BPF_EXIT_INSN(),
+ };
+
+ int prog_id = 1, i;
+ long long key, next_key, value = 0;
+ char fmt[32];
+
+ if (bpf_create_map(MAP_ID, sizeof(key), sizeof(value), 1024) < 0) {
+ printf("failed to create map '%s'\n", strerror(errno));
+ goto cleanup;
+ }
+
+ prog_id = bpf_prog_load(prog_id, BPF_PROG_TYPE_TRACING_FILTER, prog, sizeof(prog), "GPL");
+ if (prog_id < 0) {
+ printf("failed to load prog '%s'\n", strerror(errno));
+ return -1;
+ }
+
+ sprintf(fmt, "bpf_%d", prog_id);
+
+ write_to_file(TRACEPOINT "filter", fmt);
+ write_to_file(TRACEPOINT "enable", "1");
+
+ for (i = 0; i < 10; i++) {
+ key = 0;
+ while (bpf_get_next_key(MAP_ID, &key, &next_key) == 0) {
+ bpf_lookup_elem(MAP_ID, &next_key, &value);
+ printf("location 0x%llx count %lld\n", next_key, value);
+ key = next_key;
+ }
+ if (key)
+ printf("\n");
+ sleep(1);
+ }
+
+cleanup:
+ bpf_prog_unload(prog_id);
+
+ bpf_delete_map(MAP_ID);
+
+ write_to_file(TRACEPOINT "enable", "0");
+ write_to_file(TRACEPOINT "filter", "0");
+
+ return 0;
+}
+
+int main(void)
+{
+ dropmon();
+ return 0;
+}
--
1.7.9.5
the library includes a trivial set of BPF syscall wrappers:
int bpf_delete_map(int map_id);
int bpf_create_map(int map_id, int key_size, int value_size, int max_entries);
int bpf_update_elem(int map_id, void *key, void *value);
int bpf_lookup_elem(int map_id, void *key, void *value);
int bpf_delete_elem(int map_id, void *key);
int bpf_get_next_key(int map_id, void *key, void *next_key);
int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
struct sock_filter_int *insns, int insn_cnt,
const char *license);
int bpf_prog_unload(int prog_id);
Signed-off-by: Alexei Starovoitov <[email protected]>
---
samples/bpf/libbpf.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++
samples/bpf/libbpf.h | 18 ++++++++
2 files changed, 132 insertions(+)
create mode 100644 samples/bpf/libbpf.c
create mode 100644 samples/bpf/libbpf.h
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
new file mode 100644
index 000000000000..763eaf4b9814
--- /dev/null
+++ b/samples/bpf/libbpf.c
@@ -0,0 +1,114 @@
+/* eBPF mini library */
+#include <stdlib.h>
+#include <linux/unistd.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/netlink.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include "libbpf.h"
+
+struct nlattr_u32 {
+ __u16 nla_len;
+ __u16 nla_type;
+ __u32 val;
+};
+
+int bpf_delete_map(int map_id)
+{
+ return syscall(__NR_bpf, BPF_MAP_DELETE, map_id);
+}
+
+int bpf_create_map(int map_id, int key_size, int value_size, int max_entries)
+{
+ struct nlattr_u32 attr[] = {
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_KEY_SIZE,
+ .val = key_size,
+ },
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_VALUE_SIZE,
+ .val = value_size,
+ },
+ {
+ .nla_len = sizeof(struct nlattr_u32),
+ .nla_type = BPF_MAP_MAX_ENTRIES,
+ .val = max_entries,
+ },
+ };
+ int err;
+
+ err = syscall(__NR_bpf, BPF_MAP_CREATE, map_id, BPF_MAP_TYPE_HASH, attr, sizeof(attr));
+ if (err > 0 && err != map_id && map_id != 0) {
+ bpf_delete_map(err);
+ errno = EEXIST;
+ err = -1;
+ }
+ return err;
+}
+
+
+int bpf_update_elem(int map_id, void *key, void *value)
+{
+ return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, map_id, key, value);
+}
+
+int bpf_lookup_elem(int map_id, void *key, void *value)
+{
+ return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, map_id, key, value);
+}
+
+int bpf_delete_elem(int map_id, void *key)
+{
+ return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, map_id, key);
+}
+
+int bpf_get_next_key(int map_id, void *key, void *next_key)
+{
+ return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, map_id, key, next_key);
+}
+
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+
+int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
+ struct sock_filter_int *insns, int prog_len,
+ const char *license)
+{
+ int nlattr_size, license_len, err;
+ void *nlattr, *ptr;
+
+ license_len = strlen(license) + 1;
+ nlattr_size = sizeof(struct nlattr) + prog_len + sizeof(struct nlattr) +
+ ROUND_UP(license_len, 4);
+
+ ptr = nlattr = malloc(nlattr_size);
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = prog_len + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_TEXT,
+ };
+ ptr += sizeof(struct nlattr);
+
+ memcpy(ptr, insns, prog_len);
+ ptr += prog_len;
+
+ *(struct nlattr *) ptr = (struct nlattr) {
+ .nla_len = ROUND_UP(license_len, 4) + sizeof(struct nlattr),
+ .nla_type = BPF_PROG_LICENSE,
+ };
+ ptr += sizeof(struct nlattr);
+
+ memcpy(ptr, license, license_len);
+
+ err = syscall(__NR_bpf, BPF_PROG_LOAD, prog_id, prog_type, nlattr,
+ nlattr_size);
+ free(nlattr);
+ return err;
+}
+
+int bpf_prog_unload(int prog_id)
+{
+ return syscall(__NR_bpf, BPF_PROG_UNLOAD, prog_id);
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
new file mode 100644
index 000000000000..408368e6d4d5
--- /dev/null
+++ b/samples/bpf/libbpf.h
@@ -0,0 +1,18 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+struct sock_filter_int;
+
+int bpf_delete_map(int map_id);
+int bpf_create_map(int map_id, int key_size, int value_size, int max_entries);
+int bpf_update_elem(int map_id, void *key, void *value);
+int bpf_lookup_elem(int map_id, void *key, void *value);
+int bpf_delete_elem(int map_id, void *key);
+int bpf_get_next_key(int map_id, void *key, void *next_key);
+int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
+ struct sock_filter_int *insns, int insn_cnt,
+ const char *license);
+int bpf_prog_unload(int prog_id);
+
+#endif
--
1.7.9.5
User interface:
cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
where 123 is an id of the eBPF program priorly loaded.
__event__ is static tracepoint event.
(kprobe events will be supported in the future patches)
eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- trace_printk
- load_pointer
- dump_stack
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/ftrace_event.h | 5 +
include/trace/bpf_trace.h | 29 +++++
include/trace/ftrace.h | 10 ++
include/uapi/linux/bpf.h | 5 +
kernel/trace/Kconfig | 1 +
kernel/trace/Makefile | 1 +
kernel/trace/bpf_trace.c | 217 ++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 3 +
kernel/trace/trace_events.c | 7 ++
kernel/trace/trace_events_filter.c | 72 +++++++++++-
10 files changed, 349 insertions(+), 1 deletion(-)
create mode 100644 include/trace/bpf_trace.h
create mode 100644 kernel/trace/bpf_trace.c
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cff3106ffe2c..de313bd9a434 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -237,6 +237,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED_BIT,
TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
TRACE_EVENT_FL_TRACEPOINT_BIT,
+ TRACE_EVENT_FL_BPF_BIT,
};
/*
@@ -259,6 +260,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
TRACE_EVENT_FL_USE_CALL_FILTER = (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
TRACE_EVENT_FL_TRACEPOINT = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+ TRACE_EVENT_FL_BPF = (1 << TRACE_EVENT_FL_BPF_BIT),
};
struct ftrace_event_call {
@@ -536,6 +538,9 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
event_triggers_post_call(file, tt);
}
+struct bpf_context;
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx);
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..2122437f1317
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,29 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+ unsigned long arg1;
+ unsigned long arg2;
+ unsigned long arg3;
+ unsigned long arg4;
+ unsigned long arg5;
+ unsigned long arg6;
+};
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...);
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..ad4987ac68bb 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
*/
#include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>
/*
* DECLARE_EVENT_CLASS can be used to add a generic function
@@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto) \
if (ftrace_trigger_soft_disabled(ftrace_file)) \
return; \
\
+ if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
+ unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+ struct bpf_context __ctx; \
+ \
+ populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
+ trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
+ return; \
+ } \
+ \
__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
\
entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file, \
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 03c65eedd3d5..d03b8b39e031 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -382,6 +382,7 @@ enum bpf_prog_attributes {
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
BPF_PROG_TYPE_SOCKET_FILTER,
+ BPF_PROG_TYPE_TRACING_FILTER,
};
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -392,6 +393,10 @@ enum bpf_func_id {
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, void *value) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
+ BPF_FUNC_load_pointer, /* void *bpf_load_pointer(void *unsafe_ptr) */
+ BPF_FUNC_memcmp, /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
+ BPF_FUNC_dump_stack, /* void bpf_dump_stack(void) */
+ BPF_FUNC_trace_printk, /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index d4409356f40d..e36d42876634 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -80,6 +80,7 @@ config FTRACE_NMI_ENTER
config EVENT_TRACING
select CONTEXT_SWITCH_TRACER
+ depends on NET
bool
config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 2611613f14f1..a0fcfd97101d 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
endif
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o
obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_PM_RUNTIME),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..b7b394a0fd6e
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,217 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...)
+{
+ va_list args;
+
+ va_start(args, ctx);
+
+ ctx->arg1 = va_arg(args, unsigned long);
+ ctx->arg2 = va_arg(args, unsigned long);
+ ctx->arg3 = va_arg(args, unsigned long);
+ ctx->arg4 = va_arg(args, unsigned long);
+ ctx->arg5 = va_arg(args, unsigned long);
+ ctx->arg6 = va_arg(args, unsigned long);
+
+ va_end(args);
+}
+EXPORT_SYMBOL_GPL(populate_bpf_context);
+
+/* called from eBPF program with rcu lock held */
+static u64 bpf_load_pointer(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) r1;
+ void *ptr = NULL;
+
+ probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
+ return (u64) (unsigned long) ptr;
+}
+
+static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *unsafe_ptr = (void *) r1;
+ void *safe_ptr = (void *) r2;
+ u32 size = (u32) r3;
+ char buf[64];
+ int err;
+
+ if (size < 64) {
+ err = probe_kernel_read(buf, unsafe_ptr, size);
+ if (err)
+ return err;
+ return memcmp(buf, safe_ptr, size);
+ }
+ return -1;
+}
+
+static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ trace_dump_stack(0);
+ return 0;
+}
+
+/* limited trace_printk()
+ * only %d %u %x conversion specifiers allowed
+ */
+static u64 bpf_trace_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+ char *fmt = (char *) r1;
+ int fmt_cnt = 0;
+ int i;
+
+ /* bpf_check() guarantees that fmt points to bpf program stack and
+ * fmt_size bytes of it were initialized by bpf program
+ */
+ if (fmt[fmt_size - 1] != 0)
+ return -EINVAL;
+
+ /* check format string for allowed specifiers */
+ for (i = 0; i < fmt_size; i++)
+ if (fmt[i] == '%') {
+ if (i + 1 >= fmt_size)
+ return -EINVAL;
+ if (fmt[i + 1] != 'd' && fmt[i + 1] != 'u' &&
+ fmt[i + 1] != 'x')
+ return -EINVAL;
+ fmt_cnt++;
+ }
+
+ if (fmt_cnt > 3)
+ return -EINVAL;
+
+ return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+ (u32) r3, (u32) r4, (u32) r5);
+}
+
+static struct bpf_func_proto tracing_filter_funcs[] = {
+ [BPF_FUNC_load_pointer] = {
+ .ret_type = RET_INTEGER,
+ },
+ [BPF_FUNC_memcmp] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = INVALID_PTR,
+ .arg2_type = PTR_TO_STACK_IMM,
+ .arg3_type = CONST_ARG_STACK_IMM_SIZE,
+ },
+ [BPF_FUNC_dump_stack] = {
+ .ret_type = RET_VOID,
+ },
+ [BPF_FUNC_trace_printk] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = PTR_TO_STACK_IMM,
+ .arg2_type = CONST_ARG_STACK_IMM_SIZE,
+ },
+ [BPF_FUNC_map_lookup_elem] = {
+ .ret_type = PTR_TO_MAP_CONDITIONAL,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ },
+ [BPF_FUNC_map_update_elem] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ },
+ [BPF_FUNC_map_delete_elem] = {
+ .ret_type = RET_INTEGER,
+ .arg1_type = CONST_ARG_MAP_ID,
+ .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ },
+};
+
+static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
+{
+ if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
+ return NULL;
+ return &tracing_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+ int size;
+ enum bpf_access_type type;
+} tracing_filter_ctx_access[] = {
+ [offsetof(struct bpf_context, arg1)] = {
+ FIELD_SIZEOF(struct bpf_context, arg1),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg2)] = {
+ FIELD_SIZEOF(struct bpf_context, arg2),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg3)] = {
+ FIELD_SIZEOF(struct bpf_context, arg3),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg4)] = {
+ FIELD_SIZEOF(struct bpf_context, arg4),
+ BPF_READ
+ },
+ [offsetof(struct bpf_context, arg5)] = {
+ FIELD_SIZEOF(struct bpf_context, arg5),
+ BPF_READ
+ },
+};
+
+static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ const struct bpf_context_access *access;
+
+ if (off < 0 || off >= ARRAY_SIZE(tracing_filter_ctx_access))
+ return false;
+
+ access = &tracing_filter_ctx_access[off];
+ if (access->size == size && (access->type & type))
+ return true;
+
+ return false;
+}
+
+static struct bpf_verifier_ops tracing_filter_ops = {
+ .get_func_proto = tracing_filter_func_proto,
+ .is_valid_access = tracing_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+ .ops = &tracing_filter_ops,
+ .type = BPF_PROG_TYPE_TRACING_FILTER,
+};
+
+static int __init register_tracing_filter_ops(void)
+{
+ /* init function offsets used to convert BPF_FUNC_* constants in
+ * BPF_CALL instructions to offset of helper functions
+ */
+ tracing_filter_funcs[BPF_FUNC_map_lookup_elem].func_off =
+ bpf_map_lookup_elem - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_map_update_elem].func_off =
+ bpf_map_update_elem - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_map_delete_elem].func_off =
+ bpf_map_delete_elem - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_trace_printk].func_off =
+ bpf_trace_printk - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_memcmp].func_off =
+ bpf_memcmp - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_dump_stack].func_off =
+ bpf_dump_stack - __bpf_call_base;
+ tracing_filter_funcs[BPF_FUNC_load_pointer].func_off =
+ bpf_load_pointer - __bpf_call_base;
+
+ bpf_register_prog_type(&tl);
+ return 0;
+}
+late_initcall(register_tracing_filter_ops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9258f5a815db..bb7c6a19ead5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -984,12 +984,15 @@ struct ftrace_event_field {
int is_signed;
};
+struct sk_filter;
+
struct event_filter {
int n_preds; /* Number assigned */
int a_preds; /* allocated */
struct filter_pred *preds;
struct filter_pred *root;
char *filter_string;
+ struct sk_filter *prog;
};
struct event_subsystem {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f99e0b3bca8c..54298a0ad272 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1075,6 +1075,13 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
err = apply_event_filter(file, buf);
mutex_unlock(&event_mutex);
+ if (file->event_call->flags & TRACE_EVENT_FL_BPF)
+ /*
+ * allocate per-cpu printk buffers, since eBPF program
+ * might be calling bpf_trace_printk
+ */
+ trace_printk_init_buffers();
+
free_page((unsigned long) buf);
if (err < 0)
return err;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8a8631926a07..66e7b558ccae 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,9 @@
#include <linux/mutex.h>
#include <linux/perf_event.h>
#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <linux/filter.h>
#include "trace.h"
#include "trace_output.h"
@@ -535,6 +538,16 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
return WALK_PRED_DEFAULT;
}
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx)
+{
+ BUG_ON(!filter || !filter->prog);
+
+ rcu_read_lock();
+ SK_RUN_FILTER(filter->prog, (void *) ctx);
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
+
/* return 1 if event matches, 0 otherwise (discard) */
int filter_match_preds(struct event_filter *filter, void *rec)
{
@@ -794,6 +807,8 @@ static void __free_filter(struct event_filter *filter)
if (!filter)
return;
+ if (filter->prog)
+ sk_unattached_filter_destroy(filter->prog);
__free_preds(filter);
kfree(filter->filter_string);
kfree(filter);
@@ -1898,6 +1913,48 @@ static int create_filter_start(char *filter_str, bool set_str,
return err;
}
+static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+{
+ struct event_filter *filter;
+ struct sk_filter *prog;
+ long prog_id;
+ int err = 0;
+
+ *filterp = NULL;
+
+ filter = __alloc_filter();
+ if (!filter)
+ return -ENOMEM;
+
+ err = replace_filter_string(filter, filter_str);
+ if (err)
+ goto free_filter;
+
+ err = kstrtol(filter_str + 4, 0, &prog_id);
+ if (err)
+ goto free_filter;
+
+ err = -ESRCH;
+ prog = bpf_prog_get(prog_id);
+ if (!prog)
+ goto free_filter;
+
+ filter->prog = prog;
+
+ err = -EINVAL;
+ if (prog->info->prog_type != BPF_PROG_TYPE_TRACING_FILTER)
+ /* prog_id is valid, but it's not a tracing filter program */
+ goto free_filter;
+
+ *filterp = filter;
+
+ return 0;
+
+free_filter:
+ __free_filter(filter);
+ return err;
+}
+
static void create_filter_finish(struct filter_parse_state *ps)
{
if (ps) {
@@ -2007,7 +2064,20 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
return 0;
}
- err = create_filter(call, filter_string, true, &filter);
+ /*
+ * 'bpf_123' string is a request to attach eBPF program with id == 123
+ * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
+ */
+ if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
+ filter_string[4] != 0) {
+ err = create_filter_bpf(filter_string, &filter);
+ if (!err)
+ call->flags |= TRACE_EVENT_FL_BPF;
+ } else {
+ err = create_filter(call, filter_string, true, &filter);
+ if (!err)
+ call->flags &= ~TRACE_EVENT_FL_BPF;
+ }
/*
* Always swap the call filter with the new filter
--
1.7.9.5
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 5 +++
include/uapi/linux/bpf.h | 3 ++
kernel/bpf/syscall.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 93 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 67fd49eac904..bc505093683a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -127,4 +127,9 @@ struct sk_filter *bpf_prog_get(u32 prog_id);
/* verify correctness of eBPF program */
int bpf_check(struct sk_filter *fp);
+/* in-kernel helper functions called from eBPF programs */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597a35cc101d..03c65eedd3d5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -389,6 +389,9 @@ enum bpf_prog_type {
*/
enum bpf_func_id {
BPF_FUNC_unspec,
+ BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
+ BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, void *value) */
+ BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
__BPF_FUNC_MAX_ID,
};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 48d8f43da151..266136f0d333 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -691,3 +691,88 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
}
}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = PTR_TO_MAP_CONDITIONAL,
+ * .arg1_type = CONST_ARG_MAP_ID,
+ * .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map;
+ int map_id = r1;
+ void *key = (void *) (unsigned long) r2;
+ void *value;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ map = idr_find(&bpf_map_id_idr, map_id);
+ /* eBPF verifier guarantees that map_id is valid for the life of
+ * the program
+ */
+ BUG_ON(!map);
+
+ value = map->ops->map_lookup_elem(map, key);
+
+ return (unsigned long) value;
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = RET_INTEGER,
+ * .arg1_type = CONST_ARG_MAP_ID,
+ * .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map;
+ int map_id = r1;
+ void *key = (void *) (unsigned long) r2;
+ void *value = (void *) (unsigned long) r3;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ map = idr_find(&bpf_map_id_idr, map_id);
+ /* eBPF verifier guarantees that map_id is valid */
+ BUG_ON(!map);
+
+ return map->ops->map_update_elem(map, key, value);
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ * .ret_type = RET_INTEGER,
+ * .arg1_type = CONST_ARG_MAP_ID,
+ * .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct bpf_map *map;
+ int map_id = r1;
+ void *key = (void *) (unsigned long) r2;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ map = idr_find(&bpf_map_id_idr, map_id);
+ /* eBPF verifier guarantees that map_id is valid */
+ BUG_ON(!map);
+
+ return map->ops->map_delete_elem(map, key);
+}
--
1.7.9.5
eBPF programs are safe run-to-completion functions with load/unload
methods from userspace similar to kernel modules.
User space API:
- load eBPF program
prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
where 'prog' is a sequence of sections (currently TEXT and LICENSE)
TEXT - array of eBPF instructions
LICENSE - GPL compatible
- unload eBPF program
err = bpf_prog_unload(int prog_id)
User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_id, prog_type, ...)
follows in later patches
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpf.h | 32 ++++++
include/linux/filter.h | 9 +-
include/uapi/linux/bpf.h | 34 ++++++
kernel/bpf/core.c | 5 +-
kernel/bpf/syscall.c | 275 ++++++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 9 +-
6 files changed, 358 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 19cd394bdbcc..7bfcad87018e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,4 +47,36 @@ struct bpf_map_type_list {
void bpf_register_map_type(struct bpf_map_type_list *tl);
struct bpf_map *bpf_map_get(u32 map_id);
+/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
+ * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
+ * instructions after verifying
+ */
+struct bpf_func_proto {
+ s32 func_off;
+};
+
+struct bpf_verifier_ops {
+ /* return eBPF function prototype for verification */
+ const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+};
+
+struct bpf_prog_type_list {
+ struct list_head list_node;
+ struct bpf_verifier_ops *ops;
+ enum bpf_prog_type type;
+};
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
+struct bpf_prog_info {
+ int prog_id;
+ enum bpf_prog_type prog_type;
+ struct bpf_verifier_ops *ops;
+ u32 *used_maps;
+ u32 used_map_cnt;
+};
+
+void free_bpf_prog_info(struct bpf_prog_info *info);
+struct sk_filter *bpf_prog_get(u32 prog_id);
+
#endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6766577635ff..9873cc8fd31b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -29,12 +29,17 @@ struct sock_fprog_kern {
struct sk_buff;
struct sock;
struct seccomp_data;
+struct bpf_prog_info;
struct sk_filter {
atomic_t refcnt;
u32 jited:1, /* Is our filter JIT'ed? */
- len:31; /* Number of filter blocks */
- struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ ebpf:1, /* Is it eBPF program ? */
+ len:30; /* Number of filter blocks */
+ union {
+ struct sock_fprog_kern *orig_prog; /* Original BPF program */
+ struct bpf_prog_info *info;
+ };
struct rcu_head rcu;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter_int *filter);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1399ed1d5dad..ed067e245099 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -340,6 +340,19 @@ enum bpf_cmd {
* returns zero and stores next key or negative error
*/
BPF_MAP_GET_NEXT_KEY,
+
+ /* verify and load eBPF program
+ * prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
+ * prog is a sequence of sections
+ * returns positive program id or negative error
+ */
+ BPF_PROG_LOAD,
+
+ /* unload eBPF program
+ * err = bpf_prog_unload(int prog_id)
+ * returns zero or negative error
+ */
+ BPF_PROG_UNLOAD,
};
enum bpf_map_attributes {
@@ -357,4 +370,25 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
};
+enum bpf_prog_attributes {
+ BPF_PROG_UNSPEC,
+ BPF_PROG_TEXT, /* array of eBPF instructions */
+ BPF_PROG_LICENSE, /* license string */
+ __BPF_PROG_ATTR_MAX,
+};
+#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
+#define BPF_PROG_MAX_ATTR_SIZE 65535
+
+enum bpf_prog_type {
+ BPF_PROG_TYPE_UNSPEC,
+};
+
+/* integer value in 'imm' field of BPF_CALL instruction selects which helper
+ * function eBPF program intends to call
+ */
+enum bpf_func_id {
+ BPF_FUNC_unspec,
+ __BPF_FUNC_MAX_ID,
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dd9c29ff720e..b9f743929d86 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -23,6 +23,7 @@
#include <linux/filter.h>
#include <linux/skbuff.h>
#include <asm/unaligned.h>
+#include <linux/bpf.h>
/* Registers */
#define BPF_R0 regs[BPF_REG_0]
@@ -537,9 +538,11 @@ void sk_filter_select_runtime(struct sk_filter *fp)
}
EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
-/* free internal BPF program */
+/* free internal BPF program, called after RCU grace period */
void sk_filter_free(struct sk_filter *fp)
{
+ if (fp->ebpf)
+ free_bpf_prog_info(fp->info);
bpf_jit_free(fp);
}
EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1a48da23a939..836809b1bc4e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -12,6 +12,8 @@
#include <linux/bpf.h>
#include <linux/syscalls.h>
#include <net/netlink.h>
+#include <linux/license.h>
+#include <linux/filter.h>
/* mutex to protect insertion/deletion of map_id in IDR */
static DEFINE_MUTEX(bpf_map_lock);
@@ -387,6 +389,273 @@ err_unlock:
return err;
}
+static LIST_HEAD(bpf_prog_types);
+
+static int find_prog_type(enum bpf_prog_type type, struct sk_filter *prog)
+{
+ struct bpf_prog_type_list *tl;
+
+ list_for_each_entry(tl, &bpf_prog_types, list_node) {
+ if (tl->type == type) {
+ prog->info->ops = tl->ops;
+ prog->info->prog_type = type;
+ return 0;
+ }
+ }
+ return -EINVAL;
+}
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl)
+{
+ list_add(&tl->list_node, &bpf_prog_types);
+}
+
+static DEFINE_MUTEX(bpf_prog_lock);
+static DEFINE_IDR(bpf_prog_id_idr);
+
+/* maximum number of loaded eBPF programs */
+#define MAX_BPF_PROG_CNT 1024
+static u32 bpf_prog_cnt;
+
+/* fixup insn->imm field of bpf_call instructions:
+ * if (insn->imm == BPF_FUNC_map_lookup_elem)
+ * insn->imm = bpf_map_lookup_elem - __bpf_call_base;
+ * else if (insn->imm == BPF_FUNC_map_update_elem)
+ * insn->imm = bpf_map_update_elem - __bpf_call_base;
+ * else ...
+ *
+ * this function is called after eBPF program passed verification
+ */
+static void fixup_bpf_calls(struct sk_filter *prog)
+{
+ const struct bpf_func_proto *fn;
+ int i;
+
+ for (i = 0; i < prog->len; i++) {
+ struct sock_filter_int *insn = &prog->insnsi[i];
+
+ if (insn->code == (BPF_JMP | BPF_CALL)) {
+ /* we reach here when program has bpf_call instructions
+ * and it passed bpf_check(), means that
+ * ops->get_func_proto must have been supplied, check it
+ */
+ BUG_ON(!prog->info->ops->get_func_proto);
+
+ fn = prog->info->ops->get_func_proto(insn->imm);
+ /* all functions that have prototype and verifier allowed
+ * programs to call them, must be real in-kernel functions
+ * and func_off = kernel_function - __bpf_call_base
+ */
+ BUG_ON(!fn->func_off);
+ insn->imm = fn->func_off;
+ }
+ }
+}
+
+/* free eBPF program auxilary data, called after rcu grace period,
+ * so it's safe to drop refcnt on maps used by this program
+ *
+ * called from sk_filter_release()->sk_filter_release_rcu()->sk_filter_free()
+ */
+void free_bpf_prog_info(struct bpf_prog_info *info)
+{
+ bool found;
+ int i;
+
+ for (i = 0; i < info->used_map_cnt; i++) {
+ found = bpf_map_put(info->used_maps[i]);
+ /* all maps that this program was using should obviously still
+ * be there
+ */
+ BUG_ON(!found);
+ }
+ kfree(info);
+}
+
+static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
+ [BPF_PROG_TEXT] = { .type = NLA_BINARY },
+ [BPF_PROG_LICENSE] = { .type = NLA_NUL_STRING },
+};
+
+static int bpf_prog_load(int prog_id, enum bpf_prog_type type,
+ struct nlattr __user *uattr, int len)
+{
+ struct nlattr *tb[BPF_PROG_ATTR_MAX + 1];
+ struct sk_filter *prog;
+ struct bpf_map *map;
+ struct nlattr *attr;
+ size_t insn_len;
+ int err, i;
+
+ if (len <= 0 || len > BPF_PROG_MAX_ATTR_SIZE)
+ return -EINVAL;
+
+ if (prog_id < 0)
+ return -EINVAL;
+
+ attr = kmalloc(len, GFP_USER);
+ if (!attr)
+ return -ENOMEM;
+
+ /* copy eBPF program from user space */
+ err = -EFAULT;
+ if (copy_from_user(attr, uattr, len) != 0)
+ goto free_attr;
+
+ /* perform basic validation */
+ err = nla_parse(tb, BPF_PROG_ATTR_MAX, attr, len, prog_policy);
+ if (err < 0)
+ goto free_attr;
+
+ err = -EINVAL;
+ /* look for mandatory license string */
+ if (!tb[BPF_PROG_LICENSE])
+ goto free_attr;
+
+ /* eBPF programs must be GPL compatible */
+ if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
+ goto free_attr;
+
+ /* look for mandatory array of eBPF instructions */
+ if (!tb[BPF_PROG_TEXT])
+ goto free_attr;
+
+ insn_len = nla_len(tb[BPF_PROG_TEXT]);
+ if (insn_len % sizeof(struct sock_filter_int) != 0 || insn_len <= 0)
+ goto free_attr;
+
+ /* plain sk_filter allocation */
+ err = -ENOMEM;
+ prog = kmalloc(sk_filter_size(insn_len), GFP_USER);
+ if (!prog)
+ goto free_attr;
+
+ prog->len = insn_len / sizeof(struct sock_filter_int);
+ memcpy(prog->insns, nla_data(tb[BPF_PROG_TEXT]), insn_len);
+ prog->orig_prog = NULL;
+ prog->jited = 0;
+ prog->ebpf = 0;
+ atomic_set(&prog->refcnt, 1);
+
+ /* allocate eBPF related auxilary data */
+ prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
+ if (!prog->info)
+ goto free_prog;
+ prog->ebpf = 1;
+
+ /* find program type: socket_filter vs tracing_filter */
+ err = find_prog_type(type, prog);
+ if (err < 0)
+ goto free_prog;
+
+ /* lock maps to prevent any changes to maps, since eBPF program may
+ * use them. In such case bpf_check() will populate prog->used_maps
+ */
+ mutex_lock(&bpf_map_lock);
+
+ /* run eBPF verifier */
+ /* err = bpf_check(prog); */
+
+ if (err == 0 && prog->info->used_maps) {
+ /* program passed verifier and it's using some maps,
+ * hold them
+ */
+ for (i = 0; i < prog->info->used_map_cnt; i++) {
+ map = bpf_map_get(prog->info->used_maps[i]);
+ BUG_ON(!map);
+ atomic_inc(&map->refcnt);
+ }
+ }
+ mutex_unlock(&bpf_map_lock);
+
+ if (err < 0)
+ goto free_prog;
+
+ /* fixup BPF_CALL->imm field */
+ fixup_bpf_calls(prog);
+
+ /* eBPF program is ready to be JITed */
+ sk_filter_select_runtime(prog);
+
+ /* last step: grab bpf_prog_lock to allocate prog_id */
+ mutex_lock(&bpf_prog_lock);
+
+ if (bpf_prog_cnt >= MAX_BPF_PROG_CNT) {
+ mutex_unlock(&bpf_prog_lock);
+ err = -ENOSPC;
+ goto free_prog;
+ }
+ bpf_prog_cnt++;
+
+ /* allocate program id */
+ err = idr_alloc(&bpf_prog_id_idr, prog, prog_id, 0, GFP_USER);
+
+ prog->info->prog_id = err;
+
+ mutex_unlock(&bpf_prog_lock);
+
+ if (err < 0)
+ /* failed to allocate program id */
+ goto free_prog;
+
+ /* user supplied eBPF prog attributes are no longer needed */
+ kfree(attr);
+
+ return err;
+free_prog:
+ sk_filter_free(prog);
+free_attr:
+ kfree(attr);
+ return err;
+}
+
+/* called from sk_attach_filter_ebpf() or from tracing filter attach
+ * pairs with
+ * sk_detach_filter()->sk_filter_uncharge()->sk_filter_release()
+ * or with
+ * sk_unattached_filter_destroy()->sk_filter_release()
+ */
+struct sk_filter *bpf_prog_get(u32 prog_id)
+{
+ struct sk_filter *prog;
+
+ rcu_read_lock();
+ prog = idr_find(&bpf_prog_id_idr, prog_id);
+ if (prog) {
+ atomic_inc(&prog->refcnt);
+ rcu_read_unlock();
+ return prog;
+ } else {
+ rcu_read_unlock();
+ return NULL;
+ }
+}
+
+/* called from syscall */
+static int bpf_prog_unload(int prog_id)
+{
+ struct sk_filter *prog;
+
+ if (prog_id < 0)
+ return -EINVAL;
+
+ mutex_lock(&bpf_prog_lock);
+ prog = idr_find(&bpf_prog_id_idr, prog_id);
+ if (prog) {
+ WARN_ON(prog->info->prog_id != prog_id);
+ bpf_prog_cnt--;
+ idr_remove(&bpf_prog_id_idr, prog_id);
+ }
+ mutex_unlock(&bpf_prog_lock);
+
+ if (prog) {
+ sk_unattached_filter_destroy(prog);
+ return 0;
+ } else {
+ return -EINVAL;
+ }
+}
+
SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -412,6 +681,12 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
case BPF_MAP_GET_NEXT_KEY:
return map_get_next_key((int) arg2, (void __user *) arg3,
(void __user *) arg4);
+ case BPF_PROG_LOAD:
+ return bpf_prog_load((int) arg2, (enum bpf_prog_type) arg3,
+ (struct nlattr __user *) arg4, (int) arg5);
+ case BPF_PROG_UNLOAD:
+ return bpf_prog_unload((int) arg2);
+
default:
return -EINVAL;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 79d8a1b1ad75..7f7c61b4aa39 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -835,7 +835,7 @@ static void sk_release_orig_filter(struct sk_filter *fp)
{
struct sock_fprog_kern *fprog = fp->orig_prog;
- if (fprog) {
+ if (!fp->ebpf && fprog) {
kfree(fprog->filter);
kfree(fprog);
}
@@ -867,14 +867,16 @@ static void sk_filter_release(struct sk_filter *fp)
void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
{
- atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ if (!fp->ebpf)
+ atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
sk_filter_release(fp);
}
void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
{
atomic_inc(&fp->refcnt);
- atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+ if (!fp->ebpf)
+ atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
}
static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp,
@@ -978,6 +980,7 @@ static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
fp->bpf_func = NULL;
fp->jited = 0;
+ fp->ebpf = 0;
err = sk_chk_filter(fp->insns, fp->len);
if (err) {
--
1.7.9.5
BPF syscall is a demux for different BPF releated commands.
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
The maps can be created/deleted from user space via BPF syscall:
- create a map with given id, type and attributes
map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
returns positive map id or negative error
- delete map with given map id
err = bpf_map_delete(int map_id)
returns zero or negative error
Next patch allows userspace programs to populate/read maps that eBPF programs
are concurrently updating.
maps can have different types: hash, bloom filter, radix-tree, etc.
The map is defined by:
. id
. type
. max number of elements
. key size in bytes
. value size in bytes
Next patches allow eBPF programs to access maps via API:
void * bpf_map_lookup_elem(u32 map_id, void *key);
int bpf_map_update_elem(u32 map_id, void *key, void *value);
int bpf_map_delete_elem(u32 map_id, void *key);
This patch establishes core infrastructure for BPF maps.
Next patches implement lookup/update and hashtable type.
More map types can be added in the future.
syscall is using type-length-value style of passing arguments to be backwards
compatible with future extensions to map attributes. Different map types may
use different attributes as well.
The concept of type-lenght-value is borrowed from netlink, but netlink itself
is not applicable here, since BPF programs and maps can be used in NET-less
configurations.
Signed-off-by: Alexei Starovoitov <[email protected]>
---
Documentation/networking/filter.txt | 69 ++++++++++
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/bpf.h | 44 +++++++
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/bpf.h | 29 +++++
kernel/bpf/Makefile | 2 +-
kernel/bpf/syscall.c | 238 +++++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 3 +
9 files changed, 390 insertions(+), 2 deletions(-)
create mode 100644 include/linux/bpf.h
create mode 100644 kernel/bpf/syscall.c
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index ee78eba78a9d..e14e486f69cd 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
2 byte atomic increments are not supported.
+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given id, type and attributes
+ map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
+ returns positive map id or negative error
+
+- delete map with given map id
+ err = bpf_map_delete(int map_id)
+ returns zero or negative error
+
+- lookup key in a given map referenced by map_id
+ err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+ returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+ err = bpf_map_update_elem(int map_id, void *key, void *value)
+ returns zero or negative error
+
+- find and delete element by key in a given map
+ err = bpf_map_delete_elem(int map_id, void *key)
+
+userspace programs uses this API to create/populate/read maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, bloom filter, radix-tree, etc.
+
+The map is defined by:
+ . id
+ . type
+ . max number of elements
+ . key size in bytes
+ . value size in bytes
+
+The maps are accesible from eBPF program with API:
+ void * bpf_map_lookup_elem(u32 map_id, void *key);
+ int bpf_map_update_elem(u32 map_id, void *key, void *value);
+ int bpf_map_delete_elem(u32 map_id, void *key);
+
+If eBPF verifier is configured to recognize extra calls in the program
+bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
+ ...
+ ptr_to_value = map_lookup_elem(const_int_map_id, key)
+ access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
+ ...
+ prepare key2 and value2 on stack of key_size and value_size
+ err = map_update_elem(const_int_map_id2, key2, value2)
+ ...
+
+eBPF program cannot create or delete maps
+(such calls will be unknown to verifier)
+
+During program loading the refcnt of used maps is incremented, so they don't get
+deleted while program is running
+
+bpf_map_update_elem() can fail if maximum number of elements reached.
+if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
+
+bpf_map_lookup_elem() can return null or ptr_to_value
+ptr_to_value is read/write from the program point of view.
+
+The verifier will check that the program accesses map elements within specified
+size. It will not let programs pass junk values as 'key' and 'value' to
+bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
+can safely access the pointers in all cases.
+
Testing
-------
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..edbb8460e1b5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common bpf sys_bpf
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 000000000000..6448b9beea89
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,44 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_H
+#define _LINUX_BPF_H 1
+
+#include <uapi/linux/bpf.h>
+#include <linux/workqueue.h>
+
+struct bpf_map;
+struct nlattr;
+
+/* map is generic key/value storage optionally accesible by eBPF programs */
+struct bpf_map_ops {
+ /* funcs callable from userspace (via syscall) */
+ struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
+ void (*map_free)(struct bpf_map *);
+};
+
+struct bpf_map {
+ atomic_t refcnt;
+ bool deleted;
+ int map_id;
+ enum bpf_map_type map_type;
+ u32 key_size;
+ u32 value_size;
+ u32 max_entries;
+ struct bpf_map_ops *ops;
+ struct work_struct work;
+};
+
+struct bpf_map_type_list {
+ struct list_head list_node;
+ struct bpf_map_ops *ops;
+ enum bpf_map_type type;
+};
+
+void bpf_register_map_type(struct bpf_map_type_list *tl);
+struct bpf_map *bpf_map_get(u32 map_id);
+
+#endif /* _LINUX_BPF_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..2b524aeba262 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 333640608087..41e20f8fb87e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
#define __NR_renameat2 276
__SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_bpf 277
+__SYSCALL(__NR_bpf, sys_bpf)
#undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278
/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 439d64a07eff..04374e57c290 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -302,4 +302,33 @@ struct sock_filter_int {
__s32 imm; /* signed immediate constant */
};
+/* BPF syscall commands */
+enum bpf_cmd {
+ /* create a map with given id, type and attributes
+ * map_id = bpf_map_create(int map_id, bpf_map_type, struct nlattr *attr, int len)
+ * returns positive map id or negative error
+ */
+ BPF_MAP_CREATE,
+
+ /* delete map with given map id
+ * err = bpf_map_delete(int map_id)
+ * returns zero or negative error
+ */
+ BPF_MAP_DELETE,
+};
+
+enum bpf_map_attributes {
+ BPF_MAP_UNSPEC,
+ BPF_MAP_KEY_SIZE, /* size of key in bytes */
+ BPF_MAP_VALUE_SIZE, /* size of value in bytes */
+ BPF_MAP_MAX_ENTRIES, /* maximum number of entries in a map */
+ __BPF_MAP_ATTR_MAX,
+};
+#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
+#define BPF_MAP_MAX_ATTR_SIZE 65535
+
+enum bpf_map_type {
+ BPF_MAP_TYPE_UNSPEC,
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a71145e2769..e9f7334ed07a 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o
+obj-y := core.o syscall.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
new file mode 100644
index 000000000000..b9509923b16f
--- /dev/null
+++ b/kernel/bpf/syscall.c
@@ -0,0 +1,238 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/syscalls.h>
+#include <net/netlink.h>
+
+/* mutex to protect insertion/deletion of map_id in IDR */
+static DEFINE_MUTEX(bpf_map_lock);
+static DEFINE_IDR(bpf_map_id_idr);
+
+/* maximum number of outstanding maps */
+#define MAX_BPF_MAP_CNT 1024
+static u32 bpf_map_cnt;
+
+static LIST_HEAD(bpf_map_types);
+
+static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
+ struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
+{
+ struct bpf_map_type_list *tl;
+ struct bpf_map *map;
+
+ list_for_each_entry(tl, &bpf_map_types, list_node) {
+ if (tl->type == type) {
+ map = tl->ops->map_alloc(tb);
+ if (IS_ERR(map))
+ return map;
+ map->ops = tl->ops;
+ map->map_type = type;
+ return map;
+ }
+ }
+ return ERR_PTR(-EINVAL);
+}
+
+/* boot time registration of different map implementations */
+void bpf_register_map_type(struct bpf_map_type_list *tl)
+{
+ list_add(&tl->list_node, &bpf_map_types);
+}
+
+static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
+ [BPF_MAP_KEY_SIZE] = { .type = NLA_U32 },
+ [BPF_MAP_VALUE_SIZE] = { .type = NLA_U32 },
+ [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
+};
+
+/* called via syscall */
+static int map_create(int map_id, enum bpf_map_type type,
+ struct nlattr __user *uattr, int len)
+{
+ struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
+ struct bpf_map *map;
+ struct nlattr *attr;
+ int err;
+
+ if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
+ return -EINVAL;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ attr = kmalloc(len, GFP_USER);
+ if (!attr)
+ return -ENOMEM;
+
+ /* copy map attributes from user space */
+ err = -EFAULT;
+ if (copy_from_user(attr, uattr, len) != 0)
+ goto free_attr;
+
+ /* perform basic validation */
+ err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
+ if (err < 0)
+ goto free_attr;
+
+ /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+ map = find_and_alloc_map(type, tb);
+ if (IS_ERR(map)) {
+ err = PTR_ERR(map);
+ goto free_attr;
+ }
+
+ atomic_set(&map->refcnt, 1);
+ map->deleted = false;
+
+ mutex_lock(&bpf_map_lock);
+
+ if (bpf_map_cnt >= MAX_BPF_MAP_CNT) {
+ mutex_unlock(&bpf_map_lock);
+ err = -ENOSPC;
+ goto free_map;
+ }
+ bpf_map_cnt++;
+
+ /* allocate map id */
+ err = idr_alloc(&bpf_map_id_idr, map, map_id, 0, GFP_USER);
+
+ map->map_id = err;
+
+ mutex_unlock(&bpf_map_lock);
+
+ if (err < 0)
+ /* failed to allocate map id */
+ goto free_map;
+
+ /* user supplied array of map attributes is no longer needed */
+ kfree(attr);
+
+ return err;
+
+free_map:
+ map->ops->map_free(map);
+free_attr:
+ kfree(attr);
+ return err;
+}
+
+/* called from workqueue */
+static void bpf_map_free_deferred(struct work_struct *work)
+{
+ struct bpf_map *map = container_of(work, struct bpf_map, work);
+
+ /* grab the mutex and free the map */
+ mutex_lock(&bpf_map_lock);
+
+ bpf_map_cnt--;
+ idr_remove(&bpf_map_id_idr, map->map_id);
+
+ mutex_unlock(&bpf_map_lock);
+
+ /* implementation dependent freeing */
+ map->ops->map_free(map);
+}
+
+/* decrement map refcnt and schedule it for freeing via workqueue
+ * (unrelying map implementation ops->map_free() might sleep)
+ */
+static void __bpf_map_put(struct bpf_map *map)
+{
+ if (atomic_dec_and_test(&map->refcnt)) {
+ INIT_WORK(&map->work, bpf_map_free_deferred);
+ schedule_work(&map->work);
+ }
+}
+
+/* find map by id and decrement its refcnt
+ *
+ * can be called without any locks held
+ *
+ * returns true if map was found
+ */
+static bool bpf_map_put(u32 map_id)
+{
+ struct bpf_map *map;
+
+ rcu_read_lock();
+ map = idr_find(&bpf_map_id_idr, map_id);
+
+ if (!map) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ __bpf_map_put(map);
+ rcu_read_unlock();
+
+ return true;
+}
+
+/* called with bpf_map_lock held */
+struct bpf_map *bpf_map_get(u32 map_id)
+{
+ BUG_ON(!mutex_is_locked(&bpf_map_lock));
+
+ return idr_find(&bpf_map_id_idr, map_id);
+}
+
+/* called via syscall */
+static int map_delete(int map_id)
+{
+ struct bpf_map *map;
+
+ if (map_id < 0)
+ return -EINVAL;
+
+ mutex_lock(&bpf_map_lock);
+ map = bpf_map_get(map_id);
+
+ if (!map) {
+ /* user is trying to delete map_id that doesn't exist */
+ mutex_unlock(&bpf_map_lock);
+ return -ENODEV;
+ }
+
+ if (map->deleted) {
+ /* this map was already deleted */
+ mutex_unlock(&bpf_map_lock);
+ return 0;
+ }
+
+ /* first time deleting the map
+ * we cannot just remove this map_id from IDR, since program might
+ * still be using this map_id, so just mark it deleted,
+ * when refcnt goes to zero, it will be deleted from IDR
+ */
+ map->deleted = true;
+ __bpf_map_put(map);
+ mutex_unlock(&bpf_map_lock);
+ return 0;
+}
+
+SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
+ unsigned long, arg4, unsigned long, arg5)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ switch (cmd) {
+ case BPF_MAP_CREATE:
+ return map_create((int) arg2, (enum bpf_map_type) arg3,
+ (struct nlattr __user *) arg4, (int) arg5);
+ case BPF_MAP_DELETE:
+ return map_delete((int) arg2);
+
+ default:
+ return -EINVAL;
+ }
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..877c9aafbfb4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,6 @@ cond_syscall(compat_sys_open_by_handle_at);
/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
--
1.7.9.5
Signed-off-by: Alexei Starovoitov <[email protected]>
---
MAINTAINERS | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 48f4ef44b252..ebd831cd1a25 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1881,6 +1881,15 @@ S: Supported
F: drivers/net/bonding/
F: include/uapi/linux/if_bonding.h
+BPF
+M: Alexei Starovoitov <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Supported
+F: kernel/bpf/
+F: include/uapi/linux/bpf.h
+F: include/linux/bpf.h
+
BROADCOM B44 10/100 ETHERNET DRIVER
M: Gary Zambrano <[email protected]>
L: [email protected]
--
1.7.9.5
eBPF can be used from user space.
uapi/linux/bpf.h: eBPF instruction set definition
linux/filter.h: the rest
This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.
These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/filter.h | 294 +------------------------------------------
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/bpf.h | 305 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 307 insertions(+), 293 deletions(-)
create mode 100644 include/uapi/linux/bpf.h
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a7e3c48d73a7..6766577635ff 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -8,303 +8,11 @@
#include <linux/compat.h>
#include <linux/workqueue.h>
#include <uapi/linux/filter.h>
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64 0x07 /* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW 0x18 /* double word */
-#define BPF_XADD 0xc0 /* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV 0xb0 /* mov reg to reg */
-#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END 0xd0 /* flags for endianness conversion: */
-#define BPF_TO_LE 0x00 /* convert to little-endian */
-#define BPF_TO_BE 0x08 /* convert to big-endian */
-#define BPF_FROM_LE BPF_TO_LE
-#define BPF_FROM_BE BPF_TO_BE
-
-#define BPF_JNE 0x50 /* jump != */
-#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
-#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
-#define BPF_CALL 0x80 /* function call */
-#define BPF_EXIT 0x90 /* function return */
-
-/* Register numbers */
-enum {
- BPF_REG_0 = 0,
- BPF_REG_1,
- BPF_REG_2,
- BPF_REG_3,
- BPF_REG_4,
- BPF_REG_5,
- BPF_REG_6,
- BPF_REG_7,
- BPF_REG_8,
- BPF_REG_9,
- BPF_REG_10,
- __MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG __MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1 BPF_REG_1
-#define BPF_REG_ARG2 BPF_REG_2
-#define BPF_REG_ARG3 BPF_REG_3
-#define BPF_REG_ARG4 BPF_REG_4
-#define BPF_REG_ARG5 BPF_REG_5
-#define BPF_REG_CTX BPF_REG_6
-#define BPF_REG_FP BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A BPF_REG_0
-#define BPF_REG_X BPF_REG_7
-#define BPF_REG_TMP BPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK 512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
-
-#define BPF_ENDIAN(TYPE, DST, LEN) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = LEN })
-
-/* Short form of mov, dst_reg = src_reg */
-
-#define BPF_MOV64_REG(DST, SRC) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-#define BPF_MOV32_REG(DST, SRC) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_MOV | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = 0 })
-
-/* Short form of mov, dst_reg = imm32 */
-
-#define BPF_MOV64_IMM(DST, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_MOV32_IMM(DST, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_MOV | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
-
-#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
-
-#define BPF_LD_ABS(SIZE, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = IMM })
-
-/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
-
-#define BPF_LD_IND(SIZE, SRC, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
- .dst_reg = 0, \
- .src_reg = SRC, \
- .off = 0, \
- .imm = IMM })
-
-/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
-
-#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
- ((struct sock_filter_int) { \
- .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
-
-#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
- ((struct sock_filter_int) { \
- .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
-
-#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
- ((struct sock_filter_int) { \
- .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = OFF, \
- .imm = IMM })
-
-/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
-
-#define BPF_JMP_REG(OP, DST, SRC, OFF) \
- ((struct sock_filter_int) { \
- .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = 0 })
-
-/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
-
-#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
- ((struct sock_filter_int) { \
- .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
- .dst_reg = DST, \
- .src_reg = 0, \
- .off = OFF, \
- .imm = IMM })
-
-/* Function call */
-
-#define BPF_EMIT_CALL(FUNC) \
- ((struct sock_filter_int) { \
- .code = BPF_JMP | BPF_CALL, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = ((FUNC) - __bpf_call_base) })
-
-/* Raw code statement block */
-
-#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
- ((struct sock_filter_int) { \
- .code = CODE, \
- .dst_reg = DST, \
- .src_reg = SRC, \
- .off = OFF, \
- .imm = IMM })
-
-/* Program exit */
-
-#define BPF_EXIT_INSN() \
- ((struct sock_filter_int) { \
- .code = BPF_JMP | BPF_EXIT, \
- .dst_reg = 0, \
- .src_reg = 0, \
- .off = 0, \
- .imm = 0 })
-
-#define bytes_to_bpf_size(bytes) \
-({ \
- int bpf_size = -EINVAL; \
- \
- if (bytes == sizeof(u8)) \
- bpf_size = BPF_B; \
- else if (bytes == sizeof(u16)) \
- bpf_size = BPF_H; \
- else if (bytes == sizeof(u32)) \
- bpf_size = BPF_W; \
- else if (bytes == sizeof(u64)) \
- bpf_size = BPF_DW; \
- \
- bpf_size; \
-})
+#include <uapi/linux/bpf.h>
/* Macro to invoke filter function. */
#define SK_RUN_FILTER(filter, ctx) (*filter->bpf_func)(ctx, filter->insnsi)
-struct sock_filter_int {
- __u8 code; /* opcode */
- __u8 dst_reg:4; /* dest register */
- __u8 src_reg:4; /* source register */
- __s16 off; /* signed offset */
- __s32 imm; /* signed immediate constant */
-};
-
#ifdef CONFIG_COMPAT
/* A struct sock_filter is architecture independent. */
struct compat_sock_fprog {
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 24e9033f8b3f..fb3f7b675229 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -67,6 +67,7 @@ header-y += bfs_fs.h
header-y += binfmts.h
header-y += blkpg.h
header-y += blktrace_api.h
+header-y += bpf.h
header-y += bpqether.h
header-y += bsg.h
header-y += btrfs.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
new file mode 100644
index 000000000000..439d64a07eff
--- /dev/null
+++ b/include/uapi/linux/bpf.h
@@ -0,0 +1,305 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_H__
+#define _UAPI__LINUX_BPF_H__
+
+#include <linux/types.h>
+
+/* Internally used and optimized filter representation with extended
+ * instruction set based on top of classic BPF.
+ */
+
+/* instruction classes */
+#define BPF_ALU64 0x07 /* alu mode in double word width */
+
+/* ld/ldx fields */
+#define BPF_DW 0x18 /* double word */
+#define BPF_XADD 0xc0 /* exclusive add */
+
+/* alu/jmp fields */
+#define BPF_MOV 0xb0 /* mov reg to reg */
+#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
+
+/* change endianness of a register */
+#define BPF_END 0xd0 /* flags for endianness conversion: */
+#define BPF_TO_LE 0x00 /* convert to little-endian */
+#define BPF_TO_BE 0x08 /* convert to big-endian */
+#define BPF_FROM_LE BPF_TO_LE
+#define BPF_FROM_BE BPF_TO_BE
+
+#define BPF_JNE 0x50 /* jump != */
+#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
+#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
+#define BPF_CALL 0x80 /* function call */
+#define BPF_EXIT 0x90 /* function return */
+
+/* Register numbers */
+enum {
+ BPF_REG_0 = 0,
+ BPF_REG_1,
+ BPF_REG_2,
+ BPF_REG_3,
+ BPF_REG_4,
+ BPF_REG_5,
+ BPF_REG_6,
+ BPF_REG_7,
+ BPF_REG_8,
+ BPF_REG_9,
+ BPF_REG_10,
+ __MAX_BPF_REG,
+};
+
+/* BPF has 10 general purpose 64-bit registers and stack frame. */
+#define MAX_BPF_REG __MAX_BPF_REG
+
+/* ArgX, context and stack frame pointer register positions. Note,
+ * Arg1, Arg2, Arg3, etc are used as argument mappings of function
+ * calls in BPF_CALL instruction.
+ */
+#define BPF_REG_ARG1 BPF_REG_1
+#define BPF_REG_ARG2 BPF_REG_2
+#define BPF_REG_ARG3 BPF_REG_3
+#define BPF_REG_ARG4 BPF_REG_4
+#define BPF_REG_ARG5 BPF_REG_5
+#define BPF_REG_CTX BPF_REG_6
+#define BPF_REG_FP BPF_REG_10
+
+/* Additional register mappings for converted user programs. */
+#define BPF_REG_A BPF_REG_0
+#define BPF_REG_X BPF_REG_7
+#define BPF_REG_TMP BPF_REG_8
+
+/* BPF program can access up to 512 bytes of stack space. */
+#define MAX_BPF_STACK 512
+
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_MOV32_REG(DST, SRC) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
+ .dst_reg = 0, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct sock_filter_int) { \
+ .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct sock_filter_int) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
+ ((struct sock_filter_int) { \
+ .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF) \
+ ((struct sock_filter_int) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
+ ((struct sock_filter_int) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC) \
+ ((struct sock_filter_int) { \
+ .code = BPF_JMP | BPF_CALL, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
+ ((struct sock_filter_int) { \
+ .code = CODE, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN() \
+ ((struct sock_filter_int) { \
+ .code = BPF_JMP | BPF_EXIT, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = 0 })
+
+#define bytes_to_bpf_size(bytes) \
+({ \
+ int bpf_size = -EINVAL; \
+ \
+ if (bytes == sizeof(u8)) \
+ bpf_size = BPF_B; \
+ else if (bytes == sizeof(u16)) \
+ bpf_size = BPF_H; \
+ else if (bytes == sizeof(u32)) \
+ bpf_size = BPF_W; \
+ else if (bytes == sizeof(u64)) \
+ bpf_size = BPF_DW; \
+ \
+ bpf_size; \
+})
+
+struct sock_filter_int {
+ __u8 code; /* opcode */
+ __u8 dst_reg:4; /* dest register */
+ __u8 src_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
+
+#endif /* _UAPI__LINUX_BPF_H__ */
--
1.7.9.5
On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
> BPF syscall is a demux for different BPF releated commands.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps can be created/deleted from user space via BPF syscall:
> - create a map with given id, type and attributes
> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
> returns positive map id or negative error
>
> - delete map with given map id
> err = bpf_map_delete(int map_id)
> returns zero or negative error
What's the scope of "id"? How is it secured?
This question is brought to you by keyctl, which is terminally fucked.
At some point I'll generate some proof of concept exploits for severe
bugs caused by misdesign of a namespace.
--Andy
Add MAINTAINERS entry.
On Fri, 2014-06-27 at 17:05 -0700, Alexei Starovoitov wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
[]
> @@ -1881,6 +1881,15 @@ S: Supported
> F: drivers/net/bonding/
> F: include/uapi/linux/if_bonding.h
>
> +BPF
While a lot of people know what BPF is, I think it'd
be better to have something like
BPF - SOCKET FILTER (Berkeley Packet Filter like)
> +M: Alexei Starovoitov <[email protected]>
> +L: [email protected]
> +L: [email protected]
> +S: Supported
> +F: kernel/bpf/
> +F: include/uapi/linux/bpf.h
> +F: include/linux/bpf.h
On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
> eBPF programs are safe run-to-completion functions with load/unload
> methods from userspace similar to kernel modules.
>
> User space API:
>
> - load eBPF program
> prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
>
> where 'prog' is a sequence of sections (currently TEXT and LICENSE)
> TEXT - array of eBPF instructions
> LICENSE - GPL compatible
> +
> + err = -EINVAL;
> + /* look for mandatory license string */
> + if (!tb[BPF_PROG_LICENSE])
> + goto free_attr;
> +
> + /* eBPF programs must be GPL compatible */
> + if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
> + goto free_attr;
Seriously? My mind boggles.
--Andy
On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
> this socket filter example does:
>
> - creates a hashtable in kernel with key 4 bytes and value 8 bytes
>
> - populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto
>
> - loads eBPF program:
> r0 = skb[14 + 9]; // load one byte of ip->proto
> *(u32*)(fp - 4) = r0;
> value = bpf_map_lookup_elem(map_id, fp - 4);
> if (value)
> (*(u64*)value) += 1;
In the code below, this is XADD. Is there anything that validates
that shared things like this can only be poked at by atomic
operations?
--Andy
On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>> BPF syscall is a demux for different BPF releated commands.
>>
>> 'maps' is a generic storage of different types for sharing data between kernel
>> and userspace.
>>
>> The maps can be created/deleted from user space via BPF syscall:
>> - create a map with given id, type and attributes
>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>> returns positive map id or negative error
>>
>> - delete map with given map id
>> err = bpf_map_delete(int map_id)
>> returns zero or negative error
>
> What's the scope of "id"? How is it secured?
the map and program id space is global and it's cap_sys_admin only.
There is no pressing need to do it with per-user limits.
So the whole thing is root only for now.
Since I got your attention please review the most interesting
verifier bits (patch 08/14) ;)
On Fri, Jun 27, 2014 at 5:18 PM, Joe Perches <[email protected]> wrote:
> Add MAINTAINERS entry.
>
> On Fri, 2014-06-27 at 17:05 -0700, Alexei Starovoitov wrote:
>> diff --git a/MAINTAINERS b/MAINTAINERS
> []
>> @@ -1881,6 +1881,15 @@ S: Supported
>> F: drivers/net/bonding/
>> F: include/uapi/linux/if_bonding.h
>>
>> +BPF
>
> While a lot of people know what BPF is, I think it'd
> be better to have something like
>
> BPF - SOCKET FILTER (Berkeley Packet Filter like)
BPF is indeed succinct, but 'socket filter' suffix would be misleading,
since it's way more than just socket filtering.
May be: "BPF (Safe dynamic programs and tools)"
since 'perf' will become 'stap/dtrace'-like based on this infra.
On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>> eBPF programs are safe run-to-completion functions with load/unload
>> methods from userspace similar to kernel modules.
>>
>> User space API:
>>
>> - load eBPF program
>> prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
>>
>> where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>> TEXT - array of eBPF instructions
>> LICENSE - GPL compatible
>> +
>> + err = -EINVAL;
>> + /* look for mandatory license string */
>> + if (!tb[BPF_PROG_LICENSE])
>> + goto free_attr;
>> +
>> + /* eBPF programs must be GPL compatible */
>> + if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
>> + goto free_attr;
>
> Seriously? My mind boggles.
Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's wise
to leave this door open for abuse. This check makes it clear that if you
write a program in C, the source code must be available.
If program is written in assembler than this check is nop anyway.
btw this patch doesn't include debugfs access to all loaded eBPF programs.
Similarly to kernel modules I'm planning to have a way to list all loaded
programs with optional assembler dump of instructions.
On Fri, Jun 27, 2014 at 5:21 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
>> this socket filter example does:
>>
>> - creates a hashtable in kernel with key 4 bytes and value 8 bytes
>>
>> - populates map[6] = 0; map[17] = 0; // 6 - tcp_proto, 17 - udp_proto
>>
>> - loads eBPF program:
>> r0 = skb[14 + 9]; // load one byte of ip->proto
>> *(u32*)(fp - 4) = r0;
>> value = bpf_map_lookup_elem(map_id, fp - 4);
>> if (value)
>> (*(u64*)value) += 1;
>
> In the code below, this is XADD. Is there anything that validates
> that shared things like this can only be poked at by atomic
> operations?
Correct. The asm code uses xadd to increment packet stats.
It's up to the program itself to decide what it's doing.
Some programs may prefer speed vs accuracy when counting
and they will be using regular "ld, add, st", instead of xadd.
Verifier checks that programs can only access a valid memory
region. The program itself needs to do something sensible with it.
Theoretically I can add a check to verifier that shared map elements
are read-only and xadd-only, but that limits usability and unnecessary.
We actually do have a use case when we do a regular add, since
'lock add' is too costly at high event rates.
On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>> BPF syscall is a demux for different BPF releated commands.
>>>
>>> 'maps' is a generic storage of different types for sharing data between kernel
>>> and userspace.
>>>
>>> The maps can be created/deleted from user space via BPF syscall:
>>> - create a map with given id, type and attributes
>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>> returns positive map id or negative error
>>>
>>> - delete map with given map id
>>> err = bpf_map_delete(int map_id)
>>> returns zero or negative error
>>
>> What's the scope of "id"? How is it secured?
>
> the map and program id space is global and it's cap_sys_admin only.
> There is no pressing need to do it with per-user limits.
> So the whole thing is root only for now.
>
Hmm. This may be unpleasant if you ever want to support non-root or
namespaced operation.
How hard would it be to give these things fds?
> Since I got your attention please review the most interesting
> verifier bits (patch 08/14) ;)
Will do. Or at least I'll try :)
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>> eBPF programs are safe run-to-completion functions with load/unload
>>> methods from userspace similar to kernel modules.
>>>
>>> User space API:
>>>
>>> - load eBPF program
>>> prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
>>>
>>> where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>>> TEXT - array of eBPF instructions
>>> LICENSE - GPL compatible
>>> +
>>> + err = -EINVAL;
>>> + /* look for mandatory license string */
>>> + if (!tb[BPF_PROG_LICENSE])
>>> + goto free_attr;
>>> +
>>> + /* eBPF programs must be GPL compatible */
>>> + if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
>>> + goto free_attr;
>>
>> Seriously? My mind boggles.
>
> Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's wise
> to leave this door open for abuse. This check makes it clear that if you
> write a program in C, the source code must be available.
> If program is written in assembler than this check is nop anyway.
>
I can see this seriously annoying lots of users. For example,
Chromium might object.
If you want to add GPL-only functions in the future, that would be one
thing. But if someone writes a nice eBPF compiler, and someone else
writes a little program that filters on network packets, I see no
reason to claim that the little program is a derivative work of the
kernel and therefore must be GPL.
> btw this patch doesn't include debugfs access to all loaded eBPF programs.
> Similarly to kernel modules I'm planning to have a way to list all loaded
> programs with optional assembler dump of instructions.
Users can also dump running programs with ptrace. That doesn't mean
that all loaded programs need to be GPL.
--Andy
On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>>> BPF syscall is a demux for different BPF releated commands.
>>>>
>>>> 'maps' is a generic storage of different types for sharing data between kernel
>>>> and userspace.
>>>>
>>>> The maps can be created/deleted from user space via BPF syscall:
>>>> - create a map with given id, type and attributes
>>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>>> returns positive map id or negative error
>>>>
>>>> - delete map with given map id
>>>> err = bpf_map_delete(int map_id)
>>>> returns zero or negative error
>>>
>>> What's the scope of "id"? How is it secured?
>>
>> the map and program id space is global and it's cap_sys_admin only.
>> There is no pressing need to do it with per-user limits.
>> So the whole thing is root only for now.
>>
>
> Hmm. This may be unpleasant if you ever want to support non-root or
> namespaced operation.
I think it will be easy to extend it per namespace when we lift
root-only restriction. It will be seamless without user api changes.
> How hard would it be to give these things fds?
you mean programs/maps auto-terminate when creator process
exits? I thought about it and it's appealing at first glance, but
doesn't fit the model of existing tracepoint events which are global.
The programs attached to events need to live without 'daemon'
hanging around. Therefore I picked 'kernel module'- like method.
On Fri, Jun 27, 2014 at 11:28 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>>> eBPF programs are safe run-to-completion functions with load/unload
>>>> methods from userspace similar to kernel modules.
>>>>
>>>> User space API:
>>>>
>>>> - load eBPF program
>>>> prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
>>>>
>>>> where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>>>> TEXT - array of eBPF instructions
>>>> LICENSE - GPL compatible
>>>> +
>>>> + err = -EINVAL;
>>>> + /* look for mandatory license string */
>>>> + if (!tb[BPF_PROG_LICENSE])
>>>> + goto free_attr;
>>>> +
>>>> + /* eBPF programs must be GPL compatible */
>>>> + if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
>>>> + goto free_attr;
>>>
>>> Seriously? My mind boggles.
>>
>> Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's wise
>> to leave this door open for abuse. This check makes it clear that if you
>> write a program in C, the source code must be available.
>> If program is written in assembler than this check is nop anyway.
>>
>
> I can see this seriously annoying lots of users. For example,
> Chromium might object.
chrome/seccomp generated programs are an exception. They
really don't have a source code. Quite large classic BPF programs
are generated out of decision tree driven by seccomp library. Here we
obviously cannot say that such bpf generating library must be GPLed.
Just like LLVM that emits eBPF code is not under GPL.
So chrome should be fine generating eBPF as well.
> If you want to add GPL-only functions in the future, that would be one
> thing. But if someone writes a nice eBPF compiler, and someone else
> writes a little program that filters on network packets, I see no
> reason to claim that the little program is a derivative work of the
> kernel and therefore must be GPL.
I think we have to draw a line somewhere. Say, tomorrow I want
to modify libpcap to emit eBPF based on existing tcpdump syntax.
Would it mean that tcpdump filter strings are GPLed? Definitely not,
since they existed before and can function without new libpcap.
But if I write a new packet filtering program in C, compile it
using LLVM->eBPF and call into in-kernel helper functions
(like bpf_map_lookup_elem()), I think it's exactly the derivative work.
It's analogous to kernel modules. If module wants to call
export_symbol_gpl() functions, it needs to be GPLed. Here all helper
functions are GPL. So we just have a blank check for eBPF program.
Having said that I can relax it a little by adding 'export_symbol(_gpl)?'
equivalent markings to all helper function. Then eBPF program that
doesn't call any functions at all, can run under non-free license.
But before I do that, I'd like hear others.
On Sat, Jun 28, 2014 at 12:26:14AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 27, 2014 at 11:28 PM, Andy Lutomirski <[email protected]> wrote:
> > On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov <[email protected]> wrote:
> > If you want to add GPL-only functions in the future, that would be one
> > thing. But if someone writes a nice eBPF compiler, and someone else
> > writes a little program that filters on network packets, I see no
> > reason to claim that the little program is a derivative work of the
> > kernel and therefore must be GPL.
>
> I think we have to draw a line somewhere. Say, tomorrow I want
> to modify libpcap to emit eBPF based on existing tcpdump syntax.
> Would it mean that tcpdump filter strings are GPLed? Definitely not,
> since they existed before and can function without new libpcap.
> But if I write a new packet filtering program in C, compile it
> using LLVM->eBPF and call into in-kernel helper functions
> (like bpf_map_lookup_elem()), I think it's exactly the derivative work.
> It's analogous to kernel modules. If module wants to call
> export_symbol_gpl() functions, it needs to be GPLed. Here all helper
> functions are GPL. So we just have a blank check for eBPF program.
I agree, these eBFP programs should be GPL-compatible licensed as well.
greg k-h
On Fri, Jun 27, 2014 at 11:43 PM, Alexei Starovoitov <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>> BPF syscall is a demux for different BPF releated commands.
>>>>>
>>>>> 'maps' is a generic storage of different types for sharing data between kernel
>>>>> and userspace.
>>>>>
>>>>> The maps can be created/deleted from user space via BPF syscall:
>>>>> - create a map with given id, type and attributes
>>>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>>>> returns positive map id or negative error
>>>>>
>>>>> - delete map with given map id
>>>>> err = bpf_map_delete(int map_id)
>>>>> returns zero or negative error
>>>>
>>>> What's the scope of "id"? How is it secured?
>>>
>>> the map and program id space is global and it's cap_sys_admin only.
>>> There is no pressing need to do it with per-user limits.
>>> So the whole thing is root only for now.
>>>
>>
>> Hmm. This may be unpleasant if you ever want to support non-root or
>> namespaced operation.
>
> I think it will be easy to extend it per namespace when we lift
> root-only restriction. It will be seamless without user api changes.
>
It might be seamless, but I'm not sure it'll be very useful. See below.
>> How hard would it be to give these things fds?
>
> you mean programs/maps auto-terminate when creator process
> exits? I thought about it and it's appealing at first glance, but
> doesn't fit the model of existing tracepoint events which are global.
> The programs attached to events need to live without 'daemon'
> hanging around. Therefore I picked 'kernel module'- like method.
Here are some things I'd like to be able to do:
- Load an eBPF program and use it as a seccomp filter.
- Create a read-only map and reference it from a seccomp filter.
- Create a data structure that a seccomp filter can write but that
the filtered process can only read.
- Create a data structure that a seccomp filter can read but that
some other trusted process can write.
- Create a network filter of some sort and give permission to
manipulate a list of ports to an otherwise untrusted process.
The first four of these shouldn't require privilege.
All of this fits nicely into a model where all of the eBPF objects
(filters and data structures) are represented by fds. Read access to
the fd lets you read (or execute eBPF programs). Write access to the
fd lets you write. You can send them around naturally using
SCM_RIGHTS, and you can create deprivileged versions by reopening the
objects with less access.
All of this *could* fit in using global ids, but we'd need to answer
questions like "what namespace are they bound to" and "who has access
to a given fd". I'd want to see that these questions *have* good
answers before committing to this type of model. Keep in mind that,
for seccomp in particular, granting access to a specific uid will be
very limiting: part of the point of seccomp is to enable
user-controlled finer-grained permissions than allowed by uids and
gids.
--Andy
On Sat, Jun 28, 2014 at 8:21 AM, Greg KH <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 12:26:14AM -0700, Alexei Starovoitov wrote:
>> On Fri, Jun 27, 2014 at 11:28 PM, Andy Lutomirski <[email protected]> wrote:
>> > On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov <[email protected]> wrote:
>> > If you want to add GPL-only functions in the future, that would be one
>> > thing. But if someone writes a nice eBPF compiler, and someone else
>> > writes a little program that filters on network packets, I see no
>> > reason to claim that the little program is a derivative work of the
>> > kernel and therefore must be GPL.
>>
>> I think we have to draw a line somewhere. Say, tomorrow I want
>> to modify libpcap to emit eBPF based on existing tcpdump syntax.
>> Would it mean that tcpdump filter strings are GPLed? Definitely not,
>> since they existed before and can function without new libpcap.
>> But if I write a new packet filtering program in C, compile it
>> using LLVM->eBPF and call into in-kernel helper functions
>> (like bpf_map_lookup_elem()), I think it's exactly the derivative work.
>> It's analogous to kernel modules. If module wants to call
>> export_symbol_gpl() functions, it needs to be GPLed. Here all helper
>> functions are GPL. So we just have a blank check for eBPF program.
>
> I agree, these eBFP programs should be GPL-compatible licensed as well.
I think I'd be happy with an export_symbol_gpl analogue. I might
argue that bpf_map_lookup_elem shouldn't be gpl-only, though.
Something like "look up the uid that opened a port," on the other
hand, maybe should be.
--Andy
On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
This is a very high-level review. I haven't tried to read all the
code yet, and this is mostly questions rather than real comments.
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention
>
> It checks that
> - R1-R5 registers statisfy function prototype
> - program terminates
> - BPF_LD_ABS|IND instructions are only used in socket filters
Why are these used in socket filters? Can't ctx along with some
accessor do the trick.
It seems to be that this is more or less a type system. On entry to
each instruction, each register has a type, and the instruction might
change the types of the registers.
So: what are the rules? If I understand correctly, these are the types:
> + INVALID_PTR, /* reg doesn't contain a valid pointer */
> + PTR_TO_CTX, /* reg points to bpf_context */
> + PTR_TO_MAP, /* reg points to map element value */
> + PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
> + PTR_TO_STACK, /* reg == frame_pointer */
> + PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
> + RET_INTEGER, /* function returns integer */
> + RET_VOID, /* function returns void */
> + CONST_ARG, /* function expects integer constant argument */
> + CONST_ARG_MAP_ID, /* int const argument that is used as map_id */
> + /* int const argument indicating number of bytes accessed from stack
> + * previous function argument must be ptr_to_stack_imm
> + */
> + CONST_ARG_STACK_IMM_SIZE,
> +};
One confusing thing here is that some of these are types and some are
constraints. I'm not sure this is necessary. For example, RET_VOID
is an odd name for VOID, and RET_INTEGER is an odd name for INTEGER.
I think I'd have a much easier time understanding all of this if there
were an explicit table for the transition rules. There are a couple
kinds of transitions. The main one is kind of like a phi node: when
two different control paths reach the same instruction, the types of
each register presumably need to merge. I would imagine rules like:
VOID, anything -> VOID
PTR_TO_MAP, PTR_TO_MAP_CONDITIONAL -> PTR_TO_MAP_CONDITIONAL
Then there are arithmetic rules: if you try to add two values, it
might be legal or illegal, and the result type needs to be known.
There are also things that happen on function entry and exit. For
example, unused argument slots presumably all turn into VOID. All
calls into the same function will apply the merge rule to the used
argument types. Passing stack pointers into a function might be okay,
but passing stack pointers back out should presumably turn them into
VOID.
Am I understanding this right so far?
The next question is: what are the funny types for?
PTR_TO_MAP seems odd. Shouldn't that just be PTR_TO_MEM? And for these:
> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
I don't understand at all.
Next question: how does bounds checking work? Are you separately
tracking the offset and associated bounds of each pointer? How does
bounds checking on PTR_TO_MAP work? How about PTR_TO_STACK? For that
matter, why would pointer arithmetic on the stack be needed at all?
ISTM it would be easier to have instructions to load and store a given
local stack slot. I guess that stack slots from callers are useful,
too, but that seems *much* more complicated to track.
--Andy
On Sat, Jun 28, 2014 at 9:01 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
>> Safety of eBPF programs is statically determined by the verifier, which detects:
>
> This is a very high-level review. I haven't tried to read all the
> code yet, and this is mostly questions rather than real comments.
Great questions! :) Answers below:
>> - loops
>> - out of range jumps
>> - unreachable instructions
>> - invalid instructions
>> - uninitialized register access
>> - uninitialized stack access
>> - misaligned stack access
>> - out of range stack access
>> - invalid calling convention
>>
>> It checks that
>> - R1-R5 registers statisfy function prototype
>> - program terminates
>> - BPF_LD_ABS|IND instructions are only used in socket filters
>
> Why are these used in socket filters? Can't ctx along with some
> accessor do the trick.
ld_abs/ind instructions are legacy instruction that assume that ctx == skb.
They're heavily used in classic bpf, so we cannot convert them to
anything else without hurting performance of libpcap. So here we
have two special instructions that are really wrappers of function calls
that can only be used when 'ctx == skb'.
bpf_prog_type_socket_filter means that 'ctx==skb', but it doesn't mean
that this is for attaching to sockets only. The same type can be used
in attaching eBPF programs to cls, xt, etc where input is skb.
> It seems to be that this is more or less a type system. On entry to
> each instruction, each register has a type, and the instruction might
> change the types of the registers.
Exactly.
> So: what are the rules? If I understand correctly, these are the types:
the types of registers change depending on instruction semantics.
If instruction is 'mov r1 = r5', then type of r5 is transferred to r1
and so on.
>> + INVALID_PTR, /* reg doesn't contain a valid pointer */
>> + PTR_TO_CTX, /* reg points to bpf_context */
>> + PTR_TO_MAP, /* reg points to map element value */
>> + PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
>> + PTR_TO_STACK, /* reg == frame_pointer */
>> + PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
>> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
>> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
>> + RET_INTEGER, /* function returns integer */
>> + RET_VOID, /* function returns void */
>> + CONST_ARG, /* function expects integer constant argument */
>> + CONST_ARG_MAP_ID, /* int const argument that is used as map_id */
>> + /* int const argument indicating number of bytes accessed from stack
>> + * previous function argument must be ptr_to_stack_imm
>> + */
>> + CONST_ARG_STACK_IMM_SIZE,
>> +};
At the comment on top of this enum says:
/* types of values:
* - stored in an eBPF register
* - passed into helper functions as an argument
* - returned from helper functions
*/
> One confusing thing here is that some of these are types and some are
> constraints. I'm not sure this is necessary. For example, RET_VOID
exactly. some are type of registers, some are argument constraints,
some are definitions of return types like (ret_*) types.
All three categories overlap. Therefore they're in one enum.
I can split it into three enums, but there will be duplicates and it will not
be any easier to read.
> is an odd name for VOID, and RET_INTEGER is an odd name for INTEGER.
RET_VOID means ' returns void' or as comment says 'function returns void'
It never appears as type of register. It's a type of return from a
function call.
> I think I'd have a much easier time understanding all of this if there
> were an explicit table for the transition rules. There are a couple
> kinds of transitions. The main one is kind of like a phi node: when
> two different control paths reach the same instruction, the types of
> each register presumably need to merge. I would imagine rules like:
>
> VOID, anything -> VOID
> PTR_TO_MAP, PTR_TO_MAP_CONDITIONAL -> PTR_TO_MAP_CONDITIONAL
>
> Then there are arithmetic rules: if you try to add two values, it
> might be legal or illegal, and the result type needs to be known.
I sounds like you're proposing a large table of
[insn_opcode, type_x, type_y] -> type z
that won't work. Since many instruction use registers both as source
and as destination. So type changes are very specific to logic of the given
instructions and cannot be generalized into 'type transition table'.
> There are also things that happen on function entry and exit. For
> example, unused argument slots presumably all turn into VOID. All
Almost. Argument registers R1-R5 after function call turn into INVALID_PTR
type. RET_VOID type is a definition of return type from a function. It doesn't
appear as register type.
In particular this piece of code does it:
/* reset caller saved regs */
for (i = 0; i < CALLER_SAVED_REGS; i++) {
reg = regs + caller_saved[i];
reg->read_ok = false; // mark R1-R5 as unreadable
reg->ptr = INVALID_PTR; // here all R1-R5 regs are
assigned a type
reg->imm = 0xbadbad;
}
/* update return register */
reg = regs + BPF_REG_0;
if (fn->ret_type == RET_INTEGER) {
reg->read_ok = true;
reg->ptr = INVALID_PTR;
// here 'RET_INTEGER' return type is converted into INVALID_PTR type of register
} else if (fn->ret_type != RET_VOID) {
reg->read_ok = true;
reg->ptr = fn->ret_type;
// and here PTR_TO_MAP_CONDITIONAL is being transferred
from function prototype into a register.
That's an overlap of different types that I was talking about.
Note, that registers most of the time have INVALID_PTR type, which
means the register has some value, but it's not a valid pointer.
For all arithmetic operations that's what we want to see.
We don't want to track arithmetic operations on pointers. It would
complicate verifier a lot. The only special case is the sequence:
mov r1 = r10
add r1, -20
1st insn copies r10 (which has PTR_TO_STACK) type into r1
and 2nd arithmetic instruction is pattern matched to recognize
that it wants to construct a pointer to some element within stack.
So after 'add r1, -20', the register r1 has type PTR_TO_STACK_IMM
(and -20 constant is remembered as well).
Meaning that this reg is a pointer to stack plus known immediate constant.
Relevant comments from bpf.h header file:
PTR_TO_STACK, /* reg == frame_pointer */
PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
> PTR_TO_MAP seems odd. Shouldn't that just be PTR_TO_MEM? And for these:
When program is doing load or store insns the type of base register can only be:
PTR_TO_MAP, PTR_TO_CTX, PTR_TO_STACK.
These are exactly three conditions in check_mem_access() function.
PTR_TO_MAP means that this register is pointing to 'map element value'
and the range of [ptr, ptr + map's value_size) is accessible.
>> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
>> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
>
> I don't understand at all.
PTR_TO_STACK_IMM_MAP_KEY is a function argument constraint.
It means that the register type passed to this function must be
PTR_TO_STACK_IMM and it will be used inside the function as
'pointer to map element key'
Here are the argument constraints for bpf_map_lookup_elem():
[BPF_FUNC_map_lookup_elem] = {
.ret_type = PTR_TO_MAP_CONDITIONAL,
.arg1_type = CONST_ARG_MAP_ID,
.arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
},
it says that this function returns 'pointer to map elem value or null'
1st argument must be 'constant immediate' value which must
be one of valid map_ids.
2nd argument is a pointer to stack, which will be used inside
the function as a pointer to map element key.
On the kernel side the function looks like:
u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
{
struct bpf_map *map;
int map_id = r1;
void *key = (void *) (unsigned long) r2;
void *value;
so here we can access 'key' pointer safely, knowing that
[key, key + map->key_size) bytes are valid and were initialized on
the stack of eBPF program.
The eBPF program looked like:
BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
// after this insn R2 type is PTR_TO_STACK
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
// after this insn R2 type is PTR_TO_STACK_IMM
BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
// after this insn R1 type is CONST_ARG
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
here verifier looks a prototype of map_lookup_elem and sees:
1st arg should be CONST_ARG_MAP_ID
R1 type is CONST_ARG, which is ok so far,
then it goes and finds a map with map_id equal to R1->imm value
now verifier knows that this map has key of key_size bytes
2nd arg should be PTR_TO_STACK_IMM_MAP_KEY
and R2 type is PTR_TO_STACK_IMM, so far so good.
verifier now checks that [R2, R2+ map's key_size) are within stack
limits and were initialized prior to this call.
Here it the relevant comment from verifier.c that describes this:
* Example: before the call to bpf_map_lookup_elem(),
* R1 must contain integer constant and R2 PTR_TO_STACK_IMM_MAP_KEY
* Integer constant in R1 is a map_id. The verifier checks that map_id is valid
* and corresponding map->key_size fetched to check that
* [R2, R2 + map_info->key_size) are within stack limits and all that stack
* memory was initiliazed earlier by BPF program.
> Next question: how does bounds checking work? Are you separately
> tracking the offset and associated bounds of each pointer? How does
correct. verifier separately tracks bounds for every pointer type.
> bounds checking on PTR_TO_MAP work?
the way I described above.
> How about PTR_TO_STACK? For that
> matter, why would pointer arithmetic on the stack be needed at all?
stack_ptr + imm is the only one that is tracked.
If eBPF program does 'r1 = r10; r1 -= 4;' this will not be recognized
by the verifier.
Why it only tracks 'add'? because I tuned LLVM backend to emit
stack pointer arithmetic only in this way.
Obviously stack arithmetic can be done in million different ways.
It's impractical to teach verifier to recognize all possible ways of
doing pointer arithmetic. So here you have this trade off.
In the future we can teach verifier to be smarter and recognize
more patterns, but let's start with the simplest design.
> ISTM it would be easier to have instructions to load and store a given
> local stack slot.
classic bpf has special ld/st instructions just to load/store from 32-bit
stack slots. I considered similar approach for eBPF, but it didn't
help verifier to be any simpler, complicated LLVM a lot and
complicated JITs. So I got rid of them and here we have only
generic ld/st instructions like normal CPUs do.
These were great questions! I hope I answered them. If not, please
continue asking.
Alexei
On Sat, Jun 28, 2014 at 8:34 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 27, 2014 at 11:43 PM, Alexei Starovoitov <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <[email protected]> wrote:
>>>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>> BPF syscall is a demux for different BPF releated commands.
>>>>>>
>>>>>> 'maps' is a generic storage of different types for sharing data between kernel
>>>>>> and userspace.
>>>>>>
>>>>>> The maps can be created/deleted from user space via BPF syscall:
>>>>>> - create a map with given id, type and attributes
>>>>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>>>>> returns positive map id or negative error
>>>>>>
>>>>>> - delete map with given map id
>>>>>> err = bpf_map_delete(int map_id)
>>>>>> returns zero or negative error
>>>>>
>>>>> What's the scope of "id"? How is it secured?
>>>>
>>>> the map and program id space is global and it's cap_sys_admin only.
>>>> There is no pressing need to do it with per-user limits.
>>>> So the whole thing is root only for now.
>>>>
>>>
>>> Hmm. This may be unpleasant if you ever want to support non-root or
>>> namespaced operation.
>>
>> I think it will be easy to extend it per namespace when we lift
>> root-only restriction. It will be seamless without user api changes.
>>
>
> It might be seamless, but I'm not sure it'll be very useful. See below.
>
>>> How hard would it be to give these things fds?
>>
>> you mean programs/maps auto-terminate when creator process
>> exits? I thought about it and it's appealing at first glance, but
>> doesn't fit the model of existing tracepoint events which are global.
>> The programs attached to events need to live without 'daemon'
>> hanging around. Therefore I picked 'kernel module'- like method.
>
> Here are some things I'd like to be able to do:
>
> - Load an eBPF program and use it as a seccomp filter.
>
> - Create a read-only map and reference it from a seccomp filter.
>
> - Create a data structure that a seccomp filter can write but that
> the filtered process can only read.
>
> - Create a data structure that a seccomp filter can read but that
> some other trusted process can write.
>
> - Create a network filter of some sort and give permission to
> manipulate a list of ports to an otherwise untrusted process.
>
> The first four of these shouldn't require privilege.
>
> All of this fits nicely into a model where all of the eBPF objects
> (filters and data structures) are represented by fds. Read access to
> the fd lets you read (or execute eBPF programs). Write access to the
> fd lets you write. You can send them around naturally using
> SCM_RIGHTS, and you can create deprivileged versions by reopening the
> objects with less access.
Sorry I don't like 'fd' direction at all.
1. it will make the whole thing very socket specific and 'net' dependent.
but the goal here is to be able to use eBPF for tracing in embedded
setups. So it's gotta be net independent.
2. sockets are already overloaded with all sorts of stuff. Adding more
types of sockets will complicate it a lot.
3. and most important. read/write operations on sockets are not
done every nanosecond, whereas lookup operations on bpf maps
are done every dozen instructions, so we cannot have any overhead
when accessing maps.
In other words the verifier is done as static analyzer. I moved all
the complexity to verify time, so at run-time the programs are as
fast as possible. I'm strongly against run-time checks in critical path,
since they kill performance and make the whole approach a lot less usable.
What you want to achieve:
> - Load an eBPF program and use it as a seccomp filter.
> - Create a read-only map and reference it from a seccomp filter.
is very doable in the existing framework.
Note I didn't do seccomp+ebpf example, only because you and Kees
and messing with this part of code a lot and I didn't want to conflict.
> All of this *could* fit in using global ids, but we'd need to answer
> questions like "what namespace are they bound to" and "who has access
> to a given fd". I'd want to see that these questions *have* good
> answers before committing to this type of model. Keep in mind that,
> for seccomp in particular, granting access to a specific uid will be
> very limiting: part of the point of seccomp is to enable
> user-controlled finer-grained permissions than allowed by uids and
> gids.
filters(bpf programs) is a low level tool that shouldn't be aware
of gid/uids at all. Just like classic bpf doesn't care, eBPF programs
shouldn't care. Mixing concept of uids/fds into the program is wrong.
On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 8:34 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 11:43 PM, Alexei Starovoitov <[email protected]> wrote:
>>> On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>>> BPF syscall is a demux for different BPF releated commands.
>>>>>>>
>>>>>>> 'maps' is a generic storage of different types for sharing data between kernel
>>>>>>> and userspace.
>>>>>>>
>>>>>>> The maps can be created/deleted from user space via BPF syscall:
>>>>>>> - create a map with given id, type and attributes
>>>>>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>>>>>> returns positive map id or negative error
>>>>>>>
>>>>>>> - delete map with given map id
>>>>>>> err = bpf_map_delete(int map_id)
>>>>>>> returns zero or negative error
>>>>>>
>>>>>> What's the scope of "id"? How is it secured?
>>>>>
>>>>> the map and program id space is global and it's cap_sys_admin only.
>>>>> There is no pressing need to do it with per-user limits.
>>>>> So the whole thing is root only for now.
>>>>>
>>>>
>>>> Hmm. This may be unpleasant if you ever want to support non-root or
>>>> namespaced operation.
>>>
>>> I think it will be easy to extend it per namespace when we lift
>>> root-only restriction. It will be seamless without user api changes.
>>>
>>
>> It might be seamless, but I'm not sure it'll be very useful. See below.
>>
>>>> How hard would it be to give these things fds?
>>>
>>> you mean programs/maps auto-terminate when creator process
>>> exits? I thought about it and it's appealing at first glance, but
>>> doesn't fit the model of existing tracepoint events which are global.
>>> The programs attached to events need to live without 'daemon'
>>> hanging around. Therefore I picked 'kernel module'- like method.
>>
>> Here are some things I'd like to be able to do:
>>
>> - Load an eBPF program and use it as a seccomp filter.
>>
>> - Create a read-only map and reference it from a seccomp filter.
>>
>> - Create a data structure that a seccomp filter can write but that
>> the filtered process can only read.
>>
>> - Create a data structure that a seccomp filter can read but that
>> some other trusted process can write.
>>
>> - Create a network filter of some sort and give permission to
>> manipulate a list of ports to an otherwise untrusted process.
>>
>> The first four of these shouldn't require privilege.
>>
>> All of this fits nicely into a model where all of the eBPF objects
>> (filters and data structures) are represented by fds. Read access to
>> the fd lets you read (or execute eBPF programs). Write access to the
>> fd lets you write. You can send them around naturally using
>> SCM_RIGHTS, and you can create deprivileged versions by reopening the
>> objects with less access.
>
> Sorry I don't like 'fd' direction at all.
> 1. it will make the whole thing very socket specific and 'net' dependent.
> but the goal here is to be able to use eBPF for tracing in embedded
> setups. So it's gotta be net independent.
> 2. sockets are already overloaded with all sorts of stuff. Adding more
> types of sockets will complicate it a lot.
> 3. and most important. read/write operations on sockets are not
> done every nanosecond, whereas lookup operations on bpf maps
> are done every dozen instructions, so we cannot have any overhead
> when accessing maps.
> In other words the verifier is done as static analyzer. I moved all
> the complexity to verify time, so at run-time the programs are as
> fast as possible. I'm strongly against run-time checks in critical path,
> since they kill performance and make the whole approach a lot less usable.
I may have described my suggestion poorly. I'm suggesting that all of
these global ids be replaced *for userspace's benefit* with fds. That
is, a map would have an associated struct inode, and, when you load an
eBPF program, you'd pass fds into the kernel instead of global ids.
The kernel would still compile the eBPF program to use the global ids,
though.
This should have no effect at all on the execution of eBPF programs.
eBPF programs wouldn't be able to look up fds at runtime, and this
should work without CONFIG_NET.
--Andy
On Sat, Jun 28, 2014 at 1:25 PM, Alexei Starovoitov <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 9:01 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>
>> This is a very high-level review. I haven't tried to read all the
>> code yet, and this is mostly questions rather than real comments.
>
> These were great questions! I hope I answered them. If not, please
> continue asking.
I have plenty more questions, but here's one right now: does anything
prevent programs from using pointers in comparisons, returning
pointers, or otherwise figuring out the value of a pointer? If so, I
think it would be worthwhile to prevent that so that eBPF programs
can't learn kernel addresses.
--Andy
On Sat, Jun 28, 2014 at 6:58 PM, Andy Lutomirski <[email protected]> wrote:
>> These were great questions! I hope I answered them. If not, please
>> continue asking.
>
> I have plenty more questions, but here's one right now: does anything
> prevent programs from using pointers in comparisons, returning
> pointers, or otherwise figuring out the value of a pointer? If so, I
> think it would be worthwhile to prevent that so that eBPF programs
> can't learn kernel addresses.
when we decide to let non-root users load such programs, yes.
Right now the goal is the opposite. Take a look at 'drop monitor' example.
It stores kernel addresses where packets were dropped into a map,
so that user space can read it. The goal of eBPF for tracing is to be able
to see all corners of the kernel without being able to crash it.
eBPF is a kernel module that cannot crash kernel or adversely affect
the execution. Though in the future I'd like to expand applicability and
let unprivileged users use them. In such cases exposing kernel
addresses will be prevented. It's easy to tweak verifier to prevent
comparison of pointers, storing pointers to a map or passing them into
a function. The verifier is already tracking all pointers. There ways to
leak them: store into map, pass to helper function, return from program,
compare to constant, obfuscate via arithmetic. Prevention by verifier
is trivial. Though not right now. User level security is a different topic.
If I try to solve both 'root user safety' and 'non-root security' it will
take ages to go anywhere. So this patch is root only. I'm deliberately
not addressing non-root security for now.
First step: root only, get kernel pieces in place, llvm upstream, perf
plus all other user level tools. Just this step will take months.
Then let's talk about non-root. I believe it will need minimal changes
to verifier and no syscall uapi changes, but even if I'm wrong and
new syscall would be needed, it's not a big deal. Adding things
gradually is way better than trying to solve everything at once.
On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>
>> Sorry I don't like 'fd' direction at all.
>> 1. it will make the whole thing very socket specific and 'net' dependent.
>> but the goal here is to be able to use eBPF for tracing in embedded
>> setups. So it's gotta be net independent.
>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>> types of sockets will complicate it a lot.
>> 3. and most important. read/write operations on sockets are not
>> done every nanosecond, whereas lookup operations on bpf maps
>> are done every dozen instructions, so we cannot have any overhead
>> when accessing maps.
>> In other words the verifier is done as static analyzer. I moved all
>> the complexity to verify time, so at run-time the programs are as
>> fast as possible. I'm strongly against run-time checks in critical path,
>> since they kill performance and make the whole approach a lot less usable.
>
> I may have described my suggestion poorly. I'm suggesting that all of
> these global ids be replaced *for userspace's benefit* with fds. That
> is, a map would have an associated struct inode, and, when you load an
> eBPF program, you'd pass fds into the kernel instead of global ids.
> The kernel would still compile the eBPF program to use the global ids,
> though.
Hmm. If I understood you correctly, you're suggesting to do it similar
to ipc/mqueue, shmem, sockets do. By registering and mounting
a file system and providing all superblock and inode hooks… and
probably have its own namespace type… hmm… may be. That's
quite a bit of work to put lightly. As I said in the other email the first
step is root only and all these complexity just not worth doing
at this stage.
From: Alexei Starovoitov
> On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski <[email protected]> wrote:
> > On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
> >> eBPF programs are safe run-to-completion functions with load/unload
> >> methods from userspace similar to kernel modules.
> >>
> >> User space API:
> >>
> >> - load eBPF program
> >> prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int len)
> >>
> >> where 'prog' is a sequence of sections (currently TEXT and LICENSE)
> >> TEXT - array of eBPF instructions
> >> LICENSE - GPL compatible
> >> +
> >> + err = -EINVAL;
> >> + /* look for mandatory license string */
> >> + if (!tb[BPF_PROG_LICENSE])
> >> + goto free_attr;
> >> +
> >> + /* eBPF programs must be GPL compatible */
> >> + if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
> >> + goto free_attr;
> >
> > Seriously? My mind boggles.
>
> Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's wise
> to leave this door open for abuse. This check makes it clear that if you
> write a program in C, the source code must be available.
That seems utterly extreme.
Loadable kernel modules don't have to be GPL.
I can imagine that some people might not want to load code for which
they don't have the source - but in that case they probably want to
compile it themselves anyway.
I don't want to have to put a gpl licence on random pieces of test
code I might happen to write for my own use.
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Sat, Jun 28, 2014 at 8:35 AM, Andy Lutomirski <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 8:21 AM, Greg KH <[email protected]> wrote:
>> On Sat, Jun 28, 2014 at 12:26:14AM -0700, Alexei Starovoitov wrote:
>>> On Fri, Jun 27, 2014 at 11:28 PM, Andy Lutomirski <[email protected]> wrote:
>>> > On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov <[email protected]> wrote:
>>> > If you want to add GPL-only functions in the future, that would be one
>>> > thing. But if someone writes a nice eBPF compiler, and someone else
>>> > writes a little program that filters on network packets, I see no
>>> > reason to claim that the little program is a derivative work of the
>>> > kernel and therefore must be GPL.
>>>
>>> I think we have to draw a line somewhere. Say, tomorrow I want
>>> to modify libpcap to emit eBPF based on existing tcpdump syntax.
>>> Would it mean that tcpdump filter strings are GPLed? Definitely not,
>>> since they existed before and can function without new libpcap.
>>> But if I write a new packet filtering program in C, compile it
>>> using LLVM->eBPF and call into in-kernel helper functions
>>> (like bpf_map_lookup_elem()), I think it's exactly the derivative work.
>>> It's analogous to kernel modules. If module wants to call
>>> export_symbol_gpl() functions, it needs to be GPLed. Here all helper
>>> functions are GPL. So we just have a blank check for eBPF program.
>>
>> I agree, these eBFP programs should be GPL-compatible licensed as well.
>
> I think I'd be happy with an export_symbol_gpl analogue. I might
> argue that bpf_map_lookup_elem shouldn't be gpl-only, though.
ok. sounds like module-like approach will be more acceptable to potential
user base. Will change it. Last thing I want to do is to scary users away.
On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>>
>>> Sorry I don't like 'fd' direction at all.
>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>> but the goal here is to be able to use eBPF for tracing in embedded
>>> setups. So it's gotta be net independent.
>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>> types of sockets will complicate it a lot.
>>> 3. and most important. read/write operations on sockets are not
>>> done every nanosecond, whereas lookup operations on bpf maps
>>> are done every dozen instructions, so we cannot have any overhead
>>> when accessing maps.
>>> In other words the verifier is done as static analyzer. I moved all
>>> the complexity to verify time, so at run-time the programs are as
>>> fast as possible. I'm strongly against run-time checks in critical path,
>>> since they kill performance and make the whole approach a lot less usable.
>>
>> I may have described my suggestion poorly. I'm suggesting that all of
>> these global ids be replaced *for userspace's benefit* with fds. That
>> is, a map would have an associated struct inode, and, when you load an
>> eBPF program, you'd pass fds into the kernel instead of global ids.
>> The kernel would still compile the eBPF program to use the global ids,
>> though.
>
> Hmm. If I understood you correctly, you're suggesting to do it similar
> to ipc/mqueue, shmem, sockets do. By registering and mounting
> a file system and providing all superblock and inode hooks… and
> probably have its own namespace type… hmm… may be. That's
> quite a bit of work to put lightly. As I said in the other email the first
> step is root only and all these complexity just not worth doing
> at this stage.
The downside of not doing it right away is that it's harder to
retrofit in without breaking early users.
You might be able to get away with using anon_inodes. That will
prevent repoening via /proc/self/fd from working (I think), but that's
a good thing until someone fixes the /proc reopen hole. Sigh.
--Andy
On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
> Hi All,
>
> this patch set demonstrates the potential of eBPF.
>
> First patch "net: filter: split filter.c into two files" splits eBPF interpreter
> out of networking into kernel/bpf/. The goal for BPF subsystem is to be usable
> in NET-less configuration. Though the whole set is marked is RFC, the 1st patch
> is good to go. Similar version of the patch that was posted few weeks ago, but
> was deferred. I'm assuming due to lack of forward visibility. I hope that this
> patch set shows what eBPF is capable of and where it's heading.
>
> Other patches expose eBPF instruction set to user space and introduce concepts
> of maps and programs accessible via syscall.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace. Maps are referrenced by global id. Root can create multiple
> maps of different types where key/value are opaque bytes of data. It's up to
> user space and eBPF program to decide what they store in the maps.
>
> eBPF programs are similar to kernel modules. They live in global space and
> have unique prog_id. Each program is a safe run-to-completion set of
> instructions. eBPF verifier statically determines that the program terminates
> and safe to execute. During verification the program takes a hold of maps
> that it intends to use, so selected maps cannot be removed until program is
> unloaded. The program can be attached to different events. These events can
> be packets, tracepoint events and other types in the future. New event triggers
> execution of the program which may store information about the event in the maps.
> Beyond storing data the programs may call into in-kernel helper functions
> which may, for example, dump stack, do trace_printk or other forms of live
> kernel debugging. Same program can be attached to multiple events. Different
> programs can access the same map:
>
> tracepoint tracepoint tracepoint sk_buff sk_buff
> event A event B event C on eth0 on eth1
> | | | | |
> | | | | |
> --> tracing <-- tracing socket socket
> prog_1 prog_2 prog_3 prog_4
> | | | |
> |--- -----| |-------| map_3
> map_1 map_2
>
> User space (via syscall) and eBPF programs access maps concurrently.
>
> Last two patches are sample code. 1st demonstrates stateful packet inspection.
> It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
> framework can be used for network analytics.
> 2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
> event and counts number of packet drops at particular $pc location.
> User space periodically summarizes what eBPF programs recorded.
> In these two samples the eBPF programs are tiny and written in 'assembler'
> with macroses. More complex programs can be written C (llvm backend is not
> part of this diff to reduce 'huge' perception).
> Since eBPF is fully JITed on x64, the cost of running eBPF program is very
> small even for high frequency events. Here are the numbers comparing
> flow_dissector in C vs eBPF:
> x86_64 skb_flow_dissect() same skb (all cached) - 42 nsec per call
> x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
> eBPF+jit skb_flow_dissect() same skb (all cached) - 51 nsec per call
> eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call
>
> Detailed explanation on eBPF verifier and safety is in patch 08/14
This is very exciting! Thanks for working on it. :)
Between the new eBPF syscall and the new seccomp syscall, I'm really
looking forward to using lookup tables for seccomp filters. Under
certain types of filters, we'll likely see some non-trivial
performance improvements.
-Kees
--
Kees Cook
Chrome OS Security
On Mon, Jun 30, 2014 at 3:09 PM, Andy Lutomirski <[email protected]> wrote:
> On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <[email protected]> wrote:
>> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>
>>>> Sorry I don't like 'fd' direction at all.
>>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>>> but the goal here is to be able to use eBPF for tracing in embedded
>>>> setups. So it's gotta be net independent.
>>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>>> types of sockets will complicate it a lot.
>>>> 3. and most important. read/write operations on sockets are not
>>>> done every nanosecond, whereas lookup operations on bpf maps
>>>> are done every dozen instructions, so we cannot have any overhead
>>>> when accessing maps.
>>>> In other words the verifier is done as static analyzer. I moved all
>>>> the complexity to verify time, so at run-time the programs are as
>>>> fast as possible. I'm strongly against run-time checks in critical path,
>>>> since they kill performance and make the whole approach a lot less usable.
>>>
>>> I may have described my suggestion poorly. I'm suggesting that all of
>>> these global ids be replaced *for userspace's benefit* with fds. That
>>> is, a map would have an associated struct inode, and, when you load an
>>> eBPF program, you'd pass fds into the kernel instead of global ids.
>>> The kernel would still compile the eBPF program to use the global ids,
>>> though.
>>
>> Hmm. If I understood you correctly, you're suggesting to do it similar
>> to ipc/mqueue, shmem, sockets do. By registering and mounting
>> a file system and providing all superblock and inode hooks… and
>> probably have its own namespace type… hmm… may be. That's
>> quite a bit of work to put lightly. As I said in the other email the first
>> step is root only and all these complexity just not worth doing
>> at this stage.
>
> The downside of not doing it right away is that it's harder to
> retrofit in without breaking early users.
>
> You might be able to get away with using anon_inodes. That will
Spent quite a bit of time playing with anon_inode_getfd(). The model
works ok for seccomp, but doesn't seem to work for tracing,
since tracepoints are global. Say, syscall(bpf, load_prog) returns
a process-local fd. This 'fd' as a string can be written to
debugfs/tracing/events/.../filter which will increment a refcnt of a global
ebpf_program structure and will keep using it. When process exits it will
close all fds which in case of ebpf_prog_fd should be a nop, since
the program is still attached to a global event. Now we have a
program and maps that still alive and dangling, since tracepoint events
keep coming, but no new process can access it. Here we just lost all
benefits of making it 'fd' based. Theoretically we can extend tracing to
be fd-based too and tracepoints will auto-detach upon process exit,
but that's not going to work for all other global events. Like networking
components (bridge, ovs, …) are global and they won't be adding
fd-based interfaces.
I'm still thinking about it, but it looks like that any process-local
ebpf_prog_id scheme is not going to work for global events. Thoughts?
On 07/01/2014 01:09 AM, Kees Cook wrote:
> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]> wrote:
>> Hi All,
>>
>> this patch set demonstrates the potential of eBPF.
>>
>> First patch "net: filter: split filter.c into two files" splits eBPF interpreter
>> out of networking into kernel/bpf/. The goal for BPF subsystem is to be usable
>> in NET-less configuration. Though the whole set is marked is RFC, the 1st patch
>> is good to go. Similar version of the patch that was posted few weeks ago, but
>> was deferred. I'm assuming due to lack of forward visibility. I hope that this
>> patch set shows what eBPF is capable of and where it's heading.
>>
>> Other patches expose eBPF instruction set to user space and introduce concepts
>> of maps and programs accessible via syscall.
>>
>> 'maps' is a generic storage of different types for sharing data between kernel
>> and userspace. Maps are referrenced by global id. Root can create multiple
>> maps of different types where key/value are opaque bytes of data. It's up to
>> user space and eBPF program to decide what they store in the maps.
>>
>> eBPF programs are similar to kernel modules. They live in global space and
>> have unique prog_id. Each program is a safe run-to-completion set of
>> instructions. eBPF verifier statically determines that the program terminates
>> and safe to execute. During verification the program takes a hold of maps
>> that it intends to use, so selected maps cannot be removed until program is
>> unloaded. The program can be attached to different events. These events can
>> be packets, tracepoint events and other types in the future. New event triggers
>> execution of the program which may store information about the event in the maps.
>> Beyond storing data the programs may call into in-kernel helper functions
>> which may, for example, dump stack, do trace_printk or other forms of live
>> kernel debugging. Same program can be attached to multiple events. Different
>> programs can access the same map:
>>
>> tracepoint tracepoint tracepoint sk_buff sk_buff
>> event A event B event C on eth0 on eth1
>> | | | | |
>> | | | | |
>> --> tracing <-- tracing socket socket
>> prog_1 prog_2 prog_3 prog_4
>> | | | |
>> |--- -----| |-------| map_3
>> map_1 map_2
>>
>> User space (via syscall) and eBPF programs access maps concurrently.
>>
>> Last two patches are sample code. 1st demonstrates stateful packet inspection.
>> It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
>> framework can be used for network analytics.
>> 2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
>> event and counts number of packet drops at particular $pc location.
>> User space periodically summarizes what eBPF programs recorded.
>> In these two samples the eBPF programs are tiny and written in 'assembler'
>> with macroses. More complex programs can be written C (llvm backend is not
>> part of this diff to reduce 'huge' perception).
>> Since eBPF is fully JITed on x64, the cost of running eBPF program is very
>> small even for high frequency events. Here are the numbers comparing
>> flow_dissector in C vs eBPF:
>> x86_64 skb_flow_dissect() same skb (all cached) - 42 nsec per call
>> x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
>> eBPF+jit skb_flow_dissect() same skb (all cached) - 51 nsec per call
>> eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call
>>
>> Detailed explanation on eBPF verifier and safety is in patch 08/14
>
> This is very exciting! Thanks for working on it. :)
>
> Between the new eBPF syscall and the new seccomp syscall, I'm really
> looking forward to using lookup tables for seccomp filters. Under
> certain types of filters, we'll likely see some non-trivial
> performance improvements.
Well, if I read this correctly, the eBPF syscall lets you set up maps, etc,
but the only way to attach eBPF is via setsockopt for network filters right
now (and via tracing). Seccomp will still make use of classic BPF, so you
won't be able to use it there.
On 06/28/2014 02:06 AM, Alexei Starovoitov wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention
...
> More details in Documentation/networking/filter.txt
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
...
> kernel/bpf/verifier.c | 1431 +++++++++++++++++++++++++++++++++++
Looking at classic BPF verifier which checks safety of BPF
user space programs, it's roughly 200 loc. :-/
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> new file mode 100644
...
> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
...
> + _(get_map_info(env, map_id, &map));
...
> + _(size = bpf_size_to_bytes(bpf_size));
Nit: such macros should be removed, please.
On 06/28/2014 02:06 AM, Alexei Starovoitov wrote:
> User interface:
> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>
> where 123 is an id of the eBPF program priorly loaded.
> __event__ is static tracepoint event.
> (kprobe events will be supported in the future patches)
>
> eBPF programs can call in-kernel helper functions to:
> - lookup/update/delete elements in maps
> - memcmp
> - trace_printk
> - load_pointer
> - dump_stack
Are there plans to let eBPF replace the generic event
filtering framework in tracing?
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/linux/ftrace_event.h | 5 +
> include/trace/bpf_trace.h | 29 +++++
> include/trace/ftrace.h | 10 ++
> include/uapi/linux/bpf.h | 5 +
> kernel/trace/Kconfig | 1 +
> kernel/trace/Makefile | 1 +
> kernel/trace/bpf_trace.c | 217 ++++++++++++++++++++++++++++++++++++
> kernel/trace/trace.h | 3 +
> kernel/trace/trace_events.c | 7 ++
> kernel/trace/trace_events_filter.c | 72 +++++++++++-
> 10 files changed, 349 insertions(+), 1 deletion(-)
> create mode 100644 include/trace/bpf_trace.h
> create mode 100644 kernel/trace/bpf_trace.c
On Mon, Jun 30, 2014 at 10:47 PM, Alexei Starovoitov <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 3:09 PM, Andy Lutomirski <[email protected]> wrote:
>> On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <[email protected]> wrote:
>>> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>
>>>>> Sorry I don't like 'fd' direction at all.
>>>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>>>> but the goal here is to be able to use eBPF for tracing in embedded
>>>>> setups. So it's gotta be net independent.
>>>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>>>> types of sockets will complicate it a lot.
>>>>> 3. and most important. read/write operations on sockets are not
>>>>> done every nanosecond, whereas lookup operations on bpf maps
>>>>> are done every dozen instructions, so we cannot have any overhead
>>>>> when accessing maps.
>>>>> In other words the verifier is done as static analyzer. I moved all
>>>>> the complexity to verify time, so at run-time the programs are as
>>>>> fast as possible. I'm strongly against run-time checks in critical path,
>>>>> since they kill performance and make the whole approach a lot less usable.
>>>>
>>>> I may have described my suggestion poorly. I'm suggesting that all of
>>>> these global ids be replaced *for userspace's benefit* with fds. That
>>>> is, a map would have an associated struct inode, and, when you load an
>>>> eBPF program, you'd pass fds into the kernel instead of global ids.
>>>> The kernel would still compile the eBPF program to use the global ids,
>>>> though.
>>>
>>> Hmm. If I understood you correctly, you're suggesting to do it similar
>>> to ipc/mqueue, shmem, sockets do. By registering and mounting
>>> a file system and providing all superblock and inode hooks… and
>>> probably have its own namespace type… hmm… may be. That's
>>> quite a bit of work to put lightly. As I said in the other email the first
>>> step is root only and all these complexity just not worth doing
>>> at this stage.
>>
>> The downside of not doing it right away is that it's harder to
>> retrofit in without breaking early users.
>>
>> You might be able to get away with using anon_inodes. That will
>
> Spent quite a bit of time playing with anon_inode_getfd(). The model
> works ok for seccomp, but doesn't seem to work for tracing,
> since tracepoints are global. Say, syscall(bpf, load_prog) returns
> a process-local fd. This 'fd' as a string can be written to
> debugfs/tracing/events/.../filter which will increment a refcnt of a global
> ebpf_program structure and will keep using it. When process exits it will
> close all fds which in case of ebpf_prog_fd should be a nop, since
> the program is still attached to a global event. Now we have a
> program and maps that still alive and dangling, since tracepoint events
> keep coming, but no new process can access it. Here we just lost all
> benefits of making it 'fd' based. Theoretically we can extend tracing to
> be fd-based too and tracepoints will auto-detach upon process exit,
> but that's not going to work for all other global events. Like networking
> components (bridge, ovs, …) are global and they won't be adding
> fd-based interfaces.
> I'm still thinking about it, but it looks like that any process-local
> ebpf_prog_id scheme is not going to work for global events. Thoughts?
Hmm. Maybe these things do need global ids for tracing, or at least
there need to be some way to stash them somewhere and find them again.
I suppose that debugfs could have symlinks to them, but I don't know
how hard that would be to implement or how awkward it would be to use.
I imagine there's some awkwardness regardless. For tracing, if I
create map 75 and eBPF program 492 that uses map 75, then I still need
to remember that map 75 is the map I want (or I need to parse the eBPF
program later on).
How do you imagine the userspace code working? Maybe it would make
sense to add some nlattrs for eBPF programs to map between referenced
objects and nicknames for them. Then user code could look at
/sys/kernel/debug/whatever/nickname_of_map to resolve the map id or
even just open it directly.
I admit that I'm much more familiar with seccomp and even socket
filters than I am with tracing.
--Andy
On Tue, Jul 1, 2014 at 1:05 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/28/2014 02:06 AM, Alexei Starovoitov wrote:
>>
>> Safety of eBPF programs is statically determined by the verifier, which
>> detects:
>> - loops
>> - out of range jumps
>> - unreachable instructions
>> - invalid instructions
>> - uninitialized register access
>> - uninitialized stack access
>> - misaligned stack access
>> - out of range stack access
>> - invalid calling convention
>
> ...
>
>> More details in Documentation/networking/filter.txt
>>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>> ---
>
> ...
>>
>> kernel/bpf/verifier.c | 1431
>> +++++++++++++++++++++++++++++++++++
>
>
> Looking at classic BPF verifier which checks safety of BPF
> user space programs, it's roughly 200 loc. :-/
I'm not sure what's your point comparing apples to oranges.
For the record 1431 lines include ~200 lines worth of comments
and 200 lines of verbose prints. Without them rejected eBPF
program is black box. Users need a way to understand why
verifier rejected it.
>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> new file mode 100644
>
> ...
>
>> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
>
> ...
>>
>> + _(get_map_info(env, map_id, &map));
>
> ...
>>
>> + _(size = bpf_size_to_bytes(bpf_size));
>
>
> Nit: such macros should be removed, please.
It may surely look unconventional, but alternative is to replace
every usage of _ macro with:
err = …
if (err)
return err;
and since this macro is used 38 times, it will add ~120 unnecessary
lines that will only make code much harder to follow.
I tried not using macro and results were not pleasing.
On Tue, Jul 1, 2014 at 1:30 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/28/2014 02:06 AM, Alexei Starovoitov wrote:
>>
>> User interface:
>> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>>
>> where 123 is an id of the eBPF program priorly loaded.
>> __event__ is static tracepoint event.
>> (kprobe events will be supported in the future patches)
>>
>> eBPF programs can call in-kernel helper functions to:
>> - lookup/update/delete elements in maps
>> - memcmp
>> - trace_printk
>> - load_pointer
>> - dump_stack
>
>
> Are there plans to let eBPF replace the generic event
> filtering framework in tracing?
yes. the other patch that replaces predicate tree walking with
eBPF programs is pending on eBPF split out of networking.
Hi Alexei,
On Fri, 27 Jun 2014 17:05:53 -0700, Alexei Starovoitov wrote:
> BPF is used in several kernel components. This split creates logical boundary
> between generic eBPF core and the rest
>
> kernel/bpf/core.c: eBPF interpreter
>
> net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
>
> This patch only moves functions.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> kernel/Makefile | 1 +
> kernel/bpf/Makefile | 1 +
> kernel/bpf/core.c | 545 +++++++++++++++++++++++++++++++++++++++++++++++++++
> net/core/filter.c | 520 ------------------------------------------------
> 4 files changed, 547 insertions(+), 520 deletions(-)
> create mode 100644 kernel/bpf/Makefile
> create mode 100644 kernel/bpf/core.c
>
> diff --git a/kernel/Makefile b/kernel/Makefile
> index f2a8b6246ce9..e7360b7c2c0e 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
> obj-$(CONFIG_TRACEPOINTS) += trace/
> obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-$(CONFIG_CPU_PM) += cpu_pm.o
> +obj-$(CONFIG_NET) += bpf/
But this still requires CONFIG_NET to use bpf. Why not adding
CONFIG_BPF and making CONFIG_NET selects it?
Thanks,
Namhyung
Mostly questions and few nitpicks.. :)
On Fri, 27 Jun 2014 17:06:00 -0700, Alexei Starovoitov wrote:
> +/* types of values:
> + * - stored in an eBPF register
> + * - passed into helper functions as an argument
> + * - returned from helper functions
> + */
> +enum bpf_reg_type {
> + INVALID_PTR, /* reg doesn't contain a valid pointer */
I don't think it's a good name. The INVALID_PTR can be read as it
contains a "pointer" which is invalid. Maybe INTEGER, NUMBER or
something different can be used. And I think the struct reg_state->ptr
should be renamed also.
> + PTR_TO_CTX, /* reg points to bpf_context */
> + PTR_TO_MAP, /* reg points to map element value */
> + PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
> + PTR_TO_STACK, /* reg == frame_pointer */
> + PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
So, these PTR_TO_STACK_IMM[_*] types are only for function argument,
right? I guessed it could be used to access memory in general too, but
then I thought it'd make verification complicated..
And I also agree that it'd better splitting reg types and function
argument constraints.
> + RET_INTEGER, /* function returns integer */
> + RET_VOID, /* function returns void */
> + CONST_ARG, /* function expects integer constant argument */
> + CONST_ARG_MAP_ID, /* int const argument that is used as map_id */
That means a map id should always be a constant (for verification), right?
> + /* int const argument indicating number of bytes accessed from stack
> + * previous function argument must be ptr_to_stack_imm
> + */
> + CONST_ARG_STACK_IMM_SIZE,
> +};
[SNIP]
> +
> +/* check read/write into map element returned by bpf_table_lookup() */
> +static int check_table_access(struct verifier_env *env, int regno, int off,
> + int size)
I guess the "table" is an old name of the "map"?
> +{
> + struct bpf_map *map;
> + int map_id = env->cur_state.regs[regno].imm;
> +
> + _(get_map_info(env, map_id, &map));
> +
> + if (off < 0 || off + size > map->value_size) {
> + verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
> + map_id, map->value_size, off, size);
> + return -EACCES;
> + }
> + return 0;
> +}
[SNIP]
> +static int check_mem_access(struct verifier_env *env, int regno, int off,
> + int bpf_size, enum bpf_access_type t,
> + int value_regno)
> +{
> + struct verifier_state *state = &env->cur_state;
> + int size;
> +
> + _(size = bpf_size_to_bytes(bpf_size));
> +
> + if (off % size != 0) {
> + verbose("misaligned access off %d size %d\n", off, size);
> + return -EACCES;
> + }
> +
> + if (state->regs[regno].ptr == PTR_TO_MAP) {
> + _(check_table_access(env, regno, off, size));
> + if (t == BPF_READ)
> + mark_reg_no_ptr(state->regs, value_regno);
> + } else if (state->regs[regno].ptr == PTR_TO_CTX) {
> + _(check_ctx_access(env, off, size, t));
> + if (t == BPF_READ)
> + mark_reg_no_ptr(state->regs, value_regno);
> + } else if (state->regs[regno].ptr == PTR_TO_STACK) {
> + if (off >= 0 || off < -MAX_BPF_STACK) {
> + verbose("invalid stack off=%d size=%d\n", off, size);
> + return -EACCES;
> + }
So memory (stack) access is only allowed for a stack base regsiter and a
constant offset, right?
> + if (t == BPF_WRITE)
> + _(check_stack_write(state, off, size, value_regno));
> + else
> + _(check_stack_read(state, off, size, value_regno));
> + } else {
> + verbose("R%d invalid mem access '%s'\n",
> + regno, reg_type_str[state->regs[regno].ptr]);
> + return -EACCES;
> + }
> + return 0;
> +}
[SNIP]
> +static int check_call(struct verifier_env *env, int func_id)
> +{
> + struct verifier_state *state = &env->cur_state;
> + const struct bpf_func_proto *fn = NULL;
> + struct reg_state *regs = state->regs;
> + struct bpf_map *map = NULL;
> + struct reg_state *reg;
> + int map_id = -1;
> + int i;
> +
> + /* find function prototype */
> + if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
> + verbose("invalid func %d\n", func_id);
> + return -EINVAL;
> + }
> +
> + if (env->prog->info->ops->get_func_proto)
> + fn = env->prog->info->ops->get_func_proto(func_id);
> +
> + if (!fn || (fn->ret_type != RET_INTEGER &&
> + fn->ret_type != PTR_TO_MAP_CONDITIONAL &&
> + fn->ret_type != RET_VOID)) {
> + verbose("unknown func %d\n", func_id);
> + return -EINVAL;
> + }
> +
> + /* check args */
> + _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
Missing BPF_REG_5?
> +
> + /* reset caller saved regs */
> + for (i = 0; i < CALLER_SAVED_REGS; i++) {
> + reg = regs + caller_saved[i];
> + reg->read_ok = false;
> + reg->ptr = INVALID_PTR;
> + reg->imm = 0xbadbad;
> + }
> +
> + /* update return register */
> + reg = regs + BPF_REG_0;
> + if (fn->ret_type == RET_INTEGER) {
> + reg->read_ok = true;
> + reg->ptr = INVALID_PTR;
> + } else if (fn->ret_type != RET_VOID) {
> + reg->read_ok = true;
> + reg->ptr = fn->ret_type;
> + if (fn->ret_type == PTR_TO_MAP_CONDITIONAL)
> + /*
> + * remember map_id, so that check_table_access()
> + * can check 'value_size' boundary of memory access
> + * to map element returned from bpf_table_lookup()
> + */
> + reg->imm = map_id;
> + }
> + return 0;
> +}
[SNIP]
> +#define PEAK_INT() \
s/PEAK/PEEK/ ?
Thanks,
Namhyung
> + ({ \
> + int _ret; \
> + if (cur_stack == 0) \
> + _ret = -1; \
> + else \
> + _ret = stack[cur_stack - 1]; \
> + _ret; \
> + })
> +
> +#define POP_INT() \
> + ({ \
> + int _ret; \
> + if (cur_stack == 0) \
> + _ret = -1; \
> + else \
> + _ret = stack[--cur_stack]; \
> + _ret; \
> + })
On Fri, 27 Jun 2014 17:06:03 -0700, Alexei Starovoitov wrote:
> User interface:
> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>
> where 123 is an id of the eBPF program priorly loaded.
> __event__ is static tracepoint event.
> (kprobe events will be supported in the future patches)
>
> eBPF programs can call in-kernel helper functions to:
> - lookup/update/delete elements in maps
> - memcmp
> - trace_printk
ISTR Steve doesn't like to use trace_printk() (at least for production
kernels) anymore. And I'm not sure it'd work if there's no existing
trace_printk() on a system.
> - load_pointer
> - dump_stack
[SNIP]
> @@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto) \
> if (ftrace_trigger_soft_disabled(ftrace_file)) \
> return; \
> \
> + if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
> + unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
> + struct bpf_context __ctx; \
> + \
> + populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
> + trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
> + return; \
> + } \
> + \
Hmm.. But it seems the eBPF prog is not a filter - it'd always drop the
event. And I think it's better to use a recorded entry rather then args
as a bpf_context so that tools like perf can manipulate it at compile
time based on the event format.
Thanks,
Namhyung
> __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
> \
> entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file, \
On Tue, Jul 1, 2014 at 8:11 AM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Jun 30, 2014 at 10:47 PM, Alexei Starovoitov <[email protected]> wrote:
>> On Mon, Jun 30, 2014 at 3:09 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <[email protected]> wrote:
>>>> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>>
>>>>>> Sorry I don't like 'fd' direction at all.
>>>>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>>>>> but the goal here is to be able to use eBPF for tracing in embedded
>>>>>> setups. So it's gotta be net independent.
>>>>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>>>>> types of sockets will complicate it a lot.
>>>>>> 3. and most important. read/write operations on sockets are not
>>>>>> done every nanosecond, whereas lookup operations on bpf maps
>>>>>> are done every dozen instructions, so we cannot have any overhead
>>>>>> when accessing maps.
>>>>>> In other words the verifier is done as static analyzer. I moved all
>>>>>> the complexity to verify time, so at run-time the programs are as
>>>>>> fast as possible. I'm strongly against run-time checks in critical path,
>>>>>> since they kill performance and make the whole approach a lot less usable.
>>>>>
>>>>> I may have described my suggestion poorly. I'm suggesting that all of
>>>>> these global ids be replaced *for userspace's benefit* with fds. That
>>>>> is, a map would have an associated struct inode, and, when you load an
>>>>> eBPF program, you'd pass fds into the kernel instead of global ids.
>>>>> The kernel would still compile the eBPF program to use the global ids,
>>>>> though.
>>>>
>>>> Hmm. If I understood you correctly, you're suggesting to do it similar
>>>> to ipc/mqueue, shmem, sockets do. By registering and mounting
>>>> a file system and providing all superblock and inode hooks… and
>>>> probably have its own namespace type… hmm… may be. That's
>>>> quite a bit of work to put lightly. As I said in the other email the first
>>>> step is root only and all these complexity just not worth doing
>>>> at this stage.
>>>
>>> The downside of not doing it right away is that it's harder to
>>> retrofit in without breaking early users.
>>>
>>> You might be able to get away with using anon_inodes. That will
>>
>> Spent quite a bit of time playing with anon_inode_getfd(). The model
>> works ok for seccomp, but doesn't seem to work for tracing,
>> since tracepoints are global. Say, syscall(bpf, load_prog) returns
>> a process-local fd. This 'fd' as a string can be written to
>> debugfs/tracing/events/.../filter which will increment a refcnt of a global
>> ebpf_program structure and will keep using it. When process exits it will
>> close all fds which in case of ebpf_prog_fd should be a nop, since
>> the program is still attached to a global event. Now we have a
>> program and maps that still alive and dangling, since tracepoint events
>> keep coming, but no new process can access it. Here we just lost all
>> benefits of making it 'fd' based. Theoretically we can extend tracing to
>> be fd-based too and tracepoints will auto-detach upon process exit,
>> but that's not going to work for all other global events. Like networking
>> components (bridge, ovs, …) are global and they won't be adding
>> fd-based interfaces.
>> I'm still thinking about it, but it looks like that any process-local
>> ebpf_prog_id scheme is not going to work for global events. Thoughts?
>
> Hmm. Maybe these things do need global ids for tracing, or at least
> there need to be some way to stash them somewhere and find them again.
> I suppose that debugfs could have symlinks to them, but I don't know
> how hard that would be to implement or how awkward it would be to use.
>
> I imagine there's some awkwardness regardless. For tracing, if I
> create map 75 and eBPF program 492 that uses map 75, then I still need
> to remember that map 75 is the map I want (or I need to parse the eBPF
> program later on).
>
> How do you imagine the userspace code working? Maybe it would make
> sense to add some nlattrs for eBPF programs to map between referenced
> objects and nicknames for them. Then user code could look at
> /sys/kernel/debug/whatever/nickname_of_map to resolve the map id or
> even just open it directly.
I want to avoid string names, since they will force new 'strtab', 'symtab'
sections in the programs/maps and will uglify the user interface quite a bit.
Back in september one loadable unit was: one eBPF program + set of maps,
but tracing requirements forced a change, since multiple programs need
to access the same map and maps may need to be pre-populated before
the programs start executing, so I've split maps and programs into mostly
independent entities, but programs still need to think of maps as local:
For example I want to do a skb leak check 'tracing filter':
- attach this program to kretprobe of __alloc_skb():
u64 key = (u64) skb;
u64 value = bpf_get_time();
bpf_update_map_elem(1/*const_map_id*/, &key, &value);
- attach this program to consume_skb and kfree_skb tracepoints:
u64 key = (u64) skb;
bpf_delete_map_elem(1/*const_map_id*/, &key);
- and have user space do:
prior to loading:
bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
and then periodically iterate the map to see whether any skb stayed
in the map for too long.
Programs need to be written with hard coded map_ids otherwise usability
suffers, so I did global 32-bit id in this RFC, but this indeed doesn't work
for unprivileged chrome browser unless programs are previously loaded
by root and chrome only does attach to seccomp.
So here is the non-root bpf syscall interface I'm thinking about:
ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
it will create a global map in the system which will be accessible
in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
and process-local map_id that was passed as a 1st argument.
To do update/lookup the process will use bpf_map_xxx_elem(ufd,…)
Then to load eBPF program the process will do:
ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
and instructions will be referring to maps via local map_id that
was hard coded as part of the program.
Beyond the normal create_map, update/lookup/delete, load_prog
operations (that are accessible to both root and non-root), the root user
gains one more operations: bpf_get_global_id(ufd) that returns
global map_id or prog_id. This id can be attached to global events
like tracing. Non-root users lose ability to do delete_map and
unload_prog (they do close(ufd) instead), so this ops are for root
only and operate on global ids.
This is the cleanest way I could think of to combine non-root
security, per-process id and global id all in one API. Thoughts?
On Tue, Jul 1, 2014 at 9:23 PM, Namhyung Kim <[email protected]> wrote:
> Hi Alexei,
>
> On Fri, 27 Jun 2014 17:05:53 -0700, Alexei Starovoitov wrote:
>> BPF is used in several kernel components. This split creates logical boundary
>> between generic eBPF core and the rest
>>
>> kernel/bpf/core.c: eBPF interpreter
>>
>> net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters
>>
>> This patch only moves functions.
>>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>> ---
>> kernel/Makefile | 1 +
>> kernel/bpf/Makefile | 1 +
>> kernel/bpf/core.c | 545 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> net/core/filter.c | 520 ------------------------------------------------
>> 4 files changed, 547 insertions(+), 520 deletions(-)
>> create mode 100644 kernel/bpf/Makefile
>> create mode 100644 kernel/bpf/core.c
>>
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index f2a8b6246ce9..e7360b7c2c0e 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
>> obj-$(CONFIG_TRACEPOINTS) += trace/
>> obj-$(CONFIG_IRQ_WORK) += irq_work.o
>> obj-$(CONFIG_CPU_PM) += cpu_pm.o
>> +obj-$(CONFIG_NET) += bpf/
>
> But this still requires CONFIG_NET to use bpf. Why not adding
> CONFIG_BPF and making CONFIG_NET selects it?
This is the first patch that does 'split only'. Later patch replaces this line
with CONFIG_BPF.
On Tue, Jul 1, 2014 at 10:05 PM, Namhyung Kim <[email protected]> wrote:
> Mostly questions and few nitpicks.. :)
great questions. Thank you for review! Answers below:
> On Fri, 27 Jun 2014 17:06:00 -0700, Alexei Starovoitov wrote:
>> +/* types of values:
>> + * - stored in an eBPF register
>> + * - passed into helper functions as an argument
>> + * - returned from helper functions
>> + */
>> +enum bpf_reg_type {
>> + INVALID_PTR, /* reg doesn't contain a valid pointer */
>
> I don't think it's a good name. The INVALID_PTR can be read as it
> contains a "pointer" which is invalid. Maybe INTEGER, NUMBER or
> something different can be used. And I think the struct reg_state->ptr
> should be renamed also.
ok. I agree that 'invalid' part of the name is too negative.
May be 'unknown_value' ?
>> + PTR_TO_CTX, /* reg points to bpf_context */
>> + PTR_TO_MAP, /* reg points to map element value */
>> + PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
>> + PTR_TO_STACK, /* reg == frame_pointer */
>> + PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
>> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
>> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
>
> So, these PTR_TO_STACK_IMM[_*] types are only for function argument,
> right? I guessed it could be used to access memory in general too, but
> then I thought it'd make verification complicated..
>
> And I also agree that it'd better splitting reg types and function
> argument constraints.
Ok. Will split this enum into three.
>> +
>> +/* check read/write into map element returned by bpf_table_lookup() */
>> +static int check_table_access(struct verifier_env *env, int regno, int off,
>> + int size)
>
> I guess the "table" is an old name of the "map"?
oops :) Yes. I've been calling them 'bpf tables' initially, but it created too
strong of a correlation to 'hash table', so I've changed the name to 'map'
to stress that this is a generic key/value and not just hash table.
>> + } else if (state->regs[regno].ptr == PTR_TO_STACK) {
>> + if (off >= 0 || off < -MAX_BPF_STACK) {
>> + verbose("invalid stack off=%d size=%d\n", off, size);
>> + return -EACCES;
>> + }
>
> So memory (stack) access is only allowed for a stack base regsiter and a
> constant offset, right?
Correct.
In other words it allows instructions:
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_xx, -stack_offset);
Verifier makes no attempt to track pointer arithmetic and just marks
the result as 'invalid_ptr'.
For non-root programs it will reject programs that are trying to do
arithmetic on pointers (it's not part of this patch yet).
>> + /* check args */
>> + _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
>> + _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
>> + _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
>> + _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
>
> Missing BPF_REG_5?
yes. good catch.
I guess this shows that we didn't have a use case for function with 5 args :)
Will fix this.
>> +#define PEAK_INT() \
>
> s/PEAK/PEEK/ ?
aren't these the same? ;))
Will fix. Thanks!
On Tue, Jul 1, 2014 at 10:32 PM, Namhyung Kim <[email protected]> wrote:
> On Fri, 27 Jun 2014 17:06:03 -0700, Alexei Starovoitov wrote:
>> User interface:
>> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>>
>> where 123 is an id of the eBPF program priorly loaded.
>> __event__ is static tracepoint event.
>> (kprobe events will be supported in the future patches)
>>
>> eBPF programs can call in-kernel helper functions to:
>> - lookup/update/delete elements in maps
>> - memcmp
>> - trace_printk
>
> ISTR Steve doesn't like to use trace_printk() (at least for production
> kernels) anymore. And I'm not sure it'd work if there's no existing
> trace_printk() on a system.
yes. I saw big warning that trace_printk_init_buffers() emits.
The idea here is to use eBPF programs for live kernel debugging.
Instead of adding printk() and recompiling, just write a program,
attach it to some event, and printk whatever is interesting.
My only concern about printk() was that it dumps things into trace
buffers (which is still better than dumping stuff to syslog), but now
(since Andy almost convinced me to switch to 'fd' based interface)
we can have seq_printk-like that prints into special buffer. So that
user space does 'read(ufd)' and receives whatever program has
printed. I think that would be much cleaner.
>> + if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
>> + unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
>> + struct bpf_context __ctx; \
>> + \
>> + populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
>> + trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
>> + return; \
>> + } \
>> + \
>
> Hmm.. But it seems the eBPF prog is not a filter - it'd always drop the
> event. And I think it's better to use a recorded entry rather then args
> as a bpf_context so that tools like perf can manipulate it at compile
> time based on the event format.
Can manipulate what at compile time? Entry records of tracepoints are
hard coded based on the event. For verifier it's easier to treat all
tracepoint events as they received the same 'struct bpf_context'
of N arguments then the same program can be attached to multiple
tracepoint events at the same time.
I thought about making verifier specific for _every_ tracepoint event,
but it complicates the user interface, since 'bpf_context' is now different
for every program. I think args are much easier to deal with from C
programming point of view, since program can go a fetch the same
fields that tracepoint 'fast_assign' macro does.
Also skipping buffer allocation and fast_assign gives very sizable
performance boost, since the program will access only what it needs to.
The return value of eBPF program is ignored, since I couldn't think
of use case for it. We can change it to be more 'filter' like and interpret
return value as true/false, whether to record this event or not. Thoughts?
On Wed, Jul 2, 2014 at 3:14 PM, Alexei Starovoitov <[email protected]> wrote:
> On Tue, Jul 1, 2014 at 10:32 PM, Namhyung Kim <[email protected]> wrote:
>> On Fri, 27 Jun 2014 17:06:03 -0700, Alexei Starovoitov wrote:
>>> User interface:
>>> cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter
>>>
>>> where 123 is an id of the eBPF program priorly loaded.
>>> __event__ is static tracepoint event.
>>> (kprobe events will be supported in the future patches)
>>>
>>> eBPF programs can call in-kernel helper functions to:
>>> - lookup/update/delete elements in maps
>>> - memcmp
>>> - trace_printk
>>
>> ISTR Steve doesn't like to use trace_printk() (at least for production
>> kernels) anymore. And I'm not sure it'd work if there's no existing
>> trace_printk() on a system.
>
> yes. I saw big warning that trace_printk_init_buffers() emits.
> The idea here is to use eBPF programs for live kernel debugging.
> Instead of adding printk() and recompiling, just write a program,
> attach it to some event, and printk whatever is interesting.
> My only concern about printk() was that it dumps things into trace
> buffers (which is still better than dumping stuff to syslog), but now
> (since Andy almost convinced me to switch to 'fd' based interface)
> we can have seq_printk-like that prints into special buffer. So that
> user space does 'read(ufd)' and receives whatever program has
> printed. I think that would be much cleaner.
>
>>> + if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) && \
>>> + unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
>>> + struct bpf_context __ctx; \
>>> + \
>>> + populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0); \
>>> + trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
>>> + return; \
>>> + } \
>>> + \
>>
>> Hmm.. But it seems the eBPF prog is not a filter - it'd always drop the
>> event. And I think it's better to use a recorded entry rather then args
>> as a bpf_context so that tools like perf can manipulate it at compile
>> time based on the event format.
>
> Can manipulate what at compile time? Entry records of tracepoints are
> hard coded based on the event. For verifier it's easier to treat all
> tracepoint events as they received the same 'struct bpf_context'
> of N arguments then the same program can be attached to multiple
> tracepoint events at the same time.
I was thinking about perf creates a bpf program for filtering some
events like recording kfree_skb if protocol == xx. So perf can
calculate the offset and size of the protocol field and make
appropriate insns for the filter.
Maybe it needs to pass the event format to the verifier somehow then.
> I thought about making verifier specific for _every_ tracepoint event,
> but it complicates the user interface, since 'bpf_context' is now different
> for every program. I think args are much easier to deal with from C
> programming point of view, since program can go a fetch the same
> fields that tracepoint 'fast_assign' macro does.
> Also skipping buffer allocation and fast_assign gives very sizable
> performance boost, since the program will access only what it needs to.
>
> The return value of eBPF program is ignored, since I couldn't think
> of use case for it. We can change it to be more 'filter' like and interpret
> return value as true/false, whether to record this event or not. Thoughts?
Your scenario looks like just calling a bpf program when it hits a
event. It could use event triggering for that purpose IMHO.
But for filtering, it needs to add checking of the return value.
Thanks,
Namhyung
On Tue, Jul 1, 2014 at 11:39 PM, Namhyung Kim <[email protected]> wrote:
> On Wed, Jul 2, 2014 at 3:14 PM, Alexei Starovoitov <[email protected]> wrote:
>>
>> Can manipulate what at compile time? Entry records of tracepoints are
>> hard coded based on the event. For verifier it's easier to treat all
>> tracepoint events as they received the same 'struct bpf_context'
>> of N arguments then the same program can be attached to multiple
>> tracepoint events at the same time.
>
> I was thinking about perf creates a bpf program for filtering some
> events like recording kfree_skb if protocol == xx. So perf can
> calculate the offset and size of the protocol field and make
> appropriate insns for the filter.
When I'm saying 'tracing filter' in patch 11/14, I really mean
stap/dtrace-like facility for live debugging, where tracing infra plays
a key role. At the end the programs are written in C with annotations
and perf orchestrates compilation, insertion, attaching, printing results.
Your meaning of 'tracing filter' is canonical: a filter that says whether
event should be recorded or not. And it makes sense.
When perf sees 'protocol==xx' on command line it can generate
ebpf program for it. In such case my earlier proposal for replacing
predicate tree walker with ebpf programs in kernel becomes obsolete?
If I understood correctly, you're proposing to teach perf to generate
ebpf programs for existing command line interface and use it instead
of predicate tree. This way predicate tree can be removed, right?
In such case programs would need to access event records.
> Maybe it needs to pass the event format to the verifier somehow then.
The integer fields are easy to verify. dynamic_array part is tricky, since
16-bit offset + 16-bit length accessors are very tracing specific.
I need to think it through.
> Your scenario looks like just calling a bpf program when it hits a
> event. It could use event triggering for that purpose IMHO.
Sure. Calling ebpf program can be one of even trigger types.
On the other side ebpf programs themselves can replace the whole
triggering, filtering, recording code. We can have events that
do nothing or call ebpf programs. Then programs walk all necessary
data structures, store stuff into a maps, etc Just look at amount of
events that perf processes. Some of it can be done in kernel by
dynamic program.
From: Alexei Starovoitov
...
> >> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
...
> >> + _(get_map_info(env, map_id, &map));
> >
> > Nit: such macros should be removed, please.
>
> It may surely look unconventional, but alternative is to replace
> every usage of _ macro with:
> err =
> if (err)
> return err;
>
> and since this macro is used 38 times, it will add ~120 unnecessary
> lines that will only make code much harder to follow.
> I tried not using macro and results were not pleasing.
The problem is that they are hidden control flow.
As such they make flow analysis harder for the casual reader.
The extra lines really shouldn't matter.
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Jul 1, 2014 at 12:18 AM, Daniel Borkmann <[email protected]> wrote:
> On 07/01/2014 01:09 AM, Kees Cook wrote:
>>
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <[email protected]>
>> wrote:
>>>
>>> Hi All,
>>>
>>> this patch set demonstrates the potential of eBPF.
>>>
>>> First patch "net: filter: split filter.c into two files" splits eBPF
>>> interpreter
>>> out of networking into kernel/bpf/. The goal for BPF subsystem is to be
>>> usable
>>> in NET-less configuration. Though the whole set is marked is RFC, the 1st
>>> patch
>>> is good to go. Similar version of the patch that was posted few weeks
>>> ago, but
>>> was deferred. I'm assuming due to lack of forward visibility. I hope that
>>> this
>>> patch set shows what eBPF is capable of and where it's heading.
>>>
>>> Other patches expose eBPF instruction set to user space and introduce
>>> concepts
>>> of maps and programs accessible via syscall.
>>>
>>> 'maps' is a generic storage of different types for sharing data between
>>> kernel
>>> and userspace. Maps are referrenced by global id. Root can create
>>> multiple
>>> maps of different types where key/value are opaque bytes of data. It's up
>>> to
>>> user space and eBPF program to decide what they store in the maps.
>>>
>>> eBPF programs are similar to kernel modules. They live in global space
>>> and
>>> have unique prog_id. Each program is a safe run-to-completion set of
>>> instructions. eBPF verifier statically determines that the program
>>> terminates
>>> and safe to execute. During verification the program takes a hold of maps
>>> that it intends to use, so selected maps cannot be removed until program
>>> is
>>> unloaded. The program can be attached to different events. These events
>>> can
>>> be packets, tracepoint events and other types in the future. New event
>>> triggers
>>> execution of the program which may store information about the event in
>>> the maps.
>>> Beyond storing data the programs may call into in-kernel helper functions
>>> which may, for example, dump stack, do trace_printk or other forms of
>>> live
>>> kernel debugging. Same program can be attached to multiple events.
>>> Different
>>> programs can access the same map:
>>>
>>> tracepoint tracepoint tracepoint sk_buff sk_buff
>>> event A event B event C on eth0 on eth1
>>> | | | | |
>>> | | | | |
>>> --> tracing <-- tracing socket socket
>>> prog_1 prog_2 prog_3 prog_4
>>> | | | |
>>> |--- -----| |-------| map_3
>>> map_1 map_2
>>>
>>> User space (via syscall) and eBPF programs access maps concurrently.
>>>
>>> Last two patches are sample code. 1st demonstrates stateful packet
>>> inspection.
>>> It counts tcp and udp packets on eth0. Should be easy to see how this
>>> eBPF
>>> framework can be used for network analytics.
>>> 2nd sample does simple 'drop monitor'. It attaches to kfree_skb
>>> tracepoint
>>> event and counts number of packet drops at particular $pc location.
>>> User space periodically summarizes what eBPF programs recorded.
>>> In these two samples the eBPF programs are tiny and written in
>>> 'assembler'
>>> with macroses. More complex programs can be written C (llvm backend is
>>> not
>>> part of this diff to reduce 'huge' perception).
>>> Since eBPF is fully JITed on x64, the cost of running eBPF program is
>>> very
>>> small even for high frequency events. Here are the numbers comparing
>>> flow_dissector in C vs eBPF:
>>> x86_64 skb_flow_dissect() same skb (all cached) - 42 nsec per
>>> call
>>> x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per
>>> call
>>> eBPF+jit skb_flow_dissect() same skb (all cached) - 51 nsec per
>>> call
>>> eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per
>>> call
>>>
>>> Detailed explanation on eBPF verifier and safety is in patch 08/14
>>
>>
>> This is very exciting! Thanks for working on it. :)
>>
>> Between the new eBPF syscall and the new seccomp syscall, I'm really
>> looking forward to using lookup tables for seccomp filters. Under
>> certain types of filters, we'll likely see some non-trivial
>> performance improvements.
>
> Well, if I read this correctly, the eBPF syscall lets you set up maps, etc,
> but the only way to attach eBPF is via setsockopt for network filters right
> now (and via tracing). Seccomp will still make use of classic BPF, so you
> won't be able to use it there.
Currently, yes. But once this is in, and the new seccomp syscall is
in, we can add a SECCOMP_FILTER_EBPF flag to the "flags" field to
instruct seccomp to load an eBPF instead of a classic BPF. I'm excited
for the future. :)
-Kees
--
Kees Cook
Chrome OS Security
I'm in the process of reading the code, and got some questions/comments.
-Chema
On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov <[email protected]> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention
>
> It checks that
> - R1-R5 registers statisfy function prototype
> - program terminates
> - BPF_LD_ABS|IND instructions are only used in socket filters
>
> It is configured with:
>
> - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
> that provides information to the verifer which fields of 'ctx'
> are accessible (remember 'ctx' is the first argument to eBPF program)
>
> - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> reports argument types of kernel helper functions that eBPF program
> may call, so that verifier can checks that R1-R5 types match prototype
>
> More details in Documentation/networking/filter.txt
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> Documentation/networking/filter.txt | 233 ++++++
> include/linux/bpf.h | 48 ++
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/syscall.c | 2 +-
> kernel/bpf/verifier.c | 1431 +++++++++++++++++++++++++++++++++++
> 6 files changed, 1715 insertions(+), 2 deletions(-)
> create mode 100644 kernel/bpf/verifier.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index e14e486f69cd..05fee8fcedf1 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
> Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
> 2 byte atomic increments are not supported.
>
> +eBPF verifier
> +-------------
> +The safety of the eBPF program is determined in two steps.
> +
> +First step does DAG check to disallow loops and other CFG validation.
> +In particular it will detect programs that have unreachable instructions.
> +(though classic BPF checker allows them)
> +
> +Second step starts from the first insn and descends all possible paths.
> +It simulates execution of every insn and observes the state change of
> +registers and stack.
> +
> +At the start of the program the register R1 contains a pointer to context
> +and has type PTR_TO_CTX.
> +If verifier sees an insn that does R2=R1, then R2 has now type
> +PTR_TO_CTX as well and can be used on the right hand side of expression.
> +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
> +since addition of two valid pointers makes invalid pointer.
> +
> +If register was never written to, it's not readable:
> + bpf_mov R0 = R2
> + bpf_exit
> +will be rejected, since R2 is unreadable at the start of the program.
> +
> +After kernel function call, R1-R5 are reset to unreadable and
> +R0 has a return type of the function.
> +
> +Since R6-R9 are callee saved, their state is preserved across the call.
> + bpf_mov R6 = 1
> + bpf_call foo
> + bpf_mov R0 = R6
> + bpf_exit
> +is a correct program. If there was R1 instead of R6, it would have
> +been rejected.
> +
> +Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
> +so that its state is preserved across calls.
> +
> +load/store instructions are allowed only with registers of valid types, which
> +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
> +For example:
> + bpf_mov R1 = 1
> + bpf_mov R2 = 2
> + bpf_xadd *(u32 *)(R1 + 3) += R2
> + bpf_exit
> +will be rejected, since R1 doesn't have a valid pointer type at the time of
> +execution of instruction bpf_xadd.
> +
> +At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
> +ctx is generic. verifier is configured to known what context is for particular
> +class of bpf programs. For example, context == skb (for socket filters) and
> +ctx == seccomp_data for seccomp filters.
> +A callback is used to customize verifier to restrict eBPF program access to only
> +certain fields within ctx structure with specified size and alignment.
> +
> +For example, the following insn:
> + bpf_ld R0 = *(u32 *)(R6 + 8)
> +intends to load a word from address R6 + 8 and store it into R0
> +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
> +that offset 8 of size 4 bytes can be accessed for reading, otherwise
> +the verifier will reject the program.
> +If R6=PTR_TO_STACK, then access should be aligned and be within
> +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
> +so it will fail verification, since it's out of bounds.
> +
> +The verifier will allow eBPF program to read data from stack only after
> +it wrote into it.
> +Classic BPF verifier does similar check with M[0-15] memory slots.
> +For example:
> + bpf_ld R0 = *(u32 *)(R10 - 4)
> + bpf_exit
> +is invalid program.
> +Though R10 is correct read-only register and has type PTR_TO_STACK
> +and R10 - 4 is within stack bounds, there were no stores into that location.
> +
> +Pointer register spill/fill is tracked as well, since four (R6-R9)
> +callee saved registers may not be enough for some programs.
> +
> +Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
> +For example, skb_get_nlattr() function has the following definition:
> + struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
> +and eBPF verifier will check that this function is always called with first
> +argument being 'ctx'. In other words R1 must have type PTR_TO_CTX
> +at the time of bpf_call insn.
> +After the call register R0 will be set to readable state, so that
> +program can access it.
> +
> +Function calls is a main mechanism to extend functionality of eBPF programs.
> +Socket filters may let programs to call one set of functions, whereas tracing
> +filters may allow completely different set.
> +
> +If a function made accessible to eBPF program, it needs to be thought through
> +from security point of view. The verifier will guarantee that the function is
> +called with valid arguments.
> +
> +seccomp vs socket filters have different security restrictions for classic BPF.
> +Seccomp solves this by two stage verifier: classic BPF verifier is followed
> +by seccomp verifier. In case of eBPF one configurable verifier is shared for
> +all use cases.
> +
> +See details of eBPF verifier in kernel/bpf/verifier.c
> +
> eBPF maps
> ---------
> 'maps' is a generic storage of different types for sharing data between kernel
> @@ -1064,6 +1166,137 @@ size. It will not let programs pass junk values as 'key' and 'value' to
> bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
> can safely access the pointers in all cases.
>
> +Understanding eBPF verifier messages
> +------------------------------------
> +
> +The following are few examples of invalid eBPF programs and verifier error
> +messages as seen in the log:
> +
> +Program with unreachable instructions:
> +static struct sock_filter_int prog[] = {
> + BPF_EXIT_INSN(),
> + BPF_EXIT_INSN(),
> +};
> +Error:
> + unreachable insn 1
> +
> +Program that reads uninitialized register:
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (bf) r0 = r2
> + R2 !read_ok
> +
> +Program that doesn't initialize R0 before exiting:
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (bf) r2 = r1
> + 1: (95) exit
> + R0 !read_ok
> +
> +Program that accesses stack out of bounds:
> + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (7a) *(u64 *)(r10 +8) = 0
> + invalid stack off=8 size=8
> +
> +Program that doesn't initialize stack before passing its address into function:
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (bf) r2 = r10
> + 1: (07) r2 += -8
> + 2: (b7) r1 = 1
> + 3: (85) call 1
> + invalid indirect read from stack off -8+0 size 8
> +
> +Program that uses invalid map_id=2 while calling to map_lookup_elem() function:
> + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 2),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (7a) *(u64 *)(r10 -8) = 0
> + 1: (bf) r2 = r10
> + 2: (07) r2 += -8
> + 3: (b7) r1 = 2
> + 4: (85) call 1
> + invalid access to map_id=2
> +
> +Program that doesn't check return value of map_lookup_elem() before accessing
> +map element:
> + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (7a) *(u64 *)(r10 -8) = 0
> + 1: (bf) r2 = r10
> + 2: (07) r2 += -8
> + 3: (b7) r1 = 1
> + 4: (85) call 1
> + 5: (7a) *(u64 *)(r0 +0) = 0
> + R0 invalid mem access 'map_value_or_null'
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL, but
> +accesses the memory with incorrect alignment:
> + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
> + BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (7a) *(u64 *)(r10 -8) = 0
> + 1: (bf) r2 = r10
> + 2: (07) r2 += -8
> + 3: (b7) r1 = 1
> + 4: (85) call 1
> + 5: (15) if r0 == 0x0 goto pc+1
> + R0=map_value1 R10=fp
> + 6: (7a) *(u64 *)(r0 +4) = 0
> + misaligned access off 4 size 8
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL and
> +accesses memory with correct alignment in one side of 'if' branch, but fails
> +to do so in the other side of 'if' branch:
> + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> + BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
> + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> + BPF_EXIT_INSN(),
> + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (7a) *(u64 *)(r10 -8) = 0
> + 1: (bf) r2 = r10
> + 2: (07) r2 += -8
> + 3: (b7) r1 = 1
> + 4: (85) call 1
> + 5: (15) if r0 == 0x0 goto pc+2
> + R0=map_value1 R10=fp
> + 6: (7a) *(u64 *)(r0 +0) = 0
> + 7: (95) exit
> +
> + from 5 to 8: R0=imm0 R10=fp
> + 8: (7a) *(u64 *)(r0 +0) = 1
> + R0 invalid mem access 'imm'
> +
> Testing
> -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 7bfcad87018e..67fd49eac904 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -47,17 +47,63 @@ struct bpf_map_type_list {
> void bpf_register_map_type(struct bpf_map_type_list *tl);
> struct bpf_map *bpf_map_get(u32 map_id);
>
> +/* types of values:
> + * - stored in an eBPF register
> + * - passed into helper functions as an argument
> + * - returned from helper functions
> + */
> +enum bpf_reg_type {
> + INVALID_PTR, /* reg doesn't contain a valid pointer */
> + PTR_TO_CTX, /* reg points to bpf_context */
> + PTR_TO_MAP, /* reg points to map element value */
> + PTR_TO_MAP_CONDITIONAL, /* points to map element value or NULL */
> + PTR_TO_STACK, /* reg == frame_pointer */
> + PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
> + PTR_TO_STACK_IMM_MAP_KEY, /* pointer to stack used as map key */
> + PTR_TO_STACK_IMM_MAP_VALUE, /* pointer to stack used as map elem */
> + RET_INTEGER, /* function returns integer */
> + RET_VOID, /* function returns void */
> + CONST_ARG, /* function expects integer constant argument */
> + CONST_ARG_MAP_ID, /* int const argument that is used as map_id */
> + /* int const argument indicating number of bytes accessed from stack
> + * previous function argument must be ptr_to_stack_imm
> + */
> + CONST_ARG_STACK_IMM_SIZE,
> +};
> +
> /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
> * instructions after verifying
> */
> struct bpf_func_proto {
> s32 func_off;
> + enum bpf_reg_type ret_type;
> + enum bpf_reg_type arg1_type;
> + enum bpf_reg_type arg2_type;
> + enum bpf_reg_type arg3_type;
> + enum bpf_reg_type arg4_type;
> + enum bpf_reg_type arg5_type;
> +};
> +
> +/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
> + * the first argument to eBPF programs.
> + * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
> + */
> +struct bpf_context;
> +
> +enum bpf_access_type {
> + BPF_READ = 1,
> + BPF_WRITE = 2
> };
>
> struct bpf_verifier_ops {
> /* return eBPF function prototype for verification */
> const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> +
> + /* return true if 'size' wide access at offset 'off' within bpf_context
> + * with 'type' (read or write) is allowed
> + */
> + bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
> };
>
> struct bpf_prog_type_list {
> @@ -78,5 +124,7 @@ struct bpf_prog_info {
>
> void free_bpf_prog_info(struct bpf_prog_info *info);
> struct sk_filter *bpf_prog_get(u32 prog_id);
> +/* verify correctness of eBPF program */
> +int bpf_check(struct sk_filter *fp);
>
> #endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ed067e245099..597a35cc101d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -381,6 +381,7 @@ enum bpf_prog_attributes {
>
> enum bpf_prog_type {
> BPF_PROG_TYPE_UNSPEC,
> + BPF_PROG_TYPE_SOCKET_FILTER,
> };
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 558e12712ebc..95a9035e0f29 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o syscall.o hashtab.o
> +obj-y := core.o syscall.o hashtab.o verifier.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 836809b1bc4e..48d8f43da151 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -554,7 +554,7 @@ static int bpf_prog_load(int prog_id, enum bpf_prog_type type,
> mutex_lock(&bpf_map_lock);
>
> /* run eBPF verifier */
> - /* err = bpf_check(prog); */
> + err = bpf_check(prog);
>
> if (err == 0 && prog->info->used_maps) {
> /* program passed verifier and it's using some maps,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> new file mode 100644
> index 000000000000..470fce48b3b0
> --- /dev/null
> +++ b/kernel/bpf/verifier.c
> @@ -0,0 +1,1431 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/capability.h>
> +
> +/* bpf_check() is a static code analyzer that walks the BPF program
> + * instruction by instruction and updates register/stack state.
> + * All paths of conditional branches are analyzed until 'ret' insn.
> + *
> + * At the first pass depth-first-search verifies that the BPF program is a DAG.
> + * It rejects the following programs:
> + * - larger than BPF_MAXINSNS insns
> + * - if loop is present (detected via back-edge)
> + * - unreachable insns exist (shouldn't be a forest. program = one function)
This seems to me an unnecessary style restriction on user code.
> + * - ret insn is not a last insn
> + * - out of bounds or malformed jumps
> + * The second pass is all possible path descent from the 1st insn.
> + * Conditional branch target insns keep a link list of verifier states.
> + * If the state already visited, this path can be pruned.
> + * If it wasn't a DAG, such state prunning would be incorrect, since it would
> + * skip cycles. Since it's analyzing all pathes through the program,
> + * the length of the analysis is limited to 32k insn, which may be hit even
> + * if insn_cnt < 4K, but there are too many branches that change stack/regs.
> + * Number of 'branches to be analyzed' is limited to 1k
> + *
> + * All registers are 64-bit (even on 32-bit arch)
> + * R0 - return register
> + * R1-R5 argument passing registers
> + * R6-R9 callee saved registers
> + * R10 - frame pointer read-only
> + *
> + * At the start of BPF program the register R1 contains a pointer to bpf_context
> + * and has type PTR_TO_CTX.
> + *
> + * R10 has type PTR_TO_STACK. The sequence 'mov Rd, R10; add Rd, imm' changes
> + * Rd state to PTR_TO_STACK_IMM and immediate constant is saved for further
> + * stack bounds checking
> + *
> + * registers used to pass pointers to function calls are verified against
> + * function prototypes
> + *
> + * Example: before the call to bpf_map_lookup_elem(),
> + * R1 must contain integer constant and R2 PTR_TO_STACK_IMM_MAP_KEY
> + * Integer constant in R1 is a map_id. The verifier checks that map_id is valid
> + * and corresponding map->key_size fetched to check that
> + * [R3, R3 + map_info->key_size) are within stack limits and all that stack
> + * memory was initiliazed earlier by BPF program.
> + * After bpf_table_lookup() call insn, R0 is set to PTR_TO_MAP_CONDITIONAL
> + * R1-R5 are cleared and no longer readable (but still writeable).
> + *
> + * bpf_table_lookup() function returns ether pointer to map value or NULL
> + * which is type PTR_TO_MAP_CONDITIONAL. Once it passes through !=0 insn
> + * the register holding that pointer in the true branch changes state to
> + * PTR_TO_MAP and the same register changes state to INVALID_PTR in the false
> + * branch. See check_cond_jmp_op()
> + *
> + * load/store alignment is checked
> + * Ex: BPF_STX|BPF_W [Rd + 3] = Rs is rejected, because it's misaligned
> + *
> + * load/store to stack bounds checked and register spill is tracked
> + * Ex: BPF_STX|BPF_B [R10 + 0] = Rs is rejected, because it's out of bounds
> + *
> + * load/store to map bounds checked and map_id provides map size
> + * Ex: BPF_STX|BPF_H [Rd + 8] = Rs is ok, if Rd is PTR_TO_MAP and
> + * 8 + sizeof(u16) <= map_info->value_size
> + *
> + * load/store to bpf_context checked against known fields
> + */
> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
+1 to removing the _ macro. If you want to avoid the 3 lines (is there
anything in the style guide against "if ((err=OP) < 0) ..." ?), at
least use some meaningful macro name (DO_AND_CHECK, or something like
that).
> +
> +struct reg_state {
> + enum bpf_reg_type ptr;
> + int imm;
> + bool read_ok;
> +};
> +
> +enum bpf_stack_slot_type {
> + STACK_INVALID, /* nothing was stored in this stack slot */
> + STACK_SPILL, /* 1st byte of register spilled into stack */
> + STACK_SPILL_PART, /* other 7 bytes of register spill */
> + STACK_MISC /* BPF program wrote some data into this slot */
> +};
> +
> +struct bpf_stack_slot {
> + enum bpf_stack_slot_type type;
> + enum bpf_reg_type ptr;
> + int imm;
> +};
> +
> +/* state of the program:
> + * type of all registers and stack info
> + */
> +struct verifier_state {
> + struct reg_state regs[MAX_BPF_REG];
> + struct bpf_stack_slot stack[MAX_BPF_STACK];
> +};
> +
> +/* linked list of verifier states used to prune search */
> +struct verifier_state_list {
> + struct verifier_state state;
> + struct verifier_state_list *next;
> +};
> +
> +/* verifier_state + insn_idx are pushed to stack when branch is encountered */
> +struct verifier_stack_elem {
> + /* verifer state is 'st'
> + * before processing instruction 'insn_idx'
> + * and after processing instruction 'prev_insn_idx'
> + */
> + struct verifier_state st;
> + int insn_idx;
> + int prev_insn_idx;
> + struct verifier_stack_elem *next;
> +};
> +
> +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
> +
> +/* single container for all structs
> + * one verifier_env per bpf_check() call
> + */
> +struct verifier_env {
> + struct sk_filter *prog; /* eBPF program being verified */
> + struct verifier_stack_elem *head; /* stack of verifier states to be processed */
> + int stack_size; /* number of states to be processed */
> + struct verifier_state cur_state; /* current verifier state */
> + struct verifier_state_list **branch_landing; /* search prunning optimization */
> + u32 used_maps[MAX_USED_MAPS]; /* array of map_id's used by eBPF program */
> + u32 used_map_cnt; /* number of used maps */
> +};
> +
> +/* verbose verifier prints what it's seeing
> + * bpf_check() is called under map lock, so no race to access this global var
> + */
> +static bool verbose_on;
> +
> +/* when verifier rejects eBPF program, it does a second path with verbose on
> + * to dump the verification trace to the log, so the user can figure out what's
> + * wrong with the program
> + */
> +static int verbose(const char *fmt, ...)
> +{
> + va_list args;
> + int ret;
> +
> + if (!verbose_on)
> + return 0;
> +
> + va_start(args, fmt);
> + ret = vprintk(fmt, args);
> + va_end(args);
> + return ret;
> +}
> +
> +/* string representation of 'enum bpf_reg_type' */
> +static const char * const reg_type_str[] = {
> + [INVALID_PTR] = "inv",
> + [PTR_TO_CTX] = "ctx",
> + [PTR_TO_MAP] = "map_value",
> + [PTR_TO_MAP_CONDITIONAL] = "map_value_or_null",
> + [PTR_TO_STACK] = "fp",
> + [PTR_TO_STACK_IMM] = "fp",
> + [PTR_TO_STACK_IMM_MAP_KEY] = "fp_key",
> + [PTR_TO_STACK_IMM_MAP_VALUE] = "fp_value",
> + [RET_INTEGER] = "ret_int",
> + [RET_VOID] = "ret_void",
> + [CONST_ARG] = "imm",
> + [CONST_ARG_MAP_ID] = "map_id",
> + [CONST_ARG_STACK_IMM_SIZE] = "imm_size",
> +};
> +
> +static void pr_cont_verifier_state(struct verifier_env *env)
> +{
> + enum bpf_reg_type ptr;
> + int i;
> +
> + for (i = 0; i < MAX_BPF_REG; i++) {
> + if (!env->cur_state.regs[i].read_ok)
> + continue;
> + ptr = env->cur_state.regs[i].ptr;
> + pr_cont(" R%d=%s", i, reg_type_str[ptr]);
> + if (ptr == CONST_ARG ||
> + ptr == PTR_TO_STACK_IMM ||
> + ptr == PTR_TO_MAP_CONDITIONAL ||
> + ptr == PTR_TO_MAP)
> + pr_cont("%d", env->cur_state.regs[i].imm);
> + }
> + for (i = 0; i < MAX_BPF_STACK; i++) {
> + if (env->cur_state.stack[i].type == STACK_SPILL)
> + pr_cont(" fp%d=%s", -MAX_BPF_STACK + i,
> + reg_type_str[env->cur_state.stack[i].ptr]);
> + }
> + pr_cont("\n");
> +}
> +
> +static const char *const bpf_class_string[] = {
> + "ld", "ldx", "st", "stx", "alu", "jmp", "BUG", "alu64"
> +};
> +
> +static const char *const bpf_alu_string[] = {
> + "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
> + "%=", "^=", "=", "s>>=", "endian", "BUG", "BUG"
> +};
> +
> +static const char *const bpf_ldst_string[] = {
> + "u32", "u16", "u8", "u64"
> +};
> +
> +static const char *const bpf_jmp_string[] = {
> + "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
> +};
> +
> +static void pr_cont_bpf_insn(struct sock_filter_int *insn)
> +{
> + u8 class = BPF_CLASS(insn->code);
> +
> + if (class == BPF_ALU || class == BPF_ALU64) {
> + if (BPF_SRC(insn->code) == BPF_X)
> + pr_cont("(%02x) %sr%d %s %sr%d\n",
> + insn->code, class == BPF_ALU ? "(u32) " : "",
> + insn->dst_reg,
> + bpf_alu_string[BPF_OP(insn->code) >> 4],
> + class == BPF_ALU ? "(u32) " : "",
> + insn->src_reg);
> + else
> + pr_cont("(%02x) %sr%d %s %s%d\n",
> + insn->code, class == BPF_ALU ? "(u32) " : "",
> + insn->dst_reg,
> + bpf_alu_string[BPF_OP(insn->code) >> 4],
> + class == BPF_ALU ? "(u32) " : "",
> + insn->imm);
> + } else if (class == BPF_STX) {
> + if (BPF_MODE(insn->code) == BPF_MEM)
> + pr_cont("(%02x) *(%s *)(r%d %+d) = r%d\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->dst_reg,
> + insn->off, insn->src_reg);
> + else if (BPF_MODE(insn->code) == BPF_XADD)
> + pr_cont("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->dst_reg, insn->off,
> + insn->src_reg);
> + else
> + pr_cont("BUG_%02x\n", insn->code);
> + } else if (class == BPF_ST) {
> + if (BPF_MODE(insn->code) != BPF_MEM) {
> + pr_cont("BUG_st_%02x\n", insn->code);
> + return;
> + }
> + pr_cont("(%02x) *(%s *)(r%d %+d) = %d\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->dst_reg,
> + insn->off, insn->imm);
> + } else if (class == BPF_LDX) {
> + if (BPF_MODE(insn->code) != BPF_MEM) {
> + pr_cont("BUG_ldx_%02x\n", insn->code);
> + return;
> + }
> + pr_cont("(%02x) r%d = *(%s *)(r%d %+d)\n",
> + insn->code, insn->dst_reg,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->src_reg, insn->off);
Can you please add:
+ } else if (class == BPF_LD) {
+ if (BPF_MODE(insn->code) == BPF_ABS) {
+ pr_cont("(%02x) r0 = *(%s *)skb[%d]\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->imm);
+ } else if (BPF_MODE(insn->code) == BPF_IND) {
+ pr_cont("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->src_reg, insn->imm);
+ } else {
+ pr_cont("BUG_ld_%02x\n", insn->code);
+ return;
+ }
Note that I'm hardcoding r0 (instead of using %d for insn->dst_reg)
because that's how ebpf writes the instructions.
> + } else if (class == BPF_JMP) {
> + u8 opcode = BPF_OP(insn->code);
> +
> + if (opcode == BPF_CALL) {
> + pr_cont("(%02x) call %d\n", insn->code, insn->imm);
> + } else if (insn->code == (BPF_JMP | BPF_JA)) {
> + pr_cont("(%02x) goto pc%+d\n",
> + insn->code, insn->off);
> + } else if (insn->code == (BPF_JMP | BPF_EXIT)) {
> + pr_cont("(%02x) exit\n", insn->code);
> + } else if (BPF_SRC(insn->code) == BPF_X) {
> + pr_cont("(%02x) if r%d %s r%d goto pc%+d\n",
> + insn->code, insn->dst_reg,
> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
> + insn->src_reg, insn->off);
> + } else {
> + pr_cont("(%02x) if r%d %s 0x%x goto pc%+d\n",
> + insn->code, insn->dst_reg,
> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
> + insn->imm, insn->off);
> + }
> + } else {
> + pr_cont("(%02x) %s\n", insn->code, bpf_class_string[class]);
> + }
> +}
> +
> +static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
> +{
> + struct verifier_stack_elem *elem;
> + int insn_idx;
> +
> + if (env->head == NULL)
> + return -1;
> +
> + memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
> + insn_idx = env->head->insn_idx;
> + if (prev_insn_idx)
> + *prev_insn_idx = env->head->prev_insn_idx;
> + elem = env->head->next;
> + kfree(env->head);
> + env->head = elem;
> + env->stack_size--;
> + return insn_idx;
> +}
> +
> +static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
> + int prev_insn_idx)
> +{
> + struct verifier_stack_elem *elem;
> +
> + elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
> + if (!elem)
> + goto err;
> +
> + memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
> + elem->insn_idx = insn_idx;
> + elem->prev_insn_idx = prev_insn_idx;
> + elem->next = env->head;
> + env->head = elem;
> + env->stack_size++;
> + if (env->stack_size > 1024) {
> + verbose("BPF program is too complex\n");
> + goto err;
> + }
> + return &elem->st;
> +err:
> + /* pop all elements and return */
> + while (pop_stack(env, NULL) >= 0);
> + return NULL;
> +}
> +
> +#define CALLER_SAVED_REGS 6
> +static const int caller_saved[CALLER_SAVED_REGS] = {
> + BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
> +};
> +
> +static void init_reg_state(struct reg_state *regs)
> +{
> + struct reg_state *reg;
> + int i;
> +
> + for (i = 0; i < MAX_BPF_REG; i++) {
> + regs[i].ptr = INVALID_PTR;
> + regs[i].read_ok = false;
> + regs[i].imm = 0xbadbad;
> + }
> + reg = regs + BPF_REG_FP;
Any reason you switching from the array syntax to the pointer one? I
find "reg = regs[BPF_REG_FP];" more readable (and the one you chose in
the loop).
> + reg->ptr = PTR_TO_STACK;
> + reg->read_ok = true;
> +
> + reg = regs + BPF_REG_1; /* 1st arg to a function */
> + reg->ptr = PTR_TO_CTX;
Wait, doesn't this depend on doing "BPF_MOV64_REG(BPF_REG_CTX,
BPF_REG_ARG1)" (the bpf-to-ebpf prologue), which is only enforced on
filters converted from bpf? In fact, shouldn't this set
regs[BPF_REG_CTX] instead of regs[BPF_REG_1] ?
> + reg->read_ok = true;
> +}
> +
> +static void mark_reg_no_ptr(struct reg_state *regs, int regno)
> +{
> + regs[regno].ptr = INVALID_PTR;
> + regs[regno].imm = 0xbadbad;
> + regs[regno].read_ok = true;
> +}
> +
> +static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
> +{
> + if (is_src) {
> + if (!regs[regno].read_ok) {
> + verbose("R%d !read_ok\n", regno);
> + return -EACCES;
> + }
> + } else {
> + if (regno == BPF_REG_FP)
> + /* frame pointer is read only */
> + return -EACCES;
> + mark_reg_no_ptr(regs, regno);
> + }
> + return 0;
> +}
> +
> +static int bpf_size_to_bytes(int bpf_size)
> +{
> + if (bpf_size == BPF_W)
> + return 4;
> + else if (bpf_size == BPF_H)
> + return 2;
> + else if (bpf_size == BPF_B)
> + return 1;
> + else if (bpf_size == BPF_DW)
> + return 8;
> + else
> + return -EACCES;
> +}
> +
> +static int check_stack_write(struct verifier_state *state, int off, int size,
> + int value_regno)
> +{
> + struct bpf_stack_slot *slot;
> + int i;
> +
> + if (value_regno >= 0 &&
> + (state->regs[value_regno].ptr == PTR_TO_MAP ||
> + state->regs[value_regno].ptr == PTR_TO_STACK_IMM ||
> + state->regs[value_regno].ptr == PTR_TO_CTX)) {
> +
> + /* register containing pointer is being spilled into stack */
> + if (size != 8) {
> + verbose("invalid size of register spill\n");
> + return -EACCES;
> + }
> +
> + slot = &state->stack[MAX_BPF_STACK + off];
> + slot->type = STACK_SPILL;
> + /* save register state */
> + slot->ptr = state->regs[value_regno].ptr;
> + slot->imm = state->regs[value_regno].imm;
> + for (i = 1; i < 8; i++) {
> + slot = &state->stack[MAX_BPF_STACK + off + i];
> + slot->type = STACK_SPILL_PART;
> + slot->ptr = 0;
> + slot->imm = 0;
> + }
> + } else {
> +
> + /* regular write of data into stack */
> + for (i = 0; i < size; i++) {
> + slot = &state->stack[MAX_BPF_STACK + off + i];
> + slot->type = STACK_MISC;
> + slot->ptr = 0;
> + slot->imm = 0;
> + }
> + }
> + return 0;
> +}
> +
> +static int check_stack_read(struct verifier_state *state, int off, int size,
> + int value_regno)
> +{
> + int i;
> + struct bpf_stack_slot *slot;
> +
> + slot = &state->stack[MAX_BPF_STACK + off];
> +
> + if (slot->type == STACK_SPILL) {
> + if (size != 8) {
> + verbose("invalid size of register spill\n");
> + return -EACCES;
> + }
> + for (i = 1; i < 8; i++) {
> + if (state->stack[MAX_BPF_STACK + off + i].type !=
> + STACK_SPILL_PART) {
> + verbose("corrupted spill memory\n");
> + return -EACCES;
> + }
> + }
> +
> + /* restore register state from stack */
> + state->regs[value_regno].ptr = slot->ptr;
> + state->regs[value_regno].imm = slot->imm;
> + state->regs[value_regno].read_ok = true;
> + return 0;
> + } else {
> + for (i = 0; i < size; i++) {
> + if (state->stack[MAX_BPF_STACK + off + i].type !=
> + STACK_MISC) {
> + verbose("invalid read from stack off %d+%d size %d\n",
> + off, i, size);
> + return -EACCES;
> + }
> + }
> + /* have read misc data from the stack */
> + mark_reg_no_ptr(state->regs, value_regno);
> + return 0;
> + }
> +}
> +
> +static int remember_map_id(struct verifier_env *env, u32 map_id)
> +{
> + int i;
> +
> + /* check whether we recorded this map_id already */
> + for (i = 0; i < env->used_map_cnt; i++)
> + if (env->used_maps[i] == map_id)
> + return 0;
> +
> + if (env->used_map_cnt >= MAX_USED_MAPS)
> + return -E2BIG;
> +
> + /* remember this map_id */
> + env->used_maps[env->used_map_cnt++] = map_id;
> + return 0;
> +}
> +
> +static int get_map_info(struct verifier_env *env, u32 map_id,
> + struct bpf_map **map)
> +{
> + /* if BPF program contains bpf_table_lookup(map_id, key)
> + * the incorrect map_id will be caught here
> + */
> + *map = bpf_map_get(map_id);
> + if (!*map) {
> + verbose("invalid access to map_id=%d\n", map_id);
> + return -EACCES;
> + }
> +
> + _(remember_map_id(env, map_id));
> +
> + return 0;
> +}
> +
> +/* check read/write into map element returned by bpf_table_lookup() */
> +static int check_table_access(struct verifier_env *env, int regno, int off,
> + int size)
> +{
> + struct bpf_map *map;
> + int map_id = env->cur_state.regs[regno].imm;
> +
> + _(get_map_info(env, map_id, &map));
> +
> + if (off < 0 || off + size > map->value_size) {
> + verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
> + map_id, map->value_size, off, size);
> + return -EACCES;
> + }
> + return 0;
> +}
> +
> +/* check access to 'struct bpf_context' fields */
> +static int check_ctx_access(struct verifier_env *env, int off, int size,
> + enum bpf_access_type t)
> +{
> + if (env->prog->info->ops->is_valid_access &&
> + env->prog->info->ops->is_valid_access(off, size, t))
> + return 0;
> +
> + verbose("invalid bpf_context access off=%d size=%d\n", off, size);
> + return -EACCES;
> +}
> +
> +static int check_mem_access(struct verifier_env *env, int regno, int off,
> + int bpf_size, enum bpf_access_type t,
> + int value_regno)
> +{
> + struct verifier_state *state = &env->cur_state;
> + int size;
> +
> + _(size = bpf_size_to_bytes(bpf_size));
> +
> + if (off % size != 0) {
> + verbose("misaligned access off %d size %d\n", off, size);
> + return -EACCES;
> + }
> +
> + if (state->regs[regno].ptr == PTR_TO_MAP) {
> + _(check_table_access(env, regno, off, size));
> + if (t == BPF_READ)
> + mark_reg_no_ptr(state->regs, value_regno);
> + } else if (state->regs[regno].ptr == PTR_TO_CTX) {
> + _(check_ctx_access(env, off, size, t));
> + if (t == BPF_READ)
> + mark_reg_no_ptr(state->regs, value_regno);
> + } else if (state->regs[regno].ptr == PTR_TO_STACK) {
> + if (off >= 0 || off < -MAX_BPF_STACK) {
> + verbose("invalid stack off=%d size=%d\n", off, size);
> + return -EACCES;
> + }
> + if (t == BPF_WRITE)
> + _(check_stack_write(state, off, size, value_regno));
> + else
> + _(check_stack_read(state, off, size, value_regno));
> + } else {
> + verbose("R%d invalid mem access '%s'\n",
> + regno, reg_type_str[state->regs[regno].ptr]);
> + return -EACCES;
> + }
> + return 0;
> +}
> +
> +/* when register 'regno' is passed into function that will read 'access_size'
> + * bytes from that pointer, make sure that it's within stack boundary
> + * and all elements of stack are initialized
> + */
> +static int check_stack_boundary(struct verifier_env *env,
> + int regno, int access_size)
> +{
> + struct verifier_state *state = &env->cur_state;
> + struct reg_state *regs = state->regs;
> + int off, i;
> +
> + if (regs[regno].ptr != PTR_TO_STACK_IMM)
> + return -EACCES;
> +
> + off = regs[regno].imm;
> + if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
> + access_size <= 0) {
> + verbose("invalid stack ptr R%d off=%d access_size=%d\n",
> + regno, off, access_size);
> + return -EACCES;
> + }
> +
> + for (i = 0; i < access_size; i++) {
> + if (state->stack[MAX_BPF_STACK + off + i].type != STACK_MISC) {
> + verbose("invalid indirect read from stack off %d+%d size %d\n",
> + off, i, access_size);
> + return -EACCES;
> + }
> + }
> + return 0;
> +}
> +
> +static int check_func_arg(struct verifier_env *env, int regno,
> + enum bpf_reg_type arg_type, int *map_id,
> + struct bpf_map **mapp)
> +{
> + struct reg_state *reg = env->cur_state.regs + regno;
> + enum bpf_reg_type expected_type;
> +
> + if (arg_type == INVALID_PTR)
> + return 0;
> +
> + if (!reg->read_ok) {
> + verbose("R%d !read_ok\n", regno);
> + return -EACCES;
> + }
> +
> + if (arg_type == PTR_TO_STACK_IMM_MAP_KEY ||
> + arg_type == PTR_TO_STACK_IMM_MAP_VALUE)
> + expected_type = PTR_TO_STACK_IMM;
> + else if (arg_type == CONST_ARG_MAP_ID ||
> + arg_type == CONST_ARG_STACK_IMM_SIZE)
> + expected_type = CONST_ARG;
> + else
> + expected_type = arg_type;
> +
> + if (reg->ptr != expected_type) {
> + verbose("R%d type=%s expected=%s\n", regno,
> + reg_type_str[reg->ptr], reg_type_str[expected_type]);
> + return -EACCES;
> + }
> +
> + if (arg_type == CONST_ARG_MAP_ID) {
> + /* bpf_map_xxx(map_id) call: check that map_id is valid */
> + *map_id = reg->imm;
> + _(get_map_info(env, reg->imm, mapp));
> + } else if (arg_type == PTR_TO_STACK_IMM_MAP_KEY) {
> + /*
> + * bpf_map_xxx(..., map_id, ..., key) call:
> + * check that [key, key + map->key_size) are within
> + * stack limits and initialized
> + */
> + if (!*mapp) {
> + /*
> + * in function declaration map_id must come before
> + * table_key or table_elem, so that it's verified
> + * and known before we have to check table_key here
> + */
> + verbose("invalid map_id to access map->key\n");
> + return -EACCES;
> + }
> + _(check_stack_boundary(env, regno, (*mapp)->key_size));
> + } else if (arg_type == PTR_TO_STACK_IMM_MAP_VALUE) {
> + /*
> + * bpf_map_xxx(..., map_id, ..., value) call:
> + * check [value, value + map->value_size) validity
> + */
> + if (!*mapp) {
> + verbose("invalid map_id to access map->elem\n");
> + return -EACCES;
> + }
> + _(check_stack_boundary(env, regno, (*mapp)->value_size));
> + } else if (arg_type == CONST_ARG_STACK_IMM_SIZE) {
> + /*
> + * bpf_xxx(..., buf, len) call will access 'len' bytes
> + * from stack pointer 'buf'. Check it
> + * note: regno == len, regno - 1 == buf
> + */
> + _(check_stack_boundary(env, regno - 1, reg->imm));
> + }
> +
> + return 0;
> +}
> +
> +static int check_call(struct verifier_env *env, int func_id)
> +{
> + struct verifier_state *state = &env->cur_state;
> + const struct bpf_func_proto *fn = NULL;
> + struct reg_state *regs = state->regs;
> + struct bpf_map *map = NULL;
> + struct reg_state *reg;
> + int map_id = -1;
> + int i;
> +
> + /* find function prototype */
> + if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
> + verbose("invalid func %d\n", func_id);
> + return -EINVAL;
> + }
> +
> + if (env->prog->info->ops->get_func_proto)
> + fn = env->prog->info->ops->get_func_proto(func_id);
> +
> + if (!fn || (fn->ret_type != RET_INTEGER &&
> + fn->ret_type != PTR_TO_MAP_CONDITIONAL &&
> + fn->ret_type != RET_VOID)) {
> + verbose("unknown func %d\n", func_id);
> + return -EINVAL;
> + }
> +
> + /* check args */
> + _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
> + _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
> +
> + /* reset caller saved regs */
> + for (i = 0; i < CALLER_SAVED_REGS; i++) {
> + reg = regs + caller_saved[i];
> + reg->read_ok = false;
> + reg->ptr = INVALID_PTR;
> + reg->imm = 0xbadbad;
> + }
> +
> + /* update return register */
> + reg = regs + BPF_REG_0;
> + if (fn->ret_type == RET_INTEGER) {
> + reg->read_ok = true;
> + reg->ptr = INVALID_PTR;
> + } else if (fn->ret_type != RET_VOID) {
> + reg->read_ok = true;
> + reg->ptr = fn->ret_type;
> + if (fn->ret_type == PTR_TO_MAP_CONDITIONAL)
> + /*
> + * remember map_id, so that check_table_access()
> + * can check 'value_size' boundary of memory access
> + * to map element returned from bpf_table_lookup()
> + */
> + reg->imm = map_id;
> + }
> + return 0;
> +}
> +
> +/* check validity of 32-bit and 64-bit arithmetic operations */
> +static int check_alu_op(struct reg_state *regs, struct sock_filter_int *insn)
> +{
> + u8 opcode = BPF_OP(insn->code);
> +
> + if (opcode == BPF_END || opcode == BPF_NEG) {
> + if (BPF_SRC(insn->code) != BPF_X)
> + return -EINVAL;
> + /* check src operand */
> + _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> + /* check dest operand */
> + _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> + } else if (opcode == BPF_MOV) {
> +
> + if (BPF_SRC(insn->code) == BPF_X)
> + /* check src operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> +
> + /* check dest operand */
> + _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> + if (BPF_SRC(insn->code) == BPF_X) {
> + if (BPF_CLASS(insn->code) == BPF_ALU64) {
> + /* case: R1 = R2
> + * copy register state to dest reg
> + */
> + regs[insn->dst_reg].ptr = regs[insn->src_reg].ptr;
> + regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
> + } else {
> + regs[insn->dst_reg].ptr = INVALID_PTR;
> + regs[insn->dst_reg].imm = 0;
> + }
> + } else {
> + /* case: R = imm
> + * remember the value we stored into this reg
> + */
> + regs[insn->dst_reg].ptr = CONST_ARG;
> + regs[insn->dst_reg].imm = insn->imm;
> + }
> +
> + } else { /* all other ALU ops: and, sub, xor, add, ... */
> +
> + int stack_relative = 0;
> +
> + if (BPF_SRC(insn->code) == BPF_X)
> + /* check src1 operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> +
> + /* check src2 operand */
> + _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> + if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
> + regs[insn->dst_reg].ptr == PTR_TO_STACK &&
> + BPF_SRC(insn->code) == BPF_K)
> + stack_relative = 1;
> +
> + /* check dest operand */
> + _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> + if (stack_relative) {
> + regs[insn->dst_reg].ptr = PTR_TO_STACK_IMM;
> + regs[insn->dst_reg].imm = insn->imm;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int check_cond_jmp_op(struct verifier_env *env,
> + struct sock_filter_int *insn, int *insn_idx)
> +{
> + struct reg_state *regs = env->cur_state.regs;
> + struct verifier_state *other_branch;
> + u8 opcode = BPF_OP(insn->code);
> +
> + if (BPF_SRC(insn->code) == BPF_X)
> + /* check src1 operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> +
> + /* check src2 operand */
> + _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> + /* detect if R == 0 where R was initialized to zero earlier */
> + if (BPF_SRC(insn->code) == BPF_K &&
> + (opcode == BPF_JEQ || opcode == BPF_JNE) &&
> + regs[insn->dst_reg].ptr == CONST_ARG &&
> + regs[insn->dst_reg].imm == insn->imm) {
> + if (opcode == BPF_JEQ) {
> + /* if (imm == imm) goto pc+off;
> + * only follow the goto, ignore fall-through
> + */
> + *insn_idx += insn->off;
> + return 0;
> + } else {
> + /* if (imm != imm) goto pc+off;
> + * only follow fall-through branch, since
> + * that's where the program will go
> + */
> + return 0;
> + }
> + }
> +
> + other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
> + if (!other_branch)
> + return -EFAULT;
> +
> + /* detect if R == 0 where R is returned value from table_lookup() */
> + if (BPF_SRC(insn->code) == BPF_K &&
> + insn->imm == 0 && (opcode == BPF_JEQ ||
> + opcode == BPF_JNE) &&
> + regs[insn->dst_reg].ptr == PTR_TO_MAP_CONDITIONAL) {
> + if (opcode == BPF_JEQ) {
> + /* next fallthrough insn can access memory via
> + * this register
> + */
> + regs[insn->dst_reg].ptr = PTR_TO_MAP;
> + /* branch targer cannot access it, since reg == 0 */
> + other_branch->regs[insn->dst_reg].ptr = CONST_ARG;
> + other_branch->regs[insn->dst_reg].imm = 0;
> + } else {
> + other_branch->regs[insn->dst_reg].ptr = PTR_TO_MAP;
> + regs[insn->dst_reg].ptr = CONST_ARG;
> + regs[insn->dst_reg].imm = 0;
> + }
> + } else if (BPF_SRC(insn->code) == BPF_K &&
> + (opcode == BPF_JEQ || opcode == BPF_JNE)) {
> +
> + if (opcode == BPF_JEQ) {
> + /* detect if (R == imm) goto
> + * and in the target state recognize that R = imm
> + */
> + other_branch->regs[insn->dst_reg].ptr = CONST_ARG;
> + other_branch->regs[insn->dst_reg].imm = insn->imm;
> + } else {
> + /* detect if (R != imm) goto
> + * and in the fall-through state recognize that R = imm
> + */
> + regs[insn->dst_reg].ptr = CONST_ARG;
> + regs[insn->dst_reg].imm = insn->imm;
> + }
> + }
> + if (verbose_on)
> + pr_cont_verifier_state(env);
> + return 0;
> +}
> +
> +/* verify safety of LD_ABS|LD_IND instructions:
> + * - they can only appear in the programs where ctx == skb
> + * - since they are wrappers of function calls, they scratch R1-R5 registers,
> + * preserve R6-R9, and store return value into R0
> + *
> + * Implicit input:
> + * ctx == skb == R6 == CTX
> + *
> + * Explicit input:
> + * SRC == any register
> + * IMM == 32-bit immediate
> + *
> + * Output:
> + * R0 - 8/16/32-bit skb data converted to cpu endianness
> + */
> +
> +static int check_ld_abs(struct verifier_env *env, struct sock_filter_int *insn)
> +{
> + struct reg_state *regs = env->cur_state.regs;
> + u8 mode = BPF_MODE(insn->code);
> + struct reg_state *reg;
> + int i;
> +
> + if (mode != BPF_ABS && mode != BPF_IND)
> + return -EINVAL;
> +
> + if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
> + verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
> + return -EINVAL;
> + }
> +
> + /* check whether implicit source operand (register R6) is readable */
> + _(check_reg_arg(regs, BPF_REG_6, 1));
> +
> + if (regs[BPF_REG_6].ptr != PTR_TO_CTX) {
> + verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
> + return -EINVAL;
> + }
> +
> + if (mode == BPF_IND)
> + /* check explicit source operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> +
> + /* reset caller saved regs to unreadable */
> + for (i = 0; i < CALLER_SAVED_REGS; i++) {
> + reg = regs + caller_saved[i];
> + reg->read_ok = false;
> + reg->ptr = INVALID_PTR;
> + reg->imm = 0xbadbad;
> + }
> +
> + /* mark destination R0 register as readable, since it contains
> + * the value fetched from the packet
> + */
> + regs[BPF_REG_0].read_ok = true;
> + return 0;
> +}
> +
> +/* non-recursive DFS pseudo code
> + * 1 procedure DFS-iterative(G,v):
> + * 2 label v as discovered
> + * 3 let S be a stack
> + * 4 S.push(v)
> + * 5 while S is not empty
> + * 6 t <- S.pop()
> + * 7 if t is what we're looking for:
> + * 8 return t
> + * 9 for all edges e in G.adjacentEdges(t) do
> + * 10 if edge e is already labelled
> + * 11 continue with the next edge
> + * 12 w <- G.adjacentVertex(t,e)
> + * 13 if vertex w is not discovered and not explored
> + * 14 label e as tree-edge
> + * 15 label w as discovered
> + * 16 S.push(w)
> + * 17 continue at 5
> + * 18 else if vertex w is discovered
> + * 19 label e as back-edge
> + * 20 else
> + * 21 // vertex w is explored
> + * 22 label e as forward- or cross-edge
> + * 23 label t as explored
> + * 24 S.pop()
> + *
> + * convention:
> + * 1 - discovered
> + * 2 - discovered and 1st branch labelled
> + * 3 - discovered and 1st and 2nd branch labelled
> + * 4 - explored
> + */
> +
> +#define STATE_END ((struct verifier_state_list *)-1)
> +
> +#define PUSH_INT(I) \
> + do { \
> + if (cur_stack >= insn_cnt) { \
> + ret = -E2BIG; \
> + goto free_st; \
> + } \
> + stack[cur_stack++] = I; \
> + } while (0)
> +
> +#define PEAK_INT() \
> + ({ \
> + int _ret; \
> + if (cur_stack == 0) \
> + _ret = -1; \
> + else \
> + _ret = stack[cur_stack - 1]; \
> + _ret; \
> + })
> +
> +#define POP_INT() \
> + ({ \
> + int _ret; \
> + if (cur_stack == 0) \
> + _ret = -1; \
> + else \
> + _ret = stack[--cur_stack]; \
> + _ret; \
> + })
> +
> +#define PUSH_INSN(T, W, E) \
> + do { \
> + int w = W; \
> + if (E == 1 && st[T] >= 2) \
> + break; \
> + if (E == 2 && st[T] >= 3) \
> + break; \
> + if (w >= insn_cnt) { \
> + ret = -EACCES; \
> + goto free_st; \
> + } \
> + if (E == 2) \
> + /* mark branch target for state pruning */ \
> + env->branch_landing[w] = STATE_END; \
> + if (st[w] == 0) { \
> + /* tree-edge */ \
> + st[T] = 1 + E; \
> + st[w] = 1; /* discovered */ \
> + PUSH_INT(w); \
> + goto peak_stack; \
> + } else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
> + verbose("back-edge from insn %d to %d\n", t, w); \
> + ret = -EINVAL; \
> + goto free_st; \
> + } else if (st[w] == 4) { \
> + /* forward- or cross-edge */ \
> + st[T] = 1 + E; \
> + } else { \
> + verbose("insn state internal bug\n"); \
> + ret = -EFAULT; \
> + goto free_st; \
> + } \
> + } while (0)
> +
> +/* non-recursive depth-first-search to detect loops in BPF program
> + * loop == back-edge in directed graph
> + */
> +static int check_cfg(struct verifier_env *env)
> +{
> + struct sock_filter_int *insns = env->prog->insnsi;
> + int insn_cnt = env->prog->len;
> + int cur_stack = 0;
> + int *stack;
> + int ret = 0;
> + int *st;
> + int i, t;
> +
> + if (insns[insn_cnt - 1].code != (BPF_JMP | BPF_EXIT)) {
> + verbose("last insn is not a 'ret'\n");
> + return -EINVAL;
> + }
> +
> + st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> + if (!st)
> + return -ENOMEM;
> +
> + stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> + if (!stack) {
> + kfree(st);
> + return -ENOMEM;
> + }
> +
> + st[0] = 1; /* mark 1st insn as discovered */
> + PUSH_INT(0);
> +
> +peak_stack:
> + while ((t = PEAK_INT()) != -1) {
> + if (insns[t].code == (BPF_JMP | BPF_EXIT))
> + goto mark_explored;
> +
> + if (BPF_CLASS(insns[t].code) == BPF_JMP) {
> + u8 opcode = BPF_OP(insns[t].code);
> +
> + if (opcode == BPF_CALL) {
> + PUSH_INSN(t, t + 1, 1);
> + } else if (opcode == BPF_JA) {
> + if (BPF_SRC(insns[t].code) != BPF_X) {
> + ret = -EINVAL;
> + goto free_st;
> + }
> + PUSH_INSN(t, t + insns[t].off + 1, 1);
> + } else {
> + PUSH_INSN(t, t + 1, 1);
> + PUSH_INSN(t, t + insns[t].off + 1, 2);
> + }
> + /* tell verifier to check for equivalent verifier states
> + * after every call and jump
> + */
> + env->branch_landing[t + 1] = STATE_END;
> + } else {
> + PUSH_INSN(t, t + 1, 1);
> + }
> +
> +mark_explored:
> + st[t] = 4; /* explored */
> + if (POP_INT() == -1) {
> + verbose("pop_int internal bug\n");
> + ret = -EFAULT;
> + goto free_st;
> + }
> + }
> +
> +
> + for (i = 0; i < insn_cnt; i++) {
> + if (st[i] != 4) {
> + verbose("unreachable insn %d\n", i);
> + ret = -EINVAL;
> + goto free_st;
> + }
> + }
> +
> +free_st:
> + kfree(st);
> + kfree(stack);
> + return ret;
> +}
> +
> +/* compare two verifier states
> + *
> + * all states stored in state_list are known to be valid, since
> + * verifier reached 'bpf_exit' instruction through them
> + *
> + * this function is called when verifier exploring different branches of
> + * execution popped from the state stack. If it sees an old state that has
> + * more strict register state and more strict stack state then this execution
> + * branch doesn't need to be explored further, since verifier already
> + * concluded that more strict state leads to valid finish.
> + *
> + * Therefore two states are equivalent if register state is more conservative
> + * and explored stack state is more conservative than the current one.
> + * Example:
> + * explored current
> + * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
> + * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
> + *
> + * In other words if current stack state (one being explored) has more
> + * valid slots than old one that already passed validation, it means
> + * the verifier can stop exploring and conclude that current state is valid too
> + *
> + * Similarly with registers. If explored state has register type as invalid
> + * whereas register type in current state is meaningful, it means that
> + * the current state will reach 'bpf_exit' instruction safely
> + */
> +static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
> +{
> + int i;
> +
> + for (i = 0; i < MAX_BPF_REG; i++) {
> + if (memcmp(&old->regs[i], &cur->regs[i],
> + sizeof(old->regs[0])) != 0) {
> + if (!old->regs[i].read_ok)
> + continue;
> + if (old->regs[i].ptr == INVALID_PTR)
> + continue;
> + return false;
> + }
> + }
> +
> + for (i = 0; i < MAX_BPF_STACK; i++) {
> + if (memcmp(&old->stack[i], &cur->stack[i],
> + sizeof(old->stack[0])) != 0) {
> + if (old->stack[i].type == STACK_INVALID)
> + continue;
> + return false;
> + }
> + }
> + return true;
> +}
> +
> +static int is_state_visited(struct verifier_env *env, int insn_idx)
> +{
> + struct verifier_state_list *new_sl;
> + struct verifier_state_list *sl;
> +
> + sl = env->branch_landing[insn_idx];
> + if (!sl)
> + /* no branch jump to this insn, ignore it */
> + return 0;
> +
> + while (sl != STATE_END) {
> + if (states_equal(&sl->state, &env->cur_state))
> + /* reached equivalent register/stack state,
> + * prune the search
> + */
> + return 1;
> + sl = sl->next;
> + }
> + new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
> +
> + if (!new_sl)
> + /* ignore ENOMEM, it doesn't affect correctness */
> + return 0;
> +
> + /* add new state to the head of linked list */
> + memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
> + new_sl->next = env->branch_landing[insn_idx];
> + env->branch_landing[insn_idx] = new_sl;
> + return 0;
> +}
> +
> +static int do_check(struct verifier_env *env)
> +{
> + struct verifier_state *state = &env->cur_state;
> + struct sock_filter_int *insns = env->prog->insnsi;
> + struct reg_state *regs = state->regs;
> + int insn_cnt = env->prog->len;
> + int insn_idx, prev_insn_idx = 0;
> + int insn_processed = 0;
> + bool do_print_state = false;
> +
> + init_reg_state(regs);
> + insn_idx = 0;
> + for (;;) {
> + struct sock_filter_int *insn;
> + u8 class;
> +
> + if (insn_idx >= insn_cnt) {
> + verbose("invalid insn idx %d insn_cnt %d\n",
> + insn_idx, insn_cnt);
> + return -EFAULT;
> + }
> +
> + insn = &insns[insn_idx];
> + class = BPF_CLASS(insn->code);
> +
> + if (++insn_processed > 32768) {
> + verbose("BPF program is too large. Proccessed %d insn\n",
> + insn_processed);
> + return -E2BIG;
> + }
> +
> + if (is_state_visited(env, insn_idx)) {
> + if (verbose_on) {
> + if (do_print_state)
> + pr_cont("\nfrom %d to %d: safe\n",
> + prev_insn_idx, insn_idx);
> + else
> + pr_cont("%d: safe\n", insn_idx);
> + }
> + goto process_bpf_exit;
> + }
> +
> + if (verbose_on && do_print_state) {
> + pr_cont("\nfrom %d to %d:", prev_insn_idx, insn_idx);
> + pr_cont_verifier_state(env);
> + do_print_state = false;
> + }
> +
> + if (verbose_on) {
> + pr_cont("%d: ", insn_idx);
> + pr_cont_bpf_insn(insn);
> + }
> +
> + if (class == BPF_ALU || class == BPF_ALU64) {
> + _(check_alu_op(regs, insn));
> +
> + } else if (class == BPF_LDX) {
> + if (BPF_MODE(insn->code) != BPF_MEM)
> + return -EINVAL;
> +
> + /* check src operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> +
> + _(check_mem_access(env, insn->src_reg, insn->off,
> + BPF_SIZE(insn->code), BPF_READ,
> + insn->dst_reg));
> +
> + /* dest reg state will be updated by mem_access */
> +
> + } else if (class == BPF_STX) {
> + /* check src1 operand */
> + _(check_reg_arg(regs, insn->src_reg, 1));
> + /* check src2 operand */
> + _(check_reg_arg(regs, insn->dst_reg, 1));
> + _(check_mem_access(env, insn->dst_reg, insn->off,
> + BPF_SIZE(insn->code), BPF_WRITE,
> + insn->src_reg));
> +
> + } else if (class == BPF_ST) {
> + if (BPF_MODE(insn->code) != BPF_MEM)
> + return -EINVAL;
> + /* check src operand */
> + _(check_reg_arg(regs, insn->dst_reg, 1));
> + _(check_mem_access(env, insn->dst_reg, insn->off,
> + BPF_SIZE(insn->code), BPF_WRITE,
> + -1));
> +
> + } else if (class == BPF_JMP) {
> + u8 opcode = BPF_OP(insn->code);
> +
> + if (opcode == BPF_CALL) {
> + _(check_call(env, insn->imm));
> + } else if (opcode == BPF_JA) {
> + if (BPF_SRC(insn->code) != BPF_X)
> + return -EINVAL;
> + insn_idx += insn->off + 1;
> + continue;
> + } else if (opcode == BPF_EXIT) {
> + /* eBPF calling convetion is such that R0 is used
> + * to return the value from eBPF program.
> + * Make sure that it's readable at this time
> + * of bpf_exit, which means that program wrote
> + * something into it earlier
> + */
> + _(check_reg_arg(regs, BPF_REG_0, 1));
> +process_bpf_exit:
> + insn_idx = pop_stack(env, &prev_insn_idx);
> + if (insn_idx < 0) {
> + break;
> + } else {
> + do_print_state = true;
> + continue;
> + }
> + } else {
> + _(check_cond_jmp_op(env, insn, &insn_idx));
> + }
> + } else if (class == BPF_LD) {
> + _(check_ld_abs(env, insn));
> + } else {
> + verbose("unknown insn class %d\n", class);
> + return -EINVAL;
> + }
> +
> + insn_idx++;
> + }
> +
> + return 0;
> +}
> +
> +static void free_states(struct verifier_env *env, int insn_cnt)
> +{
> + struct verifier_state_list *sl, *sln;
> + int i;
> +
> + for (i = 0; i < insn_cnt; i++) {
> + sl = env->branch_landing[i];
> +
> + if (sl)
> + while (sl != STATE_END) {
> + sln = sl->next;
> + kfree(sl);
> + sl = sln;
> + }
> + }
> +
> + kfree(env->branch_landing);
> +}
> +
> +int bpf_check(struct sk_filter *prog)
> +{
> + struct verifier_env *env;
> + int ret;
> +
> + if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
> + return -E2BIG;
> +
> + env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
> + if (!env)
> + return -ENOMEM;
> +
> + verbose_on = false;
> +retry:
> + env->prog = prog;
> + env->branch_landing = kcalloc(prog->len,
> + sizeof(struct verifier_state_list *),
> + GFP_KERNEL);
> +
> + if (!env->branch_landing) {
> + kfree(env);
> + return -ENOMEM;
> + }
> +
> + ret = check_cfg(env);
> + if (ret < 0)
> + goto free_env;
> +
> + ret = do_check(env);
> +
> +free_env:
> + while (pop_stack(env, NULL) >= 0);
> + free_states(env, prog->len);
> +
> + if (ret < 0 && !verbose_on && capable(CAP_SYS_ADMIN)) {
> + /* verification failed, redo it with verbose on */
> + memset(env, 0, sizeof(struct verifier_env));
> + verbose_on = true;
> + goto retry;
> + }
> +
> + if (ret == 0 && env->used_map_cnt) {
> + /* if program passed verifier, update used_maps in bpf_prog_info */
> + prog->info->used_maps = kmalloc_array(env->used_map_cnt,
> + sizeof(u32), GFP_KERNEL);
> + if (!prog->info->used_maps) {
> + kfree(env);
> + return -ENOMEM;
> + }
> + memcpy(prog->info->used_maps, env->used_maps,
> + sizeof(u32) * env->used_map_cnt);
> + prog->info->used_map_cnt = env->used_map_cnt;
> + }
> +
> + kfree(env);
> + return ret;
> +}
> --
> 1.7.9.5
>
On Wed, Jul 2, 2014 at 1:11 AM, David Laight <[email protected]> wrote:
> From: Alexei Starovoitov
> ...
>> >> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
> ...
>> >> + _(get_map_info(env, map_id, &map));
>> >
>> > Nit: such macros should be removed, please.
>>
>> It may surely look unconventional, but alternative is to replace
>> every usage of _ macro with:
>> err =
>> if (err)
>> return err;
>>
>> and since this macro is used 38 times, it will add ~120 unnecessary
>> lines that will only make code much harder to follow.
>> I tried not using macro and results were not pleasing.
>
> The problem is that they are hidden control flow.
> As such they make flow analysis harder for the casual reader.
In the abstract context macros with gotos and returns are bad,
but in this case extra verbosity is the bigger evil.
Consider this piece of code:
#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
if (opcode == BPF_END || opcode == BPF_NEG) {
if (BPF_SRC(insn->code) != BPF_X)
return -EINVAL;
/* check src operand */
_(check_reg_arg(regs, insn->dst_reg, 1));
/* check dest operand */
_(check_reg_arg(regs, insn->dst_reg, 0));
} else if (opcode == BPF_MOV) {
if (BPF_SRC(insn->code) == BPF_X)
/* check src operand */
_(check_reg_arg(regs, insn->src_reg, 1));
/* check dest operand */
_(check_reg_arg(regs, insn->dst_reg, 0));
where casual reader can easily see what the purpose of the code
and what it's doing.
Now rewrite it without '_' macro:
if (opcode == BPF_END || opcode == BPF_NEG) {
if (BPF_SRC(insn->code) != BPF_X)
return -EINVAL;
/* check src operand */
err = check_reg_arg(regs, insn->dst_reg, 1);
if (err)
return err;
/* check dest operand */
err = check_reg_arg(regs, insn->dst_reg, 0);
if (err)
return err;
} else if (opcode == BPF_MOV) {
if (BPF_SRC(insn->code) == BPF_X) {
/* check src operand */
err = check_reg_arg(regs, insn->src_reg, 1);
if (err)
return err;
}
/* check dest operand */
err = check_reg_arg(regs, insn->dst_reg, 0);
if (err)
return err;
see how your eyes are now picking up endless control flow of
if conditions and returns, instead of focusing on the code itself.
It's much easier to understand the semantics when if (err) is out
of the way. Note that replacing _ with real name will ruin
the reading experience, since CAPITAL letters of the macro
will be screaming: "look at me", instead of letting reviewer
focus on the code.
I believe that this usage of _ as a macro specifically as defined,
would be a great addition to kernel coding style in general.
I don't want to see _ to be redefined differently.
On Wed, Jul 2, 2014 at 3:22 PM, Chema Gonzalez <[email protected]> wrote:
>> + * - unreachable insns exist (shouldn't be a forest. program = one function)
> This seems to me an unnecessary style restriction on user code.
unreachable instructions to me is a ticking time bomb of potential exploits.
Definitely should be rejected.
>> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
> +1 to removing the _ macro. If you want to avoid the 3 lines (is there
> anything in the style guide against "if ((err=OP) < 0) ..." ?), at
assignment and function call inside 'if' ? I don't like such style.
> least use some meaningful macro name (DO_AND_CHECK, or something like
> that).
Try replacing _ with any other name and see how bad it will look.
I tried with MACRO_NAME and with 'if (err) goto' and with 'if (err) return',
before I converged on _ macro.
I think it's a hidden gem of this patch.
> Can you please add:
>
> + } else if (class == BPF_LD) {
> + if (BPF_MODE(insn->code) == BPF_ABS) {
> + pr_cont("(%02x) r0 = *(%s *)skb[%d]\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->imm);
> + } else if (BPF_MODE(insn->code) == BPF_IND) {
> + pr_cont("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->src_reg, insn->imm);
> + } else {
> + pr_cont("BUG_ld_%02x\n", insn->code);
> + return;
> + }
>
> Note that I'm hardcoding r0 (instead of using %d for insn->dst_reg)
> because that's how ebpf writes the instructions.
ohh yes. it's a copy paste error, since it was in a different file before.
Will definitely add. Thanks!
>> +static void init_reg_state(struct reg_state *regs)
>> +{
>> + struct reg_state *reg;
>> + int i;
>> +
>> + for (i = 0; i < MAX_BPF_REG; i++) {
>> + regs[i].ptr = INVALID_PTR;
>> + regs[i].read_ok = false;
>> + regs[i].imm = 0xbadbad;
>> + }
>> + reg = regs + BPF_REG_FP;
> Any reason you switching from the array syntax to the pointer one? I
> find "reg = regs[BPF_REG_FP];" more readable (and the one you chose in
> the loop).
in this function no particular reason. It felt a bit less verbose, but
I can make
the change.
>> + reg = regs + BPF_REG_1; /* 1st arg to a function */
>> + reg->ptr = PTR_TO_CTX;
> Wait, doesn't this depend on doing "BPF_MOV64_REG(BPF_REG_CTX,
> BPF_REG_ARG1)" (the bpf-to-ebpf prologue), which is only enforced on
> filters converted from bpf? In fact, shouldn't this set
> regs[BPF_REG_CTX] instead of regs[BPF_REG_1] ?
nope. it's REG_1.
as you said r6=r1 is only emitted by converted classic filters.
Verifier will see this 'r6=r1' assignment and will copy the r1 type into r6.
Thank you for review!
On Wed, Jul 2, 2014 at 4:04 PM, Alexei Starovoitov <[email protected]> wrote:
>>> + reg = regs + BPF_REG_1; /* 1st arg to a function */
>>> + reg->ptr = PTR_TO_CTX;
>> Wait, doesn't this depend on doing "BPF_MOV64_REG(BPF_REG_CTX,
>> BPF_REG_ARG1)" (the bpf-to-ebpf prologue), which is only enforced on
>> filters converted from bpf? In fact, shouldn't this set
>> regs[BPF_REG_CTX] instead of regs[BPF_REG_1] ?
>
> nope. it's REG_1.
> as you said r6=r1 is only emitted by converted classic filters.
> Verifier will see this 'r6=r1' assignment and will copy the r1 type into r6.
You're right. I read BPF_MOV64_REG() AT&T-syntax-style.
BTW, check_stack_write() in kernel/bpf/verifier.c has a couple of
assignments of a slot->ptr to 0 (instead of INVALID_PTR). I assume
this is unintended.
-Chema
On Wed, Jul 2, 2014 at 4:35 PM, Chema Gonzalez <[email protected]> wrote:
> On Wed, Jul 2, 2014 at 4:04 PM, Alexei Starovoitov <[email protected]> wrote:
>>>> + reg = regs + BPF_REG_1; /* 1st arg to a function */
>>>> + reg->ptr = PTR_TO_CTX;
>>> Wait, doesn't this depend on doing "BPF_MOV64_REG(BPF_REG_CTX,
>>> BPF_REG_ARG1)" (the bpf-to-ebpf prologue), which is only enforced on
>>> filters converted from bpf? In fact, shouldn't this set
>>> regs[BPF_REG_CTX] instead of regs[BPF_REG_1] ?
>>
>> nope. it's REG_1.
>> as you said r6=r1 is only emitted by converted classic filters.
>> Verifier will see this 'r6=r1' assignment and will copy the r1 type into r6.
> You're right. I read BPF_MOV64_REG() AT&T-syntax-style.
>
> BTW, check_stack_write() in kernel/bpf/verifier.c has a couple of
> assignments of a slot->ptr to 0 (instead of INVALID_PTR). I assume
> this is unintended.
yes. good catch. Will fix it.
Too bad C compiler silently casts integers to enums
On Tue, Jul 1, 2014 at 10:33 PM, Alexei Starovoitov <[email protected]> wrote:
> On Tue, Jul 1, 2014 at 8:11 AM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Jun 30, 2014 at 10:47 PM, Alexei Starovoitov <[email protected]> wrote:
>>> On Mon, Jun 30, 2014 at 3:09 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <[email protected]> wrote:
>>>>>>>
>>>>>>> Sorry I don't like 'fd' direction at all.
>>>>>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>>>>>> but the goal here is to be able to use eBPF for tracing in embedded
>>>>>>> setups. So it's gotta be net independent.
>>>>>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>>>>>> types of sockets will complicate it a lot.
>>>>>>> 3. and most important. read/write operations on sockets are not
>>>>>>> done every nanosecond, whereas lookup operations on bpf maps
>>>>>>> are done every dozen instructions, so we cannot have any overhead
>>>>>>> when accessing maps.
>>>>>>> In other words the verifier is done as static analyzer. I moved all
>>>>>>> the complexity to verify time, so at run-time the programs are as
>>>>>>> fast as possible. I'm strongly against run-time checks in critical path,
>>>>>>> since they kill performance and make the whole approach a lot less usable.
>>>>>>
>>>>>> I may have described my suggestion poorly. I'm suggesting that all of
>>>>>> these global ids be replaced *for userspace's benefit* with fds. That
>>>>>> is, a map would have an associated struct inode, and, when you load an
>>>>>> eBPF program, you'd pass fds into the kernel instead of global ids.
>>>>>> The kernel would still compile the eBPF program to use the global ids,
>>>>>> though.
>>>>>
>>>>> Hmm. If I understood you correctly, you're suggesting to do it similar
>>>>> to ipc/mqueue, shmem, sockets do. By registering and mounting
>>>>> a file system and providing all superblock and inode hooks… and
>>>>> probably have its own namespace type… hmm… may be. That's
>>>>> quite a bit of work to put lightly. As I said in the other email the first
>>>>> step is root only and all these complexity just not worth doing
>>>>> at this stage.
>>>>
>>>> The downside of not doing it right away is that it's harder to
>>>> retrofit in without breaking early users.
>>>>
>>>> You might be able to get away with using anon_inodes. That will
>>>
>>> Spent quite a bit of time playing with anon_inode_getfd(). The model
>>> works ok for seccomp, but doesn't seem to work for tracing,
>>> since tracepoints are global. Say, syscall(bpf, load_prog) returns
>>> a process-local fd. This 'fd' as a string can be written to
>>> debugfs/tracing/events/.../filter which will increment a refcnt of a global
>>> ebpf_program structure and will keep using it. When process exits it will
>>> close all fds which in case of ebpf_prog_fd should be a nop, since
>>> the program is still attached to a global event. Now we have a
>>> program and maps that still alive and dangling, since tracepoint events
>>> keep coming, but no new process can access it. Here we just lost all
>>> benefits of making it 'fd' based. Theoretically we can extend tracing to
>>> be fd-based too and tracepoints will auto-detach upon process exit,
>>> but that's not going to work for all other global events. Like networking
>>> components (bridge, ovs, …) are global and they won't be adding
>>> fd-based interfaces.
>>> I'm still thinking about it, but it looks like that any process-local
>>> ebpf_prog_id scheme is not going to work for global events. Thoughts?
>>
>> Hmm. Maybe these things do need global ids for tracing, or at least
>> there need to be some way to stash them somewhere and find them again.
>> I suppose that debugfs could have symlinks to them, but I don't know
>> how hard that would be to implement or how awkward it would be to use.
>>
>> I imagine there's some awkwardness regardless. For tracing, if I
>> create map 75 and eBPF program 492 that uses map 75, then I still need
>> to remember that map 75 is the map I want (or I need to parse the eBPF
>> program later on).
>>
>> How do you imagine the userspace code working? Maybe it would make
>> sense to add some nlattrs for eBPF programs to map between referenced
>> objects and nicknames for them. Then user code could look at
>> /sys/kernel/debug/whatever/nickname_of_map to resolve the map id or
>> even just open it directly.
>
> I want to avoid string names, since they will force new 'strtab', 'symtab'
> sections in the programs/maps and will uglify the user interface quite a bit.
To be fair, you really need to imitate ELF here. A very simple
relocation-like table should do the trick.
>
> Back in september one loadable unit was: one eBPF program + set of maps,
> but tracing requirements forced a change, since multiple programs need
> to access the same map and maps may need to be pre-populated before
> the programs start executing, so I've split maps and programs into mostly
> independent entities, but programs still need to think of maps as local:
> For example I want to do a skb leak check 'tracing filter':
> - attach this program to kretprobe of __alloc_skb():
> u64 key = (u64) skb;
> u64 value = bpf_get_time();
> bpf_update_map_elem(1/*const_map_id*/, &key, &value);
> - attach this program to consume_skb and kfree_skb tracepoints:
> u64 key = (u64) skb;
> bpf_delete_map_elem(1/*const_map_id*/, &key);
> - and have user space do:
> prior to loading:
> bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
> and then periodically iterate the map to see whether any skb stayed
> in the map for too long.
>
> Programs need to be written with hard coded map_ids otherwise usability
> suffers, so I did global 32-bit id in this RFC
>, but this indeed doesn't work
Really? That will mean that you have to edit the source of your
filter program if the small integer map number you chose conflicts
with another program. That sounds unpleasant.
> for unprivileged chrome browser unless programs are previously loaded
> by root and chrome only does attach to seccomp.
>
> So here is the non-root bpf syscall interface I'm thinking about:
>
> ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
>
> it will create a global map in the system which will be accessible
> in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
> and process-local map_id that was passed as a 1st argument.
> To do update/lookup the process will use bpf_map_xxx_elem(ufd,…)
>
Erk. Unprivileged programs shouldn't be able to allocate global ids
of their choosing, especially if privileged programs can also do it.
Otherwise unprivileged programs can force a collision and possibly
steal information.
> Then to load eBPF program the process will do:
> ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
> and instructions will be referring to maps via local map_id that
> was hard coded as part of the program.
I think relocations would be must prettier than per-process map id tables.
>
> Beyond the normal create_map, update/lookup/delete, load_prog
> operations (that are accessible to both root and non-root), the root user
> gains one more operations: bpf_get_global_id(ufd) that returns
> global map_id or prog_id. This id can be attached to global events
> like tracing. Non-root users lose ability to do delete_map and
> unload_prog (they do close(ufd) instead), so this ops are for root
> only and operate on global ids.
> This is the cleanest way I could think of to combine non-root
> security, per-process id and global id all in one API. Thoughts?
I think I'm okay with this part, although an interface to get a map fd
given some reference to the program (in sysfs) that uses it would also
work and maybe be more straightforward.
--Andy
On Wed, Jul 2, 2014 at 6:43 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Jul 1, 2014 at 10:33 PM, Alexei Starovoitov <[email protected]> wrote:
>> I want to avoid string names, since they will force new 'strtab', 'symtab'
>> sections in the programs/maps and will uglify the user interface quite a bit.
>
> To be fair, you really need to imitate ELF here. A very simple
> relocation-like table should do the trick.
simple.. right :) I do see the amount of struggle you have with
binutils and vdso.
I really don't want to add relocation unless this is last resort.
Especially since it can be solved without it.
I don't think I explained it enough in my last email… trying again:
>> Back in september one loadable unit was: one eBPF program + set of maps,
>> but tracing requirements forced a change, since multiple programs need
>> to access the same map and maps may need to be pre-populated before
>> the programs start executing, so I've split maps and programs into mostly
>> independent entities, but programs still need to think of maps as local:
>> For example I want to do a skb leak check 'tracing filter':
>> - attach this program to kretprobe of __alloc_skb():
>> u64 key = (u64) skb;
>> u64 value = bpf_get_time();
>> bpf_update_map_elem(1/*const_map_id*/, &key, &value);
>> - attach this program to consume_skb and kfree_skb tracepoints:
>> u64 key = (u64) skb;
>> bpf_delete_map_elem(1/*const_map_id*/, &key);
>> - and have user space do:
>> prior to loading:
>> bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
>> and then periodically iterate the map to see whether any skb stayed
>> in the map for too long.
>>
>> Programs need to be written with hard coded map_ids otherwise usability
>> suffers, so I did global 32-bit id in this RFC
>>, but this indeed doesn't work
>
> Really? That will mean that you have to edit the source of your
> filter program if the small integer map number you chose conflicts
> with another program. That sounds unpleasant.
unpleasant. exactly. that's why I'm proposing per-process local map-id,
so that programs don't need to be edited.
>> for unprivileged chrome browser unless programs are previously loaded
>> by root and chrome only does attach to seccomp.
>>
>> So here is the non-root bpf syscall interface I'm thinking about:
>>
>> ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
>>
>> it will create a global map in the system which will be accessible
>> in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
>> and process-local map_id that was passed as a 1st argument.
>> To do update/lookup the process will use bpf_map_xxx_elem(ufd,…)
>>
>
> Erk. Unprivileged programs shouldn't be able to allocate global ids
> of their choosing, especially if privileged programs can also do it.
> Otherwise unprivileged programs can force a collision and possibly
> steal information.
of course. that's not what said.
>> Then to load eBPF program the process will do:
>> ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
>> and instructions will be referring to maps via local map_id that
>> was hard coded as part of the program.
>
> I think relocations would be must prettier than per-process map id tables.
I think per process map_id are much cleaner.
non-root API:
ufd = bpf_create_map(local_map_id,… )
bpf_map_update/delete/lookup_elem(ufd,…)
ufd = bpf_prog_load(insns)
close(ufd)
root only API:
global_id = bpf_get_id(ufd) // returns either map or prog global id
bpf_map_delete(global_map_id)
bpf_prog_unload(global_prog_id)
Details:
ufd = bpf_create_map(local_map_id, ...);
local_map_id - process local map_id
(this id is used to access maps from eBPF program loaded by this process)
ufd - process local file descriptor
(used to update/lookup maps from this process)
global_map_id = bpf_get_id(ufd)
this is root only call to get global_ids and pass them to global events
like tracing.
global ids will only be seen by root. There is no way for root or non-root
to influence id ranges.
>> Beyond the normal create_map, update/lookup/delete, load_prog
>> operations (that are accessible to both root and non-root), the root user
>> gains one more operations: bpf_get_global_id(ufd) that returns
>> global map_id or prog_id. This id can be attached to global events
>> like tracing. Non-root users lose ability to do delete_map and
>> unload_prog (they do close(ufd) instead), so this ops are for root
>> only and operate on global ids.
>> This is the cleanest way I could think of to combine non-root
>> security, per-process id and global id all in one API. Thoughts?
>
> I think I'm okay with this part, although an interface to get a map fd
> given some reference to the program (in sysfs) that uses it would also
> work and maybe be more straightforward.
If you meant debugfs, then yes. I'm planning to add a way for root
to see all loaded programs and maps (similar to /proc as lsmod does),
and then do bpf_map_delete/bpf_prog_unload (similar to rmmod)
setsockopt and seccomp will be non-root and programs will go
through additional dont_leak_pointers check in verifier.
tracing/dtrace will be for root only, since they would need to attach
to global events.
I think it will be cleaner once I finish fd conversion as a patch.
From: Alexei Starovoitov
> >> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
> > +1 to removing the _ macro. If you want to avoid the 3 lines (is there
> > anything in the style guide against "if ((err=OP) < 0) ..." ?), at
>
> assignment and function call inside 'if' ? I don't like such style.
>
> > least use some meaningful macro name (DO_AND_CHECK, or something like
> > that).
It would have to be RETURN_IF_NEGATIVE().
But even then it is skipped by searches for 'return'.
> Try replacing _ with any other name and see how bad it will look.
> I tried with MACRO_NAME and with 'if (err) goto' and with 'if (err) return',
> before I converged on _ macro.
> I think it's a hidden gem of this patch.
No, it is one of those things that 'seems like a good idea at the time',
but causes grief much later on.
Have you considered saving the error code into 'env' and making most of
the functions return if an error is set?
Then the calling code need not check the result of every function call.
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, Jul 3, 2014 at 2:13 AM, David Laight <[email protected]> wrote:
> From: Alexei Starovoitov
>> >> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
>> > +1 to removing the _ macro. If you want to avoid the 3 lines (is there
>> > anything in the style guide against "if ((err=OP) < 0) ..." ?), at
>>
>> assignment and function call inside 'if' ? I don't like such style.
>>
>> > least use some meaningful macro name (DO_AND_CHECK, or something like
>> > that).
>
> It would have to be RETURN_IF_NEGATIVE().
> But even then it is skipped by searches for 'return'.
try s/\<_\>/RETURN_IF_NEGATIVE/ and see how ugly it looks…
>> Try replacing _ with any other name and see how bad it will look.
>> I tried with MACRO_NAME and with 'if (err) goto' and with 'if (err) return',
>> before I converged on _ macro.
>> I think it's a hidden gem of this patch.
>
> No, it is one of those things that 'seems like a good idea at the time',
> but causes grief much later on.
Disagree. The _ macro in this code has been around for
almost 2 years and survived all sorts of changes all over the verifier.
The macro proved to be very effective in reducing code noise.
> Have you considered saving the error code into 'env' and making most of
> the functions return if an error is set?
> Then the calling code need not check the result of every function call.
that won't work, since err = check1(); err = check2(); if (err) is just wrong,
then err |= check1(); err |= check2() is even worse.
Even if it was possible, continuing verification and printing multiple
errors is too confusing for users. While writing programs and
dealing with verifier rejects we found that the first error is more than
enough to go back and analyze what's wrong with C source.
Notice that verifier prints full verification trace. Without it it was very
hard to understand why particular register at some point has
invalid type.
On Wed, Jul 2, 2014 at 7:29 PM, Alexei Starovoitov <[email protected]> wrote:
> On Wed, Jul 2, 2014 at 6:43 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Jul 1, 2014 at 10:33 PM, Alexei Starovoitov <[email protected]> wrote:
>>> I want to avoid string names, since they will force new 'strtab', 'symtab'
>>> sections in the programs/maps and will uglify the user interface quite a bit.
>>
>> To be fair, you really need to imitate ELF here. A very simple
>> relocation-like table should do the trick.
>
> simple.. right :) I do see the amount of struggle you have with
> binutils and vdso.
> I really don't want to add relocation unless this is last resort.
> Especially since it can be solved without it.
> I don't think I explained it enough in my last email… trying again:
>
>>> Back in september one loadable unit was: one eBPF program + set of maps,
>>> but tracing requirements forced a change, since multiple programs need
>>> to access the same map and maps may need to be pre-populated before
>>> the programs start executing, so I've split maps and programs into mostly
>>> independent entities, but programs still need to think of maps as local:
>>> For example I want to do a skb leak check 'tracing filter':
>>> - attach this program to kretprobe of __alloc_skb():
>>> u64 key = (u64) skb;
>>> u64 value = bpf_get_time();
>>> bpf_update_map_elem(1/*const_map_id*/, &key, &value);
>>> - attach this program to consume_skb and kfree_skb tracepoints:
>>> u64 key = (u64) skb;
>>> bpf_delete_map_elem(1/*const_map_id*/, &key);
>>> - and have user space do:
>>> prior to loading:
>>> bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
>>> and then periodically iterate the map to see whether any skb stayed
>>> in the map for too long.
>>>
>>> Programs need to be written with hard coded map_ids otherwise usability
>>> suffers, so I did global 32-bit id in this RFC
>>>, but this indeed doesn't work
>>
>> Really? That will mean that you have to edit the source of your
>> filter program if the small integer map number you chose conflicts
>> with another program. That sounds unpleasant.
>
> unpleasant. exactly. that's why I'm proposing per-process local map-id,
> so that programs don't need to be edited.
>
>>> for unprivileged chrome browser unless programs are previously loaded
>>> by root and chrome only does attach to seccomp.
>>>
>>> So here is the non-root bpf syscall interface I'm thinking about:
>>>
>>> ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
>>>
>>> it will create a global map in the system which will be accessible
>>> in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
>>> and process-local map_id that was passed as a 1st argument.
>>> To do update/lookup the process will use bpf_map_xxx_elem(ufd,…)
>>>
>>
>> Erk. Unprivileged programs shouldn't be able to allocate global ids
>> of their choosing, especially if privileged programs can also do it.
>> Otherwise unprivileged programs can force a collision and possibly
>> steal information.
>
> of course. that's not what said.
>
>>> Then to load eBPF program the process will do:
>>> ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
>>> and instructions will be referring to maps via local map_id that
>>> was hard coded as part of the program.
>>
>> I think relocations would be must prettier than per-process map id tables.
>
> I think per process map_id are much cleaner.
>
> non-root API:
>
> ufd = bpf_create_map(local_map_id,… )
> bpf_map_update/delete/lookup_elem(ufd,…)
> ufd = bpf_prog_load(insns)
> close(ufd)
>
> root only API:
>
> global_id = bpf_get_id(ufd) // returns either map or prog global id
> bpf_map_delete(global_map_id)
> bpf_prog_unload(global_prog_id)
>
> Details:
>
> ufd = bpf_create_map(local_map_id, ...);
>
> local_map_id - process local map_id
> (this id is used to access maps from eBPF program loaded by this process)
>
> ufd - process local file descriptor
> (used to update/lookup maps from this process)
>
> global_map_id = bpf_get_id(ufd)
> this is root only call to get global_ids and pass them to global events
> like tracing.
>
> global ids will only be seen by root. There is no way for root or non-root
> to influence id ranges.
>
>>> Beyond the normal create_map, update/lookup/delete, load_prog
>>> operations (that are accessible to both root and non-root), the root user
>>> gains one more operations: bpf_get_global_id(ufd) that returns
>>> global map_id or prog_id. This id can be attached to global events
>>> like tracing. Non-root users lose ability to do delete_map and
>>> unload_prog (they do close(ufd) instead), so this ops are for root
>>> only and operate on global ids.
>>> This is the cleanest way I could think of to combine non-root
>>> security, per-process id and global id all in one API. Thoughts?
>>
>> I think I'm okay with this part, although an interface to get a map fd
>> given some reference to the program (in sysfs) that uses it would also
>> work and maybe be more straightforward.
>
> If you meant debugfs, then yes. I'm planning to add a way for root
> to see all loaded programs and maps (similar to /proc as lsmod does),
> and then do bpf_map_delete/bpf_prog_unload (similar to rmmod)
>
> setsockopt and seccomp will be non-root and programs will go
> through additional dont_leak_pointers check in verifier.
> tracing/dtrace will be for root only, since they would need to attach
> to global events.
>
> I think it will be cleaner once I finish fd conversion as a patch.
OK
FWIW, per-process local id maps sound almost equivalent to relocations
-- the latter could be as simple as an extra nlattr giving a list of
pairs of (per-eBPF-program id, fd).
My current binutils mess is mainly just because I'm trying to do
something weird with an old, old file format that needs to support
lots of legacy tools. You won't have that problem.
--Andy
On Fri, Jul 4, 2014 at 8:17 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jul 2, 2014 at 7:29 PM, Alexei Starovoitov <[email protected]> wrote:
>>
>> non-root API:
>>
>> ufd = bpf_create_map(local_map_id,… )
>> bpf_map_update/delete/lookup_elem(ufd,…)
>> ufd = bpf_prog_load(insns)
>> close(ufd)
>>
>> root only API:
>>
>> global_id = bpf_get_id(ufd) // returns either map or prog global id
>> bpf_map_delete(global_map_id)
>> bpf_prog_unload(global_prog_id)
>>
>> Details:
>>
>> ufd = bpf_create_map(local_map_id, ...);
>>
>> local_map_id - process local map_id
>> (this id is used to access maps from eBPF program loaded by this process)
>>
>> ufd - process local file descriptor
>> (used to update/lookup maps from this process)
>>
>> global_map_id = bpf_get_id(ufd)
>> this is root only call to get global_ids and pass them to global events
>> like tracing.
>>
>> global ids will only be seen by root. There is no way for root or non-root
>> to influence id ranges.
>>
>> I think it will be cleaner once I finish fd conversion as a patch.
>
> OK
>
> FWIW, per-process local id maps sound almost equivalent to relocations
> -- the latter could be as simple as an extra nlattr giving a list of
> pairs of (per-eBPF-program id, fd).
I thought about array of such pairs as well, but it felt cleaner to remember
the (per-eBPF-program id, fd) pair in a kernel, instead of asking user
space to keep track of them. Either way I think it's minor. I'll implement
both and see which way is cleaner.
So far I did 3-way enum split in verifier and addressed Namhyung's
and Chema's comments. Updated in the same place:
git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master
I'll be traveling next week.
Once I'm done with fd-style interface, I'll post a v2.
Thanks