From: Alexei Starovoitov <ast@plumgrid.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>,
        Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
        Tom Zanussi <tom.zanussi@linux.intel.com>,
        Jovi Zhangwei <jovi.zhangwei@gmail.com>,
        Eric Dumazet <edumazet@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Arnaldo Carvalho de Melo <acme@infradead.org>,
        Pekka Enberg <penberg@iki.fi>, "David S. Miller" <davem@davemloft.net>,
        Arjan van de Ven <arjan@infradead.org>,
        Christoph Hellwig <hch@infradead.org>, linux-kernel@vger.kernel.org
Subject: [RFC PATCH v2 tip 3/7] Extended BPF (64-bit BPF) design document
Date: Wed,  5 Feb 2014 16:10:03 -0800
Message-Id: <1391645407-4092-4-git-send-email-ast@plumgrid.com>
In-Reply-To: <1391645407-4092-1-git-send-email-ast@plumgrid.com>
References: <1391645407-4092-1-git-send-email-ast@plumgrid.com>
Sender: linux-kernel-owner@vger.kernel.org

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/bpf_jit.txt |  204 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 Documentation/bpf_jit.txt

diff --git a/Documentation/bpf_jit.txt b/Documentation/bpf_jit.txt
new file mode 100644
index 0000000..9c70f42
--- /dev/null
+++ b/Documentation/bpf_jit.txt
@@ -0,0 +1,204 @@
+Subject: extended BPF or 64-bit BPF
+
+Q: What is BPF?
+A: Safe dynamically loadable 32-bit program that can access skb->data via
+sk_load_byte/half/word calls or seccomp_data. Can be attached to sockets,
+to netfilter xtables, seccomp. In case of sockets/xtables input is skb.
+In case of seccomp input is struct seccomp_data.
+
+Q: What is extended BPF?
+A: Safe dynamically loadable 64-bit program that can call fixed set
+of kernel functions and takes generic bpf_context as an input.
+BPF program is a glue between kernel functions and bpf_context.
+Different kernel subsystems can define their own set of available functions
+and alter BPF machinery for specific use case.
+
+Example 1:
+when function set is {bpf_load_byte/half/word} and bpf_context=skb
+the extended BPF is equivalent to original BPF (w/o negative offset extensions),
+since any such extended BPF program will only be able to load data from skb
+and interpret it.
+
+Example 2:
+when function set is {empty} and bpf_context=seccomp_data,
+the extended BPF is equivalent to original seccomp BPF with simpler programs
+and can immediately take advantage of extended BPF-JIT.
+(original BPF-JIT doesn't work for seccomp)
+
+Example 3:
+when function set is {bpf_load_xxx + bpf_table_lookup} and bpf_context=skb
+the extended BPF can be used to implement network analytics in tcpdump.
+Like counting all tcp flows through the dev or filtering for specific
+set of IP addresses.
+
+Example 4:
+when function set is {load_xxx + table_lookup + trace_printk} and
+bpf_context=pt_regs, the extended BPF is used to implement systemtap-like
+tracing filters
+
+Extended Instruction Set was designed with these goals:
+- write programs in restricted C and compile into BPF with GCC/LLVM
+- just-in-time map to modern 64-bit CPU with minimal performance overhead
+  over two steps: C -> BPF -> native code
+- guarantee termination and safety of BPF program in kernel
+  with simple algorithm
+
+Writing filters in tcpdump syntax or in systemtap language is difficult.
+Same filter done in C is easier to understand.
+GCC/LLVM-bpf backend is optional.
+Extended BPF can be coded with macroses from bpf.h just like original BPF.
+
+Minimal performance overhead is achieved by having one to one mapping
+between BPF insns and native insns, and one to one mapping between BPF
+registers and native registers on 64-bit CPUs
+
+Extended BPF allows jump forward and backward for two reasons:
+to reduce branch mispredict penalty compiler moves cold basic blocks out of
+fall-through path and to reduce code duplication that would be unavoidable
+if only jump forward was available.
+To guarantee termination simple non-recursive depth-first-search verifies
+that there are no back-edges (no loops in the program), program is a DAG
+with root at the first insn, all branches end at the last RET insn and
+all instructions are reachable.
+(Original BPF actually allows unreachable insns, but that's a bug)
+
+Original BPF has two registers (A and X) and hidden frame pointer.
+Extended BPF has ten registers and read-only frame pointer.
+Since 64-bit CPUs are passing arguments to the functions via registers
+the number of args from BPF program to in-kernel function is restricted to 5
+and one register is used to accept return value from in-kernel function.
+x86_64 passes first 6 arguments in registers.
+aarch64/sparcv9/mips64 have 7-8 registers for arguments.
+x86_64 has 6 callee saved registers.
+aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+Therefore extended BPF calling convention is defined as:
+R0 - return value from in-kernel function
+R1-R5 - arguments from BPF program to in-kernel function
+R6-R9 - callee saved registers that in-kernel function will preserve
+R10 - read-only frame pointer to access stack
+
+so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc
+and BPF calling convention maps directly to ABIs used by kernel on 64-bit
+architectures.
+
+R0-R5 are scratch registers and BPF program needs spill/fill them if necessary
+across calls.
+Note that there is only one BPF program == one BPF function and it cannot call
+other BPF functions. It can only call predefined in-kernel functions.
+
+All BPF registers are 64-bit without subregs, which makes JITed x86 code
+less optimal, but matches sparc/mips architectures.
+Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
+nicely, but read-modify-write overhead for sparc/mips is not worth the gains.
+
+Original BPF and extended BPF are two operand instructions, which helps
+to do one-to-one mapping between BPF insn and x86 insn during JIT.
+
+Extended BPF doesn't have pre-defined endianness not to favor one
+architecture vs another. Therefore bswap insn was introduced.
+Original BPF doesn't have such insn and does bswap as part of sk_load_word call
+which is often unnecessary if we want to compare the value with the constant.
+Restricted C code might be written differently depending on endianness
+and GCC/LLVM-bpf will take an endianness flag.
+
+32-bit architectures run 64-bit extended BPF programs via interpreter
+
+Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
+A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
+values in/out kernel functions, so 32-bit BPF registers would require to define
+register-pair ABI, there won't be a direct BPF register to HW register
+mapping and JIT would need to do combine/split/move operations for every
+register in and out of the function, which is complex, bug prone and slow.
+Another reason is counters. To use 64-bit counter BPF program would need to do
+a complex math. Again bug prone and not atomic.
+
+Q: Original BPF is safe, deterministic and kernel can easily prove that.
+   Does extended BPF keep these properties?
+A: Yes. The safety of the program is determined in two steps.
+First step does depth-first-search to disallow loops and other CFG validation.
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+At the start of the program the register R1 contains a pointer to bpf_context
+and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has
+now type PTR_TO_CTX as well and can be used on right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable.
+If register was never written to, it's not readable.
+After kernel function call, R1-R5 are reset to unreadable and R0 has a return
+type of the function. Since R6-R9 are callee saved, their state is preserved
+across the call.
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment
+checked.
+
+bpf_context structure is generic. Its contents are defined by specific use case.
+For seccomp it can be seccomp_data and through get_context_access callback
+BPF checker is customized, so that BPF program can only access certain fields
+of bpf_context with specified size and alignment.
+For example, the following insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+intends to load word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, then get_context_access callback should let the checker know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker
+will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds,
+which are hard coded to [-480, 0]. In this example offset is 8, so it will fail
+verification.
+The checker will allow BPF program to read data from stack only after it wrote
+into it.
+Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved
+registers may not be enough for some programs.
+
+Allowed function calls are customized via get_func_proto callback.
+For example:
+  u64 bpf_load_byte(struct bpf_context *ctx, u32 offset);
+function will have the following definition:
+  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
+and BPF checker will verify that bpf_load_byte is always called with first
+argument being a valid pointer to bpf_context. After the call BPF register R0
+will be set to readable state, so that BPF program can access it.
+
+One of the useful functions that can be made available to BPF program
+are bpf_table_lookup/bpf_table_update.
+Using them a tracing filter can collect any type of statistics.
+
+Therefore extended BPF program consists of instructions and tables.
+From BPF program the table is identified by constant table_id
+and access to a table in C looks like:
+elem = bpf_table_lookup(ctx, table_id, key);
+
+BPF checker matches 'table_id' against known tables, verifies that 'key' points
+to stack and table->key_size bytes are initialized.
+From there on bpf_table_lookup() is a normal kernel function. It needs to do
+a lookup by whatever means and return either valid pointer to the element
+or NULL. BPF checker will verify that the program accesses the pointer only
+after comparing it to NULL. That's the meaning of PTR_TO_TABLE_CONDITIONAL and
+PTR_TO_TABLE register types in bpf_check.c
+
+If a kernel subsystem wants to use this BPF framework and decides to implement
+bpf_table_lookup, the checker will guarantee that argument 'ctx' is a valid
+pointer to bpf_context, 'table_id' is valid table_id and table->key_size bytes
+can be read from the pointer 'key'. It's up to implementation to decide how it
+wants to do the lookup and what is the key.
+
+Going back to the example BPF insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+if R6=PTR_TO_TABLE, then offset and size of access must be within
+[0, table->elem_size] which is determined by constant table_id that was passed
+into bpf_table_lookup call prior to this insn.
+
+Just like original, extended BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will call fixed number of kernel functions.
+Earlier implementation of the checker had a precise calculation of worst case
+number of insns, but it was removed to simplify the code, since the worst number
+is always less then number of insns in a program anyway (because it's a DAG).
+
+Since register/stack state tracking simulates execution of all insns in all
+possible branches, it will explode if not bounded. There are two bounds.
+verifier_state stack is limited to 1k, therefore BPF program cannot have
+more than 1k jump insns.
+Total number of insns to be analyzed is limited to 32k, which means that
+checker will either prove correctness or reject the program in few
+milliseconds on average x86 cpu. Valid programs take microseconds to verify.
+
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/