Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753966AbaBFAK2 (ORCPT ); Wed, 5 Feb 2014 19:10:28 -0500 Received: from mail-pa0-f41.google.com ([209.85.220.41]:62080 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751192AbaBFAKT (ORCPT ); Wed, 5 Feb 2014 19:10:19 -0500 From: Alexei Starovoitov To: Ingo Molnar Cc: Steven Rostedt , Peter Zijlstra , "H. Peter Anvin" , Thomas Gleixner , Masami Hiramatsu , Tom Zanussi , Jovi Zhangwei , Eric Dumazet , Linus Torvalds , Andrew Morton , Frederic Weisbecker , Arnaldo Carvalho de Melo , Pekka Enberg , "David S. Miller" , Arjan van de Ven , Christoph Hellwig , linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 tip 3/7] Extended BPF (64-bit BPF) design document Date: Wed, 5 Feb 2014 16:10:03 -0800 Message-Id: <1391645407-4092-4-git-send-email-ast@plumgrid.com> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1391645407-4092-1-git-send-email-ast@plumgrid.com> References: <1391645407-4092-1-git-send-email-ast@plumgrid.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Signed-off-by: Alexei Starovoitov --- Documentation/bpf_jit.txt | 204 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 Documentation/bpf_jit.txt diff --git a/Documentation/bpf_jit.txt b/Documentation/bpf_jit.txt new file mode 100644 index 0000000..9c70f42 --- /dev/null +++ b/Documentation/bpf_jit.txt @@ -0,0 +1,204 @@ +Subject: extended BPF or 64-bit BPF + +Q: What is BPF? +A: Safe dynamically loadable 32-bit program that can access skb->data via +sk_load_byte/half/word calls or seccomp_data. Can be attached to sockets, +to netfilter xtables, seccomp. In case of sockets/xtables input is skb. +In case of seccomp input is struct seccomp_data. + +Q: What is extended BPF? +A: Safe dynamically loadable 64-bit program that can call fixed set +of kernel functions and takes generic bpf_context as an input. +BPF program is a glue between kernel functions and bpf_context. +Different kernel subsystems can define their own set of available functions +and alter BPF machinery for specific use case. + +Example 1: +when function set is {bpf_load_byte/half/word} and bpf_context=skb +the extended BPF is equivalent to original BPF (w/o negative offset extensions), +since any such extended BPF program will only be able to load data from skb +and interpret it. + +Example 2: +when function set is {empty} and bpf_context=seccomp_data, +the extended BPF is equivalent to original seccomp BPF with simpler programs +and can immediately take advantage of extended BPF-JIT. +(original BPF-JIT doesn't work for seccomp) + +Example 3: +when function set is {bpf_load_xxx + bpf_table_lookup} and bpf_context=skb +the extended BPF can be used to implement network analytics in tcpdump. +Like counting all tcp flows through the dev or filtering for specific +set of IP addresses. + +Example 4: +when function set is {load_xxx + table_lookup + trace_printk} and +bpf_context=pt_regs, the extended BPF is used to implement systemtap-like +tracing filters + +Extended Instruction Set was designed with these goals: +- write programs in restricted C and compile into BPF with GCC/LLVM +- just-in-time map to modern 64-bit CPU with minimal performance overhead + over two steps: C -> BPF -> native code +- guarantee termination and safety of BPF program in kernel + with simple algorithm + +Writing filters in tcpdump syntax or in systemtap language is difficult. +Same filter done in C is easier to understand. +GCC/LLVM-bpf backend is optional. +Extended BPF can be coded with macroses from bpf.h just like original BPF. + +Minimal performance overhead is achieved by having one to one mapping +between BPF insns and native insns, and one to one mapping between BPF +registers and native registers on 64-bit CPUs + +Extended BPF allows jump forward and backward for two reasons: +to reduce branch mispredict penalty compiler moves cold basic blocks out of +fall-through path and to reduce code duplication that would be unavoidable +if only jump forward was available. +To guarantee termination simple non-recursive depth-first-search verifies +that there are no back-edges (no loops in the program), program is a DAG +with root at the first insn, all branches end at the last RET insn and +all instructions are reachable. +(Original BPF actually allows unreachable insns, but that's a bug) + +Original BPF has two registers (A and X) and hidden frame pointer. +Extended BPF has ten registers and read-only frame pointer. +Since 64-bit CPUs are passing arguments to the functions via registers +the number of args from BPF program to in-kernel function is restricted to 5 +and one register is used to accept return value from in-kernel function. +x86_64 passes first 6 arguments in registers. +aarch64/sparcv9/mips64 have 7-8 registers for arguments. +x86_64 has 6 callee saved registers. +aarch64/sparcv9/mips64 have 11 or more callee saved registers. + +Therefore extended BPF calling convention is defined as: +R0 - return value from in-kernel function +R1-R5 - arguments from BPF program to in-kernel function +R6-R9 - callee saved registers that in-kernel function will preserve +R10 - read-only frame pointer to access stack + +so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc +and BPF calling convention maps directly to ABIs used by kernel on 64-bit +architectures. + +R0-R5 are scratch registers and BPF program needs spill/fill them if necessary +across calls. +Note that there is only one BPF program == one BPF function and it cannot call +other BPF functions. It can only call predefined in-kernel functions. + +All BPF registers are 64-bit without subregs, which makes JITed x86 code +less optimal, but matches sparc/mips architectures. +Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64 +nicely, but read-modify-write overhead for sparc/mips is not worth the gains. + +Original BPF and extended BPF are two operand instructions, which helps +to do one-to-one mapping between BPF insn and x86 insn during JIT. + +Extended BPF doesn't have pre-defined endianness not to favor one +architecture vs another. Therefore bswap insn was introduced. +Original BPF doesn't have such insn and does bswap as part of sk_load_word call +which is often unnecessary if we want to compare the value with the constant. +Restricted C code might be written differently depending on endianness +and GCC/LLVM-bpf will take an endianness flag. + +32-bit architectures run 64-bit extended BPF programs via interpreter + +Q: Why extended BPF is 64-bit? Cannot we live with 32-bit? +A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit +values in/out kernel functions, so 32-bit BPF registers would require to define +register-pair ABI, there won't be a direct BPF register to HW register +mapping and JIT would need to do combine/split/move operations for every +register in and out of the function, which is complex, bug prone and slow. +Another reason is counters. To use 64-bit counter BPF program would need to do +a complex math. Again bug prone and not atomic. + +Q: Original BPF is safe, deterministic and kernel can easily prove that. + Does extended BPF keep these properties? +A: Yes. The safety of the program is determined in two steps. +First step does depth-first-search to disallow loops and other CFG validation. +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. +At the start of the program the register R1 contains a pointer to bpf_context +and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has +now type PTR_TO_CTX as well and can be used on right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable. +If register was never written to, it's not readable. +After kernel function call, R1-R5 are reset to unreadable and R0 has a return +type of the function. Since R6-R9 are callee saved, their state is preserved +across the call. +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment +checked. + +bpf_context structure is generic. Its contents are defined by specific use case. +For seccomp it can be seccomp_data and through get_context_access callback +BPF checker is customized, so that BPF program can only access certain fields +of bpf_context with specified size and alignment. +For example, the following insn: + BPF_INSN_LD(BPF_W, R0, R6, 8) +intends to load word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, then get_context_access callback should let the checker know +that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker +will reject the program. +If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds, +which are hard coded to [-480, 0]. In this example offset is 8, so it will fail +verification. +The checker will allow BPF program to read data from stack only after it wrote +into it. +Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved +registers may not be enough for some programs. + +Allowed function calls are customized via get_func_proto callback. +For example: + u64 bpf_load_byte(struct bpf_context *ctx, u32 offset); +function will have the following definition: + struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX}; +and BPF checker will verify that bpf_load_byte is always called with first +argument being a valid pointer to bpf_context. After the call BPF register R0 +will be set to readable state, so that BPF program can access it. + +One of the useful functions that can be made available to BPF program +are bpf_table_lookup/bpf_table_update. +Using them a tracing filter can collect any type of statistics. + +Therefore extended BPF program consists of instructions and tables. +From BPF program the table is identified by constant table_id +and access to a table in C looks like: +elem = bpf_table_lookup(ctx, table_id, key); + +BPF checker matches 'table_id' against known tables, verifies that 'key' points +to stack and table->key_size bytes are initialized. +From there on bpf_table_lookup() is a normal kernel function. It needs to do +a lookup by whatever means and return either valid pointer to the element +or NULL. BPF checker will verify that the program accesses the pointer only +after comparing it to NULL. That's the meaning of PTR_TO_TABLE_CONDITIONAL and +PTR_TO_TABLE register types in bpf_check.c + +If a kernel subsystem wants to use this BPF framework and decides to implement +bpf_table_lookup, the checker will guarantee that argument 'ctx' is a valid +pointer to bpf_context, 'table_id' is valid table_id and table->key_size bytes +can be read from the pointer 'key'. It's up to implementation to decide how it +wants to do the lookup and what is the key. + +Going back to the example BPF insn: + BPF_INSN_LD(BPF_W, R0, R6, 8) +if R6=PTR_TO_TABLE, then offset and size of access must be within +[0, table->elem_size] which is determined by constant table_id that was passed +into bpf_table_lookup call prior to this insn. + +Just like original, extended BPF is limited to 4096 insns, which means that any +program will terminate quickly and will call fixed number of kernel functions. +Earlier implementation of the checker had a precise calculation of worst case +number of insns, but it was removed to simplify the code, since the worst number +is always less then number of insns in a program anyway (because it's a DAG). + +Since register/stack state tracking simulates execution of all insns in all +possible branches, it will explode if not bounded. There are two bounds. +verifier_state stack is limited to 1k, therefore BPF program cannot have +more than 1k jump insns. +Total number of insns to be analyzed is limited to 32k, which means that +checker will either prove correctness or reject the program in few +milliseconds on average x86 cpu. Valid programs take microseconds to verify. + -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/