2014-06-02 07:01:55

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH v2 net-next 0/2] split BPF out of core networking

This patch set splits BPF out of core networking into generic component

patch #1 splits filter.c into two logical pieces: generic BPF core and socket
filters. It only moves functions around. No real changes.

patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select

The main value of the patch is not a NET separation, but rather logical boundary
between generic BPF core and socket filtering. All socket specific code stays in
net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both
classic and internal).

Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use
BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense
to split up filter.h into generic and socket specific as well to cleanup the
boundary further.

Tested with several NET and NET-less configs on arm and x86

V1->V2:
rebase on top of net-next
split filter.c into kernel/bpf/core.c instead of net/bpf/core.c

Alexei Starovoitov (2):
net: filter: split filter.c into two files
net: filter: split BPF out of core networking

arch/Kconfig | 6 +-
include/linux/filter.h | 2 +
kernel/Makefile | 1 +
kernel/bpf/Makefile | 5 +
kernel/bpf/core.c | 1063 ++++++++++++++++++++++++++++++++++++++++++++++++
net/Kconfig | 1 +
net/core/filter.c | 1023 +---------------------------------------------
7 files changed, 1079 insertions(+), 1022 deletions(-)
create mode 100644 kernel/bpf/Makefile
create mode 100644 kernel/bpf/core.c

--
1.7.9.5


2014-06-02 07:02:03

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH v2 net-next 1/2] net: filter: split filter.c into two files

BPF is used in several kernel components. This split creates logical boundary
between generic BPF core and specific BPF use cases

kernel/bpf/core.c:
internal BPF interpreter, classic to internal converter, classic verifier

net/core/filter.c:
classic BPF extensions related to socket filters, socket attach/detach

This patch only moves functions. No other changes.

Next patch introduces hidden Kconfig flag, so seccomp and tracing filters
can select BPF core only instead of depending on the whole NET

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/filter.h | 2 +
kernel/Makefile | 1 +
kernel/bpf/Makefile | 5 +
kernel/bpf/core.c | 1042 ++++++++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 1023 +----------------------------------------------
5 files changed, 1052 insertions(+), 1021 deletions(-)
create mode 100644 kernel/bpf/Makefile
create mode 100644 kernel/bpf/core.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index f0c2ad43b4af..0e463ee77bb2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -341,6 +341,8 @@ void sk_filter_free(struct sk_filter *fp);

int sk_convert_filter(struct sock_filter *prog, int len,
struct sock_filter_int *new_prog, int *new_len);
+bool sk_convert_bpf_extensions(struct sock_filter *fp,
+ struct sock_filter_int **insnp);

int sk_unattached_filter_create(struct sk_filter **pfp,
struct sock_fprog_kern *fprog);
diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b6246ce9..e7360b7c2c0e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
+obj-$(CONFIG_NET) += bpf/

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
new file mode 100644
index 000000000000..2634b2fe5202
--- /dev/null
+++ b/kernel/bpf/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the BPF core infrastructure
+#
+
+obj-y := core.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
new file mode 100644
index 000000000000..22c2d99414c0
--- /dev/null
+++ b/kernel/bpf/core.c
@@ -0,0 +1,1042 @@
+/*
+ * Linux Socket Filter - Kernel level socket filtering
+ *
+ * Based on the design of the Berkeley Packet Filter. The new
+ * internal format has been designed by PLUMgrid:
+ *
+ * Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
+ *
+ * Authors:
+ *
+ * Jay Schulist <[email protected]>
+ * Alexei Starovoitov <[email protected]>
+ * Daniel Borkmann <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Andi Kleen - Fix a few bad bugs and races.
+ * Kris Katterjohn - Added many additional checks in sk_chk_filter()
+ */
+
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+/* Registers */
+#define BPF_R0 regs[BPF_REG_0]
+#define BPF_R1 regs[BPF_REG_1]
+#define BPF_R2 regs[BPF_REG_2]
+#define BPF_R3 regs[BPF_REG_3]
+#define BPF_R4 regs[BPF_REG_4]
+#define BPF_R5 regs[BPF_REG_5]
+#define BPF_R6 regs[BPF_REG_6]
+#define BPF_R7 regs[BPF_REG_7]
+#define BPF_R8 regs[BPF_REG_8]
+#define BPF_R9 regs[BPF_REG_9]
+#define BPF_R10 regs[BPF_REG_10]
+
+/* Named registers */
+#define A regs[insn->a_reg]
+#define X regs[insn->x_reg]
+#define FP regs[BPF_REG_FP]
+#define ARG1 regs[BPF_REG_ARG1]
+#define CTX regs[BPF_REG_CTX]
+#define K insn->imm
+
+/* Exported for the bpf jit load helper */
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
+{
+ u8 *ptr = NULL;
+
+ if (k >= SKF_NET_OFF)
+ ptr = skb_network_header(skb) + k - SKF_NET_OFF;
+ else if (k >= SKF_LL_OFF)
+ ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
+ if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
+ return ptr;
+
+ return NULL;
+}
+
+static inline void *load_pointer(const struct sk_buff *skb, int k,
+ unsigned int size, void *buffer)
+{
+ if (k >= 0)
+ return skb_header_pointer(skb, k, size, buffer);
+
+ return bpf_internal_load_pointer_neg_helper(skb, k, size);
+}
+
+/* Base function for offset calculation. Needs to go into .text section,
+ * therefore keeping it non-static as well; will also be used by JITs
+ * anyway later on, so do not let the compiler omit it.
+ */
+noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ return 0;
+}
+
+/**
+ * __sk_run_filter - run a filter on a given context
+ * @ctx: buffer to run the filter on
+ * @insn: filter to apply
+ *
+ * Decode and apply filter instructions to the skb->data. Return length to
+ * keep, 0 for none. @ctx is the data we are operating on, @insn is the
+ * array of filter instructions.
+ */
+static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
+{
+ u64 stack[MAX_BPF_STACK / sizeof(u64)];
+ u64 regs[MAX_BPF_REG], tmp;
+ static const void *jumptable[256] = {
+ [0 ... 255] = &&default_label,
+ /* Now overwrite non-defaults ... */
+ /* 32 bit ALU operations */
+ [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
+ [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
+ [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
+ [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
+ [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
+ [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
+ [BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
+ [BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
+ [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
+ [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
+ [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
+ [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
+ [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
+ [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
+ [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
+ [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
+ [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
+ [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
+ [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
+ [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
+ [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
+ [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
+ [BPF_ALU | BPF_NEG] = &&ALU_NEG,
+ [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
+ [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
+ /* 64 bit ALU operations */
+ [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
+ [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
+ [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
+ [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
+ [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
+ [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
+ [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
+ [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
+ [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
+ [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
+ [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
+ [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
+ [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
+ [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
+ [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
+ [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
+ [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
+ [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
+ [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
+ [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
+ [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
+ [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
+ [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
+ [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
+ [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
+ /* Call instruction */
+ [BPF_JMP | BPF_CALL] = &&JMP_CALL,
+ /* Jumps */
+ [BPF_JMP | BPF_JA] = &&JMP_JA,
+ [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
+ [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
+ [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
+ [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
+ [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
+ [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
+ [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
+ [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
+ [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
+ [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
+ [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
+ [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
+ [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
+ [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
+ /* Program return */
+ [BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
+ /* Store instructions */
+ [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
+ [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
+ [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
+ [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
+ [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
+ [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
+ [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
+ [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
+ [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
+ [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
+ /* Load instructions */
+ [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
+ [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
+ [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
+ [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
+ [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
+ [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
+ [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
+ [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
+ [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
+ [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+ };
+ void *ptr;
+ int off;
+
+#define CONT ({ insn++; goto select_insn; })
+#define CONT_JMP ({ insn++; goto select_insn; })
+
+ FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
+ ARG1 = (u64) (unsigned long) ctx;
+
+ /* Register for user BPF programs need to be reset first. */
+ regs[BPF_REG_A] = 0;
+ regs[BPF_REG_X] = 0;
+
+select_insn:
+ goto *jumptable[insn->code];
+
+ /* ALU */
+#define ALU(OPCODE, OP) \
+ ALU64_##OPCODE##_X: \
+ A = A OP X; \
+ CONT; \
+ ALU_##OPCODE##_X: \
+ A = (u32) A OP (u32) X; \
+ CONT; \
+ ALU64_##OPCODE##_K: \
+ A = A OP K; \
+ CONT; \
+ ALU_##OPCODE##_K: \
+ A = (u32) A OP (u32) K; \
+ CONT;
+
+ ALU(ADD, +)
+ ALU(SUB, -)
+ ALU(AND, &)
+ ALU(OR, |)
+ ALU(LSH, <<)
+ ALU(RSH, >>)
+ ALU(XOR, ^)
+ ALU(MUL, *)
+#undef ALU
+ ALU_NEG:
+ A = (u32) -A;
+ CONT;
+ ALU64_NEG:
+ A = -A;
+ CONT;
+ ALU_MOV_X:
+ A = (u32) X;
+ CONT;
+ ALU_MOV_K:
+ A = (u32) K;
+ CONT;
+ ALU64_MOV_X:
+ A = X;
+ CONT;
+ ALU64_MOV_K:
+ A = K;
+ CONT;
+ ALU64_ARSH_X:
+ (*(s64 *) &A) >>= X;
+ CONT;
+ ALU64_ARSH_K:
+ (*(s64 *) &A) >>= K;
+ CONT;
+ ALU64_MOD_X:
+ if (unlikely(X == 0))
+ return 0;
+ tmp = A;
+ A = do_div(tmp, X);
+ CONT;
+ ALU_MOD_X:
+ if (unlikely(X == 0))
+ return 0;
+ tmp = (u32) A;
+ A = do_div(tmp, (u32) X);
+ CONT;
+ ALU64_MOD_K:
+ tmp = A;
+ A = do_div(tmp, K);
+ CONT;
+ ALU_MOD_K:
+ tmp = (u32) A;
+ A = do_div(tmp, (u32) K);
+ CONT;
+ ALU64_DIV_X:
+ if (unlikely(X == 0))
+ return 0;
+ do_div(A, X);
+ CONT;
+ ALU_DIV_X:
+ if (unlikely(X == 0))
+ return 0;
+ tmp = (u32) A;
+ do_div(tmp, (u32) X);
+ A = (u32) tmp;
+ CONT;
+ ALU64_DIV_K:
+ do_div(A, K);
+ CONT;
+ ALU_DIV_K:
+ tmp = (u32) A;
+ do_div(tmp, (u32) K);
+ A = (u32) tmp;
+ CONT;
+ ALU_END_TO_BE:
+ switch (K) {
+ case 16:
+ A = (__force u16) cpu_to_be16(A);
+ break;
+ case 32:
+ A = (__force u32) cpu_to_be32(A);
+ break;
+ case 64:
+ A = (__force u64) cpu_to_be64(A);
+ break;
+ }
+ CONT;
+ ALU_END_TO_LE:
+ switch (K) {
+ case 16:
+ A = (__force u16) cpu_to_le16(A);
+ break;
+ case 32:
+ A = (__force u32) cpu_to_le32(A);
+ break;
+ case 64:
+ A = (__force u64) cpu_to_le64(A);
+ break;
+ }
+ CONT;
+
+ /* CALL */
+ JMP_CALL:
+ /* Function call scratches BPF_R1-BPF_R5 registers,
+ * preserves BPF_R6-BPF_R9, and stores return value
+ * into BPF_R0.
+ */
+ BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
+ BPF_R4, BPF_R5);
+ CONT;
+
+ /* JMP */
+ JMP_JA:
+ insn += insn->off;
+ CONT;
+ JMP_JEQ_X:
+ if (A == X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JEQ_K:
+ if (A == K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JNE_X:
+ if (A != X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JNE_K:
+ if (A != K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGT_X:
+ if (A > X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGT_K:
+ if (A > K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGE_X:
+ if (A >= X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JGE_K:
+ if (A >= K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGT_X:
+ if (((s64) A) > ((s64) X)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGT_K:
+ if (((s64) A) > ((s64) K)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGE_X:
+ if (((s64) A) >= ((s64) X)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSGE_K:
+ if (((s64) A) >= ((s64) K)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSET_X:
+ if (A & X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_JSET_K:
+ if (A & K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ JMP_EXIT:
+ return BPF_R0;
+
+ /* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+ STX_MEM_##SIZEOP: \
+ *(SIZE *)(unsigned long) (A + insn->off) = X; \
+ CONT; \
+ ST_MEM_##SIZEOP: \
+ *(SIZE *)(unsigned long) (A + insn->off) = K; \
+ CONT; \
+ LDX_MEM_##SIZEOP: \
+ A = *(SIZE *)(unsigned long) (X + insn->off); \
+ CONT;
+
+ LDST(B, u8)
+ LDST(H, u16)
+ LDST(W, u32)
+ LDST(DW, u64)
+#undef LDST
+ STX_XADD_W: /* lock xadd *(u32 *)(A + insn->off) += X */
+ atomic_add((u32) X, (atomic_t *)(unsigned long)
+ (A + insn->off));
+ CONT;
+ STX_XADD_DW: /* lock xadd *(u64 *)(A + insn->off) += X */
+ atomic64_add((u64) X, (atomic64_t *)(unsigned long)
+ (A + insn->off));
+ CONT;
+ LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + K)) */
+ off = K;
+load_word:
+ /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
+ * only appearing in the programs where ctx ==
+ * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
+ * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
+ * internal BPF verifier will check that BPF_R6 ==
+ * ctx.
+ *
+ * BPF_ABS and BPF_IND are wrappers of function calls,
+ * so they scratch BPF_R1-BPF_R5 registers, preserve
+ * BPF_R6-BPF_R9, and store return value into BPF_R0.
+ *
+ * Implicit input:
+ * ctx
+ *
+ * Explicit input:
+ * X == any register
+ * K == 32-bit immediate
+ *
+ * Output:
+ * BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+
+ ptr = load_pointer((struct sk_buff *) ctx, off, 4, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = get_unaligned_be32(ptr);
+ CONT;
+ }
+
+ return 0;
+ LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + K)) */
+ off = K;
+load_half:
+ ptr = load_pointer((struct sk_buff *) ctx, off, 2, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = get_unaligned_be16(ptr);
+ CONT;
+ }
+
+ return 0;
+ LD_ABS_B: /* BPF_R0 = *(u8 *) (ctx + K) */
+ off = K;
+load_byte:
+ ptr = load_pointer((struct sk_buff *) ctx, off, 1, &tmp);
+ if (likely(ptr != NULL)) {
+ BPF_R0 = *(u8 *)ptr;
+ CONT;
+ }
+
+ return 0;
+ LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + X + K)) */
+ off = K + X;
+ goto load_word;
+ LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + X + K)) */
+ off = K + X;
+ goto load_half;
+ LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + X + K) */
+ off = K + X;
+ goto load_byte;
+
+ default_label:
+ /* If we ever reach this, we have a bug somewhere. */
+ WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+ return 0;
+}
+
+/**
+ * sk_convert_filter - convert filter program
+ * @prog: the user passed filter program
+ * @len: the length of the user passed filter program
+ * @new_prog: buffer where converted program will be stored
+ * @new_len: pointer to store length of converted program
+ *
+ * Remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style.
+ * Conversion workflow:
+ *
+ * 1) First pass for calculating the new program length:
+ * sk_convert_filter(old_prog, old_len, NULL, &new_len)
+ *
+ * 2) 2nd pass to remap in two passes: 1st pass finds new
+ * jump offsets, 2nd pass remapping:
+ * new_prog = kmalloc(sizeof(struct sock_filter_int) * new_len);
+ * sk_convert_filter(old_prog, old_len, new_prog, &new_len);
+ *
+ * User BPF's register A is mapped to our BPF register 6, user BPF
+ * register X is mapped to BPF register 7; frame pointer is always
+ * register 10; Context 'void *ctx' is stored in register 1, that is,
+ * for socket filters: ctx == 'struct sk_buff *', for seccomp:
+ * ctx == 'struct seccomp_data *'.
+ */
+int sk_convert_filter(struct sock_filter *prog, int len,
+ struct sock_filter_int *new_prog, int *new_len)
+{
+ int new_flen = 0, pass = 0, target, i;
+ struct sock_filter_int *new_insn;
+ struct sock_filter *fp;
+ int *addrs = NULL;
+ u8 bpf_src;
+
+ BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK);
+ BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG);
+
+ if (len <= 0 || len >= BPF_MAXINSNS)
+ return -EINVAL;
+
+ if (new_prog) {
+ addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
+ if (!addrs)
+ return -ENOMEM;
+ }
+
+do_pass:
+ new_insn = new_prog;
+ fp = prog;
+
+ if (new_insn)
+ *new_insn = BPF_MOV64_REG(BPF_REG_CTX, BPF_REG_ARG1);
+ new_insn++;
+
+ for (i = 0; i < len; fp++, i++) {
+ struct sock_filter_int tmp_insns[6] = { };
+ struct sock_filter_int *insn = tmp_insns;
+
+ if (addrs)
+ addrs[i] = new_insn - new_prog;
+
+ switch (fp->code) {
+ /* All arithmetic insns and skb loads map as-is. */
+ case BPF_ALU | BPF_ADD | BPF_X:
+ case BPF_ALU | BPF_ADD | BPF_K:
+ case BPF_ALU | BPF_SUB | BPF_X:
+ case BPF_ALU | BPF_SUB | BPF_K:
+ case BPF_ALU | BPF_AND | BPF_X:
+ case BPF_ALU | BPF_AND | BPF_K:
+ case BPF_ALU | BPF_OR | BPF_X:
+ case BPF_ALU | BPF_OR | BPF_K:
+ case BPF_ALU | BPF_LSH | BPF_X:
+ case BPF_ALU | BPF_LSH | BPF_K:
+ case BPF_ALU | BPF_RSH | BPF_X:
+ case BPF_ALU | BPF_RSH | BPF_K:
+ case BPF_ALU | BPF_XOR | BPF_X:
+ case BPF_ALU | BPF_XOR | BPF_K:
+ case BPF_ALU | BPF_MUL | BPF_X:
+ case BPF_ALU | BPF_MUL | BPF_K:
+ case BPF_ALU | BPF_DIV | BPF_X:
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_MOD | BPF_X:
+ case BPF_ALU | BPF_MOD | BPF_K:
+ case BPF_ALU | BPF_NEG:
+ case BPF_LD | BPF_ABS | BPF_W:
+ case BPF_LD | BPF_ABS | BPF_H:
+ case BPF_LD | BPF_ABS | BPF_B:
+ case BPF_LD | BPF_IND | BPF_W:
+ case BPF_LD | BPF_IND | BPF_H:
+ case BPF_LD | BPF_IND | BPF_B:
+ /* Check for overloaded BPF extension and
+ * directly convert it if found, otherwise
+ * just move on with mapping.
+ */
+ if (BPF_CLASS(fp->code) == BPF_LD &&
+ BPF_MODE(fp->code) == BPF_ABS &&
+ sk_convert_bpf_extensions(fp, &insn))
+ break;
+
+ *insn = BPF_RAW_INSN(fp->code, BPF_REG_A, BPF_REG_X, 0, fp->k);
+ break;
+
+ /* Jump transformation cannot use BPF block macros
+ * everywhere as offset calculation and target updates
+ * require a bit more work than the rest, i.e. jump
+ * opcodes map as-is, but offsets need adjustment.
+ */
+
+#define BPF_EMIT_JMP \
+ do { \
+ if (target >= len || target < 0) \
+ goto err; \
+ insn->off = addrs ? addrs[target] - addrs[i] - 1 : 0; \
+ /* Adjust pc relative offset for 2nd or 3rd insn. */ \
+ insn->off -= insn - tmp_insns; \
+ } while (0)
+
+ case BPF_JMP | BPF_JA:
+ target = i + fp->k + 1;
+ insn->code = fp->code;
+ BPF_EMIT_JMP;
+ break;
+
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) {
+ /* BPF immediates are signed, zero extend
+ * immediate into tmp register and use it
+ * in compare insn.
+ */
+ *insn++ = BPF_MOV32_IMM(BPF_REG_TMP, fp->k);
+
+ insn->a_reg = BPF_REG_A;
+ insn->x_reg = BPF_REG_TMP;
+ bpf_src = BPF_X;
+ } else {
+ insn->a_reg = BPF_REG_A;
+ insn->x_reg = BPF_REG_X;
+ insn->imm = fp->k;
+ bpf_src = BPF_SRC(fp->code);
+ }
+
+ /* Common case where 'jump_false' is next insn. */
+ if (fp->jf == 0) {
+ insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
+ target = i + fp->jt + 1;
+ BPF_EMIT_JMP;
+ break;
+ }
+
+ /* Convert JEQ into JNE when 'jump_true' is next insn. */
+ if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
+ insn->code = BPF_JMP | BPF_JNE | bpf_src;
+ target = i + fp->jf + 1;
+ BPF_EMIT_JMP;
+ break;
+ }
+
+ /* Other jumps are mapped into two insns: Jxx and JA. */
+ target = i + fp->jt + 1;
+ insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
+ BPF_EMIT_JMP;
+ insn++;
+
+ insn->code = BPF_JMP | BPF_JA;
+ target = i + fp->jf + 1;
+ BPF_EMIT_JMP;
+ break;
+
+ /* ldxb 4 * ([14] & 0xf) is remaped into 6 insns. */
+ case BPF_LDX | BPF_MSH | BPF_B:
+ /* tmp = A */
+ *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_A);
+ /* A = BPF_R0 = *(u8 *) (skb->data + K) */
+ *insn++ = BPF_LD_ABS(BPF_B, fp->k);
+ /* A &= 0xf */
+ *insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_A, 0xf);
+ /* A <<= 2 */
+ *insn++ = BPF_ALU32_IMM(BPF_LSH, BPF_REG_A, 2);
+ /* X = A */
+ *insn++ = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A);
+ /* A = tmp */
+ *insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_TMP);
+ break;
+
+ /* RET_K, RET_A are remaped into 2 insns. */
+ case BPF_RET | BPF_A:
+ case BPF_RET | BPF_K:
+ *insn++ = BPF_MOV32_RAW(BPF_RVAL(fp->code) == BPF_K ?
+ BPF_K : BPF_X, BPF_REG_0,
+ BPF_REG_A, fp->k);
+ *insn = BPF_EXIT_INSN();
+ break;
+
+ /* Store to stack. */
+ case BPF_ST:
+ case BPF_STX:
+ *insn = BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_CLASS(fp->code) ==
+ BPF_ST ? BPF_REG_A : BPF_REG_X,
+ -(BPF_MEMWORDS - fp->k) * 4);
+ break;
+
+ /* Load from stack. */
+ case BPF_LD | BPF_MEM:
+ case BPF_LDX | BPF_MEM:
+ *insn = BPF_LDX_MEM(BPF_W, BPF_CLASS(fp->code) == BPF_LD ?
+ BPF_REG_A : BPF_REG_X, BPF_REG_FP,
+ -(BPF_MEMWORDS - fp->k) * 4);
+ break;
+
+ /* A = K or X = K */
+ case BPF_LD | BPF_IMM:
+ case BPF_LDX | BPF_IMM:
+ *insn = BPF_MOV32_IMM(BPF_CLASS(fp->code) == BPF_LD ?
+ BPF_REG_A : BPF_REG_X, fp->k);
+ break;
+
+ /* X = A */
+ case BPF_MISC | BPF_TAX:
+ *insn = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A);
+ break;
+
+ /* A = X */
+ case BPF_MISC | BPF_TXA:
+ *insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_X);
+ break;
+
+ /* A = skb->len or X = skb->len */
+ case BPF_LD | BPF_W | BPF_LEN:
+ case BPF_LDX | BPF_W | BPF_LEN:
+ *insn = BPF_LDX_MEM(BPF_W, BPF_CLASS(fp->code) == BPF_LD ?
+ BPF_REG_A : BPF_REG_X, BPF_REG_CTX,
+ offsetof(struct sk_buff, len));
+ break;
+
+ /* Access seccomp_data fields. */
+ case BPF_LDX | BPF_ABS | BPF_W:
+ /* A = *(u32 *) (ctx + K) */
+ *insn = BPF_LDX_MEM(BPF_W, BPF_REG_A, BPF_REG_CTX, fp->k);
+ break;
+
+ /* Unkown instruction. */
+ default:
+ goto err;
+ }
+
+ insn++;
+ if (new_prog)
+ memcpy(new_insn, tmp_insns,
+ sizeof(*insn) * (insn - tmp_insns));
+ new_insn += insn - tmp_insns;
+ }
+
+ if (!new_prog) {
+ /* Only calculating new length. */
+ *new_len = new_insn - new_prog;
+ return 0;
+ }
+
+ pass++;
+ if (new_flen != new_insn - new_prog) {
+ new_flen = new_insn - new_prog;
+ if (pass > 2)
+ goto err;
+ goto do_pass;
+ }
+
+ kfree(addrs);
+ BUG_ON(*new_len != new_flen);
+ return 0;
+err:
+ kfree(addrs);
+ return -EINVAL;
+}
+
+/* Security:
+ *
+ * A BPF program is able to use 16 cells of memory to store intermediate
+ * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter()).
+ *
+ * As we dont want to clear mem[] array for each packet going through
+ * sk_run_filter(), we check that filter loaded by user never try to read
+ * a cell if not previously written, and we check all branches to be sure
+ * a malicious user doesn't try to abuse us.
+ */
+static int check_load_and_stores(struct sock_filter *filter, int flen)
+{
+ u16 *masks, memvalid = 0; /* One bit per cell, 16 cells */
+ int pc, ret = 0;
+
+ BUILD_BUG_ON(BPF_MEMWORDS > 16);
+
+ masks = kmalloc(flen * sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ return -ENOMEM;
+
+ memset(masks, 0xff, flen * sizeof(*masks));
+
+ for (pc = 0; pc < flen; pc++) {
+ memvalid &= masks[pc];
+
+ switch (filter[pc].code) {
+ case BPF_ST:
+ case BPF_STX:
+ memvalid |= (1 << filter[pc].k);
+ break;
+ case BPF_LD | BPF_MEM:
+ case BPF_LDX | BPF_MEM:
+ if (!(memvalid & (1 << filter[pc].k))) {
+ ret = -EINVAL;
+ goto error;
+ }
+ break;
+ case BPF_JMP | BPF_JA:
+ /* A jump must set masks on target */
+ masks[pc + 1 + filter[pc].k] &= memvalid;
+ memvalid = ~0;
+ break;
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_X:
+ /* A jump must set masks on targets */
+ masks[pc + 1 + filter[pc].jt] &= memvalid;
+ masks[pc + 1 + filter[pc].jf] &= memvalid;
+ memvalid = ~0;
+ break;
+ }
+ }
+error:
+ kfree(masks);
+ return ret;
+}
+
+static bool chk_code_allowed(u16 code_to_probe)
+{
+ static const bool codes[] = {
+ /* 32 bit ALU operations */
+ [BPF_ALU | BPF_ADD | BPF_K] = true,
+ [BPF_ALU | BPF_ADD | BPF_X] = true,
+ [BPF_ALU | BPF_SUB | BPF_K] = true,
+ [BPF_ALU | BPF_SUB | BPF_X] = true,
+ [BPF_ALU | BPF_MUL | BPF_K] = true,
+ [BPF_ALU | BPF_MUL | BPF_X] = true,
+ [BPF_ALU | BPF_DIV | BPF_K] = true,
+ [BPF_ALU | BPF_DIV | BPF_X] = true,
+ [BPF_ALU | BPF_MOD | BPF_K] = true,
+ [BPF_ALU | BPF_MOD | BPF_X] = true,
+ [BPF_ALU | BPF_AND | BPF_K] = true,
+ [BPF_ALU | BPF_AND | BPF_X] = true,
+ [BPF_ALU | BPF_OR | BPF_K] = true,
+ [BPF_ALU | BPF_OR | BPF_X] = true,
+ [BPF_ALU | BPF_XOR | BPF_K] = true,
+ [BPF_ALU | BPF_XOR | BPF_X] = true,
+ [BPF_ALU | BPF_LSH | BPF_K] = true,
+ [BPF_ALU | BPF_LSH | BPF_X] = true,
+ [BPF_ALU | BPF_RSH | BPF_K] = true,
+ [BPF_ALU | BPF_RSH | BPF_X] = true,
+ [BPF_ALU | BPF_NEG] = true,
+ /* Load instructions */
+ [BPF_LD | BPF_W | BPF_ABS] = true,
+ [BPF_LD | BPF_H | BPF_ABS] = true,
+ [BPF_LD | BPF_B | BPF_ABS] = true,
+ [BPF_LD | BPF_W | BPF_LEN] = true,
+ [BPF_LD | BPF_W | BPF_IND] = true,
+ [BPF_LD | BPF_H | BPF_IND] = true,
+ [BPF_LD | BPF_B | BPF_IND] = true,
+ [BPF_LD | BPF_IMM] = true,
+ [BPF_LD | BPF_MEM] = true,
+ [BPF_LDX | BPF_W | BPF_LEN] = true,
+ [BPF_LDX | BPF_B | BPF_MSH] = true,
+ [BPF_LDX | BPF_IMM] = true,
+ [BPF_LDX | BPF_MEM] = true,
+ /* Store instructions */
+ [BPF_ST] = true,
+ [BPF_STX] = true,
+ /* Misc instructions */
+ [BPF_MISC | BPF_TAX] = true,
+ [BPF_MISC | BPF_TXA] = true,
+ /* Return instructions */
+ [BPF_RET | BPF_K] = true,
+ [BPF_RET | BPF_A] = true,
+ /* Jump instructions */
+ [BPF_JMP | BPF_JA] = true,
+ [BPF_JMP | BPF_JEQ | BPF_K] = true,
+ [BPF_JMP | BPF_JEQ | BPF_X] = true,
+ [BPF_JMP | BPF_JGE | BPF_K] = true,
+ [BPF_JMP | BPF_JGE | BPF_X] = true,
+ [BPF_JMP | BPF_JGT | BPF_K] = true,
+ [BPF_JMP | BPF_JGT | BPF_X] = true,
+ [BPF_JMP | BPF_JSET | BPF_K] = true,
+ [BPF_JMP | BPF_JSET | BPF_X] = true,
+ };
+
+ if (code_to_probe >= ARRAY_SIZE(codes))
+ return false;
+
+ return codes[code_to_probe];
+}
+
+/**
+ * sk_chk_filter - verify socket filter code
+ * @filter: filter to verify
+ * @flen: length of filter
+ *
+ * Check the user's filter code. If we let some ugly
+ * filter code slip through kaboom! The filter must contain
+ * no references or jumps that are out of range, no illegal
+ * instructions, and must end with a RET instruction.
+ *
+ * All jumps are forward as they are not signed.
+ *
+ * Returns 0 if the rule set is legal or -EINVAL if not.
+ */
+int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+{
+ bool anc_found;
+ int pc;
+
+ if (flen == 0 || flen > BPF_MAXINSNS)
+ return -EINVAL;
+
+ /* Check the filter code now */
+ for (pc = 0; pc < flen; pc++) {
+ struct sock_filter *ftest = &filter[pc];
+
+ /* May we actually operate on this code? */
+ if (!chk_code_allowed(ftest->code))
+ return -EINVAL;
+
+ /* Some instructions need special checks */
+ switch (ftest->code) {
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_MOD | BPF_K:
+ /* Check for division by zero */
+ if (ftest->k == 0)
+ return -EINVAL;
+ break;
+ case BPF_LD | BPF_MEM:
+ case BPF_LDX | BPF_MEM:
+ case BPF_ST:
+ case BPF_STX:
+ /* Check for invalid memory addresses */
+ if (ftest->k >= BPF_MEMWORDS)
+ return -EINVAL;
+ break;
+ case BPF_JMP | BPF_JA:
+ /* Note, the large ftest->k might cause loops.
+ * Compare this with conditional jumps below,
+ * where offsets are limited. --ANK (981016)
+ */
+ if (ftest->k >= (unsigned int)(flen - pc - 1))
+ return -EINVAL;
+ break;
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_X:
+ /* Both conditionals must be safe */
+ if (pc + ftest->jt + 1 >= flen ||
+ pc + ftest->jf + 1 >= flen)
+ return -EINVAL;
+ break;
+ case BPF_LD | BPF_W | BPF_ABS:
+ case BPF_LD | BPF_H | BPF_ABS:
+ case BPF_LD | BPF_B | BPF_ABS:
+ anc_found = false;
+ if (bpf_anc_helper(ftest) & BPF_ANC)
+ anc_found = true;
+ /* Ancillary operation unknown or unsupported */
+ if (anc_found == false && ftest->k >= SKF_AD_OFF)
+ return -EINVAL;
+ }
+ }
+
+ /* Last instruction must be a RET code */
+ switch (filter[flen - 1].code) {
+ case BPF_RET | BPF_K:
+ case BPF_RET | BPF_A:
+ return check_load_and_stores(filter, flen);
+ }
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(sk_chk_filter);
+
+void __weak bpf_int_jit_compile(struct sk_filter *prog)
+{
+}
+
+/**
+ * sk_filter_select_runtime - select execution runtime for BPF program
+ * @fp: sk_filter populated with internal BPF program
+ *
+ * try to JIT internal BPF program, if JIT is not available select interpreter
+ * BPF program will be executed via SK_RUN_FILTER() macro
+ */
+void sk_filter_select_runtime(struct sk_filter *fp)
+{
+ fp->bpf_func = (void *) __sk_run_filter;
+
+ /* Probe if internal BPF can be JITed */
+ bpf_int_jit_compile(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
+
+/* free internal BPF program */
+void sk_filter_free(struct sk_filter *fp)
+{
+ bpf_jit_free(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/net/core/filter.c b/net/core/filter.c
index 842f8393121d..9523677f735b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -45,54 +45,6 @@
#include <linux/seccomp.h>
#include <linux/if_vlan.h>

-/* Registers */
-#define BPF_R0 regs[BPF_REG_0]
-#define BPF_R1 regs[BPF_REG_1]
-#define BPF_R2 regs[BPF_REG_2]
-#define BPF_R3 regs[BPF_REG_3]
-#define BPF_R4 regs[BPF_REG_4]
-#define BPF_R5 regs[BPF_REG_5]
-#define BPF_R6 regs[BPF_REG_6]
-#define BPF_R7 regs[BPF_REG_7]
-#define BPF_R8 regs[BPF_REG_8]
-#define BPF_R9 regs[BPF_REG_9]
-#define BPF_R10 regs[BPF_REG_10]
-
-/* Named registers */
-#define A regs[insn->a_reg]
-#define X regs[insn->x_reg]
-#define FP regs[BPF_REG_FP]
-#define ARG1 regs[BPF_REG_ARG1]
-#define CTX regs[BPF_REG_CTX]
-#define K insn->imm
-
-/* No hurry in this branch
- *
- * Exported for the bpf jit load helper.
- */
-void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
-{
- u8 *ptr = NULL;
-
- if (k >= SKF_NET_OFF)
- ptr = skb_network_header(skb) + k - SKF_NET_OFF;
- else if (k >= SKF_LL_OFF)
- ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
- if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
- return ptr;
-
- return NULL;
-}
-
-static inline void *load_pointer(const struct sk_buff *skb, int k,
- unsigned int size, void *buffer)
-{
- if (k >= 0)
- return skb_header_pointer(skb, k, size, buffer);
-
- return bpf_internal_load_pointer_neg_helper(skb, k, size);
-}
-
/**
* sk_filter - run a packet through a socket filter
* @sk: sock associated with &sk_buff
@@ -135,451 +87,6 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
}
EXPORT_SYMBOL(sk_filter);

-/* Base function for offset calculation. Needs to go into .text section,
- * therefore keeping it non-static as well; will also be used by JITs
- * anyway later on, so do not let the compiler omit it.
- */
-noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
-{
- return 0;
-}
-
-/**
- * __sk_run_filter - run a filter on a given context
- * @ctx: buffer to run the filter on
- * @insn: filter to apply
- *
- * Decode and apply filter instructions to the skb->data. Return length to
- * keep, 0 for none. @ctx is the data we are operating on, @insn is the
- * array of filter instructions.
- */
-static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
-{
- u64 stack[MAX_BPF_STACK / sizeof(u64)];
- u64 regs[MAX_BPF_REG], tmp;
- static const void *jumptable[256] = {
- [0 ... 255] = &&default_label,
- /* Now overwrite non-defaults ... */
- /* 32 bit ALU operations */
- [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
- [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
- [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
- [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
- [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
- [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
- [BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X,
- [BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K,
- [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
- [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
- [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
- [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
- [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
- [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
- [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
- [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
- [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
- [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
- [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
- [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
- [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
- [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
- [BPF_ALU | BPF_NEG] = &&ALU_NEG,
- [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
- [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
- /* 64 bit ALU operations */
- [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
- [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
- [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
- [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
- [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
- [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
- [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
- [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
- [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
- [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
- [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
- [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
- [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
- [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
- [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
- [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
- [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
- [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
- [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
- [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
- [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
- [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
- [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
- [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
- [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
- /* Call instruction */
- [BPF_JMP | BPF_CALL] = &&JMP_CALL,
- /* Jumps */
- [BPF_JMP | BPF_JA] = &&JMP_JA,
- [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
- [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
- [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
- [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
- [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
- [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
- [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
- [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
- [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
- [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
- [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
- [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
- [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
- [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
- /* Program return */
- [BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
- /* Store instructions */
- [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
- [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
- [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
- [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
- [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
- [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
- [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
- [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
- [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
- [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
- /* Load instructions */
- [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
- [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
- [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
- [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
- [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
- [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
- [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
- [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
- [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
- [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
- };
- void *ptr;
- int off;
-
-#define CONT ({ insn++; goto select_insn; })
-#define CONT_JMP ({ insn++; goto select_insn; })
-
- FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
- ARG1 = (u64) (unsigned long) ctx;
-
- /* Register for user BPF programs need to be reset first. */
- regs[BPF_REG_A] = 0;
- regs[BPF_REG_X] = 0;
-
-select_insn:
- goto *jumptable[insn->code];
-
- /* ALU */
-#define ALU(OPCODE, OP) \
- ALU64_##OPCODE##_X: \
- A = A OP X; \
- CONT; \
- ALU_##OPCODE##_X: \
- A = (u32) A OP (u32) X; \
- CONT; \
- ALU64_##OPCODE##_K: \
- A = A OP K; \
- CONT; \
- ALU_##OPCODE##_K: \
- A = (u32) A OP (u32) K; \
- CONT;
-
- ALU(ADD, +)
- ALU(SUB, -)
- ALU(AND, &)
- ALU(OR, |)
- ALU(LSH, <<)
- ALU(RSH, >>)
- ALU(XOR, ^)
- ALU(MUL, *)
-#undef ALU
- ALU_NEG:
- A = (u32) -A;
- CONT;
- ALU64_NEG:
- A = -A;
- CONT;
- ALU_MOV_X:
- A = (u32) X;
- CONT;
- ALU_MOV_K:
- A = (u32) K;
- CONT;
- ALU64_MOV_X:
- A = X;
- CONT;
- ALU64_MOV_K:
- A = K;
- CONT;
- ALU64_ARSH_X:
- (*(s64 *) &A) >>= X;
- CONT;
- ALU64_ARSH_K:
- (*(s64 *) &A) >>= K;
- CONT;
- ALU64_MOD_X:
- if (unlikely(X == 0))
- return 0;
- tmp = A;
- A = do_div(tmp, X);
- CONT;
- ALU_MOD_X:
- if (unlikely(X == 0))
- return 0;
- tmp = (u32) A;
- A = do_div(tmp, (u32) X);
- CONT;
- ALU64_MOD_K:
- tmp = A;
- A = do_div(tmp, K);
- CONT;
- ALU_MOD_K:
- tmp = (u32) A;
- A = do_div(tmp, (u32) K);
- CONT;
- ALU64_DIV_X:
- if (unlikely(X == 0))
- return 0;
- do_div(A, X);
- CONT;
- ALU_DIV_X:
- if (unlikely(X == 0))
- return 0;
- tmp = (u32) A;
- do_div(tmp, (u32) X);
- A = (u32) tmp;
- CONT;
- ALU64_DIV_K:
- do_div(A, K);
- CONT;
- ALU_DIV_K:
- tmp = (u32) A;
- do_div(tmp, (u32) K);
- A = (u32) tmp;
- CONT;
- ALU_END_TO_BE:
- switch (K) {
- case 16:
- A = (__force u16) cpu_to_be16(A);
- break;
- case 32:
- A = (__force u32) cpu_to_be32(A);
- break;
- case 64:
- A = (__force u64) cpu_to_be64(A);
- break;
- }
- CONT;
- ALU_END_TO_LE:
- switch (K) {
- case 16:
- A = (__force u16) cpu_to_le16(A);
- break;
- case 32:
- A = (__force u32) cpu_to_le32(A);
- break;
- case 64:
- A = (__force u64) cpu_to_le64(A);
- break;
- }
- CONT;
-
- /* CALL */
- JMP_CALL:
- /* Function call scratches BPF_R1-BPF_R5 registers,
- * preserves BPF_R6-BPF_R9, and stores return value
- * into BPF_R0.
- */
- BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
- BPF_R4, BPF_R5);
- CONT;
-
- /* JMP */
- JMP_JA:
- insn += insn->off;
- CONT;
- JMP_JEQ_X:
- if (A == X) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JEQ_K:
- if (A == K) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JNE_X:
- if (A != X) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JNE_K:
- if (A != K) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGT_X:
- if (A > X) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGT_K:
- if (A > K) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGE_X:
- if (A >= X) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JGE_K:
- if (A >= K) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGT_X:
- if (((s64) A) > ((s64) X)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGT_K:
- if (((s64) A) > ((s64) K)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGE_X:
- if (((s64) A) >= ((s64) X)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSGE_K:
- if (((s64) A) >= ((s64) K)) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSET_X:
- if (A & X) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_JSET_K:
- if (A & K) {
- insn += insn->off;
- CONT_JMP;
- }
- CONT;
- JMP_EXIT:
- return BPF_R0;
-
- /* STX and ST and LDX*/
-#define LDST(SIZEOP, SIZE) \
- STX_MEM_##SIZEOP: \
- *(SIZE *)(unsigned long) (A + insn->off) = X; \
- CONT; \
- ST_MEM_##SIZEOP: \
- *(SIZE *)(unsigned long) (A + insn->off) = K; \
- CONT; \
- LDX_MEM_##SIZEOP: \
- A = *(SIZE *)(unsigned long) (X + insn->off); \
- CONT;
-
- LDST(B, u8)
- LDST(H, u16)
- LDST(W, u32)
- LDST(DW, u64)
-#undef LDST
- STX_XADD_W: /* lock xadd *(u32 *)(A + insn->off) += X */
- atomic_add((u32) X, (atomic_t *)(unsigned long)
- (A + insn->off));
- CONT;
- STX_XADD_DW: /* lock xadd *(u64 *)(A + insn->off) += X */
- atomic64_add((u64) X, (atomic64_t *)(unsigned long)
- (A + insn->off));
- CONT;
- LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + K)) */
- off = K;
-load_word:
- /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
- * only appearing in the programs where ctx ==
- * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
- * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
- * internal BPF verifier will check that BPF_R6 ==
- * ctx.
- *
- * BPF_ABS and BPF_IND are wrappers of function calls,
- * so they scratch BPF_R1-BPF_R5 registers, preserve
- * BPF_R6-BPF_R9, and store return value into BPF_R0.
- *
- * Implicit input:
- * ctx
- *
- * Explicit input:
- * X == any register
- * K == 32-bit immediate
- *
- * Output:
- * BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
- */
-
- ptr = load_pointer((struct sk_buff *) ctx, off, 4, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = get_unaligned_be32(ptr);
- CONT;
- }
-
- return 0;
- LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + K)) */
- off = K;
-load_half:
- ptr = load_pointer((struct sk_buff *) ctx, off, 2, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = get_unaligned_be16(ptr);
- CONT;
- }
-
- return 0;
- LD_ABS_B: /* BPF_R0 = *(u8 *) (ctx + K) */
- off = K;
-load_byte:
- ptr = load_pointer((struct sk_buff *) ctx, off, 1, &tmp);
- if (likely(ptr != NULL)) {
- BPF_R0 = *(u8 *)ptr;
- CONT;
- }
-
- return 0;
- LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + X + K)) */
- off = K + X;
- goto load_word;
- LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + X + K)) */
- off = K + X;
- goto load_half;
- LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + X + K) */
- off = K + X;
- goto load_byte;
-
- default_label:
- /* If we ever reach this, we have a bug somewhere. */
- WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
- return 0;
-}
-
/* Helper to find the offset of pkt_type in sk_buff structure. We want
* to make sure its still a 3bit field starting at a byte boundary;
* taken from arch/x86/net/bpf_jit_comp.c.
@@ -662,8 +169,8 @@ static u64 __get_random_u32(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
return prandom_u32();
}

-static bool convert_bpf_extensions(struct sock_filter *fp,
- struct sock_filter_int **insnp)
+bool sk_convert_bpf_extensions(struct sock_filter *fp,
+ struct sock_filter_int **insnp)
{
struct sock_filter_int *insn = *insnp;

@@ -796,505 +303,6 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
return true;
}

-/**
- * sk_convert_filter - convert filter program
- * @prog: the user passed filter program
- * @len: the length of the user passed filter program
- * @new_prog: buffer where converted program will be stored
- * @new_len: pointer to store length of converted program
- *
- * Remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style.
- * Conversion workflow:
- *
- * 1) First pass for calculating the new program length:
- * sk_convert_filter(old_prog, old_len, NULL, &new_len)
- *
- * 2) 2nd pass to remap in two passes: 1st pass finds new
- * jump offsets, 2nd pass remapping:
- * new_prog = kmalloc(sizeof(struct sock_filter_int) * new_len);
- * sk_convert_filter(old_prog, old_len, new_prog, &new_len);
- *
- * User BPF's register A is mapped to our BPF register 6, user BPF
- * register X is mapped to BPF register 7; frame pointer is always
- * register 10; Context 'void *ctx' is stored in register 1, that is,
- * for socket filters: ctx == 'struct sk_buff *', for seccomp:
- * ctx == 'struct seccomp_data *'.
- */
-int sk_convert_filter(struct sock_filter *prog, int len,
- struct sock_filter_int *new_prog, int *new_len)
-{
- int new_flen = 0, pass = 0, target, i;
- struct sock_filter_int *new_insn;
- struct sock_filter *fp;
- int *addrs = NULL;
- u8 bpf_src;
-
- BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK);
- BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG);
-
- if (len <= 0 || len >= BPF_MAXINSNS)
- return -EINVAL;
-
- if (new_prog) {
- addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
- if (!addrs)
- return -ENOMEM;
- }
-
-do_pass:
- new_insn = new_prog;
- fp = prog;
-
- if (new_insn)
- *new_insn = BPF_MOV64_REG(BPF_REG_CTX, BPF_REG_ARG1);
- new_insn++;
-
- for (i = 0; i < len; fp++, i++) {
- struct sock_filter_int tmp_insns[6] = { };
- struct sock_filter_int *insn = tmp_insns;
-
- if (addrs)
- addrs[i] = new_insn - new_prog;
-
- switch (fp->code) {
- /* All arithmetic insns and skb loads map as-is. */
- case BPF_ALU | BPF_ADD | BPF_X:
- case BPF_ALU | BPF_ADD | BPF_K:
- case BPF_ALU | BPF_SUB | BPF_X:
- case BPF_ALU | BPF_SUB | BPF_K:
- case BPF_ALU | BPF_AND | BPF_X:
- case BPF_ALU | BPF_AND | BPF_K:
- case BPF_ALU | BPF_OR | BPF_X:
- case BPF_ALU | BPF_OR | BPF_K:
- case BPF_ALU | BPF_LSH | BPF_X:
- case BPF_ALU | BPF_LSH | BPF_K:
- case BPF_ALU | BPF_RSH | BPF_X:
- case BPF_ALU | BPF_RSH | BPF_K:
- case BPF_ALU | BPF_XOR | BPF_X:
- case BPF_ALU | BPF_XOR | BPF_K:
- case BPF_ALU | BPF_MUL | BPF_X:
- case BPF_ALU | BPF_MUL | BPF_K:
- case BPF_ALU | BPF_DIV | BPF_X:
- case BPF_ALU | BPF_DIV | BPF_K:
- case BPF_ALU | BPF_MOD | BPF_X:
- case BPF_ALU | BPF_MOD | BPF_K:
- case BPF_ALU | BPF_NEG:
- case BPF_LD | BPF_ABS | BPF_W:
- case BPF_LD | BPF_ABS | BPF_H:
- case BPF_LD | BPF_ABS | BPF_B:
- case BPF_LD | BPF_IND | BPF_W:
- case BPF_LD | BPF_IND | BPF_H:
- case BPF_LD | BPF_IND | BPF_B:
- /* Check for overloaded BPF extension and
- * directly convert it if found, otherwise
- * just move on with mapping.
- */
- if (BPF_CLASS(fp->code) == BPF_LD &&
- BPF_MODE(fp->code) == BPF_ABS &&
- convert_bpf_extensions(fp, &insn))
- break;
-
- *insn = BPF_RAW_INSN(fp->code, BPF_REG_A, BPF_REG_X, 0, fp->k);
- break;
-
- /* Jump transformation cannot use BPF block macros
- * everywhere as offset calculation and target updates
- * require a bit more work than the rest, i.e. jump
- * opcodes map as-is, but offsets need adjustment.
- */
-
-#define BPF_EMIT_JMP \
- do { \
- if (target >= len || target < 0) \
- goto err; \
- insn->off = addrs ? addrs[target] - addrs[i] - 1 : 0; \
- /* Adjust pc relative offset for 2nd or 3rd insn. */ \
- insn->off -= insn - tmp_insns; \
- } while (0)
-
- case BPF_JMP | BPF_JA:
- target = i + fp->k + 1;
- insn->code = fp->code;
- BPF_EMIT_JMP;
- break;
-
- case BPF_JMP | BPF_JEQ | BPF_K:
- case BPF_JMP | BPF_JEQ | BPF_X:
- case BPF_JMP | BPF_JSET | BPF_K:
- case BPF_JMP | BPF_JSET | BPF_X:
- case BPF_JMP | BPF_JGT | BPF_K:
- case BPF_JMP | BPF_JGT | BPF_X:
- case BPF_JMP | BPF_JGE | BPF_K:
- case BPF_JMP | BPF_JGE | BPF_X:
- if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) {
- /* BPF immediates are signed, zero extend
- * immediate into tmp register and use it
- * in compare insn.
- */
- *insn++ = BPF_MOV32_IMM(BPF_REG_TMP, fp->k);
-
- insn->a_reg = BPF_REG_A;
- insn->x_reg = BPF_REG_TMP;
- bpf_src = BPF_X;
- } else {
- insn->a_reg = BPF_REG_A;
- insn->x_reg = BPF_REG_X;
- insn->imm = fp->k;
- bpf_src = BPF_SRC(fp->code);
- }
-
- /* Common case where 'jump_false' is next insn. */
- if (fp->jf == 0) {
- insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
- target = i + fp->jt + 1;
- BPF_EMIT_JMP;
- break;
- }
-
- /* Convert JEQ into JNE when 'jump_true' is next insn. */
- if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
- insn->code = BPF_JMP | BPF_JNE | bpf_src;
- target = i + fp->jf + 1;
- BPF_EMIT_JMP;
- break;
- }
-
- /* Other jumps are mapped into two insns: Jxx and JA. */
- target = i + fp->jt + 1;
- insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
- BPF_EMIT_JMP;
- insn++;
-
- insn->code = BPF_JMP | BPF_JA;
- target = i + fp->jf + 1;
- BPF_EMIT_JMP;
- break;
-
- /* ldxb 4 * ([14] & 0xf) is remaped into 6 insns. */
- case BPF_LDX | BPF_MSH | BPF_B:
- /* tmp = A */
- *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_A);
- /* A = BPF_R0 = *(u8 *) (skb->data + K) */
- *insn++ = BPF_LD_ABS(BPF_B, fp->k);
- /* A &= 0xf */
- *insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_A, 0xf);
- /* A <<= 2 */
- *insn++ = BPF_ALU32_IMM(BPF_LSH, BPF_REG_A, 2);
- /* X = A */
- *insn++ = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A);
- /* A = tmp */
- *insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_TMP);
- break;
-
- /* RET_K, RET_A are remaped into 2 insns. */
- case BPF_RET | BPF_A:
- case BPF_RET | BPF_K:
- *insn++ = BPF_MOV32_RAW(BPF_RVAL(fp->code) == BPF_K ?
- BPF_K : BPF_X, BPF_REG_0,
- BPF_REG_A, fp->k);
- *insn = BPF_EXIT_INSN();
- break;
-
- /* Store to stack. */
- case BPF_ST:
- case BPF_STX:
- *insn = BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_CLASS(fp->code) ==
- BPF_ST ? BPF_REG_A : BPF_REG_X,
- -(BPF_MEMWORDS - fp->k) * 4);
- break;
-
- /* Load from stack. */
- case BPF_LD | BPF_MEM:
- case BPF_LDX | BPF_MEM:
- *insn = BPF_LDX_MEM(BPF_W, BPF_CLASS(fp->code) == BPF_LD ?
- BPF_REG_A : BPF_REG_X, BPF_REG_FP,
- -(BPF_MEMWORDS - fp->k) * 4);
- break;
-
- /* A = K or X = K */
- case BPF_LD | BPF_IMM:
- case BPF_LDX | BPF_IMM:
- *insn = BPF_MOV32_IMM(BPF_CLASS(fp->code) == BPF_LD ?
- BPF_REG_A : BPF_REG_X, fp->k);
- break;
-
- /* X = A */
- case BPF_MISC | BPF_TAX:
- *insn = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A);
- break;
-
- /* A = X */
- case BPF_MISC | BPF_TXA:
- *insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_X);
- break;
-
- /* A = skb->len or X = skb->len */
- case BPF_LD | BPF_W | BPF_LEN:
- case BPF_LDX | BPF_W | BPF_LEN:
- *insn = BPF_LDX_MEM(BPF_W, BPF_CLASS(fp->code) == BPF_LD ?
- BPF_REG_A : BPF_REG_X, BPF_REG_CTX,
- offsetof(struct sk_buff, len));
- break;
-
- /* Access seccomp_data fields. */
- case BPF_LDX | BPF_ABS | BPF_W:
- /* A = *(u32 *) (ctx + K) */
- *insn = BPF_LDX_MEM(BPF_W, BPF_REG_A, BPF_REG_CTX, fp->k);
- break;
-
- /* Unkown instruction. */
- default:
- goto err;
- }
-
- insn++;
- if (new_prog)
- memcpy(new_insn, tmp_insns,
- sizeof(*insn) * (insn - tmp_insns));
- new_insn += insn - tmp_insns;
- }
-
- if (!new_prog) {
- /* Only calculating new length. */
- *new_len = new_insn - new_prog;
- return 0;
- }
-
- pass++;
- if (new_flen != new_insn - new_prog) {
- new_flen = new_insn - new_prog;
- if (pass > 2)
- goto err;
- goto do_pass;
- }
-
- kfree(addrs);
- BUG_ON(*new_len != new_flen);
- return 0;
-err:
- kfree(addrs);
- return -EINVAL;
-}
-
-/* Security:
- *
- * A BPF program is able to use 16 cells of memory to store intermediate
- * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter()).
- *
- * As we dont want to clear mem[] array for each packet going through
- * sk_run_filter(), we check that filter loaded by user never try to read
- * a cell if not previously written, and we check all branches to be sure
- * a malicious user doesn't try to abuse us.
- */
-static int check_load_and_stores(struct sock_filter *filter, int flen)
-{
- u16 *masks, memvalid = 0; /* One bit per cell, 16 cells */
- int pc, ret = 0;
-
- BUILD_BUG_ON(BPF_MEMWORDS > 16);
-
- masks = kmalloc(flen * sizeof(*masks), GFP_KERNEL);
- if (!masks)
- return -ENOMEM;
-
- memset(masks, 0xff, flen * sizeof(*masks));
-
- for (pc = 0; pc < flen; pc++) {
- memvalid &= masks[pc];
-
- switch (filter[pc].code) {
- case BPF_ST:
- case BPF_STX:
- memvalid |= (1 << filter[pc].k);
- break;
- case BPF_LD | BPF_MEM:
- case BPF_LDX | BPF_MEM:
- if (!(memvalid & (1 << filter[pc].k))) {
- ret = -EINVAL;
- goto error;
- }
- break;
- case BPF_JMP | BPF_JA:
- /* A jump must set masks on target */
- masks[pc + 1 + filter[pc].k] &= memvalid;
- memvalid = ~0;
- break;
- case BPF_JMP | BPF_JEQ | BPF_K:
- case BPF_JMP | BPF_JEQ | BPF_X:
- case BPF_JMP | BPF_JGE | BPF_K:
- case BPF_JMP | BPF_JGE | BPF_X:
- case BPF_JMP | BPF_JGT | BPF_K:
- case BPF_JMP | BPF_JGT | BPF_X:
- case BPF_JMP | BPF_JSET | BPF_K:
- case BPF_JMP | BPF_JSET | BPF_X:
- /* A jump must set masks on targets */
- masks[pc + 1 + filter[pc].jt] &= memvalid;
- masks[pc + 1 + filter[pc].jf] &= memvalid;
- memvalid = ~0;
- break;
- }
- }
-error:
- kfree(masks);
- return ret;
-}
-
-static bool chk_code_allowed(u16 code_to_probe)
-{
- static const bool codes[] = {
- /* 32 bit ALU operations */
- [BPF_ALU | BPF_ADD | BPF_K] = true,
- [BPF_ALU | BPF_ADD | BPF_X] = true,
- [BPF_ALU | BPF_SUB | BPF_K] = true,
- [BPF_ALU | BPF_SUB | BPF_X] = true,
- [BPF_ALU | BPF_MUL | BPF_K] = true,
- [BPF_ALU | BPF_MUL | BPF_X] = true,
- [BPF_ALU | BPF_DIV | BPF_K] = true,
- [BPF_ALU | BPF_DIV | BPF_X] = true,
- [BPF_ALU | BPF_MOD | BPF_K] = true,
- [BPF_ALU | BPF_MOD | BPF_X] = true,
- [BPF_ALU | BPF_AND | BPF_K] = true,
- [BPF_ALU | BPF_AND | BPF_X] = true,
- [BPF_ALU | BPF_OR | BPF_K] = true,
- [BPF_ALU | BPF_OR | BPF_X] = true,
- [BPF_ALU | BPF_XOR | BPF_K] = true,
- [BPF_ALU | BPF_XOR | BPF_X] = true,
- [BPF_ALU | BPF_LSH | BPF_K] = true,
- [BPF_ALU | BPF_LSH | BPF_X] = true,
- [BPF_ALU | BPF_RSH | BPF_K] = true,
- [BPF_ALU | BPF_RSH | BPF_X] = true,
- [BPF_ALU | BPF_NEG] = true,
- /* Load instructions */
- [BPF_LD | BPF_W | BPF_ABS] = true,
- [BPF_LD | BPF_H | BPF_ABS] = true,
- [BPF_LD | BPF_B | BPF_ABS] = true,
- [BPF_LD | BPF_W | BPF_LEN] = true,
- [BPF_LD | BPF_W | BPF_IND] = true,
- [BPF_LD | BPF_H | BPF_IND] = true,
- [BPF_LD | BPF_B | BPF_IND] = true,
- [BPF_LD | BPF_IMM] = true,
- [BPF_LD | BPF_MEM] = true,
- [BPF_LDX | BPF_W | BPF_LEN] = true,
- [BPF_LDX | BPF_B | BPF_MSH] = true,
- [BPF_LDX | BPF_IMM] = true,
- [BPF_LDX | BPF_MEM] = true,
- /* Store instructions */
- [BPF_ST] = true,
- [BPF_STX] = true,
- /* Misc instructions */
- [BPF_MISC | BPF_TAX] = true,
- [BPF_MISC | BPF_TXA] = true,
- /* Return instructions */
- [BPF_RET | BPF_K] = true,
- [BPF_RET | BPF_A] = true,
- /* Jump instructions */
- [BPF_JMP | BPF_JA] = true,
- [BPF_JMP | BPF_JEQ | BPF_K] = true,
- [BPF_JMP | BPF_JEQ | BPF_X] = true,
- [BPF_JMP | BPF_JGE | BPF_K] = true,
- [BPF_JMP | BPF_JGE | BPF_X] = true,
- [BPF_JMP | BPF_JGT | BPF_K] = true,
- [BPF_JMP | BPF_JGT | BPF_X] = true,
- [BPF_JMP | BPF_JSET | BPF_K] = true,
- [BPF_JMP | BPF_JSET | BPF_X] = true,
- };
-
- if (code_to_probe >= ARRAY_SIZE(codes))
- return false;
-
- return codes[code_to_probe];
-}
-
-/**
- * sk_chk_filter - verify socket filter code
- * @filter: filter to verify
- * @flen: length of filter
- *
- * Check the user's filter code. If we let some ugly
- * filter code slip through kaboom! The filter must contain
- * no references or jumps that are out of range, no illegal
- * instructions, and must end with a RET instruction.
- *
- * All jumps are forward as they are not signed.
- *
- * Returns 0 if the rule set is legal or -EINVAL if not.
- */
-int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
-{
- bool anc_found;
- int pc;
-
- if (flen == 0 || flen > BPF_MAXINSNS)
- return -EINVAL;
-
- /* Check the filter code now */
- for (pc = 0; pc < flen; pc++) {
- struct sock_filter *ftest = &filter[pc];
-
- /* May we actually operate on this code? */
- if (!chk_code_allowed(ftest->code))
- return -EINVAL;
-
- /* Some instructions need special checks */
- switch (ftest->code) {
- case BPF_ALU | BPF_DIV | BPF_K:
- case BPF_ALU | BPF_MOD | BPF_K:
- /* Check for division by zero */
- if (ftest->k == 0)
- return -EINVAL;
- break;
- case BPF_LD | BPF_MEM:
- case BPF_LDX | BPF_MEM:
- case BPF_ST:
- case BPF_STX:
- /* Check for invalid memory addresses */
- if (ftest->k >= BPF_MEMWORDS)
- return -EINVAL;
- break;
- case BPF_JMP | BPF_JA:
- /* Note, the large ftest->k might cause loops.
- * Compare this with conditional jumps below,
- * where offsets are limited. --ANK (981016)
- */
- if (ftest->k >= (unsigned int)(flen - pc - 1))
- return -EINVAL;
- break;
- case BPF_JMP | BPF_JEQ | BPF_K:
- case BPF_JMP | BPF_JEQ | BPF_X:
- case BPF_JMP | BPF_JGE | BPF_K:
- case BPF_JMP | BPF_JGE | BPF_X:
- case BPF_JMP | BPF_JGT | BPF_K:
- case BPF_JMP | BPF_JGT | BPF_X:
- case BPF_JMP | BPF_JSET | BPF_K:
- case BPF_JMP | BPF_JSET | BPF_X:
- /* Both conditionals must be safe */
- if (pc + ftest->jt + 1 >= flen ||
- pc + ftest->jf + 1 >= flen)
- return -EINVAL;
- break;
- case BPF_LD | BPF_W | BPF_ABS:
- case BPF_LD | BPF_H | BPF_ABS:
- case BPF_LD | BPF_B | BPF_ABS:
- anc_found = false;
- if (bpf_anc_helper(ftest) & BPF_ANC)
- anc_found = true;
- /* Ancillary operation unknown or unsupported */
- if (anc_found == false && ftest->k >= SKF_AD_OFF)
- return -EINVAL;
- }
- }
-
- /* Last instruction must be a RET code */
- switch (filter[flen - 1].code) {
- case BPF_RET | BPF_K:
- case BPF_RET | BPF_A:
- return check_load_and_stores(filter, flen);
- }
-
- return -EINVAL;
-}
-EXPORT_SYMBOL(sk_chk_filter);
-
static int sk_store_orig_filter(struct sk_filter *fp,
const struct sock_fprog *fprog)
{
@@ -1456,33 +464,6 @@ out_err:
return ERR_PTR(err);
}

-void __weak bpf_int_jit_compile(struct sk_filter *prog)
-{
-}
-
-/**
- * sk_filter_select_runtime - select execution runtime for BPF program
- * @fp: sk_filter populated with internal BPF program
- *
- * try to JIT internal BPF program, if JIT is not available select interpreter
- * BPF program will be executed via SK_RUN_FILTER() macro
- */
-void sk_filter_select_runtime(struct sk_filter *fp)
-{
- fp->bpf_func = (void *) __sk_run_filter;
-
- /* Probe if internal BPF can be JITed */
- bpf_int_jit_compile(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
-
-/* free internal BPF program */
-void sk_filter_free(struct sk_filter *fp)
-{
- bpf_jit_free(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_free);
-
static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
struct sock *sk)
{
--
1.7.9.5

2014-06-02 07:02:27

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH v2 net-next 2/2] net: filter: split BPF out of core networking

seccomp selects BPF only instead of whole NET
Other BPF users (like tracing filters) will select BPF only too

Signed-off-by: Alexei Starovoitov <[email protected]>
---
arch/Kconfig | 6 +++++-
kernel/Makefile | 2 +-
kernel/bpf/core.c | 21 +++++++++++++++++++++
net/Kconfig | 1 +
4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 97ff872c7acc..d60637a29ea0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -324,7 +324,8 @@ config HAVE_ARCH_SECCOMP_FILTER

config SECCOMP_FILTER
def_bool y
- depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
+ depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP
+ select BPF
help
Enable tasks to build secure computing environments defined
in terms of Berkeley Packet Filter programs which implement
@@ -332,6 +333,9 @@ config SECCOMP_FILTER

See Documentation/prctl/seccomp_filter.txt for details.

+config BPF
+ boolean
+
config HAVE_CC_STACKPROTECTOR
bool
help
diff --git a/kernel/Makefile b/kernel/Makefile
index e7360b7c2c0e..d5d7d0c18f36 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,7 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
-obj-$(CONFIG_NET) += bpf/
+obj-$(CONFIG_BPF) += bpf/

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 22c2d99414c0..8ca1b37ddc28 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1040,3 +1040,24 @@ void sk_filter_free(struct sk_filter *fp)
bpf_jit_free(fp);
}
EXPORT_SYMBOL_GPL(sk_filter_free);
+
+/* kernel configuration that do not enable NET are not using
+ * classic BPF extensions
+ */
+bool __weak sk_convert_bpf_extensions(struct sock_filter *fp,
+ struct sock_filter_int **insnp)
+{
+ return false;
+}
+
+/* To emulate LD_ABS/LD_IND instructions __sk_run_filter() may call
+ * skb_copy_bits(), so provide a weak definition for it in NET-less config.
+ * seccomp_check_filter() verifies that seccomp filters are not using
+ * LD_ABS/LD_IND instructions. Other BPF users (like tracing filters)
+ * must not use these instructions unless ctx==skb
+ */
+int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
+ int len)
+{
+ return -EFAULT;
+}
diff --git a/net/Kconfig b/net/Kconfig
index d92afe4204d9..a9582656856b 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -6,6 +6,7 @@ menuconfig NET
bool "Networking support"
select NLATTR
select GENERIC_NET_UTILS
+ select BPF
---help---
Unless you really know what you are doing, you should say Y here.
The reason is that some programs need kernel networking support even
--
1.7.9.5

2014-06-02 08:58:47

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:
> This patch set splits BPF out of core networking into generic component
>
> patch #1 splits filter.c into two logical pieces: generic BPF core and socket
> filters. It only moves functions around. No real changes.
>
> patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select
>
> The main value of the patch is not a NET separation, but rather logical boundary
> between generic BPF core and socket filtering. All socket specific code stays in
> net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure (both
> classic and internal).
>
> Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot use
> BPF JITs yet. This can be cleaned up in the future. Also it seems to makes sense
> to split up filter.h into generic and socket specific as well to cleanup the
> boundary further.

Hm, I really don't like that 'ripping code and headers apart' and then we believe
it's a generic abstraction. So far seccomp-BPF could live with the current state
since it was introduced, the rest of users (vast majority) is in the networking
domain (and invoked through tcpdump et al) ...

There are still parts in seccomp that show some BPF weaknesses in terms of being
'generic', for example shown in seccomp, we need to go once again over the filter
instructions after doing the usual filter sanity checks, just to whitelist what
seccomp may do in BPF.

I have not yet thought about it deeply enough, but I think we should avoid
something similar in other non-networking areas but abstract that cleanly w/o
such hacks first, for example.

> Tested with several NET and NET-less configs on arm and x86
>
> V1->V2:
> rebase on top of net-next
> split filter.c into kernel/bpf/core.c instead of net/bpf/core.c
>
> Alexei Starovoitov (2):
> net: filter: split filter.c into two files
> net: filter: split BPF out of core networking
>
> arch/Kconfig | 6 +-
> include/linux/filter.h | 2 +
> kernel/Makefile | 1 +
> kernel/bpf/Makefile | 5 +
> kernel/bpf/core.c | 1063 ++++++++++++++++++++++++++++++++++++++++++++++++
> net/Kconfig | 1 +
> net/core/filter.c | 1023 +---------------------------------------------
> 7 files changed, 1079 insertions(+), 1022 deletions(-)
> create mode 100644 kernel/bpf/Makefile
> create mode 100644 kernel/bpf/core.c
>

2014-06-02 13:15:50

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Mon, 2 Jun 2014 00:01:44 -0700
Alexei Starovoitov <[email protected]> wrote:

> This patch set splits BPF out of core networking into generic component

Quick, probably dumb question: if you're going to split it out, why not
split it out entirely, into kernel/ or (perhaps better) lib/? The
whole point seems to be that BPF is outgrowing its networking home, so
it seems like it might be better to make it truly generic.

jon

2014-06-02 13:25:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Mon, 2 Jun 2014 08:15:45 -0500
Jonathan Corbet <[email protected]> wrote:

> On Mon, 2 Jun 2014 00:01:44 -0700
> Alexei Starovoitov <[email protected]> wrote:
>
> > This patch set splits BPF out of core networking into generic component
>
> Quick, probably dumb question: if you're going to split it out, why not
> split it out entirely, into kernel/ or (perhaps better) lib/? The
> whole point seems to be that BPF is outgrowing its networking home, so
> it seems like it might be better to make it truly generic.

I believe this is what Ingo suggested as well. If it is become generic,
it belongs in lib/

-- Steve

2014-06-02 14:16:13

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
> On Mon, 2 Jun 2014 08:15:45 -0500
> Jonathan Corbet <[email protected]> wrote:

> > On Mon, 2 Jun 2014 00:01:44 -0700
> > Alexei Starovoitov <[email protected]> wrote:

> > > This patch set splits BPF out of core networking into generic component

> > Quick, probably dumb question: if you're going to split it out, why not
> > split it out entirely, into kernel/ or (perhaps better) lib/? The
> > whole point seems to be that BPF is outgrowing its networking home, so
> > it seems like it might be better to make it truly generic.

> I believe this is what Ingo suggested as well. If it is become generic,
> it belongs in lib/

Yes, that was his suggestion, which I agree with, FWIW.

- Arnaldo

2014-06-02 14:57:39

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
>> On Mon, 2 Jun 2014 08:15:45 -0500
>> Jonathan Corbet <[email protected]> wrote:
>
>> > On Mon, 2 Jun 2014 00:01:44 -0700
>> > Alexei Starovoitov <[email protected]> wrote:
>
>> > > This patch set splits BPF out of core networking into generic component
>
>> > Quick, probably dumb question: if you're going to split it out, why not
>> > split it out entirely, into kernel/ or (perhaps better) lib/? The
>> > whole point seems to be that BPF is outgrowing its networking home, so
>> > it seems like it might be better to make it truly generic.
>
>> I believe this is what Ingo suggested as well. If it is become generic,
>> it belongs in lib/
>
> Yes, that was his suggestion, which I agree with, FWIW.

I guess I posted v2 too quickly :) v2 splits filter.c into kernel/bpf/.
I think it's a better location than lib/bpf, since lib feels too constrained
by definition of 'library'. bpf is more than a set of library calls.

2014-06-02 15:41:27

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Mon, Jun 2, 2014 at 1:57 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/02/2014 09:01 AM, Alexei Starovoitov wrote:
>>
>> This patch set splits BPF out of core networking into generic component
>>
>> patch #1 splits filter.c into two logical pieces: generic BPF core and
>> socket
>> filters. It only moves functions around. No real changes.
>>
>> patch #2 adds hidden CONFIG_BPF that seccomp/tracing can select
>>
>> The main value of the patch is not a NET separation, but rather logical
>> boundary
>> between generic BPF core and socket filtering. All socket specific code
>> stays in
>> net/core/filter.c and kernel/bpf/core.c is for generic BPF infrastructure
>> (both
>> classic and internal).
>>
>> Note that CONFIG_BPF_JIT is still under NET, so NET-less configs cannot
>> use
>> BPF JITs yet. This can be cleaned up in the future. Also it seems to makes
>> sense
>> to split up filter.h into generic and socket specific as well to cleanup
>> the
>> boundary further.
>
>
> Hm, I really don't like that 'ripping code and headers apart' and then we
> believe
> it's a generic abstraction. So far seccomp-BPF could live with the current
> state
> since it was introduced, the rest of users (vast majority) is in the
> networking
> domain (and invoked through tcpdump et al) ...
>
> There are still parts in seccomp that show some BPF weaknesses in terms of
> being
> 'generic', for example shown in seccomp, we need to go once again over the
> filter
> instructions after doing the usual filter sanity checks, just to whitelist
> what
> seccomp may do in BPF.
>
> I have not yet thought about it deeply enough, but I think we should avoid
> something similar in other non-networking areas but abstract that cleanly
> w/o
> such hacks first, for example.

Glad you brought up this point :)
100% agree that current double verification done by seccomp is far from
being generic and quite hard to maintain, since any change done to
classic BPF verifier needs to be thought through from seccomp_check_filter()
perspective as well.
imo lack of generality in classic BPF is the main reason why we should stop
adding extensions to classic and switch to eBPF for any new features.
the eBPF verifier I posted now long ago is trying to be generic through
customization. The verifier core needs to stay independent of the use case.
BPF's input context, set of allowed calls need to be expressed in a generic way.
Obviously this split by itself won't make classic BPF all of a sudden generic.
It rather defines a boundary of eBPF core.
In eBPF only two instructions are not generic. It's LD_ABS/LD_IND
which are legacy instruction that we had to carry over from classic.
They require input context == sk_buff. That's why core.c had to
#include <skbuff.h> and do '__weak skb_copy_bits()'.
Alternative to that was to #ifdef these two instructions out of interpreter
and #ifndef NET #include <skbuff.h> and ld_abs helper functions in core.c
IMO that would have been ugly for code style, maintenance and testing,
but then core.c would have only one #include <filter.h> and we can
say: 'look eBPF core.c is really generic'

In the next set of patches I'll repost verifier and will explain how single
eBPF verifier core can be used for socket, seccomp, tracing and other
things. Note I'm not saying that we should use eBPF now everywhere.
Classic BPF has its niche and that niche we have to maintain forever.
So let's make sure that eBPF interpreter, its instruction set and its
verifier are staying generic.
This split is only first step in that direction that creates a file boundary
between eBPF core and sockets.

>
>> Tested with several NET and NET-less configs on arm and x86
>>
>> V1->V2:
>> rebase on top of net-next
>> split filter.c into kernel/bpf/core.c instead of net/bpf/core.c
>>
>> Alexei Starovoitov (2):
>> net: filter: split filter.c into two files
>> net: filter: split BPF out of core networking
>>
>> arch/Kconfig | 6 +-
>> include/linux/filter.h | 2 +
>> kernel/Makefile | 1 +
>> kernel/bpf/Makefile | 5 +
>> kernel/bpf/core.c | 1063
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>> net/Kconfig | 1 +
>> net/core/filter.c | 1023
>> +---------------------------------------------
>> 7 files changed, 1079 insertions(+), 1022 deletions(-)
>> create mode 100644 kernel/bpf/Makefile
>> create mode 100644 kernel/bpf/core.c
>>
>

2014-06-02 17:05:28

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
...
> Glad you brought up this point :)
> 100% agree that current double verification done by seccomp is far from
> being generic and quite hard to maintain, since any change done to
> classic BPF verifier needs to be thought through from seccomp_check_filter()
> perspective as well.

Glad we're on the same page.

> BPF's input context, set of allowed calls need to be expressed in a generic way.
> Obviously this split by itself won't make classic BPF all of a sudden generic.
> It rather defines a boundary of eBPF core.

Note, I'm not at all against using it in tracing, I think it's probably
a good idea, but shouldn't we _first_ think about how to overcome such
deficits as above by improving upon its in-kernel API design, thus to
better prepare it to be generic? I feel this step is otherwise just
skipped and quickly 'hacked' around ... ;)

2014-06-02 19:02:11

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Mon, Jun 2, 2014 at 10:04 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/02/2014 05:41 PM, Alexei Starovoitov wrote:
> ...
>
>> Glad you brought up this point :)
>> 100% agree that current double verification done by seccomp is far from
>> being generic and quite hard to maintain, since any change done to
>> classic BPF verifier needs to be thought through from
>> seccomp_check_filter()
>> perspective as well.
>
>
> Glad we're on the same page.
>
>
>> BPF's input context, set of allowed calls need to be expressed in a
>> generic way.
>> Obviously this split by itself won't make classic BPF all of a sudden
>> generic.
>> It rather defines a boundary of eBPF core.
>
>
> Note, I'm not at all against using it in tracing, I think it's probably
> a good idea, but shouldn't we _first_ think about how to overcome such
> deficits as above by improving upon its in-kernel API design, thus to
> better prepare it to be generic? I feel this step is otherwise just
> skipped and quickly 'hacked' around ... ;)

Are you talking about classic 'deficit' or eBPF 'deficit' ?
Classic has all sorts of hard coded assumptions. The whole
concept of 'load from magic constant' to mean different things
is flawed. We all got used to it and now think that it's normal
for "ld_abs -4056" to mean "a ^= x"
This split is not trying to make classic easier to hack.
With eBPF underneath classic, it got a lot easier to add extensions
to classic, but we shouldn't be doing it.
Classic BPF is not generic and cannot become one. It's eBPF's job.

The split is mainly helping to clearly see the boundary of eBPF core
vs its socket use case. It doesn't change or add any API.
We need to carefully design eBPF APIs when we expose it
to user space. I have a proposal for that too, but that's separate
discussion.
In terms of in-kernel eBPF API there is nothing to be done.
eBPF program 'prog' is generated by whatever means and then:
struct sk_filter *fp;

fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
fp->len = prog_len;

sk_filter_select_runtime(fp); // select interpreter or JIT
SK_RUN_FILTER(fp, ctx); // run the program
sk_filter_free(fp); // free program

that's how sockets, testsuite, seccomp, tracing are doing it.
All have different ways of producing 'prog' and 'prog_len'.
This in-kernel API cleanup was done in commit 5fe821a9dee2
You even acked it back then :)

If you're referring to eBPF verifier in-kernel API then yeah, it's
missing, just like the whole eBPF verifier :)
Ideally any kernel component that generates eBPF on the fly
sends eBPF program to verifier first just to double check
that generated program is valid.

2014-06-03 08:56:54

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
...
> Classic has all sorts of hard coded assumptions. The whole
> concept of 'load from magic constant' to mean different things
> is flawed. We all got used to it and now think that it's normal
> for "ld_abs -4056" to mean "a ^= x"

I think everyone knows that, no? Sure it doesn't fit into the
concept, but I think at the time BPF extensions were introduced,
it was probably seen as the best trade-off available to access
useful skb fields while still trying to minimize exposure to uapi
as much as possible.

> This split is not trying to make classic easier to hack.
> With eBPF underneath classic, it got a lot easier to add extensions
> to classic, but we shouldn't be doing it.
> Classic BPF is not generic and cannot become one. It's eBPF's job.
>
> The split is mainly helping to clearly see the boundary of eBPF core
> vs its socket use case. It doesn't change or add any API.

So what's the plan with everything in arch/*/net/, tools/net/ and
in Documentation/networking/filter.txt, plus MAINTAINERS file, that
the current patch doesn't address?

We want changes to go via [email protected] as they always
did, since [ although other use cases pop up ] the main user, as
I said, is simply still packet filtering in various networking
subsystems, no?

> This in-kernel API cleanup was done in commit 5fe821a9dee2
> You even acked it back then :)

I agreed with that change, otherwise I wouldn't have acked it,
of course.

2014-06-03 15:44:18

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Tue, Jun 3, 2014 at 1:56 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/02/2014 09:02 PM, Alexei Starovoitov wrote:
> ...
>>
>> Classic has all sorts of hard coded assumptions. The whole
>>
>> concept of 'load from magic constant' to mean different things
>> is flawed. We all got used to it and now think that it's normal
>> for "ld_abs -4056" to mean "a ^= x"
>
>
> I think everyone knows that, no? Sure it doesn't fit into the
> concept, but I think at the time BPF extensions were introduced,
> it was probably seen as the best trade-off available to access
> useful skb fields while still trying to minimize exposure to uapi
> as much as possible.

Exactly. It _was_ seen as right trade-off in the past.
Now we have a lot more bpf users, so considerations are different.

>> This split is not trying to make classic easier to hack.
>> With eBPF underneath classic, it got a lot easier to add extensions
>> to classic, but we shouldn't be doing it.
>> Classic BPF is not generic and cannot become one. It's eBPF's job.
>>
>> The split is mainly helping to clearly see the boundary of eBPF core
>> vs its socket use case. It doesn't change or add any API.
>
>
> So what's the plan with everything in arch/*/net/, tools/net/ and
> in Documentation/networking/filter.txt, plus MAINTAINERS file, that
> the current patch doesn't address?

I have multi-year long plan of actions in eBPF area and as
was seen in past several month I would have to adjust it many times
based on community feedback.
The plan includes taking care of arch/*/net, but I'm not bringing it up
right now, since the filter.c split itself doesn't depend on what
we're going to do with JITs in arch/*/net/
As you saw I mentioned JITs in the cover letter, so I obviously
thought about it before proposing this filter.c split.
I even have rough patches to take care of it, but let's not get ahead
of ourselves.
My plan also includes upstreaming of LLVM eBPF backend, but linux
needs to expose it to userspace first.
It includes eBPF assembler to write programs like:
r1 = r5
*(u32 *) (fp - 10) = 0
call foo
if (r0 == 0) goto Label
^^above is assembler. I don't like current bpf_asm syntax,
since it's too assemblish. C-looking assembler is easier to understand.
It includes bpf maps, 'perf run filter.c' and all sorts of other things.
I cannot put the year long plan in one email, since tl;dr kicks in.

filter.c split is a tiny first step.
next step is filter.h split
renaming arch/*/net/bpf_jit_comp.c into arch/*/bpf/jit_comp.c is
the least of my concerns. If JITs stay with strong dependency
to NET, it's also fine.
As I said in cover letter filter.c split is not about NET dependency.
Even tiny embedded systems rely on networking, so all real world
.config's will include 'NET'. The split is about logical separation
of eBPF vs sockets. Having them in one file just not doing any good,
since people are jumping into hacking things quickly without seeing
that eBPF is not only about sockets.

MAINTAINERS file is a good question too.
I would be happy to maintain bpf/ebpf, since it's my full time job anyway,
but again let's not jump the gun.

> We want changes to go via [email protected] as they always
> did, since [ although other use cases pop up ] the main user, as
> I said, is simply still packet filtering in various networking
> subsystems, no?

Obviously sockets is the main, but not the only user, so I think
both lkml and netdev would need to be cc-ed in the future.
Or we can create 'bpf' alias for anyone interested.

All of your points are valid. They are right questions to ask. I just
don't see why you're still arguing about first step of filter.c split,
whereas your concerns are about steps 2, 3, 4.

2014-06-03 18:16:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking


* Alexei Starovoitov <[email protected]> wrote:

> On Mon, Jun 2, 2014 at 7:16 AM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Em Mon, Jun 02, 2014 at 09:24:56AM -0400, Steven Rostedt escreveu:
> >> On Mon, 2 Jun 2014 08:15:45 -0500
> >> Jonathan Corbet <[email protected]> wrote:
> >
> >> > On Mon, 2 Jun 2014 00:01:44 -0700
> >> > Alexei Starovoitov <[email protected]> wrote:
> >
> >> > > This patch set splits BPF out of core networking into generic component
> >
> >> > Quick, probably dumb question: if you're going to split it out, why not
> >> > split it out entirely, into kernel/ or (perhaps better) lib/? The
> >> > whole point seems to be that BPF is outgrowing its networking home, so
> >> > it seems like it might be better to make it truly generic.
> >
> >> I believe this is what Ingo suggested as well. If it is become generic,
> >> it belongs in lib/
> >
> > Yes, that was his suggestion, which I agree with, FWIW.
>
> I guess I posted v2 too quickly :) v2 splits filter.c into
> kernel/bpf/. I think it's a better location than lib/bpf, since lib
> feels too constrained by definition of 'library'. bpf is more than a
> set of library calls.

Yeah, the upgrade to kernel/bpf/ is a better place for BPF IMO: BPF is
really an 'active', stateful subsystem, with non-trivial per arch
implementations, while lib/ is generally for standalone, generic,
platform-decoupled library functions (with a few exceptions).

Thanks,

Ingo

2014-06-03 20:36:31

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
...
> All of your points are valid. They are right questions to ask. I just
> don't see why you're still arguing about first step of filter.c split,
> whereas your concerns are about steps 2, 3, 4.

Fair enough, lets keep them in mind though for future work. Btw,
are other files planned for kernel/bpf/ or should it instead just
simply be kernel/bpf.c?

2014-06-03 20:58:25

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann <[email protected]> wrote:
> On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
> ...
>>
>> All of your points are valid. They are right questions to ask. I just
>>
>> don't see why you're still arguing about first step of filter.c split,
>> whereas your concerns are about steps 2, 3, 4.
>
>
> Fair enough, lets keep them in mind though for future work. Btw,

Ok :)

> are other files planned for kernel/bpf/ or should it instead just
> simply be kernel/bpf.c?

The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
bpf maps is yet another thing, but that's different topic.
Probably a set of bpf-callable functions in another file. Like right now
for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
For tracing there will be a different set of helper functions and eventually
some will be common. Like __get_raw_cpu_id() from filter.c could
eventually move to kernel/bpf/helpers.c
I'm not a fan of squeezing different logic into one file.

2014-06-03 21:40:44

by Chema Gonzalez

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

First of all, and just to join the crowd, kernel/bpf/ FTW.

Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
oriented to filter (meaning returning a single integer that states how
many bytes of the packet must be captured) packets (e.g. consider the
6 load modes, where 3 provide access the packet -- abs, ind, msh --,
one to an skb field -- len--, the 5th one to the memory itself -- mem
--, and the 6th is an immediate set mode --imm-- ) that has been used
in other environments (seccomp, tracing, etc.) by (a) extending the
idea of a "packet" into a "buffer", and (b) adding ancillary loads.

eBPF should be a generic ISA that can be used by many environments,
including those served today by classic BPF. IMO, we should get a
nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
and what should not.

- 1. we should considering separating the eBPF ISA farther from classic BPF
- eBPF still uses a_reg and x_reg as the names of the 2 op
registers. This is very confusing, especially when dealing with
translated filters that do move data between A and X. I've had a_reg
being X, and x_reg being A. We should rename them d_reg and s_reg.
- BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
only one register, and d_reg was implicit in the name of the insn
code. Now, why are we keeping both in eBPF, when the register we're
writing to is made explicit in d_reg (I already forgot if d_reg was
a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
insns.
- BPF_ST vs. BPF_STX: same here. Note that the current
sk_convert_filter() just converts all stores to BPF_STX.

- 2. there are other insn that we should consider adding:
- lui: AFAICT, there is no clean way to build a 64-bit number (you
can LD_IMM the upper part, lsh 32, and then add the lower part).
- nop: I'd like to have a nop. Do I know why? Nope.


On Tue, Jun 3, 2014 at 1:58 PM, Alexei Starovoitov <[email protected]> wrote:
> On Tue, Jun 3, 2014 at 1:35 PM, Daniel Borkmann <[email protected]> wrote:
>> On 06/03/2014 05:44 PM, Alexei Starovoitov wrote:
>> ...
>>>
>>> All of your points are valid. They are right questions to ask. I just
>>>
>>> don't see why you're still arguing about first step of filter.c split,
>>> whereas your concerns are about steps 2, 3, 4.
>>
>>
>> Fair enough, lets keep them in mind though for future work. Btw,
>
> Ok :)
>
>> are other files planned for kernel/bpf/ or should it instead just
>> simply be kernel/bpf.c?
>
> The most obvious one is eBPF verifier in separate file (kernel/bpf/verifier.c)
> bpf maps is yet another thing, but that's different topic.
> Probably a set of bpf-callable functions in another file. Like right now
> for sockets these helpers are __skb_get_pay_offset(), __skb_get_nlattr()
> For tracing there will be a different set of helper functions and eventually
> some will be common. Like __get_raw_cpu_id() from filter.c could
> eventually move to kernel/bpf/helpers.c
LGTM.

I like the idea of every user (packet filter, seccomp, etc.) providing
a map of the bpf calls that are ok, as in the packet filter stating
that {1->__skb_get_pay_offset(), 2->__skb_get_nlattr(), ...}, but
seccomp providing a completely different (or even empty) map.

-Chema


> I'm not a fan of squeezing different logic into one file.

2014-06-04 00:38:20

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez <[email protected]> wrote:
> First of all, and just to join the crowd, kernel/bpf/ FTW.
>
> Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
> oriented to filter (meaning returning a single integer that states how
> many bytes of the packet must be captured) packets (e.g. consider the
> 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
> one to an skb field -- len--, the 5th one to the memory itself -- mem
> --, and the 6th is an immediate set mode --imm-- ) that has been used
> in other environments (seccomp, tracing, etc.) by (a) extending the
> idea of a "packet" into a "buffer", and (b) adding ancillary loads.
>
> eBPF should be a generic ISA that can be used by many environments,
> including those served today by classic BPF. IMO, we should get a
> nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
> and what should not.

Model eBPF based on MIPS ISA? Ouch.
That would be one ugly ISA that is not JITable on x64.

eBPF ISA wasn't invented overnight. It was a gigantic effort that
took a lot of time to narrow down x64 into _verifiable_ ISA.
I had to take into account arm64, mips64, sparcv9 architectures too.
Of course, minor things can be improved here or there.
Ugliness of ISA hits compiler writers first. I've seen many
times how cpu designers add new instructions only to be told
by compiler guys that they just wasted silicon.
Fact that llvm/gcc compile C into eBPF is the strongest
statement that eBPF ISA is 99.9% complete.
New instructions may or may not make sense.
Let's examine your proposal:

> - 1. we should considering separating the eBPF ISA farther from classic BPF
> - eBPF still uses a_reg and x_reg as the names of the 2 op
> registers. This is very confusing, especially when dealing with
> translated filters that do move data between A and X. I've had a_reg
> being X, and x_reg being A. We should rename them d_reg and s_reg.

that is renaming of two fields in sock_filter_int structure.
No change to actual ISA.

You're proposing the following:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 0e463ee77bb2..bf50fa440ef8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -287,8 +287,8 @@ enum {

struct sock_filter_int {
__u8 code; /* opcode */
- __u8 a_reg:4; /* dest register */
- __u8 x_reg:4; /* source register */
+ __u8 dst_reg:4; /* dest register */
+ __u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
sure. I thought comment was explicit enough, but I agree
the fields could have been named better.
Will do a patch to rename it.

> - BPF_LD vs. BPF_LDX: this made sense in classic BPF, as there was
> only one register, and d_reg was implicit in the name of the insn
> code. Now, why are we keeping both in eBPF, when the register we're
> writing to is made explicit in d_reg (I already forgot if d_reg was
> a_reg or x_reg ;) ? Removing one of them will save us 1/8th of the
> insns.
> - BPF_ST vs. BPF_STX: same here. Note that the current
> sk_convert_filter() just converts all stores to BPF_STX.

nope. No extra bits can be saved here.
STX means:
*(dest_reg + off) = src_reg;
ST means:
*(dest_reg + off) = imm;
LDX means:
dest_reg = *(src_reg + off)

LD we had to carry over from classic as only two non-generic
instructions: LD_ABS and LD_IND.

> - 2. there are other insn that we should consider adding:
> - lui: AFAICT, there is no clean way to build a 64-bit number (you
> can LD_IMM the upper part, lsh 32, and then add the lower part).

correct. in tracing filters I do this:
+ /* construct 64-bit address */
+ emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr >>
32), ctx);
+ emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx);
+ emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32)
addr), ctx);
+ emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx);

so there is a way to construct 64-bit immediate.
The question is how often do we need to do it? Is it in critical path?
Naive answer "one instruction is better than 4" doesn't count.
See my point above 'cpu designer vs compiler writer'.
None of the risc ISAs have 64-bit imm and eBPF has to consider
simplicity of JITs otherwise those architectures will have a hard time
mapping eBPF to native. If JIT would be too difficult to do, then
there will be no JIT. I don't want eBPF to become an instruction
set that can be JITed only on one architecture...

'mov dest_reg, imm64' may still be ok to add, since x64 can
JIT it with one instruction, arm64 with 4 instructions, but JITs
for other archs will be ugly. They can JIT it as load from memory,
but I need to think it through. Let me explore it more carefully.

Two must have requirements for eBPF:
1. verifiable instructions, meaning that verifier doesn't need to jump
hoops to prove safety of the program
2. JITable as a minium on x64, arm64, otherwise programs will
run in interpreter and performance will be lost.

Third requirement of compiler to actually being able to generate
new instructions is also important, but it's not must have.

> - nop: I'd like to have a nop. Do I know why? Nope.

nope. Let's not add unnecessary instructions.

> I like the idea of every user (packet filter, seccomp, etc.) providing
> a map of the bpf calls that are ok, as in the packet filter stating
> that {1->__skb_get_pay_offset(), 2->__skb_get_nlattr(), ...}, but
> seccomp providing a completely different (or even empty) map.

yes. exactly.
What you're describing is a configuration for generic eBPF verifier.
The implementation details we'll debate when I rebase the verifier
and post it for review :)

Thanks!

2014-06-20 16:44:46

by Chema Gonzalez

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

[Sorry for the delay in the answer. Been mired somewhere else.]

On Tue, Jun 3, 2014 at 5:38 PM, Alexei Starovoitov <[email protected]> wrote:
> On Tue, Jun 3, 2014 at 2:40 PM, Chema Gonzalez <[email protected]> wrote:
>> First of all, and just to join the crowd, kernel/bpf/ FTW.
>>
>> Now, I have some suggestions about eBPF. IMO classic BPF is an ISA
>> oriented to filter (meaning returning a single integer that states how
>> many bytes of the packet must be captured) packets (e.g. consider the
>> 6 load modes, where 3 provide access the packet -- abs, ind, msh --,
>> one to an skb field -- len--, the 5th one to the memory itself -- mem
>> --, and the 6th is an immediate set mode --imm-- ) that has been used
>> in other environments (seccomp, tracing, etc.) by (a) extending the
>> idea of a "packet" into a "buffer", and (b) adding ancillary loads.
>>
>> eBPF should be a generic ISA that can be used by many environments,
>> including those served today by classic BPF. IMO, we should get a
>> nicely-defined ISA (MIPS anyone?) and check what should go into eBPF
>> and what should not.
>
> Model eBPF based on MIPS ISA? Ouch.
> That would be one ugly ISA that is not JITable on x64.

Definitely I wasn't making my point clear: IMO if we're redesigning
the BPF ISA, we should get a clean one (clean=something that is simple
enough to be readable by a newcomer). I mentioned MIPS because it's
the one I know the most, and it's kind of clean. I'm definitely not
proposing to use MIPS as the BPF ISA.

In particular, I have 4 cleanliness concerns related to load/stores:

1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
space to load/store (4 opcodes/instruction class values of 8
possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That
gives me pause. I'm definitely not suggesting adding more insn just
because physical ISAs have it, but I think it makes sense to avoid
using the whole insn space just because it's there, or because classic
BPF was using it all.

2. instructions (sock_filter_int) hardcode 2 immediate values. This is
very unusual in ISAs. We're effectively doubling the insn size (from
32 to 64 bits), and still we are cramming the whole insn space (only 1
reserved instruction class of 8 possible ones). The rationale is to
support a new addressing mode (BPF_ST|BPF_MEM), where both the offset
to src_reg and the immediate value are hardcoded in the insn. (See
more below.)

3. name reuse: we're reusing names b etween classic BPF and eBPF. For
example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
this very confusing, especially because we'll have to live with
classic BPF in userland filters for a long time. In fact, if you ask
me, I'll come up with some generic name for the generic linux
filtering mechanism (eBPF and internal BPF sound too much like BPF),
to make it clear that this is not just BPF.

4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
eBPF), where only a subset of the {operation types, modifier} tuples
are valid. I think we can make it simpler. For the eBPF case, we
currently have 6 valid combinations:

4.1. BPF_LDX|BPF_MEM
Operation: dst_reg = *(size *) (src_reg + off16)

This is the basic load insn. It's used to convert most of the classic
BPF addressing modes by setting the right src_reg (FP in the classic
BPF M[] access, CTX for the classic BPF BPF_LD*|BPF_LEN access and
seccomp_data access, etc.)

4.2. BPF_LD|BPF_ABS
Operation: BPF_R0 = ntoh<size>(*(size *) (skb->data + imm32))

4.3. BPF_LD|BPF_IND
Operation: BPF_R0 = ntoh<size>(*(size *) (skb->data + src_reg + imm32))

The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a
BPF_LDX|BPF_MEM with the following differences:
a. packet is skb->data == CTX == BPF_R6 (this is equivalent to
src_reg==R6 in BPF_LDX|BPF_MEM)
b. output is left in R0, not in dst_reg
c. result is ntohs()|ntohl()
d. every packet access is checked using
load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper()

Now, (a,b,c) seem like details that should be worked with multiple
instructions (in fact the x86 JIT does that). (d) is IMO the only part
important enough to justify a different insn. I'd call this mode
BPF_SKBUFF_PROTECTED (or something like that), because that is the
main idea of this instruction: that any memory access (ld or st) is
checked during runtime assuming it's an skbuff.

The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get
src_reg+imm32 into a tmp register, and the other to perform the final
load based on the tmp register.

4.4. BPF_STX|BPF_MEM
Operation: *(size *) (dst_reg + off16) = src_reg

This is the basic store insn. LGTM.

4.5. BPF_ST|BPF_MEM
Operation: *(size *) (dst_reg + off16) = imm32

This insn encodes 2 immediate values (the offset and the imm32 value)
in the insn, and actually forces the sock_filter_int 64-bit struct to
have both a 16-bit offset field and a 32-bit immediate field). In
fact, it's the only instructions that uses .off and .imm at the same
time (for all other instructions, at least one of the fields is always
0).

This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did
"mem[pc->k] = A;"). In fact, it's rare to find an ISA that allows
encoding 2 immediate values in a single insn. My impression (after
checking the x86 JIT implementation, which works on the eBPF code) is
that this was added as an x86 optimization, because x86 allows
encoding 2 values (offset and immediate) by using the displacement and
immediate suffixes. I wonder whether the ISA would be more readable if
we did this in 2 insn, one to put dst_reg+off16 in a temporary
register, and the second a simpler BPF_STX|BPF_MEM. Then we could use
the same space for the immediate and offset fields.

4.6. BPF_STX|BPF_XADD
Operation: mem[dst_reg + off16] += src_reg

I assume there's some use case for this, apart from the fact that x86
has an easy construction for this.

You guys have done an excellent job simplifying the 4 opcodes x 6
addressing modes in classic BPF, but I think we should go a step
further, and make it even simpler. I can see how we only need basic
load (4.1), protected load (4.2, 4.3), basic store (4.4, 4.5), and
maybe XADD store (4.6).

> eBPF ISA wasn't invented overnight. It was a gigantic effort that
> took a lot of time to narrow down x64 into _verifiable_ ISA.
> I had to take into account arm64, mips64, sparcv9 architectures too.
> Of course, minor things can be improved here or there.
> Ugliness of ISA hits compiler writers first. I've seen many
> times how cpu designers add new instructions only to be told
> by compiler guys that they just wasted silicon.
> Fact that llvm/gcc compile C into eBPF is the strongest
> statement that eBPF ISA is 99.9% complete.
> New instructions may or may not make sense.
> Let's examine your proposal:
>
>> - 1. we should considering separating the eBPF ISA farther from classic BPF
>> - eBPF still uses a_reg and x_reg as the names of the 2 op
>> registers. This is very confusing, especially when dealing with
>> translated filters that do move data between A and X. I've had a_reg
>> being X, and x_reg being A. We should rename them d_reg and s_reg.
>
> that is renaming of two fields in sock_filter_int structure.
> No change to actual ISA.
I saw you already wrote this one. Thanks!

>> - 2. there are other insn that we should consider adding:
>> - lui: AFAICT, there is no clean way to build a 64-bit number (you
>> can LD_IMM the upper part, lsh 32, and then add the lower part).
>
> correct. in tracing filters I do this:
> + /* construct 64-bit address */
> + emit(BPF_ALU64_IMM(BPF_MOV, BPF_REG_2, addr >>
> 32), ctx);
> + emit(BPF_ALU64_IMM(BPF_LSH, BPF_REG_2, 32), ctx);
> + emit(BPF_ALU32_IMM(BPF_MOV, BPF_REG_1, (u32)
> addr), ctx);
> + emit(BPF_ALU64_REG(BPF_OR, BPF_REG_1, BPF_REG_2), ctx);
>
> so there is a way to construct 64-bit immediate.
> The question is how often do we need to do it? Is it in critical path?
> Naive answer "one instruction is better than 4" doesn't count.
> See my point above 'cpu designer vs compiler writer'.
> None of the risc ISAs have 64-bit imm and eBPF has to consider
> simplicity of JITs otherwise those architectures will have a hard time
> mapping eBPF to native. If JIT would be too difficult to do, then
> there will be no JIT. I don't want eBPF to become an instruction
> set that can be JITed only on one architecture...
That's a good point, and I don't know enough of the other arch's to
realize whether lui is feasible or not.

>> - nop: I'd like to have a nop. Do I know why? Nope.
> nope. Let's not add unnecessary instructions.
A valid nop is a useful instruction: padding, filling up arrays of
sock_filter_int correctly (as in lib/test_bpf.c, where we're currently
using a "ld #0", which loads zero to register A), and other use cases
(see http://en.wikipedia.org/wiki/NOP ).

Thanks,
-Chema


>> I like the idea of every user (packet filter, seccomp, etc.) providing
>> a map of the bpf calls that are ok, as in the packet filter stating
>> that {1->__skb_get_pay_offset(), 2->__skb_get_nlattr(), ...}, but
>> seccomp providing a completely different (or even empty) map.
>
> yes. exactly.
> What you're describing is a configuration for generic eBPF verifier.
> The implementation details we'll debate when I rebase the verifier
> and post it for review :)
>
> Thanks!

2014-06-23 09:19:19

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 net-next 0/2] split BPF out of core networking

From: Chema Gonzalez
...
> 4.5. BPF_ST|BPF_MEM
> Operation: *(size *) (dst_reg + off16) = imm32
>
> This insn encodes 2 immediate values (the offset and the imm32 value)
> in the insn, and actually forces the sock_filter_int 64-bit struct to
> have both a 16-bit offset field and a 32-bit immediate field). In
> fact, it's the only instructions that uses .off and .imm at the same
> time (for all other instructions, at least one of the fields is always
> 0).
>
> This did not exist in classic BPF (where BPF_ST|BPF_MEM actually did
> "mem[pc->k] = A;"). In fact, it's rare to find an ISA that allows
> encoding 2 immediate values in a single insn. My impression (after
> checking the x86 JIT implementation, which works on the eBPF code) is
> that this was added as an x86 optimization, because x86 allows
> encoding 2 values (offset and immediate) by using the displacement and
> immediate suffixes. I wonder whether the ISA would be more readable if
> we did this in 2 insn, one to put dst_reg+off16 in a temporary
> register, and the second a simpler BPF_STX|BPF_MEM. Then we could use
> the same space for the immediate and offset fields.

One option is to add code to the x86 JIT to detect the two instruction
sequence and generate a single instruction.

Thinks further, the JIT might be easier to write if there is a temporary
register that is defined to be only valid for the next instruction (or two).
Then the JIT can completely optimise away any assignments to it without
requiring a full analysis of the entire program.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-06-23 21:57:57

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez <[email protected]> wrote:
>>
>> Model eBPF based on MIPS ISA? Ouch.
>> That would be one ugly ISA that is not JITable on x64.
>
> Definitely I wasn't making my point clear: IMO if we're redesigning
> the BPF ISA, we should get a clean one (clean=something that is simple
> enough to be readable by a newcomer). I mentioned MIPS because it's
> the one I know the most, and it's kind of clean. I'm definitely not
> proposing to use MIPS as the BPF ISA.

mips is a clean isa? When it was designed 30 years ago, it was good,
but today it really shows its age: delay slots, integer arithmetic that
traps on overflow, lack of real comparison operations, hi/lo, etc.
I strongly believe eBPF isa is way cleaner and easier to understand
than mips isa.

It seems your proposals to make eBPF 'cleaner' are based on
HW mindset, which is not applicable here. Details below:

> 1. how we codify the ISA. Both BPF and eBPF devote 50% of the insn
> space to load/store (4 opcodes/instruction class values of 8
> possible). In comparison, MIPS uses just 14% (9 of 64 opcodes). That

that is a misleading comparison and leads to the wrong conclusions.
Your proposal to remove useful instruction just to save a bit in bpf
class encoding just doesn't make sense. We have infinite room to
add new instructions and opcodes. Unlike HW ISA eBPF is not
limited by 8-bit opcodes and 8-byte instructions. eBPF is not designed
to be run directly by HW, so we're not trying to save rtl gates here.
Among other things we're optimizing for interpreter performance,
so removing instructions can only hurt.

> gives me pause. I'm definitely not suggesting adding more insn just
> because physical ISAs have it, but I think it makes sense to avoid
> using the whole insn space just because it's there, or because classic
> BPF was using it all.

wrong. Classic BPF is a legacy that we have to live with and we
should not sacrifice performance of classic bpf filters just to reduce
number of eBPF instructions.
It made sense only for one classic instruction: BPF_LD + MSH
It is really single purpose instruction to do X = ip->length * 4.
We didn't carry this ugliness into eBPF, since it's not generic,
can be easily represented by generic instructions, complicates JITs.
Since MSH is used at most once per tcpdump filter, few extra
insns add a tiny penalty to overall filter execution time in interpreter
and give no performance penalty at all when filter is JITed.

> 2. instructions (sock_filter_int) hardcode 2 immediate values. This is
> very unusual in ISAs. We're effectively doubling the insn size (from
> 32 to 64 bits), and still we are cramming the whole insn space (only 1
> reserved instruction class of 8 possible ones). The rationale is to

Comparisons with HW encoding is not applicable.
mips picked 4 byte wide instruction and had to live with it. Just look
at hi/lo insns and what compilers have to do with it.
All eBPF instructions today are 8 byte wide. There is no reason
to redesign them into <8 bytes or squeeze bits. It will hurt
performance without giving us anything back.
At the same time we can add 16-byte instruction to represent
load 64-bit immediate, but as I was saying before many factors
need to be considered before we proceed.

> 3. name reuse: we're reusing names b etween classic BPF and eBPF. For
> example, BPF_LD*|BPF_MEM in classic BPF refers to access to the M[]
> buffer. BPF_LDX|BPF_MEM in eBPF is a generic memory access. I find
> this very confusing, especially because we'll have to live with

That's one opinion. I think names are fine and
Documentation/networking/filter.txt explains both classic and eBPF
encoding well enough.

> classic BPF in userland filters for a long time. In fact, if you ask
> me, I'll come up with some generic name for the generic linux
> filtering mechanism (eBPF and internal BPF sound too much like BPF),
> to make it clear that this is not just BPF.

I don't think it's a good idea. I like BPF abbreviation, since the name
implies the use case. Renaming eBPF to ISA_X will be confusing.
Now everyone understands that BPF is a safe instruction set that
can be dynamically loaded into the kernel. eBPF is the same
plus more.

> 4. simplicity: both BPF and eBPF have 4 ld/st operations types (LD,
> LDX, ST, STX) and many addressing modes/modifiers (6 for BPF, 4 for
> eBPF), where only a subset of the {operation types, modifier} tuples
> are valid. I think we can make it simpler. For the eBPF case, we
> currently have 6 valid combinations:

That's a good summary. I think documentation explained it already,
but if you feel it's still missing pieces, please send a patch to
improve the doc.

> The two eBPF BPF_LD insn are BPF_LDX|BPF_MEM insn to access an skbuff.
> For example, BPF_LD|BPF_ABS does "dst_reg = packet[imm32]", which is a
> BPF_LDX|BPF_MEM with the following differences:
> a. packet is skb->data == CTX == BPF_R6 (this is equivalent to
> src_reg==R6 in BPF_LDX|BPF_MEM)
> b. output is left in R0, not in dst_reg
> c. result is ntohs()|ntohl()
> d. every packet access is checked using
> load_pointer()/skb_header_pointer()/bpf_internal_load_pointer_neg_helper()
>
> Now, (a,b,c) seem like details that should be worked with multiple
> instructions (in fact the x86 JIT does that).

and penalize performance of converted classic filters?? No.
LD+ABS and LD+IND must stay as-is, since these two are most
commonly used instructions in tcpdump filters and we cannot
split them even in two instruction without degrading performance.

> The BPF_LD|BPF_IND insn could be replaced in 2 steps, one to get
> src_reg+imm32 into a tmp register, and the other to perform the final
> load based on the tmp register.

nope. it's a critical path instruction. we cannot split it into two or
performance will suffer.

> 4.5. BPF_ST|BPF_MEM
> Operation: *(size *) (dst_reg + off16) = imm32
>
> This insn encodes 2 immediate values (the offset and the imm32 value)
> in the insn, and actually forces the sock_filter_int 64-bit struct to
> have both a 16-bit offset field and a 32-bit immediate field). In
> fact, it's the only instructions that uses .off and .imm at the same
> time (for all other instructions, at least one of the fields is always
> 0).

I've considered not introducing BPF_ST_MEM in the first place,
but then decided to add it to improve code for stack initialization.
Majority of cpus have *(u32*)(stack - offset) = 0 instruction (even mips
has it, since it has register zero), and this instruction is used a lot
to initialize variables on stack. Obviously two instructions can be
used instead (BPF_MOV_IMM + BPF_STX_MEM), but having one
instruction improves interpreter performance, so here we have
BPF_ST_MEM.

> immediate suffixes. I wonder whether the ISA would be more readable if
> we did this in 2 insn, one to put dst_reg+off16 in a temporary
> register, and the second a simpler BPF_STX|BPF_MEM. Then we could use
> the same space for the immediate and offset fields.

There is no reason to squeeze bits or reduce instruction size. It will
only make things more complex and slower. If you disagree, please
rewrite interpreter, converter, compiler backends, JIT and
measure performance on variety of programs. Then we'll have
facts to talk about. So far I don't like any of these proposals.

>>> - nop: I'd like to have a nop. Do I know why? Nope.
>> nope. Let's not add unnecessary instructions.
> A valid nop is a useful instruction: padding, filling up arrays of
> sock_filter_int correctly (as in lib/test_bpf.c, where we're currently
> using a "ld #0", which loads zero to register A), and other use cases
> (see http://en.wikipedia.org/wiki/NOP ).

especially I don't like to add 'nop' instruction.
code==0 to mean 'ld #0' is one of classic BPF ugliness.
We're not filling up arrays with nops in lib/test_bpf.c
Zero is invalid opcode in eBPF and should stay so, since it's
an easy check for humans like me who are looking at eBPF in hex.

Thanks
Alexei

2014-06-24 08:34:09

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/2] split BPF out of core networking

On 06/23/2014 11:57 PM, Alexei Starovoitov wrote:
> On Fri, Jun 20, 2014 at 9:44 AM, Chema Gonzalez <[email protected]> wrote:
...
>>>> - nop: I'd like to have a nop. Do I know why? Nope.
>>> nope. Let's not add unnecessary instructions.
>> A valid nop is a useful instruction: padding, filling up arrays of
>> sock_filter_int correctly (as in lib/test_bpf.c, where we're currently
>> using a "ld #0", which loads zero to register A), and other use cases
>> (see http://en.wikipedia.org/wiki/NOP ).
>
> especially I don't like to add 'nop' instruction.
> code==0 to mean 'ld #0' is one of classic BPF ugliness.

I think it was probably unintended to be able to have unreachable
code e.g. filled with 'nops' where both jt, jf just jump over it,
but that quirk we cannot change anymore in the classic checker
and have to carry onwards.

> We're not filling up arrays with nops in lib/test_bpf.c
> Zero is invalid opcode in eBPF and should stay so, since it's
> an easy check for humans like me who are looking at eBPF in hex.