2014-02-27 02:38:59

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter

Hi All,

V1 patches:
http://thread.gmane.org/gmane.linux.kernel/1605783
V2 patches:
http://thread.gmane.org/gmane.linux.kernel/1642325

V3 summary:
- as suggested by Daniel added on the fly converter from
old BPF (aka BPF32) into extended BPF (aka BPF64)
- as suggested by Peter Anvin added 32-bit subregisters
they don't add much to interpreter speed, but simplify bpf32->bpf64 mapping
- added sysctl net.core.bpf64_enable flag
if enabled, old BPF filters will be converted to BPF64
and will be used by tcpdump/cls/xtables.
safety of the filters is verified by old BPF sk_chk_filter()
BPF64's bpf_check() is dropped from this patch to simplify review

Addition of 32-bit subregs require some work on BPF64 x86_64 JIT, so
it's not included in this patch set. LLVM BPF64 backend also needs to be
taught to take advantage of 32-bit subregs.

Initially BPF64 instruction set was designed for max performance after JIT,
Now it was tweaked for good interpreter speeds as well.
Eventually BPF64 can completely replace existing BPF on all architectures.

Two key reasons why BPF64 interpreter is noticeably faster
than existing BPF32 interpreter:

1.fall-through jumps
In BPF32 jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty.
BPF64 jump instructions have one branch and fall-through, which fit
CPU branch predictor logic better.
'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
Instead of single tablejump at the top of 'switch' statement, GCC will
generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
BPF32 90 98 207 220
BPF64 28 85 60 108
BPF32_JIT 12 33 17 44
BPF64_JIT TBD

--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
BPF32 107 136 227 252
BPF64 40 119 69 172

--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
BPF32 202 300 475 540
BPF64 139 270 296 470
BPF32_JIT 26 182 37 202
BPF64_JIT TBD

on Intel cpus BPF64 interpreter is significantly faster than
old BPF interpreter. Existing BPF32_JIT is obviously even faster.
BPF64_JIT has similar performance.

Tested with Daniel's 'trinify BPF fuzzer'

TODO:
- bpf32->bpf64 converter doesn't recognize seccomp and negative
offsets yet, fix that

- add 32-bit subregs to BPF64 x86_64 JIT and LLVM backend

- add bpf64 verifier, so that tcpdump/cls/xt and others can
insert both bpf32 and bpf64 programs through the same interface

- add bpf tables, complete 'dropmonitor' and get back to
systemtap-like probes with bpf64

Please review.
Thanks!

Alexei Starovoitov (1):
bpf32->bpf64 mapper and bpf64 interpreter

include/linux/filter.h | 9 +-
include/linux/netdevice.h | 1 +
include/uapi/linux/filter.h | 37 ++-
net/core/Makefile | 2 +-
net/core/bpf_run.c | 766 +++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 114 ++++++-
net/core/sysctl_net_core.c | 7 +
7 files changed, 913 insertions(+), 23 deletions(-)
create mode 100644 net/core/bpf_run.c

--
1.7.9.5


2014-02-27 02:39:10

by Alexei Starovoitov

[permalink] [raw]
Subject: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter

Extended BPF (or 64-bit BPF) is an instruction set to
create safe dynamically loadable filters that can call fixed set
of kernel functions and take generic bpf_context as an input.
BPF filter is a glue between kernel functions and bpf_context.
Different kernel subsystems can define their own set of available functions
and alter BPF machinery for specific use case.
BPF64 instruction set is designed for efficient mapping to native
instructions on 64-bit CPUs

Old BPF instructions are remapped on the fly to BPF64
when sysctl net.core.bpf64_enable=1

Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/filter.h | 9 +-
include/linux/netdevice.h | 1 +
include/uapi/linux/filter.h | 37 ++-
net/core/Makefile | 2 +-
net/core/bpf_run.c | 766 +++++++++++++++++++++++++++++++++++++++++++
net/core/filter.c | 114 ++++++-
net/core/sysctl_net_core.c | 7 +
7 files changed, 913 insertions(+), 23 deletions(-)
create mode 100644 net/core/bpf_run.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..bf3085258f4c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);

+/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
+int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn *new_prog,
+ int *p_new_len);
+/* execute bpf64 program */
+u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
+
+#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
#ifdef CONFIG_BPF_JIT
#include <stdarg.h>
#include <linux/linkage.h>
@@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
16, 1, image, proglen, false);
}
-#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
#else
#include <linux/slab.h>
static inline void bpf_jit_compile(struct sk_filter *fp)
@@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
{
kfree(fp);
}
-#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
#endif

static inline int bpf_tell_extensions(void)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5e84483c0650..7b1acefc244e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2971,6 +2971,7 @@ extern int netdev_max_backlog;
extern int netdev_tstamp_prequeue;
extern int weight_p;
extern int bpf_jit_enable;
+extern int bpf64_enable;

bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..70ff29ee6825 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,3 +1,4 @@
+/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com */
/*
* Linux Socket Filter Data Structures
*/
@@ -19,7 +20,7 @@
* Try and keep these values and structures similar to BSD, especially
* the BPF code definitions which need to match so you can share filters
*/
-
+
struct sock_filter { /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
@@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define BPF_JMP 0x05
#define BPF_RET 0x06
#define BPF_MISC 0x07
+#define BPF_ALU64 0x07
+
+struct bpf_insn {
+ __u8 code; /* opcode */
+ __u8 a_reg:4; /* dest register*/
+ __u8 x_reg:4; /* source register */
+ __s16 off; /* signed offset */
+ __s32 imm; /* signed immediate constant */
+};
+
+/* pointer to bpf_context is the first and only argument to BPF program
+ * its definition is use-case specific */
+struct bpf_context;

/* ld/ldx fields */
#define BPF_SIZE(code) ((code) & 0x18)
#define BPF_W 0x00
#define BPF_H 0x08
#define BPF_B 0x10
+#define BPF_DW 0x18
#define BPF_MODE(code) ((code) & 0xe0)
#define BPF_IMM 0x00
#define BPF_ABS 0x20
@@ -58,6 +73,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define BPF_MEM 0x60
#define BPF_LEN 0x80
#define BPF_MSH 0xa0
+#define BPF_XADD 0xc0 /* exclusive add */

/* alu/jmp fields */
#define BPF_OP(code) ((code) & 0xf0)
@@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define BPF_OR 0x40
#define BPF_AND 0x50
#define BPF_LSH 0x60
-#define BPF_RSH 0x70
+#define BPF_RSH 0x70 /* logical shift right */
#define BPF_NEG 0x80
#define BPF_MOD 0x90
#define BPF_XOR 0xa0
+#define BPF_MOV 0xb0 /* mov reg to reg */
+#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
+#define BPF_BSWAP32 0xd0 /* swap lower 4 bytes of 64-bit register */
+#define BPF_BSWAP64 0xe0 /* swap all 8 bytes of 64-bit register */

#define BPF_JA 0x00
-#define BPF_JEQ 0x10
-#define BPF_JGT 0x20
-#define BPF_JGE 0x30
-#define BPF_JSET 0x40
+#define BPF_JEQ 0x10 /* jump == */
+#define BPF_JGT 0x20 /* GT is unsigned '>', JA in x86 */
+#define BPF_JGE 0x30 /* GE is unsigned '>=', JAE in x86 */
+#define BPF_JSET 0x40 /* if (A & X) */
+#define BPF_JNE 0x50 /* jump != */
+#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
+#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
+#define BPF_CALL 0x80 /* function call */
#define BPF_SRC(code) ((code) & 0x08)
#define BPF_K 0x00
#define BPF_X 0x08
@@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#define SKF_NET_OFF (-0x100000)
#define SKF_LL_OFF (-0x200000)

-
#endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 9628c20acff6..e622b97f58dc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
obj-$(CONFIG_SYSCTL) += sysctl_net_core.o

obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
- neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
+ neighbour.o rtnetlink.o utils.o link_watch.o filter.o bpf_run.o \
sock_diag.o dev_ioctl.o

obj-$(CONFIG_XFRM) += flow.o
diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
new file mode 100644
index 000000000000..fa1862fcbc74
--- /dev/null
+++ b/net/core/bpf_run.c
@@ -0,0 +1,766 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+int bpf64_enable __read_mostly;
+
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k,
+ unsigned int size);
+
+static inline void *load_pointer(const struct sk_buff *skb, int k,
+ unsigned int size, void *buffer)
+{
+ if (k >= 0)
+ return skb_header_pointer(skb, k, size, buffer);
+ return bpf_internal_load_pointer_neg_helper(skb, k, size);
+}
+
+static const char *const bpf_class_string[] = {
+ "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
+};
+
+static const char *const bpf_alu_string[] = {
+ "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+ "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
+};
+
+static const char *const bpf_ldst_string[] = {
+ "u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+ "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
+};
+
+static const char *reg_to_str(int regno, u64 *regs)
+{
+ static char reg_value[16][32];
+ if (!regs)
+ return "";
+ snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
+ regs[regno]);
+ return reg_value[regno];
+}
+
+#define R(regno) reg_to_str(regno, regs)
+
+void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
+{
+ u16 class = BPF_CLASS(insn->code);
+ if (class == BPF_ALU || class == BPF_ALU64) {
+ if (BPF_SRC(insn->code) == BPF_X)
+ pr_info("code_%02x %sr%d%s %s r%d%s\n",
+ insn->code, class == BPF_ALU ? "(u32)" : "",
+ insn->a_reg, R(insn->a_reg),
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ insn->x_reg, R(insn->x_reg));
+ else
+ pr_info("code_%02x %sr%d%s %s %d\n",
+ insn->code, class == BPF_ALU ? "(u32)" : "",
+ insn->a_reg, R(insn->a_reg),
+ bpf_alu_string[BPF_OP(insn->code) >> 4],
+ insn->imm);
+ } else if (class == BPF_STX) {
+ if (BPF_MODE(insn->code) == BPF_MEM)
+ pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->a_reg, R(insn->a_reg),
+ insn->off, insn->x_reg, R(insn->x_reg));
+ else if (BPF_MODE(insn->code) == BPF_XADD)
+ pr_info("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->a_reg, R(insn->a_reg), insn->off,
+ insn->x_reg, R(insn->x_reg));
+ else
+ pr_info("BUG_%02x\n", insn->code);
+ } else if (class == BPF_ST) {
+ if (BPF_MODE(insn->code) != BPF_MEM) {
+ pr_info("BUG_st_%02x\n", insn->code);
+ return;
+ }
+ pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
+ insn->code,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->a_reg, R(insn->a_reg),
+ insn->off, insn->imm);
+ } else if (class == BPF_LDX) {
+ if (BPF_MODE(insn->code) == BPF_MEM) {
+ pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
+ insn->code, insn->a_reg,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->x_reg, R(insn->x_reg), insn->off);
+ } else {
+ pr_info("BUG_ldx_%02x\n", insn->code);
+ }
+ } else if (class == BPF_LD) {
+ if (BPF_MODE(insn->code) == BPF_ABS) {
+ pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
+ insn->code, insn->a_reg,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->imm);
+ } else if (BPF_MODE(insn->code) == BPF_IND) {
+ pr_info("code_%02x r%d = *(%s *)(skb + r%d%s %+d)\n",
+ insn->code, insn->a_reg,
+ bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+ insn->x_reg, R(insn->x_reg), insn->imm);
+ } else {
+ pr_info("BUG_ld_%02x\n", insn->code);
+ }
+ } else if (class == BPF_JMP) {
+ u16 opcode = BPF_OP(insn->code);
+ if (opcode == BPF_CALL) {
+ pr_info("code_%02x call %d\n", insn->code, insn->imm);
+ } else if (insn->code == (BPF_JMP | BPF_JA)) {
+ pr_info("code_%02x goto pc%+d\n",
+ insn->code, insn->off);
+ } else if (BPF_SRC(insn->code) == BPF_X) {
+ pr_info("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
+ insn->code, insn->a_reg, R(insn->a_reg),
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->x_reg, R(insn->x_reg), insn->off);
+ } else {
+ pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
+ insn->code, insn->a_reg, R(insn->a_reg),
+ bpf_jmp_string[BPF_OP(insn->code) >> 4],
+ insn->imm, insn->off);
+ }
+ } else {
+ pr_info("code_%02x %s\n", insn->code, bpf_class_string[class]);
+ }
+}
+EXPORT_SYMBOL(pr_info_bpf_insn);
+
+/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style (BPF64)
+ *
+ * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate new
+ * program length in one pass
+ *
+ * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
+ *
+ * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
+ * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
+ */
+int bpf_convert(struct sock_filter *old_prog, int len,
+ struct bpf_insn *new_prog, int *p_new_len)
+{
+ struct bpf_insn *new_insn;
+ struct sock_filter *fp;
+ int *addrs = NULL;
+ int new_len = 0;
+ int pass = 0;
+ int tgt, i;
+
+ if (len <= 0 || len >= BPF_MAXINSNS)
+ return -EINVAL;
+
+ if (new_prog) {
+ addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
+ if (!addrs)
+ return -ENOMEM;
+ }
+
+do_pass:
+ new_insn = new_prog;
+ fp = old_prog;
+ for (i = 0; i < len; fp++, i++) {
+ struct bpf_insn tmp_insns[3] = {};
+ struct bpf_insn *insn = tmp_insns;
+
+ if (addrs)
+ addrs[i] = new_insn - new_prog;
+
+ switch (fp->code) {
+ /* all arithmetic insns and skb loads map as-is */
+ case BPF_ALU | BPF_ADD | BPF_X:
+ case BPF_ALU | BPF_ADD | BPF_K:
+ case BPF_ALU | BPF_SUB | BPF_X:
+ case BPF_ALU | BPF_SUB | BPF_K:
+ case BPF_ALU | BPF_AND | BPF_X:
+ case BPF_ALU | BPF_AND | BPF_K:
+ case BPF_ALU | BPF_OR | BPF_X:
+ case BPF_ALU | BPF_OR | BPF_K:
+ case BPF_ALU | BPF_LSH | BPF_X:
+ case BPF_ALU | BPF_LSH | BPF_K:
+ case BPF_ALU | BPF_RSH | BPF_X:
+ case BPF_ALU | BPF_RSH | BPF_K:
+ case BPF_ALU | BPF_XOR | BPF_X:
+ case BPF_ALU | BPF_XOR | BPF_K:
+ case BPF_ALU | BPF_MUL | BPF_X:
+ case BPF_ALU | BPF_MUL | BPF_K:
+ case BPF_ALU | BPF_DIV | BPF_X:
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_MOD | BPF_X:
+ case BPF_ALU | BPF_MOD | BPF_K:
+ case BPF_ALU | BPF_NEG:
+ case BPF_LD | BPF_ABS | BPF_W:
+ case BPF_LD | BPF_ABS | BPF_H:
+ case BPF_LD | BPF_ABS | BPF_B:
+ case BPF_LD | BPF_IND | BPF_W:
+ case BPF_LD | BPF_IND | BPF_H:
+ case BPF_LD | BPF_IND | BPF_B:
+ insn->code = fp->code;
+ insn->a_reg = 6;
+ insn->x_reg = 7;
+ insn->imm = fp->k;
+ break;
+
+ /* jump opcodes map as-is, but offsets need adjustment */
+ case BPF_JMP | BPF_JA:
+ tgt = i + fp->k + 1;
+ insn->code = fp->code;
+#define EMIT_JMP \
+ do { \
+ if (tgt >= len || tgt < 0) \
+ goto err; \
+ insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
+ } while (0)
+
+ EMIT_JMP;
+ break;
+
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ insn->a_reg = 6;
+ insn->x_reg = 7;
+ insn->imm = fp->k;
+ /* common case where 'jump_false' is next insn */
+ if (fp->jf == 0) {
+ insn->code = fp->code;
+ tgt = i + fp->jt + 1;
+ EMIT_JMP;
+ break;
+ }
+ /* convert JEQ into JNE when 'jump_true' is next insn */
+ if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
+ insn->code = BPF_JMP | BPF_JNE |
+ BPF_SRC(fp->code);
+ tgt = i + fp->jf + 1;
+ EMIT_JMP;
+ break;
+ }
+ /* other jumps are mapped into two insns: Jxx and JA */
+ tgt = i + fp->jt + 1;
+ insn->code = fp->code;
+ EMIT_JMP;
+
+ insn++;
+ insn->code = BPF_JMP | BPF_JA;
+ tgt = i + fp->jf + 1;
+ EMIT_JMP;
+ /* adjust pc relative offset, since it's a 2nd insn */
+ insn->off--;
+ break;
+
+ /* ldxb 4*([14]&0xf) is remaped into 3 insns */
+ case BPF_LDX | BPF_MSH | BPF_B:
+ insn->code = BPF_LD | BPF_ABS | BPF_B;
+ insn->a_reg = 7;
+ insn->imm = fp->k;
+
+ insn++;
+ insn->code = BPF_ALU | BPF_AND | BPF_K;
+ insn->a_reg = 7;
+ insn->imm = 0xf;
+
+ insn++;
+ insn->code = BPF_ALU | BPF_LSH | BPF_K;
+ insn->a_reg = 7;
+ insn->imm = 2;
+ break;
+
+ /* RET_K, RET_A are remaped into 2 insns */
+ case BPF_RET | BPF_A:
+ case BPF_RET | BPF_K:
+ insn->code = BPF_ALU | BPF_MOV |
+ (BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);
+ insn->a_reg = 0;
+ insn->x_reg = 6;
+ insn->imm = fp->k;
+
+ insn++;
+ insn->code = BPF_RET | BPF_K;
+ break;
+
+ /* store to stack */
+ case BPF_ST:
+ case BPF_STX:
+ insn->code = BPF_STX | BPF_MEM | BPF_W;
+ insn->a_reg = 10;
+ insn->x_reg = fp->code == BPF_ST ? 6 : 7;
+ insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+ break;
+
+ /* load from stack */
+ case BPF_LD | BPF_MEM:
+ case BPF_LDX | BPF_MEM:
+ insn->code = BPF_LDX | BPF_MEM | BPF_W;
+ insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+ insn->x_reg = 10;
+ insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+ break;
+
+ /* A = K or X = K */
+ case BPF_LD | BPF_IMM:
+ case BPF_LDX | BPF_IMM:
+ insn->code = BPF_ALU | BPF_MOV | BPF_K;
+ insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+ insn->imm = fp->k;
+ break;
+
+ /* X = A */
+ case BPF_MISC | BPF_TAX:
+ insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+ insn->a_reg = 7;
+ insn->x_reg = 6;
+ break;
+
+ /* A = X */
+ case BPF_MISC | BPF_TXA:
+ insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+ insn->a_reg = 6;
+ insn->x_reg = 7;
+ break;
+
+ /* A = skb->len or X = skb->len */
+ case BPF_LD|BPF_W|BPF_LEN:
+ case BPF_LDX|BPF_W|BPF_LEN:
+ insn->code = BPF_LDX | BPF_MEM | BPF_W;
+ insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+ insn->x_reg = 1;
+ insn->off = offsetof(struct sk_buff, len);
+ break;
+
+ default:
+ /* pr_err("unknown opcode %02x\n", fp->code); */
+ goto err;
+ }
+
+ insn++;
+ if (new_prog) {
+ memcpy(new_insn, tmp_insns,
+ sizeof(*insn) * (insn - tmp_insns));
+ }
+ new_insn += insn - tmp_insns;
+ }
+
+ if (!new_prog) {
+ /* only calculating new length */
+ *p_new_len = new_insn - new_prog;
+ return 0;
+ }
+
+ pass++;
+ if (new_len != new_insn - new_prog) {
+ new_len = new_insn - new_prog;
+ if (pass > 2)
+ goto err;
+ goto do_pass;
+ }
+ kfree(addrs);
+ if (*p_new_len != new_len)
+ /* inconsistent new program length */
+ pr_err("bpf_convert() usage error\n");
+ return 0;
+err:
+ kfree(addrs);
+ return -EINVAL;
+}
+
+notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
+{
+ u64 stack[64];
+ u64 regs[16];
+ void *ptr;
+ u64 tmp;
+ int off;
+
+#ifdef __x86_64
+#define LOAD_IMM /**/
+#define K insn->imm
+#else
+#define LOAD_IMM (K = insn->imm)
+ s32 K = insn->imm;
+#endif
+
+#ifdef DEBUG
+#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
+#else
+#define DEBUG_INSN
+#endif
+
+#define A regs[insn->a_reg]
+#define X regs[insn->x_reg]
+
+#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
+#define L(A, B, C) L_##A##B##C:
+
+#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
+#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
+/* some compilers may need help:
+ * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code]; })
+ */
+
+ static const void *jt[256] = {
+ [0 ... 255] = &&L_default,
+ DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
+ DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
+ DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
+ DL(BPF_ALU, BPF_OR, BPF_X) DL(BPF_ALU, BPF_OR, BPF_K)
+ DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
+ DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
+ DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
+ DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
+ DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
+ DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
+ DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
+ DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD, BPF_K)
+ DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB, BPF_K)
+ DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND, BPF_K)
+ DL(BPF_ALU64, BPF_OR, BPF_X) DL(BPF_ALU64, BPF_OR, BPF_K)
+ DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH, BPF_K)
+ DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH, BPF_K)
+ DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR, BPF_K)
+ DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL, BPF_K)
+ DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV, BPF_K)
+ DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH, BPF_K)
+ DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV, BPF_K)
+ DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD, BPF_K)
+ DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
+ DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
+ DL(BPF_ALU, BPF_NEG, 0)
+ DL(BPF_JMP, BPF_CALL, 0)
+ DL(BPF_JMP, BPF_JA, 0)
+ DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
+ DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
+ DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
+ DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
+ DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
+ DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
+ DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
+ DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
+ DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
+ DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
+ DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
+ DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
+ DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
+ DL(BPF_STX, BPF_XADD, BPF_W)
+#ifdef CONFIG_64BIT
+ DL(BPF_STX, BPF_XADD, BPF_DW)
+#endif
+ DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
+ DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
+ DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
+ DL(BPF_RET, BPF_K, 0)
+ };
+
+ regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
+ regs[1/* BPF R1 */] = (u64)(ulong)ctx;
+
+ DEBUG_INSN;
+ /* execute 1st insn */
+select_insn:
+ goto *jt[insn->code];
+
+ /* ALU */
+#define ALU(OPCODE, OP) \
+ L_BPF_ALU64##OPCODE##BPF_X: \
+ A = A OP X; \
+ CONT; \
+ L_BPF_ALU##OPCODE##BPF_X: \
+ A = (u32)A OP (u32)X; \
+ CONT; \
+ L_BPF_ALU64##OPCODE##BPF_K: \
+ A = A OP K; \
+ CONT; \
+ L_BPF_ALU##OPCODE##BPF_K: \
+ A = (u32)A OP (u32)K; \
+ CONT;
+
+ ALU(BPF_ADD, +)
+ ALU(BPF_SUB, -)
+ ALU(BPF_AND, &)
+ ALU(BPF_OR, |)
+ ALU(BPF_LSH, <<)
+ ALU(BPF_RSH, >>)
+ ALU(BPF_XOR, ^)
+ ALU(BPF_MUL, *)
+
+ L(BPF_ALU, BPF_NEG, 0)
+ A = (u32)-A;
+ CONT;
+ L(BPF_ALU, BPF_MOV, BPF_X)
+ A = (u32)X;
+ CONT;
+ L(BPF_ALU, BPF_MOV, BPF_K)
+ A = (u32)K;
+ CONT;
+ L(BPF_ALU64, BPF_MOV, BPF_X)
+ A = X;
+ CONT;
+ L(BPF_ALU64, BPF_MOV, BPF_K)
+ A = K;
+ CONT;
+ L(BPF_ALU64, BPF_ARSH, BPF_X)
+ (*(s64 *) &A) >>= X;
+ CONT;
+ L(BPF_ALU64, BPF_ARSH, BPF_K)
+ (*(s64 *) &A) >>= K;
+ CONT;
+ L(BPF_ALU64, BPF_MOD, BPF_X)
+ tmp = A;
+ if (X)
+ A = do_div(tmp, X);
+ CONT;
+ L(BPF_ALU, BPF_MOD, BPF_X)
+ tmp = (u32)A;
+ if (X)
+ A = do_div(tmp, (u32)X);
+ CONT;
+ L(BPF_ALU64, BPF_MOD, BPF_K)
+ tmp = A;
+ if (K)
+ A = do_div(tmp, K);
+ CONT;
+ L(BPF_ALU, BPF_MOD, BPF_K)
+ tmp = (u32)A;
+ if (K)
+ A = do_div(tmp, (u32)K);
+ CONT;
+ L(BPF_ALU64, BPF_DIV, BPF_X)
+ if (X)
+ do_div(A, X);
+ CONT;
+ L(BPF_ALU, BPF_DIV, BPF_X)
+ tmp = (u32)A;
+ if (X)
+ do_div(tmp, (u32)X);
+ A = (u32)tmp;
+ CONT;
+ L(BPF_ALU64, BPF_DIV, BPF_K)
+ if (K)
+ do_div(A, K);
+ CONT;
+ L(BPF_ALU, BPF_DIV, BPF_K)
+ tmp = (u32)A;
+ if (K)
+ do_div(tmp, (u32)K);
+ A = (u32)tmp;
+ CONT;
+ L(BPF_ALU64, BPF_BSWAP32, BPF_X)
+ A = swab32(A);
+ CONT;
+ L(BPF_ALU64, BPF_BSWAP64, BPF_X)
+ A = swab64(A);
+ CONT;
+
+ /* CALL */
+ L(BPF_JMP, BPF_CALL, 0)
+ /* TODO execute_func(K, regs); */
+ CONT;
+
+ /* JMP */
+ L(BPF_JMP, BPF_JA, 0)
+ insn += insn->off;
+ CONT;
+ L(BPF_JMP, BPF_JEQ, BPF_X)
+ if (A == X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JEQ, BPF_K)
+ if (A == K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JNE, BPF_X)
+ if (A != X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JNE, BPF_K)
+ if (A != K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JGT, BPF_X)
+ if (A > X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JGT, BPF_K)
+ if (A > K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JGE, BPF_X)
+ if (A >= X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JGE, BPF_K)
+ if (A >= K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSGT, BPF_X)
+ if (((s64)A) > ((s64)X)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSGT, BPF_K)
+ if (((s64)A) > ((s64)K)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSGE, BPF_X)
+ if (((s64)A) >= ((s64)X)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSGE, BPF_K)
+ if (((s64)A) >= ((s64)K)) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSET, BPF_X)
+ if (A & X) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+ L(BPF_JMP, BPF_JSET, BPF_K)
+ if (A & (u32)K) {
+ insn += insn->off;
+ CONT_JMP;
+ }
+ CONT;
+
+ /* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+ L_BPF_STXBPF_MEM##SIZEOP: \
+ *(SIZE *)(ulong)(A + insn->off) = X; \
+ CONT; \
+ L_BPF_STBPF_MEM##SIZEOP: \
+ *(SIZE *)(ulong)(A + insn->off) = K; \
+ CONT; \
+ L_BPF_LDXBPF_MEM##SIZEOP: \
+ A = *(SIZE *)(ulong)(X + insn->off); \
+ CONT;
+
+ LDST(BPF_B, u8)
+ LDST(BPF_H, u16)
+ LDST(BPF_W, u32)
+ LDST(BPF_DW, u64)
+
+ /* STX XADD */
+ L(BPF_STX, BPF_XADD, BPF_W)
+ atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
+ CONT;
+#ifdef CONFIG_64BIT
+ L(BPF_STX, BPF_XADD, BPF_DW)
+ atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
+ CONT;
+#endif
+
+ /* LD from SKB + K */
+ L(BPF_LD, BPF_ABS, BPF_W)
+ off = K;
+load_word:
+ ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
+ if (likely(ptr != NULL)) {
+ A = get_unaligned_be32(ptr);
+ CONT;
+ }
+ return 0;
+
+ L(BPF_LD, BPF_ABS, BPF_H)
+ off = K;
+load_half:
+ ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
+ if (likely(ptr != NULL)) {
+ A = get_unaligned_be16(ptr);
+ CONT;
+ }
+ return 0;
+
+ L(BPF_LD, BPF_ABS, BPF_B)
+ off = K;
+load_byte:
+ ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
+ if (likely(ptr != NULL)) {
+ A = *(u8 *)ptr;
+ CONT;
+ }
+ return 0;
+
+ /* LD from SKB + X + K */
+ L(BPF_LD, BPF_IND, BPF_W)
+ off = K + X;
+ goto load_word;
+
+ L(BPF_LD, BPF_IND, BPF_H)
+ off = K + X;
+ goto load_half;
+
+ L(BPF_LD, BPF_IND, BPF_B)
+ off = K + X;
+ goto load_byte;
+
+ /* RET */
+ L(BPF_RET, BPF_K, 0)
+ return regs[0/* R0 */];
+
+ L_default:
+ /* bpf_check() or bpf_convert() will guarantee that
+ * we never reach here
+ */
+ WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+ return 0;
+#undef DL
+#undef L
+#undef CONT
+#undef A
+#undef X
+#undef K
+#undef LOAD_IMM
+#undef DEBUG_INSN
+#undef ALU
+#undef LDST
+}
+EXPORT_SYMBOL(bpf_run);
+
diff --git a/net/core/filter.c b/net/core/filter.c
index ad30d626a5bd..575bf5fd4335 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
{
struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);

+ if ((void *)fp->bpf_func == (void *)bpf_run)
+ /* arch specific jit_free are expecting this value */
+ fp->bpf_func = sk_run_filter;
+
bpf_jit_free(fp);
}
EXPORT_SYMBOL(sk_filter_release_rcu);
@@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
return 0;
}

+static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog *fprog,
+ struct sock *sk)
+{
+ unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+ struct sock_filter *old_prog;
+ unsigned int sk_fsize;
+ struct sk_filter *fp;
+ int new_len;
+ int err;
+
+ /* store old program into buffer, since chk_filter will remap opcodes */
+ old_prog = kmalloc(fsize, GFP_KERNEL);
+ if (!old_prog)
+ return -ENOMEM;
+
+ if (sk) {
+ if (copy_from_user(old_prog, fprog->filter, fsize)) {
+ err = -EFAULT;
+ goto free_prog;
+ }
+ } else {
+ memcpy(old_prog, fprog->filter, fsize);
+ }
+
+ /* calculate bpf64 program length */
+ err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
+ if (err)
+ goto free_prog;
+
+ sk_fsize = sk_filter_size(new_len);
+ /* allocate sk_filter to store bpf64 program */
+ if (sk)
+ fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+ else
+ fp = kmalloc(sk_fsize, GFP_KERNEL);
+ if (!fp) {
+ err = -ENOMEM;
+ goto free_prog;
+ }
+
+ /* remap old insns into bpf64 */
+ err = bpf_convert(old_prog, fprog->len,
+ (struct bpf_insn *)fp->insns, &new_len);
+ if (err)
+ /* 2nd bpf_convert() can fail only if it fails
+ * to allocate memory, remapping must succeed
+ */
+ goto free_fp;
+
+ /* now chk_filter can overwrite old_prog while checking */
+ err = sk_chk_filter(old_prog, fprog->len);
+ if (err)
+ goto free_fp;
+
+ /* discard old prog */
+ kfree(old_prog);
+
+ atomic_set(&fp->refcnt, 1);
+ fp->len = new_len;
+
+ /* bpf64 insns must be executed by bpf_run */
+ fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
+
+ *pfp = fp;
+ return 0;
+free_fp:
+ if (sk)
+ sock_kfree_s(sk, fp, sk_fsize);
+ else
+ kfree(fp);
+free_prog:
+ kfree(old_prog);
+ return err;
+}
+
/**
* sk_unattached_filter_create - create an unattached filter
* @fprog: the filter program
@@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
if (fprog->filter == NULL)
return -EINVAL;

+ if (bpf64_enable)
+ return bpf64_prepare(pfp, fprog, NULL);
+
fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
if (!fp)
return -ENOMEM;
@@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
if (fprog->filter == NULL)
return -EINVAL;

- fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
- if (!fp)
- return -ENOMEM;
- if (copy_from_user(fp->insns, fprog->filter, fsize)) {
- sock_kfree_s(sk, fp, sk_fsize);
- return -EFAULT;
- }
+ if (bpf64_enable) {
+ err = bpf64_prepare(&fp, fprog, sk);
+ if (err)
+ return err;
+ } else {
+ fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+ if (!fp)
+ return -ENOMEM;
+ if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+ sock_kfree_s(sk, fp, sk_fsize);
+ return -EFAULT;
+ }

- atomic_set(&fp->refcnt, 1);
- fp->len = fprog->len;
+ atomic_set(&fp->refcnt, 1);
+ fp->len = fprog->len;

- err = __sk_prepare_filter(fp);
- if (err) {
- sk_filter_uncharge(sk, fp);
- return err;
+ err = __sk_prepare_filter(fp);
+ if (err) {
+ sk_filter_uncharge(sk, fp);
+ return err;
+ }
}

old_fp = rcu_dereference_protected(sk->sk_filter,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..f03acc0e8950 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
},
#endif
{
+ .procname = "bpf64_enable",
+ .data = &bpf64_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "netdev_tstamp_prequeue",
.data = &netdev_tstamp_prequeue,
.maxlen = sizeof(int),
--
1.7.9.5

2014-02-28 12:46:48

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter

Hi Alexei,

[also cc'ing Hagen and Jesse]

Just some minor comments below ... let me know what you think.

On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
> Extended BPF (or 64-bit BPF) is an instruction set to
> create safe dynamically loadable filters that can call fixed set
> of kernel functions and take generic bpf_context as an input.
> BPF filter is a glue between kernel functions and bpf_context.
> Different kernel subsystems can define their own set of available functions
> and alter BPF machinery for specific use case.
> BPF64 instruction set is designed for efficient mapping to native
> instructions on 64-bit CPUs
>
> Old BPF instructions are remapped on the fly to BPF64
> when sysctl net.core.bpf64_enable=1

Would be nice if the commit message could be extended with what you
have posted in your [PATCH v3 net-next 0/1] email (but without the
changelog, changelog should go after "---" line), so that also this
information will appear here and in the Git log later on. Also please
elaborate more on this commit message. I think, at least, as it's a
bigger change it also needs to include future planned usage scenarios
for user space and for inside the kernel e.g. for OVS and ftrace.

You could make this patch a 2 patch patch-series and put into patch
2/2 all documentation you had in your previous version of the set.
Would be nice to extend the file Documentation/networking/filter.txt
with a description of your extended BPF so that users can read about
them in the same file.

Did you also test that seccomp-BPF still works out?

> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/linux/filter.h | 9 +-
> include/linux/netdevice.h | 1 +
> include/uapi/linux/filter.h | 37 ++-
> net/core/Makefile | 2 +-
> net/core/bpf_run.c | 766 +++++++++++++++++++++++++++++++++++++++++++
> net/core/filter.c | 114 ++++++-

I would have liked to see the content from net/core/bpf_run.c to go
directly into net/core/filter.c, not as a separate file, if that's
possible.

> net/core/sysctl_net_core.c | 7 +
> 7 files changed, 913 insertions(+), 23 deletions(-)
> create mode 100644 net/core/bpf_run.c
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index e568c8ef896b..bf3085258f4c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
> extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
> extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
>
> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn *new_prog,
> + int *p_new_len);
> +/* execute bpf64 program */
> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
> +
> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
> #ifdef CONFIG_BPF_JIT
> #include <stdarg.h>
> #include <linux/linkage.h>
> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
> print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
> 16, 1, image, proglen, false);
> }
> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
> #else
> #include <linux/slab.h>
> static inline void bpf_jit_compile(struct sk_filter *fp)
> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
> {
> kfree(fp);
> }
> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
> #endif
>
> static inline int bpf_tell_extensions(void)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5e84483c0650..7b1acefc244e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2971,6 +2971,7 @@ extern int netdev_max_backlog;
> extern int netdev_tstamp_prequeue;
> extern int weight_p;
> extern int bpf_jit_enable;
> +extern int bpf64_enable;

We should keep naming consistent (so either extended BPF or BPF64),
so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
comparisons, you'd still need to load to immediate values, right?

After all your changes, we will still have the bpf_jit_enable knob
in tact, right?

> bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
> struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
> index 8eb9ccaa5b48..70ff29ee6825 100644
> --- a/include/uapi/linux/filter.h
> +++ b/include/uapi/linux/filter.h
> @@ -1,3 +1,4 @@
> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com */
> /*
> * Linux Socket Filter Data Structures
> */

You can merge both comments into one.

> @@ -19,7 +20,7 @@
> * Try and keep these values and structures similar to BSD, especially
> * the BPF code definitions which need to match so you can share filters
> */
> -
> +
> struct sock_filter { /* Filter block */
> __u16 code; /* Actual filter code */
> __u8 jt; /* Jump true */
> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> #define BPF_JMP 0x05
> #define BPF_RET 0x06
> #define BPF_MISC 0x07
> +#define BPF_ALU64 0x07
> +
> +struct bpf_insn {
> + __u8 code; /* opcode */
> + __u8 a_reg:4; /* dest register*/
> + __u8 x_reg:4; /* source register */
> + __s16 off; /* signed offset */
> + __s32 imm; /* signed immediate constant */
> +};

As we have struct sock_filter and it's immutable due to uapi, I
would have liked to see the new data structure with a consistent
naming scheme, e.g. struct sock_filter_ext {...} for extended
BPF.

> +/* pointer to bpf_context is the first and only argument to BPF program
> + * its definition is use-case specific */
> +struct bpf_context;
>
> /* ld/ldx fields */
> #define BPF_SIZE(code) ((code) & 0x18)
> #define BPF_W 0x00
> #define BPF_H 0x08
> #define BPF_B 0x10
> +#define BPF_DW 0x18
> #define BPF_MODE(code) ((code) & 0xe0)
> #define BPF_IMM 0x00
> #define BPF_ABS 0x20
> @@ -58,6 +73,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> #define BPF_MEM 0x60
> #define BPF_LEN 0x80
> #define BPF_MSH 0xa0
> +#define BPF_XADD 0xc0 /* exclusive add */
>
> /* alu/jmp fields */
> #define BPF_OP(code) ((code) & 0xf0)
> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> #define BPF_OR 0x40
> #define BPF_AND 0x50
> #define BPF_LSH 0x60
> -#define BPF_RSH 0x70
> +#define BPF_RSH 0x70 /* logical shift right */
> #define BPF_NEG 0x80
> #define BPF_MOD 0x90
> #define BPF_XOR 0xa0
> +#define BPF_MOV 0xb0 /* mov reg to reg */
> +#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
> +#define BPF_BSWAP32 0xd0 /* swap lower 4 bytes of 64-bit register */
> +#define BPF_BSWAP64 0xe0 /* swap all 8 bytes of 64-bit register */
>
> #define BPF_JA 0x00
> -#define BPF_JEQ 0x10
> -#define BPF_JGT 0x20
> -#define BPF_JGE 0x30
> -#define BPF_JSET 0x40
> +#define BPF_JEQ 0x10 /* jump == */
> +#define BPF_JGT 0x20 /* GT is unsigned '>', JA in x86 */
> +#define BPF_JGE 0x30 /* GE is unsigned '>=', JAE in x86 */
> +#define BPF_JSET 0x40 /* if (A & X) */
> +#define BPF_JNE 0x50 /* jump != */
> +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
> +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
> +#define BPF_CALL 0x80 /* function call */
> #define BPF_SRC(code) ((code) & 0x08)
> #define BPF_K 0x00
> #define BPF_X 0x08
> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> #define SKF_NET_OFF (-0x100000)
> #define SKF_LL_OFF (-0x200000)
>
> -
> #endif /* _UAPI__LINUX_FILTER_H__ */
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 9628c20acff6..e622b97f58dc 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
> obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>
> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
> - neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
> + neighbour.o rtnetlink.o utils.o link_watch.o filter.o bpf_run.o \
> sock_diag.o dev_ioctl.o
>
> obj-$(CONFIG_XFRM) += flow.o
> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
> new file mode 100644
> index 000000000000..fa1862fcbc74
> --- /dev/null
> +++ b/net/core/bpf_run.c
> @@ -0,0 +1,766 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/filter.h>
> +#include <linux/skbuff.h>
> +#include <asm/unaligned.h>
> +
> +int bpf64_enable __read_mostly;
> +
> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k,
> + unsigned int size);
> +
> +static inline void *load_pointer(const struct sk_buff *skb, int k,
> + unsigned int size, void *buffer)
> +{
> + if (k >= 0)
> + return skb_header_pointer(skb, k, size, buffer);
> + return bpf_internal_load_pointer_neg_helper(skb, k, size);
> +}
> +
> +static const char *const bpf_class_string[] = {
> + "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
> +};
> +
> +static const char *const bpf_alu_string[] = {
> + "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
> + "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
> +};
> +
> +static const char *const bpf_ldst_string[] = {
> + "u32", "u16", "u8", "u64"
> +};
> +
> +static const char *const bpf_jmp_string[] = {
> + "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
> +};
> +
> +static const char *reg_to_str(int regno, u64 *regs)
> +{
> + static char reg_value[16][32];
> + if (!regs)
> + return "";
> + snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
> + regs[regno]);
> + return reg_value[regno];
> +}
> +
> +#define R(regno) reg_to_str(regno, regs)
> +
> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
> +{
> + u16 class = BPF_CLASS(insn->code);
> + if (class == BPF_ALU || class == BPF_ALU64) {
> + if (BPF_SRC(insn->code) == BPF_X)
> + pr_info("code_%02x %sr%d%s %s r%d%s\n",
> + insn->code, class == BPF_ALU ? "(u32)" : "",
> + insn->a_reg, R(insn->a_reg),
> + bpf_alu_string[BPF_OP(insn->code) >> 4],
> + insn->x_reg, R(insn->x_reg));
> + else
> + pr_info("code_%02x %sr%d%s %s %d\n",
> + insn->code, class == BPF_ALU ? "(u32)" : "",
> + insn->a_reg, R(insn->a_reg),
> + bpf_alu_string[BPF_OP(insn->code) >> 4],
> + insn->imm);
> + } else if (class == BPF_STX) {
> + if (BPF_MODE(insn->code) == BPF_MEM)
> + pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->a_reg, R(insn->a_reg),
> + insn->off, insn->x_reg, R(insn->x_reg));
> + else if (BPF_MODE(insn->code) == BPF_XADD)
> + pr_info("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->a_reg, R(insn->a_reg), insn->off,
> + insn->x_reg, R(insn->x_reg));
> + else
> + pr_info("BUG_%02x\n", insn->code);
> + } else if (class == BPF_ST) {
> + if (BPF_MODE(insn->code) != BPF_MEM) {
> + pr_info("BUG_st_%02x\n", insn->code);
> + return;
> + }
> + pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
> + insn->code,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->a_reg, R(insn->a_reg),
> + insn->off, insn->imm);
> + } else if (class == BPF_LDX) {
> + if (BPF_MODE(insn->code) == BPF_MEM) {
> + pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
> + insn->code, insn->a_reg,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->x_reg, R(insn->x_reg), insn->off);
> + } else {
> + pr_info("BUG_ldx_%02x\n", insn->code);
> + }
> + } else if (class == BPF_LD) {
> + if (BPF_MODE(insn->code) == BPF_ABS) {
> + pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
> + insn->code, insn->a_reg,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->imm);
> + } else if (BPF_MODE(insn->code) == BPF_IND) {
> + pr_info("code_%02x r%d = *(%s *)(skb + r%d%s %+d)\n",
> + insn->code, insn->a_reg,
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->x_reg, R(insn->x_reg), insn->imm);
> + } else {
> + pr_info("BUG_ld_%02x\n", insn->code);
> + }
> + } else if (class == BPF_JMP) {
> + u16 opcode = BPF_OP(insn->code);
> + if (opcode == BPF_CALL) {
> + pr_info("code_%02x call %d\n", insn->code, insn->imm);
> + } else if (insn->code == (BPF_JMP | BPF_JA)) {
> + pr_info("code_%02x goto pc%+d\n",
> + insn->code, insn->off);
> + } else if (BPF_SRC(insn->code) == BPF_X) {
> + pr_info("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
> + insn->code, insn->a_reg, R(insn->a_reg),
> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
> + insn->x_reg, R(insn->x_reg), insn->off);
> + } else {
> + pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
> + insn->code, insn->a_reg, R(insn->a_reg),
> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
> + insn->imm, insn->off);
> + }
> + } else {
> + pr_info("code_%02x %s\n", insn->code, bpf_class_string[class]);
> + }
> +}
> +EXPORT_SYMBOL(pr_info_bpf_insn);

Why would that need to be exported as a symbol?

I would actually like to avoid having this pr_info* inside the kernel.
Couldn't this be done e.g. through systemtap script that could e.g. be
placed under tools/net/ or inside the documentation file?

> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style (BPF64)
> + *
> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate new
> + * program length in one pass
> + *
> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
> + *
> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
> + */

Would be nice to have the API comment it in kdoc format.

> +int bpf_convert(struct sock_filter *old_prog, int len,
> + struct bpf_insn *new_prog, int *p_new_len)
> +{

If we place it into net/core/filter.c, it would be nice to keep naming
conventions consistent, e.g. sk_convert_filter() or so.

> + struct bpf_insn *new_insn;
> + struct sock_filter *fp;
> + int *addrs = NULL;
> + int new_len = 0;
> + int pass = 0;
> + int tgt, i;
> +
> + if (len <= 0 || len >= BPF_MAXINSNS)
> + return -EINVAL;
> +
> + if (new_prog) {
> + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
> + if (!addrs)
> + return -ENOMEM;
> + }
> +
> +do_pass:
> + new_insn = new_prog;
> + fp = old_prog;
> + for (i = 0; i < len; fp++, i++) {
> + struct bpf_insn tmp_insns[3] = {};
> + struct bpf_insn *insn = tmp_insns;
> +
> + if (addrs)
> + addrs[i] = new_insn - new_prog;
> +
> + switch (fp->code) {
> + /* all arithmetic insns and skb loads map as-is */
> + case BPF_ALU | BPF_ADD | BPF_X:
> + case BPF_ALU | BPF_ADD | BPF_K:
> + case BPF_ALU | BPF_SUB | BPF_X:
> + case BPF_ALU | BPF_SUB | BPF_K:
> + case BPF_ALU | BPF_AND | BPF_X:
> + case BPF_ALU | BPF_AND | BPF_K:
> + case BPF_ALU | BPF_OR | BPF_X:
> + case BPF_ALU | BPF_OR | BPF_K:
> + case BPF_ALU | BPF_LSH | BPF_X:
> + case BPF_ALU | BPF_LSH | BPF_K:
> + case BPF_ALU | BPF_RSH | BPF_X:
> + case BPF_ALU | BPF_RSH | BPF_K:
> + case BPF_ALU | BPF_XOR | BPF_X:
> + case BPF_ALU | BPF_XOR | BPF_K:
> + case BPF_ALU | BPF_MUL | BPF_X:
> + case BPF_ALU | BPF_MUL | BPF_K:
> + case BPF_ALU | BPF_DIV | BPF_X:
> + case BPF_ALU | BPF_DIV | BPF_K:
> + case BPF_ALU | BPF_MOD | BPF_X:
> + case BPF_ALU | BPF_MOD | BPF_K:
> + case BPF_ALU | BPF_NEG:
> + case BPF_LD | BPF_ABS | BPF_W:
> + case BPF_LD | BPF_ABS | BPF_H:
> + case BPF_LD | BPF_ABS | BPF_B:
> + case BPF_LD | BPF_IND | BPF_W:
> + case BPF_LD | BPF_IND | BPF_H:
> + case BPF_LD | BPF_IND | BPF_B:

I think here and elsewhere for similar constructions, we could use
BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
("net: optimize Berkeley Packet Filter (BPF) processing").

> + insn->code = fp->code;
> + insn->a_reg = 6;
> + insn->x_reg = 7;
> + insn->imm = fp->k;
> + break;
> +
> + /* jump opcodes map as-is, but offsets need adjustment */
> + case BPF_JMP | BPF_JA:
> + tgt = i + fp->k + 1;
> + insn->code = fp->code;
> +#define EMIT_JMP \
> + do { \
> + if (tgt >= len || tgt < 0) \
> + goto err; \
> + insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
> + } while (0)
> +
> + EMIT_JMP;
> + break;
> +
> + case BPF_JMP | BPF_JEQ | BPF_K:
> + case BPF_JMP | BPF_JEQ | BPF_X:
> + case BPF_JMP | BPF_JSET | BPF_K:
> + case BPF_JMP | BPF_JSET | BPF_X:
> + case BPF_JMP | BPF_JGT | BPF_K:
> + case BPF_JMP | BPF_JGT | BPF_X:
> + case BPF_JMP | BPF_JGE | BPF_K:
> + case BPF_JMP | BPF_JGE | BPF_X:
> + insn->a_reg = 6;
> + insn->x_reg = 7;
> + insn->imm = fp->k;
> + /* common case where 'jump_false' is next insn */
> + if (fp->jf == 0) {
> + insn->code = fp->code;
> + tgt = i + fp->jt + 1;
> + EMIT_JMP;
> + break;
> + }
> + /* convert JEQ into JNE when 'jump_true' is next insn */
> + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
> + insn->code = BPF_JMP | BPF_JNE |
> + BPF_SRC(fp->code);
> + tgt = i + fp->jf + 1;
> + EMIT_JMP;
> + break;
> + }
> + /* other jumps are mapped into two insns: Jxx and JA */
> + tgt = i + fp->jt + 1;
> + insn->code = fp->code;
> + EMIT_JMP;
> +
> + insn++;
> + insn->code = BPF_JMP | BPF_JA;
> + tgt = i + fp->jf + 1;
> + EMIT_JMP;
> + /* adjust pc relative offset, since it's a 2nd insn */
> + insn->off--;
> + break;
> +
> + /* ldxb 4*([14]&0xf) is remaped into 3 insns */
> + case BPF_LDX | BPF_MSH | BPF_B:
> + insn->code = BPF_LD | BPF_ABS | BPF_B;
> + insn->a_reg = 7;
> + insn->imm = fp->k;
> +
> + insn++;
> + insn->code = BPF_ALU | BPF_AND | BPF_K;
> + insn->a_reg = 7;
> + insn->imm = 0xf;
> +
> + insn++;
> + insn->code = BPF_ALU | BPF_LSH | BPF_K;
> + insn->a_reg = 7;
> + insn->imm = 2;
> + break;
> +
> + /* RET_K, RET_A are remaped into 2 insns */
> + case BPF_RET | BPF_A:
> + case BPF_RET | BPF_K:
> + insn->code = BPF_ALU | BPF_MOV |
> + (BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);
> + insn->a_reg = 0;
> + insn->x_reg = 6;
> + insn->imm = fp->k;
> +
> + insn++;
> + insn->code = BPF_RET | BPF_K;
> + break;
> +
> + /* store to stack */
> + case BPF_ST:
> + case BPF_STX:
> + insn->code = BPF_STX | BPF_MEM | BPF_W;
> + insn->a_reg = 10;
> + insn->x_reg = fp->code == BPF_ST ? 6 : 7;
> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> + break;
> +
> + /* load from stack */
> + case BPF_LD | BPF_MEM:
> + case BPF_LDX | BPF_MEM:
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> + insn->x_reg = 10;
> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> + break;
> +
> + /* A = K or X = K */
> + case BPF_LD | BPF_IMM:
> + case BPF_LDX | BPF_IMM:
> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> + insn->imm = fp->k;
> + break;
> +
> + /* X = A */
> + case BPF_MISC | BPF_TAX:
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = 7;
> + insn->x_reg = 6;
> + break;
> +
> + /* A = X */
> + case BPF_MISC | BPF_TXA:
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = 6;
> + insn->x_reg = 7;
> + break;
> +
> + /* A = skb->len or X = skb->len */
> + case BPF_LD|BPF_W|BPF_LEN:
> + case BPF_LDX|BPF_W|BPF_LEN:
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> + insn->x_reg = 1;
> + insn->off = offsetof(struct sk_buff, len);
> + break;
> +
> + default:
> + /* pr_err("unknown opcode %02x\n", fp->code); */
> + goto err;
> + }
> +
> + insn++;
> + if (new_prog) {
> + memcpy(new_insn, tmp_insns,
> + sizeof(*insn) * (insn - tmp_insns));
> + }
> + new_insn += insn - tmp_insns;
> + }
> +
> + if (!new_prog) {
> + /* only calculating new length */
> + *p_new_len = new_insn - new_prog;
> + return 0;
> + }
> +
> + pass++;
> + if (new_len != new_insn - new_prog) {
> + new_len = new_insn - new_prog;
> + if (pass > 2)
> + goto err;
> + goto do_pass;
> + }
> + kfree(addrs);
> + if (*p_new_len != new_len)
> + /* inconsistent new program length */
> + pr_err("bpf_convert() usage error\n");
> + return 0;
> +err:
> + kfree(addrs);
> + return -EINVAL;
> +}
> +
> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
> +{

Similarly, sk_run_filter_ext()?

> + u64 stack[64];
> + u64 regs[16];
> + void *ptr;
> + u64 tmp;
> + int off;
> +
> +#ifdef __x86_64
> +#define LOAD_IMM /**/
> +#define K insn->imm
> +#else
> +#define LOAD_IMM (K = insn->imm)
> + s32 K = insn->imm;
> +#endif
> +
> +#ifdef DEBUG
> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
> +#else
> +#define DEBUG_INSN
> +#endif

This DEBUG hunk could then be removed when we use a stap script
instead, for example.

> +#define A regs[insn->a_reg]
> +#define X regs[insn->x_reg]
> +
> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
> +#define L(A, B, C) L_##A##B##C:

Could we get rid of these two defines? I know you're defining
and using that as labels, but it's not obvious at first what
'jt' does. Maybe 'jt_labels' ?

> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })

Not a big fan of macros, but here it seems fine though.

> +/* some compilers may need help:
> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code]; })
> + */
> +
> + static const void *jt[256] = {
> + [0 ... 255] = &&L_default,

Nit: please just one define per line below:

> + DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
> + DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
> + DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
> + DL(BPF_ALU, BPF_OR, BPF_X) DL(BPF_ALU, BPF_OR, BPF_K)
> + DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
> + DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
> + DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
> + DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
> + DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
> + DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
> + DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
> + DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD, BPF_K)
> + DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB, BPF_K)
> + DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND, BPF_K)
> + DL(BPF_ALU64, BPF_OR, BPF_X) DL(BPF_ALU64, BPF_OR, BPF_K)
> + DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH, BPF_K)
> + DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH, BPF_K)
> + DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR, BPF_K)
> + DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL, BPF_K)
> + DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV, BPF_K)
> + DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH, BPF_K)
> + DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV, BPF_K)
> + DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD, BPF_K)
> + DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
> + DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
> + DL(BPF_ALU, BPF_NEG, 0)
> + DL(BPF_JMP, BPF_CALL, 0)
> + DL(BPF_JMP, BPF_JA, 0)
> + DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
> + DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
> + DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
> + DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
> + DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
> + DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
> + DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
> + DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
> + DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
> + DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
> + DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
> + DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
> + DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
> + DL(BPF_STX, BPF_XADD, BPF_W)
> +#ifdef CONFIG_64BIT
> + DL(BPF_STX, BPF_XADD, BPF_DW)
> +#endif
> + DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
> + DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
> + DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
> + DL(BPF_RET, BPF_K, 0)
> + };
> +
> + regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
> + regs[1/* BPF R1 */] = (u64)(ulong)ctx;
> +
> + DEBUG_INSN;
> + /* execute 1st insn */
> +select_insn:
> + goto *jt[insn->code];
> +
> + /* ALU */
> +#define ALU(OPCODE, OP) \
> + L_BPF_ALU64##OPCODE##BPF_X: \
> + A = A OP X; \
> + CONT; \
> + L_BPF_ALU##OPCODE##BPF_X: \
> + A = (u32)A OP (u32)X; \
> + CONT; \
> + L_BPF_ALU64##OPCODE##BPF_K: \
> + A = A OP K; \
> + CONT; \
> + L_BPF_ALU##OPCODE##BPF_K: \
> + A = (u32)A OP (u32)K; \
> + CONT;
> +
> + ALU(BPF_ADD, +)
> + ALU(BPF_SUB, -)
> + ALU(BPF_AND, &)
> + ALU(BPF_OR, |)
> + ALU(BPF_LSH, <<)
> + ALU(BPF_RSH, >>)
> + ALU(BPF_XOR, ^)
> + ALU(BPF_MUL, *)
> +
> + L(BPF_ALU, BPF_NEG, 0)
> + A = (u32)-A;
> + CONT;
> + L(BPF_ALU, BPF_MOV, BPF_X)
> + A = (u32)X;
> + CONT;
> + L(BPF_ALU, BPF_MOV, BPF_K)
> + A = (u32)K;
> + CONT;
> + L(BPF_ALU64, BPF_MOV, BPF_X)
> + A = X;
> + CONT;
> + L(BPF_ALU64, BPF_MOV, BPF_K)
> + A = K;
> + CONT;
> + L(BPF_ALU64, BPF_ARSH, BPF_X)
> + (*(s64 *) &A) >>= X;
> + CONT;
> + L(BPF_ALU64, BPF_ARSH, BPF_K)
> + (*(s64 *) &A) >>= K;
> + CONT;
> + L(BPF_ALU64, BPF_MOD, BPF_X)
> + tmp = A;
> + if (X)
> + A = do_div(tmp, X);
> + CONT;
> + L(BPF_ALU, BPF_MOD, BPF_X)
> + tmp = (u32)A;
> + if (X)
> + A = do_div(tmp, (u32)X);
> + CONT;
> + L(BPF_ALU64, BPF_MOD, BPF_K)
> + tmp = A;
> + if (K)
> + A = do_div(tmp, K);
> + CONT;
> + L(BPF_ALU, BPF_MOD, BPF_K)
> + tmp = (u32)A;
> + if (K)
> + A = do_div(tmp, (u32)K);
> + CONT;
> + L(BPF_ALU64, BPF_DIV, BPF_X)
> + if (X)
> + do_div(A, X);
> + CONT;
> + L(BPF_ALU, BPF_DIV, BPF_X)
> + tmp = (u32)A;
> + if (X)
> + do_div(tmp, (u32)X);
> + A = (u32)tmp;
> + CONT;
> + L(BPF_ALU64, BPF_DIV, BPF_K)
> + if (K)
> + do_div(A, K);
> + CONT;
> + L(BPF_ALU, BPF_DIV, BPF_K)
> + tmp = (u32)A;
> + if (K)
> + do_div(tmp, (u32)K);
> + A = (u32)tmp;
> + CONT;
> + L(BPF_ALU64, BPF_BSWAP32, BPF_X)
> + A = swab32(A);
> + CONT;
> + L(BPF_ALU64, BPF_BSWAP64, BPF_X)
> + A = swab64(A);
> + CONT;
> +
> + /* CALL */
> + L(BPF_JMP, BPF_CALL, 0)
> + /* TODO execute_func(K, regs); */
> + CONT;

Would the filter checker for now just return -EINVAL when this is used?

> + /* JMP */
> + L(BPF_JMP, BPF_JA, 0)
> + insn += insn->off;
> + CONT;
> + L(BPF_JMP, BPF_JEQ, BPF_X)
> + if (A == X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JEQ, BPF_K)
> + if (A == K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JNE, BPF_X)
> + if (A != X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JNE, BPF_K)
> + if (A != K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JGT, BPF_X)
> + if (A > X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JGT, BPF_K)
> + if (A > K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JGE, BPF_X)
> + if (A >= X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JGE, BPF_K)
> + if (A >= K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSGT, BPF_X)
> + if (((s64)A) > ((s64)X)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSGT, BPF_K)
> + if (((s64)A) > ((s64)K)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSGE, BPF_X)
> + if (((s64)A) >= ((s64)X)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSGE, BPF_K)
> + if (((s64)A) >= ((s64)K)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSET, BPF_X)
> + if (A & X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + L(BPF_JMP, BPF_JSET, BPF_K)
> + if (A & (u32)K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;

Okay, so right now we only support forward jumps, right? And these
are checked by the old checker code I'd assume?

> + /* STX and ST and LDX*/
> +#define LDST(SIZEOP, SIZE) \
> + L_BPF_STXBPF_MEM##SIZEOP: \
> + *(SIZE *)(ulong)(A + insn->off) = X; \
> + CONT; \
> + L_BPF_STBPF_MEM##SIZEOP: \
> + *(SIZE *)(ulong)(A + insn->off) = K; \
> + CONT; \
> + L_BPF_LDXBPF_MEM##SIZEOP: \
> + A = *(SIZE *)(ulong)(X + insn->off); \
> + CONT;
> +
> + LDST(BPF_B, u8)
> + LDST(BPF_H, u16)
> + LDST(BPF_W, u32)
> + LDST(BPF_DW, u64)
> +
> + /* STX XADD */
> + L(BPF_STX, BPF_XADD, BPF_W)
> + atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
> + CONT;
> +#ifdef CONFIG_64BIT
> + L(BPF_STX, BPF_XADD, BPF_DW)
> + atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
> + CONT;
> +#endif
> +
> + /* LD from SKB + K */

Nit: indent one tab too much (here and elsewhere)

> + L(BPF_LD, BPF_ABS, BPF_W)
> + off = K;
> +load_word:
> + ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
> + if (likely(ptr != NULL)) {
> + A = get_unaligned_be32(ptr);
> + CONT;
> + }
> + return 0;
> +
> + L(BPF_LD, BPF_ABS, BPF_H)
> + off = K;
> +load_half:
> + ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
> + if (likely(ptr != NULL)) {
> + A = get_unaligned_be16(ptr);
> + CONT;
> + }
> + return 0;
> +
> + L(BPF_LD, BPF_ABS, BPF_B)
> + off = K;
> +load_byte:
> + ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
> + if (likely(ptr != NULL)) {
> + A = *(u8 *)ptr;
> + CONT;
> + }
> + return 0;
> +
> + /* LD from SKB + X + K */

Nit: ditto

> + L(BPF_LD, BPF_IND, BPF_W)
> + off = K + X;
> + goto load_word;
> +
> + L(BPF_LD, BPF_IND, BPF_H)
> + off = K + X;
> + goto load_half;
> +
> + L(BPF_LD, BPF_IND, BPF_B)
> + off = K + X;
> + goto load_byte;
> +
> + /* RET */
> + L(BPF_RET, BPF_K, 0)
> + return regs[0/* R0 */];
> +
> + L_default:
> + /* bpf_check() or bpf_convert() will guarantee that
> + * we never reach here
> + */
> + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
> + return 0;
> +#undef DL
> +#undef L
> +#undef CONT
> +#undef A
> +#undef X
> +#undef K
> +#undef LOAD_IMM
> +#undef DEBUG_INSN
> +#undef ALU
> +#undef LDST
> +}
> +EXPORT_SYMBOL(bpf_run);
> +
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ad30d626a5bd..575bf5fd4335 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
> {
> struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>
> + if ((void *)fp->bpf_func == (void *)bpf_run)
> + /* arch specific jit_free are expecting this value */
> + fp->bpf_func = sk_run_filter;
> +
> bpf_jit_free(fp);
> }
> EXPORT_SYMBOL(sk_filter_release_rcu);
> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
> return 0;
> }
>
> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog *fprog,
> + struct sock *sk)
> +{

sk_prepare_filter_ext()?

> + unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
> + struct sock_filter *old_prog;
> + unsigned int sk_fsize;
> + struct sk_filter *fp;
> + int new_len;
> + int err;
> +
> + /* store old program into buffer, since chk_filter will remap opcodes */
> + old_prog = kmalloc(fsize, GFP_KERNEL);
> + if (!old_prog)
> + return -ENOMEM;
> +
> + if (sk) {
> + if (copy_from_user(old_prog, fprog->filter, fsize)) {
> + err = -EFAULT;
> + goto free_prog;
> + }
> + } else {
> + memcpy(old_prog, fprog->filter, fsize);
> + }
> +
> + /* calculate bpf64 program length */
> + err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
> + if (err)
> + goto free_prog;
> +
> + sk_fsize = sk_filter_size(new_len);
> + /* allocate sk_filter to store bpf64 program */
> + if (sk)
> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> + else
> + fp = kmalloc(sk_fsize, GFP_KERNEL);
> + if (!fp) {
> + err = -ENOMEM;
> + goto free_prog;
> + }
> +
> + /* remap old insns into bpf64 */
> + err = bpf_convert(old_prog, fprog->len,
> + (struct bpf_insn *)fp->insns, &new_len);
> + if (err)
> + /* 2nd bpf_convert() can fail only if it fails
> + * to allocate memory, remapping must succeed
> + */
> + goto free_fp;
> +
> + /* now chk_filter can overwrite old_prog while checking */
> + err = sk_chk_filter(old_prog, fprog->len);
> + if (err)
> + goto free_fp;
> +
> + /* discard old prog */
> + kfree(old_prog);
> +
> + atomic_set(&fp->refcnt, 1);
> + fp->len = new_len;
> +
> + /* bpf64 insns must be executed by bpf_run */
> + fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
> +
> + *pfp = fp;
> + return 0;
> +free_fp:
> + if (sk)
> + sock_kfree_s(sk, fp, sk_fsize);
> + else
> + kfree(fp);
> +free_prog:
> + kfree(old_prog);
> + return err;
> +}
> +
> /**
> * sk_unattached_filter_create - create an unattached filter
> * @fprog: the filter program
> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
> if (fprog->filter == NULL)
> return -EINVAL;
>
> + if (bpf64_enable)
> + return bpf64_prepare(pfp, fprog, NULL);
> +
> fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
> if (!fp)
> return -ENOMEM;
> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
> if (fprog->filter == NULL)
> return -EINVAL;
>
> - fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> - if (!fp)
> - return -ENOMEM;
> - if (copy_from_user(fp->insns, fprog->filter, fsize)) {
> - sock_kfree_s(sk, fp, sk_fsize);
> - return -EFAULT;
> - }
> + if (bpf64_enable) {
> + err = bpf64_prepare(&fp, fprog, sk);
> + if (err)
> + return err;
> + } else {
> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> + if (!fp)
> + return -ENOMEM;
> + if (copy_from_user(fp->insns, fprog->filter, fsize)) {
> + sock_kfree_s(sk, fp, sk_fsize);
> + return -EFAULT;
> + }
>
> - atomic_set(&fp->refcnt, 1);
> - fp->len = fprog->len;
> + atomic_set(&fp->refcnt, 1);
> + fp->len = fprog->len;
>
> - err = __sk_prepare_filter(fp);
> - if (err) {
> - sk_filter_uncharge(sk, fp);
> - return err;
> + err = __sk_prepare_filter(fp);
> + if (err) {
> + sk_filter_uncharge(sk, fp);
> + return err;
> + }
> }
>
> old_fp = rcu_dereference_protected(sk->sk_filter,
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index cf9cd13509a7..f03acc0e8950 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
> },
> #endif
> {
> + .procname = "bpf64_enable",
> + .data = &bpf64_enable,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec
> + },
> + {
> .procname = "netdev_tstamp_prequeue",
> .data = &netdev_tstamp_prequeue,
> .maxlen = sizeof(int),
>

Hope some of the comments made sense. ;-)

Thanks,

Daniel

2014-02-28 20:53:36

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter

On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <[email protected]> wrote:
> Hi Alexei,
>
> [also cc'ing Hagen and Jesse]
>
> Just some minor comments below ... let me know what you think.

Thank you for review! Comments below.

> On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
>>
>> Extended BPF (or 64-bit BPF) is an instruction set to
>> create safe dynamically loadable filters that can call fixed set
>> of kernel functions and take generic bpf_context as an input.
>> BPF filter is a glue between kernel functions and bpf_context.
>> Different kernel subsystems can define their own set of available
>> functions
>> and alter BPF machinery for specific use case.
>> BPF64 instruction set is designed for efficient mapping to native
>> instructions on 64-bit CPUs
>>
>> Old BPF instructions are remapped on the fly to BPF64
>> when sysctl net.core.bpf64_enable=1
>
>
> Would be nice if the commit message could be extended with what you
> have posted in your [PATCH v3 net-next 0/1] email (but without the
> changelog, changelog should go after "---" line), so that also this
> information will appear here and in the Git log later on. Also please
> elaborate more on this commit message. I think, at least, as it's a
> bigger change it also needs to include future planned usage scenarios
> for user space and for inside the kernel e.g. for OVS and ftrace.

Ok will do

> You could make this patch a 2 patch patch-series and put into patch
> 2/2 all documentation you had in your previous version of the set.
> Would be nice to extend the file Documentation/networking/filter.txt
> with a description of your extended BPF so that users can read about
> them in the same file.

Sure.

> Did you also test that seccomp-BPF still works out?

yes. Have a prototype, but it needs a bit more cleanup.

>
>> Signed-off-by: Alexei Starovoitov <[email protected]>
>> ---
>> include/linux/filter.h | 9 +-
>> include/linux/netdevice.h | 1 +
>> include/uapi/linux/filter.h | 37 ++-
>> net/core/Makefile | 2 +-
>> net/core/bpf_run.c | 766
>> +++++++++++++++++++++++++++++++++++++++++++
>> net/core/filter.c | 114 ++++++-
>
>
> I would have liked to see the content from net/core/bpf_run.c to go
> directly into net/core/filter.c, not as a separate file, if that's
> possible.

sure. that's fine.

>
>> net/core/sysctl_net_core.c | 7 +
>> 7 files changed, 913 insertions(+), 23 deletions(-)
>> create mode 100644 net/core/bpf_run.c
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index e568c8ef896b..bf3085258f4c 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter,
>> unsigned int flen);
>> extern int sk_get_filter(struct sock *sk, struct sock_filter __user
>> *filter, unsigned len);
>> extern void sk_decode_filter(struct sock_filter *filt, struct
>> sock_filter *to);
>>
>> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
>> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn
>> *new_prog,
>> + int *p_new_len);
>> +/* execute bpf64 program */
>> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
>> +
>> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>> FILTER->insns)
>> #ifdef CONFIG_BPF_JIT
>> #include <stdarg.h>
>> #include <linux/linkage.h>
>> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen,
>> unsigned int proglen,
>> print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>> 16, 1, image, proglen, false);
>> }
>> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>> FILTER->insns)
>> #else
>> #include <linux/slab.h>
>> static inline void bpf_jit_compile(struct sk_filter *fp)
>> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>> {
>> kfree(fp);
>> }
>> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>> #endif
>>
>> static inline int bpf_tell_extensions(void)
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 5e84483c0650..7b1acefc244e 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -2971,6 +2971,7 @@ extern int netdev_max_backlog;
>> extern int netdev_tstamp_prequeue;
>> extern int weight_p;
>> extern int bpf_jit_enable;
>> +extern int bpf64_enable;
>
>
> We should keep naming consistent (so either extended BPF or BPF64),
> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext

agree. we need consistent naming for both (old and new).
I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
to see which one looks better.
I'm kinda leaning to sk_filter64, since it's easier to quickly spot
the difference
and more descriptive.

> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
> comparisons, you'd still need to load to immediate values, right?

there is no insn that use 64-bit immediate, since 64-bit immediates
are extremely rare. grep x86-64 asm code for movabsq will return very few.
llvm or gcc can easily construct any constant by combination of mov,
shifts and ors.
bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
Jxx is painless.
Notice I added signed comparison, since real life programs cannot do
without them.
I also kept the spirit of old bpf having > and >= only. (no < and <=)
that made llvm/gcc backends a bit harder to do, since no real cpu has
such limitations.
I'm still contemplating do add < and <= (both signed and unsigned), which is
interesting trade-off: number of instructions vs complexity of compiler

> After all your changes, we will still have the bpf_jit_enable knob
> in tact, right?

Yes.
In this diff the workflow is the following:

old filter comes through sk_attach_filter() or sk_unattached_filter_create()
if (bpf64) {
convert to new
sk_chk_filter() - check old bpf
use new interpreter
} else {
sk_chk_filter() - check old bpf
if (bpf_jit_enable)
use old jit
else
use old interpreter
}
soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
then add new/old inband demux into sk_attach_filter(),
so that workflow will become:

a filter comes through sk_attach_filter() or sk_unattached_filter_create()
if (new filter prog) {
sk_chk_filter64() - check new bpf
if (bpf_jit_enable)
use new jit
else
use new interpreter
} else {
if (bpf64_enable) {
convert to new
sk_chk_filter() - check old bpf
if (bpf_jit_enable)
use new jit
else
use new interpreter
} else {
sk_chk_filter()
if (bpf_jit_enable)
use old jit
else
use old interpreter
}
}
eventually bpf64_enable can be made default,
the last 'else' can be retired and 'bpf64_enable' removed along with
old interpreter.

bpf_jit_enable knob will stay for foreseeable future.

>> bool netdev_has_upper_dev(struct net_device *dev, struct net_device
>> *upper_dev);
>> struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device
>> *dev,
>> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
>> index 8eb9ccaa5b48..70ff29ee6825 100644
>> --- a/include/uapi/linux/filter.h
>> +++ b/include/uapi/linux/filter.h
>> @@ -1,3 +1,4 @@
>> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
>> */
>> /*
>> * Linux Socket Filter Data Structures
>> */
>
>
> You can merge both comments into one.

ok.

>
>> @@ -19,7 +20,7 @@
>> * Try and keep these values and structures similar to BSD,
>> especially
>> * the BPF code definitions which need to match so you can share
>> filters
>> */
>> -
>> +
>> struct sock_filter { /* Filter block */
>> __u16 code; /* Actual filter code */
>> __u8 jt; /* Jump true */
>> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>> #define BPF_JMP 0x05
>> #define BPF_RET 0x06
>> #define BPF_MISC 0x07
>> +#define BPF_ALU64 0x07
>> +
>> +struct bpf_insn {
>> + __u8 code; /* opcode */
>> + __u8 a_reg:4; /* dest register*/
>> + __u8 x_reg:4; /* source register */
>> + __s16 off; /* signed offset */
>> + __s32 imm; /* signed immediate constant */
>> +};
>
>
> As we have struct sock_filter and it's immutable due to uapi, I
> would have liked to see the new data structure with a consistent
> naming scheme, e.g. struct sock_filter_ext {...} for extended
> BPF.

ok. will rename bpf_insn to sock_filter64 and sock_filter_ext to see
which one looks better through out the code.

>> +/* pointer to bpf_context is the first and only argument to BPF program
>> + * its definition is use-case specific */
>> +struct bpf_context;
>>
>> /* ld/ldx fields */
>> #define BPF_SIZE(code) ((code) & 0x18)
>> #define BPF_W 0x00
>> #define BPF_H 0x08
>> #define BPF_B 0x10
>> +#define BPF_DW 0x18
>> #define BPF_MODE(code) ((code) & 0xe0)
>> #define BPF_IMM 0x00
>> #define BPF_ABS 0x20
>> @@ -58,6 +73,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>> #define BPF_MEM 0x60
>> #define BPF_LEN 0x80
>> #define BPF_MSH 0xa0
>> +#define BPF_XADD 0xc0 /* exclusive add */
>>
>> /* alu/jmp fields */
>> #define BPF_OP(code) ((code) & 0xf0)
>> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>> #define BPF_OR 0x40
>> #define BPF_AND 0x50
>> #define BPF_LSH 0x60
>> -#define BPF_RSH 0x70
>> +#define BPF_RSH 0x70 /* logical shift right */
>> #define BPF_NEG 0x80
>> #define BPF_MOD 0x90
>> #define BPF_XOR 0xa0
>> +#define BPF_MOV 0xb0 /* mov reg to reg */
>> +#define BPF_ARSH 0xc0 /* sign extending arithmetic
>> shift right */
>> +#define BPF_BSWAP32 0xd0 /* swap lower 4 bytes of
>> 64-bit register */
>> +#define BPF_BSWAP64 0xe0 /* swap all 8 bytes of 64-bit
>> register */
>>
>> #define BPF_JA 0x00
>> -#define BPF_JEQ 0x10
>> -#define BPF_JGT 0x20
>> -#define BPF_JGE 0x30
>> -#define BPF_JSET 0x40
>> +#define BPF_JEQ 0x10 /* jump == */
>> +#define BPF_JGT 0x20 /* GT is unsigned '>', JA in x86 */
>> +#define BPF_JGE 0x30 /* GE is unsigned '>=', JAE in x86
>> */
>> +#define BPF_JSET 0x40 /* if (A & X) */
>> +#define BPF_JNE 0x50 /* jump != */
>> +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
>> +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
>> +#define BPF_CALL 0x80 /* function call */
>> #define BPF_SRC(code) ((code) & 0x08)
>> #define BPF_K 0x00
>> #define BPF_X 0x08
>> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>> #define SKF_NET_OFF (-0x100000)
>> #define SKF_LL_OFF (-0x200000)
>>
>> -
>> #endif /* _UAPI__LINUX_FILTER_H__ */
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 9628c20acff6..e622b97f58dc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o
>> stream.o scm.o \
>> obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>>
>> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o
>> \
>> - neighbour.o rtnetlink.o utils.o link_watch.o
>> filter.o \
>> + neighbour.o rtnetlink.o utils.o link_watch.o
>> filter.o bpf_run.o \
>> sock_diag.o dev_ioctl.o
>>
>> obj-$(CONFIG_XFRM) += flow.o
>> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
>> new file mode 100644
>> index 000000000000..fa1862fcbc74
>> --- /dev/null
>> +++ b/net/core/bpf_run.c
>> @@ -0,0 +1,766 @@
>> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of version 2 of the GNU General Public
>> + * License as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>> + * General Public License for more details.
>> + */
>> +#include <linux/kernel.h>
>> +#include <linux/types.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/filter.h>
>> +#include <linux/skbuff.h>
>> +#include <asm/unaligned.h>
>> +
>> +int bpf64_enable __read_mostly;
>> +
>> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int
>> k,
>> + unsigned int size);
>> +
>> +static inline void *load_pointer(const struct sk_buff *skb, int k,
>> + unsigned int size, void *buffer)
>> +{
>> + if (k >= 0)
>> + return skb_header_pointer(skb, k, size, buffer);
>> + return bpf_internal_load_pointer_neg_helper(skb, k, size);
>> +}
>> +
>> +static const char *const bpf_class_string[] = {
>> + "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
>> +};
>> +
>> +static const char *const bpf_alu_string[] = {
>> + "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
>> + "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
>> +};
>> +
>> +static const char *const bpf_ldst_string[] = {
>> + "u32", "u16", "u8", "u64"
>> +};
>> +
>> +static const char *const bpf_jmp_string[] = {
>> + "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
>> +};
>> +
>> +static const char *reg_to_str(int regno, u64 *regs)
>> +{
>> + static char reg_value[16][32];
>> + if (!regs)
>> + return "";
>> + snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
>> + regs[regno]);
>> + return reg_value[regno];
>> +}
>> +
>> +#define R(regno) reg_to_str(regno, regs)
>> +
>> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
>> +{
>> + u16 class = BPF_CLASS(insn->code);
>> + if (class == BPF_ALU || class == BPF_ALU64) {
>> + if (BPF_SRC(insn->code) == BPF_X)
>> + pr_info("code_%02x %sr%d%s %s r%d%s\n",
>> + insn->code, class == BPF_ALU ? "(u32)" :
>> "",
>> + insn->a_reg, R(insn->a_reg),
>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>> + insn->x_reg, R(insn->x_reg));
>> + else
>> + pr_info("code_%02x %sr%d%s %s %d\n",
>> + insn->code, class == BPF_ALU ? "(u32)" :
>> "",
>> + insn->a_reg, R(insn->a_reg),
>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>> + insn->imm);
>> + } else if (class == BPF_STX) {
>> + if (BPF_MODE(insn->code) == BPF_MEM)
>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
>> + insn->code,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> + insn->a_reg, R(insn->a_reg),
>> + insn->off, insn->x_reg, R(insn->x_reg));
>> + else if (BPF_MODE(insn->code) == BPF_XADD)
>> + pr_info("code_%02x lock *(%s *)(r%d%s %+d) +=
>> r%d%s\n",
>> + insn->code,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> + insn->a_reg, R(insn->a_reg), insn->off,
>> + insn->x_reg, R(insn->x_reg));
>> + else
>> + pr_info("BUG_%02x\n", insn->code);
>> + } else if (class == BPF_ST) {
>> + if (BPF_MODE(insn->code) != BPF_MEM) {
>> + pr_info("BUG_st_%02x\n", insn->code);
>> + return;
>> + }
>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
>> + insn->code,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>> + insn->a_reg, R(insn->a_reg),
>> + insn->off, insn->imm);
>> + } else if (class == BPF_LDX) {
>> + if (BPF_MODE(insn->code) == BPF_MEM) {
>> + pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
>> + insn->code, insn->a_reg,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> + insn->x_reg, R(insn->x_reg), insn->off);
>> + } else {
>> + pr_info("BUG_ldx_%02x\n", insn->code);
>> + }
>> + } else if (class == BPF_LD) {
>> + if (BPF_MODE(insn->code) == BPF_ABS) {
>> + pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
>> + insn->code, insn->a_reg,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> + insn->imm);
>> + } else if (BPF_MODE(insn->code) == BPF_IND) {
>> + pr_info("code_%02x r%d = *(%s *)(skb + r%d%s
>> %+d)\n",
>> + insn->code, insn->a_reg,
>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> + insn->x_reg, R(insn->x_reg), insn->imm);
>> + } else {
>> + pr_info("BUG_ld_%02x\n", insn->code);
>> + }
>> + } else if (class == BPF_JMP) {
>> + u16 opcode = BPF_OP(insn->code);
>> + if (opcode == BPF_CALL) {
>> + pr_info("code_%02x call %d\n", insn->code,
>> insn->imm);
>> + } else if (insn->code == (BPF_JMP | BPF_JA)) {
>> + pr_info("code_%02x goto pc%+d\n",
>> + insn->code, insn->off);
>> + } else if (BPF_SRC(insn->code) == BPF_X) {
>> + pr_info("code_%02x if r%d%s %s r%d%s goto
>> pc%+d\n",
>> + insn->code, insn->a_reg, R(insn->a_reg),
>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>> + insn->x_reg, R(insn->x_reg), insn->off);
>> + } else {
>> + pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
>> + insn->code, insn->a_reg, R(insn->a_reg),
>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>> + insn->imm, insn->off);
>> + }
>> + } else {
>> + pr_info("code_%02x %s\n", insn->code,
>> bpf_class_string[class]);
>> + }
>> +}
>> +EXPORT_SYMBOL(pr_info_bpf_insn);
>
>
> Why would that need to be exported as a symbol?

the performance numbers I mentioned are from bpf_bench that is done
as kernel module, so I used this for debugging from it.
also to see what execution traces I get while running trinity bpf fuzzer :)

> I would actually like to avoid having this pr_info* inside the kernel.
> Couldn't this be done e.g. through systemtap script that could e.g. be
> placed under tools/net/ or inside the documentation file?

like the idea!
Will drop it from the diff and eventually will move it to tools/net.

>> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style
>> (BPF64)
>> + *
>> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate
>> new
>> + * program length in one pass
>> + *
>> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
>> + *
>> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
>> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass
>> remaps
>> + */
>
>
> Would be nice to have the API comment it in kdoc format.

good point. will do.

>> +int bpf_convert(struct sock_filter *old_prog, int len,
>> + struct bpf_insn *new_prog, int *p_new_len)
>> +{
>
>
> If we place it into net/core/filter.c, it would be nice to keep naming
> conventions consistent, e.g. sk_convert_filter() or so.

ok.

>> + struct bpf_insn *new_insn;
>> + struct sock_filter *fp;
>> + int *addrs = NULL;
>> + int new_len = 0;
>> + int pass = 0;
>> + int tgt, i;
>> +
>> + if (len <= 0 || len >= BPF_MAXINSNS)
>> + return -EINVAL;
>> +
>> + if (new_prog) {
>> + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
>> + if (!addrs)
>> + return -ENOMEM;
>> + }
>> +
>> +do_pass:
>> + new_insn = new_prog;
>> + fp = old_prog;
>> + for (i = 0; i < len; fp++, i++) {
>> + struct bpf_insn tmp_insns[3] = {};
>> + struct bpf_insn *insn = tmp_insns;
>> +
>> + if (addrs)
>> + addrs[i] = new_insn - new_prog;
>> +
>> + switch (fp->code) {
>> + /* all arithmetic insns and skb loads map as-is */
>> + case BPF_ALU | BPF_ADD | BPF_X:
>> + case BPF_ALU | BPF_ADD | BPF_K:
>> + case BPF_ALU | BPF_SUB | BPF_X:
>> + case BPF_ALU | BPF_SUB | BPF_K:
>> + case BPF_ALU | BPF_AND | BPF_X:
>> + case BPF_ALU | BPF_AND | BPF_K:
>> + case BPF_ALU | BPF_OR | BPF_X:
>> + case BPF_ALU | BPF_OR | BPF_K:
>> + case BPF_ALU | BPF_LSH | BPF_X:
>> + case BPF_ALU | BPF_LSH | BPF_K:
>> + case BPF_ALU | BPF_RSH | BPF_X:
>> + case BPF_ALU | BPF_RSH | BPF_K:
>> + case BPF_ALU | BPF_XOR | BPF_X:
>> + case BPF_ALU | BPF_XOR | BPF_K:
>> + case BPF_ALU | BPF_MUL | BPF_X:
>> + case BPF_ALU | BPF_MUL | BPF_K:
>> + case BPF_ALU | BPF_DIV | BPF_X:
>> + case BPF_ALU | BPF_DIV | BPF_K:
>> + case BPF_ALU | BPF_MOD | BPF_X:
>> + case BPF_ALU | BPF_MOD | BPF_K:
>> + case BPF_ALU | BPF_NEG:
>> + case BPF_LD | BPF_ABS | BPF_W:
>> + case BPF_LD | BPF_ABS | BPF_H:
>> + case BPF_LD | BPF_ABS | BPF_B:
>> + case BPF_LD | BPF_IND | BPF_W:
>> + case BPF_LD | BPF_IND | BPF_H:
>> + case BPF_LD | BPF_IND | BPF_B:
>
>
> I think here and elsewhere for similar constructions, we could use
> BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
> ("net: optimize Berkeley Packet Filter (BPF) processing").

well, old bpf opcodes were just unlucky to be on the wrong side of
compiler heuristics at that time.
Instead of doing remapping while loading and opposite remapping in
sk_get_filter()
the problem could have been solved with few dummy 'case' statements.
That would have only added few lines of code instead of hundred lines of mapping
back and forth.
Also I suspect commit 01f2f3f6ef4d076 was valid only when whole kernel
is compiled
with -Os. With -O2 GCC should have done it right.
Heuristic is: number_of_case_stmts * ratio < range_of_case_values
and ratio is 3 for -Os and 10 for -O2.
old bpf has ~70 stmts and ~200 range.
See expand_switch_as_decision_tree_p() in gcc/stmt.c
Eventually when current interpreter is retired, I would like to remove
remapping as well.
Especially for sk_convert_filter() I would like to keep normal BPF_
values from uapi/filter.h,
since for majority of them I reuse as-is and remapping would have
added unnecessary code.

>> + insn->code = fp->code;
>> + insn->a_reg = 6;
>> + insn->x_reg = 7;
>> + insn->imm = fp->k;
>> + break;
>> +
>> + /* jump opcodes map as-is, but offsets need adjustment */
>> + case BPF_JMP | BPF_JA:
>> + tgt = i + fp->k + 1;
>> + insn->code = fp->code;
>> +#define EMIT_JMP \
>> + do { \
>> + if (tgt >= len || tgt < 0) \
>> + goto err; \
>> + insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
>> + } while (0)
>> +
>> + EMIT_JMP;
>> + break;
>> +
>> + case BPF_JMP | BPF_JEQ | BPF_K:
>> + case BPF_JMP | BPF_JEQ | BPF_X:
>> + case BPF_JMP | BPF_JSET | BPF_K:
>> + case BPF_JMP | BPF_JSET | BPF_X:
>> + case BPF_JMP | BPF_JGT | BPF_K:
>> + case BPF_JMP | BPF_JGT | BPF_X:
>> + case BPF_JMP | BPF_JGE | BPF_K:
>> + case BPF_JMP | BPF_JGE | BPF_X:
>> + insn->a_reg = 6;
>> + insn->x_reg = 7;
>> + insn->imm = fp->k;
>> + /* common case where 'jump_false' is next insn */
>> + if (fp->jf == 0) {
>> + insn->code = fp->code;
>> + tgt = i + fp->jt + 1;
>> + EMIT_JMP;
>> + break;
>> + }
>> + /* convert JEQ into JNE when 'jump_true' is next
>> insn */
>> + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>> + insn->code = BPF_JMP | BPF_JNE |
>> + BPF_SRC(fp->code);
>> + tgt = i + fp->jf + 1;
>> + EMIT_JMP;
>> + break;
>> + }
>> + /* other jumps are mapped into two insns: Jxx and
>> JA */
>> + tgt = i + fp->jt + 1;
>> + insn->code = fp->code;
>> + EMIT_JMP;
>> +
>> + insn++;
>> + insn->code = BPF_JMP | BPF_JA;
>> + tgt = i + fp->jf + 1;
>> + EMIT_JMP;
>> + /* adjust pc relative offset, since it's a 2nd
>> insn */
>> + insn->off--;
>> + break;
>> +
>> + /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>> + case BPF_LDX | BPF_MSH | BPF_B:
>> + insn->code = BPF_LD | BPF_ABS | BPF_B;
>> + insn->a_reg = 7;
>> + insn->imm = fp->k;
>> +
>> + insn++;
>> + insn->code = BPF_ALU | BPF_AND | BPF_K;
>> + insn->a_reg = 7;
>> + insn->imm = 0xf;
>> +
>> + insn++;
>> + insn->code = BPF_ALU | BPF_LSH | BPF_K;
>> + insn->a_reg = 7;
>> + insn->imm = 2;
>> + break;
>> +
>> + /* RET_K, RET_A are remaped into 2 insns */
>> + case BPF_RET | BPF_A:
>> + case BPF_RET | BPF_K:
>> + insn->code = BPF_ALU | BPF_MOV |
>> + (BPF_SRC(fp->code) == BPF_K ? BPF_K :
>> BPF_X);
>> + insn->a_reg = 0;
>> + insn->x_reg = 6;
>> + insn->imm = fp->k;
>> +
>> + insn++;
>> + insn->code = BPF_RET | BPF_K;
>> + break;
>> +
>> + /* store to stack */
>> + case BPF_ST:
>> + case BPF_STX:
>> + insn->code = BPF_STX | BPF_MEM | BPF_W;
>> + insn->a_reg = 10;
>> + insn->x_reg = fp->code == BPF_ST ? 6 : 7;
>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>> + break;
>> +
>> + /* load from stack */
>> + case BPF_LD | BPF_MEM:
>> + case BPF_LDX | BPF_MEM:
>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> + insn->x_reg = 10;
>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>> + break;
>> +
>> + /* A = K or X = K */
>> + case BPF_LD | BPF_IMM:
>> + case BPF_LDX | BPF_IMM:
>> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> + insn->imm = fp->k;
>> + break;
>> +
>> + /* X = A */
>> + case BPF_MISC | BPF_TAX:
>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>> + insn->a_reg = 7;
>> + insn->x_reg = 6;
>> + break;
>> +
>> + /* A = X */
>> + case BPF_MISC | BPF_TXA:
>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>> + insn->a_reg = 6;
>> + insn->x_reg = 7;
>> + break;
>> +
>> + /* A = skb->len or X = skb->len */
>> + case BPF_LD|BPF_W|BPF_LEN:
>> + case BPF_LDX|BPF_W|BPF_LEN:
>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> + insn->x_reg = 1;
>> + insn->off = offsetof(struct sk_buff, len);
>> + break;
>> +
>> + default:
>> + /* pr_err("unknown opcode %02x\n", fp->code); */
>> + goto err;
>> + }
>> +
>> + insn++;
>> + if (new_prog) {
>> + memcpy(new_insn, tmp_insns,
>> + sizeof(*insn) * (insn - tmp_insns));
>> + }
>> + new_insn += insn - tmp_insns;
>> + }
>> +
>> + if (!new_prog) {
>> + /* only calculating new length */
>> + *p_new_len = new_insn - new_prog;
>> + return 0;
>> + }
>> +
>> + pass++;
>> + if (new_len != new_insn - new_prog) {
>> + new_len = new_insn - new_prog;
>> + if (pass > 2)
>> + goto err;
>> + goto do_pass;
>> + }
>> + kfree(addrs);
>> + if (*p_new_len != new_len)
>> + /* inconsistent new program length */
>> + pr_err("bpf_convert() usage error\n");
>> + return 0;
>> +err:
>> + kfree(addrs);
>> + return -EINVAL;
>> +}
>> +
>> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
>> +{
>
>
> Similarly, sk_run_filter_ext()?

ok.

>> + u64 stack[64];
>> + u64 regs[16];
>> + void *ptr;
>> + u64 tmp;
>> + int off;
>> +
>> +#ifdef __x86_64
>> +#define LOAD_IMM /**/
>> +#define K insn->imm
>> +#else
>> +#define LOAD_IMM (K = insn->imm)
>> + s32 K = insn->imm;
>> +#endif
>> +
>> +#ifdef DEBUG
>> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
>> +#else
>> +#define DEBUG_INSN
>> +#endif
>
>
> This DEBUG hunk could then be removed when we use a stap script
> instead, for example.

ok

>> +#define A regs[insn->a_reg]
>> +#define X regs[insn->x_reg]
>> +
>> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
>> +#define L(A, B, C) L_##A##B##C:
>
>
> Could we get rid of these two defines? I know you're defining

I think it's actually more readable this way and lesser chance of typos.
DL - define label
L - label
will try to remove 2nd marco to see how it looks...

> and using that as labels, but it's not obvious at first what
> 'jt' does. Maybe 'jt_labels' ?

ok will rename.

>> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>
>
> Not a big fan of macros, but here it seems fine though.
>
>
>> +/* some compilers may need help:
>> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code];
>> })
>> + */
>> +
>> + static const void *jt[256] = {
>> + [0 ... 255] = &&L_default,
>
>
> Nit: please just one define per line below:

ok.

>> + DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
>> + DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
>> + DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
>> + DL(BPF_ALU, BPF_OR, BPF_X) DL(BPF_ALU, BPF_OR, BPF_K)
>> + DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
>> + DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
>> + DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
>> + DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
>> + DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
>> + DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
>> + DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
>> + DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_OR, BPF_X) DL(BPF_ALU64, BPF_OR, BPF_K)
>> + DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD,
>> BPF_K)
>> + DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
>> + DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
>> + DL(BPF_ALU, BPF_NEG, 0)
>> + DL(BPF_JMP, BPF_CALL, 0)
>> + DL(BPF_JMP, BPF_JA, 0)
>> + DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
>> + DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
>> + DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
>> + DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
>> + DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
>> + DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
>> + DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
>> + DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
>> + DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
>> + DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
>> + DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
>> + DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
>> + DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
>> + DL(BPF_STX, BPF_XADD, BPF_W)
>> +#ifdef CONFIG_64BIT
>> + DL(BPF_STX, BPF_XADD, BPF_DW)
>> +#endif
>> + DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
>> + DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
>> + DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
>> + DL(BPF_RET, BPF_K, 0)
>> + };
>> +
>> + regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
>> + regs[1/* BPF R1 */] = (u64)(ulong)ctx;
>> +
>> + DEBUG_INSN;
>> + /* execute 1st insn */
>> +select_insn:
>> + goto *jt[insn->code];
>> +
>> + /* ALU */
>> +#define ALU(OPCODE, OP) \
>> + L_BPF_ALU64##OPCODE##BPF_X: \
>> + A = A OP X; \
>> + CONT; \
>> + L_BPF_ALU##OPCODE##BPF_X: \
>> + A = (u32)A OP (u32)X; \
>> + CONT; \
>> + L_BPF_ALU64##OPCODE##BPF_K: \
>> + A = A OP K; \
>> + CONT; \
>> + L_BPF_ALU##OPCODE##BPF_K: \
>> + A = (u32)A OP (u32)K; \
>> + CONT;
>> +
>> + ALU(BPF_ADD, +)
>> + ALU(BPF_SUB, -)
>> + ALU(BPF_AND, &)
>> + ALU(BPF_OR, |)
>> + ALU(BPF_LSH, <<)
>> + ALU(BPF_RSH, >>)
>> + ALU(BPF_XOR, ^)
>> + ALU(BPF_MUL, *)
>> +
>> + L(BPF_ALU, BPF_NEG, 0)
>> + A = (u32)-A;
>> + CONT;
>> + L(BPF_ALU, BPF_MOV, BPF_X)
>> + A = (u32)X;
>> + CONT;
>> + L(BPF_ALU, BPF_MOV, BPF_K)
>> + A = (u32)K;
>> + CONT;
>> + L(BPF_ALU64, BPF_MOV, BPF_X)
>> + A = X;
>> + CONT;
>> + L(BPF_ALU64, BPF_MOV, BPF_K)
>> + A = K;
>> + CONT;
>> + L(BPF_ALU64, BPF_ARSH, BPF_X)
>> + (*(s64 *) &A) >>= X;
>> + CONT;
>> + L(BPF_ALU64, BPF_ARSH, BPF_K)
>> + (*(s64 *) &A) >>= K;
>> + CONT;
>> + L(BPF_ALU64, BPF_MOD, BPF_X)
>> + tmp = A;
>> + if (X)
>> + A = do_div(tmp, X);
>> + CONT;
>> + L(BPF_ALU, BPF_MOD, BPF_X)
>> + tmp = (u32)A;
>> + if (X)
>> + A = do_div(tmp, (u32)X);
>> + CONT;
>> + L(BPF_ALU64, BPF_MOD, BPF_K)
>> + tmp = A;
>> + if (K)
>> + A = do_div(tmp, K);
>> + CONT;
>> + L(BPF_ALU, BPF_MOD, BPF_K)
>> + tmp = (u32)A;
>> + if (K)
>> + A = do_div(tmp, (u32)K);
>> + CONT;
>> + L(BPF_ALU64, BPF_DIV, BPF_X)
>> + if (X)
>> + do_div(A, X);
>> + CONT;
>> + L(BPF_ALU, BPF_DIV, BPF_X)
>> + tmp = (u32)A;
>> + if (X)
>> + do_div(tmp, (u32)X);
>> + A = (u32)tmp;
>> + CONT;
>> + L(BPF_ALU64, BPF_DIV, BPF_K)
>> + if (K)
>> + do_div(A, K);
>> + CONT;
>> + L(BPF_ALU, BPF_DIV, BPF_K)
>> + tmp = (u32)A;
>> + if (K)
>> + do_div(tmp, (u32)K);
>> + A = (u32)tmp;
>> + CONT;
>> + L(BPF_ALU64, BPF_BSWAP32, BPF_X)
>> + A = swab32(A);
>> + CONT;
>> + L(BPF_ALU64, BPF_BSWAP64, BPF_X)
>> + A = swab64(A);
>> + CONT;
>> +
>> + /* CALL */
>> + L(BPF_JMP, BPF_CALL, 0)
>> + /* TODO execute_func(K, regs); */
>> + CONT;
>
>
> Would the filter checker for now just return -EINVAL when this is used?

sure.
I was planning to address that in a week or so, but ok.
Will remove 'call' insn in the first patch and will re-add in a clean
way in a next diff.

>> + /* JMP */
>> + L(BPF_JMP, BPF_JA, 0)
>> + insn += insn->off;
>> + CONT;
>> + L(BPF_JMP, BPF_JEQ, BPF_X)
>> + if (A == X) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JEQ, BPF_K)
>> + if (A == K) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JNE, BPF_X)
>> + if (A != X) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JNE, BPF_K)
>> + if (A != K) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JGT, BPF_X)
>> + if (A > X) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JGT, BPF_K)
>> + if (A > K) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JGE, BPF_X)
>> + if (A >= X) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JGE, BPF_K)
>> + if (A >= K) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSGT, BPF_X)
>> + if (((s64)A) > ((s64)X)) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSGT, BPF_K)
>> + if (((s64)A) > ((s64)K)) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSGE, BPF_X)
>> + if (((s64)A) >= ((s64)X)) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSGE, BPF_K)
>> + if (((s64)A) >= ((s64)K)) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSET, BPF_X)
>> + if (A & X) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>> + L(BPF_JMP, BPF_JSET, BPF_K)
>> + if (A & (u32)K) {
>> + insn += insn->off;
>> + CONT_JMP;
>> + }
>> + CONT;
>
>
> Okay, so right now we only support forward jumps, right? And these
> are checked by the old checker code I'd assume?

yes. just like old interpreter new interpreter doesn't care. It just jumps.
and you're correct. old checker code will allow only forward jumps.
That restriction is preserved by sk_convert_filter() as well.

>> + /* STX and ST and LDX*/
>> +#define LDST(SIZEOP, SIZE) \
>> + L_BPF_STXBPF_MEM##SIZEOP: \
>> + *(SIZE *)(ulong)(A + insn->off) = X; \
>> + CONT; \
>> + L_BPF_STBPF_MEM##SIZEOP: \
>> + *(SIZE *)(ulong)(A + insn->off) = K; \
>> + CONT; \
>> + L_BPF_LDXBPF_MEM##SIZEOP: \
>> + A = *(SIZE *)(ulong)(X + insn->off); \
>> + CONT;
>> +
>> + LDST(BPF_B, u8)
>> + LDST(BPF_H, u16)
>> + LDST(BPF_W, u32)
>> + LDST(BPF_DW, u64)
>> +
>> + /* STX XADD */
>> + L(BPF_STX, BPF_XADD, BPF_W)
>> + atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
>> + CONT;
>> +#ifdef CONFIG_64BIT
>> + L(BPF_STX, BPF_XADD, BPF_DW)
>> + atomic64_add((u64)X, (atomic64_t *)(ulong)(A +
>> insn->off));
>> + CONT;
>> +#endif
>> +
>> + /* LD from SKB + K */
>
>
> Nit: indent one tab too much (here and elsewhere)

ahh. ok.

>> + L(BPF_LD, BPF_ABS, BPF_W)
>> + off = K;
>> +load_word:
>> + ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
>> + if (likely(ptr != NULL)) {
>> + A = get_unaligned_be32(ptr);
>> + CONT;
>> + }
>> + return 0;
>> +
>> + L(BPF_LD, BPF_ABS, BPF_H)
>> + off = K;
>> +load_half:
>> + ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
>> + if (likely(ptr != NULL)) {
>> + A = get_unaligned_be16(ptr);
>> + CONT;
>> + }
>> + return 0;
>> +
>> + L(BPF_LD, BPF_ABS, BPF_B)
>> + off = K;
>> +load_byte:
>> + ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
>> + if (likely(ptr != NULL)) {
>> + A = *(u8 *)ptr;
>> + CONT;
>> + }
>> + return 0;
>> +
>> + /* LD from SKB + X + K */
>
>
> Nit: ditto

ok

>> + L(BPF_LD, BPF_IND, BPF_W)
>> + off = K + X;
>> + goto load_word;
>> +
>> + L(BPF_LD, BPF_IND, BPF_H)
>> + off = K + X;
>> + goto load_half;
>> +
>> + L(BPF_LD, BPF_IND, BPF_B)
>> + off = K + X;
>> + goto load_byte;
>> +
>> + /* RET */
>> + L(BPF_RET, BPF_K, 0)
>> + return regs[0/* R0 */];
>> +
>> + L_default:
>> + /* bpf_check() or bpf_convert() will guarantee that
>> + * we never reach here
>> + */
>> + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
>> + return 0;
>> +#undef DL
>> +#undef L
>> +#undef CONT
>> +#undef A
>> +#undef X
>> +#undef K
>> +#undef LOAD_IMM
>> +#undef DEBUG_INSN
>> +#undef ALU
>> +#undef LDST
>> +}
>> +EXPORT_SYMBOL(bpf_run);
>> +
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index ad30d626a5bd..575bf5fd4335 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>> {
>> struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>
>> + if ((void *)fp->bpf_func == (void *)bpf_run)
>> + /* arch specific jit_free are expecting this value */
>> + fp->bpf_func = sk_run_filter;
>> +
>> bpf_jit_free(fp);
>> }
>> EXPORT_SYMBOL(sk_filter_release_rcu);
>> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>> return 0;
>> }
>>
>> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog
>> *fprog,
>> + struct sock *sk)
>> +{
>
>
> sk_prepare_filter_ext()?

ok.

>> + unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>> + struct sock_filter *old_prog;
>> + unsigned int sk_fsize;
>> + struct sk_filter *fp;
>> + int new_len;
>> + int err;
>> +
>> + /* store old program into buffer, since chk_filter will remap
>> opcodes */
>> + old_prog = kmalloc(fsize, GFP_KERNEL);
>> + if (!old_prog)
>> + return -ENOMEM;
>> +
>> + if (sk) {
>> + if (copy_from_user(old_prog, fprog->filter, fsize)) {
>> + err = -EFAULT;
>> + goto free_prog;
>> + }
>> + } else {
>> + memcpy(old_prog, fprog->filter, fsize);
>> + }
>> +
>> + /* calculate bpf64 program length */
>> + err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
>> + if (err)
>> + goto free_prog;
>> +
>> + sk_fsize = sk_filter_size(new_len);
>> + /* allocate sk_filter to store bpf64 program */
>> + if (sk)
>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> + else
>> + fp = kmalloc(sk_fsize, GFP_KERNEL);
>> + if (!fp) {
>> + err = -ENOMEM;
>> + goto free_prog;
>> + }
>> +
>> + /* remap old insns into bpf64 */
>> + err = bpf_convert(old_prog, fprog->len,
>> + (struct bpf_insn *)fp->insns, &new_len);
>> + if (err)
>> + /* 2nd bpf_convert() can fail only if it fails
>> + * to allocate memory, remapping must succeed
>> + */
>> + goto free_fp;
>> +
>> + /* now chk_filter can overwrite old_prog while checking */
>> + err = sk_chk_filter(old_prog, fprog->len);
>> + if (err)
>> + goto free_fp;
>> +
>> + /* discard old prog */
>> + kfree(old_prog);
>> +
>> + atomic_set(&fp->refcnt, 1);
>> + fp->len = new_len;
>> +
>> + /* bpf64 insns must be executed by bpf_run */
>> + fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
>> +
>> + *pfp = fp;
>> + return 0;
>> +free_fp:
>> + if (sk)
>> + sock_kfree_s(sk, fp, sk_fsize);
>> + else
>> + kfree(fp);
>> +free_prog:
>> + kfree(old_prog);
>> + return err;
>> +}
>> +
>> /**
>> * sk_unattached_filter_create - create an unattached filter
>> * @fprog: the filter program
>> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter
>> **pfp,
>> if (fprog->filter == NULL)
>> return -EINVAL;
>>
>> + if (bpf64_enable)
>> + return bpf64_prepare(pfp, fprog, NULL);
>> +
>> fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>> if (!fp)
>> return -ENOMEM;
>> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog,
>> struct sock *sk)
>> if (fprog->filter == NULL)
>> return -EINVAL;
>>
>> - fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> - if (!fp)
>> - return -ENOMEM;
>> - if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>> - sock_kfree_s(sk, fp, sk_fsize);
>> - return -EFAULT;
>> - }
>> + if (bpf64_enable) {
>> + err = bpf64_prepare(&fp, fprog, sk);
>> + if (err)
>> + return err;
>> + } else {
>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> + if (!fp)
>> + return -ENOMEM;
>> + if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>> + sock_kfree_s(sk, fp, sk_fsize);
>> + return -EFAULT;
>> + }
>>
>> - atomic_set(&fp->refcnt, 1);
>> - fp->len = fprog->len;
>> + atomic_set(&fp->refcnt, 1);
>> + fp->len = fprog->len;
>>
>> - err = __sk_prepare_filter(fp);
>> - if (err) {
>> - sk_filter_uncharge(sk, fp);
>> - return err;
>> + err = __sk_prepare_filter(fp);
>> + if (err) {
>> + sk_filter_uncharge(sk, fp);
>> + return err;
>> + }
>> }
>>
>> old_fp = rcu_dereference_protected(sk->sk_filter,
>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>> index cf9cd13509a7..f03acc0e8950 100644
>> --- a/net/core/sysctl_net_core.c
>> +++ b/net/core/sysctl_net_core.c
>> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>> },
>> #endif
>> {
>> + .procname = "bpf64_enable",
>> + .data = &bpf64_enable,
>> + .maxlen = sizeof(int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec
>> + },
>> + {
>> .procname = "netdev_tstamp_prequeue",
>> .data = &netdev_tstamp_prequeue,
>> .maxlen = sizeof(int),
>>
>
> Hope some of the comments made sense. ;-)

Yes. Indeed. Thanks a lot for the thorough review!
Will address things and resend V4.

Alexei

> Thanks,
>
> Daniel

2014-03-01 00:10:15

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter

On Fri, Feb 28, 2014 at 12:53 PM, Alexei Starovoitov <[email protected]> wrote:
> On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <[email protected]> wrote:
>> Hi Alexei,
>>
>> [also cc'ing Hagen and Jesse]
>>
>> Just some minor comments below ... let me know what you think.
>
> Thank you for review! Comments below.
>
>> On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
>>>
>>> Extended BPF (or 64-bit BPF) is an instruction set to
>>> create safe dynamically loadable filters that can call fixed set
>>> of kernel functions and take generic bpf_context as an input.
>>> BPF filter is a glue between kernel functions and bpf_context.
>>> Different kernel subsystems can define their own set of available
>>> functions
>>> and alter BPF machinery for specific use case.
>>> BPF64 instruction set is designed for efficient mapping to native
>>> instructions on 64-bit CPUs
>>>
>>> Old BPF instructions are remapped on the fly to BPF64
>>> when sysctl net.core.bpf64_enable=1
>>
>>
>> Would be nice if the commit message could be extended with what you
>> have posted in your [PATCH v3 net-next 0/1] email (but without the
>> changelog, changelog should go after "---" line), so that also this
>> information will appear here and in the Git log later on. Also please
>> elaborate more on this commit message. I think, at least, as it's a
>> bigger change it also needs to include future planned usage scenarios
>> for user space and for inside the kernel e.g. for OVS and ftrace.
>
> Ok will do
>
>> You could make this patch a 2 patch patch-series and put into patch
>> 2/2 all documentation you had in your previous version of the set.
>> Would be nice to extend the file Documentation/networking/filter.txt
>> with a description of your extended BPF so that users can read about
>> them in the same file.
>
> Sure.
>
>> Did you also test that seccomp-BPF still works out?
>
> yes. Have a prototype, but it needs a bit more cleanup.
>
>>
>>> Signed-off-by: Alexei Starovoitov <[email protected]>
>>> ---
>>> include/linux/filter.h | 9 +-
>>> include/linux/netdevice.h | 1 +
>>> include/uapi/linux/filter.h | 37 ++-
>>> net/core/Makefile | 2 +-
>>> net/core/bpf_run.c | 766
>>> +++++++++++++++++++++++++++++++++++++++++++
>>> net/core/filter.c | 114 ++++++-
>>
>>
>> I would have liked to see the content from net/core/bpf_run.c to go
>> directly into net/core/filter.c, not as a separate file, if that's
>> possible.
>
> sure. that's fine.
>
>>
>>> net/core/sysctl_net_core.c | 7 +
>>> 7 files changed, 913 insertions(+), 23 deletions(-)
>>> create mode 100644 net/core/bpf_run.c
>>>
>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>> index e568c8ef896b..bf3085258f4c 100644
>>> --- a/include/linux/filter.h
>>> +++ b/include/linux/filter.h
>>> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter,
>>> unsigned int flen);
>>> extern int sk_get_filter(struct sock *sk, struct sock_filter __user
>>> *filter, unsigned len);
>>> extern void sk_decode_filter(struct sock_filter *filt, struct
>>> sock_filter *to);
>>>
>>> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
>>> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn
>>> *new_prog,
>>> + int *p_new_len);
>>> +/* execute bpf64 program */
>>> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
>>> +
>>> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>> #ifdef CONFIG_BPF_JIT
>>> #include <stdarg.h>
>>> #include <linux/linkage.h>
>>> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen,
>>> unsigned int proglen,
>>> print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>>> 16, 1, image, proglen, false);
>>> }
>>> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>> #else
>>> #include <linux/slab.h>
>>> static inline void bpf_jit_compile(struct sk_filter *fp)
>>> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>>> {
>>> kfree(fp);
>>> }
>>> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>>> #endif
>>>
>>> static inline int bpf_tell_extensions(void)
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 5e84483c0650..7b1acefc244e 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -2971,6 +2971,7 @@ extern int netdev_max_backlog;
>>> extern int netdev_tstamp_prequeue;
>>> extern int weight_p;
>>> extern int bpf_jit_enable;
>>> +extern int bpf64_enable;
>>
>>
>> We should keep naming consistent (so either extended BPF or BPF64),
>> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
>
> agree. we need consistent naming for both (old and new).
> I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
> to see which one looks better.
> I'm kinda leaning to sk_filter64, since it's easier to quickly spot
> the difference
> and more descriptive.

After going back and forth between sk_filter64 vs sk_filter_ext,
decided to follow your suggestion and stick with sk_filter_ext,
bpf_ext_enable, etc. since calling it bpf64 may raise unnecessary
concerns on 32-bit architectures... but performance numbers
show that extended bpf on 32-bit cpus is faster than old bpf,
so 'extended bpf' from now on.

Thanks

>> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>> comparisons, you'd still need to load to immediate values, right?
>
> there is no insn that use 64-bit immediate, since 64-bit immediates
> are extremely rare. grep x86-64 asm code for movabsq will return very few.
> llvm or gcc can easily construct any constant by combination of mov,
> shifts and ors.
> bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
> 32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
> Jxx is painless.
> Notice I added signed comparison, since real life programs cannot do
> without them.
> I also kept the spirit of old bpf having > and >= only. (no < and <=)
> that made llvm/gcc backends a bit harder to do, since no real cpu has
> such limitations.
> I'm still contemplating do add < and <= (both signed and unsigned), which is
> interesting trade-off: number of instructions vs complexity of compiler
>
>> After all your changes, we will still have the bpf_jit_enable knob
>> in tact, right?
>
> Yes.
> In this diff the workflow is the following:
>
> old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf64) {
> convert to new
> sk_chk_filter() - check old bpf
> use new interpreter
> } else {
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
> then add new/old inband demux into sk_attach_filter(),
> so that workflow will become:
>
> a filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (new filter prog) {
> sk_chk_filter64() - check new bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> if (bpf64_enable) {
> convert to new
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> sk_chk_filter()
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> }
> eventually bpf64_enable can be made default,
> the last 'else' can be retired and 'bpf64_enable' removed along with
> old interpreter.
>
> bpf_jit_enable knob will stay for foreseeable future.
>
>>> bool netdev_has_upper_dev(struct net_device *dev, struct net_device
>>> *upper_dev);
>>> struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device
>>> *dev,
>>> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
>>> index 8eb9ccaa5b48..70ff29ee6825 100644
>>> --- a/include/uapi/linux/filter.h
>>> +++ b/include/uapi/linux/filter.h
>>> @@ -1,3 +1,4 @@
>>> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
>>> */
>>> /*
>>> * Linux Socket Filter Data Structures
>>> */
>>
>>
>> You can merge both comments into one.
>
> ok.
>
>>
>>> @@ -19,7 +20,7 @@
>>> * Try and keep these values and structures similar to BSD,
>>> especially
>>> * the BPF code definitions which need to match so you can share
>>> filters
>>> */
>>> -
>>> +
>>> struct sock_filter { /* Filter block */
>>> __u16 code; /* Actual filter code */
>>> __u8 jt; /* Jump true */
>>> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_JMP 0x05
>>> #define BPF_RET 0x06
>>> #define BPF_MISC 0x07
>>> +#define BPF_ALU64 0x07
>>> +
>>> +struct bpf_insn {
>>> + __u8 code; /* opcode */
>>> + __u8 a_reg:4; /* dest register*/
>>> + __u8 x_reg:4; /* source register */
>>> + __s16 off; /* signed offset */
>>> + __s32 imm; /* signed immediate constant */
>>> +};
>>
>>
>> As we have struct sock_filter and it's immutable due to uapi, I
>> would have liked to see the new data structure with a consistent
>> naming scheme, e.g. struct sock_filter_ext {...} for extended
>> BPF.
>
> ok. will rename bpf_insn to sock_filter64 and sock_filter_ext to see
> which one looks better through out the code.
>
>>> +/* pointer to bpf_context is the first and only argument to BPF program
>>> + * its definition is use-case specific */
>>> +struct bpf_context;
>>>
>>> /* ld/ldx fields */
>>> #define BPF_SIZE(code) ((code) & 0x18)
>>> #define BPF_W 0x00
>>> #define BPF_H 0x08
>>> #define BPF_B 0x10
>>> +#define BPF_DW 0x18
>>> #define BPF_MODE(code) ((code) & 0xe0)
>>> #define BPF_IMM 0x00
>>> #define BPF_ABS 0x20
>>> @@ -58,6 +73,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_MEM 0x60
>>> #define BPF_LEN 0x80
>>> #define BPF_MSH 0xa0
>>> +#define BPF_XADD 0xc0 /* exclusive add */
>>>
>>> /* alu/jmp fields */
>>> #define BPF_OP(code) ((code) & 0xf0)
>>> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_OR 0x40
>>> #define BPF_AND 0x50
>>> #define BPF_LSH 0x60
>>> -#define BPF_RSH 0x70
>>> +#define BPF_RSH 0x70 /* logical shift right */
>>> #define BPF_NEG 0x80
>>> #define BPF_MOD 0x90
>>> #define BPF_XOR 0xa0
>>> +#define BPF_MOV 0xb0 /* mov reg to reg */
>>> +#define BPF_ARSH 0xc0 /* sign extending arithmetic
>>> shift right */
>>> +#define BPF_BSWAP32 0xd0 /* swap lower 4 bytes of
>>> 64-bit register */
>>> +#define BPF_BSWAP64 0xe0 /* swap all 8 bytes of 64-bit
>>> register */
>>>
>>> #define BPF_JA 0x00
>>> -#define BPF_JEQ 0x10
>>> -#define BPF_JGT 0x20
>>> -#define BPF_JGE 0x30
>>> -#define BPF_JSET 0x40
>>> +#define BPF_JEQ 0x10 /* jump == */
>>> +#define BPF_JGT 0x20 /* GT is unsigned '>', JA in x86 */
>>> +#define BPF_JGE 0x30 /* GE is unsigned '>=', JAE in x86
>>> */
>>> +#define BPF_JSET 0x40 /* if (A & X) */
>>> +#define BPF_JNE 0x50 /* jump != */
>>> +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
>>> +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
>>> +#define BPF_CALL 0x80 /* function call */
>>> #define BPF_SRC(code) ((code) & 0x08)
>>> #define BPF_K 0x00
>>> #define BPF_X 0x08
>>> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define SKF_NET_OFF (-0x100000)
>>> #define SKF_LL_OFF (-0x200000)
>>>
>>> -
>>> #endif /* _UAPI__LINUX_FILTER_H__ */
>>> diff --git a/net/core/Makefile b/net/core/Makefile
>>> index 9628c20acff6..e622b97f58dc 100644
>>> --- a/net/core/Makefile
>>> +++ b/net/core/Makefile
>>> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o
>>> stream.o scm.o \
>>> obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>>>
>>> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o
>>> \
>>> - neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o \
>>> + neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o bpf_run.o \
>>> sock_diag.o dev_ioctl.o
>>>
>>> obj-$(CONFIG_XFRM) += flow.o
>>> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
>>> new file mode 100644
>>> index 000000000000..fa1862fcbc74
>>> --- /dev/null
>>> +++ b/net/core/bpf_run.c
>>> @@ -0,0 +1,766 @@
>>> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of version 2 of the GNU General Public
>>> + * License as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>>> + * General Public License for more details.
>>> + */
>>> +#include <linux/kernel.h>
>>> +#include <linux/types.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uaccess.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/skbuff.h>
>>> +#include <asm/unaligned.h>
>>> +
>>> +int bpf64_enable __read_mostly;
>>> +
>>> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int
>>> k,
>>> + unsigned int size);
>>> +
>>> +static inline void *load_pointer(const struct sk_buff *skb, int k,
>>> + unsigned int size, void *buffer)
>>> +{
>>> + if (k >= 0)
>>> + return skb_header_pointer(skb, k, size, buffer);
>>> + return bpf_internal_load_pointer_neg_helper(skb, k, size);
>>> +}
>>> +
>>> +static const char *const bpf_class_string[] = {
>>> + "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
>>> +};
>>> +
>>> +static const char *const bpf_alu_string[] = {
>>> + "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
>>> + "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
>>> +};
>>> +
>>> +static const char *const bpf_ldst_string[] = {
>>> + "u32", "u16", "u8", "u64"
>>> +};
>>> +
>>> +static const char *const bpf_jmp_string[] = {
>>> + "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
>>> +};
>>> +
>>> +static const char *reg_to_str(int regno, u64 *regs)
>>> +{
>>> + static char reg_value[16][32];
>>> + if (!regs)
>>> + return "";
>>> + snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
>>> + regs[regno]);
>>> + return reg_value[regno];
>>> +}
>>> +
>>> +#define R(regno) reg_to_str(regno, regs)
>>> +
>>> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
>>> +{
>>> + u16 class = BPF_CLASS(insn->code);
>>> + if (class == BPF_ALU || class == BPF_ALU64) {
>>> + if (BPF_SRC(insn->code) == BPF_X)
>>> + pr_info("code_%02x %sr%d%s %s r%d%s\n",
>>> + insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> + insn->a_reg, R(insn->a_reg),
>>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> + insn->x_reg, R(insn->x_reg));
>>> + else
>>> + pr_info("code_%02x %sr%d%s %s %d\n",
>>> + insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> + insn->a_reg, R(insn->a_reg),
>>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> + insn->imm);
>>> + } else if (class == BPF_STX) {
>>> + if (BPF_MODE(insn->code) == BPF_MEM)
>>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->a_reg, R(insn->a_reg),
>>> + insn->off, insn->x_reg, R(insn->x_reg));
>>> + else if (BPF_MODE(insn->code) == BPF_XADD)
>>> + pr_info("code_%02x lock *(%s *)(r%d%s %+d) +=
>>> r%d%s\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->a_reg, R(insn->a_reg), insn->off,
>>> + insn->x_reg, R(insn->x_reg));
>>> + else
>>> + pr_info("BUG_%02x\n", insn->code);
>>> + } else if (class == BPF_ST) {
>>> + if (BPF_MODE(insn->code) != BPF_MEM) {
>>> + pr_info("BUG_st_%02x\n", insn->code);
>>> + return;
>>> + }
>>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>>> + insn->a_reg, R(insn->a_reg),
>>> + insn->off, insn->imm);
>>> + } else if (class == BPF_LDX) {
>>> + if (BPF_MODE(insn->code) == BPF_MEM) {
>>> + pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->x_reg, R(insn->x_reg), insn->off);
>>> + } else {
>>> + pr_info("BUG_ldx_%02x\n", insn->code);
>>> + }
>>> + } else if (class == BPF_LD) {
>>> + if (BPF_MODE(insn->code) == BPF_ABS) {
>>> + pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->imm);
>>> + } else if (BPF_MODE(insn->code) == BPF_IND) {
>>> + pr_info("code_%02x r%d = *(%s *)(skb + r%d%s
>>> %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->x_reg, R(insn->x_reg), insn->imm);
>>> + } else {
>>> + pr_info("BUG_ld_%02x\n", insn->code);
>>> + }
>>> + } else if (class == BPF_JMP) {
>>> + u16 opcode = BPF_OP(insn->code);
>>> + if (opcode == BPF_CALL) {
>>> + pr_info("code_%02x call %d\n", insn->code,
>>> insn->imm);
>>> + } else if (insn->code == (BPF_JMP | BPF_JA)) {
>>> + pr_info("code_%02x goto pc%+d\n",
>>> + insn->code, insn->off);
>>> + } else if (BPF_SRC(insn->code) == BPF_X) {
>>> + pr_info("code_%02x if r%d%s %s r%d%s goto
>>> pc%+d\n",
>>> + insn->code, insn->a_reg, R(insn->a_reg),
>>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> + insn->x_reg, R(insn->x_reg), insn->off);
>>> + } else {
>>> + pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
>>> + insn->code, insn->a_reg, R(insn->a_reg),
>>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> + insn->imm, insn->off);
>>> + }
>>> + } else {
>>> + pr_info("code_%02x %s\n", insn->code,
>>> bpf_class_string[class]);
>>> + }
>>> +}
>>> +EXPORT_SYMBOL(pr_info_bpf_insn);
>>
>>
>> Why would that need to be exported as a symbol?
>
> the performance numbers I mentioned are from bpf_bench that is done
> as kernel module, so I used this for debugging from it.
> also to see what execution traces I get while running trinity bpf fuzzer :)
>
>> I would actually like to avoid having this pr_info* inside the kernel.
>> Couldn't this be done e.g. through systemtap script that could e.g. be
>> placed under tools/net/ or inside the documentation file?
>
> like the idea!
> Will drop it from the diff and eventually will move it to tools/net.
>
>>> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style
>>> (BPF64)
>>> + *
>>> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate
>>> new
>>> + * program length in one pass
>>> + *
>>> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
>>> + *
>>> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
>>> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass
>>> remaps
>>> + */
>>
>>
>> Would be nice to have the API comment it in kdoc format.
>
> good point. will do.
>
>>> +int bpf_convert(struct sock_filter *old_prog, int len,
>>> + struct bpf_insn *new_prog, int *p_new_len)
>>> +{
>>
>>
>> If we place it into net/core/filter.c, it would be nice to keep naming
>> conventions consistent, e.g. sk_convert_filter() or so.
>
> ok.
>
>>> + struct bpf_insn *new_insn;
>>> + struct sock_filter *fp;
>>> + int *addrs = NULL;
>>> + int new_len = 0;
>>> + int pass = 0;
>>> + int tgt, i;
>>> +
>>> + if (len <= 0 || len >= BPF_MAXINSNS)
>>> + return -EINVAL;
>>> +
>>> + if (new_prog) {
>>> + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
>>> + if (!addrs)
>>> + return -ENOMEM;
>>> + }
>>> +
>>> +do_pass:
>>> + new_insn = new_prog;
>>> + fp = old_prog;
>>> + for (i = 0; i < len; fp++, i++) {
>>> + struct bpf_insn tmp_insns[3] = {};
>>> + struct bpf_insn *insn = tmp_insns;
>>> +
>>> + if (addrs)
>>> + addrs[i] = new_insn - new_prog;
>>> +
>>> + switch (fp->code) {
>>> + /* all arithmetic insns and skb loads map as-is */
>>> + case BPF_ALU | BPF_ADD | BPF_X:
>>> + case BPF_ALU | BPF_ADD | BPF_K:
>>> + case BPF_ALU | BPF_SUB | BPF_X:
>>> + case BPF_ALU | BPF_SUB | BPF_K:
>>> + case BPF_ALU | BPF_AND | BPF_X:
>>> + case BPF_ALU | BPF_AND | BPF_K:
>>> + case BPF_ALU | BPF_OR | BPF_X:
>>> + case BPF_ALU | BPF_OR | BPF_K:
>>> + case BPF_ALU | BPF_LSH | BPF_X:
>>> + case BPF_ALU | BPF_LSH | BPF_K:
>>> + case BPF_ALU | BPF_RSH | BPF_X:
>>> + case BPF_ALU | BPF_RSH | BPF_K:
>>> + case BPF_ALU | BPF_XOR | BPF_X:
>>> + case BPF_ALU | BPF_XOR | BPF_K:
>>> + case BPF_ALU | BPF_MUL | BPF_X:
>>> + case BPF_ALU | BPF_MUL | BPF_K:
>>> + case BPF_ALU | BPF_DIV | BPF_X:
>>> + case BPF_ALU | BPF_DIV | BPF_K:
>>> + case BPF_ALU | BPF_MOD | BPF_X:
>>> + case BPF_ALU | BPF_MOD | BPF_K:
>>> + case BPF_ALU | BPF_NEG:
>>> + case BPF_LD | BPF_ABS | BPF_W:
>>> + case BPF_LD | BPF_ABS | BPF_H:
>>> + case BPF_LD | BPF_ABS | BPF_B:
>>> + case BPF_LD | BPF_IND | BPF_W:
>>> + case BPF_LD | BPF_IND | BPF_H:
>>> + case BPF_LD | BPF_IND | BPF_B:
>>
>>
>> I think here and elsewhere for similar constructions, we could use
>> BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
>> ("net: optimize Berkeley Packet Filter (BPF) processing").
>
> well, old bpf opcodes were just unlucky to be on the wrong side of
> compiler heuristics at that time.
> Instead of doing remapping while loading and opposite remapping in
> sk_get_filter()
> the problem could have been solved with few dummy 'case' statements.
> That would have only added few lines of code instead of hundred lines of mapping
> back and forth.
> Also I suspect commit 01f2f3f6ef4d076 was valid only when whole kernel
> is compiled
> with -Os. With -O2 GCC should have done it right.
> Heuristic is: number_of_case_stmts * ratio < range_of_case_values
> and ratio is 3 for -Os and 10 for -O2.
> old bpf has ~70 stmts and ~200 range.
> See expand_switch_as_decision_tree_p() in gcc/stmt.c
> Eventually when current interpreter is retired, I would like to remove
> remapping as well.
> Especially for sk_convert_filter() I would like to keep normal BPF_
> values from uapi/filter.h,
> since for majority of them I reuse as-is and remapping would have
> added unnecessary code.
>
>>> + insn->code = fp->code;
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + insn->imm = fp->k;
>>> + break;
>>> +
>>> + /* jump opcodes map as-is, but offsets need adjustment */
>>> + case BPF_JMP | BPF_JA:
>>> + tgt = i + fp->k + 1;
>>> + insn->code = fp->code;
>>> +#define EMIT_JMP \
>>> + do { \
>>> + if (tgt >= len || tgt < 0) \
>>> + goto err; \
>>> + insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
>>> + } while (0)
>>> +
>>> + EMIT_JMP;
>>> + break;
>>> +
>>> + case BPF_JMP | BPF_JEQ | BPF_K:
>>> + case BPF_JMP | BPF_JEQ | BPF_X:
>>> + case BPF_JMP | BPF_JSET | BPF_K:
>>> + case BPF_JMP | BPF_JSET | BPF_X:
>>> + case BPF_JMP | BPF_JGT | BPF_K:
>>> + case BPF_JMP | BPF_JGT | BPF_X:
>>> + case BPF_JMP | BPF_JGE | BPF_K:
>>> + case BPF_JMP | BPF_JGE | BPF_X:
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + insn->imm = fp->k;
>>> + /* common case where 'jump_false' is next insn */
>>> + if (fp->jf == 0) {
>>> + insn->code = fp->code;
>>> + tgt = i + fp->jt + 1;
>>> + EMIT_JMP;
>>> + break;
>>> + }
>>> + /* convert JEQ into JNE when 'jump_true' is next
>>> insn */
>>> + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>>> + insn->code = BPF_JMP | BPF_JNE |
>>> + BPF_SRC(fp->code);
>>> + tgt = i + fp->jf + 1;
>>> + EMIT_JMP;
>>> + break;
>>> + }
>>> + /* other jumps are mapped into two insns: Jxx and
>>> JA */
>>> + tgt = i + fp->jt + 1;
>>> + insn->code = fp->code;
>>> + EMIT_JMP;
>>> +
>>> + insn++;
>>> + insn->code = BPF_JMP | BPF_JA;
>>> + tgt = i + fp->jf + 1;
>>> + EMIT_JMP;
>>> + /* adjust pc relative offset, since it's a 2nd
>>> insn */
>>> + insn->off--;
>>> + break;
>>> +
>>> + /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>>> + case BPF_LDX | BPF_MSH | BPF_B:
>>> + insn->code = BPF_LD | BPF_ABS | BPF_B;
>>> + insn->a_reg = 7;
>>> + insn->imm = fp->k;
>>> +
>>> + insn++;
>>> + insn->code = BPF_ALU | BPF_AND | BPF_K;
>>> + insn->a_reg = 7;
>>> + insn->imm = 0xf;
>>> +
>>> + insn++;
>>> + insn->code = BPF_ALU | BPF_LSH | BPF_K;
>>> + insn->a_reg = 7;
>>> + insn->imm = 2;
>>> + break;
>>> +
>>> + /* RET_K, RET_A are remaped into 2 insns */
>>> + case BPF_RET | BPF_A:
>>> + case BPF_RET | BPF_K:
>>> + insn->code = BPF_ALU | BPF_MOV |
>>> + (BPF_SRC(fp->code) == BPF_K ? BPF_K :
>>> BPF_X);
>>> + insn->a_reg = 0;
>>> + insn->x_reg = 6;
>>> + insn->imm = fp->k;
>>> +
>>> + insn++;
>>> + insn->code = BPF_RET | BPF_K;
>>> + break;
>>> +
>>> + /* store to stack */
>>> + case BPF_ST:
>>> + case BPF_STX:
>>> + insn->code = BPF_STX | BPF_MEM | BPF_W;
>>> + insn->a_reg = 10;
>>> + insn->x_reg = fp->code == BPF_ST ? 6 : 7;
>>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> + break;
>>> +
>>> + /* load from stack */
>>> + case BPF_LD | BPF_MEM:
>>> + case BPF_LDX | BPF_MEM:
>>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->x_reg = 10;
>>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> + break;
>>> +
>>> + /* A = K or X = K */
>>> + case BPF_LD | BPF_IMM:
>>> + case BPF_LDX | BPF_IMM:
>>> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->imm = fp->k;
>>> + break;
>>> +
>>> + /* X = A */
>>> + case BPF_MISC | BPF_TAX:
>>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> + insn->a_reg = 7;
>>> + insn->x_reg = 6;
>>> + break;
>>> +
>>> + /* A = X */
>>> + case BPF_MISC | BPF_TXA:
>>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + break;
>>> +
>>> + /* A = skb->len or X = skb->len */
>>> + case BPF_LD|BPF_W|BPF_LEN:
>>> + case BPF_LDX|BPF_W|BPF_LEN:
>>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->x_reg = 1;
>>> + insn->off = offsetof(struct sk_buff, len);
>>> + break;
>>> +
>>> + default:
>>> + /* pr_err("unknown opcode %02x\n", fp->code); */
>>> + goto err;
>>> + }
>>> +
>>> + insn++;
>>> + if (new_prog) {
>>> + memcpy(new_insn, tmp_insns,
>>> + sizeof(*insn) * (insn - tmp_insns));
>>> + }
>>> + new_insn += insn - tmp_insns;
>>> + }
>>> +
>>> + if (!new_prog) {
>>> + /* only calculating new length */
>>> + *p_new_len = new_insn - new_prog;
>>> + return 0;
>>> + }
>>> +
>>> + pass++;
>>> + if (new_len != new_insn - new_prog) {
>>> + new_len = new_insn - new_prog;
>>> + if (pass > 2)
>>> + goto err;
>>> + goto do_pass;
>>> + }
>>> + kfree(addrs);
>>> + if (*p_new_len != new_len)
>>> + /* inconsistent new program length */
>>> + pr_err("bpf_convert() usage error\n");
>>> + return 0;
>>> +err:
>>> + kfree(addrs);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
>>> +{
>>
>>
>> Similarly, sk_run_filter_ext()?
>
> ok.
>
>>> + u64 stack[64];
>>> + u64 regs[16];
>>> + void *ptr;
>>> + u64 tmp;
>>> + int off;
>>> +
>>> +#ifdef __x86_64
>>> +#define LOAD_IMM /**/
>>> +#define K insn->imm
>>> +#else
>>> +#define LOAD_IMM (K = insn->imm)
>>> + s32 K = insn->imm;
>>> +#endif
>>> +
>>> +#ifdef DEBUG
>>> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
>>> +#else
>>> +#define DEBUG_INSN
>>> +#endif
>>
>>
>> This DEBUG hunk could then be removed when we use a stap script
>> instead, for example.
>
> ok
>
>>> +#define A regs[insn->a_reg]
>>> +#define X regs[insn->x_reg]
>>> +
>>> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
>>> +#define L(A, B, C) L_##A##B##C:
>>
>>
>> Could we get rid of these two defines? I know you're defining
>
> I think it's actually more readable this way and lesser chance of typos.
> DL - define label
> L - label
> will try to remove 2nd marco to see how it looks...
>
>> and using that as labels, but it's not obvious at first what
>> 'jt' does. Maybe 'jt_labels' ?
>
> ok will rename.
>
>>> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>
>>
>> Not a big fan of macros, but here it seems fine though.
>>
>>
>>> +/* some compilers may need help:
>>> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code];
>>> })
>>> + */
>>> +
>>> + static const void *jt[256] = {
>>> + [0 ... 255] = &&L_default,
>>
>>
>> Nit: please just one define per line below:
>
> ok.
>
>>> + DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
>>> + DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
>>> + DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
>>> + DL(BPF_ALU, BPF_OR, BPF_X) DL(BPF_ALU, BPF_OR, BPF_K)
>>> + DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
>>> + DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
>>> + DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
>>> + DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
>>> + DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
>>> + DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
>>> + DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
>>> + DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_OR, BPF_X) DL(BPF_ALU64, BPF_OR, BPF_K)
>>> + DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> + DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> + DL(BPF_ALU, BPF_NEG, 0)
>>> + DL(BPF_JMP, BPF_CALL, 0)
>>> + DL(BPF_JMP, BPF_JA, 0)
>>> + DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
>>> + DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
>>> + DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
>>> + DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
>>> + DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
>>> + DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
>>> + DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
>>> + DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
>>> + DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
>>> + DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
>>> + DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
>>> + DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
>>> + DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
>>> + DL(BPF_STX, BPF_XADD, BPF_W)
>>> +#ifdef CONFIG_64BIT
>>> + DL(BPF_STX, BPF_XADD, BPF_DW)
>>> +#endif
>>> + DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
>>> + DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
>>> + DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
>>> + DL(BPF_RET, BPF_K, 0)
>>> + };
>>> +
>>> + regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
>>> + regs[1/* BPF R1 */] = (u64)(ulong)ctx;
>>> +
>>> + DEBUG_INSN;
>>> + /* execute 1st insn */
>>> +select_insn:
>>> + goto *jt[insn->code];
>>> +
>>> + /* ALU */
>>> +#define ALU(OPCODE, OP) \
>>> + L_BPF_ALU64##OPCODE##BPF_X: \
>>> + A = A OP X; \
>>> + CONT; \
>>> + L_BPF_ALU##OPCODE##BPF_X: \
>>> + A = (u32)A OP (u32)X; \
>>> + CONT; \
>>> + L_BPF_ALU64##OPCODE##BPF_K: \
>>> + A = A OP K; \
>>> + CONT; \
>>> + L_BPF_ALU##OPCODE##BPF_K: \
>>> + A = (u32)A OP (u32)K; \
>>> + CONT;
>>> +
>>> + ALU(BPF_ADD, +)
>>> + ALU(BPF_SUB, -)
>>> + ALU(BPF_AND, &)
>>> + ALU(BPF_OR, |)
>>> + ALU(BPF_LSH, <<)
>>> + ALU(BPF_RSH, >>)
>>> + ALU(BPF_XOR, ^)
>>> + ALU(BPF_MUL, *)
>>> +
>>> + L(BPF_ALU, BPF_NEG, 0)
>>> + A = (u32)-A;
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOV, BPF_X)
>>> + A = (u32)X;
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOV, BPF_K)
>>> + A = (u32)K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOV, BPF_X)
>>> + A = X;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOV, BPF_K)
>>> + A = K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_ARSH, BPF_X)
>>> + (*(s64 *) &A) >>= X;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_ARSH, BPF_K)
>>> + (*(s64 *) &A) >>= K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOD, BPF_X)
>>> + tmp = A;
>>> + if (X)
>>> + A = do_div(tmp, X);
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOD, BPF_X)
>>> + tmp = (u32)A;
>>> + if (X)
>>> + A = do_div(tmp, (u32)X);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOD, BPF_K)
>>> + tmp = A;
>>> + if (K)
>>> + A = do_div(tmp, K);
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOD, BPF_K)
>>> + tmp = (u32)A;
>>> + if (K)
>>> + A = do_div(tmp, (u32)K);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_DIV, BPF_X)
>>> + if (X)
>>> + do_div(A, X);
>>> + CONT;
>>> + L(BPF_ALU, BPF_DIV, BPF_X)
>>> + tmp = (u32)A;
>>> + if (X)
>>> + do_div(tmp, (u32)X);
>>> + A = (u32)tmp;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_DIV, BPF_K)
>>> + if (K)
>>> + do_div(A, K);
>>> + CONT;
>>> + L(BPF_ALU, BPF_DIV, BPF_K)
>>> + tmp = (u32)A;
>>> + if (K)
>>> + do_div(tmp, (u32)K);
>>> + A = (u32)tmp;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> + A = swab32(A);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> + A = swab64(A);
>>> + CONT;
>>> +
>>> + /* CALL */
>>> + L(BPF_JMP, BPF_CALL, 0)
>>> + /* TODO execute_func(K, regs); */
>>> + CONT;
>>
>>
>> Would the filter checker for now just return -EINVAL when this is used?
>
> sure.
> I was planning to address that in a week or so, but ok.
> Will remove 'call' insn in the first patch and will re-add in a clean
> way in a next diff.
>
>>> + /* JMP */
>>> + L(BPF_JMP, BPF_JA, 0)
>>> + insn += insn->off;
>>> + CONT;
>>> + L(BPF_JMP, BPF_JEQ, BPF_X)
>>> + if (A == X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JEQ, BPF_K)
>>> + if (A == K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JNE, BPF_X)
>>> + if (A != X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JNE, BPF_K)
>>> + if (A != K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGT, BPF_X)
>>> + if (A > X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGT, BPF_K)
>>> + if (A > K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGE, BPF_X)
>>> + if (A >= X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGE, BPF_K)
>>> + if (A >= K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGT, BPF_X)
>>> + if (((s64)A) > ((s64)X)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGT, BPF_K)
>>> + if (((s64)A) > ((s64)K)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGE, BPF_X)
>>> + if (((s64)A) >= ((s64)X)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGE, BPF_K)
>>> + if (((s64)A) >= ((s64)K)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSET, BPF_X)
>>> + if (A & X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSET, BPF_K)
>>> + if (A & (u32)K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>
>>
>> Okay, so right now we only support forward jumps, right? And these
>> are checked by the old checker code I'd assume?
>
> yes. just like old interpreter new interpreter doesn't care. It just jumps.
> and you're correct. old checker code will allow only forward jumps.
> That restriction is preserved by sk_convert_filter() as well.
>
>>> + /* STX and ST and LDX*/
>>> +#define LDST(SIZEOP, SIZE) \
>>> + L_BPF_STXBPF_MEM##SIZEOP: \
>>> + *(SIZE *)(ulong)(A + insn->off) = X; \
>>> + CONT; \
>>> + L_BPF_STBPF_MEM##SIZEOP: \
>>> + *(SIZE *)(ulong)(A + insn->off) = K; \
>>> + CONT; \
>>> + L_BPF_LDXBPF_MEM##SIZEOP: \
>>> + A = *(SIZE *)(ulong)(X + insn->off); \
>>> + CONT;
>>> +
>>> + LDST(BPF_B, u8)
>>> + LDST(BPF_H, u16)
>>> + LDST(BPF_W, u32)
>>> + LDST(BPF_DW, u64)
>>> +
>>> + /* STX XADD */
>>> + L(BPF_STX, BPF_XADD, BPF_W)
>>> + atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
>>> + CONT;
>>> +#ifdef CONFIG_64BIT
>>> + L(BPF_STX, BPF_XADD, BPF_DW)
>>> + atomic64_add((u64)X, (atomic64_t *)(ulong)(A +
>>> insn->off));
>>> + CONT;
>>> +#endif
>>> +
>>> + /* LD from SKB + K */
>>
>>
>> Nit: indent one tab too much (here and elsewhere)
>
> ahh. ok.
>
>>> + L(BPF_LD, BPF_ABS, BPF_W)
>>> + off = K;
>>> +load_word:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = get_unaligned_be32(ptr);
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + L(BPF_LD, BPF_ABS, BPF_H)
>>> + off = K;
>>> +load_half:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = get_unaligned_be16(ptr);
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + L(BPF_LD, BPF_ABS, BPF_B)
>>> + off = K;
>>> +load_byte:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = *(u8 *)ptr;
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + /* LD from SKB + X + K */
>>
>>
>> Nit: ditto
>
> ok
>
>>> + L(BPF_LD, BPF_IND, BPF_W)
>>> + off = K + X;
>>> + goto load_word;
>>> +
>>> + L(BPF_LD, BPF_IND, BPF_H)
>>> + off = K + X;
>>> + goto load_half;
>>> +
>>> + L(BPF_LD, BPF_IND, BPF_B)
>>> + off = K + X;
>>> + goto load_byte;
>>> +
>>> + /* RET */
>>> + L(BPF_RET, BPF_K, 0)
>>> + return regs[0/* R0 */];
>>> +
>>> + L_default:
>>> + /* bpf_check() or bpf_convert() will guarantee that
>>> + * we never reach here
>>> + */
>>> + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
>>> + return 0;
>>> +#undef DL
>>> +#undef L
>>> +#undef CONT
>>> +#undef A
>>> +#undef X
>>> +#undef K
>>> +#undef LOAD_IMM
>>> +#undef DEBUG_INSN
>>> +#undef ALU
>>> +#undef LDST
>>> +}
>>> +EXPORT_SYMBOL(bpf_run);
>>> +
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index ad30d626a5bd..575bf5fd4335 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>>> {
>>> struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>>
>>> + if ((void *)fp->bpf_func == (void *)bpf_run)
>>> + /* arch specific jit_free are expecting this value */
>>> + fp->bpf_func = sk_run_filter;
>>> +
>>> bpf_jit_free(fp);
>>> }
>>> EXPORT_SYMBOL(sk_filter_release_rcu);
>>> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>>> return 0;
>>> }
>>>
>>> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog
>>> *fprog,
>>> + struct sock *sk)
>>> +{
>>
>>
>> sk_prepare_filter_ext()?
>
> ok.
>
>>> + unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>>> + struct sock_filter *old_prog;
>>> + unsigned int sk_fsize;
>>> + struct sk_filter *fp;
>>> + int new_len;
>>> + int err;
>>> +
>>> + /* store old program into buffer, since chk_filter will remap
>>> opcodes */
>>> + old_prog = kmalloc(fsize, GFP_KERNEL);
>>> + if (!old_prog)
>>> + return -ENOMEM;
>>> +
>>> + if (sk) {
>>> + if (copy_from_user(old_prog, fprog->filter, fsize)) {
>>> + err = -EFAULT;
>>> + goto free_prog;
>>> + }
>>> + } else {
>>> + memcpy(old_prog, fprog->filter, fsize);
>>> + }
>>> +
>>> + /* calculate bpf64 program length */
>>> + err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
>>> + if (err)
>>> + goto free_prog;
>>> +
>>> + sk_fsize = sk_filter_size(new_len);
>>> + /* allocate sk_filter to store bpf64 program */
>>> + if (sk)
>>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> + else
>>> + fp = kmalloc(sk_fsize, GFP_KERNEL);
>>> + if (!fp) {
>>> + err = -ENOMEM;
>>> + goto free_prog;
>>> + }
>>> +
>>> + /* remap old insns into bpf64 */
>>> + err = bpf_convert(old_prog, fprog->len,
>>> + (struct bpf_insn *)fp->insns, &new_len);
>>> + if (err)
>>> + /* 2nd bpf_convert() can fail only if it fails
>>> + * to allocate memory, remapping must succeed
>>> + */
>>> + goto free_fp;
>>> +
>>> + /* now chk_filter can overwrite old_prog while checking */
>>> + err = sk_chk_filter(old_prog, fprog->len);
>>> + if (err)
>>> + goto free_fp;
>>> +
>>> + /* discard old prog */
>>> + kfree(old_prog);
>>> +
>>> + atomic_set(&fp->refcnt, 1);
>>> + fp->len = new_len;
>>> +
>>> + /* bpf64 insns must be executed by bpf_run */
>>> + fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
>>> +
>>> + *pfp = fp;
>>> + return 0;
>>> +free_fp:
>>> + if (sk)
>>> + sock_kfree_s(sk, fp, sk_fsize);
>>> + else
>>> + kfree(fp);
>>> +free_prog:
>>> + kfree(old_prog);
>>> + return err;
>>> +}
>>> +
>>> /**
>>> * sk_unattached_filter_create - create an unattached filter
>>> * @fprog: the filter program
>>> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter
>>> **pfp,
>>> if (fprog->filter == NULL)
>>> return -EINVAL;
>>>
>>> + if (bpf64_enable)
>>> + return bpf64_prepare(pfp, fprog, NULL);
>>> +
>>> fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>>> if (!fp)
>>> return -ENOMEM;
>>> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog,
>>> struct sock *sk)
>>> if (fprog->filter == NULL)
>>> return -EINVAL;
>>>
>>> - fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> - if (!fp)
>>> - return -ENOMEM;
>>> - if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> - sock_kfree_s(sk, fp, sk_fsize);
>>> - return -EFAULT;
>>> - }
>>> + if (bpf64_enable) {
>>> + err = bpf64_prepare(&fp, fprog, sk);
>>> + if (err)
>>> + return err;
>>> + } else {
>>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> + if (!fp)
>>> + return -ENOMEM;
>>> + if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> + sock_kfree_s(sk, fp, sk_fsize);
>>> + return -EFAULT;
>>> + }
>>>
>>> - atomic_set(&fp->refcnt, 1);
>>> - fp->len = fprog->len;
>>> + atomic_set(&fp->refcnt, 1);
>>> + fp->len = fprog->len;
>>>
>>> - err = __sk_prepare_filter(fp);
>>> - if (err) {
>>> - sk_filter_uncharge(sk, fp);
>>> - return err;
>>> + err = __sk_prepare_filter(fp);
>>> + if (err) {
>>> + sk_filter_uncharge(sk, fp);
>>> + return err;
>>> + }
>>> }
>>>
>>> old_fp = rcu_dereference_protected(sk->sk_filter,
>>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>>> index cf9cd13509a7..f03acc0e8950 100644
>>> --- a/net/core/sysctl_net_core.c
>>> +++ b/net/core/sysctl_net_core.c
>>> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>>> },
>>> #endif
>>> {
>>> + .procname = "bpf64_enable",
>>> + .data = &bpf64_enable,
>>> + .maxlen = sizeof(int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec
>>> + },
>>> + {
>>> .procname = "netdev_tstamp_prequeue",
>>> .data = &netdev_tstamp_prequeue,
>>> .maxlen = sizeof(int),
>>>
>>
>> Hope some of the comments made sense. ;-)
>
> Yes. Indeed. Thanks a lot for the thorough review!
> Will address things and resend V4.
>
> Alexei
>
>> Thanks,
>>
>> Daniel

2014-03-01 03:22:58

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter

On 02/28/2014 09:53 PM, Alexei Starovoitov wrote:
> On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <[email protected]> wrote:
...
>> Did you also test that seccomp-BPF still works out?
>
> yes. Have a prototype, but it needs a bit more cleanup.

Here's [1] actually some code snippet for user space for prctl(). The
libseccomp [2] actually wraps around that and makes usage easier.

[1] http://outflux.net/teach-seccomp/
[2] http://lwn.net/Articles/491308/

...
>> We should keep naming consistent (so either extended BPF or BPF64),
>> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
>
> agree. we need consistent naming for both (old and new).
> I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
> to see which one looks better.
> I'm kinda leaning to sk_filter64, since it's easier to quickly spot
> the difference
> and more descriptive.

Just saw your second email regarding sk_filter_ext() et al, yep, I agree.

>> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>> comparisons, you'd still need to load to immediate values, right?
>
> there is no insn that use 64-bit immediate, since 64-bit immediates
> are extremely rare. grep x86-64 asm code for movabsq will return very few.
> llvm or gcc can easily construct any constant by combination of mov,
> shifts and ors.
> bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
> 32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
> Jxx is painless.

Hm, fair enough, I was just thinking for comparisons of IPv6 addresses
when we do socket filtering. On the other hand, old and new insns are
both 64 bit wide and can be used though the same api then.

> Notice I added signed comparison, since real life programs cannot do
> without them.
> I also kept the spirit of old bpf having > and >= only. (no < and <=)
> that made llvm/gcc backends a bit harder to do, since no real cpu has
> such limitations.

Hehe, I proposed them once, but for low level BPF it was just easier to
adjust jt/jf offsets differently to achieve the same.

> I'm still contemplating do add < and <= (both signed and unsigned), which is
> interesting trade-off: number of instructions vs complexity of compiler
>
>> After all your changes, we will still have the bpf_jit_enable knob
>> in tact, right?
>
> Yes.
> In this diff the workflow is the following:
>
> old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf64) {
> convert to new
> sk_chk_filter() - check old bpf
> use new interpreter
> } else {
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
> then add new/old inband demux into sk_attach_filter(),
> so that workflow will become:
>
> a filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (new filter prog) {
> sk_chk_filter64() - check new bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> if (bpf64_enable) {
> convert to new
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> sk_chk_filter()
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> }
> eventually bpf64_enable can be made default,
> the last 'else' can be retired and 'bpf64_enable' removed along with
> old interpreter.
>
> bpf_jit_enable knob will stay for foreseeable future.

Okay, cool. I think it seems reasonable to keep this knob around anyway.
E.g. for seccomp some people might argue speed is important, maybe other
more security related distros might not want to rely on JIT and therefore
trade performance.

...
>> Why would that need to be exported as a symbol?
>
> the performance numbers I mentioned are from bpf_bench that is done
> as kernel module, so I used this for debugging from it.
> also to see what execution traces I get while running trinity bpf fuzzer :)
>
>> I would actually like to avoid having this pr_info* inside the kernel.
>> Couldn't this be done e.g. through systemtap script that could e.g. be
>> placed under tools/net/ or inside the documentation file?
>
> like the idea!
> Will drop it from the diff and eventually will move it to tools/net.

Sounds great!

Thanks,

Daniel