Hi All,
v1->v2:
this patch set is almost a full rewrite of the earlier umh modules approach
The v1 of patches and follow up discussion was covered by LWN:
https://lwn.net/Articles/749108/
I believe the v2 addresses all issues brought up by Andy and others.
Mainly there are zero changes to kernel/module.c
Instead of teaching module loading logic to recognize special
umh module, let normal kernel modules execute part of its own
.init.rodata as a new user space process (Andy's idea)
Patch 1 introduces this new helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
Input:
data + len == executable file
Output:
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};
Advantages vs v1:
- the embedded user mode executable is stored as .init.rodata inside
normal kernel module. These pages are freed when .ko finishes loading
- the elf file is copied into tmpfs file. The user mode process is swappable.
- the communication between user mode process and 'parent' kernel module
is done via two unix pipes, hence protocol is not exposed to
user space
- impossible to launch umh on its own (that was the main issue of v1)
and impossible to be man-in-the-middle due to pipes
- bpfilter.ko consists of tiny kernel part that passes the data
between kernel and umh via pipes and much bigger umh part that
doing all the work
- 'lsmod' shows bpfilter.ko as usual.
'rmmod bpfilter' removes kernel module and kills corresponding umh
- signed bpfilter.ko covers the whole image including umh code
Few issues:
- architecturally bpfilter.ko can be builtin, but doesn't work yet.
Still debugging. Kinda cool to have user mode executables
to be part of vmlinux
- the user can still attach to the process and debug it with
'gdb /proc/pid/exe pid', but 'gdb -p pid' doesn't work.
(a bit worse comparing to v1)
- tinyconfig will notice a small increase in .text
+766 | TEXT | 7c8b94806bec umh: introduce fork_usermode_blob() helper
More details in patches 1 and 2 that are ready to land.
Patches 3 and 4 are still rough. They were mainly used for
testing and to demonstrate how bpfilter is building on top.
The patch 4 approach of converting one iptable rule to few bpf
instructions will certainly change in the future, since it doesn't
scale to thousands of rules.
Alexei Starovoitov (2):
umh: introduce fork_usermode_blob() helper
net: add skeleton of bpfilter kernel module
Daniel Borkmann (1):
bpfilter: rough bpfilter codegen example hack
David S. Miller (1):
bpfilter: add iptable get/set parsing
fs/exec.c | 38 ++++-
include/linux/binfmts.h | 1 +
include/linux/bpfilter.h | 15 ++
include/linux/umh.h | 12 ++
include/uapi/linux/bpfilter.h | 200 ++++++++++++++++++++++
kernel/umh.c | 176 +++++++++++++++++++-
net/Kconfig | 2 +
net/Makefile | 1 +
net/bpfilter/Kconfig | 17 ++
net/bpfilter/Makefile | 24 +++
net/bpfilter/bpfilter_kern.c | 93 +++++++++++
net/bpfilter/bpfilter_mod.h | 373 ++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/ctor.c | 91 +++++++++++
net/bpfilter/gen.c | 290 ++++++++++++++++++++++++++++++++
net/bpfilter/init.c | 36 ++++
net/bpfilter/main.c | 117 +++++++++++++
net/bpfilter/msgfmt.h | 17 ++
net/bpfilter/sockopt.c | 236 ++++++++++++++++++++++++++
net/bpfilter/tables.c | 73 +++++++++
net/bpfilter/targets.c | 51 ++++++
net/bpfilter/tgts.c | 26 +++
net/ipv4/Makefile | 2 +
net/ipv4/bpfilter/Makefile | 2 +
net/ipv4/bpfilter/sockopt.c | 42 +++++
net/ipv4/ip_sockglue.c | 17 ++
25 files changed, 1940 insertions(+), 12 deletions(-)
create mode 100644 include/linux/bpfilter.h
create mode 100644 include/uapi/linux/bpfilter.h
create mode 100644 net/bpfilter/Kconfig
create mode 100644 net/bpfilter/Makefile
create mode 100644 net/bpfilter/bpfilter_kern.c
create mode 100644 net/bpfilter/bpfilter_mod.h
create mode 100644 net/bpfilter/ctor.c
create mode 100644 net/bpfilter/gen.c
create mode 100644 net/bpfilter/init.c
create mode 100644 net/bpfilter/main.c
create mode 100644 net/bpfilter/msgfmt.h
create mode 100644 net/bpfilter/sockopt.c
create mode 100644 net/bpfilter/tables.c
create mode 100644 net/bpfilter/targets.c
create mode 100644 net/bpfilter/tgts.c
create mode 100644 net/ipv4/bpfilter/Makefile
create mode 100644 net/ipv4/bpfilter/sockopt.c
--
2.9.5
bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
and user mode helper code that is embedded into bpfilter.ko
The steps to build bpfilter.ko are the following:
- main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
- with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
is converted into bpfilter_umh.o object file
with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
Example:
$ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
- bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
bpfilter_kern.c is a normal kernel module code that calls
the fork_usermode_blob() helper to execute part of its own data
as a user mode process.
Notice that _binary_net_bpfilter_bpfilter_umh_start - end
is placed into .init.rodata section, so it's freed as soon as __init
function of bpfilter.ko is finished.
As part of __init the bpfilter.ko does first request/reply action
via two unix pipe provided by fork_usermode_blob() helper to
make sure that umh is healthy. If not it will kill it via pid.
Later bpfilter_process_sockopt() will be called from bpfilter hooks
in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
kill umh as well.
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/linux/bpfilter.h | 15 +++++++
include/uapi/linux/bpfilter.h | 21 ++++++++++
net/Kconfig | 2 +
net/Makefile | 1 +
net/bpfilter/Kconfig | 17 ++++++++
net/bpfilter/Makefile | 24 +++++++++++
net/bpfilter/bpfilter_kern.c | 93 +++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/main.c | 63 +++++++++++++++++++++++++++++
net/bpfilter/msgfmt.h | 17 ++++++++
net/ipv4/Makefile | 2 +
net/ipv4/bpfilter/Makefile | 2 +
net/ipv4/bpfilter/sockopt.c | 42 +++++++++++++++++++
net/ipv4/ip_sockglue.c | 17 ++++++++
13 files changed, 316 insertions(+)
create mode 100644 include/linux/bpfilter.h
create mode 100644 include/uapi/linux/bpfilter.h
create mode 100644 net/bpfilter/Kconfig
create mode 100644 net/bpfilter/Makefile
create mode 100644 net/bpfilter/bpfilter_kern.c
create mode 100644 net/bpfilter/main.c
create mode 100644 net/bpfilter/msgfmt.h
create mode 100644 net/ipv4/bpfilter/Makefile
create mode 100644 net/ipv4/bpfilter/sockopt.c
diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index 000000000000..687b1760bb9f
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include <uapi/linux/bpfilter.h>
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+ unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+ int *optlen);
+extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set);
+#endif
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index 000000000000..2ec3cc99ea4c
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include <linux/if.h>
+
+enum {
+ BPFILTER_IPT_SO_SET_REPLACE = 64,
+ BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+ BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+ BPFILTER_IPT_SO_GET_INFO = 64,
+ BPFILTER_IPT_SO_GET_ENTRIES = 65,
+ BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+ BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+ BPFILTER_IPT_GET_MAX,
+};
+
+#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/Kconfig b/net/Kconfig
index b62089fb1332..ed6368b306fa 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
endif
+source "net/bpfilter/Kconfig"
+
source "net/dccp/Kconfig"
source "net/sctp/Kconfig"
source "net/rds/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..7f982b7682bd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/
obj-$(CONFIG_XFRM) += xfrm/
obj-$(CONFIG_UNIX) += unix/
obj-$(CONFIG_NET) += ipv6/
+obj-$(CONFIG_BPFILTER) += bpfilter/
obj-$(CONFIG_PACKET) += packet/
obj-$(CONFIG_NET_KEY) += key/
obj-$(CONFIG_BRIDGE) += bridge/
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
new file mode 100644
index 000000000000..782a732b9a5c
--- /dev/null
+++ b/net/bpfilter/Kconfig
@@ -0,0 +1,17 @@
+menuconfig BPFILTER
+ bool "BPF based packet filtering framework (BPFILTER)"
+ default n
+ depends on NET && BPF
+ help
+ This builds experimental bpfilter framework that is aiming to
+ provide netfilter compatible functionality via BPF
+
+if BPFILTER
+config BPFILTER_UMH
+ tristate "bpftiler kernel module with user mode helper"
+ default m
+ depends on m
+ help
+ This builds bpfilter kernel module with embedded user mode helper
+endif
+
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
new file mode 100644
index 000000000000..897eedae523e
--- /dev/null
+++ b/net/bpfilter/Makefile
@@ -0,0 +1,24 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the Linux BPFILTER layer.
+#
+
+hostprogs-y := bpfilter_umh
+bpfilter_umh-objs := main.o
+HOSTCFLAGS += -I. -Itools/include/
+
+# a bit of elf magic to convert bpfilter_umh binary into a binary blob
+# inside bpfilter_umh.o elf file referenced by
+# _binary_net_bpfilter_bpfilter_umh_start symbol
+# which bpfilter_kern.c passes further into umh blob loader at run-time
+quiet_cmd_copy_umh = GEN $@
+ cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
+ $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+ -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+ --rename-section .data=.init.rodata $< $@
+
+$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
+ $(call cmd,copy_umh)
+
+obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
+bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
new file mode 100644
index 000000000000..e0a6fdd5842b
--- /dev/null
+++ b/net/bpfilter/bpfilter_kern.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/umh.h>
+#include <linux/bpfilter.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include "msgfmt.h"
+
+#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
+#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
+
+extern char UMH_start;
+extern char UMH_end;
+
+static struct umh_info info;
+
+static void shutdown_umh(struct umh_info *info)
+{
+ struct task_struct *tsk;
+
+ tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
+ if (tsk)
+ force_sig(SIGKILL, tsk);
+ fput(info->pipe_to_umh);
+ fput(info->pipe_from_umh);
+}
+
+static void stop_umh(void)
+{
+ if (bpfilter_process_sockopt) {
+ bpfilter_process_sockopt = NULL;
+ shutdown_umh(&info);
+ }
+}
+
+static int __bpfilter_process_sockopt(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set)
+{
+ struct mbox_request req;
+ struct mbox_reply reply;
+ loff_t pos;
+ ssize_t n;
+
+ req.is_set = is_set;
+ req.pid = current->pid;
+ req.cmd = optname;
+ req.addr = (long)optval;
+ req.len = optlen;
+ n = __kernel_write(info.pipe_to_umh, &req, sizeof(req), &pos);
+ if (n != sizeof(req)) {
+ pr_err("write fail %zd\n", n);
+ stop_umh();
+ return -EFAULT;
+ }
+ pos = 0;
+ n = kernel_read(info.pipe_from_umh, &reply, sizeof(reply), &pos);
+ if (n != sizeof(reply)) {
+ pr_err("read fail %zd\n", n);
+ stop_umh();
+ return -EFAULT;
+ }
+ return reply.status;
+}
+
+static int __init load_umh(void)
+{
+ int err;
+
+ err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+ if (err)
+ return err;
+ pr_info("Loaded umh pid %d\n", info.pid);
+ bpfilter_process_sockopt = &__bpfilter_process_sockopt;
+
+ if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
+ stop_umh();
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static void __exit fini_umh(void)
+{
+ stop_umh();
+}
+module_init(load_umh);
+module_exit(fini_umh);
+MODULE_LICENSE("GPL");
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
new file mode 100644
index 000000000000..81bbc1684896
--- /dev/null
+++ b/net/bpfilter/main.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <errno.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "include/uapi/linux/bpf.h"
+#include <asm/unistd.h>
+#include "msgfmt.h"
+
+int debug_fd;
+
+static int handle_get_cmd(struct mbox_request *cmd)
+{
+ switch (cmd->cmd) {
+ case 0:
+ return 0;
+ default:
+ break;
+ }
+ return -ENOPROTOOPT;
+}
+
+static int handle_set_cmd(struct mbox_request *cmd)
+{
+ return -ENOPROTOOPT;
+}
+
+static void loop(void)
+{
+ while (1) {
+ struct mbox_request req;
+ struct mbox_reply reply;
+ int n;
+
+ n = read(0, &req, sizeof(req));
+ if (n != sizeof(req)) {
+ dprintf(debug_fd, "invalid request %d\n", n);
+ return;
+ }
+
+ reply.status = req.is_set ?
+ handle_set_cmd(&req) :
+ handle_get_cmd(&req);
+
+ n = write(1, &reply, sizeof(reply));
+ if (n != sizeof(reply)) {
+ dprintf(debug_fd, "reply failed %d\n", n);
+ return;
+ }
+ }
+}
+
+int main(void)
+{
+ debug_fd = open("/dev/console", 00000002 | 00000100);
+ dprintf(debug_fd, "Started bpfilter\n");
+ loop();
+ close(debug_fd);
+ return 0;
+}
diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h
new file mode 100644
index 000000000000..94b9ac9e5114
--- /dev/null
+++ b/net/bpfilter/msgfmt.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_BPFILTER_MSGFMT_H
+#define _NET_BPFTILER_MSGFMT_H
+
+struct mbox_request {
+ __u64 addr;
+ __u32 len;
+ __u32 is_set;
+ __u32 cmd;
+ __u32 pid;
+};
+
+struct mbox_reply {
+ __u32 status;
+};
+
+#endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520f9133..7018f91c5a39 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -16,6 +16,8 @@ obj-y := route.o inetpeer.o protocol.o \
inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
metrics.o
+obj-$(CONFIG_BPFILTER) += bpfilter/
+
obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
obj-$(CONFIG_PROC_FS) += proc.o
diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
new file mode 100644
index 000000000000..ce262d76cc48
--- /dev/null
+++ b/net/ipv4/bpfilter/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_BPFILTER) += sockopt.o
+
diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
new file mode 100644
index 000000000000..42a96d2d8d05
--- /dev/null
+++ b/net/ipv4/bpfilter/sockopt.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+#include <uapi/linux/bpf.h>
+#include <linux/wait.h>
+#include <linux/kmod.h>
+
+int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set);
+EXPORT_SYMBOL_GPL(bpfilter_process_sockopt);
+
+int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
+ unsigned int optlen, bool is_set)
+{
+ if (!bpfilter_process_sockopt) {
+ int err = request_module("bpfilter");
+
+ if (err)
+ return err;
+ if (!bpfilter_process_sockopt)
+ return -ECHILD;
+ }
+ return bpfilter_process_sockopt(sk, optname, optval, optlen, is_set);
+}
+
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
+ unsigned int optlen)
+{
+ return bpfilter_mbox_request(sk, optname, optval, optlen, true);
+}
+
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
+ int __user *optlen)
+{
+ int len;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ return bpfilter_mbox_request(sk, optname, optval, len, false);
+}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 5ad2d8ed3a3f..e0791faacb24 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -47,6 +47,8 @@
#include <linux/errqueue.h>
#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+
/*
* SOL_IP control messages.
*/
@@ -1244,6 +1246,11 @@ int ip_setsockopt(struct sock *sk, int level,
return -ENOPROTOOPT;
err = do_ip_setsockopt(sk, level, optname, optval, optlen);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
+ optname < BPFILTER_IPT_SET_MAX)
+ err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
@@ -1552,6 +1559,11 @@ int ip_getsockopt(struct sock *sk, int level,
int err;
err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+ optname < BPFILTER_IPT_GET_MAX)
+ err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
@@ -1584,6 +1596,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
err = do_ip_getsockopt(sk, level, optname, optval, optlen,
MSG_CMSG_COMPAT);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+ optname < BPFILTER_IPT_GET_MAX)
+ err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
--
2.9.5
From: "David S. Miller" <[email protected]>
parse iptable binary blobs into bpfilter internal data structures
bpfilter.ko only passing the [gs]etsockopt commands from kernel to umh
All parsing is done inside umh
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
---
include/uapi/linux/bpfilter.h | 179 ++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/Makefile | 2 +-
net/bpfilter/bpfilter_mod.h | 96 ++++++++++++++++++++++
net/bpfilter/ctor.c | 80 +++++++++++++++++++
net/bpfilter/init.c | 33 ++++++++
net/bpfilter/main.c | 51 ++++++++++++
net/bpfilter/sockopt.c | 153 ++++++++++++++++++++++++++++++++++++
net/bpfilter/tables.c | 70 +++++++++++++++++
net/bpfilter/targets.c | 51 ++++++++++++
net/bpfilter/tgts.c | 25 ++++++
10 files changed, 739 insertions(+), 1 deletion(-)
create mode 100644 net/bpfilter/bpfilter_mod.h
create mode 100644 net/bpfilter/ctor.c
create mode 100644 net/bpfilter/init.c
create mode 100644 net/bpfilter/sockopt.c
create mode 100644 net/bpfilter/tables.c
create mode 100644 net/bpfilter/targets.c
create mode 100644 net/bpfilter/tgts.c
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
index 2ec3cc99ea4c..38d54e9947a1 100644
--- a/include/uapi/linux/bpfilter.h
+++ b/include/uapi/linux/bpfilter.h
@@ -18,4 +18,183 @@ enum {
BPFILTER_IPT_GET_MAX,
};
+enum {
+ BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+ BPFILTER_NF_DROP = 0,
+ BPFILTER_NF_ACCEPT = 1,
+ BPFILTER_NF_STOLEN = 2,
+ BPFILTER_NF_QUEUE = 3,
+ BPFILTER_NF_REPEAT = 4,
+ BPFILTER_NF_STOP = 5,
+ BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+ BPFILTER_INET_HOOK_PRE_ROUTING = 0,
+ BPFILTER_INET_HOOK_LOCAL_IN = 1,
+ BPFILTER_INET_HOOK_FORWARD = 2,
+ BPFILTER_INET_HOOK_LOCAL_OUT = 3,
+ BPFILTER_INET_HOOK_POST_ROUTING = 4,
+ BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+ BPFILTER_PROTO_UNSPEC = 0,
+ BPFILTER_PROTO_INET = 1,
+ BPFILTER_PROTO_IPV4 = 2,
+ BPFILTER_PROTO_ARP = 3,
+ BPFILTER_PROTO_NETDEV = 5,
+ BPFILTER_PROTO_BRIDGE = 7,
+ BPFILTER_PROTO_IPV6 = 10,
+ BPFILTER_PROTO_DECNET = 12,
+ BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX ((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN (-INT_MAX - 1)
+#endif
+
+enum {
+ BPFILTER_IP_PRI_FIRST = INT_MIN,
+ BPFILTER_IP_PRI_CONNTRACK_DEFRAG = -400,
+ BPFILTER_IP_PRI_RAW = -300,
+ BPFILTER_IP_PRI_SELINUX_FIRST = -225,
+ BPFILTER_IP_PRI_CONNTRACK = -200,
+ BPFILTER_IP_PRI_MANGLE = -150,
+ BPFILTER_IP_PRI_NAT_DST = -100,
+ BPFILTER_IP_PRI_FILTER = 0,
+ BPFILTER_IP_PRI_SECURITY = 50,
+ BPFILTER_IP_PRI_NAT_SRC = 100,
+ BPFILTER_IP_PRI_SELINUX_LAST = 225,
+ BPFILTER_IP_PRI_CONNTRACK_HELPER = 300,
+ BPFILTER_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,
+ BPFILTER_IP_PRI_LAST = INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN 30
+#define BPFILTER_EXTENSION_MAXNAMELEN 29
+#define BPFILTER_TABLE_MAXNAMELEN 32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+ union {
+ struct {
+ __u16 match_size;
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ __u8 revision;
+ } user;
+ struct {
+ __u16 match_size;
+ struct bpfilter_match *match;
+ } kernel;
+ __u16 match_size;
+ } u;
+ unsigned char data[0];
+};
+
+struct bpfilter_target;
+struct bpfilter_entry_target {
+ union {
+ struct {
+ __u16 target_size;
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ __u8 revision;
+ } user;
+ struct {
+ __u16 target_size;
+ struct bpfilter_target *target;
+ } kernel;
+ __u16 target_size;
+ } u;
+ unsigned char data[0];
+};
+
+struct bpfilter_standard_target {
+ struct bpfilter_entry_target target;
+ int verdict;
+};
+
+struct bpfilter_error_target {
+ struct bpfilter_entry_target target;
+ char error_name[BPFILTER_FUNCTION_MAXNAMELEN];
+};
+
+#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
+#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
+
+#define BPFILTER_ALIGN(__X) \
+ __ALIGN_KERNEL(__X, __alignof__(__u64))
+
+#define BPFILTER_TARGET_INIT(__name, __size) \
+{ \
+ .target.u.user = { \
+ .target_size = BPFILTER_ALIGN(__size), \
+ .name = (__name), \
+ }, \
+}
+#define BPFILTER_STANDARD_TARGET ""
+#define BPFILTER_ERROR_TARGET "ERROR"
+
+struct bpfilter_xt_counters {
+ __u64 packet_cnt;
+ __u64 byte_cnt;
+};
+
+struct bpfilter_ipt_ip {
+ __u32 src;
+ __u32 dst;
+ __u32 src_mask;
+ __u32 dst_mask;
+ char in_iface[IFNAMSIZ];
+ char out_iface[IFNAMSIZ];
+ __u8 in_iface_mask[IFNAMSIZ];
+ __u8 out_iface_mask[IFNAMSIZ];
+ __u16 protocol;
+ __u8 flags;
+ __u8 inv_flags;
+};
+
+struct bpfilter_ipt_entry {
+ struct bpfilter_ipt_ip ip;
+ __u32 bfcache;
+ __u16 target_offset;
+ __u16 next_offset;
+ __u32 camefrom;
+ struct bpfilter_xt_counters cntrs;
+ __u8 elems[0];
+};
+
+struct bpfilter_ipt_get_info {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 valid_hooks;
+ __u32 hook_entry[BPFILTER_INET_HOOK_MAX];
+ __u32 underflow[BPFILTER_INET_HOOK_MAX];
+ __u32 num_entries;
+ __u32 size;
+};
+
+struct bpfilter_ipt_get_entries {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 size;
+ struct bpfilter_ipt_entry entries[0];
+};
+
+struct bpfilter_ipt_replace {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 valid_hooks;
+ __u32 num_entries;
+ __u32 size;
+ __u32 hook_entry[BPFILTER_INET_HOOK_MAX];
+ __u32 underflow[BPFILTER_INET_HOOK_MAX];
+ __u32 num_counters;
+ struct bpfilter_xt_counters *cntrs;
+ struct bpfilter_ipt_entry entries[0];
+};
+
#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 897eedae523e..bec6181de995 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
#
hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
HOSTCFLAGS += -I. -Itools/include/
# a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
new file mode 100644
index 000000000000..f0de41b20793
--- /dev/null
+++ b/net/bpfilter/bpfilter_mod.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_INTERNAL_H
+#define _LINUX_BPFILTER_INTERNAL_H
+
+#include "include/uapi/linux/bpfilter.h"
+#include <linux/list.h>
+
+struct bpfilter_table {
+ struct hlist_node hash;
+ u32 valid_hooks;
+ struct bpfilter_table_info *info;
+ int hold;
+ u8 family;
+ int priority;
+ const char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+};
+
+struct bpfilter_table_info {
+ unsigned int size;
+ u32 num_entries;
+ unsigned int initial_entries;
+ unsigned int hook_entry[BPFILTER_INET_HOOK_MAX];
+ unsigned int underflow[BPFILTER_INET_HOOK_MAX];
+ unsigned int stacksize;
+ void ***jumpstack;
+ unsigned char entries[0] __aligned(8);
+};
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len);
+void bpfilter_table_put(struct bpfilter_table *tbl);
+int bpfilter_table_add(struct bpfilter_table *tbl);
+
+struct bpfilter_ipt_standard {
+ struct bpfilter_ipt_entry entry;
+ struct bpfilter_standard_target target;
+};
+
+struct bpfilter_ipt_error {
+ struct bpfilter_ipt_entry entry;
+ struct bpfilter_error_target target;
+};
+
+#define BPFILTER_IPT_ENTRY_INIT(__sz) \
+{ \
+ .target_offset = sizeof(struct bpfilter_ipt_entry), \
+ .next_offset = (__sz), \
+}
+
+#define BPFILTER_IPT_STANDARD_INIT(__verdict) \
+{ \
+ .entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_standard)), \
+ .target = BPFILTER_TARGET_INIT(BPFILTER_STANDARD_TARGET, \
+ sizeof(struct bpfilter_standard_target)),\
+ .target.verdict = -(__verdict) - 1, \
+}
+
+#define BPFILTER_IPT_ERROR_INIT \
+{ \
+ .entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_error)), \
+ .target = BPFILTER_TARGET_INIT(BPFILTER_ERROR_TARGET, \
+ sizeof(struct bpfilter_error_target)), \
+ .target.error_name = "ERROR", \
+}
+
+struct bpfilter_target {
+ struct list_head all_target_list;
+ const char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ unsigned int size;
+ int hold;
+ u16 family;
+ u8 rev;
+};
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
+void bpfilter_target_put(struct bpfilter_target *tgt);
+int bpfilter_target_add(struct bpfilter_target *tgt);
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+int bpfilter_ipv4_register_targets(void);
+void bpfilter_tables_init(void);
+int bpfilter_get_info(void *addr, int len);
+int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_ipv4_init(void);
+
+int copy_from_user(void *dst, void *addr, int len);
+int copy_to_user(void *addr, const void *src, int len);
+#define put_user(x, ptr) \
+({ \
+ __typeof__(*(ptr)) __x = (x); \
+ copy_to_user(ptr, &__x, sizeof(*(ptr))); \
+})
+extern int pid;
+extern int debug_fd;
+#define ENOTSUPP 524
+
+#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
new file mode 100644
index 000000000000..efb7feef3c42
--- /dev/null
+++ b/net/bpfilter/ctor.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <linux/bitops.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+unsigned int __sw_hweight32(unsigned int w)
+{
+ w -= (w >> 1) & 0x55555555;
+ w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
+ w = (w + (w >> 4)) & 0x0f0f0f0f;
+ return (w * 0x01010101) >> 24;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+{
+ unsigned int num_hooks = hweight32(tbl->valid_hooks);
+ struct bpfilter_ipt_standard *tgts;
+ struct bpfilter_table_info *info;
+ struct bpfilter_ipt_error *term;
+ unsigned int mask, offset, h, i;
+ unsigned int size, alloc_size;
+
+ size = sizeof(struct bpfilter_ipt_standard) * num_hooks;
+ size += sizeof(struct bpfilter_ipt_error);
+
+ alloc_size = size + sizeof(struct bpfilter_table_info);
+
+ info = malloc(alloc_size);
+ if (!info)
+ return NULL;
+
+ info->num_entries = num_hooks + 1;
+ info->size = size;
+
+ tgts = (struct bpfilter_ipt_standard *) (info + 1);
+ term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+
+ mask = tbl->valid_hooks;
+ offset = 0;
+ h = 0;
+ i = 0;
+ dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
+ while (mask) {
+ struct bpfilter_ipt_standard *t;
+
+ if (!(mask & 1))
+ goto next;
+
+ info->hook_entry[h] = offset;
+ info->underflow[h] = offset;
+ t = &tgts[i++];
+ *t = (struct bpfilter_ipt_standard)
+ BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
+ t->target.target.u.kernel.target =
+ bpfilter_target_get_by_name(t->target.target.u.user.name);
+ dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
+ if (!t->target.target.u.kernel.target)
+ goto out_fail;
+
+ offset += sizeof(struct bpfilter_ipt_standard);
+ next:
+ mask >>= 1;
+ h++;
+ }
+ *term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
+ term->target.target.u.kernel.target =
+ bpfilter_target_get_by_name(term->target.target.u.user.name);
+ dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
+ if (!term->target.target.u.kernel.target)
+ goto out_fail;
+
+ dprintf(debug_fd, "info %p\n", info);
+ return info;
+
+out_fail:
+ free(info);
+ return NULL;
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
new file mode 100644
index 000000000000..699f3f623189
--- /dev/null
+++ b/net/bpfilter/init.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include "bpfilter_mod.h"
+
+static struct bpfilter_table filter_table_ipv4 = {
+ .name = "filter",
+ .valid_hooks = ((1<<BPFILTER_INET_HOOK_LOCAL_IN) |
+ (1<<BPFILTER_INET_HOOK_FORWARD) |
+ (1<<BPFILTER_INET_HOOK_LOCAL_OUT)),
+ .family = BPFILTER_PROTO_IPV4,
+ .priority = BPFILTER_IP_PRI_FILTER,
+};
+
+int bpfilter_ipv4_init(void)
+{
+ struct bpfilter_table *t = &filter_table_ipv4;
+ struct bpfilter_table_info *info;
+ int err;
+
+ err = bpfilter_ipv4_register_targets();
+ if (err)
+ return err;
+
+ info = bpfilter_ipv4_table_ctor(t);
+ if (!info)
+ return -ENOMEM;
+
+ t->info = info;
+
+ return bpfilter_table_add(&filter_table_ipv4);
+}
+
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index 81bbc1684896..e0273ca201ad 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -8,13 +8,52 @@
#include <unistd.h>
#include "include/uapi/linux/bpf.h"
#include <asm/unistd.h>
+#include "bpfilter_mod.h"
#include "msgfmt.h"
+extern long int syscall (long int __sysno, ...);
+
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+ unsigned int size)
+{
+ return syscall(321, cmd, attr, size);
+}
+
+int pid;
int debug_fd;
+int copy_from_user(void *dst, void *addr, int len)
+{
+ struct iovec local;
+ struct iovec remote;
+
+ local.iov_base = dst;
+ local.iov_len = len;
+ remote.iov_base = addr;
+ remote.iov_len = len;
+ return process_vm_readv(pid, &local, 1, &remote, 1, 0) != len;
+}
+
+int copy_to_user(void *addr, const void *src, int len)
+{
+ struct iovec local;
+ struct iovec remote;
+
+ local.iov_base = (void *) src;
+ local.iov_len = len;
+ remote.iov_base = addr;
+ remote.iov_len = len;
+ return process_vm_writev(pid, &local, 1, &remote, 1, 0) != len;
+}
+
static int handle_get_cmd(struct mbox_request *cmd)
{
+ pid = cmd->pid;
switch (cmd->cmd) {
+ case BPFILTER_IPT_SO_GET_INFO:
+ return bpfilter_get_info((void *)(long)cmd->addr, cmd->len);
+ case BPFILTER_IPT_SO_GET_ENTRIES:
+ return bpfilter_get_entries((void *)(long)cmd->addr, cmd->len);
case 0:
return 0;
default:
@@ -25,11 +64,23 @@ static int handle_get_cmd(struct mbox_request *cmd)
static int handle_set_cmd(struct mbox_request *cmd)
{
+ pid = cmd->pid;
+ switch (cmd->cmd) {
+ case BPFILTER_IPT_SO_SET_REPLACE:
+ return bpfilter_set_replace((void *)(long)cmd->addr, cmd->len);
+ case BPFILTER_IPT_SO_SET_ADD_COUNTERS:
+ return bpfilter_set_add_counters((void *)(long)cmd->addr, cmd->len);
+ default:
+ break;
+ }
return -ENOPROTOOPT;
}
static void loop(void)
{
+ bpfilter_tables_init();
+ bpfilter_ipv4_init();
+
while (1) {
struct mbox_request req;
struct mbox_reply reply;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
new file mode 100644
index 000000000000..43687daf51a3
--- /dev/null
+++ b/net/bpfilter/sockopt.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+static int fetch_name(void *addr, int len, char *name, int name_len)
+{
+ if (copy_from_user(name, addr, name_len))
+ return -EFAULT;
+
+ name[BPFILTER_XT_TABLE_MAXNAMELEN-1] = '\0';
+ return 0;
+}
+
+int bpfilter_get_info(void *addr, int len)
+{
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ struct bpfilter_ipt_get_info resp;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len != sizeof(struct bpfilter_ipt_get_info))
+ return -EINVAL;
+
+ err = fetch_name(addr, len, name, sizeof(name));
+ if (err)
+ return err;
+
+ tbl = bpfilter_table_get_by_name(name, strlen(name));
+ if (!tbl)
+ return -ENOENT;
+
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+
+ memset(&resp, 0, sizeof(resp));
+ memcpy(resp.name, name, sizeof(resp.name));
+ resp.valid_hooks = tbl->valid_hooks;
+ memcpy(&resp.hook_entry, info->hook_entry, sizeof(resp.hook_entry));
+ memcpy(&resp.underflow, info->underflow, sizeof(resp.underflow));
+ resp.num_entries = info->num_entries;
+ resp.size = info->size;
+
+ err = 0;
+ if (copy_to_user(addr, &resp, len))
+ err = -EFAULT;
+out_put:
+ bpfilter_table_put(tbl);
+ return err;
+}
+
+static int copy_target(struct bpfilter_standard_target *ut,
+ struct bpfilter_standard_target *kt)
+{
+ struct bpfilter_target *tgt;
+ int sz;
+
+
+ if (put_user(kt->target.u.target_size,
+ &ut->target.u.target_size))
+ return -EFAULT;
+
+ tgt = kt->target.u.kernel.target;
+ if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
+ return -EFAULT;
+
+ if (put_user(tgt->rev, &ut->target.u.user.revision))
+ return -EFAULT;
+
+ sz = tgt->size;
+ if (copy_to_user(ut->target.data, kt->target.data, sz))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int do_get_entries(void *up,
+ struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info)
+{
+ unsigned int total_size = info->size;
+ const struct bpfilter_ipt_entry *ent;
+ unsigned int off;
+ void *base;
+
+ base = info->entries;
+
+ for (off = 0; off < total_size; off += ent->next_offset) {
+ struct bpfilter_xt_counters *cntrs;
+ struct bpfilter_standard_target *tgt;
+
+ ent = base + off;
+ if (copy_to_user(up + off, ent, sizeof(*ent)))
+ return -EFAULT;
+
+ /* XXX Just clear counters for now. XXX */
+ cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
+ if (put_user(0, &cntrs->packet_cnt) ||
+ put_user(0, &cntrs->byte_cnt))
+ return -EINVAL;
+
+ tgt = (void *) ent + ent->target_offset;
+ dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
+ if (copy_target(up + off + ent->target_offset, tgt))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+int bpfilter_get_entries(void *cmd, int len)
+{
+ struct bpfilter_ipt_get_entries *uptr = cmd;
+ struct bpfilter_ipt_get_entries req;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len < sizeof(struct bpfilter_ipt_get_entries))
+ return -EINVAL;
+
+ if (copy_from_user(&req, cmd, sizeof(req)))
+ return -EFAULT;
+
+ tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+ if (!tbl)
+ return -ENOENT;
+
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+
+ if (info->size != req.size) {
+ err = -EINVAL;
+ goto out_put;
+ }
+
+ err = do_get_entries(uptr->entries, tbl, info);
+ dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
+
+out_put:
+ bpfilter_table_put(tbl);
+
+ return err;
+}
+
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
new file mode 100644
index 000000000000..9a96599be634
--- /dev/null
+++ b/net/bpfilter/tables.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/hashtable.h>
+#include "bpfilter_mod.h"
+
+static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
+{
+ unsigned int hash = 0;
+ int i;
+
+ for (i = 0; i < len; i++)
+ hash ^= *(name + i);
+ return hash;
+}
+
+DEFINE_HASHTABLE(bpfilter_tables, 4);
+//DEFINE_MUTEX(bpfilter_table_mutex);
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len)
+{
+ unsigned int hval = full_name_hash(NULL, name, name_len);
+ struct bpfilter_table *tbl;
+
+// mutex_lock(&bpfilter_table_mutex);
+ hash_for_each_possible(bpfilter_tables, tbl, hash, hval) {
+ if (!strcmp(name, tbl->name)) {
+ tbl->hold++;
+ goto out;
+ }
+ }
+ tbl = NULL;
+out:
+// mutex_unlock(&bpfilter_table_mutex);
+ return tbl;
+}
+
+void bpfilter_table_put(struct bpfilter_table *tbl)
+{
+// mutex_lock(&bpfilter_table_mutex);
+ tbl->hold--;
+// mutex_unlock(&bpfilter_table_mutex);
+}
+
+int bpfilter_table_add(struct bpfilter_table *tbl)
+{
+ unsigned int hval = full_name_hash(NULL, tbl->name, strlen(tbl->name));
+ struct bpfilter_table *srch;
+
+// mutex_lock(&bpfilter_table_mutex);
+ hash_for_each_possible(bpfilter_tables, srch, hash, hval) {
+ if (!strcmp(srch->name, tbl->name))
+ goto exists;
+ }
+ hash_add(bpfilter_tables, &tbl->hash, hval);
+// mutex_unlock(&bpfilter_table_mutex);
+
+ return 0;
+
+exists:
+// mutex_unlock(&bpfilter_table_mutex);
+ return -EEXIST;
+}
+
+void bpfilter_tables_init(void)
+{
+ hash_init(bpfilter_tables);
+}
+
diff --git a/net/bpfilter/targets.c b/net/bpfilter/targets.c
new file mode 100644
index 000000000000..4086ac82eaf5
--- /dev/null
+++ b/net/bpfilter/targets.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include "bpfilter_mod.h"
+
+//DEFINE_MUTEX(bpfilter_target_mutex);
+static LIST_HEAD(bpfilter_targets);
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name)
+{
+ struct bpfilter_target *tgt;
+
+// mutex_lock(&bpfilter_target_mutex);
+ list_for_each_entry(tgt, &bpfilter_targets, all_target_list) {
+ if (!strcmp(tgt->name, name)) {
+ tgt->hold++;
+ goto out;
+ }
+ }
+ tgt = NULL;
+out:
+// mutex_unlock(&bpfilter_target_mutex);
+ return tgt;
+}
+
+void bpfilter_target_put(struct bpfilter_target *tgt)
+{
+// mutex_lock(&bpfilter_target_mutex);
+ tgt->hold--;
+// mutex_unlock(&bpfilter_target_mutex);
+}
+
+int bpfilter_target_add(struct bpfilter_target *tgt)
+{
+ struct bpfilter_target *srch;
+
+// mutex_lock(&bpfilter_target_mutex);
+ list_for_each_entry(srch, &bpfilter_targets, all_target_list) {
+ if (!strcmp(srch->name, tgt->name))
+ goto exists;
+ }
+ list_add_tail(&tgt->all_target_list, &bpfilter_targets);
+// mutex_unlock(&bpfilter_target_mutex);
+ return 0;
+
+exists:
+// mutex_unlock(&bpfilter_target_mutex);
+ return -EEXIST;
+}
+
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
new file mode 100644
index 000000000000..eac5e8ac0b4b
--- /dev/null
+++ b/net/bpfilter/tgts.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include "bpfilter_mod.h"
+
+struct bpfilter_target std_tgt = {
+ .name = BPFILTER_STANDARD_TARGET,
+ .family = BPFILTER_PROTO_IPV4,
+ .size = sizeof(int),
+};
+
+struct bpfilter_target err_tgt = {
+ .name = BPFILTER_ERROR_TARGET,
+ .family = BPFILTER_PROTO_IPV4,
+ .size = BPFILTER_FUNCTION_MAXNAMELEN,
+};
+
+int bpfilter_ipv4_register_targets(void)
+{
+ int err = bpfilter_target_add(&std_tgt);
+
+ if (err)
+ return err;
+ return bpfilter_target_add(&err_tgt);
+}
+
--
2.9.5
Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};
that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.
The kernel will do:
- mount "tmpfs"
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
starts, the kernel will create two unix pipes for bidirectional
communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
'struct umh_info' with two unix pipes and the pid of the user process
As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.
Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.
Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.
The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
Signed-off-by: Alexei Starovoitov <[email protected]>
---
fs/exec.c | 38 ++++++++---
include/linux/binfmts.h | 1 +
include/linux/umh.h | 12 ++++
kernel/umh.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 215 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..30a36c2a39bf 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags, struct file *file)
{
char *pathbuf = NULL;
struct linux_binprm *bprm;
- struct file *file;
struct files_struct *displaced;
int retval;
@@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
- file = do_open_execat(fd, filename, flags);
+ if (!file)
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
sched_exec();
bprm->file = file;
- if (fd == AT_FDCWD || filename->name[0] == '/') {
+ if (!filename) {
+ bprm->filename = "none";
+ } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
if (displaced)
reset_files_struct(displaced);
out_ret:
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+ return __do_execve_file(fd, filename, argv, envp, flags, NULL);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+
+ return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
int do_execve(struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 4955e0863b83..c05f24fac4f6 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
const char __user * const __user *,
const char __user * const __user *,
int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);
#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff638220..5c812acbb80a 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,8 +22,10 @@ struct subprocess_info {
const char *path;
char **argv;
char **envp;
+ struct file *file;
int wait;
int retval;
+ pid_t pid;
int (*init)(struct subprocess_info *info, struct cred *new);
void (*cleanup)(struct subprocess_info *info);
void *data;
@@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
int (*init)(struct subprocess_info *info, struct cred *new),
void (*cleanup)(struct subprocess_info *), void *data);
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+ int (*init)(struct subprocess_info *info, struct cred *new),
+ void (*cleanup)(struct subprocess_info *), void *data);
+struct umh_info {
+ struct file *pipe_to_umh;
+ struct file *pipe_from_umh;
+ pid_t pid;
+};
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
+
extern int
call_usermodehelper_exec(struct subprocess_info *info, int wait);
diff --git a/kernel/umh.c b/kernel/umh.c
index f76b3ff876cf..c3f418d7d51a 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -25,6 +25,8 @@
#include <linux/ptrace.h>
#include <linux/async.h>
#include <linux/uaccess.h>
+#include <linux/shmem_fs.h>
+#include <linux/pipe_fs_i.h>
#include <trace/events/module.h>
@@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
commit_creds(new);
- retval = do_execve(getname_kernel(sub_info->path),
- (const char __user *const __user *)sub_info->argv,
- (const char __user *const __user *)sub_info->envp);
+ if (sub_info->file)
+ retval = do_execve_file(sub_info->file,
+ sub_info->argv, sub_info->envp);
+ else
+ retval = do_execve(getname_kernel(sub_info->path),
+ (const char __user *const __user *)sub_info->argv,
+ (const char __user *const __user *)sub_info->envp);
out:
sub_info->retval = retval;
/*
@@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
if (pid < 0) {
sub_info->retval = pid;
umh_complete(sub_info);
+ } else {
+ sub_info->pid = pid;
}
}
}
@@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
}
EXPORT_SYMBOL(call_usermodehelper_setup);
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+ int (*init)(struct subprocess_info *info, struct cred *new),
+ void (*cleanup)(struct subprocess_info *info), void *data)
+{
+ struct subprocess_info *sub_info;
+
+ sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
+ if (!sub_info)
+ return NULL;
+
+ INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+ sub_info->path = "none";
+ sub_info->file = file;
+ sub_info->init = init;
+ sub_info->cleanup = cleanup;
+ sub_info->data = data;
+ return sub_info;
+}
+
+static struct vfsmount *umh_fs;
+
+static int init_tmpfs(void)
+{
+ struct file_system_type *type;
+
+ if (umh_fs)
+ return 0;
+ type = get_fs_type("tmpfs");
+ if (!type)
+ return -ENODEV;
+ umh_fs = kern_mount(type);
+ if (IS_ERR(umh_fs)) {
+ int err = PTR_ERR(umh_fs);
+
+ put_filesystem(type);
+ umh_fs = NULL;
+ return err;
+ }
+ return 0;
+}
+
+static int alloc_tmpfs_file(size_t size, struct file **filp)
+{
+ struct file *file;
+ int err;
+
+ err = init_tmpfs();
+ if (err)
+ return err;
+ file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ *filp = file;
+ return 0;
+}
+
+static int populate_file(struct file *file, const void *data, size_t size)
+{
+ size_t offset = 0;
+ int err;
+
+ do {
+ unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
+ struct page *page;
+ void *pgdata, *vaddr;
+
+ err = pagecache_write_begin(file, file->f_mapping, offset, len,
+ 0, &page, &pgdata);
+ if (err < 0)
+ goto fail;
+
+ vaddr = kmap(page);
+ memcpy(vaddr, data, len);
+ kunmap(page);
+
+ err = pagecache_write_end(file, file->f_mapping, offset, len,
+ len, page, pgdata);
+ if (err < 0)
+ goto fail;
+
+ size -= len;
+ data += len;
+ offset += len;
+ } while (size);
+ return 0;
+fail:
+ return err;
+}
+
+static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
+{
+ struct umh_info *umh_info = info->data;
+ struct file *from_umh[2];
+ struct file *to_umh[2];
+ int err;
+
+ /* create pipe to send data to umh */
+ err = create_pipe_files(to_umh, 0);
+ if (err)
+ return err;
+ err = replace_fd(0, to_umh[0], 0);
+ fput(to_umh[0]);
+ if (err < 0) {
+ fput(to_umh[1]);
+ return err;
+ }
+
+ /* create pipe to receive data from umh */
+ err = create_pipe_files(from_umh, 0);
+ if (err) {
+ fput(to_umh[1]);
+ replace_fd(0, NULL, 0);
+ return err;
+ }
+ err = replace_fd(1, from_umh[1], 0);
+ fput(from_umh[1]);
+ if (err < 0) {
+ fput(to_umh[1]);
+ replace_fd(0, NULL, 0);
+ fput(from_umh[0]);
+ return err;
+ }
+
+ umh_info->pipe_to_umh = to_umh[1];
+ umh_info->pipe_from_umh = from_umh[0];
+ return 0;
+}
+
+static void umh_save_pid(struct subprocess_info *info)
+{
+ struct umh_info *umh_info = info->data;
+
+ umh_info->pid = info->pid;
+}
+
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
+{
+ struct subprocess_info *sub_info;
+ struct file *file = NULL;
+ int err;
+
+ err = alloc_tmpfs_file(len, &file);
+ if (err)
+ return err;
+
+ err = populate_file(file, data, len);
+ if (err)
+ goto out;
+
+ err = -ENOMEM;
+ sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
+ umh_save_pid, info);
+ if (!sub_info)
+ goto out;
+
+ err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+out:
+ fput(file);
+ return err;
+}
+EXPORT_SYMBOL_GPL(fork_usermode_blob);
+
/**
* call_usermodehelper_exec - start a usermode application
* @sub_info: information about the subprocessa
--
2.9.5
From: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
---
net/bpfilter/Makefile | 2 +-
net/bpfilter/bpfilter_mod.h | 285 ++++++++++++++++++++++++++++++++++++++++++-
net/bpfilter/ctor.c | 57 +++++----
net/bpfilter/gen.c | 290 ++++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/init.c | 11 +-
net/bpfilter/main.c | 15 ++-
net/bpfilter/sockopt.c | 137 ++++++++++++++++-----
net/bpfilter/tables.c | 5 +-
net/bpfilter/tgts.c | 1 +
9 files changed, 737 insertions(+), 66 deletions(-)
create mode 100644 net/bpfilter/gen.c
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index bec6181de995..3796651c76cb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
#
hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
HOSTCFLAGS += -I. -Itools/include/
# a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index f0de41b20793..b4209985efff 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -21,8 +21,8 @@ struct bpfilter_table_info {
unsigned int initial_entries;
unsigned int hook_entry[BPFILTER_INET_HOOK_MAX];
unsigned int underflow[BPFILTER_INET_HOOK_MAX];
- unsigned int stacksize;
- void ***jumpstack;
+// unsigned int stacksize;
+// void ***jumpstack;
unsigned char entries[0] __aligned(8);
};
@@ -64,22 +64,55 @@ struct bpfilter_ipt_error {
struct bpfilter_target {
struct list_head all_target_list;
- const char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
unsigned int size;
int hold;
u16 family;
u8 rev;
};
+struct bpfilter_gen_ctx {
+ struct bpf_insn *img;
+ u32 len_cur;
+ u32 len_max;
+ u32 default_verdict;
+ int fd;
+ int ifindex;
+ bool offloaded;
+};
+
+union bpf_attr;
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+ struct bpfilter_ipt_ip *ent, int verdict);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
+
struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
void bpfilter_target_put(struct bpfilter_target *tgt);
int bpfilter_target_add(struct bpfilter_target *tgt);
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl, __u32 size_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize2(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents);
+
int bpfilter_ipv4_register_targets(void);
void bpfilter_tables_init(void);
int bpfilter_get_info(void *addr, int len);
int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_set_replace(void *cmd, int len);
+int bpfilter_set_add_counters(void *cmd, int len);
int bpfilter_ipv4_init(void);
int copy_from_user(void *dst, void *addr, int len);
@@ -93,4 +126,248 @@ extern int pid;
extern int debug_fd;
#define ENOTSUPP 524
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_MOV32_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM) \
+ BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_DW | BPF_IMM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = (__u32) (IMM) }), \
+ ((struct bpf_insn) { \
+ .code = 0, /* zero is reserved opcode */ \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((__u64) (IMM)) >> 32 })
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD) \
+ BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
+ .dst_reg = 0, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Unconditional jumps, goto pc + off16 */
+
+#define BPF_JMP_A(OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_JA, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_CALL, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = CODE, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN() \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_EXIT, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = 0 })
+
#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
index efb7feef3c42..ba44c21cacfa 100644
--- a/net/bpfilter/ctor.c
+++ b/net/bpfilter/ctor.c
@@ -1,8 +1,12 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
-#include <linux/bitops.h>
#include <stdlib.h>
#include <stdio.h>
+#include <string.h>
+
+#include <sys/socket.h>
+
+#include <linux/bitops.h>
+
#include "bpfilter_mod.h"
unsigned int __sw_hweight32(unsigned int w)
@@ -13,35 +17,47 @@ unsigned int __sw_hweight32(unsigned int w)
return (w * 0x01010101) >> 24;
}
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+struct bpfilter_table_info *bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl,
+ __u32 size_ents)
{
unsigned int num_hooks = hweight32(tbl->valid_hooks);
- struct bpfilter_ipt_standard *tgts;
struct bpfilter_table_info *info;
- struct bpfilter_ipt_error *term;
- unsigned int mask, offset, h, i;
unsigned int size, alloc_size;
size = sizeof(struct bpfilter_ipt_standard) * num_hooks;
size += sizeof(struct bpfilter_ipt_error);
+ size += size_ents;
alloc_size = size + sizeof(struct bpfilter_table_info);
info = malloc(alloc_size);
- if (!info)
- return NULL;
+ if (info) {
+ memset(info, 0, alloc_size);
+ info->size = size;
+ }
+ return info;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents)
+{
+ unsigned int num_hooks = hweight32(tbl->valid_hooks);
+ struct bpfilter_ipt_standard *tgts;
+ struct bpfilter_ipt_error *term;
+ struct bpfilter_ipt_entry *ent;
+ unsigned int mask, offset, h, i;
- info->num_entries = num_hooks + 1;
- info->size = size;
+ info->num_entries = num_ents + num_hooks + 1;
- tgts = (struct bpfilter_ipt_standard *) (info + 1);
- term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+ ent = (struct bpfilter_ipt_entry *)(info + 1);
+ tgts = (struct bpfilter_ipt_standard *)((u8 *)ent + size_ents);
+ term = (struct bpfilter_ipt_error *)(tgts + num_hooks);
mask = tbl->valid_hooks;
offset = 0;
h = 0;
i = 0;
- dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
while (mask) {
struct bpfilter_ipt_standard *t;
@@ -55,7 +71,6 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
t->target.target.u.kernel.target =
bpfilter_target_get_by_name(t->target.target.u.user.name);
- dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
if (!t->target.target.u.kernel.target)
goto out_fail;
@@ -67,14 +82,10 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
term->target.target.u.kernel.target =
bpfilter_target_get_by_name(term->target.target.u.user.name);
- dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
- if (!term->target.target.u.kernel.target)
- goto out_fail;
-
- dprintf(debug_fd, "info %p\n", info);
- return info;
-
+ if (!term->target.target.u.kernel.target) {
out_fail:
- free(info);
- return NULL;
+ free(info);
+ return NULL;
+ }
+ return info;
}
diff --git a/net/bpfilter/gen.c b/net/bpfilter/gen.c
new file mode 100644
index 000000000000..8e08561b78f1
--- /dev/null
+++ b/net/bpfilter/gen.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_link.h>
+#include <linux/rtnetlink.h>
+#include <linux/bpf.h>
+typedef __u16 __bitwise __sum16; /* hack */
+#include <linux/ip.h>
+
+#include <arpa/inet.h>
+
+#include "bpfilter_mod.h"
+
+unsigned int if_nametoindex(const char *ifname);
+
+static inline __u64 bpf_ptr_to_u64(const void *ptr)
+{
+ return (__u64)(unsigned long)ptr;
+}
+
+static int bpf_prog_load(enum bpf_prog_type type,
+ const struct bpf_insn *insns,
+ unsigned int insn_num,
+ __u32 offload_ifindex)
+{
+ union bpf_attr attr = {};
+
+ attr.prog_type = type;
+ attr.insns = bpf_ptr_to_u64(insns);
+ attr.insn_cnt = insn_num;
+ attr.license = bpf_ptr_to_u64("GPL");
+ attr.prog_ifindex = offload_ifindex;
+
+ return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
+static int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+ struct sockaddr_nl sa;
+ int sock, seq = 0, len, ret = -1;
+ char buf[4096];
+ struct nlattr *nla, *nla_xdp;
+ struct {
+ struct nlmsghdr nh;
+ struct ifinfomsg ifinfo;
+ char attrbuf[64];
+ } req;
+ struct nlmsghdr *nh;
+ struct nlmsgerr *err;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.nl_family = AF_NETLINK;
+
+ sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+ if (sock < 0) {
+ printf("open netlink socket: %s\n", strerror(errno));
+ return -1;
+ }
+
+ if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+ printf("bind to netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ memset(&req, 0, sizeof(req));
+ req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+ req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ req.nh.nlmsg_type = RTM_SETLINK;
+ req.nh.nlmsg_pid = 0;
+ req.nh.nlmsg_seq = ++seq;
+ req.ifinfo.ifi_family = AF_UNSPEC;
+ req.ifinfo.ifi_index = ifindex;
+
+ /* started nested attribute for XDP */
+ nla = (struct nlattr *)(((char *)&req)
+ + NLMSG_ALIGN(req.nh.nlmsg_len));
+ nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+ nla->nla_len = NLA_HDRLEN;
+
+ /* add XDP fd */
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+ nla->nla_len += nla_xdp->nla_len;
+
+ /* if user passed in any flags, add those too */
+ if (flags) {
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+ nla->nla_len += nla_xdp->nla_len;
+ }
+
+ req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+ if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+ printf("send to netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ len = recv(sock, buf, sizeof(buf), 0);
+ if (len < 0) {
+ printf("recv from netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+ nh = NLMSG_NEXT(nh, len)) {
+ if (nh->nlmsg_pid != getpid()) {
+ printf("Wrong pid %d, expected %d\n",
+ nh->nlmsg_pid, getpid());
+ goto cleanup;
+ }
+ if (nh->nlmsg_seq != seq) {
+ printf("Wrong seq %d, expected %d\n",
+ nh->nlmsg_seq, seq);
+ goto cleanup;
+ }
+ switch (nh->nlmsg_type) {
+ case NLMSG_ERROR:
+ err = (struct nlmsgerr *)NLMSG_DATA(nh);
+ if (!err->error)
+ continue;
+ printf("nlmsg error %s\n", strerror(-err->error));
+ goto cleanup;
+ case NLMSG_DONE:
+ break;
+ }
+ }
+
+ ret = 0;
+
+cleanup:
+ close(sock);
+ return ret;
+}
+
+static int bpfilter_load_dev(struct bpfilter_gen_ctx *ctx)
+{
+ u32 xdp_flags = 0;
+
+ if (ctx->offloaded)
+ xdp_flags |= XDP_FLAGS_HW_MODE;
+ return bpf_set_link_xdp_fd(ctx->ifindex, ctx->fd, xdp_flags);
+}
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx)
+{
+ unsigned int len_max = BPF_MAXINSNS;
+
+ memset(ctx, 0, sizeof(*ctx));
+ ctx->img = calloc(len_max, sizeof(struct bpf_insn));
+ if (!ctx->img)
+ return -ENOMEM;
+ ctx->len_max = len_max;
+ ctx->fd = -1;
+ ctx->default_verdict = XDP_PASS;
+
+ return 0;
+}
+
+#define EMIT(x) \
+ do { \
+ if (ctx->len_cur + 1 > ctx->len_max) \
+ return -ENOMEM; \
+ ctx->img[ctx->len_cur++] = x; \
+ } while (0)
+
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx)
+{
+ EMIT(BPF_MOV64_REG(BPF_REG_9, BPF_REG_1));
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_9,
+ offsetof(struct xdp_md, data)));
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_9,
+ offsetof(struct xdp_md, data_end)));
+ EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, ETH_HLEN));
+ EMIT(BPF_JMP_REG(BPF_JLE, BPF_REG_1, BPF_REG_3, 2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx)
+{
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+static int bpfilter_gen_check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF "\xff\xff\xff\xff"
+ static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+ static const __u8 mask0[IFNAMSIZ] = { };
+ int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+ if (strlen(ent->out_iface) > 0)
+ return -ENOTSUPP;
+ if (memcmp(ent->in_iface_mask, mask1, ones) ||
+ memcmp(&ent->in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+ return -ENOTSUPP;
+ if ((ent->src_mask != 0 && ent->src_mask != 0xffffffff) ||
+ (ent->dst_mask != 0 && ent->dst_mask != 0xffffffff))
+ return -ENOTSUPP;
+
+ return 0;
+}
+
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+ struct bpfilter_ipt_ip *ent, int verdict)
+{
+ u32 match_xdp = verdict == -1 ? XDP_DROP : XDP_PASS;
+ int ret, ifindex, match_state = 0;
+
+ /* convention R1: tmp, R2: data, R3: data_end, R9: xdp_buff */
+ ret = bpfilter_gen_check_entry(ent);
+ if (ret < 0)
+ return ret;
+ if (ent->src_mask == 0 && ent->dst_mask == 0)
+ return 0;
+
+ ifindex = if_nametoindex(ent->in_iface);
+ if (!ifindex)
+ return 0;
+ if (ctx->ifindex && ctx->ifindex != ifindex)
+ return -ENOTSUPP;
+
+ ctx->ifindex = ifindex;
+ match_state = !!ent->src_mask + !!ent->dst_mask;
+
+ EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_5, 0));
+ EMIT(BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1,
+ offsetof(struct ethhdr, h_proto)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, htons(ETH_P_IP),
+ 3 + match_state * 3));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+ sizeof(struct ethhdr) + sizeof(struct iphdr)));
+ EMIT(BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1 + match_state * 3));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -(int)sizeof(struct iphdr)));
+ if (ent->src_mask) {
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct iphdr, saddr)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->src, 1));
+ EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+ }
+ if (ent->dst_mask) {
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct iphdr, daddr)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->dst, 1));
+ EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+ }
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_5, match_state, 2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, match_xdp));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx)
+{
+ int ret;
+
+ ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+ ctx->len_cur, ctx->ifindex);
+ if (ret > 0)
+ ctx->offloaded = true;
+ if (ret < 0)
+ ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+ ctx->len_cur, 0);
+ if (ret > 0) {
+ ctx->fd = ret;
+ ret = bpfilter_load_dev(ctx);
+ }
+
+ return ret < 0 ? ret : 0;
+}
+
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx)
+{
+ free(ctx->img);
+ close(ctx->fd);
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
index 699f3f623189..14e621a03217 100644
--- a/net/bpfilter/init.c
+++ b/net/bpfilter/init.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
+
+#include <sys/socket.h>
+
#include "bpfilter_mod.h"
static struct bpfilter_table filter_table_ipv4 = {
@@ -22,12 +24,13 @@ int bpfilter_ipv4_init(void)
if (err)
return err;
- info = bpfilter_ipv4_table_ctor(t);
+ info = bpfilter_ipv4_table_alloc(t, 0);
+ if (!info)
+ return -ENOMEM;
+ info = bpfilter_ipv4_table_finalize(t, info, 0, 0);
if (!info)
return -ENOMEM;
-
t->info = info;
-
return bpfilter_table_add(&filter_table_ipv4);
}
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index e0273ca201ad..ebd8a4fb1e95 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -1,20 +1,23 @@
// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
-#include <sys/uio.h>
#include <errno.h>
#include <stdio.h>
-#include <sys/socket.h>
#include <fcntl.h>
#include <unistd.h>
-#include "include/uapi/linux/bpf.h"
+
+#include <sys/uio.h>
+#include <sys/socket.h>
+
#include <asm/unistd.h>
+
+#include "include/uapi/linux/bpf.h"
+
#include "bpfilter_mod.h"
#include "msgfmt.h"
extern long int syscall (long int __sysno, ...);
-static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
- unsigned int size)
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
{
return syscall(321, cmd, attr, size);
}
@@ -39,7 +42,7 @@ int copy_to_user(void *addr, const void *src, int len)
struct iovec local;
struct iovec remote;
- local.iov_base = (void *) src;
+ local.iov_base = (void *)src;
local.iov_len = len;
remote.iov_base = addr;
remote.iov_len = len;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
index 43687daf51a3..26ad12a11736 100644
--- a/net/bpfilter/sockopt.c
+++ b/net/bpfilter/sockopt.c
@@ -1,10 +1,14 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/socket.h>
+
#include "bpfilter_mod.h"
+/* TODO: Get all of this in here properly done in encoding/decoding layer. */
static int fetch_name(void *addr, int len, char *name, int name_len)
{
if (copy_from_user(name, addr, name_len))
@@ -55,12 +59,17 @@ int bpfilter_get_info(void *addr, int len)
return err;
}
-static int copy_target(struct bpfilter_standard_target *ut,
- struct bpfilter_standard_target *kt)
+static int target_u2k(struct bpfilter_standard_target *kt)
{
- struct bpfilter_target *tgt;
- int sz;
+ kt->target.u.kernel.target =
+ bpfilter_target_get_by_name(kt->target.u.user.name);
+ return kt->target.u.kernel.target ? 0 : -EINVAL;
+}
+static int target_k2u(struct bpfilter_standard_target *ut,
+ struct bpfilter_standard_target *kt)
+{
+ struct bpfilter_target *tgt;
if (put_user(kt->target.u.target_size,
&ut->target.u.target_size))
@@ -69,12 +78,9 @@ static int copy_target(struct bpfilter_standard_target *ut,
tgt = kt->target.u.kernel.target;
if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
return -EFAULT;
-
if (put_user(tgt->rev, &ut->target.u.user.revision))
return -EFAULT;
-
- sz = tgt->size;
- if (copy_to_user(ut->target.data, kt->target.data, sz))
+ if (copy_to_user(ut->target.data, kt->target.data, tgt->size))
return -EFAULT;
return 0;
@@ -84,30 +90,25 @@ static int do_get_entries(void *up,
struct bpfilter_table *tbl,
struct bpfilter_table_info *info)
{
- unsigned int total_size = info->size;
const struct bpfilter_ipt_entry *ent;
+ unsigned int total_size = info->size;
+ void *base = info->entries;
unsigned int off;
- void *base;
-
- base = info->entries;
for (off = 0; off < total_size; off += ent->next_offset) {
- struct bpfilter_xt_counters *cntrs;
struct bpfilter_standard_target *tgt;
+ struct bpfilter_xt_counters *cntrs;
ent = base + off;
if (copy_to_user(up + off, ent, sizeof(*ent)))
return -EFAULT;
-
- /* XXX Just clear counters for now. XXX */
+ /* XXX: Just clear counters for now. */
cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
if (put_user(0, &cntrs->packet_cnt) ||
put_user(0, &cntrs->byte_cnt))
return -EINVAL;
-
- tgt = (void *) ent + ent->target_offset;
- dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
- if (copy_target(up + off + ent->target_offset, tgt))
+ tgt = (void *)ent + ent->target_offset;
+ if (target_k2u(up + off + ent->target_offset, tgt))
return -EFAULT;
}
return 0;
@@ -123,31 +124,113 @@ int bpfilter_get_entries(void *cmd, int len)
if (len < sizeof(struct bpfilter_ipt_get_entries))
return -EINVAL;
-
if (copy_from_user(&req, cmd, sizeof(req)))
return -EFAULT;
-
tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
if (!tbl)
return -ENOENT;
-
info = tbl->info;
if (!info) {
err = -ENOENT;
goto out_put;
}
-
if (info->size != req.size) {
err = -EINVAL;
goto out_put;
}
-
err = do_get_entries(uptr->entries, tbl, info);
- dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
-
out_put:
bpfilter_table_put(tbl);
+ return err;
+}
+static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
+ struct bpfilter_table *tbl)
+{
+ unsigned int total_size = req->size;
+ struct bpfilter_table_info *info;
+ struct bpfilter_ipt_entry *ent;
+ struct bpfilter_gen_ctx ctx;
+ unsigned int off, sents = 0, ents = 0;
+ int ret;
+
+ ret = bpfilter_gen_init(&ctx);
+ if (ret < 0)
+ return ret;
+ ret = bpfilter_gen_prologue(&ctx);
+ if (ret < 0)
+ return ret;
+ info = bpfilter_ipv4_table_alloc(tbl, total_size);
+ if (!info)
+ return -ENOMEM;
+ if (copy_from_user(&info->entries[0], base, req->size)) {
+ free(info);
+ return -EFAULT;
+ }
+ base = &info->entries[0];
+ for (off = 0; off < total_size; off += ent->next_offset) {
+ struct bpfilter_standard_target *tgt;
+ ent = base + off;
+ ents++;
+ sents += ent->next_offset;
+ tgt = (void *) ent + ent->target_offset;
+ target_u2k(tgt);
+ ret = bpfilter_gen_append(&ctx, &ent->ip, tgt->verdict);
+ if (ret < 0)
+ goto err;
+ }
+ info->num_entries = ents;
+ info->size = sents;
+ memcpy(info->hook_entry, req->hook_entry, sizeof(info->hook_entry));
+ memcpy(info->underflow, req->underflow, sizeof(info->hook_entry));
+ ret = bpfilter_gen_epilogue(&ctx);
+ if (ret < 0)
+ goto err;
+ ret = bpfilter_gen_commit(&ctx);
+ if (ret < 0)
+ goto err;
+ free(tbl->info);
+ tbl->info = info;
+ bpfilter_gen_destroy(&ctx);
+ dprintf(debug_fd, "offloaded %u\n", ctx.offloaded);
+ return ret;
+err:
+ free(info);
+ return ret;
+}
+
+int bpfilter_set_replace(void *cmd, int len)
+{
+ struct bpfilter_ipt_replace *uptr = cmd;
+ struct bpfilter_ipt_replace req;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len < sizeof(req))
+ return -EINVAL;
+ if (copy_from_user(&req, cmd, sizeof(req)))
+ return -EFAULT;
+ if (req.num_counters >= INT_MAX / sizeof(struct bpfilter_xt_counters))
+ return -ENOMEM;
+ if (req.num_counters == 0)
+ return -EINVAL;
+ req.name[sizeof(req.name) - 1] = 0;
+ tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+ if (!tbl)
+ return -ENOENT;
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+ err = do_set_replace(&req, uptr->entries, tbl);
+out_put:
+ bpfilter_table_put(tbl);
return err;
}
+int bpfilter_set_add_counters(void *cmd, int len)
+{
+ return 0;
+}
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
index 9a96599be634..e0dab283092d 100644
--- a/net/bpfilter/tables.c
+++ b/net/bpfilter/tables.c
@@ -1,8 +1,11 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
#include <string.h>
+
+#include <sys/socket.h>
+
#include <linux/hashtable.h>
+
#include "bpfilter_mod.h"
static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
index eac5e8ac0b4b..0a00bc289d3d 100644
--- a/net/bpfilter/tgts.c
+++ b/net/bpfilter/tgts.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <sys/socket.h>
+
#include "bpfilter_mod.h"
struct bpfilter_target std_tgt = {
--
2.9.5
On 03/05/18 05:36, Alexei Starovoitov wrote:
> bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
> and user mode helper code that is embedded into bpfilter.ko
>
> The steps to build bpfilter.ko are the following:
> - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
> - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
> is converted into bpfilter_umh.o object file
> with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
> Example:
> $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
> 0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
> 0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
> 0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
> - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
>
> bpfilter_kern.c is a normal kernel module code that calls
> the fork_usermode_blob() helper to execute part of its own data
> as a user mode process.
>
> Notice that _binary_net_bpfilter_bpfilter_umh_start - end
> is placed into .init.rodata section, so it's freed as soon as __init
> function of bpfilter.ko is finished.
> As part of __init the bpfilter.ko does first request/reply action
> via two unix pipe provided by fork_usermode_blob() helper to
> make sure that umh is healthy. If not it will kill it via pid.
>
> Later bpfilter_process_sockopt() will be called from bpfilter hooks
> in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
>
> If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
> kill umh as well.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/linux/bpfilter.h | 15 +++++++
> include/uapi/linux/bpfilter.h | 21 ++++++++++
> net/Kconfig | 2 +
> net/Makefile | 1 +
> net/bpfilter/Kconfig | 17 ++++++++
> net/bpfilter/Makefile | 24 +++++++++++
> net/bpfilter/bpfilter_kern.c | 93 +++++++++++++++++++++++++++++++++++++++++++
> net/bpfilter/main.c | 63 +++++++++++++++++++++++++++++
> net/bpfilter/msgfmt.h | 17 ++++++++
> net/ipv4/Makefile | 2 +
> net/ipv4/bpfilter/Makefile | 2 +
> net/ipv4/bpfilter/sockopt.c | 42 +++++++++++++++++++
> net/ipv4/ip_sockglue.c | 17 ++++++++
> 13 files changed, 316 insertions(+)
> create mode 100644 include/linux/bpfilter.h
> create mode 100644 include/uapi/linux/bpfilter.h
> create mode 100644 net/bpfilter/Kconfig
> create mode 100644 net/bpfilter/Makefile
> create mode 100644 net/bpfilter/bpfilter_kern.c
> create mode 100644 net/bpfilter/main.c
> create mode 100644 net/bpfilter/msgfmt.h
> create mode 100644 net/ipv4/bpfilter/Makefile
> create mode 100644 net/ipv4/bpfilter/sockopt.c
>
> diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
> new file mode 100644
> index 000000000000..687b1760bb9f
> --- /dev/null
> +++ b/include/linux/bpfilter.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_BPFILTER_H
> +#define _LINUX_BPFILTER_H
> +
> +#include <uapi/linux/bpfilter.h>
> +
> +struct sock;
> +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
> + unsigned int optlen);
> +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
> + int *optlen);
> +extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set);
> +#endif
> diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
> new file mode 100644
> index 000000000000..2ec3cc99ea4c
> --- /dev/null
> +++ b/include/uapi/linux/bpfilter.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_LINUX_BPFILTER_H
> +#define _UAPI_LINUX_BPFILTER_H
> +
> +#include <linux/if.h>
> +
> +enum {
> + BPFILTER_IPT_SO_SET_REPLACE = 64,
> + BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
> + BPFILTER_IPT_SET_MAX,
> +};
> +
> +enum {
> + BPFILTER_IPT_SO_GET_INFO = 64,
> + BPFILTER_IPT_SO_GET_ENTRIES = 65,
> + BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
> + BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
> + BPFILTER_IPT_GET_MAX,
> +};
> +
> +#endif /* _UAPI_LINUX_BPFILTER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index b62089fb1332..ed6368b306fa 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
>
> endif
>
> +source "net/bpfilter/Kconfig"
> +
> source "net/dccp/Kconfig"
> source "net/sctp/Kconfig"
> source "net/rds/Kconfig"
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..7f982b7682bd 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/
> obj-$(CONFIG_XFRM) += xfrm/
> obj-$(CONFIG_UNIX) += unix/
> obj-$(CONFIG_NET) += ipv6/
> +obj-$(CONFIG_BPFILTER) += bpfilter/
> obj-$(CONFIG_PACKET) += packet/
> obj-$(CONFIG_NET_KEY) += key/
> obj-$(CONFIG_BRIDGE) += bridge/
> diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
> new file mode 100644
> index 000000000000..782a732b9a5c
> --- /dev/null
> +++ b/net/bpfilter/Kconfig
> @@ -0,0 +1,17 @@
> +menuconfig BPFILTER
> + bool "BPF based packet filtering framework (BPFILTER)"
> + default n
> + depends on NET && BPF
> + help
> + This builds experimental bpfilter framework that is aiming to
> + provide netfilter compatible functionality via BPF
> +
> +if BPFILTER
> +config BPFILTER_UMH
> + tristate "bpftiler kernel module with user mode helper"
sp. "bpftiler" -> "bpfilter"
> + default m
> + depends on m
> + help
> + This builds bpfilter kernel module with embedded user mode helper
> +endif
> +
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> new file mode 100644
> index 000000000000..897eedae523e
> --- /dev/null
> +++ b/net/bpfilter/Makefile
> @@ -0,0 +1,24 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the Linux BPFILTER layer.
> +#
> +
> +hostprogs-y := bpfilter_umh
> +bpfilter_umh-objs := main.o
> +HOSTCFLAGS += -I. -Itools/include/
> +
> +# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> +# inside bpfilter_umh.o elf file referenced by
> +# _binary_net_bpfilter_bpfilter_umh_start symbol
> +# which bpfilter_kern.c passes further into umh blob loader at run-time
> +quiet_cmd_copy_umh = GEN $@
> + cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> + $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> + -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> + --rename-section .data=.init.rodata $< $@
> +
> +$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
> + $(call cmd,copy_umh)
> +
> +obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
> +bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> new file mode 100644
> index 000000000000..e0a6fdd5842b
> --- /dev/null
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/umh.h>
> +#include <linux/bpfilter.h>
> +#include <linux/sched.h>
> +#include <linux/sched/signal.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include "msgfmt.h"
> +
> +#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
> +#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
> +
> +extern char UMH_start;
> +extern char UMH_end;
> +
> +static struct umh_info info;
> +
> +static void shutdown_umh(struct umh_info *info)
> +{
> + struct task_struct *tsk;
> +
> + tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
> + if (tsk)
> + force_sig(SIGKILL, tsk);
> + fput(info->pipe_to_umh);
> + fput(info->pipe_from_umh);
> +}
> +
> +static void stop_umh(void)
> +{
> + if (bpfilter_process_sockopt) {
I worry about locking here. Is it possible for two calls to
 bpfilter_process_sockopt() to run in parallel, both fail, and thus both
 call stop_umh()? And if both end up calling shutdown_umh(), we double
 fput().
> + bpfilter_process_sockopt = NULL;
> + shutdown_umh(&info);
> + }
> +}
> +
> +static int __bpfilter_process_sockopt(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set)
> +{
> + struct mbox_request req;
> + struct mbox_reply reply;
> + loff_t pos;
> + ssize_t n;
> +
> + req.is_set = is_set;
> + req.pid = current->pid;
> + req.cmd = optname;
> + req.addr = (long)optval;
> + req.len = optlen;
> + n = __kernel_write(info.pipe_to_umh, &req, sizeof(req), &pos);
> + if (n != sizeof(req)) {
> + pr_err("write fail %zd\n", n);
> + stop_umh();
> + return -EFAULT;
> + }
> + pos = 0;
> + n = kernel_read(info.pipe_from_umh, &reply, sizeof(reply), &pos);
> + if (n != sizeof(reply)) {
> + pr_err("read fail %zd\n", n);
> + stop_umh();
> + return -EFAULT;
> + }
> + return reply.status;
> +}
> +
> +static int __init load_umh(void)
> +{
> + int err;
> +
> + err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
> + if (err)
> + return err;
> + pr_info("Loaded umh pid %d\n", info.pid);
> + bpfilter_process_sockopt = &__bpfilter_process_sockopt;
> +
> + if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
> + stop_umh();
> + return -EFAULT;
> + }
> + return 0;
> +}
> +
> +static void __exit fini_umh(void)
> +{
> + stop_umh();
> +}
> +module_init(load_umh);
> +module_exit(fini_umh);
> +MODULE_LICENSE("GPL");
> diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
> new file mode 100644
> index 000000000000..81bbc1684896
> --- /dev/null
> +++ b/net/bpfilter/main.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <sys/uio.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <sys/socket.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include "include/uapi/linux/bpf.h"
> +#include <asm/unistd.h>
> +#include "msgfmt.h"
> +
> +int debug_fd;
> +
> +static int handle_get_cmd(struct mbox_request *cmd)
> +{
> + switch (cmd->cmd) {
> + case 0:
> + return 0;
> + default:
> + break;
> + }
> + return -ENOPROTOOPT;
> +}
> +
> +static int handle_set_cmd(struct mbox_request *cmd)
> +{
> + return -ENOPROTOOPT;
> +}
> +
> +static void loop(void)
> +{
> + while (1) {
> + struct mbox_request req;
> + struct mbox_reply reply;
> + int n;
> +
> + n = read(0, &req, sizeof(req));
> + if (n != sizeof(req)) {
> + dprintf(debug_fd, "invalid request %d\n", n);
> + return;
> + }
> +
> + reply.status = req.is_set ?
> + handle_set_cmd(&req) :
> + handle_get_cmd(&req);
> +
> + n = write(1, &reply, sizeof(reply));
> + if (n != sizeof(reply)) {
> + dprintf(debug_fd, "reply failed %d\n", n);
> + return;
> + }
> + }
> +}
> +
> +int main(void)
> +{
> + debug_fd = open("/dev/console", 00000002 | 00000100);
Should probably handle failure of this open() call.
> + dprintf(debug_fd, "Started bpfilter\n");
> + loop();
> + close(debug_fd);
> + return 0;
> +}
> diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h
> new file mode 100644
> index 000000000000..94b9ac9e5114
> --- /dev/null
> +++ b/net/bpfilter/msgfmt.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _NET_BPFILTER_MSGFMT_H
> +#define _NET_BPFTILER_MSGFMT_H
Another bpftiler here, should be
+#define _NET_BPFILTER_MSGFMT_H
-Ed
> +
> +struct mbox_request {
> + __u64 addr;
> + __u32 len;
> + __u32 is_set;
> + __u32 cmd;
> + __u32 pid;
> +};
> +
> +struct mbox_reply {
> + __u32 status;
> +};
> +
> +#endif
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index b379520f9133..7018f91c5a39 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -16,6 +16,8 @@ obj-y := route.o inetpeer.o protocol.o \
> inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
> metrics.o
>
> +obj-$(CONFIG_BPFILTER) += bpfilter/
> +
> obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
> obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
> obj-$(CONFIG_PROC_FS) += proc.o
> diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
> new file mode 100644
> index 000000000000..ce262d76cc48
> --- /dev/null
> +++ b/net/ipv4/bpfilter/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_BPFILTER) += sockopt.o
> +
> diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
> new file mode 100644
> index 000000000000..42a96d2d8d05
> --- /dev/null
> +++ b/net/ipv4/bpfilter/sockopt.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/uaccess.h>
> +#include <linux/bpfilter.h>
> +#include <uapi/linux/bpf.h>
> +#include <linux/wait.h>
> +#include <linux/kmod.h>
> +
> +int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set);
> +EXPORT_SYMBOL_GPL(bpfilter_process_sockopt);
> +
> +int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
> + unsigned int optlen, bool is_set)
> +{
> + if (!bpfilter_process_sockopt) {
> + int err = request_module("bpfilter");
> +
> + if (err)
> + return err;
> + if (!bpfilter_process_sockopt)
> + return -ECHILD;
> + }
> + return bpfilter_process_sockopt(sk, optname, optval, optlen, is_set);
> +}
> +
> +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
> + unsigned int optlen)
> +{
> + return bpfilter_mbox_request(sk, optname, optval, optlen, true);
> +}
> +
> +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
> + int __user *optlen)
> +{
> + int len;
> +
> + if (get_user(len, optlen))
> + return -EFAULT;
> +
> + return bpfilter_mbox_request(sk, optname, optval, len, false);
> +}
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index 5ad2d8ed3a3f..e0791faacb24 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -47,6 +47,8 @@
> #include <linux/errqueue.h>
> #include <linux/uaccess.h>
>
> +#include <linux/bpfilter.h>
> +
> /*
> * SOL_IP control messages.
> */
> @@ -1244,6 +1246,11 @@ int ip_setsockopt(struct sock *sk, int level,
> return -ENOPROTOOPT;
>
> err = do_ip_setsockopt(sk, level, optname, optval, optlen);
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
> + optname < BPFILTER_IPT_SET_MAX)
> + err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
> @@ -1552,6 +1559,11 @@ int ip_getsockopt(struct sock *sk, int level,
> int err;
>
> err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_GET_INFO &&
> + optname < BPFILTER_IPT_GET_MAX)
> + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
> @@ -1584,6 +1596,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
> err = do_ip_getsockopt(sk, level, optname, optval, optlen,
> MSG_CMSG_COMPAT);
>
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_GET_INFO &&
> + optname < BPFILTER_IPT_GET_MAX)
> + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
What a mighty short list of reviewers. Adding some more. My review below.
I'd appreciate a Cc on future versions of these patches.
On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> Introduce helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> struct umh_info {
> struct file *pipe_to_umh;
> struct file *pipe_from_umh;
> pid_t pid;
> };
>
> that GPLed kernel modules (signed or unsigned) can use it to execute part
> of its own data as swappable user mode process.
>
> The kernel will do:
> - mount "tmpfs"
Actually its a *shared* vfsmount tmpfs for all umh blobs.
> - allocate a unique file in tmpfs
> - populate that file with [data, data + len] bytes
> - user-mode-helper code will do_execve that file and, before the process
> starts, the kernel will create two unix pipes for bidirectional
> communication between kernel module and umh
> - close tmpfs file, effectively deleting it
> - the fork_usermode_blob will return zero on success and populate
> 'struct umh_info' with two unix pipes and the pid of the user process
But since its using UMH_WAIT_EXEC, all we can guarantee currently is the
inception point was intended, well though out, and will run, but the return
value in no way reflects the success or not of the execution. More below.
> As the first step in the development of the bpfilter project
> the fork_usermode_blob() helper is introduced to allow user mode code
> to be invoked from a kernel module. The idea is that user mode code plus
> normal kernel module code are built as part of the kernel build
> and installed as traditional kernel module into distro specified location,
> such that from a distribution point of view, there is
> no difference between regular kernel modules and kernel modules + umh code.
> Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> by a kernel module doesn't make it any special from kernel and user space
> tooling point of view.
>
> Such approach enables kernel to delegate functionality traditionally done
> by the kernel modules into the user space processes (either root or !root) and
> reduces security attack surface of the new code. The buggy umh code would crash
> the user process, but not the kernel. Another advantage is that umh code
> of the kernel module can be debugged and tested out of user space
> (e.g. opening the possibility to run clang sanitizers, fuzzers or
> user space test suites on the umh code).
> In case of the bpfilter project such architecture allows complex control plane
> to be done in the user space while bpf based data plane stays in the kernel.
>
> Since umh can crash, can be oom-ed by the kernel, killed by the admin,
> the kernel module that uses them (like bpfilter) needs to manage life
> time of umh on its own via two unix pipes and the pid of umh.
>
> The exit code of such kernel module should kill the umh it started,
> so that rmmod of the kernel module will cleanup the corresponding umh.
> Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> fs/exec.c | 38 ++++++++---
> include/linux/binfmts.h | 1 +
> include/linux/umh.h | 12 ++++
> kernel/umh.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 215 insertions(+), 12 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 183059c427b9..30a36c2a39bf 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags, struct file *file)
> {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> - struct file *file;
> struct files_struct *displaced;
> int retval;
Keeping in mind a fuzzer...
Note, right below this, and not shown here in the hunk, is:
if (IS_ERR(filename))
return PTR_ERR(filename)
>
> @@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_execat(fd, filename, flags);
> + if (!file)
> + file = do_open_execat(fd, filename, flags);
Here we now seem to allow !file and open the file with the passed fd as in
the old days. This is an expected change.
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
If anything shouldn't this be:
if (IS_ERR(filename))
But, wouldn't the above first branch in the routine catch this?
> + bprm->filename = "none";
Given this seems like a desirable branch which was tested, wonder how this
ever got set if the above branch in the first hunk I noted hit true?
In any case, we seem to have two cases, can we rule out the exact requirements
at the top so we can bail out with an error code if one or the other way to
call this function does not align with expectations?
> + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
> bprm->filename = filename->name;
> } else {
> if (filename->name[0] == '\0')
> @@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> task_numa_free(current);
> free_bprm(bprm);
> kfree(pathbuf);
> - putname(filename);
> + if (filename)
> + putname(filename);
> if (displaced)
> put_files_struct(displaced);
> return retval;
> @@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
> if (displaced)
> reset_files_struct(displaced);
> out_ret:
> - putname(filename);
> + if (filename)
> + putname(filename);
> return retval;
> }
>
> +static int do_execveat_common(int fd, struct filename *filename,
Further signs the filename is now optional. But I don't understand how these
branches ever be true, but perhaps I'm missing something?
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> +{
> + return __do_execve_file(fd, filename, argv, envp, flags, NULL);
> +}
> +
> +int do_execve_file(struct file *file, void *__argv, void *__envp)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> +}
Or maybe do the semantics expectations checks here, so we don't clutter
do_execveat_common() with them?
> +
> int do_execve(struct filename *filename,
> const char __user *const __user *__argv,
> const char __user *const __user *__envp)
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 4955e0863b83..c05f24fac4f6 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
> const char __user * const __user *,
> const char __user * const __user *,
> int);
> +int do_execve_file(struct file *file, void *__argv, void *__envp);
>
> #endif /* _LINUX_BINFMTS_H */
> diff --git a/include/linux/umh.h b/include/linux/umh.h
> index 244aff638220..5c812acbb80a 100644
> --- a/include/linux/umh.h
> +++ b/include/linux/umh.h
> @@ -22,8 +22,10 @@ struct subprocess_info {
> const char *path;
> char **argv;
> char **envp;
> + struct file *file;
> int wait;
> int retval;
> + pid_t pid;
> int (*init)(struct subprocess_info *info, struct cred *new);
> void (*cleanup)(struct subprocess_info *info);
> void *data;
While at it, can you kdocify struct subprocess_info and add new docs for at
least these two entires you are adding ?
> @@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
> int (*init)(struct subprocess_info *info, struct cred *new),
> void (*cleanup)(struct subprocess_info *), void *data);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> + int (*init)(struct subprocess_info *info, struct cred *new),
> + void (*cleanup)(struct subprocess_info *), void *data);
Likewise but on the umc.c file.
> +struct umh_info {
> + struct file *pipe_to_umh;
> + struct file *pipe_from_umh;
> + pid_t pid;
> +};
Likewise.
> +int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
Likewise but on the umc.c files.
> +
> extern int
> call_usermodehelper_exec(struct subprocess_info *info, int wait);
>
> diff --git a/kernel/umh.c b/kernel/umh.c
> index f76b3ff876cf..c3f418d7d51a 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -25,6 +25,8 @@
> #include <linux/ptrace.h>
> #include <linux/async.h>
> #include <linux/uaccess.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/pipe_fs_i.h>
>
> #include <trace/events/module.h>
>
> @@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
>
> commit_creds(new);
>
> - retval = do_execve(getname_kernel(sub_info->path),
> - (const char __user *const __user *)sub_info->argv,
> - (const char __user *const __user *)sub_info->envp);
> + if (sub_info->file)
> + retval = do_execve_file(sub_info->file,
> + sub_info->argv, sub_info->envp);
> + else
> + retval = do_execve(getname_kernel(sub_info->path),
> + (const char __user *const __user *)sub_info->argv,
> + (const char __user *const __user *)sub_info->envp);
> out:
> sub_info->retval = retval;
> /*
> @@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
> if (pid < 0) {
> sub_info->retval = pid;
> umh_complete(sub_info);
> + } else {
> + sub_info->pid = pid;
> }
> }
> }
> @@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> }
> EXPORT_SYMBOL(call_usermodehelper_setup);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> + int (*init)(struct subprocess_info *info, struct cred *new),
> + void (*cleanup)(struct subprocess_info *info), void *data)
Should be static, no other users outside of this file.
Please use umh_setup_file().
> +{
> + struct subprocess_info *sub_info;
Considering a possible fuzzer triggering random data we should probably
return NULL early and avoid the kzalloc if:
if (!file || !init || !cleanup)
return NULL;
Is data optional? The kdoc could clarify this.
> +
> + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> + if (!sub_info)
> + return NULL;
> +
> + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> + sub_info->path = "none";
> + sub_info->file = file;
> + sub_info->init = init;
> + sub_info->cleanup = cleanup;
> + sub_info->data = data;
> + return sub_info;
> +}
> +
> +static struct vfsmount *umh_fs;
> +
> +static int init_tmpfs(void)
Please use umh_init_tmpfs(). Also see init/main.c do_basic_setup() which calls
usermodehelper_enable() prior to do_initcalls(). Now, fortunately TMPFS is only
bool, saving us from some races and we do call tmpfs's init first shmem_init():
static void __init do_basic_setup(void)
{
cpuset_init_smp();
shmem_init();
driver_init();
init_irq_proc();
do_ctors();
usermodehelper_enable();
do_initcalls();
}
But it also means we're enabling your new call call fork_usermode_blob() on
early init code even if we're not setup. Since this umh tmpfs vfsmount is
shared I'd say just call this init right before usermodehelper_enable()
on do_basic_setup().
> +{
> + struct file_system_type *type;
> +
> + if (umh_fs)
> + return 0;
> + type = get_fs_type("tmpfs");
> + if (!type)
> + return -ENODEV;
> + umh_fs = kern_mount(type);
> + if (IS_ERR(umh_fs)) {
> + int err = PTR_ERR(umh_fs);
> +
> + put_filesystem(type);
> + umh_fs = NULL;
> + return err;
> + }
> + return 0;
> +}
> +
> +static int alloc_tmpfs_file(size_t size, struct file **filp)
Please use umh_alloc_tmpfs_file()
> +{
> + struct file *file;
> + int err;
> +
> + err = init_tmpfs();
> + if (err)
> + return err;
> + file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> + *filp = file;
> + return 0;
> +}
> +
> +static int populate_file(struct file *file, const void *data, size_t size)
Please use umh_populate_file()
> +{
> + size_t offset = 0;
> + int err;
> +
> + do {
> + unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> + struct page *page;
> + void *pgdata, *vaddr;
> +
> + err = pagecache_write_begin(file, file->f_mapping, offset, len,
> + 0, &page, &pgdata);
> + if (err < 0)
> + goto fail;
> +
> + vaddr = kmap(page);
> + memcpy(vaddr, data, len);
> + kunmap(page);
> +
> + err = pagecache_write_end(file, file->f_mapping, offset, len,
> + len, page, pgdata);
> + if (err < 0)
> + goto fail;
> +
> + size -= len;
> + data += len;
> + offset += len;
> + } while (size);
Character for character, this looks like a wonderful copy and paste from
i915_gem_object_create_from_data()'s own loop which does the same exact
thing. Perhaps its time for a helper on mm/filemap.c with an export so
if a bug is fixed in one place its fixed in both places.
> + return 0;
> +fail:
> + return err;
> +}
> +
> +static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
The function name umh_pipe_setup() is also used on fs/coredump.c, with the same
prototype, perhaps rename that before we take this on, even if both are static.
> +{
> + struct umh_info *umh_info = info->data;
> + struct file *from_umh[2];
> + struct file *to_umh[2];
> + int err;
> +
> + /* create pipe to send data to umh */
> + err = create_pipe_files(to_umh, 0);
> + if (err)
> + return err;
> + err = replace_fd(0, to_umh[0], 0);
> + fput(to_umh[0]);
> + if (err < 0) {
> + fput(to_umh[1]);
> + return err;
> + }
> +
> + /* create pipe to receive data from umh */
> + err = create_pipe_files(from_umh, 0);
> + if (err) {
> + fput(to_umh[1]);
> + replace_fd(0, NULL, 0);
> + return err;
> + }
> + err = replace_fd(1, from_umh[1], 0);
> + fput(from_umh[1]);
> + if (err < 0) {
> + fput(to_umh[1]);
> + replace_fd(0, NULL, 0);
> + fput(from_umh[0]);
> + return err;
> + }
> +
> + umh_info->pipe_to_umh = to_umh[1];
> + umh_info->pipe_from_umh = from_umh[0];
> + return 0;
> +}
> +
> +static void umh_save_pid(struct subprocess_info *info)
> +{
> + struct umh_info *umh_info = info->data;
> +
> + umh_info->pid = info->pid;
> +}
> +
> +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
Please use umh_fork_blob()
> +{
> + struct subprocess_info *sub_info;
> + struct file *file = NULL;
> + int err;
> +
> + err = alloc_tmpfs_file(len, &file);
> + if (err)
> + return err;
> +
> + err = populate_file(file, data, len);
> + if (err)
> + goto out;
> +
> + err = -ENOMEM;
> + sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
> + umh_save_pid, info);
> + if (!sub_info)
> + goto out;
> +
> + err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
Alright, neat, so to be clear, we're just glad to try inception, we have no
clue or idea what the real return value would be, its up to the caller to track
the progress somehow?
Can you add a kdoc entry for this and clarify requirements?
Also, can you extend lib/test_kmod.c with a test case for this with its own
demo and try to blow it up?
I hadn't tried suspend/resume during a kmod test, but since we're using a
kernel_thread() I wouldn't be surprised if we barf while stress testing the
module loader. Its surely a corner case, but better mention that now than cry
later if we get heavy umh modules and all of a sudden we start using this for
whatever reason close to suspend.
Luis
> +out:
> + fput(file);
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(fork_usermode_blob);
> +
> /**
> * call_usermodehelper_exec - start a usermode application
> * @sub_info: information about the subprocessa
> --
> 2.9.5
--
Do not panic
On Thu, May 03, 2018 at 03:23:55PM +0100, Edward Cree wrote:
> On 03/05/18 05:36, Alexei Starovoitov wrote:
> > bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
> > and user mode helper code that is embedded into bpfilter.ko
> >
> > The steps to build bpfilter.ko are the following:
> > - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
> > - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
> > is converted into bpfilter_umh.o object file
> > with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
> > Example:
> > $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
> > 0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
> > 0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
> > 0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
> > - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
> >
> > bpfilter_kern.c is a normal kernel module code that calls
> > the fork_usermode_blob() helper to execute part of its own data
> > as a user mode process.
> >
> > Notice that _binary_net_bpfilter_bpfilter_umh_start - end
> > is placed into .init.rodata section, so it's freed as soon as __init
> > function of bpfilter.ko is finished.
> > As part of __init the bpfilter.ko does first request/reply action
> > via two unix pipe provided by fork_usermode_blob() helper to
> > make sure that umh is healthy. If not it will kill it via pid.
> >
> > Later bpfilter_process_sockopt() will be called from bpfilter hooks
> > in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
> >
> > If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
> > kill umh as well.
> >
> > Signed-off-by: Alexei Starovoitov <[email protected]>
...
> > +static void stop_umh(void)
> > +{
> > + if (bpfilter_process_sockopt) {
> I worry about locking here.? Is it possible for two calls to
> ?bpfilter_process_sockopt() to run in parallel, both fail, and thus both
> ?call stop_umh()?? And if both end up calling shutdown_umh(), we double
> ?fput().
I thought iptables sockopt is serialized earlier. Nope.
We need to grab the mutex to access these pipes.
Will fix.
Thanks for spelling nits. Will fix as well.
On Fri, May 04, 2018 at 07:56:43PM +0000, Luis R. Rodriguez wrote:
> What a mighty short list of reviewers. Adding some more. My review below.
> I'd appreciate a Cc on future versions of these patches.
sure.
> On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> > Introduce helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > struct umh_info {
> > struct file *pipe_to_umh;
> > struct file *pipe_from_umh;
> > pid_t pid;
> > };
> >
> > that GPLed kernel modules (signed or unsigned) can use it to execute part
> > of its own data as swappable user mode process.
> >
> > The kernel will do:
> > - mount "tmpfs"
>
> Actually its a *shared* vfsmount tmpfs for all umh blobs.
yep
> > - allocate a unique file in tmpfs
> > - populate that file with [data, data + len] bytes
> > - user-mode-helper code will do_execve that file and, before the process
> > starts, the kernel will create two unix pipes for bidirectional
> > communication between kernel module and umh
> > - close tmpfs file, effectively deleting it
> > - the fork_usermode_blob will return zero on success and populate
> > 'struct umh_info' with two unix pipes and the pid of the user process
>
> But since its using UMH_WAIT_EXEC, all we can guarantee currently is the
> inception point was intended, well though out, and will run, but the return
> value in no way reflects the success or not of the execution. More below.
yep
> > As the first step in the development of the bpfilter project
> > the fork_usermode_blob() helper is introduced to allow user mode code
> > to be invoked from a kernel module. The idea is that user mode code plus
> > normal kernel module code are built as part of the kernel build
> > and installed as traditional kernel module into distro specified location,
> > such that from a distribution point of view, there is
> > no difference between regular kernel modules and kernel modules + umh code.
> > Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> > by a kernel module doesn't make it any special from kernel and user space
> > tooling point of view.
> >
> > Such approach enables kernel to delegate functionality traditionally done
> > by the kernel modules into the user space processes (either root or !root) and
> > reduces security attack surface of the new code. The buggy umh code would crash
> > the user process, but not the kernel. Another advantage is that umh code
> > of the kernel module can be debugged and tested out of user space
> > (e.g. opening the possibility to run clang sanitizers, fuzzers or
> > user space test suites on the umh code).
> > In case of the bpfilter project such architecture allows complex control plane
> > to be done in the user space while bpf based data plane stays in the kernel.
> >
> > Since umh can crash, can be oom-ed by the kernel, killed by the admin,
> > the kernel module that uses them (like bpfilter) needs to manage life
> > time of umh on its own via two unix pipes and the pid of umh.
> >
> > The exit code of such kernel module should kill the umh it started,
> > so that rmmod of the kernel module will cleanup the corresponding umh.
> > Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
> >
> > Signed-off-by: Alexei Starovoitov <[email protected]>
> > ---
> > fs/exec.c | 38 ++++++++---
> > include/linux/binfmts.h | 1 +
> > include/linux/umh.h | 12 ++++
> > kernel/umh.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
> > 4 files changed, 215 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 183059c427b9..30a36c2a39bf 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> > /*
> > * sys_execve() executes a new program.
> > */
> > -static int do_execveat_common(int fd, struct filename *filename,
> > - struct user_arg_ptr argv,
> > - struct user_arg_ptr envp,
> > - int flags)
> > +static int __do_execve_file(int fd, struct filename *filename,
> > + struct user_arg_ptr argv,
> > + struct user_arg_ptr envp,
> > + int flags, struct file *file)
> > {
> > char *pathbuf = NULL;
> > struct linux_binprm *bprm;
> > - struct file *file;
> > struct files_struct *displaced;
> > int retval;
>
> Keeping in mind a fuzzer...
>
> Note, right below this, and not shown here in the hunk, is:
>
> if (IS_ERR(filename))
> return PTR_ERR(filename)
> >
> > @@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> > check_unsafe_exec(bprm);
> > current->in_execve = 1;
> >
> > - file = do_open_execat(fd, filename, flags);
> > + if (!file)
> > + file = do_open_execat(fd, filename, flags);
>
>
> Here we now seem to allow !file and open the file with the passed fd as in
> the old days. This is an expected change.
>
> > retval = PTR_ERR(file);
> > if (IS_ERR(file))
> > goto out_unmark;
> > @@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> > sched_exec();
> >
> > bprm->file = file;
> > - if (fd == AT_FDCWD || filename->name[0] == '/') {
> > + if (!filename) {
>
> If anything shouldn't this be:
>
> if (IS_ERR(filename))
nope. it should be !filename as do_execve_file() passes NULL.
IS_ERR != IS_ERR_OR_NULL
> But, wouldn't the above first branch in the routine catch this?
>
> > + bprm->filename = "none";
>
> Given this seems like a desirable branch which was tested, wonder how this
> ever got set if the above branch in the first hunk I noted hit true?
>
> In any case, we seem to have two cases, can we rule out the exact requirements
> at the top so we can bail out with an error code if one or the other way to
> call this function does not align with expectations?
I think you're misreading the code or I don't understand the concern at all.
> > + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
> > bprm->filename = filename->name;
> > } else {
> > if (filename->name[0] == '\0')
> > @@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> > task_numa_free(current);
> > free_bprm(bprm);
> > kfree(pathbuf);
> > - putname(filename);
> > + if (filename)
> > + putname(filename);
> > if (displaced)
> > put_files_struct(displaced);
> > return retval;
> > @@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
> > if (displaced)
> > reset_files_struct(displaced);
> > out_ret:
> > - putname(filename);
> > + if (filename)
> > + putname(filename);
> > return retval;
> > }
> >
> > +static int do_execveat_common(int fd, struct filename *filename,
>
> Further signs the filename is now optional. But I don't understand how these
> branches ever be true, but perhaps I'm missing something?
>
> > + struct user_arg_ptr argv,
> > + struct user_arg_ptr envp,
> > + int flags)
> > +{
> > + return __do_execve_file(fd, filename, argv, envp, flags, NULL);
> > +}
> > +
> > +int do_execve_file(struct file *file, void *__argv, void *__envp)
> > +{
> > + struct user_arg_ptr argv = { .ptr.native = __argv };
> > + struct user_arg_ptr envp = { .ptr.native = __envp };
> > +
> > + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> > +}
>
> Or maybe do the semantics expectations checks here, so we don't clutter
> do_execveat_common() with them?
specifically ?
> > +
> > int do_execve(struct filename *filename,
> > const char __user *const __user *__argv,
> > const char __user *const __user *__envp)
> > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > index 4955e0863b83..c05f24fac4f6 100644
> > --- a/include/linux/binfmts.h
> > +++ b/include/linux/binfmts.h
> > @@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
> > const char __user * const __user *,
> > const char __user * const __user *,
> > int);
> > +int do_execve_file(struct file *file, void *__argv, void *__envp);
> >
> > #endif /* _LINUX_BINFMTS_H */
> > diff --git a/include/linux/umh.h b/include/linux/umh.h
> > index 244aff638220..5c812acbb80a 100644
> > --- a/include/linux/umh.h
> > +++ b/include/linux/umh.h
> > @@ -22,8 +22,10 @@ struct subprocess_info {
> > const char *path;
> > char **argv;
> > char **envp;
> > + struct file *file;
> > int wait;
> > int retval;
> > + pid_t pid;
> > int (*init)(struct subprocess_info *info, struct cred *new);
> > void (*cleanup)(struct subprocess_info *info);
> > void *data;
>
> While at it, can you kdocify struct subprocess_info and add new docs for at
> least these two entires you are adding ?
Sorry 'while at it' doesn't sound as a good reason to
add kdoc now instead of later.
> > @@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
> > int (*init)(struct subprocess_info *info, struct cred *new),
> > void (*cleanup)(struct subprocess_info *), void *data);
> >
> > +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> > + int (*init)(struct subprocess_info *info, struct cred *new),
> > + void (*cleanup)(struct subprocess_info *), void *data);
>
> Likewise but on the umc.c file.
>
> > +struct umh_info {
> > + struct file *pipe_to_umh;
> > + struct file *pipe_from_umh;
> > + pid_t pid;
> > +};
>
> Likewise.
what 'likewise' ? The kdoc ?
>
> > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
>
> Likewise but on the umc.c files.
>
> > +
> > extern int
> > call_usermodehelper_exec(struct subprocess_info *info, int wait);
> >
> > diff --git a/kernel/umh.c b/kernel/umh.c
> > index f76b3ff876cf..c3f418d7d51a 100644
> > --- a/kernel/umh.c
> > +++ b/kernel/umh.c
> > @@ -25,6 +25,8 @@
> > #include <linux/ptrace.h>
> > #include <linux/async.h>
> > #include <linux/uaccess.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/pipe_fs_i.h>
> >
> > #include <trace/events/module.h>
> >
> > @@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
> >
> > commit_creds(new);
> >
> > - retval = do_execve(getname_kernel(sub_info->path),
> > - (const char __user *const __user *)sub_info->argv,
> > - (const char __user *const __user *)sub_info->envp);
> > + if (sub_info->file)
> > + retval = do_execve_file(sub_info->file,
> > + sub_info->argv, sub_info->envp);
> > + else
> > + retval = do_execve(getname_kernel(sub_info->path),
> > + (const char __user *const __user *)sub_info->argv,
> > + (const char __user *const __user *)sub_info->envp);
> > out:
> > sub_info->retval = retval;
> > /*
> > @@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
> > if (pid < 0) {
> > sub_info->retval = pid;
> > umh_complete(sub_info);
> > + } else {
> > + sub_info->pid = pid;
> > }
> > }
> > }
> > @@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> > }
> > EXPORT_SYMBOL(call_usermodehelper_setup);
> >
> > +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> > + int (*init)(struct subprocess_info *info, struct cred *new),
> > + void (*cleanup)(struct subprocess_info *info), void *data)
>
> Should be static, no other users outside of this file.
good catch. will change to static.
> Please use umh_setup_file().
sorry. makes no sense.
There is call_usermodehelper_setup() right above it.
call_usermodehelper_setup_file() just follows the naming convention.
If you prefer shorter names, both have to be renamed in the separate patch series.
> > +{
> > + struct subprocess_info *sub_info;
>
> Considering a possible fuzzer triggering random data we should probably
> return NULL early and avoid the kzalloc if:
I missing 'fuzzer' point here and earlier.
'fuzzer' cannot reach here. It's all internal api.
> if (!file || !init || !cleanup)
> return NULL;
sorry, nope. in kernel we don't do defensive programming like this.
> Is data optional? The kdoc could clarify this.
No. Should be obvious from this patch.
The only caller of call_usermodehelper_setup_file() is fork_usermode_blob()
and it passes 'struct umh_info *info'.
>
> > +
> > + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> > + if (!sub_info)
> > + return NULL;
> > +
> > + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> > + sub_info->path = "none";
> > + sub_info->file = file;
> > + sub_info->init = init;
> > + sub_info->cleanup = cleanup;
> > + sub_info->data = data;
> > + return sub_info;
> > +}
> > +
> > +static struct vfsmount *umh_fs;
> > +
> > +static int init_tmpfs(void)
>
> Please use umh_init_tmpfs().
ok
> Also see init/main.c do_basic_setup() which calls
> usermodehelper_enable() prior to do_initcalls(). Now, fortunately TMPFS is only
> bool, saving us from some races and we do call tmpfs's init first shmem_init():
>
> static void __init do_basic_setup(void)
> {
> cpuset_init_smp();
> shmem_init();
> driver_init();
> init_irq_proc();
> do_ctors();
> usermodehelper_enable();
> do_initcalls();
> }
>
> But it also means we're enabling your new call call fork_usermode_blob() on
> early init code even if we're not setup. Since this umh tmpfs vfsmount is
> shared I'd say just call this init right before usermodehelper_enable()
> on do_basic_setup().
Not following.
Why init_tmpfs() should be called by __init function?
Are you saying make 'static struct vfsmount *shm_mnt;'
global and use it here? so no init_tmpfs() necessary?
I think that can work, but feels that having two
tmpfs mounts (one for shmem and one for umh) is cleaner.
>
> > +{
> > + struct file_system_type *type;
> > +
> > + if (umh_fs)
> > + return 0;
> > + type = get_fs_type("tmpfs");
> > + if (!type)
> > + return -ENODEV;
> > + umh_fs = kern_mount(type);
> > + if (IS_ERR(umh_fs)) {
> > + int err = PTR_ERR(umh_fs);
> > +
> > + put_filesystem(type);
> > + umh_fs = NULL;
> > + return err;
> > + }
> > + return 0;
> > +}
> > +
> > +static int alloc_tmpfs_file(size_t size, struct file **filp)
>
> Please use umh_alloc_tmpfs_file()
ok
> > +{
> > + struct file *file;
> > + int err;
> > +
> > + err = init_tmpfs();
> > + if (err)
> > + return err;
> > + file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> > + if (IS_ERR(file))
> > + return PTR_ERR(file);
> > + *filp = file;
> > + return 0;
> > +}
> > +
> > +static int populate_file(struct file *file, const void *data, size_t size)
>
> Please use umh_populate_file()
ok
> > +{
> > + size_t offset = 0;
> > + int err;
> > +
> > + do {
> > + unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> > + struct page *page;
> > + void *pgdata, *vaddr;
> > +
> > + err = pagecache_write_begin(file, file->f_mapping, offset, len,
> > + 0, &page, &pgdata);
> > + if (err < 0)
> > + goto fail;
> > +
> > + vaddr = kmap(page);
> > + memcpy(vaddr, data, len);
> > + kunmap(page);
> > +
> > + err = pagecache_write_end(file, file->f_mapping, offset, len,
> > + len, page, pgdata);
> > + if (err < 0)
> > + goto fail;
> > +
> > + size -= len;
> > + data += len;
> > + offset += len;
> > + } while (size);
>
> Character for character, this looks like a wonderful copy and paste from
> i915_gem_object_create_from_data()'s own loop which does the same exact
> thing. Perhaps its time for a helper on mm/filemap.c with an export so
> if a bug is fixed in one place its fixed in both places.
yes, of course, but not right now.
Once it all lands that will be the time to create common helper.
It's not a good idea to mess with i915 in one patch set.
> > + return 0;
> > +fail:
> > + return err;
> > +}
> > +
> > +static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>
> The function name umh_pipe_setup() is also used on fs/coredump.c, with the same
> prototype, perhaps rename that before we take this on, even if both are static.
hmm. why?
These are two static functions with the same name, so?
tags get confusing?
> > +{
> > + struct umh_info *umh_info = info->data;
> > + struct file *from_umh[2];
> > + struct file *to_umh[2];
> > + int err;
> > +
> > + /* create pipe to send data to umh */
> > + err = create_pipe_files(to_umh, 0);
> > + if (err)
> > + return err;
> > + err = replace_fd(0, to_umh[0], 0);
> > + fput(to_umh[0]);
> > + if (err < 0) {
> > + fput(to_umh[1]);
> > + return err;
> > + }
> > +
> > + /* create pipe to receive data from umh */
> > + err = create_pipe_files(from_umh, 0);
> > + if (err) {
> > + fput(to_umh[1]);
> > + replace_fd(0, NULL, 0);
> > + return err;
> > + }
> > + err = replace_fd(1, from_umh[1], 0);
> > + fput(from_umh[1]);
> > + if (err < 0) {
> > + fput(to_umh[1]);
> > + replace_fd(0, NULL, 0);
> > + fput(from_umh[0]);
> > + return err;
> > + }
> > +
> > + umh_info->pipe_to_umh = to_umh[1];
> > + umh_info->pipe_from_umh = from_umh[0];
> > + return 0;
> > +}
> > +
> > +static void umh_save_pid(struct subprocess_info *info)
> > +{
> > + struct umh_info *umh_info = info->data;
> > +
> > + umh_info->pid = info->pid;
> > +}
> > +
> > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
>
> Please use umh_fork_blob()
sorry, no. fork_usermode_blob() is much more descriptive name.
> > +{
> > + struct subprocess_info *sub_info;
> > + struct file *file = NULL;
> > + int err;
> > +
> > + err = alloc_tmpfs_file(len, &file);
> > + if (err)
> > + return err;
> > +
> > + err = populate_file(file, data, len);
> > + if (err)
> > + goto out;
> > +
> > + err = -ENOMEM;
> > + sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
> > + umh_save_pid, info);
> > + if (!sub_info)
> > + goto out;
> > +
> > + err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
>
> Alright, neat, so to be clear, we're just glad to try inception, we have no
> clue or idea what the real return value would be, its up to the caller to track
> the progress somehow?
yep.
> Can you add a kdoc entry for this and clarify requirements?
ok. I'll add a comment to this helper.
> Also, can you extend lib/test_kmod.c with a test case for this with its own
> demo and try to blow it up?
in what sense? bpfilter is the test and the driving component for it.
I'm expecting that folks who want to use this helper to do usb drivers
in user space may want to extend this helper further, but that's their job.
> I hadn't tried suspend/resume during a kmod test, but since we're using a
> kernel_thread() I wouldn't be surprised if we barf while stress testing the
> module loader. Its surely a corner case, but better mention that now than cry
> later if we get heavy umh modules and all of a sudden we start using this for
> whatever reason close to suspend.
folks that care about suspend/resume should do that.
I'm happy to gate this helper for !CONFIG_SUSPEND, since I have
no idea what issues can be uncovered, how to fix them and no desire to do so.
Thanks
On Thu, May 3, 2018 at 12:36 AM, Alexei Starovoitov <[email protected]> wrote:
> Introduce helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> struct umh_info {
> struct file *pipe_to_umh;
> struct file *pipe_from_umh;
> pid_t pid;
> };
>
> that GPLed kernel modules (signed or unsigned) can use it to execute part
> of its own data as swappable user mode process.
>
> The kernel will do:
> - mount "tmpfs"
> - allocate a unique file in tmpfs
> - populate that file with [data, data + len] bytes
> - user-mode-helper code will do_execve that file and, before the process
> starts, the kernel will create two unix pipes for bidirectional
> communication between kernel module and umh
> - close tmpfs file, effectively deleting it
> - the fork_usermode_blob will return zero on success and populate
> 'struct umh_info' with two unix pipes and the pid of the user process
>
> As the first step in the development of the bpfilter project
> the fork_usermode_blob() helper is introduced to allow user mode code
> to be invoked from a kernel module. The idea is that user mode code plus
> normal kernel module code are built as part of the kernel build
> and installed as traditional kernel module into distro specified location,
> such that from a distribution point of view, there is
> no difference between regular kernel modules and kernel modules + umh code.
> Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> by a kernel module doesn't make it any special from kernel and user space
> tooling point of view.
[...]
> +static struct vfsmount *umh_fs;
> +
> +static int init_tmpfs(void)
> +{
> + struct file_system_type *type;
> +
> + if (umh_fs)
> + return 0;
> + type = get_fs_type("tmpfs");
> + if (!type)
> + return -ENODEV;
> + umh_fs = kern_mount(type);
> + if (IS_ERR(umh_fs)) {
> + int err = PTR_ERR(umh_fs);
> +
> + put_filesystem(type);
> + umh_fs = NULL;
> + return err;
> + }
> + return 0;
> +}
Should init_tmpfs() be holding some sort of mutex if it's fiddling
with `umh_fs`? The current code only calls it in initcall context, but
if that ever changes and two processes try to initialize the tmpfs at
the same time, a few things could go wrong.
I guess Luis' suggestion (putting a call to init_tmpfs() in
do_basic_setup()) might be the easiest way to get rid of that problem.
> +static int alloc_tmpfs_file(size_t size, struct file **filp)
> +{
> + struct file *file;
> + int err;
> +
> + err = init_tmpfs();
> + if (err)
> + return err;
> + file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> + *filp = file;
> + return 0;
> +}
On Sat, May 05, 2018 at 12:48:24AM -0400, Jann Horn wrote:
> On Thu, May 3, 2018 at 12:36 AM, Alexei Starovoitov <[email protected]> wrote:
> > Introduce helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > struct umh_info {
> > struct file *pipe_to_umh;
> > struct file *pipe_from_umh;
> > pid_t pid;
> > };
> >
> > that GPLed kernel modules (signed or unsigned) can use it to execute part
> > of its own data as swappable user mode process.
> >
> > The kernel will do:
> > - mount "tmpfs"
> > - allocate a unique file in tmpfs
> > - populate that file with [data, data + len] bytes
> > - user-mode-helper code will do_execve that file and, before the process
> > starts, the kernel will create two unix pipes for bidirectional
> > communication between kernel module and umh
> > - close tmpfs file, effectively deleting it
> > - the fork_usermode_blob will return zero on success and populate
> > 'struct umh_info' with two unix pipes and the pid of the user process
> >
> > As the first step in the development of the bpfilter project
> > the fork_usermode_blob() helper is introduced to allow user mode code
> > to be invoked from a kernel module. The idea is that user mode code plus
> > normal kernel module code are built as part of the kernel build
> > and installed as traditional kernel module into distro specified location,
> > such that from a distribution point of view, there is
> > no difference between regular kernel modules and kernel modules + umh code.
> > Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> > by a kernel module doesn't make it any special from kernel and user space
> > tooling point of view.
> [...]
> > +static struct vfsmount *umh_fs;
> > +
> > +static int init_tmpfs(void)
> > +{
> > + struct file_system_type *type;
> > +
> > + if (umh_fs)
> > + return 0;
> > + type = get_fs_type("tmpfs");
> > + if (!type)
> > + return -ENODEV;
> > + umh_fs = kern_mount(type);
> > + if (IS_ERR(umh_fs)) {
> > + int err = PTR_ERR(umh_fs);
> > +
> > + put_filesystem(type);
> > + umh_fs = NULL;
> > + return err;
> > + }
> > + return 0;
> > +}
>
> Should init_tmpfs() be holding some sort of mutex if it's fiddling
> with `umh_fs`? The current code only calls it in initcall context, but
> if that ever changes and two processes try to initialize the tmpfs at
> the same time, a few things could go wrong.
I thought that module loading is serialized, so calls to
fork_usermode_blob() will be serialized as well, but looking at the code
again that doesn't seem to be the case, so need to revisit not only
this function, but the rest of it too.
> I guess Luis' suggestion (putting a call to init_tmpfs() in
> do_basic_setup()) might be the easiest way to get rid of that problem.
I still think that two mounts where umh mount is dynamic is cleaner.
Why waste the mount if no module uses this helper?
I'm thinking to wrap init_tmpfs into DO_ONCE instead or use a mutex.
Looks like shmem_file_setup_with_mnt() can be called in parallel
on the same mount, so that should be fine.
Hi Alexei + netdev list,
On Wed, May 02, 2018 at 09:36:02PM -0700, Alexei Starovoitov wrote:
> Later bpfilter_process_sockopt() will be called from bpfilter hooks
> in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
This is a part I'm quite heavily opposed to - at least at this point.
Unless bpfilter offered something that is semantically compatible to
what netfilter/iptables is currently implementing, I don't think
bpfilter should be [allowed to] overriding the iptables
{get,set}sockopt() calls.
I appreciate that people are working on a different architecture packet
filter than what we used to. I also understand that there is a need
for backwards compatibility. I still think it's wrong to offer that
compatibility on the {set,get}sockopt level, rather than on the
"iptables command line utility replacement" level. But nevermind, you
guys have a different opinion on that, on which we can agree to
disagree.
However, no matter what you do, the most important part from the user
point of view is to make sure you don't break semantics.
netfilter/iptables semantics have an intricate notion abut when which
chain of which table is executed, in which order, at what particular
point of the packet traversal during the network stack. The packet
filtering rulesets that people have created over more than 18 years
are based on those semantics. If you offer the same interface, but not
that very same semantics, the packet filtering policies can an will
break - and they will break so in a hidden way. To the user, it appears
as if the ruleset is loaded with the assumed semantics, but in reality
it isn't.
Unless you can replicate those semantics 1:1, I think it is not only
wrong to override the iptables sockopt interface, but it's outright
dangerous.
Having less matches/targets implemented than original iptables is
something that I believe is acceptable (and inevitable, at least in the
beginning). If somebody tries to load a related ruleset with bpfilter
active, it will fail gracefully and the user can chose to not use that
match/target in his ruleset, or to not use bpfilter.
But if the ruleset loads but behaves different than before (because e.g.
it's executed from a completely different place in the stack), that's
IMHO an absolute no-go that must be avoided at all cost. If that's the
case, you are actively breaking network security, rather than creating
it.
So I think there's only two ways to go:
a) replicate the exact semantics/order of the filter/mangle/raw/...
tables and their chains, both among themselves as well as in terms of
ordering with other parts of the network stack, or
b) not use the existing tables/chains with their pre-defined semantics
but rather start new 'tables' which can then have different semantics
as defined at the time of their implementation.
My apologies if I misunderstood something about bpfilter. Feel free to
correct me where I'm wrong. Thanks.
Regards,
Harald
--
- Harald Welte <[email protected]> http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
(ETSI EN 300 175-7 Ch. A6)
From: Harald Welte <[email protected]>
Date: Mon, 7 May 2018 17:24:35 +0200
> But if the ruleset loads but behaves different than before (because e.g.
> it's executed from a completely different place in the stack), that's
> IMHO an absolute no-go that must be avoided at all cost.
That's not what we are doing nor proposing. I'm sorry if you are
confused on this matter.
The base implementation we strive for will execute the BPF programs
from the existing netfilter hook points.
However, if semantically the effect is equal if we execute the BPF
program from XDP, we will allow that to happen as an optimization.
The BPF exection is where it is in these patches for the purposes of
bootstrapping the bpfilter project and easy testing/benchmarking/hacking.
I hope this clears up your confusion.
If you would like to become involved in hacking on bpfilter to help us
ensure more accurate compatability between existing iptables and what
bpfilter will execute for the same rule sets, we very much look
forward to your contributions and expertiece.
Thank you.
On Fri, May 04, 2018 at 06:37:11PM -0700, Alexei Starovoitov wrote:
> On Fri, May 04, 2018 at 07:56:43PM +0000, Luis R. Rodriguez wrote:
> > What a mighty short list of reviewers. Adding some more. My review below.
> > I'd appreciate a Cc on future versions of these patches.
>
> sure.
>
> > On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> > > Introduce helper:
> > > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > > struct umh_info {
> > > struct file *pipe_to_umh;
> > > struct file *pipe_from_umh;
> > > pid_t pid;
> > > };
> > >
> > > that GPLed kernel modules (signed or unsigned) can use it to execute part
> > > of its own data as swappable user mode process.
> > >
> > > The kernel will do:
> > > - mount "tmpfs"
> >
> > Actually its a *shared* vfsmount tmpfs for all umh blobs.
>
> yep
OK just note CONFIG_TMPFS can be disabled, and likewise for CONFIG_SHMEM,
in which case tmpfs and shmem are replaced by a simple ramfs code, more
appropriate for systems without swap.
> > > +static struct vfsmount *umh_fs;
> > > +
> > > +static int init_tmpfs(void)
> >
> > Please use umh_init_tmpfs().
>
> ok
>
> > Also see init/main.c do_basic_setup() which calls
> > usermodehelper_enable() prior to do_initcalls(). Now, fortunately TMPFS is only
> > bool, saving us from some races and we do call tmpfs's init first shmem_init():
> >
> > static void __init do_basic_setup(void)
> > {
> > cpuset_init_smp();
> > shmem_init();
> > driver_init();
> > init_irq_proc();
> > do_ctors();
> > usermodehelper_enable();
> > do_initcalls();
> > }
> >
> > But it also means we're enabling your new call call fork_usermode_blob() on
> > early init code even if we're not setup. Since this umh tmpfs vfsmount is
> > shared I'd say just call this init right before usermodehelper_enable()
> > on do_basic_setup().
>
> Not following.
> Why init_tmpfs() should be called by __init function?
Nope, not at all, I was suggesting:
diff --git a/init/main.c b/init/main.c
index 0697284a28ee..67a48fbd96ca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -973,6 +973,7 @@ static void __init do_basic_setup(void)
driver_init();
init_irq_proc();
do_ctors();
+ umh_init_tmpfs();
usermodehelper_enable();
do_initcalls();
}
Mainly to avoid the locking situation Jann Horn noted, and also provide
proper kernel ordering expectations.
> Are you saying make 'static struct vfsmount *shm_mnt;'
> global and use it here? so no init_tmpfs() necessary?
> I think that can work, but feels that having two
> tmpfs mounts (one for shmem and one for umh) is cleaner.
No, but now that you mention it... if a shared vfsmount is not used the
/sys/kernel/mm/transparent_hugepage/shmem_enabled knob for using huge pages
would not be followed for umh modules. For the i915 driver this was *why*
they ended up adding shmem_file_setup_with_mnt(), they wanted huge pages to
support huge-gtt-pages. What is the rationale behind umh.c using it for
umh modules?
Users of shmem_kernel_file_setup() spawned later out of the desire to
*avoid* LSMs since it didn't make sense in their case as their inodes
are never exposed to userspace. Such is the case for ipc/shm.c and
security/keys/big_key.c. Refer to commit c7277090927a5 ("security: shmem:
implement kernel private shmem inodes") and then commit e1832f2923ec9
("ipc: use private shmem or hugetlbfs inodes for shm segments").
In this new umh usermode modules case we are:
a) actually mapping data already extracted by the kernel somehow from
a file somehow, presumably from /lib/modules/ path somewhere, but
again this is not visible to umc.c, as it just gets called with:
fork_usermode_blob(void *data, size_t len, struct umh_info *info)
b) Creating the respective tmpfs file with shmem_file_setup_with_mnt()
with our on our own shared struct vfsmount *umh_fs.
c) Populating the file created and stuffing it with our data passed
d) Calling do_execve_file() on it.
Its not clear to me why you used shmem_file_setup_with_mnt() in this case. What
are the gains? It would make sense to use shmem_kernel_file_setup() to avoid an
LSM check on step b) *but* only if we already had a proper LSM check on step
a).
I checked how you use fork_usermode_blob() in a) and in this case the kernel
module bpfilter would be loaded first, and for that we already have an LSM
check / hook for that module. From my review then, the magic done on your
second patch to stuff the userspace application into the module should be
irrelevant to us from an LSM perspective.
So, I can see a reason to use shmem_kernel_file_setup() but not I cannot
see a reason to be using shmem_file_setup_with_mnt() at the moment.
I Cc'd tmpfs and LSM folks for further feedback.
> > > +{
> > > + size_t offset = 0;
> > > + int err;
> > > +
> > > + do {
> > > + unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> > > + struct page *page;
> > > + void *pgdata, *vaddr;
> > > +
> > > + err = pagecache_write_begin(file, file->f_mapping, offset, len,
> > > + 0, &page, &pgdata);
> > > + if (err < 0)
> > > + goto fail;
> > > +
> > > + vaddr = kmap(page);
> > > + memcpy(vaddr, data, len);
> > > + kunmap(page);
> > > +
> > > + err = pagecache_write_end(file, file->f_mapping, offset, len,
> > > + len, page, pgdata);
> > > + if (err < 0)
> > > + goto fail;
> > > +
> > > + size -= len;
> > > + data += len;
> > > + offset += len;
> > > + } while (size);
> >
> > Character for character, this looks like a wonderful copy and paste from
> > i915_gem_object_create_from_data()'s own loop which does the same exact
> > thing. Perhaps its time for a helper on mm/filemap.c with an export so
> > if a bug is fixed in one place its fixed in both places.
>
> yes, of course, but not right now.
> Once it all lands that will be the time to create common helper.
> It's not a good idea to mess with i915 in one patch set.
Either way works with me, so long as its done.
> > > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
> >
> > Please use umh_fork_blob()
>
> sorry, no. fork_usermode_blob() is much more descriptive name.
Prefixing new umh.c symbols with umh_*() makes it very clear this came from
kernel/umh.c functionality. I've been dealing with other places in the kernel
that have conflated their own use of kernel/umh.c functionality what they
expose in both code and documentation, and correcting this has taken a long
time. Best avoid future confusion and be consistent with new exported symbols
for umc.c functionality.
Also, descriptive as fork_usermode_blob() may seem there is a possible future
clash with a more generic call.
> > Also, can you extend lib/test_kmod.c with a test case for this with its own
> > demo and try to blow it up?
>
> in what sense? bpfilter is the test and the driving component for it.
That's the thing, it shouldn't be.
We are adding *new* functionality here, I don't want to require enabling
bpfitler or its dependencies to test generic umh user module loading
functionality. For instance, we have lib/test_module.c to help test generic
module loading, regardless of the functionality or requirements for other
modules.
> I'm expecting that folks who want to use this helper to do usb drivers
> in user space may want to extend this helper further, but that's their job.
I don't even want to test USB, I am just interesting in the *very* *very*
basic aspects of it. A simple lib/test_umh_module.c would do with a respective
check that its loaded, given this is a requirement from the API. It also helps
folks understand the core basic knobs without having to look at bfilter code.
If we're going to get this merged I'd be interested in doing ongoing testing
with 0day with simple UMH module with and without CONFIG_SHMEM for instance
and check that it works in both cases.
Testing this may seem irrelevant to you but keep in mind we're also already
testing just general kernel module loading. As silly as it may seem, adding new
functionality and a respective test case lets us try to avoid regressions, and
provide small unit tests to help reproduce issues and corner case situations
you may not be considering.
Luis
On Wed, May 02, 2018 at 09:36:02PM -0700, Alexei Starovoitov wrote:
> bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
> and user mode helper code that is embedded into bpfilter.ko
>
> The steps to build bpfilter.ko are the following:
> - main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
> - with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
> is converted into bpfilter_umh.o object file
> with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
> Example:
> $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
> 0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
> 0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
> 0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
> - bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
>
> bpfilter_kern.c is a normal kernel module code that calls
> the fork_usermode_blob() helper to execute part of its own data
> as a user mode process.
>
> Notice that _binary_net_bpfilter_bpfilter_umh_start - end
> is placed into .init.rodata section, so it's freed as soon as __init
> function of bpfilter.ko is finished.
> As part of __init the bpfilter.ko does first request/reply action
> via two unix pipe provided by fork_usermode_blob() helper to
> make sure that umh is healthy. If not it will kill it via pid.
It does this very fast, right away. On a really slow system how are you sure
that this won't race and the execution of the check happens early on prior to
letting the actual setup trigger? After all, we're calling the userpsace
process in async mode. We could preempt it now.
> Later bpfilter_process_sockopt() will be called from bpfilter hooks
> in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
>
> If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
> kill umh as well.
>
> Signed-off-by: Alexei Starovoitov <[email protected]>
> ---
> include/linux/bpfilter.h | 15 +++++++
> include/uapi/linux/bpfilter.h | 21 ++++++++++
> net/Kconfig | 2 +
> net/Makefile | 1 +
> net/bpfilter/Kconfig | 17 ++++++++
> net/bpfilter/Makefile | 24 +++++++++++
> net/bpfilter/bpfilter_kern.c | 93 +++++++++++++++++++++++++++++++++++++++++++
> net/bpfilter/main.c | 63 +++++++++++++++++++++++++++++
> net/bpfilter/msgfmt.h | 17 ++++++++
> net/ipv4/Makefile | 2 +
> net/ipv4/bpfilter/Makefile | 2 +
> net/ipv4/bpfilter/sockopt.c | 42 +++++++++++++++++++
> net/ipv4/ip_sockglue.c | 17 ++++++++
> 13 files changed, 316 insertions(+)
> create mode 100644 include/linux/bpfilter.h
> create mode 100644 include/uapi/linux/bpfilter.h
> create mode 100644 net/bpfilter/Kconfig
> create mode 100644 net/bpfilter/Makefile
> create mode 100644 net/bpfilter/bpfilter_kern.c
> create mode 100644 net/bpfilter/main.c
> create mode 100644 net/bpfilter/msgfmt.h
> create mode 100644 net/ipv4/bpfilter/Makefile
> create mode 100644 net/ipv4/bpfilter/sockopt.c
>
> diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
> new file mode 100644
> index 000000000000..687b1760bb9f
> --- /dev/null
> +++ b/include/linux/bpfilter.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_BPFILTER_H
> +#define _LINUX_BPFILTER_H
> +
> +#include <uapi/linux/bpfilter.h>
> +
> +struct sock;
> +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
> + unsigned int optlen);
> +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
> + int *optlen);
> +extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set);
> +#endif
> diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
> new file mode 100644
> index 000000000000..2ec3cc99ea4c
> --- /dev/null
> +++ b/include/uapi/linux/bpfilter.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_LINUX_BPFILTER_H
> +#define _UAPI_LINUX_BPFILTER_H
> +
> +#include <linux/if.h>
> +
> +enum {
> + BPFILTER_IPT_SO_SET_REPLACE = 64,
> + BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
> + BPFILTER_IPT_SET_MAX,
> +};
> +
> +enum {
> + BPFILTER_IPT_SO_GET_INFO = 64,
> + BPFILTER_IPT_SO_GET_ENTRIES = 65,
> + BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
> + BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
> + BPFILTER_IPT_GET_MAX,
> +};
> +
> +#endif /* _UAPI_LINUX_BPFILTER_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index b62089fb1332..ed6368b306fa 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
>
> endif
>
> +source "net/bpfilter/Kconfig"
> +
> source "net/dccp/Kconfig"
> source "net/sctp/Kconfig"
> source "net/rds/Kconfig"
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..7f982b7682bd 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/
> obj-$(CONFIG_XFRM) += xfrm/
> obj-$(CONFIG_UNIX) += unix/
> obj-$(CONFIG_NET) += ipv6/
> +obj-$(CONFIG_BPFILTER) += bpfilter/
> obj-$(CONFIG_PACKET) += packet/
> obj-$(CONFIG_NET_KEY) += key/
> obj-$(CONFIG_BRIDGE) += bridge/
> diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
> new file mode 100644
> index 000000000000..782a732b9a5c
> --- /dev/null
> +++ b/net/bpfilter/Kconfig
> @@ -0,0 +1,17 @@
> +menuconfig BPFILTER
> + bool "BPF based packet filtering framework (BPFILTER)"
> + default n
> + depends on NET && BPF
> + help
> + This builds experimental bpfilter framework that is aiming to
> + provide netfilter compatible functionality via BPF
> +
> +if BPFILTER
> +config BPFILTER_UMH
> + tristate "bpftiler kernel module with user mode helper"
> + default m
> + depends on m
> + help
> + This builds bpfilter kernel module with embedded user mode helper
> +endif
> +
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> new file mode 100644
> index 000000000000..897eedae523e
> --- /dev/null
> +++ b/net/bpfilter/Makefile
> @@ -0,0 +1,24 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the Linux BPFILTER layer.
> +#
> +
> +hostprogs-y := bpfilter_umh
> +bpfilter_umh-objs := main.o
> +HOSTCFLAGS += -I. -Itools/include/
> +
> +# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> +# inside bpfilter_umh.o elf file referenced by
> +# _binary_net_bpfilter_bpfilter_umh_start symbol
> +# which bpfilter_kern.c passes further into umh blob loader at run-time
> +quiet_cmd_copy_umh = GEN $@
> + cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> + $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> + -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> + --rename-section .data=.init.rodata $< $@
Cool, but so our expectation is that the compiler sets this symbol, how
are we sure it will always be set?
> +
> +$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
> + $(call cmd,copy_umh)
> +
> +obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
> +bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> new file mode 100644
> index 000000000000..e0a6fdd5842b
> --- /dev/null
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/umh.h>
> +#include <linux/bpfilter.h>
> +#include <linux/sched.h>
> +#include <linux/sched/signal.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include "msgfmt.h"
> +
> +#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
> +#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
> +
> +extern char UMH_start;
> +extern char UMH_end;
> +
> +static struct umh_info info;
> +
> +static void shutdown_umh(struct umh_info *info)
> +{
> + struct task_struct *tsk;
> +
> + tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
> + if (tsk)
> + force_sig(SIGKILL, tsk);
> + fput(info->pipe_to_umh);
> + fput(info->pipe_from_umh);
> +}
> +
> +static void stop_umh(void)
> +{
> + if (bpfilter_process_sockopt) {
> + bpfilter_process_sockopt = NULL;
> + shutdown_umh(&info);
> + }
> +}
> +
> +static int __bpfilter_process_sockopt(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set)
> +{
> + struct mbox_request req;
> + struct mbox_reply reply;
> + loff_t pos;
> + ssize_t n;
> +
> + req.is_set = is_set;
> + req.pid = current->pid;
> + req.cmd = optname;
> + req.addr = (long)optval;
> + req.len = optlen;
> + n = __kernel_write(info.pipe_to_umh, &req, sizeof(req), &pos);
> + if (n != sizeof(req)) {
> + pr_err("write fail %zd\n", n);
> + stop_umh();
> + return -EFAULT;
> + }
> + pos = 0;
> + n = kernel_read(info.pipe_from_umh, &reply, sizeof(reply), &pos);
> + if (n != sizeof(reply)) {
> + pr_err("read fail %zd\n", n);
> + stop_umh();
> + return -EFAULT;
> + }
> + return reply.status;
> +}
> +
> +static int __init load_umh(void)
> +{
> + int err;
> +
> + err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
> + if (err)
> + return err;
> + pr_info("Loaded umh pid %d\n", info.pid);
> + bpfilter_process_sockopt = &__bpfilter_process_sockopt;
> +
> + if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
See, here, what if the userspace process gets preemtped and we run this
check afterwards? Is that possible?
Luis
> + stop_umh();
> + return -EFAULT;
> + }
> + return 0;
> +}
> +
> +static void __exit fini_umh(void)
> +{
> + stop_umh();
> +}
> +module_init(load_umh);
> +module_exit(fini_umh);
> +MODULE_LICENSE("GPL");
> diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
> new file mode 100644
> index 000000000000..81bbc1684896
> --- /dev/null
> +++ b/net/bpfilter/main.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <sys/uio.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <sys/socket.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include "include/uapi/linux/bpf.h"
> +#include <asm/unistd.h>
> +#include "msgfmt.h"
> +
> +int debug_fd;
> +
> +static int handle_get_cmd(struct mbox_request *cmd)
> +{
> + switch (cmd->cmd) {
> + case 0:
> + return 0;
> + default:
> + break;
> + }
> + return -ENOPROTOOPT;
> +}
> +
> +static int handle_set_cmd(struct mbox_request *cmd)
> +{
> + return -ENOPROTOOPT;
> +}
> +
> +static void loop(void)
> +{
> + while (1) {
> + struct mbox_request req;
> + struct mbox_reply reply;
> + int n;
> +
> + n = read(0, &req, sizeof(req));
> + if (n != sizeof(req)) {
> + dprintf(debug_fd, "invalid request %d\n", n);
> + return;
> + }
> +
> + reply.status = req.is_set ?
> + handle_set_cmd(&req) :
> + handle_get_cmd(&req);
> +
> + n = write(1, &reply, sizeof(reply));
> + if (n != sizeof(reply)) {
> + dprintf(debug_fd, "reply failed %d\n", n);
> + return;
> + }
> + }
> +}
> +
> +int main(void)
> +{
> + debug_fd = open("/dev/console", 00000002 | 00000100);
> + dprintf(debug_fd, "Started bpfilter\n");
> + loop();
> + close(debug_fd);
> + return 0;
> +}
> diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h
> new file mode 100644
> index 000000000000..94b9ac9e5114
> --- /dev/null
> +++ b/net/bpfilter/msgfmt.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _NET_BPFILTER_MSGFMT_H
> +#define _NET_BPFTILER_MSGFMT_H
> +
> +struct mbox_request {
> + __u64 addr;
> + __u32 len;
> + __u32 is_set;
> + __u32 cmd;
> + __u32 pid;
> +};
> +
> +struct mbox_reply {
> + __u32 status;
> +};
> +
> +#endif
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index b379520f9133..7018f91c5a39 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -16,6 +16,8 @@ obj-y := route.o inetpeer.o protocol.o \
> inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
> metrics.o
>
> +obj-$(CONFIG_BPFILTER) += bpfilter/
> +
> obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
> obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
> obj-$(CONFIG_PROC_FS) += proc.o
> diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
> new file mode 100644
> index 000000000000..ce262d76cc48
> --- /dev/null
> +++ b/net/ipv4/bpfilter/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_BPFILTER) += sockopt.o
> +
> diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
> new file mode 100644
> index 000000000000..42a96d2d8d05
> --- /dev/null
> +++ b/net/ipv4/bpfilter/sockopt.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/uaccess.h>
> +#include <linux/bpfilter.h>
> +#include <uapi/linux/bpf.h>
> +#include <linux/wait.h>
> +#include <linux/kmod.h>
> +
> +int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
> + char __user *optval,
> + unsigned int optlen, bool is_set);
> +EXPORT_SYMBOL_GPL(bpfilter_process_sockopt);
> +
> +int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
> + unsigned int optlen, bool is_set)
> +{
> + if (!bpfilter_process_sockopt) {
> + int err = request_module("bpfilter");
> +
> + if (err)
> + return err;
> + if (!bpfilter_process_sockopt)
> + return -ECHILD;
> + }
> + return bpfilter_process_sockopt(sk, optname, optval, optlen, is_set);
> +}
> +
> +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
> + unsigned int optlen)
> +{
> + return bpfilter_mbox_request(sk, optname, optval, optlen, true);
> +}
> +
> +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
> + int __user *optlen)
> +{
> + int len;
> +
> + if (get_user(len, optlen))
> + return -EFAULT;
> +
> + return bpfilter_mbox_request(sk, optname, optval, len, false);
> +}
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index 5ad2d8ed3a3f..e0791faacb24 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -47,6 +47,8 @@
> #include <linux/errqueue.h>
> #include <linux/uaccess.h>
>
> +#include <linux/bpfilter.h>
> +
> /*
> * SOL_IP control messages.
> */
> @@ -1244,6 +1246,11 @@ int ip_setsockopt(struct sock *sk, int level,
> return -ENOPROTOOPT;
>
> err = do_ip_setsockopt(sk, level, optname, optval, optlen);
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
> + optname < BPFILTER_IPT_SET_MAX)
> + err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
> @@ -1552,6 +1559,11 @@ int ip_getsockopt(struct sock *sk, int level,
> int err;
>
> err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_GET_INFO &&
> + optname < BPFILTER_IPT_GET_MAX)
> + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
> @@ -1584,6 +1596,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
> err = do_ip_getsockopt(sk, level, optname, optval, optlen,
> MSG_CMSG_COMPAT);
>
> +#ifdef CONFIG_BPFILTER
> + if (optname >= BPFILTER_IPT_SO_GET_INFO &&
> + optname < BPFILTER_IPT_GET_MAX)
> + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
> +#endif
> #ifdef CONFIG_NETFILTER
> /* we need to exclude all possible ENOPROTOOPTs except default case */
> if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
> --
> 2.9.5
--
Do not panic
On Mon, May 07, 2018 at 06:39:31PM +0000, Luis R. Rodriguez wrote:
>
> > Are you saying make 'static struct vfsmount *shm_mnt;'
> > global and use it here? so no init_tmpfs() necessary?
> > I think that can work, but feels that having two
> > tmpfs mounts (one for shmem and one for umh) is cleaner.
>
> No, but now that you mention it... if a shared vfsmount is not used the
> /sys/kernel/mm/transparent_hugepage/shmem_enabled knob for using huge pages
> would not be followed for umh modules. For the i915 driver this was *why*
> they ended up adding shmem_file_setup_with_mnt(), they wanted huge pages to
> support huge-gtt-pages. What is the rationale behind umh.c using it for
> umh modules?
>
> Users of shmem_kernel_file_setup() spawned later out of the desire to
> *avoid* LSMs since it didn't make sense in their case as their inodes
> are never exposed to userspace. Such is the case for ipc/shm.c and
> security/keys/big_key.c. Refer to commit c7277090927a5 ("security: shmem:
> implement kernel private shmem inodes") and then commit e1832f2923ec9
> ("ipc: use private shmem or hugetlbfs inodes for shm segments").
>
> In this new umh usermode modules case we are:
>
> a) actually mapping data already extracted by the kernel somehow from
> a file somehow, presumably from /lib/modules/ path somewhere, but
> again this is not visible to umc.c, as it just gets called with:
>
> fork_usermode_blob(void *data, size_t len, struct umh_info *info)
>
> b) Creating the respective tmpfs file with shmem_file_setup_with_mnt()
> with our on our own shared struct vfsmount *umh_fs.
>
> c) Populating the file created and stuffing it with our data passed
>
> d) Calling do_execve_file() on it.
>
> Its not clear to me why you used shmem_file_setup_with_mnt() in this case. What
> are the gains? It would make sense to use shmem_kernel_file_setup() to avoid an
> LSM check on step b) *but* only if we already had a proper LSM check on step
> a).
>
> I checked how you use fork_usermode_blob() in a) and in this case the kernel
> module bpfilter would be loaded first, and for that we already have an LSM
> check / hook for that module. From my review then, the magic done on your
> second patch to stuff the userspace application into the module should be
> irrelevant to us from an LSM perspective.
>
> So, I can see a reason to use shmem_kernel_file_setup() but not I cannot
> see a reason to be using shmem_file_setup_with_mnt() at the moment.
That's a good idea. I will switch to using shmem_kernel_file_setup().
I guess I can use kernel_write() as well instead of populate_file().
I wonder why i915 had to use pagecache_write_begin() and the loop
instead of kernel_write()...
The only reason to copy into tmpfs file is to make that memory swappable.
All standard shmem knobs should apply.
> > > > +{
> > > > + size_t offset = 0;
> > > > + int err;
> > > > +
> > > > + do {
> > > > + unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> > > > + struct page *page;
> > > > + void *pgdata, *vaddr;
> > > > +
> > > > + err = pagecache_write_begin(file, file->f_mapping, offset, len,
> > > > + 0, &page, &pgdata);
> > > > + if (err < 0)
> > > > + goto fail;
> > > > +
> > > > + vaddr = kmap(page);
> > > > + memcpy(vaddr, data, len);
> > > > + kunmap(page);
> > > > +
> > > > + err = pagecache_write_end(file, file->f_mapping, offset, len,
> > > > + len, page, pgdata);
> > > > + if (err < 0)
> > > > + goto fail;
> > > > +
> > > > + size -= len;
> > > > + data += len;
> > > > + offset += len;
> > > > + } while (size);
> > >
> > > Character for character, this looks like a wonderful copy and paste from
> > > i915_gem_object_create_from_data()'s own loop which does the same exact
> > > thing. Perhaps its time for a helper on mm/filemap.c with an export so
> > > if a bug is fixed in one place its fixed in both places.
> >
> > yes, of course, but not right now.
> > Once it all lands that will be the time to create common helper.
> > It's not a good idea to mess with i915 in one patch set.
>
> Either way works with me, so long as its done.
Will be gone due to switch to kernel_write().
>
> > > > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
> > >
> > > Please use umh_fork_blob()
> >
> > sorry, no. fork_usermode_blob() is much more descriptive name.
>
> Prefixing new umh.c symbols with umh_*() makes it very clear this came from
> kernel/umh.c functionality. I've been dealing with other places in the kernel
> that have conflated their own use of kernel/umh.c functionality what they
> expose in both code and documentation, and correcting this has taken a long
> time. Best avoid future confusion and be consistent with new exported symbols
> for umc.c functionality.
There is no confusion today. The most known umh api is a family of
call_usermodehelper*()
In this case it's not a 'call', it's a 'fork', since part of kernel module
being forked as user mode process.
I considered naming this function fork_usermodehelper(),
but it's also not quite correct, since 'user mode helper' has predefined
meaning of something that has the path whereas here it's a blob of bytes.
Hence fork_usermode_blob() is more accurate and semantically correct name,
whereas umh_fork_blob() is not.
Notice I no longer call these new kernel modules as 'umh modules',
since that's the wrong name for the same reasons.
They are good ol' kernel modules.
The new functionality allowed by this patch is:
forking part of kernel module data as user mode process.
A lot of umh logic is reused, but 'user mode helpers' and
'user mode blobs' are distinct kernel features.
> I don't even want to test USB, I am just interesting in the *very* *very*
> basic aspects of it. A simple lib/test_umh_module.c would do with a respective
> check that its loaded, given this is a requirement from the API. It also helps
> folks understand the core basic knobs without having to look at bfilter code.
I agree that lib/test_usermode_blob.c must be available eventually.
Right now we cannot add it to the tree, since we need to figure how interface
between kernel and usermode_blob will work based on real world use case
of bpfilter. Once it gets further along that would be the time to say:
"look, here is the test for fork_usermode_blob() and here how others (usb drivers)
can and should use it".
Today is not the right time to fix the api. Such lib/test_usermode_blob.c
would have to be constantly adjusted as we tweak bpfilter side becoming
unnecessary burden of us bpfilter developers.
Everyone really need to think of these patches as work in progress
and internal details and api of fork_usermode_blob() will change.
On Mon, May 07, 2018 at 06:51:24PM +0000, Luis R. Rodriguez wrote:
> > Notice that _binary_net_bpfilter_bpfilter_umh_start - end
> > is placed into .init.rodata section, so it's freed as soon as __init
> > function of bpfilter.ko is finished.
> > As part of __init the bpfilter.ko does first request/reply action
> > via two unix pipe provided by fork_usermode_blob() helper to
> > make sure that umh is healthy. If not it will kill it via pid.
>
> It does this very fast, right away. On a really slow system how are you sure
> that this won't race and the execution of the check happens early on prior to
> letting the actual setup trigger? After all, we're calling the userpsace
> process in async mode. We could preempt it now.
I don't see an issue.
the kernel synchronously writes into a pipe. User space process reads.
Exactly the same as coredump logic with pipes.
> > +# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> > +# inside bpfilter_umh.o elf file referenced by
> > +# _binary_net_bpfilter_bpfilter_umh_start symbol
> > +# which bpfilter_kern.c passes further into umh blob loader at run-time
> > +quiet_cmd_copy_umh = GEN $@
> > + cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> > + $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> > + -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> > + --rename-section .data=.init.rodata $< $@
>
> Cool, but so our expectation is that the compiler sets this symbol, how
> are we sure it will always be set?
Compiler doesn't set it. objcopy does.
> > +
> > + if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
>
> See, here, what if the userspace process gets preemtped and we run this
> check afterwards? Is that possible?
User space is a normal task. It can sleep and can be single stepped with GDB.
On Fri, May 4, 2018 at 12:56 PM, Luis R. Rodriguez <[email protected]> wrote:
> What a mighty short list of reviewers. Adding some more. My review below.
> I'd appreciate a Cc on future versions of these patches.
Me too, please. And likely linux-security-module@ and Jessica too.
> On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
>> Introduce helper:
>> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
>> struct umh_info {
>> struct file *pipe_to_umh;
>> struct file *pipe_from_umh;
>> pid_t pid;
>> };
>>
>> that GPLed kernel modules (signed or unsigned) can use it to execute part
>> of its own data as swappable user mode process.
>>
>> The kernel will do:
>> - mount "tmpfs"
>> - allocate a unique file in tmpfs
>> - populate that file with [data, data + len] bytes
>> - user-mode-helper code will do_execve that file and, before the process
>> starts, the kernel will create two unix pipes for bidirectional
>> communication between kernel module and umh
>> - close tmpfs file, effectively deleting it
>> - the fork_usermode_blob will return zero on success and populate
>> 'struct umh_info' with two unix pipes and the pid of the user process
I'm trying to think how LSMs can successfully reason about the
resulting exec(). In the past, we've replaced "blob" style interfaces
with file-based interfaces (e.g. init_module() -> finit_module(),
kexec_load() -> kexec_file_load()) to better let the kernel understand
the origin of executable content. Here the intent is fine: we're
getting the exec from an already-loaded module, etc, etc. I'm trying
to think specifically about the interface.
How can the ultimate exec get tied back to the kernel module in a way
that the LSM can query? Right now the hooks hit during exec are:
kernel_read_file() and kernel_post_read_file() of tmpfs file,
bprm_set_creds(), bprm_check(), bprm_commiting_creds(),
bprm_commited_creds(). It seems silly to me for an LSM to perform
these checks at all since I would expect the _meaningful_ check to be
finit_module() of the module itself. Having a way for an LSM to know
the exec is tied to a kernel module would let them skip the nonsense
checks.
Since the process for doing the usermode_blob is defined by the kernel
module build/link/objcopy process, could we tighten the
fork_usermode_blob() interface to point to the kernel module itself,
rather than leaving it an open-ended "blob" interface? Given our
history of needing to replace blob interfaces with file interfaces,
I'm cautious to add a new blob interface. Maybe just pull all the
blob-finding/loading into the interface, and just make it something
like fork_usermode_kmod(struct module *mod, struct umh_info *info) ?
-Kees
--
Kees Cook
Pixel Security
On Thu, May 10, 2018 at 03:27:24PM -0700, Kees Cook wrote:
> On Fri, May 4, 2018 at 12:56 PM, Luis R. Rodriguez <[email protected]> wrote:
> > What a mighty short list of reviewers. Adding some more. My review below.
> > I'd appreciate a Cc on future versions of these patches.
>
> Me too, please. And likely linux-security-module@ and Jessica too.
>
> > On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> >> Introduce helper:
> >> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> >> struct umh_info {
> >> struct file *pipe_to_umh;
> >> struct file *pipe_from_umh;
> >> pid_t pid;
> >> };
> >>
> >> that GPLed kernel modules (signed or unsigned) can use it to execute part
> >> of its own data as swappable user mode process.
> >>
> >> The kernel will do:
> >> - mount "tmpfs"
> >> - allocate a unique file in tmpfs
> >> - populate that file with [data, data + len] bytes
> >> - user-mode-helper code will do_execve that file and, before the process
> >> starts, the kernel will create two unix pipes for bidirectional
> >> communication between kernel module and umh
> >> - close tmpfs file, effectively deleting it
> >> - the fork_usermode_blob will return zero on success and populate
> >> 'struct umh_info' with two unix pipes and the pid of the user process
>
> I'm trying to think how LSMs can successfully reason about the
> resulting exec(). In the past, we've replaced "blob" style interfaces
> with file-based interfaces (e.g. init_module() -> finit_module(),
> kexec_load() -> kexec_file_load()) to better let the kernel understand
> the origin of executable content. Here the intent is fine: we're
> getting the exec from an already-loaded module, etc, etc. I'm trying
> to think specifically about the interface.
>
> How can the ultimate exec get tied back to the kernel module in a way
> that the LSM can query? Right now the hooks hit during exec are:
> kernel_read_file() and kernel_post_read_file() of tmpfs file,
> bprm_set_creds(), bprm_check(), bprm_commiting_creds(),
> bprm_commited_creds(). It seems silly to me for an LSM to perform
> these checks at all since I would expect the _meaningful_ check to be
> finit_module() of the module itself. Having a way for an LSM to know
> the exec is tied to a kernel module would let them skip the nonsense
> checks.
>
> Since the process for doing the usermode_blob is defined by the kernel
> module build/link/objcopy process, could we tighten the
> fork_usermode_blob() interface to point to the kernel module itself,
> rather than leaving it an open-ended "blob" interface? Given our
> history of needing to replace blob interfaces with file interfaces,
> I'm cautious to add a new blob interface. Maybe just pull all the
> blob-finding/loading into the interface, and just make it something
> like fork_usermode_kmod(struct module *mod, struct umh_info *info) ?
I don't think it will work, since Andy and others pointed out that
bpfilter needs to work as builtin as well. There is no 'struct module'
in such case, but fork-ing of the user process still needs to happen.