I'm taking over Alban's work on this.
This series allows sockops programs to access the network namespace
inode and device via (struct bpf_sock_ops)->netns_ino and ->netns_dev.
This can be useful to apply different policies on different network
namespaces.
In the unlikely case where network namespaces are not compiled in
(CONFIG_NET_NS=n), the verifier will generate code to return netns_dev
as usual and will return 0 for netns_ino.
The generated BPF bytecode for netns_ino is loading the correct
inode number at the time of execution.
However, the generated BPF bytecode for netns_dev is loading an
immediate value determined at BPF-load-time by looking at the
initial network namespace. In practice, this works because all netns
currently use the same virtual device. If this was to change, this
code would need to be updated too.
It also adds sockmap and verifier selftests to cover the new fields.
Partial reads work thanks to commit e2f7fc0ac69 ("bpf: fix undefined
behavior in narrow load handling").
v1 patchset can be found at:
https://lkml.org/lkml/2019/4/12/238
Changes since v1:
- add netns_dev (review from Alexei)
- tools/include/uapi/linux/bpf.h: update with netns_dev
- tools/testing/selftests/bpf/test_sockmap_kern.h: print debugs with
- This is a new selftest (review from Song)
v2 patchest can be found at:
https://lkml.org/lkml/2019/4/18/685
Changes since v2:
- replace __u64 by u64 in kernel code (review from Y Song)
- remove unneeded #else branch: program would be rejected in
is_valid_access (review from Y Song)
- allow partial reads (<u64) (review from Y Song)
- standalone patch for the sync (requested by Y Song)
- update commitmsg to refer to netns_ino
- test partial reads on netns_dev (review from Y Song)
- split in two tests
v3 patchset can be found at:
https://lkml.org/lkml/2019/4/26/740
Changes since v3:
- return netns_dev unconditionally and set netns_ino to 0 if
CONFIG_NET_NS is not enabled (review from Jakub Kicinski)
- use bpf_ctx_record_field_size and bpf_ctx_narrow_access_ok instead of
manually deal with partial reads (review from Y Song)
- update commit message to reflect new code and remove note about
partial reads since it was discussed in the review
- use bpf_ctx_range() and offsetofend()
Alban Crequy (4):
bpf: sock ops: add netns ino and dev in bpf context
bpf: sync bpf.h to tools/ for bpf_sock_ops->netns*
selftests: bpf: read netns_ino from struct bpf_sock_ops
selftests: bpf: verifier: read netns_dev and netns_ino from struct
bpf_sock_ops
include/uapi/linux/bpf.h | 2 +
net/core/filter.c | 70 +++++++++++++++++++
tools/include/uapi/linux/bpf.h | 2 +
tools/testing/selftests/bpf/test_sockmap.c | 38 +++++++++-
.../testing/selftests/bpf/test_sockmap_kern.h | 22 ++++++
.../testing/selftests/bpf/verifier/var_off.c | 53 ++++++++++++++
6 files changed, 184 insertions(+), 3 deletions(-)
--
2.21.0
From: Alban Crequy <[email protected]>
This shows how a sockops program could be restricted to a specific
network namespace. The sockops program looks at the current netns via
(struct bpf_sock_ops)->netns_ino and checks if the value matches the
configuration in the new BPF map "sock_netns".
The test program ./test_sockmap accepts a new parameter "--netns"; the
default value is the current netns found by stat() on /proc/self/ns/net,
so the previous tests still pass:
sudo ./test_sockmap
...
Summary: 412 PASSED 0 FAILED
...
Summary: 824 PASSED 0 FAILED
I run my additional test in the following way:
NETNS=$(readlink /proc/self/ns/net | sed 's/^net:\[\(.*\)\]$/\1/')
CGR=/sys/fs/cgroup/unified/user.slice/user-1000.slice/session-5.scope/
sudo ./test_sockmap --cgroup $CGR --netns $NETNS &
cat /sys/kernel/debug/tracing/trace_pipe
echo foo | nc -l 127.0.0.1 8080 &
echo bar | nc 127.0.0.1 8080
=> the connection goes through the sockmap
When testing with a wrong $NETNS, I get the trace_pipe log:
> not binding connection on netns 4026531992
Signed-off-by: Alban Crequy <[email protected]>
---
Changes since v1:
- tools/include/uapi/linux/bpf.h: update with netns_dev
- tools/testing/selftests/bpf/test_sockmap_kern.h: print debugs with
both netns_dev and netns_ino
Changes since v2:
- update commitmsg to refer to netns_ino
---
tools/testing/selftests/bpf/test_sockmap.c | 38 +++++++++++++++++--
.../testing/selftests/bpf/test_sockmap_kern.h | 22 +++++++++++
2 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 3845144e2c91..5a1b9c96fca1 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -2,6 +2,7 @@
// Copyright (c) 2017-2018 Covalent IO, Inc. http://covalent.io
#include <stdio.h>
#include <stdlib.h>
+#include <stdint.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <sys/select.h>
@@ -21,6 +22,7 @@
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/sendfile.h>
+#include <sys/stat.h>
#include <linux/netlink.h>
#include <linux/socket.h>
@@ -63,8 +65,8 @@ int s1, s2, c1, c2, p1, p2;
int test_cnt;
int passed;
int failed;
-int map_fd[8];
-struct bpf_map *maps[8];
+int map_fd[9];
+struct bpf_map *maps[9];
int prog_fd[11];
int txmsg_pass;
@@ -84,6 +86,7 @@ int txmsg_ingress;
int txmsg_skb;
int ktls;
int peek_flag;
+uint64_t netns_opt;
static const struct option long_options[] = {
{"help", no_argument, NULL, 'h' },
@@ -111,6 +114,7 @@ static const struct option long_options[] = {
{"txmsg_skb", no_argument, &txmsg_skb, 1 },
{"ktls", no_argument, &ktls, 1 },
{"peek", no_argument, &peek_flag, 1 },
+ {"netns", required_argument, NULL, 'n'},
{0, 0, NULL, 0 }
};
@@ -1585,6 +1589,7 @@ char *map_names[] = {
"sock_bytes",
"sock_redir_flags",
"sock_skb_opts",
+ "sock_netns",
};
int prog_attach_type[] = {
@@ -1619,6 +1624,8 @@ static int populate_progs(char *bpf_file)
struct bpf_object *obj;
int i = 0;
long err;
+ struct stat netns_sb;
+ uint64_t netns_ino;
obj = bpf_object__open(bpf_file);
err = libbpf_get_error(obj);
@@ -1655,6 +1662,28 @@ static int populate_progs(char *bpf_file)
}
}
+ if (netns_opt == 0) {
+ err = stat("/proc/self/ns/net", &netns_sb);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: cannot stat network namespace: %ld (%s)\n",
+ err, strerror(errno));
+ return -1;
+ }
+ netns_ino = netns_sb.st_ino;
+ } else {
+ netns_ino = netns_opt;
+ }
+ i = 1;
+ err = bpf_map_update_elem(map_fd[8], &netns_ino, &i, BPF_ANY);
+ if (err) {
+ fprintf(stderr,
+ "ERROR: bpf_map_update_elem (netns): %ld (%s)\n",
+ err, strerror(errno));
+ return -1;
+ }
+
+
return 0;
}
@@ -1738,7 +1767,7 @@ int main(int argc, char **argv)
if (argc < 2)
return test_suite(-1);
- while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:p:q:",
+ while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:p:q:n:",
long_options, &longindex)) != -1) {
switch (opt) {
case 's':
@@ -1805,6 +1834,9 @@ int main(int argc, char **argv)
return -1;
}
break;
+ case 'n':
+ netns_opt = strtoull(optarg, NULL, 10);
+ break;
case 0:
break;
case 'h':
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.h b/tools/testing/selftests/bpf/test_sockmap_kern.h
index e7639f66a941..317406dad6cf 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.h
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.h
@@ -91,6 +91,13 @@ struct bpf_map_def SEC("maps") sock_skb_opts = {
.max_entries = 1
};
+struct bpf_map_def SEC("maps") sock_netns = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(__u64),
+ .value_size = sizeof(int),
+ .max_entries = 16
+};
+
SEC("sk_skb1")
int bpf_prog1(struct __sk_buff *skb)
{
@@ -132,9 +139,24 @@ int bpf_sockmap(struct bpf_sock_ops *skops)
{
__u32 lport, rport;
int op, err = 0, index, key, ret;
+ int i = 0;
+ __u64 netns_dev, netns_ino;
+ int *allowed;
op = (int) skops->op;
+ netns_dev = skops->netns_dev;
+ netns_ino = skops->netns_ino;
+ bpf_printk("bpf_sockmap: netns_dev = %lu netns_ino = %lu\n",
+ netns_dev, netns_ino);
+
+ // Only allow sockmap connection on the configured network namespace
+ allowed = bpf_map_lookup_elem(&sock_netns, &netns_ino);
+ if (allowed == NULL || *allowed == 0) {
+ bpf_printk("not binding connection on netns_ino %lu\n",
+ netns_ino);
+ return 0;
+ }
switch (op) {
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
--
2.21.0
From: Alban Crequy <[email protected]>
sockops programs can now access the network namespace inode and device
via (struct bpf_sock_ops)->netns_ino and ->netns_dev. This can be useful
to apply different policies on different network namespaces.
In the unlikely case where network namespaces are not compiled in
(CONFIG_NET_NS=n), the verifier will return netns_dev as usual and will
return 0 for netns_ino.
The generated BPF bytecode for netns_ino is loading the correct inode
number at the time of execution.
However, the generated BPF bytecode for netns_dev is loading an
immediate value determined at BPF-load-time by looking at the initial
network namespace. In practice, this works because all netns currently
use the same virtual device. If this was to change, this code would need
to be updated too.
Co-authored-by: Iago López Galeiras <[email protected]>
Signed-off-by: Alban Crequy <[email protected]>
Signed-off-by: Iago López Galeiras <[email protected]>
---
Changes since v1:
- add netns_dev (review from Alexei)
Changes since v2:
- replace __u64 by u64 in kernel code (review from Y Song)
- remove unneeded #else branch: program would be rejected in
is_valid_access (review from Y Song)
- allow partial reads (<u64) (review from Y Song)
Changes since v3:
- return netns_dev unconditionally and set netns_ino to 0 if
CONFIG_NET_NS is not enabled (review from Jakub Kicinski)
- use bpf_ctx_record_field_size and bpf_ctx_narrow_access_ok instead of
manually deal with partial reads (review from Y Song)
- update commit message to reflect new code and remove note about
partial reads since it was discussed in the review
- use bpf_ctx_range() and offsetofend()
---
include/uapi/linux/bpf.h | 2 ++
net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf66f01a..e64066a09a5f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3261,6 +3261,8 @@ struct bpf_sock_ops {
__u32 sk_txhash;
__u64 bytes_received;
__u64 bytes_acked;
+ __u64 netns_dev;
+ __u64 netns_ino;
};
/* Definitions for bpf_sock_ops_cb_flags */
diff --git a/net/core/filter.c b/net/core/filter.c
index 55bfc941d17a..2b1552a8dd74 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -76,6 +76,8 @@
#include <net/lwtunnel.h>
#include <net/ipv6_stubs.h>
#include <net/bpf_sk_storage.h>
+#include <linux/kdev_t.h>
+#include <linux/proc_ns.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
@@ -6822,6 +6824,18 @@ static bool sock_ops_is_valid_access(int off, int size,
}
} else {
switch (off) {
+ case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
+ if (off >= offsetofend(struct bpf_sock_ops, netns_dev))
+ return false;
+
+ bpf_ctx_record_field_size(info, sizeof(u64));
+ if (!bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
+ return false;
+ break;
+ case offsetof(struct bpf_sock_ops, netns_ino):
+ if (size != sizeof(u64))
+ return false;
+ break;
case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
bytes_acked):
if (size != sizeof(__u64))
@@ -7739,6 +7753,11 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
+static struct ns_common *sockops_netns_cb(void *private_data)
+{
+ return &init_net.ns;
+}
+
static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
const struct bpf_insn *si,
struct bpf_insn *insn_buf,
@@ -7747,6 +7766,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
{
struct bpf_insn *insn = insn_buf;
int off;
+ struct inode *ns_inode;
+ struct path ns_path;
+ u64 netns_dev;
+ void *res;
/* Helper macro for adding read access to tcp_sock or sock fields. */
#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
@@ -7993,6 +8016,53 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
struct sock, type);
break;
+
+ case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
+ /* We get the netns_dev at BPF-load-time and not at
+ * BPF-exec-time. We assume that netns_dev is a constant.
+ */
+ res = ns_get_path_cb(&ns_path, sockops_netns_cb, NULL);
+ if (IS_ERR(res)) {
+ netns_dev = 0;
+ } else {
+ ns_inode = ns_path.dentry->d_inode;
+ netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
+ }
+ *target_size = 8;
+ *insn++ = BPF_MOV64_IMM(si->dst_reg, netns_dev);
+ break;
+
+ case offsetof(struct bpf_sock_ops, netns_ino):
+#ifdef CONFIG_NET_NS
+ /* Loading: sk_ops->sk->__sk_common.skc_net.net->ns.inum
+ * Type: (struct bpf_sock_ops_kern *)
+ * ->(struct sock *)
+ * ->(struct sock_common)
+ * .possible_net_t
+ * .(struct net *)
+ * ->(struct ns_common)
+ * .(unsigned int)
+ */
+ BUILD_BUG_ON(offsetof(struct sock, __sk_common) != 0);
+ BUILD_BUG_ON(offsetof(possible_net_t, net) != 0);
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ possible_net_t, net),
+ si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_net));
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+ struct ns_common, inum),
+ si->dst_reg, si->dst_reg,
+ offsetof(struct net, ns) +
+ offsetof(struct ns_common, inum));
+#else
+ *insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
+#endif
+ break;
+
}
return insn - insn_buf;
}
--
2.21.0
From: Alban Crequy <[email protected]>
The change in struct bpf_sock_ops is synchronised
from: include/uapi/linux/bpf.h
to: tools/include/uapi/linux/bpf.h
Signed-off-by: Alban Crequy <[email protected]>
---
Changes since v2:
- standalone patch for the sync (requested by Y Song)
---
tools/include/uapi/linux/bpf.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 63e0cf66f01a..e64066a09a5f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3261,6 +3261,8 @@ struct bpf_sock_ops {
__u32 sk_txhash;
__u64 bytes_received;
__u64 bytes_acked;
+ __u64 netns_dev;
+ __u64 netns_ino;
};
/* Definitions for bpf_sock_ops_cb_flags */
--
2.21.0
From: Alban Crequy <[email protected]>
Tested with:
> $ sudo ./test_verifier
> ...
> #905/p sockops accessing bpf_sock_ops->netns_dev, ok OK
> #906/p sockops accessing bpf_sock_ops->netns_ino, ok OK
> ...
> Summary: 1421 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Alban Crequy <[email protected]>
---
Changes since v1:
- This is a new selftest (review from Song)
Changes since v2:
- test partial reads on netns_dev (review from Y Song)
- split in two tests
---
.../testing/selftests/bpf/verifier/var_off.c | 53 +++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/tools/testing/selftests/bpf/verifier/var_off.c b/tools/testing/selftests/bpf/verifier/var_off.c
index 8504ac937809..9e4c6c78eb9d 100644
--- a/tools/testing/selftests/bpf/verifier/var_off.c
+++ b/tools/testing/selftests/bpf/verifier/var_off.c
@@ -246,3 +246,56 @@
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_LWT_IN,
},
+{
+ "sockops accessing bpf_sock_ops->netns_dev, ok",
+ .insns = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev)),
+
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev)),
+ BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 4),
+
+ BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev)),
+ BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 2),
+ BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 4),
+ BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 6),
+
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev)),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 1),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 2),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 3),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 4),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 5),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 6),
+ BPF_LDX_MEM(BPF_B, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_dev) + 7),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+},
+{
+ "sockops accessing bpf_sock_ops->netns_ino, ok",
+ .insns = {
+ BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_1, offsetof(struct bpf_sock_ops,
+ netns_ino)),
+ BPF_MOV64_IMM(BPF_REG_0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+},
--
2.21.0
On Fri, May 24, 2019 at 9:01 AM Iago López Galeiras <[email protected]> wrote:
>
> From: Alban Crequy <[email protected]>
>
> sockops programs can now access the network namespace inode and device
> via (struct bpf_sock_ops)->netns_ino and ->netns_dev. This can be useful
> to apply different policies on different network namespaces.
>
> In the unlikely case where network namespaces are not compiled in
> (CONFIG_NET_NS=n), the verifier will return netns_dev as usual and will
> return 0 for netns_ino.
>
> The generated BPF bytecode for netns_ino is loading the correct inode
> number at the time of execution.
>
> However, the generated BPF bytecode for netns_dev is loading an
> immediate value determined at BPF-load-time by looking at the initial
> network namespace. In practice, this works because all netns currently
> use the same virtual device. If this was to change, this code would need
> to be updated too.
>
> Co-authored-by: Iago López Galeiras <[email protected]>
> Signed-off-by: Alban Crequy <[email protected]>
> Signed-off-by: Iago López Galeiras <[email protected]>
>
> ---
>
> Changes since v1:
> - add netns_dev (review from Alexei)
>
> Changes since v2:
> - replace __u64 by u64 in kernel code (review from Y Song)
> - remove unneeded #else branch: program would be rejected in
> is_valid_access (review from Y Song)
> - allow partial reads (<u64) (review from Y Song)
>
> Changes since v3:
> - return netns_dev unconditionally and set netns_ino to 0 if
> CONFIG_NET_NS is not enabled (review from Jakub Kicinski)
> - use bpf_ctx_record_field_size and bpf_ctx_narrow_access_ok instead of
> manually deal with partial reads (review from Y Song)
> - update commit message to reflect new code and remove note about
> partial reads since it was discussed in the review
> - use bpf_ctx_range() and offsetofend()
> ---
> include/uapi/linux/bpf.h | 2 ++
> net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 72 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 63e0cf66f01a..e64066a09a5f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3261,6 +3261,8 @@ struct bpf_sock_ops {
> __u32 sk_txhash;
> __u64 bytes_received;
> __u64 bytes_acked;
> + __u64 netns_dev;
> + __u64 netns_ino;
> };
>
> /* Definitions for bpf_sock_ops_cb_flags */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 55bfc941d17a..2b1552a8dd74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -76,6 +76,8 @@
> #include <net/lwtunnel.h>
> #include <net/ipv6_stubs.h>
> #include <net/bpf_sk_storage.h>
> +#include <linux/kdev_t.h>
> +#include <linux/proc_ns.h>
>
> /**
> * sk_filter_trim_cap - run a packet through a socket filter
> @@ -6822,6 +6824,18 @@ static bool sock_ops_is_valid_access(int off, int size,
> }
> } else {
> switch (off) {
> + case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
> + if (off >= offsetofend(struct bpf_sock_ops, netns_dev))
> + return false;
This is not needed.
#define bpf_ctx_range(TYPE, MEMBER)
\
offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1
off should never be >= offsetofend(struct bpf_sock_ops, netns_dev).
> +
> + bpf_ctx_record_field_size(info, sizeof(u64));
> + if (!bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
> + return false;
> + break;
> + case offsetof(struct bpf_sock_ops, netns_ino):
> + if (size != sizeof(u64))
> + return false;
> + break;
> case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
> bytes_acked):
> if (size != sizeof(__u64))
> @@ -7739,6 +7753,11 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
> return insn - insn_buf;
> }
>
> +static struct ns_common *sockops_netns_cb(void *private_data)
> +{
> + return &init_net.ns;
> +}
> +
> static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> const struct bpf_insn *si,
> struct bpf_insn *insn_buf,
> @@ -7747,6 +7766,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> {
> struct bpf_insn *insn = insn_buf;
> int off;
> + struct inode *ns_inode;
> + struct path ns_path;
> + u64 netns_dev;
> + void *res;
>
> /* Helper macro for adding read access to tcp_sock or sock fields. */
> #define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
> @@ -7993,6 +8016,53 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
> struct sock, type);
> break;
> +
> + case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
> + /* We get the netns_dev at BPF-load-time and not at
> + * BPF-exec-time. We assume that netns_dev is a constant.
> + */
> + res = ns_get_path_cb(&ns_path, sockops_netns_cb, NULL);
> + if (IS_ERR(res)) {
> + netns_dev = 0;
What is the proper way to handle error here?
If we indeed get an error and netns_dev = 0, do this mean bpf program
should check netns_dev == 0 and special case it? Or since this is really
a lower probability thing we can set to 0 and bpf program's logic does not
need to specialize this one.
At least, maybe we need a little documentation for the field in uapi header
to point out this potential caveat?
> + } else {
> + ns_inode = ns_path.dentry->d_inode;
> + netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
> + }
> + *target_size = 8;
> + *insn++ = BPF_MOV64_IMM(si->dst_reg, netns_dev);
> + break;
> +
> + case offsetof(struct bpf_sock_ops, netns_ino):
> +#ifdef CONFIG_NET_NS
> + /* Loading: sk_ops->sk->__sk_common.skc_net.net->ns.inum
> + * Type: (struct bpf_sock_ops_kern *)
> + * ->(struct sock *)
> + * ->(struct sock_common)
> + * .possible_net_t
> + * .(struct net *)
> + * ->(struct ns_common)
> + * .(unsigned int)
> + */
> + BUILD_BUG_ON(offsetof(struct sock, __sk_common) != 0);
> + BUILD_BUG_ON(offsetof(possible_net_t, net) != 0);
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct bpf_sock_ops_kern, sk),
> + si->dst_reg, si->src_reg,
> + offsetof(struct bpf_sock_ops_kern, sk));
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + possible_net_t, net),
> + si->dst_reg, si->dst_reg,
> + offsetof(struct sock_common, skc_net));
> + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> + struct ns_common, inum),
> + si->dst_reg, si->dst_reg,
> + offsetof(struct net, ns) +
> + offsetof(struct ns_common, inum));
> +#else
> + *insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
> +#endif
> + break;
> +
> }
> return insn - insn_buf;
> }
> --
> 2.21.0
>
On Fri, May 24, 2019 at 7:48 PM Y Song <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 9:01 AM Iago López Galeiras <[email protected]> wrote:
> >
> > From: Alban Crequy <[email protected]>
> >
> > sockops programs can now access the network namespace inode and device
> > via (struct bpf_sock_ops)->netns_ino and ->netns_dev. This can be useful
> > to apply different policies on different network namespaces.
> >
> > In the unlikely case where network namespaces are not compiled in
> > (CONFIG_NET_NS=n), the verifier will return netns_dev as usual and will
> > return 0 for netns_ino.
> >
> > The generated BPF bytecode for netns_ino is loading the correct inode
> > number at the time of execution.
> >
> > However, the generated BPF bytecode for netns_dev is loading an
> > immediate value determined at BPF-load-time by looking at the initial
> > network namespace. In practice, this works because all netns currently
> > use the same virtual device. If this was to change, this code would need
> > to be updated too.
> >
> > Co-authored-by: Iago López Galeiras <[email protected]>
> > Signed-off-by: Alban Crequy <[email protected]>
> > Signed-off-by: Iago López Galeiras <[email protected]>
> >
> > ---
> >
> > Changes since v1:
> > - add netns_dev (review from Alexei)
> >
> > Changes since v2:
> > - replace __u64 by u64 in kernel code (review from Y Song)
> > - remove unneeded #else branch: program would be rejected in
> > is_valid_access (review from Y Song)
> > - allow partial reads (<u64) (review from Y Song)
> >
> > Changes since v3:
> > - return netns_dev unconditionally and set netns_ino to 0 if
> > CONFIG_NET_NS is not enabled (review from Jakub Kicinski)
> > - use bpf_ctx_record_field_size and bpf_ctx_narrow_access_ok instead of
> > manually deal with partial reads (review from Y Song)
> > - update commit message to reflect new code and remove note about
> > partial reads since it was discussed in the review
> > - use bpf_ctx_range() and offsetofend()
> > ---
> > include/uapi/linux/bpf.h | 2 ++
> > net/core/filter.c | 70 ++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 72 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 63e0cf66f01a..e64066a09a5f 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3261,6 +3261,8 @@ struct bpf_sock_ops {
> > __u32 sk_txhash;
> > __u64 bytes_received;
> > __u64 bytes_acked;
> > + __u64 netns_dev;
> > + __u64 netns_ino;
> > };
> >
> > /* Definitions for bpf_sock_ops_cb_flags */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 55bfc941d17a..2b1552a8dd74 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -76,6 +76,8 @@
> > #include <net/lwtunnel.h>
> > #include <net/ipv6_stubs.h>
> > #include <net/bpf_sk_storage.h>
> > +#include <linux/kdev_t.h>
> > +#include <linux/proc_ns.h>
> >
> > /**
> > * sk_filter_trim_cap - run a packet through a socket filter
> > @@ -6822,6 +6824,18 @@ static bool sock_ops_is_valid_access(int off, int size,
> > }
> > } else {
> > switch (off) {
> > + case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
> > + if (off >= offsetofend(struct bpf_sock_ops, netns_dev))
> > + return false;
>
> This is not needed.
> #define bpf_ctx_range(TYPE, MEMBER)
> \
> offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1
> off should never be >= offsetofend(struct bpf_sock_ops, netns_dev).
>
Right, I'll remove it.
> > +
> > + bpf_ctx_record_field_size(info, sizeof(u64));
> > + if (!bpf_ctx_narrow_access_ok(off, size, sizeof(u64)))
> > + return false;
> > + break;
> > + case offsetof(struct bpf_sock_ops, netns_ino):
> > + if (size != sizeof(u64))
> > + return false;
> > + break;
> > case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
> > bytes_acked):
> > if (size != sizeof(__u64))
> > @@ -7739,6 +7753,11 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
> > return insn - insn_buf;
> > }
> >
> > +static struct ns_common *sockops_netns_cb(void *private_data)
> > +{
> > + return &init_net.ns;
> > +}
> > +
> > static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> > const struct bpf_insn *si,
> > struct bpf_insn *insn_buf,
> > @@ -7747,6 +7766,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> > {
> > struct bpf_insn *insn = insn_buf;
> > int off;
> > + struct inode *ns_inode;
> > + struct path ns_path;
> > + u64 netns_dev;
> > + void *res;
> >
> > /* Helper macro for adding read access to tcp_sock or sock fields. */
> > #define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
> > @@ -7993,6 +8016,53 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> > SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
> > struct sock, type);
> > break;
> > +
> > + case bpf_ctx_range(struct bpf_sock_ops, netns_dev):
> > + /* We get the netns_dev at BPF-load-time and not at
> > + * BPF-exec-time. We assume that netns_dev is a constant.
> > + */
> > + res = ns_get_path_cb(&ns_path, sockops_netns_cb, NULL);
> > + if (IS_ERR(res)) {
> > + netns_dev = 0;
>
> What is the proper way to handle error here?
> If we indeed get an error and netns_dev = 0, do this mean bpf program
> should check netns_dev == 0 and special case it? Or since this is really
> a lower probability thing we can set to 0 and bpf program's logic does not
> need to specialize this one.
>
> At least, maybe we need a little documentation for the field in uapi header
> to point out this potential caveat?
>
As far as I can tell, this function can only error with ENOMEM when allocating
a new inode or dentry, which is very unlikely. I don't think bpf programs
should check for this. Also, for sockops programs worst case is usually that
the connection goes through the usual networking stack so it shouldn't be
dangerous.
I can add a comment like this to the field in the uapi header file:
...
/*
* netns_dev might be zero if there's an error getting it
* when loading the BPF program. This is very unlikely.
*/
__u64 netns_dev;
...
What do you think?
> > + } else {
> > + ns_inode = ns_path.dentry->d_inode;
> > + netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
> > + }
> > + *target_size = 8;
> > + *insn++ = BPF_MOV64_IMM(si->dst_reg, netns_dev);
> > + break;
> > +
> > + case offsetof(struct bpf_sock_ops, netns_ino):
> > +#ifdef CONFIG_NET_NS
> > + /* Loading: sk_ops->sk->__sk_common.skc_net.net->ns.inum
> > + * Type: (struct bpf_sock_ops_kern *)
> > + * ->(struct sock *)
> > + * ->(struct sock_common)
> > + * .possible_net_t
> > + * .(struct net *)
> > + * ->(struct ns_common)
> > + * .(unsigned int)
> > + */
> > + BUILD_BUG_ON(offsetof(struct sock, __sk_common) != 0);
> > + BUILD_BUG_ON(offsetof(possible_net_t, net) != 0);
> > + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> > + struct bpf_sock_ops_kern, sk),
> > + si->dst_reg, si->src_reg,
> > + offsetof(struct bpf_sock_ops_kern, sk));
> > + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> > + possible_net_t, net),
> > + si->dst_reg, si->dst_reg,
> > + offsetof(struct sock_common, skc_net));
> > + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> > + struct ns_common, inum),
> > + si->dst_reg, si->dst_reg,
> > + offsetof(struct net, ns) +
> > + offsetof(struct ns_common, inum));
> > +#else
> > + *insn++ = BPF_MOV64_IMM(si->dst_reg, 0);
> > +#endif
> > + break;
> > +
> > }
> > return insn - insn_buf;
> > }
> > --
> > 2.21.0
> >
--
Iago López Galeiras
Kinvolk GmbH | Adalbertstr. 6a, 10999 Berlin
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000