This is an alternate approach to exposing connection tracking data
to the XDP + eBPF world. Rather than having to rework a number of
helper functions to ignore or rebuild metadata from an skbuff data
segment, we reuse the existing flow offload hooks that expose conntrack
tuples directly based on a flow tuple. As this is an early-version RFC,
the API behavior is definitely going to change. I'll be working on this
unless the flames grow so high that there's no choice but to bail and
let it burn down.
The goal of this work is to integrate the flow offload infrastructure
from netfilter, in a similar way to the approach that flow hw offload
has taken (ie: the 'slowpath' of netfilter does the heavy lifting for
lots of the required functions, like port allocations, helper parsing,
etc). The advatange of building a series like this is two-fold:
1. We can get the advantages of the netfilter infrastructure today,
and pull in functionality via various map types or operations (TBD).
I think the next thing to add to this would be NAT support (so that
we could actually forward end-to-end and watch things go).
2. For the hw offload folks, this gives a way to test out some of the
proposed conntrack API changes without need hardware available
today. In fact, this might let the hardware vendors prototype their
conntrack offload, see where the proposed APIs are lacking (or where
they need reworking), and turn around changes quickly.
It's not all sunshine and roses, though. The first patch in the series is
definitely controversial. It would allow kernel subsystems to register
their own map types at module load time, rather than being compiled in to
the kernel at run-time. I think there is a worry about this kind of
functionality enabling the eBPF ecosystem to fracture. I don't know if
I understand the concern enough. If that's dead in the water, there might
be an alternate approach with out patch 1 (I have a rough sketch in my
head, but haven't coded it up).
I have only done some rudimentary testing with this. Just enough to prove
that I wasn't breaking anything existing. I'm sending this out just as
it matched the first packet (and I'm re-running the build and retesting so
that I didn't forget to save something). So I don't have any benchmark
data, and I don't even have support yet to do anything useful (NAT would
be needed for my IPv4 testing to to proceed, so that's my next task).
I have a small (and hacky) test program at:
https://github.com/orgcandman/conntrack_bpf
It is only used to exercise the lookup call - it doesn't actually prevent
connections from eventually succeeding. I eventually hope to flesh that
out into a bpf implementation of hardware offload (with various features,
like window tracking, flag validation, etc).
Aaron Conole (3):
bpf: modular maps
netfilter: nf_flow_table: support a new 'snoop' mode
netfilter: nf_flow_table_bpf_map: introduce new loadable bpf map
include/linux/bpf.h | 6 +
include/linux/bpf_types.h | 2 +
include/net/netfilter/nf_flow_table.h | 5 +
include/uapi/linux/bpf.h | 7 +
include/uapi/linux/netfilter/nf_tables.h | 2 +
init/Kconfig | 8 +
kernel/bpf/syscall.c | 57 +++++-
net/netfilter/Kconfig | 9 +
net/netfilter/Makefile | 1 +
net/netfilter/nf_flow_table_bpf_flowmap.c | 202 ++++++++++++++++++++++
net/netfilter/nf_flow_table_core.c | 44 ++++-
net/netfilter/nf_tables_api.c | 13 +-
12 files changed, 351 insertions(+), 5 deletions(-)
create mode 100644 net/netfilter/nf_flow_table_bpf_flowmap.c
--
2.19.1
This commit introduces a new loadable map that allows an eBPF program to
query the flow offload tables for specific flow information. For now,
that information is limited to input and output index information. Future
enhancements would be to include connection tracking details, such as
state, metadata, and allow for window validation.
Signed-off-by: Aaron Conole <[email protected]>
---
include/linux/bpf_types.h | 2 +
include/uapi/linux/bpf.h | 7 +
net/netfilter/Kconfig | 9 +
net/netfilter/Makefile | 1 +
net/netfilter/nf_flow_table_bpf_flowmap.c | 202 ++++++++++++++++++++++
5 files changed, 221 insertions(+)
create mode 100644 net/netfilter/nf_flow_table_bpf_flowmap.c
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 44d9ab4809bd..82d3038cf6c3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -71,3 +71,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
+
+BPF_MAP_TYPE(BPF_MAP_TYPE_FLOWMAP, loadable_map)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..fb77c8c5c209 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -131,6 +131,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
BPF_MAP_TYPE_QUEUE,
BPF_MAP_TYPE_STACK,
+ BPF_MAP_TYPE_FLOWMAP,
};
enum bpf_prog_type {
@@ -2942,4 +2943,10 @@ struct bpf_flow_keys {
};
};
+struct bpf_flow_map {
+ struct bpf_flow_keys flow;
+ __u32 iifindex;
+ __u32 oifindex;
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 2ab870ef233a..30f1bc9084be 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -709,6 +709,15 @@ config NF_FLOW_TABLE
To compile it as a module, choose M here.
+config NF_FLOW_TABLE_BPF
+ tristate "Netfilter flowtable BPF map"
+ depends on NF_FLOW_TABLE
+ depends on BPF_LOADABLE_MAPS
+ help
+ This option adds support for retrieving flow table entries
+ via a loadable BPF map.
+ To compile it as a module, choose M here.
+
config NETFILTER_XTABLES
tristate "Netfilter Xtables support (required for ip_tables)"
default m if NETFILTER_ADVANCED=n
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 4ddf3ef51ece..8dba928a03fd 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -121,6 +121,7 @@ obj-$(CONFIG_NFT_FWD_NETDEV) += nft_fwd_netdev.o
# flow table infrastructure
obj-$(CONFIG_NF_FLOW_TABLE) += nf_flow_table.o
+obj-$(CONFIG_NF_FLOW_TABLE_BPF) += nf_flow_table_bpf_flowmap.o
nf_flow_table-objs := nf_flow_table_core.o nf_flow_table_ip.o
obj-$(CONFIG_NF_FLOW_TABLE_INET) += nf_flow_table_inet.o
diff --git a/net/netfilter/nf_flow_table_bpf_flowmap.c b/net/netfilter/nf_flow_table_bpf_flowmap.c
new file mode 100644
index 000000000000..577985560883
--- /dev/null
+++ b/net/netfilter/nf_flow_table_bpf_flowmap.c
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2018, Aaron Conole <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/bpf.h>
+#include <net/xdp.h>
+#include <linux/filter.h>
+#include <trace/events/xdp.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_flow_table.h>
+
+struct flow_map_internal {
+ struct bpf_map map;
+ struct nf_flowtable net_flow_table;
+};
+
+static void flow_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
+{
+ map->map_type = attr->map_type;
+ map->key_size = attr->key_size;
+ map->value_size = attr->value_size;
+ map->max_entries = attr->max_entries;
+ map->map_flags = attr->map_flags;
+ map->numa_node = bpf_map_attr_numa_node(attr);
+}
+
+static struct bpf_map *flow_map_alloc(union bpf_attr *attr)
+{
+ struct flow_map_internal *fmap_ret;
+ u64 cost;
+ int err;
+
+ if (!capable(CAP_NET_ADMIN))
+ return ERR_PTR(-EPERM);
+
+ if (attr->max_entries == 0 ||
+ attr->key_size != sizeof(struct bpf_flow_map) ||
+ attr->value_size != sizeof(struct bpf_flow_map))
+ return ERR_PTR(-EINVAL);
+
+ fmap_ret = kzalloc(sizeof(*fmap_ret), GFP_USER);
+ if (!fmap_ret)
+ return ERR_PTR(-ENOMEM);
+
+ flow_map_init_from_attr(&fmap_ret->map, attr);
+ cost = (u64)fmap_ret->map.max_entries * sizeof(struct flow_offload);
+ if (cost >= U32_MAX - PAGE_SIZE) {
+ kfree(&fmap_ret);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ fmap_ret->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+ /* if map size is larger than memlock limit, reject it early */
+ if ((err = bpf_map_precharge_memlock(fmap_ret->map.pages))) {
+ kfree(&fmap_ret);
+ return ERR_PTR(err);
+ }
+
+ memset(&fmap_ret->net_flow_table, 0, sizeof(fmap_ret->net_flow_table));
+ fmap_ret->net_flow_table.flags |= NF_FLOWTABLE_F_SNOOP;
+ nf_flow_table_init(&fmap_ret->net_flow_table);
+
+ return &fmap_ret->map;
+}
+
+static void flow_map_free(struct bpf_map *map)
+{
+ struct flow_map_internal *fmap = container_of(map,
+ struct flow_map_internal,
+ map);
+
+ nf_flow_table_free(&fmap->net_flow_table);
+ synchronize_rcu();
+ kfree(fmap);
+}
+
+static void flow_walk(struct flow_offload *flow, void *data)
+{
+ printk("Flow offload dir0: %x:%d -> %x:%d, %u, %u, %d, %u\n",
+ flow->tuplehash[0].tuple.src_v4.s_addr,
+ flow->tuplehash[0].tuple.src_port,
+ flow->tuplehash[0].tuple.dst_v4.s_addr,
+ flow->tuplehash[0].tuple.dst_port,
+ flow->tuplehash[0].tuple.l3proto,
+ flow->tuplehash[0].tuple.l4proto,
+ flow->tuplehash[0].tuple.iifidx,
+ flow->tuplehash[0].tuple.dir
+ );
+
+ printk("Flow offload dir1: %x:%d -> %x:%d, %u, %u, %d, %u\n",
+ flow->tuplehash[1].tuple.src_v4.s_addr,
+ flow->tuplehash[1].tuple.src_port,
+ flow->tuplehash[1].tuple.dst_v4.s_addr,
+ flow->tuplehash[1].tuple.dst_port,
+ flow->tuplehash[1].tuple.l3proto,
+ flow->tuplehash[1].tuple.l4proto,
+ flow->tuplehash[1].tuple.iifidx,
+ flow->tuplehash[1].tuple.dir
+ );
+}
+
+static void *flow_map_lookup_elem(struct bpf_map *map, void *key)
+{
+ struct flow_map_internal *fmap = container_of(map,
+ struct flow_map_internal, map);
+ struct bpf_flow_map *internal_key = (struct bpf_flow_map *)key;
+ struct flow_offload_tuple_rhash *hash_ret;
+ struct flow_offload_tuple lookup_key;
+
+ memset(&lookup_key, 0, sizeof(lookup_key));
+ lookup_key.src_port = ntohs(internal_key->flow.sport);
+ lookup_key.dst_port = ntohs(internal_key->flow.dport);
+ lookup_key.dir = 0;
+
+ if (internal_key->flow.addr_proto == htons(ETH_P_IP)) {
+ lookup_key.l3proto = AF_INET;
+ lookup_key.src_v4.s_addr = ntohl(internal_key->flow.ipv4_src);
+ lookup_key.dst_v4.s_addr = ntohl(internal_key->flow.ipv4_dst);
+ } else if (internal_key->flow.addr_proto == htons(ETH_P_IPV6)) {
+ lookup_key.l3proto = AF_INET6;
+ memcpy(&lookup_key.src_v6,
+ internal_key->flow.ipv6_src,
+ sizeof(lookup_key.src_v6));
+ memcpy(&lookup_key.dst_v6,
+ internal_key->flow.ipv6_dst,
+ sizeof(lookup_key.dst_v6));
+ } else
+ return NULL;
+
+ lookup_key.l4proto = (u8)internal_key->flow.ip_proto;
+ lookup_key.iifidx = internal_key->iifindex;
+
+ printk("Flow offload lookup: %x:%d -> %x:%d, %u, %u, %d, %u\n",
+ lookup_key.src_v4.s_addr, lookup_key.src_port,
+ lookup_key.dst_v4.s_addr, lookup_key.dst_port,
+ lookup_key.l3proto, lookup_key.l4proto,
+ lookup_key.iifidx, lookup_key.dir);
+ hash_ret = flow_offload_lookup(&fmap->net_flow_table, &lookup_key);
+ if (!hash_ret) {
+ memcpy(&lookup_key.src_v6, internal_key->flow.ipv6_src,
+ sizeof(lookup_key.src_v6));
+ memcpy(&lookup_key.dst_v6, internal_key->flow.ipv6_dst,
+ sizeof(lookup_key.dst_v6));
+ lookup_key.src_port = internal_key->flow.dport;
+ lookup_key.dst_port = internal_key->flow.sport;
+ lookup_key.dir = 1;
+ hash_ret = flow_offload_lookup(&fmap->net_flow_table,
+ &lookup_key);
+ }
+
+ if (!hash_ret) {
+ printk("No flow found, but table is: %d\n",
+ atomic_read(&fmap->net_flow_table.rhashtable.nelems));
+ nf_flow_table_iterate(&fmap->net_flow_table, flow_walk, NULL);
+ return NULL;
+ }
+
+ printk("Flow matched!\n");
+ return key;
+}
+
+static int flow_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+ return 0;
+}
+
+static int flow_map_check_no_btf(const struct bpf_map *map,
+ const struct btf_type *key_type,
+ const struct btf_type *value_type)
+{
+ return -ENOTSUPP;
+}
+
+const struct bpf_map_ops flow_map_ops = {
+ .map_alloc = flow_map_alloc,
+ .map_free = flow_map_free,
+ .map_get_next_key = flow_map_get_next_key,
+ .map_lookup_elem = flow_map_lookup_elem,
+ .map_check_btf = flow_map_check_no_btf,
+};
+
+static int __init flow_map_init(void)
+{
+ bpf_map_insert_ops(BPF_MAP_TYPE_FLOWMAP, &flow_map_ops);
+ return 0;
+}
+
+module_init(flow_map_init);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Aaron Conole <[email protected]>");
--
2.19.1
This commit allows for map operations to be loaded by an lkm, rather than
needing to be baked into the kernel at compile time.
Signed-off-by: Aaron Conole <[email protected]>
---
include/linux/bpf.h | 6 +++++
init/Kconfig | 8 +++++++
kernel/bpf/syscall.c | 57 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 70 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 33014ae73103..bf4531f076ca 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -553,6 +553,7 @@ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type);
int array_map_alloc_check(union bpf_attr *attr);
+void bpf_map_insert_ops(size_t id, const struct bpf_map_ops *ops);
#else /* !CONFIG_BPF_SYSCALL */
static inline struct bpf_prog *bpf_prog_get(u32 ufd)
@@ -665,6 +666,11 @@ static inline struct bpf_prog *bpf_prog_get_type_path(const char *name,
{
return ERR_PTR(-EOPNOTSUPP);
}
+
+static inline void bpf_map_insert_ops(size_t id,
+ const struct bpf_map_ops *ops)
+{
+}
#endif /* CONFIG_BPF_SYSCALL */
static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
diff --git a/init/Kconfig b/init/Kconfig
index a4112e95724a..aa4eb98af656 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1489,6 +1489,14 @@ config BPF_JIT_ALWAYS_ON
Enables BPF JIT and removes BPF interpreter to avoid
speculative execution of BPF instructions by the interpreter
+config BPF_LOADABLE_MAPS
+ bool "Allow map types to be loaded with modules"
+ depends on BPF_SYSCALL && MODULES
+ help
+ Enables BPF map types to be provided by loadable modules
+ instead of always compiled in. Maps provided dynamically
+ may only be used by super users.
+
config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf5040fd5434..fa1db9ab81e1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -49,6 +49,8 @@ static DEFINE_SPINLOCK(map_idr_lock);
int sysctl_unprivileged_bpf_disabled __read_mostly;
+const struct bpf_map_ops loadable_map = {};
+
static const struct bpf_map_ops * const bpf_map_types[] = {
#define BPF_PROG_TYPE(_id, _ops)
#define BPF_MAP_TYPE(_id, _ops) \
@@ -58,6 +60,15 @@ static const struct bpf_map_ops * const bpf_map_types[] = {
#undef BPF_MAP_TYPE
};
+static const struct bpf_map_ops * bpf_loadable_map_types[] = {
+#define BPF_PROG_TYPE(_id, _ops)
+#define BPF_MAP_TYPE(_id, _ops) \
+ [_id] = NULL,
+#include <linux/bpf_types.h>
+#undef BPF_PROG_TYPE
+#undef BPF_MAP_TYPE
+};
+
/*
* If we're handed a bigger struct than we know of, ensure all the unknown bits
* are 0 - i.e. new user-space does not rely on any kernel feature extensions
@@ -105,6 +116,48 @@ const struct bpf_map_ops bpf_map_offload_ops = {
.map_check_btf = map_check_no_btf,
};
+/*
+ * Fills in the modular ops map, provided that the entry is not already
+ * filled, and that the caller has CAP_SYS_ADMIN. */
+void bpf_map_insert_ops(size_t id, const struct bpf_map_ops *ops)
+{
+#ifdef CONFIG_BPF_LOADABLE_MAPS
+ if (!capable(CAP_SYS_ADMIN))
+ return;
+
+ if (id >= ARRAY_SIZE(bpf_loadable_map_types))
+ return;
+
+ id = array_index_nospec(id, ARRAY_SIZE(bpf_loadable_map_types));
+ if (bpf_loadable_map_types[id] == NULL)
+ bpf_loadable_map_types[id] = ops;
+#endif
+}
+EXPORT_SYMBOL_GPL(bpf_map_insert_ops);
+
+static const struct bpf_map_ops *find_loadable_ops(u32 type)
+{
+ struct user_struct *user = get_current_user();
+ const struct bpf_map_ops *ops = NULL;
+
+ if (user->uid.val)
+ goto done;
+
+#ifdef CONFIG_BPF_LOADABLE_MAPS
+ if (!capable(CAP_SYS_ADMIN))
+ goto done;
+
+ if (type >= ARRAY_SIZE(bpf_loadable_map_types))
+ goto done;
+ type = array_index_nospec(type, ARRAY_SIZE(bpf_loadable_map_types));
+ ops = bpf_loadable_map_types[type];
+#endif
+
+done:
+ free_uid(user);
+ return ops;
+}
+
static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
{
const struct bpf_map_ops *ops;
@@ -115,7 +168,8 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
if (type >= ARRAY_SIZE(bpf_map_types))
return ERR_PTR(-EINVAL);
type = array_index_nospec(type, ARRAY_SIZE(bpf_map_types));
- ops = bpf_map_types[type];
+ ops = (bpf_map_types[type] != &loadable_map) ? bpf_map_types[type] :
+ find_loadable_ops(type);
if (!ops)
return ERR_PTR(-EINVAL);
@@ -180,6 +234,7 @@ int bpf_map_precharge_memlock(u32 pages)
return -EPERM;
return 0;
}
+EXPORT_SYMBOL_GPL(bpf_map_precharge_memlock);
static int bpf_charge_memlock(struct user_struct *user, u32 pages)
{
--
2.19.1
This patch adds the ability for a flow table to receive updates on
all flows added/removed to any flow table in the system. This will
allow other subsystems in the kernel to register a lookup mechanism
into the nftables connection tracker for those connections which
should be sent to a flow offload table.
Each flow table can now be set with some kinds of flags, and if
one of those flags is the new 'snoop' flag, it will be updated
whenever a flow entry is added or removed to any flow table.
Signed-off-by: Aaron Conole <[email protected]>
---
include/net/netfilter/nf_flow_table.h | 5 +++
include/uapi/linux/netfilter/nf_tables.h | 2 ++
net/netfilter/nf_flow_table_core.c | 44 ++++++++++++++++++++++--
net/netfilter/nf_tables_api.c | 13 ++++++-
4 files changed, 60 insertions(+), 4 deletions(-)
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 77e2761d4f2f..3fdfeb17f500 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -20,9 +20,14 @@ struct nf_flowtable_type {
struct module *owner;
};
+enum nf_flowtable_flags {
+ NF_FLOWTABLE_F_SNOOP = 0x1,
+};
+
struct nf_flowtable {
struct list_head list;
struct rhashtable rhashtable;
+ u32 flags;
const struct nf_flowtable_type *type;
struct delayed_work gc_work;
};
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 7de4f1bdaf06..f1cfe30aecde 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1482,6 +1482,7 @@ enum nft_object_attributes {
* @NFTA_FLOWTABLE_HOOK: netfilter hook configuration(NLA_U32)
* @NFTA_FLOWTABLE_USE: number of references to this flow table (NLA_U32)
* @NFTA_FLOWTABLE_HANDLE: object handle (NLA_U64)
+ * @NFTA_FLOWTABLE_FLAGS: flags (NLA_U32)
*/
enum nft_flowtable_attributes {
NFTA_FLOWTABLE_UNSPEC,
@@ -1491,6 +1492,7 @@ enum nft_flowtable_attributes {
NFTA_FLOWTABLE_USE,
NFTA_FLOWTABLE_HANDLE,
NFTA_FLOWTABLE_PAD,
+ NFTA_FLOWTABLE_FLAGS,
__NFTA_FLOWTABLE_MAX
};
#define NFTA_FLOWTABLE_MAX (__NFTA_FLOWTABLE_MAX - 1)
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index b7a4816add76..289a2299eea2 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -15,6 +15,7 @@
struct flow_offload_entry {
struct flow_offload flow;
struct nf_conn *ct;
+ struct nf_flow_route route;
struct rcu_head rcu_head;
};
@@ -78,6 +79,7 @@ flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route *route)
goto err_dst_cache_reply;
entry->ct = ct;
+ entry->route = *route;
flow_offload_fill_dir(flow, ct, route, FLOW_OFFLOAD_DIR_ORIGINAL);
flow_offload_fill_dir(flow, ct, route, FLOW_OFFLOAD_DIR_REPLY);
@@ -100,6 +102,18 @@ flow_offload_alloc(struct nf_conn *ct, struct nf_flow_route *route)
}
EXPORT_SYMBOL_GPL(flow_offload_alloc);
+static struct flow_offload *flow_offload_clone(struct flow_offload *flow)
+{
+ struct flow_offload *clone_flow_val;
+ struct flow_offload_entry *e;
+
+ e = container_of(flow, struct flow_offload_entry, flow);
+
+ clone_flow_val = flow_offload_alloc(e->ct, &e->route);
+
+ return clone_flow_val;
+}
+
static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
{
tcp->state = TCP_CONNTRACK_ESTABLISHED;
@@ -182,7 +196,7 @@ static const struct rhashtable_params nf_flow_offload_rhash_params = {
.automatic_shrinking = true,
};
-int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
+static void __flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
{
flow->timeout = (u32)jiffies;
@@ -192,12 +206,30 @@ int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
rhashtable_insert_fast(&flow_table->rhashtable,
&flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].node,
nf_flow_offload_rhash_params);
+}
+
+int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload *flow)
+{
+ struct nf_flowtable *flowtable;
+
+ __flow_offload_add(flow_table, flow);
+
+ mutex_lock(&flowtable_lock);
+ list_for_each_entry(flowtable, &flowtables, list) {
+ if (flowtable != flow_table &&
+ flowtable->flags & NF_FLOWTABLE_F_SNOOP) {
+ struct flow_offload *flow_clone =
+ flow_offload_clone(flow);
+ __flow_offload_add(flowtable, flow_clone);
+ }
+ }
+ mutex_unlock(&flowtable_lock);
return 0;
}
EXPORT_SYMBOL_GPL(flow_offload_add);
-static void flow_offload_del(struct nf_flowtable *flow_table,
- struct flow_offload *flow)
+static void __flow_offload_del(struct nf_flowtable *flow_table,
+ struct flow_offload *flow)
{
struct flow_offload_entry *e;
@@ -210,6 +242,12 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
e = container_of(flow, struct flow_offload_entry, flow);
clear_bit(IPS_OFFLOAD_BIT, &e->ct->status);
+}
+
+static void flow_offload_del(struct nf_flowtable *flow_table,
+ struct flow_offload *flow)
+{
+ __flow_offload_del(flow_table, flow);
flow_offload_free(flow);
}
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 42487d01a3ed..8148de9f9a54 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -5569,6 +5569,15 @@ static int nf_tables_newflowtable(struct net *net, struct sock *nlsk,
if (err < 0)
goto err3;
+ if (nla[NFTA_FLOWTABLE_FLAGS]) {
+ flowtable->data.flags =
+ ntohl(nla_get_be32(nla[NFTA_FLOWTABLE_FLAGS]));
+ if (flowtable->data.flags & ~NF_FLOWTABLE_F_SNOOP) {
+ err = -EINVAL;
+ goto err4;
+ }
+ }
+
err = nf_tables_flowtable_parse_hook(&ctx, nla[NFTA_FLOWTABLE_HOOK],
flowtable);
if (err < 0)
@@ -5694,7 +5703,9 @@ static int nf_tables_fill_flowtable_info(struct sk_buff *skb, struct net *net,
nla_put_string(skb, NFTA_FLOWTABLE_NAME, flowtable->name) ||
nla_put_be32(skb, NFTA_FLOWTABLE_USE, htonl(flowtable->use)) ||
nla_put_be64(skb, NFTA_FLOWTABLE_HANDLE, cpu_to_be64(flowtable->handle),
- NFTA_FLOWTABLE_PAD))
+ NFTA_FLOWTABLE_PAD) ||
+ nla_put_be32(skb, NFTA_FLOWTABLE_FLAGS,
+ htonl(flowtable->data.flags)))
goto nla_put_failure;
nest = nla_nest_start(skb, NFTA_FLOWTABLE_HOOK);
--
2.19.1
On Sun, Nov 25, 2018 at 01:09:17PM -0500, Aaron Conole wrote:
> This commit allows for map operations to be loaded by an lkm, rather than
> needing to be baked into the kernel at compile time.
Nack.
Please see Documentation/bpf/bpf_design_QA.rst
Alexei Starovoitov <[email protected]> writes:
> On Sun, Nov 25, 2018 at 01:09:17PM -0500, Aaron Conole wrote:
>> This commit allows for map operations to be loaded by an lkm, rather than
>> needing to be baked into the kernel at compile time.
>
> Nack.
> Please see Documentation/bpf/bpf_design_QA.rst
Thanks for the review, Alexei! I hadn't been aware of this doc, so it's
good to read through it.
I see that the following is there:
Q: New functionality via kernel modules?
----------------------------------------
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
A: NO.
So, I think that means that even changing the helpers would be of no
value (since it would only be useful in the event that the kernel were
compiled with netfilter linked in, rather than as a module - and I doubt
there's any place where that would be true).
BUT, as I wrote in my cover I have some alternative approaches that I've
thought about, and I'm listing here. Can you let me know if any would
be acceptable? If none are, is there an alternative approach you would
support?
1. Introduce flowmap again, this time, basically having it close to a
copy of the hashmap. Introduce a few function calls that allow an
external module to easily manipulate all maps of that type to insert
/ remove / update entries. This makes it similar to, for example,
devmap.
2. Introduce a way to share maps between modules. IE:
something like bpf_helper_expose_map(map_id, module_name). Then have
netfilter and eBPF share access to the hashmap. Netfilter and the
ebpf program will need to agree on the key and value size, and will
push data in/out.
3. Introduce an xdp/ebpf hook in the flowmap.
IE: nft add flowmap xx { .program=foo.o }
Then that will be called with a context, and can update a shared map
with userspace. I haven't though out all the logistics on how to do
this, but it would put the onus of sharing the information on the
bpf program writer.
What do you think?
On Tue, Nov 27, 2018 at 09:24:05AM -0500, Aaron Conole wrote:
>
> 1. Introduce flowmap again, this time, basically having it close to a
> copy of the hashmap. Introduce a few function calls that allow an
> external module to easily manipulate all maps of that type to insert
> / remove / update entries. This makes it similar to, for example,
> devmap.
what is a flowmap?
How is this flowmap different from existing hash, lpm and lru maps?
'close to a copy of hashmap'... why hashmap is not enough for your purpose?
Alexei Starovoitov <[email protected]> writes:
> On Tue, Nov 27, 2018 at 09:24:05AM -0500, Aaron Conole wrote:
>>
>> 1. Introduce flowmap again, this time, basically having it close to a
>> copy of the hashmap. Introduce a few function calls that allow an
>> external module to easily manipulate all maps of that type to insert
>> / remove / update entries. This makes it similar to, for example,
>> devmap.
>
> what is a flowmap?
> How is this flowmap different from existing hash, lpm and lru maps?
The biggest difference is how relationship works. Normal map would
have single key and single value. Flow map needs to have two keys
"single-value," because there are two sets of flow tuples to track
(forward and reverse direction). That means that when updating the k-v
pairs, we need to ensure that the data is always consistent and up to
date. Probably we could do that with the existing maps if we had some
kind of allocation mechanism, too (so, keep a pointer to data from two
keys - not sure if there's a way to do that in ebpf)?
Still I need a way to get the conntrack information from netfilter (or
really any other entity that will provide it) into the bpf map, whatever
map type it takes.
> 'close to a copy of hashmap'... why hashmap is not enough for your purpose?
It might be (see the item 2. in that list). I'm trying to allow
netfilter conntrack to update the bpf map so that the flow offload data
is available, and make sure that when I look up a 5-tuple from the
bpf program in the map, I get the appropriate flow-offload data (ie:
forward direction addresses could be different from reverse direction so
just swapping addresses / ports will not match). Like I wrote in the
cover letter (but probably poorly, sorry for that), I want to forward
packets into the stack until a connection is added to the table, and
then push the packets directly to the places they need to go, doing the
nat etc. That lets us use xdp as a fast forwarding path for
connections, getting all of the advantage of helper modules to do the
control / parsing, and all the advantage of xdp for packet movement.
Maybe I don't see a better solution, though - or possibly there's a more
generic approach that works better.
On Wed, Nov 28, 2018 at 01:51:42PM -0500, Aaron Conole wrote:
> Alexei Starovoitov <[email protected]> writes:
>
> > On Tue, Nov 27, 2018 at 09:24:05AM -0500, Aaron Conole wrote:
> >>
> >> 1. Introduce flowmap again, this time, basically having it close to a
> >> copy of the hashmap. Introduce a few function calls that allow an
> >> external module to easily manipulate all maps of that type to insert
> >> / remove / update entries. This makes it similar to, for example,
> >> devmap.
> >
> > what is a flowmap?
> > How is this flowmap different from existing hash, lpm and lru maps?
>
> The biggest difference is how relationship works. Normal map would
> have single key and single value. Flow map needs to have two keys
> "single-value," because there are two sets of flow tuples to track
> (forward and reverse direction). That means that when updating the k-v
> pairs, we need to ensure that the data is always consistent and up to
> date. Probably we could do that with the existing maps if we had some
> kind of allocation mechanism, too (so, keep a pointer to data from two
> keys - not sure if there's a way to do that in ebpf)?
just swap the src/dst ips inside bpf program depending on direction
and use the same hash map.
That's what xdp/bpf users already do pretty successfully.
bpf hash map is already offloaded into hw too.
> forward direction addresses could be different from reverse direction so
> just swapping addresses / ports will not match).
That makes no sense to me. What would be an example of such flow?
Certainly not a tcp flow.
> That lets us use xdp as a fast forwarding path for
> connections, getting all of the advantage of helper modules to do the
> control / parsing, and all the advantage of xdp for packet movement.
From 10k feet view it sounds correct, but details make no sense.
You're saying doing nat in the stack, but that _is_ the packet movement
where you wanted to use xdp.
Alexei Starovoitov <[email protected]> writes:
> On Wed, Nov 28, 2018 at 01:51:42PM -0500, Aaron Conole wrote:
>> Alexei Starovoitov <[email protected]> writes:
>>
>> > On Tue, Nov 27, 2018 at 09:24:05AM -0500, Aaron Conole wrote:
>> >>
>> >> 1. Introduce flowmap again, this time, basically having it close to a
>> >> copy of the hashmap. Introduce a few function calls that allow an
>> >> external module to easily manipulate all maps of that type to insert
>> >> / remove / update entries. This makes it similar to, for example,
>> >> devmap.
>> >
>> > what is a flowmap?
>> > How is this flowmap different from existing hash, lpm and lru maps?
>>
>> The biggest difference is how relationship works. Normal map would
>> have single key and single value. Flow map needs to have two keys
>> "single-value," because there are two sets of flow tuples to track
>> (forward and reverse direction). That means that when updating the k-v
>> pairs, we need to ensure that the data is always consistent and up to
>> date. Probably we could do that with the existing maps if we had some
>> kind of allocation mechanism, too (so, keep a pointer to data from two
>> keys - not sure if there's a way to do that in ebpf)?
>
> just swap the src/dst ips inside bpf program depending on direction
> and use the same hash map.
That won't work. I'll explain below.
> That's what xdp/bpf users already do pretty successfully.
> bpf hash map is already offloaded into hw too.
While this is one reason to use hash map, I don't think we should use
this as a reason to exclude development of a data type that may work
better. After all, if we can do better then we should.
>> forward direction addresses could be different from reverse direction so
>> just swapping addresses / ports will not match).
>
> That makes no sense to me. What would be an example of such flow?
> Certainly not a tcp flow.
Maybe it's poorly worded on my part. Think about this scenario (ipv4, tcp):
Interfaces A(internet), B(lan)
When XDP program receives a packet from B, it will have a tuple like:
source=B-subnet:B-port dest=inet-addr:inet-port
When XDP program receives a packet from A, it will have a tuple like:
source=inet-addr:inet-port dest=gw-addr:gw-port
The only data in common there is inet-addr:inet-port, and that will
likely be shared among too many connections to be a valid key.
I don't know how to figure out from A the same connetion that
corresponds to B. A really simple static map works, *except*, when
something causes either side of the connection to become invalid, I
can't mark the other side. For instance, even if I have some static
mapping, I might not be able to infer the correct B-side tuple from the
A-side tuple to do the teardown.
I might too naive to see the right approach though - maybe I'm
over-complicating something?
>> That lets us use xdp as a fast forwarding path for
>> connections, getting all of the advantage of helper modules to do the
>> control / parsing, and all the advantage of xdp for packet movement.
>
> From 10k feet view it sounds correct, but details make no sense.
> You're saying doing nat in the stack, but that _is_ the packet movement
> where you wanted to use xdp.
The thing I want to use the stack for are things that will always be
slow anyway, or require massive system input to do correctly. Here are
some examples:
1. Port / address reservation. If I want to do NAT, I need to reserve
ports and addresses correctly. That requires knowing the interface
addresses, and which addresses are currently allocated. The stack
knows this already, let it do these allocations then. Then when
packets arrive for the connection that the stack set up, just forward
via XDP.
2. Helpers. Parsing an in-flight stream is always going to be slow.
Let the stack do that. But when it sets up an expectation, then use
that information to forward that via XDP.
So I would use the stack for the initial handshakes. Once the handshake
is complete, and we know where the packet is destined to go, all that
data is shoved into a map that the XDP program can access, and we
do the data forwarding.
Hope it helps.
On Fri, Nov 30, 2018 at 08:49:17AM -0500, Aaron Conole wrote:
>
> While this is one reason to use hash map, I don't think we should use
> this as a reason to exclude development of a data type that may work
> better. After all, if we can do better then we should.
I'm all for improving existing hash map or implementing new data types.
Like classifier map == same as wild-card match map == ACL map.
The one that OVS folks could use and other folks wanted for long time.
But I don't want bpf to become a collection of single purpose solutions.
Like mega-flow style OVS map.
That one does linear number of lookups applying mask at a time.
It sounds to me that you're proposing "NAT-as-bpf-helper"
or "NAT-as-bpf-map" type of solution.
That falls into single purpose solution category.
I'd rather see generic connection tracking building block.
The one that works out of skb and out of XDP layer.
Existing stack-queue-map can already be used to allocate integers
out of specified range. It can be used to implement port allocation for NAT.
If generic stack-queue-map is not enough, let's improve it.
> >> forward direction addresses could be different from reverse direction so
> >> just swapping addresses / ports will not match).
> >
> > That makes no sense to me. What would be an example of such flow?
> > Certainly not a tcp flow.
>
> Maybe it's poorly worded on my part. Think about this scenario (ipv4, tcp):
>
> Interfaces A(internet), B(lan)
>
> When XDP program receives a packet from B, it will have a tuple like:
>
> source=B-subnet:B-port dest=inet-addr:inet-port
>
> When XDP program receives a packet from A, it will have a tuple like:
>
> source=inet-addr:inet-port dest=gw-addr:gw-port
first of all there are two netdevs.
one XDP program can attach to multiple netdevs, but in this
case we're dealing with two indepedent tcp flows.
> The only data in common there is inet-addr:inet-port, and that will
> likely be shared among too many connections to be a valid key.
two independent tcp flows don't make a 'connection'.
That definition of connection is only meaningful in the context
of the particular problem you're trying to solve and
confuses me quite a bit.
> I don't know how to figure out from A the same connetion that
> corresponds to B. A really simple static map works, *except*, when
> something causes either side of the connection to become invalid, I
> can't mark the other side. For instance, even if I have some static
> mapping, I might not be able to infer the correct B-side tuple from the
> A-side tuple to do the teardown.
I don't think I got enough information from the above description to
understand why two tcp flows (same as two tcp connections) will
form single 'connection' in your definition of connection.
> 1. Port / address reservation. If I want to do NAT, I need to reserve
> ports and addresses correctly. That requires knowing the interface
> addresses, and which addresses are currently allocated. The stack
> knows this already, let it do these allocations then. Then when
> packets arrive for the connection that the stack set up, just forward
> via XDP.
I beg to disagree. For NAT use case the stack has nothing to do with
port allocation for NATing. It's all within NAT framework
(whichever way it's implemented).
The stack cares about sockets and ports that are open on the host
to be consumed by the host.
NAT function is independent of that.
> 2. Helpers. Parsing an in-flight stream is always going to be slow.
> Let the stack do that. But when it sets up an expectation, then use
> that information to forward that via XDP.
XDP parses packets way faster than the stack, since XDP deals with linear
buffers whereas stack has to do pskb_may_pull at every step.
The stack can be optimized further, but assuming that packet parsing
by the stack is faster than XDP and making techincal decisions based
on that just doesn't seem like the right approach to take.
Alexei Starovoitov <[email protected]> writes:
> On Fri, Nov 30, 2018 at 08:49:17AM -0500, Aaron Conole wrote:
>>
>> While this is one reason to use hash map, I don't think we should use
>> this as a reason to exclude development of a data type that may work
>> better. After all, if we can do better then we should.
>
> I'm all for improving existing hash map or implementing new data types.
> Like classifier map == same as wild-card match map == ACL map.
> The one that OVS folks could use and other folks wanted for long time.
>
> But I don't want bpf to become a collection of single purpose solutions.
> Like mega-flow style OVS map.
> That one does linear number of lookups applying mask at a time.
>
> It sounds to me that you're proposing "NAT-as-bpf-helper"
> or "NAT-as-bpf-map" type of solution.
Maybe that's what this particular iteration is. But I'm open to a
different implementation. My requirements aren't fixed to a specific
map type.
> That falls into single purpose solution category.
> I'd rather see generic connection tracking building block.
> The one that works out of skb and out of XDP layer.
> Existing stack-queue-map can already be used to allocate integers
> out of specified range. It can be used to implement port allocation for NAT.
> If generic stack-queue-map is not enough, let's improve it.
I don't understand this. You say you want something out of skb and out
of xdp layer, but then advocate an ebpf approach (that would only be
useful from xdp). Plus already some specialized mechanism exists for
FIB. Not sure why this conntrack assist would be rejected as too
specialized?
I was thinking to re-use existing conntrack framework, and make the
metadata available from ebpf context. That can be used even out of xdp
layer (for instance, maybe some tracing program, or other accounting /
auditing tool like a HIDS).
Anyway, as I wrote, there are other approaches. But maybe instead of a
flowmap, an mkmap would make sense (this is a multi-key map, that allows
a single value to be reached via multiple keys). I also wrote some
other approaches I was thinking in an earlier mail. Maybe one of those
is better direction?
>> >> forward direction addresses could be different from reverse direction so
>> >> just swapping addresses / ports will not match).
>> >
>> > That makes no sense to me. What would be an example of such flow?
>> > Certainly not a tcp flow.
>>
>> Maybe it's poorly worded on my part. Think about this scenario (ipv4, tcp):
>>
>> Interfaces A(internet), B(lan)
>>
>> When XDP program receives a packet from B, it will have a tuple like:
>>
>> source=B-subnet:B-port dest=inet-addr:inet-port
>>
>> When XDP program receives a packet from A, it will have a tuple like:
>>
>> source=inet-addr:inet-port dest=gw-addr:gw-port
>
> first of all there are two netdevs.
> one XDP program can attach to multiple netdevs, but in this
> case we're dealing with two indepedent tcp flows.
>
>> The only data in common there is inet-addr:inet-port, and that will
>> likely be shared among too many connections to be a valid key.
>
> two independent tcp flows don't make a 'connection'.
> That definition of connection is only meaningful in the context
> of the particular problem you're trying to solve and
> confuses me quite a bit.
I don't understand this.
They aren't independent. We need to properly account the packets, and
need to apply policy decisions to either side. Just because the tuples
are asymmetric, the connection *is* the same. If you treat them
separately, then you lose the ability for accounting them properly.
Something needs to make the association.
>> I don't know how to figure out from A the same connetion that
>> corresponds to B. A really simple static map works, *except*, when
>> something causes either side of the connection to become invalid, I
>> can't mark the other side. For instance, even if I have some static
>> mapping, I might not be able to infer the correct B-side tuple from the
>> A-side tuple to do the teardown.
>
> I don't think I got enough information from the above description to
> understand why two tcp flows (same as two tcp connections) will
> form single 'connection' in your definition of connection.
They aren't two connections. Maybe there's something I'm missing.
>> 1. Port / address reservation. If I want to do NAT, I need to reserve
>> ports and addresses correctly. That requires knowing the interface
>> addresses, and which addresses are currently allocated. The stack
>> knows this already, let it do these allocations then. Then when
>> packets arrive for the connection that the stack set up, just forward
>> via XDP.
>
> I beg to disagree. For NAT use case the stack has nothing to do with
> port allocation for NATing. It's all within NAT framework
> (whichever way it's implemented).
> The stack cares about sockets and ports that are open on the host
> to be consumed by the host.
> NAT function is independent of that.
It's related. If host has a particular port open, NAT can't reuse it if
NATing from a host IP.
So the NAT port allocation *must* take into account host ports.
>> 2. Helpers. Parsing an in-flight stream is always going to be slow.
>> Let the stack do that. But when it sets up an expectation, then use
>> that information to forward that via XDP.
>
> XDP parses packets way faster than the stack, since XDP deals with linear
> buffers whereas stack has to do pskb_may_pull at every step.
Sure.
> The stack can be optimized further, but assuming that packet parsing
> by the stack is faster than XDP and making techincal decisions based
> on that just doesn't seem like the right approach to take.
Agreed that packet parsing can be faster in XDP. But my point is,
packet parsing is *slow* no matter what. And the DPI required to
implement helpers is complex and slow. The instant you need to parse
H.323 or some kind of SIP logic to implement conntrack helper you will
run out of instructions and tailcall iterations in eBPF. Even simple FTP
parsing might not be 'good enough' from a throughput standpoint. The
idea here is for control kinds of connections to traverse the stack
(since throughput isn't gating factor there), and the data connections
(which need maximum throughput) can just be switched via the xdp
mechanism.