In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
sub-system. The hinting optimization improves performance by reusing
previously found dsts instead of looking them up for each skb.
This patch series introduces a generalized version of the hinting mechanism that
can "remember" a larger number of dsts. This reduces the number of dst
lookups for frequently encountered daddrs.
Before diving into the code and the benchmarking results, it's important
to address the deletion of the old route cache [2] and why
this solution is different. The original cache was complicated,
vulnerable to DOS attacks and had unstable performance.
The new input dst_cache is much simpler thanks to its lazy approach,
improving performance without the overhead of the removed cache
implementation. Instead of using timers and GC, the deletion of invalid
entries is performed lazily during their lookups.
The dsts are stored in a simple, lightweight, static hash table. This
keeps the lookup times fast yet stable, preventing DOS upon cache misses.
The new input dst_cache implementation is built over the existing
dst_cache code which supplies a fast lockless percpu behavior.
I tested this patch using udp floods with different number of daddrs.
The benchmarking setup is comprised of 3 machines: a sender,
a forwarder and a receiver. I measured the PPS received by the receiver
as the forwarder was running either the mainline kernel or the patched
kernel, comparing the results. The dst_cache I tested in this benchmark
used a total of 512 hash table entries, split into buckets of 4
entries each.
These are the results:
UDP mainline patched delta
conns pcpu Kpps Kpps %
1 274.0255 269.2205 -1.75
2 257.3748 268.0947 4.17
15 241.3513 258.8016 7.23
100 238.3419 258.4939 8.46
500 238.5390 252.6425 5.91
1000 238.7570 242.1820 1.43
2000 238.7780 236.2640 -1.05
4000 239.0440 233.5320 -2.31
8000 239.3248 232.5680 -2.82
As you can see, this patch improves performance up until ~1500
connections, after which the rate of improvement diminishes
due to the growing number of cache misses.
It's important to note that in the worst scenario, every packet will
cause a cache miss, resulting in only a constant performance degradation
due to the fixed cache and bucket sizes. This means that the cache is
resistant to DOS attacks.
Based on the above measurements, it seems that the performance
degradation flattens at around 3%. Note that the number of concurrent
connections at which performance starts to degrade depends on the cache
size and the amount of cpus.
[1] https://lore.kernel.org/netdev/[email protected]/
[2] https://lore.kernel.org/netdev/[email protected]/
v1:
- fix typo while allocating per-cpu cache
- while using dst from the dst_cache set IPSKB_DOREDIRECT correctly
- always compile dst_cache
RFC-v2:
- remove unnecessary macro
- move inline to .h file
RFC-v1: https://lore.kernel.org/netdev/[email protected]/
RFC-v2: https://lore.kernel.org/netdev/[email protected]/
Leone Fernando (4):
net: route: expire rt if the dst it holds is expired
net: dst_cache: add input_dst_cache API
net: route: always compile dst_cache
net: route: replace route hints with input_dst_cache
drivers/net/Kconfig | 1 -
include/net/dst_cache.h | 68 +++++++++++++++++++
include/net/dst_metadata.h | 2 -
include/net/ip_tunnels.h | 2 -
include/net/route.h | 6 +-
net/Kconfig | 4 --
net/core/Makefile | 3 +-
net/core/dst.c | 4 --
net/core/dst_cache.c | 132 +++++++++++++++++++++++++++++++++++++
net/ipv4/Kconfig | 1 -
net/ipv4/ip_input.c | 58 ++++++++--------
net/ipv4/ip_tunnel_core.c | 4 --
net/ipv4/route.c | 75 +++++++++++++++------
net/ipv4/udp_tunnel_core.c | 4 --
net/ipv6/Kconfig | 4 --
net/ipv6/ip6_udp_tunnel.c | 4 --
net/netfilter/nft_tunnel.c | 2 -
net/openvswitch/Kconfig | 1 -
net/sched/act_tunnel_key.c | 2 -
19 files changed, 291 insertions(+), 86 deletions(-)
--
2.34.1
The function rt_is_expired is used to verify that a cached dst is valid.
Currently, this function ignores the rt.dst->expires value.
Add a check to rt_is_expired that validates that the dst is not expired.
Signed-off-by: Leone Fernando <[email protected]>
---
net/ipv4/route.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c8f76f56dc16..73e742424e97 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -392,7 +392,8 @@ static inline int ip_rt_proc_init(void)
static inline bool rt_is_expired(const struct rtable *rth)
{
- return rth->rt_genid != rt_genid_ipv4(dev_net(rth->dst.dev));
+ return rth->rt_genid != rt_genid_ipv4(dev_net(rth->dst.dev)) ||
+ (rth->dst.expires && time_after(jiffies, rth->dst.expires));
}
void rt_cache_flush(struct net *net)
--
2.34.1
The input_dst_cache allows fast lookup of frequently encountered dsts.
In order to provide stable results, I implemented a simple linear
hashtable with each bucket containing a constant amount of
entries (DST_CACHE_INPUT_BUCKET_SIZE).
Similarly to how the route hint is used, I defined the hashtable key to
contain the daddr and the tos of the IP header.
Lookup is performed in a straightforward manner: start at the bucket head
corresponding the hashed key and search the following
DST_CACHE_INPUT_BUCKET_SIZE entries of the array for a matching key.
When inserting a new dst to the cache, if all the bucket entries are
full, the oldest one is deleted to make room for the new dst.
Signed-off-by: Leone Fernando <[email protected]>
---
include/net/dst_cache.h | 68 +++++++++++++++++++++
net/core/dst_cache.c | 132 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 200 insertions(+)
diff --git a/include/net/dst_cache.h b/include/net/dst_cache.h
index df6622a5fe98..1c3b06f08d7d 100644
--- a/include/net/dst_cache.h
+++ b/include/net/dst_cache.h
@@ -7,12 +7,40 @@
#if IS_ENABLED(CONFIG_IPV6)
#include <net/ip6_fib.h>
#endif
+#include <net/ip.h>
+
+#define DST_CACHE_INPUT_SHIFT (9)
+#define DST_CACHE_INPUT_SIZE (1 << DST_CACHE_INPUT_SHIFT)
+#define DST_CACHE_INPUT_BUCKET_SIZE (4)
+#define DST_CACHE_INPUT_HASH_MASK (~(DST_CACHE_INPUT_BUCKET_SIZE - 1))
+#define INVALID_DST_CACHE_INPUT_KEY (~(u64)(0))
struct dst_cache {
struct dst_cache_pcpu __percpu *cache;
unsigned long reset_ts;
};
+extern unsigned int dst_cache_net_id __read_mostly;
+
+/**
+ * idst_for_each_in_bucket - iterate over a dst cache bucket
+ * @pos: the type * to use as a loop cursor
+ * @head: the head of the cpu dst cache.
+ * @hash: the hash of the bucket
+ */
+#define idst_for_each_in_bucket(pos, head, hash) \
+ for (pos = &head[hash]; \
+ pos < &head[hash + DST_CACHE_INPUT_BUCKET_SIZE]; \
+ pos++)
+
+/**
+ * idst_for_each_in_cache - iterate over the dst cache
+ * @pos: the type * to use as a loop cursor
+ * @head: the head of the cpu dst cache.
+ */
+#define idst_for_each_in_cache(pos, head) \
+ for (pos = head; pos < head + DST_CACHE_INPUT_SIZE; pos++)
+
/**
* dst_cache_get - perform cache lookup
* @dst_cache: the cache
@@ -106,4 +134,44 @@ int dst_cache_init(struct dst_cache *dst_cache, gfp_t gfp);
*/
void dst_cache_destroy(struct dst_cache *dst_cache);
+/**
+ * dst_cache_input_get_noref - perform lookup in the input cache,
+ * return a noref dst
+ * @dst_cache: the input cache
+ * @skb: the packet according to which the dst entry will be searched
+ * local BH must be disabled.
+ */
+struct dst_entry *dst_cache_input_get_noref(struct dst_cache *dst_cache,
+ struct sk_buff *skb);
+
+/**
+ * dst_cache_input_add - add the dst of the given skb to the input cache.
+ *
+ * in case the cache bucket is full, the oldest entry will be deleted
+ * and replaced with the new one.
+ * @dst_cache: the input cache
+ * @skb: The packet according to which the dst entry will be searched
+ *
+ * local BH must be disabled.
+ */
+void dst_cache_input_add(struct dst_cache *dst_cache,
+ const struct sk_buff *skb);
+
+/**
+ * dst_cache_input_init - initialize the input cache,
+ * allocating the required storage
+ */
+int __init dst_cache_input_init(void);
+
+static inline u64 create_dst_cache_key_ip4(const struct sk_buff *skb)
+{
+ struct iphdr *iphdr = ip_hdr(skb);
+
+ return (((u64)iphdr->daddr) << 8) | iphdr->tos;
+}
+
+static inline u32 hash_dst_cache_key(u64 key)
+{
+ return hash_64(key, DST_CACHE_INPUT_SHIFT) & DST_CACHE_INPUT_HASH_MASK;
+}
#endif
diff --git a/net/core/dst_cache.c b/net/core/dst_cache.c
index 0ccfd5fa5cb9..245f5546a7ff 100644
--- a/net/core/dst_cache.c
+++ b/net/core/dst_cache.c
@@ -13,6 +13,7 @@
#include <net/ip6_fib.h>
#endif
#include <uapi/linux/in.h>
+#include <net/netns/generic.h>
struct dst_cache_pcpu {
unsigned long refresh_ts;
@@ -21,9 +22,12 @@ struct dst_cache_pcpu {
union {
struct in_addr in_saddr;
struct in6_addr in6_saddr;
+ u64 key;
};
};
+unsigned int dst_cache_net_id __read_mostly;
+
static void dst_cache_per_cpu_dst_set(struct dst_cache_pcpu *dst_cache,
struct dst_entry *dst, u32 cookie)
{
@@ -181,3 +185,131 @@ void dst_cache_reset_now(struct dst_cache *dst_cache)
}
}
EXPORT_SYMBOL_GPL(dst_cache_reset_now);
+
+static void dst_cache_input_set(struct dst_cache_pcpu *idst,
+ struct dst_entry *dst, u64 key)
+{
+ dst_cache_per_cpu_dst_set(idst, dst, 0);
+ idst->key = key;
+ idst->refresh_ts = jiffies;
+}
+
+static struct dst_entry *__dst_cache_input_get_noref(struct dst_cache_pcpu *idst)
+{
+ struct dst_entry *dst = idst->dst;
+
+ if (unlikely(dst->obsolete && !dst->ops->check(dst, idst->cookie))) {
+ dst_cache_input_set(idst, NULL, INVALID_DST_CACHE_INPUT_KEY);
+ goto fail;
+ }
+
+ idst->refresh_ts = jiffies;
+ return dst;
+
+fail:
+ return NULL;
+}
+
+struct dst_entry *dst_cache_input_get_noref(struct dst_cache *dst_cache,
+ struct sk_buff *skb)
+{
+ struct dst_entry *out_dst = NULL;
+ struct dst_cache_pcpu *pcpu_cache;
+ struct dst_cache_pcpu *idst;
+ u32 hash;
+ u64 key;
+
+ pcpu_cache = this_cpu_ptr(dst_cache->cache);
+ key = create_dst_cache_key_ip4(skb);
+ hash = hash_dst_cache_key(key);
+ idst_for_each_in_bucket(idst, pcpu_cache, hash) {
+ if (key == idst->key) {
+ out_dst = __dst_cache_input_get_noref(idst);
+ goto out;
+ }
+ }
+out:
+ return out_dst;
+}
+
+static void dst_cache_input_reset_now(struct dst_cache *dst_cache)
+{
+ struct dst_cache_pcpu *caches;
+ struct dst_cache_pcpu *idst;
+ struct dst_entry *dst;
+ int i;
+
+ for_each_possible_cpu(i) {
+ caches = per_cpu_ptr(dst_cache->cache, i);
+ idst_for_each_in_cache(idst, caches) {
+ idst->key = INVALID_DST_CACHE_INPUT_KEY;
+ dst = idst->dst;
+ if (dst)
+ dst_release(dst);
+ }
+ }
+}
+
+static int __net_init dst_cache_input_net_init(struct net *net)
+{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
+
+ dst_cache->cache = (struct dst_cache_pcpu __percpu *)alloc_percpu_gfp(struct dst_cache_pcpu[DST_CACHE_INPUT_SIZE],
+ GFP_KERNEL | __GFP_ZERO);
+ if (!dst_cache->cache)
+ return -ENOMEM;
+
+ dst_cache_input_reset_now(dst_cache);
+ return 0;
+}
+
+static void __net_exit dst_cache_input_net_exit(struct net *net)
+{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
+
+ dst_cache_input_reset_now(dst_cache);
+ free_percpu(dst_cache->cache);
+ dst_cache->cache = NULL;
+}
+
+static bool idst_empty(struct dst_cache_pcpu *idst)
+{
+ return idst->key == INVALID_DST_CACHE_INPUT_KEY;
+}
+
+void dst_cache_input_add(struct dst_cache *dst_cache, const struct sk_buff *skb)
+{
+ struct dst_cache_pcpu *entry = NULL;
+ struct dst_cache_pcpu *pcpu_cache;
+ struct dst_cache_pcpu *idst;
+ u32 hash;
+ u64 key;
+
+ pcpu_cache = this_cpu_ptr(dst_cache->cache);
+ key = create_dst_cache_key_ip4(skb);
+ hash = hash_dst_cache_key(key);
+ idst_for_each_in_bucket(idst, pcpu_cache, hash) {
+ if (idst_empty(idst)) {
+ entry = idst;
+ goto add_to_cache;
+ }
+ if (!entry || time_before(idst->refresh_ts, entry->refresh_ts))
+ entry = idst;
+ }
+
+add_to_cache:
+ dst_cache_input_set(entry, skb_dst(skb), key);
+}
+
+static struct pernet_operations dst_cache_input_ops __net_initdata = {
+ .init = dst_cache_input_net_init,
+ .exit = dst_cache_input_net_exit,
+ .id = &dst_cache_net_id,
+ .size = sizeof(struct dst_cache),
+};
+
+int __init dst_cache_input_init(void)
+{
+ return register_pernet_subsys(&dst_cache_input_ops);
+}
+subsys_initcall(dst_cache_input_init);
--
2.34.1
Replace route hints with cached dsts - ip_rcv_finish_core will first try
to use the cache and only then fall back to the demux or perform a full
lookup.
Only add newly found dsts to the cache after all the checks have passed
successfully to avoid adding a dropped packet's dst to the cache.
Multicast dsts are not added to the dst_cache as it will require additional
checks and multicast packets are rarer and a slower path anyway.
A check was added to ip_route_use_dst_cache that prevents forwarding
packets received by devices for which forwarding is disabled.
Relevant checks were added to ip_route_use_dst_cache to make sure the
dst can be used and to ensure IPCB(skb) flags are correct.
Signed-off-by: Leone Fernando <[email protected]>
---
include/net/route.h | 6 ++--
net/ipv4/ip_input.c | 58 +++++++++++++++++++-----------------
net/ipv4/route.c | 72 +++++++++++++++++++++++++++++++++------------
3 files changed, 88 insertions(+), 48 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index d4a0147942f1..0940383719a0 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -186,9 +186,9 @@ int ip_mc_validate_source(struct sk_buff *skb, __be32 daddr, __be32 saddr,
struct in_device *in_dev, u32 *itag);
int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
u8 tos, struct net_device *devin);
-int ip_route_use_hint(struct sk_buff *skb, __be32 dst, __be32 src,
- u8 tos, struct net_device *devin,
- const struct sk_buff *hint);
+int ip_route_use_dst_cache(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+ u8 tos, struct net_device *dev,
+ struct dst_entry *dst);
static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
u8 tos, struct net_device *devin)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 5e9c8156656a..35c8b122d62f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -305,30 +305,44 @@ static inline bool ip_rcv_options(struct sk_buff *skb, struct net_device *dev)
return true;
}
-static bool ip_can_use_hint(const struct sk_buff *skb, const struct iphdr *iph,
- const struct sk_buff *hint)
+static bool ip_can_add_dst_cache(struct sk_buff *skb, __u16 rt_type)
{
- return hint && !skb_dst(skb) && ip_hdr(hint)->daddr == iph->daddr &&
- ip_hdr(hint)->tos == iph->tos;
+ return skb_valid_dst(skb) &&
+ rt_type != RTN_BROADCAST &&
+ rt_type != RTN_MULTICAST &&
+ !(IPCB(skb)->flags & IPSKB_MULTIPATH);
+}
+
+static bool ip_can_use_dst_cache(const struct net *net, struct sk_buff *skb)
+{
+ return !skb_dst(skb) && !fib4_has_custom_rules(net);
}
int tcp_v4_early_demux(struct sk_buff *skb);
int udp_v4_early_demux(struct sk_buff *skb);
static int ip_rcv_finish_core(struct net *net, struct sock *sk,
- struct sk_buff *skb, struct net_device *dev,
- const struct sk_buff *hint)
+ struct sk_buff *skb, struct net_device *dev)
{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
const struct iphdr *iph = ip_hdr(skb);
+ struct dst_entry *dst;
int err, drop_reason;
struct rtable *rt;
+ bool do_cache;
drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
- if (ip_can_use_hint(skb, iph, hint)) {
- err = ip_route_use_hint(skb, iph->daddr, iph->saddr, iph->tos,
- dev, hint);
- if (unlikely(err))
- goto drop_error;
+ do_cache = ip_can_use_dst_cache(net, skb);
+ if (do_cache) {
+ dst = dst_cache_input_get_noref(dst_cache, skb);
+ if (dst) {
+ err = ip_route_use_dst_cache(skb, iph->daddr,
+ iph->saddr, iph->tos,
+ dev, dst);
+ if (unlikely(err))
+ goto drop_error;
+ do_cache = false;
+ }
}
if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) &&
@@ -418,6 +432,9 @@ static int ip_rcv_finish_core(struct net *net, struct sock *sk,
}
}
+ if (do_cache && ip_can_add_dst_cache(skb, rt->rt_type))
+ dst_cache_input_add(dst_cache, skb);
+
return NET_RX_SUCCESS;
drop:
@@ -444,7 +461,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
if (!skb)
return NET_RX_SUCCESS;
- ret = ip_rcv_finish_core(net, sk, skb, dev, NULL);
+ ret = ip_rcv_finish_core(net, sk, skb, dev);
if (ret != NET_RX_DROP)
ret = dst_input(skb);
return ret;
@@ -581,21 +598,11 @@ static void ip_sublist_rcv_finish(struct list_head *head)
}
}
-static struct sk_buff *ip_extract_route_hint(const struct net *net,
- struct sk_buff *skb, int rt_type)
-{
- if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST ||
- IPCB(skb)->flags & IPSKB_MULTIPATH)
- return NULL;
-
- return skb;
-}
-
static void ip_list_rcv_finish(struct net *net, struct sock *sk,
struct list_head *head)
{
- struct sk_buff *skb, *next, *hint = NULL;
struct dst_entry *curr_dst = NULL;
+ struct sk_buff *skb, *next;
struct list_head sublist;
INIT_LIST_HEAD(&sublist);
@@ -610,14 +617,11 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
skb = l3mdev_ip_rcv(skb);
if (!skb)
continue;
- if (ip_rcv_finish_core(net, sk, skb, dev, hint) == NET_RX_DROP)
+ if (ip_rcv_finish_core(net, sk, skb, dev) == NET_RX_DROP)
continue;
dst = skb_dst(skb);
if (curr_dst != dst) {
- hint = ip_extract_route_hint(net, skb,
- ((struct rtable *)dst)->rt_type);
-
/* dispatch old sublist */
if (!list_empty(&sublist))
ip_sublist_rcv_finish(&sublist);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 73e742424e97..e7de683f7b49 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1787,6 +1787,24 @@ static void ip_handle_martian_source(struct net_device *dev,
#endif
}
+static void ip_route_set_doredirect(struct in_device *in_dev,
+ struct in_device *out_dev,
+ struct sk_buff *skb,
+ u8 gw_family,
+ __be32 gw4,
+ __be32 saddr)
+{
+ if (out_dev == in_dev && IN_DEV_TX_REDIRECTS(out_dev) &&
+ skb->protocol == htons(ETH_P_IP)) {
+ __be32 gw;
+
+ gw = gw_family == AF_INET ? gw4 : 0;
+ if (IN_DEV_SHARED_MEDIA(out_dev) ||
+ inet_addr_onlink(out_dev, saddr, gw))
+ IPCB(skb)->flags |= IPSKB_DOREDIRECT;
+ }
+}
+
/* called in rcu_read_lock() section */
static int __mkroute_input(struct sk_buff *skb,
const struct fib_result *res,
@@ -1819,15 +1837,10 @@ static int __mkroute_input(struct sk_buff *skb,
}
do_cache = res->fi && !itag;
- if (out_dev == in_dev && err && IN_DEV_TX_REDIRECTS(out_dev) &&
- skb->protocol == htons(ETH_P_IP)) {
- __be32 gw;
-
- gw = nhc->nhc_gw_family == AF_INET ? nhc->nhc_gw.ipv4 : 0;
- if (IN_DEV_SHARED_MEDIA(out_dev) ||
- inet_addr_onlink(out_dev, saddr, gw))
- IPCB(skb)->flags |= IPSKB_DOREDIRECT;
- }
+ if (err)
+ ip_route_set_doredirect(in_dev, out_dev, skb,
+ nhc->nhc_gw_family,
+ nhc->nhc_gw.ipv4, saddr);
if (skb->protocol != htons(ETH_P_IP)) {
/* Not IP (i.e. ARP). Do not create route, if it is
@@ -2157,14 +2170,15 @@ static int ip_mkroute_input(struct sk_buff *skb,
/* Implements all the saddr-related checks as ip_route_input_slow(),
* assuming daddr is valid and the destination is not a local broadcast one.
- * Uses the provided hint instead of performing a route lookup.
+ * Uses the provided dst from dst_cache instead of performing a route lookup.
*/
-int ip_route_use_hint(struct sk_buff *skb, __be32 daddr, __be32 saddr,
- u8 tos, struct net_device *dev,
- const struct sk_buff *hint)
+int ip_route_use_dst_cache(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+ u8 tos, struct net_device *dev,
+ struct dst_entry *dst)
{
+ struct in_device *out_dev = __in_dev_get_rcu(dst->dev);
struct in_device *in_dev = __in_dev_get_rcu(dev);
- struct rtable *rt = skb_rtable(hint);
+ struct rtable *rt = (struct rtable *)dst;
struct net *net = dev_net(dev);
int err = -EINVAL;
u32 tag = 0;
@@ -2178,21 +2192,43 @@ int ip_route_use_hint(struct sk_buff *skb, __be32 daddr, __be32 saddr,
if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
goto martian_source;
- if (rt->rt_type != RTN_LOCAL)
- goto skip_validate_source;
+ if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
+ goto martian_destination;
+ if (rt->rt_type != RTN_LOCAL) {
+ if (!IN_DEV_FORWARD(in_dev)) {
+ err = -EHOSTUNREACH;
+ goto out_err;
+ }
+ goto skip_validate_source;
+ }
tos &= IPTOS_RT_MASK;
err = fib_validate_source(skb, saddr, daddr, tos, 0, dev, in_dev, &tag);
if (err < 0)
goto martian_source;
+ if (err)
+ ip_route_set_doredirect(in_dev, out_dev, skb, rt->rt_gw_family,
+ rt->rt_gw4, saddr);
+
skip_validate_source:
- skb_dst_copy(skb, hint);
+ skb_dst_set_noref(skb, dst);
return 0;
martian_source:
ip_handle_martian_source(dev, in_dev, skb, daddr, saddr);
+out_err:
return err;
+
+martian_destination:
+ RT_CACHE_STAT_INC(in_martian_dst);
+#ifdef CONFIG_IP_ROUTE_VERBOSE
+ if (IN_DEV_LOG_MARTIANS(in_dev))
+ net_warn_ratelimited("martian destination %pI4 from %pI4, dev %s\n",
+ &daddr, &saddr, dev->name);
+#endif
+ err = -EINVAL;
+ goto out_err;
}
/* get device for dst_alloc with local routes */
@@ -2213,7 +2249,7 @@ static struct net_device *ip_rt_get_dev(struct net *net,
* addresses, because every properly looped back packet
* must have correct destination already attached by output routine.
* Changes in the enforced policies must be applied also to
- * ip_route_use_hint().
+ * ip_route_use_dst_cache().
*
* Such approach solves two big problems:
* 1. Not simplex devices are handled properly.
--
2.34.1
make dst_cache always compile, delete all relevant ifdefs
Signed-off-by: Leone Fernando <[email protected]>
---
drivers/net/Kconfig | 1 -
include/net/dst_metadata.h | 2 --
include/net/ip_tunnels.h | 2 --
net/Kconfig | 4 ----
net/core/Makefile | 3 +--
net/core/dst.c | 4 ----
net/ipv4/Kconfig | 1 -
net/ipv4/ip_tunnel_core.c | 4 ----
net/ipv4/udp_tunnel_core.c | 4 ----
net/ipv6/Kconfig | 4 ----
net/ipv6/ip6_udp_tunnel.c | 4 ----
net/netfilter/nft_tunnel.c | 2 --
net/openvswitch/Kconfig | 1 -
net/sched/act_tunnel_key.c | 2 --
14 files changed, 1 insertion(+), 37 deletions(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 8ca0bc223b30..1be1ec8368b6 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -78,7 +78,6 @@ config WIREGUARD
depends on IPV6 || !IPV6
depends on !KMSAN # KMSAN doesn't support the crypto configs below
select NET_UDP_TUNNEL
- select DST_CACHE
select CRYPTO
select CRYPTO_LIB_CURVE25519
select CRYPTO_LIB_CHACHA20POLY1305
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 1b7fae4c6b24..c41a857bf0ed 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -165,7 +165,6 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
sizeof(struct ip_tunnel_info) + md_size);
-#ifdef CONFIG_DST_CACHE
/* Unclone the dst cache if there is one */
if (new_md->u.tun_info.dst_cache.cache) {
int ret;
@@ -176,7 +175,6 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
return ERR_PTR(ret);
}
}
-#endif
skb_dst_drop(skb);
skb_dst_set(skb, &new_md->dst);
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 5cd64bb2104d..2fe04edc23b9 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -84,9 +84,7 @@ struct ip_tunnel_encap {
struct ip_tunnel_info {
struct ip_tunnel_key key;
struct ip_tunnel_encap encap;
-#ifdef CONFIG_DST_CACHE
struct dst_cache dst_cache;
-#endif
u8 options_len;
u8 mode;
};
diff --git a/net/Kconfig b/net/Kconfig
index 3e57ccf0da27..21caffd1758d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -438,10 +438,6 @@ config LWTUNNEL_BPF
Allows to run BPF programs as a nexthop action following a route
lookup for incoming and outgoing packets.
-config DST_CACHE
- bool
- default n
-
config GRO_CELLS
bool
default n
diff --git a/net/core/Makefile b/net/core/Makefile
index 821aec06abf1..53582fef633b 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
fib_notifier.o xdp.o flow_offload.o gro.o \
- netdev-genl.o netdev-genl-gen.o gso.o
+ netdev-genl.o netdev-genl-gen.o gso.o dst_cache.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
@@ -32,7 +32,6 @@ obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
-obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
obj-$(CONFIG_FAILOVER) += failover.o
diff --git a/net/core/dst.c b/net/core/dst.c
index 95f533844f17..f035c39be104 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -291,10 +291,8 @@ EXPORT_SYMBOL_GPL(metadata_dst_alloc);
void metadata_dst_free(struct metadata_dst *md_dst)
{
-#ifdef CONFIG_DST_CACHE
if (md_dst->type == METADATA_IP_TUNNEL)
dst_cache_destroy(&md_dst->u.tun_info.dst_cache);
-#endif
if (md_dst->type == METADATA_XFRM)
dst_release(md_dst->u.xfrm_info.dst_orig);
kfree(md_dst);
@@ -326,10 +324,8 @@ void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst)
for_each_possible_cpu(cpu) {
struct metadata_dst *one_md_dst = per_cpu_ptr(md_dst, cpu);
-#ifdef CONFIG_DST_CACHE
if (one_md_dst->type == METADATA_IP_TUNNEL)
dst_cache_destroy(&one_md_dst->u.tun_info.dst_cache);
-#endif
if (one_md_dst->type == METADATA_XFRM)
dst_release(one_md_dst->u.xfrm_info.dst_orig);
}
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 8e94ed7c56a0..189f716b03e8 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -185,7 +185,6 @@ config NET_IPGRE_DEMUX
config NET_IP_TUNNEL
tristate
- select DST_CACHE
select GRO_CELLS
default n
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 80ccd6661aa3..f060d55dc249 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -682,13 +682,11 @@ static int ip_tun_build_state(struct net *net, struct nlattr *attr,
return err;
}
-#ifdef CONFIG_DST_CACHE
err = dst_cache_init(&tun_info->dst_cache, GFP_KERNEL);
if (err) {
lwtstate_free(new_state);
return err;
}
-#endif
if (tb[LWTUNNEL_IP_ID])
tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP_ID]);
@@ -720,11 +718,9 @@ static int ip_tun_build_state(struct net *net, struct nlattr *attr,
static void ip_tun_destroy_state(struct lwtunnel_state *lwtstate)
{
-#ifdef CONFIG_DST_CACHE
struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
dst_cache_destroy(&tun_info->dst_cache);
-#endif
}
static int ip_tun_fill_encap_opts_geneve(struct sk_buff *skb,
diff --git a/net/ipv4/udp_tunnel_core.c b/net/ipv4/udp_tunnel_core.c
index 860aff5f8599..ecc0990d8cab 100644
--- a/net/ipv4/udp_tunnel_core.c
+++ b/net/ipv4/udp_tunnel_core.c
@@ -215,13 +215,11 @@ struct rtable *udp_tunnel_dst_lookup(struct sk_buff *skb,
struct rtable *rt = NULL;
struct flowi4 fl4;
-#ifdef CONFIG_DST_CACHE
if (dst_cache) {
rt = dst_cache_get_ip4(dst_cache, saddr);
if (rt)
return rt;
}
-#endif
memset(&fl4, 0, sizeof(fl4));
fl4.flowi4_mark = skb->mark;
@@ -244,10 +242,8 @@ struct rtable *udp_tunnel_dst_lookup(struct sk_buff *skb,
ip_rt_put(rt);
return ERR_PTR(-ELOOP);
}
-#ifdef CONFIG_DST_CACHE
if (dst_cache)
dst_cache_set_ip4(dst_cache, &rt->dst, fl4.saddr);
-#endif
*saddr = fl4.saddr;
return rt;
}
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 08d4b7132d4c..093c768d41ab 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -124,7 +124,6 @@ config IPV6_MIP6
config IPV6_ILA
tristate "IPv6: Identifier Locator Addressing (ILA)"
depends on NETFILTER
- select DST_CACHE
select LWTUNNEL
help
Support for IPv6 Identifier Locator Addressing (ILA).
@@ -203,7 +202,6 @@ config IPV6_NDISC_NODETYPE
config IPV6_TUNNEL
tristate "IPv6: IP-in-IPv6 tunnel (RFC2473)"
select INET6_TUNNEL
- select DST_CACHE
select GRO_CELLS
help
Support for IPv6-in-IPv6 and IPv4-in-IPv6 tunnels described in
@@ -291,7 +289,6 @@ config IPV6_SEG6_LWTUNNEL
bool "IPv6: Segment Routing Header encapsulation support"
depends on IPV6
select LWTUNNEL
- select DST_CACHE
select IPV6_MULTIPLE_TABLES
help
Support for encapsulation of packets within an outer IPv6
@@ -333,7 +330,6 @@ config IPV6_IOAM6_LWTUNNEL
bool "IPv6: IOAM Pre-allocated Trace insertion support"
depends on IPV6
select LWTUNNEL
- select DST_CACHE
help
Support for the insertion of IOAM Pre-allocated Trace
Header using the lightweight tunnels mechanism.
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index c99053189ea8..de92aea01cfc 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -145,13 +145,11 @@ struct dst_entry *udp_tunnel6_dst_lookup(struct sk_buff *skb,
struct dst_entry *dst = NULL;
struct flowi6 fl6;
-#ifdef CONFIG_DST_CACHE
if (dst_cache) {
dst = dst_cache_get_ip6(dst_cache, saddr);
if (dst)
return dst;
}
-#endif
memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = IPPROTO_UDP;
@@ -173,10 +171,8 @@ struct dst_entry *udp_tunnel6_dst_lookup(struct sk_buff *skb,
dst_release(dst);
return ERR_PTR(-ELOOP);
}
-#ifdef CONFIG_DST_CACHE
if (dst_cache)
dst_cache_set_ip6(dst_cache, dst, &fl6.saddr);
-#endif
*saddr = fl6.saddr;
return dst;
}
diff --git a/net/netfilter/nft_tunnel.c b/net/netfilter/nft_tunnel.c
index f735d79d8be5..02cbd7f3f6dc 100644
--- a/net/netfilter/nft_tunnel.c
+++ b/net/netfilter/nft_tunnel.c
@@ -508,13 +508,11 @@ static int nft_tunnel_obj_init(const struct nft_ctx *ctx,
return -ENOMEM;
memcpy(&md->u.tun_info, &info, sizeof(info));
-#ifdef CONFIG_DST_CACHE
err = dst_cache_init(&md->u.tun_info.dst_cache, GFP_KERNEL);
if (err < 0) {
metadata_dst_free(md);
return err;
}
-#endif
ip_tunnel_info_opts_set(&md->u.tun_info, &priv->opts.u, priv->opts.len,
priv->opts.flags);
priv->md = md;
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 29a7081858cd..b7a5ab6374b8 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -13,7 +13,6 @@ config OPENVSWITCH
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
- select DST_CACHE
select NET_NSH
select NF_CONNTRACK_OVS if NF_CONNTRACK
select NF_NAT_OVS if NF_NAT
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 1536f8b16f1b..1da69fa82512 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -476,11 +476,9 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
goto err_out;
}
-#ifdef CONFIG_DST_CACHE
ret = dst_cache_init(&metadata->u.tun_info.dst_cache, GFP_KERNEL);
if (ret)
goto release_tun_meta;
-#endif
if (opts_len) {
ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
--
2.34.1
On Thu, 7 Mar 2024 18:12:00 +0100 Leone Fernando wrote:
> +static inline u64 create_dst_cache_key_ip4(const struct sk_buff *skb)
> +{
> + struct iphdr *iphdr = ip_hdr(skb);
> +
> + return (((u64)iphdr->daddr) << 8) | iphdr->tos;
Sparse complains that you're ignoring bitwise types:
include/net/dst_cache.h:170:19: warning: cast from restricted __be32
include/net/dst_cache.h:170:19: warning: cast from restricted __be32
--
pw-bot: cr
On 3/7/24 10:11 AM, Leone Fernando wrote:
> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
> sub-system. The hinting optimization improves performance by reusing
> previously found dsts instead of looking them up for each skb.
>
> This patch series introduces a generalized version of the hinting mechanism that
> can "remember" a larger number of dsts. This reduces the number of dst
> lookups for frequently encountered daddrs.
>
> Before diving into the code and the benchmarking results, it's important
> to address the deletion of the old route cache [2] and why
> this solution is different. The original cache was complicated,
> vulnerable to DOS attacks and had unstable performance.
>
> The new input dst_cache is much simpler thanks to its lazy approach,
> improving performance without the overhead of the removed cache
> implementation. Instead of using timers and GC, the deletion of invalid
> entries is performed lazily during their lookups.
> The dsts are stored in a simple, lightweight, static hash table. This
> keeps the lookup times fast yet stable, preventing DOS upon cache misses.
> The new input dst_cache implementation is built over the existing
> dst_cache code which supplies a fast lockless percpu behavior.
>
> I tested this patch using udp floods with different number of daddrs.
> The benchmarking setup is comprised of 3 machines: a sender,
> a forwarder and a receiver. I measured the PPS received by the receiver
> as the forwarder was running either the mainline kernel or the patched
> kernel, comparing the results. The dst_cache I tested in this benchmark
> used a total of 512 hash table entries, split into buckets of 4
> entries each.
>
> These are the results:
> UDP mainline patched delta
> conns pcpu Kpps Kpps %
> 1 274.0255 269.2205 -1.75
> 2 257.3748 268.0947 4.17
> 15 241.3513 258.8016 7.23
> 100 238.3419 258.4939 8.46
> 500 238.5390 252.6425 5.91
> 1000 238.7570 242.1820 1.43
> 2000 238.7780 236.2640 -1.05
> 4000 239.0440 233.5320 -2.31
> 8000 239.3248 232.5680 -2.82
>
I have looked at all of the sets sent. I can not convince myself this is
a good idea, but at the same time I do not have constructive feedback on
why it is not acceptable. The gains are modest at best.
David Ahern wrote:
> I have looked at all of the sets sent. I can not convince myself this is
> a good idea, but at the same time I do not have constructive feedback on
> why it is not acceptable. The gains are modest at best.
>
Thanks for the comment.
I believe an improvement of 5-8% in PPS is significant.
Note that the cache is per-cpu (e.g., for a machine with 10 CPUs, the improvement
affects 10X the conns mentioned).
Could you please provide more information about what you don't like
in the patch?
Some possible issues I can think of:
- Do you think the improvement is not affecting the common case?
In this case, it can be solved by tweaking the cache parameters.
- In case performance degradation for some # of conns is problematic - we can
find ways to reduce it. For example, the degradation for 1 connection can
probably be solved by keeping the route hints.
Leone
Thanks Jakub. I'll fix it and submit a v2.
What do you think about the patch in general?
On Thu, 14 Mar 2024 15:04:02 +0100 Leone Fernando wrote:
> Thanks Jakub. I'll fix it and submit a v2.
> What do you think about the patch in general?
Dunno.. it's a bit hard to judge how much benefit we'd get
in real life scenarios. But I'm not a routing expert.
Hi David,
I plan to continue working on this patch, and it
would be helpful if you could share some more thoughts.
As I said before, I think this patch is significant, and my measurements
showed a consistent improvement in most cases.
Thanks,
Leone
On 4/2/24 4:08 AM, Leone Fernando wrote:
> Hi David,
>
> I plan to continue working on this patch, and it
> would be helpful if you could share some more thoughts.
> As I said before, I think this patch is significant, and my measurements
> showed a consistent improvement in most cases.
>
It seems to me patch 1 and a version of it for IPv6 should go in
independent of this set.
For the rest of it, Jakub's response was a good summary: it is hard to
know if there is a benefit to real workloads. Cache's consume resources
(memory and cpu) and will be wrong some percentage of the time
increasing overhead.
Also, it is targeted at a very narrow use case -- IPv4 only, no custom
FIB rules and no multipath.
Hi David,
>
> For the rest of it, Jakub's response was a good summary: it is hard to
> know if there is a benefit to real workloads. Cache's consume resources
> (memory and cpu) and will be wrong some percentage of the time
> increasing overhead.
>
I understand what you are saying here.
Do you have an idea for extra tests or measurements to make sure
the patch also improves in real workloads?
> Also, it is targeted at a very narrow use case -- IPv4 only, no custom
> FIB rules and no multipath.
Implementation for IPv6 is almost identical. I decided to start with IPv4
and I plan to submit IPv6 support in a future patch.
This patch basically improves the hinting mechanism that Paolo introduced
and it handles the same cases.
Adding support for FIB rules and multipath is a bit more complicated but
also possible. I am willing to keep working on this patch to meet the
remaining cases including IPv6.
Thanks,
Leone