2018-12-05 18:16:42

by Paolo Abeni

[permalink] [raw]
Subject: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

Experimental results show that replacing a single indirect call via
retpoline with several branches and a direct call gives performance gains
even when multiple branches are added - 5 or more, as reported in [2].

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark and
smaller but measurable for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

rfc -> v1:
- use branch prediction hints, as suggested by Eric

v1 -> v2:
- list explicitly the builtin function names in INDIRECT_CALL_*()
- expand the recipients list

[1] http://vger.kernel.org/netconf2018_files/PaoloAbeni_netconf2018.pdf
[2] https://linuxplumbersconf.org/event/2/contributions/99/attachments/98/117/lpc18_paper_af_xdp_perf-v2.pdf

Paolo Abeni (4):
indirect call wrappers: helpers to speed-up indirect calls of builtin
net: use indirect call wrappers at GRO network layer
net: use indirect call wrappers at GRO transport layer
udp: use indirect call wrappers for GRO socket lookup

include/linux/indirect_call_wrapper.h | 51 +++++++++++++++++++++++++++
include/net/inet_common.h | 9 +++++
net/core/dev.c | 15 ++++++--
net/ipv4/af_inet.c | 13 +++++--
net/ipv4/tcp_offload.c | 6 ++--
net/ipv4/udp_offload.c | 15 +++++---
net/ipv6/ip6_offload.c | 18 +++++++---
net/ipv6/tcpv6_offload.c | 7 ++--
net/ipv6/udp_offload.c | 7 ++--
9 files changed, 119 insertions(+), 22 deletions(-)
create mode 100644 include/linux/indirect_call_wrapper.h

--
2.19.2



2018-12-05 18:15:56

by Paolo Abeni

[permalink] [raw]
Subject: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macros defined here implement the boilerplate for the above schema
and will be used by the next patches.

rfc -> v1:
- use branch prediction hint, as suggested by Eric
v1 -> v2:
- list explicitly the builtin function names in INDIRECT_CALL_*()

Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
---
include/linux/indirect_call_wrapper.h | 51 +++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h
new file mode 100644
index 000000000000..7c8b7f4948af
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ * @f: function pointer
+ * @f$NR: builtin functions names, up to $NR of them
+ * @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. The functions are check
+ * in the given order. Fallback to the indirect call.
+ */
+#define INDIRECT_CALL_1(f, f1, ...) \
+ ({ \
+ likely(f == f1) ? f1(__VA_ARGS__) : f(__VA_ARGS__); \
+ })
+#define INDIRECT_CALL_2(f, f2, f1, ...) \
+ ({ \
+ likely(f == f2) ? f2(__VA_ARGS__) : \
+ INDIRECT_CALL_1(f, f1, __VA_ARGS__); \
+ })
+
+#define INDIRECT_CALLABLE_DECLARE(f) f
+#define INDIRECT_CALLABLE_SCOPE
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE(f)
+#define INDIRECT_CALLABLE_SCOPE static
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET(f, f2, f1, ...) \
+ INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
+#else
+#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
+#endif
+
+#endif
--
2.19.2


2018-12-05 18:16:16

by Paolo Abeni

[permalink] [raw]
Subject: [PATCH net-next v2 2/4] net: use indirect call wrappers at GRO network layer

This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.

v1 -> v2:
- adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni <[email protected]>
---
include/net/inet_common.h | 2 ++
net/core/dev.c | 15 +++++++++++++--
net/ipv6/ip6_offload.c | 6 +++---
3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
#ifndef _INET_COMMON_H
#define _INET_COMMON_H

+#include <linux/indirect_call_wrapper.h>
+
extern const struct proto_ops inet_stream_ops;
extern const struct proto_ops inet_dgram_ops;

diff --git a/net/core/dev.c b/net/core/dev.c
index 04a6b7100aac..97c5ae41fd65 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
#include <linux/sctp.h>
#include <net/udp_tunnel.h>
#include <linux/net_namespace.h>
+#include <linux/indirect_call_wrapper.h>

#include "net-sysfs.h"

@@ -5320,6 +5321,8 @@ static void flush_all_backlogs(void)
put_online_cpus();
}

+INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *, int));
+INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *, int));
static int napi_gro_complete(struct sk_buff *skb)
{
struct packet_offload *ptype;
@@ -5339,7 +5342,9 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;

- err = ptype->callbacks.gro_complete(skb, 0);
+ err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+ ipv6_gro_complete, inet_gro_complete,
+ skb, 0);
break;
}
rcu_read_unlock();
@@ -5486,6 +5491,10 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
}

+INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct list_head *,
+ struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *ipv6_gro_receive(struct list_head *,
+ struct sk_buff *));
static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5535,7 +5544,9 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}

- pp = ptype->callbacks.gro_receive(gro_head, skb);
+ pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+ ipv6_gro_receive, inet_gro_receive,
+ gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..ff8b484d2258 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -164,8 +164,8 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph,
return len;
}

-static struct sk_buff *ipv6_gro_receive(struct list_head *head,
- struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
+ struct sk_buff *skb)
{
const struct net_offload *ops;
struct sk_buff *pp = NULL;
@@ -301,7 +301,7 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
return inet_gro_receive(head, skb);
}

-static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
+INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
{
const struct net_offload *ops;
struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + nhoff);
--
2.19.2


2018-12-05 18:16:52

by Paolo Abeni

[permalink] [raw]
Subject: [PATCH net-next v2 4/4] udp: use indirect call wrappers for GRO socket lookup

This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

v1 -> v2:
- adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni <[email protected]>
---
net/ipv4/udp_offload.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 9a141a6cf1a0..64f9715173ac 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
return NULL;
}

+INDIRECT_CALLABLE_DECLARE(struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
+ __be16 sport, __be16 dport));
struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
{
@@ -403,7 +405,8 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct sock *sk;

rcu_read_lock();
- sk = (*lookup)(skb, uh->source, uh->dest);
+ sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb,
+ udp4_lib_lookup_skb, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;

@@ -503,7 +506,8 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;

rcu_read_lock();
- sk = (*lookup)(skb, uh->source, uh->dest);
+ sk = INDIRECT_CALL_INET(lookup, udp6_lib_lookup_skb,
+ udp4_lib_lookup_skb, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
--
2.19.2


2018-12-05 18:16:52

by Paolo Abeni

[permalink] [raw]
Subject: [PATCH net-next v2 3/4] net: use indirect call wrappers at GRO transport layer

This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

v1 -> v2:
- adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni <[email protected]>
---
include/net/inet_common.h | 7 +++++++
net/ipv4/af_inet.c | 13 +++++++++++--
net/ipv4/tcp_offload.c | 6 ++++--
net/ipv4/udp_offload.c | 7 ++++---
net/ipv6/ip6_offload.c | 12 ++++++++++--
net/ipv6/tcpv6_offload.c | 7 ++++---
net/ipv6/udp_offload.c | 7 ++++---
7 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..975901a95c0f 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
}

+#define indirect_call_gro_receive(f2, f1, cb, head, skb) \
+({ \
+ unlikely(gro_recursion_inc_test(skb)) ? \
+ NAPI_GRO_CB(skb)->flush |= 1, NULL : \
+ INDIRECT_CALL_2(cb, f2, f1, head, skb); \
+})
+
#endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..0dfb72c46671 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,10 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
}
EXPORT_SYMBOL(inet_gso_segment);

+INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp4_gro_receive(struct list_head *,
+ struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *udp4_gro_receive(struct list_head *,
+ struct sk_buff *));
struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
{
const struct net_offload *ops;
@@ -1494,7 +1498,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));

- pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+ pp = indirect_call_gro_receive(tcp4_gro_receive, udp4_gro_receive,
+ ops->callbacks.gro_receive, head, skb);

out_unlock:
rcu_read_unlock();
@@ -1556,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
return -EINVAL;
}

+INDIRECT_CALLABLE_DECLARE(int tcp4_gro_complete(struct sk_buff *, int));
+INDIRECT_CALLABLE_DECLARE(int udp4_gro_complete(struct sk_buff *, int));
int inet_gro_complete(struct sk_buff *skb, int nhoff)
{
__be16 newlen = htons(skb->len - nhoff);
@@ -1581,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
* because any hdr with option will have been flushed in
* inet_gro_receive().
*/
- err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+ err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ tcp4_gro_complete, udp4_gro_complete,
+ skb, nhoff + sizeof(*iph));

out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..0fbf7d4df9da 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
* TCPv4 GSO/GRO support
*/

+#include <linux/indirect_call_wrapper.h>
#include <linux/skbuff.h>
#include <net/tcp.h>
#include <net/protocol.h>
@@ -305,7 +306,8 @@ int tcp_gro_complete(struct sk_buff *skb)
}
EXPORT_SYMBOL(tcp_gro_complete);

-static struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE
+struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
{
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
@@ -318,7 +320,7 @@ static struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *
return tcp_gro_receive(head, skb);
}

-static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
+INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
{
const struct iphdr *iph = ip_hdr(skb);
struct tcphdr *th = tcp_hdr(skb);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..9a141a6cf1a0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
#include <linux/skbuff.h>
#include <net/udp.h>
#include <net/protocol.h>
+#include <net/inet_common.h>

static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -451,8 +452,8 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
}
EXPORT_SYMBOL(udp_gro_receive);

-static struct sk_buff *udp4_gro_receive(struct list_head *head,
- struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE
+struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
{
struct udphdr *uh = udp_gro_udphdr(skb);

@@ -525,7 +526,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
}
EXPORT_SYMBOL(udp_gro_complete);

-static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
+INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
{
const struct iphdr *iph = ip_hdr(skb);
struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index ff8b484d2258..e92837bd873b 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -164,6 +164,10 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph,
return len;
}

+INDIRECT_CALLABLE_DECLARE(struct sk_buff *tcp6_gro_receive(struct list_head *,
+ struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *udp6_gro_receive(struct list_head *,
+ struct sk_buff *));
INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
struct sk_buff *skb)
{
@@ -260,7 +264,8 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,

skb_gro_postpull_rcsum(skb, iph, nlen);

- pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+ pp = indirect_call_gro_receive(tcp6_gro_receive, udp6_gro_receive,
+ ops->callbacks.gro_receive, head, skb);

out_unlock:
rcu_read_unlock();
@@ -301,6 +306,8 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
return inet_gro_receive(head, skb);
}

+INDIRECT_CALLABLE_DECLARE(int tcp6_gro_complete(struct sk_buff *, int));
+INDIRECT_CALLABLE_DECLARE(int udp6_gro_complete(struct sk_buff *, int));
INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
{
const struct net_offload *ops;
@@ -320,7 +327,8 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
if (WARN_ON(!ops || !ops->callbacks.gro_complete))
goto out_unlock;

- err = ops->callbacks.gro_complete(skb, nhoff);
+ err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ tcp6_gro_complete, udp6_gro_complete, skb, nhoff);

out_unlock:
rcu_read_unlock();
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index e72947c99454..3179c425d7ff 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -9,14 +9,15 @@
*
* TCPv6 GSO/GRO support
*/
+#include <linux/indirect_call_wrapper.h>
#include <linux/skbuff.h>
#include <net/protocol.h>
#include <net/tcp.h>
#include <net/ip6_checksum.h>
#include "ip6_offload.h"

-static struct sk_buff *tcp6_gro_receive(struct list_head *head,
- struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE
+struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
{
/* Don't bother verifying checksum if we're going to flush anyway. */
if (!NAPI_GRO_CB(skb)->flush &&
@@ -29,7 +30,7 @@ static struct sk_buff *tcp6_gro_receive(struct list_head *head,
return tcp_gro_receive(head, skb);
}

-static int tcp6_gro_complete(struct sk_buff *skb, int thoff)
+INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
{
const struct ipv6hdr *iph = ipv6_hdr(skb);
struct tcphdr *th = tcp_hdr(skb);
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 828b2457f97b..83b11d0ac091 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -11,6 +11,7 @@
*/
#include <linux/skbuff.h>
#include <linux/netdevice.h>
+#include <linux/indirect_call_wrapper.h>
#include <net/protocol.h>
#include <net/ipv6.h>
#include <net/udp.h>
@@ -114,8 +115,8 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
return segs;
}

-static struct sk_buff *udp6_gro_receive(struct list_head *head,
- struct sk_buff *skb)
+INDIRECT_CALLABLE_SCOPE
+struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
{
struct udphdr *uh = udp_gro_udphdr(skb);

@@ -142,7 +143,7 @@ static struct sk_buff *udp6_gro_receive(struct list_head *head,
return NULL;
}

-static int udp6_gro_complete(struct sk_buff *skb, int nhoff)
+INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
{
const struct ipv6hdr *ipv6h = ipv6_hdr(skb);
struct udphdr *uh = (struct udphdr *)(skb->data + nhoff);
--
2.19.2


2018-12-06 04:51:02

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

From: Paolo Abeni <[email protected]>
Date: Wed, 5 Dec 2018 19:13:38 +0100

...
> This may lead to some uglification around the indirect calls. In netconf 2018
> Eric Dumazet described a technique to hide the most relevant part of the needed
> boilerplate with some macro help.
>
> This series is a [re-]implementation of such idea, exposing the introduced
> helpers in a new header file. They are later leveraged to avoid the indirect
> call overhead in the GRO path, when possible.
>
> Overall this gives > 10% performance improvement for UDP GRO benchmark and
> smaller but measurable for TCP syn flood.
>
> The added infra can be used in follow-up patches to cope with retpoline overhead
> in other points of the networking stack (e.g. at the qdisc layer) and possibly
> even in other subsystems.
...

I like this a lot and unless I hear some objections I'm going to apply this
series tomorrow.

Thanks for working on this Paolo.

2018-12-07 06:25:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

From: Paolo Abeni <[email protected]>
Date: Wed, 5 Dec 2018 19:13:38 +0100

> The spectre v2 counter-measures, aka retpolines, are a source of measurable
> overhead[1]. We can partially address that when the function pointer refers to
> a builtin symbol resorting to a list of tests vs well-known builtin function and
> direct calls.
>
> Experimental results show that replacing a single indirect call via
> retpoline with several branches and a direct call gives performance gains
> even when multiple branches are added - 5 or more, as reported in [2].
>
> This may lead to some uglification around the indirect calls. In netconf 2018
> Eric Dumazet described a technique to hide the most relevant part of the needed
> boilerplate with some macro help.
>
> This series is a [re-]implementation of such idea, exposing the introduced
> helpers in a new header file. They are later leveraged to avoid the indirect
> call overhead in the GRO path, when possible.
>
> Overall this gives > 10% performance improvement for UDP GRO benchmark and
> smaller but measurable for TCP syn flood.
>
> The added infra can be used in follow-up patches to cope with retpoline overhead
> in other points of the networking stack (e.g. at the qdisc layer) and possibly
> even in other subsystems.
...

Series applied, thanks!

2018-12-07 06:29:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

From: David Miller <[email protected]>
Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST)

> Series applied, thanks!

Erm... actually reverted. Please fix these build failures:

ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive':
ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive'
ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive'
ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete':
ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete'
ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete'
make: *** [Makefile:1036: vmlinux] Error 1

THanks.

2018-12-07 09:47:35

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

On Wed, 2018-12-05 at 19:13 +0100, Paolo Abeni wrote:
> +/*
> + * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
> + * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
> + * alternatives
> + */
> +#if IS_BUILTIN(CONFIG_IPV6)
> +#define INDIRECT_CALL_INET(f, f2, f1, ...) \
> + INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
> +#elif IS_ENABLED(CONFIG_INET)
> +#define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
> +#else
> +#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
> +#endif
> +
> +#endif

Thanks for working on this.

I'm not stunningly keen on the part cited above. And it doesn't seem to
be working either, given Dave's later error and reversion.

I wonder if we can declare the common case functions as 'weak' so that
the link failures don't happen when they're absent.

Once we extend this past the network code, especially to file systems'
f_ops, I suspect we're going to want to use something like static keys
to patch the common cases at runtime — perhaps changing the f_ops
default according to what the root file system is, etc.

I'd quite like to see the API for this taking that into account even if
it's left to be a future development.


Attachments:
smime.p7s (5.09 kB)

2018-12-07 20:30:19

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

Hi,

On Thu, 2018-12-06 at 22:28 -0800, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST)
>
> > Series applied, thanks!
>
> Erm... actually reverted. Please fix these build failures:

oops ...
I'm sorry for the late reply. I'm travelling and I will not able to re-
post soon.

> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive':
> ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive'
> ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive'
> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete':
> ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete'
> ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete'
> make: *** [Makefile:1036: vmlinux] Error 1

Are you building with CONFIG_IPV6=m ? I tested vs some common cfg, but
I omitted that in my last iteration (my bad). With such conf ip6
offloads are builtin while udp6 offloads end-up in the ipv6 module, so
I can't use them with the given conf.

I'll try to fix the above in v3.

I'm sorry for this mess,

Paolo


2018-12-07 20:47:23

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

On Fri, 2018-12-07 at 09:46 +0000, David Woodhouse wrote:
> On Wed, 2018-12-05 at 19:13 +0100, Paolo Abeni wrote:
> > +/*
> > + * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
> > + * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
> > + * alternatives
> > + */
> > +#if IS_BUILTIN(CONFIG_IPV6)
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) \
> > + INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__)
> > +#elif IS_ENABLED(CONFIG_INET)
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) INDIRECT_CALL_1(f, f1, __VA_ARGS__)
> > +#else
> > +#define INDIRECT_CALL_INET(f, f2, f1, ...) f(__VA_ARGS__)
> > +#endif
> > +
> > +#endif
>
> Thanks for working on this.
>
> I'm not stunningly keen on the part cited above. And it doesn't seem to
> be working either, given Dave's later error and reversion.

My bad, I did not test vs a relevant cfg. Hopefully that can be fixed.

> I wonder if we can declare the common case functions as 'weak' so that
> the link failures don't happen when they're absent.

I experimented a previous version with alias. I avoided weak alias
usage, because I [mis?]understood not all compilers have a complete
support for them (e.g. clang).
Also, with weak ref, a coding error that is now discovered at build
time will result in worse performance at runtime, likely with some
uncommon configuration, possibly not as easily detected. I'm unsure
that would be better ?!?

> Once we extend this past the network code, especially to file systems'
> f_ops, I suspect we're going to want to use something like static keys
> to patch the common cases at runtime — perhaps changing the f_ops
> default according to what the root file system is, etc.

I'm sorry, I don't follow here. I think static keys can't be used for the reported network case: we have different list elements each contaning a different function pointer and we access/use
different ptr on a per packet basis.

> I'd quite like to see the API for this taking that into account even if
> it's left to be a future development.

Again, I'm lost here. Can you please give more hints?

Thanks,

Paolo


2018-12-07 21:07:32

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

From: Paolo Abeni <[email protected]>
Date: Fri, 07 Dec 2018 21:29:20 +0100

> Are you building with CONFIG_IPV6=m ?

I always build allmodconfig

2018-12-07 21:47:35

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

On Fri, 2018-12-07 at 21:46 +0100, Paolo Abeni wrote:
> > I wonder if we can declare the common case functions as 'weak' so that
> > the link failures don't happen when they're absent.
>
> I experimented a previous version with alias. I avoided weak alias
> usage, because I [mis?]understood not all compilers have a complete
> support for them (e.g. clang).
> Also, with weak ref, a coding error that is now discovered at build
> time will result in worse performance at runtime, likely with some
> uncommon configuration, possibly not as easily detected. I'm unsure
> that would be better ?!?

I think everything supports weak linkage; we've been using it for
years.

> > Once we extend this past the network code, especially to file systems'
> > f_ops, I suspect we're going to want to use something like static keys
> > to patch the common cases at runtime — perhaps changing the f_ops
> > default according to what the root file system is, etc.
>
> I'm sorry, I don't follow here. I think static keys can't be used for
> the reported network case: we have different list elements each
> contaning a different function pointer and we access/use
> different ptr on a per packet basis.

Yes, the alternatives would be used to change the "likely" case.

We still do the "if (fn == default_fn) default_fn(); else (*fn)();"
part; or even the variant with two (or more) common cases.

It's just that the value of 'default_fn' can be changed at runtime
(with patching like alternatives/static keys, since of course it has to
be a direct call).



Attachments:
smime.p7s (5.09 kB)

2018-12-11 22:29:42

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH net-next v2 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

Hi,

I'm sorry for the long delay, I was (and still I am) diverted by some
other duty.

On Fri, 2018-12-07 at 21:46 +0000, David Woodhouse wrote:
> On Fri, 2018-12-07 at 21:46 +0100, Paolo Abeni wrote:
> > > I wonder if we can declare the common case functions as 'weak' so that
> > > the link failures don't happen when they're absent.
> >
> > I experimented a previous version with alias. I avoided weak alias
> > usage, because I [mis?]understood not all compilers have a complete
> > support for them (e.g. clang).
> > Also, with weak ref, a coding error that is now discovered at build
> > time will result in worse performance at runtime, likely with some
> > uncommon configuration, possibly not as easily detected. I'm unsure
> > that would be better ?!?
>
> I think everything supports weak linkage; we've been using it for
> years.

Ok, I likely was confused by some old, non first-hand info.

Anyway weak alias will turn a compile time issue in a possible run-time
(small) regression. I think the first option would be preferable.

> > I'm sorry, I don't follow here. I think static keys can't be used for
> > the reported network case: we have different list elements each
> > contaning a different function pointer and we access/use
> > different ptr on a per packet basis.
>
> Yes, the alternatives would be used to change the "likely" case.
>
> We still do the "if (fn == default_fn) default_fn(); else (*fn)();"
> part; or even the variant with two (or more) common cases.
>
> It's just that the value of 'default_fn' can be changed at runtime
> (with patching like alternatives/static keys, since of course it has to
> be a direct call).

Thanks for clarifying. If I understood correctly, you would like some
helper for:

if (static_branch_likely(&use_default_fn_a))
INDIRECT_CALL_1(f, default_fn_a, <args>)
else if (static_branch_likely(&use_default_fn_b))
INDIRECT_CALL_1(f, default_fn_b, <args>)
// ...

if so, I think we can eventually add support for this kind of stuff on
top of the proposed macros. WDYT?

Thanks,

Paolo



2018-12-21 13:44:34

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH net-next v2 3/4] net: use indirect call wrappers at GRO transport layer

On Wed, 5 Dec 2018 19:13:41 +0100
Paolo Abeni <[email protected]> wrote:

> This avoids an indirect call in the receive path for TCP and UDP
> packets. TCP takes precedence on UDP, so that we have a single
> additional conditional in the common case.
>
> v1 -> v2:
> - adapted to INDIRECT_CALL_ changes
>
> Signed-off-by: Paolo Abeni <[email protected]>

This introduces a bunch of new warnings when kernel is built with W=1.
Please add the necessary prototypes in header files.

Putting prototypes in C file for global function is not the right way to fix
this.

net/ipv4/tcp_offload.c:310:17: warning: no previous prototype for ‘tcp4_gro_receive’ [-Wmissing-prototypes]
struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
^~~~~~~~~~~~~~~~
net/ipv4/tcp_offload.c:323:29: warning: no previous prototype for ‘tcp4_gro_complete’ [-Wmissing-prototypes]
INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
^~~~~~~~~~~~~~~~~
CC net/ipv4/datagram.o
CC net/ipv4/udp_offload.o
net/ipv4/udp_offload.c:459:17: warning: no previous prototype for ‘udp4_gro_receive’ [-Wmissing-prototypes]
struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
^~~~~~~~~~~~~~~~
net/ipv4/udp_offload.c:533:29: warning: no previous prototype for ‘udp4_gro_complete’ [-Wmissing-prototypes]
INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
^~~~~~~~~~~~~~~~~
CC net/ipv4/arp.o
CC net/ipv6/ip6_offload.o
net/ipv6/ip6_offload.c:188:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:328:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
^~~~~~~~~~~~~~~~~
CC net/ipv6/tcpv6_offload.o
net/ipv6/tcpv6_offload.c:20:17: warning: no previous prototype for ‘tcp6_gro_receive’ [-Wmissing-prototypes]
struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
^~~~~~~~~~~~~~~~
net/ipv6/tcpv6_offload.c:33:29: warning: no previous prototype for ‘tcp6_gro_complete’ [-Wmissing-prototypes]
INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
^~~~~~~~~~~~~~~~~
CC net/ipv6/exthdrs_offload.o