2023-06-28 10:07:48

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 0/7] Add SO_REUSEPORT support for TC bpf_sk_assign

We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.

I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.

Joint work with Daniel Borkmann.

Signed-off-by: Lorenz Bauer <[email protected]>
---
Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)

---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper

Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 67 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 70 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 656 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee

Best regards,
--
Lorenz Bauer <[email protected]>



2023-06-28 10:08:10

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 1/7] udp: re-score reuseport groups when connected sockets are present

Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
sockets. To support these properly we remember whether a group has
a connected socket and skip the fast reuseport early-return. In
effect we continue scoring all reuseport sockets and then choose the
one with the highest score.

The current code fails to re-calculate the score for the result of
lookup_reuseport. According to Kuniyuki Iwashima:

1) SO_INCOMING_CPU is set
-> selected sk might have +1 score

2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
-> selected sk will have more than 8

Using the old score could trigger more lookups depending on the
order that sockets are created.

sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
| |
`-> select the next SO_INCOMING_CPU sk
|
`-> select itself (We should save this lookup)

Fixes: efc6b6f6c311 ("udp: Improve load balancing for SO_REUSEPORT.")
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
---
net/ipv4/udp.c | 20 +++++++++++++++-----
net/ipv6/udp.c | 19 ++++++++++++++-----
2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fd3dae081f3a..5ef478d2c408 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -450,14 +450,24 @@ static struct sock *udp4_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- result = lookup_reuseport(net, sk, skb,
- saddr, sport, daddr, hnum);
+ badness = score;
+ result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ if (!result) {
+ result = sk;
+ continue;
+ }
+
/* Fall back to scoring if group has connections */
- if (result && !reuseport_has_conns(sk))
+ if (!reuseport_has_conns(sk))
return result;

- result = result ? : sk;
- badness = score;
+ /* Reuseport logic returned an error, keep original score. */
+ if (IS_ERR(result))
+ continue;
+
+ badness = compute_score(result, net, saddr, sport,
+ daddr, hnum, dif, sdif);
+
}
}
return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..8b3cb1d7da7c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -193,14 +193,23 @@ static struct sock *udp6_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- result = lookup_reuseport(net, sk, skb,
- saddr, sport, daddr, hnum);
+ badness = score;
+ result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ if (!result) {
+ result = sk;
+ continue;
+ }
+
/* Fall back to scoring if group has connections */
- if (result && !reuseport_has_conns(sk))
+ if (!reuseport_has_conns(sk))
return result;

- result = result ? : sk;
- badness = score;
+ /* Reuseport logic returned an error, keep original score. */
+ if (IS_ERR(result))
+ continue;
+
+ badness = compute_score(sk, net, saddr, sport,
+ daddr, hnum, dif, sdif);
}
}
return result;

--
2.40.1


2023-06-28 10:13:59

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 5/7] net: remove duplicate sk_lookup helpers

Now that inet[6]_lookup_reuseport are parameterised on the ehashfn
we can remove two sk_lookup helpers.

Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
---
include/net/inet6_hashtables.h | 9 +++++++++
include/net/inet_hashtables.h | 7 +++++++
net/ipv4/inet_hashtables.c | 26 +++++++++++++-------------
net/ipv4/udp.c | 32 +++++---------------------------
net/ipv6/inet6_hashtables.c | 31 ++++++++++++++++---------------
net/ipv6/udp.c | 34 +++++-----------------------------
6 files changed, 55 insertions(+), 84 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index f89320b6fee3..a6722d6ef80f 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -73,6 +73,15 @@ struct sock *inet6_lookup_listener(struct net *net,
const unsigned short hnum,
const int dif, const int sdif);

+struct sock *inet6_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr,
+ const __be16 sport,
+ const struct in6_addr *daddr,
+ const u16 hnum, const int dif,
+ inet6_ehashfn_t *ehashfn);
+
static inline struct sock *__inet6_lookup(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ddfa2e67fdb5..c0532cc7587f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -393,6 +393,13 @@ struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
__be32 daddr, unsigned short hnum,
inet_ehashfn_t *ehashfn);

+struct sock *inet_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ __be32 saddr, __be16 sport,
+ __be32 daddr, u16 hnum, const int dif,
+ inet_ehashfn_t *ehashfn);
+
static inline struct sock *
inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ac927a635a6f..a411ae560bdf 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -402,25 +402,23 @@ static struct sock *inet_lhash2_lookup(struct net *net,
return result;
}

-static inline struct sock *inet_lookup_run_bpf(struct net *net,
- struct inet_hashinfo *hashinfo,
- struct sk_buff *skb, int doff,
- __be32 saddr, __be16 sport,
- __be32 daddr, u16 hnum, const int dif)
+struct sock *inet_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ __be32 saddr, __be16 sport,
+ __be32 daddr, u16 hnum, const int dif,
+ inet_ehashfn_t *ehashfn)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;

- if (hashinfo != net->ipv4.tcp_death_row.hashinfo)
- return NULL; /* only TCP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP, saddr, sport,
+ no_reuseport = bpf_sk_lookup_run_v4(net, protocol, saddr, sport,
daddr, hnum, dif, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum,
- inet_ehashfn);
+ ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
@@ -438,9 +436,11 @@ struct sock *__inet_lookup_listener(struct net *net,
unsigned int hash2;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- result = inet_lookup_run_bpf(net, hashinfo, skb, doff,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ hashinfo == net->ipv4.tcp_death_row.hashinfo) {
+ result = inet_lookup_run_sk_lookup(net, IPPROTO_TCP, skb, doff,
+ saddr, sport, daddr, hnum, dif,
+ inet_ehashfn);
if (result)
goto done;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7258edece691..eb79268f216d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -464,30 +464,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
return result;
}

-static struct sock *udp4_lookup_run_bpf(struct net *net,
- struct udp_table *udptable,
- struct sk_buff *skb,
- __be32 saddr, __be16 sport,
- __be32 daddr, u16 hnum, const int dif)
-{
- struct sock *sk, *reuse_sk;
- bool no_reuseport;
-
- if (udptable != net->ipv4.udp_table)
- return NULL; /* only UDP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP, saddr, sport,
- daddr, hnum, dif, &sk);
- if (no_reuseport || IS_ERR_OR_NULL(sk))
- return sk;
-
- reuse_sk = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
- saddr, sport, daddr, hnum, udp_ehashfn);
- if (reuse_sk)
- sk = reuse_sk;
- return sk;
-}
-
/* UDP is nearly always wildcards out the wazoo, it makes no sense to try
* harder than this. -DaveM
*/
@@ -512,9 +488,11 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
goto done;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- sk = udp4_lookup_run_bpf(net, udptable, skb,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ udptable == net->ipv4.udp_table) {
+ sk = inet_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, dif,
+ udp_ehashfn);
if (sk) {
result = sk;
goto done;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index d37602fabc00..f9653675f577 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -176,31 +176,30 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
return result;
}

-static inline struct sock *inet6_lookup_run_bpf(struct net *net,
- struct inet_hashinfo *hashinfo,
- struct sk_buff *skb, int doff,
- const struct in6_addr *saddr,
- const __be16 sport,
- const struct in6_addr *daddr,
- const u16 hnum, const int dif)
+struct sock *inet6_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr,
+ const __be16 sport,
+ const struct in6_addr *daddr,
+ const u16 hnum, const int dif,
+ inet6_ehashfn_t *ehashfn)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;

- if (hashinfo != net->ipv4.tcp_death_row.hashinfo)
- return NULL; /* only TCP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP, saddr, sport,
+ no_reuseport = bpf_sk_lookup_run_v6(net, protocol, saddr, sport,
daddr, hnum, dif, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
- saddr, sport, daddr, hnum, inet6_ehashfn);
+ saddr, sport, daddr, hnum, ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
+EXPORT_SYMBOL_GPL(inet6_lookup_run_sk_lookup);

struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
@@ -214,9 +213,11 @@ struct sock *inet6_lookup_listener(struct net *net,
unsigned int hash2;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- result = inet6_lookup_run_bpf(net, hashinfo, skb, doff,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ hashinfo == net->ipv4.tcp_death_row.hashinfo) {
+ result = inet6_lookup_run_sk_lookup(net, IPPROTO_TCP, skb, doff,
+ saddr, sport, daddr, hnum, dif,
+ inet6_ehashfn);
if (result)
goto done;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ebac9200b15c..8a6d94cabee0 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -205,32 +205,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
return result;
}

-static inline struct sock *udp6_lookup_run_bpf(struct net *net,
- struct udp_table *udptable,
- struct sk_buff *skb,
- const struct in6_addr *saddr,
- __be16 sport,
- const struct in6_addr *daddr,
- u16 hnum, const int dif)
-{
- struct sock *sk, *reuse_sk;
- bool no_reuseport;
-
- if (udptable != net->ipv4.udp_table)
- return NULL; /* only UDP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP, saddr, sport,
- daddr, hnum, dif, &sk);
- if (no_reuseport || IS_ERR_OR_NULL(sk))
- return sk;
-
- reuse_sk = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
- saddr, sport, daddr, hnum, udp6_ehashfn);
- if (reuse_sk)
- sk = reuse_sk;
- return sk;
-}
-
/* rcu_read_lock() must be held */
struct sock *__udp6_lib_lookup(struct net *net,
const struct in6_addr *saddr, __be16 sport,
@@ -255,9 +229,11 @@ struct sock *__udp6_lib_lookup(struct net *net,
goto done;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- sk = udp6_lookup_run_bpf(net, udptable, skb,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ udptable == net->ipv4.udp_table) {
+ sk = inet6_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, dif,
+ udp6_ehashfn);
if (sk) {
result = sk;
goto done;

--
2.40.1


2023-06-28 10:20:53

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 4/7] net: document inet[6]_lookup_reuseport sk_state requirements

The current implementation was extracted from inet[6]_lhash2_lookup
in commit 80b373f74f9e ("inet: Extract helper for selecting socket
from reuseport group") and commit 5df6531292b5 ("inet6: Extract helper
for selecting socket from reuseport group"). In the original context,
sk is always in TCP_LISTEN state and so did not have a separate check.

Add documentation that specifies which sk_state are valid to pass to
the function.

Signed-off-by: Lorenz Bauer <[email protected]>
---
net/ipv4/inet_hashtables.c | 14 ++++++++++++++
net/ipv6/inet6_hashtables.c | 14 ++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 352eb371c93b..ac927a635a6f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -335,6 +335,20 @@ static inline int compute_score(struct sock *sk, struct net *net,

INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);

+/**
+ * inet_lookup_reuseport() - execute reuseport logic on AF_INET socket if necessary.
+ * @net: network namespace.
+ * @sk: AF_INET socket, must be in TCP_LISTEN state for TCP or TCP_CLOSE for UDP.
+ * @skb: context for a potential SK_REUSEPORT program.
+ * @doff: header offset.
+ * @saddr: source address.
+ * @sport: source port.
+ * @daddr: destination address.
+ * @hnum: destination port in host byte order.
+ *
+ * Return: NULL if sk doesn't have SO_REUSEPORT set, otherwise a pointer to
+ * the selected sock or an error.
+ */
struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
__be32 saddr, __be16 sport,
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3616225c89ef..d37602fabc00 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -114,6 +114,20 @@ static inline int compute_score(struct sock *sk, struct net *net,

INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);

+/**
+ * inet6_lookup_reuseport() - execute reuseport logic on AF_INET6 socket if necessary.
+ * @net: network namespace.
+ * @sk: AF_INET6 socket, must be in TCP_LISTEN state for TCP or TCP_CLOSE for UDP.
+ * @skb: context for a potential SK_REUSEPORT program.
+ * @doff: header offset.
+ * @saddr: source address.
+ * @sport: source port.
+ * @daddr: destination address.
+ * @hnum: destination port in host byte order.
+ *
+ * Return: NULL if sk doesn't have SO_REUSEPORT set, otherwise a pointer to
+ * the selected sock or an error.
+ */
struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
const struct in6_addr *saddr,

--
2.40.1


2023-06-28 10:33:12

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 3/7] net: remove duplicate reuseport_lookup functions

There are currently four copies of reuseport_lookup: one each for
(TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of
those functions as well. This is already the case for sk_lookup
helpers (inet,inet6,udp4,udp6)_lookup_run_bpf.

There are two differences between the reuseport_lookup helpers:

1. They call different hash functions depending on protocol
2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED

Move the check for sk_state into the caller and use the INDIRECT_CALL
infrastructure to cut down the helpers to one per IP version.

Signed-off-by: Lorenz Bauer <[email protected]>
---
include/net/inet6_hashtables.h | 11 ++++++++++-
include/net/inet_hashtables.h | 15 ++++++++++-----
net/ipv4/inet_hashtables.c | 20 +++++++++++++-------
net/ipv4/udp.c | 34 +++++++++++++---------------------
net/ipv6/inet6_hashtables.c | 14 ++++++++++----
net/ipv6/udp.c | 41 ++++++++++++++++-------------------------
6 files changed, 72 insertions(+), 63 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 032ddab48f8f..f89320b6fee3 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -48,12 +48,21 @@ struct sock *__inet6_lookup_established(struct net *net,
const u16 hnum, const int dif,
const int sdif);

+typedef u32 (inet6_ehashfn_t)(const struct net *net,
+ const struct in6_addr *laddr, const u16 lport,
+ const struct in6_addr *faddr, const __be16 fport);
+
+inet6_ehashfn_t inet6_ehashfn;
+
+INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);
+
struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
const struct in6_addr *saddr,
__be16 sport,
const struct in6_addr *daddr,
- unsigned short hnum);
+ unsigned short hnum,
+ inet6_ehashfn_t *ehashfn);

struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 8734f3488f5d..ddfa2e67fdb5 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -379,10 +379,19 @@ struct sock *__inet_lookup_established(struct net *net,
const __be32 daddr, const u16 hnum,
const int dif, const int sdif);

+typedef u32 (inet_ehashfn_t)(const struct net *net,
+ const __be32 laddr, const __u16 lport,
+ const __be32 faddr, const __be16 fport);
+
+inet_ehashfn_t inet_ehashfn;
+
+INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
+
struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
__be32 saddr, __be16 sport,
- __be32 daddr, unsigned short hnum);
+ __be32 daddr, unsigned short hnum,
+ inet_ehashfn_t *ehashfn);

static inline struct sock *
inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
@@ -453,10 +462,6 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
refcounted);
}

-u32 inet6_ehashfn(const struct net *net,
- const struct in6_addr *laddr, const u16 lport,
- const struct in6_addr *faddr, const __be16 fport);
-
static inline void sk_daddr_set(struct sock *sk, __be32 addr)
{
sk->sk_daddr = addr; /* alias of inet_daddr */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 920131e4a65d..352eb371c93b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -28,9 +28,9 @@
#include <net/tcp.h>
#include <net/sock_reuseport.h>

-static u32 inet_ehashfn(const struct net *net, const __be32 laddr,
- const __u16 lport, const __be32 faddr,
- const __be16 fport)
+u32 inet_ehashfn(const struct net *net, const __be32 laddr,
+ const __u16 lport, const __be32 faddr,
+ const __be16 fport)
{
static u32 inet_ehash_secret __read_mostly;

@@ -39,6 +39,7 @@ static u32 inet_ehashfn(const struct net *net, const __be32 laddr,
return __inet_ehashfn(laddr, lport, faddr, fport,
inet_ehash_secret + net_hash_mix(net));
}
+EXPORT_SYMBOL_GPL(inet_ehashfn);

/* This function handles inet_sock, but also timewait and request sockets
* for IPv4/IPv6.
@@ -332,16 +333,20 @@ static inline int compute_score(struct sock *sk, struct net *net,
return score;
}

+INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
+
struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
__be32 saddr, __be16 sport,
- __be32 daddr, unsigned short hnum)
+ __be32 daddr, unsigned short hnum,
+ inet_ehashfn_t *ehashfn)
{
struct sock *reuse_sk = NULL;
u32 phash;

if (sk->sk_reuseport) {
- phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
+ phash = INDIRECT_CALL_2(ehashfn, udp_ehashfn, inet_ehashfn,
+ net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
}
return reuse_sk;
@@ -371,7 +376,7 @@ static struct sock *inet_lhash2_lookup(struct net *net,
score = compute_score(sk, net, hnum, daddr, dif, sdif);
if (score > hiscore) {
result = inet_lookup_reuseport(net, sk, skb, doff,
- saddr, sport, daddr, hnum);
+ saddr, sport, daddr, hnum, inet_ehashfn);
if (result)
return result;

@@ -400,7 +405,8 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

- reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+ reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum,
+ inet_ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5ef478d2c408..7258edece691 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -405,9 +405,9 @@ static int compute_score(struct sock *sk, struct net *net,
return score;
}

-static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
- const __u16 lport, const __be32 faddr,
- const __be16 fport)
+INDIRECT_CALLABLE_SCOPE
+u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport,
+ const __be32 faddr, const __be16 fport)
{
static u32 udp_ehash_secret __read_mostly;

@@ -417,22 +417,6 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
udp_ehash_secret + net_hash_mix(net));
}

-static struct sock *lookup_reuseport(struct net *net, struct sock *sk,
- struct sk_buff *skb,
- __be32 saddr, __be16 sport,
- __be32 daddr, unsigned short hnum)
-{
- struct sock *reuse_sk = NULL;
- u32 hash;
-
- if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
- hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
- reuse_sk = reuseport_select_sock(sk, hash, skb,
- sizeof(struct udphdr));
- }
- return reuse_sk;
-}
-
/* called with rcu_read_lock() */
static struct sock *udp4_lib_lookup2(struct net *net,
__be32 saddr, __be16 sport,
@@ -451,7 +435,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
daddr, hnum, dif, sdif);
if (score > badness) {
badness = score;
- result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+
+ if (sk->sk_state == TCP_ESTABLISHED) {
+ result = sk;
+ continue;
+ }
+
+ result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, udp_ehashfn);
if (!result) {
result = sk;
continue;
@@ -490,7 +481,8 @@ static struct sock *udp4_lookup_run_bpf(struct net *net,
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

- reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ reuse_sk = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, udp_ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index b7c56867314e..3616225c89ef 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -39,6 +39,7 @@ u32 inet6_ehashfn(const struct net *net,
return __inet6_ehashfn(lhash, lport, fhash, fport,
inet6_ehash_secret + net_hash_mix(net));
}
+EXPORT_SYMBOL_GPL(inet6_ehashfn);

/*
* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so
@@ -111,18 +112,22 @@ static inline int compute_score(struct sock *sk, struct net *net,
return score;
}

+INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);
+
struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
struct sk_buff *skb, int doff,
const struct in6_addr *saddr,
__be16 sport,
const struct in6_addr *daddr,
- unsigned short hnum)
+ unsigned short hnum,
+ inet6_ehashfn_t *ehashfn)
{
struct sock *reuse_sk = NULL;
u32 phash;

if (sk->sk_reuseport) {
- phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
+ phash = INDIRECT_CALL_INET(ehashfn, udp6_ehashfn, inet6_ehashfn,
+ net, daddr, hnum, saddr, sport);
reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
}
return reuse_sk;
@@ -145,7 +150,7 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
score = compute_score(sk, net, hnum, daddr, dif, sdif);
if (score > hiscore) {
result = inet6_lookup_reuseport(net, sk, skb, doff,
- saddr, sport, daddr, hnum);
+ saddr, sport, daddr, hnum, inet6_ehashfn);
if (result)
return result;

@@ -176,7 +181,8 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

- reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
+ reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+ saddr, sport, daddr, hnum, inet6_ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8b3cb1d7da7c..ebac9200b15c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -70,11 +70,12 @@ int udpv6_init_sock(struct sock *sk)
return 0;
}

-static u32 udp6_ehashfn(const struct net *net,
- const struct in6_addr *laddr,
- const u16 lport,
- const struct in6_addr *faddr,
- const __be16 fport)
+INDIRECT_CALLABLE_SCOPE
+u32 udp6_ehashfn(const struct net *net,
+ const struct in6_addr *laddr,
+ const u16 lport,
+ const struct in6_addr *faddr,
+ const __be16 fport)
{
static u32 udp6_ehash_secret __read_mostly;
static u32 udp_ipv6_hash_secret __read_mostly;
@@ -159,24 +160,6 @@ static int compute_score(struct sock *sk, struct net *net,
return score;
}

-static struct sock *lookup_reuseport(struct net *net, struct sock *sk,
- struct sk_buff *skb,
- const struct in6_addr *saddr,
- __be16 sport,
- const struct in6_addr *daddr,
- unsigned int hnum)
-{
- struct sock *reuse_sk = NULL;
- u32 hash;
-
- if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
- hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
- reuse_sk = reuseport_select_sock(sk, hash, skb,
- sizeof(struct udphdr));
- }
- return reuse_sk;
-}
-
/* called with rcu_read_lock() */
static struct sock *udp6_lib_lookup2(struct net *net,
const struct in6_addr *saddr, __be16 sport,
@@ -194,7 +177,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
daddr, hnum, dif, sdif);
if (score > badness) {
badness = score;
- result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+
+ if (sk->sk_state == TCP_ESTABLISHED) {
+ result = sk;
+ continue;
+ }
+
+ result = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, udp6_ehashfn);
if (!result) {
result = sk;
continue;
@@ -234,7 +224,8 @@ static inline struct sock *udp6_lookup_run_bpf(struct net *net,
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

- reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ reuse_sk = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, udp6_ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;

--
2.40.1


2023-06-28 10:33:45

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.

The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:

sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed

Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.

Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.

Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
Cc: Joe Stringer <[email protected]>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
---
include/net/inet6_hashtables.h | 56 ++++++++++++++++++++++++++++++++++++++----
include/net/inet_hashtables.h | 49 ++++++++++++++++++++++++++++++++++--
include/net/sock.h | 7 ++++--
include/uapi/linux/bpf.h | 3 ---
net/core/filter.c | 2 --
net/ipv4/udp.c | 8 ++++--
net/ipv6/udp.c | 10 +++++---
tools/include/uapi/linux/bpf.h | 3 ---
8 files changed, 116 insertions(+), 22 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index a6722d6ef80f..7d677b89f269 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -103,6 +103,46 @@ static inline struct sock *__inet6_lookup(struct net *net,
daddr, hnum, dif, sdif);
}

+static inline
+struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr, const __be16 sport,
+ const struct in6_addr *daddr, const __be16 dport,
+ bool *refcounted, inet6_ehashfn_t *ehashfn)
+{
+ struct sock *sk, *reuse_sk;
+ bool prefetched;
+
+ sk = skb_steal_sock(skb, refcounted, &prefetched);
+ if (!sk)
+ return NULL;
+
+ if (!prefetched)
+ return sk;
+
+ if (sk->sk_protocol == IPPROTO_TCP) {
+ if (sk->sk_state != TCP_LISTEN)
+ return sk;
+ } else if (sk->sk_protocol == IPPROTO_UDP) {
+ if (sk->sk_state != TCP_CLOSE)
+ return sk;
+ } else {
+ return sk;
+ }
+
+ reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+ saddr, sport, daddr, ntohs(dport),
+ ehashfn);
+ if (!reuse_sk || reuse_sk == sk)
+ return sk;
+
+ /* We've chosen a new reuseport sock which is never refcounted. This
+ * implies that sk also isn't refcounted.
+ */
+ WARN_ON_ONCE(*refcounted);
+
+ return reuse_sk;
+}
+
static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
const __be16 sport,
@@ -110,14 +150,20 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
int iif, int sdif,
bool *refcounted)
{
- struct sock *sk = skb_steal_sock(skb, refcounted);
-
+ struct net *net = dev_net(skb_dst(skb)->dev);
+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+ struct sock *sk;
+
+ sk = inet6_steal_sock(net, skb, doff, &ip6h->saddr, sport, &ip6h->daddr, dport,
+ refcounted, inet6_ehashfn);
+ if (IS_ERR(sk))
+ return NULL;
if (sk)
return sk;

- return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
- doff, &ipv6_hdr(skb)->saddr, sport,
- &ipv6_hdr(skb)->daddr, ntohs(dport),
+ return __inet6_lookup(net, hashinfo, skb,
+ doff, &ip6h->saddr, sport,
+ &ip6h->daddr, ntohs(dport),
iif, sdif, refcounted);
}

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index c0532cc7587f..c6ae0af12ce0 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -449,6 +449,46 @@ static inline struct sock *inet_lookup(struct net *net,
return sk;
}

+static inline
+struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
+ const __be32 saddr, const __be16 sport,
+ const __be32 daddr, const __be16 dport,
+ bool *refcounted, inet_ehashfn_t *ehashfn)
+{
+ struct sock *sk, *reuse_sk;
+ bool prefetched;
+
+ sk = skb_steal_sock(skb, refcounted, &prefetched);
+ if (!sk)
+ return NULL;
+
+ if (!prefetched)
+ return sk;
+
+ if (sk->sk_protocol == IPPROTO_TCP) {
+ if (sk->sk_state != TCP_LISTEN)
+ return sk;
+ } else if (sk->sk_protocol == IPPROTO_UDP) {
+ if (sk->sk_state != TCP_CLOSE)
+ return sk;
+ } else {
+ return sk;
+ }
+
+ reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
+ saddr, sport, daddr, ntohs(dport),
+ ehashfn);
+ if (!reuse_sk || reuse_sk == sk)
+ return sk;
+
+ /* We've chosen a new reuseport sock which is never refcounted. This
+ * implies that sk also isn't refcounted.
+ */
+ WARN_ON_ONCE(*refcounted);
+
+ return reuse_sk;
+}
+
static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
struct sk_buff *skb,
int doff,
@@ -457,13 +497,18 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
const int sdif,
bool *refcounted)
{
- struct sock *sk = skb_steal_sock(skb, refcounted);
+ struct net *net = dev_net(skb_dst(skb)->dev);
const struct iphdr *iph = ip_hdr(skb);
+ struct sock *sk;

+ sk = inet_steal_sock(net, skb, doff, iph->saddr, sport, iph->daddr, dport,
+ refcounted, inet_ehashfn);
+ if (IS_ERR(sk))
+ return NULL;
if (sk)
return sk;

- return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+ return __inet_lookup(net, hashinfo, skb,
doff, iph->saddr, sport,
iph->daddr, dport, inet_iif(skb), sdif,
refcounted);
diff --git a/include/net/sock.h b/include/net/sock.h
index 656ea89f60ff..5645570c2a64 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
* skb_steal_sock - steal a socket from an sk_buff
* @skb: sk_buff to steal the socket from
* @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
*/
static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted)
+skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
{
if (skb->sk) {
struct sock *sk = skb->sk;

*refcounted = true;
- if (skb_sk_is_prefetched(skb))
+ *prefetched = skb_sk_is_prefetched(skb);
+ if (*prefetched)
*refcounted = sk_is_refcounted(sk);
skb->destructor = NULL;
skb->sk = NULL;
return sk;
}
+ *prefetched = false;
*refcounted = false;
return NULL;
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a7b5e91dd768..d6fb6f43b0f3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4158,9 +4158,6 @@ union bpf_attr {
* **-EOPNOTSUPP** if the operation is not supported, for example
* a call from outside of TC ingress.
*
- * **-ESOCKTNOSUPPORT** if the socket type is not supported
- * (reuseport).
- *
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
diff --git a/net/core/filter.c b/net/core/filter.c
index 428df050d021..d4be0a1d754c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7278,8 +7278,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
return -EOPNOTSUPP;
if (unlikely(dev_net(skb->dev) != sock_net(sk)))
return -ENETUNREACH;
- if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
- return -ESOCKTNOSUPPORT;
if (sk_is_refcounted(sk) &&
unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
return -ENOENT;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index eb79268f216d..b256f1f73b4d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2388,7 +2388,11 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
if (udp4_csum_init(skb, uh, proto))
goto csum_error;

- sk = skb_steal_sock(skb, &refcounted);
+ sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
+ &refcounted, udp_ehashfn);
+ if (IS_ERR(sk))
+ goto no_sk;
+
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
@@ -2409,7 +2413,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
if (sk)
return udp_unicast_rcv_skb(sk, skb, uh);
-
+no_sk:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset_ct(skb);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8a6d94cabee0..2d4c05bc322a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -923,9 +923,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
const struct in6_addr *saddr, *daddr;
struct net *net = dev_net(skb->dev);
+ bool refcounted;
struct udphdr *uh;
struct sock *sk;
- bool refcounted;
u32 ulen = 0;

if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -962,7 +962,11 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
goto csum_error;

/* Check if the socket is already available, e.g. due to early demux */
- sk = skb_steal_sock(skb, &refcounted);
+ sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
+ &refcounted, udp6_ehashfn);
+ if (IS_ERR(sk))
+ goto no_sk;
+
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
@@ -996,7 +1000,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
goto report_csum_error;
return udp6_unicast_rcv_skb(sk, skb, uh);
}
-
+no_sk:
reason = SKB_DROP_REASON_NO_SOCKET;

if (!uh->check)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a7b5e91dd768..d6fb6f43b0f3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4158,9 +4158,6 @@ union bpf_attr {
* **-EOPNOTSUPP** if the operation is not supported, for example
* a call from outside of TC ingress.
*
- * **-ESOCKTNOSUPPORT** if the socket type is not supported
- * (reuseport).
- *
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This

--
2.40.1


2023-06-28 18:49:29

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 3/7] net: remove duplicate reuseport_lookup functions

From: Lorenz Bauer <[email protected]>
Date: Wed, 28 Jun 2023 10:48:18 +0100
> There are currently four copies of reuseport_lookup: one each for
> (TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of
> those functions as well. This is already the case for sk_lookup
> helpers (inet,inet6,udp4,udp6)_lookup_run_bpf.
>
> There are two differences between the reuseport_lookup helpers:
>
> 1. They call different hash functions depending on protocol
> 2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED
>
> Move the check for sk_state into the caller and use the INDIRECT_CALL
> infrastructure to cut down the helpers to one per IP version.
>
> Signed-off-by: Lorenz Bauer <[email protected]>

Reviewed-by: Kuniyuki Iwashima <[email protected]>

2 nits below.


> ---
> include/net/inet6_hashtables.h | 11 ++++++++++-
> include/net/inet_hashtables.h | 15 ++++++++++-----
> net/ipv4/inet_hashtables.c | 20 +++++++++++++-------
> net/ipv4/udp.c | 34 +++++++++++++---------------------
> net/ipv6/inet6_hashtables.c | 14 ++++++++++----
> net/ipv6/udp.c | 41 ++++++++++++++++-------------------------
> 6 files changed, 72 insertions(+), 63 deletions(-)
>
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 032ddab48f8f..f89320b6fee3 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -48,12 +48,21 @@ struct sock *__inet6_lookup_established(struct net *net,
> const u16 hnum, const int dif,
> const int sdif);
>
> +typedef u32 (inet6_ehashfn_t)(const struct net *net,
> + const struct in6_addr *laddr, const u16 lport,
> + const struct in6_addr *faddr, const __be16 fport);
> +
> +inet6_ehashfn_t inet6_ehashfn;
> +
> +INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);

We need not define udp6_ehashfn() here as inet6_hashtables.c has
the definition.

Only inet6_ehashfn() is needed because sk_ehashfn() uses it.


> +
> struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> const struct in6_addr *saddr,
> __be16 sport,
> const struct in6_addr *daddr,
> - unsigned short hnum);
> + unsigned short hnum,
> + inet6_ehashfn_t *ehashfn);
>
> struct sock *inet6_lookup_listener(struct net *net,
> struct inet_hashinfo *hashinfo,
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 8734f3488f5d..ddfa2e67fdb5 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -379,10 +379,19 @@ struct sock *__inet_lookup_established(struct net *net,
> const __be32 daddr, const u16 hnum,
> const int dif, const int sdif);
>
> +typedef u32 (inet_ehashfn_t)(const struct net *net,
> + const __be32 laddr, const __u16 lport,
> + const __be32 faddr, const __be16 fport);
> +
> +inet_ehashfn_t inet_ehashfn;
> +
> +INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
> +

We don't need inet_ehashfn() and udp_ehashfn() declarations here.


> struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> __be32 saddr, __be16 sport,
> - __be32 daddr, unsigned short hnum);
> + __be32 daddr, unsigned short hnum,
> + inet_ehashfn_t *ehashfn);
>
> static inline struct sock *
> inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
> @@ -453,10 +462,6 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
> refcounted);
> }
>
> -u32 inet6_ehashfn(const struct net *net,
> - const struct in6_addr *laddr, const u16 lport,
> - const struct in6_addr *faddr, const __be16 fport);
> -
> static inline void sk_daddr_set(struct sock *sk, __be32 addr)
> {
> sk->sk_daddr = addr; /* alias of inet_daddr */
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 920131e4a65d..352eb371c93b 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -28,9 +28,9 @@
> #include <net/tcp.h>
> #include <net/sock_reuseport.h>
>
> -static u32 inet_ehashfn(const struct net *net, const __be32 laddr,
> - const __u16 lport, const __be32 faddr,
> - const __be16 fport)
> +u32 inet_ehashfn(const struct net *net, const __be32 laddr,
> + const __u16 lport, const __be32 faddr,
> + const __be16 fport)
> {
> static u32 inet_ehash_secret __read_mostly;
>
> @@ -39,6 +39,7 @@ static u32 inet_ehashfn(const struct net *net, const __be32 laddr,
> return __inet_ehashfn(laddr, lport, faddr, fport,
> inet_ehash_secret + net_hash_mix(net));
> }
> +EXPORT_SYMBOL_GPL(inet_ehashfn);
>
> /* This function handles inet_sock, but also timewait and request sockets
> * for IPv4/IPv6.
> @@ -332,16 +333,20 @@ static inline int compute_score(struct sock *sk, struct net *net,
> return score;
> }
>
> +INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
> +
> struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> __be32 saddr, __be16 sport,
> - __be32 daddr, unsigned short hnum)
> + __be32 daddr, unsigned short hnum,
> + inet_ehashfn_t *ehashfn)
> {
> struct sock *reuse_sk = NULL;
> u32 phash;
>
> if (sk->sk_reuseport) {
> - phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
> + phash = INDIRECT_CALL_2(ehashfn, udp_ehashfn, inet_ehashfn,
> + net, daddr, hnum, saddr, sport);
> reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
> }
> return reuse_sk;
> @@ -371,7 +376,7 @@ static struct sock *inet_lhash2_lookup(struct net *net,
> score = compute_score(sk, net, hnum, daddr, dif, sdif);
> if (score > hiscore) {
> result = inet_lookup_reuseport(net, sk, skb, doff,
> - saddr, sport, daddr, hnum);
> + saddr, sport, daddr, hnum, inet_ehashfn);
> if (result)
> return result;
>
> @@ -400,7 +405,8 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
> if (no_reuseport || IS_ERR_OR_NULL(sk))
> return sk;
>
> - reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> + reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum,
> + inet_ehashfn);
> if (reuse_sk)
> sk = reuse_sk;
> return sk;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 5ef478d2c408..7258edece691 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -405,9 +405,9 @@ static int compute_score(struct sock *sk, struct net *net,
> return score;
> }
>
> -static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
> - const __u16 lport, const __be32 faddr,
> - const __be16 fport)
> +INDIRECT_CALLABLE_SCOPE
> +u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport,
> + const __be32 faddr, const __be16 fport)
> {
> static u32 udp_ehash_secret __read_mostly;
>
> @@ -417,22 +417,6 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
> udp_ehash_secret + net_hash_mix(net));
> }
>
> -static struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> - struct sk_buff *skb,
> - __be32 saddr, __be16 sport,
> - __be32 daddr, unsigned short hnum)
> -{
> - struct sock *reuse_sk = NULL;
> - u32 hash;
> -
> - if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
> - hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
> - reuse_sk = reuseport_select_sock(sk, hash, skb,
> - sizeof(struct udphdr));
> - }
> - return reuse_sk;
> -}
> -
> /* called with rcu_read_lock() */
> static struct sock *udp4_lib_lookup2(struct net *net,
> __be32 saddr, __be16 sport,
> @@ -451,7 +435,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> daddr, hnum, dif, sdif);
> if (score > badness) {
> badness = score;
> - result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
> +
> + if (sk->sk_state == TCP_ESTABLISHED) {
> + result = sk;
> + continue;
> + }
> +
> + result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
> + saddr, sport, daddr, hnum, udp_ehashfn);
> if (!result) {
> result = sk;
> continue;
> @@ -490,7 +481,8 @@ static struct sock *udp4_lookup_run_bpf(struct net *net,
> if (no_reuseport || IS_ERR_OR_NULL(sk))
> return sk;
>
> - reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
> + reuse_sk = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
> + saddr, sport, daddr, hnum, udp_ehashfn);
> if (reuse_sk)
> sk = reuse_sk;
> return sk;
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index b7c56867314e..3616225c89ef 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -39,6 +39,7 @@ u32 inet6_ehashfn(const struct net *net,
> return __inet6_ehashfn(lhash, lport, fhash, fport,
> inet6_ehash_secret + net_hash_mix(net));
> }
> +EXPORT_SYMBOL_GPL(inet6_ehashfn);
>
> /*
> * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so
> @@ -111,18 +112,22 @@ static inline int compute_score(struct sock *sk, struct net *net,
> return score;
> }
>
> +INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);
> +
> struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> const struct in6_addr *saddr,
> __be16 sport,
> const struct in6_addr *daddr,
> - unsigned short hnum)
> + unsigned short hnum,
> + inet6_ehashfn_t *ehashfn)
> {
> struct sock *reuse_sk = NULL;
> u32 phash;
>
> if (sk->sk_reuseport) {
> - phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
> + phash = INDIRECT_CALL_INET(ehashfn, udp6_ehashfn, inet6_ehashfn,
> + net, daddr, hnum, saddr, sport);
> reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
> }
> return reuse_sk;
> @@ -145,7 +150,7 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
> score = compute_score(sk, net, hnum, daddr, dif, sdif);
> if (score > hiscore) {
> result = inet6_lookup_reuseport(net, sk, skb, doff,
> - saddr, sport, daddr, hnum);
> + saddr, sport, daddr, hnum, inet6_ehashfn);
> if (result)
> return result;
>
> @@ -176,7 +181,8 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
> if (no_reuseport || IS_ERR_OR_NULL(sk))
> return sk;
>
> - reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum);
> + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> + saddr, sport, daddr, hnum, inet6_ehashfn);
> if (reuse_sk)
> sk = reuse_sk;
> return sk;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 8b3cb1d7da7c..ebac9200b15c 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -70,11 +70,12 @@ int udpv6_init_sock(struct sock *sk)
> return 0;
> }
>
> -static u32 udp6_ehashfn(const struct net *net,
> - const struct in6_addr *laddr,
> - const u16 lport,
> - const struct in6_addr *faddr,
> - const __be16 fport)
> +INDIRECT_CALLABLE_SCOPE
> +u32 udp6_ehashfn(const struct net *net,
> + const struct in6_addr *laddr,
> + const u16 lport,
> + const struct in6_addr *faddr,
> + const __be16 fport)
> {
> static u32 udp6_ehash_secret __read_mostly;
> static u32 udp_ipv6_hash_secret __read_mostly;
> @@ -159,24 +160,6 @@ static int compute_score(struct sock *sk, struct net *net,
> return score;
> }
>
> -static struct sock *lookup_reuseport(struct net *net, struct sock *sk,
> - struct sk_buff *skb,
> - const struct in6_addr *saddr,
> - __be16 sport,
> - const struct in6_addr *daddr,
> - unsigned int hnum)
> -{
> - struct sock *reuse_sk = NULL;
> - u32 hash;
> -
> - if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
> - hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
> - reuse_sk = reuseport_select_sock(sk, hash, skb,
> - sizeof(struct udphdr));
> - }
> - return reuse_sk;
> -}
> -
> /* called with rcu_read_lock() */
> static struct sock *udp6_lib_lookup2(struct net *net,
> const struct in6_addr *saddr, __be16 sport,
> @@ -194,7 +177,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
> daddr, hnum, dif, sdif);
> if (score > badness) {
> badness = score;
> - result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
> +
> + if (sk->sk_state == TCP_ESTABLISHED) {
> + result = sk;
> + continue;
> + }
> +
> + result = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
> + saddr, sport, daddr, hnum, udp6_ehashfn);
> if (!result) {
> result = sk;
> continue;
> @@ -234,7 +224,8 @@ static inline struct sock *udp6_lookup_run_bpf(struct net *net,
> if (no_reuseport || IS_ERR_OR_NULL(sk))
> return sk;
>
> - reuse_sk = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
> + reuse_sk = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
> + saddr, sport, daddr, hnum, udp6_ehashfn);
> if (reuse_sk)
> sk = reuse_sk;
> return sk;
>
> --
> 2.40.1

2023-06-28 18:53:50

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Lorenz Bauer <[email protected]>
Date: Wed, 28 Jun 2023 10:48:21 +0100
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy,
> which configures SO_REUSEPORT on its sockets. In turn, we're blocked
> from removing TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the
> bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
> one of the reuseport sockets is selected by hash. This could cause
> dispatch to the "wrong" socket:
>
> sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
> bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <[email protected]>
> Signed-off-by: Daniel Borkmann <[email protected]>
> Signed-off-by: Lorenz Bauer <[email protected]>
> Cc: Joe Stringer <[email protected]>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/

Reviewed-by: Kuniyuki Iwashima <[email protected]>

I left minor comments below, but this series overall looks good.

Thanks!


> ---
> include/net/inet6_hashtables.h | 56 ++++++++++++++++++++++++++++++++++++++----
> include/net/inet_hashtables.h | 49 ++++++++++++++++++++++++++++++++++--
> include/net/sock.h | 7 ++++--
> include/uapi/linux/bpf.h | 3 ---
> net/core/filter.c | 2 --
> net/ipv4/udp.c | 8 ++++--
> net/ipv6/udp.c | 10 +++++---
> tools/include/uapi/linux/bpf.h | 3 ---
> 8 files changed, 116 insertions(+), 22 deletions(-)
>
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index a6722d6ef80f..7d677b89f269 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -103,6 +103,46 @@ static inline struct sock *__inet6_lookup(struct net *net,
> daddr, hnum, dif, sdif);
> }
>
> +static inline
> +struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> + const struct in6_addr *saddr, const __be16 sport,
> + const struct in6_addr *daddr, const __be16 dport,
> + bool *refcounted, inet6_ehashfn_t *ehashfn)
> +{
> + struct sock *sk, *reuse_sk;
> + bool prefetched;
> +
> + sk = skb_steal_sock(skb, refcounted, &prefetched);
> + if (!sk)
> + return NULL;
> +
> + if (!prefetched)
> + return sk;
> +
> + if (sk->sk_protocol == IPPROTO_TCP) {
> + if (sk->sk_state != TCP_LISTEN)
> + return sk;
> + } else if (sk->sk_protocol == IPPROTO_UDP) {
> + if (sk->sk_state != TCP_CLOSE)
> + return sk;
> + } else {
> + return sk;
> + }
> +
> + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> + saddr, sport, daddr, ntohs(dport),
> + ehashfn);
> + if (!reuse_sk || reuse_sk == sk)

nit: compiler might have optimised though, given here is the fast path,
we can save reuse_sk == sk check.


> + return sk;
> +
> + /* We've chosen a new reuseport sock which is never refcounted. This
> + * implies that sk also isn't refcounted.
> + */
> + WARN_ON_ONCE(*refcounted);
> +
> + return reuse_sk;
> +}
> +
> static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> struct sk_buff *skb, int doff,
> const __be16 sport,
> @@ -110,14 +150,20 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
> int iif, int sdif,
> bool *refcounted)
> {
> - struct sock *sk = skb_steal_sock(skb, refcounted);
> -
> + struct net *net = dev_net(skb_dst(skb)->dev);
> + const struct ipv6hdr *ip6h = ipv6_hdr(skb);
> + struct sock *sk;
> +
> + sk = inet6_steal_sock(net, skb, doff, &ip6h->saddr, sport, &ip6h->daddr, dport,
> + refcounted, inet6_ehashfn);
> + if (IS_ERR(sk))
> + return NULL;
> if (sk)
> return sk;
>
> - return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> - doff, &ipv6_hdr(skb)->saddr, sport,
> - &ipv6_hdr(skb)->daddr, ntohs(dport),
> + return __inet6_lookup(net, hashinfo, skb,
> + doff, &ip6h->saddr, sport,
> + &ip6h->daddr, ntohs(dport),
> iif, sdif, refcounted);
> }
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index c0532cc7587f..c6ae0af12ce0 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -449,6 +449,46 @@ static inline struct sock *inet_lookup(struct net *net,
> return sk;
> }
>
> +static inline
> +struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> + const __be32 saddr, const __be16 sport,
> + const __be32 daddr, const __be16 dport,
> + bool *refcounted, inet_ehashfn_t *ehashfn)
> +{
> + struct sock *sk, *reuse_sk;
> + bool prefetched;
> +
> + sk = skb_steal_sock(skb, refcounted, &prefetched);
> + if (!sk)
> + return NULL;
> +
> + if (!prefetched)
> + return sk;
> +
> + if (sk->sk_protocol == IPPROTO_TCP) {
> + if (sk->sk_state != TCP_LISTEN)
> + return sk;
> + } else if (sk->sk_protocol == IPPROTO_UDP) {
> + if (sk->sk_state != TCP_CLOSE)
> + return sk;
> + } else {
> + return sk;
> + }
> +
> + reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
> + saddr, sport, daddr, ntohs(dport),
> + ehashfn);
> + if (!reuse_sk || reuse_sk == sk)

Same here.


> + return sk;
> +
> + /* We've chosen a new reuseport sock which is never refcounted. This
> + * implies that sk also isn't refcounted.
> + */
> + WARN_ON_ONCE(*refcounted);
> +
> + return reuse_sk;
> +}
> +
> static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
> struct sk_buff *skb,
> int doff,
> @@ -457,13 +497,18 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
> const int sdif,
> bool *refcounted)
> {
> - struct sock *sk = skb_steal_sock(skb, refcounted);
> + struct net *net = dev_net(skb_dst(skb)->dev);
> const struct iphdr *iph = ip_hdr(skb);
> + struct sock *sk;
>
> + sk = inet_steal_sock(net, skb, doff, iph->saddr, sport, iph->daddr, dport,
> + refcounted, inet_ehashfn);
> + if (IS_ERR(sk))
> + return NULL;
> if (sk)
> return sk;
>
> - return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
> + return __inet_lookup(net, hashinfo, skb,
> doff, iph->saddr, sport,
> iph->daddr, dport, inet_iif(skb), sdif,
> refcounted);
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 656ea89f60ff..5645570c2a64 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2806,20 +2806,23 @@ sk_is_refcounted(struct sock *sk)
> * skb_steal_sock - steal a socket from an sk_buff
> * @skb: sk_buff to steal the socket from
> * @refcounted: is set to true if the socket is reference-counted
> + * @prefetched: is set to true if the socket was assigned from bpf
> */
> static inline struct sock *
> -skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> +skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
> {
> if (skb->sk) {
> struct sock *sk = skb->sk;
>
> *refcounted = true;
> - if (skb_sk_is_prefetched(skb))
> + *prefetched = skb_sk_is_prefetched(skb);
> + if (*prefetched)
> *refcounted = sk_is_refcounted(sk);
> skb->destructor = NULL;
> skb->sk = NULL;
> return sk;
> }
> + *prefetched = false;
> *refcounted = false;
> return NULL;
> }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a7b5e91dd768..d6fb6f43b0f3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4158,9 +4158,6 @@ union bpf_attr {
> * **-EOPNOTSUPP** if the operation is not supported, for example
> * a call from outside of TC ingress.
> *
> - * **-ESOCKTNOSUPPORT** if the socket type is not supported
> - * (reuseport).
> - *
> * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
> * Description
> * Helper is overloaded depending on BPF program type. This
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 428df050d021..d4be0a1d754c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7278,8 +7278,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> return -EOPNOTSUPP;
> if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> return -ENETUNREACH;
> - if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
> - return -ESOCKTNOSUPPORT;
> if (sk_is_refcounted(sk) &&
> unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> return -ENOENT;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index eb79268f216d..b256f1f73b4d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2388,7 +2388,11 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
> if (udp4_csum_init(skb, uh, proto))
> goto csum_error;
>
> - sk = skb_steal_sock(skb, &refcounted);
> + sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> + &refcounted, udp_ehashfn);
> + if (IS_ERR(sk))
> + goto no_sk;
> +
> if (sk) {
> struct dst_entry *dst = skb_dst(skb);
> int ret;
> @@ -2409,7 +2413,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
> sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
> if (sk)
> return udp_unicast_rcv_skb(sk, skb, uh);
> -
> +no_sk:
> if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
> goto drop;
> nf_reset_ct(skb);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 8a6d94cabee0..2d4c05bc322a 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -923,9 +923,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
> enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
> const struct in6_addr *saddr, *daddr;
> struct net *net = dev_net(skb->dev);
> + bool refcounted;
> struct udphdr *uh;
> struct sock *sk;
> - bool refcounted;
> u32 ulen = 0;
>
> if (!pskb_may_pull(skb, sizeof(struct udphdr)))
> @@ -962,7 +962,11 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
> goto csum_error;
>
> /* Check if the socket is already available, e.g. due to early demux */
> - sk = skb_steal_sock(skb, &refcounted);
> + sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> + &refcounted, udp6_ehashfn);
> + if (IS_ERR(sk))
> + goto no_sk;
> +
> if (sk) {
> struct dst_entry *dst = skb_dst(skb);
> int ret;
> @@ -996,7 +1000,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
> goto report_csum_error;
> return udp6_unicast_rcv_skb(sk, skb, uh);
> }
> -
> +no_sk:
> reason = SKB_DROP_REASON_NO_SOCKET;
>
> if (!uh->check)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index a7b5e91dd768..d6fb6f43b0f3 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4158,9 +4158,6 @@ union bpf_attr {
> * **-EOPNOTSUPP** if the operation is not supported, for example
> * a call from outside of TC ingress.
> *
> - * **-ESOCKTNOSUPPORT** if the socket type is not supported
> - * (reuseport).
> - *
> * long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
> * Description
> * Helper is overloaded depending on BPF program type. This
>
> --
> 2.40.1

2023-06-28 18:54:01

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 4/7] net: document inet[6]_lookup_reuseport sk_state requirements

From: Lorenz Bauer <[email protected]>
Date: Wed, 28 Jun 2023 10:48:19 +0100
> The current implementation was extracted from inet[6]_lhash2_lookup
> in commit 80b373f74f9e ("inet: Extract helper for selecting socket
> from reuseport group") and commit 5df6531292b5 ("inet6: Extract helper
> for selecting socket from reuseport group"). In the original context,
> sk is always in TCP_LISTEN state and so did not have a separate check.
>
> Add documentation that specifies which sk_state are valid to pass to
> the function.
>
> Signed-off-by: Lorenz Bauer <[email protected]>

Reviewed-by: Kuniyuki Iwashima <[email protected]>

2 nit below.


> ---
> net/ipv4/inet_hashtables.c | 14 ++++++++++++++
> net/ipv6/inet6_hashtables.c | 14 ++++++++++++++
> 2 files changed, 28 insertions(+)
>
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 352eb371c93b..ac927a635a6f 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -335,6 +335,20 @@ static inline int compute_score(struct sock *sk, struct net *net,
>
> INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
>
> +/**
> + * inet_lookup_reuseport() - execute reuseport logic on AF_INET socket if necessary.
> + * @net: network namespace.
> + * @sk: AF_INET socket, must be in TCP_LISTEN state for TCP or TCP_CLOSE for UDP.
> + * @skb: context for a potential SK_REUSEPORT program.
> + * @doff: header offset.
> + * @saddr: source address.
> + * @sport: source port.
> + * @daddr: destination address.
> + * @hnum: destination port in host byte order.

It seems you forgot to copy-and-paste the ehashfn description.

---8<---
+ * @ehashfn: hash function used to generate the fallback hash.
---8<---
https://lore.kernel.org/all/[email protected]/


> + *
> + * Return: NULL if sk doesn't have SO_REUSEPORT set, otherwise a pointer to
> + * the selected sock or an error.
> + */
> struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> __be32 saddr, __be16 sport,
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index 3616225c89ef..d37602fabc00 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -114,6 +114,20 @@ static inline int compute_score(struct sock *sk, struct net *net,
>
> INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);
>
> +/**
> + * inet6_lookup_reuseport() - execute reuseport logic on AF_INET6 socket if necessary.
> + * @net: network namespace.
> + * @sk: AF_INET6 socket, must be in TCP_LISTEN state for TCP or TCP_CLOSE for UDP.
> + * @skb: context for a potential SK_REUSEPORT program.
> + * @doff: header offset.
> + * @saddr: source address.
> + * @sport: source port.
> + * @daddr: destination address.
> + * @hnum: destination port in host byte order.

Same here.


> + *
> + * Return: NULL if sk doesn't have SO_REUSEPORT set, otherwise a pointer to
> + * the selected sock or an error.
> + */
> struct sock *inet6_lookup_reuseport(struct net *net, struct sock *sk,
> struct sk_buff *skb, int doff,
> const struct in6_addr *saddr,
>
> --
> 2.40.1

2023-06-28 19:00:51

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Lorenz Bauer <[email protected]>
Date: Wed, 28 Jun 2023 10:48:21 +0100
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index a6722d6ef80f..7d677b89f269 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -103,6 +103,46 @@ static inline struct sock *__inet6_lookup(struct net *net,
> daddr, hnum, dif, sdif);
> }
>
> +static inline
> +struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> + const struct in6_addr *saddr, const __be16 sport,
> + const struct in6_addr *daddr, const __be16 dport,
> + bool *refcounted, inet6_ehashfn_t *ehashfn)
> +{
> + struct sock *sk, *reuse_sk;
> + bool prefetched;
> +
> + sk = skb_steal_sock(skb, refcounted, &prefetched);
> + if (!sk)
> + return NULL;
> +
> + if (!prefetched)
> + return sk;
> +
> + if (sk->sk_protocol == IPPROTO_TCP) {
> + if (sk->sk_state != TCP_LISTEN)
> + return sk;
> + } else if (sk->sk_protocol == IPPROTO_UDP) {
> + if (sk->sk_state != TCP_CLOSE)
> + return sk;
> + } else {
> + return sk;
> + }
> +
> + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> + saddr, sport, daddr, ntohs(dport),
> + ehashfn);
> + if (!reuse_sk || reuse_sk == sk)
> + return sk;
> +
> + /* We've chosen a new reuseport sock which is never refcounted. This
> + * implies that sk also isn't refcounted.
> + */
> + WARN_ON_ONCE(*refcounted);

One more nit.

WARN_ON_ONCE() should be tested before inet6?_lookup_reuseport() not to
miss the !reuse_sk case.


> +
> + return reuse_sk;
> +}

2023-07-03 10:01:24

by Lorenz Bauer

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 3/7] net: remove duplicate reuseport_lookup functions

On Wed, Jun 28, 2023 at 7:33 PM Kuniyuki Iwashima <[email protected]> wrote:

> > +
> > +inet6_ehashfn_t inet6_ehashfn;
> > +
> > +INDIRECT_CALLABLE_DECLARE(inet6_ehashfn_t udp6_ehashfn);
>
> We need not define udp6_ehashfn() here as inet6_hashtables.c has
> the definition.
>
> Only inet6_ehashfn() is needed because sk_ehashfn() uses it.

Without udp6_ehashfn we get the following error, as reported by Simon
against v1:

net/ipv4/udp.c:410:5: error: no previous prototype for ‘udp_ehashfn’
[-Werror=missing-prototypes]
410 | u32 udp_ehashfn(const struct net *net, const __be32 laddr,
const __u16 lport,
| ^~~~~~~~~~~

> > +inet_ehashfn_t inet_ehashfn;
> > +
> > +INDIRECT_CALLABLE_DECLARE(inet_ehashfn_t udp_ehashfn);
> > +
>
> We don't need inet_ehashfn() and udp_ehashfn() declarations here.

Without inet_ehashfn I get:

./include/net/inet_hashtables.h: In function ‘__inet_lookup_skb’:
./include/net/inet_hashtables.h:501:42: error: ‘inet_ehashfn’
undeclared (first use in this function); did you mean ‘inet_bhashfn’?
501 | refcounted, inet_ehashfn);

Same problem with the warning as above.

I think this needs to stay the way it is.

2023-07-03 10:05:12

by Lorenz Bauer

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

On Wed, Jun 28, 2023 at 7:54 PM Kuniyuki Iwashima <[email protected]> wrote:

> > + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > + saddr, sport, daddr, ntohs(dport),
> > + ehashfn);
> > + if (!reuse_sk || reuse_sk == sk)
> > + return sk;
> > +
> > + /* We've chosen a new reuseport sock which is never refcounted. This
> > + * implies that sk also isn't refcounted.
> > + */
> > + WARN_ON_ONCE(*refcounted);
>
> One more nit.
>
> WARN_ON_ONCE() should be tested before inet6?_lookup_reuseport() not to
> miss the !reuse_sk case.

I was just pondering that as well, but I came to the opposite
conclusion. In the !reuse_sk case we don't really know anything about
sk, except that it isn't part of a reuseport group. How can we be sure
that it's not refcounted?

2023-07-03 10:13:47

by Lorenz Bauer

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

On Wed, Jun 28, 2023 at 7:50 PM Kuniyuki Iwashima <[email protected]> wrote:

> > + } else {
> > + return sk;
> > + }
> > +
> > + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > + saddr, sport, daddr, ntohs(dport),
> > + ehashfn);
> > + if (!reuse_sk || reuse_sk == sk)
>
> nit: compiler might have optimised though, given here is the fast path,
> we can save reuse_sk == sk check.

Ack.

2023-07-06 00:58:18

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Lorenz Bauer <[email protected]>
Date: Mon, 3 Jul 2023 10:57:23 +0100
> On Wed, Jun 28, 2023 at 7:54 PM Kuniyuki Iwashima <[email protected]> wrote:
>
> > > + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > > + saddr, sport, daddr, ntohs(dport),
> > > + ehashfn);
> > > + if (!reuse_sk || reuse_sk == sk)
> > > + return sk;
> > > +
> > > + /* We've chosen a new reuseport sock which is never refcounted. This
> > > + * implies that sk also isn't refcounted.
> > > + */
> > > + WARN_ON_ONCE(*refcounted);
> >
> > One more nit.
> >
> > WARN_ON_ONCE() should be tested before inet6?_lookup_reuseport() not to
> > miss the !reuse_sk case.
>
> I was just pondering that as well, but I came to the opposite
> conclusion. In the !reuse_sk case we don't really know anything about
> sk, except that it isn't part of a reuseport group. How can we be sure
> that it's not refcounted?

Sorry for late reply.

What we know about sk before inet6?_lookup_reuseport() are

(1) sk was full socket in bpf_sk_assign()
(2) sk had SOCK_RCU_FREE in bpf_sk_assign()
(3) sk was TCP_LISTEN here if TCP

After bpf_sk_assign(), reqsk is never converted to fullsock, and UDP
never clears SOCK_RCU_FREE. If sk is TCP, now we are in the RCU grace
period and confirmed sk->sk_state was TCP_LISTEN. Then, TCP_LISTEN sk
cannot be reused and SOCK_RCU_FREE is never cleared.

So, before/after inet6?_lookup_reuseport(), the fact that sk is not
refcounted here should not change in spite of that reuse_sk is NULL.

What do you think ?

2023-07-06 08:32:51

by Lorenz Bauer

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

On Thu, Jul 6, 2023 at 1:41 AM Kuniyuki Iwashima <[email protected]> wrote:
>
> Sorry for late reply.
>
> What we know about sk before inet6?_lookup_reuseport() are
>
> (1) sk was full socket in bpf_sk_assign()
> (2) sk had SOCK_RCU_FREE in bpf_sk_assign()
> (3) sk was TCP_LISTEN here if TCP

Are we looking at the same bpf_sk_assign? Confusingly there are two
very similarly named functions. The one we care about is:

BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
{
if (!sk || flags != 0)
return -EINVAL;
if (!skb_at_tc_ingress(skb))
return -EOPNOTSUPP;
if (unlikely(dev_net(skb->dev) != sock_net(sk)))
return -ENETUNREACH;
if (sk_is_refcounted(sk) &&
unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
return -ENOENT;

skb_orphan(skb);
skb->sk = sk;
skb->destructor = sock_pfree;

return 0;
}

From this we can't tell what state the socket is in or whether it is
RCU freed or not.

Thanks
Lorenz

2023-07-06 15:45:35

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Lorenz Bauer <[email protected]>
Date: Thu, 6 Jul 2023 09:11:15 +0100
> On Thu, Jul 6, 2023 at 1:41 AM Kuniyuki Iwashima <[email protected]> wrote:
> >
> > Sorry for late reply.
> >
> > What we know about sk before inet6?_lookup_reuseport() are
> >
> > (1) sk was full socket in bpf_sk_assign()
> > (2) sk had SOCK_RCU_FREE in bpf_sk_assign()
> > (3) sk was TCP_LISTEN here if TCP
>
> Are we looking at the same bpf_sk_assign? Confusingly there are two
> very similarly named functions. The one we care about is:
>
> BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> {
> if (!sk || flags != 0)
> return -EINVAL;
> if (!skb_at_tc_ingress(skb))
> return -EOPNOTSUPP;
> if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> return -ENETUNREACH;
> if (sk_is_refcounted(sk) &&
> unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> return -ENOENT;
>
> skb_orphan(skb);
> skb->sk = sk;
> skb->destructor = sock_pfree;
>
> return 0;
> }
>
> From this we can't tell what state the socket is in or whether it is
> RCU freed or not.

But we can in inet6?_steal_sock() by calling sk_is_refcounted() again
via skb_steal_sock().

In inet6?_steal_sock(), we call inet6?_lookup_reuseport() only for
sk that was a TCP listener or UDP non-connected socket until just before
the sk_state checks. Then, we know *refcounted should be false for such
sockets even before inet6?_lookup_reuseport().

After the checks, sk might be poped out of the reuseport group before
inet6?_lookup_reuseport() and reuse_sk might be NULL, but it's not
related because *refcounted is a value for sk, not for reuse_sk.

2023-07-11 01:35:27

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH bpf-next v4 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

On Thu, Jul 6, 2023 at 8:33 AM Kuniyuki Iwashima <[email protected]> wrote:
>
> From: Lorenz Bauer <[email protected]>
> Date: Thu, 6 Jul 2023 09:11:15 +0100
> > On Thu, Jul 6, 2023 at 1:41 AM Kuniyuki Iwashima <[email protected]> wrote:
> > >
> > > Sorry for late reply.
> > >
> > > What we know about sk before inet6?_lookup_reuseport() are
> > >
> > > (1) sk was full socket in bpf_sk_assign()
> > > (2) sk had SOCK_RCU_FREE in bpf_sk_assign()
> > > (3) sk was TCP_LISTEN here if TCP
> >
> > Are we looking at the same bpf_sk_assign? Confusingly there are two
> > very similarly named functions. The one we care about is:
> >
> > BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
> > {
> > if (!sk || flags != 0)
> > return -EINVAL;
> > if (!skb_at_tc_ingress(skb))
> > return -EOPNOTSUPP;
> > if (unlikely(dev_net(skb->dev) != sock_net(sk)))
> > return -ENETUNREACH;
> > if (sk_is_refcounted(sk) &&
> > unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
> > return -ENOENT;
> >
> > skb_orphan(skb);
> > skb->sk = sk;
> > skb->destructor = sock_pfree;
> >
> > return 0;
> > }
> >
> > From this we can't tell what state the socket is in or whether it is
> > RCU freed or not.
>
> But we can in inet6?_steal_sock() by calling sk_is_refcounted() again
> via skb_steal_sock().
>
> In inet6?_steal_sock(), we call inet6?_lookup_reuseport() only for
> sk that was a TCP listener or UDP non-connected socket until just before
> the sk_state checks. Then, we know *refcounted should be false for such
> sockets even before inet6?_lookup_reuseport().
>
> After the checks, sk might be poped out of the reuseport group before
> inet6?_lookup_reuseport() and reuse_sk might be NULL, but it's not
> related because *refcounted is a value for sk, not for reuse_sk.

I was about to apply v5 before I noticed this discussion on v4.
Sounds like v6 will be needed.
Next time please continue discussion in the latest version.