2023-07-04 13:59:13

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v5 0/7] Add SO_REUSEPORT support for TC bpf_sk_assign

We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.

I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.

Joint work with Daniel Borkmann.

Signed-off-by: Lorenz Bauer <[email protected]>
---
Changes in v5:
- Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki)
- Link to v4: https://lore.kernel.org/r/[email protected]

Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)

---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper

Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 68 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 71 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 658 insertions(+), 179 deletions(-)
---
base-commit: c20f9cef725bc6b19efe372696e8000fb5af0d46
change-id: 20230613-so-reuseport-e92c526173ee

Best regards,
--
Lorenz Bauer <[email protected]>



2023-07-04 14:00:14

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v5 1/7] udp: re-score reuseport groups when connected sockets are present

Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
sockets. To support these properly we remember whether a group has
a connected socket and skip the fast reuseport early-return. In
effect we continue scoring all reuseport sockets and then choose the
one with the highest score.

The current code fails to re-calculate the score for the result of
lookup_reuseport. According to Kuniyuki Iwashima:

1) SO_INCOMING_CPU is set
-> selected sk might have +1 score

2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
-> selected sk will have more than 8

Using the old score could trigger more lookups depending on the
order that sockets are created.

sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
| |
`-> select the next SO_INCOMING_CPU sk
|
`-> select itself (We should save this lookup)

Fixes: efc6b6f6c311 ("udp: Improve load balancing for SO_REUSEPORT.")
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
---
net/ipv4/udp.c | 20 +++++++++++++++-----
net/ipv6/udp.c | 19 ++++++++++++++-----
2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 42a96b3547c9..c62d5e1c6675 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -451,14 +451,24 @@ static struct sock *udp4_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- result = lookup_reuseport(net, sk, skb,
- saddr, sport, daddr, hnum);
+ badness = score;
+ result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ if (!result) {
+ result = sk;
+ continue;
+ }
+
/* Fall back to scoring if group has connections */
- if (result && !reuseport_has_conns(sk))
+ if (!reuseport_has_conns(sk))
return result;

- result = result ? : sk;
- badness = score;
+ /* Reuseport logic returned an error, keep original score. */
+ if (IS_ERR(result))
+ continue;
+
+ badness = compute_score(result, net, saddr, sport,
+ daddr, hnum, dif, sdif);
+
}
}
return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 317b01c9bc39..dca8bec2eeb1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -193,14 +193,23 @@ static struct sock *udp6_lib_lookup2(struct net *net,
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- result = lookup_reuseport(net, sk, skb,
- saddr, sport, daddr, hnum);
+ badness = score;
+ result = lookup_reuseport(net, sk, skb, saddr, sport, daddr, hnum);
+ if (!result) {
+ result = sk;
+ continue;
+ }
+
/* Fall back to scoring if group has connections */
- if (result && !reuseport_has_conns(sk))
+ if (!reuseport_has_conns(sk))
return result;

- result = result ? : sk;
- badness = score;
+ /* Reuseport logic returned an error, keep original score. */
+ if (IS_ERR(result))
+ continue;
+
+ badness = compute_score(sk, net, saddr, sport,
+ daddr, hnum, dif, sdif);
}
}
return result;

--
2.40.1


2023-07-04 14:00:46

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v5 5/7] net: remove duplicate sk_lookup helpers

Now that inet[6]_lookup_reuseport are parameterised on the ehashfn
we can remove two sk_lookup helpers.

Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
---
include/net/inet6_hashtables.h | 9 +++++++++
include/net/inet_hashtables.h | 7 +++++++
net/ipv4/inet_hashtables.c | 26 +++++++++++++-------------
net/ipv4/udp.c | 32 +++++---------------------------
net/ipv6/inet6_hashtables.c | 31 ++++++++++++++++---------------
net/ipv6/udp.c | 34 +++++-----------------------------
6 files changed, 55 insertions(+), 84 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index f89320b6fee3..a6722d6ef80f 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -73,6 +73,15 @@ struct sock *inet6_lookup_listener(struct net *net,
const unsigned short hnum,
const int dif, const int sdif);

+struct sock *inet6_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr,
+ const __be16 sport,
+ const struct in6_addr *daddr,
+ const u16 hnum, const int dif,
+ inet6_ehashfn_t *ehashfn);
+
static inline struct sock *__inet6_lookup(struct net *net,
struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ddfa2e67fdb5..c0532cc7587f 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -393,6 +393,13 @@ struct sock *inet_lookup_reuseport(struct net *net, struct sock *sk,
__be32 daddr, unsigned short hnum,
inet_ehashfn_t *ehashfn);

+struct sock *inet_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ __be32 saddr, __be16 sport,
+ __be32 daddr, u16 hnum, const int dif,
+ inet_ehashfn_t *ehashfn);
+
static inline struct sock *
inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 64fc1bd3fb63..75b1ed1b89f2 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -403,25 +403,23 @@ static struct sock *inet_lhash2_lookup(struct net *net,
return result;
}

-static inline struct sock *inet_lookup_run_bpf(struct net *net,
- struct inet_hashinfo *hashinfo,
- struct sk_buff *skb, int doff,
- __be32 saddr, __be16 sport,
- __be32 daddr, u16 hnum, const int dif)
+struct sock *inet_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ __be32 saddr, __be16 sport,
+ __be32 daddr, u16 hnum, const int dif,
+ inet_ehashfn_t *ehashfn)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;

- if (hashinfo != net->ipv4.tcp_death_row.hashinfo)
- return NULL; /* only TCP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP, saddr, sport,
+ no_reuseport = bpf_sk_lookup_run_v4(net, protocol, saddr, sport,
daddr, hnum, dif, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

reuse_sk = inet_lookup_reuseport(net, sk, skb, doff, saddr, sport, daddr, hnum,
- inet_ehashfn);
+ ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
@@ -439,9 +437,11 @@ struct sock *__inet_lookup_listener(struct net *net,
unsigned int hash2;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- result = inet_lookup_run_bpf(net, hashinfo, skb, doff,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ hashinfo == net->ipv4.tcp_death_row.hashinfo) {
+ result = inet_lookup_run_sk_lookup(net, IPPROTO_TCP, skb, doff,
+ saddr, sport, daddr, hnum, dif,
+ inet_ehashfn);
if (result)
goto done;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 55f683b31c93..045eca6ed177 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -465,30 +465,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
return result;
}

-static struct sock *udp4_lookup_run_bpf(struct net *net,
- struct udp_table *udptable,
- struct sk_buff *skb,
- __be32 saddr, __be16 sport,
- __be32 daddr, u16 hnum, const int dif)
-{
- struct sock *sk, *reuse_sk;
- bool no_reuseport;
-
- if (udptable != net->ipv4.udp_table)
- return NULL; /* only UDP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP, saddr, sport,
- daddr, hnum, dif, &sk);
- if (no_reuseport || IS_ERR_OR_NULL(sk))
- return sk;
-
- reuse_sk = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
- saddr, sport, daddr, hnum, udp_ehashfn);
- if (reuse_sk)
- sk = reuse_sk;
- return sk;
-}
-
/* UDP is nearly always wildcards out the wazoo, it makes no sense to try
* harder than this. -DaveM
*/
@@ -513,9 +489,11 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
goto done;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- sk = udp4_lookup_run_bpf(net, udptable, skb,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ udptable == net->ipv4.udp_table) {
+ sk = inet_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, dif,
+ udp_ehashfn);
if (sk) {
result = sk;
goto done;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index f76dbbb29332..7c9700c7c9c8 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -177,31 +177,30 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
return result;
}

-static inline struct sock *inet6_lookup_run_bpf(struct net *net,
- struct inet_hashinfo *hashinfo,
- struct sk_buff *skb, int doff,
- const struct in6_addr *saddr,
- const __be16 sport,
- const struct in6_addr *daddr,
- const u16 hnum, const int dif)
+struct sock *inet6_lookup_run_sk_lookup(struct net *net,
+ int protocol,
+ struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr,
+ const __be16 sport,
+ const struct in6_addr *daddr,
+ const u16 hnum, const int dif,
+ inet6_ehashfn_t *ehashfn)
{
struct sock *sk, *reuse_sk;
bool no_reuseport;

- if (hashinfo != net->ipv4.tcp_death_row.hashinfo)
- return NULL; /* only TCP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP, saddr, sport,
+ no_reuseport = bpf_sk_lookup_run_v6(net, protocol, saddr, sport,
daddr, hnum, dif, &sk);
if (no_reuseport || IS_ERR_OR_NULL(sk))
return sk;

reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
- saddr, sport, daddr, hnum, inet6_ehashfn);
+ saddr, sport, daddr, hnum, ehashfn);
if (reuse_sk)
sk = reuse_sk;
return sk;
}
+EXPORT_SYMBOL_GPL(inet6_lookup_run_sk_lookup);

struct sock *inet6_lookup_listener(struct net *net,
struct inet_hashinfo *hashinfo,
@@ -215,9 +214,11 @@ struct sock *inet6_lookup_listener(struct net *net,
unsigned int hash2;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- result = inet6_lookup_run_bpf(net, hashinfo, skb, doff,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ hashinfo == net->ipv4.tcp_death_row.hashinfo) {
+ result = inet6_lookup_run_sk_lookup(net, IPPROTO_TCP, skb, doff,
+ saddr, sport, daddr, hnum, dif,
+ inet6_ehashfn);
if (result)
goto done;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5c1c61a5a401..ac3899e6112c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -205,32 +205,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
return result;
}

-static inline struct sock *udp6_lookup_run_bpf(struct net *net,
- struct udp_table *udptable,
- struct sk_buff *skb,
- const struct in6_addr *saddr,
- __be16 sport,
- const struct in6_addr *daddr,
- u16 hnum, const int dif)
-{
- struct sock *sk, *reuse_sk;
- bool no_reuseport;
-
- if (udptable != net->ipv4.udp_table)
- return NULL; /* only UDP is supported */
-
- no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP, saddr, sport,
- daddr, hnum, dif, &sk);
- if (no_reuseport || IS_ERR_OR_NULL(sk))
- return sk;
-
- reuse_sk = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
- saddr, sport, daddr, hnum, udp6_ehashfn);
- if (reuse_sk)
- sk = reuse_sk;
- return sk;
-}
-
/* rcu_read_lock() must be held */
struct sock *__udp6_lib_lookup(struct net *net,
const struct in6_addr *saddr, __be16 sport,
@@ -255,9 +229,11 @@ struct sock *__udp6_lib_lookup(struct net *net,
goto done;

/* Lookup redirect from BPF */
- if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
- sk = udp6_lookup_run_bpf(net, udptable, skb,
- saddr, sport, daddr, hnum, dif);
+ if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
+ udptable == net->ipv4.udp_table) {
+ sk = inet6_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
+ saddr, sport, daddr, hnum, dif,
+ udp6_ehashfn);
if (sk) {
result = sk;
goto done;

--
2.40.1


2023-07-04 14:07:16

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v5 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.

The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:

sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed

Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.

Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.

Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Joe Stringer <[email protected]>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
---
include/net/inet6_hashtables.h | 56 ++++++++++++++++++++++++++++++++++++++----
include/net/inet_hashtables.h | 49 ++++++++++++++++++++++++++++++++++--
include/net/sock.h | 7 ++++--
include/uapi/linux/bpf.h | 3 ---
net/core/filter.c | 2 --
net/ipv4/udp.c | 8 ++++--
net/ipv6/udp.c | 10 +++++---
tools/include/uapi/linux/bpf.h | 3 ---
8 files changed, 116 insertions(+), 22 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index a6722d6ef80f..284b5ce7205d 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -103,6 +103,46 @@ static inline struct sock *__inet6_lookup(struct net *net,
daddr, hnum, dif, sdif);
}

+static inline
+struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
+ const struct in6_addr *saddr, const __be16 sport,
+ const struct in6_addr *daddr, const __be16 dport,
+ bool *refcounted, inet6_ehashfn_t *ehashfn)
+{
+ struct sock *sk, *reuse_sk;
+ bool prefetched;
+
+ sk = skb_steal_sock(skb, refcounted, &prefetched);
+ if (!sk)
+ return NULL;
+
+ if (!prefetched)
+ return sk;
+
+ if (sk->sk_protocol == IPPROTO_TCP) {
+ if (sk->sk_state != TCP_LISTEN)
+ return sk;
+ } else if (sk->sk_protocol == IPPROTO_UDP) {
+ if (sk->sk_state != TCP_CLOSE)
+ return sk;
+ } else {
+ return sk;
+ }
+
+ reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
+ saddr, sport, daddr, ntohs(dport),
+ ehashfn);
+ if (!reuse_sk)
+ return sk;
+
+ /* We've chosen a new reuseport sock which is never refcounted. This
+ * implies that sk also isn't refcounted.
+ */
+ WARN_ON_ONCE(*refcounted);
+
+ return reuse_sk;
+}
+
static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
struct sk_buff *skb, int doff,
const __be16 sport,
@@ -110,14 +150,20 @@ static inline struct sock *__inet6_lookup_skb(struct inet_hashinfo *hashinfo,
int iif, int sdif,
bool *refcounted)
{
- struct sock *sk = skb_steal_sock(skb, refcounted);
-
+ struct net *net = dev_net(skb_dst(skb)->dev);
+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+ struct sock *sk;
+
+ sk = inet6_steal_sock(net, skb, doff, &ip6h->saddr, sport, &ip6h->daddr, dport,
+ refcounted, inet6_ehashfn);
+ if (IS_ERR(sk))
+ return NULL;
if (sk)
return sk;

- return __inet6_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
- doff, &ipv6_hdr(skb)->saddr, sport,
- &ipv6_hdr(skb)->daddr, ntohs(dport),
+ return __inet6_lookup(net, hashinfo, skb,
+ doff, &ip6h->saddr, sport,
+ &ip6h->daddr, ntohs(dport),
iif, sdif, refcounted);
}

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index c0532cc7587f..1177effabed3 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -449,6 +449,46 @@ static inline struct sock *inet_lookup(struct net *net,
return sk;
}

+static inline
+struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
+ const __be32 saddr, const __be16 sport,
+ const __be32 daddr, const __be16 dport,
+ bool *refcounted, inet_ehashfn_t *ehashfn)
+{
+ struct sock *sk, *reuse_sk;
+ bool prefetched;
+
+ sk = skb_steal_sock(skb, refcounted, &prefetched);
+ if (!sk)
+ return NULL;
+
+ if (!prefetched)
+ return sk;
+
+ if (sk->sk_protocol == IPPROTO_TCP) {
+ if (sk->sk_state != TCP_LISTEN)
+ return sk;
+ } else if (sk->sk_protocol == IPPROTO_UDP) {
+ if (sk->sk_state != TCP_CLOSE)
+ return sk;
+ } else {
+ return sk;
+ }
+
+ reuse_sk = inet_lookup_reuseport(net, sk, skb, doff,
+ saddr, sport, daddr, ntohs(dport),
+ ehashfn);
+ if (!reuse_sk)
+ return sk;
+
+ /* We've chosen a new reuseport sock which is never refcounted. This
+ * implies that sk also isn't refcounted.
+ */
+ WARN_ON_ONCE(*refcounted);
+
+ return reuse_sk;
+}
+
static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
struct sk_buff *skb,
int doff,
@@ -457,13 +497,18 @@ static inline struct sock *__inet_lookup_skb(struct inet_hashinfo *hashinfo,
const int sdif,
bool *refcounted)
{
- struct sock *sk = skb_steal_sock(skb, refcounted);
+ struct net *net = dev_net(skb_dst(skb)->dev);
const struct iphdr *iph = ip_hdr(skb);
+ struct sock *sk;

+ sk = inet_steal_sock(net, skb, doff, iph->saddr, sport, iph->daddr, dport,
+ refcounted, inet_ehashfn);
+ if (IS_ERR(sk))
+ return NULL;
if (sk)
return sk;

- return __inet_lookup(dev_net(skb_dst(skb)->dev), hashinfo, skb,
+ return __inet_lookup(net, hashinfo, skb,
doff, iph->saddr, sport,
iph->daddr, dport, inet_iif(skb), sdif,
refcounted);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2eb916d1ff64..320f00e69ae9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2814,20 +2814,23 @@ sk_is_refcounted(struct sock *sk)
* skb_steal_sock - steal a socket from an sk_buff
* @skb: sk_buff to steal the socket from
* @refcounted: is set to true if the socket is reference-counted
+ * @prefetched: is set to true if the socket was assigned from bpf
*/
static inline struct sock *
-skb_steal_sock(struct sk_buff *skb, bool *refcounted)
+skb_steal_sock(struct sk_buff *skb, bool *refcounted, bool *prefetched)
{
if (skb->sk) {
struct sock *sk = skb->sk;

*refcounted = true;
- if (skb_sk_is_prefetched(skb))
+ *prefetched = skb_sk_is_prefetched(skb);
+ if (*prefetched)
*refcounted = sk_is_refcounted(sk);
skb->destructor = NULL;
skb->sk = NULL;
return sk;
}
+ *prefetched = false;
*refcounted = false;
return NULL;
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60a9d59beeab..d8cbae822025 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4159,9 +4159,6 @@ union bpf_attr {
* **-EOPNOTSUPP** if the operation is not supported, for example
* a call from outside of TC ingress.
*
- * **-ESOCKTNOSUPPORT** if the socket type is not supported
- * (reuseport).
- *
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
diff --git a/net/core/filter.c b/net/core/filter.c
index 06ba0e56e369..24a97daa9bed 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7356,8 +7356,6 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
return -EOPNOTSUPP;
if (unlikely(dev_net(skb->dev) != sock_net(sk)))
return -ENETUNREACH;
- if (unlikely(sk_fullsock(sk) && sk->sk_reuseport))
- return -ESOCKTNOSUPPORT;
if (sk_is_refcounted(sk) &&
unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
return -ENOENT;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 045eca6ed177..ec1a5f8a2eca 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2388,7 +2388,11 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
if (udp4_csum_init(skb, uh, proto))
goto csum_error;

- sk = skb_steal_sock(skb, &refcounted);
+ sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
+ &refcounted, udp_ehashfn);
+ if (IS_ERR(sk))
+ goto no_sk;
+
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
@@ -2409,7 +2413,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
if (sk)
return udp_unicast_rcv_skb(sk, skb, uh);
-
+no_sk:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset_ct(skb);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index ac3899e6112c..ec427f5531f6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -923,9 +923,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
const struct in6_addr *saddr, *daddr;
struct net *net = dev_net(skb->dev);
+ bool refcounted;
struct udphdr *uh;
struct sock *sk;
- bool refcounted;
u32 ulen = 0;

if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -962,7 +962,11 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
goto csum_error;

/* Check if the socket is already available, e.g. due to early demux */
- sk = skb_steal_sock(skb, &refcounted);
+ sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
+ &refcounted, udp6_ehashfn);
+ if (IS_ERR(sk))
+ goto no_sk;
+
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
@@ -996,7 +1000,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
goto report_csum_error;
return udp6_unicast_rcv_skb(sk, skb, uh);
}
-
+no_sk:
reason = SKB_DROP_REASON_NO_SOCKET;

if (!uh->check)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60a9d59beeab..d8cbae822025 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4159,9 +4159,6 @@ union bpf_attr {
* **-EOPNOTSUPP** if the operation is not supported, for example
* a call from outside of TC ingress.
*
- * **-ESOCKTNOSUPPORT** if the socket type is not supported
- * (reuseport).
- *
* long bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This

--
2.40.1


2023-07-04 14:09:32

by Lorenz Bauer

[permalink] [raw]
Subject: [PATCH bpf-next v5 7/7] selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper

From: Daniel Borkmann <[email protected]>

We use two programs to check that the new reuseport logic is executed
appropriately.

The first is a TC clsact program which bpf_sk_assigns
the skb to a UDP or TCP socket created by user space. Since the test
communicates via lo we see both directions of packets in the eBPF.
Traffic ingressing to the reuseport socket is identified by looking
at the destination port. For TCP, we additionally need to make sure
that we only assign the initial SYN packets towards our listening
socket. The network stack then creates a request socket which
transitions to ESTABLISHED after the 3WHS.

The second is a reuseport program which shares the fact that
it has been executed with user space. This tells us that the delayed
lookup mechanism is working.

Signed-off-by: Daniel Borkmann <[email protected]>
Co-developed-by: Lorenz Bauer <[email protected]>
Signed-off-by: Lorenz Bauer <[email protected]>
Cc: Joe Stringer <[email protected]>
---
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
3 files changed, 342 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..8a33bcea97de 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -423,6 +423,9 @@ struct nstoken *open_netns(const char *name)

void close_netns(struct nstoken *token)
{
+ if (!token)
+ return;
+
ASSERT_OK(setns(token->orig_netns_fd, CLONE_NEWNET), "setns");
close(token->orig_netns_fd);
free(token);
diff --git a/tools/testing/selftests/bpf/prog_tests/assign_reuse.c b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
new file mode 100644
index 000000000000..622f123410f4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/assign_reuse.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <test_progs.h>
+
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+
+#include "network_helpers.h"
+#include "test_assign_reuse.skel.h"
+
+#define NS_TEST "assign_reuse"
+#define LOOPBACK 1
+#define PORT 4443
+
+static int attach_reuseport(int sock_fd, int prog_fd)
+{
+ return setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+ &prog_fd, sizeof(prog_fd));
+}
+
+static __u64 cookie(int fd)
+{
+ __u64 cookie = 0;
+ socklen_t cookie_len = sizeof(cookie);
+ int ret;
+
+ ret = getsockopt(fd, SOL_SOCKET, SO_COOKIE, &cookie, &cookie_len);
+ ASSERT_OK(ret, "cookie");
+ ASSERT_GT(cookie, 0, "cookie_invalid");
+
+ return cookie;
+}
+
+static int echo_test_udp(int fd_sv)
+{
+ struct sockaddr_storage addr = {};
+ socklen_t len = sizeof(addr);
+ char buff[1] = {};
+ int fd_cl = -1, ret;
+
+ fd_cl = connect_to_fd(fd_sv, 100);
+ ASSERT_GT(fd_cl, 0, "create_client");
+ ASSERT_EQ(getsockname(fd_cl, (void *)&addr, &len), 0, "getsockname");
+
+ ASSERT_EQ(send(fd_cl, buff, sizeof(buff), 0), 1, "send_client");
+
+ ret = recv(fd_sv, buff, sizeof(buff), 0);
+ if (ret < 0)
+ return errno;
+
+ ASSERT_EQ(ret, 1, "recv_server");
+ ASSERT_EQ(sendto(fd_sv, buff, sizeof(buff), 0, (void *)&addr, len), 1, "send_server");
+ ASSERT_EQ(recv(fd_cl, buff, sizeof(buff), 0), 1, "recv_client");
+ close(fd_cl);
+ return 0;
+}
+
+static int echo_test_tcp(int fd_sv)
+{
+ char buff[1] = {};
+ int fd_cl = -1, fd_sv_cl = -1;
+
+ fd_cl = connect_to_fd(fd_sv, 100);
+ if (fd_cl < 0)
+ return errno;
+
+ fd_sv_cl = accept(fd_sv, NULL, NULL);
+ ASSERT_GE(fd_sv_cl, 0, "accept_fd");
+
+ ASSERT_EQ(send(fd_cl, buff, sizeof(buff), 0), 1, "send_client");
+ ASSERT_EQ(recv(fd_sv_cl, buff, sizeof(buff), 0), 1, "recv_server");
+ ASSERT_EQ(send(fd_sv_cl, buff, sizeof(buff), 0), 1, "send_server");
+ ASSERT_EQ(recv(fd_cl, buff, sizeof(buff), 0), 1, "recv_client");
+ close(fd_sv_cl);
+ close(fd_cl);
+ return 0;
+}
+
+void run_assign_reuse(int family, int sotype, const char *ip, __u16 port)
+{
+ DECLARE_LIBBPF_OPTS(bpf_tc_hook, tc_hook,
+ .ifindex = LOOPBACK,
+ .attach_point = BPF_TC_INGRESS,
+ );
+ DECLARE_LIBBPF_OPTS(bpf_tc_opts, tc_opts,
+ .handle = 1,
+ .priority = 1,
+ );
+ bool hook_created = false, tc_attached = false;
+ int ret, fd_tc, fd_accept, fd_drop, fd_map;
+ int *fd_sv = NULL;
+ __u64 fd_val;
+ struct test_assign_reuse *skel;
+ const int zero = 0;
+
+ skel = test_assign_reuse__open();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ skel->rodata->dest_port = port;
+
+ ret = test_assign_reuse__load(skel);
+ if (!ASSERT_OK(ret, "skel_load"))
+ goto cleanup;
+
+ ASSERT_EQ(skel->bss->sk_cookie_seen, 0, "cookie_init");
+
+ fd_tc = bpf_program__fd(skel->progs.tc_main);
+ fd_accept = bpf_program__fd(skel->progs.reuse_accept);
+ fd_drop = bpf_program__fd(skel->progs.reuse_drop);
+ fd_map = bpf_map__fd(skel->maps.sk_map);
+
+ fd_sv = start_reuseport_server(family, sotype, ip, port, 100, 1);
+ if (!ASSERT_NEQ(fd_sv, NULL, "start_reuseport_server"))
+ goto cleanup;
+
+ ret = attach_reuseport(*fd_sv, fd_drop);
+ if (!ASSERT_OK(ret, "attach_reuseport"))
+ goto cleanup;
+
+ fd_val = *fd_sv;
+ ret = bpf_map_update_elem(fd_map, &zero, &fd_val, BPF_NOEXIST);
+ if (!ASSERT_OK(ret, "bpf_sk_map"))
+ goto cleanup;
+
+ ret = bpf_tc_hook_create(&tc_hook);
+ if (ret == 0)
+ hook_created = true;
+ ret = ret == -EEXIST ? 0 : ret;
+ if (!ASSERT_OK(ret, "bpf_tc_hook_create"))
+ goto cleanup;
+
+ tc_opts.prog_fd = fd_tc;
+ ret = bpf_tc_attach(&tc_hook, &tc_opts);
+ if (!ASSERT_OK(ret, "bpf_tc_attach"))
+ goto cleanup;
+ tc_attached = true;
+
+ if (sotype == SOCK_STREAM)
+ ASSERT_EQ(echo_test_tcp(*fd_sv), ECONNREFUSED, "drop_tcp");
+ else
+ ASSERT_EQ(echo_test_udp(*fd_sv), EAGAIN, "drop_udp");
+ ASSERT_EQ(skel->bss->reuseport_executed, 1, "program executed once");
+
+ skel->bss->sk_cookie_seen = 0;
+ skel->bss->reuseport_executed = 0;
+ ASSERT_OK(attach_reuseport(*fd_sv, fd_accept), "attach_reuseport(accept)");
+
+ if (sotype == SOCK_STREAM)
+ ASSERT_EQ(echo_test_tcp(*fd_sv), 0, "echo_tcp");
+ else
+ ASSERT_EQ(echo_test_udp(*fd_sv), 0, "echo_udp");
+
+ ASSERT_EQ(skel->bss->sk_cookie_seen, cookie(*fd_sv),
+ "cookie_mismatch");
+ ASSERT_EQ(skel->bss->reuseport_executed, 1, "program executed once");
+cleanup:
+ if (tc_attached) {
+ tc_opts.flags = tc_opts.prog_fd = tc_opts.prog_id = 0;
+ ret = bpf_tc_detach(&tc_hook, &tc_opts);
+ ASSERT_OK(ret, "bpf_tc_detach");
+ }
+ if (hook_created) {
+ tc_hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS;
+ bpf_tc_hook_destroy(&tc_hook);
+ }
+ test_assign_reuse__destroy(skel);
+ free_fds(fd_sv, 1);
+}
+
+void test_assign_reuse(void)
+{
+ struct nstoken *tok = NULL;
+
+ SYS(out, "ip netns add %s", NS_TEST);
+ SYS(cleanup, "ip -net %s link set dev lo up", NS_TEST);
+
+ tok = open_netns(NS_TEST);
+ if (!ASSERT_OK_PTR(tok, "netns token"))
+ return;
+
+ if (test__start_subtest("tcpv4"))
+ run_assign_reuse(AF_INET, SOCK_STREAM, "127.0.0.1", PORT);
+ if (test__start_subtest("tcpv6"))
+ run_assign_reuse(AF_INET6, SOCK_STREAM, "::1", PORT);
+ if (test__start_subtest("udpv4"))
+ run_assign_reuse(AF_INET, SOCK_DGRAM, "127.0.0.1", PORT);
+ if (test__start_subtest("udpv6"))
+ run_assign_reuse(AF_INET6, SOCK_DGRAM, "::1", PORT);
+
+cleanup:
+ close_netns(tok);
+ SYS_NOFAIL("ip netns delete %s", NS_TEST);
+out:
+ return;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_assign_reuse.c b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
new file mode 100644
index 000000000000..4f2e2321ea06
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_assign_reuse.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/pkt_cls.h>
+
+char LICENSE[] SEC("license") = "GPL";
+
+__u64 sk_cookie_seen;
+__u64 reuseport_executed;
+union {
+ struct tcphdr tcp;
+ struct udphdr udp;
+} headers;
+
+const volatile __u16 dest_port;
+
+struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, __u64);
+} sk_map SEC(".maps");
+
+SEC("sk_reuseport")
+int reuse_accept(struct sk_reuseport_md *ctx)
+{
+ reuseport_executed++;
+
+ if (ctx->ip_protocol == IPPROTO_TCP) {
+ if (ctx->data + sizeof(headers.tcp) > ctx->data_end)
+ return SK_DROP;
+
+ if (__builtin_memcmp(&headers.tcp, ctx->data, sizeof(headers.tcp)) != 0)
+ return SK_DROP;
+ } else if (ctx->ip_protocol == IPPROTO_UDP) {
+ if (ctx->data + sizeof(headers.udp) > ctx->data_end)
+ return SK_DROP;
+
+ if (__builtin_memcmp(&headers.udp, ctx->data, sizeof(headers.udp)) != 0)
+ return SK_DROP;
+ } else {
+ return SK_DROP;
+ }
+
+ sk_cookie_seen = bpf_get_socket_cookie(ctx->sk);
+ return SK_PASS;
+}
+
+SEC("sk_reuseport")
+int reuse_drop(struct sk_reuseport_md *ctx)
+{
+ reuseport_executed++;
+ sk_cookie_seen = 0;
+ return SK_DROP;
+}
+
+static int
+assign_sk(struct __sk_buff *skb)
+{
+ int zero = 0, ret = 0;
+ struct bpf_sock *sk;
+
+ sk = bpf_map_lookup_elem(&sk_map, &zero);
+ if (!sk)
+ return TC_ACT_SHOT;
+ ret = bpf_sk_assign(skb, sk, 0);
+ bpf_sk_release(sk);
+ return ret ? TC_ACT_SHOT : TC_ACT_OK;
+}
+
+static bool
+maybe_assign_tcp(struct __sk_buff *skb, struct tcphdr *th)
+{
+ if (th + 1 > (void *)(long)(skb->data_end))
+ return TC_ACT_SHOT;
+
+ if (!th->syn || th->ack || th->dest != bpf_htons(dest_port))
+ return TC_ACT_OK;
+
+ __builtin_memcpy(&headers.tcp, th, sizeof(headers.tcp));
+ return assign_sk(skb);
+}
+
+static bool
+maybe_assign_udp(struct __sk_buff *skb, struct udphdr *uh)
+{
+ if (uh + 1 > (void *)(long)(skb->data_end))
+ return TC_ACT_SHOT;
+
+ if (uh->dest != bpf_htons(dest_port))
+ return TC_ACT_OK;
+
+ __builtin_memcpy(&headers.udp, uh, sizeof(headers.udp));
+ return assign_sk(skb);
+}
+
+SEC("tc")
+int tc_main(struct __sk_buff *skb)
+{
+ void *data_end = (void *)(long)skb->data_end;
+ void *data = (void *)(long)skb->data;
+ struct ethhdr *eth;
+
+ eth = (struct ethhdr *)(data);
+ if (eth + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+ struct iphdr *iph = (struct iphdr *)(data + sizeof(*eth));
+
+ if (iph + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (iph->protocol == IPPROTO_TCP)
+ return maybe_assign_tcp(skb, (struct tcphdr *)(iph + 1));
+ else if (iph->protocol == IPPROTO_UDP)
+ return maybe_assign_udp(skb, (struct udphdr *)(iph + 1));
+ else
+ return TC_ACT_SHOT;
+ } else {
+ struct ipv6hdr *ip6h = (struct ipv6hdr *)(data + sizeof(*eth));
+
+ if (ip6h + 1 > data_end)
+ return TC_ACT_SHOT;
+
+ if (ip6h->nexthdr == IPPROTO_TCP)
+ return maybe_assign_tcp(skb, (struct tcphdr *)(ip6h + 1));
+ else if (ip6h->nexthdr == IPPROTO_UDP)
+ return maybe_assign_udp(skb, (struct udphdr *)(ip6h + 1));
+ else
+ return TC_ACT_SHOT;
+ }
+}

--
2.40.1


2023-07-11 17:07:07

by Lorenz Bauer

[permalink] [raw]
Subject: Re: [PATCH bpf-next v5 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

On Tue, Jul 4, 2023 at 2:46 PM Lorenz Bauer <[email protected]> wrote:
>
> +static inline
> +struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> + const struct in6_addr *saddr, const __be16 sport,
> + const struct in6_addr *daddr, const __be16 dport,
> + bool *refcounted, inet6_ehashfn_t *ehashfn)
> +{
> + struct sock *sk, *reuse_sk;
> + bool prefetched;
> +
> + sk = skb_steal_sock(skb, refcounted, &prefetched);
> + if (!sk)
> + return NULL;
> +
> + if (!prefetched)
> + return sk;
> +
> + if (sk->sk_protocol == IPPROTO_TCP) {
> + if (sk->sk_state != TCP_LISTEN)
> + return sk;
> + } else if (sk->sk_protocol == IPPROTO_UDP) {
> + if (sk->sk_state != TCP_CLOSE)
> + return sk;
> + } else {
> + return sk;
> + }
> +
> + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> + saddr, sport, daddr, ntohs(dport),
> + ehashfn);
> + if (!reuse_sk)
> + return sk;
> +
> + /* We've chosen a new reuseport sock which is never refcounted. This
> + * implies that sk also isn't refcounted.
> + */
> + WARN_ON_ONCE(*refcounted);
> +
> + return reuse_sk;
> +}

Hi Kuniyuki,

Continuing the conversation from v5 of the patch set, you wrote:

In inet6?_steal_sock(), we call inet6?_lookup_reuseport() only for
sk that was a TCP listener or UDP non-connected socket until just before
the sk_state checks. Then, we know *refcounted should be false for such
sockets even before inet6?_lookup_reuseport().

This makes sense for me in the TCP listener case. I understand UDP
less, so I'll have to rely on your input. I tried to convince myself
that all UDP sockets in TCP_CLOSE have SOCK_RCU_FREE set. However, the
only place I see sock_set_flag(sk, SOCK_RCU_FREE) in the UDP case is
in udp_lib_get_port(). That in turn seems to be called during bind.
So, what if BPF does bpf_sk_assign() of an unbound and unconnected
socket? Wouldn't that trigger the warning?

To maybe sidestep this question: do you think the location of the
WARN_ON_ONCE has to prevent this patch set from going in? I've been
noodling at it for quite a while already and it would be good to see
it land.

Thanks

Lorenz

2023-07-11 18:37:27

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH bpf-next v5 6/7] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Lorenz Bauer <[email protected]>
Date: Tue, 11 Jul 2023 17:15:06 +0100
> On Tue, Jul 4, 2023 at 2:46 PM Lorenz Bauer <[email protected]> wrote:
> >
> > +static inline
> > +struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> > + const struct in6_addr *saddr, const __be16 sport,
> > + const struct in6_addr *daddr, const __be16 dport,
> > + bool *refcounted, inet6_ehashfn_t *ehashfn)
> > +{
> > + struct sock *sk, *reuse_sk;
> > + bool prefetched;
> > +
> > + sk = skb_steal_sock(skb, refcounted, &prefetched);
> > + if (!sk)
> > + return NULL;
> > +
> > + if (!prefetched)
> > + return sk;
> > +
> > + if (sk->sk_protocol == IPPROTO_TCP) {
> > + if (sk->sk_state != TCP_LISTEN)
> > + return sk;
> > + } else if (sk->sk_protocol == IPPROTO_UDP) {
> > + if (sk->sk_state != TCP_CLOSE)
> > + return sk;
> > + } else {
> > + return sk;
> > + }
> > +
> > + reuse_sk = inet6_lookup_reuseport(net, sk, skb, doff,
> > + saddr, sport, daddr, ntohs(dport),
> > + ehashfn);
> > + if (!reuse_sk)
> > + return sk;
> > +
> > + /* We've chosen a new reuseport sock which is never refcounted. This
> > + * implies that sk also isn't refcounted.
> > + */
> > + WARN_ON_ONCE(*refcounted);
> > +
> > + return reuse_sk;
> > +}
>
> Hi Kuniyuki,
>
> Continuing the conversation from v5 of the patch set, you wrote:
>
> In inet6?_steal_sock(), we call inet6?_lookup_reuseport() only for
> sk that was a TCP listener or UDP non-connected socket until just before
> the sk_state checks. Then, we know *refcounted should be false for such
> sockets even before inet6?_lookup_reuseport().
>
> This makes sense for me in the TCP listener case. I understand UDP
> less, so I'll have to rely on your input. I tried to convince myself
> that all UDP sockets in TCP_CLOSE have SOCK_RCU_FREE set. However, the
> only place I see sock_set_flag(sk, SOCK_RCU_FREE) in the UDP case is
> in udp_lib_get_port(). That in turn seems to be called during bind.
> So, what if BPF does bpf_sk_assign() of an unbound and unconnected
> socket? Wouldn't that trigger the warning?

Ah sorry, I assumed it would not happen, but if we can put unbound
TCP/UDP socket into a map and select it, then yes, it hits the warning.

Let's say we can select a non-RCU sk in bpf_sk_assign() and then the
socket is converted to RCU by bind(udp_sk) or listen(tcp_sk).

The sk_is_refcounted() in bpf_sk_assign() returns true and sk_refcnt
is incremented. Then, I think of two scenarios:

1) RCU conversion is done before sk_is_refcounted() in skb_steal_sock().
-> *refcounted is false

2) RCU conversion is done after skb_steal_sock().
-> *refcounted is true

In both cases, we need to decrement the refcnt that is bumped up
by bpf_sk_assign(). The sock_put() in the v1 series does not catch
the former case.

How should we track it ?


>
> To maybe sidestep this question: do you think the location of the
> WARN_ON_ONCE has to prevent this patch set from going in? I've been
> noodling at it for quite a while already and it would be good to see
> it land.

If the issue above happened, I think it could be a blocker.

Thanks!