2020-12-07 13:27:52

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 00/13] Socket migration for SO_REUSEPORT.

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation[1]. When a SYN packet is received, the connection is tied to
a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

The SO_REUSEPORT option is excellent to improve scalability. On the other
hand, as a trade-off, users have to know deeply how the kernel handles SYN
packets and implement connection draining by eBPF[2]:

1. Stop routing SYN packets to the listener by eBPF.
2. Wait for all timers to expire to complete requests
3. Accept connections until EAGAIN, then close the listener.

or

1. Start counting SYN packets and accept syscalls using eBPF map.
2. Stop routing SYN packets.
3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to others in the same reuseport group at/after
close() or shutdown() syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons such as replacing TLS certificates, we
may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining.

Moreover, auto-migration simplifies userspace logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.


Link:
[1] The SO_REUSEPORT socket option
https://lwn.net/Articles/542629/

[2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
https://lore.kernel.org/netdev/[email protected]/


Changelog:
v2:
* Do not save closed sockets in socks[]
* Revert 607904c357c61adf20b8fd18af765e501d61a385
* Extract inet_csk_reqsk_queue_migrate() into a single patch
* Change the spin_lock order to avoid lockdep warning
* Add static to __reuseport_select_sock
* Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
* Set the default attach type in bpf_prog_load_check_attach()
* Define new proto of BPF_FUNC_get_socket_cookie
* Fix test to be compiled successfully
* Update commit messages

v1:
https://lore.kernel.org/netdev/[email protected]/
* Remove the sysctl option
* Enable migration if eBPF progam is not attached
* Add expected_attach_type to check if eBPF program can migrate sockets
* Add a field to tell migration type to eBPF program
* Support BPF_FUNC_get_socket_cookie to get the cookie of sk
* Allocate an empty skb if skb is NULL
* Pass req_to_sk(req)->sk_hash because listener's hash is zero
* Update commit messages and coverletter

RFC:
https://lore.kernel.org/netdev/[email protected]/


Kuniyuki Iwashima (13):
tcp: Allow TCP_CLOSE sockets to hold the reuseport group.
bpf: Define migration types for SO_REUSEPORT.
Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested()
API"
tcp: Introduce inet_csk_reqsk_queue_migrate().
tcp: Set the new listener to migrated TFO requests.
tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
tcp: Migrate TCP_NEW_SYN_RECV requests.
bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.
libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Add migration to sk_reuseport_(kern|md).
bpf: Support BPF_FUNC_get_socket_cookie() for
BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Call bpf_run_sk_reuseport() for socket migration.
bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

include/linux/bpf.h | 1 +
include/linux/filter.h | 4 +-
include/linux/spinlock.h | 8 +
include/linux/spinlock_api_smp.h | 2 +
include/linux/spinlock_api_up.h | 1 +
include/net/inet_connection_sock.h | 12 ++
include/net/request_sock.h | 13 ++
include/net/sock_reuseport.h | 15 +-
include/uapi/linux/bpf.h | 25 +++
kernel/bpf/syscall.c | 13 ++
kernel/locking/spinlock.c | 8 +
net/core/filter.c | 56 +++++-
net/core/sock_reuseport.c | 96 +++++++---
net/ipv4/inet_connection_sock.c | 99 +++++++++-
net/ipv4/inet_hashtables.c | 9 +-
net/ipv4/tcp_ipv4.c | 9 +-
net/ipv6/tcp_ipv6.c | 9 +-
tools/include/uapi/linux/bpf.h | 25 +++
tools/lib/bpf/libbpf.c | 5 +-
.../bpf/prog_tests/select_reuseport_migrate.c | 173 ++++++++++++++++++
.../bpf/progs/test_select_reuseport_migrate.c | 53 ++++++
21 files changed, 590 insertions(+), 46 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
create mode 100644 tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c

--
2.17.2 (Apple Git-113)


2020-12-07 13:29:05

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 01/13] tcp: Allow TCP_CLOSE sockets to hold the reuseport group.

This patch is a preparation patch to migrate incoming connections in the
later commits and adds a field (num_closed_socks) to the struct
sock_reuseport to allow TCP_CLOSE sockets to access to the reuseport group.

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So, we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, it is impossible because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to hold sk_reuseport_cb while any child
socket references to them. The point is that reuseport_detach_sock() is
called twice from inet_unhash() and sk_destruct(). At first, it decrements
num_socks and increments num_closed_socks. Later, when all migrated
connections are accepted, it decrements num_closed_socks and sets NULL to
sk_reuseport_cb.

By this change, closed sockets can keep sk_reuseport_cb until all child
requests have been freed or accepted. Consequently calling listen() after
shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
inet_csk_bind_conflict() which expect that such sockets should not have the
reuseport group. Therefore, this patch also loosens such validation rules
so that the socket can listen again if it has the same reuseport group with
other listening sockets.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/net/sock_reuseport.h | 5 +++--
net/core/sock_reuseport.c | 39 +++++++++++++++++++++++----------
net/ipv4/inet_connection_sock.c | 7 ++++--
3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 505f1e18e9bf..0e558ca7afbf 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock;
struct sock_reuseport {
struct rcu_head rcu;

- u16 max_socks; /* length of socks */
- u16 num_socks; /* elements in socks */
+ u16 max_socks; /* length of socks */
+ u16 num_socks; /* elements in socks */
+ u16 num_closed_socks; /* closed elements in socks */
/* The last synq overflow event timestamp of this
* reuse->socks[] group.
*/
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index bbdd3c7b6cb5..c26f4256ff41 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -98,14 +98,15 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse)
return NULL;

more_reuse->num_socks = reuse->num_socks;
+ more_reuse->num_closed_socks = reuse->num_closed_socks;
more_reuse->prog = reuse->prog;
more_reuse->reuseport_id = reuse->reuseport_id;
more_reuse->bind_inany = reuse->bind_inany;
more_reuse->has_conns = reuse->has_conns;
+ more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);

memcpy(more_reuse->socks, reuse->socks,
reuse->num_socks * sizeof(struct sock *));
- more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts);

for (i = 0; i < reuse->num_socks; ++i)
rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
@@ -152,8 +153,10 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
reuse = rcu_dereference_protected(sk2->sk_reuseport_cb,
lockdep_is_held(&reuseport_lock));
old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
- lockdep_is_held(&reuseport_lock));
- if (old_reuse && old_reuse->num_socks != 1) {
+ lockdep_is_held(&reuseport_lock));
+ if (old_reuse == reuse) {
+ reuse->num_closed_socks--;
+ } else if (old_reuse && old_reuse->num_socks != 1) {
spin_unlock_bh(&reuseport_lock);
return -EBUSY;
}
@@ -174,8 +177,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)

spin_unlock_bh(&reuseport_lock);

- if (old_reuse)
+ if (old_reuse && old_reuse != reuse)
call_rcu(&old_reuse->rcu, reuseport_free_rcu);
+
return 0;
}
EXPORT_SYMBOL(reuseport_add_sock);
@@ -199,17 +203,28 @@ void reuseport_detach_sock(struct sock *sk)
*/
bpf_sk_reuseport_detach(sk);

- rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
-
- for (i = 0; i < reuse->num_socks; i++) {
- if (reuse->socks[i] == sk) {
- reuse->socks[i] = reuse->socks[reuse->num_socks - 1];
- reuse->num_socks--;
- if (reuse->num_socks == 0)
- call_rcu(&reuse->rcu, reuseport_free_rcu);
+ if (sk->sk_protocol == IPPROTO_TCP && sk->sk_state == TCP_CLOSE) {
+ reuse->num_closed_socks--;
+ rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+ } else {
+ for (i = 0; i < reuse->num_socks; i++) {
+ if (reuse->socks[i] != sk)
+ continue;
break;
}
+
+ reuse->num_socks--;
+ reuse->socks[i] = reuse->socks[reuse->num_socks];
+
+ if (sk->sk_protocol == IPPROTO_TCP)
+ reuse->num_closed_socks++;
+ else
+ rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
}
+
+ if (reuse->num_socks + reuse->num_closed_socks == 0)
+ call_rcu(&reuse->rcu, reuseport_free_rcu);
+
spin_unlock_bh(&reuseport_lock);
}
EXPORT_SYMBOL(reuseport_detach_sock);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f60869acbef0..1451aa9712b0 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk,
bool reuse = sk->sk_reuse;
bool reuseport = !!sk->sk_reuseport;
kuid_t uid = sock_i_uid((struct sock *)sk);
+ struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);

/*
* Unlike other sk lookup places we do not check
@@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk,
if ((!relax ||
(!reuseport_ok &&
reuseport && sk2->sk_reuseport &&
- !rcu_access_pointer(sk->sk_reuseport_cb) &&
+ (!reuseport_cb ||
+ reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) &&
(sk2->sk_state == TCP_TIME_WAIT ||
uid_eq(uid, sock_i_uid(sk2))))) &&
inet_rcv_saddr_equal(sk, sk2, true))
break;
} else if (!reuseport_ok ||
!reuseport || !sk2->sk_reuseport ||
- rcu_access_pointer(sk->sk_reuseport_cb) ||
+ (reuseport_cb &&
+ reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) ||
(sk2->sk_state != TCP_TIME_WAIT &&
!uid_eq(uid, sock_i_uid(sk2)))) {
if (inet_rcv_saddr_equal(sk, sk2, true))
--
2.17.2 (Apple Git-113)

2020-12-07 13:29:12

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 02/13] bpf: Define migration types for SO_REUSEPORT.

As noted in the preceding commit, there are two migration types. In
addition to that, the kernel will run the same eBPF program to select a
listener for SYN packets.

This patch defines three types to signal the kernel and the eBPF program if
it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in
the accept queue or NEW_SYN_RECV socket during 3WHS.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/uapi/linux/bpf.h | 14 ++++++++++++++
tools/include/uapi/linux/bpf.h | 14 ++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1233f14f659f..7a48e0055500 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4423,6 +4423,20 @@ struct sk_msg_md {
__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
};

+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in
+ * the accept queue at close() or shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the
+ * final ACK of 3WHS or retransmitting SYN+ACKs.
+ */
+enum {
+ BPF_SK_REUSEPORT_MIGRATE_NO,
+ BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+ BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
struct sk_reuseport_md {
/*
* Start of directly accessible data. It begins from
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1233f14f659f..7a48e0055500 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4423,6 +4423,20 @@ struct sk_msg_md {
__bpf_md_ptr(struct bpf_sock *, sk); /* current socket */
};

+/* Migration type for SO_REUSEPORT enabled TCP sockets.
+ *
+ * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets.
+ * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in
+ * the accept queue at close() or shutdown().
+ * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the
+ * final ACK of 3WHS or retransmitting SYN+ACKs.
+ */
+enum {
+ BPF_SK_REUSEPORT_MIGRATE_NO,
+ BPF_SK_REUSEPORT_MIGRATE_QUEUE,
+ BPF_SK_REUSEPORT_MIGRATE_REQUEST,
+};
+
struct sk_reuseport_md {
/*
* Start of directly accessible data. It begins from
--
2.17.2 (Apple Git-113)

2020-12-07 13:30:30

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 03/13] Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested() API"

This reverts commit 607904c357c61adf20b8fd18af765e501d61a385 to use
spin_lock_bh_nested() in the next commit.

Link: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
CC: Waiman Long <[email protected]>
---
include/linux/spinlock.h | 8 ++++++++
include/linux/spinlock_api_smp.h | 2 ++
include/linux/spinlock_api_up.h | 1 +
kernel/locking/spinlock.c | 8 ++++++++
4 files changed, 19 insertions(+)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 79897841a2cc..c020b375a071 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -227,6 +227,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define raw_spin_lock_nested(lock, subclass) \
_raw_spin_lock_nested(lock, subclass)
+# define raw_spin_lock_bh_nested(lock, subclass) \
+ _raw_spin_lock_bh_nested(lock, subclass)

# define raw_spin_lock_nest_lock(lock, nest_lock) \
do { \
@@ -242,6 +244,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
# define raw_spin_lock_nested(lock, subclass) \
_raw_spin_lock(((void)(subclass), (lock)))
# define raw_spin_lock_nest_lock(lock, nest_lock) _raw_spin_lock(lock)
+# define raw_spin_lock_bh_nested(lock, subclass) _raw_spin_lock_bh(lock)
#endif

#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -369,6 +372,11 @@ do { \
raw_spin_lock_nested(spinlock_check(lock), subclass); \
} while (0)

+#define spin_lock_bh_nested(lock, subclass) \
+do { \
+ raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
+} while (0)
+
#define spin_lock_nest_lock(lock, nest_lock) \
do { \
raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock); \
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 19a9be9d97ee..d565fb6304f2 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr);
void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
__acquires(lock);
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+ __acquires(lock);
void __lockfunc
_raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)
__acquires(lock);
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index d0d188861ad6..d3afef9d8dbe 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -57,6 +57,7 @@

#define _raw_spin_lock(lock) __LOCK(lock)
#define _raw_spin_lock_nested(lock, subclass) __LOCK(lock)
+#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
#define _raw_read_lock(lock) __LOCK(lock)
#define _raw_write_lock(lock) __LOCK(lock)
#define _raw_spin_lock_bh(lock) __LOCK_BH(lock)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 0ff08380f531..48e99ed1bdd8 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
}
EXPORT_SYMBOL(_raw_spin_lock_nested);

+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+{
+ __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
+ spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+ LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
+}
+EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
+
unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
int subclass)
{
--
2.17.2 (Apple Git-113)

2020-12-07 13:30:50

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 06/13] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

By default, the kernel selects a new listener randomly. In order to pick
out a different socket every time, we select the last element of socks[] as
the new listener. This behaviour is based on how the kernel moves sockets
in socks[]. (See also [1])

Basically, in order to redistribute sockets evenly, we have to use an eBPF
program called in the later commit, but as the side effect of such default
selection, the kernel can redistribute old requests evenly to new listeners
for a specific case where the application replaces listeners by
generations.

For example, we call listen() for four sockets (A, B, C, D), and close()
the first two by turns. The sockets move in socks[] like below.

socks[0] : A <-. socks[0] : D socks[0] : D
socks[1] : B | => socks[1] : B <-. => socks[1] : C
socks[2] : C | socks[2] : C --'
socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d)
socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c)
socks[2] : C (c) | socks[2] : C (c) --'
socks[3] : D (d) --'

Here, (A, D), or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not hold SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must attach an eBPF
program described in later commits.

Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/net/sock_reuseport.h | 2 +-
net/core/sock_reuseport.c | 16 +++++++++++++---
net/ipv4/inet_hashtables.c | 9 +++++++--
3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 0e558ca7afbf..09a1b1539d4c 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -31,7 +31,7 @@ struct sock_reuseport {
extern int reuseport_alloc(struct sock *sk, bool bind_inany);
extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
bool bind_inany);
-extern void reuseport_detach_sock(struct sock *sk);
+extern struct sock *reuseport_detach_sock(struct sock *sk);
extern struct sock *reuseport_select_sock(struct sock *sk,
u32 hash,
struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index c26f4256ff41..2de42f8103ea 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -184,9 +184,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
}
EXPORT_SYMBOL(reuseport_add_sock);

-void reuseport_detach_sock(struct sock *sk)
+struct sock *reuseport_detach_sock(struct sock *sk)
{
struct sock_reuseport *reuse;
+ struct bpf_prog *prog;
+ struct sock *nsk = NULL;
int i;

spin_lock_bh(&reuseport_lock);
@@ -215,17 +217,25 @@ void reuseport_detach_sock(struct sock *sk)

reuse->num_socks--;
reuse->socks[i] = reuse->socks[reuse->num_socks];
+ prog = rcu_dereference_protected(reuse->prog,
+ lockdep_is_held(&reuseport_lock));
+
+ if (sk->sk_protocol == IPPROTO_TCP) {
+ if (reuse->num_socks && !prog)
+ nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];

- if (sk->sk_protocol == IPPROTO_TCP)
reuse->num_closed_socks++;
- else
+ } else {
rcu_assign_pointer(sk->sk_reuseport_cb, NULL);
+ }
}

if (reuse->num_socks + reuse->num_closed_socks == 0)
call_rcu(&reuse->rcu, reuseport_free_rcu);

spin_unlock_bh(&reuseport_lock);
+
+ return nsk;
}
EXPORT_SYMBOL(reuseport_detach_sock);

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 45fb450b4522..545538a6bfac 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk)
{
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
struct inet_listen_hashbucket *ilb = NULL;
+ struct sock *nsk;
spinlock_t *lock;

if (sk_unhashed(sk))
@@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk)
if (sk_unhashed(sk))
goto unlock;

- if (rcu_access_pointer(sk->sk_reuseport_cb))
- reuseport_detach_sock(sk);
+ if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+ nsk = reuseport_detach_sock(sk);
+ if (nsk)
+ inet_csk_reqsk_queue_migrate(sk, nsk);
+ }
+
if (ilb) {
inet_unhash2(hashinfo, sk);
ilb->count--;
--
2.17.2 (Apple Git-113)

2020-12-07 13:31:45

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 04/13] tcp: Introduce inet_csk_reqsk_queue_migrate().

This patch defines a new function to migrate ESTABLISHED/SYN_RECV sockets.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to its
full socket and listener. In inet_csk_reqsk_queue_migrate(), we only unlink
the requests from the closing listener's queue and relink them to the head
of the new listener's queue. We do not process each request and its
reference to the listener, so the migration completes in O(1) time
complexity.

Moreover, if TFO requests caused RST before 3WHS has completed, they are
held in the listener's TFO queue to prevent DDoS attack. Thus, we also
migrate the requests in the TFO queue in the same way.

After 3WHS has completed, there are three access patterns to incoming
sockets:

(1) access to the full socket instead of request_sock
(2) access to request_sock from access queue
(3) access to request_sock from TFO queue

In the first case, the full socket does not have a reference to its request
socket and listener, so we do not need the correct listener set in the
request socket. In the second case, we always have the correct listener and
currently do not use req->rsk_listener. However, in the third case of
TCP_SYN_RECV sockets, we take special care in the next commit.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/net/inet_connection_sock.h | 1 +
net/ipv4/inet_connection_sock.c | 68 ++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..2ea2d743f8fc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
struct request_sock *req,
struct sock *child);
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
unsigned long timeout);
struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1451aa9712b0..5da38a756e4c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -992,6 +992,74 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
}
EXPORT_SYMBOL(inet_csk_reqsk_queue_add);

+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
+{
+ struct request_sock_queue *old_accept_queue, *new_accept_queue;
+ struct fastopen_queue *old_fastopenq, *new_fastopenq;
+ spinlock_t *l1, *l2, *l3, *l4;
+
+ old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
+ new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
+ old_fastopenq = &old_accept_queue->fastopenq;
+ new_fastopenq = &new_accept_queue->fastopenq;
+
+ l1 = &old_accept_queue->rskq_lock;
+ l2 = &new_accept_queue->rskq_lock;
+ l3 = &old_fastopenq->lock;
+ l4 = &new_fastopenq->lock;
+
+ /* sk is never selected as the new listener from reuse->socks[],
+ * so inversion deadlock does not happen here,
+ * but change the order to avoid the warning of lockdep.
+ */
+ if (sk < nsk) {
+ swap(l1, l2);
+ swap(l3, l4);
+ }
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+
+ if (old_accept_queue->rskq_accept_head) {
+ if (new_accept_queue->rskq_accept_head)
+ old_accept_queue->rskq_accept_tail->dl_next =
+ new_accept_queue->rskq_accept_head;
+ else
+ new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
+
+ new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
+ old_accept_queue->rskq_accept_head = NULL;
+ old_accept_queue->rskq_accept_tail = NULL;
+
+ WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
+ WRITE_ONCE(sk->sk_ack_backlog, 0);
+ }
+
+ spin_unlock(l2);
+ spin_unlock(l1);
+
+ spin_lock_bh(l3);
+ spin_lock_bh_nested(l4, SINGLE_DEPTH_NESTING);
+
+ new_fastopenq->qlen += old_fastopenq->qlen;
+ old_fastopenq->qlen = 0;
+
+ if (old_fastopenq->rskq_rst_head) {
+ if (new_fastopenq->rskq_rst_head)
+ old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head;
+ else
+ old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail;
+
+ new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head;
+ old_fastopenq->rskq_rst_head = NULL;
+ old_fastopenq->rskq_rst_tail = NULL;
+ }
+
+ spin_unlock_bh(l4);
+ spin_unlock_bh(l3);
+}
+EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
+
struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
struct request_sock *req, bool own_req)
{
--
2.17.2 (Apple Git-113)

2020-12-07 13:31:48

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 07/13] tcp: Migrate TCP_NEW_SYN_RECV requests.

This patch renames reuseport_select_sock() to __reuseport_select_sock() and
adds two wrapper function of it to pass the migration type defined in the
previous commit.

reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO
reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch also changes the code to call reuseport_select_migrated_sock() even
if the listening socket is TCP_CLOSE. If we can pick out a listening socket
from the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Link: https://lore.kernel.org/bpf/[email protected]/
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/net/inet_connection_sock.h | 11 ++++++++
include/net/request_sock.h | 13 ++++++++++
include/net/sock_reuseport.h | 8 +++---
net/core/sock_reuseport.c | 40 ++++++++++++++++++++++++------
net/ipv4/inet_connection_sock.c | 13 ++++++++--
net/ipv4/tcp_ipv4.c | 9 +++++--
net/ipv6/tcp_ipv6.c | 9 +++++--
7 files changed, 86 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ea2d743f8fc..d8c3be31e987 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -272,6 +272,17 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
}

+static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
+ struct sock *nsk,
+ struct request_sock *req)
+{
+ reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
+ &inet_csk(nsk)->icsk_accept_queue,
+ req);
+ sock_put(sk);
+ req->rsk_listener = nsk;
+}
+
static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
{
return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..d18ba0b857cc 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue)
atomic_inc(&queue->qlen);
}

+static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue,
+ struct request_sock_queue *new_accept_queue,
+ const struct request_sock *req)
+{
+ atomic_dec(&old_accept_queue->qlen);
+ atomic_inc(&new_accept_queue->qlen);
+
+ if (req->num_timeout == 0) {
+ atomic_dec(&old_accept_queue->young);
+ atomic_inc(&new_accept_queue->young);
+ }
+}
+
static inline int reqsk_queue_len(const struct request_sock_queue *queue)
{
return atomic_read(&queue->qlen);
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 09a1b1539d4c..a48259a974be 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany);
extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
bool bind_inany);
extern struct sock *reuseport_detach_sock(struct sock *sk);
-extern struct sock *reuseport_select_sock(struct sock *sk,
- u32 hash,
- struct sk_buff *skb,
- int hdr_len);
+extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb, int hdr_len);
+extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb);
extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
extern int reuseport_detach_prog(struct sock *sk);

diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 2de42f8103ea..1011c3756c92 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -170,7 +170,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
}

reuse->socks[reuse->num_socks] = sk;
- /* paired with smp_rmb() in reuseport_select_sock() */
+ /* paired with smp_rmb() in __reuseport_select_sock() */
smp_wmb();
reuse->num_socks++;
rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
@@ -277,12 +277,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks,
* @hdr_len: BPF filter expects skb data pointer at payload data. If
* the skb does not yet point at the payload, this parameter represents
* how far the pointer needs to advance to reach the payload.
+ * @migration: represents if it is selecting a listener for SYN or
+ * migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket.
* Returns a socket that should receive the packet (or NULL on error).
*/
-struct sock *reuseport_select_sock(struct sock *sk,
- u32 hash,
- struct sk_buff *skb,
- int hdr_len)
+static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb, int hdr_len,
+ u8 migration)
{
struct sock_reuseport *reuse;
struct bpf_prog *prog;
@@ -296,13 +297,19 @@ struct sock *reuseport_select_sock(struct sock *sk,
if (!reuse)
goto out;

- prog = rcu_dereference(reuse->prog);
socks = READ_ONCE(reuse->num_socks);
if (likely(socks)) {
/* paired with smp_wmb() in reuseport_add_sock() */
smp_rmb();

- if (!prog || !skb)
+ prog = rcu_dereference(reuse->prog);
+ if (!prog)
+ goto select_by_hash;
+
+ if (migration)
+ goto out;
+
+ if (!skb)
goto select_by_hash;

if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
@@ -331,8 +338,27 @@ struct sock *reuseport_select_sock(struct sock *sk,
rcu_read_unlock();
return sk2;
}
+
+struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb, int hdr_len)
+{
+ return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO);
+}
EXPORT_SYMBOL(reuseport_select_sock);

+struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+ struct sk_buff *skb)
+{
+ struct sock *nsk;
+
+ nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+ if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
+ return nsk;
+
+ return NULL;
+}
+EXPORT_SYMBOL(reuseport_select_migrated_sock);
+
int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)
{
struct sock_reuseport *reuse;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 143590858c2e..f042e9122074 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
int max_syn_ack_retries, qlen, expire = 0, resend = 0;

- if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
- goto drop;
+ if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
+ sk_listener = reuseport_select_migrated_sock(sk_listener,
+ req_to_sk(req)->sk_hash, NULL);
+ if (!sk_listener) {
+ sk_listener = req->rsk_listener;
+ goto drop;
+ }
+ inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
+ icsk = inet_csk(sk_listener);
+ queue = &icsk->icsk_accept_queue;
+ }

max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
/* Normally all the openreqs are young and become mature
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index af2338294598..a4eea6b36795 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1978,8 +1978,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
goto csum_error;
}
if (unlikely(sk->sk_state != TCP_LISTEN)) {
- inet_csk_reqsk_queue_drop_and_put(sk, req);
- goto lookup;
+ nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+ if (!nsk) {
+ inet_csk_reqsk_queue_drop_and_put(sk, req);
+ goto lookup;
+ }
+ inet_csk_reqsk_queue_migrated(sk, nsk, req);
+ sk = nsk;
}
/* We own a reference on the listener, increase it again
* as we might lose it too soon.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 1a1510513739..61b8c5855735 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1640,8 +1640,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
goto csum_error;
}
if (unlikely(sk->sk_state != TCP_LISTEN)) {
- inet_csk_reqsk_queue_drop_and_put(sk, req);
- goto lookup;
+ nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+ if (!nsk) {
+ inet_csk_reqsk_queue_drop_and_put(sk, req);
+ goto lookup;
+ }
+ inet_csk_reqsk_queue_migrated(sk, nsk, req);
+ sk = nsk;
}
sock_hold(sk);
refcounted = true;
--
2.17.2 (Apple Git-113)

2020-12-07 13:32:03

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 05/13] tcp: Set the new listener to migrated TFO requests.

A TFO request socket is only freed after BOTH 3WHS has completed (or
aborted) and the child socket has been accepted (or its listener has been
closed). Hence, depending on the order, there can be two kinds of request
sockets in the accept queue.

3WHS -> accept : TCP_ESTABLISHED
accept -> 3WHS : TCP_SYN_RECV

Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
socket migration, we have to set the current listener to it at accept()
before reqsk_fastopen_remove().

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
net/ipv4/inet_connection_sock.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 5da38a756e4c..143590858c2e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
tcp_rsk(req)->tfo_listener) {
spin_lock_bh(&queue->fastopenq.lock);
if (tcp_rsk(req)->tfo_listener) {
+ if (req->rsk_listener != sk) {
+ /* TFO request was migrated to another listener so
+ * the new listener must be used in reqsk_fastopen_remove()
+ * to hold requests which cause RST.
+ */
+ sock_put(req->rsk_listener);
+ sock_hold(sk);
+ req->rsk_listener = sk;
+ }
+
/* We are still waiting for the final ACK from 3WHS
* so can't free req now. Instead, we set req->sk to
* NULL to signify that the child socket is taken
@@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req,

if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) {
BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req);
- BUG_ON(sk != req->rsk_listener);

/* Paranoid, to prevent race condition if
* an inbound pkt destined for child is
--
2.17.2 (Apple Git-113)

2020-12-07 13:32:20

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 08/13] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT.

This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
check if the attached eBPF program is capable of migrating sockets.

When the eBPF program is attached, the kernel runs it for socket migration
only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
The kernel will change the behaviour depending on the returned value:

- SK_PASS with selected_sk, select it as a new listener
- SK_PASS with selected_sk NULL, fall back to the random selection
- SK_DROP, cancel the migration

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/uapi/linux/bpf.h | 2 ++
kernel/bpf/syscall.c | 13 +++++++++++++
tools/include/uapi/linux/bpf.h | 2 ++
3 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7a48e0055500..c7f6848c0226 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
BPF_XDP,
+ BPF_SK_REUSEPORT_SELECT,
+ BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
__MAX_BPF_ATTACH_TYPE
};

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0cd3cc2af9c1..0737673c727c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1920,6 +1920,11 @@ static void bpf_prog_load_fixup_attach_type(union bpf_attr *attr)
attr->expected_attach_type =
BPF_CGROUP_INET_SOCK_CREATE;
break;
+ case BPF_PROG_TYPE_SK_REUSEPORT:
+ if (!attr->expected_attach_type)
+ attr->expected_attach_type =
+ BPF_SK_REUSEPORT_SELECT;
+ break;
}
}

@@ -2003,6 +2008,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
if (expected_attach_type == BPF_SK_LOOKUP)
return 0;
return -EINVAL;
+ case BPF_PROG_TYPE_SK_REUSEPORT:
+ switch (expected_attach_type) {
+ case BPF_SK_REUSEPORT_SELECT:
+ case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE:
+ return 0;
+ default:
+ return -EINVAL;
+ }
case BPF_PROG_TYPE_EXT:
if (expected_attach_type)
return -EINVAL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7a48e0055500..c7f6848c0226 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -241,6 +241,8 @@ enum bpf_attach_type {
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
BPF_XDP,
+ BPF_SK_REUSEPORT_SELECT,
+ BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
__MAX_BPF_ATTACH_TYPE
};

--
2.17.2 (Apple Git-113)

2020-12-07 13:32:25

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 09/13] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.

This commit introduces a new section (sk_reuseport/migrate) and sets
expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
program.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
tools/lib/bpf/libbpf.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9be88a90a4aa..ba64c891a5e7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8471,7 +8471,10 @@ static struct bpf_link *attach_iter(const struct bpf_sec_def *sec,

static const struct bpf_sec_def section_defs[] = {
BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER),
- BPF_PROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT),
+ BPF_EAPROG_SEC("sk_reuseport/migrate", BPF_PROG_TYPE_SK_REUSEPORT,
+ BPF_SK_REUSEPORT_SELECT_OR_MIGRATE),
+ BPF_EAPROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT,
+ BPF_SK_REUSEPORT_SELECT),
SEC_DEF("kprobe/", KPROBE,
.attach_fn = attach_kprobe),
BPF_PROG_SEC("uprobe/", BPF_PROG_TYPE_KPROBE),
--
2.17.2 (Apple Git-113)

2020-12-07 13:32:42

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 13/13] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
.../bpf/prog_tests/select_reuseport_migrate.c | 173 ++++++++++++++++++
.../bpf/progs/test_select_reuseport_migrate.c | 53 ++++++
2 files changed, 226 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
create mode 100644 tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c

diff --git a/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
new file mode 100644
index 000000000000..814b1e3a4c56
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/select_reuseport_migrate.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ * 1. call listen() for 5 server sockets.
+ * 2. update a map to migrate all child socket
+ * to the last server socket (migrate_map[cookie] = 4)
+ * 3. call connect() for 25 client sockets.
+ * 4. call close() for first 4 server sockets.
+ * 5. call accept() for the last server socket.
+ *
+ * Author: Kuniyuki Iwashima <[email protected]>
+ */
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "test_progs.h"
+#include "test_select_reuseport_migrate.skel.h"
+
+#define ADDRESS "127.0.0.1"
+#define PORT 80
+#define NUM_SERVERS 5
+#define NUM_CLIENTS (NUM_SERVERS * 5)
+
+
+static int test_listen(struct test_select_reuseport_migrate *skel, int server_fds[])
+{
+ int i, err, optval = 1, migrated_to = NUM_SERVERS - 1;
+ int prog_fd, reuseport_map_fd, migrate_map_fd;
+ struct sockaddr_in addr;
+ socklen_t addr_len;
+ __u64 value;
+
+ prog_fd = bpf_program__fd(skel->progs.prog_select_reuseport_migrate);
+ reuseport_map_fd = bpf_map__fd(skel->maps.reuseport_map);
+ migrate_map_fd = bpf_map__fd(skel->maps.migrate_map);
+
+ addr_len = sizeof(addr);
+ addr.sin_family = AF_INET;
+ addr.sin_port = htons(PORT);
+ inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr);
+
+ for (i = 0; i < NUM_SERVERS; i++) {
+ server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+ if (CHECK_FAIL(server_fds[i] == -1))
+ return -1;
+
+ err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT,
+ &optval, sizeof(optval));
+ if (CHECK_FAIL(err == -1))
+ return -1;
+
+ if (i == 0) {
+ err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+ &prog_fd, sizeof(prog_fd));
+ if (CHECK_FAIL(err == -1))
+ return -1;
+ }
+
+ err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+
+ err = listen(server_fds[i], 32);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+
+ err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+
+ err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+
+ err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+ }
+
+ return 0;
+}
+
+static int test_connect(int client_fds[])
+{
+ struct sockaddr_in addr;
+ socklen_t addr_len;
+ int i, err;
+
+ addr_len = sizeof(addr);
+ addr.sin_family = AF_INET;
+ addr.sin_port = htons(PORT);
+ inet_pton(AF_INET, ADDRESS, &addr.sin_addr.s_addr);
+
+ for (i = 0; i < NUM_CLIENTS; i++) {
+ client_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+ if (CHECK_FAIL(client_fds[i] == -1))
+ return -1;
+
+ err = connect(client_fds[i], (struct sockaddr *)&addr, addr_len);
+ if (CHECK_FAIL(err == -1))
+ return -1;
+ }
+
+ return 0;
+}
+
+static void test_close(int server_fds[], int num)
+{
+ int i;
+
+ for (i = 0; i < num; i++)
+ if (server_fds[i] > 0)
+ close(server_fds[i]);
+}
+
+static int test_accept(int server_fd)
+{
+ struct sockaddr_in addr;
+ socklen_t addr_len;
+ int cnt, client_fd;
+
+ fcntl(server_fd, F_SETFL, O_NONBLOCK);
+ addr_len = sizeof(addr);
+
+ for (cnt = 0; cnt < NUM_CLIENTS; cnt++) {
+ client_fd = accept(server_fd, (struct sockaddr *)&addr, &addr_len);
+ if (CHECK_FAIL(client_fd == -1))
+ return -1;
+ }
+
+ return cnt;
+}
+
+
+void test_select_reuseport_migrate(void)
+{
+ struct test_select_reuseport_migrate *skel;
+ int server_fds[NUM_SERVERS] = {0};
+ int client_fds[NUM_CLIENTS] = {0};
+ __u32 duration = 0;
+ int err;
+
+ skel = test_select_reuseport_migrate__open_and_load();
+ if (CHECK_FAIL(!skel))
+ goto destroy;
+
+ err = test_listen(skel, server_fds);
+ if (err)
+ goto close_server;
+
+ err = test_connect(client_fds);
+ if (err)
+ goto close_client;
+
+ test_close(server_fds, NUM_SERVERS - 1);
+
+ err = test_accept(server_fds[NUM_SERVERS - 1]);
+ CHECK(err != NUM_CLIENTS,
+ "accept",
+ "expected (%d) != actual (%d)\n",
+ NUM_CLIENTS, err);
+
+close_client:
+ test_close(client_fds, NUM_CLIENTS);
+
+close_server:
+ test_close(server_fds, NUM_SERVERS);
+
+destroy:
+ test_select_reuseport_migrate__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c b/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c
new file mode 100644
index 000000000000..f1ac07bb2c03
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_select_reuseport_migrate.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Check if we can migrate child sockets.
+ *
+ * 1. If reuse_md->migration is 0 (SYN packet),
+ * return SK_PASS without selecting a listener.
+ * 2. If reuse_md->migration is not 0 (socket migration),
+ * select a listener (reuseport_map[migrate_map[cookie]])
+ *
+ * Author: Kuniyuki Iwashima <[email protected]>
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+#define NULL ((void *)0)
+
+struct bpf_map_def SEC("maps") reuseport_map = {
+ .type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(__u64),
+ .max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") migrate_map = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(__u64),
+ .value_size = sizeof(int),
+ .max_entries = 256,
+};
+
+SEC("sk_reuseport/migrate")
+int prog_select_reuseport_migrate(struct sk_reuseport_md *reuse_md)
+{
+ int *key, flags = 0;
+ __u64 cookie;
+
+ if (!reuse_md->migration)
+ return SK_PASS;
+
+ cookie = bpf_get_socket_cookie(reuse_md->sk);
+
+ key = bpf_map_lookup_elem(&migrate_map, &cookie);
+ if (key == NULL)
+ return SK_DROP;
+
+ bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags);
+
+ return SK_PASS;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
--
2.17.2 (Apple Git-113)

2020-12-07 13:33:32

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 10/13] bpf: Add migration to sk_reuseport_(kern|md).

This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md
to signal the eBPF program if the kernel calls it for selecting a listener
for SYN or migrating sockets in the accept queue or an immature socket
during 3WHS.

Note that this field is accessible only if the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/linux/bpf.h | 1 +
include/linux/filter.h | 4 ++--
include/uapi/linux/bpf.h | 1 +
net/core/filter.c | 15 ++++++++++++---
net/core/sock_reuseport.c | 2 +-
tools/include/uapi/linux/bpf.h | 1 +
6 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d05e75ed8c1b..cdeb27f4ad63 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1914,6 +1914,7 @@ struct sk_reuseport_kern {
u32 hash;
u32 reuseport_id;
bool bind_inany;
+ u8 migration;
};
bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
struct bpf_insn_access_aux *info);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1b62397bd124..15d5bf13a905 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -967,12 +967,12 @@ void bpf_warn_invalid_xdp_action(u32 act);
#ifdef CONFIG_INET
struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
struct bpf_prog *prog, struct sk_buff *skb,
- u32 hash);
+ u32 hash, u8 migration);
#else
static inline struct sock *
bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
struct bpf_prog *prog, struct sk_buff *skb,
- u32 hash)
+ u32 hash, u8 migration)
{
return NULL;
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c7f6848c0226..cf518e83df5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4462,6 +4462,7 @@ struct sk_reuseport_md {
__u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
__u32 bind_inany; /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
+ __u8 migration; /* Migration type */
};

#define BPF_TAG_SIZE 8
diff --git a/net/core/filter.c b/net/core/filter.c
index 77001a35768f..7bdf62f24044 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9860,7 +9860,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
struct sock_reuseport *reuse,
struct sock *sk, struct sk_buff *skb,
- u32 hash)
+ u32 hash, u8 migration)
{
reuse_kern->skb = skb;
reuse_kern->sk = sk;
@@ -9869,16 +9869,17 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern,
reuse_kern->hash = hash;
reuse_kern->reuseport_id = reuse->reuseport_id;
reuse_kern->bind_inany = reuse->bind_inany;
+ reuse_kern->migration = migration;
}

struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
struct bpf_prog *prog, struct sk_buff *skb,
- u32 hash)
+ u32 hash, u8 migration)
{
struct sk_reuseport_kern reuse_kern;
enum sk_action action;

- bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash);
+ bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
action = BPF_PROG_RUN(prog, &reuse_kern);

if (action == SK_PASS)
@@ -10017,6 +10018,10 @@ sk_reuseport_is_valid_access(int off, int size,
case offsetof(struct sk_reuseport_md, hash):
return size == size_default;

+ case bpf_ctx_range(struct sk_reuseport_md, migration):
+ return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
+ size == sizeof(__u8);
+
/* Fields that allow narrowing */
case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
if (size < sizeof_field(struct sk_buff, protocol))
@@ -10089,6 +10094,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
case offsetof(struct sk_reuseport_md, bind_inany):
SK_REUSEPORT_LOAD_FIELD(bind_inany);
break;
+
+ case offsetof(struct sk_reuseport_md, migration):
+ SK_REUSEPORT_LOAD_FIELD(migration);
+ break;
}

return insn - insn_buf;
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 1011c3756c92..b877c8e552d2 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -313,7 +313,7 @@ static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
goto select_by_hash;

if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
- sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
+ sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration);
else
sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c7f6848c0226..cf518e83df5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4462,6 +4462,7 @@ struct sk_reuseport_md {
__u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
__u32 bind_inany; /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
+ __u8 migration; /* Migration type */
};

#define BPF_TAG_SIZE 8
--
2.17.2 (Apple Git-113)

2020-12-07 13:33:45

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 11/13] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT.

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing in order to
select the new listener.

Currently, we can get a unique ID for each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the sk pointer available in sk_reuseport_md so that we can
get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
include/uapi/linux/bpf.h | 8 ++++++++
net/core/filter.c | 22 ++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 8 ++++++++
3 files changed, 38 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cf518e83df5c..a688a7a4fe85 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1655,6 +1655,13 @@ union bpf_attr {
* A 8-byte long non-decreasing number on success, or 0 if the
* socket field is missing inside *skb*.
*
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * Description
+ * Equivalent to bpf_get_socket_cookie() helper that accepts
+ * *skb*, but gets socket from **struct bpf_sock** context.
+ * Return
+ * A 8-byte long non-decreasing number.
+ *
* u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
* Description
* Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4463,6 +4470,7 @@ struct sk_reuseport_md {
__u32 bind_inany; /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
__u8 migration; /* Migration type */
+ __bpf_md_ptr(struct bpf_sock *, sk); /* Current listening socket */
};

#define BPF_TAG_SIZE 8
diff --git a/net/core/filter.c b/net/core/filter.c
index 7bdf62f24044..9f7018e3f545 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4631,6 +4631,18 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = {
.arg1_type = ARG_PTR_TO_CTX,
};

+BPF_CALL_1(bpf_get_socket_pointer_cookie, struct sock *, sk)
+{
+ return __sock_gen_cookie(sk);
+}
+
+static const struct bpf_func_proto bpf_get_socket_pointer_cookie_proto = {
+ .func = bpf_get_socket_pointer_cookie,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_SOCKET,
+};
+
BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx)
{
return __sock_gen_cookie(ctx->sk);
@@ -9989,6 +10001,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
return &sk_reuseport_load_bytes_proto;
case BPF_FUNC_skb_load_bytes_relative:
return &sk_reuseport_load_bytes_relative_proto;
+ case BPF_FUNC_get_socket_cookie:
+ return &bpf_get_socket_pointer_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -10022,6 +10036,10 @@ sk_reuseport_is_valid_access(int off, int size,
return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE &&
size == sizeof(__u8);

+ case offsetof(struct sk_reuseport_md, sk):
+ info->reg_type = PTR_TO_SOCKET;
+ return size == sizeof(__u64);
+
/* Fields that allow narrowing */
case bpf_ctx_range(struct sk_reuseport_md, eth_protocol):
if (size < sizeof_field(struct sk_buff, protocol))
@@ -10098,6 +10116,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type,
case offsetof(struct sk_reuseport_md, migration):
SK_REUSEPORT_LOAD_FIELD(migration);
break;
+
+ case offsetof(struct sk_reuseport_md, sk):
+ SK_REUSEPORT_LOAD_FIELD(sk);
+ break;
}

return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index cf518e83df5c..a688a7a4fe85 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1655,6 +1655,13 @@ union bpf_attr {
* A 8-byte long non-decreasing number on success, or 0 if the
* socket field is missing inside *skb*.
*
+ * u64 bpf_get_socket_cookie(struct bpf_sock *sk)
+ * Description
+ * Equivalent to bpf_get_socket_cookie() helper that accepts
+ * *skb*, but gets socket from **struct bpf_sock** context.
+ * Return
+ * A 8-byte long non-decreasing number.
+ *
* u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx)
* Description
* Equivalent to bpf_get_socket_cookie() helper that accepts
@@ -4463,6 +4470,7 @@ struct sk_reuseport_md {
__u32 bind_inany; /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
__u8 migration; /* Migration type */
+ __bpf_md_ptr(struct bpf_sock *, sk); /* Current listening socket */
};

#define BPF_TAG_SIZE 8
--
2.17.2 (Apple Git-113)

2020-12-07 13:34:02

by Iwashima, Kuniyuki

[permalink] [raw]
Subject: [PATCH v2 bpf-next 12/13] bpf: Call bpf_run_sk_reuseport() for socket migration.

This patch supports socket migration by eBPF. If the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

There are two noteworthy points. The first is that we select a listening
socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do
not have struct skb at closing a listener or retransmitting a SYN+ACK.
However, some helper functions do not expect skb is NULL (e.g.
skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in
BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program. The second is that we do not
have struct request_sock in unhash path, and the sk_hash of the listener is
always zero. So we pass zero as hash to bpf_run_sk_reuseport().

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
---
net/core/filter.c | 19 +++++++++++++++++++
net/core/sock_reuseport.c | 21 +++++++++++----------
net/ipv4/inet_hashtables.c | 2 +-
3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9f7018e3f545..53fa3bcbf00f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9890,10 +9890,29 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
{
struct sk_reuseport_kern reuse_kern;
enum sk_action action;
+ bool allocated = false;
+
+ if (migration) {
+ /* cancel migration for possibly incapable eBPF program */
+ if (prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)
+ return ERR_PTR(-ENOTSUPP);
+
+ if (!skb) {
+ allocated = true;
+ skb = alloc_skb(0, GFP_ATOMIC);
+ if (!skb)
+ return ERR_PTR(-ENOMEM);
+ }
+ } else if (!skb) {
+ return NULL; /* fall back to select by hash */
+ }

bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration);
action = BPF_PROG_RUN(prog, &reuse_kern);

+ if (allocated)
+ kfree_skb(skb);
+
if (action == SK_PASS)
return reuse_kern.selected_sk;
else
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index b877c8e552d2..2358e8896199 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -221,8 +221,15 @@ struct sock *reuseport_detach_sock(struct sock *sk)
lockdep_is_held(&reuseport_lock));

if (sk->sk_protocol == IPPROTO_TCP) {
- if (reuse->num_socks && !prog)
- nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+ if (reuse->num_socks) {
+ if (prog)
+ nsk = bpf_run_sk_reuseport(reuse, sk, prog, NULL, 0,
+ BPF_SK_REUSEPORT_MIGRATE_QUEUE);
+
+ if (!nsk)
+ nsk = i == reuse->num_socks ?
+ reuse->socks[i - 1] : reuse->socks[i];
+ }

reuse->num_closed_socks++;
} else {
@@ -306,15 +313,9 @@ static struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
if (!prog)
goto select_by_hash;

- if (migration)
- goto out;
-
- if (!skb)
- goto select_by_hash;
-
if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration);
- else
+ else if (!skb)
sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);

select_by_hash:
@@ -352,7 +353,7 @@ struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
struct sock *nsk;

nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
- if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
+ if (!IS_ERR_OR_NULL(nsk) && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
return nsk;

return NULL;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 545538a6bfac..59f58740c20d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -699,7 +699,7 @@ void inet_unhash(struct sock *sk)

if (rcu_access_pointer(sk->sk_reuseport_cb)) {
nsk = reuseport_detach_sock(sk);
- if (nsk)
+ if (!IS_ERR_OR_NULL(nsk))
inet_csk_reqsk_queue_migrate(sk, nsk);
}

--
2.17.2 (Apple Git-113)

2020-12-07 16:29:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 bpf-next 03/13] Revert "locking/spinlocks: Remove the unused spin_lock_bh_nested() API"

On 12/7/20 8:24 AM, Kuniyuki Iwashima wrote:
> This reverts commit 607904c357c61adf20b8fd18af765e501d61a385 to use
> spin_lock_bh_nested() in the next commit.
>
> Link: https://lore.kernel.org/netdev/[email protected]/
> Signed-off-by: Kuniyuki Iwashima <[email protected]>
> CC: Waiman Long <[email protected]>

If there is a use case for spin_lock_bh_nested(), it is perfectly fine
to add it back.

Acked-by: Waiman Long <[email protected]>

> ---
> include/linux/spinlock.h | 8 ++++++++
> include/linux/spinlock_api_smp.h | 2 ++
> include/linux/spinlock_api_up.h | 1 +
> kernel/locking/spinlock.c | 8 ++++++++
> 4 files changed, 19 insertions(+)
>
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 79897841a2cc..c020b375a071 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -227,6 +227,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> # define raw_spin_lock_nested(lock, subclass) \
> _raw_spin_lock_nested(lock, subclass)
> +# define raw_spin_lock_bh_nested(lock, subclass) \
> + _raw_spin_lock_bh_nested(lock, subclass)
>
> # define raw_spin_lock_nest_lock(lock, nest_lock) \
> do { \
> @@ -242,6 +244,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
> # define raw_spin_lock_nested(lock, subclass) \
> _raw_spin_lock(((void)(subclass), (lock)))
> # define raw_spin_lock_nest_lock(lock, nest_lock) _raw_spin_lock(lock)
> +# define raw_spin_lock_bh_nested(lock, subclass) _raw_spin_lock_bh(lock)
> #endif
>
> #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> @@ -369,6 +372,11 @@ do { \
> raw_spin_lock_nested(spinlock_check(lock), subclass); \
> } while (0)
>
> +#define spin_lock_bh_nested(lock, subclass) \
> +do { \
> + raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
> +} while (0)
> +
> #define spin_lock_nest_lock(lock, nest_lock) \
> do { \
> raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock); \
> diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
> index 19a9be9d97ee..d565fb6304f2 100644
> --- a/include/linux/spinlock_api_smp.h
> +++ b/include/linux/spinlock_api_smp.h
> @@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr);
> void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
> void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
> __acquires(lock);
> +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
> + __acquires(lock);
> void __lockfunc
> _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)
> __acquires(lock);
> diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
> index d0d188861ad6..d3afef9d8dbe 100644
> --- a/include/linux/spinlock_api_up.h
> +++ b/include/linux/spinlock_api_up.h
> @@ -57,6 +57,7 @@
>
> #define _raw_spin_lock(lock) __LOCK(lock)
> #define _raw_spin_lock_nested(lock, subclass) __LOCK(lock)
> +#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
> #define _raw_read_lock(lock) __LOCK(lock)
> #define _raw_write_lock(lock) __LOCK(lock)
> #define _raw_spin_lock_bh(lock) __LOCK_BH(lock)
> diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
> index 0ff08380f531..48e99ed1bdd8 100644
> --- a/kernel/locking/spinlock.c
> +++ b/kernel/locking/spinlock.c
> @@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
> }
> EXPORT_SYMBOL(_raw_spin_lock_nested);
>
> +void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
> +{
> + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
> + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
> + LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> +}
> +EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
> +
> unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
> int subclass)
> {