2017-03-24 17:08:01

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 0/8] Add busy poll support for epoll

This patch set adds support for using busy polling with epoll. The main
idea behind this is that we record the NAPI ID for the last event that is
moved onto the ready list for the epoll context and then when we no longer
have any events on the ready list we begin polling with that ID. If the
busy polling does not yield any events then we will reset the NAPI ID to 0
and wait until a new event is added to the ready list with a valid NAPI ID
before we will resume busy polling.

Most of the changes in this set authored by me are meant to be cleanup or
fixes for various things. For example, I am trying to make it so that we
don't perform hash look-ups for the NAPI instance when we are only working
with sender_cpu and the like.

At the heart of this set is the last 3 patches which enable epoll support
and add support for obtaining the NAPI ID of a given socket. With these it
becomes possible for an application to make use of epoll and get optimal
busy poll utilization by stacking multiple sockets with the same NAPI ID on
the same epoll context.

v1: The first version of this series only allowed epoll to busy poll if all
of the sockets with a NAPI ID shared the same NAPI ID. I feel we were
too strict with this requirement, so I changed the behavior for v2.
v2: The second version was pretty much a full rewrite of the first set. The
main changes consisted of pulling apart several patches to better
address the need to clean up a few items and to make the code easier to
review. In the set however I went a bit overboard and was trying to fix
an issue that would only occur with 500+ years of uptime, and in the
process limited the range for busy_poll/busy_read unnecessarily.
v3: Split off the code for limiting busy_poll and busy_read into a separate
patch for net.
Updated patch that changed busy loop time tracking so that it uses
"local_clock() >> 10" as we originally did.
Tweaked "Change return type.." patch by moving declaration of "work"
inside the loop where is was accessed and always reset to 0.
Added "Acked-by" for patches that received acks.

---

Alexander Duyck (5):
net: Busy polling should ignore sender CPUs
tcp: Record Rx hash and NAPI ID in tcp_child_process
net: Only define skb_mark_napi_id in one spot instead of two
net: Change return type of sk_busy_loop from bool to void
net: Track start of busy loop instead of when it should end

Sridhar Samudrala (3):
net: Commonize busy polling code to focus on napi_id instead of socket
epoll: Add busy poll support to epoll with socket fds.
net: Introduce SO_INCOMING_NAPI_ID


arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 1
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
fs/eventpoll.c | 93 +++++++++++++++++++++++++++++
fs/select.c | 16 +++--
include/net/busy_poll.h | 102 +++++++++++++++++++-------------
include/uapi/asm-generic/socket.h | 2 +
net/core/datagram.c | 8 ++-
net/core/dev.c | 41 ++++++-------
net/core/sock.c | 23 +++++++
net/ipv4/tcp_ipv4.c | 2 -
net/ipv4/tcp_minisocks.c | 4 +
net/ipv6/tcp_ipv6.c | 2 -
net/sctp/socket.c | 9 ++-
23 files changed, 244 insertions(+), 81 deletions(-)

--


2017-03-24 17:08:24

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 2/8] tcp: Record Rx hash and NAPI ID in tcp_child_process

From: Alexander Duyck <[email protected]>

While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set. In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.

Reported-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
net/ipv4/tcp_ipv4.c | 2 --
net/ipv4/tcp_minisocks.c | 4 ++++
net/ipv6/tcp_ipv6.c | 2 --
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7482b5d11861..20cbd2f07f28 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1409,8 +1409,6 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (!nsk)
goto discard;
if (nsk != sk) {
- sock_rps_save_rxhash(nsk, skb);
- sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 692f974e5abe..40f125b19988 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -26,6 +26,7 @@
#include <net/tcp.h>
#include <net/inet_common.h>
#include <net/xfrm.h>
+#include <net/busy_poll.h>

int sysctl_tcp_abort_on_overflow __read_mostly;

@@ -798,6 +799,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
int ret = 0;
int state = child->sk_state;

+ /* record NAPI ID of child */
+ sk_mark_napi_id(child, skb);
+
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
ret = tcp_rcv_state_process(child, skb);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0f08d718a002..ee13e380c0dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1293,8 +1293,6 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
goto discard;

if (nsk != sk) {
- sock_rps_save_rxhash(nsk, skb);
- sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb))
goto reset;
if (opt_skb)

2017-03-24 17:08:40

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 1/8] net: Busy polling should ignore sender CPUs

From: Alexander Duyck <[email protected]>

This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.

One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling. This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.

Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
include/net/busy_poll.h | 9 +++++++--
net/core/dev.c | 13 +++++++++----
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c0452de83086..3fcda9e70c3f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -35,6 +35,12 @@
extern unsigned int sysctl_net_busy_read __read_mostly;
extern unsigned int sysctl_net_busy_poll __read_mostly;

+/* 0 - Reserved to indicate value not set
+ * 1..NR_CPUS - Reserved for sender_cpu
+ * NR_CPUS+1..~0 - Region available for NAPI IDs
+ */
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+
static inline bool net_busy_loop_on(void)
{
return sysctl_net_busy_poll;
@@ -58,10 +64,9 @@ static inline unsigned long busy_loop_end_time(void)

static inline bool sk_can_busy_loop(const struct sock *sk)
{
- return sk->sk_ll_usec && sk->sk_napi_id && !signal_pending(current);
+ return sk->sk_ll_usec && !signal_pending(current);
}

-
static inline bool busy_loop_timeout(unsigned long end_time)
{
unsigned long now = busy_loop_us_clock();
diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3837ca..ab337bf5bbf4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5066,15 +5066,20 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
+ unsigned int napi_id;
int rc;

restart:
+ napi_id = READ_ONCE(sk->sk_napi_id);
+ if (napi_id < MIN_NAPI_ID)
+ return 0;
+
rc = false;
napi_poll = NULL;

rcu_read_lock();

- napi = napi_by_id(sk->sk_napi_id);
+ napi = napi_by_id(napi_id);
if (!napi)
goto out;

@@ -5143,10 +5148,10 @@ static void napi_hash_add(struct napi_struct *napi)

spin_lock(&napi_hash_lock);

- /* 0..NR_CPUS+1 range is reserved for sender_cpu use */
+ /* 0..NR_CPUS range is reserved for sender_cpu use */
do {
- if (unlikely(++napi_gen_id < NR_CPUS + 1))
- napi_gen_id = NR_CPUS + 1;
+ if (unlikely(++napi_gen_id < MIN_NAPI_ID))
+ napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
napi->napi_id = napi_gen_id;


2017-03-24 17:08:58

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 5/8] net: Track start of busy loop instead of when it should end

From: Alexander Duyck <[email protected]>

This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.

Signed-off-by: Alexander Duyck <[email protected]>
---
fs/select.c | 16 ++++++-----
include/net/busy_poll.h | 68 ++++++++++++++++++++++++++---------------------
net/core/dev.c | 6 ++--
3 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index e2112270d75a..9287d3a96e35 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -409,7 +409,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
int retval, i, timed_out = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
- unsigned long busy_end = 0;
+ unsigned long busy_start = 0;

rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -512,11 +512,11 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)

/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
- if (!busy_end) {
- busy_end = busy_loop_end_time();
+ if (!busy_start) {
+ busy_start = busy_loop_current_time();
continue;
}
- if (!busy_loop_timeout(busy_end))
+ if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
@@ -800,7 +800,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
int timed_out = 0, count = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
- unsigned long busy_end = 0;
+ unsigned long busy_start = 0;

/* Optimise the no-wait case */
if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
@@ -853,11 +853,11 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,

/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
- if (!busy_end) {
- busy_end = busy_loop_end_time();
+ if (!busy_start) {
+ busy_start = busy_loop_current_time();
continue;
}
- if (!busy_loop_timeout(busy_end))
+ if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c55760f4820f..72c82f2ea536 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -46,62 +46,70 @@ static inline bool net_busy_loop_on(void)
return sysctl_net_busy_poll;
}

-static inline u64 busy_loop_us_clock(void)
+static inline bool sk_can_busy_loop(const struct sock *sk)
{
- return local_clock() >> 10;
+ return sk->sk_ll_usec && !signal_pending(current);
}

-static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
-{
- return busy_loop_us_clock() + ACCESS_ONCE(sk->sk_ll_usec);
-}
+void sk_busy_loop(struct sock *sk, int nonblock);

-/* in poll/select we use the global sysctl_net_ll_poll value */
-static inline unsigned long busy_loop_end_time(void)
+#else /* CONFIG_NET_RX_BUSY_POLL */
+static inline unsigned long net_busy_loop_on(void)
{
- return busy_loop_us_clock() + ACCESS_ONCE(sysctl_net_busy_poll);
+ return 0;
}

-static inline bool sk_can_busy_loop(const struct sock *sk)
+static inline bool sk_can_busy_loop(struct sock *sk)
{
- return sk->sk_ll_usec && !signal_pending(current);
+ return false;
}

-static inline bool busy_loop_timeout(unsigned long end_time)
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
{
- unsigned long now = busy_loop_us_clock();
-
- return time_after(now, end_time);
}

-void sk_busy_loop(struct sock *sk, int nonblock);
+#endif /* CONFIG_NET_RX_BUSY_POLL */

-#else /* CONFIG_NET_RX_BUSY_POLL */
-static inline unsigned long net_busy_loop_on(void)
+static inline unsigned long busy_loop_current_time(void)
{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ return (unsigned long)(local_clock() >> 10);
+#else
return 0;
+#endif
}

-static inline unsigned long busy_loop_end_time(void)
+/* in poll/select we use the global sysctl_net_ll_poll value */
+static inline bool busy_loop_timeout(unsigned long start_time)
{
- return 0;
-}
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned long bp_usec = READ_ONCE(sysctl_net_busy_poll);

-static inline bool sk_can_busy_loop(struct sock *sk)
-{
- return false;
-}
+ if (bp_usec) {
+ unsigned long end_time = start_time + bp_usec;
+ unsigned long now = busy_loop_current_time();

-static inline bool busy_loop_timeout(unsigned long end_time)
-{
+ return time_after(now, end_time);
+ }
+#endif
return true;
}

-static inline void sk_busy_loop(struct sock *sk, int nonblock)
+static inline bool sk_busy_loop_timeout(struct sock *sk,
+ unsigned long start_time)
{
-}
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned long bp_usec = READ_ONCE(sk->sk_ll_usec);

-#endif /* CONFIG_NET_RX_BUSY_POLL */
+ if (bp_usec) {
+ unsigned long end_time = start_time + bp_usec;
+ unsigned long now = busy_loop_current_time();
+
+ return time_after(now, end_time);
+ }
+#endif
+ return true;
+}

/* used in the NIC receive handler to mark the skb */
static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index af70eb6ba682..2d1b5613b7fd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5062,7 +5062,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)

void sk_busy_loop(struct sock *sk, int nonblock)
{
- unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
+ unsigned long start_time = nonblock ? 0 : busy_loop_current_time();
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
@@ -5111,7 +5111,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
local_bh_enable();

if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
- busy_loop_timeout(end_time))
+ sk_busy_loop_timeout(sk, start_time))
break;

if (unlikely(need_resched())) {
@@ -5121,7 +5121,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
rcu_read_unlock();
cond_resched();
if (!skb_queue_empty(&sk->sk_receive_queue) ||
- busy_loop_timeout(end_time))
+ sk_busy_loop_timeout(sk, start_time))
return;
goto restart;
}

2017-03-24 17:08:50

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 3/8] net: Only define skb_mark_napi_id in one spot instead of two

From: Alexander Duyck <[email protected]>

Instead of defining two versions of skb_mark_napi_id I think it is more
readable to just match the format of the sk_mark_napi_id functions and just
wrap the contents of the function instead of defining two versions of the
function. This way we can save a few lines of code since we only need 2 of
the ifdef/endif but needed 5 for the extra function declaration.

Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
include/net/busy_poll.h | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 3fcda9e70c3f..b82d6ba70a14 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -76,14 +76,6 @@ static inline bool busy_loop_timeout(unsigned long end_time)

bool sk_busy_loop(struct sock *sk, int nonblock);

-/* used in the NIC receive handler to mark the skb */
-static inline void skb_mark_napi_id(struct sk_buff *skb,
- struct napi_struct *napi)
-{
- skb->napi_id = napi->napi_id;
-}
-
-
#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
{
@@ -100,11 +92,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
}

-static inline void skb_mark_napi_id(struct sk_buff *skb,
- struct napi_struct *napi)
-{
-}
-
static inline bool busy_loop_timeout(unsigned long end_time)
{
return true;
@@ -117,6 +104,15 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)

#endif /* CONFIG_NET_RX_BUSY_POLL */

+/* used in the NIC receive handler to mark the skb */
+static inline void skb_mark_napi_id(struct sk_buff *skb,
+ struct napi_struct *napi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ skb->napi_id = napi->napi_id;
+#endif
+}
+
/* used in the protocol hanlder to propagate the napi_id to the socket */
static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
{

2017-03-24 17:09:24

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 8/8] net: Introduce SO_INCOMING_NAPI_ID

From: Sridhar Samudrala <[email protected]>

This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
arch/alpha/include/uapi/asm/socket.h | 2 ++
arch/avr32/include/uapi/asm/socket.h | 2 ++
arch/frv/include/uapi/asm/socket.h | 2 ++
arch/ia64/include/uapi/asm/socket.h | 2 ++
arch/m32r/include/uapi/asm/socket.h | 2 ++
arch/mips/include/uapi/asm/socket.h | 1 +
arch/mn10300/include/uapi/asm/socket.h | 2 ++
arch/parisc/include/uapi/asm/socket.h | 2 ++
arch/powerpc/include/uapi/asm/socket.h | 2 ++
arch/s390/include/uapi/asm/socket.h | 2 ++
arch/sparc/include/uapi/asm/socket.h | 2 ++
arch/xtensa/include/uapi/asm/socket.h | 2 ++
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 12 ++++++++++++
14 files changed, 37 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 089db42c1b40..1bb8cac61a28 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6eabcbd2f82a..f824eeb0f2e4 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index bd497f8356b9..a8ad9bebfc47 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -94,5 +94,7 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */

diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index f1bb54686168..6af3253e4209 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -103,4 +103,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 459c46076f6f..e98b6bb897c0 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 688c18dd62ef..ae2b62e39d4d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -112,5 +112,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56

#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 312d2c457a04..e4ac1843ee01 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index b98ec38f2083..f754c793e82a 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -93,4 +93,6 @@

#define SO_MEMINFO 0x4030

+#define SO_INCOMING_NAPI_ID 0x4031
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 099a889240f6..5f84af7dcb2e 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 6199bb34f7fa..25ac4960e707 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -100,4 +100,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 12cd8c2ec422..b05513acd589 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -90,6 +90,8 @@

#define SO_MEMINFO 0x0039

+#define SO_INCOMING_NAPI_ID 0x003a
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index d0b85f6c1484..786606c81edd 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -105,4 +105,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 8313702c1eae..c98a52fb572a 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -96,4 +96,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 0aa725cb3dd6..1de6c369ed86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1328,6 +1328,18 @@ int sock_getsockopt(struct socket *sock, int level, int optname,

goto lenout;
}
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ case SO_INCOMING_NAPI_ID:
+ v.val = READ_ONCE(sk->sk_napi_id);
+
+ /* aggregate non-NAPI IDs down to 0 */
+ if (v.val < MIN_NAPI_ID)
+ v.val = 0;
+
+ break;
+#endif
+
default:
/* We implement the SO_SNDLOWAT etc to not be settable
* (1003.1g 7).

2017-03-24 17:09:36

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 6/8] net: Commonize busy polling code to focus on napi_id instead of socket

From: Sridhar Samudrala <[email protected]>

Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.

This enables re-using this function in epoll busy loop implementation.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
include/net/busy_poll.h | 20 +++++++++++++++-----
net/core/dev.c | 21 ++++++++-------------
net/core/sock.c | 11 +++++++++++
3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 72c82f2ea536..8ffd434676b7 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -51,7 +51,11 @@ static inline bool sk_can_busy_loop(const struct sock *sk)
return sk->sk_ll_usec && !signal_pending(current);
}

-void sk_busy_loop(struct sock *sk, int nonblock);
+bool sk_busy_loop_end(void *p, unsigned long start_time);
+
+void napi_busy_loop(unsigned int napi_id,
+ bool (*loop_end)(void *, unsigned long),
+ void *loop_end_arg);

#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
@@ -64,10 +68,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
}

-static inline void sk_busy_loop(struct sock *sk, int nonblock)
-{
-}
-
#endif /* CONFIG_NET_RX_BUSY_POLL */

static inline unsigned long busy_loop_current_time(void)
@@ -111,6 +111,16 @@ static inline bool sk_busy_loop_timeout(struct sock *sk,
return true;
}

+static inline void sk_busy_loop(struct sock *sk, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned int napi_id = READ_ONCE(sk->sk_napi_id);
+
+ if (napi_id >= MIN_NAPI_ID)
+ napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk);
+#endif
+}
+
/* used in the NIC receive handler to mark the skb */
static inline void skb_mark_napi_id(struct sk_buff *skb,
struct napi_struct *napi)
diff --git a/net/core/dev.c b/net/core/dev.c
index 2d1b5613b7fd..ef9fe60ee294 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,19 +5060,16 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
do_softirq();
}

-void sk_busy_loop(struct sock *sk, int nonblock)
+void napi_busy_loop(unsigned int napi_id,
+ bool (*loop_end)(void *, unsigned long),
+ void *loop_end_arg)
{
- unsigned long start_time = nonblock ? 0 : busy_loop_current_time();
+ unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
- unsigned int napi_id;

restart:
- napi_id = READ_ONCE(sk->sk_napi_id);
- if (napi_id < MIN_NAPI_ID)
- return;
-
napi_poll = NULL;

rcu_read_lock();
@@ -5106,12 +5103,11 @@ void sk_busy_loop(struct sock *sk, int nonblock)
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
count:
if (work > 0)
- __NET_ADD_STATS(sock_net(sk),
+ __NET_ADD_STATS(dev_net(napi->dev),
LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();

- if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
- sk_busy_loop_timeout(sk, start_time))
+ if (!loop_end || loop_end(loop_end_arg, start_time))
break;

if (unlikely(need_resched())) {
@@ -5120,8 +5116,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
- if (!skb_queue_empty(&sk->sk_receive_queue) ||
- sk_busy_loop_timeout(sk, start_time))
+ if (loop_end(loop_end_arg, start_time))
return;
goto restart;
}
@@ -5133,7 +5128,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
out:
rcu_read_unlock();
}
-EXPORT_SYMBOL(sk_busy_loop);
+EXPORT_SYMBOL(napi_busy_loop);

#endif /* CONFIG_NET_RX_BUSY_POLL */

diff --git a/net/core/sock.c b/net/core/sock.c
index f8c0373a3a74..0aa725cb3dd6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3231,3 +3231,14 @@ static int __init proto_init(void)
subsys_initcall(proto_init);

#endif /* PROC_FS */
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+bool sk_busy_loop_end(void *p, unsigned long start_time)
+{
+ struct sock *sk = p;
+
+ return !skb_queue_empty(&sk->sk_receive_queue) ||
+ sk_busy_loop_timeout(sk, start_time);
+}
+EXPORT_SYMBOL(sk_busy_loop_end);
+#endif /* CONFIG_NET_RX_BUSY_POLL */

2017-03-24 17:09:49

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 7/8] epoll: Add busy poll support to epoll with socket fds.

From: Sridhar Samudrala <[email protected]>

This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty. Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.

In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it. If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.

An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
fs/eventpoll.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 341251421ced..5420767c9b68 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
#include <linux/seq_file.h>
#include <linux/compat.h>
#include <linux/rculist.h>
+#include <net/busy_poll.h>

/*
* LOCKING:
@@ -224,6 +225,11 @@ struct eventpoll {
/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ /* used to track busy poll napi_id */
+ unsigned int napi_id;
+#endif
};

/* Wait structure used by the poll hooks */
@@ -384,6 +390,77 @@ static inline int ep_events_available(struct eventpoll *ep)
return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
}

+#ifdef CONFIG_NET_RX_BUSY_POLL
+static bool ep_busy_loop_end(void *p, unsigned long start_time)
+{
+ struct eventpoll *ep = p;
+
+ return ep_events_available(ep) || busy_loop_timeout(start_time);
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
+/*
+ * Busy poll if globally on and supporting sockets found && no events,
+ * busy loop will return if need_resched or ep_events_available.
+ *
+ * we must do our busy polling with irqs enabled
+ */
+static void ep_busy_loop(struct eventpoll *ep, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned int napi_id = READ_ONCE(ep->napi_id);
+
+ if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on())
+ napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep);
+#endif
+}
+
+static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ if (ep->napi_id)
+ ep->napi_id = 0;
+#endif
+}
+
+/*
+ * Set epoll busy poll NAPI ID from sk.
+ */
+static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ struct eventpoll *ep;
+ unsigned int napi_id;
+ struct socket *sock;
+ struct sock *sk;
+ int err;
+
+ if (!net_busy_loop_on())
+ return;
+
+ sock = sock_from_file(epi->ffd.file, &err);
+ if (!sock)
+ return;
+
+ sk = sock->sk;
+ if (!sk)
+ return;
+
+ napi_id = READ_ONCE(sk->sk_napi_id);
+ ep = epi->ep;
+
+ /* Non-NAPI IDs can be rejected
+ * or
+ * Nothing to do if we already have this ID
+ */
+ if (napi_id < MIN_NAPI_ID || napi_id == ep->napi_id)
+ return;
+
+ /* record NAPI ID for use in next busy poll */
+ ep->napi_id = napi_id;
+#endif
+}
+
/**
* ep_call_nested - Perform a bound (possibly) nested call, by checking
* that the recursion limit is not exceeded, and that
@@ -1022,6 +1099,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k

spin_lock_irqsave(&ep->lock, flags);

+ ep_set_busy_poll_napi_id(epi);
+
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
@@ -1363,6 +1442,9 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
/* We have to drop the new item inside our item list to keep track of it */
spin_lock_irqsave(&ep->lock, flags);

+ /* record NAPI ID of new item if present */
+ ep_set_busy_poll_napi_id(epi);
+
/* If the file is already "ready" we drop it inside the ready list */
if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
@@ -1637,10 +1719,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
}

fetch_events:
+
+ if (!ep_events_available(ep))
+ ep_busy_loop(ep, timed_out);
+
spin_lock_irqsave(&ep->lock, flags);

if (!ep_events_available(ep)) {
/*
+ * Busy poll timed out. Drop NAPI ID for now, we can add
+ * it back in when we have moved a socket with a valid NAPI
+ * ID onto the ready list.
+ */
+ ep_reset_busy_poll_napi_id(ep);
+
+ /*
* We don't have any available event to return to the caller.
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.

2017-03-24 17:10:33

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

From: Alexander Duyck <[email protected]>

>From what I can tell there is only a couple spots where we are actually
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.

Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
include/net/busy_poll.h | 5 ++---
net/core/datagram.c | 8 ++++++--
net/core/dev.c | 25 +++++++++++--------------
net/sctp/socket.c | 9 ++++++---
4 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index b82d6ba70a14..c55760f4820f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return time_after(now, end_time);
}

-bool sk_busy_loop(struct sock *sk, int nonblock);
+void sk_busy_loop(struct sock *sk, int nonblock);

#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
@@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return true;
}

-static inline bool sk_busy_loop(struct sock *sk, int nonblock)
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
{
- return false;
}

#endif /* CONFIG_NET_RX_BUSY_POLL */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ea633342ab0d..4608aa245410 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
}

spin_unlock_irqrestore(&queue->lock, cpu_flags);
- } while (sk_can_busy_loop(sk) &&
- sk_busy_loop(sk, flags & MSG_DONTWAIT));
+
+ if (!sk_can_busy_loop(sk))
+ break;
+
+ sk_busy_loop(sk, flags & MSG_DONTWAIT);
+ } while (!skb_queue_empty(&sk->sk_receive_queue));

error = -EAGAIN;

diff --git a/net/core/dev.c b/net/core/dev.c
index ab337bf5bbf4..af70eb6ba682 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,21 +5060,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
do_softirq();
}

-bool sk_busy_loop(struct sock *sk, int nonblock)
+void sk_busy_loop(struct sock *sk, int nonblock)
{
unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
unsigned int napi_id;
- int rc;

restart:
napi_id = READ_ONCE(sk->sk_napi_id);
if (napi_id < MIN_NAPI_ID)
- return 0;
+ return;

- rc = false;
napi_poll = NULL;

rcu_read_lock();
@@ -5085,7 +5083,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)

preempt_disable();
for (;;) {
- rc = 0;
+ int work = 0;
+
local_bh_disable();
if (!napi_poll) {
unsigned long val = READ_ONCE(napi->state);
@@ -5103,12 +5102,12 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
have_poll_lock = netpoll_poll_lock(napi);
napi_poll = napi->poll;
}
- rc = napi_poll(napi, BUSY_POLL_BUDGET);
- trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
+ work = napi_poll(napi, BUSY_POLL_BUDGET);
+ trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
count:
- if (rc > 0)
+ if (work > 0)
__NET_ADD_STATS(sock_net(sk),
- LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+ LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();

if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
@@ -5121,9 +5120,9 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
- rc = !skb_queue_empty(&sk->sk_receive_queue);
- if (rc || busy_loop_timeout(end_time))
- return rc;
+ if (!skb_queue_empty(&sk->sk_receive_queue) ||
+ busy_loop_timeout(end_time))
+ return;
goto restart;
}
cpu_relax();
@@ -5131,10 +5130,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
if (napi_poll)
busy_poll_stop(napi, have_poll_lock);
preempt_enable();
- rc = !skb_queue_empty(&sk->sk_receive_queue);
out:
rcu_read_unlock();
- return rc;
}
EXPORT_SYMBOL(sk_busy_loop);

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 72cc3ecf6516..ccc08fc39722 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -7518,9 +7518,12 @@ struct sk_buff *sctp_skb_recv_datagram(struct sock *sk, int flags,
if (sk->sk_shutdown & RCV_SHUTDOWN)
break;

- if (sk_can_busy_loop(sk) &&
- sk_busy_loop(sk, noblock))
- continue;
+ if (sk_can_busy_loop(sk)) {
+ sk_busy_loop(sk, noblock);
+
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ continue;
+ }

/* User doesn't want to wait. */
error = -EAGAIN;

2017-03-25 02:23:45

by David Miller

[permalink] [raw]
Subject: Re: [net-next PATCH v3 0/8] Add busy poll support for epoll

From: Alexander Duyck <[email protected]>
Date: Fri, 24 Mar 2017 10:07:47 -0700

> v3: Split off the code for limiting busy_poll and busy_read into a separate
> patch for net.
> Updated patch that changed busy loop time tracking so that it uses
> "local_clock() >> 10" as we originally did.
> Tweaked "Change return type.." patch by moving declaration of "work"
> inside the loop where is was accessed and always reset to 0.
> Added "Acked-by" for patches that received acks.

Eric, I believe Alexander has addresed your concerns. Could you please
review patches 5 and 7 and ACK if they look ok to you?

Thanks.

2017-03-25 03:33:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v3 7/8] epoll: Add busy poll support to epoll with socket fds.

On Fri, 2017-03-24 at 10:08 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala <[email protected]>
>
> This patch adds busy poll support to epoll. The implementation is meant to
> be opportunistic in that it will take the NAPI ID from the last socket
> that is added to the ready list that contains a valid NAPI ID and it will
> use that for busy polling until the ready list goes empty. Once the ready
> list goes empty the NAPI ID is reset and busy polling is disabled until a
> new socket is added to the ready list.
>
> In addition when we insert a new socket into the epoll we record the NAPI
> ID and assume we are going to receive events on it. If that doesn't occur
> it will be evicted as the active NAPI ID and we will resume normal
> behavior.
>
> An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
> options to spread the incoming connections to specific worker threads
> based on the incoming queue. This enables epoll for each worker thread
> to have only sockets that receive packets from a single queue. So when an
> application calls epoll_wait() and there are no events available to report,
> busy polling is done on the associated queue to pull the packets.
>
> Signed-off-by: Sridhar Samudrala <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> fs/eventpoll.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 93 insertions(+)

Acked-by: Eric Dumazet <[email protected]>


2017-03-25 03:34:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v3 5/8] net: Track start of busy loop instead of when it should end

On Fri, 2017-03-24 at 10:08 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> This patch flips the logic we were using to determine if the busy polling
> has timed out. The main motivation for this is that we will need to
> support two different possible timeout values in the future and by
> recording the start time rather than when we would want to end we can focus
> on making the end_time specific to the task be it epoll or socket based
> polling.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---


Acked-by: Eric Dumazet <[email protected]>

Thanks guys !


2017-03-25 03:50:02

by David Miller

[permalink] [raw]
Subject: Re: [net-next PATCH v3 0/8] Add busy poll support for epoll

From: Alexander Duyck <[email protected]>
Date: Fri, 24 Mar 2017 10:07:47 -0700

> This patch set adds support for using busy polling with epoll.

Series applied, thanks!

2019-03-20 18:38:34

by Christoph Paasch

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

Hello,

On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
>
> >From what I can tell there is only a couple spots where we are actually
> checking the return value of sk_busy_loop. As there are only a few
> consumers of that data, and the data being checked for can be replaced
> with a check for !skb_queue_empty() we might as well just pull the code
> out of sk_busy_loop and place it in the spots that actually need it.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> Acked-by: Eric Dumazet <[email protected]>
> ---
> include/net/busy_poll.h | 5 ++---
> net/core/datagram.c | 8 ++++++--
> net/core/dev.c | 25 +++++++++++--------------
> net/sctp/socket.c | 9 ++++++---
> 4 files changed, 25 insertions(+), 22 deletions(-)
>
> diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> index b82d6ba70a14..c55760f4820f 100644
> --- a/include/net/busy_poll.h
> +++ b/include/net/busy_poll.h
> @@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> return time_after(now, end_time);
> }
>
> -bool sk_busy_loop(struct sock *sk, int nonblock);
> +void sk_busy_loop(struct sock *sk, int nonblock);
>
> #else /* CONFIG_NET_RX_BUSY_POLL */
> static inline unsigned long net_busy_loop_on(void)
> @@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> return true;
> }
>
> -static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> +static inline void sk_busy_loop(struct sock *sk, int nonblock)
> {
> - return false;
> }
>
> #endif /* CONFIG_NET_RX_BUSY_POLL */
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index ea633342ab0d..4608aa245410 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> }
>
> spin_unlock_irqrestore(&queue->lock, cpu_flags);
> - } while (sk_can_busy_loop(sk) &&
> - sk_busy_loop(sk, flags & MSG_DONTWAIT));
> +
> + if (!sk_can_busy_loop(sk))
> + break;
> +
> + sk_busy_loop(sk, flags & MSG_DONTWAIT);
> + } while (!skb_queue_empty(&sk->sk_receive_queue));

since this change I am hitting stalls where it's looping in this
while-loop with syzkaller.

It worked prior to this change because sk->sk_napi_id was not set thus
sk_busy_loop would make us get out of the loop.

Now, it keeps on looping because there is an skb in the queue with
skb->len == 0 and we are peeking with an offset, thus
__skb_try_recv_from_queue will return NULL and thus we have no way of
getting out of the loop.

I'm not sure what would be the best way to fix it. I don't see why we
end up with an skb in the list with skb->len == 0. So, shooting a
quick e-mail, maybe somebody has an idea :-)

I have the syzkaller-reproducer if needed.

Thanks,
Christoph



>
> error = -EAGAIN;
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ab337bf5bbf4..af70eb6ba682 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5060,21 +5060,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
> do_softirq();
> }
>
> -bool sk_busy_loop(struct sock *sk, int nonblock)
> +void sk_busy_loop(struct sock *sk, int nonblock)
> {
> unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
> int (*napi_poll)(struct napi_struct *napi, int budget);
> void *have_poll_lock = NULL;
> struct napi_struct *napi;
> unsigned int napi_id;
> - int rc;
>
> restart:
> napi_id = READ_ONCE(sk->sk_napi_id);
> if (napi_id < MIN_NAPI_ID)
> - return 0;
> + return;
>
> - rc = false;
> napi_poll = NULL;
>
> rcu_read_lock();
> @@ -5085,7 +5083,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
>
> preempt_disable();
> for (;;) {
> - rc = 0;
> + int work = 0;
> +
> local_bh_disable();
> if (!napi_poll) {
> unsigned long val = READ_ONCE(napi->state);
> @@ -5103,12 +5102,12 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
> have_poll_lock = netpoll_poll_lock(napi);
> napi_poll = napi->poll;
> }
> - rc = napi_poll(napi, BUSY_POLL_BUDGET);
> - trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
> + work = napi_poll(napi, BUSY_POLL_BUDGET);
> + trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
> count:
> - if (rc > 0)
> + if (work > 0)
> __NET_ADD_STATS(sock_net(sk),
> - LINUX_MIB_BUSYPOLLRXPACKETS, rc);
> + LINUX_MIB_BUSYPOLLRXPACKETS, work);
> local_bh_enable();
>
> if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
> @@ -5121,9 +5120,9 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
> preempt_enable();
> rcu_read_unlock();
> cond_resched();
> - rc = !skb_queue_empty(&sk->sk_receive_queue);
> - if (rc || busy_loop_timeout(end_time))
> - return rc;
> + if (!skb_queue_empty(&sk->sk_receive_queue) ||
> + busy_loop_timeout(end_time))
> + return;
> goto restart;
> }
> cpu_relax();
> @@ -5131,10 +5130,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
> if (napi_poll)
> busy_poll_stop(napi, have_poll_lock);
> preempt_enable();
> - rc = !skb_queue_empty(&sk->sk_receive_queue);
> out:
> rcu_read_unlock();
> - return rc;
> }
> EXPORT_SYMBOL(sk_busy_loop);
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 72cc3ecf6516..ccc08fc39722 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -7518,9 +7518,12 @@ struct sk_buff *sctp_skb_recv_datagram(struct sock *sk, int flags,
> if (sk->sk_shutdown & RCV_SHUTDOWN)
> break;
>
> - if (sk_can_busy_loop(sk) &&
> - sk_busy_loop(sk, noblock))
> - continue;
> + if (sk_can_busy_loop(sk)) {
> + sk_busy_loop(sk, noblock);
> +
> + if (!skb_queue_empty(&sk->sk_receive_queue))
> + continue;
> + }
>
> /* User doesn't want to wait. */
> error = -EAGAIN;
>

2019-03-20 19:41:54

by David Miller

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

From: Christoph Paasch <[email protected]>
Date: Wed, 20 Mar 2019 11:35:31 -0700

> On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
> <[email protected]> wrote:
>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>> index ea633342ab0d..4608aa245410 100644
>> --- a/net/core/datagram.c
>> +++ b/net/core/datagram.c
>> @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
>> }
>>
>> spin_unlock_irqrestore(&queue->lock, cpu_flags);
>> - } while (sk_can_busy_loop(sk) &&
>> - sk_busy_loop(sk, flags & MSG_DONTWAIT));
>> +
>> + if (!sk_can_busy_loop(sk))
>> + break;
>> +
>> + sk_busy_loop(sk, flags & MSG_DONTWAIT);
>> + } while (!skb_queue_empty(&sk->sk_receive_queue));
>
> since this change I am hitting stalls where it's looping in this
> while-loop with syzkaller.
>
> It worked prior to this change because sk->sk_napi_id was not set thus
> sk_busy_loop would make us get out of the loop.
>
> Now, it keeps on looping because there is an skb in the queue with
> skb->len == 0 and we are peeking with an offset, thus
> __skb_try_recv_from_queue will return NULL and thus we have no way of
> getting out of the loop.
>
> I'm not sure what would be the best way to fix it. I don't see why we
> end up with an skb in the list with skb->len == 0. So, shooting a
> quick e-mail, maybe somebody has an idea :-)
>
> I have the syzkaller-reproducer if needed.

Just for the record, __skb_try_recv_datagram() and it's friend
__skb_try_recv_from_queue() are my least favorite functions in the
entire tree for the past year or so.

Their current design, and how they assume things about the
implementation of SKB queues, together result in all the weird
problems we keep fixing in them.

There has to be a much better way to do this.


2019-03-21 09:47:04

by Paolo Abeni

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

Hi,

On Wed, 2019-03-20 at 11:35 -0700, Christoph Paasch wrote:
> Hello,
>
> On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
> <[email protected]> wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > > From what I can tell there is only a couple spots where we are actually
> > checking the return value of sk_busy_loop. As there are only a few
> > consumers of that data, and the data being checked for can be replaced
> > with a check for !skb_queue_empty() we might as well just pull the code
> > out of sk_busy_loop and place it in the spots that actually need it.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > Acked-by: Eric Dumazet <[email protected]>
> > ---
> > include/net/busy_poll.h | 5 ++---
> > net/core/datagram.c | 8 ++++++--
> > net/core/dev.c | 25 +++++++++++--------------
> > net/sctp/socket.c | 9 ++++++---
> > 4 files changed, 25 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > index b82d6ba70a14..c55760f4820f 100644
> > --- a/include/net/busy_poll.h
> > +++ b/include/net/busy_poll.h
> > @@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > return time_after(now, end_time);
> > }
> >
> > -bool sk_busy_loop(struct sock *sk, int nonblock);
> > +void sk_busy_loop(struct sock *sk, int nonblock);
> >
> > #else /* CONFIG_NET_RX_BUSY_POLL */
> > static inline unsigned long net_busy_loop_on(void)
> > @@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > return true;
> > }
> >
> > -static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > +static inline void sk_busy_loop(struct sock *sk, int nonblock)
> > {
> > - return false;
> > }
> >
> > #endif /* CONFIG_NET_RX_BUSY_POLL */
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index ea633342ab0d..4608aa245410 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct
> > sock *sk, unsigned int flags,
> > }
> >
> > spin_unlock_irqrestore(&queue->lock, cpu_flags);
> > - } while (sk_can_busy_loop(sk) &&
> > - sk_busy_loop(sk, flags & MSG_DONTWAIT));
> > +
> > + if (!sk_can_busy_loop(sk))
> > + break;
> > +
> > + sk_busy_loop(sk, flags & MSG_DONTWAIT);
> > + } while (!skb_queue_empty(&sk->sk_receive_queue));
>
> since this change I am hitting stalls where it's looping in this
> while-loop with syzkaller.
>
> It worked prior to this change because sk->sk_napi_id was not set thus
> sk_busy_loop would make us get out of the loop.
>
> Now, it keeps on looping because there is an skb in the queue with
> skb->len == 0 and we are peeking with an offset, thus
> __skb_try_recv_from_queue will return NULL and thus we have no way of
> getting out of the loop.
>
> I'm not sure what would be the best way to fix it. I don't see why we
> end up with an skb in the list with skb->len == 0. So, shooting a
> quick e-mail, maybe somebody has an idea :-)
>
> I have the syzkaller-reproducer if needed.

IIRC we can have 0 len UDP packet sitting on sk_receive_queue since:

commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac
Author: samanthakumar <[email protected]>
Date: Tue Apr 5 12:41:15 2016 -0400

udp: remove headers from UDP packets before queueing

Both __skb_try_recv_datagram() and napi_busy_loop() assume that we
received some packets if the queue is not empty. When peeking such
assumption is not true, we should check if the last packet is changed,
as __skb_recv_datagram() already does. So I *think* the root cause of
this issue is older than Alex's patch.

The following - completely untested - should avoid the unbounded loop,
but it's not a complete fix, I *think* we should also change
sk_busy_loop_end() in a similar way, but that is a little more complex
due to the additional indirections.

Could you please test it?

Any feedback welcome!


Could you please test it?

Paolo
---
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b2651bb6d2a3..e657289db4ac 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -279,7 +279,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock
*sk, unsigned int flags,
break;

sk_busy_loop(sk, flags & MSG_DONTWAIT);
- } while (!skb_queue_empty(&sk->sk_receive_queue));
+ } while (sk->sk_receive_queue.prev != *last);

error = -EAGAIN;



2019-03-21 14:32:05

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

On Thu, Mar 21, 2019 at 5:46 AM Paolo Abeni <[email protected]> wrote:
>
> Hi,
>
> On Wed, 2019-03-20 at 11:35 -0700, Christoph Paasch wrote:
> > Hello,
> >
> > On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
> > <[email protected]> wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > > From what I can tell there is only a couple spots where we are actually
> > > checking the return value of sk_busy_loop. As there are only a few
> > > consumers of that data, and the data being checked for can be replaced
> > > with a check for !skb_queue_empty() we might as well just pull the code
> > > out of sk_busy_loop and place it in the spots that actually need it.
> > >
> > > Signed-off-by: Alexander Duyck <[email protected]>
> > > Acked-by: Eric Dumazet <[email protected]>
> > > ---
> > > include/net/busy_poll.h | 5 ++---
> > > net/core/datagram.c | 8 ++++++--
> > > net/core/dev.c | 25 +++++++++++--------------
> > > net/sctp/socket.c | 9 ++++++---
> > > 4 files changed, 25 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index b82d6ba70a14..c55760f4820f 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > return time_after(now, end_time);
> > > }
> > >
> > > -bool sk_busy_loop(struct sock *sk, int nonblock);
> > > +void sk_busy_loop(struct sock *sk, int nonblock);
> > >
> > > #else /* CONFIG_NET_RX_BUSY_POLL */
> > > static inline unsigned long net_busy_loop_on(void)
> > > @@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > return true;
> > > }
> > >
> > > -static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > +static inline void sk_busy_loop(struct sock *sk, int nonblock)
> > > {
> > > - return false;
> > > }
> > >
> > > #endif /* CONFIG_NET_RX_BUSY_POLL */
> > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > index ea633342ab0d..4608aa245410 100644
> > > --- a/net/core/datagram.c
> > > +++ b/net/core/datagram.c
> > > @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct
> > > sock *sk, unsigned int flags,
> > > }
> > >
> > > spin_unlock_irqrestore(&queue->lock, cpu_flags);
> > > - } while (sk_can_busy_loop(sk) &&
> > > - sk_busy_loop(sk, flags & MSG_DONTWAIT));
> > > +
> > > + if (!sk_can_busy_loop(sk))
> > > + break;
> > > +
> > > + sk_busy_loop(sk, flags & MSG_DONTWAIT);
> > > + } while (!skb_queue_empty(&sk->sk_receive_queue));
> >
> > since this change I am hitting stalls where it's looping in this
> > while-loop with syzkaller.
> >
> > It worked prior to this change because sk->sk_napi_id was not set thus
> > sk_busy_loop would make us get out of the loop.
> >
> > Now, it keeps on looping because there is an skb in the queue with
> > skb->len == 0 and we are peeking with an offset, thus
> > __skb_try_recv_from_queue will return NULL and thus we have no way of
> > getting out of the loop.
> >
> > I'm not sure what would be the best way to fix it. I don't see why we
> > end up with an skb in the list with skb->len == 0. So, shooting a
> > quick e-mail, maybe somebody has an idea :-)
> > I have the syzkaller-reproducer if needed.
>
> IIRC we can have 0 len UDP packet sitting on sk_receive_queue since:

Yes, as of header before enqueue pulling zero byte datagrams may be
queued. And these need to be delivered, among other reason for their
cmsg metadata.

> commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac
> Author: samanthakumar <[email protected]>
> Date: Tue Apr 5 12:41:15 2016 -0400
>
> udp: remove headers from UDP packets before queueing
>
> Both __skb_try_recv_datagram() and napi_busy_loop() assume that we
> received some packets if the queue is not empty. When peeking such
> assumption is not true, we should check if the last packet is changed,
> as __skb_recv_datagram() already does.

Good catch. The condition in sk_busy_loop_end is not easy to address.
Since busy poll is an optimization and poll at offset rare, one way
out may be to amend the __sk_can_busy_loop test in __skb_recv_udp to
disallow busy polling together with peek at offset.

The difference in behavior betwee __skb_try_recv_datagram and
__skb_recv_datagram also reminds of Alexei's earlier report (without
busy polling, seemingly with a list corruption introduced elsewhere)
in

[net-next,1/3] net/sock: factor out dequeue/peek with offset code
https://patchwork.ozlabs.org/patch/762327/


> So I *think* the root cause of
> this issue is older than Alex's patch.
>
> The following - completely untested - should avoid the unbounded loop,
> but it's not a complete fix, I *think* we should also change
> sk_busy_loop_end() in a similar way, but that is a little more complex
> due to the additional indirections.
>
> Could you please test it?
>
> Any feedback welcome!
>
>
> Could you please test it?
>
> Paolo
> ---
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index b2651bb6d2a3..e657289db4ac 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -279,7 +279,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock
> *sk, unsigned int flags,
> break;
>
> sk_busy_loop(sk, flags & MSG_DONTWAIT);
> - } while (!skb_queue_empty(&sk->sk_receive_queue));
> + } while (sk->sk_receive_queue.prev != *last);
>
> error = -EAGAIN;
>
>

2019-03-21 16:45:25

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
>
> Hi,
>
> On Wed, 2019-03-20 at 11:35 -0700, Christoph Paasch wrote:
> > Hello,
> >
> > On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
> > <[email protected]> wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > > From what I can tell there is only a couple spots where we are actually
> > > checking the return value of sk_busy_loop. As there are only a few
> > > consumers of that data, and the data being checked for can be replaced
> > > with a check for !skb_queue_empty() we might as well just pull the code
> > > out of sk_busy_loop and place it in the spots that actually need it.
> > >
> > > Signed-off-by: Alexander Duyck <[email protected]>
> > > Acked-by: Eric Dumazet <[email protected]>
> > > ---
> > > include/net/busy_poll.h | 5 ++---
> > > net/core/datagram.c | 8 ++++++--
> > > net/core/dev.c | 25 +++++++++++--------------
> > > net/sctp/socket.c | 9 ++++++---
> > > 4 files changed, 25 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > index b82d6ba70a14..c55760f4820f 100644
> > > --- a/include/net/busy_poll.h
> > > +++ b/include/net/busy_poll.h
> > > @@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > return time_after(now, end_time);
> > > }
> > >
> > > -bool sk_busy_loop(struct sock *sk, int nonblock);
> > > +void sk_busy_loop(struct sock *sk, int nonblock);
> > >
> > > #else /* CONFIG_NET_RX_BUSY_POLL */
> > > static inline unsigned long net_busy_loop_on(void)
> > > @@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > return true;
> > > }
> > >
> > > -static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > +static inline void sk_busy_loop(struct sock *sk, int nonblock)
> > > {
> > > - return false;
> > > }
> > >
> > > #endif /* CONFIG_NET_RX_BUSY_POLL */
> > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > index ea633342ab0d..4608aa245410 100644
> > > --- a/net/core/datagram.c
> > > +++ b/net/core/datagram.c
> > > @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct
> > > sock *sk, unsigned int flags,
> > > }
> > >
> > > spin_unlock_irqrestore(&queue->lock, cpu_flags);
> > > - } while (sk_can_busy_loop(sk) &&
> > > - sk_busy_loop(sk, flags & MSG_DONTWAIT));
> > > +
> > > + if (!sk_can_busy_loop(sk))
> > > + break;
> > > +
> > > + sk_busy_loop(sk, flags & MSG_DONTWAIT);
> > > + } while (!skb_queue_empty(&sk->sk_receive_queue));
> >
> > since this change I am hitting stalls where it's looping in this
> > while-loop with syzkaller.
> >
> > It worked prior to this change because sk->sk_napi_id was not set thus
> > sk_busy_loop would make us get out of the loop.
> >
> > Now, it keeps on looping because there is an skb in the queue with
> > skb->len == 0 and we are peeking with an offset, thus
> > __skb_try_recv_from_queue will return NULL and thus we have no way of
> > getting out of the loop.
> >
> > I'm not sure what would be the best way to fix it. I don't see why we
> > end up with an skb in the list with skb->len == 0. So, shooting a
> > quick e-mail, maybe somebody has an idea :-)
> >
> > I have the syzkaller-reproducer if needed.
>
> IIRC we can have 0 len UDP packet sitting on sk_receive_queue since:
>
> commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac
> Author: samanthakumar <[email protected]>
> Date: Tue Apr 5 12:41:15 2016 -0400
>
> udp: remove headers from UDP packets before queueing
>
> Both __skb_try_recv_datagram() and napi_busy_loop() assume that we
> received some packets if the queue is not empty. When peeking such
> assumption is not true, we should check if the last packet is changed,
> as __skb_recv_datagram() already does. So I *think* the root cause of
> this issue is older than Alex's patch.

I agree.

> The following - completely untested - should avoid the unbounded loop,
> but it's not a complete fix, I *think* we should also change
> sk_busy_loop_end() in a similar way, but that is a little more complex
> due to the additional indirections.

As far as sk_busy_loop_end we could look at just forking sk_busy_loop
and writing a separate implementation for datagram sockets that uses a
different loop_end function. It shouldn't take much to change since
all we would need to do is pass a structure containing the sk and last
pointers instead of just passing the sk directly as the loop_end
argument.

> Could you please test it?
>
> Any feedback welcome!

The change below looks good to me.

> Could you please test it?
>
> Paolo
> ---
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index b2651bb6d2a3..e657289db4ac 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -279,7 +279,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock
> *sk, unsigned int flags,
> break;
>
> sk_busy_loop(sk, flags & MSG_DONTWAIT);
> - } while (!skb_queue_empty(&sk->sk_receive_queue));
> + } while (sk->sk_receive_queue.prev != *last);
>
> error = -EAGAIN;
>
>

2019-03-22 03:06:53

by Christoph Paasch

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

Hi,

On Thu, Mar 21, 2019 at 12:43 PM Alexander Duyck
<[email protected]> wrote:
>
> On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
> >
> > Hi,
> >
> > On Wed, 2019-03-20 at 11:35 -0700, Christoph Paasch wrote:
> > > Hello,
> > >
> > > On Fri, Mar 24, 2017 at 3:23 PM Alexander Duyck
> > > <[email protected]> wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > > From what I can tell there is only a couple spots where we are actually
> > > > checking the return value of sk_busy_loop. As there are only a few
> > > > consumers of that data, and the data being checked for can be replaced
> > > > with a check for !skb_queue_empty() we might as well just pull the code
> > > > out of sk_busy_loop and place it in the spots that actually need it.
> > > >
> > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > Acked-by: Eric Dumazet <[email protected]>
> > > > ---
> > > > include/net/busy_poll.h | 5 ++---
> > > > net/core/datagram.c | 8 ++++++--
> > > > net/core/dev.c | 25 +++++++++++--------------
> > > > net/sctp/socket.c | 9 ++++++---
> > > > 4 files changed, 25 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
> > > > index b82d6ba70a14..c55760f4820f 100644
> > > > --- a/include/net/busy_poll.h
> > > > +++ b/include/net/busy_poll.h
> > > > @@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > > return time_after(now, end_time);
> > > > }
> > > >
> > > > -bool sk_busy_loop(struct sock *sk, int nonblock);
> > > > +void sk_busy_loop(struct sock *sk, int nonblock);
> > > >
> > > > #else /* CONFIG_NET_RX_BUSY_POLL */
> > > > static inline unsigned long net_busy_loop_on(void)
> > > > @@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
> > > > return true;
> > > > }
> > > >
> > > > -static inline bool sk_busy_loop(struct sock *sk, int nonblock)
> > > > +static inline void sk_busy_loop(struct sock *sk, int nonblock)
> > > > {
> > > > - return false;
> > > > }
> > > >
> > > > #endif /* CONFIG_NET_RX_BUSY_POLL */
> > > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > > index ea633342ab0d..4608aa245410 100644
> > > > --- a/net/core/datagram.c
> > > > +++ b/net/core/datagram.c
> > > > @@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct
> > > > sock *sk, unsigned int flags,
> > > > }
> > > >
> > > > spin_unlock_irqrestore(&queue->lock, cpu_flags);
> > > > - } while (sk_can_busy_loop(sk) &&
> > > > - sk_busy_loop(sk, flags & MSG_DONTWAIT));
> > > > +
> > > > + if (!sk_can_busy_loop(sk))
> > > > + break;
> > > > +
> > > > + sk_busy_loop(sk, flags & MSG_DONTWAIT);
> > > > + } while (!skb_queue_empty(&sk->sk_receive_queue));
> > >
> > > since this change I am hitting stalls where it's looping in this
> > > while-loop with syzkaller.
> > >
> > > It worked prior to this change because sk->sk_napi_id was not set thus
> > > sk_busy_loop would make us get out of the loop.
> > >
> > > Now, it keeps on looping because there is an skb in the queue with
> > > skb->len == 0 and we are peeking with an offset, thus
> > > __skb_try_recv_from_queue will return NULL and thus we have no way of
> > > getting out of the loop.
> > >
> > > I'm not sure what would be the best way to fix it. I don't see why we
> > > end up with an skb in the list with skb->len == 0. So, shooting a
> > > quick e-mail, maybe somebody has an idea :-)
> > >
> > > I have the syzkaller-reproducer if needed.
> >
> > IIRC we can have 0 len UDP packet sitting on sk_receive_queue since:
> >
> > commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac
> > Author: samanthakumar <[email protected]>
> > Date: Tue Apr 5 12:41:15 2016 -0400
> >
> > udp: remove headers from UDP packets before queueing
> >
> > Both __skb_try_recv_datagram() and napi_busy_loop() assume that we
> > received some packets if the queue is not empty. When peeking such
> > assumption is not true, we should check if the last packet is changed,
> > as __skb_recv_datagram() already does. So I *think* the root cause of
> > this issue is older than Alex's patch.
>
> I agree.
>
> > The following - completely untested - should avoid the unbounded loop,
> > but it's not a complete fix, I *think* we should also change
> > sk_busy_loop_end() in a similar way, but that is a little more complex
> > due to the additional indirections.
>
> As far as sk_busy_loop_end we could look at just forking sk_busy_loop
> and writing a separate implementation for datagram sockets that uses a
> different loop_end function. It shouldn't take much to change since
> all we would need to do is pass a structure containing the sk and last
> pointers instead of just passing the sk directly as the loop_end
> argument.
>
> > Could you please test it?
> >
> > Any feedback welcome!
>
> The change below looks good to me.

I just tried it out. Worked for me!

You can add my Tested-by if you do a formal patch-submission:

Tested-by: Christoph Paasch <[email protected]>


Christoph

>
> > Could you please test it?
> >
> > Paolo
> > ---
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index b2651bb6d2a3..e657289db4ac 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -279,7 +279,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock
> > *sk, unsigned int flags,
> > break;
> >
> > sk_busy_loop(sk, flags & MSG_DONTWAIT);
> > - } while (!skb_queue_empty(&sk->sk_receive_queue));
> > + } while (sk->sk_receive_queue.prev != *last);
> >
> > error = -EAGAIN;
> >
> >

2019-03-22 10:35:37

by Paolo Abeni

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

Hi,

On Thu, 2019-03-21 at 23:05 -0400, Christoph Paasch wrote:
> On Thu, Mar 21, 2019 at 12:43 PM Alexander Duyck
> <[email protected]> wrote:
> > On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
> > > The following - completely untested - should avoid the unbounded loop,
> > > but it's not a complete fix, I *think* we should also change
> > > sk_busy_loop_end() in a similar way, but that is a little more complex
> > > due to the additional indirections.
> >
> > As far as sk_busy_loop_end we could look at just forking sk_busy_loop
> > and writing a separate implementation for datagram sockets that uses a
> > different loop_end function. It shouldn't take much to change since
> > all we would need to do is pass a structure containing the sk and last
> > pointers instead of just passing the sk directly as the loop_end
> > argument.
> >
> > > Could you please test it?
> > >
> > > Any feedback welcome!
> >
> > The change below looks good to me.
>
> I just tried it out. Worked for me!
>
> You can add my Tested-by if you do a formal patch-submission:
>
> Tested-by: Christoph Paasch <[email protected]>

Thanks for testing!

I'm trying to reproduce the issue locally, but I'm unable. I think that
the current UDP implementation is not affected, as we always ensure
sk_receive_queue is empty before busy polling. Unix sockets should not
be affected, too, as busy polling should not have any effect there
(sk_napi_id should be never >= MIN_NAPI_ID). Can you reproduce the
issue on an unpatched, recent, upstream kernel?

Can you please provide the syzkaller repro?

Thanks,

Paolo






2019-03-22 13:01:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

On Fri, Mar 22, 2019 at 3:33 AM Paolo Abeni <[email protected]> wrote:
>
> Hi,
>
> On Thu, 2019-03-21 at 23:05 -0400, Christoph Paasch wrote:
> > On Thu, Mar 21, 2019 at 12:43 PM Alexander Duyck
> > <[email protected]> wrote:
> > > On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
> > > > The following - completely untested - should avoid the unbounded loop,
> > > > but it's not a complete fix, I *think* we should also change
> > > > sk_busy_loop_end() in a similar way, but that is a little more complex
> > > > due to the additional indirections.
> > >
> > > As far as sk_busy_loop_end we could look at just forking sk_busy_loop
> > > and writing a separate implementation for datagram sockets that uses a
> > > different loop_end function. It shouldn't take much to change since
> > > all we would need to do is pass a structure containing the sk and last
> > > pointers instead of just passing the sk directly as the loop_end
> > > argument.
> > >
> > > > Could you please test it?
> > > >
> > > > Any feedback welcome!
> > >
> > > The change below looks good to me.
> >
> > I just tried it out. Worked for me!
> >
> > You can add my Tested-by if you do a formal patch-submission:
> >
> > Tested-by: Christoph Paasch <[email protected]>
>
> Thanks for testing!
>
> I'm trying to reproduce the issue locally, but I'm unable. I think that
> the current UDP implementation is not affected, as we always ensure
> sk_receive_queue is empty before busy polling.

But right after check is done we release the queue lock, so a packet might
come right after the test has been done.

> Unix sockets should not
> be affected, too, as busy polling should not have any effect there
> (sk_napi_id should be never >= MIN_NAPI_ID). Can you reproduce the
> issue on an unpatched, recent, upstream kernel?



>
> Can you please provide the syzkaller repro?
>
> Thanks,
>
> Paolo
>
>
>
>
>

2019-03-22 13:36:34

by Paolo Abeni

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

On Fri, 2019-03-22 at 05:59 -0700, Eric Dumazet wrote:
> On Fri, Mar 22, 2019 at 3:33 AM Paolo Abeni <[email protected]> wrote:
> > Hi,
> >
> > On Thu, 2019-03-21 at 23:05 -0400, Christoph Paasch wrote:
> > > On Thu, Mar 21, 2019 at 12:43 PM Alexander Duyck
> > > <[email protected]> wrote:
> > > > On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
> > > > > The following - completely untested - should avoid the unbounded loop,
> > > > > but it's not a complete fix, I *think* we should also change
> > > > > sk_busy_loop_end() in a similar way, but that is a little more complex
> > > > > due to the additional indirections.
> > > >
> > > > As far as sk_busy_loop_end we could look at just forking sk_busy_loop
> > > > and writing a separate implementation for datagram sockets that uses a
> > > > different loop_end function. It shouldn't take much to change since
> > > > all we would need to do is pass a structure containing the sk and last
> > > > pointers instead of just passing the sk directly as the loop_end
> > > > argument.
> > > >
> > > > > Could you please test it?
> > > > >
> > > > > Any feedback welcome!
> > > >
> > > > The change below looks good to me.
> > >
> > > I just tried it out. Worked for me!
> > >
> > > You can add my Tested-by if you do a formal patch-submission:
> > >
> > > Tested-by: Christoph Paasch <[email protected]>
> >
> > Thanks for testing!
> >
> > I'm trying to reproduce the issue locally, but I'm unable. I think that
> > the current UDP implementation is not affected, as we always ensure
> > sk_receive_queue is empty before busy polling.
>
> But right after check is done we release the queue lock, so a packet might
> come right after the test has been done.

Yep I was unclear and uncorrect. My point is: with the current UDP
implementation, if we have a non empty sk_receive_queue after the busy
loop, it always means there are more packets to be processed and we
should loop again, as we currently do.

AFAICS, that is different from the reported issue, where the system
stall looping around sk_receive_queue while no new packets are appended
there.

Cheers,

Paol




2019-03-22 19:27:16

by Christoph Paasch

[permalink] [raw]
Subject: Re: [net-next PATCH v3 4/8] net: Change return type of sk_busy_loop from bool to void

Hi Paolo,

On Fri, Mar 22, 2019 at 6:33 AM Paolo Abeni <[email protected]> wrote:
> On Thu, 2019-03-21 at 23:05 -0400, Christoph Paasch wrote:
> > On Thu, Mar 21, 2019 at 12:43 PM Alexander Duyck
> > <[email protected]> wrote:
> > > On Thu, Mar 21, 2019 at 2:45 AM Paolo Abeni <[email protected]> wrote:
> > > > The following - completely untested - should avoid the unbounded loop,
> > > > but it's not a complete fix, I *think* we should also change
> > > > sk_busy_loop_end() in a similar way, but that is a little more complex
> > > > due to the additional indirections.
> > >
> > > As far as sk_busy_loop_end we could look at just forking sk_busy_loop
> > > and writing a separate implementation for datagram sockets that uses a
> > > different loop_end function. It shouldn't take much to change since
> > > all we would need to do is pass a structure containing the sk and last
> > > pointers instead of just passing the sk directly as the loop_end
> > > argument.
> > >
> > > > Could you please test it?
> > > >
> > > > Any feedback welcome!
> > >
> > > The change below looks good to me.
> >
> > I just tried it out. Worked for me!
> >
> > You can add my Tested-by if you do a formal patch-submission:
> >
> > Tested-by: Christoph Paasch <[email protected]>
>
> Thanks for testing!
>
> I'm trying to reproduce the issue locally, but I'm unable. I think that
> the current UDP implementation is not affected, as we always ensure
> sk_receive_queue is empty before busy polling. Unix sockets should not
> be affected, too, as busy polling should not have any effect there
> (sk_napi_id should be never >= MIN_NAPI_ID). Can you reproduce the
> issue on an unpatched, recent, upstream kernel?

yes, I can repro it with the C-reproducer reliably on the latest
net-branch, v4.14.105 and then also bisected it down to the commit.

> Can you please provide the syzkaller repro?

Sure, here is the full report:

Syzkaller hit 'INFO: rcu detected stall in unix_seqpacket_recvmsg' bug.

INFO: rcu_sched self-detected stall on CPU
1-...: (19373 ticks this GP) idle=ac6/140000000000001/0 softirq=4477/4477 fqs=4
(t=21000 jiffies g=1853 c=1852 q=16)
rcu_sched kthread starved for 20984 jiffies! g1853 c1852 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
rcu_sched R running task 21040 8 2 0x80000000
Call Trace:
schedule+0xee/0x3a0 kernel/sched/core.c:3427
schedule_timeout+0x158/0x260 kernel/time/timer.c:1744
rcu_gp_kthread+0x12fc/0x3580 kernel/rcu/tree.c:2247
kthread+0x355/0x430 kernel/kthread.c:232
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:402
NMI backtrace for cpu 1
CPU: 1 PID: 1956 Comm: syz-executor808 Not tainted 4.14.105 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x10a/0x1d1 lib/dump_stack.c:53
nmi_cpu_backtrace+0xf2/0x110 lib/nmi_backtrace.c:101
nmi_trigger_cpumask_backtrace+0x116/0x170 lib/nmi_backtrace.c:62
trigger_single_cpu_backtrace include/linux/nmi.h:158 [inline]
rcu_dump_cpu_stacks+0x180/0x1dd kernel/rcu/tree.c:1396
print_cpu_stall kernel/rcu/tree.c:1542 [inline]
check_cpu_stall.isra.69+0x9f1/0x1080 kernel/rcu/tree.c:1610
__rcu_pending kernel/rcu/tree.c:3382 [inline]
rcu_pending kernel/rcu/tree.c:3444 [inline]
rcu_check_callbacks+0x380/0xc90 kernel/rcu/tree.c:2784
update_process_times+0x28/0x60 kernel/time/timer.c:1588
tick_sched_handle+0x7d/0x150 kernel/time/tick-sched.c:161
tick_sched_timer+0x3d/0x110 kernel/time/tick-sched.c:1219
__run_hrtimer kernel/time/hrtimer.c:1220 [inline]
__hrtimer_run_queues+0x3d8/0xb80 kernel/time/hrtimer.c:1284
hrtimer_interrupt+0x1b0/0x590 kernel/time/hrtimer.c:1318
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline]
smp_apic_timer_interrupt+0x1d5/0x5b0 arch/x86/kernel/apic/apic.c:1062
apic_timer_interrupt+0x87/0x90 arch/x86/entry/entry_64.S:787
</IRQ>
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:778 [inline]
RIP: 0010:__raw_spin_unlock_irqrestore
include/linux/spinlock_api_smp.h:160 [inline]
RIP: 0010:_raw_spin_unlock_irqrestore+0x32/0x60 kernel/locking/spinlock.c:192
RSP: 0018:ffff888065b273b0 EFLAGS: 00000297 ORIG_RAX: ffffffffffffff10
RAX: 0000000000000007 RBX: 0000000000000297 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000297
RBP: ffff88806a2447a0 R08: 1ffff1100cb64e57 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888065b27450
R13: ffffed100cb64e92 R14: 0000000000000000 R15: ffff88806a244788
spin_unlock_irqrestore include/linux/spinlock.h:372 [inline]
__skb_try_recv_datagram+0x299/0x4b0 net/core/datagram.c:274
unix_dgram_recvmsg+0x294/0x1840 net/unix/af_unix.c:2107
unix_seqpacket_recvmsg+0x82/0xb0 net/unix/af_unix.c:2073
sock_recvmsg_nosec net/socket.c:818 [inline]
sock_recvmsg+0xc4/0x110 net/socket.c:825
___sys_recvmsg+0x2a7/0x620 net/socket.c:2220
__sys_recvmsg+0xc6/0x200 net/socket.c:2265
SYSC_recvmsg net/socket.c:2277 [inline]
SyS_recvmsg+0x27/0x40 net/socket.c:2272
do_syscall_64+0x23f/0x6d0 arch/x86/entry/common.c:289
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f4a53385469
RSP: 002b:00007f4a53a76f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
RAX: ffffffffffffffda RBX: 0000000000603168 RCX: 00007f4a53385469
RDX: 0000000040000002 RSI: 0000000020001680 RDI: 0000000000000003
RBP: 0000000000603160 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000060316c
R13: 0800000000008923 R14: 00007f4a53a57000 R15: 0000000000000003


Syzkaller reproducer:
# {Threaded:true Collide:false Repeat:true RepeatTimes:0 Procs:8
Sandbox:none Fault:false FaultCall:-1 FaultNth:0 EnableTun:false
UseTmpDir:false EnableCgroups:false EnableNetdev:false ResetNet:false
HandleSegv:false Repro:false Trace:false}
socketpair$unix(0x1, 0x1000000000005, 0x0,
&(0x7f0000000340)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
setsockopt$sock_int(r0, 0x1, 0x2a, &(0x7f00000000c0)=0x9, 0x4)
recvmsg(r0, &(0x7f0000001680)={0x0, 0x0, 0x0}, 0x40000002)
setsockopt$sock_int(r0, 0x1, 0x2e, &(0x7f00000003c0)=0x9, 0x4)
sendmsg(r1, &(0x7f0000000680)={0x0, 0x0, 0x0}, 0x0)
connect$inet6(0xffffffffffffffff, 0x0, 0x0)
socketpair(0xa, 0x800, 0x4, 0x0)
fcntl$getflags(0xffffffffffffffff, 0x40b)
sendto$inet6(0xffffffffffffffff, 0x0, 0x0, 0x24000855, 0x0, 0x0)
getsockopt$inet_IP_XFRM_POLICY(0xffffffffffffff9c, 0x0, 0x11, 0x0, 0x0)
socket$nl_xfrm(0x10, 0x3, 0x6)
setsockopt$netlink_NETLINK_BROADCAST_ERROR(0xffffffffffffffff, 0x10e,
0x4, 0x0, 0x0)
socket$unix(0x1, 0x3, 0x0)
getsockopt$sock_int(0xffffffffffffffff, 0x1, 0x1e, 0x0, 0x0)
ioctl$sock_SIOCGSKNS(0xffffffffffffffff, 0x894c, 0x0)
ioctl$sock_inet_udp_SIOCOUTQ(0xffffffffffffffff, 0x5411, 0x0)
setsockopt$IP_VS_SO_SET_EDIT(0xffffffffffffffff, 0x0, 0x483, 0x0, 0x0)
setsockopt$inet6_IPV6_XFRM_POLICY(0xffffffffffffffff, 0x29, 0x23, 0x0, 0x0)
ioctl$sock_ifreq(0xffffffffffffffff, 0x800000000008923, 0x0)
bind(0xffffffffffffffff, 0x0, 0x0)
socket(0xa, 0x3, 0x3)


C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include <dirent.h>
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/prctl.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#include <linux/futex.h>

unsigned long long procid;

static void sleep_ms(uint64_t ms)
{
usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, &ts))
exit(1);
return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
}

static void thread_start(void* (*fn)(void*), void* arg)
{
pthread_t th;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 128 << 10);
int i;
for (i = 0; i < 100; i++) {
if (pthread_create(&th, &attr, fn, arg) == 0) {
pthread_attr_destroy(&attr);
return;
}
if (errno == EAGAIN) {
usleep(50);
continue;
}
break;
}
exit(1);
}

typedef struct {
int state;
} event_t;

static void event_init(event_t* ev)
{
ev->state = 0;
}

static void event_reset(event_t* ev)
{
ev->state = 0;
}

static void event_set(event_t* ev)
{
if (ev->state)
exit(1);
__atomic_store_n(&ev->state, 1, __ATOMIC_RELEASE);
syscall(SYS_futex, &ev->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG);
}

static void event_wait(event_t* ev)
{
while (!__atomic_load_n(&ev->state, __ATOMIC_ACQUIRE))
syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, 0);
}

static int event_isset(event_t* ev)
{
return __atomic_load_n(&ev->state, __ATOMIC_ACQUIRE);
}

static int event_timedwait(event_t* ev, uint64_t timeout)
{
uint64_t start = current_time_ms();
uint64_t now = start;
for (;;) {
uint64_t remain = timeout - (now - start);
struct timespec ts;
ts.tv_sec = remain / 1000;
ts.tv_nsec = (remain % 1000) * 1000 * 1000;
syscall(SYS_futex, &ev->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 0, &ts);
if (__atomic_load_n(&ev->state, __ATOMIC_RELAXED))
return 1;
now = current_time_ms();
if (now - start > timeout)
return 0;
}
}

static bool write_file(const char* file, const char* what, ...)
{
char buf[1024];
va_list args;
va_start(args, what);
vsnprintf(buf, sizeof(buf), what, args);
va_end(args);
buf[sizeof(buf) - 1] = 0;
int len = strlen(buf);
int fd = open(file, O_WRONLY | O_CLOEXEC);
if (fd == -1)
return false;
if (write(fd, buf, len) != len) {
int err = errno;
close(fd);
errno = err;
return false;
}
close(fd);
return true;
}

static void setup_common()
{
if (mount(0, "/sys/fs/fuse/connections", "fusectl", 0, 0)) {
}
}

static void loop();

static void sandbox_common()
{
prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
setpgrp();
setsid();
struct rlimit rlim;
rlim.rlim_cur = rlim.rlim_max = (200 << 20);
setrlimit(RLIMIT_AS, &rlim);
rlim.rlim_cur = rlim.rlim_max = 32 << 20;
setrlimit(RLIMIT_MEMLOCK, &rlim);
rlim.rlim_cur = rlim.rlim_max = 136 << 20;
setrlimit(RLIMIT_FSIZE, &rlim);
rlim.rlim_cur = rlim.rlim_max = 1 << 20;
setrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = rlim.rlim_max = 0;
setrlimit(RLIMIT_CORE, &rlim);
rlim.rlim_cur = rlim.rlim_max = 256;
setrlimit(RLIMIT_NOFILE, &rlim);
if (unshare(CLONE_NEWNS)) {
}
if (unshare(CLONE_NEWIPC)) {
}
if (unshare(0x02000000)) {
}
if (unshare(CLONE_NEWUTS)) {
}
if (unshare(CLONE_SYSVSEM)) {
}
typedef struct {
const char* name;
const char* value;
} sysctl_t;
static const sysctl_t sysctls[] = {
{"/proc/sys/kernel/shmmax", "16777216"},
{"/proc/sys/kernel/shmall", "536870912"},
{"/proc/sys/kernel/shmmni", "1024"},
{"/proc/sys/kernel/msgmax", "8192"},
{"/proc/sys/kernel/msgmni", "1024"},
{"/proc/sys/kernel/msgmnb", "1024"},
{"/proc/sys/kernel/sem", "1024 1048576 500 1024"},
};
unsigned i;
for (i = 0; i < sizeof(sysctls) / sizeof(sysctls[0]); i++)
write_file(sysctls[i].name, sysctls[i].value);
}

int wait_for_loop(int pid)
{
if (pid < 0)
exit(1);
int status = 0;
while (waitpid(-1, &status, __WALL) != pid) {
}
return WEXITSTATUS(status);
}

static int do_sandbox_none(void)
{
if (unshare(CLONE_NEWPID)) {
}
int pid = fork();
if (pid != 0)
return wait_for_loop(pid);
setup_common();
sandbox_common();
if (unshare(CLONE_NEWNET)) {
}
loop();
exit(1);
}

static void kill_and_wait(int pid, int* status)
{
kill(-pid, SIGKILL);
kill(pid, SIGKILL);
int i;
for (i = 0; i < 100; i++) {
if (waitpid(-1, status, WNOHANG | __WALL) == pid)
return;
usleep(1000);
}
DIR* dir = opendir("/sys/fs/fuse/connections");
if (dir) {
for (;;) {
struct dirent* ent = readdir(dir);
if (!ent)
break;
if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
continue;
char abort[300];
snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
ent->d_name);
int fd = open(abort, O_WRONLY);
if (fd == -1) {
continue;
}
if (write(fd, abort, 1) < 0) {
}
close(fd);
}
closedir(dir);
} else {
}
while (waitpid(-1, status, __WALL) != pid) {
}
}

#define SYZ_HAVE_SETUP_TEST 1
static void setup_test()
{
prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
setpgrp();
}

#define SYZ_HAVE_RESET_TEST 1
static void reset_test()
{
int fd;
for (fd = 3; fd < 30; fd++)
close(fd);
}

struct thread_t {
int created, call;
event_t ready, done;
};

static struct thread_t threads[16];
static void execute_call(int call);
static int running;

static void* thr(void* arg)
{
struct thread_t* th = (struct thread_t*)arg;
for (;;) {
event_wait(&th->ready);
event_reset(&th->ready);
execute_call(th->call);
__atomic_fetch_sub(&running, 1, __ATOMIC_RELAXED);
event_set(&th->done);
}
return 0;
}

static void execute_one(void)
{
int i, call, thread;
for (call = 0; call < 21; call++) {
for (thread = 0; thread < (int)(sizeof(threads) / sizeof(threads[0]));
thread++) {
struct thread_t* th = &threads[thread];
if (!th->created) {
th->created = 1;
event_init(&th->ready);
event_init(&th->done);
event_set(&th->done);
thread_start(thr, th);
}
if (!event_isset(&th->done))
continue;
event_reset(&th->done);
th->call = call;
__atomic_fetch_add(&running, 1, __ATOMIC_RELAXED);
event_set(&th->ready);
event_timedwait(&th->done, 45);
break;
}
}
for (i = 0; i < 100 && __atomic_load_n(&running, __ATOMIC_RELAXED); i++)
sleep_ms(1);
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
int iter;
for (iter = 0;; iter++) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
setup_test();
execute_one();
reset_test();
exit(0);
}
int status = 0;
uint64_t start = current_time_ms();
for (;;) {
if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
break;
sleep_ms(1);
if (current_time_ms() - start < 5 * 1000)
continue;
kill_and_wait(pid, &status);
break;
}
}
}

uint64_t r[2] = {0xffffffffffffffff, 0xffffffffffffffff};

void execute_call(int call)
{
long res; switch (call) {
case 0:
res = syscall(__NR_socketpair, 1, 0x1000000000005, 0, 0x20000340);
if (res != -1) {
r[0] = *(uint32_t*)0x20000340;
r[1] = *(uint32_t*)0x20000344;
}
break;
case 1:
*(uint32_t*)0x200000c0 = 9;
syscall(__NR_setsockopt, r[0], 1, 0x2a, 0x200000c0, 4);
break;
case 2:
*(uint64_t*)0x20001680 = 0;
*(uint32_t*)0x20001688 = 0;
*(uint64_t*)0x20001690 = 0;
*(uint64_t*)0x20001698 = 0;
*(uint64_t*)0x200016a0 = 0;
*(uint64_t*)0x200016a8 = 0;
*(uint32_t*)0x200016b0 = 0;
syscall(__NR_recvmsg, r[0], 0x20001680, 0x40000002);
break;
case 3:
*(uint32_t*)0x200003c0 = 9;
syscall(__NR_setsockopt, r[0], 1, 0x2e, 0x200003c0, 4);
break;
case 4:
*(uint64_t*)0x20000680 = 0;
*(uint32_t*)0x20000688 = 0;
*(uint64_t*)0x20000690 = 0;
*(uint64_t*)0x20000698 = 0;
*(uint64_t*)0x200006a0 = 0;
*(uint64_t*)0x200006a8 = 0;
*(uint32_t*)0x200006b0 = 0;
syscall(__NR_sendmsg, r[1], 0x20000680, 0);
break;
case 5:
syscall(__NR_connect, -1, 0, 0);
break;
case 6:
syscall(__NR_socketpair, 0xa, 0x800, 4, 0);
break;
case 7:
syscall(__NR_fcntl, -1, 0x40b, 0);
break;
case 8:
syscall(__NR_sendto, -1, 0, 0, 0x24000855, 0, 0);
break;
case 9:
syscall(__NR_getsockopt, 0xffffff9c, 0, 0x11, 0, 0);
break;
case 10:
syscall(__NR_socket, 0x10, 3, 6);
break;
case 11:
syscall(__NR_setsockopt, -1, 0x10e, 4, 0, 0);
break;
case 12:
syscall(__NR_socket, 1, 3, 0);
break;
case 13:
syscall(__NR_getsockopt, -1, 1, 0x1e, 0, 0);
break;
case 14:
syscall(__NR_ioctl, -1, 0x894c, 0);
break;
case 15:
syscall(__NR_ioctl, -1, 0x5411, 0);
break;
case 16:
syscall(__NR_setsockopt, -1, 0, 0x483, 0, 0);
break;
case 17:
syscall(__NR_setsockopt, -1, 0x29, 0x23, 0, 0);
break;
case 18:
syscall(__NR_ioctl, -1, 0x800000000008923, 0);
break;
case 19:
syscall(__NR_bind, -1, 0, 0);
break;
case 20:
syscall(__NR_socket, 0xa, 3, 3);
break;
}

}
int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
for (procid = 0; procid < 8; procid++) {
if (fork() == 0) {
do_sandbox_none();
}
}
sleep(1000000);
return 0;
}