2017-03-23 21:36:42

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 0/8] Add busy poll support for epoll

This is my second pass at trying to add support for busy polling when using
epoll. It is pretty much a full rewrite as I have made serious changes to
most of the patches.

In the v1 series I had submitted we only allowed epoll to make use of busy
poll when all NAPI IDs were the same. I gave this some more thought and
after making several other changes based on feedback from Eric Dumazet I
decided to try changing the main concept a bit and instead we will now
attempt to busy poll on the NAPI ID of the last socket added to the ready
list. By doing it this way we are able to take advantage of the changes
Eric has already made so that we get woken up by the softirq, we then pull
packets via busy poll, and will return to the softirq until we are woken up
and a new socket has been added to the ready list.

Most of the changes in this set authored by me are meant to be cleanup or
fixes for various things. For example, I am trying to make it so that we
don't perform hash look-ups for the NAPI instance when we are only working
with sender_cpu and the like.

The most complicated change of the set is probably the clean-ups for the
timeout. I realized that the timeout could potentially get into a state
where it would never timeout if the local_clock() was approaching a
rollover and the added busy poll usecs interval would be enough to roll it
over. Because we were shifting the value you would never be able to get
time_after to actually trigger.

At the heart of this set is the last 3 patches which enable epoll support
and add support for obtaining the NAPI ID of a given socket. With these
It becomes possible for an application to make use of epoll and get optimal
busy poll utilization by stacking multiple sockets with the same NAPI ID on
the same epoll context.

---

Alexander Duyck (5):
net: Busy polling should ignore sender CPUs
tcp: Record Rx hash and NAPI ID in tcp_child_process
net: Only define skb_mark_napi_id in one spot instead of two
net: Change return type of sk_busy_loop from bool to void
net: Track start of busy loop instead of when it should end

Sridhar Samudrala (3):
net: Commonize busy polling code to focus on napi_id instead of socket
epoll: Add busy poll support to epoll with socket fds.
net: Introduce SO_INCOMING_NAPI_ID


arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 1
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
fs/eventpoll.c | 93 +++++++++++++++++++++++++++
fs/select.c | 16 ++---
include/net/busy_poll.h | 112 ++++++++++++++++++++------------
include/uapi/asm-generic/socket.h | 2 +
net/core/datagram.c | 8 ++
net/core/dev.c | 42 ++++++------
net/core/sock.c | 23 +++++++
net/core/sysctl_net_core.c | 11 +++
net/ipv4/tcp_ipv4.c | 2 -
net/ipv4/tcp_minisocks.c | 4 +
net/ipv6/tcp_ipv6.c | 2 -
net/sctp/socket.c | 9 ++-
24 files changed, 264 insertions(+), 83 deletions(-)

--


2017-03-23 21:36:51

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 1/8] net: Busy polling should ignore sender CPUs

From: Alexander Duyck <[email protected]>

This patch is a cleanup/fix for NAPI IDs following the changes that made it
so that sender_cpu and napi_id were doing a better job of sharing the same
location in the sk_buff.

One issue I found is that we weren't validating the napi_id as being valid
before we started trying to setup the busy polling. This change corrects
that by using the MIN_NAPI_ID value that is now used in both allocating the
NAPI IDs, as well as validating them.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/net/busy_poll.h | 9 +++++++--
net/core/dev.c | 13 +++++++++----
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c0452de83086..3fcda9e70c3f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -35,6 +35,12 @@
extern unsigned int sysctl_net_busy_read __read_mostly;
extern unsigned int sysctl_net_busy_poll __read_mostly;

+/* 0 - Reserved to indicate value not set
+ * 1..NR_CPUS - Reserved for sender_cpu
+ * NR_CPUS+1..~0 - Region available for NAPI IDs
+ */
+#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))
+
static inline bool net_busy_loop_on(void)
{
return sysctl_net_busy_poll;
@@ -58,10 +64,9 @@ static inline unsigned long busy_loop_end_time(void)

static inline bool sk_can_busy_loop(const struct sock *sk)
{
- return sk->sk_ll_usec && sk->sk_napi_id && !signal_pending(current);
+ return sk->sk_ll_usec && !signal_pending(current);
}

-
static inline bool busy_loop_timeout(unsigned long end_time)
{
unsigned long now = busy_loop_us_clock();
diff --git a/net/core/dev.c b/net/core/dev.c
index 7869ae3837ca..ab337bf5bbf4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5066,15 +5066,20 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
+ unsigned int napi_id;
int rc;

restart:
+ napi_id = READ_ONCE(sk->sk_napi_id);
+ if (napi_id < MIN_NAPI_ID)
+ return 0;
+
rc = false;
napi_poll = NULL;

rcu_read_lock();

- napi = napi_by_id(sk->sk_napi_id);
+ napi = napi_by_id(napi_id);
if (!napi)
goto out;

@@ -5143,10 +5148,10 @@ static void napi_hash_add(struct napi_struct *napi)

spin_lock(&napi_hash_lock);

- /* 0..NR_CPUS+1 range is reserved for sender_cpu use */
+ /* 0..NR_CPUS range is reserved for sender_cpu use */
do {
- if (unlikely(++napi_gen_id < NR_CPUS + 1))
- napi_gen_id = NR_CPUS + 1;
+ if (unlikely(++napi_gen_id < MIN_NAPI_ID))
+ napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
napi->napi_id = napi_gen_id;


2017-03-23 21:36:56

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 2/8] tcp: Record Rx hash and NAPI ID in tcp_child_process

From: Alexander Duyck <[email protected]>

While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set. In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.

Reported-by: Sridhar Samudrala <[email protected]>
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
net/ipv4/tcp_ipv4.c | 2 --
net/ipv4/tcp_minisocks.c | 4 ++++
net/ipv6/tcp_ipv6.c | 2 --
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7482b5d11861..20cbd2f07f28 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1409,8 +1409,6 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (!nsk)
goto discard;
if (nsk != sk) {
- sock_rps_save_rxhash(nsk, skb);
- sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 692f974e5abe..40f125b19988 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -26,6 +26,7 @@
#include <net/tcp.h>
#include <net/inet_common.h>
#include <net/xfrm.h>
+#include <net/busy_poll.h>

int sysctl_tcp_abort_on_overflow __read_mostly;

@@ -798,6 +799,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
int ret = 0;
int state = child->sk_state;

+ /* record NAPI ID of child */
+ sk_mark_napi_id(child, skb);
+
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
ret = tcp_rcv_state_process(child, skb);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0f08d718a002..ee13e380c0dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1293,8 +1293,6 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
goto discard;

if (nsk != sk) {
- sock_rps_save_rxhash(nsk, skb);
- sk_mark_napi_id(nsk, skb);
if (tcp_child_process(sk, nsk, skb))
goto reset;
if (opt_skb)

2017-03-23 21:37:34

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 3/8] net: Only define skb_mark_napi_id in one spot instead of two

From: Alexander Duyck <[email protected]>

Instead of defining two versions of skb_mark_napi_id I think it is more
readable to just match the format of the sk_mark_napi_id functions and just
wrap the contents of the function instead of defining two versions of the
function. This way we can save a few lines of code since we only need 2 of
the ifdef/endif but needed 5 for the extra function declaration.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/net/busy_poll.h | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 3fcda9e70c3f..b82d6ba70a14 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -76,14 +76,6 @@ static inline bool busy_loop_timeout(unsigned long end_time)

bool sk_busy_loop(struct sock *sk, int nonblock);

-/* used in the NIC receive handler to mark the skb */
-static inline void skb_mark_napi_id(struct sk_buff *skb,
- struct napi_struct *napi)
-{
- skb->napi_id = napi->napi_id;
-}
-
-
#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
{
@@ -100,11 +92,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
}

-static inline void skb_mark_napi_id(struct sk_buff *skb,
- struct napi_struct *napi)
-{
-}
-
static inline bool busy_loop_timeout(unsigned long end_time)
{
return true;
@@ -117,6 +104,15 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)

#endif /* CONFIG_NET_RX_BUSY_POLL */

+/* used in the NIC receive handler to mark the skb */
+static inline void skb_mark_napi_id(struct sk_buff *skb,
+ struct napi_struct *napi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ skb->napi_id = napi->napi_id;
+#endif
+}
+
/* used in the protocol hanlder to propagate the napi_id to the socket */
static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
{

2017-03-23 21:37:49

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 4/8] net: Change return type of sk_busy_loop from bool to void

From: Alexander Duyck <[email protected]>

>From what I can tell there is only a couple spots where we are actually
checking the return value of sk_busy_loop. As there are only a few
consumers of that data, and the data being checked for can be replaced
with a check for !skb_queue_empty() we might as well just pull the code
out of sk_busy_loop and place it in the spots that actually need it.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/net/busy_poll.h | 5 ++---
net/core/datagram.c | 8 ++++++--
net/core/dev.c | 26 ++++++++++++--------------
net/sctp/socket.c | 9 ++++++---
4 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index b82d6ba70a14..c55760f4820f 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,7 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return time_after(now, end_time);
}

-bool sk_busy_loop(struct sock *sk, int nonblock);
+void sk_busy_loop(struct sock *sk, int nonblock);

#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
@@ -97,9 +97,8 @@ static inline bool busy_loop_timeout(unsigned long end_time)
return true;
}

-static inline bool sk_busy_loop(struct sock *sk, int nonblock)
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
{
- return false;
}

#endif /* CONFIG_NET_RX_BUSY_POLL */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ea633342ab0d..4608aa245410 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -256,8 +256,12 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
}

spin_unlock_irqrestore(&queue->lock, cpu_flags);
- } while (sk_can_busy_loop(sk) &&
- sk_busy_loop(sk, flags & MSG_DONTWAIT));
+
+ if (!sk_can_busy_loop(sk))
+ break;
+
+ sk_busy_loop(sk, flags & MSG_DONTWAIT);
+ } while (!skb_queue_empty(&sk->sk_receive_queue));

error = -EAGAIN;

diff --git a/net/core/dev.c b/net/core/dev.c
index ab337bf5bbf4..6b0458b5afe0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,21 +5060,21 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
do_softirq();
}

-bool sk_busy_loop(struct sock *sk, int nonblock)
+void sk_busy_loop(struct sock *sk, int nonblock)
{
unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
unsigned int napi_id;
- int rc;
+ int work;

restart:
napi_id = READ_ONCE(sk->sk_napi_id);
if (napi_id < MIN_NAPI_ID)
- return 0;
+ return;

- rc = false;
+ work = 0;
napi_poll = NULL;

rcu_read_lock();
@@ -5085,7 +5085,7 @@ bool sk_busy_loop(struct sock *sk, int nonblock)

preempt_disable();
for (;;) {
- rc = 0;
+ work = 0;
local_bh_disable();
if (!napi_poll) {
unsigned long val = READ_ONCE(napi->state);
@@ -5103,12 +5103,12 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
have_poll_lock = netpoll_poll_lock(napi);
napi_poll = napi->poll;
}
- rc = napi_poll(napi, BUSY_POLL_BUDGET);
- trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
+ work = napi_poll(napi, BUSY_POLL_BUDGET);
+ trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
count:
- if (rc > 0)
+ if (work > 0)
__NET_ADD_STATS(sock_net(sk),
- LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+ LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();

if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
@@ -5121,9 +5121,9 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
- rc = !skb_queue_empty(&sk->sk_receive_queue);
- if (rc || busy_loop_timeout(end_time))
- return rc;
+ if (!skb_queue_empty(&sk->sk_receive_queue) ||
+ busy_loop_timeout(end_time))
+ return;
goto restart;
}
cpu_relax();
@@ -5131,10 +5131,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
if (napi_poll)
busy_poll_stop(napi, have_poll_lock);
preempt_enable();
- rc = !skb_queue_empty(&sk->sk_receive_queue);
out:
rcu_read_unlock();
- return rc;
}
EXPORT_SYMBOL(sk_busy_loop);

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 72cc3ecf6516..ccc08fc39722 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -7518,9 +7518,12 @@ struct sk_buff *sctp_skb_recv_datagram(struct sock *sk, int flags,
if (sk->sk_shutdown & RCV_SHUTDOWN)
break;

- if (sk_can_busy_loop(sk) &&
- sk_busy_loop(sk, noblock))
- continue;
+ if (sk_can_busy_loop(sk)) {
+ sk_busy_loop(sk, noblock);
+
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ continue;
+ }

/* User doesn't want to wait. */
error = -EAGAIN;

2017-03-23 21:38:11

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

From: Alexander Duyck <[email protected]>

This patch flips the logic we were using to determine if the busy polling
has timed out. The main motivation for this is that we will need to
support two different possible timeout values in the future and by
recording the start time rather than when we would want to end we can focus
on making the end_time specific to the task be it epoll or socket based
polling.

I am also flipping the logic a bit. The previous code was taking
local_clock() and shifting it by 10 to get the time value in microseconds.
That works for most values but has a side effect of us potentially
encountering time values that will never time out as the current time will
never exceed the recorded clock value plus the timeout usecs if the clock
value was approaching a roll-over. To account for that I am leaving
start_time as a nanoseconds value instead of shifting it down to a
microseconds value. In addition I am limiting the busy_poll and busy_read
values so that they cannot exceed about 2 seconds. By doing this I can add
the timeout value into the nanosecond value and be guaranteed that we will
not prematurely trigger timeouts, even on 32 bit architectures.

The last bit I changed is to move from using a shift by 10 to just using
NSEC_PER_USEC and using multiplication for any run time calculations and
division for a few compile time ones. This should be more accurate and
perform about the same on most architectures since modern CPUs typically
handle multiplication without too much overhead.

Signed-off-by: Alexander Duyck <[email protected]>
---
fs/select.c | 16 +++++----
include/net/busy_poll.h | 78 +++++++++++++++++++++++++++-----------------
net/core/dev.c | 6 ++-
net/core/sysctl_net_core.c | 11 +++++-
4 files changed, 68 insertions(+), 43 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index e2112270d75a..9287d3a96e35 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -409,7 +409,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
int retval, i, timed_out = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
- unsigned long busy_end = 0;
+ unsigned long busy_start = 0;

rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -512,11 +512,11 @@ int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)

/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
- if (!busy_end) {
- busy_end = busy_loop_end_time();
+ if (!busy_start) {
+ busy_start = busy_loop_current_time();
continue;
}
- if (!busy_loop_timeout(busy_end))
+ if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
@@ -800,7 +800,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,
int timed_out = 0, count = 0;
u64 slack = 0;
unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
- unsigned long busy_end = 0;
+ unsigned long busy_start = 0;

/* Optimise the no-wait case */
if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
@@ -853,11 +853,11 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait,

/* only if found POLL_BUSY_LOOP sockets && not out of time */
if (can_busy_loop && !need_resched()) {
- if (!busy_end) {
- busy_end = busy_loop_end_time();
+ if (!busy_start) {
+ busy_start = busy_loop_current_time();
continue;
}
- if (!busy_loop_timeout(busy_end))
+ if (!busy_loop_timeout(busy_start))
continue;
}
busy_flag = 0;
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c55760f4820f..4626cb22f625 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -41,67 +41,85 @@
*/
#define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1))

+/* The timer checks are using time_after() to determine if we have
+ * exceeded out timeout period. In order to support that we have to limit
+ * ourselves to the smallest possible signed type and then in addition
+ * account for the fact that we are recording our value in microseconds
+ * instead of nanoseconds.
+ *
+ * This limit should be just a little over 2 seconds.
+ */
+#define MAX_BUSY_POLL (INT_MAX / NSEC_PER_USEC)
+
static inline bool net_busy_loop_on(void)
{
return sysctl_net_busy_poll;
}

-static inline u64 busy_loop_us_clock(void)
+static inline bool sk_can_busy_loop(const struct sock *sk)
{
- return local_clock() >> 10;
+ return sk->sk_ll_usec && !signal_pending(current);
}

-static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
-{
- return busy_loop_us_clock() + ACCESS_ONCE(sk->sk_ll_usec);
-}
+void sk_busy_loop(struct sock *sk, int nonblock);

-/* in poll/select we use the global sysctl_net_ll_poll value */
-static inline unsigned long busy_loop_end_time(void)
+#else /* CONFIG_NET_RX_BUSY_POLL */
+static inline unsigned long net_busy_loop_on(void)
{
- return busy_loop_us_clock() + ACCESS_ONCE(sysctl_net_busy_poll);
+ return 0;
}

-static inline bool sk_can_busy_loop(const struct sock *sk)
+static inline bool sk_can_busy_loop(struct sock *sk)
{
- return sk->sk_ll_usec && !signal_pending(current);
+ return false;
}

-static inline bool busy_loop_timeout(unsigned long end_time)
+static inline void sk_busy_loop(struct sock *sk, int nonblock)
{
- unsigned long now = busy_loop_us_clock();
-
- return time_after(now, end_time);
}

-void sk_busy_loop(struct sock *sk, int nonblock);
+#endif /* CONFIG_NET_RX_BUSY_POLL */

-#else /* CONFIG_NET_RX_BUSY_POLL */
-static inline unsigned long net_busy_loop_on(void)
+static inline unsigned long busy_loop_current_time(void)
{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ return (unsigned long)local_clock();
+#else
return 0;
+#endif
}

-static inline unsigned long busy_loop_end_time(void)
+/* in poll/select we use the global sysctl_net_ll_poll value */
+static inline bool busy_loop_timeout(unsigned long start_time)
{
- return 0;
-}
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned long bp_usec = READ_ONCE(sysctl_net_busy_poll);

-static inline bool sk_can_busy_loop(struct sock *sk)
-{
- return false;
-}
+ if (bp_usec) {
+ unsigned long end_time = start_time + (bp_usec * NSEC_PER_USEC);
+ unsigned long now = busy_loop_current_time();

-static inline bool busy_loop_timeout(unsigned long end_time)
-{
+ return time_after(now, end_time);
+ }
+#endif
return true;
}

-static inline void sk_busy_loop(struct sock *sk, int nonblock)
+static inline bool sk_busy_loop_timeout(struct sock *sk,
+ unsigned long start_time)
{
-}
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned long bp_usec = READ_ONCE(sk->sk_ll_usec);

-#endif /* CONFIG_NET_RX_BUSY_POLL */
+ if (bp_usec) {
+ unsigned long end_time = start_time + (bp_usec * NSEC_PER_USEC);
+ unsigned long now = busy_loop_current_time();
+
+ return time_after(now, end_time);
+ }
+#endif
+ return true;
+}

/* used in the NIC receive handler to mark the skb */
static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 6b0458b5afe0..73ebf2f5600e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5062,7 +5062,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)

void sk_busy_loop(struct sock *sk, int nonblock)
{
- unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
+ unsigned long start_time = nonblock ? 0 : busy_loop_current_time();
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
@@ -5112,7 +5112,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
local_bh_enable();

if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
- busy_loop_timeout(end_time))
+ sk_busy_loop_timeout(sk, start_time))
break;

if (unlikely(need_resched())) {
@@ -5122,7 +5122,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
rcu_read_unlock();
cond_resched();
if (!skb_queue_empty(&sk->sk_receive_queue) ||
- busy_loop_timeout(end_time))
+ sk_busy_loop_timeout(sk, start_time))
return;
goto restart;
}
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 4ead336e14ea..88417a0a179a 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -27,6 +27,9 @@
static int min_sndbuf = SOCK_MIN_SNDBUF;
static int min_rcvbuf = SOCK_MIN_RCVBUF;
static int max_skb_frags = MAX_SKB_FRAGS;
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static int max_busy_poll = MAX_BUSY_POLL;
+#endif

static int net_msg_warn; /* Unused, but still a sysctl */

@@ -408,14 +411,18 @@ static int proc_do_rss_key(struct ctl_table *table, int write,
.data = &sysctl_net_busy_poll,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &max_busy_poll,
},
{
.procname = "busy_read",
.data = &sysctl_net_busy_read,
.maxlen = sizeof(unsigned int),
.mode = 0644,
- .proc_handler = proc_dointvec
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &max_busy_poll,
},
#endif
#ifdef CONFIG_NET_SCHED

2017-03-23 21:38:08

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 7/8] epoll: Add busy poll support to epoll with socket fds.

From: Sridhar Samudrala <[email protected]>

This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty. Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.

In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it. If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.

An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
fs/eventpoll.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 341251421ced..5420767c9b68 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
#include <linux/seq_file.h>
#include <linux/compat.h>
#include <linux/rculist.h>
+#include <net/busy_poll.h>

/*
* LOCKING:
@@ -224,6 +225,11 @@ struct eventpoll {
/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ /* used to track busy poll napi_id */
+ unsigned int napi_id;
+#endif
};

/* Wait structure used by the poll hooks */
@@ -384,6 +390,77 @@ static inline int ep_events_available(struct eventpoll *ep)
return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
}

+#ifdef CONFIG_NET_RX_BUSY_POLL
+static bool ep_busy_loop_end(void *p, unsigned long start_time)
+{
+ struct eventpoll *ep = p;
+
+ return ep_events_available(ep) || busy_loop_timeout(start_time);
+}
+#endif /* CONFIG_NET_RX_BUSY_POLL */
+
+/*
+ * Busy poll if globally on and supporting sockets found && no events,
+ * busy loop will return if need_resched or ep_events_available.
+ *
+ * we must do our busy polling with irqs enabled
+ */
+static void ep_busy_loop(struct eventpoll *ep, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned int napi_id = READ_ONCE(ep->napi_id);
+
+ if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on())
+ napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep);
+#endif
+}
+
+static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ if (ep->napi_id)
+ ep->napi_id = 0;
+#endif
+}
+
+/*
+ * Set epoll busy poll NAPI ID from sk.
+ */
+static inline void ep_set_busy_poll_napi_id(struct epitem *epi)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ struct eventpoll *ep;
+ unsigned int napi_id;
+ struct socket *sock;
+ struct sock *sk;
+ int err;
+
+ if (!net_busy_loop_on())
+ return;
+
+ sock = sock_from_file(epi->ffd.file, &err);
+ if (!sock)
+ return;
+
+ sk = sock->sk;
+ if (!sk)
+ return;
+
+ napi_id = READ_ONCE(sk->sk_napi_id);
+ ep = epi->ep;
+
+ /* Non-NAPI IDs can be rejected
+ * or
+ * Nothing to do if we already have this ID
+ */
+ if (napi_id < MIN_NAPI_ID || napi_id == ep->napi_id)
+ return;
+
+ /* record NAPI ID for use in next busy poll */
+ ep->napi_id = napi_id;
+#endif
+}
+
/**
* ep_call_nested - Perform a bound (possibly) nested call, by checking
* that the recursion limit is not exceeded, and that
@@ -1022,6 +1099,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k

spin_lock_irqsave(&ep->lock, flags);

+ ep_set_busy_poll_napi_id(epi);
+
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
@@ -1363,6 +1442,9 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
/* We have to drop the new item inside our item list to keep track of it */
spin_lock_irqsave(&ep->lock, flags);

+ /* record NAPI ID of new item if present */
+ ep_set_busy_poll_napi_id(epi);
+
/* If the file is already "ready" we drop it inside the ready list */
if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
@@ -1637,10 +1719,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
}

fetch_events:
+
+ if (!ep_events_available(ep))
+ ep_busy_loop(ep, timed_out);
+
spin_lock_irqsave(&ep->lock, flags);

if (!ep_events_available(ep)) {
/*
+ * Busy poll timed out. Drop NAPI ID for now, we can add
+ * it back in when we have moved a socket with a valid NAPI
+ * ID onto the ready list.
+ */
+ ep_reset_busy_poll_napi_id(ep);
+
+ /*
* We don't have any available event to return to the caller.
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.

2017-03-23 21:38:38

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

From: Sridhar Samudrala <[email protected]>

This socket option returns the NAPI ID associated with the queue on which
the last frame is received. This information can be used by the apps to
split the incoming flows among the threads based on the Rx queue on which
they are received.

If the NAPI ID actually represents a sender_cpu then the value is ignored
and 0 is returned.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
arch/alpha/include/uapi/asm/socket.h | 2 ++
arch/avr32/include/uapi/asm/socket.h | 2 ++
arch/frv/include/uapi/asm/socket.h | 2 ++
arch/ia64/include/uapi/asm/socket.h | 2 ++
arch/m32r/include/uapi/asm/socket.h | 2 ++
arch/mips/include/uapi/asm/socket.h | 1 +
arch/mn10300/include/uapi/asm/socket.h | 2 ++
arch/parisc/include/uapi/asm/socket.h | 2 ++
arch/powerpc/include/uapi/asm/socket.h | 2 ++
arch/s390/include/uapi/asm/socket.h | 2 ++
arch/sparc/include/uapi/asm/socket.h | 2 ++
arch/xtensa/include/uapi/asm/socket.h | 2 ++
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 12 ++++++++++++
14 files changed, 37 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 089db42c1b40..1bb8cac61a28 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6eabcbd2f82a..f824eeb0f2e4 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index bd497f8356b9..a8ad9bebfc47 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -94,5 +94,7 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */

diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index f1bb54686168..6af3253e4209 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -103,4 +103,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 459c46076f6f..e98b6bb897c0 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 688c18dd62ef..ae2b62e39d4d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -112,5 +112,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56

#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 312d2c457a04..e4ac1843ee01 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index b98ec38f2083..f754c793e82a 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -93,4 +93,6 @@

#define SO_MEMINFO 0x4030

+#define SO_INCOMING_NAPI_ID 0x4031
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 099a889240f6..5f84af7dcb2e 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 6199bb34f7fa..25ac4960e707 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -100,4 +100,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 12cd8c2ec422..b05513acd589 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -90,6 +90,8 @@

#define SO_MEMINFO 0x0039

+#define SO_INCOMING_NAPI_ID 0x003a
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index d0b85f6c1484..786606c81edd 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -105,4 +105,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 8313702c1eae..c98a52fb572a 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -96,4 +96,6 @@

#define SO_MEMINFO 55

+#define SO_INCOMING_NAPI_ID 56
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 0aa725cb3dd6..1de6c369ed86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1328,6 +1328,18 @@ int sock_getsockopt(struct socket *sock, int level, int optname,

goto lenout;
}
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ case SO_INCOMING_NAPI_ID:
+ v.val = READ_ONCE(sk->sk_napi_id);
+
+ /* aggregate non-NAPI IDs down to 0 */
+ if (v.val < MIN_NAPI_ID)
+ v.val = 0;
+
+ break;
+#endif
+
default:
/* We implement the SO_SNDLOWAT etc to not be settable
* (1003.1g 7).

2017-03-23 21:39:13

by Alexander Duyck

[permalink] [raw]
Subject: [net-next PATCH v2 6/8] net: Commonize busy polling code to focus on napi_id instead of socket

From: Sridhar Samudrala <[email protected]>

Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.

This enables re-using this function in epoll busy loop implementation.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
include/net/busy_poll.h | 20 +++++++++++++++-----
net/core/dev.c | 21 ++++++++-------------
net/core/sock.c | 11 +++++++++++
3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 4626cb22f625..336a874c0fed 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -61,7 +61,11 @@ static inline bool sk_can_busy_loop(const struct sock *sk)
return sk->sk_ll_usec && !signal_pending(current);
}

-void sk_busy_loop(struct sock *sk, int nonblock);
+bool sk_busy_loop_end(void *p, unsigned long start_time);
+
+void napi_busy_loop(unsigned int napi_id,
+ bool (*loop_end)(void *, unsigned long),
+ void *loop_end_arg);

#else /* CONFIG_NET_RX_BUSY_POLL */
static inline unsigned long net_busy_loop_on(void)
@@ -74,10 +78,6 @@ static inline bool sk_can_busy_loop(struct sock *sk)
return false;
}

-static inline void sk_busy_loop(struct sock *sk, int nonblock)
-{
-}
-
#endif /* CONFIG_NET_RX_BUSY_POLL */

static inline unsigned long busy_loop_current_time(void)
@@ -121,6 +121,16 @@ static inline bool sk_busy_loop_timeout(struct sock *sk,
return true;
}

+static inline void sk_busy_loop(struct sock *sk, int nonblock)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned int napi_id = READ_ONCE(sk->sk_napi_id);
+
+ if (napi_id >= MIN_NAPI_ID)
+ napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk);
+#endif
+}
+
/* used in the NIC receive handler to mark the skb */
static inline void skb_mark_napi_id(struct sk_buff *skb,
struct napi_struct *napi)
diff --git a/net/core/dev.c b/net/core/dev.c
index 73ebf2f5600e..b082f70e50c5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5060,20 +5060,17 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
do_softirq();
}

-void sk_busy_loop(struct sock *sk, int nonblock)
+void napi_busy_loop(unsigned int napi_id,
+ bool (*loop_end)(void *, unsigned long),
+ void *loop_end_arg)
{
- unsigned long start_time = nonblock ? 0 : busy_loop_current_time();
+ unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
int (*napi_poll)(struct napi_struct *napi, int budget);
void *have_poll_lock = NULL;
struct napi_struct *napi;
- unsigned int napi_id;
int work;

restart:
- napi_id = READ_ONCE(sk->sk_napi_id);
- if (napi_id < MIN_NAPI_ID)
- return;
-
work = 0;
napi_poll = NULL;

@@ -5107,12 +5104,11 @@ void sk_busy_loop(struct sock *sk, int nonblock)
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
count:
if (work > 0)
- __NET_ADD_STATS(sock_net(sk),
+ __NET_ADD_STATS(dev_net(napi->dev),
LINUX_MIB_BUSYPOLLRXPACKETS, work);
local_bh_enable();

- if (nonblock || !skb_queue_empty(&sk->sk_receive_queue) ||
- sk_busy_loop_timeout(sk, start_time))
+ if (!loop_end || loop_end(loop_end_arg, start_time))
break;

if (unlikely(need_resched())) {
@@ -5121,8 +5117,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
preempt_enable();
rcu_read_unlock();
cond_resched();
- if (!skb_queue_empty(&sk->sk_receive_queue) ||
- sk_busy_loop_timeout(sk, start_time))
+ if (loop_end(loop_end_arg, start_time))
return;
goto restart;
}
@@ -5134,7 +5129,7 @@ void sk_busy_loop(struct sock *sk, int nonblock)
out:
rcu_read_unlock();
}
-EXPORT_SYMBOL(sk_busy_loop);
+EXPORT_SYMBOL(napi_busy_loop);

#endif /* CONFIG_NET_RX_BUSY_POLL */

diff --git a/net/core/sock.c b/net/core/sock.c
index f8c0373a3a74..0aa725cb3dd6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3231,3 +3231,14 @@ static int __init proto_init(void)
subsys_initcall(proto_init);

#endif /* PROC_FS */
+
+#ifdef CONFIG_NET_RX_BUSY_POLL
+bool sk_busy_loop_end(void *p, unsigned long start_time)
+{
+ struct sock *sk = p;
+
+ return !skb_queue_empty(&sk->sk_receive_queue) ||
+ sk_busy_loop_timeout(sk, start_time);
+}
+EXPORT_SYMBOL(sk_busy_loop_end);
+#endif /* CONFIG_NET_RX_BUSY_POLL */

2017-03-23 22:06:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 1/8] net: Busy polling should ignore sender CPUs

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> This patch is a cleanup/fix for NAPI IDs following the changes that made it
> so that sender_cpu and napi_id were doing a better job of sharing the same
> location in the sk_buff.
>
> One issue I found is that we weren't validating the napi_id as being valid
> before we started trying to setup the busy polling. This change corrects
> that by using the MIN_NAPI_ID value that is now used in both allocating the
> NAPI IDs, as well as validating them.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>


2017-03-23 22:07:32

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

On Thu, Mar 23, 2017 at 02:36:29PM -0700, Alexander Duyck wrote:
> This is my second pass at trying to add support for busy polling when using
> epoll. It is pretty much a full rewrite as I have made serious changes to
> most of the patches.
>
> In the v1 series I had submitted we only allowed epoll to make use of busy
> poll when all NAPI IDs were the same. I gave this some more thought and
> after making several other changes based on feedback from Eric Dumazet I
> decided to try changing the main concept a bit and instead we will now
> attempt to busy poll on the NAPI ID of the last socket added to the ready
> list. By doing it this way we are able to take advantage of the changes
> Eric has already made so that we get woken up by the softirq, we then pull
> packets via busy poll, and will return to the softirq until we are woken up
> and a new socket has been added to the ready list.
>
> Most of the changes in this set authored by me are meant to be cleanup or
> fixes for various things. For example, I am trying to make it so that we
> don't perform hash look-ups for the NAPI instance when we are only working
> with sender_cpu and the like.
>
> The most complicated change of the set is probably the clean-ups for the
> timeout. I realized that the timeout could potentially get into a state
> where it would never timeout if the local_clock() was approaching a
> rollover and the added busy poll usecs interval would be enough to roll it
> over. Because we were shifting the value you would never be able to get
> time_after to actually trigger.
>
> At the heart of this set is the last 3 patches which enable epoll support
> and add support for obtaining the NAPI ID of a given socket. With these
> It becomes possible for an application to make use of epoll and get optimal
> busy poll utilization by stacking multiple sockets with the same NAPI ID on
> the same epoll context.

it all sounds awesome, but i cannot quite visualize the impact.
Can you post some sample code/minibenchmark and numbers before/after?

Thanks!

2017-03-23 22:22:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, 2017-03-23 at 14:38 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala <[email protected]>
>
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
>
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

Acked-by: Eric Dumazet <[email protected]>


2017-03-23 22:25:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 6/8] net: Commonize busy polling code to focus on napi_id instead of socket

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Sridhar Samudrala <[email protected]>
>
> Move the core functionality in sk_busy_loop() to napi_busy_loop() and
> make it independent of sk.
>
> This enables re-using this function in epoll busy loop implementation.
>
> Signed-off-by: Sridhar Samudrala <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>


2017-03-23 22:38:22

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

On Thu, Mar 23, 2017 at 3:07 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Thu, Mar 23, 2017 at 02:36:29PM -0700, Alexander Duyck wrote:
>> This is my second pass at trying to add support for busy polling when using
>> epoll. It is pretty much a full rewrite as I have made serious changes to
>> most of the patches.
>>
>> In the v1 series I had submitted we only allowed epoll to make use of busy
>> poll when all NAPI IDs were the same. I gave this some more thought and
>> after making several other changes based on feedback from Eric Dumazet I
>> decided to try changing the main concept a bit and instead we will now
>> attempt to busy poll on the NAPI ID of the last socket added to the ready
>> list. By doing it this way we are able to take advantage of the changes
>> Eric has already made so that we get woken up by the softirq, we then pull
>> packets via busy poll, and will return to the softirq until we are woken up
>> and a new socket has been added to the ready list.
>>
>> Most of the changes in this set authored by me are meant to be cleanup or
>> fixes for various things. For example, I am trying to make it so that we
>> don't perform hash look-ups for the NAPI instance when we are only working
>> with sender_cpu and the like.
>>
>> The most complicated change of the set is probably the clean-ups for the
>> timeout. I realized that the timeout could potentially get into a state
>> where it would never timeout if the local_clock() was approaching a
>> rollover and the added busy poll usecs interval would be enough to roll it
>> over. Because we were shifting the value you would never be able to get
>> time_after to actually trigger.
>>
>> At the heart of this set is the last 3 patches which enable epoll support
>> and add support for obtaining the NAPI ID of a given socket. With these
>> It becomes possible for an application to make use of epoll and get optimal
>> busy poll utilization by stacking multiple sockets with the same NAPI ID on
>> the same epoll context.
>
> it all sounds awesome, but i cannot quite visualize the impact.
> Can you post some sample code/minibenchmark and numbers before/after?
>
> Thanks!
>

Anything specific you are looking for? I can probably work with our
team internally to setup the benchmark in the next day or so.

I've been doing most of my benchmarking at my desk with sockperf with
just one thread and multiple sockets and seeing some modest savings
with my round trip times dropping from something like 18 microseconds
on average to 8 microseconds for ping-pong tests.

Thanks.

- Alex

2017-03-23 22:43:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
<[email protected]> wrote:
> From: Sridhar Samudrala <[email protected]>
>
> This socket option returns the NAPI ID associated with the queue on which
> the last frame is received. This information can be used by the apps to
> split the incoming flows among the threads based on the Rx queue on which
> they are received.
>
> If the NAPI ID actually represents a sender_cpu then the value is ignored
> and 0 is returned.

This may be more of a naming / documentation issue than a
functionality issue, but to me this reads as:

"This socket option returns an internal implementation detail that, if
you are sufficiently clueful about the current performance heuristics
used by the Linux networking stack, just might give you a hint as to
which epoll set to put the socket in." I've done some digging into
Linux networking stuff, but not nearly enough to have the slighest
clue what you're supposed to do with the NAPI ID.

It would be nice to make this a bit more concrete and a bit less tied
in Linux innards. Perhaps a socket option could instead return a hint
saying "for best results, put this socket in an epoll set that's on
cpu N"? After all, you're unlikely to do anything productive with
busy polling at all, even on a totally different kernel
implementation, if you have more than one epoll set per CPU. I can
see cases where you could plausibly poll with fewer than one set per
CPU, I suppose.

Again, though, from the description, it's totally unclear what a user
is supposed to do.

--Andy

2017-03-23 22:49:22

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 0/8] Add busy poll support for epoll

On Thu, Mar 23, 2017 at 3:38 PM, Alexander Duyck
<[email protected]> wrote:
> On Thu, Mar 23, 2017 at 3:07 PM, Alexei Starovoitov

>> it all sounds awesome, but i cannot quite visualize the impact.
>> Can you post some sample code/minibenchmark and numbers before/after?
>>
>> Thanks!
>>
>
> Anything specific you are looking for? I can probably work with our
> team internally to setup the benchmark in the next day or so.
>
> I've been doing most of my benchmarking at my desk with sockperf with
> just one thread and multiple sockets and seeing some modest savings
> with my round trip times dropping from something like 18 microseconds
> on average to 8 microseconds for ping-pong tests.

Same reduction for me, on 1000/1000 bytes RPC.

26 usec -> 16 usec per transaction

(If you use sockperf, beware that it displays half round trips)

Also note that the gains also depends on how the interrupt mitigation
parameters are set.

2017-03-24 00:58:52

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
> <[email protected]> wrote:
>> From: Sridhar Samudrala <[email protected]>
>>
>> This socket option returns the NAPI ID associated with the queue on which
>> the last frame is received. This information can be used by the apps to
>> split the incoming flows among the threads based on the Rx queue on which
>> they are received.
>>
>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>> and 0 is returned.
>
> This may be more of a naming / documentation issue than a
> functionality issue, but to me this reads as:
>
> "This socket option returns an internal implementation detail that, if
> you are sufficiently clueful about the current performance heuristics
> used by the Linux networking stack, just might give you a hint as to
> which epoll set to put the socket in." I've done some digging into
> Linux networking stuff, but not nearly enough to have the slighest
> clue what you're supposed to do with the NAPI ID.

Really the NAPI ID is an arbitrary number that will be unique per
device queue, though multiple Rx queues can share a NAPI ID if they
are meant to be processed in the same call to poll.

If we wanted we could probably rename it to something like Device Poll
Identifier or Device Queue Identifier, DPID or DQID, if that would
work for you. Essentially it is just a unique u32 value that should
not identify any other queue in the system while this device queue is
active. Really the number itself is mostly arbitrary, the main thing
is that it doesn't change and uniquely identifies the queue in the
system.

> It would be nice to make this a bit more concrete and a bit less tied
> in Linux innards. Perhaps a socket option could instead return a hint
> saying "for best results, put this socket in an epoll set that's on
> cpu N"? After all, you're unlikely to do anything productive with
> busy polling at all, even on a totally different kernel
> implementation, if you have more than one epoll set per CPU. I can
> see cases where you could plausibly poll with fewer than one set per
> CPU, I suppose.

Really we kind of already have an option that does what you are
implying called SO_INCOMING_CPU. The problem is it requires pinning
the interrupts to the CPUs in order to keep the values consistent, and
even then busy polling can mess that up if the busy poll thread is
running on a different CPU. With the NAPI ID we have to do a bit of
work on the application end, but we can uniquely identify each
incoming queue and interrupt migration and busy polling don't have any
effect on it. So for example we could stack all the interrupts on CPU
0, and have our main thread located there doing the sorting of
incoming requests and handing them out to epoll listener threads on
other CPUs. When those epoll listener threads start doing busy
polling the NAPI ID won't change even though the packet is being
processed on a different CPU.

> Again, though, from the description, it's totally unclear what a user
> is supposed to do.

What you end up having to do is essentially create a hash of sorts so
that you can go from NAPI IDs to threads. In an ideal setup what you
end up with multiple threads, each one running one epoll, and each
epoll polling on one specific queue.

Hope that helps to clarify it.

- Alex

2017-03-24 01:00:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 2/8] tcp: Record Rx hash and NAPI ID in tcp_child_process

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> While working on some recent busy poll changes we found that child sockets
> were being instantiated without NAPI ID being set. In our first attempt to
> fix it, it was suggested that we should just pull programming the NAPI ID
> into the function itself since all callers will need to have it set.
>
> In addition to the NAPI ID change I have dropped the code that was
> populating the Rx hash since it was actually being populated in
> tcp_get_cookie_sock.
>
> Reported-by: Sridhar Samudrala <[email protected]>
> Suggested-by: Eric Dumazet <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>


2017-03-24 01:02:26

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 3/8] net: Only define skb_mark_napi_id in one spot instead of two

On Thu, 2017-03-23 at 14:36 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Instead of defining two versions of skb_mark_napi_id I think it is more
> readable to just match the format of the sk_mark_napi_id functions and just
> wrap the contents of the function instead of defining two versions of the
> function. This way we can save a few lines of code since we only need 2 of
> the ifdef/endif but needed 5 for the extra function declaration.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> include/net/busy_poll.h | 22 +++++++++-------------
> 1 file changed, 9 insertions(+), 13 deletions(-)

Acked-by: Eric Dumazet <[email protected]>


2017-03-24 01:03:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 4/8] net: Change return type of sk_busy_loop from bool to void

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> >From what I can tell there is only a couple spots where we are actually
> checking the return value of sk_busy_loop. As there are only a few
> consumers of that data, and the data being checked for can be replaced
> with a check for !skb_queue_empty() we might as well just pull the code
> out of sk_busy_loop and place it in the spots that actually need it.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>


2017-03-24 01:24:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>

> The last bit I changed is to move from using a shift by 10 to just using
> NSEC_PER_USEC and using multiplication for any run time calculations and
> division for a few compile time ones. This should be more accurate and
> perform about the same on most architectures since modern CPUs typically
> handle multiplication without too much overhead.


busy polling thread can be preempted for more than 2 seconds.

Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.

We do not need nsec accuracy for busy polling users, if this restricts
range and usability under stress.




2017-03-24 03:42:42

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet <[email protected]> wrote:
> On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
>> From: Alexander Duyck <[email protected]>
>>
>
>> The last bit I changed is to move from using a shift by 10 to just using
>> NSEC_PER_USEC and using multiplication for any run time calculations and
>> division for a few compile time ones. This should be more accurate and
>> perform about the same on most architectures since modern CPUs typically
>> handle multiplication without too much overhead.
>
>
> busy polling thread can be preempted for more than 2 seconds.

If it is preempted is the timing value even valid anymore? I was
wondering about that. Also when preemption is enabled is there
anything to prevent us from being migrated to a CPU? If so what do we
do about architectures that allow drift between the clocks?

> Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.

Yes, but the problem is we also opened up an issue where if the clock
was approaching a roll-over we could add a value to it that would put
us in a state where we would never time out.

> We do not need nsec accuracy for busy polling users, if this restricts
> range and usability under stress.

Yes and no. So the standard use cases suggest using values of 50 to
100 microseconds. I suspect that for most people that is probably
what they are using. The addition of preemption kind of threw a
wrench in the works because now instead of spending that time busy
polling you can get preempted and then are off doing something else
for the entire period of time.

What would you think of changing this so that instead of tracking the
total time this function is active, instead we tracked the total time
we spent with preemption disabled? What I would do is move the
start_time configuration to just after the preempt_disable() call.
Then if we end up yielding to another thread we would just reset the
start_time when we restarted instead of trying to deal with all the
extra clock nonsense that we would have to deal with otherwise since I
don't know if we actually want to count time where we aren't actually
doing anything. In addition this would bring us closer to how NAPI
already works since it essentially will either find an event, or if we
time out we hand it off to the softirq which in turn can handle it or
hand it off to softirqd. The only item that might be a bit more
difficult to deal with then would be the way the times are used in
fs/select.c but I don't know if that is really the right way to go
anyway. With the preemption changes and such it might just make sense
to drop those bits and rely on just the socket polling alone.

The other option is to switch over everything from using unsigned long
to using uint64_t and time_after64. Then we can guarantee the range
needed and then some, but then we are playing with a u64 time value on
32b architectures which might be a bit more expensive. Even with that
though I still need to clean up the sysctl since it doesn't make sense
to allow negative values for the busy_poll usec to be used which is
currently the case.

Anyway let me know what you think and I can probably spin out a new
set tomorrow.

- Alex

2017-03-24 04:27:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 20:42 -0700, Alexander Duyck wrote:
> On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet <[email protected]> wrote:
> > On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
> >> From: Alexander Duyck <[email protected]>
> >>
> >
> >> The last bit I changed is to move from using a shift by 10 to just using
> >> NSEC_PER_USEC and using multiplication for any run time calculations and
> >> division for a few compile time ones. This should be more accurate and
> >> perform about the same on most architectures since modern CPUs typically
> >> handle multiplication without too much overhead.
> >
> >
> > busy polling thread can be preempted for more than 2 seconds.
>
> If it is preempted is the timing value even valid anymore? I was
> wondering about that. Also when preemption is enabled is there
> anything to prevent us from being migrated to a CPU? If so what do we
> do about architectures that allow drift between the clocks?
>
> > Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.
>
> Yes, but the problem is we also opened up an issue where if the clock
> was approaching a roll-over we could add a value to it that would put
> us in a state where we would never time out.

If you believe there is a bug, send a fix for net tree ?

I really do not see a bug, given we use time_after(now, end_time) which
handles roll-over just fine.



>
> > We do not need nsec accuracy for busy polling users, if this restricts
> > range and usability under stress.
>
> Yes and no. So the standard use cases suggest using values of 50 to
> 100 microseconds. I suspect that for most people that is probably
> what they are using. The addition of preemption kind of threw a
> wrench in the works because now instead of spending that time busy
> polling you can get preempted and then are off doing something else
> for the entire period of time.

That is fine. Busy polling heavy users are pinning threads to cpu, and
the re-schedule check is basically a way to make sure ksoftirqd will get
a chance to service BH.

>
> What would you think of changing this so that instead of tracking the
> total time this function is active, instead we tracked the total time
> we spent with preemption disabled? What I would do is move the
> start_time configuration to just after the preempt_disable() call.
> Then if we end up yielding to another thread we would just reset the
> start_time when we restarted instead of trying to deal with all the
> extra clock nonsense that we would have to deal with otherwise since I
> don't know if we actually want to count time where we aren't actually
> doing anything. In addition this would bring us closer to how NAPI
> already works since it essentially will either find an event, or if we
> time out we hand it off to the softirq which in turn can handle it or
> hand it off to softirqd. The only item that might be a bit more
> difficult to deal with then would be the way the times are used in
> fs/select.c but I don't know if that is really the right way to go
> anyway. With the preemption changes and such it might just make sense
> to drop those bits and rely on just the socket polling alone.
>
> The other option is to switch over everything from using unsigned long
> to using uint64_t and time_after64. Then we can guarantee the range
> needed and then some, but then we are playing with a u64 time value on
> 32b architectures which might be a bit more expensive. Even with that
> though I still need to clean up the sysctl since it doesn't make sense
> to allow negative values for the busy_poll usec to be used which is
> currently the case.
>
> Anyway let me know what you think and I can probably spin out a new
> set tomorrow.


Leave usec resolution, I see no point switching to nsec, as long we
support 32bit kernels.

If you believe min/max values should be added to the sysctls, because we
do not trust root anymore, please send patches only addressing that.

Thanks.

2017-03-24 04:38:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 21:27 -0700, Eric Dumazet wrote:

> If you believe min/max values should be added to the sysctls, because we
> do not trust root anymore, please send patches only addressing that.

extern unsigned int sysctl_net_busy_read;
extern unsigned int sysctl_net_busy_poll;
...
unsigned int sk_ll_usec;


These are unsigned values, so no min/max would be required, only
maybe use proc_douintvec() instead of proc_dointvec() as most sysctls
use because we were lazy and had to wait linux-4.8 to actually get
proc_douintvec() in the kernel ;)





2017-03-24 04:48:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 5:58 PM, Alexander Duyck
<[email protected]> wrote:
> On Thu, Mar 23, 2017 at 3:43 PM, Andy Lutomirski <[email protected]> wrote:
>> On Thu, Mar 23, 2017 at 2:38 PM, Alexander Duyck
>> <[email protected]> wrote:
>>> From: Sridhar Samudrala <[email protected]>
>>>
>>> This socket option returns the NAPI ID associated with the queue on which
>>> the last frame is received. This information can be used by the apps to
>>> split the incoming flows among the threads based on the Rx queue on which
>>> they are received.
>>>
>>> If the NAPI ID actually represents a sender_cpu then the value is ignored
>>> and 0 is returned.
>>
>> This may be more of a naming / documentation issue than a
>> functionality issue, but to me this reads as:
>>
>> "This socket option returns an internal implementation detail that, if
>> you are sufficiently clueful about the current performance heuristics
>> used by the Linux networking stack, just might give you a hint as to
>> which epoll set to put the socket in." I've done some digging into
>> Linux networking stuff, but not nearly enough to have the slighest
>> clue what you're supposed to do with the NAPI ID.
>
> Really the NAPI ID is an arbitrary number that will be unique per
> device queue, though multiple Rx queues can share a NAPI ID if they
> are meant to be processed in the same call to poll.
>
> If we wanted we could probably rename it to something like Device Poll
> Identifier or Device Queue Identifier, DPID or DQID, if that would
> work for you. Essentially it is just a unique u32 value that should
> not identify any other queue in the system while this device queue is
> active. Really the number itself is mostly arbitrary, the main thing
> is that it doesn't change and uniquely identifies the queue in the
> system.

That seems reasonably sane to me.

>
>> It would be nice to make this a bit more concrete and a bit less tied
>> in Linux innards. Perhaps a socket option could instead return a hint
>> saying "for best results, put this socket in an epoll set that's on
>> cpu N"? After all, you're unlikely to do anything productive with
>> busy polling at all, even on a totally different kernel
>> implementation, if you have more than one epoll set per CPU. I can
>> see cases where you could plausibly poll with fewer than one set per
>> CPU, I suppose.
>
> Really we kind of already have an option that does what you are
> implying called SO_INCOMING_CPU. The problem is it requires pinning
> the interrupts to the CPUs in order to keep the values consistent,

Some day the kernel should just solve this problem once and for all.
Have root give a basic policy for mapping queues to CPUs (one per
physical core / one per logical core / use this subset of cores) and
perhaps forcibly prevent irqbalanced from even seeing it. I'm sure
other solutions are possible.

> and
> even then busy polling can mess that up if the busy poll thread is
> running on a different CPU. With the NAPI ID we have to do a bit of
> work on the application end, but we can uniquely identify each
> incoming queue and interrupt migration and busy polling don't have any
> effect on it. So for example we could stack all the interrupts on CPU
> 0, and have our main thread located there doing the sorting of
> incoming requests and handing them out to epoll listener threads on
> other CPUs. When those epoll listener threads start doing busy
> polling the NAPI ID won't change even though the packet is being
> processed on a different CPU.
>
>> Again, though, from the description, it's totally unclear what a user
>> is supposed to do.
>
> What you end up having to do is essentially create a hash of sorts so
> that you can go from NAPI IDs to threads. In an ideal setup what you
> end up with multiple threads, each one running one epoll, and each
> epoll polling on one specific queue.

So don't we want queue id, not NAPI id? Or am I still missing something?

But I'm also a but confused as to the overall performance effect.
Suppose I have an rx queue that has its interrupt bound to cpu 0. For
whatever reason (random chance if I'm hashing, for example), I end up
with the epoll caller on cpu 1. Suppose further that cpus 0 and 1 are
on different NUMA nodes.

Now, let's suppose that I get lucky and *all* the packets are pulled
off the queue by epoll busy polling. Life is great [1]. But suppose
that, due to a tiny hiccup or simply user code spending some cycles
processing those packets, an rx interrupt fires. Now cpu 0 starts
pulling packets off the queue via NAPI, right? So both NUMA nodes are
fighting over all the cachelines involved in servicing the queue *and*
the packets just got dequeued on the wrong NUMA node.

ISTM this would work better if the epoll busy polling could handle the
case where one epoll set polls sockets on different queues as long as
those queues are all owned by the same CPU. Then user code could use
SO_INCOMING_CPU to sort out the sockets.

Am I missing something?

[1] Maybe. How smart is direct cache access? If it's smart enough,
it'll pre-populate node 0's LLC, which means that life isn't so great
after all.

2017-03-24 05:07:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 8/8] net: Introduce SO_INCOMING_NAPI_ID

On Thu, Mar 23, 2017 at 9:47 PM, Andy Lutomirski <[email protected]> wrote:

> So don't we want queue id, not NAPI id? Or am I still missing something?
>
> But I'm also a but confused as to the overall performance effect.
> Suppose I have an rx queue that has its interrupt bound to cpu 0. For
> whatever reason (random chance if I'm hashing, for example), I end up
> with the epoll caller on cpu 1. Suppose further that cpus 0 and 1 are
> on different NUMA nodes.
>
> Now, let's suppose that I get lucky and *all* the packets are pulled
> off the queue by epoll busy polling. Life is great [1]. But suppose
> that, due to a tiny hiccup or simply user code spending some cycles
> processing those packets, an rx interrupt fires. Now cpu 0 starts
> pulling packets off the queue via NAPI, right? So both NUMA nodes are
> fighting over all the cachelines involved in servicing the queue *and*
> the packets just got dequeued on the wrong NUMA node.
>
> ISTM this would work better if the epoll busy polling could handle the
> case where one epoll set polls sockets on different queues as long as
> those queues are all owned by the same CPU. Then user code could use
> SO_INCOMING_CPU to sort out the sockets.
>

Of course you can do that already.

SO_REUSEPORT + appropriate eBPF filter can select the best socket to
receive your packets, based
on various smp/numa affinities ( BPF_FUNC_get_smp_processor_id or
BPF_FUNC_get_numa_node_id )

This new instruction is simply _allowing_ other schems, based on
queues ID, in the case each NIC queue
can be managed by a group of cores (presumably on same NUMA node)


> Am I missing something?
>
> [1] Maybe. How smart is direct cache access? If it's smart enough,
> it'll pre-populate node 0's LLC, which means that life isn't so great
> after all.

2017-03-24 05:55:19

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, Mar 23, 2017 at 9:27 PM, Eric Dumazet <[email protected]> wrote:
> On Thu, 2017-03-23 at 20:42 -0700, Alexander Duyck wrote:
>> On Thu, Mar 23, 2017 at 6:24 PM, Eric Dumazet <[email protected]> wrote:
>> > On Thu, 2017-03-23 at 14:37 -0700, Alexander Duyck wrote:
>> >> From: Alexander Duyck <[email protected]>
>> >>
>> >
>> >> The last bit I changed is to move from using a shift by 10 to just using
>> >> NSEC_PER_USEC and using multiplication for any run time calculations and
>> >> division for a few compile time ones. This should be more accurate and
>> >> perform about the same on most architectures since modern CPUs typically
>> >> handle multiplication without too much overhead.
>> >
>> >
>> > busy polling thread can be preempted for more than 2 seconds.
>>
>> If it is preempted is the timing value even valid anymore? I was
>> wondering about that. Also when preemption is enabled is there
>> anything to prevent us from being migrated to a CPU? If so what do we
>> do about architectures that allow drift between the clocks?
>>
>> > Using usec instead of nanoseconds gave us 3 orders of magnitude cushion.
>>
>> Yes, but the problem is we also opened up an issue where if the clock
>> was approaching a roll-over we could add a value to it that would put
>> us in a state where we would never time out.
>
> If you believe there is a bug, send a fix for net tree ?
>
> I really do not see a bug, given we use time_after(now, end_time) which
> handles roll-over just fine.

Right, but time_after assumes roll over. When you are using a time
value based off of local_clock() >> 10, you don't ever roll over when
you do addition. Just the clock rolls over. At least on 64 bit
systems.

So if local time approaches something like all 1's, and we have
shifted it by 10 it is then the max it can ever reach is
0x003FFFFFFFFFFFFF. I can add our loop time to that and it won't roll
over. In the mean time the busy_loop_us_ can never exceed whatever I
added to that so we are now locked into a loop. I realize I am
probably being pedantic, and it will have an exceedingly small rate of
occurrence, but it is still an issue.

>>
>> > We do not need nsec accuracy for busy polling users, if this restricts
>> > range and usability under stress.
>>
>> Yes and no. So the standard use cases suggest using values of 50 to
>> 100 microseconds. I suspect that for most people that is probably
>> what they are using. The addition of preemption kind of threw a
>> wrench in the works because now instead of spending that time busy
>> polling you can get preempted and then are off doing something else
>> for the entire period of time.
>
> That is fine. Busy polling heavy users are pinning threads to cpu, and
> the re-schedule check is basically a way to make sure ksoftirqd will get
> a chance to service BH.
>
>>
>> What would you think of changing this so that instead of tracking the
>> total time this function is active, instead we tracked the total time
>> we spent with preemption disabled? What I would do is move the
>> start_time configuration to just after the preempt_disable() call.
>> Then if we end up yielding to another thread we would just reset the
>> start_time when we restarted instead of trying to deal with all the
>> extra clock nonsense that we would have to deal with otherwise since I
>> don't know if we actually want to count time where we aren't actually
>> doing anything. In addition this would bring us closer to how NAPI
>> already works since it essentially will either find an event, or if we
>> time out we hand it off to the softirq which in turn can handle it or
>> hand it off to softirqd. The only item that might be a bit more
>> difficult to deal with then would be the way the times are used in
>> fs/select.c but I don't know if that is really the right way to go
>> anyway. With the preemption changes and such it might just make sense
>> to drop those bits and rely on just the socket polling alone.
>>
>> The other option is to switch over everything from using unsigned long
>> to using uint64_t and time_after64. Then we can guarantee the range
>> needed and then some, but then we are playing with a u64 time value on
>> 32b architectures which might be a bit more expensive. Even with that
>> though I still need to clean up the sysctl since it doesn't make sense
>> to allow negative values for the busy_poll usec to be used which is
>> currently the case.
>>
>> Anyway let me know what you think and I can probably spin out a new
>> set tomorrow.
>
>
> Leave usec resolution, I see no point switching to nsec, as long we
> support 32bit kernels.

I wonder how much it really matters in the grand scheme of things.
The question I would have is which is more expensive. We are either
having to do a set of double precision shifts, or a multiplication,
double precision addition, and double precision compare. I need to
take a look at some instruction tables to see which is more expensive.

My preference would be to switch things over to a u64 at this point as
it fixes the time_after problem, and has no additional cost on 64b and
bumps up the maximum usable value on 32b systems by a power of 2, but
I will have to think about it and see if there isn't a better way to
deal with that issue.

> If you believe min/max values should be added to the sysctls, because we
> do not trust root anymore, please send patches only addressing that.

It isn't so much a min/max as just the wrong function used. I will
switch it over to using proc_douintvec().

Also it isn't that I don't trust root, I don't like us giving root the
option of configuring things wrongly without giving them some sort of
warning. So switching things over to using unsigned int will match
what we are actually storing but the problem is anything over INT_MAX
on 32 bit systems will give you busy polling with an immediate time
out since the signed bit is what is being used in "time_after" to
determine if we crossed the boundary or not.

- Alex

2017-03-24 11:17:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Thu, 2017-03-23 at 22:55 -0700, Alexander Duyck wrote:

> Right, but time_after assumes roll over. When you are using a time
> value based off of local_clock() >> 10, you don't ever roll over when
> you do addition. Just the clock rolls over. At least on 64 bit
> systems.
>
> So if local time approaches something like all 1's, and we have
> shifted it by 10 it is then the max it can ever reach is
> 0x003FFFFFFFFFFFFF. I can add our loop time to that and it won't roll
> over. In the mean time the busy_loop_us_ can never exceed whatever I
> added to that so we are now locked into a loop. I realize I am
> probably being pedantic, and it will have an exceedingly small rate of
> occurrence, but it is still an issue.

Do you realize that a 64bit clock wont rollover before the host has
reached 584 years of uptime ?



2017-03-24 15:48:47

by Alexander Duyck

[permalink] [raw]
Subject: Re: [net-next PATCH v2 5/8] net: Track start of busy loop instead of when it should end

On Fri, Mar 24, 2017 at 4:16 AM, Eric Dumazet <[email protected]> wrote:
> On Thu, 2017-03-23 at 22:55 -0700, Alexander Duyck wrote:
>
>> Right, but time_after assumes roll over. When you are using a time
>> value based off of local_clock() >> 10, you don't ever roll over when
>> you do addition. Just the clock rolls over. At least on 64 bit
>> systems.
>>
>> So if local time approaches something like all 1's, and we have
>> shifted it by 10 it is then the max it can ever reach is
>> 0x003FFFFFFFFFFFFF. I can add our loop time to that and it won't roll
>> over. In the mean time the busy_loop_us_ can never exceed whatever I
>> added to that so we are now locked into a loop. I realize I am
>> probably being pedantic, and it will have an exceedingly small rate of
>> occurrence, but it is still an issue.
>
> Do you realize that a 64bit clock wont rollover before the host has
> reached 584 years of uptime ?

Yeah, that is what I meant by "probably being pedantic". I was being
a too much of a perfectionist.

So I can work with the ">> 10" approach. The only thing I think I may
still want to change is that on 32b systems I will still use the
do_procintvec_minmax for busy_poll and busy_read to prevent us from
inputting values less than 0. For 64b systems we can do_procuintvec.
It isn't so much that I don't trust root, it is just that we didn't
really document the ranges anywhere for this so I figure if we at
least lock that down to the usable ranges since root may not be aware
of the implementation details.