LinuxLists.cc - [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

[permalink] [raw]

Subject: [PATCH v10 net-next 1/6] net: add napi_id and hash

Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.

Signed-off-by: Eliezer Tamir <[email protected]>
---

include/linux/netdevice.h | 29 ++++++++++++++++++++++
net/core/dev.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8f967e3..39bbd46 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -324,12 +324,15 @@ struct napi_struct {
struct sk_buff *gro_list;
struct sk_buff *skb;
struct list_head dev_list;
+ struct hlist_node napi_hash_node;
+ unsigned int napi_id;
};

enum {
NAPI_STATE_SCHED, /* Poll is scheduled */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
+ NAPI_STATE_HASHED, /* In NAPI hash */
};

enum gro_result {
@@ -446,6 +449,32 @@ extern void __napi_complete(struct napi_struct *n);
extern void napi_complete(struct napi_struct *n);

/**
+ * napi_by_id - lookup a NAPI by napi_id
+ * @napi_id: hashed napi_id
+ *
+ * lookup @napi_id in napi_hash table
+ * must be called under rcu_read_lock()
+ */
+extern struct napi_struct *napi_by_id(unsigned int napi_id);
+
+/**
+ * napi_hash_add - add a NAPI to global hashtable
+ * @napi: napi context
+ *
+ * generate a new napi_id and store a @napi under it in napi_hash
+ */
+extern void napi_hash_add(struct napi_struct *napi);
+
+/**
+ * napi_hash_del - remove a NAPI from global table
+ * @napi: napi context
+ *
+ * Warning: caller must observe rcu grace period
+ * before freeing memory containing @napi
+ */
+extern void napi_hash_del(struct napi_struct *napi);
+
+/**
* napi_disable - prevent NAPI from scheduling
* @n: napi context
*
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c18557..fa007db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
#include <linux/inetdevice.h>
#include <linux/cpu_rmap.h>
#include <linux/static_key.h>
+#include <linux/hashtable.h>

#include "net-sysfs.h"

@@ -166,6 +167,12 @@ static struct list_head offload_base __read_mostly;
DEFINE_RWLOCK(dev_base_lock);
EXPORT_SYMBOL(dev_base_lock);

+/* protects napi_hash addition/deletion and napi_gen_id */
+static DEFINE_SPINLOCK(napi_hash_lock);
+
+static unsigned int napi_gen_id;
+static DEFINE_HASHTABLE(napi_hash, 8);
+
seqcount_t devnet_rename_seq;

static inline void dev_base_seq_inc(struct net *net)
@@ -4136,6 +4143,58 @@ void napi_complete(struct napi_struct *n)
}
EXPORT_SYMBOL(napi_complete);

+/* must be called under rcu_read_lock(), as we dont take a reference */
+struct napi_struct *napi_by_id(unsigned int napi_id)
+{
+ unsigned int hash = napi_id % HASH_SIZE(napi_hash);
+ struct napi_struct *napi;
+
+ hlist_for_each_entry_rcu(napi, &napi_hash[hash], napi_hash_node)
+ if (napi->napi_id == napi_id)
+ return napi;
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(napi_by_id);
+
+void napi_hash_add(struct napi_struct *napi)
+{
+ if (!test_and_set_bit(NAPI_STATE_HASHED, &napi->state)) {
+
+ spin_lock(&napi_hash_lock);
+
+ /* 0 is not a valid id, we also skip an id that is taken
+ * we expect both events to be extremely rare
+ */
+ napi->napi_id = 0;
+ while (!napi->napi_id) {
+ napi->napi_id = ++napi_gen_id;
+ if (napi_by_id(napi->napi_id))
+ napi->napi_id = 0;
+ }
+
+ hlist_add_head_rcu(&napi->napi_hash_node,
+ &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+
+ spin_unlock(&napi_hash_lock);
+ }
+}
+EXPORT_SYMBOL_GPL(napi_hash_add);
+
+/* Warning : caller is responsible to make sure rcu grace period
+ * is respected before freeing memory containing @napi
+ */
+void napi_hash_del(struct napi_struct *napi)
+{
+ spin_lock(&napi_hash_lock);
+
+ if (test_and_clear_bit(NAPI_STATE_HASHED, &napi->state))
+ hlist_del_rcu(&napi->napi_hash_node);
+
+ spin_unlock(&napi_hash_lock);
+}
+EXPORT_SYMBOL_GPL(napi_hash_del);
+
void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
{

2013-06-10 08:40:00

[permalink] [raw]

Subject: [PATCH v10 net-next 2/6] net: add low latency socket poll

Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

Documentation/sysctl/net.txt | 7 ++
include/linux/netdevice.h | 3 +
include/linux/skbuff.h | 8 ++
include/net/ll_poll.h | 148 ++++++++++++++++++++++++++++++++++++++++++
include/net/sock.h | 4 +
include/uapi/linux/snmp.h | 1
net/Kconfig | 12 +++
net/core/skbuff.c | 4 +
net/core/sock.c | 6 ++
net/core/sysctl_net_core.c | 10 +++
net/ipv4/proc.c | 1
net/socket.c | 6 ++
12 files changed, 208 insertions(+), 2 deletions(-)
create mode 100644 include/net/ll_poll.h

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index c1f8640..85ab72d 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -50,6 +50,13 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
it's a Per-CPU variable.
Default: 64

+low_latency_poll
+----------------
+Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
+Approximate time in us to spin waiting for packets on the device queue.
+Recommended value is 50. May increase power usage.
+Default: 0 (off)
+
rmem_default
------------

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 39bbd46..2ecb96d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -972,6 +972,9 @@ struct net_device_ops {
gfp_t gfp);
void (*ndo_netpoll_cleanup)(struct net_device *dev);
#endif
+#ifdef CONFIG_NET_LL_RX_POLL
+ int (*ndo_ll_poll)(struct napi_struct *dev);
+#endif
int (*ndo_set_vf_mac)(struct net_device *dev,
int queue, u8 *mac);
int (*ndo_set_vf_vlan)(struct net_device *dev,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9995834..400d82a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -386,6 +386,7 @@ typedef unsigned char *sk_buff_data_t;
* @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS
* @dma_cookie: a cookie to one of several possible DMA operations
* done by skb DMA functions
+ * @napi_id: id of the NAPI struct this skb came from
* @secmark: security marking
* @mark: Generic packet mark
* @dropcount: total number of sk_receive_queue overflows
@@ -500,8 +501,11 @@ struct sk_buff {
/* 7/9 bit hole (depending on ndisc_nodetype presence) */
kmemcheck_bitfield_end(flags2);

-#ifdef CONFIG_NET_DMA
- dma_cookie_t dma_cookie;
+#if defined CONFIG_NET_DMA || defined CONFIG_NET_LL_RX_POLL
+ union {
+ unsigned int napi_id;
+ dma_cookie_t dma_cookie;
+ };
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h
new file mode 100644
index 0000000..bc262f8
--- /dev/null
+++ b/include/net/ll_poll.h
@@ -0,0 +1,148 @@
+/*
+ * Low Latency Sockets
+ * Copyright(c) 2013 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * Author: Eliezer Tamir
+ *
+ * Contact Information:
+ * e1000-devel Mailing List <[email protected]>
+ */
+
+/*
+ * For now this depends on CONFIG_X86_TSC
+ */
+
+#ifndef _LINUX_NET_LL_POLL_H
+#define _LINUX_NET_LL_POLL_H
+
+#include <linux/netdevice.h>
+#include <net/ip.h>
+
+#ifdef CONFIG_NET_LL_RX_POLL
+
+struct napi_struct;
+extern unsigned long sysctl_net_ll_poll __read_mostly;
+
+/* return values from ndo_ll_poll */
+#define LL_FLUSH_FAILED -1
+#define LL_FLUSH_BUSY -2
+
+/* we don't mind a ~2.5% imprecision */
+#define TSC_MHZ (tsc_khz >> 10)
+
+static inline cycles_t ll_end_time(void)
+{
+ return TSC_MHZ * ACCESS_ONCE(sysctl_net_ll_poll) + get_cycles();
+}
+
+static inline bool sk_valid_ll(struct sock *sk)
+{
+ return sysctl_net_ll_poll && sk->sk_napi_id &&
+ !need_resched() && !signal_pending(current);
+}
+
+static inline bool can_poll_ll(cycles_t end_time)
+{
+ return !time_after((unsigned long)get_cycles(),
+ (unsigned long)end_time);
+}
+
+static inline bool sk_poll_ll(struct sock *sk, int nonblock)
+{
+ cycles_t end_time = ll_end_time();
+ const struct net_device_ops *ops;
+ struct napi_struct *napi;
+ int rc = false;
+
+ /*
+ * rcu read lock for napi hash
+ * bh so we don't race with net_rx_action
+ */
+ rcu_read_lock_bh();
+
+ napi = napi_by_id(sk->sk_napi_id);
+ if (!napi)
+ goto out;
+
+ ops = napi->dev->netdev_ops;
+ if (!ops->ndo_ll_poll)
+ goto out;
+
+ do {
+
+ rc = ops->ndo_ll_poll(napi);
+
+ if (rc == LL_FLUSH_FAILED)
+ break; /* permanent failure */
+
+ if (rc > 0)
+ /* local bh are disabled so it is ok to use _BH */
+ NET_ADD_STATS_BH(sock_net(sk),
+ LINUX_MIB_LOWLATENCYRXPACKETS, rc);
+
+ } while (skb_queue_empty(&sk->sk_receive_queue)
+ && can_poll_ll(end_time) && !nonblock);
+
+ rc = !skb_queue_empty(&sk->sk_receive_queue);
+out:
+ rcu_read_unlock_bh();
+ return rc;
+}
+
+/* used in the NIC receive handler to mark the skb */
+static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
+{
+ skb->napi_id = napi->napi_id;
+}
+
+/* used in the protocol hanlder to propagate the napi_id to the socket */
+static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
+{
+ sk->sk_napi_id = skb->napi_id;
+}
+
+#else /* CONFIG_NET_LL_RX_POLL */
+
+static inline cycles_t ll_end_time(void)
+{
+ return 0;
+}
+
+static inline bool sk_valid_ll(struct sock *sk)
+{
+ return false;
+}
+
+static inline bool sk_poll_ll(struct sock *sk, int nonblock)
+{
+ return false;
+}
+
+static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
+{
+}
+
+static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
+{
+}
+
+static inline bool can_poll_ll(cycles_t end_time)
+{
+ return false;
+}
+
+#endif /* CONFIG_NET_LL_RX_POLL */
+#endif /* _LINUX_NET_LL_POLL_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 66772cf..ac8e181 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,6 +229,7 @@ struct cg_proto;
* @sk_omem_alloc: "o" is "option" or "other"
* @sk_wmem_queued: persistent queue size
* @sk_forward_alloc: space allocated forward
+ * @sk_napi_id: id of the last napi context to receive data for sk
* @sk_allocation: allocation mode
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
@@ -325,6 +326,9 @@ struct sock {
#ifdef CONFIG_RPS
__u32 sk_rxhash;
#endif
+#ifdef CONFIG_NET_LL_RX_POLL
+ unsigned int sk_napi_id;
+#endif
atomic_t sk_drops;
int sk_rcvbuf;

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index df2e8b4..26cbf76 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -253,6 +253,7 @@ enum
LINUX_MIB_TCPFASTOPENLISTENOVERFLOW, /* TCPFastOpenListenOverflow */
LINUX_MIB_TCPFASTOPENCOOKIEREQD, /* TCPFastOpenCookieReqd */
LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES, /* TCPSpuriousRtxHostQueues */
+ LINUX_MIB_LOWLATENCYRXPACKETS, /* LowLatencyRxPackets */
__LINUX_MIB_MAX
};

diff --git a/net/Kconfig b/net/Kconfig
index 523e43e..d6a9ce6 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -243,6 +243,18 @@ config NETPRIO_CGROUP
Cgroup subsystem for use in assigning processes to network priorities on
a per-interface basis

+config NET_LL_RX_POLL
+ bool "Low Latency Receive Poll"
+ depends on X86_TSC
+ default n
+ ---help---
+ Support Low Latency Receive Queue Poll.
+ (For network card drivers which support this option.)
+ When waiting for data in read or poll call directly into the the device driver
+ to flush packets which may be pending on the device queues into the stack.
+
+ If unsure, say N.
+
config BQL
boolean
depends on SYSFS
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 73f57a0..4a4181e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -733,6 +733,10 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
new->vlan_tci = old->vlan_tci;

skb_copy_secmark(new, old);
+
+#ifdef CONFIG_NET_LL_RX_POLL
+ new->napi_id = old->napi_id;
+#endif
}

/*
diff --git a/net/core/sock.c b/net/core/sock.c
index 88868a9..788c0da 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -139,6 +139,8 @@
#include <net/tcp.h>
#endif

+#include <net/ll_poll.h>
+
static DEFINE_MUTEX(proto_list_mutex);
static LIST_HEAD(proto_list);

@@ -2284,6 +2286,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)

sk->sk_stamp = ktime_set(-1L, 0);

+#ifdef CONFIG_NET_LL_RX_POLL
+ sk->sk_napi_id = 0;
+#endif
+
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 741db5fc..4b48f39 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -19,6 +19,7 @@
#include <net/ip.h>
#include <net/sock.h>
#include <net/net_ratelimit.h>
+#include <net/ll_poll.h>

static int one = 1;

@@ -284,6 +285,15 @@ static struct ctl_table net_core_table[] = {
.proc_handler = flow_limit_table_len_sysctl
},
#endif /* CONFIG_NET_FLOW_LIMIT */
+#ifdef CONFIG_NET_LL_RX_POLL
+ {
+ .procname = "low_latency_poll",
+ .data = &sysctl_net_ll_poll,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax
+ },
+#endif
#endif /* CONFIG_NET */
{
.procname = "netdev_budget",
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 2a5bf86..6577a11 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -273,6 +273,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW),
SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
SNMP_MIB_ITEM("TCPSpuriousRtxHostQueues", LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES),
+ SNMP_MIB_ITEM("LowLatencyRxPackets", LINUX_MIB_LOWLATENCYRXPACKETS),
SNMP_MIB_SENTINEL
};

diff --git a/net/socket.c b/net/socket.c
index 3ebdcb8..21fd29f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -104,6 +104,12 @@
#include <linux/route.h>
#include <linux/sockios.h>
#include <linux/atalk.h>
+#include <net/ll_poll.h>
+
+#ifdef CONFIG_NET_LL_RX_POLL
+unsigned long sysctl_net_ll_poll __read_mostly;
+EXPORT_SYMBOL_GPL(sysctl_net_ll_poll);
+#endif

static int sock_no_open(struct inode *irrelevant, struct file *dontcare);
static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,

2013-06-10 08:40:09

[permalink] [raw]

Subject: [PATCH v10 net-next 3/6] udp: add low latency socket poll support

Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

net/core/datagram.c | 4 ++++
net/ipv4/udp.c | 6 +++++-
net/ipv6/udp.c | 6 +++++-
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index b71423d..9cbaba9 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -56,6 +56,7 @@
#include <net/sock.h>
#include <net/tcp_states.h>
#include <trace/events/skb.h>
+#include <net/ll_poll.h>

/*
* Is a socket 'connection oriented' ?
@@ -207,6 +208,9 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
}
spin_unlock_irqrestore(&queue->lock, cpu_flags);

+ if (sk_valid_ll(sk) && sk_poll_ll(sk, flags & MSG_DONTWAIT))
+ continue;
+
/* User doesn't want to wait */
error = -EAGAIN;
if (!timeo)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c7338ec..2955b25 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -109,6 +109,7 @@
#include <trace/events/udp.h>
#include <linux/static_key.h>
#include <trace/events/skb.h>
+#include <net/ll_poll.h>
#include "udp_impl.h"

struct udp_table udp_table __read_mostly;
@@ -1709,7 +1710,10 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);

if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ sk_mark_ll(sk, skb);
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b580853..f77e34c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -46,6 +46,7 @@
#include <net/ip6_checksum.h>
#include <net/xfrm.h>
#include <net/inet6_hashtables.h>
+#include <net/ll_poll.h>

#include <linux/proc_fs.h>
#include <linux/seq_file.h>
@@ -841,7 +842,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
*/
sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
if (sk != NULL) {
- int ret = udpv6_queue_rcv_skb(sk, skb);
+ int ret;
+
+ sk_mark_ll(sk, skb);
+ ret = udpv6_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but

2013-06-10 08:40:21

[permalink] [raw]

Subject: [PATCH v10 net-next 4/6] tcp: add low latency socket poll support.

Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

net/ipv4/tcp.c | 5 +++++
net/ipv4/tcp_ipv4.c | 2 ++
net/ipv6/tcp_ipv6.c | 2 ++
3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bc42469..46ed9af 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@

#include <asm/uaccess.h>
#include <asm/ioctls.h>
+#include <net/ll_poll.h>

int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;

@@ -1553,6 +1554,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
struct sk_buff *skb;
u32 urg_hole = 0;

+ if (sk_valid_ll(sk) && skb_queue_empty(&sk->sk_receive_queue)
+ && (sk->sk_state == TCP_ESTABLISHED))
+ sk_poll_ll(sk, nonblock);
+
lock_sock(sk);

err = -ENOTCONN;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 289039b4..1063bb8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
#include <net/netdma.h>
#include <net/secure_seq.h>
#include <net/tcp_memcontrol.h>
+#include <net/ll_poll.h>

#include <linux/inet.h>
#include <linux/ipv6.h>
@@ -1993,6 +1994,7 @@ process:
if (sk_filter(sk, skb))
goto discard_and_relse;

+ sk_mark_ll(sk, skb);
skb->dev = NULL;

bh_lock_sock_nested(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0a17ed9..5cffa5c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -63,6 +63,7 @@
#include <net/inet_common.h>
#include <net/secure_seq.h>
#include <net/tcp_memcontrol.h>
+#include <net/ll_poll.h>

#include <asm/uaccess.h>

@@ -1498,6 +1499,7 @@ process:
if (sk_filter(sk, skb))
goto discard_and_relse;

+ sk_mark_ll(sk, skb);
skb->dev = NULL;

bh_lock_sock_nested(sk);

2013-06-10 08:40:33

[permalink] [raw]

Subject: [PATCH v10 net-next 5/6] ixgbe: add support for ndo_ll_poll

Add the ixgbe driver code implementing ndo_ll_poll.
Adds ndo_ll_poll method and locking between it and the napi poll.
When receiving a packet we use skb_mark_ll to record the napi it came from.
Add each napi to the napi_hash right after netif_napi_add().

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

drivers/net/ethernet/intel/ixgbe/ixgbe.h | 120 +++++++++++++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 2
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 63 +++++++++++--
3 files changed, 177 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ca93238..e9d9862 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -52,6 +52,8 @@
#include <linux/dca.h>
#endif

+#include <net/ll_poll.h>
+
/* common prefix used by pr_<> macros */
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -356,9 +358,127 @@ struct ixgbe_q_vector {
struct rcu_head rcu; /* to avoid race with update stats on free */
char name[IFNAMSIZ + 9];

+#ifdef CONFIG_NET_LL_RX_POLL
+ unsigned int state;
+#define IXGBE_QV_STATE_IDLE 0
+#define IXGBE_QV_STATE_NAPI 1 /* NAPI owns this QV */
+#define IXGBE_QV_STATE_POLL 2 /* poll owns this QV */
+#define IXGBE_QV_LOCKED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
+#define IXGBE_QV_STATE_NAPI_YIELD 4 /* NAPI yielded this QV */
+#define IXGBE_QV_STATE_POLL_YIELD 8 /* poll yielded this QV */
+#define IXGBE_QV_YIELD (IXGBE_QV_STATE_NAPI_YIELD | IXGBE_QV_STATE_POLL_YIELD)
+#define IXGBE_QV_USER_PEND (IXGBE_QV_STATE_POLL | IXGBE_QV_STATE_POLL_YIELD)
+ spinlock_t lock;
+#endif /* CONFIG_NET_LL_RX_POLL */
+
/* for dynamic allocation of rings associated with this q_vector */
struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp;
};
+#ifdef CONFIG_NET_LL_RX_POLL
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+
+ spin_lock_init(&q_vector->lock);
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+}
+
+/* called from the device poll routine to get ownership of a q_vector */
+static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+ int rc = true;
+ spin_lock(&q_vector->lock);
+ if (q_vector->state & IXGBE_QV_LOCKED) {
+ WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
+ q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
+ rc = false;
+ } else
+ /* we don't care if someone yielded */
+ q_vector->state = IXGBE_QV_STATE_NAPI;
+ spin_unlock(&q_vector->lock);
+ return rc;
+}
+
+/* returns true is someone tried to get the qv while napi had it */
+static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+ int rc = false;
+ spin_lock(&q_vector->lock);
+ WARN_ON(q_vector->state & (IXGBE_QV_STATE_POLL |
+ IXGBE_QV_STATE_NAPI_YIELD));
+
+ if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+ rc = true;
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+ spin_unlock(&q_vector->lock);
+ return rc;
+}
+
+/* called from ixgbe_low_latency_poll() */
+static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+ int rc = true;
+ spin_lock_bh(&q_vector->lock);
+ if ((q_vector->state & IXGBE_QV_LOCKED)) {
+ q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
+ rc = false;
+ } else
+ /* preserve yield marks */
+ q_vector->state |= IXGBE_QV_STATE_POLL;
+ spin_unlock_bh(&q_vector->lock);
+ return rc;
+}
+
+/* returns true if someone tried to get the qv while it was locked */
+static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+ int rc = false;
+ spin_lock_bh(&q_vector->lock);
+ WARN_ON(q_vector->state & (IXGBE_QV_STATE_NAPI));
+
+ if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+ rc = true;
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+ spin_unlock_bh(&q_vector->lock);
+ return rc;
+}
+
+/* true if a socket is polling, even if it did not get the lock */
+static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+ WARN_ON(!(q_vector->state & IXGBE_QV_LOCKED));
+ return q_vector->state & IXGBE_QV_USER_PEND;
+}
+#else /* CONFIG_NET_LL_RX_POLL */
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+}
+
+static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+ return true;
+}
+
+static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+#endif /* CONFIG_NET_LL_RX_POLL */
+
#ifdef CONFIG_IXGBE_HWMON

#define IXGBE_HWMON_TYPE_LOC 0
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index ef5f7a6..90b4e10 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -811,6 +811,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
/* initialize NAPI */
netif_napi_add(adapter->netdev, &q_vector->napi,
ixgbe_poll, 64);
+ napi_hash_add(&q_vector->napi);

/* tie q_vector and adapter together */
adapter->q_vector[v_idx] = q_vector;
@@ -931,6 +932,7 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter *adapter, int v_idx)
adapter->rx_ring[ring->queue_index] = NULL;

adapter->q_vector[v_idx] = NULL;
+ napi_hash_del(&q_vector->napi);
netif_napi_del(&q_vector->napi);

/*
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d30fbdd..9a7dc40 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1504,7 +1504,9 @@ static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
{
struct ixgbe_adapter *adapter = q_vector->adapter;

- if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
+ if (ixgbe_qv_ll_polling(q_vector))
+ netif_receive_skb(skb);
+ else if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
napi_gro_receive(&q_vector->napi, skb);
else
netif_rx(skb);
@@ -1892,9 +1894,9 @@ dma_sync:
* expensive overhead for IOMMU access this provides a means of avoiding
* it by maintaining the mapping of the page to the syste.
*
- * Returns true if all work is completed without reaching budget
+ * Returns amount of work completed
**/
-static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
+static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
struct ixgbe_ring *rx_ring,
const int budget)
{
@@ -1976,6 +1978,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
}

#endif /* IXGBE_FCOE */
+ skb_mark_ll(skb, &q_vector->napi);
ixgbe_rx_skb(q_vector, skb);

/* update budget accounting */
@@ -1992,9 +1995,37 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
if (cleaned_count)
ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);

- return (total_rx_packets < budget);
+ return total_rx_packets;
}

+#ifdef CONFIG_NET_LL_RX_POLL
+/* must be called with local_bh_disable()d */
+static int ixgbe_low_latency_recv(struct napi_struct *napi)
+{
+ struct ixgbe_q_vector *q_vector =
+ container_of(napi, struct ixgbe_q_vector, napi);
+ struct ixgbe_adapter *adapter = q_vector->adapter;
+ struct ixgbe_ring *ring;
+ int found = 0;
+
+ if (test_bit(__IXGBE_DOWN, &adapter->state))
+ return LL_FLUSH_FAILED;
+
+ if (!ixgbe_qv_lock_poll(q_vector))
+ return LL_FLUSH_BUSY;
+
+ ixgbe_for_each_ring(ring, q_vector->rx) {
+ found = ixgbe_clean_rx_irq(q_vector, ring, 4);
+ if (found)
+ break;
+ }
+
+ ixgbe_qv_unlock_poll(q_vector);
+
+ return found;
+}
+#endif /* CONFIG_NET_LL_RX_POLL */
+
/**
* ixgbe_configure_msix - Configure MSI-X hardware
* @adapter: board private structure
@@ -2550,6 +2581,9 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
ixgbe_for_each_ring(ring, q_vector->tx)
clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);

+ if (!ixgbe_qv_lock_napi(q_vector))
+ return budget;
+
/* attempt to distribute budget to each queue fairly, but don't allow
* the budget to go below 1 because we'll exit polling */
if (q_vector->rx.count > 1)
@@ -2558,9 +2592,10 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
per_ring_budget = budget;

ixgbe_for_each_ring(ring, q_vector->rx)
- clean_complete &= ixgbe_clean_rx_irq(q_vector, ring,
- per_ring_budget);
+ clean_complete &= (ixgbe_clean_rx_irq(q_vector, ring,
+ per_ring_budget) < per_ring_budget);

+ ixgbe_qv_unlock_napi(q_vector);
/* If all work not completed, return budget and keep polling */
if (!clean_complete)
return budget;
@@ -3747,16 +3782,25 @@ static void ixgbe_napi_enable_all(struct ixgbe_adapter *adapter)
{
int q_idx;

- for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++)
+ for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
+ ixgbe_qv_init_lock(adapter->q_vector[q_idx]);
napi_enable(&adapter->q_vector[q_idx]->napi);
+ }
}

static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
{
int q_idx;

- for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++)
+ local_bh_disable(); /* for ixgbe_qv_lock_napi() */
+ for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
napi_disable(&adapter->q_vector[q_idx]->napi);
+ while (!ixgbe_qv_lock_napi(adapter->q_vector[q_idx])) {
+ pr_info("QV %d locked\n", q_idx);
+ mdelay(1);
+ }
+ }
+ local_bh_enable();
}

#ifdef CONFIG_IXGBE_DCB
@@ -7177,6 +7221,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = ixgbe_netpoll,
#endif
+#ifdef CONFIG_NET_LL_RX_POLL
+ .ndo_ll_poll = ixgbe_low_latency_recv,
+#endif
#ifdef IXGBE_FCOE
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,

2013-06-10 08:40:46

[permalink] [raw]

Subject: [PATCH v10 net-next 6/6] ixgbe: add extra stats for ndo_ll_poll

Add additional statistics to the ixgbe driver for ndo_ll_poll
Defined under LL_EXTENDED_STATS

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

drivers/net/ethernet/intel/ixgbe/ixgbe.h | 14 ++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 40 ++++++++++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++
3 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index e9d9862..fb098b4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -54,6 +54,9 @@

#include <net/ll_poll.h>

+#ifdef CONFIG_NET_LL_RX_POLL
+#define LL_EXTENDED_STATS
+#endif
/* common prefix used by pr_<> macros */
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -184,6 +187,11 @@ struct ixgbe_rx_buffer {
struct ixgbe_queue_stats {
u64 packets;
u64 bytes;
+#ifdef LL_EXTENDED_STATS
+ u64 yields;
+ u64 misses;
+ u64 cleaned;
+#endif /* LL_EXTENDED_STATS */
};

struct ixgbe_tx_queue_stats {
@@ -391,6 +399,9 @@ static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
rc = false;
+#ifdef LL_EXTENDED_STATS
+ q_vector->tx.ring->stats.yields++;
+#endif
} else
/* we don't care if someone yielded */
q_vector->state = IXGBE_QV_STATE_NAPI;
@@ -421,6 +432,9 @@ static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
if ((q_vector->state & IXGBE_QV_LOCKED)) {
q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
rc = false;
+#ifdef LL_EXTENDED_STATS
+ q_vector->rx.ring->stats.yields++;
+#endif
} else
/* preserve yield marks */
q_vector->state |= IXGBE_QV_STATE_POLL;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index d375472..24e2e7a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1054,6 +1054,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i] = 0;
data[i+1] = 0;
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = 0;
+ data[i+1] = 0;
+ data[i+2] = 0;
+ i += 3;
+#endif
continue;
}

@@ -1063,6 +1069,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i+1] = ring->stats.bytes;
} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = ring->stats.yields;
+ data[i+1] = ring->stats.misses;
+ data[i+2] = ring->stats.cleaned;
+ i += 3;
+#endif
}
for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) {
ring = adapter->rx_ring[j];
@@ -1070,6 +1082,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i] = 0;
data[i+1] = 0;
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = 0;
+ data[i+1] = 0;
+ data[i+2] = 0;
+ i += 3;
+#endif
continue;
}

@@ -1079,6 +1097,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i+1] = ring->stats.bytes;
} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = ring->stats.yields;
+ data[i+1] = ring->stats.misses;
+ data[i+2] = ring->stats.cleaned;
+ i += 3;
+#endif
}

for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) {
@@ -1115,12 +1139,28 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
p += ETH_GSTRING_LEN;
sprintf(p, "tx_queue_%u_bytes", i);
p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+ sprintf(p, "tx_q_%u_napi_yield", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "tx_q_%u_misses", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "tx_q_%u_cleaned", i);
+ p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
}
for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) {
sprintf(p, "rx_queue_%u_packets", i);
p += ETH_GSTRING_LEN;
sprintf(p, "rx_queue_%u_bytes", i);
p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+ sprintf(p, "rx_q_%u_ll_poll_yield", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "rx_q_%u_misses", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "rx_q_%u_cleaned", i);
+ p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
}
for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
sprintf(p, "tx_pb_%u_pxon", i);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9a7dc40..047ebaa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2016,6 +2016,12 @@ static int ixgbe_low_latency_recv(struct napi_struct *napi)

ixgbe_for_each_ring(ring, q_vector->rx) {
found = ixgbe_clean_rx_irq(q_vector, ring, 4);
+#ifdef LL_EXTENDED_STATS
+ if (found)
+ ring->stats.cleaned += found;
+ else
+ ring->stats.misses++;
+#endif
if (found)
break;
}

2013-06-10 09:22:51

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 1/6] net: add napi_id and hash

On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
> Adds a napi_id and a hashing mechanism to lookup a napi by id.
> This will be used by subsequent patches to implement low latency
> Ethernet device polling.
> Based on a code sample by Eric Dumazet.
>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---
>
> include/linux/netdevice.h | 29 ++++++++++++++++++++++
> net/core/dev.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 88 insertions(+), 0 deletions(-)

Signed-off-by: Eric Dumazet <[email protected]>

2013-06-10 14:29:44

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 2/6] net: add low latency socket poll

On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
> Adds an ndo_ll_poll method and the code that supports it.
> This method can be used by low latency applications to busy-poll
> Ethernet device queues directly from the socket code.
> sysctl_net_ll_poll controls how many microseconds to poll.
> Default is zero (disabled).
> Individual protocol support will be added by subsequent patches.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> Signed-off-by: Jesse Brandeburg <[email protected]>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---
>
> Documentation/sysctl/net.txt | 7 ++
> include/linux/netdevice.h | 3 +
> include/linux/skbuff.h | 8 ++
> include/net/ll_poll.h | 148 ++++++++++++++++++++++++++++++++++++++++++
> include/net/sock.h | 4 +
> include/uapi/linux/snmp.h | 1
> net/Kconfig | 12 +++
> net/core/skbuff.c | 4 +
> net/core/sock.c | 6 ++
> net/core/sysctl_net_core.c | 10 +++
> net/ipv4/proc.c | 1
> net/socket.c | 6 ++
> 12 files changed, 208 insertions(+), 2 deletions(-)
> create mode 100644 include/net/ll_poll.h

Acked-by: Eric Dumazet <[email protected]>

2013-06-10 14:31:17

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 3/6] udp: add low latency socket poll support

On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
> Add upport for busy-polling on UDP sockets.
> In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
> from the skb into the sk.
> This is done at the earliest possible moment, right after we identify
> which socket this skb is for.
> In __skb_recv_datagram When there is no data and the user
> tries to read we busy poll.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> Signed-off-by: Jesse Brandeburg <[email protected]>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---
>
> net/core/datagram.c | 4 ++++
> net/ipv4/udp.c | 6 +++++-
> net/ipv6/udp.c | 6 +++++-
> 3 files changed, 14 insertions(+), 2 deletions(-)

Acked-by: Eric Dumazet <[email protected]>

2013-06-10 14:32:37

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 4/6] tcp: add low latency socket poll support.

On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
> Adds low latency socket poll support for TCP.
> In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
> from the skb to the sk.
> In tcp_recvmsg(), when there is no data in the socket we busy-poll.
> This is a good example of how to add busy-poll support to more protocols.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> Signed-off-by: Jesse Brandeburg <[email protected]>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>

2013-06-10 14:59:34

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 5/6] ixgbe: add support for ndo_ll_poll

On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
> Add the ixgbe driver code implementing ndo_ll_poll.
> Adds ndo_ll_poll method and locking between it and the napi poll.
> When receiving a packet we use skb_mark_ll to record the napi it came from.
> Add each napi to the napi_hash right after netif_napi_add().
>
> Signed-off-by: Alexander Duyck <[email protected]>
> Signed-off-by: Jesse Brandeburg <[email protected]>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---

Reviewed-by: Eric Dumazet <[email protected]>

2013-06-10 16:37:10

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 2/6] net: add low latency socket poll

On Mon, Jun 10, 2013 at 10:29 AM, Eric Dumazet <[email protected]> wrote:
> On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
>> Adds an ndo_ll_poll method and the code that supports it.
>> This method can be used by low latency applications to busy-poll
>> Ethernet device queues directly from the socket code.
>> sysctl_net_ll_poll controls how many microseconds to poll.
>> Default is zero (disabled).
>> Individual protocol support will be added by subsequent patches.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>> Signed-off-by: Jesse Brandeburg <[email protected]>
>> Signed-off-by: Eliezer Tamir <[email protected]>
>> ---
>>
>> Documentation/sysctl/net.txt | 7 ++
>> include/linux/netdevice.h | 3 +
>> include/linux/skbuff.h | 8 ++
>> include/net/ll_poll.h | 148 ++++++++++++++++++++++++++++++++++++++++++
>> include/net/sock.h | 4 +
>> include/uapi/linux/snmp.h | 1
>> net/Kconfig | 12 +++
>> net/core/skbuff.c | 4 +
>> net/core/sock.c | 6 ++
>> net/core/sysctl_net_core.c | 10 +++
>> net/ipv4/proc.c | 1
>> net/socket.c | 6 ++
>> 12 files changed, 208 insertions(+), 2 deletions(-)
>> create mode 100644 include/net/ll_poll.h
>
> Acked-by: Eric Dumazet <[email protected]>
>

Tested-by: Willem de Bruijn <[email protected]>

Per Eliezer's request, I applied v10 to a small set of workloads
(netperf tcp_rr, udp_rr, 100x tcp_rr, 100x udp_rr and a poll()-based
tcp rr) for some functional testing.

2013-06-10 16:38:32

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 1/6] net: add napi_id and hash

On Mon, Jun 10, 2013 at 5:22 AM, Eric Dumazet <[email protected]> wrote:
> On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
>> Adds a napi_id and a hashing mechanism to lookup a napi by id.
>> This will be used by subsequent patches to implement low latency
>> Ethernet device polling.
>> Based on a code sample by Eric Dumazet.
>>
>> Signed-off-by: Eliezer Tamir <[email protected]>
>> ---
>>
>> include/linux/netdevice.h | 29 ++++++++++++++++++++++
>> net/core/dev.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 88 insertions(+), 0 deletions(-)
>
> Signed-off-by: Eric Dumazet <[email protected]>
>

Tested-by: Willem de Bruijn <[email protected]>

(per Eliezer's request)

2013-06-10 16:39:00

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 3/6] udp: add low latency socket poll support

On Mon, Jun 10, 2013 at 10:31 AM, Eric Dumazet <[email protected]> wrote:
> On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
>> Add upport for busy-polling on UDP sockets.
>> In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
>> from the skb into the sk.
>> This is done at the earliest possible moment, right after we identify
>> which socket this skb is for.
>> In __skb_recv_datagram When there is no data and the user
>> tries to read we busy poll.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>> Signed-off-by: Jesse Brandeburg <[email protected]>
>> Signed-off-by: Eliezer Tamir <[email protected]>
>> ---
>>
>> net/core/datagram.c | 4 ++++
>> net/ipv4/udp.c | 6 +++++-
>> net/ipv6/udp.c | 6 +++++-
>> 3 files changed, 14 insertions(+), 2 deletions(-)
>
> Acked-by: Eric Dumazet <[email protected]>
>

Tested-by: Willem de Bruijn <[email protected]>

(per Eliezer's request)

2013-06-10 16:39:08

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 4/6] tcp: add low latency socket poll support.

On Mon, Jun 10, 2013 at 10:32 AM, Eric Dumazet <[email protected]> wrote:
> On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
>> Adds low latency socket poll support for TCP.
>> In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
>> from the skb to the sk.
>> In tcp_recvmsg(), when there is no data in the socket we busy-poll.
>> This is a good example of how to add busy-poll support to more protocols.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>> Signed-off-by: Jesse Brandeburg <[email protected]>
>> Signed-off-by: Eliezer Tamir <[email protected]>
>> ---
>
> Acked-by: Eric Dumazet <[email protected]>
>

Tested-by: Willem de Bruijn <[email protected]>

(per Eliezer's request)

2013-06-10 20:41:19

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

From: Eliezer Tamir <[email protected]>
Date: Mon, 10 Jun 2013 11:39:30 +0300

> I removed the select/poll patch (was 5/7 in v9) from the set.
> The rest are the same patches that were in v9.
>
> Please consider applying.
>
> Thanks to everyone for their input.

There used to be a really nice, detailed and verbose, description of
the goals and general idea of these changes, along with a lot of
benchmark data.

Now I don't see it, either here in this posting, or in any of the
patch commit messages.

Don't get rid of stuff like that, for a set of changes of this
magnitude you can basically consider such details descriptions
and information mandatory.

Reply to this email with some text to put in the merge commit,
including basic benchmark results, so that I can apply this series.

Thanks.

2013-06-11 02:25:50

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

On 10/06/2013 23:41, David Miller wrote:
> From: Eliezer Tamir <[email protected]>
> Date: Mon, 10 Jun 2013 11:39:30 +0300
>
>> I removed the select/poll patch (was 5/7 in v9) from the set.
>> The rest are the same patches that were in v9.
>
> Reply to this email with some text to put in the merge commit,
> including basic benchmark results, so that I can apply this series.
>
sorry,

Here is the text from the RFC and v2 cover letters, updated and merged.
If this is too long, please tell me what you think should be removed.

Thanks,
Eliezer

---

This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last
year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.

Performance numbers:
setup TCP_RR UDP_RR
kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem
patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7
patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0
patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2
patched typical on 100 72 3.13 14.0 79k 3.17 12.8
patched typical on 0 60k 2.13 16.5 71k 3.18 14.0
patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5
3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2
3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18
3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49
3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23
3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us
When C3/6 states were turned on (via BIOS) the performance governor was
used.

These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.

Design:
A global hash table that allows us to look up a struct napi by a unique
id was added.

A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the
protocol handler.

When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally
disables busy-polling)

Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), caller must take care to wait an rcu
grace period before freeing the memory containing the napi struct.
(Ixgbe already had this because the queue vector structure uses rcu to
protect the statistics counters in it.)

how to test:

1. The patchset should apply cleanly to net-next.
(don't forget to configure INET_LL_RX_POLL).

2. The ethtool -c setting for rx-usecs should be on the order of 100.

3. Use ethtool -K to disable GRO and LRO
(You are encouraged to try it both ways. If you find that your workload
does better with GRO on do tell us.)

4. Sysctl value net.core.low_latency_poll controls how long
(in us) to busy-wait for more data, You are encouraged to play
with this and see what works for you. The default is now 0 so you need to
set it to turn the feature on. I recommend a value around 50.

4. benchmark thread and IRQ should be bound to separate cores.
Both cores should be on the same CPU NUMA node as the NIC.
When the app and the IRQ run on the same CPU you get a small penalty.
If interrupt coalescing is set to a low value this penalty can be very
large.

5. If you suspect that your machine is not configured properly,
use numademo to make sure that the CPU to memory BW is OK.
numademo 128m memcpy local copy numbers should be more than
8GB/s on a properly configured machine.

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen
---

2013-06-11 04:24:48

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

From: Eliezer Tamir <[email protected]>
Date: Tue, 11 Jun 2013 05:25:42 +0300

> Here is the text from the RFC and v2 cover letters, updated and
> merged. If this is too long, please tell me what you think should
> be removed.

It's perfect, and since this went through so many iterations I
included the changelog too.

Thanks!

2013-06-11 06:49:42

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

On 11/06/2013 07:24, David Miller wrote:
> From: Eliezer Tamir <[email protected]>
> Date: Tue, 11 Jun 2013 05:25:42 +0300
>
>> Here is the text from the RFC and v2 cover letters, updated and
>> merged. If this is too long, please tell me what you think should
>> be removed.
>
> It's perfect, and since this went through so many iterations I
> included the changelog too.
>

Thank you.

I would like to hear opinions on what needs to be added to make this
feature complete.

The list I have so far is:
1. add a socket option
2. support for poll/select
3. support for epoll

Also, would you accept a trailing whitespace cleanup patch for
fs/select.c?

-Eliezer

2013-06-11 07:32:52

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:

> I would like to hear opinions on what needs to be added to make this
> feature complete.
>
> The list I have so far is:
> 1. add a socket option

Yes, please. I do not believe all sockets on the machine are candidate
for low latency. In fact very few of them should be, depending on the
number of cpu and/or RX queues.

> 2. support for poll/select

As long as the cost of llpoll is bounded per poll()/select() call it
will be ok.

> 3. support for epoll

For this one, I honestly do not know how to proceed.

epoll Edge Trigger model is driven by the wakeups events.

The wakeups come from frames being delivered by the NIC (for UDP/TCP
sockets)

If epoll_wait() has to scan the list of epitem to be able to perform the
llpoll callback, it will be too slow : We come back to poll() model,
with O(N) execution time.

Ideally we would have to callback llpoll not from the tcp_poll(), but
right before putting current thread in wait mode.

>
> Also, would you accept a trailing whitespace cleanup patch for
> fs/select.c?

This has to be submitted to lkml

2013-06-11 08:14:33

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

From: Eliezer Tamir <[email protected]>
Date: Tue, 11 Jun 2013 09:49:31 +0300

> I would like to hear opinions on what needs to be added to make this
> feature complete.
>
> The list I have so far is:
> 1. add a socket option
> 2. support for poll/select
> 3. support for epoll

I actually would like to see the Kconfig option go away, that's
my only request.

> Also, would you accept a trailing whitespace cleanup patch for
> fs/select.c?

That's not really for my tree, sorry.

2013-06-11 09:29:20

[permalink] [raw]

Subject: Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

On 11/06/2013 10:32, Eric Dumazet wrote:
> On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:
>
>> I would like to hear opinions on what needs to be added to make this
>> feature complete.
>>
>> The list I have so far is:
>> 1. add a socket option
>
> Yes, please. I do not believe all sockets on the machine are candidate
> for low latency. In fact very few of them should be, depending on the
> number of cpu and/or RX queues.

I have a patch for that, along a patch for sockperf I will use for
testing.
One I will test it some more, I will send it in.

>> 3. support for epoll
>
> For this one, I honestly do not know how to proceed.
>
> epoll Edge Trigger model is driven by the wakeups events.
>
> The wakeups come from frames being delivered by the NIC (for UDP/TCP
> sockets)
>
> If epoll_wait() has to scan the list of epitem to be able to perform the
> llpoll callback, it will be too slow : We come back to poll() model,
> with O(N) execution time.
>
> Ideally we would have to callback llpoll not from the tcp_poll(), but
> right before putting current thread in wait mode.

We have a few ideas, I will do a POC and see if any of them actually
work.

One thing that would really help is information about use-cases that
people care about:

Number and type of sockets, how active are they.
How many active Ethernet ports are there.
Can bulk and low latency traffic be steered to separate cores or
separated in any other way.

Thanks,
Eliezer

2013-06-11 12:49:39