2013-05-27 07:43:58

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 0/5] net: low latency Ethernet device polling

Hello Dave,

There are many small changes from the last time.
The two big changes are:
* Skb and sk now store a napi_id instead of a pointer.
* Very naive poll/select support. There is a dramatic improvement in both
latencey and jitter, but clearly more work needs to be done here.

Please consider applying.

some rough poll/select results:
Using the optimized kernel from v2
testing with the sysctl value set to 50 and 0
sockperf using poll on 10 udp sockets 7.0us vs. 51.9us
sockperf using select on 10 udp sockets 7.2us vs. 51.8us
sockperf using poll on 10 tcp sockets 7.1us vs. 53.2us
sockperf using selct on 10 tcp sockets 7.4us vs. 52.8us

Note to anyone doing testing: the sysctl value has moved to net.core

change log
v5
- corrections suggested by Ben Hutchings:
fixed typos, moved the config option and sysctl value from IPv4 to net
- moved sk_mark_ll() to the protocol handlers
- removed global id mechanism, replaced with a hashed napi_id.
based on code sample from Eric Dumazet
Note that ixgbe_free_q_vector() already waits an rcu grace period
before freeing the q_vector, so nothing additional needs to be done
when adding a call to napi_hash_del().
- simple poll/select support

v4
- removed separate config option for TCP as suggested Eric Dumazet.
- added linux mib counter for packets received through the low latency path,
as suggested by Andi Kleen.
- re-allow module unloading, remove module param, use a global generation id
instead to prevent the use of a stale napi pointer, as suggested
by Eric Dumazet
- updated Documentation/networking/ip-sysctl.txt text

v3
- coding style changes suggested by Dave Miller

v2
- the sysctl knob is now in microseconds. The default value is now 0 (off).
- for now the code depends at configure time on CONFIG_I86_TSC
- the napi reference in struct skb is now a union with the dma cookie
since the former is only used on RX and the latter on TX,
as suggested by Eric Dumazet.
- we do a better job at honoring non-blocking operations.
- removed busy-polling support for tcp_read_sock()
- remove dynamic disabling of GRO
- coding style fixes
- disallow unloading the device module after the feature has been used

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen

Thanks,
Eliezer


2013-05-27 07:44:13

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 1/5] net: add napi_id and hash

Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.

Signed-off-by: Eliezer Tamir <[email protected]>
---

include/linux/netdevice.h | 29 +++++++++++++++++++++++++++++
net/core/dev.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ea7b6bc..d1ec8b1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -324,12 +324,15 @@ struct napi_struct {
struct sk_buff *gro_list;
struct sk_buff *skb;
struct list_head dev_list;
+ struct hlist_node napi_hash_node;
+ int napi_id;
};

enum {
NAPI_STATE_SCHED, /* Poll is scheduled */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
+ NAPI_STATE_HASHED, /* In NAPI hash */
};

enum gro_result {
@@ -446,6 +449,32 @@ extern void __napi_complete(struct napi_struct *n);
extern void napi_complete(struct napi_struct *n);

/**
+ * napi_hash_add - add a NAPI to global hashtable
+ * @napi: napi context
+ *
+ * generate a new napi_id and store a @napi under it in napi_hash
+ */
+extern void napi_hash_add(struct napi_struct *napi);
+
+/**
+ * napi_hash_del - remove a NAPI from blobal table
+ * @napi: napi context
+ *
+ * Warning: caller must observe rcu grace period
+ * before freeing memory containing @NAPI
+ */
+extern void napi_hash_del(struct napi_struct *napi);
+
+/**
+ * napi_by_is - lookup a NAPI by napi_id
+ * @napi_id: hashed napi_id
+ *
+ * lookup napi_id in napi_hash table
+ * must be called under rcu_read_lock()
+ */
+extern struct napi_struct *napi_by_id(int napi_id);
+
+/**
* napi_disable - prevent NAPI from scheduling
* @n: napi context
*
diff --git a/net/core/dev.c b/net/core/dev.c
index 50c02de..283ab14 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -129,6 +129,7 @@
#include <linux/inetdevice.h>
#include <linux/cpu_rmap.h>
#include <linux/static_key.h>
+#include <linux/hashtable.h>

#include "net-sysfs.h"

@@ -166,6 +167,10 @@ static struct list_head offload_base __read_mostly;
DEFINE_RWLOCK(dev_base_lock);
EXPORT_SYMBOL(dev_base_lock);

+atomic_t napi_gen_id;
+
+DEFINE_HASHTABLE(napi_hash, 8);
+
seqcount_t devnet_rename_seq;

static inline void dev_base_seq_inc(struct net *net)
@@ -4113,6 +4118,45 @@ void napi_complete(struct napi_struct *n)
}
EXPORT_SYMBOL(napi_complete);

+void napi_hash_add(struct napi_struct *napi)
+{
+ if (!test_and_set_bit(NAPI_STATE_HASHED, &napi->state)) {
+
+ /* 0 is not a valid id */
+ napi->napi_id = 0;
+ while (!napi->napi_id)
+ napi->napi_id = atomic_inc_return(&napi_gen_id);
+
+ hlist_add_head_rcu(&napi->napi_hash_node,
+ &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
+ }
+}
+EXPORT_SYMBOL_GPL(napi_hash_add);
+
+/* Warning : caller is responsible to make sure rcu grace period
+ * is respected before freeing memory containing @napi
+ */
+void napi_hash_del(struct napi_struct *napi)
+{
+ if (test_and_clear_bit(NAPI_STATE_HASHED, &napi->state))
+ hlist_del_rcu(&napi->napi_hash_node);
+}
+EXPORT_SYMBOL_GPL(napi_hash_del);
+
+/* must be called under rcu_read_lock(), as we dont take a reference */
+struct napi_struct *napi_by_id(int napi_id)
+{
+ unsigned int hash = napi_id % HASH_SIZE(napi_hash);
+ struct napi_struct *napi;
+
+ hlist_for_each_entry_rcu(napi, &napi_hash[hash], napi_hash_node)
+ if (napi->napi_id == napi_id)
+ return napi;
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(napi_by_id);
+
void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
{

2013-05-27 07:44:27

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 2/5] net: implement support for low latency socket polling

Adds a new ndo_ll_poll method and the code that supports and uses it.
This method can be used by low latency applications to busy poll Ethernet
device queues directly from the socket code. The value of sysctl_net_ll_poll
controls how many microseconds to poll. Set to zero to disable.

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Willem de Bruijn <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

Documentation/sysctl/net.txt | 7 ++
fs/select.c | 7 ++
include/linux/netdevice.h | 3 +
include/linux/skbuff.h | 8 ++-
include/net/ll_poll.h | 126 ++++++++++++++++++++++++++++++++++++++++++
include/net/sock.h | 4 +
include/uapi/linux/snmp.h | 1
net/Kconfig | 12 ++++
net/core/datagram.c | 4 +
net/core/skbuff.c | 4 +
net/core/sock.c | 6 ++
net/core/sysctl_net_core.c | 10 +++
net/ipv4/proc.c | 1
net/ipv4/udp.c | 6 ++
net/socket.c | 16 +++++
15 files changed, 211 insertions(+), 4 deletions(-)
create mode 100644 include/net/ll_poll.h

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index c1f8640..85ab72d 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -50,6 +50,13 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
it's a Per-CPU variable.
Default: 64

+low_latency_poll
+----------------
+Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
+Approximate time in us to spin waiting for packets on the device queue.
+Recommended value is 50. May increase power usage.
+Default: 0 (off)
+
rmem_default
------------

diff --git a/fs/select.c b/fs/select.c
index 8c1c96c..0ef246d 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -27,6 +27,7 @@
#include <linux/rcupdate.h>
#include <linux/hrtimer.h>
#include <linux/sched/rt.h>
+#include <net/ll_poll.h>

#include <asm/uaccess.h>

@@ -400,6 +401,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
poll_table *wait;
int retval, i, timed_out = 0;
unsigned long slack = 0;
+ unsigned long ll_time = ll_end_time();

rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -486,6 +488,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
break;
}

+ if (can_poll_ll(ll_time))
+ continue;
/*
* If this is the first loop and we have a timeout
* given, then we convert to ktime_t and set the to
@@ -750,6 +754,7 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
ktime_t expire, *to = NULL;
int timed_out = 0, count = 0;
unsigned long slack = 0;
+ unsigned long ll_time = ll_end_time();

/* Optimise the no-wait case */
if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
@@ -795,6 +800,8 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
if (count || timed_out)
break;

+ if (can_poll_ll(ll_time))
+ continue;
/*
* If this is the first loop and we have a timeout
* given, then we convert to ktime_t and set the to
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1ec8b1..a2941e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -972,6 +972,9 @@ struct net_device_ops {
gfp_t gfp);
void (*ndo_netpoll_cleanup)(struct net_device *dev);
#endif
+#ifdef CONFIG_NET_LL_RX_POLL
+ int (*ndo_ll_poll)(struct napi_struct *dev);
+#endif
int (*ndo_set_vf_mac)(struct net_device *dev,
int queue, u8 *mac);
int (*ndo_set_vf_vlan)(struct net_device *dev,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2e0ced1..2f4f77c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -384,6 +384,7 @@ typedef unsigned char *sk_buff_data_t;
* @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS
* @dma_cookie: a cookie to one of several possible DMA operations
* done by skb DMA functions
+ * @napi_id: id of the NAPI struct this skb came from
* @secmark: security marking
* @mark: Generic packet mark
* @dropcount: total number of sk_receive_queue overflows
@@ -497,8 +498,11 @@ struct sk_buff {
/* 7/9 bit hole (depending on ndisc_nodetype presence) */
kmemcheck_bitfield_end(flags2);

-#ifdef CONFIG_NET_DMA
- dma_cookie_t dma_cookie;
+#if defined CONFIG_NET_DMA || defined CONFIG_NET_LL_RX_POLL
+ union {
+ unsigned int napi_id;
+ dma_cookie_t dma_cookie;
+ };
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h
new file mode 100644
index 0000000..9e1c972
--- /dev/null
+++ b/include/net/ll_poll.h
@@ -0,0 +1,126 @@
+/*
+ * low latency network device queue flush
+ * Copyright(c) 2013 Intel Corporation.
+ * Author: Eliezer Tamir
+ *
+ * For now this depends on CONFIG_X86_TSC
+ */
+
+#ifndef _LINUX_NET_LL_POLL_H
+#define _LINUX_NET_LL_POLL_H
+
+#include <linux/netdevice.h>
+#include <net/ip.h>
+
+#ifdef CONFIG_NET_LL_RX_POLL
+
+struct napi_struct;
+extern int sysctl_net_ll_poll __read_mostly;
+
+/* return values from ndo_ll_poll */
+#define LL_FLUSH_FAILED -1
+#define LL_FLUSH_BUSY -2
+
+/* we don't mind a ~2.5% imprecision */
+#define TSC_MHZ (tsc_khz >> 10)
+
+static inline unsigned long ll_end_time(void)
+{
+ return TSC_MHZ * ACCESS_ONCE(sysctl_net_ll_poll) + get_cycles();
+}
+
+static inline bool sk_valid_ll(struct sock *sk)
+{
+ return sysctl_net_ll_poll && sk->sk_napi_id &&
+ !need_resched() && !signal_pending(current);
+}
+
+static inline bool can_poll_ll(unsigned long end_time)
+{
+ return !time_after((unsigned long)get_cycles(), end_time);
+}
+
+static inline bool sk_poll_ll(struct sock *sk, int nonblock)
+{
+ unsigned long end_time = ll_end_time();
+ const struct net_device_ops *ops;
+ struct napi_struct *napi;
+ int rc = false;
+
+ /*
+ * rcu read lock for napi hash
+ * bh so we don't race with net_rx_action
+ */
+ rcu_read_lock_bh();
+
+ napi = napi_by_id(sk->sk_napi_id);
+ if (!napi)
+ goto out;
+
+ ops = napi->dev->netdev_ops;
+ if (!ops->ndo_ll_poll)
+ goto out;
+
+ do {
+
+ rc = ops->ndo_ll_poll(napi);
+
+ if (rc == LL_FLUSH_FAILED)
+ break; /* permanent failure */
+
+ if (rc > 0)
+ /* local bh are disabled so it is ok to use _BH */
+ NET_ADD_STATS_BH(sock_net(sk),
+ LINUX_MIB_LOWLATENCYRXPACKETS, rc);
+
+ } while (skb_queue_empty(&sk->sk_receive_queue)
+ && can_poll_ll(end_time) && !nonblock);
+
+ rc = !skb_queue_empty(&sk->sk_receive_queue);
+out:
+ rcu_read_unlock_bh();
+ return rc;
+}
+
+static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
+{
+ skb->napi_id = napi->napi_id;
+}
+
+static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
+{
+ sk->sk_napi_id = skb->napi_id;
+}
+
+#else /* CONFIG_NET_LL_RX_POLL */
+
+static inline unsigned long ll_end_time(void)
+{
+ return 0;
+}
+
+static inline bool sk_valid_ll(struct sock *sk)
+{
+ return 0;
+}
+
+static inline bool sk_poll_ll(struct sock *sk, int nonblock)
+{
+ return 0;
+}
+
+static inline void skb_mark_ll(struct sk_buff *skb, struct napi_struct *napi)
+{
+}
+
+static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
+{
+}
+
+static inline bool can_poll_ll(unsigned long end_time)
+{
+ return false;
+}
+
+#endif /* CONFIG_NET_LL_RX_POLL */
+#endif /* _LINUX_NET_LL_POLL_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 66772cf..c7c3ea6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -281,6 +281,7 @@ struct cg_proto;
* @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
* @sk_backlog_rcv: callback to process the backlog
* @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
+ * @sk_napi_id: id of the last napi context to receive data for sk
*/
struct sock {
/*
@@ -399,6 +400,9 @@ struct sock {
int (*sk_backlog_rcv)(struct sock *sk,
struct sk_buff *skb);
void (*sk_destruct)(struct sock *sk);
+#ifdef CONFIG_NET_LL_RX_POLL
+ unsigned int sk_napi_id;
+#endif
};

/*
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index df2e8b4..26cbf76 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -253,6 +253,7 @@ enum
LINUX_MIB_TCPFASTOPENLISTENOVERFLOW, /* TCPFastOpenListenOverflow */
LINUX_MIB_TCPFASTOPENCOOKIEREQD, /* TCPFastOpenCookieReqd */
LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES, /* TCPSpuriousRtxHostQueues */
+ LINUX_MIB_LOWLATENCYRXPACKETS, /* LowLatencyRxPackets */
__LINUX_MIB_MAX
};

diff --git a/net/Kconfig b/net/Kconfig
index 08de901..4d90826 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -242,6 +242,18 @@ config NETPRIO_CGROUP
Cgroup subsystem for use in assigning processes to network priorities on
a per-interface basis

+config NET_LL_RX_POLL
+ bool "Low Latency Receive Poll"
+ depends on X86_TSC
+ default n
+ ---help---
+ Support Low Latency Receive Queue Poll.
+ (For network card drivers which support this option.)
+ When waiting for data in read or poll call directly into the the device driver
+ to flush packets which may be pending on the device queues into the stack.
+
+ If unsure, say N.
+
config BQL
boolean
depends on SYSFS
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b71423d..9cbaba9 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -56,6 +56,7 @@
#include <net/sock.h>
#include <net/tcp_states.h>
#include <trace/events/skb.h>
+#include <net/ll_poll.h>

/*
* Is a socket 'connection oriented' ?
@@ -207,6 +208,9 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
}
spin_unlock_irqrestore(&queue->lock, cpu_flags);

+ if (sk_valid_ll(sk) && sk_poll_ll(sk, flags & MSG_DONTWAIT))
+ continue;
+
/* User doesn't want to wait */
error = -EAGAIN;
if (!timeo)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d629891..c74f78e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -739,6 +739,10 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
new->vlan_tci = old->vlan_tci;

skb_copy_secmark(new, old);
+
+#ifdef CONFIG_NET_LL_RX_POLL
+ new->napi_id = old->napi_id;
+#endif
}

/*
diff --git a/net/core/sock.c b/net/core/sock.c
index 6ba327d..3770c10 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -139,6 +139,8 @@
#include <net/tcp.h>
#endif

+#include <net/ll_poll.h>
+
static DEFINE_MUTEX(proto_list_mutex);
static LIST_HEAD(proto_list);

@@ -2284,6 +2286,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)

sk->sk_stamp = ktime_set(-1L, 0);

+#ifdef CONFIG_NET_LL_RX_POLL
+ sk->sk_napi_id = 0;
+#endif
+
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 741db5fc..4ca5702 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -19,6 +19,7 @@
#include <net/ip.h>
#include <net/sock.h>
#include <net/net_ratelimit.h>
+#include <net/ll_poll.h>

static int one = 1;

@@ -284,6 +285,15 @@ static struct ctl_table net_core_table[] = {
.proc_handler = flow_limit_table_len_sysctl
},
#endif /* CONFIG_NET_FLOW_LIMIT */
+#ifdef CONFIG_NET_LL_RX_POLL
+ {
+ .procname = "low_latency_poll",
+ .data = &sysctl_net_ll_poll,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+#endif
#endif /* CONFIG_NET */
{
.procname = "netdev_budget",
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 2a5bf86..6577a11 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -273,6 +273,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW),
SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
SNMP_MIB_ITEM("TCPSpuriousRtxHostQueues", LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES),
+ SNMP_MIB_ITEM("LowLatencyRxPackets", LINUX_MIB_LOWLATENCYRXPACKETS),
SNMP_MIB_SENTINEL
};

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0bf5d39..a70ad55 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -109,6 +109,7 @@
#include <trace/events/udp.h>
#include <linux/static_key.h>
#include <trace/events/skb.h>
+#include <net/ll_poll.h>
#include "udp_impl.h"

struct udp_table udp_table __read_mostly;
@@ -1709,7 +1710,10 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);

if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ sk_mark_ll(sk, skb);
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but
diff --git a/net/socket.c b/net/socket.c
index 6b94633..4844155 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -104,6 +104,12 @@
#include <linux/route.h>
#include <linux/sockios.h>
#include <linux/atalk.h>
+#include <net/ll_poll.h>
+
+#ifdef CONFIG_NET_LL_RX_POLL
+int sysctl_net_ll_poll __read_mostly;
+EXPORT_SYMBOL_GPL(sysctl_net_ll_poll);
+#endif

static int sock_no_open(struct inode *irrelevant, struct file *dontcare);
static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,
@@ -1142,13 +1148,21 @@ EXPORT_SYMBOL(sock_create_lite);
/* No kernel lock held - perfect */
static unsigned int sock_poll(struct file *file, poll_table *wait)
{
+ unsigned int poll_result;
struct socket *sock;

/*
* We can't return errors to poll, so it's either yes or no.
*/
sock = file->private_data;
- return sock->ops->poll(file, sock, wait);
+
+ poll_result = sock->ops->poll(file, sock, wait);
+
+ if (!(poll_result & (POLLRDNORM | POLLERR | POLLRDHUP | POLLHUP)) &&
+ sk_valid_ll(sock->sk) && sk_poll_ll(sock->sk, 1))
+ poll_result = sock->ops->poll(file, sock, wait);
+
+ return poll_result;
}

static int sock_mmap(struct file *file, struct vm_area_struct *vma)

2013-05-27 07:44:45

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 4/5] ixgbe: Add support for ndo_ll_poll

Add the ixgbe driver code implementing ndo_ll_poll.
It should be easy for other drivers to do something similar
in order to enable support for CONFIG_NET_LL_RX_POLL

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Willem de Bruijn <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

drivers/net/ethernet/intel/ixgbe/ixgbe.h | 120 +++++++++++++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 2
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 63 +++++++++++--
3 files changed, 177 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ca93238..04fdbf6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -52,6 +52,8 @@
#include <linux/dca.h>
#endif

+#include <net/ll_poll.h>
+
/* common prefix used by pr_<> macros */
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -356,9 +358,127 @@ struct ixgbe_q_vector {
struct rcu_head rcu; /* to avoid race with update stats on free */
char name[IFNAMSIZ + 9];

+#ifdef CONFIG_NET_LL_RX_POLL
+ unsigned int state;
+#define IXGBE_QV_STATE_IDLE 0
+#define IXGBE_QV_STATE_NAPI 1 /* NAPI owns this QV */
+#define IXGBE_QV_STATE_POLL 2 /* poll owns this QV */
+#define IXGBE_QV_LOCKED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
+#define IXGBE_QV_STATE_NAPI_YIELD 4 /* NAPI yielded this QV */
+#define IXGBE_QV_STATE_POLL_YIELD 8 /* poll yielded this QV */
+#define IXGBE_QV_YIELD (IXGBE_QV_STATE_NAPI_YIELD | IXGBE_QV_STATE_POLL_YIELD)
+#define IXGBE_QV_USER_PEND (IXGBE_QV_STATE_POLL | IXGBE_QV_STATE_POLL_YIELD)
+ spinlock_t lock;
+#endif /* CONFIG_NET_LL_RX_POLL */
+
/* for dynamic allocation of rings associated with this q_vector */
struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp;
};
+#ifdef CONFIG_NET_LL_RX_POLL
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+
+ spin_lock_init(&q_vector->lock);
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+}
+
+/* called from the device poll rutine to get ownership of a q_vector */
+static inline int ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+ int rc = true;
+ spin_lock(&q_vector->lock);
+ if (q_vector->state & IXGBE_QV_LOCKED) {
+ WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
+ q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
+ rc = false;
+ } else
+ /* we don't care if someone yielded */
+ q_vector->state = IXGBE_QV_STATE_NAPI;
+ spin_unlock(&q_vector->lock);
+ return rc;
+}
+
+/* returns true is someone tried to get the qv while napi had it */
+static inline int ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+ int rc = false;
+ spin_lock(&q_vector->lock);
+ WARN_ON(q_vector->state & (IXGBE_QV_STATE_POLL |
+ IXGBE_QV_STATE_NAPI_YIELD));
+
+ if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+ rc = true;
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+ spin_unlock(&q_vector->lock);
+ return rc;
+}
+
+/* called from ixgbe_low_latency_poll() */
+static inline int ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+ int rc = true;
+ spin_lock_bh(&q_vector->lock);
+ if ((q_vector->state & IXGBE_QV_LOCKED)) {
+ q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
+ rc = false;
+ } else
+ /* preserve yield marks */
+ q_vector->state |= IXGBE_QV_STATE_POLL;
+ spin_unlock_bh(&q_vector->lock);
+ return rc;
+}
+
+/* returns true if someone tried to get the qv while it was locked */
+static inline int ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+ int rc = false;
+ spin_lock_bh(&q_vector->lock);
+ WARN_ON(q_vector->state & (IXGBE_QV_STATE_NAPI));
+
+ if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+ rc = true;
+ q_vector->state = IXGBE_QV_STATE_IDLE;
+ spin_unlock_bh(&q_vector->lock);
+ return rc;
+}
+
+/* true if a socket is polling, even if it did not get the lock */
+static inline int ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+ WARN_ON(!(q_vector->state & IXGBE_QV_LOCKED));
+ return q_vector->state & IXGBE_QV_USER_PEND;
+}
+#else /* CONFIG_NET_LL_RX_POLL */
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+}
+
+static inline int ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+ return true;
+}
+
+static inline int ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline int ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline int ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+
+static inline int ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+ return false;
+}
+#endif /* CONFIG_NET_LL_RX_POLL */
+
#ifdef CONFIG_IXGBE_HWMON

#define IXGBE_HWMON_TYPE_LOC 0
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index ef5f7a6..90b4e10 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -811,6 +811,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
/* initialize NAPI */
netif_napi_add(adapter->netdev, &q_vector->napi,
ixgbe_poll, 64);
+ napi_hash_add(&q_vector->napi);

/* tie q_vector and adapter together */
adapter->q_vector[v_idx] = q_vector;
@@ -931,6 +932,7 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter *adapter, int v_idx)
adapter->rx_ring[ring->queue_index] = NULL;

adapter->q_vector[v_idx] = NULL;
+ napi_hash_del(&q_vector->napi);
netif_napi_del(&q_vector->napi);

/*
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d30fbdd..9a7dc40 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1504,7 +1504,9 @@ static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
{
struct ixgbe_adapter *adapter = q_vector->adapter;

- if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
+ if (ixgbe_qv_ll_polling(q_vector))
+ netif_receive_skb(skb);
+ else if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
napi_gro_receive(&q_vector->napi, skb);
else
netif_rx(skb);
@@ -1892,9 +1894,9 @@ dma_sync:
* expensive overhead for IOMMU access this provides a means of avoiding
* it by maintaining the mapping of the page to the syste.
*
- * Returns true if all work is completed without reaching budget
+ * Returns amount of work completed
**/
-static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
+static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
struct ixgbe_ring *rx_ring,
const int budget)
{
@@ -1976,6 +1978,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
}

#endif /* IXGBE_FCOE */
+ skb_mark_ll(skb, &q_vector->napi);
ixgbe_rx_skb(q_vector, skb);

/* update budget accounting */
@@ -1992,9 +1995,37 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
if (cleaned_count)
ixgbe_alloc_rx_buffers(rx_ring, cleaned_count);

- return (total_rx_packets < budget);
+ return total_rx_packets;
}

+#ifdef CONFIG_NET_LL_RX_POLL
+/* must be called with local_bh_disable()d */
+static int ixgbe_low_latency_recv(struct napi_struct *napi)
+{
+ struct ixgbe_q_vector *q_vector =
+ container_of(napi, struct ixgbe_q_vector, napi);
+ struct ixgbe_adapter *adapter = q_vector->adapter;
+ struct ixgbe_ring *ring;
+ int found = 0;
+
+ if (test_bit(__IXGBE_DOWN, &adapter->state))
+ return LL_FLUSH_FAILED;
+
+ if (!ixgbe_qv_lock_poll(q_vector))
+ return LL_FLUSH_BUSY;
+
+ ixgbe_for_each_ring(ring, q_vector->rx) {
+ found = ixgbe_clean_rx_irq(q_vector, ring, 4);
+ if (found)
+ break;
+ }
+
+ ixgbe_qv_unlock_poll(q_vector);
+
+ return found;
+}
+#endif /* CONFIG_NET_LL_RX_POLL */
+
/**
* ixgbe_configure_msix - Configure MSI-X hardware
* @adapter: board private structure
@@ -2550,6 +2581,9 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
ixgbe_for_each_ring(ring, q_vector->tx)
clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);

+ if (!ixgbe_qv_lock_napi(q_vector))
+ return budget;
+
/* attempt to distribute budget to each queue fairly, but don't allow
* the budget to go below 1 because we'll exit polling */
if (q_vector->rx.count > 1)
@@ -2558,9 +2592,10 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
per_ring_budget = budget;

ixgbe_for_each_ring(ring, q_vector->rx)
- clean_complete &= ixgbe_clean_rx_irq(q_vector, ring,
- per_ring_budget);
+ clean_complete &= (ixgbe_clean_rx_irq(q_vector, ring,
+ per_ring_budget) < per_ring_budget);

+ ixgbe_qv_unlock_napi(q_vector);
/* If all work not completed, return budget and keep polling */
if (!clean_complete)
return budget;
@@ -3747,16 +3782,25 @@ static void ixgbe_napi_enable_all(struct ixgbe_adapter *adapter)
{
int q_idx;

- for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++)
+ for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
+ ixgbe_qv_init_lock(adapter->q_vector[q_idx]);
napi_enable(&adapter->q_vector[q_idx]->napi);
+ }
}

static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
{
int q_idx;

- for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++)
+ local_bh_disable(); /* for ixgbe_qv_lock_napi() */
+ for (q_idx = 0; q_idx < adapter->num_q_vectors; q_idx++) {
napi_disable(&adapter->q_vector[q_idx]->napi);
+ while (!ixgbe_qv_lock_napi(adapter->q_vector[q_idx])) {
+ pr_info("QV %d locked\n", q_idx);
+ mdelay(1);
+ }
+ }
+ local_bh_enable();
}

#ifdef CONFIG_IXGBE_DCB
@@ -7177,6 +7221,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = ixgbe_netpoll,
#endif
+#ifdef CONFIG_NET_LL_RX_POLL
+ .ndo_ll_poll = ixgbe_low_latency_recv,
+#endif
#ifdef IXGBE_FCOE
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,

2013-05-27 07:44:51

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 5/5] ixgbe: add extra stats for ndo_ll_poll

Add additional statistics to the ixgbe driver for ndo_ll_poll
Defined under LL_EXTENDED_STATS

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Willem de Bruijn <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

drivers/net/ethernet/intel/ixgbe/ixgbe.h | 14 ++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 40 ++++++++++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +++
3 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 04fdbf6..9765772 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -54,6 +54,9 @@

#include <net/ll_poll.h>

+#ifdef CONFIG_NET_LL_RX_POLL
+#define LL_EXTENDED_STATS
+#endif
/* common prefix used by pr_<> macros */
#undef pr_fmt
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -184,6 +187,11 @@ struct ixgbe_rx_buffer {
struct ixgbe_queue_stats {
u64 packets;
u64 bytes;
+#ifdef LL_EXTENDED_STATS
+ u64 yields;
+ u64 misses;
+ u64 cleaned;
+#endif /* LL_EXTENDED_STATS */
};

struct ixgbe_tx_queue_stats {
@@ -391,6 +399,9 @@ static inline int ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
rc = false;
+#ifdef LL_EXTENDED_STATS
+ q_vector->tx.ring->stats.yields++;
+#endif
} else
/* we don't care if someone yielded */
q_vector->state = IXGBE_QV_STATE_NAPI;
@@ -421,6 +432,9 @@ static inline int ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
if ((q_vector->state & IXGBE_QV_LOCKED)) {
q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
rc = false;
+#ifdef LL_EXTENDED_STATS
+ q_vector->rx.ring->stats.yields++;
+#endif
} else
/* preserve yield marks */
q_vector->state |= IXGBE_QV_STATE_POLL;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index d375472..24e2e7a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1054,6 +1054,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i] = 0;
data[i+1] = 0;
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = 0;
+ data[i+1] = 0;
+ data[i+2] = 0;
+ i += 3;
+#endif
continue;
}

@@ -1063,6 +1069,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i+1] = ring->stats.bytes;
} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = ring->stats.yields;
+ data[i+1] = ring->stats.misses;
+ data[i+2] = ring->stats.cleaned;
+ i += 3;
+#endif
}
for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) {
ring = adapter->rx_ring[j];
@@ -1070,6 +1082,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i] = 0;
data[i+1] = 0;
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = 0;
+ data[i+1] = 0;
+ data[i+2] = 0;
+ i += 3;
+#endif
continue;
}

@@ -1079,6 +1097,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
data[i+1] = ring->stats.bytes;
} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
i += 2;
+#ifdef LL_EXTENDED_STATS
+ data[i] = ring->stats.yields;
+ data[i+1] = ring->stats.misses;
+ data[i+2] = ring->stats.cleaned;
+ i += 3;
+#endif
}

for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) {
@@ -1115,12 +1139,28 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
p += ETH_GSTRING_LEN;
sprintf(p, "tx_queue_%u_bytes", i);
p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+ sprintf(p, "tx_q_%u_napi_yield", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "tx_q_%u_misses", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "tx_q_%u_cleaned", i);
+ p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
}
for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) {
sprintf(p, "rx_queue_%u_packets", i);
p += ETH_GSTRING_LEN;
sprintf(p, "rx_queue_%u_bytes", i);
p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+ sprintf(p, "rx_q_%u_ll_poll_yield", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "rx_q_%u_misses", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "rx_q_%u_cleaned", i);
+ p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
}
for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
sprintf(p, "tx_pb_%u_pxon", i);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9a7dc40..047ebaa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2016,6 +2016,12 @@ static int ixgbe_low_latency_recv(struct napi_struct *napi)

ixgbe_for_each_ring(ring, q_vector->rx) {
found = ixgbe_clean_rx_irq(q_vector, ring, 4);
+#ifdef LL_EXTENDED_STATS
+ if (found)
+ ring->stats.cleaned += found;
+ else
+ ring->stats.misses++;
+#endif
if (found)
break;
}

2013-05-27 07:44:44

by Eliezer Tamir

[permalink] [raw]
Subject: [PATCH v5 net-next 3/5] tcp: add TCP support for low latency receive poll.

adds busy-poll support for TCP.

Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: Jesse Brandeburg <[email protected]>
Tested-by: Willem de Bruijn <[email protected]>
Signed-off-by: Eliezer Tamir <[email protected]>
---

net/ipv4/tcp.c | 5 +++++
net/ipv4/tcp_input.c | 1 +
net/ipv4/tcp_ipv4.c | 2 ++
3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d87ce72..652c75a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@

#include <asm/uaccess.h>
#include <asm/ioctls.h>
+#include <net/ll_poll.h>

int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;

@@ -1551,6 +1552,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
struct sk_buff *skb;
u32 urg_hole = 0;

+ if (sk_valid_ll(sk) && skb_queue_empty(&sk->sk_receive_queue)
+ && (sk->sk_state == TCP_ESTABLISHED))
+ sk_poll_ll(sk, nonblock);
+
lock_sock(sk);

err = -ENOTCONN;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9579e1a..4d82939 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -74,6 +74,7 @@
#include <linux/ipsec.h>
#include <asm/unaligned.h>
#include <net/netdma.h>
+#include <net/ll_poll.h>

int sysctl_tcp_timestamps __read_mostly = 1;
int sysctl_tcp_window_scaling __read_mostly = 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d20ede0..35fd8bc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -75,6 +75,7 @@
#include <net/netdma.h>
#include <net/secure_seq.h>
#include <net/tcp_memcontrol.h>
+#include <net/ll_poll.h>

#include <linux/inet.h>
#include <linux/ipv6.h>
@@ -2011,6 +2012,7 @@ process:
if (sk_filter(sk, skb))
goto discard_and_relse;

+ sk_mark_ll(sk, skb);
skb->dev = NULL;

bh_lock_sock_nested(sk);

2013-05-28 00:26:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 2/5] net: implement support for low latency socket polling

On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:

> diff --git a/include/net/sock.h b/include/net/sock.h
> index 66772cf..c7c3ea6 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -281,6 +281,7 @@ struct cg_proto;
> * @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
> * @sk_backlog_rcv: callback to process the backlog
> * @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
> + * @sk_napi_id: id of the last napi context to receive data for sk
> */
> struct sock {
> /*
> @@ -399,6 +400,9 @@ struct sock {
> int (*sk_backlog_rcv)(struct sock *sk,
> struct sk_buff *skb);
> void (*sk_destruct)(struct sock *sk);
> +#ifdef CONFIG_NET_LL_RX_POLL
> + unsigned int sk_napi_id;
> +#endif
> };

I believe this is a bad choice for data locality.

I would rather move it in the same cache line than sk_rxhash


2013-05-28 00:28:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 1/5] net: add napi_id and hash

On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:
> Adds a napi_id and a hashing mechanism to lookup a napi by id.
> This will be used by subsequent patches to implement low latency
> Ethernet device polling.
> Based on a code sample by Eric Dumazet.
>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---
>
> include/linux/netdevice.h | 29 +++++++++++++++++++++++++++++
> net/core/dev.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 73 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ea7b6bc..d1ec8b1 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -324,12 +324,15 @@ struct napi_struct {
> struct sk_buff *gro_list;
> struct sk_buff *skb;
> struct list_head dev_list;
> + struct hlist_node napi_hash_node;
> + int napi_id;
> };
>
> enum {
> NAPI_STATE_SCHED, /* Poll is scheduled */
> NAPI_STATE_DISABLE, /* Disable pending */
> NAPI_STATE_NPSVC, /* Netpoll - don't dequeue from poll_list */
> + NAPI_STATE_HASHED, /* In NAPI hash */
> };
>
> enum gro_result {
> @@ -446,6 +449,32 @@ extern void __napi_complete(struct napi_struct *n);
> extern void napi_complete(struct napi_struct *n);
>
> /**
> + * napi_hash_add - add a NAPI to global hashtable
> + * @napi: napi context
> + *
> + * generate a new napi_id and store a @napi under it in napi_hash
> + */
> +extern void napi_hash_add(struct napi_struct *napi);
> +
> +/**
> + * napi_hash_del - remove a NAPI from blobal table

global

> + * @napi: napi context
> + *
> + * Warning: caller must observe rcu grace period
> + * before freeing memory containing @NAPI

@napi

> + */
> +extern void napi_hash_del(struct napi_struct *napi);
> +
> +/**
> + * napi_by_is - lookup a NAPI by napi_id

napi_by_id

> + * @napi_id: hashed napi_id
> + *
> + * lookup napi_id in napi_hash table

@napi_id

> + * must be called under rcu_read_lock()
> + */
> +extern struct napi_struct *napi_by_id(int napi_id);
> +
> +/**
> * napi_disable - prevent NAPI from scheduling
> * @n: napi context
> *
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 50c02de..283ab14 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -129,6 +129,7 @@
> #include <linux/inetdevice.h>
> #include <linux/cpu_rmap.h>
> #include <linux/static_key.h>
> +#include <linux/hashtable.h>
>
> #include "net-sysfs.h"
>
> @@ -166,6 +167,10 @@ static struct list_head offload_base __read_mostly;
> DEFINE_RWLOCK(dev_base_lock);
> EXPORT_SYMBOL(dev_base_lock);
>
> +atomic_t napi_gen_id;

Not sure we need an atomic, we are protected by RTNL anyway.

> +
> +DEFINE_HASHTABLE(napi_hash, 8);
> +
> seqcount_t devnet_rename_seq;
>
> static inline void dev_base_seq_inc(struct net *net)
> @@ -4113,6 +4118,45 @@ void napi_complete(struct napi_struct *n)
> }
> EXPORT_SYMBOL(napi_complete);
>
> +void napi_hash_add(struct napi_struct *napi)
> +{
> + if (!test_and_set_bit(NAPI_STATE_HASHED, &napi->state)) {
> +
> + /* 0 is not a valid id */
> + napi->napi_id = 0;
> + while (!napi->napi_id)
> + napi->napi_id = atomic_inc_return(&napi_gen_id);
> +
> + hlist_add_head_rcu(&napi->napi_hash_node,
> + &napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
> + }
> +}
> +EXPORT_SYMBOL_GPL(napi_hash_add);
> +
> +/* Warning : caller is responsible to make sure rcu grace period
> + * is respected before freeing memory containing @napi
> + */
> +void napi_hash_del(struct napi_struct *napi)
> +{
> + if (test_and_clear_bit(NAPI_STATE_HASHED, &napi->state))
> + hlist_del_rcu(&napi->napi_hash_node);
> +}
> +EXPORT_SYMBOL_GPL(napi_hash_del);
> +
> +/* must be called under rcu_read_lock(), as we dont take a reference */
> +struct napi_struct *napi_by_id(int napi_id)
> +{
> + unsigned int hash = napi_id % HASH_SIZE(napi_hash);
> + struct napi_struct *napi;
> +
> + hlist_for_each_entry_rcu(napi, &napi_hash[hash], napi_hash_node)
> + if (napi->napi_id == napi_id)
> + return napi;
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(napi_by_id);
> +
> void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
> int (*poll)(struct napi_struct *, int), int weight)
> {
>

2013-05-28 00:35:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 0/5] net: low latency Ethernet device polling

On Mon, 2013-05-27 at 10:43 +0300, Eliezer Tamir wrote:
> Hello Dave,
>
> There are many small changes from the last time.
> The two big changes are:
> * Skb and sk now store a napi_id instead of a pointer.
> * Very naive poll/select support. There is a dramatic improvement in both
> latencey and jitter, but clearly more work needs to be done here.

Sorry I couldn't figure out how poll() was supported.

2013-05-28 00:36:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 3/5] tcp: add TCP support for low latency receive poll.

On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:
> adds busy-poll support for TCP.
>

Really, this is a small changelog for such an addition :(

How poll()/epoll() is supported ?

> Signed-off-by: Alexander Duyck <[email protected]>
> Signed-off-by: Jesse Brandeburg <[email protected]>
> Tested-by: Willem de Bruijn <[email protected]>
> Signed-off-by: Eliezer Tamir <[email protected]>
> ---
>
> net/ipv4/tcp.c | 5 +++++
> net/ipv4/tcp_input.c | 1 +
> net/ipv4/tcp_ipv4.c | 2 ++
> 3 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index d87ce72..652c75a 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -279,6 +279,7 @@
>
> #include <asm/uaccess.h>
> #include <asm/ioctls.h>
> +#include <net/ll_poll.h>
>
> int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>
> @@ -1551,6 +1552,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> struct sk_buff *skb;
> u32 urg_hole = 0;
>
> + if (sk_valid_ll(sk) && skb_queue_empty(&sk->sk_receive_queue)
> + && (sk->sk_state == TCP_ESTABLISHED))
> + sk_poll_ll(sk, nonblock);
> +
> lock_sock(sk);
>
> err = -ENOTCONN;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9579e1a..4d82939 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -74,6 +74,7 @@
> #include <linux/ipsec.h>
> #include <asm/unaligned.h>
> #include <net/netdma.h>
> +#include <net/ll_poll.h>
>


Not sure why this include is needed in this file ?

You added nothing else but this line.

> int sysctl_tcp_timestamps __read_mostly = 1;
> int sysctl_tcp_window_scaling __read_mostly = 1;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index d20ede0..35fd8bc 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -75,6 +75,7 @@
> #include <net/netdma.h>
> #include <net/secure_seq.h>
> #include <net/tcp_memcontrol.h>
> +#include <net/ll_poll.h>
>
> #include <linux/inet.h>
> #include <linux/ipv6.h>
> @@ -2011,6 +2012,7 @@ process:
> if (sk_filter(sk, skb))
> goto discard_and_relse;
>
> + sk_mark_ll(sk, skb);
> skb->dev = NULL;
>
> bh_lock_sock_nested(sk);

How IPv6 is handled ?


2013-05-28 08:03:19

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 1/5] net: add napi_id and hash

On 28/05/2013 03:28, Eric Dumazet wrote:
> On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:
>> +extern void napi_hash_add(struct napi_struct *napi);
>> +
>> +/**
>> + * napi_hash_del - remove a NAPI from blobal table
>
> global

Thank you
(my typing is almost as bad as my spelling, please don't tell my mom)

>> @@ -166,6 +167,10 @@ static struct list_head offload_base __read_mostly;
>> DEFINE_RWLOCK(dev_base_lock);
>> EXPORT_SYMBOL(dev_base_lock);
>>
>> +atomic_t napi_gen_id;
>
> Not sure we need an atomic, we are protected by RTNL anyway.

With an atomic we don't need the RTNL in any of the napi_id functions.
One less thing to worry about when we try to remove the RTNL.

-Eliezer

2013-05-28 08:04:38

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 2/5] net: implement support for low latency socket polling

On 28/05/2013 03:26, Eric Dumazet wrote:
> On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:

>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 66772cf..c7c3ea6 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -281,6 +281,7 @@ struct cg_proto;
>> * @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
>> * @sk_backlog_rcv: callback to process the backlog
>> * @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
>> + * @sk_napi_id: id of the last napi context to receive data for sk
>> */
>> struct sock {
>> /*
>> @@ -399,6 +400,9 @@ struct sock {
>> int (*sk_backlog_rcv)(struct sock *sk,
>> struct sk_buff *skb);
>> void (*sk_destruct)(struct sock *sk);
>> +#ifdef CONFIG_NET_LL_RX_POLL
>> + unsigned int sk_napi_id;
>> +#endif
>> };
>
> I believe this is a bad choice for data locality.
>
> I would rather move it in the same cache line than sk_rxhash

I will move it to right after sk_rxhash.

-Eliezer

2013-05-28 08:26:19

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 3/5] tcp: add TCP support for low latency receive poll.

On 28/05/2013 03:36, Eric Dumazet wrote:
> On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:
>> adds busy-poll support for TCP.
>>
>
> Really, this is a small changelog for such an addition :(

OK


> How poll()/epoll() is supported ?

poll()/select() are done by the code added to fs/select.c in 2/5.
epoll() is not yet supported.

>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 9579e1a..4d82939 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -74,6 +74,7 @@
>> #include <linux/ipsec.h>
>> #include <asm/unaligned.h>
>> #include <net/netdma.h>
>> +#include <net/ll_poll.h>
>>
>
> Not sure why this include is needed in this file ?
>
> You added nothing else but this line.

This is a mistake, a remnant from an earlier version when sk_mark_ll()
was where we copy data to the socket.
I will remove it.

>> #include <net/netdma.h>
>> #include <net/secure_seq.h>
>> #include <net/tcp_memcontrol.h>
>> +#include <net/ll_poll.h>
>>
>> #include <linux/inet.h>
>> #include <linux/ipv6.h>
>> @@ -2011,6 +2012,7 @@ process:
>> if (sk_filter(sk, skb))
>> goto discard_and_relse;
>>
>> + sk_mark_ll(sk, skb);
>> skb->dev = NULL;
>>
>> bh_lock_sock_nested(sk);
>
> How IPv6 is handled ?

IPv6 is currently not supported (it was not supported in any version of
this patch set, the POC code in fact was hard-codded for UDPv4/TCPv4).

If there is interest, I will add it, I think it will not be complicated.
However, I would prefer to wait with that for a second stage.

My main concern is that adding IPv6 will significantly increase my
testing effort, which is already 90% of what I'm spending time on.

IMHO epoll() and a more robust support for select()/poll() should have
a higher priority, but I'm open to suggestions.

I would like to get what we have so far applied so more people can try
it, then work on all of the other things that we need.

Dave, I would like to hear your opinion on this, please.

-Eliezer

2013-05-28 08:29:24

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 0/5] net: low latency Ethernet device polling

On 28/05/2013 03:35, Eric Dumazet wrote:
> On Mon, 2013-05-27 at 10:43 +0300, Eliezer Tamir wrote:
>> Hello Dave,
>>
>> There are many small changes from the last time.
>> The two big changes are:
>> * Skb and sk now store a napi_id instead of a pointer.
>> * Very naive poll/select support. There is a dramatic improvement in both
>> latencey and jitter, but clearly more work needs to be done here.
>
> Sorry I couldn't figure out how poll() was supported.

It's the fs/select changes in the second patch.
Should I will split it into a separate one? (it's just 7 lines)

-Eliezer

2013-05-28 12:15:31

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 3/5] tcp: add TCP support for low latency receive poll.

On 28/05/2013 11:26, Eliezer Tamir wrote:
> On 28/05/2013 03:36, Eric Dumazet wrote:
>> On Mon, 2013-05-27 at 10:44 +0300, Eliezer Tamir wrote:
>>> #include <net/netdma.h>
>>> #include <net/secure_seq.h>
>>> #include <net/tcp_memcontrol.h>
>>> +#include <net/ll_poll.h>
>>>
>>> #include <linux/inet.h>
>>> #include <linux/ipv6.h>
>>> @@ -2011,6 +2012,7 @@ process:
>>> if (sk_filter(sk, skb))
>>> goto discard_and_relse;
>>>
>>> + sk_mark_ll(sk, skb);
>>> skb->dev = NULL;
>>>
>>> bh_lock_sock_nested(sk);
>>
>> How IPv6 is handled ?

It turns out that adding TCPv6/UDPv6 is very simple.
I will add them, with a warning that I only did very limited testing.

-Eliezer

2013-05-28 13:38:08

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 1/5] net: add napi_id and hash

On Tue, 2013-05-28 at 11:03 +0300, Eliezer Tamir wrote:

> With an atomic we don't need the RTNL in any of the napi_id functions.
> One less thing to worry about when we try to remove the RTNL.

OK but we'll need something to protect the lists against concurrent
insert/deletes.

A spinlock or a mutex.


2013-05-28 13:41:18

by Eliezer Tamir

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 1/5] net: add napi_id and hash

On 28/05/2013 16:38, Eric Dumazet wrote:
> On Tue, 2013-05-28 at 11:03 +0300, Eliezer Tamir wrote:
>
>> With an atomic we don't need the RTNL in any of the napi_id functions.
>> One less thing to worry about when we try to remove the RTNL.
>
> OK but we'll need something to protect the lists against concurrent
> insert/deletes.
>
> A spinlock or a mutex.

OK

2013-05-28 13:44:14

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v5 net-next 3/5] tcp: add TCP support for low latency receive poll.

On Tue, 2013-05-28 at 15:15 +0300, Eliezer Tamir wrote:

> >> How IPv6 is handled ?
>
> It turns out that adding TCPv6/UDPv6 is very simple.

Yep, I was about to send you the needed lines after my breakfast ;)