2012-10-29 05:43:52

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 0/7] Multiqueue support in tuntap

Hello All:

This is an update of multiqueue support in tuntap from V3. Please consider to
merge.

The main idea for this series is to let tun/tap device to be benefited from
multiqueue network cards and multi-core host. We used to have a single queue for
tuntap which could be a bottleneck in a multiqueue/core environment. So this
series let the device could be attched with multiple sockets and expose them
through fd to the userspace as multiqueues. The sereis were orignally designed
to serve as backend for multiqueue virtio-net in KVM, but the design is generic
for other application to be used.

Some quick overview of the design:

- Moving socket from tun_device to tun_file.
- Allowing multiple sockets to be attached to a tun/tap devices.
- Using RCU to synchronize the data path and system call.
- Two new ioctls were added for the usespace to attach and detach socket to the device.
- API compatibility were maintained without userspace notable changes, so legacy
userspace that only use one queue won't need any changes.
- A flow(rxhash) to queue table were maintained by tuntap which choose the txq
based on the last rxq where it comes.

Performance test:

Pktgen is used to generate the traffic and a simple program that only does the
receiving in userspace.

#q #thread kpps aggregate kpps
1q 1thread 818kpps +0%
2q 2threads 1926kpps +135%
3q 3threads 2642kpps +223%
4q 4threads 3536kpps +332%

Changes from V3:
- Rebase to net-next
- A separate RCUiying patch to simply the reviewing
- Add a simple "tx follows rx" policy when choosing txq
- Various bug fixes

Changes from V2:
- Rebase to the latest net-next
- Fix netdev leak when tun_attach fails
- Fix return value of TUNSETOWNER
- Purge the receive queue in socket destructor
- Enable multiqueue tun (V1 and V2 only allows mq to be eanbled for tap
- Add per-queue u64 statistics
- Fix wrong BUG_ON() check in tun_detach()
- Check numqueues instead of tfile[0] in tun_set_iff() to let tunctl -d works
correctly
- Set numqueues to MAX_TAP_QUEUES during tun_detach_all() to prevent the
attaching.

Changes from V1:
- Simplify the sockets array management by not leaving NULL in the slot.
- Optimization on the tx queue selecting.
- Fix the bug in tun_deatch_all()

Reference:
- V3 https://lkml.org/lkml/2012/6/25/191
- V2 http://lwn.net/Articles/459270/
- V1 http://www.mail-archive.com/[email protected]/msg59479.html

Jason Wang (7):
tuntap: log the unsigned informaiton with %u
tuntap: move socket to tun_file
tuntap: RCUify dereferencing between tun_struct and tun_file
tuntap: introduce multiqueue flags
tuntap: multiqueue support
tuntap: add ioctl to attach or detach a file form tuntap device
tuntap: choose the txq based on rxq

drivers/net/tun.c | 845 ++++++++++++++++++++++++++++++++-----------
include/uapi/linux/if_tun.h | 5 +
2 files changed, 640 insertions(+), 210 deletions(-)


2012-10-29 05:44:07

by Jason Wang

[permalink] [raw]
Subject: [PATCH] tuntap: choose the txq based on rxq

This patch implements a simple multiqueue flow steering policy - tx follows rx
for tun/tap. The idea is simple, it just choose the txq based on which rxq it
comes. The flow were identified through the rxhash of a skb, and the hash to
queue mapping were recorded in a hlist with an ageing timer to retire the
mapping. The mapping were created when tun receives packet from userspace, and
was quired in .ndo_select_queue().

I run co-current TCP_CRR test and didn't see any mapping manipulation helpers in
perf top, so the overhead could be negelected.

Signed-off-by: Jason Wang <[email protected]>
---

Changes from last version:

- Add the missing barrior to fix the race of module unloading.
- Add TODO in tun_flow_update() and comments of rcu read lock holding in
tun_select_queue()
- Use a macro to define the ageing time.
- Don't trigger the timer when there's no flow in use.

drivers/net/tun.c | 239 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 236 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 78dcda8..158ef1d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -115,6 +115,8 @@ struct tap_filter {
* to match a queue per guest CPU. */
#define MAX_TAP_QUEUES 1024

+#define TUN_FLOW_EXPIRE 3 * HZ
+
/* A tun_file connects an open character device to a tuntap netdevice. It
* also contains all socket related strctures (except sock_fprog and tap_filter)
* to serve as one transmit queue for tuntap device. The sock_fprog and
@@ -138,6 +140,18 @@ struct tun_file {
u16 queue_index;
};

+struct tun_flow_entry {
+ struct hlist_node hash_link;
+ struct rcu_head rcu;
+ struct tun_struct *tun;
+
+ u32 rxhash;
+ int queue_index;
+ unsigned long updated;
+};
+
+#define TUN_NUM_FLOW_ENTRIES 1024
+
/* Since the socket were moved to tun_file, to preserve the behavior of persist
* device, socket fileter, sndbuf and vnet header size were restore when the
* file were attached to a persist device.
@@ -163,8 +177,176 @@ struct tun_struct {
#ifdef TUN_DEBUG
int debug;
#endif
+ spinlock_t lock;
+ struct kmem_cache *flow_cache;
+ struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
+ struct timer_list flow_gc_timer;
+ unsigned long ageing_time;
};

+static inline u32 tun_hashfn(u32 rxhash)
+{
+ return rxhash & 0x3ff;
+}
+
+static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 rxhash)
+{
+ struct tun_flow_entry *e;
+ struct hlist_node *n;
+
+ hlist_for_each_entry_rcu(e, n, head, hash_link) {
+ if (e->rxhash == rxhash)
+ return e;
+ }
+ return NULL;
+}
+
+static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
+ struct hlist_head *head,
+ u32 rxhash, u16 queue_index)
+{
+ struct tun_flow_entry *e = kmem_cache_alloc(tun->flow_cache,
+ GFP_ATOMIC);
+ if (e) {
+ tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
+ rxhash, queue_index);
+ e->updated = jiffies;
+ e->rxhash = rxhash;
+ e->queue_index = queue_index;
+ e->tun = tun;
+ hlist_add_head_rcu(&e->hash_link, head);
+ }
+ return e;
+}
+
+static void tun_flow_free(struct rcu_head *head)
+{
+ struct tun_flow_entry *e
+ = container_of(head, struct tun_flow_entry, rcu);
+ kmem_cache_free(e->tun->flow_cache, e);
+}
+
+static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
+{
+ tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
+ e->rxhash, e->queue_index);
+ hlist_del_rcu(&e->hash_link);
+ call_rcu(&e->rcu, tun_flow_free);
+}
+
+/* caller hold tun->lock */
+static int tun_delete_by_rxhash(struct tun_struct *tun, u32 rxhash)
+{
+ struct hlist_head *head = &tun->flows[tun_hashfn(rxhash)];
+ struct tun_flow_entry *e = tun_flow_find(head, rxhash);
+
+ if (!e)
+ return -ENOENT;
+
+ tun_flow_delete(tun, e);
+}
+
+static void tun_flow_flush(struct tun_struct *tun)
+{
+ int i;
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link)
+ tun_flow_delete(tun, e);
+ }
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_delete_by_queue(struct tun_struct *tun, u16 queue_index)
+{
+ int i;
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link) {
+ if (e->queue_index == queue_index)
+ tun_flow_delete(tun, e);
+ }
+ }
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_cleanup(unsigned long data)
+{
+ struct tun_struct *tun = (struct tun_struct *)data;
+ unsigned long delay = tun->ageing_time;
+ unsigned long next_timer = jiffies + delay;
+ unsigned long count = 0;
+ int i;
+
+ tun_debug(KERN_INFO, tun, "tun_flow_cleanup\n");
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link) {
+ unsigned long this_timer;
+ count ++;
+ this_timer = e->updated + delay;
+ if (time_before_eq(this_timer, jiffies))
+ tun_flow_delete(tun, e);
+ else if (time_before(this_timer, next_timer))
+ next_timer = this_timer;
+ }
+ }
+
+ if (count)
+ mod_timer(&tun->flow_gc_timer, round_jiffies_up(next_timer));
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_update(struct tun_struct *tun, struct sk_buff *skb,
+ u16 queue_index)
+{
+ struct hlist_head *head;
+ struct tun_flow_entry *e;
+ unsigned long delay = tun->ageing_time;
+ u32 rxhash = skb_get_rxhash(skb);
+
+ if (!rxhash)
+ return;
+ else
+ head = &tun->flows[tun_hashfn(rxhash)];
+
+ rcu_read_lock();
+
+ if (tun->numqueues == 1)
+ goto unlock;
+
+ e = tun_flow_find(head, rxhash);
+ if (likely(e)) {
+ /* TODO: keep queueing to old queue until it's empty? */
+ e->queue_index = queue_index;
+ e->updated = jiffies;
+ } else {
+ spin_lock_bh(&tun->lock);
+ if (!tun_flow_find(head, rxhash))
+ tun_flow_create(tun, head, rxhash, queue_index);
+
+ if (!timer_pending(&tun->flow_gc_timer))
+ mod_timer(&tun->flow_gc_timer,
+ round_jiffies_up(jiffies + delay));
+ spin_unlock_bh(&tun->lock);
+ }
+
+unlock:
+ rcu_read_unlock();
+}
+
/* We try to identify a flow through its rxhash first. The reason that
* we do not check rxq no. is becuase some cards(e.g 82599), chooses
* the rxq based on the txq where the last packet of the flow comes. As
@@ -175,6 +357,7 @@ struct tun_struct {
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)
{
struct tun_struct *tun = netdev_priv(dev);
+ struct tun_flow_entry *e;
u32 txq = 0;
u32 numqueues = 0;

@@ -183,8 +366,12 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)

txq = skb_get_rxhash(skb);
if (txq) {
- /* use multiply and shift instead of expensive divide */
- txq = ((u64)txq * numqueues) >> 32;
+ e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
+ if (e)
+ txq = e->queue_index;
+ else
+ /* use multiply and shift instead of expensive divide */
+ txq = ((u64)txq * numqueues) >> 32;
} else if (likely(skb_rx_queue_recorded(skb))) {
txq = skb_get_rx_queue(skb);
while (unlikely(txq >= numqueues))
@@ -234,6 +421,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
sock_put(&tfile->sk);

synchronize_net();
+ tun_flow_delete_by_queue(tun, tun->numqueues + 1);
/* Drop read queue */
skb_queue_purge(&tfile->sk.sk_receive_queue);
tun_set_real_num_queues(tun);
@@ -629,6 +817,37 @@ static const struct net_device_ops tap_netdev_ops = {
#endif
};

+static int tun_flow_init(struct tun_struct *tun)
+{
+ int i;
+
+ tun->flow_cache = kmem_cache_create("tun_flow_cache",
+ sizeof(struct tun_flow_entry), 0, 0,
+ NULL);
+ if (!tun->flow_cache)
+ return -ENOMEM;
+
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++)
+ INIT_HLIST_HEAD(&tun->flows[i]);
+
+ tun->ageing_time = TUN_FLOW_EXPIRE;
+ setup_timer(&tun->flow_gc_timer, tun_flow_cleanup, (unsigned long)tun);
+ mod_timer(&tun->flow_gc_timer,
+ round_jiffies_up(jiffies + tun->ageing_time));
+
+ return 0;
+}
+
+static void tun_flow_uninit(struct tun_struct *tun)
+{
+ del_timer_sync(&tun->flow_gc_timer);
+ tun_flow_flush(tun);
+
+ /* Wait for completion of call_rcu()'s */
+ rcu_barrier();
+ kmem_cache_destroy(tun->flow_cache);
+}
+
/* Initialize net device. */
static void tun_net_init(struct net_device *dev)
{
@@ -973,6 +1192,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
tun->dev->stats.rx_packets++;
tun->dev->stats.rx_bytes += len;

+ tun_flow_update(tun, skb, tfile->queue_index);
return total_len;
}

@@ -1150,6 +1370,14 @@ out:
return ret;
}

+static void tun_free_netdev(struct net_device *dev)
+{
+ struct tun_struct *tun = netdev_priv(dev);
+
+ tun_flow_uninit(tun);
+ free_netdev(dev);
+}
+
static void tun_setup(struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
@@ -1158,7 +1386,7 @@ static void tun_setup(struct net_device *dev)
tun->group = INVALID_GID;

dev->ethtool_ops = &tun_ethtool_ops;
- dev->destructor = free_netdev;
+ dev->destructor = tun_free_netdev;
}

/* Trivial set of netlink ops to allow deleting tun or tap
@@ -1381,10 +1609,15 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
tun->filter_attached = false;
tun->sndbuf = tfile->socket.sk->sk_sndbuf;

+ spin_lock_init(&tun->lock);
+
security_tun_dev_post_create(&tfile->sk);

tun_net_init(dev);

+ if (tun_flow_init(tun))
+ goto err_free_dev;
+
dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
TUN_USER_FEATURES;
dev->features = dev->hw_features;
--
1.7.1

2012-10-29 05:44:21

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 2/7] tuntap: move socket to tun_file

Current tuntap makes use of the socket receive queue as its tx queue. To
implement multiple tx queues for tuntap and enable the ability of adding and
removing queues during workload, the first step is to move the socket related
structures to tun_file. Then we could let multiple fds/sockets to be attached to
the tuntap.

This patch removes tun_sock and moves socket related structures from tun_sock or
tun_struct to tun_file. Two exceptions are tap_filter and sock_fprog, they are
still kept in tun_structure since they are used to filter packets for the net
device instead of per transmit queue (at least I see no requirements for
them). After those changes, socket were created and destroyed during file open
and close (instead of device creation and destroy), the socket structures could
be dereferenced from tun_file instead of the file of tun_struct structure
itself.

For persisent device, since we purge during datching and wouldn't queue any
packets when no interface were attached, there's no behaviod changes before and
after this patch, so the changes were transparent to the userspace. To keep the
attributes such as sndbuf, socket filter and vnet header, those would be
re-initialize after a new interface were attached to an persist device.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 266 +++++++++++++++++++++++++++++------------------------
1 files changed, 145 insertions(+), 121 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ef13cf0..e8cedb0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -110,14 +110,29 @@ struct tap_filter {
unsigned char addr[FLT_EXACT_COUNT][ETH_ALEN];
};

+/* A tun_file connects an open character device to a tuntap netdevice. It
+ * also contains all socket related strctures (except sock_fprog and tap_filter)
+ * to serve as one transmit queue for tuntap device. The sock_fprog and
+ * tap_filter were kept in tun_struct since they were used for filtering for the
+ * netdevice not for a specific queue (at least I didn't see the reqirement for
+ * this).
+ */
struct tun_file {
+ struct sock sk;
+ struct socket socket;
+ struct socket_wq wq;
atomic_t count;
struct tun_struct *tun;
struct net *net;
+ struct fasync_struct *fasync;
+ /* only used for fasnyc */
+ unsigned int flags;
};

-struct tun_sock;
-
+/* Since the socket were moved to tun_file, to preserve the behavior of persist
+ * device, socket fileter, sndbuf and vnet header size were restore when the
+ * file were attached to a persist device.
+ */
struct tun_struct {
struct tun_file *tfile;
unsigned int flags;
@@ -128,29 +143,18 @@ struct tun_struct {
netdev_features_t set_features;
#define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
NETIF_F_TSO6|NETIF_F_UFO)
- struct fasync_struct *fasync;
-
- struct tap_filter txflt;
- struct socket socket;
- struct socket_wq wq;

int vnet_hdr_sz;
-
+ int sndbuf;
+ struct tap_filter txflt;
+ struct sock_fprog fprog;
+ /* protected by rtnl lock */
+ bool filter_attached;
#ifdef TUN_DEBUG
int debug;
#endif
};

-struct tun_sock {
- struct sock sk;
- struct tun_struct *tun;
-};
-
-static inline struct tun_sock *tun_sk(struct sock *sk)
-{
- return container_of(sk, struct tun_sock, sk);
-}
-
static int tun_attach(struct tun_struct *tun, struct file *file)
{
struct tun_file *tfile = file->private_data;
@@ -169,12 +173,19 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
goto out;

err = 0;
+
+ /* Re-attach filter when attaching to a persist device */
+ if (tun->filter_attached == true) {
+ err = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+ if (!err)
+ goto out;
+ }
tfile->tun = tun;
+ tfile->socket.sk->sk_sndbuf = tun->sndbuf;
tun->tfile = tfile;
- tun->socket.file = file;
netif_carrier_on(tun->dev);
dev_hold(tun->dev);
- sock_hold(tun->socket.sk);
+ sock_hold(&tfile->sk);
atomic_inc(&tfile->count);

out:
@@ -184,14 +195,16 @@ out:

static void __tun_detach(struct tun_struct *tun)
{
+ struct tun_file *tfile = tun->tfile;
/* Detach from net device */
netif_tx_lock_bh(tun->dev);
netif_carrier_off(tun->dev);
tun->tfile = NULL;
+ tfile->tun = NULL;
netif_tx_unlock_bh(tun->dev);

/* Drop read queue */
- skb_queue_purge(&tun->socket.sk->sk_receive_queue);
+ skb_queue_purge(&tfile->socket.sk->sk_receive_queue);

/* Drop the extra count on the net device */
dev_put(tun->dev);
@@ -350,21 +363,12 @@ static void tun_net_uninit(struct net_device *dev)
/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
- wake_up_all(&tun->wq.wait);
+ wake_up_all(&tfile->wq.wait);
if (atomic_dec_and_test(&tfile->count))
__tun_detach(tun);
}
}

-static void tun_free_netdev(struct net_device *dev)
-{
- struct tun_struct *tun = netdev_priv(dev);
-
- BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED, &tun->socket.flags));
-
- sk_release_kernel(tun->socket.sk);
-}
-
/* Net device open. */
static int tun_net_open(struct net_device *dev)
{
@@ -383,11 +387,12 @@ static int tun_net_close(struct net_device *dev)
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
+ struct tun_file *tfile = tun->tfile;

tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);

/* Drop packet if interface is not attached */
- if (!tun->tfile)
+ if (!tfile)
goto drop;

/* Drop if the filter does not like it.
@@ -396,11 +401,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
if (!check_filter(&tun->txflt, skb))
goto drop;

- if (tun->socket.sk->sk_filter &&
- sk_filter(tun->socket.sk, skb))
+ if (tfile->socket.sk->sk_filter &&
+ sk_filter(tfile->socket.sk, skb))
goto drop;

- if (skb_queue_len(&tun->socket.sk->sk_receive_queue) >= dev->tx_queue_len) {
+ if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
+ >= dev->tx_queue_len) {
if (!(tun->flags & TUN_ONE_QUEUE)) {
/* Normal queueing mode. */
/* Packet scheduler handles dropping of further packets. */
@@ -423,12 +429,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
skb_orphan(skb);

/* Enqueue packet */
- skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
+ skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);

/* Notify and wake up reader process */
- if (tun->flags & TUN_FASYNC)
- kill_fasync(&tun->fasync, SIGIO, POLL_IN);
- wake_up_interruptible_poll(&tun->wq.wait, POLLIN |
+ if (tfile->flags & TUN_FASYNC)
+ kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
+ wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
POLLRDNORM | POLLRDBAND);
return NETDEV_TX_OK;

@@ -556,11 +562,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
if (!tun)
return POLLERR;

- sk = tun->socket.sk;
+ sk = tfile->socket.sk;

tun_debug(KERN_INFO, tun, "tun_chr_poll\n");

- poll_wait(file, &tun->wq.wait, wait);
+ poll_wait(file, &tfile->wq.wait, wait);

if (!skb_queue_empty(&sk->sk_receive_queue))
mask |= POLLIN | POLLRDNORM;
@@ -579,11 +585,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)

/* prepad is the amount to reserve at front. len is length after that.
* linear is a hint as to how much to copy (usually headers). */
-static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
+static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
size_t prepad, size_t len,
size_t linear, int noblock)
{
- struct sock *sk = tun->socket.sk;
+ struct sock *sk = tfile->socket.sk;
struct sk_buff *skb;
int err;

@@ -685,9 +691,9 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
}

/* Get packet from user space buffer */
-static ssize_t tun_get_user(struct tun_struct *tun, void *msg_control,
- const struct iovec *iv, size_t total_len,
- size_t count, int noblock)
+static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
+ void *msg_control, const struct iovec *iv,
+ size_t total_len, size_t count, int noblock)
{
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
@@ -757,7 +763,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, void *msg_control,
} else
copylen = len;

- skb = tun_alloc_skb(tun, align, copylen, gso.hdr_len, noblock);
+ skb = tun_alloc_skb(tfile, align, copylen, gso.hdr_len, noblock);
if (IS_ERR(skb)) {
if (PTR_ERR(skb) != -EAGAIN)
tun->dev->stats.rx_dropped++;
@@ -862,6 +868,7 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
{
struct file *file = iocb->ki_filp;
struct tun_struct *tun = tun_get(file);
+ struct tun_file *tfile = file->private_data;
ssize_t result;

if (!tun)
@@ -869,8 +876,8 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,

tun_debug(KERN_INFO, tun, "tun_chr_write %ld\n", count);

- result = tun_get_user(tun, NULL, iv, iov_length(iv, count), count,
- file->f_flags & O_NONBLOCK);
+ result = tun_get_user(tun, tfile, NULL, iv, iov_length(iv, count),
+ count, file->f_flags & O_NONBLOCK);

tun_put(tun);
return result;
@@ -878,6 +885,7 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,

/* Put packet to the user space buffer */
static ssize_t tun_put_user(struct tun_struct *tun,
+ struct tun_file *tfile,
struct sk_buff *skb,
const struct iovec *iv, int len)
{
@@ -957,7 +965,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
return total;
}

-static ssize_t tun_do_read(struct tun_struct *tun,
+static ssize_t tun_do_read(struct tun_struct *tun,struct tun_file *tfile,
struct kiocb *iocb, const struct iovec *iv,
ssize_t len, int noblock)
{
@@ -968,12 +976,12 @@ static ssize_t tun_do_read(struct tun_struct *tun,
tun_debug(KERN_INFO, tun, "tun_chr_read\n");

if (unlikely(!noblock))
- add_wait_queue(&tun->wq.wait, &wait);
+ add_wait_queue(&tfile->wq.wait, &wait);
while (len) {
current->state = TASK_INTERRUPTIBLE;

/* Read frames from the queue */
- if (!(skb=skb_dequeue(&tun->socket.sk->sk_receive_queue))) {
+ if (!(skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue))) {
if (noblock) {
ret = -EAGAIN;
break;
@@ -993,14 +1001,14 @@ static ssize_t tun_do_read(struct tun_struct *tun,
}
netif_wake_queue(tun->dev);

- ret = tun_put_user(tun, skb, iv, len);
+ ret = tun_put_user(tun, tfile, skb, iv, len);
kfree_skb(skb);
break;
}

current->state = TASK_RUNNING;
if (unlikely(!noblock))
- remove_wait_queue(&tun->wq.wait, &wait);
+ remove_wait_queue(&tfile->wq.wait, &wait);

return ret;
}
@@ -1021,7 +1029,8 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
goto out;
}

- ret = tun_do_read(tun, iocb, iv, len, file->f_flags & O_NONBLOCK);
+ ret = tun_do_read(tun, tfile, iocb, iv, len,
+ file->f_flags & O_NONBLOCK);
ret = min_t(ssize_t, ret, len);
out:
tun_put(tun);
@@ -1036,7 +1045,7 @@ static void tun_setup(struct net_device *dev)
tun->group = INVALID_GID;

dev->ethtool_ops = &tun_ethtool_ops;
- dev->destructor = tun_free_netdev;
+ dev->destructor = free_netdev;
}

/* Trivial set of netlink ops to allow deleting tun or tap
@@ -1056,7 +1065,7 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {

static void tun_sock_write_space(struct sock *sk)
{
- struct tun_struct *tun;
+ struct tun_file *tfile;
wait_queue_head_t *wqueue;

if (!sock_writeable(sk))
@@ -1070,37 +1079,47 @@ static void tun_sock_write_space(struct sock *sk)
wake_up_interruptible_sync_poll(wqueue, POLLOUT |
POLLWRNORM | POLLWRBAND);

- tun = tun_sk(sk)->tun;
- kill_fasync(&tun->fasync, SIGIO, POLL_OUT);
-}
-
-static void tun_sock_destruct(struct sock *sk)
-{
- free_netdev(tun_sk(sk)->tun->dev);
+ tfile = container_of(sk, struct tun_file, sk);
+ kill_fasync(&tfile->fasync, SIGIO, POLL_OUT);
}

static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
- struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
- return tun_get_user(tun, m->msg_control, m->msg_iov, total_len,
- m->msg_iovlen, m->msg_flags & MSG_DONTWAIT);
+ int ret;
+ struct tun_file *tfile = container_of(sock, struct tun_file, socket);
+ struct tun_struct *tun = __tun_get(tfile);
+
+ if (!tun)
+ return -EBADFD;
+
+ ret = tun_get_user(tun, tfile, m->msg_control, m->msg_iov, total_len,
+ m->msg_iovlen, m->msg_flags & MSG_DONTWAIT);
+ tun_put(tun);
+ return ret;
}

+
static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len,
int flags)
{
- struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
+ struct tun_file *tfile = container_of(sock, struct tun_file, socket);
+ struct tun_struct *tun = __tun_get(tfile);
int ret;
+
+ if (!tun)
+ return -EBADFD;
+
if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
return -EINVAL;
- ret = tun_do_read(tun, iocb, m->msg_iov, total_len,
+ ret = tun_do_read(tun, tfile, iocb, m->msg_iov, total_len,
flags & MSG_DONTWAIT);
if (ret > total_len) {
m->msg_flags |= MSG_TRUNC;
ret = flags & MSG_TRUNC ? ret : total_len;
}
+ tun_put(tun);
return ret;
}

@@ -1121,7 +1140,7 @@ static const struct proto_ops tun_socket_ops = {
static struct proto tun_proto = {
.name = "tun",
.owner = THIS_MODULE,
- .obj_size = sizeof(struct tun_sock),
+ .obj_size = sizeof(struct tun_file),
};

static int tun_flags(struct tun_struct *tun)
@@ -1178,8 +1197,8 @@ static DEVICE_ATTR(group, 0444, tun_show_group, NULL);

static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
{
- struct sock *sk;
struct tun_struct *tun;
+ struct tun_file *tfile = file->private_data;
struct net_device *dev;
int err;

@@ -1200,7 +1219,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
(gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
!capable(CAP_NET_ADMIN))
return -EPERM;
- err = security_tun_dev_attach(tun->socket.sk);
+ err = security_tun_dev_attach(tfile->socket.sk);
if (err < 0)
return err;

@@ -1246,25 +1265,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
tun->flags = flags;
tun->txflt.count = 0;
tun->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
- set_bit(SOCK_EXTERNALLY_ALLOCATED, &tun->socket.flags);
-
- err = -ENOMEM;
- sk = sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL, &tun_proto);
- if (!sk)
- goto err_free_dev;

- sk_change_net(sk, net);
- tun->socket.wq = &tun->wq;
- init_waitqueue_head(&tun->wq.wait);
- tun->socket.ops = &tun_socket_ops;
- sock_init_data(&tun->socket, sk);
- sk->sk_write_space = tun_sock_write_space;
- sk->sk_sndbuf = INT_MAX;
- sock_set_flag(sk, SOCK_ZEROCOPY);
+ tun->filter_attached = false;
+ tun->sndbuf = tfile->socket.sk->sk_sndbuf;

- tun_sk(sk)->tun = tun;
-
- security_tun_dev_post_create(sk);
+ security_tun_dev_post_create(&tfile->sk);

tun_net_init(dev);

@@ -1274,15 +1279,13 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)

err = register_netdevice(tun->dev);
if (err < 0)
- goto err_free_sk;
+ goto err_free_dev;

if (device_create_file(&tun->dev->dev, &dev_attr_tun_flags) ||
device_create_file(&tun->dev->dev, &dev_attr_owner) ||
device_create_file(&tun->dev->dev, &dev_attr_group))
pr_err("Failed to create tun sysfs files\n");

- sk->sk_destruct = tun_sock_destruct;
-
err = tun_attach(tun, file);
if (err < 0)
goto failed;
@@ -1314,8 +1317,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
strcpy(ifr->ifr_name, tun->dev->name);
return 0;

- err_free_sk:
- tun_free_netdev(dev);
err_free_dev:
free_netdev(dev);
failed:
@@ -1379,7 +1380,6 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
struct tun_file *tfile = file->private_data;
struct tun_struct *tun;
void __user* argp = (void __user*)arg;
- struct sock_fprog fprog;
struct ifreq ifr;
kuid_t owner;
kgid_t group;
@@ -1444,11 +1444,16 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;

case TUNSETPERSIST:
- /* Disable/Enable persist mode */
- if (arg)
+ /* Disable/Enable persist mode. Keep an extra reference to the
+ * module to prevent the module being unprobed.
+ */
+ if (arg) {
tun->flags |= TUN_PERSIST;
- else
+ __module_get(THIS_MODULE);
+ } else {
tun->flags &= ~TUN_PERSIST;
+ module_put(THIS_MODULE);
+ }

tun_debug(KERN_INFO, tun, "persist %s\n",
arg ? "enabled" : "disabled");
@@ -1526,7 +1531,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;

case TUNGETSNDBUF:
- sndbuf = tun->socket.sk->sk_sndbuf;
+ sndbuf = tfile->socket.sk->sk_sndbuf;
if (copy_to_user(argp, &sndbuf, sizeof(sndbuf)))
ret = -EFAULT;
break;
@@ -1537,7 +1542,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;
}

- tun->socket.sk->sk_sndbuf = sndbuf;
+ tun->sndbuf = tfile->socket.sk->sk_sndbuf = sndbuf;
break;

case TUNGETVNETHDRSZ:
@@ -1565,10 +1570,12 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
break;
ret = -EFAULT;
- if (copy_from_user(&fprog, argp, sizeof(fprog)))
+ if (copy_from_user(&tun->fprog, argp, sizeof(tun->fprog)))
break;

- ret = sk_attach_filter(&fprog, tun->socket.sk);
+ ret = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+ if (!ret)
+ tun->filter_attached = true;
break;

case TUNDETACHFILTER:
@@ -1576,7 +1583,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
ret = -EINVAL;
if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
break;
- ret = sk_detach_filter(tun->socket.sk);
+ ret = sk_detach_filter(tfile->socket.sk);
+ if (!ret)
+ tun->filter_attached = false;
break;

default:
@@ -1628,27 +1637,21 @@ static long tun_chr_compat_ioctl(struct file *file,

static int tun_chr_fasync(int fd, struct file *file, int on)
{
- struct tun_struct *tun = tun_get(file);
+ struct tun_file *tfile = file->private_data;
int ret;

- if (!tun)
- return -EBADFD;
-
- tun_debug(KERN_INFO, tun, "tun_chr_fasync %d\n", on);
-
- if ((ret = fasync_helper(fd, file, on, &tun->fasync)) < 0)
+ if ((ret = fasync_helper(fd, file, on, &tfile->fasync)) < 0)
goto out;

if (on) {
ret = __f_setown(file, task_pid(current), PIDTYPE_PID, 0);
if (ret)
goto out;
- tun->flags |= TUN_FASYNC;
+ tfile->flags |= TUN_FASYNC;
} else
- tun->flags &= ~TUN_FASYNC;
+ tfile->flags &= ~TUN_FASYNC;
ret = 0;
out:
- tun_put(tun);
return ret;
}

@@ -1658,13 +1661,30 @@ static int tun_chr_open(struct inode *inode, struct file * file)

DBG1(KERN_INFO, "tunX: tun_chr_open\n");

- tfile = kmalloc(sizeof(*tfile), GFP_KERNEL);
+ tfile = (struct tun_file *)sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL,
+ &tun_proto);
if (!tfile)
return -ENOMEM;
atomic_set(&tfile->count, 0);
tfile->tun = NULL;
tfile->net = get_net(current->nsproxy->net_ns);
+ tfile->flags = 0;
+
+ rcu_assign_pointer(tfile->socket.wq, &tfile->wq);
+ init_waitqueue_head(&tfile->wq.wait);
+
+ tfile->socket.file = file;
+ tfile->socket.ops = &tun_socket_ops;
+
+ sock_init_data(&tfile->socket, &tfile->sk);
+ sk_change_net(&tfile->sk, tfile->net);
+
+ tfile->sk.sk_write_space = tun_sock_write_space;
+ tfile->sk.sk_sndbuf = INT_MAX;
+
file->private_data = tfile;
+ set_bit(SOCK_EXTERNALLY_ALLOCATED, &tfile->socket.flags);
+
return 0;
}

@@ -1672,6 +1692,7 @@ static int tun_chr_close(struct inode *inode, struct file *file)
{
struct tun_file *tfile = file->private_data;
struct tun_struct *tun;
+ struct net *net = tfile->net;

tun = __tun_get(tfile);
if (tun) {
@@ -1688,14 +1709,16 @@ static int tun_chr_close(struct inode *inode, struct file *file)
unregister_netdevice(dev);
rtnl_unlock();
}
- }

- tun = tfile->tun;
- if (tun)
- sock_put(tun->socket.sk);
+ /* drop the reference that netdevice holds */
+ sock_put(&tfile->sk);
+ }

- put_net(tfile->net);
- kfree(tfile);
+ /* drop the reference that file holds */
+ BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
+ &tfile->socket.flags));
+ sk_release_kernel(&tfile->sk);
+ put_net(net);

return 0;
}
@@ -1823,13 +1846,14 @@ static void tun_cleanup(void)
struct socket *tun_get_socket(struct file *file)
{
struct tun_struct *tun;
+ struct tun_file *tfile = file->private_data;
if (file->f_op != &tun_fops)
return ERR_PTR(-EINVAL);
tun = tun_get(file);
if (!tun)
return ERR_PTR(-EBADFD);
tun_put(tun);
- return &tun->socket;
+ return &tfile->socket;
}
EXPORT_SYMBOL_GPL(tun_get_socket);

--
1.7.1

2012-10-29 05:44:27

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 3/7] tuntap: RCUify dereferencing between tun_struct and tun_file

RCU were introduced in this patch to synchronize the dereferences between
tun_struct and tun_file. All tun_{get|put} were replaced with RCU, the
dereference from one to other must be done under rtnl lock or rcu read critical
region.

This is needed for the following patches since the one of the goal of multiqueue
tuntap is to allow adding or removing queues during workload. Without RCU,
control path would hold tx locks when adding or removing queues (which may cause
sme delay) and it's hard to change the number of queues without stopping the net
device. With the help of rcu, there's also no need for tun_file hold an refcnt
to tun_struct.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 95 ++++++++++++++++++++++++++---------------------------
1 files changed, 47 insertions(+), 48 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e8cedb0..d332cb8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -116,13 +116,16 @@ struct tap_filter {
* tap_filter were kept in tun_struct since they were used for filtering for the
* netdevice not for a specific queue (at least I didn't see the reqirement for
* this).
+ *
+ * RCU usage:
+ * The tun_file and tun_struct are loosely coupled, the pointer from on to the
+ * other can only be read while rcu_read_lock or rtnl_lock is held.
*/
struct tun_file {
struct sock sk;
struct socket socket;
struct socket_wq wq;
- atomic_t count;
- struct tun_struct *tun;
+ struct tun_struct __rcu *tun;
struct net *net;
struct fasync_struct *fasync;
/* only used for fasnyc */
@@ -134,7 +137,7 @@ struct tun_file {
* file were attached to a persist device.
*/
struct tun_struct {
- struct tun_file *tfile;
+ struct tun_file __rcu *tfile;
unsigned int flags;
kuid_t owner;
kgid_t group;
@@ -180,13 +183,11 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
if (!err)
goto out;
}
- tfile->tun = tun;
+ rcu_assign_pointer(tfile->tun, tun);
tfile->socket.sk->sk_sndbuf = tun->sndbuf;
- tun->tfile = tfile;
+ rcu_assign_pointer(tun->tfile, tfile);
netif_carrier_on(tun->dev);
- dev_hold(tun->dev);
sock_hold(&tfile->sk);
- atomic_inc(&tfile->count);

out:
netif_tx_unlock_bh(tun->dev);
@@ -195,34 +196,29 @@ out:

static void __tun_detach(struct tun_struct *tun)
{
- struct tun_file *tfile = tun->tfile;
+ struct tun_file *tfile = rcu_dereference_protected(tun->tfile,
+ lockdep_rtnl_is_held());
/* Detach from net device */
- netif_tx_lock_bh(tun->dev);
netif_carrier_off(tun->dev);
- tun->tfile = NULL;
- tfile->tun = NULL;
- netif_tx_unlock_bh(tun->dev);
-
- /* Drop read queue */
- skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
-
- /* Drop the extra count on the net device */
- dev_put(tun->dev);
-}
+ rcu_assign_pointer(tun->tfile, NULL);
+ if (tfile) {
+ rcu_assign_pointer(tfile->tun, NULL);

-static void tun_detach(struct tun_struct *tun)
-{
- rtnl_lock();
- __tun_detach(tun);
- rtnl_unlock();
+ synchronize_net();
+ /* Drop read queue */
+ skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
+ }
}

static struct tun_struct *__tun_get(struct tun_file *tfile)
{
- struct tun_struct *tun = NULL;
+ struct tun_struct *tun;

- if (atomic_inc_not_zero(&tfile->count))
- tun = tfile->tun;
+ rcu_read_lock();
+ tun = rcu_dereference(tfile->tun);
+ if (tun)
+ dev_hold(tun->dev);
+ rcu_read_unlock();

return tun;
}
@@ -234,10 +230,7 @@ static struct tun_struct *tun_get(struct file *file)

static void tun_put(struct tun_struct *tun)
{
- struct tun_file *tfile = tun->tfile;
-
- if (atomic_dec_and_test(&tfile->count))
- tun_detach(tfile->tun);
+ dev_put(tun->dev);
}

/* TAP filtering */
@@ -358,14 +351,15 @@ static const struct ethtool_ops tun_ethtool_ops;
static void tun_net_uninit(struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
- struct tun_file *tfile = tun->tfile;
+ struct tun_file *tfile = rcu_dereference_protected(tun->tfile,
+ lockdep_rtnl_is_held());

/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
wake_up_all(&tfile->wq.wait);
- if (atomic_dec_and_test(&tfile->count))
- __tun_detach(tun);
+ __tun_detach(tun);
+ synchronize_net();
}
}

@@ -387,14 +381,16 @@ static int tun_net_close(struct net_device *dev)
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
- struct tun_file *tfile = tun->tfile;
-
- tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
+ struct tun_file *tfile;

+ rcu_read_lock();
+ tfile = rcu_dereference(tun->tfile);
/* Drop packet if interface is not attached */
if (!tfile)
goto drop;

+ tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
+
/* Drop if the filter does not like it.
* This is a noop if the filter is disabled.
* Filter can be enabled only for the TAP devices. */
@@ -436,11 +432,14 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
POLLRDNORM | POLLRDBAND);
+
+ rcu_read_unlock();
return NETDEV_TX_OK;

drop:
dev->stats.tx_dropped++;
kfree_skb(skb);
+ rcu_read_unlock();
return NETDEV_TX_OK;
}

@@ -1092,7 +1091,6 @@ static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,

if (!tun)
return -EBADFD;
-
ret = tun_get_user(tun, tfile, m->msg_control, m->msg_iov, total_len,
m->msg_iovlen, m->msg_flags & MSG_DONTWAIT);
tun_put(tun);
@@ -1665,8 +1663,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
&tun_proto);
if (!tfile)
return -ENOMEM;
- atomic_set(&tfile->count, 0);
- tfile->tun = NULL;
+ rcu_assign_pointer(tfile->tun, NULL);
tfile->net = get_net(current->nsproxy->net_ns);
tfile->flags = 0;

@@ -1694,7 +1691,9 @@ static int tun_chr_close(struct inode *inode, struct file *file)
struct tun_struct *tun;
struct net *net = tfile->net;

- tun = __tun_get(tfile);
+ rtnl_lock();
+
+ tun = rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held());
if (tun) {
struct net_device *dev = tun->dev;

@@ -1702,18 +1701,20 @@ static int tun_chr_close(struct inode *inode, struct file *file)

__tun_detach(tun);

+ synchronize_net();
+
/* If desirable, unregister the netdevice. */
if (!(tun->flags & TUN_PERSIST)) {
- rtnl_lock();
if (dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(dev);
- rtnl_unlock();
}

/* drop the reference that netdevice holds */
sock_put(&tfile->sk);
}

+ rtnl_unlock();
+
/* drop the reference that file holds */
BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
&tfile->socket.flags));
@@ -1845,14 +1846,12 @@ static void tun_cleanup(void)
* holding a reference to the file for as long as the socket is in use. */
struct socket *tun_get_socket(struct file *file)
{
- struct tun_struct *tun;
- struct tun_file *tfile = file->private_data;
+ struct tun_file *tfile;
if (file->f_op != &tun_fops)
return ERR_PTR(-EINVAL);
- tun = tun_get(file);
- if (!tun)
+ tfile = file->private_data;
+ if (!tfile)
return ERR_PTR(-EBADFD);
- tun_put(tun);
return &tfile->socket;
}
EXPORT_SYMBOL_GPL(tun_get_socket);
--
1.7.1

2012-10-29 05:44:37

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 4/7] tuntap: introduce multiqueue flags

Add flags to be used by creating multiqueue tuntap device.

Signed-off-by: Jason Wang <[email protected]>
---
include/uapi/linux/if_tun.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 25a585c..8ef3a87 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -34,6 +34,7 @@
#define TUN_ONE_QUEUE 0x0080
#define TUN_PERSIST 0x0100
#define TUN_VNET_HDR 0x0200
+#define TUN_TAP_MQ 0x0400

/* Ioctl defines */
#define TUNSETNOCSUM _IOW('T', 200, int)
@@ -61,6 +62,7 @@
#define IFF_ONE_QUEUE 0x2000
#define IFF_VNET_HDR 0x4000
#define IFF_TUN_EXCL 0x8000
+#define IFF_MULTI_QUEUE 0x0100

/* Features for GSO (TUNSETOFFLOAD). */
#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
--
1.7.1

2012-10-29 05:44:43

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 5/7] tuntap: multiqueue support

This patch converts tun/tap to a multiqueue devices and expose the multiqueue
queues as multiple file descriptors to userspace. Internally, each tun_file were
abstracted as a queue, and an array of pointers to tun_file structurs were
stored in tun_structure device, so multiple tun_files were allowed to be
attached to the device as multiple queues.

When choosing txq, we first try to identify a flow through its rxhash, if it
does not have such one, we could try recorded rxq and then use them to choose
the transmit queue. This policy may be changed in the future.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 305 +++++++++++++++++++++++++++++++++++++---------------
1 files changed, 217 insertions(+), 88 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d332cb8..59235ba 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -110,6 +110,11 @@ struct tap_filter {
unsigned char addr[FLT_EXACT_COUNT][ETH_ALEN];
};

+/* 1024 is probably a high enough limit: modern hypervisors seem to support on
+ * the order of 100-200 CPUs so this leaves us some breathing space if we want
+ * to match a queue per guest CPU. */
+#define MAX_TAP_QUEUES 1024
+
/* A tun_file connects an open character device to a tuntap netdevice. It
* also contains all socket related strctures (except sock_fprog and tap_filter)
* to serve as one transmit queue for tuntap device. The sock_fprog and
@@ -130,6 +135,7 @@ struct tun_file {
struct fasync_struct *fasync;
/* only used for fasnyc */
unsigned int flags;
+ u16 queue_index;
};

/* Since the socket were moved to tun_file, to preserve the behavior of persist
@@ -137,7 +143,8 @@ struct tun_file {
* file were attached to a persist device.
*/
struct tun_struct {
- struct tun_file __rcu *tfile;
+ struct tun_file __rcu *tfiles[MAX_TAP_QUEUES];
+ unsigned int numqueues;
unsigned int flags;
kuid_t owner;
kgid_t group;
@@ -158,56 +165,156 @@ struct tun_struct {
#endif
};

+/* We try to identify a flow through its rxhash first. The reason that
+ * we do not check rxq no. is becuase some cards(e.g 82599), chooses
+ * the rxq based on the txq where the last packet of the flow comes. As
+ * the userspace application move between processors, we may get a
+ * different rxq no. here. If we could not get rxhash, then we would
+ * hope the rxq no. may help here.
+ */
+static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+ struct tun_struct *tun = netdev_priv(dev);
+ u32 txq = 0;
+ u32 numqueues = 0;
+
+ rcu_read_lock();
+ numqueues = tun->numqueues;
+
+ txq = skb_get_rxhash(skb);
+ if (txq) {
+ /* use multiply and shift instead of expensive divide */
+ txq = ((u64)txq * numqueues) >> 32;
+ } else if (likely(skb_rx_queue_recorded(skb))) {
+ txq = skb_get_rx_queue(skb);
+ while (unlikely(txq >= numqueues))
+ txq -= numqueues;
+ }
+
+ rcu_read_unlock();
+ return txq;
+}
+
+static void tun_set_real_num_queues(struct tun_struct *tun)
+{
+ netif_set_real_num_tx_queues(tun->dev, tun->numqueues);
+ netif_set_real_num_rx_queues(tun->dev, tun->numqueues);
+}
+
+static void __tun_detach(struct tun_file *tfile, bool clean)
+{
+ struct tun_file *ntfile;
+ struct tun_struct *tun;
+ struct net_device *dev;
+
+ tun = rcu_dereference_protected(tfile->tun,
+ lockdep_rtnl_is_held());
+ if (tun) {
+ u16 index = tfile->queue_index;
+ BUG_ON(index >= tun->numqueues);
+ dev = tun->dev;
+
+ rcu_assign_pointer(tun->tfiles[index],
+ tun->tfiles[tun->numqueues - 1]);
+ rcu_assign_pointer(tfile->tun, NULL);
+ ntfile = rcu_dereference_protected(tun->tfiles[index],
+ lockdep_rtnl_is_held());
+ ntfile->queue_index = index;
+
+ --tun->numqueues;
+ sock_put(&tfile->sk);
+
+ synchronize_net();
+ /* Drop read queue */
+ skb_queue_purge(&tfile->sk.sk_receive_queue);
+ tun_set_real_num_queues(tun);
+
+ if (tun->numqueues == 0 && !(tun->flags & TUN_PERSIST))
+ if (dev->reg_state == NETREG_REGISTERED)
+ unregister_netdevice(dev);
+ }
+
+ if (clean) {
+ BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
+ &tfile->socket.flags));
+ sk_release_kernel(&tfile->sk);
+ }
+}
+
+static void tun_detach(struct tun_file *tfile, bool clean)
+{
+ rtnl_lock();
+ __tun_detach(tfile, clean);
+ rtnl_unlock();
+}
+
+static void tun_detach_all(struct net_device *dev)
+{
+ struct tun_struct *tun = netdev_priv(dev);
+ struct tun_file *tfile;
+ int i, n = tun->numqueues;
+
+ for (i = 0; i < n; i++) {
+ tfile = rcu_dereference_protected(tun->tfiles[i],
+ lockdep_rtnl_is_held());
+ BUG_ON(!tfile);
+ wake_up_all(&tfile->wq.wait);
+ rcu_assign_pointer(tfile->tun, NULL);
+ --tun->numqueues;
+ }
+ BUG_ON(tun->numqueues != 0);
+
+ synchronize_net();
+ for (i = 0; i < n; i++) {
+ tfile = rcu_dereference_protected(tun->tfiles[i],
+ lockdep_rtnl_is_held());
+ /* Drop read queue */
+ skb_queue_purge(&tfile->sk.sk_receive_queue);
+ sock_put(&tfile->sk);
+ }
+}
+
static int tun_attach(struct tun_struct *tun, struct file *file)
{
struct tun_file *tfile = file->private_data;
int err;

- ASSERT_RTNL();
-
- netif_tx_lock_bh(tun->dev);
-
err = -EINVAL;
- if (tfile->tun)
+ if (rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held()))
goto out;

err = -EBUSY;
- if (tun->tfile)
+ if (!(tun->flags & TUN_TAP_MQ) && tun->numqueues == 1)
+ goto out;
+
+ err = -E2BIG;
+ if (tun->numqueues == MAX_TAP_QUEUES)
goto out;

err = 0;

- /* Re-attach filter when attaching to a persist device */
+ /* Re-attach the filter to presist device */
if (tun->filter_attached == true) {
err = sk_attach_filter(&tun->fprog, tfile->socket.sk);
if (!err)
goto out;
}
+ tfile->queue_index = tun->numqueues;
rcu_assign_pointer(tfile->tun, tun);
- tfile->socket.sk->sk_sndbuf = tun->sndbuf;
- rcu_assign_pointer(tun->tfile, tfile);
- netif_carrier_on(tun->dev);
+ rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
sock_hold(&tfile->sk);
+ tun->numqueues++;

-out:
- netif_tx_unlock_bh(tun->dev);
- return err;
-}
+ tun_set_real_num_queues(tun);

-static void __tun_detach(struct tun_struct *tun)
-{
- struct tun_file *tfile = rcu_dereference_protected(tun->tfile,
- lockdep_rtnl_is_held());
- /* Detach from net device */
- netif_carrier_off(tun->dev);
- rcu_assign_pointer(tun->tfile, NULL);
- if (tfile) {
- rcu_assign_pointer(tfile->tun, NULL);
+ if (tun->numqueues == 1)
+ netif_carrier_on(tun->dev);

- synchronize_net();
- /* Drop read queue */
- skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
- }
+ /* device is allowed to go away first, so no need to hold extra
+ * refcnt. */
+
+out:
+ return err;
}

static struct tun_struct *__tun_get(struct tun_file *tfile)
@@ -350,30 +457,20 @@ static const struct ethtool_ops tun_ethtool_ops;
/* Net device detach from fd. */
static void tun_net_uninit(struct net_device *dev)
{
- struct tun_struct *tun = netdev_priv(dev);
- struct tun_file *tfile = rcu_dereference_protected(tun->tfile,
- lockdep_rtnl_is_held());
-
- /* Inform the methods they need to stop using the dev.
- */
- if (tfile) {
- wake_up_all(&tfile->wq.wait);
- __tun_detach(tun);
- synchronize_net();
- }
+ tun_detach_all(dev);
}

/* Net device open. */
static int tun_net_open(struct net_device *dev)
{
- netif_start_queue(dev);
+ netif_tx_start_all_queues(dev);
return 0;
}

/* Net device close. */
static int tun_net_close(struct net_device *dev)
{
- netif_stop_queue(dev);
+ netif_tx_stop_all_queues(dev);
return 0;
}

@@ -381,16 +478,20 @@ static int tun_net_close(struct net_device *dev)
static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
+ int txq = skb->queue_mapping;
struct tun_file *tfile;

rcu_read_lock();
- tfile = rcu_dereference(tun->tfile);
+ tfile = rcu_dereference(tun->tfiles[txq]);
+
/* Drop packet if interface is not attached */
- if (!tfile)
+ if (txq >= tun->numqueues)
goto drop;

tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);

+ BUG_ON(!tfile);
+
/* Drop if the filter does not like it.
* This is a noop if the filter is disabled.
* Filter can be enabled only for the TAP devices. */
@@ -401,12 +502,14 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
sk_filter(tfile->socket.sk, skb))
goto drop;

+ /* Limit the number of packets queued by divining txq length with the
+ * number of queues. */
if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
- >= dev->tx_queue_len) {
+ >= dev->tx_queue_len / tun->numqueues){
if (!(tun->flags & TUN_ONE_QUEUE)) {
/* Normal queueing mode. */
/* Packet scheduler handles dropping of further packets. */
- netif_stop_queue(dev);
+ netif_stop_subqueue(dev, txq);

/* We won't see all dropped packets individually, so overrun
* error is more appropriate. */
@@ -495,6 +598,7 @@ static const struct net_device_ops tun_netdev_ops = {
.ndo_start_xmit = tun_net_xmit,
.ndo_change_mtu = tun_net_change_mtu,
.ndo_fix_features = tun_net_fix_features,
+ .ndo_select_queue = tun_select_queue,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = tun_poll_controller,
#endif
@@ -510,6 +614,7 @@ static const struct net_device_ops tap_netdev_ops = {
.ndo_set_rx_mode = tun_net_mclist,
.ndo_set_mac_address = eth_mac_addr,
.ndo_validate_addr = eth_validate_addr,
+ .ndo_select_queue = tun_select_queue,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = tun_poll_controller,
#endif
@@ -551,7 +656,7 @@ static void tun_net_init(struct net_device *dev)
/* Character device part */

/* Poll */
-static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
+static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
{
struct tun_file *tfile = file->private_data;
struct tun_struct *tun = __tun_get(tfile);
@@ -998,7 +1103,7 @@ static ssize_t tun_do_read(struct tun_struct *tun,struct tun_file *tfile,
schedule();
continue;
}
- netif_wake_queue(tun->dev);
+ netif_wake_subqueue(tun->dev, tfile->queue_index);

ret = tun_put_user(tun, tfile, skb, iv, len);
kfree_skb(skb);
@@ -1159,6 +1264,9 @@ static int tun_flags(struct tun_struct *tun)
if (tun->flags & TUN_VNET_HDR)
flags |= IFF_VNET_HDR;

+ if (tun->flags & TUN_TAP_MQ)
+ flags |= IFF_MULTI_QUEUE;
+
return flags;
}

@@ -1250,8 +1358,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
if (*ifr->ifr_name)
name = ifr->ifr_name;

- dev = alloc_netdev(sizeof(struct tun_struct), name,
- tun_setup);
+ dev = alloc_netdev_mqs(sizeof(struct tun_struct), name,
+ tun_setup,
+ MAX_TAP_QUEUES, MAX_TAP_QUEUES);
if (!dev)
return -ENOMEM;

@@ -1286,7 +1395,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)

err = tun_attach(tun, file);
if (err < 0)
- goto failed;
+ goto err_free_dev;
}

tun_debug(KERN_INFO, tun, "tun_set_iff\n");
@@ -1306,18 +1415,22 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
else
tun->flags &= ~TUN_VNET_HDR;

+ if (ifr->ifr_flags & IFF_MULTI_QUEUE)
+ tun->flags |= TUN_TAP_MQ;
+ else
+ tun->flags &= ~TUN_TAP_MQ;
+
/* Make sure persistent devices do not get stuck in
* xoff state.
*/
if (netif_running(tun->dev))
- netif_wake_queue(tun->dev);
+ netif_tx_wake_all_queues(tun->dev);

strcpy(ifr->ifr_name, tun->dev->name);
return 0;

err_free_dev:
free_netdev(dev);
- failed:
return err;
}

@@ -1372,6 +1485,51 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
return 0;
}

+static void tun_detach_filter(struct tun_struct *tun, int n)
+{
+ int i;
+ struct tun_file *tfile;
+
+ for (i = 0; i < n; i++) {
+ tfile = rcu_dereference_protected(tun->tfiles[i],
+ lockdep_rtnl_is_held());
+ sk_detach_filter(tfile->socket.sk);
+ }
+
+ tun->filter_attached = false;
+}
+
+static int tun_attach_filter(struct tun_struct *tun)
+{
+ int i, ret = 0;
+ struct tun_file *tfile;
+
+ for (i = 0; i < tun->numqueues; i++) {
+ tfile = rcu_dereference_protected(tun->tfiles[i],
+ lockdep_rtnl_is_held());
+ ret = sk_attach_filter(&tun->fprog, tfile->socket.sk);
+ if (ret) {
+ tun_detach_filter(tun, i);
+ return ret;
+ }
+ }
+
+ tun->filter_attached = true;
+ return ret;
+}
+
+static void tun_set_sndbuf(struct tun_struct *tun)
+{
+ struct tun_file *tfile;
+ int i;
+
+ for (i = 0; i < tun->numqueues; i++) {
+ tfile = rcu_dereference_protected(tun->tfiles[i],
+ lockdep_rtnl_is_held());
+ tfile->socket.sk->sk_sndbuf = tun->sndbuf;
+ }
+}
+
static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg, int ifreq_len)
{
@@ -1400,6 +1558,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
(unsigned int __user*)argp);
}

+ ret = 0;
rtnl_lock();

tun = __tun_get(tfile);
@@ -1540,7 +1699,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;
}

- tun->sndbuf = tfile->socket.sk->sk_sndbuf = sndbuf;
+ tun->sndbuf = sndbuf;
+ tun_set_sndbuf(tun);
break;

case TUNGETVNETHDRSZ:
@@ -1571,9 +1731,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
if (copy_from_user(&tun->fprog, argp, sizeof(tun->fprog)))
break;

- ret = sk_attach_filter(&tun->fprog, tfile->socket.sk);
- if (!ret)
- tun->filter_attached = true;
+ ret = tun_attach_filter(tun);
break;

case TUNDETACHFILTER:
@@ -1581,9 +1739,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
ret = -EINVAL;
if ((tun->flags & TUN_TYPE_MASK) != TUN_TAP_DEV)
break;
- ret = sk_detach_filter(tfile->socket.sk);
- if (!ret)
- tun->filter_attached = false;
+ ret = 0;
+ tun_detach_filter(tun, tun->numqueues);
break;

default:
@@ -1688,37 +1845,9 @@ static int tun_chr_open(struct inode *inode, struct file * file)
static int tun_chr_close(struct inode *inode, struct file *file)
{
struct tun_file *tfile = file->private_data;
- struct tun_struct *tun;
struct net *net = tfile->net;

- rtnl_lock();
-
- tun = rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held());
- if (tun) {
- struct net_device *dev = tun->dev;
-
- tun_debug(KERN_INFO, tun, "tun_chr_close\n");
-
- __tun_detach(tun);
-
- synchronize_net();
-
- /* If desirable, unregister the netdevice. */
- if (!(tun->flags & TUN_PERSIST)) {
- if (dev->reg_state == NETREG_REGISTERED)
- unregister_netdevice(dev);
- }
-
- /* drop the reference that netdevice holds */
- sock_put(&tfile->sk);
- }
-
- rtnl_unlock();
-
- /* drop the reference that file holds */
- BUG_ON(!test_bit(SOCK_EXTERNALLY_ALLOCATED,
- &tfile->socket.flags));
- sk_release_kernel(&tfile->sk);
+ tun_detach(tfile, true);
put_net(net);

return 0;
--
1.7.1

2012-10-29 05:44:52

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 6/7] tuntap: add ioctl to attach or detach a file form tuntap device

Sometimes usespace may need to active/deactive a queue, this could be done by
detaching and attaching a file from tuntap device.

This patch introduces a new ioctls - TUNSETQUEUE which could be used to do
this. Flag IFF_ATTACH_QUEUE were introduced to do attaching while
IFF_DETACH_QUEUE were introduced to do the detaching.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 56 ++++++++++++++++++++++++++++++++++++------
include/uapi/linux/if_tun.h | 3 ++
2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 59235ba..78dcda8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -195,6 +195,15 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)
return txq;
}

+static inline bool tun_not_capable(struct tun_struct *tun)
+{
+ const struct cred *cred = current_cred();
+
+ return (((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
+ (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
+ !capable(CAP_NET_ADMIN));
+}
+
static void tun_set_real_num_queues(struct tun_struct *tun)
{
netif_set_real_num_tx_queues(tun->dev, tun->numqueues);
@@ -1310,8 +1319,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)

dev = __dev_get_by_name(net, ifr->ifr_name);
if (dev) {
- const struct cred *cred = current_cred();
-
if (ifr->ifr_flags & IFF_TUN_EXCL)
return -EBUSY;
if ((ifr->ifr_flags & IFF_TUN) && dev->netdev_ops == &tun_netdev_ops)
@@ -1321,9 +1328,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
else
return -EINVAL;

- if (((uid_valid(tun->owner) && !uid_eq(cred->euid, tun->owner)) ||
- (gid_valid(tun->group) && !in_egroup_p(tun->group))) &&
- !capable(CAP_NET_ADMIN))
+ if (tun_not_capable(tun))
return -EPERM;
err = security_tun_dev_attach(tfile->socket.sk);
if (err < 0)
@@ -1530,6 +1535,40 @@ static void tun_set_sndbuf(struct tun_struct *tun)
}
}

+static int tun_set_queue(struct file *file, struct ifreq *ifr)
+{
+ struct tun_file *tfile = file->private_data;
+ struct tun_struct *tun;
+ struct net_device *dev;
+ int ret = 0;
+
+ rtnl_lock();
+
+ if (ifr->ifr_flags & IFF_ATTACH_QUEUE) {
+ dev = __dev_get_by_name(tfile->net, ifr->ifr_name);
+ if (!dev) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ tun = netdev_priv(dev);
+ if (dev->netdev_ops != &tap_netdev_ops &&
+ dev->netdev_ops != &tun_netdev_ops)
+ ret = -EINVAL;
+ else if (tun_not_capable(tun))
+ ret = -EPERM;
+ else
+ ret = tun_attach(tun, file);
+ } else if (ifr->ifr_flags & IFF_DETACH_QUEUE)
+ __tun_detach(tfile, false);
+ else
+ ret = -EINVAL;
+
+unlock:
+ rtnl_unlock();
+ return ret;
+}
+
static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg, int ifreq_len)
{
@@ -1543,7 +1582,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
int vnet_hdr_sz;
int ret;

- if (cmd == TUNSETIFF || _IOC_TYPE(cmd) == 0x89) {
+ if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == 0x89) {
if (copy_from_user(&ifr, argp, ifreq_len))
return -EFAULT;
} else {
@@ -1554,9 +1593,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
* This is needed because we never checked for invalid flags on
* TUNSETIFF. */
return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
- IFF_VNET_HDR,
+ IFF_VNET_HDR | IFF_MULTI_QUEUE,
(unsigned int __user*)argp);
- }
+ } else if (cmd == TUNSETQUEUE)
+ return tun_set_queue(file, &ifr);

ret = 0;
rtnl_lock();
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 8ef3a87..958497a 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -54,6 +54,7 @@
#define TUNDETACHFILTER _IOW('T', 214, struct sock_fprog)
#define TUNGETVNETHDRSZ _IOR('T', 215, int)
#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE _IOW('T', 217, int)

/* TUNSETIFF ifr flags */
#define IFF_TUN 0x0001
@@ -63,6 +64,8 @@
#define IFF_VNET_HDR 0x4000
#define IFF_TUN_EXCL 0x8000
#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400

/* Features for GSO (TUNSETOFFLOAD). */
#define TUN_F_CSUM 0x01 /* You can hand me unchecksummed packets. */
--
1.7.1

2012-10-29 05:45:00

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 7/7] tuntap: choose the txq based on rxq

This patch implements a simple multiqueue flow steering policy - tx follows rx
for tun/tap. The idea is simple, it just choose the txq based on which rxq it
comes. The flow were identified through the rxhash of a skb, and the hash to
queue mapping were recorded in a hlist with an ageing timer to retire the
mapping. The mapping were created when tun receives packet from userspace, and
was quired in .ndo_select_queue().

I run co-current TCP_CRR test and didn't see any mapping manipulation helpers in
perf top, so the overhead could be negelected.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 239 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 236 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 78dcda8..158ef1d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -115,6 +115,8 @@ struct tap_filter {
* to match a queue per guest CPU. */
#define MAX_TAP_QUEUES 1024

+#define TUN_FLOW_EXPIRE 3 * HZ
+
/* A tun_file connects an open character device to a tuntap netdevice. It
* also contains all socket related strctures (except sock_fprog and tap_filter)
* to serve as one transmit queue for tuntap device. The sock_fprog and
@@ -138,6 +140,18 @@ struct tun_file {
u16 queue_index;
};

+struct tun_flow_entry {
+ struct hlist_node hash_link;
+ struct rcu_head rcu;
+ struct tun_struct *tun;
+
+ u32 rxhash;
+ int queue_index;
+ unsigned long updated;
+};
+
+#define TUN_NUM_FLOW_ENTRIES 1024
+
/* Since the socket were moved to tun_file, to preserve the behavior of persist
* device, socket fileter, sndbuf and vnet header size were restore when the
* file were attached to a persist device.
@@ -163,8 +177,176 @@ struct tun_struct {
#ifdef TUN_DEBUG
int debug;
#endif
+ spinlock_t lock;
+ struct kmem_cache *flow_cache;
+ struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
+ struct timer_list flow_gc_timer;
+ unsigned long ageing_time;
};

+static inline u32 tun_hashfn(u32 rxhash)
+{
+ return rxhash & 0x3ff;
+}
+
+static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 rxhash)
+{
+ struct tun_flow_entry *e;
+ struct hlist_node *n;
+
+ hlist_for_each_entry_rcu(e, n, head, hash_link) {
+ if (e->rxhash == rxhash)
+ return e;
+ }
+ return NULL;
+}
+
+static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
+ struct hlist_head *head,
+ u32 rxhash, u16 queue_index)
+{
+ struct tun_flow_entry *e = kmem_cache_alloc(tun->flow_cache,
+ GFP_ATOMIC);
+ if (e) {
+ tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
+ rxhash, queue_index);
+ e->updated = jiffies;
+ e->rxhash = rxhash;
+ e->queue_index = queue_index;
+ e->tun = tun;
+ hlist_add_head_rcu(&e->hash_link, head);
+ }
+ return e;
+}
+
+static void tun_flow_free(struct rcu_head *head)
+{
+ struct tun_flow_entry *e
+ = container_of(head, struct tun_flow_entry, rcu);
+ kmem_cache_free(e->tun->flow_cache, e);
+}
+
+static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
+{
+ tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
+ e->rxhash, e->queue_index);
+ hlist_del_rcu(&e->hash_link);
+ call_rcu(&e->rcu, tun_flow_free);
+}
+
+/* caller hold tun->lock */
+static int tun_delete_by_rxhash(struct tun_struct *tun, u32 rxhash)
+{
+ struct hlist_head *head = &tun->flows[tun_hashfn(rxhash)];
+ struct tun_flow_entry *e = tun_flow_find(head, rxhash);
+
+ if (!e)
+ return -ENOENT;
+
+ tun_flow_delete(tun, e);
+}
+
+static void tun_flow_flush(struct tun_struct *tun)
+{
+ int i;
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link)
+ tun_flow_delete(tun, e);
+ }
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_delete_by_queue(struct tun_struct *tun, u16 queue_index)
+{
+ int i;
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link) {
+ if (e->queue_index == queue_index)
+ tun_flow_delete(tun, e);
+ }
+ }
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_cleanup(unsigned long data)
+{
+ struct tun_struct *tun = (struct tun_struct *)data;
+ unsigned long delay = tun->ageing_time;
+ unsigned long next_timer = jiffies + delay;
+ unsigned long count = 0;
+ int i;
+
+ tun_debug(KERN_INFO, tun, "tun_flow_cleanup\n");
+
+ spin_lock_bh(&tun->lock);
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
+ struct tun_flow_entry *e;
+ struct hlist_node *h, *n;
+
+ hlist_for_each_entry_safe(e, h, n, &tun->flows[i], hash_link) {
+ unsigned long this_timer;
+ count ++;
+ this_timer = e->updated + delay;
+ if (time_before_eq(this_timer, jiffies))
+ tun_flow_delete(tun, e);
+ else if (time_before(this_timer, next_timer))
+ next_timer = this_timer;
+ }
+ }
+
+ if (count)
+ mod_timer(&tun->flow_gc_timer, round_jiffies_up(next_timer));
+ spin_unlock_bh(&tun->lock);
+}
+
+static void tun_flow_update(struct tun_struct *tun, struct sk_buff *skb,
+ u16 queue_index)
+{
+ struct hlist_head *head;
+ struct tun_flow_entry *e;
+ unsigned long delay = tun->ageing_time;
+ u32 rxhash = skb_get_rxhash(skb);
+
+ if (!rxhash)
+ return;
+ else
+ head = &tun->flows[tun_hashfn(rxhash)];
+
+ rcu_read_lock();
+
+ if (tun->numqueues == 1)
+ goto unlock;
+
+ e = tun_flow_find(head, rxhash);
+ if (likely(e)) {
+ /* TODO: keep queueing to old queue until it's empty? */
+ e->queue_index = queue_index;
+ e->updated = jiffies;
+ } else {
+ spin_lock_bh(&tun->lock);
+ if (!tun_flow_find(head, rxhash))
+ tun_flow_create(tun, head, rxhash, queue_index);
+
+ if (!timer_pending(&tun->flow_gc_timer))
+ mod_timer(&tun->flow_gc_timer,
+ round_jiffies_up(jiffies + delay));
+ spin_unlock_bh(&tun->lock);
+ }
+
+unlock:
+ rcu_read_unlock();
+}
+
/* We try to identify a flow through its rxhash first. The reason that
* we do not check rxq no. is becuase some cards(e.g 82599), chooses
* the rxq based on the txq where the last packet of the flow comes. As
@@ -175,6 +357,7 @@ struct tun_struct {
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)
{
struct tun_struct *tun = netdev_priv(dev);
+ struct tun_flow_entry *e;
u32 txq = 0;
u32 numqueues = 0;

@@ -183,8 +366,12 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)

txq = skb_get_rxhash(skb);
if (txq) {
- /* use multiply and shift instead of expensive divide */
- txq = ((u64)txq * numqueues) >> 32;
+ e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
+ if (e)
+ txq = e->queue_index;
+ else
+ /* use multiply and shift instead of expensive divide */
+ txq = ((u64)txq * numqueues) >> 32;
} else if (likely(skb_rx_queue_recorded(skb))) {
txq = skb_get_rx_queue(skb);
while (unlikely(txq >= numqueues))
@@ -234,6 +421,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
sock_put(&tfile->sk);

synchronize_net();
+ tun_flow_delete_by_queue(tun, tun->numqueues + 1);
/* Drop read queue */
skb_queue_purge(&tfile->sk.sk_receive_queue);
tun_set_real_num_queues(tun);
@@ -629,6 +817,37 @@ static const struct net_device_ops tap_netdev_ops = {
#endif
};

+static int tun_flow_init(struct tun_struct *tun)
+{
+ int i;
+
+ tun->flow_cache = kmem_cache_create("tun_flow_cache",
+ sizeof(struct tun_flow_entry), 0, 0,
+ NULL);
+ if (!tun->flow_cache)
+ return -ENOMEM;
+
+ for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++)
+ INIT_HLIST_HEAD(&tun->flows[i]);
+
+ tun->ageing_time = TUN_FLOW_EXPIRE;
+ setup_timer(&tun->flow_gc_timer, tun_flow_cleanup, (unsigned long)tun);
+ mod_timer(&tun->flow_gc_timer,
+ round_jiffies_up(jiffies + tun->ageing_time));
+
+ return 0;
+}
+
+static void tun_flow_uninit(struct tun_struct *tun)
+{
+ del_timer_sync(&tun->flow_gc_timer);
+ tun_flow_flush(tun);
+
+ /* Wait for completion of call_rcu()'s */
+ rcu_barrier();
+ kmem_cache_destroy(tun->flow_cache);
+}
+
/* Initialize net device. */
static void tun_net_init(struct net_device *dev)
{
@@ -973,6 +1192,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
tun->dev->stats.rx_packets++;
tun->dev->stats.rx_bytes += len;

+ tun_flow_update(tun, skb, tfile->queue_index);
return total_len;
}

@@ -1150,6 +1370,14 @@ out:
return ret;
}

+static void tun_free_netdev(struct net_device *dev)
+{
+ struct tun_struct *tun = netdev_priv(dev);
+
+ tun_flow_uninit(tun);
+ free_netdev(dev);
+}
+
static void tun_setup(struct net_device *dev)
{
struct tun_struct *tun = netdev_priv(dev);
@@ -1158,7 +1386,7 @@ static void tun_setup(struct net_device *dev)
tun->group = INVALID_GID;

dev->ethtool_ops = &tun_ethtool_ops;
- dev->destructor = free_netdev;
+ dev->destructor = tun_free_netdev;
}

/* Trivial set of netlink ops to allow deleting tun or tap
@@ -1381,10 +1609,15 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
tun->filter_attached = false;
tun->sndbuf = tfile->socket.sk->sk_sndbuf;

+ spin_lock_init(&tun->lock);
+
security_tun_dev_post_create(&tfile->sk);

tun_net_init(dev);

+ if (tun_flow_init(tun))
+ goto err_free_dev;
+
dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
TUN_USER_FEATURES;
dev->features = dev->hw_features;
--
1.7.1

2012-10-29 05:46:45

by Jason Wang

[permalink] [raw]
Subject: [net-next v4 1/7] tuntap: log the unsigned informaiton with %u

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/tun.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 0873cdc..ef13cf0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1422,7 +1422,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
if (!tun)
goto unlock;

- tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
+ tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %u\n", cmd);

ret = 0;
switch (cmd) {
@@ -1462,7 +1462,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;
}
tun->owner = owner;
- tun_debug(KERN_INFO, tun, "owner set to %d\n",
+ tun_debug(KERN_INFO, tun, "owner set to %u\n",
from_kuid(&init_user_ns, tun->owner));
break;

@@ -1474,7 +1474,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
break;
}
tun->group = group;
- tun_debug(KERN_INFO, tun, "group set to %d\n",
+ tun_debug(KERN_INFO, tun, "group set to %u\n",
from_kgid(&init_user_ns, tun->group));
break;

--
1.7.1

2012-10-29 06:00:57

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH] tuntap: choose the txq based on rxq

On 10/29/2012 01:35 PM, Jason Wang wrote:
> This patch implements a simple multiqueue flow steering policy - tx follows rx
> for tun/tap. The idea is simple, it just choose the txq based on which rxq it
> comes. The flow were identified through the rxhash of a skb, and the hash to
> queue mapping were recorded in a hlist with an ageing timer to retire the
> mapping. The mapping were created when tun receives packet from userspace, and
> was quired in .ndo_select_queue().
>
> I run co-current TCP_CRR test and didn't see any mapping manipulation helpers in
> perf top, so the overhead could be negelected.
>
> Signed-off-by: Jason Wang<[email protected]>

Sorry, send the wrong patch, please ignore this patch.

2012-10-29 06:07:49

by David Miller

[permalink] [raw]
Subject: Re: [net-next v4 0/7] Multiqueue support in tuntap


Many sites rejected your emails because they had two To: headers.

Please repost this without that problem so people can actually
receive it.

Thanks.

2012-10-29 06:13:46

by Jason Wang

[permalink] [raw]
Subject: Re: [net-next v4 0/7] Multiqueue support in tuntap

On 10/29/2012 02:07 PM, David Miller wrote:
> Many sites rejected your emails because they had two To: headers.
>
> Please repost this without that problem so people can actually
> receive it.
>
> Thanks.

Sorry about that, will repost soon.

2012-10-30 23:53:43

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [net-next v4 0/7] Multiqueue support in tuntap

I am testing BQL for tuntap.
It wouldn't be hard to do BQL in the multi-queue version.

2012-11-01 05:04:51

by Jason Wang

[permalink] [raw]
Subject: Re: [net-next v4 0/7] Multiqueue support in tuntap

On 10/31/2012 07:52 AM, Stephen Hemminger wrote:
> I am testing BQL for tuntap.
> It wouldn't be hard to do BQL in the multi-queue version.

Yes, if BQL for tuntap is in first, I will rebase and convert it to
multiqueue version.

Thanks