Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755900Ab0DNOzo (ORCPT ); Wed, 14 Apr 2010 10:55:44 -0400 Received: from moutng.kundenserver.de ([212.227.17.9]:49514 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755832Ab0DNOzl (ORCPT ); Wed, 14 Apr 2010 10:55:41 -0400 From: Arnd Bergmann To: xiaohui.xin@intel.com Subject: Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net. Date: Wed, 14 Apr 2010 16:55:21 +0200 User-Agent: KMail/1.12.2 (Linux/2.6.31-19-generic; KDE/4.3.2; x86_64; ; ) Cc: netdev@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu, davem@davemloft.net, jdike@linux.intel.com References: <1270805865-16901-1-git-send-email-xiaohui.xin@intel.com> <1270805865-16901-2-git-send-email-xiaohui.xin@intel.com> In-Reply-To: <1270805865-16901-2-git-send-email-xiaohui.xin@intel.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201004141655.21885.arnd@arndb.de> X-Provags-ID: V01U2FsdGVkX1/jpBcR0zu/RgPNCTjPnfiFSqiE78KJXm2yYpl +VD75n00l0zvVwJEKgBGGBfSCuvoG+aLOzYi/nRhnsbYpk1HSo WZym7cIP+NXFEf9mDofOw== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7026 Lines: 228 On Friday 09 April 2010, xiaohui.xin@intel.com wrote: > From: Xin Xiaohui > > Add a device to utilize the vhost-net backend driver for > copy-less data transfer between guest FE and host NIC. > It pins the guest user space to the host memory and > provides proto_ops as sendmsg/recvmsg to vhost-net. Sorry for taking so long before finding the time to look at your code in more detail. It seems that you are duplicating a lot of functionality that is already in macvtap. I've asked about this before but then didn't look at your newer versions. Can you explain the value of introducing another interface to user land? I'm still planning to add zero-copy support to macvtap, hopefully reusing parts of your code, but do you think there is value in having both? > diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c > new file mode 100644 > index 0000000..86d2525 > --- /dev/null > +++ b/drivers/vhost/mpassthru.c > @@ -0,0 +1,1264 @@ > + > +#ifdef MPASSTHRU_DEBUG > +static int debug; > + > +#define DBG if (mp->debug) printk > +#define DBG1 if (debug == 2) printk > +#else > +#define DBG(a...) > +#define DBG1(a...) > +#endif This should probably just use the existing dev_dbg/pr_debug infrastructure. > [... skipping buffer management code for now] > +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock, > + struct msghdr *m, size_t total_len) > +{ > [...] This function looks like we should be able to easily include it into macvtap and get zero-copy transmits without introducing the new user-level interface. > +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock, > + struct msghdr *m, size_t total_len, > + int flags) > +{ > + struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp; > + struct page_ctor *ctor; > + struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private); It smells like a layering violation to look at the iocb->private field from a lower-level driver. I would have hoped that it's possible to implement this without having this driver know about the higher-level vhost driver internals. Can you explain why this is needed? > + spin_lock_irqsave(&ctor->read_lock, flag); > + list_add_tail(&info->list, &ctor->readq); > + spin_unlock_irqrestore(&ctor->read_lock, flag); > + > + if (!vq->receiver) { > + vq->receiver = mp_recvmsg_notify; > + set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, > + vq->num * 4096, > + vq->num * 4096); > + } > + > + return 0; > +} Not sure what I'm missing, but who calls the vq->receiver? This seems to be neither in the upstream version of vhost nor introduced by your patch. > +static void __mp_detach(struct mp_struct *mp) > +{ > + mp->mfile = NULL; > + > + mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP); > + page_ctor_detach(mp); > + mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP); > + > + /* Drop the extra count on the net device */ > + dev_put(mp->dev); > +} > + > +static DEFINE_MUTEX(mp_mutex); > + > +static void mp_detach(struct mp_struct *mp) > +{ > + mutex_lock(&mp_mutex); > + __mp_detach(mp); > + mutex_unlock(&mp_mutex); > +} > + > +static void mp_put(struct mp_file *mfile) > +{ > + if (atomic_dec_and_test(&mfile->count)) > + mp_detach(mfile->mp); > +} > + > +static int mp_release(struct socket *sock) > +{ > + struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp; > + struct mp_file *mfile = mp->mfile; > + > + mp_put(mfile); > + sock_put(mp->socket.sk); > + put_net(mfile->net); > + > + return 0; > +} Doesn't this prevent the underlying interface from going away while the chardev is open? You also have logic to handle that case, so why do you keep the extra reference on the netdev? > +/* Ops structure to mimic raw sockets with mp device */ > +static const struct proto_ops mp_socket_ops = { > + .sendmsg = mp_sendmsg, > + .recvmsg = mp_recvmsg, > + .release = mp_release, > +}; > +static int mp_chr_open(struct inode *inode, struct file * file) > +{ > + struct mp_file *mfile; > + cycle_kernel_lock(); I don't think you really want to use the BKL here, just kill that line. > +static long mp_chr_ioctl(struct file *file, unsigned int cmd, > + unsigned long arg) > +{ > + struct mp_file *mfile = file->private_data; > + struct mp_struct *mp; > + struct net_device *dev; > + void __user* argp = (void __user *)arg; > + struct ifreq ifr; > + struct sock *sk; > + int ret; > + > + ret = -EINVAL; > + > + switch (cmd) { > + case MPASSTHRU_BINDDEV: > + ret = -EFAULT; > + if (copy_from_user(&ifr, argp, sizeof ifr)) > + break; This is broken for 32 bit compat mode ioctls, because struct ifreq is different between 32 and 64 bit systems. Since you are only using the device name anyway, a fixed length string or just the interface index would be simpler and work better. > + ifr.ifr_name[IFNAMSIZ-1] = '\0'; > + > + ret = -EBUSY; > + > + if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL) > + break; Your current use of the IFF_MPASSTHRU* flags does not seem to make any sense whatsoever. You check that this flag is never set, but set it later yourself and then ignore all flags. > + ret = -ENODEV; > + dev = dev_get_by_name(mfile->net, ifr.ifr_name); > + if (!dev) > + break; There is no permission checking on who can access what device, which seems a bit simplistic. Any user that has access to the mpassthru device seems to be able to bind to any network interface in the namespace. This is one point where the macvtap model seems more appropriate, it separates the permissions for creating logical interfaces and using them. > +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov, > + unsigned long count, loff_t pos) > +{ > + struct file *file = iocb->ki_filp; > + struct mp_struct *mp = mp_get(file->private_data); > + struct sock *sk = mp->socket.sk; > + struct sk_buff *skb; > + int len, err; > + ssize_t result; Can you explain what this function is even there for? AFAICT, vhost-net doesn't call it, the interface is incompatible with the existing tap interface, and you don't provide a read function. > diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h > new file mode 100644 > index 0000000..2be21c5 > --- /dev/null > +++ b/include/linux/mpassthru.h > @@ -0,0 +1,29 @@ > +#ifndef __MPASSTHRU_H > +#define __MPASSTHRU_H > + > +#include > +#include > + > +/* ioctl defines */ > +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) > +#define MPASSTHRU_UNBINDDEV _IOW('M', 214, int) These definitions are slightly wrong, because you pass more than just an 'int'. > +/* MPASSTHRU ifc flags */ > +#define IFF_MPASSTHRU 0x0001 > +#define IFF_MPASSTHRU_EXCL 0x0002 As mentioned above, these flags don't make any sense with your current code. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/