Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753549Ab0DSKFb (ORCPT ); Mon, 19 Apr 2010 06:05:31 -0400 Received: from mga03.intel.com ([143.182.124.21]:33301 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751066Ab0DSKF2 convert rfc822-to-8bit (ORCPT ); Mon, 19 Apr 2010 06:05:28 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.52,234,1270450800"; d="scan'208";a="267303644" From: "Xin, Xiaohui" To: "Michael S. Tsirkin" CC: "netdev@vger.kernel.org" , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "mingo@elte.hu" , "jdike@linux.intel.com" , "davem@davemloft.net" Date: Mon, 19 Apr 2010 18:05:17 +0800 Subject: RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net. Thread-Topic: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net. Thread-Index: Acrcg8fpgLgMgMhqSPC6efUNCpkSrADIRJkQ Message-ID: References: <1270193100-6769-1-git-send-email-xiaohui.xin@intel.com> <20100414152519.GA10792@redhat.com> <97F6D3BD476C464182C1B7BABF0B0AF5C18969CC@shzsmsx502.ccr.corp.intel.com> <20100415100546.GA17035@redhat.com> In-Reply-To: <20100415100546.GA17035@redhat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8569 Lines: 187 > Michael, > >>> The idea is simple, just to pin the guest VM user space and then > >>> let host NIC driver has the chance to directly DMA to it. > >>> The patches are based on vhost-net backend driver. We add a device > >>> which provides proto_ops as sendmsg/recvmsg to vhost-net to > >>> send/recv directly to/from the NIC driver. KVM guest who use the > >>> vhost-net backend may bind any ethX interface in the host side to > >>> get copyless data transfer thru guest virtio-net frontend. > >>> > >>> The scenario is like this: > >>> > >>> The guest virtio-net driver submits multiple requests thru vhost-net > >>> backend driver to the kernel. And the requests are queued and then > >>> completed after corresponding actions in h/w are done. > >>> > >>> For read, user space buffers are dispensed to NIC driver for rx when > >>> a page constructor API is invoked. Means NICs can allocate user buffers > >>> from a page constructor. We add a hook in netif_receive_skb() function > >>> to intercept the incoming packets, and notify the zero-copy device. > >>> > >>> For write, the zero-copy deivce may allocates a new host skb and puts > >>> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data. > >>> The request remains pending until the skb is transmitted by h/w. > >>> > >>> Here, we have ever considered 2 ways to utilize the page constructor > >>> API to dispense the user buffers. > >>> > >>> One: Modify __alloc_skb() function a bit, it can only allocate a > >>> structure of sk_buff, and the data pointer is pointing to a > >>> user buffer which is coming from a page constructor API. > >>> Then the shinfo of the skb is also from guest. > >>> When packet is received from hardware, the skb->data is filled > >>> directly by h/w. What we have done is in this way. > >>> > >>> Pros: We can avoid any copy here. > >>> Cons: Guest virtio-net driver needs to allocate skb as almost > >>> the same method with the host NIC drivers, say the size > >>> of netdev_alloc_skb() and the same reserved space in the > >>> head of skb. Many NIC drivers are the same with guest and > >>> ok for this. But some lastest NIC drivers reserves special > >>> room in skb head. To deal with it, we suggest to provide > >>> a method in guest virtio-net driver to ask for parameter > >>> we interest from the NIC driver when we know which device > >>> we have bind to do zero-copy. Then we ask guest to do so. > >>> Is that reasonable? > >>Unfortunately, this would break compatibility with existing virtio. > >>This also complicates migration. >> You mean any modification to the guest virtio-net driver will break the >> compatibility? We tried to enlarge the virtio_net_config to contains the >> 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe() >> will check the feature flag, and get the parameters, then virtio-net driver use >> it to allocate buffers. How about this? >This means that we can't, for example, live-migrate between different systems >without flushing outstanding buffers. Ok. What we have thought about now is to do something with skb_reserve(). If the device is binded by mp, then skb_reserve() will do nothing with it. > >>What is the room in skb head used for? > >I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to >> NET_IP_ALIGN. >Looking at code, this seems to do with alignment - could just be >a performance optimization. > >>> Two: Modify driver to get user buffer allocated from a page constructor > >>> API(to substitute alloc_page()), the user buffer are used as payload > >>> buffers and filled by h/w directly when packet is received. Driver > >>> should associate the pages with skb (skb_shinfo(skb)->frags). For > >>> the head buffer side, let host allocates skb, and h/w fills it. > >>> After that, the data filled in host skb header will be copied into > >>> guest header buffer which is submitted together with the payload buffer. > >>> > >>> Pros: We could less care the way how guest or host allocates their > >>> buffers. > >>> Cons: We still need a bit copy here for the skb header. > >>> > >>> We are not sure which way is the better here. > >>The obvious question would be whether you see any speed difference > >>with the two approaches. If no, then the second approach would be > >>better. > >> I remember the second approach is a bit slower in 1500MTU. >> But we did not tested too much. >Well, that's an important datapoint. By the way, you'll need >header copy to activate LRO in host, so that's a good >reason to go with option 2 as well. > >>> This is the first thing we want > >>> to get comments from the community. We wish the modification to the network > >>> part will be generic which not used by vhost-net backend only, but a user > >>> application may use it as well when the zero-copy device may provides async > >>> read/write operations later. > >>> > >>> Please give comments especially for the network part modifications. > >>> > >>> > >>> We provide multiple submits and asynchronous notifiicaton to > >>>vhost-net too. > >>> > >>> Our goal is to improve the bandwidth and reduce the CPU usage. > >>> Exact performance data will be provided later. But for simple > >>> test with netperf, we found bindwidth up and CPU % up too, > >>> but the bindwidth up ratio is much more than CPU % up ratio. > >>> > >>> What we have not done yet: > >>> packet split support > > >>What does this mean, exactly? >> We can support 1500MTU, but for jumbo frame, since vhost driver before don't > >support mergeable buffer, we cannot try it for multiple sg. >I do not see why, vhost currently supports 64K buffers with indirect >descriptors. The receive_skb() in guest virtio-net driver will merge the multiple sg to skb frags, how can indirect descriptors to that? >>> A jumbo frame will split 5 >>> frags and hook them once a descriptor, so the user buffer allocation is greatly dependent >>> on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for >>>it. > > >> To support GRO >>> Actually, I think if the mergeable buffer may get good performance, then GRO is not >>> so important then. > >>And TSO/GSO? >>> Do we really need them? >>My guess would be yes. Mergeable buffers is a memory saving >>optimization, not a performance optimization, I don't see >>that it can help. And I think you can't solely rely on jumbo frames >>in hardware, not everyone can enable them. >Having said that, number one priority is getting decent performance >out of the driver, in whatever way you find fit. I was just >suggesting obvious ways to do this. Thanks. > >> Performance tuning > >> > >> what we have done in v1: > >> polish the RCU usage > >> deal with write logging in asynchroush mode in vhost > >> add notifier block for mp device > >> rename page_ctor to mp_port in netdevice.h to make it looks generic > >> add mp_dev_change_flags() for mp device to change NIC state > >> add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load > >> a small fix for missing dev_put when fail > >> using dynamic minor instead of static minor number > >> a __KERNEL__ protect to mp_get_sock() > >> > >> what we have done in v2: > >> > >> remove most of the RCU usage, since the ctor pointer is only > >> changed by BIND/UNBIND ioctl, and during that time, NIC will be > >> stopped to get good cleanup(all outstanding requests are finished), > >> so the ctor pointer cannot be raced into wrong situation. > >> > >> Remove the struct vhost_notifier with struct kiocb. > >> Let vhost-net backend to alloc/free the kiocb and transfer them > >> via sendmsg/recvmsg. > >> > >> use get_user_pages_fast() and set_page_dirty_lock() when read. > >> > >> Add some comments for netdev_mp_port_prep() and handle_mpassthru(). > >> > >> > >> Comments not addressed yet in this time: > >> the async write logging is not satified by vhost-net > >> Qemu needs a sync write > >> a limit for locked pages from get_user_pages_fast() > >> > >> > >> performance: > >> using netperf with GSO/TSO disabled, 10G NIC, > >> disabled packet split mode, with raw socket case compared to vhost. > >> > >> bindwidth will be from 1.1Gbps to 1.7Gbps > >> CPU % from 120%-140% to 140%-160% -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/