Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753896AbZIUHH3 (ORCPT ); Mon, 21 Sep 2009 03:07:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753360AbZIUHH1 (ORCPT ); Mon, 21 Sep 2009 03:07:27 -0400 Received: from ozlabs.org ([203.10.76.45]:47783 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753322AbZIUHH0 (ORCPT ); Mon, 21 Sep 2009 03:07:26 -0400 From: Rusty Russell To: virtualization@lists.linux-foundation.org Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM Date: Mon, 21 Sep 2009 16:37:22 +0930 User-Agent: KMail/1.11.2 (Linux/2.6.28-15-generic; KDE/4.2.2; i686; ; ) Cc: Stephen Hemminger , "Xin, Xiaohui" , "kvm@vger.kernel.org" , "mst@redhat.com" , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hpa@zytor.com" , "mingo@elte.hu" , "akpm@linux-foundation.org" References: <20090901090518.1193e412@nehalam> In-Reply-To: <20090901090518.1193e412@nehalam> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200909211637.23299.rusty@rustcorp.com.au> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8909 Lines: 175 On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote: > On Tue, 1 Sep 2009 14:58:19 +0800 > "Xin, Xiaohui" wrote: > > > [RFC] Virtual Machine Device Queues (VMDq) support on KVM > > > > Network adapter with VMDq technology presents multiple pairs of tx/rx queues, > > and renders network L2 sorting mechanism based on MAC addresses and VLAN tags > > for each tx/rx queue pair. Here we present a generic framework, in which network > > traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without > > any software copy. > > > > Actually this framework can apply to traditional network adapters which have > > just one tx/rx queue pair. And applications using the same user/kernel interface > > can utilize this framework to send/receive network traffic directly thru a tx/rx > > queue pair in a network adapter. > > > > We use virtio-net architecture to illustrate the framework. > > > > > > |--------------------| pop add_buf |----------------| > > | Qemu process | <--------- TX <---------- | Guest Kernel | > > | | ---------> ----------> | | > > | Virtio-net | push get_buf | | > > | (Backend service) | ---------> RX ----------> | Virtio-net | > > | | <--------- <---------- | driver | > > | | push get_buf | | > > |--------------------| |----------------| > > | > > | > > | AIO (read & write) combined with Direct I/O > > | (which substitute synced file operations) > > |-----------------------------------------------------------------------| > > | Host kernel | read: copy-less with directly mapped user | > > | | space to kernel, payload directly DMAed | > > | | into user space | > > | | write: copy-less with directly mapped user | > > | | space to kernel, payload directly hooked | > > | | to a skb | > > | | | > > | (a likely | | > > | queue pair | | > > | instance) | | > > | | | | > > | NIC driver <--> TUN/TAP driver | > > |-----------------------------------------------------------------------| > > | > > | > > traditional adapter or a tx/rx queue pair > > > > The basic idea is to utilize the kernel Asynchronous I/O combined with Direct > > I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to > > kernel, we still can see it in SCSI tape driver. > > > > With traditional file operations, a copying of payload contents from/to the > > kernel DMA address to/from a user buffer is needed. That's what the copying we > > want to save. > > > > The proposed framework is like this: > > A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in > > host side. KVM virto-net Backend service, the user space program submits > > asynchronous read/write I/O requests to the host kernel through TUN/TAP device. > > The requests are corresponding to the vqueue elements include both transmission > > & receive. They can be queued in one AIO request and later, the completion will > > be notified through the underlying packets tx/rx processing of the rx/tx queue > > pair. > > > > Detailed path: > > > > To guest Virtio-net driver, packets receive corresponding to asynchronous read > > I/O requests of Backend service. > > > > 1) Guest Virtio-net driver provides header and payload address through the > > receive vqueue to Virtio-net backend service. > > > > 2) Virtio-net backend service encapsulates multiple vqueue elements into > > multiple AIO control blocks and composes them into one AIO read request. > > > > 3) Virtio-net backend service uses io_submit() syscall to pass the request to > > the TUN/TAP device. > > > > 4) Virtio-net backend service uses io_getevents() syscall to check the > > completion of the request. > > > > 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares > > for Direct I/O. > > A modified NIC driver may render a skb which header is allocated in host > > kernel, but the payload buffer is directly mapped from user space buffer which > > are rendered through the AIO request by the Backend service. get_user_pages() > > may do this. For one AIO read request, the TUN/TAP driver maintains a list for > > the directly mapped buffers, and a NIC driver tries to get the buffers as > > payload buffer to compose the new skbs. Of course, if getting the buffers > > fails, then kernel allocated buffers are used. > > > > 6) Modern NIC cards now mostly have the header split feature. The NIC queue > > pair then may directly DMA the payload into the user spaces mapped payload > > buffers. > > Thus a zero-copy for payload is implemented in packet receiving. > > > > 7) The TUN/TAP driver manually copy the host header to space user mapped. > > > > 8) aio_complete() to notify the Virtio-net backend service for io_getevents(). > > > > > > To guest Virtio-net driver, packets send corresponding to asynchronous write > > I/O requests of backend. The path is similar to packet receive. > > > > 1) Guest Virtio-net driver provides header and payload address filled with > > contents through the transmit vqueue to Virtio-net backed service. > > > > 2) Virtio-net backend service encapsulates the vqueue elements into multiple > > AIO control blocks and composes them into one AIO write request. > > > > 3) Virtio-net backend service uses the io_submit() syscall to pass the > > requests to the TUN/TAP device. > > > > 4) Virtio-net backend service uses io_getevents() syscall to check the request > > completion. > > > > 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The > > header contents are copied into the skb header. The directly mapped user space > > buffer is easily hooked into skb. Thus a zero copy for payload is implemented > > in packet sending. > > > > 6) aio_complete() to notify the Virtio-net backend service for io_getevents(). > > > > The proposed framework is described as above. > > > > Consider the modifications to the kernel and qemu: > > > > To kernel: > > 1) The TUN/TAP driver may be modified a lot to implement AIO device operations > > and to implement directly user space mapping into kernel. Code to maintain the > > directly mapped user buffers should be in. It's just a modification for driver. > > > > 2) The NIC driver may be modified to compose skb differently and slightly data > > structure change to add user directly mapped buffer pointer. > > Here, maybe it's better for a NIC driver to present an interface for an rx/tx > > queue pair instance which will also apply to traditional hardware, the kernel > > interface should not be changed to make the other components happy. > > The abstraction is useful, though it is not needed immediately here. > > > > 3) The skb shared info structure may be modified a little to contain the user > > directly mapped info. > > > > To Qemu: > > 1) The Virtio-net backend service may be modified to handle AIO read/write > > requests from the vqueues. > > 2) Maybe a separate pthread to handle the AIO request triggering is needed. > > > > Any comments are appreciated here. > > * Code is easier to review than bullet points. > > * Direct I/O has to be safe when page is shared by multiple threads, > and has to be non-blocking since network I/O can take indeterminately > long (think big queue's, tunneling, ...) > > * In the past attempts at Direct I/O on network have always had SMP > TLB issues. The page has to be flipped or marked as COW on all CPU's > and the cost of the Inter Processor Interrupt to steal the page has > been slower than copying The Guest shouldn't touch the packet, until the virtio net protocol says it's finished with (just like a real NIC). Even if it's being nasty, if we have hw csum it'll get the right csum on the wire, and if we don't we copy internally anyway. So I think this isn't a problem... Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/