[RFC] Virtual Machine Device Queues (VMDq) support on KVM
Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
for each tx/rx queue pair. Here we present a generic framework, in which network
traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
any software copy.
Actually this framework can apply to traditional network adapters which have
just one tx/rx queue pair. And applications using the same user/kernel interface
can utilize this framework to send/receive network traffic directly thru a tx/rx
queue pair in a network adapter.
We use virtio-net architecture to illustrate the framework.
|--------------------| pop add_buf |----------------|
| Qemu process | <--------- TX <---------- | Guest Kernel |
| | ---------> ----------> | |
| Virtio-net | push get_buf | |
| (Backend service) | ---------> RX ----------> | Virtio-net |
| | <--------- <---------- | driver |
| | push get_buf | |
|--------------------| |----------------|
|
|
| AIO (read & write) combined with Direct I/O
| (which substitute synced file operations)
|-----------------------------------------------------------------------|
| Host kernel | read: copy-less with directly mapped user |
| | space to kernel, payload directly DMAed |
| | into user space |
| | write: copy-less with directly mapped user |
| | space to kernel, payload directly hooked |
| | to a skb |
| | |
| (a likely | |
| queue pair | |
| instance) | |
| | | |
| NIC driver <--> TUN/TAP driver |
|-----------------------------------------------------------------------|
|
|
traditional adapter or a tx/rx queue pair
The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
kernel, we still can see it in SCSI tape driver.
With traditional file operations, a copying of payload contents from/to the
kernel DMA address to/from a user buffer is needed. That's what the copying we
want to save.
The proposed framework is like this:
A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
host side. KVM virto-net Backend service, the user space program submits
asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
The requests are corresponding to the vqueue elements include both transmission
& receive. They can be queued in one AIO request and later, the completion will
be notified through the underlying packets tx/rx processing of the rx/tx queue
pair.
Detailed path:
To guest Virtio-net driver, packets receive corresponding to asynchronous read
I/O requests of Backend service.
1) Guest Virtio-net driver provides header and payload address through the
receive vqueue to Virtio-net backend service.
2) Virtio-net backend service encapsulates multiple vqueue elements into
multiple AIO control blocks and composes them into one AIO read request.
3) Virtio-net backend service uses io_submit() syscall to pass the request to
the TUN/TAP device.
4) Virtio-net backend service uses io_getevents() syscall to check the
completion of the request.
5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
for Direct I/O.
A modified NIC driver may render a skb which header is allocated in host
kernel, but the payload buffer is directly mapped from user space buffer which
are rendered through the AIO request by the Backend service. get_user_pages()
may do this. For one AIO read request, the TUN/TAP driver maintains a list for
the directly mapped buffers, and a NIC driver tries to get the buffers as
payload buffer to compose the new skbs. Of course, if getting the buffers
fails, then kernel allocated buffers are used.
6) Modern NIC cards now mostly have the header split feature. The NIC queue
pair then may directly DMA the payload into the user spaces mapped payload
buffers.
Thus a zero-copy for payload is implemented in packet receiving.
7) The TUN/TAP driver manually copy the host header to space user mapped.
8) aio_complete() to notify the Virtio-net backend service for io_getevents().
To guest Virtio-net driver, packets send corresponding to asynchronous write
I/O requests of backend. The path is similar to packet receive.
1) Guest Virtio-net driver provides header and payload address filled with
contents through the transmit vqueue to Virtio-net backed service.
2) Virtio-net backend service encapsulates the vqueue elements into multiple
AIO control blocks and composes them into one AIO write request.
3) Virtio-net backend service uses the io_submit() syscall to pass the
requests to the TUN/TAP device.
4) Virtio-net backend service uses io_getevents() syscall to check the request
completion.
5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
header contents are copied into the skb header. The directly mapped user space
buffer is easily hooked into skb. Thus a zero copy for payload is implemented
in packet sending.
6) aio_complete() to notify the Virtio-net backend service for io_getevents().
The proposed framework is described as above.
Consider the modifications to the kernel and qemu:
To kernel:
1) The TUN/TAP driver may be modified a lot to implement AIO device operations
and to implement directly user space mapping into kernel. Code to maintain the
directly mapped user buffers should be in. It's just a modification for driver.
2) The NIC driver may be modified to compose skb differently and slightly data
structure change to add user directly mapped buffer pointer.
Here, maybe it's better for a NIC driver to present an interface for an rx/tx
queue pair instance which will also apply to traditional hardware, the kernel
interface should not be changed to make the other components happy.
The abstraction is useful, though it is not needed immediately here.
3) The skb shared info structure may be modified a little to contain the user
directly mapped info.
To Qemu:
1) The Virtio-net backend service may be modified to handle AIO read/write
requests from the vqueues.
2) Maybe a separate pthread to handle the AIO request triggering is needed.
Any comments are appreciated here.
Thanks
Xiaohui
On Tue, 1 Sep 2009 14:58:19 +0800
"Xin, Xiaohui" <[email protected]> wrote:
> [RFC] Virtual Machine Device Queues (VMDq) support on KVM
>
> Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
> and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
> for each tx/rx queue pair. Here we present a generic framework, in which network
> traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
> any software copy.
>
> Actually this framework can apply to traditional network adapters which have
> just one tx/rx queue pair. And applications using the same user/kernel interface
> can utilize this framework to send/receive network traffic directly thru a tx/rx
> queue pair in a network adapter.
>
> We use virtio-net architecture to illustrate the framework.
>
>
> |--------------------| pop add_buf |----------------|
> | Qemu process | <--------- TX <---------- | Guest Kernel |
> | | ---------> ----------> | |
> | Virtio-net | push get_buf | |
> | (Backend service) | ---------> RX ----------> | Virtio-net |
> | | <--------- <---------- | driver |
> | | push get_buf | |
> |--------------------| |----------------|
> |
> |
> | AIO (read & write) combined with Direct I/O
> | (which substitute synced file operations)
> |-----------------------------------------------------------------------|
> | Host kernel | read: copy-less with directly mapped user |
> | | space to kernel, payload directly DMAed |
> | | into user space |
> | | write: copy-less with directly mapped user |
> | | space to kernel, payload directly hooked |
> | | to a skb |
> | | |
> | (a likely | |
> | queue pair | |
> | instance) | |
> | | | |
> | NIC driver <--> TUN/TAP driver |
> |-----------------------------------------------------------------------|
> |
> |
> traditional adapter or a tx/rx queue pair
>
> The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
> I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
> kernel, we still can see it in SCSI tape driver.
>
> With traditional file operations, a copying of payload contents from/to the
> kernel DMA address to/from a user buffer is needed. That's what the copying we
> want to save.
>
> The proposed framework is like this:
> A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
> host side. KVM virto-net Backend service, the user space program submits
> asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
> The requests are corresponding to the vqueue elements include both transmission
> & receive. They can be queued in one AIO request and later, the completion will
> be notified through the underlying packets tx/rx processing of the rx/tx queue
> pair.
>
> Detailed path:
>
> To guest Virtio-net driver, packets receive corresponding to asynchronous read
> I/O requests of Backend service.
>
> 1) Guest Virtio-net driver provides header and payload address through the
> receive vqueue to Virtio-net backend service.
>
> 2) Virtio-net backend service encapsulates multiple vqueue elements into
> multiple AIO control blocks and composes them into one AIO read request.
>
> 3) Virtio-net backend service uses io_submit() syscall to pass the request to
> the TUN/TAP device.
>
> 4) Virtio-net backend service uses io_getevents() syscall to check the
> completion of the request.
>
> 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
> for Direct I/O.
> A modified NIC driver may render a skb which header is allocated in host
> kernel, but the payload buffer is directly mapped from user space buffer which
> are rendered through the AIO request by the Backend service. get_user_pages()
> may do this. For one AIO read request, the TUN/TAP driver maintains a list for
> the directly mapped buffers, and a NIC driver tries to get the buffers as
> payload buffer to compose the new skbs. Of course, if getting the buffers
> fails, then kernel allocated buffers are used.
>
> 6) Modern NIC cards now mostly have the header split feature. The NIC queue
> pair then may directly DMA the payload into the user spaces mapped payload
> buffers.
> Thus a zero-copy for payload is implemented in packet receiving.
>
> 7) The TUN/TAP driver manually copy the host header to space user mapped.
>
> 8) aio_complete() to notify the Virtio-net backend service for io_getevents().
>
>
> To guest Virtio-net driver, packets send corresponding to asynchronous write
> I/O requests of backend. The path is similar to packet receive.
>
> 1) Guest Virtio-net driver provides header and payload address filled with
> contents through the transmit vqueue to Virtio-net backed service.
>
> 2) Virtio-net backend service encapsulates the vqueue elements into multiple
> AIO control blocks and composes them into one AIO write request.
>
> 3) Virtio-net backend service uses the io_submit() syscall to pass the
> requests to the TUN/TAP device.
>
> 4) Virtio-net backend service uses io_getevents() syscall to check the request
> completion.
>
> 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
> header contents are copied into the skb header. The directly mapped user space
> buffer is easily hooked into skb. Thus a zero copy for payload is implemented
> in packet sending.
>
> 6) aio_complete() to notify the Virtio-net backend service for io_getevents().
>
> The proposed framework is described as above.
>
> Consider the modifications to the kernel and qemu:
>
> To kernel:
> 1) The TUN/TAP driver may be modified a lot to implement AIO device operations
> and to implement directly user space mapping into kernel. Code to maintain the
> directly mapped user buffers should be in. It's just a modification for driver.
>
> 2) The NIC driver may be modified to compose skb differently and slightly data
> structure change to add user directly mapped buffer pointer.
> Here, maybe it's better for a NIC driver to present an interface for an rx/tx
> queue pair instance which will also apply to traditional hardware, the kernel
> interface should not be changed to make the other components happy.
> The abstraction is useful, though it is not needed immediately here.
>
> 3) The skb shared info structure may be modified a little to contain the user
> directly mapped info.
>
> To Qemu:
> 1) The Virtio-net backend service may be modified to handle AIO read/write
> requests from the vqueues.
> 2) Maybe a separate pthread to handle the AIO request triggering is needed.
>
> Any comments are appreciated here.
* Code is easier to review than bullet points.
* Direct I/O has to be safe when page is shared by multiple threads,
and has to be non-blocking since network I/O can take indeterminately
long (think big queue's, tunneling, ...)
* In the past attempts at Direct I/O on network have always had SMP
TLB issues. The page has to be flipped or marked as COW on all CPU's
and the cost of the Inter Processor Interrupt to steal the page has
been slower than copying
--
>* Code is easier to review than bullet points.
Yes. We'd send the code soon.
>* Direct I/O has to be safe when page is shared by multiple threads,
> and has to be non-blocking since network I/O can take indeterminately
> long (think big queue's, tunneling, ...)
In the situation, one queue pair NIC is assigned to only one guest, the pages
are locked and a KVM guest will not swapped out.
>* In the past attempts at Direct I/O on network have always had SMP
> TLB issues. The page has to be flipped or marked as COW on all CPU's
> and the cost of the Inter Processor Interrupt to steal the page has
> been slower than copying
It may be, we have not thought about this more . Thanks.
Thanks
Xiaohui
-----Original Message-----
From: Stephen Hemminger [mailto:[email protected]]
Sent: Wednesday, September 02, 2009 12:05 AM
To: Xin, Xiaohui
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Tue, 1 Sep 2009 14:58:19 +0800
"Xin, Xiaohui" <[email protected]> wrote:
> [RFC] Virtual Machine Device Queues (VMDq) support on KVM
>
> Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
> and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
> for each tx/rx queue pair. Here we present a generic framework, in which network
> traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
> any software copy.
>
> Actually this framework can apply to traditional network adapters which have
> just one tx/rx queue pair. And applications using the same user/kernel interface
> can utilize this framework to send/receive network traffic directly thru a tx/rx
> queue pair in a network adapter.
>
> We use virtio-net architecture to illustrate the framework.
>
>
> |--------------------| pop add_buf |----------------|
> | Qemu process | <--------- TX <---------- | Guest Kernel |
> | | ---------> ----------> | |
> | Virtio-net | push get_buf | |
> | (Backend service) | ---------> RX ----------> | Virtio-net |
> | | <--------- <---------- | driver |
> | | push get_buf | |
> |--------------------| |----------------|
> |
> |
> | AIO (read & write) combined with Direct I/O
> | (which substitute synced file operations)
> |-----------------------------------------------------------------------|
> | Host kernel | read: copy-less with directly mapped user |
> | | space to kernel, payload directly DMAed |
> | | into user space |
> | | write: copy-less with directly mapped user |
> | | space to kernel, payload directly hooked |
> | | to a skb |
> | | |
> | (a likely | |
> | queue pair | |
> | instance) | |
> | | | |
> | NIC driver <--> TUN/TAP driver |
> |-----------------------------------------------------------------------|
> |
> |
> traditional adapter or a tx/rx queue pair
>
> The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
> I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
> kernel, we still can see it in SCSI tape driver.
>
> With traditional file operations, a copying of payload contents from/to the
> kernel DMA address to/from a user buffer is needed. That's what the copying we
> want to save.
>
> The proposed framework is like this:
> A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
> host side. KVM virto-net Backend service, the user space program submits
> asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
> The requests are corresponding to the vqueue elements include both transmission
> & receive. They can be queued in one AIO request and later, the completion will
> be notified through the underlying packets tx/rx processing of the rx/tx queue
> pair.
>
> Detailed path:
>
> To guest Virtio-net driver, packets receive corresponding to asynchronous read
> I/O requests of Backend service.
>
> 1) Guest Virtio-net driver provides header and payload address through the
> receive vqueue to Virtio-net backend service.
>
> 2) Virtio-net backend service encapsulates multiple vqueue elements into
> multiple AIO control blocks and composes them into one AIO read request.
>
> 3) Virtio-net backend service uses io_submit() syscall to pass the request to
> the TUN/TAP device.
>
> 4) Virtio-net backend service uses io_getevents() syscall to check the
> completion of the request.
>
> 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
> for Direct I/O.
> A modified NIC driver may render a skb which header is allocated in host
> kernel, but the payload buffer is directly mapped from user space buffer which
> are rendered through the AIO request by the Backend service. get_user_pages()
> may do this. For one AIO read request, the TUN/TAP driver maintains a list for
> the directly mapped buffers, and a NIC driver tries to get the buffers as
> payload buffer to compose the new skbs. Of course, if getting the buffers
> fails, then kernel allocated buffers are used.
>
> 6) Modern NIC cards now mostly have the header split feature. The NIC queue
> pair then may directly DMA the payload into the user spaces mapped payload
> buffers.
> Thus a zero-copy for payload is implemented in packet receiving.
>
> 7) The TUN/TAP driver manually copy the host header to space user mapped.
>
> 8) aio_complete() to notify the Virtio-net backend service for io_getevents().
>
>
> To guest Virtio-net driver, packets send corresponding to asynchronous write
> I/O requests of backend. The path is similar to packet receive.
>
> 1) Guest Virtio-net driver provides header and payload address filled with
> contents through the transmit vqueue to Virtio-net backed service.
>
> 2) Virtio-net backend service encapsulates the vqueue elements into multiple
> AIO control blocks and composes them into one AIO write request.
>
> 3) Virtio-net backend service uses the io_submit() syscall to pass the
> requests to the TUN/TAP device.
>
> 4) Virtio-net backend service uses io_getevents() syscall to check the request
> completion.
>
> 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
> header contents are copied into the skb header. The directly mapped user space
> buffer is easily hooked into skb. Thus a zero copy for payload is implemented
> in packet sending.
>
> 6) aio_complete() to notify the Virtio-net backend service for io_getevents().
>
> The proposed framework is described as above.
>
> Consider the modifications to the kernel and qemu:
>
> To kernel:
> 1) The TUN/TAP driver may be modified a lot to implement AIO device operations
> and to implement directly user space mapping into kernel. Code to maintain the
> directly mapped user buffers should be in. It's just a modification for driver.
>
> 2) The NIC driver may be modified to compose skb differently and slightly data
> structure change to add user directly mapped buffer pointer.
> Here, maybe it's better for a NIC driver to present an interface for an rx/tx
> queue pair instance which will also apply to traditional hardware, the kernel
> interface should not be changed to make the other components happy.
> The abstraction is useful, though it is not needed immediately here.
>
> 3) The skb shared info structure may be modified a little to contain the user
> directly mapped info.
>
> To Qemu:
> 1) The Virtio-net backend service may be modified to handle AIO read/write
> requests from the vqueues.
> 2) Maybe a separate pthread to handle the AIO request triggering is needed.
>
> Any comments are appreciated here.
* Code is easier to review than bullet points.
* Direct I/O has to be safe when page is shared by multiple threads,
and has to be non-blocking since network I/O can take indeterminately
long (think big queue's, tunneling, ...)
* In the past attempts at Direct I/O on network have always had SMP
TLB issues. The page has to be flipped or marked as COW on all CPU's
and the cost of the Inter Processor Interrupt to steal the page has
been slower than copying
--
On Wed, 2 Sep 2009 01:35:18 am Stephen Hemminger wrote:
> On Tue, 1 Sep 2009 14:58:19 +0800
> "Xin, Xiaohui" <[email protected]> wrote:
>
> > [RFC] Virtual Machine Device Queues (VMDq) support on KVM
> >
> > Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
> > and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
> > for each tx/rx queue pair. Here we present a generic framework, in which network
> > traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
> > any software copy.
> >
> > Actually this framework can apply to traditional network adapters which have
> > just one tx/rx queue pair. And applications using the same user/kernel interface
> > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > queue pair in a network adapter.
> >
> > We use virtio-net architecture to illustrate the framework.
> >
> >
> > |--------------------| pop add_buf |----------------|
> > | Qemu process | <--------- TX <---------- | Guest Kernel |
> > | | ---------> ----------> | |
> > | Virtio-net | push get_buf | |
> > | (Backend service) | ---------> RX ----------> | Virtio-net |
> > | | <--------- <---------- | driver |
> > | | push get_buf | |
> > |--------------------| |----------------|
> > |
> > |
> > | AIO (read & write) combined with Direct I/O
> > | (which substitute synced file operations)
> > |-----------------------------------------------------------------------|
> > | Host kernel | read: copy-less with directly mapped user |
> > | | space to kernel, payload directly DMAed |
> > | | into user space |
> > | | write: copy-less with directly mapped user |
> > | | space to kernel, payload directly hooked |
> > | | to a skb |
> > | | |
> > | (a likely | |
> > | queue pair | |
> > | instance) | |
> > | | | |
> > | NIC driver <--> TUN/TAP driver |
> > |-----------------------------------------------------------------------|
> > |
> > |
> > traditional adapter or a tx/rx queue pair
> >
> > The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
> > I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
> > kernel, we still can see it in SCSI tape driver.
> >
> > With traditional file operations, a copying of payload contents from/to the
> > kernel DMA address to/from a user buffer is needed. That's what the copying we
> > want to save.
> >
> > The proposed framework is like this:
> > A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in
> > host side. KVM virto-net Backend service, the user space program submits
> > asynchronous read/write I/O requests to the host kernel through TUN/TAP device.
> > The requests are corresponding to the vqueue elements include both transmission
> > & receive. They can be queued in one AIO request and later, the completion will
> > be notified through the underlying packets tx/rx processing of the rx/tx queue
> > pair.
> >
> > Detailed path:
> >
> > To guest Virtio-net driver, packets receive corresponding to asynchronous read
> > I/O requests of Backend service.
> >
> > 1) Guest Virtio-net driver provides header and payload address through the
> > receive vqueue to Virtio-net backend service.
> >
> > 2) Virtio-net backend service encapsulates multiple vqueue elements into
> > multiple AIO control blocks and composes them into one AIO read request.
> >
> > 3) Virtio-net backend service uses io_submit() syscall to pass the request to
> > the TUN/TAP device.
> >
> > 4) Virtio-net backend service uses io_getevents() syscall to check the
> > completion of the request.
> >
> > 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares
> > for Direct I/O.
> > A modified NIC driver may render a skb which header is allocated in host
> > kernel, but the payload buffer is directly mapped from user space buffer which
> > are rendered through the AIO request by the Backend service. get_user_pages()
> > may do this. For one AIO read request, the TUN/TAP driver maintains a list for
> > the directly mapped buffers, and a NIC driver tries to get the buffers as
> > payload buffer to compose the new skbs. Of course, if getting the buffers
> > fails, then kernel allocated buffers are used.
> >
> > 6) Modern NIC cards now mostly have the header split feature. The NIC queue
> > pair then may directly DMA the payload into the user spaces mapped payload
> > buffers.
> > Thus a zero-copy for payload is implemented in packet receiving.
> >
> > 7) The TUN/TAP driver manually copy the host header to space user mapped.
> >
> > 8) aio_complete() to notify the Virtio-net backend service for io_getevents().
> >
> >
> > To guest Virtio-net driver, packets send corresponding to asynchronous write
> > I/O requests of backend. The path is similar to packet receive.
> >
> > 1) Guest Virtio-net driver provides header and payload address filled with
> > contents through the transmit vqueue to Virtio-net backed service.
> >
> > 2) Virtio-net backend service encapsulates the vqueue elements into multiple
> > AIO control blocks and composes them into one AIO write request.
> >
> > 3) Virtio-net backend service uses the io_submit() syscall to pass the
> > requests to the TUN/TAP device.
> >
> > 4) Virtio-net backend service uses io_getevents() syscall to check the request
> > completion.
> >
> > 5) The TUN/TAP driver gets the write requests and allocates skbs for it. The
> > header contents are copied into the skb header. The directly mapped user space
> > buffer is easily hooked into skb. Thus a zero copy for payload is implemented
> > in packet sending.
> >
> > 6) aio_complete() to notify the Virtio-net backend service for io_getevents().
> >
> > The proposed framework is described as above.
> >
> > Consider the modifications to the kernel and qemu:
> >
> > To kernel:
> > 1) The TUN/TAP driver may be modified a lot to implement AIO device operations
> > and to implement directly user space mapping into kernel. Code to maintain the
> > directly mapped user buffers should be in. It's just a modification for driver.
> >
> > 2) The NIC driver may be modified to compose skb differently and slightly data
> > structure change to add user directly mapped buffer pointer.
> > Here, maybe it's better for a NIC driver to present an interface for an rx/tx
> > queue pair instance which will also apply to traditional hardware, the kernel
> > interface should not be changed to make the other components happy.
> > The abstraction is useful, though it is not needed immediately here.
> >
> > 3) The skb shared info structure may be modified a little to contain the user
> > directly mapped info.
> >
> > To Qemu:
> > 1) The Virtio-net backend service may be modified to handle AIO read/write
> > requests from the vqueues.
> > 2) Maybe a separate pthread to handle the AIO request triggering is needed.
> >
> > Any comments are appreciated here.
>
> * Code is easier to review than bullet points.
>
> * Direct I/O has to be safe when page is shared by multiple threads,
> and has to be non-blocking since network I/O can take indeterminately
> long (think big queue's, tunneling, ...)
>
> * In the past attempts at Direct I/O on network have always had SMP
> TLB issues. The page has to be flipped or marked as COW on all CPU's
> and the cost of the Inter Processor Interrupt to steal the page has
> been slower than copying
The Guest shouldn't touch the packet, until the virtio net protocol says
it's finished with (just like a real NIC). Even if it's being nasty, if we
have hw csum it'll get the right csum on the wire, and if we don't we copy
internally anyway.
So I think this isn't a problem...
Rusty.
On Mon, 21 Sep 2009 16:37:22 +0930
Rusty Russell <[email protected]> wrote:
> > > Actually this framework can apply to traditional network adapters which have
> > > just one tx/rx queue pair. And applications using the same user/kernel interface
> > > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > > queue pair in a network adapter.
> > >
More importantly, when virtualizations is used with multi-queue NIC's the virtio-net
NIC is a single CPU bottleneck. The virtio-net NIC should preserve the parallelism (lock
free) using multiple receive/transmit queues. The number of queues should equal the
number of CPUs.
* Stephen Hemminger ([email protected]) wrote:
> On Mon, 21 Sep 2009 16:37:22 +0930
> Rusty Russell <[email protected]> wrote:
>
> > > > Actually this framework can apply to traditional network adapters which have
> > > > just one tx/rx queue pair. And applications using the same user/kernel interface
> > > > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > > > queue pair in a network adapter.
> > > >
>
> More importantly, when virtualizations is used with multi-queue NIC's the virtio-net
> NIC is a single CPU bottleneck. The virtio-net NIC should preserve the parallelism (lock
> free) using multiple receive/transmit queues. The number of queues should equal the
> number of CPUs.
Yup, multiqueue virtio is on todo list ;-)
thanks,
-chris
On Mon, Sep 21, 2009 at 09:27:18AM -0700, Chris Wright wrote:
> * Stephen Hemminger ([email protected]) wrote:
> > On Mon, 21 Sep 2009 16:37:22 +0930
> > Rusty Russell <[email protected]> wrote:
> >
> > > > > Actually this framework can apply to traditional network adapters which have
> > > > > just one tx/rx queue pair. And applications using the same user/kernel interface
> > > > > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > > > > queue pair in a network adapter.
> > > > >
> >
> > More importantly, when virtualizations is used with multi-queue
> > NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
> > NIC should preserve the parallelism (lock free) using multiple
> > receive/transmit queues. The number of queues should equal the
> > number of CPUs.
>
> Yup, multiqueue virtio is on todo list ;-)
>
> thanks,
> -chris
Note we'll need multiqueue tap for that to help.
--
MST
On Tuesday 22 September 2009, Michael S. Tsirkin wrote:
> > > More importantly, when virtualizations is used with multi-queue
> > > NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
> > > NIC should preserve the parallelism (lock free) using multiple
> > > receive/transmit queues. The number of queues should equal the
> > > number of CPUs.
> >
> > Yup, multiqueue virtio is on todo list ;-)
> >
>
> Note we'll need multiqueue tap for that to help.
My idea for that was to open multiple file descriptors to the same
macvtap device and let the kernel figure out the right thing to
do with that. You can do the same with raw packed sockets in case
of vhost_net, but I wouldn't want to add more complexity to the
tun/tap driver for this.
Arnd <><
On Tue, 22 Sep 2009 13:50:54 +0200
Arnd Bergmann <[email protected]> wrote:
> On Tuesday 22 September 2009, Michael S. Tsirkin wrote:
> > > > More importantly, when virtualizations is used with multi-queue
> > > > NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
> > > > NIC should preserve the parallelism (lock free) using multiple
> > > > receive/transmit queues. The number of queues should equal the
> > > > number of CPUs.
> > >
> > > Yup, multiqueue virtio is on todo list ;-)
> > >
> >
> > Note we'll need multiqueue tap for that to help.
>
> My idea for that was to open multiple file descriptors to the same
> macvtap device and let the kernel figure out the right thing to
> do with that. You can do the same with raw packed sockets in case
> of vhost_net, but I wouldn't want to add more complexity to the
> tun/tap driver for this.
>
> Arnd <><
Or get tap out of the way entirely. The packets should not have
to go out to user space at all (see veth)
On Tuesday 22 September 2009, Stephen Hemminger wrote:
> > My idea for that was to open multiple file descriptors to the same
> > macvtap device and let the kernel figure out the right thing to
> > do with that. You can do the same with raw packed sockets in case
> > of vhost_net, but I wouldn't want to add more complexity to the
> > tun/tap driver for this.
> >
> Or get tap out of the way entirely. The packets should not have
> to go out to user space at all (see veth)
How does veth relate to that, do you mean vhost_net? With vhost_net,
you could still open multiple sockets, only the access is in the kernel.
Obviously, once it all is in the kernel, that could be done under the
covers, but I think it would be cleaner to treat vhost_net purely as
a way to bypass the syscalls for user space, with as little as possible
visible impact otherwise.
Arnd <><