2009-06-19 00:30:50

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v8 0/3] iosignalfd

(Applies to kvm.git/master:c27b64a0)

This is v8 of the series. For more details, please see the header to
patch 3/3.

This series has been tested against the kvm-eventfd unit test, and
appears to be functioning properly. You can download this test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

Please consider for inclusion to kvm.git

An updated userspace for qemu-kvm.git is forthcoming.

[
Changelog:

v8:
*) Addressed Avi's review comments:
*) Simplified the unregister_dev logic in patch 1/3
*) Implemented a per-vm io-device limit (patch 2/3)
*) Removed spurious whitespace hunk
*) changed the data-match from a void* to a u64
*) check group-length violations before wildcarding

v7:
*) Implemented a resource limit (CONFIG_KVM_MAX_IOSIGNALFD_ITEMS)
to limit malicious/broken userspace from consuming arbitrary
kernel memory.
*) Rebased to kvm.git/master:c27b64a0, which already includes
Marcelo's irq-lock rework.

v6:
*) Removed "FIXME?" comment on choice over RCU vs SRCU after
discussion/numbers from Paul. I think RCU is fine to use for
now based on the conversation. We can always convert it later
if need be.
*) Fixed the "group" free path to eliminate an RCU related race
*) Fixed a memory/eventfd leak on shutdown for any iosignalfd's
which were still active at the time the guest shuts down.
*) Beefed up comments
*) Rebased to kvm.git/master:0281e88f + irq locking rework and
verified that kvm-eventfd unit test still passes.

v5:
*) Removed "cookie" field, which was a misunderstanding on my
part on what Avi wanted for a data-match feature
*) Added a new "trigger" data-match feature which I think is
much closer to what we need.
*) We retain the dev_count field in the io_bus infrastructure
and instead back-fill the array on removal.
*) Various minor cleanups
*) Rebased to kvm.git/master:25deed73

v4:
*) Fixed a bug in the original 2/4 where the PIT failure case
would potentially leave the io_bus components registered.
*) Condensed the v3 2/4 and 3/4 into one patch (2/2) since
the patches became interdependent with the fix described above
*) Rebased to kvm.git/master:74dfca0a

v3:
*) fixed patch 2/4 to handle error cases instead of BUG_ON
*) implemented same HAVE_EVENTFD protection mechanism as
irqfd to prevent compilation errors on unsupported arches
*) completed testing
*) rebased to kvm.git/master:7391a6d5

v2:
*) added optional data-matching capability (via cookie field)
*) changed name from iofd to iosignalfd
*) added io_bus unregister function
*) implemented deassign feature

v1:
*) original release (integrated into irqfd v7 series as "iofd")
]

---

Gregory Haskins (3):
KVM: add iosignalfd support
KVM: add per-vm limit on the maximum number of io-devices supported
KVM: make io_bus interface more robust


arch/x86/kvm/Kconfig | 8 +
arch/x86/kvm/i8254.c | 23 ++
arch/x86/kvm/i8259.c | 9 +
arch/x86/kvm/x86.c | 1
include/linux/kvm.h | 15 ++
include/linux/kvm_host.h | 19 ++
virt/kvm/coalesced_mmio.c | 8 +
virt/kvm/eventfd.c | 426 +++++++++++++++++++++++++++++++++++++++++++++
virt/kvm/ioapic.c | 9 +
virt/kvm/kvm_main.c | 42 ++++
10 files changed, 545 insertions(+), 15 deletions(-)

--
Signature


2009-06-19 00:31:07

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v8 1/3] KVM: make io_bus interface more robust

Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON if it
fails. We want to create dynamic MMIO/PIO entries driven from userspace later
in the series, so we need to enhance the code to be more robust with the
following changes:

1) Add a return value to the registration function
2) Fix up all the callsites to check the return code, handle any
failures, and percolate the error up to the caller.
3) Add an unregister function that collapses holes in the array

Signed-off-by: Gregory Haskins <[email protected]>
---

arch/x86/kvm/i8254.c | 22 ++++++++++++++++++++--
arch/x86/kvm/i8259.c | 9 ++++++++-
include/linux/kvm_host.h | 6 ++++--
virt/kvm/coalesced_mmio.c | 8 ++++++--
virt/kvm/ioapic.c | 9 +++++++--
virt/kvm/kvm_main.c | 23 +++++++++++++++++++++--
6 files changed, 66 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 331705f..1c41715 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -586,6 +586,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
{
struct kvm_pit *pit;
struct kvm_kpit_state *pit_state;
+ int ret;

pit = kzalloc(sizeof(struct kvm_pit), GFP_KERNEL);
if (!pit)
@@ -620,14 +621,31 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier);

kvm_iodevice_init(&pit->dev, &pit_dev_ops);
- kvm_io_bus_register_dev(&kvm->pio_bus, &pit->dev);
+ ret = kvm_io_bus_register_dev(&kvm->pio_bus, &pit->dev);
+ if (ret < 0)
+ goto fail;

if (flags & KVM_PIT_SPEAKER_DUMMY) {
kvm_iodevice_init(&pit->speaker_dev, &speaker_dev_ops);
- kvm_io_bus_register_dev(&kvm->pio_bus, &pit->speaker_dev);
+ ret = kvm_io_bus_register_dev(&kvm->pio_bus,
+ &pit->speaker_dev);
+ if (ret < 0)
+ goto fail;
}

return pit;
+
+fail:
+ if (flags & KVM_PIT_SPEAKER_DUMMY)
+ kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->speaker_dev);
+
+ kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->dev);
+
+ if (pit->irq_source_id >= 0)
+ kvm_free_irq_source_id(kvm, pit->irq_source_id);
+
+ kfree(pit);
+ return NULL;
}

void kvm_free_pit(struct kvm *kvm)
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 148c52a..66e37c2 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -532,6 +532,8 @@ static const struct kvm_io_device_ops picdev_ops = {
struct kvm_pic *kvm_create_pic(struct kvm *kvm)
{
struct kvm_pic *s;
+ int ret;
+
s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL);
if (!s)
return NULL;
@@ -548,6 +550,11 @@ struct kvm_pic *kvm_create_pic(struct kvm *kvm)
* Initialize PIO device
*/
kvm_iodevice_init(&s->dev, &picdev_ops);
- kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+ ret = kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+ if (ret < 0) {
+ kfree(s);
+ return NULL;
+ }
+
return s;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e4e78db..eafa2b3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -61,8 +61,10 @@ void kvm_io_bus_init(struct kvm_io_bus *bus);
void kvm_io_bus_destroy(struct kvm_io_bus *bus);
struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
gpa_t addr, int len, int is_write);
-void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
- struct kvm_io_device *dev);
+int kvm_io_bus_register_dev(struct kvm_io_bus *bus,
+ struct kvm_io_device *dev);
+void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
+ struct kvm_io_device *dev);

struct kvm_vcpu {
struct kvm *kvm;
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 397f419..3e89db8 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -94,6 +94,7 @@ static const struct kvm_io_device_ops coalesced_mmio_ops = {
int kvm_coalesced_mmio_init(struct kvm *kvm)
{
struct kvm_coalesced_mmio_dev *dev;
+ int ret;

dev = kzalloc(sizeof(struct kvm_coalesced_mmio_dev), GFP_KERNEL);
if (!dev)
@@ -102,9 +103,12 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
kvm_iodevice_init(&dev->dev, &coalesced_mmio_ops);
dev->kvm = kvm;
kvm->coalesced_mmio_dev = dev;
- kvm_io_bus_register_dev(&kvm->mmio_bus, &dev->dev);

- return 0;
+ ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &dev->dev);
+ if (ret < 0)
+ kfree(dev);
+
+ return ret;
}

int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index d8b2eca..28adb05 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -335,6 +335,7 @@ static const struct kvm_io_device_ops ioapic_mmio_ops = {
int kvm_ioapic_init(struct kvm *kvm)
{
struct kvm_ioapic *ioapic;
+ int ret;

ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
if (!ioapic)
@@ -343,7 +344,11 @@ int kvm_ioapic_init(struct kvm *kvm)
kvm_ioapic_reset(ioapic);
kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
ioapic->kvm = kvm;
- kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev);
- return 0;
+
+ ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev);
+ if (ret < 0)
+ kfree(ioapic);
+
+ return ret;
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9fab08e..6bd71f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,11 +2481,30 @@ struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
return NULL;
}

-void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
+/* assumes kvm->lock held */
+int kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
{
- BUG_ON(bus->dev_count > (NR_IOBUS_DEVS-1));
+ if (bus->dev_count > (NR_IOBUS_DEVS-1))
+ return -ENOSPC;

bus->devs[bus->dev_count++] = dev;
+
+ return 0;
+}
+
+/* assumes kvm->lock held */
+void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
+ struct kvm_io_device *dev)
+{
+ int i;
+
+ for (i = 0; i < bus->dev_count; i++) {
+
+ if (bus->devs[i] == dev) {
+ bus->devs[i] = bus->devs[--bus->dev_count];
+ return;
+ }
+ }
}

static struct notifier_block kvm_cpu_notifier = {

2009-06-19 00:31:24

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v8 2/3] KVM: add per-vm limit on the maximum number of io-devices supported

This patch adds a .config option for setting a upper resource limit on
the number of iodevice-like constructs that can be allocated on a per-guest
basis. It also adds a per-vm variable (io_device_count) that tracks
the aggregate number of io_devices registered. Today this is limited
purely to PIO/MMIO devices, though it is intended that this will grow in
the future (such as including iosignalfd aliases, etc).

This patch has no bearing on the number of devices that a given subsystem
may _actually_ support, so other limits may be in play at any given moment.
For instance, each PIO/MMIO bus is currently limited to 6 devices max. This
will likely need to change soon to accomodate upcoming features, but that
work is beyond the scope of this patch.

The primary point of this patch is to offer a single, easy to understand
upper limit on the number of kmalloc's that may occur as the result of a
userspace request. In this way, we are free to design future io subsystems
with arbitrary constructs (such as lists and btrees) and to know that they
will not be allowed to grow without bound in a way that the userspace side
should hopefully understand.

Signed-off-by: Gregory Haskins <[email protected]>
---

arch/x86/kvm/Kconfig | 8 ++++++++
arch/x86/kvm/i8254.c | 9 +++++----
arch/x86/kvm/i8259.c | 2 +-
include/linux/kvm_host.h | 9 ++++++---
virt/kvm/coalesced_mmio.c | 2 +-
virt/kvm/ioapic.c | 2 +-
virt/kvm/kvm_main.c | 12 ++++++++++--
7 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 7fbedfd..0fcf660 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -74,6 +74,14 @@ config KVM_TRACE
relayfs. Note the ABI is not considered stable and will be
modified in future updates.

+config KVM_MAX_IO_DEVICES
+ int "Maximum IO devices support per VM"
+ depends on KVM
+ default "256"
+ ---help---
+ This option influences the maximum number of MMIO, PIO, and other
+ io-devices that can simultaneously register on a per-guest basis.
+
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/lguest/Kconfig
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 1c41715..f24a8ea 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -621,13 +621,13 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier);

kvm_iodevice_init(&pit->dev, &pit_dev_ops);
- ret = kvm_io_bus_register_dev(&kvm->pio_bus, &pit->dev);
+ ret = kvm_io_bus_register_dev(kvm, &kvm->pio_bus, &pit->dev);
if (ret < 0)
goto fail;

if (flags & KVM_PIT_SPEAKER_DUMMY) {
kvm_iodevice_init(&pit->speaker_dev, &speaker_dev_ops);
- ret = kvm_io_bus_register_dev(&kvm->pio_bus,
+ ret = kvm_io_bus_register_dev(kvm, &kvm->pio_bus,
&pit->speaker_dev);
if (ret < 0)
goto fail;
@@ -637,9 +637,10 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)

fail:
if (flags & KVM_PIT_SPEAKER_DUMMY)
- kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->speaker_dev);
+ kvm_io_bus_unregister_dev(kvm, &kvm->pio_bus,
+ &pit->speaker_dev);

- kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->dev);
+ kvm_io_bus_unregister_dev(kvm, &kvm->pio_bus, &pit->dev);

if (pit->irq_source_id >= 0)
kvm_free_irq_source_id(kvm, pit->irq_source_id);
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 66e37c2..cfe3433 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -550,7 +550,7 @@ struct kvm_pic *kvm_create_pic(struct kvm *kvm)
* Initialize PIO device
*/
kvm_iodevice_init(&s->dev, &picdev_ops);
- ret = kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
+ ret = kvm_io_bus_register_dev(kvm, &kvm->pio_bus, &s->dev);
if (ret < 0) {
kfree(s);
return NULL;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eafa2b3..707c4d8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -61,10 +61,12 @@ void kvm_io_bus_init(struct kvm_io_bus *bus);
void kvm_io_bus_destroy(struct kvm_io_bus *bus);
struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
gpa_t addr, int len, int is_write);
-int kvm_io_bus_register_dev(struct kvm_io_bus *bus,
- struct kvm_io_device *dev);
-void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
+int kvm_io_bus_register_dev(struct kvm *kvm,
+ struct kvm_io_bus *bus,
struct kvm_io_device *dev);
+void kvm_io_bus_unregister_dev(struct kvm *kvm,
+ struct kvm_io_bus *bus,
+ struct kvm_io_device *dev);

struct kvm_vcpu {
struct kvm *kvm;
@@ -139,6 +141,7 @@ struct kvm {
atomic_t online_vcpus;
struct list_head vm_list;
struct mutex lock;
+ unsigned long io_device_count;
struct kvm_io_bus mmio_bus;
struct kvm_io_bus pio_bus;
#ifdef CONFIG_HAVE_KVM_EVENTFD
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 3e89db8..09aae26 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -104,7 +104,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
dev->kvm = kvm;
kvm->coalesced_mmio_dev = dev;

- ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &dev->dev);
+ ret = kvm_io_bus_register_dev(kvm, &kvm->mmio_bus, &dev->dev);
if (ret < 0)
kfree(dev);

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 28adb05..bc7ad68 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -345,7 +345,7 @@ int kvm_ioapic_init(struct kvm *kvm)
kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
ioapic->kvm = kvm;

- ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev);
+ ret = kvm_io_bus_register_dev(kvm, &kvm->mmio_bus, &ioapic->dev);
if (ret < 0)
kfree(ioapic);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6bd71f7..42cbea7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2482,18 +2482,25 @@ struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
}

/* assumes kvm->lock held */
-int kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
+int kvm_io_bus_register_dev(struct kvm *kvm,
+ struct kvm_io_bus *bus,
+ struct kvm_io_device *dev)
{
if (bus->dev_count > (NR_IOBUS_DEVS-1))
return -ENOSPC;

+ if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES)
+ return -ENOSPC;
+
bus->devs[bus->dev_count++] = dev;
+ kvm->io_device_count++;

return 0;
}

/* assumes kvm->lock held */
-void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
+void kvm_io_bus_unregister_dev(struct kvm *kvm,
+ struct kvm_io_bus *bus,
struct kvm_io_device *dev)
{
int i;
@@ -2502,6 +2509,7 @@ void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,

if (bus->devs[i] == dev) {
bus->devs[i] = bus->devs[--bus->dev_count];
+ kvm->io_device_count--;
return;
}
}

2009-06-19 00:31:38

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v8 3/3] KVM: add iosignalfd support

iosignalfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any arbitrary
IO address with a corresponding eventfd and then pass the eventfd to a
specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.

Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.

To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().

We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
iosignalfd.

You can download this test harness here:

ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

The measured results are as follows:

qemu-mmio: 110000 iops, 9.09us rtt
iosignalfd-mmio: 200100 iops, 5.00us rtt
iosignalfd-pio: 367300 iops, 2.72us rtt

I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:

qemu-pio: 153139 iops, 6.53us rtt
iosignalfd-hc: 412585 iops, 2.37us rtt

these are just for fun, for now, until I can gather more data.

Here is a graph for your convenience:

http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

The conclusion to draw is that we save about 4us by skipping the userspace
hop.

--------------------

Signed-off-by: Gregory Haskins <[email protected]>
---

arch/x86/kvm/x86.c | 1
include/linux/kvm.h | 15 ++
include/linux/kvm_host.h | 10 +
virt/kvm/eventfd.c | 426 ++++++++++++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 11 +
5 files changed, 459 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b91ea7..9b119e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1097,6 +1097,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_IRQ_INJECT_STATUS:
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_IRQFD:
+ case KVM_CAP_IOSIGNALFD:
case KVM_CAP_PIT2:
r = 1;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 38ff31e..9de6486 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -307,6 +307,19 @@ struct kvm_guest_debug {
struct kvm_guest_debug_arch arch;
};

+#define KVM_IOSIGNALFD_FLAG_TRIGGER (1 << 0)
+#define KVM_IOSIGNALFD_FLAG_PIO (1 << 1)
+#define KVM_IOSIGNALFD_FLAG_DEASSIGN (1 << 2)
+
+struct kvm_iosignalfd {
+ __u64 trigger;
+ __u64 addr;
+ __u32 len;
+ __u32 fd;
+ __u32 flags;
+ __u8 pad[36];
+};
+
#define KVM_TRC_SHIFT 16
/*
* kvm trace categories
@@ -438,6 +451,7 @@ struct kvm_trace_rec {
#define KVM_CAP_PIT2 33
#endif
#define KVM_CAP_SET_BOOT_CPU_ID 34
+#define KVM_CAP_IOSIGNALFD 35

#ifdef KVM_CAP_IRQ_ROUTING

@@ -544,6 +558,7 @@ struct kvm_irqfd {
#define KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd)
#define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct kvm_pit_config)
#define KVM_SET_BOOT_CPU_ID _IO(KVMIO, 0x78)
+#define KVM_IOSIGNALFD _IOW(KVMIO, 0x79, struct kvm_iosignalfd)

/*
* ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 707c4d8..6c0569a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -146,6 +146,7 @@ struct kvm {
struct kvm_io_bus pio_bus;
#ifdef CONFIG_HAVE_KVM_EVENTFD
struct list_head irqfds;
+ struct list_head iosignalfds;
#endif
struct kvm_vm_stat stat;
struct kvm_arch arch;
@@ -554,19 +555,24 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {}

#ifdef CONFIG_HAVE_KVM_EVENTFD

-void kvm_irqfd_init(struct kvm *kvm);
+void kvm_eventfd_init(struct kvm *kvm);
int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags);
void kvm_irqfd_release(struct kvm *kvm);
+int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args);

#else

-static inline void kvm_irqfd_init(struct kvm *kvm) {}
+static inline void kvm_eventfd_init(struct kvm *kvm) {}
static inline int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
{
return -EINVAL;
}

static inline void kvm_irqfd_release(struct kvm *kvm) {}
+static inline int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
+{
+ return -EINVAL;
+}

#endif /* CONFIG_HAVE_KVM_EVENTFD */

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2c8028c..52ac455 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -21,6 +21,7 @@
*/

#include <linux/kvm_host.h>
+#include <linux/kvm.h>
#include <linux/workqueue.h>
#include <linux/syscalls.h>
#include <linux/wait.h>
@@ -29,6 +30,9 @@
#include <linux/list.h>
#include <linux/eventfd.h>
#include <linux/srcu.h>
+#include <linux/kernel.h>
+
+#include "iodev.h"

/*
* --------------------------------------------------------------------
@@ -202,9 +206,10 @@ fail:
}

void
-kvm_irqfd_init(struct kvm *kvm)
+kvm_eventfd_init(struct kvm *kvm)
{
INIT_LIST_HEAD(&kvm->irqfds);
+ INIT_LIST_HEAD(&kvm->iosignalfds);
}

void
@@ -215,3 +220,422 @@ kvm_irqfd_release(struct kvm *kvm)
list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
irqfd_disconnect(irqfd);
}
+
+/*
+ * --------------------------------------------------------------------
+ * iosignalfd: translate a PIO/MMIO memory write to an eventfd signal.
+ *
+ * userspace can register a PIO/MMIO address with an eventfd for recieving
+ * notification when the memory has been touched.
+ * --------------------------------------------------------------------
+ */
+
+/*
+ * Design note: We create one PIO/MMIO device (iosignalfd_group) which
+ * aggregates one or more iosignalfd_items. Each item points to exactly one
+ * eventfd, and can be registered to trigger on any write to the group
+ * (wildcard), or to a write of a specific value. If more than one item is to
+ * be supported, the addr/len ranges must all be identical in the group. If a
+ * trigger value is to be supported on a particular item, the group range must
+ * be exactly the width of the trigger.
+ */
+
+struct _iosignalfd_item {
+ struct list_head list;
+ struct file *file;
+ u64 match;
+ struct rcu_head rcu;
+ int wildcard:1;
+};
+
+struct _iosignalfd_group {
+ struct list_head list;
+ u64 addr;
+ size_t length;
+ size_t count;
+ struct list_head items;
+ struct kvm_io_device dev;
+ struct rcu_head rcu;
+};
+
+static inline struct _iosignalfd_group *
+to_group(struct kvm_io_device *dev)
+{
+ return container_of(dev, struct _iosignalfd_group, dev);
+}
+
+static void
+iosignalfd_item_free(struct _iosignalfd_item *item)
+{
+ fput(item->file);
+ kfree(item);
+}
+
+static void
+iosignalfd_item_deferred_free(struct rcu_head *rhp)
+{
+ struct _iosignalfd_item *item;
+
+ item = container_of(rhp, struct _iosignalfd_item, rcu);
+
+ iosignalfd_item_free(item);
+}
+
+static void
+iosignalfd_group_deferred_free(struct rcu_head *rhp)
+{
+ struct _iosignalfd_group *group;
+
+ group = container_of(rhp, struct _iosignalfd_group, rcu);
+
+ kfree(group);
+}
+
+static int
+iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
+ int is_write)
+{
+ struct _iosignalfd_group *p = to_group(this);
+
+ return ((addr >= p->addr && (addr < p->addr + p->length)));
+}
+
+static int
+iosignalfd_is_match(struct _iosignalfd_group *group,
+ struct _iosignalfd_item *item,
+ const void *val,
+ int len)
+{
+ u64 _val;
+
+ if (len != group->length)
+ /* mis-matched length is always a miss */
+ return false;
+
+ if (item->wildcard)
+ /* wildcard is always a hit */
+ return true;
+
+ /* otherwise, we have to actually compare the data */
+
+ if (!IS_ALIGNED((unsigned long)val, len))
+ /* protect against this request causing a SIGBUS */
+ return false;
+
+ switch (len) {
+ case 1:
+ _val = *(u8 *)val;
+ break;
+ case 2:
+ _val = *(u16 *)val;
+ break;
+ case 4:
+ _val = *(u32 *)val;
+ break;
+ case 8:
+ _val = *(u64 *)val;
+ break;
+ default:
+ return false;
+ }
+
+ return _val == item->match;
+}
+
+/*
+ * MMIO/PIO writes trigger an event (if the data matches).
+ *
+ * This is invoked by the io_bus subsystem in response to an address match
+ * against the group. We must then walk the list of individual items to check
+ * for a match and, if applicable, to send the appropriate signal. If the item
+ * in question does not have a "match" pointer, it is considered a wildcard
+ * and will always generate a signal. There can be an arbitrary number
+ * of distinct matches or wildcards per group.
+ */
+static void
+iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
+ const void *val)
+{
+ struct _iosignalfd_group *group = to_group(this);
+ struct _iosignalfd_item *item;
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(item, &group->items, list) {
+ if (iosignalfd_is_match(group, item, val, len))
+ eventfd_signal(item->file, 1);
+ }
+
+ rcu_read_unlock();
+}
+
+/*
+ * MMIO/PIO reads against the group indiscriminately return all zeros
+ */
+static void
+iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
+ void *val)
+{
+ memset(val, 0, len);
+}
+
+/*
+ * This function is called as KVM is completely shutting down. We do not
+ * need to worry about locking or careful RCU dancing...just nuke anything
+ * we have as quickly as possible
+ */
+static void
+iosignalfd_group_destructor(struct kvm_io_device *this)
+{
+ struct _iosignalfd_group *group = to_group(this);
+ struct _iosignalfd_item *item, *tmp;
+
+ list_for_each_entry_safe(item, tmp, &group->items, list) {
+ list_del(&item->list);
+ iosignalfd_item_free(item);
+ }
+
+ list_del(&group->list);
+ kfree(group);
+}
+
+static const struct kvm_io_device_ops iosignalfd_ops = {
+ .read = iosignalfd_group_read,
+ .write = iosignalfd_group_write,
+ .in_range = iosignalfd_group_in_range,
+ .destructor = iosignalfd_group_destructor,
+};
+
+/* assumes kvm->lock held */
+static struct _iosignalfd_group *
+iosignalfd_group_find(struct kvm *kvm, u64 addr)
+{
+ struct _iosignalfd_group *group;
+
+ list_for_each_entry(group, &kvm->iosignalfds, list) {
+ if (group->addr == addr)
+ return group;
+ }
+
+ return NULL;
+}
+
+/* assumes kvm->lock is held */
+static struct _iosignalfd_group *
+iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
+ u64 addr, size_t len)
+{
+ struct _iosignalfd_group *group;
+ int ret;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&group->list);
+ INIT_LIST_HEAD(&group->items);
+ group->addr = addr;
+ group->length = len;
+ kvm_iodevice_init(&group->dev, &iosignalfd_ops);
+
+ ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
+ if (ret < 0) {
+ kfree(group);
+ return ERR_PTR(ret);
+ }
+
+ list_add_tail(&group->list, &kvm->iosignalfds);
+
+ return group;
+}
+
+static int
+kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
+{
+ int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
+ struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
+ struct _iosignalfd_group *group = NULL;
+ struct _iosignalfd_item *item = NULL;
+ struct file *file;
+ int ret;
+
+ if (args->len > sizeof(u64))
+ return -EINVAL;
+
+ file = eventfd_fget(args->fd);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ item = kzalloc(sizeof(*item), GFP_KERNEL);
+ if (!item) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ INIT_LIST_HEAD(&item->list);
+ item->file = file;
+
+ /*
+ * A trigger address is optional, otherwise this is a wildcard
+ */
+ if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
+ item->match = args->trigger;
+ else
+ item->wildcard = true;
+
+ mutex_lock(&kvm->lock);
+
+ /*
+ * Put an upper limit on the number of items we support
+ */
+ if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
+ ret = -ENOSPC;
+ goto unlock_fail;
+ }
+
+ group = iosignalfd_group_find(kvm, args->addr);
+ if (!group) {
+
+ group = iosignalfd_group_create(kvm, bus,
+ args->addr, args->len);
+ if (IS_ERR(group)) {
+ ret = PTR_ERR(group);
+ goto unlock_fail;
+ }
+
+ /*
+ * Note: We do not increment io_device_count for the first item,
+ * as this is represented by the group device that we just
+ * registered. Make sure we handle this properly when we
+ * deassign the last item
+ */
+ } else {
+
+ if (group->length != args->len) {
+ /*
+ * Existing groups must have the same addr/len tuple
+ * or we reject the request
+ */
+ ret = -EINVAL;
+ goto unlock_fail;
+ }
+
+ kvm->io_device_count++;
+ }
+
+ /*
+ * Note: We are committed to succeed at this point since we have
+ * (potentially) published a new group-device. Any failure handling
+ * added in the future after this point will need to be carefully
+ * considered.
+ */
+
+ list_add_tail_rcu(&item->list, &group->items);
+ group->count++;
+
+ mutex_unlock(&kvm->lock);
+
+ return 0;
+
+unlock_fail:
+ mutex_unlock(&kvm->lock);
+fail:
+ if (item)
+ /*
+ * it would have never made it to the group->items list
+ * in the failure path, so we dont need to worry about removing
+ * it
+ */
+ kfree(item);
+
+ fput(file);
+
+ return ret;
+}
+
+
+static int
+kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
+{
+ int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
+ struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
+ struct _iosignalfd_group *group;
+ struct _iosignalfd_item *item, *tmp;
+ struct file *file;
+ int ret = 0;
+
+ file = eventfd_fget(args->fd);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ mutex_lock(&kvm->lock);
+
+ group = iosignalfd_group_find(kvm, args->addr);
+ if (!group) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Exhaustively search our group->items list for any items that might
+ * match the specified fd, and (carefully) remove each one found.
+ */
+ list_for_each_entry_safe(item, tmp, &group->items, list) {
+
+ if (item->file != file)
+ continue;
+
+ list_del_rcu(&item->list);
+ group->count--;
+ if (group->count)
+ /*
+ * We only decrement the global count if this is *not*
+ * the last item. The last item will be accounted for
+ * by the io_bus_unregister
+ */
+ kvm->io_device_count--;
+
+ /*
+ * The item may be still referenced inside our group->write()
+ * path's RCU read-side CS, so defer the actual free to the
+ * next grace
+ */
+ call_rcu(&item->rcu, iosignalfd_item_deferred_free);
+ }
+
+ /*
+ * Check if the group is now completely vacated as a result of
+ * removing the items. If so, unregister/delete it
+ */
+ if (!group->count) {
+
+ kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
+
+ /*
+ * Like the item, the group may also still be referenced as
+ * per above. However, the kvm->iosignalfds list is not
+ * RCU protected (its protected by kvm->lock instead) so
+ * we can just plain-vanilla remove it. What needs to be
+ * done carefully is the actual freeing of the group pointer
+ * since we walk the group->items list within the RCU CS.
+ */
+ list_del(&group->list);
+ call_rcu(&group->rcu, iosignalfd_group_deferred_free);
+ }
+
+out:
+ mutex_unlock(&kvm->lock);
+
+ fput(file);
+
+ return ret;
+}
+
+int
+kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
+{
+ if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
+ return kvm_deassign_iosignalfd(kvm, args);
+
+ return kvm_assign_iosignalfd(kvm, args);
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 42cbea7..e6495d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
atomic_inc(&kvm->mm->mm_count);
spin_lock_init(&kvm->mmu_lock);
kvm_io_bus_init(&kvm->pio_bus);
- kvm_irqfd_init(kvm);
+ kvm_eventfd_init(kvm);
mutex_init(&kvm->lock);
mutex_init(&kvm->irq_lock);
kvm_io_bus_init(&kvm->mmio_bus);
@@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
break;
}
+ case KVM_IOSIGNALFD: {
+ struct kvm_iosignalfd data;
+
+ r = -EFAULT;
+ if (copy_from_user(&data, argp, sizeof data))
+ goto out;
+ r = kvm_iosignalfd(kvm, &data);
+ break;
+ }
#ifdef CONFIG_KVM_APIC_ARCHITECTURE
case KVM_SET_BOOT_CPU_ID:
r = 0;

2009-06-22 10:45:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Some comments below:

On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
> iosignalfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
> signal when written to by a guest. Host userspace can register any arbitrary
> IO address with a corresponding eventfd and then pass the eventfd to a
> specific end-point of interest for handling.
>
> Normal IO requires a blocking round-trip since the operation may cause
> side-effects in the emulated model or may return data to the caller.
> Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
> "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
> device model synchronously before returning control back to the vcpu.
>
> However, there is a subclass of IO which acts purely as a trigger for
> other IO (such as to kick off an out-of-band DMA request, etc). For these
> patterns, the synchronous call is particularly expensive since we really
> only want to simply get our notification transmitted asychronously and
> return as quickly as possible. All the sychronous infrastructure to ensure
> proper data-dependencies are met in the normal IO case are just unecessary
> overhead for signalling. This adds additional computational load on the
> system, as well as latency to the signalling path.
>
> Therefore, we provide a mechanism for registration of an in-kernel trigger
> point that allows the VCPU to only require a very brief, lightweight
> exit just long enough to signal an eventfd. This also means that any
> clients compatible with the eventfd interface (which includes userspace
> and kernelspace equally well) can now register to be notified. The end
> result should be a more flexible and higher performance notification API
> for the backend KVM hypervisor and perhipheral components.
>
> To test this theory, we built a test-harness called "doorbell". This
> module has a function called "doorbell_ring()" which simply increments a
> counter for each time the doorbell is signaled. It supports signalling
> from either an eventfd, or an ioctl().
>
> We then wired up two paths to the doorbell: One via QEMU via a registered
> io region and through the doorbell ioctl(). The other is direct via
> iosignalfd.
>
> You can download this test harness here:
>
> ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
>
> The measured results are as follows:
>
> qemu-mmio: 110000 iops, 9.09us rtt
> iosignalfd-mmio: 200100 iops, 5.00us rtt
> iosignalfd-pio: 367300 iops, 2.72us rtt
>
> I didn't measure qemu-pio, because I have to figure out how to register a
> PIO region with qemu's device model, and I got lazy. However, for now we
> can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
> and -350ns for HC, we get:
>
> qemu-pio: 153139 iops, 6.53us rtt
> iosignalfd-hc: 412585 iops, 2.37us rtt
>
> these are just for fun, for now, until I can gather more data.
>
> Here is a graph for your convenience:
>
> http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
>
> The conclusion to draw is that we save about 4us by skipping the userspace
> hop.
>
> --------------------
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> arch/x86/kvm/x86.c | 1
> include/linux/kvm.h | 15 ++
> include/linux/kvm_host.h | 10 +
> virt/kvm/eventfd.c | 426 ++++++++++++++++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 11 +
> 5 files changed, 459 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b91ea7..9b119e4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1097,6 +1097,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_IRQ_INJECT_STATUS:
> case KVM_CAP_ASSIGN_DEV_IRQ:
> case KVM_CAP_IRQFD:
> + case KVM_CAP_IOSIGNALFD:
> case KVM_CAP_PIT2:
> r = 1;
> break;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 38ff31e..9de6486 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -307,6 +307,19 @@ struct kvm_guest_debug {
> struct kvm_guest_debug_arch arch;
> };
>
> +#define KVM_IOSIGNALFD_FLAG_TRIGGER (1 << 0)
> +#define KVM_IOSIGNALFD_FLAG_PIO (1 << 1)
> +#define KVM_IOSIGNALFD_FLAG_DEASSIGN (1 << 2)

This is exposed to userspace authors.
So - some documentation for flags?

> +
> +struct kvm_iosignalfd {
> + __u64 trigger;

Is value, or trigger_value, a better name?
Which bits are used if len is e.g. 1
and what happens with unused bits?

> + __u64 addr;
> + __u32 len;

What are the legal values for len/addr?
Probably should document them.

> + __u32 fd;

fd should probably be signed? You cast it to int later.

> + __u32 flags;
> + __u8 pad[36];

4 byte padding not enough?

> +};
> +
> #define KVM_TRC_SHIFT 16
> /*
> * kvm trace categories
> @@ -438,6 +451,7 @@ struct kvm_trace_rec {
> #define KVM_CAP_PIT2 33
> #endif
> #define KVM_CAP_SET_BOOT_CPU_ID 34
> +#define KVM_CAP_IOSIGNALFD 35
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -544,6 +558,7 @@ struct kvm_irqfd {
> #define KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd)
> #define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct kvm_pit_config)
> #define KVM_SET_BOOT_CPU_ID _IO(KVMIO, 0x78)
> +#define KVM_IOSIGNALFD _IOW(KVMIO, 0x79, struct kvm_iosignalfd)
>
> /*
> * ioctls for vcpu fds
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 707c4d8..6c0569a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -146,6 +146,7 @@ struct kvm {
> struct kvm_io_bus pio_bus;
> #ifdef CONFIG_HAVE_KVM_EVENTFD

BTW, do we want this config option?

> struct list_head irqfds;
> + struct list_head iosignalfds;
> #endif
> struct kvm_vm_stat stat;
> struct kvm_arch arch;
> @@ -554,19 +555,24 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {}
>
> #ifdef CONFIG_HAVE_KVM_EVENTFD
>
> -void kvm_irqfd_init(struct kvm *kvm);
> +void kvm_eventfd_init(struct kvm *kvm);
> int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags);
> void kvm_irqfd_release(struct kvm *kvm);
> +int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args);
>
> #else
>
> -static inline void kvm_irqfd_init(struct kvm *kvm) {}
> +static inline void kvm_eventfd_init(struct kvm *kvm) {}
> static inline int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> {
> return -EINVAL;
> }
>
> static inline void kvm_irqfd_release(struct kvm *kvm) {}
> +static inline int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> +{
> + return -EINVAL;
> +}

I thought ENOTTY is the accepted error for an unsupported ioctl?

>
> #endif /* CONFIG_HAVE_KVM_EVENTFD */
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 2c8028c..52ac455 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -21,6 +21,7 @@
> */
>
> #include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> #include <linux/workqueue.h>
> #include <linux/syscalls.h>
> #include <linux/wait.h>
> @@ -29,6 +30,9 @@
> #include <linux/list.h>
> #include <linux/eventfd.h>
> #include <linux/srcu.h>
> +#include <linux/kernel.h>
> +
> +#include "iodev.h"
>
> /*
> * --------------------------------------------------------------------
> @@ -202,9 +206,10 @@ fail:
> }
>
> void
> -kvm_irqfd_init(struct kvm *kvm)
> +kvm_eventfd_init(struct kvm *kvm)
> {
> INIT_LIST_HEAD(&kvm->irqfds);
> + INIT_LIST_HEAD(&kvm->iosignalfds);
> }
>
> void
> @@ -215,3 +220,422 @@ kvm_irqfd_release(struct kvm *kvm)
> list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
> irqfd_disconnect(irqfd);
> }
> +
> +/*
> + * --------------------------------------------------------------------
> + * iosignalfd: translate a PIO/MMIO memory write to an eventfd signal.
> + *
> + * userspace can register a PIO/MMIO address with an eventfd for recieving

typo

> + * notification when the memory has been touched.
> + * --------------------------------------------------------------------
> + */
> +
> +/*
> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
> + * aggregates one or more iosignalfd_items. Each item points to exactly one
> + * eventfd, and can be registered to trigger on any write to the group
> + * (wildcard), or to a write of a specific value. If more than one item is to
> + * be supported, the addr/len ranges must all be identical in the group. If a
> + * trigger value is to be supported on a particular item, the group range must
> + * be exactly the width of the trigger.

Some duplicate spaces in the text above, apparently at random places.

> + */
> +
> +struct _iosignalfd_item {
> + struct list_head list;
> + struct file *file;
> + u64 match;
> + struct rcu_head rcu;
> + int wildcard:1;
> +};
> +
> +struct _iosignalfd_group {
> + struct list_head list;
> + u64 addr;
> + size_t length;
> + size_t count;
> + struct list_head items;
> + struct kvm_io_device dev;
> + struct rcu_head rcu;
> +};
> +
> +static inline struct _iosignalfd_group *
> +to_group(struct kvm_io_device *dev)
> +{
> + return container_of(dev, struct _iosignalfd_group, dev);
> +}
> +
> +static void
> +iosignalfd_item_free(struct _iosignalfd_item *item)
> +{
> + fput(item->file);
> + kfree(item);
> +}
> +
> +static void
> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
> +{
> + struct _iosignalfd_item *item;
> +
> + item = container_of(rhp, struct _iosignalfd_item, rcu);
> +
> + iosignalfd_item_free(item);
> +}
> +
> +static void
> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
> +{
> + struct _iosignalfd_group *group;
> +
> + group = container_of(rhp, struct _iosignalfd_group, rcu);
> +
> + kfree(group);
> +}
> +
> +static int
> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> + int is_write)
> +{
> + struct _iosignalfd_group *p = to_group(this);
> +
> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> +}

What does this test? len is ignored ...

> +
> +static int

This seems to be returning bool ...

> +iosignalfd_is_match(struct _iosignalfd_group *group,
> + struct _iosignalfd_item *item,
> + const void *val,
> + int len)
> +{
> + u64 _val;
> +
> + if (len != group->length)
> + /* mis-matched length is always a miss */
> + return false;

Why is that? what if there's 8 byte write which covers
a 4 byte group?

> +
> + if (item->wildcard)
> + /* wildcard is always a hit */
> + return true;
> +
> + /* otherwise, we have to actually compare the data */
> +
> + if (!IS_ALIGNED((unsigned long)val, len))
> + /* protect against this request causing a SIGBUS */
> + return false;

Could you explain what this does please?
I thought misaligned accesses are allowed.

> +
> + switch (len) {
> + case 1:
> + _val = *(u8 *)val;
> + break;
> + case 2:
> + _val = *(u16 *)val;
> + break;
> + case 4:
> + _val = *(u32 *)val;
> + break;
> + case 8:
> + _val = *(u64 *)val;
> + break;
> + default:
> + return false;
> + }

So legal values for len are 1,2,4 and 8?
Might be a good idea to document this.

> +
> + return _val == item->match;
> +}
> +
> +/*
> + * MMIO/PIO writes trigger an event (if the data matches).
> + *
> + * This is invoked by the io_bus subsystem in response to an address match
> + * against the group. We must then walk the list of individual items to check
> + * for a match and, if applicable, to send the appropriate signal. If the item
> + * in question does not have a "match" pointer, it is considered a wildcard
> + * and will always generate a signal. There can be an arbitrary number
> + * of distinct matches or wildcards per group.
> + */
> +static void
> +iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
> + const void *val)
> +{
> + struct _iosignalfd_group *group = to_group(this);
> + struct _iosignalfd_item *item;
> +
> + rcu_read_lock();
> +
> + list_for_each_entry_rcu(item, &group->items, list) {
> + if (iosignalfd_is_match(group, item, val, len))
> + eventfd_signal(item->file, 1);
> + }
> +
> + rcu_read_unlock();
> +}
> +
> +/*
> + * MMIO/PIO reads against the group indiscriminately return all zeros
> + */

Does it have to be so? It would be better to bounce reads to
userspace...

> +static void
> +iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
> + void *val)
> +{
> + memset(val, 0, len);
> +}
> +
> +/*
> + * This function is called as KVM is completely shutting down. We do not
> + * need to worry about locking or careful RCU dancing...just nuke anything
> + * we have as quickly as possible
> + */
> +static void
> +iosignalfd_group_destructor(struct kvm_io_device *this)
> +{
> + struct _iosignalfd_group *group = to_group(this);
> + struct _iosignalfd_item *item, *tmp;
> +
> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> + list_del(&item->list);
> + iosignalfd_item_free(item);
> + }
> +
> + list_del(&group->list);
> + kfree(group);
> +}
> +
> +static const struct kvm_io_device_ops iosignalfd_ops = {
> + .read = iosignalfd_group_read,
> + .write = iosignalfd_group_write,
> + .in_range = iosignalfd_group_in_range,
> + .destructor = iosignalfd_group_destructor,
> +};
> +
> +/* assumes kvm->lock held */
> +static struct _iosignalfd_group *
> +iosignalfd_group_find(struct kvm *kvm, u64 addr)
> +{
> + struct _iosignalfd_group *group;
> +
> + list_for_each_entry(group, &kvm->iosignalfds, list) {

{} not needed here

> + if (group->addr == addr)
> + return group;
> + }
> +
> + return NULL;
> +}
> +
> +/* assumes kvm->lock is held */
> +static struct _iosignalfd_group *
> +iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
> + u64 addr, size_t len)
> +{
> + struct _iosignalfd_group *group;
> + int ret;
> +
> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> + if (!group)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&group->list);
> + INIT_LIST_HEAD(&group->items);
> + group->addr = addr;
> + group->length = len;
> + kvm_iodevice_init(&group->dev, &iosignalfd_ops);
> +
> + ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
> + if (ret < 0) {
> + kfree(group);
> + return ERR_PTR(ret);
> + }
> +
> + list_add_tail(&group->list, &kvm->iosignalfds);
> +
> + return group;
> +}
> +
> +static int
> +kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> +{
> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> + struct _iosignalfd_group *group = NULL;

why does group need to be initialized?

> + struct _iosignalfd_item *item = NULL;

Why does item need to be initialized?

> + struct file *file;
> + int ret;
> +
> + if (args->len > sizeof(u64))

Is e.g. value 3 legal?

> + return -EINVAL;

> +
> + file = eventfd_fget(args->fd);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> +
> + item = kzalloc(sizeof(*item), GFP_KERNEL);
> + if (!item) {
> + ret = -ENOMEM;
> + goto fail;
> + }
> +
> + INIT_LIST_HEAD(&item->list);
> + item->file = file;
> +
> + /*
> + * A trigger address is optional, otherwise this is a wildcard
> + */
> + if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
> + item->match = args->trigger;
> + else
> + item->wildcard = true;
> +
> + mutex_lock(&kvm->lock);
> +
> + /*
> + * Put an upper limit on the number of items we support
> + */

Groups and items, actually, right?

> + if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
> + ret = -ENOSPC;
> + goto unlock_fail;
> + }
> +
> + group = iosignalfd_group_find(kvm, args->addr);
> + if (!group) {
> +
> + group = iosignalfd_group_create(kvm, bus,
> + args->addr, args->len);
> + if (IS_ERR(group)) {
> + ret = PTR_ERR(group);
> + goto unlock_fail;
> + }
> +
> + /*
> + * Note: We do not increment io_device_count for the first item,
> + * as this is represented by the group device that we just
> + * registered. Make sure we handle this properly when we
> + * deassign the last item
> + */
> + } else {
> +
> + if (group->length != args->len) {
> + /*
> + * Existing groups must have the same addr/len tuple
> + * or we reject the request
> + */
> + ret = -EINVAL;
> + goto unlock_fail;

Most errors seem to trigger EINVAL. Applications will be
easier to debug if different errors are returned on
different mistakes. E.g. here EBUSY might be good. And same
in other places.

> + }
> +
> + kvm->io_device_count++;
> + }
> +
> + /*
> + * Note: We are committed to succeed at this point since we have
> + * (potentially) published a new group-device. Any failure handling
> + * added in the future after this point will need to be carefully
> + * considered.
> + */
> +
> + list_add_tail_rcu(&item->list, &group->items);
> + group->count++;
> +
> + mutex_unlock(&kvm->lock);
> +
> + return 0;
> +
> +unlock_fail:
> + mutex_unlock(&kvm->lock);
> +fail:
> + if (item)
> + /*
> + * it would have never made it to the group->items list
> + * in the failure path, so we dont need to worry about removing
> + * it
> + */
> + kfree(item);
> +
> + fput(file);
> +
> + return ret;
> +}
> +
> +
> +static int
> +kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> +{
> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> + struct _iosignalfd_group *group;
> + struct _iosignalfd_item *item, *tmp;
> + struct file *file;
> + int ret = 0;
> +
> + file = eventfd_fget(args->fd);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> +
> + mutex_lock(&kvm->lock);
> +
> + group = iosignalfd_group_find(kvm, args->addr);
> + if (!group) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Exhaustively search our group->items list for any items that might
> + * match the specified fd, and (carefully) remove each one found.
> + */
> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> +
> + if (item->file != file)
> + continue;
> +
> + list_del_rcu(&item->list);
> + group->count--;
> + if (group->count)
> + /*
> + * We only decrement the global count if this is *not*
> + * the last item. The last item will be accounted for
> + * by the io_bus_unregister
> + */
> + kvm->io_device_count--;
> +
> + /*
> + * The item may be still referenced inside our group->write()
> + * path's RCU read-side CS, so defer the actual free to the
> + * next grace
> + */
> + call_rcu(&item->rcu, iosignalfd_item_deferred_free);
> + }
> +
> + /*
> + * Check if the group is now completely vacated as a result of
> + * removing the items. If so, unregister/delete it
> + */
> + if (!group->count) {
> +
> + kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
> +
> + /*
> + * Like the item, the group may also still be referenced as
> + * per above. However, the kvm->iosignalfds list is not
> + * RCU protected (its protected by kvm->lock instead) so
> + * we can just plain-vanilla remove it. What needs to be
> + * done carefully is the actual freeing of the group pointer
> + * since we walk the group->items list within the RCU CS.
> + */
> + list_del(&group->list);
> + call_rcu(&group->rcu, iosignalfd_group_deferred_free);

This is a deferred call, is it not, with no guarantee on when it will
run? If correct I think synchronize_rcu might be better here:
- can the module go away while iosignalfd_group_deferred_free is
running?
- can eventfd be signalled *after* ioctl exits? If yes
this might confuse applications if they use the eventfd
for something else.

> + }
> +
> +out:
> + mutex_unlock(&kvm->lock);
> +
> + fput(file);
> +
> + return ret;
> +}
> +
> +int
> +kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> +{
> + if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
> + return kvm_deassign_iosignalfd(kvm, args);
> +
> + return kvm_assign_iosignalfd(kvm, args);
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 42cbea7..e6495d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
> atomic_inc(&kvm->mm->mm_count);
> spin_lock_init(&kvm->mmu_lock);
> kvm_io_bus_init(&kvm->pio_bus);
> - kvm_irqfd_init(kvm);
> + kvm_eventfd_init(kvm);
> mutex_init(&kvm->lock);
> mutex_init(&kvm->irq_lock);
> kvm_io_bus_init(&kvm->mmio_bus);
> @@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
> r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
> break;
> }
> + case KVM_IOSIGNALFD: {
> + struct kvm_iosignalfd data;
> +
> + r = -EFAULT;
> + if (copy_from_user(&data, argp, sizeof data))
> + goto out;
> + r = kvm_iosignalfd(kvm, &data);
> + break;
> + }
> #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> case KVM_SET_BOOT_CPU_ID:
> r = 0;
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-06-22 12:14:14

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> Some comments below:
>
> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>
>> iosignalfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
>> signal when written to by a guest. Host userspace can register any arbitrary
>> IO address with a corresponding eventfd and then pass the eventfd to a
>> specific end-point of interest for handling.
>>
>> Normal IO requires a blocking round-trip since the operation may cause
>> side-effects in the emulated model or may return data to the caller.
>> Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
>> "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
>> device model synchronously before returning control back to the vcpu.
>>
>> However, there is a subclass of IO which acts purely as a trigger for
>> other IO (such as to kick off an out-of-band DMA request, etc). For these
>> patterns, the synchronous call is particularly expensive since we really
>> only want to simply get our notification transmitted asychronously and
>> return as quickly as possible. All the sychronous infrastructure to ensure
>> proper data-dependencies are met in the normal IO case are just unecessary
>> overhead for signalling. This adds additional computational load on the
>> system, as well as latency to the signalling path.
>>
>> Therefore, we provide a mechanism for registration of an in-kernel trigger
>> point that allows the VCPU to only require a very brief, lightweight
>> exit just long enough to signal an eventfd. This also means that any
>> clients compatible with the eventfd interface (which includes userspace
>> and kernelspace equally well) can now register to be notified. The end
>> result should be a more flexible and higher performance notification API
>> for the backend KVM hypervisor and perhipheral components.
>>
>> To test this theory, we built a test-harness called "doorbell". This
>> module has a function called "doorbell_ring()" which simply increments a
>> counter for each time the doorbell is signaled. It supports signalling
>> from either an eventfd, or an ioctl().
>>
>> We then wired up two paths to the doorbell: One via QEMU via a registered
>> io region and through the doorbell ioctl(). The other is direct via
>> iosignalfd.
>>
>> You can download this test harness here:
>>
>> ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
>>
>> The measured results are as follows:
>>
>> qemu-mmio: 110000 iops, 9.09us rtt
>> iosignalfd-mmio: 200100 iops, 5.00us rtt
>> iosignalfd-pio: 367300 iops, 2.72us rtt
>>
>> I didn't measure qemu-pio, because I have to figure out how to register a
>> PIO region with qemu's device model, and I got lazy. However, for now we
>> can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
>> and -350ns for HC, we get:
>>
>> qemu-pio: 153139 iops, 6.53us rtt
>> iosignalfd-hc: 412585 iops, 2.37us rtt
>>
>> these are just for fun, for now, until I can gather more data.
>>
>> Here is a graph for your convenience:
>>
>> http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
>>
>> The conclusion to draw is that we save about 4us by skipping the userspace
>> hop.
>>
>> --------------------
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> arch/x86/kvm/x86.c | 1
>> include/linux/kvm.h | 15 ++
>> include/linux/kvm_host.h | 10 +
>> virt/kvm/eventfd.c | 426 ++++++++++++++++++++++++++++++++++++++++++++++
>> virt/kvm/kvm_main.c | 11 +
>> 5 files changed, 459 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1b91ea7..9b119e4 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1097,6 +1097,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>> case KVM_CAP_IRQ_INJECT_STATUS:
>> case KVM_CAP_ASSIGN_DEV_IRQ:
>> case KVM_CAP_IRQFD:
>> + case KVM_CAP_IOSIGNALFD:
>> case KVM_CAP_PIT2:
>> r = 1;
>> break;
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 38ff31e..9de6486 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -307,6 +307,19 @@ struct kvm_guest_debug {
>> struct kvm_guest_debug_arch arch;
>> };
>>
>> +#define KVM_IOSIGNALFD_FLAG_TRIGGER (1 << 0)
>> +#define KVM_IOSIGNALFD_FLAG_PIO (1 << 1)
>> +#define KVM_IOSIGNALFD_FLAG_DEASSIGN (1 << 2)
>>
>
> This is exposed to userspace authors.
> So - some documentation for flags?
>

Yeah, its probably not a bad idea. However, note that I did document
the interface in the qemu-kvm.git patch, and there isn't a lot of
precedent for documentation in kvm.h for other interfaces.

>
>> +
>> +struct kvm_iosignalfd {
>> + __u64 trigger;
>>
>
> Is value, or trigger_value, a better name?
>

I think trigger is fine personally, but I dont feel strongly either
way. Any comments by others?

> Which bits are used if len is e.g. 1
> and what happens with unused bits?
>

Will document.
>
>> + __u64 addr;
>> + __u32 len;
>>
>
> What are the legal values for len/addr?
> Probably should document them.
>
>
Ack

>> + __u32 fd;
>>
>
> fd should probably be signed? You cast it to int later.
>
>
Ack

>> + __u32 flags;
>> + __u8 pad[36];
>>
>
> 4 byte padding not enough?
>
>
It doesn't leave much room for expansion, so I like a round 64 better.

>> +};
>> +
>> #define KVM_TRC_SHIFT 16
>> /*
>> * kvm trace categories
>> @@ -438,6 +451,7 @@ struct kvm_trace_rec {
>> #define KVM_CAP_PIT2 33
>> #endif
>> #define KVM_CAP_SET_BOOT_CPU_ID 34
>> +#define KVM_CAP_IOSIGNALFD 35
>>
>> #ifdef KVM_CAP_IRQ_ROUTING
>>
>> @@ -544,6 +558,7 @@ struct kvm_irqfd {
>> #define KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd)
>> #define KVM_CREATE_PIT2 _IOW(KVMIO, 0x77, struct kvm_pit_config)
>> #define KVM_SET_BOOT_CPU_ID _IO(KVMIO, 0x78)
>> +#define KVM_IOSIGNALFD _IOW(KVMIO, 0x79, struct kvm_iosignalfd)
>>
>> /*
>> * ioctls for vcpu fds
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 707c4d8..6c0569a 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -146,6 +146,7 @@ struct kvm {
>> struct kvm_io_bus pio_bus;
>> #ifdef CONFIG_HAVE_KVM_EVENTFD
>>
>
> BTW, do we want this config option?
>

I think so, yes.

>
>> struct list_head irqfds;
>> + struct list_head iosignalfds;
>> #endif
>> struct kvm_vm_stat stat;
>> struct kvm_arch arch;
>> @@ -554,19 +555,24 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {}
>>
>> #ifdef CONFIG_HAVE_KVM_EVENTFD
>>
>> -void kvm_irqfd_init(struct kvm *kvm);
>> +void kvm_eventfd_init(struct kvm *kvm);
>> int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags);
>> void kvm_irqfd_release(struct kvm *kvm);
>> +int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args);
>>
>> #else
>>
>> -static inline void kvm_irqfd_init(struct kvm *kvm) {}
>> +static inline void kvm_eventfd_init(struct kvm *kvm) {}
>> static inline int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> {
>> return -EINVAL;
>> }
>>
>> static inline void kvm_irqfd_release(struct kvm *kvm) {}
>> +static inline int kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>> +{
>> + return -EINVAL;
>> +}
>>
>
> I thought ENOTTY is the accepted error for an unsupported ioctl?
>
>

Yeah, I think Avi suggested ENOSYS in a similar scenario for other code
I have. Either is probably better than EINVAL. Does anyone have a
preference between ENOTTY and ENOSYS?

>>
>> #endif /* CONFIG_HAVE_KVM_EVENTFD */
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 2c8028c..52ac455 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -21,6 +21,7 @@
>> */
>>
>> #include <linux/kvm_host.h>
>> +#include <linux/kvm.h>
>> #include <linux/workqueue.h>
>> #include <linux/syscalls.h>
>> #include <linux/wait.h>
>> @@ -29,6 +30,9 @@
>> #include <linux/list.h>
>> #include <linux/eventfd.h>
>> #include <linux/srcu.h>
>> +#include <linux/kernel.h>
>> +
>> +#include "iodev.h"
>>
>> /*
>> * --------------------------------------------------------------------
>> @@ -202,9 +206,10 @@ fail:
>> }
>>
>> void
>> -kvm_irqfd_init(struct kvm *kvm)
>> +kvm_eventfd_init(struct kvm *kvm)
>> {
>> INIT_LIST_HEAD(&kvm->irqfds);
>> + INIT_LIST_HEAD(&kvm->iosignalfds);
>> }
>>
>> void
>> @@ -215,3 +220,422 @@ kvm_irqfd_release(struct kvm *kvm)
>> list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
>> irqfd_disconnect(irqfd);
>> }
>> +
>> +/*
>> + * --------------------------------------------------------------------
>> + * iosignalfd: translate a PIO/MMIO memory write to an eventfd signal.
>> + *
>> + * userspace can register a PIO/MMIO address with an eventfd for recieving
>>
>
> typo
>

Thx, will fix
>
>> + * notification when the memory has been touched.
>> + * --------------------------------------------------------------------
>> + */
>> +
>> +/*
>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
>> + * eventfd, and can be registered to trigger on any write to the group
>> + * (wildcard), or to a write of a specific value. If more than one item is to
>> + * be supported, the addr/len ranges must all be identical in the group. If a
>> + * trigger value is to be supported on a particular item, the group range must
>> + * be exactly the width of the trigger.
>>
>
> Some duplicate spaces in the text above, apparently at random places.
>
>

-ENOPARSE ;)

Can you elaborate?

>> + */
>> +
>> +struct _iosignalfd_item {
>> + struct list_head list;
>> + struct file *file;
>> + u64 match;
>> + struct rcu_head rcu;
>> + int wildcard:1;
>> +};
>> +
>> +struct _iosignalfd_group {
>> + struct list_head list;
>> + u64 addr;
>> + size_t length;
>> + size_t count;
>> + struct list_head items;
>> + struct kvm_io_device dev;
>> + struct rcu_head rcu;
>> +};
>> +
>> +static inline struct _iosignalfd_group *
>> +to_group(struct kvm_io_device *dev)
>> +{
>> + return container_of(dev, struct _iosignalfd_group, dev);
>> +}
>> +
>> +static void
>> +iosignalfd_item_free(struct _iosignalfd_item *item)
>> +{
>> + fput(item->file);
>> + kfree(item);
>> +}
>> +
>> +static void
>> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
>> +{
>> + struct _iosignalfd_item *item;
>> +
>> + item = container_of(rhp, struct _iosignalfd_item, rcu);
>> +
>> + iosignalfd_item_free(item);
>> +}
>> +
>> +static void
>> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
>> +{
>> + struct _iosignalfd_group *group;
>> +
>> + group = container_of(rhp, struct _iosignalfd_group, rcu);
>> +
>> + kfree(group);
>> +}
>> +
>> +static int
>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>> + int is_write)
>> +{
>> + struct _iosignalfd_group *p = to_group(this);
>> +
>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>> +}
>>
>
> What does this test? len is ignored ...
>
>
Yeah, I was following precedent with other IO devices. However, this
*is* sloppy, I agree. Will fix.

>> +
>> +static int
>>
>
> This seems to be returning bool ...
>

Ack
>
>> +iosignalfd_is_match(struct _iosignalfd_group *group,
>> + struct _iosignalfd_item *item,
>> + const void *val,
>> + int len)
>> +{
>> + u64 _val;
>> +
>> + if (len != group->length)
>> + /* mis-matched length is always a miss */
>> + return false;
>>
>
> Why is that? what if there's 8 byte write which covers
> a 4 byte group?
>

v7 and earlier used to allow that for wildcards, actually. It of
course would never make sense to allow mis-matched writes for
non-wildcards, since the idea is to match the value exactly. However,
the feedback I got from Avi was that we should make the wildcard vs
non-wildcard access symmetrical and ensure they both conform to the size.
>
>> +
>> + if (item->wildcard)
>> + /* wildcard is always a hit */
>> + return true;
>> +
>> + /* otherwise, we have to actually compare the data */
>> +
>> + if (!IS_ALIGNED((unsigned long)val, len))
>> + /* protect against this request causing a SIGBUS */
>> + return false;
>>
>
> Could you explain what this does please?
>
Sure: item->match is a fixed u64 to represent all group->length
values. So it might have a 1, 2, 4, or 8 byte value in it. When I
write arrives, we need to cast the data-register (in this case
represented by (void*)val) into a u64 so the equality check (see [A],
below) can be done. However, you can't cast an unaligned pointer, or it
will SIGBUS on many (most?) architectures.

> I thought misaligned accesses are allowed.
>
If thats true, we are in trouble ;)

>
>> +
>> + switch (len) {
>> + case 1:
>> + _val = *(u8 *)val;
>> + break;
>> + case 2:
>> + _val = *(u16 *)val;
>> + break;
>> + case 4:
>> + _val = *(u32 *)val;
>> + break;
>> + case 8:
>> + _val = *(u64 *)val;
>> + break;
>> + default:
>> + return false;
>> + }
>>
>
> So legal values for len are 1,2,4 and 8?
> Might be a good idea to document this.
>
>
Ack

>> +
>> + return _val == item->match;
>>

[A]

>> +}
>> +
>> +/*
>> + * MMIO/PIO writes trigger an event (if the data matches).
>> + *
>> + * This is invoked by the io_bus subsystem in response to an address match
>> + * against the group. We must then walk the list of individual items to check
>> + * for a match and, if applicable, to send the appropriate signal. If the item
>> + * in question does not have a "match" pointer, it is considered a wildcard
>> + * and will always generate a signal. There can be an arbitrary number
>> + * of distinct matches or wildcards per group.
>> + */
>> +static void
>> +iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
>> + const void *val)
>> +{
>> + struct _iosignalfd_group *group = to_group(this);
>> + struct _iosignalfd_item *item;
>> +
>> + rcu_read_lock();
>> +
>> + list_for_each_entry_rcu(item, &group->items, list) {
>> + if (iosignalfd_is_match(group, item, val, len))
>> + eventfd_signal(item->file, 1);
>> + }
>> +
>> + rcu_read_unlock();
>> +}
>> +
>> +/*
>> + * MMIO/PIO reads against the group indiscriminately return all zeros
>> + */
>>
>
> Does it have to be so? It would be better to bounce reads to
> userspace...
>
>
Good idea. I can set is_write = false and I should never get this
function called.

>> +static void
>> +iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
>> + void *val)
>> +{
>> + memset(val, 0, len);
>> +}
>> +
>> +/*
>> + * This function is called as KVM is completely shutting down. We do not
>> + * need to worry about locking or careful RCU dancing...just nuke anything
>> + * we have as quickly as possible
>> + */
>> +static void
>> +iosignalfd_group_destructor(struct kvm_io_device *this)
>> +{
>> + struct _iosignalfd_group *group = to_group(this);
>> + struct _iosignalfd_item *item, *tmp;
>> +
>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
>> + list_del(&item->list);
>> + iosignalfd_item_free(item);
>> + }
>> +
>> + list_del(&group->list);
>> + kfree(group);
>> +}
>> +
>> +static const struct kvm_io_device_ops iosignalfd_ops = {
>> + .read = iosignalfd_group_read,
>> + .write = iosignalfd_group_write,
>> + .in_range = iosignalfd_group_in_range,
>> + .destructor = iosignalfd_group_destructor,
>> +};
>> +
>> +/* assumes kvm->lock held */
>> +static struct _iosignalfd_group *
>> +iosignalfd_group_find(struct kvm *kvm, u64 addr)
>> +{
>> + struct _iosignalfd_group *group;
>> +
>> + list_for_each_entry(group, &kvm->iosignalfds, list) {
>>
>
> {} not needed here
>

Ack
>
>> + if (group->addr == addr)
>> + return group;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +/* assumes kvm->lock is held */
>> +static struct _iosignalfd_group *
>> +iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
>> + u64 addr, size_t len)
>> +{
>> + struct _iosignalfd_group *group;
>> + int ret;
>> +
>> + group = kzalloc(sizeof(*group), GFP_KERNEL);
>> + if (!group)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + INIT_LIST_HEAD(&group->list);
>> + INIT_LIST_HEAD(&group->items);
>> + group->addr = addr;
>> + group->length = len;
>> + kvm_iodevice_init(&group->dev, &iosignalfd_ops);
>> +
>> + ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
>> + if (ret < 0) {
>> + kfree(group);
>> + return ERR_PTR(ret);
>> + }
>> +
>> + list_add_tail(&group->list, &kvm->iosignalfds);
>> +
>> + return group;
>> +}
>> +
>> +static int
>> +kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>> +{
>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
>> + struct _iosignalfd_group *group = NULL;
>>
>
> why does group need to be initialized?
>
>
>> + struct _iosignalfd_item *item = NULL;
>>
>
> Why does item need to be initialized?
>
>

Probably leftover from versions prior to v8. Will fix.

>> + struct file *file;
>> + int ret;
>> +
>> + if (args->len > sizeof(u64))
>>
>
> Is e.g. value 3 legal?
>

Ack. Will check against legal values.

>
>> + return -EINVAL;
>>
>
>
>> +
>> + file = eventfd_fget(args->fd);
>> + if (IS_ERR(file))
>> + return PTR_ERR(file);
>> +
>> + item = kzalloc(sizeof(*item), GFP_KERNEL);
>> + if (!item) {
>> + ret = -ENOMEM;
>> + goto fail;
>> + }
>> +
>> + INIT_LIST_HEAD(&item->list);
>> + item->file = file;
>> +
>> + /*
>> + * A trigger address is optional, otherwise this is a wildcard
>> + */
>> + if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
>> + item->match = args->trigger;
>> + else
>> + item->wildcard = true;
>> +
>> + mutex_lock(&kvm->lock);
>> +
>> + /*
>> + * Put an upper limit on the number of items we support
>> + */
>>
>
> Groups and items, actually, right?
>
>

Yeah, though technically that is implicit when you say "items", since
each group always has at least one item. I will try to make this
clearer, though.

>> + if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
>> + ret = -ENOSPC;
>> + goto unlock_fail;
>> + }
>> +
>> + group = iosignalfd_group_find(kvm, args->addr);
>> + if (!group) {
>> +
>> + group = iosignalfd_group_create(kvm, bus,
>> + args->addr, args->len);
>> + if (IS_ERR(group)) {
>> + ret = PTR_ERR(group);
>> + goto unlock_fail;
>> + }
>> +
>> + /*
>> + * Note: We do not increment io_device_count for the first item,
>> + * as this is represented by the group device that we just
>> + * registered. Make sure we handle this properly when we
>> + * deassign the last item
>> + */
>> + } else {
>> +
>> + if (group->length != args->len) {
>> + /*
>> + * Existing groups must have the same addr/len tuple
>> + * or we reject the request
>> + */
>> + ret = -EINVAL;
>> + goto unlock_fail;
>>
>
> Most errors seem to trigger EINVAL. Applications will be
> easier to debug if different errors are returned on
> different mistakes.

Yeah, agreed. Will try to differentiate some errors here.

> E.g. here EBUSY might be good. And same
> in other places.
>
>

Actually, I think EBUSY is supposed to be a transitory error, and would
not be appropriate to use here. That said, your point is taken: Find
more appropriate and descriptive errors.

>> + }
>> +
>> + kvm->io_device_count++;
>> + }
>> +
>> + /*
>> + * Note: We are committed to succeed at this point since we have
>> + * (potentially) published a new group-device. Any failure handling
>> + * added in the future after this point will need to be carefully
>> + * considered.
>> + */
>> +
>> + list_add_tail_rcu(&item->list, &group->items);
>> + group->count++;
>> +
>> + mutex_unlock(&kvm->lock);
>> +
>> + return 0;
>> +
>> +unlock_fail:
>> + mutex_unlock(&kvm->lock);
>> +fail:
>> + if (item)
>> + /*
>> + * it would have never made it to the group->items list
>> + * in the failure path, so we dont need to worry about removing
>> + * it
>> + */
>> + kfree(item);
>> +
>> + fput(file);
>> +
>> + return ret;
>> +}
>> +
>> +
>> +static int
>> +kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>> +{
>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
>> + struct _iosignalfd_group *group;
>> + struct _iosignalfd_item *item, *tmp;
>> + struct file *file;
>> + int ret = 0;
>> +
>> + file = eventfd_fget(args->fd);
>> + if (IS_ERR(file))
>> + return PTR_ERR(file);
>> +
>> + mutex_lock(&kvm->lock);
>> +
>> + group = iosignalfd_group_find(kvm, args->addr);
>> + if (!group) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> +
>> + /*
>> + * Exhaustively search our group->items list for any items that might
>> + * match the specified fd, and (carefully) remove each one found.
>> + */
>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
>> +
>> + if (item->file != file)
>> + continue;
>> +
>> + list_del_rcu(&item->list);
>> + group->count--;
>> + if (group->count)
>> + /*
>> + * We only decrement the global count if this is *not*
>> + * the last item. The last item will be accounted for
>> + * by the io_bus_unregister
>> + */
>> + kvm->io_device_count--;
>> +
>> + /*
>> + * The item may be still referenced inside our group->write()
>> + * path's RCU read-side CS, so defer the actual free to the
>> + * next grace
>> + */
>> + call_rcu(&item->rcu, iosignalfd_item_deferred_free);
>> + }
>> +
>> + /*
>> + * Check if the group is now completely vacated as a result of
>> + * removing the items. If so, unregister/delete it
>> + */
>> + if (!group->count) {
>> +
>> + kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
>> +
>> + /*
>> + * Like the item, the group may also still be referenced as
>> + * per above. However, the kvm->iosignalfds list is not
>> + * RCU protected (its protected by kvm->lock instead) so
>> + * we can just plain-vanilla remove it. What needs to be
>> + * done carefully is the actual freeing of the group pointer
>> + * since we walk the group->items list within the RCU CS.
>> + */
>> + list_del(&group->list);
>> + call_rcu(&group->rcu, iosignalfd_group_deferred_free);
>>
>
> This is a deferred call, is it not, with no guarantee on when it will
> run? If correct I think synchronize_rcu might be better here:
> - can the module go away while iosignalfd_group_deferred_free is
> running?
>

Good catch. Once I go this route it will be easy to use SRCU instead of
RCU, too. So I will fix this up.


> - can eventfd be signalled *after* ioctl exits? If yes
> this might confuse applications if they use the eventfd
> for something else.
>

Not by iosignalfd. Once this function completes, we synchronously
guarantee that no more IO activity will generate an event on the
affected eventfds. Of course, this has no bearing on whether some other
producer wont signal, but that is beyond the scope of iosignalfd.
>
>> + }
>> +
>> +out:
>> + mutex_unlock(&kvm->lock);
>> +
>> + fput(file);
>> +
>> + return ret;
>> +}
>> +
>> +int
>> +kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>> +{
>> + if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
>> + return kvm_deassign_iosignalfd(kvm, args);
>> +
>> + return kvm_assign_iosignalfd(kvm, args);
>> +}
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 42cbea7..e6495d4 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
>> atomic_inc(&kvm->mm->mm_count);
>> spin_lock_init(&kvm->mmu_lock);
>> kvm_io_bus_init(&kvm->pio_bus);
>> - kvm_irqfd_init(kvm);
>> + kvm_eventfd_init(kvm);
>> mutex_init(&kvm->lock);
>> mutex_init(&kvm->irq_lock);
>> kvm_io_bus_init(&kvm->mmio_bus);
>> @@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
>> r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
>> break;
>> }
>> + case KVM_IOSIGNALFD: {
>> + struct kvm_iosignalfd data;
>> +
>> + r = -EFAULT;
>> + if (copy_from_user(&data, argp, sizeof data))
>> + goto out;
>> + r = kvm_iosignalfd(kvm, &data);
>> + break;
>> + }
>> #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>> case KVM_SET_BOOT_CPU_ID:
>> r = 0;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>

Thanks Michael,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 12:27:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support


>>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
>>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
>>> + * eventfd, and can be registered to trigger on any write to the group
>>> + * (wildcard), or to a write of a specific value. If more than one item is to
>>> + * be supported, the addr/len ranges must all be identical in the group. If a
>>> + * trigger value is to be supported on a particular item, the group range must
>>> + * be exactly the width of the trigger.
>>>
>> Some duplicate spaces in the text above, apparently at random places.
>
> -ENOPARSE ;)
>
> Can you elaborate?

I see "aggregates one". The others are all at end of sentence, so I
think that Michael was not talking about those (git grep '\*.*\. ').

Paolo

2009-06-22 12:31:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
> >> + * notification when the memory has been touched.
> >> + * --------------------------------------------------------------------
> >> + */
> >> +
> >> +/*
> >> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
> >> + * aggregates one or more iosignalfd_items. Each item points to exactly one
^^ ^^
> >> + * eventfd, and can be registered to trigger on any write to the group
> >> + * (wildcard), or to a write of a specific value. If more than one item is to
^^
> >> + * be supported, the addr/len ranges must all be identical in the group. If a
^^
> >> + * trigger value is to be supported on a particular item, the group range must
> >> + * be exactly the width of the trigger.
> >>
> >
> > Some duplicate spaces in the text above, apparently at random places.
> >
> >
>
> -ENOPARSE ;)
>
> Can you elaborate?


Marked with ^^

>
> >> + */
> >> +
> >> +struct _iosignalfd_item {
> >> + struct list_head list;
> >> + struct file *file;
> >> + u64 match;
> >> + struct rcu_head rcu;
> >> + int wildcard:1;
> >> +};
> >> +
> >> +struct _iosignalfd_group {
> >> + struct list_head list;
> >> + u64 addr;
> >> + size_t length;
> >> + size_t count;
> >> + struct list_head items;
> >> + struct kvm_io_device dev;
> >> + struct rcu_head rcu;
> >> +};
> >> +
> >> +static inline struct _iosignalfd_group *
> >> +to_group(struct kvm_io_device *dev)
> >> +{
> >> + return container_of(dev, struct _iosignalfd_group, dev);
> >> +}
> >> +
> >> +static void
> >> +iosignalfd_item_free(struct _iosignalfd_item *item)
> >> +{
> >> + fput(item->file);
> >> + kfree(item);
> >> +}
> >> +
> >> +static void
> >> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
> >> +{
> >> + struct _iosignalfd_item *item;
> >> +
> >> + item = container_of(rhp, struct _iosignalfd_item, rcu);
> >> +
> >> + iosignalfd_item_free(item);
> >> +}
> >> +
> >> +static void
> >> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
> >> +{
> >> + struct _iosignalfd_group *group;
> >> +
> >> + group = container_of(rhp, struct _iosignalfd_group, rcu);
> >> +
> >> + kfree(group);
> >> +}
> >> +
> >> +static int
> >> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> >> + int is_write)
> >> +{
> >> + struct _iosignalfd_group *p = to_group(this);
> >> +
> >> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> >> +}
> >>
> >
> > What does this test? len is ignored ...
> >
> >
> Yeah, I was following precedent with other IO devices. However, this
> *is* sloppy, I agree. Will fix.
>
> >> +
> >> +static int
> >>
> >
> > This seems to be returning bool ...
> >
>
> Ack
> >
> >> +iosignalfd_is_match(struct _iosignalfd_group *group,
> >> + struct _iosignalfd_item *item,
> >> + const void *val,
> >> + int len)
> >> +{
> >> + u64 _val;
> >> +
> >> + if (len != group->length)
> >> + /* mis-matched length is always a miss */
> >> + return false;
> >>
> >
> > Why is that? what if there's 8 byte write which covers
> > a 4 byte group?
> >
>
> v7 and earlier used to allow that for wildcards, actually. It of
> course would never make sense to allow mis-matched writes for
> non-wildcards, since the idea is to match the value exactly. However,
> the feedback I got from Avi was that we should make the wildcard vs
> non-wildcard access symmetrical and ensure they both conform to the size.
> >
> >> +
> >> + if (item->wildcard)
> >> + /* wildcard is always a hit */
> >> + return true;
> >> +
> >> + /* otherwise, we have to actually compare the data */
> >> +
> >> + if (!IS_ALIGNED((unsigned long)val, len))
> >> + /* protect against this request causing a SIGBUS */
> >> + return false;
> >>
> >
> > Could you explain what this does please?
> >
> Sure: item->match is a fixed u64 to represent all group->length
> values. So it might have a 1, 2, 4, or 8 byte value in it. When I
> write arrives, we need to cast the data-register (in this case
> represented by (void*)val) into a u64 so the equality check (see [A],
> below) can be done. However, you can't cast an unaligned pointer, or it
> will SIGBUS on many (most?) architectures.

I mean guest access. Does it have to be aligned?
You could memcpy the value...

> > I thought misaligned accesses are allowed.
> >
> If thats true, we are in trouble ;)

I think it works at least on x86:
http://en.wikipedia.org/wiki/Packed#x86_and_x86-64

> >
> >> +
> >> + switch (len) {
> >> + case 1:
> >> + _val = *(u8 *)val;
> >> + break;
> >> + case 2:
> >> + _val = *(u16 *)val;
> >> + break;
> >> + case 4:
> >> + _val = *(u32 *)val;
> >> + break;
> >> + case 8:
> >> + _val = *(u64 *)val;
> >> + break;
> >> + default:
> >> + return false;
> >> + }
> >>
> >
> > So legal values for len are 1,2,4 and 8?
> > Might be a good idea to document this.
> >
> >
> Ack
>
> >> +
> >> + return _val == item->match;
> >>
>
> [A]
>
> >> +}
> >> +
> >> +/*
> >> + * MMIO/PIO writes trigger an event (if the data matches).
> >> + *
> >> + * This is invoked by the io_bus subsystem in response to an address match
> >> + * against the group. We must then walk the list of individual items to check
> >> + * for a match and, if applicable, to send the appropriate signal. If the item
> >> + * in question does not have a "match" pointer, it is considered a wildcard
> >> + * and will always generate a signal. There can be an arbitrary number
> >> + * of distinct matches or wildcards per group.
> >> + */
> >> +static void
> >> +iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
> >> + const void *val)
> >> +{
> >> + struct _iosignalfd_group *group = to_group(this);
> >> + struct _iosignalfd_item *item;
> >> +
> >> + rcu_read_lock();
> >> +
> >> + list_for_each_entry_rcu(item, &group->items, list) {
> >> + if (iosignalfd_is_match(group, item, val, len))
> >> + eventfd_signal(item->file, 1);
> >> + }
> >> +
> >> + rcu_read_unlock();
> >> +}
> >> +
> >> +/*
> >> + * MMIO/PIO reads against the group indiscriminately return all zeros
> >> + */
> >>
> >
> > Does it have to be so? It would be better to bounce reads to
> > userspace...
> >
> >
> Good idea. I can set is_write = false and I should never get this
> function called.
>
> >> +static void
> >> +iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
> >> + void *val)
> >> +{
> >> + memset(val, 0, len);
> >> +}
> >> +
> >> +/*
> >> + * This function is called as KVM is completely shutting down. We do not
> >> + * need to worry about locking or careful RCU dancing...just nuke anything
> >> + * we have as quickly as possible
> >> + */
> >> +static void
> >> +iosignalfd_group_destructor(struct kvm_io_device *this)
> >> +{
> >> + struct _iosignalfd_group *group = to_group(this);
> >> + struct _iosignalfd_item *item, *tmp;
> >> +
> >> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> >> + list_del(&item->list);
> >> + iosignalfd_item_free(item);
> >> + }
> >> +
> >> + list_del(&group->list);
> >> + kfree(group);
> >> +}
> >> +
> >> +static const struct kvm_io_device_ops iosignalfd_ops = {
> >> + .read = iosignalfd_group_read,
> >> + .write = iosignalfd_group_write,
> >> + .in_range = iosignalfd_group_in_range,
> >> + .destructor = iosignalfd_group_destructor,
> >> +};
> >> +
> >> +/* assumes kvm->lock held */
> >> +static struct _iosignalfd_group *
> >> +iosignalfd_group_find(struct kvm *kvm, u64 addr)
> >> +{
> >> + struct _iosignalfd_group *group;
> >> +
> >> + list_for_each_entry(group, &kvm->iosignalfds, list) {
> >>
> >
> > {} not needed here
> >
>
> Ack
> >
> >> + if (group->addr == addr)
> >> + return group;
> >> + }
> >> +
> >> + return NULL;
> >> +}
> >> +
> >> +/* assumes kvm->lock is held */
> >> +static struct _iosignalfd_group *
> >> +iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
> >> + u64 addr, size_t len)
> >> +{
> >> + struct _iosignalfd_group *group;
> >> + int ret;
> >> +
> >> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> >> + if (!group)
> >> + return ERR_PTR(-ENOMEM);
> >> +
> >> + INIT_LIST_HEAD(&group->list);
> >> + INIT_LIST_HEAD(&group->items);
> >> + group->addr = addr;
> >> + group->length = len;
> >> + kvm_iodevice_init(&group->dev, &iosignalfd_ops);
> >> +
> >> + ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
> >> + if (ret < 0) {
> >> + kfree(group);
> >> + return ERR_PTR(ret);
> >> + }
> >> +
> >> + list_add_tail(&group->list, &kvm->iosignalfds);
> >> +
> >> + return group;
> >> +}
> >> +
> >> +static int
> >> +kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >> +{
> >> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> >> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> >> + struct _iosignalfd_group *group = NULL;
> >>
> >
> > why does group need to be initialized?
> >
> >
> >> + struct _iosignalfd_item *item = NULL;
> >>
> >
> > Why does item need to be initialized?
> >
> >
>
> Probably leftover from versions prior to v8. Will fix.
>
> >> + struct file *file;
> >> + int ret;
> >> +
> >> + if (args->len > sizeof(u64))
> >>
> >
> > Is e.g. value 3 legal?
> >
>
> Ack. Will check against legal values.
>
> >
> >> + return -EINVAL;
> >>
> >
> >
> >> +
> >> + file = eventfd_fget(args->fd);
> >> + if (IS_ERR(file))
> >> + return PTR_ERR(file);
> >> +
> >> + item = kzalloc(sizeof(*item), GFP_KERNEL);
> >> + if (!item) {
> >> + ret = -ENOMEM;
> >> + goto fail;
> >> + }
> >> +
> >> + INIT_LIST_HEAD(&item->list);
> >> + item->file = file;
> >> +
> >> + /*
> >> + * A trigger address is optional, otherwise this is a wildcard
> >> + */
> >> + if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
> >> + item->match = args->trigger;
> >> + else
> >> + item->wildcard = true;
> >> +
> >> + mutex_lock(&kvm->lock);
> >> +
> >> + /*
> >> + * Put an upper limit on the number of items we support
> >> + */
> >>
> >
> > Groups and items, actually, right?
> >
> >
>
> Yeah, though technically that is implicit when you say "items", since
> each group always has at least one item. I will try to make this
> clearer, though.
>
> >> + if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
> >> + ret = -ENOSPC;
> >> + goto unlock_fail;
> >> + }
> >> +
> >> + group = iosignalfd_group_find(kvm, args->addr);
> >> + if (!group) {
> >> +
> >> + group = iosignalfd_group_create(kvm, bus,
> >> + args->addr, args->len);
> >> + if (IS_ERR(group)) {
> >> + ret = PTR_ERR(group);
> >> + goto unlock_fail;
> >> + }
> >> +
> >> + /*
> >> + * Note: We do not increment io_device_count for the first item,
> >> + * as this is represented by the group device that we just
> >> + * registered. Make sure we handle this properly when we
> >> + * deassign the last item
> >> + */
> >> + } else {
> >> +
> >> + if (group->length != args->len) {
> >> + /*
> >> + * Existing groups must have the same addr/len tuple
> >> + * or we reject the request
> >> + */
> >> + ret = -EINVAL;
> >> + goto unlock_fail;
> >>
> >
> > Most errors seem to trigger EINVAL. Applications will be
> > easier to debug if different errors are returned on
> > different mistakes.
>
> Yeah, agreed. Will try to differentiate some errors here.
>
> > E.g. here EBUSY might be good. And same
> > in other places.
> >
> >
>
> Actually, I think EBUSY is supposed to be a transitory error, and would
> not be appropriate to use here. That said, your point is taken: Find
> more appropriate and descriptive errors.
>
> >> + }
> >> +
> >> + kvm->io_device_count++;
> >> + }
> >> +
> >> + /*
> >> + * Note: We are committed to succeed at this point since we have
> >> + * (potentially) published a new group-device. Any failure handling
> >> + * added in the future after this point will need to be carefully
> >> + * considered.
> >> + */
> >> +
> >> + list_add_tail_rcu(&item->list, &group->items);
> >> + group->count++;
> >> +
> >> + mutex_unlock(&kvm->lock);
> >> +
> >> + return 0;
> >> +
> >> +unlock_fail:
> >> + mutex_unlock(&kvm->lock);
> >> +fail:
> >> + if (item)
> >> + /*
> >> + * it would have never made it to the group->items list
> >> + * in the failure path, so we dont need to worry about removing
> >> + * it
> >> + */
> >> + kfree(item);
> >> +
> >> + fput(file);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +
> >> +static int
> >> +kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >> +{
> >> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> >> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> >> + struct _iosignalfd_group *group;
> >> + struct _iosignalfd_item *item, *tmp;
> >> + struct file *file;
> >> + int ret = 0;
> >> +
> >> + file = eventfd_fget(args->fd);
> >> + if (IS_ERR(file))
> >> + return PTR_ERR(file);
> >> +
> >> + mutex_lock(&kvm->lock);
> >> +
> >> + group = iosignalfd_group_find(kvm, args->addr);
> >> + if (!group) {
> >> + ret = -EINVAL;
> >> + goto out;
> >> + }
> >> +
> >> + /*
> >> + * Exhaustively search our group->items list for any items that might
> >> + * match the specified fd, and (carefully) remove each one found.
> >> + */
> >> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> >> +
> >> + if (item->file != file)
> >> + continue;
> >> +
> >> + list_del_rcu(&item->list);
> >> + group->count--;
> >> + if (group->count)
> >> + /*
> >> + * We only decrement the global count if this is *not*
> >> + * the last item. The last item will be accounted for
> >> + * by the io_bus_unregister
> >> + */
> >> + kvm->io_device_count--;
> >> +
> >> + /*
> >> + * The item may be still referenced inside our group->write()
> >> + * path's RCU read-side CS, so defer the actual free to the
> >> + * next grace
> >> + */
> >> + call_rcu(&item->rcu, iosignalfd_item_deferred_free);
> >> + }
> >> +
> >> + /*
> >> + * Check if the group is now completely vacated as a result of
> >> + * removing the items. If so, unregister/delete it
> >> + */
> >> + if (!group->count) {
> >> +
> >> + kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
> >> +
> >> + /*
> >> + * Like the item, the group may also still be referenced as
> >> + * per above. However, the kvm->iosignalfds list is not
> >> + * RCU protected (its protected by kvm->lock instead) so
> >> + * we can just plain-vanilla remove it. What needs to be
> >> + * done carefully is the actual freeing of the group pointer
> >> + * since we walk the group->items list within the RCU CS.
> >> + */
> >> + list_del(&group->list);
> >> + call_rcu(&group->rcu, iosignalfd_group_deferred_free);
> >>
> >
> > This is a deferred call, is it not, with no guarantee on when it will
> > run? If correct I think synchronize_rcu might be better here:
> > - can the module go away while iosignalfd_group_deferred_free is
> > running?
> >
>
> Good catch. Once I go this route it will be easy to use SRCU instead of
> RCU, too. So I will fix this up.
>
>
> > - can eventfd be signalled *after* ioctl exits? If yes
> > this might confuse applications if they use the eventfd
> > for something else.
> >
>
> Not by iosignalfd. Once this function completes, we synchronously
> guarantee that no more IO activity will generate an event on the
> affected eventfds. Of course, this has no bearing on whether some other
> producer wont signal, but that is beyond the scope of iosignalfd.
> >
> >> + }
> >> +
> >> +out:
> >> + mutex_unlock(&kvm->lock);
> >> +
> >> + fput(file);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +int
> >> +kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >> +{
> >> + if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
> >> + return kvm_deassign_iosignalfd(kvm, args);
> >> +
> >> + return kvm_assign_iosignalfd(kvm, args);
> >> +}
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 42cbea7..e6495d4 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
> >> atomic_inc(&kvm->mm->mm_count);
> >> spin_lock_init(&kvm->mmu_lock);
> >> kvm_io_bus_init(&kvm->pio_bus);
> >> - kvm_irqfd_init(kvm);
> >> + kvm_eventfd_init(kvm);
> >> mutex_init(&kvm->lock);
> >> mutex_init(&kvm->irq_lock);
> >> kvm_io_bus_init(&kvm->mmio_bus);
> >> @@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
> >> r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
> >> break;
> >> }
> >> + case KVM_IOSIGNALFD: {
> >> + struct kvm_iosignalfd data;
> >> +
> >> + r = -EFAULT;
> >> + if (copy_from_user(&data, argp, sizeof data))
> >> + goto out;
> >> + r = kvm_iosignalfd(kvm, &data);
> >> + break;
> >> + }
> >> #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> >> case KVM_SET_BOOT_CPU_ID:
> >> r = 0;
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
>
> Thanks Michael,
> -Greg
>
>

2009-06-22 12:56:48

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
>
>>>> + * notification when the memory has been touched.
>>>> + * --------------------------------------------------------------------
>>>> + */
>>>> +
>>>> +/*
>>>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
>>>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
>>>>
> ^^ ^^
>
>>>> + * eventfd, and can be registered to trigger on any write to the group
>>>> + * (wildcard), or to a write of a specific value. If more than one item is to
>>>>
> ^^
>
>>>> + * be supported, the addr/len ranges must all be identical in the group. If a
>>>>
> ^^
>
>>>> + * trigger value is to be supported on a particular item, the group range must
>>>> + * be exactly the width of the trigger.
>>>>
>>>>
>>> Some duplicate spaces in the text above, apparently at random places.
>>>
>>>
>>>
>> -ENOPARSE ;)
>>
>> Can you elaborate?
>>
>
>
> Marked with ^^
>
Heh...well, the first one ("aggregates one") is just a plain typo.
The others are just me showing my age, perhaps:

http://desktoppub.about.com/cs/typespacing/a/onetwospaces.htm

Whether right or wrong, I think I use two-spaces-after-a-period
everywhere. I can fix these if they bother you, but I suspect just
about every comment I've written has them too. ;)

-Greg


>
>>>> + */
>>>> +
>>>> +struct _iosignalfd_item {
>>>> + struct list_head list;
>>>> + struct file *file;
>>>> + u64 match;
>>>> + struct rcu_head rcu;
>>>> + int wildcard:1;
>>>> +};
>>>> +
>>>> +struct _iosignalfd_group {
>>>> + struct list_head list;
>>>> + u64 addr;
>>>> + size_t length;
>>>> + size_t count;
>>>> + struct list_head items;
>>>> + struct kvm_io_device dev;
>>>> + struct rcu_head rcu;
>>>> +};
>>>> +
>>>> +static inline struct _iosignalfd_group *
>>>> +to_group(struct kvm_io_device *dev)
>>>> +{
>>>> + return container_of(dev, struct _iosignalfd_group, dev);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_item_free(struct _iosignalfd_item *item)
>>>> +{
>>>> + fput(item->file);
>>>> + kfree(item);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
>>>> +{
>>>> + struct _iosignalfd_item *item;
>>>> +
>>>> + item = container_of(rhp, struct _iosignalfd_item, rcu);
>>>> +
>>>> + iosignalfd_item_free(item);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
>>>> +{
>>>> + struct _iosignalfd_group *group;
>>>> +
>>>> + group = container_of(rhp, struct _iosignalfd_group, rcu);
>>>> +
>>>> + kfree(group);
>>>> +}
>>>> +
>>>> +static int
>>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + int is_write)
>>>> +{
>>>> + struct _iosignalfd_group *p = to_group(this);
>>>> +
>>>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>>>> +}
>>>>
>>>>
>>> What does this test? len is ignored ...
>>>
>>>
>>>
>> Yeah, I was following precedent with other IO devices. However, this
>> *is* sloppy, I agree. Will fix.
>>
>>
>>>> +
>>>> +static int
>>>>
>>>>
>>> This seems to be returning bool ...
>>>
>>>
>> Ack
>>
>>>
>>>
>>>> +iosignalfd_is_match(struct _iosignalfd_group *group,
>>>> + struct _iosignalfd_item *item,
>>>> + const void *val,
>>>> + int len)
>>>> +{
>>>> + u64 _val;
>>>> +
>>>> + if (len != group->length)
>>>> + /* mis-matched length is always a miss */
>>>> + return false;
>>>>
>>>>
>>> Why is that? what if there's 8 byte write which covers
>>> a 4 byte group?
>>>
>>>
>> v7 and earlier used to allow that for wildcards, actually. It of
>> course would never make sense to allow mis-matched writes for
>> non-wildcards, since the idea is to match the value exactly. However,
>> the feedback I got from Avi was that we should make the wildcard vs
>> non-wildcard access symmetrical and ensure they both conform to the size.
>>
>>>
>>>
>>>> +
>>>> + if (item->wildcard)
>>>> + /* wildcard is always a hit */
>>>> + return true;
>>>> +
>>>> + /* otherwise, we have to actually compare the data */
>>>> +
>>>> + if (!IS_ALIGNED((unsigned long)val, len))
>>>> + /* protect against this request causing a SIGBUS */
>>>> + return false;
>>>>
>>>>
>>> Could you explain what this does please?
>>>
>>>
>> Sure: item->match is a fixed u64 to represent all group->length
>> values. So it might have a 1, 2, 4, or 8 byte value in it. When I
>> write arrives, we need to cast the data-register (in this case
>> represented by (void*)val) into a u64 so the equality check (see [A],
>> below) can be done. However, you can't cast an unaligned pointer, or it
>> will SIGBUS on many (most?) architectures.
>>
>
> I mean guest access. Does it have to be aligned?
> You could memcpy the value...
>
>
>>> I thought misaligned accesses are allowed.
>>>
>>>
>> If thats true, we are in trouble ;)
>>
>
> I think it works at least on x86:
> http://en.wikipedia.org/wiki/Packed#x86_and_x86-64
>
>
>>>
>>>
>>>> +
>>>> + switch (len) {
>>>> + case 1:
>>>> + _val = *(u8 *)val;
>>>> + break;
>>>> + case 2:
>>>> + _val = *(u16 *)val;
>>>> + break;
>>>> + case 4:
>>>> + _val = *(u32 *)val;
>>>> + break;
>>>> + case 8:
>>>> + _val = *(u64 *)val;
>>>> + break;
>>>> + default:
>>>> + return false;
>>>> + }
>>>>
>>>>
>>> So legal values for len are 1,2,4 and 8?
>>> Might be a good idea to document this.
>>>
>>>
>>>
>> Ack
>>
>>
>>>> +
>>>> + return _val == item->match;
>>>>
>>>>
>> [A]
>>
>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * MMIO/PIO writes trigger an event (if the data matches).
>>>> + *
>>>> + * This is invoked by the io_bus subsystem in response to an address match
>>>> + * against the group. We must then walk the list of individual items to check
>>>> + * for a match and, if applicable, to send the appropriate signal. If the item
>>>> + * in question does not have a "match" pointer, it is considered a wildcard
>>>> + * and will always generate a signal. There can be an arbitrary number
>>>> + * of distinct matches or wildcards per group.
>>>> + */
>>>> +static void
>>>> +iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + const void *val)
>>>> +{
>>>> + struct _iosignalfd_group *group = to_group(this);
>>>> + struct _iosignalfd_item *item;
>>>> +
>>>> + rcu_read_lock();
>>>> +
>>>> + list_for_each_entry_rcu(item, &group->items, list) {
>>>> + if (iosignalfd_is_match(group, item, val, len))
>>>> + eventfd_signal(item->file, 1);
>>>> + }
>>>> +
>>>> + rcu_read_unlock();
>>>> +}
>>>> +
>>>> +/*
>>>> + * MMIO/PIO reads against the group indiscriminately return all zeros
>>>> + */
>>>>
>>>>
>>> Does it have to be so? It would be better to bounce reads to
>>> userspace...
>>>
>>>
>>>
>> Good idea. I can set is_write = false and I should never get this
>> function called.
>>
>>
>>>> +static void
>>>> +iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + void *val)
>>>> +{
>>>> + memset(val, 0, len);
>>>> +}
>>>> +
>>>> +/*
>>>> + * This function is called as KVM is completely shutting down. We do not
>>>> + * need to worry about locking or careful RCU dancing...just nuke anything
>>>> + * we have as quickly as possible
>>>> + */
>>>> +static void
>>>> +iosignalfd_group_destructor(struct kvm_io_device *this)
>>>> +{
>>>> + struct _iosignalfd_group *group = to_group(this);
>>>> + struct _iosignalfd_item *item, *tmp;
>>>> +
>>>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
>>>> + list_del(&item->list);
>>>> + iosignalfd_item_free(item);
>>>> + }
>>>> +
>>>> + list_del(&group->list);
>>>> + kfree(group);
>>>> +}
>>>> +
>>>> +static const struct kvm_io_device_ops iosignalfd_ops = {
>>>> + .read = iosignalfd_group_read,
>>>> + .write = iosignalfd_group_write,
>>>> + .in_range = iosignalfd_group_in_range,
>>>> + .destructor = iosignalfd_group_destructor,
>>>> +};
>>>> +
>>>> +/* assumes kvm->lock held */
>>>> +static struct _iosignalfd_group *
>>>> +iosignalfd_group_find(struct kvm *kvm, u64 addr)
>>>> +{
>>>> + struct _iosignalfd_group *group;
>>>> +
>>>> + list_for_each_entry(group, &kvm->iosignalfds, list) {
>>>>
>>>>
>>> {} not needed here
>>>
>>>
>> Ack
>>
>>>
>>>
>>>> + if (group->addr == addr)
>>>> + return group;
>>>> + }
>>>> +
>>>> + return NULL;
>>>> +}
>>>> +
>>>> +/* assumes kvm->lock is held */
>>>> +static struct _iosignalfd_group *
>>>> +iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
>>>> + u64 addr, size_t len)
>>>> +{
>>>> + struct _iosignalfd_group *group;
>>>> + int ret;
>>>> +
>>>> + group = kzalloc(sizeof(*group), GFP_KERNEL);
>>>> + if (!group)
>>>> + return ERR_PTR(-ENOMEM);
>>>> +
>>>> + INIT_LIST_HEAD(&group->list);
>>>> + INIT_LIST_HEAD(&group->items);
>>>> + group->addr = addr;
>>>> + group->length = len;
>>>> + kvm_iodevice_init(&group->dev, &iosignalfd_ops);
>>>> +
>>>> + ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
>>>> + if (ret < 0) {
>>>> + kfree(group);
>>>> + return ERR_PTR(ret);
>>>> + }
>>>> +
>>>> + list_add_tail(&group->list, &kvm->iosignalfds);
>>>> +
>>>> + return group;
>>>> +}
>>>> +
>>>> +static int
>>>> +kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>>>> +{
>>>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
>>>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
>>>> + struct _iosignalfd_group *group = NULL;
>>>>
>>>>
>>> why does group need to be initialized?
>>>
>>>
>>>
>>>> + struct _iosignalfd_item *item = NULL;
>>>>
>>>>
>>> Why does item need to be initialized?
>>>
>>>
>>>
>> Probably leftover from versions prior to v8. Will fix.
>>
>>
>>>> + struct file *file;
>>>> + int ret;
>>>> +
>>>> + if (args->len > sizeof(u64))
>>>>
>>>>
>>> Is e.g. value 3 legal?
>>>
>>>
>> Ack. Will check against legal values.
>>
>>
>>>
>>>
>>>> + return -EINVAL;
>>>>
>>>>
>>>
>>>
>>>> +
>>>> + file = eventfd_fget(args->fd);
>>>> + if (IS_ERR(file))
>>>> + return PTR_ERR(file);
>>>> +
>>>> + item = kzalloc(sizeof(*item), GFP_KERNEL);
>>>> + if (!item) {
>>>> + ret = -ENOMEM;
>>>> + goto fail;
>>>> + }
>>>> +
>>>> + INIT_LIST_HEAD(&item->list);
>>>> + item->file = file;
>>>> +
>>>> + /*
>>>> + * A trigger address is optional, otherwise this is a wildcard
>>>> + */
>>>> + if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
>>>> + item->match = args->trigger;
>>>> + else
>>>> + item->wildcard = true;
>>>> +
>>>> + mutex_lock(&kvm->lock);
>>>> +
>>>> + /*
>>>> + * Put an upper limit on the number of items we support
>>>> + */
>>>>
>>>>
>>> Groups and items, actually, right?
>>>
>>>
>>>
>> Yeah, though technically that is implicit when you say "items", since
>> each group always has at least one item. I will try to make this
>> clearer, though.
>>
>>
>>>> + if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
>>>> + ret = -ENOSPC;
>>>> + goto unlock_fail;
>>>> + }
>>>> +
>>>> + group = iosignalfd_group_find(kvm, args->addr);
>>>> + if (!group) {
>>>> +
>>>> + group = iosignalfd_group_create(kvm, bus,
>>>> + args->addr, args->len);
>>>> + if (IS_ERR(group)) {
>>>> + ret = PTR_ERR(group);
>>>> + goto unlock_fail;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Note: We do not increment io_device_count for the first item,
>>>> + * as this is represented by the group device that we just
>>>> + * registered. Make sure we handle this properly when we
>>>> + * deassign the last item
>>>> + */
>>>> + } else {
>>>> +
>>>> + if (group->length != args->len) {
>>>> + /*
>>>> + * Existing groups must have the same addr/len tuple
>>>> + * or we reject the request
>>>> + */
>>>> + ret = -EINVAL;
>>>> + goto unlock_fail;
>>>>
>>>>
>>> Most errors seem to trigger EINVAL. Applications will be
>>> easier to debug if different errors are returned on
>>> different mistakes.
>>>
>> Yeah, agreed. Will try to differentiate some errors here.
>>
>>
>>> E.g. here EBUSY might be good. And same
>>> in other places.
>>>
>>>
>>>
>> Actually, I think EBUSY is supposed to be a transitory error, and would
>> not be appropriate to use here. That said, your point is taken: Find
>> more appropriate and descriptive errors.
>>
>>
>>>> + }
>>>> +
>>>> + kvm->io_device_count++;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Note: We are committed to succeed at this point since we have
>>>> + * (potentially) published a new group-device. Any failure handling
>>>> + * added in the future after this point will need to be carefully
>>>> + * considered.
>>>> + */
>>>> +
>>>> + list_add_tail_rcu(&item->list, &group->items);
>>>> + group->count++;
>>>> +
>>>> + mutex_unlock(&kvm->lock);
>>>> +
>>>> + return 0;
>>>> +
>>>> +unlock_fail:
>>>> + mutex_unlock(&kvm->lock);
>>>> +fail:
>>>> + if (item)
>>>> + /*
>>>> + * it would have never made it to the group->items list
>>>> + * in the failure path, so we dont need to worry about removing
>>>> + * it
>>>> + */
>>>> + kfree(item);
>>>> +
>>>> + fput(file);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +
>>>> +static int
>>>> +kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>>>> +{
>>>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
>>>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
>>>> + struct _iosignalfd_group *group;
>>>> + struct _iosignalfd_item *item, *tmp;
>>>> + struct file *file;
>>>> + int ret = 0;
>>>> +
>>>> + file = eventfd_fget(args->fd);
>>>> + if (IS_ERR(file))
>>>> + return PTR_ERR(file);
>>>> +
>>>> + mutex_lock(&kvm->lock);
>>>> +
>>>> + group = iosignalfd_group_find(kvm, args->addr);
>>>> + if (!group) {
>>>> + ret = -EINVAL;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Exhaustively search our group->items list for any items that might
>>>> + * match the specified fd, and (carefully) remove each one found.
>>>> + */
>>>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
>>>> +
>>>> + if (item->file != file)
>>>> + continue;
>>>> +
>>>> + list_del_rcu(&item->list);
>>>> + group->count--;
>>>> + if (group->count)
>>>> + /*
>>>> + * We only decrement the global count if this is *not*
>>>> + * the last item. The last item will be accounted for
>>>> + * by the io_bus_unregister
>>>> + */
>>>> + kvm->io_device_count--;
>>>> +
>>>> + /*
>>>> + * The item may be still referenced inside our group->write()
>>>> + * path's RCU read-side CS, so defer the actual free to the
>>>> + * next grace
>>>> + */
>>>> + call_rcu(&item->rcu, iosignalfd_item_deferred_free);
>>>> + }
>>>> +
>>>> + /*
>>>> + * Check if the group is now completely vacated as a result of
>>>> + * removing the items. If so, unregister/delete it
>>>> + */
>>>> + if (!group->count) {
>>>> +
>>>> + kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
>>>> +
>>>> + /*
>>>> + * Like the item, the group may also still be referenced as
>>>> + * per above. However, the kvm->iosignalfds list is not
>>>> + * RCU protected (its protected by kvm->lock instead) so
>>>> + * we can just plain-vanilla remove it. What needs to be
>>>> + * done carefully is the actual freeing of the group pointer
>>>> + * since we walk the group->items list within the RCU CS.
>>>> + */
>>>> + list_del(&group->list);
>>>> + call_rcu(&group->rcu, iosignalfd_group_deferred_free);
>>>>
>>>>
>>> This is a deferred call, is it not, with no guarantee on when it will
>>> run? If correct I think synchronize_rcu might be better here:
>>> - can the module go away while iosignalfd_group_deferred_free is
>>> running?
>>>
>>>
>> Good catch. Once I go this route it will be easy to use SRCU instead of
>> RCU, too. So I will fix this up.
>>
>>
>>
>>> - can eventfd be signalled *after* ioctl exits? If yes
>>> this might confuse applications if they use the eventfd
>>> for something else.
>>>
>>>
>> Not by iosignalfd. Once this function completes, we synchronously
>> guarantee that no more IO activity will generate an event on the
>> affected eventfds. Of course, this has no bearing on whether some other
>> producer wont signal, but that is beyond the scope of iosignalfd.
>>
>>>
>>>
>>>> + }
>>>> +
>>>> +out:
>>>> + mutex_unlock(&kvm->lock);
>>>> +
>>>> + fput(file);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +int
>>>> +kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
>>>> +{
>>>> + if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
>>>> + return kvm_deassign_iosignalfd(kvm, args);
>>>> +
>>>> + return kvm_assign_iosignalfd(kvm, args);
>>>> +}
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 42cbea7..e6495d4 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
>>>> atomic_inc(&kvm->mm->mm_count);
>>>> spin_lock_init(&kvm->mmu_lock);
>>>> kvm_io_bus_init(&kvm->pio_bus);
>>>> - kvm_irqfd_init(kvm);
>>>> + kvm_eventfd_init(kvm);
>>>> mutex_init(&kvm->lock);
>>>> mutex_init(&kvm->irq_lock);
>>>> kvm_io_bus_init(&kvm->mmio_bus);
>>>> @@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
>>>> r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
>>>> break;
>>>> }
>>>> + case KVM_IOSIGNALFD: {
>>>> + struct kvm_iosignalfd data;
>>>> +
>>>> + r = -EFAULT;
>>>> + if (copy_from_user(&data, argp, sizeof data))
>>>> + goto out;
>>>> + r = kvm_iosignalfd(kvm, &data);
>>>> + break;
>>>> + }
>>>> #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>>>> case KVM_SET_BOOT_CPU_ID:
>>>> r = 0;
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> Thanks Michael,
>> -Greg
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 13:04:59

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Sorry, Michael. I missed that you had other comments after the
grammatical one. Will answer inline

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
>
>>>> + * notification when the memory has been touched.
>>>> + * --------------------------------------------------------------------
>>>> + */
>>>> +
>>>> +/*
>>>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
>>>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
>>>>
> ^^ ^^
>
>>>> + * eventfd, and can be registered to trigger on any write to the group
>>>> + * (wildcard), or to a write of a specific value. If more than one item is to
>>>>
> ^^
>
>>>> + * be supported, the addr/len ranges must all be identical in the group. If a
>>>>
> ^^
>
>>>> + * trigger value is to be supported on a particular item, the group range must
>>>> + * be exactly the width of the trigger.
>>>>
>>>>
>>> Some duplicate spaces in the text above, apparently at random places.
>>>
>>>
>>>
>> -ENOPARSE ;)
>>
>> Can you elaborate?
>>
>
>
> Marked with ^^
>
>
>>>> + */
>>>> +
>>>> +struct _iosignalfd_item {
>>>> + struct list_head list;
>>>> + struct file *file;
>>>> + u64 match;
>>>> + struct rcu_head rcu;
>>>> + int wildcard:1;
>>>> +};
>>>> +
>>>> +struct _iosignalfd_group {
>>>> + struct list_head list;
>>>> + u64 addr;
>>>> + size_t length;
>>>> + size_t count;
>>>> + struct list_head items;
>>>> + struct kvm_io_device dev;
>>>> + struct rcu_head rcu;
>>>> +};
>>>> +
>>>> +static inline struct _iosignalfd_group *
>>>> +to_group(struct kvm_io_device *dev)
>>>> +{
>>>> + return container_of(dev, struct _iosignalfd_group, dev);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_item_free(struct _iosignalfd_item *item)
>>>> +{
>>>> + fput(item->file);
>>>> + kfree(item);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
>>>> +{
>>>> + struct _iosignalfd_item *item;
>>>> +
>>>> + item = container_of(rhp, struct _iosignalfd_item, rcu);
>>>> +
>>>> + iosignalfd_item_free(item);
>>>> +}
>>>> +
>>>> +static void
>>>> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
>>>> +{
>>>> + struct _iosignalfd_group *group;
>>>> +
>>>> + group = container_of(rhp, struct _iosignalfd_group, rcu);
>>>> +
>>>> + kfree(group);
>>>> +}
>>>> +
>>>> +static int
>>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + int is_write)
>>>> +{
>>>> + struct _iosignalfd_group *p = to_group(this);
>>>> +
>>>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>>>> +}
>>>>
>>>>
>>> What does this test? len is ignored ...
>>>
>>>
>>>
>> Yeah, I was following precedent with other IO devices. However, this
>> *is* sloppy, I agree. Will fix.
>>
>>
>>>> +
>>>> +static int
>>>>
>>>>
>>> This seems to be returning bool ...
>>>
>>>
>> Ack
>>
>>>
>>>
>>>> +iosignalfd_is_match(struct _iosignalfd_group *group,
>>>> + struct _iosignalfd_item *item,
>>>> + const void *val,
>>>> + int len)
>>>> +{
>>>> + u64 _val;
>>>> +
>>>> + if (len != group->length)
>>>> + /* mis-matched length is always a miss */
>>>> + return false;
>>>>
>>>>
>>> Why is that? what if there's 8 byte write which covers
>>> a 4 byte group?
>>>
>>>
>> v7 and earlier used to allow that for wildcards, actually. It of
>> course would never make sense to allow mis-matched writes for
>> non-wildcards, since the idea is to match the value exactly. However,
>> the feedback I got from Avi was that we should make the wildcard vs
>> non-wildcard access symmetrical and ensure they both conform to the size.
>>
>>>
>>>
>>>> +
>>>> + if (item->wildcard)
>>>> + /* wildcard is always a hit */
>>>> + return true;
>>>> +
>>>> + /* otherwise, we have to actually compare the data */
>>>> +
>>>> + if (!IS_ALIGNED((unsigned long)val, len))
>>>> + /* protect against this request causing a SIGBUS */
>>>> + return false;
>>>>
>>>>
>>> Could you explain what this does please?
>>>
>>>
>> Sure: item->match is a fixed u64 to represent all group->length
>> values. So it might have a 1, 2, 4, or 8 byte value in it. When I
>> write arrives, we need to cast the data-register (in this case
>> represented by (void*)val) into a u64 so the equality check (see [A],
>> below) can be done. However, you can't cast an unaligned pointer, or it
>> will SIGBUS on many (most?) architectures.
>>
>
> I mean guest access. Does it have to be aligned?
>

In order to work on arches that require alignment, yes. Note that I
highly suspect that the pointer is already aligned anyway. My
IS_ALIGNED check is simply for conservative sanity.
> You could memcpy the value...
>

Then you get into the issue of endianness and what pointer to use. Or
am I missing something?

>
>>> I thought misaligned accesses are allowed.
>>>
>>>
>> If thats true, we are in trouble ;)
>>
>
> I think it works at least on x86:
> http://en.wikipedia.org/wiki/Packed#x86_and_x86-64
>

Right, understood. What I meant specifically is that if the (void*)val
pointer is allowed to be misaligned we are in trouble ;). I haven't
studied the implementation in front of the MMIO callback recently, but I
generally doubt thats the case. More than likely this is some buffer
that was kmalloced and that should already be aligned to the machine word.

Kind Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 13:09:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Mon, Jun 22, 2009 at 08:56:28AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
> >
> >>>> + * notification when the memory has been touched.
> >>>> + * --------------------------------------------------------------------
> >>>> + */
> >>>> +
> >>>> +/*
> >>>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
> >>>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
> >>>>
> > ^^ ^^
> >
> >>>> + * eventfd, and can be registered to trigger on any write to the group
> >>>> + * (wildcard), or to a write of a specific value. If more than one item is to
> >>>>
> > ^^
> >
> >>>> + * be supported, the addr/len ranges must all be identical in the group. If a
> >>>>
> > ^^
> >
> >>>> + * trigger value is to be supported on a particular item, the group range must
> >>>> + * be exactly the width of the trigger.
> >>>>
> >>>>
> >>> Some duplicate spaces in the text above, apparently at random places.
> >>>
> >>>
> >>>
> >> -ENOPARSE ;)
> >>
> >> Can you elaborate?
> >>
> >
> >
> > Marked with ^^
> >
> Heh...well, the first one ("aggregates one") is just a plain typo.
> The others are just me showing my age, perhaps:
>
> http://desktoppub.about.com/cs/typespacing/a/onetwospaces.htm
>
> Whether right or wrong, I think I use two-spaces-after-a-period
> everywhere.

Ah, I see now. Naturally it is really up to you.

> I can fix these if they bother you, but I suspect just
> about every comment I've written has them too. ;)
>
> -Greg

It doesn't bother me as such. But you seem to care about such things :).
If you do care, other comments in kvm don't seem to be like this and
people won't remember to add spaces in comments, though.

>
> >
> >>>> + */
> >>>> +
> >>>> +struct _iosignalfd_item {
> >>>> + struct list_head list;
> >>>> + struct file *file;
> >>>> + u64 match;
> >>>> + struct rcu_head rcu;
> >>>> + int wildcard:1;
> >>>> +};
> >>>> +
> >>>> +struct _iosignalfd_group {
> >>>> + struct list_head list;
> >>>> + u64 addr;
> >>>> + size_t length;
> >>>> + size_t count;
> >>>> + struct list_head items;
> >>>> + struct kvm_io_device dev;
> >>>> + struct rcu_head rcu;
> >>>> +};
> >>>> +
> >>>> +static inline struct _iosignalfd_group *
> >>>> +to_group(struct kvm_io_device *dev)
> >>>> +{
> >>>> + return container_of(dev, struct _iosignalfd_group, dev);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_item_free(struct _iosignalfd_item *item)
> >>>> +{
> >>>> + fput(item->file);
> >>>> + kfree(item);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
> >>>> +{
> >>>> + struct _iosignalfd_item *item;
> >>>> +
> >>>> + item = container_of(rhp, struct _iosignalfd_item, rcu);
> >>>> +
> >>>> + iosignalfd_item_free(item);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
> >>>> +{
> >>>> + struct _iosignalfd_group *group;
> >>>> +
> >>>> + group = container_of(rhp, struct _iosignalfd_group, rcu);
> >>>> +
> >>>> + kfree(group);
> >>>> +}
> >>>> +
> >>>> +static int
> >>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> >>>> + int is_write)
> >>>> +{
> >>>> + struct _iosignalfd_group *p = to_group(this);
> >>>> +
> >>>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> >>>> +}
> >>>>
> >>>>
> >>> What does this test? len is ignored ...
> >>>
> >>>
> >>>
> >> Yeah, I was following precedent with other IO devices. However, this
> >> *is* sloppy, I agree. Will fix.
> >>
> >>
> >>>> +
> >>>> +static int
> >>>>
> >>>>
> >>> This seems to be returning bool ...
> >>>
> >>>
> >> Ack
> >>
> >>>
> >>>
> >>>> +iosignalfd_is_match(struct _iosignalfd_group *group,
> >>>> + struct _iosignalfd_item *item,
> >>>> + const void *val,
> >>>> + int len)
> >>>> +{
> >>>> + u64 _val;
> >>>> +
> >>>> + if (len != group->length)
> >>>> + /* mis-matched length is always a miss */
> >>>> + return false;
> >>>>
> >>>>
> >>> Why is that? what if there's 8 byte write which covers
> >>> a 4 byte group?
> >>>
> >>>
> >> v7 and earlier used to allow that for wildcards, actually. It of
> >> course would never make sense to allow mis-matched writes for
> >> non-wildcards, since the idea is to match the value exactly. However,
> >> the feedback I got from Avi was that we should make the wildcard vs
> >> non-wildcard access symmetrical and ensure they both conform to the size.
> >>
> >>>
> >>>
> >>>> +
> >>>> + if (item->wildcard)
> >>>> + /* wildcard is always a hit */
> >>>> + return true;
> >>>> +
> >>>> + /* otherwise, we have to actually compare the data */
> >>>> +
> >>>> + if (!IS_ALIGNED((unsigned long)val, len))
> >>>> + /* protect against this request causing a SIGBUS */
> >>>> + return false;
> >>>>
> >>>>
> >>> Could you explain what this does please?
> >>>
> >>>
> >> Sure: item->match is a fixed u64 to represent all group->length
> >> values. So it might have a 1, 2, 4, or 8 byte value in it. When I
> >> write arrives, we need to cast the data-register (in this case
> >> represented by (void*)val) into a u64 so the equality check (see [A],
> >> below) can be done. However, you can't cast an unaligned pointer, or it
> >> will SIGBUS on many (most?) architectures.
> >>
> >
> > I mean guest access. Does it have to be aligned?
> > You could memcpy the value...
> >
> >
> >>> I thought misaligned accesses are allowed.
> >>>
> >>>
> >> If thats true, we are in trouble ;)
> >>
> >
> > I think it works at least on x86:
> > http://en.wikipedia.org/wiki/Packed#x86_and_x86-64
> >
> >
> >>>
> >>>
> >>>> +
> >>>> + switch (len) {
> >>>> + case 1:
> >>>> + _val = *(u8 *)val;
> >>>> + break;
> >>>> + case 2:
> >>>> + _val = *(u16 *)val;
> >>>> + break;
> >>>> + case 4:
> >>>> + _val = *(u32 *)val;
> >>>> + break;
> >>>> + case 8:
> >>>> + _val = *(u64 *)val;
> >>>> + break;
> >>>> + default:
> >>>> + return false;
> >>>> + }
> >>>>
> >>>>
> >>> So legal values for len are 1,2,4 and 8?
> >>> Might be a good idea to document this.
> >>>
> >>>
> >>>
> >> Ack
> >>
> >>
> >>>> +
> >>>> + return _val == item->match;
> >>>>
> >>>>
> >> [A]
> >>
> >>
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * MMIO/PIO writes trigger an event (if the data matches).
> >>>> + *
> >>>> + * This is invoked by the io_bus subsystem in response to an address match
> >>>> + * against the group. We must then walk the list of individual items to check
> >>>> + * for a match and, if applicable, to send the appropriate signal. If the item
> >>>> + * in question does not have a "match" pointer, it is considered a wildcard
> >>>> + * and will always generate a signal. There can be an arbitrary number
> >>>> + * of distinct matches or wildcards per group.
> >>>> + */
> >>>> +static void
> >>>> +iosignalfd_group_write(struct kvm_io_device *this, gpa_t addr, int len,
> >>>> + const void *val)
> >>>> +{
> >>>> + struct _iosignalfd_group *group = to_group(this);
> >>>> + struct _iosignalfd_item *item;
> >>>> +
> >>>> + rcu_read_lock();
> >>>> +
> >>>> + list_for_each_entry_rcu(item, &group->items, list) {
> >>>> + if (iosignalfd_is_match(group, item, val, len))
> >>>> + eventfd_signal(item->file, 1);
> >>>> + }
> >>>> +
> >>>> + rcu_read_unlock();
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * MMIO/PIO reads against the group indiscriminately return all zeros
> >>>> + */
> >>>>
> >>>>
> >>> Does it have to be so? It would be better to bounce reads to
> >>> userspace...
> >>>
> >>>
> >>>
> >> Good idea. I can set is_write = false and I should never get this
> >> function called.
> >>
> >>
> >>>> +static void
> >>>> +iosignalfd_group_read(struct kvm_io_device *this, gpa_t addr, int len,
> >>>> + void *val)
> >>>> +{
> >>>> + memset(val, 0, len);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * This function is called as KVM is completely shutting down. We do not
> >>>> + * need to worry about locking or careful RCU dancing...just nuke anything
> >>>> + * we have as quickly as possible
> >>>> + */
> >>>> +static void
> >>>> +iosignalfd_group_destructor(struct kvm_io_device *this)
> >>>> +{
> >>>> + struct _iosignalfd_group *group = to_group(this);
> >>>> + struct _iosignalfd_item *item, *tmp;
> >>>> +
> >>>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> >>>> + list_del(&item->list);
> >>>> + iosignalfd_item_free(item);
> >>>> + }
> >>>> +
> >>>> + list_del(&group->list);
> >>>> + kfree(group);
> >>>> +}
> >>>> +
> >>>> +static const struct kvm_io_device_ops iosignalfd_ops = {
> >>>> + .read = iosignalfd_group_read,
> >>>> + .write = iosignalfd_group_write,
> >>>> + .in_range = iosignalfd_group_in_range,
> >>>> + .destructor = iosignalfd_group_destructor,
> >>>> +};
> >>>> +
> >>>> +/* assumes kvm->lock held */
> >>>> +static struct _iosignalfd_group *
> >>>> +iosignalfd_group_find(struct kvm *kvm, u64 addr)
> >>>> +{
> >>>> + struct _iosignalfd_group *group;
> >>>> +
> >>>> + list_for_each_entry(group, &kvm->iosignalfds, list) {
> >>>>
> >>>>
> >>> {} not needed here
> >>>
> >>>
> >> Ack
> >>
> >>>
> >>>
> >>>> + if (group->addr == addr)
> >>>> + return group;
> >>>> + }
> >>>> +
> >>>> + return NULL;
> >>>> +}
> >>>> +
> >>>> +/* assumes kvm->lock is held */
> >>>> +static struct _iosignalfd_group *
> >>>> +iosignalfd_group_create(struct kvm *kvm, struct kvm_io_bus *bus,
> >>>> + u64 addr, size_t len)
> >>>> +{
> >>>> + struct _iosignalfd_group *group;
> >>>> + int ret;
> >>>> +
> >>>> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> >>>> + if (!group)
> >>>> + return ERR_PTR(-ENOMEM);
> >>>> +
> >>>> + INIT_LIST_HEAD(&group->list);
> >>>> + INIT_LIST_HEAD(&group->items);
> >>>> + group->addr = addr;
> >>>> + group->length = len;
> >>>> + kvm_iodevice_init(&group->dev, &iosignalfd_ops);
> >>>> +
> >>>> + ret = kvm_io_bus_register_dev(kvm, bus, &group->dev);
> >>>> + if (ret < 0) {
> >>>> + kfree(group);
> >>>> + return ERR_PTR(ret);
> >>>> + }
> >>>> +
> >>>> + list_add_tail(&group->list, &kvm->iosignalfds);
> >>>> +
> >>>> + return group;
> >>>> +}
> >>>> +
> >>>> +static int
> >>>> +kvm_assign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >>>> +{
> >>>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> >>>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> >>>> + struct _iosignalfd_group *group = NULL;
> >>>>
> >>>>
> >>> why does group need to be initialized?
> >>>
> >>>
> >>>
> >>>> + struct _iosignalfd_item *item = NULL;
> >>>>
> >>>>
> >>> Why does item need to be initialized?
> >>>
> >>>
> >>>
> >> Probably leftover from versions prior to v8. Will fix.
> >>
> >>
> >>>> + struct file *file;
> >>>> + int ret;
> >>>> +
> >>>> + if (args->len > sizeof(u64))
> >>>>
> >>>>
> >>> Is e.g. value 3 legal?
> >>>
> >>>
> >> Ack. Will check against legal values.
> >>
> >>
> >>>
> >>>
> >>>> + return -EINVAL;
> >>>>
> >>>>
> >>>
> >>>
> >>>> +
> >>>> + file = eventfd_fget(args->fd);
> >>>> + if (IS_ERR(file))
> >>>> + return PTR_ERR(file);
> >>>> +
> >>>> + item = kzalloc(sizeof(*item), GFP_KERNEL);
> >>>> + if (!item) {
> >>>> + ret = -ENOMEM;
> >>>> + goto fail;
> >>>> + }
> >>>> +
> >>>> + INIT_LIST_HEAD(&item->list);
> >>>> + item->file = file;
> >>>> +
> >>>> + /*
> >>>> + * A trigger address is optional, otherwise this is a wildcard
> >>>> + */
> >>>> + if (args->flags & KVM_IOSIGNALFD_FLAG_TRIGGER)
> >>>> + item->match = args->trigger;
> >>>> + else
> >>>> + item->wildcard = true;
> >>>> +
> >>>> + mutex_lock(&kvm->lock);
> >>>> +
> >>>> + /*
> >>>> + * Put an upper limit on the number of items we support
> >>>> + */
> >>>>
> >>>>
> >>> Groups and items, actually, right?
> >>>
> >>>
> >>>
> >> Yeah, though technically that is implicit when you say "items", since
> >> each group always has at least one item. I will try to make this
> >> clearer, though.
> >>
> >>
> >>>> + if (kvm->io_device_count >= CONFIG_KVM_MAX_IO_DEVICES) {
> >>>> + ret = -ENOSPC;
> >>>> + goto unlock_fail;
> >>>> + }
> >>>> +
> >>>> + group = iosignalfd_group_find(kvm, args->addr);
> >>>> + if (!group) {
> >>>> +
> >>>> + group = iosignalfd_group_create(kvm, bus,
> >>>> + args->addr, args->len);
> >>>> + if (IS_ERR(group)) {
> >>>> + ret = PTR_ERR(group);
> >>>> + goto unlock_fail;
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * Note: We do not increment io_device_count for the first item,
> >>>> + * as this is represented by the group device that we just
> >>>> + * registered. Make sure we handle this properly when we
> >>>> + * deassign the last item
> >>>> + */
> >>>> + } else {
> >>>> +
> >>>> + if (group->length != args->len) {
> >>>> + /*
> >>>> + * Existing groups must have the same addr/len tuple
> >>>> + * or we reject the request
> >>>> + */
> >>>> + ret = -EINVAL;
> >>>> + goto unlock_fail;
> >>>>
> >>>>
> >>> Most errors seem to trigger EINVAL. Applications will be
> >>> easier to debug if different errors are returned on
> >>> different mistakes.
> >>>
> >> Yeah, agreed. Will try to differentiate some errors here.
> >>
> >>
> >>> E.g. here EBUSY might be good. And same
> >>> in other places.
> >>>
> >>>
> >>>
> >> Actually, I think EBUSY is supposed to be a transitory error, and would
> >> not be appropriate to use here. That said, your point is taken: Find
> >> more appropriate and descriptive errors.
> >>
> >>
> >>>> + }
> >>>> +
> >>>> + kvm->io_device_count++;
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * Note: We are committed to succeed at this point since we have
> >>>> + * (potentially) published a new group-device. Any failure handling
> >>>> + * added in the future after this point will need to be carefully
> >>>> + * considered.
> >>>> + */
> >>>> +
> >>>> + list_add_tail_rcu(&item->list, &group->items);
> >>>> + group->count++;
> >>>> +
> >>>> + mutex_unlock(&kvm->lock);
> >>>> +
> >>>> + return 0;
> >>>> +
> >>>> +unlock_fail:
> >>>> + mutex_unlock(&kvm->lock);
> >>>> +fail:
> >>>> + if (item)
> >>>> + /*
> >>>> + * it would have never made it to the group->items list
> >>>> + * in the failure path, so we dont need to worry about removing
> >>>> + * it
> >>>> + */
> >>>> + kfree(item);
> >>>> +
> >>>> + fput(file);
> >>>> +
> >>>> + return ret;
> >>>> +}
> >>>> +
> >>>> +
> >>>> +static int
> >>>> +kvm_deassign_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >>>> +{
> >>>> + int pio = args->flags & KVM_IOSIGNALFD_FLAG_PIO;
> >>>> + struct kvm_io_bus *bus = pio ? &kvm->pio_bus : &kvm->mmio_bus;
> >>>> + struct _iosignalfd_group *group;
> >>>> + struct _iosignalfd_item *item, *tmp;
> >>>> + struct file *file;
> >>>> + int ret = 0;
> >>>> +
> >>>> + file = eventfd_fget(args->fd);
> >>>> + if (IS_ERR(file))
> >>>> + return PTR_ERR(file);
> >>>> +
> >>>> + mutex_lock(&kvm->lock);
> >>>> +
> >>>> + group = iosignalfd_group_find(kvm, args->addr);
> >>>> + if (!group) {
> >>>> + ret = -EINVAL;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * Exhaustively search our group->items list for any items that might
> >>>> + * match the specified fd, and (carefully) remove each one found.
> >>>> + */
> >>>> + list_for_each_entry_safe(item, tmp, &group->items, list) {
> >>>> +
> >>>> + if (item->file != file)
> >>>> + continue;
> >>>> +
> >>>> + list_del_rcu(&item->list);
> >>>> + group->count--;
> >>>> + if (group->count)
> >>>> + /*
> >>>> + * We only decrement the global count if this is *not*
> >>>> + * the last item. The last item will be accounted for
> >>>> + * by the io_bus_unregister
> >>>> + */
> >>>> + kvm->io_device_count--;
> >>>> +
> >>>> + /*
> >>>> + * The item may be still referenced inside our group->write()
> >>>> + * path's RCU read-side CS, so defer the actual free to the
> >>>> + * next grace
> >>>> + */
> >>>> + call_rcu(&item->rcu, iosignalfd_item_deferred_free);
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * Check if the group is now completely vacated as a result of
> >>>> + * removing the items. If so, unregister/delete it
> >>>> + */
> >>>> + if (!group->count) {
> >>>> +
> >>>> + kvm_io_bus_unregister_dev(kvm, bus, &group->dev);
> >>>> +
> >>>> + /*
> >>>> + * Like the item, the group may also still be referenced as
> >>>> + * per above. However, the kvm->iosignalfds list is not
> >>>> + * RCU protected (its protected by kvm->lock instead) so
> >>>> + * we can just plain-vanilla remove it. What needs to be
> >>>> + * done carefully is the actual freeing of the group pointer
> >>>> + * since we walk the group->items list within the RCU CS.
> >>>> + */
> >>>> + list_del(&group->list);
> >>>> + call_rcu(&group->rcu, iosignalfd_group_deferred_free);
> >>>>
> >>>>
> >>> This is a deferred call, is it not, with no guarantee on when it will
> >>> run? If correct I think synchronize_rcu might be better here:
> >>> - can the module go away while iosignalfd_group_deferred_free is
> >>> running?
> >>>
> >>>
> >> Good catch. Once I go this route it will be easy to use SRCU instead of
> >> RCU, too. So I will fix this up.
> >>
> >>
> >>
> >>> - can eventfd be signalled *after* ioctl exits? If yes
> >>> this might confuse applications if they use the eventfd
> >>> for something else.
> >>>
> >>>
> >> Not by iosignalfd. Once this function completes, we synchronously
> >> guarantee that no more IO activity will generate an event on the
> >> affected eventfds. Of course, this has no bearing on whether some other
> >> producer wont signal, but that is beyond the scope of iosignalfd.
> >>
> >>>
> >>>
> >>>> + }
> >>>> +
> >>>> +out:
> >>>> + mutex_unlock(&kvm->lock);
> >>>> +
> >>>> + fput(file);
> >>>> +
> >>>> + return ret;
> >>>> +}
> >>>> +
> >>>> +int
> >>>> +kvm_iosignalfd(struct kvm *kvm, struct kvm_iosignalfd *args)
> >>>> +{
> >>>> + if (args->flags & KVM_IOSIGNALFD_FLAG_DEASSIGN)
> >>>> + return kvm_deassign_iosignalfd(kvm, args);
> >>>> +
> >>>> + return kvm_assign_iosignalfd(kvm, args);
> >>>> +}
> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>> index 42cbea7..e6495d4 100644
> >>>> --- a/virt/kvm/kvm_main.c
> >>>> +++ b/virt/kvm/kvm_main.c
> >>>> @@ -971,7 +971,7 @@ static struct kvm *kvm_create_vm(void)
> >>>> atomic_inc(&kvm->mm->mm_count);
> >>>> spin_lock_init(&kvm->mmu_lock);
> >>>> kvm_io_bus_init(&kvm->pio_bus);
> >>>> - kvm_irqfd_init(kvm);
> >>>> + kvm_eventfd_init(kvm);
> >>>> mutex_init(&kvm->lock);
> >>>> mutex_init(&kvm->irq_lock);
> >>>> kvm_io_bus_init(&kvm->mmio_bus);
> >>>> @@ -2227,6 +2227,15 @@ static long kvm_vm_ioctl(struct file *filp,
> >>>> r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
> >>>> break;
> >>>> }
> >>>> + case KVM_IOSIGNALFD: {
> >>>> + struct kvm_iosignalfd data;
> >>>> +
> >>>> + r = -EFAULT;
> >>>> + if (copy_from_user(&data, argp, sizeof data))
> >>>> + goto out;
> >>>> + r = kvm_iosignalfd(kvm, &data);
> >>>> + break;
> >>>> + }
> >>>> #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> >>>> case KVM_SET_BOOT_CPU_ID:
> >>>> r = 0;
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>>> the body of a message to [email protected]
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> >> Thanks Michael,
> >> -Greg
> >>
> >>
> >>
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>

2009-06-22 13:12:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On 06/22/2009 04:08 PM, Michael S. Tsirkin wrote:
> It doesn't bother me as such. But you seem to care about such things :).
> If you do care, other comments in kvm don't seem to be like this and
> people won't remember to add spaces in comments, though.
>

Really, we don't need to standardize everything.

--
error compiling committee.c: too many arguments to function

2009-06-22 13:14:31

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Mon, Jun 22, 2009 at 09:04:48AM -0400, Gregory Haskins wrote:
> Sorry, Michael. I missed that you had other comments after the
> grammatical one. Will answer inline
>
> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
> >
> >>>> + * notification when the memory has been touched.
> >>>> + * --------------------------------------------------------------------
> >>>> + */
> >>>> +
> >>>> +/*
> >>>> + * Design note: We create one PIO/MMIO device (iosignalfd_group) which
> >>>> + * aggregates one or more iosignalfd_items. Each item points to exactly one
> >>>>
> > ^^ ^^
> >
> >>>> + * eventfd, and can be registered to trigger on any write to the group
> >>>> + * (wildcard), or to a write of a specific value. If more than one item is to
> >>>>
> > ^^
> >
> >>>> + * be supported, the addr/len ranges must all be identical in the group. If a
> >>>>
> > ^^
> >
> >>>> + * trigger value is to be supported on a particular item, the group range must
> >>>> + * be exactly the width of the trigger.
> >>>>
> >>>>
> >>> Some duplicate spaces in the text above, apparently at random places.
> >>>
> >>>
> >>>
> >> -ENOPARSE ;)
> >>
> >> Can you elaborate?
> >>
> >
> >
> > Marked with ^^
> >
> >
> >>>> + */
> >>>> +
> >>>> +struct _iosignalfd_item {
> >>>> + struct list_head list;
> >>>> + struct file *file;
> >>>> + u64 match;
> >>>> + struct rcu_head rcu;
> >>>> + int wildcard:1;
> >>>> +};
> >>>> +
> >>>> +struct _iosignalfd_group {
> >>>> + struct list_head list;
> >>>> + u64 addr;
> >>>> + size_t length;
> >>>> + size_t count;
> >>>> + struct list_head items;
> >>>> + struct kvm_io_device dev;
> >>>> + struct rcu_head rcu;
> >>>> +};
> >>>> +
> >>>> +static inline struct _iosignalfd_group *
> >>>> +to_group(struct kvm_io_device *dev)
> >>>> +{
> >>>> + return container_of(dev, struct _iosignalfd_group, dev);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_item_free(struct _iosignalfd_item *item)
> >>>> +{
> >>>> + fput(item->file);
> >>>> + kfree(item);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_item_deferred_free(struct rcu_head *rhp)
> >>>> +{
> >>>> + struct _iosignalfd_item *item;
> >>>> +
> >>>> + item = container_of(rhp, struct _iosignalfd_item, rcu);
> >>>> +
> >>>> + iosignalfd_item_free(item);
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +iosignalfd_group_deferred_free(struct rcu_head *rhp)
> >>>> +{
> >>>> + struct _iosignalfd_group *group;
> >>>> +
> >>>> + group = container_of(rhp, struct _iosignalfd_group, rcu);
> >>>> +
> >>>> + kfree(group);
> >>>> +}
> >>>> +
> >>>> +static int
> >>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> >>>> + int is_write)
> >>>> +{
> >>>> + struct _iosignalfd_group *p = to_group(this);
> >>>> +
> >>>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> >>>> +}
> >>>>
> >>>>
> >>> What does this test? len is ignored ...
> >>>
> >>>
> >>>
> >> Yeah, I was following precedent with other IO devices. However, this
> >> *is* sloppy, I agree. Will fix.
> >>
> >>
> >>>> +
> >>>> +static int
> >>>>
> >>>>
> >>> This seems to be returning bool ...
> >>>
> >>>
> >> Ack
> >>
> >>>
> >>>
> >>>> +iosignalfd_is_match(struct _iosignalfd_group *group,
> >>>> + struct _iosignalfd_item *item,
> >>>> + const void *val,
> >>>> + int len)
> >>>> +{
> >>>> + u64 _val;
> >>>> +
> >>>> + if (len != group->length)
> >>>> + /* mis-matched length is always a miss */
> >>>> + return false;
> >>>>
> >>>>
> >>> Why is that? what if there's 8 byte write which covers
> >>> a 4 byte group?
> >>>
> >>>
> >> v7 and earlier used to allow that for wildcards, actually. It of
> >> course would never make sense to allow mis-matched writes for
> >> non-wildcards, since the idea is to match the value exactly. However,
> >> the feedback I got from Avi was that we should make the wildcard vs
> >> non-wildcard access symmetrical and ensure they both conform to the size.
> >>
> >>>
> >>>
> >>>> +
> >>>> + if (item->wildcard)
> >>>> + /* wildcard is always a hit */
> >>>> + return true;
> >>>> +
> >>>> + /* otherwise, we have to actually compare the data */
> >>>> +
> >>>> + if (!IS_ALIGNED((unsigned long)val, len))
> >>>> + /* protect against this request causing a SIGBUS */
> >>>> + return false;
> >>>>
> >>>>
> >>> Could you explain what this does please?
> >>>
> >>>
> >> Sure: item->match is a fixed u64 to represent all group->length
> >> values. So it might have a 1, 2, 4, or 8 byte value in it. When I
> >> write arrives, we need to cast the data-register (in this case
> >> represented by (void*)val) into a u64 so the equality check (see [A],
> >> below) can be done. However, you can't cast an unaligned pointer, or it
> >> will SIGBUS on many (most?) architectures.
> >>
> >
> > I mean guest access. Does it have to be aligned?
> >
>
> In order to work on arches that require alignment, yes. Note that I
> highly suspect that the pointer is already aligned anyway. My
> IS_ALIGNED check is simply for conservative sanity.
> > You could memcpy the value...
> >
>
> Then you get into the issue of endianness and what pointer to use.
> Or
> am I missing something?
>
> >
> >>> I thought misaligned accesses are allowed.
> >>>
> >>>
> >> If thats true, we are in trouble ;)
> >>
> >
> > I think it works at least on x86:
> > http://en.wikipedia.org/wiki/Packed#x86_and_x86-64
> >
>
> Right, understood. What I meant specifically is that if the (void*)val
> pointer is allowed to be misaligned we are in trouble ;). I haven't
> studied the implementation in front of the MMIO callback recently, but I
> generally doubt thats the case. More than likely this is some buffer
> that was kmalloced and that should already be aligned to the machine word.
>
> Kind Regards,
> -Greg
>

Yes, from what I saw of the code I think it can be BUG_ON.
Avi?

2009-06-22 13:19:21

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 08:56:28AM -0400, Gregory Haskins wrote:
>
>>
>> Heh...well, the first one ("aggregates one") is just a plain typo.
>> The others are just me showing my age, perhaps:
>>
>> http://desktoppub.about.com/cs/typespacing/a/onetwospaces.htm
>>
>> Whether right or wrong, I think I use two-spaces-after-a-period
>> everywhere.
>>
>
> Ah, I see now. Naturally it is really up to you.
>
>
>> I can fix these if they bother you, but I suspect just
>> about every comment I've written has them too. ;)
>>
>> -Greg
>>
>
> It doesn't bother me as such. But you seem to care about such things :).
>

Its not that I care per se. Its that it will be a tough habit to break
as I've done it for years now ;) (See, I even did it in this paragraph ;)

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 13:20:17

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 09:04:48AM -0400, Gregory Haskins wrote:
>
>> Sorry, Michael. I missed that you had other comments after the
>> grammatical one. Will answer inline
>>
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 22, 2009 at 08:13:48AM -0400, Gregory Haskins wrote:
>>>
>>>
>>>
>>>> If thats true, we are in trouble ;)
>>>>
>>>>
>>> I think it works at least on x86:
>>> http://en.wikipedia.org/wiki/Packed#x86_and_x86-64
>>>
>>>
>> Right, understood. What I meant specifically is that if the (void*)val
>> pointer is allowed to be misaligned we are in trouble ;). I haven't
>> studied the implementation in front of the MMIO callback recently, but I
>> generally doubt thats the case. More than likely this is some buffer
>> that was kmalloced and that should already be aligned to the machine word.
>>
>> Kind Regards,
>> -Greg
>>
>>
>
> Yes, from what I saw of the code I think it can be BUG_ON.
> Avi?
>
>

The question to ask is whether a guest can influence that condition. If
they can, its an attack vector to crash the host. I suspect they can't,
however. Therefore, your recommendation is perhaps a good approach so
this condition cannot ever go unnoticed. Avi?

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 14:27:59

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On 06/22/2009 04:13 PM, Michael S. Tsirkin wrote:
>> Right, understood. What I meant specifically is that if the (void*)val
>> pointer is allowed to be misaligned we are in trouble ;). I haven't
>> studied the implementation in front of the MMIO callback recently, but I
>> generally doubt thats the case. More than likely this is some buffer
>> that was kmalloced and that should already be aligned to the machine word.
>>
>> Kind Regards,
>> -Greg
>>
>>
>
> Yes, from what I saw of the code I think it can be BUG_ON.
> Avi?
>

Yes, BUG_ON is safe here.

--
error compiling committee.c: too many arguments to function

2009-06-22 14:29:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On 06/22/2009 04:19 PM, Gregory Haskins wrote:
> The question to ask is whether a guest can influence that condition. If
> they can, its an attack vector to crash the host. I suspect they can't,
> however. Therefore, your recommendation is perhaps a good approach so
> this condition cannot ever go unnoticed. Avi?
>

No, this is host memory in the emulator context, allocated as unsigned
long. But this is on x86 which isn't sensitive to alignment anyway.
It's unlikely that other achitectures will supply unaligned pointers.

We ought to convert the interface to pass a value anyway.

--
error compiling committee.c: too many arguments to function

2009-06-22 14:39:28

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Avi Kivity wrote:
> On 06/22/2009 04:19 PM, Gregory Haskins wrote:
>> The question to ask is whether a guest can influence that condition. If
>> they can, its an attack vector to crash the host. I suspect they can't,
>> however. Therefore, your recommendation is perhaps a good approach so
>> this condition cannot ever go unnoticed. Avi?
>>
>
> No, this is host memory in the emulator context, allocated as unsigned
> long. But this is on x86 which isn't sensitive to alignment anyway.

Ok, will change to BUG_ON in v9

> It's unlikely that other achitectures will supply unaligned pointers.
>
Yeah, they shouldn't

> We ought to convert the interface to pass a value anyway.
>
Agreed. As you said earlier, lets defer for now.

Thanks Avi,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 15:17:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH RFC] pass write value to in_range pointers

It seems that a lot of complexity and trickiness with iosignalfd is
handling the group/item relationship, which comes about because kvm does
not currently let a device on the bus claim a write transaction based on the
value written. This could be greatly simplified if the value written
was passed to the in_range check for write operation. We could then
simply make each kvm_iosignalfd a device on the bus.

What does everyone think of the following lightly tested patch?

-->

Subject: kvm: pass value to in_range callback

For write transactions, pass the value written to in_range checks so
that we can make each iosignalfd a separate device on kvm bus.

Signed-off-by: Michael S. Tsirkin <[email protected]>

---


diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
index 55e8846..98a25af 100644
--- a/virt/kvm/iodev.h
+++ b/virt/kvm/iodev.h
@@ -28,7 +28,7 @@ struct kvm_io_device {
int len,
const void *val);
int (*in_range)(struct kvm_io_device *this, gpa_t addr, int len,
- int is_write);
+ int is_write, void *write_val);
void (*destructor)(struct kvm_io_device *this);

void *private;
@@ -51,9 +51,10 @@ static inline void kvm_iodevice_write(struct kvm_io_device *dev,
}

static inline int kvm_iodevice_inrange(struct kvm_io_device *dev,
- gpa_t addr, int len, int is_write)
+ gpa_t addr, int len, int is_write,
+ void *write_val)
{
- return dev->in_range(dev, addr, len, is_write);
+ return dev->in_range(dev, addr, len, is_write, write_val);
}

static inline void kvm_iodevice_destructor(struct kvm_io_device *dev)
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 80c57b0..8cfdf9d 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -211,11 +211,13 @@ int kvm_dev_ioctl_check_extension(long ext)
}

static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
- gpa_t addr, int len, int is_write)
+ gpa_t addr, int len, int is_write,
+ void *write_val)
{
struct kvm_io_device *dev;

- dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write);
+ dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write,
+ write_val);

return dev;
}
@@ -247,7 +247,8 @@ static int handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
kvm_run->exit_reason = KVM_EXIT_MMIO;
return 0;
mmio:
- mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size, !p->dir);
+ mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size,
+ !p->dir, &p->data);
if (mmio_dev) {
if (!p->dir)
kvm_iodevice_write(mmio_dev, p->addr, p->size,
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 4d6f0d2..5ba21ff 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -485,7 +485,7 @@ static void pit_ioport_read(struct kvm_io_device *this,
}

static int pit_in_range(struct kvm_io_device *this, gpa_t addr,
- int len, int is_write)
+ int len, int is_write, void *write_val)
{
return ((addr >= KVM_PIT_BASE_ADDRESS) &&
(addr < KVM_PIT_BASE_ADDRESS + KVM_PIT_MEM_LENGTH));
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ae99d83..456fd53 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -721,7 +721,7 @@ static void apic_mmio_write(struct kvm_io_device *this,
}

static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr,
- int len, int size)
+ int len, int is_write, void *write_val)
{
struct kvm_lapic *apic = (struct kvm_lapic *)this->private;
int ret = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 249540f..9d29017 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2043,13 +2043,13 @@ static void kvm_init_msr_list(void)
*/
static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,
gpa_t addr, int len,
- int is_write)
+ int is_write, void *write_val)
{
struct kvm_io_device *dev;

if (vcpu->arch.apic) {
dev = &vcpu->arch.apic->dev;
- if (dev->in_range(dev, addr, len, is_write))
+ if (dev->in_range(dev, addr, len, is_write, write_val))
return dev;
}
return NULL;
@@ -2058,14 +2058,14 @@ static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,

static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
gpa_t addr, int len,
- int is_write)
+ int is_write, void *write_val)
{
struct kvm_io_device *dev;

- dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write);
+ dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write, write_val);
if (dev == NULL)
dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len,
- is_write);
+ is_write, write_val);
return dev;
}

@@ -2161,7 +2161,7 @@ mmio:
* Is this MMIO handled locally?
*/
mutex_lock(&vcpu->kvm->lock);
- mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0);
+ mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0, NULL);
if (mmio_dev) {
kvm_iodevice_read(mmio_dev, gpa, bytes, val);
mutex_unlock(&vcpu->kvm->lock);
@@ -2216,7 +2216,7 @@ mmio:
* Is this MMIO handled locally?
*/
mutex_lock(&vcpu->kvm->lock);
- mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1);
+ mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1, val);
if (mmio_dev) {
kvm_iodevice_write(mmio_dev, gpa, bytes, val);
mutex_unlock(&vcpu->kvm->lock);
@@ -2576,9 +2576,10 @@ static void pio_string_write(struct kvm_io_device *pio_dev,

static struct kvm_io_device *vcpu_find_pio_dev(struct kvm_vcpu *vcpu,
gpa_t addr, int len,
- int is_write)
+ int is_write, void *write_val)
{
- return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write);
+ return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write,
+ write_val);
}

int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
@@ -2608,7 +2608,8 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
val = kvm_register_read(vcpu, VCPU_REGS_RAX);
memcpy(vcpu->arch.pio_data, &val, 4);

- pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
+ pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in,
+ vcpu->arch.pio_data);
if (pio_dev) {
kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
complete_pio(vcpu);
@@ -2624,7 +2625,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
{
unsigned now, in_page;
int ret = 0;
- struct kvm_io_device *pio_dev;

vcpu->run->exit_reason = KVM_EXIT_IO;
vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
@@ -2672,9 +2672,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,

vcpu->arch.pio.guest_gva = address;

- pio_dev = vcpu_find_pio_dev(vcpu, port,
- vcpu->arch.pio.cur_count,
- !vcpu->arch.pio.in);
if (!vcpu->arch.pio.in) {
/* string PIO write */
ret = pio_copy_data(vcpu);
@@ -2682,13 +2679,20 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
kvm_inject_gp(vcpu, 0);
return 1;
}
- if (ret == 0 && pio_dev) {
- pio_string_write(pio_dev, vcpu);
- complete_pio(vcpu);
- if (vcpu->arch.pio.count == 0)
- ret = 1;
+ if (ret == 0) {
+ struct kvm_io_device *pio_dev;
+ pio_dev = vcpu_find_pio_dev(vcpu, port,
+ vcpu->arch.pio.cur_count,
+ 1, vcpu->arch.pio_data);
+ if (pio_dev) {
+ pio_string_write(pio_dev, vcpu);
+ complete_pio(vcpu);
+ if (vcpu->arch.pio.count == 0)
+ ret = 1;
+ }
}
- } else if (pio_dev)
+ } else if (vcpu_find_pio_dev(vcpu, port, vcpu->arch.pio.cur_count, 0,
+ NULL))
pr_unimpl(vcpu, "no string pio read support yet, "
"port %x size %d count %ld\n",
port, size, count);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index aacc544..0fb7938 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -60,7 +60,8 @@ struct kvm_io_bus {
void kvm_io_bus_init(struct kvm_io_bus *bus);
void kvm_io_bus_destroy(struct kvm_io_bus *bus);
struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
- gpa_t addr, int len, int is_write);
+ gpa_t addr, int len, int is_write,
+ void *write_val);
void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
struct kvm_io_device *dev);

diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 5ae620d..3a1cbfd 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -15,7 +15,8 @@
#include "coalesced_mmio.h"

static int coalesced_mmio_in_range(struct kvm_io_device *this,
- gpa_t addr, int len, int is_write)
+ gpa_t addr, int len, int is_write,
+ void *write_val)
{
struct kvm_coalesced_mmio_dev *dev =
(struct kvm_coalesced_mmio_dev*)this->private;
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 1eddae9..2adbb1b 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -221,7 +221,8 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
}

static int ioapic_in_range(struct kvm_io_device *this, gpa_t addr,
- int len, int is_write)
+ int len, int is_write,
+ void *write_val)
{
struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private;

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7645543..b97e390 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2445,14 +2445,15 @@ void kvm_io_bus_destroy(struct kvm_io_bus *bus)
}

struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
- gpa_t addr, int len, int is_write)
+ gpa_t addr, int len, int is_write,
+ void *write_val)
{
int i;

for (i = 0; i < bus->dev_count; i++) {
struct kvm_io_device *pos = bus->devs[i];

- if (pos->in_range(pos, addr, len, is_write))
+ if (pos->in_range(pos, addr, len, is_write, write_val))
return pos;
}

2009-06-22 15:45:34

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Michael S. Tsirkin wrote:
> It seems that a lot of complexity and trickiness with iosignalfd is
> handling the group/item relationship, which comes about because kvm does
> not currently let a device on the bus claim a write transaction based on the
> value written. This could be greatly simplified if the value written
> was passed to the in_range check for write operation. We could then
> simply make each kvm_iosignalfd a device on the bus.
>
> What does everyone think of the following lightly tested patch?
>

Hi Michael,
Its interesting, but I am not convinced its necessary. We created the
group/item layout because iosignalfds are unique in that they are
probably the only IO device that wants to do some kind of address
aliasing. With what you are proposing here, you are adding aliasing
support to the general infrastructure which I am not (yet) convinced is
necessary. If there isn't a use case for other devices to have
aliasing, I would think the logic is best contained in iosignalfd. Do
you have something in mind?

Kind Regards,
-Greg

> -->
>
> Subject: kvm: pass value to in_range callback
>
> For write transactions, pass the value written to in_range checks so
> that we can make each iosignalfd a separate device on kvm bus.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> ---
>
>
> diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
> index 55e8846..98a25af 100644
> --- a/virt/kvm/iodev.h
> +++ b/virt/kvm/iodev.h
> @@ -28,7 +28,7 @@ struct kvm_io_device {
> int len,
> const void *val);
> int (*in_range)(struct kvm_io_device *this, gpa_t addr, int len,
> - int is_write);
> + int is_write, void *write_val);
> void (*destructor)(struct kvm_io_device *this);
>
> void *private;
> @@ -51,9 +51,10 @@ static inline void kvm_iodevice_write(struct kvm_io_device *dev,
> }
>
> static inline int kvm_iodevice_inrange(struct kvm_io_device *dev,
> - gpa_t addr, int len, int is_write)
> + gpa_t addr, int len, int is_write,
> + void *write_val)
> {
> - return dev->in_range(dev, addr, len, is_write);
> + return dev->in_range(dev, addr, len, is_write, write_val);
> }
>
> static inline void kvm_iodevice_destructor(struct kvm_io_device *dev)
> diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
> index 80c57b0..8cfdf9d 100644
> --- a/arch/ia64/kvm/kvm-ia64.c
> +++ b/arch/ia64/kvm/kvm-ia64.c
> @@ -211,11 +211,13 @@ int kvm_dev_ioctl_check_extension(long ext)
> }
>
> static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
> - gpa_t addr, int len, int is_write)
> + gpa_t addr, int len, int is_write,
> + void *write_val)
> {
> struct kvm_io_device *dev;
>
> - dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write);
> + dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write,
> + write_val);
>
> return dev;
> }
> @@ -247,7 +247,8 @@ static int handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
> kvm_run->exit_reason = KVM_EXIT_MMIO;
> return 0;
> mmio:
> - mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size, !p->dir);
> + mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size,
> + !p->dir, &p->data);
> if (mmio_dev) {
> if (!p->dir)
> kvm_iodevice_write(mmio_dev, p->addr, p->size,
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 4d6f0d2..5ba21ff 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -485,7 +485,7 @@ static void pit_ioport_read(struct kvm_io_device *this,
> }
>
> static int pit_in_range(struct kvm_io_device *this, gpa_t addr,
> - int len, int is_write)
> + int len, int is_write, void *write_val)
> {
> return ((addr >= KVM_PIT_BASE_ADDRESS) &&
> (addr < KVM_PIT_BASE_ADDRESS + KVM_PIT_MEM_LENGTH));
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ae99d83..456fd53 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -721,7 +721,7 @@ static void apic_mmio_write(struct kvm_io_device *this,
> }
>
> static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr,
> - int len, int size)
> + int len, int is_write, void *write_val)
> {
> struct kvm_lapic *apic = (struct kvm_lapic *)this->private;
> int ret = 0;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 249540f..9d29017 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2043,13 +2043,13 @@ static void kvm_init_msr_list(void)
> */
> static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,
> gpa_t addr, int len,
> - int is_write)
> + int is_write, void *write_val)
> {
> struct kvm_io_device *dev;
>
> if (vcpu->arch.apic) {
> dev = &vcpu->arch.apic->dev;
> - if (dev->in_range(dev, addr, len, is_write))
> + if (dev->in_range(dev, addr, len, is_write, write_val))
> return dev;
> }
> return NULL;
> @@ -2058,14 +2058,14 @@ static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,
>
> static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
> gpa_t addr, int len,
> - int is_write)
> + int is_write, void *write_val)
> {
> struct kvm_io_device *dev;
>
> - dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write);
> + dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write, write_val);
> if (dev == NULL)
> dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len,
> - is_write);
> + is_write, write_val);
> return dev;
> }
>
> @@ -2161,7 +2161,7 @@ mmio:
> * Is this MMIO handled locally?
> */
> mutex_lock(&vcpu->kvm->lock);
> - mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0);
> + mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0, NULL);
> if (mmio_dev) {
> kvm_iodevice_read(mmio_dev, gpa, bytes, val);
> mutex_unlock(&vcpu->kvm->lock);
> @@ -2216,7 +2216,7 @@ mmio:
> * Is this MMIO handled locally?
> */
> mutex_lock(&vcpu->kvm->lock);
> - mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1);
> + mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1, val);
> if (mmio_dev) {
> kvm_iodevice_write(mmio_dev, gpa, bytes, val);
> mutex_unlock(&vcpu->kvm->lock);
> @@ -2576,9 +2576,10 @@ static void pio_string_write(struct kvm_io_device *pio_dev,
>
> static struct kvm_io_device *vcpu_find_pio_dev(struct kvm_vcpu *vcpu,
> gpa_t addr, int len,
> - int is_write)
> + int is_write, void *write_val)
> {
> - return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write);
> + return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write,
> + write_val);
> }
>
> int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> @@ -2608,7 +2608,8 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> val = kvm_register_read(vcpu, VCPU_REGS_RAX);
> memcpy(vcpu->arch.pio_data, &val, 4);
>
> - pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> + pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in,
> + vcpu->arch.pio_data);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu);
> @@ -2624,7 +2625,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> {
> unsigned now, in_page;
> int ret = 0;
> - struct kvm_io_device *pio_dev;
>
> vcpu->run->exit_reason = KVM_EXIT_IO;
> vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
> @@ -2672,9 +2672,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
>
> vcpu->arch.pio.guest_gva = address;
>
> - pio_dev = vcpu_find_pio_dev(vcpu, port,
> - vcpu->arch.pio.cur_count,
> - !vcpu->arch.pio.in);
> if (!vcpu->arch.pio.in) {
> /* string PIO write */
> ret = pio_copy_data(vcpu);
> @@ -2682,13 +2679,20 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> kvm_inject_gp(vcpu, 0);
> return 1;
> }
> - if (ret == 0 && pio_dev) {
> - pio_string_write(pio_dev, vcpu);
> - complete_pio(vcpu);
> - if (vcpu->arch.pio.count == 0)
> - ret = 1;
> + if (ret == 0) {
> + struct kvm_io_device *pio_dev;
> + pio_dev = vcpu_find_pio_dev(vcpu, port,
> + vcpu->arch.pio.cur_count,
> + 1, vcpu->arch.pio_data);
> + if (pio_dev) {
> + pio_string_write(pio_dev, vcpu);
> + complete_pio(vcpu);
> + if (vcpu->arch.pio.count == 0)
> + ret = 1;
> + }
> }
> - } else if (pio_dev)
> + } else if (vcpu_find_pio_dev(vcpu, port, vcpu->arch.pio.cur_count, 0,
> + NULL))
> pr_unimpl(vcpu, "no string pio read support yet, "
> "port %x size %d count %ld\n",
> port, size, count);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index aacc544..0fb7938 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -60,7 +60,8 @@ struct kvm_io_bus {
> void kvm_io_bus_init(struct kvm_io_bus *bus);
> void kvm_io_bus_destroy(struct kvm_io_bus *bus);
> struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> - gpa_t addr, int len, int is_write);
> + gpa_t addr, int len, int is_write,
> + void *write_val);
> void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
> struct kvm_io_device *dev);
>
> diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> index 5ae620d..3a1cbfd 100644
> --- a/virt/kvm/coalesced_mmio.c
> +++ b/virt/kvm/coalesced_mmio.c
> @@ -15,7 +15,8 @@
> #include "coalesced_mmio.h"
>
> static int coalesced_mmio_in_range(struct kvm_io_device *this,
> - gpa_t addr, int len, int is_write)
> + gpa_t addr, int len, int is_write,
> + void *write_val)
> {
> struct kvm_coalesced_mmio_dev *dev =
> (struct kvm_coalesced_mmio_dev*)this->private;
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index 1eddae9..2adbb1b 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -221,7 +221,8 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
> }
>
> static int ioapic_in_range(struct kvm_io_device *this, gpa_t addr,
> - int len, int is_write)
> + int len, int is_write,
> + void *write_val)
> {
> struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private;
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7645543..b97e390 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2445,14 +2445,15 @@ void kvm_io_bus_destroy(struct kvm_io_bus *bus)
> }
>
> struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> - gpa_t addr, int len, int is_write)
> + gpa_t addr, int len, int is_write,
> + void *write_val)
> {
> int i;
>
> for (i = 0; i < bus->dev_count; i++) {
> struct kvm_io_device *pos = bus->devs[i];
>
> - if (pos->in_range(pos, addr, len, is_write))
> + if (pos->in_range(pos, addr, len, is_write, write_val))
> return pos;
> }
>
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 16:09:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > It seems that a lot of complexity and trickiness with iosignalfd is
> > handling the group/item relationship, which comes about because kvm does
> > not currently let a device on the bus claim a write transaction based on the
> > value written. This could be greatly simplified if the value written
> > was passed to the in_range check for write operation. We could then
> > simply make each kvm_iosignalfd a device on the bus.
> >
> > What does everyone think of the following lightly tested patch?
> >
>
> Hi Michael,
> Its interesting, but I am not convinced its necessary. We created the
> group/item layout because iosignalfds are unique in that they are
> probably the only IO device that wants to do some kind of address
> aliasing.

We actually already have aliasing: is_write flag is used for this
purpose. Actually, it's possible to remove is_write by passing
a null pointer in write_val for reads. I like this a bit less as
the code generated is less compact ... Avi, what do you think?

> With what you are proposing here, you are adding aliasing
> support to the general infrastructure which I am not (yet) convinced is
> necessary.

Infrastructure is a big name for something that adds a total of 10 lines to kvm.
And it should at least halve the size of your 450-line patch.

> If there isn't a use case for other devices to have
> aliasing, I would think the logic is best contained in iosignalfd. Do
> you have something in mind?

One is enough :)
Seriously, do you see that this saves you all of RCU, linked lists and
counters? You don't need to keep track of iofds, you don't need to
implement your own lookup logic - you just use the kvm device
and that's it.

> Kind Regards,
> -Greg
>
> > -->
> >
> > Subject: kvm: pass value to in_range callback
> >
> > For write transactions, pass the value written to in_range checks so
> > that we can make each iosignalfd a separate device on kvm bus.
> >
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
> >
> > ---
> >
> >
> > diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
> > index 55e8846..98a25af 100644
> > --- a/virt/kvm/iodev.h
> > +++ b/virt/kvm/iodev.h
> > @@ -28,7 +28,7 @@ struct kvm_io_device {
> > int len,
> > const void *val);
> > int (*in_range)(struct kvm_io_device *this, gpa_t addr, int len,
> > - int is_write);
> > + int is_write, void *write_val);
> > void (*destructor)(struct kvm_io_device *this);
> >
> > void *private;
> > @@ -51,9 +51,10 @@ static inline void kvm_iodevice_write(struct kvm_io_device *dev,
> > }
> >
> > static inline int kvm_iodevice_inrange(struct kvm_io_device *dev,
> > - gpa_t addr, int len, int is_write)
> > + gpa_t addr, int len, int is_write,
> > + void *write_val)
> > {
> > - return dev->in_range(dev, addr, len, is_write);
> > + return dev->in_range(dev, addr, len, is_write, write_val);
> > }
> >
> > static inline void kvm_iodevice_destructor(struct kvm_io_device *dev)
> > diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
> > index 80c57b0..8cfdf9d 100644
> > --- a/arch/ia64/kvm/kvm-ia64.c
> > +++ b/arch/ia64/kvm/kvm-ia64.c
> > @@ -211,11 +211,13 @@ int kvm_dev_ioctl_check_extension(long ext)
> > }
> >
> > static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
> > - gpa_t addr, int len, int is_write)
> > + gpa_t addr, int len, int is_write,
> > + void *write_val)
> > {
> > struct kvm_io_device *dev;
> >
> > - dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write);
> > + dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len, is_write,
> > + write_val);
> >
> > return dev;
> > }
> > @@ -247,7 +247,8 @@ static int handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
> > kvm_run->exit_reason = KVM_EXIT_MMIO;
> > return 0;
> > mmio:
> > - mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size, !p->dir);
> > + mmio_dev = vcpu_find_mmio_dev(vcpu, p->addr, p->size,
> > + !p->dir, &p->data);
> > if (mmio_dev) {
> > if (!p->dir)
> > kvm_iodevice_write(mmio_dev, p->addr, p->size,
> > diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> > index 4d6f0d2..5ba21ff 100644
> > --- a/arch/x86/kvm/i8254.c
> > +++ b/arch/x86/kvm/i8254.c
> > @@ -485,7 +485,7 @@ static void pit_ioport_read(struct kvm_io_device *this,
> > }
> >
> > static int pit_in_range(struct kvm_io_device *this, gpa_t addr,
> > - int len, int is_write)
> > + int len, int is_write, void *write_val)
> > {
> > return ((addr >= KVM_PIT_BASE_ADDRESS) &&
> > (addr < KVM_PIT_BASE_ADDRESS + KVM_PIT_MEM_LENGTH));
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index ae99d83..456fd53 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -721,7 +721,7 @@ static void apic_mmio_write(struct kvm_io_device *this,
> > }
> >
> > static int apic_mmio_range(struct kvm_io_device *this, gpa_t addr,
> > - int len, int size)
> > + int len, int is_write, void *write_val)
> > {
> > struct kvm_lapic *apic = (struct kvm_lapic *)this->private;
> > int ret = 0;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 249540f..9d29017 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2043,13 +2043,13 @@ static void kvm_init_msr_list(void)
> > */
> > static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,
> > gpa_t addr, int len,
> > - int is_write)
> > + int is_write, void *write_val)
> > {
> > struct kvm_io_device *dev;
> >
> > if (vcpu->arch.apic) {
> > dev = &vcpu->arch.apic->dev;
> > - if (dev->in_range(dev, addr, len, is_write))
> > + if (dev->in_range(dev, addr, len, is_write, write_val))
> > return dev;
> > }
> > return NULL;
> > @@ -2058,14 +2058,14 @@ static struct kvm_io_device *vcpu_find_pervcpu_dev(struct kvm_vcpu *vcpu,
> >
> > static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
> > gpa_t addr, int len,
> > - int is_write)
> > + int is_write, void *write_val)
> > {
> > struct kvm_io_device *dev;
> >
> > - dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write);
> > + dev = vcpu_find_pervcpu_dev(vcpu, addr, len, is_write, write_val);
> > if (dev == NULL)
> > dev = kvm_io_bus_find_dev(&vcpu->kvm->mmio_bus, addr, len,
> > - is_write);
> > + is_write, write_val);
> > return dev;
> > }
> >
> > @@ -2161,7 +2161,7 @@ mmio:
> > * Is this MMIO handled locally?
> > */
> > mutex_lock(&vcpu->kvm->lock);
> > - mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0);
> > + mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0, NULL);
> > if (mmio_dev) {
> > kvm_iodevice_read(mmio_dev, gpa, bytes, val);
> > mutex_unlock(&vcpu->kvm->lock);
> > @@ -2216,7 +2216,7 @@ mmio:
> > * Is this MMIO handled locally?
> > */
> > mutex_lock(&vcpu->kvm->lock);
> > - mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1);
> > + mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1, val);
> > if (mmio_dev) {
> > kvm_iodevice_write(mmio_dev, gpa, bytes, val);
> > mutex_unlock(&vcpu->kvm->lock);
> > @@ -2576,9 +2576,10 @@ static void pio_string_write(struct kvm_io_device *pio_dev,
> >
> > static struct kvm_io_device *vcpu_find_pio_dev(struct kvm_vcpu *vcpu,
> > gpa_t addr, int len,
> > - int is_write)
> > + int is_write, void *write_val)
> > {
> > - return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write);
> > + return kvm_io_bus_find_dev(&vcpu->kvm->pio_bus, addr, len, is_write,
> > + write_val);
> > }
> >
> > int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> > @@ -2608,7 +2608,8 @@ int kvm_emulate_pio(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> > val = kvm_register_read(vcpu, VCPU_REGS_RAX);
> > memcpy(vcpu->arch.pio_data, &val, 4);
> >
> > - pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> > + pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in,
> > + vcpu->arch.pio_data);
> > if (pio_dev) {
> > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> > complete_pio(vcpu);
> > @@ -2624,7 +2625,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> > {
> > unsigned now, in_page;
> > int ret = 0;
> > - struct kvm_io_device *pio_dev;
> >
> > vcpu->run->exit_reason = KVM_EXIT_IO;
> > vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
> > @@ -2672,9 +2672,6 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> >
> > vcpu->arch.pio.guest_gva = address;
> >
> > - pio_dev = vcpu_find_pio_dev(vcpu, port,
> > - vcpu->arch.pio.cur_count,
> > - !vcpu->arch.pio.in);
> > if (!vcpu->arch.pio.in) {
> > /* string PIO write */
> > ret = pio_copy_data(vcpu);
> > @@ -2682,13 +2679,20 @@ int kvm_emulate_pio_string(struct kvm_vcpu *vcpu, struct kvm_run *run, int in,
> > kvm_inject_gp(vcpu, 0);
> > return 1;
> > }
> > - if (ret == 0 && pio_dev) {
> > - pio_string_write(pio_dev, vcpu);
> > - complete_pio(vcpu);
> > - if (vcpu->arch.pio.count == 0)
> > - ret = 1;
> > + if (ret == 0) {
> > + struct kvm_io_device *pio_dev;
> > + pio_dev = vcpu_find_pio_dev(vcpu, port,
> > + vcpu->arch.pio.cur_count,
> > + 1, vcpu->arch.pio_data);
> > + if (pio_dev) {
> > + pio_string_write(pio_dev, vcpu);
> > + complete_pio(vcpu);
> > + if (vcpu->arch.pio.count == 0)
> > + ret = 1;
> > + }
> > }
> > - } else if (pio_dev)
> > + } else if (vcpu_find_pio_dev(vcpu, port, vcpu->arch.pio.cur_count, 0,
> > + NULL))
> > pr_unimpl(vcpu, "no string pio read support yet, "
> > "port %x size %d count %ld\n",
> > port, size, count);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index aacc544..0fb7938 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -60,7 +60,8 @@ struct kvm_io_bus {
> > void kvm_io_bus_init(struct kvm_io_bus *bus);
> > void kvm_io_bus_destroy(struct kvm_io_bus *bus);
> > struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> > - gpa_t addr, int len, int is_write);
> > + gpa_t addr, int len, int is_write,
> > + void *write_val);
> > void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
> > struct kvm_io_device *dev);
> >
> > diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> > index 5ae620d..3a1cbfd 100644
> > --- a/virt/kvm/coalesced_mmio.c
> > +++ b/virt/kvm/coalesced_mmio.c
> > @@ -15,7 +15,8 @@
> > #include "coalesced_mmio.h"
> >
> > static int coalesced_mmio_in_range(struct kvm_io_device *this,
> > - gpa_t addr, int len, int is_write)
> > + gpa_t addr, int len, int is_write,
> > + void *write_val)
> > {
> > struct kvm_coalesced_mmio_dev *dev =
> > (struct kvm_coalesced_mmio_dev*)this->private;
> > diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> > index 1eddae9..2adbb1b 100644
> > --- a/virt/kvm/ioapic.c
> > +++ b/virt/kvm/ioapic.c
> > @@ -221,7 +221,8 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
> > }
> >
> > static int ioapic_in_range(struct kvm_io_device *this, gpa_t addr,
> > - int len, int is_write)
> > + int len, int is_write,
> > + void *write_val)
> > {
> > struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private;
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7645543..b97e390 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2445,14 +2445,15 @@ void kvm_io_bus_destroy(struct kvm_io_bus *bus)
> > }
> >
> > struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> > - gpa_t addr, int len, int is_write)
> > + gpa_t addr, int len, int is_write,
> > + void *write_val)
> > {
> > int i;
> >
> > for (i = 0; i < bus->dev_count; i++) {
> > struct kvm_io_device *pos = bus->devs[i];
> >
> > - if (pos->in_range(pos, addr, len, is_write))
> > + if (pos->in_range(pos, addr, len, is_write, write_val))
> > return pos;
> > }
> >
> >
>
>

2009-06-22 16:29:33

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> It seems that a lot of complexity and trickiness with iosignalfd is
>>> handling the group/item relationship, which comes about because kvm does
>>> not currently let a device on the bus claim a write transaction based on the
>>> value written. This could be greatly simplified if the value written
>>> was passed to the in_range check for write operation. We could then
>>> simply make each kvm_iosignalfd a device on the bus.
>>>
>>> What does everyone think of the following lightly tested patch?
>>>
>>>
>> Hi Michael,
>> Its interesting, but I am not convinced its necessary. We created the
>> group/item layout because iosignalfds are unique in that they are
>> probably the only IO device that wants to do some kind of address
>> aliasing.
>>
>
> We actually already have aliasing: is_write flag is used for this
> purpose.

Yes, but read/write address aliasing is not the same thing is
multi-match data aliasing. Besides, your proposal also breaks some of
the natural relationship models (e.g. all the aliased iosignal_items
always belong to the same underlying device. io_bus entries have an
arbitrary topology).


> Actually, it's possible to remove is_write by passing
> a null pointer in write_val for reads. I like this a bit less as
> the code generated is less compact ... Avi, what do you think?
>
>
>> With what you are proposing here, you are adding aliasing
>> support to the general infrastructure which I am not (yet) convinced is
>> necessary.
>>
>
> Infrastructure is a big name for something that adds a total of 10 lines to kvm.
> And it should at least halve the size of your 450-line patch.
>

Your patch isn't complete until some critical missing features are added
to io_bus, though, so its not really just 10 lines. For one, it will
need to support much more than 6 devices. It will also need to support
multiple matches. Also you are proposing an general interface change
that doesn't make sense to all but one device type. So now every
io-device developer that comes along will scratch their head at what to
do with that field.

None of these are insurmountable hurdles, but my point is that today the
complexity is encapsulated in the proper place IMO. E.g. The one and
only device that cares to do this "weird" thing handles it behind an
interface that makes sense to all parties involved.
>
>> If there isn't a use case for other devices to have
>> aliasing, I would think the logic is best contained in iosignalfd. Do
>> you have something in mind?
>>
>
> One is enough :)
>

I am not convinced yet. ;) It appears to me that we are leaking
iosignalfd-isms into the general code. If there is another device that
wants to do something similar, ok. But I can't think of any.

> Seriously, do you see that this saves you all of RCU, linked lists and
> counters?

Well, also keep in mind we will probably be converting io_bus to RCU
very soon, so we are going the opposite direction ;)

Kind Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-22 17:27:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On Mon, Jun 22, 2009 at 12:29:10PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
> >
> >> Michael S. Tsirkin wrote:
> >>
> >>> It seems that a lot of complexity and trickiness with iosignalfd is
> >>> handling the group/item relationship, which comes about because kvm does
> >>> not currently let a device on the bus claim a write transaction based on the
> >>> value written. This could be greatly simplified if the value written
> >>> was passed to the in_range check for write operation. We could then
> >>> simply make each kvm_iosignalfd a device on the bus.
> >>>
> >>> What does everyone think of the following lightly tested patch?
> >>>
> >>>
> >> Hi Michael,
> >> Its interesting, but I am not convinced its necessary. We created the
> >> group/item layout because iosignalfds are unique in that they are
> >> probably the only IO device that wants to do some kind of address
> >> aliasing.
> >>
> >
> > We actually already have aliasing: is_write flag is used for this
> > purpose.
>
> Yes, but read/write address aliasing is not the same thing is
> multi-match data aliasing.

What's the big difference?

> Besides, your proposal also breaks

s/break/removes limitation/ :)

> some of
> the natural relationship models
> (e.g. all the aliased iosignal_items
> always belong to the same underlying device. io_bus entries have an
> arbitrary topology).

iosignal_item is an artifact, they are not seen by user -
they are just a work around an API limitation.

And they are only grouped if the same PIO offset is used for all accesses.
Why is not always the case. If a device uses several PIO offsets
(as virtio does), you create separate devices for a single guest device too.

>
> > Actually, it's possible to remove is_write by passing
> > a null pointer in write_val for reads. I like this a bit less as
> > the code generated is less compact ... Avi, what do you think?
> >
> >
> >> With what you are proposing here, you are adding aliasing
> >> support to the general infrastructure which I am not (yet) convinced is
> >> necessary.
> >>
> >
> > Infrastructure is a big name for something that adds a total of 10 lines to kvm.
> > And it should at least halve the size of your 450-line patch.
> >
>
> Your patch isn't complete until some critical missing features are added
> to io_bus, though, so its not really just 10 lines.
> For one, it will
> need to support much more than 6 devices.

Isn't this like a #define change? With the item patch we are still
limited in the number of groups we can create.

What we gain is a simple array/list instead of a tree of
linked lists that makes cheshire cheese out of CPU data cache.

> It will also need to support
> multiple matches.

What, signal many fds on the same address/value pair?
I see this as a bug. Why is this a good thing to support?
Just increases the chance of leaking this fd.

> Also you are proposing an general interface change
> that doesn't make sense to all but one device type. So now every
> io-device developer that comes along will scratch their head at what to
> do with that field.

What do they do with is_write now? Ignore it. It's used in a whole
of 1 place.

>
> None of these are insurmountable hurdles, but my point is that today the
> complexity is encapsulated in the proper place IMO.

It's better to get rid of complexity than encapsulate it.

> E.g. The one and
> only device that cares to do this "weird" thing handles it behind an
> interface that makes sense to all parties involved.
> >
> >> If there isn't a use case for other devices to have
> >> aliasing, I would think the logic is best contained in iosignalfd. Do
> >> you have something in mind?
> >>
> >
> > One is enough :)
> >
>
> I am not convinced yet. ;) It appears to me that we are leaking
> iosignalfd-isms into the general code. If there is another device that
> wants to do something similar, ok. But I can't think of any.

You never know. is_write was used by a whole of 1 user: coalesced_mmio,
then your patch comes along ...


> > Seriously, do you see that this saves you all of RCU, linked lists and
> > counters?
>
> Well, also keep in mind we will probably be converting io_bus to RCU
> very soon, so we are going the opposite direction ;)
>
> Kind Regards,
> -Greg
>

Same direction. Let's put RCU in iobus, we don't need another one on
top of it. That's encapsulating complexity.

2009-06-23 04:04:34

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 12:29:10PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> Michael S. Tsirkin wrote:
>>>>
>>>>
>>>>> It seems that a lot of complexity and trickiness with iosignalfd is
>>>>> handling the group/item relationship, which comes about because kvm does
>>>>> not currently let a device on the bus claim a write transaction based on the
>>>>> value written. This could be greatly simplified if the value written
>>>>> was passed to the in_range check for write operation. We could then
>>>>> simply make each kvm_iosignalfd a device on the bus.
>>>>>
>>>>> What does everyone think of the following lightly tested patch?
>>>>>
>>>>>
>>>>>
>>>> Hi Michael,
>>>> Its interesting, but I am not convinced its necessary. We created the
>>>> group/item layout because iosignalfds are unique in that they are
>>>> probably the only IO device that wants to do some kind of address
>>>> aliasing.
>>>>
>>>>
>>> We actually already have aliasing: is_write flag is used for this
>>> purpose.
>>>
>> Yes, but read/write address aliasing is not the same thing is
>> multi-match data aliasing.
>>
>
> What's the big difference?
>

Well, for one its not very clear what the benefit of the read/write
aliasing even is. ;) Apparently coalesced_mmio uses it, but even so I
doubt that is for the purposes of having one device do reads while
another does writes. I could be wrong, though.
>
>> Besides, your proposal also breaks
>>
>
> s/break/removes limitation/ :)
>
>
>> some of
>> the natural relationship models
>> (e.g. all the aliased iosignal_items
>> always belong to the same underlying device. io_bus entries have an
>> arbitrary topology).
>>
>
> iosignal_item is an artifact, they are not seen by user -
> they are just a work around an API limitation.
>

Well, not really. The "limitation" is a natural attribute of real
hardware too. I don't know of a whole bunch of real hardware that
aliases multiple address decoders to the same value, do you? How would
you handle reads?

The way I think of these items are not as unique devices in the io_bus
sense. We have one address decoder (the group) listening on the bus to
a unique address. That group provides a data-decoding service, where
you can program what event gets triggered for which value (or wildcard =
all values).

It is unnatural to turn it around and say that the io_bus now scans
exhaustively on the address/data tuple, and we support aliasing such
that multiple decoders may coexist. There isn't any other device
(hardware or software) that will actually do this aside from iosignalfd,
to my knowledge. So why distort the entire io_bus subsystem's
addressing scheme to match? I don't get the motivation, as it appears
to me to be already handled ideally as it is.

Perhaps I am contributing to this confusion with my decision to make the
group creation implicit with the first item creation. Would you feel
better about this if we made the distinction with group and item more
explicit? (E.g. IOSIGNALFD_GROUP_ASSIGN and IOSIGNALFD_ITEM_ASSIGN
verbs). I am fine with that change, I suppose.


> And they are only grouped if the same PIO offset is used for all accesses.
> Why is not always the case. If a device uses several PIO offsets
> (as virtio does), you create separate devices for a single guest device too.
>

Thats ok. As I stated above, I view this as a data-match service
against a single address decoder, not as multiple decoders aliased to
the same address/data pair. Its perfectly fine for a logical device to
employ multiple decoders, but its unnatural for there to be multiple
decoders to the same address. Its also unnatural to use the data as
part of the decode criteria for a general IO cycle.

>
>>> Actually, it's possible to remove is_write by passing
>>> a null pointer in write_val for reads. I like this a bit less as
>>> the code generated is less compact ... Avi, what do you think?
>>>
>>>
>>>
>>>> With what you are proposing here, you are adding aliasing
>>>> support to the general infrastructure which I am not (yet) convinced is
>>>> necessary.
>>>>
>>>>
>>> Infrastructure is a big name for something that adds a total of 10 lines to kvm.
>>> And it should at least halve the size of your 450-line patch.
>>>
>>>
>> Your patch isn't complete until some critical missing features are added
>> to io_bus, though, so its not really just 10 lines.
>> For one, it will
>> need to support much more than 6 devices.
>>
>
> Isn't this like a #define change? With the item patch we are still
> limited in the number of groups we can create.
>

Well, no because we want to support ~512 of these things, so the array
is not going to scale. Right now iosignalfd is implemented as a
tree+list, which is a little better for scalability but even that needs
to change soon. We probably ultimately want to do either a rbtree or
radixtree for all of these things. But that is an optimization patch
for a later day ;)

> What we gain is a simple array/list instead of a tree of
> linked lists that makes cheshire cheese out of CPU data cache.
>
>

Yeah, that is a down side, I admit. But the current io_bus array will
probably not scale going forward either so it needs to be addressed on
that side as well. We can always implement a SW cache, if need be, to
account for this.

>> It will also need to support
>> multiple matches.
>>
>
> What, signal many fds on the same address/value pair?
> I see this as a bug. Why is this a good thing to support?
> Just increases the chance of leaking this fd.
>

I believe Avi asked for this feature specifically, so I will defer to him.


>
>> Also you are proposing an general interface change
>> that doesn't make sense to all but one device type. So now every
>> io-device developer that comes along will scratch their head at what to
>> do with that field.
>>
>
> What do they do with is_write now? Ignore it. It's used in a whole
> of 1 place.
>
>
>> None of these are insurmountable hurdles, but my point is that today the
>> complexity is encapsulated in the proper place IMO.
>>
>
> It's better to get rid of complexity than encapsulate it.
>

All you are doing is moving the complexity to a place where I don't see
it serving a benefit to any code other than the one you just took it
away from. Like it or not, io_bus will have to address scale and
data-matching to do what you want. And the address/data match doesn't
make sense to anyone else afaict. In addition, data-matching at that
level is even harder because we can make little simplifications with the
way we do things now (like enforce that all add/lens conform to the
group addr/len). If you go to a general address/data tuple decoder, you
will presumably need to have a much more flexible decoder design
(overlapping regions, etc).

Again, not insurmountable...but I am not seeing the advantage to make
this change worthwhile either.


>
>> E.g. The one and
>> only device that cares to do this "weird" thing handles it behind an
>> interface that makes sense to all parties involved.
>>
>>>
>>>
>>>> If there isn't a use case for other devices to have
>>>> aliasing, I would think the logic is best contained in iosignalfd. Do
>>>> you have something in mind?
>>>>
>>>>
>>> One is enough :)
>>>
>>>
>> I am not convinced yet. ;) It appears to me that we are leaking
>> iosignalfd-isms into the general code. If there is another device that
>> wants to do something similar, ok. But I can't think of any.
>>
>
> You never know. is_write was used by a whole of 1 user: coalesced_mmio,
> then your patch comes along ...
>

Can it wait until that happens, then? :) I'm already at v8 and a v9 is
brewing. It's not like we can't change later if something better comes
along.

>
>
>>> Seriously, do you see that this saves you all of RCU, linked lists and
>>> counters?
>>>
>> Well, also keep in mind we will probably be converting io_bus to RCU
>> very soon, so we are going the opposite direction ;)
>>
>> Kind Regards,
>> -Greg
>>
>>
>
> Same direction. Let's put RCU in iobus, we don't need another one on
> top of it. That's encapsulating complexity.
>

Is this really all that complicated? The iosignalfd code is fairly
trivial even with the group/item nesting IMO. I also think it makes
sense to live where it is.

Thanks Michael,
-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 08:57:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
> +static int
> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> + int is_write)
> +{
> + struct _iosignalfd_group *p = to_group(this);
> +
> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> +}

I think I see a problem here. For virtio, we do not necessarily want all
virtqueues for a device to live in kernel: there might be control
virtqueues that we want to leave in userspace. Since this claims all
writes to a specific address, the signal never makes it to userspace.

2009-06-23 09:51:36

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On 06/22/2009 07:08 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> It seems that a lot of complexity and trickiness with iosignalfd is
>>> handling the group/item relationship, which comes about because kvm does
>>> not currently let a device on the bus claim a write transaction based on the
>>> value written. This could be greatly simplified if the value written
>>> was passed to the in_range check for write operation. We could then
>>> simply make each kvm_iosignalfd a device on the bus.
>>>
>>> What does everyone think of the following lightly tested patch?
>>>
>>>
>> Hi Michael,
>> Its interesting, but I am not convinced its necessary. We created the
>> group/item layout because iosignalfds are unique in that they are
>> probably the only IO device that wants to do some kind of address
>> aliasing.
>>
>
> We actually already have aliasing: is_write flag is used for this
> purpose. Actually, it's possible to remove is_write by passing
> a null pointer in write_val for reads. I like this a bit less as
> the code generated is less compact ... Avi, what do you think?
>

Greg, won't Michael's patch eliminate a big chunk from your iosignalfd
patches? Seems like a win to me.

> One is enough :)
> Seriously, do you see that this saves you all of RCU, linked lists and
> counters? You don't need to keep track of iofds, you don't need to
> implement your own lookup logic - you just use the kvm device
> and that's it.
>
>

Yup.

--
error compiling committee.c: too many arguments to function

2009-06-23 09:53:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On 06/22/2009 07:29 PM, Gregory Haskins wrote:
>> We actually already have aliasing: is_write flag is used for this
>> purpose.
>>
>
> Yes, but read/write address aliasing is not the same thing is
> multi-match data aliasing. Besides, your proposal also breaks some of
> the natural relationship models (e.g. all the aliased iosignal_items
> always belong to the same underlying device. io_bus entries have an
> arbitrary topology).
>

It's all one big hack, we want to get the correct function called with
as little fuss as possible.

--
error compiling committee.c: too many arguments to function

2009-06-23 09:57:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On 06/23/2009 11:56 AM, Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>
>> +static int
>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>> + int is_write)
>> +{
>> + struct _iosignalfd_group *p = to_group(this);
>> +
>> + return ((addr>= p->addr&& (addr< p->addr + p->length)));
>> +}
>>
>
> I think I see a problem here. For virtio, we do not necessarily want all
> virtqueues for a device to live in kernel: there might be control
> virtqueues that we want to leave in userspace. Since this claims all
> writes to a specific address, the signal never makes it to userspace.
>

Userspace could create an eventfd for this control queue and wait for it
to fire.

--
error compiling committee.c: too many arguments to function

2009-06-23 10:48:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Tue, Jun 23, 2009 at 12:57:53PM +0300, Avi Kivity wrote:
> On 06/23/2009 11:56 AM, Michael S. Tsirkin wrote:
>> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>>
>>> +static int
>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>>> + int is_write)
>>> +{
>>> + struct _iosignalfd_group *p = to_group(this);
>>> +
>>> + return ((addr>= p->addr&& (addr< p->addr + p->length)));
>>> +}
>>>
>>
>> I think I see a problem here. For virtio, we do not necessarily want all
>> virtqueues for a device to live in kernel: there might be control
>> virtqueues that we want to leave in userspace. Since this claims all
>> writes to a specific address, the signal never makes it to userspace.
>>
>
> Userspace could create an eventfd for this control queue and wait for it
> to fire.

What if guest writes an unexpected value there?
The value it simply lost ... that's not very elegant.

--
MST

2009-06-23 11:23:51

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On 06/23/2009 01:48 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2009 at 12:57:53PM +0300, Avi Kivity wrote:
>
>> On 06/23/2009 11:56 AM, Michael S. Tsirkin wrote:
>>
>>> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> +static int
>>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + int is_write)
>>>> +{
>>>> + struct _iosignalfd_group *p = to_group(this);
>>>> +
>>>> + return ((addr>= p->addr&& (addr< p->addr + p->length)));
>>>> +}
>>>>
>>>>
>>> I think I see a problem here. For virtio, we do not necessarily want all
>>> virtqueues for a device to live in kernel: there might be control
>>> virtqueues that we want to leave in userspace. Since this claims all
>>> writes to a specific address, the signal never makes it to userspace.
>>>
>>>
>> Userspace could create an eventfd for this control queue and wait for it
>> to fire.
>>
>
> What if guest writes an unexpected value there?
> The value it simply lost ... that's not very elegant.
>

True, it's better to have a lossless interface.

--
error compiling committee.c: too many arguments to function

2009-06-23 11:33:29

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>
>> +static int
>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>> + int is_write)
>> +{
>> + struct _iosignalfd_group *p = to_group(this);
>> +
>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>> +}
>>
>
> I think I see a problem here. For virtio, we do not necessarily want all
> virtqueues for a device to live in kernel: there might be control
> virtqueues that we want to leave in userspace. Since this claims all
> writes to a specific address, the signal never makes it to userspace.
>
>
You can use a wildcard. Would that work?


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 11:37:13

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

On Tue, Jun 23, 2009 at 07:33:07AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
> >
> >> +static int
> >> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
> >> + int is_write)
> >> +{
> >> + struct _iosignalfd_group *p = to_group(this);
> >> +
> >> + return ((addr >= p->addr && (addr < p->addr + p->length)));
> >> +}
> >>
> >
> > I think I see a problem here. For virtio, we do not necessarily want all
> > virtqueues for a device to live in kernel: there might be control
> > virtqueues that we want to leave in userspace. Since this claims all
> > writes to a specific address, the signal never makes it to userspace.
> >
> >
> You can use a wildcard. Would that work?
>

Nope, you need to know the value written.

2009-06-23 11:40:50

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2009 at 07:33:07AM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> +static int
>>>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>>>> + int is_write)
>>>> +{
>>>> + struct _iosignalfd_group *p = to_group(this);
>>>> +
>>>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>>>> +}
>>>>
>>>>
>>> I think I see a problem here. For virtio, we do not necessarily want all
>>> virtqueues for a device to live in kernel: there might be control
>>> virtqueues that we want to leave in userspace. Since this claims all
>>> writes to a specific address, the signal never makes it to userspace.
>>>
>>>
>>>
>> You can use a wildcard. Would that work?
>>
>>
>
> Nope, you need to know the value written.
>
>
Note: You can also terminate an eventfd in userspace...it doesn't have
to terminate in-kernel. Not sure if that helps.


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 11:41:29

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Avi Kivity wrote:
> On 06/22/2009 07:08 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
>>
>>> Michael S. Tsirkin wrote:
>>>
>>>> It seems that a lot of complexity and trickiness with iosignalfd is
>>>> handling the group/item relationship, which comes about because kvm
>>>> does
>>>> not currently let a device on the bus claim a write transaction
>>>> based on the
>>>> value written. This could be greatly simplified if the value written
>>>> was passed to the in_range check for write operation. We could then
>>>> simply make each kvm_iosignalfd a device on the bus.
>>>>
>>>> What does everyone think of the following lightly tested patch?
>>>>
>>>>
>>> Hi Michael,
>>> Its interesting, but I am not convinced its necessary. We
>>> created the
>>> group/item layout because iosignalfds are unique in that they are
>>> probably the only IO device that wants to do some kind of address
>>> aliasing.
>>>
>>
>> We actually already have aliasing: is_write flag is used for this
>> purpose. Actually, it's possible to remove is_write by passing
>> a null pointer in write_val for reads. I like this a bit less as
>> the code generated is less compact ... Avi, what do you think?
>>
>
> Greg, won't Michael's patch eliminate a big chunk from your iosignalfd
> patches? Seems like a win to me.

Well, it really just moves that hunk from eventfd.c to kvm_main.c, where
I don't think anyone else will use it by iosignalfd. But if that is
what everyone wants, I guess I have no choice.

>
>> One is enough :)
>> Seriously, do you see that this saves you all of RCU, linked lists and
>> counters? You don't need to keep track of iofds, you don't need to
>> implement your own lookup logic - you just use the kvm device
>> and that's it.
>>
>>
>
> Yup.
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 11:45:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On Tue, Jun 23, 2009 at 12:04:06AM -0400, Gregory Haskins wrote:
> >> It will also need to support
> >> multiple matches.
> >>
> >
> > What, signal many fds on the same address/value pair?
> > I see this as a bug. Why is this a good thing to support?
> > Just increases the chance of leaking this fd.
> >
>
> I believe Avi asked for this feature specifically, so I will defer to him.

Hmm. That's hard to implement in my model. Avi, can we give up
this feature? I don't think anyone needs this specifically ...

--
MST

2009-06-23 11:46:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On Tue, Jun 23, 2009 at 07:41:12AM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 06/22/2009 07:08 PM, Michael S. Tsirkin wrote:
> >> On Mon, Jun 22, 2009 at 11:45:00AM -0400, Gregory Haskins wrote:
> >>
> >>> Michael S. Tsirkin wrote:
> >>>
> >>>> It seems that a lot of complexity and trickiness with iosignalfd is
> >>>> handling the group/item relationship, which comes about because kvm
> >>>> does
> >>>> not currently let a device on the bus claim a write transaction
> >>>> based on the
> >>>> value written. This could be greatly simplified if the value written
> >>>> was passed to the in_range check for write operation. We could then
> >>>> simply make each kvm_iosignalfd a device on the bus.
> >>>>
> >>>> What does everyone think of the following lightly tested patch?
> >>>>
> >>>>
> >>> Hi Michael,
> >>> Its interesting, but I am not convinced its necessary. We
> >>> created the
> >>> group/item layout because iosignalfds are unique in that they are
> >>> probably the only IO device that wants to do some kind of address
> >>> aliasing.
> >>>
> >>
> >> We actually already have aliasing: is_write flag is used for this
> >> purpose. Actually, it's possible to remove is_write by passing
> >> a null pointer in write_val for reads. I like this a bit less as
> >> the code generated is less compact ... Avi, what do you think?
> >>
> >
> > Greg, won't Michael's patch eliminate a big chunk from your iosignalfd
> > patches? Seems like a win to me.
>
> Well, it really just moves that hunk from eventfd.c to kvm_main.c, where
> I don't think anyone else will use it by iosignalfd. But if that is
> what everyone wants, I guess I have no choice.

Wait a bit before you start rebasing though please.
I just had a brainwave and is rewriting this patch.

> >
> >> One is enough :)
> >> Seriously, do you see that this saves you all of RCU, linked lists and
> >> counters? You don't need to keep track of iofds, you don't need to
> >> implement your own lookup logic - you just use the kvm device
> >> and that's it.
> >>
> >>
> >
> > Yup.
> >
>
>

2009-06-23 11:52:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On 06/23/2009 07:04 AM, Gregory Haskins wrote:
> Well, for one its not very clear what the benefit of the read/write
> aliasing even is. ;) Apparently coalesced_mmio uses it, but even so I
> doubt that is for the purposes of having one device do reads while
> another does writes. I could be wrong, though.
>

Coalesced mmio cannot handle reads.

--
error compiling committee.c: too many arguments to function

2009-06-23 11:52:51

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2009 at 12:04:06AM -0400, Gregory Haskins wrote:
>
>>>> It will also need to support
>>>> multiple matches.
>>>>
>>>>
>>> What, signal many fds on the same address/value pair?
>>> I see this as a bug. Why is this a good thing to support?
>>> Just increases the chance of leaking this fd.
>>>
>>>
>> I believe Avi asked for this feature specifically, so I will defer to him.
>>
>
> Hmm. That's hard to implement in my model. Avi, can we give up
> this feature? I don't think anyone needs this specifically ...
>
>

If we can drop this it will simplify the mods I will need to do to
io_bus to support Michael's new interface.

Out of curiosity, what is it you are trying to build Michael?

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 11:55:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On 06/23/2009 02:44 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2009 at 12:04:06AM -0400, Gregory Haskins wrote:
>
>>>> It will also need to support
>>>> multiple matches.
>>>>
>>>>
>>> What, signal many fds on the same address/value pair?
>>> I see this as a bug. Why is this a good thing to support?
>>> Just increases the chance of leaking this fd.
>>>
>>>
>> I believe Avi asked for this feature specifically, so I will defer to him.
>>
>
> Hmm. That's hard to implement in my model. Avi, can we give up
> this feature? I don't think anyone needs this specifically ...
>

I think we can make do with passing that single eventfd to multiple
consumers. It means their event count reads may return zero, but I
guess we can live with that.

I do want to retain flexibility in how we route events.

--
error compiling committee.c: too many arguments to function

2009-06-23 12:02:20

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

Avi Kivity wrote:
> On 06/23/2009 02:44 PM, Michael S. Tsirkin wrote:
>> On Tue, Jun 23, 2009 at 12:04:06AM -0400, Gregory Haskins wrote:
>>
>>>>> It will also need to support
>>>>> multiple matches.
>>>>>
>>>>>
>>>> What, signal many fds on the same address/value pair?
>>>> I see this as a bug. Why is this a good thing to support?
>>>> Just increases the chance of leaking this fd.
>>>>
>>>>
>>> I believe Avi asked for this feature specifically, so I will defer
>>> to him.
>>>
>>
>> Hmm. That's hard to implement in my model. Avi, can we give up
>> this feature? I don't think anyone needs this specifically ...
>>
>
> I think we can make do with passing that single eventfd to multiple
> consumers. It means their event count reads may return zero, but I
> guess we can live with that.
>
> I do want to retain flexibility in how we route events.
>

Ok, so for now I will just crank up the io_bus array, and we can address
scale another day. Can I just drop patch 2/3 and let the io_bus govern
the limit?

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-23 12:18:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC] pass write value to in_range pointers

On 06/23/2009 03:01 PM, Gregory Haskins wrote:
> Ok, so for now I will just crank up the io_bus array, and we can address
> scale another day. Can I just drop patch 2/3 and let the io_bus govern
> the limit?
>

So long as we have a runtime-discoverable limit, yes.

--
error compiling committee.c: too many arguments to function

2009-06-23 13:22:20

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 3/3] KVM: add iosignalfd support

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 08:30:46PM -0400, Gregory Haskins wrote:
>
>> +static int
>> +iosignalfd_group_in_range(struct kvm_io_device *this, gpa_t addr, int len,
>> + int is_write)
>> +{
>> + struct _iosignalfd_group *p = to_group(this);
>> +
>> + return ((addr >= p->addr && (addr < p->addr + p->length)));
>> +}
>>
>
> I think I see a problem here. For virtio, we do not necessarily want all
> virtqueues for a device to live in kernel: there might be control
> virtqueues that we want to leave in userspace. Since this claims all
> writes to a specific address, the signal never makes it to userspace.
>
>
So based on this, I think you are right about the io_bus changes. If we
accept your proposal this problem above is solved cleanly. Sorry for
the resistance, but I just wanted to make sure we were doing the right
thing. I am in agreement now.

Kind Regards,
-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-25 14:22:43

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 1/3] KVM: make io_bus interface more robust

Gregory Haskins wrote:
> Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON if it
> fails. We want to create dynamic MMIO/PIO entries driven from userspace later
> in the series, so we need to enhance the code to be more robust with the
> following changes:
>
> 1) Add a return value to the registration function
> 2) Fix up all the callsites to check the return code, handle any
> failures, and percolate the error up to the caller.
> 3) Add an unregister function that collapses holes in the array
>
> Signed-off-by: Gregory Haskins <[email protected]>
>

Hi Avi,
While the disposition on the iosignalfd patch w.r.t. current proposals
by Michael is up in the air, this patch has not had and comments in a
while. Please at least consider this one for merging ASAP so we can
base some of the other io_bus cleanups on it.

Thanks,
-Greg

> ---
>
> arch/x86/kvm/i8254.c | 22 ++++++++++++++++++++--
> arch/x86/kvm/i8259.c | 9 ++++++++-
> include/linux/kvm_host.h | 6 ++++--
> virt/kvm/coalesced_mmio.c | 8 ++++++--
> virt/kvm/ioapic.c | 9 +++++++--
> virt/kvm/kvm_main.c | 23 +++++++++++++++++++++--
> 6 files changed, 66 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 331705f..1c41715 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -586,6 +586,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
> {
> struct kvm_pit *pit;
> struct kvm_kpit_state *pit_state;
> + int ret;
>
> pit = kzalloc(sizeof(struct kvm_pit), GFP_KERNEL);
> if (!pit)
> @@ -620,14 +621,31 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
> kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier);
>
> kvm_iodevice_init(&pit->dev, &pit_dev_ops);
> - kvm_io_bus_register_dev(&kvm->pio_bus, &pit->dev);
> + ret = kvm_io_bus_register_dev(&kvm->pio_bus, &pit->dev);
> + if (ret < 0)
> + goto fail;
>
> if (flags & KVM_PIT_SPEAKER_DUMMY) {
> kvm_iodevice_init(&pit->speaker_dev, &speaker_dev_ops);
> - kvm_io_bus_register_dev(&kvm->pio_bus, &pit->speaker_dev);
> + ret = kvm_io_bus_register_dev(&kvm->pio_bus,
> + &pit->speaker_dev);
> + if (ret < 0)
> + goto fail;
> }
>
> return pit;
> +
> +fail:
> + if (flags & KVM_PIT_SPEAKER_DUMMY)
> + kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->speaker_dev);
> +
> + kvm_io_bus_unregister_dev(&kvm->pio_bus, &pit->dev);
> +
> + if (pit->irq_source_id >= 0)
> + kvm_free_irq_source_id(kvm, pit->irq_source_id);
> +
> + kfree(pit);
> + return NULL;
> }
>
> void kvm_free_pit(struct kvm *kvm)
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index 148c52a..66e37c2 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -532,6 +532,8 @@ static const struct kvm_io_device_ops picdev_ops = {
> struct kvm_pic *kvm_create_pic(struct kvm *kvm)
> {
> struct kvm_pic *s;
> + int ret;
> +
> s = kzalloc(sizeof(struct kvm_pic), GFP_KERNEL);
> if (!s)
> return NULL;
> @@ -548,6 +550,11 @@ struct kvm_pic *kvm_create_pic(struct kvm *kvm)
> * Initialize PIO device
> */
> kvm_iodevice_init(&s->dev, &picdev_ops);
> - kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
> + ret = kvm_io_bus_register_dev(&kvm->pio_bus, &s->dev);
> + if (ret < 0) {
> + kfree(s);
> + return NULL;
> + }
> +
> return s;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e4e78db..eafa2b3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -61,8 +61,10 @@ void kvm_io_bus_init(struct kvm_io_bus *bus);
> void kvm_io_bus_destroy(struct kvm_io_bus *bus);
> struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> gpa_t addr, int len, int is_write);
> -void kvm_io_bus_register_dev(struct kvm_io_bus *bus,
> - struct kvm_io_device *dev);
> +int kvm_io_bus_register_dev(struct kvm_io_bus *bus,
> + struct kvm_io_device *dev);
> +void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
> + struct kvm_io_device *dev);
>
> struct kvm_vcpu {
> struct kvm *kvm;
> diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> index 397f419..3e89db8 100644
> --- a/virt/kvm/coalesced_mmio.c
> +++ b/virt/kvm/coalesced_mmio.c
> @@ -94,6 +94,7 @@ static const struct kvm_io_device_ops coalesced_mmio_ops = {
> int kvm_coalesced_mmio_init(struct kvm *kvm)
> {
> struct kvm_coalesced_mmio_dev *dev;
> + int ret;
>
> dev = kzalloc(sizeof(struct kvm_coalesced_mmio_dev), GFP_KERNEL);
> if (!dev)
> @@ -102,9 +103,12 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
> kvm_iodevice_init(&dev->dev, &coalesced_mmio_ops);
> dev->kvm = kvm;
> kvm->coalesced_mmio_dev = dev;
> - kvm_io_bus_register_dev(&kvm->mmio_bus, &dev->dev);
>
> - return 0;
> + ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &dev->dev);
> + if (ret < 0)
> + kfree(dev);
> +
> + return ret;
> }
>
> int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index d8b2eca..28adb05 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -335,6 +335,7 @@ static const struct kvm_io_device_ops ioapic_mmio_ops = {
> int kvm_ioapic_init(struct kvm *kvm)
> {
> struct kvm_ioapic *ioapic;
> + int ret;
>
> ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
> if (!ioapic)
> @@ -343,7 +344,11 @@ int kvm_ioapic_init(struct kvm *kvm)
> kvm_ioapic_reset(ioapic);
> kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
> ioapic->kvm = kvm;
> - kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev);
> - return 0;
> +
> + ret = kvm_io_bus_register_dev(&kvm->mmio_bus, &ioapic->dev);
> + if (ret < 0)
> + kfree(ioapic);
> +
> + return ret;
> }
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9fab08e..6bd71f7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2481,11 +2481,30 @@ struct kvm_io_device *kvm_io_bus_find_dev(struct kvm_io_bus *bus,
> return NULL;
> }
>
> -void kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
> +/* assumes kvm->lock held */
> +int kvm_io_bus_register_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev)
> {
> - BUG_ON(bus->dev_count > (NR_IOBUS_DEVS-1));
> + if (bus->dev_count > (NR_IOBUS_DEVS-1))
> + return -ENOSPC;
>
> bus->devs[bus->dev_count++] = dev;
> +
> + return 0;
> +}
> +
> +/* assumes kvm->lock held */
> +void kvm_io_bus_unregister_dev(struct kvm_io_bus *bus,
> + struct kvm_io_device *dev)
> +{
> + int i;
> +
> + for (i = 0; i < bus->dev_count; i++) {
> +
> + if (bus->devs[i] == dev) {
> + bus->devs[i] = bus->devs[--bus->dev_count];
> + return;
> + }
> + }
> }
>
> static struct notifier_block kvm_cpu_notifier = {
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-25 14:54:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v8 1/3] KVM: make io_bus interface more robust

On Thu, Jun 25, 2009 at 10:22:29AM -0400, Gregory Haskins wrote:
> Gregory Haskins wrote:
> > Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON if it
> > fails. We want to create dynamic MMIO/PIO entries driven from userspace later
> > in the series, so we need to enhance the code to be more robust with the
> > following changes:
> >
> > 1) Add a return value to the registration function
> > 2) Fix up all the callsites to check the return code, handle any
> > failures, and percolate the error up to the caller.
> > 3) Add an unregister function that collapses holes in the array

Does not unregister need rcu bus fixes to work?

--
MST

2009-06-25 14:56:53

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v8 1/3] KVM: make io_bus interface more robust

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 10:22:29AM -0400, Gregory Haskins wrote:
>
>> Gregory Haskins wrote:
>>
>>> Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON if it
>>> fails. We want to create dynamic MMIO/PIO entries driven from userspace later
>>> in the series, so we need to enhance the code to be more robust with the
>>> following changes:
>>>
>>> 1) Add a return value to the registration function
>>> 2) Fix up all the callsites to check the return code, handle any
>>> failures, and percolate the error up to the caller.
>>> 3) Add an unregister function that collapses holes in the array
>>>
>
> Does not unregister need rcu bus fixes to work?
>
>
Not without users ;)

But, hmm. Perhaps I should re-split out the return value stuff so it
can go independent of the unregister. Gah, this io_bus is a nightmare.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature