This is release v2. Changes since v1:
*) Incorporated review feedback from Stephen Hemminger on vbus-enet driver
*) Added support for connecting to vbus devices from userspace
*) Added support for a virtio-vbus transport to allow virtio drivers to
work with vbus (needs testing and backend models).
(Avi, I know I still owe you a reply re the PCI debate)
Todo:
*) Develop some kind of hypercall registration mechanism for KVM so that
we can use that as an integration point instead of directly hooking
kvm hypercalls
*) Beef up the userspace event channel ABI to support different event types
*) Add memory-registration support
*) Integrate with qemu PCI device model to render vbus objects as PCI
*) Develop some virtio backend devices.
*) Support ethtool_ops for venet.
---------------------------------------
RFC: Virtual-bus
applies to v2.6.29 (will port to git HEAD soon)
FIRST OFF: Let me state that this is _not_ a KVM or networking specific
technology. Virtual-Bus is a mechanism for defining and deploying
software “devices” directly in a Linux kernel. These devices are designed
to be directly accessed from a variety of environments in an arbitrarly nested
fashion. The goal is provide for the potential for maxium IO performance by
providing the shortest and most efficient path to the "bare metal" kernel, and
thus the actual IO resources. For instance, an application can be written to
run the same on baremetal as it does in guest userspace nested 10 levels deep,
all the while providing direct access to the resource, thus reducing latency
and boosting throughput. A good way to think of this is perhaps like software
based SR-IOV that supports nesting of the pass-through.
Due to its design as an in-kernel resource, it also provides very strong
notions of protection and isolation so as to not introduce a security
compromise when compared to traditional/alternative models where such
guarantees are provided by something like userspace or hardware.
The example use-case we have provided supports a “virtual-ethernet” device
being utilized in a KVM guest environment, so comparisons to virtio-net will
be natural. However, please note that this is but one use-case, of many we have
planned for the future (such as userspace bypass and RT guest support).
The goal for right now is to describe what a virual-bus is and why we
believe it is useful.
We are intent to get this core technology merged, even if the networking
components are not accepted as is. It should be noted that, in many ways,
virtio could be considered complimentary to the technology. We could
in fact, have implemented the virtual-ethernet using a virtio-ring, but
it would have required ABI changes that we didn't want to yet propose
without having the concept in general vetted and accepted by the community.
[Update: this release includes a virtio-vbus transport, so virtio-net and
other such drivers can now run over vbus in addition to the venet system
provided]
To cut to the chase, we recently measured our virtual-ethernet on
v2.6.29 on two 8-core x86_64 boxes with Chelsio T3 10GE connected back
to back via cross over. We measured bare-metal performance, as well
as a kvm guest (running the same kernel) connected to the T3 via
a linux-bridge+tap configuration with a 1500 MTU. The results are as
follows:
Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
As you can see, all three technologies can achieve (MTU limited) line-rate,
but the virtio-net solution is severely limited on the latency front (by a
factor of 48:1)
Note that the 320pps is technically artificially low in virtio-net, caused by a
a known design limitation to use a timer for tx-mitigation. However, note that
even when removing the timer from the path the best we could achieve was
350us-450us of latency, and doing so causes the tput to drop to 1300Mb/s.
So even in this case, I think the in-kernel results presents a compelling
argument for the new model presented.
[Update: Anthony Ligouri is working on this userspace implementation problem
currently and has obtained significant performance gains by utilizing some of
the techniques we use in this patch set as well. More details to come.]
When we jump to 9000 byte MTU, the situation looks similar
Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)
Note that even the throughput was slightly better in this test for venet, though
neither venet nor virtio-net could achieve line-rate. I suspect some tuning may
allow these numbers to improve, TBD.
So with that said, lets jump into the description:
Virtual-Bus: What is it?
--------------------
Virtual-Bus is a kernel based IO resource container technology. It is modeled
on a concept similar to the Linux Device-Model (LDM), where we have buses,
devices, and drivers as the primary actors. However, VBUS has several
distinctions when contrasted with LDM:
1) "Busses" in LDM are relatively static and global to the kernel (e.g.
"PCI", "USB", etc). VBUS buses are arbitrarily created and destroyed
dynamically, and are not globally visible. Instead they are defined as
visible only to a specific subset of the system (the contained context).
2) "Devices" in LDM are typically tangible physical (or sometimes logical)
devices. VBUS devices are purely software abstractions (which may or
may not have one or more physical devices behind them). Devices may
also be arbitrarily created or destroyed by software/administrative action
as opposed to by a hardware discovery mechanism.
3) "Drivers" in LDM sit within the same kernel context as the busses and
devices they interact with. VBUS drivers live in a foreign
context (such as userspace, or a virtual-machine guest).
The idea is that a vbus is created to contain access to some IO services.
Virtual devices are then instantiated and linked to a bus to grant access to
drivers actively present on the bus. Drivers will only have visibility to
devices present on their respective bus, and nothing else.
Virtual devices are defined by modules which register a deviceclass with the
system. A deviceclass simply represents a type of device that _may_ be
instantiated into a device, should an administrator wish to do so. Once
this has happened, the device may be associated with one or more buses where
it will become visible to all clients of those respective buses.
Why do we need this?
----------------------
There are various reasons why such a construct may be useful. One of the
most interesting use cases is for virtualization, such as KVM. Hypervisors
today provide virtualized IO resources to a guest, but this is often at a cost
in both latency and throughput compared to bare metal performance. Utilizing
para-virtual resources instead of emulated devices helps to mitigate this
penalty, but even these techniques to date have not fully realized the
potential of the underlying bare-metal hardware.
Some of the performance differential is unavoidable just given the extra
processing that occurs due to the deeper stack (guest+host). However, some of
this overhead is a direct result of the rather indirect path most hypervisors
use to route IO. For instance, KVM uses PIO faults from the guest to trigger
a guest->host-kernel->host-userspace->host-kernel sequence of events.
Contrast this to a typical userspace application on the host which must only
traverse app->kernel for most IO.
The fact is that the linux kernel is already great at managing access to IO
resources. Therefore, if you have a hypervisor that is based on the linux
kernel, is there some way that we can allow the hypervisor to manage IO
directly instead of forcing this convoluted path?
The short answer is: "not yet" ;)
In order to use such a concept, we need some new facilties. For one, we
need to be able to define containers with their corresponding access-control so
that guests do not have unmitigated access to anything they wish. Second,
we also need to define some forms of memory access that is uniform in the face
of various clients (e.g. "copy_to_user()" cannot be assumed to work for, say,
a KVM vcpu context). Lastly, we need to provide access to these resources in
a way that makes sense for the application, such as asynchronous communication
paths and minimizing context switches.
For more details, please visit our wiki at:
http://developer.novell.com/wiki/index.php/Virtual-bus
Regards,
-Greg
---
Gregory Haskins (19):
virtio: add a vbus transport
vbus: add a userspace connector
kvm: Add guest-side support for VBUS
kvm: Add VBUS support to the host
kvm: add dynamic IRQ support
kvm: add a reset capability
x86: allow the irq->vector translation to be determined outside of ioapic
venettap: add scatter-gather support
venet: add scatter-gather support
venet-tap: Adds a "venet" compatible "tap" device to VBUS
net: Add vbus_enet driver
venet: add the ABI definitions for an 802.x packet interface
ioq: add vbus helpers
ioq: Add basic definitions for a shared-memory, lockless queue
vbus: add a "vbus-proxy" bus model for vbus_driver objects
vbus: add bus-registration notifiers
vbus: add connection-client helper infrastructure
vbus: add virtual-bus definitions
shm-signal: shared-memory signals
Documentation/vbus.txt | 386 +++++++++
arch/x86/Kconfig | 16
arch/x86/Makefile | 3
arch/x86/include/asm/irq.h | 6
arch/x86/include/asm/kvm_host.h | 9
arch/x86/include/asm/kvm_para.h | 12
arch/x86/kernel/io_apic.c | 25 +
arch/x86/kvm/Kconfig | 9
arch/x86/kvm/Makefile | 6
arch/x86/kvm/dynirq.c | 329 ++++++++
arch/x86/kvm/guest/Makefile | 2
arch/x86/kvm/guest/dynirq.c | 95 ++
arch/x86/kvm/x86.c | 13
arch/x86/kvm/x86.h | 12
drivers/Makefile | 2
drivers/net/Kconfig | 13
drivers/net/Makefile | 1
drivers/net/vbus-enet.c | 907 +++++++++++++++++++++
drivers/vbus/devices/Kconfig | 17
drivers/vbus/devices/Makefile | 1
drivers/vbus/devices/venet-tap.c | 1609 ++++++++++++++++++++++++++++++++++++++
drivers/vbus/proxy/Makefile | 2
drivers/vbus/proxy/kvm.c | 726 +++++++++++++++++
drivers/virtio/Kconfig | 15
drivers/virtio/Makefile | 1
drivers/virtio/virtio_vbus.c | 496 ++++++++++++
fs/proc/base.c | 96 ++
include/linux/ioq.h | 410 ++++++++++
include/linux/kvm.h | 4
include/linux/kvm_guest.h | 7
include/linux/kvm_host.h | 27 +
include/linux/kvm_para.h | 60 +
include/linux/sched.h | 4
include/linux/shm_signal.h | 188 ++++
include/linux/vbus.h | 166 ++++
include/linux/vbus_client.h | 115 +++
include/linux/vbus_device.h | 424 ++++++++++
include/linux/vbus_driver.h | 80 ++
include/linux/vbus_userspace.h | 48 +
include/linux/venet.h | 82 ++
include/linux/virtio_vbus.h | 163 ++++
kernel/Makefile | 1
kernel/exit.c | 2
kernel/fork.c | 2
kernel/vbus/Kconfig | 55 +
kernel/vbus/Makefile | 11
kernel/vbus/attribute.c | 52 +
kernel/vbus/client.c | 543 +++++++++++++
kernel/vbus/config.c | 275 ++++++
kernel/vbus/core.c | 626 +++++++++++++++
kernel/vbus/devclass.c | 124 +++
kernel/vbus/map.c | 72 ++
kernel/vbus/map.h | 41 +
kernel/vbus/proxy.c | 216 +++++
kernel/vbus/shm-ioq.c | 89 ++
kernel/vbus/userspace-client.c | 485 +++++++++++
kernel/vbus/vbus.h | 117 +++
kernel/vbus/virtio.c | 628 +++++++++++++++
lib/Kconfig | 22 +
lib/Makefile | 2
lib/ioq.c | 298 +++++++
lib/shm_signal.c | 186 ++++
virt/kvm/kvm_main.c | 37 +
virt/kvm/vbus.c | 1307 +++++++++++++++++++++++++++++++
64 files changed, 11777 insertions(+), 1 deletions(-)
create mode 100644 Documentation/vbus.txt
create mode 100644 arch/x86/kvm/dynirq.c
create mode 100644 arch/x86/kvm/guest/Makefile
create mode 100644 arch/x86/kvm/guest/dynirq.c
create mode 100644 drivers/net/vbus-enet.c
create mode 100644 drivers/vbus/devices/Kconfig
create mode 100644 drivers/vbus/devices/Makefile
create mode 100644 drivers/vbus/devices/venet-tap.c
create mode 100644 drivers/vbus/proxy/Makefile
create mode 100644 drivers/vbus/proxy/kvm.c
create mode 100644 drivers/virtio/virtio_vbus.c
create mode 100644 include/linux/ioq.h
create mode 100644 include/linux/kvm_guest.h
create mode 100644 include/linux/shm_signal.h
create mode 100644 include/linux/vbus.h
create mode 100644 include/linux/vbus_client.h
create mode 100644 include/linux/vbus_device.h
create mode 100644 include/linux/vbus_driver.h
create mode 100644 include/linux/vbus_userspace.h
create mode 100644 include/linux/venet.h
create mode 100644 include/linux/virtio_vbus.h
create mode 100644 kernel/vbus/Kconfig
create mode 100644 kernel/vbus/Makefile
create mode 100644 kernel/vbus/attribute.c
create mode 100644 kernel/vbus/client.c
create mode 100644 kernel/vbus/config.c
create mode 100644 kernel/vbus/core.c
create mode 100644 kernel/vbus/devclass.c
create mode 100644 kernel/vbus/map.c
create mode 100644 kernel/vbus/map.h
create mode 100644 kernel/vbus/proxy.c
create mode 100644 kernel/vbus/shm-ioq.c
create mode 100644 kernel/vbus/userspace-client.c
create mode 100644 kernel/vbus/vbus.h
create mode 100644 kernel/vbus/virtio.c
create mode 100644 lib/ioq.c
create mode 100644 lib/shm_signal.c
create mode 100644 virt/kvm/vbus.c
--
This interface provides a bidirectional shared-memory based signaling
mechanism. It can be used by any entities which desire efficient
communication via shared memory. The implementation details of the
signaling are abstracted so that they may transcend a wide variety
of locale boundaries (e.g. userspace/kernel, guest/host, etc).
The shm_signal mechanism supports event masking as well as spurious
event delivery mitigation.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/shm_signal.h | 188 ++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 10 ++
lib/Makefile | 1
lib/shm_signal.c | 186 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 385 insertions(+), 0 deletions(-)
create mode 100644 include/linux/shm_signal.h
create mode 100644 lib/shm_signal.c
diff --git a/include/linux/shm_signal.h b/include/linux/shm_signal.h
new file mode 100644
index 0000000..a65e54e
--- /dev/null
+++ b/include/linux/shm_signal.h
@@ -0,0 +1,188 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SHM_SIGNAL_H
+#define _LINUX_SHM_SIGNAL_H
+
+#include <asm/types.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc). Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+
+#define SHM_SIGNAL_MAGIC 0x58fa39df
+#define SHM_SIGNAL_VER 1
+
+struct shm_signal_irq {
+ __u8 enabled;
+ __u8 pending;
+ __u8 dirty;
+};
+
+enum shm_signal_locality {
+ shm_locality_north,
+ shm_locality_south,
+};
+
+struct shm_signal_desc {
+ __u32 magic;
+ __u32 ver;
+ struct shm_signal_irq irq[2];
+};
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/interrupt.h>
+
+struct shm_signal_notifier {
+ void (*signal)(struct shm_signal_notifier *);
+};
+
+struct shm_signal;
+
+struct shm_signal_ops {
+ int (*inject)(struct shm_signal *s);
+ void (*fault)(struct shm_signal *s, const char *fmt, ...);
+ void (*release)(struct shm_signal *s);
+};
+
+enum {
+ shm_signal_in_wakeup,
+};
+
+struct shm_signal {
+ atomic_t refs;
+ spinlock_t lock;
+ enum shm_signal_locality locale;
+ unsigned long flags;
+ struct shm_signal_ops *ops;
+ struct shm_signal_desc *desc;
+ struct shm_signal_notifier *notifier;
+ struct tasklet_struct deferred_notify;
+};
+
+#define SHM_SIGNAL_FAULT(s, fmt, args...) \
+ ((s)->ops->fault ? (s)->ops->fault((s), fmt, ## args) : panic(fmt, ## args))
+
+ /*
+ * These functions should only be used internally
+ */
+void _shm_signal_release(struct shm_signal *s);
+void _shm_signal_wakeup(struct shm_signal *s);
+
+/**
+ * shm_signal_init() - initialize an SHM_SIGNAL
+ * @s: SHM_SIGNAL context
+ *
+ * Initializes SHM_SIGNAL context before first use
+ *
+ **/
+void shm_signal_init(struct shm_signal *s);
+
+/**
+ * shm_signal_get() - acquire an SHM_SIGNAL context reference
+ * @s: SHM_SIGNAL context
+ *
+ **/
+static inline struct shm_signal *shm_signal_get(struct shm_signal *s)
+{
+ atomic_inc(&s->refs);
+
+ return s;
+}
+
+/**
+ * shm_signal_put() - release an SHM_SIGNAL context reference
+ * @s: SHM_SIGNAL context
+ *
+ **/
+static inline void shm_signal_put(struct shm_signal *s)
+{
+ if (atomic_dec_and_test(&s->refs))
+ _shm_signal_release(s);
+}
+
+/**
+ * shm_signal_enable() - enables local notifications on an SHM_SIGNAL
+ * @s: SHM_SIGNAL context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered notifier (if applicable) to receive wakeups
+ * whenever the remote side performs an shm_signal() operation. A notification
+ * will be dispatched immediately if any pending signals have already been
+ * issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_enable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_disable() - disable local notifications on an SHM_SIGNAL
+ * @s: SHM_SIGNAL context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Disables/masks the registered shm_signal_notifier (if applicable) from
+ * receiving any further notifications. Any subsequent calls to shm_signal()
+ * by the remote side will update the shm as dirty, but will not traverse the
+ * locale boundary and will not invoke the notifier callback. Signals
+ * delivered while masked will be deferred until shm_signal_enable() is
+ * invoked.
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_disable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_inject() - notify the remote side about shm changes
+ * @s: SHM_SIGNAL context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Marks the shm state as "dirty" and, if enabled, will traverse
+ * a locale boundary to inject a remote notification. The remote
+ * side controls whether the notification should be delivered via
+ * the shm_signal_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the shm_signal_ops->signal() interface and provided by a particular
+ * implementation. However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_inject(struct shm_signal *s, int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SHM_SIGNAL_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 03c2c24..32d82fe 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -174,4 +174,14 @@ config DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
bool "Disable obsolete cpumask functions" if DEBUG_PER_CPU_MAPS
depends on EXPERIMENTAL && BROKEN
+config SHM_SIGNAL
+ boolean "SHM Signal - Generic shared-memory signaling mechanism"
+ default n
+ help
+ Provides a shared-memory based signaling mechansim to indicate
+ memory-dirty notifications between two end-points.
+
+ If unsure, say N
+
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 32b0e64..bc36327 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
obj-$(CONFIG_SMP) += percpu_counter.o
obj-$(CONFIG_AUDIT_GENERIC) += audit.o
+obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
obj-$(CONFIG_SWIOTLB) += swiotlb.o
obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/shm_signal.c b/lib/shm_signal.c
new file mode 100644
index 0000000..fa1770c
--- /dev/null
+++ b/lib/shm_signal.c
@@ -0,0 +1,186 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * See include/linux/shm_signal.h for documentation
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+
+int shm_signal_enable(struct shm_signal *s, int flags)
+{
+ struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+ unsigned long iflags;
+
+ spin_lock_irqsave(&s->lock, iflags);
+
+ irq->enabled = 1;
+ wmb();
+
+ if ((irq->dirty || irq->pending)
+ && !test_bit(shm_signal_in_wakeup, &s->flags)) {
+ rmb();
+ tasklet_schedule(&s->deferred_notify);
+ }
+
+ spin_unlock_irqrestore(&s->lock, iflags);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_enable);
+
+int shm_signal_disable(struct shm_signal *s, int flags)
+{
+ struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+
+ irq->enabled = 0;
+ wmb();
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_disable);
+
+/*
+ * signaling protocol:
+ *
+ * each side of the shm_signal has an "irq" structure with the following
+ * fields:
+ *
+ * - enabled: controlled by shm_signal_enable/disable() to mask/unmask
+ * the notification locally
+ * - dirty: indicates if the shared-memory is dirty or clean. This
+ * is updated regardless of the enabled/pending state so that
+ * the state is always accurately tracked.
+ * - pending: indicates if a signal is pending to the remote locale.
+ * This allows us to determine if a remote-notification is
+ * already in flight to optimize spurious notifications away.
+ */
+int shm_signal_inject(struct shm_signal *s, int flags)
+{
+ /* Load the irq structure from the other locale */
+ struct shm_signal_irq *irq = &s->desc->irq[!s->locale];
+
+ /*
+ * We always mark the remote side as dirty regardless of whether
+ * they need to be notified.
+ */
+ irq->dirty = 1;
+ wmb(); /* dirty must be visible before we test the pending state */
+
+ if (irq->enabled && !irq->pending) {
+ rmb();
+
+ /*
+ * If the remote side has enabled notifications, and we do
+ * not see a notification pending, we must inject a new one.
+ */
+ irq->pending = 1;
+ wmb(); /* make it visible before we do the injection */
+
+ s->ops->inject(s);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_inject);
+
+void _shm_signal_wakeup(struct shm_signal *s)
+{
+ struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+ int dirty;
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->lock, flags);
+
+ __set_bit(shm_signal_in_wakeup, &s->flags);
+
+ /*
+ * The outer loop protects against race conditions between
+ * irq->dirty and irq->pending updates
+ */
+ while (irq->enabled && (irq->dirty || irq->pending)) {
+
+ /*
+ * Run until we completely exhaust irq->dirty (it may
+ * be re-dirtied by the remote side while we are in the
+ * callback). We let "pending" remain untouched until we have
+ * processed them all so that the remote side knows we do not
+ * need a new notification (yet).
+ */
+ do {
+ irq->dirty = 0;
+ /* the unlock is an implicit wmb() for dirty = 0 */
+ spin_unlock_irqrestore(&s->lock, flags);
+
+ if (s->notifier)
+ s->notifier->signal(s->notifier);
+
+ spin_lock_irqsave(&s->lock, flags);
+ dirty = irq->dirty;
+ rmb();
+
+ } while (irq->enabled && dirty);
+
+ barrier();
+
+ /*
+ * We can finally acknowledge the notification by clearing
+ * "pending" after all of the dirty memory has been processed
+ * Races against this clearing are handled by the outer loop.
+ * Subsequent iterations of this loop will execute with
+ * pending=0 potentially leading to future spurious
+ * notifications, but this is an acceptable tradeoff as this
+ * will be rare and harmless.
+ */
+ irq->pending = 0;
+ wmb();
+
+ }
+
+ __clear_bit(shm_signal_in_wakeup, &s->flags);
+ spin_unlock_irqrestore(&s->lock, flags);
+
+}
+EXPORT_SYMBOL_GPL(_shm_signal_wakeup);
+
+void _shm_signal_release(struct shm_signal *s)
+{
+ s->ops->release(s);
+}
+EXPORT_SYMBOL_GPL(_shm_signal_release);
+
+static void
+deferred_notify(unsigned long data)
+{
+ struct shm_signal *s = (struct shm_signal *)data;
+
+ _shm_signal_wakeup(s);
+}
+
+void shm_signal_init(struct shm_signal *s)
+{
+ memset(s, 0, sizeof(*s));
+ atomic_set(&s->refs, 1);
+ spin_lock_init(&s->lock);
+ tasklet_init(&s->deferred_notify,
+ deferred_notify,
+ (unsigned long)s);
+}
+EXPORT_SYMBOL_GPL(shm_signal_init);
This will generally be used for hypervisors to publish any host-side
virtual devices up to a guest. The guest will have the opportunity
to consume any devices present on the vbus-proxy as if they were
platform devices, similar to existing buses like PCI.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/vbus_driver.h | 73 +++++++++++++++++++++
kernel/vbus/Kconfig | 9 +++
kernel/vbus/Makefile | 4 +
kernel/vbus/proxy.c | 152 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 238 insertions(+), 0 deletions(-)
create mode 100644 include/linux/vbus_driver.h
create mode 100644 kernel/vbus/proxy.c
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
new file mode 100644
index 0000000..c53e13f
--- /dev/null
+++ b/include/linux/vbus_driver.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Mediates access to a host VBUS from a guest kernel by providing a
+ * global view of all VBUS devices
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DRIVER_H
+#define _LINUX_VBUS_DRIVER_H
+
+#include <linux/device.h>
+#include <linux/shm_signal.h>
+
+struct vbus_device_proxy;
+struct vbus_driver;
+
+struct vbus_device_proxy_ops {
+ int (*open)(struct vbus_device_proxy *dev, int version, int flags);
+ int (*close)(struct vbus_device_proxy *dev, int flags);
+ int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
+ void *ptr, size_t len,
+ struct shm_signal_desc *sigdesc, struct shm_signal **signal,
+ int flags);
+ int (*call)(struct vbus_device_proxy *dev, u32 func,
+ void *data, size_t len, int flags);
+ void (*release)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_device_proxy {
+ char *type;
+ u64 id;
+ void *priv; /* Used by drivers */
+ struct vbus_device_proxy_ops *ops;
+ struct device dev;
+};
+
+int vbus_device_proxy_register(struct vbus_device_proxy *dev);
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev);
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id);
+
+struct vbus_driver_ops {
+ int (*probe)(struct vbus_device_proxy *dev);
+ int (*remove)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_driver {
+ char *type;
+ struct module *owner;
+ struct vbus_driver_ops *ops;
+ struct device_driver drv;
+};
+
+int vbus_driver_register(struct vbus_driver *drv);
+void vbus_driver_unregister(struct vbus_driver *drv);
+
+#endif /* _LINUX_VBUS_DRIVER_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index f2b92f5..3aaa085 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -12,3 +12,12 @@ config VBUS
various tasks and devices which reside on the bus.
If unsure, say N
+
+config VBUS_DRIVERS
+ tristate "VBUS Driver support"
+ default n
+ help
+ Adds support for a virtual bus model for proxying drivers.
+
+ If unsure, say N
+
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 4d440e5..d028ece 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1 +1,5 @@
obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
+
+vbus-proxy-objs += proxy.o
+obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
+
diff --git a/kernel/vbus/proxy.c b/kernel/vbus/proxy.c
new file mode 100644
index 0000000..ea48f00
--- /dev/null
+++ b/kernel/vbus/proxy.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#define VBUS_PROXY_NAME "vbus-proxy"
+
+static struct vbus_device_proxy *to_dev(struct device *_dev)
+{
+ return _dev ? container_of(_dev, struct vbus_device_proxy, dev) : NULL;
+}
+
+static struct vbus_driver *to_drv(struct device_driver *_drv)
+{
+ return container_of(_drv, struct vbus_driver, drv);
+}
+
+/*
+ * This function is invoked whenever a new driver and/or device is added
+ * to check if there is a match
+ */
+static int vbus_dev_proxy_match(struct device *_dev, struct device_driver *_drv)
+{
+ struct vbus_device_proxy *dev = to_dev(_dev);
+ struct vbus_driver *drv = to_drv(_drv);
+
+ return !strcmp(dev->type, drv->type);
+}
+
+/*
+ * This function is invoked after the bus infrastructure has already made a
+ * match. The device will contain a reference to the paired driver which
+ * we will extract.
+ */
+static int vbus_dev_proxy_probe(struct device *_dev)
+{
+ int ret = 0;
+ struct vbus_device_proxy *dev = to_dev(_dev);
+ struct vbus_driver *drv = to_drv(_dev->driver);
+
+ if (drv->ops->probe)
+ ret = drv->ops->probe(dev);
+
+ return ret;
+}
+
+static struct bus_type vbus_proxy = {
+ .name = VBUS_PROXY_NAME,
+ .match = vbus_dev_proxy_match,
+};
+
+static struct device vbus_proxy_rootdev = {
+ .parent = NULL,
+ .bus_id = VBUS_PROXY_NAME,
+};
+
+static int __init vbus_init(void)
+{
+ int ret;
+
+ ret = bus_register(&vbus_proxy);
+ BUG_ON(ret < 0);
+
+ ret = device_register(&vbus_proxy_rootdev);
+ BUG_ON(ret < 0);
+
+ return 0;
+}
+
+postcore_initcall(vbus_init);
+
+static void device_release(struct device *dev)
+{
+ struct vbus_device_proxy *_dev;
+
+ _dev = container_of(dev, struct vbus_device_proxy, dev);
+
+ _dev->ops->release(_dev);
+}
+
+int vbus_device_proxy_register(struct vbus_device_proxy *new)
+{
+ new->dev.parent = &vbus_proxy_rootdev;
+ new->dev.bus = &vbus_proxy;
+ new->dev.release = &device_release;
+
+ return device_register(&new->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_register);
+
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev)
+{
+ device_unregister(&dev->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_unregister);
+
+static int match_device_id(struct device *_dev, void *data)
+{
+ struct vbus_device_proxy *dev = to_dev(_dev);
+ u64 id = *(u64 *)data;
+
+ return dev->id == id;
+}
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id)
+{
+ struct device *dev;
+
+ dev = bus_find_device(&vbus_proxy, NULL, &id, &match_device_id);
+
+ return to_dev(dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_find);
+
+int vbus_driver_register(struct vbus_driver *new)
+{
+ new->drv.bus = &vbus_proxy;
+ new->drv.name = new->type;
+ new->drv.owner = new->owner;
+ new->drv.probe = vbus_dev_proxy_probe;
+
+ return driver_register(&new->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_register);
+
+void vbus_driver_unregister(struct vbus_driver *drv)
+{
+ driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_unregister);
+
We need to get hotswap events in environments which cannot use existing
facilities (e.g. inotify). So we add a notifier-chain to allow client
callbacks whenever an interface is {un}registered.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/vbus.h | 15 +++++++++++++
kernel/vbus/core.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/vbus/vbus.h | 1 +
3 files changed, 75 insertions(+), 0 deletions(-)
diff --git a/include/linux/vbus.h b/include/linux/vbus.h
index 5f0566c..04db4ff 100644
--- a/include/linux/vbus.h
+++ b/include/linux/vbus.h
@@ -29,6 +29,7 @@
#include <linux/sched.h>
#include <linux/rcupdate.h>
#include <linux/vbus_device.h>
+#include <linux/notifier.h>
struct vbus;
struct task_struct;
@@ -137,6 +138,20 @@ static inline void task_vbus_disassociate(struct task_struct *p)
}
}
+enum {
+ VBUS_EVENT_DEVADD,
+ VBUS_EVENT_DEVDROP,
+};
+
+struct vbus_event_devadd {
+ const char *type;
+ unsigned long id;
+};
+
+int vbus_notifier_register(struct vbus *vbus, struct notifier_block *nb);
+int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb);
+
+
#else /* CONFIG_VBUS */
#define fork_vbus(p) do { } while (0)
diff --git a/kernel/vbus/core.c b/kernel/vbus/core.c
index 033999f..b6df487 100644
--- a/kernel/vbus/core.c
+++ b/kernel/vbus/core.c
@@ -89,6 +89,7 @@ int vbus_device_interface_register(struct vbus_device *dev,
{
int ret;
struct vbus_devshell *ds = to_devshell(dev->kobj);
+ struct vbus_event_devadd ev;
mutex_lock(&vbus->lock);
@@ -124,6 +125,14 @@ int vbus_device_interface_register(struct vbus_device *dev,
if (ret)
goto error;
+ ev.type = intf->type;
+ ev.id = intf->id;
+
+ /* and let any clients know about the new device */
+ ret = raw_notifier_call_chain(&vbus->notifier, VBUS_EVENT_DEVADD, &ev);
+ if (ret < 0)
+ goto error;
+
mutex_unlock(&vbus->lock);
return 0;
@@ -144,6 +153,7 @@ int vbus_device_interface_unregister(struct vbus_device_interface *intf)
mutex_lock(&vbus->lock);
_interface_unregister(intf);
+ raw_notifier_call_chain(&vbus->notifier, VBUS_EVENT_DEVDROP, &intf->id);
mutex_unlock(&vbus->lock);
kobject_put(&intf->kobj);
@@ -346,6 +356,8 @@ int vbus_create(const char *name, struct vbus **bus)
_bus->next_id = 0;
+ RAW_INIT_NOTIFIER_HEAD(&_bus->notifier);
+
mutex_lock(&vbus_root.lock);
ret = map_add(&vbus_root.buses.map, &_bus->node);
@@ -358,6 +370,53 @@ int vbus_create(const char *name, struct vbus **bus)
return 0;
}
+#define for_each_rbnode(node, root) \
+ for (node = rb_first(root); node != NULL; node = rb_next(node))
+
+int vbus_notifier_register(struct vbus *vbus, struct notifier_block *nb)
+{
+ int ret;
+ struct rb_node *node;
+
+ mutex_lock(&vbus->lock);
+
+ /*
+ * resync the client for any devices we might already have
+ */
+ for_each_rbnode(node, &vbus->devices.map.root) {
+ struct vbus_device_interface *intf = node_to_intf(node);
+ struct vbus_event_devadd ev = {
+ .type = intf->type,
+ .id = intf->id,
+ };
+
+ ret = nb->notifier_call(nb, VBUS_EVENT_DEVADD, &ev);
+ if (ret & NOTIFY_STOP_MASK) {
+ mutex_unlock(&vbus->lock);
+ return -EPERM;
+ }
+ }
+
+ ret = raw_notifier_chain_register(&vbus->notifier, nb);
+
+ mutex_unlock(&vbus->lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_notifier_register);
+
+int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb)
+{
+ int ret;
+
+ mutex_lock(&vbus->lock);
+ ret = raw_notifier_chain_unregister(&vbus->notifier, nb);
+ mutex_unlock(&vbus->lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_notifier_unregister);
+
static void devshell_release(struct kobject *kobj)
{
struct vbus_devshell *ds = container_of(kobj,
diff --git a/kernel/vbus/vbus.h b/kernel/vbus/vbus.h
index 1266d69..cd2676b 100644
--- a/kernel/vbus/vbus.h
+++ b/kernel/vbus/vbus.h
@@ -51,6 +51,7 @@ struct vbus {
struct vbus_subdir members;
unsigned long next_id;
struct rb_node node;
+ struct raw_notifier_head notifier;
};
struct vbus_member {
We expect to have various types of connection-clients (e.g. userspace,
kvm, etc), each of which is likely to have common access patterns and
marshalling duties. Therefore we create a "client" API to simplify
client development by helping with mundane tasks such as handle-2-pointer
translation, etc.
Special thanks to Pat Mullaney for suggesting the optimization to pass
a cookie object down during DEVICESHM operations to save lookup overhead
on the event channel.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/vbus_client.h | 115 +++++++++
kernel/vbus/Makefile | 2
kernel/vbus/client.c | 543 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 659 insertions(+), 1 deletions(-)
create mode 100644 include/linux/vbus_client.h
create mode 100644 kernel/vbus/client.c
diff --git a/include/linux/vbus_client.h b/include/linux/vbus_client.h
new file mode 100644
index 0000000..62dab78
--- /dev/null
+++ b/include/linux/vbus_client.h
@@ -0,0 +1,115 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Virtual-Bus - Client interface
+ *
+ * We expect to have various types of connection-clients (e.g. userspace,
+ * kvm, etc). Each client will be connecting from some environment outside
+ * of the kernel, and therefore will not have direct access to the API as
+ * presented in ./linux/vbus.h. There will undoubtedly be some parameter
+ * marshalling that must occur, as well as common patterns for the handling
+ * of those marshalled parameters (e.g. translating a handle into a pointer,
+ * etc).
+ *
+ * Therefore this "client" API is provided to simplify the development
+ * of any clients. Of course, a client is free to bypass this API entirely
+ * and communicate with the direct VBUS API if desired.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_CLIENT_H
+#define _LINUX_VBUS_CLIENT_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+
+struct vbus_deviceopen {
+ __u32 devid;
+ __u32 version; /* device ABI version */
+ __u64 handle; /* return value for devh */
+};
+
+struct vbus_devicecall {
+ __u64 devh; /* device-handle (returned from DEVICEOPEN */
+ __u32 func;
+ __u32 len;
+ __u32 flags;
+ __u64 datap;
+};
+
+struct vbus_deviceshm {
+ __u64 devh; /* device-handle (returned from DEVICEOPEN */
+ __u32 id;
+ __u32 len;
+ __u32 flags;
+ struct {
+ __u32 offset;
+ __u32 prio;
+ __u64 cookie; /* token to pass back when signaling client */
+ } signal;
+ __u64 datap;
+ __u64 handle; /* return value for signaling from client to kernel */
+};
+
+#ifdef __KERNEL__
+
+#include <linux/ioq.h>
+#include <linux/module.h>
+#include <asm/atomic.h>
+
+struct vbus_client;
+
+struct vbus_client_ops {
+ int (*deviceopen)(struct vbus_client *client, struct vbus_memctx *ctx,
+ __u32 devid, __u32 version, __u64 *devh);
+ int (*deviceclose)(struct vbus_client *client, __u64 devh);
+ int (*devicecall)(struct vbus_client *client,
+ __u64 devh, __u32 func,
+ void *data, __u32 len, __u32 flags);
+ int (*deviceshm)(struct vbus_client *client,
+ __u64 devh, __u32 id,
+ struct vbus_shm *shm, struct shm_signal *signal,
+ __u32 flags, __u64 *handle);
+ int (*shmsignal)(struct vbus_client *client, __u64 handle);
+ void (*release)(struct vbus_client *client);
+};
+
+struct vbus_client {
+ atomic_t refs;
+ struct vbus_client_ops *ops;
+};
+
+static inline void vbus_client_get(struct vbus_client *client)
+{
+ atomic_inc(&client->refs);
+}
+
+static inline void vbus_client_put(struct vbus_client *client)
+{
+ if (atomic_dec_and_test(&client->refs))
+ client->ops->release(client);
+}
+
+struct vbus_client *vbus_client_attach(struct vbus *bus);
+
+extern struct vbus_memctx *current_memctx;
+struct vbus_memctx *task_memctx_alloc(struct task_struct *task);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_VBUS_CLIENT_H */
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 367f65b..4d440e5 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o
+obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
diff --git a/kernel/vbus/client.c b/kernel/vbus/client.c
new file mode 100644
index 0000000..f9c3dcf
--- /dev/null
+++ b/kernel/vbus/client.c
@@ -0,0 +1,543 @@
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uaccess.h>
+#include <linux/vbus.h>
+#include <linux/vbus_client.h>
+#include "vbus.h"
+
+static int
+nodeptr_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+ unsigned long l = (unsigned long)lhs;
+ unsigned long r = (unsigned long)rhs;
+
+ return l - r;
+}
+
+static int
+nodeptr_key_compare(const void *key, struct rb_node *node)
+{
+ unsigned long item = (unsigned long)node;
+ unsigned long _key = *(unsigned long *)key;
+
+ return _key - item;
+}
+
+static struct map_ops nodeptr_map_ops = {
+ .key_compare = &nodeptr_key_compare,
+ .item_compare = &nodeptr_item_compare,
+};
+
+struct _signal {
+ atomic_t refs;
+ struct rb_node node;
+ struct list_head list;
+ struct shm_signal *signal;
+};
+
+struct _connection {
+ atomic_t refs;
+ struct rb_node node;
+ struct list_head signals;
+ struct vbus_connection *conn;
+ int closed:1;
+};
+
+static inline void _signal_get(struct _signal *_signal)
+{
+ atomic_inc(&_signal->refs);
+}
+
+static inline void _signal_put(struct _signal *_signal)
+{
+ if (atomic_dec_and_test(&_signal->refs)) {
+ shm_signal_put(_signal->signal);
+ kfree(_signal);
+ }
+}
+
+static inline void conn_get(struct _connection *_conn)
+{
+ atomic_inc(&_conn->refs);
+}
+
+static inline void conn_close(struct _connection *_conn)
+{
+ struct vbus_connection *conn = _conn->conn;
+
+ if (conn->ops->close)
+ conn->ops->close(conn);
+
+ _conn->closed = true;
+}
+
+static inline void conn_put(struct _connection *_conn)
+{
+ if (atomic_dec_and_test(&_conn->refs)) {
+ struct _signal *_signal, *tmp;
+
+ if (!_conn->closed)
+ conn_close(_conn);
+
+ list_for_each_entry_safe(_signal, tmp, &_conn->signals,
+ list) {
+ list_del(&_signal->list);
+ _signal_put(_signal);
+ }
+
+ vbus_connection_put(_conn->conn);
+ kfree(_conn);
+ }
+}
+
+struct _client {
+ struct mutex lock;
+ struct map conn_map;
+ struct map signal_map;
+ struct vbus *vbus;
+ struct vbus_client client;
+};
+
+struct _connection *to_conn(struct rb_node *node)
+{
+ return node ? container_of(node, struct _connection, node) : NULL;
+}
+
+static struct _signal *to_signal(struct rb_node *node)
+{
+ return node ? container_of(node, struct _signal, node) : NULL;
+}
+
+static struct _client *to_client(struct vbus_client *client)
+{
+ return container_of(client, struct _client, client);
+}
+
+static struct _connection *
+connection_find(struct _client *c, unsigned long devid)
+{
+ struct _connection *_conn;
+
+ /*
+ * We could, in theory, cast devid to _conn->node, but this would
+ * be pretty stupid to trust. Therefore, we must validate that
+ * the pointer is legit by seeing if it exists in our conn_map
+ */
+
+ mutex_lock(&c->lock);
+
+ _conn = to_conn(map_find(&c->conn_map, &devid));
+ if (likely(_conn))
+ conn_get(_conn);
+
+ mutex_unlock(&c->lock);
+
+ return _conn;
+}
+
+static int
+_deviceopen(struct vbus_client *client, struct vbus_memctx *ctx,
+ __u32 devid, __u32 version, __u64 *devh)
+{
+ struct _client *c = to_client(client);
+ struct vbus_connection *conn;
+ struct _connection *_conn;
+ struct vbus_device_interface *intf = NULL;
+ int ret;
+
+ /*
+ * We only get here if the device has never been opened before,
+ * so we need to create a new connection
+ */
+ ret = vbus_interface_find(c->vbus, devid, &intf);
+ if (ret < 0)
+ return ret;
+
+ ret = intf->ops->open(intf, ctx, version, &conn);
+ kobject_put(&intf->kobj);
+ if (ret < 0)
+ return ret;
+
+ _conn = kzalloc(sizeof(*_conn), GFP_KERNEL);
+ if (!_conn) {
+ vbus_connection_put(conn);
+ return -ENOMEM;
+ }
+
+ atomic_set(&_conn->refs, 1);
+ _conn->conn = conn;
+
+ INIT_LIST_HEAD(&_conn->signals);
+
+ mutex_lock(&c->lock);
+ ret = map_add(&c->conn_map, &_conn->node);
+ mutex_unlock(&c->lock);
+
+ if (ret < 0) {
+ conn_put(_conn);
+ return ret;
+ }
+
+ /* in theory, &_conn->node should be unique */
+ *devh = (__u64)&_conn->node;
+
+ return 0;
+
+}
+
+/*
+ * Assumes client->lock is held (or we are releasing and dont need to lock)
+ */
+static void
+conn_del(struct _client *c, struct _connection *_conn)
+{
+ struct _signal *_signal, *tmp;
+
+ /* Delete and release each opened queue */
+ list_for_each_entry_safe(_signal, tmp, &_conn->signals, list) {
+ map_del(&c->signal_map, &_signal->node);
+ _signal_put(_signal);
+ }
+
+ map_del(&c->conn_map, &_conn->node);
+}
+
+static int
+_deviceclose(struct vbus_client *client, __u64 devh)
+{
+ struct _client *c = to_client(client);
+ struct _connection *_conn;
+
+ mutex_lock(&c->lock);
+
+ _conn = to_conn(map_find(&c->conn_map, &devh));
+ if (likely(_conn))
+ conn_del(c, _conn);
+
+ mutex_unlock(&c->lock);
+
+ if (unlikely(!_conn))
+ return -ENOENT;
+
+ conn_close(_conn);
+
+ /* this _put is the compliment to the _get performed at _deviceopen */
+ conn_put(_conn);
+
+ return 0;
+}
+
+static int
+_devicecall(struct vbus_client *client,
+ __u64 devh, __u32 func, void *data, __u32 len, __u32 flags)
+{
+ struct _client *c = to_client(client);
+ struct _connection *_conn;
+ struct vbus_connection *conn;
+ int ret;
+
+ _conn = connection_find(c, devh);
+ if (!_conn)
+ return -ENOENT;
+
+ conn = _conn->conn;
+
+ ret = conn->ops->call(conn, func, data, len, flags);
+
+ conn_put(_conn);
+
+ return ret;
+}
+
+static int
+_deviceshm(struct vbus_client *client,
+ __u64 devh,
+ __u32 id,
+ struct vbus_shm *shm,
+ struct shm_signal *signal,
+ __u32 flags,
+ __u64 *handle)
+{
+ struct _client *c = to_client(client);
+ struct _signal *_signal = NULL;
+ struct _connection *_conn;
+ struct vbus_connection *conn;
+ int ret;
+
+ *handle = 0;
+
+ _conn = connection_find(c, devh);
+ if (!_conn)
+ return -ENOENT;
+
+ conn = _conn->conn;
+
+ ret = conn->ops->shm(conn, id, shm, signal, flags);
+ if (ret < 0) {
+ conn_put(_conn);
+ return ret;
+ }
+
+ if (signal) {
+ _signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+ if (!_signal) {
+ conn_put(_conn);
+ return -ENOMEM;
+ }
+
+ /* one for map-ref, one for list-ref */
+ atomic_set(&_signal->refs, 2);
+ _signal->signal = signal;
+ shm_signal_get(signal);
+
+ mutex_lock(&c->lock);
+ ret = map_add(&c->signal_map, &_signal->node);
+ list_add_tail(&_signal->list, &_conn->signals);
+ mutex_unlock(&c->lock);
+
+ if (!ret)
+ *handle = (__u64)&_signal->node;
+ }
+
+ conn_put(_conn);
+
+ return 0;
+}
+
+static int
+_shmsignal(struct vbus_client *client, __u64 handle)
+{
+ struct _client *c = to_client(client);
+ struct _signal *_signal;
+
+ mutex_lock(&c->lock);
+
+ _signal = to_signal(map_find(&c->signal_map, &handle));
+ if (likely(_signal))
+ _signal_get(_signal);
+
+ mutex_unlock(&c->lock);
+
+ if (!_signal)
+ return -ENOENT;
+
+ _shm_signal_wakeup(_signal->signal);
+
+ _signal_put(_signal);
+
+ return 0;
+}
+
+static void
+_release(struct vbus_client *client)
+{
+ struct _client *c = to_client(client);
+ struct rb_node *node;
+
+ /* Drop all of our open connections */
+ while ((node = rb_first(&c->conn_map.root))) {
+ struct _connection *_conn = to_conn(node);
+
+ conn_del(c, _conn);
+ conn_put(_conn);
+ }
+
+ vbus_put(c->vbus);
+ kfree(c);
+}
+
+struct vbus_client_ops _client_ops = {
+ .deviceopen = _deviceopen,
+ .deviceclose = _deviceclose,
+ .devicecall = _devicecall,
+ .deviceshm = _deviceshm,
+ .shmsignal = _shmsignal,
+ .release = _release,
+};
+
+struct vbus_client *vbus_client_attach(struct vbus *vbus)
+{
+ struct _client *c;
+
+ BUG_ON(!vbus);
+
+ c = kzalloc(sizeof(*c), GFP_KERNEL);
+ if (!c)
+ return NULL;
+
+ atomic_set(&c->client.refs, 1);
+ c->client.ops = &_client_ops;
+
+ mutex_init(&c->lock);
+ map_init(&c->conn_map, &nodeptr_map_ops);
+ map_init(&c->signal_map, &nodeptr_map_ops);
+ c->vbus = vbus_get(vbus);
+
+ return &c->client;
+}
+EXPORT_SYMBOL_GPL(vbus_client_attach);
+
+/*
+ * memory context helpers
+ */
+
+static unsigned long
+current_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long len)
+{
+ return copy_to_user(dst, src, len);
+}
+
+static unsigned long
+current_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long len)
+{
+ return copy_from_user(dst, src, len);
+}
+
+static void
+current_memctx_release(struct vbus_memctx *ctx)
+{
+ panic("dropped last reference to current_memctx");
+}
+
+static struct vbus_memctx_ops current_memctx_ops = {
+ .copy_to = ¤t_memctx_copy_to,
+ .copy_from = ¤t_memctx_copy_from,
+ .release = ¤t_memctx_release,
+};
+
+static struct vbus_memctx _current_memctx =
+ VBUS_MEMCTX_INIT((¤t_memctx_ops));
+
+struct vbus_memctx *current_memctx = &_current_memctx;
+
+/*
+ * task_mem allows you to have a copy_from_user/copy_to_user like
+ * environment, except that it supports copying to tasks other
+ * than "current" as ctu/cfu() do
+ */
+struct task_memctx {
+ struct task_struct *task;
+ struct vbus_memctx ctx;
+};
+
+static struct task_memctx *to_task_memctx(struct vbus_memctx *ctx)
+{
+ return container_of(ctx, struct task_memctx, ctx);
+}
+
+static unsigned long
+task_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long n)
+{
+ struct task_memctx *tm = to_task_memctx(ctx);
+ struct task_struct *p = tm->task;
+
+ while (n) {
+ unsigned long offset = ((unsigned long)dst)%PAGE_SIZE;
+ unsigned long len = PAGE_SIZE - offset;
+ int ret;
+ struct page *pg;
+ void *maddr;
+
+ if (len > n)
+ len = n;
+
+ down_read(&p->mm->mmap_sem);
+ ret = get_user_pages(p, p->mm,
+ (unsigned long)dst, 1, 1, 0, &pg, NULL);
+
+ if (ret != 1) {
+ up_read(&p->mm->mmap_sem);
+ break;
+ }
+
+ maddr = kmap_atomic(pg, KM_USER0);
+ memcpy(maddr + offset, src, len);
+ kunmap_atomic(maddr, KM_USER0);
+ set_page_dirty_lock(pg);
+ put_page(pg);
+ up_read(&p->mm->mmap_sem);
+
+ src += len;
+ dst += len;
+ n -= len;
+ }
+
+ return n;
+}
+
+static unsigned long
+task_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long n)
+{
+ struct task_memctx *tm = to_task_memctx(ctx);
+ struct task_struct *p = tm->task;
+
+ while (n) {
+ unsigned long offset = ((unsigned long)src)%PAGE_SIZE;
+ unsigned long len = PAGE_SIZE - offset;
+ int ret;
+ struct page *pg;
+ void *maddr;
+
+ if (len > n)
+ len = n;
+
+ down_read(&p->mm->mmap_sem);
+ ret = get_user_pages(p, p->mm,
+ (unsigned long)src, 1, 1, 0, &pg, NULL);
+
+ if (ret != 1) {
+ up_read(&p->mm->mmap_sem);
+ break;
+ }
+
+ maddr = kmap_atomic(pg, KM_USER0);
+ memcpy(dst, maddr + offset, len);
+ kunmap_atomic(maddr, KM_USER0);
+ put_page(pg);
+ up_read(&p->mm->mmap_sem);
+
+ src += len;
+ dst += len;
+ n -= len;
+ }
+
+ return n;
+}
+
+static void
+task_memctx_release(struct vbus_memctx *ctx)
+{
+ struct task_memctx *tm = to_task_memctx(ctx);
+
+ put_task_struct(tm->task);
+ kfree(tm);
+}
+
+static struct vbus_memctx_ops task_memctx_ops = {
+ .copy_to = &task_memctx_copy_to,
+ .copy_from = &task_memctx_copy_from,
+ .release = &task_memctx_release,
+};
+
+struct vbus_memctx *task_memctx_alloc(struct task_struct *task)
+{
+ struct task_memctx *tm;
+
+ tm = kzalloc(sizeof(*tm), GFP_KERNEL);
+ if (!tm)
+ return NULL;
+
+ get_task_struct(task);
+
+ tm->task = task;
+ vbus_memctx_init(&tm->ctx, &task_memctx_ops);
+
+ return &tm->ctx;
+}
+EXPORT_SYMBOL_GPL(task_memctx_alloc);
See Documentation/vbus.txt for details
Signed-off-by: Gregory Haskins <[email protected]>
---
Documentation/vbus.txt | 386 +++++++++++++++++++++++++++++
arch/x86/Kconfig | 2
fs/proc/base.c | 96 +++++++
include/linux/sched.h | 4
include/linux/vbus.h | 147 +++++++++++
include/linux/vbus_device.h | 417 ++++++++++++++++++++++++++++++++
kernel/Makefile | 1
kernel/exit.c | 2
kernel/fork.c | 2
kernel/vbus/Kconfig | 14 +
kernel/vbus/Makefile | 1
kernel/vbus/attribute.c | 52 ++++
kernel/vbus/config.c | 275 +++++++++++++++++++++
kernel/vbus/core.c | 567 +++++++++++++++++++++++++++++++++++++++++++
kernel/vbus/devclass.c | 124 +++++++++
kernel/vbus/map.c | 72 +++++
kernel/vbus/map.h | 41 +++
kernel/vbus/vbus.h | 116 +++++++++
18 files changed, 2319 insertions(+), 0 deletions(-)
create mode 100644 Documentation/vbus.txt
create mode 100644 include/linux/vbus.h
create mode 100644 include/linux/vbus_device.h
create mode 100644 kernel/vbus/Kconfig
create mode 100644 kernel/vbus/Makefile
create mode 100644 kernel/vbus/attribute.c
create mode 100644 kernel/vbus/config.c
create mode 100644 kernel/vbus/core.c
create mode 100644 kernel/vbus/devclass.c
create mode 100644 kernel/vbus/map.c
create mode 100644 kernel/vbus/map.h
create mode 100644 kernel/vbus/vbus.h
diff --git a/Documentation/vbus.txt b/Documentation/vbus.txt
new file mode 100644
index 0000000..e8a05da
--- /dev/null
+++ b/Documentation/vbus.txt
@@ -0,0 +1,386 @@
+
+Virtual-Bus:
+======================
+Author: Gregory Haskins <[email protected]>
+
+
+
+
+What is it?
+--------------------
+
+Virtual-Bus is a kernel based IO resource container technology. It is modeled
+on a concept similar to the Linux Device-Model (LDM), where we have buses,
+devices, and drivers as the primary actors. However, VBUS has several
+distinctions when contrasted with LDM:
+
+ 1) "Busses" in LDM are relatively static and global to the kernel (e.g.
+ "PCI", "USB", etc). VBUS buses are arbitrarily created and destroyed
+ dynamically, and are not globally visible. Instead they are defined as
+ visible only to a specific subset of the system (the contained context).
+ 2) "Devices" in LDM are typically tangible physical (or sometimes logical)
+ devices. VBUS devices are purely software abstractions (which may or
+ may not have one or more physical devices behind them). Devices may
+ also be arbitrarily created or destroyed by software/administrative action
+ as opposed to by a hardware discovery mechanism.
+ 3) "Drivers" in LDM sit within the same kernel context as the busses and
+ devices they interact with. VBUS drivers live in a foreign
+ context (such as userspace, or a virtual-machine guest).
+
+The idea is that a vbus is created to contain access to some IO services.
+Virtual devices are then instantiated and linked to a bus to grant access to
+drivers actively present on the bus. Drivers will only have visibility to
+devices present on their respective bus, and nothing else.
+
+Virtual devices are defined by modules which register a deviceclass with the
+system. A deviceclass simply represents a type of device that _may_ be
+instantiated into a device, should an administrator wish to do so. Once
+this has happened, the device may be associated with one or more buses where
+it will become visible to all clients of those respective buses.
+
+Why do we need this?
+----------------------
+
+There are various reasons why such a construct may be useful. One of the
+most interesting use cases is for virtualization, such as KVM. Hypervisors
+today provide virtualized IO resources to a guest, but this is often at a cost
+in both latency and throughput compared to bare metal performance. Utilizing
+para-virtual resources instead of emulated devices helps to mitigate this
+penalty, but even these techniques to date have not fully realized the
+potential of the underlying bare-metal hardware.
+
+Some of the performance differential is unavoidable just given the extra
+processing that occurs due to the deeper stack (guest+host). However, some of
+this overhead is a direct result of the rather indirect path most hypervisors
+use to route IO. For instance, KVM uses PIO faults from the guest to trigger
+a guest->host-kernel->host-userspace->host-kernel sequence of events.
+Contrast this to a typical userspace application on the host which must only
+traverse app->kernel for most IO.
+
+The fact is that the linux kernel is already great at managing access to IO
+resources. Therefore, if you have a hypervisor that is based on the linux
+kernel, is there some way that we can allow the hypervisor to manage IO
+directly instead of forcing this convoluted path?
+
+The short answer is: "not yet" ;)
+
+In order to use such a concept, we need some new facilties. For one, we
+need to be able to define containers with their corresponding access-control so
+that guests do not have unmitigated access to anything they wish. Second,
+we also need to define some forms of memory access that is uniform in the face
+of various clients (e.g. "copy_to_user()" cannot be assumed to work for, say,
+a KVM vcpu context). Lastly, we need to provide access to these resources in
+a way that makes sense for the application, such as asynchronous communication
+paths and minimizing context switches.
+
+So we introduce VBUS as a framework to provide such facilities. The net
+result is a *substantial* reduction in IO overhead, even when compared to
+state of the art para-virtualization techniques (such as virtio-net).
+
+How do I use it?
+------------------------
+
+There are two components to utilizing a virtual-bus. One is the
+administrative function (creating and configuring a bus and its devices). The
+other is the consumption of the resources on the bus by a client (e.g. a
+virtual machine, or a userspace application). The former occurs on the host
+kernel by means of interacting with various special filesystems (e.g. sysfs,
+configfs, etc). The latter occurs by means of a "vbus connector" which must
+be developed specifically to bridge a particular environment. To date, we
+have developed such connectors for host-userspace and kvm-guests. Conceivably
+we could develop other connectors as needs arise (e.g. lguest, xen,
+guest-userspace, etc). This document deals with the administrative interface.
+Details about developing a connector are out of scope for this document.
+
+Interacting with vbus
+------------------------
+
+The first step is to enable virtual-bus support (CONFIG_VBUS) as well as any
+desired vbus-device modules (e.g. CONFIG_VBUS_VENETTAP), and ensure that your
+environment mounts both sysfs and configfs somewhere in the filesystem. This
+document will assume they are mounted to /sys and /config, respectively.
+
+VBUS will create a top-level directory "vbus" in each of the two respective
+filesystems. At boot-up, they will look like the following:
+
+/sys/vbus/
+|-- deviceclass
+|-- devices
+|-- instances
+`-- version
+
+/config/vbus/
+|-- devices
+`-- instances
+
+Following their respective roles, /config/vbus is for userspace to manage the
+lifetime of some number of objects/attributes. This is in contrast to
+/sys/vbus which is a reflection of objects managed by the kernel. It is
+assumed the reader is already familiar with these two facilities, so we will
+not go into depth about their general operation. Suffice to say that vbus
+consists of objects that are managed both by userspace and the kernel.
+Modification of objects via /config/vbus will typically be reflected in the
+/sys/vbus area.
+
+It all starts with a deviceclass
+--------------------------------
+
+Before you can do anything useful with vbus, you need some registered
+deviceclasses. A deviceclass provides the implementation of a specific type
+of virtual device. A deviceclass will typically be registered by loading a
+kernel-module. Once loaded, the available device types are enumerated under
+/sys/vbus/deviceclass. For example, we will load our "venet-tap" module,
+which provides network services:
+
+# modprobe venet-tap
+# tree /sys/vbus
+/sys/vbus
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+|-- instances
+`-- version
+
+An administrative agent should be able to enumerate /sys/vbus/deviceclass to
+determine what services are available on a given platform.
+
+Create the container
+-------------------
+
+The next step is to create a new container. In vbus, this comes in the form
+of a vbus-instance and it is created by a simple "mkdir" in the
+/config/vbus/instances area. The only requirement is that the instance is
+given a host-wide unique name. This may be some kind of association to the
+application (e.g. the unique VM GUID) or it can be arbitrary. For the
+purposes of example, we will let $(uuidgen) generate a random UUID for us.
+
+# mkdir /config/vbus/instances/$(uuidgen)
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+|-- instances
+| `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+| |-- devices
+| `-- members
+`-- version
+
+So we can see that we have now created a vbus called
+
+ "beb4df8f-7483-4028-b3f7-767512e2a18c"
+
+in the /config area, and it was immediately reflected in the
+/sys/vbus/instances area as well (with a few subobjects of its own: "devices"
+and "members"). The "devices" object denotes any devices that are present on
+the bus (in this case: none). Likewise, "members" denotes the pids of any
+tasks that are members of the bus (in this case: none). We will come back to
+this later. For now, we move on to the next step
+
+Create a device instance
+------------------------
+
+Devices are instantiated by again utilizing the /config/vbus configfs area.
+At first you may suspect that devices are created as subordinate objects of a
+bus/container instance, but you would be mistaken. Devices are actually
+root-level objects in vbus specifically to allow greater flexibility in the
+association of a device. For instance, it may be desirable to have a single
+device that spans multiple VMs (consider an ethernet switch, or a shared disk
+for a cluster). Therefore, device lifecycles are managed by creating/deleting
+objects in /config/vbus/devices.
+
+Note: Creating a device instance is actually a two step process: We need to
+give the device instance a unique name, and we also need to give it a specific
+device type. It is hard to express both parameters using standard filesystem
+operations like mkdir, so the design decision was made to require performing
+the operation in two steps.
+
+Our first step is to create a unique instance. We will again utilize
+$(uuidgen) to yield an arbitrary name. Any name will suffice as long as it is
+unqie on this particular host.
+
+# mkdir /config/vbus/devices/$(uuidgen)
+# tree /sys/vbus
+/sys/vbus
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+| `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+| `-- interfaces
+|-- instances
+| `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+| |-- devices
+| `-- members
+`-- version
+
+At this point we have created a partial instance, since we have not yet
+assigned a type to the device. Even so, we can see that some state has
+changed under /sys/vbus/devices. We now have an instance named
+
+ 6a1aff24-5dc0-4aea-9c35-435daef90e55
+
+and it has a single subordinate object: "interfaces". This object in
+particular is provided by the infrastructure, though do note that a
+deviceclass may also provide its own attributes/objects once it is created.
+
+We will go ahead and give this device a type to complete its construction. We
+do this by setting the /config/vbus/devices/$devname/type attribute with a
+valid deviceclass type:
+
+# echo foo > /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/type
+bash: echo: write error: No such file or directory
+
+Oops! What happened? "foo" is not a valid deviceclass. We need to consult
+the /sys/vbus/deviceclass area to find out what our options are:
+
+# tree /sys/vbus/deviceclass/
+/sys/vbus/deviceclass/
+`-- venet-tap
+
+Lets try again:
+
+# echo venet-tap > /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/type
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+| `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+| |-- class -> ../../deviceclass/venet-tap
+| |-- client_mac
+| |-- enabled
+| |-- host_mac
+| |-- ifname
+| `-- interfaces
+|-- instances
+| `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+| |-- devices
+| `-- members
+`-- version
+
+Ok, that looks better. And note that /sys/vbus/devices now has some more
+subordinate objects. Most of those were registered when the venet-tap
+deviceclass was given a chance to create an instance of itself. Those
+attributes are a property of the venet-tap and therefore are out of scope
+for this document. Please see the documentation that accompanies a particular
+module for more details.
+
+Put the device on the bus
+-------------------------
+
+The next administrative step is to associate our new device with our bus.
+This is accomplished using a symbolic link from the bus instance to our device
+instance.
+
+ln -s /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/ /config/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+| `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+| |-- class -> ../../deviceclass/venet-tap
+| |-- client_mac
+| |-- enabled
+| |-- host_mac
+| |-- ifname
+| `-- interfaces
+| `-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+|-- instances
+| `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+| |-- devices
+| | `-- 0
+| | |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+| | `-- type
+| `-- members
+`-- version
+
+We can now see that the device indicates that it has an interface registered
+to a bus:
+
+/sys/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/interfaces/
+`-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+
+Likewise, we can see that the bus has a device listed (id = "0"):
+
+/sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/
+`-- 0
+ |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+ `-- type
+
+At this point, our container is ready for use. However, it currently has 0
+members, so lets fix that
+
+Add some members
+--------------------
+
+Membership is controlled by an attribute: /proc/$pid/vbus. A pid can only be
+a member of one (or zero) busses at a time. To establish membership, we set
+the name of the bus, like so:
+
+# echo beb4df8f-7483-4028-b3f7-767512e2a18c > /proc/self/vbus
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+| `-- venet-tap
+|-- devices
+| `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+| |-- class -> ../../deviceclass/venet-tap
+| |-- client_mac
+| |-- enabled
+| |-- host_mac
+| |-- ifname
+| `-- interfaces
+| `-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+|-- instances
+| `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+| |-- devices
+| | `-- 0
+| | |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+| | `-- type
+| `-- members
+| |-- 4382
+| `-- 4588
+`-- version
+
+Woah! Why are there two members? VBUS membership is inherited by forked
+tasks. Therefore 4382 is the pid of our shell (which we set via /proc/self),
+and 4588 is the pid of the forked/exec'ed "tree" process. This property can
+be useful for having things like qemu set up the bus and then forking each
+vcpu which will inherit access.
+
+At this point, we are ready to roll. Pid 4382 has access to a virtual-bus
+namespace with one device, id=0. Its type is:
+
+# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
+virtual-ethernet
+
+"virtual-ethernet"? Why is it not "venet-tap"? Device-classes are allowed to
+register their interfaces under an id that is not required to be the same as
+their deviceclass. This supports device polymorphism. For instance,
+consider that an interface "virtual-ethernet" may provide basic 802.x packet
+exchange. However, we could have various implementations of a device that
+supports the 802.x interface, while having various implementations behind
+them.
+
+For instance, "venet-tap" might act like a tuntap module, while
+"venet-loopback" would loop packets back and "venet-switch" would form a
+layer-2 domain among the participating guests. All three modules would
+presumably support the same basic 802.x interface, yet all three have
+completely different implementations.
+
+Drivers on this particular bus would see this instance id=0 as a type
+"virtual-ethernet" even though the underlying implementation happens to be a
+tap device. This means a single driver that supports the protocol advertised
+by the "virtual-ethernet" type would be able to support the plethera of
+available device types that we may wish to create.
+
+Teardown:
+---------------
+
+We can descontruct a vbus container doing pretty much the opposite of what we
+did to create it. Echo "0" into /proc/self/vbus, rm the symlink between the
+bus and device, and rmdir the bus and device objects. Once that is done, we
+can even rmmod the venet-tap module. Note that the infrastructure will
+maintain a module-ref while it is configured in a container, so be sure to
+completely tear down the vbus/device before trying this.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc2fbad..3fca247 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1939,6 +1939,8 @@ source "drivers/pcmcia/Kconfig"
source "drivers/pci/hotplug/Kconfig"
+source "kernel/vbus/Kconfig"
+
endmenu
diff --git a/fs/proc/base.c b/fs/proc/base.c
index beaa0ce..03993fb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -80,6 +80,7 @@
#include <linux/oom.h>
#include <linux/elf.h>
#include <linux/pid_namespace.h>
+#include <linux/vbus.h>
#include "internal.h"
/* NOTE:
@@ -1065,6 +1066,98 @@ static const struct file_operations proc_oom_adjust_operations = {
.write = oom_adjust_write,
};
+#ifdef CONFIG_VBUS
+
+static ssize_t vbus_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ struct vbus *vbus;
+ const char *name;
+ char buffer[256];
+ size_t len;
+
+ if (!task)
+ return -ESRCH;
+
+ vbus = task_vbus_get(task);
+
+ put_task_struct(task);
+
+ name = vbus_name(vbus);
+
+ len = snprintf(buffer, sizeof(buffer), "%s\n", name ? name : "<none>");
+
+ vbus_put(vbus);
+
+ return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t vbus_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task;
+ struct vbus *vbus = NULL;
+ char buffer[256];
+ int disable = 0;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ if (buffer[count-1] == '\n')
+ buffer[count-1] = 0;
+
+ task = get_proc_task(file->f_path.dentry->d_inode);
+ if (!task)
+ return -ESRCH;
+
+ if (!capable(CAP_SYS_ADMIN)) {
+ put_task_struct(task);
+ return -EACCES;
+ }
+
+ if (strcmp(buffer, "0") == 0)
+ disable = 1;
+ else
+ vbus = vbus_find(buffer);
+
+ if (disable || vbus)
+ task_vbus_disassociate(task);
+
+ if (vbus) {
+ int ret = vbus_associate(vbus, task);
+
+ if (ret < 0)
+ printk(KERN_ERR \
+ "vbus: could not associate %s/%d with bus %s",
+ task->comm, task->pid, vbus_name(vbus));
+ else
+ rcu_assign_pointer(task->vbus, vbus);
+
+ vbus_put(vbus); /* Counter the vbus_find() */
+ } else if (!disable) {
+ put_task_struct(task);
+ return -ENOENT;
+ }
+
+ put_task_struct(task);
+
+ if (count == sizeof(buffer)-1)
+ return -EIO;
+
+ return count;
+}
+
+static const struct file_operations proc_vbus_operations = {
+ .read = vbus_read,
+ .write = vbus_write,
+};
+
+#endif /* CONFIG_VBUS */
+
#ifdef CONFIG_AUDITSYSCALL
#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2556,6 +2649,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tgid_io_accounting),
#endif
+#ifdef CONFIG_VBUS
+ REG("vbus", S_IRUGO|S_IWUSR, proc_vbus_operations),
+#endif
};
static int proc_tgid_base_readdir(struct file * filp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 011db2f..cd2f9b1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -97,6 +97,7 @@ struct futex_pi_state;
struct robust_list_head;
struct bio;
struct bts_tracer;
+struct vbus;
/*
* List of flags we want to share for kernel threads,
@@ -1329,6 +1330,9 @@ struct task_struct {
unsigned int lockdep_recursion;
struct held_lock held_locks[MAX_LOCK_DEPTH];
#endif
+#ifdef CONFIG_VBUS
+ struct vbus *vbus;
+#endif
/* journalling filesystem info */
void *journal_info;
diff --git a/include/linux/vbus.h b/include/linux/vbus.h
new file mode 100644
index 0000000..5f0566c
--- /dev/null
+++ b/include/linux/vbus.h
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Virtual-Bus
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_H
+#define _LINUX_VBUS_H
+
+#ifdef CONFIG_VBUS
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+#include <linux/vbus_device.h>
+
+struct vbus;
+struct task_struct;
+
+/**
+ * vbus_associate() - associate a task with a vbus
+ * @vbus: The bus context to associate with
+ * @p: The task to associate
+ *
+ * This function adds a task as a member of a vbus. Tasks must be members
+ * of a bus before they are allowed to use its resources. Tasks may only
+ * associate with a single bus at a time.
+ *
+ * Note: children inherit any association present at fork().
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_associate(struct vbus *vbus, struct task_struct *p);
+
+/**
+ * vbus_disassociate() - disassociate a task with a vbus
+ * @vbus: The bus context to disassociate with
+ * @p: The task to disassociate
+ *
+ * This function removes a task as a member of a vbus.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_disassociate(struct vbus *vbus, struct task_struct *p);
+
+struct vbus *vbus_get(struct vbus *);
+void vbus_put(struct vbus *);
+
+/**
+ * vbus_name() - returns the name of a bus
+ * @vbus: The bus context
+ *
+ * Returns: (char *) name of bus
+ *
+ **/
+const char *vbus_name(struct vbus *vbus);
+
+/**
+ * vbus_find() - retreives a vbus pointer from its name
+ * @name: The name of the bus to find
+ *
+ * Returns: NULL = failure, non-null = (vbus *)bus-pointer
+ *
+ **/
+struct vbus *vbus_find(const char *name);
+
+/**
+ * task_vbus_get() - retreives an associated vbus pointer from a task
+ * @p: The task context
+ *
+ * Safely retreives a pointer to an associated (if any) vbus from a task
+ *
+ * Returns: NULL = no association, non-null = (vbus *)bus-pointer
+ *
+ **/
+static inline struct vbus *task_vbus_get(struct task_struct *p)
+{
+ struct vbus *vbus;
+
+ rcu_read_lock();
+ vbus = rcu_dereference(p->vbus);
+ if (vbus)
+ vbus_get(vbus);
+ rcu_read_unlock();
+
+ return vbus;
+}
+
+/**
+ * fork_vbus() - Helper function to handle associated task forking
+ * @p: The task context
+ *
+ **/
+static inline void fork_vbus(struct task_struct *p)
+{
+ struct vbus *vbus = task_vbus_get(p);
+
+ if (vbus) {
+ BUG_ON(vbus_associate(vbus, p) < 0);
+ vbus_put(vbus);
+ }
+}
+
+/**
+ * task_vbus_disassociate() - Helper function to handle disassociating tasks
+ * @p: The task context
+ *
+ **/
+static inline void task_vbus_disassociate(struct task_struct *p)
+{
+ struct vbus *vbus = task_vbus_get(p);
+
+ if (vbus) {
+ rcu_assign_pointer(p->vbus, NULL);
+ synchronize_rcu();
+
+ vbus_disassociate(vbus, p);
+ vbus_put(vbus);
+ }
+}
+
+#else /* CONFIG_VBUS */
+
+#define fork_vbus(p) do { } while (0)
+#define task_vbus_disassociate(p) do { } while (0)
+
+#endif /* CONFIG_VBUS */
+
+#endif /* _LINUX_VBUS_H */
diff --git a/include/linux/vbus_device.h b/include/linux/vbus_device.h
new file mode 100644
index 0000000..f73cd86
--- /dev/null
+++ b/include/linux/vbus_device.h
@@ -0,0 +1,417 @@
+/*
+ * VBUS device models
+ *
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file deals primarily with the definitions for interfacing a virtual
+ * device model to a virtual bus. In a nutshell, a devclass begets a device,
+ * which begets a device_interface, which begets a connection.
+ *
+ * devclass
+ * -------
+ *
+ * To develop a vbus device, it all starts with a devclass. You must register
+ * a devclass using vbus_devclass_register(). Each registered devclass is
+ * enumerated under /sys/vbus/deviceclass.
+ *
+ * In of itself, a devclass doesnt do much. It is just an object factory for
+ * a device whose lifetime is managed by userspace. When userspace decides
+ * it would like to create an instance of a particular devclass, the
+ * devclass::create() callback is invoked (registered as part of the ops
+ * structure during vbus_devclass_register()). How and when userspace decides
+ * to do this is beyond the scope of this document. Please see:
+ *
+ * Documentation/vbus.txt
+ *
+ * for more details.
+ *
+ * device
+ * -------
+ *
+ * A vbus device is created by a particular devclass during the invokation
+ * of its devclass::create() callback. A device is initially created without
+ * any association with a bus. One or more buses may attempt to connect to
+ * a device (controlled, again, by userspace). When this occurs, a
+ * device::bus_connect() callback is invoked.
+ *
+ * This bus_connect() callback gives the device a chance to decide if it will
+ * accept the connection, and if so, to register its interfaces. Most devices
+ * will likely only allow a connection to one bus. Therefore, they may return
+ * -EBUSY another bus is already connected.
+ *
+ * If the device accepts the connection, it should register one of more
+ * interfaces with the bus using vbus_device_interface_register(). Most
+ * devices will only support one interface, and therefore will only invoke
+ * this method once. However, some more elaborate devices may have multiple
+ * functions, or abstracted topologies. Therefore they may opt at their own
+ * discretion to register more than one interface. The interfaces do not need
+ * to be uniform in type.
+ *
+ * device_interface
+ * -------------------
+ *
+ * The purpose of an interface is two fold: 1) advertise a particular ABI
+ * for communcation to a driver, 2) handle the initial connection of a driver.
+ *
+ * As such, a device_interface has a string "type" (which is akin to the
+ * abi type that this interface supports, like a PCI-ID). It also sports
+ * an interface::open() method.
+ *
+ * The interface::open callback is invoked whenever a driver attempts to
+ * connect to this device. The device implements its own policy regarding
+ * whether it accepts multiple connections or not. Most devices will likely
+ * only accept one connection at a time, and therefore will return -EBUSY if
+ * subsequent attempts are made.
+ *
+ * However, if successful, the interface::open() should return a
+ * vbus_connection object
+ *
+ * connections
+ * -----------
+ *
+ * A connection represents an interface that is succesfully opened. It will
+ * remain in an active state as long as the client retains the connection.
+ * The connection::release() method is invoked if the client should die,
+ * restart, or explicitly close the connection. The device-model should use
+ * this release() callback as the indication to clean up any resources
+ * associated with a particular connection such as allocated queues, etc.
+ *
+ * ---
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DEVICE_H
+#define _LINUX_VBUS_DEVICE_H
+
+#include <linux/module.h>
+#include <linux/configfs.h>
+#include <linux/rbtree.h>
+#include <linux/shm_signal.h>
+#include <linux/vbus.h>
+#include <asm/atomic.h>
+
+struct vbus_device_interface;
+struct vbus_connection;
+struct vbus_device;
+struct vbus_devclass;
+struct vbus_memctx;
+
+/*
+ * ----------------------
+ * devclass
+ * ----------------------
+ */
+struct vbus_devclass_ops {
+ int (*create)(struct vbus_devclass *dc,
+ struct vbus_device **dev);
+ void (*release)(struct vbus_devclass *dc);
+};
+
+struct vbus_devclass {
+ const char *name;
+ struct vbus_devclass_ops *ops;
+ struct rb_node node;
+ struct kobject kobj;
+ struct module *owner;
+};
+
+/**
+ * vbus_devclass_register() - register a devclass with the system
+ * @devclass: The devclass context to register
+ *
+ * Establishes a new device-class for consumption. Registered device-classes
+ * are enumerated under /sys/vbus/deviceclass. For more details, please see
+ * Documentation/vbus*
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_devclass_register(struct vbus_devclass *devclass);
+
+/**
+ * vbus_devclass_unregister() - unregister a devclass with the system
+ * @devclass: The devclass context to unregister
+ *
+ * Removes a devclass from the system
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_devclass_unregister(struct vbus_devclass *devclass);
+
+/**
+ * vbus_devclass_get() - acquire a devclass context reference
+ * @devclass: devclass context
+ *
+ **/
+static inline struct vbus_devclass *
+vbus_devclass_get(struct vbus_devclass *devclass)
+{
+ if (!try_module_get(devclass->owner))
+ return NULL;
+
+ kobject_get(&devclass->kobj);
+ return devclass;
+}
+
+/**
+ * vbus_devclass_put() - release a devclass context reference
+ * @devclass: devclass context
+ *
+ **/
+static inline void
+vbus_devclass_put(struct vbus_devclass *devclass)
+{
+ kobject_put(&devclass->kobj);
+ module_put(devclass->owner);
+}
+
+/*
+ * ----------------------
+ * device
+ * ----------------------
+ */
+struct vbus_device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct vbus_device *dev,
+ struct vbus_device_attribute *attr,
+ char *buf);
+ ssize_t (*store)(struct vbus_device *dev,
+ struct vbus_device_attribute *attr,
+ const char *buf, size_t count);
+};
+
+struct vbus_device_ops {
+ int (*bus_connect)(struct vbus_device *dev, struct vbus *vbus);
+ int (*bus_disconnect)(struct vbus_device *dev, struct vbus *vbus);
+ void (*release)(struct vbus_device *dev);
+};
+
+struct vbus_device {
+ const char *type;
+ struct vbus_device_ops *ops;
+ struct attribute_group *attrs;
+ struct kobject *kobj;
+};
+
+/*
+ * ----------------------
+ * device_interface
+ * ----------------------
+ */
+struct vbus_device_interface_ops {
+ int (*open)(struct vbus_device_interface *intf,
+ struct vbus_memctx *ctx,
+ int version,
+ struct vbus_connection **conn);
+ void (*release)(struct vbus_device_interface *intf);
+};
+
+struct vbus_device_interface {
+ const char *name;
+ const char *type;
+ struct vbus_device_interface_ops *ops;
+ unsigned long id;
+ struct vbus_device *dev;
+ struct vbus *vbus;
+ struct rb_node node;
+ struct kobject kobj;
+};
+
+/**
+ * vbus_device_interface_register() - register an interface with a bus
+ * @dev: The device context of the caller
+ * @vbus: The bus context to register with
+ * @intf: The interface context to register
+ *
+ * This function is invoked (usually in the context of a device::bus_connect()
+ * callback) to register a interface on a bus. We make this an explicit
+ * operation instead of implicit on the bus_connect() to facilitate devices
+ * that may present multiple interfaces to a bus. In those cases, a device
+ * may invoke this function multiple times (one per supported interface).
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_device_interface_register(struct vbus_device *dev,
+ struct vbus *vbus,
+ struct vbus_device_interface *intf);
+
+/**
+ * vbus_device_interface_unregister() - unregister an interface with a bus
+ * @intf: The interface context to unregister
+ *
+ * This function is the converse of interface_register. It is typically
+ * invoked in the context of a device::bus_disconnect().
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_device_interface_unregister(struct vbus_device_interface *intf);
+
+/*
+ * ----------------------
+ * memory context
+ * ----------------------
+ */
+struct vbus_memctx_ops {
+ unsigned long (*copy_to)(struct vbus_memctx *ctx,
+ void *dst,
+ const void *src,
+ unsigned long len);
+ unsigned long (*copy_from)(struct vbus_memctx *ctx,
+ void *dst,
+ const void *src,
+ unsigned long len);
+ void (*release)(struct vbus_memctx *ctx);
+};
+
+struct vbus_memctx {
+ atomic_t refs;
+ struct vbus_memctx_ops *ops;
+};
+
+static inline void
+vbus_memctx_init(struct vbus_memctx *ctx, struct vbus_memctx_ops *ops)
+{
+ memset(ctx, 0, sizeof(*ctx));
+ atomic_set(&ctx->refs, 1);
+ ctx->ops = ops;
+}
+
+#define VBUS_MEMCTX_INIT(_ops) { \
+ .refs = ATOMIC_INIT(1), \
+ .ops = _ops, \
+}
+
+static inline void
+vbus_memctx_get(struct vbus_memctx *ctx)
+{
+ atomic_inc(&ctx->refs);
+}
+
+static inline void
+vbus_memctx_put(struct vbus_memctx *ctx)
+{
+ if (atomic_dec_and_test(&ctx->refs))
+ ctx->ops->release(ctx);
+}
+
+/*
+ * ----------------------
+ * memory context
+ * ----------------------
+ */
+struct vbus_shm;
+
+struct vbus_shm_ops {
+ void (*release)(struct vbus_shm *shm);
+};
+
+struct vbus_shm {
+ atomic_t refs;
+ struct vbus_shm_ops *ops;
+ void *ptr;
+ size_t len;
+};
+
+static inline void
+vbus_shm_init(struct vbus_shm *shm, struct vbus_shm_ops *ops,
+ void *ptr, size_t len)
+{
+ memset(shm, 0, sizeof(*shm));
+ atomic_set(&shm->refs, 1);
+ shm->ops = ops;
+ shm->ptr = ptr;
+ shm->len = len;
+}
+
+static inline void
+vbus_shm_get(struct vbus_shm *shm)
+{
+ atomic_inc(&shm->refs);
+}
+
+static inline void
+vbus_shm_put(struct vbus_shm *shm)
+{
+ if (atomic_dec_and_test(&shm->refs))
+ shm->ops->release(shm);
+}
+
+/*
+ * ----------------------
+ * connection
+ * ----------------------
+ */
+struct vbus_connection_ops {
+ int (*call)(struct vbus_connection *conn,
+ unsigned long func,
+ void *data,
+ unsigned long len,
+ unsigned long flags);
+ int (*shm)(struct vbus_connection *conn,
+ unsigned long id,
+ struct vbus_shm *shm,
+ struct shm_signal *signal,
+ unsigned long flags);
+ void (*close)(struct vbus_connection *conn);
+ void (*release)(struct vbus_connection *conn);
+};
+
+struct vbus_connection {
+ atomic_t refs;
+ struct vbus_connection_ops *ops;
+};
+
+/**
+ * vbus_connection_init() - initialize a vbus_connection
+ * @conn: connection context
+ * @ops: ops structure to assign to context
+ *
+ **/
+static inline void vbus_connection_init(struct vbus_connection *conn,
+ struct vbus_connection_ops *ops)
+{
+ memset(conn, 0, sizeof(*conn));
+ atomic_set(&conn->refs, 1);
+ conn->ops = ops;
+}
+
+/**
+ * vbus_connection_get() - acquire a connection context reference
+ * @conn: connection context
+ *
+ **/
+static inline void vbus_connection_get(struct vbus_connection *conn)
+{
+ atomic_inc(&conn->refs);
+}
+
+/**
+ * vbus_connection_put() - release a connection context reference
+ * @conn: connection context
+ *
+ **/
+static inline void vbus_connection_put(struct vbus_connection *conn)
+{
+ if (atomic_dec_and_test(&conn->refs))
+ conn->ops->release(conn);
+}
+
+#endif /* _LINUX_VBUS_DEVICE_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e4791b3..99a98a7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_VBUS) += vbus/
ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/exit.c b/kernel/exit.c
index efd30cc..8736de6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
#include <linux/tracehook.h>
#include <linux/init_task.h>
#include <trace/sched.h>
+#include <linux/vbus.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -1081,6 +1082,7 @@ NORET_TYPE void do_exit(long code)
check_stack_usage();
exit_thread();
cgroup_exit(tsk, 1);
+ task_vbus_disassociate(tsk);
if (group_dead && tsk->signal->leader)
disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..5536053 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -61,6 +61,7 @@
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
#include <trace/sched.h>
+#include <linux/vbus.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1274,6 +1275,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
write_unlock_irq(&tasklist_lock);
proc_fork_connector(p);
cgroup_post_fork(p);
+ fork_vbus(p);
return p;
bad_fork_free_graph:
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
new file mode 100644
index 0000000..f2b92f5
--- /dev/null
+++ b/kernel/vbus/Kconfig
@@ -0,0 +1,14 @@
+#
+# Virtual-Bus (VBus) configuration
+#
+
+config VBUS
+ bool "Virtual Bus"
+ select CONFIGFS_FS
+ select SHM_SIGNAL
+ default n
+ help
+ Provides a mechansism for declaring virtual-bus objects and binding
+ various tasks and devices which reside on the bus.
+
+ If unsure, say N
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
new file mode 100644
index 0000000..367f65b
--- /dev/null
+++ b/kernel/vbus/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o
diff --git a/kernel/vbus/attribute.c b/kernel/vbus/attribute.c
new file mode 100644
index 0000000..3928228
--- /dev/null
+++ b/kernel/vbus/attribute.c
@@ -0,0 +1,52 @@
+#include <linux/vbus.h>
+#include <linux/uaccess.h>
+#include <linux/kobject.h>
+#include <linux/kallsyms.h>
+
+#include "vbus.h"
+
+static struct vbus_device_attribute *to_vattr(struct attribute *attr)
+{
+ return container_of(attr, struct vbus_device_attribute, attr);
+}
+
+static struct vbus_devshell *to_devshell(struct kobject *kobj)
+{
+ return container_of(kobj, struct vbus_devshell, kobj);
+}
+
+static ssize_t _dev_attr_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct vbus_devshell *ds = to_devshell(kobj);
+ struct vbus_device_attribute *vattr = to_vattr(attr);
+ ssize_t ret = -EIO;
+
+ if (vattr->show)
+ ret = vattr->show(ds->dev, vattr, buf);
+
+ if (ret >= (ssize_t)PAGE_SIZE) {
+ print_symbol("vbus_attr_show: %s returned bad count\n",
+ (unsigned long)vattr->show);
+ }
+
+ return ret;
+}
+
+static ssize_t _dev_attr_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+ struct vbus_devshell *ds = to_devshell(kobj);
+ struct vbus_device_attribute *vattr = to_vattr(attr);
+ ssize_t ret = -EIO;
+
+ if (vattr->store)
+ ret = vattr->store(ds->dev, vattr, buf, count);
+
+ return ret;
+}
+
+struct sysfs_ops vbus_dev_attr_ops = {
+ .show = _dev_attr_show,
+ .store = _dev_attr_store,
+};
diff --git a/kernel/vbus/config.c b/kernel/vbus/config.c
new file mode 100644
index 0000000..a40dbf1
--- /dev/null
+++ b/kernel/vbus/config.c
@@ -0,0 +1,275 @@
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/vbus.h>
+#include <linux/configfs.h>
+
+#include "vbus.h"
+
+static struct config_item_type perms_type = {
+ .ct_owner = THIS_MODULE,
+};
+
+static struct vbus *to_vbus(struct config_group *group)
+{
+ return group ? container_of(group, struct vbus, ci.group) : NULL;
+}
+
+static struct vbus *item_to_vbus(struct config_item *item)
+{
+ return to_vbus(to_config_group(item));
+}
+
+static struct vbus_devshell *to_devshell(struct config_group *group)
+{
+ return group ? container_of(group, struct vbus_devshell, ci_group)
+ : NULL;
+}
+
+static struct vbus_devshell *to_vbus_devshell(struct config_item *item)
+{
+ return to_devshell(to_config_group(item));
+}
+
+static int
+device_bus_connect(struct config_item *src, struct config_item *target)
+{
+ struct vbus *vbus = item_to_vbus(src);
+ struct vbus_devshell *ds;
+
+ /* We only allow connections to devices */
+ if (target->ci_parent != &vbus_root.devices.ci_group.cg_item)
+ return -EINVAL;
+
+ ds = to_vbus_devshell(target);
+ BUG_ON(!ds);
+
+ if (!ds->dev)
+ return -EINVAL;
+
+ return ds->dev->ops->bus_connect(ds->dev, vbus);
+}
+
+static int
+device_bus_disconnect(struct config_item *src, struct config_item *target)
+{
+ struct vbus *vbus = item_to_vbus(src);
+ struct vbus_devshell *ds;
+
+ ds = to_vbus_devshell(target);
+ BUG_ON(!ds);
+
+ if (!ds->dev)
+ return -EINVAL;
+
+ return ds->dev->ops->bus_disconnect(ds->dev, vbus);
+}
+
+struct configfs_item_operations bus_ops = {
+ .allow_link = device_bus_connect,
+ .drop_link = device_bus_disconnect,
+};
+
+static struct config_item_type bus_type = {
+ .ct_item_ops = &bus_ops,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct config_group *bus_create(struct config_group *group,
+ const char *name)
+{
+ struct vbus *bus = NULL;
+ int ret;
+
+ ret = vbus_create(name, &bus);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ config_group_init_type_name(&bus->ci.group, name, &bus_type);
+ bus->ci.group.default_groups = bus->ci.defgroups;
+ bus->ci.group.default_groups[0] = &bus->ci.perms;
+ bus->ci.group.default_groups[1] = NULL;
+
+ config_group_init_type_name(&bus->ci.perms, "perms", &perms_type);
+
+ return &bus->ci.group;
+}
+
+static void bus_destroy(struct config_group *group, struct config_item *item)
+{
+ struct vbus *vbus = item_to_vbus(item);
+
+ vbus_put(vbus);
+}
+
+static struct configfs_group_operations buses_ops = {
+ .make_group = bus_create,
+ .drop_item = bus_destroy,
+};
+
+static struct config_item_type buses_type = {
+ .ct_group_ops = &buses_ops,
+ .ct_owner = THIS_MODULE,
+};
+
+CONFIGFS_ATTR_STRUCT(vbus_devshell);
+#define DEVSHELL_ATTR(_name, _mode, _show, _store) \
+struct vbus_devshell_attribute vbus_devshell_attr_##_name = \
+ __CONFIGFS_ATTR(_name, _mode, _show, _store)
+
+static ssize_t devshell_type_read(struct vbus_devshell *ds, char *page)
+{
+ if (ds->dev)
+ return sprintf(page, "%s\n", ds->dev->type);
+ else
+ return sprintf(page, "\n");
+}
+
+static ssize_t devshell_type_write(struct vbus_devshell *ds, const char *page,
+ size_t count)
+{
+ struct vbus_devclass *dc;
+ struct vbus_device *dev;
+ char name[256];
+ int ret;
+
+ /*
+ * The device-type can only be set once, and then it is permenent.
+ * The admin should delete the device-shell if they want to create
+ * a new type
+ */
+ if (ds->dev)
+ return -EINVAL;
+
+ if (count > sizeof(name))
+ return -EINVAL;
+
+ strcpy(name, page);
+ if (name[count-1] == '\n')
+ name[count-1] = 0;
+
+ dc = vbus_devclass_find(name);
+ if (!dc)
+ return -ENOENT;
+
+ ret = dc->ops->create(dc, &dev);
+ if (ret < 0) {
+ vbus_devclass_put(dc);
+ return ret;
+ }
+
+ ds->dev = dev;
+ ds->dc = dc;
+ dev->kobj = &ds->kobj;
+
+ ret = vbus_devshell_type_set(ds);
+ if (ret < 0) {
+ vbus_devclass_put(dc);
+ return ret;
+ }
+
+ return count;
+}
+
+DEVSHELL_ATTR(type, S_IRUGO | S_IWUSR, devshell_type_read,
+ devshell_type_write);
+
+static struct configfs_attribute *devshell_attrs[] = {
+ &vbus_devshell_attr_type.attr,
+ NULL,
+};
+
+CONFIGFS_ATTR_OPS(vbus_devshell);
+static struct configfs_item_operations devshell_item_ops = {
+ .show_attribute = vbus_devshell_attr_show,
+ .store_attribute = vbus_devshell_attr_store,
+};
+
+static struct config_item_type devshell_type = {
+ .ct_item_ops = &devshell_item_ops,
+ .ct_attrs = devshell_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct config_group *devshell_create(struct config_group *group,
+ const char *name)
+{
+ struct vbus_devshell *ds = NULL;
+ int ret;
+
+ ret = vbus_devshell_create(name, &ds);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ config_group_init_type_name(&ds->ci_group, name, &devshell_type);
+
+ return &ds->ci_group;
+}
+
+static void devshell_release(struct config_group *group,
+ struct config_item *item)
+{
+ struct vbus_devshell *ds = to_vbus_devshell(item);
+
+ kobject_put(&ds->kobj);
+
+ if (ds->dc)
+ vbus_devclass_put(ds->dc);
+}
+
+static struct configfs_group_operations devices_ops = {
+ .make_group = devshell_create,
+ .drop_item = devshell_release,
+};
+
+static struct config_item_type devices_type = {
+ .ct_group_ops = &devices_ops,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct config_item_type root_type = {
+ .ct_owner = THIS_MODULE,
+};
+
+int __init vbus_config_init(void)
+{
+ int ret;
+ struct configfs_subsystem *subsys = &vbus_root.ci.subsys;
+
+ config_group_init_type_name(&subsys->su_group, "vbus", &root_type);
+ mutex_init(&subsys->su_mutex);
+
+ subsys->su_group.default_groups = vbus_root.ci.defgroups;
+ subsys->su_group.default_groups[0] = &vbus_root.buses.ci_group;
+ subsys->su_group.default_groups[1] = &vbus_root.devices.ci_group;
+ subsys->su_group.default_groups[2] = NULL;
+
+ config_group_init_type_name(&vbus_root.buses.ci_group,
+ "instances", &buses_type);
+
+ config_group_init_type_name(&vbus_root.devices.ci_group,
+ "devices", &devices_type);
+
+ ret = configfs_register_subsystem(subsys);
+ if (ret) {
+ printk(KERN_ERR "Error %d while registering subsystem %s\n",
+ ret,
+ subsys->su_group.cg_item.ci_namebuf);
+ goto out_unregister;
+ }
+
+ return 0;
+
+out_unregister:
+ configfs_unregister_subsystem(subsys);
+
+ return ret;
+}
+
+void __exit vbus_config_exit(void)
+{
+ configfs_unregister_subsystem(&vbus_root.ci.subsys);
+}
+
+
diff --git a/kernel/vbus/core.c b/kernel/vbus/core.c
new file mode 100644
index 0000000..033999f
--- /dev/null
+++ b/kernel/vbus/core.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/vbus.h>
+#include <linux/uaccess.h>
+
+#include "vbus.h"
+
+static struct vbus_device_interface *kobj_to_intf(struct kobject *kobj)
+{
+ return container_of(kobj, struct vbus_device_interface, kobj);
+}
+
+static struct vbus_devshell *to_devshell(struct kobject *kobj)
+{
+ return container_of(kobj, struct vbus_devshell, kobj);
+}
+
+static void interface_release(struct kobject *kobj)
+{
+ struct vbus_device_interface *intf = kobj_to_intf(kobj);
+
+ if (intf->ops->release)
+ intf->ops->release(intf);
+}
+
+static struct kobj_type interface_ktype = {
+ .release = interface_release,
+ .sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t
+type_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct vbus_device_interface *intf = kobj_to_intf(kobj);
+
+ return snprintf(buf, PAGE_SIZE, "%s\n", intf->type);
+}
+
+static struct kobj_attribute devattr_type =
+ __ATTR_RO(type);
+
+static struct attribute *attrs[] = {
+ &devattr_type.attr,
+ NULL,
+};
+
+static struct attribute_group attr_group = {
+ .attrs = attrs,
+};
+
+/*
+ * Assumes dev->bus->lock is held
+ */
+static void _interface_unregister(struct vbus_device_interface *intf)
+{
+ struct vbus *vbus = intf->vbus;
+ struct vbus_devshell *ds = to_devshell(intf->dev->kobj);
+
+ map_del(&vbus->devices.map, &intf->node);
+ sysfs_remove_link(&ds->intfs, intf->name);
+ sysfs_remove_link(&intf->kobj, "device");
+ sysfs_remove_group(&intf->kobj, &attr_group);
+}
+
+int vbus_device_interface_register(struct vbus_device *dev,
+ struct vbus *vbus,
+ struct vbus_device_interface *intf)
+{
+ int ret;
+ struct vbus_devshell *ds = to_devshell(dev->kobj);
+
+ mutex_lock(&vbus->lock);
+
+ if (vbus->next_id == -1) {
+ mutex_unlock(&vbus->lock);
+ return -ENOSPC;
+ }
+
+ intf->id = vbus->next_id++;
+ intf->dev = dev;
+ intf->vbus = vbus;
+
+ ret = map_add(&vbus->devices.map, &intf->node);
+ if (ret < 0) {
+ mutex_unlock(&vbus->lock);
+ return ret;
+ }
+
+ kobject_init_and_add(&intf->kobj, &interface_ktype,
+ &vbus->devices.kobj, "%ld", intf->id);
+
+ /* Create the basic attribute files associated with this kobject */
+ ret = sysfs_create_group(&intf->kobj, &attr_group);
+ if (ret)
+ goto error;
+
+ /* Create cross-referencing links between the device and bus */
+ ret = sysfs_create_link(&intf->kobj, dev->kobj, "device");
+ if (ret)
+ goto error;
+
+ ret = sysfs_create_link(&ds->intfs, &intf->kobj, intf->name);
+ if (ret)
+ goto error;
+
+ mutex_unlock(&vbus->lock);
+
+ return 0;
+
+error:
+ _interface_unregister(intf);
+ mutex_unlock(&vbus->lock);
+
+ kobject_put(&intf->kobj);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_device_interface_register);
+
+int vbus_device_interface_unregister(struct vbus_device_interface *intf)
+{
+ struct vbus *vbus = intf->vbus;
+
+ mutex_lock(&vbus->lock);
+ _interface_unregister(intf);
+ mutex_unlock(&vbus->lock);
+
+ kobject_put(&intf->kobj);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_device_interface_unregister);
+
+static struct vbus_device_interface *node_to_intf(struct rb_node *node)
+{
+ return node ? container_of(node, struct vbus_device_interface, node)
+ : NULL;
+}
+
+static int interface_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+ struct vbus_device_interface *lintf = node_to_intf(lhs);
+ struct vbus_device_interface *rintf = node_to_intf(rhs);
+
+ return lintf->id - rintf->id;
+}
+
+static int interface_key_compare(const void *key, struct rb_node *node)
+{
+ struct vbus_device_interface *intf = node_to_intf(node);
+ unsigned long id = *(unsigned long *)key;
+
+ return id - intf->id;
+}
+
+static struct map_ops interface_map_ops = {
+ .key_compare = &interface_key_compare,
+ .item_compare = &interface_item_compare,
+};
+
+/*
+ *-----------------
+ * member
+ *-----------------
+ */
+
+static struct vbus_member *node_to_member(struct rb_node *node)
+{
+ return node ? container_of(node, struct vbus_member, node) : NULL;
+}
+
+static struct vbus_member *kobj_to_member(struct kobject *kobj)
+{
+ return kobj ? container_of(kobj, struct vbus_member, kobj) : NULL;
+}
+
+static int member_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+ struct vbus_member *lmember = node_to_member(lhs);
+ struct vbus_member *rmember = node_to_member(rhs);
+
+ return lmember->tsk->pid - rmember->tsk->pid;
+}
+
+static int member_key_compare(const void *key, struct rb_node *node)
+{
+ struct vbus_member *member = node_to_member(node);
+ pid_t pid = *(pid_t *)key;
+
+ return pid - member->tsk->pid;
+}
+
+static struct map_ops member_map_ops = {
+ .key_compare = &member_key_compare,
+ .item_compare = &member_item_compare,
+};
+
+static void member_release(struct kobject *kobj)
+{
+ struct vbus_member *member = kobj_to_member(kobj);
+
+ vbus_put(member->vbus);
+ put_task_struct(member->tsk);
+
+ kfree(member);
+}
+
+static struct kobj_type member_ktype = {
+ .release = member_release,
+};
+
+int vbus_associate(struct vbus *vbus, struct task_struct *tsk)
+{
+ struct vbus_member *member;
+ int ret;
+
+ member = kzalloc(sizeof(struct vbus_member), GFP_KERNEL);
+ if (!member)
+ return -ENOMEM;
+
+ mutex_lock(&vbus->lock);
+
+ get_task_struct(tsk);
+ vbus_get(vbus);
+
+ member->vbus = vbus;
+ member->tsk = tsk;
+
+ ret = kobject_init_and_add(&member->kobj, &member_ktype,
+ &vbus->members.kobj,
+ "%d", tsk->pid);
+ if (ret < 0)
+ goto error;
+
+ ret = map_add(&vbus->members.map, &member->node);
+ if (ret < 0)
+ goto error;
+
+out:
+ mutex_unlock(&vbus->lock);
+ return 0;
+
+error:
+ kobject_put(&member->kobj);
+ goto out;
+}
+
+int vbus_disassociate(struct vbus *vbus, struct task_struct *tsk)
+{
+ struct vbus_member *member;
+
+ mutex_lock(&vbus->lock);
+
+ member = node_to_member(map_find(&vbus->members.map, &tsk->pid));
+ BUG_ON(!member);
+
+ map_del(&vbus->members.map, &member->node);
+
+ mutex_unlock(&vbus->lock);
+
+ kobject_put(&member->kobj);
+
+ return 0;
+}
+
+/*
+ *-----------------
+ * vbus_subdir
+ *-----------------
+ */
+
+static void vbus_subdir_init(struct vbus_subdir *subdir,
+ const char *name,
+ struct kobject *parent,
+ struct kobj_type *type,
+ struct map_ops *map_ops)
+{
+ int ret;
+
+ map_init(&subdir->map, map_ops);
+
+ ret = kobject_init_and_add(&subdir->kobj, type, parent, name);
+ BUG_ON(ret < 0);
+}
+
+/*
+ *-----------------
+ * vbus
+ *-----------------
+ */
+
+static void vbus_destroy(struct kobject *kobj)
+{
+ struct vbus *vbus = container_of(kobj, struct vbus, kobj);
+
+ kfree(vbus);
+}
+
+static struct kobj_type vbus_ktype = {
+ .release = vbus_destroy,
+};
+
+static struct kobj_type null_ktype = {
+};
+
+int vbus_create(const char *name, struct vbus **bus)
+{
+ struct vbus *_bus = NULL;
+ int ret;
+
+ _bus = kzalloc(sizeof(struct vbus), GFP_KERNEL);
+ if (!_bus)
+ return -ENOMEM;
+
+ atomic_set(&_bus->refs, 1);
+ mutex_init(&_bus->lock);
+
+ kobject_init_and_add(&_bus->kobj, &vbus_ktype,
+ vbus_root.buses.kobj, name);
+
+ vbus_subdir_init(&_bus->devices, "devices", &_bus->kobj,
+ &null_ktype, &interface_map_ops);
+ vbus_subdir_init(&_bus->members, "members", &_bus->kobj,
+ &null_ktype, &member_map_ops);
+
+ _bus->next_id = 0;
+
+ mutex_lock(&vbus_root.lock);
+
+ ret = map_add(&vbus_root.buses.map, &_bus->node);
+ BUG_ON(ret < 0);
+
+ mutex_unlock(&vbus_root.lock);
+
+ *bus = _bus;
+
+ return 0;
+}
+
+static void devshell_release(struct kobject *kobj)
+{
+ struct vbus_devshell *ds = container_of(kobj,
+ struct vbus_devshell, kobj);
+
+ if (ds->dev) {
+ if (ds->dev->attrs)
+ sysfs_remove_group(&ds->kobj, ds->dev->attrs);
+
+ if (ds->dev->ops->release)
+ ds->dev->ops->release(ds->dev);
+ }
+
+ if (ds->dc)
+ sysfs_remove_link(&ds->kobj, "class");
+
+ kobject_put(&ds->intfs);
+ kfree(ds);
+}
+
+static struct kobj_type devshell_ktype = {
+ .release = devshell_release,
+ .sysfs_ops = &vbus_dev_attr_ops,
+};
+
+static void _interfaces_init(struct vbus_devshell *ds)
+{
+ kobject_init_and_add(&ds->intfs, &null_ktype, &ds->kobj, "interfaces");
+}
+
+int vbus_devshell_create(const char *name, struct vbus_devshell **ds)
+{
+ struct vbus_devshell *_ds = NULL;
+
+ _ds = kzalloc(sizeof(*_ds), GFP_KERNEL);
+ if (!_ds)
+ return -ENOMEM;
+
+ kobject_init_and_add(&_ds->kobj, &devshell_ktype,
+ vbus_root.devices.kobj, name);
+
+ _interfaces_init(_ds);
+
+ *ds = _ds;
+
+ return 0;
+}
+
+int vbus_devshell_type_set(struct vbus_devshell *ds)
+{
+ int ret;
+
+ if (!ds->dev)
+ return -EINVAL;
+
+ if (!ds->dev->attrs)
+ return 0;
+
+ ret = sysfs_create_link(&ds->kobj, &ds->dc->kobj, "class");
+ if (ret < 0)
+ return ret;
+
+ return sysfs_create_group(&ds->kobj, ds->dev->attrs);
+}
+
+struct vbus *vbus_get(struct vbus *vbus)
+{
+ if (vbus)
+ atomic_inc(&vbus->refs);
+
+ return vbus;
+}
+EXPORT_SYMBOL_GPL(vbus_get);
+
+void vbus_put(struct vbus *vbus)
+{
+ if (vbus && atomic_dec_and_test(&vbus->refs)) {
+ kobject_put(&vbus->devices.kobj);
+ kobject_put(&vbus->members.kobj);
+ kobject_put(&vbus->kobj);
+ }
+}
+EXPORT_SYMBOL_GPL(vbus_put);
+
+long vbus_interface_find(struct vbus *bus,
+ unsigned long id,
+ struct vbus_device_interface **intf)
+{
+ struct vbus_device_interface *_intf;
+
+ BUG_ON(!bus);
+
+ mutex_lock(&bus->lock);
+
+ _intf = node_to_intf(map_find(&bus->devices.map, &id));
+ if (likely(_intf))
+ kobject_get(&_intf->kobj);
+
+ mutex_unlock(&bus->lock);
+
+ if (!_intf)
+ return -ENOENT;
+
+ *intf = _intf;
+
+ return 0;
+}
+
+const char *vbus_name(struct vbus *vbus)
+{
+ return vbus ? vbus->kobj.name : NULL;
+}
+
+/*
+ *---------------------
+ * vbus_buses
+ *---------------------
+ */
+
+static struct vbus *node_to_bus(struct rb_node *node)
+{
+ return node ? container_of(node, struct vbus, node) : NULL;
+}
+
+static int bus_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+ struct vbus *lbus = node_to_bus(lhs);
+ struct vbus *rbus = node_to_bus(rhs);
+
+ return strcmp(lbus->kobj.name, rbus->kobj.name);
+}
+
+static int bus_key_compare(const void *key, struct rb_node *node)
+{
+ struct vbus *bus = node_to_bus(node);
+
+ return strcmp(key, bus->kobj.name);
+}
+
+static struct map_ops bus_map_ops = {
+ .key_compare = &bus_key_compare,
+ .item_compare = &bus_item_compare,
+};
+
+struct vbus *vbus_find(const char *name)
+{
+ struct vbus *bus;
+
+ mutex_lock(&vbus_root.lock);
+
+ bus = node_to_bus(map_find(&vbus_root.buses.map, name));
+ if (!bus)
+ goto out;
+
+ vbus_get(bus);
+
+out:
+ mutex_unlock(&vbus_root.lock);
+
+ return bus;
+
+}
+
+struct vbus_root vbus_root;
+
+static ssize_t version_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "%d\n", VBUS_VERSION);
+}
+
+static struct kobj_attribute version_attr =
+ __ATTR(version, S_IRUGO, version_show, NULL);
+
+static int __init vbus_init(void)
+{
+ int ret;
+
+ mutex_init(&vbus_root.lock);
+
+ ret = vbus_config_init();
+ BUG_ON(ret < 0);
+
+ vbus_root.kobj = kobject_create_and_add("vbus", NULL);
+ BUG_ON(!vbus_root.kobj);
+
+ ret = sysfs_create_file(vbus_root.kobj, &version_attr.attr);
+ BUG_ON(ret);
+
+ ret = vbus_devclass_init();
+ BUG_ON(ret < 0);
+
+ map_init(&vbus_root.buses.map, &bus_map_ops);
+ vbus_root.buses.kobj = kobject_create_and_add("instances",
+ vbus_root.kobj);
+ BUG_ON(!vbus_root.buses.kobj);
+
+ vbus_root.devices.kobj = kobject_create_and_add("devices",
+ vbus_root.kobj);
+ BUG_ON(!vbus_root.devices.kobj);
+
+ return 0;
+}
+
+late_initcall(vbus_init);
+
+
diff --git a/kernel/vbus/devclass.c b/kernel/vbus/devclass.c
new file mode 100644
index 0000000..3f5ef0d
--- /dev/null
+++ b/kernel/vbus/devclass.c
@@ -0,0 +1,124 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus.h>
+
+#include "vbus.h"
+
+static struct vbus_devclass *node_to_devclass(struct rb_node *node)
+{
+ return node ? container_of(node, struct vbus_devclass, node) : NULL;
+}
+
+static int devclass_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+ struct vbus_devclass *ldc = node_to_devclass(lhs);
+ struct vbus_devclass *rdc = node_to_devclass(rhs);
+
+ return strcmp(ldc->name, rdc->name);
+}
+
+static int devclass_key_compare(const void *key, struct rb_node *node)
+{
+ struct vbus_devclass *dc = node_to_devclass(node);
+
+ return strcmp((const char *)key, dc->name);
+}
+
+static struct map_ops devclass_map_ops = {
+ .key_compare = &devclass_key_compare,
+ .item_compare = &devclass_item_compare,
+};
+
+int __init vbus_devclass_init(void)
+{
+ struct vbus_devclasses *c = &vbus_root.devclasses;
+
+ map_init(&c->map, &devclass_map_ops);
+
+ c->kobj = kobject_create_and_add("deviceclass", vbus_root.kobj);
+ BUG_ON(!c->kobj);
+
+ return 0;
+}
+
+static void devclass_release(struct kobject *kobj)
+{
+ struct vbus_devclass *dc = container_of(kobj,
+ struct vbus_devclass,
+ kobj);
+
+ if (dc->ops->release)
+ dc->ops->release(dc);
+}
+
+static struct kobj_type devclass_ktype = {
+ .release = devclass_release,
+};
+
+int vbus_devclass_register(struct vbus_devclass *dc)
+{
+ int ret;
+
+ mutex_lock(&vbus_root.lock);
+
+ ret = map_add(&vbus_root.devclasses.map, &dc->node);
+ if (ret < 0)
+ goto out;
+
+ ret = kobject_init_and_add(&dc->kobj, &devclass_ktype,
+ vbus_root.devclasses.kobj, dc->name);
+ if (ret < 0) {
+ map_del(&vbus_root.devclasses.map, &dc->node);
+ goto out;
+ }
+
+out:
+ mutex_unlock(&vbus_root.lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_devclass_register);
+
+int vbus_devclass_unregister(struct vbus_devclass *dc)
+{
+ mutex_lock(&vbus_root.lock);
+ map_del(&vbus_root.devclasses.map, &dc->node);
+ mutex_unlock(&vbus_root.lock);
+
+ kobject_put(&dc->kobj);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_devclass_unregister);
+
+struct vbus_devclass *vbus_devclass_find(const char *name)
+{
+ struct vbus_devclass *dev;
+
+ mutex_lock(&vbus_root.lock);
+ dev = node_to_devclass(map_find(&vbus_root.devclasses.map, name));
+ if (dev)
+ dev = vbus_devclass_get(dev);
+ mutex_unlock(&vbus_root.lock);
+
+ return dev;
+}
diff --git a/kernel/vbus/map.c b/kernel/vbus/map.c
new file mode 100644
index 0000000..a3bd841
--- /dev/null
+++ b/kernel/vbus/map.c
@@ -0,0 +1,72 @@
+
+#include <linux/errno.h>
+
+#include "map.h"
+
+void map_init(struct map *map, struct map_ops *ops)
+{
+ map->root = RB_ROOT;
+ map->ops = ops;
+}
+
+int map_add(struct map *map, struct rb_node *node)
+{
+ int ret = 0;
+ struct rb_root *root;
+ struct rb_node **new, *parent = NULL;
+
+ root = &map->root;
+ new = &(root->rb_node);
+
+ /* Figure out where to put new node */
+ while (*new) {
+ int val;
+
+ parent = *new;
+
+ val = map->ops->item_compare(node, *new);
+ if (val < 0)
+ new = &((*new)->rb_left);
+ else if (val > 0)
+ new = &((*new)->rb_right);
+ else {
+ ret = -EEXIST;
+ break;
+ }
+ }
+
+ if (!ret) {
+ /* Add new node and rebalance tree. */
+ rb_link_node(node, parent, new);
+ rb_insert_color(node, root);
+ }
+
+ return ret;
+}
+
+struct rb_node *map_find(struct map *map, const void *key)
+{
+ struct rb_node *node;
+
+ node = map->root.rb_node;
+
+ while (node) {
+ int val;
+
+ val = map->ops->key_compare(key, node);
+ if (val < 0)
+ node = node->rb_left;
+ else if (val > 0)
+ node = node->rb_right;
+ else
+ break;
+ }
+
+ return node;
+}
+
+void map_del(struct map *map, struct rb_node *node)
+{
+ rb_erase(node, &map->root);
+}
+
diff --git a/kernel/vbus/map.h b/kernel/vbus/map.h
new file mode 100644
index 0000000..7fb5164
--- /dev/null
+++ b/kernel/vbus/map.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __VBUS_MAP_H__
+#define __VBUS_MAP_H__
+
+#include <linux/rbtree.h>
+
+struct map_ops {
+ int (*item_compare)(struct rb_node *lhs, struct rb_node *rhs);
+ int (*key_compare)(const void *key, struct rb_node *item);
+};
+
+struct map {
+ struct rb_root root;
+ struct map_ops *ops;
+};
+
+void map_init(struct map *map, struct map_ops *ops);
+int map_add(struct map *map, struct rb_node *node);
+struct rb_node *map_find(struct map *map, const void *key);
+void map_del(struct map *map, struct rb_node *node);
+
+#endif /* __VBUS_MAP_H__ */
diff --git a/kernel/vbus/vbus.h b/kernel/vbus/vbus.h
new file mode 100644
index 0000000..1266d69
--- /dev/null
+++ b/kernel/vbus/vbus.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __VBUS_H__
+#define __VBUS_H__
+
+#include <linux/configfs.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/kobject.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+
+#include "map.h"
+
+#define VBUS_VERSION 1
+
+struct vbus_subdir {
+ struct map map;
+ struct kobject kobj;
+};
+
+struct vbus {
+ struct {
+ struct config_group group;
+ struct config_group perms;
+ struct config_group *defgroups[2];
+ } ci;
+
+ atomic_t refs;
+ struct mutex lock;
+ struct kobject kobj;
+ struct vbus_subdir devices;
+ struct vbus_subdir members;
+ unsigned long next_id;
+ struct rb_node node;
+};
+
+struct vbus_member {
+ struct rb_node node;
+ struct task_struct *tsk;
+ struct vbus *vbus;
+ struct kobject kobj;
+};
+
+struct vbus_devclasses {
+ struct kobject *kobj;
+ struct map map;
+};
+
+struct vbus_buses {
+ struct config_group ci_group;
+ struct map map;
+ struct kobject *kobj;
+};
+
+struct vbus_devshell {
+ struct config_group ci_group;
+ struct vbus_device *dev;
+ struct vbus_devclass *dc;
+ struct kobject kobj;
+ struct kobject intfs;
+};
+
+struct vbus_devices {
+ struct config_group ci_group;
+ struct kobject *kobj;
+};
+
+struct vbus_root {
+ struct {
+ struct configfs_subsystem subsys;
+ struct config_group *defgroups[3];
+ } ci;
+
+ struct mutex lock;
+ struct kobject *kobj;
+ struct vbus_devclasses devclasses;
+ struct vbus_buses buses;
+ struct vbus_devices devices;
+};
+
+extern struct vbus_root vbus_root;
+extern struct sysfs_ops vbus_dev_attr_ops;
+
+int vbus_config_init(void);
+int vbus_devclass_init(void);
+
+int vbus_create(const char *name, struct vbus **bus);
+
+int vbus_devshell_create(const char *name, struct vbus_devshell **ds);
+struct vbus_devclass *vbus_devclass_find(const char *name);
+int vbus_devshell_type_set(struct vbus_devshell *ds);
+
+long vbus_interface_find(struct vbus *vbus,
+ unsigned long id,
+ struct vbus_device_interface **intf);
+
+#endif /* __VBUS_H__ */
Signed-off-by: Gregory Haskins <[email protected]>
---
drivers/net/Kconfig | 13 +
drivers/net/Makefile | 1
drivers/net/vbus-enet.c | 680 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 694 insertions(+), 0 deletions(-)
create mode 100644 drivers/net/vbus-enet.c
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 62d732a..ac9dabd 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3099,4 +3099,17 @@ config VIRTIO_NET
This is the virtual network driver for virtio. It can be used with
lguest or QEMU based VMMs (like KVM or Xen). Say Y or M.
+config VBUS_ENET
+ tristate "Virtual Ethernet Driver"
+ depends on VBUS_DRIVERS
+ help
+ A virtualized 802.x network device based on the VBUS interface.
+ It can be used with any hypervisor/kernel that supports the
+ vbus protocol.
+
+config VBUS_ENET_DEBUG
+ bool "Enable Debugging"
+ depends on VBUS_ENET
+ default n
+
endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 471baaf..61db928 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -264,6 +264,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
obj-$(CONFIG_NETXEN_NIC) += netxen/
obj-$(CONFIG_NIU) += niu.o
obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
obj-$(CONFIG_SFC) += sfc/
obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..3779f77
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,680 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <[email protected]>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+static int napi_weight = 128;
+module_param(napi_weight, int, 0444);
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+
+#undef PDEBUG /* undef it, just in case */
+#ifdef VBUS_ENET_DEBUG
+# define PDEBUG(fmt, args...) printk(KERN_DEBUG "vbus_enet: " fmt, ## args)
+#else
+# define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+struct vbus_enet_queue {
+ struct ioq *queue;
+ struct ioq_notifier notifier;
+};
+
+struct vbus_enet_priv {
+ spinlock_t lock;
+ struct net_device *dev;
+ struct vbus_device_proxy *vdev;
+ struct napi_struct napi;
+ struct vbus_enet_queue rxq;
+ struct vbus_enet_queue txq;
+ struct tasklet_struct txtask;
+};
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+ return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+ struct vbus_enet_queue *q,
+ int qid,
+ size_t ringsize,
+ void (*func)(struct ioq_notifier *))
+{
+ struct vbus_device_proxy *dev = priv->vdev;
+ int ret;
+
+ ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+ if (ret < 0)
+ panic("ioq_alloc failed: %d\n", ret);
+
+ if (func) {
+ q->notifier.signal = func;
+ q->queue->notifier = &q->notifier;
+ }
+
+ return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+ struct vbus_device_proxy *dev = priv->vdev;
+
+ return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct ioq_ring_desc *desc, size_t len)
+{
+ struct sk_buff *skb;
+
+ len += ETH_HLEN;
+
+ skb = dev_alloc_skb(len + 2);
+ BUG_ON(!skb);
+
+ skb_reserve(skb, 2); /* align IP on 16B boundary */
+
+ desc->cookie = (u64)skb;
+ desc->ptr = (u64)__pa(skb->data);
+ desc->len = len; /* total length */
+ desc->valid = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+ struct ioq *ioq = priv->rxq.queue;
+ struct ioq_iterator iter;
+ int ret;
+
+ /*
+ * We want to iterate on the "valid" index. By default the iterator
+ * will not "autoupdate" which means it will not hypercall the host
+ * with our changes. This is good, because we are really just
+ * initializing stuff here anyway. Note that you can always manually
+ * signal the host with ioq_signal() if the autoupdate feature is not
+ * used.
+ */
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * Seek to the tail of the valid index (which should be our first
+ * item, since the queue is brand-new)
+ */
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * Now populate each descriptor with an empty SKB and mark it valid
+ */
+ while (!iter.desc->valid) {
+ rxdesc_alloc(iter.desc, priv->dev->mtu);
+
+ /*
+ * This push operation will simultaneously advance the
+ * valid-head index and increment our position in the queue
+ * by one.
+ */
+ ret = ioq_iter_push(&iter, 0);
+ BUG_ON(ret < 0);
+ }
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+ struct ioq *ioq = priv->rxq.queue;
+ struct ioq_iterator iter;
+ int ret;
+
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * free each valid descriptor
+ */
+ while (iter.desc->valid) {
+ struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+ iter.desc->valid = 0;
+ wmb();
+
+ iter.desc->ptr = 0;
+ iter.desc->cookie = 0;
+
+ ret = ioq_iter_pop(&iter, 0);
+ BUG_ON(ret < 0);
+
+ dev_kfree_skb(skb);
+ }
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+ struct vbus_enet_priv *priv = netdev_priv(dev);
+ int ret;
+
+ ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+ BUG_ON(ret < 0);
+
+ napi_enable(&priv->napi);
+
+ return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+ struct vbus_enet_priv *priv = netdev_priv(dev);
+ int ret;
+
+ napi_disable(&priv->napi);
+
+ ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+ BUG_ON(ret < 0);
+
+ return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+ if (dev->flags & IFF_UP) /* can't act on a running interface */
+ return -EBUSY;
+
+ /* Don't allow changing the I/O address */
+ if (map->base_addr != dev->base_addr) {
+ printk(KERN_WARNING "vbus_enet: Can't change I/O address\n");
+ return -EOPNOTSUPP;
+ }
+
+ /* ignore other fields */
+ return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (netif_rx_schedule_prep(&priv->napi)) {
+ /* Disable further interrupts */
+ ioq_notify_disable(priv->rxq.queue, 0);
+ __netif_rx_schedule(&priv->napi);
+ }
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+ struct vbus_enet_priv *priv = netdev_priv(dev);
+ int ret;
+
+ dev->mtu = new_mtu;
+
+ /*
+ * FLUSHRX will cause the device to flush any outstanding
+ * RX buffers. They will appear to come in as 0 length
+ * packets which we can simply discard and replace with new_mtu
+ * buffers for the future.
+ */
+ ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+ BUG_ON(ret < 0);
+
+ vbus_enet_schedule_rx(priv);
+
+ return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+ struct vbus_enet_priv *priv = napi_to_priv(napi);
+ int npackets = 0;
+ struct ioq_iterator iter;
+ int ret;
+
+ PDEBUG("%lld: polling...\n", priv->vdev->id);
+
+ /* We want to iterate on the head of the in-use index */
+ ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+ IOQ_ITER_AUTOUPDATE);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * We stop if we have met the quota or there are no more packets.
+ * The EOM is indicated by finding a packet that is still owned by
+ * the south side
+ */
+ while ((npackets < budget) && (!iter.desc->sown)) {
+ struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+ if (iter.desc->len) {
+ skb_put(skb, iter.desc->len);
+
+ /* Maintain stats */
+ npackets++;
+ priv->dev->stats.rx_packets++;
+ priv->dev->stats.rx_bytes += iter.desc->len;
+
+ /* Pass the buffer up to the stack */
+ skb->dev = priv->dev;
+ skb->protocol = eth_type_trans(skb, priv->dev);
+ netif_receive_skb(skb);
+
+ mb();
+ } else
+ /*
+ * the device may send a zero-length packet when its
+ * flushing references on the ring. We can just drop
+ * these on the floor
+ */
+ dev_kfree_skb(skb);
+
+ /* Grab a new buffer to put in the ring */
+ rxdesc_alloc(iter.desc, priv->dev->mtu);
+
+ /* Advance the in-use tail */
+ ret = ioq_iter_pop(&iter, 0);
+ BUG_ON(ret < 0);
+ }
+
+ PDEBUG("%lld poll: %d packets received\n", priv->vdev->id, npackets);
+
+ /*
+ * If we processed all packets, we're done; tell the kernel and
+ * reenable ints
+ */
+ if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+ netif_rx_complete(napi);
+ ioq_notify_enable(priv->rxq.queue, 0);
+ ret = 0;
+ } else
+ /* We couldn't process everything. */
+ ret = 1;
+
+ return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+ struct vbus_enet_priv *priv = netdev_priv(dev);
+ struct ioq_iterator iter;
+ int ret;
+ unsigned long flags;
+
+ PDEBUG("%lld: sending %d bytes\n", priv->vdev->id, skb->len);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+ /*
+ * We must flow-control the kernel by disabling the
+ * queue
+ */
+ spin_unlock_irqrestore(&priv->lock, flags);
+ netif_stop_queue(dev);
+ printk(KERN_ERR "VBUS_ENET: tx on full queue bug " \
+ "on device %lld\n", priv->vdev->id);
+ return 1;
+ }
+
+ /*
+ * We want to iterate on the tail of both the "inuse" and "valid" index
+ * so we specify the "both" index
+ */
+ ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
+ IOQ_ITER_AUTOUPDATE);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+ BUG_ON(iter.desc->sown);
+
+ /*
+ * We simply put the skb right onto the ring. We will get an interrupt
+ * later when the data has been consumed and we can reap the pointers
+ * at that time
+ */
+ iter.desc->cookie = (u64)skb;
+ iter.desc->len = (u64)skb->len;
+ iter.desc->ptr = (u64)__pa(skb->data);
+ iter.desc->valid = 1;
+
+ priv->dev->stats.tx_packets++;
+ priv->dev->stats.tx_bytes += skb->len;
+
+ /*
+ * This advances both indexes together implicitly, and then
+ * signals the south side to consume the packet
+ */
+ ret = ioq_iter_push(&iter, 0);
+ BUG_ON(ret < 0);
+
+ dev->trans_start = jiffies; /* save the timestamp */
+
+ if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+ /*
+ * If the queue is congested, we must flow-control the kernel
+ */
+ PDEBUG("%lld: backpressure tx queue\n", priv->vdev->id);
+ netif_stop_queue(dev);
+ }
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ return 0;
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
+{
+ struct ioq_iterator iter;
+ int ret;
+
+ /*
+ * We want to iterate on the head of the valid index, but we
+ * do not want the iter_pop (below) to flip the ownership, so
+ * we set the NOFLIPOWNER option
+ */
+ ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
+ IOQ_ITER_NOFLIPOWNER);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * We are done once we find the first packet either invalid or still
+ * owned by the south-side
+ */
+ while (iter.desc->valid && (!iter.desc->sown || force)) {
+ struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+ PDEBUG("%lld: completed sending %d bytes\n",
+ priv->vdev->id, skb->len);
+
+ /* Reset the descriptor */
+ iter.desc->valid = 0;
+
+ dev_kfree_skb(skb);
+
+ /* Advance the valid-index head */
+ ret = ioq_iter_pop(&iter, 0);
+ BUG_ON(ret < 0);
+ }
+
+ /*
+ * If we were previously stopped due to flow control, restart the
+ * processing
+ */
+ if (netif_queue_stopped(priv->dev)
+ && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+ PDEBUG("%lld: re-enabling tx queue\n", priv->vdev->id);
+ netif_wake_queue(priv->dev);
+ }
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+ struct vbus_enet_priv *priv = netdev_priv(dev);
+ unsigned long flags;
+
+ printk(KERN_DEBUG "VBUS_ENET %lld: Transmit timeout\n", priv->vdev->id);
+
+ spin_lock_irqsave(&priv->lock, flags);
+ vbus_enet_tx_reap(priv, 0);
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+ struct vbus_enet_priv *priv;
+ struct net_device *dev;
+
+ priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+ dev = priv->dev;
+
+ if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+ vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+ struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+ unsigned long flags;
+
+ PDEBUG("deferred_tx_isr for %lld\n", priv->vdev->id);
+
+ spin_lock_irqsave(&priv->lock, flags);
+ vbus_enet_tx_reap(priv, 0);
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ ioq_notify_enable(priv->txq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+ struct vbus_enet_priv *priv;
+ unsigned long flags;
+
+ priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
+
+ PDEBUG("tx_isr for %lld\n", priv->vdev->id);
+
+ ioq_notify_disable(priv->txq.queue, 0);
+ tasklet_schedule(&priv->txtask);
+}
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+ .ndo_open = vbus_enet_open,
+ .ndo_stop = vbus_enet_stop,
+ .ndo_set_config = vbus_enet_config,
+ .ndo_start_xmit = vbus_enet_tx_start,
+ .ndo_change_mtu = vbus_enet_change_mtu,
+ .ndo_tx_timeout = vbus_enet_timeout,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+ struct net_device *dev;
+ struct vbus_enet_priv *priv;
+ int ret;
+
+ printk(KERN_INFO "VBUS_ENET: Found new device at %lld\n", vdev->id);
+
+ ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+ if (ret < 0)
+ return ret;
+
+ dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+ if (!dev)
+ return -ENOMEM;
+
+ priv = netdev_priv(dev);
+
+ spin_lock_init(&priv->lock);
+ priv->dev = dev;
+ priv->vdev = vdev;
+
+ tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
+
+ queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+ queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+ rx_setup(priv);
+
+ ioq_notify_enable(priv->rxq.queue, 0); /* enable interrupts */
+ ioq_notify_enable(priv->txq.queue, 0);
+
+ dev->netdev_ops = &vbus_enet_netdev_ops;
+ dev->watchdog_timeo = 5 * HZ;
+
+ netif_napi_add(dev, &priv->napi, vbus_enet_poll, napi_weight);
+
+ ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+ if (ret < 0) {
+ printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+ "%lld\n",
+ priv->vdev->id);
+ goto out_free;
+ }
+
+ dev->features |= NETIF_F_HIGHDMA;
+
+ ret = register_netdev(dev);
+ if (ret < 0) {
+ printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+ ret, dev->name);
+ goto out_free;
+ }
+
+ vdev->priv = priv;
+
+ return 0;
+
+ out_free:
+ free_netdev(dev);
+
+ return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+ struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+ struct vbus_device_proxy *dev = priv->vdev;
+
+ unregister_netdev(priv->dev);
+ napi_disable(&priv->napi);
+
+ rx_teardown(priv);
+ vbus_enet_tx_reap(priv, 1);
+
+ ioq_put(priv->rxq.queue);
+ ioq_put(priv->txq.queue);
+
+ dev->ops->close(dev, 0);
+
+ free_netdev(priv->dev);
+
+ return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+ .probe = vbus_enet_probe,
+ .remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+ .type = VENET_TYPE,
+ .owner = THIS_MODULE,
+ .ops = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+ printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+ printk(KERN_DEBUG "VBUSENET: Using %d/%d queue depth\n",
+ rx_ringlen, tx_ringlen);
+ return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+ vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
It will be common to map an IOQ over the VBUS shared-memory interfaces,
so lets generalize their setup so we can reuse the pattern.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/vbus_device.h | 7 +++
include/linux/vbus_driver.h | 7 +++
kernel/vbus/Kconfig | 2 +
kernel/vbus/Makefile | 1
kernel/vbus/proxy.c | 64 +++++++++++++++++++++++++++++++
kernel/vbus/shm-ioq.c | 89 +++++++++++++++++++++++++++++++++++++++++++
6 files changed, 170 insertions(+), 0 deletions(-)
create mode 100644 kernel/vbus/shm-ioq.c
diff --git a/include/linux/vbus_device.h b/include/linux/vbus_device.h
index f73cd86..c91bce4 100644
--- a/include/linux/vbus_device.h
+++ b/include/linux/vbus_device.h
@@ -102,6 +102,7 @@
#include <linux/configfs.h>
#include <linux/rbtree.h>
#include <linux/shm_signal.h>
+#include <linux/ioq.h>
#include <linux/vbus.h>
#include <asm/atomic.h>
@@ -414,4 +415,10 @@ static inline void vbus_connection_put(struct vbus_connection *conn)
conn->ops->release(conn);
}
+/*
+ * device-side IOQ helper - dereferences device-shm as an IOQ
+ */
+int vbus_shm_ioq_attach(struct vbus_shm *shm, struct shm_signal *signal,
+ int maxcount, struct ioq **ioq);
+
#endif /* _LINUX_VBUS_DEVICE_H */
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
index c53e13f..9cfbf60 100644
--- a/include/linux/vbus_driver.h
+++ b/include/linux/vbus_driver.h
@@ -26,6 +26,7 @@
#include <linux/device.h>
#include <linux/shm_signal.h>
+#include <linux/ioq.h>
struct vbus_device_proxy;
struct vbus_driver;
@@ -70,4 +71,10 @@ struct vbus_driver {
int vbus_driver_register(struct vbus_driver *drv);
void vbus_driver_unregister(struct vbus_driver *drv);
+/*
+ * driver-side IOQ helper - allocates device-shm and maps an IOQ on it
+ */
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+ size_t ringsize, struct ioq **ioq);
+
#endif /* _LINUX_VBUS_DRIVER_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index 3aaa085..71acd6f 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -6,6 +6,7 @@ config VBUS
bool "Virtual Bus"
select CONFIGFS_FS
select SHM_SIGNAL
+ select IOQ
default n
help
Provides a mechansism for declaring virtual-bus objects and binding
@@ -15,6 +16,7 @@ config VBUS
config VBUS_DRIVERS
tristate "VBUS Driver support"
+ select IOQ
default n
help
Adds support for a virtual bus model for proxying drivers.
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index d028ece..45f6503 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
+obj-$(CONFIG_VBUS) += shm-ioq.o
vbus-proxy-objs += proxy.o
obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
diff --git a/kernel/vbus/proxy.c b/kernel/vbus/proxy.c
index ea48f00..75b0cb1 100644
--- a/kernel/vbus/proxy.c
+++ b/kernel/vbus/proxy.c
@@ -150,3 +150,67 @@ void vbus_driver_unregister(struct vbus_driver *drv)
}
EXPORT_SYMBOL_GPL(vbus_driver_unregister);
+/*
+ *---------------------------------
+ * driver-side IOQ helper
+ *---------------------------------
+ */
+static void
+vbus_driver_ioq_release(struct ioq *ioq)
+{
+ kfree(ioq->head_desc);
+ kfree(ioq);
+}
+
+static struct ioq_ops vbus_driver_ioq_ops = {
+ .release = vbus_driver_ioq_release,
+};
+
+
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+ size_t count, struct ioq **ioq)
+{
+ struct ioq *_ioq;
+ struct ioq_ring_head *head = NULL;
+ struct shm_signal *signal = NULL;
+ size_t len = IOQ_HEAD_DESC_SIZE(count);
+ int ret = -ENOMEM;
+
+ _ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+ if (!_ioq)
+ goto error;
+
+ head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+ if (!head)
+ goto error;
+
+ head->magic = IOQ_RING_MAGIC;
+ head->ver = IOQ_RING_VER;
+ head->count = count;
+
+ ret = dev->ops->shm(dev, id, prio, head, len,
+ &head->signal, &signal, 0);
+ if (ret < 0)
+ goto error;
+
+ ioq_init(_ioq,
+ &vbus_driver_ioq_ops,
+ ioq_locality_north,
+ head,
+ signal,
+ count);
+
+ *ioq = _ioq;
+
+ return 0;
+
+ error:
+ kfree(_ioq);
+ kfree(head);
+
+ if (signal)
+ shm_signal_put(signal);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_driver_ioq_alloc);
diff --git a/kernel/vbus/shm-ioq.c b/kernel/vbus/shm-ioq.c
new file mode 100644
index 0000000..a627337
--- /dev/null
+++ b/kernel/vbus/shm-ioq.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * IOQ helper for devices - This module implements an IOQ which has
+ * been shared with a device via a vbus_shm segment.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/ioq.h>
+#include <linux/vbus_device.h>
+
+struct _ioq {
+ struct vbus_shm *shm;
+ struct ioq ioq;
+};
+
+static void
+_shm_ioq_release(struct ioq *ioq)
+{
+ struct _ioq *_ioq = container_of(ioq, struct _ioq, ioq);
+
+ /* the signal is released by the IOQ infrastructure */
+ vbus_shm_put(_ioq->shm);
+ kfree(_ioq);
+}
+
+static struct ioq_ops _shm_ioq_ops = {
+ .release = _shm_ioq_release,
+};
+
+int vbus_shm_ioq_attach(struct vbus_shm *shm, struct shm_signal *signal,
+ int maxcount, struct ioq **ioq)
+{
+ struct _ioq *_ioq;
+ struct ioq_ring_head *head = NULL;
+ size_t ringcount;
+
+ if (!signal)
+ return -EINVAL;
+
+ _ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+ if (!_ioq)
+ return -ENOMEM;
+
+ head = (struct ioq_ring_head *)shm->ptr;
+
+ if (head->magic != IOQ_RING_MAGIC)
+ return -EINVAL;
+
+ if (head->ver != IOQ_RING_VER)
+ return -EINVAL;
+
+ ringcount = head->count;
+
+ if ((maxcount != -1) && (ringcount > maxcount))
+ return -EINVAL;
+
+ /*
+ * Sanity check the ringcount against the actual length of the segment
+ */
+ if (IOQ_HEAD_DESC_SIZE(ringcount) != shm->len)
+ return -EINVAL;
+
+ _ioq->shm = shm;
+
+ ioq_init(&_ioq->ioq, &_shm_ioq_ops, ioq_locality_south, head,
+ signal, ringcount);
+
+ *ioq = &_ioq->ioq;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_shm_ioq_attach);
+
Signed-off-by: Gregory Haskins <[email protected]>
---
drivers/vbus/devices/venet-tap.c | 235 +++++++++++++++++++++++++++++++++++++-
1 files changed, 228 insertions(+), 7 deletions(-)
diff --git a/drivers/vbus/devices/venet-tap.c b/drivers/vbus/devices/venet-tap.c
index 148e2c8..a1f2dc6 100644
--- a/drivers/vbus/devices/venet-tap.c
+++ b/drivers/vbus/devices/venet-tap.c
@@ -81,6 +81,13 @@ enum {
TX_IOQ_CONGESTED,
};
+struct venettap;
+
+struct venettap_rx_ops {
+ int (*decode)(struct venettap *priv, void *ptr, int len);
+ int (*import)(struct venettap *, struct sk_buff *, void *, int);
+};
+
struct venettap {
spinlock_t lock;
unsigned char hmac[ETH_ALEN]; /* host-mac */
@@ -108,7 +115,13 @@ struct venettap {
struct vbus_memctx *ctx;
struct venettap_queue rxq;
struct venettap_queue txq;
+ struct venettap_rx_ops *rx_ops;
wait_queue_head_t rx_empty;
+ struct {
+ struct venet_sg *desc;
+ size_t len;
+ int enabled:1;
+ } sg;
int connected:1;
int opened:1;
int link:1;
@@ -290,6 +303,183 @@ venettap_change_mtu(struct net_device *dev, int new_mtu)
}
/*
+ * ---------------------------
+ * Scatter-Gather support
+ * ---------------------------
+ */
+
+/* assumes reference to priv->vbus.conn held */
+static int
+venettap_sg_decode(struct venettap *priv, void *ptr, int len)
+{
+ struct venet_sg *vsg;
+ struct vbus_memctx *ctx;
+ int ret;
+
+ /*
+ * SG is enabled, so we need to pull in the venet_sg
+ * header before we can interpret the rest of the
+ * packet
+ *
+ * FIXME: Make sure this is not too big
+ */
+ if (unlikely(len > priv->vbus.sg.len)) {
+ kfree(priv->vbus.sg.desc);
+ priv->vbus.sg.desc = kzalloc(len, GFP_KERNEL);
+ }
+
+ vsg = priv->vbus.sg.desc;
+ ctx = priv->vbus.ctx;
+
+ ret = ctx->ops->copy_from(ctx, vsg, ptr, len);
+ BUG_ON(ret);
+
+ /*
+ * Non GSO type packets should be constrained by the MTU setting
+ * on the host
+ */
+ if (!(vsg->flags & VENET_SG_FLAG_GSO)
+ && (vsg->len > (priv->netif.dev->mtu + ETH_HLEN)))
+ return -1;
+
+ return vsg->len;
+}
+
+/*
+ * venettap_sg_import - import an skb in scatter-gather mode
+ *
+ * assumes reference to priv->vbus.conn held
+ */
+static int
+venettap_sg_import(struct venettap *priv, struct sk_buff *skb,
+ void *ptr, int len)
+{
+ struct venet_sg *vsg = priv->vbus.sg.desc;
+ struct vbus_memctx *ctx = priv->vbus.ctx;
+ int remain = len;
+ int ret;
+ int i;
+
+ PDEBUG("Importing %d bytes in %d segments\n", len, vsg->count);
+
+ for (i = 0; i < vsg->count; i++) {
+ struct venet_iov *iov = &vsg->iov[i];
+
+ if (remain < iov->len)
+ return -EINVAL;
+
+ PDEBUG("Segment %d: %p/%d\n", i, iov->ptr, iov->len);
+
+ ret = ctx->ops->copy_from(ctx, skb_tail_pointer(skb),
+ (void *)iov->ptr,
+ iov->len);
+ if (ret)
+ return -EFAULT;
+
+ skb_put(skb, iov->len);
+ remain -= iov->len;
+ }
+
+ if (vsg->flags & VENET_SG_FLAG_NEEDS_CSUM
+ && !skb_partial_csum_set(skb, vsg->csum.start, vsg->csum.offset))
+ return -EINVAL;
+
+ if (vsg->flags & VENET_SG_FLAG_GSO) {
+ struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+ PDEBUG("GSO packet detected\n");
+
+ switch (vsg->gso.type) {
+ case VENET_GSO_TYPE_TCPV4:
+ sinfo->gso_type = SKB_GSO_TCPV4;
+ break;
+ case VENET_GSO_TYPE_TCPV6:
+ sinfo->gso_type = SKB_GSO_TCPV6;
+ break;
+ case VENET_GSO_TYPE_UDP:
+ sinfo->gso_type = SKB_GSO_UDP;
+ break;
+ default:
+ PDEBUG("Illegal GSO type: %d\n", vsg->gso.type);
+ priv->netif.stats.rx_frame_errors++;
+ kfree_skb(skb);
+ return -EINVAL;
+ }
+
+ if (vsg->flags & VENET_SG_FLAG_ECN)
+ sinfo->gso_type |= SKB_GSO_TCP_ECN;
+
+ sinfo->gso_size = vsg->gso.size;
+ if (skb_shinfo(skb)->gso_size == 0) {
+ PDEBUG("Illegal GSO size: %d\n", vsg->gso.size);
+ priv->netif.stats.rx_frame_errors++;
+ kfree_skb(skb);
+ return -EINVAL;
+ }
+
+ /* Header must be checked, and gso_segs computed. */
+ skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+ skb_shinfo(skb)->gso_segs = 0;
+ }
+
+ return 0;
+}
+
+static struct venettap_rx_ops venettap_sg_rx_ops = {
+ .decode = venettap_sg_decode,
+ .import = venettap_sg_import,
+};
+
+/*
+ * ---------------------------
+ * Flat (non Scatter-Gather) support
+ * ---------------------------
+ */
+
+/* assumes reference to priv->vbus.conn held */
+static int
+venettap_flat_decode(struct venettap *priv, void *ptr, int len)
+{
+ size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
+
+ if (len > maxlen)
+ return -1;
+
+ /*
+ * If SG is *not* enabled, the length is simply the
+ * descriptor length
+ */
+
+ return len;
+}
+
+/*
+ * venettap_rx_flat - import an skb in non scatter-gather mode
+ *
+ * assumes reference to priv->vbus.conn held
+ */
+static int
+venettap_flat_import(struct venettap *priv, struct sk_buff *skb,
+ void *ptr, int len)
+{
+ struct vbus_memctx *ctx = priv->vbus.ctx;
+ int ret;
+
+ ret = ctx->ops->copy_from(ctx, skb_tail_pointer(skb), ptr, len);
+ if (ret)
+ return -EFAULT;
+
+ skb_put(skb, len);
+
+ return 0;
+}
+
+static struct venettap_rx_ops venettap_flat_rx_ops = {
+ .decode = venettap_flat_decode,
+ .import = venettap_flat_import,
+};
+
+/*
* The poll implementation.
*/
static int
@@ -303,6 +493,7 @@ venettap_rx(struct venettap *priv)
int ret;
unsigned long flags;
struct vbus_connection *conn;
+ struct venettap_rx_ops *rx_ops;
PDEBUG("polling...\n");
@@ -326,6 +517,8 @@ venettap_rx(struct venettap *priv)
ioq = priv->vbus.rxq.queue;
ctx = priv->vbus.ctx;
+ rx_ops = priv->vbus.rx_ops;
+
spin_unlock_irqrestore(&priv->lock, flags);
/* We want to iterate on the head of the in-use index */
@@ -340,11 +533,14 @@ venettap_rx(struct venettap *priv)
* the north side
*/
while (iter.desc->sown) {
- size_t len = iter.desc->len;
- size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
struct sk_buff *skb = NULL;
+ size_t len;
+
+ len = rx_ops->decode(priv,
+ (void *)iter.desc->ptr,
+ iter.desc->len);
- if (unlikely(len > maxlen)) {
+ if (unlikely(len < 0)) {
priv->netif.stats.rx_errors++;
priv->netif.stats.rx_length_errors++;
goto next;
@@ -362,10 +558,8 @@ venettap_rx(struct venettap *priv)
/* align IP on 16B boundary */
skb_reserve(skb, 2);
- ret = ctx->ops->copy_from(ctx, skb->data,
- (void *)iter.desc->ptr,
- len);
- if (unlikely(ret)) {
+ ret = rx_ops->import(priv, skb, (void *)iter.desc->ptr, len);
+ if (unlikely(ret < 0)) {
priv->netif.stats.rx_errors++;
goto next;
}
@@ -846,6 +1040,23 @@ venettap_macquery(struct venettap *priv, void *data, unsigned long len)
return 0;
}
+static u32
+venettap_negcap_sg(struct venettap *priv, u32 requested)
+{
+ u32 available = VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+ |VENET_CAP_ECN;
+ u32 ret;
+
+ ret = available & requested;
+
+ if (ret & VENET_CAP_SG) {
+ priv->vbus.sg.enabled = true;
+ priv->vbus.rx_ops = &venettap_sg_rx_ops;
+ }
+
+ return ret;
+}
+
/*
* Negotiate Capabilities - This function is provided so that the
* interface may be extended without breaking ABI compatability
@@ -873,6 +1084,9 @@ venettap_negcap(struct venettap *priv, void *data, unsigned long len)
return -EFAULT;
switch (caps.gid) {
+ case VENET_CAP_GROUP_SG:
+ caps.bits = venettap_negcap_sg(priv, caps.bits);
+ break;
default:
caps.bits = 0;
break;
@@ -1057,6 +1271,12 @@ venettap_vlink_release(struct vbus_connection *conn)
vbus_memctx_put(priv->vbus.ctx);
kobject_put(priv->vbus.dev.kobj);
+
+ priv->vbus.sg.enabled = false;
+ priv->vbus.rx_ops = &venettap_flat_rx_ops;
+ kfree(priv->vbus.sg.desc);
+ priv->vbus.sg.desc = NULL;
+ priv->vbus.sg.len = 0;
}
static struct vbus_connection_ops venettap_vbus_link_ops = {
@@ -1336,6 +1556,7 @@ venettap_device_create(struct vbus_devclass *dc,
_vdev->ops = &venettap_device_ops;
_vdev->attrs = &venettap_attr_group;
+ priv->vbus.rx_ops = &venettap_flat_rx_ops;
init_waitqueue_head(&priv->vbus.rx_empty);
/*
We need a way to detect if a VM is reset later in the series, so lets
add a capability for userspace to signal a VM reset down to the kernel.
Signed-off-by: Gregory Haskins <[email protected]>
---
arch/x86/kvm/x86.c | 1 +
include/linux/kvm.h | 2 ++
include/linux/kvm_host.h | 6 ++++++
virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++
4 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 758b7a1..9b0a649 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -971,6 +971,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_NOP_IO_DELAY:
case KVM_CAP_MP_STATE:
case KVM_CAP_SYNC_MMU:
+ case KVM_CAP_RESET:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 0424326..7ffd8f5 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -396,6 +396,7 @@ struct kvm_trace_rec {
#ifdef __KVM_HAVE_USER_NMI
#define KVM_CAP_USER_NMI 22
#endif
+#define KVM_CAP_RESET 23
/*
* ioctls for VM fds
@@ -429,6 +430,7 @@ struct kvm_trace_rec {
struct kvm_assigned_pci_dev)
#define KVM_ASSIGN_IRQ _IOR(KVMIO, 0x70, \
struct kvm_assigned_irq)
+#define KVM_RESET _IO(KVMIO, 0x67)
/*
* ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bf6f703..506eca1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -17,6 +17,7 @@
#include <linux/preempt.h>
#include <linux/marker.h>
#include <linux/msi.h>
+#include <linux/notifier.h>
#include <asm/signal.h>
#include <linux/kvm.h>
@@ -132,6 +133,8 @@ struct kvm {
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
#endif
+
+ struct raw_notifier_head reset_notifier; /* triggers when VM reboots */
};
/* The guest did something we don't support. */
@@ -158,6 +161,9 @@ void kvm_exit(void);
void kvm_get_kvm(struct kvm *kvm);
void kvm_put_kvm(struct kvm *kvm);
+int kvm_reset_notifier_register(struct kvm *kvm, struct notifier_block *nb);
+int kvm_reset_notifier_unregister(struct kvm *kvm, struct notifier_block *nb);
+
#define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
#define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB)
static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 29a667c..fca2d25 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -868,6 +868,8 @@ static struct kvm *kvm_create_vm(void)
#ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
kvm_coalesced_mmio_init(kvm);
#endif
+ RAW_INIT_NOTIFIER_HEAD(&kvm->reset_notifier);
+
out:
return kvm;
}
@@ -1485,6 +1487,35 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
}
}
+static void kvm_notify_reset(struct kvm *kvm)
+{
+ mutex_lock(&kvm->lock);
+ raw_notifier_call_chain(&kvm->reset_notifier, 0, kvm);
+ mutex_unlock(&kvm->lock);
+}
+
+int kvm_reset_notifier_register(struct kvm *kvm, struct notifier_block *nb)
+{
+ int ret;
+
+ mutex_lock(&kvm->lock);
+ ret = raw_notifier_chain_register(&kvm->reset_notifier, nb);
+ mutex_unlock(&kvm->lock);
+
+ return ret;
+}
+
+int kvm_reset_notifier_unregister(struct kvm *kvm, struct notifier_block *nb)
+{
+ int ret;
+
+ mutex_lock(&kvm->lock);
+ ret = raw_notifier_chain_unregister(&kvm->reset_notifier, nb);
+ mutex_unlock(&kvm->lock);
+
+ return ret;
+}
+
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
@@ -1929,6 +1960,11 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
#endif
+ case KVM_RESET: {
+ kvm_notify_reset(kvm);
+ r = 0;
+ break;
+ }
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
}
This patch provides the ability to dynamically declare and map an
interrupt-request handle to an x86 8-bit vector.
Problem Statement: Emulated devices (such as PCI, ISA, etc) have
interrupt routing done via standard PC mechanisms (MP-table, ACPI,
etc). However, we also want to support a new class of devices
which exist in a new virtualized namespace and therefore should
not try to piggyback on these emulated mechanisms. Rather, we
create a way to dynamically register interrupt resources that
acts indepent of the emulated counterpart.
On x86, a simplistic view of the interrupt model is that each core
has a local-APIC which can recieve messages from APIC-compliant
routing devices (such as IO-APIC and MSI) regarding details about
an interrupt (such as which vector to raise). These routing devices
are controlled by the OS so they may translate a physical event
(such as "e1000: raise an RX interrupt") to a logical destination
(such as "inject IDT vector 46 on core 3"). A dynirq is a virtual
implementation of such a router (think of it as a virtual-MSI, but
without the coupling to an existing standard, such as PCI).
The model is simple: A guest OS can allocate the mapping of "IRQ"
handle to "vector/core" in any way it sees fit, and provide this
information to the dynirq module running in the host. The assigned
IRQ then becomes the sole handle needed to inject an IDT vector
to the guest from a host. A host entity that wishes to raise an
interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
is performed transparently.
Signed-off-by: Gregory Haskins <[email protected]>
---
arch/x86/Kconfig | 5 +
arch/x86/Makefile | 3
arch/x86/include/asm/kvm_host.h | 9 +
arch/x86/include/asm/kvm_para.h | 11 +
arch/x86/kvm/Makefile | 3
arch/x86/kvm/dynirq.c | 329 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/guest/Makefile | 2
arch/x86/kvm/guest/dynirq.c | 95 +++++++++++
arch/x86/kvm/x86.c | 6 +
include/linux/kvm.h | 1
include/linux/kvm_guest.h | 7 +
include/linux/kvm_host.h | 1
include/linux/kvm_para.h | 1
13 files changed, 472 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kvm/dynirq.c
create mode 100644 arch/x86/kvm/guest/Makefile
create mode 100644 arch/x86/kvm/guest/dynirq.c
create mode 100644 include/linux/kvm_guest.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3fca247..91fefd5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -446,6 +446,11 @@ config KVM_GUEST
This option enables various optimizations for running under the KVM
hypervisor.
+config KVM_GUEST_DYNIRQ
+ bool "KVM Dynamic IRQ support"
+ depends on KVM_GUEST
+ default y
+
source "arch/x86/lguest/Kconfig"
config PARAVIRT
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index d1a47ad..d788815 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -147,6 +147,9 @@ core-$(CONFIG_XEN) += arch/x86/xen/
# lguest paravirtualization support
core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/
+# kvm paravirtualization support
+core-$(CONFIG_KVM_GUEST) += arch/x86/kvm/guest/
+
core-y += arch/x86/kernel/
core-y += arch/x86/mm/
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 730843d..9ae398a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -346,6 +346,12 @@ struct kvm_mem_alias {
gfn_t target_gfn;
};
+struct kvm_dynirq {
+ spinlock_t lock;
+ struct rb_root map;
+ struct kvm *kvm;
+};
+
struct kvm_arch{
int naliases;
struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -363,6 +369,7 @@ struct kvm_arch{
struct iommu_domain *iommu_domain;
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
+ struct kvm_dynirq *dynirq;
struct kvm_pit *vpit;
struct hlist_head irq_ack_notifier_list;
int vapics_in_nmi_mode;
@@ -519,6 +526,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes);
int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
gpa_t addr, unsigned long *ret);
+int kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len);
+void kvm_free_dynirq(struct kvm *kvm);
extern bool tdp_enabled;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index b8a3305..fba210e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -13,6 +13,7 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_DYNIRQ 3
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
@@ -45,6 +46,16 @@ struct kvm_mmu_op_release_pt {
__u64 pt_phys;
};
+/* Operations for KVM_HC_DYNIRQ */
+#define KVM_DYNIRQ_OP_SET 1
+#define KVM_DYNIRQ_OP_CLEAR 2
+
+struct kvm_dynirq_set {
+ __u32 irq;
+ __u32 vec; /* x86 IDT vector */
+ __u32 dest; /* 0-based vcpu id */
+};
+
#ifdef __KERNEL__
#include <asm/processor.h>
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d3ec292..d5676f5 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -14,9 +14,10 @@ endif
EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
- i8254.o
+ i8254.o dynirq.o
obj-$(CONFIG_KVM) += kvm.o
kvm-intel-objs = vmx.o
obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
kvm-amd-objs = svm.o
obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+
diff --git a/arch/x86/kvm/dynirq.c b/arch/x86/kvm/dynirq.c
new file mode 100644
index 0000000..54162dd
--- /dev/null
+++ b/arch/x86/kvm/dynirq.c
@@ -0,0 +1,329 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Dynamic-Interrupt-Request (dynirq): This module provides the ability
+ * to dynamically declare and map an interrupt-request handle to an
+ * x86 8-bit vector.
+ *
+ * Problem Statement: Emulated devices (such as PCI, ISA, etc) have
+ * interrupt routing done via standard PC mechanisms (MP-table, ACPI,
+ * etc). However, we also want to support a new class of devices
+ * which exist in a new virtualized namespace and therefore should
+ * not try to piggyback on these emulated mechanisms. Rather, we
+ * create a way to dynamically register interrupt resources that
+ * acts indepent of the emulated counterpart.
+ *
+ * On x86, a simplistic view of the interrupt model is that each core
+ * has a local-APIC which can recieve messages from APIC-compliant
+ * routing devices (such as IO-APIC and MSI) regarding details about
+ * an interrupt (such as which vector to raise). These routing devices
+ * are controlled by the OS so they may translate a physical event
+ * (such as "e1000: raise an RX interrupt") to a logical destination
+ * (such as "inject IDT vector 46 on core 3"). A dynirq is a virtual
+ * implementation of such a router (think of it as a virtual-MSI, but
+ * without the coupling to an existing standard, such as PCI).
+ *
+ * The model is simple: A guest OS can allocate the mapping of "IRQ"
+ * handle to "vector/core" in any way it sees fit, and provide this
+ * information to the dynirq module running in the host. The assigned
+ * IRQ then becomes the sole handle needed to inject an IDT vector
+ * to the guest from a host. A host entity that wishes to raise an
+ * interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
+ * is performed transparently.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/workqueue.h>
+#include <linux/hardirq.h>
+
+#include "lapic.h"
+
+struct dynirq {
+ struct kvm_dynirq *parent;
+ unsigned int irq;
+ unsigned short vec;
+ unsigned int dest;
+ struct rb_node node;
+ struct work_struct work;
+};
+
+static inline struct dynirq *
+to_dynirq(struct rb_node *node)
+{
+ return node ? container_of(node, struct dynirq, node) : NULL;
+}
+
+static int
+map_add(struct rb_root *root, struct dynirq *entry)
+{
+ int ret = 0;
+ struct rb_node **new, *parent = NULL;
+ struct rb_node *node = &entry->node;
+
+ new = &(root->rb_node);
+
+ /* Figure out where to put new node */
+ while (*new) {
+ int val;
+
+ parent = *new;
+
+ val = to_dynirq(node)->irq - to_dynirq(*new)->irq;
+ if (val < 0)
+ new = &((*new)->rb_left);
+ else if (val > 0)
+ new = &((*new)->rb_right);
+ else {
+ ret = -EEXIST;
+ break;
+ }
+ }
+
+ if (!ret) {
+ /* Add new node and rebalance tree. */
+ rb_link_node(node, parent, new);
+ rb_insert_color(node, root);
+ }
+
+ return ret;
+}
+
+static struct dynirq *
+map_find(struct rb_root *root, unsigned int key)
+{
+ struct rb_node *node;
+
+ node = root->rb_node;
+
+ while (node) {
+ int val;
+
+ val = key - to_dynirq(node)->irq;
+ if (val < 0)
+ node = node->rb_left;
+ else if (val > 0)
+ node = node->rb_right;
+ else
+ break;
+ }
+
+ return to_dynirq(node);
+}
+
+static void
+dynirq_add(struct kvm_dynirq *dynirq, struct dynirq *entry)
+{
+ unsigned long flags;
+ int ret;
+
+ spin_lock_irqsave(&dynirq->lock, flags);
+ ret = map_add(&dynirq->map, entry);
+ spin_unlock_irqrestore(&dynirq->lock, flags);
+}
+
+static struct dynirq *
+dynirq_find(struct kvm_dynirq *dynirq, int irq)
+{
+ struct dynirq *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dynirq->lock, flags);
+ entry = map_find(&dynirq->map, irq);
+ spin_unlock_irqrestore(&dynirq->lock, flags);
+
+ return entry;
+}
+
+static int
+_kvm_inject_dynirq(struct kvm *kvm, struct dynirq *entry)
+{
+ struct kvm_vcpu *vcpu;
+ int ret;
+
+ mutex_lock(&kvm->lock);
+
+ vcpu = kvm->vcpus[entry->dest];
+ if (!vcpu) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ ret = kvm_apic_set_irq(vcpu, entry->vec, 1);
+
+out:
+ mutex_unlock(&kvm->lock);
+
+ return ret;
+}
+
+static void
+deferred_inject_dynirq(struct work_struct *work)
+{
+ struct dynirq *entry = container_of(work, struct dynirq, work);
+ struct kvm_dynirq *dynirq = entry->parent;
+ struct kvm *kvm = dynirq->kvm;
+
+ _kvm_inject_dynirq(kvm, entry);
+}
+
+int
+kvm_inject_dynirq(struct kvm *kvm, int irq)
+{
+ struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+ struct dynirq *entry;
+
+ entry = dynirq_find(dynirq, irq);
+ if (!entry)
+ return -EINVAL;
+
+ if (preemptible())
+ return _kvm_inject_dynirq(kvm, entry);
+
+ schedule_work(&entry->work);
+ return 0;
+}
+
+static int
+hc_set(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+ struct kvm_dynirq_set args;
+ struct kvm_dynirq *dynirq = vcpu->kvm->arch.dynirq;
+ struct dynirq *entry;
+ int ret;
+
+ if (len != sizeof(args))
+ return -EINVAL;
+
+ ret = kvm_read_guest(vcpu->kvm, gpa, &args, len);
+ if (ret < 0)
+ return ret;
+
+ if (args.dest >= KVM_MAX_VCPUS)
+ return -EINVAL;
+
+ entry = dynirq_find(dynirq, args.irq);
+ if (!entry) {
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ INIT_WORK(&entry->work, deferred_inject_dynirq);
+ } else
+ rb_erase(&entry->node, &dynirq->map);
+
+ entry->irq = args.irq;
+ entry->vec = args.vec;
+ entry->dest = args.dest;
+
+ dynirq_add(dynirq, entry);
+
+ return 0;
+}
+
+static int
+hc_clear(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+ struct kvm_dynirq *dynirq = vcpu->kvm->arch.dynirq;
+ struct dynirq *entry;
+ unsigned long flags;
+ u32 irq;
+ int ret;
+
+ if (len != sizeof(irq))
+ return -EINVAL;
+
+ ret = kvm_read_guest(vcpu->kvm, gpa, &irq, len);
+ if (ret < 0)
+ return ret;
+
+ spin_lock_irqsave(&dynirq->lock, flags);
+
+ entry = map_find(&dynirq->map, irq);
+ if (entry)
+ rb_erase(&entry->node, &dynirq->map);
+
+ spin_unlock_irqrestore(&dynirq->lock, flags);
+
+ if (!entry)
+ return -ENOENT;
+
+ kfree(entry);
+ return 0;
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+int
+kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+ int ret = -EINVAL;
+
+ mutex_lock(&vcpu->kvm->lock);
+
+ if (unlikely(!vcpu->kvm->arch.dynirq)) {
+ struct kvm_dynirq *dynirq;
+
+ dynirq = kzalloc(sizeof(*dynirq), GFP_KERNEL);
+ if (!dynirq)
+ return -ENOMEM;
+
+ spin_lock_init(&dynirq->lock);
+ dynirq->map = RB_ROOT;
+ dynirq->kvm = vcpu->kvm;
+ vcpu->kvm->arch.dynirq = dynirq;
+ }
+
+ switch (nr) {
+ case KVM_DYNIRQ_OP_SET:
+ ret = hc_set(vcpu, gpa, len);
+ break;
+ case KVM_DYNIRQ_OP_CLEAR:
+ ret = hc_clear(vcpu, gpa, len);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ mutex_unlock(&vcpu->kvm->lock);
+
+ return ret;
+}
+
+void
+kvm_free_dynirq(struct kvm *kvm)
+{
+ struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+ struct rb_node *node;
+
+ while ((node = rb_first(&dynirq->map))) {
+ struct dynirq *entry = to_dynirq(node);
+
+ rb_erase(node, &dynirq->map);
+ kfree(entry);
+ }
+
+ kfree(dynirq);
+}
diff --git a/arch/x86/kvm/guest/Makefile b/arch/x86/kvm/guest/Makefile
new file mode 100644
index 0000000..de8f824
--- /dev/null
+++ b/arch/x86/kvm/guest/Makefile
@@ -0,0 +1,2 @@
+
+obj-$(CONFIG_KVM_GUEST_DYNIRQ) += dynirq.o
\ No newline at end of file
diff --git a/arch/x86/kvm/guest/dynirq.c b/arch/x86/kvm/guest/dynirq.c
new file mode 100644
index 0000000..a5cf55e
--- /dev/null
+++ b/arch/x86/kvm/guest/dynirq.c
@@ -0,0 +1,95 @@
+#include <linux/module.h>
+#include <linux/irq.h>
+#include <linux/kvm.h>
+#include <linux/kvm_para.h>
+
+#include <asm/irq.h>
+#include <asm/apic.h>
+
+/*
+ * -----------------------
+ * Dynamic-IRQ support
+ * -----------------------
+ */
+
+static int dynirq_set(int irq, int dest)
+{
+ struct kvm_dynirq_set op = {
+ .irq = irq,
+ .vec = irq_to_vector(irq),
+ .dest = dest,
+ };
+
+ return kvm_hypercall3(KVM_HC_DYNIRQ, KVM_DYNIRQ_OP_SET,
+ __pa(&op), sizeof(op));
+}
+
+static void dynirq_chip_noop(unsigned int irq)
+{
+}
+
+static void dynirq_chip_eoi(unsigned int irq)
+{
+ ack_APIC_irq();
+}
+
+struct irq_chip kvm_irq_chip = {
+ .name = "KVM-DYNIRQ",
+ .mask = dynirq_chip_noop,
+ .unmask = dynirq_chip_noop,
+ .eoi = dynirq_chip_eoi,
+};
+
+int create_kvm_dynirq(int cpu)
+{
+ const cpumask_t *mask = get_cpu_mask(cpu);
+ int irq;
+ int ret;
+
+ ret = kvm_para_has_feature(KVM_FEATURE_DYNIRQ);
+ if (!ret)
+ return -ENOENT;
+
+ irq = create_irq();
+ if (irq < 0)
+ return -ENOSPC;
+
+#ifdef CONFIG_SMP
+ ret = set_irq_affinity(irq, *mask);
+ if (ret < 0)
+ goto error;
+#endif
+
+ set_irq_chip_and_handler_name(irq,
+ &kvm_irq_chip,
+ handle_percpu_irq,
+ "apiceoi");
+
+ ret = dynirq_set(irq, cpu);
+ if (ret < 0)
+ goto error;
+
+ return irq;
+
+error:
+ destroy_irq(irq);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(create_kvm_dynirq);
+
+int destroy_kvm_dynirq(int irq)
+{
+ __u32 _irq = irq;
+
+ if (kvm_para_has_feature(KVM_FEATURE_DYNIRQ))
+ kvm_hypercall3(KVM_HC_DYNIRQ,
+ KVM_DYNIRQ_OP_CLEAR,
+ __pa(&_irq),
+ sizeof(_irq));
+
+ destroy_irq(irq);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(destroy_kvm_dynirq);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b0a649..e24f0a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -972,6 +972,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_MP_STATE:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_RESET:
+ case KVM_CAP_DYNIRQ:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -2684,6 +2685,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_MMU_OP:
r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
break;
+ case KVM_HC_DYNIRQ:
+ ret = kvm_dynirq_hc(vcpu, a0, a1, a2);
+ break;
default:
ret = -KVM_ENOSYS;
break;
@@ -4141,6 +4145,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_free_pit(kvm);
kfree(kvm->arch.vpic);
kfree(kvm->arch.vioapic);
+ if (kvm->arch.dynirq)
+ kvm_free_dynirq(kvm);
kvm_free_vcpus(kvm);
kvm_free_physmem(kvm);
if (kvm->arch.apic_access_page)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7ffd8f5..349d273 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -397,6 +397,7 @@ struct kvm_trace_rec {
#define KVM_CAP_USER_NMI 22
#endif
#define KVM_CAP_RESET 23
+#define KVM_CAP_DYNIRQ 24
/*
* ioctls for VM fds
diff --git a/include/linux/kvm_guest.h b/include/linux/kvm_guest.h
new file mode 100644
index 0000000..7dd7930
--- /dev/null
+++ b/include/linux/kvm_guest.h
@@ -0,0 +1,7 @@
+#ifndef __LINUX_KVM_GUEST_H
+#define __LINUX_KVM_GUEST_H
+
+extern int create_kvm_dynirq(int cpu);
+extern int destroy_kvm_dynirq(int irq);
+
+#endif /* __LINUX_KVM_GUEST_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 506eca1..bec9b35 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
+int kvm_inject_dynirq(struct kvm *kvm, int irq);
int kvm_is_mmio_pfn(pfn_t pfn);
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 3ddce03..a2de904 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -16,6 +16,7 @@
#define KVM_HC_VAPIC_POLL_IRQ 1
#define KVM_HC_MMU_OP 2
+#define KVM_HC_DYNIRQ 3
/*
* hypercalls use architecture specific
This patch adds support for guest access to a VBUS assigned to the same
context as the VM. It utilizes a IOQ+IRQ to move events from host->guest,
and provides a hypercall interface to move events guest->host.
Signed-off-by: Gregory Haskins <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1
arch/x86/kvm/Kconfig | 9
arch/x86/kvm/Makefile | 3
arch/x86/kvm/x86.c | 6
arch/x86/kvm/x86.h | 12
include/linux/kvm.h | 1
include/linux/kvm_host.h | 20 +
include/linux/kvm_para.h | 59 ++
virt/kvm/kvm_main.c | 1
virt/kvm/vbus.c | 1307 +++++++++++++++++++++++++++++++++++++++
10 files changed, 1419 insertions(+), 0 deletions(-)
create mode 100644 virt/kvm/vbus.c
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index fba210e..19d81e0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -14,6 +14,7 @@
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
#define KVM_FEATURE_DYNIRQ 3
+#define KVM_FEATURE_VBUS 4
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b81125f..875e96e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,15 @@ config KVM_TRACE
relayfs. Note the ABI is not considered stable and will be
modified in future updates.
+config KVM_HOST_VBUS
+ bool "KVM virtual-bus (VBUS) host-side support"
+ depends on KVM
+ select VBUS
+ default n
+ ---help---
+ This option enables host-side support for accessing virtual-bus
+ devices.
+
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/lguest/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d5676f5..f749ec9 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -15,6 +15,9 @@ EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
i8254.o dynirq.o
+ifeq ($(CONFIG_KVM_HOST_VBUS),y)
+kvm-objs += $(addprefix ../../../virt/kvm/, vbus.o)
+endif
obj-$(CONFIG_KVM) += kvm.o
kvm-intel-objs = vmx.o
obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e24f0a5..2369d84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -996,6 +996,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_CLOCKSOURCE:
r = boot_cpu_has(X86_FEATURE_CONSTANT_TSC);
break;
+ case KVM_CAP_VBUS:
+ r = kvm_vbus_support();
+ break;
default:
r = 0;
break;
@@ -2688,6 +2691,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_DYNIRQ:
ret = kvm_dynirq_hc(vcpu, a0, a1, a2);
break;
+ case KVM_HC_VBUS:
+ ret = kvm_vbus_hc(vcpu, a0, a1, a2);
+ break;
default:
ret = -KVM_ENOSYS;
break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6a4be78..b6c682b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -3,6 +3,18 @@
#include <linux/kvm_host.h>
+#ifdef CONFIG_KVM_HOST_VBUS
+static inline int kvm_vbus_support(void)
+{
+ return 1;
+}
+#else
+static inline int kvm_vbus_support(void)
+{
+ return 0;
+}
+#endif
+
static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
{
vcpu->arch.exception.pending = false;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 349d273..077daac 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -398,6 +398,7 @@ struct kvm_trace_rec {
#endif
#define KVM_CAP_RESET 23
#define KVM_CAP_DYNIRQ 24
+#define KVM_CAP_VBUS 25
/*
* ioctls for VM fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bec9b35..757f998 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -120,6 +120,9 @@ struct kvm {
struct list_head vm_list;
struct kvm_io_bus mmio_bus;
struct kvm_io_bus pio_bus;
+#ifdef CONFIG_KVM_HOST_VBUS
+ struct kvm_vbus *kvbus;
+#endif
struct kvm_vm_stat stat;
struct kvm_arch arch;
atomic_t users_count;
@@ -471,4 +474,21 @@ static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_se
}
#endif
+#ifdef CONFIG_KVM_HOST_VBUS
+
+int kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len);
+void kvm_vbus_release(struct kvm_vbus *kvbus);
+
+#else /* CONFIG_KVM_HOST_VBUS */
+
+static inline int
+kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+ return -EINVAL;
+}
+
+#define kvm_vbus_release(kvbus) do {} while (0)
+
+#endif /* CONFIG_KVM_HOST_VBUS */
+
#endif
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index a2de904..ca5203c 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -17,6 +17,65 @@
#define KVM_HC_VAPIC_POLL_IRQ 1
#define KVM_HC_MMU_OP 2
#define KVM_HC_DYNIRQ 3
+#define KVM_HC_VBUS 4
+
+/* Payload of KVM_HC_VBUS */
+#define KVM_VBUS_MAGIC 0x27fdab45
+#define KVM_VBUS_VERSION 1
+
+enum kvm_vbus_op{
+ KVM_VBUS_OP_BUSOPEN,
+ KVM_VBUS_OP_BUSREG,
+ KVM_VBUS_OP_DEVOPEN,
+ KVM_VBUS_OP_DEVCLOSE,
+ KVM_VBUS_OP_DEVCALL,
+ KVM_VBUS_OP_DEVSHM,
+ KVM_VBUS_OP_SHMSIGNAL,
+};
+
+struct kvm_vbus_busopen {
+ __u32 magic;
+ __u32 version;
+ __u64 capabilities;
+};
+
+struct kvm_vbus_eventqreg {
+ __u32 irq;
+ __u32 count;
+ __u64 ring;
+ __u64 data;
+};
+
+struct kvm_vbus_busreg {
+ __u32 count; /* supporting multiple queues allows for prio, etc */
+ struct kvm_vbus_eventqreg eventq[1];
+};
+
+enum kvm_vbus_eventid {
+ KVM_VBUS_EVENT_DEVADD,
+ KVM_VBUS_EVENT_DEVDROP,
+ KVM_VBUS_EVENT_SHMSIGNAL,
+ KVM_VBUS_EVENT_SHMCLOSE,
+};
+
+#define VBUS_MAX_DEVTYPE_LEN 128
+
+struct kvm_vbus_add_event {
+ __u64 id;
+ char type[VBUS_MAX_DEVTYPE_LEN];
+};
+
+struct kvm_vbus_handle_event {
+ __u64 handle;
+};
+
+struct kvm_vbus_event {
+ __u32 eventid;
+ union {
+ struct kvm_vbus_add_event add;
+ struct kvm_vbus_handle_event handle;
+ } data;
+};
/*
* hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fca2d25..2e4ba8b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -942,6 +942,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
{
struct kvm *kvm = filp->private_data;
+ kvm_vbus_release(kvm->kvbus);
kvm_put_kvm(kvm);
return 0;
}
diff --git a/virt/kvm/vbus.c b/virt/kvm/vbus.c
new file mode 100644
index 0000000..17b3392
--- /dev/null
+++ b/virt/kvm/vbus.c
@@ -0,0 +1,1307 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/ioq.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/vbus.h>
+#include <linux/vbus_client.h>
+
+#undef PDEBUG
+#ifdef KVMVBUS_DEBUG
+#include <linux/ftrace.h>
+# define PDEBUG(fmt, args...) ftrace_printk(fmt, ## args)
+#else
+# define PDEBUG(fmt, args...)
+#endif
+
+struct kvm_vbus_eventq {
+ spinlock_t lock;
+ struct ioq *ioq;
+ struct ioq_notifier notifier;
+ struct list_head backlog;
+ struct {
+ u64 gpa;
+ size_t len;
+ void *ptr;
+ } ringdata;
+ struct work_struct work;
+ int backpressure:1;
+};
+
+enum kvm_vbus_state {
+ kvm_vbus_state_init,
+ kvm_vbus_state_registration,
+ kvm_vbus_state_running,
+};
+
+struct kvm_vbus {
+ struct mutex lock;
+ enum kvm_vbus_state state;
+ struct kvm *kvm;
+ struct vbus *vbus;
+ struct vbus_client *client;
+ struct kvm_vbus_eventq eventq;
+ struct work_struct destruct;
+ struct vbus_memctx *ctx;
+ struct {
+ struct notifier_block vbus;
+ struct notifier_block reset;
+ } notify;
+};
+
+struct vbus_client *to_client(struct kvm_vcpu *vcpu)
+{
+ return vcpu ? vcpu->kvm->kvbus->client : NULL;
+}
+
+static void*
+kvm_vmap(struct kvm *kvm, gpa_t gpa, size_t len)
+{
+ struct page **page_list;
+ void *ptr = NULL;
+ unsigned long addr;
+ off_t offset;
+ size_t npages;
+ int ret;
+
+ addr = gfn_to_hva(kvm, gpa >> PAGE_SHIFT);
+
+ offset = offset_in_page(gpa);
+ npages = PAGE_ALIGN(len + offset) >> PAGE_SHIFT;
+
+ if (npages > (PAGE_SIZE / sizeof(struct page *)))
+ return NULL;
+
+ page_list = (struct page **) __get_free_page(GFP_KERNEL);
+ if (!page_list)
+ return NULL;
+
+ ret = get_user_pages_fast(addr, npages, 1, page_list);
+ if (ret < 0)
+ goto out;
+
+ down_write(¤t->mm->mmap_sem);
+
+ ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
+ if (ptr)
+ current->mm->locked_vm += npages;
+
+ up_write(¤t->mm->mmap_sem);
+
+ ptr = ptr+offset;
+
+out:
+ free_page((unsigned long)page_list);
+
+ return ptr;
+}
+
+static void
+kvm_vunmap(void *ptr)
+{
+ /* FIXME: do we need to adjust current->mm->locked_vm? */
+ vunmap((void *)((unsigned long)ptr & PAGE_MASK));
+}
+
+/*
+ * -----------------
+ * kvm_shm routines
+ * -----------------
+ */
+
+struct kvm_shm {
+ struct kvm_vbus *kvbus;
+ struct vbus_shm shm;
+};
+
+static void
+kvm_shm_release(struct vbus_shm *shm)
+{
+ struct kvm_shm *_shm = container_of(shm, struct kvm_shm, shm);
+
+ kvm_vunmap(_shm->shm.ptr);
+ kfree(_shm);
+}
+
+static struct vbus_shm_ops kvm_shm_ops = {
+ .release = kvm_shm_release,
+};
+
+static int
+kvm_shm_map(struct kvm_vbus *kvbus, __u64 ptr, __u32 len, struct kvm_shm **kshm)
+{
+ struct kvm_shm *_shm;
+ void *vmap;
+
+ if (!can_do_mlock())
+ return -EPERM;
+
+ _shm = kzalloc(sizeof(*_shm), GFP_KERNEL);
+ if (!_shm)
+ return -ENOMEM;
+
+ _shm->kvbus = kvbus;
+
+ vmap = kvm_vmap(kvbus->kvm, ptr, len);
+ if (!vmap) {
+ kfree(_shm);
+ return -EFAULT;
+ }
+
+ vbus_shm_init(&_shm->shm, &kvm_shm_ops, vmap, len);
+
+ *kshm = _shm;
+
+ return 0;
+}
+
+/*
+ * -----------------
+ * vbus_memctx routines
+ * -----------------
+ */
+
+struct kvm_memctx {
+ struct kvm *kvm;
+ struct vbus_memctx *taskmem;
+ struct vbus_memctx ctx;
+};
+
+static struct kvm_memctx *to_kvm_memctx(struct vbus_memctx *ctx)
+{
+ return container_of(ctx, struct kvm_memctx, ctx);
+}
+
+
+static unsigned long
+kvm_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long n)
+{
+ struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+ struct vbus_memctx *tm = kvm_memctx->taskmem;
+ gpa_t gpa = (gpa_t)dst;
+ unsigned long addr;
+ int offset;
+
+ addr = gfn_to_hva(kvm_memctx->kvm, gpa >> PAGE_SHIFT);
+ offset = offset_in_page(gpa);
+
+ return tm->ops->copy_to(tm, (void *)(addr + offset), src, n);
+}
+
+static unsigned long
+kvm_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+ unsigned long n)
+{
+ struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+ struct vbus_memctx *tm = kvm_memctx->taskmem;
+ gpa_t gpa = (gpa_t)src;
+ unsigned long addr;
+ int offset;
+
+ addr = gfn_to_hva(kvm_memctx->kvm, gpa >> PAGE_SHIFT);
+ offset = offset_in_page(gpa);
+
+ return tm->ops->copy_from(tm, dst, (void *)(addr + offset), n);
+}
+
+static void
+kvm_memctx_release(struct vbus_memctx *ctx)
+{
+ struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+
+ vbus_memctx_put(kvm_memctx->taskmem);
+ kvm_put_kvm(kvm_memctx->kvm);
+
+ kfree(kvm_memctx);
+}
+
+static struct vbus_memctx_ops kvm_memctx_ops = {
+ .copy_to = &kvm_memctx_copy_to,
+ .copy_from = &kvm_memctx_copy_from,
+ .release = &kvm_memctx_release,
+};
+
+struct vbus_memctx *kvm_memctx_alloc(struct kvm *kvm)
+{
+ struct kvm_memctx *kvm_memctx;
+
+ kvm_memctx = kzalloc(sizeof(*kvm_memctx), GFP_KERNEL);
+ if (!kvm_memctx)
+ return NULL;
+
+ kvm_get_kvm(kvm);
+ kvm_memctx->kvm = kvm;
+
+ kvm_memctx->taskmem = task_memctx_alloc(current);
+ vbus_memctx_init(&kvm_memctx->ctx, &kvm_memctx_ops);
+
+ return &kvm_memctx->ctx;
+}
+
+/*
+ * -----------------
+ * general routines
+ * -----------------
+ */
+
+static int
+_signal_init(struct kvm *kvm, struct shm_signal_desc *desc,
+ struct shm_signal *signal, struct shm_signal_ops *ops)
+{
+ if (desc->magic != SHM_SIGNAL_MAGIC)
+ return -EINVAL;
+
+ if (desc->ver != SHM_SIGNAL_VER)
+ return -EINVAL;
+
+ shm_signal_init(signal);
+
+ signal->locale = shm_locality_south;
+ signal->ops = ops;
+ signal->desc = desc;
+
+ return 0;
+}
+
+static struct kvm_vbus_event *
+event_ptr_translate(struct kvm_vbus_eventq *eventq, u64 ptr)
+{
+ u64 off = ptr - eventq->ringdata.gpa;
+
+ if ((ptr < eventq->ringdata.gpa)
+ || (off > (eventq->ringdata.len - sizeof(struct kvm_vbus_event))))
+ return NULL;
+
+ return eventq->ringdata.ptr + off;
+}
+
+/*
+ * ------------------
+ * event-object code
+ * ------------------
+ */
+
+struct _event {
+ atomic_t refs;
+ struct list_head list;
+ struct kvm_vbus_event data;
+};
+
+static void
+_event_init(struct _event *event)
+{
+ memset(event, 0, sizeof(*event));
+ atomic_set(&event->refs, 1);
+ INIT_LIST_HEAD(&event->list);
+}
+
+static void
+_event_get(struct _event *event)
+{
+ atomic_inc(&event->refs);
+}
+
+static inline void
+_event_put(struct _event *event)
+{
+ if (atomic_dec_and_test(&event->refs))
+ kfree(event);
+}
+
+/*
+ * ------------------
+ * event-inject code
+ * ------------------
+ */
+
+static struct kvm_vbus_eventq *notify_to_eventq(struct ioq_notifier *notifier)
+{
+ return container_of(notifier, struct kvm_vbus_eventq, notifier);
+}
+
+static struct kvm_vbus_eventq *work_to_eventq(struct work_struct *work)
+{
+ return container_of(work, struct kvm_vbus_eventq, work);
+}
+
+/*
+ * This is invoked by the guest whenever they signal our eventq when
+ * we have notifications enabled
+ */
+static void
+eventq_notify(struct ioq_notifier *notifier)
+{
+ struct kvm_vbus_eventq *eventq = notify_to_eventq(notifier);
+ unsigned long flags;
+
+ spin_lock_irqsave(&eventq->lock, flags);
+
+ if (!ioq_full(eventq->ioq, ioq_idxtype_inuse)) {
+ eventq->backpressure = false;
+ ioq_notify_disable(eventq->ioq, 0);
+ schedule_work(&eventq->work);
+ }
+
+ spin_unlock_irqrestore(&eventq->lock, flags);
+}
+
+static void
+events_flush(struct kvm_vbus_eventq *eventq)
+{
+ struct ioq_iterator iter;
+ int ret;
+ unsigned long flags;
+ struct _event *_event, *tmp;
+ int dirty = 0;
+
+ spin_lock_irqsave(&eventq->lock, flags);
+
+ /* We want to iterate on the tail of the in-use index */
+ ret = ioq_iter_init(eventq->ioq, &iter, ioq_idxtype_inuse, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+
+ list_for_each_entry_safe(_event, tmp, &eventq->backlog, list) {
+ struct kvm_vbus_event *ev;
+
+ if (!iter.desc->sown) {
+ eventq->backpressure = true;
+ ioq_notify_enable(eventq->ioq, 0);
+ break;
+ }
+
+ if (iter.desc->len < sizeof(*ev)) {
+ SHM_SIGNAL_FAULT(eventq->ioq->signal,
+ "Desc too small on eventq: %p: %d<%d",
+ iter.desc->ptr,
+ iter.desc->len, sizeof(*ev));
+ break;
+ }
+
+ ev = event_ptr_translate(eventq, iter.desc->ptr);
+ if (!ev) {
+ SHM_SIGNAL_FAULT(eventq->ioq->signal,
+ "Invalid address on eventq: %p",
+ iter.desc->ptr);
+ break;
+ }
+
+ memcpy(ev, &_event->data, sizeof(*ev));
+
+ list_del_init(&_event->list);
+ _event_put(_event);
+
+ ret = ioq_iter_push(&iter, 0);
+ BUG_ON(ret < 0);
+
+ dirty = 1;
+ }
+
+ spin_unlock_irqrestore(&eventq->lock, flags);
+
+ /*
+ * Signal the IOQ outside of the spinlock so that we can potentially
+ * directly inject this interrupt instead of deferring it
+ */
+ if (dirty)
+ ioq_signal(eventq->ioq, 0);
+}
+
+static int
+event_inject(struct kvm_vbus_eventq *eventq, struct _event *_event)
+{
+ unsigned long flags;
+
+ if (!list_empty(&_event->list))
+ return -EBUSY;
+
+ spin_lock_irqsave(&eventq->lock, flags);
+ list_add_tail(&_event->list, &eventq->backlog);
+ spin_unlock_irqrestore(&eventq->lock, flags);
+
+ events_flush(eventq);
+
+ return 0;
+}
+
+static void
+eventq_reinject(struct work_struct *work)
+{
+ struct kvm_vbus_eventq *eventq = work_to_eventq(work);
+
+ events_flush(eventq);
+}
+
+/*
+ * devadd/drop are in the slow path and are rare enough that we will
+ * simply allocate memory for the event from the heap
+ */
+static int
+devadd_inject(struct kvm_vbus_eventq *eventq, const char *type, u64 id)
+{
+ struct _event *_event;
+ struct kvm_vbus_add_event *ae;
+ int ret;
+
+ _event = kmalloc(sizeof(*_event), GFP_KERNEL);
+ if (!_event)
+ return -ENOMEM;
+
+ _event_init(_event);
+
+ _event->data.eventid = KVM_VBUS_EVENT_DEVADD;
+ ae = (struct kvm_vbus_add_event *)&_event->data.data;
+ ae->id = id;
+ strncpy(ae->type, type, VBUS_MAX_DEVTYPE_LEN);
+
+ ret = event_inject(eventq, _event);
+ if (ret < 0)
+ _event_put(_event);
+
+ return ret;
+}
+
+/*
+ * "handle" events are used to send any kind of event that simply
+ * uses a handle as a parameter. This includes things like DEVDROP
+ * and SHMSIGNAL, etc.
+ */
+static struct _event *
+handle_event_alloc(u64 id, u64 handle)
+{
+ struct _event *_event;
+ struct kvm_vbus_handle_event *he;
+
+ _event = kmalloc(sizeof(*_event), GFP_KERNEL);
+ if (!_event)
+ return NULL;
+
+ _event_init(_event);
+ _event->data.eventid = id;
+
+ he = (struct kvm_vbus_handle_event *)&_event->data.data;
+ he->handle = handle;
+
+ return _event;
+}
+
+static int
+devdrop_inject(struct kvm_vbus_eventq *eventq, u64 id)
+{
+ struct _event *_event;
+ int ret;
+
+ _event = handle_event_alloc(KVM_VBUS_EVENT_DEVDROP, id);
+ if (!_event)
+ return -ENOMEM;
+
+ ret = event_inject(eventq, _event);
+ if (ret < 0)
+ _event_put(_event);
+
+ return ret;
+}
+
+static struct kvm_vbus_eventq *
+prio_to_eventq(struct kvm_vbus *kvbus, int prio)
+{
+ /*
+ * NOTE: priority is ignored for now...all events aggregate onto a
+ * single queue
+ */
+
+ return &kvbus->eventq;
+}
+
+/*
+ * -----------------
+ * event ioq
+ *
+ * This queue is used by the infrastructure to transmit events (such as
+ * "new device", or "signal an ioq") to the guest. We do this so that
+ * we minimize the number of hypercalls required to inject an event.
+ * In theory, the guest only needs to process a single interrupt vector
+ * and it doesnt require switching back to host context since the state
+ * is placed within the ring
+ * -----------------
+ */
+
+struct eventq_signal {
+ struct kvm_vbus *kvbus;
+ struct vbus_shm *shm;
+ struct shm_signal signal;
+ int irq;
+};
+
+static struct eventq_signal *signal_to_eventq(struct shm_signal *signal)
+{
+ return container_of(signal, struct eventq_signal, signal);
+}
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+ struct eventq_signal *_signal = signal_to_eventq(signal);
+ struct kvm *kvm = _signal->kvbus->kvm;
+
+ /* Inject an interrupt to the guest */
+ kvm_inject_dynirq(kvm, _signal->irq);
+
+ return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+ struct eventq_signal *_signal = signal_to_eventq(signal);
+
+ vbus_shm_put(_signal->shm);
+ kfree(_signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+ .inject = eventq_signal_inject,
+ .release = eventq_signal_release,
+};
+
+static int
+_eventq_attach(struct kvm_vbus *kvbus, __u32 count, __u64 ptr, int irq,
+ struct ioq **ioq)
+{
+ struct ioq_ring_head *desc;
+ struct eventq_signal *_signal = NULL;
+ struct kvm_shm *_shm = NULL;
+ size_t len = IOQ_HEAD_DESC_SIZE(count);
+ int ret;
+
+ ret = kvm_shm_map(kvbus, ptr, len, &_shm);
+ if (ret < 0)
+ return ret;
+
+ _signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+ if (!_signal) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ desc = _shm->shm.ptr;
+
+ ret = _signal_init(kvbus->kvm,
+ &desc->signal,
+ &_signal->signal,
+ &eventq_signal_ops);
+ if (ret < 0) {
+ kfree(_signal);
+ _signal = NULL;
+ goto error;
+ }
+
+ _signal->kvbus = kvbus;
+ _signal->irq = irq;
+ _signal->shm = &_shm->shm;
+ vbus_shm_get(&_shm->shm); /* dropped when the signal releases */
+
+ /* FIXME: we should make maxcount configurable */
+ ret = vbus_shm_ioq_attach(&_shm->shm, &_signal->signal, 2048, ioq);
+ if (ret < 0)
+ goto error;
+
+ return 0;
+
+error:
+ if (_signal)
+ shm_signal_put(&_signal->signal);
+
+ if (_shm)
+ vbus_shm_put(&_shm->shm);
+
+ return ret;
+}
+
+/*
+ * -----------------
+ * device_signal routines
+ *
+ * This is the more standard signal that is allocated to communicate
+ * with a specific device's shm region
+ * -----------------
+ */
+
+struct device_signal {
+ struct kvm_vbus *kvbus;
+ struct vbus_shm *shm;
+ struct shm_signal signal;
+ struct _event *inject;
+ int prio;
+ u64 handle;
+};
+
+static struct device_signal *to_dsig(struct shm_signal *signal)
+{
+ return container_of(signal, struct device_signal, signal);
+}
+
+static void
+_device_signal_inject(struct device_signal *_signal)
+{
+ struct kvm_vbus_eventq *eventq;
+ int ret;
+
+ eventq = prio_to_eventq(_signal->kvbus, _signal->prio);
+
+ ret = event_inject(eventq, _signal->inject);
+ if (ret < 0)
+ _event_put(_signal->inject);
+}
+
+static int
+device_signal_inject(struct shm_signal *signal)
+{
+ struct device_signal *_signal = to_dsig(signal);
+
+ _event_get(_signal->inject); /* will be dropped by injection code */
+ _device_signal_inject(_signal);
+
+ return 0;
+}
+
+static void
+device_signal_release(struct shm_signal *signal)
+{
+ struct device_signal *_signal = to_dsig(signal);
+ struct kvm_vbus_eventq *eventq;
+ unsigned long flags;
+
+ eventq = prio_to_eventq(_signal->kvbus, _signal->prio);
+
+ /*
+ * Change the event-type while holding the lock so we do not race
+ * with any potential threads already processing the queue
+ */
+ spin_lock_irqsave(&eventq->lock, flags);
+ _signal->inject->data.eventid = KVM_VBUS_EVENT_SHMCLOSE;
+ spin_unlock_irqrestore(&eventq->lock, flags);
+
+ /*
+ * do not take a reference to event..last will be dropped once
+ * transmitted.
+ */
+ _device_signal_inject(_signal);
+
+ vbus_shm_put(_signal->shm);
+ kfree(_signal);
+}
+
+static struct shm_signal_ops device_signal_ops = {
+ .inject = device_signal_inject,
+ .release = device_signal_release,
+};
+
+static int
+device_signal_alloc(struct kvm_vbus *kvbus, struct vbus_shm *shm,
+ u32 offset, u32 prio, u64 cookie,
+ struct device_signal **dsignal)
+{
+ struct device_signal *_signal;
+ int ret;
+
+ _signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+ if (!_signal)
+ return -ENOMEM;
+
+ ret = _signal_init(kvbus->kvm, shm->ptr + offset,
+ &_signal->signal,
+ &device_signal_ops);
+ if (ret < 0) {
+ kfree(_signal);
+ return ret;
+ }
+
+ _signal->inject = handle_event_alloc(KVM_VBUS_EVENT_SHMSIGNAL, cookie);
+ if (!_signal->inject) {
+ shm_signal_put(&_signal->signal);
+ return -ENOMEM;
+ }
+
+ _signal->kvbus = kvbus;
+ _signal->shm = shm;
+ _signal->prio = prio;
+ vbus_shm_get(shm); /* dropped when the signal is released */
+
+ *dsignal = _signal;
+
+ return 0;
+}
+
+/*
+ * ------------------
+ * notifiers
+ * ------------------
+ */
+
+/*
+ * This is called whenever our associated vbus emits an event. We inject
+ * these events at the highest logical priority
+ */
+static int
+vbus_notifier(struct notifier_block *nb, unsigned long nr, void *data)
+{
+ struct kvm_vbus *kvbus = container_of(nb, struct kvm_vbus, notify.vbus);
+ struct kvm_vbus_eventq *eventq = prio_to_eventq(kvbus, 0);
+
+ switch (nr) {
+ case VBUS_EVENT_DEVADD: {
+ struct vbus_event_devadd *ev = data;
+
+ devadd_inject(eventq, ev->type, ev->id);
+ break;
+ }
+ case VBUS_EVENT_DEVDROP: {
+ unsigned long id = *(unsigned long *)data;
+
+ devdrop_inject(eventq, id);
+ break;
+ }
+ default:
+ break;
+ }
+
+ return 0;
+}
+
+static void
+deferred_destruct(struct work_struct *work)
+{
+ struct kvm_vbus *kvbus = container_of(work, struct kvm_vbus, destruct);
+
+ kvm_vbus_release(kvbus);
+}
+
+/*
+ * This is called if the guest reboots...we should release our association
+ * with the vbus (if any)
+ */
+static int
+reset_notifier(struct notifier_block *nb, unsigned long nr, void *data)
+{
+ struct kvm_vbus *kvbus = container_of(nb, struct kvm_vbus,
+ notify.reset);
+
+ schedule_work(&kvbus->destruct);
+ kvbus->kvm->kvbus = NULL;
+
+ return NOTIFY_DONE;
+}
+
+static int
+kvm_vbus_eventq_attach(struct kvm_vbus *kvbus, struct kvm_vbus_eventq *eventq,
+ u32 count, u64 ring, u64 data, int irq)
+{
+ struct ioq *ioq;
+ size_t len;
+ void *ptr;
+ int ret;
+
+ if (eventq->ioq)
+ return -EINVAL;
+
+ ret = _eventq_attach(kvbus, count, ring, irq, &ioq);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * We are going to pre-vmap the eventq data for performance reasons
+ */
+ len = count * sizeof(struct kvm_vbus_event);
+ ptr = kvm_vmap(kvbus->kvm, data, len);
+ if (!ptr) {
+ ioq_put(ioq);
+ return -EFAULT;
+ }
+
+ spin_lock_init(&eventq->lock);
+ eventq->ioq = ioq;
+ INIT_WORK(&eventq->work, eventq_reinject);
+
+ eventq->notifier.signal = eventq_notify;
+ ioq->notifier = &eventq->notifier;
+
+ INIT_LIST_HEAD(&eventq->backlog);
+
+ eventq->ringdata.len = len;
+ eventq->ringdata.gpa = data;
+ eventq->ringdata.ptr = ptr;
+
+ return 0;
+}
+
+static void
+kvm_vbus_eventq_detach(struct kvm_vbus_eventq *eventq)
+{
+ if (eventq->ioq)
+ ioq_put(eventq->ioq);
+
+ if (eventq->ringdata.ptr)
+ kvm_vunmap(eventq->ringdata.ptr);
+}
+
+static int
+kvm_vbus_alloc(struct kvm_vcpu *vcpu)
+{
+ struct vbus *vbus = task_vbus_get(current);
+ struct vbus_client *client;
+ struct kvm_vbus *kvbus;
+ int ret;
+
+ if (!vbus)
+ return -EPERM;
+
+ client = vbus_client_attach(vbus);
+ if (!client) {
+ vbus_put(vbus);
+ return -ENOMEM;
+ }
+
+ kvbus = kzalloc(sizeof(*kvbus), GFP_KERNEL);
+ if (!kvbus) {
+ vbus_put(vbus);
+ vbus_client_put(client);
+ return -ENOMEM;
+ }
+
+ mutex_init(&kvbus->lock);
+ kvbus->state = kvm_vbus_state_registration;
+ kvbus->kvm = vcpu->kvm;
+ kvbus->vbus = vbus;
+ kvbus->client = client;
+
+ vcpu->kvm->kvbus = kvbus;
+
+ INIT_WORK(&kvbus->destruct, deferred_destruct);
+ kvbus->ctx = kvm_memctx_alloc(vcpu->kvm);
+
+ kvbus->notify.vbus.notifier_call = vbus_notifier;
+ kvbus->notify.vbus.priority = 0;
+
+ kvbus->notify.reset.notifier_call = reset_notifier;
+ kvbus->notify.reset.priority = 0;
+
+ ret = kvm_reset_notifier_register(vcpu->kvm, &kvbus->notify.reset);
+ if (ret < 0) {
+ kvm_vbus_release(kvbus);
+ return ret;
+ }
+
+ return 0;
+}
+
+void
+kvm_vbus_release(struct kvm_vbus *kvbus)
+{
+ if (!kvbus)
+ return;
+
+ if (kvbus->ctx)
+ vbus_memctx_put(kvbus->ctx);
+
+ kvm_vbus_eventq_detach(&kvbus->eventq);
+
+ if (kvbus->client)
+ vbus_client_put(kvbus->client);
+
+ if (kvbus->vbus) {
+ vbus_notifier_unregister(kvbus->vbus, &kvbus->notify.vbus);
+ vbus_put(kvbus->vbus);
+ }
+
+ kvm_reset_notifier_unregister(kvbus->kvm, &kvbus->notify.reset);
+
+ flush_scheduled_work();
+
+ kvbus->kvm->kvbus = NULL;
+
+ kfree(kvbus);
+}
+
+/*
+ * ------------------
+ * hypercall implementation
+ * ------------------
+ */
+
+static int
+hc_busopen(struct kvm_vcpu *vcpu, void *data)
+{
+ struct kvm_vbus_busopen *args = data;
+
+ if (vcpu->kvm->kvbus)
+ return -EEXIST;
+
+ if (args->magic != KVM_VBUS_MAGIC)
+ return -EINVAL;
+
+ if (args->version != KVM_VBUS_VERSION)
+ return -EINVAL;
+
+ args->capabilities = 0;
+
+ return kvm_vbus_alloc(vcpu);
+}
+
+static int
+hc_busreg(struct kvm_vcpu *vcpu, void *data)
+{
+ struct kvm_vbus_busreg *args = data;
+ struct kvm_vbus_eventqreg *qreg = &args->eventq[0];
+ struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+ int ret;
+
+ if (args->count != 1)
+ return -EINVAL;
+
+ ret = kvm_vbus_eventq_attach(kvbus,
+ &kvbus->eventq,
+ qreg->count,
+ qreg->ring,
+ qreg->data,
+ qreg->irq);
+ if (ret < 0)
+ return ret;
+
+ ret = vbus_notifier_register(kvbus->vbus, &kvbus->notify.vbus);
+ if (ret < 0)
+ return ret;
+
+ kvbus->state = kvm_vbus_state_running;
+
+ return 0;
+}
+
+static int
+hc_deviceopen(struct kvm_vcpu *vcpu, void *data)
+{
+ struct vbus_deviceopen *args = data;
+ struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+ struct vbus_client *c = kvbus->client;
+
+ return c->ops->deviceopen(c, kvbus->ctx,
+ args->devid, args->version, &args->handle);
+}
+
+static int
+hc_deviceclose(struct kvm_vcpu *vcpu, void *data)
+{
+ __u64 devh = *(__u64 *)data;
+ struct vbus_client *c = to_client(vcpu);
+
+ return c->ops->deviceclose(c, devh);
+}
+
+static int
+hc_devicecall(struct kvm_vcpu *vcpu, void *data)
+{
+ struct vbus_devicecall *args = data;
+ struct vbus_client *c = to_client(vcpu);
+
+ return c->ops->devicecall(c, args->devh, args->func,
+ (void *)args->datap, args->len, args->flags);
+}
+
+static int
+hc_deviceshm(struct kvm_vcpu *vcpu, void *data)
+{
+ struct vbus_deviceshm *args = data;
+ struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+ struct vbus_client *c = to_client(vcpu);
+ struct device_signal *_signal = NULL;
+ struct shm_signal *signal = NULL;
+ struct kvm_shm *_shm;
+ u64 handle;
+ int ret;
+
+ ret = kvm_shm_map(kvbus, args->datap, args->len, &_shm);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Establishing a signal is optional
+ */
+ if (args->signal.offset != -1) {
+ ret = device_signal_alloc(kvbus, &_shm->shm,
+ args->signal.offset,
+ args->signal.prio,
+ args->signal.cookie,
+ &_signal);
+ if (ret < 0)
+ goto out;
+
+ signal = &_signal->signal;
+ }
+
+ ret = c->ops->deviceshm(c, args->devh, args->id,
+ &_shm->shm, signal,
+ args->flags, &handle);
+ if (ret < 0)
+ goto out;
+
+ args->handle = handle;
+ if (_signal)
+ _signal->handle = handle;
+
+ return 0;
+
+out:
+ if (signal)
+ shm_signal_put(signal);
+
+ vbus_shm_put(&_shm->shm);
+ return ret;
+}
+
+static int
+hc_shmsignal(struct kvm_vcpu *vcpu, void *data)
+{
+ __u64 handle = *(__u64 *)data;
+ struct kvm_vbus *kvbus;
+ struct vbus_client *c = to_client(vcpu);
+
+ /* A non-zero handle is targeted at a device's shm */
+ if (handle)
+ return c->ops->shmsignal(c, handle);
+
+ kvbus = vcpu->kvm->kvbus;
+
+ /* A null handle is signaling our eventq */
+ _shm_signal_wakeup(kvbus->eventq.ioq->signal);
+
+ return 0;
+}
+
+struct hc_op {
+ int nr;
+ int len;
+ int dirty;
+ int (*func)(struct kvm_vcpu *vcpu, void *args);
+};
+
+static struct hc_op _hc_busopen = {
+ .nr = KVM_VBUS_OP_BUSOPEN,
+ .len = sizeof(struct kvm_vbus_busopen),
+ .dirty = 1,
+ .func = &hc_busopen,
+};
+
+static struct hc_op _hc_busreg = {
+ .nr = KVM_VBUS_OP_BUSREG,
+ .len = sizeof(struct kvm_vbus_busreg),
+ .func = &hc_busreg,
+};
+
+static struct hc_op _hc_devopen = {
+ .nr = KVM_VBUS_OP_DEVOPEN,
+ .len = sizeof(struct vbus_deviceopen),
+ .dirty = 1,
+ .func = &hc_deviceopen,
+};
+
+static struct hc_op _hc_devclose = {
+ .nr = KVM_VBUS_OP_DEVCLOSE,
+ .len = sizeof(u64),
+ .func = &hc_deviceclose,
+};
+
+static struct hc_op _hc_devcall = {
+ .nr = KVM_VBUS_OP_DEVCALL,
+ .len = sizeof(struct vbus_devicecall),
+ .func = &hc_devicecall,
+};
+
+static struct hc_op _hc_devshm = {
+ .nr = KVM_VBUS_OP_DEVSHM,
+ .len = sizeof(struct vbus_deviceshm),
+ .dirty = 1,
+ .func = &hc_deviceshm,
+};
+
+static struct hc_op _hc_shmsignal = {
+ .nr = KVM_VBUS_OP_SHMSIGNAL,
+ .len = sizeof(u64),
+ .func = &hc_shmsignal,
+};
+
+static struct hc_op *hc_ops[] = {
+ &_hc_busopen,
+ &_hc_busreg,
+ &_hc_devopen,
+ &_hc_devclose,
+ &_hc_devcall,
+ &_hc_devshm,
+ &_hc_shmsignal,
+ NULL,
+};
+
+static int
+hc_execute_indirect(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa)
+{
+ struct kvm *kvm = vcpu->kvm;
+ char *args = NULL;
+ int ret;
+
+ BUG_ON(!op->len);
+
+ args = kmalloc(op->len, GFP_KERNEL);
+ if (!args)
+ return -ENOMEM;
+
+ ret = kvm_read_guest(kvm, gpa, args, op->len);
+ if (ret < 0)
+ goto out;
+
+ ret = op->func(vcpu, args);
+
+ if (ret >= 0 && op->dirty)
+ ret = kvm_write_guest(kvm, gpa, args, op->len);
+
+out:
+ kfree(args);
+
+ return ret;
+}
+
+static int
+hc_execute_direct(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa)
+{
+ struct kvm *kvm = vcpu->kvm;
+ void *args;
+ char *kaddr;
+ struct page *page;
+ int ret;
+
+ page = gfn_to_page(kvm, gpa >> PAGE_SHIFT);
+ if (page == bad_page) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ kaddr = kmap(page);
+ if (!kaddr) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ args = kaddr + offset_in_page(gpa);
+
+ ret = op->func(vcpu, args);
+
+out:
+ if (kaddr)
+ kunmap(kaddr);
+
+ if (ret >= 0 && op->dirty)
+ kvm_release_page_dirty(page);
+ else
+ kvm_release_page_clean(page);
+
+ return ret;
+}
+
+static int
+hc_execute(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa, size_t len)
+{
+ if (len != op->len)
+ return -EINVAL;
+
+ /*
+ * Execute-immediate if there is no data
+ */
+ if (!len)
+ return op->func(vcpu, NULL);
+
+ /*
+ * We will need to copy the arguments in the unlikely case that the
+ * gpa pointer crosses a page boundary
+ *
+ * FIXME: Is it safe to assume PAGE_SIZE is relevant to gpa?
+ */
+ if (unlikely(len && (offset_in_page(gpa) + len) > PAGE_SIZE))
+ return hc_execute_indirect(vcpu, op, gpa);
+
+ /*
+ * Otherwise just execute with zero-copy by mapping the arguments
+ */
+ return hc_execute_direct(vcpu, op, gpa);
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+int
+kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+ struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+ enum kvm_vbus_state state = kvbus ? kvbus->state : kvm_vbus_state_init;
+ int i;
+
+ PDEBUG("nr=%d, state=%d\n", nr, state);
+
+ switch (state) {
+ case kvm_vbus_state_init:
+ if (nr != KVM_VBUS_OP_BUSOPEN) {
+ PDEBUG("expected BUSOPEN\n");
+ return -EINVAL;
+ }
+ break;
+ case kvm_vbus_state_registration:
+ if (nr != KVM_VBUS_OP_BUSREG) {
+ PDEBUG("expected BUSREG\n");
+ return -EINVAL;
+ }
+ break;
+ default:
+ break;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(hc_ops); i++) {
+ struct hc_op *op = hc_ops[i];
+
+ if (op->nr != nr)
+ continue;
+
+ return hc_execute(vcpu, op, gpa, len);
+ }
+
+ PDEBUG("error: no matching function for nr=%d\n", nr);
+
+ return -EINVAL;
+}
This allows userspace applications to access vbus devices
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/vbus.h | 4
include/linux/vbus_client.h | 2
include/linux/vbus_userspace.h | 48 ++++
kernel/vbus/Kconfig | 10 +
kernel/vbus/Makefile | 2
kernel/vbus/userspace-client.c | 485 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 550 insertions(+), 1 deletions(-)
create mode 100644 include/linux/vbus_userspace.h
create mode 100644 kernel/vbus/userspace-client.c
diff --git a/include/linux/vbus.h b/include/linux/vbus.h
index 04db4ff..f967e59 100644
--- a/include/linux/vbus.h
+++ b/include/linux/vbus.h
@@ -23,6 +23,8 @@
#ifndef _LINUX_VBUS_H
#define _LINUX_VBUS_H
+#ifdef __KERNEL__
+
#ifdef CONFIG_VBUS
#include <linux/module.h>
@@ -159,4 +161,6 @@ int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb);
#endif /* CONFIG_VBUS */
+#endif /* __KERNEL__ */
+
#endif /* _LINUX_VBUS_H */
diff --git a/include/linux/vbus_client.h b/include/linux/vbus_client.h
index 62dab78..4c82822 100644
--- a/include/linux/vbus_client.h
+++ b/include/linux/vbus_client.h
@@ -35,7 +35,7 @@
#ifndef _LINUX_VBUS_CLIENT_H
#define _LINUX_VBUS_CLIENT_H
-#include <linux/types.h>
+#include <asm/types.h>
#include <linux/compiler.h>
struct vbus_deviceopen {
diff --git a/include/linux/vbus_userspace.h b/include/linux/vbus_userspace.h
new file mode 100644
index 0000000..0b78686
--- /dev/null
+++ b/include/linux/vbus_userspace.h
@@ -0,0 +1,48 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Virtual-Bus
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_USERSPACE_H
+#define _LINUX_VBUS_USERSPACE_H
+
+#include <linux/ioctl.h>
+#include <linux/vbus.h>
+#include <linux/vbus_client.h>
+
+#define VBUS_USERSPACE_ABI_MAGIC 0x4fa23b58
+#define VBUS_USERSPACE_ABI_VERSION 1
+
+struct vbus_userspace_busopen {
+ __u32 magic;
+ __u32 version;
+ __u64 capabilities;
+};
+
+#define VBUS_IOCTL_MAGIC 'v'
+
+#define VBUS_BUSOPEN _IOWR(VBUS_IOCTL_MAGIC, 0x00, struct vbus_userspace_busopen)
+#define VBUS_DEVICEOPEN _IOWR(VBUS_IOCTL_MAGIC, 0x01, struct vbus_deviceopen)
+#define VBUS_DEVICECLOSE _IOWR(VBUS_IOCTL_MAGIC, 0x02, __u64)
+#define VBUS_DEVICECALL _IOWR(VBUS_IOCTL_MAGIC, 0x03, struct vbus_devicecall)
+#define VBUS_DEVICESHM _IOWR(VBUS_IOCTL_MAGIC, 0x04, struct vbus_deviceshm)
+#define VBUS_SHMSIGNAL _IOWR(VBUS_IOCTL_MAGIC, 0x05, __u64)
+
+#endif /* _LINUX_VBUS_USERSPACE_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index 3ce0adc..b894dd1 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -25,6 +25,16 @@ config VBUS_DEVICES
source "drivers/vbus/devices/Kconfig"
+config VBUS_USERSPACE
+ tristate "Virtual-Bus userspace client support"
+ depends on VBUS
+ default y
+ help
+ Provides facilities for userspace applications to access virtual-
+ bus objects.
+
+ If unsure, say N
+
config VBUS_DRIVERS
tristate "VBUS Driver support"
select IOQ
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 45f6503..61d0371 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -4,3 +4,5 @@ obj-$(CONFIG_VBUS) += shm-ioq.o
vbus-proxy-objs += proxy.o
obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
+vbus-userspace-objs += userspace-client.o
+obj-$(CONFIG_VBUS_USERSPACE) += vbus-userspace.o
diff --git a/kernel/vbus/userspace-client.c b/kernel/vbus/userspace-client.c
new file mode 100644
index 0000000..b2fe447
--- /dev/null
+++ b/kernel/vbus/userspace-client.c
@@ -0,0 +1,485 @@
+#include <linux/ioctl.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/uaccess.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+
+#include <linux/vbus.h>
+#include <linux/vbus_userspace.h>
+
+#include "vbus.h"
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+struct userspace_chardev;
+
+struct userspace_signal {
+ struct userspace_chardev *cd;
+ struct vbus_shm *shm;
+ struct shm_signal signal;
+ struct list_head list;
+ int signaled;
+ int prio;
+ u64 cookie;
+};
+
+struct userspace_shm {
+ struct vbus_shm shm;
+};
+
+struct userspace_chardev {
+ spinlock_t lock;
+ int opened;
+ struct vbus_memctx *ctx;
+ struct vbus_client *client;
+ struct list_head signal_list;
+ wait_queue_head_t wq;
+ struct vbus *vbus;
+};
+
+static long
+_busopen(struct userspace_chardev *cd, struct vbus_userspace_busopen *args)
+{
+ if (cd->opened)
+ return -EINVAL;
+
+ if (args->magic != VBUS_USERSPACE_ABI_MAGIC)
+ return -EINVAL;
+
+ if (args->version != VBUS_USERSPACE_ABI_VERSION)
+ return -EINVAL;
+
+ /*
+ * We have no extended capabilities yet, so we dont care if they set
+ * any option bits. Just clear them all.
+ */
+ args->capabilities = 0;
+
+ cd->opened = 1;
+
+ return 0;
+}
+
+static long
+_deviceopen(struct userspace_chardev *cd, struct vbus_deviceopen *args)
+{
+ struct vbus_client *c = cd->client;
+
+ return c->ops->deviceopen(c, cd->ctx, args->devid, args->version,
+ &args->handle);
+}
+
+static long
+_deviceclose(struct userspace_chardev *cd, unsigned long devh)
+{
+ struct vbus_client *c = cd->client;
+
+ return c->ops->deviceclose(c, devh);
+}
+
+static long
+_devicecall(struct userspace_chardev *cd, struct vbus_devicecall *args)
+{
+ struct vbus_client *c = cd->client;
+
+ return c->ops->devicecall(c, args->devh, args->func,
+ (void *)args->datap,
+ args->len, args->flags);
+}
+
+static void*
+userspace_vmap(__u64 addr, size_t len)
+{
+ struct page **page_list;
+ void *ptr = NULL;
+ unsigned long base;
+ off_t offset;
+ size_t npages;
+ int ret;
+
+ base = (unsigned long)addr & PAGE_MASK;
+ offset = (unsigned long)addr & ~PAGE_MASK;
+ npages = PAGE_ALIGN(len + offset) >> PAGE_SHIFT;
+
+ if (npages > (PAGE_SIZE / sizeof(struct page *)))
+ return NULL;
+
+ page_list = (struct page **) __get_free_page(GFP_KERNEL);
+ if (!page_list)
+ return NULL;
+
+ down_write(¤t->mm->mmap_sem);
+
+ ret = get_user_pages(current, current->mm, base, npages,
+ 1, 0, page_list, NULL);
+ if (ret < 0)
+ goto out;
+
+ ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
+ if (ptr)
+ current->mm->locked_vm += npages;
+
+out:
+ up_write(¤t->mm->mmap_sem);
+ free_page((unsigned long)page_list);
+
+ return ptr+offset;
+}
+
+static struct userspace_signal *to_userspace(struct shm_signal *signal)
+{
+ return container_of(signal, struct userspace_signal, signal);
+}
+
+static int
+userspace_signal_inject(struct shm_signal *signal)
+{
+ struct userspace_signal *_signal = to_userspace(signal);
+ struct userspace_chardev *cd = _signal->cd;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cd->lock, flags);
+
+ if (!_signal->signaled) {
+ _signal->signaled = 1;
+ list_add_tail(&_signal->list, &cd->signal_list);
+ wake_up_interruptible(&cd->wq);
+ }
+
+ spin_unlock_irqrestore(&cd->lock, flags);
+
+ return 0;
+}
+
+static void
+userspace_signal_release(struct shm_signal *signal)
+{
+ struct userspace_signal *_signal = to_userspace(signal);
+
+ vbus_shm_put(_signal->shm);
+ kfree(_signal);
+}
+
+static struct shm_signal_ops userspace_signal_ops = {
+ .inject = userspace_signal_inject,
+ .release = userspace_signal_release,
+};
+
+static long
+userspace_signal_alloc(struct vbus_shm *shm,
+ u32 offset, u32 prio, u64 cookie,
+ struct userspace_signal **usignal)
+{
+ struct userspace_signal *_signal;
+ struct shm_signal *signal;
+ struct shm_signal_desc *desc;
+ int ret = -EINVAL;
+
+ _signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+ if (!_signal)
+ return -ENOMEM;
+
+ desc = (struct shm_signal_desc *)(shm->ptr + offset);
+
+ if (desc->magic != SHM_SIGNAL_MAGIC)
+ goto out;
+
+ if (desc->ver != SHM_SIGNAL_VER)
+ goto out;
+
+ signal = &_signal->signal;
+
+ shm_signal_init(signal);
+
+ signal->locale = shm_locality_south;
+ signal->ops = &userspace_signal_ops;
+ signal->desc = desc;
+
+ _signal->shm = shm;
+ _signal->prio = prio;
+ _signal->cookie = cookie;
+ vbus_shm_get(shm); /* dropped when the signal is released */
+
+ *usignal = _signal;
+
+ return 0;
+
+out:
+ kfree(_signal);
+
+ return ret;
+}
+
+static void
+userspace_shm_release(struct vbus_shm *shm)
+{
+ struct userspace_shm *_shm = container_of(shm, struct userspace_shm,
+ shm);
+
+ /* FIXME: do we need to adjust current->mm->locked_vm? */
+ vunmap((void *)((unsigned long)shm->ptr & PAGE_MASK));
+ kfree(_shm);
+}
+
+static struct vbus_shm_ops userspace_shm_ops = {
+ .release = userspace_shm_release,
+};
+
+static int
+userspace_shm_map(struct userspace_chardev *cd,
+ __u64 ptr, __u32 len,
+ struct userspace_shm **ushm)
+{
+ struct userspace_shm *_shm;
+ struct vbus_shm *shm;
+ void *vmap;
+
+ _shm = kzalloc(sizeof(*_shm), GFP_KERNEL);
+ if (!_shm)
+ return -ENOMEM;
+
+ shm = &_shm->shm;
+
+ vmap = userspace_vmap(ptr, len);
+ if (!vmap) {
+ kfree(_shm);
+ return -EFAULT;
+ }
+
+ vbus_shm_init(shm, &userspace_shm_ops, vmap, len);
+
+ *ushm = _shm;
+
+ return 0;
+}
+
+static long
+_deviceshm(struct userspace_chardev *cd, struct vbus_deviceshm *args)
+{
+ struct vbus_client *c = cd->client;
+ struct userspace_signal *_signal = NULL;
+ struct shm_signal *signal = NULL;
+ struct userspace_shm *_shm;
+ u64 handle;
+ long ret;
+
+ ret = userspace_shm_map(cd, args->datap, args->len, &_shm);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Establishing a signal is optional
+ */
+ if (args->signal.offset != -1) {
+ ret = userspace_signal_alloc(&_shm->shm,
+ args->signal.offset,
+ args->signal.prio,
+ args->signal.cookie,
+ &_signal);
+ if (ret < 0)
+ goto out;
+
+ _signal->cd = cd;
+ signal = &_signal->signal;
+ }
+
+ ret = c->ops->deviceshm(c, args->devh, args->id,
+ &_shm->shm, signal,
+ args->flags, &handle);
+ if (ret < 0)
+ goto out;
+
+ args->handle = handle;
+
+ return 0;
+
+out:
+ if (signal)
+ shm_signal_put(signal);
+
+ vbus_shm_put(&_shm->shm);
+ return ret;
+}
+
+static long
+_shmsignal(struct userspace_chardev *cd, unsigned long handle)
+{
+ struct vbus_client *c = cd->client;
+
+ return c->ops->shmsignal(c, handle);
+}
+
+static int
+vbus_chardev_open(struct inode *inode, struct file *filp)
+{
+ struct vbus *vbus = task_vbus_get(current);
+ struct vbus_client *client;
+ struct vbus_memctx *ctx;
+ struct userspace_chardev *cd;
+
+ if (!vbus)
+ return -EPERM;
+
+ client = vbus_client_attach(vbus);
+ vbus_put(vbus);
+ if (!client)
+ return -ENOMEM;
+
+ ctx = task_memctx_alloc(current);
+ if (!ctx) {
+ vbus_client_put(client);
+ return -ENOMEM;
+ }
+
+ cd = kzalloc(sizeof(*cd), GFP_KERNEL);
+ if (!cd) {
+ vbus_memctx_put(ctx);
+ vbus_client_put(client);
+ return -ENOMEM;
+ }
+
+ spin_lock_init(&cd->lock);
+ cd->opened = 0;
+ cd->client = client;
+ cd->ctx = ctx;
+
+ INIT_LIST_HEAD(&cd->signal_list);
+ init_waitqueue_head(&cd->wq);
+
+ filp->private_data = cd;
+
+ return 0;
+}
+
+static long
+vbus_chardev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
+{
+ struct userspace_chardev *cd = filp->private_data;
+
+ if (!cd->opened && ioctl != VBUS_BUSOPEN)
+ return -EINVAL;
+
+ switch (ioctl) {
+ case VBUS_BUSOPEN:
+ return _busopen(cd, (struct vbus_userspace_busopen *)arg);
+ case VBUS_DEVICEOPEN:
+ return _deviceopen(cd, (struct vbus_deviceopen *)arg);
+ case VBUS_DEVICECLOSE:
+ return _deviceclose(cd, *(__u64 *)arg);
+ case VBUS_DEVICECALL:
+ return _devicecall(cd, (struct vbus_devicecall *)arg);
+ case VBUS_DEVICESHM:
+ return _deviceshm(cd, (struct vbus_deviceshm *)arg);
+ case VBUS_SHMSIGNAL:
+ return _shmsignal(cd, *(__u64 *)arg);
+ default:
+ return -EINVAL;
+ }
+}
+
+static ssize_t
+vbus_chardev_read(struct file *filp, char __user *buf, size_t len,
+ loff_t *ppos)
+{
+ DEFINE_WAIT(wait);
+ struct userspace_chardev *cd = filp->private_data;
+ ssize_t bytes = 0;
+ int count, i;
+ __u64 __user *p = (__u64 __user *)buf;
+ unsigned long flags;
+
+ count = len/sizeof(__u64);
+
+ if (!count)
+ return -EINVAL;
+
+ spin_lock_irqsave(&cd->lock, flags);
+
+ for (;;) {
+ prepare_to_wait(&cd->wq, &wait, TASK_INTERRUPTIBLE);
+
+ if (!list_empty(&cd->signal_list))
+ break;
+
+ if (signal_pending(current)) {
+ finish_wait(&cd->wq, &wait);
+ spin_unlock_irqrestore(&cd->lock, flags);
+ return -EINTR;
+ }
+
+ spin_unlock_irqrestore(&cd->lock, flags);
+ schedule();
+ spin_lock_irqsave(&cd->lock, flags);
+ }
+
+ finish_wait(&cd->wq, &wait);
+
+ for (i = 0; i < count; i++) {
+ struct userspace_signal *_signal;
+ __u64 cookie;
+
+ if (list_empty(&cd->signal_list))
+ break;
+
+ _signal = list_first_entry(&cd->signal_list,
+ struct userspace_signal, list);
+
+ _signal->signaled = 0;
+ list_del(&_signal->list);
+
+ cookie = _signal->cookie;
+
+ put_user(cookie, p++);
+
+ bytes += sizeof(cookie);
+ }
+
+ spin_unlock_irqrestore(&cd->lock, flags);
+
+ return bytes;
+}
+
+static int
+vbus_chardev_release(struct inode *inode, struct file *filp)
+{
+ struct userspace_chardev *cd = filp->private_data;
+
+ vbus_memctx_put(cd->ctx);
+ vbus_client_put(cd->client);
+ kfree(cd);
+
+ return 0;
+}
+
+static const struct file_operations vbus_chardev_ops = {
+ .open = vbus_chardev_open,
+ .read = vbus_chardev_read,
+ .unlocked_ioctl = vbus_chardev_ioctl,
+ .compat_ioctl = vbus_chardev_ioctl,
+ .release = vbus_chardev_release,
+};
+
+static struct miscdevice vbus_chardev = {
+ MISC_DYNAMIC_MINOR,
+ "vbus",
+ &vbus_chardev_ops,
+};
+
+static int __init
+vbus_userspace_init(void)
+{
+ return misc_register(&vbus_chardev);
+}
+
+static void __exit
+vbus_userspace_cleanup(void)
+{
+ misc_deregister(&vbus_chardev);
+}
+
+module_init(vbus_userspace_init);
+module_exit(vbus_userspace_cleanup);
We add a new virtio transport for accessing backends located on vbus. This
complements the existing transports for virtio-pci, virtio-s390, and
virtio-lguest that already exist.
Signed-off-by: Gregory Haskins <[email protected]>
---
drivers/virtio/Kconfig | 15 +
drivers/virtio/Makefile | 1
drivers/virtio/virtio_vbus.c | 496 +++++++++++++++++++++++++++++++++
include/linux/virtio_vbus.h | 163 +++++++++++
kernel/vbus/Kconfig | 7
kernel/vbus/Makefile | 3
kernel/vbus/virtio.c | 628 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 1313 insertions(+), 0 deletions(-)
create mode 100644 drivers/virtio/virtio_vbus.c
create mode 100644 include/linux/virtio_vbus.h
create mode 100644 kernel/vbus/virtio.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3dd6294..e8562ee 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -23,6 +23,21 @@ config VIRTIO_PCI
If unsure, say M.
+config VIRTIO_VBUS
+ tristate "VBUS driver for virtio devices (EXPERIMENTAL)"
+ depends on VBUS_DRIVERS && EXPERIMENTAL
+ select VIRTIO
+ select VIRTIO_RING
+ ---help---
+ This drivers provides support for virtio based paravirtual device
+ drivers over VBUS. This requires that your VMM has appropriate VBUS
+ virtio backends.
+
+ Currently, the ABI is not considered stable so there is no guarantee
+ that this version of the driver will work with your VMM.
+
+ If unsure, say M.
+
config VIRTIO_BALLOON
tristate "Virtio balloon driver (EXPERIMENTAL)"
select VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 6738c44..0342e42 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_VIRTIO) += virtio.o
obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
+obj-$(CONFIG_VIRTIO_VBUS) += virtio_vbus.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
diff --git a/drivers/virtio/virtio_vbus.c b/drivers/virtio/virtio_vbus.c
new file mode 100644
index 0000000..ebefcf2
--- /dev/null
+++ b/drivers/virtio/virtio_vbus.c
@@ -0,0 +1,496 @@
+/*
+ * Virtio VBUS driver
+ *
+ * This module allows virtio devices to be used over a virtual-bus device.
+ *
+ * Copyright: Novell, 2009
+ *
+ * Authors:
+ * Gregory Haskins <[email protected]>
+ *
+ * Derived from virtio-pci, written by
+ * Anthony Liguori <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/interrupt.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+#include <linux/virtio_vbus.h>
+#include <linux/vbus_driver.h>
+#include <linux/spinlock.h>
+
+MODULE_AUTHOR("Gregory Haskins <[email protected]>");
+MODULE_DESCRIPTION("virtio-vbus");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+struct virtio_vbus_priv {
+ struct virtio_device virtio_dev;
+ struct vbus_device_proxy *vbus_dev;
+ struct {
+ struct virtio_vbus_shm *shm;
+ struct shm_signal *signal;
+ struct shm_signal_notifier notifier;
+ } config;
+};
+
+struct vbus_virtqueue {
+ struct virtqueue *vq;
+ u64 index;
+ int num;
+ struct virtio_vbus_shm *shm;
+ size_t size;
+ struct shm_signal *signal;
+ struct shm_signal_notifier notifier;
+};
+
+static struct virtio_vbus_priv *
+virtio_to_priv(struct virtio_device *virtio_dev)
+{
+ return container_of(virtio_dev, struct virtio_vbus_priv, virtio_dev);
+}
+
+static int
+devcall(struct virtio_vbus_priv *priv, u32 func, void *data, size_t len)
+{
+ struct vbus_device_proxy *dev = priv->vbus_dev;
+
+ return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * This is called whenever the host signals our config-space shm
+ */
+static void
+config_isr(struct shm_signal_notifier *notifier)
+{
+ struct virtio_vbus_priv *priv = container_of(notifier,
+ struct virtio_vbus_priv,
+ config.notifier);
+ struct virtio_driver *drv = container_of(priv->virtio_dev.dev.driver,
+ struct virtio_driver, driver);
+
+ if (drv && drv->config_changed)
+ drv->config_changed(&priv->virtio_dev);
+}
+
+/*
+ * ------------------
+ * virtio config ops
+ * ------------------
+ */
+
+static u32
+_virtio_get_features(struct virtio_device *dev)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(dev);
+ u32 features;
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_GET_FEATURES,
+ &features, sizeof(features));
+ BUG_ON(ret < 0);
+
+ /*
+ * When someone needs more than 32 feature bits, we'll need to
+ * steal a bit to indicate that the rest are somewhere else.
+ */
+ return features;
+}
+
+static void
+_virtio_finalize_features(struct virtio_device *dev)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(dev);
+ int ret;
+
+ /* Give virtio_ring a chance to accept features. */
+ vring_transport_features(dev);
+
+ /* We only support 32 feature bits. */
+ BUILD_BUG_ON(ARRAY_SIZE(dev->features) != 1);
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_FINALIZE_FEATURES,
+ &dev->features[0], sizeof(dev->features[0]));
+ BUG_ON(ret < 0);
+}
+
+static void
+_virtio_get(struct virtio_device *vdev, unsigned offset,
+ void *buf, unsigned len)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+
+ BUG_ON((offset + len) > VIRTIO_VBUS_CONFIGSPACE_LEN);
+ memcpy(buf, &priv->config.shm->data[offset], len);
+}
+
+static void
+_virtio_set(struct virtio_device *vdev, unsigned offset,
+ const void *buf, unsigned len)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+ int ret;
+
+ BUG_ON((offset + len) > VIRTIO_VBUS_CONFIGSPACE_LEN);
+ memcpy(&priv->config.shm->data[offset], buf, len);
+
+ ret = shm_signal_inject(priv->config.signal, 0);
+ BUG_ON(ret < 0);
+}
+
+static u8
+_virtio_get_status(struct virtio_device *vdev)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+ u8 data;
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_GET_STATUS, &data, sizeof(data));
+ BUG_ON(ret < 0);
+
+ return data;
+}
+
+static void
+_virtio_set_status(struct virtio_device *vdev, u8 status)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+ int ret;
+
+ /* We should never be setting status to 0. */
+ BUG_ON(status == 0);
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_SET_STATUS, &status,
+ sizeof(status));
+ BUG_ON(ret < 0);
+}
+
+static void
+_virtio_reset(struct virtio_device *vdev)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_RESET, NULL, 0);
+ BUG_ON(ret < 0);
+}
+
+/*
+ * ------------------
+ * virtqueue ops
+ * ------------------
+ */
+
+static int
+_vq_getlen(struct virtio_vbus_priv *priv, int index)
+{
+ struct virtio_vbus_queryqueue query = {
+ .index = index,
+ };
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_QUERY_QUEUE,
+ &query, sizeof(query));
+ if (ret < 0)
+ return ret;
+
+ return query.num;
+}
+
+static void
+_vq_kick(struct virtqueue *vq)
+{
+ struct vbus_virtqueue *_vq = vq->priv;
+ int ret;
+
+ ret = shm_signal_inject(_vq->signal, 0);
+ BUG_ON(ret < 0);
+}
+
+/*
+ * This is called whenever the host signals our virtqueue
+ */
+static void
+_vq_isr(struct shm_signal_notifier *notifier)
+{
+ struct vbus_virtqueue *_vq = container_of(notifier,
+ struct vbus_virtqueue,
+ notifier);
+ vring_interrupt(0, _vq->vq);
+}
+
+static struct virtqueue *
+_virtio_find_vq(struct virtio_device *vdev, unsigned index,
+ void (*callback)(struct virtqueue *vq))
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vdev);
+ struct vbus_device_proxy *dev = priv->vbus_dev;
+ struct vbus_virtqueue *_vq;
+ struct virtqueue *vq;
+ unsigned long ringsize;
+ int num;
+ int ret;
+
+ num = _vq_getlen(priv, index);
+ if (num < 0)
+ return ERR_PTR(num);
+
+ _vq = kmalloc(sizeof(struct vbus_virtqueue), GFP_KERNEL);
+ if (!_vq)
+ return ERR_PTR(-ENOMEM);
+
+ ringsize = vring_size(num, PAGE_SIZE);
+
+ _vq->index = index;
+ _vq->num = num;
+ _vq->size = PAGE_ALIGN(sizeof(struct virtio_vbus_shm) + ringsize - 1);
+
+ _vq->shm = alloc_pages_exact(_vq->size, GFP_KERNEL|__GFP_ZERO);
+ if (!_vq->shm) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* initialize the shm with a ring */
+ vq = vring_new_virtqueue(_vq->num, PAGE_SIZE, vdev,
+ &_vq->shm->data[0],
+ _vq_kick,
+ callback);
+ if (!vq) {
+ ret = -ENOMEM;
+ goto out_free;
+ }
+
+ /* register the shm with an id of the vq index + RING_OFFSET */
+ ret = dev->ops->shm(dev, index + VIRTIO_VBUS_RING_OFFSET, 0,
+ _vq->shm, _vq->size,
+ &_vq->shm->signal, &_vq->signal, 0);
+ if (ret < 0)
+ goto out_free;
+
+ _vq->notifier.signal = &_vq_isr;
+ _vq->signal->notifier = &_vq->notifier;
+
+ shm_signal_enable(_vq->signal, 0);
+
+ vq->priv = _vq;
+ _vq->vq = vq;
+
+ return vq;
+
+out_free:
+ free_pages_exact(_vq->shm, _vq->size);
+out:
+ if (_vq && _vq->signal)
+ shm_signal_put(_vq->signal);
+ kfree(_vq);
+ return ERR_PTR(ret);
+}
+
+/* the config->del_vq() implementation */
+static void
+_virtio_del_vq(struct virtqueue *vq)
+{
+ struct virtio_vbus_priv *priv = virtio_to_priv(vq->vdev);
+ struct vbus_virtqueue *_vq = vq->priv;
+
+ devcall(priv, VIRTIO_VBUS_FUNC_DEL_QUEUE,
+ &_vq->index, sizeof(_vq->index));
+
+ vring_del_virtqueue(vq);
+
+ shm_signal_put(_vq->signal);
+ free_pages_exact(_vq->shm, _vq->size);
+ kfree(_vq);
+}
+
+/*
+ * ------------------
+ * general setup
+ * ------------------
+ */
+
+static struct virtio_config_ops virtio_vbus_config_ops = {
+ .get = _virtio_get,
+ .set = _virtio_set,
+ .get_status = _virtio_get_status,
+ .set_status = _virtio_set_status,
+ .reset = _virtio_reset,
+ .find_vq = _virtio_find_vq,
+ .del_vq = _virtio_del_vq,
+ .get_features = _virtio_get_features,
+ .finalize_features = _virtio_finalize_features,
+};
+
+/*
+ * Negotiate vbus transport features. This is not to be confused with the
+ * higher-level function FUNC_GET/FINALIZE_FEATURES, which is specifically
+ * for the virtio transport
+ */
+static void
+virtio_vbus_negcap(struct virtio_vbus_priv *priv)
+{
+ u64 features = 0; /* We do not have any advanced features to enable */
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_NEG_CAP,
+ &features, sizeof(features));
+ BUG_ON(ret < 0);
+}
+
+static void
+virtio_vbus_getid(struct virtio_vbus_priv *priv)
+{
+ struct virtio_vbus_id id;
+ int ret;
+
+ ret = devcall(priv, VIRTIO_VBUS_FUNC_GET_ID, &id, sizeof(id));
+ BUG_ON(ret < 0);
+
+ priv->virtio_dev.id.vendor = id.vendor;
+ priv->virtio_dev.id.device = id.device;
+}
+
+static int
+virtio_vbus_initconfig(struct virtio_vbus_priv *priv)
+{
+ struct vbus_device_proxy *vdev = priv->vbus_dev;
+ size_t len;
+ int ret;
+
+ len = sizeof(struct virtio_vbus_shm) + VIRTIO_VBUS_CONFIGSPACE_LEN - 1;
+
+ priv->config.shm = kzalloc(len, GFP_KERNEL);
+ if (!priv->config.shm)
+ return -ENOMEM;
+
+ ret = vdev->ops->shm(vdev, 0, 0,
+ &priv->config.shm, len,
+ &priv->config.shm->signal, &priv->config.signal,
+ 0);
+ BUG_ON(ret < 0);
+
+ priv->config.notifier.signal = &config_isr;
+ priv->config.signal->notifier = &priv->config.notifier;
+
+ shm_signal_enable(priv->config.signal, 0);
+
+ return 0;
+}
+
+/* the VBUS probing function */
+static int
+virtio_vbus_probe(struct vbus_device_proxy *vdev)
+{
+ struct virtio_vbus_priv *priv;
+ int ret;
+
+ printk(KERN_INFO "VIRTIO-VBUS: Found new device at %lld\n", vdev->id);
+
+ ret = vdev->ops->open(vdev, VIRTIO_VBUS_ABI_VERSION, 0);
+ if (ret < 0) {
+ printk(KERN_ERR "virtio_vbus: ABI version %d failed with: %d\n",
+ VIRTIO_VBUS_ABI_VERSION, ret);
+ return ret;
+ }
+
+ priv = kzalloc(sizeof(struct virtio_vbus_priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->virtio_dev.config = &virtio_vbus_config_ops;
+ priv->vbus_dev = vdev;
+
+ /*
+ * Negotiate for any vbus specific features
+ */
+ virtio_vbus_negcap(priv);
+
+ /*
+ * This probe occurs for any "virtio" device on the vbus, so we need
+ * to hypercall the host to figure out what specific PCI-ID type
+ * device this is
+ */
+ virtio_vbus_getid(priv);
+
+ /*
+ * Map our config-space to the device, and establish a signal-path
+ * for config-space updates
+ */
+ virtio_vbus_initconfig(priv);
+
+ /* finally register the virtio device */
+ ret = register_virtio_device(&priv->virtio_dev);
+ if (ret)
+ goto out;
+
+ vdev->priv = priv;
+
+ return 0;
+
+out:
+ kfree(priv);
+ return ret;
+}
+
+#ifdef NOTYET
+/* FIXME: wire this up */
+static void
+virtio_vbus_release(struct virtio_vbus_priv *priv)
+{
+ shm_signal_put(priv->config.signal);
+ kfree(priv->config.shm);
+ kfree(priv);
+}
+
+#endif
+
+static int
+virtio_vbus_remove(struct vbus_device_proxy *vdev)
+{
+ struct virtio_vbus_priv *priv = vdev->priv;
+
+ unregister_virtio_device(&priv->virtio_dev);
+
+ return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops virtio_vbus_driver_ops = {
+ .probe = virtio_vbus_probe,
+ .remove = virtio_vbus_remove,
+};
+
+static struct vbus_driver virtio_vbus_driver = {
+ .type = "virtio",
+ .owner = THIS_MODULE,
+ .ops = &virtio_vbus_driver_ops,
+};
+
+static __init int
+virtio_vbus_init_module(void)
+{
+ printk(KERN_INFO "Virtio-VBUS: Copyright (C) 2009 Novell, Gregory Haskins\n");
+ return vbus_driver_register(&virtio_vbus_driver);
+}
+
+static __exit void
+virtio_vbus_cleanup(void)
+{
+ vbus_driver_unregister(&virtio_vbus_driver);
+}
+
+module_init(virtio_vbus_init_module);
+module_exit(virtio_vbus_cleanup);
+
diff --git a/include/linux/virtio_vbus.h b/include/linux/virtio_vbus.h
new file mode 100644
index 0000000..05791bf
--- /dev/null
+++ b/include/linux/virtio_vbus.h
@@ -0,0 +1,163 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Virtio VBUS driver
+ *
+ * This module allows virtio devices to be used over a VBUS interface
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VIRTIO_VBUS_H
+#define _LINUX_VIRTIO_VBUS_H
+
+#include <linux/shm_signal.h>
+
+#define VIRTIO_VBUS_ABI_VERSION 1
+
+enum {
+ VIRTIO_VBUS_FUNC_NEG_CAP,
+ VIRTIO_VBUS_FUNC_GET_ID,
+ VIRTIO_VBUS_FUNC_GET_FEATURES,
+ VIRTIO_VBUS_FUNC_FINALIZE_FEATURES,
+ VIRTIO_VBUS_FUNC_GET_STATUS,
+ VIRTIO_VBUS_FUNC_SET_STATUS,
+ VIRTIO_VBUS_FUNC_RESET,
+ VIRTIO_VBUS_FUNC_QUERY_QUEUE,
+ VIRTIO_VBUS_FUNC_DEL_QUEUE,
+};
+
+struct virtio_vbus_id {
+ u16 vendor;
+ u16 device;
+};
+
+struct virtio_vbus_queryqueue {
+ u64 index; /* in: queue index */
+ u32 num; /* out: number of entries */
+ u32 pad[0];
+};
+
+#define VIRTIO_VBUS_CONFIGSPACE_LEN 1024
+#define VIRTIO_VBUS_RING_OFFSET 10000 /* shm-index where rings start */
+
+struct virtio_vbus_shm {
+ struct shm_signal_desc signal;
+ char data[1];
+};
+
+/*
+ * --------------------------------------------------
+ * Backend support - These components are only needed
+ * for interfacing a virtio-backend to the vbus-backend
+ * --------------------------------------------------
+ */
+
+#include <linux/vbus_device.h>
+
+struct virtio_device_interface;
+struct virtio_connection;
+
+struct virtio_queue_def {
+ int index;
+ int entries;
+};
+
+/*
+ * ----------------------
+ * interface
+ * ----------------------
+ */
+
+struct virtio_device_interface_ops {
+ int (*open)(struct virtio_device_interface *intf,
+ struct vbus_memctx *ctx,
+ struct virtio_connection **conn);
+ void (*release)(struct virtio_device_interface *intf);
+};
+
+struct virtio_device_interface {
+ struct virtio_vbus_id id;
+ struct virtio_device_interface_ops *ops;
+ struct virtio_queue_def *queues;
+ struct vbus_device_interface *parent;
+};
+
+/**
+ * virtio_device_interface_register() - register an interface with a bus
+ * @dev: The device context of the caller
+ * @vbus: The bus context to register with
+ * @intf: The interface context to register
+ *
+ * This function is invoked (usually in the context of a device::bus_connect()
+ * callback) to register a interface on a bus. We make this an explicit
+ * operation instead of implicit on the bus_connect() to facilitate devices
+ * that may present multiple interfaces to a bus. In those cases, a device
+ * may invoke this function multiple times (one per supported interface).
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int virtio_device_interface_register(struct vbus_device *dev,
+ struct vbus *vbus,
+ struct virtio_device_interface *intf);
+
+/**
+ * virtio_device_interface_unregister() - unregister an interface with a bus
+ * @intf: The interface context to unregister
+ *
+ * This function is the converse of interface_register. It is typically
+ * invoked in the context of a device::bus_disconnect().
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int virtio_device_interface_unregister(struct virtio_device_interface *intf);
+
+/*
+ * ----------------------
+ * connection
+ * ----------------------
+ */
+struct virtqueue;
+
+struct virtio_connection_ops {
+ void (*config_changed)(struct virtio_connection *vconn);
+ u8 (*get_status)(struct virtio_connection *vconn);
+ void (*set_status)(struct virtio_connection *vconn, u8 status);
+ void (*reset)(struct virtio_connection *vconn);
+ u32 (*get_features)(struct virtio_connection *vconn);
+ void (*finalize_features)(struct virtio_connection *vconn);
+ void (*add_vq)(struct virtio_connection *vconn, int index,
+ struct virtqueue *vq);
+ void (*del_vq)(struct virtio_connection *vconn, int index);
+ void (*notify_vq)(struct virtio_connection *vconn, int index);
+ void (*release)(struct virtio_connection *conn);
+};
+
+struct virtio_connection {
+ struct virtio_connection_ops *ops;
+ struct vbus_connection *parent;
+};
+
+int virtio_connection_config_get(struct virtio_connection *vconn,
+ int offset, void *buf, size_t len);
+
+int virtio_connection_config_set(struct virtio_connection *vconn,
+ int offset, void *buf, size_t len);
+
+#endif /* _LINUX_VIRTIO_VBUS_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index b894dd1..5eeced2 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -14,6 +14,13 @@ config VBUS
If unsure, say N
+config VBUS_VIRTIO_BACKEND
+ tristate "Virtio VBUS Backend"
+ depends on VBUS
+ default n
+ help
+ Provides backend support for virtio devices over vbus
+
config VBUS_DEVICES
bool "Virtual-Bus Devices"
depends on VBUS
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 61d0371..c2bd140 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1,6 +1,9 @@
obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
obj-$(CONFIG_VBUS) += shm-ioq.o
+virtio-backend-objs += virtio.o
+obj-$(CONFIG_VBUS_VIRTIO_BACKEND) += virtio-backend.o
+
vbus-proxy-objs += proxy.o
obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
diff --git a/kernel/vbus/virtio.c b/kernel/vbus/virtio.c
new file mode 100644
index 0000000..dac5cd4
--- /dev/null
+++ b/kernel/vbus/virtio.c
@@ -0,0 +1,628 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/virtio_vbus.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#undef PDEBUG
+#ifdef VENETTAP_DEBUG
+# define PDEBUG(fmt, args...) printk(KERN_DEBUG "virtio-vbus: " fmt, ## args)
+#else
+# define PDEBUG(fmt, args...)
+#endif
+
+struct _virtio_device_interface {
+ struct virtio_device_interface *vintf;
+ struct vbus_device_interface intf;
+};
+
+struct _virtio_connection {
+ struct _virtio_device_interface *_vintf;
+ struct virtio_connection *vconn;
+ struct vbus_connection conn;
+ struct vbus_memctx *ctx;
+ struct list_head queues;
+
+ struct {
+ struct vbus_shm *shm;
+ struct shm_signal *signal;
+ struct shm_signal_notifier notifier;
+ } config;
+
+ int running:1;
+};
+
+struct _virtio_queue {
+ int index;
+ int num;
+ struct _virtio_connection *_vconn;
+ struct virtqueue *vq;
+
+ struct vbus_shm *shm;
+ struct shm_signal *signal;
+ struct shm_signal_notifier notifier;
+
+ struct list_head node;
+};
+
+static struct _virtio_device_interface *
+to_vintf(struct vbus_device_interface *intf)
+{
+ return container_of(intf, struct _virtio_device_interface, intf);
+}
+
+static struct _virtio_connection *
+to_vconn(struct vbus_connection *conn)
+{
+ return container_of(conn, struct _virtio_connection, conn);
+}
+
+int virtio_connection_config_get(struct virtio_connection *vconn,
+ int offset, void *buf, size_t len)
+{
+ struct _virtio_connection *_vconn = to_vconn(vconn->parent);
+ char *data;
+
+ if (!_vconn->config.shm)
+ return -EINVAL;
+
+ if (offset + len > _vconn->config.shm->len)
+ return -EINVAL;
+
+ data = _vconn->config.shm->ptr;
+
+ memcpy(buf, &data[offset], len);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_connection_config_get);
+
+int virtio_connection_config_set(struct virtio_connection *vconn,
+ int offset, void *buf, size_t len)
+{
+ struct _virtio_connection *_vconn = to_vconn(vconn->parent);
+ char *data;
+
+ if (!_vconn->config.shm)
+ return -EINVAL;
+
+ if (offset + len > _vconn->config.shm->len)
+ return -EINVAL;
+
+ data = _vconn->config.shm->ptr;
+
+ memcpy(&data[offset], buf, len);
+
+ if (_vconn->config.signal)
+ shm_signal_inject(_vconn->config.signal, 0);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_connection_config_set);
+
+/*
+ * Negotiate Capabilities - This function is provided so that the
+ * interface may be extended without breaking ABI compatability
+ *
+ * The caller is expected to send down any capabilities they would like
+ * to enable, and the device will OR them with capabilities that it
+ * supports. This value is then returned so that both sides may
+ * ascertain the lowest-common-denominator of features to enable
+ */
+static int
+_virtio_connection_negcap(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct vbus_memctx *ctx = _vconn->ctx;
+ u64 features;
+ int ret;
+
+ if (len != sizeof(features))
+ return -EINVAL;
+
+ if (_vconn->running)
+ return -EINVAL;
+
+#ifdef NOTYET
+ ret = ctx->ops->copy_from(ctx, &features, data, sizeof(features));
+ if (ret)
+ return -EFAULT;
+#endif
+
+ /*
+ * right now we dont support any advanced features, so just clear all
+ * bits
+ */
+ features = 0;
+
+ ret = ctx->ops->copy_to(ctx, data, &features, sizeof(features));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+_virtio_connection_getid(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct vbus_memctx *ctx = _vconn->ctx;
+ struct virtio_vbus_id *id = &_vconn->_vintf->vintf->id;
+ int ret;
+
+ if (len != sizeof(*id))
+ return -EINVAL;
+
+ ret = ctx->ops->copy_to(ctx, data, id, sizeof(*id));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+_virtio_connection_getstatus(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct virtio_connection *vconn = _vconn->vconn;
+ struct vbus_memctx *ctx = _vconn->ctx;
+ u8 val = 0;
+ int ret;
+
+ if (len != sizeof(val))
+ return -EINVAL;
+
+ if (vconn->ops->get_status)
+ val = vconn->ops->get_status(vconn);
+
+ ret = ctx->ops->copy_to(ctx, data, &val, sizeof(val));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+_virtio_connection_setstatus(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct virtio_connection *vconn = _vconn->vconn;
+ struct vbus_memctx *ctx = _vconn->ctx;
+ u8 val;
+ int ret;
+
+ if (len != sizeof(val))
+ return -EINVAL;
+
+ if (!vconn->ops->set_status)
+ return 0;
+
+ ret = ctx->ops->copy_from(ctx, &val, data, sizeof(val));
+ if (ret)
+ return -EFAULT;
+
+ vconn->ops->set_status(vconn, val);
+
+ return 0;
+}
+
+static int
+_virtio_connection_getfeatures(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct virtio_connection *vconn = _vconn->vconn;
+ struct vbus_memctx *ctx = _vconn->ctx;
+ u32 val = 0;
+ int ret;
+
+ if (len != sizeof(val))
+ return -EINVAL;
+
+ if (vconn->ops->get_features)
+ val = vconn->ops->get_features(vconn);
+
+ ret = ctx->ops->copy_to(ctx, data, &val, sizeof(val));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+_virtio_connection_finalizefeatures(struct _virtio_connection *_vconn)
+{
+ struct virtio_connection *vconn = _vconn->vconn;
+
+ if (vconn->ops->finalize_features)
+ vconn->ops->finalize_features(vconn);
+
+ return 0;
+}
+
+static int
+_virtio_connection_reset(struct _virtio_connection *_vconn)
+{
+ struct virtio_connection *vconn = _vconn->vconn;
+
+ if (vconn->ops->reset)
+ vconn->ops->reset(vconn);
+
+ return 0;
+}
+
+static struct _virtio_queue *
+_virtio_find_queue(struct _virtio_connection *_vconn, int index)
+{
+ struct _virtio_queue *vq;
+
+ list_for_each_entry(vq, &_vconn->queues, node) {
+ if (vq->index == index)
+ return vq;
+ }
+
+ return NULL;
+}
+
+static int
+_virtio_connection_queryqueue(struct _virtio_connection *_vconn,
+ void *data, unsigned long len)
+{
+ struct vbus_memctx *ctx = _vconn->ctx;
+ struct virtio_vbus_queryqueue val;
+ struct _virtio_queue *vq;
+ int ret;
+
+ if (len != sizeof(val))
+ return -EINVAL;
+
+ ret = ctx->ops->copy_from(ctx, &val, data, sizeof(val));
+ if (ret)
+ return -EFAULT;
+
+ vq = _virtio_find_queue(_vconn, val.index);
+
+ if (!vq)
+ return -EINVAL;
+
+ if (vq->shm)
+ return -EEXIST;
+
+ val.num = vq->num;
+
+ ret = ctx->ops->copy_to(ctx, data, &val, sizeof(val));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+static int
+_virtio_connection_call(struct vbus_connection *conn,
+ unsigned long func,
+ void *data,
+ unsigned long len,
+ unsigned long flags)
+{
+ struct _virtio_connection *_vconn = to_vconn(conn);
+ int ret = 0;
+
+ PDEBUG("call -> %d with %p/%d\n", func, data, len);
+
+ switch (func) {
+ case VIRTIO_VBUS_FUNC_NEG_CAP:
+ ret = _virtio_connection_negcap(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_GET_ID:
+ ret = _virtio_connection_getid(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_GET_FEATURES:
+ ret = _virtio_connection_getfeatures(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_FINALIZE_FEATURES:
+ _virtio_connection_finalizefeatures(_vconn);
+ break;
+ case VIRTIO_VBUS_FUNC_GET_STATUS:
+ ret = _virtio_connection_getstatus(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_SET_STATUS:
+ ret = _virtio_connection_setstatus(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_RESET:
+ _virtio_connection_reset(_vconn);
+ break;
+ case VIRTIO_VBUS_FUNC_QUERY_QUEUE:
+ ret = _virtio_connection_queryqueue(_vconn, data, len);
+ break;
+ case VIRTIO_VBUS_FUNC_DEL_QUEUE:
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+static void _virtio_config_isr(struct shm_signal_notifier *notifier)
+{
+ struct _virtio_connection *_vconn;
+ struct virtio_connection *vconn;
+
+ _vconn = container_of(notifier, struct _virtio_connection,
+ config.notifier);
+
+ vconn = _vconn->vconn;
+
+ if (vconn->ops->config_changed)
+ vconn->ops->config_changed(vconn);
+}
+
+static int
+_virtio_connection_open(struct _virtio_connection *_vconn)
+
+{
+ struct virtio_device_interface *vintf = _vconn->_vintf->vintf;
+ struct virtio_connection *vconn;
+ struct virtio_queue_def *def = vintf->queues;
+ int ret;
+
+ ret = vintf->ops->open(vintf, _vconn->ctx, &vconn);
+ if (ret < 0)
+ return ret;
+
+ while (def && def->index != -1) {
+ struct _virtio_queue *vq;
+
+ vq = kzalloc(sizeof(*vq), GFP_KERNEL);
+ if (!vq)
+ return -ENOMEM;
+
+ vq->index = def->index;
+ vq->num = def->entries;
+ vq->_vconn = _vconn;
+
+ list_add_tail(&vq->node, &_vconn->queues);
+
+ def++;
+ }
+
+ _vconn->vconn = vconn;
+ vconn->parent = &_vconn->conn;
+
+ return 0;
+}
+
+static int
+_virtio_connection_initconfig(struct _virtio_connection *_vconn,
+ struct vbus_shm *shm,
+ struct shm_signal *signal)
+{
+ int ret;
+
+ if (_vconn->running)
+ return -EINVAL;
+
+ _vconn->config.signal = signal;
+ _vconn->config.shm = shm;
+ _vconn->config.notifier.signal = &_virtio_config_isr;
+ signal->notifier = &_vconn->config.notifier;
+
+ shm_signal_enable(signal, 0);
+
+ ret = _virtio_connection_open(_vconn);
+ if (ret < 0)
+ return ret;
+
+ _vconn->running = 1;
+
+ return 0;
+}
+
+static void _vq_isr(struct shm_signal_notifier *notifier)
+{
+ struct _virtio_queue *vq;
+
+ vq = container_of(notifier, struct _virtio_queue, notifier);
+
+ vring_interrupt(0, vq->vq);
+}
+
+static void _vq_notify(struct virtqueue *vq)
+{
+ struct _virtio_queue *_vq = vq->priv;
+
+ shm_signal_inject(_vq->signal, 0);
+}
+
+static void _vq_callback(struct virtqueue *vq)
+{
+ struct _virtio_queue *_vq = vq->priv;
+ struct virtio_connection *vconn = _vq->_vconn->vconn;
+
+ vconn->ops->notify_vq(vconn, _vq->index);
+}
+
+static int
+_virtio_connection_shm(struct vbus_connection *conn,
+ unsigned long id,
+ struct vbus_shm *shm,
+ struct shm_signal *signal,
+ unsigned long flags)
+{
+ struct _virtio_connection *_vconn = to_vconn(conn);
+ struct virtio_connection *vconn = _vconn->vconn;
+ struct _virtio_queue *vq;
+ struct virtio_vbus_shm *_shm = shm->ptr;
+
+ /* All shm connections that we support require a signal */
+ if (!signal)
+ return -EINVAL;
+
+ if (!id)
+ return _virtio_connection_initconfig(_vconn, shm, signal);
+
+ vq = _virtio_find_queue(_vconn, id - VIRTIO_VBUS_RING_OFFSET);
+ if (!vq)
+ return -EINVAL;
+
+ if (vq->shm)
+ return -EEXIST;
+
+ vq->shm = shm;
+ vq->signal = signal;
+
+ vq->notifier.signal = &_vq_isr;
+ signal->notifier = &vq->notifier;
+
+ shm_signal_enable(signal, 0);
+
+ vq->vq = vring_new_virtqueue(vq->num, PAGE_SIZE, NULL,
+ &_shm->data[0],
+ _vq_notify, _vq_callback);
+
+ vq->vq->priv = vq;
+
+ vconn->ops->add_vq(vconn, vq->index, vq->vq);
+
+ return 0;
+}
+
+static void
+_virtio_connection_release(struct vbus_connection *conn)
+{
+ struct _virtio_connection *_vconn = to_vconn(conn);
+ struct virtio_connection *vconn = _vconn->vconn;
+ struct _virtio_queue *vq, *tmp;
+
+ vconn->ops->release(vconn);
+
+ list_for_each_entry_safe(vq, tmp, &_vconn->queues, node) {
+ if (vq->vq)
+ vring_del_virtqueue(vq->vq);
+
+ if (vq->shm)
+ vbus_shm_put(vq->shm);
+
+ if (vq->signal)
+ shm_signal_put(vq->signal);
+
+ list_del(&vq->node);
+ kfree(vq);
+ }
+
+ if (_vconn->config.signal)
+ shm_signal_put(_vconn->config.signal);
+
+ if (_vconn->config.shm)
+ vbus_shm_put(_vconn->config.shm);
+
+ kobject_put(&_vconn->_vintf->intf.kobj);
+ vbus_memctx_put(_vconn->ctx);
+
+ kfree(_vconn);
+}
+
+static struct vbus_connection_ops _virtio_connection_ops = {
+ .call = _virtio_connection_call,
+ .shm = _virtio_connection_shm,
+ .release = _virtio_connection_release,
+};
+
+static int
+_virtio_intf_open(struct vbus_device_interface *intf,
+ struct vbus_memctx *ctx,
+ int version,
+ struct vbus_connection **conn)
+{
+ struct _virtio_device_interface *_vintf = to_vintf(intf);
+ struct _virtio_connection *_vconn;
+ int ret;
+
+ if (version != VIRTIO_VBUS_ABI_VERSION)
+ return -EINVAL;
+
+ _vconn = kzalloc(sizeof(*_vconn), GFP_KERNEL);
+ if (!_vconn)
+ return -ENOMEM;
+
+ vbus_connection_init(&_vconn->conn, &_virtio_connection_ops);
+ _vconn->_vintf = _vintf;
+ _vconn->ctx = ctx;
+ INIT_LIST_HEAD(&_vconn->queues);
+
+ vbus_memctx_get(ctx);
+ kobject_get(&intf->kobj);
+
+ *conn = &_vconn->conn;
+
+ return 0;
+}
+
+static void
+_virtio_intf_release(struct vbus_device_interface *intf)
+{
+ struct _virtio_device_interface *_vintf = to_vintf(intf);
+ struct virtio_device_interface *vintf = _vintf->vintf;
+
+ if (vintf && vintf->ops->release)
+ vintf->ops->release(vintf);
+ kfree(_vintf);
+}
+
+static struct vbus_device_interface_ops _virtio_device_interface_ops = {
+ .open = _virtio_intf_open,
+ .release = _virtio_intf_release,
+};
+
+int
+virtio_device_interface_register(struct vbus_device *dev,
+ struct vbus *vbus,
+ struct virtio_device_interface *vintf)
+{
+ struct _virtio_device_interface *_vintf;
+ struct vbus_device_interface *intf;
+
+ _vintf = kzalloc(sizeof(*_vintf), GFP_KERNEL);
+ if (!_vintf)
+ return -ENOMEM;
+
+ _vintf->vintf = vintf;
+
+ intf = &_vintf->intf;
+
+ intf->name = "0"; /* FIXME */
+ intf->type = "virtio";
+ intf->ops = &_virtio_device_interface_ops;
+
+ return vbus_device_interface_register(dev, vbus, intf);
+}
+EXPORT_SYMBOL_GPL(virtio_device_interface_register);
+
+int
+virtio_device_interface_unregister(struct virtio_device_interface *intf)
+{
+ return vbus_device_interface_unregister(intf->parent);
+}
+EXPORT_SYMBOL_GPL(virtio_device_interface_unregister);
This adds a driver to interface between the host VBUS support, and the
guest-vbus bus model.
Signed-off-by: Gregory Haskins <[email protected]>
---
arch/x86/Kconfig | 9 +
drivers/Makefile | 1
drivers/vbus/proxy/Makefile | 2
drivers/vbus/proxy/kvm.c | 726 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 738 insertions(+), 0 deletions(-)
create mode 100644 drivers/vbus/proxy/Makefile
create mode 100644 drivers/vbus/proxy/kvm.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 91fefd5..8661495 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -451,6 +451,15 @@ config KVM_GUEST_DYNIRQ
depends on KVM_GUEST
default y
+config KVM_GUEST_VBUS
+ tristate "KVM virtual-bus (VBUS) guest-side support"
+ depends on KVM_GUEST
+ select VBUS_DRIVERS
+ default y
+ ---help---
+ This option enables guest-side support for accessing virtual-bus
+ devices.
+
source "arch/x86/lguest/Kconfig"
config PARAVIRT
diff --git a/drivers/Makefile b/drivers/Makefile
index 98fab51..4f2cb93 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_VIRTIO) += virtio/
obj-$(CONFIG_STAGING) += staging/
obj-y += platform/
obj-$(CONFIG_VBUS_DEVICES) += vbus/devices/
+obj-$(CONFIG_VBUS_DRIVERS) += vbus/proxy/
diff --git a/drivers/vbus/proxy/Makefile b/drivers/vbus/proxy/Makefile
new file mode 100644
index 0000000..c18d58d
--- /dev/null
+++ b/drivers/vbus/proxy/Makefile
@@ -0,0 +1,2 @@
+kvm-guest-vbus-objs += kvm.o
+obj-$(CONFIG_KVM_GUEST_VBUS) += kvm-guest-vbus.o
diff --git a/drivers/vbus/proxy/kvm.c b/drivers/vbus/proxy/kvm.c
new file mode 100644
index 0000000..82e28b4
--- /dev/null
+++ b/drivers/vbus/proxy/kvm.c
@@ -0,0 +1,726 @@
+/*
+ * Copyright (C) 2009 Novell. All Rights Reserved.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus.h>
+#include <linux/kvm_para.h>
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/ioq.h>
+#include <linux/interrupt.h>
+#include <linux/kvm_para.h>
+#include <linux/kvm_guest.h>
+#include <linux/vbus_client.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+static int kvm_vbus_hypercall(unsigned long nr, void *data, unsigned long len)
+{
+ return kvm_hypercall3(KVM_HC_VBUS, nr, __pa(data), len);
+}
+
+struct kvm_vbus {
+ spinlock_t lock;
+ struct ioq eventq;
+ struct kvm_vbus_event *ring;
+ int irq;
+};
+
+static struct kvm_vbus kvm_vbus;
+
+struct kvm_vbus_device {
+ char type[VBUS_MAX_DEVTYPE_LEN];
+ u64 handle;
+ struct list_head shms;
+ struct vbus_device_proxy vdev;
+};
+
+/*
+ * -------------------
+ * common routines
+ * -------------------
+ */
+
+struct kvm_vbus_device *
+to_dev(struct vbus_device_proxy *vdev)
+{
+ return container_of(vdev, struct kvm_vbus_device, vdev);
+}
+
+static void
+_signal_init(struct shm_signal *signal, struct shm_signal_desc *desc,
+ struct shm_signal_ops *ops)
+{
+ desc->magic = SHM_SIGNAL_MAGIC;
+ desc->ver = SHM_SIGNAL_VER;
+
+ shm_signal_init(signal);
+
+ signal->locale = shm_locality_north;
+ signal->ops = ops;
+ signal->desc = desc;
+}
+
+/*
+ * -------------------
+ * _signal
+ * -------------------
+ */
+
+struct _signal {
+ struct kvm_vbus *kvbus;
+ struct shm_signal signal;
+ u64 handle;
+ struct rb_node node;
+ struct list_head list;
+};
+
+static struct _signal *
+to_signal(struct shm_signal *signal)
+{
+ return container_of(signal, struct _signal, signal);
+}
+
+static struct _signal *
+node_to_signal(struct rb_node *node)
+{
+ return container_of(node, struct _signal, node);
+}
+
+static int
+_signal_inject(struct shm_signal *signal)
+{
+ struct _signal *_signal = to_signal(signal);
+
+ kvm_vbus_hypercall(KVM_VBUS_OP_SHMSIGNAL,
+ &_signal->handle, sizeof(_signal->handle));
+
+ return 0;
+}
+
+static void
+_signal_release(struct shm_signal *signal)
+{
+ struct _signal *_signal = to_signal(signal);
+
+ kfree(_signal);
+}
+
+static struct shm_signal_ops _signal_ops = {
+ .inject = _signal_inject,
+ .release = _signal_release,
+};
+
+/*
+ * -------------------
+ * vbus_device_proxy routines
+ * -------------------
+ */
+
+static int
+kvm_vbus_device_open(struct vbus_device_proxy *vdev, int version, int flags)
+{
+ struct kvm_vbus_device *dev = to_dev(vdev);
+ struct vbus_deviceopen params;
+ int ret;
+
+ if (dev->handle)
+ return -EINVAL;
+
+ params.devid = vdev->id;
+ params.version = version;
+
+ ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVOPEN,
+ ¶ms, sizeof(params));
+ if (ret < 0)
+ return ret;
+
+ dev->handle = params.handle;
+
+ return 0;
+}
+
+static int
+kvm_vbus_device_close(struct vbus_device_proxy *vdev, int flags)
+{
+ struct kvm_vbus_device *dev = to_dev(vdev);
+ unsigned long iflags;
+ int ret;
+
+ if (!dev->handle)
+ return -EINVAL;
+
+ spin_lock_irqsave(&kvm_vbus.lock, iflags);
+
+ while (!list_empty(&dev->shms)) {
+ struct _signal *_signal;
+
+ _signal = list_first_entry(&dev->shms, struct _signal, list);
+
+ list_del(&_signal->list);
+
+ spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+ shm_signal_put(&_signal->signal);
+ spin_lock_irqsave(&kvm_vbus.lock, iflags);
+ }
+
+ spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+
+ /*
+ * The DEVICECLOSE will implicitly close all of the shm on the
+ * host-side, so there is no need to do an explicit per-shm
+ * hypercall
+ */
+ ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVCLOSE,
+ &dev->handle, sizeof(dev->handle));
+
+ if (ret < 0)
+ printk(KERN_ERR "KVM-VBUS: Error closing device %s/%lld: %d\n",
+ vdev->type, vdev->id, ret);
+
+ dev->handle = 0;
+
+ return 0;
+}
+
+static int
+kvm_vbus_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
+ void *ptr, size_t len,
+ struct shm_signal_desc *sdesc, struct shm_signal **signal,
+ int flags)
+{
+ struct kvm_vbus_device *dev = to_dev(vdev);
+ struct _signal *_signal = NULL;
+ struct vbus_deviceshm params;
+ unsigned long iflags;
+ int ret;
+
+ if (!dev->handle)
+ return -EINVAL;
+
+ params.devh = dev->handle;
+ params.id = id;
+ params.flags = flags;
+ params.datap = (u64)__pa(ptr);
+ params.len = len;
+
+ if (signal) {
+ /*
+ * The signal descriptor must be embedded within the
+ * provided ptr
+ */
+ if (!sdesc
+ || (len < sizeof(*sdesc))
+ || ((void *)sdesc < ptr)
+ || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
+ return -EINVAL;
+
+ _signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+ if (!_signal)
+ return -ENOMEM;
+
+ _signal_init(&_signal->signal, sdesc, &_signal_ops);
+
+ /*
+ * take another reference for the host. This is dropped
+ * by a SHMCLOSE event
+ */
+ shm_signal_get(&_signal->signal);
+
+ params.signal.offset = (u64)sdesc - (u64)ptr;
+ params.signal.prio = prio;
+ params.signal.cookie = (u64)_signal;
+
+ } else
+ params.signal.offset = -1; /* yes, this is a u32, but its ok */
+
+ ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVSHM,
+ ¶ms, sizeof(params));
+ if (ret < 0) {
+ if (_signal) {
+ /*
+ * We held two references above, so we need to drop
+ * both of them
+ */
+ shm_signal_put(&_signal->signal);
+ shm_signal_put(&_signal->signal);
+ }
+
+ return ret;
+ }
+
+ if (signal) {
+ _signal->handle = params.handle;
+
+ spin_lock_irqsave(&kvm_vbus.lock, iflags);
+
+ list_add_tail(&_signal->list, &dev->shms);
+
+ spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+
+ shm_signal_get(&_signal->signal);
+ *signal = &_signal->signal;
+ }
+
+ return 0;
+}
+
+static int
+kvm_vbus_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
+ size_t len, int flags)
+{
+ struct kvm_vbus_device *dev = to_dev(vdev);
+ struct vbus_devicecall params = {
+ .devh = dev->handle,
+ .func = func,
+ .datap = (u64)__pa(data),
+ .len = len,
+ .flags = flags,
+ };
+
+ if (!dev->handle)
+ return -EINVAL;
+
+ return kvm_vbus_hypercall(KVM_VBUS_OP_DEVCALL, ¶ms, sizeof(params));
+}
+
+static void
+kvm_vbus_device_release(struct vbus_device_proxy *vdev)
+{
+ struct kvm_vbus_device *_dev = to_dev(vdev);
+
+ kvm_vbus_device_close(vdev, 0);
+
+ kfree(_dev);
+}
+
+struct vbus_device_proxy_ops kvm_vbus_device_ops = {
+ .open = kvm_vbus_device_open,
+ .close = kvm_vbus_device_close,
+ .shm = kvm_vbus_device_shm,
+ .call = kvm_vbus_device_call,
+ .release = kvm_vbus_device_release,
+};
+
+/*
+ * -------------------
+ * vbus events
+ * -------------------
+ */
+
+static void
+event_devadd(struct kvm_vbus_add_event *event)
+{
+ int ret;
+ struct kvm_vbus_device *new = kzalloc(sizeof(*new), GFP_KERNEL);
+ if (!new) {
+ printk(KERN_ERR "KVM_VBUS: Out of memory on add_event\n");
+ return;
+ }
+
+ INIT_LIST_HEAD(&new->shms);
+
+ memcpy(new->type, event->type, VBUS_MAX_DEVTYPE_LEN);
+ new->vdev.type = new->type;
+ new->vdev.id = event->id;
+ new->vdev.ops = &kvm_vbus_device_ops;
+
+ sprintf(new->vdev.dev.bus_id, "%lld", event->id);
+
+ ret = vbus_device_proxy_register(&new->vdev);
+ if (ret < 0)
+ panic("failed to register device %lld(%s): %d\n",
+ event->id, event->type, ret);
+}
+
+static void
+event_devdrop(struct kvm_vbus_handle_event *event)
+{
+ struct vbus_device_proxy *dev = vbus_device_proxy_find(event->handle);
+
+ if (!dev) {
+ printk(KERN_WARNING "KVM-VBUS: devdrop failed: %lld\n",
+ event->handle);
+ return;
+ }
+
+ vbus_device_proxy_unregister(dev);
+}
+
+static void
+event_shmsignal(struct kvm_vbus_handle_event *event)
+{
+ struct _signal *_signal = (struct _signal *)event->handle;
+
+ _shm_signal_wakeup(&_signal->signal);
+}
+
+static void
+event_shmclose(struct kvm_vbus_handle_event *event)
+{
+ struct _signal *_signal = (struct _signal *)event->handle;
+
+ /*
+ * This reference was taken during the DEVICESHM call
+ */
+ shm_signal_put(&_signal->signal);
+}
+
+/*
+ * -------------------
+ * eventq routines
+ * -------------------
+ */
+
+static struct ioq_notifier eventq_notifier;
+
+static int __init
+eventq_init(int qlen)
+{
+ struct ioq_iterator iter;
+ int ret;
+ int i;
+
+ kvm_vbus.ring = kzalloc(sizeof(struct kvm_vbus_event) * qlen,
+ GFP_KERNEL);
+ if (!kvm_vbus.ring)
+ return -ENOMEM;
+
+ /*
+ * We want to iterate on the "valid" index. By default the iterator
+ * will not "autoupdate" which means it will not hypercall the host
+ * with our changes. This is good, because we are really just
+ * initializing stuff here anyway. Note that you can always manually
+ * signal the host with ioq_signal() if the autoupdate feature is not
+ * used.
+ */
+ ret = ioq_iter_init(&kvm_vbus.eventq, &iter, ioq_idxtype_valid, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * Seek to the tail of the valid index (which should be our first
+ * item since the queue is brand-new)
+ */
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * Now populate each descriptor with an empty vbus_event and mark it
+ * valid
+ */
+ for (i = 0; i < qlen; i++) {
+ struct kvm_vbus_event *event = &kvm_vbus.ring[i];
+ size_t len = sizeof(*event);
+ struct ioq_ring_desc *desc = iter.desc;
+
+ BUG_ON(iter.desc->valid);
+
+ desc->cookie = (u64)event;
+ desc->ptr = (u64)__pa(event);
+ desc->len = len; /* total length */
+ desc->valid = 1;
+
+ /*
+ * This push operation will simultaneously advance the
+ * valid-tail index and increment our position in the queue
+ * by one.
+ */
+ ret = ioq_iter_push(&iter, 0);
+ BUG_ON(ret < 0);
+ }
+
+ kvm_vbus.eventq.notifier = &eventq_notifier;
+
+ /*
+ * And finally, ensure that we can receive notification
+ */
+ ioq_notify_enable(&kvm_vbus.eventq, 0);
+
+ return 0;
+}
+
+/* Invoked whenever the hypervisor ioq_signal()s our eventq */
+static void
+eventq_wakeup(struct ioq_notifier *notifier)
+{
+ struct ioq_iterator iter;
+ int ret;
+
+ /* We want to iterate on the head of the in-use index */
+ ret = ioq_iter_init(&kvm_vbus.eventq, &iter, ioq_idxtype_inuse, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * The EOM is indicated by finding a packet that is still owned by
+ * the south side.
+ *
+ * FIXME: This in theory could run indefinitely if the host keeps
+ * feeding us events since there is nothing like a NAPI budget. We
+ * might need to address that
+ */
+ while (!iter.desc->sown) {
+ struct ioq_ring_desc *desc = iter.desc;
+ struct kvm_vbus_event *event;
+
+ event = (struct kvm_vbus_event *)desc->cookie;
+
+ switch (event->eventid) {
+ case KVM_VBUS_EVENT_DEVADD:
+ event_devadd(&event->data.add);
+ break;
+ case KVM_VBUS_EVENT_DEVDROP:
+ event_devdrop(&event->data.handle);
+ break;
+ case KVM_VBUS_EVENT_SHMSIGNAL:
+ event_shmsignal(&event->data.handle);
+ break;
+ case KVM_VBUS_EVENT_SHMCLOSE:
+ event_shmclose(&event->data.handle);
+ break;
+ default:
+ printk(KERN_WARNING "KVM_VBUS: Unexpected event %d\n",
+ event->eventid);
+ break;
+ };
+
+ memset(event, 0, sizeof(*event));
+
+ /* Advance the in-use head */
+ ret = ioq_iter_pop(&iter, 0);
+ BUG_ON(ret < 0);
+ }
+
+ /* And let the south side know that we changed the queue */
+ ioq_signal(&kvm_vbus.eventq, 0);
+}
+
+static struct ioq_notifier eventq_notifier = {
+ .signal = &eventq_wakeup,
+};
+
+/* Injected whenever the host issues an ioq_signal() on the eventq */
+irqreturn_t
+eventq_intr(int irq, void *dev)
+{
+ _shm_signal_wakeup(kvm_vbus.eventq.signal);
+
+ return IRQ_HANDLED;
+}
+
+/*
+ * -------------------
+ */
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+ u64 handle = 0; /* The eventq uses the special-case handle=0 */
+
+ kvm_vbus_hypercall(KVM_VBUS_OP_SHMSIGNAL, &handle, sizeof(handle));
+
+ return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+ kfree(signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+ .inject = eventq_signal_inject,
+ .release = eventq_signal_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+eventq_ioq_release(struct ioq *ioq)
+{
+ /* released as part of the kvm_vbus object */
+}
+
+static struct ioq_ops eventq_ioq_ops = {
+ .release = eventq_ioq_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+kvm_vbus_release(void)
+{
+ if (kvm_vbus.irq > 0) {
+ free_irq(kvm_vbus.irq, NULL);
+ destroy_kvm_dynirq(kvm_vbus.irq);
+ }
+
+ kfree(kvm_vbus.eventq.head_desc);
+ kfree(kvm_vbus.ring);
+}
+
+static int __init
+kvm_vbus_open(void)
+{
+ struct kvm_vbus_busopen params = {
+ .magic = KVM_VBUS_MAGIC,
+ .version = KVM_VBUS_VERSION,
+ .capabilities = 0,
+ };
+
+ return kvm_vbus_hypercall(KVM_VBUS_OP_BUSOPEN, ¶ms, sizeof(params));
+}
+
+#define QLEN 1024
+
+static int __init
+kvm_vbus_register(void)
+{
+ struct kvm_vbus_busreg params = {
+ .count = 1,
+ .eventq = {
+ {
+ .irq = kvm_vbus.irq,
+ .count = QLEN,
+ .ring = (u64)__pa(kvm_vbus.eventq.head_desc),
+ .data = (u64)__pa(kvm_vbus.ring),
+ },
+ },
+ };
+
+ return kvm_vbus_hypercall(KVM_VBUS_OP_BUSREG, ¶ms, sizeof(params));
+}
+
+static int __init
+_ioq_init(size_t ringsize, struct ioq *ioq, struct ioq_ops *ops)
+{
+ struct shm_signal *signal = NULL;
+ struct ioq_ring_head *head = NULL;
+ size_t len = IOQ_HEAD_DESC_SIZE(ringsize);
+
+ head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+ if (!head)
+ return -ENOMEM;
+
+ signal = kzalloc(sizeof(*signal), GFP_KERNEL);
+ if (!signal) {
+ kfree(head);
+ return -ENOMEM;
+ }
+
+ head->magic = IOQ_RING_MAGIC;
+ head->ver = IOQ_RING_VER;
+ head->count = ringsize;
+
+ _signal_init(signal, &head->signal, &eventq_signal_ops);
+
+ ioq_init(ioq, ops, ioq_locality_north, head, signal, ringsize);
+
+ return 0;
+}
+
+int __init
+kvm_vbus_init(void)
+{
+ int ret;
+
+ memset(&kvm_vbus, 0, sizeof(kvm_vbus));
+
+ ret = kvm_para_has_feature(KVM_FEATURE_VBUS);
+ if (!ret)
+ return -ENOENT;
+
+ ret = kvm_vbus_open();
+ if (ret < 0) {
+ printk(KERN_ERR "KVM_VBUS: Could not register with host: %d\n",
+ ret);
+ goto out_fail;
+ }
+
+ spin_lock_init(&kvm_vbus.lock);
+
+ /*
+ * Allocate an IOQ to use for host-2-guest event notification
+ */
+ ret = _ioq_init(QLEN, &kvm_vbus.eventq, &eventq_ioq_ops);
+ if (ret < 0) {
+ printk(KERN_ERR "KVM_VBUS: Cound not init eventq\n");
+ goto out_fail;
+ }
+
+ ret = eventq_init(QLEN);
+ if (ret < 0) {
+ printk(KERN_ERR "KVM_VBUS: Cound not setup ring\n");
+ goto out_fail;
+ }
+
+ /*
+ * Dynamically assign a free IRQ to this resource
+ */
+ kvm_vbus.irq = create_kvm_dynirq(0);
+ if (kvm_vbus.irq < 0) {
+ printk(KERN_ERR "KVM_VBUS: Failed to create IRQ\n");
+ goto out_fail;
+ }
+
+ ret = request_irq(kvm_vbus.irq, eventq_intr, 0, "vbus", NULL);
+ if (ret < 0) {
+ printk(KERN_ERR "KVM_VBUS: Failed to register IRQ %d\n: %d",
+ kvm_vbus.irq, ret);
+ goto out_fail;
+ }
+
+ /*
+ * Finally register our queue on the host to start receiving events
+ */
+ ret = kvm_vbus_register();
+ if (ret < 0) {
+ printk(KERN_ERR "KVM_VBUS: Could not register with host: %d\n",
+ ret);
+ goto out_fail;
+ }
+
+ return 0;
+
+ out_fail:
+ kvm_vbus_release();
+
+ return ret;
+
+}
+
+static void __exit
+kvm_vbus_exit(void)
+{
+ kvm_vbus_release();
+}
+
+module_init(kvm_vbus_init);
+module_exit(kvm_vbus_exit);
+
The ioapic code currently privately manages the mapping between irq
and vector. This results in some layering violations as the support
for certain MSI operations need this info. As a result, the MSI
code itself was moved to the ioapic module. This is not really
optimal.
We now have another need to gain access to the vector assignment on
x86. However, rather than put yet another inappropriately placed
function into io-apic, lets create a way to export this simple data
and therefore allow the logic to sit closer to where it belongs.
Ideally we should abstract the entire notion of irq->vector management
out of io-apic, but we leave that as an excercise for another day.
Signed-off-by: Gregory Haskins <[email protected]>
---
arch/x86/include/asm/irq.h | 6 ++++++
arch/x86/kernel/io_apic.c | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 592688e..b1726d8 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -40,6 +40,12 @@ extern unsigned int do_IRQ(struct pt_regs *regs);
extern void init_IRQ(void);
extern void native_init_IRQ(void);
+#ifdef CONFIG_SMP
+extern int set_irq_affinity(int irq, cpumask_t mask);
+#endif
+
+extern int irq_to_vector(int irq);
+
/* Interrupt vector management */
extern DECLARE_BITMAP(used_vectors, NR_VECTORS);
extern int vector_used_by_percpu_irq(unsigned int vector);
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
index bc7ac4d..86a2c36 100644
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -614,6 +614,14 @@ set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask)
set_ioapic_affinity_irq_desc(desc, mask);
}
+
+int set_irq_affinity(int irq, cpumask_t mask)
+{
+ set_ioapic_affinity_irq(irq, &mask);
+
+ return 0;
+}
+
#endif /* CONFIG_SMP */
/*
@@ -3249,6 +3257,23 @@ void destroy_irq(unsigned int irq)
spin_unlock_irqrestore(&vector_lock, flags);
}
+int irq_to_vector(int irq)
+{
+ struct irq_cfg *cfg;
+ unsigned long flags;
+ int ret = -ENOENT;
+
+ spin_lock_irqsave(&vector_lock, flags);
+
+ cfg = irq_cfg(irq);
+ if (cfg && cfg->vector != 0)
+ ret = cfg->vector;
+
+ spin_unlock_irqrestore(&vector_lock, flags);
+
+ return ret;
+}
+
/*
* MSI message composition
*/
This module is similar in concept to a "tuntap". A tuntap module provides
a netif() interface on one side, and a char-dev interface on the other.
Packets that ingress on one interface, egress on the other (and vice versa).
This module offers a similar concept, except that it substitues the
char-dev for a VBUS/IOQ interface. This allows a VBUS compatible entity
(e.g. userspace or a guest) to directly inject and receive packets
from the host/kernel stack.
Thanks to Pat Mullaney for contributing the maxcount modification
Signed-off-by: Gregory Haskins <[email protected]>
---
drivers/Makefile | 1
drivers/vbus/devices/Kconfig | 17
drivers/vbus/devices/Makefile | 1
drivers/vbus/devices/venet-tap.c | 1388 ++++++++++++++++++++++++++++++++++++++
kernel/vbus/Kconfig | 13
5 files changed, 1420 insertions(+), 0 deletions(-)
create mode 100644 drivers/vbus/devices/Kconfig
create mode 100644 drivers/vbus/devices/Makefile
create mode 100644 drivers/vbus/devices/venet-tap.c
diff --git a/drivers/Makefile b/drivers/Makefile
index c1bf417..98fab51 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -106,3 +106,4 @@ obj-$(CONFIG_SSB) += ssb/
obj-$(CONFIG_VIRTIO) += virtio/
obj-$(CONFIG_STAGING) += staging/
obj-y += platform/
+obj-$(CONFIG_VBUS_DEVICES) += vbus/devices/
diff --git a/drivers/vbus/devices/Kconfig b/drivers/vbus/devices/Kconfig
new file mode 100644
index 0000000..64e4731
--- /dev/null
+++ b/drivers/vbus/devices/Kconfig
@@ -0,0 +1,17 @@
+#
+# Virtual-Bus (VBus) configuration
+#
+
+config VBUS_VENETTAP
+ tristate "Virtual-Bus Ethernet Tap Device"
+ depends on VBUS_DEVICES
+ default n
+ help
+ Provides a virtual ethernet adapter to a vbus, which in turn
+ manifests itself as a standard netif based adapter to the
+ kernel. It can be used similarly to a "tuntap" device,
+ except that the char-dev transport is replaced with a vbus/ioq
+ interface.
+
+ If unsure, say N
+
diff --git a/drivers/vbus/devices/Makefile b/drivers/vbus/devices/Makefile
new file mode 100644
index 0000000..2ea7d2a
--- /dev/null
+++ b/drivers/vbus/devices/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VBUS_VENETTAP) += venet-tap.o
diff --git a/drivers/vbus/devices/venet-tap.c b/drivers/vbus/devices/venet-tap.c
new file mode 100644
index 0000000..148e2c8
--- /dev/null
+++ b/drivers/vbus/devices/venet-tap.c
@@ -0,0 +1,1388 @@
+/*
+ * venettap - A 802.x virtual network device based on the VBUS/IOQ interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <[email protected]>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+#include <linux/wait.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus.h>
+#include <linux/freezer.h>
+#include <linux/kthread.h>
+
+#include <linux/venet.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#undef PDEBUG /* undef it, just in case */
+#ifdef VENETTAP_DEBUG
+# define PDEBUG(fmt, args...) printk(KERN_DEBUG "venet-tap: " fmt, ## args)
+#else
+# define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+static int maxcount = 2048;
+module_param(maxcount, int, 0600);
+MODULE_PARM_DESC(maxcount, "maximum size for rx/tx ioq ring");
+
+static void venettap_tx_isr(struct ioq_notifier *notifier);
+static int venettap_rx_thread(void *__priv);
+static int venettap_tx_thread(void *__priv);
+
+struct venettap_queue {
+ struct ioq *queue;
+ struct ioq_notifier notifier;
+};
+
+struct venettap;
+
+enum {
+ RX_SCHED,
+ TX_SCHED,
+ TX_NETIF_CONGESTED,
+ TX_IOQ_CONGESTED,
+};
+
+struct venettap {
+ spinlock_t lock;
+ unsigned char hmac[ETH_ALEN]; /* host-mac */
+ unsigned char cmac[ETH_ALEN]; /* client-mac */
+ struct task_struct *rxthread;
+ struct task_struct *txthread;
+ unsigned long flags;
+
+ struct {
+ struct net_device *dev;
+ struct net_device_stats stats;
+ struct {
+ struct sk_buff_head list;
+ size_t len;
+ int irqdepth;
+ } txq;
+ int enabled:1;
+ int link:1;
+ } netif;
+
+ struct {
+ struct vbus_device dev;
+ struct vbus_device_interface intf;
+ struct vbus_connection conn;
+ struct vbus_memctx *ctx;
+ struct venettap_queue rxq;
+ struct venettap_queue txq;
+ wait_queue_head_t rx_empty;
+ int connected:1;
+ int opened:1;
+ int link:1;
+ } vbus;
+};
+
+static int
+venettap_queue_init(struct venettap_queue *q,
+ struct vbus_shm *shm,
+ struct shm_signal *signal,
+ void (*func)(struct ioq_notifier *))
+{
+ struct ioq *ioq;
+ int ret;
+
+ if (q->queue)
+ return -EEXIST;
+
+ /* FIXME: make maxcount a tunable */
+ ret = vbus_shm_ioq_attach(shm, signal, maxcount, &ioq);
+ if (ret < 0)
+ return ret;
+
+ q->queue = ioq;
+ ioq_get(ioq);
+
+ if (func) {
+ q->notifier.signal = func;
+ q->queue->notifier = &q->notifier;
+ }
+
+ return 0;
+}
+
+static void
+venettap_queue_release(struct venettap_queue *q)
+{
+ if (!q->queue)
+ return;
+
+ ioq_put(q->queue);
+ q->queue = NULL;
+}
+
+/* Assumes priv->lock is held */
+static void
+venettap_txq_notify_inc(struct venettap *priv)
+{
+ priv->netif.txq.irqdepth++;
+ if (priv->netif.txq.irqdepth == 1 && priv->vbus.link)
+ ioq_notify_enable(priv->vbus.txq.queue, 0);
+}
+
+/* Assumes priv->lock is held */
+static void
+venettap_txq_notify_dec(struct venettap *priv)
+{
+ BUG_ON(!priv->netif.txq.irqdepth);
+ priv->netif.txq.irqdepth--;
+ if (!priv->netif.txq.irqdepth && priv->vbus.link)
+ ioq_notify_disable(priv->vbus.txq.queue, 0);
+}
+
+/*
+ *----------------------------------------------------------------------
+ * netif link
+ *----------------------------------------------------------------------
+ */
+
+static struct venettap *conn_to_priv(struct vbus_connection *conn)
+{
+ return container_of(conn, struct venettap, vbus.conn);
+}
+
+static struct venettap *intf_to_priv(struct vbus_device_interface *intf)
+{
+ return container_of(intf, struct venettap, vbus.intf);
+}
+
+static struct venettap *vdev_to_priv(struct vbus_device *vdev)
+{
+ return container_of(vdev, struct venettap, vbus.dev);
+}
+
+static int
+venettap_netdev_open(struct net_device *dev)
+{
+ struct venettap *priv = netdev_priv(dev);
+ unsigned long flags;
+
+ BUG_ON(priv->netif.link);
+
+ /*
+ * We need rx-polling to be done in process context, and we want
+ * ingress processing to occur independent of the producer thread
+ * to maximize multi-core distribution. Since the built in NAPI uses a
+ * softirq, we cannot guarantee this wont call us back in interrupt
+ * context, so we cant use it. And both a work-queue or softirq
+ * solution would tend to process requests on the same CPU as the
+ * producer. Therefore, we create a special thread to handle ingress.
+ *
+ * The downside to this type of approach is that we may still need to
+ * ctx-switch to the NAPI polling thread (presumably running on the same
+ * core as the rx-thread) by virtue of the netif_rx() backlog mechanism.
+ * However, this can be mitigated by the use of netif_rx_ni().
+ */
+ priv->rxthread = kthread_create(venettap_rx_thread, priv,
+ "%s-rx", priv->netif.dev->name);
+
+ priv->txthread = kthread_create(venettap_tx_thread, priv,
+ "%s-tx", priv->netif.dev->name);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ priv->netif.link = true;
+
+ if (!priv->vbus.link)
+ netif_carrier_off(dev);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ return 0;
+}
+
+static int
+venettap_netdev_stop(struct net_device *dev)
+{
+ struct venettap *priv = netdev_priv(dev);
+ unsigned long flags;
+ int needs_stop = false;
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (priv->netif.link) {
+ needs_stop = true;
+ priv->netif.link = false;
+ }
+
+ /* FIXME: free priv->netif.txq */
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ if (needs_stop) {
+ kthread_stop(priv->rxthread);
+ priv->rxthread = NULL;
+
+ kthread_stop(priv->txthread);
+ priv->txthread = NULL;
+ }
+
+ return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+venettap_netdev_config(struct net_device *dev, struct ifmap *map)
+{
+ if (dev->flags & IFF_UP) /* can't act on a running interface */
+ return -EBUSY;
+
+ /* Don't allow changing the I/O address */
+ if (map->base_addr != dev->base_addr) {
+ printk(KERN_WARNING "venettap: Can't change I/O address\n");
+ return -EOPNOTSUPP;
+ }
+
+ /* ignore other fields */
+ return 0;
+}
+
+static int
+venettap_change_mtu(struct net_device *dev, int new_mtu)
+{
+ dev->mtu = new_mtu;
+
+ return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+venettap_rx(struct venettap *priv)
+{
+ struct ioq *ioq;
+ struct vbus_memctx *ctx;
+ int npackets = 0;
+ int dirty = 0;
+ struct ioq_iterator iter;
+ int ret;
+ unsigned long flags;
+ struct vbus_connection *conn;
+
+ PDEBUG("polling...\n");
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (!priv->vbus.link) {
+ spin_unlock_irqrestore(&priv->lock, flags);
+ return 0;
+ }
+
+ /*
+ * We take a reference to the connection object to ensure that the
+ * ioq/ctx references do not disappear out from under us. We could
+ * acommplish the same thing more directly by acquiring a reference
+ * to the ioq and ctx explictly, but this would require an extra
+ * atomic_inc+dec pair, for no additional benefit
+ */
+ conn = &priv->vbus.conn;
+ vbus_connection_get(conn);
+
+ ioq = priv->vbus.rxq.queue;
+ ctx = priv->vbus.ctx;
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ /* We want to iterate on the head of the in-use index */
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * The EOM is indicated by finding a packet that is still owned by
+ * the north side
+ */
+ while (iter.desc->sown) {
+ size_t len = iter.desc->len;
+ size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
+ struct sk_buff *skb = NULL;
+
+ if (unlikely(len > maxlen)) {
+ priv->netif.stats.rx_errors++;
+ priv->netif.stats.rx_length_errors++;
+ goto next;
+ }
+
+ skb = dev_alloc_skb(len+2);
+ if (unlikely(!skb)) {
+ printk(KERN_INFO "VENETTAP: skb alloc failed:" \
+ " memory squeeze.\n");
+ priv->netif.stats.rx_errors++;
+ priv->netif.stats.rx_dropped++;
+ goto next;
+ }
+
+ /* align IP on 16B boundary */
+ skb_reserve(skb, 2);
+
+ ret = ctx->ops->copy_from(ctx, skb->data,
+ (void *)iter.desc->ptr,
+ len);
+ if (unlikely(ret)) {
+ priv->netif.stats.rx_errors++;
+ goto next;
+ }
+
+ /* Maintain stats */
+ npackets++;
+ priv->netif.stats.rx_packets++;
+ priv->netif.stats.rx_bytes += len;
+
+ /* Pass the buffer up to the stack */
+ skb->dev = priv->netif.dev;
+ skb->protocol = eth_type_trans(skb, priv->netif.dev);
+
+ netif_rx_ni(skb);
+next:
+ dirty = 1;
+
+ /* Advance the in-use head */
+ ret = ioq_iter_pop(&iter, 0);
+ BUG_ON(ret < 0);
+
+ /* send up to N packets before sending tx-complete */
+ if (!(npackets % 10)) {
+ ioq_signal(ioq, 0);
+ dirty = 0;
+ }
+
+ }
+
+ PDEBUG("poll: %d packets received\n", npackets);
+
+ if (dirty)
+ ioq_signal(ioq, 0);
+
+ /*
+ * If we processed all packets we're done, so reenable ints
+ */
+ if (ioq_empty(ioq, ioq_idxtype_inuse)) {
+ clear_bit(RX_SCHED, &priv->flags);
+ ioq_notify_enable(ioq, 0);
+ wake_up(&priv->vbus.rx_empty);
+ }
+
+ vbus_connection_put(conn);
+
+ return 0;
+}
+
+static int venettap_rx_thread(void *__priv)
+{
+ struct venettap *priv = __priv;
+
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!freezing(current) &&
+ !kthread_should_stop() &&
+ !test_bit(RX_SCHED, &priv->flags))
+ schedule();
+ set_current_state(TASK_RUNNING);
+
+ try_to_freeze();
+
+ if (kthread_should_stop())
+ break;
+
+ venettap_rx(priv);
+ }
+
+ return 0;
+}
+
+/* assumes priv->lock is held */
+static void
+venettap_check_netif_congestion(struct venettap *priv)
+{
+ struct ioq *ioq = priv->vbus.txq.queue;
+
+ if (priv->vbus.link
+ && priv->netif.txq.len < ioq_remain(ioq, ioq_idxtype_inuse)
+ && test_and_clear_bit(TX_NETIF_CONGESTED, &priv->flags)) {
+ PDEBUG("NETIF congestion cleared\n");
+ venettap_txq_notify_dec(priv);
+
+ if (priv->netif.link)
+ netif_wake_queue(priv->netif.dev);
+ }
+}
+
+static int
+venettap_tx(struct venettap *priv)
+{
+ struct sk_buff *skb;
+ struct ioq_iterator iter;
+ struct ioq *ioq = NULL;
+ struct vbus_memctx *ctx;
+ int ret;
+ int npackets = 0;
+ unsigned long flags;
+ struct vbus_connection *conn;
+
+ PDEBUG("tx-thread\n");
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (unlikely(!priv->vbus.link)) {
+ spin_unlock_irqrestore(&priv->lock, flags);
+ return 0;
+ }
+
+ /*
+ * We take a reference to the connection object to ensure that the
+ * ioq/ctx references do not disappear out from under us. We could
+ * acommplish the same thing more directly by acquiring a reference
+ * to the ioq and ctx explictly, but this would require an extra
+ * atomic_inc+dec pair, for no additional benefit
+ */
+ conn = &priv->vbus.conn;
+ vbus_connection_get(conn);
+
+ ioq = priv->vbus.txq.queue;
+ ctx = priv->vbus.ctx;
+
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, IOQ_ITER_AUTOUPDATE);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+
+ while (priv->vbus.link && iter.desc->sown && priv->netif.txq.len) {
+
+ skb = __skb_dequeue(&priv->netif.txq.list);
+ if (!skb)
+ break;
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ PDEBUG("tx-thread: sending %d bytes\n", skb->len);
+
+ if (skb->len <= iter.desc->len) {
+ ret = ctx->ops->copy_to(ctx, (void *)iter.desc->ptr,
+ skb->data, skb->len);
+ BUG_ON(ret);
+
+ iter.desc->len = skb->len;
+
+ npackets++;
+ priv->netif.stats.tx_packets++;
+ priv->netif.stats.tx_bytes += skb->len;
+
+ ret = ioq_iter_push(&iter, 0);
+ BUG_ON(ret < 0);
+ } else {
+ printk(KERN_WARNING \
+ "VENETTAP: discarding packet: buf too small " \
+ "(%d > %lld)\n", skb->len, iter.desc->len);
+ priv->netif.stats.tx_errors++;
+ }
+
+ dev_kfree_skb(skb);
+ priv->netif.dev->trans_start = jiffies; /* save the timestamp */
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ priv->netif.txq.len--;
+ }
+
+ PDEBUG("send complete\n");
+
+ if (!priv->vbus.link || !priv->netif.txq.len) {
+ PDEBUG("descheduling TX: link=%d, len=%d\n",
+ priv->vbus.link, priv->netif.txq.len);
+ clear_bit(TX_SCHED, &priv->flags);
+ } else if (!test_and_set_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+ PDEBUG("congested with %d packets still queued\n",
+ priv->netif.txq.len);
+ venettap_txq_notify_inc(priv);
+ }
+
+ venettap_check_netif_congestion(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ vbus_connection_put(conn);
+
+ return npackets;
+}
+
+static int venettap_tx_thread(void *__priv)
+{
+ struct venettap *priv = __priv;
+
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!freezing(current) &&
+ !kthread_should_stop() &&
+ (test_bit(TX_IOQ_CONGESTED, &priv->flags) ||
+ !test_bit(TX_SCHED, &priv->flags)))
+ schedule();
+ set_current_state(TASK_RUNNING);
+
+ PDEBUG("tx wakeup: %s%s%s\n",
+ test_bit(TX_SCHED, &priv->flags) ? "s" : "-",
+ test_bit(TX_IOQ_CONGESTED, &priv->flags) ? "c" : "-",
+ test_bit(TX_NETIF_CONGESTED, &priv->flags) ? "b" : "-"
+ );
+
+ try_to_freeze();
+
+ if (kthread_should_stop())
+ break;
+
+ venettap_tx(priv);
+ }
+
+ return 0;
+}
+
+static void
+venettap_deferred_tx(struct venettap *priv)
+{
+ PDEBUG("wake up txthread\n");
+ wake_up_process(priv->txthread);
+}
+
+/* assumes priv->lock is held */
+static void
+venettap_apply_backpressure(struct venettap *priv)
+{
+ PDEBUG("backpressure\n");
+
+ if (!test_and_set_bit(TX_NETIF_CONGESTED, &priv->flags)) {
+ /*
+ * We must flow-control the kernel by disabling the queue
+ */
+ netif_stop_queue(priv->netif.dev);
+ venettap_txq_notify_inc(priv);
+ }
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ *
+ * We want to perform ctx->copy_to() operations from a sleepable process
+ * context, so we defer the actual tx operations to a thread.
+ * However, we want to be careful that we do not double-buffer the
+ * queue, so we create a buffer whose space dynamically grows and
+ * shrinks with the availability of the actual IOQ. This means that
+ * the netif flow control is still managed by the actual consumer,
+ * thereby avoiding the creation of an extra servo-loop to the equation.
+ */
+static int
+venettap_netdev_tx(struct sk_buff *skb, struct net_device *dev)
+{
+ struct venettap *priv = netdev_priv(dev);
+ struct ioq *ioq = NULL;
+ unsigned long flags;
+
+ PDEBUG("queuing %d bytes\n", skb->len);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ ioq = priv->vbus.txq.queue;
+
+ BUG_ON(test_bit(TX_NETIF_CONGESTED, &priv->flags));
+
+ if (!priv->vbus.link) {
+ /*
+ * We have a link-down condition
+ */
+ printk(KERN_ERR "VENETTAP: tx on link down\n");
+ goto flowcontrol;
+ }
+
+ __skb_queue_tail(&priv->netif.txq.list, skb);
+ priv->netif.txq.len++;
+ set_bit(TX_SCHED, &priv->flags);
+
+ if (priv->netif.txq.len >= ioq_remain(ioq, ioq_idxtype_inuse))
+ venettap_apply_backpressure(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ venettap_deferred_tx(priv);
+
+ return NETDEV_TX_OK;
+
+flowcontrol:
+ venettap_apply_backpressure(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ return NETDEV_TX_BUSY;
+}
+
+/*
+ * Ioctl commands
+ */
+static int
+venettap_netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+ PDEBUG("ioctl\n");
+ return 0;
+}
+
+/*
+ * Return statistics to the caller
+ */
+struct net_device_stats *
+venettap_netdev_stats(struct net_device *dev)
+{
+ struct venettap *priv = netdev_priv(dev);
+ return &priv->netif.stats;
+}
+
+static void
+venettap_netdev_unregister(struct venettap *priv)
+{
+ if (priv->netif.enabled) {
+ venettap_netdev_stop(priv->netif.dev);
+ unregister_netdev(priv->netif.dev);
+ }
+}
+
+/*
+ * Assumes priv->lock held
+ */
+static void
+venettap_rx_schedule(struct venettap *priv)
+{
+ if (!priv->vbus.link)
+ return;
+
+ if (priv->netif.link
+ && !ioq_empty(priv->vbus.rxq.queue, ioq_idxtype_inuse)) {
+ ioq_notify_disable(priv->vbus.rxq.queue, 0);
+
+ if (!test_and_set_bit(RX_SCHED, &priv->flags))
+ wake_up_process(priv->rxthread);
+ }
+}
+
+/*
+ * receive interrupt-service-routine - called whenever the vbus-driver signals
+ * our IOQ to indicate more inbound packets are ready.
+ */
+static void
+venettap_rx_isr(struct ioq_notifier *notifier)
+{
+ struct venettap *priv;
+ unsigned long flags;
+
+ priv = container_of(notifier, struct venettap, vbus.rxq.notifier);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ /* Disable future interrupts and schedule our napi-poll */
+ venettap_rx_schedule(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * transmit interrupt-service-routine - called whenever the vbus-driver signals
+ * our IOQ to indicate there is more room in the TX queue
+ */
+static void
+venettap_tx_isr(struct ioq_notifier *notifier)
+{
+ struct venettap *priv;
+ unsigned long flags;
+
+ priv = container_of(notifier, struct venettap, vbus.txq.notifier);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (priv->vbus.link
+ && !ioq_full(priv->vbus.txq.queue, ioq_idxtype_inuse)
+ && test_and_clear_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+ PDEBUG("IOQ congestion cleared\n");
+ venettap_txq_notify_dec(priv);
+
+ if (priv->netif.link)
+ wake_up_process(priv->txthread);
+ }
+
+ venettap_check_netif_congestion(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+venettap_vlink_up(struct venettap *priv)
+{
+ int ret = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (priv->vbus.link) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ if (!priv->vbus.rxq.queue || !priv->vbus.txq.queue) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ priv->vbus.link = 1;
+
+ if (priv->netif.link)
+ netif_carrier_on(priv->netif.dev);
+
+ venettap_check_netif_congestion(priv);
+
+ ioq_notify_enable(priv->vbus.rxq.queue, 0);
+
+out:
+ spin_unlock_irqrestore(&priv->lock, flags);
+ return ret;
+}
+
+/* Assumes priv->lock held */
+static int
+_venettap_vlink_down(struct venettap *priv)
+{
+ struct sk_buff *skb;
+
+ if (!priv->vbus.link)
+ return -ENOENT;
+
+ priv->vbus.link = 0;
+
+ if (priv->netif.link)
+ netif_carrier_off(priv->netif.dev);
+
+ /* just trash whatever might have been pending */
+ while ((skb = __skb_dequeue(&priv->netif.txq.list)))
+ dev_kfree_skb(skb);
+
+ priv->netif.txq.len = 0;
+
+ /* And deschedule any pending processing */
+ clear_bit(RX_SCHED, &priv->flags);
+ clear_bit(TX_SCHED, &priv->flags);
+
+ ioq_notify_disable(priv->vbus.rxq.queue, 0);
+
+ return 0;
+}
+
+static int
+venettap_vlink_down(struct venettap *priv)
+{
+ unsigned long flags;
+ int ret;
+
+ spin_lock_irqsave(&priv->lock, flags);
+ ret = _venettap_vlink_down(priv);
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ return ret;
+}
+
+static int
+venettap_macquery(struct venettap *priv, void *data, unsigned long len)
+{
+ struct vbus_memctx *ctx = priv->vbus.ctx;
+ int ret;
+
+ if (len != ETH_ALEN)
+ return -EINVAL;
+
+ ret = ctx->ops->copy_to(ctx, data, priv->cmac, ETH_ALEN);
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * Negotiate Capabilities - This function is provided so that the
+ * interface may be extended without breaking ABI compatability
+ *
+ * The caller is expected to send down any capabilities they would like
+ * to enable, and the device will OR them with capabilities that it
+ * supports. This value is then returned so that both sides may
+ * ascertain the lowest-common-denominator of features to enable
+ */
+static int
+venettap_negcap(struct venettap *priv, void *data, unsigned long len)
+{
+ struct vbus_memctx *ctx = priv->vbus.ctx;
+ struct venet_capabilities caps;
+ int ret;
+
+ if (len != sizeof(caps))
+ return -EINVAL;
+
+ if (priv->vbus.link)
+ return -EINVAL;
+
+ ret = ctx->ops->copy_from(ctx, &caps, data, sizeof(caps));
+ if (ret)
+ return -EFAULT;
+
+ switch (caps.gid) {
+ default:
+ caps.bits = 0;
+ break;
+ }
+
+ ret = ctx->ops->copy_to(ctx, data, &caps, sizeof(caps));
+ if (ret)
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * Walk through and flush each remaining descriptor by returning
+ * a zero length packet.
+ *
+ * This is useful, for instance, when the driver is changing the MTU
+ * and wants to reclaim all the existing buffers outstanding which
+ * are a different size than the new MTU
+ */
+static int
+venettap_flushrx(struct venettap *priv)
+{
+ struct ioq_iterator iter;
+ struct ioq *ioq = NULL;
+ int ret;
+ unsigned long flags;
+
+ PDEBUG("flushrx\n");
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ if (unlikely(!priv->vbus.link)) {
+ spin_unlock_irqrestore(&priv->lock, flags);
+ return -EINVAL;
+ }
+
+ ioq = priv->vbus.txq.queue;
+
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+ BUG_ON(ret < 0);
+
+ while (iter.desc->sown) {
+ iter.desc->len = 0;
+ ret = ioq_iter_push(&iter, 0);
+ if (ret < 0)
+ SHM_SIGNAL_FAULT(ioq->signal, "could not flushrx");
+ }
+
+ PDEBUG("flushrx complete\n");
+
+ if (!test_and_set_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+ PDEBUG("congested with %d packets still queued\n",
+ priv->netif.txq.len);
+ venettap_txq_notify_inc(priv);
+ }
+
+ /*
+ * we purposely do not ioq_signal() the other side here. Since
+ * this function was invoked by the client, they can take care
+ * of explcitly calling any reclaim code if they like. This also
+ * avoids a potential deadlock in case turning around and injecting
+ * a signal while we are in a call() is problematic to the
+ * connector design
+ */
+
+ venettap_check_netif_congestion(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ return 0;
+}
+
+/*
+ * This is called whenever a driver wants to perform a synchronous
+ * "function call" to our device. It is similar to the notion of
+ * an ioctl(). The parameters are part of the ABI between the device
+ * and driver.
+ */
+static int
+venettap_vlink_call(struct vbus_connection *conn,
+ unsigned long func,
+ void *data,
+ unsigned long len,
+ unsigned long flags)
+{
+ struct venettap *priv = conn_to_priv(conn);
+
+ PDEBUG("call -> %d with %p/%d\n", func, data, len);
+
+ switch (func) {
+ case VENET_FUNC_LINKUP:
+ return venettap_vlink_up(priv);
+ case VENET_FUNC_LINKDOWN:
+ return venettap_vlink_down(priv);
+ case VENET_FUNC_MACQUERY:
+ return venettap_macquery(priv, data, len);
+ case VENET_FUNC_NEGCAP:
+ return venettap_negcap(priv, data, len);
+ case VENET_FUNC_FLUSHRX:
+ return venettap_flushrx(priv);
+ default:
+ return -EINVAL;
+ }
+}
+
+/*
+ * This is called whenever a driver wants to open a new IOQ between itself
+ * and our device. The "id" field is meant to convey meaning to the device
+ * as to what the intended use of this IOQ is. For instance, for venet "id=0"
+ * means "rx" and "id=1" = "tx". That namespace is managed by the device
+ * and should be understood by the driver as part of its ABI agreement.
+ *
+ * The device should take a reference to the IOQ via ioq_get() and hold it
+ * until the connection is released.
+ */
+static int
+venettap_vlink_shm(struct vbus_connection *conn,
+ unsigned long id,
+ struct vbus_shm *shm,
+ struct shm_signal *signal,
+ unsigned long flags)
+{
+ struct venettap *priv = conn_to_priv(conn);
+
+ PDEBUG("queue -> %p/%d attached\n", ioq, id);
+
+ switch (id) {
+ case VENET_QUEUE_RX:
+ return venettap_queue_init(&priv->vbus.txq, shm, signal,
+ venettap_tx_isr);
+ case VENET_QUEUE_TX:
+ return venettap_queue_init(&priv->vbus.rxq, shm, signal,
+ venettap_rx_isr);
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static void
+venettap_vlink_close(struct vbus_connection *conn)
+{
+ struct venettap *priv = conn_to_priv(conn);
+ DEFINE_WAIT(wait);
+ unsigned long flags;
+
+ PDEBUG("connection closed\n");
+
+ /* Block until all posted packets from the client have been processed */
+ prepare_to_wait(&priv->vbus.rx_empty, &wait, TASK_UNINTERRUPTIBLE);
+
+ while (test_bit(RX_SCHED, &priv->flags))
+ schedule();
+
+ finish_wait(&priv->vbus.rx_empty, &wait);
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ priv->vbus.opened = false;
+ _venettap_vlink_down(priv);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * This is called whenever the driver closes all references to our device
+ */
+static void
+venettap_vlink_release(struct vbus_connection *conn)
+{
+ struct venettap *priv = conn_to_priv(conn);
+
+ PDEBUG("connection released\n");
+
+ venettap_queue_release(&priv->vbus.rxq);
+ venettap_queue_release(&priv->vbus.txq);
+ vbus_memctx_put(priv->vbus.ctx);
+
+ kobject_put(priv->vbus.dev.kobj);
+}
+
+static struct vbus_connection_ops venettap_vbus_link_ops = {
+ .call = venettap_vlink_call,
+ .shm = venettap_vlink_shm,
+ .close = venettap_vlink_close,
+ .release = venettap_vlink_release,
+};
+
+/*
+ * This is called whenever a driver wants to open our device_interface
+ * for communication. The connection is represented by a
+ * vbus_connection object. It is up to the implementation to decide
+ * if it allows more than one connection at a time. This simple example
+ * does not.
+ */
+static int
+venettap_intf_open(struct vbus_device_interface *intf,
+ struct vbus_memctx *ctx,
+ int version,
+ struct vbus_connection **conn)
+{
+ struct venettap *priv = intf_to_priv(intf);
+ unsigned long flags;
+
+ PDEBUG("open\n");
+
+ if (version != VENET_VERSION)
+ return -EINVAL;
+
+ spin_lock_irqsave(&priv->lock, flags);
+
+ /*
+ * We only allow one connection to this device
+ */
+ if (priv->vbus.opened) {
+ spin_unlock_irqrestore(&priv->lock, flags);
+ return -EBUSY;
+ }
+
+ kobject_get(intf->dev->kobj);
+
+ vbus_connection_init(&priv->vbus.conn, &venettap_vbus_link_ops);
+
+ priv->vbus.opened = true;
+ priv->vbus.ctx = ctx;
+
+ vbus_memctx_get(ctx);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+
+ *conn = &priv->vbus.conn;
+
+ return 0;
+}
+
+static void
+venettap_intf_release(struct vbus_device_interface *intf)
+{
+ kobject_put(intf->dev->kobj);
+}
+
+static struct vbus_device_interface_ops venettap_device_interface_ops = {
+ .open = venettap_intf_open,
+ .release = venettap_intf_release,
+};
+
+/*
+ * This is called whenever the admin creates a symbolic link between
+ * a bus in /config/vbus/buses and our device. It represents a bus
+ * connection. Your device can chose to allow more than one bus to
+ * connect, or it can restrict it to one bus. It can also choose to
+ * register one or more device_interfaces on each bus that it
+ * successfully connects to.
+ *
+ * This example device only registers a single interface
+ */
+static int
+venettap_device_bus_connect(struct vbus_device *dev, struct vbus *vbus)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+ struct vbus_device_interface *intf = &priv->vbus.intf;
+
+ /* We only allow one bus to connect */
+ if (priv->vbus.connected)
+ return -EBUSY;
+
+ kobject_get(dev->kobj);
+
+ intf->name = "0";
+ intf->type = VENET_TYPE;
+ intf->ops = &venettap_device_interface_ops;
+
+ priv->vbus.connected = true;
+
+ /*
+ * Our example only registers one interface. If you need
+ * more, simply call interface_register() multiple times
+ */
+ return vbus_device_interface_register(dev, vbus, intf);
+}
+
+/*
+ * This is called whenever the admin removes the symbolic link between
+ * a bus in /config/vbus/buses and our device.
+ */
+static int
+venettap_device_bus_disconnect(struct vbus_device *dev, struct vbus *vbus)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+ struct vbus_device_interface *intf = &priv->vbus.intf;
+
+ if (!priv->vbus.connected)
+ return -EINVAL;
+
+ vbus_device_interface_unregister(intf);
+
+ priv->vbus.connected = false;
+ kobject_put(dev->kobj);
+
+ return 0;
+}
+
+static void
+venettap_device_release(struct vbus_device *dev)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+
+ venettap_netdev_unregister(priv);
+ free_netdev(priv->netif.dev);
+ module_put(THIS_MODULE);
+}
+
+
+static struct vbus_device_ops venettap_device_ops = {
+ .bus_connect = venettap_device_bus_connect,
+ .bus_disconnect = venettap_device_bus_disconnect,
+ .release = venettap_device_release,
+};
+
+#define VENETTAP_TYPE "venet-tap"
+
+/*
+ * Interface attributes show up as files under
+ * /sys/vbus/devices/$devid
+ */
+static ssize_t
+host_mac_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+ char *buf)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+
+ return sysfs_format_mac(buf, priv->hmac, ETH_ALEN);
+}
+
+static struct vbus_device_attribute attr_hmac =
+ __ATTR_RO(host_mac);
+
+static ssize_t
+client_mac_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+ char *buf)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+
+ return sysfs_format_mac(buf, priv->cmac, ETH_ALEN);
+}
+
+static struct vbus_device_attribute attr_cmac =
+ __ATTR_RO(client_mac);
+
+static ssize_t
+enabled_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+ char *buf)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", priv->netif.enabled);
+}
+
+static ssize_t
+enabled_store(struct vbus_device *dev, struct vbus_device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+ int enabled = -1;
+ int ret = 0;
+
+ if (count > 0)
+ sscanf(buf, "%d", &enabled);
+
+ if (enabled != 0 && enabled != 1)
+ return -EINVAL;
+
+ if (enabled && !priv->netif.enabled)
+ ret = register_netdev(priv->netif.dev);
+
+ if (!enabled && priv->netif.enabled)
+ venettap_netdev_unregister(priv);
+
+ if (ret < 0)
+ return ret;
+
+ priv->netif.enabled = enabled;
+
+ return count;
+}
+
+static struct vbus_device_attribute attr_enabled =
+ __ATTR(enabled, S_IRUGO | S_IWUSR, enabled_show, enabled_store);
+
+static ssize_t
+ifname_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+ char *buf)
+{
+ struct venettap *priv = vdev_to_priv(dev);
+
+ if (!priv->netif.enabled)
+ return sprintf(buf, "<disabled>\n");
+
+ return snprintf(buf, PAGE_SIZE, "%s\n", priv->netif.dev->name);
+}
+
+static struct vbus_device_attribute attr_ifname =
+ __ATTR_RO(ifname);
+
+static struct attribute *attrs[] = {
+ &attr_hmac.attr,
+ &attr_cmac.attr,
+ &attr_enabled.attr,
+ &attr_ifname.attr,
+ NULL,
+};
+
+static struct attribute_group venettap_attr_group = {
+ .attrs = attrs,
+};
+
+static struct net_device_ops venettap_netdev_ops = {
+ .ndo_open = venettap_netdev_open,
+ .ndo_stop = venettap_netdev_stop,
+ .ndo_set_config = venettap_netdev_config,
+ .ndo_change_mtu = venettap_change_mtu,
+ .ndo_start_xmit = venettap_netdev_tx,
+ .ndo_do_ioctl = venettap_netdev_ioctl,
+ .ndo_get_stats = venettap_netdev_stats,
+};
+
+/*
+ * This is called whenever the admin instantiates our devclass via
+ * "mkdir /config/vbus/devices/$(inst)/venet-tap"
+ */
+static int
+venettap_device_create(struct vbus_devclass *dc,
+ struct vbus_device **vdev)
+{
+ struct net_device *dev;
+ struct venettap *priv;
+ struct vbus_device *_vdev;
+
+ dev = alloc_etherdev(sizeof(struct venettap));
+ if (!dev)
+ return -ENOMEM;
+
+ priv = netdev_priv(dev);
+ memset(priv, 0, sizeof(*priv));
+
+ spin_lock_init(&priv->lock);
+ random_ether_addr(priv->hmac);
+ random_ether_addr(priv->cmac);
+
+ /*
+ * vbus init
+ */
+ _vdev = &priv->vbus.dev;
+
+ _vdev->type = VENETTAP_TYPE;
+ _vdev->ops = &venettap_device_ops;
+ _vdev->attrs = &venettap_attr_group;
+
+ init_waitqueue_head(&priv->vbus.rx_empty);
+
+ /*
+ * netif init
+ */
+ skb_queue_head_init(&priv->netif.txq.list);
+ priv->netif.txq.len = 0;
+
+ priv->netif.dev = dev;
+
+ ether_setup(dev); /* assign some of the fields */
+
+ dev->netdev_ops = &venettap_netdev_ops;
+ memcpy(dev->dev_addr, priv->hmac, ETH_ALEN);
+
+ dev->features |= NETIF_F_HIGHDMA;
+
+ *vdev = _vdev;
+
+ /*
+ * We don't need a try_get because the reference is held by the
+ * infrastructure during a create() operation
+ */
+ __module_get(THIS_MODULE);
+
+ return 0;
+}
+
+static struct vbus_devclass_ops venettap_devclass_ops = {
+ .create = venettap_device_create,
+};
+
+static struct vbus_devclass venettap_devclass = {
+ .name = VENETTAP_TYPE,
+ .ops = &venettap_devclass_ops,
+ .owner = THIS_MODULE,
+};
+
+static int __init venettap_init(void)
+{
+ return vbus_devclass_register(&venettap_devclass);
+}
+
+static void __exit venettap_cleanup(void)
+{
+ vbus_devclass_unregister(&venettap_devclass);
+}
+
+module_init(venettap_init);
+module_exit(venettap_cleanup);
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index 71acd6f..3ce0adc 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -14,6 +14,17 @@ config VBUS
If unsure, say N
+config VBUS_DEVICES
+ bool "Virtual-Bus Devices"
+ depends on VBUS
+ default n
+ help
+ Provides device-class modules for instantiation on a virtual-bus
+
+ If unsure, say N
+
+source "drivers/vbus/devices/Kconfig"
+
config VBUS_DRIVERS
tristate "VBUS Driver support"
select IOQ
@@ -23,3 +34,5 @@ config VBUS_DRIVERS
If unsure, say N
+
+
Signed-off-by: Gregory Haskins <[email protected]>
---
drivers/net/vbus-enet.c | 249 +++++++++++++++++++++++++++++++++++++++++++++--
include/linux/venet.h | 39 +++++++
2 files changed, 275 insertions(+), 13 deletions(-)
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
index 3779f77..2a190e0 100644
--- a/drivers/net/vbus-enet.c
+++ b/drivers/net/vbus-enet.c
@@ -42,6 +42,8 @@ static int rx_ringlen = 256;
module_param(rx_ringlen, int, 0444);
static int tx_ringlen = 256;
module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
#undef PDEBUG /* undef it, just in case */
#ifdef VBUS_ENET_DEBUG
@@ -63,8 +65,17 @@ struct vbus_enet_priv {
struct vbus_enet_queue rxq;
struct vbus_enet_queue txq;
struct tasklet_struct txtask;
+ struct {
+ int sg:1;
+ int tso:1;
+ int ufo:1;
+ int tso6:1;
+ int ecn:1;
+ } flags;
};
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
static struct vbus_enet_priv *
napi_to_priv(struct napi_struct *napi)
{
@@ -198,6 +209,93 @@ rx_teardown(struct vbus_enet_priv *priv)
}
}
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+ struct ioq *ioq = priv->txq.queue;
+ struct ioq_iterator iter;
+ int i;
+ int ret;
+
+ if (!priv->flags.sg)
+ /*
+ * There is nothing to do for a ring that is not using
+ * scatter-gather
+ */
+ return 0;
+
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+ BUG_ON(ret < 0);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * Now populate each descriptor with an empty SG descriptor
+ */
+ for (i = 0; i < tx_ringlen; i++) {
+ struct venet_sg *vsg;
+ size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+ size_t len = sizeof(*vsg) + iovlen;
+
+ vsg = kzalloc(len, GFP_KERNEL);
+ if (!vsg)
+ return -ENOMEM;
+
+ iter.desc->cookie = (u64)vsg;
+ iter.desc->len = len;
+ iter.desc->ptr = (u64)__pa(vsg);
+
+ ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+ BUG_ON(ret < 0);
+ }
+
+ return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+ struct ioq *ioq = priv->txq.queue;
+ struct ioq_iterator iter;
+ int ret;
+
+ /* forcefully free all outstanding transmissions */
+ vbus_enet_tx_reap(priv, 1);
+
+ if (!priv->flags.sg)
+ /*
+ * There is nothing else to do for a ring that is not using
+ * scatter-gather
+ */
+ return;
+
+ ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+ BUG_ON(ret < 0);
+
+ /* seek to position 0 */
+ ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+ BUG_ON(ret < 0);
+
+ /*
+ * free each valid descriptor
+ */
+ while (iter.desc->cookie) {
+ struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+ iter.desc->valid = 0;
+ wmb();
+
+ iter.desc->ptr = 0;
+ iter.desc->cookie = 0;
+
+ ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+ BUG_ON(ret < 0);
+
+ kfree(vsg);
+ }
+}
+
/*
* Open and close
*/
@@ -402,14 +500,67 @@ vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
BUG_ON(ret < 0);
BUG_ON(iter.desc->sown);
- /*
- * We simply put the skb right onto the ring. We will get an interrupt
- * later when the data has been consumed and we can reap the pointers
- * at that time
- */
- iter.desc->cookie = (u64)skb;
- iter.desc->len = (u64)skb->len;
- iter.desc->ptr = (u64)__pa(skb->data);
+ if (priv->flags.sg) {
+ struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+ struct scatterlist sgl[MAX_SKB_FRAGS+1];
+ struct scatterlist *sg;
+ int count, maxcount = ARRAY_SIZE(sgl);
+
+ sg_init_table(sgl, maxcount);
+
+ memset(vsg, 0, sizeof(*vsg));
+
+ vsg->cookie = (u64)skb;
+ vsg->len = skb->len;
+
+ if (skb->ip_summed == CHECKSUM_PARTIAL) {
+ vsg->flags |= VENET_SG_FLAG_NEEDS_CSUM;
+ vsg->csum.start = skb->csum_start - skb_headroom(skb);
+ vsg->csum.offset = skb->csum_offset;
+ }
+
+ if (skb_is_gso(skb)) {
+ struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+ vsg->flags |= VENET_SG_FLAG_GSO;
+
+ vsg->gso.hdrlen = skb_transport_header(skb) - skb->data;
+ vsg->gso.size = sinfo->gso_size;
+ if (sinfo->gso_type & SKB_GSO_TCPV4)
+ vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+ else if (sinfo->gso_type & SKB_GSO_TCPV6)
+ vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+ else if (sinfo->gso_type & SKB_GSO_UDP)
+ vsg->gso.type = VENET_GSO_TYPE_UDP;
+ else
+ panic("Virtual-Ethernet: unknown GSO type " \
+ "0x%x\n", sinfo->gso_type);
+
+ if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+ vsg->flags |= VENET_SG_FLAG_ECN;
+ }
+
+ count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+ BUG_ON(count > maxcount);
+
+ for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+ struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+ iov->len = sg->length;
+ iov->ptr = (u64)sg_phys(sg);
+ }
+
+ } else {
+ /*
+ * non scatter-gather mode: simply put the skb right onto the
+ * ring.
+ */
+ iter.desc->cookie = (u64)skb;
+ iter.desc->len = (u64)skb->len;
+ iter.desc->ptr = (u64)__pa(skb->data);
+ }
+
iter.desc->valid = 1;
priv->dev->stats.tx_packets++;
@@ -465,7 +616,17 @@ vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
* owned by the south-side
*/
while (iter.desc->valid && (!iter.desc->sown || force)) {
- struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+ struct sk_buff *skb;
+
+ if (priv->flags.sg) {
+ struct venet_sg *vsg;
+
+ vsg = (struct venet_sg *)iter.desc->cookie;
+ skb = (struct sk_buff *)vsg->cookie;
+
+ } else {
+ skb = (struct sk_buff *)iter.desc->cookie;
+ }
PDEBUG("%lld: completed sending %d bytes\n",
priv->vdev->id, skb->len);
@@ -546,6 +707,47 @@ tx_isr(struct ioq_notifier *notifier)
tasklet_schedule(&priv->txtask);
}
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+ int ret;
+ struct venet_capabilities caps;
+
+ memset(&caps, 0, sizeof(caps));
+
+ if (sg_enabled) {
+ caps.gid = VENET_CAP_GROUP_SG;
+ caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+ |VENET_CAP_ECN|VENET_CAP_UFO);
+ }
+
+ ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+ if (ret < 0)
+ return ret;
+
+ if (caps.bits & VENET_CAP_SG) {
+ priv->flags.sg = true;
+
+ if (caps.bits & VENET_CAP_TSO4)
+ priv->flags.tso = true;
+ if (caps.bits & VENET_CAP_TSO6)
+ priv->flags.tso6 = true;
+ if (caps.bits & VENET_CAP_UFO)
+ priv->flags.ufo = true;
+ if (caps.bits & VENET_CAP_ECN)
+ priv->flags.ecn = true;
+
+ printk(KERN_INFO "VBUSENET %lld: " \
+ "Detected GSO features %s%s%s%s\n", priv->vdev->id,
+ priv->flags.tso ? "t" : "-",
+ priv->flags.tso6 ? "T" : "-",
+ priv->flags.ufo ? "u" : "-",
+ priv->flags.ecn ? "e" : "-");
+ }
+
+ return 0;
+}
+
static const struct net_device_ops vbus_enet_netdev_ops = {
.ndo_open = vbus_enet_open,
.ndo_stop = vbus_enet_stop,
@@ -582,12 +784,21 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
priv->dev = dev;
priv->vdev = vdev;
+ ret = vbus_enet_negcap(priv);
+ if (ret < 0) {
+ printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+ "%lld\n",
+ priv->vdev->id);
+ goto out_free;
+ }
+
tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
rx_setup(priv);
+ tx_setup(priv);
ioq_notify_enable(priv->rxq.queue, 0); /* enable interrupts */
ioq_notify_enable(priv->txq.queue, 0);
@@ -607,6 +818,22 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
dev->features |= NETIF_F_HIGHDMA;
+ if (priv->flags.sg) {
+ dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+ if (priv->flags.tso)
+ dev->features |= NETIF_F_TSO;
+
+ if (priv->flags.ufo)
+ dev->features |= NETIF_F_UFO;
+
+ if (priv->flags.tso6)
+ dev->features |= NETIF_F_TSO6;
+
+ if (priv->flags.ecn)
+ dev->features |= NETIF_F_TSO_ECN;
+ }
+
ret = register_netdev(dev);
if (ret < 0) {
printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
@@ -634,9 +861,9 @@ vbus_enet_remove(struct vbus_device_proxy *vdev)
napi_disable(&priv->napi);
rx_teardown(priv);
- vbus_enet_tx_reap(priv, 1);
-
ioq_put(priv->rxq.queue);
+
+ tx_teardown(priv);
ioq_put(priv->txq.queue);
dev->ops->close(dev, 0);
diff --git a/include/linux/venet.h b/include/linux/venet.h
index ef6b199..1c96b90 100644
--- a/include/linux/venet.h
+++ b/include/linux/venet.h
@@ -35,8 +35,43 @@ struct venet_capabilities {
__u32 bits;
};
-/* CAPABILITIES-GROUP 0 */
-/* #define VENET_CAP_FOO 0 (No capabilities defined yet, for now) */
+#define VENET_CAP_GROUP_SG 0
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG (1 << 0)
+#define VENET_CAP_TSO4 (1 << 1)
+#define VENET_CAP_TSO6 (1 << 2)
+#define VENET_CAP_ECN (1 << 3)
+#define VENET_CAP_UFO (1 << 4)
+
+struct venet_iov {
+ __u32 len;
+ __u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO (1 << 1)
+#define VENET_SG_FLAG_ECN (1 << 2)
+
+struct venet_sg {
+ __u64 cookie;
+ __u32 flags;
+ __u32 len; /* total length of all iovs */
+ struct {
+ __u16 start; /* csum starting position */
+ __u16 offset; /* offset to place csum */
+ } csum;
+ struct {
+#define VENET_GSO_TYPE_TCPV4 0 /* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP 1 /* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6 2 /* IPv6 TCP */
+ __u8 type;
+ __u16 hdrlen;
+ __u16 size;
+ } gso;
+ __u32 count; /* nr of iovs */
+ struct venet_iov iov[1];
+};
#define VENET_FUNC_LINKUP 0
#define VENET_FUNC_LINKDOWN 1
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/venet.h | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 47 insertions(+), 0 deletions(-)
create mode 100644 include/linux/venet.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..ef6b199
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright 2008 Novell. All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+ __u32 gid;
+ __u32 bits;
+};
+
+/* CAPABILITIES-GROUP 0 */
+/* #define VENET_CAP_FOO 0 (No capabilities defined yet, for now) */
+
+#define VENET_FUNC_LINKUP 0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP 3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX 4
+
+#endif /* _LINUX_VENET_H */
We can map these over VBUS shared memory (or really any shared-memory
architecture if it supports shm-signals) to allow asynchronous
communication between two end-points. Memory is synchronized using
pure barriers (i.e. lockless), so IOQs are friendly in many contexts,
even if the memory is remote.
Signed-off-by: Gregory Haskins <[email protected]>
---
include/linux/ioq.h | 410 +++++++++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 12 +
lib/Makefile | 1
lib/ioq.c | 298 +++++++++++++++++++++++++++++++++++++
4 files changed, 721 insertions(+), 0 deletions(-)
create mode 100644 include/linux/ioq.h
create mode 100644 lib/ioq.c
diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..d450d9a
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,410 @@
+/*
+ * Copyright 2009 Novell. All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory, lockless queue mechanism. It can be used
+ * in a variety of ways, though its intended purpose is to become the
+ * asynchronous communication path for virtual-bus drivers.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ * link. This typically would be the guest side in a VM/VMM scenario.
+ * #) Each IOQ has the concept of "north" and "south" locales, where
+ * north denotes the memory-owner side (e.g. guest).
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) Provides a bi-directional signaling/notification infrastructure on
+ * a per-queue basis, which includes an event mitigation strategy
+ * to reduce boundary switching.
+ * #) The signaling path is abstracted so that various technologies and
+ * topologies can define their own specific implementation while sharing
+ * the basic structures and code.
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_IOQ_H
+#define _LINUX_IOQ_H
+
+#include <asm/types.h>
+#include <linux/shm_signal.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc). Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+struct ioq_ring_desc {
+ __u64 cookie; /* for arbitrary use by north-side */
+ __u64 ptr;
+ __u64 len;
+ __u8 valid;
+ __u8 sown; /* South owned = 1, North owned = 0 */
+};
+
+#define IOQ_RING_MAGIC 0x47fa2fe4
+#define IOQ_RING_VER 4
+
+struct ioq_ring_idx {
+ __u32 head; /* 0 based index to head of ptr array */
+ __u32 tail; /* 0 based index to tail of ptr array */
+ __u8 full;
+};
+
+enum ioq_locality {
+ ioq_locality_north,
+ ioq_locality_south,
+};
+
+struct ioq_ring_head {
+ __u32 magic;
+ __u32 ver;
+ struct shm_signal_desc signal;
+ struct ioq_ring_idx idx[2];
+ __u32 count;
+ struct ioq_ring_desc ring[1]; /* "count" elements will be allocated */
+};
+
+#define IOQ_HEAD_DESC_SIZE(count) \
+ (sizeof(struct ioq_ring_head) + sizeof(struct ioq_ring_desc) * (count - 1))
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+#include <asm/atomic.h>
+
+enum ioq_idx_type {
+ ioq_idxtype_valid,
+ ioq_idxtype_inuse,
+ ioq_idxtype_both,
+ ioq_idxtype_invalid,
+};
+
+enum ioq_seek_type {
+ ioq_seek_tail,
+ ioq_seek_next,
+ ioq_seek_head,
+ ioq_seek_set
+};
+
+struct ioq_iterator {
+ struct ioq *ioq;
+ struct ioq_ring_idx *idx;
+ u32 pos;
+ struct ioq_ring_desc *desc;
+ int update:1;
+ int dualidx:1;
+ int flipowner:1;
+};
+
+struct ioq_notifier {
+ void (*signal)(struct ioq_notifier *);
+};
+
+struct ioq_ops {
+ void (*release)(struct ioq *ioq);
+};
+
+struct ioq {
+ struct ioq_ops *ops;
+
+ atomic_t refs;
+ enum ioq_locality locale;
+ struct ioq_ring_head *head_desc;
+ struct ioq_ring_desc *ring;
+ struct shm_signal *signal;
+ wait_queue_head_t wq;
+ struct ioq_notifier *notifier;
+ size_t count;
+ struct shm_signal_notifier shm_notifier;
+};
+
+#define IOQ_ITER_AUTOUPDATE (1 << 0)
+#define IOQ_ITER_NOFLIPOWNER (1 << 1)
+
+/**
+ * ioq_init() - initialize an IOQ
+ * @ioq: IOQ context
+ *
+ * Initializes IOQ context before first use
+ *
+ **/
+void ioq_init(struct ioq *ioq,
+ struct ioq_ops *ops,
+ enum ioq_locality locale,
+ struct ioq_ring_head *head,
+ struct shm_signal *signal,
+ size_t count);
+
+/**
+ * ioq_get() - acquire an IOQ context reference
+ * @ioq: IOQ context
+ *
+ **/
+static inline struct ioq *ioq_get(struct ioq *ioq)
+{
+ atomic_inc(&ioq->refs);
+
+ return ioq;
+}
+
+/**
+ * ioq_put() - release an IOQ context reference
+ * @ioq: IOQ context
+ *
+ **/
+static inline void ioq_put(struct ioq *ioq)
+{
+ if (atomic_dec_and_test(&ioq->refs)) {
+ shm_signal_put(ioq->signal);
+ ioq->ops->release(ioq);
+ }
+}
+
+/**
+ * ioq_notify_enable() - enables local notifications on an IOQ
+ * @ioq: IOQ context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered ioq_notifier (if applicable) and waitq to
+ * receive wakeups whenever the remote side performs an ioq_signal() operation.
+ * A notification will be dispatched immediately if any pending signals have
+ * already been issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_enable(struct ioq *ioq, int flags)
+{
+ return shm_signal_enable(ioq->signal, 0);
+}
+
+/**
+ * ioq_notify_disable() - disable local notifications on an IOQ
+ * @ioq: IOQ context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Disables/masks the registered ioq_notifier (if applicable) and waitq
+ * from receiving any further notifications. Any subsequent calls to
+ * ioq_signal() by the remote side will update the ring as dirty, but
+ * will not traverse the locale boundary and will not invoke the notifier
+ * callback or wakeup the waitq. Signals delivered while masked will
+ * be deferred until ioq_notify_enable() is invoked
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_disable(struct ioq *ioq, int flags)
+{
+ return shm_signal_disable(ioq->signal, 0);
+}
+
+/**
+ * ioq_signal() - notify the remote side about ring changes
+ * @ioq: IOQ context
+ * @flags: Reserved for future use, must be 0
+ *
+ * Marks the ring state as "dirty" and, if enabled, will traverse
+ * a locale boundary to invoke a remote notification. The remote
+ * side controls whether the notification should be delivered via
+ * the ioq_notify_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the ioq_ops->signal() interface and provided by a particular
+ * implementation. However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_signal(struct ioq *ioq, int flags)
+{
+ return shm_signal_inject(ioq->signal, 0);
+}
+
+/**
+ * ioq_count() - counts the number of outstanding descriptors in an index
+ * @ioq: IOQ context
+ * @type: Specifies the index type
+ * (*) valid: the descriptor is valid. This is usually
+ * used to keep track of descriptors that may not
+ * be carrying a useful payload, but still need to
+ * be tracked carefully.
+ * (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ * (*) >=0: # of descriptors outstanding in the index
+ * (*) <0 = ERRNO
+ *
+ **/
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_remain() - counts the number of remaining descriptors in an index
+ * @ioq: IOQ context
+ * @type: Specifies the index type
+ * (*) valid: the descriptor is valid. This is usually
+ * used to keep track of descriptors that may not
+ * be carrying a useful payload, but still need to
+ * be tracked carefully.
+ * (*) inuse: Descriptors that carry useful payload
+ *
+ * This is the converse of ioq_count(). This function returns the number
+ * of "free" descriptors left in a particular index
+ *
+ * Returns:
+ * (*) >=0: # of descriptors remaining in the index
+ * (*) <0 = ERRNO
+ *
+ **/
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_size() - counts the maximum number of descriptors in an ring
+ * @ioq: IOQ context
+ *
+ * This function returns the maximum number of descriptors supported in
+ * a ring, regardless of their current state (free or inuse).
+ *
+ * Returns:
+ * (*) >=0: total # of descriptors in the ring
+ * (*) <0 = ERRNO
+ *
+ **/
+int ioq_size(struct ioq *ioq);
+
+/**
+ * ioq_full() - determines if a specific index is "full"
+ * @ioq: IOQ context
+ * @type: Specifies the index type
+ * (*) valid: the descriptor is valid. This is usually
+ * used to keep track of descriptors that may not
+ * be carrying a useful payload, but still need to
+ * be tracked carefully.
+ * (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ * (*) 0: index is not full
+ * (*) 1: index is full
+ * (*) <0 = ERRNO
+ *
+ **/
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_empty() - determines if a specific index is "empty"
+ * @ioq: IOQ context
+ * @type: Specifies the index type
+ * (*) valid: the descriptor is valid. This is usually
+ * used to keep track of descriptors that may not
+ * be carrying a useful payload, but still need to
+ * be tracked carefully.
+ * (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ * (*) 0: index is not empty
+ * (*) 1: index is empty
+ * (*) <0 = ERRNO
+ *
+ **/
+static inline int ioq_empty(struct ioq *ioq, enum ioq_idx_type type)
+{
+ return !ioq_count(ioq, type);
+}
+
+/**
+ * ioq_iter_init() - initialize an iterator for IOQ descriptor traversal
+ * @ioq: IOQ context to iterate on
+ * @iter: Iterator context to init (usually from stack)
+ * @type: Specifies the index type to iterate against
+ * (*) valid: iterate against the "valid" index
+ * (*) inuse: iterate against the "inuse" index
+ * (*) both: iterate against both indexes simultaneously
+ * @flags: Bitfield with 0 or more bits set to alter behavior
+ * (*) autoupdate: automatically signal the remote side
+ * whenever the iterator pushes/pops to a new desc
+ * (*) noflipowner: do not flip the ownership bit during
+ * a push/pop operation
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+ enum ioq_idx_type type, int flags);
+
+/**
+ * ioq_iter_seek() - seek to a specific location in the IOQ ring
+ * @iter: Iterator context (must be initialized with ioq_iter_init)
+ * @type: Specifies the type of seek operation
+ * (*) tail: seek to the absolute tail, offset is ignored
+ * (*) next: seek to the relative next, offset is ignored
+ * (*) head: seek to the absolute head, offset is ignored
+ * (*) set: seek to the absolute offset
+ * @offset: Offset for ioq_seek_set operations
+ * @flags: Reserved for future use, must be 0
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+ long offset, int flags);
+
+/**
+ * ioq_iter_push() - push the tail pointer forward
+ * @iter: Iterator context (must be initialized with ioq_iter_init)
+ * @flags: Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the tail ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation. This effectively "pushes" a new pointer
+ * onto the tail of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_push(struct ioq_iterator *iter, int flags);
+
+/**
+ * ioq_iter_pop() - pop the head pointer from the ring
+ * @iter: Iterator context (must be initialized with ioq_iter_init)
+ * @flags: Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the head ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation. This effectively "pops" a pointer
+ * from the head of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_pop(struct ioq_iterator *iter, int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_IOQ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 32d82fe..1e66f8e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -183,5 +183,17 @@ config SHM_SIGNAL
If unsure, say N
+config IOQ
+ boolean "IO-Queue library - Generic shared-memory queue"
+ select SHM_SIGNAL
+ default n
+ help
+ IOQ is a generic shared-memory-queue mechanism that happens to be
+ friendly to virtualization boundaries. It can be used in a variety
+ of ways, though its intended purpose is to become the low-level
+ communication path for paravirtualized drivers.
+
+ If unsure, say N
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index bc36327..98cd332 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
obj-$(CONFIG_SMP) += percpu_counter.o
obj-$(CONFIG_AUDIT_GENERIC) += audit.o
obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
+obj-$(CONFIG_IOQ) += ioq.o
obj-$(CONFIG_SWIOTLB) += swiotlb.o
obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/ioq.c b/lib/ioq.c
new file mode 100644
index 0000000..803b5d6
--- /dev/null
+++ b/lib/ioq.c
@@ -0,0 +1,298 @@
+/*
+ * Copyright 2008 Novell. All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ * Gregory Haskins <[email protected]>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/ioq.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+
+#ifndef NULL
+#define NULL 0
+#endif
+
+static int ioq_iter_setpos(struct ioq_iterator *iter, u32 pos)
+{
+ struct ioq *ioq = iter->ioq;
+
+ BUG_ON(pos >= ioq->count);
+
+ iter->pos = pos;
+ iter->desc = &ioq->ring[pos];
+
+ return 0;
+}
+
+static inline u32 modulo_inc(u32 val, u32 mod)
+{
+ BUG_ON(val >= mod);
+
+ if (val == (mod - 1))
+ return 0;
+
+ return val + 1;
+}
+
+static inline int idx_full(struct ioq_ring_idx *idx)
+{
+ return idx->full && (idx->head == idx->tail);
+}
+
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+ long offset, int flags)
+{
+ struct ioq_ring_idx *idx = iter->idx;
+ u32 pos;
+
+ switch (type) {
+ case ioq_seek_next:
+ pos = modulo_inc(iter->pos, iter->ioq->count);
+ break;
+ case ioq_seek_tail:
+ pos = idx->tail;
+ break;
+ case ioq_seek_head:
+ pos = idx->head;
+ break;
+ case ioq_seek_set:
+ if (offset >= iter->ioq->count)
+ return -1;
+ pos = offset;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return ioq_iter_setpos(iter, pos);
+}
+EXPORT_SYMBOL_GPL(ioq_iter_seek);
+
+static int ioq_ring_count(struct ioq_ring_idx *idx, int count)
+{
+ if (idx->full && (idx->head == idx->tail))
+ return count;
+ else if (idx->tail >= idx->head)
+ return idx->tail - idx->head;
+ else
+ return (idx->tail + count) - idx->head;
+}
+
+static void idx_tail_push(struct ioq_ring_idx *idx, int count)
+{
+ u32 tail = modulo_inc(idx->tail, count);
+
+ if (idx->head == tail) {
+ rmb();
+
+ /*
+ * Setting full here may look racy, but note that we havent
+ * flipped the owner bit yet. So it is impossible for the
+ * remote locale to move head in such a way that this operation
+ * becomes invalid
+ */
+ idx->full = 1;
+ wmb();
+ }
+
+ idx->tail = tail;
+}
+
+int ioq_iter_push(struct ioq_iterator *iter, int flags)
+{
+ struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+ struct ioq_ring_idx *idx = iter->idx;
+ int ret;
+
+ /*
+ * Its only valid to push if we are currently pointed at the tail
+ */
+ if (iter->pos != idx->tail || iter->desc->sown != iter->ioq->locale)
+ return -EINVAL;
+
+ idx_tail_push(idx, iter->ioq->count);
+ if (iter->dualidx) {
+ idx_tail_push(&head_desc->idx[ioq_idxtype_inuse],
+ iter->ioq->count);
+ if (head_desc->idx[ioq_idxtype_inuse].tail !=
+ head_desc->idx[ioq_idxtype_valid].tail) {
+ SHM_SIGNAL_FAULT(iter->ioq->signal,
+ "Tails not synchronized");
+ return -EINVAL;
+ }
+ }
+
+ wmb(); /* the index must be visible before the sown, or signal */
+
+ if (iter->flipowner) {
+ iter->desc->sown = !iter->ioq->locale;
+ wmb(); /* sown must be visible before we signal */
+ }
+
+ ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+ if (iter->update)
+ ioq_signal(iter->ioq, 0);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_push);
+
+int ioq_iter_pop(struct ioq_iterator *iter, int flags)
+{
+ struct ioq_ring_idx *idx = iter->idx;
+ int full;
+ int ret;
+
+ /*
+ * Its only valid to pop if we are currently pointed at the head
+ */
+ if (iter->pos != idx->head || iter->desc->sown != iter->ioq->locale)
+ return -EINVAL;
+
+ full = idx_full(idx);
+ rmb();
+
+ idx->head = modulo_inc(idx->head, iter->ioq->count);
+ wmb(); /* head must be visible before full */
+
+ if (full) {
+ idx->full = 0;
+ wmb(); /* full must be visible before sown */
+ }
+
+ if (iter->flipowner) {
+ iter->desc->sown = !iter->ioq->locale;
+ wmb(); /* sown must be visible before we signal */
+ }
+
+ ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+ if (iter->update)
+ ioq_signal(iter->ioq, 0);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_pop);
+
+static struct ioq_ring_idx *idxtype_to_idx(struct ioq *ioq,
+ enum ioq_idx_type type)
+{
+ struct ioq_ring_idx *idx;
+
+ switch (type) {
+ case ioq_idxtype_valid:
+ case ioq_idxtype_inuse:
+ idx = &ioq->head_desc->idx[type];
+ break;
+ default:
+ panic("IOQ: illegal index type: %d", type);
+ break;
+ }
+
+ return idx;
+}
+
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+ enum ioq_idx_type type, int flags)
+{
+ iter->ioq = ioq;
+ iter->update = (flags & IOQ_ITER_AUTOUPDATE);
+ iter->flipowner = !(flags & IOQ_ITER_NOFLIPOWNER);
+ iter->pos = -1;
+ iter->desc = NULL;
+ iter->dualidx = 0;
+
+ if (type == ioq_idxtype_both) {
+ /*
+ * "both" is a special case, so we set the dualidx flag.
+ *
+ * However, we also just want to use the valid-index
+ * for normal processing, so override that here
+ */
+ type = ioq_idxtype_valid;
+ iter->dualidx = 1;
+ }
+
+ iter->idx = idxtype_to_idx(ioq, type);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_init);
+
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type)
+{
+ return ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+}
+EXPORT_SYMBOL_GPL(ioq_count);
+
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type)
+{
+ int count = ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+
+ return ioq->count - count;
+}
+EXPORT_SYMBOL_GPL(ioq_remain);
+
+int ioq_size(struct ioq *ioq)
+{
+ return ioq->count;
+}
+EXPORT_SYMBOL_GPL(ioq_size);
+
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type)
+{
+ struct ioq_ring_idx *idx = idxtype_to_idx(ioq, type);
+
+ return idx_full(idx);
+}
+EXPORT_SYMBOL_GPL(ioq_full);
+
+static void ioq_shm_signal(struct shm_signal_notifier *notifier)
+{
+ struct ioq *ioq = container_of(notifier, struct ioq, shm_notifier);
+
+ wake_up(&ioq->wq);
+ if (ioq->notifier)
+ ioq->notifier->signal(ioq->notifier);
+}
+
+void ioq_init(struct ioq *ioq,
+ struct ioq_ops *ops,
+ enum ioq_locality locale,
+ struct ioq_ring_head *head,
+ struct shm_signal *signal,
+ size_t count)
+{
+ memset(ioq, 0, sizeof(*ioq));
+ atomic_set(&ioq->refs, 1);
+ init_waitqueue_head(&ioq->wq);
+
+ ioq->ops = ops;
+ ioq->locale = locale;
+ ioq->head_desc = head;
+ ioq->ring = &head->ring[0];
+ ioq->count = count;
+ ioq->signal = signal;
+
+ ioq->shm_notifier.signal = &ioq_shm_signal;
+ signal->notifier = &ioq->shm_notifier;
+}
+EXPORT_SYMBOL_GPL(ioq_init);
On Thu, 09 Apr 2009 12:31:29 -0400
Gregory Haskins <[email protected]> wrote:
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> drivers/net/Kconfig | 13 +
> drivers/net/Makefile | 1
> drivers/net/vbus-enet.c | 680 +++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 694 insertions(+), 0 deletions(-)
> create mode 100644 drivers/net/vbus-enet.c
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 62d732a..ac9dabd 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -3099,4 +3099,17 @@ config VIRTIO_NET
> This is the virtual network driver for virtio. It can be used with
> lguest or QEMU based VMMs (like KVM or Xen). Say Y or M.
>
> +config VBUS_ENET
> + tristate "Virtual Ethernet Driver"
> + depends on VBUS_DRIVERS
> + help
> + A virtualized 802.x network device based on the VBUS interface.
> + It can be used with any hypervisor/kernel that supports the
> + vbus protocol.
> +
> +config VBUS_ENET_DEBUG
> + bool "Enable Debugging"
> + depends on VBUS_ENET
> + default n
> +
> endif # NETDEVICES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 471baaf..61db928 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -264,6 +264,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
> obj-$(CONFIG_NETXEN_NIC) += netxen/
> obj-$(CONFIG_NIU) += niu.o
> obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
> obj-$(CONFIG_SFC) += sfc/
>
> obj-$(CONFIG_WIMAX) += wimax/
> diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
> new file mode 100644
> index 0000000..3779f77
> --- /dev/null
> +++ b/drivers/net/vbus-enet.c
> @@ -0,0 +1,680 @@
> +/*
> + * vbus_enet - A virtualized 802.x network device based on the VBUS interface
> + *
> + * Copyright (C) 2009 Novell, Gregory Haskins <[email protected]>
> + *
> + * Derived from the SNULL example from the book "Linux Device Drivers" by
> + * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
> + * by O'Reilly & Associates.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/interrupt.h>
> +
> +#include <linux/in.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/tcp.h>
> +#include <linux/skbuff.h>
> +#include <linux/ioq.h>
> +#include <linux/vbus_driver.h>
> +
> +#include <linux/in6.h>
> +#include <asm/checksum.h>
> +
> +#include <linux/venet.h>
> +
> +MODULE_AUTHOR("Gregory Haskins");
> +MODULE_LICENSE("GPL");
MODULE_DESCRIPTION ?
MODULE_VERSION ?
> +static int napi_weight = 128;
> +module_param(napi_weight, int, 0444);
Already accessible through sysfs
> +static int rx_ringlen = 256;
> +module_param(rx_ringlen, int, 0444);
API for ring length exists via ethtool. If you used this
then there would be no need for device special parameter.
> +static int tx_ringlen = 256;
> +module_param(tx_ringlen, int, 0444);
> +
> +#undef PDEBUG /* undef it, just in case */
> +#ifdef VBUS_ENET_DEBUG
> +# define PDEBUG(fmt, args...) printk(KERN_DEBUG "vbus_enet: " fmt, ## args)
> +#else
> +# define PDEBUG(fmt, args...) /* not debugging: nothing */
> +#endif
Why reinvent pr_debug()?
> +
> +struct vbus_enet_queue {
> + struct ioq *queue;
> + struct ioq_notifier notifier;
> +};
> +
> +struct vbus_enet_priv {
> + spinlock_t lock;
> + struct net_device *dev;
> + struct vbus_device_proxy *vdev;
> + struct napi_struct napi;
> + struct vbus_enet_queue rxq;
> + struct vbus_enet_queue txq;
> + struct tasklet_struct txtask;
> +};
> +
> +static struct vbus_enet_priv *
> +napi_to_priv(struct napi_struct *napi)
> +{
> + return container_of(napi, struct vbus_enet_priv, napi);
> +}
> +
> +static int
> +queue_init(struct vbus_enet_priv *priv,
> + struct vbus_enet_queue *q,
> + int qid,
> + size_t ringsize,
> + void (*func)(struct ioq_notifier *))
> +{
> + struct vbus_device_proxy *dev = priv->vdev;
> + int ret;
> +
> + ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
> + if (ret < 0)
> + panic("ioq_alloc failed: %d\n", ret);
> +
> + if (func) {
> + q->notifier.signal = func;
> + q->queue->notifier = &q->notifier;
> + }
> +
> + return 0;
> +}
> +
> +static int
> +devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
> +{
> + struct vbus_device_proxy *dev = priv->vdev;
> +
> + return dev->ops->call(dev, func, data, len, 0);
> +}
> +
> +/*
> + * ---------------
> + * rx descriptors
> + * ---------------
> + */
> +
> +static void
> +rxdesc_alloc(struct ioq_ring_desc *desc, size_t len)
> +{
> + struct sk_buff *skb;
> +
> + len += ETH_HLEN;
> +
> + skb = dev_alloc_skb(len + 2);
> + BUG_ON(!skb);
> +
> + skb_reserve(skb, 2); /* align IP on 16B boundary */
Use NET_IP_ALIGN rather than 2
use netdev_alloc_skb because it NUMA aware.
> +
> + desc->cookie = (u64)skb;
> + desc->ptr = (u64)__pa(skb->data);
> + desc->len = len; /* total length */
> + desc->valid = 1;
> +}
> +
> +static void
> +rx_setup(struct vbus_enet_priv *priv)
> +{
> + struct ioq *ioq = priv->rxq.queue;
> + struct ioq_iterator iter;
> + int ret;
> +
> + /*
> + * We want to iterate on the "valid" index. By default the iterator
> + * will not "autoupdate" which means it will not hypercall the host
> + * with our changes. This is good, because we are really just
> + * initializing stuff here anyway. Note that you can always manually
> + * signal the host with ioq_signal() if the autoupdate feature is not
> + * used.
> + */
> + ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
> + BUG_ON(ret < 0);
Why not doing proper initialization error handling, I.e fail the
attempt to bring device up with error code (-ENOMEM)...
> + /*
> + * Seek to the tail of the valid index (which should be our first
> + * item, since the queue is brand-new)
> + */
> + ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
> + BUG_ON(ret < 0);
> +
> + /*
> + * Now populate each descriptor with an empty SKB and mark it valid
> + */
> + while (!iter.desc->valid) {
> + rxdesc_alloc(iter.desc, priv->dev->mtu);
> +
> + /*
> + * This push operation will simultaneously advance the
> + * valid-head index and increment our position in the queue
> + * by one.
> + */
> + ret = ioq_iter_push(&iter, 0);
> + BUG_ON(ret < 0);
> + }
> +}
> +
> +static void
> +rx_teardown(struct vbus_enet_priv *priv)
> +{
> + struct ioq *ioq = priv->rxq.queue;
> + struct ioq_iterator iter;
> + int ret;
> +
> + ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
> + BUG_ON(ret < 0);
> +
> + ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
> + BUG_ON(ret < 0);
> +
> + /*
> + * free each valid descriptor
> + */
> + while (iter.desc->valid) {
> + struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
> +
> + iter.desc->valid = 0;
> + wmb();
> +
> + iter.desc->ptr = 0;
> + iter.desc->cookie = 0;
> +
> + ret = ioq_iter_pop(&iter, 0);
> + BUG_ON(ret < 0);
> +
> + dev_kfree_skb(skb);
> + }
> +}
> +
> +/*
> + * Open and close
> + */
> +
> +static int
> +vbus_enet_open(struct net_device *dev)
> +{
> + struct vbus_enet_priv *priv = netdev_priv(dev);
> + int ret;
> +
> + ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
> + BUG_ON(ret < 0);
> +
> + napi_enable(&priv->napi);
> +
> + return 0;
> +}
> +
> +static int
> +vbus_enet_stop(struct net_device *dev)
> +{
> + struct vbus_enet_priv *priv = netdev_priv(dev);
> + int ret;
> +
> + napi_disable(&priv->napi);
> +
> + ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
> + BUG_ON(ret < 0);
> +
> + return 0;
> +}
> +
> +/*
> + * Configuration changes (passed on by ifconfig)
> + */
> +static int
> +vbus_enet_config(struct net_device *dev, struct ifmap *map)
> +{
> + if (dev->flags & IFF_UP) /* can't act on a running interface */
> + return -EBUSY;
> +
> + /* Don't allow changing the I/O address */
> + if (map->base_addr != dev->base_addr) {
> + printk(KERN_WARNING "vbus_enet: Can't change I/O address\n");
> + return -EOPNOTSUPP;
> + }
> +
> + /* ignore other fields */
> + return 0;
> +}
> +
> +static void
> +vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&priv->lock, flags);
> +
> + if (netif_rx_schedule_prep(&priv->napi)) {
> + /* Disable further interrupts */
> + ioq_notify_disable(priv->rxq.queue, 0);
> + __netif_rx_schedule(&priv->napi);
> + }
> +
> + spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static int
> +vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
> +{
> + struct vbus_enet_priv *priv = netdev_priv(dev);
> + int ret;
> +
> + dev->mtu = new_mtu;
> +
> + /*
> + * FLUSHRX will cause the device to flush any outstanding
> + * RX buffers. They will appear to come in as 0 length
> + * packets which we can simply discard and replace with new_mtu
> + * buffers for the future.
> + */
> + ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
> + BUG_ON(ret < 0);
> +
> + vbus_enet_schedule_rx(priv);
> +
> + return 0;
> +}
> +
> +/*
> + * The poll implementation.
> + */
> +static int
> +vbus_enet_poll(struct napi_struct *napi, int budget)
> +{
> + struct vbus_enet_priv *priv = napi_to_priv(napi);
> + int npackets = 0;
> + struct ioq_iterator iter;
> + int ret;
> +
> + PDEBUG("%lld: polling...\n", priv->vdev->id);
> +
> + /* We want to iterate on the head of the in-use index */
> + ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
> + IOQ_ITER_AUTOUPDATE);
> + BUG_ON(ret < 0);
> +
> + ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
> + BUG_ON(ret < 0);
> +
> + /*
> + * We stop if we have met the quota or there are no more packets.
> + * The EOM is indicated by finding a packet that is still owned by
> + * the south side
> + */
> + while ((npackets < budget) && (!iter.desc->sown)) {
> + struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
> +
> + if (iter.desc->len) {
> + skb_put(skb, iter.desc->len);
> +
> + /* Maintain stats */
> + npackets++;
> + priv->dev->stats.rx_packets++;
> + priv->dev->stats.rx_bytes += iter.desc->len;
> +
> + /* Pass the buffer up to the stack */
> + skb->dev = priv->dev;
> + skb->protocol = eth_type_trans(skb, priv->dev);
> + netif_receive_skb(skb);
> +
> + mb();
> + } else
> + /*
> + * the device may send a zero-length packet when its
> + * flushing references on the ring. We can just drop
> + * these on the floor
> + */
> + dev_kfree_skb(skb);
> +
> + /* Grab a new buffer to put in the ring */
> + rxdesc_alloc(iter.desc, priv->dev->mtu);
> +
> + /* Advance the in-use tail */
> + ret = ioq_iter_pop(&iter, 0);
> + BUG_ON(ret < 0);
> + }
> +
> + PDEBUG("%lld poll: %d packets received\n", priv->vdev->id, npackets);
> +
> + /*
> + * If we processed all packets, we're done; tell the kernel and
> + * reenable ints
> + */
> + if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
> + netif_rx_complete(napi);
> + ioq_notify_enable(priv->rxq.queue, 0);
> + ret = 0;
> + } else
> + /* We couldn't process everything. */
> + ret = 1;
> +
> + return ret;
> +}
> +
> +/*
> + * Transmit a packet (called by the kernel)
> + */
> +static int
> +vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct vbus_enet_priv *priv = netdev_priv(dev);
> + struct ioq_iterator iter;
> + int ret;
> + unsigned long flags;
> +
> + PDEBUG("%lld: sending %d bytes\n", priv->vdev->id, skb->len);
> +
> + spin_lock_irqsave(&priv->lock, flags);
> +
> + if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
> + /*
> + * We must flow-control the kernel by disabling the
> + * queue
> + */
> + spin_unlock_irqrestore(&priv->lock, flags);
> + netif_stop_queue(dev);
> + printk(KERN_ERR "VBUS_ENET: tx on full queue bug " \
> + "on device %lld\n", priv->vdev->id);
> + return 1;
> + }
> +
> + /*
> + * We want to iterate on the tail of both the "inuse" and "valid" index
> + * so we specify the "both" index
> + */
> + ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
> + IOQ_ITER_AUTOUPDATE);
> + BUG_ON(ret < 0);
> +
> + ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
> + BUG_ON(ret < 0);
> + BUG_ON(iter.desc->sown);
> +
> + /*
> + * We simply put the skb right onto the ring. We will get an interrupt
> + * later when the data has been consumed and we can reap the pointers
> + * at that time
> + */
> + iter.desc->cookie = (u64)skb;
> + iter.desc->len = (u64)skb->len;
> + iter.desc->ptr = (u64)__pa(skb->data);
> + iter.desc->valid = 1;
> +
> + priv->dev->stats.tx_packets++;
> + priv->dev->stats.tx_bytes += skb->len;
> +
> + /*
> + * This advances both indexes together implicitly, and then
> + * signals the south side to consume the packet
> + */
> + ret = ioq_iter_push(&iter, 0);
> + BUG_ON(ret < 0);
> +
> + dev->trans_start = jiffies; /* save the timestamp */
> +
> + if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
> + /*
> + * If the queue is congested, we must flow-control the kernel
> + */
> + PDEBUG("%lld: backpressure tx queue\n", priv->vdev->id);
> + netif_stop_queue(dev);
> + }
> +
> + spin_unlock_irqrestore(&priv->lock, flags);
> +
> + return 0;
> +}
> +
> +/*
> + * reclaim any outstanding completed tx packets
> + *
> + * assumes priv->lock held
> + */
> +static void
> +vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
> +{
> + struct ioq_iterator iter;
> + int ret;
> +
> + /*
> + * We want to iterate on the head of the valid index, but we
> + * do not want the iter_pop (below) to flip the ownership, so
> + * we set the NOFLIPOWNER option
> + */
> + ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
> + IOQ_ITER_NOFLIPOWNER);
> + BUG_ON(ret < 0);
> +
> + ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
> + BUG_ON(ret < 0);
> +
> + /*
> + * We are done once we find the first packet either invalid or still
> + * owned by the south-side
> + */
> + while (iter.desc->valid && (!iter.desc->sown || force)) {
> + struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
> +
> + PDEBUG("%lld: completed sending %d bytes\n",
> + priv->vdev->id, skb->len);
> +
> + /* Reset the descriptor */
> + iter.desc->valid = 0;
> +
> + dev_kfree_skb(skb);
> +
> + /* Advance the valid-index head */
> + ret = ioq_iter_pop(&iter, 0);
> + BUG_ON(ret < 0);
> + }
> +
> + /*
> + * If we were previously stopped due to flow control, restart the
> + * processing
> + */
> + if (netif_queue_stopped(priv->dev)
> + && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
> + PDEBUG("%lld: re-enabling tx queue\n", priv->vdev->id);
> + netif_wake_queue(priv->dev);
> + }
> +}
> +
> +static void
> +vbus_enet_timeout(struct net_device *dev)
> +{
> + struct vbus_enet_priv *priv = netdev_priv(dev);
> + unsigned long flags;
> +
> + printk(KERN_DEBUG "VBUS_ENET %lld: Transmit timeout\n", priv->vdev->id);
> +
> + spin_lock_irqsave(&priv->lock, flags);
> + vbus_enet_tx_reap(priv, 0);
> + spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static void
> +rx_isr(struct ioq_notifier *notifier)
> +{
> + struct vbus_enet_priv *priv;
> + struct net_device *dev;
> +
> + priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
> + dev = priv->dev;
> +
> + if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
> + vbus_enet_schedule_rx(priv);
> +}
> +
> +static void
> +deferred_tx_isr(unsigned long data)
> +{
> + struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
> + unsigned long flags;
> +
> + PDEBUG("deferred_tx_isr for %lld\n", priv->vdev->id);
> +
> + spin_lock_irqsave(&priv->lock, flags);
> + vbus_enet_tx_reap(priv, 0);
> + spin_unlock_irqrestore(&priv->lock, flags);
> +
> + ioq_notify_enable(priv->txq.queue, 0);
> +}
> +
> +static void
> +tx_isr(struct ioq_notifier *notifier)
> +{
> + struct vbus_enet_priv *priv;
> + unsigned long flags;
> +
> + priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
> +
> + PDEBUG("tx_isr for %lld\n", priv->vdev->id);
> +
> + ioq_notify_disable(priv->txq.queue, 0);
> + tasklet_schedule(&priv->txtask);
> +}
> +
> +static const struct net_device_ops vbus_enet_netdev_ops = {
> + .ndo_open = vbus_enet_open,
> + .ndo_stop = vbus_enet_stop,
> + .ndo_set_config = vbus_enet_config,
> + .ndo_start_xmit = vbus_enet_tx_start,
> + .ndo_change_mtu = vbus_enet_change_mtu,
> + .ndo_tx_timeout = vbus_enet_timeout,
add .ndo_validate_addr = eth_valid_addr?
multicast list?
> +};
> +
> +/*
> + * This is called whenever a new vbus_device_proxy is added to the vbus
> + * with the matching VENET_ID
> + */
> +static int
> +vbus_enet_probe(struct vbus_device_proxy *vdev)
> +{
> + struct net_device *dev;
> + struct vbus_enet_priv *priv;
> + int ret;
> +
> + printk(KERN_INFO "VBUS_ENET: Found new device at %lld\n", vdev->id);
> +
> + ret = vdev->ops->open(vdev, VENET_VERSION, 0);
> + if (ret < 0)
> + return ret;
> +
> + dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
> + if (!dev)
> + return -ENOMEM;
> +
> + priv = netdev_priv(dev);
> +
> + spin_lock_init(&priv->lock);
> + priv->dev = dev;
> + priv->vdev = vdev;
> +
> + tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
> +
> + queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
> + queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
> +
> + rx_setup(priv);
> +
> + ioq_notify_enable(priv->rxq.queue, 0); /* enable interrupts */
> + ioq_notify_enable(priv->txq.queue, 0);
> +
> + dev->netdev_ops = &vbus_enet_netdev_ops;
> + dev->watchdog_timeo = 5 * HZ;
> +
> + netif_napi_add(dev, &priv->napi, vbus_enet_poll, napi_weight);
> +
> + ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
> + if (ret < 0) {
> + printk(KERN_INFO "VENET: Error obtaining MAC address for " \
> + "%lld\n",
> + priv->vdev->id);
> + goto out_free;
> + }
> +
> + dev->features |= NETIF_F_HIGHDMA;
> +
> + ret = register_netdev(dev);
> + if (ret < 0) {
> + printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
> + ret, dev->name);
> + goto out_free;
> + }
> +
> + vdev->priv = priv;
> +
> + return 0;
> +
> + out_free:
> + free_netdev(dev);
> +
> + return ret;
> +}
> +
> +static int
> +vbus_enet_remove(struct vbus_device_proxy *vdev)
> +{
> + struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
> + struct vbus_device_proxy *dev = priv->vdev;
> +
> + unregister_netdev(priv->dev);
> + napi_disable(&priv->napi);
> +
> + rx_teardown(priv);
> + vbus_enet_tx_reap(priv, 1);
> +
> + ioq_put(priv->rxq.queue);
> + ioq_put(priv->txq.queue);
> +
> + dev->ops->close(dev, 0);
> +
> + free_netdev(priv->dev);
> +
> + return 0;
> +}
> +
> +/*
> + * Finally, the module stuff
> + */
> +
> +static struct vbus_driver_ops vbus_enet_driver_ops = {
> + .probe = vbus_enet_probe,
> + .remove = vbus_enet_remove,
> +};
> +
> +static struct vbus_driver vbus_enet_driver = {
> + .type = VENET_TYPE,
> + .owner = THIS_MODULE,
> + .ops = &vbus_enet_driver_ops,
> +};
> +
> +static __init int
> +vbus_enet_init_module(void)
> +{
> + printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
> + printk(KERN_DEBUG "VBUSENET: Using %d/%d queue depth\n",
> + rx_ringlen, tx_ringlen);
> + return vbus_driver_register(&vbus_enet_driver);
> +}
> +
> +static __exit void
> +vbus_enet_cleanup(void)
> +{
> + vbus_driver_unregister(&vbus_enet_driver);
> +}
> +
> +module_init(vbus_enet_init_module);
> +module_exit(vbus_enet_cleanup);
>
Avi,
Gregory Haskins wrote:
>
> Todo:
> *) Develop some kind of hypercall registration mechanism for KVM so that
> we can use that as an integration point instead of directly hooking
> kvm hypercalls
>
What would you like to see here? I now remember why I removed the
original patch I had for registration...it requires some kind of
discovery mechanism on its own. Note that this is hard, but I figured
it would make the overall series simpler if I didn't go this route and
instead just integrated with a statically allocated vector. That being
said, I have no problem adding this back in but figure we should discuss
the approach so I don't go down a rat-hole ;)
So, one thing we could do is use a string-identifier to discover
hypercall resources. In this model, we would have one additional
hypercall registered with kvm (in addition to the mmu-ops, etc) called
KVM_HC_DYNHC or something like that. The support for DYNHC could be
indicated in the cpuid (much like I do with the RESET, DYNIRQ, and VBUS
support today. When hypercall provides register, the could provide a
string such as "vbus", and they would be allocated a hypercall id.
Likewise, the HC_DYNHC interface would allow a guest to query the cpuid
for the DYNHC feature, and then query the HC_DYNHC vector for a string
to hc# translation. If the provider is not present, we return -1 for
the hc#, otherwise we return the one that was allocated.
I know how you feel about string-ids in general, but I am not quite sure
how to design this otherwise without it looking eerily similar to what I
already have (which is registering a new HC vector in kvm_para.h)
Thoughts?
-Greg
On Thu, Apr 09, 2009 at 09:37:10AM -0700, Stephen Hemminger wrote:
> > +static int tx_ringlen = 256;
> > +module_param(tx_ringlen, int, 0444);
> > +
> > +#undef PDEBUG /* undef it, just in case */
> > +#ifdef VBUS_ENET_DEBUG
> > +# define PDEBUG(fmt, args...) printk(KERN_DEBUG "vbus_enet: " fmt, ## args)
> > +#else
> > +# define PDEBUG(fmt, args...) /* not debugging: nothing */
> > +#endif
>
> Why reinvent pr_debug()?
Even more important, use dev_dbg() instead please, that uniquly
describes your device and driver together, which is what you need/want,
and it ties into the dynamic debug work, so you don't need a special
kernel config option.
thanks,
greg k-h
Gregory Haskins wrote:
> Avi,
>
> Gregory Haskins wrote:
>
>> Todo:
>> *) Develop some kind of hypercall registration mechanism for KVM so that
>> we can use that as an integration point instead of directly hooking
>> kvm hypercalls
>>
>>
>
> What would you like to see here? I now remember why I removed the
> original patch I had for registration...it requires some kind of
> discovery mechanism on its own. Note that this is hard, but I figured
> it would make the overall series simpler if I didn't go this route and
> instead just integrated with a statically allocated vector. That being
> said, I have no problem adding this back in but figure we should discuss
> the approach so I don't go down a rat-hole ;)
>
>
One idea is similar to signalfd() or eventfd(). Provide a kvm ioctl
that takes a gsi and returns an fd. Writes to the fd change the state
of the line, possible triggering an interrupt. Another ioctl takes a
hypercall number or pio port as well as an existing fd. Invocations of
the hypercall or writes to the port write to the fd (using the same
protocol as eventfd), so the other end can respond.
The nice thing is that this can be used by both kernel and userspace
components, and for kernel components, hypercalls can be either buffered
or unbuffered.
> So, one thing we could do is use a string-identifier to discover
> hypercall resources. In this model, we would have one additional
> hypercall registered with kvm (in addition to the mmu-ops, etc) called
> KVM_HC_DYNHC or something like that. The support for DYNHC could be
> indicated in the cpuid (much like I do with the RESET, DYNIRQ, and VBUS
> support today. When hypercall provides register, the could provide a
> string such as "vbus", and they would be allocated a hypercall id.
> Likewise, the HC_DYNHC interface would allow a guest to query the cpuid
> for the DYNHC feature, and then query the HC_DYNHC vector for a string
> to hc# translation. If the provider is not present, we return -1 for
> the hc#, otherwise we return the one that was allocated.
>
> I know how you feel about string-ids in general, but I am not quite sure
> how to design this otherwise without it looking eerily similar to what I
> already have (which is registering a new HC vector in kvm_para.h)
>
No need for a string ID. Reserve a range of hypercall numbers for
dynamic IDs. Userspace allocates one and gives it to the device using
its configuration space (as applies to whatever bus it is using).
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Gregory Haskins wrote:
> We need a way to detect if a VM is reset later in the series, so lets
> add a capability for userspace to signal a VM reset down to the kernel.
>
As I mentioned, this won't be reliable. It needs to be driven from
userspace and be per-device.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Gregory Haskins wrote:
> This patch provides the ability to dynamically declare and map an
> interrupt-request handle to an x86 8-bit vector.
>
> Problem Statement: Emulated devices (such as PCI, ISA, etc) have
> interrupt routing done via standard PC mechanisms (MP-table, ACPI,
> etc). However, we also want to support a new class of devices
> which exist in a new virtualized namespace and therefore should
> not try to piggyback on these emulated mechanisms. Rather, we
> create a way to dynamically register interrupt resources that
> acts indepent of the emulated counterpart.
>
> On x86, a simplistic view of the interrupt model is that each core
> has a local-APIC which can recieve messages from APIC-compliant
> routing devices (such as IO-APIC and MSI) regarding details about
> an interrupt (such as which vector to raise). These routing devices
> are controlled by the OS so they may translate a physical event
> (such as "e1000: raise an RX interrupt") to a logical destination
> (such as "inject IDT vector 46 on core 3"). A dynirq is a virtual
> implementation of such a router (think of it as a virtual-MSI, but
> without the coupling to an existing standard, such as PCI).
>
> The model is simple: A guest OS can allocate the mapping of "IRQ"
> handle to "vector/core" in any way it sees fit, and provide this
> information to the dynirq module running in the host. The assigned
> IRQ then becomes the sole handle needed to inject an IDT vector
> to the guest from a host. A host entity that wishes to raise an
> interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
> is performed transparently.
>
> +static int
> +_kvm_inject_dynirq(struct kvm *kvm, struct dynirq *entry)
> +{
> + struct kvm_vcpu *vcpu;
> + int ret;
> +
> + mutex_lock(&kvm->lock);
> +
> + vcpu = kvm->vcpus[entry->dest];
> + if (!vcpu) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + ret = kvm_apic_set_irq(vcpu, entry->vec, 1);
> +
> +out:
> + mutex_unlock(&kvm->lock);
> +
> + return ret;
> +}
> +
>
Given that you're using the apic to inject the IRQ, you'll need an EOI.
So what's the difference between dynirq and MSI, performance wise?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Avi Kivity wrote:
>>
>
> Given that you're using the apic to inject the IRQ, you'll need an
> EOI. So what's the difference between dynirq and MSI, performance wise?
>
I would have loved to eliminate that EOI completely, but that is a much
broader problem to solve ;)
Actually dynirq wasnt introduced as a performance alternative over MSI.
Rather, I was trying to eliminate the complexity of needing to sync
between the userspace PCI emulation and the in kernel models.
However, its moot. Since v2 introduce virtio-vbus, I now have a much
clearer picture of what is needed here, and v3 will simply integrate
with MSI interrupts and drop dynirq completely. So please ignore this
patch.
-Greg
On Thu, Apr 09, 2009 at 12:30:57PM -0400, Gregory Haskins wrote:
> +static unsigned long
> +task_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
> + unsigned long n)
> +{
> + struct task_memctx *tm = to_task_memctx(ctx);
> + struct task_struct *p = tm->task;
> +
> + while (n) {
> + unsigned long offset = ((unsigned long)dst)%PAGE_SIZE;
> + unsigned long len = PAGE_SIZE - offset;
> + int ret;
> + struct page *pg;
> + void *maddr;
> +
> + if (len > n)
> + len = n;
> +
> + down_read(&p->mm->mmap_sem);
> + ret = get_user_pages(p, p->mm,
> + (unsigned long)dst, 1, 1, 0, &pg, NULL);
> +
> + if (ret != 1) {
> + up_read(&p->mm->mmap_sem);
> + break;
> + }
> +
> + maddr = kmap_atomic(pg, KM_USER0);
> + memcpy(maddr + offset, src, len);
> + kunmap_atomic(maddr, KM_USER0);
> + set_page_dirty_lock(pg);
> + put_page(pg);
> + up_read(&p->mm->mmap_sem);
> +
> + src += len;
> + dst += len;
> + n -= len;
> + }
> +
> + return n;
> +}
BTW, why did you decide to use get_user_pages?
Would switch_mm + copy_to_user work as well
avoiding page walk if all pages are present?
Also - if we just had vmexit because a process executed
io (or hypercall), can't we just do copy_to_user there?
Avi, I think at some point you said that we can?
--
MST
Michael S. Tsirkin wrote:
> On Thu, Apr 09, 2009 at 12:30:57PM -0400, Gregory Haskins wrote:
>
>> +static unsigned long
>> +task_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
>> + unsigned long n)
>> +{
>> + struct task_memctx *tm = to_task_memctx(ctx);
>> + struct task_struct *p = tm->task;
>> +
>> + while (n) {
>> + unsigned long offset = ((unsigned long)dst)%PAGE_SIZE;
>> + unsigned long len = PAGE_SIZE - offset;
>> + int ret;
>> + struct page *pg;
>> + void *maddr;
>> +
>> + if (len > n)
>> + len = n;
>> +
>> + down_read(&p->mm->mmap_sem);
>> + ret = get_user_pages(p, p->mm,
>> + (unsigned long)dst, 1, 1, 0, &pg, NULL);
>> +
>> + if (ret != 1) {
>> + up_read(&p->mm->mmap_sem);
>> + break;
>> + }
>> +
>> + maddr = kmap_atomic(pg, KM_USER0);
>> + memcpy(maddr + offset, src, len);
>> + kunmap_atomic(maddr, KM_USER0);
>> + set_page_dirty_lock(pg);
>> + put_page(pg);
>> + up_read(&p->mm->mmap_sem);
>> +
>> + src += len;
>> + dst += len;
>> + n -= len;
>> + }
>> +
>> + return n;
>> +}
>>
>
> BTW, why did you decide to use get_user_pages?
> Would switch_mm + copy_to_user work as well
> avoiding page walk if all pages are present?
>
Well, basic c_t_u() won't work because its likely not "current" if you
are updating the ring from some other task, but I think you have already
figured that out based on the switch_mm suggestion. The simple truth is
I was not familiar with switch_mm at the time I wrote this (nor am I
now). If this is a superior method that allows you to acquire
c_t_u(some_other_ctx) like behavior, I see no problem in changing. I
will look into this, and thanks for the suggestion!
> Also - if we just had vmexit because a process executed
> io (or hypercall), can't we just do copy_to_user there?
> Avi, I think at some point you said that we can?
>
Right, and yes that will work I believe. We could always do a "if (p ==
current)" check to test for this. To date, I don't typically do
anything mem-ops related directly in vcpu context so this wasn't an
issue...but that doesn't mean someone wont try in the future.
Therefore, I agree we should strive to optimize it if we can.
>
>
Thanks Michael,
-Greg
Michael S. Tsirkin wrote:
> Also - if we just had vmexit because a process executed
> io (or hypercall), can't we just do copy_to_user there?
> Avi, I think at some point you said that we can?
>
You can do copy_to_user() whereever it is legal in Linux. Almost all of
kvm runs in process context, preemptible, and with interrupts enabled.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Gregory Haskins wrote:
>> BTW, why did you decide to use get_user_pages?
>> Would switch_mm + copy_to_user work as well
>> avoiding page walk if all pages are present?
>>
>>
>
> Well, basic c_t_u() won't work because its likely not "current" if you
> are updating the ring from some other task, but I think you have already
> figured that out based on the switch_mm suggestion. The simple truth is
> I was not familiar with switch_mm at the time I wrote this (nor am I
> now). If this is a superior method that allows you to acquire
> c_t_u(some_other_ctx) like behavior, I see no problem in changing. I
> will look into this, and thanks for the suggestion!
>
copy_to_user() is significantly faster than get_user_pages() + kmap() +
memcmp() (or their variants).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
Avi Kivity wrote:
> Gregory Haskins wrote:
>
>
>>> BTW, why did you decide to use get_user_pages?
>>> Would switch_mm + copy_to_user work as well
>>> avoiding page walk if all pages are present?
>>>
>>
>> Well, basic c_t_u() won't work because its likely not "current" if you
>> are updating the ring from some other task, but I think you have already
>> figured that out based on the switch_mm suggestion. The simple truth is
>> I was not familiar with switch_mm at the time I wrote this (nor am I
>> now). If this is a superior method that allows you to acquire
>> c_t_u(some_other_ctx) like behavior, I see no problem in changing. I
>> will look into this, and thanks for the suggestion!
>>
>
> copy_to_user() is significantly faster than get_user_pages() + kmap()
> + memcmp() (or their variants).
>
Oh, I don't doubt that (in fact, I was pretty sure that was the case
based on some of the optimizations I could see in studying the c_t_u()
path). I just didn't realize there were other ways to do it if its a
non "current" task. ;)
I guess the enigma for me right now is what cost does switch_mm have?
(Thats not a slam against the suggested approach...I really do not know
and am curious).
As an aside, note that we seem to be reviewing v2, where v3 is really
the last set I pushed. I think this patch is more or less the same
across both iterations, but FYI that I would recommend looking at v3
instead.
-Greg
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi,
>>
>> Gregory Haskins wrote:
>>
>>> Todo:
>>> *) Develop some kind of hypercall registration mechanism for KVM so
>>> that
>>> we can use that as an integration point instead of directly hooking
>>> kvm hypercalls
>>>
>>
>> What would you like to see here? I now remember why I removed the
>> original patch I had for registration...it requires some kind of
>> discovery mechanism on its own. Note that this is hard, but I figured
>> it would make the overall series simpler if I didn't go this route and
>> instead just integrated with a statically allocated vector. That being
>> said, I have no problem adding this back in but figure we should discuss
>> the approach so I don't go down a rat-hole ;)
>>
>>
>
>
> One idea is similar to signalfd() or eventfd(). Provide a kvm ioctl
> that takes a gsi and returns an fd. Writes to the fd change the state
> of the line, possible triggering an interrupt. Another ioctl takes a
> hypercall number or pio port as well as an existing fd. Invocations
> of the hypercall or writes to the port write to the fd (using the same
> protocol as eventfd), so the other end can respond.
>
> The nice thing is that this can be used by both kernel and userspace
> components, and for kernel components, hypercalls can be either
> buffered or unbuffered.
And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was born. ;)
(Michael FYI: so I will be pushing a vbus-v4 series at some point in the
near future that is expressed in terms of irqfd/iosignalfd, per the
conversation above. The patches in v3 and earlier are more intrusive to
the KVM core than they will be in final form)
-Greg
Gregory Haskins wrote:
> Oh, I don't doubt that (in fact, I was pretty sure that was the case
> based on some of the optimizations I could see in studying the c_t_u()
> path). I just didn't realize there were other ways to do it if its a
> non "current" task. ;)
>
> I guess the enigma for me right now is what cost does switch_mm have?
> (Thats not a slam against the suggested approach...I really do not know
> and am curious).
>
switch_mm() is probably very cheap (reloads cr3), but it does dirty the
current cpu's tlb. When the kernel needs to flush a process' tlb, it
will have to IPI that cpu in addition to all others. This takes place,
for example, after munmap() or after a page is swapped out (though
significant batching is done there).
It's still plenty cheaper in my estimation.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On Fri, 5 Jun 2009 04:19:17 am Gregory Haskins wrote:
> Avi Kivity wrote:
> > Gregory Haskins wrote:
> > One idea is similar to signalfd() or eventfd()
>
> And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was born.
> ;)
The lguest patch queue already has such an interface :) And I have a
partially complete in-kernel virtio_pci patch with the same trick.
I switched from "kernel created eventfd" to "userspace passes in eventfd"
after a while though; it lets you connect multiple virtqueues to a single fd
if you want.
Combined with a minor change to allow any process with access to the lguest fd
to queue interrupts, this allowed lguest to move to a thread-per-virtqueue
model which was a significant speedup as well as nice code reduction.
Here's the relevant kernel patch for reading.
Thanks!
Rusty.
lguest: use eventfds for device notification
Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
an address: the main Launcher process returns with this address, and figures
out what device to run.
A far nicer model is to let processes bind an eventfd to an address: if we
find one, we simply signal the eventfd.
Signed-off-by: Rusty Russell <[email protected]>
Cc: Davide Libenzi <[email protected]>
---
drivers/lguest/Kconfig | 2 -
drivers/lguest/core.c | 8 ++--
drivers/lguest/lg.h | 9 ++++
drivers/lguest/lguest_user.c | 73 ++++++++++++++++++++++++++++++++++++++++
include/linux/lguest_launcher.h | 1
5 files changed, 89 insertions(+), 4 deletions(-)
diff --git a/drivers/lguest/Kconfig b/drivers/lguest/Kconfig
--- a/drivers/lguest/Kconfig
+++ b/drivers/lguest/Kconfig
@@ -1,6 +1,6 @@
config LGUEST
tristate "Linux hypervisor example code"
- depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
+ depends on X86_32 && EXPERIMENTAL && !X86_PAE && EVENTFD
select HVC_DRIVER
---help---
This is a very simple module which allows you to run
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -198,9 +198,11 @@ int run_guest(struct lg_cpu *cpu, unsign
/* It's possible the Guest did a NOTIFY hypercall to the
* Launcher, in which case we return from the read() now. */
if (cpu->pending_notify) {
- if (put_user(cpu->pending_notify, user))
- return -EFAULT;
- return sizeof(cpu->pending_notify);
+ if (!send_notify_to_eventfd(cpu)) {
+ if (put_user(cpu->pending_notify, user))
+ return -EFAULT;
+ return sizeof(cpu->pending_notify);
+ }
}
/* Check for signals */
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -82,6 +82,11 @@ struct lg_cpu {
struct lg_cpu_arch arch;
};
+struct lg_eventfds {
+ unsigned long addr;
+ struct file *event;
+};
+
/* The private info the thread maintains about the guest. */
struct lguest
{
@@ -102,6 +107,9 @@ struct lguest
unsigned int stack_pages;
u32 tsc_khz;
+ unsigned int num_eventfds;
+ struct lg_eventfds *eventfds;
+
/* Dead? */
const char *dead;
};
@@ -152,6 +160,7 @@ void setup_default_idt_entries(struct lg
void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
const unsigned long *def);
void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta);
+bool send_notify_to_eventfd(struct lg_cpu *cpu);
void init_clockdev(struct lg_cpu *cpu);
bool check_syscall_vector(struct lguest *lg);
int init_interrupts(void);
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -7,6 +7,8 @@
#include <linux/miscdevice.h>
#include <linux/fs.h>
#include <linux/sched.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
#include "lg.h"
/*L:055 When something happens, the Waker process needs a way to stop the
@@ -35,6 +37,70 @@ static int break_guest_out(struct lg_cpu
}
}
+bool send_notify_to_eventfd(struct lg_cpu *cpu)
+{
+ unsigned int i;
+
+ /* lg->eventfds is RCU-protected */
+ preempt_disable();
+ for (i = 0; i < cpu->lg->num_eventfds; i++) {
+ if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
+ eventfd_signal(cpu->lg->eventfds[i].event, 1);
+ cpu->pending_notify = 0;
+ break;
+ }
+ }
+ preempt_enable();
+ return cpu->pending_notify == 0;
+}
+
+static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
+{
+ struct lg_eventfds *new, *old;
+
+ if (!addr)
+ return -EINVAL;
+
+ /* Replace the old array with the new one, carefully: others can
+ * be accessing it at the same time */
+ new = kmalloc(sizeof(*new) * (lg->num_eventfds + 1), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ memcpy(new, lg->eventfds, sizeof(*new) * lg->num_eventfds);
+ old = lg->eventfds;
+ lg->eventfds = new;
+ synchronize_rcu();
+ kfree(old);
+
+ lg->eventfds[lg->num_eventfds].addr = addr;
+ lg->eventfds[lg->num_eventfds].event = eventfd_fget(fd);
+ if (IS_ERR(lg->eventfds[lg->num_eventfds].event))
+ return PTR_ERR(lg->eventfds[lg->num_eventfds].event);
+
+ wmb();
+ lg->num_eventfds++;
+ return 0;
+}
+
+static int attach_eventfd(struct lguest *lg, const unsigned long __user *input)
+{
+ unsigned long addr, fd;
+ int err;
+
+ if (get_user(addr, input) != 0)
+ return -EFAULT;
+ input++;
+ if (get_user(fd, input) != 0)
+ return -EFAULT;
+
+ mutex_lock(&lguest_lock);
+ err = add_eventfd(lg, addr, fd);
+ mutex_unlock(&lguest_lock);
+
+ return 0;
+}
+
/*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt
* number to /dev/lguest. */
static int user_send_irq(struct lg_cpu *cpu, const unsigned long __user *input)
@@ -260,6 +326,8 @@ static ssize_t write(struct file *file,
return user_send_irq(cpu, input);
case LHREQ_BREAK:
return break_guest_out(cpu, input);
+ case LHREQ_EVENTFD:
+ return attach_eventfd(lg, input);
default:
return -EINVAL;
}
@@ -297,6 +365,11 @@ static int close(struct inode *inode, st
* the Launcher's memory management structure. */
mmput(lg->cpus[i].mm);
}
+
+ /* Release any eventfds they registered. */
+ for (i = 0; i < lg->num_eventfds; i++)
+ fput(lg->eventfds[i].event);
+
/* If lg->dead doesn't contain an error code it will be NULL or a
* kmalloc()ed string, either of which is ok to hand to kfree(). */
if (!IS_ERR(lg->dead))
diff --git a/include/linux/lguest_launcher.h b/include/linux/lguest_launcher.h
--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -58,6 +58,7 @@ enum lguest_req
LHREQ_GETDMA, /* No longer used */
LHREQ_IRQ, /* + irq */
LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
+ LHREQ_EVENTFD, /* + address, fd. */
};
/* The alignment to use between consumer and producer parts of vring.
On Fri, Jun 05, 2009 at 02:25:01PM +0930, Rusty Russell wrote:
> On Fri, 5 Jun 2009 04:19:17 am Gregory Haskins wrote:
> > Avi Kivity wrote:
> > > Gregory Haskins wrote:
> > > One idea is similar to signalfd() or eventfd()
> >
> > And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was born.
> > ;)
>
> The lguest patch queue already has such an interface :) And I have a
> partially complete in-kernel virtio_pci patch with the same trick.
>
> I switched from "kernel created eventfd" to "userspace passes in eventfd"
> after a while though; it lets you connect multiple virtqueues to a single fd
> if you want.
>
> Combined with a minor change to allow any process with access to the lguest fd
> to queue interrupts, this allowed lguest to move to a thread-per-virtqueue
> model which was a significant speedup as well as nice code reduction.
>
> Here's the relevant kernel patch for reading.
>
> Thanks!
> Rusty.
>
> lguest: use eventfds for device notification
>
> Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
> an address: the main Launcher process returns with this address, and figures
> out what device to run.
>
> A far nicer model is to let processes bind an eventfd to an address: if we
> find one, we simply signal the eventfd.
A couple of (probably misguided) RCU questions/suggestions interspersed.
> Signed-off-by: Rusty Russell <[email protected]>
> Cc: Davide Libenzi <[email protected]>
> ---
> drivers/lguest/Kconfig | 2 -
> drivers/lguest/core.c | 8 ++--
> drivers/lguest/lg.h | 9 ++++
> drivers/lguest/lguest_user.c | 73 ++++++++++++++++++++++++++++++++++++++++
> include/linux/lguest_launcher.h | 1
> 5 files changed, 89 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/lguest/Kconfig b/drivers/lguest/Kconfig
> --- a/drivers/lguest/Kconfig
> +++ b/drivers/lguest/Kconfig
> @@ -1,6 +1,6 @@
> config LGUEST
> tristate "Linux hypervisor example code"
> - depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
> + depends on X86_32 && EXPERIMENTAL && !X86_PAE && EVENTFD
> select HVC_DRIVER
> ---help---
> This is a very simple module which allows you to run
> diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
> --- a/drivers/lguest/core.c
> +++ b/drivers/lguest/core.c
> @@ -198,9 +198,11 @@ int run_guest(struct lg_cpu *cpu, unsign
> /* It's possible the Guest did a NOTIFY hypercall to the
> * Launcher, in which case we return from the read() now. */
> if (cpu->pending_notify) {
> - if (put_user(cpu->pending_notify, user))
> - return -EFAULT;
> - return sizeof(cpu->pending_notify);
> + if (!send_notify_to_eventfd(cpu)) {
> + if (put_user(cpu->pending_notify, user))
> + return -EFAULT;
> + return sizeof(cpu->pending_notify);
> + }
> }
>
> /* Check for signals */
> diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
> --- a/drivers/lguest/lg.h
> +++ b/drivers/lguest/lg.h
> @@ -82,6 +82,11 @@ struct lg_cpu {
> struct lg_cpu_arch arch;
> };
>
> +struct lg_eventfds {
> + unsigned long addr;
> + struct file *event;
> +};
> +
> /* The private info the thread maintains about the guest. */
> struct lguest
> {
> @@ -102,6 +107,9 @@ struct lguest
> unsigned int stack_pages;
> u32 tsc_khz;
>
> + unsigned int num_eventfds;
> + struct lg_eventfds *eventfds;
> +
> /* Dead? */
> const char *dead;
> };
> @@ -152,6 +160,7 @@ void setup_default_idt_entries(struct lg
> void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
> const unsigned long *def);
> void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta);
> +bool send_notify_to_eventfd(struct lg_cpu *cpu);
> void init_clockdev(struct lg_cpu *cpu);
> bool check_syscall_vector(struct lguest *lg);
> int init_interrupts(void);
> diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
> --- a/drivers/lguest/lguest_user.c
> +++ b/drivers/lguest/lguest_user.c
> @@ -7,6 +7,8 @@
> #include <linux/miscdevice.h>
> #include <linux/fs.h>
> #include <linux/sched.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> #include "lg.h"
>
> /*L:055 When something happens, the Waker process needs a way to stop the
> @@ -35,6 +37,70 @@ static int break_guest_out(struct lg_cpu
> }
> }
>
> +bool send_notify_to_eventfd(struct lg_cpu *cpu)
> +{
> + unsigned int i;
> +
> + /* lg->eventfds is RCU-protected */
> + preempt_disable();
Suggest changing to rcu_read_lock() to match the synchronize_rcu().
> + for (i = 0; i < cpu->lg->num_eventfds; i++) {
> + if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
> + eventfd_signal(cpu->lg->eventfds[i].event, 1);
Shouldn't this be something like the following?
p = rcu_dereference(cpu->lg->eventfds);
if (p[i].addr == cpu->pending_notify) {
eventfd_signal(p[i].event, 1);
> + cpu->pending_notify = 0;
> + break;
> + }
> + }
> + preempt_enable();
And of course, rcu_read_unlock() here.
> + return cpu->pending_notify == 0;
> +}
> +
> +static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
> +{
> + struct lg_eventfds *new, *old;
> +
> + if (!addr)
> + return -EINVAL;
> +
> + /* Replace the old array with the new one, carefully: others can
> + * be accessing it at the same time */
> + new = kmalloc(sizeof(*new) * (lg->num_eventfds + 1), GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + memcpy(new, lg->eventfds, sizeof(*new) * lg->num_eventfds);
> + old = lg->eventfds;
> + lg->eventfds = new;
> + synchronize_rcu();
> + kfree(old);
> +
> + lg->eventfds[lg->num_eventfds].addr = addr;
> + lg->eventfds[lg->num_eventfds].event = eventfd_fget(fd);
> + if (IS_ERR(lg->eventfds[lg->num_eventfds].event))
> + return PTR_ERR(lg->eventfds[lg->num_eventfds].event);
> +
> + wmb();
> + lg->num_eventfds++;
Doesn't the synchronize_rcu() need to be synchronize_sched() to match the
preempt_disable() in send_notify_to_eventfd()? Or, alternatively, use
rcu_read_lock() instead of preempt_disable() in send_notify_to_eventfd().
This last is preferred.
Although you have the wmb() above, there is no ordering in
send_notify_to_eventfd(). Would the following work?
old = lg->eventfds;
lg->eventfds = new;
lg->eventfds[lg->num_eventfds].addr = addr;
lg->eventfds[lg->num_eventfds].event = eventfd_fget(fd);
if (IS_ERR(lg->eventfds[lg->num_eventfds].event))
return PTR_ERR(lg->eventfds[lg->num_eventfds].event);
synchronize_rcu();
kfree(old);
lg->num_eventfds++;
Here, synchronize_rcu() is doing two things:
1. ensuring that old readers who might be referencing "old" are
done before the kfree(), and
2. wait for the completion of all old readers who might (a) be
referencing the short "old" array and (b) be unaware of the
initialization of the new element.
Or do we also need to wait for anyone who might still be using the
old value of lg->num_eventfds? If so, the usual trick is to put
this value behind the same pointer that references the array, so
that any given rcu_dereference() is guaranteed to see matching
array and size.
> + return 0;
> +}
> +
> +static int attach_eventfd(struct lguest *lg, const unsigned long __user *input)
> +{
> + unsigned long addr, fd;
> + int err;
> +
> + if (get_user(addr, input) != 0)
> + return -EFAULT;
> + input++;
> + if (get_user(fd, input) != 0)
> + return -EFAULT;
> +
> + mutex_lock(&lguest_lock);
> + err = add_eventfd(lg, addr, fd);
> + mutex_unlock(&lguest_lock);
> +
> + return 0;
> +}
> +
> /*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt
> * number to /dev/lguest. */
> static int user_send_irq(struct lg_cpu *cpu, const unsigned long __user *input)
> @@ -260,6 +326,8 @@ static ssize_t write(struct file *file,
> return user_send_irq(cpu, input);
> case LHREQ_BREAK:
> return break_guest_out(cpu, input);
> + case LHREQ_EVENTFD:
> + return attach_eventfd(lg, input);
> default:
> return -EINVAL;
> }
> @@ -297,6 +365,11 @@ static int close(struct inode *inode, st
> * the Launcher's memory management structure. */
> mmput(lg->cpus[i].mm);
> }
> +
> + /* Release any eventfds they registered. */
> + for (i = 0; i < lg->num_eventfds; i++)
> + fput(lg->eventfds[i].event);
> +
> /* If lg->dead doesn't contain an error code it will be NULL or a
> * kmalloc()ed string, either of which is ok to hand to kfree(). */
> if (!IS_ERR(lg->dead))
> diff --git a/include/linux/lguest_launcher.h b/include/linux/lguest_launcher.h
> --- a/include/linux/lguest_launcher.h
> +++ b/include/linux/lguest_launcher.h
> @@ -58,6 +58,7 @@ enum lguest_req
> LHREQ_GETDMA, /* No longer used */
> LHREQ_IRQ, /* + irq */
> LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
> + LHREQ_EVENTFD, /* + address, fd. */
> };
>
> /* The alignment to use between consumer and producer parts of vring.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hi Rusty,
Rusty Russell wrote:
> On Fri, 5 Jun 2009 04:19:17 am Gregory Haskins wrote:
>
>> Avi Kivity wrote:
>>
>>> Gregory Haskins wrote:
>>> One idea is similar to signalfd() or eventfd()
>>>
>> And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was born.
>> ;)
>>
>
> The lguest patch queue already has such an interface :)
Cool! Ultimately I think it will be easier if both lguest+kvm support
the same eventfd notion so this is good you are already moving in the
same direction.
> And I have a partially complete in-kernel virtio_pci patch with the same trick.
>
I thought lguest didn't use pci? Or do you just mean that you have an
in-kernel virtio-net for lguest?
As a follow up question, I wonder if we can easily port that to vbus so
that it will work in both lguest and kvm? (note to self: push a skeleton
example today)
> I switched from "kernel created eventfd" to "userspace passes in eventfd"
> after a while though; it lets you connect multiple virtqueues to a single fd
> if you want.
>
Yeah, actually we switched that that model, too. Aside from the
limitation you point out, there were some problems that Al Viro had
raised trying to do it in kernel w.r.t. fd abuse.
> Combined with a minor change to allow any process with access to the lguest fd
> to queue interrupts, this allowed lguest to move to a thread-per-virtqueue
> model which was a significant speedup as well as nice code reduction.
>
Yep, that was one of my findings on venet as well so I was looking
forward to trying to get virtio-net to do the same.
> Here's the relevant kernel patch for reading.
>
Thanks Rusty! Will take a look.
> Thanks!
> Rusty.
>
> lguest: use eventfds for device notification
>
> Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
> an address: the main Launcher process returns with this address, and figures
> out what device to run.
>
> A far nicer model is to let processes bind an eventfd to an address: if we
> find one, we simply signal the eventfd.
>
> Signed-off-by: Rusty Russell <[email protected]>
> Cc: Davide Libenzi <[email protected]>
> ---
> drivers/lguest/Kconfig | 2 -
> drivers/lguest/core.c | 8 ++--
> drivers/lguest/lg.h | 9 ++++
> drivers/lguest/lguest_user.c | 73 ++++++++++++++++++++++++++++++++++++++++
> include/linux/lguest_launcher.h | 1
> 5 files changed, 89 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/lguest/Kconfig b/drivers/lguest/Kconfig
> --- a/drivers/lguest/Kconfig
> +++ b/drivers/lguest/Kconfig
> @@ -1,6 +1,6 @@
> config LGUEST
> tristate "Linux hypervisor example code"
> - depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
> + depends on X86_32 && EXPERIMENTAL && !X86_PAE && EVENTFD
>
Note to self: we probably need a similar line in KVM now.
> select HVC_DRIVER
> ---help---
> This is a very simple module which allows you to run
> diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
> --- a/drivers/lguest/core.c
> +++ b/drivers/lguest/core.c
> @@ -198,9 +198,11 @@ int run_guest(struct lg_cpu *cpu, unsign
> /* It's possible the Guest did a NOTIFY hypercall to the
> * Launcher, in which case we return from the read() now. */
> if (cpu->pending_notify) {
> - if (put_user(cpu->pending_notify, user))
> - return -EFAULT;
> - return sizeof(cpu->pending_notify);
> + if (!send_notify_to_eventfd(cpu)) {
> + if (put_user(cpu->pending_notify, user))
> + return -EFAULT;
> + return sizeof(cpu->pending_notify);
> + }
> }
>
> /* Check for signals */
> diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
> --- a/drivers/lguest/lg.h
> +++ b/drivers/lguest/lg.h
> @@ -82,6 +82,11 @@ struct lg_cpu {
> struct lg_cpu_arch arch;
> };
>
> +struct lg_eventfds {
> + unsigned long addr;
> + struct file *event;
> +};
> +
> /* The private info the thread maintains about the guest. */
> struct lguest
> {
> @@ -102,6 +107,9 @@ struct lguest
> unsigned int stack_pages;
> u32 tsc_khz;
>
> + unsigned int num_eventfds;
> + struct lg_eventfds *eventfds;
> +
> /* Dead? */
> const char *dead;
> };
> @@ -152,6 +160,7 @@ void setup_default_idt_entries(struct lg
> void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
> const unsigned long *def);
> void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta);
> +bool send_notify_to_eventfd(struct lg_cpu *cpu);
> void init_clockdev(struct lg_cpu *cpu);
> bool check_syscall_vector(struct lguest *lg);
> int init_interrupts(void);
> diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
> --- a/drivers/lguest/lguest_user.c
> +++ b/drivers/lguest/lguest_user.c
> @@ -7,6 +7,8 @@
> #include <linux/miscdevice.h>
> #include <linux/fs.h>
> #include <linux/sched.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> #include "lg.h"
>
> /*L:055 When something happens, the Waker process needs a way to stop the
> @@ -35,6 +37,70 @@ static int break_guest_out(struct lg_cpu
> }
> }
>
> +bool send_notify_to_eventfd(struct lg_cpu *cpu)
> +{
> + unsigned int i;
> +
> + /* lg->eventfds is RCU-protected */
> + preempt_disable();
> + for (i = 0; i < cpu->lg->num_eventfds; i++) {
> + if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
> + eventfd_signal(cpu->lg->eventfds[i].event, 1);
> + cpu->pending_notify = 0;
> + break;
> + }
> + }
> + preempt_enable();
> + return cpu->pending_notify == 0;
> +}
> +
> +static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
> +{
> + struct lg_eventfds *new, *old;
> +
> + if (!addr)
> + return -EINVAL;
> +
> + /* Replace the old array with the new one, carefully: others can
> + * be accessing it at the same time */
> + new = kmalloc(sizeof(*new) * (lg->num_eventfds + 1), GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + memcpy(new, lg->eventfds, sizeof(*new) * lg->num_eventfds);
> + old = lg->eventfds;
> + lg->eventfds = new;
> + synchronize_rcu();
> + kfree(old);
> +
> + lg->eventfds[lg->num_eventfds].addr = addr;
> + lg->eventfds[lg->num_eventfds].event = eventfd_fget(fd);
> + if (IS_ERR(lg->eventfds[lg->num_eventfds].event))
> + return PTR_ERR(lg->eventfds[lg->num_eventfds].event);
> +
> + wmb();
> + lg->num_eventfds++;
> + return 0;
> +}
> +
> +static int attach_eventfd(struct lguest *lg, const unsigned long __user *input)
> +{
> + unsigned long addr, fd;
> + int err;
> +
> + if (get_user(addr, input) != 0)
> + return -EFAULT;
> + input++;
> + if (get_user(fd, input) != 0)
> + return -EFAULT;
> +
> + mutex_lock(&lguest_lock);
> + err = add_eventfd(lg, addr, fd);
> + mutex_unlock(&lguest_lock);
> +
> + return 0;
> +}
> +
> /*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt
> * number to /dev/lguest. */
> static int user_send_irq(struct lg_cpu *cpu, const unsigned long __user *input)
> @@ -260,6 +326,8 @@ static ssize_t write(struct file *file,
> return user_send_irq(cpu, input);
> case LHREQ_BREAK:
> return break_guest_out(cpu, input);
> + case LHREQ_EVENTFD:
> + return attach_eventfd(lg, input);
> default:
> return -EINVAL;
> }
> @@ -297,6 +365,11 @@ static int close(struct inode *inode, st
> * the Launcher's memory management structure. */
> mmput(lg->cpus[i].mm);
> }
> +
> + /* Release any eventfds they registered. */
> + for (i = 0; i < lg->num_eventfds; i++)
> + fput(lg->eventfds[i].event);
> +
> /* If lg->dead doesn't contain an error code it will be NULL or a
> * kmalloc()ed string, either of which is ok to hand to kfree(). */
> if (!IS_ERR(lg->dead))
> diff --git a/include/linux/lguest_launcher.h b/include/linux/lguest_launcher.h
> --- a/include/linux/lguest_launcher.h
> +++ b/include/linux/lguest_launcher.h
> @@ -58,6 +58,7 @@ enum lguest_req
> LHREQ_GETDMA, /* No longer used */
> LHREQ_IRQ, /* + irq */
> LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
> + LHREQ_EVENTFD, /* + address, fd. */
> };
>
> /* The alignment to use between consumer and producer parts of vring.
>
>
>
>
Other than the potential rcu issues that Paul already addressed, looks
good. FWIW: this looks like what we are calling "iosignalfd" on the kvm
land (unless I am misunderstanding). Do you have the equivalent of
"irqfd" going the other way?
Thanks Rusty,
-Greg
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> @@ -1,6 +1,6 @@
>>> config LGUEST
>>> tristate "Linux hypervisor example code"
>>> - depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
>>> + depends on X86_32 && EXPERIMENTAL && !X86_PAE && EVENTFD
>>>
>>
>> Note to self: we probably need a similar line in KVM now.
>>
>>
>
> 'select EVENTFD' is more appropriate.
>
>
Yeah, I was thinking the same...
Gregory Haskins wrote:
>> @@ -1,6 +1,6 @@
>> config LGUEST
>> tristate "Linux hypervisor example code"
>> - depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
>> + depends on X86_32 && EXPERIMENTAL && !X86_PAE && EVENTFD
>>
>>
>
> Note to self: we probably need a similar line in KVM now.
>
>
'select EVENTFD' is more appropriate.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On Fri, 5 Jun 2009 09:26:48 pm Gregory Haskins wrote:
> Hi Rusty,
>
> Rusty Russell wrote:
> > On Fri, 5 Jun 2009 04:19:17 am Gregory Haskins wrote:
> >> Avi Kivity wrote:
> >>> Gregory Haskins wrote:
> >>> One idea is similar to signalfd() or eventfd()
> >>
> >> And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was
> >> born. ;)
> >
> > The lguest patch queue already has such an interface :)
>
> Cool! Ultimately I think it will be easier if both lguest+kvm support
> the same eventfd notion so this is good you are already moving in the
> same direction.
Not really; lguest doesn't do PCI.
> > And I have a partially complete in-kernel virtio_pci patch with the same
> > trick.
>
> I thought lguest didn't use pci? Or do you just mean that you have an
> in-kernel virtio-net for lguest?
No, this was for kvm. Sorry for the confusion.
> Other than the potential rcu issues that Paul already addressed, looks
> good. FWIW: this looks like what we are calling "iosignalfd" on the kvm
> land (unless I am misunderstanding). Do you have the equivalent of
> "irqfd" going the other way?
Yes; lguest uses write() (offset indicates cpu #) rather than ioctls, but
anyone can do the LHREQ_IRQ write to queue an interrupt for delivery.
So the threads just get the same /dev/lguest fd and it's simple.
Thanks!
Rusty.
Rusty Russell wrote:
> On Fri, 5 Jun 2009 09:26:48 pm Gregory Haskins wrote:
>
>> Hi Rusty,
>>
>> Rusty Russell wrote:
>>
>>> On Fri, 5 Jun 2009 04:19:17 am Gregory Haskins wrote:
>>>
>>>> Avi Kivity wrote:
>>>>
>>>>> Gregory Haskins wrote:
>>>>> One idea is similar to signalfd() or eventfd()
>>>>>
>>>> And thus the "kvm-eventfd" (irqfd/iosignalfd) interface project was
>>>> born. ;)
>>>>
>>> The lguest patch queue already has such an interface :)
>>>
>> Cool! Ultimately I think it will be easier if both lguest+kvm support
>> the same eventfd notion so this is good you are already moving in the
>> same direction.
>>
>
> Not really; lguest doesn't do PCI.
>
Thats ok. I see these eventfd interfaces as somewhat orthogonal to
PCI. I.e. if both lguest and kvm have an eventfd mechnism for signaling
in both directions (e.g. interrupts and io), it would make it easier to
support the kind of thing I am striving for with a unified backend.
That is: one in-kernel virtio-net that works in both (or even many) HV
environments. I see that as a higher layer abstraction than PCI, per se.
>
>>> And I have a partially complete in-kernel virtio_pci patch with the same
>>> trick.
>>>
>> I thought lguest didn't use pci? Or do you just mean that you have an
>> in-kernel virtio-net for lguest?
>>
>
> No, this was for kvm. Sorry for the confusion.
>
Ah, sorry. Well, if its in any kind of shape to see the light of day,
please forward it over. Perhaps Michael and I can craft it into a
working solution.
>
>> Other than the potential rcu issues that Paul already addressed, looks
>> good. FWIW: this looks like what we are calling "iosignalfd" on the kvm
>> land (unless I am misunderstanding). Do you have the equivalent of
>> "irqfd" going the other way?
>>
>
> Yes; lguest uses write() (offset indicates cpu #) rather than ioctls, but
> anyone can do the LHREQ_IRQ write to queue an interrupt for delivery.
>
> So the threads just get the same /dev/lguest fd and it's simple.
>
Ah, ok. Thats workable, too. (This kind of detail would be buried in
the "lguest connector" for vbus anyway, so it doesn't have to have a
uniform "eventfd_signal()" interface to work. The fd concept alone is
sufficiently flexible).
Thanks Rusty,
-Greg
On Fri, 5 Jun 2009 03:00:10 pm Paul E. McKenney wrote:
> On Fri, Jun 05, 2009 at 02:25:01PM +0930, Rusty Russell wrote:
> > + /* lg->eventfds is RCU-protected */
> > + preempt_disable();
>
> Suggest changing to rcu_read_lock() to match the synchronize_rcu().
Ah yes, much better. As I was implementing it I warred with myself since
lguest aims for simplicity above all else. But since we only ever add things
to the array, RCU probably is simpler.
> > + for (i = 0; i < cpu->lg->num_eventfds; i++) {
> > + if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
> > + eventfd_signal(cpu->lg->eventfds[i].event, 1);
>
> Shouldn't this be something like the following?
>
> p = rcu_dereference(cpu->lg->eventfds);
> if (p[i].addr == cpu->pending_notify) {
> eventfd_signal(p[i].event, 1);
Hmm, need to read num_eventfds first, too. It doesn't matter if we get the old
->num_eventfds and the new ->eventfds, but the other way around would be bad.
Here's the inter-diff:
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -39,18 +39,24 @@ static int break_guest_out(struct lg_cpu
bool send_notify_to_eventfd(struct lg_cpu *cpu)
{
- unsigned int i;
+ unsigned int i, num;
+ struct lg_eventfds *eventfds;
+
+ /* Make sure we grab the total number before accessing the array. */
+ cpu->lg->num_eventfds = num;
+ rmb();
/* lg->eventfds is RCU-protected */
rcu_read_lock();
- for (i = 0; i < cpu->lg->num_eventfds; i++) {
- if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
- eventfd_signal(cpu->lg->eventfds[i].event, 1);
+ eventfds = rcu_dereference(cpu->lg->eventfds);
+ for (i = 0; i < num; i++) {
+ if (eventfds[i].addr == cpu->pending_notify) {
+ eventfd_signal(eventfds[i].event, 1);
cpu->pending_notify = 0;
break;
}
}
- preempt_enable();
+ rcu_read_unlock();
return cpu->pending_notify == 0;
}
Thanks!
Rusty.
On Sat, Jun 06, 2009 at 12:25:57AM +0930, Rusty Russell wrote:
> On Fri, 5 Jun 2009 03:00:10 pm Paul E. McKenney wrote:
> > On Fri, Jun 05, 2009 at 02:25:01PM +0930, Rusty Russell wrote:
> > > + /* lg->eventfds is RCU-protected */
> > > + preempt_disable();
> >
> > Suggest changing to rcu_read_lock() to match the synchronize_rcu().
>
> Ah yes, much better. As I was implementing it I warred with myself since
> lguest aims for simplicity above all else. But since we only ever add things
> to the array, RCU probably is simpler.
;-)
> > > + for (i = 0; i < cpu->lg->num_eventfds; i++) {
> > > + if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
> > > + eventfd_signal(cpu->lg->eventfds[i].event, 1);
> >
> > Shouldn't this be something like the following?
> >
> > p = rcu_dereference(cpu->lg->eventfds);
> > if (p[i].addr == cpu->pending_notify) {
> > eventfd_signal(p[i].event, 1);
>
> Hmm, need to read num_eventfds first, too. It doesn't matter if we get the old
> ->num_eventfds and the new ->eventfds, but the other way around would be bad.
Yep!!! ;-)
> Here's the inter-diff:
>
> diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
> --- a/drivers/lguest/lguest_user.c
> +++ b/drivers/lguest/lguest_user.c
> @@ -39,18 +39,24 @@ static int break_guest_out(struct lg_cpu
>
> bool send_notify_to_eventfd(struct lg_cpu *cpu)
> {
> - unsigned int i;
> + unsigned int i, num;
> + struct lg_eventfds *eventfds;
> +
> + /* Make sure we grab the total number before accessing the array. */
> + cpu->lg->num_eventfds = num;
> + rmb();
>
> /* lg->eventfds is RCU-protected */
> rcu_read_lock();
> - for (i = 0; i < cpu->lg->num_eventfds; i++) {
> - if (cpu->lg->eventfds[i].addr == cpu->pending_notify) {
> - eventfd_signal(cpu->lg->eventfds[i].event, 1);
> + eventfds = rcu_dereference(cpu->lg->eventfds);
> + for (i = 0; i < num; i++) {
> + if (eventfds[i].addr == cpu->pending_notify) {
> + eventfd_signal(eventfds[i].event, 1);
> cpu->pending_notify = 0;
> break;
> }
> }
> - preempt_enable();
> + rcu_read_unlock();
> return cpu->pending_notify == 0;
> }
It is possible to get rid of the rmb() and wmb() as well, doing
something like the following:
struct lg_eventfds_num {
unsigned int n;
struct lg_eventfds a[0];
}
Then the rcu_dereference() gets you a pointer to a struct lg_eventfds_num,
which has the array and its length in guaranteed synchronization without
the need for barriers.
Does this work for you, or is there some complication that I am missing?
Thanx, Paul
On Sat, 6 Jun 2009 01:55:53 am Paul E. McKenney wrote:
> It is possible to get rid of the rmb() and wmb() as well, doing
> something like the following:
>
> struct lg_eventfds_num {
> unsigned int n;
> struct lg_eventfds a[0];
> }
>
> Then the rcu_dereference() gets you a pointer to a struct lg_eventfds_num,
> which has the array and its length in guaranteed synchronization without
> the need for barriers.
Yep, that's actually quite nice. The only wart is that it needs to be
allocated even when n == 0, but IMHO worth it for barrier avoidance.
This is what I ended up with:
lguest: use eventfds for device notification
Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
an address: the main Launcher process returns with this address, and figures
out what device to run.
A far nicer model is to let processes bind an eventfd to an address: if we
find one, we simply signal the eventfd.
Signed-off-by: Rusty Russell <[email protected]>
Cc: Davide Libenzi <[email protected]>
---
drivers/lguest/Kconfig | 2
drivers/lguest/core.c | 8 ++-
drivers/lguest/lg.h | 13 +++++
drivers/lguest/lguest_user.c | 98 +++++++++++++++++++++++++++++++++++++++-
include/linux/lguest_launcher.h | 1
5 files changed, 116 insertions(+), 6 deletions(-)
diff --git a/drivers/lguest/Kconfig b/drivers/lguest/Kconfig
--- a/drivers/lguest/Kconfig
+++ b/drivers/lguest/Kconfig
@@ -1,6 +1,6 @@
config LGUEST
tristate "Linux hypervisor example code"
- depends on X86_32 && EXPERIMENTAL && FUTEX
+ depends on X86_32 && EXPERIMENTAL && EVENTFD
select HVC_DRIVER
---help---
This is a very simple module which allows you to run
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -198,9 +198,11 @@ int run_guest(struct lg_cpu *cpu, unsign
/* It's possible the Guest did a NOTIFY hypercall to the
* Launcher, in which case we return from the read() now. */
if (cpu->pending_notify) {
- if (put_user(cpu->pending_notify, user))
- return -EFAULT;
- return sizeof(cpu->pending_notify);
+ if (!send_notify_to_eventfd(cpu)) {
+ if (put_user(cpu->pending_notify, user))
+ return -EFAULT;
+ return sizeof(cpu->pending_notify);
+ }
}
/* Check for signals */
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -82,6 +82,16 @@ struct lg_cpu {
struct lg_cpu_arch arch;
};
+struct lg_eventfd {
+ unsigned long addr;
+ struct file *event;
+};
+
+struct lg_eventfd_map {
+ unsigned int num;
+ struct lg_eventfd map[];
+};
+
/* The private info the thread maintains about the guest. */
struct lguest
{
@@ -102,6 +112,8 @@ struct lguest
unsigned int stack_pages;
u32 tsc_khz;
+ struct lg_eventfd_map *eventfds;
+
/* Dead? */
const char *dead;
};
@@ -154,6 +166,7 @@ void setup_default_idt_entries(struct lg
void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
const unsigned long *def);
void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta);
+bool send_notify_to_eventfd(struct lg_cpu *cpu);
void init_clockdev(struct lg_cpu *cpu);
bool check_syscall_vector(struct lguest *lg);
int init_interrupts(void);
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -7,6 +7,8 @@
#include <linux/miscdevice.h>
#include <linux/fs.h>
#include <linux/sched.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
#include "lg.h"
/*L:055 When something happens, the Waker process needs a way to stop the
@@ -35,6 +37,81 @@ static int break_guest_out(struct lg_cpu
}
}
+bool send_notify_to_eventfd(struct lg_cpu *cpu)
+{
+ unsigned int i;
+ struct lg_eventfd_map *map;
+
+ /* lg->eventfds is RCU-protected */
+ rcu_read_lock();
+ map = rcu_dereference(cpu->lg->eventfds);
+ for (i = 0; i < map->num; i++) {
+ if (map->map[i].addr == cpu->pending_notify) {
+ eventfd_signal(map->map[i].event, 1);
+ cpu->pending_notify = 0;
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return cpu->pending_notify == 0;
+}
+
+static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
+{
+ struct lg_eventfd_map *new, *old = lg->eventfds;
+
+ if (!addr)
+ return -EINVAL;
+
+ /* Replace the old array with the new one, carefully: others can
+ * be accessing it at the same time */
+ new = kmalloc(sizeof(*new) + sizeof(new->map[0]) * (old->num + 1),
+ GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+
+ /* First make identical copy. */
+ memcpy(new->map, old->map, sizeof(old->map[0]) * old->num);
+ new->num = old->num;
+
+ /* Now append new entry. */
+ new->map[new->num].addr = addr;
+ new->map[new->num].event = eventfd_fget(fd);
+ if (IS_ERR(new->map[new->num].event)) {
+ kfree(new);
+ return PTR_ERR(new->map[new->num].event);
+ }
+ new->num++;
+
+ /* Now put new one in place. */
+ rcu_assign_pointer(lg->eventfds, new);
+
+ /* We're not in a big hurry. Wait until noone's looking at old
+ * version, then delete it. */
+ synchronize_rcu();
+ kfree(old);
+
+ return 0;
+}
+
+static int attach_eventfd(struct lguest *lg, const unsigned long __user *input)
+{
+ unsigned long addr, fd;
+ int err;
+
+ if (get_user(addr, input) != 0)
+ return -EFAULT;
+ input++;
+ if (get_user(fd, input) != 0)
+ return -EFAULT;
+
+ mutex_lock(&lguest_lock);
+ err = add_eventfd(lg, addr, fd);
+ mutex_unlock(&lguest_lock);
+
+ return 0;
+}
+
/*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt
* number to /dev/lguest. */
static int user_send_irq(struct lg_cpu *cpu, const unsigned long __user *input)
@@ -184,6 +261,13 @@ static int initialize(struct file *file,
goto unlock;
}
+ lg->eventfds = kmalloc(sizeof(*lg->eventfds), GFP_KERNEL);
+ if (!lg->eventfds) {
+ err = -ENOMEM;
+ goto free_lg;
+ }
+ lg->eventfds->num = 0;
+
/* Populate the easy fields of our "struct lguest" */
lg->mem_base = (void __user *)args[0];
lg->pfn_limit = args[1];
@@ -191,7 +275,7 @@ static int initialize(struct file *file,
/* This is the first cpu (cpu 0) and it will start booting at args[2] */
err = lg_cpu_start(&lg->cpus[0], 0, args[2]);
if (err)
- goto release_guest;
+ goto free_eventfds;
/* Initialize the Guest's shadow page tables, using the toplevel
* address the Launcher gave us. This allocates memory, so can fail. */
@@ -210,7 +294,9 @@ static int initialize(struct file *file,
free_regs:
/* FIXME: This should be in free_vcpu */
free_page(lg->cpus[0].regs_page);
-release_guest:
+free_eventfds:
+ kfree(lg->eventfds);
+free_lg:
kfree(lg);
unlock:
mutex_unlock(&lguest_lock);
@@ -260,6 +346,8 @@ static ssize_t write(struct file *file,
return user_send_irq(cpu, input);
case LHREQ_BREAK:
return break_guest_out(cpu, input);
+ case LHREQ_EVENTFD:
+ return attach_eventfd(lg, input);
default:
return -EINVAL;
}
@@ -297,6 +385,12 @@ static int close(struct inode *inode, st
* the Launcher's memory management structure. */
mmput(lg->cpus[i].mm);
}
+
+ /* Release any eventfds they registered. */
+ for (i = 0; i < lg->eventfds->num; i++)
+ fput(lg->eventfds->map[i].event);
+ kfree(lg->eventfds);
+
/* If lg->dead doesn't contain an error code it will be NULL or a
* kmalloc()ed string, either of which is ok to hand to kfree(). */
if (!IS_ERR(lg->dead))
diff --git a/include/linux/lguest_launcher.h b/include/linux/lguest_launcher.h
--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -58,6 +58,7 @@ enum lguest_req
LHREQ_GETDMA, /* No longer used */
LHREQ_IRQ, /* + irq */
LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
+ LHREQ_EVENTFD, /* + address, fd. */
};
/* The alignment to use between consumer and producer parts of vring.
On Thu, Jun 11, 2009 at 10:51:20PM +0930, Rusty Russell wrote:
> On Sat, 6 Jun 2009 01:55:53 am Paul E. McKenney wrote:
> > It is possible to get rid of the rmb() and wmb() as well, doing
> > something like the following:
> >
> > struct lg_eventfds_num {
> > unsigned int n;
> > struct lg_eventfds a[0];
> > }
> >
> > Then the rcu_dereference() gets you a pointer to a struct lg_eventfds_num,
> > which has the array and its length in guaranteed synchronization without
> > the need for barriers.
>
> Yep, that's actually quite nice. The only wart is that it needs to be
> allocated even when n == 0, but IMHO worth it for barrier avoidance.
Well, I suppose that you -could- statically allocate one in struct
lguest, but it is not clear to me that this cure would be better than
the always-allocate disease in this case. But either way, you would
be allocating an instance, so your statement above is correct. ;-)
> This is what I ended up with:
>
> lguest: use eventfds for device notification
>
> Currently, when a Guest wants to perform I/O it calls LHCALL_NOTIFY with
> an address: the main Launcher process returns with this address, and figures
> out what device to run.
>
> A far nicer model is to let processes bind an eventfd to an address: if we
> find one, we simply signal the eventfd.
Looks very good to me from an RCU viewpoint!!!
Reviewed-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Rusty Russell <[email protected]>
> Cc: Davide Libenzi <[email protected]>
> ---
> drivers/lguest/Kconfig | 2
> drivers/lguest/core.c | 8 ++-
> drivers/lguest/lg.h | 13 +++++
> drivers/lguest/lguest_user.c | 98 +++++++++++++++++++++++++++++++++++++++-
> include/linux/lguest_launcher.h | 1
> 5 files changed, 116 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/lguest/Kconfig b/drivers/lguest/Kconfig
> --- a/drivers/lguest/Kconfig
> +++ b/drivers/lguest/Kconfig
> @@ -1,6 +1,6 @@
> config LGUEST
> tristate "Linux hypervisor example code"
> - depends on X86_32 && EXPERIMENTAL && FUTEX
> + depends on X86_32 && EXPERIMENTAL && EVENTFD
> select HVC_DRIVER
> ---help---
> This is a very simple module which allows you to run
> diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
> --- a/drivers/lguest/core.c
> +++ b/drivers/lguest/core.c
> @@ -198,9 +198,11 @@ int run_guest(struct lg_cpu *cpu, unsign
> /* It's possible the Guest did a NOTIFY hypercall to the
> * Launcher, in which case we return from the read() now. */
> if (cpu->pending_notify) {
> - if (put_user(cpu->pending_notify, user))
> - return -EFAULT;
> - return sizeof(cpu->pending_notify);
> + if (!send_notify_to_eventfd(cpu)) {
> + if (put_user(cpu->pending_notify, user))
> + return -EFAULT;
> + return sizeof(cpu->pending_notify);
> + }
> }
>
> /* Check for signals */
> diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
> --- a/drivers/lguest/lg.h
> +++ b/drivers/lguest/lg.h
> @@ -82,6 +82,16 @@ struct lg_cpu {
> struct lg_cpu_arch arch;
> };
>
> +struct lg_eventfd {
> + unsigned long addr;
> + struct file *event;
> +};
> +
> +struct lg_eventfd_map {
> + unsigned int num;
> + struct lg_eventfd map[];
> +};
> +
> /* The private info the thread maintains about the guest. */
> struct lguest
> {
> @@ -102,6 +112,8 @@ struct lguest
> unsigned int stack_pages;
> u32 tsc_khz;
>
> + struct lg_eventfd_map *eventfds;
> +
> /* Dead? */
> const char *dead;
> };
> @@ -154,6 +166,7 @@ void setup_default_idt_entries(struct lg
> void copy_traps(const struct lg_cpu *cpu, struct desc_struct *idt,
> const unsigned long *def);
> void guest_set_clockevent(struct lg_cpu *cpu, unsigned long delta);
> +bool send_notify_to_eventfd(struct lg_cpu *cpu);
> void init_clockdev(struct lg_cpu *cpu);
> bool check_syscall_vector(struct lguest *lg);
> int init_interrupts(void);
> diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
> --- a/drivers/lguest/lguest_user.c
> +++ b/drivers/lguest/lguest_user.c
> @@ -7,6 +7,8 @@
> #include <linux/miscdevice.h>
> #include <linux/fs.h>
> #include <linux/sched.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> #include "lg.h"
>
> /*L:055 When something happens, the Waker process needs a way to stop the
> @@ -35,6 +37,81 @@ static int break_guest_out(struct lg_cpu
> }
> }
>
> +bool send_notify_to_eventfd(struct lg_cpu *cpu)
> +{
> + unsigned int i;
> + struct lg_eventfd_map *map;
> +
> + /* lg->eventfds is RCU-protected */
> + rcu_read_lock();
> + map = rcu_dereference(cpu->lg->eventfds);
> + for (i = 0; i < map->num; i++) {
> + if (map->map[i].addr == cpu->pending_notify) {
> + eventfd_signal(map->map[i].event, 1);
> + cpu->pending_notify = 0;
> + break;
> + }
> + }
> + rcu_read_unlock();
> + return cpu->pending_notify == 0;
> +}
> +
> +static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
> +{
> + struct lg_eventfd_map *new, *old = lg->eventfds;
> +
> + if (!addr)
> + return -EINVAL;
> +
> + /* Replace the old array with the new one, carefully: others can
> + * be accessing it at the same time */
> + new = kmalloc(sizeof(*new) + sizeof(new->map[0]) * (old->num + 1),
> + GFP_KERNEL);
> + if (!new)
> + return -ENOMEM;
> +
> + /* First make identical copy. */
> + memcpy(new->map, old->map, sizeof(old->map[0]) * old->num);
> + new->num = old->num;
> +
> + /* Now append new entry. */
> + new->map[new->num].addr = addr;
> + new->map[new->num].event = eventfd_fget(fd);
> + if (IS_ERR(new->map[new->num].event)) {
> + kfree(new);
> + return PTR_ERR(new->map[new->num].event);
> + }
> + new->num++;
> +
> + /* Now put new one in place. */
> + rcu_assign_pointer(lg->eventfds, new);
> +
> + /* We're not in a big hurry. Wait until noone's looking at old
> + * version, then delete it. */
> + synchronize_rcu();
> + kfree(old);
> +
> + return 0;
> +}
> +
> +static int attach_eventfd(struct lguest *lg, const unsigned long __user *input)
> +{
> + unsigned long addr, fd;
> + int err;
> +
> + if (get_user(addr, input) != 0)
> + return -EFAULT;
> + input++;
> + if (get_user(fd, input) != 0)
> + return -EFAULT;
> +
> + mutex_lock(&lguest_lock);
> + err = add_eventfd(lg, addr, fd);
> + mutex_unlock(&lguest_lock);
> +
> + return 0;
> +}
> +
> /*L:050 Sending an interrupt is done by writing LHREQ_IRQ and an interrupt
> * number to /dev/lguest. */
> static int user_send_irq(struct lg_cpu *cpu, const unsigned long __user *input)
> @@ -184,6 +261,13 @@ static int initialize(struct file *file,
> goto unlock;
> }
>
> + lg->eventfds = kmalloc(sizeof(*lg->eventfds), GFP_KERNEL);
> + if (!lg->eventfds) {
> + err = -ENOMEM;
> + goto free_lg;
> + }
> + lg->eventfds->num = 0;
> +
> /* Populate the easy fields of our "struct lguest" */
> lg->mem_base = (void __user *)args[0];
> lg->pfn_limit = args[1];
> @@ -191,7 +275,7 @@ static int initialize(struct file *file,
> /* This is the first cpu (cpu 0) and it will start booting at args[2] */
> err = lg_cpu_start(&lg->cpus[0], 0, args[2]);
> if (err)
> - goto release_guest;
> + goto free_eventfds;
>
> /* Initialize the Guest's shadow page tables, using the toplevel
> * address the Launcher gave us. This allocates memory, so can fail. */
> @@ -210,7 +294,9 @@ static int initialize(struct file *file,
> free_regs:
> /* FIXME: This should be in free_vcpu */
> free_page(lg->cpus[0].regs_page);
> -release_guest:
> +free_eventfds:
> + kfree(lg->eventfds);
> +free_lg:
> kfree(lg);
> unlock:
> mutex_unlock(&lguest_lock);
> @@ -260,6 +346,8 @@ static ssize_t write(struct file *file,
> return user_send_irq(cpu, input);
> case LHREQ_BREAK:
> return break_guest_out(cpu, input);
> + case LHREQ_EVENTFD:
> + return attach_eventfd(lg, input);
> default:
> return -EINVAL;
> }
> @@ -297,6 +385,12 @@ static int close(struct inode *inode, st
> * the Launcher's memory management structure. */
> mmput(lg->cpus[i].mm);
> }
> +
> + /* Release any eventfds they registered. */
> + for (i = 0; i < lg->eventfds->num; i++)
> + fput(lg->eventfds->map[i].event);
> + kfree(lg->eventfds);
> +
> /* If lg->dead doesn't contain an error code it will be NULL or a
> * kmalloc()ed string, either of which is ok to hand to kfree(). */
> if (!IS_ERR(lg->dead))
> diff --git a/include/linux/lguest_launcher.h b/include/linux/lguest_launcher.h
> --- a/include/linux/lguest_launcher.h
> +++ b/include/linux/lguest_launcher.h
> @@ -58,6 +58,7 @@ enum lguest_req
> LHREQ_GETDMA, /* No longer used */
> LHREQ_IRQ, /* + irq */
> LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
> + LHREQ_EVENTFD, /* + address, fd. */
> };
>
> /* The alignment to use between consumer and producer parts of vring.
>