LinuxLists.cc - [PATCH v2 00/10] RFC: NVME MDEV

2019-05-02 11:50:48

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 00/10] RFC: NVME MDEV

Hi everyone!

In this patch series, I would like to introduce my take on the problem of doing
as fast as possible virtualization of storage with emphasis on low latency.

For more information for the inner working you can look at V1 of the submission
at
https://lkml.org/lkml/2019/3/19/458

*** Changes from V1 ***

* Some correctness fixes that slipped in the last minute,
Nothing major though, and all my attempt to crash this driver lately
were unsucessfull,
and it pretty much copes with anything I throw at it.

* Experemental block layer mode: In this mode, the mdev driver sends
the IO though the block layer (using bio_submit), and then polls for
the completions using the blk_poll. The last commit in the series adds this.
The performance overhead of this is about 2x of direct dedicated queue submission
though.

For this patch series, I would like to hear your opinion on the generic
block layer code, and to hear your opinion on the code as whole.

Please ignore the fact that code doesn't use nvme target code
(for instance it already have the generic block device code in io-cmd-bdev.c)
I will later switch to it, in next version of these patches, although,
I think that I should keep the option of using the direct, reserved IO queue
version too.

*** Performance results ***

Two tests were run, this time focusing only on latency and especially
on measuring the overhead of the translation.

I used the nvme-5.2 branch of http://git.infradead.org/nvme.git
which includes the small IO performance improvments which show a modest
perf increase in the generic block layer code path.

So the tests were:

** No interrupts test**

For that test, I used spdk's fio plugin as the test client. It itself uses
vfio to bind to the device, and then read/write from it using polling.
For tests that run in the guest, I used the virtual IOMMU, and run that spdk
test in the guest.

IOW, the tests were:

1. spdk in the host with fio
2. nvme-mdev in the host, spdk in the guest with fio
3. spdk in the host, spdk in the guest with fio

The fio (with the spdk plugin)was running at queue depth 1.
In addtion to that I added rdtsc based instrumentation to my mdev driver
to read the number of cycles it takes it to translate a command, and number
of cycles it takes it to poll for the response, and also number of cycles,
it takes to send the interrupt to the guest.

You can find the script at

'https://gitlab.com/maximlevitsky/vm_scripts/blob/master/tests/stest0.sh'

** Interrupts test **

For this test, the client was the kernel nvme driver, and fio running on top of it,
also with queue depth 1.
The test is at

'https://gitlab.com/maximlevitsky/vm_scripts/blob/master/tests/test0.sh'

The tests were done on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz,
with Optane SSD 900P NVME drive.

The system is dual socket, but the system was booted with configuration
that allowed to fully use only NUMA node 0, where the device is attached.

** The results for non interrupt test **

host:
BW :574.93 MiB/s (stdev:0.73 Mib/s)
IOPS :147,182 (stdev:186 IOPS/s)
SLAT :0.113 us (stdev:0.055 us)
CLAT :6.402 us (stdev:1.146 us)
LAT :6.516 us (stdev:1.148 us)

mdev/direct:

BW :535.99 MiB/s (stdev:2.62 Mib/s)
IOPS :137,214 (stdev:670 IOPS/s)
SLAT :0.128 us (stdev:3.074 us)
CLAT :6.909 us (stdev:4.384 us)
LAT :7.038 us (stdev:6.892 us)

commands translated : avg cycles: 668.032 avg time(usec): 0.196 total: 8239732
commands completed : avg cycles: 411.791 avg time(usec): 0.121 total: 8239732

mdev/block generic:

BW :512.99 MiB/s (stdev:2.5 Mib/s)
IOPS :131,324 (stdev:641 IOPS/s)
SLAT :0.13 us (stdev:3.143 us)
CLAT :7.237 us (stdev:4.516 us)
LAT :7.367 us (stdev:7.069 us)

commands translated : avg cycles: 1509.207 avg time(usec): 0.444 total: 7879519
commands completed : avg cycles: 1005.299 avg time(usec): 0.296 total: 7879519

*Here you clearly see the overhead added by the block layer*

spdk:
BW :535.77 MiB/s (stdev:0.86 Mib/s)
IOPS :137,157 (stdev:220 IOPS/s)
SLAT :0.135 us (stdev:0.073 us)
CLAT :6.905 us (stdev:1.166 us)
LAT :7.04 us (stdev:1.168 us)

qemu userspace nvme driver:
BW :151.56 MiB/s (stdev:0.38 Mib/s)
IOPS :38,799 (stdev:97 IOPS/s)
SLAT :4.655 us (stdev:0.714 us)
CLAT :20.856 us (stdev:1.993 us)
LAT :25.512 us (stdev:2.086 us)

** The results from interrupt test **

host:
BW :428.37 MiB/s (stdev:0.44 Mib/s)
IOPS :109,662 (stdev:112 IOPS/s)
SLAT :0.913 us (stdev:0.077 us)
CLAT :7.894 us (stdev:1.152 us)
LAT :8.844 us (stdev:1.155 us)

mdev/direct:
BW :401.33 MiB/s (stdev:1.99 Mib/s)
IOPS :102,739 (stdev:509 IOPS/s)
SLAT :0.916 us (stdev:3.588 us)
CLAT :8.544 us (stdev:1.21 us)
LAT :9.494 us (stdev:10.569 us)

commands translated : avg cycles: 639.993 avg time(usec): 0.188 total: 6164505
commands completed : avg cycles: 398.824 avg time(usec): 0.117 total: 6164505
interrupts sent : avg cycles: 612.795 avg time(usec): 0.18 total: 6164505

mdev/generic:
BW :384.35 MiB/s (stdev:1.89 Mib/s)
IOPS :98,394 (stdev:483 IOPS/s)
SLAT :0.986 us (stdev:3.668 us)
CLAT :8.86 us (stdev:8.584 us)
LAT :9.897 us (stdev:9.336 us)

commands translated : avg cycles: 1537.659 avg time(usec): 0.452 total: 5903703
commands completed : avg cycles: 976.604 avg time(usec): 0.287 total: 5903703
interrupts sent : avg cycles: 596.916 avg time(usec): 0.176 total: 5903703

spdk:
BW :395.11 MiB/s (stdev:0.54 Mib/s)
IOPS :101,149 (stdev:137 IOPS/s)
SLAT :0.857 us (stdev:0.123 us)
CLAT :8.755 us (stdev:1.224 us)
LAT :9.644 us (stdev:1.229 us)

aio_kernel/virtio_blk (other combinations showed similiar, a bit worse latency):

BW :176.27 MiB/s (stdev:3.14 Mib/s)
IOPS :45,125 (stdev:804 IOPS/s)
SLAT :1.584 us (stdev:0.375 us)
CLAT :20.291 us (stdev:2.031 us)
LAT :21.91 us (stdev:2.057 us)

qemu userspace driver:

BW :119.51 MiB/s (stdev:0.8 Mib/s)
IOPS :30,595 (stdev:203 IOPS/s)
SLAT :4.976 us (stdev:0.71 us)
CLAT :27.304 us (stdev:2.668 us)
LAT :32.336 us (stdev:2.736 us)

*** FAQ ***

* Why to make this in the kernel? Why this is better that SPDK

-> Reuse the existing nvme kernel driver in the host. No new drivers in the guest.

-> Share the NVMe device between host and guest.
Even in fully virtualized configurations,
some partitions of nvme device could be used by guests as block devices
while others passed through with nvme-mdev to achieve balance between
all features of full IO stack emulation and performance.

-> This is a framework that later can be used to support NVMe devices
with more of the IO virtualization built-in
(IOMMU with PASID support coupled with device that supports it)

* Why to attach directly to nvme-pci driver and not use block layer IO
-> The direct attachment allows for better performance (2x),
but it is possible to use block layer IO, especially for fabrics drivers, and/or
non nvme block devices.

*** Testing ***

The device was tested with mostly stock QEMU 4 on the host,
with host was using 5.1-rc6 kernel with nvme-mdev added and the following hardware:
* QEMU nvme virtual device (with nested guest)
* Intel DC P3700 on Xeon E5-2620 v2 server
* Intel Corporation Optane SSD 900P Series
* Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
* Lenovo NVME device found in my laptop

The guest was tested with kernel 4.16, 4.18, 4.20 and
the same custom complied kernel 5.0
Windows 10 guest was tested too with both Microsoft's inbox driver and
open source community NVME driver
(https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)

Testing was mostly done on x86_64, but 32 bit host/guest combination
was lightly tested too.

In addition to that, the virtual device was tested with nested guest,
by passing the virtual device to it,
using pci passthrough, qemu userspace nvme driver, and spdk

Maxim Levitsky (10):
vfio/mdev: add notifier for map events
vfio/mdev: add .request callback
nvme/core: add some more values from the spec
nvme/core: add NVME_CTRL_SUSPENDED controller state
nvme/pci: use the NVME_CTRL_SUSPENDED state
nvme/core: add mdev interfaces
nvme/core: add nvme-mdev core driver
nvme/pci: implement the mdev external queue allocation interface
nvme/mdev - Add inline performance measurments
nvme/mdev - generic block IO POC

MAINTAINERS | 5 +
drivers/nvme/Kconfig | 1 +
drivers/nvme/Makefile | 1 +
drivers/nvme/host/core.c | 144 +++++-
drivers/nvme/host/nvme.h | 55 +-
drivers/nvme/host/pci.c | 381 +++++++++++++-
drivers/nvme/mdev/Kconfig | 24 +
drivers/nvme/mdev/Makefile | 5 +
drivers/nvme/mdev/adm.c | 873 ++++++++++++++++++++++++++++++++
drivers/nvme/mdev/events.c | 142 ++++++
drivers/nvme/mdev/host.c | 741 +++++++++++++++++++++++++++
drivers/nvme/mdev/instance.c | 866 +++++++++++++++++++++++++++++++
drivers/nvme/mdev/io.c | 601 ++++++++++++++++++++++
drivers/nvme/mdev/irq.c | 270 ++++++++++
drivers/nvme/mdev/mdev.h | 56 ++
drivers/nvme/mdev/mmio.c | 588 +++++++++++++++++++++
drivers/nvme/mdev/pci.c | 247 +++++++++
drivers/nvme/mdev/priv.h | 774 ++++++++++++++++++++++++++++
drivers/nvme/mdev/udata.c | 390 ++++++++++++++
drivers/nvme/mdev/vcq.c | 209 ++++++++
drivers/nvme/mdev/vctrl.c | 518 +++++++++++++++++++
drivers/nvme/mdev/viommu.c | 322 ++++++++++++
drivers/nvme/mdev/vns.c | 356 +++++++++++++
drivers/nvme/mdev/vsq.c | 181 +++++++
drivers/vfio/mdev/vfio_mdev.c | 11 +
drivers/vfio/vfio_iommu_type1.c | 97 +++-
include/linux/mdev.h | 4 +
include/linux/nvme.h | 88 +++-
include/linux/vfio.h | 4 +
29 files changed, 7905 insertions(+), 49 deletions(-)
create mode 100644 drivers/nvme/mdev/Kconfig
create mode 100644 drivers/nvme/mdev/Makefile
create mode 100644 drivers/nvme/mdev/adm.c
create mode 100644 drivers/nvme/mdev/events.c
create mode 100644 drivers/nvme/mdev/host.c
create mode 100644 drivers/nvme/mdev/instance.c
create mode 100644 drivers/nvme/mdev/io.c
create mode 100644 drivers/nvme/mdev/irq.c
create mode 100644 drivers/nvme/mdev/mdev.h
create mode 100644 drivers/nvme/mdev/mmio.c
create mode 100644 drivers/nvme/mdev/pci.c
create mode 100644 drivers/nvme/mdev/priv.h
create mode 100644 drivers/nvme/mdev/udata.c
create mode 100644 drivers/nvme/mdev/vcq.c
create mode 100644 drivers/nvme/mdev/vctrl.c
create mode 100644 drivers/nvme/mdev/viommu.c
create mode 100644 drivers/nvme/mdev/vns.c
create mode 100644 drivers/nvme/mdev/vsq.c

--
2.17.2

2019-05-02 11:50:56

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 06/10] nvme/core: add mdev interfaces

This adds infrastructure for a nvme-mdev
to attach to the core driver, to be able to
know which nvme controllers are present and which
namespaces they have.

It also adds an interface to nvme device drivers
which expose the its queues in a controlled manner
to the nvme mdev core driver. A driver must opt-in
for this using a new flag NVME_F_MDEV_SUPPORTED

If the mdev device driver also sets the
NVME_F_MDEV_DMA_SUPPORTED, the mdev core will
dma map all the guest memory into the nvme device,
so that nvme device driver can use dma addresses as passed
from the mdev core driver

Signed-off-by: Maxim Levitsky <[email protected]>
---
drivers/nvme/host/core.c | 125 ++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 54 +++++++++++++++--
2 files changed, 172 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 22db0c51a0bf..3c1b91089631 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -89,11 +89,111 @@ static dev_t nvme_chr_devt;
static struct class *nvme_class;
static struct class *nvme_subsys_class;

+static void nvme_ns_remove(struct nvme_ns *ns);
static int nvme_revalidate_disk(struct gendisk *disk);
static void nvme_put_subsystem(struct nvme_subsystem *subsys);
static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
unsigned nsid);

+#ifdef CONFIG_NVME_MDEV
+
+static struct nvme_mdev_driver *mdev_driver_interface;
+static DEFINE_MUTEX(mdev_ctrl_lock);
+static LIST_HEAD(mdev_ctrl_list);
+
+static bool nvme_ctrl_has_mdev(struct nvme_ctrl *ctrl)
+{
+ return (ctrl->ops->flags & NVME_F_MDEV_SUPPORTED) != 0;
+}
+
+static void nvme_mdev_add_ctrl(struct nvme_ctrl *ctrl)
+{
+ if (nvme_ctrl_has_mdev(ctrl)) {
+ mutex_lock(&mdev_ctrl_lock);
+ list_add_tail(&ctrl->link, &mdev_ctrl_list);
+ mutex_unlock(&mdev_ctrl_lock);
+ }
+}
+
+static void nvme_mdev_remove_ctrl(struct nvme_ctrl *ctrl)
+{
+ if (nvme_ctrl_has_mdev(ctrl)) {
+ mutex_lock(&mdev_ctrl_lock);
+ list_del_init(&ctrl->link);
+ mutex_unlock(&mdev_ctrl_lock);
+ }
+}
+
+int nvme_core_register_mdev_driver(struct nvme_mdev_driver *driver_ops)
+{
+ struct nvme_ctrl *ctrl;
+
+ if (mdev_driver_interface)
+ return -EEXIST;
+
+ mdev_driver_interface = driver_ops;
+
+ mutex_lock(&mdev_ctrl_lock);
+ list_for_each_entry(ctrl, &mdev_ctrl_list, link)
+ mdev_driver_interface->nvme_ctrl_state_changed(ctrl);
+
+ mutex_unlock(&mdev_ctrl_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_core_register_mdev_driver);
+
+void nvme_core_unregister_mdev_driver(struct nvme_mdev_driver *driver_ops)
+{
+ if (WARN_ON(driver_ops != mdev_driver_interface))
+ return;
+ mdev_driver_interface = NULL;
+}
+EXPORT_SYMBOL_GPL(nvme_core_unregister_mdev_driver);
+
+static void nvme_mdev_ctrl_state_changed(struct nvme_ctrl *ctrl)
+{
+ if (!mdev_driver_interface || !nvme_ctrl_has_mdev(ctrl))
+ return;
+ if (!try_module_get(mdev_driver_interface->owner))
+ return;
+
+ mdev_driver_interface->nvme_ctrl_state_changed(ctrl);
+ module_put(mdev_driver_interface->owner);
+}
+
+static void nvme_mdev_ns_state_changed(struct nvme_ctrl *ctrl,
+ struct nvme_ns *ns, bool removed)
+{
+ if (!mdev_driver_interface || !nvme_ctrl_has_mdev(ctrl))
+ return;
+ if (!try_module_get(mdev_driver_interface->owner))
+ return;
+
+ mdev_driver_interface->nvme_ns_state_changed(ctrl,
+ ns->head->ns_id, removed);
+ module_put(mdev_driver_interface->owner);
+}
+
+#else
+static void nvme_mdev_ctrl_state_changed(struct nvme_ctrl *ctrl)
+{
+}
+
+static void nvme_mdev_ns_state_changed(struct nvme_ctrl *ctrl,
+ struct nvme_ns *ns, bool removed)
+{
+}
+
+static void nvme_mdev_add_ctrl(struct nvme_ctrl *ctrl)
+{
+}
+
+static void nvme_mdev_remove_ctrl(struct nvme_ctrl *ctrl)
+{
+}
+
+#endif
+
static void nvme_set_queue_dying(struct nvme_ns *ns)
{
/*
@@ -387,10 +487,13 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,

if (changed)
ctrl->state = new_state;
-
spin_unlock_irqrestore(&ctrl->lock, flags);
+
+ nvme_mdev_ctrl_state_changed(ctrl);
+
if (changed && ctrl->state == NVME_CTRL_LIVE)
nvme_kick_requeue_lists(ctrl);
+
return changed;
}
EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -426,10 +529,11 @@ static void nvme_free_ns(struct kref *kref)
kfree(ns);
}

-static void nvme_put_ns(struct nvme_ns *ns)
+void nvme_put_ns(struct nvme_ns *ns)
{
kref_put(&ns->kref, nvme_free_ns);
}
+EXPORT_SYMBOL_GPL(nvme_put_ns);

static inline void nvme_clear_nvme_request(struct request *req)
{
@@ -1289,6 +1393,11 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
return effects;
}

+static void nvme_update_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
+{
+ nvme_mdev_ns_state_changed(ctrl, ns, false);
+}
+
static void nvme_update_formats(struct nvme_ctrl *ctrl)
{
struct nvme_ns *ns;
@@ -1297,6 +1406,8 @@ static void nvme_update_formats(struct nvme_ctrl *ctrl)
list_for_each_entry(ns, &ctrl->namespaces, list)
if (ns->disk && nvme_revalidate_disk(ns->disk))
nvme_set_queue_dying(ns);
+ else
+ nvme_update_ns(ctrl, ns);
up_read(&ctrl->namespaces_rwsem);

nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
@@ -3186,7 +3297,7 @@ static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
return nsa->head->ns_id - nsb->head->ns_id;
}

-static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned int nsid)
{
struct nvme_ns *ns, *ret = NULL;

@@ -3204,6 +3315,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
up_read(&ctrl->namespaces_rwsem);
return ret;
}
+EXPORT_SYMBOL_GPL(nvme_find_get_ns);

static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
{
@@ -3336,6 +3448,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
return;

+ nvme_mdev_ns_state_changed(ns->ctrl, ns, true);
+
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
del_gendisk(ns->disk);
@@ -3366,6 +3480,8 @@ static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
if (ns) {
if (ns->disk && revalidate_disk(ns->disk))
nvme_ns_remove(ns);
+ else
+ nvme_update_ns(ctrl, ns);
nvme_put_ns(ns);
} else
nvme_alloc_ns(ctrl, nsid);
@@ -3717,6 +3833,7 @@ static void nvme_free_ctrl(struct device *dev)
sysfs_remove_link(&subsys->dev.kobj, dev_name(ctrl->device));
}

+ nvme_mdev_remove_ctrl(ctrl);
ctrl->ops->free_ctrl(ctrl);

if (subsys)
@@ -3789,6 +3906,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
dev_pm_qos_update_user_latency_tolerance(ctrl->device,
min(default_ps_max_latency_us, (unsigned long)S32_MAX));

+ nvme_mdev_add_ctrl(ctrl);
+
return 0;
out_free_name:
kfree_const(ctrl->device->kobj.name);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d040eb00e880..6ec14ce0d015 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -148,6 +148,7 @@ enum nvme_ctrl_state {
};

struct nvme_ctrl {
+ struct list_head link;
bool comp_seen;
enum nvme_ctrl_state state;
bool identified;
@@ -255,6 +256,23 @@ enum nvme_iopolicy {
NVME_IOPOLICY_RR,
};

+#ifdef CONFIG_NVME_MDEV
+/* Interface to the host driver */
+struct nvme_mdev_driver {
+ struct module *owner;
+
+ /* a controller state has changed*/
+ void (*nvme_ctrl_state_changed)(struct nvme_ctrl *ctrl);
+
+ /* NS is updated in some way (after format or so) */
+ void (*nvme_ns_state_changed)(struct nvme_ctrl *ctrl,
+ u32 nsid, bool removed);
+};
+
+int nvme_core_register_mdev_driver(struct nvme_mdev_driver *driver_ops);
+void nvme_core_unregister_mdev_driver(struct nvme_mdev_driver *driver_ops);
+#endif
+
struct nvme_subsystem {
int instance;
struct device dev;
@@ -280,7 +298,7 @@ struct nvme_subsystem {
};

/*
- * Container structure for uniqueue namespace identifiers.
+ * Container structure for unique namespace identifiers.
*/
struct nvme_ns_ids {
u8 eui64[8];
@@ -356,13 +374,22 @@ struct nvme_ns {

};

+struct nvme_ext_data_iter;
+struct nvme_ext_cmd_result {
+ u32 tag;
+ u16 status;
+};
+
struct nvme_ctrl_ops {
const char *name;
struct module *module;
unsigned int flags;
-#define NVME_F_FABRICS (1 << 0)
-#define NVME_F_METADATA_SUPPORTED (1 << 1)
-#define NVME_F_PCI_P2PDMA (1 << 2)
+#define NVME_F_FABRICS BIT(0)
+#define NVME_F_METADATA_SUPPORTED BIT(1)
+#define NVME_F_PCI_P2PDMA BIT(2)
+#define NVME_F_MDEV_SUPPORTED BIT(3)
+#define NVME_F_MDEV_DMA_SUPPORTED BIT(4)
+
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -370,6 +397,23 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+
+#ifdef CONFIG_NVME_MDEV
+ int (*ext_queues_available)(struct nvme_ctrl *ctrl);
+ int (*ext_queue_alloc)(struct nvme_ctrl *ctrl, u16 *qid);
+ void (*ext_queue_free)(struct nvme_ctrl *ctrl, u16 qid);
+
+ int (*ext_queue_submit)(struct nvme_ctrl *ctrl,
+ u16 qid, u32 tag,
+ struct nvme_command *command,
+ struct nvme_ext_data_iter *iter);
+
+ bool (*ext_queue_full)(struct nvme_ctrl *ctrl, u16 qid);
+
+ int (*ext_queue_poll)(struct nvme_ctrl *ctrl, u16 qid,
+ struct nvme_ext_cmd_result *results,
+ unsigned int max_len);
+#endif
};

#ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
@@ -431,6 +475,8 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl);
void nvme_put_ctrl(struct nvme_ctrl *ctrl);
int nvme_init_identify(struct nvme_ctrl *ctrl);

+struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned int nsid);
+void nvme_put_ns(struct nvme_ns *ns);
void nvme_remove_namespaces(struct nvme_ctrl *ctrl);

int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
--
2.17.2

2019-05-02 11:51:12

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

Note that currently the number of hw queues reserved for mdev,
has to be pre determined on module load.

(I used to allocate the queues dynamicaly on demand, but
recent changes to allocate polled/read queues made
this somewhat difficult, so I dropped this for now)

Signed-off-by: Maxim Levitsky <[email protected]>
---
drivers/nvme/host/pci.c | 375 ++++++++++++++++++++++++++++++++++++++-
drivers/nvme/mdev/host.c | 46 ++---
drivers/nvme/mdev/io.c | 46 +++--
drivers/nvme/mdev/mmio.c | 3 -
4 files changed, 421 insertions(+), 49 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 282f28c851c1..87507e710374 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -23,6 +23,7 @@
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
#include <linux/pci-p2pdma.h>
+#include "../mdev/mdev.h"

#include "trace.h"
#include "nvme.h"
@@ -32,6 +33,7 @@

#define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))

+#define USE_SMALL_PRP_POOL(nprps) ((nprps) < (256 / 8))
/*
* These can be higher, but we need to ensure that any command doesn't
* require an sg allocation that needs more than a page of data.
@@ -83,12 +85,24 @@ static int poll_queues = 0;
module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644);
MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");

+static int mdev_queues;
+#ifdef CONFIG_NVME_MDEV
+module_param_cb(mdev_queues, &queue_count_ops, &mdev_queues, 0644);
+MODULE_PARM_DESC(mdev_queues, "Number of queues to use for mediated VFIO");
+#endif
+
struct nvme_dev;
struct nvme_queue;

static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);

+#ifdef CONFIG_NVME_MDEV
+static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid);
+#else
+static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid) {}
+#endif
+
/*
* Represents an NVM Express device. Each nvme_dev is a PCI function.
*/
@@ -103,6 +117,7 @@ struct nvme_dev {
unsigned online_queues;
unsigned max_qid;
unsigned io_queues[HCTX_MAX_TYPES];
+ unsigned int mdev_queues;
unsigned int num_vecs;
int q_depth;
u32 db_stride;
@@ -110,6 +125,7 @@ struct nvme_dev {
unsigned long bar_mapped_size;
struct work_struct remove_work;
struct mutex shutdown_lock;
+ struct mutex ext_dev_lock;
bool subsystem;
u64 cmb_size;
bool cmb_use_sqes;
@@ -172,6 +188,16 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
return container_of(ctrl, struct nvme_dev, ctrl);
}

+/* Simplified IO descriptor for MDEV use */
+struct nvme_ext_iod {
+ struct list_head link;
+ u32 user_tag;
+ int nprps;
+ struct nvme_ext_data_iter *saved_iter;
+ dma_addr_t first_prplist_dma;
+ __le64 *prpslists[NVME_MAX_SEGS];
+};
+
/*
* An NVM Express queue. Each device has at least two (one for admin
* commands and one for I/O commands).
@@ -196,15 +222,26 @@ struct nvme_queue {
u16 qid;
u8 cq_phase;
unsigned long flags;
+
#define NVMEQ_ENABLED 0
#define NVMEQ_SQ_CMB 1
#define NVMEQ_DELETE_ERROR 2
#define NVMEQ_POLLED 3
+#define NVMEQ_EXTERNAL 4
+
u32 *dbbuf_sq_db;
u32 *dbbuf_cq_db;
u32 *dbbuf_sq_ei;
u32 *dbbuf_cq_ei;
struct completion delete_done;
+
+ /* queue passthrough for external use */
+ struct {
+ int inflight;
+ struct nvme_ext_iod *iods;
+ struct list_head free_iods;
+ struct list_head used_iods;
+ } ext;
};

/*
@@ -255,7 +292,7 @@ static inline void _nvme_check_size(void)

static unsigned int max_io_queues(void)
{
- return num_possible_cpus() + write_queues + poll_queues;
+ return num_possible_cpus() + write_queues + poll_queues + mdev_queues;
}

static unsigned int max_queue_count(void)
@@ -1066,6 +1103,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
* the irq handler, even if that was on another CPU.
*/
rmb();
+
if (nvmeq->cq_head != nvmeq->last_cq_head)
ret = IRQ_HANDLED;
nvme_process_cq(nvmeq, &start, &end, -1);
@@ -1553,7 +1591,11 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
nvme_dbbuf_init(dev, nvmeq, qid);
dev->online_queues++;
+
wmb(); /* ensure the first interrupt sees the initialization */
+
+ if (test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
+ nvme_ext_queue_reset(nvmeq->dev, qid);
}

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
@@ -1759,7 +1801,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
}

max = min(dev->max_qid, dev->ctrl.queue_count - 1);
- if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
+ if (max != 1) {
rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
dev->io_queues[HCTX_TYPE_READ];
} else {
@@ -2095,14 +2137,23 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
* Poll queues don't need interrupts, but we need at least one IO
* queue left over for non-polled IO.
*/
- this_p_queues = poll_queues;
+ this_p_queues = poll_queues + mdev_queues;
if (this_p_queues >= nr_io_queues) {
this_p_queues = nr_io_queues - 1;
irq_queues = 1;
} else {
irq_queues = nr_io_queues - this_p_queues + 1;
}
+
+ if (mdev_queues > this_p_queues) {
+ mdev_queues = this_p_queues;
+ this_p_queues = 0;
+ } else {
+ this_p_queues -= mdev_queues;
+ }
+
dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
+ dev->mdev_queues = mdev_queues;

/* Initialize for the single interrupt case */
dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
@@ -2170,7 +2221,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)

dev->num_vecs = result;
result = max(result - 1, 1);
- dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
+ dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL] +
+ dev->mdev_queues;

/*
* Should investigate if there's a performance win from allocating
@@ -2193,10 +2245,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
nvme_suspend_io_queues(dev);
goto retry;
}
- dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+ dev_info(dev->ctrl.device, "%d/%d/%d/%d default/read/poll/mdev queues\n",
dev->io_queues[HCTX_TYPE_DEFAULT],
dev->io_queues[HCTX_TYPE_READ],
- dev->io_queues[HCTX_TYPE_POLL]);
+ dev->io_queues[HCTX_TYPE_POLL],
+ dev->mdev_queues);
return 0;
}

@@ -2623,6 +2676,301 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
nvme_put_ctrl(&dev->ctrl);
}

+#ifdef CONFIG_NVME_MDEV
+static void nvme_ext_free_iod(struct nvme_dev *dev, struct nvme_ext_iod *iod)
+{
+ int i = 0, max_prp, nprps = iod->nprps;
+ dma_addr_t dma = iod->first_prplist_dma;
+
+ if (iod->saved_iter) {
+ iod->saved_iter->release(iod->saved_iter);
+ iod->saved_iter = NULL;
+ }
+
+ if (--nprps < 2) {
+ goto out;
+ } else if (USE_SMALL_PRP_POOL(nprps)) {
+ dma_pool_free(dev->prp_small_pool, iod->prpslists[0], dma);
+ goto out;
+ }
+
+ max_prp = (dev->ctrl.page_size >> 3) - 1;
+ while (nprps > 0) {
+ if (i > 0) {
+ dma = iod->prpslists[i - 1][max_prp];
+ if (nprps == 1)
+ break;
+ }
+ dma_pool_free(dev->prp_page_pool, iod->prpslists[i++], dma);
+ nprps -= max_prp;
+ }
+out:
+ iod->nprps = -1;
+ iod->first_prplist_dma = 0;
+ iod->user_tag = 0xDEADDEAD;
+}
+
+static int nvme_ext_setup_iod(struct nvme_dev *dev, struct nvme_ext_iod *iod,
+ struct nvme_common_command *cmd,
+ struct nvme_ext_data_iter *iter)
+{
+ int ret, i, j;
+ __le64 *prp_list;
+ dma_addr_t prp_dma;
+ struct dma_pool *pool;
+ int max_prp = (dev->ctrl.page_size >> 3) - 1;
+
+ iod->saved_iter = iter && iter->release ? iter : NULL;
+ iod->nprps = iter ? iter->count : 0;
+ cmd->dptr.prp1 = 0;
+ cmd->dptr.prp2 = 0;
+ cmd->metadata = 0;
+
+ if (!iter)
+ return 0;
+
+ /* put first pointer*/
+ cmd->dptr.prp1 = cpu_to_le64(iter->host_iova);
+ if (iter->count == 1)
+ return 0;
+
+ ret = iter->next(iter);
+ if (ret)
+ goto error;
+
+ /* if only have one more pointer, put it to second data pointer*/
+ if (iter->count == 1) {
+ cmd->dptr.prp2 = cpu_to_le64(iter->host_iova);
+ return 0;
+ }
+
+ pool = USE_SMALL_PRP_POOL(iter->count) ? dev->prp_small_pool :
+ dev->prp_page_pool;
+
+ /* Allocate prp lists as needed and fill them */
+ for (i = 0 ; i < NVME_MAX_SEGS && iter->count ; i++) {
+ prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
+ if (!prp_list) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ iod->prpslists[i++] = prp_list;
+
+ if (i == 1) {
+ iod->first_prplist_dma = prp_dma;
+ cmd->dptr.prp2 = cpu_to_le64(prp_dma);
+ j = 0;
+ } else {
+ prp_list[0] = iod->prpslists[i - 1][max_prp];
+ iod->prpslists[i - 1][max_prp] = prp_dma;
+ j = 1;
+ }
+
+ while (j <= max_prp && iter->count) {
+ prp_list[j++] = iter->host_iova;
+ ret = iter->next(iter);
+ if (ret)
+ goto error;
+ }
+ }
+
+ if (iter->count) {
+ ret = -ENOSPC;
+ goto error;
+ }
+ return 0;
+error:
+ iod->nprps -= iter->count;
+ nvme_ext_free_iod(dev, iod);
+ return ret;
+}
+
+static int nvme_ext_queues_available(struct nvme_ctrl *ctrl)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ unsigned int ret = 0, qid;
+ unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
+
+ for (qid = first_mdev_q; qid < dev->online_queues; qid++) {
+ struct nvme_queue *nvmeq = &dev->queues[qid];
+
+ if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
+ ret++;
+ }
+ return ret;
+}
+
+static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid)
+{
+ struct nvme_queue *nvmeq = &dev->queues[qid];
+ struct nvme_ext_iod *iod, *tmp;
+
+ list_for_each_entry_safe(iod, tmp, &nvmeq->ext.used_iods, link) {
+ if (iod->saved_iter && iod->saved_iter->release) {
+ iod->saved_iter->release(iod->saved_iter);
+ iod->saved_iter = NULL;
+ list_move(&iod->link, &nvmeq->ext.free_iods);
+ }
+ }
+
+ nvmeq->ext.inflight = 0;
+}
+
+static int nvme_ext_queue_alloc(struct nvme_ctrl *ctrl, u16 *ret_qid)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ struct nvme_queue *nvmeq;
+ int ret = 0, qid, i;
+ unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
+
+ mutex_lock(&dev->ext_dev_lock);
+
+ /* find a polled queue to allocate */
+ for (qid = dev->online_queues - 1 ; qid >= first_mdev_q ; qid--) {
+ nvmeq = &dev->queues[qid];
+ if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
+ break;
+ }
+
+ if (qid < first_mdev_q) {
+ ret = -ENOSPC;
+ goto out;
+ }
+
+ INIT_LIST_HEAD(&nvmeq->ext.free_iods);
+ INIT_LIST_HEAD(&nvmeq->ext.used_iods);
+
+ nvmeq->ext.iods =
+ vzalloc_node(sizeof(struct nvme_ext_iod) * nvmeq->q_depth,
+ dev_to_node(dev->dev));
+
+ if (!nvmeq->ext.iods) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ for (i = 0 ; i < nvmeq->q_depth ; i++)
+ list_add_tail(&nvmeq->ext.iods[i].link, &nvmeq->ext.free_iods);
+
+ set_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
+ *ret_qid = qid;
+out:
+ mutex_unlock(&dev->ext_dev_lock);
+ return ret;
+}
+
+static void nvme_ext_queue_free(struct nvme_ctrl *ctrl, u16 qid)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ struct nvme_queue *nvmeq;
+
+ mutex_lock(&dev->ext_dev_lock);
+ nvmeq = &dev->queues[qid];
+
+ if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
+ return;
+
+ nvme_ext_queue_reset(dev, qid);
+
+ vfree(nvmeq->ext.iods);
+ nvmeq->ext.iods = NULL;
+ INIT_LIST_HEAD(&nvmeq->ext.free_iods);
+ INIT_LIST_HEAD(&nvmeq->ext.used_iods);
+
+ clear_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
+ mutex_unlock(&dev->ext_dev_lock);
+}
+
+static int nvme_ext_queue_submit(struct nvme_ctrl *ctrl, u16 qid, u32 user_tag,
+ struct nvme_command *command,
+ struct nvme_ext_data_iter *iter)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ struct nvme_queue *nvmeq = &dev->queues[qid];
+ struct nvme_ext_iod *iod;
+ int ret;
+
+ if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
+ return -EINVAL;
+
+ if (list_empty(&nvmeq->ext.free_iods))
+ return -1;
+
+ iod = list_first_entry(&nvmeq->ext.free_iods,
+ struct nvme_ext_iod, link);
+
+ list_move(&iod->link, &nvmeq->ext.used_iods);
+
+ command->common.command_id = cpu_to_le16(iod - nvmeq->ext.iods);
+ iod->user_tag = user_tag;
+
+ ret = nvme_ext_setup_iod(dev, iod, &command->common, iter);
+ if (ret) {
+ list_move(&iod->link, &nvmeq->ext.free_iods);
+ return ret;
+ }
+
+ nvmeq->ext.inflight++;
+ nvme_submit_cmd(nvmeq, command, true);
+ return 0;
+}
+
+static int nvme_ext_queue_poll(struct nvme_ctrl *ctrl, u16 qid,
+ struct nvme_ext_cmd_result *results,
+ unsigned int max_len)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ struct nvme_queue *nvmeq = &dev->queues[qid];
+ u16 old_head;
+ int i, j;
+
+ if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
+ return -EINVAL;
+
+ if (nvmeq->ext.inflight == 0)
+ return -1;
+
+ old_head = nvmeq->cq_head;
+
+ for (i = 0 ; nvme_cqe_pending(nvmeq) && i < max_len ; i++) {
+ u16 status = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status);
+ u16 tag = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].command_id);
+
+ results[i].status = status >> 1;
+ results[i].tag = (u32)tag;
+ nvme_update_cq_head(nvmeq);
+ }
+
+ if (old_head != nvmeq->cq_head)
+ nvme_ring_cq_doorbell(nvmeq);
+
+ for (j = 0 ; j < i ; j++) {
+ u16 tag = results[j].tag & 0xFFFF;
+ struct nvme_ext_iod *iod = &nvmeq->ext.iods[tag];
+
+ if (WARN_ON(tag >= nvmeq->q_depth || iod->nprps == -1))
+ continue;
+
+ results[j].tag = iod->user_tag;
+ nvme_ext_free_iod(dev, iod);
+ list_move(&iod->link, &nvmeq->ext.free_iods);
+ nvmeq->ext.inflight--;
+ }
+
+ WARN_ON(nvmeq->ext.inflight < 0);
+ return i;
+}
+
+static bool nvme_ext_queue_full(struct nvme_ctrl *ctrl, u16 qid)
+{
+ struct nvme_dev *dev = to_nvme_dev(ctrl);
+ struct nvme_queue *nvmeq = &dev->queues[qid];
+
+ return nvmeq->ext.inflight < nvmeq->q_depth - 1;
+}
+#endif
+
static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
{
*val = readl(to_nvme_dev(ctrl)->bar + off);
@@ -2652,13 +3000,25 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
.flags = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+ NVME_F_PCI_P2PDMA |
+ NVME_F_MDEV_SUPPORTED |
+ NVME_F_MDEV_DMA_SUPPORTED,
+
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address = nvme_pci_get_address,
+
+#ifdef CONFIG_NVME_MDEV
+ .ext_queues_available = nvme_ext_queues_available,
+ .ext_queue_alloc = nvme_ext_queue_alloc,
+ .ext_queue_free = nvme_ext_queue_free,
+ .ext_queue_submit = nvme_ext_queue_submit,
+ .ext_queue_poll = nvme_ext_queue_poll,
+ .ext_queue_full = nvme_ext_queue_full,
+#endif
};

static int nvme_dev_map(struct nvme_dev *dev)
@@ -2747,6 +3107,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
mutex_init(&dev->shutdown_lock);
+ mutex_init(&dev->ext_dev_lock);

result = nvme_setup_prp_pools(dev);
if (result)
diff --git a/drivers/nvme/mdev/host.c b/drivers/nvme/mdev/host.c
index 5766bad7e909..6590946b86c2 100644
--- a/drivers/nvme/mdev/host.c
+++ b/drivers/nvme/mdev/host.c
@@ -48,19 +48,21 @@ static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
return NULL;
}

+ hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
+ dev_to_node(ctrl->dev));
+ if (!hctrl)
+ return NULL;
+
nr_host_queues = ctrl->ops->ext_queues_available(ctrl);
max_lba_transfer = ctrl->max_hw_sectors >> (PAGE_SHIFT - 9);

if (nr_host_queues == 0) {
dev_info(ctrl->dev,
"no support for mdev - no mdev reserved queues available");
+ kfree(hctrl);
return NULL;
}

- hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
- dev_to_node(ctrl->dev));
- if (!hctrl)
- return NULL;

kref_init(&hctrl->ref);
mutex_init(&hctrl->lock);
@@ -180,6 +182,24 @@ void nvme_mdev_hctrl_hqs_unreserve(struct nvme_mdev_hctrl *hctrl,
mutex_unlock(&hctrl->lock);
}

+/* Check if IO passthrough is supported for given IO optcode */
+bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
+{
+ switch (optcode) {
+ case nvme_cmd_flush:
+ case nvme_cmd_read:
+ case nvme_cmd_write:
+ /* these are mandatory*/
+ return true;
+ case nvme_cmd_write_zeroes:
+ return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
+ case nvme_cmd_dsm:
+ return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
+ default:
+ return false;
+ }
+}
+
/* Allocate a host IO queue */
int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
{
@@ -204,23 +224,7 @@ bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid)
return hctrl->nvme_ctrl->ops->ext_queue_full(hctrl->nvme_ctrl, qid);
}

-/* Check if IO passthrough is supported for given IO optcode */
-bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
-{
- switch (optcode) {
- case nvme_cmd_flush:
- case nvme_cmd_read:
- case nvme_cmd_write:
- /* these are mandatory*/
- return true;
- case nvme_cmd_write_zeroes:
- return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
- case nvme_cmd_dsm:
- return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
- default:
- return false;
- }
-}
+

/* Submit a IO passthrough command */
int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
index a731196d0365..59837540fec2 100644
--- a/drivers/nvme/mdev/io.c
+++ b/drivers/nvme/mdev/io.c
@@ -11,14 +11,16 @@
#include <linux/ktime.h>
#include "priv.h"

+
struct io_ctx {
struct nvme_mdev_hctrl *hctrl;
struct nvme_mdev_vctrl *vctrl;

const struct nvme_command *in;
- struct nvme_command out;
struct nvme_mdev_vns *ns;
struct nvme_ext_data_iter udatait;
+
+ struct nvme_command out;
struct nvme_ext_data_iter *kdatait;

ktime_t last_io_t;
@@ -28,6 +30,20 @@ struct io_ctx {
unsigned int arb_burst;
};

+/* Check if we need to read a command from the admin queue */
+static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
+{
+ if (!timeout(ctx->last_admin_poll_time,
+ ctx->vctrl->now, ctx->admin_poll_rate_ms))
+ return false;
+
+ if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
+ return true;
+
+ ctx->last_admin_poll_time = ctx->vctrl->now;
+ return false;
+}
+
/* Handle read/write command.*/
static int nvme_mdev_io_translate_rw(struct io_ctx *ctx)
{
@@ -229,6 +245,7 @@ static int nvme_mdev_io_translate_cmd(struct io_ctx *ctx)
}
}

+/* process a user submission queue */
static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
{
struct nvme_vsq *vsq = &ctx->vctrl->vsqs[sqid];
@@ -275,7 +292,13 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
return true;
}

-/* process host replies to the passed through commands */
+/* process a user completion queue */
+static void nvme_mdev_io_process_cq(struct io_ctx *ctx, u16 cqid)
+{
+ nvme_mdev_vcq_process(ctx->vctrl, cqid, true);
+}
+
+/* process hardware completion queue */
static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
{
int n, i;
@@ -301,22 +324,9 @@ static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
return n;
}

-/* Check if we need to read a command from the admin queue */
-static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
-{
- if (!timeout(ctx->last_admin_poll_time,
- ctx->vctrl->now, ctx->admin_poll_rate_ms))
- return false;
-
- if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
- return true;
-
- ctx->last_admin_poll_time = ctx->vctrl->now;
- return false;
-}

/* do polling till one of events stops it */
-static void nvme_mdev_io_maintask(struct io_ctx *ctx)
+static void nvme_mdev_io_polling_loop(struct io_ctx *ctx)
{
struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
u16 i, cqid, sqid, hsqcnt;
@@ -353,7 +363,7 @@ static void nvme_mdev_io_maintask(struct io_ctx *ctx)
/* process the completions from the guest*/
cqid = 1;
for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
- nvme_mdev_vcq_process(vctrl, cqid, true);
+ nvme_mdev_io_process_cq(ctx, cqid);

/* process the completions from the hardware*/
for (i = 0 ; i < hsqcnt ; i++)
@@ -470,7 +480,7 @@ static int nvme_mdev_io_polling_thread(void *data)
if (kthread_should_stop())
break;

- nvme_mdev_io_maintask(&ctx);
+ nvme_mdev_io_polling_loop(&ctx);
}

_DBG(ctx.vctrl, "IO: iothread stopped\n");
diff --git a/drivers/nvme/mdev/mmio.c b/drivers/nvme/mdev/mmio.c
index cf03c1f22f4c..a80962bf4a3d 100644
--- a/drivers/nvme/mdev/mmio.c
+++ b/drivers/nvme/mdev/mmio.c
@@ -54,9 +54,6 @@ static const struct vm_operations_struct nvme_mdev_mmio_dbs_vm_ops = {
bool nvme_mdev_mmio_db_check(struct nvme_mdev_vctrl *vctrl,
u16 qid, u16 size, u16 db)
{
- if (get_current() != vctrl->iothread)
- lockdep_assert_held(&vctrl->lock);
-
if (db < size)
return true;
if (qid == 0) {
--
2.17.2

2019-05-02 11:51:10

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 07/10] nvme/core: add nvme-mdev core driver

This is main commit in the series, adding the mediated nvme
driver.

The idea behind this driver is based on paper you can find at
https://www.usenix.org/conference/atc18/presentation/peng
But this is an independent implementation.

This mdev device exposes a NVME 1.3 virtual device to any VFIO user
which represents an partition (or whole namespace) of a host NVME device.

Unlike the paper, the driver/device uses
one polling thread per mediated device,
and only needs one hw queue per it to achieve near native performance
(but can use more that one hw queue, in which case it splits the queues
between virtual nvme queues thus supporting n:m mapping between
host hw queues and the guest virtual queues).

The nvme-mdev can't yet be used after this commit, as no nvme device
drivers support mediation (support is added in the next commit to
nvme-pci)

The driver can use the shadow doorbell nvme optional feature,
to stop polling after a timeout.

Currently the device has redhat pci vendor ID, and 0x1234 pci device id,
which is only a placeholder till a real device id is allocated.

Use example:

# load the nvme-mdev driver
$ modprobe nvme-mdev

# load the nvme pci driver with 4 polling queues reserved
# (will work with the next patch)
$ modprobe nvme mdev_queues=4

# generate random UUID for the mediated device
$ UUID=$(uuidgen)
$ MDEV_DEVICE=/sys/bus/mdev/devices/$UUID

# the location of the real nvme device (replace with yours)
$ PCI_DEVICE=/sys/bus/pci/devices/0000:44:00.0

# create the mediated device using 2 host polling queues
$ echo $UUID > $PCI_DEVICE/mdev_supported_types/nvme-2Q_V1/create

# attach partition 1 of namespace 1 to a free virtual namespace
# (use n1 to attach whole namespace)
# you can attach up to 16 virtual namespaces for now
$ echo n1p1 > $MDEV_DEVICE/namespaces/add_namespace

# move the polling thread to cpu 11
$ echo 11 > $MDEV_DEVICE/settings/iothread_cpu

# now you can boot qemu with
# -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID

Note that you can attach and detach virtual namespaces
even while the guest is running which will make the
device sent namespace changed AEN notifications to the guest
prior to attach/detach action.

Signed-off-by: Maxim Levitsky <[email protected]>
---
MAINTAINERS | 5 +
drivers/nvme/Kconfig | 1 +
drivers/nvme/Makefile | 1 +
drivers/nvme/host/core.c | 4 +-
drivers/nvme/mdev/Kconfig | 16 +
drivers/nvme/mdev/Makefile | 5 +
drivers/nvme/mdev/adm.c | 873 +++++++++++++++++++++++++++++++++++
drivers/nvme/mdev/events.c | 142 ++++++
drivers/nvme/mdev/host.c | 502 ++++++++++++++++++++
drivers/nvme/mdev/instance.c | 804 ++++++++++++++++++++++++++++++++
drivers/nvme/mdev/io.c | 563 ++++++++++++++++++++++
drivers/nvme/mdev/irq.c | 264 +++++++++++
drivers/nvme/mdev/mdev.h | 56 +++
drivers/nvme/mdev/mmio.c | 591 ++++++++++++++++++++++++
drivers/nvme/mdev/pci.c | 247 ++++++++++
drivers/nvme/mdev/priv.h | 700 ++++++++++++++++++++++++++++
drivers/nvme/mdev/udata.c | 390 ++++++++++++++++
drivers/nvme/mdev/vcq.c | 209 +++++++++
drivers/nvme/mdev/vctrl.c | 515 +++++++++++++++++++++
drivers/nvme/mdev/viommu.c | 322 +++++++++++++
drivers/nvme/mdev/vns.c | 356 ++++++++++++++
drivers/nvme/mdev/vsq.c | 181 ++++++++
22 files changed, 6745 insertions(+), 2 deletions(-)
create mode 100644 drivers/nvme/mdev/Kconfig
create mode 100644 drivers/nvme/mdev/Makefile
create mode 100644 drivers/nvme/mdev/adm.c
create mode 100644 drivers/nvme/mdev/events.c
create mode 100644 drivers/nvme/mdev/host.c
create mode 100644 drivers/nvme/mdev/instance.c
create mode 100644 drivers/nvme/mdev/io.c
create mode 100644 drivers/nvme/mdev/irq.c
create mode 100644 drivers/nvme/mdev/mdev.h
create mode 100644 drivers/nvme/mdev/mmio.c
create mode 100644 drivers/nvme/mdev/pci.c
create mode 100644 drivers/nvme/mdev/priv.h
create mode 100644 drivers/nvme/mdev/udata.c
create mode 100644 drivers/nvme/mdev/vcq.c
create mode 100644 drivers/nvme/mdev/vctrl.c
create mode 100644 drivers/nvme/mdev/viommu.c
create mode 100644 drivers/nvme/mdev/vns.c
create mode 100644 drivers/nvme/mdev/vsq.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 09f43f1bdd15..481f10412d50 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11107,6 +11107,11 @@ W: http://git.infradead.org/nvme.git
S: Supported
F: drivers/nvme/target/

+NVM EXPRESS MDEV DRIVER
+M: Maxim Levitsky <[email protected]>
+S: Supported
+F: drivers/nvme/mdev/
+
NVMEM FRAMEWORK
M: Srinivas Kandagatla <[email protected]>
S: Maintained
diff --git a/drivers/nvme/Kconfig b/drivers/nvme/Kconfig
index 04008e0bbe81..cbf867e6ac1e 100644
--- a/drivers/nvme/Kconfig
+++ b/drivers/nvme/Kconfig
@@ -2,5 +2,6 @@ menu "NVME Support"

source "drivers/nvme/host/Kconfig"
source "drivers/nvme/target/Kconfig"
+source "drivers/nvme/mdev/Kconfig"

endmenu
diff --git a/drivers/nvme/Makefile b/drivers/nvme/Makefile
index 0096a7fd1431..0458efc57aee 100644
--- a/drivers/nvme/Makefile
+++ b/drivers/nvme/Makefile
@@ -1,3 +1,4 @@

obj-y += host/
obj-y += target/
+obj-y += mdev/
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3c1b91089631..29b2d1cc181d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2349,7 +2349,7 @@ static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ct
size_t nqnlen;
int off;

- if(!(ctrl->quirks & NVME_QUIRK_IGNORE_DEV_SUBNQN)) {
+ if (!(ctrl->quirks & NVME_QUIRK_IGNORE_DEV_SUBNQN)) {
nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
strlcpy(subsys->subnqn, id->subnqn, NVMF_NQN_SIZE);
@@ -3424,8 +3424,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)

nvme_mpath_add_disk(ns, id);
nvme_fault_inject_init(ns);
- kfree(id);

+ kfree(id);
return 0;
out_put_disk:
put_disk(ns->disk);
diff --git a/drivers/nvme/mdev/Kconfig b/drivers/nvme/mdev/Kconfig
new file mode 100644
index 000000000000..7ebc66cdeac0
--- /dev/null
+++ b/drivers/nvme/mdev/Kconfig
@@ -0,0 +1,16 @@
+
+config NVME_MDEV
+ bool
+
+config NVME_MDEV_VFIO
+ tristate "NVME Mediated VFIO virtual device"
+ select NVME_MDEV
+ depends on BLOCK
+ depends on VFIO_MDEV
+ depends on NVME_CORE
+ help
+ This provides EXPEREMENTAL support for lightweight software
+ passthrough of an partition on a NVME storage device to
+ guest, also as a NVME namespace, attached to a virtual NVME
+ controller
+ If unsure, say N.
diff --git a/drivers/nvme/mdev/Makefile b/drivers/nvme/mdev/Makefile
new file mode 100644
index 000000000000..114016c48476
--- /dev/null
+++ b/drivers/nvme/mdev/Makefile
@@ -0,0 +1,5 @@
+
+obj-$(CONFIG_NVME_MDEV_VFIO) += nvme-mdev.o
+
+nvme-mdev-y += adm.o events.o instance.o host.o io.o irq.o \
+ udata.o viommu.o vns.o vsq.o vcq.o vctrl.o mmio.o pci.o
diff --git a/drivers/nvme/mdev/adm.c b/drivers/nvme/mdev/adm.c
new file mode 100644
index 000000000000..39a7ad252c69
--- /dev/null
+++ b/drivers/nvme/mdev/adm.c
@@ -0,0 +1,873 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe admin command implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include "priv.h"
+
+struct adm_ctx {
+ struct nvme_mdev_vctrl *vctrl;
+ struct nvme_mdev_hctrl *hctrl;
+ const struct nvme_command *in;
+ struct nvme_mdev_vns *ns;
+ struct nvme_ext_data_iter udatait;
+ unsigned int datalen;
+};
+
+/*Identify Controller */
+static int nvme_mdev_adm_handle_id_cntrl(struct adm_ctx *ctx)
+{
+ int ret;
+ const struct nvme_identify *in = &ctx->in->identify;
+ struct nvme_id_ctrl *id;
+
+ if (in->nsid != 0)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ id = kzalloc(sizeof(*id), GFP_KERNEL);
+ if (!id)
+ return NVME_SC_INTERNAL;
+
+ /** Controller Capabilities and Features ************************/
+ // PCI vendor ID
+ store_le16(&id->vid, NVME_MDEV_PCI_VENDOR_ID);
+ // PCI Subsystem Vendor ID
+ store_le16(&id->ssvid, NVME_MDEV_PCI_SUBVENDOR_ID);
+ // Serial Number
+ store_strsp(id->sn, ctx->vctrl->serial);
+ // Model Number
+ store_strsp(id->mn, "NVMe MDEV virtual device");
+ // Firmware Revision
+ store_strsp(id->fr, NVME_MDEV_FIRMWARE_VERSION);
+ // Recommended Arbitration Burst
+ id->rab = 6;
+ // IEEE OUI Identifier for the controller vendor
+ id->ieee[0] = 0;
+ // Controller Multi-Path I/O and Namespace Sharing Capabilities
+ id->cmic = 0;
+ // Maximum Data Transfer Size (power of two, in page size units)
+ id->mdts = ctx->hctrl->mdts;
+ // controller ID
+ id->cntlid = 0;
+ // NVME supported version
+ store_le32(&id->ver, NVME_MDEV_NVME_VER);
+ // RTD3 Resume Latency
+ id->rtd3r = 0;
+ //RTD3 Entry Latency
+ id->rtd3e = 0;
+ // Optional Asynchronous Events Supported
+ store_le32(&id->oaes, NVME_AEN_CFG_NS_ATTR);
+ // Controller Attributes (misc junk)
+ id->ctratt = 0;
+
+ /*Admin Command Set Attributes & Optional Controller Capabilities */
+ // Optional Admin Command Support
+ id->oacs = ctx->vctrl->mmio.shadow_db_supported ?
+ NVME_CTRL_OACS_DBBUF_SUPP : 0;
+ // Abort Command Limit (dummy, zero based)
+ id->acl = 3;
+ // Asynchronous Event Request Limit (zero based)
+ id->aerl = MAX_AER_COMMANDS - 1;
+ // Firmware Updates (dummy)
+ id->frmw = 3;
+ // Log Page Attributes
+ // (IMPLEMENT: bit for commands supported and effects)
+ id->lpa = 0;
+ // Error Log Page Entries
+ // (zero based, IMPLEMENT: dummy for now)
+ id->elpe = 0;
+ // Number of Power States Support
+ // (zero based, IMPLEMENT: dummy for now)
+ id->npss = 0;
+ // Admin Vendor Specific Command Configuration (junk)
+ id->avscc = 0;
+ // Autonomous Power State Transition Attributes
+ id->apsta = 0;
+ // Warning Composite Temperature Threshold (dummy)
+ id->wctemp = 0x157;
+ // Critical Composite Temperature Threshold (dummy)
+ id->cctemp = 0x175;
+ // Maximum Time for Firmware Activation (dummy)
+ id->mtfa = 0;
+ // Host Memory Buffer Preferred Size (dummy)
+ id->hmpre = 0;
+ // Host Memory Buffer Minimum Size (dummy)
+ id->hmmin = 0;
+ // Total NVM Capacity (not supported)
+ id->tnvmcap[0] = 0;
+ // Unallocated NVM Capacity (not supported for now)
+ id->unvmcap[0] = 0;
+ // Replay Protected Memory Block Support
+ id->rpmbs = 0;
+ // Extended Device Self-test Time (dummy)
+ id->edstt = 0;
+ // Device Self-test Options (dummy)
+ id->dsto = 0;
+ // Firmware Update Granularity (dummy)
+ id->fwug = 0;
+ // Keep Alive Support (not supported)
+ id->kas = 0;
+ // Host Controlled Thermal Management Attributes (not supported)
+ id->hctma = 0;
+ // Minimum Thermal Management Temperature (not supported)
+ id->mntmt = 0;
+ // Maximum Thermal Management Temperature (not supported)
+ id->mxtmt = 0;
+ // Sanitize capabilities (not supported)
+ id->sanicap = 0;
+
+ /****************** NVM Command Set Attributes ********************/
+ // Submission Queue Entry Size
+ id->sqes = (0x6 << 4) | 0x6;
+ // Completion Queue Entry Size
+ id->cqes = (0x4 << 4) | 0x4;
+ // Maximum Outstanding Commands
+ id->maxcmd = 0;
+ // Number of Namespaces
+ id->nn = MAX_VIRTUAL_NAMESPACES;
+ // Optional NVM Command Support
+ // (we add dsm and write zeros if host supports them)
+ id->oncs = ctx->hctrl->oncs;
+ // TODOLATER: IO: Fused Operation Support
+ id->fuses = 0;
+ // Format NVM Attributes (don't support)
+ id->fna = 0;
+ // Volatile Write Cache (tell that always exist)
+ id->vwc = 1;
+ // Atomic Write Unit Normal (zero based value in blocks)
+ id->awun = 0;
+ // Atomic Write Unit Power Fail (ditto)
+ id->awupf = 0;
+ // NVM Vendor Specific Command Configuration
+ id->nvscc = 0;
+ // Atomic Compare & Write Unit (zero based value in blocks)
+ id->acwu = 0;
+ // SGL Support
+ id->sgls = 0;
+ // NVM Subsystem NVMe Qualified Name
+ strncpy(id->subnqn, ctx->vctrl->subnqn, sizeof(id->subnqn));
+
+ /******************Power state descriptors ***********************/
+ store_le16(&id->psd[0].max_power, 0x9c4); // dummy
+ store_le32(&id->psd[0].entry_lat, 0x10);
+ store_le32(&id->psd[0].exit_lat, 0x4);
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, id, sizeof(*id));
+ kfree(id);
+ return nvme_mdev_translate_error(ret);
+}
+
+/*Identify Namespace data structure for the specified NSID or common one */
+static int nvme_mdev_adm_handle_id_ns(struct adm_ctx *ctx)
+{
+ int ret;
+ struct nvme_id_ns *idns;
+ u32 nsid = le32_to_cpu(ctx->in->identify.nsid);
+
+ if (nsid == 0xffffffff || nsid == 0 || nsid > MAX_VIRTUAL_NAMESPACES)
+ return DNR(NVME_SC_INVALID_NS);
+
+ /* Allocate return structure*/
+ idns = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL);
+ if (!idns)
+ return NVME_SC_INTERNAL;
+
+ if (ctx->ns) {
+ //Namespace Size
+ store_le64(&idns->nsze, ctx->ns->ns_size);
+ // Namespace Capacity
+ store_le64(&idns->ncap, ctx->ns->ns_size);
+ // Namespace Utilization
+ store_le64(&idns->nuse, ctx->ns->ns_size);
+ // Namespace Features (nothing to set here yet)
+ idns->nsfeat = 0;
+ // Number of LBA Formats (dummy, zero based)
+ idns->nlbaf = 0;
+ // Formatted LBA Size (current LBA format in use)
+ // + external metadata bit
+ idns->flbas = 0;
+ // Metadata Capabilities
+ idns->mc = 0;
+ // End-to-end Data Protection Capabilities
+ idns->dpc = 0;
+ // End-to-end Data Protection Type Settings
+ idns->dps = 0;
+ // Namespace Multi-path I/O and Namespace Sharing Capabilities
+ idns->nmic = 0;
+ // Reservation Capabilities
+ idns->rescap = 0;
+ // Format Progress Indicator (dummy)
+ idns->fpi = 0;
+ // Namespace Atomic Write Unit Normal
+ idns->nawun = 0;
+ // Namespace Atomic Write Unit Power Fail
+ idns->nawupf = 0;
+ // Namespace Atomic Compare & Write Unit
+ idns->nacwu = 0;
+ // Namespace Atomic Boundary Size Normal
+ idns->nabsn = 0;
+ // Namespace Atomic Boundary Offset
+ idns->nabo = 0;
+ // Namespace Atomic Boundary Size Power Fail
+ idns->nabspf = 0;
+ // Namespace Optimal IO Boundary
+ idns->noiob = ctx->ns->noiob;
+ // NVM Capacity (another capacity but in bytes)
+ idns->nvmcap[0] = 0;
+
+ // TODOLATER: NS: support NGUID/EUI64
+ idns->nguid[0] = 0;
+ idns->eui64[0] = 0;
+ // format 0 metadata size
+ idns->lbaf[0].ms = 0;
+ // format 0 block size (in power of two)
+ idns->lbaf[0].ds = ctx->ns->blksize_shift;
+ // format 0 relative performance
+ idns->lbaf[0].rp = 0;
+ }
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, idns,
+ NVME_IDENTIFY_DATA_SIZE);
+ kfree(idns);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Namespace Identification Descriptor list for the specified NSID.*/
+static int nvme_mdev_adm_handle_id_ns_desc(struct adm_ctx *ctx)
+{
+ struct ns_desc {
+ struct nvme_ns_id_desc uuid_desc;
+ uuid_t uuid;
+ struct nvme_ns_id_desc null_desc;
+ };
+
+ int ret;
+ struct ns_desc *id;
+
+ if (!ctx->ns)
+ return DNR(NVME_SC_INVALID_NS);
+
+ /* Allocate return structure */
+ id = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL);
+ if (!id)
+ return NVME_SC_INTERNAL;
+
+ id->uuid_desc.nidt = NVME_NIDT_UUID;
+ id->uuid_desc.nidl = NVME_NIDT_UUID_LEN;
+ memcpy(&id->uuid, &ctx->ns->uuid, sizeof(id->uuid));
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, id,
+ NVME_IDENTIFY_DATA_SIZE);
+ kfree(id);
+ return nvme_mdev_translate_error(ret);
+}
+
+/*Active Namespace ID list */
+static int nvme_mdev_adm_handle_id_active_ns_list(struct adm_ctx *ctx)
+{
+ u32 nsid, start_nsid = le32_to_cpu(ctx->in->identify.nsid);
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ int i = 0, ret;
+
+ __le32 *nslist = kzalloc(NVME_IDENTIFY_DATA_SIZE, GFP_KERNEL);
+
+ if (start_nsid >= 0xfffffffe)
+ return DNR(NVME_SC_INVALID_NS);
+
+ for (nsid = start_nsid + 1; nsid <= MAX_VIRTUAL_NAMESPACES; nsid++)
+ if (nvme_mdev_vns_from_vnsid(vctrl, nsid))
+ nslist[i++] = nsid;
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, nslist,
+ NVME_IDENTIFY_DATA_SIZE);
+ kfree(nslist);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Handle Identify command*/
+static int nvme_mdev_adm_handle_id(struct adm_ctx *ctx)
+{
+ const struct nvme_identify *in = &ctx->in->identify;
+
+ int ret = nvme_mdev_udata_iter_set_dptr(&ctx->udatait,
+ &ctx->in->common.dptr,
+ NVME_IDENTIFY_DATA_SIZE);
+
+ u32 nsid = le32_to_cpu(in->nsid);
+
+ if (ret)
+ return nvme_mdev_translate_error(ret);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_MPTR | RSRV_DW11_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (in->ctrlid)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ ctx->ns = nvme_mdev_vns_from_vnsid(ctx->vctrl, nsid);
+
+ switch (ctx->in->identify.cns) {
+ case NVME_ID_CNS_CTRL:
+ _DBG(ctx->vctrl, "ADMINQ: IDENTIFY CTRL\n");
+ return nvme_mdev_adm_handle_id_cntrl(ctx);
+ case NVME_ID_CNS_NS_ACTIVE_LIST:
+ _DBG(ctx->vctrl, "ADMINQ: IDENTIFY ACTIVE_NS_LIST\n");
+ return nvme_mdev_adm_handle_id_active_ns_list(ctx);
+ case NVME_ID_CNS_NS:
+ _DBG(ctx->vctrl, "ADMINQ: IDENTIFY NS=0x%08x\n", nsid);
+ return nvme_mdev_adm_handle_id_ns(ctx);
+ case NVME_ID_CNS_NS_DESC_LIST:
+ _DBG(ctx->vctrl, "ADMINQ: IDENTIFY NS_DESC NS=0x%08x\n", nsid);
+ return nvme_mdev_adm_handle_id_ns_desc(ctx);
+ default:
+ return DNR(NVME_SC_INVALID_FIELD);
+ }
+}
+
+/* Error log for AER */
+static int nvme_mdev_adm_handle_get_log_page_err(struct adm_ctx *ctx)
+{
+ struct nvme_err_log_entry dummy_entry;
+ int ret;
+
+ // write one dummy entry with 0 error count
+ memset(&dummy_entry, 0, sizeof(dummy_entry));
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait,
+ &dummy_entry,
+ min((unsigned int)sizeof(dummy_entry),
+ ctx->datalen));
+
+ return nvme_mdev_translate_error(ret);
+}
+
+/* This log page allows to tell user about connected/disconnected namespaces */
+static int nvme_mdev_adm_handle_get_log_page_changed_ns(struct adm_ctx *ctx)
+{
+ unsigned int datasize = min(ctx->vctrl->ns_log_size * 4, ctx->datalen);
+
+ int ret = nvme_mdev_write_to_udata(&ctx->udatait,
+ &ctx->vctrl->ns_log, datasize);
+
+ nvme_mdev_vns_log_reset(ctx->vctrl);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* S.M.A.R.T. log*/
+static int nvme_mdev_adm_handle_get_log_page_smart(struct adm_ctx *ctx)
+{
+ unsigned int datasize = min_t(unsigned int,
+ sizeof(struct nvme_smart_log), ctx->datalen);
+ int ret;
+ struct nvme_smart_log *log = kzalloc(sizeof(*log), GFP_KERNEL);
+
+ if (!log)
+ return NVME_SC_INTERNAL;
+
+ /* Some dummy values */
+ log->avail_spare = 100;
+ log->spare_thresh = 10;
+ store_le16(&log->temperature, 0x140);
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, log, datasize);
+ kfree(log);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* FW slot log - useless */
+static int nvme_mdev_adm_handle_get_log_page_fw_slot(struct adm_ctx *ctx)
+{
+ unsigned int datasize = min_t(unsigned int,
+ sizeof(struct nvme_fw_slot_info_log),
+ ctx->datalen);
+ int ret;
+ struct nvme_fw_slot_info_log *log = kzalloc(sizeof(*log), GFP_KERNEL);
+
+ if (!log)
+ return NVME_SC_INTERNAL;
+
+ ret = nvme_mdev_write_to_udata(&ctx->udatait, log, datasize);
+ kfree(log);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Response to GET LOG PAGE command */
+static int nvme_mdev_adm_handle_get_log_page(struct adm_ctx *ctx)
+{
+ const struct nvme_get_log_page_command *in = &ctx->in->get_log_page;
+ u8 log_page_id = ctx->in->get_log_page.lid;
+ int ret;
+
+ ctx->datalen = (le16_to_cpu(in->numdl) + 1) * 4;
+
+ /* We don't support extensions (NUMDU,LPOL,LPOU) */
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_MPTR | RSRV_DW11_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* Currently ignore the NSID in the command */
+
+ /* ACK the AER */
+ if ((in->lsp & 0x80) == 0)
+ nvme_mdev_event_process_ack(ctx->vctrl, log_page_id);
+
+ /* map data pointer */
+ ret = nvme_mdev_udata_iter_set_dptr(&ctx->udatait,
+ &in->dptr, ctx->datalen);
+ if (ret)
+ return nvme_mdev_translate_error(ret);
+
+ switch (log_page_id) {
+ case NVME_LOG_ERROR:
+ _DBG(ctx->vctrl, "ADMINQ: GET_LOG_PAGE : ERRLOG\n");
+ return nvme_mdev_adm_handle_get_log_page_err(ctx);
+ case NVME_LOG_CHANGED_NS:
+ _DBG(ctx->vctrl, "ADMINQ: GET_LOG_PAGE : CHANGED_NS\n");
+ return nvme_mdev_adm_handle_get_log_page_changed_ns(ctx);
+ case NVME_LOG_SMART:
+ _DBG(ctx->vctrl, "ADMINQ: GET_LOG_PAGE : SMART\n");
+ return nvme_mdev_adm_handle_get_log_page_smart(ctx);
+ case NVME_LOG_FW_SLOT:
+ _DBG(ctx->vctrl, "ADMINQ: GET_LOG_PAGE : FWSLOT\n");
+ return nvme_mdev_adm_handle_get_log_page_fw_slot(ctx);
+ default:
+ _DBG(ctx->vctrl, "ADMINQ: GET_LOG_PAGE : log page 0x%02x\n",
+ log_page_id);
+ return DNR(NVME_SC_INVALID_FIELD);
+ }
+}
+
+/* Response to CREATE CQ command */
+static int nvme_mdev_adm_handle_create_cq(struct adm_ctx *ctx)
+{
+ int irq = -1, ret;
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ const struct nvme_create_cq *in = &ctx->in->create_cq;
+ u16 cqid = le16_to_cpu(in->cqid);
+ u16 qsize = le16_to_cpu(in->qsize);
+ u16 cq_flags = le16_to_cpu(in->cq_flags);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR_PRP2 |
+ RSRV_MPTR | RSRV_DW12_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* QID checks*/
+ if (!cqid ||
+ cqid >= MAX_VIRTUAL_QUEUES || test_bit(cqid, vctrl->vcq_en))
+ return DNR(NVME_SC_QID_INVALID);
+
+ /* Queue size checks*/
+ if (qsize > (MAX_VIRTUAL_QUEUE_DEPTH - 1) || qsize < 1)
+ return DNR(NVME_SC_QUEUE_SIZE);
+
+ /* Queue flags checks */
+ if (cq_flags & ~(NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (cq_flags & NVME_CQ_IRQ_ENABLED) {
+ irq = le16_to_cpu(in->irq_vector);
+ if (irq >= MAX_VIRTUAL_IRQS)
+ return DNR(NVME_SC_INVALID_VECTOR);
+ }
+
+ ret = nvme_mdev_vcq_init(ctx->vctrl, cqid,
+ le64_to_cpu(in->prp1),
+ cq_flags & NVME_QUEUE_PHYS_CONTIG,
+ qsize + 1, irq);
+
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Response to DELETE CQ command */
+static int nvme_mdev_adm_handle_delete_cq(struct adm_ctx *ctx)
+{
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ const struct nvme_delete_queue *in = &ctx->in->delete_queue;
+ u16 qid = le16_to_cpu(in->qid), sqid;
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW11_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!qid || qid >= MAX_VIRTUAL_QUEUES || !test_bit(qid, vctrl->vcq_en))
+ return DNR(NVME_SC_QID_INVALID);
+
+ for_each_set_bit(sqid, vctrl->vsq_en, MAX_VIRTUAL_QUEUES)
+ if (vctrl->vsqs[sqid].vcq == &vctrl->vcqs[qid])
+ return DNR(NVME_SC_INVALID_QUEUE);
+
+ nvme_mdev_vcq_delete(vctrl, qid);
+ return NVME_SC_SUCCESS;
+}
+
+/* Response to CREATE SQ command */
+static int nvme_mdev_adm_handle_create_sq(struct adm_ctx *ctx)
+{
+ const struct nvme_create_sq *in = &ctx->in->create_sq;
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ int ret;
+
+ u16 sqid = le16_to_cpu(in->sqid);
+ u16 cqid = le16_to_cpu(in->cqid);
+ u16 qsize = le16_to_cpu(in->qsize);
+ u16 sq_flags = le16_to_cpu(in->sq_flags);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR_PRP2 |
+ RSRV_MPTR | RSRV_DW12_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!sqid ||
+ sqid >= MAX_VIRTUAL_QUEUES || test_bit(sqid, vctrl->vsq_en))
+ return DNR(NVME_SC_QID_INVALID);
+
+ if (!cqid || cqid >= MAX_VIRTUAL_QUEUES)
+ return DNR(NVME_SC_QID_INVALID);
+
+ if (!test_bit(cqid, vctrl->vcq_en))
+ return DNR(NVME_SC_CQ_INVALID);
+
+ /* Queue size checks */
+ if (qsize > (MAX_VIRTUAL_QUEUE_DEPTH - 1) || qsize < 1)
+ return DNR(NVME_SC_QUEUE_SIZE);
+
+ /* Queue flags checks */
+ if (sq_flags & ~(NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MASK))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ ret = nvme_mdev_vsq_init(ctx->vctrl, sqid,
+ le64_to_cpu(in->prp1),
+ sq_flags & NVME_QUEUE_PHYS_CONTIG,
+ qsize + 1, cqid);
+ if (ret)
+ goto error;
+
+ return NVME_SC_SUCCESS;
+error:
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Response to DELETE SQ command */
+static int nvme_mdev_adm_handle_delete_sq(struct adm_ctx *ctx)
+{
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ const struct nvme_delete_queue *in = &ctx->in->delete_queue;
+ u16 qid = le16_to_cpu(in->qid);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW11_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!qid || qid >= MAX_VIRTUAL_QUEUES || !test_bit(qid, vctrl->vsq_en))
+ return DNR(NVME_SC_QID_INVALID);
+
+ nvme_mdev_vsq_delete(ctx->vctrl, qid);
+ return NVME_SC_SUCCESS;
+}
+
+/* Set the shadow doorbell */
+static int nvme_mdev_adm_handle_dbbuf(struct adm_ctx *ctx)
+{
+ const struct nvme_dbbuf *in = &ctx->in->dbbuf;
+ int ret;
+
+ dma_addr_t sdb_iova = le64_to_cpu(in->prp1);
+ dma_addr_t eidx_iova = le64_to_cpu(in->prp2);
+
+ /* Check if we support the shadow doorbell */
+ if (!ctx->vctrl->mmio.shadow_db_supported)
+ return DNR(NVME_SC_INVALID_OPCODE);
+
+ /* Don't allow to enable the shadow doorbell more that once */
+ if (ctx->vctrl->mmio.shadow_db_en)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 |
+ RSRV_MPTR | RSRV_DW10_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* check input buffers */
+ if ((OFFSET_IN_PAGE(sdb_iova) != 0) || (OFFSET_IN_PAGE(eidx_iova) != 0))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* switch to the new doorbell buffer */
+ ret = nvme_mdev_mmio_enable_dbs_shadow(ctx->vctrl, sdb_iova, eidx_iova);
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Response to GET_FEATURES command */
+static int nvme_mdev_adm_handle_get_features(struct adm_ctx *ctx)
+{
+ u32 value = 0;
+ u32 irq;
+ const struct nvme_features *in = &ctx->in->features;
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ unsigned int tmp;
+
+ u32 fid = le32_to_cpu(in->fid);
+ u16 cid = le16_to_cpu(in->command_id);
+
+ _DBG(ctx->vctrl, "ADMINQ: GET_FEATURES FID=0x%x\n", fid);
+
+ /* common reserved bits*/
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW12_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* reserved bits in dword10*/
+ if (fid > 0xFF)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* reserved bits in dword11*/
+ if (fid != NVME_FEAT_IRQ_CONFIG && in->dword11 != 0)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ switch (fid) {
+ /* Number of queues */
+ case NVME_FEAT_NUM_QUEUES:
+ value = (MAX_VIRTUAL_QUEUES - 1) |
+ ((MAX_VIRTUAL_QUEUES - 1) << 16);
+ goto out;
+
+ /* Arbitration */
+ case NVME_FEAT_ARBITRATION:
+ value = vctrl->arb_burst_shift & 0x7;
+ goto out;
+
+ /* Interrupt coalescing settings*/
+ case NVME_FEAT_IRQ_COALESCE:
+ tmp = vctrl->irqs.irq_coalesc_time_us;
+ do_div(tmp, 100);
+ value = (vctrl->irqs.irq_coalesc_max - 1) | (tmp << 8);
+ goto out;
+
+ /* Interrupt coalescing disable for a specific interrupt */
+ case NVME_FEAT_IRQ_CONFIG:
+ irq = le32_to_cpu(in->dword11);
+ if (irq >= MAX_VIRTUAL_IRQS)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ value = irq;
+ if (vctrl->irqs.vecs[irq].irq_coalesc_en)
+ value |= (1 << 16);
+ goto out;
+
+ /* Volatile write cache */
+ case NVME_FEAT_VOLATILE_WC:
+ /*we always report write cache due to mediation*/
+ value = 0x1;
+ goto out;
+
+ /* Limited error recovery */
+ case NVME_FEAT_ERR_RECOVERY:
+ value = 0;
+ break;
+
+ /* Workload hint + power state */
+ case NVME_FEAT_POWER_MGMT:
+ value = vctrl->worload_hint << 4;
+ break;
+
+ /* Temperature threshold */
+ case NVME_FEAT_TEMP_THRESH:
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* AEN permanent masking*/
+ case NVME_FEAT_ASYNC_EVENT:
+ value = nvme_mdev_event_read_aen_config(vctrl);
+ goto out;
+ default:
+ return DNR(NVME_SC_INVALID_FIELD);
+ }
+out:
+ nvme_mdev_vsq_cmd_done_adm(ctx->vctrl, value, cid, NVME_SC_SUCCESS);
+ return -1;
+}
+
+/* Response to SET_FEATURES command */
+static int nvme_mdev_adm_handle_set_features(struct adm_ctx *ctx)
+{
+ const struct nvme_features *in = &ctx->in->features;
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+
+ u32 value = le32_to_cpu(in->dword11);
+ u8 fid = le32_to_cpu(in->fid) & 0xFF;
+ u16 cid = le16_to_cpu(in->command_id);
+ u32 nsid = le32_to_cpu(in->nsid);
+
+ _DBG(ctx->vctrl, "ADMINQ: SET_FEATURES cmd. FID=0x%x\n", fid);
+
+ if (nsid != 0xffffffff && nsid != 0)
+ return DNR(NVME_SC_FEATURE_NOT_PER_NS);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW12_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ switch (fid) {
+ case NVME_FEAT_NUM_QUEUES:
+ /* need to return the value here as well */
+ value = (MAX_VIRTUAL_QUEUES - 1) |
+ ((MAX_VIRTUAL_QUEUES - 1) << 16);
+
+ nvme_mdev_vsq_cmd_done_adm(ctx->vctrl, value,
+ cid, NVME_SC_SUCCESS);
+ return -1;
+
+ case NVME_FEAT_ARBITRATION:
+ vctrl->arb_burst_shift = value & 0x7;
+ return NVME_SC_SUCCESS;
+
+ case NVME_FEAT_IRQ_COALESCE:
+ vctrl->irqs.irq_coalesc_max = (value & 0xFF) + 1;
+ vctrl->irqs.irq_coalesc_time_us = ((value >> 8) & 0xFF) * 100;
+ return NVME_SC_SUCCESS;
+
+ case NVME_FEAT_IRQ_CONFIG: {
+ u16 irq = value & 0xFFFF;
+
+ if (irq >= MAX_VIRTUAL_IRQS)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ vctrl->irqs.vecs[irq].irq_coalesc_en = (value & 0x10000) != 0;
+ return NVME_SC_SUCCESS;
+ }
+ case NVME_FEAT_VOLATILE_WC:
+ return (value != 0x1) ? DNR(NVME_SC_FEATURE_NOT_CHANGEABLE) :
+ NVME_SC_SUCCESS;
+
+ case NVME_FEAT_ERR_RECOVERY:
+ return (value != 0) ? DNR(NVME_SC_FEATURE_NOT_CHANGEABLE) :
+ NVME_SC_SUCCESS;
+ case NVME_FEAT_POWER_MGMT:
+ if (value & 0xFFFFFF0F)
+ return DNR(NVME_SC_INVALID_FIELD);
+ vctrl->worload_hint = value >> 4;
+ return NVME_SC_SUCCESS;
+
+ case NVME_FEAT_TEMP_THRESH:
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ case NVME_FEAT_ASYNC_EVENT:
+ nvme_mdev_event_set_aen_config(vctrl, value);
+ return NVME_SC_SUCCESS;
+ default:
+ return DNR(NVME_SC_INVALID_FIELD);
+ }
+}
+
+/* Response to AER command */
+static int nvme_mdev_adm_handle_async_event(struct adm_ctx *ctx)
+{
+ u16 cid = le16_to_cpu(ctx->in->common.command_id);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW10_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ return nvme_mdev_event_request_receive(ctx->vctrl, cid);
+}
+
+/* (Dummy) response to ABORT command*/
+static int nvme_mdev_adm_handle_abort(struct adm_ctx *ctx)
+{
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_NSID | RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW10_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ return DNR(NVME_SC_ABORT_MISSING);
+}
+
+/* Process one new command in the admin queue*/
+static int nvme_mdev_adm_handle_cmd(struct adm_ctx *ctx)
+{
+ u8 optcode = ctx->in->common.opcode;
+
+ ctx->ns = NULL;
+ ctx->datalen = 0;
+
+ if (ctx->in->common.flags != 0)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ switch (optcode) {
+ case nvme_admin_identify:
+ return nvme_mdev_adm_handle_id(ctx);
+ case nvme_admin_create_cq:
+ _DBG(ctx->vctrl, "ADMINQ: CREATE_CQ\n");
+ return nvme_mdev_adm_handle_create_cq(ctx);
+ case nvme_admin_create_sq:
+ _DBG(ctx->vctrl, "ADMINQ: CREATE_SQ\n");
+ return nvme_mdev_adm_handle_create_sq(ctx);
+ case nvme_admin_delete_sq:
+ _DBG(ctx->vctrl, "ADMINQ: DELETE_SQ\n");
+ return nvme_mdev_adm_handle_delete_sq(ctx);
+ case nvme_admin_delete_cq:
+ _DBG(ctx->vctrl, "ADMINQ: DELETE_CQ\n");
+ return nvme_mdev_adm_handle_delete_cq(ctx);
+ case nvme_admin_dbbuf:
+ _DBG(ctx->vctrl, "ADMINQ: DBBUF_CONFIG\n");
+ return nvme_mdev_adm_handle_dbbuf(ctx);
+ case nvme_admin_get_log_page:
+ return nvme_mdev_adm_handle_get_log_page(ctx);
+ case nvme_admin_get_features:
+ return nvme_mdev_adm_handle_get_features(ctx);
+ case nvme_admin_set_features:
+ return nvme_mdev_adm_handle_set_features(ctx);
+ case nvme_admin_async_event:
+ _DBG(ctx->vctrl, "ADMINQ: ASYNC_EVENT_REQ\n");
+ return nvme_mdev_adm_handle_async_event(ctx);
+ case nvme_admin_abort_cmd:
+ _DBG(ctx->vctrl, "ADMINQ: ABORT\n");
+ return nvme_mdev_adm_handle_abort(ctx);
+ default:
+ _DBG(ctx->vctrl, "ADMINQ: optcode 0x%04x\n", optcode);
+ return DNR(NVME_SC_INVALID_OPCODE);
+ }
+}
+
+/* Process all pending admin commands */
+void nvme_mdev_adm_process_sq(struct nvme_mdev_vctrl *vctrl)
+{
+ struct adm_ctx ctx;
+
+ lockdep_assert_held(&vctrl->lock);
+ memset(&ctx, 0, sizeof(struct adm_ctx));
+ ctx.vctrl = vctrl;
+ ctx.hctrl = vctrl->hctrl;
+ nvme_mdev_udata_iter_setup(&vctrl->viommu, &ctx.udatait);
+
+ nvme_mdev_io_pause(ctx.vctrl);
+
+ while (!(nvme_mdev_vctrl_is_dead(vctrl))) {
+ int ret;
+ u16 cid;
+
+ ctx.in = nvme_mdev_vsq_get_cmd(vctrl, &vctrl->vsqs[0]);
+ if (!ctx.in)
+ break;
+
+ cid = le16_to_cpu(ctx.in->common.command_id);
+ ret = nvme_mdev_adm_handle_cmd(&ctx);
+
+ if (ret == -1)
+ continue;
+
+ if (ret != 0)
+ _DBG(vctrl, "ADMINQ: CID 0x%x FAILED: status 0x%x\n",
+ cid, ret);
+ nvme_mdev_vsq_cmd_done_adm(vctrl, 0, cid, ret);
+ }
+ nvme_mdev_io_resume(ctx.vctrl);
+}
diff --git a/drivers/nvme/mdev/events.c b/drivers/nvme/mdev/events.c
new file mode 100644
index 000000000000..9854c1cabdcb
--- /dev/null
+++ b/drivers/nvme/mdev/events.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe async events implementation (AER, changed namespace log)
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include "priv.h"
+
+/* complete an AER event on the admin queue if it is pending*/
+static void nvme_mdev_event_complete(struct nvme_mdev_vctrl *vctrl)
+{
+ u16 lid, cid;
+ u32 dw0;
+
+ for_each_set_bit(lid, vctrl->events.events_pending, MAX_LOG_PAGES) {
+ /* we have pending aer requests, but no requests*/
+ if (vctrl->events.aer_cid_count == 0)
+ break;
+
+ if (!test_bit(lid, vctrl->events.events_enabled))
+ continue;
+
+ cid = vctrl->events.aer_cids[--vctrl->events.aer_cid_count];
+ dw0 = vctrl->events.event_values[lid];
+ clear_bit(lid, vctrl->events.events_pending);
+
+ _DBG(vctrl,
+ "AEN: replying to AER (CID=%d) with status 0x%08x\n",
+ cid, dw0);
+
+ nvme_mdev_vsq_cmd_done_adm(vctrl, dw0, cid, NVME_SC_SUCCESS);
+ }
+}
+
+/* deal with received async event request from the user*/
+int nvme_mdev_event_request_receive(struct nvme_mdev_vctrl *vctrl,
+ u16 cid)
+{
+ int cnt = vctrl->events.aer_cid_count;
+
+ if (cnt >= MAX_AER_COMMANDS)
+ return DNR(NVME_SC_ASYNC_LIMIT);
+
+ /* don't allow AER to be pending if there is no space left in the
+ * completion queue permanently
+ */
+ if ((cnt + 1) >= vctrl->vcqs[0].size - 1)
+ return DNR(NVME_SC_ASYNC_LIMIT);
+
+ vctrl->events.aer_cids[cnt++] = cid;
+ vctrl->events.aer_cid_count = cnt;
+
+ _DBG(vctrl, "AEN: received new request (cid=%d)\n", cid);
+ nvme_mdev_event_complete(vctrl);
+ return -1;
+}
+
+/* Send an async event request */
+void nvme_mdev_event_send(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_async_event_type type,
+ enum nvme_async_event info)
+{
+ u8 log_page;
+ u32 event;
+
+ // determine the log page for event types that we support
+ switch (type) {
+ case NVME_AER_TYPE_ERROR:
+ log_page = NVME_LOG_ERROR;
+ break;
+ case NVME_AER_TYPE_SMART:
+ log_page = NVME_LOG_SMART;
+ break;
+ case NVME_AER_TYPE_NOTICE:
+ WARN_ON(info != NVME_AER_NOTICE_NS_CHANGED);
+ log_page = NVME_LOG_CHANGED_NS;
+ break;
+ default:
+ WARN_ON(1);
+ return;
+ }
+
+ if (test_and_set_bit(log_page, vctrl->events.events_masked))
+ return;
+
+ event = (u32)type | ((u32)info << 8) | ((u32)log_page << 16);
+ vctrl->events.event_values[log_page] = event;
+ set_bit(log_page, vctrl->events.events_masked);
+ set_bit(log_page, vctrl->events.events_pending);
+ nvme_mdev_event_complete(vctrl);
+}
+
+u32 nvme_mdev_event_read_aen_config(struct nvme_mdev_vctrl *vctrl)
+{
+ u32 value = 0;
+
+ if (test_bit(NVME_LOG_CHANGED_NS, vctrl->events.events_enabled))
+ value |= NVME_AEN_CFG_NS_ATTR;
+ return value;
+}
+
+void nvme_mdev_event_set_aen_config(struct nvme_mdev_vctrl *vctrl, u32 value)
+{
+ _DBG(vctrl, "AEN: set config: 0x%04x\n", value);
+
+ if (value & NVME_AEN_CFG_NS_ATTR)
+ set_bit(NVME_LOG_CHANGED_NS, vctrl->events.events_enabled);
+ else
+ clear_bit(NVME_LOG_CHANGED_NS, vctrl->events.events_enabled);
+
+ nvme_mdev_event_complete(vctrl);
+}
+
+/* called when user acks an log page which causes an AER event to be unmasked*/
+void nvme_mdev_event_process_ack(struct nvme_mdev_vctrl *vctrl, u8 log_page)
+{
+ lockdep_assert_held(&vctrl->lock);
+
+ _DBG(vctrl, "AEN: log page %d ACK\n", log_page);
+
+ if (log_page >= MAX_LOG_PAGES)
+ return;
+
+ clear_bit(log_page, vctrl->events.events_masked);
+ nvme_mdev_event_complete(vctrl);
+}
+
+/* reset event state*/
+void nvme_mdev_events_init(struct nvme_mdev_vctrl *vctrl)
+{
+ memset(&vctrl->events, 0, sizeof(vctrl->events));
+ set_bit(NVME_LOG_CHANGED_NS, vctrl->events.events_enabled);
+ set_bit(NVME_LOG_ERROR, vctrl->events.events_enabled);
+}
+
+/* reset event state*/
+void nvme_mdev_events_reset(struct nvme_mdev_vctrl *vctrl)
+{
+ memset(&vctrl->events, 0, sizeof(vctrl->events));
+}
+
diff --git a/drivers/nvme/mdev/host.c b/drivers/nvme/mdev/host.c
new file mode 100644
index 000000000000..5766bad7e909
--- /dev/null
+++ b/drivers/nvme/mdev/host.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe parent (host) device abstraction
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/nvme.h>
+#include <linux/mdev.h>
+#include <linux/module.h>
+#include "priv.h"
+
+static LIST_HEAD(nvme_mdev_hctrl_list);
+static DEFINE_MUTEX(nvme_mdev_hctrl_list_mutex);
+static struct nvme_mdev_inst_type **instance_types;
+
+unsigned int io_timeout_ms = 30000;
+module_param_named(io_timeout, io_timeout_ms, uint, 0644);
+MODULE_PARM_DESC(io_timeout,
+ "Maximum I/O command completion timeout (in msec)");
+
+unsigned int poll_timeout_ms = 500;
+module_param_named(poll_timeout, poll_timeout_ms, uint, 0644);
+MODULE_PARM_DESC(poll_timeout,
+ "Maximum idle time to keep polling (in msec) (0 - poll forever)");
+
+unsigned int admin_poll_rate_ms = 100;
+module_param_named(admin_poll_rate, poll_timeout_ms, uint, 0644);
+MODULE_PARM_DESC(admin_poll_rate,
+ "Admin queue polling rate (in msec) (used only when shadow doorbell is disabled)");
+
+bool use_shadow_doorbell = true;
+module_param(use_shadow_doorbell, bool, 0644);
+MODULE_PARM_DESC(use_shadow_doorbell,
+ "Enable the shadow doorbell NVMe extension");
+
+/* Create a new host controller */
+static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
+{
+ struct nvme_mdev_hctrl *hctrl;
+ u32 max_lba_transfer;
+ unsigned int nr_host_queues;
+
+ /* TODOLATER: IO: support more page size configurations*/
+ if (ctrl->page_size != PAGE_SIZE) {
+ dev_info(ctrl->dev, "no support for mdev - page_size mismatch");
+ return NULL;
+ }
+
+ nr_host_queues = ctrl->ops->ext_queues_available(ctrl);
+ max_lba_transfer = ctrl->max_hw_sectors >> (PAGE_SHIFT - 9);
+
+ if (nr_host_queues == 0) {
+ dev_info(ctrl->dev,
+ "no support for mdev - no mdev reserved queues available");
+ return NULL;
+ }
+
+ hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
+ dev_to_node(ctrl->dev));
+ if (!hctrl)
+ return NULL;
+
+ kref_init(&hctrl->ref);
+ mutex_init(&hctrl->lock);
+
+ hctrl->nvme_ctrl = ctrl;
+ nvme_get_ctrl(ctrl);
+
+ hctrl->oncs = ctrl->oncs &
+ (NVME_CTRL_ONCS_DSM | NVME_CTRL_ONCS_WRITE_ZEROES);
+
+ hctrl->id = ctrl->instance;
+ hctrl->node = dev_to_node(ctrl->dev);
+ hctrl->mdts = ilog2(__rounddown_pow_of_two(max_lba_transfer));
+ hctrl->nr_host_queues = nr_host_queues;
+
+ mutex_lock(&nvme_mdev_hctrl_list_mutex);
+
+ dev_info(ctrl->dev,
+ "mediated nvme support enabled, using up to %d host queues\n",
+ hctrl->nr_host_queues);
+
+ list_add_tail(&hctrl->link, &nvme_mdev_hctrl_list);
+
+ mutex_unlock(&nvme_mdev_hctrl_list_mutex);
+
+ if (mdev_register_device(ctrl->dev, &mdev_fops) < 0) {
+ nvme_put_ctrl(ctrl);
+ kfree(hctrl);
+ return NULL;
+ }
+ return hctrl;
+}
+
+/* Release an unused host controller*/
+static void nvme_mdev_hctrl_free(struct kref *ref)
+{
+ struct nvme_mdev_hctrl *hctrl =
+ container_of(ref, struct nvme_mdev_hctrl, ref);
+
+ dev_info(hctrl->nvme_ctrl->dev, "mediated nvme support disabled");
+
+ nvme_put_ctrl(hctrl->nvme_ctrl);
+ hctrl->nvme_ctrl = NULL;
+ kfree(hctrl);
+}
+
+/* Lookup a host controller based on mdev parent device*/
+struct nvme_mdev_hctrl *nvme_mdev_hctrl_lookup_get(struct device *parent)
+{
+ struct nvme_mdev_hctrl *hctrl = NULL, *tmp;
+
+ mutex_lock(&nvme_mdev_hctrl_list_mutex);
+ list_for_each_entry(tmp, &nvme_mdev_hctrl_list, link) {
+ if (tmp->nvme_ctrl->dev == parent) {
+ hctrl = tmp;
+ kref_get(&hctrl->ref);
+ break;
+ }
+ }
+ mutex_unlock(&nvme_mdev_hctrl_list_mutex);
+ return hctrl;
+}
+
+/* Release a held reference to a host controller*/
+void nvme_mdev_hctrl_put(struct nvme_mdev_hctrl *hctrl)
+{
+ kref_put(&hctrl->ref, nvme_mdev_hctrl_free);
+}
+
+/* Destroy a host controller. It might still be kept in zombie state
+ * if someone uses a reference to it
+ */
+static void nvme_mdev_hctrl_destroy(struct nvme_mdev_hctrl *hctrl)
+{
+ mutex_lock(&nvme_mdev_hctrl_list_mutex);
+ list_del(&hctrl->link);
+ mutex_unlock(&nvme_mdev_hctrl_list_mutex);
+
+ hctrl->removing = true;
+ mdev_unregister_device(hctrl->nvme_ctrl->dev);
+ nvme_mdev_hctrl_put(hctrl);
+}
+
+/* Check how many host queues are still available */
+int nvme_mdev_hctrl_hqs_available(struct nvme_mdev_hctrl *hctrl)
+{
+ int ret;
+
+ mutex_lock(&hctrl->lock);
+ ret = hctrl->nr_host_queues;
+ mutex_unlock(&hctrl->lock);
+ return ret;
+}
+
+/* Reserve N host IO queues, for later allocation to a specific user*/
+bool nvme_mdev_hctrl_hqs_reserve(struct nvme_mdev_hctrl *hctrl,
+ unsigned int n)
+{
+ mutex_lock(&hctrl->lock);
+
+ if (n > hctrl->nr_host_queues) {
+ mutex_unlock(&hctrl->lock);
+ return false;
+ }
+
+ hctrl->nr_host_queues -= n;
+ mutex_unlock(&hctrl->lock);
+ return true;
+}
+
+/* Free N host IO queues, for allocation for other users*/
+void nvme_mdev_hctrl_hqs_unreserve(struct nvme_mdev_hctrl *hctrl,
+ unsigned int n)
+{
+ mutex_lock(&hctrl->lock);
+ hctrl->nr_host_queues += n;
+ mutex_unlock(&hctrl->lock);
+}
+
+/* Allocate a host IO queue */
+int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
+{
+ u16 qid = 0;
+ int ret = hctrl->nvme_ctrl->ops->ext_queue_alloc(hctrl->nvme_ctrl,
+ &qid);
+
+ if (ret)
+ return ret;
+ return qid;
+}
+
+/* Free an host IO queue */
+void nvme_mdev_hctrl_hq_free(struct nvme_mdev_hctrl *hctrl, u16 qid)
+{
+ hctrl->nvme_ctrl->ops->ext_queue_free(hctrl->nvme_ctrl, qid);
+}
+
+/* Check if we can submit another IO passthrough command */
+bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid)
+{
+ return hctrl->nvme_ctrl->ops->ext_queue_full(hctrl->nvme_ctrl, qid);
+}
+
+/* Check if IO passthrough is supported for given IO optcode */
+bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
+{
+ switch (optcode) {
+ case nvme_cmd_flush:
+ case nvme_cmd_read:
+ case nvme_cmd_write:
+ /* these are mandatory*/
+ return true;
+ case nvme_cmd_write_zeroes:
+ return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
+ case nvme_cmd_dsm:
+ return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
+ default:
+ return false;
+ }
+}
+
+/* Submit a IO passthrough command */
+int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
+ u16 qid, u32 tag,
+ struct nvme_command *cmd,
+ struct nvme_ext_data_iter *datait)
+{
+ struct nvme_ctrl *ctrl = hctrl->nvme_ctrl;
+
+ return ctrl->ops->ext_queue_submit(ctrl, qid, tag, cmd, datait);
+}
+
+/* Poll for completion of IO passthrough commands */
+int nvme_mdev_hctrl_hq_poll(struct nvme_mdev_hctrl *hctrl,
+ u32 qid,
+ struct nvme_ext_cmd_result *results,
+ unsigned int max_len)
+{
+ struct nvme_ctrl *ctrl = hctrl->nvme_ctrl;
+
+ return ctrl->ops->ext_queue_poll(ctrl, qid, results, max_len);
+}
+
+/* Destroy all host controllers */
+void nvme_mdev_hctrl_destroy_all(void)
+{
+ struct nvme_mdev_hctrl *hctrl = NULL, *tmp;
+
+ list_for_each_entry_safe(hctrl, tmp, &nvme_mdev_hctrl_list, link) {
+ list_del(&hctrl->link);
+ hctrl->removing = true;
+ mdev_unregister_device(hctrl->nvme_ctrl->dev);
+ nvme_mdev_hctrl_put(hctrl);
+ }
+}
+
+/* Get the mdev instance given it sysfs name */
+struct nvme_mdev_inst_type *nvme_mdev_inst_type_get(const char *name)
+{
+ int i;
+
+ for (i = 0; instance_types[i]; i++) {
+ const char *test =
+ name + strlen(name) - strlen(instance_types[i]->name);
+
+ if (strcmp(instance_types[i]->name, test) == 0)
+ return instance_types[i];
+ }
+ return NULL;
+}
+
+/* This shows name of the instance type */
+static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+ return sprintf(buf, "%s\n", kobj->name);
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+/* This shows description of the instance type */
+static ssize_t description_show(struct kobject *kobj,
+ struct device *dev, char *buf)
+{
+ struct nvme_mdev_inst_type *type = nvme_mdev_inst_type_get(kobj->name);
+
+ return sprintf(buf,
+ "MDEV nvme device, using maximum %d hw submission queues\n",
+ type->max_hw_queues);
+}
+static MDEV_TYPE_ATTR_RO(description);
+
+/* This shows the device API of the instance type */
+static ssize_t device_api_show(struct kobject *kobj,
+ struct device *dev, char *buf)
+{
+ return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+/* This shows how many instances of this instance type can be created */
+static ssize_t available_instances_show(struct kobject *kobj,
+ struct device *dev, char *buf)
+{
+ struct nvme_mdev_inst_type *type = nvme_mdev_inst_type_get(kobj->name);
+ struct nvme_mdev_hctrl *hctrl = nvme_mdev_hctrl_lookup_get(dev);
+ int count;
+
+ if (!hctrl)
+ return -ENODEV;
+
+ count = nvme_mdev_hctrl_hqs_available(hctrl);
+ do_div(count, type->max_hw_queues);
+
+ nvme_mdev_hctrl_put(hctrl);
+ return sprintf(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static struct attribute *nvme_mdev_types_attrs[] = {
+ &mdev_type_attr_name.attr,
+ &mdev_type_attr_description.attr,
+ &mdev_type_attr_device_api.attr,
+ &mdev_type_attr_available_instances.attr,
+ NULL,
+};
+
+/* Undo the creation of mdev array of instance types */
+static void nvme_mdev_instance_types_fini(struct mdev_parent_ops *ops)
+{
+ int i;
+
+ for (i = 0; instance_types[i]; i++) {
+ struct nvme_mdev_inst_type *type = instance_types[i];
+
+ kfree(type->attrgroup);
+ kfree(type);
+ }
+
+ kfree(instance_types);
+ instance_types = NULL;
+
+ kfree(ops->supported_type_groups);
+ ops->supported_type_groups = NULL;
+}
+
+/* Create the array of mdev instance types from our array of them */
+static int nvme_mdev_instance_types_init(struct mdev_parent_ops *ops)
+{
+ unsigned int i;
+ struct nvme_mdev_inst_type *type;
+ struct attribute_group *attrgroup;
+
+ ops->supported_type_groups = kzalloc(sizeof(struct attribute_group *)
+ * (MAX_HOST_QUEUES + 1), GFP_KERNEL);
+
+ if (!ops->supported_type_groups)
+ return -ENOMEM;
+
+ instance_types = kzalloc(sizeof(struct nvme_mdev_inst_type *)
+ * MAX_HOST_QUEUES + 1, GFP_KERNEL);
+
+ if (!instance_types) {
+ kfree(ops->supported_type_groups);
+ ops->supported_type_groups = NULL;
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < MAX_HOST_QUEUES; i++) {
+ type = kzalloc(sizeof(*type), GFP_KERNEL);
+ if (!type) {
+ nvme_mdev_instance_types_fini(ops);
+ return -ENOMEM;
+ }
+ snprintf(type->name, sizeof(type->name), "%dQ_V1", i + 1);
+ type->max_hw_queues = i + 1;
+
+ attrgroup = kzalloc(sizeof(*attrgroup), GFP_KERNEL);
+ if (!attrgroup) {
+ kfree(type);
+ nvme_mdev_instance_types_fini(ops);
+ return -ENOMEM;
+ }
+
+ attrgroup->attrs = nvme_mdev_types_attrs;
+ attrgroup->name = type->name;
+ type->attrgroup = attrgroup;
+ instance_types[i] = type;
+ ops->supported_type_groups[i] = attrgroup;
+ }
+ return 0;
+}
+
+/* Updates in host controller state*/
+static void nvme_mdev_nvme_ctrl_state_changed(struct nvme_ctrl *ctrl)
+{
+ struct nvme_mdev_hctrl *hctrl = nvme_mdev_hctrl_lookup_get(ctrl->dev);
+ struct nvme_mdev_vctrl *vctrl;
+
+ switch (ctrl->state) {
+ case NVME_CTRL_NEW:
+ /* do nothing as new controller is not yet initialized*/
+ break;
+
+ case NVME_CTRL_LIVE:
+ /* new controller is live, create a mdev for it*/
+ if (!hctrl) {
+ hctrl = nvme_mdev_hctrl_create(ctrl);
+ return;
+ /* a controller is live again after reset/reconnect/suspend*/
+ } else {
+ mutex_lock(&nvme_mdev_vctrl_list_mutex);
+ list_for_each_entry(vctrl, &nvme_mdev_vctrl_list, link)
+ if (vctrl->hctrl == hctrl)
+ nvme_mdev_vctrl_resume(vctrl);
+ mutex_unlock(&nvme_mdev_vctrl_list_mutex);
+ }
+ break;
+
+ case NVME_CTRL_RESETTING:
+ case NVME_CTRL_CONNECTING:
+ case NVME_CTRL_SUSPENDED:
+ /* controller is temporarily not usable, stop using its queues*/
+ if (!hctrl)
+ return;
+
+ mutex_lock(&nvme_mdev_vctrl_list_mutex);
+ list_for_each_entry(vctrl, &nvme_mdev_vctrl_list, link)
+ if (vctrl->hctrl == hctrl)
+ nvme_mdev_vctrl_pause(vctrl);
+ mutex_unlock(&nvme_mdev_vctrl_list_mutex);
+ break;
+
+ case NVME_CTRL_DELETING:
+ case NVME_CTRL_DEAD:
+ case NVME_CTRL_ADMIN_ONLY:
+ /* host nvme controller is dead, remove it*/
+ if (!hctrl)
+ return;
+ nvme_mdev_hctrl_destroy(hctrl);
+ break;
+ }
+
+ if (hctrl)
+ nvme_mdev_hctrl_put(hctrl);
+}
+
+/* A host namespace might have its properties changed/removed.*/
+static void nvme_mdev_nvme_ctrl_ns_updated(struct nvme_ctrl *ctrl,
+ u32 nsid, bool removed)
+{
+ struct nvme_mdev_vctrl *vctrl;
+ struct nvme_mdev_hctrl *hctrl = nvme_mdev_hctrl_lookup_get(ctrl->dev);
+
+ if (!hctrl)
+ return;
+
+ mutex_lock(&nvme_mdev_vctrl_list_mutex);
+ list_for_each_entry(vctrl, &nvme_mdev_vctrl_list, link)
+ if (vctrl->hctrl == hctrl)
+ nvme_mdev_vns_host_ns_update(vctrl, nsid, removed);
+ mutex_unlock(&nvme_mdev_vctrl_list_mutex);
+ nvme_mdev_hctrl_put(hctrl);
+}
+
+static struct nvme_mdev_driver nvme_mdev_driver = {
+ .owner = THIS_MODULE,
+ .nvme_ctrl_state_changed = nvme_mdev_nvme_ctrl_state_changed,
+ .nvme_ns_state_changed = nvme_mdev_nvme_ctrl_ns_updated,
+};
+
+static int __init nvme_mdev_init(void)
+{
+ int ret;
+
+ nvme_mdev_instance_types_init(&mdev_fops);
+ ret = nvme_core_register_mdev_driver(&nvme_mdev_driver);
+ if (ret) {
+ nvme_mdev_instance_types_fini(&mdev_fops);
+ return ret;
+ }
+
+ pr_info("nvme_mdev " NVME_MDEV_FIRMWARE_VERSION " loaded\n");
+ return 0;
+}
+
+static void __exit nvme_mdev_exit(void)
+{
+ nvme_core_unregister_mdev_driver(&nvme_mdev_driver);
+ nvme_mdev_hctrl_destroy_all();
+ nvme_mdev_instance_types_fini(&mdev_fops);
+ pr_info("nvme_mdev unloaded\n");
+}
+
+MODULE_AUTHOR("Maxim Levitsky <[email protected]>");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(NVME_MDEV_FIRMWARE_VERSION);
+
+module_init(nvme_mdev_init)
+module_exit(nvme_mdev_exit)
+
diff --git a/drivers/nvme/mdev/instance.c b/drivers/nvme/mdev/instance.c
new file mode 100644
index 000000000000..0e5ba5a9269f
--- /dev/null
+++ b/drivers/nvme/mdev/instance.c
@@ -0,0 +1,804 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Mediated NVMe instance VFIO code
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/vfio.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+#include "priv.h"
+
+#define OFFSET_TO_REGION(offset) ((offset) >> 20)
+#define REGION_TO_OFFSET(nr) (((u64)nr) << 20)
+
+LIST_HEAD(nvme_mdev_vctrl_list);
+/*protects the list */
+DEFINE_MUTEX(nvme_mdev_vctrl_list_mutex);
+
+struct mdev_nvme_vfio_region_info {
+ struct vfio_region_info base;
+ struct vfio_region_info_cap_sparse_mmap mmap_cap;
+};
+
+/* User memory added*/
+static int nvme_mdev_map_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct vfio_iommu_type1_dma_map *map = data;
+ struct nvme_mdev_vctrl *vctrl =
+ container_of(nb, struct nvme_mdev_vctrl, vfio_map_notifier);
+
+ int ret = nvme_mdev_vctrl_viommu_map(vctrl, map->flags,
+ map->iova, map->size);
+ return ret ? NOTIFY_OK : notifier_from_errno(ret);
+}
+
+/* User memory removed*/
+static int nvme_mdev_unmap_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct nvme_mdev_vctrl *vctrl =
+ container_of(nb, struct nvme_mdev_vctrl, vfio_unmap_notifier);
+ struct vfio_iommu_type1_dma_unmap *unmap = data;
+
+ int ret = nvme_mdev_vctrl_viommu_unmap(vctrl, unmap->iova, unmap->size);
+
+ WARN_ON(ret <= 0);
+ return NOTIFY_OK;
+}
+
+/* Called when new mediated device is created */
+static int nvme_mdev_ops_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+ int ret = 0;
+ const struct nvme_mdev_inst_type *type = NULL;
+ struct nvme_mdev_vctrl *vctrl;
+ struct nvme_mdev_hctrl *hctrl = NULL;
+
+ hctrl = nvme_mdev_hctrl_lookup_get(mdev_parent_dev(mdev));
+ if (!hctrl)
+ return -ENODEV;
+
+ type = nvme_mdev_inst_type_get(kobj->name);
+ vctrl = nvme_mdev_vctrl_create(mdev, hctrl, type->max_hw_queues);
+
+ if (IS_ERR(vctrl)) {
+ ret = PTR_ERR(vctrl);
+ goto out;
+ }
+
+ mutex_lock(&nvme_mdev_vctrl_list_mutex);
+ list_add_tail(&vctrl->link, &nvme_mdev_vctrl_list);
+ mutex_unlock(&nvme_mdev_vctrl_list_mutex);
+out:
+ nvme_mdev_hctrl_put(hctrl);
+ return ret;
+}
+
+/* Called when a mediated device is removed */
+static int nvme_mdev_ops_remove(struct mdev_device *mdev)
+{
+ struct nvme_mdev_vctrl *vctrl = mdev_to_vctrl(mdev);
+
+ if (!vctrl)
+ return -ENODEV;
+ return nvme_mdev_vctrl_destroy(vctrl);
+}
+
+/* Called when new mediated device is opened by a user */
+static int nvme_mdev_ops_open(struct mdev_device *mdev)
+{
+ int ret;
+ unsigned long events;
+ struct nvme_mdev_vctrl *vctrl = mdev_to_vctrl(mdev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ ret = nvme_mdev_vctrl_open(vctrl);
+ if (ret)
+ return ret;
+
+ /* register unmap IOMMU notifier*/
+ vctrl->vfio_unmap_notifier.notifier_call = nvme_mdev_unmap_notifier;
+ events = VFIO_IOMMU_NOTIFY_DMA_UNMAP;
+
+ ret = vfio_register_notifier(mdev_dev(vctrl->mdev),
+ VFIO_IOMMU_NOTIFY, &events,
+ &vctrl->vfio_unmap_notifier);
+
+ if (ret != 0) {
+ nvme_mdev_vctrl_release(vctrl);
+ return ret;
+ }
+
+ /* register map IOMMU notifier*/
+ vctrl->vfio_map_notifier.notifier_call = nvme_mdev_map_notifier;
+ events = VFIO_IOMMU_NOTIFY_DMA_MAP;
+
+ ret = vfio_register_notifier(mdev_dev(vctrl->mdev),
+ VFIO_IOMMU_NOTIFY, &events,
+ &vctrl->vfio_map_notifier);
+
+ if (ret != 0) {
+ vfio_unregister_notifier(mdev_dev(vctrl->mdev),
+ VFIO_IOMMU_NOTIFY,
+ &vctrl->vfio_unmap_notifier);
+ nvme_mdev_vctrl_release(vctrl);
+ return ret;
+ }
+ return ret;
+}
+
+/* Called when new mediated device is closed (last close of the user) */
+static void nvme_mdev_ops_release(struct mdev_device *mdev)
+{
+ struct nvme_mdev_vctrl *vctrl = mdev_to_vctrl(mdev);
+ int ret;
+
+ ret = vfio_unregister_notifier(mdev_dev(vctrl->mdev),
+ VFIO_IOMMU_NOTIFY,
+ &vctrl->vfio_unmap_notifier);
+ WARN_ON(ret);
+
+ ret = vfio_unregister_notifier(mdev_dev(vctrl->mdev),
+ VFIO_IOMMU_NOTIFY,
+ &vctrl->vfio_map_notifier);
+ WARN_ON(ret);
+
+ nvme_mdev_vctrl_release(vctrl);
+}
+
+/* Helper function for bar/pci config read/write access */
+static ssize_t nvme_mdev_access(struct nvme_mdev_vctrl *vctrl,
+ char *buf, size_t count,
+ loff_t pos, bool is_write)
+{
+ int index = OFFSET_TO_REGION(pos);
+ int ret = -EINVAL;
+ unsigned int offset;
+
+ if (index >= VFIO_PCI_NUM_REGIONS || !vctrl->regions[index].rw)
+ goto out;
+
+ offset = pos - REGION_TO_OFFSET(index);
+ if (offset + count > vctrl->regions[index].size)
+ goto out;
+
+ ret = vctrl->regions[index].rw(vctrl, offset, buf, count, is_write);
+out:
+ return ret;
+}
+
+/* Called when read() is done on the device */
+static ssize_t nvme_mdev_ops_read(struct mdev_device *mdev, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int done = 0;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = mdev_to_vctrl(mdev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, false);
+ if (ret <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+ filled = sizeof(val);
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, false);
+ if (ret <= 0)
+ goto read_err;
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+ filled = sizeof(val);
+ } else {
+ u8 val;
+
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, false);
+ if (ret <= 0)
+ goto read_err;
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+ filled = sizeof(val);
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+ return done;
+read_err:
+ return -EFAULT;
+}
+
+/* Called when write() is done on the device */
+static ssize_t nvme_mdev_ops_write(struct mdev_device *mdev,
+ const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int done = 0;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = mdev_to_vctrl(mdev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, true);
+ if (ret <= 0)
+ goto write_err;
+ filled = sizeof(val);
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, true);
+ if (ret <= 0)
+ goto write_err;
+ filled = sizeof(val);
+ } else {
+ u8 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+ ret = nvme_mdev_access(vctrl, (char *)&val,
+ sizeof(val), *ppos, true);
+ if (ret <= 0)
+ goto write_err;
+ filled = sizeof(val);
+ }
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+ return done;
+write_err:
+ return -EFAULT;
+}
+
+/*Helper for IRQ number VFIO query */
+static int nvme_mdev_irq_counts(struct nvme_mdev_vctrl *vctrl,
+ unsigned int irq_type)
+{
+ switch (irq_type) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ return 1;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ return MAX_VIRTUAL_IRQS;
+ case VFIO_PCI_REQ_IRQ_INDEX:
+ return 1;
+ default:
+ return 0;
+ }
+}
+
+/* VFIO VFIO_IRQ_SET_ACTION_TRIGGER implementation */
+static int nvme_mdev_ioctl_set_irqs_trigger(struct nvme_mdev_vctrl *vctrl,
+ u32 flags,
+ unsigned int irq_type,
+ unsigned int start,
+ unsigned int count,
+ void *data)
+{
+ u32 data_type = flags & VFIO_IRQ_SET_DATA_TYPE_MASK;
+ u8 *bools = NULL;
+ unsigned int i;
+ int ret = -EINVAL;
+
+ /* Asked to disable the current interrupt mode*/
+ if (data_type == VFIO_IRQ_SET_DATA_NONE && count == 0) {
+ switch (irq_type) {
+ case VFIO_PCI_REQ_IRQ_INDEX:
+ nvme_mdev_irqs_set_unplug_trigger(vctrl, -1);
+ return 0;
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ nvme_mdev_irqs_disable(vctrl, NVME_MDEV_IMODE_INTX);
+ return 0;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ nvme_mdev_irqs_disable(vctrl, NVME_MDEV_IMODE_MSIX);
+ return 0;
+ default:
+ return -EINVAL;
+ }
+ }
+
+ if (start + count > nvme_mdev_irq_counts(vctrl, irq_type))
+ return -EINVAL;
+
+ switch (data_type) {
+ case VFIO_IRQ_SET_DATA_BOOL:
+ bools = (u8 *)data;
+ /*fallthrough*/
+ case VFIO_IRQ_SET_DATA_NONE:
+ if (irq_type == VFIO_PCI_REQ_IRQ_INDEX)
+ return -EINVAL;
+
+ for (i = 0 ; i < count ; i++) {
+ int index = start + i;
+
+ if (!bools || bools[i])
+ nvme_mdev_irq_trigger(vctrl, index);
+ }
+ return 0;
+
+ case VFIO_IRQ_SET_DATA_EVENTFD:
+ switch (irq_type) {
+ case VFIO_PCI_REQ_IRQ_INDEX:
+ return nvme_mdev_irqs_set_unplug_trigger(vctrl,
+ *(int32_t *)data);
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ ret = nvme_mdev_irqs_enable(vctrl,
+ NVME_MDEV_IMODE_INTX);
+ break;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ ret = nvme_mdev_irqs_enable(vctrl,
+ NVME_MDEV_IMODE_MSIX);
+ break;
+ default:
+ return -EINVAL;
+ }
+ if (ret)
+ return ret;
+
+ return nvme_mdev_irqs_set_triggers(vctrl, start,
+ count, (int32_t *)data);
+ default:
+ return -EINVAL;
+ }
+}
+
+/* VFIO_DEVICE_GET_INFO ioctl implementation */
+static int nvme_mdev_ioctl_get_info(struct nvme_mdev_vctrl *vctrl,
+ void __user *arg)
+{
+ struct vfio_device_info info;
+ unsigned int minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.flags = VFIO_DEVICE_FLAGS_PCI | VFIO_DEVICE_FLAGS_RESET;
+ info.num_regions = VFIO_PCI_NUM_REGIONS;
+ info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+ if (copy_to_user(arg, &info, minsz))
+ return -EFAULT;
+ return 0;
+}
+
+/* VFIO_DEVICE_GET_REGION_INFO ioctl implementation*/
+static int nvme_mdev_ioctl_get_reg_info(struct nvme_mdev_vctrl *vctrl,
+ void __user *arg)
+{
+ struct nvme_mdev_io_region *region;
+ struct mdev_nvme_vfio_region_info *info;
+ unsigned long minsz, outsz, maxsz;
+ int ret = 0;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+ maxsz = sizeof(struct mdev_nvme_vfio_region_info) +
+ sizeof(struct vfio_region_sparse_mmap_area);
+
+ info = kzalloc(maxsz, GFP_KERNEL);
+ if (!info)
+ return -ENOMEM;
+
+ if (copy_from_user(info, arg, minsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ outsz = info->base.argsz;
+ if (outsz < minsz || outsz > maxsz) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (info->base.index >= VFIO_PCI_NUM_REGIONS) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ region = &vctrl->regions[info->base.index];
+ info->base.offset = REGION_TO_OFFSET(info->base.index);
+ info->base.argsz = maxsz;
+ info->base.size = region->size;
+
+ info->base.flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+
+ if (region->mmap_ops) {
+ info->base.flags |= (VFIO_REGION_INFO_FLAG_MMAP |
+ VFIO_REGION_INFO_FLAG_CAPS);
+
+ info->base.cap_offset =
+ offsetof(struct mdev_nvme_vfio_region_info, mmap_cap);
+
+ info->mmap_cap.header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ info->mmap_cap.header.version = 1;
+ info->mmap_cap.header.next = 0;
+ info->mmap_cap.nr_areas = 1;
+ info->mmap_cap.areas[0].offset = region->mmap_area_start;
+ info->mmap_cap.areas[0].size = region->mmap_area_size;
+ }
+
+ if (copy_to_user(arg, info, outsz))
+ ret = -EFAULT;
+out:
+ kfree(info);
+ return ret;
+}
+
+/* VFIO_DEVICE_GET_IRQ_INFO ioctl implementation */
+static int nvme_mdev_ioctl_get_irq_info(struct nvme_mdev_vctrl *vctrl,
+ void __user *arg)
+{
+ struct vfio_irq_info info;
+ unsigned int minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, arg, minsz))
+ return -EFAULT;
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.count = nvme_mdev_irq_counts(vctrl, info.index);
+ info.flags = VFIO_IRQ_INFO_EVENTFD;
+
+ if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+ info.flags |= VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED;
+
+ if (copy_to_user(arg, &info, minsz))
+ return -EFAULT;
+ return 0;
+}
+
+/* VFIO VFIO_DEVICE_SET_IRQS ioctl implementation */
+static int nvme_mdev_ioctl_set_irqs(struct nvme_mdev_vctrl *vctrl,
+ void __user *arg)
+{
+ int ret, irqcount;
+ struct vfio_irq_set hdr;
+ u8 *data = NULL;
+ size_t data_size = 0;
+ unsigned long minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, arg, minsz))
+ return -EFAULT;
+
+ irqcount = nvme_mdev_irq_counts(vctrl, hdr.index);
+ ret = vfio_set_irqs_validate_and_prepare(&hdr,
+ irqcount,
+ VFIO_PCI_NUM_IRQS,
+ &data_size);
+ if (ret)
+ return ret;
+
+ if (data_size) {
+ data = memdup_user((arg + minsz), data_size);
+ if (IS_ERR(data))
+ return PTR_ERR(data);
+ }
+
+ ret = -ENOTTY;
+ switch (hdr.index) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ case VFIO_PCI_REQ_IRQ_INDEX:
+ switch (hdr.flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ // pretend to support this (even with eventfd)
+ ret = hdr.index == VFIO_PCI_INTX_IRQ_INDEX ?
+ 0 : -EINVAL;
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ ret = nvme_mdev_ioctl_set_irqs_trigger(vctrl, hdr.flags,
+ hdr.index,
+ hdr.start,
+ hdr.count,
+ data);
+ break;
+ }
+ break;
+ }
+
+ kfree(data);
+ return ret;
+}
+
+/* ioctl() implementation */
+static long nvme_mdev_ops_ioctl(struct mdev_device *mdev, unsigned int cmd,
+ unsigned long arg)
+{
+ struct nvme_mdev_vctrl *vctrl = mdev_get_drvdata(mdev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ switch (cmd) {
+ case VFIO_DEVICE_GET_INFO:
+ return nvme_mdev_ioctl_get_info(vctrl, (void __user *)arg);
+ case VFIO_DEVICE_GET_REGION_INFO:
+ return nvme_mdev_ioctl_get_reg_info(vctrl, (void __user *)arg);
+ case VFIO_DEVICE_GET_IRQ_INFO:
+ return nvme_mdev_ioctl_get_irq_info(vctrl, (void __user *)arg);
+ case VFIO_DEVICE_SET_IRQS:
+ return nvme_mdev_ioctl_set_irqs(vctrl, (void __user *)arg);
+ case VFIO_DEVICE_RESET:
+ nvme_mdev_vctrl_reset(vctrl);
+ return 0;
+ default:
+ return -ENOTTY;
+ }
+}
+
+/* mmap() implementation (doorbell area) */
+static int nvme_mdev_ops_mmap(struct mdev_device *mdev,
+ struct vm_area_struct *vma)
+{
+ struct nvme_mdev_vctrl *vctrl = mdev_get_drvdata(mdev);
+ int index = OFFSET_TO_REGION((u64)vma->vm_pgoff << PAGE_SHIFT);
+ unsigned long size, start;
+
+ if (!vctrl)
+ return -EFAULT;
+
+ if (index >= VFIO_PCI_NUM_REGIONS || !vctrl->regions[index].mmap_ops)
+ return -EINVAL;
+
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+
+ size = vma->vm_end - vma->vm_start;
+ start = vma->vm_pgoff << PAGE_SHIFT;
+
+ if (start < vctrl->regions[index].mmap_area_start)
+ return -EINVAL;
+ if (size > vctrl->regions[index].mmap_area_size)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+ vma->vm_ops = vctrl->regions[index].mmap_ops;
+ vma->vm_private_data = vctrl;
+ return 0;
+}
+
+/* Request removal of the device*/
+static void nvme_mdev_ops_request(struct mdev_device *mdev, unsigned int count)
+{
+ struct nvme_mdev_vctrl *vctrl = mdev_get_drvdata(mdev);
+
+ if (vctrl)
+ nvme_mdev_irq_raise_unplug_event(vctrl, count);
+}
+
+/* Adding a new namespace given host NS id and partition ID (e/g. n1p2 or n1) */
+static ssize_t add_namespace_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+ int ret;
+ unsigned long partno = 0, nsid;
+ char *buf_copy, *token, *tmp;
+
+ if (!vctrl)
+ return -ENODEV;
+
+ buf_copy = kstrdup(buf, GFP_KERNEL);
+ if (!buf_copy)
+ return -ENOMEM;
+
+ tmp = buf_copy;
+ if (tmp[0] != 'n') {
+ ret = -EINVAL;
+ goto out;
+ }
+ tmp++;
+
+ // read namespace ID (mandatory)
+ token = strsep(&tmp, "p");
+ if (!token) {
+ ret = -EINVAL;
+ goto out;
+ }
+ ret = kstrtoul(token, 10, &nsid);
+ if (ret)
+ goto out;
+
+ // read partition ID (optional)
+ if (tmp) {
+ ret = kstrtoul(tmp, 10, &partno);
+ if (ret)
+ goto out;
+ }
+
+ // create the user namespace
+ ret = nvme_mdev_vns_open(vctrl, nsid, partno);
+ if (ret)
+ goto out;
+ ret = count;
+out:
+ kfree(buf_copy);
+ return ret;
+}
+static DEVICE_ATTR_WO(add_namespace);
+
+/* Remove a user namespace */
+static ssize_t remove_namespace_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long user_nsid;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ ret = kstrtoul(buf, 10, &user_nsid);
+ if (ret)
+ return ret;
+
+ ret = nvme_mdev_vns_destroy(vctrl, user_nsid);
+ if (ret)
+ return ret;
+ return count;
+}
+static DEVICE_ATTR_WO(remove_namespace);
+
+/* Show list of user namespaces */
+static ssize_t namespaces_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+ return nvme_mdev_vns_print_description(vctrl, buf, PAGE_SIZE - 1);
+}
+static DEVICE_ATTR_RO(namespaces);
+
+/* change the cpu binding of the IO threads*/
+static ssize_t iothread_cpu_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long val;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+ ret = kstrtoul(buf, 10, &val);
+ if (ret)
+ return ret;
+ nvme_mdev_vctrl_bind_iothread(vctrl, val);
+ return count;
+}
+
+/* change the cpu binding of the IO threads*/
+static ssize_t
+iothread_cpu_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+ return sprintf(buf, "%d\n", vctrl->iothread_cpu);
+}
+static DEVICE_ATTR_RW(iothread_cpu);
+
+/* change the cpu binding of the IO threads*/
+static ssize_t shadow_doorbell_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ bool val;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+ ret = kstrtobool(buf, &val);
+ if (ret)
+ return ret;
+ ret = nvme_mdev_vctrl_set_shadow_doorbell_supported(vctrl, val);
+ if (ret)
+ return ret;
+ return count;
+}
+
+/* change the cpu binding of the IO threads*/
+static ssize_t shadow_doorbell_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+
+ return sprintf(buf, "%d\n", vctrl->mmio.shadow_db_supported ? 1 : 0);
+}
+static DEVICE_ATTR_RW(shadow_doorbell);
+
+static struct attribute *nvme_mdev_dev_ns_atttributes[] = {
+ &dev_attr_add_namespace.attr,
+ &dev_attr_remove_namespace.attr,
+ &dev_attr_namespaces.attr,
+ NULL
+};
+
+static struct attribute *nvme_mdev_dev_settings_atttributes[] = {
+ &dev_attr_iothread_cpu.attr,
+ &dev_attr_shadow_doorbell.attr,
+ NULL
+};
+
+static const struct attribute_group nvme_mdev_ns_attr_group = {
+ .name = "namespaces",
+ .attrs = nvme_mdev_dev_ns_atttributes,
+};
+
+static const struct attribute_group nvme_mdev_setting_attr_group = {
+ .name = "settings",
+ .attrs = nvme_mdev_dev_settings_atttributes,
+};
+
+static const struct attribute_group *nvme_mdev_dev_attributte_groups[] = {
+ &nvme_mdev_ns_attr_group,
+ &nvme_mdev_setting_attr_group,
+ NULL,
+};
+
+struct mdev_parent_ops mdev_fops = {
+ .owner = THIS_MODULE,
+ .create = nvme_mdev_ops_create,
+ .remove = nvme_mdev_ops_remove,
+ .open = nvme_mdev_ops_open,
+ .release = nvme_mdev_ops_release,
+ .read = nvme_mdev_ops_read,
+ .write = nvme_mdev_ops_write,
+ .mmap = nvme_mdev_ops_mmap,
+ .ioctl = nvme_mdev_ops_ioctl,
+ .request = nvme_mdev_ops_request,
+ .mdev_attr_groups = nvme_mdev_dev_attributte_groups,
+ .dev_attr_groups = NULL,
+};
+
diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
new file mode 100644
index 000000000000..a731196d0365
--- /dev/null
+++ b/drivers/nvme/mdev/io.c
@@ -0,0 +1,563 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe IO command translation and polling IO thread
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/slab.h>
+#include <linux/nvme.h>
+#include <linux/timekeeping.h>
+#include <linux/ktime.h>
+#include "priv.h"
+
+struct io_ctx {
+ struct nvme_mdev_hctrl *hctrl;
+ struct nvme_mdev_vctrl *vctrl;
+
+ const struct nvme_command *in;
+ struct nvme_command out;
+ struct nvme_mdev_vns *ns;
+ struct nvme_ext_data_iter udatait;
+ struct nvme_ext_data_iter *kdatait;
+
+ ktime_t last_io_t;
+ ktime_t last_admin_poll_time;
+ unsigned int idle_timeout_ms;
+ unsigned int admin_poll_rate_ms;
+ unsigned int arb_burst;
+};
+
+/* Handle read/write command.*/
+static int nvme_mdev_io_translate_rw(struct io_ctx *ctx)
+{
+ int ret;
+ const struct nvme_rw_command *in = &ctx->in->rw;
+
+ u64 slba = le64_to_cpu(in->slba);
+ u64 length = le16_to_cpu(in->length) + 1;
+ u16 control = le16_to_cpu(in->control);
+
+ _DBG(ctx->vctrl, "IOQ: READ/WRITE\n");
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_MPTR | RSRV_DW14_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16, 0b1100000000111100))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (in->opcode == nvme_cmd_write && ctx->ns->readonly)
+ return DNR(NVME_SC_READ_ONLY);
+
+ if (!check_range(slba, length, ctx->ns->ns_size))
+ return DNR(NVME_SC_LBA_RANGE);
+
+ ctx->out.rw.slba = cpu_to_le64(slba + ctx->ns->host_lba_offset);
+ ctx->out.rw.length = in->length;
+
+ ret = nvme_mdev_udata_iter_set_dptr(&ctx->udatait, &in->dptr,
+ length << ctx->ns->blksize_shift);
+ if (ret)
+ return nvme_mdev_translate_error(ret);
+
+ ctx->kdatait = &ctx->udatait;
+ if (control & ~(NVME_RW_LR | NVME_RW_FUA))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ ctx->out.rw.control = in->control;
+ return -1;
+}
+
+/*Handle flush command */
+static int nvme_mdev_io_translate_flush(struct io_ctx *ctx)
+{
+ ctx->kdatait = NULL;
+
+ _DBG(ctx->vctrl, "IOQ: FLUSH\n");
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW10_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (ctx->ns->readonly)
+ return DNR(NVME_SC_READ_ONLY);
+
+ return -1;
+}
+
+/* Handle write zeros command */
+static int nvme_mdev_io_translate_write_zeros(struct io_ctx *ctx)
+{
+ const struct nvme_write_zeroes_cmd *in = &ctx->in->write_zeroes;
+ u64 slba = le64_to_cpu(in->slba);
+ u64 length = le16_to_cpu(in->length) + 1;
+ u16 control = le16_to_cpu(in->control);
+
+ _DBG(ctx->vctrl, "IOQ: WRITE_ZEROS\n");
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_DPTR |
+ RSRV_MPTR | RSRV_DW13_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!nvme_mdev_hctrl_hq_check_op(ctx->hctrl, in->opcode))
+ return DNR(NVME_SC_INVALID_OPCODE);
+
+ if (ctx->ns->readonly)
+ return DNR(NVME_SC_READ_ONLY);
+ ctx->kdatait = NULL;
+
+ if (!check_range(slba, length, ctx->ns->ns_size))
+ return DNR(NVME_SC_LBA_RANGE);
+
+ ctx->out.write_zeroes.slba =
+ cpu_to_le64(slba + ctx->ns->host_lba_offset);
+ ctx->out.write_zeroes.length = in->length;
+
+ if (control & ~(NVME_RW_LR | NVME_RW_FUA | NVME_WZ_DEAC))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ ctx->out.write_zeroes.control = in->control;
+ return -1;
+}
+
+/* Handle dataset management command */
+static int nvme_mdev_io_translate_dsm(struct io_ctx *ctx)
+{
+ unsigned int size, i, nr;
+ int ret;
+ const struct nvme_dsm_cmd *in = &ctx->in->dsm;
+ struct nvme_dsm_range *data_ptr;
+
+ _DBG(ctx->vctrl, "IOQ: DSM_MANAGEMENT\n");
+
+ if (!check_reserved_dwords(ctx->in->dwords, 16,
+ RSRV_DW23 | RSRV_MPTR | RSRV_DW12_15))
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (le32_to_cpu(in->nr) & 0xFFFFFF00)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ if (!nvme_mdev_hctrl_hq_check_op(ctx->hctrl, in->opcode))
+ return DNR(NVME_SC_INVALID_OPCODE);
+
+ if (ctx->ns->readonly)
+ return DNR(NVME_SC_READ_ONLY);
+
+ nr = le32_to_cpu(in->nr) + 1;
+ size = nr * sizeof(struct nvme_dsm_range);
+
+ ctx->out.dsm.nr = in->nr;
+ ret = nvme_mdev_udata_iter_set_dptr(&ctx->udatait, &in->dptr, size);
+ if (ret)
+ goto error;
+
+ ctx->kdatait = nvme_mdev_kdata_iter_alloc(&ctx->vctrl->viommu, size);
+ if (!ctx->kdatait)
+ return NVME_SC_INTERNAL;
+
+ _DBG(ctx->vctrl, "IOQ: DSM_MANAGEMENT: NR=%d\n", nr);
+
+ ret = nvme_mdev_read_from_udata(ctx->kdatait->kmem.data, &ctx->udatait,
+ size);
+ if (ret)
+ goto error2;
+
+ data_ptr = (struct nvme_dsm_range *)ctx->kdatait->kmem.data;
+
+ for (i = 0 ; i < nr; i++) {
+ u64 slba = le64_to_cpu(data_ptr[i].slba);
+ /* looks like not zero based value*/
+ u32 nlb = le32_to_cpu(data_ptr[i].nlb);
+
+ if (!check_range(slba, nlb, ctx->ns->ns_size))
+ goto error2;
+
+ _DBG(ctx->vctrl, "IOQ: DSM_MANAGEMENT: RANGE 0x%llx-0x%x\n",
+ slba, nlb);
+
+ data_ptr[i].slba = cpu_to_le64(slba + ctx->ns->host_lba_offset);
+ }
+
+ ctx->out.dsm.attributes = in->attributes;
+ return -1;
+error2:
+ ctx->kdatait->release(ctx->kdatait);
+error:
+ return nvme_mdev_translate_error(ret);
+}
+
+/* Process one new command in the io queue*/
+static int nvme_mdev_io_translate_cmd(struct io_ctx *ctx)
+{
+ memset(&ctx->out, 0, sizeof(ctx->out));
+ /* translate opcode */
+ ctx->out.common.opcode = ctx->in->common.opcode;
+
+ /* check flags */
+ if (ctx->in->common.flags != 0)
+ return DNR(NVME_SC_INVALID_FIELD);
+
+ /* namespace*/
+ ctx->ns = nvme_mdev_vns_from_vnsid(ctx->vctrl,
+ le32_to_cpu(ctx->in->rw.nsid));
+ if (!ctx->ns) {
+ _DBG(ctx->vctrl, "IOQ: invalid NSID\n");
+ return DNR(NVME_SC_INVALID_NS);
+ }
+
+ if (!ctx->ns->readonly && bdev_read_only(ctx->ns->host_part))
+ ctx->ns->readonly = true;
+
+ ctx->out.common.nsid = cpu_to_le32(ctx->ns->host_nsid);
+
+ switch (ctx->in->common.opcode) {
+ case nvme_cmd_flush:
+ return nvme_mdev_io_translate_flush(ctx);
+ case nvme_cmd_read:
+ return nvme_mdev_io_translate_rw(ctx);
+ case nvme_cmd_write:
+ return nvme_mdev_io_translate_rw(ctx);
+ case nvme_cmd_write_zeroes:
+ return nvme_mdev_io_translate_write_zeros(ctx);
+ case nvme_cmd_dsm:
+ return nvme_mdev_io_translate_dsm(ctx);
+ default:
+ return DNR(NVME_SC_INVALID_OPCODE);
+ }
+}
+
+static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
+{
+ struct nvme_vsq *vsq = &ctx->vctrl->vsqs[sqid];
+ u16 ucid;
+ int ret;
+
+ /* If host queue is full, we can't process a command
+ * as a command will likely result in passthrough
+ */
+ if (!nvme_mdev_hctrl_hq_can_submit(ctx->hctrl, vsq->hsq))
+ return false;
+
+ /* read the command */
+ ctx->in = nvme_mdev_vsq_get_cmd(ctx->vctrl, vsq);
+ if (!ctx->in)
+ return false;
+ ucid = le16_to_cpu(ctx->in->common.command_id);
+
+ /* translate the command */
+ ret = nvme_mdev_io_translate_cmd(ctx);
+ if (ret != -1) {
+ _DBG(ctx->vctrl,
+ "IOQ: QID %d CID %d FAILED: status 0x%x (translate)\n",
+ sqid, ucid, ret);
+ nvme_mdev_vsq_cmd_done_io(ctx->vctrl, sqid, ucid, ret);
+ return true;
+ }
+
+ /*passthrough*/
+ ret = nvme_mdev_hctrl_hq_submit(ctx->hctrl,
+ vsq->hsq,
+ (((u32)vsq->qid) << 16) | ((u32)ucid),
+ &ctx->out,
+ ctx->kdatait);
+ if (ret) {
+ ret = nvme_mdev_translate_error(ret);
+
+ _DBG(ctx->vctrl,
+ "IOQ: QID %d CID %d FAILED: status 0x%x (host submit)\n",
+ sqid, ucid, ret);
+
+ nvme_mdev_vsq_cmd_done_io(ctx->vctrl, sqid, ucid, ret);
+ }
+ return true;
+}
+
+/* process host replies to the passed through commands */
+static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
+{
+ int n, i;
+ struct nvme_ext_cmd_result res[16];
+
+ /* process the completions from the hardware */
+ n = nvme_mdev_hctrl_hq_poll(ctx->hctrl, hwq, res, 16);
+ if (n == -1)
+ return -1;
+
+ for (i = 0; i < n; i++) {
+ u16 qid = res[i].tag >> 16;
+ u16 cid = res[i].tag & 0xFFFF;
+ u16 status = res[i].status;
+
+ if (status != 0)
+ _DBG(ctx->vctrl,
+ "IOQ: QID %d CID %d FAILED: status 0x%x (host response)\n",
+ qid, cid, status);
+
+ nvme_mdev_vsq_cmd_done_io(ctx->vctrl, qid, cid, status);
+ }
+ return n;
+}
+
+/* Check if we need to read a command from the admin queue */
+static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
+{
+ if (!timeout(ctx->last_admin_poll_time,
+ ctx->vctrl->now, ctx->admin_poll_rate_ms))
+ return false;
+
+ if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
+ return true;
+
+ ctx->last_admin_poll_time = ctx->vctrl->now;
+ return false;
+}
+
+/* do polling till one of events stops it */
+static void nvme_mdev_io_maintask(struct io_ctx *ctx)
+{
+ struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
+ u16 i, cqid, sqid, hsqcnt;
+ u16 hsqs[MAX_HOST_QUEUES];
+ bool idle = false;
+
+ hsqcnt = nvme_mdev_vctrl_hqs_list(vctrl, hsqs);
+ ctx->arb_burst = 1 << ctx->vctrl->arb_burst_shift;
+
+ /* can't stop polling when shadow db not enabled */
+ ctx->idle_timeout_ms = vctrl->mmio.shadow_db_en ? poll_timeout_ms : 0;
+ ctx->admin_poll_rate_ms = admin_poll_rate_ms;
+
+ vctrl->now = ktime_get();
+ ctx->last_admin_poll_time = vctrl->now;
+ ctx->last_io_t = vctrl->now;
+
+ /* main loop */
+ while (!kthread_should_park()) {
+ vctrl->now = ktime_get();
+
+ /* check if we have to exit to support admin polling */
+ if (!vctrl->mmio.shadow_db_supported)
+ if (nvme_mdev_adm_needs_processing(ctx))
+ break;
+
+ /* process the submission queues*/
+ sqid = 1;
+ for_each_set_bit_from(sqid, vctrl->vsq_en, MAX_VIRTUAL_QUEUES)
+ for (i = 0 ; i < ctx->arb_burst ; i++)
+ if (!nvme_mdev_io_process_sq(ctx, sqid))
+ break;
+
+ /* process the completions from the guest*/
+ cqid = 1;
+ for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
+ nvme_mdev_vcq_process(vctrl, cqid, true);
+
+ /* process the completions from the hardware*/
+ for (i = 0 ; i < hsqcnt ; i++)
+ if (nvme_mdev_io_process_hwq(ctx, hsqs[i]) > 0)
+ ctx->last_io_t = vctrl->now;
+
+ /* Check if we need to stop polling*/
+ if (ctx->idle_timeout_ms) {
+ if (timeout(ctx->last_io_t,
+ vctrl->now, ctx->idle_timeout_ms)) {
+ idle = true;
+ break;
+ }
+ }
+ cond_resched();
+ }
+
+ /* Drain the host IO */
+ for (;;) {
+ bool pending_io = false;
+
+ vctrl->now = ktime_get_coarse_boottime();
+
+ if (nvme_mdev_vctrl_is_dead(vctrl) || ctx->hctrl->removing) {
+ idle = false;
+ break;
+ }
+
+ for (i = 0; i < hsqcnt; i++) {
+ int n = nvme_mdev_io_process_hwq(ctx, hsqs[i]);
+
+ if (n != -1)
+ pending_io = true;
+ if (n > 0)
+ ctx->last_io_t = vctrl->now;
+ }
+
+ if (!pending_io)
+ break;
+
+ cond_resched();
+
+ if (!timeout(ctx->last_io_t, vctrl->now, io_timeout_ms))
+ continue;
+
+ _WARN(ctx->vctrl, "IO: skipping flush - host IO timeout\n");
+ idle = false;
+ break;
+ }
+
+ /* Drain all the pending completion interrupts to the guest*/
+ cqid = 1;
+ for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
+ if (nvme_mdev_vcq_flush(vctrl, cqid))
+ idle = false;
+
+ /* Park IO thread if IO is truly idle*/
+ if (idle) {
+ /* don't bother going idle if someone holds the vctrl
+ * lock. It might try to park us, and thus
+ * cause a deadlock
+ */
+ if (!mutex_trylock(&vctrl->lock))
+ return;
+
+ sqid = 1;
+ for_each_set_bit_from(sqid, vctrl->vsq_en, MAX_VIRTUAL_QUEUES)
+ if (!nvme_mdev_vsq_suspend_io(vctrl, sqid)) {
+ idle = false;
+ break;
+ }
+
+ if (idle) {
+ _DBG(ctx->vctrl, "IO: self-parking\n");
+ vctrl->io_idle = true;
+ nvme_mdev_io_pause(vctrl);
+ }
+
+ mutex_unlock(&vctrl->lock);
+ }
+
+ /* Admin poll for cases when shadow doorbell is not supported */
+ if (!vctrl->mmio.shadow_db_supported) {
+ if (mutex_trylock(&vctrl->lock)) {
+ nvme_mdev_vcq_process(vctrl, 0, false);
+ nvme_mdev_adm_process_sq(ctx->vctrl);
+ ctx->last_admin_poll_time = vctrl->now;
+ mutex_unlock(&ctx->vctrl->lock);
+ }
+ }
+}
+
+/* the main IO thread */
+static int nvme_mdev_io_polling_thread(void *data)
+{
+ struct io_ctx ctx;
+
+ if (kthread_should_stop())
+ return 0;
+
+ memset(&ctx, 0, sizeof(struct io_ctx));
+ ctx.vctrl = (struct nvme_mdev_vctrl *)data;
+ ctx.hctrl = ctx.vctrl->hctrl;
+ nvme_mdev_udata_iter_setup(&ctx.vctrl->viommu, &ctx.udatait);
+
+ _DBG(ctx.vctrl, "IO: iothread started\n");
+
+ for (;;) {
+ if (kthread_should_park()) {
+ _DBG(ctx.vctrl, "IO: iothread parked\n");
+ kthread_parkme();
+ }
+
+ if (kthread_should_stop())
+ break;
+
+ nvme_mdev_io_maintask(&ctx);
+ }
+
+ _DBG(ctx.vctrl, "IO: iothread stopped\n");
+ return 0;
+}
+
+/* Kick the IO thread into running state*/
+void nvme_mdev_io_resume(struct nvme_mdev_vctrl *vctrl)
+{
+ lockdep_assert_held(&vctrl->lock);
+
+ if (!vctrl->iothread || !vctrl->iothread_parked)
+ return;
+ if (vctrl->io_idle || vctrl->vctrl_paused)
+ return;
+
+ vctrl->iothread_parked = false;
+ /* has memory barrier*/
+ kthread_unpark(vctrl->iothread);
+}
+
+/* Pause the IO thread */
+void nvme_mdev_io_pause(struct nvme_mdev_vctrl *vctrl)
+{
+ lockdep_assert_held(&vctrl->lock);
+
+ if (!vctrl->iothread || vctrl->iothread_parked)
+ return;
+
+ vctrl->iothread_parked = true;
+ kthread_park(vctrl->iothread);
+}
+
+/* setup the main IO thread */
+int nvme_mdev_io_create(struct nvme_mdev_vctrl *vctrl, unsigned int cpu)
+{
+ /*TODOLATER: IO: Better thread name*/
+ char name[TASK_COMM_LEN];
+
+ _DBG(vctrl, "IO: creating the polling iothread\n");
+
+ if (WARN_ON(vctrl->iothread))
+ return -EINVAL;
+
+ snprintf(name, sizeof(name), "nvme%d_poll_io", vctrl->hctrl->id);
+
+ vctrl->iothread_cpu = cpu;
+ vctrl->iothread_parked = false;
+ vctrl->io_idle = true;
+
+ vctrl->iothread = kthread_create_on_node(nvme_mdev_io_polling_thread,
+ vctrl,
+ vctrl->hctrl->node,
+ name);
+ if (IS_ERR(vctrl->iothread)) {
+ vctrl->iothread = NULL;
+ return PTR_ERR(vctrl->iothread);
+ }
+
+ kthread_bind(vctrl->iothread, cpu);
+
+ if (vctrl->io_idle) {
+ vctrl->iothread_parked = true;
+ kthread_park(vctrl->iothread);
+ return 0;
+ }
+
+ wake_up_process(vctrl->iothread);
+ return 0;
+}
+
+/* End the main IO thread */
+void nvme_mdev_io_free(struct nvme_mdev_vctrl *vctrl)
+{
+ int ret;
+
+ _DBG(vctrl, "IO: destroying the polling iothread\n");
+
+ lockdep_assert_held(&vctrl->lock);
+ nvme_mdev_io_pause(vctrl);
+ ret = kthread_stop(vctrl->iothread);
+ WARN_ON(ret);
+ vctrl->iothread = NULL;
+}
+
+void nvme_mdev_assert_io_not_running(struct nvme_mdev_vctrl *vctrl)
+{
+ if (WARN_ON(vctrl->iothread && !vctrl->iothread_parked))
+ nvme_mdev_io_pause(vctrl);
+}
diff --git a/drivers/nvme/mdev/irq.c b/drivers/nvme/mdev/irq.c
new file mode 100644
index 000000000000..5809cdb4d84c
--- /dev/null
+++ b/drivers/nvme/mdev/irq.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe virtual controller IRQ implementation (MSIx and INTx)
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+
+#include <linux/init.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include "priv.h"
+
+/* Setup the interrupt subsystem */
+void nvme_mdev_irqs_setup(struct nvme_mdev_vctrl *vctrl)
+{
+ vctrl->irqs.mode = NVME_MDEV_IMODE_NONE;
+ vctrl->irqs.irq_coalesc_max = 1;
+}
+
+/* Enable INTx or MSIx interrupts */
+static int __nvme_mdev_irqs_enable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode)
+{
+ if (vctrl->irqs.mode == mode)
+ return 0;
+ if (vctrl->irqs.mode != NVME_MDEV_IMODE_NONE)
+ return -EBUSY;
+
+ if (mode == NVME_MDEV_IMODE_INTX)
+ _DBG(vctrl, "IRQ: enable INTx interrupts\n");
+ else if (mode == NVME_MDEV_IMODE_MSIX)
+ _DBG(vctrl, "IRQ: enable MSIX interrupts\n");
+ else
+ WARN_ON(1);
+
+ nvme_mdev_io_pause(vctrl);
+ vctrl->irqs.mode = mode;
+ nvme_mdev_io_resume(vctrl);
+ return 0;
+}
+
+int nvme_mdev_irqs_enable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode)
+{
+ int retval = 0;
+
+ mutex_lock(&vctrl->lock);
+ retval = __nvme_mdev_irqs_enable(vctrl, mode);
+ mutex_unlock(&vctrl->lock);
+ return retval;
+}
+
+/* Disable INTx or MSIx interrupts */
+static void __nvme_mdev_irqs_disable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode)
+{
+ unsigned int i;
+
+ if (vctrl->irqs.mode == NVME_MDEV_IMODE_NONE)
+ return;
+ if (vctrl->irqs.mode != mode)
+ return;
+
+ if (vctrl->irqs.mode == NVME_MDEV_IMODE_INTX)
+ _DBG(vctrl, "IRQ: disable INTx interrupts\n");
+ else if (vctrl->irqs.mode == NVME_MDEV_IMODE_MSIX)
+ _DBG(vctrl, "IRQ: disable MSIX interrupts\n");
+ else
+ WARN_ON(1);
+
+ nvme_mdev_io_pause(vctrl);
+
+ for (i = 0; i < MAX_VIRTUAL_IRQS; i++) {
+ struct nvme_mdev_user_irq *vec = &vctrl->irqs.vecs[i];
+
+ if (vec->trigger) {
+ eventfd_ctx_put(vec->trigger);
+ vec->trigger = NULL;
+ }
+ vec->irq_pending_cnt = 0;
+ vec->irq_time = 0;
+ }
+ vctrl->irqs.mode = NVME_MDEV_IMODE_NONE;
+ nvme_mdev_io_resume(vctrl);
+}
+
+void nvme_mdev_irqs_disable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode)
+{
+ mutex_lock(&vctrl->lock);
+ __nvme_mdev_irqs_disable(vctrl, mode);
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Set eventfd triggers for INTx or MSIx interrupts */
+int nvme_mdev_irqs_set_triggers(struct nvme_mdev_vctrl *vctrl,
+ int start, int count, int32_t *fds)
+{
+ unsigned int i;
+
+ mutex_lock(&vctrl->lock);
+ nvme_mdev_io_pause(vctrl);
+
+ for (i = 0; i < count; i++) {
+ int irqindex = start + i;
+ struct eventfd_ctx *trigger;
+ struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[irqindex];
+
+ if (irq->trigger) {
+ eventfd_ctx_put(irq->trigger);
+ irq->trigger = NULL;
+ }
+
+ if (fds[i] < 0)
+ continue;
+
+ trigger = eventfd_ctx_fdget(fds[i]);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+
+ irq->trigger = trigger;
+ }
+ nvme_mdev_io_resume(vctrl);
+ mutex_unlock(&vctrl->lock);
+ return 0;
+}
+
+/* Set eventfd trigger for unplug interrupt */
+static int __nvme_mdev_irqs_set_unplug_trigger(struct nvme_mdev_vctrl *vctrl,
+ int32_t fd)
+{
+ struct eventfd_ctx *trigger;
+
+ if (vctrl->irqs.request_trigger) {
+ _DBG(vctrl, "IRQ: clear hotplug trigger\n");
+ eventfd_ctx_put(vctrl->irqs.request_trigger);
+ vctrl->irqs.request_trigger = NULL;
+ }
+
+ if (fd < 0)
+ return 0;
+
+ _DBG(vctrl, "IRQ: set hotplug trigger\n");
+
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+
+ vctrl->irqs.request_trigger = trigger;
+ return 0;
+}
+
+int nvme_mdev_irqs_set_unplug_trigger(struct nvme_mdev_vctrl *vctrl,
+ int32_t fd)
+{
+ int retval;
+
+ mutex_lock(&vctrl->lock);
+ retval = __nvme_mdev_irqs_set_unplug_trigger(vctrl, fd);
+ mutex_unlock(&vctrl->lock);
+ return retval;
+}
+
+/* Reset the interrupts subsystem */
+void nvme_mdev_irqs_reset(struct nvme_mdev_vctrl *vctrl)
+{
+ int i;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ if (vctrl->irqs.mode != NVME_MDEV_IMODE_NONE)
+ __nvme_mdev_irqs_disable(vctrl, vctrl->irqs.mode);
+
+ __nvme_mdev_irqs_set_unplug_trigger(vctrl, -1);
+
+ for (i = 0; i < MAX_VIRTUAL_IRQS; i++) {
+ struct nvme_mdev_user_irq *vec = &vctrl->irqs.vecs[i];
+
+ vec->irq_coalesc_en = false;
+ vec->irq_pending_cnt = 0;
+ vec->irq_time = 0;
+ }
+
+ vctrl->irqs.irq_coalesc_time_us = 0;
+}
+
+/* Check if interrupt can be coalesced */
+static bool nvme_mdev_irq_coalesce(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_mdev_user_irq *irq)
+{
+ s64 delta;
+
+ if (!irq->irq_coalesc_en)
+ return false;
+
+ if (irq->irq_pending_cnt >= vctrl->irqs.irq_coalesc_max)
+ return false;
+
+ delta = ktime_us_delta(vctrl->now, irq->irq_time);
+ return (delta < vctrl->irqs.irq_coalesc_time_us);
+}
+
+void nvme_mdev_irq_raise_unplug_event(struct nvme_mdev_vctrl *vctrl,
+ unsigned int count)
+{
+ mutex_lock(&vctrl->lock);
+
+ if (vctrl->irqs.request_trigger) {
+ if (!(count % 10))
+ dev_notice_ratelimited(mdev_dev(vctrl->mdev),
+ "Relaying device request to user (#%u)\n",
+ count);
+
+ eventfd_signal(vctrl->irqs.request_trigger, 1);
+
+ } else if (count == 0) {
+ dev_notice(mdev_dev(vctrl->mdev),
+ "No device request channel registered, blocked until released by user\n");
+ }
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Raise an interrupt */
+void nvme_mdev_irq_raise(struct nvme_mdev_vctrl *vctrl, unsigned int index)
+{
+ struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[index];
+
+ irq->irq_pending_cnt++;
+}
+
+/* Unraise an interrupt */
+void nvme_mdev_irq_clear(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index)
+{
+ struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[index];
+
+ irq->irq_time = vctrl->now;
+ irq->irq_pending_cnt = 0;
+}
+
+/* Directly trigger an interrupt without affecting irq coalescing settings */
+void nvme_mdev_irq_trigger(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index)
+{
+ struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[index];
+
+ if (irq->trigger)
+ eventfd_signal(irq->trigger, 1);
+}
+
+/* Trigger previously raised interrupt */
+void nvme_mdev_irq_cond_trigger(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index)
+{
+ struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[index];
+
+ if (irq->irq_pending_cnt == 0)
+ return;
+
+ if (!nvme_mdev_irq_coalesce(vctrl, irq)) {
+ nvme_mdev_irq_trigger(vctrl, index);
+ nvme_mdev_irq_clear(vctrl, index);
+ }
+}
diff --git a/drivers/nvme/mdev/mdev.h b/drivers/nvme/mdev/mdev.h
new file mode 100644
index 000000000000..d139e090520e
--- /dev/null
+++ b/drivers/nvme/mdev/mdev.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * NVME VFIO mediated driver
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+
+#ifndef _MDEV_NVME_MDEV_H
+#define _MDEV_NVME_MDEV_H
+
+#include <linux/kernel.h>
+#include <linux/byteorder/generic.h>
+#include <linux/nvme.h>
+
+struct page_map {
+ void *kmap;
+ struct page *page;
+ dma_addr_t iova;
+};
+
+struct user_prplist {
+ /* used by user data iterator*/
+ struct page_map page;
+ unsigned int index; /* index of current entry */
+};
+
+struct kernel_data {
+ /* used by kernel data iterator*/
+ void *data;
+ unsigned int size;
+ dma_addr_t dma_addr;
+};
+
+struct nvme_ext_data_iter {
+ /* private */
+ struct nvme_mdev_viommu *viommu;
+ union {
+ const union nvme_data_ptr *dptr;
+ struct user_prplist uprp;
+ struct kernel_data kmem;
+ };
+
+ /* user interface */
+ u64 count; /* number of data pages, yet to be covered */
+
+ phys_addr_t physical; /* iterator physical address value*/
+ dma_addr_t host_iova; /* iterator dma address value*/
+
+ /* moves iterator to the next item */
+ int (*next)(struct nvme_ext_data_iter *data_iter);
+
+ /* if != NULL, user should call this when it done with data
+ * pointed by the iterator
+ */
+ void (*release)(struct nvme_ext_data_iter *data_iter);
+};
+#endif
diff --git a/drivers/nvme/mdev/mmio.c b/drivers/nvme/mdev/mmio.c
new file mode 100644
index 000000000000..cf03c1f22f4c
--- /dev/null
+++ b/drivers/nvme/mdev/mmio.c
@@ -0,0 +1,591 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe virtual controller MMIO implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include "priv.h"
+
+#define DB_AREA_SIZE (MAX_VIRTUAL_QUEUES * 2 * (4 << DB_STRIDE_SHIFT))
+#define DB_MASK ((4 << DB_STRIDE_SHIFT) - 1)
+#define MMIO_BAR_SIZE __roundup_pow_of_two(NVME_REG_DBS + DB_AREA_SIZE)
+
+/* Put the controller into fatal error state. Only way out is reset */
+static void nvme_mdev_mmio_fatal_error(struct nvme_mdev_vctrl *vctrl)
+{
+ if (vctrl->mmio.csts & NVME_CSTS_CFS)
+ return;
+
+ vctrl->mmio.csts |= NVME_CSTS_CFS;
+ nvme_mdev_io_pause(vctrl);
+
+ if (vctrl->mmio.csts & NVME_CSTS_RDY)
+ nvme_mdev_vctrl_disable(vctrl);
+}
+
+/* This sends an generic error notification to the user */
+static void nvme_mdev_mmio_error(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_async_event info)
+{
+ nvme_mdev_event_send(vctrl, NVME_AER_TYPE_ERROR, info);
+}
+
+/* This is memory fault handler for the mmap area of the doorbells*/
+static vm_fault_t nvme_mdev_mmio_dbs_mmap_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct nvme_mdev_vctrl *vctrl = vma->vm_private_data;
+
+ /* DB area is just one page, starting at offset 4096 of the mmio*/
+ if (WARN_ON(vmf->pgoff != 1))
+ return VM_FAULT_SIGBUS;
+
+ get_page(vctrl->mmio.dbs_page);
+ vmf->page = vctrl->mmio.dbs_page;
+ return 0;
+}
+
+static const struct vm_operations_struct nvme_mdev_mmio_dbs_vm_ops = {
+ .fault = nvme_mdev_mmio_dbs_mmap_fault,
+};
+
+/* check that user db write is valid and send an error if not*/
+bool nvme_mdev_mmio_db_check(struct nvme_mdev_vctrl *vctrl,
+ u16 qid, u16 size, u16 db)
+{
+ if (get_current() != vctrl->iothread)
+ lockdep_assert_held(&vctrl->lock);
+
+ if (db < size)
+ return true;
+ if (qid == 0) {
+ _DBG(vctrl, "MMIO: invalid admin DB write - fatal error\n");
+ nvme_mdev_mmio_fatal_error(vctrl);
+ return false;
+ }
+
+ _DBG(vctrl, "MMIO: invalid DB value write qid=%d, size=%d, value=%d\n",
+ qid, size, db);
+
+ nvme_mdev_mmio_error(vctrl, NVME_AER_ERROR_INVALID_DB_VALUE);
+ return false;
+}
+
+/* handle submission queue doorbell write */
+static void nvme_mdev_mmio_db_write_sq(struct nvme_mdev_vctrl *vctrl,
+ u32 qid, u32 val)
+{
+ _DBG(vctrl, "MMIO: doorbell SQID %d, DB write %d\n", qid, val);
+
+ lockdep_assert_held(&vctrl->lock);
+ /* check if the db belongs to a valid queue */
+ if (qid >= MAX_VIRTUAL_QUEUES || !test_bit(qid, vctrl->vsq_en))
+ goto err_db;
+
+ /* emulate the shadow doorbell functionality */
+ if (!vctrl->mmio.shadow_db_en || qid == 0)
+ vctrl->mmio.dbs[qid].sqt = cpu_to_le32(val & 0x0000FFFF);
+
+ if (qid != 0)
+ vctrl->io_idle = false;
+
+ if (vctrl->vctrl_paused || !vctrl->mmio.shadow_db_supported)
+ return;
+
+ qid ? nvme_mdev_io_resume(vctrl) : nvme_mdev_adm_process_sq(vctrl);
+ return;
+err_db:
+
+ _DBG(vctrl, "MMIO: inactive/invalid SQ DB write qid=%d, value=%d\n",
+ qid, val);
+
+ nvme_mdev_mmio_error(vctrl, NVME_AER_ERROR_INVALID_DB_REG);
+}
+
+/* handle doorbell write */
+static void nvme_mdev_mmio_db_write_cq(struct nvme_mdev_vctrl *vctrl,
+ u32 qid, u32 val)
+{
+ _DBG(vctrl, "MMIO: doorbell CQID %d, DB write %d\n", qid, val);
+
+ lockdep_assert_held(&vctrl->lock);
+ /* check if the db belongs to a valid queue */
+ if (qid >= MAX_VIRTUAL_QUEUES || !test_bit(qid, vctrl->vcq_en))
+ goto err_db;
+
+ /* emulate the shadow doorbell functionality */
+ if (!vctrl->mmio.shadow_db_en || qid == 0)
+ vctrl->mmio.dbs[qid].cqh = cpu_to_le16(val & 0xFFFF);
+
+ if (vctrl->vctrl_paused || !vctrl->mmio.shadow_db_supported)
+ return;
+
+ if (qid == 0) {
+ nvme_mdev_vcq_process(vctrl, 0, false);
+ // if completion queue was full prior to that, we
+ // might have some admin commands pending,
+ // and this is the last chance to process them
+ nvme_mdev_adm_process_sq(vctrl);
+ }
+ return;
+err_db:
+ _DBG(vctrl,
+ "MMIO: inactive/invalid CQ DB write qid=%d, value=%d\n",
+ qid, val);
+
+ nvme_mdev_mmio_error(vctrl, NVME_AER_ERROR_INVALID_DB_REG);
+}
+
+/* This is called when user enables the controller */
+static void nvme_mdev_mmio_cntrl_enable(struct nvme_mdev_vctrl *vctrl)
+{
+ u64 acq, asq;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ // Controller must be reset from the dead state
+ if (nvme_mdev_vctrl_is_dead(vctrl))
+ goto error;
+
+ /* only NVME command set supported */
+ if (((vctrl->mmio.cc >> NVME_CC_CSS_SHIFT) & 0x7) != 0)
+ goto error;
+
+ /* Check the queue arbitration method*/
+ if ((vctrl->mmio.cc & NVME_CC_AMS_MASK) != NVME_CC_AMS_RR)
+ goto error;
+
+ /* Check the page size*/
+ if (((vctrl->mmio.cc >> NVME_CC_MPS_SHIFT) & 0xF) != (PAGE_SHIFT - 12))
+ goto error;
+
+ /* Start the admin completion queue*/
+ acq = vctrl->mmio.acql | ((u64)vctrl->mmio.acqh << 32);
+ asq = vctrl->mmio.asql | ((u64)vctrl->mmio.asqh << 32);
+
+ if (!nvme_mdev_vctrl_enable(vctrl, acq, asq, vctrl->mmio.aqa))
+ goto error;
+
+ /* Success! */
+ vctrl->mmio.csts |= NVME_CSTS_RDY;
+ return;
+error:
+ _DBG(vctrl, "MMIO: failure to enable the controller - fatal error\n");
+ nvme_mdev_mmio_fatal_error(vctrl);
+}
+
+/* This is called when user sends a notification that controller is
+ * about to be disabled
+ */
+static void nvme_mdev_mmio_cntrl_shutdown(struct nvme_mdev_vctrl *vctrl)
+{
+ lockdep_assert_held(&vctrl->lock);
+
+ /* clear shutdown notification bits */
+ vctrl->mmio.cc &= ~NVME_CC_SHN_MASK;
+
+ if (nvme_mdev_vctrl_is_dead(vctrl)) {
+ _DBG(vctrl, "MMIO: shutdown notification for dead ctrl\n");
+ return;
+ }
+
+ /* not enabled */
+ if (!(vctrl->mmio.csts & NVME_CSTS_RDY)) {
+ _DBG(vctrl, "MMIO: shutdown notification with CSTS.RDY==0\n");
+ nvme_mdev_assert_io_not_running(vctrl);
+ return;
+ }
+
+ nvme_mdev_io_pause(vctrl);
+ nvme_mdev_vctrl_disable(vctrl);
+ vctrl->mmio.csts |= NVME_CSTS_SHST_CMPLT;
+}
+
+/* MMIO BAR read/write */
+static int nvme_mdev_mmio_bar_access(struct nvme_mdev_vctrl *vctrl,
+ u16 offset, char *buf,
+ u32 count, bool is_write)
+{
+ u32 val, oldval;
+
+ mutex_lock(&vctrl->lock);
+
+ /* Drop non DWORD sized and aligned reads/writes
+ * (QWORD read/writes are split by the caller)
+ */
+ if (count != 4 || (offset & 0x3))
+ goto drop;
+
+ val = is_write ? le32_to_cpu(*(__le32 *)buf) : 0;
+
+ switch (offset) {
+ case NVME_REG_CAP:
+ /* controller capabilities (low 32 bit)*/
+ if (is_write)
+ goto drop;
+ store_le32(buf, vctrl->mmio.cap & 0xFFFFFFFF);
+ break;
+
+ case NVME_REG_CAP + 4:
+ /* controller capabilities (upper 32 bit)*/
+ if (is_write)
+ goto drop;
+ store_le32(buf, vctrl->mmio.cap >> 32);
+ break;
+
+ case NVME_REG_VS:
+ if (is_write)
+ goto drop;
+ store_le32(buf, NVME_MDEV_NVME_VER);
+ break;
+
+ case NVME_REG_INTMS:
+ case NVME_REG_INTMC:
+ /* Interrupt Mask Set & Clear */
+ goto drop;
+
+ case NVME_REG_CC:
+ /* Controller Configuration */
+ if (!is_write) {
+ store_le32(buf, vctrl->mmio.cc);
+ break;
+ }
+
+ oldval = vctrl->mmio.cc;
+ vctrl->mmio.cc = val;
+
+ /* drop if reserved bits set */
+ if (vctrl->mmio.cc & 0xFF00000E) {
+ _DBG(vctrl,
+ "MMIO: reserved bits of CC set - fatal error\n");
+ nvme_mdev_mmio_fatal_error(vctrl);
+ goto drop;
+ }
+
+ /* CSS(command set),MPS(memory page size),AMS(queue arbitration)
+ * must not be changed while controller is running
+ */
+ if (vctrl->mmio.csts & NVME_CSTS_RDY) {
+ if ((vctrl->mmio.cc & 0x3FF0) != (oldval & 0x3FF0)) {
+ _DBG(vctrl,
+ "MMIO: attempt to change setting bits of CC while CC.EN=1 - fatal error\n");
+
+ nvme_mdev_mmio_fatal_error(vctrl);
+ goto drop;
+ }
+ }
+
+ if ((vctrl->mmio.cc & NVME_CC_SHN_MASK) != NVME_CC_SHN_NONE) {
+ _DBG(vctrl, "MMIO: CC.SHN != 0 - shutdown\n");
+ nvme_mdev_mmio_cntrl_shutdown(vctrl);
+ }
+
+ /* change in controller enabled state */
+ if ((val & NVME_CC_ENABLE) == (oldval & NVME_CC_ENABLE))
+ break;
+
+ if (vctrl->mmio.cc & NVME_CC_ENABLE) {
+ _DBG(vctrl, "MMIO: CC.EN<=1 - enable the controller\n");
+ nvme_mdev_mmio_cntrl_enable(vctrl);
+ } else {
+ _DBG(vctrl, "MMIO: CC.EN<=0 - reset controller\n");
+ __nvme_mdev_vctrl_reset(vctrl, false);
+ }
+
+ break;
+
+ case NVME_REG_CSTS:
+ /* Controller Status */
+ if (is_write)
+ goto drop;
+ store_le32(buf, vctrl->mmio.csts);
+ break;
+
+ case NVME_REG_AQA:
+ /* admin queue submission and completion size*/
+ if (!is_write)
+ store_le32(buf, vctrl->mmio.aqa);
+ else if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ vctrl->mmio.aqa = val;
+ else
+ goto drop;
+ break;
+
+ case NVME_REG_ASQ:
+ /* admin submission queue address (low 32 bit)*/
+ if (!is_write)
+ store_le32(buf, vctrl->mmio.asql);
+ else if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ vctrl->mmio.asql = val;
+ else
+ goto drop;
+ break;
+
+ case NVME_REG_ASQ + 4:
+ /* admin submission queue address (high 32 bit)*/
+ if (!is_write)
+ store_le32(buf, vctrl->mmio.asqh);
+ else if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ vctrl->mmio.asqh = val;
+ else
+ goto drop;
+ break;
+
+ case NVME_REG_ACQ:
+ /* admin completion queue address (low 32 bit)*/
+ if (!is_write)
+ store_le32(buf, vctrl->mmio.acql);
+ else if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ vctrl->mmio.acql = val;
+ else
+ goto drop;
+ break;
+
+ case NVME_REG_ACQ + 4:
+ /* admin completion queue address (high 32 bit)*/
+ if (!is_write)
+ store_le32(buf, vctrl->mmio.acqh);
+ else if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ vctrl->mmio.acqh = val;
+ else
+ goto drop;
+ break;
+
+ case NVME_REG_CMBLOC:
+ case NVME_REG_CMBSZ:
+ /* not supported - hardwired to 0*/
+ if (is_write)
+ goto drop;
+ store_le32(buf, 0);
+ break;
+
+ case NVME_REG_DBS ... (NVME_REG_DBS + DB_AREA_SIZE - 1): {
+ /* completion and submission doorbells */
+ u16 db_offset = offset - NVME_REG_DBS;
+ u16 index = db_offset >> (DB_STRIDE_SHIFT + 2);
+ u16 qid = index >> 1;
+ bool sq = (index & 0x1) == 0;
+
+ if (!is_write || (db_offset & DB_MASK))
+ goto drop;
+
+ if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ goto drop;
+
+ if (nvme_mdev_vctrl_is_dead(vctrl))
+ goto drop;
+
+ sq ? nvme_mdev_mmio_db_write_sq(vctrl, qid, val) :
+ nvme_mdev_mmio_db_write_cq(vctrl, qid, val);
+ break;
+ }
+ default:
+ goto drop;
+ }
+
+ mutex_unlock(&vctrl->lock);
+ return count;
+drop:
+ _DBG(vctrl, "MMIO: dropping write at 0x%x\n", offset);
+ mutex_unlock(&vctrl->lock);
+ return 0;
+}
+
+/* Called when the virtual controller is created */
+int nvme_mdev_mmio_create(struct nvme_mdev_vctrl *vctrl)
+{
+ int ret;
+
+ /* BAR0 */
+ nvme_mdev_pci_setup_bar(vctrl, PCI_BASE_ADDRESS_0,
+ MMIO_BAR_SIZE, nvme_mdev_mmio_bar_access);
+
+ /* Spec allows for maximum depth of 0x10000, but we limit
+ * it to 1 less to avoid various overflows
+ */
+ BUILD_BUG_ON(MAX_VIRTUAL_QUEUE_DEPTH > 0xFFFF);
+
+ /* CAP has 4 bits for the doorbell stride shift*/
+ BUILD_BUG_ON(DB_STRIDE_SHIFT > 0xF);
+
+ /* Shadow doorbell limits doorbells to 1 page*/
+ BUILD_BUG_ON(DB_AREA_SIZE > PAGE_SIZE);
+
+ /* Just in case...*/
+ BUILD_BUG_ON((PAGE_SHIFT - 12) > 0xF);
+
+ vctrl->mmio.cap =
+ // MQES: maximum queue entries
+ ((u64)(MAX_VIRTUAL_QUEUE_DEPTH - 1) << 0) |
+ // CQR: physically contiguous queues - no
+ (0ULL << 16) |
+ // AMS: Queue arbitration.
+ // TODOLATER: IO: implement WRRU
+ (0ULL << 17) |
+ // TO: RDY timeout - 0 (done in sync)
+ (0ULL << 24) |
+ // DSTRD: doorbell stride
+ ((u64)DB_STRIDE_SHIFT << 32) |
+ // NSSRS: no support for nvme subsystem reset
+ (0ULL << 36) |
+ // CSS: NVM command set supported
+ (1ULL << 37) |
+ // BPS: no support for boot partition
+ (0ULL << 45) |
+ // MPSMIN: Minimum page size supported is PAGE_SIZE
+ ((u64)(PAGE_SHIFT - 12) << 48) |
+ // MPSMAX: Maximum page size is PAGE_SIZE as well
+ ((u64)(PAGE_SHIFT - 12) << 52);
+
+ /* Create the (regular) doorbell buffers */
+ vctrl->mmio.dbs_page = alloc_pages_node(vctrl->hctrl->node,
+ __GFP_ZERO, 0);
+
+ ret = -ENOMEM;
+
+ if (!vctrl->mmio.dbs_page)
+ goto error0;
+
+ vctrl->mmio.db_page_kmap = kmap(vctrl->mmio.dbs_page);
+ if (!vctrl->mmio.db_page_kmap)
+ goto error1;
+
+ vctrl->mmio.fake_eidx_page = alloc_pages_node(vctrl->hctrl->node,
+ __GFP_ZERO, 0);
+ if (!vctrl->mmio.fake_eidx_page)
+ goto error2;
+
+ vctrl->mmio.fake_eidx_kmap = kmap(vctrl->mmio.fake_eidx_page);
+ if (!vctrl->mmio.fake_eidx_kmap)
+ goto error3;
+ return 0;
+error3:
+ put_page(vctrl->mmio.fake_eidx_kmap);
+error2:
+ kunmap(vctrl->mmio.dbs_page);
+error1:
+ put_page(vctrl->mmio.dbs_page);
+error0:
+ return ret;
+}
+
+/* Called when the virtual controller is reset */
+void nvme_mdev_mmio_reset(struct nvme_mdev_vctrl *vctrl, bool pci_reset)
+{
+ vctrl->mmio.cc = 0;
+ vctrl->mmio.csts = 0;
+
+ if (pci_reset) {
+ vctrl->mmio.aqa = 0;
+ vctrl->mmio.asql = 0;
+ vctrl->mmio.asqh = 0;
+ vctrl->mmio.acql = 0;
+ vctrl->mmio.acqh = 0;
+ }
+}
+
+/* Called when the virtual controller is opened */
+void nvme_mdev_mmio_open(struct nvme_mdev_vctrl *vctrl)
+{
+ if (!vctrl->mmio.shadow_db_supported)
+ nvme_mdev_vctrl_region_set_mmap(vctrl,
+ VFIO_PCI_BAR0_REGION_INDEX,
+ NVME_REG_DBS, PAGE_SIZE,
+ &nvme_mdev_mmio_dbs_vm_ops);
+ else
+ nvme_mdev_vctrl_region_disable_mmap(vctrl,
+ VFIO_PCI_BAR0_REGION_INDEX);
+}
+
+/* Called when the virtual controller queues are enabled */
+int nvme_mdev_mmio_enable_dbs(struct nvme_mdev_vctrl *vctrl)
+{
+ if (WARN_ON(vctrl->mmio.shadow_db_en))
+ return -EINVAL;
+
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ /* setup normal doorbells and reset them*/
+ vctrl->mmio.dbs = vctrl->mmio.db_page_kmap;
+ vctrl->mmio.eidxs = vctrl->mmio.fake_eidx_kmap;
+ memset((void *)vctrl->mmio.dbs, 0, DB_AREA_SIZE);
+ memset((void *)vctrl->mmio.eidxs, 0, DB_AREA_SIZE);
+ return 0;
+}
+
+/* Called when the virtual controller shadow doorbell is enabled */
+int nvme_mdev_mmio_enable_dbs_shadow(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t sdb_iova,
+ dma_addr_t eidx_iova)
+{
+ int ret;
+
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ ret = nvme_mdev_viommu_create_kmap(&vctrl->viommu,
+ sdb_iova, &vctrl->mmio.sdb_map);
+ if (ret)
+ return ret;
+
+ ret = nvme_mdev_viommu_create_kmap(&vctrl->viommu,
+ eidx_iova, &vctrl->mmio.seidx_map);
+ if (ret) {
+ nvme_mdev_viommu_free_kmap(&vctrl->viommu,
+ &vctrl->mmio.sdb_map);
+ return ret;
+ }
+
+ vctrl->mmio.dbs = vctrl->mmio.sdb_map.kmap;
+ vctrl->mmio.eidxs = vctrl->mmio.seidx_map.kmap;
+
+ memcpy((void *)vctrl->mmio.dbs,
+ vctrl->mmio.db_page_kmap, DB_AREA_SIZE);
+
+ memcpy((void *)vctrl->mmio.eidxs,
+ vctrl->mmio.db_page_kmap, DB_AREA_SIZE);
+
+ vctrl->mmio.shadow_db_en = true;
+ return 0;
+}
+
+/* Called on guest mapping update to
+ * verify that our mappings are still intact
+ */
+void nvme_mdev_mmio_viommu_update(struct nvme_mdev_vctrl *vctrl)
+{
+ nvme_mdev_assert_io_not_running(vctrl);
+ if (!vctrl->mmio.shadow_db_en)
+ return;
+
+ nvme_mdev_viommu_update_kmap(&vctrl->viommu, &vctrl->mmio.sdb_map);
+ nvme_mdev_viommu_update_kmap(&vctrl->viommu, &vctrl->mmio.seidx_map);
+
+ vctrl->mmio.dbs = vctrl->mmio.sdb_map.kmap;
+ vctrl->mmio.eidxs = vctrl->mmio.seidx_map.kmap;
+}
+
+/* Disable the doorbells */
+void nvme_mdev_mmio_disable_dbs(struct nvme_mdev_vctrl *vctrl)
+{
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ /* Free the shadow doorbells */
+ nvme_mdev_viommu_free_kmap(&vctrl->viommu, &vctrl->mmio.sdb_map);
+ nvme_mdev_viommu_free_kmap(&vctrl->viommu, &vctrl->mmio.seidx_map);
+
+ /* Clear the doorbells */
+ vctrl->mmio.dbs = NULL;
+ vctrl->mmio.eidxs = NULL;
+ vctrl->mmio.shadow_db_en = false;
+}
+
+/* Called when the virtual controller is about to be freed */
+void nvme_mdev_mmio_free(struct nvme_mdev_vctrl *vctrl)
+{
+ nvme_mdev_assert_io_not_running(vctrl);
+ kunmap(vctrl->mmio.dbs_page);
+ put_page(vctrl->mmio.dbs_page);
+ kunmap(vctrl->mmio.fake_eidx_page);
+ put_page(vctrl->mmio.fake_eidx_page);
+}
diff --git a/drivers/nvme/mdev/pci.c b/drivers/nvme/mdev/pci.c
new file mode 100644
index 000000000000..b7cdeaaf9c2e
--- /dev/null
+++ b/drivers/nvme/mdev/pci.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * NVMe virtual controller minimal PCI/PCIe config space implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include "priv.h"
+
+/* setup a 64 bit PCI bar */
+void nvme_mdev_pci_setup_bar(struct nvme_mdev_vctrl *vctrl,
+ u8 bar,
+ unsigned int size,
+ region_access_fn access_fn)
+{
+ nvme_mdev_vctrl_add_region(vctrl,
+ VFIO_PCI_BAR0_REGION_INDEX +
+ ((bar - PCI_BASE_ADDRESS_0) >> 2),
+ size, access_fn);
+
+ store_le32(vctrl->pcicfg.wmask + bar, ~((u64)size - 1));
+ store_le32(vctrl->pcicfg.values + bar,
+ PCI_BASE_ADDRESS_SPACE_MEMORY |
+ PCI_BASE_ADDRESS_MEM_TYPE_64);
+}
+
+/* Allocate a pci capability*/
+static u8 nvme_mdev_pci_allocate_cap(struct nvme_mdev_vctrl *vctrl,
+ u8 id, u8 size)
+{
+ u8 *cfg = vctrl->pcicfg.values;
+ u8 newcap = vctrl->pcicfg.end;
+ u8 cap = cfg[PCI_CAPABILITY_LIST];
+
+ size = round_up(size, 4);
+ // only standard cfg space caps for now
+ WARN_ON(newcap + size > 256);
+
+ if (!cfg[PCI_CAPABILITY_LIST]) {
+ /*special case for first capability*/
+ u16 status = load_le16(cfg + PCI_STATUS);
+
+ status |= PCI_STATUS_CAP_LIST;
+ store_le16(cfg + PCI_STATUS, status);
+
+ cfg[PCI_CAPABILITY_LIST] = newcap;
+ goto setupcap;
+ }
+
+ while (cfg[cap + PCI_CAP_LIST_NEXT] != 0)
+ cap = cfg[cap + PCI_CAP_LIST_NEXT];
+
+ cfg[cap + PCI_CAP_LIST_NEXT] = newcap;
+
+setupcap:
+ cfg[newcap + PCI_CAP_LIST_ID] = id;
+ cfg[newcap + PCI_CAP_LIST_NEXT] = 0;
+ vctrl->pcicfg.end += size;
+ return newcap;
+}
+
+static void nvme_mdev_pci_setup_pm_cap(struct nvme_mdev_vctrl *vctrl)
+{
+ u8 *cfg = vctrl->pcicfg.values;
+ u8 *cfgm = vctrl->pcicfg.wmask;
+
+ u8 cap = nvme_mdev_pci_allocate_cap(vctrl,
+ PCI_CAP_ID_PM, PCI_PM_SIZEOF);
+
+ store_le16(cfg + cap + PCI_PM_PMC, 0x3);
+ store_le16(cfg + cap + PCI_PM_CTRL, PCI_PM_CTRL_NO_SOFT_RESET);
+ store_le16(cfgm + cap + PCI_PM_CTRL, 0x3);
+ vctrl->pcicfg.pmcap = cap;
+}
+
+static void nvme_mdev_pci_setup_msix_cap(struct nvme_mdev_vctrl *vctrl)
+{
+ u8 *cfg = vctrl->pcicfg.values;
+ u8 *cfgm = vctrl->pcicfg.wmask;
+ u8 cap = nvme_mdev_pci_allocate_cap(vctrl,
+ PCI_CAP_ID_MSIX,
+ PCI_CAP_MSIX_SIZEOF);
+
+ int MSIX_TBL_SIZE = roundup(MAX_VIRTUAL_IRQS * 16, PAGE_SIZE);
+ int MSIX_PBA_SIZE = roundup(DIV_ROUND_UP(MAX_VIRTUAL_IRQS, 8),
+ PAGE_SIZE);
+
+ store_le16(cfg + cap + PCI_MSIX_FLAGS, MAX_VIRTUAL_IRQS - 1);
+ store_le16(cfgm + cap + PCI_MSIX_FLAGS,
+ PCI_MSIX_FLAGS_MASKALL | PCI_MSIX_FLAGS_ENABLE);
+
+ store_le32(cfg + cap + PCI_MSIX_TABLE, 0x2);
+ store_le32(cfg + cap + PCI_MSIX_PBA, MSIX_TBL_SIZE | 0x2);
+
+ nvme_mdev_pci_setup_bar(vctrl, PCI_BASE_ADDRESS_2,
+ __roundup_pow_of_two(MSIX_TBL_SIZE +
+ MSIX_PBA_SIZE), NULL);
+ vctrl->pcicfg.msixcap = cap;
+}
+
+static void nvme_mdev_pci_setup_pcie_cap(struct nvme_mdev_vctrl *vctrl)
+{
+ u8 *cfg = vctrl->pcicfg.values;
+ u8 cap = nvme_mdev_pci_allocate_cap(vctrl,
+ PCI_CAP_ID_EXP,
+ PCI_CAP_EXP_ENDPOINT_SIZEOF_V2);
+
+ store_le16(cfg + cap + PCI_EXP_FLAGS, 0x02 |
+ (PCI_EXP_TYPE_ENDPOINT << 4));
+
+ store_le32(cfg + cap + PCI_EXP_DEVCAP,
+ PCI_EXP_DEVCAP_RBER | PCI_EXP_DEVCAP_FLR);
+ store_le32(cfg + cap + PCI_EXP_LNKCAP,
+ PCI_EXP_LNKCAP_SLS_8_0GB | (4 << 4) /*4x*/);
+ store_le16(cfg + cap + PCI_EXP_LNKSTA,
+ PCI_EXP_LNKSTA_CLS_8_0GB | (4 << 4) /*4x*/);
+
+ store_le32(cfg + cap + PCI_EXP_LNKCAP2, PCI_EXP_LNKCAP2_SLS_8_0GB);
+ store_le16(cfg + cap + PCI_EXP_LNKCTL2, PCI_EXP_LNKCTL2_TLS_8_0GT);
+ vctrl->pcicfg.pciecap = cap;
+}
+
+/* This is called on PCI config read/write */
+static int nvme_mdev_pci_cfg_access(struct nvme_mdev_vctrl *vctrl,
+ u16 offset, char *buf,
+ u32 count, bool is_write)
+{
+ unsigned int i;
+
+ mutex_lock(&vctrl->lock);
+
+ if (!is_write) {
+ memcpy(buf, (vctrl->pcicfg.values + offset), count);
+ goto out;
+ }
+
+ for (i = 0; i < count; i++) {
+ u8 address = offset + i;
+ u8 value = buf[i];
+ u8 old_value = vctrl->pcicfg.values[address];
+ u8 wmask = vctrl->pcicfg.wmask[address];
+ u8 new_value = (value & wmask) | (old_value & ~wmask);
+
+ /* D3/D0 power control */
+ if (address == vctrl->pcicfg.pmcap + PCI_PM_CTRL) {
+ u8 state = new_value & 0x03;
+
+ if (state != 0 && state != 3)
+ new_value = old_value;
+
+ if (old_value != new_value) {
+ const char *s = state == 3 ? "D3" : "D0";
+
+ if (state == 3)
+ __nvme_mdev_vctrl_reset(vctrl, true);
+ _DBG(vctrl, "PCI: going to %s\n", s);
+ }
+ }
+
+ /* FLR reset*/
+ if (address == vctrl->pcicfg.pciecap + PCI_EXP_DEVCTL + 1)
+ if (value & 0x80) {
+ _DBG(vctrl, "PCI: FLR reset\n");
+ __nvme_mdev_vctrl_reset(vctrl, true);
+ }
+ vctrl->pcicfg.values[offset + i] = new_value;
+ }
+out:
+ mutex_unlock(&vctrl->lock);
+ return count;
+}
+
+/* setup pci configuration */
+int nvme_mdev_pci_create(struct nvme_mdev_vctrl *vctrl)
+{
+ u8 *cfg, *cfgm;
+
+ vctrl->pcicfg.values = kzalloc(PCI_CFG_SIZE, GFP_KERNEL);
+ if (!vctrl->pcicfg.values)
+ return -ENOMEM;
+
+ vctrl->pcicfg.wmask = kzalloc(PCI_CFG_SIZE, GFP_KERNEL);
+ if (!vctrl->pcicfg.wmask) {
+ kfree(vctrl->pcicfg.values);
+ return -ENOMEM;
+ }
+
+ cfg = vctrl->pcicfg.values;
+ cfgm = vctrl->pcicfg.wmask;
+
+ nvme_mdev_vctrl_add_region(vctrl,
+ VFIO_PCI_CONFIG_REGION_INDEX,
+ PCI_CFG_SIZE,
+ nvme_mdev_pci_cfg_access);
+
+ /* vendor information */
+ store_le16(cfg + PCI_VENDOR_ID, NVME_MDEV_PCI_VENDOR_ID);
+ store_le16(cfg + PCI_DEVICE_ID, NVME_MDEV_PCI_DEVICE_ID);
+
+ /* pci command register */
+ store_le16(cfgm + PCI_COMMAND,
+ PCI_COMMAND_INTX_DISABLE |
+ PCI_COMMAND_MEMORY |
+ PCI_COMMAND_MASTER);
+
+ /* pci status register */
+ store_le16(cfg + PCI_STATUS, PCI_STATUS_CAP_LIST);
+
+ /* subsystem information */
+ store_le16(cfg + PCI_SUBSYSTEM_VENDOR_ID, NVME_MDEV_PCI_SUBVENDOR_ID);
+ store_le16(cfg + PCI_SUBSYSTEM_ID, NVME_MDEV_PCI_SUBDEVICE_ID);
+ store_le8(cfg + PCI_CLASS_REVISION, NVME_MDEV_PCI_REVISION);
+
+ /*Programming Interface (NVM Express) */
+ store_le8(cfg + PCI_CLASS_PROG, 0x02);
+
+ /* Device class and subclass
+ * (Mass storage controller, Non-Volatile memory controller)
+ */
+ store_le16(cfg + PCI_CLASS_DEVICE, 0x0108);
+
+ /* dummy read/write */
+ store_le8(cfgm + PCI_CACHE_LINE_SIZE, 0xFF);
+
+ /* initial value*/
+ store_le8(cfg + PCI_CAPABILITY_LIST, 0);
+ vctrl->pcicfg.end = 0x40;
+
+ nvme_mdev_pci_setup_pm_cap(vctrl);
+ nvme_mdev_pci_setup_msix_cap(vctrl);
+ nvme_mdev_pci_setup_pcie_cap(vctrl);
+
+ /* INTX IRQ number - info only for BIOS */
+ store_le8(cfgm + PCI_INTERRUPT_LINE, 0xFF);
+ store_le8(cfg + PCI_INTERRUPT_PIN, 0x01);
+
+ return 0;
+}
+
+/* teardown pci configuration */
+void nvme_mdev_pci_free(struct nvme_mdev_vctrl *vctrl)
+{
+ kfree(vctrl->pcicfg.values);
+ kfree(vctrl->pcicfg.wmask);
+ vctrl->pcicfg.values = NULL;
+ vctrl->pcicfg.wmask = NULL;
+}
diff --git a/drivers/nvme/mdev/priv.h b/drivers/nvme/mdev/priv.h
new file mode 100644
index 000000000000..9f65e46c1ab2
--- /dev/null
+++ b/drivers/nvme/mdev/priv.h
@@ -0,0 +1,700 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Driver private data structures and helper macros
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+
+#ifndef _MDEV_NVME_PRIV_H
+#define _MDEV_NVME_PRIV_H
+
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/pci.h>
+#include <linux/eventfd.h>
+#include <linux/byteorder/generic.h>
+#include "../host/nvme.h"
+#include "mdev.h"
+
+#define NVME_MDEV_NVME_VER NVME_VS(0x01, 0x03, 0x00)
+#define NVME_MDEV_FIRMWARE_VERSION "1.0"
+
+#define NVME_MDEV_PCI_VENDOR_ID PCI_VENDOR_ID_REDHAT_QUMRANET
+#define NVME_MDEV_PCI_DEVICE_ID 0x1234
+#define NVME_MDEV_PCI_SUBVENDOR_ID PCI_SUBVENDOR_ID_REDHAT_QUMRANET
+#define NVME_MDEV_PCI_SUBDEVICE_ID 0
+#define NVME_MDEV_PCI_REVISION 0x0
+
+#define DB_STRIDE_SHIFT 4 /*4 = 1 cacheline */
+#define MAX_VIRTUAL_QUEUES 16
+#define MAX_VIRTUAL_QUEUE_DEPTH 0xFFFF
+#define MAX_VIRTUAL_NAMESPACES 16 /* NSID = 1..16*/
+#define MAX_VIRTUAL_IRQS 16
+
+#define MAX_HOST_QUEUES 4
+#define MAX_AER_COMMANDS 16
+#define MAX_LOG_PAGES 16
+
+extern bool use_shadow_doorbell;
+extern unsigned int io_timeout_ms;
+extern unsigned int poll_timeout_ms;
+extern unsigned int admin_poll_rate_ms;
+
+/* virtual submission queue*/
+struct nvme_vsq {
+ u16 qid;
+ u16 size;
+ u16 head; /*next item to read */
+
+ struct nvme_command *data; /*the queue*/
+ struct nvme_vcq *vcq; /* completion queue*/
+
+ dma_addr_t iova;
+ bool cont;
+
+ u16 hsq;
+};
+
+/* virtual completion queue*/
+struct nvme_vcq {
+ /* basic queue settings */
+ u16 qid;
+ u16 size;
+ u16 head;
+ u16 tail;
+ bool phase; /* current queue phase */
+
+ volatile struct nvme_completion *data;
+
+ /* number of items pending*/
+ u16 pending;
+
+ /* IRQ settings */
+ int irq /* -1 if disabled*/;
+
+ dma_addr_t iova;
+ bool cont;
+};
+
+/*A virtual namespace */
+struct nvme_mdev_vns {
+ /* host nvme namespace that we are attached to it*/
+ struct nvme_ns *host_ns;
+
+ /* block device that corresponds to the partition of that namespace */
+ struct block_device *host_part;
+ fmode_t fmode;
+
+ u32 nsid;
+
+ /* NSID on the host*/
+ u32 host_nsid;
+
+ /* host partition ID*/
+ unsigned int host_partid;
+
+ /* Offset inside the host namespace (start of the partition)*/
+ u64 host_lba_offset;
+
+ /* size of each block on the real namespace, same for host and guest */
+ u8 blksize_shift;
+
+ /* size of the namespace in lbas*/
+ u64 ns_size;
+
+ /* is the namespace read only?*/
+ bool readonly;
+
+ /* UUID of this namespace */
+ uuid_t uuid;
+
+ /* Optimal IO boundary*/
+ u16 noiob;
+};
+
+/* Virtual IOMMU */
+struct nvme_mdev_viommu {
+ struct device *hw_dev;
+ struct device *sw_dev;
+
+ /* dma/prp bookkeeping */
+ struct rb_root_cached maps_tree;
+ struct list_head maps_list;
+ struct nvme_mdev_vctrl *vctrl;
+};
+
+struct doorbell {
+ volatile __le32 sqt;
+ u8 rsvd1[(4 << DB_STRIDE_SHIFT) - sizeof(__le32)];
+ volatile __le32 cqh;
+ u8 rsvd2[(4 << DB_STRIDE_SHIFT) - sizeof(__le32)];
+};
+
+/* MMIO state */
+struct nvme_mdev_user_ctrl_mmio {
+ u32 cc; /* controller configuration */
+ u32 csts; /* controller status */
+ u64 cap /* controller capabilities*/;
+
+ /* admin queue location & size */
+ u32 aqa;
+ u32 asql;
+ u32 asqh;
+ u32 acql;
+ u32 acqh;
+
+ bool shadow_db_supported;
+ bool shadow_db_en;
+
+ /* Regular doorbells */
+ struct page *dbs_page;
+ struct page *fake_eidx_page;
+ void *db_page_kmap;
+ void *fake_eidx_kmap;
+
+ /* Shadow doorbells */
+ struct page_map sdb_map;
+ struct page_map seidx_map;
+
+ /* Current doorbell mappings */
+ volatile struct doorbell *dbs;
+ volatile struct doorbell *eidxs;
+};
+
+/* pci configuration space of the device*/
+#define PCI_CFG_SIZE 4096
+struct nvme_mdev_pci_cfg_space {
+ u8 *values;
+ u8 *wmask;
+
+ u8 pmcap;
+ u8 pciecap;
+ u8 msixcap;
+ u8 end;
+};
+
+/*IRQ state of the controller */
+struct nvme_mdev_user_irq {
+ struct eventfd_ctx *trigger;
+ /* IRQ coalescing */
+ bool irq_coalesc_en;
+ time_t irq_time;
+ unsigned int irq_pending_cnt;
+};
+
+enum nvme_mdev_irq_mode {
+ NVME_MDEV_IMODE_NONE,
+ NVME_MDEV_IMODE_INTX,
+ NVME_MDEV_IMODE_MSIX,
+};
+
+struct nvme_mdev_user_irqs {
+ /* one of VFIO_PCI_{INTX|MSI|MSIX}_IRQ_INDEX */
+ enum nvme_mdev_irq_mode mode;
+
+ struct nvme_mdev_user_irq vecs[MAX_VIRTUAL_IRQS];
+ /* user interrupt coalescing settings */
+ u8 irq_coalesc_max;
+ unsigned int irq_coalesc_time_us;
+ /* device removal trigger*/
+ struct eventfd_ctx *request_trigger;
+};
+
+/*AER state */
+struct nvme_mdev_user_events {
+ /* async event request CIDs*/
+ u16 aer_cids[MAX_AER_COMMANDS];
+ unsigned int aer_cid_count;
+
+ /* events that are enabled */
+ unsigned long events_enabled[BITS_TO_LONGS(MAX_LOG_PAGES)];
+
+ /* events that are masked till next log page read*/
+ unsigned long events_masked[BITS_TO_LONGS(MAX_LOG_PAGES)];
+
+ /* events that are pending to be sent when user gives us an AER*/
+ unsigned long events_pending[BITS_TO_LONGS(MAX_LOG_PAGES)];
+ u32 event_values[MAX_LOG_PAGES];
+};
+
+/* host IO queue */
+struct nvme_mdev_hq {
+ unsigned int usecount;
+ struct list_head link;
+ unsigned int hqid;
+};
+
+/* IO region abstraction (BARs, the PCI config space */
+struct nvme_mdev_vctrl;
+typedef int (*region_access_fn) (struct nvme_mdev_vctrl *vctrl,
+ u16 offset, char *buf,
+ u32 size, bool is_write);
+
+struct nvme_mdev_io_region {
+ unsigned int size;
+ region_access_fn rw;
+
+ /* IF != NULL, the mmap_area_start/size specify the mmaped window
+ * of this region
+ */
+ const struct vm_operations_struct *mmap_ops;
+ unsigned int mmap_area_start;
+ unsigned int mmap_area_size;
+};
+
+/*Virtual NVME controller state */
+struct nvme_mdev_vctrl {
+ struct kref ref;
+ struct mutex lock;
+ struct list_head link;
+
+ struct mdev_device *mdev;
+ struct nvme_mdev_hctrl *hctrl;
+ bool inuse;
+
+ struct nvme_mdev_io_region regions[VFIO_PCI_NUM_REGIONS];
+
+ /* virtual controller state */
+ struct nvme_mdev_user_ctrl_mmio mmio;
+ struct nvme_mdev_pci_cfg_space pcicfg;
+ struct nvme_mdev_user_irqs irqs;
+ struct nvme_mdev_user_events events;
+
+ /* emulated namespaces */
+ struct nvme_mdev_vns *namespaces[MAX_VIRTUAL_NAMESPACES];
+ __le32 ns_log[MAX_VIRTUAL_NAMESPACES];
+ unsigned int ns_log_size;
+
+ /* emulated submission queues*/
+ struct nvme_vsq vsqs[MAX_VIRTUAL_QUEUES];
+ unsigned long vsq_en[BITS_TO_LONGS(MAX_VIRTUAL_QUEUES)];
+
+ /* emulated completion queues*/
+ unsigned long vcq_en[BITS_TO_LONGS(MAX_VIRTUAL_QUEUES)];
+ struct nvme_vcq vcqs[MAX_VIRTUAL_QUEUES];
+
+ /* Host IO queues*/
+ int max_host_hw_queues;
+ struct list_head host_hw_queues;
+
+ /* Interface to access user memory */
+ struct notifier_block vfio_map_notifier;
+ struct notifier_block vfio_unmap_notifier;
+ struct nvme_mdev_viommu viommu;
+
+ /* the IO thread */
+ struct task_struct *iothread;
+ bool iothread_parked;
+ bool io_idle;
+ ktime_t now;
+
+ /* Settings */
+ unsigned int arb_burst_shift;
+ u8 worload_hint;
+ unsigned int iothread_cpu;
+
+ /* Identification*/
+ char subnqn[256];
+ char serial[9];
+
+ bool vctrl_paused; /* true when the host device paused our IO */
+};
+
+/* mdev instance type*/
+struct nvme_mdev_inst_type {
+ unsigned int max_hw_queues;
+ char name[16];
+ struct attribute_group *attrgroup;
+};
+
+/*Abstraction of the host controller that we are connected to */
+struct nvme_mdev_hctrl {
+ struct mutex lock;
+
+ /* numa node of the host controller*/
+ int node;
+
+ struct list_head link;
+ struct kref ref;
+ bool removing;
+
+ /* for reference counting */
+ struct nvme_ctrl *nvme_ctrl;
+
+ /* Host area*/
+ u16 oncs;
+ u8 mdts;
+ unsigned int id;
+
+ /* book-keeping for number of host queues we can allocate*/
+ unsigned int nr_host_queues;
+};
+
+/* vctrl.c*/
+struct nvme_mdev_vctrl *nvme_mdev_vctrl_create(struct mdev_device *mdev,
+ struct nvme_mdev_hctrl *hctrl,
+ unsigned int max_host_queues);
+
+int nvme_mdev_vctrl_destroy(struct nvme_mdev_vctrl *vctrl);
+
+int nvme_mdev_vctrl_open(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_vctrl_release(struct nvme_mdev_vctrl *vctrl);
+
+void nvme_mdev_vctrl_pause(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_vctrl_resume(struct nvme_mdev_vctrl *vctrl);
+
+bool nvme_mdev_vctrl_enable(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t cqiova, dma_addr_t sqiova, u32 sizes);
+
+void nvme_mdev_vctrl_disable(struct nvme_mdev_vctrl *vctrl);
+
+void nvme_mdev_vctrl_reset(struct nvme_mdev_vctrl *vctrl);
+void __nvme_mdev_vctrl_reset(struct nvme_mdev_vctrl *vctrl, bool pci_reset);
+
+void nvme_mdev_vctrl_add_region(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index, unsigned int size,
+ region_access_fn access_fn);
+
+void nvme_mdev_vctrl_region_set_mmap(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index,
+ unsigned int offset,
+ unsigned int size,
+ const struct vm_operations_struct *ops);
+
+void nvme_mdev_vctrl_region_disable_mmap(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index);
+
+void nvme_mdev_vctrl_bind_iothread(struct nvme_mdev_vctrl *vctrl,
+ unsigned int cpu);
+
+int nvme_mdev_vctrl_set_shadow_doorbell_supported(struct nvme_mdev_vctrl *vctrl,
+ bool enable);
+
+int nvme_mdev_vctrl_hq_alloc(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_vctrl_hq_free(struct nvme_mdev_vctrl *vctrl, u16 qid);
+unsigned int nvme_mdev_vctrl_hqs_list(struct nvme_mdev_vctrl *vctrl, u16 *out);
+bool nvme_mdev_vctrl_is_dead(struct nvme_mdev_vctrl *vctrl);
+
+int nvme_mdev_vctrl_viommu_map(struct nvme_mdev_vctrl *vctrl, u32 flags,
+ dma_addr_t iova, u64 size);
+
+int nvme_mdev_vctrl_viommu_unmap(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t iova, u64 size);
+
+/* hctrl.c*/
+struct nvme_mdev_inst_type *nvme_mdev_inst_type_get(const char *name);
+struct nvme_mdev_hctrl *nvme_mdev_hctrl_lookup_get(struct device *parent);
+void nvme_mdev_hctrl_put(struct nvme_mdev_hctrl *hctrl);
+
+int nvme_mdev_hctrl_hqs_available(struct nvme_mdev_hctrl *hctrl);
+
+bool nvme_mdev_hctrl_hqs_reserve(struct nvme_mdev_hctrl *hctrl,
+ unsigned int n);
+void nvme_mdev_hctrl_hqs_unreserve(struct nvme_mdev_hctrl *hctrl,
+ unsigned int n);
+
+int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl);
+void nvme_mdev_hctrl_hq_free(struct nvme_mdev_hctrl *hctrl, u16 qid);
+bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid);
+bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode);
+
+int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
+ u16 qid, u32 tag,
+ struct nvme_command *cmd,
+ struct nvme_ext_data_iter *datait);
+
+int nvme_mdev_hctrl_hq_poll(struct nvme_mdev_hctrl *hctrl,
+ u32 qid,
+ struct nvme_ext_cmd_result *results,
+ unsigned int max_len);
+
+void nvme_mdev_hctrl_destroy_all(void);
+
+/* io.c */
+int nvme_mdev_io_create(struct nvme_mdev_vctrl *vctrl, unsigned int cpu);
+void nvme_mdev_io_free(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_io_pause(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_io_resume(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_assert_io_not_running(struct nvme_mdev_vctrl *vctrl);
+
+/* mmio.c*/
+int nvme_mdev_mmio_create(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_mmio_open(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_mmio_reset(struct nvme_mdev_vctrl *vctrl, bool pci_reset);
+void nvme_mdev_mmio_free(struct nvme_mdev_vctrl *vctrl);
+
+int nvme_mdev_mmio_enable_dbs(struct nvme_mdev_vctrl *vctrl);
+int nvme_mdev_mmio_enable_dbs_shadow(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t sdb_iova, dma_addr_t eidx_iova);
+
+void nvme_mdev_mmio_viommu_update(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_mmio_disable_dbs(struct nvme_mdev_vctrl *vctrl);
+bool nvme_mdev_mmio_db_check(struct nvme_mdev_vctrl *vctrl,
+ u16 qid, u16 size, u16 db);
+
+/* pci.c*/
+int nvme_mdev_pci_create(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_pci_free(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_pci_setup_bar(struct nvme_mdev_vctrl *vctrl,
+ u8 bar, unsigned int size,
+ region_access_fn access_fn);
+/* adm.c*/
+void nvme_mdev_adm_process_sq(struct nvme_mdev_vctrl *vctrl);
+
+/* events.c */
+void nvme_mdev_events_init(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_events_reset(struct nvme_mdev_vctrl *vctrl);
+
+int nvme_mdev_event_request_receive(struct nvme_mdev_vctrl *vctrl, u16 cid);
+void nvme_mdev_event_process_ack(struct nvme_mdev_vctrl *vctrl, u8 log_page);
+
+void nvme_mdev_event_send(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_async_event_type type,
+ enum nvme_async_event info);
+
+u32 nvme_mdev_event_read_aen_config(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_event_set_aen_config(struct nvme_mdev_vctrl *vctrl, u32 value);
+
+/* irq.c*/
+void nvme_mdev_irqs_setup(struct nvme_mdev_vctrl *vctrl);
+void nvme_mdev_irqs_reset(struct nvme_mdev_vctrl *vctrl);
+
+int nvme_mdev_irqs_enable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode);
+void nvme_mdev_irqs_disable(struct nvme_mdev_vctrl *vctrl,
+ enum nvme_mdev_irq_mode mode);
+
+int nvme_mdev_irqs_set_triggers(struct nvme_mdev_vctrl *vctrl,
+ int start, int count, int32_t *fds);
+int nvme_mdev_irqs_set_unplug_trigger(struct nvme_mdev_vctrl *vctrl,
+ int32_t fd);
+
+void nvme_mdev_irq_raise_unplug_event(struct nvme_mdev_vctrl *vctrl,
+ unsigned int count);
+void nvme_mdev_irq_raise(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index);
+void nvme_mdev_irq_trigger(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index);
+void nvme_mdev_irq_cond_trigger(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index);
+void nvme_mdev_irq_clear(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index);
+
+/* ns.c*/
+int nvme_mdev_vns_open(struct nvme_mdev_vctrl *vctrl,
+ u32 host_nsid, unsigned int host_partid);
+int nvme_mdev_vns_destroy(struct nvme_mdev_vctrl *vctrl,
+ u32 user_nsid);
+void nvme_mdev_vns_destroy_all(struct nvme_mdev_vctrl *vctrl);
+
+struct nvme_mdev_vns *nvme_mdev_vns_from_vnsid(struct nvme_mdev_vctrl *vctrl,
+ u32 user_ns_id);
+
+int nvme_mdev_vns_print_description(struct nvme_mdev_vctrl *vctrl,
+ char *buf, unsigned int size);
+void nvme_mdev_vns_host_ns_update(struct nvme_mdev_vctrl *vctrl,
+ u32 host_nsid, bool removed);
+
+void nvme_mdev_vns_log_reset(struct nvme_mdev_vctrl *vctrl);
+
+/* vcq.c */
+int nvme_mdev_vcq_init(struct nvme_mdev_vctrl *vctrl, u16 qid,
+ dma_addr_t iova, bool cont, u16 size, int irq);
+
+int nvme_mdev_vcq_viommu_update(struct nvme_mdev_viommu *viommu,
+ struct nvme_vcq *q);
+
+void nvme_mdev_vcq_delete(struct nvme_mdev_vctrl *vctrl, u16 qid);
+void nvme_mdev_vcq_process(struct nvme_mdev_vctrl *vctrl, u16 qid,
+ bool trigger_irqs);
+
+bool nvme_mdev_vcq_flush(struct nvme_mdev_vctrl *vctrl, u16 qid);
+bool nvme_mdev_vcq_reserve_space(struct nvme_vcq *q);
+
+void nvme_mdev_vcq_write_io(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vcq *q, u16 sq_head,
+ u16 sqid, u16 cid, u16 status);
+
+void nvme_mdev_vcq_write_adm(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vcq *q, u32 dw0,
+ u16 sq_head, u16 cid, u16 status);
+/* vsq.c*/
+int nvme_mdev_vsq_init(struct nvme_mdev_vctrl *vctrl, u16 qid,
+ dma_addr_t iova, bool cont, u16 size, u16 cqid);
+
+int nvme_mdev_vsq_viommu_update(struct nvme_mdev_viommu *viommu,
+ struct nvme_vsq *q);
+
+void nvme_mdev_vsq_delete(struct nvme_mdev_vctrl *vctrl, u16 qid);
+
+bool nvme_mdev_vsq_has_data(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vsq *q);
+
+const struct nvme_command *nvme_mdev_vsq_get_cmd(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vsq *q);
+
+void nvme_mdev_vsq_cmd_done_io(struct nvme_mdev_vctrl *vctrl,
+ u16 sqid, u16 cid, u16 status);
+void nvme_mdev_vsq_cmd_done_adm(struct nvme_mdev_vctrl *vctrl,
+ u32 dw0, u16 cid, u16 status);
+bool nvme_mdev_vsq_suspend_io(struct nvme_mdev_vctrl *vctrl, u16 sqid);
+
+/* udata.c*/
+void nvme_mdev_udata_iter_setup(struct nvme_mdev_viommu *viommu,
+ struct nvme_ext_data_iter *iter);
+
+int nvme_mdev_udata_iter_set_dptr(struct nvme_ext_data_iter *it,
+ const union nvme_data_ptr *dptr, u64 size);
+
+struct nvme_ext_data_iter *
+nvme_mdev_kdata_iter_alloc(struct nvme_mdev_viommu *viommu, unsigned int size);
+
+int nvme_mdev_read_from_udata(void *dst, struct nvme_ext_data_iter *srcit,
+ u64 size);
+
+int nvme_mdev_write_to_udata(struct nvme_ext_data_iter *dstit, void *src,
+ u64 size);
+
+void *nvme_mdev_udata_queue_vmap(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ unsigned int size, bool cont);
+/* viommu.c */
+void nvme_mdev_viommu_init(struct nvme_mdev_viommu *viommu,
+ struct device *sw_dev,
+ struct device *hw_dev);
+
+int nvme_mdev_viommu_add(struct nvme_mdev_viommu *viommu, u32 flags,
+ dma_addr_t iova, u64 size);
+
+int nvme_mdev_viommu_remove(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova, u64 size);
+
+int nvme_mdev_viommu_translate(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ dma_addr_t *physical,
+ dma_addr_t *host_iova);
+
+int nvme_mdev_viommu_create_kmap(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova, struct page_map *page);
+
+void nvme_mdev_viommu_free_kmap(struct nvme_mdev_viommu *viommu,
+ struct page_map *page);
+
+void nvme_mdev_viommu_update_kmap(struct nvme_mdev_viommu *viommu,
+ struct page_map *page);
+
+void nvme_mdev_viommu_reset(struct nvme_mdev_viommu *viommu);
+
+/* some utilities*/
+
+#define store_le64(address, value) (*((__le64 *)(address)) = cpu_to_le64(value))
+#define store_le32(address, value) (*((__le32 *)(address)) = cpu_to_le32(value))
+#define store_le16(address, value) (*((__le16 *)(address)) = cpu_to_le16(value))
+#define store_le8(address, value) (*((u8 *)(address)) = (value))
+
+#define load_le16(address) le16_to_cpu(*(__le16 *)(address))
+#define load_le32(address) le32_to_cpu(*(__le32 *)(address))
+
+#define store_strsp(dst, src) \
+ memcpy_and_pad(dst, sizeof(dst), src, sizeof(src) - 1, ' ')
+
+#define DNR(e) ((e) | NVME_SC_DNR)
+
+#define PAGE_ADDRESS(address) ((address) & PAGE_MASK)
+#define OFFSET_IN_PAGE(address) ((address) & ~(PAGE_MASK))
+
+#define _DBG(vctrl, fmt, ...) \
+ dev_dbg(mdev_dev((vctrl)->mdev), fmt, ##__VA_ARGS__)
+
+#define _INFO(vctrl, fmt, ...) \
+ dev_info(mdev_dev((vctrl)->mdev), fmt, ##__VA_ARGS__)
+
+#define _WARN(vctrl, fmt, ...) \
+ dev_warn(mdev_dev((vctrl)->mdev), fmt, ##__VA_ARGS__)
+
+#define mdev_to_vctrl(mdev) \
+ ((struct nvme_mdev_vctrl *)mdev_get_drvdata(mdev))
+
+#define dev_to_vctrl(mdev) \
+ mdev_to_vctrl(mdev_from_dev(dev))
+
+#define RSRV_NSID (BIT(1))
+#define RSRV_DW23 (BIT(2) | BIT(3))
+#define RSRV_MPTR (BIT(4) | BIT(5))
+
+#define RSRV_DPTR (BIT(6) | BIT(7) | BIT(8) | BIT(9))
+#define RSRV_DPTR_PRP2 (BIT(8) | BIT(9))
+
+#define RSRV_DW10_15 (BIT(10) | BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15))
+#define RSRV_DW11_15 (BIT(11) | BIT(12) | BIT(13) | BIT(14) | BIT(15))
+#define RSRV_DW12_15 (BIT(12) | BIT(13) | BIT(14) | BIT(15))
+#define RSRV_DW13_15 (BIT(13) | BIT(14) | BIT(15))
+#define RSRV_DW14_15 (BIT(14) | BIT(15))
+
+static inline bool check_reserved_dwords(const u32 *dwords,
+ int count, unsigned long bitmask)
+{
+ int bit;
+
+ if (WARN_ON(count > BITS_PER_TYPE(long)))
+ return false;
+
+ for_each_set_bit(bit, &bitmask, count)
+ if (dwords[bit])
+ return false;
+ return true;
+}
+
+static inline bool check_range(u64 start, u64 size, u64 end)
+{
+ u64 test = start + size;
+
+ /* check for overflow */
+ if (test < start || test < size)
+ return false;
+ return test <= end;
+}
+
+/* Rough translation of internal errors to the NVME errors */
+static inline int nvme_mdev_translate_error(int error)
+{
+ // nvme status, including no error (NVME_SC_SUCCESS)
+ if (error >= 0)
+ return error;
+
+ switch (error) {
+ case -ENOMEM:
+ /*no memory - truly an internal error*/
+ return NVME_SC_INTERNAL;
+ case -ENOSPC:
+ /* Happens when user sends to large PRP list
+ * User shoudn't do this since the maximum transfer size
+ * is specified in the controller caps
+ */
+ return DNR(NVME_SC_DATA_XFER_ERROR);
+ case -EFAULT:
+ /* Bad memory pointers in the prp lists*/
+ return DNR(NVME_SC_DATA_XFER_ERROR);
+ case -EINVAL:
+ /* Bad prp offsets in the prp lists/command*/
+ return DNR(NVME_SC_PRP_OFFSET_INVALID);
+ default:
+ /*Shouldn't happen */
+ WARN_ON_ONCE(true);
+ return NVME_SC_INTERNAL;
+ }
+}
+
+static inline bool timeout(ktime_t event, ktime_t now, unsigned long timeout_ms)
+{
+ return ktime_ms_delta(now, event) > (long)timeout_ms;
+}
+
+extern struct mdev_parent_ops mdev_fops;
+extern struct list_head nvme_mdev_vctrl_list;
+extern struct mutex nvme_mdev_vctrl_list_mutex;
+
+#endif // _MDEV_NVME_H
diff --git a/drivers/nvme/mdev/udata.c b/drivers/nvme/mdev/udata.c
new file mode 100644
index 000000000000..7af6b3f6d6aa
--- /dev/null
+++ b/drivers/nvme/mdev/udata.c
@@ -0,0 +1,390 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * User (guest) data access routines
+ * Implementation of PRP iterator in user memory
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/mdev.h>
+#include <linux/nvme.h>
+#include "priv.h"
+
+#define MAX_PRP ((PAGE_SIZE / sizeof(__le64)) - 1)
+
+/* Setup up a new PRP iterator */
+void nvme_mdev_udata_iter_setup(struct nvme_mdev_viommu *viommu,
+ struct nvme_ext_data_iter *iter)
+{
+ iter->viommu = viommu;
+ iter->count = 0;
+ iter->next = NULL;
+ iter->release = NULL;
+}
+
+/* Load a new prp list into the iterator. Internal*/
+static int nvme_mdev_udata_iter_load_prplist(struct nvme_ext_data_iter *iter,
+ dma_addr_t iova)
+{
+ dma_addr_t data_iova;
+ int ret;
+ __le64 *map;
+
+ /* map the prp list*/
+ ret = nvme_mdev_viommu_create_kmap(iter->viommu,
+ PAGE_ADDRESS(iova),
+ &iter->uprp.page);
+ if (ret)
+ return ret;
+
+ iter->uprp.index = OFFSET_IN_PAGE(iova) / (sizeof(__le64));
+
+ /* read its first entry and check its alignment */
+ map = iter->uprp.page.kmap;
+ data_iova = le64_to_cpu(map[iter->uprp.index]);
+
+ if (OFFSET_IN_PAGE(data_iova) != 0) {
+ nvme_mdev_viommu_free_kmap(iter->viommu, &iter->uprp.page);
+ return -EINVAL;
+ }
+
+ /* translate the entry to complete the setup*/
+ ret = nvme_mdev_viommu_translate(iter->viommu, data_iova,
+ &iter->physical, &iter->host_iova);
+ if (ret)
+ nvme_mdev_viommu_free_kmap(iter->viommu, &iter->uprp.page);
+
+ return ret;
+}
+
+/* ->next function when iterator points to prp list*/
+static int nvme_mdev_udata_iter_next_prplist(struct nvme_ext_data_iter *iter)
+{
+ dma_addr_t iova;
+ int ret;
+ __le64 *map = iter->uprp.page.kmap;
+
+ if (WARN_ON(iter->count <= 0))
+ return 0;
+
+ if (--iter->count == 0) {
+ nvme_mdev_viommu_free_kmap(iter->viommu, &iter->uprp.page);
+ return 0;
+ }
+
+ iter->uprp.index++;
+
+ if (iter->uprp.index < MAX_PRP || iter->count == 1) {
+ // advance over next pointer in current prp list
+ // these pointers must be page aligned
+ iova = le64_to_cpu(map[iter->uprp.index]);
+ if (OFFSET_IN_PAGE(iova) != 0)
+ return -EINVAL;
+
+ ret = nvme_mdev_viommu_translate(iter->viommu, iova,
+ &iter->physical,
+ &iter->host_iova);
+ if (ret)
+ nvme_mdev_viommu_free_kmap(iter->viommu,
+ &iter->uprp.page);
+ return ret;
+ }
+
+ /* switch to next prp list. it must be page aligned as well*/
+ iova = le64_to_cpu(map[MAX_PRP]);
+
+ if (OFFSET_IN_PAGE(iova) != 0)
+ return -EINVAL;
+
+ nvme_mdev_viommu_free_kmap(iter->viommu, &iter->uprp.page);
+ return nvme_mdev_udata_iter_load_prplist(iter, iova);
+}
+
+/* ->next function when iterator points to user data pointer*/
+static int nvme_mdev_udata_iter_next_dptr(struct nvme_ext_data_iter *iter)
+{
+ dma_addr_t iova;
+
+ if (WARN_ON(iter->count <= 0))
+ return 0;
+
+ if (--iter->count == 0)
+ return 0;
+
+ /* we will be called only once to deal with the second
+ * pointer in the data pointer
+ */
+ iova = le64_to_cpu(iter->dptr->prp2);
+
+ if (iter->count == 1) {
+ /* only need to read one more entry, meaning
+ * the 2nd entry of the dptr.
+ * It must be page aligned
+ */
+ if (OFFSET_IN_PAGE(iova) != 0)
+ return -EINVAL;
+ return nvme_mdev_viommu_translate(iter->viommu, iova,
+ &iter->physical,
+ &iter->host_iova);
+ } else {
+ /*
+ * Second dptr entry is prp pointer, and it might not
+ * be page aligned (but QWORD aligned at least)
+ */
+ if (iova & 0x7ULL)
+ return -EINVAL;
+ iter->next = nvme_mdev_udata_iter_next_prplist;
+ return nvme_mdev_udata_iter_load_prplist(iter, iova);
+ }
+}
+
+/* Set prp list iterator to point to data pointer found in NVME command */
+int nvme_mdev_udata_iter_set_dptr(struct nvme_ext_data_iter *it,
+ const union nvme_data_ptr *dptr, u64 size)
+{
+ int ret;
+ u64 prp1 = le64_to_cpu(dptr->prp1);
+ dma_addr_t iova = PAGE_ADDRESS(prp1);
+ unsigned int page_offset = OFFSET_IN_PAGE(prp1);
+
+ /* first dptr pointer must be at least DWORD aligned*/
+ if (page_offset & 0x3)
+ return -EINVAL;
+
+ it->dptr = dptr;
+ it->next = nvme_mdev_udata_iter_next_dptr;
+ it->count = DIV_ROUND_UP_ULL(size + page_offset, PAGE_SIZE);
+
+ ret = nvme_mdev_viommu_translate(it->viommu, iova,
+ &it->physical, &it->host_iova);
+ if (ret)
+ return ret;
+
+ it->physical += page_offset;
+ it->host_iova += page_offset;
+ return 0;
+}
+
+/* ->next function when iterator points to kernel memory buffer */
+static int nvme_mdev_kdata_iter_next(struct nvme_ext_data_iter *it)
+{
+ if (WARN_ON(it->count <= 0))
+ return 0;
+
+ if (--it->count == 0)
+ return 0;
+
+ it->physical = PAGE_ADDRESS(it->physical) + PAGE_SIZE;
+ it->host_iova = PAGE_ADDRESS(it->host_iova) + PAGE_SIZE;
+ return 0;
+}
+
+/* ->release function for kdata iterator to free it after use */
+static void nvme_mdev_kdata_iter_free(struct nvme_ext_data_iter *it)
+{
+ struct device *dma_dev = it->viommu->hw_dev;
+
+ if (dma_dev)
+ dma_free_coherent(dma_dev, it->kmem.size,
+ it->kmem.data, it->kmem.dma_addr);
+ else
+ kfree(it->kmem.data);
+ kfree(it);
+}
+
+/* allocate a kernel data buffer with read iterator for nvme host device */
+struct nvme_ext_data_iter *
+nvme_mdev_kdata_iter_alloc(struct nvme_mdev_viommu *viommu, unsigned int size)
+{
+ struct nvme_ext_data_iter *it;
+
+ it = kzalloc(sizeof(*it), GFP_KERNEL);
+ if (!it)
+ return NULL;
+
+ it->viommu = viommu;
+ it->kmem.size = size;
+ if (viommu->hw_dev) {
+ it->kmem.data = dma_alloc_coherent(viommu->hw_dev, size,
+ &it->kmem.dma_addr,
+ GFP_KERNEL);
+ } else {
+ it->kmem.data = kzalloc(size, GFP_KERNEL);
+ it->kmem.dma_addr = 0;
+ }
+
+ if (!it->kmem.data) {
+ kfree(it);
+ return NULL;
+ }
+
+ it->physical = virt_to_phys(it->kmem.data);
+ it->host_iova = it->kmem.dma_addr;
+
+ it->count = DIV_ROUND_UP(size + OFFSET_IN_PAGE(it->physical),
+ PAGE_SIZE);
+
+ it->next = nvme_mdev_kdata_iter_next;
+ it->release = nvme_mdev_kdata_iter_free;
+ return it;
+}
+
+/* copy data from user data iterator to a kernel buffer */
+int nvme_mdev_read_from_udata(void *dst, struct nvme_ext_data_iter *srcit,
+ u64 size)
+{
+ int ret;
+ unsigned int srcoffset, chunk_size;
+
+ while (srcit->count && size > 0) {
+ struct page *page = pfn_to_page(PHYS_PFN(srcit->physical));
+ void *src = kmap(page);
+
+ if (!src)
+ return -ENOMEM;
+
+ srcoffset = OFFSET_IN_PAGE(srcit->physical);
+ chunk_size = min(size, (u64)PAGE_SIZE - srcoffset);
+
+ memcpy(dst, src + srcoffset, chunk_size);
+ dst += chunk_size;
+ size -= chunk_size;
+ kunmap(page);
+
+ ret = srcit->next(srcit);
+ if (ret)
+ return ret;
+ }
+ WARN_ON(size > 0);
+ return 0;
+}
+
+/* copy data from kernel buffer to user data iterator */
+int nvme_mdev_write_to_udata(struct nvme_ext_data_iter *dstit, void *src,
+ u64 size)
+{
+ int ret, dstoffset, chunk_size;
+
+ while (dstit->count && size > 0) {
+ struct page *page = pfn_to_page(PHYS_PFN(dstit->physical));
+ void *dst = kmap(page);
+
+ if (!dst)
+ return -ENOMEM;
+
+ dstoffset = OFFSET_IN_PAGE(dstit->physical);
+ chunk_size = min(size, (u64)PAGE_SIZE - dstoffset);
+
+ memcpy(dst + dstoffset, src, chunk_size);
+ src += chunk_size;
+ size -= chunk_size;
+ kunmap(page);
+
+ ret = dstit->next(dstit);
+ if (ret)
+ return ret;
+ }
+ WARN_ON(size > 0);
+ return 0;
+}
+
+/* Set prp list iterator to point to prp list found in create queue command */
+static int
+nvme_mdev_udata_iter_set_queue_prplist(struct nvme_mdev_viommu *viommu,
+ struct nvme_ext_data_iter *iter,
+ dma_addr_t iova, unsigned int size)
+{
+ if (iova & ~PAGE_MASK)
+ return -EINVAL;
+
+ nvme_mdev_udata_iter_setup(viommu, iter);
+ iter->count = DIV_ROUND_UP(size, PAGE_SIZE);
+ iter->next = nvme_mdev_udata_iter_next_prplist;
+ return nvme_mdev_udata_iter_load_prplist(iter, iova);
+}
+
+/* Map an SQ/CQ queue (contiguous in guest physical memory) */
+static int nvme_mdev_queue_getpages_contiguous(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ struct page **pages,
+ unsigned int npages)
+{
+ int ret;
+ unsigned int i;
+
+ dma_addr_t host_page_iova;
+ phys_addr_t physical;
+
+ for (i = 0 ; i < npages; i++) {
+ ret = nvme_mdev_viommu_translate(viommu, iova + (PAGE_SIZE * i),
+ &physical,
+ &host_page_iova);
+ if (ret)
+ return ret;
+ pages[i] = pfn_to_page(PHYS_PFN(physical));
+ }
+ return 0;
+}
+
+/* Map an SQ/CQ queue (non contiguous in guest physical memory) */
+static int nvme_mdev_queue_getpages_prplist(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ struct page **pages,
+ unsigned int npages)
+{
+ int ret, i = 0;
+ struct nvme_ext_data_iter uprpit;
+
+ ret = nvme_mdev_udata_iter_set_queue_prplist(viommu,
+ &uprpit, iova,
+ npages * PAGE_SIZE);
+ if (ret)
+ return ret;
+
+ while (uprpit.count && i < npages) {
+ pages[i++] = pfn_to_page(PHYS_PFN(uprpit.physical));
+ ret = uprpit.next(&uprpit);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+/* map a SQ/CQ queue to host physical memory */
+void *nvme_mdev_udata_queue_vmap(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ unsigned int size,
+ bool cont)
+{
+ int ret;
+ unsigned int npages;
+ void *map;
+ struct page **pages;
+
+ // queue must be page aligned
+ if (OFFSET_IN_PAGE(iova) != 0)
+ return ERR_PTR(-EINVAL);
+
+ npages = DIV_ROUND_UP(size, PAGE_SIZE);
+ pages = kcalloc(npages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+
+ ret = cont ?
+ nvme_mdev_queue_getpages_contiguous(viommu, iova, pages, npages)
+ : nvme_mdev_queue_getpages_prplist(viommu, iova, pages, npages);
+
+ if (ret) {
+ map = ERR_PTR(ret);
+ goto out;
+ }
+
+ map = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+ if (!map)
+ map = ERR_PTR(-ENOMEM);
+out:
+ kfree(pages);
+ return map;
+}
diff --git a/drivers/nvme/mdev/vcq.c b/drivers/nvme/mdev/vcq.c
new file mode 100644
index 000000000000..7702137eb8bc
--- /dev/null
+++ b/drivers/nvme/mdev/vcq.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Virtual NVMe completion queue implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include "priv.h"
+
+/* Create new virtual completion queue */
+int nvme_mdev_vcq_init(struct nvme_mdev_vctrl *vctrl, u16 qid,
+ dma_addr_t iova, bool cont, u16 size, int irq)
+{
+ struct nvme_vcq *q = &vctrl->vcqs[qid];
+ int ret;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ q->iova = iova;
+ q->cont = cont;
+ q->data = NULL;
+ q->qid = qid;
+ q->size = size;
+ q->tail = 0;
+ q->phase = true;
+ q->irq = irq;
+ q->pending = 0;
+ q->head = 0;
+
+ ret = nvme_mdev_vcq_viommu_update(&vctrl->viommu, q);
+ if (ret && (ret != -EFAULT))
+ return ret;
+
+ _DBG(vctrl, "VCQ: create qid=%d contig=%d depth=%d irq=%d\n",
+ qid, cont, size, irq);
+
+ set_bit(qid, vctrl->vcq_en);
+
+ vctrl->mmio.dbs[q->qid].cqh = 0;
+ vctrl->mmio.eidxs[q->qid].cqh = 0;
+ return 0;
+}
+
+/* Update the kernel mapping of the queue */
+int nvme_mdev_vcq_viommu_update(struct nvme_mdev_viommu *viommu,
+ struct nvme_vcq *q)
+{
+ void *data;
+
+ if (q->data)
+ vunmap((void *)q->data);
+
+ data = nvme_mdev_udata_queue_vmap(viommu, q->iova,
+ (unsigned int)q->size *
+ sizeof(struct nvme_completion),
+ q->cont);
+
+ q->data = IS_ERR(data) ? NULL : data;
+ return IS_ERR(data) ? PTR_ERR(data) : 0;
+}
+
+/* Delete a virtual completion queue */
+void nvme_mdev_vcq_delete(struct nvme_mdev_vctrl *vctrl, u16 qid)
+{
+ struct nvme_vcq *q = &vctrl->vcqs[qid];
+
+ lockdep_assert_held(&vctrl->lock);
+
+ if (q->data)
+ vunmap((void *)q->data);
+ q->data = NULL;
+
+ _DBG(vctrl, "VCQ: delete qid=%d\n", q->qid);
+ clear_bit(qid, vctrl->vcq_en);
+}
+
+/* Move queue tail one item forward */
+static void nvme_mdev_vcq_advance_tail(struct nvme_vcq *q)
+{
+ if (++q->tail == q->size) {
+ q->tail = 0;
+ q->phase = !q->phase;
+ }
+}
+
+/* Move queue head one item forward */
+static void nvme_mdev_vcq_advance_head(struct nvme_vcq *q)
+{
+ q->head++;
+ if (q->head == q->size)
+ q->head = 0;
+}
+
+/* Process a virtual completion queue*/
+void nvme_mdev_vcq_process(struct nvme_mdev_vctrl *vctrl, u16 qid,
+ bool trigger_irqs)
+{
+ struct nvme_vcq *q = &vctrl->vcqs[qid];
+ u16 new_head;
+ u32 eidx;
+
+ if (!vctrl->mmio.dbs || !vctrl->mmio.eidxs)
+ return;
+
+ new_head = le32_to_cpu(vctrl->mmio.dbs[qid].cqh);
+
+ if (new_head != q->head) {
+ /* bad tail - can't process*/
+ if (!nvme_mdev_mmio_db_check(vctrl, q->qid, q->size, new_head))
+ return;
+
+ while (q->head != new_head) {
+ nvme_mdev_vcq_advance_head(q);
+ WARN_ON_ONCE(q->pending == 0);
+ if (q->pending > 0)
+ q->pending--;
+ }
+
+ eidx = q->head + (q->size >> 1);
+ if (eidx >= q->size)
+ eidx -= q->size;
+ vctrl->mmio.eidxs[q->qid].cqh = cpu_to_le32(eidx);
+ }
+
+ if (q->irq != -1 && trigger_irqs) {
+ if (q->tail != new_head)
+ nvme_mdev_irq_cond_trigger(vctrl, q->irq);
+ else
+ nvme_mdev_irq_clear(vctrl, q->irq);
+ }
+}
+
+/* flush interrupts on a completion queue */
+bool nvme_mdev_vcq_flush(struct nvme_mdev_vctrl *vctrl, u16 qid)
+{
+ struct nvme_vcq *q = &vctrl->vcqs[qid];
+ u16 new_head = le32_to_cpu(vctrl->mmio.dbs[qid].cqh);
+
+ if (new_head == q->tail || q->irq == -1)
+ return false;
+
+ nvme_mdev_irq_trigger(vctrl, q->irq);
+ nvme_mdev_irq_clear(vctrl, q->irq);
+ return true;
+}
+
+/* Reserve space for one completion entry, that will be added later */
+bool nvme_mdev_vcq_reserve_space(struct nvme_vcq *q)
+{
+ /* TODOLATER: track passed through commmands
+ * If we pass through a command to host and never receive a response
+ * we will keep space for response in CQ forever, eventually stalling
+ * the CQ forever.
+ * In this case, the guest is still expected to recover by resetting
+ * our controller
+ * This can be fixed by tracking all the commands that we send
+ * to the host
+ */
+
+ if (q->pending == q->size - 1)
+ return false;
+ q->pending++;
+ return true;
+}
+
+/* Write a new item into the completion queue (IO version) */
+void nvme_mdev_vcq_write_io(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vcq *q, u16 sq_head,
+ u16 sqid, u16 cid, u16 status)
+{
+ volatile u64 *qw = (__le64 *)(&q->data[q->tail]);
+
+ u64 phase = q->phase ? (0x1ULL << 48) : 0;
+ u64 qw1 =
+ ((u64)sq_head) |
+ ((u64)sqid << 16) |
+ ((u64)cid << 32) |
+ ((u64)status << 49) | phase;
+
+ WRITE_ONCE(qw[1], cpu_to_le64(qw1));
+
+ nvme_mdev_vcq_advance_tail(q);
+ if (q->irq != -1)
+ nvme_mdev_irq_raise(vctrl, q->irq);
+}
+
+/* Write a new item into the completion queue (ADMIN version) */
+void nvme_mdev_vcq_write_adm(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vcq *q, u32 dw0,
+ u16 sq_head, u16 cid, u16 status)
+{
+ volatile u64 *qw = (__le64 *)(&q->data[q->tail]);
+
+ u64 phase = q->phase ? (0x1ULL << 48) : 0;
+ u64 qw1 =
+ ((u64)sq_head) |
+ ((u64)cid << 32) |
+ ((u64)status << 49) | phase;
+
+ WRITE_ONCE(qw[0], cpu_to_le64(dw0));
+ /* ensure that hardware sees the phase bit flip last */
+ wmb();
+ WRITE_ONCE(qw[1], cpu_to_le64(qw1));
+
+ nvme_mdev_vcq_advance_tail(q);
+ if (q->irq != -1)
+ nvme_mdev_irq_trigger(vctrl, q->irq);
+}
diff --git a/drivers/nvme/mdev/vctrl.c b/drivers/nvme/mdev/vctrl.c
new file mode 100644
index 000000000000..87bc7c435c0c
--- /dev/null
+++ b/drivers/nvme/mdev/vctrl.c
@@ -0,0 +1,515 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Virtual NVMe controller implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/mdev.h>
+#include <linux/nvme.h>
+#include "priv.h"
+
+bool nvme_mdev_vctrl_is_dead(struct nvme_mdev_vctrl *vctrl)
+{
+ return (vctrl->mmio.csts & (NVME_CSTS_CFS | NVME_CSTS_SHST_MASK)) != 0;
+}
+
+/* Setup the controller guid and serial */
+static void nvme_mdev_vctrl_init_id(struct nvme_mdev_vctrl *vctrl)
+{
+ const guid_t* guid = mdev_uuid(vctrl->mdev);
+
+ snprintf(vctrl->subnqn, sizeof(vctrl->subnqn),
+ "nqn.2014-08.org.nvmexpress:uuid:%pUl", guid->b);
+
+ snprintf(vctrl->serial, sizeof(vctrl->serial), "%pUl", guid->b);
+}
+
+/* Change the IO thread CPU pinning */
+void nvme_mdev_vctrl_bind_iothread(struct nvme_mdev_vctrl *vctrl,
+ unsigned int cpu)
+{
+ mutex_lock(&vctrl->lock);
+
+ if (cpu == vctrl->iothread_cpu)
+ goto out;
+
+ nvme_mdev_io_free(vctrl);
+ nvme_mdev_io_create(vctrl, cpu);
+out:
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Change the status of support for shadow doorbell */
+int nvme_mdev_vctrl_set_shadow_doorbell_supported(struct nvme_mdev_vctrl *vctrl,
+ bool enable)
+{
+ if (vctrl->inuse)
+ return -EBUSY;
+ vctrl->mmio.shadow_db_supported = enable;
+ return 0;
+}
+
+/* Called when memory mapping are changed. Propagate this to all kmap users */
+static void nvme_mdev_vctrl_viommu_update(struct nvme_mdev_vctrl *vctrl)
+{
+ u16 qid;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ if (!(vctrl->mmio.csts & NVME_CSTS_RDY))
+ return;
+
+ /* update mappings for submission and completion queues */
+ for_each_set_bit(qid, vctrl->vsq_en, MAX_VIRTUAL_QUEUES)
+ nvme_mdev_vsq_viommu_update(&vctrl->viommu, &vctrl->vsqs[qid]);
+
+ for_each_set_bit(qid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
+ nvme_mdev_vcq_viommu_update(&vctrl->viommu, &vctrl->vcqs[qid]);
+
+ /* update mapping for the shadow doorbells */
+ nvme_mdev_mmio_viommu_update(vctrl);
+}
+
+/* Create a new virtual controller */
+struct nvme_mdev_vctrl *nvme_mdev_vctrl_create(struct mdev_device *mdev,
+ struct nvme_mdev_hctrl *hctrl,
+ unsigned int max_host_queues)
+{
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = kzalloc_node(sizeof(*vctrl),
+ GFP_KERNEL, hctrl->node);
+ if (!vctrl)
+ return ERR_PTR(-ENOMEM);
+
+ /* Basic init */
+ vctrl->hctrl = hctrl;
+ vctrl->mdev = mdev;
+ vctrl->max_host_hw_queues = max_host_queues;
+ vctrl->viommu.vctrl = vctrl;
+
+ kref_init(&vctrl->ref);
+ mutex_init(&vctrl->lock);
+ nvme_mdev_vctrl_init_id(vctrl);
+ INIT_LIST_HEAD(&vctrl->host_hw_queues);
+
+ get_device(mdev_dev(mdev));
+ mdev_set_drvdata(mdev, vctrl);
+
+ /* reserve host IO queues */
+ if (!nvme_mdev_hctrl_hqs_reserve(hctrl, max_host_queues)) {
+ ret = -ENOSPC;
+ goto error1;
+ }
+
+ /* default feature values*/
+ vctrl->arb_burst_shift = 3;
+ vctrl->mmio.shadow_db_supported = use_shadow_doorbell;
+
+ ret = nvme_mdev_pci_create(vctrl);
+ if (ret)
+ goto error2;
+
+ ret = nvme_mdev_mmio_create(vctrl);
+ if (ret)
+ goto error3;
+
+ nvme_mdev_irqs_setup(vctrl);
+
+ /* Create the IO thread */
+ /*TODOLATER: IO: smp_processor_id() is not an ideal pinning choice */
+ ret = nvme_mdev_io_create(vctrl, smp_processor_id());
+ if (ret)
+ goto error4;
+
+ _INFO(vctrl, "device created using %d host queues\n", max_host_queues);
+ return vctrl;
+error4:
+ nvme_mdev_mmio_free(vctrl);
+error3:
+ nvme_mdev_pci_free(vctrl);
+error2:
+ nvme_mdev_hctrl_hqs_unreserve(hctrl, max_host_queues);
+error1:
+ put_device(mdev_dev(mdev));
+ kfree(vctrl);
+ return ERR_PTR(ret);
+}
+
+/*Try to destroy an vctrl */
+int nvme_mdev_vctrl_destroy(struct nvme_mdev_vctrl *vctrl)
+{
+ mutex_lock(&vctrl->lock);
+
+ if (vctrl->inuse) {
+ /* vctrl has mdev users */
+ mutex_unlock(&vctrl->lock);
+ return -EBUSY;
+ }
+
+ _INFO(vctrl, "device is destroying\n");
+
+ mdev_set_drvdata(vctrl->mdev, NULL);
+ mutex_unlock(&vctrl->lock);
+
+ mutex_lock(&nvme_mdev_vctrl_list_mutex);
+ list_del_init(&vctrl->link);
+ mutex_unlock(&nvme_mdev_vctrl_list_mutex);
+
+ mutex_lock(&vctrl->lock); /*only for lockdep checks */
+ nvme_mdev_io_free(vctrl);
+ nvme_mdev_vns_destroy_all(vctrl);
+ __nvme_mdev_vctrl_reset(vctrl, true);
+
+ nvme_mdev_hctrl_hqs_unreserve(vctrl->hctrl, vctrl->max_host_hw_queues);
+
+ nvme_mdev_pci_free(vctrl);
+ nvme_mdev_mmio_free(vctrl);
+
+ mutex_unlock(&vctrl->lock);
+
+ put_device(mdev_dev(vctrl->mdev));
+ _INFO(vctrl, "device is destroyed\n");
+ kfree(vctrl);
+ return 0;
+}
+
+/* Suspends a running virtual controller
+ * Called when host needs to regain full control of the device
+ */
+void nvme_mdev_vctrl_pause(struct nvme_mdev_vctrl *vctrl)
+{
+ mutex_lock(&vctrl->lock);
+ if (!vctrl->vctrl_paused) {
+ _INFO(vctrl, "pausing the virtual controller\n");
+ if (vctrl->mmio.csts & NVME_CSTS_RDY)
+ nvme_mdev_io_pause(vctrl);
+ vctrl->vctrl_paused = true;
+ }
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Resumes a virtual controller
+ * Called when host done with exclusive access and allows us
+ * again to attach to the controller
+ */
+void nvme_mdev_vctrl_resume(struct nvme_mdev_vctrl *vctrl)
+{
+ mutex_lock(&vctrl->lock);
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ if (vctrl->vctrl_paused) {
+ _INFO(vctrl, "resuming the virtual controller\n");
+
+ if (vctrl->mmio.csts & NVME_CSTS_RDY) {
+ /* handle all pending admin commands*/
+ nvme_mdev_adm_process_sq(vctrl);
+ /* start the IO thread again if it was stopped or
+ * if we had doorbell writes during the pause
+ */
+ nvme_mdev_io_resume(vctrl);
+ }
+ vctrl->vctrl_paused = false;
+ }
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Called when emulator opens the virtual device */
+int nvme_mdev_vctrl_open(struct nvme_mdev_vctrl *vctrl)
+{
+ struct device *dma_dev = NULL;
+ int ret = 0;
+
+ mutex_lock(&vctrl->lock);
+
+ if (vctrl->hctrl->removing) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ if (vctrl->inuse) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ _INFO(vctrl, "device is opened\n");
+
+ if (vctrl->hctrl->nvme_ctrl->ops->flags & NVME_F_MDEV_DMA_SUPPORTED)
+ dma_dev = vctrl->hctrl->nvme_ctrl->dev;
+
+ nvme_mdev_viommu_init(&vctrl->viommu, mdev_dev(vctrl->mdev), dma_dev);
+
+ nvme_mdev_mmio_open(vctrl);
+ vctrl->inuse = true;
+out:
+ mutex_unlock(&vctrl->lock);
+ return ret;
+}
+
+/* Called when emulator closes the virtual device */
+void nvme_mdev_vctrl_release(struct nvme_mdev_vctrl *vctrl)
+{
+ mutex_lock(&vctrl->lock);
+ nvme_mdev_io_pause(vctrl);
+
+ /* Remove the guest DMA mappings - new user that will open the
+ * device might be a different guest
+ */
+ nvme_mdev_viommu_reset(&vctrl->viommu);
+
+ /* Reset the controller to a clean state for a new user */
+ __nvme_mdev_vctrl_reset(vctrl, false);
+ nvme_mdev_irqs_reset(vctrl);
+ vctrl->inuse = false;
+ mutex_unlock(&vctrl->lock);
+
+ WARN_ON(!list_empty(&vctrl->host_hw_queues));
+
+ _INFO(vctrl, "device is released\n");
+
+ /* If we are released after request to remove the host controller
+ * we are dead, won't be opened again ever, so remove ourselves
+ */
+ if (vctrl->hctrl->removing)
+ nvme_mdev_vctrl_destroy(vctrl);
+}
+
+/* Called each time the controller is reset (CC.EN <= 0 or VM level reset) */
+void __nvme_mdev_vctrl_reset(struct nvme_mdev_vctrl *vctrl, bool pci_reset)
+{
+ lockdep_assert_held(&vctrl->lock);
+
+ if ((vctrl->mmio.csts & NVME_CSTS_RDY) &&
+ !(vctrl->mmio.csts & NVME_CSTS_SHST_MASK)) {
+ _DBG(vctrl, "unsafe reset (CSTS.RDY==1)\n");
+ nvme_mdev_io_pause(vctrl);
+ nvme_mdev_vctrl_disable(vctrl);
+ }
+ nvme_mdev_mmio_reset(vctrl, pci_reset);
+}
+
+/* setups initial admin queues and doorbells */
+bool nvme_mdev_vctrl_enable(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t cqiova, dma_addr_t sqiova, u32 sizes)
+{
+ int ret;
+ u16 cqentries, sqentries;
+
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ lockdep_assert_held(&vctrl->lock);
+
+ sqentries = (sizes & 0xFFFF) + 1;
+ cqentries = (sizes >> 16) + 1;
+
+ if (cqentries > 4096 || cqentries < 2)
+ return false;
+ if (sqentries > 4096 || sqentries < 2)
+ return false;
+
+ ret = nvme_mdev_mmio_enable_dbs(vctrl);
+ if (ret)
+ goto error0;
+
+ ret = nvme_mdev_vcq_init(vctrl, 0, cqiova, true, cqentries, 0);
+ if (ret)
+ goto error1;
+
+ ret = nvme_mdev_vsq_init(vctrl, 0, sqiova, true, sqentries, 0);
+ if (ret)
+ goto error2;
+
+ nvme_mdev_events_init(vctrl);
+
+ if (!vctrl->mmio.shadow_db_supported) {
+ /* start polling right away to support admin queue */
+ vctrl->io_idle = false;
+ nvme_mdev_io_resume(vctrl);
+ }
+
+ return true;
+error2:
+ nvme_mdev_mmio_disable_dbs(vctrl);
+error1:
+ nvme_mdev_vcq_delete(vctrl, 0);
+error0:
+ return false;
+}
+
+/* destroy all io/admin queues on the controller */
+void nvme_mdev_vctrl_disable(struct nvme_mdev_vctrl *vctrl)
+{
+ u16 sqid, cqid;
+
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ lockdep_assert_held(&vctrl->lock);
+
+ nvme_mdev_events_reset(vctrl);
+ nvme_mdev_vns_log_reset(vctrl);
+
+ sqid = 1;
+ for_each_set_bit_from(sqid, vctrl->vsq_en, MAX_VIRTUAL_QUEUES)
+ nvme_mdev_vsq_delete(vctrl, sqid);
+
+ cqid = 1;
+ for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
+ nvme_mdev_vcq_delete(vctrl, cqid);
+
+ nvme_mdev_vsq_delete(vctrl, 0);
+ nvme_mdev_vcq_delete(vctrl, 0);
+
+ nvme_mdev_mmio_disable_dbs(vctrl);
+ vctrl->io_idle = true;
+}
+
+/* External reset */
+void nvme_mdev_vctrl_reset(struct nvme_mdev_vctrl *vctrl)
+{
+ mutex_lock(&vctrl->lock);
+ _INFO(vctrl, "reset\n");
+ __nvme_mdev_vctrl_reset(vctrl, true);
+ mutex_unlock(&vctrl->lock);
+}
+
+/* Add IO region*/
+void nvme_mdev_vctrl_add_region(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index, unsigned int size,
+ region_access_fn access_fn)
+{
+ struct nvme_mdev_io_region *region = &vctrl->regions[index];
+
+ region->size = size;
+ region->rw = access_fn;
+ region->mmap_ops = NULL;
+}
+
+/* Enable mmap window on an IO region */
+void nvme_mdev_vctrl_region_set_mmap(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index,
+ unsigned int offset,
+ unsigned int size,
+ const struct vm_operations_struct *ops)
+{
+ struct nvme_mdev_io_region *region = &vctrl->regions[index];
+
+ region->mmap_area_start = offset;
+ region->mmap_area_size = size;
+ region->mmap_ops = ops;
+}
+
+/* Disable mmap window on an IO region */
+void nvme_mdev_vctrl_region_disable_mmap(struct nvme_mdev_vctrl *vctrl,
+ unsigned int index)
+{
+ struct nvme_mdev_io_region *region = &vctrl->regions[index];
+
+ region->mmap_area_start = 0;
+ region->mmap_area_size = 0;
+ region->mmap_ops = NULL;
+}
+
+/* Allocate a host IO queue */
+int nvme_mdev_vctrl_hq_alloc(struct nvme_mdev_vctrl *vctrl)
+{
+ struct nvme_mdev_hq *hq = NULL, *tmp;
+ int hwqcount = 0, ret;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ list_for_each_entry(tmp, &vctrl->host_hw_queues, link) {
+ if (!hq || tmp->usecount < hq->usecount)
+ hq = tmp;
+ hwqcount++;
+ }
+
+ if (hwqcount < vctrl->max_host_hw_queues) {
+ ret = nvme_mdev_hctrl_hq_alloc(vctrl->hctrl);
+ if (ret < 0)
+ return ret;
+
+ hq = kzalloc_node(sizeof(*hq), GFP_KERNEL, vctrl->hctrl->node);
+ if (!hq) {
+ nvme_mdev_hctrl_hq_free(vctrl->hctrl, ret);
+ return -ENOMEM;
+ }
+
+ hq->hqid = ret;
+ hq->usecount = 1;
+ list_add_tail(&hq->link, &vctrl->host_hw_queues);
+ } else {
+ hq->usecount++;
+ }
+ return hq->hqid;
+}
+
+/* Free a host IO queue */
+void nvme_mdev_vctrl_hq_free(struct nvme_mdev_vctrl *vctrl, u16 hqid)
+{
+ struct nvme_mdev_hq *hq;
+
+ lockdep_assert_held(&vctrl->lock);
+ nvme_mdev_assert_io_not_running(vctrl);
+
+ list_for_each_entry(hq, &vctrl->host_hw_queues, link)
+ if (hq->hqid == hqid) {
+ if (--hq->usecount > 0)
+ return;
+ nvme_mdev_hctrl_hq_free(vctrl->hctrl, hq->hqid);
+ list_del(&hq->link);
+ kfree(hq);
+ return;
+ }
+ WARN_ON(1);
+}
+
+/* get current list of host queues */
+unsigned int nvme_mdev_vctrl_hqs_list(struct nvme_mdev_vctrl *vctrl, u16 *out)
+{
+ struct nvme_mdev_hq *q;
+ unsigned int i = 0;
+
+ list_for_each_entry(q, &vctrl->host_hw_queues, link) {
+ out[i++] = q->hqid;
+ if (WARN_ON(i > MAX_HOST_QUEUES))
+ break;
+ }
+ return i;
+}
+
+/* add a user memory mapping */
+int nvme_mdev_vctrl_viommu_map(struct nvme_mdev_vctrl *vctrl, u32 flags,
+ dma_addr_t iova, u64 size)
+{
+ int ret;
+
+ mutex_lock(&vctrl->lock);
+
+ nvme_mdev_io_pause(vctrl);
+ ret = nvme_mdev_viommu_add(&vctrl->viommu, flags, iova, size);
+ nvme_mdev_vctrl_viommu_update(vctrl);
+ nvme_mdev_io_resume(vctrl);
+
+ mutex_unlock(&vctrl->lock);
+ return ret;
+}
+
+/* remove a user memory mapping */
+int nvme_mdev_vctrl_viommu_unmap(struct nvme_mdev_vctrl *vctrl,
+ dma_addr_t iova, u64 size)
+{
+ int ret;
+
+ mutex_lock(&vctrl->lock);
+
+ nvme_mdev_io_pause(vctrl);
+ ret = nvme_mdev_viommu_remove(&vctrl->viommu, iova, size);
+ nvme_mdev_vctrl_viommu_update(vctrl);
+ nvme_mdev_io_resume(vctrl);
+
+ mutex_unlock(&vctrl->lock);
+ return ret;
+}
diff --git a/drivers/nvme/mdev/viommu.c b/drivers/nvme/mdev/viommu.c
new file mode 100644
index 000000000000..31b86e8f5768
--- /dev/null
+++ b/drivers/nvme/mdev/viommu.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Virtual IOMMU - mapping user memory to the real device
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/mdev.h>
+#include <linux/vmalloc.h>
+#include <linux/nvme.h>
+#include <linux/iommu.h>
+#include <linux/interval_tree_generic.h>
+#include "priv.h"
+
+struct mem_mapping {
+ struct rb_node rb;
+ struct list_head link;
+
+ dma_addr_t __subtree_last;
+ dma_addr_t iova_start; /* first iova in this mapping*/
+ dma_addr_t iova_last; /* last iova in this mapping*/
+
+ unsigned long pfn; /* physical address of this mapping */
+ dma_addr_t host_iova; /* dma mapping to the real device*/
+};
+
+#define map_len(m) (((m)->iova_last - (m)->iova_start) + 1ULL)
+#define map_pages(m) (map_len(m) >> PAGE_SHIFT)
+#define START(node) ((node)->iova_start)
+#define LAST(node) ((node)->iova_last)
+
+INTERVAL_TREE_DEFINE(struct mem_mapping, rb, dma_addr_t, __subtree_last,
+ START, LAST, static inline, viommu_int_tree);
+
+static void nvme_mdev_viommu_dbg_dma_range(struct nvme_mdev_viommu *viommu,
+ struct mem_mapping *map,
+ const char *action)
+{
+ dma_addr_t iova_start = map->iova_start;
+ dma_addr_t iova_end = map->iova_start + map_len(map) - 1;
+ dma_addr_t hiova_start = map->host_iova;
+ dma_addr_t hiova_end = map->host_iova + map_len(map) - 1;
+
+ _DBG(viommu->vctrl,
+ "vIOMMU: %s RW IOVA %pad-%pad -> DMA %pad-%pad\n",
+ action, &iova_start, &iova_end, &hiova_start, &hiova_end);
+}
+
+/* unpin N pages starting at given IOVA*/
+static void nvme_mdev_viommu_unpin_pages(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova, int n)
+{
+ int i;
+
+ for (i = 0; i < n; i++) {
+ unsigned long user_pfn = (iova >> PAGE_SHIFT) + i;
+ int ret = vfio_unpin_pages(viommu->sw_dev, &user_pfn, 1);
+
+ WARN_ON(ret != 1);
+ }
+}
+
+/* User memory init code*/
+void nvme_mdev_viommu_init(struct nvme_mdev_viommu *viommu,
+ struct device *sw_dev,
+ struct device *hw_dev)
+{
+ viommu->sw_dev = sw_dev;
+ viommu->hw_dev = hw_dev;
+ viommu->maps_tree = RB_ROOT_CACHED;
+ INIT_LIST_HEAD(&viommu->maps_list);
+}
+
+/* User memory end code*/
+void nvme_mdev_viommu_reset(struct nvme_mdev_viommu *viommu)
+{
+ nvme_mdev_viommu_remove(viommu, 0, 0xFFFFFFFFFFFFFFFFULL);
+ WARN_ON(!list_empty(&viommu->maps_list));
+}
+
+/* Adds a new range of user memory*/
+int nvme_mdev_viommu_add(struct nvme_mdev_viommu *viommu,
+ u32 flags,
+ dma_addr_t iova,
+ u64 size)
+{
+ u64 offset;
+ dma_addr_t iova_end = iova + size - 1;
+ struct mem_mapping *map = NULL, *tmp;
+ LIST_HEAD(new_mappings_list);
+ int ret;
+
+ if (!(flags & VFIO_DMA_MAP_FLAG_READ) ||
+ !(flags & VFIO_DMA_MAP_FLAG_WRITE)) {
+ const char *type = "none";
+
+ if (flags & VFIO_DMA_MAP_FLAG_READ)
+ type = "RO";
+ else if (flags & VFIO_DMA_MAP_FLAG_WRITE)
+ type = "WO";
+
+ _DBG(viommu->vctrl, "vIOMMU: IGN %s IOVA %pad-%pad\n",
+ type, &iova, &iova_end);
+ return 0;
+ }
+
+ WARN_ON_ONCE(nvme_mdev_viommu_remove(viommu, iova, size) != 0);
+
+ if (WARN_ON_ONCE(size & ~PAGE_MASK))
+ return -EINVAL;
+
+ // VFIO pinning all the pages
+ for (offset = 0; offset < size; offset += PAGE_SIZE) {
+ unsigned long vapfn = ((iova + offset) >> PAGE_SHIFT), pa_pfn;
+
+ ret = vfio_pin_pages(viommu->sw_dev,
+ &vapfn, 1,
+ VFIO_DMA_MAP_FLAG_READ |
+ VFIO_DMA_MAP_FLAG_WRITE,
+ &pa_pfn);
+
+ if (ret != 1) {
+ /*sadly mdev api doesn't return an error*/
+ ret = -EFAULT;
+
+ _DBG(viommu->vctrl,
+ "vIOMMU: ADD RW IOVA %pad - pin failed\n",
+ &iova);
+ goto unwind;
+ }
+
+ // new mapping needed
+ if (!map || map->pfn + map_pages(map) != pa_pfn) {
+ int node = viommu->hw_dev ?
+ dev_to_node(viommu->hw_dev) : NUMA_NO_NODE;
+
+ map = kzalloc_node(sizeof(*map), GFP_KERNEL, node);
+
+ if (WARN_ON(!map)) {
+ vfio_unpin_pages(viommu->sw_dev, &vapfn, 1);
+ ret = -ENOMEM;
+ goto unwind;
+ }
+ map->iova_start = iova + offset;
+ map->iova_last = iova + offset + PAGE_SIZE - 1ULL;
+ map->pfn = pa_pfn;
+ map->host_iova = 0;
+ list_add_tail(&map->link, &new_mappings_list);
+ } else {
+ // current map can be extended
+ map->iova_last += PAGE_SIZE;
+ }
+ }
+
+ // DMA mapping the pages
+ list_for_each_entry_safe(map, tmp, &new_mappings_list, link) {
+ if (viommu->hw_dev) {
+ map->host_iova =
+ dma_map_page(viommu->hw_dev,
+ pfn_to_page(map->pfn),
+ 0,
+ map_len(map),
+ DMA_BIDIRECTIONAL);
+
+ ret = dma_mapping_error(viommu->hw_dev, map->host_iova);
+ if (ret) {
+ _DBG(viommu->vctrl,
+ "vIOMMU: ADD RW IOVA %pad-%pad - DMA map failed\n",
+ &iova, &iova_end);
+ goto unwind;
+ }
+ }
+
+ nvme_mdev_viommu_dbg_dma_range(viommu, map, "ADD");
+ list_del(&map->link);
+ list_add_tail(&map->link, &viommu->maps_list);
+ viommu_int_tree_insert(map, &viommu->maps_tree);
+ }
+ return 0;
+unwind:
+ list_for_each_entry_safe(map, tmp, &new_mappings_list, link) {
+ nvme_mdev_viommu_unpin_pages(viommu, map->iova_start,
+ map_pages(map));
+
+ list_del(&map->link);
+ kfree(map);
+ }
+ nvme_mdev_viommu_remove(viommu, iova, size);
+ return ret;
+}
+
+/* Removes a range of user memory*/
+int nvme_mdev_viommu_remove(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ u64 size)
+{
+ struct mem_mapping *map = NULL, *tmp;
+ dma_addr_t last_iova = iova + (size) - 1ULL;
+ LIST_HEAD(remove_list);
+ int count = 0;
+
+ /* find out all the relevant ranges */
+ map = viommu_int_tree_iter_first(&viommu->maps_tree, iova, last_iova);
+ while (map) {
+ list_del(&map->link);
+ list_add_tail(&map->link, &remove_list);
+ map = viommu_int_tree_iter_next(map, iova, last_iova);
+ }
+
+ /* remove them */
+ list_for_each_entry_safe(map, tmp, &remove_list, link) {
+ count++;
+
+ nvme_mdev_viommu_dbg_dma_range(viommu, map, "DEL");
+ if (viommu->hw_dev)
+ dma_unmap_page(viommu->hw_dev, map->host_iova,
+ map_len(map), DMA_BIDIRECTIONAL);
+
+ nvme_mdev_viommu_unpin_pages(viommu, map->iova_start,
+ map_pages(map));
+
+ viommu_int_tree_remove(map, &viommu->maps_tree);
+ kfree(map);
+ }
+ return count;
+}
+
+/* Translate an IOVA to a physical address and read device bus address */
+int nvme_mdev_viommu_translate(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova,
+ dma_addr_t *physical,
+ dma_addr_t *host_iova)
+{
+ struct mem_mapping *mapping;
+ u64 offset;
+
+ if (WARN_ON_ONCE(OFFSET_IN_PAGE(iova) != 0))
+ return -EINVAL;
+
+ mapping = viommu_int_tree_iter_first(&viommu->maps_tree,
+ iova, iova + PAGE_SIZE - 1);
+ if (!mapping) {
+ _DBG(viommu->vctrl,
+ "vIOMMU: translation of IOVA %pad failed\n", &iova);
+ return -EFAULT;
+ }
+
+ WARN_ON(iova > mapping->iova_last);
+ WARN_ON(OFFSET_IN_PAGE(mapping->iova_start) != 0);
+
+ offset = iova - mapping->iova_start;
+ *physical = PFN_PHYS(mapping->pfn) + offset;
+ *host_iova = mapping->host_iova + offset;
+ return 0;
+}
+
+/* map an IOVA to kernel address space */
+int nvme_mdev_viommu_create_kmap(struct nvme_mdev_viommu *viommu,
+ dma_addr_t iova, struct page_map *page)
+{
+ dma_addr_t host_iova;
+ phys_addr_t physical;
+ struct page *new_page;
+ int ret;
+
+ page->iova = iova;
+
+ ret = nvme_mdev_viommu_translate(viommu, iova, &physical, &host_iova);
+ if (ret)
+ return ret;
+
+ new_page = pfn_to_page(PHYS_PFN(physical));
+
+ page->kmap = kmap(new_page);
+ if (!page->kmap)
+ return -ENOMEM;
+
+ page->page = new_page;
+ return 0;
+}
+
+/* update IOVA <-> kernel mapping. If fails, removes the previous mapping */
+void nvme_mdev_viommu_update_kmap(struct nvme_mdev_viommu *viommu,
+ struct page_map *page)
+{
+ dma_addr_t host_iova;
+ phys_addr_t physical;
+ struct page *new_page;
+ int ret;
+
+ ret = nvme_mdev_viommu_translate(viommu, page->iova,
+ &physical, &host_iova);
+ if (ret) {
+ nvme_mdev_viommu_free_kmap(viommu, page);
+ return;
+ }
+
+ new_page = pfn_to_page(PHYS_PFN(physical));
+ if (new_page == page->page)
+ return;
+
+ nvme_mdev_viommu_free_kmap(viommu, page);
+
+ page->kmap = kmap(new_page);
+ if (!page->kmap)
+ return;
+ page->page = new_page;
+}
+
+/* unmap an IOVA to kernel address space */
+void nvme_mdev_viommu_free_kmap(struct nvme_mdev_viommu *viommu,
+ struct page_map *page)
+{
+ if (page->page) {
+ kunmap(page->page);
+ page->page = NULL;
+ page->kmap = NULL;
+ }
+}
diff --git a/drivers/nvme/mdev/vns.c b/drivers/nvme/mdev/vns.c
new file mode 100644
index 000000000000..42d4f8d7423b
--- /dev/null
+++ b/drivers/nvme/mdev/vns.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Virtual NVMe namespace implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/nvme.h>
+#include "priv.h"
+
+/* Reset the changed namespace log */
+void nvme_mdev_vns_log_reset(struct nvme_mdev_vctrl *vctrl)
+{
+ vctrl->ns_log_size = 0;
+}
+
+/* This adds entry to NS changed log and sends to the user a notification */
+static void nvme_mdev_vns_send_event(struct nvme_mdev_vctrl *vctrl, u32 ns)
+{
+ unsigned int i;
+ unsigned int log_size = vctrl->ns_log_size;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ _INFO(vctrl, "host namespace list rescanned\n");
+
+ if (WARN_ON(ns == 0 || ns > MAX_VIRTUAL_NAMESPACES))
+ return;
+
+ // check if the namespace ID is alredy in the log
+ if (log_size == MAX_VIRTUAL_NAMESPACES)
+ return;
+
+ for (i = 0; i < log_size; i++)
+ if (vctrl->ns_log[i] == cpu_to_le32(ns))
+ return;
+
+ vctrl->ns_log[log_size++] = cpu_to_le32(ns);
+ vctrl->ns_log_size++;
+ nvme_mdev_event_send(vctrl, NVME_AER_TYPE_NOTICE,
+ NVME_AER_NOTICE_NS_CHANGED);
+}
+
+/* Read host NS/partition parameters to update our virtual NS */
+static void nvme_mdev_vns_read_host_properties(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_mdev_vns *vns,
+ struct nvme_ns *host_ns)
+{
+ unsigned int sector_to_lba_shift;
+ u64 host_ns_size, start, nr, align_mask;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ /* read the namespace block size */
+ vns->blksize_shift = host_ns->lba_shift;
+
+ if (WARN_ON(vns->blksize_shift < 9)) {
+ _WARN(vctrl, "NS/create: device block size is bad\n");
+ goto error;
+ }
+
+ sector_to_lba_shift = vns->blksize_shift - 9;
+ align_mask = (1ULL << sector_to_lba_shift) - 1;
+
+ /* read the partition start and size*/
+ start = get_start_sect(vns->host_part);
+ nr = part_nr_sects_read(vns->host_part->bd_part);
+
+ /* check that partition is aligned on LBA size*/
+ if (sector_to_lba_shift != 0) {
+ if ((start & align_mask) || (nr & align_mask)) {
+ _WARN(vctrl, "NS/create: partition not aligned\n");
+ goto error;
+ }
+ }
+
+ vns->host_lba_offset = start >> sector_to_lba_shift;
+ vns->ns_size = nr >> sector_to_lba_shift;
+ host_ns_size = get_capacity(host_ns->disk) >> sector_to_lba_shift;
+
+ /*TODOLATER: NS: support metadata on host namespace */
+ if (host_ns->ms) {
+ _WARN(vctrl, "NS/create: no support for namespace metadata\n");
+ goto error;
+ }
+
+ if (vns->ns_size == 0) {
+ _WARN(vctrl, "NS/create: host namespace has size 0\n");
+ goto error;
+ }
+
+ /* sanity check that partition doesn't extend beyond the namespace */
+ if (!check_range(vns->host_lba_offset, vns->ns_size, host_ns_size)) {
+ _WARN(vctrl, "NS/create: host namespace size mismatch\n");
+ goto error;
+ }
+
+ /* check if namespace is readonly*/
+ if (!vns->readonly)
+ vns->readonly = get_disk_ro(host_ns->disk);
+
+ vns->noiob = host_ns->noiob;
+ if (vns->noiob != 0) {
+ u64 tmp = vns->host_lba_offset;
+
+ if (do_div(tmp, vns->noiob)) {
+ _WARN(vctrl,
+ "NS/create: host partition is not aligned on host optimum IO boundary, performance might suffer");
+ vns->noiob = 0;
+ }
+ }
+ return;
+error:
+ vns->ns_size = 0;
+}
+
+/* Open new reference to a host namespace */
+int nvme_mdev_vns_open(struct nvme_mdev_vctrl *vctrl,
+ u32 host_nsid, unsigned int host_partid)
+{
+ struct nvme_mdev_vns *vns;
+ u32 user_nsid;
+ int ret;
+
+ _INFO(vctrl, "open host_namespace=%u, partition=%u\n",
+ host_nsid, host_partid);
+
+ mutex_lock(&vctrl->lock);
+ ret = -ENODEV;
+ if (nvme_mdev_vctrl_is_dead(vctrl))
+ goto out;
+
+ /* create the namespace object */
+ ret = -ENOMEM;
+ vns = kzalloc_node(sizeof(*vns), GFP_KERNEL, vctrl->hctrl->node);
+ if (!vns)
+ goto out;
+
+ uuid_gen(&vns->uuid); // TODOLATER: NS: non random NS UUID
+ vns->host_nsid = host_nsid;
+ vns->host_partid = host_partid;
+
+ /* find the host namespace */
+ vns->host_ns = nvme_find_get_ns(vctrl->hctrl->nvme_ctrl, host_nsid);
+ if (!vns->host_ns) {
+ ret = -ENODEV;
+ goto error1;
+ }
+
+ if (test_bit(NVME_NS_DEAD, &vns->host_ns->flags) ||
+ test_bit(NVME_NS_REMOVING, &vns->host_ns->flags) ||
+ !vns->host_ns->disk) {
+ ret = -ENODEV;
+ goto error2;
+ }
+
+ /* get the block device for the partition that we will use */
+ vns->host_part = bdget_disk(vns->host_ns->disk, host_partid);
+ if (!vns->host_part) {
+ ret = -ENODEV;
+ goto error2;
+ }
+
+ /* get exclusive access to the block device (partition) */
+ vns->fmode = FMODE_READ | FMODE_EXCL;
+ if (!vns->readonly)
+ vns->fmode |= FMODE_WRITE;
+
+ ret = blkdev_get(vns->host_part, vns->fmode, vns);
+ if (ret)
+ goto error2;
+
+ /* read properties of the host namespace */
+ nvme_mdev_vns_read_host_properties(vctrl, vns, vns->host_ns);
+
+ /* Allocate a user namespace ID for this namespace */
+ ret = -ENOSPC;
+ for (user_nsid = 1; user_nsid <= MAX_VIRTUAL_NAMESPACES; user_nsid++)
+ if (!nvme_mdev_vns_from_vnsid(vctrl, user_nsid))
+ break;
+
+ if (user_nsid > MAX_VIRTUAL_NAMESPACES)
+ goto error3;
+
+ nvme_mdev_io_pause(vctrl);
+
+ vctrl->namespaces[user_nsid - 1] = vns;
+ vns->nsid = user_nsid;
+
+ /* Announce the new namespace to the user */
+ nvme_mdev_vns_send_event(vctrl, user_nsid);
+ nvme_mdev_io_resume(vctrl);
+ ret = 0;
+ goto out;
+error3:
+ blkdev_put(vns->host_part, vns->fmode);
+error2:
+ nvme_put_ns(vns->host_ns);
+error1:
+ kfree(vns);
+out:
+ mutex_unlock(&vctrl->lock);
+ return ret;
+}
+
+/* Re-open new reference to a host namespace, after notification
+ * of change in the host namespace
+ */
+static bool nvme_mdev_vns_reopen(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_mdev_vns *vns)
+{
+ struct nvme_ns *host_ns;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ _INFO(vctrl, "reopen host namespace %u, partition=%u\n",
+ vns->host_nsid, vns->host_partid);
+
+ /* namespace disappeared on the host - invalid*/
+ host_ns = nvme_find_get_ns(vctrl->hctrl->nvme_ctrl, vns->host_nsid);
+ if (!host_ns)
+ return false;
+
+ /* different namespace with same ID on the host - invalid*/
+ if (vns->host_ns != host_ns)
+ goto error1;
+
+ // basic checks on the namespace
+ if (test_bit(NVME_NS_DEAD, &host_ns->flags) ||
+ test_bit(NVME_NS_REMOVING, &host_ns->flags) ||
+ !host_ns->disk)
+ goto error1;
+
+ /* read properties of the host namespace */
+ nvme_mdev_io_pause(vctrl);
+ nvme_mdev_vns_read_host_properties(vctrl, vns, host_ns);
+ nvme_mdev_io_resume(vctrl);
+
+ nvme_put_ns(host_ns);
+ return true;
+error1:
+ nvme_put_ns(host_ns);
+ return false;
+}
+
+/* Destroy a virtual namespace*/
+static int __nvme_mdev_vns_destroy(struct nvme_mdev_vctrl *vctrl, u32 user_nsid)
+{
+ struct nvme_mdev_vns *vns;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ vns = nvme_mdev_vns_from_vnsid(vctrl, user_nsid);
+ if (!vns)
+ return -ENODEV;
+
+ nvme_mdev_vns_send_event(vctrl, user_nsid);
+ nvme_mdev_io_pause(vctrl);
+
+ vctrl->namespaces[user_nsid - 1] = NULL;
+ blkdev_put(vns->host_part, vns->fmode);
+ nvme_put_ns(vns->host_ns);
+ kfree(vns);
+ nvme_mdev_io_resume(vctrl);
+ return 0;
+}
+
+/* Destroy a virtual namespace (external interface) */
+int nvme_mdev_vns_destroy(struct nvme_mdev_vctrl *vctrl, u32 user_nsid)
+{
+ int ret;
+
+ mutex_lock(&vctrl->lock);
+ nvme_mdev_io_pause(vctrl);
+ ret = __nvme_mdev_vns_destroy(vctrl, user_nsid);
+ nvme_mdev_io_resume(vctrl);
+ mutex_unlock(&vctrl->lock);
+
+ return ret;
+}
+
+/* Destroy all virtual namespaces */
+void nvme_mdev_vns_destroy_all(struct nvme_mdev_vctrl *vctrl)
+{
+ u32 user_nsid;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ for (user_nsid = 1 ; user_nsid <= MAX_VIRTUAL_NAMESPACES ; user_nsid++)
+ __nvme_mdev_vns_destroy(vctrl, user_nsid);
+}
+
+/* Get a virtual namespace */
+struct nvme_mdev_vns *nvme_mdev_vns_from_vnsid(struct nvme_mdev_vctrl *vctrl,
+ u32 user_ns_id)
+{
+ if (user_ns_id == 0 || user_ns_id > MAX_VIRTUAL_NAMESPACES)
+ return NULL;
+ return vctrl->namespaces[user_ns_id - 1];
+}
+
+/* Print description off all virtual namespaces */
+int nvme_mdev_vns_print_description(struct nvme_mdev_vctrl *vctrl,
+ char *buf, unsigned int size)
+{
+ int nsid, ret = 0;
+
+ mutex_lock(&vctrl->lock);
+
+ for (nsid = 1; nsid <= MAX_VIRTUAL_NAMESPACES; nsid++) {
+ int n;
+ struct nvme_mdev_vns *vns = nvme_mdev_vns_from_vnsid(vctrl,
+ nsid);
+ if (!vns)
+ continue;
+
+ else if (vns->host_partid == 0)
+ n = snprintf(buf, size, "VNS%d: nvme%dn%d\n",
+ nsid, vctrl->hctrl->id,
+ (int)vns->host_nsid);
+ else
+ n = snprintf(buf, size, "VNS%d: nvme%dn%dp%d\n",
+ nsid, vctrl->hctrl->id,
+ (int)vns->host_nsid,
+ (int)vns->host_partid);
+ if (n > size)
+ return -ENOMEM;
+ buf += n;
+ size -= n;
+ ret += n;
+ }
+ mutex_unlock(&vctrl->lock);
+ return ret;
+}
+
+/* Processes an update on the host namespace */
+void nvme_mdev_vns_host_ns_update(struct nvme_mdev_vctrl *vctrl,
+ u32 host_nsid, bool removed)
+{
+ int nsid;
+
+ mutex_lock(&vctrl->lock);
+
+ for (nsid = 1; nsid <= MAX_VIRTUAL_NAMESPACES; nsid++) {
+ struct nvme_mdev_vns *vns = nvme_mdev_vns_from_vnsid(vctrl,
+ nsid);
+ if (!vns || vns->host_nsid != host_nsid)
+ continue;
+
+ if (removed || !nvme_mdev_vns_reopen(vctrl, vns))
+ __nvme_mdev_vns_destroy(vctrl, nsid);
+ else
+ nvme_mdev_vns_send_event(vctrl, nsid);
+ }
+ mutex_unlock(&vctrl->lock);
+}
diff --git a/drivers/nvme/mdev/vsq.c b/drivers/nvme/mdev/vsq.c
new file mode 100644
index 000000000000..5b63081c144d
--- /dev/null
+++ b/drivers/nvme/mdev/vsq.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Virtual NVMe submission queue implementation
+ * Copyright (c) 2019 - Maxim Levitsky
+ */
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include "priv.h"
+
+/* Create new virtual completion queue */
+int nvme_mdev_vsq_init(struct nvme_mdev_vctrl *vctrl,
+ u16 qid, dma_addr_t iova, bool cont, u16 size, u16 cqid)
+{
+ struct nvme_vsq *q = &vctrl->vsqs[qid];
+ int ret;
+
+ lockdep_assert_held(&vctrl->lock);
+
+ q->iova = iova;
+ q->cont = cont;
+ q->qid = qid;
+ q->size = size;
+ q->head = 0;
+ q->vcq = &vctrl->vcqs[cqid];
+ q->data = NULL;
+ q->hsq = 0;
+
+ ret = nvme_mdev_vsq_viommu_update(&vctrl->viommu, q);
+ if (ret && (ret != -EFAULT))
+ return ret;
+
+ if (qid > 0) {
+ ret = nvme_mdev_vctrl_hq_alloc(vctrl);
+ if (ret < 0) {
+ vunmap(q->data);
+ return ret;
+ }
+ q->hsq = ret;
+ }
+
+ _DBG(vctrl, "VSQ: create qid=%d contig=%d, depth=%d cqid=%d\n",
+ qid, cont, size, cqid);
+
+ set_bit(qid, vctrl->vsq_en);
+
+ vctrl->mmio.dbs[q->qid].sqt = 0;
+ vctrl->mmio.eidxs[q->qid].sqt = 0;
+
+ return 0;
+}
+
+/* Update the kernel mapping of the queue */
+int nvme_mdev_vsq_viommu_update(struct nvme_mdev_viommu *viommu,
+ struct nvme_vsq *q)
+{
+ void *data;
+
+ if (q->data)
+ vunmap((void *)q->data);
+
+ data = nvme_mdev_udata_queue_vmap(viommu, q->iova,
+ (unsigned int)q->size *
+ sizeof(struct nvme_command),
+ q->cont);
+
+ q->data = IS_ERR(data) ? NULL : data;
+ return IS_ERR(data) ? PTR_ERR(data) : 0;
+}
+
+/* Delete an virtual completion queue */
+void nvme_mdev_vsq_delete(struct nvme_mdev_vctrl *vctrl, u16 qid)
+{
+ struct nvme_vsq *q = &vctrl->vsqs[qid];
+
+ lockdep_assert_held(&vctrl->lock);
+ _DBG(vctrl, "VSQ: delete qid=%d\n", q->qid);
+
+ if (q->data)
+ vunmap(q->data);
+ q->data = NULL;
+
+ if (q->hsq) {
+ nvme_mdev_vctrl_hq_free(vctrl, q->hsq);
+ q->hsq = 0;
+ }
+
+ clear_bit(qid, vctrl->vsq_en);
+}
+
+/* Move queue head one item forward */
+static void nvme_mdev_vsq_advance_head(struct nvme_vsq *q)
+{
+ q->head++;
+ if (q->head == q->size)
+ q->head = 0;
+}
+
+bool nvme_mdev_vsq_has_data(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vsq *q)
+{
+ u16 tail = le32_to_cpu(vctrl->mmio.dbs[q->qid].sqt);
+
+ if (!vctrl->mmio.dbs || !vctrl->mmio.eidxs || !q->data)
+ return false;
+
+ if (tail == q->head)
+ return false;
+
+ if (!nvme_mdev_mmio_db_check(vctrl, q->qid, q->size, tail))
+ return false;
+ return true;
+}
+
+/* get one command from a virtual submission queue */
+const struct nvme_command *nvme_mdev_vsq_get_cmd(struct nvme_mdev_vctrl *vctrl,
+ struct nvme_vsq *q)
+{
+ u16 oldhead = q->head;
+ u32 eidx;
+
+ if (!nvme_mdev_vsq_has_data(vctrl, q))
+ return NULL;
+ if (!nvme_mdev_vcq_reserve_space(q->vcq))
+ return NULL;
+ nvme_mdev_vsq_advance_head(q);
+
+ eidx = q->head + (q->size >> 1);
+ if (eidx >= q->size)
+ eidx -= q->size;
+
+ vctrl->mmio.eidxs[q->qid].sqt = cpu_to_le32(eidx);
+
+ return &q->data[oldhead];
+}
+
+bool nvme_mdev_vsq_suspend_io(struct nvme_mdev_vctrl *vctrl, u16 sqid)
+{
+ struct nvme_vsq *q = &vctrl->vsqs[sqid];
+ u16 tail = le32_to_cpu(vctrl->mmio.dbs[q->qid].sqt);
+
+ /* If the queue is not in working state don't allow the idle code
+ * to kick in
+ */
+ if (!vctrl->mmio.dbs || !vctrl->mmio.eidxs || !q->data)
+ return false;
+
+ /* queue has data - refuse idle*/
+ if (tail != q->head)
+ return false;
+
+ /* Write eventid to tell the user to ring normal doorbell*/
+ vctrl->mmio.eidxs[q->qid].sqt = cpu_to_le32(q->head);
+
+ /* memory barrier to ensure that the user have seen the eidx */
+ mb();
+
+ /* Check that doorbell diddn't move meanwhile */
+ tail = le32_to_cpu(vctrl->mmio.dbs[q->qid].sqt);
+ return (tail == q->head);
+}
+
+/* complete a command (IO version)*/
+void nvme_mdev_vsq_cmd_done_io(struct nvme_mdev_vctrl *vctrl,
+ u16 sqid, u16 cid, u16 status)
+{
+ struct nvme_vsq *q = &vctrl->vsqs[sqid];
+
+ nvme_mdev_vcq_write_io(vctrl, q->vcq, q->head, q->qid, cid, status);
+}
+
+/* complete a command (ADMIN version)*/
+void nvme_mdev_vsq_cmd_done_adm(struct nvme_mdev_vctrl *vctrl,
+ u32 dw0, u16 cid, u16 status)
+{
+ struct nvme_vsq *q = &vctrl->vsqs[0];
+
+ nvme_mdev_vcq_write_adm(vctrl, q->vcq, dw0, q->head, cid, status);
+}
--
2.17.2

2019-05-02 11:51:23

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 10/10] nvme/mdev - generic block IO code

Use the block layer (bio_submit) to pass through the IO to the nvme driver
instead of the direct IO submission hooks.

Currently that code supports only read/write, and it still assumes that
we talk to an nvme driver.

Signed-off-by: Maxim Levitsky <[email protected]>
---
drivers/nvme/mdev/Kconfig | 8 ++
drivers/nvme/mdev/host.c | 239 +++++++++++++++++++++++++++++++++++++-
drivers/nvme/mdev/io.c | 7 ++
drivers/nvme/mdev/priv.h | 61 ++++++++++
4 files changed, 313 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/mdev/Kconfig b/drivers/nvme/mdev/Kconfig
index 7ebc66cdeac0..1ace298a364d 100644
--- a/drivers/nvme/mdev/Kconfig
+++ b/drivers/nvme/mdev/Kconfig
@@ -14,3 +14,11 @@ config NVME_MDEV_VFIO
guest, also as a NVME namespace, attached to a virtual NVME
controller
If unsure, say N.
+
+config NVME_MDEV_VFIO_GENERIC_IO
+ bool "Use generic block layer IO"
+ depends on NVME_MDEV_VFIO
+ help
+ Send the IO through the block layer using polled IO queues,
+ instead of dedicated mdev queues
+ If unsure, say N.
diff --git a/drivers/nvme/mdev/host.c b/drivers/nvme/mdev/host.c
index 6590946b86c2..a2ba69dcf4f2 100644
--- a/drivers/nvme/mdev/host.c
+++ b/drivers/nvme/mdev/host.c
@@ -53,6 +53,7 @@ static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
if (!hctrl)
return NULL;

+#ifndef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
nr_host_queues = ctrl->ops->ext_queues_available(ctrl);
max_lba_transfer = ctrl->max_hw_sectors >> (PAGE_SHIFT - 9);

@@ -63,6 +64,15 @@ static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
return NULL;
}

+ hctrl->oncs = ctrl->oncs &
+ (NVME_CTRL_ONCS_DSM | NVME_CTRL_ONCS_WRITE_ZEROES);
+#else
+ /* for now don't deal with bio chaining */
+ max_lba_transfer = BIO_MAX_PAGES;
+ nr_host_queues = MDEV_NVME_NUM_BIO_QUEUES;
+ /* for now no support for write zeros and discard*/
+ hctrl->oncs = 0;
+#endif

kref_init(&hctrl->ref);
mutex_init(&hctrl->lock);
@@ -70,8 +80,6 @@ static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
hctrl->nvme_ctrl = ctrl;
nvme_get_ctrl(ctrl);

- hctrl->oncs = ctrl->oncs &
- (NVME_CTRL_ONCS_DSM | NVME_CTRL_ONCS_WRITE_ZEROES);

hctrl->id = ctrl->instance;
hctrl->node = dev_to_node(ctrl->dev);
@@ -200,6 +208,8 @@ bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
}
}

+#ifndef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
+
/* Allocate a host IO queue */
int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
{
@@ -228,6 +238,7 @@ bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid)

/* Submit a IO passthrough command */
int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
+ struct nvme_mdev_vns *vns,
u16 qid, u32 tag,
struct nvme_command *cmd,
struct nvme_ext_data_iter *datait)
@@ -248,6 +259,226 @@ int nvme_mdev_hctrl_hq_poll(struct nvme_mdev_hctrl *hctrl,
return ctrl->ops->ext_queue_poll(ctrl, qid, results, max_len);
}

+#else
+
+/* Allocate a 'host' queue - here the queues are virtual*/
+int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
+{
+ int qid, ret;
+ struct hw_mbio_queue *hwq;
+
+ for (qid = 0 ; qid < MDEV_NVME_NUM_BIO_QUEUES ; qid++)
+ if (!hctrl->hw_queues[qid])
+ break;
+
+ if (qid == MDEV_NVME_NUM_BIO_QUEUES)
+ return -ENOSPC;
+
+ hwq = kzalloc_node(sizeof(*hwq), GFP_KERNEL, hctrl->node);
+ if (!hwq)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&hwq->bios_in_flight);
+
+ ret = bioset_init(&hwq->bioset, MDEV_NVME_BIO_QUEUE_SIZE,
+ offsetof(struct mbio, bio), BIOSET_NEED_BVECS);
+
+ if (ret < 0) {
+ kfree(hwq);
+ return ret;
+ }
+
+ hctrl->hw_queues[qid] = hwq;
+ return qid + 1;
+}
+
+/* Free a 'host' queue - here the queues are virtual*/
+void nvme_mdev_hctrl_hq_free(struct nvme_mdev_hctrl *hctrl, u16 qid)
+{
+ struct hw_mbio_queue *hwq = hctrl->hw_queues[qid - 1];
+
+ if (WARN_ON(!hwq))
+ return;
+
+ WARN_ON(!list_empty(&hwq->bios_in_flight));
+ WARN_ON(hwq->inflight);
+
+ hctrl->hw_queues[qid - 1] = NULL;
+ bioset_exit(&hwq->bioset);
+ kfree(hwq);
+}
+
+/*
+ * Check if the host queue has space for submission - also our limit
+ * not related to the block layer
+ */
+bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid)
+{
+ struct hw_mbio_queue *hwq = hctrl->hw_queues[qid - 1];
+
+ if (WARN_ON(!hwq))
+ return false;
+ return hwq->inflight < MDEV_NVME_BIO_QUEUE_SIZE;
+}
+
+/*
+ * Callback we get from the block layer
+ * Note that despite polling, this can be run from IRQ context
+ */
+static void nvme_mdev_hctrl_bio_done(struct bio *bio)
+{
+ struct mbio *mbio = container_of(bio, struct mbio, bio);
+
+ /* this will mark this bio as done, and allow the polling thread
+ * to return it to the user
+ */
+ mbio->status = nvme_mdev_translate_error_block(bio->bi_status);
+}
+
+/* Submit a IO passthrough command */
+int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
+ struct nvme_mdev_vns *vns,
+ u16 qid, u32 tag,
+ struct nvme_command *cmd,
+ struct nvme_ext_data_iter *datait)
+{
+ struct hw_mbio_queue *hwq = hctrl->hw_queues[qid - 1];
+ struct bio *bio = NULL;
+ struct mbio *mbio;
+ struct page *page;
+ u8 opcode = cmd->common.opcode;
+ int retval, op, op_flags = 0;
+ int offset;
+
+ if (WARN_ON(!hwq))
+ return -EINVAL;
+ if (WARN_ON(hwq->inflight >= MDEV_NVME_BIO_QUEUE_SIZE))
+ return -EBUSY;
+
+ /* read/write buffer processing */
+ if (opcode == nvme_cmd_read || opcode == nvme_cmd_write) {
+ unsigned long datalength =
+ (le16_to_cpu(cmd->rw.length) + 1) << vns->blksize_shift;
+
+ if (opcode == nvme_cmd_read) {
+ op = REQ_OP_READ;
+ } else {
+ op = REQ_OP_WRITE;
+ op_flags = REQ_SYNC | REQ_IDLE;
+ if (cmd->rw.control & cpu_to_le16(NVME_RW_FUA))
+ op_flags |= REQ_FUA;
+ }
+
+ if (WARN_ON(datait->count > BIO_MAX_PAGES))
+ return -EINVAL;
+
+ bio = bio_alloc_bioset(GFP_KERNEL, datait->count, &hwq->bioset);
+ if (WARN_ON(!bio))
+ return -ENOMEM;
+
+ mbio = container_of(bio, struct mbio, bio);
+
+ /* starting sector */
+ bio->bi_iter.bi_sector = le64_to_cpu(cmd->rw.slba) <<
+ (vns->blksize_shift - 9);
+
+ /* Data. Last page might be partial size*/
+ while (datait->count) {
+ int chunk = min(PAGE_SIZE, datalength);
+
+ if (WARN_ON(datalength == 0))
+ break;
+
+ page = pfn_to_page(PHYS_PFN(datait->physical));
+ offset = OFFSET_IN_PAGE(datait->physical);
+
+ if (bio_add_page(&mbio->bio, page,
+ chunk, offset) != chunk) {
+ WARN_ON(1);
+ retval = -ENOMEM;
+ goto error;
+ }
+
+ retval = datait->next(datait);
+ if (WARN_ON(retval))
+ goto error;
+ datalength -= chunk;
+ }
+
+ /* flush request */
+ } else if (opcode == nvme_cmd_flush) {
+ op = REQ_OP_WRITE;
+ op_flags = REQ_PREFLUSH;
+ bio = bio_alloc_bioset(GFP_KERNEL, 0, &hwq->bioset);
+ if (WARN_ON(!bio))
+ return -ENOMEM;
+ mbio = container_of(bio, struct mbio, bio);
+ } else {
+ retval = -EINVAL;
+ goto error;
+ }
+
+ /* set polling */
+ op_flags |= REQ_HIPRI | REQ_NOWAIT;
+
+ /* setup the bio */
+ bio_set_dev(bio, vns->host_part);
+ bio->bi_end_io = nvme_mdev_hctrl_bio_done;
+ bio_set_op_attrs(bio, op, op_flags);
+
+ /* setup our portion of the bio*/
+ mbio = container_of(bio, struct mbio, bio);
+ mbio->tag = tag;
+ mbio->status = NVME_STATUS_PENDING;
+ mbio->blk_queue = bdev_get_queue(vns->host_part);
+
+ /* submit the bio*/
+ mbio->cookie = submit_bio(bio);
+
+ list_add_tail(&mbio->link, &hwq->bios_in_flight);
+ hwq->inflight++;
+ return 0;
+error:
+ if (bio)
+ bio_put(bio);
+ return retval;
+}
+
+/* Poll for completion of IO passthrough commands */
+int nvme_mdev_hctrl_hq_poll(struct nvme_mdev_hctrl *hctrl,
+ u32 qid,
+ struct nvme_ext_cmd_result *results,
+ unsigned int max_len)
+{
+ struct hw_mbio_queue *hwq = hctrl->hw_queues[qid - 1];
+ struct mbio *mbio, *tmp;
+
+ int i = 0;
+
+ if (!hwq->inflight)
+ return -1;
+
+ list_for_each_entry_safe(mbio, tmp, &hwq->bios_in_flight, link) {
+ if (mbio->status == NVME_STATUS_PENDING)
+ blk_poll(mbio->blk_queue, mbio->cookie, false);
+
+ if (mbio->status == NVME_STATUS_PENDING)
+ continue;
+
+ results[i].tag = mbio->tag;
+ results[i].status = mbio->status;
+
+ hwq->inflight--;
+ list_del(&mbio->link);
+ bio_put(&mbio->bio);
+
+ if (++i == max_len)
+ break;
+ }
+ return i;
+}
+#endif
+
/* Destroy all host controllers */
void nvme_mdev_hctrl_destroy_all(void)
{
@@ -486,6 +717,10 @@ static int __init nvme_mdev_init(void)
}

pr_info("nvme_mdev " NVME_MDEV_FIRMWARE_VERSION " loaded\n");
+
+#ifdef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
+ pr_info("nvme_mdev: using block layer polled IO\b");
+#endif
return 0;
}

diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
index 39550d0e3649..d3c46de33b01 100644
--- a/drivers/nvme/mdev/io.c
+++ b/drivers/nvme/mdev/io.c
@@ -70,7 +70,11 @@ static int nvme_mdev_io_translate_rw(struct io_ctx *ctx)
if (!check_range(slba, length, ctx->ns->ns_size))
return DNR(NVME_SC_LBA_RANGE);

+#ifndef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
ctx->out.rw.slba = cpu_to_le64(slba + ctx->ns->host_lba_offset);
+#else
+ ctx->out.rw.slba = in->slba;
+#endif
ctx->out.rw.length = in->length;

ret = nvme_mdev_udata_iter_set_dptr(&ctx->udatait, &in->dptr,
@@ -195,7 +199,9 @@ static int nvme_mdev_io_translate_dsm(struct io_ctx *ctx)
_DBG(ctx->vctrl, "IOQ: DSM_MANAGEMENT: RANGE 0x%llx-0x%x\n",
slba, nlb);

+#ifndef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
data_ptr[i].slba = cpu_to_le64(slba + ctx->ns->host_lba_offset);
+#endif
}

ctx->out.dsm.attributes = in->attributes;
@@ -280,6 +286,7 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)

/*passthrough*/
ret = nvme_mdev_hctrl_hq_submit(ctx->hctrl,
+ ctx->ns,
vsq->hsq,
(((u32)vsq->qid) << 16) | ((u32)ucid),
&ctx->out,
diff --git a/drivers/nvme/mdev/priv.h b/drivers/nvme/mdev/priv.h
index a11a1842957d..1dd5fce0bfa6 100644
--- a/drivers/nvme/mdev/priv.h
+++ b/drivers/nvme/mdev/priv.h
@@ -34,7 +34,12 @@
#define MAX_VIRTUAL_NAMESPACES 16 /* NSID = 1..16*/
#define MAX_VIRTUAL_IRQS 16

+#ifndef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
#define MAX_HOST_QUEUES 4
+#else
+#define MAX_HOST_QUEUES 1
+#endif
+
#define MAX_AER_COMMANDS 16
#define MAX_LOG_PAGES 16

@@ -323,6 +328,39 @@ struct nvme_mdev_inst_type {
struct attribute_group *attrgroup;
};

+#ifdef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
+
+#define MDEV_NVME_BIO_QUEUE_SIZE 128
+#define NVME_STATUS_PENDING 0xFFFF
+#define MDEV_NVME_NUM_BIO_QUEUES 16
+
+struct mbio {
+ /* link in a list of pending bios*/
+ struct list_head link;
+
+ struct request_queue *blk_queue;
+
+ /*GDPR compliant*/
+ unsigned int cookie;
+
+ /* tag from the translation (user cid + user qid) */
+ u32 tag;
+
+ /* result NVME status */
+ u16 status;
+
+ /* must be last for bioset allocation*/
+ struct bio bio;
+};
+
+struct hw_mbio_queue {
+ int inflight;
+ struct list_head bios_in_flight;
+ struct bio_set bioset;
+};
+
+#endif
+
/*Abstraction of the host controller that we are connected to */
struct nvme_mdev_hctrl {
struct mutex lock;
@@ -344,6 +382,10 @@ struct nvme_mdev_hctrl {

/* book-keeping for number of host queues we can allocate*/
unsigned int nr_host_queues;
+
+#ifdef CONFIG_NVME_MDEV_VFIO_GENERIC_IO
+ struct hw_mbio_queue *hw_queues[MDEV_NVME_NUM_BIO_QUEUES];
+#endif
};

/* vctrl.c*/
@@ -415,6 +457,7 @@ bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid);
bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode);

int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
+ struct nvme_mdev_vns *vns,
u16 qid, u32 tag,
struct nvme_command *cmd,
struct nvme_ext_data_iter *datait);
@@ -701,6 +744,24 @@ static inline int nvme_mdev_translate_error(int error)
}
}

+static inline int nvme_mdev_translate_error_block(blk_status_t blk_sts)
+{
+ switch (blk_sts) {
+ case BLK_STS_OK:
+ return NVME_SC_SUCCESS;
+ case BLK_STS_NOSPC:
+ return DNR(NVME_SC_CAP_EXCEEDED);
+ case BLK_STS_TARGET:
+ return DNR(NVME_SC_LBA_RANGE);
+ case BLK_STS_NOTSUPP:
+ return DNR(NVME_SC_INVALID_OPCODE);
+ case BLK_STS_MEDIUM:
+ return DNR(NVME_SC_ACCESS_DENIED);
+ default:
+ return DNR(NVME_SC_INTERNAL);
+ }
+}
+
static inline bool timeout(ktime_t event, ktime_t now, unsigned long timeout_ms)
{
return ktime_ms_delta(now, event) > (long)timeout_ms;
--
2.17.2

2019-05-02 11:52:32

by Maxim Levitsky

[permalink] [raw]

Subject: [PATCH v2 09/10] nvme/mdev - Add inline performance measurments

This code might not be needed to be merged in the final version

Signed-off-by: Maxim Levitsky <[email protected]>
---
drivers/nvme/mdev/instance.c | 62 ++++++++++++++++++++++++++++++++++++
drivers/nvme/mdev/io.c | 21 ++++++++++++
drivers/nvme/mdev/irq.c | 6 ++++
drivers/nvme/mdev/priv.h | 13 ++++++++
drivers/nvme/mdev/vctrl.c | 3 ++
5 files changed, 105 insertions(+)

diff --git a/drivers/nvme/mdev/instance.c b/drivers/nvme/mdev/instance.c
index 0e5ba5a9269f..d692b2bf2ef2 100644
--- a/drivers/nvme/mdev/instance.c
+++ b/drivers/nvme/mdev/instance.c
@@ -771,6 +771,62 @@ static struct attribute *nvme_mdev_dev_settings_atttributes[] = {
NULL
};

+/* show perf stats */
+static ssize_t stats_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+ struct nvme_mdev_perf *perf;
+
+ if (!vctrl)
+ return -ENODEV;
+
+ perf = &vctrl->perf;
+
+ return sprintf(buf,
+ "%u %llu %llu %llu %llu %llu %llu\n",
+
+ tsc_khz,
+
+ perf->cmds_started,
+ perf->cycles_send_to_hw,
+
+ perf->cmds_complete,
+ perf->cycles_receive_from_hw,
+
+ perf->interrupts_sent,
+ perf->cycles_irq_delivery);
+}
+
+/* clear the perf stats */
+static ssize_t stats_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ bool val;
+ int ret;
+ struct nvme_mdev_vctrl *vctrl = dev_to_vctrl(dev);
+
+ if (!vctrl)
+ return -ENODEV;
+ ret = kstrtobool(buf, &val);
+ if (ret)
+ return ret;
+
+ if (!val)
+ return -EINVAL;
+
+ memset(&vctrl->perf, 0, sizeof(vctrl->perf));
+ return count;
+}
+
+static DEVICE_ATTR_RW(stats);
+
+static struct attribute *nvme_mdev_dev_debug_attributes[] = {
+ &dev_attr_stats.attr,
+ NULL
+};
+
static const struct attribute_group nvme_mdev_ns_attr_group = {
.name = "namespaces",
.attrs = nvme_mdev_dev_ns_atttributes,
@@ -781,9 +837,15 @@ static const struct attribute_group nvme_mdev_setting_attr_group = {
.attrs = nvme_mdev_dev_settings_atttributes,
};

+static const struct attribute_group nvme_mdev_debug_attr_group = {
+ .name = "debug",
+ .attrs = nvme_mdev_dev_debug_attributes,
+};
+
static const struct attribute_group *nvme_mdev_dev_attributte_groups[] = {
&nvme_mdev_ns_attr_group,
&nvme_mdev_setting_attr_group,
+ &nvme_mdev_debug_attr_group,
NULL,
};

diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
index 59837540fec2..39550d0e3649 100644
--- a/drivers/nvme/mdev/io.c
+++ b/drivers/nvme/mdev/io.c
@@ -9,6 +9,7 @@
#include <linux/nvme.h>
#include <linux/timekeeping.h>
#include <linux/ktime.h>
+#include <asm/msr.h>
#include "priv.h"

@@ -251,6 +252,9 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
struct nvme_vsq *vsq = &ctx->vctrl->vsqs[sqid];
u16 ucid;
int ret;
+ unsigned long long c1, c2;
+
+ c1 = rdtsc();

/* If host queue is full, we can't process a command
* as a command will likely result in passthrough
@@ -289,6 +293,12 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)

nvme_mdev_vsq_cmd_done_io(ctx->vctrl, sqid, ucid, ret);
}
+
+ c2 = rdtsc();
+
+ ctx->vctrl->perf.cmds_started++;
+ ctx->vctrl->perf.cycles_send_to_hw += (c2 - c1);
+
return true;
}

@@ -304,6 +314,10 @@ static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
int n, i;
struct nvme_ext_cmd_result res[16];

+ unsigned long long c1, c2;
+
+ c1 = rdtsc();
+
/* process the completions from the hardware */
n = nvme_mdev_hctrl_hq_poll(ctx->hctrl, hwq, res, 16);
if (n == -1)
@@ -321,6 +335,13 @@ static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)

nvme_mdev_vsq_cmd_done_io(ctx->vctrl, qid, cid, status);
}
+
+ if (n > 0) {
+ c2 = rdtsc();
+ ctx->vctrl->perf.cmds_complete += n;
+ ctx->vctrl->perf.cycles_receive_from_hw += (c2 - c1);
+ }
+
return n;
}

diff --git a/drivers/nvme/mdev/irq.c b/drivers/nvme/mdev/irq.c
index 5809cdb4d84c..b6010a69b584 100644
--- a/drivers/nvme/mdev/irq.c
+++ b/drivers/nvme/mdev/irq.c
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/slab.h>
#include "priv.h"
+#include <asm/msr.h>

/* Setup the interrupt subsystem */
void nvme_mdev_irqs_setup(struct nvme_mdev_vctrl *vctrl)
@@ -253,12 +254,17 @@ void nvme_mdev_irq_cond_trigger(struct nvme_mdev_vctrl *vctrl,
unsigned int index)
{
struct nvme_mdev_user_irq *irq = &vctrl->irqs.vecs[index];
+ unsigned long long c1, c2;

if (irq->irq_pending_cnt == 0)
return;

if (!nvme_mdev_irq_coalesce(vctrl, irq)) {
+ vctrl->perf.interrupts_sent++;
+ c1 = rdtsc();
nvme_mdev_irq_trigger(vctrl, index);
+ c2 = rdtsc();
nvme_mdev_irq_clear(vctrl, index);
+ vctrl->perf.cycles_irq_delivery += (c2 - c1);
}
}
diff --git a/drivers/nvme/mdev/priv.h b/drivers/nvme/mdev/priv.h
index 9f65e46c1ab2..a11a1842957d 100644
--- a/drivers/nvme/mdev/priv.h
+++ b/drivers/nvme/mdev/priv.h
@@ -245,6 +245,17 @@ struct nvme_mdev_io_region {
unsigned int mmap_area_size;
};

+struct nvme_mdev_perf {
+ /* number of IO commands received */
+ unsigned long long cmds_started;
+ unsigned long long cmds_complete;
+ unsigned long long interrupts_sent;
+
+ unsigned long long cycles_send_to_hw;
+ unsigned long long cycles_receive_from_hw;
+ unsigned long long cycles_irq_delivery;
+};
+
/*Virtual NVME controller state */
struct nvme_mdev_vctrl {
struct kref ref;
@@ -301,6 +312,8 @@ struct nvme_mdev_vctrl {
char serial[9];

bool vctrl_paused; /* true when the host device paused our IO */
+
+ struct nvme_mdev_perf perf;
};

/* mdev instance type*/
diff --git a/drivers/nvme/mdev/vctrl.c b/drivers/nvme/mdev/vctrl.c
index 87bc7c435c0c..d23b543dfd52 100644
--- a/drivers/nvme/mdev/vctrl.c
+++ b/drivers/nvme/mdev/vctrl.c
@@ -242,6 +242,9 @@ int nvme_mdev_vctrl_open(struct nvme_mdev_vctrl *vctrl)

nvme_mdev_mmio_open(vctrl);
vctrl->inuse = true;
+
+ memset(&vctrl->perf, 0, sizeof(vctrl->perf));
+
out:
mutex_unlock(&vctrl->lock);
return ret;
--
2.17.2

2019-05-02 14:22:32

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

On Thu, 2019-05-02 at 14:47 +0300, Maxim Levitsky wrote:
> Note that currently the number of hw queues reserved for mdev,
> has to be pre determined on module load.
>
> (I used to allocate the queues dynamicaly on demand, but
> recent changes to allocate polled/read queues made
> this somewhat difficult, so I dropped this for now)
>
> Signed-off-by: Maxim Levitsky <[email protected]>
> ---
> drivers/nvme/host/pci.c | 375 ++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/mdev/host.c | 46 ++---
> drivers/nvme/mdev/io.c | 46 +++--
> drivers/nvme/mdev/mmio.c | 3 -
> 4 files changed, 421 insertions(+), 49 deletions(-)

The changes in drivers/nvme/mdev aren't supposed to be here, this was some code
moving around to reduce the diff in the generic block layer support code,
it supposed to go to the main mdev commit.

Best regards,
Maxim Levitsky

2019-05-02 21:13:37

by Heitke, Kenneth

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

On 5/2/2019 5:47 AM, Maxim Levitsky wrote:
> Note that currently the number of hw queues reserved for mdev,
> has to be pre determined on module load.
>
> (I used to allocate the queues dynamicaly on demand, but
> recent changes to allocate polled/read queues made
> this somewhat difficult, so I dropped this for now)
>
> Signed-off-by: Maxim Levitsky <[email protected]>
> ---
> drivers/nvme/host/pci.c | 375 ++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/mdev/host.c | 46 ++---
> drivers/nvme/mdev/io.c | 46 +++--
> drivers/nvme/mdev/mmio.c | 3 -
> 4 files changed, 421 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 282f28c851c1..87507e710374 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -23,6 +23,7 @@
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/sed-opal.h>
> #include <linux/pci-p2pdma.h>
> +#include "../mdev/mdev.h"
>
> #include "trace.h"
> #include "nvme.h"
> @@ -32,6 +33,7 @@
>
> #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>
> +#define USE_SMALL_PRP_POOL(nprps) ((nprps) < (256 / 8))
> /*
> * These can be higher, but we need to ensure that any command doesn't
> * require an sg allocation that needs more than a page of data.
> @@ -83,12 +85,24 @@ static int poll_queues = 0;
> module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644);
> MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
>
> +static int mdev_queues;
> +#ifdef CONFIG_NVME_MDEV
> +module_param_cb(mdev_queues, &queue_count_ops, &mdev_queues, 0644);
> +MODULE_PARM_DESC(mdev_queues, "Number of queues to use for mediated VFIO");
> +#endif
> +
> struct nvme_dev;
> struct nvme_queue;
>
> static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
>
> +#ifdef CONFIG_NVME_MDEV
> +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid);
> +#else
> +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid) {}
> +#endif
> +
> /*
> * Represents an NVM Express device. Each nvme_dev is a PCI function.
> */
> @@ -103,6 +117,7 @@ struct nvme_dev {
> unsigned online_queues;
> unsigned max_qid;
> unsigned io_queues[HCTX_MAX_TYPES];
> + unsigned int mdev_queues;
> unsigned int num_vecs;
> int q_depth;
> u32 db_stride;
> @@ -110,6 +125,7 @@ struct nvme_dev {
> unsigned long bar_mapped_size;
> struct work_struct remove_work;
> struct mutex shutdown_lock;
> + struct mutex ext_dev_lock;
> bool subsystem;
> u64 cmb_size;
> bool cmb_use_sqes;
> @@ -172,6 +188,16 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
> return container_of(ctrl, struct nvme_dev, ctrl);
> }
>
> +/* Simplified IO descriptor for MDEV use */
> +struct nvme_ext_iod {
> + struct list_head link;
> + u32 user_tag;
> + int nprps;
> + struct nvme_ext_data_iter *saved_iter;
> + dma_addr_t first_prplist_dma;
> + __le64 *prpslists[NVME_MAX_SEGS];
> +};
> +
> /*
> * An NVM Express queue. Each device has at least two (one for admin
> * commands and one for I/O commands).
> @@ -196,15 +222,26 @@ struct nvme_queue {
> u16 qid;
> u8 cq_phase;
> unsigned long flags;
> +
> #define NVMEQ_ENABLED 0
> #define NVMEQ_SQ_CMB 1
> #define NVMEQ_DELETE_ERROR 2
> #define NVMEQ_POLLED 3
> +#define NVMEQ_EXTERNAL 4
> +
> u32 *dbbuf_sq_db;
> u32 *dbbuf_cq_db;
> u32 *dbbuf_sq_ei;
> u32 *dbbuf_cq_ei;
> struct completion delete_done;
> +
> + /* queue passthrough for external use */
> + struct {
> + int inflight;
> + struct nvme_ext_iod *iods;
> + struct list_head free_iods;
> + struct list_head used_iods;
> + } ext;
> };
>
> /*
> @@ -255,7 +292,7 @@ static inline void _nvme_check_size(void)
>
> static unsigned int max_io_queues(void)
> {
> - return num_possible_cpus() + write_queues + poll_queues;
> + return num_possible_cpus() + write_queues + poll_queues + mdev_queues;
> }
>
> static unsigned int max_queue_count(void)
> @@ -1066,6 +1103,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
> * the irq handler, even if that was on another CPU.
> */
> rmb();
> +
> if (nvmeq->cq_head != nvmeq->last_cq_head)
> ret = IRQ_HANDLED;
> nvme_process_cq(nvmeq, &start, &end, -1);
> @@ -1553,7 +1591,11 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
> memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
> nvme_dbbuf_init(dev, nvmeq, qid);
> dev->online_queues++;
> +
> wmb(); /* ensure the first interrupt sees the initialization */
> +
> + if (test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> + nvme_ext_queue_reset(nvmeq->dev, qid);
> }
>
> static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
> @@ -1759,7 +1801,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
> }
>
> max = min(dev->max_qid, dev->ctrl.queue_count - 1);
> - if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
> + if (max != 1) {
> rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
> dev->io_queues[HCTX_TYPE_READ];
> } else {
> @@ -2095,14 +2137,23 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
> * Poll queues don't need interrupts, but we need at least one IO
> * queue left over for non-polled IO.
> */
> - this_p_queues = poll_queues;
> + this_p_queues = poll_queues + mdev_queues;
> if (this_p_queues >= nr_io_queues) {
> this_p_queues = nr_io_queues - 1;
> irq_queues = 1;
> } else {
> irq_queues = nr_io_queues - this_p_queues + 1;
> }
> +
> + if (mdev_queues > this_p_queues) {
> + mdev_queues = this_p_queues;
> + this_p_queues = 0;
> + } else {
> + this_p_queues -= mdev_queues;
> + }
> +
> dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
> + dev->mdev_queues = mdev_queues;
>
> /* Initialize for the single interrupt case */
> dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
> @@ -2170,7 +2221,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>
> dev->num_vecs = result;
> result = max(result - 1, 1);
> - dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
> + dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL] +
> + dev->mdev_queues;
>
> /*
> * Should investigate if there's a performance win from allocating
> @@ -2193,10 +2245,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> nvme_suspend_io_queues(dev);
> goto retry;
> }
> - dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> + dev_info(dev->ctrl.device, "%d/%d/%d/%d default/read/poll/mdev queues\n",
> dev->io_queues[HCTX_TYPE_DEFAULT],
> dev->io_queues[HCTX_TYPE_READ],
> - dev->io_queues[HCTX_TYPE_POLL]);
> + dev->io_queues[HCTX_TYPE_POLL],
> + dev->mdev_queues);
> return 0;
> }
>
> @@ -2623,6 +2676,301 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
> nvme_put_ctrl(&dev->ctrl);
> }
>
> +#ifdef CONFIG_NVME_MDEV
> +static void nvme_ext_free_iod(struct nvme_dev *dev, struct nvme_ext_iod *iod)
> +{
> + int i = 0, max_prp, nprps = iod->nprps;
> + dma_addr_t dma = iod->first_prplist_dma;
> +
> + if (iod->saved_iter) {
> + iod->saved_iter->release(iod->saved_iter);
> + iod->saved_iter = NULL;
> + }
> +
> + if (--nprps < 2) {
> + goto out;
> + } else if (USE_SMALL_PRP_POOL(nprps)) {
> + dma_pool_free(dev->prp_small_pool, iod->prpslists[0], dma);
> + goto out;
> + }
> +
> + max_prp = (dev->ctrl.page_size >> 3) - 1;
> + while (nprps > 0) {
> + if (i > 0) {
> + dma = iod->prpslists[i - 1][max_prp];
> + if (nprps == 1)
> + break;
> + }
> + dma_pool_free(dev->prp_page_pool, iod->prpslists[i++], dma);
> + nprps -= max_prp;
> + }
> +out:
> + iod->nprps = -1;
> + iod->first_prplist_dma = 0;
> + iod->user_tag = 0xDEADDEAD;
> +}
> +
> +static int nvme_ext_setup_iod(struct nvme_dev *dev, struct nvme_ext_iod *iod,
> + struct nvme_common_command *cmd,
> + struct nvme_ext_data_iter *iter)
> +{
> + int ret, i, j;
> + __le64 *prp_list;
> + dma_addr_t prp_dma;
> + struct dma_pool *pool;
> + int max_prp = (dev->ctrl.page_size >> 3) - 1;
> +
> + iod->saved_iter = iter && iter->release ? iter : NULL;
> + iod->nprps = iter ? iter->count : 0;
> + cmd->dptr.prp1 = 0;
> + cmd->dptr.prp2 = 0;
> + cmd->metadata = 0;
> +
> + if (!iter)
> + return 0;
> +
> + /* put first pointer*/
> + cmd->dptr.prp1 = cpu_to_le64(iter->host_iova);
> + if (iter->count == 1)
> + return 0;
> +
> + ret = iter->next(iter);
> + if (ret)
> + goto error;
> +
> + /* if only have one more pointer, put it to second data pointer*/
> + if (iter->count == 1) {
> + cmd->dptr.prp2 = cpu_to_le64(iter->host_iova);
> + return 0;
> + }
> +
> + pool = USE_SMALL_PRP_POOL(iter->count) ? dev->prp_small_pool :
> + dev->prp_page_pool;
> +
> + /* Allocate prp lists as needed and fill them */
> + for (i = 0 ; i < NVME_MAX_SEGS && iter->count ; i++) {
> + prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
> + if (!prp_list) {
> + ret = -ENOMEM;
> + goto error;
> + }
> +
> + iod->prpslists[i++] = prp_list;
> +
> + if (i == 1) {
> + iod->first_prplist_dma = prp_dma;
> + cmd->dptr.prp2 = cpu_to_le64(prp_dma);
> + j = 0;
> + } else {
> + prp_list[0] = iod->prpslists[i - 1][max_prp];
> + iod->prpslists[i - 1][max_prp] = prp_dma;
> + j = 1;
> + }
> +
> + while (j <= max_prp && iter->count) {
> + prp_list[j++] = iter->host_iova;
> + ret = iter->next(iter);
> + if (ret)
> + goto error;
> + }
> + }
> +
> + if (iter->count) {
> + ret = -ENOSPC;
> + goto error;
> + }
> + return 0;
> +error:
> + iod->nprps -= iter->count;
> + nvme_ext_free_iod(dev, iod);
> + return ret;
> +}
> +
> +static int nvme_ext_queues_available(struct nvme_ctrl *ctrl)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + unsigned int ret = 0, qid;
> + unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
> +
> + for (qid = first_mdev_q; qid < dev->online_queues; qid++) {
> + struct nvme_queue *nvmeq = &dev->queues[qid];
> +
> + if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> + ret++;
> + }
> + return ret;
> +}
> +
> +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid)
> +{
> + struct nvme_queue *nvmeq = &dev->queues[qid];
> + struct nvme_ext_iod *iod, *tmp;
> +
> + list_for_each_entry_safe(iod, tmp, &nvmeq->ext.used_iods, link) {
> + if (iod->saved_iter && iod->saved_iter->release) {
> + iod->saved_iter->release(iod->saved_iter);
> + iod->saved_iter = NULL;
> + list_move(&iod->link, &nvmeq->ext.free_iods);
> + }
> + }
> +
> + nvmeq->ext.inflight = 0;
> +}
> +
> +static int nvme_ext_queue_alloc(struct nvme_ctrl *ctrl, u16 *ret_qid)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + struct nvme_queue *nvmeq;
> + int ret = 0, qid, i;
> + unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
> +
> + mutex_lock(&dev->ext_dev_lock);
> +
> + /* find a polled queue to allocate */
> + for (qid = dev->online_queues - 1 ; qid >= first_mdev_q ; qid--) {
> + nvmeq = &dev->queues[qid];
> + if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> + break;
> + }
> +
> + if (qid < first_mdev_q) {
> + ret = -ENOSPC;
> + goto out;
> + }
> +
> + INIT_LIST_HEAD(&nvmeq->ext.free_iods);
> + INIT_LIST_HEAD(&nvmeq->ext.used_iods);
> +
> + nvmeq->ext.iods =
> + vzalloc_node(sizeof(struct nvme_ext_iod) * nvmeq->q_depth,
> + dev_to_node(dev->dev));
> +
> + if (!nvmeq->ext.iods) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + for (i = 0 ; i < nvmeq->q_depth ; i++)
> + list_add_tail(&nvmeq->ext.iods[i].link, &nvmeq->ext.free_iods);
> +
> + set_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
> + *ret_qid = qid;
> +out:
> + mutex_unlock(&dev->ext_dev_lock);
> + return ret;
> +}
> +
> +static void nvme_ext_queue_free(struct nvme_ctrl *ctrl, u16 qid)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + struct nvme_queue *nvmeq;
> +
> + mutex_lock(&dev->ext_dev_lock);
> + nvmeq = &dev->queues[qid];
> +
> + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> + return;

This condition is probably not expected to happen (since its a warning)
but do you need to unlock the ext_dev_lock before returning?

> +
> + nvme_ext_queue_reset(dev, qid);
> +
> + vfree(nvmeq->ext.iods);
> + nvmeq->ext.iods = NULL;
> + INIT_LIST_HEAD(&nvmeq->ext.free_iods);
> + INIT_LIST_HEAD(&nvmeq->ext.used_iods);
> +
> + clear_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
> + mutex_unlock(&dev->ext_dev_lock);
> +}
> +
> +static int nvme_ext_queue_submit(struct nvme_ctrl *ctrl, u16 qid, u32 user_tag,
> + struct nvme_command *command,
> + struct nvme_ext_data_iter *iter)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + struct nvme_queue *nvmeq = &dev->queues[qid];
> + struct nvme_ext_iod *iod;
> + int ret;
> +
> + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> + return -EINVAL;
> +
> + if (list_empty(&nvmeq->ext.free_iods))
> + return -1;
> +
> + iod = list_first_entry(&nvmeq->ext.free_iods,
> + struct nvme_ext_iod, link);
> +
> + list_move(&iod->link, &nvmeq->ext.used_iods);
> +
> + command->common.command_id = cpu_to_le16(iod - nvmeq->ext.iods);
> + iod->user_tag = user_tag;
> +
> + ret = nvme_ext_setup_iod(dev, iod, &command->common, iter);
> + if (ret) {
> + list_move(&iod->link, &nvmeq->ext.free_iods);
> + return ret;
> + }
> +
> + nvmeq->ext.inflight++;
> + nvme_submit_cmd(nvmeq, command, true);
> + return 0;
> +}
> +
> +static int nvme_ext_queue_poll(struct nvme_ctrl *ctrl, u16 qid,
> + struct nvme_ext_cmd_result *results,
> + unsigned int max_len)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + struct nvme_queue *nvmeq = &dev->queues[qid];
> + u16 old_head;
> + int i, j;
> +
> + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> + return -EINVAL;
> +
> + if (nvmeq->ext.inflight == 0)
> + return -1;
> +
> + old_head = nvmeq->cq_head;
> +
> + for (i = 0 ; nvme_cqe_pending(nvmeq) && i < max_len ; i++) {
> + u16 status = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status);
> + u16 tag = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].command_id);
> +
> + results[i].status = status >> 1;
> + results[i].tag = (u32)tag;
> + nvme_update_cq_head(nvmeq);
> + }
> +
> + if (old_head != nvmeq->cq_head)
> + nvme_ring_cq_doorbell(nvmeq);
> +
> + for (j = 0 ; j < i ; j++) {
> + u16 tag = results[j].tag & 0xFFFF;
> + struct nvme_ext_iod *iod = &nvmeq->ext.iods[tag];
> +
> + if (WARN_ON(tag >= nvmeq->q_depth || iod->nprps == -1))
> + continue;
> +
> + results[j].tag = iod->user_tag;
> + nvme_ext_free_iod(dev, iod);
> + list_move(&iod->link, &nvmeq->ext.free_iods);
> + nvmeq->ext.inflight--;
> + }
> +
> + WARN_ON(nvmeq->ext.inflight < 0);
> + return i;
> +}
> +
> +static bool nvme_ext_queue_full(struct nvme_ctrl *ctrl, u16 qid)
> +{
> + struct nvme_dev *dev = to_nvme_dev(ctrl);
> + struct nvme_queue *nvmeq = &dev->queues[qid];
> +
> + return nvmeq->ext.inflight < nvmeq->q_depth - 1;
> +}
> +#endif
> +
> static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
> {
> *val = readl(to_nvme_dev(ctrl)->bar + off);
> @@ -2652,13 +3000,25 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
> .name = "pcie",
> .module = THIS_MODULE,
> .flags = NVME_F_METADATA_SUPPORTED |
> - NVME_F_PCI_P2PDMA,
> + NVME_F_PCI_P2PDMA |
> + NVME_F_MDEV_SUPPORTED |
> + NVME_F_MDEV_DMA_SUPPORTED,
> +
> .reg_read32 = nvme_pci_reg_read32,
> .reg_write32 = nvme_pci_reg_write32,
> .reg_read64 = nvme_pci_reg_read64,
> .free_ctrl = nvme_pci_free_ctrl,
> .submit_async_event = nvme_pci_submit_async_event,
> .get_address = nvme_pci_get_address,
> +
> +#ifdef CONFIG_NVME_MDEV
> + .ext_queues_available = nvme_ext_queues_available,
> + .ext_queue_alloc = nvme_ext_queue_alloc,
> + .ext_queue_free = nvme_ext_queue_free,
> + .ext_queue_submit = nvme_ext_queue_submit,
> + .ext_queue_poll = nvme_ext_queue_poll,
> + .ext_queue_full = nvme_ext_queue_full,
> +#endif
> };
>
> static int nvme_dev_map(struct nvme_dev *dev)
> @@ -2747,6 +3107,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
> INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
> mutex_init(&dev->shutdown_lock);
> + mutex_init(&dev->ext_dev_lock);
>
> result = nvme_setup_prp_pools(dev);
> if (result)
> diff --git a/drivers/nvme/mdev/host.c b/drivers/nvme/mdev/host.c
> index 5766bad7e909..6590946b86c2 100644
> --- a/drivers/nvme/mdev/host.c
> +++ b/drivers/nvme/mdev/host.c
> @@ -48,19 +48,21 @@ static struct nvme_mdev_hctrl *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
> return NULL;
> }
>
> + hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
> + dev_to_node(ctrl->dev));
> + if (!hctrl)
> + return NULL;
> +
> nr_host_queues = ctrl->ops->ext_queues_available(ctrl);
> max_lba_transfer = ctrl->max_hw_sectors >> (PAGE_SHIFT - 9);
>
> if (nr_host_queues == 0) {
> dev_info(ctrl->dev,
> "no support for mdev - no mdev reserved queues available");
> + kfree(hctrl);
> return NULL;
> }
>
> - hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
> - dev_to_node(ctrl->dev));
> - if (!hctrl)
> - return NULL;
>
> kref_init(&hctrl->ref);
> mutex_init(&hctrl->lock);
> @@ -180,6 +182,24 @@ void nvme_mdev_hctrl_hqs_unreserve(struct nvme_mdev_hctrl *hctrl,
> mutex_unlock(&hctrl->lock);
> }
>
> +/* Check if IO passthrough is supported for given IO optcode */
> +bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
> +{
> + switch (optcode) {
> + case nvme_cmd_flush:
> + case nvme_cmd_read:
> + case nvme_cmd_write:
> + /* these are mandatory*/
> + return true;
> + case nvme_cmd_write_zeroes:
> + return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
> + case nvme_cmd_dsm:
> + return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
> + default:
> + return false;
> + }
> +}
> +
> /* Allocate a host IO queue */
> int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
> {
> @@ -204,23 +224,7 @@ bool nvme_mdev_hctrl_hq_can_submit(struct nvme_mdev_hctrl *hctrl, u16 qid)
> return hctrl->nvme_ctrl->ops->ext_queue_full(hctrl->nvme_ctrl, qid);
> }
>
> -/* Check if IO passthrough is supported for given IO optcode */
> -bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
> -{
> - switch (optcode) {
> - case nvme_cmd_flush:
> - case nvme_cmd_read:
> - case nvme_cmd_write:
> - /* these are mandatory*/
> - return true;
> - case nvme_cmd_write_zeroes:
> - return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
> - case nvme_cmd_dsm:
> - return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
> - default:
> - return false;
> - }
> -}
> +
>
> /* Submit a IO passthrough command */
> int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
> diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
> index a731196d0365..59837540fec2 100644
> --- a/drivers/nvme/mdev/io.c
> +++ b/drivers/nvme/mdev/io.c
> @@ -11,14 +11,16 @@
> #include <linux/ktime.h>
> #include "priv.h"
>
> +
> struct io_ctx {
> struct nvme_mdev_hctrl *hctrl;
> struct nvme_mdev_vctrl *vctrl;
>
> const struct nvme_command *in;
> - struct nvme_command out;
> struct nvme_mdev_vns *ns;
> struct nvme_ext_data_iter udatait;
> +
> + struct nvme_command out;
> struct nvme_ext_data_iter *kdatait;
>
> ktime_t last_io_t;
> @@ -28,6 +30,20 @@ struct io_ctx {
> unsigned int arb_burst;
> };
>
> +/* Check if we need to read a command from the admin queue */
> +static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
> +{
> + if (!timeout(ctx->last_admin_poll_time,
> + ctx->vctrl->now, ctx->admin_poll_rate_ms))
> + return false;
> +
> + if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
> + return true;
> +
> + ctx->last_admin_poll_time = ctx->vctrl->now;
> + return false;
> +}
> +
> /* Handle read/write command.*/
> static int nvme_mdev_io_translate_rw(struct io_ctx *ctx)
> {
> @@ -229,6 +245,7 @@ static int nvme_mdev_io_translate_cmd(struct io_ctx *ctx)
> }
> }
>
> +/* process a user submission queue */
> static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
> {
> struct nvme_vsq *vsq = &ctx->vctrl->vsqs[sqid];
> @@ -275,7 +292,13 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
> return true;
> }
>
> -/* process host replies to the passed through commands */
> +/* process a user completion queue */
> +static void nvme_mdev_io_process_cq(struct io_ctx *ctx, u16 cqid)
> +{
> + nvme_mdev_vcq_process(ctx->vctrl, cqid, true);
> +}
> +
> +/* process hardware completion queue */
> static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
> {
> int n, i;
> @@ -301,22 +324,9 @@ static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
> return n;
> }
>
> -/* Check if we need to read a command from the admin queue */
> -static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
> -{
> - if (!timeout(ctx->last_admin_poll_time,
> - ctx->vctrl->now, ctx->admin_poll_rate_ms))
> - return false;
> -
> - if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
> - return true;
> -
> - ctx->last_admin_poll_time = ctx->vctrl->now;
> - return false;
> -}
>
> /* do polling till one of events stops it */
> -static void nvme_mdev_io_maintask(struct io_ctx *ctx)
> +static void nvme_mdev_io_polling_loop(struct io_ctx *ctx)
> {
> struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
> u16 i, cqid, sqid, hsqcnt;
> @@ -353,7 +363,7 @@ static void nvme_mdev_io_maintask(struct io_ctx *ctx)
> /* process the completions from the guest*/
> cqid = 1;
> for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
> - nvme_mdev_vcq_process(vctrl, cqid, true);
> + nvme_mdev_io_process_cq(ctx, cqid);
>
> /* process the completions from the hardware*/
> for (i = 0 ; i < hsqcnt ; i++)
> @@ -470,7 +480,7 @@ static int nvme_mdev_io_polling_thread(void *data)
> if (kthread_should_stop())
> break;
>
> - nvme_mdev_io_maintask(&ctx);
> + nvme_mdev_io_polling_loop(&ctx);
> }
>
> _DBG(ctx.vctrl, "IO: iothread stopped\n");
> diff --git a/drivers/nvme/mdev/mmio.c b/drivers/nvme/mdev/mmio.c
> index cf03c1f22f4c..a80962bf4a3d 100644
> --- a/drivers/nvme/mdev/mmio.c
> +++ b/drivers/nvme/mdev/mmio.c
> @@ -54,9 +54,6 @@ static const struct vm_operations_struct nvme_mdev_mmio_dbs_vm_ops = {
> bool nvme_mdev_mmio_db_check(struct nvme_mdev_vctrl *vctrl,
> u16 qid, u16 size, u16 db)
> {
> - if (get_current() != vctrl->iothread)
> - lockdep_assert_held(&vctrl->lock);
> -
> if (db < size)
> return true;
> if (qid == 0) {
>

2019-05-02 21:21:43

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

On Thu, 2019-05-02 at 15:12 -0600, Heitke, Kenneth wrote:
>
> On 5/2/2019 5:47 AM, Maxim Levitsky wrote:
> > Note that currently the number of hw queues reserved for mdev,
> > has to be pre determined on module load.
> >
> > (I used to allocate the queues dynamicaly on demand, but
> > recent changes to allocate polled/read queues made
> > this somewhat difficult, so I dropped this for now)
> >
> > Signed-off-by: Maxim Levitsky <[email protected]>
> > ---
> > drivers/nvme/host/pci.c | 375 ++++++++++++++++++++++++++++++++++++++-
> > drivers/nvme/mdev/host.c | 46 ++---
> > drivers/nvme/mdev/io.c | 46 +++--
> > drivers/nvme/mdev/mmio.c | 3 -
> > 4 files changed, 421 insertions(+), 49 deletions(-)
> >
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 282f28c851c1..87507e710374 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -23,6 +23,7 @@
> > #include <linux/io-64-nonatomic-lo-hi.h>
> > #include <linux/sed-opal.h>
> > #include <linux/pci-p2pdma.h>
> > +#include "../mdev/mdev.h"
> >
> > #include "trace.h"
> > #include "nvme.h"
> > @@ -32,6 +33,7 @@
> >
> > #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
> >
> > +#define USE_SMALL_PRP_POOL(nprps) ((nprps) < (256 / 8))
> > /*
> > * These can be higher, but we need to ensure that any command doesn't
> > * require an sg allocation that needs more than a page of data.
> > @@ -83,12 +85,24 @@ static int poll_queues = 0;
> > module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644);
> > MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
> >
> > +static int mdev_queues;
> > +#ifdef CONFIG_NVME_MDEV
> > +module_param_cb(mdev_queues, &queue_count_ops, &mdev_queues, 0644);
> > +MODULE_PARM_DESC(mdev_queues, "Number of queues to use for mediated VFIO");
> > +#endif
> > +
> > struct nvme_dev;
> > struct nvme_queue;
> >
> > static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> > static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
> >
> > +#ifdef CONFIG_NVME_MDEV
> > +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid);
> > +#else
> > +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid) {}
> > +#endif
> > +
> > /*
> > * Represents an NVM Express device. Each nvme_dev is a PCI function.
> > */
> > @@ -103,6 +117,7 @@ struct nvme_dev {
> > unsigned online_queues;
> > unsigned max_qid;
> > unsigned io_queues[HCTX_MAX_TYPES];
> > + unsigned int mdev_queues;
> > unsigned int num_vecs;
> > int q_depth;
> > u32 db_stride;
> > @@ -110,6 +125,7 @@ struct nvme_dev {
> > unsigned long bar_mapped_size;
> > struct work_struct remove_work;
> > struct mutex shutdown_lock;
> > + struct mutex ext_dev_lock;
> > bool subsystem;
> > u64 cmb_size;
> > bool cmb_use_sqes;
> > @@ -172,6 +188,16 @@ static inline struct nvme_dev *to_nvme_dev(struct
> > nvme_ctrl *ctrl)
> > return container_of(ctrl, struct nvme_dev, ctrl);
> > }
> >
> > +/* Simplified IO descriptor for MDEV use */
> > +struct nvme_ext_iod {
> > + struct list_head link;
> > + u32 user_tag;
> > + int nprps;
> > + struct nvme_ext_data_iter *saved_iter;
> > + dma_addr_t first_prplist_dma;
> > + __le64 *prpslists[NVME_MAX_SEGS];
> > +};
> > +
> > /*
> > * An NVM Express queue. Each device has at least two (one for admin
> > * commands and one for I/O commands).
> > @@ -196,15 +222,26 @@ struct nvme_queue {
> > u16 qid;
> > u8 cq_phase;
> > unsigned long flags;
> > +
> > #define NVMEQ_ENABLED 0
> > #define NVMEQ_SQ_CMB 1
> > #define NVMEQ_DELETE_ERROR 2
> > #define NVMEQ_POLLED 3
> > +#define NVMEQ_EXTERNAL 4
> > +
> > u32 *dbbuf_sq_db;
> > u32 *dbbuf_cq_db;
> > u32 *dbbuf_sq_ei;
> > u32 *dbbuf_cq_ei;
> > struct completion delete_done;
> > +
> > + /* queue passthrough for external use */
> > + struct {
> > + int inflight;
> > + struct nvme_ext_iod *iods;
> > + struct list_head free_iods;
> > + struct list_head used_iods;
> > + } ext;
> > };
> >
> > /*
> > @@ -255,7 +292,7 @@ static inline void _nvme_check_size(void)
> >
> > static unsigned int max_io_queues(void)
> > {
> > - return num_possible_cpus() + write_queues + poll_queues;
> > + return num_possible_cpus() + write_queues + poll_queues + mdev_queues;
> > }
> >
> > static unsigned int max_queue_count(void)
> > @@ -1066,6 +1103,7 @@ static irqreturn_t nvme_irq(int irq, void *data)
> > * the irq handler, even if that was on another CPU.
> > */
> > rmb();
> > +
> > if (nvmeq->cq_head != nvmeq->last_cq_head)
> > ret = IRQ_HANDLED;
> > nvme_process_cq(nvmeq, &start, &end, -1);
> > @@ -1553,7 +1591,11 @@ static void nvme_init_queue(struct nvme_queue *nvmeq,
> > u16 qid)
> > memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
> > nvme_dbbuf_init(dev, nvmeq, qid);
> > dev->online_queues++;
> > +
> > wmb(); /* ensure the first interrupt sees the initialization */
> > +
> > + if (test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> > + nvme_ext_queue_reset(nvmeq->dev, qid);
> > }
> >
> > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool
> > polled)
> > @@ -1759,7 +1801,7 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
> > }
> >
> > max = min(dev->max_qid, dev->ctrl.queue_count - 1);
> > - if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
> > + if (max != 1) {
> > rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
> > dev->io_queues[HCTX_TYPE_READ];
> > } else {
> > @@ -2095,14 +2137,23 @@ static int nvme_setup_irqs(struct nvme_dev *dev,
> > unsigned int nr_io_queues)
> > * Poll queues don't need interrupts, but we need at least one IO
> > * queue left over for non-polled IO.
> > */
> > - this_p_queues = poll_queues;
> > + this_p_queues = poll_queues + mdev_queues;
> > if (this_p_queues >= nr_io_queues) {
> > this_p_queues = nr_io_queues - 1;
> > irq_queues = 1;
> > } else {
> > irq_queues = nr_io_queues - this_p_queues + 1;
> > }
> > +
> > + if (mdev_queues > this_p_queues) {
> > + mdev_queues = this_p_queues;
> > + this_p_queues = 0;
> > + } else {
> > + this_p_queues -= mdev_queues;
> > + }
> > +
> > dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
> > + dev->mdev_queues = mdev_queues;
> >
> > /* Initialize for the single interrupt case */
> > dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
> > @@ -2170,7 +2221,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
> >
> > dev->num_vecs = result;
> > result = max(result - 1, 1);
> > - dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
> > + dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL] +
> > + dev->mdev_queues;
> >
> > /*
> > * Should investigate if there's a performance win from allocating
> > @@ -2193,10 +2245,11 @@ static int nvme_setup_io_queues(struct nvme_dev
> > *dev)
> > nvme_suspend_io_queues(dev);
> > goto retry;
> > }
> > - dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
> > + dev_info(dev->ctrl.device, "%d/%d/%d/%d default/read/poll/mdev
> > queues\n",
> > dev->io_queues[HCTX_TYPE_DEFAULT],
> > dev->io_queues[HCTX_TYPE_READ],
> > - dev->io_queues[HCTX_TYPE_POLL]);
> > + dev->io_queues[HCTX_TYPE_POLL],
> > + dev->mdev_queues);
> > return 0;
> > }
> >
> > @@ -2623,6 +2676,301 @@ static void nvme_remove_dead_ctrl_work(struct
> > work_struct *work)
> > nvme_put_ctrl(&dev->ctrl);
> > }
> >
> > +#ifdef CONFIG_NVME_MDEV
> > +static void nvme_ext_free_iod(struct nvme_dev *dev, struct nvme_ext_iod
> > *iod)
> > +{
> > + int i = 0, max_prp, nprps = iod->nprps;
> > + dma_addr_t dma = iod->first_prplist_dma;
> > +
> > + if (iod->saved_iter) {
> > + iod->saved_iter->release(iod->saved_iter);
> > + iod->saved_iter = NULL;
> > + }
> > +
> > + if (--nprps < 2) {
> > + goto out;
> > + } else if (USE_SMALL_PRP_POOL(nprps)) {
> > + dma_pool_free(dev->prp_small_pool, iod->prpslists[0], dma);
> > + goto out;
> > + }
> > +
> > + max_prp = (dev->ctrl.page_size >> 3) - 1;
> > + while (nprps > 0) {
> > + if (i > 0) {
> > + dma = iod->prpslists[i - 1][max_prp];
> > + if (nprps == 1)
> > + break;
> > + }
> > + dma_pool_free(dev->prp_page_pool, iod->prpslists[i++], dma);
> > + nprps -= max_prp;
> > + }
> > +out:
> > + iod->nprps = -1;
> > + iod->first_prplist_dma = 0;
> > + iod->user_tag = 0xDEADDEAD;
> > +}
> > +
> > +static int nvme_ext_setup_iod(struct nvme_dev *dev, struct nvme_ext_iod
> > *iod,
> > + struct nvme_common_command *cmd,
> > + struct nvme_ext_data_iter *iter)
> > +{
> > + int ret, i, j;
> > + __le64 *prp_list;
> > + dma_addr_t prp_dma;
> > + struct dma_pool *pool;
> > + int max_prp = (dev->ctrl.page_size >> 3) - 1;
> > +
> > + iod->saved_iter = iter && iter->release ? iter : NULL;
> > + iod->nprps = iter ? iter->count : 0;
> > + cmd->dptr.prp1 = 0;
> > + cmd->dptr.prp2 = 0;
> > + cmd->metadata = 0;
> > +
> > + if (!iter)
> > + return 0;
> > +
> > + /* put first pointer*/
> > + cmd->dptr.prp1 = cpu_to_le64(iter->host_iova);
> > + if (iter->count == 1)
> > + return 0;
> > +
> > + ret = iter->next(iter);
> > + if (ret)
> > + goto error;
> > +
> > + /* if only have one more pointer, put it to second data pointer*/
> > + if (iter->count == 1) {
> > + cmd->dptr.prp2 = cpu_to_le64(iter->host_iova);
> > + return 0;
> > + }
> > +
> > + pool = USE_SMALL_PRP_POOL(iter->count) ? dev->prp_small_pool :
> > + dev->prp_page_pool;
> > +
> > + /* Allocate prp lists as needed and fill them */
> > + for (i = 0 ; i < NVME_MAX_SEGS && iter->count ; i++) {
> > + prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
> > + if (!prp_list) {
> > + ret = -ENOMEM;
> > + goto error;
> > + }
> > +
> > + iod->prpslists[i++] = prp_list;
> > +
> > + if (i == 1) {
> > + iod->first_prplist_dma = prp_dma;
> > + cmd->dptr.prp2 = cpu_to_le64(prp_dma);
> > + j = 0;
> > + } else {
> > + prp_list[0] = iod->prpslists[i - 1][max_prp];
> > + iod->prpslists[i - 1][max_prp] = prp_dma;
> > + j = 1;
> > + }
> > +
> > + while (j <= max_prp && iter->count) {
> > + prp_list[j++] = iter->host_iova;
> > + ret = iter->next(iter);
> > + if (ret)
> > + goto error;
> > + }
> > + }
> > +
> > + if (iter->count) {
> > + ret = -ENOSPC;
> > + goto error;
> > + }
> > + return 0;
> > +error:
> > + iod->nprps -= iter->count;
> > + nvme_ext_free_iod(dev, iod);
> > + return ret;
> > +}
> > +
> > +static int nvme_ext_queues_available(struct nvme_ctrl *ctrl)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + unsigned int ret = 0, qid;
> > + unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
> > +
> > + for (qid = first_mdev_q; qid < dev->online_queues; qid++) {
> > + struct nvme_queue *nvmeq = &dev->queues[qid];
> > +
> > + if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> > + ret++;
> > + }
> > + return ret;
> > +}
> > +
> > +static void nvme_ext_queue_reset(struct nvme_dev *dev, u16 qid)
> > +{
> > + struct nvme_queue *nvmeq = &dev->queues[qid];
> > + struct nvme_ext_iod *iod, *tmp;
> > +
> > + list_for_each_entry_safe(iod, tmp, &nvmeq->ext.used_iods, link) {
> > + if (iod->saved_iter && iod->saved_iter->release) {
> > + iod->saved_iter->release(iod->saved_iter);
> > + iod->saved_iter = NULL;
> > + list_move(&iod->link, &nvmeq->ext.free_iods);
> > + }
> > + }
> > +
> > + nvmeq->ext.inflight = 0;
> > +}
> > +
> > +static int nvme_ext_queue_alloc(struct nvme_ctrl *ctrl, u16 *ret_qid)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + struct nvme_queue *nvmeq;
> > + int ret = 0, qid, i;
> > + unsigned int first_mdev_q = dev->online_queues - dev->mdev_queues;
> > +
> > + mutex_lock(&dev->ext_dev_lock);
> > +
> > + /* find a polled queue to allocate */
> > + for (qid = dev->online_queues - 1 ; qid >= first_mdev_q ; qid--) {
> > + nvmeq = &dev->queues[qid];
> > + if (!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags))
> > + break;
> > + }
> > +
> > + if (qid < first_mdev_q) {
> > + ret = -ENOSPC;
> > + goto out;
> > + }
> > +
> > + INIT_LIST_HEAD(&nvmeq->ext.free_iods);
> > + INIT_LIST_HEAD(&nvmeq->ext.used_iods);
> > +
> > + nvmeq->ext.iods =
> > + vzalloc_node(sizeof(struct nvme_ext_iod) * nvmeq->q_depth,
> > + dev_to_node(dev->dev));
> > +
> > + if (!nvmeq->ext.iods) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + for (i = 0 ; i < nvmeq->q_depth ; i++)
> > + list_add_tail(&nvmeq->ext.iods[i].link, &nvmeq->ext.free_iods);
> > +
> > + set_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
> > + *ret_qid = qid;
> > +out:
> > + mutex_unlock(&dev->ext_dev_lock);
> > + return ret;
> > +}
> > +
> > +static void nvme_ext_queue_free(struct nvme_ctrl *ctrl, u16 qid)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + struct nvme_queue *nvmeq;
> > +
> > + mutex_lock(&dev->ext_dev_lock);
> > + nvmeq = &dev->queues[qid];
> > +
> > + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> > + return;
>
> This condition is probably not expected to happen (since its a warning)
> but do you need to unlock the ext_dev_lock before returning?

This is true, I will fix this. This used to be BUG_ON, but due to checkpatch.pl
complains I turned them all to WARN_ON, and missed this.

Thanks,
Best regards,
Maxim Levitsky

>
> > +
> > + nvme_ext_queue_reset(dev, qid);
> > +
> > + vfree(nvmeq->ext.iods);
> > + nvmeq->ext.iods = NULL;
> > + INIT_LIST_HEAD(&nvmeq->ext.free_iods);
> > + INIT_LIST_HEAD(&nvmeq->ext.used_iods);
> > +
> > + clear_bit(NVMEQ_EXTERNAL, &nvmeq->flags);
> > + mutex_unlock(&dev->ext_dev_lock);
> > +}
> > +
> > +static int nvme_ext_queue_submit(struct nvme_ctrl *ctrl, u16 qid, u32
> > user_tag,
> > + struct nvme_command *command,
> > + struct nvme_ext_data_iter *iter)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + struct nvme_queue *nvmeq = &dev->queues[qid];
> > + struct nvme_ext_iod *iod;
> > + int ret;
> > +
> > + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> > + return -EINVAL;
> > +
> > + if (list_empty(&nvmeq->ext.free_iods))
> > + return -1;
> > +
> > + iod = list_first_entry(&nvmeq->ext.free_iods,
> > + struct nvme_ext_iod, link);
> > +
> > + list_move(&iod->link, &nvmeq->ext.used_iods);
> > +
> > + command->common.command_id = cpu_to_le16(iod - nvmeq->ext.iods);
> > + iod->user_tag = user_tag;
> > +
> > + ret = nvme_ext_setup_iod(dev, iod, &command->common, iter);
> > + if (ret) {
> > + list_move(&iod->link, &nvmeq->ext.free_iods);
> > + return ret;
> > + }
> > +
> > + nvmeq->ext.inflight++;
> > + nvme_submit_cmd(nvmeq, command, true);
> > + return 0;
> > +}
> > +
> > +static int nvme_ext_queue_poll(struct nvme_ctrl *ctrl, u16 qid,
> > + struct nvme_ext_cmd_result *results,
> > + unsigned int max_len)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + struct nvme_queue *nvmeq = &dev->queues[qid];
> > + u16 old_head;
> > + int i, j;
> > +
> > + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> > + return -EINVAL;
> > +
> > + if (nvmeq->ext.inflight == 0)
> > + return -1;
> > +
> > + old_head = nvmeq->cq_head;
> > +
> > + for (i = 0 ; nvme_cqe_pending(nvmeq) && i < max_len ; i++) {
> > + u16 status = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].status);
> > + u16 tag = le16_to_cpu(nvmeq->cqes[nvmeq->cq_head].command_id);
> > +
> > + results[i].status = status >> 1;
> > + results[i].tag = (u32)tag;
> > + nvme_update_cq_head(nvmeq);
> > + }
> > +
> > + if (old_head != nvmeq->cq_head)
> > + nvme_ring_cq_doorbell(nvmeq);
> > +
> > + for (j = 0 ; j < i ; j++) {
> > + u16 tag = results[j].tag & 0xFFFF;
> > + struct nvme_ext_iod *iod = &nvmeq->ext.iods[tag];
> > +
> > + if (WARN_ON(tag >= nvmeq->q_depth || iod->nprps == -1))
> > + continue;
> > +
> > + results[j].tag = iod->user_tag;
> > + nvme_ext_free_iod(dev, iod);
> > + list_move(&iod->link, &nvmeq->ext.free_iods);
> > + nvmeq->ext.inflight--;
> > + }
> > +
> > + WARN_ON(nvmeq->ext.inflight < 0);
> > + return i;
> > +}
> > +
> > +static bool nvme_ext_queue_full(struct nvme_ctrl *ctrl, u16 qid)
> > +{
> > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > + struct nvme_queue *nvmeq = &dev->queues[qid];
> > +
> > + return nvmeq->ext.inflight < nvmeq->q_depth - 1;
> > +}
> > +#endif
> > +
> > static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
> > {
> > *val = readl(to_nvme_dev(ctrl)->bar + off);
> > @@ -2652,13 +3000,25 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops
> > = {
> > .name = "pcie",
> > .module = THIS_MODULE,
> > .flags = NVME_F_METADATA_SUPPORTED |
> > - NVME_F_PCI_P2PDMA,
> > + NVME_F_PCI_P2PDMA |
> > + NVME_F_MDEV_SUPPORTED |
> > + NVME_F_MDEV_DMA_SUPPORTED,
> > +
> > .reg_read32 = nvme_pci_reg_read32,
> > .reg_write32 = nvme_pci_reg_write32,
> > .reg_read64 = nvme_pci_reg_read64,
> > .free_ctrl = nvme_pci_free_ctrl,
> > .submit_async_event = nvme_pci_submit_async_event,
> > .get_address = nvme_pci_get_address,
> > +
> > +#ifdef CONFIG_NVME_MDEV
> > + .ext_queues_available = nvme_ext_queues_available,
> > + .ext_queue_alloc = nvme_ext_queue_alloc,
> > + .ext_queue_free = nvme_ext_queue_free,
> > + .ext_queue_submit = nvme_ext_queue_submit,
> > + .ext_queue_poll = nvme_ext_queue_poll,
> > + .ext_queue_full = nvme_ext_queue_full,
> > +#endif
> > };
> >
> > static int nvme_dev_map(struct nvme_dev *dev)
> > @@ -2747,6 +3107,7 @@ static int nvme_probe(struct pci_dev *pdev, const
> > struct pci_device_id *id)
> > INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
> > INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
> > mutex_init(&dev->shutdown_lock);
> > + mutex_init(&dev->ext_dev_lock);
> >
> > result = nvme_setup_prp_pools(dev);
> > if (result)
> > diff --git a/drivers/nvme/mdev/host.c b/drivers/nvme/mdev/host.c
> > index 5766bad7e909..6590946b86c2 100644
> > --- a/drivers/nvme/mdev/host.c
> > +++ b/drivers/nvme/mdev/host.c
> > @@ -48,19 +48,21 @@ static struct nvme_mdev_hctrl
> > *nvme_mdev_hctrl_create(struct nvme_ctrl *ctrl)
> > return NULL;
> > }
> >
> > + hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
> > + dev_to_node(ctrl->dev));
> > + if (!hctrl)
> > + return NULL;
> > +
> > nr_host_queues = ctrl->ops->ext_queues_available(ctrl);
> > max_lba_transfer = ctrl->max_hw_sectors >> (PAGE_SHIFT - 9);
> >
> > if (nr_host_queues == 0) {
> > dev_info(ctrl->dev,
> > "no support for mdev - no mdev reserved queues
> > available");
> > + kfree(hctrl);
> > return NULL;
> > }
> >
> > - hctrl = kzalloc_node(sizeof(*hctrl), GFP_KERNEL,
> > - dev_to_node(ctrl->dev));
> > - if (!hctrl)
> > - return NULL;
> >
> > kref_init(&hctrl->ref);
> > mutex_init(&hctrl->lock);
> > @@ -180,6 +182,24 @@ void nvme_mdev_hctrl_hqs_unreserve(struct
> > nvme_mdev_hctrl *hctrl,
> > mutex_unlock(&hctrl->lock);
> > }
> >
> > +/* Check if IO passthrough is supported for given IO optcode */
> > +bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
> > +{
> > + switch (optcode) {
> > + case nvme_cmd_flush:
> > + case nvme_cmd_read:
> > + case nvme_cmd_write:
> > + /* these are mandatory*/
> > + return true;
> > + case nvme_cmd_write_zeroes:
> > + return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
> > + case nvme_cmd_dsm:
> > + return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
> > + default:
> > + return false;
> > + }
> > +}
> > +
> > /* Allocate a host IO queue */
> > int nvme_mdev_hctrl_hq_alloc(struct nvme_mdev_hctrl *hctrl)
> > {
> > @@ -204,23 +224,7 @@ bool nvme_mdev_hctrl_hq_can_submit(struct
> > nvme_mdev_hctrl *hctrl, u16 qid)
> > return hctrl->nvme_ctrl->ops->ext_queue_full(hctrl->nvme_ctrl, qid);
> > }
> >
> > -/* Check if IO passthrough is supported for given IO optcode */
> > -bool nvme_mdev_hctrl_hq_check_op(struct nvme_mdev_hctrl *hctrl, u8 optcode)
> > -{
> > - switch (optcode) {
> > - case nvme_cmd_flush:
> > - case nvme_cmd_read:
> > - case nvme_cmd_write:
> > - /* these are mandatory*/
> > - return true;
> > - case nvme_cmd_write_zeroes:
> > - return (hctrl->oncs & NVME_CTRL_ONCS_WRITE_ZEROES);
> > - case nvme_cmd_dsm:
> > - return (hctrl->oncs & NVME_CTRL_ONCS_DSM);
> > - default:
> > - return false;
> > - }
> > -}
> > +
> >
> > /* Submit a IO passthrough command */
> > int nvme_mdev_hctrl_hq_submit(struct nvme_mdev_hctrl *hctrl,
> > diff --git a/drivers/nvme/mdev/io.c b/drivers/nvme/mdev/io.c
> > index a731196d0365..59837540fec2 100644
> > --- a/drivers/nvme/mdev/io.c
> > +++ b/drivers/nvme/mdev/io.c
> > @@ -11,14 +11,16 @@
> > #include <linux/ktime.h>
> > #include "priv.h"
> >
> > +
> > struct io_ctx {
> > struct nvme_mdev_hctrl *hctrl;
> > struct nvme_mdev_vctrl *vctrl;
> >
> > const struct nvme_command *in;
> > - struct nvme_command out;
> > struct nvme_mdev_vns *ns;
> > struct nvme_ext_data_iter udatait;
> > +
> > + struct nvme_command out;
> > struct nvme_ext_data_iter *kdatait;
> >
> > ktime_t last_io_t;
> > @@ -28,6 +30,20 @@ struct io_ctx {
> > unsigned int arb_burst;
> > };
> >
> > +/* Check if we need to read a command from the admin queue */
> > +static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
> > +{
> > + if (!timeout(ctx->last_admin_poll_time,
> > + ctx->vctrl->now, ctx->admin_poll_rate_ms))
> > + return false;
> > +
> > + if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
> > + return true;
> > +
> > + ctx->last_admin_poll_time = ctx->vctrl->now;
> > + return false;
> > +}
> > +
> > /* Handle read/write command.*/
> > static int nvme_mdev_io_translate_rw(struct io_ctx *ctx)
> > {
> > @@ -229,6 +245,7 @@ static int nvme_mdev_io_translate_cmd(struct io_ctx
> > *ctx)
> > }
> > }
> >
> > +/* process a user submission queue */
> > static bool nvme_mdev_io_process_sq(struct io_ctx *ctx, u16 sqid)
> > {
> > struct nvme_vsq *vsq = &ctx->vctrl->vsqs[sqid];
> > @@ -275,7 +292,13 @@ static bool nvme_mdev_io_process_sq(struct io_ctx *ctx,
> > u16 sqid)
> > return true;
> > }
> >
> > -/* process host replies to the passed through commands */
> > +/* process a user completion queue */
> > +static void nvme_mdev_io_process_cq(struct io_ctx *ctx, u16 cqid)
> > +{
> > + nvme_mdev_vcq_process(ctx->vctrl, cqid, true);
> > +}
> > +
> > +/* process hardware completion queue */
> > static int nvme_mdev_io_process_hwq(struct io_ctx *ctx, u16 hwq)
> > {
> > int n, i;
> > @@ -301,22 +324,9 @@ static int nvme_mdev_io_process_hwq(struct io_ctx *ctx,
> > u16 hwq)
> > return n;
> > }
> >
> > -/* Check if we need to read a command from the admin queue */
> > -static bool nvme_mdev_adm_needs_processing(struct io_ctx *ctx)
> > -{
> > - if (!timeout(ctx->last_admin_poll_time,
> > - ctx->vctrl->now, ctx->admin_poll_rate_ms))
> > - return false;
> > -
> > - if (nvme_mdev_vsq_has_data(ctx->vctrl, &ctx->vctrl->vsqs[0]))
> > - return true;
> > -
> > - ctx->last_admin_poll_time = ctx->vctrl->now;
> > - return false;
> > -}
> >
> > /* do polling till one of events stops it */
> > -static void nvme_mdev_io_maintask(struct io_ctx *ctx)
> > +static void nvme_mdev_io_polling_loop(struct io_ctx *ctx)
> > {
> > struct nvme_mdev_vctrl *vctrl = ctx->vctrl;
> > u16 i, cqid, sqid, hsqcnt;
> > @@ -353,7 +363,7 @@ static void nvme_mdev_io_maintask(struct io_ctx *ctx)
> > /* process the completions from the guest*/
> > cqid = 1;
> > for_each_set_bit_from(cqid, vctrl->vcq_en, MAX_VIRTUAL_QUEUES)
> > - nvme_mdev_vcq_process(vctrl, cqid, true);
> > + nvme_mdev_io_process_cq(ctx, cqid);
> >
> > /* process the completions from the hardware*/
> > for (i = 0 ; i < hsqcnt ; i++)
> > @@ -470,7 +480,7 @@ static int nvme_mdev_io_polling_thread(void *data)
> > if (kthread_should_stop())
> > break;
> >
> > - nvme_mdev_io_maintask(&ctx);
> > + nvme_mdev_io_polling_loop(&ctx);
> > }
> >
> > _DBG(ctx.vctrl, "IO: iothread stopped\n");
> > diff --git a/drivers/nvme/mdev/mmio.c b/drivers/nvme/mdev/mmio.c
> > index cf03c1f22f4c..a80962bf4a3d 100644
> > --- a/drivers/nvme/mdev/mmio.c
> > +++ b/drivers/nvme/mdev/mmio.c
> > @@ -54,9 +54,6 @@ static const struct vm_operations_struct
> > nvme_mdev_mmio_dbs_vm_ops = {
> > bool nvme_mdev_mmio_db_check(struct nvme_mdev_vctrl *vctrl,
> > u16 qid, u16 size, u16 db)
> > {
> > - if (get_current() != vctrl->iothread)
> > - lockdep_assert_held(&vctrl->lock);
> > -
> > if (db < size)
> > return true;
> > if (qid == 0) {
> >

2019-05-03 12:16:07

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

On Fri, May 03, 2019 at 12:20:17AM +0300, Maxim Levitsky wrote:
> On Thu, 2019-05-02 at 15:12 -0600, Heitke, Kenneth wrote:
> > On 5/2/2019 5:47 AM, Maxim Levitsky wrote:
> > > +static void nvme_ext_queue_free(struct nvme_ctrl *ctrl, u16 qid)
> > > +{
> > > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > > + struct nvme_queue *nvmeq;
> > > +
> > > + mutex_lock(&dev->ext_dev_lock);
> > > + nvmeq = &dev->queues[qid];
> > > +
> > > + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> > > + return;
> >
> > This condition is probably not expected to happen (since its a warning)
> > but do you need to unlock the ext_dev_lock before returning?
>
> This is true, I will fix this. This used to be BUG_ON, but due to checkpatch.pl
> complains I turned them all to WARN_ON, and missed this.

Gentle reminder to trim your replies to the relevant context. It's
much easier to read when we don't need to scroll through hundreds of
unnecessary lines.

2019-05-03 12:30:26

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On Thu, May 02, 2019 at 02:47:57PM +0300, Maxim Levitsky wrote:
> If the mdev device driver also sets the
> NVME_F_MDEV_DMA_SUPPORTED, the mdev core will
> dma map all the guest memory into the nvme device,
> so that nvme device driver can use dma addresses as passed
> from the mdev core driver

We really need a proper block layer interface for that so that
uring or the nvme target can use pre-mapping as well.

2019-05-03 12:47:14

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

I simply don't get the point of this series.

MDEV is an interface for exposing parts of a device to a userspace
program / VM. But that this series appears to do is to expose a
purely software defined nvme controller to userspace. Which in
principle is a good idea, but we have a much better framework for that,
which is called vhost.

2019-05-03 20:37:21

by Max Gurtovoy

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On 5/3/2019 3:29 PM, Christoph Hellwig wrote:
> On Thu, May 02, 2019 at 02:47:57PM +0300, Maxim Levitsky wrote:
>> If the mdev device driver also sets the
>> NVME_F_MDEV_DMA_SUPPORTED, the mdev core will
>> dma map all the guest memory into the nvme device,
>> so that nvme device driver can use dma addresses as passed
>> from the mdev core driver
> We really need a proper block layer interface for that so that
> uring or the nvme target can use pre-mapping as well.

I think we can also find a way to use nvme-mdev for the target offload
p2p feature.

Don't see a big difference of taking NVMe queue and namespace/partition
to guest OS or to P2P since IO is issued by external entity and pooled
outside the pci driver.

thoughts ?

>
> _______________________________________________
> Linux-nvme mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-nvme

2019-05-04 06:52:14

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On Fri, May 03, 2019 at 10:00:54PM +0300, Max Gurtovoy wrote:
> Don't see a big difference of taking NVMe queue and namespace/partition to
> guest OS or to P2P since IO is issued by external entity and pooled outside
> the pci driver.

We are not going to the queue aside either way.. That is where the
last patch in this series is already working to, and which would be
the sensible vhost model to start with.

2019-05-06 07:58:59

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 08/10] nvme/pci: implement the mdev external queue allocation interface

On Fri, 2019-05-03 at 06:09 -0600, Keith Busch wrote:
> On Fri, May 03, 2019 at 12:20:17AM +0300, Maxim Levitsky wrote:
> > On Thu, 2019-05-02 at 15:12 -0600, Heitke, Kenneth wrote:
> > > On 5/2/2019 5:47 AM, Maxim Levitsky wrote:
> > > > +static void nvme_ext_queue_free(struct nvme_ctrl *ctrl, u16 qid)
> > > > +{
> > > > + struct nvme_dev *dev = to_nvme_dev(ctrl);
> > > > + struct nvme_queue *nvmeq;
> > > > +
> > > > + mutex_lock(&dev->ext_dev_lock);
> > > > + nvmeq = &dev->queues[qid];
> > > > +
> > > > + if (WARN_ON(!test_bit(NVMEQ_EXTERNAL, &nvmeq->flags)))
> > > > + return;
> > >
> > > This condition is probably not expected to happen (since its a warning)
> > > but do you need to unlock the ext_dev_lock before returning?
> >
> > This is true, I will fix this. This used to be BUG_ON, but due to
> > checkpatch.pl
> > complains I turned them all to WARN_ON, and missed this.
>
> Gentle reminder to trim your replies to the relevant context. It's
> much easier to read when we don't need to scroll through hundreds of
> unnecessary lines.

I fully agree, sorry! Next time I will do this.
Best regards,
Maxim Levitsky

2019-05-06 08:32:51

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On Sat, 2019-05-04 at 08:49 +0200, Christoph Hellwig wrote:
> On Fri, May 03, 2019 at 10:00:54PM +0300, Max Gurtovoy wrote:
> > Don't see a big difference of taking NVMe queue and namespace/partition to
> > guest OS or to P2P since IO is issued by external entity and pooled outside
> > the pci driver.
>
> We are not going to the queue aside either way.. That is where the
> last patch in this series is already working to, and which would be
> the sensible vhost model to start with.

Why are you saying that? I actualy prefer to use a sepearate queue per software
nvme controller, tat because of lower overhead (about half than going through
the block layer) and it better at QoS as the separate queue (or even few queues
if needed) will give the guest a mostly guaranteed slice of the bandwidth of the
device.

The only drawback of this is some code duplication but that can be worked on
with some changes in the block layer.

The last patch in my series was done with 2 purposes in mind which are to
measure the overhead, and to maybe utilize that as a failback to non nvme
devices.

Best regards,
Maxim Levitsky

2019-05-06 08:36:44

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On Mon, 2019-05-06 at 11:31 +0300, Maxim Levitsky wrote:
> On Sat, 2019-05-04 at 08:49 +0200, Christoph Hellwig wrote:
> > On Fri, May 03, 2019 at 10:00:54PM +0300, Max Gurtovoy wrote:
> > > Don't see a big difference of taking NVMe queue and namespace/partition
> > > to
> > > guest OS or to P2P since IO is issued by external entity and pooled
> > > outside
> > > the pci driver.
> >
> > We are not going to the queue aside either way.. That is where the
> > last patch in this series is already working to, and which would be
> > the sensible vhost model to start with.
>
> Why are you saying that? I actualy prefer to use a sepearate queue per
> software
> nvme controller, tat because of lower overhead (about half than going through
> the block layer) and it better at QoS as the separate queue (or even few
> queues
> if needed) will give the guest a mostly guaranteed slice of the bandwidth of
> the
> device.

Sorry for typos - I need more coffee :-)

Best regards,
Maxim Levitsky

2019-05-06 09:06:11

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On Fri, 2019-05-03 at 14:18 +0200, Christoph Hellwig wrote:
> I simply don't get the point of this series.
>
> MDEV is an interface for exposing parts of a device to a userspace
> program / VM. But that this series appears to do is to expose a
> purely software defined nvme controller to userspace. Which in
> principle is a good idea, but we have a much better framework for that,
> which is called vhost.

Let me explain the reasons for choosing the IO interfaces as I did:

1. Frontend interface (the interface that faces the guest/userspace/etc):

VFIO/mdev is just way to expose a (partially) software defined PCIe device to a
guest.

Vhost on the other hand is an interface that is hardcoded and optimized for
virtio. It can be extended to be pci generic, but why to do so if we already
have VFIO.

So the biggest advantage of using VFIO _currently_ is that I don't add any new
API/ABI to the kernel, and neither the userspace (qemu) needs to learn to use a
new API.

It also worth noting that VFIO supports nesting out of box, so I don't need to
worry about it (vhost has to deal with that on the protocol level using its
IOTLB facility).

On top of that, it is expected that newer hardware will support the PASID based
device subdivision, which will allow us to _directly_ pass through the
submission queues of the device and _force_ us to use the NVME protocol for the
frontend.

2. Backend interface (the connection to the real nvme device):

Currently the backend interface _doesn't have_ to allocate a dedicated queue and
bypass the block layer. It can use the block submit_bio/blk_poll as I
demonstrate in the last patch in the series. Its 2x slower though.

However, similar to the (1), when the driver will support the devices with
hardware based passthrough, it will have to dedicate a bunch of queues to the
guest, configure them with the appropriate PASID, and then let the guest use
these queues directly.

Best regards,
Maxim Levitsky

2019-05-06 12:59:45

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> 1. Frontend interface (the interface that faces the guest/userspace/etc):
>
> VFIO/mdev is just way to expose a (partially) software defined PCIe device to a
> guest.
>
> Vhost on the other hand is an interface that is hardcoded and optimized for
> virtio. It can be extended to be pci generic, but why to do so if we already
> have VFIO.

I wouldn't say vhost is virtio specific. At least Hanne's vhost-nvme
doesn't get impacted by that a whole lot.

> 2. Backend interface (the connection to the real nvme device):
>
> Currently the backend interface _doesn't have_ to allocate a dedicated queue and
> bypass the block layer. It can use the block submit_bio/blk_poll as I
> demonstrate in the last patch in the series. Its 2x slower though.
>
> However, similar to the (1), when the driver will support the devices with
> hardware based passthrough, it will have to dedicate a bunch of queues to the
> guest, configure them with the appropriate PASID, and then let the guest useA
> these queues directly.

We will not let you abuse the nvme queues for anything else. We had
that discussion with the mellanox offload and it not only unsafe but
also adds way to much crap to the core nvme code for corner cases.

Or to put it into another way: unless your paravirt interface requires
zero specific changes to the core nvme code it is not acceptable at all.

2019-05-06 13:00:49

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 06/10] nvme/core: add mdev interfaces

On Mon, May 06, 2019 at 11:31:27AM +0300, Maxim Levitsky wrote:
> Why are you saying that? I actualy prefer to use a sepearate queue per software
> nvme controller, tat because of lower overhead (about half than going through
> the block layer) and it better at QoS as the separate queue (or even few queues
> if needed) will give the guest a mostly guaranteed slice of the bandwidth of the
> device.

The downside is that your create tons of crap code in the core nvme driver
for your specific use case that no one cares about. Which is why it is
completely unacceptable. If you want things to go fast make the block
layer go fast, don't add your very special bypass.

2019-05-06 16:51:33

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On Mon, May 06, 2019 at 05:57:52AM -0700, Christoph Hellwig wrote:
> > However, similar to the (1), when the driver will support the devices with
> > hardware based passthrough, it will have to dedicate a bunch of queues to the
> > guest, configure them with the appropriate PASID, and then let the guest useA
> > these queues directly.
>
> We will not let you abuse the nvme queues for anything else. We had
> that discussion with the mellanox offload and it not only unsafe but
> also adds way to much crap to the core nvme code for corner cases.
>
> Or to put it into another way: unless your paravirt interface requires
> zero specific changes to the core nvme code it is not acceptable at all.

I agree we shouldn't specialize generic queues for this, but I think
it is worth revisiting driver support for assignable hardware resources
iff the specification defines it.

Until then, you can always steer processes to different queues by
assigning them to different CPUs.

2019-05-08 13:25:16

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On 06/05/19 07:57, Christoph Hellwig wrote:
>
> Or to put it into another way: unless your paravirt interface requires
> zero specific changes to the core nvme code it is not acceptable at all.

I'm not sure it's possible to attain that goal, however I agree that
putting the control plane in the kernel is probably not a good idea, so
the vhost model is better than mdev for this usecase.

In addition, unless it is possible for the driver to pass the queue
directly to the guests, there probably isn't much advantage in putting
the driver in the kernel at all. Maxim, do you have numbers for 1) QEMU
with aio 2) QEMU with VFIO-based userspace nvme driver 3) nvme-mdev?

Paolo

2019-05-09 09:15:15

by Stefan Hajnoczi

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> On top of that, it is expected that newer hardware will support the PASID based
> device subdivision, which will allow us to _directly_ pass through the
> submission queues of the device and _force_ us to use the NVME protocol for the
> frontend.

I don't understand the PASID argument. The data path will be 100%
passthrough and this driver won't be necessary.

In the meantime there is already SPDK for users who want polling. This
driver's main feature is that the host can still access the device at
the same time as VMs, but I'm not sure that's useful in
performance-critical use cases and for non-performance use cases this
driver isn't necessary.

Stefan

Attachments:

(No filename) (746.00 B)
signature.asc (499.00 B)
Download all attachments

2019-05-09 13:55:58

[permalink] [raw]

Subject: Re: [PATCH v2 00/10] RFC: NVME MDEV

On Thu, May 09, 2019 at 02:12:55AM -0700, Stefan Hajnoczi wrote:
> On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> > On top of that, it is expected that newer hardware will support the PASID based
> > device subdivision, which will allow us to _directly_ pass through the
> > submission queues of the device and _force_ us to use the NVME protocol for the
> > frontend.
>
> I don't understand the PASID argument. The data path will be 100%
> passthrough and this driver won't be necessary.

We still need a non-passthrough component to handle slow path,
non-doorbell controller registers and admin queue. That doesn't
necessarily need to be a kernel driver, though.