This patch series has implementation for "fake DAX".
"fake DAX" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest driver and qemu device changes in separate
patch sets for easy review and it has been tested together.
Details of project idea for 'fake DAX' flushing interface
is shared [2] & [3].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[4] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from RFC v3: [1]
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak,correct sizeof req- Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (3):
nd: move nd_region to common header
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
[1] https://lkml.org/lkml/2018/7/13/102
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html
[4] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
drivers/acpi/nfit/core.c | 7 -
drivers/nvdimm/claim.c | 3
drivers/nvdimm/nd.h | 39 -----
drivers/nvdimm/pmem.c | 12 +
drivers/nvdimm/region_devs.c | 12 +
drivers/virtio/Kconfig | 9 +
drivers/virtio/Makefile | 1
drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++
include/linux/libnvdimm.h | 4
include/linux/nd.h | 40 ++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 40 ++++++
12 files changed, 374 insertions(+), 49 deletions(-)
This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/acpi/nfit/core.c | 7 +++++--
drivers/nvdimm/claim.c | 3 ++-
drivers/nvdimm/pmem.c | 12 ++++++++----
drivers/nvdimm/region_devs.c | 12 ++++++++++--
include/linux/libnvdimm.h | 4 +++-
5 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index b072cfc..cd63b69 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2216,6 +2216,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
{
u64 cmd, offset;
struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
+ struct nd_region *nd_region = nfit_blk->nd_region;
enum {
BCW_OFFSET_MASK = (1ULL << 48)-1,
@@ -2234,7 +2235,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
offset = to_interleave_offset(offset, mmio);
writeq(cmd, mmio->addr.base + offset);
- nvdimm_flush(nfit_blk->nd_region);
+ nd_region->flush(nd_region);
if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2245,6 +2246,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
unsigned int lane)
{
struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
+ struct nd_region *nd_region = nfit_blk->nd_region;
unsigned int copied = 0;
u64 base_offset;
int rc;
@@ -2283,7 +2285,8 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}
if (rw)
- nvdimm_flush(nfit_blk->nd_region);
+ nd_region->flush(nd_region);
+
rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf..49dce9c 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -262,6 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
{
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
sector_t sector = offset >> 9;
int rc = 0;
@@ -301,7 +302,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
}
memcpy_flushcache(nsio->addr + offset, buf, size);
- nvdimm_flush(to_nd_region(ndns->dev.parent));
+ nd_region->flush(nd_region);
return rc;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 6071e29..ba57cfa 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -201,7 +201,8 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
struct nd_region *nd_region = to_region(pmem);
if (bio->bi_opf & REQ_PREFLUSH)
- nvdimm_flush(nd_region);
+ bio->bi_status = nd_region->flush(nd_region);
+
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
nd_iostat_end(bio, start);
if (bio->bi_opf & REQ_FUA)
- nvdimm_flush(nd_region);
+ bio->bi_status = nd_region->flush(nd_region);
bio_endio(bio);
return BLK_QC_T_NONE;
@@ -517,6 +518,7 @@ static int nd_pmem_probe(struct device *dev)
static int nd_pmem_remove(struct device *dev)
{
struct pmem_device *pmem = dev_get_drvdata(dev);
+ struct nd_region *nd_region = to_region(pmem);
if (is_nd_btt(dev))
nvdimm_namespace_detach_btt(to_nd_btt(dev));
@@ -528,14 +530,16 @@ static int nd_pmem_remove(struct device *dev)
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
- nvdimm_flush(to_nd_region(dev->parent));
+ nd_region->flush(nd_region);
return 0;
}
static void nd_pmem_shutdown(struct device *dev)
{
- nvdimm_flush(to_nd_region(dev->parent));
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+
+ nd_region->flush(nd_region);
}
static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fa37afc..a170a6b 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -290,7 +290,7 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
return rc;
if (!flush)
return -EINVAL;
- nvdimm_flush(nd_region);
+ nd_region->flush(nd_region);
return len;
}
@@ -1065,6 +1065,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ if (ndr_desc->flush)
+ nd_region->flush = ndr_desc->flush;
+ else
+ nd_region->flush = nvdimm_flush;
+
nd_device_register(dev);
return nd_region;
@@ -1109,7 +1114,7 @@ EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
* nvdimm_flush - flush any posted write queues between the cpu and pmem media
* @nd_region: blk or interleaved pmem region
*/
-void nvdimm_flush(struct nd_region *nd_region)
+int nvdimm_flush(struct nd_region *nd_region)
{
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i, idx;
@@ -1133,7 +1138,10 @@ void nvdimm_flush(struct nd_region *nd_region)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
wmb();
+
+ return 0;
}
+
EXPORT_SYMBOL_GPL(nvdimm_flush);
/**
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 097072c..3af7177 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -115,6 +115,7 @@ struct nd_mapping_desc {
int position;
};
+struct nd_region;
struct nd_region_desc {
struct resource *res;
struct nd_mapping_desc *mapping;
@@ -126,6 +127,7 @@ struct nd_region_desc {
int numa_node;
unsigned long flags;
struct device_node *of_node;
+ int (*flush)(struct nd_region *nd_region);
};
struct device;
@@ -201,7 +203,7 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
--
2.9.3
This patch adds virtio-pmem Qemu device.
This device presents memory address range information to guest
which is backed by file backend type. It acts like persistent
memory device for KVM guest. Guest can perform read and
persistent write operations on this memory range with the help
of DAX capable filesystem.
Persistent guest writes are assured with the help of virtio
based flushing interface. When guest userspace space performs
fsync on file fd on pmem device, a flush command is send to
Qemu over VIRTIO and host side flush/sync is done on backing
image file.
Signed-off-by: Pankaj Gupta <[email protected]>
---
Changes from RFC v3:
- Return EIO for host fsync failure instead of errno - Luiz, Stefan
- Change version for inclusion to Qemu 3.1 - Eric
Changes from RFC v2:
- Use aio_worker() to avoid Qemu from hanging with blocking fsync
call - Stefan
- Use virtio_st*_p() for endianess - Stefan
- Correct indentation in qapi/misc.json - Eric
hw/virtio/Makefile.objs | 3 +
hw/virtio/virtio-pci.c | 44 +++++
hw/virtio/virtio-pci.h | 14 ++
hw/virtio/virtio-pmem.c | 241 ++++++++++++++++++++++++++++
include/hw/pci/pci.h | 1 +
include/hw/virtio/virtio-pmem.h | 42 +++++
include/standard-headers/linux/virtio_ids.h | 1 +
qapi/misc.json | 26 ++-
8 files changed, 371 insertions(+), 1 deletion(-)
create mode 100644 hw/virtio/virtio-pmem.c
create mode 100644 include/hw/virtio/virtio-pmem.h
diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 1b2799cfd8..7f914d45d0 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -10,6 +10,9 @@ obj-$(CONFIG_VIRTIO_CRYPTO) += virtio-crypto.o
obj-$(call land,$(CONFIG_VIRTIO_CRYPTO),$(CONFIG_VIRTIO_PCI)) += virtio-crypto-pci.o
obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
+ifeq ($(CONFIG_MEM_HOTPLUG),y)
+obj-$(CONFIG_LINUX) += virtio-pmem.o
+endif
obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
endif
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 3a01fe90f0..93d3fc05c7 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2521,6 +2521,49 @@ static const TypeInfo virtio_rng_pci_info = {
.class_init = virtio_rng_pci_class_init,
};
+/* virtio-pmem-pci */
+
+static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+ VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
+ DeviceState *vdev = DEVICE(&vpmem->vdev);
+
+ qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
+ object_property_set_bool(OBJECT(vdev), true, "realized", errp);
+}
+
+static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+ PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+ k->realize = virtio_pmem_pci_realize;
+ set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+ pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+ pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
+ pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+ pcidev_k->class_id = PCI_CLASS_OTHERS;
+}
+
+static void virtio_pmem_pci_instance_init(Object *obj)
+{
+ VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
+
+ virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+ TYPE_VIRTIO_PMEM);
+ object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
+ &error_abort);
+}
+
+static const TypeInfo virtio_pmem_pci_info = {
+ .name = TYPE_VIRTIO_PMEM_PCI,
+ .parent = TYPE_VIRTIO_PCI,
+ .instance_size = sizeof(VirtIOPMEMPCI),
+ .instance_init = virtio_pmem_pci_instance_init,
+ .class_init = virtio_pmem_pci_class_init,
+};
+
+
/* virtio-input-pci */
static Property virtio_input_pci_properties[] = {
@@ -2714,6 +2757,7 @@ static void virtio_pci_register_types(void)
type_register_static(&virtio_balloon_pci_info);
type_register_static(&virtio_serial_pci_info);
type_register_static(&virtio_net_pci_info);
+ type_register_static(&virtio_pmem_pci_info);
#ifdef CONFIG_VHOST_SCSI
type_register_static(&vhost_scsi_pci_info);
#endif
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index 813082b0d7..fe74fcad3f 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -19,6 +19,7 @@
#include "hw/virtio/virtio-blk.h"
#include "hw/virtio/virtio-net.h"
#include "hw/virtio/virtio-rng.h"
+#include "hw/virtio/virtio-pmem.h"
#include "hw/virtio/virtio-serial.h"
#include "hw/virtio/virtio-scsi.h"
#include "hw/virtio/virtio-balloon.h"
@@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
typedef struct VirtIOGPUPCI VirtIOGPUPCI;
typedef struct VHostVSockPCI VHostVSockPCI;
typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
+typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
/* virtio-pci-bus */
@@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
VirtIOBlock vdev;
};
+/*
+ * virtio-pmem-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
+#define VIRTIO_PMEM_PCI(obj) \
+ OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
+
+struct VirtIOPMEMPCI {
+ VirtIOPCIProxy parent_obj;
+ VirtIOPMEM vdev;
+};
+
/*
* virtio-balloon-pci: This extends VirtioPCIProxy.
*/
diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
new file mode 100644
index 0000000000..69ae4c0a50
--- /dev/null
+++ b/hw/virtio/virtio-pmem.c
@@ -0,0 +1,241 @@
+/*
+ * Virtio pmem device
+ *
+ * Copyright (C) 2018 Red Hat, Inc.
+ * Copyright (C) 2018 Pankaj Gupta <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "hw/virtio/virtio-access.h"
+#include "hw/virtio/virtio-pmem.h"
+#include "hw/mem/memory-device.h"
+#include "block/aio.h"
+#include "block/thread-pool.h"
+
+typedef struct VirtIOPMEMresp {
+ int ret;
+} VirtIOPMEMResp;
+
+typedef struct VirtIODeviceRequest {
+ VirtQueueElement elem;
+ int fd;
+ VirtIOPMEM *pmem;
+ VirtIOPMEMResp resp;
+} VirtIODeviceRequest;
+
+static int worker_cb(void *opaque)
+{
+ VirtIODeviceRequest *req = opaque;
+ int err = 0;
+
+ /* flush raw backing image */
+ err = fsync(req->fd);
+ if (err != 0) {
+ err = EIO;
+ }
+ req->resp.ret = err;
+
+ return 0;
+}
+
+static void done_cb(void *opaque, int ret)
+{
+ VirtIODeviceRequest *req = opaque;
+ int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
+ &req->resp, sizeof(VirtIOPMEMResp));
+
+ /* Callbacks are serialized, so no need to use atomic ops. */
+ virtqueue_push(req->pmem->rq_vq, &req->elem, len);
+ virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
+ g_free(req);
+}
+
+static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIODeviceRequest *req;
+ VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
+ HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
+ ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
+
+ req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
+ if (!req) {
+ virtio_error(vdev, "virtio-pmem missing request data");
+ return;
+ }
+
+ if (req->elem.out_num < 1 || req->elem.in_num < 1) {
+ virtio_error(vdev, "virtio-pmem request not proper");
+ g_free(req);
+ return;
+ }
+ req->fd = memory_region_get_fd(&backend->mr);
+ req->pmem = pmem;
+ thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
+}
+
+static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
+{
+ VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
+ struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
+
+ virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
+ virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
+}
+
+static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
+ Error **errp)
+{
+ return features;
+}
+
+static void virtio_pmem_realize(DeviceState *dev, Error **errp)
+{
+ VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+ VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
+ MachineState *ms = MACHINE(qdev_get_machine());
+ uint64_t align;
+ Error *local_err = NULL;
+ MemoryRegion *mr;
+
+ if (!pmem->memdev) {
+ error_setg(errp, "virtio-pmem memdev not set");
+ return;
+ }
+
+ mr = host_memory_backend_get_memory(pmem->memdev);
+ align = memory_region_get_alignment(mr);
+ pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
+ pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
+ &local_err);
+ if (local_err) {
+ error_setg(errp, "Can't get free address in mem device");
+ return;
+ }
+ memory_region_init_alias(&pmem->mr, OBJECT(pmem),
+ "virtio_pmem-memory", mr, 0, pmem->size);
+ memory_device_plug_region(ms, &pmem->mr, pmem->start);
+
+ host_memory_backend_set_mapped(pmem->memdev, true);
+ virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
+ sizeof(struct virtio_pmem_config));
+ pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
+}
+
+static void virtio_mem_check_memdev(Object *obj, const char *name, Object *val,
+ Error **errp)
+{
+ if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
+ char *path = object_get_canonical_path_component(val);
+ error_setg(errp, "Can't use already busy memdev: %s", path);
+ g_free(path);
+ return;
+ }
+
+ qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
+}
+
+static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
+{
+ Object *obj = OBJECT(vm);
+ DeviceState *parent_dev;
+
+ /* always use the ID of the proxy device */
+ if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
+ parent_dev = DEVICE(obj->parent);
+ return parent_dev->id;
+ }
+ return NULL;
+}
+
+static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
+ MemoryDeviceInfo *info)
+{
+ VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
+ VirtIOPMEM *vm = VIRTIO_PMEM(md);
+ const char *id = virtio_pmem_get_device_id(vm);
+
+ if (id) {
+ vi->has_id = true;
+ vi->id = g_strdup(id);
+ }
+
+ vi->start = vm->start;
+ vi->size = vm->size;
+ vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
+
+ info->u.virtio_pmem.data = vi;
+ info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
+}
+
+static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
+{
+ VirtIOPMEM *vm = VIRTIO_PMEM(md);
+
+ return vm->start;
+}
+
+static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState *md)
+{
+ VirtIOPMEM *vm = VIRTIO_PMEM(md);
+
+ return vm->size;
+}
+
+static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState *md)
+{
+ VirtIOPMEM *vm = VIRTIO_PMEM(md);
+
+ return vm->size;
+}
+
+static void virtio_pmem_instance_init(Object *obj)
+{
+ VirtIOPMEM *vm = VIRTIO_PMEM(obj);
+ object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
+ (Object **)&vm->memdev,
+ (void *) virtio_mem_check_memdev,
+ OBJ_PROP_LINK_STRONG,
+ &error_abort);
+}
+
+
+static void virtio_pmem_class_init(ObjectClass *klass, void *data)
+{
+ VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+ MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
+
+ vdc->realize = virtio_pmem_realize;
+ vdc->get_config = virtio_pmem_get_config;
+ vdc->get_features = virtio_pmem_get_features;
+
+ mdc->get_addr = virtio_pmem_md_get_addr;
+ mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
+ mdc->get_region_size = virtio_pmem_md_get_region_size;
+ mdc->fill_device_info = virtio_pmem_md_fill_device_info;
+}
+
+static TypeInfo virtio_pmem_info = {
+ .name = TYPE_VIRTIO_PMEM,
+ .parent = TYPE_VIRTIO_DEVICE,
+ .class_init = virtio_pmem_class_init,
+ .instance_size = sizeof(VirtIOPMEM),
+ .instance_init = virtio_pmem_instance_init,
+ .interfaces = (InterfaceInfo[]) {
+ { TYPE_MEMORY_DEVICE },
+ { }
+ },
+};
+
+static void virtio_register_types(void)
+{
+ type_register_static(&virtio_pmem_info);
+}
+
+type_init(virtio_register_types)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 990d6fcbde..28829b6437 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -85,6 +85,7 @@ extern bool pci_available;
#define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
#define PCI_DEVICE_ID_VIRTIO_9P 0x1009
#define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
+#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
#define PCI_VENDOR_ID_REDHAT 0x1b36
#define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
diff --git a/include/hw/virtio/virtio-pmem.h b/include/hw/virtio/virtio-pmem.h
new file mode 100644
index 0000000000..fda3ee691c
--- /dev/null
+++ b/include/hw/virtio/virtio-pmem.h
@@ -0,0 +1,42 @@
+/*
+ * Virtio pmem Device
+ *
+ * Copyright Red Hat, Inc. 2018
+ * Copyright Pankaj Gupta <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version. See the COPYING file in the
+ * top-level directory.
+ */
+
+#ifndef QEMU_VIRTIO_PMEM_H
+#define QEMU_VIRTIO_PMEM_H
+
+#include "hw/virtio/virtio.h"
+#include "exec/memory.h"
+#include "sysemu/hostmem.h"
+#include "standard-headers/linux/virtio_ids.h"
+#include "hw/boards.h"
+#include "hw/i386/pc.h"
+
+#define TYPE_VIRTIO_PMEM "virtio-pmem"
+
+#define VIRTIO_PMEM(obj) \
+ OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
+
+/* VirtIOPMEM device structure */
+typedef struct VirtIOPMEM {
+ VirtIODevice parent_obj;
+
+ VirtQueue *rq_vq;
+ uint64_t start;
+ uint64_t size;
+ MemoryRegion mr;
+ HostMemoryBackend *memdev;
+} VirtIOPMEM;
+
+struct virtio_pmem_config {
+ uint64_t start;
+ uint64_t size;
+};
+#endif
diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index 6d5c3b2d4f..346389565a 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_PMEM 25 /* virtio pmem */
#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/qapi/misc.json b/qapi/misc.json
index d450cfef21..517376b866 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2907,6 +2907,29 @@
}
}
+##
+# @VirtioPMemDeviceInfo:
+#
+# VirtioPMem state information
+#
+# @id: device's ID
+#
+# @start: physical address, where device is mapped
+#
+# @size: size of memory that the device provides
+#
+# @memdev: memory backend linked with device
+#
+# Since: 3.1
+##
+{ 'struct': 'VirtioPMemDeviceInfo',
+ 'data': { '*id': 'str',
+ 'start': 'size',
+ 'size': 'size',
+ 'memdev': 'str'
+ }
+}
+
##
# @MemoryDeviceInfo:
#
@@ -2916,7 +2939,8 @@
##
{ 'union': 'MemoryDeviceInfo',
'data': { 'dimm': 'PCDIMMDeviceInfo',
- 'nvdimm': 'PCDIMMDeviceInfo'
+ 'nvdimm': 'PCDIMMDeviceInfo',
+ 'virtio-pmem': 'VirtioPMemDeviceInfo'
}
}
--
2.14.3
This patch adds virtio-pmem driver for KVM guest.
Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.
This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/virtio/Kconfig | 9 ++
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pmem.h | 40 ++++++
5 files changed, 306 insertions(+)
create mode 100644 drivers/virtio/virtio_pmem.c
create mode 100644 include/uapi/linux/virtio_pmem.h
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3589764..a331e23 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
If unsure, say Y.
+config VIRTIO_PMEM
+ tristate "Support for virtio pmem driver"
+ depends on VIRTIO
+ help
+ This driver provides support for virtio based flushing interface
+ for persistent memory range.
+
+ If unsure, say M.
+
config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5..cbe91c6 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
new file mode 100644
index 0000000..c22cc87
--- /dev/null
+++ b/drivers/virtio/virtio_pmem.c
@@ -0,0 +1,255 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio.h>
+#include <linux/module.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/spinlock.h>
+#include <linux/libnvdimm.h>
+#include <linux/nd.h>
+
+struct virtio_pmem_request {
+ /* Host return status corresponding to flush request */
+ int ret;
+
+ /* command name*/
+ char name[16];
+
+ /* Wait queue to process deferred work after ack from host */
+ wait_queue_head_t host_acked;
+ bool done;
+
+ /* Wait queue to process deferred work after virt queue buffer avail */
+ wait_queue_head_t wq_buf;
+ bool wq_buf_avail;
+ struct list_head list;
+};
+
+struct virtio_pmem {
+ struct virtio_device *vdev;
+
+ /* Virtio pmem request queue */
+ struct virtqueue *req_vq;
+
+ /* nvdimm bus registers virtio pmem device */
+ struct nvdimm_bus *nvdimm_bus;
+ struct nvdimm_bus_descriptor nd_desc;
+
+ /* List to store deferred work if virtqueue is full */
+ struct list_head req_list;
+
+ /* Synchronize virtqueue data */
+ spinlock_t pmem_lock;
+
+ /* Memory region information */
+ uint64_t start;
+ uint64_t size;
+};
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+ /* The interrupt handler */
+static void host_ack(struct virtqueue *vq)
+{
+ unsigned int len;
+ unsigned long flags;
+ struct virtio_pmem_request *req, *req_buf;
+ struct virtio_pmem *vpmem = vq->vdev->priv;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+ req->done = true;
+ wake_up(&req->host_acked);
+
+ if (!list_empty(&vpmem->req_list)) {
+ req_buf = list_first_entry(&vpmem->req_list,
+ struct virtio_pmem_request, list);
+ list_del(&vpmem->req_list);
+ req_buf->wq_buf_avail = true;
+ wake_up(&req_buf->wq_buf);
+ }
+ }
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+ struct virtqueue *vq;
+
+ /* single vq */
+ vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+ host_ack, "flush_queue");
+ if (IS_ERR(vq))
+ return PTR_ERR(vq);
+
+ spin_lock_init(&vpmem->pmem_lock);
+ INIT_LIST_HEAD(&vpmem->req_list);
+
+ return 0;
+};
+
+ /* The request submission function */
+static int virtio_pmem_flush(struct nd_region *nd_region)
+{
+ int err;
+ unsigned long flags;
+ struct scatterlist *sgs[2], sg, ret;
+ struct virtio_device *vdev =
+ dev_to_virtio(nd_region->dev.parent->parent);
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
+
+ if (!req)
+ return -ENOMEM;
+
+ req->done = req->wq_buf_avail = false;
+ strcpy(req->name, "FLUSH");
+ init_waitqueue_head(&req->host_acked);
+ init_waitqueue_head(&req->wq_buf);
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ sg_init_one(&sg, req->name, strlen(req->name));
+ sgs[0] = &sg;
+ sg_init_one(&ret, &req->ret, sizeof(req->ret));
+ sgs[1] = &ret;
+ err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+ if (err) {
+ dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+ list_add_tail(&vpmem->req_list, &req->list);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->wq_buf, req->wq_buf_avail);
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ }
+ virtqueue_kick(vpmem->req_vq);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->host_acked, req->done);
+ err = req->ret;
+ kfree(req);
+
+ return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+ int err = 0;
+ struct resource res;
+ struct virtio_pmem *vpmem;
+ struct nvdimm_bus *nvdimm_bus;
+ struct nd_region_desc ndr_desc;
+ int nid = dev_to_node(&vdev->dev);
+ struct nd_region *nd_region;
+
+ if (!vdev->config->get) {
+ dev_err(&vdev->dev, "%s failure: config disabled\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+ GFP_KERNEL);
+ if (!vpmem) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ vpmem->vdev = vdev;
+ err = init_vq(vpmem);
+ if (err)
+ goto out_err;
+
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ start, &vpmem->start);
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ size, &vpmem->size);
+
+ res.start = vpmem->start;
+ res.end = vpmem->start + vpmem->size-1;
+ vpmem->nd_desc.provider_name = "virtio-pmem";
+ vpmem->nd_desc.module = THIS_MODULE;
+
+ vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+ &vpmem->nd_desc);
+ if (!nvdimm_bus)
+ goto out_vq;
+
+ dev_set_drvdata(&vdev->dev, nvdimm_bus);
+ memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+ ndr_desc.res = &res;
+ ndr_desc.numa_node = nid;
+ ndr_desc.flush = virtio_pmem_flush;
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+ nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+
+ if (!nd_region)
+ goto out_nd;
+
+ //virtio_device_ready(vdev);
+ return 0;
+out_nd:
+ err = -ENXIO;
+ nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+ vdev->config->del_vqs(vdev);
+out_err:
+ dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+ return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+ nvdimm_bus_unregister(nvdimm_bus);
+ vdev->config->del_vqs(vdev);
+ kfree(vpmem);
+}
+
+#ifdef CONFIG_PM_SLEEP
+static int virtio_pmem_freeze(struct virtio_device *vdev)
+{
+ /* todo: handle freeze function */
+ return -EPERM;
+}
+
+static int virtio_pmem_restore(struct virtio_device *vdev)
+{
+ /* todo: handle restore function */
+ return -EPERM;
+}
+#endif
+
+
+static struct virtio_driver virtio_pmem_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = virtio_pmem_probe,
+ .remove = virtio_pmem_remove,
+#ifdef CONFIG_PM_SLEEP
+ .freeze = virtio_pmem_freeze,
+ .restore = virtio_pmem_restore,
+#endif
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2..3463895 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_PMEM 25 /* virtio pmem */
#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 0000000..c7c22a5
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2018-2019
+ * Copyright (C) Pankaj Gupta <[email protected]>, 2018
+ */
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+ __le64 start;
+ __le64 size;
+};
+#endif
--
2.9.3
This patch moves nd_region definition to common header
include/linux/nd.h file. This is required for flush callback
support for both virtio-pmem & pmem driver.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/nvdimm/nd.h | 39 ---------------------------------------
include/linux/nd.h | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 39 deletions(-)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 98317e7..d079a2b 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -123,45 +123,6 @@ enum nd_mapping_lock_class {
ND_MAPPING_UUID_SCAN,
};
-struct nd_mapping {
- struct nvdimm *nvdimm;
- u64 start;
- u64 size;
- int position;
- struct list_head labels;
- struct mutex lock;
- /*
- * @ndd is for private use at region enable / disable time for
- * get_ndd() + put_ndd(), all other nd_mapping to ndd
- * conversions use to_ndd() which respects enabled state of the
- * nvdimm.
- */
- struct nvdimm_drvdata *ndd;
-};
-
-struct nd_region {
- struct device dev;
- struct ida ns_ida;
- struct ida btt_ida;
- struct ida pfn_ida;
- struct ida dax_ida;
- unsigned long flags;
- struct device *ns_seed;
- struct device *btt_seed;
- struct device *pfn_seed;
- struct device *dax_seed;
- u16 ndr_mappings;
- u64 ndr_size;
- u64 ndr_start;
- int id, num_lanes, ro, numa_node;
- void *provider_data;
- struct kernfs_node *bb_state;
- struct badblocks bb;
- struct nd_interleave_set *nd_set;
- struct nd_percpu_lane __percpu *lane;
- struct nd_mapping mapping[0];
-};
-
struct nd_blk_region {
int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
int (*do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
diff --git a/include/linux/nd.h b/include/linux/nd.h
index 43c181a..b9da9f7 100644
--- a/include/linux/nd.h
+++ b/include/linux/nd.h
@@ -120,6 +120,46 @@ struct nd_namespace_blk {
struct resource **res;
};
+struct nd_mapping {
+ struct nvdimm *nvdimm;
+ u64 start;
+ u64 size;
+ int position;
+ struct list_head labels;
+ struct mutex lock;
+ /*
+ * @ndd is for private use at region enable / disable time for
+ * get_ndd() + put_ndd(), all other nd_mapping to ndd
+ * conversions use to_ndd() which respects enabled state of the
+ * nvdimm.
+ */
+ struct nvdimm_drvdata *ndd;
+};
+
+struct nd_region {
+ struct device dev;
+ struct ida ns_ida;
+ struct ida btt_ida;
+ struct ida pfn_ida;
+ struct ida dax_ida;
+ unsigned long flags;
+ struct device *ns_seed;
+ struct device *btt_seed;
+ struct device *pfn_seed;
+ struct device *dax_seed;
+ u16 ndr_mappings;
+ u64 ndr_size;
+ u64 ndr_start;
+ int id, num_lanes, ro, numa_node;
+ void *provider_data;
+ struct kernfs_node *bb_state;
+ struct badblocks bb;
+ struct nd_interleave_set *nd_set;
+ struct nd_percpu_lane __percpu *lane;
+ int (*flush)(struct nd_region *nd_region);
+ struct nd_mapping mapping[0];
+};
+
static inline struct nd_namespace_io *to_nd_namespace_io(const struct device *dev)
{
return container_of(dev, struct nd_namespace_io, common.dev);
--
2.9.3
Hi Pankaj,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.19-rc2 next-20180903]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-randconfig-a3-201835 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
:::::: branch date: 21 hours ago
:::::: commit date: 21 hours ago
All errors (new ones prefixed by >>):
drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
>> drivers/virtio/virtio_pmem.c:220: undefined reference to `nvdimm_bus_unregister'
drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
>> drivers/virtio/virtio_pmem.c:186: undefined reference to `nvdimm_bus_register'
>> drivers/virtio/virtio_pmem.c:198: undefined reference to `nvdimm_pmem_region_create'
drivers/virtio/virtio_pmem.c:207: undefined reference to `nvdimm_bus_unregister'
# https://github.com/0day-ci/linux/commit/acce2633da18b0ad58d0cc9243a85b03020ca099
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout acce2633da18b0ad58d0cc9243a85b03020ca099
vim +220 drivers/virtio/virtio_pmem.c
acce2633 Pankaj Gupta 2018-08-31 147
acce2633 Pankaj Gupta 2018-08-31 148 static int virtio_pmem_probe(struct virtio_device *vdev)
acce2633 Pankaj Gupta 2018-08-31 149 {
acce2633 Pankaj Gupta 2018-08-31 150 int err = 0;
acce2633 Pankaj Gupta 2018-08-31 151 struct resource res;
acce2633 Pankaj Gupta 2018-08-31 152 struct virtio_pmem *vpmem;
acce2633 Pankaj Gupta 2018-08-31 153 struct nvdimm_bus *nvdimm_bus;
acce2633 Pankaj Gupta 2018-08-31 154 struct nd_region_desc ndr_desc;
acce2633 Pankaj Gupta 2018-08-31 155 int nid = dev_to_node(&vdev->dev);
acce2633 Pankaj Gupta 2018-08-31 156 struct nd_region *nd_region;
acce2633 Pankaj Gupta 2018-08-31 157
acce2633 Pankaj Gupta 2018-08-31 158 if (!vdev->config->get) {
acce2633 Pankaj Gupta 2018-08-31 159 dev_err(&vdev->dev, "%s failure: config disabled\n",
acce2633 Pankaj Gupta 2018-08-31 160 __func__);
acce2633 Pankaj Gupta 2018-08-31 161 return -EINVAL;
acce2633 Pankaj Gupta 2018-08-31 162 }
acce2633 Pankaj Gupta 2018-08-31 163
acce2633 Pankaj Gupta 2018-08-31 164 vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
acce2633 Pankaj Gupta 2018-08-31 165 GFP_KERNEL);
acce2633 Pankaj Gupta 2018-08-31 166 if (!vpmem) {
acce2633 Pankaj Gupta 2018-08-31 167 err = -ENOMEM;
acce2633 Pankaj Gupta 2018-08-31 168 goto out_err;
acce2633 Pankaj Gupta 2018-08-31 169 }
acce2633 Pankaj Gupta 2018-08-31 170
acce2633 Pankaj Gupta 2018-08-31 171 vpmem->vdev = vdev;
acce2633 Pankaj Gupta 2018-08-31 172 err = init_vq(vpmem);
acce2633 Pankaj Gupta 2018-08-31 173 if (err)
acce2633 Pankaj Gupta 2018-08-31 174 goto out_err;
acce2633 Pankaj Gupta 2018-08-31 175
acce2633 Pankaj Gupta 2018-08-31 176 virtio_cread(vpmem->vdev, struct virtio_pmem_config,
acce2633 Pankaj Gupta 2018-08-31 177 start, &vpmem->start);
acce2633 Pankaj Gupta 2018-08-31 178 virtio_cread(vpmem->vdev, struct virtio_pmem_config,
acce2633 Pankaj Gupta 2018-08-31 179 size, &vpmem->size);
acce2633 Pankaj Gupta 2018-08-31 180
acce2633 Pankaj Gupta 2018-08-31 181 res.start = vpmem->start;
acce2633 Pankaj Gupta 2018-08-31 182 res.end = vpmem->start + vpmem->size-1;
acce2633 Pankaj Gupta 2018-08-31 183 vpmem->nd_desc.provider_name = "virtio-pmem";
acce2633 Pankaj Gupta 2018-08-31 184 vpmem->nd_desc.module = THIS_MODULE;
acce2633 Pankaj Gupta 2018-08-31 185
acce2633 Pankaj Gupta 2018-08-31 @186 vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
acce2633 Pankaj Gupta 2018-08-31 187 &vpmem->nd_desc);
acce2633 Pankaj Gupta 2018-08-31 188 if (!nvdimm_bus)
acce2633 Pankaj Gupta 2018-08-31 189 goto out_vq;
acce2633 Pankaj Gupta 2018-08-31 190
acce2633 Pankaj Gupta 2018-08-31 191 dev_set_drvdata(&vdev->dev, nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31 192 memset(&ndr_desc, 0, sizeof(ndr_desc));
acce2633 Pankaj Gupta 2018-08-31 193
acce2633 Pankaj Gupta 2018-08-31 194 ndr_desc.res = &res;
acce2633 Pankaj Gupta 2018-08-31 195 ndr_desc.numa_node = nid;
acce2633 Pankaj Gupta 2018-08-31 196 ndr_desc.flush = virtio_pmem_flush;
acce2633 Pankaj Gupta 2018-08-31 197 set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
acce2633 Pankaj Gupta 2018-08-31 @198 nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
acce2633 Pankaj Gupta 2018-08-31 199
acce2633 Pankaj Gupta 2018-08-31 200 if (!nd_region)
acce2633 Pankaj Gupta 2018-08-31 201 goto out_nd;
acce2633 Pankaj Gupta 2018-08-31 202
acce2633 Pankaj Gupta 2018-08-31 203 //virtio_device_ready(vdev);
acce2633 Pankaj Gupta 2018-08-31 204 return 0;
acce2633 Pankaj Gupta 2018-08-31 205 out_nd:
acce2633 Pankaj Gupta 2018-08-31 206 err = -ENXIO;
acce2633 Pankaj Gupta 2018-08-31 207 nvdimm_bus_unregister(nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31 208 out_vq:
acce2633 Pankaj Gupta 2018-08-31 209 vdev->config->del_vqs(vdev);
acce2633 Pankaj Gupta 2018-08-31 210 out_err:
acce2633 Pankaj Gupta 2018-08-31 211 dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
acce2633 Pankaj Gupta 2018-08-31 212 return err;
acce2633 Pankaj Gupta 2018-08-31 213 }
acce2633 Pankaj Gupta 2018-08-31 214
acce2633 Pankaj Gupta 2018-08-31 215 static void virtio_pmem_remove(struct virtio_device *vdev)
acce2633 Pankaj Gupta 2018-08-31 216 {
acce2633 Pankaj Gupta 2018-08-31 217 struct virtio_pmem *vpmem = vdev->priv;
acce2633 Pankaj Gupta 2018-08-31 218 struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
acce2633 Pankaj Gupta 2018-08-31 219
acce2633 Pankaj Gupta 2018-08-31 @220 nvdimm_bus_unregister(nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31 221 vdev->config->del_vqs(vdev);
acce2633 Pankaj Gupta 2018-08-31 222 kfree(vpmem);
acce2633 Pankaj Gupta 2018-08-31 223 }
acce2633 Pankaj Gupta 2018-08-31 224
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Pankaj,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
[also build test WARNING on v4.19-rc2 next-20180831]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
:::::: branch date: 7 hours ago
:::::: commit date: 7 hours ago
drivers/nvdimm/pmem.c:116:25: sparse: expression using sizeof(void)
drivers/nvdimm/pmem.c:135:25: sparse: expression using sizeof(void)
>> drivers/nvdimm/pmem.c:204:32: sparse: incorrect type in assignment (different base types) @@ expected restricted blk_status_t [usertype] bi_status @@ got e] bi_status @@
drivers/nvdimm/pmem.c:204:32: expected restricted blk_status_t [usertype] bi_status
drivers/nvdimm/pmem.c:204:32: got int
drivers/nvdimm/pmem.c:208:9: sparse: expression using sizeof(void)
drivers/nvdimm/pmem.c:208:9: sparse: expression using sizeof(void)
include/linux/bvec.h:82:37: sparse: expression using sizeof(void)
include/linux/bvec.h:82:37: sparse: expression using sizeof(void)
include/linux/bvec.h:83:32: sparse: expression using sizeof(void)
include/linux/bvec.h:83:32: sparse: expression using sizeof(void)
drivers/nvdimm/pmem.c:220:32: sparse: incorrect type in assignment (different base types) @@ expected restricted blk_status_t [usertype] bi_status @@ got e] bi_status @@
drivers/nvdimm/pmem.c:220:32: expected restricted blk_status_t [usertype] bi_status
drivers/nvdimm/pmem.c:220:32: got int
# https://github.com/0day-ci/linux/commit/69b95edd2a1f4676361988fa36866b59427e2cfa
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 69b95edd2a1f4676361988fa36866b59427e2cfa
vim +204 drivers/nvdimm/pmem.c
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 107
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 108 static void write_pmem(void *pmem_addr, struct page *page,
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 109 unsigned int off, unsigned int len)
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 110 {
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 111 unsigned int chunk;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 112 void *mem;
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 113
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 114 while (len) {
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 115 mem = kmap_atomic(page);
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 @116 chunk = min_t(unsigned int, len, PAGE_SIZE);
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 117 memcpy_flushcache(pmem_addr, mem + off, chunk);
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 118 kunmap_atomic(mem);
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 119 len -= chunk;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 120 off = 0;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 121 page++;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 122 pmem_addr += PAGE_SIZE;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 123 }
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 124 }
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 125
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 126 static blk_status_t read_pmem(struct page *page, unsigned int off,
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 127 void *pmem_addr, unsigned int len)
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 128 {
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 129 unsigned int chunk;
60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 130 unsigned long rem;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 131 void *mem;
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 132
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 133 while (len) {
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 134 mem = kmap_atomic(page);
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 135 chunk = min_t(unsigned int, len, PAGE_SIZE);
60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 136 rem = memcpy_mcsafe(mem + off, pmem_addr, chunk);
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 137 kunmap_atomic(mem);
60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 138 if (rem)
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 139 return BLK_STS_IOERR;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 140 len -= chunk;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 141 off = 0;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 142 page++;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 143 pmem_addr += PAGE_SIZE;
98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 144 }
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 145 return BLK_STS_OK;
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 146 }
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 147
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 148 static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 149 unsigned int len, unsigned int off, unsigned int op,
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 150 sector_t sector)
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 151 {
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 152 blk_status_t rc = BLK_STS_OK;
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 153 bool bad_pmem = false;
32ab0a3f5 drivers/nvdimm/pmem.c Dan Williams 2015-08-01 154 phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
7a9eb2066 drivers/nvdimm/pmem.c Dan Williams 2016-06-03 155 void *pmem_addr = pmem->virt_addr + pmem_off;
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 156
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 157 if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 158 bad_pmem = true;
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 159
3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 160 if (!op_is_write(op)) {
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 161 if (unlikely(bad_pmem))
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 162 rc = BLK_STS_IOERR;
b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 163 else {
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 164 rc = read_pmem(page, off, pmem_addr, len);
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 165 flush_dcache_page(page);
b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 166 }
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 167 } else {
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 168 /*
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 169 * Note that we write the data both before and after
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 170 * clearing poison. The write before clear poison
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 171 * handles situations where the latest written data is
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 172 * preserved and the clear poison operation simply marks
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 173 * the address range as valid without changing the data.
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 174 * In this case application software can assume that an
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 175 * interrupted write will either return the new good
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 176 * data or an error.
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 177 *
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 178 * However, if pmem_clear_poison() leaves the data in an
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 179 * indeterminate state we need to perform the write
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 180 * after clear poison.
0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 181 */
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 182 flush_dcache_page(page);
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 183 write_pmem(pmem_addr, page, off, len);
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 184 if (unlikely(bad_pmem)) {
3115bb02b drivers/nvdimm/pmem.c Toshi Kani 2016-10-13 185 rc = pmem_clear_poison(pmem, pmem_off, len);
bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 186 write_pmem(pmem_addr, page, off, len);
59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 187 }
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 188 }
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 189
b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 190 return rc;
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 191 }
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 192
dece16353 drivers/nvdimm/pmem.c Jens Axboe 2015-11-05 193 static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 194 {
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 195 blk_status_t rc = 0;
f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 196 bool do_acct;
f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 197 unsigned long start;
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 198 struct bio_vec bvec;
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 199 struct bvec_iter iter;
bd842b8ca drivers/nvdimm/pmem.c Dan Williams 2016-03-18 200 struct pmem_device *pmem = q->queuedata;
7e267a8c7 drivers/nvdimm/pmem.c Dan Williams 2016-06-01 201 struct nd_region *nd_region = to_region(pmem);
7e267a8c7 drivers/nvdimm/pmem.c Dan Williams 2016-06-01 202
d2d6364dc drivers/nvdimm/pmem.c Ross Zwisler 2018-06-06 203 if (bio->bi_opf & REQ_PREFLUSH)
69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 @204 bio->bi_status = nd_region->flush(nd_region);
69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 205
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 206
f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 207 do_acct = nd_iostat_start(bio, &start);
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 208 bio_for_each_segment(bvec, bio, iter) {
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 209 rc = pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 210 bvec.bv_offset, bio_op(bio), iter.bi_sector);
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 211 if (rc) {
4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 212 bio->bi_status = rc;
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 213 break;
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 214 }
e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 215 }
f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 216 if (do_acct)
f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 217 nd_iostat_end(bio, start);
61031952f drivers/nvdimm/pmem.c Ross Zwisler 2015-06-25 218
1eff9d322 drivers/nvdimm/pmem.c Jens Axboe 2016-08-05 219 if (bio->bi_opf & REQ_FUA)
69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 220 bio->bi_status = nd_region->flush(nd_region);
61031952f drivers/nvdimm/pmem.c Ross Zwisler 2015-06-25 221
4246a0b63 drivers/nvdimm/pmem.c Christoph Hellwig 2015-07-20 222 bio_endio(bio);
dece16353 drivers/nvdimm/pmem.c Jens Axboe 2015-11-05 223 return BLK_QC_T_NONE;
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 224 }
9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 225
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hello,
Thanks for the report.
> Hi Pankaj,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
> [also build test ERROR on v4.19-rc2 next-20180903]
> [if your patch is applied to the wrong git tree, please drop us a note to
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
> base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
> libnvdimm-for-next
> config: i386-randconfig-a3-201835 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
> :::::: branch date: 21 hours ago
> :::::: commit date: 21 hours ago
>
> All errors (new ones prefixed by >>):
>
> drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
> >> drivers/virtio/virtio_pmem.c:220: undefined reference to
> >> `nvdimm_bus_unregister'
> drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
> >> drivers/virtio/virtio_pmem.c:186: undefined reference to
> >> `nvdimm_bus_register'
> >> drivers/virtio/virtio_pmem.c:198: undefined reference to
> >> `nvdimm_pmem_region_create'
> drivers/virtio/virtio_pmem.c:207: undefined reference to
> `nvdimm_bus_unregister'
It looks like dependent configiguration 'LIBNVDIMM' is not enabled. I will add the
dependency in Kconfig file for virtio_pmem in v2.
Thanks,
Pankaj
>
> #
> https://github.com/0day-ci/linux/commit/acce2633da18b0ad58d0cc9243a85b03020ca099
> git remote add linux-review https://github.com/0day-ci/linux
> git remote update linux-review
> git checkout acce2633da18b0ad58d0cc9243a85b03020ca099
> vim +220 drivers/virtio/virtio_pmem.c
>
> acce2633 Pankaj Gupta 2018-08-31 147
> acce2633 Pankaj Gupta 2018-08-31 148 static int virtio_pmem_probe(struct
> virtio_device *vdev)
> acce2633 Pankaj Gupta 2018-08-31 149 {
> acce2633 Pankaj Gupta 2018-08-31 150 int err = 0;
> acce2633 Pankaj Gupta 2018-08-31 151 struct resource res;
> acce2633 Pankaj Gupta 2018-08-31 152 struct virtio_pmem *vpmem;
> acce2633 Pankaj Gupta 2018-08-31 153 struct nvdimm_bus *nvdimm_bus;
> acce2633 Pankaj Gupta 2018-08-31 154 struct nd_region_desc ndr_desc;
> acce2633 Pankaj Gupta 2018-08-31 155 int nid = dev_to_node(&vdev->dev);
> acce2633 Pankaj Gupta 2018-08-31 156 struct nd_region *nd_region;
> acce2633 Pankaj Gupta 2018-08-31 157
> acce2633 Pankaj Gupta 2018-08-31 158 if (!vdev->config->get) {
> acce2633 Pankaj Gupta 2018-08-31 159 dev_err(&vdev->dev, "%s failure:
> config disabled\n",
> acce2633 Pankaj Gupta 2018-08-31 160 __func__);
> acce2633 Pankaj Gupta 2018-08-31 161 return -EINVAL;
> acce2633 Pankaj Gupta 2018-08-31 162 }
> acce2633 Pankaj Gupta 2018-08-31 163
> acce2633 Pankaj Gupta 2018-08-31 164 vdev->priv = vpmem =
> devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> acce2633 Pankaj Gupta 2018-08-31 165 GFP_KERNEL);
> acce2633 Pankaj Gupta 2018-08-31 166 if (!vpmem) {
> acce2633 Pankaj Gupta 2018-08-31 167 err = -ENOMEM;
> acce2633 Pankaj Gupta 2018-08-31 168 goto out_err;
> acce2633 Pankaj Gupta 2018-08-31 169 }
> acce2633 Pankaj Gupta 2018-08-31 170
> acce2633 Pankaj Gupta 2018-08-31 171 vpmem->vdev = vdev;
> acce2633 Pankaj Gupta 2018-08-31 172 err = init_vq(vpmem);
> acce2633 Pankaj Gupta 2018-08-31 173 if (err)
> acce2633 Pankaj Gupta 2018-08-31 174 goto out_err;
> acce2633 Pankaj Gupta 2018-08-31 175
> acce2633 Pankaj Gupta 2018-08-31 176 virtio_cread(vpmem->vdev, struct
> virtio_pmem_config,
> acce2633 Pankaj Gupta 2018-08-31 177 start, &vpmem->start);
> acce2633 Pankaj Gupta 2018-08-31 178 virtio_cread(vpmem->vdev, struct
> virtio_pmem_config,
> acce2633 Pankaj Gupta 2018-08-31 179 size, &vpmem->size);
> acce2633 Pankaj Gupta 2018-08-31 180
> acce2633 Pankaj Gupta 2018-08-31 181 res.start = vpmem->start;
> acce2633 Pankaj Gupta 2018-08-31 182 res.end = vpmem->start +
> vpmem->size-1;
> acce2633 Pankaj Gupta 2018-08-31 183 vpmem->nd_desc.provider_name =
> "virtio-pmem";
> acce2633 Pankaj Gupta 2018-08-31 184 vpmem->nd_desc.module = THIS_MODULE;
> acce2633 Pankaj Gupta 2018-08-31 185
> acce2633 Pankaj Gupta 2018-08-31 @186 vpmem->nvdimm_bus = nvdimm_bus =
> nvdimm_bus_register(&vdev->dev,
> acce2633 Pankaj Gupta 2018-08-31 187 &vpmem->nd_desc);
> acce2633 Pankaj Gupta 2018-08-31 188 if (!nvdimm_bus)
> acce2633 Pankaj Gupta 2018-08-31 189 goto out_vq;
> acce2633 Pankaj Gupta 2018-08-31 190
> acce2633 Pankaj Gupta 2018-08-31 191 dev_set_drvdata(&vdev->dev,
> nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31 192 memset(&ndr_desc, 0,
> sizeof(ndr_desc));
> acce2633 Pankaj Gupta 2018-08-31 193
> acce2633 Pankaj Gupta 2018-08-31 194 ndr_desc.res = &res;
> acce2633 Pankaj Gupta 2018-08-31 195 ndr_desc.numa_node = nid;
> acce2633 Pankaj Gupta 2018-08-31 196 ndr_desc.flush = virtio_pmem_flush;
> acce2633 Pankaj Gupta 2018-08-31 197 set_bit(ND_REGION_PAGEMAP,
> &ndr_desc.flags);
> acce2633 Pankaj Gupta 2018-08-31 @198 nd_region =
> nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> acce2633 Pankaj Gupta 2018-08-31 199
> acce2633 Pankaj Gupta 2018-08-31 200 if (!nd_region)
> acce2633 Pankaj Gupta 2018-08-31 201 goto out_nd;
> acce2633 Pankaj Gupta 2018-08-31 202
> acce2633 Pankaj Gupta 2018-08-31 203 //virtio_device_ready(vdev);
> acce2633 Pankaj Gupta 2018-08-31 204 return 0;
> acce2633 Pankaj Gupta 2018-08-31 205 out_nd:
> acce2633 Pankaj Gupta 2018-08-31 206 err = -ENXIO;
> acce2633 Pankaj Gupta 2018-08-31 207 nvdimm_bus_unregister(nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31 208 out_vq:
> acce2633 Pankaj Gupta 2018-08-31 209 vdev->config->del_vqs(vdev);
> acce2633 Pankaj Gupta 2018-08-31 210 out_err:
> acce2633 Pankaj Gupta 2018-08-31 211 dev_err(&vdev->dev, "failed to
> register virtio pmem memory\n");
> acce2633 Pankaj Gupta 2018-08-31 212 return err;
> acce2633 Pankaj Gupta 2018-08-31 213 }
> acce2633 Pankaj Gupta 2018-08-31 214
> acce2633 Pankaj Gupta 2018-08-31 215 static void virtio_pmem_remove(struct
> virtio_device *vdev)
> acce2633 Pankaj Gupta 2018-08-31 216 {
> acce2633 Pankaj Gupta 2018-08-31 217 struct virtio_pmem *vpmem =
> vdev->priv;
> acce2633 Pankaj Gupta 2018-08-31 218 struct nvdimm_bus *nvdimm_bus =
> dev_get_drvdata(&vdev->dev);
> acce2633 Pankaj Gupta 2018-08-31 219
> acce2633 Pankaj Gupta 2018-08-31 @220 nvdimm_bus_unregister(nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31 221 vdev->config->del_vqs(vdev);
> acce2633 Pankaj Gupta 2018-08-31 222 kfree(vpmem);
> acce2633 Pankaj Gupta 2018-08-31 223 }
> acce2633 Pankaj Gupta 2018-08-31 224
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
>
Hello,
Thanks for the report.
>
> Hi Pankaj,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on linux-nvdimm/libnvdimm-for-next]
> [also build test WARNING on v4.19-rc2 next-20180831]
> [if your patch is applied to the wrong git tree, please drop us a note to
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
> base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
> libnvdimm-for-next
> reproduce:
> # apt-get install sparse
> make ARCH=x86_64 allmodconfig
> make C=1 CF=-D__CHECK_ENDIAN__
> :::::: branch date: 7 hours ago
> :::::: commit date: 7 hours ago
>
> drivers/nvdimm/pmem.c:116:25: sparse: expression using sizeof(void)
> drivers/nvdimm/pmem.c:135:25: sparse: expression using sizeof(void)
> >> drivers/nvdimm/pmem.c:204:32: sparse: incorrect type in assignment
> >> (different base types) @@ expected restricted blk_status_t [usertype]
> >> bi_status @@ got e] bi_status @@
I will fix this in V2. Will wait for any review comments and address in v2.
Thanks,
Pankaj
> drivers/nvdimm/pmem.c:204:32: expected restricted blk_status_t
> [usertype] bi_status
> drivers/nvdimm/pmem.c:204:32: got int
> drivers/nvdimm/pmem.c:208:9: sparse: expression using sizeof(void)
> drivers/nvdimm/pmem.c:208:9: sparse: expression using sizeof(void)
> include/linux/bvec.h:82:37: sparse: expression using sizeof(void)
> include/linux/bvec.h:82:37: sparse: expression using sizeof(void)
> include/linux/bvec.h:83:32: sparse: expression using sizeof(void)
> include/linux/bvec.h:83:32: sparse: expression using sizeof(void)
> drivers/nvdimm/pmem.c:220:32: sparse: incorrect type in assignment
> (different base types) @@ expected restricted blk_status_t [usertype]
> bi_status @@ got e] bi_status @@
> drivers/nvdimm/pmem.c:220:32: expected restricted blk_status_t
> [usertype] bi_status
> drivers/nvdimm/pmem.c:220:32: got int
>
> #
> https://github.com/0day-ci/linux/commit/69b95edd2a1f4676361988fa36866b59427e2cfa
> git remote add linux-review https://github.com/0day-ci/linux
> git remote update linux-review
> git checkout 69b95edd2a1f4676361988fa36866b59427e2cfa
> vim +204 drivers/nvdimm/pmem.c
>
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 107
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 108 static
> void write_pmem(void *pmem_addr, struct page *page,
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 109 unsigned
> int off, unsigned int len)
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 110 {
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 111 unsigned
> int chunk;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 112 void
> *mem;
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 113
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 114 while
> (len) {
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 115 mem =
> kmap_atomic(page);
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 @116 chunk =
> min_t(unsigned int, len, PAGE_SIZE);
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 117
> memcpy_flushcache(pmem_addr, mem + off, chunk);
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 118
> kunmap_atomic(mem);
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 119 len -=
> chunk;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 120 off = 0;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 121 page++;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 122
> pmem_addr += PAGE_SIZE;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 123 }
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 124 }
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 125
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 126 static
> blk_status_t read_pmem(struct page *page, unsigned int off,
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 127 void
> *pmem_addr, unsigned int len)
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 128 {
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 129 unsigned
> int chunk;
> 60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 130 unsigned
> long rem;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 131 void
> *mem;
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 132
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 133 while
> (len) {
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 134 mem =
> kmap_atomic(page);
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 135 chunk =
> min_t(unsigned int, len, PAGE_SIZE);
> 60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 136 rem =
> memcpy_mcsafe(mem + off, pmem_addr, chunk);
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 137
> kunmap_atomic(mem);
> 60622d682 drivers/nvdimm/pmem.c Dan Williams 2018-05-03 138 if (rem)
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 139 return
> BLK_STS_IOERR;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 140 len -=
> chunk;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 141 off = 0;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 142 page++;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 143
> pmem_addr += PAGE_SIZE;
> 98cc093cb drivers/nvdimm/pmem.c Huang Ying 2017-09-06 144 }
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 145 return
> BLK_STS_OK;
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 146 }
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 147
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 148 static
> blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
> 3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 149
> unsigned int len, unsigned int off, unsigned int op,
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 150
> sector_t sector)
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 151 {
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 152
> blk_status_t rc = BLK_STS_OK;
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 153 bool
> bad_pmem = false;
> 32ab0a3f5 drivers/nvdimm/pmem.c Dan Williams 2015-08-01 154
> phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
> 7a9eb2066 drivers/nvdimm/pmem.c Dan Williams 2016-06-03 155 void
> *pmem_addr = pmem->virt_addr + pmem_off;
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 156
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 157 if
> (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 158 bad_pmem
> = true;
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 159
> 3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 160 if
> (!op_is_write(op)) {
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 161 if
> (unlikely(bad_pmem))
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 162 rc =
> BLK_STS_IOERR;
> b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 163 else {
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 164 rc =
> read_pmem(page, off, pmem_addr, len);
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 165
> flush_dcache_page(page);
> b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 166 }
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 167 } else {
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 168 /*
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 169 * Note
> that we write the data both before and after
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 170 *
> clearing poison. The write before clear poison
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 171 *
> handles situations where the latest written data is
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 172 *
> preserved and the clear poison operation simply marks
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 173 * the
> address range as valid without changing the data.
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 174 * In
> this case application software can assume that an
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 175 *
> interrupted write will either return the new good
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 176 * data
> or an error.
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 177 *
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 178 *
> However, if pmem_clear_poison() leaves the data in an
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 179 *
> indeterminate state we need to perform the write
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 180 * after
> clear poison.
> 0a370d261 drivers/nvdimm/pmem.c Dan Williams 2016-04-14 181 */
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 182
> flush_dcache_page(page);
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 183
> write_pmem(pmem_addr, page, off, len);
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 184 if
> (unlikely(bad_pmem)) {
> 3115bb02b drivers/nvdimm/pmem.c Toshi Kani 2016-10-13 185 rc =
> pmem_clear_poison(pmem, pmem_off, len);
> bd697a80c drivers/nvdimm/pmem.c Vishal Verma 2016-09-30 186
> write_pmem(pmem_addr, page, off, len);
> 59e647398 drivers/nvdimm/pmem.c Dan Williams 2016-03-08 187 }
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 188 }
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 189
> b5ebc8ec6 drivers/nvdimm/pmem.c Dan Williams 2016-03-06 190 return
> rc;
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 191 }
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 192
> dece16353 drivers/nvdimm/pmem.c Jens Axboe 2015-11-05 193 static
> blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 194 {
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 195
> blk_status_t rc = 0;
> f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 196 bool
> do_acct;
> f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 197 unsigned
> long start;
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 198 struct
> bio_vec bvec;
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 199 struct
> bvec_iter iter;
> bd842b8ca drivers/nvdimm/pmem.c Dan Williams 2016-03-18 200 struct
> pmem_device *pmem = q->queuedata;
> 7e267a8c7 drivers/nvdimm/pmem.c Dan Williams 2016-06-01 201 struct
> nd_region *nd_region = to_region(pmem);
> 7e267a8c7 drivers/nvdimm/pmem.c Dan Williams 2016-06-01 202
> d2d6364dc drivers/nvdimm/pmem.c Ross Zwisler 2018-06-06 203 if
> (bio->bi_opf & REQ_PREFLUSH)
> 69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 @204
> bio->bi_status = nd_region->flush(nd_region);
> 69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 205
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 206
> f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 207 do_acct =
> nd_iostat_start(bio, &start);
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 208
> bio_for_each_segment(bvec, bio, iter) {
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 209 rc =
> pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len,
> 3f289dcb4 drivers/nvdimm/pmem.c Tejun Heo 2018-07-18 210
> bvec.bv_offset, bio_op(bio), iter.bi_sector);
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 211 if (rc)
> {
> 4e4cbee93 drivers/nvdimm/pmem.c Christoph Hellwig 2017-06-03 212
> bio->bi_status = rc;
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 213 break;
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 214 }
> e10624f8c drivers/nvdimm/pmem.c Dan Williams 2016-01-06 215 }
> f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 216 if
> (do_acct)
> f0dc089ce drivers/nvdimm/pmem.c Dan Williams 2015-05-16 217
> nd_iostat_end(bio, start);
> 61031952f drivers/nvdimm/pmem.c Ross Zwisler 2015-06-25 218
> 1eff9d322 drivers/nvdimm/pmem.c Jens Axboe 2016-08-05 219 if
> (bio->bi_opf & REQ_FUA)
> 69b95edd2 drivers/nvdimm/pmem.c Pankaj Gupta 2018-08-31 220
> bio->bi_status = nd_region->flush(nd_region);
> 61031952f drivers/nvdimm/pmem.c Ross Zwisler 2015-06-25 221
> 4246a0b63 drivers/nvdimm/pmem.c Christoph Hellwig 2015-07-20 222
> bio_endio(bio);
> dece16353 drivers/nvdimm/pmem.c Jens Axboe 2015-11-05 223 return
> BLK_QC_T_NONE;
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 224 }
> 9e853f231 drivers/block/pmem.c Ross Zwisler 2015-04-01 225
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
>
Hi Pankaj,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.19-rc2 next-20180905]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
base: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
>> virtio_pmem.c:(.text+0x299): undefined reference to `nvdimm_bus_unregister'
drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
>> virtio_pmem.c:(.text+0x5e3): undefined reference to `nvdimm_bus_register'
>> virtio_pmem.c:(.text+0x62a): undefined reference to `nvdimm_pmem_region_create'
virtio_pmem.c:(.text+0x63b): undefined reference to `nvdimm_bus_unregister'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Fri, 31 Aug 2018 19:00:18 +0530
Pankaj Gupta <[email protected]> wrote:
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/virtio/Kconfig | 9 ++
> drivers/virtio/Makefile | 1 +
> drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 40 ++++++
> 5 files changed, 306 insertions(+)
> create mode 100644 drivers/virtio/virtio_pmem.c
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3589764..a331e23 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
>
> If unsure, say Y.
>
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> +
> + If unsure, say M.
> +
> config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5..cbe91c6 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 0000000..c22cc87
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio.h>
> +#include <linux/module.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/spinlock.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/nd.h>
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];
> +
> + /* Wait queue to process deferred work after ack from host */
> + wait_queue_head_t host_acked;
> + bool done;
> +
> + /* Wait queue to process deferred work after virt queue buffer avail */
> + wait_queue_head_t wq_buf;
> + bool wq_buf_avail;
> + struct list_head list;
> +};
> +
> +struct virtio_pmem {
> + struct virtio_device *vdev;
> +
> + /* Virtio pmem request queue */
> + struct virtqueue *req_vq;
> +
> + /* nvdimm bus registers virtio pmem device */
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm_bus_descriptor nd_desc;
> +
> + /* List to store deferred work if virtqueue is full */
> + struct list_head req_list;
> +
> + /* Synchronize virtqueue data */
> + spinlock_t pmem_lock;
> +
> + /* Memory region information */
> + uint64_t start;
> + uint64_t size;
> +};
> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* The interrupt handler */
> +static void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> + req->done = true;
> + wake_up(&req->host_acked);
> +
> + if (!list_empty(&vpmem->req_list)) {
> + req_buf = list_first_entry(&vpmem->req_list,
> + struct virtio_pmem_request, list);
> + list_del(&vpmem->req_list);
> + req_buf->wq_buf_avail = true;
> + wake_up(&req_buf->wq_buf);
> + }
> + }
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> + struct virtqueue *vq;
> +
> + /* single vq */
> + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vq))
> + return PTR_ERR(vq);
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> + /* The request submission function */
> +static int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev =
> + dev_to_virtio(nd_region->dev.parent->parent);
> + struct virtio_pmem *vpmem = vdev->priv;
I'm missing a might_sleep() call in this function.
> + struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> +
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = req->wq_buf_avail = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(&req->host_acked);
> + init_waitqueue_head(&req->wq_buf);
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + sg_init_one(&sg, req->name, strlen(req->name));
> + sgs[0] = &sg;
> + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> + sgs[1] = &ret;
It seems that sg_init_one() is only setting fields, in this
case you can move spin_lock_irqsave() here.
> + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> + if (err) {
> + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> + list_add_tail(&vpmem->req_list, &req->list);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
Is this error handling code assuming that at some point
virtqueue_add_sgs() will succeed for a different thread? If yes,
what happens if the assumption is false? That is, what happens if
virtqueue_add_sgs() never succeeds anymore?
Why not just return an error?
> + }
> + virtqueue_kick(vpmem->req_vq);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
> + err = req->ret;
If I'm understanding the QEMU code correctly, you're returning EIO
from QEMU if fsync() fails. I think this is wrong, since we don't know
if EIO in QEMU will be the same EIO in the guest. One way to solve this
would be to return 0 for success and 1 for failure from QEMU, and let the
guest implementation pick its error code (for your implementation it
could be EIO).
> + kfree(req);
> +
> + return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc;
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> + memset(&ndr_desc, 0, sizeof(ndr_desc));
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + //virtio_device_ready(vdev);
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);
> + kfree(vpmem);
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int virtio_pmem_freeze(struct virtio_device *vdev)
> +{
> + /* todo: handle freeze function */
> + return -EPERM;
> +}
> +
> +static int virtio_pmem_restore(struct virtio_device *vdev)
> +{
> + /* todo: handle restore function */
> + return -EPERM;
> +}
> +#endif
> +
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +#ifdef CONFIG_PM_SLEEP
> + .freeze = virtio_pmem_freeze,
> + .restore = virtio_pmem_restore,
> +#endif
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2..3463895 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 0000000..c7c22a5
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:
> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in the
> + * documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + * may be used to endorse or promote products derived from this software
> + * without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2018-2019
> + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> + __le64 start;
> + __le64 size;
> +};
> +#endif
On Fri, 31 Aug 2018 19:00:19 +0530
Pankaj Gupta <[email protected]> wrote:
> This patch adds virtio-pmem Qemu device.
>
> This device presents memory address range information to guest
> which is backed by file backend type. It acts like persistent
> memory device for KVM guest. Guest can perform read and
> persistent write operations on this memory range with the help
> of DAX capable filesystem.
>
> Persistent guest writes are assured with the help of virtio
> based flushing interface. When guest userspace space performs
> fsync on file fd on pmem device, a flush command is send to
> Qemu over VIRTIO and host side flush/sync is done on backing
> image file.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> Changes from RFC v3:
> - Return EIO for host fsync failure instead of errno - Luiz, Stefan
> - Change version for inclusion to Qemu 3.1 - Eric
>
> Changes from RFC v2:
> - Use aio_worker() to avoid Qemu from hanging with blocking fsync
> call - Stefan
> - Use virtio_st*_p() for endianess - Stefan
> - Correct indentation in qapi/misc.json - Eric
>
> hw/virtio/Makefile.objs | 3 +
> hw/virtio/virtio-pci.c | 44 +++++
> hw/virtio/virtio-pci.h | 14 ++
> hw/virtio/virtio-pmem.c | 241 ++++++++++++++++++++++++++++
> include/hw/pci/pci.h | 1 +
> include/hw/virtio/virtio-pmem.h | 42 +++++
> include/standard-headers/linux/virtio_ids.h | 1 +
> qapi/misc.json | 26 ++-
> 8 files changed, 371 insertions(+), 1 deletion(-)
> create mode 100644 hw/virtio/virtio-pmem.c
> create mode 100644 include/hw/virtio/virtio-pmem.h
>
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 1b2799cfd8..7f914d45d0 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -10,6 +10,9 @@ obj-$(CONFIG_VIRTIO_CRYPTO) += virtio-crypto.o
> obj-$(call land,$(CONFIG_VIRTIO_CRYPTO),$(CONFIG_VIRTIO_PCI)) += virtio-crypto-pci.o
>
> obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> +ifeq ($(CONFIG_MEM_HOTPLUG),y)
> +obj-$(CONFIG_LINUX) += virtio-pmem.o
> +endif
> obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> endif
>
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index 3a01fe90f0..93d3fc05c7 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2521,6 +2521,49 @@ static const TypeInfo virtio_rng_pci_info = {
> .class_init = virtio_rng_pci_class_init,
> };
>
> +/* virtio-pmem-pci */
> +
> +static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> + VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
> + DeviceState *vdev = DEVICE(&vpmem->vdev);
> +
> + qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> + object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> +}
> +
> +static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
> +{
> + DeviceClass *dc = DEVICE_CLASS(klass);
> + VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> + PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> + k->realize = virtio_pmem_pci_realize;
> + set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> + pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> + pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
> + pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> + pcidev_k->class_id = PCI_CLASS_OTHERS;
> +}
> +
> +static void virtio_pmem_pci_instance_init(Object *obj)
> +{
> + VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
> +
> + virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> + TYPE_VIRTIO_PMEM);
> + object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
> + &error_abort);
> +}
> +
> +static const TypeInfo virtio_pmem_pci_info = {
> + .name = TYPE_VIRTIO_PMEM_PCI,
> + .parent = TYPE_VIRTIO_PCI,
> + .instance_size = sizeof(VirtIOPMEMPCI),
> + .instance_init = virtio_pmem_pci_instance_init,
> + .class_init = virtio_pmem_pci_class_init,
> +};
> +
> +
> /* virtio-input-pci */
>
> static Property virtio_input_pci_properties[] = {
> @@ -2714,6 +2757,7 @@ static void virtio_pci_register_types(void)
> type_register_static(&virtio_balloon_pci_info);
> type_register_static(&virtio_serial_pci_info);
> type_register_static(&virtio_net_pci_info);
> + type_register_static(&virtio_pmem_pci_info);
> #ifdef CONFIG_VHOST_SCSI
> type_register_static(&vhost_scsi_pci_info);
> #endif
> diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> index 813082b0d7..fe74fcad3f 100644
> --- a/hw/virtio/virtio-pci.h
> +++ b/hw/virtio/virtio-pci.h
> @@ -19,6 +19,7 @@
> #include "hw/virtio/virtio-blk.h"
> #include "hw/virtio/virtio-net.h"
> #include "hw/virtio/virtio-rng.h"
> +#include "hw/virtio/virtio-pmem.h"
> #include "hw/virtio/virtio-serial.h"
> #include "hw/virtio/virtio-scsi.h"
> #include "hw/virtio/virtio-balloon.h"
> @@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
> typedef struct VirtIOGPUPCI VirtIOGPUPCI;
> typedef struct VHostVSockPCI VHostVSockPCI;
> typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
> +typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
>
> /* virtio-pci-bus */
>
> @@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
> VirtIOBlock vdev;
> };
>
> +/*
> + * virtio-pmem-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
> +#define VIRTIO_PMEM_PCI(obj) \
> + OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
> +
> +struct VirtIOPMEMPCI {
> + VirtIOPCIProxy parent_obj;
> + VirtIOPMEM vdev;
> +};
> +
> /*
> * virtio-balloon-pci: This extends VirtioPCIProxy.
> */
> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> new file mode 100644
> index 0000000000..69ae4c0a50
> --- /dev/null
> +++ b/hw/virtio/virtio-pmem.c
> @@ -0,0 +1,241 @@
> +/*
> + * Virtio pmem device
> + *
> + * Copyright (C) 2018 Red Hat, Inc.
> + * Copyright (C) 2018 Pankaj Gupta <[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-pmem.h"
> +#include "hw/mem/memory-device.h"
> +#include "block/aio.h"
> +#include "block/thread-pool.h"
> +
> +typedef struct VirtIOPMEMresp {
> + int ret;
> +} VirtIOPMEMResp;
> +
> +typedef struct VirtIODeviceRequest {
> + VirtQueueElement elem;
> + int fd;
> + VirtIOPMEM *pmem;
> + VirtIOPMEMResp resp;
> +} VirtIODeviceRequest;
> +
> +static int worker_cb(void *opaque)
> +{
> + VirtIODeviceRequest *req = opaque;
> + int err = 0;
> +
> + /* flush raw backing image */
> + err = fsync(req->fd);
> + if (err != 0) {
> + err = EIO;
> + }
> + req->resp.ret = err;
As I mentioned in the kernel patch, I think you should 1 for
error and let the guest pick the error it wants to return to
the calling thread.
> +
> + return 0;
> +}
> +
> +static void done_cb(void *opaque, int ret)
> +{
> + VirtIODeviceRequest *req = opaque;
> + int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
> + &req->resp, sizeof(VirtIOPMEMResp));
> +
> + /* Callbacks are serialized, so no need to use atomic ops. */
> + virtqueue_push(req->pmem->rq_vq, &req->elem, len);
> + virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
> + g_free(req);
> +}
> +
> +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> +{
> + VirtIODeviceRequest *req;
> + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> + HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
> + ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> + req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
> + if (!req) {
> + virtio_error(vdev, "virtio-pmem missing request data");
> + return;
> + }
> +
> + if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> + virtio_error(vdev, "virtio-pmem request not proper");
> + g_free(req);
> + return;
> + }
I think you should abort() in those errors.
> + req->fd = memory_region_get_fd(&backend->mr);
> + req->pmem = pmem;
> + thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
> +}
> +
> +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> +{
> + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> + struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
> +
> + virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
> + virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
> +}
> +
> +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
> + Error **errp)
> +{
> + return features;
> +}
> +
> +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> +{
> + VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> + VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
> + MachineState *ms = MACHINE(qdev_get_machine());
> + uint64_t align;
> + Error *local_err = NULL;
> + MemoryRegion *mr;
> +
> + if (!pmem->memdev) {
> + error_setg(errp, "virtio-pmem memdev not set");
> + return;
> + }
> +
> + mr = host_memory_backend_get_memory(pmem->memdev);
> + align = memory_region_get_alignment(mr);
> + pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
> + pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
> + &local_err);
> + if (local_err) {
> + error_setg(errp, "Can't get free address in mem device");
> + return;
> + }
> + memory_region_init_alias(&pmem->mr, OBJECT(pmem),
> + "virtio_pmem-memory", mr, 0, pmem->size);
> + memory_device_plug_region(ms, &pmem->mr, pmem->start);
> +
> + host_memory_backend_set_mapped(pmem->memdev, true);
> + virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> + sizeof(struct virtio_pmem_config));
> + pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> +}
> +
> +static void virtio_mem_check_memdev(Object *obj, const char *name, Object *val,
> + Error **errp)
> +{
> + if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
> + char *path = object_get_canonical_path_component(val);
> + error_setg(errp, "Can't use already busy memdev: %s", path);
> + g_free(path);
> + return;
> + }
> +
> + qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
> +}
> +
> +static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
> +{
> + Object *obj = OBJECT(vm);
> + DeviceState *parent_dev;
> +
> + /* always use the ID of the proxy device */
> + if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
> + parent_dev = DEVICE(obj->parent);
> + return parent_dev->id;
> + }
> + return NULL;
> +}
> +
> +static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
> + MemoryDeviceInfo *info)
> +{
> + VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> + const char *id = virtio_pmem_get_device_id(vm);
> +
> + if (id) {
> + vi->has_id = true;
> + vi->id = g_strdup(id);
> + }
> +
> + vi->start = vm->start;
> + vi->size = vm->size;
> + vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
> +
> + info->u.virtio_pmem.data = vi;
> + info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
> +}
> +
> +static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->start;
> +}
> +
> +static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->size;
> +}
> +
> +static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->size;
> +}
> +
> +static void virtio_pmem_instance_init(Object *obj)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(obj);
> + object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
> + (Object **)&vm->memdev,
> + (void *) virtio_mem_check_memdev,
> + OBJ_PROP_LINK_STRONG,
> + &error_abort);
> +}
> +
> +
> +static void virtio_pmem_class_init(ObjectClass *klass, void *data)
> +{
> + VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> +
> + vdc->realize = virtio_pmem_realize;
> + vdc->get_config = virtio_pmem_get_config;
> + vdc->get_features = virtio_pmem_get_features;
> +
> + mdc->get_addr = virtio_pmem_md_get_addr;
> + mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
> + mdc->get_region_size = virtio_pmem_md_get_region_size;
> + mdc->fill_device_info = virtio_pmem_md_fill_device_info;
> +}
> +
> +static TypeInfo virtio_pmem_info = {
> + .name = TYPE_VIRTIO_PMEM,
> + .parent = TYPE_VIRTIO_DEVICE,
> + .class_init = virtio_pmem_class_init,
> + .instance_size = sizeof(VirtIOPMEM),
> + .instance_init = virtio_pmem_instance_init,
> + .interfaces = (InterfaceInfo[]) {
> + { TYPE_MEMORY_DEVICE },
> + { }
> + },
> +};
> +
> +static void virtio_register_types(void)
> +{
> + type_register_static(&virtio_pmem_info);
> +}
> +
> +type_init(virtio_register_types)
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 990d6fcbde..28829b6437 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -85,6 +85,7 @@ extern bool pci_available;
> #define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
> #define PCI_DEVICE_ID_VIRTIO_9P 0x1009
> #define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
> +#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
>
> #define PCI_VENDOR_ID_REDHAT 0x1b36
> #define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
> diff --git a/include/hw/virtio/virtio-pmem.h b/include/hw/virtio/virtio-pmem.h
> new file mode 100644
> index 0000000000..fda3ee691c
> --- /dev/null
> +++ b/include/hw/virtio/virtio-pmem.h
> @@ -0,0 +1,42 @@
> +/*
> + * Virtio pmem Device
> + *
> + * Copyright Red Hat, Inc. 2018
> + * Copyright Pankaj Gupta <[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version. See the COPYING file in the
> + * top-level directory.
> + */
> +
> +#ifndef QEMU_VIRTIO_PMEM_H
> +#define QEMU_VIRTIO_PMEM_H
> +
> +#include "hw/virtio/virtio.h"
> +#include "exec/memory.h"
> +#include "sysemu/hostmem.h"
> +#include "standard-headers/linux/virtio_ids.h"
> +#include "hw/boards.h"
> +#include "hw/i386/pc.h"
> +
> +#define TYPE_VIRTIO_PMEM "virtio-pmem"
> +
> +#define VIRTIO_PMEM(obj) \
> + OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
> +
> +/* VirtIOPMEM device structure */
> +typedef struct VirtIOPMEM {
> + VirtIODevice parent_obj;
> +
> + VirtQueue *rq_vq;
> + uint64_t start;
> + uint64_t size;
> + MemoryRegion mr;
> + HostMemoryBackend *memdev;
> +} VirtIOPMEM;
> +
> +struct virtio_pmem_config {
> + uint64_t start;
> + uint64_t size;
> +};
> +#endif
> diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
> index 6d5c3b2d4f..346389565a 100644
> --- a/include/standard-headers/linux/virtio_ids.h
> +++ b/include/standard-headers/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/qapi/misc.json b/qapi/misc.json
> index d450cfef21..517376b866 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -2907,6 +2907,29 @@
> }
> }
>
> +##
> +# @VirtioPMemDeviceInfo:
> +#
> +# VirtioPMem state information
> +#
> +# @id: device's ID
> +#
> +# @start: physical address, where device is mapped
> +#
> +# @size: size of memory that the device provides
> +#
> +# @memdev: memory backend linked with device
> +#
> +# Since: 3.1
> +##
> +{ 'struct': 'VirtioPMemDeviceInfo',
> + 'data': { '*id': 'str',
> + 'start': 'size',
> + 'size': 'size',
> + 'memdev': 'str'
> + }
> +}
> +
> ##
> # @MemoryDeviceInfo:
> #
> @@ -2916,7 +2939,8 @@
> ##
> { 'union': 'MemoryDeviceInfo',
> 'data': { 'dimm': 'PCDIMMDeviceInfo',
> - 'nvdimm': 'PCDIMMDeviceInfo'
> + 'nvdimm': 'PCDIMMDeviceInfo',
> + 'virtio-pmem': 'VirtioPMemDeviceInfo'
> }
> }
>
Hi Luiz,
Thanks for the review.
>
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/virtio/Kconfig | 9 ++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/virtio_pmem.c | 255
> > +++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 40 ++++++
> > 5 files changed, 306 insertions(+)
> > create mode 100644 drivers/virtio/virtio_pmem.c
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 3589764..a331e23 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
> > +
> > + If unsure, say M.
> > +
> > config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5..cbe91c6 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 0000000..c22cc87
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,255 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/nd.h>
> > +
> > +struct virtio_pmem_request {
> > + /* Host return status corresponding to flush request */
> > + int ret;
> > +
> > + /* command name*/
> > + char name[16];
> > +
> > + /* Wait queue to process deferred work after ack from host */
> > + wait_queue_head_t host_acked;
> > + bool done;
> > +
> > + /* Wait queue to process deferred work after virt queue buffer avail */
> > + wait_queue_head_t wq_buf;
> > + bool wq_buf_avail;
> > + struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > + struct virtio_device *vdev;
> > +
> > + /* Virtio pmem request queue */
> > + struct virtqueue *req_vq;
> > +
> > + /* nvdimm bus registers virtio pmem device */
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nvdimm_bus_descriptor nd_desc;
> > +
> > + /* List to store deferred work if virtqueue is full */
> > + struct list_head req_list;
> > +
> > + /* Synchronize virtqueue data */
> > + spinlock_t pmem_lock;
> > +
> > + /* Memory region information */
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* The interrupt handler */
> > +static void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + list_del(&vpmem->req_list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > + /* The request submission function */
> > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev =
> > + dev_to_virtio(nd_region->dev.parent->parent);
> > + struct virtio_pmem *vpmem = vdev->priv;
>
> I'm missing a might_sleep() call in this function.
I am not sure if we need might_sleep here?
We can add it as debugging aid for detecting any problems
in sleeping from acquired atomic context?
>
> > + struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > +
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = req->wq_buf_avail = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
>
> It seems that sg_init_one() is only setting fields, in this
> case you can move spin_lock_irqsave() here.
yes, will move spin_lock_irqsave here.
>
> > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > + if (err) {
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > +
> > + list_add_tail(&vpmem->req_list, &req->list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
>
> Is this error handling code assuming that at some point
> virtqueue_add_sgs() will succeed for a different thread? If yes,
> what happens if the assumption is false? That is, what happens if
> virtqueue_add_sgs() never succeeds anymore?
virtqueue_add_sgs will not succeed and corresponding thread should wait.
All subsequent calling threads should also wait. As soon as there is first
available free entry(from host), first waiting thread is acknowledged.
In worst case if Qemu is not utilizing any of the used buffer will keep
multiple threads waiting.
>
> Why not just return an error?
As per suggestion by Stefan in previous discussion: if the virtqueue is full.
Printing a message and failing the flush isn't appropriate. This thread needs to
wait until virtqueue space becomes available.
>
> > + }
> > + virtqueue_kick(vpmem->req_vq);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->host_acked, req->done);
> > + err = req->ret;
>
> If I'm understanding the QEMU code correctly, you're returning EIO
> from QEMU if fsync() fails. I think this is wrong, since we don't know
> if EIO in QEMU will be the same EIO in the guest. One way to solve this
> would be to return 0 for success and 1 for failure from QEMU, and let the
> guest implementation pick its error code (for your implementation it
> could be EIO).
Makes sense, will change this.
Thanks,
Pankaj
>
> > + kfree(req);
> > +
> > + return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc;
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +
> > + if (!nd_region)
> > + goto out_nd;
> > +
> > + //virtio_device_ready(vdev);
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
> > + kfree(vpmem);
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > +{
> > + /* todo: handle freeze function */
> > + return -EPERM;
> > +}
> > +
> > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > +{
> > + /* todo: handle restore function */
> > + return -EPERM;
> > +}
> > +#endif
> > +
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +#ifdef CONFIG_PM_SLEEP
> > + .freeze = virtio_pmem_freeze,
> > + .restore = virtio_pmem_restore,
> > +#endif
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2..3463895 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 0000000..c7c22a5
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + * notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + * notice, this list of conditions and the following disclaimer in the
> > + * documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + * may be used to endorse or promote products derived from this
> > software
> > + * without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > PURPOSE
> > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2018-2019
> > + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> > + */
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > + __le64 start;
> > + __le64 size;
> > +};
> > +#endif
>
>
>
>
> > This patch adds virtio-pmem Qemu device.
> >
> > This device presents memory address range information to guest
> > which is backed by file backend type. It acts like persistent
> > memory device for KVM guest. Guest can perform read and
> > persistent write operations on this memory range with the help
> > of DAX capable filesystem.
> >
> > Persistent guest writes are assured with the help of virtio
> > based flushing interface. When guest userspace space performs
> > fsync on file fd on pmem device, a flush command is send to
> > Qemu over VIRTIO and host side flush/sync is done on backing
> > image file.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > Changes from RFC v3:
> > - Return EIO for host fsync failure instead of errno - Luiz, Stefan
> > - Change version for inclusion to Qemu 3.1 - Eric
> >
> > Changes from RFC v2:
> > - Use aio_worker() to avoid Qemu from hanging with blocking fsync
> > call - Stefan
> > - Use virtio_st*_p() for endianess - Stefan
> > - Correct indentation in qapi/misc.json - Eric
> >
> > hw/virtio/Makefile.objs | 3 +
> > hw/virtio/virtio-pci.c | 44 +++++
> > hw/virtio/virtio-pci.h | 14 ++
> > hw/virtio/virtio-pmem.c | 241
> > ++++++++++++++++++++++++++++
> > include/hw/pci/pci.h | 1 +
> > include/hw/virtio/virtio-pmem.h | 42 +++++
> > include/standard-headers/linux/virtio_ids.h | 1 +
> > qapi/misc.json | 26 ++-
> > 8 files changed, 371 insertions(+), 1 deletion(-)
> > create mode 100644 hw/virtio/virtio-pmem.c
> > create mode 100644 include/hw/virtio/virtio-pmem.h
> >
> > diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> > index 1b2799cfd8..7f914d45d0 100644
> > --- a/hw/virtio/Makefile.objs
> > +++ b/hw/virtio/Makefile.objs
> > @@ -10,6 +10,9 @@ obj-$(CONFIG_VIRTIO_CRYPTO) += virtio-crypto.o
> > obj-$(call land,$(CONFIG_VIRTIO_CRYPTO),$(CONFIG_VIRTIO_PCI)) +=
> > virtio-crypto-pci.o
> >
> > obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> > +ifeq ($(CONFIG_MEM_HOTPLUG),y)
> > +obj-$(CONFIG_LINUX) += virtio-pmem.o
> > +endif
> > obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> > endif
> >
> > diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> > index 3a01fe90f0..93d3fc05c7 100644
> > --- a/hw/virtio/virtio-pci.c
> > +++ b/hw/virtio/virtio-pci.c
> > @@ -2521,6 +2521,49 @@ static const TypeInfo virtio_rng_pci_info = {
> > .class_init = virtio_rng_pci_class_init,
> > };
> >
> > +/* virtio-pmem-pci */
> > +
> > +static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error
> > **errp)
> > +{
> > + VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
> > + DeviceState *vdev = DEVICE(&vpmem->vdev);
> > +
> > + qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> > + object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> > +}
> > +
> > +static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
> > +{
> > + DeviceClass *dc = DEVICE_CLASS(klass);
> > + VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> > + PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> > + k->realize = virtio_pmem_pci_realize;
> > + set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> > + pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> > + pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
> > + pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> > + pcidev_k->class_id = PCI_CLASS_OTHERS;
> > +}
> > +
> > +static void virtio_pmem_pci_instance_init(Object *obj)
> > +{
> > + VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
> > +
> > + virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> > + TYPE_VIRTIO_PMEM);
> > + object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
> > + &error_abort);
> > +}
> > +
> > +static const TypeInfo virtio_pmem_pci_info = {
> > + .name = TYPE_VIRTIO_PMEM_PCI,
> > + .parent = TYPE_VIRTIO_PCI,
> > + .instance_size = sizeof(VirtIOPMEMPCI),
> > + .instance_init = virtio_pmem_pci_instance_init,
> > + .class_init = virtio_pmem_pci_class_init,
> > +};
> > +
> > +
> > /* virtio-input-pci */
> >
> > static Property virtio_input_pci_properties[] = {
> > @@ -2714,6 +2757,7 @@ static void virtio_pci_register_types(void)
> > type_register_static(&virtio_balloon_pci_info);
> > type_register_static(&virtio_serial_pci_info);
> > type_register_static(&virtio_net_pci_info);
> > + type_register_static(&virtio_pmem_pci_info);
> > #ifdef CONFIG_VHOST_SCSI
> > type_register_static(&vhost_scsi_pci_info);
> > #endif
> > diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> > index 813082b0d7..fe74fcad3f 100644
> > --- a/hw/virtio/virtio-pci.h
> > +++ b/hw/virtio/virtio-pci.h
> > @@ -19,6 +19,7 @@
> > #include "hw/virtio/virtio-blk.h"
> > #include "hw/virtio/virtio-net.h"
> > #include "hw/virtio/virtio-rng.h"
> > +#include "hw/virtio/virtio-pmem.h"
> > #include "hw/virtio/virtio-serial.h"
> > #include "hw/virtio/virtio-scsi.h"
> > #include "hw/virtio/virtio-balloon.h"
> > @@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
> > typedef struct VirtIOGPUPCI VirtIOGPUPCI;
> > typedef struct VHostVSockPCI VHostVSockPCI;
> > typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
> > +typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
> >
> > /* virtio-pci-bus */
> >
> > @@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
> > VirtIOBlock vdev;
> > };
> >
> > +/*
> > + * virtio-pmem-pci: This extends VirtioPCIProxy.
> > + */
> > +#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
> > +#define VIRTIO_PMEM_PCI(obj) \
> > + OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
> > +
> > +struct VirtIOPMEMPCI {
> > + VirtIOPCIProxy parent_obj;
> > + VirtIOPMEM vdev;
> > +};
> > +
> > /*
> > * virtio-balloon-pci: This extends VirtioPCIProxy.
> > */
> > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > new file mode 100644
> > index 0000000000..69ae4c0a50
> > --- /dev/null
> > +++ b/hw/virtio/virtio-pmem.c
> > @@ -0,0 +1,241 @@
> > +/*
> > + * Virtio pmem device
> > + *
> > + * Copyright (C) 2018 Red Hat, Inc.
> > + * Copyright (C) 2018 Pankaj Gupta <[email protected]>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "qemu-common.h"
> > +#include "qemu/error-report.h"
> > +#include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/virtio-pmem.h"
> > +#include "hw/mem/memory-device.h"
> > +#include "block/aio.h"
> > +#include "block/thread-pool.h"
> > +
> > +typedef struct VirtIOPMEMresp {
> > + int ret;
> > +} VirtIOPMEMResp;
> > +
> > +typedef struct VirtIODeviceRequest {
> > + VirtQueueElement elem;
> > + int fd;
> > + VirtIOPMEM *pmem;
> > + VirtIOPMEMResp resp;
> > +} VirtIODeviceRequest;
> > +
> > +static int worker_cb(void *opaque)
> > +{
> > + VirtIODeviceRequest *req = opaque;
> > + int err = 0;
> > +
> > + /* flush raw backing image */
> > + err = fsync(req->fd);
> > + if (err != 0) {
> > + err = EIO;
> > + }
> > + req->resp.ret = err;
>
> As I mentioned in the kernel patch, I think you should 1 for
> error and let the guest pick the error it wants to return to
> the calling thread.
Sure.
>
> > +
> > + return 0;
> > +}
> > +
> > +static void done_cb(void *opaque, int ret)
> > +{
> > + VirtIODeviceRequest *req = opaque;
> > + int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
> > + &req->resp, sizeof(VirtIOPMEMResp));
> > +
> > + /* Callbacks are serialized, so no need to use atomic ops. */
> > + virtqueue_push(req->pmem->rq_vq, &req->elem, len);
> > + virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
> > + g_free(req);
> > +}
> > +
> > +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> > +{
> > + VirtIODeviceRequest *req;
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > + HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
> > + ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> > +
> > + req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
> > + if (!req) {
> > + virtio_error(vdev, "virtio-pmem missing request data");
> > + return;
> > + }
> > +
> > + if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> > + virtio_error(vdev, "virtio-pmem request not proper");
> > + g_free(req);
> > + return;
> > + }
>
> I think you should abort() in those errors.
Just skimmed over how other devices handle such errors (virtio_blk & virtio_scsi):
None of these is aborting?
if (req->elem.out_num < 1 || req->elem.in_num < 1) {
virtio_error(vdev, "virtio-blk missing headers");
return -1;
}
Thanks,
Pankaj
>
> > + req->fd = memory_region_get_fd(&backend->mr);
> > + req->pmem = pmem;
> > + thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
> > +}
> > +
> > +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> > +{
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > + struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *)
> > config;
> > +
> > + virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
> > + virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
> > +}
> > +
> > +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t
> > features,
> > + Error **errp)
> > +{
> > + return features;
> > +}
> > +
> > +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> > +{
> > + VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
> > + MachineState *ms = MACHINE(qdev_get_machine());
> > + uint64_t align;
> > + Error *local_err = NULL;
> > + MemoryRegion *mr;
> > +
> > + if (!pmem->memdev) {
> > + error_setg(errp, "virtio-pmem memdev not set");
> > + return;
> > + }
> > +
> > + mr = host_memory_backend_get_memory(pmem->memdev);
> > + align = memory_region_get_alignment(mr);
> > + pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
> > + pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
> > +
> > &local_err);
> > + if (local_err) {
> > + error_setg(errp, "Can't get free address in mem device");
> > + return;
> > + }
> > + memory_region_init_alias(&pmem->mr, OBJECT(pmem),
> > + "virtio_pmem-memory", mr, 0, pmem->size);
> > + memory_device_plug_region(ms, &pmem->mr, pmem->start);
> > +
> > + host_memory_backend_set_mapped(pmem->memdev, true);
> > + virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > + sizeof(struct
> > virtio_pmem_config));
> > + pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > +}
> > +
> > +static void virtio_mem_check_memdev(Object *obj, const char *name, Object
> > *val,
> > + Error **errp)
> > +{
> > + if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
> > + char *path = object_get_canonical_path_component(val);
> > + error_setg(errp, "Can't use already busy memdev: %s", path);
> > + g_free(path);
> > + return;
> > + }
> > +
> > + qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
> > +}
> > +
> > +static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
> > +{
> > + Object *obj = OBJECT(vm);
> > + DeviceState *parent_dev;
> > +
> > + /* always use the ID of the proxy device */
> > + if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
> > + parent_dev = DEVICE(obj->parent);
> > + return parent_dev->id;
> > + }
> > + return NULL;
> > +}
> > +
> > +static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
> > + MemoryDeviceInfo *info)
> > +{
> > + VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > + const char *id = virtio_pmem_get_device_id(vm);
> > +
> > + if (id) {
> > + vi->has_id = true;
> > + vi->id = g_strdup(id);
> > + }
> > +
> > + vi->start = vm->start;
> > + vi->size = vm->size;
> > + vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
> > +
> > + info->u.virtio_pmem.data = vi;
> > + info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->start;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState
> > *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->size;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState
> > *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->size;
> > +}
> > +
> > +static void virtio_pmem_instance_init(Object *obj)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(obj);
> > + object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
> > + (Object **)&vm->memdev,
> > + (void *) virtio_mem_check_memdev,
> > + OBJ_PROP_LINK_STRONG,
> > + &error_abort);
> > +}
> > +
> > +
> > +static void virtio_pmem_class_init(ObjectClass *klass, void *data)
> > +{
> > + VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> > + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> > +
> > + vdc->realize = virtio_pmem_realize;
> > + vdc->get_config = virtio_pmem_get_config;
> > + vdc->get_features = virtio_pmem_get_features;
> > +
> > + mdc->get_addr = virtio_pmem_md_get_addr;
> > + mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
> > + mdc->get_region_size = virtio_pmem_md_get_region_size;
> > + mdc->fill_device_info = virtio_pmem_md_fill_device_info;
> > +}
> > +
> > +static TypeInfo virtio_pmem_info = {
> > + .name = TYPE_VIRTIO_PMEM,
> > + .parent = TYPE_VIRTIO_DEVICE,
> > + .class_init = virtio_pmem_class_init,
> > + .instance_size = sizeof(VirtIOPMEM),
> > + .instance_init = virtio_pmem_instance_init,
> > + .interfaces = (InterfaceInfo[]) {
> > + { TYPE_MEMORY_DEVICE },
> > + { }
> > + },
> > +};
> > +
> > +static void virtio_register_types(void)
> > +{
> > + type_register_static(&virtio_pmem_info);
> > +}
> > +
> > +type_init(virtio_register_types)
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 990d6fcbde..28829b6437 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -85,6 +85,7 @@ extern bool pci_available;
> > #define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
> > #define PCI_DEVICE_ID_VIRTIO_9P 0x1009
> > #define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
> > +#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
> >
> > #define PCI_VENDOR_ID_REDHAT 0x1b36
> > #define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
> > diff --git a/include/hw/virtio/virtio-pmem.h
> > b/include/hw/virtio/virtio-pmem.h
> > new file mode 100644
> > index 0000000000..fda3ee691c
> > --- /dev/null
> > +++ b/include/hw/virtio/virtio-pmem.h
> > @@ -0,0 +1,42 @@
> > +/*
> > + * Virtio pmem Device
> > + *
> > + * Copyright Red Hat, Inc. 2018
> > + * Copyright Pankaj Gupta <[email protected]>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or
> > + * (at your option) any later version. See the COPYING file in the
> > + * top-level directory.
> > + */
> > +
> > +#ifndef QEMU_VIRTIO_PMEM_H
> > +#define QEMU_VIRTIO_PMEM_H
> > +
> > +#include "hw/virtio/virtio.h"
> > +#include "exec/memory.h"
> > +#include "sysemu/hostmem.h"
> > +#include "standard-headers/linux/virtio_ids.h"
> > +#include "hw/boards.h"
> > +#include "hw/i386/pc.h"
> > +
> > +#define TYPE_VIRTIO_PMEM "virtio-pmem"
> > +
> > +#define VIRTIO_PMEM(obj) \
> > + OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
> > +
> > +/* VirtIOPMEM device structure */
> > +typedef struct VirtIOPMEM {
> > + VirtIODevice parent_obj;
> > +
> > + VirtQueue *rq_vq;
> > + uint64_t start;
> > + uint64_t size;
> > + MemoryRegion mr;
> > + HostMemoryBackend *memdev;
> > +} VirtIOPMEM;
> > +
> > +struct virtio_pmem_config {
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +#endif
> > diff --git a/include/standard-headers/linux/virtio_ids.h
> > b/include/standard-headers/linux/virtio_ids.h
> > index 6d5c3b2d4f..346389565a 100644
> > --- a/include/standard-headers/linux/virtio_ids.h
> > +++ b/include/standard-headers/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/qapi/misc.json b/qapi/misc.json
> > index d450cfef21..517376b866 100644
> > --- a/qapi/misc.json
> > +++ b/qapi/misc.json
> > @@ -2907,6 +2907,29 @@
> > }
> > }
> >
> > +##
> > +# @VirtioPMemDeviceInfo:
> > +#
> > +# VirtioPMem state information
> > +#
> > +# @id: device's ID
> > +#
> > +# @start: physical address, where device is mapped
> > +#
> > +# @size: size of memory that the device provides
> > +#
> > +# @memdev: memory backend linked with device
> > +#
> > +# Since: 3.1
> > +##
> > +{ 'struct': 'VirtioPMemDeviceInfo',
> > + 'data': { '*id': 'str',
> > + 'start': 'size',
> > + 'size': 'size',
> > + 'memdev': 'str'
> > + }
> > +}
> > +
> > ##
> > # @MemoryDeviceInfo:
> > #
> > @@ -2916,7 +2939,8 @@
> > ##
> > { 'union': 'MemoryDeviceInfo',
> > 'data': { 'dimm': 'PCDIMMDeviceInfo',
> > - 'nvdimm': 'PCDIMMDeviceInfo'
> > + 'nvdimm': 'PCDIMMDeviceInfo',
> > + 'virtio-pmem': 'VirtioPMemDeviceInfo'
> > }
> > }
> >
>
>
On Thu, 13 Sep 2018 02:58:21 -0400 (EDT)
Pankaj Gupta <[email protected]> wrote:
> Hi Luiz,
>
> Thanks for the review.
>
> >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---
> > > drivers/virtio/Kconfig | 9 ++
> > > drivers/virtio/Makefile | 1 +
> > > drivers/virtio/virtio_pmem.c | 255
> > > +++++++++++++++++++++++++++++++++++++++
> > > include/uapi/linux/virtio_ids.h | 1 +
> > > include/uapi/linux/virtio_pmem.h | 40 ++++++
> > > 5 files changed, 306 insertions(+)
> > > create mode 100644 drivers/virtio/virtio_pmem.c
> > > create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 3589764..a331e23 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> > >
> > > If unsure, say Y.
> > >
> > > +config VIRTIO_PMEM
> > > + tristate "Support for virtio pmem driver"
> > > + depends on VIRTIO
> > > + help
> > > + This driver provides support for virtio based flushing interface
> > > + for persistent memory range.
> > > +
> > > + If unsure, say M.
> > > +
> > > config VIRTIO_BALLOON
> > > tristate "Virtio balloon driver"
> > > depends on VIRTIO
> > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > index 3a2b5c5..cbe91c6 100644
> > > --- a/drivers/virtio/Makefile
> > > +++ b/drivers/virtio/Makefile
> > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > > new file mode 100644
> > > index 0000000..c22cc87
> > > --- /dev/null
> > > +++ b/drivers/virtio/virtio_pmem.c
> > > @@ -0,0 +1,255 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include <linux/virtio.h>
> > > +#include <linux/module.h>
> > > +#include <linux/virtio_ids.h>
> > > +#include <linux/virtio_config.h>
> > > +#include <uapi/linux/virtio_pmem.h>
> > > +#include <linux/spinlock.h>
> > > +#include <linux/libnvdimm.h>
> > > +#include <linux/nd.h>
> > > +
> > > +struct virtio_pmem_request {
> > > + /* Host return status corresponding to flush request */
> > > + int ret;
> > > +
> > > + /* command name*/
> > > + char name[16];
> > > +
> > > + /* Wait queue to process deferred work after ack from host */
> > > + wait_queue_head_t host_acked;
> > > + bool done;
> > > +
> > > + /* Wait queue to process deferred work after virt queue buffer avail */
> > > + wait_queue_head_t wq_buf;
> > > + bool wq_buf_avail;
> > > + struct list_head list;
> > > +};
> > > +
> > > +struct virtio_pmem {
> > > + struct virtio_device *vdev;
> > > +
> > > + /* Virtio pmem request queue */
> > > + struct virtqueue *req_vq;
> > > +
> > > + /* nvdimm bus registers virtio pmem device */
> > > + struct nvdimm_bus *nvdimm_bus;
> > > + struct nvdimm_bus_descriptor nd_desc;
> > > +
> > > + /* List to store deferred work if virtqueue is full */
> > > + struct list_head req_list;
> > > +
> > > + /* Synchronize virtqueue data */
> > > + spinlock_t pmem_lock;
> > > +
> > > + /* Memory region information */
> > > + uint64_t start;
> > > + uint64_t size;
> > > +};
> > > +
> > > +static struct virtio_device_id id_table[] = {
> > > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > + { 0 },
> > > +};
> > > +
> > > + /* The interrupt handler */
> > > +static void host_ack(struct virtqueue *vq)
> > > +{
> > > + unsigned int len;
> > > + unsigned long flags;
> > > + struct virtio_pmem_request *req, *req_buf;
> > > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > + req->done = true;
> > > + wake_up(&req->host_acked);
> > > +
> > > + if (!list_empty(&vpmem->req_list)) {
> > > + req_buf = list_first_entry(&vpmem->req_list,
> > > + struct virtio_pmem_request, list);
> > > + list_del(&vpmem->req_list);
> > > + req_buf->wq_buf_avail = true;
> > > + wake_up(&req_buf->wq_buf);
> > > + }
> > > + }
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +}
> > > + /* Initialize virt queue */
> > > +static int init_vq(struct virtio_pmem *vpmem)
> > > +{
> > > + struct virtqueue *vq;
> > > +
> > > + /* single vq */
> > > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > + host_ack, "flush_queue");
> > > + if (IS_ERR(vq))
> > > + return PTR_ERR(vq);
> > > +
> > > + spin_lock_init(&vpmem->pmem_lock);
> > > + INIT_LIST_HEAD(&vpmem->req_list);
> > > +
> > > + return 0;
> > > +};
> > > +
> > > + /* The request submission function */
> > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > + int err;
> > > + unsigned long flags;
> > > + struct scatterlist *sgs[2], sg, ret;
> > > + struct virtio_device *vdev =
> > > + dev_to_virtio(nd_region->dev.parent->parent);
> > > + struct virtio_pmem *vpmem = vdev->priv;
> >
> > I'm missing a might_sleep() call in this function.
>
> I am not sure if we need might_sleep here?
> We can add it as debugging aid for detecting any problems
> in sleeping from acquired atomic context?
Yes. Since this function sleeps and since some functions that
may run in atomic context call it, it's a good idea to
call might_sleep().
> > > + struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > +
> > > + if (!req)
> > > + return -ENOMEM;
> > > +
> > > + req->done = req->wq_buf_avail = false;
> > > + strcpy(req->name, "FLUSH");
> > > + init_waitqueue_head(&req->host_acked);
> > > + init_waitqueue_head(&req->wq_buf);
> > > +
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + sg_init_one(&sg, req->name, strlen(req->name));
> > > + sgs[0] = &sg;
> > > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > + sgs[1] = &ret;
> >
> > It seems that sg_init_one() is only setting fields, in this
> > case you can move spin_lock_irqsave() here.
>
> yes, will move spin_lock_irqsave here.
>
> >
> > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > > + if (err) {
> > > + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > > +
> > > + list_add_tail(&vpmem->req_list, &req->list);
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->wq_buf, req->wq_buf_avail);
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> >
> > Is this error handling code assuming that at some point
> > virtqueue_add_sgs() will succeed for a different thread? If yes,
> > what happens if the assumption is false? That is, what happens if
> > virtqueue_add_sgs() never succeeds anymore?
>
> virtqueue_add_sgs will not succeed and corresponding thread should wait.
> All subsequent calling threads should also wait. As soon as there is first
> available free entry(from host), first waiting thread is acknowledged.
>
> In worst case if Qemu is not utilizing any of the used buffer will keep
> multiple threads waiting.
>
> >
> > Why not just return an error?
>
> As per suggestion by Stefan in previous discussion: if the virtqueue is full.
> Printing a message and failing the flush isn't appropriate. This thread needs to
> wait until virtqueue space becomes available.
If virtqueue_add_sgs() is guaranteed to succeed at some point then OK.
Otherwise, you'll get threads getting stuck forever.
> > > + }
> > > + virtqueue_kick(vpmem->req_vq);
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->host_acked, req->done);
> > > + err = req->ret;
> >
> > If I'm understanding the QEMU code correctly, you're returning EIO
> > from QEMU if fsync() fails. I think this is wrong, since we don't know
> > if EIO in QEMU will be the same EIO in the guest. One way to solve this
> > would be to return 0 for success and 1 for failure from QEMU, and let the
> > guest implementation pick its error code (for your implementation it
> > could be EIO).
>
> Makes sense, will change this.
>
> Thanks,
> Pankaj
> >
> > > + kfree(req);
> > > +
> > > + return err;
> > > +};
> > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > +
> > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > +{
> > > + int err = 0;
> > > + struct resource res;
> > > + struct virtio_pmem *vpmem;
> > > + struct nvdimm_bus *nvdimm_bus;
> > > + struct nd_region_desc ndr_desc;
> > > + int nid = dev_to_node(&vdev->dev);
> > > + struct nd_region *nd_region;
> > > +
> > > + if (!vdev->config->get) {
> > > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > + GFP_KERNEL);
> > > + if (!vpmem) {
> > > + err = -ENOMEM;
> > > + goto out_err;
> > > + }
> > > +
> > > + vpmem->vdev = vdev;
> > > + err = init_vq(vpmem);
> > > + if (err)
> > > + goto out_err;
> > > +
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + start, &vpmem->start);
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + size, &vpmem->size);
> > > +
> > > + res.start = vpmem->start;
> > > + res.end = vpmem->start + vpmem->size-1;
> > > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > > + vpmem->nd_desc.module = THIS_MODULE;
> > > +
> > > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > + &vpmem->nd_desc);
> > > + if (!nvdimm_bus)
> > > + goto out_vq;
> > > +
> > > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> > > +
> > > + ndr_desc.res = &res;
> > > + ndr_desc.numa_node = nid;
> > > + ndr_desc.flush = virtio_pmem_flush;
> > > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > +
> > > + if (!nd_region)
> > > + goto out_nd;
> > > +
> > > + //virtio_device_ready(vdev);
> > > + return 0;
> > > +out_nd:
> > > + err = -ENXIO;
> > > + nvdimm_bus_unregister(nvdimm_bus);
> > > +out_vq:
> > > + vdev->config->del_vqs(vdev);
> > > +out_err:
> > > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > + return err;
> > > +}
> > > +
> > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > +{
> > > + struct virtio_pmem *vpmem = vdev->priv;
> > > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > +
> > > + nvdimm_bus_unregister(nvdimm_bus);
> > > + vdev->config->del_vqs(vdev);
> > > + kfree(vpmem);
> > > +}
> > > +
> > > +#ifdef CONFIG_PM_SLEEP
> > > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > > +{
> > > + /* todo: handle freeze function */
> > > + return -EPERM;
> > > +}
> > > +
> > > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > > +{
> > > + /* todo: handle restore function */
> > > + return -EPERM;
> > > +}
> > > +#endif
> > > +
> > > +
> > > +static struct virtio_driver virtio_pmem_driver = {
> > > + .driver.name = KBUILD_MODNAME,
> > > + .driver.owner = THIS_MODULE,
> > > + .id_table = id_table,
> > > + .probe = virtio_pmem_probe,
> > > + .remove = virtio_pmem_remove,
> > > +#ifdef CONFIG_PM_SLEEP
> > > + .freeze = virtio_pmem_freeze,
> > > + .restore = virtio_pmem_restore,
> > > +#endif
> > > +};
> > > +
> > > +module_virtio_driver(virtio_pmem_driver);
> > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > +MODULE_LICENSE("GPL");
> > > diff --git a/include/uapi/linux/virtio_ids.h
> > > b/include/uapi/linux/virtio_ids.h
> > > index 6d5c3b2..3463895 100644
> > > --- a/include/uapi/linux/virtio_ids.h
> > > +++ b/include/uapi/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > >
> > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > b/include/uapi/linux/virtio_pmem.h
> > > new file mode 100644
> > > index 0000000..c7c22a5
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_pmem.h
> > > @@ -0,0 +1,40 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > > + * anyone can use the definitions to implement compatible drivers/servers:
> > > + *
> > > + *
> > > + * Redistribution and use in source and binary forms, with or without
> > > + * modification, are permitted provided that the following conditions
> > > + * are met:
> > > + * 1. Redistributions of source code must retain the above copyright
> > > + * notice, this list of conditions and the following disclaimer.
> > > + * 2. Redistributions in binary form must reproduce the above copyright
> > > + * notice, this list of conditions and the following disclaimer in the
> > > + * documentation and/or other materials provided with the distribution.
> > > + * 3. Neither the name of IBM nor the names of its contributors
> > > + * may be used to endorse or promote products derived from this
> > > software
> > > + * without specific prior written permission.
> > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > ``AS IS''
> > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > > THE
> > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > > PURPOSE
> > > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > CONSEQUENTIAL
> > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > > STRICT
> > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > > WAY
> > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > > + * SUCH DAMAGE.
> > > + *
> > > + * Copyright (C) Red Hat, Inc., 2018-2019
> > > + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> > > + */
> > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > +
> > > +struct virtio_pmem_config {
> > > + __le64 start;
> > > + __le64 size;
> > > +};
> > > +#endif
> >
> >
> >
>
On Thu, 13 Sep 2018 03:06:27 -0400 (EDT)
Pankaj Gupta <[email protected]> wrote:
> >
> > > This patch adds virtio-pmem Qemu device.
> > >
> > > This device presents memory address range information to guest
> > > which is backed by file backend type. It acts like persistent
> > > memory device for KVM guest. Guest can perform read and
> > > persistent write operations on this memory range with the help
> > > of DAX capable filesystem.
> > >
> > > Persistent guest writes are assured with the help of virtio
> > > based flushing interface. When guest userspace space performs
> > > fsync on file fd on pmem device, a flush command is send to
> > > Qemu over VIRTIO and host side flush/sync is done on backing
> > > image file.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---
> > > Changes from RFC v3:
> > > - Return EIO for host fsync failure instead of errno - Luiz, Stefan
> > > - Change version for inclusion to Qemu 3.1 - Eric
> > >
> > > Changes from RFC v2:
> > > - Use aio_worker() to avoid Qemu from hanging with blocking fsync
> > > call - Stefan
> > > - Use virtio_st*_p() for endianess - Stefan
> > > - Correct indentation in qapi/misc.json - Eric
> > >
> > > hw/virtio/Makefile.objs | 3 +
> > > hw/virtio/virtio-pci.c | 44 +++++
> > > hw/virtio/virtio-pci.h | 14 ++
> > > hw/virtio/virtio-pmem.c | 241
> > > ++++++++++++++++++++++++++++
> > > include/hw/pci/pci.h | 1 +
> > > include/hw/virtio/virtio-pmem.h | 42 +++++
> > > include/standard-headers/linux/virtio_ids.h | 1 +
> > > qapi/misc.json | 26 ++-
> > > 8 files changed, 371 insertions(+), 1 deletion(-)
> > > create mode 100644 hw/virtio/virtio-pmem.c
> > > create mode 100644 include/hw/virtio/virtio-pmem.h
> > >
> > > diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> > > index 1b2799cfd8..7f914d45d0 100644
> > > --- a/hw/virtio/Makefile.objs
> > > +++ b/hw/virtio/Makefile.objs
> > > @@ -10,6 +10,9 @@ obj-$(CONFIG_VIRTIO_CRYPTO) += virtio-crypto.o
> > > obj-$(call land,$(CONFIG_VIRTIO_CRYPTO),$(CONFIG_VIRTIO_PCI)) +=
> > > virtio-crypto-pci.o
> > >
> > > obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> > > +ifeq ($(CONFIG_MEM_HOTPLUG),y)
> > > +obj-$(CONFIG_LINUX) += virtio-pmem.o
> > > +endif
> > > obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> > > endif
> > >
> > > diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> > > index 3a01fe90f0..93d3fc05c7 100644
> > > --- a/hw/virtio/virtio-pci.c
> > > +++ b/hw/virtio/virtio-pci.c
> > > @@ -2521,6 +2521,49 @@ static const TypeInfo virtio_rng_pci_info = {
> > > .class_init = virtio_rng_pci_class_init,
> > > };
> > >
> > > +/* virtio-pmem-pci */
> > > +
> > > +static void virtio_pmem_pci_realize(VirtIOPCIProxy *vpci_dev, Error
> > > **errp)
> > > +{
> > > + VirtIOPMEMPCI *vpmem = VIRTIO_PMEM_PCI(vpci_dev);
> > > + DeviceState *vdev = DEVICE(&vpmem->vdev);
> > > +
> > > + qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> > > + object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> > > +}
> > > +
> > > +static void virtio_pmem_pci_class_init(ObjectClass *klass, void *data)
> > > +{
> > > + DeviceClass *dc = DEVICE_CLASS(klass);
> > > + VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> > > + PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> > > + k->realize = virtio_pmem_pci_realize;
> > > + set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> > > + pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> > > + pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PMEM;
> > > + pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> > > + pcidev_k->class_id = PCI_CLASS_OTHERS;
> > > +}
> > > +
> > > +static void virtio_pmem_pci_instance_init(Object *obj)
> > > +{
> > > + VirtIOPMEMPCI *dev = VIRTIO_PMEM_PCI(obj);
> > > +
> > > + virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> > > + TYPE_VIRTIO_PMEM);
> > > + object_property_add_alias(obj, "memdev", OBJECT(&dev->vdev), "memdev",
> > > + &error_abort);
> > > +}
> > > +
> > > +static const TypeInfo virtio_pmem_pci_info = {
> > > + .name = TYPE_VIRTIO_PMEM_PCI,
> > > + .parent = TYPE_VIRTIO_PCI,
> > > + .instance_size = sizeof(VirtIOPMEMPCI),
> > > + .instance_init = virtio_pmem_pci_instance_init,
> > > + .class_init = virtio_pmem_pci_class_init,
> > > +};
> > > +
> > > +
> > > /* virtio-input-pci */
> > >
> > > static Property virtio_input_pci_properties[] = {
> > > @@ -2714,6 +2757,7 @@ static void virtio_pci_register_types(void)
> > > type_register_static(&virtio_balloon_pci_info);
> > > type_register_static(&virtio_serial_pci_info);
> > > type_register_static(&virtio_net_pci_info);
> > > + type_register_static(&virtio_pmem_pci_info);
> > > #ifdef CONFIG_VHOST_SCSI
> > > type_register_static(&vhost_scsi_pci_info);
> > > #endif
> > > diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> > > index 813082b0d7..fe74fcad3f 100644
> > > --- a/hw/virtio/virtio-pci.h
> > > +++ b/hw/virtio/virtio-pci.h
> > > @@ -19,6 +19,7 @@
> > > #include "hw/virtio/virtio-blk.h"
> > > #include "hw/virtio/virtio-net.h"
> > > #include "hw/virtio/virtio-rng.h"
> > > +#include "hw/virtio/virtio-pmem.h"
> > > #include "hw/virtio/virtio-serial.h"
> > > #include "hw/virtio/virtio-scsi.h"
> > > #include "hw/virtio/virtio-balloon.h"
> > > @@ -57,6 +58,7 @@ typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
> > > typedef struct VirtIOGPUPCI VirtIOGPUPCI;
> > > typedef struct VHostVSockPCI VHostVSockPCI;
> > > typedef struct VirtIOCryptoPCI VirtIOCryptoPCI;
> > > +typedef struct VirtIOPMEMPCI VirtIOPMEMPCI;
> > >
> > > /* virtio-pci-bus */
> > >
> > > @@ -274,6 +276,18 @@ struct VirtIOBlkPCI {
> > > VirtIOBlock vdev;
> > > };
> > >
> > > +/*
> > > + * virtio-pmem-pci: This extends VirtioPCIProxy.
> > > + */
> > > +#define TYPE_VIRTIO_PMEM_PCI "virtio-pmem-pci"
> > > +#define VIRTIO_PMEM_PCI(obj) \
> > > + OBJECT_CHECK(VirtIOPMEMPCI, (obj), TYPE_VIRTIO_PMEM_PCI)
> > > +
> > > +struct VirtIOPMEMPCI {
> > > + VirtIOPCIProxy parent_obj;
> > > + VirtIOPMEM vdev;
> > > +};
> > > +
> > > /*
> > > * virtio-balloon-pci: This extends VirtioPCIProxy.
> > > */
> > > diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> > > new file mode 100644
> > > index 0000000000..69ae4c0a50
> > > --- /dev/null
> > > +++ b/hw/virtio/virtio-pmem.c
> > > @@ -0,0 +1,241 @@
> > > +/*
> > > + * Virtio pmem device
> > > + *
> > > + * Copyright (C) 2018 Red Hat, Inc.
> > > + * Copyright (C) 2018 Pankaj Gupta <[email protected]>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "qapi/error.h"
> > > +#include "qemu-common.h"
> > > +#include "qemu/error-report.h"
> > > +#include "hw/virtio/virtio-access.h"
> > > +#include "hw/virtio/virtio-pmem.h"
> > > +#include "hw/mem/memory-device.h"
> > > +#include "block/aio.h"
> > > +#include "block/thread-pool.h"
> > > +
> > > +typedef struct VirtIOPMEMresp {
> > > + int ret;
> > > +} VirtIOPMEMResp;
> > > +
> > > +typedef struct VirtIODeviceRequest {
> > > + VirtQueueElement elem;
> > > + int fd;
> > > + VirtIOPMEM *pmem;
> > > + VirtIOPMEMResp resp;
> > > +} VirtIODeviceRequest;
> > > +
> > > +static int worker_cb(void *opaque)
> > > +{
> > > + VirtIODeviceRequest *req = opaque;
> > > + int err = 0;
> > > +
> > > + /* flush raw backing image */
> > > + err = fsync(req->fd);
> > > + if (err != 0) {
> > > + err = EIO;
> > > + }
> > > + req->resp.ret = err;
> >
> > As I mentioned in the kernel patch, I think you should 1 for
> > error and let the guest pick the error it wants to return to
> > the calling thread.
>
> Sure.
>
> >
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void done_cb(void *opaque, int ret)
> > > +{
> > > + VirtIODeviceRequest *req = opaque;
> > > + int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
> > > + &req->resp, sizeof(VirtIOPMEMResp));
> > > +
> > > + /* Callbacks are serialized, so no need to use atomic ops. */
> > > + virtqueue_push(req->pmem->rq_vq, &req->elem, len);
> > > + virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
> > > + g_free(req);
> > > +}
> > > +
> > > +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> > > +{
> > > + VirtIODeviceRequest *req;
> > > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > > + HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
> > > + ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> > > +
> > > + req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
> > > + if (!req) {
> > > + virtio_error(vdev, "virtio-pmem missing request data");
> > > + return;
> > > + }
> > > +
> > > + if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> > > + virtio_error(vdev, "virtio-pmem request not proper");
> > > + g_free(req);
> > > + return;
> > > + }
> >
> > I think you should abort() in those errors.
>
> Just skimmed over how other devices handle such errors (virtio_blk & virtio_scsi):
> None of these is aborting?
My fear is threads on the host side getting blocked in a row forever.
>
> if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> virtio_error(vdev, "virtio-blk missing headers");
> return -1;
> }
>
> Thanks,
> Pankaj
>
> >
> > > + req->fd = memory_region_get_fd(&backend->mr);
> > > + req->pmem = pmem;
> > > + thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
> > > +}
> > > +
> > > +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> > > +{
> > > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > > + struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *)
> > > config;
> > > +
> > > + virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
> > > + virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
> > > +}
> > > +
> > > +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t
> > > features,
> > > + Error **errp)
> > > +{
> > > + return features;
> > > +}
> > > +
> > > +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> > > +{
> > > + VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> > > + VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
> > > + MachineState *ms = MACHINE(qdev_get_machine());
> > > + uint64_t align;
> > > + Error *local_err = NULL;
> > > + MemoryRegion *mr;
> > > +
> > > + if (!pmem->memdev) {
> > > + error_setg(errp, "virtio-pmem memdev not set");
> > > + return;
> > > + }
> > > +
> > > + mr = host_memory_backend_get_memory(pmem->memdev);
> > > + align = memory_region_get_alignment(mr);
> > > + pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
> > > + pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
> > > +
> > > &local_err);
> > > + if (local_err) {
> > > + error_setg(errp, "Can't get free address in mem device");
> > > + return;
> > > + }
> > > + memory_region_init_alias(&pmem->mr, OBJECT(pmem),
> > > + "virtio_pmem-memory", mr, 0, pmem->size);
> > > + memory_device_plug_region(ms, &pmem->mr, pmem->start);
> > > +
> > > + host_memory_backend_set_mapped(pmem->memdev, true);
> > > + virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > > + sizeof(struct
> > > virtio_pmem_config));
> > > + pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > > +}
> > > +
> > > +static void virtio_mem_check_memdev(Object *obj, const char *name, Object
> > > *val,
> > > + Error **errp)
> > > +{
> > > + if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
> > > + char *path = object_get_canonical_path_component(val);
> > > + error_setg(errp, "Can't use already busy memdev: %s", path);
> > > + g_free(path);
> > > + return;
> > > + }
> > > +
> > > + qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
> > > +}
> > > +
> > > +static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
> > > +{
> > > + Object *obj = OBJECT(vm);
> > > + DeviceState *parent_dev;
> > > +
> > > + /* always use the ID of the proxy device */
> > > + if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
> > > + parent_dev = DEVICE(obj->parent);
> > > + return parent_dev->id;
> > > + }
> > > + return NULL;
> > > +}
> > > +
> > > +static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
> > > + MemoryDeviceInfo *info)
> > > +{
> > > + VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
> > > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > > + const char *id = virtio_pmem_get_device_id(vm);
> > > +
> > > + if (id) {
> > > + vi->has_id = true;
> > > + vi->id = g_strdup(id);
> > > + }
> > > +
> > > + vi->start = vm->start;
> > > + vi->size = vm->size;
> > > + vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
> > > +
> > > + info->u.virtio_pmem.data = vi;
> > > + info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
> > > +}
> > > +
> > > +static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
> > > +{
> > > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > > +
> > > + return vm->start;
> > > +}
> > > +
> > > +static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState
> > > *md)
> > > +{
> > > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > > +
> > > + return vm->size;
> > > +}
> > > +
> > > +static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState
> > > *md)
> > > +{
> > > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > > +
> > > + return vm->size;
> > > +}
> > > +
> > > +static void virtio_pmem_instance_init(Object *obj)
> > > +{
> > > + VirtIOPMEM *vm = VIRTIO_PMEM(obj);
> > > + object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
> > > + (Object **)&vm->memdev,
> > > + (void *) virtio_mem_check_memdev,
> > > + OBJ_PROP_LINK_STRONG,
> > > + &error_abort);
> > > +}
> > > +
> > > +
> > > +static void virtio_pmem_class_init(ObjectClass *klass, void *data)
> > > +{
> > > + VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> > > + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> > > +
> > > + vdc->realize = virtio_pmem_realize;
> > > + vdc->get_config = virtio_pmem_get_config;
> > > + vdc->get_features = virtio_pmem_get_features;
> > > +
> > > + mdc->get_addr = virtio_pmem_md_get_addr;
> > > + mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
> > > + mdc->get_region_size = virtio_pmem_md_get_region_size;
> > > + mdc->fill_device_info = virtio_pmem_md_fill_device_info;
> > > +}
> > > +
> > > +static TypeInfo virtio_pmem_info = {
> > > + .name = TYPE_VIRTIO_PMEM,
> > > + .parent = TYPE_VIRTIO_DEVICE,
> > > + .class_init = virtio_pmem_class_init,
> > > + .instance_size = sizeof(VirtIOPMEM),
> > > + .instance_init = virtio_pmem_instance_init,
> > > + .interfaces = (InterfaceInfo[]) {
> > > + { TYPE_MEMORY_DEVICE },
> > > + { }
> > > + },
> > > +};
> > > +
> > > +static void virtio_register_types(void)
> > > +{
> > > + type_register_static(&virtio_pmem_info);
> > > +}
> > > +
> > > +type_init(virtio_register_types)
> > > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > > index 990d6fcbde..28829b6437 100644
> > > --- a/include/hw/pci/pci.h
> > > +++ b/include/hw/pci/pci.h
> > > @@ -85,6 +85,7 @@ extern bool pci_available;
> > > #define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
> > > #define PCI_DEVICE_ID_VIRTIO_9P 0x1009
> > > #define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
> > > +#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
> > >
> > > #define PCI_VENDOR_ID_REDHAT 0x1b36
> > > #define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
> > > diff --git a/include/hw/virtio/virtio-pmem.h
> > > b/include/hw/virtio/virtio-pmem.h
> > > new file mode 100644
> > > index 0000000000..fda3ee691c
> > > --- /dev/null
> > > +++ b/include/hw/virtio/virtio-pmem.h
> > > @@ -0,0 +1,42 @@
> > > +/*
> > > + * Virtio pmem Device
> > > + *
> > > + * Copyright Red Hat, Inc. 2018
> > > + * Copyright Pankaj Gupta <[email protected]>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 or
> > > + * (at your option) any later version. See the COPYING file in the
> > > + * top-level directory.
> > > + */
> > > +
> > > +#ifndef QEMU_VIRTIO_PMEM_H
> > > +#define QEMU_VIRTIO_PMEM_H
> > > +
> > > +#include "hw/virtio/virtio.h"
> > > +#include "exec/memory.h"
> > > +#include "sysemu/hostmem.h"
> > > +#include "standard-headers/linux/virtio_ids.h"
> > > +#include "hw/boards.h"
> > > +#include "hw/i386/pc.h"
> > > +
> > > +#define TYPE_VIRTIO_PMEM "virtio-pmem"
> > > +
> > > +#define VIRTIO_PMEM(obj) \
> > > + OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
> > > +
> > > +/* VirtIOPMEM device structure */
> > > +typedef struct VirtIOPMEM {
> > > + VirtIODevice parent_obj;
> > > +
> > > + VirtQueue *rq_vq;
> > > + uint64_t start;
> > > + uint64_t size;
> > > + MemoryRegion mr;
> > > + HostMemoryBackend *memdev;
> > > +} VirtIOPMEM;
> > > +
> > > +struct virtio_pmem_config {
> > > + uint64_t start;
> > > + uint64_t size;
> > > +};
> > > +#endif
> > > diff --git a/include/standard-headers/linux/virtio_ids.h
> > > b/include/standard-headers/linux/virtio_ids.h
> > > index 6d5c3b2d4f..346389565a 100644
> > > --- a/include/standard-headers/linux/virtio_ids.h
> > > +++ b/include/standard-headers/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > >
> > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > diff --git a/qapi/misc.json b/qapi/misc.json
> > > index d450cfef21..517376b866 100644
> > > --- a/qapi/misc.json
> > > +++ b/qapi/misc.json
> > > @@ -2907,6 +2907,29 @@
> > > }
> > > }
> > >
> > > +##
> > > +# @VirtioPMemDeviceInfo:
> > > +#
> > > +# VirtioPMem state information
> > > +#
> > > +# @id: device's ID
> > > +#
> > > +# @start: physical address, where device is mapped
> > > +#
> > > +# @size: size of memory that the device provides
> > > +#
> > > +# @memdev: memory backend linked with device
> > > +#
> > > +# Since: 3.1
> > > +##
> > > +{ 'struct': 'VirtioPMemDeviceInfo',
> > > + 'data': { '*id': 'str',
> > > + 'start': 'size',
> > > + 'size': 'size',
> > > + 'memdev': 'str'
> > > + }
> > > +}
> > > +
> > > ##
> > > # @MemoryDeviceInfo:
> > > #
> > > @@ -2916,7 +2939,8 @@
> > > ##
> > > { 'union': 'MemoryDeviceInfo',
> > > 'data': { 'dimm': 'PCDIMMDeviceInfo',
> > > - 'nvdimm': 'PCDIMMDeviceInfo'
> > > + 'nvdimm': 'PCDIMMDeviceInfo',
> > > + 'virtio-pmem': 'VirtioPMemDeviceInfo'
> > > }
> > > }
> > >
> >
> >
>
>
> > Hi Luiz,
> >
> > Thanks for the review.
> >
> > >
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > >
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > >
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > >
> > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > ---
> > > > drivers/virtio/Kconfig | 9 ++
> > > > drivers/virtio/Makefile | 1 +
> > > > drivers/virtio/virtio_pmem.c | 255
> > > > +++++++++++++++++++++++++++++++++++++++
> > > > include/uapi/linux/virtio_ids.h | 1 +
> > > > include/uapi/linux/virtio_pmem.h | 40 ++++++
> > > > 5 files changed, 306 insertions(+)
> > > > create mode 100644 drivers/virtio/virtio_pmem.c
> > > > create mode 100644 include/uapi/linux/virtio_pmem.h
> > > >
> > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > index 3589764..a331e23 100644
> > > > --- a/drivers/virtio/Kconfig
> > > > +++ b/drivers/virtio/Kconfig
> > > > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> > > >
> > > > If unsure, say Y.
> > > >
> > > > +config VIRTIO_PMEM
> > > > + tristate "Support for virtio pmem driver"
> > > > + depends on VIRTIO
> > > > + help
> > > > + This driver provides support for virtio based flushing interface
> > > > + for persistent memory range.
> > > > +
> > > > + If unsure, say M.
> > > > +
> > > > config VIRTIO_BALLOON
> > > > tristate "Virtio balloon driver"
> > > > depends on VIRTIO
> > > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > > index 3a2b5c5..cbe91c6 100644
> > > > --- a/drivers/virtio/Makefile
> > > > +++ b/drivers/virtio/Makefile
> > > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > > > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > > > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > > > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > > > diff --git a/drivers/virtio/virtio_pmem.c
> > > > b/drivers/virtio/virtio_pmem.c
> > > > new file mode 100644
> > > > index 0000000..c22cc87
> > > > --- /dev/null
> > > > +++ b/drivers/virtio/virtio_pmem.c
> > > > @@ -0,0 +1,255 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * virtio_pmem.c: Virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and provides a virtio based flushing
> > > > + * interface.
> > > > + */
> > > > +#include <linux/virtio.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/virtio_ids.h>
> > > > +#include <linux/virtio_config.h>
> > > > +#include <uapi/linux/virtio_pmem.h>
> > > > +#include <linux/spinlock.h>
> > > > +#include <linux/libnvdimm.h>
> > > > +#include <linux/nd.h>
> > > > +
> > > > +struct virtio_pmem_request {
> > > > + /* Host return status corresponding to flush request */
> > > > + int ret;
> > > > +
> > > > + /* command name*/
> > > > + char name[16];
> > > > +
> > > > + /* Wait queue to process deferred work after ack from host */
> > > > + wait_queue_head_t host_acked;
> > > > + bool done;
> > > > +
> > > > + /* Wait queue to process deferred work after virt queue buffer avail
> > > > */
> > > > + wait_queue_head_t wq_buf;
> > > > + bool wq_buf_avail;
> > > > + struct list_head list;
> > > > +};
> > > > +
> > > > +struct virtio_pmem {
> > > > + struct virtio_device *vdev;
> > > > +
> > > > + /* Virtio pmem request queue */
> > > > + struct virtqueue *req_vq;
> > > > +
> > > > + /* nvdimm bus registers virtio pmem device */
> > > > + struct nvdimm_bus *nvdimm_bus;
> > > > + struct nvdimm_bus_descriptor nd_desc;
> > > > +
> > > > + /* List to store deferred work if virtqueue is full */
> > > > + struct list_head req_list;
> > > > +
> > > > + /* Synchronize virtqueue data */
> > > > + spinlock_t pmem_lock;
> > > > +
> > > > + /* Memory region information */
> > > > + uint64_t start;
> > > > + uint64_t size;
> > > > +};
> > > > +
> > > > +static struct virtio_device_id id_table[] = {
> > > > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > > + { 0 },
> > > > +};
> > > > +
> > > > + /* The interrupt handler */
> > > > +static void host_ack(struct virtqueue *vq)
> > > > +{
> > > > + unsigned int len;
> > > > + unsigned long flags;
> > > > + struct virtio_pmem_request *req, *req_buf;
> > > > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > > > +
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > > + req->done = true;
> > > > + wake_up(&req->host_acked);
> > > > +
> > > > + if (!list_empty(&vpmem->req_list)) {
> > > > + req_buf = list_first_entry(&vpmem->req_list,
> > > > + struct virtio_pmem_request, list);
> > > > + list_del(&vpmem->req_list);
> > > > + req_buf->wq_buf_avail = true;
> > > > + wake_up(&req_buf->wq_buf);
> > > > + }
> > > > + }
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +}
> > > > + /* Initialize virt queue */
> > > > +static int init_vq(struct virtio_pmem *vpmem)
> > > > +{
> > > > + struct virtqueue *vq;
> > > > +
> > > > + /* single vq */
> > > > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > > + host_ack, "flush_queue");
> > > > + if (IS_ERR(vq))
> > > > + return PTR_ERR(vq);
> > > > +
> > > > + spin_lock_init(&vpmem->pmem_lock);
> > > > + INIT_LIST_HEAD(&vpmem->req_list);
> > > > +
> > > > + return 0;
> > > > +};
> > > > +
> > > > + /* The request submission function */
> > > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > > +{
> > > > + int err;
> > > > + unsigned long flags;
> > > > + struct scatterlist *sgs[2], sg, ret;
> > > > + struct virtio_device *vdev =
> > > > + dev_to_virtio(nd_region->dev.parent->parent);
> > > > + struct virtio_pmem *vpmem = vdev->priv;
> > >
> > > I'm missing a might_sleep() call in this function.
> >
> > I am not sure if we need might_sleep here?
> > We can add it as debugging aid for detecting any problems
> > in sleeping from acquired atomic context?
>
> Yes. Since this function sleeps and since some functions that
> may run in atomic context call it, it's a good idea to
> call might_sleep().
o.k Will add might_sleep.
>
> > > > + struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > > +
> > > > + if (!req)
> > > > + return -ENOMEM;
> > > > +
> > > > + req->done = req->wq_buf_avail = false;
> > > > + strcpy(req->name, "FLUSH");
> > > > + init_waitqueue_head(&req->host_acked);
> > > > + init_waitqueue_head(&req->wq_buf);
> > > > +
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > + sg_init_one(&sg, req->name, strlen(req->name));
> > > > + sgs[0] = &sg;
> > > > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > > + sgs[1] = &ret;
> > >
> > > It seems that sg_init_one() is only setting fields, in this
> > > case you can move spin_lock_irqsave() here.
> >
> > yes, will move spin_lock_irqsave here.
> >
> > >
> > > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > > > + if (err) {
> > > > + dev_err(&vdev->dev, "failed to send command to virtio pmem
> > > > device\n");
> > > > +
> > > > + list_add_tail(&vpmem->req_list, &req->list);
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > + /* When host has read buffer, this completes via host_ack */
> > > > + wait_event(req->wq_buf, req->wq_buf_avail);
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > >
> > > Is this error handling code assuming that at some point
> > > virtqueue_add_sgs() will succeed for a different thread? If yes,
> > > what happens if the assumption is false? That is, what happens if
> > > virtqueue_add_sgs() never succeeds anymore?
> >
> > virtqueue_add_sgs will not succeed and corresponding thread should wait.
> > All subsequent calling threads should also wait. As soon as there is first
> > available free entry(from host), first waiting thread is acknowledged.
> >
> > In worst case if Qemu is not utilizing any of the used buffer will keep
> > multiple threads waiting.
> >
> > >
> > > Why not just return an error?
> >
> > As per suggestion by Stefan in previous discussion: if the virtqueue is
> > full.
> > Printing a message and failing the flush isn't appropriate. This thread
> > needs to
> > wait until virtqueue space becomes available.
>
> If virtqueue_add_sgs() is guaranteed to succeed at some point then OK.
> Otherwise, you'll get threads getting stuck forever.
We are handling here 'virtqueue_add_sgs' failure when virtqueue is full.
For regular virtqueue full case, guest threads should wait. This scales for
more number of fsync requests than current virtqueue size and avoids returning
failure to userspace.
Even if we return error when qemu threads are stuck, every time we return error
unless threads actually progress and free an entry in virtqueue.
>
> > > > + }
> > > > + virtqueue_kick(vpmem->req_vq);
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > + /* When host has read buffer, this completes via host_ack */
> > > > + wait_event(req->host_acked, req->done);
> > > > + err = req->ret;
> > >
> > > If I'm understanding the QEMU code correctly, you're returning EIO
> > > from QEMU if fsync() fails. I think this is wrong, since we don't know
> > > if EIO in QEMU will be the same EIO in the guest. One way to solve this
> > > would be to return 0 for success and 1 for failure from QEMU, and let the
> > > guest implementation pick its error code (for your implementation it
> > > could be EIO).
> >
> > Makes sense, will change this.
> >
> > Thanks,
> > Pankaj
> > >
> > > > + kfree(req);
> > > > +
> > > > + return err;
> > > > +};
> > > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > > +
> > > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > +{
> > > > + int err = 0;
> > > > + struct resource res;
> > > > + struct virtio_pmem *vpmem;
> > > > + struct nvdimm_bus *nvdimm_bus;
> > > > + struct nd_region_desc ndr_desc;
> > > > + int nid = dev_to_node(&vdev->dev);
> > > > + struct nd_region *nd_region;
> > > > +
> > > > + if (!vdev->config->get) {
> > > > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > > + __func__);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > > + GFP_KERNEL);
> > > > + if (!vpmem) {
> > > > + err = -ENOMEM;
> > > > + goto out_err;
> > > > + }
> > > > +
> > > > + vpmem->vdev = vdev;
> > > > + err = init_vq(vpmem);
> > > > + if (err)
> > > > + goto out_err;
> > > > +
> > > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > + start, &vpmem->start);
> > > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > + size, &vpmem->size);
> > > > +
> > > > + res.start = vpmem->start;
> > > > + res.end = vpmem->start + vpmem->size-1;
> > > > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > > > + vpmem->nd_desc.module = THIS_MODULE;
> > > > +
> > > > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > > + &vpmem->nd_desc);
> > > > + if (!nvdimm_bus)
> > > > + goto out_vq;
> > > > +
> > > > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> > > > +
> > > > + ndr_desc.res = &res;
> > > > + ndr_desc.numa_node = nid;
> > > > + ndr_desc.flush = virtio_pmem_flush;
> > > > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > > +
> > > > + if (!nd_region)
> > > > + goto out_nd;
> > > > +
> > > > + //virtio_device_ready(vdev);
> > > > + return 0;
> > > > +out_nd:
> > > > + err = -ENXIO;
> > > > + nvdimm_bus_unregister(nvdimm_bus);
> > > > +out_vq:
> > > > + vdev->config->del_vqs(vdev);
> > > > +out_err:
> > > > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > > + return err;
> > > > +}
> > > > +
> > > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > > +{
> > > > + struct virtio_pmem *vpmem = vdev->priv;
> > > > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > > +
> > > > + nvdimm_bus_unregister(nvdimm_bus);
> > > > + vdev->config->del_vqs(vdev);
> > > > + kfree(vpmem);
> > > > +}
> > > > +
> > > > +#ifdef CONFIG_PM_SLEEP
> > > > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > > > +{
> > > > + /* todo: handle freeze function */
> > > > + return -EPERM;
> > > > +}
> > > > +
> > > > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > > > +{
> > > > + /* todo: handle restore function */
> > > > + return -EPERM;
> > > > +}
> > > > +#endif
> > > > +
> > > > +
> > > > +static struct virtio_driver virtio_pmem_driver = {
> > > > + .driver.name = KBUILD_MODNAME,
> > > > + .driver.owner = THIS_MODULE,
> > > > + .id_table = id_table,
> > > > + .probe = virtio_pmem_probe,
> > > > + .remove = virtio_pmem_remove,
> > > > +#ifdef CONFIG_PM_SLEEP
> > > > + .freeze = virtio_pmem_freeze,
> > > > + .restore = virtio_pmem_restore,
> > > > +#endif
> > > > +};
> > > > +
> > > > +module_virtio_driver(virtio_pmem_driver);
> > > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > > +MODULE_LICENSE("GPL");
> > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > b/include/uapi/linux/virtio_ids.h
> > > > index 6d5c3b2..3463895 100644
> > > > --- a/include/uapi/linux/virtio_ids.h
> > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > @@ -43,5 +43,6 @@
> > > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > > >
> > > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > > b/include/uapi/linux/virtio_pmem.h
> > > > new file mode 100644
> > > > index 0000000..c7c22a5
> > > > --- /dev/null
> > > > +++ b/include/uapi/linux/virtio_pmem.h
> > > > @@ -0,0 +1,40 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed
> > > > so
> > > > + * anyone can use the definitions to implement compatible
> > > > drivers/servers:
> > > > + *
> > > > + *
> > > > + * Redistribution and use in source and binary forms, with or without
> > > > + * modification, are permitted provided that the following conditions
> > > > + * are met:
> > > > + * 1. Redistributions of source code must retain the above copyright
> > > > + * notice, this list of conditions and the following disclaimer.
> > > > + * 2. Redistributions in binary form must reproduce the above
> > > > copyright
> > > > + * notice, this list of conditions and the following disclaimer in
> > > > the
> > > > + * documentation and/or other materials provided with the
> > > > distribution.
> > > > + * 3. Neither the name of IBM nor the names of its contributors
> > > > + * may be used to endorse or promote products derived from this
> > > > software
> > > > + * without specific prior written permission.
> > > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > ``AS IS''
> > > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> > > > TO,
> > > > THE
> > > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > > > PURPOSE
> > > > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > > CONSEQUENTIAL
> > > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> > > > GOODS
> > > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> > > > INTERRUPTION)
> > > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > > > STRICT
> > > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> > > > ANY
> > > > WAY
> > > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> > > > OF
> > > > + * SUCH DAMAGE.
> > > > + *
> > > > + * Copyright (C) Red Hat, Inc., 2018-2019
> > > > + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> > > > + */
> > > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +
> > > > +struct virtio_pmem_config {
> > > > + __le64 start;
> > > > + __le64 size;
> > > > +};
> > > > +#endif
> > >
> > >
> > >
> >
>
>
> @@ -0,0 +1,241 @@
> +/*
> + * Virtio pmem device
> + *
> + * Copyright (C) 2018 Red Hat, Inc.
> + * Copyright (C) 2018 Pankaj Gupta <[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-pmem.h"
> +#include "hw/mem/memory-device.h"
> +#include "block/aio.h"
> +#include "block/thread-pool.h"
> +
> +typedef struct VirtIOPMEMresp {
> + int ret;
> +} VirtIOPMEMResp;
> +
> +typedef struct VirtIODeviceRequest {
> + VirtQueueElement elem;
> + int fd;
> + VirtIOPMEM *pmem;
> + VirtIOPMEMResp resp;
> +} VirtIODeviceRequest;
Both, response and request have to go to a linux header (and a header
sync patch).
Also, you are using the same request for host<->guest handling and
internal purposes. The fd or pmem pointer definitely don't belong here.
Use a separate struct for internal handling purposes. (passing to worker_cb)
> +
> +static int worker_cb(void *opaque)
> +{
> + VirtIODeviceRequest *req = opaque;
> + int err = 0;
> +
> + /* flush raw backing image */
> + err = fsync(req->fd);
> + if (err != 0) {
> + err = EIO;
> + }
> + req->resp.ret = err;
> +
> + return 0;
> +}
> +
> +static void done_cb(void *opaque, int ret)
> +{
> + VirtIODeviceRequest *req = opaque;
> + int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
> + &req->resp, sizeof(VirtIOPMEMResp));
> +
> + /* Callbacks are serialized, so no need to use atomic ops. */
> + virtqueue_push(req->pmem->rq_vq, &req->elem, len);
> + virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
> + g_free(req);
> +}
> +
> +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> +{
> + VirtIODeviceRequest *req;
> + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> + HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
> + ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> +
> + req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
> + if (!req) {
> + virtio_error(vdev, "virtio-pmem missing request data");
> + return;
> + }
> +
> + if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> + virtio_error(vdev, "virtio-pmem request not proper");
> + g_free(req);
> + return;
> + }
> + req->fd = memory_region_get_fd(&backend->mr);
> + req->pmem = pmem;
> + thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
> +}
> +
> +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> +{
> + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> + struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *) config;
> +
> + virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
> + virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
> +}
> +
> +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t features,
> + Error **errp)
> +{
> + return features;
> +}
> +
> +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> +{
> + VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> + VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
> + MachineState *ms = MACHINE(qdev_get_machine());
> + uint64_t align;
> + Error *local_err = NULL;
> + MemoryRegion *mr;
> +
> + if (!pmem->memdev) {
> + error_setg(errp, "virtio-pmem memdev not set");
> + return;
> + }
> +
> + mr = host_memory_backend_get_memory(pmem->memdev);
> + align = memory_region_get_alignment(mr);
> + pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
> + pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
> + &local_err);
> + if (local_err) {
> + error_setg(errp, "Can't get free address in mem device");
> + return;
> + }
> + memory_region_init_alias(&pmem->mr, OBJECT(pmem),
> + "virtio_pmem-memory", mr, 0, pmem->size);
> + memory_device_plug_region(ms, &pmem->mr, pmem->start);
> +
> + host_memory_backend_set_mapped(pmem->memdev, true);
> + virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> + sizeof(struct virtio_pmem_config));
> + pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> +}
> +
> +static void virtio_mem_check_memdev(Object *obj, const char *name, Object *val,
> + Error **errp)
> +{
> + if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
> + char *path = object_get_canonical_path_component(val);
> + error_setg(errp, "Can't use already busy memdev: %s", path);
> + g_free(path);
> + return;
> + }
> +
> + qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
> +}
> +
> +static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
> +{
> + Object *obj = OBJECT(vm);
> + DeviceState *parent_dev;
> +
> + /* always use the ID of the proxy device */
> + if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
> + parent_dev = DEVICE(obj->parent);
> + return parent_dev->id;
> + }
> + return NULL;
> +}
> +
> +static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
> + MemoryDeviceInfo *info)
> +{
> + VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> + const char *id = virtio_pmem_get_device_id(vm);
> +
> + if (id) {
> + vi->has_id = true;
> + vi->id = g_strdup(id);
> + }
> +
> + vi->start = vm->start;
> + vi->size = vm->size;
> + vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
> +
> + info->u.virtio_pmem.data = vi;
> + info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
> +}
> +
> +static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->start;
> +}
> +
> +static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->size;
> +}
> +
> +static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState *md)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> +
> + return vm->size;
> +}
> +
> +static void virtio_pmem_instance_init(Object *obj)
> +{
> + VirtIOPMEM *vm = VIRTIO_PMEM(obj);
> + object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
> + (Object **)&vm->memdev,
> + (void *) virtio_mem_check_memdev,
> + OBJ_PROP_LINK_STRONG,
> + &error_abort);
> +}
> +
> +
> +static void virtio_pmem_class_init(ObjectClass *klass, void *data)
> +{
> + VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> +
> + vdc->realize = virtio_pmem_realize;
> + vdc->get_config = virtio_pmem_get_config;
> + vdc->get_features = virtio_pmem_get_features;
> +
> + mdc->get_addr = virtio_pmem_md_get_addr;
> + mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
> + mdc->get_region_size = virtio_pmem_md_get_region_size;
> + mdc->fill_device_info = virtio_pmem_md_fill_device_info;
> +}
> +
> +static TypeInfo virtio_pmem_info = {
> + .name = TYPE_VIRTIO_PMEM,
> + .parent = TYPE_VIRTIO_DEVICE,
> + .class_init = virtio_pmem_class_init,
> + .instance_size = sizeof(VirtIOPMEM),
> + .instance_init = virtio_pmem_instance_init,
> + .interfaces = (InterfaceInfo[]) {
> + { TYPE_MEMORY_DEVICE },
> + { }
> + },
> +};
> +
> +static void virtio_register_types(void)
> +{
> + type_register_static(&virtio_pmem_info);
> +}
> +
> +type_init(virtio_register_types)
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 990d6fcbde..28829b6437 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -85,6 +85,7 @@ extern bool pci_available;
> #define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
> #define PCI_DEVICE_ID_VIRTIO_9P 0x1009
> #define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
> +#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
>
> #define PCI_VENDOR_ID_REDHAT 0x1b36
> #define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
> diff --git a/include/hw/virtio/virtio-pmem.h b/include/hw/virtio/virtio-pmem.h
> new file mode 100644
> index 0000000000..fda3ee691c
> --- /dev/null
> +++ b/include/hw/virtio/virtio-pmem.h
> @@ -0,0 +1,42 @@
> +/*
> + * Virtio pmem Device
> + *
> + * Copyright Red Hat, Inc. 2018
> + * Copyright Pankaj Gupta <[email protected]>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * (at your option) any later version. See the COPYING file in the
> + * top-level directory.
> + */
> +
> +#ifndef QEMU_VIRTIO_PMEM_H
> +#define QEMU_VIRTIO_PMEM_H
> +
> +#include "hw/virtio/virtio.h"
> +#include "exec/memory.h"
> +#include "sysemu/hostmem.h"
> +#include "standard-headers/linux/virtio_ids.h"
> +#include "hw/boards.h"
> +#include "hw/i386/pc.h"
> +
> +#define TYPE_VIRTIO_PMEM "virtio-pmem"
> +
> +#define VIRTIO_PMEM(obj) \
> + OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
> +
> +/* VirtIOPMEM device structure */
> +typedef struct VirtIOPMEM {
> + VirtIODevice parent_obj;
> +
> + VirtQueue *rq_vq;
> + uint64_t start;
> + uint64_t size;
> + MemoryRegion mr;
> + HostMemoryBackend *memdev;
> +} VirtIOPMEM;
> +
> +struct virtio_pmem_config {
> + uint64_t start;
> + uint64_t size;
> +};
> +#endif
> diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
> index 6d5c3b2d4f..346389565a 100644
> --- a/include/standard-headers/linux/virtio_ids.h
> +++ b/include/standard-headers/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
This should be moved to a linux header sync patch.
--
Thanks,
David / dhildenb
>
> > @@ -0,0 +1,241 @@
> > +/*
> > + * Virtio pmem device
> > + *
> > + * Copyright (C) 2018 Red Hat, Inc.
> > + * Copyright (C) 2018 Pankaj Gupta <[email protected]>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "qemu-common.h"
> > +#include "qemu/error-report.h"
> > +#include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/virtio-pmem.h"
> > +#include "hw/mem/memory-device.h"
> > +#include "block/aio.h"
> > +#include "block/thread-pool.h"
> > +
> > +typedef struct VirtIOPMEMresp {
> > + int ret;
> > +} VirtIOPMEMResp;
> > +
> > +typedef struct VirtIODeviceRequest {
> > + VirtQueueElement elem;
> > + int fd;
> > + VirtIOPMEM *pmem;
> > + VirtIOPMEMResp resp;
> > +} VirtIODeviceRequest;
>
> Both, response and request have to go to a linux header (and a header
> sync patch).
Sure.
>
> Also, you are using the same request for host<->guest handling and
> internal purposes. The fd or pmem pointer definitely don't belong here.
> Use a separate struct for internal handling purposes. (passing to worker_cb)
o.k. will add another struct for internal handling.
>
> > +
> > +static int worker_cb(void *opaque)
> > +{
> > + VirtIODeviceRequest *req = opaque;
> > + int err = 0;
> > +
> > + /* flush raw backing image */
> > + err = fsync(req->fd);
> > + if (err != 0) {
> > + err = EIO;
> > + }
> > + req->resp.ret = err;
> > +
> > + return 0;
> > +}
> > +
> > +static void done_cb(void *opaque, int ret)
> > +{
> > + VirtIODeviceRequest *req = opaque;
> > + int len = iov_from_buf(req->elem.in_sg, req->elem.in_num, 0,
> > + &req->resp, sizeof(VirtIOPMEMResp));
> > +
> > + /* Callbacks are serialized, so no need to use atomic ops. */
> > + virtqueue_push(req->pmem->rq_vq, &req->elem, len);
> > + virtio_notify((VirtIODevice *)req->pmem, req->pmem->rq_vq);
> > + g_free(req);
> > +}
> > +
> > +static void virtio_pmem_flush(VirtIODevice *vdev, VirtQueue *vq)
> > +{
> > + VirtIODeviceRequest *req;
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > + HostMemoryBackend *backend = MEMORY_BACKEND(pmem->memdev);
> > + ThreadPool *pool = aio_get_thread_pool(qemu_get_aio_context());
> > +
> > + req = virtqueue_pop(vq, sizeof(VirtIODeviceRequest));
> > + if (!req) {
> > + virtio_error(vdev, "virtio-pmem missing request data");
> > + return;
> > + }
> > +
> > + if (req->elem.out_num < 1 || req->elem.in_num < 1) {
> > + virtio_error(vdev, "virtio-pmem request not proper");
> > + g_free(req);
> > + return;
> > + }
> > + req->fd = memory_region_get_fd(&backend->mr);
> > + req->pmem = pmem;
> > + thread_pool_submit_aio(pool, worker_cb, req, done_cb, req);
> > +}
> > +
> > +static void virtio_pmem_get_config(VirtIODevice *vdev, uint8_t *config)
> > +{
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(vdev);
> > + struct virtio_pmem_config *pmemcfg = (struct virtio_pmem_config *)
> > config;
> > +
> > + virtio_stq_p(vdev, &pmemcfg->start, pmem->start);
> > + virtio_stq_p(vdev, &pmemcfg->size, pmem->size);
> > +}
> > +
> > +static uint64_t virtio_pmem_get_features(VirtIODevice *vdev, uint64_t
> > features,
> > + Error **errp)
> > +{
> > + return features;
> > +}
> > +
> > +static void virtio_pmem_realize(DeviceState *dev, Error **errp)
> > +{
> > + VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> > + VirtIOPMEM *pmem = VIRTIO_PMEM(dev);
> > + MachineState *ms = MACHINE(qdev_get_machine());
> > + uint64_t align;
> > + Error *local_err = NULL;
> > + MemoryRegion *mr;
> > +
> > + if (!pmem->memdev) {
> > + error_setg(errp, "virtio-pmem memdev not set");
> > + return;
> > + }
> > +
> > + mr = host_memory_backend_get_memory(pmem->memdev);
> > + align = memory_region_get_alignment(mr);
> > + pmem->size = QEMU_ALIGN_DOWN(memory_region_size(mr), align);
> > + pmem->start = memory_device_get_free_addr(ms, NULL, align, pmem->size,
> > +
> > &local_err);
> > + if (local_err) {
> > + error_setg(errp, "Can't get free address in mem device");
> > + return;
> > + }
> > + memory_region_init_alias(&pmem->mr, OBJECT(pmem),
> > + "virtio_pmem-memory", mr, 0, pmem->size);
> > + memory_device_plug_region(ms, &pmem->mr, pmem->start);
> > +
> > + host_memory_backend_set_mapped(pmem->memdev, true);
> > + virtio_init(vdev, TYPE_VIRTIO_PMEM, VIRTIO_ID_PMEM,
> > + sizeof(struct
> > virtio_pmem_config));
> > + pmem->rq_vq = virtio_add_queue(vdev, 128, virtio_pmem_flush);
> > +}
> > +
> > +static void virtio_mem_check_memdev(Object *obj, const char *name, Object
> > *val,
> > + Error **errp)
> > +{
> > + if (host_memory_backend_is_mapped(MEMORY_BACKEND(val))) {
> > + char *path = object_get_canonical_path_component(val);
> > + error_setg(errp, "Can't use already busy memdev: %s", path);
> > + g_free(path);
> > + return;
> > + }
> > +
> > + qdev_prop_allow_set_link_before_realize(obj, name, val, errp);
> > +}
> > +
> > +static const char *virtio_pmem_get_device_id(VirtIOPMEM *vm)
> > +{
> > + Object *obj = OBJECT(vm);
> > + DeviceState *parent_dev;
> > +
> > + /* always use the ID of the proxy device */
> > + if (obj->parent && object_dynamic_cast(obj->parent, TYPE_DEVICE)) {
> > + parent_dev = DEVICE(obj->parent);
> > + return parent_dev->id;
> > + }
> > + return NULL;
> > +}
> > +
> > +static void virtio_pmem_md_fill_device_info(const MemoryDeviceState *md,
> > + MemoryDeviceInfo *info)
> > +{
> > + VirtioPMemDeviceInfo *vi = g_new0(VirtioPMemDeviceInfo, 1);
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > + const char *id = virtio_pmem_get_device_id(vm);
> > +
> > + if (id) {
> > + vi->has_id = true;
> > + vi->id = g_strdup(id);
> > + }
> > +
> > + vi->start = vm->start;
> > + vi->size = vm->size;
> > + vi->memdev = object_get_canonical_path(OBJECT(vm->memdev));
> > +
> > + info->u.virtio_pmem.data = vi;
> > + info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_PMEM;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_addr(const MemoryDeviceState *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->start;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_plugged_size(const MemoryDeviceState
> > *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->size;
> > +}
> > +
> > +static uint64_t virtio_pmem_md_get_region_size(const MemoryDeviceState
> > *md)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(md);
> > +
> > + return vm->size;
> > +}
> > +
> > +static void virtio_pmem_instance_init(Object *obj)
> > +{
> > + VirtIOPMEM *vm = VIRTIO_PMEM(obj);
> > + object_property_add_link(obj, "memdev", TYPE_MEMORY_BACKEND,
> > + (Object **)&vm->memdev,
> > + (void *) virtio_mem_check_memdev,
> > + OBJ_PROP_LINK_STRONG,
> > + &error_abort);
> > +}
> > +
> > +
> > +static void virtio_pmem_class_init(ObjectClass *klass, void *data)
> > +{
> > + VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> > + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> > +
> > + vdc->realize = virtio_pmem_realize;
> > + vdc->get_config = virtio_pmem_get_config;
> > + vdc->get_features = virtio_pmem_get_features;
> > +
> > + mdc->get_addr = virtio_pmem_md_get_addr;
> > + mdc->get_plugged_size = virtio_pmem_md_get_plugged_size;
> > + mdc->get_region_size = virtio_pmem_md_get_region_size;
> > + mdc->fill_device_info = virtio_pmem_md_fill_device_info;
> > +}
> > +
> > +static TypeInfo virtio_pmem_info = {
> > + .name = TYPE_VIRTIO_PMEM,
> > + .parent = TYPE_VIRTIO_DEVICE,
> > + .class_init = virtio_pmem_class_init,
> > + .instance_size = sizeof(VirtIOPMEM),
> > + .instance_init = virtio_pmem_instance_init,
> > + .interfaces = (InterfaceInfo[]) {
> > + { TYPE_MEMORY_DEVICE },
> > + { }
> > + },
> > +};
> > +
> > +static void virtio_register_types(void)
> > +{
> > + type_register_static(&virtio_pmem_info);
> > +}
> > +
> > +type_init(virtio_register_types)
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 990d6fcbde..28829b6437 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -85,6 +85,7 @@ extern bool pci_available;
> > #define PCI_DEVICE_ID_VIRTIO_RNG 0x1005
> > #define PCI_DEVICE_ID_VIRTIO_9P 0x1009
> > #define PCI_DEVICE_ID_VIRTIO_VSOCK 0x1012
> > +#define PCI_DEVICE_ID_VIRTIO_PMEM 0x1013
> >
> > #define PCI_VENDOR_ID_REDHAT 0x1b36
> > #define PCI_DEVICE_ID_REDHAT_BRIDGE 0x0001
> > diff --git a/include/hw/virtio/virtio-pmem.h
> > b/include/hw/virtio/virtio-pmem.h
> > new file mode 100644
> > index 0000000000..fda3ee691c
> > --- /dev/null
> > +++ b/include/hw/virtio/virtio-pmem.h
> > @@ -0,0 +1,42 @@
> > +/*
> > + * Virtio pmem Device
> > + *
> > + * Copyright Red Hat, Inc. 2018
> > + * Copyright Pankaj Gupta <[email protected]>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or
> > + * (at your option) any later version. See the COPYING file in the
> > + * top-level directory.
> > + */
> > +
> > +#ifndef QEMU_VIRTIO_PMEM_H
> > +#define QEMU_VIRTIO_PMEM_H
> > +
> > +#include "hw/virtio/virtio.h"
> > +#include "exec/memory.h"
> > +#include "sysemu/hostmem.h"
> > +#include "standard-headers/linux/virtio_ids.h"
> > +#include "hw/boards.h"
> > +#include "hw/i386/pc.h"
> > +
> > +#define TYPE_VIRTIO_PMEM "virtio-pmem"
> > +
> > +#define VIRTIO_PMEM(obj) \
> > + OBJECT_CHECK(VirtIOPMEM, (obj), TYPE_VIRTIO_PMEM)
> > +
> > +/* VirtIOPMEM device structure */
> > +typedef struct VirtIOPMEM {
> > + VirtIODevice parent_obj;
> > +
> > + VirtQueue *rq_vq;
> > + uint64_t start;
> > + uint64_t size;
> > + MemoryRegion mr;
> > + HostMemoryBackend *memdev;
> > +} VirtIOPMEM;
> > +
> > +struct virtio_pmem_config {
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +#endif
> > diff --git a/include/standard-headers/linux/virtio_ids.h
> > b/include/standard-headers/linux/virtio_ids.h
> > index 6d5c3b2d4f..346389565a 100644
> > --- a/include/standard-headers/linux/virtio_ids.h
> > +++ b/include/standard-headers/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> This should be moved to a linux header sync patch.
Sure.
Thanks,
Pankaj
>
>
>
>
> --
>
> Thanks,
>
> David / dhildenb
>
>
On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta <[email protected]> wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
This looks ok to me, just some nits below.
> ---
> drivers/acpi/nfit/core.c | 7 +++++--
> drivers/nvdimm/claim.c | 3 ++-
> drivers/nvdimm/pmem.c | 12 ++++++++----
> drivers/nvdimm/region_devs.c | 12 ++++++++++--
> include/linux/libnvdimm.h | 4 +++-
> 5 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index b072cfc..cd63b69 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2216,6 +2216,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
> {
> u64 cmd, offset;
> struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
> + struct nd_region *nd_region = nfit_blk->nd_region;
>
> enum {
> BCW_OFFSET_MASK = (1ULL << 48)-1,
> @@ -2234,7 +2235,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
> offset = to_interleave_offset(offset, mmio);
>
> writeq(cmd, mmio->addr.base + offset);
> - nvdimm_flush(nfit_blk->nd_region);
> + nd_region->flush(nd_region);
I would keep the indirect function call override inside of
nvdimm_flush. Then this hunk can go away...
>
> if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
> readq(mmio->addr.base + offset);
> @@ -2245,6 +2246,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
> unsigned int lane)
> {
> struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
> + struct nd_region *nd_region = nfit_blk->nd_region;
> unsigned int copied = 0;
> u64 base_offset;
> int rc;
> @@ -2283,7 +2285,8 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
> }
>
> if (rw)
> - nvdimm_flush(nfit_blk->nd_region);
> + nd_region->flush(nd_region);
> +
>
...ditto, no need to touch this code.
> rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
> return rc;
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index fb667bf..49dce9c 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -262,6 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> {
> struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
> + struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
> sector_t sector = offset >> 9;
> int rc = 0;
>
> @@ -301,7 +302,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> }
>
> memcpy_flushcache(nsio->addr + offset, buf, size);
> - nvdimm_flush(to_nd_region(ndns->dev.parent));
> + nd_region->flush(nd_region);
For this you would need to teach nsio_rw_bytes() that the flush can fail.
>
> return rc;
> }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 6071e29..ba57cfa 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -201,7 +201,8 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
> struct nd_region *nd_region = to_region(pmem);
>
> if (bio->bi_opf & REQ_PREFLUSH)
> - nvdimm_flush(nd_region);
> + bio->bi_status = nd_region->flush(nd_region);
> +
Let's have nvdimm_flush() return 0 or -EIO if it fails since thats
what nsio_rw_bytes() expects, and you'll need to translate that to:
BLK_STS_IOERR
>
> do_acct = nd_iostat_start(bio, &start);
> bio_for_each_segment(bvec, bio, iter) {
> @@ -216,7 +217,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
> nd_iostat_end(bio, start);
>
> if (bio->bi_opf & REQ_FUA)
> - nvdimm_flush(nd_region);
> + bio->bi_status = nd_region->flush(nd_region);
Same comment.
>
> bio_endio(bio);
> return BLK_QC_T_NONE;
> @@ -517,6 +518,7 @@ static int nd_pmem_probe(struct device *dev)
> static int nd_pmem_remove(struct device *dev)
> {
> struct pmem_device *pmem = dev_get_drvdata(dev);
> + struct nd_region *nd_region = to_region(pmem);
>
> if (is_nd_btt(dev))
> nvdimm_namespace_detach_btt(to_nd_btt(dev));
> @@ -528,14 +530,16 @@ static int nd_pmem_remove(struct device *dev)
> sysfs_put(pmem->bb_state);
> pmem->bb_state = NULL;
> }
> - nvdimm_flush(to_nd_region(dev->parent));
> + nd_region->flush(nd_region);
Not needed if the indirect function call moves inside nvdimm_flush().
>
> return 0;
> }
>
> static void nd_pmem_shutdown(struct device *dev)
> {
> - nvdimm_flush(to_nd_region(dev->parent));
> + struct nd_region *nd_region = to_nd_region(dev->parent);
> +
> + nd_region->flush(nd_region);
> }
>
> static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index fa37afc..a170a6b 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -290,7 +290,7 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
> return rc;
> if (!flush)
> return -EINVAL;
> - nvdimm_flush(nd_region);
> + nd_region->flush(nd_region);
Let's pass the error code through if the flush fails.
>
> return len;
> }
> @@ -1065,6 +1065,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
> dev->of_node = ndr_desc->of_node;
> nd_region->ndr_size = resource_size(ndr_desc->res);
> nd_region->ndr_start = ndr_desc->res->start;
> + if (ndr_desc->flush)
> + nd_region->flush = ndr_desc->flush;
> + else
> + nd_region->flush = nvdimm_flush;
> +
We'll need to rename the existing nvdimm_flush() to generic_nvdimm_flush().
> nd_device_register(dev);
>
> return nd_region;
> @@ -1109,7 +1114,7 @@ EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> * nvdimm_flush - flush any posted write queues between the cpu and pmem media
> * @nd_region: blk or interleaved pmem region
> */
> -void nvdimm_flush(struct nd_region *nd_region)
> +int nvdimm_flush(struct nd_region *nd_region)
> {
> struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
> int i, idx;
> @@ -1133,7 +1138,10 @@ void nvdimm_flush(struct nd_region *nd_region)
> if (ndrd_get_flush_wpq(ndrd, i, 0))
> writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> wmb();
> +
> + return 0;
> }
> +
Needless newline.
> EXPORT_SYMBOL_GPL(nvdimm_flush);
>
> /**
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 097072c..3af7177 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -115,6 +115,7 @@ struct nd_mapping_desc {
> int position;
> };
>
> +struct nd_region;
> struct nd_region_desc {
> struct resource *res;
> struct nd_mapping_desc *mapping;
> @@ -126,6 +127,7 @@ struct nd_region_desc {
> int numa_node;
> unsigned long flags;
> struct device_node *of_node;
> + int (*flush)(struct nd_region *nd_region);
> };
>
> struct device;
> @@ -201,7 +203,7 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
> unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
> void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
> u64 nd_fletcher64(void *addr, size_t len, bool le);
> -void nvdimm_flush(struct nd_region *nd_region);
> +int nvdimm_flush(struct nd_region *nd_region);
> int nvdimm_has_flush(struct nd_region *nd_region);
> int nvdimm_has_cache(struct nd_region *nd_region);
>
> --
> 2.9.3
>
On Fri, Aug 31, 2018 at 6:31 AM Pankaj Gupta <[email protected]> wrote:
>
> This patch moves nd_region definition to common header
> include/linux/nd.h file. This is required for flush callback
> support for both virtio-pmem & pmem driver.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/nvdimm/nd.h | 39 ---------------------------------------
> include/linux/nd.h | 40 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+), 39 deletions(-)
No, we need to find a way to do this without dumping all of these
internal details to a public / global header.
On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta <[email protected]> wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/virtio/Kconfig | 9 ++
> drivers/virtio/Makefile | 1 +
> drivers/virtio/virtio_pmem.c | 255 +++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 40 ++++++
> 5 files changed, 306 insertions(+)
> create mode 100644 drivers/virtio/virtio_pmem.c
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3589764..a331e23 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
>
> If unsure, say Y.
>
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> +
> + If unsure, say M.
> +
> config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5..cbe91c6 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 0000000..c22cc87
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio.h>
> +#include <linux/module.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/spinlock.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/nd.h>
I think we need to split this driver into 2 files,
drivers/virtio/pmem.c would discover and register the virtual pmem
device with the libnvdimm core, and drivers/nvdimm/virtio.c would
house virtio_pmem_flush().
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];
> +
> + /* Wait queue to process deferred work after ack from host */
> + wait_queue_head_t host_acked;
> + bool done;
> +
> + /* Wait queue to process deferred work after virt queue buffer avail */
> + wait_queue_head_t wq_buf;
Why does this need wait_queue's per request? shouldn't this be per-device?
> + bool wq_buf_avail;
> + struct list_head list;
> +};
> +
> +struct virtio_pmem {
> + struct virtio_device *vdev;
> +
> + /* Virtio pmem request queue */
> + struct virtqueue *req_vq;
> +
> + /* nvdimm bus registers virtio pmem device */
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm_bus_descriptor nd_desc;
> +
> + /* List to store deferred work if virtqueue is full */
> + struct list_head req_list;
> +
> + /* Synchronize virtqueue data */
> + spinlock_t pmem_lock;
> +
> + /* Memory region information */
> + uint64_t start;
> + uint64_t size;
> +};
> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* The interrupt handler */
> +static void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> + req->done = true;
> + wake_up(&req->host_acked);
> +
> + if (!list_empty(&vpmem->req_list)) {
> + req_buf = list_first_entry(&vpmem->req_list,
> + struct virtio_pmem_request, list);
> + list_del(&vpmem->req_list);
> + req_buf->wq_buf_avail = true;
> + wake_up(&req_buf->wq_buf);
> + }
> + }
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> + struct virtqueue *vq;
> +
> + /* single vq */
> + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vq))
> + return PTR_ERR(vq);
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> + /* The request submission function */
> +static int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev =
> + dev_to_virtio(nd_region->dev.parent->parent);
That's a long de-ref chain I would just stash the vdev in
nd_region->provider_data.
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> +
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = req->wq_buf_avail = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(&req->host_acked);
> + init_waitqueue_head(&req->wq_buf);
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + sg_init_one(&sg, req->name, strlen(req->name));
> + sgs[0] = &sg;
> + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> + sgs[1] = &ret;
> + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> + if (err) {
> + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> + list_add_tail(&vpmem->req_list, &req->list);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + }
> + virtqueue_kick(vpmem->req_vq);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
Hmm, this seems awkward if this is called from pmem_make_request. If
we need to wait for completion that should be managed by the guest
block layer. I.e. make_request should just queue request and then
trigger bio_endio() when the response comes back.
However this does mean that nvdimm_flush() becomes asynchronous. So
maybe we need to pass in a 'sync' flag or the bio directly to indicate
this is an asynchronous flush request from pmem_make_request() vs a
synchronous one from nsio_rw_bytes().
> + err = req->ret;
> + kfree(req);
> +
> + return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc;
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> + memset(&ndr_desc, 0, sizeof(ndr_desc));
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + //virtio_device_ready(vdev);
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);
> + kfree(vpmem);
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int virtio_pmem_freeze(struct virtio_device *vdev)
> +{
> + /* todo: handle freeze function */
> + return -EPERM;
> +}
> +
> +static int virtio_pmem_restore(struct virtio_device *vdev)
> +{
> + /* todo: handle restore function */
> + return -EPERM;
> +}
> +#endif
As far as I can see there's nothing to do on a power transition, I
would just omit this completely.
> +
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +#ifdef CONFIG_PM_SLEEP
> + .freeze = virtio_pmem_freeze,
> + .restore = virtio_pmem_restore,
> +#endif
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2..3463895 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 0000000..c7c22a5
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:
The SPDX identifier does not match this BSD license, and the whole
point of the SPDX identifier is to get out of the need to have these
large text blobs of license goop.
> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + * notice, this list of conditions and the following disclaimer in the
> + * documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + * may be used to endorse or promote products derived from this software
> + * without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2018-2019
> + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> + __le64 start;
> + __le64 size;
> +};
> +#endif
Why does this need to be in the uapi?
Hi Dan,
Thanks for the review. Please find my reply inline.
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/virtio/Kconfig | 9 ++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/virtio_pmem.c | 255
> > +++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 40 ++++++
> > 5 files changed, 306 insertions(+)
> > create mode 100644 drivers/virtio/virtio_pmem.c
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 3589764..a331e23 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
> > +
> > + If unsure, say M.
> > +
> > config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5..cbe91c6 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 0000000..c22cc87
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,255 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/nd.h>
>
> I think we need to split this driver into 2 files,
> drivers/virtio/pmem.c would discover and register the virtual pmem
> device with the libnvdimm core, and drivers/nvdimm/virtio.c would
> house virtio_pmem_flush().
o.k. Will split the driver into two files as suggested.
>
> > +
> > +struct virtio_pmem_request {
> > + /* Host return status corresponding to flush request */
> > + int ret;
> > +
> > + /* command name*/
> > + char name[16];
> > +
> > + /* Wait queue to process deferred work after ack from host */
> > + wait_queue_head_t host_acked;
> > + bool done;
> > +
> > + /* Wait queue to process deferred work after virt queue buffer
> > avail */
> > + wait_queue_head_t wq_buf;
>
> Why does this need wait_queue's per request? shouldn't this be per-device?
This is used to wait flush calling threads when virtio queue is full.
wait_queue in request struct binds waitqueue and request. When host acknowledges
guest, first waiting request is selected and corresponding thread is woken-up.
Alternatively, we can use "add_wait_queue_exclusive" with device wait_queue.
This will wake up only one exclusive process waiting. This will avoid using
additional list for tracking.
>
> > + bool wq_buf_avail;
> > + struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > + struct virtio_device *vdev;
> > +
> > + /* Virtio pmem request queue */
> > + struct virtqueue *req_vq;
> > +
> > + /* nvdimm bus registers virtio pmem device */
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nvdimm_bus_descriptor nd_desc;
> > +
> > + /* List to store deferred work if virtqueue is full */
> > + struct list_head req_list;
> > +
> > + /* Synchronize virtqueue data */
> > + spinlock_t pmem_lock;
> > +
> > + /* Memory region information */
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* The interrupt handler */
> > +static void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + list_del(&vpmem->req_list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > + /* The request submission function */
> > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev =
> > + dev_to_virtio(nd_region->dev.parent->parent);
>
> That's a long de-ref chain I would just stash the vdev in
> nd_region->provider_data.
Sure. Will use 'nd_region->provider_data' for vdev.
>
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct virtio_pmem_request *req = kmalloc(sizeof(*req),
> > GFP_KERNEL);
> > +
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = req->wq_buf_avail = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
> > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > + if (err) {
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem
> > device\n");
> > +
> > + list_add_tail(&vpmem->req_list, &req->list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack
> > */
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + }
> > + virtqueue_kick(vpmem->req_vq);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->host_acked, req->done);
>
> Hmm, this seems awkward if this is called from pmem_make_request. If
> we need to wait for completion that should be managed by the guest
> block layer. I.e. make_request should just queue request and then
> trigger bio_endio() when the response comes back.
We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
callbacks for corresponding VIRTIO vqs. AFAICU there is no existing multiqueue
code merged for pmem driver yet, though i could see patches by Dave upstream.
Anything I am missing here?
>
> However this does mean that nvdimm_flush() becomes asynchronous. So
> maybe we need to pass in a 'sync' flag or the bio directly to indicate
> this is an asynchronous flush request from pmem_make_request() vs a
> synchronous one from nsio_rw_bytes().
Sure.
>
> > + err = req->ret;
> > + kfree(req);
> > +
> > + return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc;
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +
> > + if (!nd_region)
> > + goto out_nd;
> > +
> > + //virtio_device_ready(vdev);
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
> > + kfree(vpmem);
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > +{
> > + /* todo: handle freeze function */
> > + return -EPERM;
> > +}
> > +
> > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > +{
> > + /* todo: handle restore function */
> > + return -EPERM;
> > +}
> > +#endif
>
> As far as I can see there's nothing to do on a power transition, I
> would just omit this completely.
o.k Will remove these handlers.
>
> > +
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +#ifdef CONFIG_PM_SLEEP
> > + .freeze = virtio_pmem_freeze,
> > + .restore = virtio_pmem_restore,
> > +#endif
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2..3463895 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 0000000..c7c22a5
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
>
> The SPDX identifier does not match this BSD license, and the whole
> point of the SPDX identifier is to get out of the need to have these
> large text blobs of license goop.
Right, just copied this. Will remove these BSD lines.
>
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + * notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + * notice, this list of conditions and the following disclaimer in the
> > + * documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + * may be used to endorse or promote products derived from this
> > software
> > + * without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > PURPOSE
> > + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2018-2019
> > + * Copyright (C) Pankaj Gupta <[email protected]>, 2018
> > + */
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > + __le64 start;
> > + __le64 size;
> > +};
> > +#endif
>
> Why does this need to be in the uapi?
This struct is defined by userspace(Qemu) and used by kernel to fetch
values passed by qemu device.
Thanks,
Pankaj
> > This patch adds functionality to perform flush from guest
> > to host over VIRTIO. We are registering a callback based
> > on 'nd_region' type. virtio_pmem driver requires this special
> > flush function. For rest of the region types we are registering
> > existing flush function. Report error returned by host fsync
> > failure to userspace.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
>
> This looks ok to me, just some nits below.
>
> > ---
> > drivers/acpi/nfit/core.c | 7 +++++--
> > drivers/nvdimm/claim.c | 3 ++-
> > drivers/nvdimm/pmem.c | 12 ++++++++----
> > drivers/nvdimm/region_devs.c | 12 ++++++++++--
> > include/linux/libnvdimm.h | 4 +++-
> > 5 files changed, 28 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > index b072cfc..cd63b69 100644
> > --- a/drivers/acpi/nfit/core.c
> > +++ b/drivers/acpi/nfit/core.c
> > @@ -2216,6 +2216,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk,
> > unsigned int bw,
> > {
> > u64 cmd, offset;
> > struct nfit_blk_mmio *mmio = &nfit_blk->mmio[DCR];
> > + struct nd_region *nd_region = nfit_blk->nd_region;
> >
> > enum {
> > BCW_OFFSET_MASK = (1ULL << 48)-1,
> > @@ -2234,7 +2235,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk,
> > unsigned int bw,
> > offset = to_interleave_offset(offset, mmio);
> >
> > writeq(cmd, mmio->addr.base + offset);
> > - nvdimm_flush(nfit_blk->nd_region);
> > + nd_region->flush(nd_region);
>
> I would keep the indirect function call override inside of
> nvdimm_flush. Then this hunk can go away...
Sure. Will change.
>
> >
> > if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
> > readq(mmio->addr.base + offset);
> > @@ -2245,6 +2246,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk
> > *nfit_blk,
> > unsigned int lane)
> > {
> > struct nfit_blk_mmio *mmio = &nfit_blk->mmio[BDW];
> > + struct nd_region *nd_region = nfit_blk->nd_region;
> > unsigned int copied = 0;
> > u64 base_offset;
> > int rc;
> > @@ -2283,7 +2285,8 @@ static int acpi_nfit_blk_single_io(struct nfit_blk
> > *nfit_blk,
> > }
> >
> > if (rw)
> > - nvdimm_flush(nfit_blk->nd_region);
> > + nd_region->flush(nd_region);
> > +
> >
>
> ...ditto, no need to touch this code.
Sure.
>
> > rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
> > return rc;
> > diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> > index fb667bf..49dce9c 100644
> > --- a/drivers/nvdimm/claim.c
> > +++ b/drivers/nvdimm/claim.c
> > @@ -262,6 +262,7 @@ static int nsio_rw_bytes(struct nd_namespace_common
> > *ndns,
> > {
> > struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> > unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
> > + struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
> > sector_t sector = offset >> 9;
> > int rc = 0;
> >
> > @@ -301,7 +302,7 @@ static int nsio_rw_bytes(struct nd_namespace_common
> > *ndns,
> > }
> >
> > memcpy_flushcache(nsio->addr + offset, buf, size);
> > - nvdimm_flush(to_nd_region(ndns->dev.parent));
> > + nd_region->flush(nd_region);
>
> For this you would need to teach nsio_rw_bytes() that the flush can fail.
Sure.
>
> >
> > return rc;
> > }
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 6071e29..ba57cfa 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -201,7 +201,8 @@ static blk_qc_t pmem_make_request(struct request_queue
> > *q, struct bio *bio)
> > struct nd_region *nd_region = to_region(pmem);
> >
> > if (bio->bi_opf & REQ_PREFLUSH)
> > - nvdimm_flush(nd_region);
> > + bio->bi_status = nd_region->flush(nd_region);
> > +
>
> Let's have nvdimm_flush() return 0 or -EIO if it fails since thats
> what nsio_rw_bytes() expects, and you'll need to translate that to:
> BLK_STS_IOERR
o.k. Will change it as per suggestion.
>
> >
> > do_acct = nd_iostat_start(bio, &start);
> > bio_for_each_segment(bvec, bio, iter) {
> > @@ -216,7 +217,7 @@ static blk_qc_t pmem_make_request(struct request_queue
> > *q, struct bio *bio)
> > nd_iostat_end(bio, start);
> >
> > if (bio->bi_opf & REQ_FUA)
> > - nvdimm_flush(nd_region);
> > + bio->bi_status = nd_region->flush(nd_region);
>
> Same comment.
Sure.
>
> >
> > bio_endio(bio);
> > return BLK_QC_T_NONE;
> > @@ -517,6 +518,7 @@ static int nd_pmem_probe(struct device *dev)
> > static int nd_pmem_remove(struct device *dev)
> > {
> > struct pmem_device *pmem = dev_get_drvdata(dev);
> > + struct nd_region *nd_region = to_region(pmem);
> >
> > if (is_nd_btt(dev))
> > nvdimm_namespace_detach_btt(to_nd_btt(dev));
> > @@ -528,14 +530,16 @@ static int nd_pmem_remove(struct device *dev)
> > sysfs_put(pmem->bb_state);
> > pmem->bb_state = NULL;
> > }
> > - nvdimm_flush(to_nd_region(dev->parent));
> > + nd_region->flush(nd_region);
>
> Not needed if the indirect function call moves inside nvdimm_flush().
o.k
>
> >
> > return 0;
> > }
> >
> > static void nd_pmem_shutdown(struct device *dev)
> > {
> > - nvdimm_flush(to_nd_region(dev->parent));
> > + struct nd_region *nd_region = to_nd_region(dev->parent);
> > +
> > + nd_region->flush(nd_region);
> > }
> >
> > static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index fa37afc..a170a6b 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -290,7 +290,7 @@ static ssize_t deep_flush_store(struct device *dev,
> > struct device_attribute *att
> > return rc;
> > if (!flush)
> > return -EINVAL;
> > - nvdimm_flush(nd_region);
> > + nd_region->flush(nd_region);
>
> Let's pass the error code through if the flush fails.
o.k
>
> >
> > return len;
> > }
> > @@ -1065,6 +1065,11 @@ static struct nd_region *nd_region_create(struct
> > nvdimm_bus *nvdimm_bus,
> > dev->of_node = ndr_desc->of_node;
> > nd_region->ndr_size = resource_size(ndr_desc->res);
> > nd_region->ndr_start = ndr_desc->res->start;
> > + if (ndr_desc->flush)
> > + nd_region->flush = ndr_desc->flush;
> > + else
> > + nd_region->flush = nvdimm_flush;
> > +
>
> We'll need to rename the existing nvdimm_flush() to generic_nvdimm_flush().
Sure.
>
> > nd_device_register(dev);
> >
> > return nd_region;
> > @@ -1109,7 +1114,7 @@ EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > * nvdimm_flush - flush any posted write queues between the cpu and pmem
> > media
> > * @nd_region: blk or interleaved pmem region
> > */
> > -void nvdimm_flush(struct nd_region *nd_region)
> > +int nvdimm_flush(struct nd_region *nd_region)
> > {
> > struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
> > int i, idx;
> > @@ -1133,7 +1138,10 @@ void nvdimm_flush(struct nd_region *nd_region)
> > if (ndrd_get_flush_wpq(ndrd, i, 0))
> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> > wmb();
> > +
> > + return 0;
> > }
> > +
>
> Needless newline.
Will remove this.
>
> > EXPORT_SYMBOL_GPL(nvdimm_flush);
> >
> > /**
> > diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> > index 097072c..3af7177 100644
> > --- a/include/linux/libnvdimm.h
> > +++ b/include/linux/libnvdimm.h
> > @@ -115,6 +115,7 @@ struct nd_mapping_desc {
> > int position;
> > };
> >
> > +struct nd_region;
> > struct nd_region_desc {
> > struct resource *res;
> > struct nd_mapping_desc *mapping;
> > @@ -126,6 +127,7 @@ struct nd_region_desc {
> > int numa_node;
> > unsigned long flags;
> > struct device_node *of_node;
> > + int (*flush)(struct nd_region *nd_region);
> > };
> >
> > struct device;
> > @@ -201,7 +203,7 @@ unsigned long nd_blk_memremap_flags(struct
> > nd_blk_region *ndbr);
> > unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
> > void nd_region_release_lane(struct nd_region *nd_region, unsigned int
> > lane);
> > u64 nd_fletcher64(void *addr, size_t len, bool le);
> > -void nvdimm_flush(struct nd_region *nd_region);
> > +int nvdimm_flush(struct nd_region *nd_region);
> > int nvdimm_has_flush(struct nd_region *nd_region);
> > int nvdimm_has_cache(struct nd_region *nd_region);
> >
> > --
> > 2.9.3
> >
>
Thanks,
Pankaj
> Subject: Re: [PATCH 1/3] nd: move nd_region to common header
>
> On Fri, Aug 31, 2018 at 6:31 AM Pankaj Gupta <[email protected]> wrote:
> >
> > This patch moves nd_region definition to common header
> > include/linux/nd.h file. This is required for flush callback
> > support for both virtio-pmem & pmem driver.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/nvdimm/nd.h | 39 ---------------------------------------
> > include/linux/nd.h | 40 ++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 40 insertions(+), 39 deletions(-)
>
> No, we need to find a way to do this without dumping all of these
> internal details to a public / global header.
This is required when virtio_pmem driver accesses fields of nd_region struct.
Instead if we pass device pointer in place of nd_region, we don't need to put
this in global header. Thoughts?
e.g virtio_pmem_flush(struct device *dev)
Thanks,
Pankaj
Hello Dan,
> > > + /* The request submission function */
> > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > + int err;
[...]
> > > + init_waitqueue_head(&req->host_acked);
> > > + init_waitqueue_head(&req->wq_buf);
> > > +
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + sg_init_one(&sg, req->name, strlen(req->name));
> > > + sgs[0] = &sg;
> > > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > + sgs[1] = &ret;
[...]
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->host_acked, req->done);
> >
> > Hmm, this seems awkward if this is called from pmem_make_request. If
> > we need to wait for completion that should be managed by the guest
> > block layer. I.e. make_request should just queue request and then
> > trigger bio_endio() when the response comes back.
>
> We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
> driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
> callbacks for corresponding VIRTIO vqs. AFAICU there is no existing
> multiqueue
> code merged for pmem driver yet, though i could see patches by Dave upstream.
>
I thought about this and with current infrastructure "make_request" releases spinlock
and makes current thread/task. All Other threads are free to call 'make_request'/flush
and similarly wait by releasing the lock. This actually works like a queue of threads
waiting for notifications from host.
Current pmem code do not have multiqueue support and I am not sure if core pmem code
needs it. Adding multiqueue support just for virtio-pmem and not for pmem in same driver
will be confusing or require alot of tweaking.
Could you please give your suggestions on this.
Thanks,
Pankaj
On Thu, Sep 27, 2018 at 6:07 AM Pankaj Gupta <[email protected]> wrote:
[..]
> > We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
> > driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
> > callbacks for corresponding VIRTIO vqs. AFAICU there is no existing
> > multiqueue
> > code merged for pmem driver yet, though i could see patches by Dave upstream.
> >
>
> I thought about this and with current infrastructure "make_request" releases spinlock
> and makes current thread/task. All Other threads are free to call 'make_request'/flush
> and similarly wait by releasing the lock.
Which lock are you referring?
> This actually works like a queue of threads
> waiting for notifications from host.
>
> Current pmem code do not have multiqueue support and I am not sure if core pmem code
> needs it. Adding multiqueue support just for virtio-pmem and not for pmem in same driver
> will be confusing or require alot of tweaking.
Why does the pmem driver need to be converted to multiqueue support?
> Could you please give your suggestions on this.
I was expecting that flush requests that cannot be completed
synchronously be placed on a queue and have bio_endio() called at a
future time. I.e. use bio_chain() to manage the async portion of the
flush request. This causes the guest block layer to just assume the
bio was queued and will be completed at some point in the future.