2019-05-10 15:53:24

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 0/6] virtio pmem driver

Hi Michael & Dan,

Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
your ack on nvdimm patches(1 & 3) & virtio patch 2.

Changes done from v7 are only in patch(2 & 3) and not
affecting existing reviews. Request to please review.
----

This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.

Sharing guest kernel driver in this patchset with the
changes suggested in v4. Tested with Qemu side device
emulation [6] for virtio-pmem. Documented the impact of
possible page cache side channel attacks with suggested
countermeasures.

Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].

Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.

David Hildenbrand CCed also posted a modified version[7] of
qemu virtio-pmem code based on updated Qemu memory device API.

Virtio-pmem security implications and countermeasures:
-----------------------------------------------------

In previous posting of kernel driver, there was discussion [9]
on possible implications of page cache side channel attacks with
virtio pmem. After thorough analysis of details of known side
channel attacks, below are the suggestions:

- Depends entirely on how host backing image file is mapped
into guest address space.

- virtio-pmem device emulation, by default shared mapping is used
to map host backing file. It is recommended to use separate
backing file at host side for every guest. This will prevent
any possibility of executing common code from multiple guests
and any chance of inferring guest local data based based on
execution time.

- If backing file is required to be shared among multiple guests
it is recommended to don't support host page cache eviction
commands from the guest driver. This will avoid any possibility
of inferring guest local data or host data from another guest.

- Proposed device specification [8] for virtio-pmem device with
details of possible security implications and suggested
countermeasures for device emulation.

Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?

- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.

b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?

Changes from PATCH v7: [1]
- Corrected pending request queue logic (patch 2) - Jakub Staroń
- Used unsigned long flags for passing DAXDEV_F_SYNC (patch 3) - Dan
- Fixed typo => vma 'flag' to 'vm_flag' (patch 4)
- Added rob in patch 6 & patch 2

Changes from PATCH v6: [1]
- Corrected comment format in patch 5 & patch 6. [Dave]
- Changed variable declaration indentation in patch 6 [Darrick]
- Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5

Changes from PATCH v5: [2]
Changes suggested in by - [Cornelia, Yuval]
- Remove assignment chaining in virtio driver
- Better error message and remove not required free
- Check nd_region before use

Changes suggested by - [Jan Kara]
- dax_synchronous() for !CONFIG_DAX
- Correct 'daxdev_mapping_supported' comment and non-dax implementation

Changes suggested by - [Dan Williams]
- Pass meaningful flag 'DAXDEV_F_SYNC' to alloc_dax
- Gate nvdimm_flush instead of additional async parameter
- Move block chaining logic to flush callback than common nvdimm_flush
- Use NULL flush callback for generic flush for better readability [Dan, Jan]

- Use virtio device id 27 from 25(already used) - [MST]

Changes from PATCH v4:
- Factor out MAP_SYNC supported functionality to a common helper
[Dave, Darrick, Jan]
- Comment, indentation and virtqueue_kick failure handle - Yuval Shaia

Changes from PATCH v3:
- Use generic dax_synchronous() helper to check for DAXDEV_SYNC
flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications &
countermeasures - [Dave Chinner, Michael]

Changes from PATCH v2:
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'

Changes from PATCH v1:
- 0-day build test for build dependency on libnvdimm

Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Pankaj Gupta (6):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
dax: check synchronous mapping is supported
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem

[1] https://lkml.org/lkml/2019/4/26/36
[2] https://lkml.org/lkml/2019/4/23/1092
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
[9] https://lkml.org/lkml/2019/1/9/1191

drivers/acpi/nfit/core.c | 4 -
drivers/dax/bus.c | 2
drivers/dax/super.c | 13 +++
drivers/md/dm.c | 3
drivers/nvdimm/Makefile | 1
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/pmem.c | 18 +++--
drivers/nvdimm/region_devs.c | 33 +++++++++
drivers/nvdimm/virtio_pmem.c | 117 +++++++++++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
fs/ext4/file.c | 10 +--
fs/xfs/xfs_file.c | 9 +-
include/linux/dax.h | 25 ++++++-
include/linux/libnvdimm.h | 9 ++
include/linux/virtio_pmem.h | 60 ++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 10 +++
19 files changed, 436 insertions(+), 25 deletions(-)


2019-05-10 15:53:54

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver

This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <[email protected]>
Reviewed-by: Yuval Shaia <[email protected]>
---
drivers/nvdimm/Makefile | 1 +
drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++
drivers/nvdimm/virtio_pmem.c | 117 ++++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
include/linux/virtio_pmem.h | 60 ++++++++++++++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pmem.h | 10 +++
7 files changed, 328 insertions(+)
create mode 100644 drivers/nvdimm/nd_virtio.c
create mode 100644 drivers/nvdimm/virtio_pmem.c
create mode 100644 include/linux/virtio_pmem.h
create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 6f2a088afad6..cefe233e0b52 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
obj-$(CONFIG_ND_BLK) += nd_blk.o
obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
obj-$(CONFIG_OF_PMEM) += of_pmem.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o

nd_pmem-y := pmem.o

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
new file mode 100644
index 000000000000..ed7ddcc5a62c
--- /dev/null
+++ b/drivers/nvdimm/nd_virtio.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+ unsigned int len;
+ unsigned long flags;
+ struct virtio_pmem_request *req, *req_buf;
+ struct virtio_pmem *vpmem = vq->vdev->priv;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+ req->done = true;
+ wake_up(&req->host_acked);
+
+ if (!list_empty(&vpmem->req_list)) {
+ req_buf = list_first_entry(&vpmem->req_list,
+ struct virtio_pmem_request, list);
+ req_buf->wq_buf_avail = true;
+ wake_up(&req_buf->wq_buf);
+ list_del(&req_buf->list);
+ }
+ }
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+ int err, err1;
+ unsigned long flags;
+ struct scatterlist *sgs[2], sg, ret;
+ struct virtio_device *vdev = nd_region->provider_data;
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct virtio_pmem_request *req;
+
+ might_sleep();
+ req = kmalloc(sizeof(*req), GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+
+ req->done = false;
+ strcpy(req->name, "FLUSH");
+ init_waitqueue_head(&req->host_acked);
+ init_waitqueue_head(&req->wq_buf);
+ INIT_LIST_HEAD(&req->list);
+ sg_init_one(&sg, req->name, strlen(req->name));
+ sgs[0] = &sg;
+ sg_init_one(&ret, &req->ret, sizeof(req->ret));
+ sgs[1] = &ret;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ /*
+ * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
+ * queue does not have free descriptor. We add the request
+ * to req_list and wait for host_ack to wake us up when free
+ * slots are available.
+ */
+ while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
+ GFP_ATOMIC)) == -ENOSPC) {
+
+ dev_err(&vdev->dev, "failed to send command to virtio pmem"\
+ "device, no free slots in the virtqueue\n");
+ req->wq_buf_avail = false;
+ list_add_tail(&req->list, &vpmem->req_list);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->wq_buf, req->wq_buf_avail);
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ }
+ err1 = virtqueue_kick(vpmem->req_vq);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /*
+ * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
+ * do anything about that.
+ */
+ if (err || !err1) {
+ dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
+ err = -EIO;
+ goto ret;
+ }
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->host_acked, req->done);
+ err = req->ret;
+ret:
+ kfree(req);
+ return err;
+};
+
+/* The asynchronous flush callback function */
+int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
+{
+ int rc = 0;
+
+ /* Create child bio for asynchronous flush and chain with
+ * parent bio. Otherwise directly call nd_region flush.
+ */
+ if (bio && bio->bi_iter.bi_sector != -1) {
+ struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+ if (!child)
+ return -ENOMEM;
+ bio_copy_dev(child, bio);
+ child->bi_opf = REQ_PREFLUSH;
+ child->bi_iter.bi_sector = -1;
+ bio_chain(child, bio);
+ submit_bio(child);
+ } else {
+ if (virtio_pmem_flush(nd_region))
+ rc = -EIO;
+ }
+
+ return rc;
+};
+EXPORT_SYMBOL_GPL(async_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 000000000000..cfc6381c4e5d
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+ /* single vq */
+ vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
+ host_ack, "flush_queue");
+ if (IS_ERR(vpmem->req_vq))
+ return PTR_ERR(vpmem->req_vq);
+
+ spin_lock_init(&vpmem->pmem_lock);
+ INIT_LIST_HEAD(&vpmem->req_list);
+
+ return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+ int err = 0;
+ struct resource res;
+ struct virtio_pmem *vpmem;
+ struct nd_region_desc ndr_desc = {};
+ int nid = dev_to_node(&vdev->dev);
+ struct nd_region *nd_region;
+
+ if (!vdev->config->get) {
+ dev_err(&vdev->dev, "%s failure: config access disabled\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL);
+ if (!vpmem) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ vpmem->vdev = vdev;
+ vdev->priv = vpmem;
+ err = init_vq(vpmem);
+ if (err)
+ goto out_err;
+
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ start, &vpmem->start);
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ size, &vpmem->size);
+
+ res.start = vpmem->start;
+ res.end = vpmem->start + vpmem->size-1;
+ vpmem->nd_desc.provider_name = "virtio-pmem";
+ vpmem->nd_desc.module = THIS_MODULE;
+
+ vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+ &vpmem->nd_desc);
+ if (!vpmem->nvdimm_bus)
+ goto out_vq;
+
+ dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
+
+ ndr_desc.res = &res;
+ ndr_desc.numa_node = nid;
+ ndr_desc.flush = async_pmem_flush;
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+ set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+ nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc);
+
+ if (!nd_region)
+ goto out_nd;
+ nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
+ return 0;
+out_nd:
+ err = -ENXIO;
+ nvdimm_bus_unregister(vpmem->nvdimm_bus);
+out_vq:
+ vdev->config->del_vqs(vdev);
+out_err:
+ dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+ return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+ struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+ nvdimm_bus_unregister(nvdimm_bus);
+ vdev->config->del_vqs(vdev);
+ vdev->config->reset(vdev);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = virtio_pmem_probe,
+ .remove = virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..9f634a2ed638 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY

If unsure, say Y.

+config VIRTIO_PMEM
+ tristate "Support for virtio pmem driver"
+ depends on VIRTIO
+ depends on LIBNVDIMM
+ help
+ This driver provides support for virtio based flushing interface
+ for persistent memory range.
+
+ If unsure, say M.
+
config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 000000000000..ab1da877575d
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+ /* Host return status corresponding to flush request */
+ int ret;
+
+ /* command name*/
+ char name[16];
+
+ /* Wait queue to process deferred work after ack from host */
+ wait_queue_head_t host_acked;
+ bool done;
+
+ /* Wait queue to process deferred work after virt queue buffer avail */
+ wait_queue_head_t wq_buf;
+ bool wq_buf_avail;
+ struct list_head list;
+};
+
+struct virtio_pmem {
+ struct virtio_device *vdev;
+
+ /* Virtio pmem request queue */
+ struct virtqueue *req_vq;
+
+ /* nvdimm bus registers virtio pmem device */
+ struct nvdimm_bus *nvdimm_bus;
+ struct nvdimm_bus_descriptor nd_desc;
+
+ /* List to store deferred work if virtqueue is full */
+ struct list_head req_list;
+
+ /* Synchronize virtqueue data */
+ spinlock_t pmem_lock;
+
+ /* Memory region information */
+ uint64_t start;
+ uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..32b2f94d1f58 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_PMEM 27 /* virtio pmem */

#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 000000000000..fa3f7d52717a
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+ __le64 start;
+ __le64 size;
+};
+#endif
--
2.20.1

2019-05-10 15:54:39

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/dax/bus.c | 2 +-
drivers/dax/super.c | 13 ++++++++++++-
drivers/md/dm.c | 3 ++-
drivers/nvdimm/pmem.c | 5 ++++-
drivers/nvdimm/region_devs.c | 7 +++++++
include/linux/dax.h | 8 ++++++--
include/linux/libnvdimm.h | 1 +
7 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..5f184e751c82 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
* No 'host' or dax_operations since there is no access to this
* device outside of mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL, NULL);
+ dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
if (!dax_dev)
goto err;

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index bbd57ca0634a..b6c44b5062e9 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -186,6 +186,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+ /* flag to check if device supports synchronous flush */
+ DAXDEV_SYNC,
};

/**
@@ -354,6 +356,12 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(dax_write_cache_enabled);

+bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_synchronous);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
@@ -508,7 +516,7 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host)
}

struct dax_device *alloc_dax(void *private, const char *__host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, unsigned long flags)
{
struct dax_device *dax_dev;
const char *host;
@@ -531,6 +539,9 @@ struct dax_device *alloc_dax(void *private, const char *__host,
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+ if (flags & DAXDEV_F_SYNC)
+ set_bit(DAXDEV_SYNC, &dax_dev->flags);
+
return dax_dev;

err_dev:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 043f0761e4a0..ee007b75d9fd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);

if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops,
+ DAXDEV_F_SYNC);
if (!dax_dev)
goto bad;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0279eb1da3ef..bdbd2b414d3d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -365,6 +365,7 @@ static int pmem_attach_disk(struct device *dev,
struct gendisk *disk;
void *addr;
int rc;
+ unsigned long flags = 0UL;

pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
if (!pmem)
@@ -462,7 +463,9 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;

- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ if (is_nvdimm_sync(nd_region))
+ flags = DAXDEV_F_SYNC;
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..f3ea5369d20a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1197,6 +1197,13 @@ int nvdimm_has_cache(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_cache);

+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev) &&
+ !test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..ed75b7d9d178 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,9 @@
#include <linux/radix-tree.h>
#include <asm/pgtable.h>

+/* Flag for synchronous flush */
+#define DAXDEV_F_SYNC (1UL << 0)
+
typedef unsigned long dax_entry_t;

struct iomap_ops;
@@ -32,18 +35,19 @@ extern struct attribute_group dax_attribute_group;
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *dax_get_by_host(const char *host);
struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
+ const struct dax_operations *ops, unsigned long flags);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
+bool dax_synchronous(struct dax_device *dax_dev);
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
return NULL;
}
static inline struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, unsigned long flags)
{
/*
* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index feb342d026f2..3238a206e563 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -264,6 +264,7 @@ void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
int nvdimm_in_overwrite(struct nvdimm *nvdimm);
+bool is_nvdimm_sync(struct nd_region *nd_region);

static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc)
--
2.20.1

2019-05-10 15:55:36

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 4/6] dax: check synchronous mapping is supported

This patch introduces 'daxdev_mapping_supported' helper
which checks if 'MAP_SYNC' is supported with filesystem
mapping. It also checks if corresponding dax_device is
synchronous. Virtio pmem device is asynchronous and
does not not support VM_SYNC.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Pankaj Gupta <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/dax.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index c97fc0cc7167..41b4a5db6305 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,18 @@ void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool dax_synchronous(struct dax_device *dax_dev);
+/*
+ * Check if given mapping is supported by the file / underlying device.
+ */
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ if (!(vma->vm_flags & VM_SYNC))
+ return true;
+ if (!IS_DAX(file_inode(vma->vm_file)))
+ return false;
+ return dax_synchronous(dax_dev);
+}
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
@@ -68,6 +80,11 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
{
return false;
}
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ return !(vma->vm_flags & VM_SYNC);
+}
#endif

struct writeback_control;
--
2.20.1

2019-05-10 15:57:04

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 5/6] ext4: disable map_sync for async flush

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.

Signed-off-by: Pankaj Gupta <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 98ec11f69cd4..dee549339e13 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -360,15 +360,17 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct dax_device *dax_dev = sbi->s_daxdev;

- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;

/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;

file_accessed(file);
--
2.20.1

2019-05-10 15:57:20

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v8 6/6] xfs: disable map_sync for async flush

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.

Signed-off-by: Pankaj Gupta <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/xfs/xfs_file.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a7ceae90110e..f17652cca5ff 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,11 +1203,14 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
{
+ struct dax_device *dax_dev;
+
+ dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;

file_accessed(filp);
--
2.20.1

2019-05-10 16:31:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v8 0/6] virtio pmem driver

On Fri, May 10, 2019 at 09:21:56PM +0530, Pankaj Gupta wrote:
> Hi Michael & Dan,
>
> Please review/ack the patch series from LIBNVDIMM & VIRTIO side.

Thanks!
Hope to do this early next week.

> We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
> your ack on nvdimm patches(1 & 3) & virtio patch 2.
>
> Changes done from v7 are only in patch(2 & 3) and not
> affecting existing reviews. Request to please review.
> ----
>
> This patch series has implementation for "virtio pmem".
> "virtio pmem" is fake persistent memory(nvdimm) in guest
> which allows to bypass the guest page cache. This also
> implements a VIRTIO based asynchronous flush mechanism.
>
> Sharing guest kernel driver in this patchset with the
> changes suggested in v4. Tested with Qemu side device
> emulation [6] for virtio-pmem. Documented the impact of
> possible page cache side channel attacks with suggested
> countermeasures.
>
> Details of project idea for 'virtio pmem' flushing interface
> is shared [3] & [4].
>
> Implementation is divided into two parts:
> New virtio pmem guest driver and qemu code changes for new
> virtio pmem paravirtualized device.
>
> 1. Guest virtio-pmem kernel driver
> ---------------------------------
> - Reads persistent memory range from paravirt device and
> registers with 'nvdimm_bus'.
> - 'nvdimm/pmem' driver uses this information to allocate
> persistent memory region and setup filesystem operations
> to the allocated memory.
> - virtio pmem driver implements asynchronous flushing
> interface to flush from guest to host.
>
> 2. Qemu virtio-pmem device
> ---------------------------------
> - Creates virtio pmem device and exposes a memory range to
> KVM guest.
> - At host side this is file backed memory which acts as
> persistent memory.
> - Qemu side flush uses aio thread pool API's and virtio
> for asynchronous guest multi request handling.
>
> David Hildenbrand CCed also posted a modified version[7] of
> qemu virtio-pmem code based on updated Qemu memory device API.
>
> Virtio-pmem security implications and countermeasures:
> -----------------------------------------------------
>
> In previous posting of kernel driver, there was discussion [9]
> on possible implications of page cache side channel attacks with
> virtio pmem. After thorough analysis of details of known side
> channel attacks, below are the suggestions:
>
> - Depends entirely on how host backing image file is mapped
> into guest address space.
>
> - virtio-pmem device emulation, by default shared mapping is used
> to map host backing file. It is recommended to use separate
> backing file at host side for every guest. This will prevent
> any possibility of executing common code from multiple guests
> and any chance of inferring guest local data based based on
> execution time.
>
> - If backing file is required to be shared among multiple guests
> it is recommended to don't support host page cache eviction
> commands from the guest driver. This will avoid any possibility
> of inferring guest local data or host data from another guest.
>
> - Proposed device specification [8] for virtio-pmem device with
> details of possible security implications and suggested
> countermeasures for device emulation.
>
> Virtio-pmem errors handling:
> ----------------------------------------
> Checked behaviour of virtio-pmem for below types of errors
> Need suggestions on expected behaviour for handling these errors?
>
> - Hardware Errors: Uncorrectable recoverable Errors:
> a] virtio-pmem:
> - As per current logic if error page belongs to Qemu process,
> host MCE handler isolates(hwpoison) that page and send SIGBUS.
> Qemu SIGBUS handler injects exception to KVM guest.
> - KVM guest then isolates the page and send SIGBUS to guest
> userspace process which has mapped the page.
>
> b] Existing implementation for ACPI pmem driver:
> - Handles such errors with MCE notifier and creates a list
> of bad blocks. Read/direct access DAX operation return EIO
> if accessed memory page fall in bad block list.
> - It also starts backgound scrubbing.
> - Similar functionality can be reused in virtio-pmem with MCE
> notifier but without scrubbing(no ACPI/ARS)? Need inputs to
> confirm if this behaviour is ok or needs any change?
>
> Changes from PATCH v7: [1]
> - Corrected pending request queue logic (patch 2) - Jakub Staroń
> - Used unsigned long flags for passing DAXDEV_F_SYNC (patch 3) - Dan
> - Fixed typo => vma 'flag' to 'vm_flag' (patch 4)
> - Added rob in patch 6 & patch 2
>
> Changes from PATCH v6: [1]
> - Corrected comment format in patch 5 & patch 6. [Dave]
> - Changed variable declaration indentation in patch 6 [Darrick]
> - Add Reviewed-by tag by 'Jan Kara' in patch 4 & patch 5
>
> Changes from PATCH v5: [2]
> Changes suggested in by - [Cornelia, Yuval]
> - Remove assignment chaining in virtio driver
> - Better error message and remove not required free
> - Check nd_region before use
>
> Changes suggested by - [Jan Kara]
> - dax_synchronous() for !CONFIG_DAX
> - Correct 'daxdev_mapping_supported' comment and non-dax implementation
>
> Changes suggested by - [Dan Williams]
> - Pass meaningful flag 'DAXDEV_F_SYNC' to alloc_dax
> - Gate nvdimm_flush instead of additional async parameter
> - Move block chaining logic to flush callback than common nvdimm_flush
> - Use NULL flush callback for generic flush for better readability [Dan, Jan]
>
> - Use virtio device id 27 from 25(already used) - [MST]
>
> Changes from PATCH v4:
> - Factor out MAP_SYNC supported functionality to a common helper
> [Dave, Darrick, Jan]
> - Comment, indentation and virtqueue_kick failure handle - Yuval Shaia
>
> Changes from PATCH v3:
> - Use generic dax_synchronous() helper to check for DAXDEV_SYNC
> flag - [Dan, Darrick, Jan]
> - Add 'is_nvdimm_async' function
> - Document page cache side channel attacks implications &
> countermeasures - [Dave Chinner, Michael]
>
> Changes from PATCH v2:
> - Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
> - Use name 'virtio pmem' in place of 'fake dax'
>
> Changes from PATCH v1:
> - 0-day build test for build dependency on libnvdimm
>
> Changes suggested by - [Dan Williams]
> - Split the driver into two parts virtio & pmem
> - Move queuing of async block request to block layer
> - Add "sync" parameter in nvdimm_flush function
> - Use indirect call for nvdimm_flush
> - Don’t move declarations to common global header e.g nd.h
> - nvdimm_flush() return 0 or -EIO if it fails
> - Teach nsio_rw_bytes() that the flush can fail
> - Rename nvdimm_flush() to generic_nvdimm_flush()
> - Use 'nd_region->provider_data' for long dereferencing
> - Remove virtio_pmem_freeze/restore functions
> - Remove BSD license text with SPDX license text
>
> - Add might_sleep() in virtio_pmem_flush - [Luiz]
> - Make spin_lock_irqsave() narrow
>
> Pankaj Gupta (6):
> libnvdimm: nd_region flush callback support
> virtio-pmem: Add virtio-pmem guest driver
> libnvdimm: add nd_region buffered dax_dev flag
> dax: check synchronous mapping is supported
> ext4: disable map_sync for virtio pmem
> xfs: disable map_sync for virtio pmem
>
> [1] https://lkml.org/lkml/2019/4/26/36
> [2] https://lkml.org/lkml/2019/4/23/1092
> [3] https://www.spinics.net/lists/kvm/msg149761.html
> [4] https://www.spinics.net/lists/kvm/msg153095.html
> [5] https://lkml.org/lkml/2018/8/31/413
> [6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
> [7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
> [8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
> [9] https://lkml.org/lkml/2019/1/9/1191
>
> drivers/acpi/nfit/core.c | 4 -
> drivers/dax/bus.c | 2
> drivers/dax/super.c | 13 +++
> drivers/md/dm.c | 3
> drivers/nvdimm/Makefile | 1
> drivers/nvdimm/claim.c | 6 +
> drivers/nvdimm/nd.h | 1
> drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++++++++++
> drivers/nvdimm/pmem.c | 18 +++--
> drivers/nvdimm/region_devs.c | 33 +++++++++
> drivers/nvdimm/virtio_pmem.c | 117 +++++++++++++++++++++++++++++++++++
> drivers/virtio/Kconfig | 10 +++
> fs/ext4/file.c | 10 +--
> fs/xfs/xfs_file.c | 9 +-
> include/linux/dax.h | 25 ++++++-
> include/linux/libnvdimm.h | 9 ++
> include/linux/virtio_pmem.h | 60 ++++++++++++++++++
> include/uapi/linux/virtio_ids.h | 1
> include/uapi/linux/virtio_pmem.h | 10 +++
> 19 files changed, 436 insertions(+), 25 deletions(-)

2019-05-10 20:15:37

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag

On Fri, May 10, 2019 at 8:53 AM Pankaj Gupta <[email protected]> wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/dax/bus.c | 2 +-
> drivers/dax/super.c | 13 ++++++++++++-
> drivers/md/dm.c | 3 ++-
> drivers/nvdimm/pmem.c | 5 ++++-
> drivers/nvdimm/region_devs.c | 7 +++++++
> include/linux/dax.h | 8 ++++++--
> include/linux/libnvdimm.h | 1 +
> 7 files changed, 33 insertions(+), 6 deletions(-)
[..]
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 043f0761e4a0..ee007b75d9fd 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> - dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
> + dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops,
> + DAXDEV_F_SYNC);

Apologies for not realizing this until now, but this is broken.
Imaging a device-mapper configuration composed of both 'async'
virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
across all members. I would change this argument to '0' and then
arrange for it to be set at dm_table_supports_dax() time after
validating that all components support synchronous dax.

2019-05-10 23:34:14

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v8 0/6] virtio pmem driver


> >
> > Hi Michael & Dan,
> >
> > Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
> > We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
> > your ack on nvdimm patches(1 & 3) & virtio patch 2.
>
> I was planning to merge these via the nvdimm tree, not ack them. Did
> you have another maintainer lined up to take these patches?

Sorry! for not being clear on this. I wanted to say same.

Proposed the patch series to be merged via nvdimm tree as kindly agreed
by you. We only need an ack on virtio patch 2 from Micahel.

Thank you for all your help.

Best regards,
Pankaj Gupta

>

2019-05-11 00:47:43

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v8 3/6] libnvdimm: add dax_dev sync flag



> >
> > This patch adds 'DAXDEV_SYNC' flag which is set
> > for nd_region doing synchronous flush. This later
> > is used to disable MAP_SYNC functionality for
> > ext4 & xfs filesystem for devices don't support
> > synchronous flush.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/dax/bus.c | 2 +-
> > drivers/dax/super.c | 13 ++++++++++++-
> > drivers/md/dm.c | 3 ++-
> > drivers/nvdimm/pmem.c | 5 ++++-
> > drivers/nvdimm/region_devs.c | 7 +++++++
> > include/linux/dax.h | 8 ++++++--
> > include/linux/libnvdimm.h | 1 +
> > 7 files changed, 33 insertions(+), 6 deletions(-)
> [..]
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 043f0761e4a0..ee007b75d9fd 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1969,7 +1969,8 @@ static struct mapped_device *alloc_dev(int minor)
> > sprintf(md->disk->disk_name, "dm-%d", minor);
> >
> > if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
> > - dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
> > + dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops,
> > + DAXDEV_F_SYNC);
>
> Apologies for not realizing this until now, but this is broken.
> Imaging a device-mapper configuration composed of both 'async'
> virtio-pmem and 'sync' pmem. The 'sync' flag needs to be unified
> across all members. I would change this argument to '0' and then
> arrange for it to be set at dm_table_supports_dax() time after
> validating that all components support synchronous dax.

o.k. Need to set 'DAXDEV_F_SYNC' flag after verifying all the target
components support synchronous DAX.

Just a question, If device mapper configuration have composed of both
virtio-pmem or pmem devices, we want to configure device mapper for async flush?

Thank you,
Pankaj
>

2019-05-12 16:52:35

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver

On Fri, May 10, 2019 at 09:21:58PM +0530, Pankaj Gupta wrote:
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> Reviewed-by: Yuval Shaia <[email protected]>

Acked-by: Michael S. Tsirkin <[email protected]>

> ---
> drivers/nvdimm/Makefile | 1 +
> drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++
> drivers/nvdimm/virtio_pmem.c | 117 ++++++++++++++++++++++++++++
> drivers/virtio/Kconfig | 10 +++
> include/linux/virtio_pmem.h | 60 ++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 10 +++
> 7 files changed, 328 insertions(+)
> create mode 100644 drivers/nvdimm/nd_virtio.c
> create mode 100644 drivers/nvdimm/virtio_pmem.c
> create mode 100644 include/linux/virtio_pmem.h
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 6f2a088afad6..cefe233e0b52 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
> obj-$(CONFIG_ND_BLK) += nd_blk.o
> obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
> obj-$(CONFIG_OF_PMEM) += of_pmem.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o
>
> nd_pmem-y := pmem.o
>
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> new file mode 100644
> index 000000000000..ed7ddcc5a62c
> --- /dev/null
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio_pmem.h>
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> + req->done = true;
> + wake_up(&req->host_acked);
> +
> + if (!list_empty(&vpmem->req_list)) {
> + req_buf = list_first_entry(&vpmem->req_list,
> + struct virtio_pmem_request, list);
> + req_buf->wq_buf_avail = true;
> + wake_up(&req_buf->wq_buf);
> + list_del(&req_buf->list);
> + }
> + }
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err, err1;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev = nd_region->provider_data;
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct virtio_pmem_request *req;
> +
> + might_sleep();
> + req = kmalloc(sizeof(*req), GFP_KERNEL);
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(&req->host_acked);
> + init_waitqueue_head(&req->wq_buf);
> + INIT_LIST_HEAD(&req->list);
> + sg_init_one(&sg, req->name, strlen(req->name));
> + sgs[0] = &sg;
> + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> + sgs[1] = &ret;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + /*
> + * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
> + * queue does not have free descriptor. We add the request
> + * to req_list and wait for host_ack to wake us up when free
> + * slots are available.
> + */
> + while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
> + GFP_ATOMIC)) == -ENOSPC) {
> +
> + dev_err(&vdev->dev, "failed to send command to virtio pmem"\
> + "device, no free slots in the virtqueue\n");
> + req->wq_buf_avail = false;
> + list_add_tail(&req->list, &vpmem->req_list);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + }
> + err1 = virtqueue_kick(vpmem->req_vq);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /*
> + * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
> + * do anything about that.
> + */
> + if (err || !err1) {
> + dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
> + err = -EIO;
> + goto ret;
> + }
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
> + err = req->ret;
> +ret:
> + kfree(req);
> + return err;
> +};
> +
> +/* The asynchronous flush callback function */
> +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> +{
> + int rc = 0;
> +
> + /* Create child bio for asynchronous flush and chain with
> + * parent bio. Otherwise directly call nd_region flush.
> + */
> + if (bio && bio->bi_iter.bi_sector != -1) {
> + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> +
> + if (!child)
> + return -ENOMEM;
> + bio_copy_dev(child, bio);
> + child->bi_opf = REQ_PREFLUSH;
> + child->bi_iter.bi_sector = -1;
> + bio_chain(child, bio);
> + submit_bio(child);
> + } else {
> + if (virtio_pmem_flush(nd_region))
> + rc = -EIO;
> + }
> +
> + return rc;
> +};
> +EXPORT_SYMBOL_GPL(async_pmem_flush);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index 000000000000..cfc6381c4e5d
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,117 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include "nd.h"
> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> + /* single vq */
> + vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vpmem->req_vq))
> + return PTR_ERR(vpmem->req_vq);
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nd_region_desc ndr_desc = {};
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config access disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + vdev->priv = vpmem;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!vpmem->nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = async_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc);
> +
> + if (!nd_region)
> + goto out_nd;
> + nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(vpmem->nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);
> + vdev->config->reset(vdev);
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..9f634a2ed638 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
>
> If unsure, say Y.
>
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + depends on LIBNVDIMM
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> +
> + If unsure, say M.
> +
> config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..ab1da877575d
> --- /dev/null
> +++ b/include/linux/virtio_pmem.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * virtio_pmem.h: virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + **/
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include <linux/virtio_ids.h>
> +#include <linux/module.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/spinlock.h>
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];
> +
> + /* Wait queue to process deferred work after ack from host */
> + wait_queue_head_t host_acked;
> + bool done;
> +
> + /* Wait queue to process deferred work after virt queue buffer avail */
> + wait_queue_head_t wq_buf;
> + bool wq_buf_avail;
> + struct list_head list;
> +};
> +
> +struct virtio_pmem {
> + struct virtio_device *vdev;
> +
> + /* Virtio pmem request queue */
> + struct virtqueue *req_vq;
> +
> + /* nvdimm bus registers virtio pmem device */
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm_bus_descriptor nd_desc;
> +
> + /* List to store deferred work if virtqueue is full */
> + struct list_head req_list;
> +
> + /* Synchronize virtqueue data */
> + spinlock_t pmem_lock;
> +
> + /* Memory region information */
> + uint64_t start;
> + uint64_t size;
> +};
> +
> +void host_ack(struct virtqueue *vq);
> +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
> +#endif
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..32b2f94d1f58 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 27 /* virtio pmem */
>
> #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..fa3f7d52717a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> + __le64 start;
> + __le64 size;
> +};
> +#endif
> --
> 2.20.1

2019-05-12 16:54:08

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v8 0/6] virtio pmem driver

On Fri, May 10, 2019 at 07:33:03PM -0400, Pankaj Gupta wrote:
>
> > >
> > > Hi Michael & Dan,
> > >
> > > Please review/ack the patch series from LIBNVDIMM & VIRTIO side.
> > > We have ack on ext4, xfs patches(4, 5 & 6) patch 2. Still need
> > > your ack on nvdimm patches(1 & 3) & virtio patch 2.
> >
> > I was planning to merge these via the nvdimm tree, not ack them. Did
> > you have another maintainer lined up to take these patches?
>
> Sorry! for not being clear on this. I wanted to say same.
>
> Proposed the patch series to be merged via nvdimm tree as kindly agreed
> by you. We only need an ack on virtio patch 2 from Micahel.
>
> Thank you for all your help.
>
> Best regards,
> Pankaj Gupta

Fine by me.

> >

2019-05-13 05:08:14

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver


> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > Reviewed-by: Yuval Shaia <[email protected]>
>
> Acked-by: Michael S. Tsirkin <[email protected]>

Thank you, Michael.

Best regards,
Pankaj

>
> > ---
> > drivers/nvdimm/Makefile | 1 +
> > drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++
> > drivers/nvdimm/virtio_pmem.c | 117 ++++++++++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > include/linux/virtio_pmem.h | 60 ++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 328 insertions(+)
> > create mode 100644 drivers/nvdimm/nd_virtio.c
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> > index 6f2a088afad6..cefe233e0b52 100644
> > --- a/drivers/nvdimm/Makefile
> > +++ b/drivers/nvdimm/Makefile
> > @@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
> > obj-$(CONFIG_ND_BLK) += nd_blk.o
> > obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
> > obj-$(CONFIG_OF_PMEM) += of_pmem.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o
> >
> > nd_pmem-y := pmem.o
> >
> > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > new file mode 100644
> > index 000000000000..ed7ddcc5a62c
> > --- /dev/null
> > +++ b/drivers/nvdimm/nd_virtio.c
> > @@ -0,0 +1,129 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > + /* The interrupt handler */
> > +void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + list_del(&req_buf->list);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(host_ack);
> > +
> > + /* The request submission function */
> > +int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err, err1;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev = nd_region->provider_data;
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct virtio_pmem_request *req;
> > +
> > + might_sleep();
> > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > + INIT_LIST_HEAD(&req->list);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + /*
> > + * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
> > + * queue does not have free descriptor. We add the request
> > + * to req_list and wait for host_ack to wake us up when free
> > + * slots are available.
> > + */
> > + while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
> > + GFP_ATOMIC)) == -ENOSPC) {
> > +
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem"\
> > + "device, no free slots in the virtqueue\n");
> > + req->wq_buf_avail = false;
> > + list_add_tail(&req->list, &vpmem->req_list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + }
> > + err1 = virtqueue_kick(vpmem->req_vq);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /*
> > + * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
> > + * do anything about that.
> > + */
> > + if (err || !err1) {
> > + dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
> > + err = -EIO;
> > + goto ret;
> > + }
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->host_acked, req->done);
> > + err = req->ret;
> > +ret:
> > + kfree(req);
> > + return err;
> > +};
> > +
> > +/* The asynchronous flush callback function */
> > +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > +{
> > + int rc = 0;
> > +
> > + /* Create child bio for asynchronous flush and chain with
> > + * parent bio. Otherwise directly call nd_region flush.
> > + */
> > + if (bio && bio->bi_iter.bi_sector != -1) {
> > + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > +
> > + if (!child)
> > + return -ENOMEM;
> > + bio_copy_dev(child, bio);
> > + child->bi_opf = REQ_PREFLUSH;
> > + child->bi_iter.bi_sector = -1;
> > + bio_chain(child, bio);
> > + submit_bio(child);
> > + } else {
> > + if (virtio_pmem_flush(nd_region))
> > + rc = -EIO;
> > + }
> > +
> > + return rc;
> > +};
> > +EXPORT_SYMBOL_GPL(async_pmem_flush);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..cfc6381c4e5d
> > --- /dev/null
> > +++ b/drivers/nvdimm/virtio_pmem.c
> > @@ -0,0 +1,117 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + /* single vq */
> > + vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vpmem->req_vq))
> > + return PTR_ERR(vpmem->req_vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nd_region_desc ndr_desc = {};
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config access disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL);
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + vdev->priv = vpmem;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!vpmem->nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = async_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc);
> > +
> > + if (!nd_region)
> > + goto out_nd;
> > + nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(vpmem->nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
> > + vdev->config->reset(vdev);
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 35897649c24f..9f634a2ed638 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + depends on LIBNVDIMM
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
> > +
> > + If unsure, say M.
> > +
> > config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..ab1da877575d
> > --- /dev/null
> > +++ b/include/linux/virtio_pmem.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * virtio_pmem.h: virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + **/
> > +
> > +#ifndef _LINUX_VIRTIO_PMEM_H
> > +#define _LINUX_VIRTIO_PMEM_H
> > +
> > +#include <linux/virtio_ids.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/spinlock.h>
> > +
> > +struct virtio_pmem_request {
> > + /* Host return status corresponding to flush request */
> > + int ret;
> > +
> > + /* command name*/
> > + char name[16];
> > +
> > + /* Wait queue to process deferred work after ack from host */
> > + wait_queue_head_t host_acked;
> > + bool done;
> > +
> > + /* Wait queue to process deferred work after virt queue buffer avail */
> > + wait_queue_head_t wq_buf;
> > + bool wq_buf_avail;
> > + struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > + struct virtio_device *vdev;
> > +
> > + /* Virtio pmem request queue */
> > + struct virtqueue *req_vq;
> > +
> > + /* nvdimm bus registers virtio pmem device */
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nvdimm_bus_descriptor nd_desc;
> > +
> > + /* List to store deferred work if virtqueue is full */
> > + struct list_head req_list;
> > +
> > + /* Synchronize virtqueue data */
> > + spinlock_t pmem_lock;
> > +
> > + /* Memory region information */
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +
> > +void host_ack(struct virtqueue *vq);
> > +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio);
> > +#endif
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2d4f4d..32b2f94d1f58 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 27 /* virtio pmem */
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..fa3f7d52717a
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > + __le64 start;
> > + __le64 size;
> > +};
> > +#endif
> > --
> > 2.20.1
>
>

2019-05-14 09:28:11

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver


Hi David,

Thank you for the review.

> On 10.05.19 17:51, Pankaj Gupta wrote:
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > Reviewed-by: Yuval Shaia <[email protected]>
> > ---
> > drivers/nvdimm/Makefile | 1 +
> > drivers/nvdimm/nd_virtio.c | 129 +++++++++++++++++++++++++++++++
> > drivers/nvdimm/virtio_pmem.c | 117 ++++++++++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > include/linux/virtio_pmem.h | 60 ++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 328 insertions(+)
> > create mode 100644 drivers/nvdimm/nd_virtio.c
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> > index 6f2a088afad6..cefe233e0b52 100644
> > --- a/drivers/nvdimm/Makefile
> > +++ b/drivers/nvdimm/Makefile
> > @@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o
> > obj-$(CONFIG_ND_BLK) += nd_blk.o
> > obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
> > obj-$(CONFIG_OF_PMEM) += of_pmem.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o
> >
> > nd_pmem-y := pmem.o
> >
> > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > new file mode 100644
> > index 000000000000..ed7ddcc5a62c
> > --- /dev/null
> > +++ b/drivers/nvdimm/nd_virtio.c
> > @@ -0,0 +1,129 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > + /* The interrupt handler */
> > +void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
>
> Nit: use reverse Christmas tree layout :)

o.k

>
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + list_del(&req_buf->list);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(host_ack);
> > +
> > + /* The request submission function */
> > +int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err, err1;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev = nd_region->provider_data;
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct virtio_pmem_request *req;
>
> Nit: use reverse Christmas tree layout :)

o.k

>
> > +
> > + might_sleep();
> > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > + INIT_LIST_HEAD(&req->list);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + /*
> > + * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual
> > + * queue does not have free descriptor. We add the request
> > + * to req_list and wait for host_ack to wake us up when free
> > + * slots are available.
> > + */
> > + while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req,
> > + GFP_ATOMIC)) == -ENOSPC) {
> > +
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem"\
> > + "device, no free slots in the virtqueue\n");
> > + req->wq_buf_avail = false;
> > + list_add_tail(&req->list, &vpmem->req_list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
>
> "A host repsonse results in "host_ack" getting called" ... ?

o.k

>
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + }
> > + err1 = virtqueue_kick(vpmem->req_vq);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /*
> > + * virtqueue_add_sgs failed with error different than -ENOSPC, we can't
> > + * do anything about that.
> > + */
> > + if (err || !err1) {
> > + dev_info(&vdev->dev, "failed to send command to virtio pmem device\n");
> > + err = -EIO;
> > + goto ret;
>
> Avoid the goto. Just move the following statements into an "else" case.

o.k

>
> > + }
> > +
> > + /* When host has read buffer, this completes via host_ack */
>
> "A host repsonse results in "host_ack" getting called" ... ?
>
> > + wait_event(req->host_acked, req->done);
> > + err = req->ret;
> > +ret:
> > + kfree(req);
> > + return err;
> > +};
> > +
> > +/* The asynchronous flush callback function */
> > +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> > +{
> > + int rc = 0;
> > +
> > + /* Create child bio for asynchronous flush and chain with
> > + * parent bio. Otherwise directly call nd_region flush.
> > + */
> > + if (bio && bio->bi_iter.bi_sector != -1) {
> > + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > +
> > + if (!child)
> > + return -ENOMEM;
> > + bio_copy_dev(child, bio);
> > + child->bi_opf = REQ_PREFLUSH;
> > + child->bi_iter.bi_sector = -1;
> > + bio_chain(child, bio);
> > + submit_bio(child);
>
> return 0;
>
> Then, drop the "else" case and "int rc" and do directly
>
> if (virtio_pmem_flush(nd_region))
> return -EIO;

and another 'return 0' here :)

I don't like return from multiple places instead I prefer
single exit point from function.

>
> > + } else {
> > +
> > + rc = -EIO;
> > + }
> > +
> > + return rc;
> > +};
> > +EXPORT_SYMBOL_GPL(async_pmem_flush);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..cfc6381c4e5d
> > --- /dev/null
> > +++ b/drivers/nvdimm/virtio_pmem.c
> > @@ -0,0 +1,117 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + /* single vq */
> > + vpmem->req_vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
>
> Nit: Wrong indentation of parameters.

o.k

>
> > + if (IS_ERR(vpmem->req_vq))
> > + return PTR_ERR(vpmem->req_vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
>
> I would initialize the locks in the virtio_pmem_probe() directly,
> earlier (directly after allocating vpmem).

Since the lock is to protect the vq so logically I put it in 'init_vq'
which is eventually called by probe.

>
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nd_region_desc ndr_desc = {};
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
>
> Nit: use reverse Christmas tree layout :)

Done.

>
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config access disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL);
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + vdev->priv = vpmem;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!vpmem->nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = async_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc);
> > +
>
> I'd drop this empty line.

hmm.

>
> > + if (!nd_region)
> > + goto out_nd;
> > + nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
>
> Does it make sense to move some parts into separate functions for
> readability? E.g., virtio_pmem_init_nvdimm_bus()

I think this function only has initialization and don't have any complex logic.

>
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
>
> I'd always initialize "err" along with the goto statement for
> readability, especially because ...
>
> > + nvdimm_bus_unregister(vpmem->nvdimm_bus);
> > +out_vq:
>
> ... you don't initialize err in this case. Err is here 0 if I am not wrong.

Right. Changed.

>
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
>
> Should we try to give more meaning full messages? I can think of
> scenarios like "memory region is not properly aligned" or "out of memory".

o.k

>
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
>
> Is the nd_region implicitly cleaned up?

yes. it will be.

>
> You are not freeing "vdev->priv".

vdev->priv is vpmem which is allocated using devm API.
>
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
> > + vdev->config->reset(vdev);
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 35897649c24f..9f634a2ed638 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + depends on LIBNVDIMM
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
>
> "This driver provides access to virtio-pmem devices, storage devices
> that are mapped into the physical address space - similar to NVDIMMs -
> with a virtio-based flushing interface." ... ?

looks better. Incorporated.

Thanks,
Pankaj
>
> > +
> > + If unsure, say M.
> > +
>
>
> --
>
> Thanks,
>
> David / dhildenb
>

2019-05-14 09:47:51

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver


>
> >>
> >>> + }
> >>> +
> >>> + /* When host has read buffer, this completes via host_ack */
> >>
> >> "A host repsonse results in "host_ack" getting called" ... ?
> >>
> >>> + wait_event(req->host_acked, req->done);
> >>> + err = req->ret;
> >>> +ret:
> >>> + kfree(req);
> >>> + return err;
> >>> +};
> >>> +
> >>> +/* The asynchronous flush callback function */
> >>> +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> >>> +{
> >>> + int rc = 0;
> >>> +
> >>> + /* Create child bio for asynchronous flush and chain with
> >>> + * parent bio. Otherwise directly call nd_region flush.
> >>> + */
> >>> + if (bio && bio->bi_iter.bi_sector != -1) {
> >>> + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> >>> +
> >>> + if (!child)
> >>> + return -ENOMEM;
> >>> + bio_copy_dev(child, bio);
> >>> + child->bi_opf = REQ_PREFLUSH;
> >>> + child->bi_iter.bi_sector = -1;
> >>> + bio_chain(child, bio);
> >>> + submit_bio(child);
> >>
> >> return 0;
> >>
> >> Then, drop the "else" case and "int rc" and do directly
> >>
> >> if (virtio_pmem_flush(nd_region))
> >> return -EIO;
> >
> > and another 'return 0' here :)
> >
> > I don't like return from multiple places instead I prefer
> > single exit point from function.
>
> Makes this function more complicated than necessary. I agree when there
> are locks involved.

o.k. I will change as you suggest :)

>
> >
> >>
> >>> +
> >>> + return 0;
> >>> +};
> >>> +
> >>> +static int virtio_pmem_probe(struct virtio_device *vdev)
> >>> +{
> >>> + int err = 0;
> >>> + struct resource res;
> >>> + struct virtio_pmem *vpmem;
> >>> + struct nd_region_desc ndr_desc = {};
> >>> + int nid = dev_to_node(&vdev->dev);
> >>> + struct nd_region *nd_region;
> >>
> >> Nit: use reverse Christmas tree layout :)
> >
> > Done.
> >
> >>
> >>> +
> >>> + if (!vdev->config->get) {
> >>> + dev_err(&vdev->dev, "%s failure: config access disabled\n",
> >>> + __func__);
> >>> + return -EINVAL;
> >>> + }
> >>> +
> >>> + vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL);
> >>> + if (!vpmem) {
> >>> + err = -ENOMEM;
> >>> + goto out_err;
> >>> + }
> >>> +
> >>> + vpmem->vdev = vdev;
> >>> + vdev->priv = vpmem;
> >>> + err = init_vq(vpmem);
> >>> + if (err)
> >>> + goto out_err;
> >>> +
> >>> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> >>> + start, &vpmem->start);
> >>> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> >>> + size, &vpmem->size);
> >>> +
> >>> + res.start = vpmem->start;
> >>> + res.end = vpmem->start + vpmem->size-1;
> >>> + vpmem->nd_desc.provider_name = "virtio-pmem";
> >>> + vpmem->nd_desc.module = THIS_MODULE;
> >>> +
> >>> + vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> >>> + &vpmem->nd_desc);
> >>> + if (!vpmem->nvdimm_bus)
> >>> + goto out_vq;
> >>> +
> >>> + dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> >>> +
> >>> + ndr_desc.res = &res;
> >>> + ndr_desc.numa_node = nid;
> >>> + ndr_desc.flush = async_pmem_flush;
> >>> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> >>> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> >>> + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc);
> >>> +
> >>
> >> I'd drop this empty line.
> >
> > hmm.
> >
>
> The common pattern after allocating something, immediately check for it
> in the next line (like you do throughout this patch ;) )

Right. But rare times when I see space will beauty the code I tend to
add it. Maybe I should not :)

>
> ...
> >> You are not freeing "vdev->priv".
> >
> > vdev->priv is vpmem which is allocated using devm API.
>
> I'm confused. Looking at drivers/virtio/virtio_balloon.c:
>
> static void virtballoon_remove(struct virtio_device *vdev)
> {
> struct virtio_balloon *vb = vdev->priv;
>
> ...
>
> kfree(vb);
> }
>
> I think you should do the same here, vdev->priv is allocated in
> virtio_pmem_probe.
>
> But maybe I am missing something important here :)

Because virtio_balloon use "kzalloc" for allocation and needs to be freed.
But virtio pmem uses "devm_kzalloc" which takes care of automatically deleting
the device memory when associated device is detached.

Thanks,
Pankaj
>
> >>
> >>> + nvdimm_bus_unregister(nvdimm_bus);
> >>> + vdev->config->del_vqs(vdev);
> >>> + vdev->config->reset(vdev);
> >>> +}
> >>> +
>
> --
>
> Thanks,
>
> David / dhildenb
>

2019-05-14 09:59:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 2/6] virtio-pmem: Add virtio pmem driver

>>
>> I think you should do the same here, vdev->priv is allocated in
>> virtio_pmem_probe.
>>
>> But maybe I am missing something important here :)
>
> Because virtio_balloon use "kzalloc" for allocation and needs to be freed.
> But virtio pmem uses "devm_kzalloc" which takes care of automatically deleting
> the device memory when associated device is detached.

Hehe, thanks, that was the part that I was missing!

--

Thanks,

David / dhildenb