This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.
Sharing guest kernel driver in this patchset with the
changes suggested in v4. Tested with Qemu side device
emulation [6] for virtio-pmem.
We have incorporated all the suggestions in V4. Documented
the impact of possible page cache side channel attacks with
suggested countermeasures.
Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].
Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.
1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.
2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.
David Hildenbrand CCed also posted a modified version[7] of
qemu virtio-pmem code based on updated Qemu memory device API.
Virtio-pmem security implications and countermeasures:
-----------------------------------------------------
In previous posting of kernel driver, there was discussion [9]
on possible implications of page cache side channel attacks with
virtio pmem. After thorough analysis of details of known side
channel attacks, below are the suggestions:
- Depends entirely on how host backing image file is mapped
into guest address space.
- virtio-pmem device emulation, by default shared mapping is used
to map host backing file. It is recommended to use separate
backing file at host side for every guest. This will prevent
any possibility of executing common code from multiple guests
and any chance of inferring guest local data based based on
execution time.
- If backing file is required to be shared among multiple guests
it is recommended to don't support host page cache eviction
commands from the guest driver. This will avoid any possibility
of inferring guest local data or host data from another guest.
- Proposed device specification [8] for virtio-pmem device with
details of possible security implications and suggested
countermeasures for device emulation.
Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?
- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.
b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?
Changes from PATCH v4: [1]
- Factor out MAP_SYNC supported functionality to a common helper
[Dave, Darrick, Jan]
- Comment, indentation and virtqueue_kick failure handle - Yuval Shaia
Changes from PATCH v3: [2]
- Use generic dax_synchronous() helper to check for DAXDEV_SYNC
flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications &
countermeasures - [Dave Chinner, Michael]
Changes from PATCH v2:
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'
Changes from PATCH v1:
- 0-day build test for build dependency on libnvdimm
Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text
- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow
Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c
Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan
Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.
Pankaj Gupta (6):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
dax: check synchronous mapping is supported
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem
[1] https://lkml.org/lkml/2019/4/3/394
[2] https://lkml.org/lkml/2019/1/9/471
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
[9] https://lkml.org/lkml/2019/1/9/1191
drivers/acpi/nfit/core.c | 4 -
drivers/dax/bus.c | 2
drivers/dax/super.c | 13 +++-
drivers/md/dm.c | 2
drivers/nvdimm/claim.c | 6 +
drivers/nvdimm/nd.h | 1
drivers/nvdimm/pmem.c | 17 +++--
drivers/nvdimm/region_devs.c | 45 +++++++++++++-
drivers/nvdimm/virtio_pmem.c | 88 +++++++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1
drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++++++++++
fs/ext4/file.c | 11 +--
fs/xfs/xfs_file.c | 10 +--
include/linux/dax.h | 32 +++++++++-
include/linux/libnvdimm.h | 9 ++
include/linux/virtio_pmem.h | 60 ++++++++++++++++++
include/uapi/linux/virtio_ids.h | 1
include/uapi/linux/virtio_pmem.h | 10 +++
19 files changed, 419 insertions(+), 27 deletions(-)
This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.
This also handles asynchronous flush requests from the block layer
by creating a child bio and chaining it with parent bio.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/acpi/nfit/core.c | 4 ++--
drivers/nvdimm/claim.c | 6 ++++--
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/pmem.c | 14 ++++++++-----
drivers/nvdimm/region_devs.c | 38 ++++++++++++++++++++++++++++++++++--
include/linux/libnvdimm.h | 8 +++++++-
6 files changed, 59 insertions(+), 12 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 5a389a4f4f65..567017a2190e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
offset = to_interleave_offset(offset, mmio);
writeq(cmd, mmio->addr.base + offset);
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL, false);
if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}
if (rw)
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL, false);
rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf469c7..a1dfa066786b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
sector_t sector = offset >> 9;
- int rc = 0;
+ int rc = 0, ret = 0;
if (unlikely(!size))
return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
}
memcpy_flushcache(nsio->addr + offset, buf, size);
- nvdimm_flush(to_nd_region(ndns->dev.parent));
+ ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL, false);
+ if (ret)
+ rc = ret;
return rc;
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index a5ac3b240293..916cd6c5451a 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -159,6 +159,7 @@ struct nd_region {
struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
+ int (*flush)(struct nd_region *nd_region);
struct nd_mapping mapping[0];
};
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..5a5b3ea4d073 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,
static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
{
+ int ret = 0;
blk_status_t rc = 0;
bool do_acct;
unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
struct nd_region *nd_region = to_region(pmem);
if (bio->bi_opf & REQ_PREFLUSH)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio, true);
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
nd_iostat_end(bio, start);
if (bio->bi_opf & REQ_FUA)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio, true);
+
+ if (ret)
+ bio->bi_status = errno_to_blk_status(ret);
bio_endio(bio);
return BLK_QC_T_NONE;
@@ -463,13 +467,13 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = &pmem->bb;
dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
-
gendev = disk_to_dev(disk);
gendev->groups = pmem_attribute_groups;
@@ -527,14 +531,14 @@ static int nd_pmem_remove(struct device *dev)
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL, false);
return 0;
}
static void nd_pmem_shutdown(struct device *dev)
{
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL, false);
}
static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..fb1041ab32a6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
return rc;
if (!flush)
return -EINVAL;
- nvdimm_flush(nd_region);
+ rc = nvdimm_flush(nd_region, NULL, false);
+ if (rc)
+ return rc;
return len;
}
@@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ if (ndr_desc->flush)
+ nd_region->flush = ndr_desc->flush;
+ else
+ nd_region->flush = generic_nvdimm_flush;
+
nd_device_register(dev);
return nd_region;
@@ -1125,11 +1132,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
}
EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
+{
+ int rc = 0;
+
+ /* Create child bio for asynchronous flush and chain with
+ * parent bio. Otherwise directly call nd_region flush.
+ */
+ if (async && bio->bi_iter.bi_sector != -1) {
+
+ struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+ if (!child)
+ return -ENOMEM;
+ bio_copy_dev(child, bio);
+ child->bi_opf = REQ_PREFLUSH;
+ child->bi_iter.bi_sector = -1;
+ bio_chain(child, bio);
+ submit_bio(child);
+ } else {
+ if (nd_region->flush(nd_region))
+ rc = -EIO;
+ }
+
+ return rc;
+}
/**
* nvdimm_flush - flush any posted write queues between the cpu and pmem media
* @nd_region: blk or interleaved pmem region
*/
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
{
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i, idx;
@@ -1153,6 +1185,8 @@ void nvdimm_flush(struct nd_region *nd_region)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
wmb();
+
+ return 0;
}
EXPORT_SYMBOL_GPL(nvdimm_flush);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index feb342d026f2..d9d2ab8a6e64 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -65,6 +65,9 @@ enum {
*/
ND_REGION_PERSIST_MEMCTRL = 2,
+ /* Platform provides asynchronous flush mechanism */
+ ND_REGION_ASYNC = 3,
+
/* mark newly adjusted resources as requiring a label update */
DPA_RESOURCE_ADJUSTED = 1 << 0,
};
@@ -121,6 +124,7 @@ struct nd_mapping_desc {
int position;
};
+struct nd_region;
struct nd_region_desc {
struct resource *res;
struct nd_mapping_desc *mapping;
@@ -133,6 +137,7 @@ struct nd_region_desc {
int target_node;
unsigned long flags;
struct device_node *of_node;
+ int (*flush)(struct nd_region *nd_region);
};
struct device;
@@ -260,7 +265,8 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
+int generic_nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
int nvdimm_in_overwrite(struct nvdimm *nvdimm);
--
2.20.1
This patch adds virtio-pmem driver for KVM guest.
Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.
This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1 +
drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
include/linux/virtio_pmem.h | 60 +++++++++++++++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pmem.h | 10 +++
7 files changed, 294 insertions(+)
create mode 100644 drivers/nvdimm/virtio_pmem.c
create mode 100644 drivers/virtio/pmem.c
create mode 100644 include/linux/virtio_pmem.h
create mode 100644 include/uapi/linux/virtio_pmem.h
diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 000000000000..01044cc2f3a3
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+ unsigned int len;
+ unsigned long flags;
+ struct virtio_pmem_request *req, *req_buf;
+ struct virtio_pmem *vpmem = vq->vdev->priv;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+ req->done = true;
+ wake_up(&req->host_acked);
+
+ if (!list_empty(&vpmem->req_list)) {
+ req_buf = list_first_entry(&vpmem->req_list,
+ struct virtio_pmem_request, list);
+ list_del(&vpmem->req_list);
+ req_buf->wq_buf_avail = true;
+ wake_up(&req_buf->wq_buf);
+ }
+ }
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+ int err;
+ unsigned long flags;
+ struct scatterlist *sgs[2], sg, ret;
+ struct virtio_device *vdev = nd_region->provider_data;
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct virtio_pmem_request *req;
+
+ might_sleep();
+ req = kmalloc(sizeof(*req), GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+
+ req->done = req->wq_buf_avail = false;
+ strcpy(req->name, "FLUSH");
+ init_waitqueue_head(&req->host_acked);
+ init_waitqueue_head(&req->wq_buf);
+ sg_init_one(&sg, req->name, strlen(req->name));
+ sgs[0] = &sg;
+ sg_init_one(&ret, &req->ret, sizeof(req->ret));
+ sgs[1] = &ret;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+ if (err) {
+ dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+ list_add_tail(&vpmem->req_list, &req->list);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->wq_buf, req->wq_buf_avail);
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ }
+ err = virtqueue_kick(vpmem->req_vq);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ if (!err) {
+ err = -EIO;
+ goto ret;
+ }
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->host_acked, req->done);
+ err = req->ret;
+ret:
+ kfree(req);
+ return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..9f634a2ed638 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
If unsure, say Y.
+config VIRTIO_PMEM
+ tristate "Support for virtio pmem driver"
+ depends on VIRTIO
+ depends on LIBNVDIMM
+ help
+ This driver provides support for virtio based flushing interface
+ for persistent memory range.
+
+ If unsure, say M.
+
config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..143ce91eabe9 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
new file mode 100644
index 000000000000..cc9de9589d56
--- /dev/null
+++ b/drivers/virtio/pmem.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include <../../drivers/nvdimm/nd.h>
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+ struct virtqueue *vq;
+
+ /* single vq */
+ vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+ host_ack, "flush_queue");
+ if (IS_ERR(vq))
+ return PTR_ERR(vq);
+
+ spin_lock_init(&vpmem->pmem_lock);
+ INIT_LIST_HEAD(&vpmem->req_list);
+
+ return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+ int err = 0;
+ struct resource res;
+ struct virtio_pmem *vpmem;
+ struct nvdimm_bus *nvdimm_bus;
+ struct nd_region_desc ndr_desc = {};
+ int nid = dev_to_node(&vdev->dev);
+ struct nd_region *nd_region;
+
+ if (!vdev->config->get) {
+ dev_err(&vdev->dev, "%s failure: config disabled\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+ GFP_KERNEL);
+ if (!vpmem) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ vpmem->vdev = vdev;
+ err = init_vq(vpmem);
+ if (err)
+ goto out_err;
+
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ start, &vpmem->start);
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ size, &vpmem->size);
+
+ res.start = vpmem->start;
+ res.end = vpmem->start + vpmem->size-1;
+ vpmem->nd_desc.provider_name = "virtio-pmem";
+ vpmem->nd_desc.module = THIS_MODULE;
+
+ vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+ &vpmem->nd_desc);
+ if (!nvdimm_bus)
+ goto out_vq;
+
+ dev_set_drvdata(&vdev->dev, nvdimm_bus);
+
+ ndr_desc.res = &res;
+ ndr_desc.numa_node = nid;
+ ndr_desc.flush = virtio_pmem_flush;
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+ set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+ nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+ nd_region->provider_data = dev_to_virtio
+ (nd_region->dev.parent->parent);
+
+ if (!nd_region)
+ goto out_nd;
+
+ return 0;
+out_nd:
+ err = -ENXIO;
+ nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+ vdev->config->del_vqs(vdev);
+out_err:
+ dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+ return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+ nvdimm_bus_unregister(nvdimm_bus);
+ vdev->config->del_vqs(vdev);
+ vdev->config->reset(vdev);
+ kfree(vpmem);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = virtio_pmem_probe,
+ .remove = virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 000000000000..224f9d934be6
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+ /* Host return status corresponding to flush request */
+ int ret;
+
+ /* command name*/
+ char name[16];
+
+ /* Wait queue to process deferred work after ack from host */
+ wait_queue_head_t host_acked;
+ bool done;
+
+ /* Wait queue to process deferred work after virt queue buffer avail */
+ wait_queue_head_t wq_buf;
+ bool wq_buf_avail;
+ struct list_head list;
+};
+
+struct virtio_pmem {
+ struct virtio_device *vdev;
+
+ /* Virtio pmem request queue */
+ struct virtqueue *req_vq;
+
+ /* nvdimm bus registers virtio pmem device */
+ struct nvdimm_bus *nvdimm_bus;
+ struct nvdimm_bus_descriptor nd_desc;
+
+ /* List to store deferred work if virtqueue is full */
+ struct list_head req_list;
+
+ /* Synchronize virtqueue data */
+ spinlock_t pmem_lock;
+
+ /* Memory region information */
+ uint64_t start;
+ uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int virtio_pmem_flush(struct nd_region *nd_region);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..346389565ac1 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_PMEM 25 /* virtio pmem */
#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 000000000000..fa3f7d52717a
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+ __le64 start;
+ __le64 size;
+};
+#endif
--
2.20.1
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.
Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/dax/bus.c | 2 +-
drivers/dax/super.c | 13 ++++++++++++-
drivers/md/dm.c | 2 +-
drivers/nvdimm/pmem.c | 3 ++-
drivers/nvdimm/region_devs.c | 7 +++++++
include/linux/dax.h | 9 +++++++--
include/linux/libnvdimm.h | 1 +
7 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..431bf7d2a7f9 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
* No 'host' or dax_operations since there is no access to this
* device outside of mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL, NULL);
+ dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
if (!dax_dev)
goto err;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0a339b85133e..bd6509308d05 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -186,6 +186,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+ /* flag to check if device supports synchronous flush */
+ DAXDEV_SYNC,
};
/**
@@ -354,6 +356,12 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
+bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_synchronous);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
@@ -511,7 +519,7 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host)
}
struct dax_device *alloc_dax(void *private, const char *__host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, bool sync)
{
struct dax_device *dax_dev;
const char *host;
@@ -534,6 +542,9 @@ struct dax_device *alloc_dax(void *private, const char *__host,
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+ if (sync)
+ set_bit(DAXDEV_SYNC, &dax_dev->flags);
+
return dax_dev;
err_dev:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 68d24056d0b1..534e12ca6329 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1965,7 +1965,7 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops, true);
if (!dax_dev)
goto bad;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5a5b3ea4d073..78f71ba0e7cf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -466,7 +466,8 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;
- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops,
+ is_nvdimm_sync(nd_region));
if (!dax_dev) {
put_disk(disk);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fb1041ab32a6..8c7aa047fe2b 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1231,6 +1231,13 @@ int nvdimm_has_cache(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev) &&
+ !test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..b896706a5ee9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,18 +32,19 @@ extern struct attribute_group dax_attribute_group;
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *dax_get_by_host(const char *host);
struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
+ const struct dax_operations *ops, bool sync);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
+bool dax_synchronous(struct dax_device *dax_dev);
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
return NULL;
}
static inline struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, bool sync)
{
/*
* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
@@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
{
return false;
}
+static inline bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return true;
+}
#endif
struct writeback_control;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index d9d2ab8a6e64..9a8aea370cbc 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -270,6 +270,7 @@ int generic_nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
int nvdimm_in_overwrite(struct nvdimm *nvdimm);
+bool is_nvdimm_sync(struct nd_region *nd_region);
static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc)
--
2.20.1
This patch introduces 'daxdev_mapping_supported' helper
which checks if 'MAP_SYNC' is supported with filesystem
mapping. It also checks if corresponding dax_device is
synchronous. Virtio pmem device is asynchronous and
does not not support VM_SYNC.
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Pankaj Gupta <[email protected]>
---
include/linux/dax.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b896706a5ee9..4a2a60ffec86 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool dax_synchronous(struct dax_device *dax_dev);
+
+/*
+ * Callers check if synchronous mapping is enabled for DAX file
+ * and attached dax device is also synchronous.
+ *
+ * dax_synchronous function verifies if dax device is synchronous.
+ * Currently, only virtio pmem device supports asynchronous device
+ * flush.
+ */
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ if (!(vma->vm_flags & VM_SYNC))
+ return true;
+ if (!IS_DAX(file_inode(vma->vm_file)))
+ return false;
+ return dax_synchronous(dax_dev);
+}
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
@@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
{
return true;
}
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ return true;
+}
#endif
struct writeback_control;
--
2.20.1
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.
Signed-off-by: Pankaj Gupta <[email protected]>
---
fs/ext4/file.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69d65d49837b..4b2ccaf1932e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -360,15 +360,16 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct dax_device *dax_dev = sbi->s_daxdev;
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
- /*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ /* We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(file);
--
2.20.1
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.
Signed-off-by: Pankaj Gupta <[email protected]>
---
fs/xfs/xfs_file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1f2e2845eb76..0e59be018511 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1196,11 +1196,13 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
{
- /*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ struct dax_device *dax_dev = xfs_find_daxdev_for_inode
+ (file_inode(filp));
+
+ /* We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(filp);
--
2.20.1
On Wed 10-04-19 09:38:24, Pankaj Gupta wrote:
> This patch introduces 'daxdev_mapping_supported' helper
> which checks if 'MAP_SYNC' is supported with filesystem
> mapping. It also checks if corresponding dax_device is
> synchronous. Virtio pmem device is asynchronous and
> does not not support VM_SYNC.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> include/linux/dax.h | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b896706a5ee9..4a2a60ffec86 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
> void dax_write_cache(struct dax_device *dax_dev, bool wc);
> bool dax_write_cache_enabled(struct dax_device *dax_dev);
> bool dax_synchronous(struct dax_device *dax_dev);
> +
> +/*
> + * Callers check if synchronous mapping is enabled for DAX file
> + * and attached dax device is also synchronous.
> + *
> + * dax_synchronous function verifies if dax device is synchronous.
> + * Currently, only virtio pmem device supports asynchronous device
> + * flush.
> + */
Thanks for the patch! I'd restructure this comment like:
/*
* Check if given mapping is supported by the file / underlying device.
*/
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> + struct dax_device *dax_dev)
> +{
/* Everyone supports non-sync mappings */
> + if (!(vma->vm_flags & VM_SYNC))
> + return true;
/* Sync mappings are supported only for files using DAX */
> + if (!IS_DAX(file_inode(vma->vm_file)))
> + return false;
/* Underlying device must support persisting through CPU instructions */
> + return dax_synchronous(dax_dev);
> +}
> #else
> static inline struct dax_device *dax_get_by_host(const char *host)
> {
> @@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device *dax_dev)
> {
> return true;
> }
> +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> + struct dax_device *dax_dev)
> +{
> + return true;
This looks wrong. Shouldn't it rather be:
return !(vma->flags & VM_SYNC);
?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 10-04-19 09:38:23, Pankaj Gupta wrote:
> @@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
> {
> return false;
> }
> +static inline bool dax_synchronous(struct dax_device *dax_dev)
> +{
> + return true;
> +}
Is there a need to define dax_synchronous() for !CONFIG_DAX? Because that
property of dax device is pretty much undefined and I don't see anything
needing to call it for !CONFIG_DAX...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
> > This patch introduces 'daxdev_mapping_supported' helper
> > which checks if 'MAP_SYNC' is supported with filesystem
> > mapping. It also checks if corresponding dax_device is
> > synchronous. Virtio pmem device is asynchronous and
> > does not not support VM_SYNC.
> >
> > Suggested-by: Jan Kara <[email protected]>
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > include/linux/dax.h | 23 +++++++++++++++++++++++
> > 1 file changed, 23 insertions(+)
> >
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b896706a5ee9..4a2a60ffec86 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -38,6 +38,24 @@ void kill_dax(struct dax_device *dax_dev);
> > void dax_write_cache(struct dax_device *dax_dev, bool wc);
> > bool dax_write_cache_enabled(struct dax_device *dax_dev);
> > bool dax_synchronous(struct dax_device *dax_dev);
> > +
> > +/*
> > + * Callers check if synchronous mapping is enabled for DAX file
> > + * and attached dax device is also synchronous.
> > + *
> > + * dax_synchronous function verifies if dax device is synchronous.
> > + * Currently, only virtio pmem device supports asynchronous device
> > + * flush.
> > + */
>
> Thanks for the patch! I'd restructure this comment like:
>
> /*
> * Check if given mapping is supported by the file / underlying device.
> */
Sure.
> > +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> > + struct dax_device *dax_dev)
> > +{
>
> /* Everyone supports non-sync mappings */
> > + if (!(vma->vm_flags & VM_SYNC))
> > + return true;
> /* Sync mappings are supported only for files using DAX */
> > + if (!IS_DAX(file_inode(vma->vm_file)))
> > + return false;
> /* Underlying device must support persisting through CPU instructions */
> > + return dax_synchronous(dax_dev);
> > +}
> > #else
> > static inline struct dax_device *dax_get_by_host(const char *host)
> > {
> > @@ -69,6 +87,11 @@ static inline bool dax_synchronous(struct dax_device
> > *dax_dev)
> > {
> > return true;
> > }
> > +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> > + struct dax_device *dax_dev)
> > +{
> > + return true;
>
> This looks wrong. Shouldn't it rather be:
>
> return !(vma->flags & VM_SYNC);
>
> ?
Right. I will correct this.
Thanks,
Pankaj
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
>
> On Wed 10-04-19 09:38:23, Pankaj Gupta wrote:
> > @@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct
> > dax_device *dax_dev)
> > {
> > return false;
> > }
> > +static inline bool dax_synchronous(struct dax_device *dax_dev)
> > +{
> > + return true;
> > +}
>
> Is there a need to define dax_synchronous() for !CONFIG_DAX? Because that
> property of dax device is pretty much undefined and I don't see anything
> needing to call it for !CONFIG_DAX...
You are right. Will remove this.
Thanks for the review.
Best regards,
Pankaj
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
On Wed, 10 Apr 2019 09:38:22 +0530
Pankaj Gupta <[email protected]> wrote:
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> drivers/virtio/Kconfig | 10 +++
> drivers/virtio/Makefile | 1 +
> drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> include/linux/virtio_pmem.h | 60 +++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 10 +++
> 7 files changed, 294 insertions(+)
> create mode 100644 drivers/nvdimm/virtio_pmem.c
> create mode 100644 drivers/virtio/pmem.c
> create mode 100644 include/linux/virtio_pmem.h
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
(...)
> diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> new file mode 100644
> index 000000000000..cc9de9589d56
> --- /dev/null
> +++ b/drivers/virtio/pmem.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include <../../drivers/nvdimm/nd.h>
> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
IMHO, you don't gain much by splitting off this function...
> +{
> + struct virtqueue *vq;
> +
> + /* single vq */
> + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vq))
> + return PTR_ERR(vq);
I'm personally not a fan of chained assignments... I think I'd just
drop the 'vq' variable and operate on vpmem->req_vq directly.
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc = {};
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
Maybe s/config disabled/config access disabled/ ? That seems to be the
more common message.
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);
Here, the vpmem variable makes sense for convenience, but I'm again not
a fan of the chaining :)
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
And here :)
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> + nd_region->provider_data = dev_to_virtio
> + (nd_region->dev.parent->parent);
Isn't it clear that this parent chain will always end up at &vdev->dev?
Maybe simply set ->provider_data to vdev directly? (Does it need to
grab a reference count of the device, BTW?)
> +
> + if (!nd_region)
> + goto out_nd;
Probably better to do this check before you access nd_region's
members :)
> +
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
I haven't followed this around the nvdimm code, but is the nd_region
you created during probe cleaned up automatically, or would you need to
do something here?
> + vdev->config->del_vqs(vdev);
> + vdev->config->reset(vdev);
> + kfree(vpmem);
You allocated vpmem via devm_kzalloc; isn't it freed automatically on
remove?
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
Only looked at this from the general virtio driver angle; seems fine
apart from some easy-to-fix issues and some personal style preference
things.
On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> drivers/virtio/Kconfig | 10 +++
> drivers/virtio/Makefile | 1 +
> drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> include/linux/virtio_pmem.h | 60 +++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 10 +++
> 7 files changed, 294 insertions(+)
> create mode 100644 drivers/nvdimm/virtio_pmem.c
> create mode 100644 drivers/virtio/pmem.c
> create mode 100644 include/linux/virtio_pmem.h
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index 000000000000..01044cc2f3a3
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio_pmem.h>
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> + req->done = true;
> + wake_up(&req->host_acked);
> +
> + if (!list_empty(&vpmem->req_list)) {
> + req_buf = list_first_entry(&vpmem->req_list,
> + struct virtio_pmem_request, list);
> + list_del(&vpmem->req_list);
> + req_buf->wq_buf_avail = true;
> + wake_up(&req_buf->wq_buf);
> + }
> + }
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev = nd_region->provider_data;
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct virtio_pmem_request *req;
> +
> + might_sleep();
> + req = kmalloc(sizeof(*req), GFP_KERNEL);
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = req->wq_buf_avail = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(&req->host_acked);
> + init_waitqueue_head(&req->wq_buf);
> + sg_init_one(&sg, req->name, strlen(req->name));
> + sgs[0] = &sg;
> + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> + sgs[1] = &ret;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> + if (err) {
> + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> + list_add_tail(&vpmem->req_list, &req->list);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + }
> + err = virtqueue_kick(vpmem->req_vq);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + if (!err) {
> + err = -EIO;
> + goto ret;
> + }
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
> + err = req->ret;
> +ret:
> + kfree(req);
> + return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..9f634a2ed638 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
>
> If unsure, say Y.
>
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + depends on LIBNVDIMM
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> +
> + If unsure, say M.
> +
> config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5dcf46..143ce91eabe9 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> new file mode 100644
> index 000000000000..cc9de9589d56
> --- /dev/null
> +++ b/drivers/virtio/pmem.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include <../../drivers/nvdimm/nd.h>
> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> + struct virtqueue *vq;
> +
> + /* single vq */
> + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vq))
> + return PTR_ERR(vq);
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc = {};
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> + nd_region->provider_data = dev_to_virtio
> + (nd_region->dev.parent->parent);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);
> + vdev->config->reset(vdev);
> + kfree(vpmem);
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..224f9d934be6
> --- /dev/null
> +++ b/include/linux/virtio_pmem.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * virtio_pmem.h: virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + **/
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include <linux/virtio_ids.h>
> +#include <linux/module.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/spinlock.h>
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];
> +
> + /* Wait queue to process deferred work after ack from host */
> + wait_queue_head_t host_acked;
> + bool done;
> +
> + /* Wait queue to process deferred work after virt queue buffer avail */
> + wait_queue_head_t wq_buf;
> + bool wq_buf_avail;
> + struct list_head list;
> +};
> +
> +struct virtio_pmem {
> + struct virtio_device *vdev;
> +
> + /* Virtio pmem request queue */
> + struct virtqueue *req_vq;
> +
> + /* nvdimm bus registers virtio pmem device */
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm_bus_descriptor nd_desc;
> +
> + /* List to store deferred work if virtqueue is full */
> + struct list_head req_list;
> +
> + /* Synchronize virtqueue data */
> + spinlock_t pmem_lock;
> +
> + /* Memory region information */
> + uint64_t start;
> + uint64_t size;
> +};
> +
> +void host_ack(struct virtqueue *vq);
> +int virtio_pmem_flush(struct nd_region *nd_region);
> +#endif
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..346389565ac1 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> #endif /* _LINUX_VIRTIO_IDS_H */
Didn't Paolo point out someone is using 25 for audio?
Please, reserve an ID with the Virtio TC before using it.
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..fa3f7d52717a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> + __le64 start;
> + __le64 size;
> +};
> +#endif
> --
> 2.20.1
>
> On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 294 insertions(+)
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 drivers/virtio/pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..01044cc2f3a3
> > --- /dev/null
> > +++ b/drivers/nvdimm/virtio_pmem.c
> > @@ -0,0 +1,88 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > + /* The interrupt handler */
> > +void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + list_del(&vpmem->req_list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(host_ack);
> > +
> > + /* The request submission function */
> > +int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev = nd_region->provider_data;
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct virtio_pmem_request *req;
> > +
> > + might_sleep();
> > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = req->wq_buf_avail = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > + if (err) {
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > +
> > + list_add_tail(&vpmem->req_list, &req->list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + }
> > + err = virtqueue_kick(vpmem->req_vq);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + if (!err) {
> > + err = -EIO;
> > + goto ret;
> > + }
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->host_acked, req->done);
> > + err = req->ret;
> > +ret:
> > + kfree(req);
> > + return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 35897649c24f..9f634a2ed638 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + depends on LIBNVDIMM
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
> > +
> > + If unsure, say M.
> > +
> > config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5dcf46..143ce91eabe9 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc = {};
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > + nd_region->provider_data = dev_to_virtio
> > + (nd_region->dev.parent->parent);
> > +
> > + if (!nd_region)
> > + goto out_nd;
> > +
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
> > + vdev->config->reset(vdev);
> > + kfree(vpmem);
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..224f9d934be6
> > --- /dev/null
> > +++ b/include/linux/virtio_pmem.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * virtio_pmem.h: virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + **/
> > +
> > +#ifndef _LINUX_VIRTIO_PMEM_H
> > +#define _LINUX_VIRTIO_PMEM_H
> > +
> > +#include <linux/virtio_ids.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/spinlock.h>
> > +
> > +struct virtio_pmem_request {
> > + /* Host return status corresponding to flush request */
> > + int ret;
> > +
> > + /* command name*/
> > + char name[16];
> > +
> > + /* Wait queue to process deferred work after ack from host */
> > + wait_queue_head_t host_acked;
> > + bool done;
> > +
> > + /* Wait queue to process deferred work after virt queue buffer avail */
> > + wait_queue_head_t wq_buf;
> > + bool wq_buf_avail;
> > + struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > + struct virtio_device *vdev;
> > +
> > + /* Virtio pmem request queue */
> > + struct virtqueue *req_vq;
> > +
> > + /* nvdimm bus registers virtio pmem device */
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nvdimm_bus_descriptor nd_desc;
> > +
> > + /* List to store deferred work if virtqueue is full */
> > + struct list_head req_list;
> > +
> > + /* Synchronize virtqueue data */
> > + spinlock_t pmem_lock;
> > +
> > + /* Memory region information */
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +
> > +void host_ack(struct virtqueue *vq);
> > +int virtio_pmem_flush(struct nd_region *nd_region);
> > +#endif
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2d4f4d..346389565ac1 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
>
> Didn't Paolo point out someone is using 25 for audio?
Sorry! I did not notice this.
>
> Please, reserve an ID with the Virtio TC before using it.
I thought I requested[1] ID 25.
[1] https://github.com/oasis-tcs/virtio-spec/issues/38
[2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
In that case I will send request for next available ID i.e 26?
Thanks,
Pankaj
>
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..fa3f7d52717a
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > + __le64 start;
> > + __le64 size;
> > +};
> > +#endif
> > --
> > 2.20.1
>
>
On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
Pankaj Gupta <[email protected]> wrote:
> >
> > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > diff --git a/include/uapi/linux/virtio_ids.h
> > > b/include/uapi/linux/virtio_ids.h
> > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > --- a/include/uapi/linux/virtio_ids.h
> > > +++ b/include/uapi/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > >
> > > #endif /* _LINUX_VIRTIO_IDS_H */
> >
> > Didn't Paolo point out someone is using 25 for audio?
>
>
> Sorry! I did not notice this.
>
> >
> > Please, reserve an ID with the Virtio TC before using it.
>
> I thought I requested[1] ID 25.
>
> [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
>
> In that case I will send request for next available ID i.e 26?
Have the folks using this for audio sent a reservation request as well?
If not, I'd say the first requester should get the id...
(And yes, we need to be quicker with voting on/applying id
reservations :/)
On Wed, Apr 10, 2019 at 02:24:26PM +0200, Cornelia Huck wrote:
> On Wed, 10 Apr 2019 09:38:22 +0530
> Pankaj Gupta <[email protected]> wrote:
>
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 294 insertions(+)
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 drivers/virtio/pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> (...)
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
>
> IMHO, you don't gain much by splitting off this function...
It make sense to have all the vq-init-related stuff in one function, so
here pmem_lock and req_list are used only for the vq.
Saying that - maybe it would be better to have the 3 in one struct.
>
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
>
> I'm personally not a fan of chained assignments... I think I'd just
> drop the 'vq' variable and operate on vpmem->req_vq directly.
+1
>
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc = {};
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
>
> Maybe s/config disabled/config access disabled/ ? That seems to be the
> more common message.
>
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
>
> Here, the vpmem variable makes sense for convenience, but I'm again not
> a fan of the chaining :)
+1
>
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
>
> And here :)
>
> > + if (!nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > + nd_region->provider_data = dev_to_virtio
> > + (nd_region->dev.parent->parent);
>
> Isn't it clear that this parent chain will always end up at &vdev->dev?
> Maybe simply set ->provider_data to vdev directly? (Does it need to
> grab a reference count of the device, BTW?)
>
> > +
> > + if (!nd_region)
> > + goto out_nd;
>
> Probably better to do this check before you access nd_region's
> members :)
>
> > +
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
>
> I haven't followed this around the nvdimm code, but is the nd_region
> you created during probe cleaned up automatically, or would you need to
> do something here?
>
> > + vdev->config->del_vqs(vdev);
> > + vdev->config->reset(vdev);
> > + kfree(vpmem);
>
> You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> remove?
>
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
>
> Only looked at this from the general virtio driver angle; seems fine
> apart from some easy-to-fix issues and some personal style preference
> things.
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc = {};
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);
> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;
> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> + nd_region->provider_data = dev_to_virtio
Pleas delete the extra space.
> + (nd_region->dev.parent->parent);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
>
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 294 insertions(+)
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 drivers/virtio/pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> (...)
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..cc9de9589d56
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
>
> IMHO, you don't gain much by splitting off this function...
o.k. I kept this for better code structure.
>
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
>
> I'm personally not a fan of chained assignments... I think I'd just
> drop the 'vq' variable and operate on vpmem->req_vq directly.
Sure.
>
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc = {};
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
>
> Maybe s/config disabled/config access disabled/ ? That seems to be the
> more common message.
This is better.
>
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
>
> Here, the vpmem variable makes sense for convenience, but I'm again not
> a fan of the chaining :)
Sure will change :)
>
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
>
> And here :)
Sure.
>
> > + if (!nvdimm_bus)
> > + goto out_vq;
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > + nd_region->provider_data = dev_to_virtio
> > + (nd_region->dev.parent->parent);
>
> Isn't it clear that this parent chain will always end up at &vdev->dev?
Yes, It will resolve to &vdev->dev.
> Maybe simply set ->provider_data to vdev directly? (Does it need to
> grab a reference count of the device, BTW?)
reference is already taken when registering the device with nvdimm_bus.
>
> > +
> > + if (!nd_region)
> > + goto out_nd;
>
> Probably better to do this check before you access nd_region's
> members :)
ah Sorry! Will correct this.
>
> > +
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
>
> I haven't followed this around the nvdimm code, but is the nd_region
> you created during probe cleaned up automatically, or would you need to
> do something here?
'nvdimm_bus_unregister' does that.
>
> > + vdev->config->del_vqs(vdev);
> > + vdev->config->reset(vdev);
> > + kfree(vpmem);
>
> You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> remove?
yes. Will remove free(vpmem).
>
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
>
> Only looked at this from the general virtio driver angle; seems fine
> apart from some easy-to-fix issues and some personal style preference
> things.
Thank you for the review.
Best regards,
Pankaj
>
> >
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---
> > > drivers/nvdimm/virtio_pmem.c | 88 ++++++++++++++++++++++
> > > drivers/virtio/Kconfig | 10 +++
> > > drivers/virtio/Makefile | 1 +
> > > drivers/virtio/pmem.c | 124 +++++++++++++++++++++++++++++++
> > > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > > include/uapi/linux/virtio_ids.h | 1 +
> > > include/uapi/linux/virtio_pmem.h | 10 +++
> > > 7 files changed, 294 insertions(+)
> > > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > > create mode 100644 drivers/virtio/pmem.c
> > > create mode 100644 include/linux/virtio_pmem.h
> > > create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > (...)
> > > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > > new file mode 100644
> > > index 000000000000..cc9de9589d56
> > > --- /dev/null
> > > +++ b/drivers/virtio/pmem.c
> > > @@ -0,0 +1,124 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and registers the virtual pmem device
> > > + * with libnvdimm core.
> > > + */
> > > +#include <linux/virtio_pmem.h>
> > > +#include <../../drivers/nvdimm/nd.h>
> > > +
> > > +static struct virtio_device_id id_table[] = {
> > > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > + { 0 },
> > > +};
> > > +
> > > + /* Initialize virt queue */
> > > +static int init_vq(struct virtio_pmem *vpmem)
> >
> > IMHO, you don't gain much by splitting off this function...
>
> It make sense to have all the vq-init-related stuff in one function, so
> here pmem_lock and req_list are used only for the vq.
Yes.
> Saying that - maybe it would be better to have the 3 in one struct.
>
> >
> > > +{
> > > + struct virtqueue *vq;
> > > +
> > > + /* single vq */
> > > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > + host_ack, "flush_queue");
> > > + if (IS_ERR(vq))
> > > + return PTR_ERR(vq);
> >
> > I'm personally not a fan of chained assignments... I think I'd just
> > drop the 'vq' variable and operate on vpmem->req_vq directly.
>
> +1
Will drop extra vq.
>
> >
> > > +
> > > + spin_lock_init(&vpmem->pmem_lock);
> > > + INIT_LIST_HEAD(&vpmem->req_list);
> > > +
> > > + return 0;
> > > +};
> > > +
> > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > +{
> > > + int err = 0;
> > > + struct resource res;
> > > + struct virtio_pmem *vpmem;
> > > + struct nvdimm_bus *nvdimm_bus;
> > > + struct nd_region_desc ndr_desc = {};
> > > + int nid = dev_to_node(&vdev->dev);
> > > + struct nd_region *nd_region;
> > > +
> > > + if (!vdev->config->get) {
> > > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> >
> > Maybe s/config disabled/config access disabled/ ? That seems to be the
> > more common message.
> >
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > + GFP_KERNEL);
> >
> > Here, the vpmem variable makes sense for convenience, but I'm again not
> > a fan of the chaining :)
>
> +1
here as well.
>
> >
> > > + if (!vpmem) {
> > > + err = -ENOMEM;
> > > + goto out_err;
> > > + }
> > > +
> > > + vpmem->vdev = vdev;
> > > + err = init_vq(vpmem);
> > > + if (err)
> > > + goto out_err;
> > > +
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + start, &vpmem->start);
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + size, &vpmem->size);
> > > +
> > > + res.start = vpmem->start;
> > > + res.end = vpmem->start + vpmem->size-1;
> > > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > > + vpmem->nd_desc.module = THIS_MODULE;
> > > +
> > > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > + &vpmem->nd_desc);
> >
> > And here :)
> >
> > > + if (!nvdimm_bus)
> > > + goto out_vq;
> > > +
> > > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > +
> > > + ndr_desc.res = &res;
> > > + ndr_desc.numa_node = nid;
> > > + ndr_desc.flush = virtio_pmem_flush;
> > > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > + nd_region->provider_data = dev_to_virtio
> > > + (nd_region->dev.parent->parent);
> >
> > Isn't it clear that this parent chain will always end up at &vdev->dev?
> > Maybe simply set ->provider_data to vdev directly? (Does it need to
> > grab a reference count of the device, BTW?)
> >
> > > +
> > > + if (!nd_region)
> > > + goto out_nd;
> >
> > Probably better to do this check before you access nd_region's
> > members :)
> >
> > > +
> > > + return 0;
> > > +out_nd:
> > > + err = -ENXIO;
> > > + nvdimm_bus_unregister(nvdimm_bus);
> > > +out_vq:
> > > + vdev->config->del_vqs(vdev);
> > > +out_err:
> > > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > + return err;
> > > +}
> > > +
> > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > +{
> > > + struct virtio_pmem *vpmem = vdev->priv;
> > > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > +
> > > + nvdimm_bus_unregister(nvdimm_bus);
> >
> > I haven't followed this around the nvdimm code, but is the nd_region
> > you created during probe cleaned up automatically, or would you need to
> > do something here?
> >
> > > + vdev->config->del_vqs(vdev);
> > > + vdev->config->reset(vdev);
> > > + kfree(vpmem);
> >
> > You allocated vpmem via devm_kzalloc; isn't it freed automatically on
> > remove?
> >
> > > +}
> > > +
> > > +static struct virtio_driver virtio_pmem_driver = {
> > > + .driver.name = KBUILD_MODNAME,
> > > + .driver.owner = THIS_MODULE,
> > > + .id_table = id_table,
> > > + .probe = virtio_pmem_probe,
> > > + .remove = virtio_pmem_remove,
> > > +};
> > > +
> > > +module_virtio_driver(virtio_pmem_driver);
> > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > +MODULE_LICENSE("GPL");
> >
> > Only looked at this from the general virtio driver angle; seems fine
> > apart from some easy-to-fix issues and some personal style preference
> > things.
>
Best regards,
Pankaj
On Wed, Apr 10, 2019 at 04:31:39PM +0200, Cornelia Huck wrote:
> On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
> Pankaj Gupta <[email protected]> wrote:
>
> > >
> > > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > >
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > >
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > >
> > > > Signed-off-by: Pankaj Gupta <[email protected]>
>
> > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > b/include/uapi/linux/virtio_ids.h
> > > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > > --- a/include/uapi/linux/virtio_ids.h
> > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > @@ -43,5 +43,6 @@
> > > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > > >
> > > > #endif /* _LINUX_VIRTIO_IDS_H */
> > >
> > > Didn't Paolo point out someone is using 25 for audio?
> >
> >
> > Sorry! I did not notice this.
> >
> > >
> > > Please, reserve an ID with the Virtio TC before using it.
> >
> > I thought I requested[1] ID 25.
> >
> > [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> > [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
> >
> > In that case I will send request for next available ID i.e 26?
>
> Have the folks using this for audio sent a reservation request as well?
> If not, I'd say the first requester should get the id...
No but I think they ship their's already. No one ships pmem
so it's less pain for everyone if we just skip 25.
> (And yes, we need to be quicker with voting on/applying id
> reservations :/)
We can't vote on what was not proposed for a vote.
At the moment that responsibility is with the commented
once discussion on the comment has taken place.
I think what's missing is a description of the process
in the README. Want to write it up and post it?
--
MST
On Wed, 10 Apr 2019 12:46:12 -0400
"Michael S. Tsirkin" <[email protected]> wrote:
> On Wed, Apr 10, 2019 at 04:31:39PM +0200, Cornelia Huck wrote:
> > On Wed, 10 Apr 2019 10:03:01 -0400 (EDT)
> > Pankaj Gupta <[email protected]> wrote:
> >
> > > >
> > > > On Wed, Apr 10, 2019 at 09:38:22AM +0530, Pankaj Gupta wrote:
> > > > > This patch adds virtio-pmem driver for KVM guest.
> > > > >
> > > > > Guest reads the persistent memory range information from
> > > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > > creates a nd_region object with the persistent memory
> > > > > range information so that existing 'nvdimm/pmem' driver
> > > > > can reserve this into system memory map. This way
> > > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > > driver to register persistent memory compatible for DAX
> > > > > capable filesystems.
> > > > >
> > > > > This also provides function to perform guest flush over
> > > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > > on DAX memory range.
> > > > >
> > > > > Signed-off-by: Pankaj Gupta <[email protected]>
> >
> > > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > > b/include/uapi/linux/virtio_ids.h
> > > > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > > > --- a/include/uapi/linux/virtio_ids.h
> > > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > > @@ -43,5 +43,6 @@
> > > > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > > > >
> > > > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > >
> > > > Didn't Paolo point out someone is using 25 for audio?
> > >
> > >
> > > Sorry! I did not notice this.
> > >
> > > >
> > > > Please, reserve an ID with the Virtio TC before using it.
> > >
> > > I thought I requested[1] ID 25.
> > >
> > > [1] https://github.com/oasis-tcs/virtio-spec/issues/38
> > > [2] https://lists.oasis-open.org/archives/virtio-dev/201805/threads.html
> > >
> > > In that case I will send request for next available ID i.e 26?
> >
> > Have the folks using this for audio sent a reservation request as well?
> > If not, I'd say the first requester should get the id...
>
> No but I think they ship their's already. No one ships pmem
> so it's less pain for everyone if we just skip 25.
Ugh. Ok, then we should change pmem...
>
> > (And yes, we need to be quicker with voting on/applying id
> > reservations :/)
>
> We can't vote on what was not proposed for a vote.
> At the moment that responsibility is with the commented
> once discussion on the comment has taken place.
>
> I think what's missing is a description of the process
> in the README. Want to write it up and post it?
I can add that to my TODO list, can't promise speedy resolution, though.
On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> This also handles asynchronous flush requests from the block layer
> by creating a child bio and chaining it with parent bio.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
[..]
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index b4ef7d9ff22e..fb1041ab32a6 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
> return rc;
> if (!flush)
> return -EINVAL;
> - nvdimm_flush(nd_region);
> + rc = nvdimm_flush(nd_region, NULL, false);
> + if (rc)
> + return rc;
>
> return len;
> }
> @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
> dev->of_node = ndr_desc->of_node;
> nd_region->ndr_size = resource_size(ndr_desc->res);
> nd_region->ndr_start = ndr_desc->res->start;
> + if (ndr_desc->flush)
> + nd_region->flush = ndr_desc->flush;
> + else
> + nd_region->flush = generic_nvdimm_flush;
> +
> nd_device_register(dev);
>
> return nd_region;
> @@ -1125,11 +1132,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> }
> EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
>
> +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> +{
I don't quite see the point of the 'async' argument. All the usages of
this routine are either
nvdimm_flush(nd_region, bio, true)
...or:
nvdimm_flush(nd_region, NULL, false)
...so why not gate async behavior on the presence of the 'bio' argument?
> + int rc = 0;
> +
> + /* Create child bio for asynchronous flush and chain with
> + * parent bio. Otherwise directly call nd_region flush.
> + */
> + if (async && bio->bi_iter.bi_sector != -1) {
> +
> + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> +
> + if (!child)
> + return -ENOMEM;
> + bio_copy_dev(child, bio);
> + child->bi_opf = REQ_PREFLUSH;
> + child->bi_iter.bi_sector = -1;
> + bio_chain(child, bio);
> + submit_bio(child);
I understand how this works, but it's a bit too "magical" for my
taste. I would prefer that all flush implementations take an optional
'bio' argument rather than rely on the make_request implementation to
stash the bio away on a driver specific list.
> + } else {
> + if (nd_region->flush(nd_region))
> + rc = -EIO;
Given the common case wants to be fast and synchronous I think we
should try to avoid retpoline overhead by default. So something like
this:
if (nd_region->flush == generic_nvdimm_flush)
rc = generic_nvdimm_flush(...);
On Tue, Apr 9, 2019 at 9:10 PM Pankaj Gupta <[email protected]> wrote:
>
> This patch adds 'DAXDEV_SYNC' flag which is set
> for nd_region doing synchronous flush. This later
> is used to disable MAP_SYNC functionality for
> ext4 & xfs filesystem for devices don't support
> synchronous flush.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/dax/bus.c | 2 +-
> drivers/dax/super.c | 13 ++++++++++++-
> drivers/md/dm.c | 2 +-
> drivers/nvdimm/pmem.c | 3 ++-
> drivers/nvdimm/region_devs.c | 7 +++++++
> include/linux/dax.h | 9 +++++++--
> include/linux/libnvdimm.h | 1 +
> 7 files changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 2109cfe80219..431bf7d2a7f9 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
> * No 'host' or dax_operations since there is no access to this
> * device outside of mmap of the resulting character device.
> */
> - dax_dev = alloc_dax(dev_dax, NULL, NULL);
> + dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
I find apis that take a boolean as unreadable. What does 'true' mean?
It wastes time to go look at the function definition vs something
like:
alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
Hi Dan,
Thank you for the review.
> >
> > This patch adds 'DAXDEV_SYNC' flag which is set
> > for nd_region doing synchronous flush. This later
> > is used to disable MAP_SYNC functionality for
> > ext4 & xfs filesystem for devices don't support
> > synchronous flush.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/dax/bus.c | 2 +-
> > drivers/dax/super.c | 13 ++++++++++++-
> > drivers/md/dm.c | 2 +-
> > drivers/nvdimm/pmem.c | 3 ++-
> > drivers/nvdimm/region_devs.c | 7 +++++++
> > include/linux/dax.h | 9 +++++++--
> > include/linux/libnvdimm.h | 1 +
> > 7 files changed, 31 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 2109cfe80219..431bf7d2a7f9 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region
> > *dax_region, int id,
> > * No 'host' or dax_operations since there is no access to this
> > * device outside of mmap of the resulting character device.
> > */
> > - dax_dev = alloc_dax(dev_dax, NULL, NULL);
> > + dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
>
> I find apis that take a boolean as unreadable. What does 'true' mean?
> It wastes time to go look at the function definition vs something
> like:
>
> alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
Agree. Will change as suggested.
Best regards,
Pankaj
>
> >
> > This patch adds functionality to perform flush from guest
> > to host over VIRTIO. We are registering a callback based
> > on 'nd_region' type. virtio_pmem driver requires this special
> > flush function. For rest of the region types we are registering
> > existing flush function. Report error returned by host fsync
> > failure to userspace.
> >
> > This also handles asynchronous flush requests from the block layer
> > by creating a child bio and chaining it with parent bio.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---bio_chain Dan williams
> [..]
> > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > index b4ef7d9ff22e..fb1041ab32a6 100644
> > --- a/drivers/nvdimm/region_devs.c
> > +++ b/drivers/nvdimm/region_devs.c
> > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > struct device_attribute *att
> > return rc;
> > if (!flush)
> > return -EINVAL;
> > - nvdimm_flush(nd_region);
> > + rc = nvdimm_flush(nd_region, NULL, false);
> > + if (rc)
> > + return rc;
> >
> > return len;
> > }
> > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > nvdimm_bus *nvdimm_bus,
> > dev->of_node = ndr_desc->of_node;
> > nd_region->ndr_size = resource_size(ndr_desc->res);
> > nd_region->ndr_start = ndr_desc->res->start;
> > + if (ndr_desc->flush)
> > + nd_region->flush = ndr_desc->flush;
> > + else
> > + nd_region->flush = generic_nvdimm_flush;
> > +
> > nd_device_register(dev);
> >
> > return nd_region;
> > @@ -1125,11 +1132,36 @@ struct nd_region
> > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > }
> > EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> >
> > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> > +{
>
> I don't quite see the point of the 'async' argument. All the usages of
> this routine are either
>
> nvdimm_flush(nd_region, bio, true)
> ...or:
> nvdimm_flush(nd_region, NULL, false)
Agree.
>
> ...so why not gate async behavior on the presence of the 'bio' argument?
Sure.
>
>
> > + int rc = 0;
> > +
> > + /* Create child bio for asynchronous flush and chain with
> > + * parent bio. Otherwise directly call nd_region flush.
> > + */
> > + if (async && bio->bi_iter.bi_sector != -1) {
> > +
> > + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > +
> > + if (!child)
> > + return -ENOMEM;
> > + bio_copy_dev(child, bio);
> > + child->bi_opf = REQ_PREFLUSH;
> > + child->bi_iter.bi_sector = -1;
> > + bio_chain(child, bio);
> > + submit_bio(child);
>
> I understand how this works, but it's a bit too "magical" for my
> taste. I would prefer that all flush implementations take an optional
> 'bio' argument rather than rely on the make_request implementation to
> stash the bio away on a driver specific list.
I did this to make use of "bio_chain" for chaining child bio for async flush
suggested [1]. Are you saying to remove this and just call "flush" based on
bio argument? Or I implemented the 'bio_chain' request entirely wrong?
[1] https://lkml.org/lkml/2018/9/27/1028
>
> > + } else {
> > + if (nd_region->flush(nd_region))
> > + rc = -EIO;
>
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
>
> if (nd_region->flush == generic_nvdimm_flush)
> rc = generic_nvdimm_flush(...);
Sure.
Thanks,
Pankaj
>
>
On Thu, Apr 11, 2019 at 9:02 AM Pankaj Gupta <[email protected]> wrote:
>
>
>
> > >
> > > This patch adds functionality to perform flush from guest
> > > to host over VIRTIO. We are registering a callback based
> > > on 'nd_region' type. virtio_pmem driver requires this special
> > > flush function. For rest of the region types we are registering
> > > existing flush function. Report error returned by host fsync
> > > failure to userspace.
> > >
> > > This also handles asynchronous flush requests from the block layer
> > > by creating a child bio and chaining it with parent bio.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---bio_chain Dan williams
> > [..]
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index b4ef7d9ff22e..fb1041ab32a6 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > > struct device_attribute *att
> > > return rc;
> > > if (!flush)
> > > return -EINVAL;
> > > - nvdimm_flush(nd_region);
> > > + rc = nvdimm_flush(nd_region, NULL, false);
> > > + if (rc)
> > > + return rc;
> > >
> > > return len;
> > > }
> > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > > nvdimm_bus *nvdimm_bus,
> > > dev->of_node = ndr_desc->of_node;
> > > nd_region->ndr_size = resource_size(ndr_desc->res);
> > > nd_region->ndr_start = ndr_desc->res->start;
> > > + if (ndr_desc->flush)
> > > + nd_region->flush = ndr_desc->flush;
> > > + else
> > > + nd_region->flush = generic_nvdimm_flush;
> > > +
> > > nd_device_register(dev);
> > >
> > > return nd_region;
> > > @@ -1125,11 +1132,36 @@ struct nd_region
> > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > > }
> > > EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > >
> > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
> > > +{
> >
> > I don't quite see the point of the 'async' argument. All the usages of
> > this routine are either
> >
> > nvdimm_flush(nd_region, bio, true)
> > ...or:
> > nvdimm_flush(nd_region, NULL, false)
>
> Agree.
>
> >
> > ...so why not gate async behavior on the presence of the 'bio' argument?
>
> Sure.
>
> >
> >
> > > + int rc = 0;
> > > +
> > > + /* Create child bio for asynchronous flush and chain with
> > > + * parent bio. Otherwise directly call nd_region flush.
> > > + */
> > > + if (async && bio->bi_iter.bi_sector != -1) {
> > > +
> > > + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > +
> > > + if (!child)
> > > + return -ENOMEM;
> > > + bio_copy_dev(child, bio);
> > > + child->bi_opf = REQ_PREFLUSH;
> > > + child->bi_iter.bi_sector = -1;
> > > + bio_chain(child, bio);
> > > + submit_bio(child);
> >
> > I understand how this works, but it's a bit too "magical" for my
> > taste. I would prefer that all flush implementations take an optional
> > 'bio' argument rather than rely on the make_request implementation to
> > stash the bio away on a driver specific list.
>
> I did this to make use of "bio_chain" for chaining child bio for async flush
> suggested [1]. Are you saying to remove this and just call "flush" based on
> bio argument? Or I implemented the 'bio_chain' request entirely wrong?
No, I think you implemented it correctly. I'm just asking for the
chaining to be performed internal to the ->flush() callback rather
than in the common nvdimm_flush() front-end.
> > > > This patch adds functionality to perform flush from guest
> > > > to host over VIRTIO. We are registering a callback based
> > > > on 'nd_region' type. virtio_pmem driver requires this special
> > > > flush function. For rest of the region types we are registering
> > > > existing flush function. Report error returned by host fsync
> > > > failure to userspace.
> > > >
> > > > This also handles asynchronous flush requests from the block layer
> > > > by creating a child bio and chaining it with parent bio.
> > > >
> > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > ---bio_chain Dan williams
> > > [..]
> > > > diff --git a/drivers/nvdimm/region_devs.c
> > > > b/drivers/nvdimm/region_devs.c
> > > > index b4ef7d9ff22e..fb1041ab32a6 100644
> > > > --- a/drivers/nvdimm/region_devs.c
> > > > +++ b/drivers/nvdimm/region_devs.c
> > > > @@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev,
> > > > struct device_attribute *att
> > > > return rc;
> > > > if (!flush)
> > > > return -EINVAL;
> > > > - nvdimm_flush(nd_region);
> > > > + rc = nvdimm_flush(nd_region, NULL, false);
> > > > + if (rc)
> > > > + return rc;
> > > >
> > > > return len;
> > > > }
> > > > @@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct
> > > > nvdimm_bus *nvdimm_bus,
> > > > dev->of_node = ndr_desc->of_node;
> > > > nd_region->ndr_size = resource_size(ndr_desc->res);
> > > > nd_region->ndr_start = ndr_desc->res->start;
> > > > + if (ndr_desc->flush)
> > > > + nd_region->flush = ndr_desc->flush;
> > > > + else
> > > > + nd_region->flush = generic_nvdimm_flush;
> > > > +
> > > > nd_device_register(dev);
> > > >
> > > > return nd_region;
> > > > @@ -1125,11 +1132,36 @@ struct nd_region
> > > > *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
> > > > }
> > > > EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
> > > >
> > > > +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool
> > > > async)
> > > > +{
> > >
> > > I don't quite see the point of the 'async' argument. All the usages of
> > > this routine are either
> > >
> > > nvdimm_flush(nd_region, bio, true)
> > > ...or:
> > > nvdimm_flush(nd_region, NULL, false)
> >
> > Agree.
> >
> > >
> > > ...so why not gate async behavior on the presence of the 'bio' argument?
> >
> > Sure.
> >
> > >
> > >
> > > > + int rc = 0;
> > > > +
> > > > + /* Create child bio for asynchronous flush and chain with
> > > > + * parent bio. Otherwise directly call nd_region flush.
> > > > + */
> > > > + if (async && bio->bi_iter.bi_sector != -1) {
> > > > +
> > > > + struct bio *child = bio_alloc(GFP_ATOMIC, 0);
> > > > +
> > > > + if (!child)
> > > > + return -ENOMEM;
> > > > + bio_copy_dev(child, bio);
> > > > + child->bi_opf = REQ_PREFLUSH;
> > > > + child->bi_iter.bi_sector = -1;
> > > > + bio_chain(child, bio);
> > > > + submit_bio(child);
> > >
> > > I understand how this works, but it's a bit too "magical" for my
> > > taste. I would prefer that all flush implementations take an optional
> > > 'bio' argument rather than rely on the make_request implementation to
> > > stash the bio away on a driver specific list.
> >
> > I did this to make use of "bio_chain" for chaining child bio for async
> > flush
> > suggested [1]. Are you saying to remove this and just call "flush" based on
> > bio argument? Or I implemented the 'bio_chain' request entirely wrong?
>
> No, I think you implemented it correctly. I'm just asking for the
> chaining to be performed internal to the ->flush() callback rather
> than in the common nvdimm_flush() front-end.
Sure. Perfect!
Thank you very much for all the suggestions.
Best regards,
Pankaj
>
On Thu 11-04-19 07:51:48, Dan Williams wrote:
> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
> > + } else {
> > + if (nd_region->flush(nd_region))
> > + rc = -EIO;
>
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
>
> if (nd_region->flush == generic_nvdimm_flush)
> rc = generic_nvdimm_flush(...);
I'd either add a comment about avoiding retpoline overhead here or just
make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
get confused by the code.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
>> > + } else {
>> > + if (nd_region->flush(nd_region))
>> > + rc = -EIO;
>>
>> Given the common case wants to be fast and synchronous I think we
>> should try to avoid retpoline overhead by default. So something like
>> this:
>>
>> if (nd_region->flush == generic_nvdimm_flush)
>> rc = generic_nvdimm_flush(...);
>
> I'd either add a comment about avoiding retpoline overhead here or just
> make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> get confused by the code.
Isn't this premature optimization? I really don't like adding things
like this without some numbers to show it's worth it.
-Jeff
Hello,
Thank you for the suggestions on this.
>
> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
> >> > + } else {
> >> > + if (nd_region->flush(nd_region))
> >> > + rc = -EIO;
> >>
> >> Given the common case wants to be fast and synchronous I think we
> >> should try to avoid retpoline overhead by default. So something like
> >> this:
> >>
> >> if (nd_region->flush == generic_nvdimm_flush)
> >> rc = generic_nvdimm_flush(...);
> >
> > I'd either add a comment about avoiding retpoline overhead here or just
> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > get confused by the code.
>
> Isn't this premature optimization? I really don't like adding things
> like this without some numbers to show it's worth it.
Can we add the optimization after this version is merged.
Best regards,
Pankaj
>
> -Jeff
>
>
On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer <[email protected]> wrote:
>
> Jan Kara <[email protected]> writes:
>
> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
> >> > + } else {
> >> > + if (nd_region->flush(nd_region))
> >> > + rc = -EIO;
> >>
> >> Given the common case wants to be fast and synchronous I think we
> >> should try to avoid retpoline overhead by default. So something like
> >> this:
> >>
> >> if (nd_region->flush == generic_nvdimm_flush)
> >> rc = generic_nvdimm_flush(...);
> >
> > I'd either add a comment about avoiding retpoline overhead here or just
> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > get confused by the code.
>
> Isn't this premature optimization? I really don't like adding things
> like this without some numbers to show it's worth it.
I don't think it's premature given this optimization technique is
already being deployed elsewhere, see:
https://lwn.net/Articles/774347/
Dan Williams <[email protected]> writes:
> On Fri, Apr 12, 2019 at 6:12 AM Jeff Moyer <[email protected]> wrote:
>>
>> Jan Kara <[email protected]> writes:
>>
>> > On Thu 11-04-19 07:51:48, Dan Williams wrote:
>> >> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta <[email protected]> wrote:
>> >> > + } else {
>> >> > + if (nd_region->flush(nd_region))
>> >> > + rc = -EIO;
>> >>
>> >> Given the common case wants to be fast and synchronous I think we
>> >> should try to avoid retpoline overhead by default. So something like
>> >> this:
>> >>
>> >> if (nd_region->flush == generic_nvdimm_flush)
>> >> rc = generic_nvdimm_flush(...);
>> >
>> > I'd either add a comment about avoiding retpoline overhead here or just
>> > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> > get confused by the code.
>>
>> Isn't this premature optimization? I really don't like adding things
>> like this without some numbers to show it's worth it.
>
> I don't think it's premature given this optimization technique is
> already being deployed elsewhere, see:
>
> https://lwn.net/Articles/774347/
The technique is fine, but that doesn't mean it should be applied
everywhere. Is *this* code path really going to benefit from the
optimization?
-Jeff
On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> > > I'd either add a comment about avoiding retpoline overhead here or just
> > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > > get confused by the code.
> >
> > Isn't this premature optimization? I really don't like adding things
> > like this without some numbers to show it's worth it.
>
> I don't think it's premature given this optimization technique is
> already being deployed elsewhere, see:
>
> https://lwn.net/Articles/774347/
For one this one was backed by numbers, and second after feedback
from Linux we switched to the NULL pointer check instead.
On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> > > > I'd either add a comment about avoiding retpoline overhead here or just
> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> > > > get confused by the code.
> > >
> > > Isn't this premature optimization? I really don't like adding things
> > > like this without some numbers to show it's worth it.
> >
> > I don't think it's premature given this optimization technique is
> > already being deployed elsewhere, see:
> >
> > https://lwn.net/Articles/774347/
>
> For one this one was backed by numbers, and second after feedback
> from Linux we switched to the NULL pointer check instead.
Ok I should have noticed the switch to NULL pointer check. However,
the question still stands do we want everyone to run numbers to
justify this optimization, or make it a new common kernel coding
practice to do:
if (!object->op)
generic_op(object);
else
object->op(object);
...in hot paths? I agree with not doing premature optimization in
principle, but this hack is minimally intrusive from a readability
perspective similar to likely()/unlikely() usage which also don't come
with numbers on a per patch basis.
Dan Williams <[email protected]> writes:
> On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <[email protected]> wrote:
>>
>> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> > > > I'd either add a comment about avoiding retpoline overhead here or just
>> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> > > > get confused by the code.
>> > >
>> > > Isn't this premature optimization? I really don't like adding things
>> > > like this without some numbers to show it's worth it.
>> >
>> > I don't think it's premature given this optimization technique is
>> > already being deployed elsewhere, see:
>> >
>> > https://lwn.net/Articles/774347/
>>
>> For one this one was backed by numbers, and second after feedback
>> from Linux we switched to the NULL pointer check instead.
>
> Ok I should have noticed the switch to NULL pointer check. However,
> the question still stands do we want everyone to run numbers to
> justify this optimization, or make it a new common kernel coding
> practice to do:
>
> if (!object->op)
> generic_op(object);
> else
> object->op(object);
>
> ...in hot paths?
I don't think nvdimm_flush is a hot path. Numbers of some
representative workload would prove one of us right.
-Jeff
On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
>
> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <[email protected]> wrote:
> >>
> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> >> > > > I'd either add a comment about avoiding retpoline overhead here or just
> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
> >> > > > get confused by the code.
> >> > >
> >> > > Isn't this premature optimization? I really don't like adding things
> >> > > like this without some numbers to show it's worth it.
> >> >
> >> > I don't think it's premature given this optimization technique is
> >> > already being deployed elsewhere, see:
> >> >
> >> > https://lwn.net/Articles/774347/
> >>
> >> For one this one was backed by numbers, and second after feedback
> >> from Linux we switched to the NULL pointer check instead.
> >
> > Ok I should have noticed the switch to NULL pointer check. However,
> > the question still stands do we want everyone to run numbers to
> > justify this optimization, or make it a new common kernel coding
> > practice to do:
> >
> > if (!object->op)
> > generic_op(object);
> > else
> > object->op(object);
> >
> > ...in hot paths?
>
> I don't think nvdimm_flush is a hot path. Numbers of some
> representative workload would prove one of us right.
I'd rather say that the if "if (!op) do_generic()" pattern is more
readable in the general case, saves grepping for who set the op in the
common case. The fact that it has the potential to be faster is gravy
at that point.
Dan Williams <[email protected]> writes:
> On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <[email protected]> wrote:
>>
>> Dan Williams <[email protected]> writes:
>>
>> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <[email protected]> wrote:
>> >>
>> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
>> >> > > > I'd either add a comment about avoiding retpoline overhead here or just
>> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
>> >> > > > get confused by the code.
>> >> > >
>> >> > > Isn't this premature optimization? I really don't like adding things
>> >> > > like this without some numbers to show it's worth it.
>> >> >
>> >> > I don't think it's premature given this optimization technique is
>> >> > already being deployed elsewhere, see:
>> >> >
>> >> > https://lwn.net/Articles/774347/
>> >>
>> >> For one this one was backed by numbers, and second after feedback
>> >> from Linux we switched to the NULL pointer check instead.
>> >
>> > Ok I should have noticed the switch to NULL pointer check. However,
>> > the question still stands do we want everyone to run numbers to
>> > justify this optimization, or make it a new common kernel coding
>> > practice to do:
>> >
>> > if (!object->op)
>> > generic_op(object);
>> > else
>> > object->op(object);
>> >
>> > ...in hot paths?
>>
>> I don't think nvdimm_flush is a hot path. Numbers of some
>> representative workload would prove one of us right.
>
> I'd rather say that the if "if (!op) do_generic()" pattern is more
> readable in the general case, saves grepping for who set the op in the
> common case. The fact that it has the potential to be faster is gravy
> at that point.
If the primary motivation is performance, then I'd expect performance
numbers to back it up. If that isn't the primary motivation, then
choose whichever way you feel is appropriate.
Cheers,
Jeff
>
> Dan Williams <[email protected]> writes:
>
> > On Mon, Apr 22, 2019 at 8:59 AM Jeff Moyer <[email protected]> wrote:
> >>
> >> Dan Williams <[email protected]> writes:
> >>
> >> > On Thu, Apr 18, 2019 at 9:18 AM Christoph Hellwig <[email protected]>
> >> > wrote:
> >> >>
> >> >> On Thu, Apr 18, 2019 at 09:05:05AM -0700, Dan Williams wrote:
> >> >> > > > I'd either add a comment about avoiding retpoline overhead here
> >> >> > > > or just
> >> >> > > > make ->flush == NULL mean generic_nvdimm_flush(). Just so that
> >> >> > > > people don't
> >> >> > > > get confused by the code.
> >> >> > >
> >> >> > > Isn't this premature optimization? I really don't like adding
> >> >> > > things
> >> >> > > like this without some numbers to show it's worth it.
> >> >> >
> >> >> > I don't think it's premature given this optimization technique is
> >> >> > already being deployed elsewhere, see:
> >> >> >
> >> >> > https://lwn.net/Articles/774347/
> >> >>
> >> >> For one this one was backed by numbers, and second after feedback
> >> >> from Linux we switched to the NULL pointer check instead.
> >> >
> >> > Ok I should have noticed the switch to NULL pointer check. However,
> >> > the question still stands do we want everyone to run numbers to
> >> > justify this optimization, or make it a new common kernel coding
> >> > practice to do:
> >> >
> >> > if (!object->op)
> >> > generic_op(object);
> >> > else
> >> > object->op(object);
> >> >
> >> > ...in hot paths?
> >>
> >> I don't think nvdimm_flush is a hot path. Numbers of some
> >> representative workload would prove one of us right.
> >
> > I'd rather say that the if "if (!op) do_generic()" pattern is more
> > readable in the general case, saves grepping for who set the op in the
> > common case. The fact that it has the potential to be faster is gravy
> > at that point.
>
> If the primary motivation is performance, then I'd expect performance
> numbers to back it up. If that isn't the primary motivation, then
> choose whichever way you feel is appropriate.
Agree. This change enhances the code readability. Will add this change in
v6 with other changes.
Thank you!
Pankaj
>
> Cheers,
> Jeff
>