2019-04-03 10:40:51

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 0/5] virtio pmem driver

This patch series has implementation for "virtio pmem".
"virtio pmem" is fake persistent memory(nvdimm) in guest
which allows to bypass the guest page cache. This also
implements a VIRTIO based asynchronous flush mechanism.

Sharing guest kernel driver in this patchset with the
changes suggested in v3. Tested with Qemu side device
emulation [6] for virtio-pmem.

We have incorporated all the suggestions in V3. Documented
the impact of possible page cache side channel attacks with
suggested countermeasures.

Details of project idea for 'virtio pmem' flushing interface
is shared [3] & [4].

Implementation is divided into two parts:
New virtio pmem guest driver and qemu code changes for new
virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
- Reads persistent memory range from paravirt device and
registers with 'nvdimm_bus'.
- 'nvdimm/pmem' driver uses this information to allocate
persistent memory region and setup filesystem operations
to the allocated memory.
- virtio pmem driver implements asynchronous flushing
interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
- Creates virtio pmem device and exposes a memory range to
KVM guest.
- At host side this is file backed memory which acts as
persistent memory.
- Qemu side flush uses aio thread pool API's and virtio
for asynchronous guest multi request handling.

David Hildenbrand CCed also posted a modified version[7] of
qemu virtio-pmem code based on updated Qemu memory device API.

Virtio-pmem security implications and countermeasures:
-----------------------------------------------------

In previous posting of kernel driver, there was discussion [9]
on possible implications of page cache side channel attacks with
virtio pmem. After thorough analysis of details of known side
channel attacks, below are the suggestions:

- Depends entirely on how host backing image file is mapped
into guest address space.

- virtio-pmem device emulation, by default shared mapping is used
to map host backing file. It is recommended to use separate
backing file at host side for every guest. This will prevent
any possibility of executing common code from multiple guests
and any chance of inferring guest local data based based on
execution time.

- If backing file is required to be shared among multiple guests
it is recommended to don't support host page cache eviction
commands from the guest driver. This will avoid any possibility
of inferring guest local data or host data from another guest.

- Proposed device specification [8] for virtio-pmem device with
details of possible security implications and suggested
countermeasures for device emulation.

Virtio-pmem errors handling:
----------------------------------------
Checked behaviour of virtio-pmem for below types of errors
Need suggestions on expected behaviour for handling these errors?

- Hardware Errors: Uncorrectable recoverable Errors:
a] virtio-pmem:
- As per current logic if error page belongs to Qemu process,
host MCE handler isolates(hwpoison) that page and send SIGBUS.
Qemu SIGBUS handler injects exception to KVM guest.
- KVM guest then isolates the page and send SIGBUS to guest
userspace process which has mapped the page.

b] Existing implementation for ACPI pmem driver:
- Handles such errors with MCE notifier and creates a list
of bad blocks. Read/direct access DAX operation return EIO
if accessed memory page fall in bad block list.
- It also starts backgound scrubbing.
- Similar functionality can be reused in virtio-pmem with MCE
notifier but without scrubbing(no ACPI/ARS)? Need inputs to
confirm if this behaviour is ok or needs any change?

Changes from PATCH v3: [1]

- Use generic dax_synchronous() helper to check for DAXDEV_SYNC
flag - [Dan, Darrick, Jan]
- Add 'is_nvdimm_async' function
- Document page cache side channel attacks implications &
countermeasures - [Dave Chinner, Michael]

Changes from PATCH v2: [2]
- Disable MAP_SYNC for ext4 & XFS filesystems - [Dan]
- Use name 'virtio pmem' in place of 'fake dax'

Changes from PATCH v1:
- 0-day build test for build dependency on libnvdimm

Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c

Changes from RFC v2:
- Add flush function in the nd_region in place of switching
on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan

Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent
memory and other operations instead of creating an entirely
new block driver.
- Use VIRTIO driver to register memory information with
nvdimm_bus and create region_type accordingly.
- Call VIRTIO flush from existing pmem driver.

Pankaj Gupta (5):
libnvdimm: nd_region flush callback support
virtio-pmem: Add virtio-pmem guest driver
libnvdimm: add nd_region buffered dax_dev flag
ext4: disable map_sync for virtio pmem
xfs: disable map_sync for virtio pmem

[1] https://lkml.org/lkml/2019/1/9/471
[2] https://lkml.org/lkml/2018/10/13/117
[3] https://www.spinics.net/lists/kvm/msg149761.html
[4] https://www.spinics.net/lists/kvm/msg153095.html
[5] https://lkml.org/lkml/2018/8/31/413
[6] https://marc.info/?l=linux-kernel&m=153572228719237&w=2
[7] https://marc.info/?l=qemu-devel&m=153555721901824&w=2
[8] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html


2019-04-03 10:41:15

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 1/5] ibnvdimm: nd_region flush callback support

This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.

This also handles asynchronous flush requests from the block layer
by creating a child bio and chaining it with parent bio.

Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/acpi/nfit/core.c | 4 ++--
drivers/nvdimm/claim.c | 6 ++++--
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/pmem.c | 14 ++++++++-----
drivers/nvdimm/region_devs.c | 38 ++++++++++++++++++++++++++++++++++--
include/linux/libnvdimm.h | 8 +++++++-
6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 5a389a4f4f65..567017a2190e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
offset = to_interleave_offset(offset, mmio);

writeq(cmd, mmio->addr.base + offset);
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL, false);

if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}

if (rw)
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL, false);

rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf469c7..a1dfa066786b 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
sector_t sector = offset >> 9;
- int rc = 0;
+ int rc = 0, ret = 0;

if (unlikely(!size))
return 0;
@@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
}

memcpy_flushcache(nsio->addr + offset, buf, size);
- nvdimm_flush(to_nd_region(ndns->dev.parent));
+ ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL, false);
+ if (ret)
+ rc = ret;

return rc;
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index a5ac3b240293..916cd6c5451a 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -159,6 +159,7 @@ struct nd_region {
struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
+ int (*flush)(struct nd_region *nd_region);
struct nd_mapping mapping[0];
};

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..5a5b3ea4d073 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page,

static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
{
+ int ret = 0;
blk_status_t rc = 0;
bool do_acct;
unsigned long start;
@@ -201,7 +202,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
struct nd_region *nd_region = to_region(pmem);

if (bio->bi_opf & REQ_PREFLUSH)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio, true);

do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
nd_iostat_end(bio, start);

if (bio->bi_opf & REQ_FUA)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio, true);
+
+ if (ret)
+ bio->bi_status = errno_to_blk_status(ret);

bio_endio(bio);
return BLK_QC_T_NONE;
@@ -463,13 +467,13 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = &pmem->bb;

dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
-
gendev = disk_to_dev(disk);
gendev->groups = pmem_attribute_groups;

@@ -527,14 +531,14 @@ static int nd_pmem_remove(struct device *dev)
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL, false);

return 0;
}

static void nd_pmem_shutdown(struct device *dev)
{
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL, false);
}

static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b4ef7d9ff22e..fb1041ab32a6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -295,7 +295,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att
return rc;
if (!flush)
return -EINVAL;
- nvdimm_flush(nd_region);
+ rc = nvdimm_flush(nd_region, NULL, false);
+ if (rc)
+ return rc;

return len;
}
@@ -1085,6 +1087,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ if (ndr_desc->flush)
+ nd_region->flush = ndr_desc->flush;
+ else
+ nd_region->flush = generic_nvdimm_flush;
+
nd_device_register(dev);

return nd_region;
@@ -1125,11 +1132,36 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus,
}
EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);

+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async)
+{
+ int rc = 0;
+
+ /* Create child bio for asynchronous flush and chain with
+ * parent bio. Otherwise directly call nd_region flush.
+ */
+ if (async && bio->bi_iter.bi_sector != -1) {
+
+ struct bio *child = bio_alloc(GFP_ATOMIC, 0);
+
+ if (!child)
+ return -ENOMEM;
+ bio_copy_dev(child, bio);
+ child->bi_opf = REQ_PREFLUSH;
+ child->bi_iter.bi_sector = -1;
+ bio_chain(child, bio);
+ submit_bio(child);
+ } else {
+ if (nd_region->flush(nd_region))
+ rc = -EIO;
+ }
+
+ return rc;
+}
/**
* nvdimm_flush - flush any posted write queues between the cpu and pmem media
* @nd_region: blk or interleaved pmem region
*/
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
{
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i, idx;
@@ -1153,6 +1185,8 @@ void nvdimm_flush(struct nd_region *nd_region)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
wmb();
+
+ return 0;
}
EXPORT_SYMBOL_GPL(nvdimm_flush);

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index feb342d026f2..d9d2ab8a6e64 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -65,6 +65,9 @@ enum {
*/
ND_REGION_PERSIST_MEMCTRL = 2,

+ /* Platform provides asynchronous flush mechanism */
+ ND_REGION_ASYNC = 3,
+
/* mark newly adjusted resources as requiring a label update */
DPA_RESOURCE_ADJUSTED = 1 << 0,
};
@@ -121,6 +124,7 @@ struct nd_mapping_desc {
int position;
};

+struct nd_region;
struct nd_region_desc {
struct resource *res;
struct nd_mapping_desc *mapping;
@@ -133,6 +137,7 @@ struct nd_region_desc {
int target_node;
unsigned long flags;
struct device_node *of_node;
+ int (*flush)(struct nd_region *nd_region);
};

struct device;
@@ -260,7 +265,8 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio, bool async);
+int generic_nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
int nvdimm_in_overwrite(struct nvdimm *nvdimm);
--
2.20.1


2019-04-03 10:41:32

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver

This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/nvdimm/virtio_pmem.c | 84 +++++++++++++++++++++
drivers/virtio/Kconfig | 10 +++
drivers/virtio/Makefile | 1 +
drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++
include/linux/virtio_pmem.h | 60 +++++++++++++++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pmem.h | 10 +++
7 files changed, 291 insertions(+)
create mode 100644 drivers/nvdimm/virtio_pmem.c
create mode 100644 drivers/virtio/pmem.c
create mode 100644 include/linux/virtio_pmem.h
create mode 100644 include/uapi/linux/virtio_pmem.h

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
new file mode 100644
index 000000000000..2a1b1ba2c1ff
--- /dev/null
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio_pmem.h>
+#include "nd.h"
+
+ /* The interrupt handler */
+void host_ack(struct virtqueue *vq)
+{
+ unsigned int len;
+ unsigned long flags;
+ struct virtio_pmem_request *req, *req_buf;
+ struct virtio_pmem *vpmem = vq->vdev->priv;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+ req->done = true;
+ wake_up(&req->host_acked);
+
+ if (!list_empty(&vpmem->req_list)) {
+ req_buf = list_first_entry(&vpmem->req_list,
+ struct virtio_pmem_request, list);
+ list_del(&vpmem->req_list);
+ req_buf->wq_buf_avail = true;
+ wake_up(&req_buf->wq_buf);
+ }
+ }
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+EXPORT_SYMBOL_GPL(host_ack);
+
+ /* The request submission function */
+int virtio_pmem_flush(struct nd_region *nd_region)
+{
+ int err;
+ unsigned long flags;
+ struct scatterlist *sgs[2], sg, ret;
+ struct virtio_device *vdev = nd_region->provider_data;
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct virtio_pmem_request *req;
+
+ might_sleep();
+ req = kmalloc(sizeof(*req), GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+
+ req->done = req->wq_buf_avail = false;
+ strcpy(req->name, "FLUSH");
+ init_waitqueue_head(&req->host_acked);
+ init_waitqueue_head(&req->wq_buf);
+ sg_init_one(&sg, req->name, strlen(req->name));
+ sgs[0] = &sg;
+ sg_init_one(&ret, &req->ret, sizeof(req->ret));
+ sgs[1] = &ret;
+
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+ if (err) {
+ dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+ list_add_tail(&vpmem->req_list, &req->list);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->wq_buf, req->wq_buf_avail);
+ spin_lock_irqsave(&vpmem->pmem_lock, flags);
+ }
+ virtqueue_kick(vpmem->req_vq);
+ spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+ /* When host has read buffer, this completes via host_ack */
+ wait_event(req->host_acked, req->done);
+ err = req->ret;
+ kfree(req);
+
+ return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 35897649c24f..9f634a2ed638 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY

If unsure, say Y.

+config VIRTIO_PMEM
+ tristate "Support for virtio pmem driver"
+ depends on VIRTIO
+ depends on LIBNVDIMM
+ help
+ This driver provides support for virtio based flushing interface
+ for persistent memory range.
+
+ If unsure, say M.
+
config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..143ce91eabe9 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
new file mode 100644
index 000000000000..52f74064f67e
--- /dev/null
+++ b/drivers/virtio/pmem.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and registers the virtual pmem device
+ * with libnvdimm core.
+ */
+#include <linux/virtio_pmem.h>
+#include <../../drivers/nvdimm/nd.h>
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+ struct virtqueue *vq;
+
+ /* single vq */
+ vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+ host_ack, "flush_queue");
+ if (IS_ERR(vq))
+ return PTR_ERR(vq);
+
+ spin_lock_init(&vpmem->pmem_lock);
+ INIT_LIST_HEAD(&vpmem->req_list);
+
+ return 0;
+};
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+ int err = 0;
+ struct resource res;
+ struct virtio_pmem *vpmem;
+ struct nvdimm_bus *nvdimm_bus;
+ struct nd_region_desc ndr_desc;
+ int nid = dev_to_node(&vdev->dev);
+ struct nd_region *nd_region;
+
+ if (!vdev->config->get) {
+ dev_err(&vdev->dev, "%s failure: config disabled\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+ GFP_KERNEL);
+ if (!vpmem) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+
+ vpmem->vdev = vdev;
+ err = init_vq(vpmem);
+ if (err)
+ goto out_err;
+
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ start, &vpmem->start);
+ virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+ size, &vpmem->size);
+
+ res.start = vpmem->start;
+ res.end = vpmem->start + vpmem->size-1;
+ vpmem->nd_desc.provider_name = "virtio-pmem";
+ vpmem->nd_desc.module = THIS_MODULE;
+
+ vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+ &vpmem->nd_desc);
+ if (!nvdimm_bus)
+ goto out_vq;
+
+ dev_set_drvdata(&vdev->dev, nvdimm_bus);
+ memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+ ndr_desc.res = &res;
+ ndr_desc.numa_node = nid;
+ ndr_desc.flush = virtio_pmem_flush;
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+ set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
+ nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+ nd_region->provider_data = dev_to_virtio
+ (nd_region->dev.parent->parent);
+
+ if (!nd_region)
+ goto out_nd;
+
+ //virtio_device_ready(vdev);
+ return 0;
+out_nd:
+ err = -ENXIO;
+ nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+ vdev->config->del_vqs(vdev);
+out_err:
+ dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+ return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+ struct virtio_pmem *vpmem = vdev->priv;
+ struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+ nvdimm_bus_unregister(nvdimm_bus);
+ vdev->config->del_vqs(vdev);
+ kfree(vpmem);
+}
+
+static struct virtio_driver virtio_pmem_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = virtio_pmem_probe,
+ .remove = virtio_pmem_remove,
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
new file mode 100644
index 000000000000..224f9d934be6
--- /dev/null
+++ b/include/linux/virtio_pmem.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * virtio_pmem.h: virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ **/
+
+#ifndef _LINUX_VIRTIO_PMEM_H
+#define _LINUX_VIRTIO_PMEM_H
+
+#include <linux/virtio_ids.h>
+#include <linux/module.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/libnvdimm.h>
+#include <linux/spinlock.h>
+
+struct virtio_pmem_request {
+ /* Host return status corresponding to flush request */
+ int ret;
+
+ /* command name*/
+ char name[16];
+
+ /* Wait queue to process deferred work after ack from host */
+ wait_queue_head_t host_acked;
+ bool done;
+
+ /* Wait queue to process deferred work after virt queue buffer avail */
+ wait_queue_head_t wq_buf;
+ bool wq_buf_avail;
+ struct list_head list;
+};
+
+struct virtio_pmem {
+ struct virtio_device *vdev;
+
+ /* Virtio pmem request queue */
+ struct virtqueue *req_vq;
+
+ /* nvdimm bus registers virtio pmem device */
+ struct nvdimm_bus *nvdimm_bus;
+ struct nvdimm_bus_descriptor nd_desc;
+
+ /* List to store deferred work if virtqueue is full */
+ struct list_head req_list;
+
+ /* Synchronize virtqueue data */
+ spinlock_t pmem_lock;
+
+ /* Memory region information */
+ uint64_t start;
+ uint64_t size;
+};
+
+void host_ack(struct virtqueue *vq);
+int virtio_pmem_flush(struct nd_region *nd_region);
+#endif
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..346389565ac1 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_PMEM 25 /* virtio pmem */

#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 000000000000..fa3f7d52717a
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+ __le64 start;
+ __le64 size;
+};
+#endif
--
2.20.1


2019-04-03 10:41:56

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 3/5] libnvdimm: add dax_dev sync flag

This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta <[email protected]>
---
drivers/dax/bus.c | 2 +-
drivers/dax/super.c | 13 ++++++++++++-
drivers/md/dm.c | 2 +-
drivers/nvdimm/pmem.c | 3 ++-
drivers/nvdimm/region_devs.c | 7 +++++++
include/linux/dax.h | 9 +++++++--
include/linux/libnvdimm.h | 1 +
7 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..431bf7d2a7f9 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -388,7 +388,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
* No 'host' or dax_operations since there is no access to this
* device outside of mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL, NULL);
+ dax_dev = alloc_dax(dev_dax, NULL, NULL, true);
if (!dax_dev)
goto err;

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0a339b85133e..bd6509308d05 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -186,6 +186,8 @@ enum dax_device_flags {
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+ /* flag to check if device supports synchronous flush */
+ DAXDEV_SYNC,
};

/**
@@ -354,6 +356,12 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(dax_write_cache_enabled);

+bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(dax_synchronous);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
@@ -511,7 +519,7 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host)
}

struct dax_device *alloc_dax(void *private, const char *__host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, bool sync)
{
struct dax_device *dax_dev;
const char *host;
@@ -534,6 +542,9 @@ struct dax_device *alloc_dax(void *private, const char *__host,
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+ if (sync)
+ set_bit(DAXDEV_SYNC, &dax_dev->flags);
+
return dax_dev;

err_dev:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 68d24056d0b1..534e12ca6329 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1965,7 +1965,7 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);

if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops, true);
if (!dax_dev)
goto bad;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5a5b3ea4d073..78f71ba0e7cf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -466,7 +466,8 @@ static int pmem_attach_disk(struct device *dev,
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;

- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops,
+ is_nvdimm_sync(nd_region));

if (!dax_dev) {
put_disk(disk);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index fb1041ab32a6..8c7aa047fe2b 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1231,6 +1231,13 @@ int nvdimm_has_cache(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_cache);

+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev) &&
+ !test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..9bdd50d06ef6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,18 +32,19 @@ extern struct attribute_group dax_attribute_group;
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *dax_get_by_host(const char *host);
struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
+ const struct dax_operations *ops, bool sync);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
+bool dax_synchronous(struct dax_device *dax_dev);
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
return NULL;
}
static inline struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, bool sync)
{
/*
* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
@@ -64,6 +65,10 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev)
{
return false;
}
+static inline bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return false;
+}
#endif

struct writeback_control;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index d9d2ab8a6e64..9a8aea370cbc 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -270,6 +270,7 @@ int generic_nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
int nvdimm_in_overwrite(struct nvdimm *nvdimm);
+bool is_nvdimm_sync(struct nd_region *nd_region);

static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
unsigned int buf_len, int *cmd_rc)
--
2.20.1


2019-04-03 10:42:15

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 4/5] ext4: disable map_sync for async flush

Virtio pmem provides asynchronous host page cache flush
mechanism. We don't support 'MAP_SYNC' with virtio pmem
and ext4.

Signed-off-by: Pankaj Gupta <[email protected]>
---
fs/ext4/file.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69d65d49837b..86e4bf464320 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -360,8 +360,10 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct dax_device *dax_dev = sbi->s_daxdev;

- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;

/*
@@ -371,6 +373,13 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
return -EOPNOTSUPP;

+ /* We don't support synchronous mappings with DAX files if
+ * dax_device is not synchronous.
+ */
+ if (IS_DAX(file_inode(file)) && !dax_synchronous(dax_dev)
+ && (vma->vm_flags & VM_SYNC))
+ return -EOPNOTSUPP;
+
file_accessed(file);
if (IS_DAX(file_inode(file))) {
vma->vm_ops = &ext4_dax_vm_ops;
--
2.20.1


2019-04-03 10:42:46

by Pankaj Gupta

[permalink] [raw]
Subject: [PATCH v4 5/5] xfs: disable map_sync for async flush

Virtio pmem provides asynchronous host page cache flush
mechanism. we don't support 'MAP_SYNC' with virtio pmem
and xfs.

Signed-off-by: Pankaj Gupta <[email protected]>
---
fs/xfs/xfs_file.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1f2e2845eb76..dced2eb8c91a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,14 @@ xfs_file_mmap(
if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
return -EOPNOTSUPP;

+ /* We don't support synchronous mappings with DAX files if
+ * dax_device is not synchronous.
+ */
+ if (IS_DAX(file_inode(filp)) && !dax_synchronous(
+ xfs_find_daxdev_for_inode(file_inode(filp))) &&
+ (vma->vm_flags & VM_SYNC))
+ return -EOPNOTSUPP;
+
file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
if (IS_DAX(file_inode(filp)))
--
2.20.1


2019-04-03 11:31:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] ext4: disable map_sync for async flush

On Wed 03-04-19 16:10:17, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. We don't support 'MAP_SYNC' with virtio pmem
> and ext4.
>
> Signed-off-by: Pankaj Gupta <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/file.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 69d65d49837b..86e4bf464320 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -360,8 +360,10 @@ static const struct vm_operations_struct ext4_file_vm_ops = {
> static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct inode *inode = file->f_mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + struct dax_device *dax_dev = sbi->s_daxdev;
>
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> + if (unlikely(ext4_forced_shutdown(sbi)))
> return -EIO;
>
> /*
> @@ -371,6 +373,13 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
> return -EOPNOTSUPP;
>
> + /* We don't support synchronous mappings with DAX files if
> + * dax_device is not synchronous.
> + */
> + if (IS_DAX(file_inode(file)) && !dax_synchronous(dax_dev)
> + && (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
> file_accessed(file);
> if (IS_DAX(file_inode(file))) {
> vma->vm_ops = &ext4_dax_vm_ops;
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-03 11:44:42

by Yuval Shaia

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver

On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> drivers/nvdimm/virtio_pmem.c | 84 +++++++++++++++++++++
> drivers/virtio/Kconfig | 10 +++
> drivers/virtio/Makefile | 1 +
> drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++
> include/linux/virtio_pmem.h | 60 +++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_pmem.h | 10 +++
> 7 files changed, 291 insertions(+)
> create mode 100644 drivers/nvdimm/virtio_pmem.c
> create mode 100644 drivers/virtio/pmem.c
> create mode 100644 include/linux/virtio_pmem.h
> create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> new file mode 100644
> index 000000000000..2a1b1ba2c1ff
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0

Is this comment stile (//) acceptable?

> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio_pmem.h>
> +#include "nd.h"
> +
> + /* The interrupt handler */
> +void host_ack(struct virtqueue *vq)
> +{
> + unsigned int len;
> + unsigned long flags;
> + struct virtio_pmem_request *req, *req_buf;
> + struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> + req->done = true;
> + wake_up(&req->host_acked);
> +
> + if (!list_empty(&vpmem->req_list)) {
> + req_buf = list_first_entry(&vpmem->req_list,
> + struct virtio_pmem_request, list);
> + list_del(&vpmem->req_list);
> + req_buf->wq_buf_avail = true;
> + wake_up(&req_buf->wq_buf);
> + }
> + }
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(host_ack);
> +
> + /* The request submission function */
> +int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> + int err;
> + unsigned long flags;
> + struct scatterlist *sgs[2], sg, ret;
> + struct virtio_device *vdev = nd_region->provider_data;
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct virtio_pmem_request *req;
> +
> + might_sleep();

[1]

> + req = kmalloc(sizeof(*req), GFP_KERNEL);
> + if (!req)
> + return -ENOMEM;
> +
> + req->done = req->wq_buf_avail = false;
> + strcpy(req->name, "FLUSH");
> + init_waitqueue_head(&req->host_acked);
> + init_waitqueue_head(&req->wq_buf);
> + sg_init_one(&sg, req->name, strlen(req->name));
> + sgs[0] = &sg;
> + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> + sgs[1] = &ret;
> +
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);

Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?

> + if (err) {
> + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> + list_add_tail(&vpmem->req_list, &req->list);
> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->wq_buf, req->wq_buf_avail);
> + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> + }
> + virtqueue_kick(vpmem->req_vq);

You probably want to check return value here.

> + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> + /* When host has read buffer, this completes via host_ack */
> + wait_event(req->host_acked, req->done);
> + err = req->ret;
> + kfree(req);
> +
> + return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 35897649c24f..9f634a2ed638 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
>
> If unsure, say Y.
>
> +config VIRTIO_PMEM
> + tristate "Support for virtio pmem driver"
> + depends on VIRTIO
> + depends on LIBNVDIMM
> + help
> + This driver provides support for virtio based flushing interface
> + for persistent memory range.
> +
> + If unsure, say M.
> +
> config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5dcf46..143ce91eabe9 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> new file mode 100644
> index 000000000000..52f74064f67e
> --- /dev/null
> +++ b/drivers/virtio/pmem.c
> @@ -0,0 +1,125 @@
> +// SPDX-License-Identifier: GPL-2.0

Ditto

> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and registers the virtual pmem device
> + * with libnvdimm core.
> + */
> +#include <linux/virtio_pmem.h>
> +#include <../../drivers/nvdimm/nd.h>

Should this file be moved to include/ directory?

> +
> +static struct virtio_device_id id_table[] = {
> + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> + { 0 },
> +};
> +
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> + struct virtqueue *vq;
> +
> + /* single vq */
> + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> + host_ack, "flush_queue");
> + if (IS_ERR(vq))
> + return PTR_ERR(vq);
> +
> + spin_lock_init(&vpmem->pmem_lock);
> + INIT_LIST_HEAD(&vpmem->req_list);
> +
> + return 0;
> +};
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> + int err = 0;
> + struct resource res;
> + struct virtio_pmem *vpmem;
> + struct nvdimm_bus *nvdimm_bus;
> + struct nd_region_desc ndr_desc;
> + int nid = dev_to_node(&vdev->dev);
> + struct nd_region *nd_region;
> +
> + if (!vdev->config->get) {
> + dev_err(&vdev->dev, "%s failure: config disabled\n",
> + __func__);
> + return -EINVAL;
> + }
> +
> + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> + GFP_KERNEL);

Suggesting to indent it right so it will be under &vdev

> + if (!vpmem) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + vpmem->vdev = vdev;
> + err = init_vq(vpmem);
> + if (err)
> + goto out_err;

No need to free vpmem here?

> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> +
> + res.start = vpmem->start;
> + res.end = vpmem->start + vpmem->size-1;
> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> + &vpmem->nd_desc);
> + if (!nvdimm_bus)
> + goto out_vq;

Ditto (i'm probably missing something here)

> +
> + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> + memset(&ndr_desc, 0, sizeof(ndr_desc));

Any reason not to use compiler initialization?
i.e.
struct nd_region_desc ndr_desc = {};

> +
> + ndr_desc.res = &res;
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = virtio_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> + nd_region->provider_data = dev_to_virtio
> + (nd_region->dev.parent->parent);
> +
> + if (!nd_region)
> + goto out_nd;
> +
> + //virtio_device_ready(vdev);

Left over

> + return 0;
> +out_nd:
> + err = -ENXIO;
> + nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct virtio_pmem *vpmem = vdev->priv;
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);

I think you should also call vdev->config->reset

> + kfree(vpmem);
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name = KBUILD_MODNAME,
> + .driver.owner = THIS_MODULE,
> + .id_table = id_table,
> + .probe = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..224f9d934be6
> --- /dev/null
> +++ b/include/linux/virtio_pmem.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * virtio_pmem.h: virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + **/
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include <linux/virtio_ids.h>
> +#include <linux/module.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/spinlock.h>
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];
> +
> + /* Wait queue to process deferred work after ack from host */
> + wait_queue_head_t host_acked;
> + bool done;
> +
> + /* Wait queue to process deferred work after virt queue buffer avail */
> + wait_queue_head_t wq_buf;
> + bool wq_buf_avail;
> + struct list_head list;
> +};
> +
> +struct virtio_pmem {
> + struct virtio_device *vdev;
> +
> + /* Virtio pmem request queue */
> + struct virtqueue *req_vq;
> +
> + /* nvdimm bus registers virtio pmem device */
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm_bus_descriptor nd_desc;
> +
> + /* List to store deferred work if virtqueue is full */
> + struct list_head req_list;
> +
> + /* Synchronize virtqueue data */
> + spinlock_t pmem_lock;
> +
> + /* Memory region information */
> + uint64_t start;
> + uint64_t size;
> +};
> +
> +void host_ack(struct virtqueue *vq);
> +int virtio_pmem_flush(struct nd_region *nd_region);
> +#endif
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..346389565ac1 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
> #define VIRTIO_ID_INPUT 18 /* virtio input */
> #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM 25 /* virtio pmem */

Any reason for the jump here? are 21 to 24 already taken or you just
want to be on the safe side?

>
> #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 000000000000..fa3f7d52717a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> + __le64 start;
> + __le64 size;
> +};
> +#endif
> --
> 2.20.1
>
>

2019-04-03 12:40:24

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver


> Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver
>
> On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > drivers/nvdimm/virtio_pmem.c | 84 +++++++++++++++++++++
> > drivers/virtio/Kconfig | 10 +++
> > drivers/virtio/Makefile | 1 +
> > drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++
> > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > include/uapi/linux/virtio_ids.h | 1 +
> > include/uapi/linux/virtio_pmem.h | 10 +++
> > 7 files changed, 291 insertions(+)
> > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > create mode 100644 drivers/virtio/pmem.c
> > create mode 100644 include/linux/virtio_pmem.h
> > create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > new file mode 100644
> > index 000000000000..2a1b1ba2c1ff
> > --- /dev/null
> > +++ b/drivers/nvdimm/virtio_pmem.c
> > @@ -0,0 +1,84 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> Is this comment stile (//) acceptable?

In existing code, i can see same comment
pattern for license at some places.

>
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include "nd.h"
> > +
> > + /* The interrupt handler */
> > +void host_ack(struct virtqueue *vq)
> > +{
> > + unsigned int len;
> > + unsigned long flags;
> > + struct virtio_pmem_request *req, *req_buf;
> > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > + req->done = true;
> > + wake_up(&req->host_acked);
> > +
> > + if (!list_empty(&vpmem->req_list)) {
> > + req_buf = list_first_entry(&vpmem->req_list,
> > + struct virtio_pmem_request, list);
> > + list_del(&vpmem->req_list);
> > + req_buf->wq_buf_avail = true;
> > + wake_up(&req_buf->wq_buf);
> > + }
> > + }
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > +EXPORT_SYMBOL_GPL(host_ack);
> > +
> > + /* The request submission function */
> > +int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > + int err;
> > + unsigned long flags;
> > + struct scatterlist *sgs[2], sg, ret;
> > + struct virtio_device *vdev = nd_region->provider_data;
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct virtio_pmem_request *req;
> > +
> > + might_sleep();
>
> [1]
>
> > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > + if (!req)
> > + return -ENOMEM;
> > +
> > + req->done = req->wq_buf_avail = false;
> > + strcpy(req->name, "FLUSH");
> > + init_waitqueue_head(&req->host_acked);
> > + init_waitqueue_head(&req->wq_buf);
> > + sg_init_one(&sg, req->name, strlen(req->name));
> > + sgs[0] = &sg;
> > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > + sgs[1] = &ret;
> > +
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
>
> Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?

might sleep will give us a warning if we try to sleep from non-sleepable
context.

We are doing it other way, i.e might_sleep is not inside GFP_ATOMIC.

>
> > + if (err) {
> > + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > +
> > + list_add_tail(&vpmem->req_list, &req->list);
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->wq_buf, req->wq_buf_avail);
> > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > + }
> > + virtqueue_kick(vpmem->req_vq);
>
> You probably want to check return value here.

Don't think it will matter in this case?

>
> > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > + /* When host has read buffer, this completes via host_ack */
> > + wait_event(req->host_acked, req->done);
> > + err = req->ret;
> > + kfree(req);
> > +
> > + return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +MODULE_LICENSE("GPL");
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 35897649c24f..9f634a2ed638 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> >
> > If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > + tristate "Support for virtio pmem driver"
> > + depends on VIRTIO
> > + depends on LIBNVDIMM
> > + help
> > + This driver provides support for virtio based flushing interface
> > + for persistent memory range.
> > +
> > + If unsure, say M.
> > +
> > config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5dcf46..143ce91eabe9 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > new file mode 100644
> > index 000000000000..52f74064f67e
> > --- /dev/null
> > +++ b/drivers/virtio/pmem.c
> > @@ -0,0 +1,125 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> Ditto
>
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and registers the virtual pmem device
> > + * with libnvdimm core.
> > + */
> > +#include <linux/virtio_pmem.h>
> > +#include <../../drivers/nvdimm/nd.h>
>
> Should this file be moved to include/ directory?

We are not touching the directory structure of nd & nd_pmem
kernel driver.

>
> > +
> > +static struct virtio_device_id id_table[] = {
> > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > + { 0 },
> > +};
> > +
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > + struct virtqueue *vq;
> > +
> > + /* single vq */
> > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > + host_ack, "flush_queue");
> > + if (IS_ERR(vq))
> > + return PTR_ERR(vq);
> > +
> > + spin_lock_init(&vpmem->pmem_lock);
> > + INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > + return 0;
> > +};
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > + int err = 0;
> > + struct resource res;
> > + struct virtio_pmem *vpmem;
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nd_region_desc ndr_desc;
> > + int nid = dev_to_node(&vdev->dev);
> > + struct nd_region *nd_region;
> > +
> > + if (!vdev->config->get) {
> > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > +
> > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > + GFP_KERNEL);
>
> Suggesting to indent it right so it will be under &vdev

o.k

>
> > + if (!vpmem) {
> > + err = -ENOMEM;
> > + goto out_err;
> > + }
> > +
> > + vpmem->vdev = vdev;
> > + err = init_vq(vpmem);
> > + if (err)
> > + goto out_err;
>
> No need to free vpmem here?

No. devm_kzalloc will take care of it.

>
> > +
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + start, &vpmem->start);
> > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > + size, &vpmem->size);
> > +
> > + res.start = vpmem->start;
> > + res.end = vpmem->start + vpmem->size-1;
> > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > + vpmem->nd_desc.module = THIS_MODULE;
> > +
> > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > + &vpmem->nd_desc);
> > + if (!nvdimm_bus)
> > + goto out_vq;
>
> Ditto (i'm probably missing something here)
>
> > +
> > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > + memset(&ndr_desc, 0, sizeof(ndr_desc));
>
> Any reason not to use compiler initialization?
> i.e.
> struct nd_region_desc ndr_desc = {};

will change.

>
> > +
> > + ndr_desc.res = &res;
> > + ndr_desc.numa_node = nid;
> > + ndr_desc.flush = virtio_pmem_flush;
> > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > + nd_region->provider_data = dev_to_virtio
> > + (nd_region->dev.parent->parent);
> > +
> > + if (!nd_region)
> > + goto out_nd;
> > +
> > + //virtio_device_ready(vdev);
>
> Left over

o.k

>
> > + return 0;
> > +out_nd:
> > + err = -ENXIO;
> > + nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > + vdev->config->del_vqs(vdev);
> > +out_err:
> > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > + return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > + struct virtio_pmem *vpmem = vdev->priv;
> > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > + nvdimm_bus_unregister(nvdimm_bus);
> > + vdev->config->del_vqs(vdev);
>
> I think you should also call vdev->config->reset

o.k. Here device will be removed completely, still its required?

>
> > + kfree(vpmem);
> > +}
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > + .driver.name = KBUILD_MODNAME,
> > + .driver.owner = THIS_MODULE,
> > + .id_table = id_table,
> > + .probe = virtio_pmem_probe,
> > + .remove = virtio_pmem_remove,
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..224f9d934be6
> > --- /dev/null
> > +++ b/include/linux/virtio_pmem.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * virtio_pmem.h: virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + **/
> > +
> > +#ifndef _LINUX_VIRTIO_PMEM_H
> > +#define _LINUX_VIRTIO_PMEM_H
> > +
> > +#include <linux/virtio_ids.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/spinlock.h>
> > +
> > +struct virtio_pmem_request {
> > + /* Host return status corresponding to flush request */
> > + int ret;
> > +
> > + /* command name*/
> > + char name[16];
> > +
> > + /* Wait queue to process deferred work after ack from host */
> > + wait_queue_head_t host_acked;
> > + bool done;
> > +
> > + /* Wait queue to process deferred work after virt queue buffer avail */
> > + wait_queue_head_t wq_buf;
> > + bool wq_buf_avail;
> > + struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > + struct virtio_device *vdev;
> > +
> > + /* Virtio pmem request queue */
> > + struct virtqueue *req_vq;
> > +
> > + /* nvdimm bus registers virtio pmem device */
> > + struct nvdimm_bus *nvdimm_bus;
> > + struct nvdimm_bus_descriptor nd_desc;
> > +
> > + /* List to store deferred work if virtqueue is full */
> > + struct list_head req_list;
> > +
> > + /* Synchronize virtqueue data */
> > + spinlock_t pmem_lock;
> > +
> > + /* Memory region information */
> > + uint64_t start;
> > + uint64_t size;
> > +};
> > +
> > +void host_ack(struct virtqueue *vq);
> > +int virtio_pmem_flush(struct nd_region *nd_region);
> > +#endif
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2d4f4d..346389565ac1 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
>
> Any reason for the jump here? are 21 to 24 already taken or you just
> want to be on the safe side?

They are already reserved.

Thanks,
Pankaj

>
> >
> > #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 000000000000..fa3f7d52717a
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,10 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > + __le64 start;
> > + __le64 size;
> > +};
> > +#endif
> > --
> > 2.20.1
> >
> >
>
>

2019-04-03 22:09:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. we don't support 'MAP_SYNC' with virtio pmem
> and xfs.
>
> Signed-off-by: Pankaj Gupta <[email protected]>
> ---
> fs/xfs/xfs_file.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..dced2eb8c91a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> return -EOPNOTSUPP;
>
> + /* We don't support synchronous mappings with DAX files if
> + * dax_device is not synchronous.
> + */
> + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> + (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
> file_accessed(filp);
> vma->vm_ops = &xfs_file_vm_ops;
> if (IS_DAX(file_inode(filp)))

All this ad hoc IS_DAX conditional logic is getting pretty nasty.

xfs_file_mmap(
....
{
struct inode *inode = file_inode(filp);

if (vma->vm_flags & VM_SYNC) {
if (!IS_DAX(inode))
return -EOPNOTSUPP;
if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
return -EOPNOTSUPP;
}

file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
if (IS_DAX(inode))
vma->vm_flags |= VM_HUGEPAGE;
return 0;
}


Even better, factor out all the "MAP_SYNC supported" checks into a
helper so that the filesystem code just doesn't have to care about
the details of checking for DAX+MAP_SYNC support....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-03 22:40:55

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > Virtio pmem provides asynchronous host page cache flush
> > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > and xfs.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..dced2eb8c91a 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > return -EOPNOTSUPP;
> >
> > + /* We don't support synchronous mappings with DAX files if
> > + * dax_device is not synchronous.
> > + */
> > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > + (vma->vm_flags & VM_SYNC))
> > + return -EOPNOTSUPP;
> > +
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > if (IS_DAX(file_inode(filp)))
>
> All this ad hoc IS_DAX conditional logic is getting pretty nasty.
>
> xfs_file_mmap(
> ....
> {
> struct inode *inode = file_inode(filp);
>
> if (vma->vm_flags & VM_SYNC) {
> if (!IS_DAX(inode))
> return -EOPNOTSUPP;
> if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> return -EOPNOTSUPP;
> }
>
> file_accessed(filp);
> vma->vm_ops = &xfs_file_vm_ops;
> if (IS_DAX(inode))
> vma->vm_flags |= VM_HUGEPAGE;
> return 0;
> }
>
>
> Even better, factor out all the "MAP_SYNC supported" checks into a
> helper so that the filesystem code just doesn't have to care about
> the details of checking for DAX+MAP_SYNC support....

Seconded, since ext4 has nearly the same flag validation logic.

--D

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2019-04-04 06:12:39

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush


Hi Dave,

> > Virtio pmem provides asynchronous host page cache flush
> > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > and xfs.
> >
> > Signed-off-by: Pankaj Gupta <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..dced2eb8c91a 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > return -EOPNOTSUPP;
> >
> > + /* We don't support synchronous mappings with DAX files if
> > + * dax_device is not synchronous.
> > + */
> > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > + (vma->vm_flags & VM_SYNC))
> > + return -EOPNOTSUPP;
> > +
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > if (IS_DAX(file_inode(filp)))
>
> All this ad hoc IS_DAX conditional logic is getting pretty nasty.
>
> xfs_file_mmap(
> ....
> {
> struct inode *inode = file_inode(filp);
>
> if (vma->vm_flags & VM_SYNC) {
> if (!IS_DAX(inode))
> return -EOPNOTSUPP;
> if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> return -EOPNOTSUPP;
> }
>
> file_accessed(filp);
> vma->vm_ops = &xfs_file_vm_ops;
> if (IS_DAX(inode))
> vma->vm_flags |= VM_HUGEPAGE;
> return 0;
> }

Sure, this is better.

>
>
> Even better, factor out all the "MAP_SYNC supported" checks into a
> helper so that the filesystem code just doesn't have to care about
> the details of checking for DAX+MAP_SYNC support....

o.k. Will add one common helper function for both ext4 & xfs filesystems.

Thanks for the suggestion.

Best regards,
Pankaj

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2019-04-04 06:13:45

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush


>
> On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > Virtio pmem provides asynchronous host page cache flush
> > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > and xfs.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---
> > > fs/xfs/xfs_file.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > return -EOPNOTSUPP;
> > >
> > > + /* We don't support synchronous mappings with DAX files if
> > > + * dax_device is not synchronous.
> > > + */
> > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > + (vma->vm_flags & VM_SYNC))
> > > + return -EOPNOTSUPP;
> > > +
> > > file_accessed(filp);
> > > vma->vm_ops = &xfs_file_vm_ops;
> > > if (IS_DAX(file_inode(filp)))
> >
> > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> >
> > xfs_file_mmap(
> > ....
> > {
> > struct inode *inode = file_inode(filp);
> >
> > if (vma->vm_flags & VM_SYNC) {
> > if (!IS_DAX(inode))
> > return -EOPNOTSUPP;
> > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > return -EOPNOTSUPP;
> > }
> >
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > if (IS_DAX(inode))
> > vma->vm_flags |= VM_HUGEPAGE;
> > return 0;
> > }
> >
> >
> > Even better, factor out all the "MAP_SYNC supported" checks into a
> > helper so that the filesystem code just doesn't have to care about
> > the details of checking for DAX+MAP_SYNC support....
>
> Seconded, since ext4 has nearly the same flag validation logic.

Agree.

Thanks,
Pankaj

>
> --D
>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
>

2019-04-04 06:41:30

by Yuval Shaia

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver

On Wed, Apr 03, 2019 at 08:40:13AM -0400, Pankaj Gupta wrote:
>
> > Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver
> >
> > On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> > > This patch adds virtio-pmem driver for KVM guest.
> > >
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > >
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > >
> > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > ---
> > > drivers/nvdimm/virtio_pmem.c | 84 +++++++++++++++++++++
> > > drivers/virtio/Kconfig | 10 +++
> > > drivers/virtio/Makefile | 1 +
> > > drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++
> > > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > > include/uapi/linux/virtio_ids.h | 1 +
> > > include/uapi/linux/virtio_pmem.h | 10 +++
> > > 7 files changed, 291 insertions(+)
> > > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > > create mode 100644 drivers/virtio/pmem.c
> > > create mode 100644 include/linux/virtio_pmem.h
> > > create mode 100644 include/uapi/linux/virtio_pmem.h
> > >
> > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> > > new file mode 100644
> > > index 000000000000..2a1b1ba2c1ff
> > > --- /dev/null
> > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > @@ -0,0 +1,84 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> >
> > Is this comment stile (//) acceptable?
>
> In existing code, i can see same comment
> pattern for license at some places.

Is it preferred for new code?

>
> >
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include <linux/virtio_pmem.h>
> > > +#include "nd.h"
> > > +
> > > + /* The interrupt handler */
> > > +void host_ack(struct virtqueue *vq)
> > > +{
> > > + unsigned int len;
> > > + unsigned long flags;
> > > + struct virtio_pmem_request *req, *req_buf;
> > > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > + req->done = true;
> > > + wake_up(&req->host_acked);
> > > +
> > > + if (!list_empty(&vpmem->req_list)) {
> > > + req_buf = list_first_entry(&vpmem->req_list,
> > > + struct virtio_pmem_request, list);
> > > + list_del(&vpmem->req_list);
> > > + req_buf->wq_buf_avail = true;
> > > + wake_up(&req_buf->wq_buf);
> > > + }
> > > + }
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +}
> > > +EXPORT_SYMBOL_GPL(host_ack);
> > > +
> > > + /* The request submission function */
> > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > + int err;
> > > + unsigned long flags;
> > > + struct scatterlist *sgs[2], sg, ret;
> > > + struct virtio_device *vdev = nd_region->provider_data;
> > > + struct virtio_pmem *vpmem = vdev->priv;
> > > + struct virtio_pmem_request *req;
> > > +
> > > + might_sleep();
> >
> > [1]
> >
> > > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > + if (!req)
> > > + return -ENOMEM;
> > > +
> > > + req->done = req->wq_buf_avail = false;
> > > + strcpy(req->name, "FLUSH");
> > > + init_waitqueue_head(&req->host_acked);
> > > + init_waitqueue_head(&req->wq_buf);
> > > + sg_init_one(&sg, req->name, strlen(req->name));
> > > + sgs[0] = &sg;
> > > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > + sgs[1] = &ret;
> > > +
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> >
> > Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?
>
> might sleep will give us a warning if we try to sleep from non-sleepable
> context.
>
> We are doing it other way, i.e might_sleep is not inside GFP_ATOMIC.
>
> >
> > > + if (err) {
> > > + dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > > +
> > > + list_add_tail(&vpmem->req_list, &req->list);
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->wq_buf, req->wq_buf_avail);
> > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > + }
> > > + virtqueue_kick(vpmem->req_vq);
> >
> > You probably want to check return value here.
>
> Don't think it will matter in this case?

Have no idea, if it fails maybe you will never get to host_acked.

>
> >
> > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > + /* When host has read buffer, this completes via host_ack */
> > > + wait_event(req->host_acked, req->done);
> > > + err = req->ret;
> > > + kfree(req);
> > > +
> > > + return err;
> > > +};
> > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > +MODULE_LICENSE("GPL");
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 35897649c24f..9f634a2ed638 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> > >
> > > If unsure, say Y.
> > >
> > > +config VIRTIO_PMEM
> > > + tristate "Support for virtio pmem driver"
> > > + depends on VIRTIO
> > > + depends on LIBNVDIMM
> > > + help
> > > + This driver provides support for virtio based flushing interface
> > > + for persistent memory range.
> > > +
> > > + If unsure, say M.
> > > +
> > > config VIRTIO_BALLOON
> > > tristate "Virtio balloon driver"
> > > depends on VIRTIO
> > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > index 3a2b5c5dcf46..143ce91eabe9 100644
> > > --- a/drivers/virtio/Makefile
> > > +++ b/drivers/virtio/Makefile
> > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> > > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > > new file mode 100644
> > > index 000000000000..52f74064f67e
> > > --- /dev/null
> > > +++ b/drivers/virtio/pmem.c
> > > @@ -0,0 +1,125 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> >
> > Ditto
> >
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and registers the virtual pmem device
> > > + * with libnvdimm core.
> > > + */
> > > +#include <linux/virtio_pmem.h>
> > > +#include <../../drivers/nvdimm/nd.h>
> >
> > Should this file be moved to include/ directory?
>
> We are not touching the directory structure of nd & nd_pmem
> kernel driver.

But since this file becomes public it should be considered, right?

>
> >
> > > +
> > > +static struct virtio_device_id id_table[] = {
> > > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > + { 0 },
> > > +};
> > > +
> > > + /* Initialize virt queue */
> > > +static int init_vq(struct virtio_pmem *vpmem)
> > > +{
> > > + struct virtqueue *vq;
> > > +
> > > + /* single vq */
> > > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > + host_ack, "flush_queue");
> > > + if (IS_ERR(vq))
> > > + return PTR_ERR(vq);
> > > +
> > > + spin_lock_init(&vpmem->pmem_lock);
> > > + INIT_LIST_HEAD(&vpmem->req_list);
> > > +
> > > + return 0;
> > > +};
> > > +
> > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > +{
> > > + int err = 0;
> > > + struct resource res;
> > > + struct virtio_pmem *vpmem;
> > > + struct nvdimm_bus *nvdimm_bus;
> > > + struct nd_region_desc ndr_desc;
> > > + int nid = dev_to_node(&vdev->dev);
> > > + struct nd_region *nd_region;
> > > +
> > > + if (!vdev->config->get) {
> > > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > + __func__);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > + GFP_KERNEL);
> >
> > Suggesting to indent it right so it will be under &vdev
>
> o.k
>
> >
> > > + if (!vpmem) {
> > > + err = -ENOMEM;
> > > + goto out_err;
> > > + }
> > > +
> > > + vpmem->vdev = vdev;
> > > + err = init_vq(vpmem);
> > > + if (err)
> > > + goto out_err;
> >
> > No need to free vpmem here?
>
> No. devm_kzalloc will take care of it.

Nice.

>
> >
> > > +
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + start, &vpmem->start);
> > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > + size, &vpmem->size);
> > > +
> > > + res.start = vpmem->start;
> > > + res.end = vpmem->start + vpmem->size-1;
> > > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > > + vpmem->nd_desc.module = THIS_MODULE;
> > > +
> > > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > + &vpmem->nd_desc);
> > > + if (!nvdimm_bus)
> > > + goto out_vq;
> >
> > Ditto (i'm probably missing something here)
> >
> > > +
> > > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> >
> > Any reason not to use compiler initialization?
> > i.e.
> > struct nd_region_desc ndr_desc = {};
>
> will change.
>
> >
> > > +
> > > + ndr_desc.res = &res;
> > > + ndr_desc.numa_node = nid;
> > > + ndr_desc.flush = virtio_pmem_flush;
> > > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > + nd_region->provider_data = dev_to_virtio
> > > + (nd_region->dev.parent->parent);
> > > +
> > > + if (!nd_region)
> > > + goto out_nd;
> > > +
> > > + //virtio_device_ready(vdev);
> >
> > Left over
>
> o.k
>
> >
> > > + return 0;
> > > +out_nd:
> > > + err = -ENXIO;
> > > + nvdimm_bus_unregister(nvdimm_bus);
> > > +out_vq:
> > > + vdev->config->del_vqs(vdev);
> > > +out_err:
> > > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > + return err;
> > > +}
> > > +
> > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > +{
> > > + struct virtio_pmem *vpmem = vdev->priv;
> > > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > +
> > > + nvdimm_bus_unregister(nvdimm_bus);
> > > + vdev->config->del_vqs(vdev);
> >
> > I think you should also call vdev->config->reset
>
> o.k. Here device will be removed completely, still its required?

I had a bad experience with unloading virtio PCI driver and it gone after i
added call to 'reset'.
See the warning in function virtio_dev_remove.

>
> >
> > > + kfree(vpmem);
> > > +}
> > > +
> > > +static struct virtio_driver virtio_pmem_driver = {
> > > + .driver.name = KBUILD_MODNAME,
> > > + .driver.owner = THIS_MODULE,
> > > + .id_table = id_table,
> > > + .probe = virtio_pmem_probe,
> > > + .remove = virtio_pmem_remove,
> > > +};
> > > +
> > > +module_virtio_driver(virtio_pmem_driver);
> > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > +MODULE_LICENSE("GPL");
> > > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > > new file mode 100644
> > > index 000000000000..224f9d934be6
> > > --- /dev/null
> > > +++ b/include/linux/virtio_pmem.h
> > > @@ -0,0 +1,60 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * virtio_pmem.h: virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + **/
> > > +
> > > +#ifndef _LINUX_VIRTIO_PMEM_H
> > > +#define _LINUX_VIRTIO_PMEM_H
> > > +
> > > +#include <linux/virtio_ids.h>
> > > +#include <linux/module.h>
> > > +#include <linux/virtio_config.h>
> > > +#include <uapi/linux/virtio_pmem.h>
> > > +#include <linux/libnvdimm.h>
> > > +#include <linux/spinlock.h>
> > > +
> > > +struct virtio_pmem_request {
> > > + /* Host return status corresponding to flush request */
> > > + int ret;
> > > +
> > > + /* command name*/
> > > + char name[16];
> > > +
> > > + /* Wait queue to process deferred work after ack from host */
> > > + wait_queue_head_t host_acked;
> > > + bool done;
> > > +
> > > + /* Wait queue to process deferred work after virt queue buffer avail */
> > > + wait_queue_head_t wq_buf;
> > > + bool wq_buf_avail;
> > > + struct list_head list;
> > > +};
> > > +
> > > +struct virtio_pmem {
> > > + struct virtio_device *vdev;
> > > +
> > > + /* Virtio pmem request queue */
> > > + struct virtqueue *req_vq;
> > > +
> > > + /* nvdimm bus registers virtio pmem device */
> > > + struct nvdimm_bus *nvdimm_bus;
> > > + struct nvdimm_bus_descriptor nd_desc;
> > > +
> > > + /* List to store deferred work if virtqueue is full */
> > > + struct list_head req_list;
> > > +
> > > + /* Synchronize virtqueue data */
> > > + spinlock_t pmem_lock;
> > > +
> > > + /* Memory region information */
> > > + uint64_t start;
> > > + uint64_t size;
> > > +};
> > > +
> > > +void host_ack(struct virtqueue *vq);
> > > +int virtio_pmem_flush(struct nd_region *nd_region);
> > > +#endif
> > > diff --git a/include/uapi/linux/virtio_ids.h
> > > b/include/uapi/linux/virtio_ids.h
> > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > --- a/include/uapi/linux/virtio_ids.h
> > > +++ b/include/uapi/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> >
> > Any reason for the jump here? are 21 to 24 already taken or you just
> > want to be on the safe side?
>
> They are already reserved.

Can you direct me to how to find a free ID?

>
> Thanks,
> Pankaj
>
> >
> > >
> > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > b/include/uapi/linux/virtio_pmem.h
> > > new file mode 100644
> > > index 000000000000..fa3f7d52717a
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_pmem.h
> > > @@ -0,0 +1,10 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +
> > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > +
> > > +struct virtio_pmem_config {
> > > + __le64 start;
> > > + __le64 size;
> > > +};
> > > +#endif
> > > --
> > > 2.20.1
> > >
> > >
> >
> >

2019-04-04 07:14:45

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem driver


> >
> > > Subject: Re: [Qemu-devel] [PATCH v4 2/5] virtio-pmem: Add virtio pmem
> > > driver
> > >
> > > On Wed, Apr 03, 2019 at 04:10:15PM +0530, Pankaj Gupta wrote:
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > >
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > >
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > >
> > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > ---
> > > > drivers/nvdimm/virtio_pmem.c | 84 +++++++++++++++++++++
> > > > drivers/virtio/Kconfig | 10 +++
> > > > drivers/virtio/Makefile | 1 +
> > > > drivers/virtio/pmem.c | 125 +++++++++++++++++++++++++++++++
> > > > include/linux/virtio_pmem.h | 60 +++++++++++++++
> > > > include/uapi/linux/virtio_ids.h | 1 +
> > > > include/uapi/linux/virtio_pmem.h | 10 +++
> > > > 7 files changed, 291 insertions(+)
> > > > create mode 100644 drivers/nvdimm/virtio_pmem.c
> > > > create mode 100644 drivers/virtio/pmem.c
> > > > create mode 100644 include/linux/virtio_pmem.h
> > > > create mode 100644 include/uapi/linux/virtio_pmem.h
> > > >
> > > > diff --git a/drivers/nvdimm/virtio_pmem.c
> > > > b/drivers/nvdimm/virtio_pmem.c
> > > > new file mode 100644
> > > > index 000000000000..2a1b1ba2c1ff
> > > > --- /dev/null
> > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > @@ -0,0 +1,84 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > >
> > > Is this comment stile (//) acceptable?
> >
> > In existing code, i can see same comment
> > pattern for license at some places.
>
> Is it preferred for new code?

will change.

>
> >
> > >
> > > > +/*
> > > > + * virtio_pmem.c: Virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and provides a virtio based flushing
> > > > + * interface.
> > > > + */
> > > > +#include <linux/virtio_pmem.h>
> > > > +#include "nd.h"
> > > > +
> > > > + /* The interrupt handler */
> > > > +void host_ack(struct virtqueue *vq)
> > > > +{
> > > > + unsigned int len;
> > > > + unsigned long flags;
> > > > + struct virtio_pmem_request *req, *req_buf;
> > > > + struct virtio_pmem *vpmem = vq->vdev->priv;
> > > > +
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > + while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > > + req->done = true;
> > > > + wake_up(&req->host_acked);
> > > > +
> > > > + if (!list_empty(&vpmem->req_list)) {
> > > > + req_buf = list_first_entry(&vpmem->req_list,
> > > > + struct virtio_pmem_request, list);
> > > > + list_del(&vpmem->req_list);
> > > > + req_buf->wq_buf_avail = true;
> > > > + wake_up(&req_buf->wq_buf);
> > > > + }
> > > > + }
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(host_ack);
> > > > +
> > > > + /* The request submission function */
> > > > +int virtio_pmem_flush(struct nd_region *nd_region)
> > > > +{
> > > > + int err;
> > > > + unsigned long flags;
> > > > + struct scatterlist *sgs[2], sg, ret;
> > > > + struct virtio_device *vdev = nd_region->provider_data;
> > > > + struct virtio_pmem *vpmem = vdev->priv;
> > > > + struct virtio_pmem_request *req;
> > > > +
> > > > + might_sleep();
> > >
> > > [1]
> > >
> > > > + req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > > + if (!req)
> > > > + return -ENOMEM;
> > > > +
> > > > + req->done = req->wq_buf_avail = false;
> > > > + strcpy(req->name, "FLUSH");
> > > > + init_waitqueue_head(&req->host_acked);
> > > > + init_waitqueue_head(&req->wq_buf);
> > > > + sg_init_one(&sg, req->name, strlen(req->name));
> > > > + sgs[0] = &sg;
> > > > + sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > > + sgs[1] = &ret;
> > > > +
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > + err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > >
> > > Is it okay to use GFP_ATOMIC in a might-sleep ([1]) function?
> >
> > might sleep will give us a warning if we try to sleep from non-sleepable
> > context.
> >
> > We are doing it other way, i.e might_sleep is not inside GFP_ATOMIC.
> >
> > >
> > > > + if (err) {
> > > > + dev_err(&vdev->dev, "failed to send command to virtio pmem
> > > > device\n");
> > > > +
> > > > + list_add_tail(&vpmem->req_list, &req->list);
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > + /* When host has read buffer, this completes via host_ack */
> > > > + wait_event(req->wq_buf, req->wq_buf_avail);
> > > > + spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > + }
> > > > + virtqueue_kick(vpmem->req_vq);
> > >
> > > You probably want to check return value here.
> >
> > Don't think it will matter in this case?
>
> Have no idea, if it fails maybe you will never get to host_acked.

I am not sure about this. Surely will check.

>
> >
> > >
> > > > + spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > + /* When host has read buffer, this completes via host_ack */
> > > > + wait_event(req->host_acked, req->done);
> > > > + err = req->ret;
> > > > + kfree(req);
> > > > +
> > > > + return err;
> > > > +};
> > > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > > +MODULE_LICENSE("GPL");
> > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > index 35897649c24f..9f634a2ed638 100644
> > > > --- a/drivers/virtio/Kconfig
> > > > +++ b/drivers/virtio/Kconfig
> > > > @@ -42,6 +42,16 @@ config VIRTIO_PCI_LEGACY
> > > >
> > > > If unsure, say Y.
> > > >
> > > > +config VIRTIO_PMEM
> > > > + tristate "Support for virtio pmem driver"
> > > > + depends on VIRTIO
> > > > + depends on LIBNVDIMM
> > > > + help
> > > > + This driver provides support for virtio based flushing interface
> > > > + for persistent memory range.
> > > > +
> > > > + If unsure, say M.
> > > > +
> > > > config VIRTIO_BALLOON
> > > > tristate "Virtio balloon driver"
> > > > depends on VIRTIO
> > > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > > index 3a2b5c5dcf46..143ce91eabe9 100644
> > > > --- a/drivers/virtio/Makefile
> > > > +++ b/drivers/virtio/Makefile
> > > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > > > virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > > > obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > > > obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > > +obj-$(CONFIG_VIRTIO_PMEM) += pmem.o ../nvdimm/virtio_pmem.o
> > > > diff --git a/drivers/virtio/pmem.c b/drivers/virtio/pmem.c
> > > > new file mode 100644
> > > > index 000000000000..52f74064f67e
> > > > --- /dev/null
> > > > +++ b/drivers/virtio/pmem.c
> > > > @@ -0,0 +1,125 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > >
> > > Ditto
> > >
> > > > +/*
> > > > + * virtio_pmem.c: Virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and registers the virtual pmem device
> > > > + * with libnvdimm core.
> > > > + */
> > > > +#include <linux/virtio_pmem.h>
> > > > +#include <../../drivers/nvdimm/nd.h>
> > >
> > > Should this file be moved to include/ directory?
> >
> > We are not touching the directory structure of nd & nd_pmem
> > kernel driver.
>
> But since this file becomes public it should be considered, right?

IIRC I tried to do it but that required changes in existing pmem code
directory structure for conflict resolution. I would suggest to keep
the directory structure as it is currently and submit a followup
patch to do this after current code is merged upstream.

>
> >
> > >
> > > > +
> > > > +static struct virtio_device_id id_table[] = {
> > > > + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > > + { 0 },
> > > > +};
> > > > +
> > > > + /* Initialize virt queue */
> > > > +static int init_vq(struct virtio_pmem *vpmem)
> > > > +{
> > > > + struct virtqueue *vq;
> > > > +
> > > > + /* single vq */
> > > > + vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > > + host_ack, "flush_queue");
> > > > + if (IS_ERR(vq))
> > > > + return PTR_ERR(vq);
> > > > +
> > > > + spin_lock_init(&vpmem->pmem_lock);
> > > > + INIT_LIST_HEAD(&vpmem->req_list);
> > > > +
> > > > + return 0;
> > > > +};
> > > > +
> > > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > +{
> > > > + int err = 0;
> > > > + struct resource res;
> > > > + struct virtio_pmem *vpmem;
> > > > + struct nvdimm_bus *nvdimm_bus;
> > > > + struct nd_region_desc ndr_desc;
> > > > + int nid = dev_to_node(&vdev->dev);
> > > > + struct nd_region *nd_region;
> > > > +
> > > > + if (!vdev->config->get) {
> > > > + dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > > + __func__);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > > + GFP_KERNEL);
> > >
> > > Suggesting to indent it right so it will be under &vdev
> >
> > o.k
> >
> > >
> > > > + if (!vpmem) {
> > > > + err = -ENOMEM;
> > > > + goto out_err;
> > > > + }
> > > > +
> > > > + vpmem->vdev = vdev;
> > > > + err = init_vq(vpmem);
> > > > + if (err)
> > > > + goto out_err;
> > >
> > > No need to free vpmem here?
> >
> > No. devm_kzalloc will take care of it.
>
> Nice.
>
> >
> > >
> > > > +
> > > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > + start, &vpmem->start);
> > > > + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > + size, &vpmem->size);
> > > > +
> > > > + res.start = vpmem->start;
> > > > + res.end = vpmem->start + vpmem->size-1;
> > > > + vpmem->nd_desc.provider_name = "virtio-pmem";
> > > > + vpmem->nd_desc.module = THIS_MODULE;
> > > > +
> > > > + vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > > + &vpmem->nd_desc);
> > > > + if (!nvdimm_bus)
> > > > + goto out_vq;
> > >
> > > Ditto (i'm probably missing something here)
> > >
> > > > +
> > > > + dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > > + memset(&ndr_desc, 0, sizeof(ndr_desc));
> > >
> > > Any reason not to use compiler initialization?
> > > i.e.
> > > struct nd_region_desc ndr_desc = {};
> >
> > will change.
> >
> > >
> > > > +
> > > > + ndr_desc.res = &res;
> > > > + ndr_desc.numa_node = nid;
> > > > + ndr_desc.flush = virtio_pmem_flush;
> > > > + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > > + set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
> > > > + nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > > + nd_region->provider_data = dev_to_virtio
> > > > + (nd_region->dev.parent->parent);
> > > > +
> > > > + if (!nd_region)
> > > > + goto out_nd;
> > > > +
> > > > + //virtio_device_ready(vdev);
> > >
> > > Left over
> >
> > o.k
> >
> > >
> > > > + return 0;
> > > > +out_nd:
> > > > + err = -ENXIO;
> > > > + nvdimm_bus_unregister(nvdimm_bus);
> > > > +out_vq:
> > > > + vdev->config->del_vqs(vdev);
> > > > +out_err:
> > > > + dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > > + return err;
> > > > +}
> > > > +
> > > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > > +{
> > > > + struct virtio_pmem *vpmem = vdev->priv;
> > > > + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > > +
> > > > + nvdimm_bus_unregister(nvdimm_bus);
> > > > + vdev->config->del_vqs(vdev);
> > >
> > > I think you should also call vdev->config->reset
> >
> > o.k. Here device will be removed completely, still its required?
>
> I had a bad experience with unloading virtio PCI driver and it gone after i
> added call to 'reset'.
> See the warning in function virtio_dev_remove.

Fair point. I will add call to vdev->config->reset.

>
> >
> > >
> > > > + kfree(vpmem);
> > > > +}
> > > > +
> > > > +static struct virtio_driver virtio_pmem_driver = {
> > > > + .driver.name = KBUILD_MODNAME,
> > > > + .driver.owner = THIS_MODULE,
> > > > + .id_table = id_table,
> > > > + .probe = virtio_pmem_probe,
> > > > + .remove = virtio_pmem_remove,
> > > > +};
> > > > +
> > > > +module_virtio_driver(virtio_pmem_driver);
> > > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > > +MODULE_LICENSE("GPL");
> > > > diff --git a/include/linux/virtio_pmem.h b/include/linux/virtio_pmem.h
> > > > new file mode 100644
> > > > index 000000000000..224f9d934be6
> > > > --- /dev/null
> > > > +++ b/include/linux/virtio_pmem.h
> > > > @@ -0,0 +1,60 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * virtio_pmem.h: virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and provides a virtio based flushing
> > > > + * interface.
> > > > + **/
> > > > +
> > > > +#ifndef _LINUX_VIRTIO_PMEM_H
> > > > +#define _LINUX_VIRTIO_PMEM_H
> > > > +
> > > > +#include <linux/virtio_ids.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/virtio_config.h>
> > > > +#include <uapi/linux/virtio_pmem.h>
> > > > +#include <linux/libnvdimm.h>
> > > > +#include <linux/spinlock.h>
> > > > +
> > > > +struct virtio_pmem_request {
> > > > + /* Host return status corresponding to flush request */
> > > > + int ret;
> > > > +
> > > > + /* command name*/
> > > > + char name[16];
> > > > +
> > > > + /* Wait queue to process deferred work after ack from host */
> > > > + wait_queue_head_t host_acked;
> > > > + bool done;
> > > > +
> > > > + /* Wait queue to process deferred work after virt queue buffer avail
> > > > */
> > > > + wait_queue_head_t wq_buf;
> > > > + bool wq_buf_avail;
> > > > + struct list_head list;
> > > > +};
> > > > +
> > > > +struct virtio_pmem {
> > > > + struct virtio_device *vdev;
> > > > +
> > > > + /* Virtio pmem request queue */
> > > > + struct virtqueue *req_vq;
> > > > +
> > > > + /* nvdimm bus registers virtio pmem device */
> > > > + struct nvdimm_bus *nvdimm_bus;
> > > > + struct nvdimm_bus_descriptor nd_desc;
> > > > +
> > > > + /* List to store deferred work if virtqueue is full */
> > > > + struct list_head req_list;
> > > > +
> > > > + /* Synchronize virtqueue data */
> > > > + spinlock_t pmem_lock;
> > > > +
> > > > + /* Memory region information */
> > > > + uint64_t start;
> > > > + uint64_t size;
> > > > +};
> > > > +
> > > > +void host_ack(struct virtqueue *vq);
> > > > +int virtio_pmem_flush(struct nd_region *nd_region);
> > > > +#endif
> > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > b/include/uapi/linux/virtio_ids.h
> > > > index 6d5c3b2d4f4d..346389565ac1 100644
> > > > --- a/include/uapi/linux/virtio_ids.h
> > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > @@ -43,5 +43,6 @@
> > > > #define VIRTIO_ID_INPUT 18 /* virtio input */
> > > > #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
> > > > #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
> > > > +#define VIRTIO_ID_PMEM 25 /* virtio pmem */
> > >
> > > Any reason for the jump here? are 21 to 24 already taken or you just
> > > want to be on the safe side?
> >
> > They are already reserved.
>
> Can you direct me to how to find a free ID?

I referred 'content.tex' in virtio-spec.

I have also posted a draft spec document[1] for virtio-pmem.
[1] https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html

Thanks for the suggestions.

Best regards,
Pankaj

>
> >
> > Thanks,
> > Pankaj
> >
> > >
> > > >
> > > > #endif /* _LINUX_VIRTIO_IDS_H */
> > > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > > b/include/uapi/linux/virtio_pmem.h
> > > > new file mode 100644
> > > > index 000000000000..fa3f7d52717a
> > > > --- /dev/null
> > > > +++ b/include/uapi/linux/virtio_pmem.h
> > > > @@ -0,0 +1,10 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +
> > > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +
> > > > +struct virtio_pmem_config {
> > > > + __le64 start;
> > > > + __le64 size;
> > > > +};
> > > > +#endif
> > > > --
> > > > 2.20.1
> > > >
> > > >
> > >
> > >
>

2019-04-04 09:31:53

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush


> > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > Virtio pmem provides asynchronous host page cache flush
> > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > and xfs.
> > > >
> > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > ---
> > > > fs/xfs/xfs_file.c | 8 ++++++++
> > > > 1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > return -EOPNOTSUPP;
> > > >
> > > > + /* We don't support synchronous mappings with DAX files if
> > > > + * dax_device is not synchronous.
> > > > + */
> > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > + (vma->vm_flags & VM_SYNC))
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > file_accessed(filp);
> > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > if (IS_DAX(file_inode(filp)))
> > >
> > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > >
> > > xfs_file_mmap(
> > > ....
> > > {
> > > struct inode *inode = file_inode(filp);
> > >
> > > if (vma->vm_flags & VM_SYNC) {
> > > if (!IS_DAX(inode))
> > > return -EOPNOTSUPP;
> > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > return -EOPNOTSUPP;
> > > }
> > >
> > > file_accessed(filp);
> > > vma->vm_ops = &xfs_file_vm_ops;
> > > if (IS_DAX(inode))
> > > vma->vm_flags |= VM_HUGEPAGE;
> > > return 0;
> > > }
> > >
> > >
> > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > helper so that the filesystem code just doesn't have to care about
> > > the details of checking for DAX+MAP_SYNC support....
> >
> > Seconded, since ext4 has nearly the same flag validation logic.
>

Only issue with this I see is we need the helper function only for supported
filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
will be compiled for every filesystem, even for those don't need it.

Sample patch below, does below patch is near to what you have in mind?

=================

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1f2e2845eb76..614995170cac 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1196,12 +1196,17 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
{
+ struct dax_device *dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
+
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
- return -EOPNOTSUPP;
+ if (vma->vm_flags & VM_SYNC) {
+ int err = is_synchronous(filp, dax_dev);
+ if (err)
+ return err;
+ }

file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b42df09b04c..add017de3dd7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2162,6 +2162,20 @@ static inline void file_accessed(struct file *file)
touch_atime(&file->f_path);
}

+struct dax_device;
+extern bool dax_synchronous(struct dax_device *dax_dev);
+static inline int is_synchronous(struct file *filp, struct dax_device *dax_dev)
+{
+ struct inode *inode = file_inode(filp);
+
+ if (!IS_DAX(inode))
+ return -EOPNOTSUPP;
+ if (!dax_synchronous(dax_dev))
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
int sync_inode(struct inode *inode, struct writeback_control *wbc);
int sync_inode_metadata(struct inode *inode, int wait);

---------

Thanks,
Pankaj




2019-04-04 09:41:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

On Thu 04-04-19 05:09:10, Pankaj Gupta wrote:
>
> > > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > > Virtio pmem provides asynchronous host page cache flush
> > > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > > and xfs.
> > > > >
> > > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > > ---
> > > > > fs/xfs/xfs_file.c | 8 ++++++++
> > > > > 1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > > --- a/fs/xfs/xfs_file.c
> > > > > +++ b/fs/xfs/xfs_file.c
> > > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > > return -EOPNOTSUPP;
> > > > >
> > > > > + /* We don't support synchronous mappings with DAX files if
> > > > > + * dax_device is not synchronous.
> > > > > + */
> > > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > > + (vma->vm_flags & VM_SYNC))
> > > > > + return -EOPNOTSUPP;
> > > > > +
> > > > > file_accessed(filp);
> > > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > > if (IS_DAX(file_inode(filp)))
> > > >
> > > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > > >
> > > > xfs_file_mmap(
> > > > ....
> > > > {
> > > > struct inode *inode = file_inode(filp);
> > > >
> > > > if (vma->vm_flags & VM_SYNC) {
> > > > if (!IS_DAX(inode))
> > > > return -EOPNOTSUPP;
> > > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > > return -EOPNOTSUPP;
> > > > }
> > > >
> > > > file_accessed(filp);
> > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > if (IS_DAX(inode))
> > > > vma->vm_flags |= VM_HUGEPAGE;
> > > > return 0;
> > > > }
> > > >
> > > >
> > > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > > helper so that the filesystem code just doesn't have to care about
> > > > the details of checking for DAX+MAP_SYNC support....
> > >
> > > Seconded, since ext4 has nearly the same flag validation logic.
> >
>
> Only issue with this I see is we need the helper function only for supported
> filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
> will be compiled for every filesystem, even for those don't need it.
>
> Sample patch below, does below patch is near to what you have in mind?

So I would put the helper in include/linux/dax.h and have it like:

bool daxdev_mapping_supported(struct vm_area_struct *vma,
struct dax_device *dax_dev)
{
if (!(vma->vm_flags & VM_SYNC))
return true;
if (!IS_DAX(file_inode(vma->vm_file)))
return false;
return dax_synchronous(dax_dev);
}

Honza
>
> =================
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..614995170cac 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1196,12 +1196,17 @@ xfs_file_mmap(
> struct file *filp,
> struct vm_area_struct *vma)
> {
> + struct dax_device *dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
> +
> /*
> - * We don't support synchronous mappings for non-DAX files. At least
> - * until someone comes with a sensible use case.
> + * We don't support synchronous mappings for non-DAX files and
> + * for DAX files if underneath dax_device is not synchronous.
> */
> - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> - return -EOPNOTSUPP;
> + if (vma->vm_flags & VM_SYNC) {
> + int err = is_synchronous(filp, dax_dev);
> + if (err)
> + return err;
> + }
>
> file_accessed(filp);
> vma->vm_ops = &xfs_file_vm_ops;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8b42df09b04c..add017de3dd7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2162,6 +2162,20 @@ static inline void file_accessed(struct file *file)
> touch_atime(&file->f_path);
> }
>
> +struct dax_device;
> +extern bool dax_synchronous(struct dax_device *dax_dev);
> +static inline int is_synchronous(struct file *filp, struct dax_device *dax_dev)
> +{
> + struct inode *inode = file_inode(filp);
> +
> + if (!IS_DAX(inode))
> + return -EOPNOTSUPP;
> + if (!dax_synchronous(dax_dev))
> + return -EOPNOTSUPP;
> +
> + return 0;
> +}
> +
> int sync_inode(struct inode *inode, struct writeback_control *wbc);
> int sync_inode_metadata(struct inode *inode, int wait);
>
> ---------
>
> Thanks,
> Pankaj
>
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-04-04 09:57:17

by Adam Borowski

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

On Thu, Apr 04, 2019 at 02:12:30AM -0400, Pankaj Gupta wrote:
> > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> >
> > xfs_file_mmap(
> > ....
> > {
> > struct inode *inode = file_inode(filp);
> >
> > if (vma->vm_flags & VM_SYNC) {
> > if (!IS_DAX(inode))
> > return -EOPNOTSUPP;
> > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > return -EOPNOTSUPP;
> > }
> >
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > if (IS_DAX(inode))
> > vma->vm_flags |= VM_HUGEPAGE;
> > return 0;
> > }
>
> Sure, this is better.

> > Even better, factor out all the "MAP_SYNC supported" checks into a
> > helper so that the filesystem code just doesn't have to care about
> > the details of checking for DAX+MAP_SYNC support....
>
> o.k. Will add one common helper function for both ext4 & xfs filesystems.

Note this pending patch for Goldwyn Rodrigues' patchset for btrfs:

https://lore.kernel.org/linux-btrfs/[email protected]/

We might want to coordinate.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄⠀⠀⠀⠀

2019-04-04 10:08:52

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush


> On Thu 04-04-19 05:09:10, Pankaj Gupta wrote:
> >
> > > > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > > > Virtio pmem provides asynchronous host page cache flush
> > > > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > > > and xfs.
> > > > > >
> > > > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > > > ---
> > > > > > fs/xfs/xfs_file.c | 8 ++++++++
> > > > > > 1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > > > --- a/fs/xfs/xfs_file.c
> > > > > > +++ b/fs/xfs/xfs_file.c
> > > > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > > > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > > > return -EOPNOTSUPP;
> > > > > >
> > > > > > + /* We don't support synchronous mappings with DAX files if
> > > > > > + * dax_device is not synchronous.
> > > > > > + */
> > > > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > > > + (vma->vm_flags & VM_SYNC))
> > > > > > + return -EOPNOTSUPP;
> > > > > > +
> > > > > > file_accessed(filp);
> > > > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > > > if (IS_DAX(file_inode(filp)))
> > > > >
> > > > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > > > >
> > > > > xfs_file_mmap(
> > > > > ....
> > > > > {
> > > > > struct inode *inode = file_inode(filp);
> > > > >
> > > > > if (vma->vm_flags & VM_SYNC) {
> > > > > if (!IS_DAX(inode))
> > > > > return -EOPNOTSUPP;
> > > > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > > > return -EOPNOTSUPP;
> > > > > }
> > > > >
> > > > > file_accessed(filp);
> > > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > > if (IS_DAX(inode))
> > > > > vma->vm_flags |= VM_HUGEPAGE;
> > > > > return 0;
> > > > > }
> > > > >
> > > > >
> > > > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > > > helper so that the filesystem code just doesn't have to care about
> > > > > the details of checking for DAX+MAP_SYNC support....
> > > >
> > > > Seconded, since ext4 has nearly the same flag validation logic.
> > >
> >
> > Only issue with this I see is we need the helper function only for
> > supported
> > filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
> > will be compiled for every filesystem, even for those don't need it.
> >
> > Sample patch below, does below patch is near to what you have in mind?
>
> So I would put the helper in include/linux/dax.h and have it like:
>
> bool daxdev_mapping_supported(struct vm_area_struct *vma,
> struct dax_device *dax_dev)
> {
> if (!(vma->vm_flags & VM_SYNC))
> return true;
> if (!IS_DAX(file_inode(vma->vm_file)))
> return false;
> return dax_synchronous(dax_dev);
> }

Sure. This is much better. I was also not sure what to name the helper function.
I will go ahead with this unless 'Dave' & 'Darrick' have anything to add.

Thank you very much.

Best regards,
Pankaj

>
> Honza
> >
> > =================
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..614995170cac 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1196,12 +1196,17 @@ xfs_file_mmap(
> > struct file *filp,
> > struct vm_area_struct *vma)
> > {
> > + struct dax_device *dax_dev =
> > xfs_find_daxdev_for_inode(file_inode(filp));
> > +
> > /*
> > - * We don't support synchronous mappings for non-DAX files. At
> > least
> > - * until someone comes with a sensible use case.
> > + * We don't support synchronous mappings for non-DAX files and
> > + * for DAX files if underneath dax_device is not synchronous.
> > */
> > - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > - return -EOPNOTSUPP;
> > + if (vma->vm_flags & VM_SYNC) {
> > + int err = is_synchronous(filp, dax_dev);
> > + if (err)
> > + return err;
> > + }
> >
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b42df09b04c..add017de3dd7 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2162,6 +2162,20 @@ static inline void file_accessed(struct file *file)
> > touch_atime(&file->f_path);
> > }
> >
> > +struct dax_device;
> > +extern bool dax_synchronous(struct dax_device *dax_dev);
> > +static inline int is_synchronous(struct file *filp, struct dax_device
> > *dax_dev)
> > +{
> > + struct inode *inode = file_inode(filp);
> > +
> > + if (!IS_DAX(inode))
> > + return -EOPNOTSUPP;
> > + if (!dax_synchronous(dax_dev))
> > + return -EOPNOTSUPP;
> > +
> > + return 0;
> > +}
> > +
> > int sync_inode(struct inode *inode, struct writeback_control *wbc);
> > int sync_inode_metadata(struct inode *inode, int wait);
> >
> > ---------
> >
> > Thanks,
> > Pankaj
> >
> >
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
>

2019-04-04 10:52:25

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] xfs: disable map_sync for async flush


>
> On Thu, Apr 04, 2019 at 02:12:30AM -0400, Pankaj Gupta wrote:
> > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > >
> > > xfs_file_mmap(
> > > ....
> > > {
> > > struct inode *inode = file_inode(filp);
> > >
> > > if (vma->vm_flags & VM_SYNC) {
> > > if (!IS_DAX(inode))
> > > return -EOPNOTSUPP;
> > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > return -EOPNOTSUPP;
> > > }
> > >
> > > file_accessed(filp);
> > > vma->vm_ops = &xfs_file_vm_ops;
> > > if (IS_DAX(inode))
> > > vma->vm_flags |= VM_HUGEPAGE;
> > > return 0;
> > > }
> >
> > Sure, this is better.
>
> > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > helper so that the filesystem code just doesn't have to care about
> > > the details of checking for DAX+MAP_SYNC support....
> >
> > o.k. Will add one common helper function for both ext4 & xfs filesystems.
>
> Note this pending patch for Goldwyn Rodrigues' patchset for btrfs:
>
> https://lore.kernel.org/linux-btrfs/[email protected]/
>
> We might want to coordinate.

Sure. Good to know.

Thanks for the pointer. Will have a look.

Best regards,
Pankaj

>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
> ⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
> ⠈⠳⣄⠀⠀⠀⠀
>

2019-04-04 15:01:30

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush

On Thu, Apr 04, 2019 at 06:08:44AM -0400, Pankaj Gupta wrote:
>
> > On Thu 04-04-19 05:09:10, Pankaj Gupta wrote:
> > >
> > > > > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > > > > Virtio pmem provides asynchronous host page cache flush
> > > > > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > > > > and xfs.
> > > > > > >
> > > > > > > Signed-off-by: Pankaj Gupta <[email protected]>
> > > > > > > ---
> > > > > > > fs/xfs/xfs_file.c | 8 ++++++++
> > > > > > > 1 file changed, 8 insertions(+)
> > > > > > >
> > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > > > > --- a/fs/xfs/xfs_file.c
> > > > > > > +++ b/fs/xfs/xfs_file.c
> > > > > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > > > > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > > > > return -EOPNOTSUPP;
> > > > > > >
> > > > > > > + /* We don't support synchronous mappings with DAX files if
> > > > > > > + * dax_device is not synchronous.
> > > > > > > + */
> > > > > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > > > > + (vma->vm_flags & VM_SYNC))
> > > > > > > + return -EOPNOTSUPP;
> > > > > > > +
> > > > > > > file_accessed(filp);
> > > > > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > > > > if (IS_DAX(file_inode(filp)))
> > > > > >
> > > > > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > > > > >
> > > > > > xfs_file_mmap(
> > > > > > ....
> > > > > > {
> > > > > > struct inode *inode = file_inode(filp);
> > > > > >
> > > > > > if (vma->vm_flags & VM_SYNC) {
> > > > > > if (!IS_DAX(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > }
> > > > > >
> > > > > > file_accessed(filp);
> > > > > > vma->vm_ops = &xfs_file_vm_ops;
> > > > > > if (IS_DAX(inode))
> > > > > > vma->vm_flags |= VM_HUGEPAGE;
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > > > > helper so that the filesystem code just doesn't have to care about
> > > > > > the details of checking for DAX+MAP_SYNC support....
> > > > >
> > > > > Seconded, since ext4 has nearly the same flag validation logic.
> > > >
> > >
> > > Only issue with this I see is we need the helper function only for
> > > supported
> > > filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
> > > will be compiled for every filesystem, even for those don't need it.
> > >
> > > Sample patch below, does below patch is near to what you have in mind?
> >
> > So I would put the helper in include/linux/dax.h and have it like:
> >
> > bool daxdev_mapping_supported(struct vm_area_struct *vma,

Should this be static inline if you're putting it in the header file?

A comment ought to be added to describe what this predicate function
does.

> > struct dax_device *dax_dev)
> > {
> > if (!(vma->vm_flags & VM_SYNC))
> > return true;
> > if (!IS_DAX(file_inode(vma->vm_file)))
> > return false;
> > return dax_synchronous(dax_dev);
> > }
>
> Sure. This is much better. I was also not sure what to name the helper function.
> I will go ahead with this unless 'Dave' & 'Darrick' have anything to add.

Jan's approach (modulo that one comment) looks good to me.

--D

> Thank you very much.
>
> Best regards,
> Pankaj
>
> >
> > Honza
> > >
> > > =================
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 1f2e2845eb76..614995170cac 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1196,12 +1196,17 @@ xfs_file_mmap(
> > > struct file *filp,
> > > struct vm_area_struct *vma)
> > > {
> > > + struct dax_device *dax_dev =
> > > xfs_find_daxdev_for_inode(file_inode(filp));
> > > +
> > > /*
> > > - * We don't support synchronous mappings for non-DAX files. At
> > > least
> > > - * until someone comes with a sensible use case.
> > > + * We don't support synchronous mappings for non-DAX files and
> > > + * for DAX files if underneath dax_device is not synchronous.
> > > */
> > > - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > - return -EOPNOTSUPP;
> > > + if (vma->vm_flags & VM_SYNC) {
> > > + int err = is_synchronous(filp, dax_dev);
> > > + if (err)
> > > + return err;
> > > + }
> > >
> > > file_accessed(filp);
> > > vma->vm_ops = &xfs_file_vm_ops;
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 8b42df09b04c..add017de3dd7 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2162,6 +2162,20 @@ static inline void file_accessed(struct file *file)
> > > touch_atime(&file->f_path);
> > > }
> > >
> > > +struct dax_device;
> > > +extern bool dax_synchronous(struct dax_device *dax_dev);
> > > +static inline int is_synchronous(struct file *filp, struct dax_device
> > > *dax_dev)
> > > +{
> > > + struct inode *inode = file_inode(filp);
> > > +
> > > + if (!IS_DAX(inode))
> > > + return -EOPNOTSUPP;
> > > + if (!dax_synchronous(dax_dev))
> > > + return -EOPNOTSUPP;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > int sync_inode(struct inode *inode, struct writeback_control *wbc);
> > > int sync_inode_metadata(struct inode *inode, int wait);
> > >
> > > ---------
> > >
> > > Thanks,
> > > Pankaj
> > >
> > >
> > >
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
> >
> >

2019-04-04 15:50:10

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush


> > >
> > > So I would put the helper in include/linux/dax.h and have it like:
> > >
> > > bool daxdev_mapping_supported(struct vm_area_struct *vma,
>
> Should this be static inline if you're putting it in the header file?

yes. Thanks.

>
> A comment ought to be added to describe what this predicate function
> does.

Sure, will add a comment describing the function.

>
> > > struct dax_device *dax_dev)
> > > {
> > > if (!(vma->vm_flags & VM_SYNC))
> > > return true;
> > > if (!IS_DAX(file_inode(vma->vm_file)))
> > > return false;
> > > return dax_synchronous(dax_dev);
> > > }
> >
> > Sure. This is much better. I was also not sure what to name the helper
> > function.
> > I will go ahead with this unless 'Dave' & 'Darrick' have anything to add.
>
> Jan's approach (modulo that one comment) looks good to me.

Sure. Thank you.

Best regards,
Pankaj

>
> --D
>