2016-10-18 22:31:55

by Stephen Bates

[permalink] [raw]
Subject: [PATCH 0/3] iopmem : A block device for PCIe memory

This patch follows from an RFC we did earlier this year [1]. This
patchset applies cleanly to v4.9-rc1.

Updates since RFC
-----------------
Rebased.
Included the iopmem driver in the submission.

History
-------

There have been several attempts to upstream patchsets that enable
DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
style patches [3]. None have been successful to date. Haggai Eran
gives a nice overview of the prior art in this space in his cover
letter [3].

Motivation and Use Cases
------------------------

PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.

Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU
entirely.

Consumers
---------

We provide a PCIe device driver in an accompanying patch that can be
used to map any PCIe BAR into a DAX capable block device. For
non-persistent BARs this simply serves as an alternative to using
system memory bounce buffers. For persistent BARs this can serve as an
additional storage device in the system.

Testing and Performance
-----------------------

We have done a moderate about of testing of this patch on a QEMU
environment and on real hardware. On real hardware we have observed
peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
both cases these numbers are limitations of our consumer hardware. In
addition, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional path
through system memory is taken.

For more information on the testing and performance results see the
GitHub site [4].

Known Issues
------------

1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).

2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.

3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of. Though really, this isn't much different than the
existing situation with things like RDMA: if userspace sets up an MR
for remote use, they need to be careful about using that memory region
themselves.

4. Architecture. Currently this patch is applicable only to x86_64
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.

References
----------
[1] https://patchwork.kernel.org/patch/8583221/
[2] http://comments.gmane.org/gmane.linux.drivers.rdma/21849
[3] http://www.spinics.net/lists/linux-rdma/msg38748.html
[4] https://github.com/sbates130272/zone-device

Logan Gunthorpe (1):
memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

Stephen Bates (2):
iopmem : Add a block device driver for PCIe attached IO memory.
iopmem : Add documentation for iopmem driver

Documentation/blockdev/00-INDEX | 2 +
Documentation/blockdev/iopmem.txt | 62 +++++++
MAINTAINERS | 7 +
drivers/block/Kconfig | 27 ++++
drivers/block/Makefile | 1 +
drivers/block/iopmem.c | 333 ++++++++++++++++++++++++++++++++++++++
drivers/dax/pmem.c | 4 +-
drivers/nvdimm/pmem.c | 4 +-
include/linux/memremap.h | 5 +-
kernel/memremap.c | 80 ++++++++-
tools/testing/nvdimm/test/iomap.c | 3 +-
11 files changed, 518 insertions(+), 10 deletions(-)
create mode 100644 Documentation/blockdev/iopmem.txt
create mode 100644 drivers/block/iopmem.c

--
2.1.4


2016-10-18 22:07:01

by Stephen Bates

[permalink] [raw]
Subject: [PATCH 3/3] iopmem : Add documentation for iopmem driver

Add documentation for the iopmem PCIe device driver.

Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/blockdev/00-INDEX | 2 ++
Documentation/blockdev/iopmem.txt | 62 +++++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)
create mode 100644 Documentation/blockdev/iopmem.txt

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..913e500 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -8,6 +8,8 @@ cpqarray.txt
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
floppy.txt
- notes and driver options for the floppy disk driver.
+iopmem.txt
+ - info on the iopmem block driver.
mflash.txt
- info on mGine m(g)flash driver for linux.
nbd.txt
diff --git a/Documentation/blockdev/iopmem.txt b/Documentation/blockdev/iopmem.txt
new file mode 100644
index 0000000..ba805b8
--- /dev/null
+++ b/Documentation/blockdev/iopmem.txt
@@ -0,0 +1,62 @@
+IOPMEM Block Driver
+===================
+
+Logan Gunthorpe and Stephen Bates - October 2016
+
+Introduction
+------------
+
+The iopmem module creates a DAX capable block device from a BAR on a PCIe
+device. iopmem leverages heavily from the pmem driver although it utilizes IO
+memory rather than system memory as its backing store.
+
+Usage
+-----
+
+To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM
+to either y or m. A block device will be created for each PCIe attached device
+that matches the vendor and device ID as specified in the module. Currently an
+unallocated PMC PCIe ID is used as the default. Alternatively this driver can
+be bound to any aribtary PCIe function using the sysfs bind entry.
+
+The main purpose for an iopmem block device is expected to be for peer-2-peer
+PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local
+CPU unless you are doing one of the three following things:
+
+1. Creating a DAX capable filesystem on the iopmem device.
+2. Creating some files on the DAX capable filesystem.
+3. Interogating the files on said filesystem to obtain pointers that can be
+ passed to other PCIe devices for p2p DMA operations.
+
+Issues
+------
+
+1. Address Translation. Suggestions have been made that in certain
+architectures and topologies the dma_addr_t passed to the DMA master
+in a peer-2-peer transfer will not correctly route to the IO memory
+intended. However in our testing to date we have not seen this to be
+an issue, even in systems with IOMMUs and PCIe switches. It is our
+understanding that an IOMMU only maps system memory and would not
+interfere with device memory regions. (It certainly has no opportunity
+to do so if the transfer gets routed through a switch).
+
+2. Memory Segment Spacing. This patch has the same limitations that
+ZONE_DEVICE does in that memory regions must be spaces at least
+SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
+BARs can be placed closer together than this. Thus ZONE_DEVICE would not
+be usable on neighboring BARs. For our purposes, this is not an issue as
+we'd only be looking at enabling a single BAR in a given PCIe device.
+More exotic use cases may have problems with this.
+
+3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
+peer there is potential for coherency issues and for writes to occur out
+of order. This is something that users of this feature need to be
+cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
+this isn't much different than the existing situation with RDMA: if
+userspace sets up an MR for remote use, they need to be careful about
+using that memory region themselves.
+
+4. Architecture. Currently this patch is applicable only to x86
+architectures. The same is true for much of the code pertaining to
+PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
+ARCH over time.
--
2.1.4

2016-10-18 22:31:51

by Stephen Bates

[permalink] [raw]
Subject: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

From: Logan Gunthorpe <[email protected]>

We build on recent work that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2].

1. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

2. For completeness, we add MEMREMAP_WT support to the memremap;
however we have no actual need for this functionality.

3. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

4. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function. We also set up
the page attribute tables for the mapped region correctly based on the
desired mapping.

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html

Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/dax/pmem.c | 4 +-
drivers/nvdimm/pmem.c | 4 +-
include/linux/memremap.h | 5 ++-
kernel/memremap.c | 80 +++++++++++++++++++++++++++++++++++++--
tools/testing/nvdimm/test/iomap.c | 3 +-
5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d88..58ac456 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -14,6 +14,7 @@
#include <linux/memremap.h>
#include <linux/module.h>
#include <linux/pfn_t.h>
+#include <linux/pmem.h>
#include "../nvdimm/pfn.h"
#include "../nvdimm/nd.h"
#include "dax.h"
@@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

- addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap);
+ addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap,
+ ARCH_MEMREMAP_PMEM);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a82..97032a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
- altmap);
+ altmap, ARCH_MEMREMAP_PMEM);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev,
res->start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
addr = devm_memremap_pages(dev, &nsio->res,
- &q->q_usage_counter, NULL);
+ &q->q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
pmem->pfn_flags |= PFN_MAP;
} else
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fc99283 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

#ifdef CONFIG_ZONE_DEVICE
void *devm_memremap_pages(struct device *dev, struct resource *res,
- struct percpu_ref *ref, struct vmem_altmap *altmap);
+ struct percpu_ref *ref, struct vmem_altmap *altmap,
+ unsigned long flags);
struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
#else
static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
- struct vmem_altmap *altmap)
+ struct vmem_altmap *altmap, unsigned long flags)
{
/*
* Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..d5f462c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+ PAGEMAP_IO_MEM = 1 << 0,
+};
+
struct page_map {
struct resource res;
struct percpu_ref *ref;
struct dev_pagemap pgmap;
struct vmem_altmap altmap;
+ void *kaddr;
+ int flags;
};

+static int add_zone_device_pages(int nid, u64 start, u64 size)
+{
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ struct zone *zone = pgdat->node_zones + ZONE_DEVICE;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+
+ return __add_pages(nid, zone, start_pfn, nr_pages);
+}
+
+static void remove_zone_device_pages(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+ int ret;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ ret = __remove_pages(zone, start_pfn, nr_pages);
+ WARN_ON_ONCE(ret);
+}
+
void get_zone_device_page(struct page *page)
{
percpu_ref_get(page->pgmap->ref);
@@ -246,9 +274,17 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
/* pages are dead and unused, undo the arch mapping */
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(resource_size(res), SECTION_SIZE);
- arch_remove_memory(align_start, align_size);
+
+ if (page_map->flags & PAGEMAP_IO_MEM) {
+ remove_zone_device_pages(align_start, align_size);
+ iounmap(page_map->kaddr);
+ } else {
+ arch_remove_memory(align_start, align_size);
+ }
+
untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
pgmap_radix_release(res);
+
dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
"%s: failed to free all reserved pages\n", __func__);
}
@@ -270,6 +306,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
* @res: "host memory" address range
* @ref: a live per-cpu reference count
* @altmap: optional descriptor for allocating the memmap from @res
+ * @flags: either MEMREMAP_WB, MEMREMAP_WT and MEMREMAP_WC
+ * see memremap() for a description of the flags
*
* Notes:
* 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -280,7 +318,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
* this is not enforced.
*/
void *devm_memremap_pages(struct device *dev, struct resource *res,
- struct percpu_ref *ref, struct vmem_altmap *altmap)
+ struct percpu_ref *ref, struct vmem_altmap *altmap,
+ unsigned long flags)
{
resource_size_t key, align_start, align_size, align_end;
pgprot_t pgprot = PAGE_KERNEL;
@@ -288,6 +327,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
struct page_map *page_map;
int error, nid, is_ram;
unsigned long pfn;
+ void *addr = NULL;
+ enum page_cache_mode pcm;

align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -353,15 +394,44 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
if (nid < 0)
nid = numa_mem_id();

+ if (flags & MEMREMAP_WB)
+ pcm = _PAGE_CACHE_MODE_WB;
+ else if (flags & MEMREMAP_WT)
+ pcm = _PAGE_CACHE_MODE_WT;
+ else if (flags & MEMREMAP_WC)
+ pcm = _PAGE_CACHE_MODE_WC;
+ else
+ pcm = _PAGE_CACHE_MODE_WB;
+
+ pgprot = __pgprot(pgprot_val(pgprot) | cachemode2protval(pcm));
+
error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start), 0,
align_size);
if (error)
goto err_pfn_remap;

- error = arch_add_memory(nid, align_start, align_size, true);
+ if (flags & MEMREMAP_WB || !flags) {
+ error = arch_add_memory(nid, align_start, align_size, true);
+ addr = __va(res->start);
+ } else {
+ page_map->flags |= PAGEMAP_IO_MEM;
+ error = add_zone_device_pages(nid, align_start, align_size);
+ }
+
if (error)
goto err_add_memory;

+ if (!addr && (flags & MEMREMAP_WT))
+ addr = ioremap_wt(res->start, resource_size(res));
+
+ if (!addr && (flags & MEMREMAP_WC))
+ addr = ioremap_wc(res->start, resource_size(res));
+
+ if (!addr && page_map->flags & PAGEMAP_IO_MEM) {
+ remove_zone_device_pages(res->start, resource_size(res));
+ goto err_add_memory;
+ }
+
for_each_device_pfn(pfn, page_map) {
struct page *page = pfn_to_page(pfn);

@@ -374,8 +444,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
list_del(&page->lru);
page->pgmap = pgmap;
}
+
+ page_map->kaddr = addr;
devres_add(dev, page_map);
- return __va(res->start);
+ return addr;

err_add_memory:
untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index 3ccef73..b82fecb 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/types.h>
#include <linux/pfn_t.h>
+#include <linux/pmem.h>
#include <linux/acpi.h>
#include <linux/io.h>
#include <linux/mm.h>
@@ -109,7 +110,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,

if (nfit_res)
return nfit_res->buf + offset - nfit_res->res.start;
- return devm_memremap_pages(dev, res, ref, altmap);
+ return devm_memremap_pages(dev, res, ref, altmap, ARCH_MEMREMAP_PMEM);
}
EXPORT_SYMBOL(__wrap_devm_memremap_pages);

--
2.1.4

2016-10-18 22:46:45

by Stephen Bates

[permalink] [raw]
Subject: [PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.

Add a new block device driver that binds to PCIe devices and turns
PCIe BARs into DAX capable block devices.

Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
MAINTAINERS | 7 ++
drivers/block/Kconfig | 27 ++++
drivers/block/Makefile | 1 +
drivers/block/iopmem.c | 333 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 368 insertions(+)
create mode 100644 drivers/block/iopmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1cd38a7..c379f9d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6510,6 +6510,13 @@ S: Maintained
F: Documentation/devicetree/bindings/iommu/
F: drivers/iommu/

+IOPMEM BLOCK DEVICE DRVIER
+M: Stephen Bates <[email protected]>
+L: [email protected]
+S: Maintained
+F: drivers/block/iopmem.c
+F: Documentation/blockdev/iopmem.txt
+
IP MASQUERADING
M: Juanjo Ciarlante <[email protected]>
S: Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 39dd30b..13ae1e7 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -537,4 +537,31 @@ config BLK_DEV_RSXX
To compile this driver as a module, choose M here: the
module will be called rsxx.

+config BLK_DEV_IOPMEM
+ tristate "Persistent block device backed by PCIe Memory"
+ depends on ZONE_DEVICE
+ default n
+ help
+ Say Y here if you want to include a generic device driver
+ that can create a block device from persistent PCIe attached
+ IO memory.
+
+ To compile this driver as a module, choose M here: The
+ module will be called iopmem. A block device will be created
+ for each PCIe attached device that matches the vendor and
+ device ID as specified in the module. Alternativel this
+ driver can be bound to any aribtary PCIe function using the
+ sysfs bind entry.
+
+ This block device supports direct access (DAX) file systems
+ and supports struct page backing for the IO Memory. This
+ makes the underlying memory suitable for things like RDMA
+ Memory Regions and Direct IO which is useful for PCIe
+ peer-to-peer DMA operations.
+
+ Note that persistent is only assured if the memory on the
+ PCIe card has some form of power loss protection. This could
+ be provided via some form of battery, a supercap/NAND combo
+ or some exciting new persistent memory technology.
+
endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1e9661e..1f4f69b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IOPMEM) += iopmem.o

skd-y := skd_main.o
swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/iopmem.c b/drivers/block/iopmem.c
new file mode 100644
index 0000000..4a1e693
--- /dev/null
+++ b/drivers/block/iopmem.c
@@ -0,0 +1,333 @@
+/*
+ * IOPMEM Block Device Driver
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/pmem.c.
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
+#include <linux/memremap.h>
+
+static const int BAR_ID = 4;
+
+static struct pci_device_id iopmem_id_table[] = {
+ { PCI_DEVICE(0x11f8, 0xf115) },
+ { 0, }
+};
+MODULE_DEVICE_TABLE(pci, iopmem_id_table);
+
+struct iopmem_device {
+ struct request_queue *queue;
+ struct gendisk *disk;
+ struct device *dev;
+
+ int instance;
+
+ /* One contiguous memory region per device */
+ phys_addr_t phys_addr;
+ void *virt_addr;
+ size_t size;
+};
+
+ /*
+ * We can only access the iopmem device with full 32-bit word
+ * accesses which cannot be gaurantee'd by the regular memcpy
+ */
+
+static void memcpy_from_iopmem(void *dst, const void *src, size_t sz)
+{
+ u64 *wdst = dst;
+ const u64 *wsrc = src;
+ u64 tmp;
+
+ while (sz >= sizeof(*wdst)) {
+ *wdst++ = *wsrc++;
+ sz -= sizeof(*wdst);
+ }
+
+ if (!sz)
+ return;
+
+ tmp = *wsrc;
+ memcpy(wdst, &tmp, sz);
+}
+
+static void write_iopmem(void *iopmem_addr, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ void *mem = kmap_atomic(page);
+
+ memcpy(iopmem_addr, mem + off, len);
+ kunmap_atomic(mem);
+}
+
+static void read_iopmem(struct page *page, unsigned int off,
+ void *iopmem_addr, unsigned int len)
+{
+ void *mem = kmap_atomic(page);
+
+ memcpy_from_iopmem(mem + off, iopmem_addr, len);
+ kunmap_atomic(mem);
+}
+
+static void iopmem_do_bvec(struct iopmem_device *iopmem, struct page *page,
+ unsigned int len, unsigned int off, bool is_write,
+ sector_t sector)
+{
+ phys_addr_t iopmem_off = sector * 512;
+ void *iopmem_addr = iopmem->virt_addr + iopmem_off;
+
+ if (!is_write) {
+ read_iopmem(page, off, iopmem_addr, len);
+ flush_dcache_page(page);
+ } else {
+ flush_dcache_page(page);
+ write_iopmem(iopmem_addr, page, off, len);
+ }
+}
+
+static blk_qc_t iopmem_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct iopmem_device *iopmem = q->queuedata;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
+
+ bio_for_each_segment(bvec, bio, iter) {
+ iopmem_do_bvec(iopmem, bvec.bv_page, bvec.bv_len,
+ bvec.bv_offset, op_is_write(bio_op(bio)),
+ iter.bi_sector);
+ }
+
+ bio_endio(bio);
+ return BLK_QC_T_NONE;
+}
+
+static int iopmem_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, bool is_write)
+{
+ struct iopmem_device *iopmem = bdev->bd_queue->queuedata;
+
+ iopmem_do_bvec(iopmem, page, PAGE_SIZE, 0, is_write, sector);
+ page_endio(page, is_write, 0);
+ return 0;
+}
+
+static long iopmem_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, pfn_t *pfn, long size)
+{
+ struct iopmem_device *iopmem = bdev->bd_queue->queuedata;
+ resource_size_t offset = sector * 512;
+
+ if (!iopmem)
+ return -ENODEV;
+
+ *kaddr = iopmem->virt_addr + offset;
+ *pfn = phys_to_pfn_t(iopmem->phys_addr + offset, PFN_DEV | PFN_MAP);
+
+ return iopmem->size - offset;
+}
+
+static const struct block_device_operations iopmem_fops = {
+ .owner = THIS_MODULE,
+ .rw_page = iopmem_rw_page,
+ .direct_access = iopmem_direct_access,
+};
+
+static DEFINE_IDA(iopmem_instance_ida);
+static DEFINE_SPINLOCK(ida_lock);
+
+static int iopmem_set_instance(struct iopmem_device *iopmem)
+{
+ int instance, error;
+
+ do {
+ if (!ida_pre_get(&iopmem_instance_ida, GFP_KERNEL))
+ return -ENODEV;
+
+ spin_lock(&ida_lock);
+ error = ida_get_new(&iopmem_instance_ida, &instance);
+ spin_unlock(&ida_lock);
+
+ } while (error == -EAGAIN);
+
+ if (error)
+ return -ENODEV;
+
+ iopmem->instance = instance;
+ return 0;
+}
+
+static void iopmem_release_instance(struct iopmem_device *iopmem)
+{
+ spin_lock(&ida_lock);
+ ida_remove(&iopmem_instance_ida, iopmem->instance);
+ spin_unlock(&ida_lock);
+}
+
+static int iopmem_attach_disk(struct iopmem_device *iopmem)
+{
+ struct gendisk *disk;
+ int nid = dev_to_node(iopmem->dev);
+ struct request_queue *q = iopmem->queue;
+
+ blk_queue_write_cache(q, true, true);
+ blk_queue_make_request(q, iopmem_make_request);
+ blk_queue_physical_block_size(q, PAGE_SIZE);
+ blk_queue_max_hw_sectors(q, UINT_MAX);
+ blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
+ queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+ queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
+ q->queuedata = iopmem;
+
+ disk = alloc_disk_node(0, nid);
+ if (unlikely(!disk))
+ return -ENOMEM;
+
+ disk->fops = &iopmem_fops;
+ disk->queue = q;
+ disk->flags = GENHD_FL_EXT_DEVT;
+ sprintf(disk->disk_name, "iopmem%d", iopmem->instance);
+ set_capacity(disk, iopmem->size / 512);
+ iopmem->disk = disk;
+
+ device_add_disk(iopmem->dev, disk);
+ revalidate_disk(disk);
+
+ return 0;
+}
+
+static void iopmem_detach_disk(struct iopmem_device *iopmem)
+{
+ del_gendisk(iopmem->disk);
+ put_disk(iopmem->disk);
+}
+
+static int iopmem_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct iopmem_device *iopmem;
+ struct device *dev;
+ int err = 0;
+ int nid = dev_to_node(&pdev->dev);
+
+ if (pci_enable_device_mem(pdev) < 0) {
+ dev_err(&pdev->dev, "unable to enable device!\n");
+ goto out;
+ }
+
+ iopmem = kzalloc(sizeof(*iopmem), GFP_KERNEL);
+ if (unlikely(!iopmem)) {
+ err = -ENOMEM;
+ goto out_disable_device;
+ }
+
+ iopmem->phys_addr = pci_resource_start(pdev, BAR_ID);
+ iopmem->size = pci_resource_end(pdev, BAR_ID) - iopmem->phys_addr + 1;
+ iopmem->dev = dev = get_device(&pdev->dev);
+ pci_set_drvdata(pdev, iopmem);
+
+ err = iopmem_set_instance(iopmem);
+ if (err)
+ goto out_put_device;
+
+ dev_info(dev, "bar space 0x%llx len %lld\n",
+ (unsigned long long) iopmem->phys_addr,
+ (unsigned long long) iopmem->size);
+
+ if (!devm_request_mem_region(dev, iopmem->phys_addr,
+ iopmem->size, dev_name(dev))) {
+ dev_warn(dev, "could not reserve region [0x%pa:0x%zx]\n",
+ &iopmem->phys_addr, iopmem->size);
+ err = -EBUSY;
+ goto out_release_instance;
+ }
+
+ iopmem->queue = blk_alloc_queue_node(GFP_KERNEL, nid);
+ if (!iopmem->queue) {
+ err = -ENOMEM;
+ goto out_release_instance;
+ }
+
+ iopmem->virt_addr = devm_memremap_pages(dev, &pdev->resource[BAR_ID],
+ &iopmem->queue->q_usage_counter,
+ NULL, MEMREMAP_WC);
+ if (IS_ERR(iopmem->virt_addr)) {
+ err = -ENXIO;
+ goto out_free_queue;
+ }
+
+ err = iopmem_attach_disk(iopmem);
+ if (err)
+ goto out_free_queue;
+
+ return 0;
+
+out_free_queue:
+ blk_cleanup_queue(iopmem->queue);
+out_release_instance:
+ iopmem_release_instance(iopmem);
+out_put_device:
+ put_device(&pdev->dev);
+ kfree(iopmem);
+out_disable_device:
+ pci_disable_device(pdev);
+out:
+ return err;
+}
+
+static void iopmem_remove(struct pci_dev *pdev)
+{
+ struct iopmem_device *iopmem = pci_get_drvdata(pdev);
+
+ blk_set_queue_dying(iopmem->queue);
+ iopmem_detach_disk(iopmem);
+ blk_cleanup_queue(iopmem->queue);
+ iopmem_release_instance(iopmem);
+ put_device(iopmem->dev);
+ kfree(iopmem);
+ pci_disable_device(pdev);
+}
+
+static struct pci_driver iopmem_pci_driver = {
+ .name = "iopmem",
+ .id_table = iopmem_id_table,
+ .probe = iopmem_probe,
+ .remove = iopmem_remove,
+};
+
+static int __init iopmem_init(void)
+{
+ int rc;
+
+ rc = pci_register_driver(&iopmem_pci_driver);
+ if (rc)
+ return rc;
+
+ pr_info("iopmem: module loaded\n");
+ return 0;
+}
+
+static void __exit iopmem_exit(void)
+{
+ pci_unregister_driver(&iopmem_pci_driver);
+ pr_info("iopmem: module unloaded\n");
+}
+
+MODULE_AUTHOR("Logan Gunthorpe <[email protected]>");
+MODULE_LICENSE("GPL");
+module_init(iopmem_init);
+module_exit(iopmem_exit);
--
2.1.4

2016-10-19 03:51:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

[ adding Ashok and David for potential iommu comments ]

On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <[email protected]> wrote:
> This patch follows from an RFC we did earlier this year [1]. This
> patchset applies cleanly to v4.9-rc1.
>
> Updates since RFC
> -----------------
> Rebased.
> Included the iopmem driver in the submission.
>
> History
> -------
>
> There have been several attempts to upstream patchsets that enable
> DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
> style patches [3]. None have been successful to date. Haggai Eran
> gives a nice overview of the prior art in this space in his cover
> letter [3].
>
> Motivation and Use Cases
> ------------------------
>
> PCIe IO devices are getting faster. It is not uncommon now to find PCIe
> network and storage devices that can generate and consume several GB/s.
> Almost always these devices have either a high performance DMA engine, a
> number of exposed PCIe BARs or both.
>
> Until this patch, any high-performance transfer of information between
> two PICe devices has required the use of a staging buffer in system
> memory. With this patch the bandwidth to system memory is not compromised
> when high-throughput transfers occurs between PCIe devices. This means
> that more system memory bandwidth is available to the CPU cores for data
> processing and manipulation. In addition, in systems where the two PCIe
> devices reside behind a PCIe switch the datapath avoids the CPU
> entirely.

I agree with the motivation and the need for a solution, but I have
some questions about this implementation.

>
> Consumers
> ---------
>
> We provide a PCIe device driver in an accompanying patch that can be
> used to map any PCIe BAR into a DAX capable block device. For
> non-persistent BARs this simply serves as an alternative to using
> system memory bounce buffers. For persistent BARs this can serve as an
> additional storage device in the system.

Why block devices? I wonder if iopmem was initially designed back
when we were considering enabling DAX for raw block devices. However,
that support has since been ripped out / abandoned. You currently
need a filesystem on top of a block-device to get DAX operation.
Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
if all you want is a way to map the bar for another PCI-E device in
the topology.

If you're only using the block-device as a entry-point to create
dax-mappings then a device-dax (drivers/dax/) character-device might
be a better fit.

>
> Testing and Performance
> -----------------------
>
> We have done a moderate about of testing of this patch on a QEMU
> environment and on real hardware. On real hardware we have observed
> peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
> both cases these numbers are limitations of our consumer hardware. In
> addition, we have observed that the CPU DRAM bandwidth is not impacted
> when using IOPMEM which is not the case when a traditional path
> through system memory is taken.
>
> For more information on the testing and performance results see the
> GitHub site [4].
>
> Known Issues
> ------------
>
> 1. Address Translation. Suggestions have been made that in certain
> architectures and topologies the dma_addr_t passed to the DMA master
> in a peer-2-peer transfer will not correctly route to the IO memory
> intended. However in our testing to date we have not seen this to be
> an issue, even in systems with IOMMUs and PCIe switches. It is our
> understanding that an IOMMU only maps system memory and would not
> interfere with device memory regions. (It certainly has no opportunity
> to do so if the transfer gets routed through a switch).
>

There may still be platforms where peer-to-peer cycles are routed up
through the root bridge and then back down to target device, but we
can address that when / if it happens. I wonder if we could (ab)use a
software-defined 'pasid' as the requester id for a peer-to-peer
mapping that needs address translation.

> 2. Memory Segment Spacing. This patch has the same limitations that
> ZONE_DEVICE does in that memory regions must be spaces at least
> SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> be usable on neighboring BARs. For our purposes, this is not an issue as
> we'd only be looking at enabling a single BAR in a given PCIe device.
> More exotic use cases may have problems with this.

I'm working on patches for 4.10 to allow mixing multiple
devm_memremap_pages() allocations within the same physical section.
Hopefully this won't be a problem going forward.

> 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
> peer there is potential for coherency issues and for writes to occur out
> of order. This is something that users of this feature need to be
> cognizant of. Though really, this isn't much different than the
> existing situation with things like RDMA: if userspace sets up an MR
> for remote use, they need to be careful about using that memory region
> themselves.

We might be able to mitigate this by indicating that the mapping is
busy for device-to-device transfers. But you're right we could wait
to see how much of a problem this is in practice.

>
> 4. Architecture. Currently this patch is applicable only to x86_64
> architectures. The same is true for much of the code pertaining to
> PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
> ARCH over time.
>
> References
> ----------
> [1] https://patchwork.kernel.org/patch/8583221/
> [2] http://comments.gmane.org/gmane.linux.drivers.rdma/21849
> [3] http://www.spinics.net/lists/linux-rdma/msg38748.html
> [4] https://github.com/sbates130272/zone-device
>
> Logan Gunthorpe (1):
> memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

I haven't yet grokked the motivation for this, but I'll go comment on
that separately.

2016-10-19 17:50:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <[email protected]> wrote:
> From: Logan Gunthorpe <[email protected]>
>
> We build on recent work that adds memory regions owned by a device
> driver (ZONE_DEVICE) [1] and to add struct page support for these new
> regions of memory [2].
>
> 1. Add an extra flags argument into dev_memremap_pages to take in a
> MEMREMAP_XX argument. We update the existing calls to this function to
> reflect the change.
>
> 2. For completeness, we add MEMREMAP_WT support to the memremap;
> however we have no actual need for this functionality.
>
> 3. We add the static functions, add_zone_device_pages and
> remove_zone_device pages. These are similar to arch_add_memory except
> they don't create the memory mapping. We don't believe these need to be
> made arch specific, but are open to other opinions.
>
> 4. dev_memremap_pages and devm_memremap_pages_release are updated to
> treat IO memory slightly differently. For IO memory we use a combination
> of the appropriate io_remap function and the zone_device pages functions
> created above. A flags variable and kaddr pointer are added to struct
> page_mem to facilitate this for the release function. We also set up
> the page attribute tables for the mapped region correctly based on the
> desired mapping.
>

This description says "what" is being done, but not "why".

In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
memory", it mentions that the lack of I/O coherency is a known issue
and users of this functionality need to be cognizant of the pitfalls.
If that is the case why do we need support for different cpu mapping
types than the default write-back cache setting? It's up to the
application to handle cache cpu flushing similar to what we require of
device-dax users in the persistent memory case.

2016-10-19 18:48:24

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
> [ adding Ashok and David for potential iommu comments ]
>

Hi Dan

Thanks for adding Ashok and David!

>
> I agree with the motivation and the need for a solution, but I have
> some questions about this implementation.
>
> >
> > Consumers
> > ---------
> >
> > We provide a PCIe device driver in an accompanying patch that can be
> > used to map any PCIe BAR into a DAX capable block device. For
> > non-persistent BARs this simply serves as an alternative to using
> > system memory bounce buffers. For persistent BARs this can serve as an
> > additional storage device in the system.
>
> Why block devices? I wonder if iopmem was initially designed back
> when we were considering enabling DAX for raw block devices. However,
> that support has since been ripped out / abandoned. You currently
> need a filesystem on top of a block-device to get DAX operation.
> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
> if all you want is a way to map the bar for another PCI-E device in
> the topology.
>
> If you're only using the block-device as a entry-point to create
> dax-mappings then a device-dax (drivers/dax/) character-device might
> be a better fit.
>

We chose a block device because we felt it was intuitive for users to
carve up a memory region but putting a DAX filesystem on it and creating
files on that DAX aware FS. It seemed like a convenient way to
partition up the region and to be easily able to get the DMA address
for the memory backing the device.

That said I would be very keen to get other peoples thoughts on how
they would like to see this done. And I know some people have had some
reservations about using DAX mounted FS to do this in the past.

>
> > 2. Memory Segment Spacing. This patch has the same limitations that
> > ZONE_DEVICE does in that memory regions must be spaces at least
> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> > be usable on neighboring BARs. For our purposes, this is not an issue as
> > we'd only be looking at enabling a single BAR in a given PCIe device.
> > More exotic use cases may have problems with this.
>
> I'm working on patches for 4.10 to allow mixing multiple
> devm_memremap_pages() allocations within the same physical section.
> Hopefully this won't be a problem going forward.
>

Thanks Dan. Your patches will help address the problem of how to
partition a /dev/dax device but they don't help the case then BARs
themselves are small, closely spaced and non-segment aligned. However
I think most people using iopmem will want to use reasonbly large
BARs so I am not sure item 2 is that big of an issue.

> I haven't yet grokked the motivation for this, but I'll go comment on
> that separately.

Thanks Dan!

2016-10-19 19:03:13

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <[email protected]> wrote:
> > From: Logan Gunthorpe <[email protected]>
> >
> > We build on recent work that adds memory regions owned by a device
> > driver (ZONE_DEVICE) [1] and to add struct page support for these new
> > regions of memory [2].
> >
> > 1. Add an extra flags argument into dev_memremap_pages to take in a
> > MEMREMAP_XX argument. We update the existing calls to this function to
> > reflect the change.
> >
> > 2. For completeness, we add MEMREMAP_WT support to the memremap;
> > however we have no actual need for this functionality.
> >
> > 3. We add the static functions, add_zone_device_pages and
> > remove_zone_device pages. These are similar to arch_add_memory except
> > they don't create the memory mapping. We don't believe these need to be
> > made arch specific, but are open to other opinions.
> >
> > 4. dev_memremap_pages and devm_memremap_pages_release are updated to
> > treat IO memory slightly differently. For IO memory we use a combination
> > of the appropriate io_remap function and the zone_device pages functions
> > created above. A flags variable and kaddr pointer are added to struct
> > page_mem to facilitate this for the release function. We also set up
> > the page attribute tables for the mapped region correctly based on the
> > desired mapping.
> >
>
> This description says "what" is being done, but not "why".

Hi Dan

We discuss the motivation in the cover letter.

>
> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
> memory", it mentions that the lack of I/O coherency is a known issue
> and users of this functionality need to be cognizant of the pitfalls.
> If that is the case why do we need support for different cpu mapping
> types than the default write-back cache setting? It's up to the
> application to handle cache cpu flushing similar to what we require of
> device-dax users in the persistent memory case.

Some of the iopmem hardware we have tested has certain alignment
restrictions on BAR accesses. At the very least we require write
combine mappings for these. We then felt it appropriate to add the
other mappings for the sake of completeness.

Cheers

Stephen

2016-10-19 19:58:34

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Wed, Oct 19, 2016 at 11:48 AM, Stephen Bates <[email protected]> wrote:
> On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
>> [ adding Ashok and David for potential iommu comments ]
>>
>
> Hi Dan
>
> Thanks for adding Ashok and David!
>
>>
>> I agree with the motivation and the need for a solution, but I have
>> some questions about this implementation.
>>
>> >
>> > Consumers
>> > ---------
>> >
>> > We provide a PCIe device driver in an accompanying patch that can be
>> > used to map any PCIe BAR into a DAX capable block device. For
>> > non-persistent BARs this simply serves as an alternative to using
>> > system memory bounce buffers. For persistent BARs this can serve as an
>> > additional storage device in the system.
>>
>> Why block devices? I wonder if iopmem was initially designed back
>> when we were considering enabling DAX for raw block devices. However,
>> that support has since been ripped out / abandoned. You currently
>> need a filesystem on top of a block-device to get DAX operation.
>> Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
>> if all you want is a way to map the bar for another PCI-E device in
>> the topology.
>>
>> If you're only using the block-device as a entry-point to create
>> dax-mappings then a device-dax (drivers/dax/) character-device might
>> be a better fit.
>>
>
> We chose a block device because we felt it was intuitive for users to
> carve up a memory region but putting a DAX filesystem on it and creating
> files on that DAX aware FS. It seemed like a convenient way to
> partition up the region and to be easily able to get the DMA address
> for the memory backing the device.
>
> That said I would be very keen to get other peoples thoughts on how
> they would like to see this done. And I know some people have had some
> reservations about using DAX mounted FS to do this in the past.

I guess it depends on the expected size of these devices BARs, but I
get the sense they may be smaller / more precious such that you
wouldn't want to spend capacity on filesystem metadata? For the target
use case is it assumed that these device BARs are always backed by
non-volatile memory? Otherwise this is a mkfs each boot for a
volatile device.

>>
>> > 2. Memory Segment Spacing. This patch has the same limitations that
>> > ZONE_DEVICE does in that memory regions must be spaces at least
>> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
>> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
>> > be usable on neighboring BARs. For our purposes, this is not an issue as
>> > we'd only be looking at enabling a single BAR in a given PCIe device.
>> > More exotic use cases may have problems with this.
>>
>> I'm working on patches for 4.10 to allow mixing multiple
>> devm_memremap_pages() allocations within the same physical section.
>> Hopefully this won't be a problem going forward.
>>
>
> Thanks Dan. Your patches will help address the problem of how to
> partition a /dev/dax device but they don't help the case then BARs
> themselves are small, closely spaced and non-segment aligned. However
> I think most people using iopmem will want to use reasonbly large
> BARs so I am not sure item 2 is that big of an issue.

I think you might have misunderstood what I'm proposing. The patches
I'm working on are separate from a facility to carve up a /dev/dax
device. The effort is to allow devm_memremap_pages() to maintain
several allocations within the same 128MB section. I need this for
persistent memory to handle platforms that mix pmem and system-ram in
the same section. I want to be able to map ZONE_DEVICE pages for a
portion of a section and be able to remove portions of section that
may collide with allocations of a different lifetime.

2016-10-19 20:01:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

On Wed, Oct 19, 2016 at 11:40 AM, Stephen Bates <[email protected]> wrote:
> On Wed, Oct 19, 2016 at 10:50:25AM -0700, Dan Williams wrote:
>> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <[email protected]> wrote:
>> > From: Logan Gunthorpe <[email protected]>
>> >
>> > We build on recent work that adds memory regions owned by a device
>> > driver (ZONE_DEVICE) [1] and to add struct page support for these new
>> > regions of memory [2].
>> >
>> > 1. Add an extra flags argument into dev_memremap_pages to take in a
>> > MEMREMAP_XX argument. We update the existing calls to this function to
>> > reflect the change.
>> >
>> > 2. For completeness, we add MEMREMAP_WT support to the memremap;
>> > however we have no actual need for this functionality.
>> >
>> > 3. We add the static functions, add_zone_device_pages and
>> > remove_zone_device pages. These are similar to arch_add_memory except
>> > they don't create the memory mapping. We don't believe these need to be
>> > made arch specific, but are open to other opinions.
>> >
>> > 4. dev_memremap_pages and devm_memremap_pages_release are updated to
>> > treat IO memory slightly differently. For IO memory we use a combination
>> > of the appropriate io_remap function and the zone_device pages functions
>> > created above. A flags variable and kaddr pointer are added to struct
>> > page_mem to facilitate this for the release function. We also set up
>> > the page attribute tables for the mapped region correctly based on the
>> > desired mapping.
>> >
>>
>> This description says "what" is being done, but not "why".
>
> Hi Dan
>
> We discuss the motivation in the cover letter.
>
>>
>> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
>> memory", it mentions that the lack of I/O coherency is a known issue
>> and users of this functionality need to be cognizant of the pitfalls.
>> If that is the case why do we need support for different cpu mapping
>> types than the default write-back cache setting? It's up to the
>> application to handle cache cpu flushing similar to what we require of
>> device-dax users in the persistent memory case.
>
> Some of the iopmem hardware we have tested has certain alignment
> restrictions on BAR accesses. At the very least we require write
> combine mappings for these. We then felt it appropriate to add the
> other mappings for the sake of completeness.

If the device can support write-combine then it can support bursts, so
I wonder why it couldn't support read bursts for cache fills...

2016-10-19 23:18:32

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

> >>
> >> If you're only using the block-device as a entry-point to create
> >> dax-mappings then a device-dax (drivers/dax/) character-device might
> >> be a better fit.
> >>
> >
> > We chose a block device because we felt it was intuitive for users to
> > carve up a memory region but putting a DAX filesystem on it and creating
> > files on that DAX aware FS. It seemed like a convenient way to
> > partition up the region and to be easily able to get the DMA address
> > for the memory backing the device.
> >
> > That said I would be very keen to get other peoples thoughts on how
> > they would like to see this done. And I know some people have had some
> > reservations about using DAX mounted FS to do this in the past.
>
> I guess it depends on the expected size of these devices BARs, but I
> get the sense they may be smaller / more precious such that you
> wouldn't want to spend capacity on filesystem metadata? For the target
> use case is it assumed that these device BARs are always backed by
> non-volatile memory? Otherwise this is a mkfs each boot for a
> volatile device.

Dan

Fair point and this is a concern I share. We are not assuming that all
iopmem devices are backed by non-volatile memory so the mkfs
recreation comment is valid. All in all I think you are persuading us
to take a look at /dev/dax ;-). I will see if anyone else chips in
with their thoughts on this.

>
> >>
> >> > 2. Memory Segment Spacing. This patch has the same limitations that
> >> > ZONE_DEVICE does in that memory regions must be spaces at least
> >> > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> >> > BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> >> > be usable on neighboring BARs. For our purposes, this is not an issue as
> >> > we'd only be looking at enabling a single BAR in a given PCIe device.
> >> > More exotic use cases may have problems with this.
> >>
> >> I'm working on patches for 4.10 to allow mixing multiple
> >> devm_memremap_pages() allocations within the same physical section.
> >> Hopefully this won't be a problem going forward.
> >>
> >
> > Thanks Dan. Your patches will help address the problem of how to
> > partition a /dev/dax device but they don't help the case then BARs
> > themselves are small, closely spaced and non-segment aligned. However
> > I think most people using iopmem will want to use reasonbly large
> > BARs so I am not sure item 2 is that big of an issue.
>
> I think you might have misunderstood what I'm proposing. The patches
> I'm working on are separate from a facility to carve up a /dev/dax
> device. The effort is to allow devm_memremap_pages() to maintain
> several allocations within the same 128MB section. I need this for
> persistent memory to handle platforms that mix pmem and system-ram in
> the same section. I want to be able to map ZONE_DEVICE pages for a
> portion of a section and be able to remove portions of section that
> may collide with allocations of a different lifetime.

Oh I did misunderstand. This is very cool and would be useful to us.
One more reason to consider moving to /dev/dax in the next spin of
this patchset ;-).

Thanks

Stephen

2016-10-20 23:33:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Wed, Oct 19, 2016 at 12:48:14PM -0600, Stephen Bates wrote:
> On Tue, Oct 18, 2016 at 08:51:15PM -0700, Dan Williams wrote:
> > [ adding Ashok and David for potential iommu comments ]
> >
>
> Hi Dan
>
> Thanks for adding Ashok and David!
>
> >
> > I agree with the motivation and the need for a solution, but I have
> > some questions about this implementation.
> >
> > >
> > > Consumers
> > > ---------
> > >
> > > We provide a PCIe device driver in an accompanying patch that can be
> > > used to map any PCIe BAR into a DAX capable block device. For
> > > non-persistent BARs this simply serves as an alternative to using
> > > system memory bounce buffers. For persistent BARs this can serve as an
> > > additional storage device in the system.
> >
> > Why block devices? I wonder if iopmem was initially designed back
> > when we were considering enabling DAX for raw block devices. However,
> > that support has since been ripped out / abandoned. You currently
> > need a filesystem on top of a block-device to get DAX operation.
> > Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
> > if all you want is a way to map the bar for another PCI-E device in
> > the topology.
> >
> > If you're only using the block-device as a entry-point to create
> > dax-mappings then a device-dax (drivers/dax/) character-device might
> > be a better fit.
> >
>
> We chose a block device because we felt it was intuitive for users to
> carve up a memory region but putting a DAX filesystem on it and creating
> files on that DAX aware FS. It seemed like a convenient way to
> partition up the region and to be easily able to get the DMA address
> for the memory backing the device.

You do realise that local filesystems can silently change the
location of file data at any point in time, so there is no such
thing as a "stable mapping" of file data to block device addresses
in userspace?

If you want remote access to the blocks owned and controlled by a
filesystem, then you need to use a filesystem with a remote locking
mechanism to allow co-ordinated, coherent access to the data in
those blocks. Anything else is just asking for ongoing, unfixable
filesystem corruption or data leakage problems (i.e. security
issues).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-21 09:57:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote:
> You do realise that local filesystems can silently change the
> location of file data at any point in time, so there is no such
> thing as a "stable mapping" of file data to block device addresses
> in userspace?
>
> If you want remote access to the blocks owned and controlled by a
> filesystem, then you need to use a filesystem with a remote locking
> mechanism to allow co-ordinated, coherent access to the data in
> those blocks. Anything else is just asking for ongoing, unfixable
> filesystem corruption or data leakage problems (i.e. security
> issues).

And at least for XFS we have such a mechanism :) E.g. I have a
prototype of a pNFS layout that uses XFS+DAX to allow clients to do
RDMA directly to XFS files, with the same locking mechanism we use
for the current block and scsi layout in xfs_pnfs.c.

2016-10-21 11:13:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote:
> > You do realise that local filesystems can silently change the
> > location of file data at any point in time, so there is no such
> > thing as a "stable mapping" of file data to block device addresses
> > in userspace?
> >
> > If you want remote access to the blocks owned and controlled by a
> > filesystem, then you need to use a filesystem with a remote locking
> > mechanism to allow co-ordinated, coherent access to the data in
> > those blocks. Anything else is just asking for ongoing, unfixable
> > filesystem corruption or data leakage problems (i.e. security
> > issues).
>
> And at least for XFS we have such a mechanism :) E.g. I have a
> prototype of a pNFS layout that uses XFS+DAX to allow clients to do
> RDMA directly to XFS files, with the same locking mechanism we use
> for the current block and scsi layout in xfs_pnfs.c.

Oh, that's good to know - pNFS over XFS was exactly what I was
thinking of when I wrote my earlier reply. A few months ago someone
else was trying to use file mappings in userspace for direct remote
client access on fabric connected devices. I told them "pNFS on XFS
and write an efficient transport for you hardware"....

Now that I know we've got RDMA support for pNFS on XFS in the
pipeline, I can just tell them "just write an rdma driver for your
hardware" instead. :P

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-25 11:54:18

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

On Wed, Oct 19, 2016 at 01:01:06PM -0700, Dan Williams wrote:
> >>
> >> In the cover letter, "[PATCH 0/3] iopmem : A block device for PCIe
> >> memory", it mentions that the lack of I/O coherency is a known issue
> >> and users of this functionality need to be cognizant of the pitfalls.
> >> If that is the case why do we need support for different cpu mapping
> >> types than the default write-back cache setting? It's up to the
> >> application to handle cache cpu flushing similar to what we require of
> >> device-dax users in the persistent memory case.
> >
> > Some of the iopmem hardware we have tested has certain alignment
> > restrictions on BAR accesses. At the very least we require write
> > combine mappings for these. We then felt it appropriate to add the
> > other mappings for the sake of completeness.
>
> If the device can support write-combine then it can support bursts, so
> I wonder why it couldn't support read bursts for cache fills...

Dan

You make a good point. We did some testing on this and for the HW we
have access too we did see that a standard WB mapping worked.
Interestly though the local access performance was much slower than
for the WC mapping. We also noticed the PAT entries we not marked
correctly in the WB case. I am trying to get access to some other HW
for more testing.

Grepping for the address of interest:

In WB mode it's:

uncached-minus @ 0x381f80000000-0x381fc0000000

In WC mode it's:

write-combining @ 0x381f80000000-0x381fc0000000

Cheers

Stephen

2016-10-25 12:38:02

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

Hi Dave and Christoph

On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote:
> > > You do realise that local filesystems can silently change the
> > > location of file data at any point in time, so there is no such
> > > thing as a "stable mapping" of file data to block device addresses
> > > in userspace?
> > >
> > > If you want remote access to the blocks owned and controlled by a
> > > filesystem, then you need to use a filesystem with a remote locking
> > > mechanism to allow co-ordinated, coherent access to the data in
> > > those blocks. Anything else is just asking for ongoing, unfixable
> > > filesystem corruption or data leakage problems (i.e. security
> > > issues).
> >

Dave are you saying that even for local mappings of files on a DAX
capable system it is possible for the mappings to move on you unless
the FS supports locking? Does that not mean DAX on such FS is
inherently broken?

> > And at least for XFS we have such a mechanism :) E.g. I have a
> > prototype of a pNFS layout that uses XFS+DAX to allow clients to do
> > RDMA directly to XFS files, with the same locking mechanism we use
> > for the current block and scsi layout in xfs_pnfs.c.
>

Thanks for fixing this issue on XFS Christoph! I assume this problem
continues to exist on the other DAX capable FS?

One more reason to consider a move to /dev/dax I guess ;-)...

Stephen


> Oh, that's good to know - pNFS over XFS was exactly what I was
> thinking of when I wrote my earlier reply. A few months ago someone
> else was trying to use file mappings in userspace for direct remote
> client access on fabric connected devices. I told them "pNFS on XFS
> and write an efficient transport for you hardware"....
>
> Now that I know we've got RDMA support for pNFS on XFS in the
> pipeline, I can just tell them "just write an rdma driver for your
> hardware" instead. :P
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2016-10-25 21:19:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote:
> Hi Dave and Christoph
>
> On Fri, Oct 21, 2016 at 10:12:53PM +1100, Dave Chinner wrote:
> > On Fri, Oct 21, 2016 at 02:57:14AM -0700, Christoph Hellwig wrote:
> > > On Fri, Oct 21, 2016 at 10:22:39AM +1100, Dave Chinner wrote:
> > > > You do realise that local filesystems can silently change the
> > > > location of file data at any point in time, so there is no such
> > > > thing as a "stable mapping" of file data to block device addresses
> > > > in userspace?
> > > >
> > > > If you want remote access to the blocks owned and controlled by a
> > > > filesystem, then you need to use a filesystem with a remote locking
> > > > mechanism to allow co-ordinated, coherent access to the data in
> > > > those blocks. Anything else is just asking for ongoing, unfixable
> > > > filesystem corruption or data leakage problems (i.e. security
> > > > issues).
> > >
>
> Dave are you saying that even for local mappings of files on a DAX
> capable system it is possible for the mappings to move on you unless
> the FS supports locking?

Yes.

> Does that not mean DAX on such FS is
> inherently broken?

No. DAX is accessed through a virtual mapping layer that abstracts
the physical location from userspace applications.

Example: think copy-on-write overwrites. It occurs atomically from
the perspective of userspace and starts by invalidating any current
mappings userspace has of that physical location. The location is
changes, the data copied in, and then when the locks are released
userspace can fault in a new page table mapping on the next
access....

> > > And at least for XFS we have such a mechanism :) E.g. I have a
> > > prototype of a pNFS layout that uses XFS+DAX to allow clients to do
> > > RDMA directly to XFS files, with the same locking mechanism we use
> > > for the current block and scsi layout in xfs_pnfs.c.
>
> Thanks for fixing this issue on XFS Christoph! I assume this problem
> continues to exist on the other DAX capable FS?

Yes, but it they implement the exportfs API that supplies this
capability, they'll be able to use pNFS, too.

> One more reason to consider a move to /dev/dax I guess ;-)...

That doesn't get rid of the need for sane access control arbitration
across all machines that are directly accessing the storage. That's
the problem pNFS solves, regardless of whether your direct access
target is a filesystem, a block device or object storage...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-26 08:40:13

by Haggai Eran

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On 10/19/2016 6:51 AM, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates <[email protected]> wrote:
>> 1. Address Translation. Suggestions have been made that in certain
>> architectures and topologies the dma_addr_t passed to the DMA master
>> in a peer-2-peer transfer will not correctly route to the IO memory
>> intended. However in our testing to date we have not seen this to be
>> an issue, even in systems with IOMMUs and PCIe switches. It is our
>> understanding that an IOMMU only maps system memory and would not
>> interfere with device memory regions.
I'm not sure that's the case. I think it works because with ZONE_DEVICE,
the iommu driver will simply treat a dma_map_page call as any other PFN,
and create a mapping as it does for any memory page.

>> (It certainly has no opportunity
>> to do so if the transfer gets routed through a switch).
It can still go through the IOMMU if you enable ACS upstream forwarding.

> There may still be platforms where peer-to-peer cycles are routed up
> through the root bridge and then back down to target device, but we
> can address that when / if it happens.
I agree.

> I wonder if we could (ab)use a
> software-defined 'pasid' as the requester id for a peer-to-peer
> mapping that needs address translation.
Why would you need that? Isn't it enough to map the peer-to-peer
addresses correctly in the iommu driver?

Haggai

2016-10-26 13:39:48

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Wed, Oct 26, 2016 at 1:24 AM, Haggai Eran <[email protected]> wrote:
[..]
>> I wonder if we could (ab)use a
>> software-defined 'pasid' as the requester id for a peer-to-peer
>> mapping that needs address translation.
> Why would you need that? Isn't it enough to map the peer-to-peer
> addresses correctly in the iommu driver?
>

You're right, we might already have enough...

We would just need to audit iommu drivers to undo any assumptions that
the page being mapped is always in host memory and apply any bus
address translations between source device and target device.

2016-10-27 13:50:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Thu, Oct 27, 2016 at 01:22:49PM +0300, Sagi Grimberg wrote:
> Christoph, did you manage to leap to the future and solve the
> RDMA persistency hole? :)
>
> e.g. what happens with O_DSYNC in this model? Or you did
> a message exchange for commits?

Yes, pNFS calls this the layoutcommit. That being said once we get a RDMA
commit or flush operation we could easily make the layoutcommit optional
for some operations. There already is a precedence for the in the
flexfiles layout specification.

2016-10-27 14:39:05

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory


>> You do realise that local filesystems can silently change the
>> location of file data at any point in time, so there is no such
>> thing as a "stable mapping" of file data to block device addresses
>> in userspace?
>>
>> If you want remote access to the blocks owned and controlled by a
>> filesystem, then you need to use a filesystem with a remote locking
>> mechanism to allow co-ordinated, coherent access to the data in
>> those blocks. Anything else is just asking for ongoing, unfixable
>> filesystem corruption or data leakage problems (i.e. security
>> issues).
>
> And at least for XFS we have such a mechanism :) E.g. I have a
> prototype of a pNFS layout that uses XFS+DAX to allow clients to do
> RDMA directly to XFS files, with the same locking mechanism we use
> for the current block and scsi layout in xfs_pnfs.c.

Christoph, did you manage to leap to the future and solve the
RDMA persistency hole? :)

e.g. what happens with O_DSYNC in this model? Or you did
a message exchange for commits?

2016-10-28 06:46:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.

> Signed-off-by: Stephen Bates <[email protected]>

FYI, that address has bounced throught the whole thread for me,
replacing it with a known good one for now.


> + * This driver is heavily based on drivers/block/pmem.c.
> + * Copyright (c) 2014, Intel Corporation.
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.

Is there anything left of it actually? I didn't spot anything
obvious. Nevermind that we don't have a file with that name anymore :)

> + /*
> + * We can only access the iopmem device with full 32-bit word
> + * accesses which cannot be gaurantee'd by the regular memcpy
> + */

Odd comment formatting.

> +static void memcpy_from_iopmem(void *dst, const void *src, size_t sz)
> +{
> + u64 *wdst = dst;
> + const u64 *wsrc = src;
> + u64 tmp;
> +
> + while (sz >= sizeof(*wdst)) {
> + *wdst++ = *wsrc++;
> + sz -= sizeof(*wdst);
> + }
> +
> + if (!sz)
> + return;
> +
> + tmp = *wsrc;
> + memcpy(wdst, &tmp, sz);
> +}

And then we dod a memcpy here anyway. And no volatile whatsover, so
the compiler could do anything to it. I defintively feel a bit uneasy
about having this in the driver as well. Can we define the exact
semantics for this and define it by the system, possibly in an arch
specific way?

> +static void iopmem_do_bvec(struct iopmem_device *iopmem, struct page *page,
> + unsigned int len, unsigned int off, bool is_write,
> + sector_t sector)
> +{
> + phys_addr_t iopmem_off = sector * 512;
> + void *iopmem_addr = iopmem->virt_addr + iopmem_off;
> +
> + if (!is_write) {
> + read_iopmem(page, off, iopmem_addr, len);
> + flush_dcache_page(page);
> + } else {
> + flush_dcache_page(page);
> + write_iopmem(iopmem_addr, page, off, len);
> + }

How about moving the address and offset calculation as well as the
cache flushing into read_iopmem/write_iopmem and removing this function?

> +static blk_qc_t iopmem_make_request(struct request_queue *q, struct bio *bio)
> +{
> + struct iopmem_device *iopmem = q->queuedata;
> + struct bio_vec bvec;
> + struct bvec_iter iter;
> +
> + bio_for_each_segment(bvec, bio, iter) {
> + iopmem_do_bvec(iopmem, bvec.bv_page, bvec.bv_len,
> + bvec.bv_offset, op_is_write(bio_op(bio)),
> + iter.bi_sector);

op_is_write just checks the data direction. I'd feel much more
comfortable with a switch on the op, e.g.

switch (bio_op(bio))) {
case REQ_OP_READ:
bio_for_each_segment(bvec, bio, iter)
read_iopmem(iopmem, bvec, iter.bi_sector);
break;
case REQ_OP_READ:
bio_for_each_segment(bvec, bio, iter)
write_iopmem(iopmem, bvec, iter.bi_sector);
defualt:
WARN_ON_ONCE(1);
bio->bi_error = -EIO;
break;
}


> +static long iopmem_direct_access(struct block_device *bdev, sector_t sector,
> + void **kaddr, pfn_t *pfn, long size)
> +{
> + struct iopmem_device *iopmem = bdev->bd_queue->queuedata;
> + resource_size_t offset = sector * 512;
> +
> + if (!iopmem)
> + return -ENODEV;

I don't think this can ever happen, can it?

> +static DEFINE_IDA(iopmem_instance_ida);
> +static DEFINE_SPINLOCK(ida_lock);
> +
> +static int iopmem_set_instance(struct iopmem_device *iopmem)
> +{
> + int instance, error;
> +
> + do {
> + if (!ida_pre_get(&iopmem_instance_ida, GFP_KERNEL))
> + return -ENODEV;
> +
> + spin_lock(&ida_lock);
> + error = ida_get_new(&iopmem_instance_ida, &instance);
> + spin_unlock(&ida_lock);
> +
> + } while (error == -EAGAIN);
> +
> + if (error)
> + return -ENODEV;
> +
> + iopmem->instance = instance;
> + return 0;
> +}
> +
> +static void iopmem_release_instance(struct iopmem_device *iopmem)
> +{
> + spin_lock(&ida_lock);
> + ida_remove(&iopmem_instance_ida, iopmem->instance);
> + spin_unlock(&ida_lock);
> +}
> +

Just use ida_simple_get/ida_simple_remove instead to take care
of the locking and preloading, and get rid of these two functions.


> +static int iopmem_attach_disk(struct iopmem_device *iopmem)
> +{
> + struct gendisk *disk;
> + int nid = dev_to_node(iopmem->dev);
> + struct request_queue *q = iopmem->queue;
> +
> + blk_queue_write_cache(q, true, true);

You don't handle flush commands or the fua bit in make_request, so
this setting seems wrong.

> + int err = 0;
> + int nid = dev_to_node(&pdev->dev);
> +
> + if (pci_enable_device_mem(pdev) < 0) {

propagate the actual error code, please.

2016-10-28 06:46:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/3] iopmem : Add documentation for iopmem driver

I'd say please fold this into the previous patch.

2016-10-28 20:04:11

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 2/3] iopmem : Add a block device driver for PCIe attached IO memory.

Hi Christoph,

Thanks so much for the detailed review of the code! Even though by the
sounds of things we will be moving to device dax and most of this is
moot. Still, it's great to get some feedback and learn a few things.

I've given some responses below.

On 28/10/16 12:45 AM, Christoph Hellwig wrote:
>> + * This driver is heavily based on drivers/block/pmem.c.
>> + * Copyright (c) 2014, Intel Corporation.
>> + * Copyright (C) 2007 Nick Piggin
>> + * Copyright (C) 2007 Novell Inc.
>
> Is there anything left of it actually? I didn't spot anything
> obvious. Nevermind that we don't have a file with that name anymore :)

Yes, actually there's still a lot of similarities with the current
pmem.c. Though, yes, the path was on oversight. Some of this code is
getting pretty old (it started from an out-of-tree version of pmem.c)
and we've tried our best to track as many of the changes to the pmem.c
as possible. This proved to be difficult. Note: this is now the nvdimm
pmem and not the dax pmem (drivers/nvdimm/pmem.c)

>> + /*
>> + * We can only access the iopmem device with full 32-bit word
>> + * accesses which cannot be gaurantee'd by the regular memcpy
>> + */
>
> Odd comment formatting.

Oops. I'm surprised check_patch didn't pick up on that.

>
>> +static void memcpy_from_iopmem(void *dst, const void *src, size_t sz)
>> +{
>> + u64 *wdst = dst;
>> + const u64 *wsrc = src;
>> + u64 tmp;
>> +
>> + while (sz >= sizeof(*wdst)) {
>> + *wdst++ = *wsrc++;
>> + sz -= sizeof(*wdst);
>> + }
>> +
>> + if (!sz)
>> + return;
>> +
>> + tmp = *wsrc;
>> + memcpy(wdst, &tmp, sz);
>> +}
>
> And then we dod a memcpy here anyway. And no volatile whatsover, so
> the compiler could do anything to it. I defintively feel a bit uneasy
> about having this in the driver as well. Can we define the exact
> semantics for this and define it by the system, possibly in an arch
> specific way?

Yeah, you're right. We should have reviewed this function a bit more.
Anyway, I'd be interested in learning a better approach to forcing a
copy from a mapped BAR with larger widths.


>> +static void iopmem_do_bvec(struct iopmem_device *iopmem, struct page *page,
>> + unsigned int len, unsigned int off, bool is_write,
>> + sector_t sector)
>> +{
>> + phys_addr_t iopmem_off = sector * 512;
>> + void *iopmem_addr = iopmem->virt_addr + iopmem_off;
>> +
>> + if (!is_write) {
>> + read_iopmem(page, off, iopmem_addr, len);
>> + flush_dcache_page(page);
>> + } else {
>> + flush_dcache_page(page);
>> + write_iopmem(iopmem_addr, page, off, len);
>> + }
>
> How about moving the address and offset calculation as well as the
> cache flushing into read_iopmem/write_iopmem and removing this function?

Could do. This was copied from the existing pmem.c and once the bad_pmem
stuff was stripped out this function became relatively simple.


>
>> +static blk_qc_t iopmem_make_request(struct request_queue *q, struct bio *bio)
>> +{
>> + struct iopmem_device *iopmem = q->queuedata;
>> + struct bio_vec bvec;
>> + struct bvec_iter iter;
>> +
>> + bio_for_each_segment(bvec, bio, iter) {
>> + iopmem_do_bvec(iopmem, bvec.bv_page, bvec.bv_len,
>> + bvec.bv_offset, op_is_write(bio_op(bio)),
>> + iter.bi_sector);
>
> op_is_write just checks the data direction. I'd feel much more
> comfortable with a switch on the op, e.g.

That makes sense. This was also copied from pmem.c, so this same change
may make sense there too.


>> +static long iopmem_direct_access(struct block_device *bdev, sector_t sector,
>> + void **kaddr, pfn_t *pfn, long size)
>> +{
>> + struct iopmem_device *iopmem = bdev->bd_queue->queuedata;
>> + resource_size_t offset = sector * 512;
>> +
>> + if (!iopmem)
>> + return -ENODEV;
>
> I don't think this can ever happen, can it?

Yes, I think now that's the case. This is probably a holdover from a
previous version.

> Just use ida_simple_get/ida_simple_remove instead to take care
> of the locking and preloading, and get rid of these two functions.

Thanks, noted. That would be much better. I never found a simple example
of that when I was looking, though I expected there should have been.

>
>> +static int iopmem_attach_disk(struct iopmem_device *iopmem)
>> +{
>> + struct gendisk *disk;
>> + int nid = dev_to_node(iopmem->dev);
>> + struct request_queue *q = iopmem->queue;
>> +
>> + blk_queue_write_cache(q, true, true);
>
> You don't handle flush commands or the fua bit in make_request, so
> this setting seems wrong.

Yup, ok. I'm afraid this is a case of copying without complete
comprehension.

>
>> + int err = 0;
>> + int nid = dev_to_node(&pdev->dev);
>> +
>> + if (pci_enable_device_mem(pdev) < 0) {
>
> propagate the actual error code, please.

Hmm, yup. Not sure why that was missed.

Thanks,

Logan

2016-11-06 14:35:27

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH 0/3] iopmem : A block device for PCIe memory

On Tue, October 25, 2016 3:19 pm, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 05:50:43AM -0600, Stephen Bates wrote:
>>
>> Dave are you saying that even for local mappings of files on a DAX
>> capable system it is possible for the mappings to move on you unless the
>> FS supports locking?
>>
>
> Yes.
>
>
>> Does that not mean DAX on such FS is
>> inherently broken?
>
> No. DAX is accessed through a virtual mapping layer that abstracts
> the physical location from userspace applications.
>
> Example: think copy-on-write overwrites. It occurs atomically from
> the perspective of userspace and starts by invalidating any current
> mappings userspace has of that physical location. The location is changes,
> the data copied in, and then when the locks are released userspace can
> fault in a new page table mapping on the next access....

Dave

Thanks for the good input and for correcting some of my DAX
misconceptions! We will certainly be taking this into account as we
consider v1.

>
>>>> And at least for XFS we have such a mechanism :) E.g. I have a
>>>> prototype of a pNFS layout that uses XFS+DAX to allow clients to do
>>>> RDMA directly to XFS files, with the same locking mechanism we use
>>>> for the current block and scsi layout in xfs_pnfs.c.
>>
>> Thanks for fixing this issue on XFS Christoph! I assume this problem
>> continues to exist on the other DAX capable FS?
>
> Yes, but it they implement the exportfs API that supplies this
> capability, they'll be able to use pNFS, too.
>
>> One more reason to consider a move to /dev/dax I guess ;-)...
>>
>
> That doesn't get rid of the need for sane access control arbitration
> across all machines that are directly accessing the storage. That's the
> problem pNFS solves, regardless of whether your direct access target is a
> filesystem, a block device or object storage...

Fair point. I am still hoping for a bit more discussion on the best choice
of user-space interface for this work. If/When that happens we will take
it into account when we look at spinning the patchset.


Stephen