2018-03-12 19:37:59

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 00/11] Copy Offload in NVMe Fabrics with P2P PCI Memory

Hi Everyone,

Here's v3 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc5.

Thanks,

Logan


Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
and Kconfig doc. Also, the code that disables the ACS bits is surrounded
explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
are closest to (or the same as) the clients using them. In cases
of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
may be explicitly specified, bypassing the logic in pci_p2pmem_find().
(Note: it's still enforced that the provider must be behind the
same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.

* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (11):
PCI/P2PDMA: Support peer-to-peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
PCI/P2PDMA: Add P2P DMA driver writer's documentation
block: Introduce PCI P2P flags for request and request queue
IB/core: Ensure we map P2P memory correctly in
rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Optionally use PCI P2P memory

Documentation/ABI/testing/sysfs-bus-pci | 25 +
Documentation/PCI/index.rst | 14 +
Documentation/PCI/p2pdma.rst | 164 +++++++
Documentation/index.rst | 3 +-
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 13 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 +++--
drivers/nvme/target/configfs.c | 67 +++
drivers/nvme/target/core.c | 106 +++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 12 +
drivers/nvme/target/rdma.c | 32 +-
drivers/pci/Kconfig | 25 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 828 ++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 6 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 119 +++++
include/linux/pci.h | 4 +
23 files changed, 1543 insertions(+), 52 deletions(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/PCI/p2pdma.rst
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0


2018-03-12 19:37:59

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 02/11] PCI/P2PDMA: Add sysfs group to display p2pmem stats

Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
drivers/pci/p2pdma.c | 54 +++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:

This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What: /sys/bus/pci/devices/.../p2pmem/available
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the amount of memory that has not been
+ allocated (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/size
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the total amount of memory that the device
+ provides (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/published
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains a '1' if the memory has been published for
+ use inside the kernel or a '0' if it is only intended
+ for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0ee917381dce..fd4789566a56 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
bool p2pmem_published;
};

+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t size = 0;
+
+ if (pdev->p2pdma->pool)
+ size = gen_pool_size(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t avail = 0;
+
+ if (pdev->p2pdma->pool)
+ avail = gen_pool_avail(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n",
+ pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+ &dev_attr_size.attr,
+ &dev_attr_available.attr,
+ &dev_attr_published.attr,
+ NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+ .attrs = p2pmem_attrs,
+ .name = "p2pmem",
+};
+
static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
{
struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
percpu_ref_exit(&pdev->p2pdma->devmap_ref);

gen_pool_destroy(pdev->p2pdma->pool);
+ sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
pdev->p2pdma = NULL;
}

@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)

pdev->p2pdma = p2p;

+ error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+ if (error)
+ goto out_pool_destroy;
+
return 0;

out_pool_destroy:
+ pdev->p2pdma = NULL;
gen_pool_destroy(p2p->pool);
out:
devm_kfree(&pdev->dev, p2p);
--
2.11.0


2018-03-12 19:37:59

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but would significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVME
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
we have no way of knowing whether peer-to-peer routing between PCIe
Root Ports is supported (PCIe r4.0, sec 1.3.1). Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage and, in some cases, coding convienence.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 16 ++
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 679 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 18 ++
include/linux/pci-p2pdma.h | 101 +++++++
include/linux/pci.h | 4 +
6 files changed, 819 insertions(+)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..d59f6f5ddfcd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID

If unsure, say N.

+config PCI_P2PDMA
+ bool "PCI peer-to-peer transfer support"
+ depends on ZONE_DEVICE
+ select GENERIC_ALLOCATOR
+ help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time you
+ will need a PCIe switch.
+
+ If unsure, say N.
+
config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o

obj-$(CONFIG_PCI_ATS) += ats.o
obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o

#
# ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..0ee917381dce
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,679 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+ struct percpu_ref devmap_ref;
+ struct completion devmap_ref_done;
+ struct gen_pool *pool;
+ bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+ struct pci_p2pdma *p2p =
+ container_of(ref, struct pci_p2pdma, devmap_ref);
+
+ complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+ struct percpu_ref *ref = data;
+
+ if (percpu_ref_is_dying(ref))
+ return;
+
+ percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ if (!pdev->p2pdma)
+ return;
+
+ wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+ percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+ gen_pool_destroy(pdev->p2pdma->pool);
+ pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+ int error = -ENOMEM;
+ struct pci_p2pdma *p2p;
+
+ p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+ if (!p2p)
+ return -ENOMEM;
+
+ p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+ if (!p2p->pool)
+ goto out;
+
+ init_completion(&p2p->devmap_ref_done);
+ error = percpu_ref_init(&p2p->devmap_ref,
+ pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+ if (error)
+ goto out_pool_destroy;
+
+ percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ if (error)
+ goto out_pool_destroy;
+
+ pdev->p2pdma = p2p;
+
+ return 0;
+
+out_pool_destroy:
+ gen_pool_destroy(p2p->pool);
+out:
+ devm_kfree(&pdev->dev, p2p);
+ return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset)
+{
+ struct dev_pagemap *pgmap;
+ void *addr;
+ int error;
+
+ if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+ return -EINVAL;
+
+ if (offset >= pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!size)
+ size = pci_resource_len(pdev, bar) - offset;
+
+ if (size + offset > pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!pdev->p2pdma) {
+ error = pci_p2pdma_setup(pdev);
+ if (error)
+ return error;
+ }
+
+ pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+ pgmap->res.end = pgmap->res.start + size - 1;
+ pgmap->res.flags = pci_resource_flags(pdev, bar);
+ pgmap->ref = &pdev->p2pdma->devmap_ref;
+ pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+ addr = devm_memremap_pages(&pdev->dev, pgmap);
+ if (IS_ERR(addr)) {
+ error = PTR_ERR(addr);
+ goto pgmap_free;
+ }
+
+ error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+ pci_bus_address(pdev, bar) + offset,
+ resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+ if (error)
+ goto pgmap_free;
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+ &pdev->p2pdma->devmap_ref);
+ if (error)
+ goto pgmap_free;
+
+ pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+ &pgmap->res);
+
+ return 0;
+
+pgmap_free:
+ devres_free(pgmap);
+ return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+ struct device *parent;
+
+ dev = get_device(dev);
+
+ while (dev) {
+ if (dev_is_pci(dev))
+ return to_pci_dev(dev);
+
+ parent = get_device(dev->parent);
+ put_device(dev);
+ dev = parent;
+ }
+
+ return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+ struct pci_dev *up1, *up2;
+
+ if (!pdev)
+ return NULL;
+
+ up1 = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!up1)
+ return NULL;
+
+ up2 = pci_dev_get(pci_upstream_bridge(up1));
+ pci_dev_put(up1);
+
+ return up2;
+}
+
+/*
+ * This function checks if two PCI devices are behind the same switch.
+ * (ie. they share the same second upstream port as returned by
+ * get_upstream_bridge_port().)
+ *
+ * Future work could expand this to handle hierarchies of switches
+ * so any devices whose traffic can be routed without going through
+ * the root complex could be used. For now, we limit it to just one
+ * level of switch.
+ *
+ * This function returns the "distance" between the devices. 0 meaning
+ * they are the same device, 1 meaning they are behind the same switch.
+ * If they are not behind the same switch, -1 is returned.
+ */
+static int __upstream_bridges_match(struct pci_dev *upstream,
+ struct pci_dev *client)
+{
+ struct pci_dev *dma_up;
+ int ret = 1;
+
+ dma_up = get_upstream_bridge_port(client);
+
+ if (!dma_up) {
+ dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+ ret = -1;
+ goto out;
+ }
+
+ if (upstream != dma_up) {
+ dev_dbg(&client->dev,
+ "does not reside on the same upstream bridge\n");
+ ret = -1;
+ goto out;
+ }
+
+out:
+ pci_dev_put(dma_up);
+ return ret;
+}
+
+static int upstream_bridges_match(struct pci_dev *provider,
+ struct pci_dev *client)
+{
+ struct pci_dev *upstream;
+ int ret;
+
+ if (provider == client)
+ return 0;
+
+ upstream = get_upstream_bridge_port(provider);
+ if (!upstream) {
+ pci_warn(provider, "not behind a PCI bridge\n");
+ return -1;
+ }
+
+ ret = __upstream_bridges_match(upstream, client);
+
+ pci_dev_put(upstream);
+
+ return ret;
+}
+
+struct pci_p2pdma_client {
+ struct list_head list;
+ struct pci_dev *client;
+ struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *item, *new_item;
+ struct pci_dev *provider = NULL;
+ struct pci_dev *client;
+ int ret;
+
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+ dev_warn(dev,
+ "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+ return -ENODEV;
+ }
+
+
+ client = find_parent_pci_dev(dev);
+ if (!client) {
+ dev_warn(dev,
+ "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+ return -ENODEV;
+ }
+
+ item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+ if (item && item->provider) {
+ provider = item->provider;
+
+ if (upstream_bridges_match(provider, client) < 0) {
+ ret = -EXDEV;
+ goto put_client;
+ }
+ }
+
+ new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+ if (!new_item) {
+ ret = -ENOMEM;
+ goto put_client;
+ }
+
+ new_item->client = client;
+ new_item->provider = pci_dev_get(provider);
+
+ list_add_tail(&new_item->list, head);
+
+ return 0;
+
+put_client:
+ pci_dev_put(client);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+ list_del(&item->list);
+ pci_dev_put(item->client);
+ pci_dev_put(item->provider);
+ kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+ struct pci_dev *pdev;
+
+ pdev = find_parent_pci_dev(dev);
+ if (!pdev)
+ return;
+
+ list_for_each_entry_safe(pos, tmp, head, list) {
+ if (pos->client != pdev)
+ continue;
+
+ pci_p2pdma_client_free(pos);
+ }
+
+ pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, head, list)
+ pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ * a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * switch as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same switch. This cuts out cases that may work but is safest for the
+ * user. Future work can expand this to cases with nested switches.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+ struct pci_p2pdma_client *pos;
+ struct pci_dev *upstream;
+ int ret;
+ int distance = 0;
+
+ upstream = get_upstream_bridge_port(provider);
+ if (!upstream) {
+ pci_warn(provider, "not behind a PCI bridge\n");
+ return false;
+ }
+
+ list_for_each_entry(pos, clients, list) {
+ if (pos->client == provider)
+ continue;
+
+ ret = __upstream_bridges_match(upstream, pos->client);
+ if (ret < 0)
+ goto no_match;
+
+ distance += ret;
+ }
+
+ ret = distance;
+
+no_match:
+ pci_dev_put(upstream);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ * and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ struct pci_p2pdma_client *pos;
+
+ if (pci_p2pdma_distance(provider, clients) < 0)
+ return false;
+
+ list_for_each_entry(pos, clients, list)
+ pos->provider = provider;
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+ return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ * the specified list of clients and shortest distance (as determined
+ * by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ struct pci_dev *pdev = NULL;
+ struct pci_p2pdma_client *pos;
+ int distance;
+ int closest_distance = INT_MAX;
+ struct pci_dev **closest_pdevs;
+ int ties = 0;
+ const int max_ties = PAGE_SIZE / sizeof(*closest_pdevs);
+ int i;
+
+ closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+ if (!pci_has_p2pmem(pdev))
+ continue;
+
+ distance = pci_p2pdma_distance(pdev, clients);
+ if (distance < 0 || distance > closest_distance)
+ continue;
+
+ if (distance == closest_distance && ties >= max_ties)
+ continue;
+
+ if (distance < closest_distance) {
+ for (i = 0; i < ties; i++)
+ pci_dev_put(closest_pdevs[i]);
+
+ ties = 0;
+ closest_distance = distance;
+ }
+
+ closest_pdevs[ties++] = pci_dev_get(pdev);
+ }
+
+ if (ties)
+ pdev = pci_dev_get(closest_pdevs[prandom_u32_max(ties)]);
+
+ for (i = 0; i < ties; i++)
+ pci_dev_put(closest_pdevs[i]);
+
+ if (pdev)
+ list_for_each_entry(pos, clients, list)
+ pos->provider = pdev;
+
+ kfree(closest_pdevs);
+ return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ void *ret;
+
+ if (unlikely(!pdev->p2pdma))
+ return NULL;
+
+ if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+ return NULL;
+
+ ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+ if (unlikely(!ret))
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+ gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ * address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+ if (!addr)
+ return 0;
+ if (!pdev->p2pdma)
+ return 0;
+
+ /*
+ * Note: when we added the memory to the pool we used the PCI
+ * bus address as the physical address. So gen_pool_virt_to_phys()
+ * actually returns the bus address despite the misleading name.
+ */
+ return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length)
+{
+ struct scatterlist *sg;
+ void *addr;
+
+ sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+ if (!sg)
+ return -ENOMEM;
+
+ sg_init_table(sg, 1);
+
+ addr = pci_alloc_p2pmem(pdev, length);
+ if (!addr)
+ goto out_free_sg;
+
+ sg_set_buf(sg, addr, length);
+ *sgl = sg;
+ *nents = 1;
+ return 0;
+
+out_free_sg:
+ kfree(sg);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents)
+{
+ struct scatterlist *sg;
+ int count;
+
+ if (!sgl || !nents)
+ return;
+
+ for_each_sg(sgl, sg, nents, count)
+ pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+ kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ * other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+ if (publish && !pdev->p2pdma)
+ return;
+
+ pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_PCI_P2PDMA,
};

/*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */

+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
static inline bool is_device_private_page(const struct page *page)
{
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..1f7856ff098b
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+ unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+ unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+ size_t size, u64 offset)
+{
+ return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+ struct device *dev)
+{
+ return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+ struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+ return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+ size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+ void *addr)
+{
+ return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+ return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+ struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@ struct pcie_link_state;
struct pci_vpd;
struct pci_sriov;
struct pci_ats;
+struct pci_p2pdma;

/* The pci_dev structure describes PCI devices */
struct pci_dev {
@@ -429,6 +430,9 @@ struct pci_dev {
#ifdef CONFIG_PCI_PASID
u16 pasid_features;
#endif
+#ifdef CONFIG_PCI_P2PDMA
+ struct pci_p2pdma *p2pdma;
+#endif
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
--
2.11.0


2018-03-12 19:38:57

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 09/11] nvme-pci: Add support for P2P memory in requests

For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
---
drivers/nvme/host/core.c | 4 ++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 19 +++++++++++++++----
3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7aeca5db7916..c7c5de116720 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2949,7 +2949,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+ if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+ queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d733b14ede9d..1fb2b6603d49 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
+#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1fb57fa42dd0..0ebab7ab4d7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out;

ret = BLK_STS_RESOURCE;
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
- DMA_ATTR_NO_WARN);
+
+ if (REQ_IS_PCI_P2PDMA(req))
+ nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+ dma_dir, DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;

@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;

if (iod->nents) {
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+ if (REQ_IS_PCI_P2PDMA(req))
+ pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req)) {
if (req_op(req) == REQ_OP_READ)
nvme_dif_remap(req, nvme_dif_complete);
@@ -2426,7 +2436,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
- .flags = NVME_F_METADATA_SUPPORTED,
+ .flags = NVME_F_METADATA_SUPPORTED |
+ NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
--
2.11.0


2018-03-12 19:38:59

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 03/11] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 1 +
include/linux/pci-p2pdma.h | 13 ++++++++++++
3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index fd4789566a56..ab810c3a93eb 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = &pdev->p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+ pci_resource_start(pdev, bar);

addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -731,3 +733,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2pdma->p2pmem_published = publish;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+ struct dev_pagemap *pgmap;
+ struct scatterlist *s;
+ phys_addr_t paddr;
+ int i;
+
+ /*
+ * p2pdma mappings are not compatible with devices that use
+ * dma_virt_ops.
+ */
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+ return 0;
+
+ for_each_sg(sg, s, nents, i) {
+ pgmap = sg_page(s)->pgmap;
+ paddr = sg_phys(s);
+
+ s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+ sg_dma_len(s) = s->length;
+ }
+
+ return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+ u64 pci_p2pdma_bus_offset;
};

#ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 1f7856ff098b..59eb218bdb25 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
unsigned int nents);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -97,5 +101,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline int pci_p2pdma_map_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+ return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
#endif /* CONFIG_PCI_P2PDMA */
#endif /* _LINUX_PCI_P2P_H */
--
2.11.0


2018-03-12 19:39:16

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 04/11] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch will be in the same IOMMU group.
Which implies that individual devices behind any switch will not be
able to be assigned to separate VMs because there is no isolation
between them. Additionally, any malicious PCIe devices will be able
to DMA to memory exposed by other EPs in the same domain as TLPs will
not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 9 +++++++++
drivers/pci/p2pdma.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 6 ++++++
include/linux/pci-p2pdma.h | 5 +++++
4 files changed, 64 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index d59f6f5ddfcd..c7a9d155baca 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,15 @@ config PCI_P2PDMA
it's hard to tell which support it at all, so at this time you
will need a PCIe switch.

+ Enabling this option will also disable ACS on all ports behind
+ any PCIe switch. This effectively puts all devices behind any
+ switch into the same IOMMU group. Which implies that individual
+ devices behind any switch will not be able to be assigned to
+ separate VMs because there is no isolation between them.
+ Additionally, any malicious PCIe devices will be able to DMA
+ to memory exposed by other EPs in the same domain as TLPs will
+ not be checked by the IOMMU.
+
If unsure, say N.

config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ab810c3a93eb..3e70b0662def 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -264,6 +264,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
}

/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ * bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ struct pci_dev *up;
+ int pos;
+ u16 ctrl;
+
+ up = get_upstream_bridge_port(pdev);
+ if (!up)
+ return 0;
+ pci_dev_put(up);
+
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ return 0;
+
+ pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+ pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+ ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+ pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+ return 1;
+}
+
+/*
* This function checks if two PCI devices are behind the same switch.
* (ie. they share the same second upstream port as returned by
* get_upstream_bridge_port().)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..e5da8f482e94 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
#include <linux/of.h>
#include <linux/of_pci.h>
#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include <linux/pm.h>
#include <linux/slab.h>
#include <linux/module.h>
@@ -2826,6 +2827,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
*/
void pci_enable_acs(struct pci_dev *dev)
{
+#ifdef CONFIG_PCI_P2PDMA
+ if (pci_p2pdma_disable_acs(dev))
+ return;
+#endif
+
if (!pci_acs_enable)
return;

diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 59eb218bdb25..2a2bf2ca018e 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
struct scatterlist;

#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ return 0;
+}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
--
2.11.0


2018-03-12 19:39:36

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 07/11] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/infiniband/core/rw.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>

@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;

- ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+ if (is_pci_p2pdma_page(sg_page(sg)))
+ ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+ else
+ ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}

- ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ if (is_pci_p2pdma_page(sg_page(sg)))
+ pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+ sg_cnt, dir);
+ else
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);

--
2.11.0


2018-03-12 19:39:46

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has ben converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Documentation/PCI/index.rst | 14 ++++
Documentation/PCI/p2pdma.rst | 164 +++++++++++++++++++++++++++++++++++++++++++
Documentation/index.rst | 3 +-
3 files changed, 180 insertions(+), 1 deletion(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/PCI/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+ p2pdma
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/PCI/p2pdma.rst b/Documentation/PCI/p2pdma.rst
new file mode 100644
index 000000000000..d7edd48a3941
--- /dev/null
+++ b/Documentation/PCI/p2pdma.rst
@@ -0,0 +1,164 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind a PCIe Switch as this guarantees the packets will always be routable.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+ memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+ DMA transaction to it.
+* Orchestrators - A driver which orchestrates the flow of data between
+ clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+ in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+ resource (provider), it accepts P2P memory pages as buffers in requests
+ to be used directly (client) and it can also make use the CMB as
+ submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+ can DMA directly to the memory exposed by the NVME device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+ to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same switch) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA API Functions
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+ :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index ef5080cbf009..c31bf0918413 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
.. toctree::
:maxdepth: 2

- userspace-api/index
+ userspace-api/index


Introduction to kernel development
@@ -88,6 +88,7 @@ needed).
sound/index
crypto/index
filesystems/index
+ PCI/index

Architecture-specific documentation
-----------------------------------
--
2.11.0


2018-03-12 19:39:46

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 06/11] block: Introduce PCI P2P flags for request and request queue

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
---
block/blk-core.c | 3 +++
include/linux/blk_types.h | 18 +++++++++++++++++-
include/linux/blkdev.h | 3 +++
3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d82c4f7fadd..a2f113738b85 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;

+ if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+ goto not_supported;
+
if (should_fail_bio(bio))
goto end_io;

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
__REQ_BACKGROUND, /* background IO */
__REQ_NOWAIT, /* Don't wait if request will block */

+#ifdef CONFIG_PCI_P2PDMA
+ __REQ_PCI_P2PDMA, /* request is to/from P2P memory */
+#endif
+
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */

@@ -298,6 +302,18 @@ enum req_flag_bits {
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)

+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA (1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req) ((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA 0
+#define REQ_IS_PCI_P2PDMA(req) 0
+#endif /* CONFIG_PCI_P2PDMA */
+
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)

#define REQ_DRV (1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)

#define REQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+ (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)

#define bio_op(bio) \
((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports pci p2p requests */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
#define blk_queue_scsi_passthrough(q) \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q) \
+ test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)

#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
--
2.11.0


2018-03-12 19:40:10

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 10/11] nvme-pci: Add a quirk for a pseudo CMB

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
---
drivers/nvme/host/nvme.h | 7 +++++++
drivers/nvme/host/pci.c | 24 ++++++++++++++++++++----
2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1fb2b6603d49..d1381bfc40f1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
* Supports the LighNVM command set if indicated in vs[1].
*/
NVME_QUIRK_LIGHTNVM = (1 << 6),
+
+ /*
+ * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+ * NVRAM that have CMB-like memory on a BAR but does not set
+ * CMBLOC or CMBSZ.
+ */
+ NVME_QUIRK_PSEUDO_CMB_BAR4 = (1 << 7),
};

/*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0ebab7ab4d7e..a798e08a07bc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1683,6 +1683,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);

+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+ return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+ (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+ ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
{
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1702,10 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;

- dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
- if (!dev->cmbsz)
- return;
- dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+ dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+ dev->cmbloc = 4;
+ } else {
+ dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+ if (!dev->cmbsz)
+ return;
+ dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ }

size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2719,6 +2731,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+ { PCI_DEVICE(0x11f8, 0xf117), /* Microsemi NVRAM adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+ { PCI_DEVICE(0x1db1, 0x0002), /* Everspin nvNitro adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
--
2.11.0


2018-03-12 19:40:28

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 11/11] nvmet: Optionally use PCI P2P memory

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/configfs.c | 67 ++++++++++++++++++++++++++
drivers/nvme/target/core.c | 106 ++++++++++++++++++++++++++++++++++++++++-
drivers/nvme/target/io-cmd.c | 3 ++
drivers/nvme/target/nvmet.h | 12 +++++
drivers/nvme/target/rdma.c | 32 +++++++++++--
5 files changed, 214 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..6ca8c712f0d3 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>

#include "nvmet.h"

@@ -867,12 +869,77 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
}

+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+
+ if (!port->use_p2pmem)
+ return sprintf(page, "none\n");
+
+ if (!port->p2p_dev)
+ return sprintf(page, "auto\n");
+
+ return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+ struct device *dev;
+ struct pci_dev *p2p_dev = NULL;
+ bool use_p2pmem;
+
+ switch (page[0]) {
+ case 'y':
+ case 'Y':
+ case 'a':
+ case 'A':
+ use_p2pmem = true;
+ break;
+ case 'n':
+ case 'N':
+ use_p2pmem = false;
+ break;
+ default:
+ dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+ if (!dev) {
+ pr_err("No such PCI device: %s\n", page);
+ return -ENODEV;
+ }
+
+ use_p2pmem = true;
+ p2p_dev = to_pci_dev(dev);
+
+ if (!pci_has_p2pmem(p2p_dev)) {
+ pr_err("PCI device has no peer-to-peer memory: %s\n",
+ page);
+ pci_dev_put(p2p_dev);
+ return -ENODEV;
+ }
+ }
+
+ down_write(&nvmet_config_sem);
+ port->use_p2pmem = use_p2pmem;
+ pci_dev_put(port->p2p_dev);
+ port->p2p_dev = p2p_dev;
+ up_write(&nvmet_config_sem);
+
+ return count;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
static struct configfs_attribute *nvmet_port_attrs[] = {
&nvmet_attr_addr_adrfam,
&nvmet_attr_addr_treq,
&nvmet_attr_addr_traddr,
&nvmet_attr_addr_trsvcid,
&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+ &nvmet_attr_p2pmem,
+#endif
NULL,
};

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index a78029e4e5f4..ab3cc7135ae8 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/random.h>
#include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>

#include "nvmet.h"

@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
percpu_ref_put(&ns->ref);
}

+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+ struct nvmet_ns *ns)
+{
+ int ret;
+
+ if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+ pr_err("peer-to-peer DMA is not supported by %s\n",
+ ns->device_path);
+ return -EINVAL;
+ }
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+ if (ret)
+ pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+ ns->device_path, ret);
+
+ return ret;
+}
+
int nvmet_ns_enable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;

+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->p2p_dev) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto out_remove_clients;
+ }
+ }
+
if (ns->nsid > subsys->max_nsid)
subsys->max_nsid = ns->nsid;

@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
out_unlock:
mutex_unlock(&subsys->lock);
return ret;
+out_remove_clients:
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
out_blkdev_put:
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);

mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+ }

if (ns->bdev)
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -764,6 +797,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
return __nvmet_host_allowed(subsys, hostnqn);
}

+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+ struct nvmet_ns *ns;
+ int ret;
+
+ if (!req->port->use_p2pmem || !req->p2p_client)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+ if (ret) {
+ pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+ dev_name(req->p2p_client), ret);
+ goto free_devices;
+ }
+
+ list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto free_devices;
+ }
+
+ if (req->port->p2p_dev) {
+ if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+ &ctrl->p2p_clients)) {
+ pr_info("peer-to-peer memory on %s is not supported\n",
+ pci_name(req->port->p2p_dev));
+ goto free_devices;
+ }
+ ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+ } else {
+ ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+ if (!ctrl->p2p_dev) {
+ pr_info("no supported peer-to-peer memory devices found\n");
+ goto free_devices;
+ }
+ }
+
+ mutex_unlock(&ctrl->subsys->lock);
+
+ pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+ return;
+
+free_devices:
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+ if (!ctrl->p2p_dev)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ pci_dev_put(ctrl->p2p_dev);
+ ctrl->p2p_dev = NULL;
+
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
{
@@ -803,6 +904,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,

INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
+ INIT_LIST_HEAD(&ctrl->p2p_clients);

memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -858,6 +960,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
ctrl->kato = DIV_ROUND_UP(kato, 1000);
}
nvmet_start_keep_alive_timer(ctrl);
+ nvmet_setup_p2pmem(ctrl, req);

mutex_lock(&subsys->lock);
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -894,6 +997,7 @@ static void nvmet_ctrl_free(struct kref *ref)
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fatal_err_work);

+ nvmet_release_p2pmem(ctrl);
ida_simple_remove(&cntlid_ida, ctrl->cntlid);

kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
op = REQ_OP_READ;
}

+ if (is_pci_p2pdma_page(sg_page(req->sg)))
+ op_flags |= REQ_PCI_P2PDMA;
+
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);

diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..e05afdbdaa10 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
}

+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+ return disk_to_dev(ns->bdev->bd_disk);
+}
+
struct nvmet_cq {
u16 qid;
u16 size;
@@ -98,6 +103,8 @@ struct nvmet_port {
struct list_head referrals;
void *priv;
bool enabled;
+ bool use_p2pmem;
+ struct pci_dev *p2p_dev;
};

static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +138,8 @@ struct nvmet_ctrl {
struct work_struct fatal_err_work;

struct nvmet_fabrics_ops *ops;
+ struct pci_dev *p2p_dev;
+ struct list_head p2p_clients;

char subsysnqn[NVMF_NQN_FIELD_LEN];
char hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +241,9 @@ struct nvmet_req {

void (*execute)(struct nvmet_req *req);
struct nvmet_fabrics_ops *ops;
+
+ struct pci_dev *p2p_dev;
+ struct device *p2p_client;
};

static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..84db1022664f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
#include <linux/string.h>
#include <linux/wait.h>
#include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
#include <asm/unaligned.h>

#include <rdma/ib_verbs.h>
@@ -430,8 +431,13 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
}

- if (rsp->req.sg != &rsp->cmd->inline_sg)
- sgl_free(rsp->req.sg);
+ if (rsp->req.sg != &rsp->cmd->inline_sg) {
+ if (rsp->req.p2p_dev)
+ pci_p2pmem_free_sgl(rsp->req.p2p_dev, rsp->req.sg,
+ rsp->req.sg_cnt);
+ else
+ sgl_free(rsp->req.sg);
+ }

if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -567,15 +573,29 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
u64 addr = le64_to_cpu(sgl->addr);
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
+ struct pci_dev *p2p_dev = NULL;
int ret;

/* no data command? */
if (!len)
return 0;

- rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
- if (!rsp->req.sg)
- return NVME_SC_INTERNAL;
+ if (rsp->queue->nvme_sq.ctrl)
+ p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+ rsp->req.p2p_dev = NULL;
+ if (rsp->queue->nvme_sq.qid && p2p_dev) {
+ ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+ &rsp->req.sg_cnt, len);
+ if (!ret)
+ rsp->req.p2p_dev = p2p_dev;
+ }
+
+ if (!rsp->req.p2p_dev) {
+ rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+ if (!rsp->req.sg)
+ return NVME_SC_INTERNAL;
+ }

ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
@@ -658,6 +678,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);

+ cmd->req.p2p_client = &queue->dev->device->dev;
+
if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
&queue->nvme_sq, &nvmet_rdma_ops))
return;
--
2.11.0


2018-03-12 19:40:50

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b738f03..1fb57fa42dd0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>

#include "nvme.h"

@@ -91,9 +92,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
- void __iomem *cmb;
- pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+ bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t q_lock;
struct nvme_command *sq_cmds;
- struct nvme_command __iomem *sq_cmds_io;
+ bool sq_cmds_is_io;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
{
u16 tail = nvmeq->sq_tail;

- if (nvmeq->sq_cmds_io)
- memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
- else
- memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+ memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

if (++tail == nvmeq->q_depth)
tail = 0;
@@ -1287,9 +1284,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
- if (nvmeq->sq_cmds)
- dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
- nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+ if (nvmeq->sq_cmds) {
+ if (nvmeq->sq_cmds_is_io)
+ pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+ nvmeq->sq_cmds,
+ SQ_SIZE(nvmeq->q_depth));
+ else
+ dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+ }
}

static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1369,12 +1375,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
- /* CMB SQEs will be mapped before creation */
- if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
- return 0;
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+ if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+ nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+ nvmeq->sq_cmds);
+ nvmeq->sq_cmds_is_io = true;
+ }
+
+ if (!nvmeq->sq_cmds) {
+ nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+ &nvmeq->sq_dma_addr, GFP_KERNEL);
+ nvmeq->sq_cmds_is_io = false;
+ }

- nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
- &nvmeq->sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@@ -1450,13 +1465,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
struct nvme_dev *dev = nvmeq->dev;
int result;

- if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
- unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
- dev->ctrl.page_size);
- nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
- nvmeq->sq_cmds_io = dev->cmb + offset;
- }
-
nvmeq->cq_vector = qid - 1;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
@@ -1689,9 +1697,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);

- if (!use_cmb_sqes)
- return;
-
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1708,11 +1713,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;

- dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
- if (!dev->cmb)
+ if (pci_p2pdma_add_resource(pdev, bar, size, offset))
return;
- dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
dev->cmb_size = size;
+ dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+ if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+ (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+ pci_p2pmem_publish(pdev, true);

if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL))
@@ -1722,12 +1731,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)

static inline void nvme_release_cmb(struct nvme_dev *dev)
{
- if (dev->cmb) {
- iounmap(dev->cmb);
- dev->cmb = NULL;
+ if (dev->cmb_size) {
sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL);
- dev->cmbsz = 0;
+ dev->cmb_size = 0;
}
}

@@ -1922,13 +1929,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;

- if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
dev->q_depth = result;
else
- nvme_release_cmb(dev);
+ dev->cmb_use_sqes = false;
}

do {
--
2.11.0


2018-03-12 19:43:18

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

On Mon, 12 Mar 2018 13:35:19 -0600
Logan Gunthorpe <[email protected]> wrote:

> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
>
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has ben converted to restructured text
> at this time.

This all seems good, but...could we consider moving this documentation to
driver-api/PCI as it's converted to RST? That would keep it together with
similar materials and bring a bit more coherence to Documentation/ as a
whole.

Thanks,

jon

2018-03-12 21:20:11

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation

On 3/12/2018 1:41 PM, Jonathan Corbet wrote:
> This all seems good, but...could we consider moving this documentation to
> driver-api/PCI as it's converted to RST? That would keep it together with
> similar materials and bring a bit more coherence to Documentation/ as a
> whole.

Yup, I'll change this for the next revision.

Thanks,

Logan

2018-03-13 01:56:42

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> - if (nvmeq->sq_cmds_io)

I think you should keep the code as it is for the case where
(!nvmeq->sq_cmds_is_io && nvmeq->sq_cmds_io)

You are changing the behavior for NVMe drives with CMB buffers.
You can change the if statement here with the statement above.

> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> - else
> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 01:59:59

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

On 3/12/2018 9:55 PM, Sinan Kaya wrote:
> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>> - if (nvmeq->sq_cmds_io)
>
> I think you should keep the code as it is for the case where
> (!nvmeq->sq_cmds_is_io && nvmeq->sq_cmds_io)

Never mind. I misunderstood the code.


>
> You are changing the behavior for NVMe drives with CMB buffers.
> You can change the if statement here with the statement above.
>
>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>> - else
>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 03:30:27

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)

It feels like code tried to be a generic p2pdma provider first. Then got
converted to PCI, yet all dev parameters are still struct device.

Maybe, dev parameter should also be struct pci_dev so that you can get rid of
all to_pci_dev() calls in this code including find_parent_pci_dev() function.

Regarding the switch business, It is amazing how much trouble you went into
limit this functionality into very specific hardware.

I thought that we reached to an agreement that code would not impose
any limits on what user wants.

What happened to all the emails we exchanged?

I understand you are coming from what you tested. Considering that you
are singing up for providing a generic PCI functionality into the kernel,
why don't you just blacklist the products that you had problems with and
yet still allow other architectures to use your code with their root ports?

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 16:45:35

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 12/03/18 09:28 PM, Sinan Kaya wrote:
> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> Regarding the switch business, It is amazing how much trouble you went into
> limit this functionality into very specific hardware.
>
> I thought that we reached to an agreement that code would not impose
> any limits on what user wants.
>
> What happened to all the emails we exchanged?

It turns out that root ports that support P2P are far less common than
anyone thought. So it will likely have to be a white list. Nobody else
seems keen on allowing the user to enable this on hardware that doesn't
work. The easiest solution is still limiting it to using a switch. From
there, if someone wants to start creating a white-list then that's
probably the way forward to support root ports.

And there's also the ACS problem which means if you want to use P2P on
the root ports you'll have to disable ACS on the entire system. (Or
preferably, the IOMMU groups need to get more sophisticated to allow for
dynamic changes).

Additionally, once you allow for root ports you may find the IOMMU
getting in the way.

So there are great deal more issues to sort out if you don't restrict to
devices behind switches.

Logan

2018-03-13 17:51:19

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 12:43 PM, Logan Gunthorpe wrote:
>
>
> On 12/03/18 09:28 PM, Sinan Kaya wrote:
>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>> Regarding the switch business, It is amazing how much trouble you went into
>> limit this functionality into very specific hardware.
>>
>> I thought that we reached to an agreement that code would not impose
>> any limits on what user wants.
>>
>> What happened to all the emails we exchanged?
>
> It turns out that root ports that support P2P are far less common than anyone thought. So it will likely have to be a white list. Nobody else seems keen on allowing the user to enable this on hardware that doesn't work. The easiest solution is still limiting it to using a switch. From there, if someone wants to start creating a white-list then that's probably the way forward to support root ports.

I thought only few root ports had problem. Thanks for clarifying that.

>
> And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes).
>

Do you think you can keep a pointer to the parent bridge instead of querying it
via get_upstream_bridge_port() here so that we can reuse your
pci_p2pdma_disable_acs() in the future.

+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ struct pci_dev *up;
+ int pos;
+ u16 ctrl;
+
+ up = get_upstream_bridge_port(pdev);
+ if (!up)
+ return 0;

Some future device would only deal with pci_p2pdma_add_client(() for whitelisting
instead of changing all of your code.

We should at least limit the usage of get_upstream_bridge_port() family of functions
to probe time only.

> Additionally, once you allow for root ports you may find the IOMMU getting in the way.

We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel
command line to have identical behavior to your switch case.

>
> So there are great deal more issues to sort out if you don't restrict to devices behind switches.
>
> Logan
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 18:41:54

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 12/03/18 09:28 PM, Sinan Kaya wrote:
> Maybe, dev parameter should also be struct pci_dev so that you can get rid of
> all to_pci_dev() calls in this code including find_parent_pci_dev() function.

No, this was mentioned in v2. find_parent_pci_dev is necessary because
the calling drivers aren't likely to have a bunch of struct pci_dev's
for all the devices they are working with lying around. It's a much
nicer from an API stand point to take struct devs and not worth it just
to have a PCI API only taking struct pci_devs.

Logan

2018-03-13 18:46:28

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 11:49 AM, Sinan Kaya wrote:
>> And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes).
>>
>
> Do you think you can keep a pointer to the parent bridge instead of querying it
> via get_upstream_bridge_port() here so that we can reuse your
> pci_p2pdma_disable_acs() in the future.

Keep a pointer where? pci_p2pdma_disable_acs() and
pci_p2pdma_add_client() are used in completely different cases on
completely different devices. There really is no overlap and no obvious
place to store the port pointer (except in the struct pci_dev itself, in
which case why not just call the function again).

> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> + struct pci_dev *up;
> + int pos;
> + u16 ctrl;
> +
> + up = get_upstream_bridge_port(pdev);
> + if (!up)
> + return 0;
>
> Some future device would only deal with pci_p2pdma_add_client(() for whitelisting
> instead of changing all of your code.

That's a problem for whoever writes the future code.

> We should at least limit the usage of get_upstream_bridge_port() family of functions
> to probe time only.

Why?

> We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel
> command line to have identical behavior to your switch case.

Well, someone will need to write code for all available IOMMUs to
support this. That's a very big project.

Logan

2018-03-13 19:12:01

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 2:44 PM, Logan Gunthorpe wrote:
>> Do you think you can keep a pointer to the parent bridge instead of querying it
>> via get_upstream_bridge_port() here so that we can reuse your
>> pci_p2pdma_disable_acs() in the future.
>
> Keep a pointer where? pci_p2pdma_disable_acs() and pci_p2pdma_add_client() are used in completely different cases on completely different devices. There really is no overlap and no obvious place to store the port pointer (except in the struct pci_dev itself, in which case why not just call the function again).

I was thinking of this for the pci_p2pdma_add_client() case for the
parent pointer.

+struct pci_p2pdma_client {
+ struct list_head list;
+ struct pci_dev *client;
+ struct pci_dev *provider;
+};

You are right second code seems to be there to disable ACS altogether
when this kernel configuration is selected.

It doesn't have any relationship to the client code.

But then, Why bother searching for the switch at all?

Even if the switch is there, there is no guarantee that it is currently
being used for P2P.

It seems that we are going with the assumption that enabling this config
option implies you want P2P, then we can simplify this code as well.

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 19:21:05

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 01:10 PM, Sinan Kaya wrote:
> I was thinking of this for the pci_p2pdma_add_client() case for the
> parent pointer.
>
> +struct pci_p2pdma_client {
> + struct list_head list;
> + struct pci_dev *client;
> + struct pci_dev *provider;
> +};

Yeah, that structure only exists in a list owned by the client and we
only check the upstream bridge once per entry so I don't see the point.

> But then, Why bother searching for the switch at all?

Huh? We have to make sure all the client and provider devices are behind
the same switch. How can we do that without "searching" for the switch?

In the ACS case, we only disable ACS on downstream ports of switches. No
sense disabling it globally as that's worse from an isolation point of
view and not worth it given we require all P2P transactions to be behind
a switch.

> Even if the switch is there, there is no guarantee that it is currently
> being used for P2P.

IOMMU groups are set at boot time and, at present, there's no way to
dynamically change ACS bits without messing up the groups. So switches
not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set
and I don't know of any good solution to that. Please see the ACS
discussions in v1 and v2.

> It seems that we are going with the assumption that enabling this config
> option implies you want P2P, then we can simplify this code as well.

How so?

Logan



2018-03-13 19:54:40

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 3:19 PM, Logan Gunthorpe wrote:
>
>
> On 13/03/18 01:10 PM, Sinan Kaya wrote:
>> I was thinking of this for the pci_p2pdma_add_client() case for the
>> parent pointer.
>>
>> +struct pci_p2pdma_client {
>> + struct list_head list;
>> + struct pci_dev *client;
>> + struct pci_dev *provider;
>> +};
>
> Yeah, that structure only exists in a list owned by the client and we
> only check the upstream bridge once per entry so I don't see the point.
>
>> But then, Why bother searching for the switch at all?
>
> Huh? We have to make sure all the client and provider devices are behind
> the same switch. How can we do that without "searching" for the switch?
>

Sorry, I was thinking of ACS case you described below. The only thing code
cares is if the device is behind a switch or not at this moment.

> In the ACS case, we only disable ACS on downstream ports of switches. No
> sense disabling it globally as that's worse from an isolation point of
> view and not worth it given we require all P2P transactions to be behind
> a switch.

I agree disabling globally would be bad. Somebody can always say I have
ten switches on my system. I want to do peer-to-peer on one switch only. Now,
this change weakened security for the other switches that I had no intention
with doing P2P.

Isn't this a problem?

Can we specify the BDF of the downstream device we want P2P with during boot via
kernel command line?

>
>> Even if the switch is there, there is no guarantee that it is currently
>> being used for P2P.
>
> IOMMU groups are set at boot time and, at present, there's no way to
> dynamically change ACS bits without messing up the groups. So switches
> not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set
> and I don't know of any good solution to that. Please see the ACS
> discussions in v1 and v2.

Given the implementation limitations, this might be OK as a short-term
solution.

It depends on if Alex is comfortable with this.

>
>> It seems that we are going with the assumption that enabling this config
>> option implies you want P2P, then we can simplify this code as well.
>
> How so?
>
> Logan
>
>
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 20:49:28

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 01:53 PM, Sinan Kaya wrote:
> I agree disabling globally would be bad. Somebody can always say I have
> ten switches on my system. I want to do peer-to-peer on one switch only. Now,
> this change weakened security for the other switches that I had no intention
> with doing P2P.
>
> Isn't this a problem?

Well, if it's a problem for someone they'll have to solve it. We're
targeting JBOFs that have no use for ACS / IOMMU groups at all.

> Can we specify the BDF of the downstream device we want P2P with during boot via
> kernel command line?

That's a painful configuration burden. And then things might stop
working if you change your topology at all and now have to change boot
parameters.

Logan

2018-03-13 21:24:08

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 4:46 PM, Logan Gunthorpe wrote:
>
>
> On 13/03/18 01:53 PM, Sinan Kaya wrote:
>> I agree disabling globally would be bad. Somebody can always say I have
>> ten switches on my system. I want to do peer-to-peer on one switch only. Now,
>> this change weakened security for the other switches that I had no intention
>> with doing P2P.
>>
>> Isn't this a problem?
>
> Well, if it's a problem for someone they'll have to solve it. We're
> targeting JBOFs that have no use for ACS / IOMMU groups at all.
>
>> Can we specify the BDF of the downstream device we want P2P with during boot via
>> kernel command line?
>
> That's a painful configuration burden. And then things might stop
> working if you change your topology at all and now have to change boot
> parameters.
>

It sounds like you have very tight hardware expectations for this to work
at this moment. You also don't want to generalize this code for others and
address the shortcomings.

To get you going, you should limit this change to the switch products that you have
validated via white-listing PCI vendor/device ids. Please do not enable this feature
for all other PCI devices or by default.

I think your code qualifies as a virus until this issue is resolved (so NAK).

Another option is for your CONFIG to depend on BROKEN/EXPERT.

You are delivering a general purpose P2P code with a lot of holes in it and
expecting people to jump through it.

Turning security off by default is also not acceptable. Linux requires ACS support
even though you don't care about it for your particular application.

I'd hate ACS to be broken due to some operating system enabling your CONFIG option.


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 22:02:53

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 03:22 PM, Sinan Kaya wrote:
> It sounds like you have very tight hardware expectations for this to work
> at this moment. You also don't want to generalize this code for others and
> address the shortcomings.

No, that's the way the community has pushed this work. Our original work
was very general and we were told it was unacceptable to put the onus on
the user and have things break if the hardware doesn't support it. I
think that's a reasonable requirement. So the hardware use-cases were
wittled down to the ones we can be confident about and support with
reasonable changes to the kernel today.

> To get you going, you should limit this change to the switch products that you have
> validated via white-listing PCI vendor/device ids. Please do not enable this feature
> for all other PCI devices or by default.

This doesn't seem necessary. We know that all PCIe switches available
today support P2P and we are highly confident that any switch that would
ever be produced would support P2P. As long as devices are behind a
switch you don't need any white listing. This is what the current patch
set implements. If you want to start including root ports then you will
need a white list (and solve all the other problems I mentioned earlier).

> I think your code qualifies as a virus until this issue is resolved (so NAK).

That seems a bit hyperbolic... "a virus"??!... please keep things
constructive.

> You are delivering a general purpose P2P code with a lot of holes in it and
> expecting people to jump through it.
No, the code prevents users from screwing it up. It just requires a
switch in the hardware which is hardly a high bar to jump through
(provided you are putting some design thought into your hardware). And
given it requires semi-custom hardware today, it's not something that
needs to be on by default in any distributor kernel.

> Turning security off by default is also not acceptable. Linux requires ACS support
> even though you don't care about it for your particular application.

That's not true. Linux does not require ACS support. In fact it's
already off by default until you specifically turn on the IOMMU. (Which
is not always the most obvious thing to enable.) And the only thing it
really supports is enforcing isolation between VMs that are using
different pieces of hardware in the system.

> I'd hate ACS to be broken due to some operating system enabling your CONFIG option.

ACS isn't "broken" by enabling the config option. It just makes the
IOMMU groups and isolation less granular. (ie. devices behind a switch
will be in the same group and not isolated from each-other).

Logan

2018-03-13 22:32:18

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 6:00 PM, Logan Gunthorpe wrote:
>
>
> On 13/03/18 03:22 PM, Sinan Kaya wrote:
>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
>
> No, that's the way the community has pushed this work. Our original work
> was very general and we were told it was unacceptable to put the onus on
> the user and have things break if the hardware doesn't support it. I
> think that's a reasonable requirement. So the hardware use-cases were
> wittled down to the ones we can be confident about and support with
> reasonable changes to the kernel today.

If hardware doesn't support it, blacklisting should have been the right
path and I still think that you should remove all switch business from the code.
I did not hear enough justification for having a switch requirement
for P2P.

You are also saying that root ports have issues not because of functionality but
because of performance.

If you know what is bad, create a list and keep adding it. You can't assume
universally that all root ports are broken/ have bad performance.

>
>> To get you going, you should limit this change to the switch products that you have
>> validated via white-listing PCI vendor/device ids. Please do not enable this feature
>> for all other PCI devices or by default.
>
> This doesn't seem necessary. We know that all PCIe switches available
> today support P2P and we are highly confident that any switch that would
> ever be produced would support P2P. As long as devices are behind a
> switch you don't need any white listing. This is what the current patch
> set implements. If you want to start including root ports then you will
> need a white list (and solve all the other problems I mentioned earlier).

What if I come up with a very cheap/crappy switch (something like used in data
mining)?

What guarantees that P2P will work with this device? You are making an
assumption here that all switches have good performance.

How is that any different from good switch vs. bad switch and good root
port vs. bad root port?

If it is universally broken, why don't you list the things that work?

>
>> I think your code qualifies as a virus until this issue is resolved (so NAK).
>
> That seems a bit hyperbolic... "a virus"??!... please keep things
> constructive.
>

Sorry, this was harsh. I'm taking "virus" word back. I apologize.
But, I hold onto my NAK opinion.

I have been doing my best to provide feedback. It feels like you are throwing
them over the wall to be honest.

You keep implying "not my problem".

> > I agree disabling globally would be bad. Somebody can always say I have
> > ten switches on my system. I want to do peer-to-peer on one switch only. Now,
> > this change weakened security for the other switches that I had no intention
> > with doing P2P.
> >
> > Isn't this a problem?
>
> Well, if it's a problem for someone they'll have to solve it. We're
> targeting JBOFs that have no use for ACS / IOMMU groups at all.

IMO, you (not somebody) should address this one way or the other before this
series land in upstream.

>> You are delivering a general purpose P2P code with a lot of holes in it and
>> expecting people to jump through it.
> No, the code prevents users from screwing it up. It just requires a
> switch in the hardware which is hardly a high bar to jump through
> (provided you are putting some design thought into your hardware). And
> given it requires semi-custom hardware today, it's not something that
> needs to be on by default in any distributor kernel.
>
>> Turning security off by default is also not acceptable. Linux requires ACS support
>> even though you don't care about it for your particular application.
>
> That's not true. Linux does not require ACS support. In fact it's
> already off by default until you specifically turn on the IOMMU. (Which
> is not always the most obvious thing to enable.) And the only thing it
> really supports is enforcing isolation between VMs that are using
> different pieces of hardware in the system.

Another assumption: There are other architectures like ARM64 where IOMMU
is enabled by default even if you don't use VMs for security reasons.
IOMMU blocks stray transactions.

>
>> I'd hate ACS to be broken due to some operating system enabling your CONFIG option.
>
> ACS isn't "broken" by enabling the config option. It just makes the
> IOMMU groups and isolation less granular. (ie. devices behind a switch
> will be in the same group and not isolated from each-other).

Didn't the ACS behavior change suddenly for no good reason when we enabled
your code even though I might not be using the P2P but I happen to have
a kernel with P2P config option?

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 22:33:10

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

>> It sounds like you have very tight hardware expectations for this to work
>> at this moment. You also don't want to generalize this code for others and
>> address the shortcomings.
> No, that's the way the community has pushed this work

Hi Sinan

Thanks for all the input. As Logan has pointed out the switch requirement is something that has evolved over time based on input from the community. You are more than welcome to have an opinion on this (and you have made that opinion clear ;-)). Over time the patchset may evolve from its current requirements but right now we are aligned with the feedback from the community.

Cheers

Stephen


2018-03-13 22:48:05

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

Hi Sinan

> If hardware doesn't support it, blacklisting should have been the right
> path and I still think that you should remove all switch business from the code.
> I did not hear enough justification for having a switch requirement
> for P2P.

We disagree. As does the community based on feedback on previous version of this patch set.

> You are also saying that root ports have issues not because of functionality but
> because of performance.

I have seen both. Some systems fail in entirety to route TLPs across route ports. Others do route but with poor performance. Having a requirement for a switch is therefore a reasonable and sensible minimum criteria for enabling p2p.

> If you know what is bad, create a list and keep adding it. You can't assume
> universally that all root ports are broken/ have bad performance.

I actually wanted to do a blacklist originally but that idea was not accepted by the community.

> What if I come up with a very cheap/crappy switch (something like used in data
> mining)?
> What guarantees that P2P will work with this device? You are making an
> assumption here that all switches have good performance.

A switch must support P2P or it is in violation of the PCIe specification.

> I have been doing my best to provide feedback. It feels like you are throwing
> them over the wall to be honest.

I think one issue Sinan is that a lot of your ideas have been discussed in previous versions of the series. So you are asking us to make changes which, in some cases, we know are not acceptable to others in the community. The input is welcome but it flies in the face of other input we have received.

However if other developers feel strongly about the blacklist/whitelist and switch criteria please speak up! ;-)

Stephen



2018-03-13 22:50:46

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 04:29 PM, Sinan Kaya wrote:
> If hardware doesn't support it, blacklisting should have been the right
> path and I still think that you should remove all switch business from the code.
> I did not hear enough justification for having a switch requirement
> for P2P.

I disagree.

> You are also saying that root ports have issues not because of functionality but
> because of performance.

No... performance can also be an issue but the main justification for
this is that many root ports do not support P2P at all and can fail in
different ways (usually just dropping the TLPs).

> What if I come up with a very cheap/crappy switch (something like used in data
> mining)?

Good luck. That's not how hardware is designed. PCIe switches that have
any hope to compete with the existing market will need features like
NTB, non-transparent ports, etc... and that implies a certain symmetry
(ie there isn't a special host port because there may be more than one
and it may move around) which implies that packets will always be able
to forward between each ports which implies P2P will work.

> I have been doing my best to provide feedback. It feels like you are throwing
> them over the wall to be honest.
>
> You keep implying "not my problem".

Well, the fact of the matter is that extending this in all the ways
people like you want face huge problems on all sides. These are not
trivial issues and holding back work that works for our problem because
it doesn't solve your problem is IMO just going to grind development in
this area to a halt. We have to have something we can agree on which is
safe to start building on. The building can't just emerge fully formed
in one go.

P2P proposal go back a long time and have never gotten anywhere because
there are limitations and people want it to do things that are hard but
don't want to contribute the work to solving those problems.

>> Well, if it's a problem for someone they'll have to solve it. We're
>> targeting JBOFs that have no use for ACS / IOMMU groups at all.
>
> IMO, you (not somebody) should address this one way or the other before this
> series land in upstream.

The real way to address this (as I've mentioned before) is with some way
of doing ACS and iomem groups dynamically. But this is a huge project in
itself and is never going to be part of the P2P patchset.

> Another assumption: There are other architectures like ARM64 where IOMMU
> is enabled by default even if you don't use VMs for security reasons.
> IOMMU blocks stray transactions.

True, but it doesn't change my point: ACS is not a requirement for Linux
many many systems do not have it on at all or by default.

> Didn't the ACS behavior change suddenly for no good reason when we enabled
> your code even though I might not be using the P2P but I happen to have
> a kernel with P2P config option?

Well no, presumably you made a conscious choice to turn the config
option on and build a custom kernel for your box. That doesn't seem very
sudden and the reason is that the two concepts are very much at odds
with each other: you can't have isolation and still have transactions
between devices.

Logan


2018-03-13 23:10:09

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote:
> >> It sounds like you have very tight hardware expectations for this to work
> >> at this moment. You also don't want to generalize this code for others and
> >> address the shortcomings.
> > No, that's the way the community has pushed this work
>
> Hi Sinan
>
> Thanks for all the input. As Logan has pointed out the switch
> requirement is something that has evolved over time based on input
> from the community. You are more than welcome to have an opinion on
> this (and you have made that opinion clear ;-)). Over time the
> patchset may evolve from its current requirements but right now we
> are aligned with the feedback from the community.

This part of the community hasn't been convinced of the need to have
two bridges, e.g., both an Upstream Port and a Downstream Port, or two
conventional PCI bridges, above the peers.

Every PCI-to-PCI bridge is required to support routing transactions
between devices on its secondary side. Therefore, I think it is
sufficient to verify that the potential peers share a single common
upstream bridge. This could be a conventional PCI bridge, a Switch
Downstream Port, or a Root Port.

I've seen the response that peers directly below a Root Port could not
DMA to each other through the Root Port because of the "route to self"
issue, and I'm not disputing that. But enforcing a requirement for
two upstream bridges introduces a weird restriction on conventional
PCI topologies, makes the code hard to read, and I don't think it's
necessary.

If it *is* necessary because Root Ports and devices below them behave
differently than in conventional PCI, I think you should include a
reference to the relevant section of the spec and check directly for a
Root Port. I would prefer that over trying to exclude Root Ports by
looking for two upstream bridges.

Bjorn

2018-03-13 23:20:41

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On 3/13/2018 6:48 PM, Logan Gunthorpe wrote:
>
>
> On 13/03/18 04:29 PM, Sinan Kaya wrote:
>> If hardware doesn't support it, blacklisting should have been the right
>> path and I still think that you should remove all switch business from the code.
>> I did not hear enough justification for having a switch requirement
>> for P2P.
>
> I disagree.
>
>> You are also saying that root ports have issues not because of functionality but
>> because of performance.
>
> No... performance can also be an issue but the main justification for
> this is that many root ports do not support P2P at all and can fail in
> different ways (usually just dropping the TLPs).
>
>> What if I come up with a very cheap/crappy switch (something like used in data
>> mining)?
>
> Good luck. That's not how hardware is designed. PCIe switches that have
> any hope to compete with the existing market will need features like
> NTB, non-transparent ports, etc... and that implies a certain symmetry
> (ie there isn't a special host port because there may be more than one
> and it may move around) which implies that packets will always be able
> to forward between each ports which implies P2P will work.
>

It is still a switch it can move packets but, maybe it can move data at
100kbps speed.

What prevents that switch from trying P2P and having a bad user experience?

If everything is so broken, I was suggesting to at least list the switches
you have tested.

What's the problem with this?

Why do you want to assume that all switches are good and all root ports are
bad?

>> I have been doing my best to provide feedback. It feels like you are throwing
>> them over the wall to be honest.
>>
>> You keep implying "not my problem".
>
> Well, the fact of the matter is that extending this in all the ways
> people like you want face huge problems on all sides. These are not
> trivial issues and holding back work that works for our problem because
> it doesn't solve your problem is IMO just going to grind development in
> this area to a halt. We have to have something we can agree on which is
> safe to start building on. The building can't just emerge fully formed
> in one go.
>

What if the design is so canned that you can't change anything?

I have been asking things like getting rid of switch search in ACS
enablement towards achieving generic P2P. You seem to be pushing back.
You said yourself P2P and isolation doesn't go together at this point
but you also care about isolation for other devices that are not doing
P2P.

Confusing...

> P2P proposal go back a long time and have never gotten anywhere because
> there are limitations and people want it to do things that are hard but
> don't want to contribute the work to solving those problems.
>
>>> Well, if it's a problem for someone they'll have to solve it. We're
>>> targeting JBOFs that have no use for ACS / IOMMU groups at all.
>>
>> IMO, you (not somebody) should address this one way or the other before this
>> series land in upstream.
>
> The real way to address this (as I've mentioned before) is with some way
> of doing ACS and iomem groups dynamically. But this is a huge project in
> itself and is never going to be part of the P2P patchset.

fair enough.

>
>> Another assumption: There are other architectures like ARM64 where IOMMU
>> is enabled by default even if you don't use VMs for security reasons.
>> IOMMU blocks stray transactions.
>
> True, but it doesn't change my point: ACS is not a requirement for Linux
> many many systems do not have it on at all or by default.

I don't think so.

It is not a requirement for you but it is a requirement for me (ARM64 guy).
Linux happens to run on multiple architectures. One exception invalidates your
point.

>
>> Didn't the ACS behavior change suddenly for no good reason when we enabled
>> your code even though I might not be using the P2P but I happen to have
>> a kernel with P2P config option?
>

If you are assuming that your kernel option should not be used by general
distributions like Ubuntu/redhat etc. and requires a kernel compilation,
creating a dependency to EXPERT is the right way to do.

Distributions assume that no damage is done by enabling PCI bus options
under normal circumstances.

> Well no, presumably you made a conscious choice to turn the config
> option on and build a custom kernel for your box. That doesn't seem very
> sudden and the reason is that the two concepts are very much at odds
> with each other: you can't have isolation and still have transactions
> between devices.
>
> Logan
>
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-13 23:23:03

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 05:08 PM, Bjorn Helgaas wrote:
> On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote:
> If it *is* necessary because Root Ports and devices below them behave
> differently than in conventional PCI, I think you should include a
> reference to the relevant section of the spec and check directly for a
> Root Port. I would prefer that over trying to exclude Root Ports by
> looking for two upstream bridges.

Well we've established that we don't want to allow root ports.

But we need to, at a minimum, do two pci_upstream_bridge() calls...

Remember, we need to check that a number of devices are behind the same
switch. So we need to find a _common_ upstream port for several devices.
Excluding the multifunction device case (which I don't think is
applicable for reasons we've discussed before), this will *never* be the
first upstream port for a given device. We need to find the upstream
port of the switch and ensure all devices share that same port. If a
device is behind a switch, it's pci_upstream_bridge() is the downstream
switch port which is unique to that device. So a peer device would have
a different pci_upstream_bridge() port but share the same
pci_upstream_bridge(pci_upstream_bridge()) port (assuming they are on
the same switch).

The alternative, is to walk the entire tree of upstream ports and check
that all devices have a common port somewhere in the tree that isn't the
root complex. Arguably this is the right way to do it and would support
a network of switches but is a fair bit more complex to code.

Logan

2018-03-13 23:48:45

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 05:19 PM, Sinan Kaya wrote:
> It is still a switch it can move packets but, maybe it can move data at
> 100kbps speed.

As Stephen pointed out, it's a requirement of the PCIe spec that a
switch supports P2P. If you want to sell a switch that does P2P with bad
performance then that's on you to deal with.

> What prevents that switch from trying P2P and having a bad user experience?

The fact that no one would buy such a switch as it would have a bad user
experience with regular PCI transfers. A P2P TLP is not special on a
switch as it's routed in just the same way as any others. There'd also
be no cost gain in making such a broken-by-design device.

> If everything is so broken, I was suggesting to at least list the switches
> you have tested.
>
> What's the problem with this?

More complexity for zero gain.

> Why do you want to assume that all switches are good and all root ports are
> bad?

Because the assumption that all switches are good is more than
reasonable and simplifies the code and maintenance significantly. No one
wants to maintain a white list when they don't have to.

> What if the design is so canned that you can't change anything?

Based on the feedback we've got so far and the developers that have
participated in getting it to where it is, it is not "canned".

> I have been asking things like getting rid of switch search in ACS
> enablement towards achieving generic P2P. You seem to be pushing back.
> You said yourself P2P and isolation doesn't go together at this point
> but you also care about isolation for other devices that are not doing
> P2P.

P2P and isolation will never be compatible at any point. They are two
opposite concepts. So we could just disable isolation across the whole
system and for our purposes that would be fine. But it's relatively
simple to limit this and only disable it behind switches. So why
wouldn't we? It enables use cases like having an isolated card on the
root complex used in a VM while having P2P on cards behind switches. I
personally have no interest in doing this but I also have no reason to
prevent it with my code.

> It is not a requirement for you but it is a requirement for me (ARM64 guy).
> Linux happens to run on multiple architectures. One exception invalidates your
> point.

It is not a requirement of an architecture or people that use a specific
architecture. It is a requirement of the use-case and you have not said
any use case or how we could do better. If you're running VMs that
require isolation you will need to be *very* careful if you also want to
do P2P between cards which requires no isolation. But, based on my
understanding, most people will want to do one or the other -- not both.
If you want to do P2P you enable the P2P config option, if you want
isolation you don't.

> If you are assuming that your kernel option should not be used by general
> distributions like Ubuntu/redhat etc. and requires a kernel compilation,
> creating a dependency to EXPERT is the right way to do.

I don't have a problem adding EXPERT as a dependency. We can do that for
v4. I'd rather hope that distros actually read and understand the
kconfig help text before enabling an off-by-default option. But maybe
I'm asking too much.

Logan

2018-03-14 02:57:59

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Tue, Mar 13, 2018 at 05:21:20PM -0600, Logan Gunthorpe wrote:
> On 13/03/18 05:08 PM, Bjorn Helgaas wrote:
> > On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote:
> > If it *is* necessary because Root Ports and devices below them behave
> > differently than in conventional PCI, I think you should include a
> > reference to the relevant section of the spec and check directly for a
> > Root Port. I would prefer that over trying to exclude Root Ports by
> > looking for two upstream bridges.
>
> Well we've established that we don't want to allow root ports.

I assume you want to exclude Root Ports because of multi-function
devices and the "route to self" error. I was hoping for a reference
to that so I could learn more about it.

While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
Functions in SR-IOV Capable and Multi-Function Devices", which seems
relevant. It talks about "peer-to-peer Requests (between Functions of
the device)". Thay says to me that multi-function devices can DMA
between themselves.

> But we need to, at a minimum, do two pci_upstream_bridge() calls...
>
> Remember, we need to check that a number of devices are behind the same
> switch. So we need to find a _common_ upstream port for several devices.

I agree that peers need to have a common upstream bridge. I think
you're saying peers need to have *two* common upstream bridges. If I
understand correctly, requiring two common bridges is a way to ensure
that peers directly below Root Ports don't try to DMA to each other.

So I guess the first order of business is to nail down whether peers
below a Root Port are prohibited from DMAing to each other. My
assumption, based on 6.12.1.2 and the fact that I haven't yet found
a prohibition, is that they can.

> Excluding the multifunction device case (which I don't think is
> applicable for reasons we've discussed before), this will *never* be the
> first upstream port for a given device.

If you're looking for a common upstream bridge, you don't need to make
assumptions about whether the hierarchy is conventional PCI or PCIe or
how many levels are in the hierarchy.

You already have upstream_bridges_match(), which takes two pci_devs.
I think it should walk up the PCI hierarchy from the first device,
checking whether the bridge at each level is also a parent of the
second device.

Bjorn

2018-03-14 12:16:34

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

From: Logan Gunthorpe
> Sent: 13 March 2018 23:46
...
> As Stephen pointed out, it's a requirement of the PCIe spec that a
> switch supports P2P. If you want to sell a switch that does P2P with bad
> performance then that's on you to deal with.

That surprises me (unless I missed something last time I read the spec).
While P2P writes are relatively easy to handle, reads and any other TLP that
require acks are a completely different proposition.
There are no additional fields that can be set in the read TLP and will be
reflected back in the ack(s) than can be used to route the acks back to the
correct initiator.

I'm pretty sure that to support P2P reads a switch would have to save
the received read TLP and (possibly later on) issue read TLP of its own
for the required data.
I'm not even sure it is easy to interleave the P2P reads with those
coming from the root.
That requires a potentially infinite queue of pending requests.

Some x86 root ports support P2P writes (maybe with a bios option).
It would be a shame not to be able to do P2P writes on such systems
even though P2P reads won't work.

(We looked at using P2P transfers for some data, but in the end used
a different scheme.
For our use case P2P writes were enough.
An alternative would be to access the same host memory buffer from
two different devices - but there isn't an API that lets you do that.)

David


2018-03-14 14:07:05

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

>I assume you want to exclude Root Ports because of multi-function
> devices and the "route to self" error. I was hoping for a reference
> to that so I could learn more about it.

Apologies Bjorn. This slipped through my net. I will try and get you a reference for RTS in the next couple of days.

> While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS
> Functions in SR-IOV Capable and Multi-Function Devices", which seems
> relevant. It talks about "peer-to-peer Requests (between Functions of
> the device)". Thay says to me that multi-function devices can DMA
> between themselves.

I will go take a look. Appreciate the link.

Stephen

2018-03-14 16:19:30

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> I assume you want to exclude Root Ports because of multi-function
> devices and the "route to self" error. I was hoping for a reference
> to that so I could learn more about it.

I haven't been able to find where in the spec it forbids route to self.
But I was told this by developers who work know switches. Hopefully
Stephen can find the reference.

But it's a bit of a moot point. Devices can DMA to themselves if they
are designed to do so. For example, some NVMe cards can read and write
their own CMB for certain types of DMA request. There is a register in
the spec (CMBSZ) which specifies which types of requests are supported.
(See 3.1.12 in NVMe 1.3a).

> I agree that peers need to have a common upstream bridge. I think
> you're saying peers need to have *two* common upstream bridges. If I
> understand correctly, requiring two common bridges is a way to ensure
> that peers directly below Root Ports don't try to DMA to each other.

No, I don't get where you think we need to have two common upstream
bridges. I'm not sure when such a case would ever happen. But you seem
to understand based on what you wrote below.

> So I guess the first order of business is to nail down whether peers
> below a Root Port are prohibited from DMAing to each other. My
> assumption, based on 6.12.1.2 and the fact that I haven't yet found
> a prohibition, is that they can.

If you have a multifunction device designed to DMA to itself below a
root port, it can. But determining this is on a device by device basis,
just as determining whether a root complex can do peer to peer is on a
per device basis. So I'd say we don't want to allow it by default and
let someone who has such a device figure out what's necessary if and
when one comes along.

> You already have upstream_bridges_match(), which takes two pci_devs.
> I think it should walk up the PCI hierarchy from the first device,
> checking whether the bridge at each level is also a parent of the
> second device.

Yes, this is what I meant when I said walking the entire tree. I've been
kicking the can down the road on implementing this as getting ref
counting right and testing it is going to be quite tricky. The single
switch approach we implemented now is just a simplification which works
for a single switch. But I guess we can look at implementing it this way
for v4.

Logan

2018-03-14 16:26:59

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 14/03/18 06:16 AM, David Laight wrote:
> That surprises me (unless I missed something last time I read the spec).
> While P2P writes are relatively easy to handle, reads and any other TLP that
> require acks are a completely different proposition.
> There are no additional fields that can be set in the read TLP and will be
> reflected back in the ack(s) than can be used to route the acks back to the
> correct initiator.
>
> I'm pretty sure that to support P2P reads a switch would have to save
> the received read TLP and (possibly later on) issue read TLP of its own
> for the required data.
> I'm not even sure it is easy to interleave the P2P reads with those
> coming from the root.
> That requires a potentially infinite queue of pending requests.

This is wrong. A completion is a TLP just like any other and makes use
of the Destination ID field in the header to route it back to the
original requester.

> Some x86 root ports support P2P writes (maybe with a bios option).
> It would be a shame not to be able to do P2P writes on such systems
> even though P2P reads won't work.

Yes, and this has been discussed many times. It won't be changing in the
near term.

Logan

2018-03-14 18:53:14

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote:
> On 13/03/18 08:56 PM, Bjorn Helgaas wrote:
> > I agree that peers need to have a common upstream bridge. I think
> > you're saying peers need to have *two* common upstream bridges. If I
> > understand correctly, requiring two common bridges is a way to ensure
> > that peers directly below Root Ports don't try to DMA to each other.
>
> No, I don't get where you think we need to have two common upstream
> bridges. I'm not sure when such a case would ever happen. But you seem
> to understand based on what you wrote below.

Sorry, I phrased that wrong. You don't require two common upstream
bridges; you require two upstream bridges, with the upper one being
common, i.e.,

static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
{
struct pci_dev *up1, *up2;

up1 = pci_dev_get(pci_upstream_bridge(pdev));
up2 = pci_dev_get(pci_upstream_bridge(up1));
return up2;
}

So if you're starting with pdev, up1 is the immediately upstream
bridge and up2 is the second upstream bridge. If this is PCIe, up1
may be a Root Port and there is no up2, or up1 and up2 are in a
switch.

This is more restrictive than the spec requires. As long as there is
a single common upstream bridge, peer-to-peer DMA should work. In
fact, in conventional PCI, I think the upstream bridge could even be
the host bridge (not a PCI-to-PCI bridge).

You are focused on PCIe systems, and in those systems, most topologies
do have an upstream switch, which means two upstream bridges. I'm
trying to remove that assumption because I don't think there's a
requirement for it in the spec. Enforcing this assumption complicates
the code and makes it harder to understand because the reader says
"huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
so why do we need these two bridges?"

[*] For conventional PCI, this means anything below the same host
bridge. Two devices on a conventional PCI root bus should be able to
DMA to each other, even though there's no PCI-to-PCI bridge above
them. For PCIe, it means a "hierarchy domain" as used in PCIe r4.0,
sec 1.3.1, i.e., anything below the same Root Port.

> > So I guess the first order of business is to nail down whether peers
> > below a Root Port are prohibited from DMAing to each other. My
> > assumption, based on 6.12.1.2 and the fact that I haven't yet found
> > a prohibition, is that they can.
>
> If you have a multifunction device designed to DMA to itself below a
> root port, it can. But determining this is on a device by device basis,
> just as determining whether a root complex can do peer to peer is on a
> per device basis. So I'd say we don't want to allow it by default and
> let someone who has such a device figure out what's necessary if and
> when one comes along.

It's not the job of this infrastructure to answer the device-dependent
question of whether DMA initiators or targets support peer-to-peer
DMA.

All we want to do here is figure out whether the PCI topology supports
it, using the mechanisms guaranteed by the spec. We can derive that
from the basic rules about how PCI bridges work, i.e., from the
PCI-to-PCI Bridge spec r1.2, sec 4.3:

A bridge forwards PCI memory transactions from its primary interface
to its secondary interface (downstream) if a memory address is in
the range defined by the Memory Base and Memory Limit registers
(when the base is less than or equal to the limit) as illustrated in
Figure 4-3. Conversely, a memory transaction on the secondary
interface that is within this address range will not be forwarded
upstream to the primary interface. Any memory transactions on the
secondary interface that are outside this address range will be
forwarded upstream to the primary interface (provided they are not
in the address range defined by the prefetchable memory address
range registers).

This works for either PCI or PCIe. The only wrinkle PCIe adds is that
the very top of the hierarchy is a Root Port, and we can't rely on it
to route traffic to other Root Ports. I also doubt Root Complex
Integrated Endpoints can participate in peer-to-peer DMA.

Thanks for your patience in working through all this. I know it
sometimes feels like being bounced around in all directions. It's
just a normal consequence of trying to add complex functionality to an
already complex system, with interest and expertise spread unevenly
across a crowd of people.

Bjorn

2018-03-14 19:06:59

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
> You are focused on PCIe systems, and in those systems, most topologies
> do have an upstream switch, which means two upstream bridges. I'm
> trying to remove that assumption because I don't think there's a
> requirement for it in the spec. Enforcing this assumption complicates
> the code and makes it harder to understand because the reader says
> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
> so why do we need these two bridges?"

Yes, as I've said, we focused on being behind a single PCIe Switch
because it's easier and vaguely safer (we *know* switches will work but
other types of topology we have to assume will work based on the spec).
Also, I have my doubts that anyone will ever have a use for this with
non-PCIe devices.

A switch shows up as two or more virtual bridges (per the PCIe v4 Spec
1.3.3) which explains the existing get_upstream_bridge_port() function.

In any case, we'll look at generalizing this by looking for a common
upstream port in the next revision of the patch set.

Logan



2018-03-14 19:30:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe <[email protected]> wrote:
>
>
> On 14/03/18 12:51 PM, Bjorn Helgaas wrote:
>> You are focused on PCIe systems, and in those systems, most topologies
>> do have an upstream switch, which means two upstream bridges. I'm
>> trying to remove that assumption because I don't think there's a
>> requirement for it in the spec. Enforcing this assumption complicates
>> the code and makes it harder to understand because the reader says
>> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*,
>> so why do we need these two bridges?"
>
> Yes, as I've said, we focused on being behind a single PCIe Switch
> because it's easier and vaguely safer (we *know* switches will work but
> other types of topology we have to assume will work based on the spec).
> Also, I have my doubts that anyone will ever have a use for this with
> non-PCIe devices.

P2P over PCI/PCI-X is quite common in devices like raid controllers.
It would be useful if those configurations were not left behind so
that Linux could feasibly deploy offload code to a controller in the
PCI domain.

2018-03-14 19:32:26

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 14/03/18 01:28 PM, Dan Williams wrote:
> P2P over PCI/PCI-X is quite common in devices like raid controllers.
> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.

Thanks for the note. Neat. In the end nothing is getting left behind
it's just work for someone to add support. Even if I wasn't already
going to make the change I mentioned it all fits in the architecture and
APIs quite easily.

Logan


2018-03-14 19:36:09

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

> P2P over PCI/PCI-X is quite common in devices like raid controllers.

Hi Dan

Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)?

> It would be useful if those configurations were not left behind so
> that Linux could feasibly deploy offload code to a controller in the
> PCI domain.

Agreed. I think this would be great. Kind of like the XCOPY framework that was proposed a while back for SCSI devices [1] but updated to also include NVMe devices. That is definitely a use case we would like this framework to support.

Stephen

[1] https://lwn.net/Articles/592094/

2018-03-15 04:04:46

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory


Stephen,

>> It would be useful if those configurations were not left behind so
>> that Linux could feasibly deploy offload code to a controller in the
>> PCI domain.
>
> Agreed. I think this would be great. Kind of like the XCOPY framework
> that was proposed a while back for SCSI devices [1] but updated to also
> include NVMe devices. That is definitely a use case we would like this
> framework to support.

I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as
protocol-agnostic as I would like in the block layer facing downwards.
It has proven quite hard to reconcile token-based and EXTENDED COPY
semantics along with the desire to support stacking. But from an
application/filesystem perspective everything looks the same regardless
of the intricacies of the device. Nothing is preventing us from
supporting other protocols...

--
Martin K. Petersen Oracle Linux Engineering

2018-03-15 04:32:36

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Wed, Mar 14, 2018 at 12:34 PM, Stephen Bates <[email protected]> wrote:
>> P2P over PCI/PCI-X is quite common in devices like raid controllers.
>
> Hi Dan
>
> Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)?

I'm thinking of the classic I/O offload card where there's an NTB to
an internal PCI bus that has a storage controller and raid offload
engines.

2018-03-21 09:28:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] nvmet: Optionally use PCI P2P memory

> + const char *page, size_t count)
> +{
> + struct nvmet_port *port = to_nvmet_port(item);
> + struct device *dev;
> + struct pci_dev *p2p_dev = NULL;
> + bool use_p2pmem;
> +
> + switch (page[0]) {
> + case 'y':
> + case 'Y':
> + case 'a':
> + case 'A':
> + use_p2pmem = true;
> + break;
> + case 'n':
> + case 'N':
> + use_p2pmem = false;
> + break;
> + default:
> + dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
> + if (!dev) {
> + pr_err("No such PCI device: %s\n", page);
> + return -ENODEV;
> + }
> +
> + use_p2pmem = true;
> + p2p_dev = to_pci_dev(dev);
> +
> + if (!pci_has_p2pmem(p2p_dev)) {
> + pr_err("PCI device has no peer-to-peer memory: %s\n",
> + page);
> + pci_dev_put(p2p_dev);
> + return -ENODEV;
> + }
> + }

Yikes. Shouldn't auto just be the normal yes case instead of this
string parsing mess?

> + if (rsp->req.sg != &rsp->cmd->inline_sg) {
> + if (rsp->req.p2p_dev)
> + pci_p2pmem_free_sgl(rsp->req.p2p_dev, rsp->req.sg,
> + rsp->req.sg_cnt);
> + else
> + sgl_free(rsp->req.sg);
> + }

Can we factor this into a helper, as the other target drivers (fc for now,
tcp soon) using sgl allocatins should share the code?

(same for the alloc side)


2018-03-21 16:55:08

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 11/11] nvmet: Optionally use PCI P2P memory



On 21/03/18 03:27 AM, Christoph Hellwig wrote:
>> + const char *page, size_t count)
>> +{
>> + struct nvmet_port *port = to_nvmet_port(item);
>> + struct device *dev;
>> + struct pci_dev *p2p_dev = NULL;
>> + bool use_p2pmem;
>> +
>> + switch (page[0]) {
>> + case 'y':
>> + case 'Y':
>> + case 'a':
>> + case 'A':
>> + use_p2pmem = true;
>> + break;
>> + case 'n':
>> + case 'N':
>> + use_p2pmem = false;
>> + break;
>> + default:
>> + dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
>> + if (!dev) {
>> + pr_err("No such PCI device: %s\n", page);
>> + return -ENODEV;
>> + }
>> +
>> + use_p2pmem = true;
>> + p2p_dev = to_pci_dev(dev);
>> +
>> + if (!pci_has_p2pmem(p2p_dev)) {
>> + pr_err("PCI device has no peer-to-peer memory: %s\n",
>> + page);
>> + pci_dev_put(p2p_dev);
>> + return -ENODEV;
>> + }
>> + }
>
> Yikes. Shouldn't auto just be the normal yes case instead of this
> string parsing mess?

Sorry, I don't follow. The code, as is, should automatically select the
device if the user sets it to "yes" or "auto" or "y" or similar.
(Roughly similar to how kstrtobool() works, except '0' or '1' are not
accepted seeing they could overlap with PCI device names). In other
cases, it looks for the specific PCI device name to use exactly.

Are you saying it shouldn't work this way or are you saying the code to
implement it is too messy?

>> + if (rsp->req.sg != &rsp->cmd->inline_sg) {
>> + if (rsp->req.p2p_dev)
>> + pci_p2pmem_free_sgl(rsp->req.p2p_dev, rsp->req.sg,
>> + rsp->req.sg_cnt);
>> + else
>> + sgl_free(rsp->req.sg);
>> + }
>
> Can we factor this into a helper, as the other target drivers (fc for now,
> tcp soon) using sgl allocatins should share the code?
>
> (same for the alloc side)

Sure. Would the helpers belong in core.c?

Thanks,

Logan

2018-03-22 22:58:47

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

> I've seen the response that peers directly below a Root Port could not
> DMA to each other through the Root Port because of the "route to self"
> issue, and I'm not disputing that.

Bjorn

You asked me for a reference to RTS in the PCIe specification. As luck would have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. We discussed the topic. It is not explicitly referred to as "Route to Self" and it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification discusses error conditions for virtual PCI bridges. One of these conditions (given in the very first bullet in that section) applies to a request that is destined for the same port it came in on. When this occurs the request must be terminated as a UR.

Stephen


2018-03-23 21:51:56

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Thu, Mar 22, 2018 at 10:57:32PM +0000, Stephen Bates wrote:
> > I've seen the response that peers directly below a Root Port could not
> > DMA to each other through the Root Port because of the "route to self"
> > issue, and I'm not disputing that.
>
> Bjorn
>
> You asked me for a reference to RTS in the PCIe specification. As
> luck would have it I ended up in an Irish bar with Peter Onufryk
> this week at OCP Summit. We discussed the topic. It is not
> explicitly referred to as "Route to Self" and it's certainly not
> explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification
> discusses error conditions for virtual PCI bridges. One of these
> conditions (given in the very first bullet in that section) applies
> to a request that is destined for the same port it came in on. When
> this occurs the request must be terminated as a UR.

Thanks for that reference!

I suspect figure 10-3 in sec 10.1.1 might also be relevant, although
it's buried in the ATS section. It shows internal routing between
functions of a multifunction device. That suggests that the functions
should be able to DMA to each other without those transactions ever
appearing on the link.

Popping way up the stack, my original point was that I'm trying to
remove restrictions on what devices can participate in peer-to-peer
DMA. I think it's fairly clear that in conventional PCI, any devices
in the same PCI hierarchy, i.e., below the same host-to-PCI bridge,
should be able to DMA to each other.

The routing behavior of PCIe is supposed to be compatible with
conventional PCI, and I would argue that this effectively requires
multi-function PCIe devices to have the internal routing required to
avoid the route-to-self issue.

Bjorn

2018-03-23 22:01:40

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 23/03/18 03:50 PM, Bjorn Helgaas wrote:
> Popping way up the stack, my original point was that I'm trying to
> remove restrictions on what devices can participate in peer-to-peer
> DMA. I think it's fairly clear that in conventional PCI, any devices
> in the same PCI hierarchy, i.e., below the same host-to-PCI bridge,
> should be able to DMA to each other.

Yup, we are working on this.

> The routing behavior of PCIe is supposed to be compatible with
> conventional PCI, and I would argue that this effectively requires
> multi-function PCIe devices to have the internal routing required to
> avoid the route-to-self issue.

That would be very nice but many devices do not support the internal
route. We've had to work around this in the past and as I mentioned
earlier that NVMe devices have a flag indicating support. However, if a
device wants to be involved in P2P it must support it and we can exclude
devices that don't support it by simply not enabling their drivers.

Logan

2018-03-24 03:55:33

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Fri, Mar 23, 2018 at 03:59:14PM -0600, Logan Gunthorpe wrote:
> On 23/03/18 03:50 PM, Bjorn Helgaas wrote:
> > Popping way up the stack, my original point was that I'm trying to
> > remove restrictions on what devices can participate in
> > peer-to-peer DMA. I think it's fairly clear that in conventional
> > PCI, any devices in the same PCI hierarchy, i.e., below the same
> > host-to-PCI bridge, should be able to DMA to each other.
>
> Yup, we are working on this.
>
> > The routing behavior of PCIe is supposed to be compatible with
> > conventional PCI, and I would argue that this effectively requires
> > multi-function PCIe devices to have the internal routing required
> > to avoid the route-to-self issue.
>
> That would be very nice but many devices do not support the internal
> route. We've had to work around this in the past and as I mentioned
> earlier that NVMe devices have a flag indicating support. However,
> if a device wants to be involved in P2P it must support it and we
> can exclude devices that don't support it by simply not enabling
> their drivers.

Do you think these devices that don't support internal DMA between
functions are within spec, or should we handle them as exceptions,
e.g., via quirks?

If NVMe defines a flag indicating peer-to-peer support, that would
suggest to me that these devices are within spec.

I looked up the CMBSZ register you mentioned (NVMe 1.3a, sec 3.1.12).
You must be referring to the WDS, RDS, LISTS, CQS, and SQS bits. If
WDS is set, the controller supports having Write-related data and
metadata in the Controller Memory Buffer. That would mean the driver
could put certain queues in controller memory instead of in host
memory. The controller could then read the queue from its own
internal memory rather than issuing a PCIe transaction to read it from
host memory.

That makes sense to me, but I don't see the connection to
peer-to-peer. There's no multi-function device in this picture, so
it's not about internal DMA between functions.

WDS, etc., tell us about capabilities of the controller. If WDS is
set, the CPU (or a peer PCIe device) can write things to controller
memory. If it is clear, neither the CPU nor a peer device can put
things there. So it doesn't seem to tell us anything about
peer-to-peer specifically. It looks like information needed by the
NVMe driver, but not by the PCI core.

Bjorn

2018-03-24 15:30:02

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

> That would be very nice but many devices do not support the internal
> route.

But Logan in the NVMe case we are discussing movement within a single function (i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is discussing movement between two functions (PFs or VFs) in the same PCIe EP. In the case of multi-function endpoints I think the standard requires those devices to support internal DMAs for transfers between those functions (but does not require it within a function).

So I think the summary is:

1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time.

2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP.

3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device).

Based on my review of the specification I concur with Bjorn that p2pdma between functions in a MF end-point should be assured to be supported via the standard. However if the p2pdma involves only a single function in a MF device then we can only support NVMe CMBs for now. Let's review and see what the options are for supporting this in the next respin.

Stephen


2018-03-26 11:13:35

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Tue, 13 Mar 2018 10:43:55 -0600
Logan Gunthorpe <[email protected]> wrote:

> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > Regarding the switch business, It is amazing how much trouble you went into
> > limit this functionality into very specific hardware.
> >
> > I thought that we reached to an agreement that code would not impose
> > any limits on what user wants.
> >
> > What happened to all the emails we exchanged?
>
> It turns out that root ports that support P2P are far less common than
> anyone thought. So it will likely have to be a white list.

This came as a bit of a surprise to our PCIe architect.

His follow up was whether it was worth raising an ECR for the PCIe spec
to add a capability bit to allow this to be discovered. This might
long term avoid the need to maintain the white list for new devices.

So is it worth having a long term solution for making this discoverable?

Jonathan

> Nobody else
> seems keen on allowing the user to enable this on hardware that doesn't
> work. The easiest solution is still limiting it to using a switch. From
> there, if someone wants to start creating a white-list then that's
> probably the way forward to support root ports.
>
> And there's also the ACS problem which means if you want to use P2P on
> the root ports you'll have to disable ACS on the entire system. (Or
> preferably, the IOMMU groups need to get more sophisticated to allow for
> dynamic changes).
>
> Additionally, once you allow for root ports you may find the IOMMU
> getting in the way.
>
> So there are great deal more issues to sort out if you don't restrict to
> devices behind switches.
>
> Logan


2018-03-26 14:04:28

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe <[email protected]> wrote:
> > It turns out that root ports that support P2P are far less common than
> > anyone thought. So it will likely have to be a white list.
>
> This came as a bit of a surprise to our PCIe architect.
>
> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered. This might
> long term avoid the need to maintain the white list for new devices.
>
> So is it worth having a long term solution for making this discoverable?

It was surprising to me that there's no architected way to discover
this. It seems like such an obvious thing that I guess I assumed the
omission was intentional, i.e., maybe there's something that makes it
impractical, but it would be worth at least asking somebody in the
SIG. It seems like for root ports in the same root complex, at least,
there could be a bit somewhere in the root port or the RCRB (which
Linux doesn't support yet).

2018-03-26 15:45:02

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 24/03/18 09:28 AM, Stephen Bates wrote:
> 1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time.
>
> 2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP.
>
> 3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device).

Yeah, that make sense. I wasn't making a huge distinction between single
and multi-function devices because from an implementation point of view
it's not that different.

Logan

2018-03-26 15:48:08

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 26/03/18 08:01 AM, Bjorn Helgaas wrote:
> On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
>> On Tue, 13 Mar 2018 10:43:55 -0600
>> Logan Gunthorpe <[email protected]> wrote:
>>> It turns out that root ports that support P2P are far less common than
>>> anyone thought. So it will likely have to be a white list.
>>
>> This came as a bit of a surprise to our PCIe architect.
>>
>> His follow up was whether it was worth raising an ECR for the PCIe spec
>> to add a capability bit to allow this to be discovered. This might
>> long term avoid the need to maintain the white list for new devices.
>>
>> So is it worth having a long term solution for making this discoverable?
>
> It was surprising to me that there's no architected way to discover
> this. It seems like such an obvious thing that I guess I assumed the
> omission was intentional, i.e., maybe there's something that makes it
> impractical, but it would be worth at least asking somebody in the
> SIG. It seems like for root ports in the same root complex, at least,
> there could be a bit somewhere in the root port or the RCRB (which
> Linux doesn't support yet).

Yes, I agree. It would be a good long term solution to have this bit in
the spec. That would avoid us needing to create a white list for new
hardware. However, I expect it would be years before we can rely on it
so someone may yet implement that white list.

Logan

2018-03-26 16:43:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe <[email protected]> wrote:
>
> > On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > > Regarding the switch business, It is amazing how much trouble you went into
> > > limit this functionality into very specific hardware.
> > >
> > > I thought that we reached to an agreement that code would not impose
> > > any limits on what user wants.
> > >
> > > What happened to all the emails we exchanged?
> >
> > It turns out that root ports that support P2P are far less common than
> > anyone thought. So it will likely have to be a white list.
>
> This came as a bit of a surprise to our PCIe architect.

I don't think it is a hardware problem.

I know Mellanox and Nvidia have been doing p2p on Intel root complexes
for something like 5-6 years now.. I don't have the details, but it
does seem to work.

I have heard some chips give poor performance..

Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche
feature in Intel/AMD land.

I think the main issue here is that there is some BIOS involvement to
set things up properly. Eg in GPU land motherboards certify for
'crossfire' support.

> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered. This might
> long term avoid the need to maintain the white list for new devices.

If it is primarily a BIOS issue then it should be an ACPI thing, right?

Jason

2018-03-26 17:33:07

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
>> On Tue, 13 Mar 2018 10:43:55 -0600
>> Logan Gunthorpe <[email protected]> wrote:
>>
>>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
>>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
>>>> Regarding the switch business, It is amazing how much trouble you went into
>>>> limit this functionality into very specific hardware.
>>>>
>>>> I thought that we reached to an agreement that code would not impose
>>>> any limits on what user wants.
>>>>
>>>> What happened to all the emails we exchanged?
>>>
>>> It turns out that root ports that support P2P are far less common than
>>> anyone thought. So it will likely have to be a white list.
>>
>> This came as a bit of a surprise to our PCIe architect.
>
> I don't think it is a hardware problem.

The latest and greatest Power9 CPUs still explicitly do not support
this. And, if I recall correctly, the ARM64 device we played with did
not either -- but I suspect that will differ depending on vendor.

The latest Intel devices of course do support it, but go back 3-4 years
or so and the performance is pretty much unusable (at least for the
purposes of what this patchset implements). Even older CPUs did not
support it.

We haven't done any testing on AMD devices but I assume they are similar
to Intel.

In any case, even if every CPU manufactured today supported it well,
there are still older devices out there without support that we need to
ensure are handled properly.

Logan

2018-03-26 19:36:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote:
>
>
> On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe <[email protected]> wrote:
> >>
> >>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> >>>> Regarding the switch business, It is amazing how much trouble you went into
> >>>> limit this functionality into very specific hardware.
> >>>>
> >>>> I thought that we reached to an agreement that code would not impose
> >>>> any limits on what user wants.
> >>>>
> >>>> What happened to all the emails we exchanged?
> >>>
> >>> It turns out that root ports that support P2P are far less common than
> >>> anyone thought. So it will likely have to be a white list.
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> >
> > I don't think it is a hardware problem.
>
> The latest and greatest Power9 CPUs still explicitly do not support
> this.

I think this is another case of the HW can do it but the SW support is
missing. IOMMU configuration and maybe firmware too, for instance.

If I recall I saw a presentation that Coral was expected to use P2P
between the network and GPU.

> And, if I recall correctly, the ARM64 device we played with did
> not either -- but I suspect that will differ depending on vendor.

Wouldn't surprise me at all to see broken implementations in
ARM64.. But even there it needs IOMMU enablement to work at all if I
recall.

Bascially, this is probably not a HW problem that needs a HW bit, but
a OS/firmware problem to do all the enablement..

Jason

2018-03-26 20:43:44

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 26/03/18 01:35 PM, Jason Gunthorpe wrote:
> I think this is another case of the HW can do it but the SW support is
> missing. IOMMU configuration and maybe firmware too, for instance.

Nope, not sure how you can make this leap. We've been specifically told
that peer-to-peer PCIe DMA is not supported on P9. We've also been told
there is a very limited (ie. store only), low performance mode designed
for poking mailboxes but it is disabled by default and must be enable
with some kind of special firmware calls.

Logan

2018-03-27 08:49:30

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

On Mon, 26 Mar 2018 09:46:24 -0600
Logan Gunthorpe <[email protected]> wrote:

> On 26/03/18 08:01 AM, Bjorn Helgaas wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe <[email protected]> wrote:
> >>> It turns out that root ports that support P2P are far less common than
> >>> anyone thought. So it will likely have to be a white list.
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> >>
> >> His follow up was whether it was worth raising an ECR for the PCIe spec
> >> to add a capability bit to allow this to be discovered. This might
> >> long term avoid the need to maintain the white list for new devices.
> >>
> >> So is it worth having a long term solution for making this discoverable?
> >
> > It was surprising to me that there's no architected way to discover
> > this. It seems like such an obvious thing that I guess I assumed the
> > omission was intentional, i.e., maybe there's something that makes it
> > impractical, but it would be worth at least asking somebody in the
> > SIG. It seems like for root ports in the same root complex, at least,
> > there could be a bit somewhere in the root port or the RCRB (which
> > Linux doesn't support yet).
>
> Yes, I agree. It would be a good long term solution to have this bit in
> the spec. That would avoid us needing to create a white list for new
> hardware. However, I expect it would be years before we can rely on it
> so someone may yet implement that white list.
>
> Logan

I'll see if I can get our PCI SIG people to follow this through and see if
it is just an omission or as Bjorn suggested, there is some reason we
aren't thinking of that makes it hard.

Agreed that it is a somewhat tangential question to what to do with current
hardware, but let's get the ball rolling if possible.

Jonathan

2018-03-27 15:39:57

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory



On 27/03/18 02:47 AM, Jonathan Cameron wrote:
> I'll see if I can get our PCI SIG people to follow this through and see if
> it is just an omission or as Bjorn suggested, there is some reason we
> aren't thinking of that makes it hard.

That would be great! Thanks!

Logan


2018-04-13 21:57:32

by Stephen Bates

[permalink] [raw]
Subject: Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory


> I'll see if I can get our PCI SIG people to follow this through

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track it. I can see this being doable between Root Ports that reside in the same Root Complex but might become more challenging to standardize for RPs that reside in different RCs in the same (potentially multi-socket) system. I know in the past we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that is not something that would work in all systems and must fall outside the remit of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite some time before we can rely on hardware being available that supports it.

Stephen