Hi Everyone,
Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:
https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
Thanks,
Logan
Changes in v4:
* Change the original upstream_bridges_match() function to
upstream_bridge_distance() which calculates the distance between two
devices as long as they are behind the same root port. This should
address Bjorn's concerns that the code was to focused on
being behind a single switch.
* The disable ACS function now disables ACS for all bridge ports instead
of switch ports (ie. those that had two upstream_bridge ports).
* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
API to be more like sgl_alloc() in that the alloc function returns
the allocated scatterlist and nents is not required bythe free
function.
* Moved the new documentation into the driver-api tree as requested
by Jonathan
* Add SGL alloc and free helpers in the nvmet code so that the
individual drivers can share the code that allocates P2P memory.
As requested by Christoph.
* Cleanup the nvmet_p2pmem_store() function as Christoph
thought my first attempt was ugly.
* Numerous commit message and comment fix-ups
Changes in v3:
* Many more fixes and minor cleanups that were spotted by Bjorn
* Additional explanation of the ACS change in both the commit message
and Kconfig doc. Also, the code that disables the ACS bits is surrounded
explicitly by an #ifdef
* Removed the flag we added to rdma_rw_ctx() in favour of using
is_pci_p2pdma_page(), as suggested by Sagi.
* Adjust pci_p2pmem_find() so that it prefers P2P providers that
are closest to (or the same as) the clients using them. In cases
of ties, the provider is randomly chosen.
* Modify the NVMe Target code so that the PCI device name of the provider
may be explicitly specified, bypassing the logic in pci_p2pmem_find().
(Note: it's still enforced that the provider must be behind the
same switch as the clients).
* As requested by Bjorn, added documentation for driver writers.
Changes in v2:
* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
as a bunch of cleanup and spelling fixes he pointed out in the last
series.
* To address Alex's ACS concerns, we change to a simpler method of
just disabling ACS behind switches for any kernel that has
CONFIG_PCI_P2PDMA.
* We also reject using devices that employ 'dma_virt_ops' which should
fairly simply handle Jason's concerns that this work might break with
the HFI, QIB and rxe drivers that use the virtual ops to implement
their own special DMA operations.
--
This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.
The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).
Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.
In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.
When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.
Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.
In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.
In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.
In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.
Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.
These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.
Logan Gunthorpe (14):
PCI/P2PDMA: Support peer-to-peer memory
PCI/P2PDMA: Add sysfs group to display p2pmem stats
PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
docs-rst: Add a new directory for PCI documentation
PCI/P2PDMA: Add P2P DMA driver writer's documentation
block: Introduce PCI P2P flags for request and request queue
IB/core: Ensure we map P2P memory correctly in
rdma_rw_ctx_[init|destroy]()
nvme-pci: Use PCI p2pmem subsystem to manage the CMB
nvme-pci: Add support for P2P memory in requests
nvme-pci: Add a quirk for a pseudo CMB
nvmet: Introduce helper functions to allocate and free request SGLs
nvmet-rdma: Use new SGL alloc/free helper for requests
nvmet: Optionally use PCI P2P memory
Documentation/ABI/testing/sysfs-bus-pci | 25 +
Documentation/PCI/index.rst | 14 +
Documentation/driver-api/index.rst | 2 +-
Documentation/driver-api/pci/index.rst | 20 +
Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
Documentation/driver-api/{ => pci}/pci.rst | 0
Documentation/index.rst | 3 +-
block/blk-core.c | 3 +
drivers/infiniband/core/rw.c | 13 +-
drivers/nvme/host/core.c | 4 +
drivers/nvme/host/nvme.h | 8 +
drivers/nvme/host/pci.c | 118 +++--
drivers/nvme/target/configfs.c | 67 +++
drivers/nvme/target/core.c | 143 ++++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 15 +
drivers/nvme/target/rdma.c | 22 +-
drivers/pci/Kconfig | 26 +
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
drivers/pci/pci.c | 6 +
include/linux/blk_types.h | 18 +-
include/linux/blkdev.h | 3 +
include/linux/memremap.h | 19 +
include/linux/pci-p2pdma.h | 118 +++++
include/linux/pci.h | 4 +
26 files changed, 1579 insertions(+), 56 deletions(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/driver-api/pci/index.rst
create mode 100644 Documentation/driver-api/pci/p2pdma.rst
rename Documentation/driver-api/{ => pci}/pci.rst (100%)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
--
2.11.0
Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.
If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.
We can now drop the __iomem safety on the buffer seeing that, by
convention, devm_memremap_pages() allocates regular memory without
side effects that's accessible without the iomem accessors.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 34 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..514da4de3c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
#include "nvme.h"
@@ -92,9 +93,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
- void __iomem *cmb;
- pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+ bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -149,7 +149,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t q_lock;
struct nvme_command *sq_cmds;
- struct nvme_command __iomem *sq_cmds_io;
+ bool sq_cmds_is_io;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
dma_addr_t sq_dma_addr;
@@ -431,10 +431,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
{
u16 tail = nvmeq->sq_tail;
- if (nvmeq->sq_cmds_io)
- memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
- else
- memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+ memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
if (++tail == nvmeq->q_depth)
tail = 0;
@@ -1289,9 +1286,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
- if (nvmeq->sq_cmds)
- dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
- nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+ if (nvmeq->sq_cmds) {
+ if (nvmeq->sq_cmds_is_io)
+ pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+ nvmeq->sq_cmds,
+ SQ_SIZE(nvmeq->q_depth));
+ else
+ dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+ }
}
static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1371,12 +1377,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
{
- /* CMB SQEs will be mapped before creation */
- if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
- return 0;
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+ if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+ nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+ nvmeq->sq_cmds);
+ nvmeq->sq_cmds_is_io = true;
+ }
+
+ if (!nvmeq->sq_cmds) {
+ nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+ &nvmeq->sq_dma_addr, GFP_KERNEL);
+ nvmeq->sq_cmds_is_io = false;
+ }
- nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
- &nvmeq->sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@@ -1451,13 +1466,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
struct nvme_dev *dev = nvmeq->dev;
int result;
- if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
- unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
- dev->ctrl.page_size);
- nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
- nvmeq->sq_cmds_io = dev->cmb + offset;
- }
-
/*
* A queue's vector matches the queue identifier unless the controller
* has only one vector available.
@@ -1691,9 +1699,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
return;
dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
- if (!use_cmb_sqes)
- return;
-
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1710,11 +1715,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
if (size > bar_size - offset)
size = bar_size - offset;
- dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
- if (!dev->cmb)
+ if (pci_p2pdma_add_resource(pdev, bar, size, offset))
return;
- dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
dev->cmb_size = size;
+ dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+ if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+ (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+ pci_p2pmem_publish(pdev, true);
if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL))
@@ -1724,12 +1733,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
static inline void nvme_release_cmb(struct nvme_dev *dev)
{
- if (dev->cmb) {
- iounmap(dev->cmb);
- dev->cmb = NULL;
+ if (dev->cmb_size) {
sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
&dev_attr_cmb.attr, NULL);
- dev->cmbsz = 0;
+ dev->cmb_size = 0;
}
}
@@ -1928,13 +1935,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
- if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+ if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
dev->q_depth = result;
else
- nvme_release_cmb(dev);
+ dev->cmb_use_sqes = false;
}
do {
--
2.11.0
Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.
We'd like to use some of these older cards to test P2P memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
---
drivers/nvme/host/nvme.h | 7 +++++++
drivers/nvme/host/pci.c | 24 ++++++++++++++++++++----
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a689c13998f..885e9ec9b889 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -84,6 +84,13 @@ enum nvme_quirks {
* Supports the LighNVM command set if indicated in vs[1].
*/
NVME_QUIRK_LIGHTNVM = (1 << 6),
+
+ /*
+ * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+ * NVRAM that have CMB-like memory on a BAR but does not set
+ * CMBLOC or CMBSZ.
+ */
+ NVME_QUIRK_PSEUDO_CMB_BAR4 = (1 << 7),
};
/*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09b6aba6ed28..e526e969680a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1685,6 +1685,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
}
static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+ return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+ (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+ ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
{
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1704,10 +1711,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
- dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
- if (!dev->cmbsz)
- return;
- dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+ dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+ dev->cmbloc = 4;
+ } else {
+ dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+ if (!dev->cmbsz)
+ return;
+ dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+ }
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2736,6 +2748,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+ { PCI_DEVICE(0x11f8, 0xf117), /* Microsemi NVRAM adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+ { PCI_DEVICE(0x1db1, 0x0002), /* Everspin nvNitro adaptor */
+ .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
--
2.11.0
Add a new directory in the driver API guide for PCI specific
documentation.
This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.
Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Vinod Koul <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Thierry Reding <[email protected]>
Cc: Sanyog Kale <[email protected]>
Cc: Sagar Dharia <[email protected]>
---
Documentation/driver-api/index.rst | 2 +-
Documentation/driver-api/pci/index.rst | 19 +++++++++++++++++++
Documentation/driver-api/{ => pci}/pci.rst | 0
3 files changed, 20 insertions(+), 1 deletion(-)
create mode 100644 Documentation/driver-api/pci/index.rst
rename Documentation/driver-api/{ => pci}/pci.rst (100%)
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 6d8352c0f354..9e4cd4e91a49 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -27,7 +27,7 @@ available subsections can be seen below.
iio/index
input
usb/index
- pci
+ pci/index
spi
i2c
hsi
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
new file mode 100644
index 000000000000..03b57cbf8cc2
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,19 @@
+============================================
+The Linux PCI driver implementer's API guide
+============================================
+
+.. class:: toc-title
+
+ Table of contents
+
+.. toctree::
+ :maxdepth: 2
+
+ pci
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
--
2.11.0
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.
To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/infiniband/core/rw.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;
- ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+ if (is_pci_p2pdma_page(sg_page(sg)))
+ ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+ else
+ ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}
- ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+ if (is_pci_p2pdma_page(sg_page(sg)))
+ pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+ sg_cnt, dir);
+ else
+ ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);
--
2.11.0
Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.
Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.
Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Documentation/PCI/index.rst | 14 +++
Documentation/driver-api/pci/index.rst | 1 +
Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
Documentation/index.rst | 3 +-
4 files changed, 183 insertions(+), 1 deletion(-)
create mode 100644 Documentation/PCI/index.rst
create mode 100644 Documentation/driver-api/pci/p2pdma.rst
diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+ p2pdma
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
:maxdepth: 2
pci
+ p2pdma
.. only:: subproject and html
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+ memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+ DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+ clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+ in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+ resource (provider), it accepts P2P memory pages as buffers in requests
+ to be used directly (client) and it can also make use the CMB as
+ submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+ can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+ to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+ :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
.. toctree::
:maxdepth: 2
- userspace-api/index
+ userspace-api/index
Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
sound/index
crypto/index
filesystems/index
+ PCI/index
Architecture-specific documentation
-----------------------------------
--
2.11.0
For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.
This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.
Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 9 +++++++++
drivers/pci/p2pdma.c | 45 ++++++++++++++++++++++++++++++---------------
drivers/pci/pci.c | 6 ++++++
include/linux/pci-p2pdma.h | 5 +++++
4 files changed, 50 insertions(+), 15 deletions(-)
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
transations must be between devices behind the same root port.
(Typically behind a network of PCIe switches).
+ Enabling this option will also disable ACS on all ports behind
+ any PCIe switch. This effectively puts all devices behind any
+ switch heirarchy into the same IOMMU group. Which implies that
+ individual devices behind any switch will not be able to be
+ assigned to separate VMs because there is no isolation between
+ them. Additionally, any malicious PCIe devices will be able to
+ DMA to memory exposed by other EPs in the same domain as TLPs
+ will not be checked by the IOMMU.
+
If unsure, say N.
config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
}
/*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
*/
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
{
- struct pci_dev *up1, *up2;
+ int pos;
+ u16 ctrl;
- if (!pdev)
- return NULL;
+ if (!pci_is_bridge(pdev))
+ return 0;
- up1 = pci_dev_get(pci_upstream_bridge(pdev));
- if (!up1)
- return NULL;
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+ if (!pos)
+ return 0;
+
+ pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+ pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+ ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
- up2 = pci_dev_get(pci_upstream_bridge(up1));
- pci_dev_put(up1);
+ pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
- return up2;
+ return 1;
}
/*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
#include <linux/of.h>
#include <linux/of_pci.h>
#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include <linux/pm.h>
#include <linux/slab.h>
#include <linux/module.h>
@@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
*/
void pci_enable_acs(struct pci_dev *dev)
{
+#ifdef CONFIG_PCI_P2PDMA
+ if (pci_p2pdma_disable_acs(dev))
+ return;
+#endif
+
if (!pci_acs_enable)
return;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
struct scatterlist;
#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+ return 0;
+}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
--
2.11.0
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.
REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
block/blk-core.c | 3 +++
include/linux/blk_types.h | 18 +++++++++++++++++-
include/linux/blkdev.h | 3 +++
3 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..35680cbebaf4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2270,6 +2270,9 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
+ if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+ goto not_supported;
+
if (should_fail_bio(bio))
goto end_io;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..41194d54c45a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -279,6 +279,10 @@ enum req_flag_bits {
__REQ_BACKGROUND, /* background IO */
__REQ_NOWAIT, /* Don't wait if request will block */
+#ifdef CONFIG_PCI_P2PDMA
+ __REQ_PCI_P2PDMA, /* request is to/from P2P memory */
+#endif
+
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -303,6 +307,18 @@ enum req_flag_bits {
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA (1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req) ((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA 0
+#define REQ_IS_PCI_P2PDMA(req) 0
+#endif /* CONFIG_PCI_P2PDMA */
+
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
#define REQ_DRV (1ULL << __REQ_DRV)
@@ -311,7 +327,7 @@ enum req_flag_bits {
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
#define REQ_NOMERGE_FLAGS \
- (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+ (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
#define bio_op(bio) \
((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..116367babb39 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
#define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
#define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */
#define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports pci p2p requests */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -730,6 +731,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
#define blk_queue_scsi_passthrough(q) \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q) \
+ test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
--
2.11.0
Add helpers to allocate and free the SGL in a struct nvmet_req:
int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
void nvmet_req_free_sgl(struct nvmet_req *req)
This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers. The presently
unused 'sq' argument in the alloc function will be necessary to
decide whether to use peer-to-peer memory and obtain the correct
provider to allocate the memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
---
drivers/nvme/target/core.c | 18 ++++++++++++++++++
drivers/nvme/target/nvmet.h | 2 ++
2 files changed, 20 insertions(+)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e95424f172fd..75d44bc3e8d3 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -575,6 +575,24 @@ void nvmet_req_execute(struct nvmet_req *req)
}
EXPORT_SYMBOL_GPL(nvmet_req_execute);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
+{
+ req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+ if (!req->sg)
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+ sgl_free(req->sg);
+ req->sg = NULL;
+ req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
static inline bool nvmet_cc_en(u32 cc)
{
return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 15fd84ab21f8..10b162615a5e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -273,6 +273,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
void nvmet_req_uninit(struct nvmet_req *req);
void nvmet_req_execute(struct nvmet_req *req);
void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq);
+void nvmet_req_free_sgl(struct nvmet_req *req);
void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
u16 size);
--
2.11.0
Use the new helpers introduced in the previous patch to allocate
the SGLs for the request.
Seeing we use req.transfer_len as the length of the SGL it is
set earlier and cleared on any error. It also seems to be unnecessary
to accumulate the length as the map_sgl functions should only ever
be called once.
Signed-off-by: Logan Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
---
drivers/nvme/target/rdma.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 52e0c5d579a7..f7a3459d618f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -430,7 +430,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
}
if (rsp->req.sg != &rsp->cmd->inline_sg)
- sgl_free(rsp->req.sg);
+ nvmet_req_free_sgl(&rsp->req);
if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -564,24 +564,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
{
struct rdma_cm_id *cm_id = rsp->queue->cm_id;
u64 addr = le64_to_cpu(sgl->addr);
- u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
+ rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
/* no data command? */
- if (!len)
+ if (!rsp->req.transfer_len)
return 0;
- rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
- if (!rsp->req.sg)
- return NVME_SC_INTERNAL;
+ ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+ if (ret < 0)
+ goto error_out;
ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
nvmet_data_dir(&rsp->req));
if (ret < 0)
- return NVME_SC_INTERNAL;
- rsp->req.transfer_len += len;
+ goto error_out;
rsp->n_rdma += ret;
if (invalidate) {
@@ -590,6 +590,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
}
return 0;
+
+error_out:
+ rsp->req.transfer_len = 0;
+ return NVME_SC_INTERNAL;
}
static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
--
2.11.0
For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.
With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/nvme/host/core.c | 4 ++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 19 +++++++++++++++----
3 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..2ca9debbcf2b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2977,7 +2977,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+ if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+ blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..9a689c13998f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -306,6 +306,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
+#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 514da4de3c85..09b6aba6ed28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -798,8 +798,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out;
ret = BLK_STS_RESOURCE;
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
- DMA_ATTR_NO_WARN);
+
+ if (REQ_IS_PCI_P2PDMA(req))
+ nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+ dma_dir, DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;
@@ -844,7 +849,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
if (iod->nents) {
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+ if (REQ_IS_PCI_P2PDMA(req))
+ pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+ else
+ dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req)) {
if (req_op(req) == REQ_OP_READ)
nvme_dif_remap(req, nvme_dif_complete);
@@ -2439,7 +2449,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name = "pcie",
.module = THIS_MODULE,
- .flags = NVME_F_METADATA_SUPPORTED,
+ .flags = NVME_F_METADATA_SUPPORTED |
+ NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32 = nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
--
2.11.0
Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.
Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
drivers/pci/p2pdma.c | 54 +++++++++++++++++++++++++++++++++
2 files changed, 79 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What: /sys/bus/pci/devices/.../p2pmem/available
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the amount of memory that has not been
+ allocated (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/size
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains the total amount of memory that the device
+ provides (in decimal).
+
+What: /sys/bus/pci/devices/.../p2pmem/published
+Date: November 2017
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ If the device has any Peer-to-Peer memory registered, this
+ file contains a '1' if the memory has been published for
+ use inside the kernel or a '0' if it is only intended
+ for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e524a12eca1f..4daad6374869 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
bool p2pmem_published;
};
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t size = 0;
+
+ if (pdev->p2pdma->pool)
+ size = gen_pool_size(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ size_t avail = 0;
+
+ if (pdev->p2pdma->pool)
+ avail = gen_pool_avail(pdev->p2pdma->pool);
+
+ return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n",
+ pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+ &dev_attr_size.attr,
+ &dev_attr_available.attr,
+ &dev_attr_published.attr,
+ NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+ .attrs = p2pmem_attrs,
+ .name = "p2pmem",
+};
+
static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
{
struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
percpu_ref_exit(&pdev->p2pdma->devmap_ref);
gen_pool_destroy(pdev->p2pdma->pool);
+ sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
pdev->p2pdma = NULL;
}
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
pdev->p2pdma = p2p;
+ error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+ if (error)
+ goto out_pool_destroy;
+
return 0;
out_pool_destroy:
+ pdev->p2pdma = NULL;
gen_pool_destroy(p2p->pool);
out:
devm_kfree(&pdev->dev, p2p);
--
2.11.0
We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch heirarchy as the RDMA port and all the block devices in
use. If the user enabled it and no devices are found, then the system
will silently fall back on using regular memory.
If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.
Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, only a limited number
of cards with CMB buffers seem to be available.
Signed-off-by: Stephen Bates <[email protected]>
Signed-off-by: Steve Wise <[email protected]>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/target/configfs.c | 67 ++++++++++++++++++++++
drivers/nvme/target/core.c | 127 ++++++++++++++++++++++++++++++++++++++++-
drivers/nvme/target/io-cmd.c | 3 +
drivers/nvme/target/nvmet.h | 13 +++++
drivers/nvme/target/rdma.c | 2 +
5 files changed, 210 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ad9ff27234b5..5efe0dae0ee7 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include "nvmet.h"
@@ -864,12 +866,77 @@ static void nvmet_port_release(struct config_item *item)
kfree(port);
}
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+
+ if (!port->use_p2pmem)
+ return sprintf(page, "none\n");
+
+ if (!port->p2p_dev)
+ return sprintf(page, "auto\n");
+
+ return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+ struct device *dev;
+ struct pci_dev *p2p_dev = NULL;
+ bool use_p2pmem;
+
+ dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+ if (dev) {
+ use_p2pmem = true;
+ p2p_dev = to_pci_dev(dev);
+
+ if (!pci_has_p2pmem(p2p_dev)) {
+ pr_err("PCI device has no peer-to-peer memory: %s\n",
+ page);
+ pci_dev_put(p2p_dev);
+ return -ENODEV;
+ }
+ } else if (sysfs_streq(page, "auto")) {
+ use_p2pmem = 1;
+ } else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+ /*
+ * If the user enters a PCI device that doesn't exist
+ * like "0000:01:00.1", we don't want strtobool to think
+ * it's a '0' when it's clearly not what the user wanted.
+ * So we require 0's and 1's to be exactly one character.
+ */
+ goto no_such_pci_device;
+ } else if (strtobool(page, &use_p2pmem)) {
+ goto no_such_pci_device;
+ }
+
+ down_write(&nvmet_config_sem);
+ port->use_p2pmem = use_p2pmem;
+ pci_dev_put(port->p2p_dev);
+ port->p2p_dev = p2p_dev;
+ up_write(&nvmet_config_sem);
+
+ return count;
+
+no_such_pci_device:
+ pr_err("No such PCI device: %s\n", page);
+ return -ENODEV;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
static struct configfs_attribute *nvmet_port_attrs[] = {
&nvmet_attr_addr_adrfam,
&nvmet_attr_addr_treq,
&nvmet_attr_addr_traddr,
&nvmet_attr_addr_trsvcid,
&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+ &nvmet_attr_p2pmem,
+#endif
NULL,
};
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 75d44bc3e8d3..b2b62cd36f6c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/random.h>
#include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
#include "nvmet.h"
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
percpu_ref_put(&ns->ref);
}
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+ struct nvmet_ns *ns)
+{
+ int ret;
+
+ if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+ pr_err("peer-to-peer DMA is not supported by %s\n",
+ ns->device_path);
+ return -EINVAL;
+ }
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+ if (ret)
+ pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+ ns->device_path, ret);
+
+ return ret;
+}
+
int nvmet_ns_enable(struct nvmet_ns *ns)
{
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->p2p_dev) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto out_remove_clients;
+ }
+ }
+
if (ns->nsid > subsys->max_nsid)
subsys->max_nsid = ns->nsid;
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
out_unlock:
mutex_unlock(&subsys->lock);
return ret;
+out_remove_clients:
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
out_blkdev_put:
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);
mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+ }
if (ns->bdev)
blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -577,6 +610,21 @@ EXPORT_SYMBOL_GPL(nvmet_req_execute);
int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
{
+ struct pci_dev *p2p_dev = NULL;
+
+ if (sq->ctrl)
+ p2p_dev = sq->ctrl->p2p_dev;
+
+ req->p2p_dev = NULL;
+ if (sq->qid && p2p_dev) {
+ req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
+ req->transfer_len);
+ if (req->sg) {
+ req->p2p_dev = p2p_dev;
+ return 0;
+ }
+ }
+
req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
if (!req->sg)
return -ENOMEM;
@@ -587,7 +635,11 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
void nvmet_req_free_sgl(struct nvmet_req *req)
{
- sgl_free(req->sg);
+ if (req->p2p_dev)
+ pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+ else
+ sgl_free(req->sg);
+
req->sg = NULL;
req->sg_cnt = 0;
}
@@ -782,6 +834,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
return __nvmet_host_allowed(subsys, hostnqn);
}
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+ struct nvmet_ns *ns;
+ int ret;
+
+ if (!req->port->use_p2pmem || !req->p2p_client)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+ if (ret) {
+ pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+ dev_name(req->p2p_client), ret);
+ goto free_devices;
+ }
+
+ list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+ ret = nvmet_p2pdma_add_client(ctrl, ns);
+ if (ret)
+ goto free_devices;
+ }
+
+ if (req->port->p2p_dev) {
+ if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+ &ctrl->p2p_clients)) {
+ pr_info("peer-to-peer memory on %s is not supported\n",
+ pci_name(req->port->p2p_dev));
+ goto free_devices;
+ }
+ ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+ } else {
+ ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+ if (!ctrl->p2p_dev) {
+ pr_info("no supported peer-to-peer memory devices found\n");
+ goto free_devices;
+ }
+ }
+
+ mutex_unlock(&ctrl->subsys->lock);
+
+ pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+ return;
+
+free_devices:
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+ if (!ctrl->p2p_dev)
+ return;
+
+ mutex_lock(&ctrl->subsys->lock);
+
+ pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+ pci_dev_put(ctrl->p2p_dev);
+ ctrl->p2p_dev = NULL;
+
+ mutex_unlock(&ctrl->subsys->lock);
+}
+
u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
{
@@ -821,6 +941,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
+ INIT_LIST_HEAD(&ctrl->p2p_clients);
memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -876,6 +997,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
ctrl->kato = DIV_ROUND_UP(kato, 1000);
}
nvmet_start_keep_alive_timer(ctrl);
+ nvmet_setup_p2pmem(ctrl, req);
mutex_lock(&subsys->lock);
list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -912,6 +1034,7 @@ static void nvmet_ctrl_free(struct kref *ref)
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fatal_err_work);
+ nvmet_release_p2pmem(ctrl);
ida_simple_remove(&cntlid_ida, ctrl->cntlid);
kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index cd2344179673..39bd37f1f312 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
op = REQ_OP_READ;
}
+ if (is_pci_p2pdma_page(sg_page(req->sg)))
+ op_flags |= REQ_PCI_P2PDMA;
+
sector = le64_to_cpu(req->cmd->rw.slba);
sector <<= (req->ns->blksize_shift - 9);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 10b162615a5e..f192fefe61d9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
}
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+ return disk_to_dev(ns->bdev->bd_disk);
+}
+
struct nvmet_cq {
u16 qid;
u16 size;
@@ -98,6 +103,8 @@ struct nvmet_port {
struct list_head referrals;
void *priv;
bool enabled;
+ bool use_p2pmem;
+ struct pci_dev *p2p_dev;
};
static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -132,6 +139,9 @@ struct nvmet_ctrl {
const struct nvmet_fabrics_ops *ops;
+ struct pci_dev *p2p_dev;
+ struct list_head p2p_clients;
+
char subsysnqn[NVMF_NQN_FIELD_LEN];
char hostnqn[NVMF_NQN_FIELD_LEN];
};
@@ -234,6 +244,9 @@ struct nvmet_req {
void (*execute)(struct nvmet_req *req);
const struct nvmet_fabrics_ops *ops;
+
+ struct pci_dev *p2p_dev;
+ struct device *p2p_client;
};
static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index f7a3459d618f..27a6d8ea1b56 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -661,6 +661,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);
+ cmd->req.p2p_client = &queue->dev->device->dev;
+
if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
&queue->nvme_sq, &nvmet_rdma_ops))
return;
--
2.11.0
The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.
For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 1 +
include/linux/pci-p2pdma.h | 13 ++++++++++++
3 files changed, 65 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4daad6374869..ed9dce8552a2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = &pdev->p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+ pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -746,3 +748,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2pdma->p2pmem_published = publish;
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+ struct dev_pagemap *pgmap;
+ struct scatterlist *s;
+ phys_addr_t paddr;
+ int i;
+
+ /*
+ * p2pdma mappings are not compatible with devices that use
+ * dma_virt_ops.
+ */
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+ return 0;
+
+ for_each_sg(sg, s, nents, i) {
+ pgmap = sg_page(s)->pgmap;
+ paddr = sg_phys(s);
+
+ s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+ sg_dma_len(s) = s->length;
+ }
+
+ return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+ u64 pci_p2pdma_bus_offset;
};
#ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 80e931cb1235..0cde88341eeb 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -35,6 +35,10 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
unsigned int *nents, u32 length);
void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -96,5 +100,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline int pci_p2pdma_map_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+ return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+ struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
#endif /* CONFIG_PCI_P2PDMA */
#endif /* _LINUX_PCI_P2P_H */
--
2.11.0
Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.
Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:
int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();
The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.
Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.
The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same root port (typically through
a network of PCIe switches). This is because we have no way of knowing
whether peer-to-peer routing between PCIe Root Ports is supported
(PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
go through the RC is limited to only reducing DRAM usage and, in some
cases, coding convenience. The PCI-SIG may be exploring adding a new
capability bit to advertise whether this is possible for future
hardware.
This commit includes significant rework and feedback from Christoph
Hellwig.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/Kconfig | 17 ++
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 694 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/memremap.h | 18 ++
include/linux/pci-p2pdma.h | 100 +++++++
include/linux/pci.h | 4 +
6 files changed, 834 insertions(+)
create mode 100644 drivers/pci/p2pdma.c
create mode 100644 include/linux/pci-p2pdma.h
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..b2396c22b53e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,23 @@ config PCI_PASID
If unsure, say N.
+config PCI_P2PDMA
+ bool "PCI peer-to-peer transfer support"
+ depends on PCI && ZONE_DEVICE && EXPERT
+ select GENERIC_ALLOCATOR
+ help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time, DMA
+ transations must be between devices behind the same root port.
+ (Typically behind a network of PCIe switches).
+
+ If unsure, say N.
+
config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 952addc7bacf..050c1e19a1de 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_X86_INTEL_MID) += pci-mid.o
obj-$(CONFIG_PCI_SYSCALL) += syscall.o
obj-$(CONFIG_PCI_STUB) += pci-stub.o
obj-$(CONFIG_PCI_ECAM) += ecam.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
obj-y += host/
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..e524a12eca1f
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,694 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+ struct percpu_ref devmap_ref;
+ struct completion devmap_ref_done;
+ struct gen_pool *pool;
+ bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+ struct pci_p2pdma *p2p =
+ container_of(ref, struct pci_p2pdma, devmap_ref);
+
+ complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+ struct percpu_ref *ref = data;
+
+ if (percpu_ref_is_dying(ref))
+ return;
+
+ percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ if (!pdev->p2pdma)
+ return;
+
+ wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+ percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+ gen_pool_destroy(pdev->p2pdma->pool);
+ pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+ int error = -ENOMEM;
+ struct pci_p2pdma *p2p;
+
+ p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+ if (!p2p)
+ return -ENOMEM;
+
+ p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+ if (!p2p->pool)
+ goto out;
+
+ init_completion(&p2p->devmap_ref_done);
+ error = percpu_ref_init(&p2p->devmap_ref,
+ pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+ if (error)
+ goto out_pool_destroy;
+
+ percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ if (error)
+ goto out_pool_destroy;
+
+ pdev->p2pdma = p2p;
+
+ return 0;
+
+out_pool_destroy:
+ gen_pool_destroy(p2p->pool);
+out:
+ devm_kfree(&pdev->dev, p2p);
+ return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset)
+{
+ struct dev_pagemap *pgmap;
+ void *addr;
+ int error;
+
+ if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+ return -EINVAL;
+
+ if (offset >= pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!size)
+ size = pci_resource_len(pdev, bar) - offset;
+
+ if (size + offset > pci_resource_len(pdev, bar))
+ return -EINVAL;
+
+ if (!pdev->p2pdma) {
+ error = pci_p2pdma_setup(pdev);
+ if (error)
+ return error;
+ }
+
+ pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+ pgmap->res.end = pgmap->res.start + size - 1;
+ pgmap->res.flags = pci_resource_flags(pdev, bar);
+ pgmap->ref = &pdev->p2pdma->devmap_ref;
+ pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+ addr = devm_memremap_pages(&pdev->dev, pgmap);
+ if (IS_ERR(addr)) {
+ error = PTR_ERR(addr);
+ goto pgmap_free;
+ }
+
+ error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+ pci_bus_address(pdev, bar) + offset,
+ resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+ if (error)
+ goto pgmap_free;
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+ &pdev->p2pdma->devmap_ref);
+ if (error)
+ goto pgmap_free;
+
+ pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+ &pgmap->res);
+
+ return 0;
+
+pgmap_free:
+ devres_free(pgmap);
+ return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+ struct device *parent;
+
+ dev = get_device(dev);
+
+ while (dev) {
+ if (dev_is_pci(dev))
+ return to_pci_dev(dev);
+
+ parent = get_device(dev->parent);
+ put_device(dev);
+ dev = parent;
+ }
+
+ return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+ struct pci_dev *up1, *up2;
+
+ if (!pdev)
+ return NULL;
+
+ up1 = pci_dev_get(pci_upstream_bridge(pdev));
+ if (!up1)
+ return NULL;
+
+ up2 = pci_dev_get(pci_upstream_bridge(up1));
+ pci_dev_put(up1);
+
+ return up2;
+}
+
+/*
+ * Find the distance through the nearest common upstream bridge between
+ * two PCI devices.
+ *
+ * If the two devices are the same device then 0 will be returned.
+ *
+ * If there are two virtual functions of the same device behind the same
+ * bridge port then 2 will be returned (one step down to the bridge then
+ * one step back to the same device).
+ *
+ * In the case where two devices are connected to the same PCIe switch, the
+ * value 4 will be returned. This corresponds to the following PCI tree:
+ *
+ * -+ Root Port
+ * \+ Switch Upstream Port
+ * +-+ Switch Downstream Port
+ * + \- Device A
+ * \-+ Switch Downstream Port
+ * \- Device B
+ *
+ * The distance is 4 because we traverse from Device A through the downstream
+ * port of the switch, to the common upstream port, back up to the second
+ * downstream port and then to Device B.
+ *
+ * Any two devices that don't have a common upstream bridge will return -1.
+ * In this way devices on seperate root ports will be rejected, which
+ * is what we want for peer-to-peer seeing there's no way to determine
+ * if the root complex supports forwarding between root ports.
+ *
+ * In the case where two devices are connected to different PCIe switches
+ * this function will still return a positive distance as long as both
+ * switches evenutally have a common upstream bridge. Note this covers
+ * the case of using multiple PCIe switches to achieve a desired level of
+ * fan-out from a root port. The exact distance will be a function of the
+ * number of switches between Device A and Device B.
+ *
+ */
+static int upstream_bridge_distance(struct pci_dev *a,
+ struct pci_dev *b)
+{
+ int dist_a = 0;
+ int dist_b = 0;
+ struct pci_dev *aa, *bb = NULL, *tmp;
+
+ aa = pci_dev_get(a);
+
+ while (aa) {
+ dist_b = 0;
+
+ pci_dev_put(bb);
+ bb = pci_dev_get(b);
+
+ while (bb) {
+ if (aa == bb)
+ goto put_and_return;
+
+ tmp = pci_dev_get(pci_upstream_bridge(bb));
+ pci_dev_put(bb);
+ bb = tmp;
+
+ dist_b++;
+ }
+
+ tmp = pci_dev_get(pci_upstream_bridge(aa));
+ pci_dev_put(aa);
+ aa = tmp;
+
+ dist_a++;
+ }
+
+ dist_a = -1;
+ dist_b = 0;
+
+put_and_return:
+ pci_dev_put(bb);
+ pci_dev_put(aa);
+
+ return dist_a + dist_b;
+}
+
+struct pci_p2pdma_client {
+ struct list_head list;
+ struct pci_dev *client;
+ struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *item, *new_item;
+ struct pci_dev *provider = NULL;
+ struct pci_dev *client;
+ int ret;
+
+ if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+ dev_warn(dev, "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+ return -ENODEV;
+ }
+
+
+ client = find_parent_pci_dev(dev);
+ if (!client) {
+ dev_warn(dev, "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+ return -ENODEV;
+ }
+
+ item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+ if (item && item->provider) {
+ provider = item->provider;
+
+ if (upstream_bridge_distance(provider, client) < 0) {
+ dev_warn(dev, "cannot be used for peer-to-peer DMA as the client and provider do not share an upstream bridge\n");
+
+ ret = -EXDEV;
+ goto put_client;
+ }
+ }
+
+ new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+ if (!new_item) {
+ ret = -ENOMEM;
+ goto put_client;
+ }
+
+ new_item->client = client;
+ new_item->provider = pci_dev_get(provider);
+
+ list_add_tail(&new_item->list, head);
+
+ return 0;
+
+put_client:
+ pci_dev_put(client);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+ list_del(&item->list);
+ pci_dev_put(item->client);
+ pci_dev_put(item->provider);
+ kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+ struct pci_dev *pdev;
+
+ pdev = find_parent_pci_dev(dev);
+ if (!pdev)
+ return;
+
+ list_for_each_entry_safe(pos, tmp, head, list) {
+ if (pos->client != pdev)
+ continue;
+
+ pci_p2pdma_client_free(pos);
+ }
+
+ pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+ struct pci_p2pdma_client *pos, *tmp;
+
+ list_for_each_entry_safe(pos, tmp, head, list)
+ pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ * a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * root port as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same PCI root port. This cuts out cases that may work but is safest
+ * for the user. Future work can expand this to white-list root complexes that
+ * can safely forward between each ports.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+ struct pci_p2pdma_client *pos;
+ int ret;
+ int distance = 0;
+
+ if (list_empty(clients))
+ return -1;
+
+ list_for_each_entry(pos, clients, list) {
+ ret = upstream_bridge_distance(provider, pos->client);
+ if (ret < 0)
+ goto no_match;
+
+ distance += ret;
+ }
+
+ ret = distance;
+
+no_match:
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ * and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ struct pci_p2pdma_client *pos;
+
+ if (pci_p2pdma_distance(provider, clients) < 0)
+ return false;
+
+ list_for_each_entry(pos, clients, list)
+ pos->provider = provider;
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+ return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ * the specified list of clients and shortest distance (as determined
+ * by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ struct pci_dev *pdev = NULL;
+ struct pci_p2pdma_client *pos;
+ int distance;
+ int closest_distance = INT_MAX;
+ struct pci_dev **closest_pdevs;
+ int dev_cnt = 0;
+ const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
+ int i;
+
+ closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+ if (!pci_has_p2pmem(pdev))
+ continue;
+
+ distance = pci_p2pdma_distance(pdev, clients);
+ if (distance < 0 || distance > closest_distance)
+ continue;
+
+ if (distance == closest_distance && dev_cnt >= max_devs)
+ continue;
+
+ if (distance < closest_distance) {
+ for (i = 0; i < dev_cnt; i++)
+ pci_dev_put(closest_pdevs[i]);
+
+ dev_cnt = 0;
+ closest_distance = distance;
+ }
+
+ closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
+ }
+
+ if (dev_cnt)
+ pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
+
+ for (i = 0; i < dev_cnt; i++)
+ pci_dev_put(closest_pdevs[i]);
+
+ if (pdev)
+ list_for_each_entry(pos, clients, list)
+ pos->provider = pdev;
+
+ kfree(closest_pdevs);
+ return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ void *ret;
+
+ if (unlikely(!pdev->p2pdma))
+ return NULL;
+
+ if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+ return NULL;
+
+ ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+ if (unlikely(!ret))
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+ gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+ percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ * address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+ if (!addr)
+ return 0;
+ if (!pdev->p2pdma)
+ return 0;
+
+ /*
+ * Note: when we added the memory to the pool we used the PCI
+ * bus address as the physical address. So gen_pool_virt_to_phys()
+ * actually returns the bus address despite the misleading name.
+ */
+ return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ unsigned int *nents, u32 length)
+{
+ struct scatterlist *sg;
+ void *addr;
+
+ sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+ if (!sg)
+ return NULL;
+
+ sg_init_table(sg, 1);
+
+ addr = pci_alloc_p2pmem(pdev, length);
+ if (!addr)
+ goto out_free_sg;
+
+ sg_set_buf(sg, addr, length);
+ *nents = 1;
+ return sg;
+
+out_free_sg:
+ kfree(sg);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
+{
+ struct scatterlist *sg;
+ int count;
+
+ for_each_sg(sgl, sg, INT_MAX, count) {
+ if (!sg)
+ break;
+
+ pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+ }
+ kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ * other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ *
+ * Published memory can be used by other PCI device drivers for
+ * peer-2-peer DMA operations. Non-published memory is reserved for
+ * exlusive use of the device driver that registers the peer-to-peer
+ * memory.
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+ if (publish && !pdev->p2pdma)
+ return;
+
+ pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
* driver can hotplug the device memory using ZONE_DEVICE and with that memory
* type. Any page of a process can be migrated to such memory. However no one
* should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
*/
enum memory_type {
MEMORY_DEVICE_HOST = 0,
MEMORY_DEVICE_PRIVATE,
MEMORY_DEVICE_PUBLIC,
+ MEMORY_DEVICE_PCI_P2PDMA,
};
/*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
}
#endif /* CONFIG_ZONE_DEVICE */
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+ return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
static inline bool is_device_private_page(const struct page *page)
{
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..80e931cb1235
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+ u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+ size_t size, u64 offset)
+{
+ return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+ struct device *dev)
+{
+ return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+ struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+ struct list_head *clients)
+{
+ return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+ return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+ return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+ return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+ size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+ void *addr)
+{
+ return 0;
+}
+static inline struct scatterlist * pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+ unsigned int *nents, u32 length)
+{
+ return NULL;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+ struct scatterlist *sgl)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..005feaea8dca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -277,6 +277,7 @@ struct pcie_link_state;
struct pci_vpd;
struct pci_sriov;
struct pci_ats;
+struct pci_p2pdma;
/* The pci_dev structure describes PCI devices */
struct pci_dev {
@@ -430,6 +431,9 @@ struct pci_dev {
#ifdef CONFIG_PCI_PASID
u16 pasid_features;
#endif
+#ifdef CONFIG_PCI_P2PDMA
+ struct pci_p2pdma *p2pdma;
+#endif
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
--
2.11.0
On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <[email protected]>> ---
> drivers/pci/Kconfig | 9 +++++++++> drivers/pci/p2pdma.c | 45 ++++++++++++++++++++++++++++++---------------> drivers/pci/pci.c | 6 ++++++> include/linux/pci-p2pdma.h | 5 +++++> 4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effectively puts all devices behind any
> + switch heirarchy into the same IOMMU group. Which implies that
hierarchy group, which
and sames fixes in the commit description...
> + individual devices behind any switch will not be able to be
> + assigned to separate VMs because there is no isolation between
> + them. Additionally, any malicious PCIe devices will be able to
> + DMA to memory exposed by other EPs in the same domain as TLPs
> + will not be checked by the IOMMU.
> +
> If unsure, say N.
>
> config PCI_LABEL
--
~Randy
Hi Logan,
it would be rather nice to have if you could separate out the functions
to detect if peer2peer is possible between two devices.
That would allow me to reuse the same logic for GPU peer2peer where I
don't really have ZONE_DEVICE.
Regards,
Christian.
Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
> upstream_bridge_distance() which calculates the distance between two
> devices as long as they are behind the same root port. This should
> address Bjorn's concerns that the code was to focused on
> being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
> of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
> API to be more like sgl_alloc() in that the alloc function returns
> the allocated scatterlist and nents is not required bythe free
> function.
>
> * Moved the new documentation into the driver-api tree as requested
> by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
> individual drivers can share the code that allocates P2P memory.
> As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
> thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
> and Kconfig doc. Also, the code that disables the ACS bits is surrounded
> explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
> is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
> are closest to (or the same as) the clients using them. In cases
> of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
> may be explicitly specified, bypassing the logic in pci_p2pmem_find().
> (Note: it's still enforced that the provider must be behind the
> same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> as a bunch of cleanup and spelling fixes he pointed out in the last
> series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
> just disabling ACS behind switches for any kernel that has
> CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
> fairly simply handle Jason's concerns that this work might break with
> the HFI, QIB and rxe drivers that use the virtual ops to implement
> their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
> PCI/P2PDMA: Support peer-to-peer memory
> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> docs-rst: Add a new directory for PCI documentation
> PCI/P2PDMA: Add P2P DMA driver writer's documentation
> block: Introduce PCI P2P flags for request and request queue
> IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> nvme-pci: Add support for P2P memory in requests
> nvme-pci: Add a quirk for a pseudo CMB
> nvmet: Introduce helper functions to allocate and free request SGLs
> nvmet-rdma: Use new SGL alloc/free helper for requests
> nvmet: Optionally use PCI P2P memory
>
> Documentation/ABI/testing/sysfs-bus-pci | 25 +
> Documentation/PCI/index.rst | 14 +
> Documentation/driver-api/index.rst | 2 +-
> Documentation/driver-api/pci/index.rst | 20 +
> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
> Documentation/driver-api/{ => pci}/pci.rst | 0
> Documentation/index.rst | 3 +-
> block/blk-core.c | 3 +
> drivers/infiniband/core/rw.c | 13 +-
> drivers/nvme/host/core.c | 4 +
> drivers/nvme/host/nvme.h | 8 +
> drivers/nvme/host/pci.c | 118 +++--
> drivers/nvme/target/configfs.c | 67 +++
> drivers/nvme/target/core.c | 143 ++++-
> drivers/nvme/target/io-cmd.c | 3 +
> drivers/nvme/target/nvmet.h | 15 +
> drivers/nvme/target/rdma.c | 22 +-
> drivers/pci/Kconfig | 26 +
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
> drivers/pci/pci.c | 6 +
> include/linux/blk_types.h | 18 +-
> include/linux/blkdev.h | 3 +
> include/linux/memremap.h | 19 +
> include/linux/pci-p2pdma.h | 118 +++++
> include/linux/pci.h | 4 +
> 26 files changed, 1579 insertions(+), 56 deletions(-)
> create mode 100644 Documentation/PCI/index.rst
> create mode 100644 Documentation/driver-api/pci/index.rst
> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0
Hi Christian,
On 5/2/2018 5:51 AM, Christian König wrote:
> it would be rather nice to have if you could separate out the functions
> to detect if peer2peer is possible between two devices.
This would essentially be pci_p2pdma_distance() in the existing
patchset. It returns the sum of the distance between a list of clients
and a P2PDMA provider. It returns -1 if peer2peer is not possible
between the devices (presently this means they are not behind the same
root port).
Logan
Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian König wrote:
>> it would be rather nice to have if you could separate out the
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing
> patchset. It returns the sum of the distance between a list of clients
> and a P2PDMA provider. It returns -1 if peer2peer is not possible
> between the devices (presently this means they are not behind the same
> root port).
Ok, I'm still missing the big picture here. First question is what is
the P2PDMA provider?
Second question is how to you want to handle things when device are not
behind the same root port (which is perfectly possible in the cases I
deal with)?
Third question why multiple clients? That feels a bit like you are
pushing something special to your use case into the common PCI
subsystem. Something which usually isn't a good idea.
As far as I can see we need a function which return the distance between
a initiator and target device. This function then returns -1 if the
transaction can't be made and a positive value otherwise.
We also need to give the direction of the transaction and have a
whitelist root complex PCI-IDs which can handle P2P transactions from
different ports for a certain DMA direction.
Christian.
>
> Logan
On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is
> the P2PDMA provider?
Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).
> Second question is how to you want to handle things when device are not
> behind the same root port (which is perfectly possible in the cases I
> deal with)?
I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.
> Third question why multiple clients? That feels a bit like you are
> pushing something special to your use case into the common PCI
> subsystem. Something which usually isn't a good idea.
No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.
In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.
I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.
> As far as I can see we need a function which return the distance between
> a initiator and target device. This function then returns -1 if the
> transaction can't be made and a positive value otherwise.
If you need to make a simpler convenience function for your use case I'm
not against it.
> We also need to give the direction of the transaction and have a
> whitelist root complex PCI-IDs which can handle P2P transactions from
> different ports for a certain DMA direction.
Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.
Logan
Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian König wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.
Sounds good.
>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.
Ok, that is the point where I'm stuck. Why do we need that in one
function call in the PCIe subsystem?
The problem at least with GPUs is that we seriously don't have that
information here, cause the PCI subsystem might not be aware of all the
interconnections.
For example it isn't uncommon to put multiple GPUs on one board. To the
PCI subsystem that looks like separate devices, but in reality all GPUs
are interconnected and can access each others memory directly without
going over the PCIe bus.
I seriously don't want to model that in the PCI subsystem, but rather
the driver. That's why it feels like a mistake to me to push all that
into the PCI function.
> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.
Why can't we model that as two separate transactions?
E.g. one from the RDMA NIC to the GPU memory. And another one from the
GPU memory to the NVMe device.
That would also match how I get this information from userspace.
>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.
Yeah, same for me. If Bjorn is ok with that specialized NVM functions
that I'm fine with that as well.
I think it would just be more convenient when we can come up with
functions which can handle all use cases, cause there still seems to be
a lot of similarities.
>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.
Ok, I agree that can be added later on. For simplicity let's assume for
now we always to bidirectional transfers.
Thanks for the explanation,
Christian.
>
> Logan
On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one
> function call in the PCIe subsystem?
>
> The problem at least with GPUs is that we seriously don't have that
> information here, cause the PCI subsystem might not be aware of all the
> interconnections.
>
> For example it isn't uncommon to put multiple GPUs on one board. To the
> PCI subsystem that looks like separate devices, but in reality all GPUs
> are interconnected and can access each others memory directly without
> going over the PCIe bus.
>
> I seriously don't want to model that in the PCI subsystem, but rather
> the driver. That's why it feels like a mistake to me to push all that
> into the PCI function.
Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:
pci_p2pdma_distance(target, [initiator, target])
> Why can't we model that as two separate transactions?
You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).
> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
> that I'm fine with that as well.
>
> I think it would just be more convenient when we can come up with
> functions which can handle all use cases, cause there still seems to be
> a lot of similarities.
The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.
Logan
Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian König wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])
Ok, I expected that something like that would do it.
So just to confirm: When I have a bunch of GPUs which could be the
initiator I only need to do "pci_p2pdma_distance(target, [first GPU,
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU,
third GPU, forth...., target])" ?
>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).
Are you sure that this is more convenient? At least on first glance it
feels overly complicated.
I mean what's the difference between the two approaches?
sum = pci_p2pdma_distance(target, [A, B, C, target]);
and
sum = pci_p2pdma_distance(target, A);
sum += pci_p2pdma_distance(target, B);
sum += pci_p2pdma_distance(target, C);
>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.
Well at the end of the day you only need to convince Bjorn of the
interface, so I'm perfectly fine with it as long as it serves my use
case as well :)
But I still would like to understand your intention, cause that really
helps not to accidentally break something in the long term.
Now when I take a look at the pure PCI hardware level, what I have is a
transaction between an initiator and a target, and not multiple devices
in one operation.
I mean you must have a very good reason that you now want to deal with
multiple devices in the software layer, but neither from the code nor
from your explanation that reason becomes obvious to me.
Thanks,
Christian.
>
> Logan
On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it
> feels overly complicated.
>
> I mean what's the difference between the two approaches?
>
> sum = pci_p2pdma_distance(target, [A, B, C, target]);
>
> and
>
> sum = pci_p2pdma_distance(target, A);
> sum += pci_p2pdma_distance(target, B);
> sum += pci_p2pdma_distance(target, C);
Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.
Logan
On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
>
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
>
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
>
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
>
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes
s/PCI/PCIe/
> which would maximize the number of lanes available to connect to NVMe
> devices.
>
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through
s/root port/PCI bridge/
> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
>
> This commit includes significant rework and feedback from Christoph
> Hellwig.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/Kconfig | 17 ++
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 694 +++++++++++++++++++++++++++++++++++++++++++++
> include/linux/memremap.h | 18 ++
> include/linux/pci-p2pdma.h | 100 +++++++
> include/linux/pci.h | 4 +
> 6 files changed, 834 insertions(+)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>
> If unsure, say N.
>
> +config PCI_P2PDMA
> + bool "PCI peer-to-peer transfer support"
> + depends on PCI && ZONE_DEVICE && EXPERT
> + select GENERIC_ALLOCATOR
> + help
> + Enableѕ drivers to do PCI peer-to-peer transactions to and from
> + BARs that are exposed in other devices that are the part of
> + the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> + specification to work (ie. anything below a single PCI bridge).
> +
> + Many PCIe root complexes do not support P2P transactions and
> + it's hard to tell which support it at all, so at this time, DMA
> + transations must be between devices behind the same root port.
s/DMA transactions/PCIe DMA transactions/
(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)
> + (Typically behind a network of PCIe switches).
Not sure this last sentence adds useful information.
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *
Nit: unnecessary blank line.
> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
This function doesn't seem to be used anymore. Thanks for all your hard
work to get rid of it!
> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + pci_dev_put(up1);
> +
> + return up2;
> +}
> +
> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the bridge then
s/bridge/PCIe switch/
> + * one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + * -+ Root Port
> + * \+ Switch Upstream Port
> + * +-+ Switch Downstream Port
> + * + \- Device A
> + * \-+ Switch Downstream Port
> + * \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on seperate root ports will be rejected, which
s/seperate/separate/
s/root port/PCIe root ports/
(Again, since P2P should work on conventional PCI)
> + * is what we want for peer-to-peer seeing there's no way to determine
> + * if the root complex supports forwarding between root ports.
s/seeing there's no way.../
seeing each PCIe root port defines a separate hierarchy domain and
there's no way to determine whether the root complex supports forwarding
between them./
> + *
> + * In the case where two devices are connected to different PCIe switches
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.
> + *
Nit: unnecessary blank line.
> + */
> +static int upstream_bridge_distance(struct pci_dev *a, > + struct pci_dev *b)
s/dma/DMA/ (in subject)
On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
>
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>
> This function doesn't seem to be used anymore. Thanks for all your hard
> work to get rid of it!
Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.
Logan
[+to Alex]
Alex,
Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA? We only check this at enumeration-time and
I don't know if there are other places we would care?
On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/Kconfig | 9 +++++++++
> drivers/pci/p2pdma.c | 45 ++++++++++++++++++++++++++++++---------------
> drivers/pci/pci.c | 6 ++++++
> include/linux/pci-p2pdma.h | 5 +++++
> 4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effectively puts all devices behind any
> + switch heirarchy into the same IOMMU group. Which implies that
s/heirarchy/hierarchy/ (also above in changelog)
> + individual devices behind any switch will not be able to be
> + assigned to separate VMs because there is no isolation between
> + them. Additionally, any malicious PCIe devices will be able to
> + DMA to memory exposed by other EPs in the same domain as TLPs
> + will not be checked by the IOMMU.
> +
> If unsure, say N.
>
> config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
> }
>
> /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.
s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
> */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> {
> - struct pci_dev *up1, *up2;
> + int pos;
> + u16 ctrl;
>
> - if (!pdev)
> - return NULL;
> + if (!pci_is_bridge(pdev))
> + return 0;
>
> - up1 = pci_dev_get(pci_upstream_bridge(pdev));
> - if (!up1)
> - return NULL;
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>
> - up2 = pci_dev_get(pci_upstream_bridge(up1));
> - pci_dev_put(up1);
> + pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>
> - return up2;
> + return 1;
> }
>
> /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
> #include <linux/of.h>
> #include <linux/of_pci.h>
> #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
> #include <linux/pm.h>
> #include <linux/slab.h>
> #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
> */
> void pci_enable_acs(struct pci_dev *dev)
> {
> +#ifdef CONFIG_PCI_P2PDMA
> + if (pci_p2pdma_disable_acs(dev))
> + return;
> +#endif
> +
> if (!pci_acs_enable)
> return;
>
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
> struct scatterlist;
>
> #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
> int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> u64 offset);
> int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> enum dma_data_direction dir);
> #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> + return 0;
> +}
> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
> size_t size, u64 offset)
> {
> --
> 2.11.0
>
On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
>
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> ---
> Documentation/PCI/index.rst | 14 +++
> Documentation/driver-api/pci/index.rst | 1 +
> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
> Documentation/index.rst | 3 +-
> 4 files changed, 183 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/PCI/index.rst
> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==================================
> +Linux PCI Driver Developer's Guide
> +==================================
> +
> +.. toctree::
> +
> + p2pdma
> +
> +.. only:: subproject and html
> +
> + Indices
> + =======
> +
> + * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
> :maxdepth: 2
>
> pci
> + p2pdma
>
> .. only:: subproject and html
>
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
s/endpoints/devices/
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
s/PCI Root Complexes .../
PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
s/endpoints involved .../
devices involved are all behind the same PCI bridge, as such devices are
all in the same PCI hierarchy domain, and the spec guarantees that all
transactions within the hierarchy will be routable, but it does not
require routing between hierarchies./
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...
> Logan Gunthorpe (14):
> PCI/P2PDMA: Support peer-to-peer memory
> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> docs-rst: Add a new directory for PCI documentation
> PCI/P2PDMA: Add P2P DMA driver writer's documentation
> block: Introduce PCI P2P flags for request and request queue
> IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> nvme-pci: Add support for P2P memory in requests
> nvme-pci: Add a quirk for a pseudo CMB
> nvmet: Introduce helper functions to allocate and free request SGLs
> nvmet-rdma: Use new SGL alloc/free helper for requests
> nvmet: Optionally use PCI P2P memory
>
> Documentation/ABI/testing/sysfs-bus-pci | 25 +
> Documentation/PCI/index.rst | 14 +
> Documentation/driver-api/index.rst | 2 +-
> Documentation/driver-api/pci/index.rst | 20 +
> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
> Documentation/driver-api/{ => pci}/pci.rst | 0
> Documentation/index.rst | 3 +-
> block/blk-core.c | 3 +
> drivers/infiniband/core/rw.c | 13 +-
> drivers/nvme/host/core.c | 4 +
> drivers/nvme/host/nvme.h | 8 +
> drivers/nvme/host/pci.c | 118 +++--
> drivers/nvme/target/configfs.c | 67 +++
> drivers/nvme/target/core.c | 143 ++++-
> drivers/nvme/target/io-cmd.c | 3 +
> drivers/nvme/target/nvmet.h | 15 +
> drivers/nvme/target/rdma.c | 22 +-
> drivers/pci/Kconfig | 26 +
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
> drivers/pci/pci.c | 6 +
> include/linux/blk_types.h | 18 +-
> include/linux/blkdev.h | 3 +
> include/linux/memremap.h | 19 +
> include/linux/pci-p2pdma.h | 118 +++++
> include/linux/pci.h | 4 +
> 26 files changed, 1579 insertions(+), 56 deletions(-)
> create mode 100644 Documentation/PCI/index.rst
> create mode 100644 Documentation/driver-api/pci/index.rst
> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h
How do you envison merging this? There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.
If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?
Bjorn
> How do you envison merging this? There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
>
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?
Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.
Thanks,
Logan
Hi Bjorn,
Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA? We only check this at enumeration-time and
> I don't know if there are other places we would care?
thanks for pointing this out, I totally missed this hack.
AMD APUs mandatory need the ACS flag set for the GPU integrated in the
CPU when IOMMU is enabled or otherwise you will break SVM.
Similar problems arise when you do this for dedicated GPU, but we
haven't upstreamed the support for this yet.
So that is a clear NAK from my side for the approach.
And what exactly is the problem here? I'm currently testing P2P with
GPUs in different IOMMU domains and at least with AMD IOMMUs that works
perfectly fine.
Regards,
Christian.
>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/pci/Kconfig | 9 +++++++++
>> drivers/pci/p2pdma.c | 45 ++++++++++++++++++++++++++++++---------------
>> drivers/pci/pci.c | 6 ++++++
>> include/linux/pci-p2pdma.h | 5 +++++
>> 4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>> transations must be between devices behind the same root port.
>> (Typically behind a network of PCIe switches).
>>
>> + Enabling this option will also disable ACS on all ports behind
>> + any PCIe switch. This effectively puts all devices behind any
>> + switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> + individual devices behind any switch will not be able to be
>> + assigned to separate VMs because there is no isolation between
>> + them. Additionally, any malicious PCIe devices will be able to
>> + DMA to memory exposed by other EPs in the same domain as TLPs
>> + will not be checked by the IOMMU.
>> +
>> If unsure, say N.
>>
>> config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>> }
>>
>> /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>> */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> {
>> - struct pci_dev *up1, *up2;
>> + int pos;
>> + u16 ctrl;
>>
>> - if (!pdev)
>> - return NULL;
>> + if (!pci_is_bridge(pdev))
>> + return 0;
>>
>> - up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> - if (!up1)
>> - return NULL;
>> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> + if (!pos)
>> + return 0;
>> +
>> + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>
>> - up2 = pci_dev_get(pci_upstream_bridge(up1));
>> - pci_dev_put(up1);
>> + pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>
>> - return up2;
>> + return 1;
>> }
>>
>> /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>> #include <linux/of.h>
>> #include <linux/of_pci.h>
>> #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>> #include <linux/pm.h>
>> #include <linux/slab.h>
>> #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>> */
>> void pci_enable_acs(struct pci_dev *dev)
>> {
>> +#ifdef CONFIG_PCI_P2PDMA
>> + if (pci_p2pdma_disable_acs(dev))
>> + return;
>> +#endif
>> +
>> if (!pci_acs_enable)
>> return;
>>
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>> struct scatterlist;
>>
>> #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>> int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>> u64 offset);
>> int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>> enum dma_data_direction dir);
>> #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> + return 0;
>> +}
>> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>> size_t size, u64 offset)
>> {
>> --
>> 2.11.0
>>
Hi Christian
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
> CPU when IOMMU is enabled or otherwise you will break SVM.
OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?
> Similar problems arise when you do this for dedicated GPU, but we
> haven't upstreamed the support for this yet.
Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
> So that is a clear NAK from my side for the approach.
Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
> And what exactly is the problem here?
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
> I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
Cheers
Stephen
On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <[email protected]> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/Kconfig | 9 +++++++++
> drivers/pci/p2pdma.c | 45 ++++++++++++++++++++++++++++++---------------
> drivers/pci/pci.c | 6 ++++++
> include/linux/pci-p2pdma.h | 5 +++++
> 4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effectively puts all devices behind any
> + switch heirarchy into the same IOMMU group. Which implies that
> + individual devices behind any switch will not be able to be
> + assigned to separate VMs because there is no isolation between
> + them. Additionally, any malicious PCIe devices will be able to
> + DMA to memory exposed by other EPs in the same domain as TLPs
> + will not be checked by the IOMMU.
> +
> If unsure, say N.
It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?
Why is this text added in a follow on patch and not the patch that
introduced the config option?
I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.
Hi Dan
> It seems unwieldy that this is a compile time option and not a runtime
> option. Can't we have a kernel command line option to opt-in to this
> behavior rather than require a wholly separate kernel image?
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> Why is this text added in a follow on patch and not the patch that
> introduced the config option?
Because the ACS section was added later in the series and this information is associated with that additional functionality.
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.
By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
Stephen
[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
> CPU when IOMMU is enabled or otherwise you will break SVM.
Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.
> And what exactly is the problem here? I'm currently testing P2P with
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
> perfectly fine.
In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.
If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.
Logan
Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?
Well I'm not an expert on this, but I think that is an incorrect
assumption you guys use here.
At least in the default configuration even with IOMMU enabled P2P
transactions does NOT necessary travel up to the root complex for
translation.
It's already late here, but if nobody beats me I'm going to dig up the
necessary documentation tomorrow.
Regards,
Christian.
>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>
>> And what exactly is the problem here?
>
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>
>
Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian König wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.
Ok, that is at least a step in the right direction. But I think we
seriously need to test that for side effects.
>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.
Well I'm not an expert on this, but if I'm not completely mistaken that
is not correct.
E.g. transactions are initially send to the root complex for
translation, that's for sure. But at least for AMD GPUs the root complex
answers with the translated address which is then cached in the device.
So further transactions for the same address range then go directly to
the destination.
What you don't want is device isolation, cause in this case the root
complex handles the transaction themselves. IIRC there where also
something like "force_isolation" and "nobypass" parameters for the IOMMU
to control that behavior.
It's already late here, but going to dig up the documentation for that
tomorrow and/or contact a hardware engineer involved in the ACS spec.
Regards,
Christian.
>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan
On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <[email protected]> wrote:
> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> >
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> >
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...
>
> > Logan Gunthorpe (14):
> > PCI/P2PDMA: Support peer-to-peer memory
> > PCI/P2PDMA: Add sysfs group to display p2pmem stats
> > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> > docs-rst: Add a new directory for PCI documentation
> > PCI/P2PDMA: Add P2P DMA driver writer's documentation
> > block: Introduce PCI P2P flags for request and request queue
> > IB/core: Ensure we map P2P memory correctly in
> > rdma_rw_ctx_[init|destroy]()
> > nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> > nvme-pci: Add support for P2P memory in requests
> > nvme-pci: Add a quirk for a pseudo CMB
> > nvmet: Introduce helper functions to allocate and free request SGLs
> > nvmet-rdma: Use new SGL alloc/free helper for requests
> > nvmet: Optionally use PCI P2P memory
> >
> > Documentation/ABI/testing/sysfs-bus-pci | 25 +
> > Documentation/PCI/index.rst | 14 +
> > Documentation/driver-api/index.rst | 2 +-
> > Documentation/driver-api/pci/index.rst | 20 +
> > Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
> > Documentation/driver-api/{ => pci}/pci.rst | 0
> > Documentation/index.rst | 3 +-
> > block/blk-core.c | 3 +
> > drivers/infiniband/core/rw.c | 13 +-
> > drivers/nvme/host/core.c | 4 +
> > drivers/nvme/host/nvme.h | 8 +
> > drivers/nvme/host/pci.c | 118 +++--
> > drivers/nvme/target/configfs.c | 67 +++
> > drivers/nvme/target/core.c | 143 ++++-
> > drivers/nvme/target/io-cmd.c | 3 +
> > drivers/nvme/target/nvmet.h | 15 +
> > drivers/nvme/target/rdma.c | 22 +-
> > drivers/pci/Kconfig | 26 +
> > drivers/pci/Makefile | 1 +
> > drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
> > drivers/pci/pci.c | 6 +
> > include/linux/blk_types.h | 18 +-
> > include/linux/blkdev.h | 3 +
> > include/linux/memremap.h | 19 +
> > include/linux/pci-p2pdma.h | 118 +++++
> > include/linux/pci.h | 4 +
> > 26 files changed, 1579 insertions(+), 56 deletions(-)
> > create mode 100644 Documentation/PCI/index.rst
> > create mode 100644 Documentation/driver-api/pci/index.rst
> > create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> > rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> > create mode 100644 drivers/pci/p2pdma.c
> > create mode 100644 include/linux/pci-p2pdma.h
>
> How do you envison merging this? There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
>
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?
AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option. I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both. Is this really such
a unique feature that distro users aren't going to be asking for both
features? Thanks,
Alex
On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for
> translation, that's for sure. But at least for AMD GPUs the root complex
> answers with the translated address which is then cached in the device.
>
> So further transactions for the same address range then go directly to
> the destination.
Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.
Logan
On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option. I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both. Is this really such
> a unique feature that distro users aren't going to be asking for both
> features? Thanks,
I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.
Logan
On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 10:50 AM, Christian König wrote:
> > E.g. transactions are initially send to the root complex for
> > translation, that's for sure. But at least for AMD GPUs the root complex
> > answers with the translated address which is then cached in the device.
> >
> > So further transactions for the same address range then go directly to
> > the destination.
>
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.
They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS. This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU. Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether? Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly. Thanks,
Alex
On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS. This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU. Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether? Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly. Thanks,
I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.
Logan
On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS. This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU. Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether? Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly. Thanks,
>
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.
Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P. So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do. The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present. It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,
Alex
On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P. So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do. The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present. It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.
Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.
Logan
On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P. So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do. The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present. It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
>
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.
Yes, GPUs seem to be leading the pack in implementing ATS. So now the
dumb question, why not simply turn off the IOMMU and thus ACS? The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either. Are there target
systems where this is not a simple kernel commandline option? Thanks,
Alex
On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS. So now the
> dumb question, why not simply turn off the IOMMU and thus ACS? The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either. Are there target
> systems where this is not a simple kernel commandline option? Thanks,
Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).
I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.
Logan
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
>
>
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P. So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do. The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present. It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
>
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.
I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.
ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.
Cheers,
J?r?me
On 05/08/2018 10:44 AM, Stephen Bates wrote:
> Hi Dan
>
>> It seems unwieldy that this is a compile time option and not a runtime
>> option. Can't we have a kernel command line option to opt-in to this
>> behavior rather than require a wholly separate kernel image?
>
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
>
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
>> Why is this text added in a follow on patch and not the patch that
>> introduced the config option?
>
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
>
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.
That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).
Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.
> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
>
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.
btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
So I don't understand the comments why VMs should need to know.
-- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
-- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
-- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed
Is there a thread I need to read up to explain /clear-up the thoughts above?
> Stephen
>
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>
>
On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <[email protected]> wrote:
>
>> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>> PCI/P2PDMA: Support peer-to-peer memory
>>> PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>> docs-rst: Add a new directory for PCI documentation
>>> PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>> block: Introduce PCI P2P flags for request and request queue
>>> IB/core: Ensure we map P2P memory correctly in
>>> rdma_rw_ctx_[init|destroy]()
>>> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>> nvme-pci: Add support for P2P memory in requests
>>> nvme-pci: Add a quirk for a pseudo CMB
>>> nvmet: Introduce helper functions to allocate and free request SGLs
>>> nvmet-rdma: Use new SGL alloc/free helper for requests
>>> nvmet: Optionally use PCI P2P memory
>>>
>>> Documentation/ABI/testing/sysfs-bus-pci | 25 +
>>> Documentation/PCI/index.rst | 14 +
>>> Documentation/driver-api/index.rst | 2 +-
>>> Documentation/driver-api/pci/index.rst | 20 +
>>> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
>>> Documentation/driver-api/{ => pci}/pci.rst | 0
>>> Documentation/index.rst | 3 +-
>>> block/blk-core.c | 3 +
>>> drivers/infiniband/core/rw.c | 13 +-
>>> drivers/nvme/host/core.c | 4 +
>>> drivers/nvme/host/nvme.h | 8 +
>>> drivers/nvme/host/pci.c | 118 +++--
>>> drivers/nvme/target/configfs.c | 67 +++
>>> drivers/nvme/target/core.c | 143 ++++-
>>> drivers/nvme/target/io-cmd.c | 3 +
>>> drivers/nvme/target/nvmet.h | 15 +
>>> drivers/nvme/target/rdma.c | 22 +-
>>> drivers/pci/Kconfig | 26 +
>>> drivers/pci/Makefile | 1 +
>>> drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
>>> drivers/pci/pci.c | 6 +
>>> include/linux/blk_types.h | 18 +-
>>> include/linux/blkdev.h | 3 +
>>> include/linux/memremap.h | 19 +
>>> include/linux/pci-p2pdma.h | 118 +++++
>>> include/linux/pci.h | 4 +
>>> 26 files changed, 1579 insertions(+), 56 deletions(-)
>>> create mode 100644 Documentation/PCI/index.rst
>>> create mode 100644 Documentation/driver-api/pci/index.rst
>>> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>> create mode 100644 drivers/pci/p2pdma.c
>>> create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this? There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
>
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option. I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both. Is this really such
> a unique feature that distro users aren't going to be asking for both
> features? Thanks,
>
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.
Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Hi Don
>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
> That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
> I recommend doing so via a sysfs method.
Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> So I don't understand the comments why VMs should need to know.
As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
> Is there a thread I need to read up to explain /clear-up the thoughts above?
If you search for p2pdma you should find the previous discussions. Thanks for the input!
Stephen
On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS. So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS? The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either. Are there target
> > systems where this is not a simple kernel commandline option? Thanks,
>
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).
But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.
> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.
The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed. We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.
What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me. As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces. But this series specifically wants to leave those gaps open
for direct P2P access.
So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding. Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem. So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.
Are there other uses I'm forgetting? We can enable interrupt remapping
separate from DMA translation, so we can exclude that one. I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled. Thanks,
Alex
Hi Jerome
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.
This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).
> ATS is implemented by the IOMMU not by the device (well device implement
> the client side of it). Also ATS is meaningless without something like
> PASID as far as i know.
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future. So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.
Stephen
On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <[email protected]> wrote:
> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <[email protected]> wrote:
> >
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...
> >>
> >>> Logan Gunthorpe (14):
> >>> PCI/P2PDMA: Support peer-to-peer memory
> >>> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>> docs-rst: Add a new directory for PCI documentation
> >>> PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>> block: Introduce PCI P2P flags for request and request queue
> >>> IB/core: Ensure we map P2P memory correctly in
> >>> rdma_rw_ctx_[init|destroy]()
> >>> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>> nvme-pci: Add support for P2P memory in requests
> >>> nvme-pci: Add a quirk for a pseudo CMB
> >>> nvmet: Introduce helper functions to allocate and free request SGLs
> >>> nvmet-rdma: Use new SGL alloc/free helper for requests
> >>> nvmet: Optionally use PCI P2P memory
> >>>
> >>> Documentation/ABI/testing/sysfs-bus-pci | 25 +
> >>> Documentation/PCI/index.rst | 14 +
> >>> Documentation/driver-api/index.rst | 2 +-
> >>> Documentation/driver-api/pci/index.rst | 20 +
> >>> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
> >>> Documentation/driver-api/{ => pci}/pci.rst | 0
> >>> Documentation/index.rst | 3 +-
> >>> block/blk-core.c | 3 +
> >>> drivers/infiniband/core/rw.c | 13 +-
> >>> drivers/nvme/host/core.c | 4 +
> >>> drivers/nvme/host/nvme.h | 8 +
> >>> drivers/nvme/host/pci.c | 118 +++--
> >>> drivers/nvme/target/configfs.c | 67 +++
> >>> drivers/nvme/target/core.c | 143 ++++-
> >>> drivers/nvme/target/io-cmd.c | 3 +
> >>> drivers/nvme/target/nvmet.h | 15 +
> >>> drivers/nvme/target/rdma.c | 22 +-
> >>> drivers/pci/Kconfig | 26 +
> >>> drivers/pci/Makefile | 1 +
> >>> drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
> >>> drivers/pci/pci.c | 6 +
> >>> include/linux/blk_types.h | 18 +-
> >>> include/linux/blkdev.h | 3 +
> >>> include/linux/memremap.h | 19 +
> >>> include/linux/pci-p2pdma.h | 118 +++++
> >>> include/linux/pci.h | 4 +
> >>> 26 files changed, 1579 insertions(+), 56 deletions(-)
> >>> create mode 100644 Documentation/PCI/index.rst
> >>> create mode 100644 Documentation/driver-api/pci/index.rst
> >>> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>> create mode 100644 drivers/pci/p2pdma.c
> >>> create mode 100644 include/linux/pci-p2pdma.h
> >>
> >> How do you envison merging this? There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?
> >
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option. I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both. Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features? Thanks,
> >
> > Alex
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.
That's not what's done here AIUI. There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs? Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change. Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device. Thanks,
Alex
Hi Alex
> But it would be a much easier proposal to disable ACS when the IOMMU is
> not enabled, ACS has no real purpose in that case.
I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on.
> The IOMMU and P2P are already not exclusive, we can bounce off the
> IOMMU or make use of ATS as we've previously discussed. We were
> previously talking about a build time config option that you didn't
> expect distros to use, so I don't think intervention for the user to
> disable the IOMMU if it's enabled by default is a serious concern
> either.
ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly).
> What you're trying to do is enabled direct peer-to-peer for endpoints
> which do not support ATS when the IOMMU is enabled, which is not
> something that necessarily makes sense to me.
As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.
> So that leaves avoiding bounce buffers as the remaining IOMMU feature
I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
> I'm still not seeing why it's terribly undesirable to require devices to support
> ATS if they want to do direct P2P with an IOMMU enabled.
I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
Stephen
On Tue, 8 May 2018 21:42:27 +0000
"Stephen Bates" <[email protected]> wrote:
> Hi Alex
>
> > But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.
>
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on.
But as I understand this series, we're not really targeting specific
sets of devices either. It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.
> > The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed. We were
> > previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.
>
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly).
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.
>
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.
That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right? With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.
> > So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature
>
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing.
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.
>
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain....
If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both. I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation. Thanks,
Alex
On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both. I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation. Thanks,
Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.
Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.
A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.
We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.
Logan
On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen Bates" <[email protected]> wrote:
>
>> Hi Alex
>>
>>> But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
>
> But as I understand this series, we're not really targeting specific
> sets of devices either. It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
>
>>> The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed. We were
>>> previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
>
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right? With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
>
>>> So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
>
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both. I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation. Thanks,
>
> Alex
+1/ack
RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
I concur that this seems to be where the conversation is taking us.
@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....
Thanks
Stephen
On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both. I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation. Thanks,
>
>
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
>
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
>
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
>
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.
I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address. Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc. This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it. I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes. Thanks,
Alex
On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<[email protected]> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <[email protected]> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both. I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation. Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address. Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc. This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it. I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes. Thanks,
True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.
On 05/08/2018 05:27 PM, Stephen Bates wrote:
> Hi Don
>
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>> That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>> I recommend doing so via a sysfs method.
>
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
>
>
>> So I don't understand the comments why VMs should need to know.
>
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.
>
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
>
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
>
> Stephen
>
>
>
On Tue, 8 May 2018 22:25:06 +0000
"Stephen Bates" <[email protected]> wrote:
> > Yeah, so based on the discussion I'm leaning toward just having a
> > command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.
>
> I concur that this seems to be where the conversation is taking us.
>
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....
I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do. From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it. It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.
On to the implementation details... I already mentioned the BDF issue
in my other reply. If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect? The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct. The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery. A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools. I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping. Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it. Thanks,
Alex
On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address. Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc. This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it. I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes. Thanks,
>
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.
Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.
That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.
Logan
On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply. If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect? The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct. The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.
I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.
> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools. I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping. Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it. Thanks,
This approach sounds like it has a lot more issues to contend with:
For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.
Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.
This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.
IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.
Logan
On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <[email protected]> wrote:
> On 05/08/2018 05:27 PM, Stephen Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.
Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices. Thanks,
Alex
On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <[email protected]> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools. I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping. Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it. Thanks,
>
> This approach sounds like it has a lot more issues to contend with:
>
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
>
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.
True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each. A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.
> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
>
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.
OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing. Thanks,
Alex
Hi Alex and Don
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups,
But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.
Stephen
Hi Logan
> Yeah, I'm having a hard time coming up with an easy enough solution for
> the user. I agree with Dan though, the bus renumbering risk would be
> fairly low in the custom hardware seeing the switches are likely going
> to be directly soldered to the same board with the CPU.
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.
Stephen
Hi Don
> RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
> put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
> which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
Stephen
Jerome and Christian
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.
So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.
> Also ATS is meaningless without something like PASID as far as i know.
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...
Stephen
Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.
Still need to double check the specification (had a busy morning today),
but that sounds about correct.
The key takeaway is that when any device has ATS enabled you can't
disable ACS without breaking it (even if you unplug and replug it).
>> Also ATS is meaningless without something like PASID as far as i know.
>
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...
Interesting point, give me a moment to check that. That finally makes
all the hardware I have standing around here valuable :)
Christian.
>
> Stephen
>
On Wed, 9 May 2018 12:35:56 +0000
"Stephen Bates" <[email protected]> wrote:
> Hi Alex and Don
>
> > Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,
>
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.
Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces. Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity. Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.
The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug. For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge. Thanks,
Alex
Christian
> Interesting point, give me a moment to check that. That finally makes
> all the hardware I have standing around here valuable :)
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.
BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?
Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...
Thanks
Stephen
On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <[email protected]> wrote:
>> On 05/08/2018 05:27 PM, Stephen Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
>
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices. Thanks,
>
> Alex
>
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').
On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen Bates" <[email protected]> wrote:
>
>> Hi Alex and Don
>>
>>> Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
>
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces. Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity. Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
>
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug. For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge. Thanks,
>
> Alex
>
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.
On 05/08/2018 05:27 PM, Stephen Bates wrote:
> Hi Don
>
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>> That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>> I recommend doing so via a sysfs method.
>
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
>
>
>> So I don't understand the comments why VMs should need to know.
>
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
>
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
>
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
>
under linux-pci I'm assuming...
you cc'd a number of upstream lists; I picked this thread up via rdma-list.
> Stephen
>
>
>
On 05/09/2018 08:44 AM, Stephen Bates wrote:
> Hi Don
>
>> RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>> put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>> which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
>
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
>
> Stephen
>
>
>
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.
On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen Bates wrote:
> Christian
>
> > Interesting point, give me a moment to check that. That finally makes
> > all the hardware I have standing around here valuable :)
>
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
>
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?
On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).
GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.
On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).
J?r?me
Hi Jerome
> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
Thanks. This all makes sense.
But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
Stephen
On 09/05/18 07:40 AM, Christian König wrote:
> The key takeaway is that when any device has ATS enabled you can't
> disable ACS without breaking it (even if you unplug and replug it).
I don't follow how you came to this conclusion...
The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.
Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).
I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.
Logan
On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen Bates wrote:
> Hi Jerome
>
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>
> Thanks. This all makes sense.
>
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
>
As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:
drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/
Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.
Cheers,
J?r?me
Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian König wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
> The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.
Why would a switch not identify that as a peer address? We use the PASID
together with ATS to identify the address space which a transaction
should use.
If I'm not completely mistaken when you disable ACS it is perfectly
possible that a bridge identifies a transaction as belonging to a peer
address, which isn't what we want here.
Christian.
>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan
Hi Christian
> Why would a switch not identify that as a peer address? We use the PASID
> together with ATS to identify the address space which a transaction
> should use.
I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
> If I'm not completely mistaken when you disable ACS it is perfectly
> possible that a bridge identifies a transaction as belonging to a peer
> address, which isn't what we want here.
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
Make sense?
Stephen
Hi Jerome
> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> user is the AMD GPU driver see:
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
Cheers
Stephen
Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>> user is the AMD GPU driver see:
>
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
Just FYI: There is also another effort ongoing to give both the AMD,
Intel as well as ARM IOMMUs a common interface so that drivers can use
whatever the platform offers fro SVM support.
> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
Oh, well that is complicated at best.
On very old hardware it wasn't a window, but instead you had to use
special commands in your shader which indicated that you want to use an
ATS transaction instead of a normal PCIe transaction for your
read/write/atomic.
As Jerome explained on most hardware we have a window inside the
internal GPU address space which when accessed issues a ATS transaction
with a configurable PASID.
But on very newer hardware that window became a bit in the GPUVM page
tables, so in theory we now can control it on a 4K granularity basis for
the internal 48bit GPU address space.
Christian.
>
> Cheers
>
> Stephen
>
On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen Bates wrote:
> Hi Christian
>
> > Why would a switch not identify that as a peer address? We use the PASID
> > together with ATS to identify the address space which a transaction
> > should use.
>
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>
> > If I'm not completely mistaken when you disable ACS it is perfectly
> > possible that a bridge identifies a transaction as belonging to a peer
> > address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
>
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
>
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
>
> Make sense?
>
Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.
However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.
Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.
Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.
Cheers,
J?r?me
On Thu, May 10, 2018 at 04:29:44PM +0200, Christian K?nig wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> >
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > > user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
>
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
>
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
>
> Oh, well that is complicated at best.
>
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
>
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
>
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
>
To complete this a 50 lines primer on GPU:
GPUVA - GPU virtual address
GPUPA - GPU physical address
GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.
Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.
GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).
So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.
So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.
So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.
Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.
All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.
Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.
Cheers,
J?r?me
On 10/05/18 08:16 AM, Stephen Bates wrote:
> Hi Christian
>
>> Why would a switch not identify that as a peer address? We use the PASID
>> together with ATS to identify the address space which a transaction
>> should use.
>
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.
> > If I'm not completely mistaken when you disable ACS it is perfectly
> > possible that a bridge identifies a transaction as belonging to a peer
> > address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
>
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
>
> Make sense?
Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).
For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.
Logan
> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address.
Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
> Redir bits are on they will be forced back to the RC by the switch and
> the transaction will fail. If we clear the ACS bits, the TLPs will go
> where we want and everything will work (but we lose the isolation of ACS).
Agreed.
> For EPs that support ATS, we should (but don't necessarily have to)
> program them with the IOVA address so they can go through the
> translation process which will allow P2P without disabling the ACS Redir
> bits -- provided the ACS direct translation bit is set. (And btw, if it
> is, then we lose the benefit of ACS protecting against malicious EPs).
> But, per above, the ATS transaction should involve only the IOVA address
> so the ACS bits not being set should not break ATS.
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
Stephen
On 10/05/18 11:11 AM, Stephen Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address.
>
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
>
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>> Redir bits are on they will be forced back to the RC by the switch and
>> the transaction will fail. If we clear the ACS bits, the TLPs will go
>> where we want and everything will work (but we lose the isolation of ACS).
>
> Agreed.
>
>> For EPs that support ATS, we should (but don't necessarily have to)
>> program them with the IOVA address so they can go through the
>> translation process which will allow P2P without disabling the ACS Redir
>> bits -- provided the ACS direct translation bit is set. (And btw, if it
>> is, then we lose the benefit of ACS protecting against malicious EPs).
>> But, per above, the ATS transaction should involve only the IOVA address
>> so the ACS bits not being set should not break ATS.
>
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.
Logan
Hi Jerome
> Note on GPU we do would not rely on ATS for peer to peer. Some part
> of the GPU (DMA engines) do not necessarily support ATS. Yet those
> are the part likely to be use in peer to peer.
OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
> We (ake GPU people aka the good guys ;)) do no want to do peer to peer
> for performance reasons ie we do not care having our transaction going
> to the root complex and back down the destination. At least in use case
> i am working on this is fine.
If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
> Reasons is that GPU are giving up on PCIe (see all specialize link like
> NVlink that are popping up in GPU space). So for fast GPU inter-connect
> we have this new links.
I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
> Also the IOMMU isolation do matter a lot to us. Think someone using this
> peer to peer to gain control of a server in the cloud.
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.
Stephen
Hi Jerome
> Hopes this helps understanding the big picture. I over simplify thing and
> devils is in the details.
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??
Stephen
On 10/05/18 12:41 PM, Stephen Bates wrote:
> Hi Jerome
>
>> Note on GPU we do would not rely on ATS for peer to peer. Some part
>> of the GPU (DMA engines) do not necessarily support ATS. Yet those
>> are the part likely to be use in peer to peer.
>
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>
>> We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>> for performance reasons ie we do not care having our transaction going
>> to the root complex and back down the destination. At least in use case
>> i am working on this is fine.
>
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...
Obviously.
Logan
On Thu, 10 May 2018 18:41:09 +0000
"Stephen Bates" <[email protected]> wrote:
> > Reasons is that GPU are giving up on PCIe (see all specialize link like
> > NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > we have this new links.
>
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction. Whether this is included
or an afterthought, I have no idea.
> > Also the IOMMU isolation do matter a lot to us. Think someone using this
> > peer to peer to gain control of a server in the cloud.
From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation? I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,
Alex
On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen Bates" <[email protected]> wrote:
> > > Reasons is that GPU are giving up on PCIe (see all specialize link like
> > > NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > > we have this new links.
> >
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
>
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction. Whether this is included
> or an afterthought, I have no idea.
AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.
GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).
So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.
Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.
>
> > > Also the IOMMU isolation do matter a lot to us. Think someone using this
> > > peer to peer to gain control of a server in the cloud.
>
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation? I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,
Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.
I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.
Cheers,
J?r?me
Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
By disabling the ACS bits on the intermediate bridges you turn their
address routing from IOVA addresses (which are to be resolved by the
root complex) back to PCI bus addresses (which are resolved locally in
the bridge).
This only works when the IOVA and the PCI bus addresses never overlap.
I'm not sure how the IOVA allocation works but I don't think we
guarantee that on Linux.
>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>> Redir bits are on they will be forced back to the RC by the switch and
>>> the transaction will fail. If we clear the ACS bits, the TLPs will go
>>> where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.
If we really want to enable P2P without ATS and IOMMU enabled I think we
should probably approach it like this:
a) Make double sure that IOVA in an IOMMU group never overlap with PCI
BARs in that group.
b) Add configuration options to put a whole PCI branch of devices (e.g.
a bridge) into a single IOMMU group.
c) Add a configuration option to disable the ACS bit on bridges in the
same IOMMU group.
I agree that we have a rather special case here, but I still find that
approach rather brave and would vote for disabling P2P without ATS when
IOMMU is enabled.
BTW: I can't say anything about other implementations, but at least for
the AMD-IOMMU the transaction won't fail when it is send to the root
complex.
Instead the root complex would send it to the correct device. I already
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs
(but could be that this only works because of ATS).
Regards,
Christian.
>>> For EPs that support ATS, we should (but don't necessarily have to)
>>> program them with the IOVA address so they can go through the
>>> translation process which will allow P2P without disabling the ACS Redir
>>> bits -- provided the ACS direct translation bit is set. (And btw, if it
>>> is, then we lose the benefit of ACS protecting against malicious EPs).
>>> But, per above, the ATS transaction should involve only the IOVA address
>>> so the ACS bits not being set should not break ATS.
>>
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan
On 5/11/2018 2:52 AM, Christian König wrote:
> This only works when the IOVA and the PCI bus addresses never overlap.
> I'm not sure how the IOVA allocation works but I don't think we
> guarantee that on Linux.
I find this hard to believe. There's always the possibility that some
part of the system doesn't support ACS so if the PCI bus addresses and
IOVA overlap there's a good chance that P2P and ATS won't work at all on
some hardware.
> If we really want to enable P2P without ATS and IOMMU enabled I think we
> should probably approach it like this:
>
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI
> BARs in that group.
>
> b) Add configuration options to put a whole PCI branch of devices (e.g.
> a bridge) into a single IOMMU group.
>
> c) Add a configuration option to disable the ACS bit on bridges in the
> same IOMMU group.
I think a configuration option to manage IOMMU groups as you suggest
would be a very complex interface and difficult to implement. I prefer
the option to disable the ACS bit on boot and let the existing code put
the devices into their own IOMMU group (as it should already do to
support hardware that doesn't have ACS support).
Logan
> I find this hard to believe. There's always the possibility that some
> part of the system doesn't support ACS so if the PCI bus addresses and
> IOVA overlap there's a good chance that P2P and ATS won't work at all on
> some hardware.
I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
> I prefer
> the option to disable the ACS bit on boot and let the existing code put
> the devices into their own IOMMU group (as it should already do to
> support hardware that doesn't have ACS support).
+1
Stephen
All
>?Alex (or anyone else) can you point to where IOVA addresses are generated?
A case of RTFM perhaps (though a pointer to the code would still be appreciated).
https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.
Cheers
Stephen
On 5/11/2018 4:24 PM, Stephen Bates wrote:
> All
>
>> Alex (or anyone else) can you point to where IOVA addresses are generated?
>
> A case of RTFM perhaps (though a pointer to the code would still be appreciated).
>
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
>
> Some exceptions to IOVA
> -----------------------
> Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
> The same is true for peer to peer transactions. Hence we reserve the
> address from PCI MMIO ranges so they are not allocated for IOVA addresses.
Hmm, except I'm not sure how to interpret that. It sounds like there
can't be an IOVA address that overlaps with the PCI MMIO range which is
good and what I'd expect.
But for peer to peer they say they don't translate the address which
implies to me that the intention is for a peer to peer address to not be
mapped in the same way using the dma_map interface (of course though if
you were using ATS you'd want this for sure). Unless the existing
dma_map command's notice a PCI MMIO address and handle them differently,
but I don't see how.
Logan
On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
>
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> ---
> Documentation/PCI/index.rst | 14 +++
> Documentation/driver-api/pci/index.rst | 1 +
> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
> Documentation/index.rst | 3 +-
> 4 files changed, 183 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/PCI/index.rst
> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=====================
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like
* Provider -
> + memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a
* Client -
> + DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between
* Orchestrator -
> + clients and providers
> +
> +In many cases there could be overlap between these three types (ie.
(i.e.,
> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> + in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> + resource (provider), it accepts P2P memory pages as buffers in requests
> + to be used directly (client) and it can also make use the CMB as
> + submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> + can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> + to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +----------------
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types of resources like doorbells.
> +
> +
> +Client Drivers
> +--------------
> +
> +A client driver typically only has to conditionally change its DMA map
> +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
> +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
> +functions.
> +
> +The client may also, optionally, make use of
> +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
> +functions and when to use the regular mapping functions. In some
> +situations, it may be more appropriate to use a flag to indicate a
> +given request is P2P memory and map appropriately (for example the
> +block layer uses a flag to keep P2P memory out of queues that do not
> +have P2P client support). It is important to ensure that struct pages that
> +back P2P memory stay out of code that does not have support for them.
> +
> +
> +Orchestrator Drivers
> +--------------------
> +
> +The first task an orchestrator driver must do is compile a list of
> +all client drivers that will be involved in a given transaction. For
> +example, the NVMe Target driver creates a list including all NVMe drives
^^^^^^
or drivers ?
Could be either, I guess, but the previous sentence says "compile a list of drivers."
> +and the RNIC in use. The list is stored as an anonymous struct
> +list_head which must be initialized with the usual INIT_LIST_HEAD.
> +The following functions may then be used to add to, remove from and free
> +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
> +:c:func:`pci_p2pdma_remove_client()` and
> +:c:func:`pci_p2pdma_client_list_free()`.
> +
> +With the client list in hand, the orchestrator may then call
> +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
> +that is supported (behind the same root port) as all the clients. If more
> +than one provider is supported, the one nearest to all the clients will
> +be chosen first. If there are more than one provider is an equal distance
there is
> +away, the one returned will be chosen at random. This function returns the PCI
> +device to use for the provider with a reference taken and therefore
> +when it's no longer needed it should be returned with pci_dev_put().
> +
> +Alternatively, if the orchestrator knows (via some other means)
> +which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
> +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
> +to determine the cumulative distance between it and a potential
> +list of clients.
> +
> +With a supported provider in hand, the driver can then call
> +:c:func:`pci_p2pdma_assign_provider()` to assign the provider
> +to the client list. This function returns false if any of the
> +clients are unsupported by the provider.
[I would say:]
is unsupported
> +
> +Once a provider is assigned to a client list via either
> +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
> +the list is permanently bound to the provider such that any new clients
> +added to the list must be supported by the already selected provider.
> +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
> +an error. In this way, orchestrators are free to add and remove devices
> +without having to recheck support or tear down existing transfers to
> +change P2P providers.
> +
> +Once a provider is selected, the orchestrator can then use
> +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
> +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
> +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
> +allocating scatter-gather lists with P2P memory.
> +
> +Struct Page Caveats
> +-------------------
> +
> +Driver writers should be very careful about not passing these special
> +struct pages to code that isn't prepared for it. At this time, the kernel
> +interfaces do not have any checks for ensuring this. This obviously
> +precludes passing these pages to userspace.
> +
> +P2P memory is also technically IO memory but should never have any side
> +effects behind it. Thus, the order of loads and stores should not be important
> +and ioreadX(), iowriteX() and friends should not be necessary.
> +However, as the memory is not cache coherent, if access ever needs to
> +be protected by a spinlock then :c:func:`mmiowb()` must be used before
> +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
> +Documentation/memory-barriers.txt)
> +
> +
> +P2P DMA Support Library
> +=====================
> +
> +.. kernel-doc:: drivers/pci/p2pdma.c
> + :export:
--
~Randy
Thanks for the review Randy! I'll make the changes for the next time we
post the series.
On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
> ^^^^^^
> or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."
I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.
Logan